Opencl Motion Detection Using NVIDIA Fermi and the Opencl Programming Framework

General purpose computing on graphics processing units using OpenCL Motion detection using NVIDIA Fermi and the OpenCL programming framework Master of Science thesis in the Integrated Electronic System Design programme MATS JOHANSSON OSCAR WINTER Chalmers University of Technology University of Gothenburg Department of Computer Science and Engineering Göteborg, Sweden, June 2010 The Author grants to Chalmers University of Technology and University of Gothenburg the non-exclusive right to publish the Work electronically and in a non-commercial purpose make it accessible on the Internet. The Author warrants that he/she is the author to the Work, and warrants that the Work does not contain text, pictures or other material that violates copyright law. The Author shall, when transferring the rights of the Work to a third party (for example a publisher or a company), acknowledge the third party about this agreement. If the Author has signed a copyright agreement with a third party regarding the Work, the Author warrants hereby that he/she has obtained any necessary permission from this third party to let Chalmers University of Technology and University of Gothenburg store the Work electronically and make it accessible on the Internet. General purpose computing on graphics processing units using OpenCL Motion detection using NVIDIA Fermi and the OpenCL programming framework MATS JOHANSSON, [email protected] OSCAR WINTER, [email protected] © MATS JOHANSSON, June 2010. © OSCAR WINTER, June 2010. Examiner: ULF ASSARSSON Chalmers University of Technology University of Gothenburg Department of Computer Science and Engineering SE-412 96 Göteborg Sweden Telephone + 46 (0)31-772 1000 Department of Computer Science and Engineering Göteborg, Sweden June 2010 Abstract General-Purpose computing using Graphics Processing Units (GPGPU) has been an area of active research for many years. During 2009 and 2010 much has happened in the GPGPU research field with the release of the Open Computing Language (OpenCL) programming framework and the new NVIDIA Fermi Graphics Processing Unit (GPU) architecture. This thesis explores the hardware architectures of three GPUs and how well they support general computations; the NVIDIA Geforce 8800 GTS (the G80 architecture) from 2006, the AMD Radeon 4870 (the RV700 architecture) from 2008 and the NVIDIA Geforce GTX 480 (the Fermi architecture) from 2010. Special concern is given to the new Fermi architecture and the GPGPU related improvements implemented in this architecture. The Lukas-Kanade algorithm for optical flow estimation has been implemented in OpenCL to evaluate the framework and the impact of several different parallel application optimizations. The RV700 architecture is not well suited for GPGPU. The performance of the G80 architecture is very good taking its relative age into account. However, much effort must be spent optimizing a parallel application for the G80 before full performance is obtained, a task that can be quite tedious. Fermi excels in all aspects of GPGPU programming. Fermi’s performance is much higher than that of the RV700 and the G80 architectures and its new memory hierarchy makes GPGPU programming easier than ever before. OpenCL is a stable and competent framework well suited for any GPGPU project that would benefit from the increased flexibility of software and hardware platform independence. However, if performance is more important than flexibility, NVIDIA’s Compute Unified Device Architecture (CUDA) or AMD’s ATI Stream might be better alternatives. Sammanfattning Generella beräkningar med hjälp av grafikprocessorer (General-Purpose computation using Graphics Processing Units, GPGPU) har varit ett aktivt forskningsområde under många år. Stora framsteg har gjorts under 2009 och 2010 i och med lanseringen av programmeringsramverket Open Computing Language (OpenCL) och NVIDIAs nya GPU- arkitektur Fermi. Denna tes utforskar hårdvaruarkitekturen hos tre grafikprocessorer och hur väl de är anpassade för generella beräkningar; NVIDIA Geforce 8800 (G80-arkitekturen) utgivet 2006, AMD Radeon 4870 (RV700-arkitekturen) utgivet 2008 och NVIDIA Geforce GTX 480 (Fermi-arkitekturen) utgivet 2010. Stort fokus läggs på Fermi och de GPGPU-relaterade förbättringar som gjorts på denna arkitektur jämfört med tidigare generationer. Ramverket OpenCL och den relativa påverkan hos flertalet olika optimeringar av en parallell applikation har utvärderats genom att implementera Lukas-Kanades algoritm för uppskattning av optiskt flöde. RV700-arkitekturen är ej lämpad för generella beräkningar. Prestandan hos G80-arkitekturen är utmärkt trots dess relativa ålder. Mycket möda måste dock tillägnas G80-specifika optimeringar av den parallella applikationen för att kunna uppnå högsta möjliga prestanda. Fermi är överlägsen i alla aspekter av GPGPU. Fermis nya minneshierarki tillåter att generella beräkningar utförs både lättare och snabbare än tidigare. På samma gång är Fermis prestanda mycket högre än hos de två andra arkitekturerna och detta redan innan några hårdvaruspecifika optimeringar gjorts. Programmeringsramverket OpenCL är ett stabilt och kompetent ramverk väl anpassat för GPGPU-projekt som kan dra nytta av den ökade flexibiliteten av mjuk- och hårdvaruoberoende. Om prestanda är viktigare än flexibilitet kan dock NVIDIAs Compute Unified Device Architecture (CUDA) eller AMDs ATI Stream vara bättre alternativ. Acknowledgements We would like to thank Combitech AB for making this thesis possible. We would also like to thank our supervisors Peter Gelin and Dennis Arkeryd at Combitech AB as well as Ulf Assarsson at Chalmers University of Technology for their help and feedback. Finally we would like to give a special thanks to our families for their love and support. Contents 1. Introduction ............................................................................................................................ 1 1.1. Background ...................................................................................................................... 1 1.2. Purpose ............................................................................................................................ 3 1.3. Delimitations ................................................................................................................... 3 2. Previous work ......................................................................................................................... 5 3. General-Purpose computing on Graphics Processing Units................................................... 6 3.1. History ............................................................................................................................. 6 3.1.1. History of graphics hardware .................................................................................... 6 3.1.2. History of the GPGPU research field ........................................................................ 7 3.2. Graphics hardware ........................................................................................................... 8 3.2.1. Global scheduling ..................................................................................................... 9 3.2.2. Execution model ....................................................................................................... 9 3.2.3. Streaming multiprocessors ........................................................................................ 9 3.2.3.1. Scheduling ........................................................................................................ 10 3.2.3.2. Memory ............................................................................................................ 10 3.2.3.3. Functional units ................................................................................................ 11 3.2.4. Memory pipeline ..................................................................................................... 11 3.2.5. The NVIDIA Fermi architecture ............................................................................. 12 3.3. Open Computing Language (OpenCL) ......................................................................... 14 3.3.1. Platform model ........................................................................................................ 14 3.3.2. Execution model ..................................................................................................... 14 3.3.3. Memory model ........................................................................................................ 16 3.3.4. Programming model ................................................................................................ 17 4. Implementing and optimizing an OpenCL application ........................................................ 18 4.1. Application background ................................................................................................ 18 4.2. The Lucas-Kanade method for optical flow estimation ................................................ 18 4.3. Implementation of the Lucas-Kanade method ............................................................... 21 4.4. Parallel implementation of the Lucas-Kanade method .................................................. 23 4.4.1. OpenCL implementation ......................................................................................... 24 4.5. Hardware optimizations ................................................................................................. 26 4.5.1. Instruction

Opencl Motion Detection Using NVIDIA Fermi and the Opencl Programming Framework

The Importance of Data

AMD Accelerated Parallel Processing Opencl Programming Guide

Exploring Weak Scalability for FEM Calculations on a GPU-Enhanced Cluster

Evolution of the Graphical Processing Unit

Graphics Processing Units

The Shallow Water Equations on Graphics Processing Units

Fragment Shader

GPGPU in System Simulations

An Introduction to GPU Computing

Contributions of Hybrid Architectures to Depth Imaging: a CPU, APU and GPU Comparative Study

GPGPU, 4Th Meeting

Heterogeneous Multi-Core Architectures for High Performance Computing