Parallel Implementation of the Singular Value Decomposition Using Opencl Bhushan Rayrikar Clemson University, [email protected]

Clemson University TigerPrints All Theses Theses 12-2011 Parallel Implementation of the Singular Value Decomposition using OpenCL Bhushan Rayrikar Clemson University, [email protected] Follow this and additional works at: https://tigerprints.clemson.edu/all_theses Part of the Computer Engineering Commons Recommended Citation Rayrikar, Bhushan, "Parallel Implementation of the Singular Value Decomposition using OpenCL" (2011). All Theses. 1292. https://tigerprints.clemson.edu/all_theses/1292 This Thesis is brought to you for free and open access by the Theses at TigerPrints. It has been accepted for inclusion in All Theses by an authorized administrator of TigerPrints. For more information, please contact [email protected]. PARALLEL IMPLEMENTATION OF THE SINGULAR VALUE DECOMPOSITION USING OPENCL A Thesis Presented to the Graduate School of Clemson University In Partial Fulfillment of the Requirements for the Degree Master of Science Computer Engineering by Bhushan Dattatraya Rayrikar December 2011 Accepted by: Dr. Melissa C. Smith, Committee Chair Dr. Stanley Birchfield Dr. John Gowdy ABSTRACT General-Purpose Graphics Processing Units (GPGPUs) have massively parallel computational capabilities. Low cost and ease of programming make them a popular choice over other parallel architectures such as large clusters and accelerators such as Field-Programmable Gate Arrays (FPGAs). Mature programming frameworks for GPGPUs, such as CUDA from Nvidia and OpenCL from the Khronos Group, reduce the learning curve and development time for programming GPGPU architectures. OpenCL, a relatively new industry standard for parallel computing makes it possible to write a single program for heterogeneous platforms that is portable across multiple platforms including GPGPUs and multi-core processors with minimal coding modifications. GPGPU architectures have been successfully used for accelerating many computationally expensive problems including many linear algebra algorithms, which are inherently parallel in nature. Singular Value Decomposition (SVD) is a computationally expensive linear algebra matrix decomposition technique that has many applications including data compression, facial recognition, and solving a system of equations. As the dimensions of the matrix increase, SVD computation becomes increasingly time consuming. Since SVD is a major part of some algorithms such as Eigenfaces (a facial recognition algorithm based on Principle Component Analysis), the overall runtime for these algorithms depends heavily on the execution time of SVD. Hence, to implement efficient applications based on SVD, for example real-time facial recognition, it is desirable to accelerate the SVD algorithm. ii In this work, a parallel implementation of Singular Value Decomposition is discussed in detail. It uses many basic linear algebra techniques such as matrix-vector multiplication, vector norms and vector outer products. This work focuses on the implementation techniques, optimization methods (specifically for a GPGPU implementation) and their effect on the overall performance of the algorithm. We present the performance analysis of this algorithm on NVIDIA’s Tesla C2050 GPU as compared to the single-threaded serial implementation executed on an Intel 2.66 GHz Q9450 processor. We report speedups up to 20x for the parallel SVD computation. The results discussed in this thesis demonstrate the potential of the computational resources available with GPGPUs. iii DEDICATION I dedicate this thesis to my family. Without their support, it wouldn’t have been possible. iv ACKNOWLEDGEMENTS I would like to thank Dr. Melissa Smith. Her constant guidance and support was instrumental in this thesis. I would also like to thank Dr. Birchfield and Dr. Gowdy for reviewing my work. My fellow group members: Ashraf, Vivek, Scott, Harsh, Tushar and Sumedh must be thanked as well. They made our research lab a fun place to work. Last but not the least; I would like to express my gratitude to all the researchers and engineers from NVIDIA, AMD and the Khronos group for providing excellent technical documentation for the benefit of all the researchers. v TABLE OF CONTENTS Page TITLE PAGE .................................................................................................................... i ABSTRACT ..................................................................................................................... ii DEDICATION ................................................................................................................ iv ACKNOWLEDGMENTS ............................................................................................... v LIST OF TABLES ........................................................................................................ viii LIST OF FIGURES ........................................................................................................ ix CHAPTER I. INTRODUCTION ......................................................................................... 1 II. RELATED WORK, GPGPU ARCHITECTURE AND OPENCL PROGRAMMING FRAMEWORK ........................................................ 5 Related Work ........................................................................................... 5 History of GPGPU ................................................................................... 7 Basic GPU Architecture ........................................................................... 8 OpenCL Architecture and Programming Framework.............................. 9 Summary ................................................................................................ 15 III. THE SVD COMPUTATION ....................................................................... 16 SVD Computation .................................................................................. 16 Householder Bidiagonalization .............................................................. 16 Accumulation of Left and Right Hand Transformations ....................... 20 Diagonalization of the Bidiagonal Form................................................ 22 Summary ................................................................................................ 23 IV. PARALLEL IMPLEMENTATION OF THE FIRST TWO STEPS OF THE SVD AND OPTIMIZATION TECHIQUES ......................................... 24 Overview of Parallel Implementation .................................................... 24 Parallel Implementation of Householder Bidiagonalization .................. 26 Accumulation of Left and Right Hand Transformations ....................... 30 vi Table of Contents (Continued) Page Optimization Techniques ....................................................................... 31 Summary ................................................................................................ 35 V. EXPERIMENTAL SYSTEM AND RESULTS .......................................... 36 Experimental System ............................................................................. 36 Speed up Calculations ............................................................................ 37 Memory Requirements........................................................................... 38 Results .................................................................................................... 40 Summary ................................................................................................ 43 VI. CONCLUSION AND FUTURE WORK .................................................... 44 REFERENCES .............................................................................................................. 47 vii LIST OF TABLES Table Page 1.1 Increase in serial runtime with increase in matrix size .................................................2 5.1 Tesla C2050 Specifications..........................................................................................37 5.2 Resources Required for each Kernel............................................................................39 5.3 Serial Runtimes ............................................................................................................40 5.4 Speed Up Values for Optimum Local Work Size ........................................................42 5.5 Execution times for the fastest parallel implementation ..............................................43 viii LIST OF FIGURES Figure Page 2.1 Overview of NVIDIA’s GPU Architecture ...................................................................8 2.2 OpenCL Platform Model .............................................................................................10 2.3 OpenCL Execution Model ...........................................................................................11 2.4 Example of OpenCL Indexing .....................................................................................12 2.5 OpenCL Memory Model..............................................................................................13 2.6 Structure of an OpenCL Program ................................................................................14 4.1 Division of the SVD Algorithm on the GPGPU and the CPU ....................................25 4.2 Illustration of the Reduction ........................................................................................27 4.3 Parallel Matrix-Vector Product on the GPGPU ...........................................................29 4.4 Parallel Implementation of the Outer Product on the GPGPU ....................................30

Load more