A Fast Parallel Gauss Jordan Algorithm for Matrix Inversion Using CUDA

Computers and Structures 128 (2013) 31–37 Contents lists available at SciVerse ScienceDirect Computers and Structures journal homepage: www.elsevier.com/locate/compstruc Technical Note A fast parallel Gauss Jordan algorithm for matrix inversion using CUDA ⇑ Girish Sharma a, Abhishek Agarwala b, Baidurya Bhattacharya a, a Department of Civil Engineering, Indian Institute of Technology, Kharagpur 721302, India b Archayne Labs, Gurgaon 122001, India article info abstract Article history: The ability to invert large matrices quickly and accurately determines the effectiveness of a computa- Received 5 April 2012 tional tool. Current literature suggests that time complexity of matrix inversion is 2 or higher. This paper Accepted 24 June 2013 redesigns the Gauss Jordan algorithm for matrix inversion on a CUDA platform to exploit the large scale Available online 13 September 2013 parallelization feature of a massively multithreaded GPU. The algorithm is tested for various types of matrices and the performance metrics are studied and compared with CPU based parallel methods. Keywords: We show that the time complexity of matrix inversion scales as n as long as n2 threads can be supported Graphics processing unit by the GPU. Compute unified development architecture Ó 2013 Elsevier Ltd. All rights reserved. Matrix inversion Gauss Jordan Parallelization 1. Introduction introduced the notion of border rank and obtained w < 2.78. Schönhage [25] generalized this notion in 1981, proving Matrix inversion is an essential step in a wide range of numer- w < 2.548. In the same paper, combining his work with ideas by ical problems – starting with solving linear equations [1–4], struc- Pan [23], he also showed w < 2.522. The following year, Romani tural analyses using finite element method [5–7], 3D rendering [8], [26] found that w < 2.517. The first result to break 2.5 was by digital filtering [9], image filtering [10,11] and image processing Coppersmith and Winograd [16] who obtained w < 2.496. In [12] – and constitutes an indispensable component in almost all 1988, Strassen introduced his laser method [27] which was a novel mathematical/statistical software suites. Some of the common approach for matrix multiplication, and thus decreased the bound available algorithms for computing the inverse of a matrix are to w < 2.479. Two years later, Coppersmith and Winograd [28] Strassen [13], Strassen-Newton [14], Gaussian elimination [15], combined Strassen’s technique with a novel form of analysis based Gauss–Jordan [15], Coppersmith and Winograd [16], LUP Decom- on large sets avoiding arithmetic progressions and obtained the position [17], Cholesky decomposition [18], QR decomposition famous bound of w < 2.376 which has remained unchanged for [19], RRQR factorization [20], Monte Carlo Methods for inverse more than 22 years (a very recent unpublished work [29] claims [21,22], etc. to have brought down the limit to w < 2.3727). While most activity Until the late 1960s matrix inversion was believed to require a focuses on trying to reduce the exponent w, both Coppersmith and cubic number of operations, as the fastest algorithm known was Winograd [28] and Cohn et al. [30] presented conjectures which if Gaussian elimination method, or rather Gauss Jordan [15] method true would imply w = 2, but never less than that. which runs in O(n3) time where n is the size of the matrix. In 1969, Inversion methods for specific types of matrices sometimes Strassen [13] excited the research community by giving the first with no set time complexity, like Monte Carlo Methods [21,22] sub cubic time algorithm for matrix multiplication, running in for inverting Hermitian matrix and positive definite matrix, a fast O(n2.808) time. This also reduced the time complexity, w, of Matrix algorithm [31] for the inversion of general Toeplitz matrices and Inversion using Strassen Multiplication to O(n2.808) time. This dis- various matrix decomposition methods [16–19] also exist. covery triggered a long line of research that gradually reduced Recent developments in parallel architecture and its use in the time complexity w over time. In 1978, Pan [23] presented a computation have brought about the prospect of massively parallel method that proved w < 2.796 and the next year, Bini et al. [24] systems capable of reducing running times of codes below the limit of w = 2. The amount of performance boost of course depends largely upon the scope of parallelization that the algorithm pro- vides. Owing to its unique architecture, the graphics processing unit ⇑ Corresponding author. Tel./fax: +91 3222 283422. (GPU) enables massive parallelization unlike anything possible on a E-mail addresses: [email protected], baidurya.bhattacharya@jhu. edu, [email protected] (B. Bhattacharya). CPU based network, as described later. The Gauss Jordan method is 0045-7949/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.compstruc.2013.06.015 32 G. Sharma et al. / Computers and Structures 128 (2013) 31–37 one of the oldest methods for matrix inversion. It is straightforward Within a GPU, there are a couple of techniques for performing and robust and is particularly suitable for massive parallelization parallel computation. In the early years, General-purpose comput- unlike many of the more advanced methods. Nevertheless, the ing on graphics processing units (GPGPU) [38] was the technique available literature either on the scope of parallelization of the used to harness the computational power of a GPU by program- Gauss Jordan algorithm or its optimization appear rather insuffi- mers, until CUDA (Compute Unified Device Architecture) was made cient. This paper tailors and implements the Gauss–Jordan publically available in 2007 as the platform to replace GPGPU. algorithm for matrix inversion on a CUDA (Compute Unified Device CUDA has several advantages [39] over GPGPU and CPU which Architecture [32]) based GPU platform and studies the performance includes faster memory sharing and read backs, minimal threads metrics of the algorithm. creation overhead, and flexibility in the choice of programming language, and has been adopted for this work. More extensive 2. Parallelization and challenges information about Nvidia and CUDA model is explained in the Programming Guide published by Nvidia [32]. The Strassen approach [14,33] for matrix inversion reduces the problem of inverting an n Â n matrix into seven n/2 Â n/2 multipli- 3. Parallel Gauss–Jordan algorithm cations and inversion of two n/2 Â n/2 matrices, each of which then can be solved recursively. The inverse is given as: Gauss Jordan method for computation of inverse of a matrix is À1 one of the oldest methods. It is robust, accurate over range of matri- A11 A12 C11 C12 AÀ1 ¼ ¼ ð1Þ ces and does not require multiple checks depending upon the type A21 A22 C21 C22 of matrix involved. Existing work on the parallelization of the Gauss Jordan algorithm (e.g. [33]) has been limited in the scope for The four partitions of C may be computed in the following steps: parallelization mainly due to the hardware limitations of the time. À1 This paper redesigns the Gauss Jordan method so as to make full use 1: P1 ¼ A11 2: P2 ¼ A21 Â P1 3: P3 ¼ P1 Â A12 of the massive parallelization feature of a modern GPU. 4: P ¼ A Â P 5: P ¼ P À A 6: P ¼ PÀ1 4 21 3 5 4 22 6 5 ð2Þ The standard Gauss Jordan method for computation of inverse 7: C12 ¼ P3 Â P6 8: C21 ¼ P6 Â P2 9: C11 ¼ P1 À P3 Â C21 of a matrix A of size n starts by augmenting the matrix with the 10: C22 ¼P6 identity matrix of size n: For large n, the individual matrices will be large enough to pro- ½C¼½AIj ð3Þ vide scope for parallelization of the matrix multiplication steps (steps 2, 3, 4, 7, 8 and 9) while the inverse calculation steps (steps Then, performing elementary row transformations on matrix C, the 0 1 and 6) can be performed recursively. It is this inherent recursive left half of C is transformed column by column into the unit matrix. nature of the algorithm that reduces the scope of large scale This step is broken down into 2 steps per column of the left half parallelization. matrix, the first of which is to convert element aii into 1 by the An additional issue faced by the Strassen approach is that the transformation: accuracy of the resultant inverse depends highly on the quarter Ri Ri=aii ð4Þ matrix A11. If the quarter matrix chosen in step 1 is singular, then pivoting is required and the matrix with the larger determinant be- If aii is zero, then any non-zero row is added to the ith row before tween A11 and A22 is chosen for step 1. This check is performed in applying Eq. (4). The second step is to reduce all the other elements each level of recursion in steps 1 as well as 6, making this algo- of the jth column to zero by the following transformation of every rithm cumbersome. While certain parallel implementations for row except for the jth one: specific types of matrices already have been studied before Ri Ri À Rj Â aij ð5Þ [34,35], developing an algorithm to comply with all types of matrices is a difficult task. A much easier and robust parallel implemen- After transforming the first column, the matrix reduces to: 2 3 1 a =a a =a 12 11 13 11 ÁÁÁ ÁÁÁ 1=a11 00ÁÁÁ 0 6 7 6 0 a22 À a21 Â a12=a11 a23 À a21 Â a13=a11 ÁÁÁ ÁÁÁ 7 6 Àa21=a11 10ÁÁÁ 0 7 0 6 0 a a a =a a a a =a 7 6 32 À 31 Â 12 11 33 À 31 Â 13 11 ÁÁÁ ÁÁÁ Àa31=a11 01ÁÁÁ 0 7 ½C ¼6 7 ð6Þ 6 .

Load more