Computers and Structures 128 (2013) 31–37
Contents lists available at SciVerse ScienceDirect
Computers and Structures
journal homepage: www.elsevier.com/locate/compstruc
Technical Note A fast parallel Gauss Jordan algorithm for matrix inversion using CUDA ⇑ Girish Sharma a, Abhishek Agarwala b, Baidurya Bhattacharya a, a Department of Civil Engineering, Indian Institute of Technology, Kharagpur 721302, India b Archayne Labs, Gurgaon 122001, India article info abstract
Article history: The ability to invert large matrices quickly and accurately determines the effectiveness of a computa- Received 5 April 2012 tional tool. Current literature suggests that time complexity of matrix inversion is 2 or higher. This paper Accepted 24 June 2013 redesigns the Gauss Jordan algorithm for matrix inversion on a CUDA platform to exploit the large scale Available online 13 September 2013 parallelization feature of a massively multithreaded GPU. The algorithm is tested for various types of matrices and the performance metrics are studied and compared with CPU based parallel methods. Keywords: We show that the time complexity of matrix inversion scales as n as long as n2 threads can be supported Graphics processing unit by the GPU. Compute unified development architecture Ó 2013 Elsevier Ltd. All rights reserved. Matrix inversion Gauss Jordan Parallelization
1. Introduction introduced the notion of border rank and obtained w < 2.78. Schönhage [25] generalized this notion in 1981, proving Matrix inversion is an essential step in a wide range of numer- w < 2.548. In the same paper, combining his work with ideas by ical problems – starting with solving linear equations [1–4], struc- Pan [23], he also showed w < 2.522. The following year, Romani tural analyses using finite element method [5–7], 3D rendering [8], [26] found that w < 2.517. The first result to break 2.5 was by digital filtering [9], image filtering [10,11] and image processing Coppersmith and Winograd [16] who obtained w < 2.496. In [12] – and constitutes an indispensable component in almost all 1988, Strassen introduced his laser method [27] which was a novel mathematical/statistical software suites. Some of the common approach for matrix multiplication, and thus decreased the bound available algorithms for computing the inverse of a matrix are to w < 2.479. Two years later, Coppersmith and Winograd [28] Strassen [13], Strassen-Newton [14], Gaussian elimination [15], combined Strassen’s technique with a novel form of analysis based Gauss–Jordan [15], Coppersmith and Winograd [16], LUP Decom- on large sets avoiding arithmetic progressions and obtained the position [17], Cholesky decomposition [18], QR decomposition famous bound of w < 2.376 which has remained unchanged for [19], RRQR factorization [20], Monte Carlo Methods for inverse more than 22 years (a very recent unpublished work [29] claims [21,22], etc. to have brought down the limit to w < 2.3727). While most activity Until the late 1960s matrix inversion was believed to require a focuses on trying to reduce the exponent w, both Coppersmith and cubic number of operations, as the fastest algorithm known was Winograd [28] and Cohn et al. [30] presented conjectures which if Gaussian elimination method, or rather Gauss Jordan [15] method true would imply w = 2, but never less than that. which runs in O(n3) time where n is the size of the matrix. In 1969, Inversion methods for specific types of matrices sometimes Strassen [13] excited the research community by giving the first with no set time complexity, like Monte Carlo Methods [21,22] sub cubic time algorithm for matrix multiplication, running in for inverting Hermitian matrix and positive definite matrix, a fast O(n2.808) time. This also reduced the time complexity, w, of Matrix algorithm [31] for the inversion of general Toeplitz matrices and Inversion using Strassen Multiplication to O(n2.808) time. This dis- various matrix decomposition methods [16–19] also exist. covery triggered a long line of research that gradually reduced Recent developments in parallel architecture and its use in the time complexity w over time. In 1978, Pan [23] presented a computation have brought about the prospect of massively parallel method that proved w < 2.796 and the next year, Bini et al. [24] systems capable of reducing running times of codes below the limit of w = 2. The amount of performance boost of course depends largely upon the scope of parallelization that the algorithm pro- vides. Owing to its unique architecture, the graphics processing unit ⇑ Corresponding author. Tel./fax: +91 3222 283422. (GPU) enables massive parallelization unlike anything possible on a E-mail addresses: [email protected], baidurya.bhattacharya@jhu. edu, [email protected] (B. Bhattacharya). CPU based network, as described later. The Gauss Jordan method is
0045-7949/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.compstruc.2013.06.015 32 G. Sharma et al. / Computers and Structures 128 (2013) 31–37 one of the oldest methods for matrix inversion. It is straightforward Within a GPU, there are a couple of techniques for performing and robust and is particularly suitable for massive parallelization parallel computation. In the early years, General-purpose comput- unlike many of the more advanced methods. Nevertheless, the ing on graphics processing units (GPGPU) [38] was the technique available literature either on the scope of parallelization of the used to harness the computational power of a GPU by program- Gauss Jordan algorithm or its optimization appear rather insuffi- mers, until CUDA (Compute Unified Device Architecture) was made cient. This paper tailors and implements the Gauss–Jordan publically available in 2007 as the platform to replace GPGPU. algorithm for matrix inversion on a CUDA (Compute Unified Device CUDA has several advantages [39] over GPGPU and CPU which Architecture [32]) based GPU platform and studies the performance includes faster memory sharing and read backs, minimal threads metrics of the algorithm. creation overhead, and flexibility in the choice of programming language, and has been adopted for this work. More extensive 2. Parallelization and challenges information about Nvidia and CUDA model is explained in the Programming Guide published by Nvidia [32]. The Strassen approach [14,33] for matrix inversion reduces the problem of inverting an n n matrix into seven n/2 n/2 multipli- 3. Parallel Gauss–Jordan algorithm cations and inversion of two n/2 n/2 matrices, each of which then can be solved recursively. The inverse is given as: Gauss Jordan method for computation of inverse of a matrix is