Scalable Sparse Symbolic LU Factorization on Gpus

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 GSOFA: Scalable Sparse Symbolic LU Factorization on GPUs Anil Gaihre, Xiaoye Sherry Li and Hang Liu Abstract—Decomposing a matrix A into a lower matrix L and an upper matrix U, which is also known as LU decomposition, is an essential operation in numerical linear algebra. For a sparse matrix, LU decomposition often introduces more nonzero entries in the L and U factors than in the original matrix. A symbolic factorization step is needed to identify the nonzero structures of L and U matrices. Attracted by the enormous potentials of the Graphics Processing Units (GPUs), an array of efforts have surged to deploy various LU factorization steps except for the symbolic factorization, to the best of our knowledge, on GPUs. This paper introduces GSOFA, the first GPU-based symbolic factorization design with the following three optimizations to enable scalable LU symbolic factorization for nonsymmetric pattern sparse matrices on GPUs. First, we introduce a novel fine-grained parallel symbolic factorization algorithm that is well suited for the Single Instruction Multiple Thread (SIMT) architecture of GPUs. Second, we tailor supernode detection into a SIMT friendly process and strive to balance the workload, minimize the communication and saturate the GPU computing resources during supernode detection. Third, we introduce a three-pronged optimization to reduce the excessive space consumption problem faced by multi-source concurrent symbolic factorization. Taken together, GSOFA achieves up to 31× speedup from 1 to 44 Summit nodes (6 to 264 GPUs) and outperforms the state-of-the-art CPU project, on average, by 5×. Notably, GSOFA also achieves up to 47% of the peak memory throughput of a V100 GPU in Summit. Index Terms—Sparse linear algebra, sparse linear solvers, LU decomposition, static symbolic factorization on GPU F 0 1 2 3 4 5 6 7 8 9 Figure 1: (a) A 1 INTRODUCTION 0 9 1 0 1 sparse matrix A 2 used throughout this Many scientific and engineering problems require solving 3 large-scale linear systems, i.e., Ax = b. Solving this prob- 4 8 2 5 paper where each lem with direct methods [1] often involves LU factoriza- 5 represents a nonzero, 6 (b) G(A): the graph tion [2], [3], that is, decomposing the original matrix A into 7 6 7 3 8 4 representation of A. lower and upper triangular matrices L and U, respectively, 9 where A = LU. Since LU decomposition of a sparse matrix (a) The sparse matrix A. (b) G(A): Graph of A. typically introduces new nonzeros, also known as fill-ins, GPUs have a relatively large amount of device memory, in the L and U factored matrices, symbolic factorization [2] which allows the entire sparse matrix to reside on the GPUs. is used to compute the locations of the fill-ins for both L Secondly, more and more exascale application codes are and U matrices. This information is needed to allocate the moving to GPUs, which demands the underlying linear compressed sparse data structures for L and U, and for the solvers to reside on GPUs to eliminate CPU-GPU com- subsequent numerical factorization. munication. Ultimately, we will move all the workflows Symbolic factorization is a graph algorithm that acts on of the sparse LU solver to GPUs. Third, many GPU-based the adjacency graph of the corresponding sparse matrix. efforts have gone to deploying numerical factorization and arXiv:2007.00840v4 [cs.DC] 9 May 2021 Figure 1 shows a sparse matrix A and its adjacency graph triangular solution on GPUs [4], [5] due to (i) numerical G(A). In the graph, each row of a matrix corresponds to factorization and triangular solution take more time, and (ii) a vertex in the directed graph with a nonzero in that row symbolic factorization consumes excessive memory space. corresponding to an out edge of that vertex. For instance, We argue that since sparse symbolic factorization detects row 8 of A in Figure 1(a) has nonzeros in columns f1, 2, 7, the fill-ins and supernodes for the numerical steps, it is 8, 9g. We thus have those corresponding out edges for vertex imperative to move sparse symbolic factorization on GPUs. 8 in G(A) of Figure 1(b). For brevity, we do not represent Existing symbolic factorization algorithms are difficult the self edges corresponding to the diagonal elements. We to deploy on GPUs, as well as scale-up in distributed will use this example throughout this paper. settings, due to stringent data dependency, lack of fine- The motivation to parallelize the sparse symbolic factor- grained parallelism, and excessive memory consumption. ization on GPUs is threefold. Firstly, the newer generation There mainly exist three types of algorithms to perform symbolic factorization, that is, fill1 and fill2 algorithms by • A. Gaihre and H. Liu are with Department of Electrical and Computer Rose and Tarjan [6] and Gilbert-Peierls algorithm [7], [8]. Engineering, Stevens Institute of Technology, NJ. E-mail: fagaihre, [email protected] They are based on traversing either the graphs G(L) and • X.S. Li is with Lawrence Berkeley National Laboratory, Berkeley, CA. G(U) or graph G(A). In particular, fill1 works on G(U) for E-mail: [email protected] fill-ins detection while Gilbert-Peierls algorithm does that Manuscript revised May 7, 2021. on G(L). Fill2 algorithm is analogous to Dijkstra’s algorithm [9]. Consequently, fill1 and Gilbert-Peierls algorithms structures between CPU and GPU memories. are sequential in nature as the fill-in detection of larger rows The rest of this paper is organized as follows: Section 2 or columns has to wait until the completion of the smaller introduces the background of this work. Section 3 presents rows or columns. Fill2 algorithm lacks fine-grained paral- the design challenges. Sections 4, 5 and 6 present the fine- lelism (i.e., in each source) due to strict priority requirement, grained symbolic factorization algorithm design, parallel although coarse-grained parallelism (i.e., at source level) efficient supernode detection, and space consumption opti- exists. Furthermore, symbolic factorization needs to identify mization techniques. We evaluate GSOFA in Section 7, study supernode [10], which is used to expedite the numerical the related work in Section 8, and conclude in Section 9. steps. But supernode detection presents cross-source data dependencies. Third, we need to perform parallel symbolic 2 BACKGROUND factorization for a collection of rows to saturate the computing resources on GPUs. This results in overwhelming 2.1 Sparse LU Factorization space consumption, which is similar to multi-source graph Factorizing a sparse matrix A into the L and U matrices traversal [11]. often involves several major steps, such as preprocessing, In this paper, we choose the fill2 algorithm as a starting symbolic factorization, and numerical factorization. point stemming from the fact that fill2 directly traverses the Matrix preprocessing performs either row (partial piv- original instead of the filled graphs, allowing independent oting) or both row and column permutations (complete traversals from multiple sources. However, we have to pivoting) in order to improve numerical stability and reduce revamp this algorithm to expose a higher degree of fine- the number of fill-ins in the L and U matrices. For numerical grained parallelism to match GPUs’ SIMT nature and reduce stability, existing methods aim to find a permutation so the excessive memory requirements during the multi-source that the pivots on the diagonal are big, for instance, by symbolic factorization to, again, fit GPUs, which often equip maximizing the product of diagonal entries and make the limited memory space. To summarize, we make the follow- matrix diagonal dominant [12]. Supplemental file illustrates ing three major contributions. how matrix preprocessing works with matrix A. Regarding First, to the best of our knowledge, we introduce the fill-in reduction, the popular strategies are minimum degree first fine-grained parallel symbolic factorization that is well algorithm [13], Reverse Cuthill-McKee [14] and nested dis- suited for the SIMT architecture of GPUs. Particularly, in- section [15], or a combination of them, such as METIS [16]. stead of processing one vertex at a time, we allow all the Symbolic factorization. Symbolic factorization can be neighboring vertices to be processed in parallel, such that viewed as performing Gaussian elimination on a sparse the computations align well with the SIMT nature. Since matrix with either 0 or 1 entries. At each elimination step, this relaxation enables traversal to some vertices before their the pivoting row is multiplied by a nonzero scalar and dependencies are satisfied, we further allow revisitation of these updated into another row below, which has nonzero in the vertices to ensure correctness. We also introduce optimizations pivoting column; this is called an elementary row operation. to reduce the reinitialization overhead and random memory After a sequence of row operations, the original matrix is access to the algorithmic data. Finally, we use multi-source transformed into two triangular matrices L and U. Con- concurrent symbolic factorization to saturate each GPU and sidering row 8 in Figure 2(a), since row 8 has nonzeros achieve intra-GPU workload balance. at columns 1 and 2, both rows 1 and 2 will perform row Second, we not only tailor supernode detection into a operations on row 8. Using row operation from row 2 as an SIMT friendly process but also strive to balance the workload, example, it produces a fill-in (8, 3), which further triggers minimize the communication and saturate GPU computing re- row operations from rows 3 and 5 that produce fill-ins (8, 4) sources during supernode detection. Particularly, we break and (8, 5).

Scalable Sparse Symbolic LU Factorization on Gpus

Recursive Approach in Sparse Matrix LU Factorization

Lecture 5 - Triangular Factorizations & Operation Counts

LU Factorization LU Decomposition LU Decomposition: Motivation LU Decomposition

LU-Factorization 1 Introduction 2 Upper and Lower Triangular Matrices

LU, QR and Cholesky Factorizations Using Vector Capabilities of Gpus LAPACK Working Note 202 Vasily Volkov James W

LU and Cholesky Decompositions. J. Demmel, Chapter 2.7

Math 2270 - Lecture 27 : Calculating Determinants

LU Decomposition S = LU, Then We Can  1 0 0   2 4 −2   2 4 −2  Use It to Solve Sx = F

LU, QR and Cholesky Factorizations Using Vector Capabilities of Gpus LAPACK Working Note 202 Vasily Volkov James W

Gaussian Elimination and Lu Decomposition (Supplement for Ma511)

Effective GPU Strategies for LU Decomposition H

Triangular Factorization and Inversion by Fast Matrix Multiplication*