<<

Direct and Line Based Iterative Methods for Solving Sparse Block Linear Systems

A thesis submitted to the Graduate School of the University of Cincinnati in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE

in the Department of Aerospace Engineering and Engineering Mechanics of the College of Engineering and Applied Science by

Xiaolin Yang

B.S. Shandong University, June 2012

Date: Oct/31/2018

Committee Chair: Mark G. Turner, Sc.D

Abstract

Solving sparse linear system of equations represents the major computation cost in many scientific and engineering areas. There are two major approaches for solving large sparse linear system: direct method and iterative method. Both methods have their own advantages for certain type of problem. In general, the direct method is more robust and the iterative method has better scalability.

High-order Discontinuous Galerkin (DG) Method has gained growing interest in Computation

Fluid Dynamics (CFD) community. The Jacobian matrices that arise in the application of the

DG method are sparse and block-structured. This thesis summarizes the development of direct and iterative solvers for sparse block linear system. Block capability is achieved by using Intel

CPU library or Nvidia GPU based libraries. The direct solver uses left-looking method with fill-reducing ordering to factorize the matrices into lower/upper triangular parts. The iterative solver uses line-based Successive Over-Relaxation method (SLOR) and Alternating Direction

Implicit method (ADI), which exploit the characteristic of structured grid.

The direct and iterative solvers are tested with matrices from the simulation of a flow channel using DG method. The grid dimension is 6 × 2 × 2 . The results show that direct solver performs better on these small matrices. However, the iterative solver using ADI method demonstrates better scalability with respect to the degree of polynomial used in DG scheme. ii

This work advances the development of linear solver for DG method.

ii

iii

Acknowledgement

First and foremost, I would like to express my sincere gratitude to my advisor Dr. Mark Turner for his guidance and patience. Dr. Turner and I spent a lot of time on this thesis and the discussion with him always benefits me.

I would also like to thank the rest of my committee members, Dr. Shaaban Abdallah and Dr.

Donald French, for their time and insightful comments.

I am sincerely grateful to Nathan Wukie for providing me the test cases of this thesis. This thesis cannot be done without Nathan’s help.

Finally, I would like to thank my parents and Sisi for their support over the years. They have always been there for me during the hard time of my life.

iv

Contents

Abstract ...... ii

Acknowledgement ...... iv

Contents ...... v

List of Figures ...... viii

Nomenclature ...... xi

1 Introduction ...... 1

Background ...... 1

Motivation ...... 2

Sparse Linear Solver Data Structure ...... 3

Overview of Direct Method ...... 5

1.4.1 Symbolic Analysis ...... 5

1.4.2 Numerical Factorization...... 7

1.4.3 Solving Sparse Triangular System ...... 8

Overview of Iterative Method ...... 8

Linear Algebra Package ...... 11

1.6.1 Intel Math Kernel Library ...... 11 v

1.6.2 Nvidia cuBLAS and cuSolver ...... 12

Thesis Outline ...... 12

2 Direct Methodology ...... 13

Overall Algorithm ...... 13

Solving Sparse Triangular System ...... 14

Block-wise Left-looking Method ...... 17

Symbolic Analysis and Fill-reducing Ordering ...... 19

Error Analysis ...... 23

Iterative Refinement...... 24

Memory Management ...... 25

Data Structure ...... 25

3 SLOR/SLOR-ADI Methodology...... 27

Overall Algorithm of SLOR Method ...... 27

Symbolic Analysis of SLOR Method ...... 28

Block-wise Thomas Algorithm ...... 31

Convergence Analysis of SLOR Method ...... 32

SLOR-ADI Method ...... 34

Memory Management ...... 39

Data Structure ...... 40

4 Implementation ...... 41

Test Case ...... 41

vi

Test Configuration ...... 43

Direct Method ...... 44

SLOR Method ...... 51

ADI Method ...... 57

GPU Application ...... 62

5 Conclusion and Future Work...... 65

References ...... 66

vii

List of Figures

2.1: Illustration of symbolic analysis process...... 14

2.2: Finding the non-zero pattern of x...... 16

2.3: Row-merge tree of (2.6) ...... 21

3.1: Illustration of SLOR method ...... 27

3.2: A structured 4×3 2D grid with natural ordering...... 28

3.3: Illustration of SLOR-ADI method...... 34

3.4: 3×2×2 structured 3D grid with natural ordering...... 35

4.1: 6×2×2 3D structured grid ...... 42

4.2: Contours of pressure coefficient...... 42

4.3: Nonzero pattern of Jacobian ...... 43

4.4: Nonzero pattern of L+U-I with original ordering...... 44

4.5: Nonzero pattern of L+U-I with fill-reducing ordering...... 45

4.6: Direct method execution time (ms)...... 46

4.7: Execution time per equation (ms)...... 46

4.8: Log scaled execution time with original ordering...... 47

4.9: Log scaled execution time with fill-reducing ordering...... 47

4.10: Log scaled execution time per equation with original ordering...... 48

4.11: Log scaled execution time per equation with fill-reducing ordering...... 48

4.12: Flop/Time ratios of LU factorization of different orderings...... 49

viii

4.13: Flop/Time ratios of LU factorization of different matrices...... 49

4.14: Condition number (log scale)...... 50

4.15: Log scaled upper bound of relative error...... 51

4.16: Direct method residual norm...... 51

4.17: P1 spectral radius vs omega ...... 52

4.18: P2 spectral radius vs omega...... 52

4.19: P3 spectral radius vs omega...... 53

4.20: P1 SLOR steps to converge vs omega...... 53

4.21: P2 SLOR Steps to Converge vs Omega...... 54

4.22: P3 SLOR steps to converge vs omega...... 54

4.23: x-SLOR execution time (ms)...... 55

4.24: x-SLOR execution time per equation (ms)...... 55

4.25: Log scaled x-SLOR execution time...... 56

4.26: Log scaled x-SLOR execution time per equation...... 56

4.27: Log scaled x-SLOR relative error bound...... 57

4.28: x-SLOR residual norm...... 57

4.29: P1 x-SLOR/ADI residual vs steps...... 58

4.30: P2 x-SLOR/ADI residual vs steps...... 58

4.31: P3 x-SLOR/ADI residual vs steps...... 59

4.32: ADI execution time (ms)...... 59

4.33: ADI execution time per equation (ms)...... 60

ix

4.34: Log scaled ADI execution time...... 60

4.35: Log scaled ADI execution time per equation...... 61

4.36: Log scaled ADI relative error bound...... 61

4.37: ADI residual norm...... 62

4.38: P1 execution time (ms) ...... 63

4.39: P2 execution time (ms)...... 63

4.40: P3 execution time (ms)...... 63

4.41: Log scaled direct (left) /ADI (middle) /SLOR (right) execution time...... 64

x

Nomenclature

N Matrix Dimension

n Block Dimension

퐴푖푗 Element of Matrix A at ith row and jth column.

퐴−1 Inverse of Matrix A

퐴푇 Transpose of Matrix A

|퐴| Number of Nonzero Elements of Matrix A

L Lower

U Upper Triangular Matrix

TD

G Graph of Matrix

||퐴|| Natural Norm of Matrix A

xi

Introduction

Background

A can be defined as a matrix with enough zeros that it pays to take advantage of them [1]. This practical definition captures the essence of the sparse linear system solving methods.

Solving large sparse linear systems lies at the hearts of a wide range of scientific and engineering computations. It represents the major computation cost. There are two major approaches for solving large sparse linear system: direct method and iterative method. Both methods have been developed for decades and have their own advantages for certain type of problem. The direct method is more robust compared to iterative method and the computation workload is predictable. In addition, the result from a direct method can be reused many times if multiple right-hand sides are present. However, it has higher memory usage and worse scalability compared to iterative method. On the other hand, iterative methods could achieve desired accuracy through iterations, but the convergence is not always guaranteed. Iterative methods are preferred for large problems due to their scalability whereas direct method is usually ineffective. Both methods will be discussed in this thesis.

1

Motivation

Discontinuous Galerkin Methods have gained growing interest in the CFD community [2] [3]

[4]. Its numerical scheme represents the approximate solution with piecewise cell local polynomials that are continuous within a given cell, the discrete solution can have a discontinuity across cell boundaries. High order accuracy can be obtained by increasing the order of polynomial expansion without decreasing the size of stencil. Like Finite Volume

Methods, this method also relies on an integral form of the conservation law equations. The integral form is obtained by multiplying the conservation law equations by a set of test functions and integrating over a control volume. The resulting volume integral is then converted into a weak form. The flux in the boundary integral is substituted with the same numerical upwind fluxes, or approximate Riemann solvers, used for the finite volume discretization [5] [6].

The system of linear equations be of the interest of this thesis that arises from Discontinuous

Galerkin method has the form:

휕ℜ(푄푛) Δ푄푛 = −ℜ(푄푛) (1.1) [7] 휕푄

ℜ(푄) is the sum of all spatial integral. Q is the vector of all cell dependent variables in the computational domain that used to compute ΔQn, which is the update vector used to update the entire solution vector as Qn+1 = 푄푛 + Δ푄푛 for each Newton iteration n. ΔQ is the

휕ℜ(푄푛) global change in Q. In a structured mesh, the Jacobian matrix forms a block tri-, 휕푄 block penta-, and block hepta-diagonal sparse matrix for one-, two-, and three-dimensional 2

problem, respectively. The dimension of each dense block can be very large, which is computationally expensive to work with. This thesis explores the efficiency of direct methods and line based iterative methods applied to such block sparse matrix.

Sparse Linear Solver Data Structure

A regular sparse matrix is typically stored in a column-major form called compressed sparse column (CSC) form or a row-major form called compressed sparse row (CSR) form [8]. In either form, an 푁 × 푁 sparse matrix can be represented by three 1-D arrays. For block sparse matrix, each dense block will be considered as an element. The corresponding storage schemes are column-major block sparse column form (BSC) and row-major block sparse row form

(BSR), respectively. Both BSC and BSR forms will be used in this thesis. In either form, the dense block can be stored in column-major or row-major fashion. As an example, the following sparse matrix (1.2) can be stored in either a 6 × 6 CSC form matrix or a 3 × 3 BSR form matrix.

4 −1 0 0 0 0

−1 4 −1 0 0 0

0 −1 4 −1 0 0 (1.2) 0 0 −1 4 −1 0

0 0 0 −1 4 −1

[ 0 0 0 0 −1 4 ]

CSC form in C-style arrays (0 – based index):

3

푖푛푡 푐표푙_푝푡푟[] = {0, 2, 5, 8, 11, 14, 16};

푖푛푡 푟표푤_푖푑푥[] = {0, 1, 0, 1, 2, 1, 2, 3, 2, 3, 4, 3, 4, 5, 4, 5};

푑표푢푏푙푒 푣푎푙푢푒[] = {4, −1, −1, 4, −1, −1, 4, −1, −1, 4, −1, −1, 4, −1, −1, 4};

The integer array col_ptr of length 푁 + 1 (푁 = 6) stores the starting point of each column in array row_idx and value, except for the last item stores the number of nonzeros in the matrix.

The row indices of column j are stored in row_idx[col_ptr[j]] through row_idx[col_ptr[j + 1]

– 1]. The numerical values of column j are stored in value[col_ptr[j]] through value[col_ptr[j

+ 1] – 1]. The length of array row_idx and value are equal to the number of nonzeros in the matrix.

BSR form in C-style arrays (0 – based index):

푖푛푡 푟표푤_푝푡푟[] = {0, 2, 5, 7};

푖푛푡 푐표푙_푖푑푥[] = {0, 1, 0, 1, 2, 1, 2};

푑표푢푏푙푒 푣푎푙푢푒[]

= {4, −1, −1, 4, 0, −1, 0, 0, 0, 0, −1, 0 , 4, −1, −1, 4, 0, −1, 0, 0, 0, 0, −1, 0, 4, −1, −1, 4};

As an extension, the above matrix can also be stored as a 3 × 3 with 2 × 2 blocks in BSR form. Each dense block is stored in column major form. The column indices of block row j are stored in col_idx[row_ptr[j]] through col_idx[row_ptr[j + 1] - 1]. The numerical values of block row j are stored in value[row_ptr[j] * n * n] through value[(row_ptr[j + 1] – 1) * n * n], where n is the dimension of dense block.

4

Overview of Direct Method

The direct method of solving a sparse linear system Ax = b starts with 퐿푈 factorization of A,

퐴 = 퐿푈 (1.3) followed by forward and backward substitution.

퐿푦 = 푏 (1.4)

푈푥 = 푦 (1.5)

The direct method will always find the solution if the entire process can be finished.

1.4.1 Symbolic Analysis

Symbolic analysis is the first step of LU factorization. The purpose of it is to make subsequent operations more efficient in terms of time and memory. The factors L or U usually have more nonzero elements than the lower or upper triangular part of A. These newly created nonzero are called “fill-in”, which may cause the factorization takes 푂(푁2) memory and 푂(푁3) time.

The symbolic analysis phase will take into consideration this “fill-in” and try to reduce them in the nonzero pattern of L and U.

Symbolic analysis of sparse matrices is tightly-coupled with graphic analysis. Although being symmetric, it is the symbolic Cholesky factorization that has been researched first and it provides the foundation for all other factorization [9]. The Cholesky factors 퐿 + 퐿푇 from 퐴 =

푇 퐿퐿 can be viewed as of an undirected graph 퐺퐿+퐿푇 , in which an edge between vertex 푖 and 푗 exists if 푙푖푗 ≠ 0 . This graph, however, can be pruned while maintaining the graph reachability. The pruned result is the elimination tree, which is obtained 5

by removing all “redundant” edges from 퐺퐿+퐿푇 [10]. The elimination tree contains the ancestor-descendant relation between each node and it can be constructed in time of 푂(|퐴|)

[11] [12] [13].

The first symbolic analysis of unsymmetric matrices was done by Rose and Tarjan [14], but their method is costly and does not include any permutation. Later George and Ng found that if matrix A is nonsingular and has a zero-free diagonal, then the factor of Cholesky factorization of 퐴푇퐴 provides an upper bound of nonzero pattern of L and U, regardless of any partial pivoting [15]. Gilbert and Ng also showed that when matrix A has strong Hall property [16], then the bound of U is tight [17].

It has been found that that nonzero pattern of factorization factors depends largely on how the columns and rows are permutated [18].Therefore, it is important to find a fill-reducing ordering, which can be stated as finding a row and column permutation P and Q such that the number of nonzero in the factorization of PAQ, or the amount of work required to compute factorization, is minimized [1]. However, finding the best P and Q that minimizes memory usage or flop count to compute factorization is an NP-hard problem [19]. Therefore, heuristics need to be used. The result from Cholesky factorization of 퐴푇퐴 suggests that it is possible to find out an ordering of 퐴푇퐴 to reduce the amount of fill-in in L and U [15] [20]. However, a noticeable drawback of analyzing the structure of 퐴푇퐴 is that it requires explicitly forming of 퐴푇퐴. Each row of A will create a clique in the adjacency graph of 퐴푇퐴, which may result in a dense submatrix that makes the upper bound very loose. Based on a symbolic analysis of the conventional outer-product formulation of Gaussian elimination of the matrix A and row-merge

6

tree [21], Davis etc. developed a symbolic analysis approach that does not need to form 퐴푇퐴 explicitly and generates a tighter bound of nonzero pattern of L and U. They also applied

Column Approximate Minimum Degree (COLAMD) method [22] [23] to their symbolic analysis [24], which has been proven to be an effective approach to reduce the fill-in. The key idea of COLAMD method is that at each elimination step, a column that can minimize a selected metric on all the candidate pivot columns will be chosen as the pivot column at that step, and then recompute the metric for each remaining column. The permutation method used in this thesis is based on above research, which is a modified COLAMD method to accommodate the data structure used in this thesis.

1.4.2 Numerical Factorization

The numerical factorization phase usually dominates the time consumption of direct method.

Besides the benefits from symbolic analysis, numerical factorization can be accelerated in two ways: (1) creating task level parallelism and (2) taking advantage of BLAS or LAPACK operation. There are many numerical schemes that have been developed based on above approaches. Duff did a survey on the impact of these approaches on improving performance of direct sparse solvers [25].

For the block sparse matrices considered in this thesis, applying level 3 BLAS and LAPACK operation to the process of factorization is straightforward. However, little task-level parallelism can be extracted from such matrices. Therefore, a sequential left-looking method is adopted in this thesis. The left-looking LU factorization method computes L and U one column

7

at a time, from left to right. At the kth step, it accesses columns from 1 to k – 1 of L and column k of A (assuming one-based index). The earliest implementation of this method is done by Sato and Tinney [26]. Gilbert and Peierls showed that left-looking method takes time proportional to the number of floating-point operations [27]. According to the survey done by Davis, etc.

[1], no other method provides this guarantee. In this thesis, the left-looking method will also be used in symbolic analysis phase.

1.4.3 Solving Sparse Triangular System

Solving a sparse triangular system such that Lx = b or Ux = b, where L and U are lower and upper sparse triangular matrix respectively, is the fundamental mathematical kernel of direct sparse method. It will be used not only in the solving phase after factorization, but also in the process of numerical factorization. The sparsity in 퐿 and 푈 needs to be considered. If the right-hand-side 푏 is a dense vector, the solving process is close to that of solving a dense triangular system. However, if 푏 is a sparse vector, then not all columns of 퐿 or 푈 will take part in the computation. The detailed algorithm of solving sparse triangular system will be discussed in Chapter 2.

Overview of Iterative Method

Iterative methods are usually preferred for large linear system because they use less memory and are easier to parallelize compared to direct methods. They can also exploit the physical information of the problem; thus the iteration scheme can be tailored for the specific problem

8

to improve the effectiveness and robustness. [28] [29]

The generic classic iterative method has the form:

푥푘+1 = 푇푥푘 + 푐 (1.6)

The matrix 푇 is called iterative matrix, which is the amplification matrix of the error. It starts with an initial given solution and generates a sequence of improving approximation solution until the termination criteria is met.

There are many ways to convert 퐴푥 = 푏 into iteration form of (1.6). The classic iteration methods are based on splitting of A into the form:

퐴 = 푀 + 푁 (1.7) where M is a nonsingular matrix. Then 퐴푥 = 푏 can be converted into a fixed-point iteration form:

푥푘+1 = −푀−1푁푥푘 + 푀−1푏 (1.8)

Among all the classic iteration method, the most fundamental iterative method is Jacobi method. It split the matrix as:

퐴 = 퐷 + (퐿 + 푈) (1.9) where 퐷 is the diagonal, 퐿 is the strictly lower triangular part and 푈 is the strictly upper triangular part.

The corresponding iteration form of Jacobi method is:

푥푘+1 = −퐷−1(퐿 + 푈)푥푘 + 퐷−1푏 (1.10)

Gauss-Seidel method modifies Jacobi method by overwriting the approximate solution by the new value as soon as it is available. This results in the splitting:

9

퐴 = (퐷 + 퐿) + 푈 (1.11) and iteration form:

푥푘+1 = −(퐷 + 퐿)−1푈푥푘 + (퐷 + 퐿)−1푏 (1.12)

The convergence rate of Gauss-Seidel can be improved by applying a weighted average of the new value with the one obtained during the previous iteration.

푥푘+1 = (1 − 휔)푥푘 + 휔푥푘+1 (1.13)

This is called Successive Over-Relaxation (SOR) method. The iteration form is defined by:

푥푘+1 = (퐷 + 휔퐿)−1[(1 − 휔)퐷 − 휔푈]푥푘 + 휔(퐷 + 휔퐿)−1푏 (1.14) where 휔 is the relaxation factor.

If the linear system of equation 퐴푥 = 푏 is generated from a structured grid, then the matrix

A must have multiple tri-diagonal structures on the diagonal. Each tri-diagonal represents the connection along a single grid line. The line based SOR method (SLOR) takes advantage of this structure by solving the tri-diagonal system simultaneously. Adding implicity to the unknowns from a same line allows the information to propagate more effectively over entire domain compared with the point-wise methods mentioned above. The SLOR iteration is based on the splitting:

퐴 = 푇퐷 + 퐿′ + 푈′ (1.15)

The iteration form is:

푥푘+1 = (푇퐷 + 휔퐿′)−1[(1 − 휔)푇퐷 − 휔푈′]푥푘 + 휔(푇퐷 + 휔퐿′)−1푏 (1.16) where 푇퐷 represents the main tri-diagonals, 퐿′ and 푈′ is the remaining lower triangular part and upper triangular part, respectively.

10

The SLOR method can be combined with Alternating Direction Implicit method (ADI) [30].

The basic idea is to apply SLOR method on different directions in each iteration step. This is requires reordering the matrix for each direction. This is particularly effective if the couplings along multiple directions are strong. Assuming the original ordering is on 푥 direction, the iteration form of SLOR-ADI is:

푘+1 ′ −1 ′ 푘 ′ −1 푥 = (푇퐷 + 휔푥퐿 ) [(1 − 휔푥)푇퐷 − 휔푥푈 ]푥 + 휔푥(푇퐷 + 휔푥퐿 ) 푏 (1.17)

푘+1 ′ −1 ′ 푘 ′ −1 푥푦 = (푇퐷푦 + 휔푦퐿푦) [(1 − 휔푦)푇퐷푦 − 휔푦푈푦]푥푦 + 휔푦(푇퐷푦 + 휔푦퐿푦) 푏푦 (1.18)

푘+1 ′ −1 ′ 푘 ′ −1 푥푧 = (푇퐷푧 + 휔푧퐿푧) [(1 − 휔푧)푇퐷푧 − 휔푧푈푧]푥푧 + 휔푧(푇퐷푧 + 휔푧퐿푧) 푏푧 (1.19) where 휔푥, 휔푦 and 휔푧 are relaxation factors on each direction.

Although such classic iterative methods are rarely used alone nowadays, they have a role of preconditioners for Krylov Subspace method. They can also be useful when combined with multigrid method [31]. This thesis is going to investigate the effectiveness of SLOR and SLOR-

ADI with the block sparse matrix in (1.1).

Linear Algebra Package

The performance of sparse linear solver relies heavily on dense linear algebra package. Intel and Nvidia provide high performance linear algebra package based on their own hardware, and they are among the most popular packages in scientific computing community.

1.6.1 Intel Math Kernel Library

Intel Math Kernel Library (MKL) is a library of math routines that is optimized specifically for

11

Intel Processors [32]. It provides the function of BLAS and LAPACK with both C and Fortran interface. The C interface of this library is used in this thesis.

1.6.2 Nvidia cuBLAS and cuSolver

The development of Graphic Processing Units (GPU) is revolutionizing the scientific computation community. Nvidia also provide linear algebra packages similar to MKL based on their own GPU [33] [34]. A lot of research has shown that significant speed up with GPU can be observed on problems involving large amount of Level 2 or Level 3 BLAS operations [35]

[36] [37]. However, high performance can only be achieved with dedicated GPU for scientific computing, which is not available for the work of this thesis.

Thesis Outline

A direct solver and an iterative solver for the sparse matrix in (1.1) are developed. The direct method is discussed in Chapter 2 and iterative method is discussed in Chapter 3. Detailed algorithms are provided for both methods. Verification of the direct solver and the iterative solver are presented in Chapter 4. Conclusion and future work are given in Chapter 5

12

Chapter 2

Direct Methodology

This chapter explains the direct method that is used to solve the sparse linear system arise from

(1.1). Algorithms for symbolic analysis, numerical factorization and solving triangular system are given. Error analysis and data structure are also presented.

Overall Algorithm

The overall algorithm of direct method is given below.

Algorithm 2.1 Direct method algorithm Input: 푁 × 푁 block sparse matrix in BSC form, 푁 ∗ 푛 vector as right-hand-side //Symbolic analysis phase for k = 1 to N do select the kth pivot column from candidate pivot columns find the nonzero pattern of kth column of L and U end for applying permutation 푃퐴푃푇 //Numerical factorization phase for k = 1 to N do compute the column k of L and U end for //Sparse triangular system solving phase solve 퐿푦 = 푃푏 solve 푈푥′ = 푦 compute 푥 = 푃푇푥′

The symbolic analysis phase uses a combined left-looking and right-looking method.

13

This phase can be illustrated by Fig 2.1. At kth step, the selection of pivot column is based on the result of fill-reducing ordering, which is discussed in section 2.4. Then it computes the nonzero pattern of kth column of L and U using left-looking method, as discussed in section

2.3. Finally, the estimation of the nonzero pattern of the rest N – k columns will be updated, which is a right-looking approach.

Figure 2.1: Illustration of symbolic analysis process.

The numerical factorization phase is a separate phase from the symbolic phase. This feature makes the solver GPU friendly. The numerical factorization work can be sent to GPU without large data transfer with CPU during the process. In addition, having an independent numerical factorization process allows the solver to reuse the result of symbolic analysis for matrices with the same nonzero pattern. This process also relies on left-looking method.

Solving Sparse Triangular System

The Algorithm for solving a sparse triangular system is the fundamental mathematical kernel

14

of the direct method. Therefore, its algorithm must be discussed first.

Solving a sparse triangular system with dense right-hand-side (RHS) is essentially the same as solving a dense triangular system. The block-wise algorithms are given here.

Solving a lower triangular system 퐿푥 = 푏 with dense RHS uses forward substitution, where

푥 and 푏 are both 푁 × 푛 dense vectors. The algorithm is presented here [38].

Algorithm 2.2 Algorithm for lower triangular system with dense RHS 푥 = 푏 for k = 1 to N do

−1 푥푗 = 퐿푗푗 푥푗

for each 푖 > 푗 for which 퐿푖푗 is nonzero do 푥푖 = 푥푖 − 퐿푖푗푥푗 end for end for

The algorithm for upper triangular system 푈푥 = 푏 is similar.

Algorithm 2.3 Algorithm for upper triangular system with dense RHS 푥 = 푏 for 푘 = 푁 to 1 do

−1 푥푗 = 푈푗푗 푥푗

for each 푖 < 푗 for which 푈푖푗 is nonzero do 푥푖 = 푥푖 − 푈푖푗푥푗 end for end for

Changing from dense RHS to sparse RHS has a significant impact on the algorithm. For lower triangular system, if any 푥푗 = 0 then the 푗푡ℎ column of 퐿 can be skipped in computation.

The problem becomes how to determine the nonzero pattern of 푥. Gilbert and Peierls provide a topological approach to find out the nonzero pattern of 푥 [39]. With a lower triangular system showed in Fig 2.2, it can be written in the following two statements [38]:

(1) 푖푓 푏푖 푖푠 푛표푛푧푒푟표 푡ℎ푒푛 푥푖 푖푠 푛표푛푧푒푟표.

15

(2) 푖푓 푥푖 푖푠 푛표푛푧푒푟표 푎푛푑 퐿푗푖 푖푠 푛표푛푧푒푟표, 푡ℎ푒푛 푥푗 푖푠 푛표푛푧푒푟표.

Figure 2.2: Finding the non-zero pattern of x.

The above statements can be implemented by a non-recursive depth-first-search (DFS) algorithm with a stack of size N [38]. This approach is as part of the process when factorizing

A into L and U.

Algorithm 2.4 Non-recursive DFS algorithm for lower triangular system Let S be a stack of size N Let bool be a Boolean type variable Let B be the nonzero pattern of b Let X be the nonzero pattern of x for each 푖 ∈ 퐵 do if i is not labeled in X then label i in X push i onto the stack end if while stack is nonempty bool = false j = top of the stack

for each 푘 > 푗 for which 퐿푘푗 is nonzero do

16

if k is not labeled in X then label k in X push k onto the stack bool = true break end if end for if bool then stack pop operation end if end while end for

Once the nonzero pattern of x is found, the lower triangular system can be solved by using a modified Algorithm 2.2, which is shown as Algorithm 2.5. [38]

Algorithm 2.5 Algorithm for lower triangular system with sparse RHS 푥 = 푏 for each 푗 ∈ 푋 do

−1 푥푗 = 퐿푗푗 푥푗

for each 푖 > 푗 for which 퐿푖푗 is nonzero do 푥푖 = 푥푖 − 퐿푖푗푥푗 end for end for

The Algorithms 2.4 and 2.5 are used in computing the nonzero pattern and numerical factorization. The algorithms for upper triangular system with sparse RHS are applied in a similar manner that is omitted here.

Block-wise Left-looking Method

The left-looking method can be derived from the following a 3 × 3 expression of LU factorization [38]. The uppercase terms represent the submatrices and lowercase represent the vectors. The lower triangular factor L is assumed to have an identity diagonal.

17

퐿 푈 푢 푈 퐴 푎 퐴 11 11 12 13 11 12 13

푙21 1 푢22 푢23 = 푎21 푎22 푎23 (2.1)

[퐿31 푙32 퐿33] [ 푈33] [퐴31 푎32 퐴33]

In (2.1), the lowercased row and column of each matrix is the kth row and column of L, U, and A, respectively. Assuming no permutation, the left-looking method uses first k -1 columns of L and kth column of A to compute the kth column of L and U. This requires solving the following lower triangular system using Algorithm 2.5.

퐿11 푥1 푎12

푥 푎 푙21 1 2 = 22 (2.2)

[퐿31 0 퐼 ] [푥3] [푎32]

The solution to (2.2) gives:

푢12 = 푥1

푢22 = 푥2 (2.3)

−1 푙32 = 푢22 푥3

This method accesses L, U and A column by column so it can easily accommodate any column ordering with a column major data structure.

The block-wise method is derived from (2.3) by replacing the vector-vector operation with matrix-vector operation. The division is also replaced by left multiplication of matrix inversion.

The corresponding algorithm is given below [38].

Algorithm 2.6 Left-looking method algorithm for k = 1 to N do apply Algorithm 2.4 to find the nonzero pattern of kth column of L apply Algorithm 2.5 to solve (2.2) compute (2.3) end for

18

Symbolic Analysis and Fill-reducing Ordering

The derivation of ordering approach is based on symbolic analysis of outer-product Gaussian elimination, which is a right-looking method [24]. For the matrices within the scope of this thesis, only diagonal blocks are nonsingular, so we only need to find out the column ordering, and the row ordering will be the same with column ordering because the diagonal block needs to be inverted in the following process.

The basic idea of symbolic analysis is to find out the upper bound of nonzero pattern of L and

U. Consider Gaussian elimination process of the following 8 × 8 block sparse matrix A as showed in (2.4). Let 퐴푘 represent the lower (푁 − 푘) × (푁 − 푘) submatrix of A after kth step of Gaussian elimination is performed.

∗ ∗ ∗

∗ ∗ ∗ ∗

∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ (2.4) ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ [ ∗ ∗ ∗ ] At kth step, the pivot row will update the nonzero pattern of 퐴푘 by letting all the rows in

푘푡ℎ column of 퐴푘 take a same upper bound as their new nonzero pattern, which is the set union of all the rows in 퐴푘−1 that have nonzero in the left-most column of 퐴푘−1. Assume we are using the original ordering, after the first step, the nonzero pattern of matrix A will be turned into:

19

∗ ∗ ∗

∗ ∗ ∗ ● ● ∗

∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ (2.5) ∗ ● ● ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ [ ∗ ∗ ∗ ] The black circles in (2.5) represent the fill-ins. The 1st and 4th row of 퐴1 now have the same upper bound of nonzero pattern, which is the set union of 1st, 2nd and 5th rows of 퐴0.

The second step can be implemented in the same way and the matrix A is turned into:

∗ ∗ ∗

∗ ∗ ∗ ● ● ∗

∗ ∗ ∗ ● ● ∗ ∗ ∗ ∗ ∗ (2.6) ∗ ● ● ∗ ∗ ∗ ● ∗ ● ● ∗ ∗ ∗ ∗ ∗ ∗ ∗ [ ∗ ∗ ∗ ] The 1st, 3rd and 4th row of 퐴2 now have the same upper bound of nonzero pattern, which is the set union of 1st, 2nd, 4th and 5th row of 퐴1.

After 8th step, the matrix has become:

∗ ∗ ∗

∗ ∗ ∗ ● ● ∗

∗ ∗ ∗ ● ● ∗ ∗ ∗ ∗ ● ● ∗ (2.7) ∗ ● ● ∗ ∗ ∗ ● ● ∗ ● ● ∗ ∗ ∗ ● ∗ ● ● ∗ ∗ ∗ [ ∗ ● ● ∗ ∗ ] This will be the upper bound of nonzero pattern of 퐿 + 푈 using the original ordering.

Let 퐴푖 be the nonzero pattern of row i of A, and 푅푘 be the upper bound formed at kth step of elimination. From above discussion, we have:

푅푘 = (⋃푘=min 푅푖 푅푖) ∪ (⋃푘=min 퐴푖 퐴푖) \ 푘 (2.8)

This process is called regular row absorption [24]. It can be described by row-merge tree [21], 20

which is a slightly modified elimination tree of 퐴푇퐴. The corresponding row-merge tree of

(2.7) is presented in Fig 2.3.

By removing nodes of A from the row-merge tree, it becomes elimination tree of 퐴푇퐴 .

Therefore, the row-merge tree captures the same connection information as the elimination tree of 퐴푇퐴.

Figure 2.3: Row-merge tree of (2.6)

In the row absorption process, the sequence of the rows being eliminated can affect the upper bound of the nonzero pattern of L and U. Because a dense pivot row could destroy the sparsity of L and U, fill-reducing ordering needs to be applied during the elimination process.

The algorithm of fill-reducing ordering is given by Algorithm 2.7 below, which is derived from Davis’s and Gilbert’s ordering approaches [24] [40]. The main modification was made is the selection of pivot row. For the matrices within the scope of this thesis, the off-diagonal

21

blocks are singular, which means there is only one row in each column can be selected as the pivot row. With that been said, the column with sparsest pivot row will be picked as the pivot column at each step.

The Algorithm 2.7 is implemented with the Algorithm 2.4. It computes the nonzero pattern of

L and U column by column. In each step, the nonzero pattern of selected pivot column is computed by Algorithm 2.4, and then the nonzero pattern of all the unselected rows in the pivot column will be updated by a new upper bound as described in the row-merging process.

As an example, the nonzero pattern of 퐿 + 푈 − 퐼 for (2.4) would be reduced if fill-reducing ordering is applied, as showed in (2.9).

∗ ∗ ∗

∗ ∗ ∗ ∗ ∗ ∗ ∗

∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ (2.9) ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ [ ∗ ∗ ∗ ∗ ∗ ∗ ]

Algorithm 2.7 Fill-reducing ordering algorithm

Let 퐴푖 be the nonzero pattern of row i of A 푘 푘 Let 퐴푖 be the nonzero pattern of row i of 퐴 Let 퐿푘 be the nonzero pattern of column k of L Let 푅푘 be the upper bound of nonzero pattern formed at kth step Let 푑푖 be the value of upper bound of nonzero pattern row i Let 푃푖 be the permutation of row/column i Let 푃푖푛푣푖 be the inversion/transpose of P Let 퐴퐴푖 be the highest ancestor of 퐴푖 in row merge tree pivot =0; d = N;

푅푘 = ∅ for k = 1 to N do pivot = k for each candidate pivot column i do 22

if 푑푖 < 푑 then 푑 = 푑푖 pivot = i end if end for

푃푘 = 푝푖푣표푡 푃푖푛푣푝푖푣표푡 = 푘 apply Algorithm 2.4 to compute the nonzero pattern of column k of L and U

for each 푖 ∈ 퐿푘 do if row i has not been absorbed into a pivot row then

푅푘 = 푅푘 ∪ 퐴푖 else

푘 퐴푖 = 푅퐴퐴푖 푘 푅푘 = 푅푘 ∪ 퐴푖 end if end for

푅푘 = 푅푘 \ 푘 for each 푖 ∈ 퐿푘 do 퐴퐴푖 = 푘 푑푖 = |푅푘| end for end for

Error Analysis

The direct solution of 퐴푥 = 푏 will subject to floating point operation error. Let 푥′ be the direct solution of 퐴푥 = 푏, 푟 be the residual vector for 푥′. Then for any natural norm, it can be proven that [41]:

||푥 − 푥′|| ≤ ||푟|| ∙ ||퐴−1|| (2.10) and if ||푥|| ≠ 0 and ||푏|| ≠ 0,

−1 ||푥−푥′|| ||퐴||∙||퐴 || ||푟|| ≤ ||푟|| = 푐표푛푑(퐴) (2.11) ||푥|| ||푏|| ||푏|| in which 푐표푛푑(퐴) is the condition number of 퐴.

If the computation is carried out in t-digit arithmetic, then it can be proven that the residual 23

vector can be approximated by [42]:

||푟|| ≈ 10−푡||퐴|| ∙ ||푥′|| (2.12)

An estimation of condition number can be provided by solving:

퐴푦 = 푟 (2.13)

Because the vector y can be expressed as:

푦 = 퐴−1푟 = 퐴−1(푏 − 퐴푥′) = 푥 − 푥′ (2.14)

Combining (2.12) and (2.14)

||푦|| = ||퐴−1푟|| ≤ ||퐴−1|| ∙ ||푟|| ≈ 10−푡||푥||푐표푛푑(퐴) (2.15)

Hence the condition number can be estimated by (2.15):

||푦′|| 푐표푛푑(퐴) ≈ 10푡 (2.16) ||푥′|| where 푦′ is the solution of 퐴푦 = 푟 obtained with t-digit arithmetic, but this estimation can be very conservative.

Iterative Refinement

As an option of the direct solver, the residual norm of 퐴푥 = 푏 can be reduced by iterative refinement, which uses the solution of (2.13). The process can be repeated as many as needed, but usually one step is enough. The algorithm is given below.

Algorithm 2.8 Iterative Refinement Algorithm [41] compute 푟 = 푏 − 퐴푥 Let 푀 be the maximum refinement steps 푘 = 0 while (푘 < 푀) solve 퐴푦 = 푟 compute 푥 = 푥 + 푦 compute 푟′ = 푏 − 퐴푥

24

푟 = 푟′ 푘 = 푘 + 1 end while

Memory Management

The memory space of this solver is allocated dynamically on heap. The solver reads formatted matrix file with pre-allocated memory space. If the space is not enough during the process of reading a matrix from a file, then the solver will allocate a bigger chunk of memory, copy the contents to the new memory space and free the old memory space. The memory for 퐿 + 푈 − 퐼 is allocated with just the right amount using the result of symbolic analysis and it will be freed after the solver finishes. No large stride will be involved during computation because the matrix is stored in column major. If the solver cannot allocate enough space, it will free all the memory space allocated on heap and send an error message to the screen.

The memory space needed to store structural information is linearly proportional to the number of nonzero blocks, which is trivial compared to the memory requirement of L and U for the matrices considered in this thesis. Therefore, the memory requirement for this direct solver is bounded by the number of nonzero blocks in the matrix factors.

Data Structure

The input data structure has structural information stored in both BSC and BSR form, numerical values are stored in BSC form, and the dense blocks are stored in column major.

The HYBC struct is initialized by the function that read the matrix from a file, it requires the dimension of A and the dimension of dense blocks as the input parameters.

25

struct HYBC { int cc_dim; //number of rows and columns int cc_bdim; //dimension of each dense block int nzb; //number of nonzero blocks int * cc_cptr; //BSC column pointers int * cr_rptr; //BSR row pointers int * cc_ridx; //BSC row indices int * cr_cidx; //BSR column index double *cc_dat; //numerical values with each block in column major };

The result of symbolic analysis and numerical factorization will be stored in BSCLU struct.

The structure information is stored in both BSC and BSR form. The numerical values are stored in BSC form. The identity diagonal of 퐿 is omitted in structure information and numerical values. The array 푝 is the permutation vector that stores the results from fill-reducing ordering, where 푃[푖] = 푗 means 푗푡ℎ column of 퐴 will be placed at 푖푡ℎ position. Array 푝푖푛푣 is the inverse of 푝, which is also the transpose of permutation.

struct { int dim_A; //matrix dim int bdim_A; //block dim int * cptr_A; //col pointers of filled graph of A int * ridx_A; int * colct_L; //col count of L and U int * colct_U; int * row_sq; //the sequence of each row per col double * dat; //numerical value L + U – I double * diag_inv; //inversion of diagonal blocks int * p; //permutations int * pinv; };

The input of RHS will be a pointer to an array of size 푁 ∗ 푛. This array will be overwritten by the solution vector when the solver is finished

26

Chapter 3

SLOR/SLOR-ADI Methodology

This chapter outlines the line based SLOR method and SLOR-ADI method for solving the sparse linear system in (1.1). The iterative schemes and detailed algorithms for both methods are given. Convergence analysis and data structure are also presented.

Overall Algorithm of SLOR Method

The numerical scheme has been defined in (1.16). In each iteration step, (1.16) will be applied in a line by line manner as it was illustrated in Fig 3.1. The construction of the lines is based on the structured grid. The overall algorithm is given below [28].

Figure 3.1: Illustration of SLOR method

27

Algorithm 3.1 SLOR Algorithm Input: 푁 × 푁 block sparse matrix with 푛 × 푛 dense blocks in BSR form, 푁 ∗ 푛 vector as RHS Let M be the maximum iteration steps Let L be the number of grid lines in the domain Let 푙, 푖, j be the index number of grid line Let 푥0 be the initial guess of the solution Let 푘 = 0 while (k < M) do for 푙 = 1 to 퐿 do

푘+1 ′ 푘+1 푘 ′ 푘 푇퐷푙푥푙 + 휔퐿푖푥푖 = (1 − 휔)푇퐷푙푥푙 − 휔푈푗 푥푗 + 휔푏푙 end for 푘 = 푘 + 1 end while check residual

Symbolic Analysis of SLOR Method

The matrix elements representing coupling between grid lines should be grouped with proper lines of unknowns. The purpose of symbolic analysis is to determine the indices 푖 and 푗 in

Algorithm 3.1. Consider the following 4 × 3 structured 2퐷 grid with natural ordering showed in Fig 3.2.

Figure 3.2: A structured 2D 4×3 grid with natural ordering.

28

The corresponding system of linear equation 퐴푥 = (푇퐷 + 퐿′ + 퐷′)푥 = 푏 will be structured as:

푏 ∗ ∗ ○ 푥1 1

∗ ∗ ∗ ○ 푥2 푏2

∗ ∗ ∗ ○ 푥3 푏3 푥 푏 ∗ ∗ ○ 4 4 푥 푏 ● ∗ ∗ ○ 5 5 푥 ● ∗ ∗ ∗ ○ 6 푏6 = (3.1) ● ∗ ∗ ∗ ○ 푥7 푏7

● ∗ ∗ ○ 푥8 푏8 푥 ● ∗ ∗ 9 푏9 푥10 ● ∗ ∗ ∗ 푏10 푥 ● ∗ ∗ ∗ 11 푏11 [ ] [푥12] ● ∗ ∗ [푏12]

The tri-diagonal system 푇퐷 is represented by stars, the remaining lower triangular part 퐿′ is represented by black dots and remaining upper triangular part 푈′ is represented by white dots.

The elements in Fig 3.2 are ordered along its longest dimension. Therefore, this grid has three lines. Each element in Fig 3.2 is mapped to its corresponding row in (3.1) with same ordering number. Elements numbered with 1 to 4 belong to the 1st line of grid, which corresponds to row 1 to row 4 of (3.1). Similarly, the 2nd line elements correspond to row 5 to row 8, and the

3rd line elements correspond to row 9 to row 12. In addition, each row of (3.1) represents the connection of each element with its neighbors.

For grid line l, according to Algorithm 3.1, the SLOR iteration is:

푘+1 ′ 푘+1 푘 ′ 푘 푇퐷푙푥푙 + 휔퐿푖푥푖 = (1 − 휔)푇퐷푙푥푙 − 휔푈푗 푥푗 + 휔푏푙 (3.2)

in which 푙, 푖, 푗 are the numbering of grid lines. To use Thomas Algorithm to solve the tri- diagonal system, the 퐿′ and 푈′ need to be matched with proper line of 푥 vector. In (3.1), the

푈′ of row 1 to row 4 are matched to the 2nd line of x, which is represented by row 5 to row 8 29

of x. The 퐿′ and 푈′ of row 5 to row 8 are matched to the 1st and 3rd line of x, respectively.

The 퐿′ of row 9 to row 12 are matched to the 2nd line of x.

Finding the correct match for 퐿′ and 푈′ can be done by traversing the first row of A for each grid line, which avoids going through all the nonzero elements in matrix A. The key idea comes from the observation that elements of 퐿′ or 푈′ from a same grid line have the same offset in the matrix. In (3.1), for example, the 퐿′ of 1st and 2nd lines are the first nonzero in each row and 푈′ of 2nd and 3rd lines are the last nonzero in each row. The algorithm is given below.

Algorithm 3.2 Line-matching algorithm Let i be the row to traverse for each grid line Let j be the column index in row i Let td-offset be the location of tri-diagonal system Let offset be the array that stores the location of column j Let xline be the array that stores the corresponding line of column j 푐푡 = 0 for each column j in row i of A do if (푗 < 푖) then do 표푓푓푠푒푡[푐푡] = location of column j in row i – starting point of row i 푗 푥푙푖푛푒[푐푡] = 푓푙표표푟( ) + 1 푑푖푚푒푛푠푖표푛 표푓 푒푎푐ℎ 푙푖푛푒 푐푡 = 푐푡 + 1 end if if (푗 == 푖) then do td-offset = location of column j in row i – starting point of row i end if if (푗 > 푖 + 1) then do 표푓푓푠푒푡[푐푡] = location of column j in row i – starting point of row i + 1 푗 푥푙푖푛푒[푐푡] = 푓푙표표푟( ) + 1 푑푖푚푒푛푠푖표푛 표푓 푒푎푐ℎ 푙푖푛푒 푐푡 = 푐푡 + 1 end if end for for 푘 = 1 to ct do if (표푓푓푠푒푡[푐푡] ≥ 0) then do 푖 = 푥푙푖푛푒[푐푡] ′ 푘+1 compute 퐿푖푥푖 30

end if if (표푓푓푠푒푡[푐푡] < 0) then do 푖 = 푥푙푖푛푒[푐푡] ′ 푘 compute 푈푖 푥푖 end if 푐푡 = 푐푡 + 1 end for

For example, the matrix rows need to be traversed in (3.1) are row 1, row 5 and row 9. For row 5, the first column index is 1. It belongs to 퐿′ with offset 0 and it should be matched to the 1st line of x. The 퐿′ in row 5 to row 9 have the same offset and matching. The second column index is 5. Its offset is 1 and it is the first element of the tri-diagonals. The tri-diagonals in row 5 to row 9 can be located by this offset. The third column index is 6 and it is ignored because it belongs to the tri-diagonals as well. The fourth column index is 9 and it belongs to

푈′. Its offset is -1 and it is matched to the 3rd line of x, which means the 푈′ in row 5 to row 9 will share this offset and matching.

Block-wise Thomas Algorithm

The tri-diagonal system will be solved simultaneously using block-wise Thomas Algorithm.

The tri-diagonal system in (3.3) can be solved by the following Algorithm 3.3. The difference between Algorithm 3.3 and regular Thomas Algorithm is that the multiplication and division are replace by level 2, level 3 and BLAS and LAPACK operations.

31

푏1 푐1 푥1 푦1

푎 푏 푐 푥2 푦2 2 2 2

⋱⋱⋱ ⋮ = ⋮ (3.3)

푎푛−1 푏푛−1 푐푛−1 푥푛−1 푦푛−1

[ 푎푛 푏푛 ] [ 푥푛 ] [ 푦푛 ]

Algorithm 3.3 Block-wise Thomas Algorithm [7]

Let 훾1 = 훽1 = 0 Let 푥푛+1 = 0 for 푖 = 1 to 푛 do −1 훾푖+1 = −(푎푖훾푖 + 푏푖) 푐푖 −1 훽푖+1 = (푎푖훾푖 + 푏푖) (푦푖 − 푎푖훽푖) end for for 푖 = 푛 + 1 to 2 do

푥푖−1 = 훾푖푥푖 + 훽푖 end for

Convergence Analysis of SLOR Method

The SLOR numerical scheme (1.6) is rewritten here as:

푥푘+1 = (푇퐷 + 휔퐿′)−1[(1 − 휔)푇퐷 − 휔푈′]푥푘 + (푇퐷 + 휔퐿′)−1휔푏 (3.4)

Let

′ −1 ′ 푇푠 = (푇퐷 + 휔퐿 ) [(1 − 휔)푇퐷 − 휔푈 ] (3.5)

and

′ −1 푐푠 = (푇퐷 + 휔퐿 ) 휔푏 (3.6)

These give (3.4) the generic iteration form

푘+1 푘 푥 = 푇푠푥 + 푐푠 (3.7)

The analysis of (3.7) is focused on the iterative matrix 푇푠.

32

The spectral radius 휌(푇푠) of matrix 푇푠 is defined by:

1 푛 푛 휌(푇푠) = max |휆| = lim ||푇푠 || (3.8) 푛→∞

where 휆 is the eigenvalue of 푇푠.

The spectral radius of 푇푠 is independent of any matrix norm. Hence:

휌(푇푠) ≤ ||푇푠|| (3.9)

It has been proven that (3.7) converges to a unique solution of 푥 = 푇푠푥 + 푐푠 if and only if

[41]

휌(푇푠) < 1 (3.10)

and the following error bound hold:

푘 푘 0 ||푥 − 푥 || ≤ ||푇푠|| ||푥 − 푥|| (3.11)

The error can also be estimated by:

푘 푘 0 ||푥 − 푥|| ≈ 휌(푇푠) ||푥 − 푥|| (3.12)

The spectral radius can be found by using power method, as showed in Algorithm 3.4. If the matrix has unique dominant eigenvector and the initial vector has nonzero component with respect to the dominant eigenvector, then the power method will converge to the modulus of dominant eigenvalue and eigenvector [43]. However, the power method will not converge if the matrix has multiple dominant eigenvalues [44]. In this case, (3.12) can be used to estimate the spectral radius.

It should also be noted that the spectral radius of 푇푠 depends on the value of relaxation factor

휔. W. Kahan established the maximum range of relaxation factor [45]:

0 < 휔 < 2 (3.13)

33

The choice of optimal 휔 is problem dependent [46]. There is no established theorem for the matrices within the scope of this thesis. Numerical experiments need to be performed to find out the optimal 휔.

Algorithm 3.4 Power method algorithm [41] Let 푞0 be an initial vector Let 푀 be the maximum iteration steps while 푘 < 푀 do 푧푘 = 퐴푞푘−1 푧푘 푞푘 = ||푧푘|| 휆푘 = (푞푘)푇퐴푞푘 푒푟푟표푟 = ||푞푘 − 푞푘−1|| if error is small enough then break end if 푘 = 푘 + 1 end while

SLOR-ADI Method

An important enhancement of line based iterative method is to solve strongly coupled unknowns simultaneously. The SLOR method can be applied with ADI method to accelerate the convergence. The key idea of ADI is to apply SLOR on all three directions since the construction of lines are based on the structured grid, as it is illustrated in Fig 3.3. This gives us the capability of utilizing the coupling on all three directions.

Figure 3.3: Illustration of SLOR-ADI method.

34

Applying SLOR on y and z directions can be done by reordering the matrix. Consider the following 3 × 2 × 2 structured 3D grid.

Figure 3.4: 3×2×2 structured 3D grid with natural ordering.

The grid is ordered is along x direction, which is the longest dimension. The nonzero pattern of its Jacobian matrix is:

∗ ∗ ○ ○

∗ ∗ ∗ ○ ○

∗ ∗ ○ ○

● ∗ ∗ ○ ● ∗ ∗ ∗ ○ ● ∗ ∗ ○ (3.14) ● ∗ ∗ ○ ● ∗ ∗ ∗ ○ ● ∗ ∗ ○ ● ● ∗ ∗ ● ● ∗ ∗ ∗ [ ● ● ∗ ∗ ]

The matrix (3.14) has four tri-diagonals because the original ordering has four lines, and (3.14) is also the matrix of SLOR on x direction. The ADI method applies the SLOR scheme on all 35

coordinate directions. To apply SLOR on the y direction, the gird and its matrix need to be reordered along the y direction with the same origin as that of the x direction. For the grid in

Fig 3.4, 1 remains to be 1, 2 becomes 3, 3 becomes 5, 4 becomes 2, 5 becomes 4, and 6 remains to be 6, etc. Let 푃푦 represents the along y direction, the reordered system can be written as:

푇 푃푦퐴푃푦 (푃푦푥) = 푃푦푏 (3.15)

The first step to determine 푃푦 is to find out the direction of y axis in originally ordered grid, in other words, finding the second element in y direction ordered grid. Since this element is connected to the origin, it must be the least ordered element in the first row of 푈′(denoted by white dots in (3.14)) of original matrix. Once the direction is determined, the elements can be reordered line by line.

The algorithm of determining 푃푦 is given below. Algorithm 3.5 y-direction reordering algorithm Let xDim be the dimension of x direction Let yDim be the dimension of y direction Let zDim be the dimension of z direction Let line be the number of grid line of original ordering

Let 푃푦 be the permutation vector Let 퐼푛푣푃푦 be the inverse of 푃푦 for k = 1 to zDim do for i = 1 to yDim do 푙푖푛푒 = 0 + 푘 × 푦퐷푖푚 + 푖 for n = 1 to xDim do 푠푡푎푟푡 = 푘 × 푥퐷푖푚 × 푦퐷푖푚 + 푖 퐼푛푣푃푦[푙푖푛푒 × 푥퐷푖푚 + 푛] = 푠푡푎푟푡 + 푛 × 푦퐷푖푚 푃푦[푠푡푎푟푡 + 푛 × 푦퐷푖푚] = 푙푖푛푒 × 푥퐷푖푚 + 푛 end for end for end for

The nonzero pattern of the matrix of y-direction ordered gird in Fig 3.4 is shown below. (3.16) 36

has 6 tri-diagonals because the grid has 6 lines if it is ordered along y-direction.

∗ ∗ ○ ○

∗ ∗ ○ ○

● ∗ ∗ ○ ○

● ∗ ∗ ○ ○ ● ∗ ∗ ○ ● ∗ ∗ ○ (3.16) ● ∗ ∗ ○ ● ∗ ∗ ○ ● ● ∗ ∗ ○ ● ● ∗ ∗ ○ ● ● ∗ ∗ [ ● ● ∗ ∗ ]

The z-direction reordering can be implemented in a similar fashion.

푇 푃푧퐴푃푧 (푃푧푥) = 푃푧푏 (3.17)

The algorithm of finding 푃푧 is given below. The main difference with Algorithm 3.5 is that the direction of line to be reordered is along the z direction instead of the y direction.

Algorithm 3.6 z-direction reordering algorithm Let xDim be the dimension of x direction Let yDim be the dimension of y direction Let zDim be the dimension of z direction Let line be the number of grid line of original ordering

Let 푃푧 be the permutation vector Let 퐼푛푣푃푧 be the inverse of 푃푧 for k = 1 to yDim do for i = 1 to zDim do 푙푖푛푒 = 0 + 푘 + 푖 × 푦퐷푖푚 for n = 1 to xDim do 푠푡푎푟푡 = 푘 × 푥퐷푖푚 × 푧퐷푖푚 + 푖

퐼푛푣푃푧[푙푖푛푒 × 푥퐷푖푚 + 푛] = 푠푡푎푟푡 + 푛 × 푧퐷푖푚 푃푧[푠푡푎푟푡 + 푛 × 푧퐷푖푚] = 푙푖푛푒 × 푥퐷푖푚 + 푛 end for end for end for

Before applying SLOR to y and z directions, the matrix A, solution vector x and right-hand- side vector b need to be permuted according to (3.15) and (3.17). Permuting x and b are

37

straightforward and lightweight; however, permuting matrix A is expensive. Explicitly permuting matrix A requires an extra memory space that is same as the memory required to the store original matrix A. This would defeat the purpose of memory-saving iterative method.

The main idea of avoiding moving actual numerical data around is to use indirect addressing.

By doing so, only the structure arrays of new matrix need to be constructed. Indirect addressing is implemented by adding an array of size of the number of nonzero blocks, which stores the location of each nonzero block in original matrix A. The algorithms of permutation are given below.

Algorithm 3.7 Vector permutation algorithm Let 푃 be the permutation vector Let 푖푛푣푒푟푠푒푃 be the inverse of 푃 Let 푥 be the vector to permute Let 푥′ be the permuted 푥 //permute x for 푘 = 1 to 푁 do 푥′[푘] = 푥[푃[푘]] end for //recovery x for 푘 = 1 to 푁 do 푥[푘] = 푥′[푖푛푣푒푟푠푒푃[푘]] end for

Algorithm 3.8 Implicit matrix permutation algorithm Let 푃 be the permutation vector Let 푖푛푣푒푟푠푒푃 be the inverse of 푃 Let 퐴 be the matrix to permute Let 퐴′ be the permuted 퐴 Let C be the workspace of size N Set C to be -1 for k = 1 to N do 푖 = 푃[푘] for each column j in row i of A do 푗′ = 푖푛푣푒푟푠푒푃[푗] 퐶[푗′] = sequence of 퐴[푖][푗] 38

end for for 푛 = 1 to 푁 do if 퐶[푛] ≥ 0 then do store 푛 in row 푘 of 퐴′ store 퐶[푛] in row 푘 of 퐴′ 퐶[푛] = −1 end if end for end for

Like the SLOR method, the choice of relaxation factor 휔 on each direction has a significant impact on the rate of convergence. Moreover, the combined 휔푥, 휔푦 and 휔푧 may act differently from what they did when applied alone [47]. In other words, the combined relaxation factors that minimize the spectral radius on each direction may not be the best choice when they were put together. The choice of 휔 on each direction is problem dependent and theorems have been established for elliptic problems [48]. For the problem within this thesis, numerical experiments need to be performed to find out the optimal relaxation factors.

Memory Management

The SLOR/SLOR-ADI solver uses the same memory management strategy as the direct solver, and it uses less memory than the direct solver does since it does not need to store the 퐿푈 factors. However, large stride will be involved during the computation because the matrix is stored in row major. All memory space is allocated dynamically and will be released when the solver is finished. If allocation fails, the solver will send an error message and release the heap memory.

39

Data Structure

The input data structure required for SLOR and ADI solver is BSR form.

struct BSR //compressed block row form { MKL_INT dim; //block matrix dimension MKL_INT bdim; //dim of each block submatrix MKL_INT nzb; // number of nonzero blocks MKL_INT *cr_rptr; //column pointers MKL_INT *cr_cidx; //row indices double *cr_dat; //numerical values of each block that stored in column major };

This data structure is initialized by the function that read the matrix A from a file, it requires the dimension of A and the dimension of dense block (n) as the input parameters.

The input of the RHS will be a pointer to an array of size 푁 ∗ 푛. This array will be overwritten by the solution vector once the solver finished.

40

Chapter 4

Implementation and Results

This chapter details the implementation and results of the direct method and line based iterative methods discussed in previous chapters. A description of the test case is provided, and the test results are presented.

Test Case

The test case is an inviscid flow over a Gaussian smooth bump in a channel using a high-order

Discontinuous Galerkin solver [49]. The domain is represented by a 6 × 2 × 2 grid as showed in Fig 4.1. The inlet boundary condition is specified by stagnation pressure and temperature with flow direction imposed along the direction of 푥 axis. The outlet boundary condition is specified by constant static pressure. The contours of pressure coefficient are shown in Fig 4.2 for different orders of the DG polynomial.

41

Figure 4.1: 6×2×2 3D structured grid

Figure 4.2: Contours of pressure coefficient. [49]

This 3퐷 grid has 24 elements, so its Jacobian matrix is a 24 × 24 hepta-diagonal sparse matrix, whose nonzero pattern is shown in Fig 4.3. The number of nonzero elements is 112.

42

Figure 4.3: Nonzero pattern of Jacobian matrix.

As it was mentioned before, the Jacobian matrix is block-structured. The block dimension represents the degrees of freedom in every grid volume for the 5 equations (mass, 3 momentums, and energy) to be solved. In 3D problems, 푛푡ℎ degree polynomial basis has

(푛 + 1)3 degrees of freedom. The 푃1 (piecewise linear), 푃2 (piecewise quadratic), and 푃3

(piecewise cubic) basis are used for constructing the solution in the test cases. Therefore, the block dimension of 푃1 case is 40, 푃2 is 135, and 푃3 is 320.

Test Configuration

Hardware

Processor: Intel Core i7-7500 2.7 GHz

Memory: 16 GB DDR4

GPU: NVIDIA GeForce 940MX with 2GB VRAM

Software

43

Windows 10 Home

Microsoft Visual Studio 2017

Intel MKL 2018

CUDA Toolkit 9.2

Direct Method

The nonzero pattern of 퐿 + 푈 − 퐼 with original ordering and fill-reducing ordering are shown in Fig 4.4 and Fig 4.5. These figures reveal the effectiveness of fill-reducing ordering.

It reduces the workload of subsequent numerical factorization by reducing the number of nonzero elements. (Fig 4.4 has 238 nonzero elements and Fig 4.5 has 394 nonzero elements, the full matrix would have 242 = 576 elements.)

Figure 4.4: Nonzero pattern of L+U-I with original ordering.

44

Figure 4.5: Nonzero pattern of L+U-I with fill-reducing ordering.

Fig 4.6 shows the total execution time for 푃1, 푃2, and 푃3 matrices. The computation time is measured with original ordering (denoted by “0”) and fill-reducing ordering (denoted by

“1”). Fig 4.7 shows the execution time per equation

It should be noted from Fig 4.6 that the 퐿푈 factorization dominates the entire execution time.

Since left-looking 퐿푈 factorization should take time proportional to the number of floating- point operations, this explains why reducing the number of nonzero can significantly reduce the total execution time.

45

3500 3500

3000 3000

2500 2500

2000 2000 ms 1500 1500

1000 1000

500 500

0 0 P1 P2 P3

Time_0 Time_1 LU_0 LU_1

Figure 4.6: Direct method execution time (ms).

0.5 0.45 0.4 0.35 0.3

0.25 ms 0.2 0.15 0.1 0.05 0 P1 P2 P3

t_0 t_1

Figure 4.7: Execution time per equation (ms).

The scalability of the direct method is better illustrated in log scaled charts. Fig 4.8 through

Fig 4.11 showed that the original ordering and fill-reducing ordering have similar scalability.

46

10000

1000

100 ms

10

1 P1 P2 P3

Figure 4.8: Log scaled execution time with original ordering.

10000

1000

100 ms

10

1 P1 P2 P3

Figure 4.9: Log scaled execution time with fill-reducing ordering.

47

1 P1 P2 P3

0.1 ms

0.01

Figure 4.10: Log scaled execution time per equation with original ordering.

1 P1 P2 P3

0.1 ms

0.01

0.001

Figure 4.11: Log scaled execution time per equation with fill-reducing ordering.

Since the 퐿푈 factorization is the dominant part of whole process, it determines the performance of the solver. As mentioned before, the left-looking 퐿푈 factorization time should be proportional to the number of floating-point operations. The floating-point operation takes place in BLAS gemm calls and LAPACK getrf/getri calls during 퐿푈 factorization. The execution time ratio of 퐿푈 factorization with respect to the ratio of floating-point operations

48

are shown in Fig 4.12 and Fig 4.13. Only 푃2 and 푃3 cases will be considered here because the execution time for 푃1 case is too short, and the overhead could be significant.

The gemm operation is called 1984 times and getrf/getri operation is called 24 times when using original ordering, and the numbers for fill-reducing ordering are 710 and 24 , respectively. From Fig 4.12 and 4.13, the time ratios are bounded by the floating-point operation ratios for 푃2 and 푃3 matrices.

3

2.5

2

1.5

1 P2_0/P2_1 P3_0/P3_1

FLOP Ratio T Ratio

Figure 4.12: Flop/Time ratios of LU factorization of different orderings.

13.5

13

12.5

12

11.5

11

10.5

10 P3_0/P2_0 P3_1/P2_1

FLOP Ratio T Ratio

Figure 4.13: Flop/Time ratios of LU factorization of different matrices.

49

The accuracy of computation relates to the condition number, especially for a direct method.

It is found that the average condition number of diagonal blocks is roughly one magnitude smaller than that of the matrix. The condition numbers are shown in Fig 4.14. This result implies that the condition number of diagonal blocks can be used to estimate the matrix condition number. However, the estimation of condition number from (2.16) gives very small condition numbers (푂(10)), which indicates that the computation reached machine zero.

1.00E+14 1.00E+13 1.00E+12 1.00E+11 1.00E+10 1.00E+09 1.00E+08 1.00E+07 1.00E+06 1.00E+05 1.00E+04 1.00E+03 1.00E+02 1.00E+01 1.00E+00 P1 P2 P3

Avg Diag Cond Matrix Cond

Figure 4.14: Condition number (log scale).

|푥−푥′| The magnitude of relative error is bounded by the condition number times the norm of |푥| residual as in (2.9). The upper bound of relative error and the Euclid norm of residual with fill- reducing ordering are shown in Fig 4.15 and Fig 4.16. The primed items are the result with one step iterative refinement. From above discussion, the error upper bound is conservative and iterative refinement may not be necessary, but the time increment for iterative refinement is not significant since this process can reuse the 퐿푈 factors.

50

1.00E+01

1.00E+00 P1 P1' P2 P2' P3 P3'

1.00E-01

1.00E-02

1.00E-03

1.00E-04

1.00E-05

Figure 4.15: Log scaled upper bound of relative error.

1.40E-07

1.20E-07

1.00E-07

8.00E-08

6.00E-08

4.00E-08

2.00E-08

0.00E+00 P1 P1' P2 P2' P3 P3'

Figure 4.16: Direct method residual norm.

SLOR Method

The convergence rate of SLOR method is determined by the relaxation factor. The effect of different relaxation factors on spectral radius for P1, P2 and P3 matrices are shown in Fig 4.17 to Fig 4.19.

51

1.40E+00

1.20E+00

1.00E+00

8.00E-01

6.00E-01

Spectral Radius 4.00E-01

2.00E-01

0.00E+00 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 Omega

rhoX rhoY rhoZ

Figure 4.17: P1 spectral radius vs omega

1.40E+00

1.20E+00

1.00E+00

8.00E-01

6.00E-01

Spectral Radius 4.00E-01

2.00E-01

0.00E+00 0.5 0.6 0.7 0.8 0.9 1 1.1 Omega

rhoX rhoY rhoZ

Figure 4.18: P2 spectral radius vs omega.

52

1.40E+00 1.20E+00 1.00E+00 8.00E-01

6.00E-01 Spectral Radius 4.00E-01 2.00E-01 0.00E+00 0.5 0.6 0.7 0.8 0.9 1 1.1 Omega

rhoX rhoY rhoZ

Figure 4.19: P3 spectral radius vs omega.

Because the test case primary flow direction is along 푥 axis, the SLOR in the 푥 direction has the shortest spectral radius and fastest convergence rate. The number of steps that required to converge in each direction versus relaxation factors are shown in Figs 4.20 to Fig 4.22.

1600 1400 1200 1000

800 Steps 600 400 200 0 0.6 0.7 0.8 0.9 1 1.1 Omega

x y z

Figure 4.20: P1 SLOR steps to converge vs omega.

53

600

500

400

300 Steps 200

100

0 0.6 0.7 0.8 0.9 1 Omega

x y z

Figure 4.21: P2 SLOR Steps to Converge vs Omega.

600

500

400

300 Steps 200

100

0 0.6 0.7 0.8 0.9 1 Omega

x y z

Figure 4.22: P3 SLOR steps to converge vs omega.

It can be seen from above charts that the SLOR method favors under-relaxation, especially in

P2 and P3 matrices. The SLOR in the 푥 direction with optimum relaxation factor is used for test the performance. The total execution time is shown in Fig 4.23 and execution time per equation is shown in Fig 4.24.

54

20000 18000 16000 14000 12000

10000 ms 8000 6000 4000 2000 0 P1 P2 P3

Figure 4.23: x-SLOR execution time (ms).

3

2.5

2

1.5 ms

1

0.5

0 P1 P2 P3

Figure 4.24: x-SLOR execution time per equation (ms).

The log scaled charts in Fig 4.25 and Fig 4.26 showed the scalability of x-SLOR. The scalability from P2 to P3 is noticeably bad. Since x-SLOR needs roughly the same number of steps to converge for P2 and P3 cases, the reason of bad scalability should be the increased block dimension.

55

100000

10000

1000 ms 100

10

1 P1 P2 P3

Figure 4.25: Log scaled x-SLOR execution time.

10

1 ms P1 P2 P3

0.1

Figure 4.26: Log scaled x-SLOR execution time per equation.

The upper bound of relative error and residual norm are shown in Fig 4.27 and Fig 4.28. They show that SLOR has better accuracy than direct method without iterative refinement.

56

1.00E+00 P1 P2 P3

1.00E-01

1.00E-02

1.00E-03

Figure 4.27: Log scaled x-SLOR relative error bound.

6.00E-08

5.00E-08

4.00E-08

3.00E-08

2.00E-08

1.00E-08

0.00E+00 P1 P2 P3

Figure 4.28: x-SLOR residual norm.

ADI Method

Applying ADI scheme to SLOR can reduce the number of steps that required to reach convergence. The choice of relaxation factors is based on the result of numerical experiments.

Fig 4.29 to Fig 4.31 demonstrate the effect of ADI on convergence rate.

57

1.00E+02 1.00E+01 1.00E+00 60 90 120 150 180 210 1.00E-01 1.00E-02 1.00E-03

Residual 1.00E-04 1.00E-05 1.00E-06 1.00E-07 1.00E-08 Steps

x (omega = 1.1) ADI(wx = 0.7, wy = 0.8, wz = 0.9)

Figure 4.29: P1 x-SLOR/ADI residual vs steps.

1.00E+05

1.00E+03

1.00E+01

1.00E-01 10 20 30 40 50 100 140

1.00E-03 Residual

1.00E-05

1.00E-07

1.00E-09 Steps

x (omega = 0.9) ADI(wx = 0.7, wy = 0.8, wz = 0.9)

Figure 4.30: P2 x-SLOR/ADI residual vs steps.

58

1.00E+04 1.00E+03 1.00E+02 1.00E+01 1.00E+00 1.00E-01 10 20 30 40 50 100 160 1.00E-02

Residual 1.00E-03 1.00E-04 1.00E-05 1.00E-06 1.00E-07 1.00E-08 Steps

x (omega = 0.9) ADI(wx = 0.7, wy = 0.8, wz = 0.9)

Figure 4.31: P3 x-SLOR/ADI residual vs steps.

From P1 to P3, the effect of ADI becomes more significant. For P3 case, ADI reduces the number of iteration steps by nearly 70%. The execution time and its scalability are shown in

Figs 4.32 to Fig 4.35.

18000

16000

14000

12000

10000

ms 8000

6000

4000

2000

0 P1 P2 P3

Figure 4.32: ADI execution time (ms).

59

2.5

2

1.5 ms 1

0.5

0 P1 P2 P3

Figure 4.33: ADI execution time per equation (ms).

100000

10000

1000 ms 100

10

1 P1 P2 P3

Figure 4.34: Log scaled ADI execution time.

60

10

1 ms P1 P2 P3

0.1

Figure 4.35: Log scaled ADI execution time per equation.

The relative error bound and residual norm of ADI are shown in Fig 4.36 and Fig 4.37, respectively. ADI achieved a tighter error upper bound and smaller residual norm compared to x-SLOR.

1.00E+00 P1 P2 P3

1.00E-01

1.00E-02

1.00E-03

Figure 4.36: Log scaled ADI relative error bound.

61

5.00E-08

4.50E-08

4.00E-08

3.50E-08

3.00E-08

2.50E-08

2.00E-08

1.50E-08

1.00E-08

5.00E-09

0.00E+00 P1 P2 P3

Figure 4.37: ADI residual norm.

Comparison

The execution time for P1, P2 and P3 cases are summarized in Fig 4.38 to Fig 4.40. (“0” represents the original ordering, “1” represents the fill-reducing ordering.) For this test case, the direct method performs better. The ADI method is slower than SLOR method for P1 and

P2 cases because each ADI step is more time consuming. However, ADI outperforms the SLOR on P3 case, which indicates the better scalability of ADI than SLOR.

The scalabilities of each method are compared in log scaled Fig 4.41. The ADI method represents the best scalability, and it is expected to perform better when viscosity is involved because the coupling on y and z directions can also be strong.

62

350

300

250

200 ms 150

100

50

0 x-SLOR ADI Direct_0 Direct_1

Figure 4.38: P1 execution time (ms)

2000 1800 1600 1400 1200

1000 ms 800 600 400 200 0 x-SLOR ADI Direct_0 Direct_1

Figure 4.39: P2 execution time (ms).

20000 18000 16000 14000 12000

10000 ms 8000 6000 4000 2000 0 x-SLOR ADI Direct_0 Direct_1

Figure 4.40: P3 execution time (ms).

63

10000 100000 100000

1000 10000 10000 1000 1000

100

ms ms ms 100 100 10 10 10 1 1 1 P1 P2 P3 P1 P2 P3 P1 P2 P3

Figure 4.41: Log scaled direct (left) /ADI (middle) /SLOR (right) execution time.

GPU Application

A GPU solver using both direct and ADI methods is also developed for this thesis. A notable feature of GPU is that it has its independent memory space. GPU communicates with CPU through PCIe bus. The speed of PCIe is roughly 10~20 times slower than the speed of RAM.

With that in mind, the GPU solvers are developed with minimum data movement between GPU and CPU. It also takes advantage of batched library routines as the CPU solver does.

However, high performance of double computing can only be achieved on the GPU that is dedicated to scientific computing, which is not available to this work. The GPU solver is validated by the P1 case on a GeForce 940MX graph card and the test result is shown below.

Direct method with fill-reducing ordering: 11609 ms.

ADI method: 753056 ms.

64

Chapter 5

Conclusion and Future Work

A CPU and a CPU/GPU based solvers have been developed for the sparse block linear system based on the DG method. Because the DG method produces singular blocks off the diagonal, a new fill-reducing ordering approach is developed for the direct solvers that kept the diagonal block on the diagonal.

Line based iterative solvers (SLOR and ADI) for sparse block linear system are also developed on CPU and CPU/GPU platforms. This is the first known application of SLOR or

ADI for the linear systems based on DG method.

From the test results, it is reasonable to choose direct solver for such small-sized problem. For large-sized problem, especially when viscosity is involved, the ADI solver would be the choice due to its memory usage (direct method uses roughly 2 times of the memory of storing original matrix) and scalability. Whether a problem is large or small should depends on the hardware capability.

For the future work, the line based iterative method can be extended to creating connection across grid boundary when using multiple grids. The line based method can also be used as preconditioning technique for Krylov Subspace Method [50] [51] [52] [53]. In addition, investigation of the choice of relaxation factor would be beneficial.

65

References

[1] T. A. Davis, S. Rajamanickam and W. M. Sid-Lakhdar, "A Survey of Direct Methods

for Sparse Linear System," Acta Numerica, vol. 25, pp. 383-566, 2016.

[2] W. Reed and T. Hill, "Triangular Mesh Methods for the Neutron Transport Equation,"

in National Topical Meeting on Mathematical Models and Computational Techniques

for Analysis of Nuclear Systems, Ann Arbor, Michigan, USA, 1973.

[3] B. Cockburn, G. Karniadakis and C. Shu, "The Development of Discontinuous

Galerkin Methods," in Discontinuous Galerkin Methods: Theory, Computation and

Application, Berlin, Heidelberg, Springer Publishing Company, 1999.

[4] C. Shu, "Discontinuous Galerkin Method for Time-Dependent Problems: Survey and

Recent Developments," in Recent Developments in Discontinuous Galerkin Finite

Element Methods for Partial Differential Equations, Cham, Springer, 2014.

[5] E. Toro, Riemann Solvers and Numerical Methods for Fluid Dynamics: A Practical

Introduction, 2nd ed., Springer-Verlag, 1999.

[6] C. Klaij, J. van der Vegt, W. J. and H. van der Ven, "Space-time Discontinuous Galerkin

Method for the Compressible Navier-Stokes Equations," Journal of Computational

Physics, Vols. 217, No. 2, p. 589–611, 2006.

[7] M. Galbraith, "A Discontinuous Galerkin Chimera Overset Solver, PhD Thesis,"

66

University of Cincinnati, Cincinnati, 2013.

[8] A. Sherman, "On the Efficient Solution of Sparse Systems of Linear and Nonlinear

Equations," Yale University, New Haven, 1975.

[9] D. Rose, "A Graph-Theoretic Study of the Numerical Solution of Sparse Positive

Definite Systems of Linear Equations," in and Computing, Cambridge,

MA, Academic Press, 1972, pp. 183-217.

[10] D. Rose, E. Tarjan and G. Lueker, "Algorithmic Aspects of Vertex Elimination on

Graphs," SIAM J. Comput, vol. 5, no. 2, p. 266–283, 1976.

[11] R. Schreiber, "A New Implementation of Sparse Gaussian Elimination," ACM

Transactions on Mathematical Software (TOMS), vol. 8, no. 3, pp. 256-276, 1982.

[12] J. Liu, "A Compact Row Storage Scheme for Cholesky Factors Using Elimination

Trees," ACM Transactions on Mathematical Software (TOMS), vol. 12, no. 2, pp. 127-

148, 1986.

[13] J. Liu, "The Role of Elimination Trees in Sparse Factorization," SIAM J. Matrix Anal.

& Appl, vol. 11, no. 1, p. 134–172, 1990.

[14] D. Rose and R. Tarjan, "Algorithmic Aspects of Vertex Elimination on Directed

Graphs," Stanford University, 1975.

[15] A. George and E. Ng, "An Implementation of Gaussian Elimination with Partial

Pivoting for Sparse Systems," SIAM J. Sci. and Stat. Comput, vol. 6(2), p. 390–409,

1985.

67

[16] T. Coleman, A. Edenbrandt and J. Gilbert, "Predicting Fill for Sparse Orthogonal

Factorization," Journal of the ACM (JACM), vol. 33, no. 3, pp. 517-532, 1986.

[17] J. Gilbert and E. Ng, "(1993) Predicting Structure in Nonsymmetric Sparse Matrix

Factorizations. In: George A., Gilbert J.R., Liu J.W.H. (eds)," in Graph Theory and

Sparse Matrix Computation. The IMA Volumes in Mathematics and its Applications,

vol. 56, New York, NY: Springer, 1993.

[18] A. George and J. Liu, Computer Solution of Large Sparse Positive Definite, Prentice

Hall Professional Technical Reference, 1981.

[19] M. Yannakakis, "Computing the Minimum Fill-in is NP-Complete," SIAM. J. on

Algebraic and Discrete Methods, vol. 2(1), p. 77–79, 1981.

[20] A. George and E. Ng, "Symbolic Factorization for Sparse Gaussian Elimination with

Partial Pivoting," SIAM Journal on Scientific and Statistical Computing, vol. 8, no. 6,

pp. 877 - 898, 1987.

[21] J. Liu, "A Generalized Envelope Method for Sparse Factorization by Rows," ACM

Transactions on Mathematical Software (TOMS), vol. 17, no. 1, pp. 112-129, 1991.

[22] T. Davis and I. Duff, "An Unsymmetric-Pattern Multifrontal Method for Sparse LU

Factorization," SIAM J. Matrix Anal. & Appl., vol. 18(1), p. 140–158, 1997.

[23] T. Davis and I. Duff, "A Combined Unifrontal/Multifrontal Method for Unsymmetric

Sparse Matrices," ACM Transactions on Mathematical Software (TOMS), vol. 25, no.

1, pp. 1-20, 1999.

68

[24] T. Davis, J. Gilbert, S. Larimore and E. Ng, "A Column Approximate Minimum Degree

Ordering Algorithm," ACM Transactions on Mathematical Software (TOMS), vol. 30,

no. 3, pp. 353-376, 2004.

[25] I. Duff, "The Impact of High Performance Computing in the Solution of Linear

Systems: Trends and Problems," Journal of Computational and Applied Mathematics,

vol. 123, no. 1-2, pp. 515-530, 2000.

[26] N. Sato and W. Tinney, "Techniques for Exploiting the Sparsity or the Network

Admittance Matrix," IEEE Transactions on Power Apparatus and Systems, vol. 82, pp.

944-950, 1963.

[27] J. Gilbert and T. Peierls, "Sparse Partial Pivoting in Time Proportional to Arithmetic

Operations," SIAM J. Sci. and Stat. Comput, vol. 9(5), p. 862–874, 1988.

[28] S. Mazumder, Numerical Methods for Partial Differential Equations, Cambridge:

Academic Press, 2015.

[29] Y. Saad, Iterative Methods for Sparse Linear Systems, 2nd Edition, Philadelphia:

Society for Industrial and Applied Mathematics, 2003.

[30] D. Peaceman and H. Rachford, Jr, "The Numerical Solution of Parabolic and Elliptic

Differential Equations," Journal of the Society for Industrial and Applied Mathematics,

vol. 3, no. 1, pp. 28-41, 1955.

[31] H. Atkins and C. Shu, "Analysis of Preconditioning and Relaxation Operators for the

Discontinuous Galerkin Method Applied to Diffusion," in 15th AIAA Computational

69

Fluid Dynamics Conference, Fluid Dynamics and Co-located Conferences,

Anaheim,CA, 2001.

[32] Intel, "Developer Reference for Intel® Math Kernel Library," Intel, 2018.

[33] Nvidia, "CUBLAS LIBRARY User Guide," Nvidia, 2018.

[34] Nvidia, "CUSOLVER LIBRARY User Guide," Nvidia, 2018.

[35] T. George, V. Saxena, A. Gupta, A. Singh and A. Choudhury, "Multifrontal

Factorization of Sparse SPD Matrices on GPUs," in 2011 IEEE International Parallel

& Distributed Processing Symposium, IEEE, Anchorage, AK, USA, 2011.

[36] C. Yu, W. Wang and D. Pierce, "A CPU–GPU Hybrid Approach for the Unsymmetric

Multifrontal Method," Parallel Computing, vol. 37, no. 12, pp. 759-770, 2011.

[37] S. Rennich, D. Stosic and T. Davis, "Accelerating Sparse Cholesky Factorization on

GPUs," Parallel Computing, vol. 59, pp. 140-150, 2016.

[38] T. Davis, Direct Method for Sparse Linear Systems, Philadelphia: SIAM, 2006.

[39] J. Gilbert and T. Peierls, "Sparse Partial Pivoting in Time Proportional to Arithmetic

Operations," SIAM J. Sci. and Stat. Comput, vol. 9, no. 5, p. 862–874, 1988.

[40] J. R. Gilbert, C. Moler and R. Schreiber, "Sparse Matrices in MATLAB: Design and

Implementation," SIAM J. Matrix Anal. & Appl., vol. 13, no. 1, p. 333–356, 1992.

[41] R. Burden and J. Faires, Numerical Anaysis, 9 ed., Boston: CENGAGE Learning,

2011.

[42] G. Forsythe and C. Moler, Computer Solution of Linear Algebraic Systems, Upper

70

Saddle River: Prentice-Hall, 1967.

[43] G. Golub and C. Van Loan, Matrix Computations, Baltimore: Johns Hopkins

University Press, 2013.

[44] J. Wilkinson, The Algebraic Eigenvalue Problem, Oxford: Clarendon Press, 1965.

[45] W. Kahan, "Gauss-Seidel Methods of Solving Large Systems of Linear Equations. PhD

Thesis," University of Toronto, Toronto, 1958.

[46] J. Ortega, Numerical Analysis: A Second Course, Philadelphia: SIAM, 1990.

[47] E. Wachspress, The ADI Model Problem, New York: Springer-Verlag New York, 2013.

[48] G. Avdelas and A. Hadjidimos, "Jordan-Wachspress Parameters in Three Dimensions,"

Linear Algebra and its Applications, vol. 24, pp. 251-261, 1979.

[49] N. Wukie, "A Discontinuous Galerkin Method for Turbomachinery and Acoustic

Applications," University of Cincinnati, Cincinnati, 2018.

[50] S. Ma and Y. Saad, "Block-ADI Preconditioners for Solving Sparse Non-Symmetric

Linear Systems of Equations," University of Minnesota, Minneapolis, 1995.

[51] M. Gasteiger, L. Einkemmer, A. Ostermann and D. Tskhakaya, "ADI Type

Preconditioners for the Steady State Inhomogeneous Vlasov Equation," Journal of

Plasma Physics, vol. 83, no. 1, 2017.

[52] D. J. Mavriplis and B. R. Ahrabi, "Scalable Solution Strategies for Stabilized Finite-

Element Flow Solvers on Unstructured Meshes," in 55th AIAA Aerospace Sciences

Meeting, Grapevine, 2017.

71

[53] B. R. Ahrabi and D. J. Mavriplis, "Scalable Solution Strategies for Stabilized Finite-

Element Flow Solvers on Unstructured Meshes, Part II," in 23rd AIAA Computational

Fluid Dynamics Conference, Denver, 2017.

72