Chapter F11 Sparse Linear Algebra Contents

Chapter F11 Sparse Linear Algebra

Contents

1 Scope of the Chapter 2

2 Background to the Problems 2 2.1 Iterative Methods ...... 2 2.2 Iterative Methods for Real Nonsymmetric and Complex Non-Hermitian Linear Systems . 2 2.3 Iterative Methods for Real Symmetric Linear Systems ...... 3 2.4Sparse Matrices and Their Storage ...... 4 2.4.1 Coordinate storage (CS) format ...... 4 2.4.2 Sparse matrices with symmetric sparsity pattern ...... 4 2.5 Data Distribution ...... 5 2.5.1 Sparse matrices in Chapter F11 ...... 5 2.5.2 Cyclic row block distribution of sparse matrices ...... 5 2.5.3 Dense vectors in Chapter F11 ...... 6 2.5.4Conformally distributed dense vectors ...... 6 2.6 Local Matrices and Vectors ...... 6 2.6.1 Local matrices ...... 6 2.6.2 Storage order of elements in conformally distributed vectors ...... 8 2.7 Preconditioners ...... 8 2.7.1 Domain decomposition based preconditioners ...... 8 2.7.2 Matrix splitting-based preconditioners ...... 9 2.8 Parallel Sparse Matrix Operations ...... 9 2.8.1 Basic solvers (reverse-communication) ...... 9 2.8.2 Matrix-vector multiplication ...... 9 2.8.3 Block-Jacobi preconditioning ...... 10 2.9 References ...... 10

3 Recommendations on Choice and Use of Available Routines 11 3.1 Types of Routines Available ...... 11 3.2 Choosing an Iterative Method and a Preconditioner ...... 13 3.2.1 Real nonsymmetric or complex non-Hermitian linear systems ...... 13 3.2.2 Real symmetric linear systems ...... 13 3.3 Auxiliary Information for Parallel Sparse Matrix Computations ...... 13 3.3.1 Contents of IAINFO ...... 13 3.3.2 Sequence of accesses to IAINFO ...... 14 3.3.3 Required size of IAINFO ...... 15

4 Routines Withdrawn or Scheduled for Withdrawal 15

5 F11 Chapter Quick Reference Table 15

[NP3344/3/pdf] F11.1 Introduction – F11 NAG Parallel Library Manual

1 Scope of the Chapter

This chapter provides routines for the solution of large sparse systems of simultaneous linear equations, particularly those arising from the discretisation of PDEs, using iterative methods for real symmetric, real nonsymmetric and complex non-Hermitian linear systems.

2 Background to the Problems

This section is only a brief introduction to the solution of sparse linear systems. For a more detailed discussion see, for example, Axelsson [2], Barrett et al. [3], Bruaset [4], or Saad [10].

2.1 Iterative Methods Iterative methods for the solution of the linear algebraic system

Ax = b (1) approach the solution x through a sequence of approximations, until some user-specified termination criterion is met or until some predefined maximum number of iterations has been carried out. The number of iterations required for convergence is not generally known in advance, as it depends on the accuracy required, and on the properties of the matrix A — its sparsity pattern, conditioning and eigenvalue spectrum. Faster convergence can often be achieved using a preconditioner (Golub and van Loan [6], Barrett et al. [3]). A preconditioner maps the original system of equations onto a different system

A¯x¯ = ¯b, (2) which hopefully exhibits better convergence characteristics: for example, the condition number of the matrix A¯ may be better than that of A, or it may have eigenvalues of greater multiplicity. An unsuitable preconditioner or no preconditioning at all may result in a very slow rate or lack of convergence. Generally, preconditioning involves a trade-oﬀ between the reduction in the number of iterations required for convergence and the additional computational costs per iteration. Also, setting up a preconditioner may involve non-negligible overheads. The application of preconditioners to real nonsymmetric, complex non-Hermitian and real symmetric systems of equations is given further consideration in Section 2.2 and Section 2.3.

2.2 Iterative Methods for Real Nonsymmetric and Complex Non-Hermitian Linear Systems Many of the most eﬀective iterative methods for the solution of (1) lie in the class of non-stationary Krylovsubspace methods (Barrett et al. [3]). For real nonsymmetric and complex non-Hermitian matrices this class includes: restarted generalized minimum residual method (RGMRES) (Saad and Schultz [11]); conjugate gradient squared method (CGS) (Sonneveld [13]); polynomial stabilized bi-conjugate gradient method of order (Bi-CGSTAB()) (van der Vorst [14], Sleijpen and Fokkema [12]); Here we just give a brief overview of these algorithms as implemented in Chapter F11. For full details see Section 1 of the documents for F11BAFP and F11BRFP. RGMRES is based on the Arnoldi method, which explicitly generates an orthogonal basis for the Krylov k subspace span{A r0}, k =0, 1, 2,...,wherer0 is the initial residual. The solution is then expanded onto the orthogonal basis so as to minimize the residual norm. For real nonsymmetric and complex non- Hermitian matrices the generation of the basis requires a ‘long’ recurrence relation, resulting in prohibitive computational and storage costs. RGMRES limits these costs by restarting the Arnoldi process from the latest available residual every m iterations. The value of m is chosen in advance and is ﬁxed throughout the computation. Unfortunately, an optimum value of m cannot easily be predicted.

F11.2 [NP3344/3/pdf] F11 – Sparse Linear Algebra Introduction – F11

CGS is a development of the bi-conjugate gradient method where the nonsymmetric Lanczos method is applied to reduce the coefficient matrix to tridiagonal form: two bi-orthogonal sequences of vectors are generated starting from the initial residual r0 and from the shadow residual rˆ0 corresponding to the H arbitrary problem A xˆ = ˆb,whereˆb is chosen so that r0 =ˆr0. In the course of the iteration, the residual H and shadow residual ri = Pi(A)r0 andr î = Pi(A )ˆr0 are generated, where Pi is a polynomial of order i, H and bi-orthogonality is exploited by computing the vector product ρi =(ˆri,ri)=(Pi(A )ˆr0,Pi(A)r0)= 2 (ˆr0,Pi (A)r0). Applying the ‘contraction’ operator Pi(A) twice, the iteration coefficients can still be recovered without advancing the solution of the shadow problem, which is of no interest. The CGS method often provides fast convergence; however, there is no reason why the contraction operator should also reduce the once reduced vector Pi(A)r0: this can lead to a highly irregular convergence. { 2 } Bi-CGSTAB() is similar to the CGS method. However, instead of generating the sequence Pi (A)r0 , it generates the sequence {Qi(A)Pi(A)r0} where the Qi(A) are polynomials chosen to minimize the residual after the application of the contraction operator Pi(A). Two main steps can be identified for each iteration: an OR (Orthogonal Residuals) step where a basis of order is generated by a Bi-CG iteration and an MR (Minimum Residuals) step where the residual is minimized over the basis generated, by a method akin to GMRES (Saad and Schultz [11]). For = 1, the method corresponds to the Bi- CGSTAB, method of van der Vorst [14]. For >1, more information about complex eigenvalues of the iteration matrix can be taken into account, and this may lead to improved convergence and robustness. However, as increases, numerical instabilities may arise (Sleijpen and Fokkema [12]). Faster convergence can usually be achieved by using a preconditioner.Aleft preconditioner M −1 can −1 be used by the RGMRES and CGS methods, such that A¯ = M A ∼ In in (2), where In is the identity matrix of order n;aright preconditioner M −1 canbeusedbytheBi-CGSTAB() method, such that −1 A¯ = AM ∼ In. These are formal definitions, used only in the design of the algorithms; in practice, only the means to compute the matrix–vector products v = Au and v = AH u (the latter only being required when an estimate of A1 or A∞ is computed internally), and to solve the preconditioning equations Mv = u are required, that is, explicit information about M, or its inverse is not required at any stage.

2.3 Iterative Methods for Real Symmetric Linear Systems Two of the best known iterative methods applicable to real symmetric linear systems are the conjugate gradient method (CG) (Hestenes and Stiefel [7], Golub and van Loan [6]) and a Lanczos type method based on SYMMLQ (Paige and Saunders [9]). For the CG method the matrix A should ideally be positive-definite. The application of CG to indefinite matrices may lead to failure, or to lack of convergence. The SYMMLQ method is suitable for both positive-definite and indefinite symmetric matrices. It is more robust than CG, but less efficient when A is positive-definite.

Both methods start from the residual r0 = b − Ax0,wherex0 is an initial estimate for the solution (often k x0 = 0), and generate an orthogonal basis for the Krylov subspace span{A r0},fork =0, 1,..., by means of three-term recurrence relations (Golub and van Loan [6]). A sequence of symmetric tridiagonal matrices {Tk} is also generated. Here and in the following, the index k denotes the iteration count. The resulting symmetric tridiagonal systems of equations are usually more easily solved than the original problem. A sequence of solution iterates {xk} is thus generated such that the sequence of the norms of the residuals {rk} converges to a required tolerance. Note that, in general, the convergence is not monotonic. In exact arithmetic, after n iterations, this process is equivalent to an orthogonal reduction of A to T symmetric tridiagonal form, Tn = Q AQ;thesolutionxn would thus achieve exact convergence. In finite-precision arithmetic, cancellation and round-off errors accumulate causing loss of orthogonality. These methods must therefore be viewed as genuinely iterative methods, able to converge to a solution within a prescribed tolerance. The orthogonal basis is not formed explicitly in either method. The basic difference between the two methods lies in the method of solution of the resulting symmetric tridiagonal systems of equations: the CG method is equivalent to carrying out an LDLT (Cholesky) factorization whereas the Lanczos method (SYMMLQ) uses an LQ factorization. A preconditioner for these methods must be symmetric and positive-definite, i.e., representable by T −1 −T M = EE ,whereM is non-singular, and such that A¯ = E AE ∼ In in (2), where In is the identity

[NP3344/3/pdf] F11.3 Introduction – F11 NAG Parallel Library Manual

matrix of order n. These are formal deﬁnitions, used only in the design of the algorithms; in practice, only the means to compute the matrix-vector products v = Au and to solve the preconditioning equations Mv = u are required.

2.4Sparse Matrices and Their Storage AmatrixA may be described as sparse if the number of zero elements is sufficiently large that it is worthwhile using algorithms which avoid computations involving zero elements. If A is sparse, and the chosen algorithm requires the matrix coefficients to be stored, a significant saving in storage can often be made by storing only non-zero elements. A number of different formats may be used to represent sparse matrices economically. These differ according to the amount of storage required, the amount of indirect addressing required for fundamental operations such as matrix–vector products, and their suitability for vector and/or parallel architectures. For a survey of some of these storage formats see Barrett et al. [3]. Some of the routines in this chapter have been designed to be independent of the matrix storage format. This allows the user to choose his or her own preferred format, or to avoid storing the matrix altogether. Other routines are black boxes, which are easier to use, but are based on fixed storage formats. One such fixed format is currently catered for, known as coordinate storage (CS) format. A number of utility routines which use CS format are also provided. In this release there is no special storage format particularly tailored to symmetric matrices, such as, for example, symmetric coordinate storage (SCS).

2.4.1 Coordinate storage (CS) format This storage format represents a sparse nonsymmetric matrix A, with NNZ non-zero elements, in terms of a real or complex array A and two integer arrays IROW and ICOL. These arrays are all of rank 1 and of dimension at least NNZ. A contains the non-zero elements themselves, while IROW and ICOL store the corresponding row and column indices respectively. For example, the matrix   12−1 −1 −3    0 −100−4    A =  30002  20411 −20001 might be represented in the arrays A, IROW and ICOL as A=(1,2,−1,−1,−3,−1,−4,3,2,2,4,1,1,−2,1) IROW = (1,1,1,1,1,2,2,3,3,4,4,4,4,5,5) ICOL = (1,2,3,4,5,2,5,1,5,1,3,4,5,1,5) Notes (i) The general format specifies no ordering of the array elements, but some routines may impose a specific ordering. For example, the non-zero elements may be required to be ordered by increasing row index and by increasing column index within each row, as in the example above. However, for reasons of efficiency a specific ordering is required by most routines in this chapter and a utility routine is provided to order the elements appropriately. (ii) With this storage format it is possible to enter duplicate elements. These may be interpreted in various ways (raising an error, ignoring all but the first entry, all but the last, or summing, for example).

2.4.2 Sparse matrices with symmetric sparsity pattern Although, in this release, no special storage formats particularly tailored to symmetric matrices are employed, some routines can take advantage of matrices A which are nonsymmetric or non-Hermitian, but have a symmetric sparsity pattern. This means that the non-zero structure of A is symmetric:

aij =0 ifandonlyif aji =0 for i, j =1,...,n,

F11.4 [NP3344/3/pdf] F11 – Sparse Linear Algebra Introduction – F11 where n is the order of the matrix A.

2.5 Data Distribution 2.5.1 Sparse matrices in Chapter F11 The basic solvers are ‘matrix-free’, i.e., no details of the matrix, its storage format or distribution need to be provided, either directly or indirectly by the calling program. This is achieved by employing reverse communication (see Section 2.8.1 and Section 3.1). This scheme allows the precise tailoring of a matrix representation to suit a given problem. This chapter also provides a range of Black Box routines and utility routines (preprocessing of sparse matrices, computation of matrix-vector products, generation and application of preconditioners, etc.): they all employ a CS format for the representation of sparse matrices.

2.5.2 Cyclic row block distribution of sparse matrices

In the cyclic row block distribution, blocks of Mb contiguous rows of a matrix A are distributed cyclically on the logical processor of the Library Grid, starting from the {0, 0} logical processor: processors are assumed ordered row by row, i.e., following the row major ordering of the Library Grid. In other words, the ﬁrst block will be stored on the {0, 0} logical processor, the second block on the {0, 1} processor, and so on until a row block has been assigned to all processors; then the distribution will be repeated for the remaining row blocks.

B 1 {0,0} B 2 {0,1} B 3 {0,2} B 4 {1,0} B 5 {1,1} B 6 {1,2} B 7 {0,0} B 8 {0,1} B 9 {0,2} B 10 {1,0} B 11 {1,1} B 12 {1.2} B 13 {0,0} B 14 {0,1} B 15 {0,2}

Figure 1 Cyclic row block distribution over a 2 by 3 logical grid of processors

{0,0} {0,1} {0,2} B 1 B 2 B 3 B 7 B 8 B 9 B 13 B 14 B 15

{1,0} {1,1} {1,2} B 4 B 5 B 6 B 10 B 11 B 12

Figure 2 Data distribution from the processors’ point of view

Figure 1 illustrates the cyclic row block distribution of a matrix split into ﬁfteen row blocks, marked B1 to B15, over a 2 by 3 Library Grid of logical processors. Each shading denotes the six row blocks assigned for each round of cyclic distribution

[NP3344/3/pdf] F11.5 Introduction – F11 NAG Parallel Library Manual

Figure 2 shows the row blocks stored on each logical processor of the 2 by 3 Library Grid. Shading and markings are the same as in Figure 1. The number of row blocks stored in each processor may vary: in the case depicted, the processors in the ﬁrst and second rows of the Library Grid store three and two row blocks each, respectively.

2.5.3 Dense vectors in Chapter F11 In the basic, ‘matrix-free’ solver routines, the vectors can be distributed in either of two different ways. In the first way, vectors are distributed across all processors in the Library Grid, with different processors holding different parts of the vector. All vectors are distributed in the same way. This distribution must be chosen if the solver routines are to be used in connection with any Chapter F11 routine other than the corresponding set-up or diagnostic routines. In the second way, on initial entry to the solver routines, input vectors may be distributed along the first column or row of the logical grid of processors. Each processor in the first column or row then broadcasts the elements stored locally to all the other processors in the same row or column, respectively. Thereafter, all vectors in the solve routine, input, output and internal vectors alike, are distributed identically along all the columns or the rows of the logical grid of processors. Hence, if vectors are distributed by column or row, all processors in the same row or column, respectively, of the grid will hold copies of the same vector elements. In either distribution scheme, details of the distribution pattern are not required by any of the routines in the suites, only the number of vector elements stored in the processors.

2.5.4 Conformally distributed dense vectors Whenever a sparse matrix A is represented explicitly in coordinate storage format, any vector x used in conjunction with A must be distributed conformally to A. This means that the elements of the vector x are aligned with the rows of A, i.e., the element x(i)andtherowA(i, 1:n), for i =1,...,n, are assigned to same logical processor and must be sorted in the same order. This kind of distribution can also be obtained by interpreting x as a column vector and distributing x across the Library Grid identically to each of the columns A(1 : n, j), j =1,...,n.

2.6 Local Matrices and Vectors 2.6.1 Local matrices In order to allow an efficient implementation of sparse matrix operations, a user-specified distributed sparse matrix A, represented in coordinate storage format, undergoes a preprocessing phase. This includes: (a) dealing with entries with duplicate row and column indices: these may be removed, or their values may be summed; (b) dealing with entries with zero values: these may optionally be removed; (c) transforming the user-supplied index values in IROW and ICOL; (d) reordering the non-zero elements and identifying specific parts of the matrix A. Each sparse matrix A, represented in distributed coordinate storage format, needs to be appropriately preprocessed before using it in any other library routine. The following terms and notational conventions are used in this context: Rows assigned to a logical processor

Let I denote the index set of the rows assigned to a logical processor. Then, ml is used to denote the number of elements in I, i.e., the number of rows assigned to a logical processor. Internal and interface coeﬃcients and indices

A local non-zero element aij is said to be an internal coeﬃcient if j ∈I.Otherwise,aij is said to be an interface coeﬃcient, and the indices i and j are referred to as internal and external interface indices. The numbers of internal and external interface indices on a logical processor i e are denoted nint and nint, respectively.

F11.6 [NP3344/3/pdf] F11 – Sparse Linear Algebra Introduction – F11

Global and local indexing

Before preprocessing, the row and column indices of the elements of a distributed matrix and the indices of a distributed vector are global indices, defined by the mathematical objects. Preprocessing transforms the global indices in a way that depends on, and is tailored to, the data distribution and the sparsity pattern of A, aiming to achieve maximum efficiency in the numerical computations. The transformation used on each processor is based on a local permutation of the index set {1, 2,...,n}, i.e., different logical processors employ different permutations. The resulting row and column indices are referred to as local indices. Note that a global index is usually mapped to different local index values on different logical processors. After the preprocessing stage, the values in the local arrays IROW and ICOL satisfy, on each processor, the constraints 1 ≤ IROW(i) ≤ ml,i=1,...,NNZ, and ≤ ≤ e 1 ICOL(i) ml + nint,i=1,...,NNZ, where NNZ denotes the number of local non-zero elements. Internal global indices are mapped onto local indices between 1 and ml, whereas external interface indices are mapped onto local indices e between ml +1andml + nint. Local Diagonal Blocks

The symbol nLB is used to denote the number of row blocks assigned to a logical processor. Let Ik,fork =1, 2,...,nLB, be the set of row indices associated with each block. Then, the local

diagonal blocks Ak,fork =1, 2,...,nLB, are deﬁned by Ak =(aij )i,j∈Ik .

These relationships are illustrated in Figure 3 where a sixteen by sixteen sparse matrix is distributed on a2by1LibraryGridusingablocksizeMb = 8. The internal and external non-zero elements of A are (0) (0) denoted by solid and empty circles, respectively. The local diagonal blocks, labelled A1 and A2 on processors {0, 0} and {1, 0}, respectively, are shaded.

It can be easily seen that, on the second processor, for i =9, 10, 11, 12, the non-zero elements ai−4,i =0 are interface coeﬃcients, hence {9, 10, 11, 12} are the internal interface indices and {5, 6, 7, 8} are the i e external interface indices. Consequently, nint =4andnint = 4on the second processor. The local row indices of the non-zeros elements on the second logical processor will therefore be numbered from 1 to 8 and the local column indices from 1 to 12: The global indices {9,...,16} will be mapped onto the local indices {1,...,8} and the global indices {5, 6, 7, 8} will be mapped onto the local indices {9, 10, 11, 12} on this processor.

1 2 3 4 5 6 7 8 9 10111213141516 1 2 3 (0) A 4 1 5 6 7 8 9 10 11 (0) 12 A 13 2 14 15 16

Figure 3 Local Diagonal Blocks and Interface Coeﬃcients

[NP3344/3/pdf] F11.7 Introduction – F11 NAG Parallel Library Manual

2.6.2 Storage order of elements in conformally distributed vectors The elements of any vector x distributed conformally to a sparse matrix A must be stored in a local array X of size at least ml in ascending order according to their local indices. This storage order is henceforth referred to as the matrix-based order. A number of routines in Chapter F01 can be used to generate or distribute vectors and store their elements in the matrix-based order. If the elements of a conformally distributed vector are stored locally in a different order, they must be permuted into the matrix-based order in a preprocessing phase. A number of routines in Chapter F01 can be used to permute given vector elements stored in distribution-based order into matrix-based order and vice versa. The distribution-based order of vector elements is defined as follows: vector elements are ordered first according to the row block to which they are assigned; vector elements in the same row block are then ordered according to increasing global index. When block cyclic data distribution is used or when only a single row block is assigned to a given processor, vector elements given in distribution-based order are stored in ascending order according to their global indices. On the other hand, the global indices of vector elements given in distribution-based order may not be ascending if several row blocks are assigned to a given processor.

2.7 Preconditioners Two types of preconditioners are available at this release: Domain decomposition-based: Block Jacobi preconditioner; Matrix splitting-based: Successive Over Relaxation (SOR) preconditioners, only for matrices with a symmetric sparsity pattern; relaxed Jacobi preconditioner. These preconditioners are described brieﬂy in the subsections below.

2.7.1 Domain decomposition based preconditioners In this release block Jacobi (non-overlapping additive Schwarz) preconditioners are obtained by (0) (approximately) inverting the local diagonal blocks Ak ,fork =1, 2,...,nLB, on each processor (see also (0) Figure 3). The approximate inverse of each local diagonal block Ak is calculated using an incomplete LU factorization. This means that Ak is decomposed in the form

Ak = Mk + Rk

where Mk = PkLkDkUkQk

and Lk, Dk and Uk are lower triangular with unit diagonal elements, diagonal and upper triangular with unit diagonal elements, respectively; Pk and Qk are permutation matrices, and Rk is a remainder matrix. The preconditioning matrix M is distributed identically to the coefficient matrix A on the logical grid of processors. Therefore, the local diagonal blocks of the block diagonal matrix M are given by Mk, k =1, 2,...,nLB. The quality of block Jacobi preconditioners depends largely on the coupling between unknowns associated with different local diagonal blocks and on the accuracy of the incomplete LU factorizations. The relative number and sizes of the non-zero elements in the off-diagonal parts and within the diagonal blocks of A, can give an indication of the coupling strength. When the coupling is strong, then, generally, block Jacobi preconditioners are of poor quality. For weak couplings, block Jacobi preconditioners can be reasonably effective, provided the incomplete LU factorizations yield approximate inverses of sufficient accuracy. It is possible to improve a block Jacobi preconditioner by increasing the size of the diagonal blocks. Inevitably, this leads to a decrease in the number of diagonal blocks and, consequently, to a reduction in the degree of parallelism in generating and applying the preconditioner.

F11.8 [NP3344/3/pdf] F11 – Sparse Linear Algebra Introduction – F11

2.7.2 Matrix splitting-based preconditioners This release includes routines for relaxed Jacobi and for forward, backward and symmetric successive overrelaxation (SOR) iterative methods. SOR methods can only be applied for matrices with symmetric sparsity patterns. The relaxed Jacobi method splits the matrix A into the component matrices D, L and U,whichare diagonal, strictly lower triangular and strictly upper triangular, respectively, so that A = L + D + U. Applied to a linear system Ax = b, and starting from an initial guess for the solution, x0, the iteration proceeds using the recurrence:

Dxk+1 =[(1− ω)D − ω(L + U)]xk + ωb,

where k is the iteration number and 0 <ω<2istherelaxation parameter. SOR iterative methods are deﬁned by the recurrence formulae

(D + ωL)xk+1 =[(1− ω)D − ωU]xk + ωb,

for the forward SOR method

(D + ωU)xk+1 =[(1− ω)D − ωL]xk + ωb,

for the backward SOR method, and

−1 −1 (D + ωU)xk+1 =[(1− ω)D − ωL](D + ωL) [(1 − ω)D − ωU]xk + ω(2 − ω)D(D + ωL) )b for the symmetric SOR (SSOR) method. D, L, U, k and ω have the same meaning as for the relaxed Jacobi method above. SOR methods are inherently sequential: multi-colour ordering schemes, for example, can be applied to provide a degree of parallelism though at the cost of some loss of effectiveness and slower convergence. From the adjacency graph of the sparse matrix A a multi-coloured partition of A into disjoint blocks can be derived. Each block is labelled by a ‘colour’ and all blocks of same colour, but of no other, can be operated upon at the same time, in parallel. Generally, between operating on two different ‘colours’, some communications take place between the processors. The matrix A may need to be symmetrically permuted for best efficiency: in such a case, the SOR method is applied to the matrix PAPT rather than A,whereP is a permutation matrix.

Although the sequences of iterates {xk} generated by the relaxed Jacobi or SOR methods can be shown to converge to the solution of (1) for certain classes of matrices (a more restricted class for Jacobi methods) these methods are usually inferior to the iterative methods described in Sections 2.2 and 2.3, both in terms of performance and reliability. Hence, Jacobi or SOR methods should not be used as iterative methods in their own right, but rather as preconditioners: in many cases the iteration matrix arising from the application of repeated Jacobi or SOR iterations can be an adequate representation of A−1, suﬃciently accurate for eﬀective preconditioning.

2.8 Parallel Sparse Matrix Operations 2.8.1 Basic solvers (reverse-communication) Most computations within the basic (reverse-communication) solver routines are carried out by parallel Level-1 BLAS. Most Level-1 operations can be carried out independently on each processor of the Library Grid. Reduction operations, such as dot products, involve global summations, requiring communications across the whole Library Grid, only along its columns, or only along its rows, depending on how the vectors are distributed.

2.8.2 Matrix-vector multiplication Each processor computes disjoint portions of the vector v resulting from the matrix-vector product Au = v: the resulting vector v is distributed conformally to A.

Each processor calculates vloc = Alocu,wherevloc and Aloc are the local parts of the vector v and of the (i) (e) matrix A, respectively. If the local matrix portion Aloc is split into the two parts Aloc and Aloc, containing the internal and interface coeﬃcients, respectively, the matrix-vector product can be rewritten as

(i) (e) vloc = Alocu = Aloculoc + Alocuext

[NP3344/3/pdf] F11.9 Introduction – F11 NAG Parallel Library Manual

where uext denotes the set of the elements of u required by the local portion of the matrix-vector product but stored elsewhere, on other processors. Thus, the matrix-vector product Au = v consists of a number of steps, to be carried out in sequence by each processor:

Step 1) send the elements of uloc required by other processors; (i) Step 2) calculate vloc = Aloculoc;

Step 3) receive the required elements of uext; (e) Step 4) calculate vloc = vloc + Alocuext Step 2) can be performed using only data stored locally, whereas Step 4) involves elements of the vector u stored on other processors. At least in principle, steps 1), 2) and 3) can be overlapped, i.e., carried out concurrently on each processor. The sparsity pattern and the distribution of A have a major impact on the parallel efficiency of the matrix-vector product. While the number of floating-point operations depends linearly on the number of non-zeros, the amount of interprocessor communication depends on the number of interface non-zero elements each processor needs to send and receive. If the number of (i) (e) non-zero elements in Aloc is large compared to the number of non-zero elements in Aloc, then there is a high computation-to-communication ratio leading to satisfactory parallel efficiency. Otherwise, the operation can become communication-bound, leading to poorer parallel performance. Some of the preprocessing routines in this chapter aim to organise and redistribute the data to increase as far as possible data locality, or at least to improve communication characteristics. A similar technique can also be used for the transposed matrix-vector product AT u = v.

2.8.3 Block-Jacobi preconditioning The main advantage of the block Jacobi preconditioner is that the solution of the preconditioning equation Mv = u does not require any inter-processor communication: M is block diagonal and each diagonal block of M is stored, in its entirety, on a single processor. In other words, the preconditioning equation is partitioned into disjoint sets of (smaller) preconditioning equations, a set for each processor, namely Mkvk = uk,fork =1, 2,...,nLB.

2.9 References [1] Arnoldi W (1951) The principle of minimized iterations in the solution of the matrix eigenvalue problem Quart. Appl. Math. 9 17–29 [2] Axelsson O (1996) Iterative Solution Methods Cambridge University Press, Cambridge [3] Barrett R, Berry M, Chan T F, Demmel J, Donato J, Dongarra J, Eijkhout V, Pozo R, Romine C and van der Vorst H (1994) Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods SIAM, Philadelphia [4] Bruaset A M (1995) A survey of preconditioned iterative methods Pitman Research Notes in Mathematics 328 Longman Scientiﬁc & Technical, Harlow, Essex [5] Dias da Cunha R and Hopkins T (1994) PIM 1.1 — the parallel iterative method package for systems of linear equations user’s guide — Fortran 77 version Technical Report Computing Laboratory, University of Kent at Canterbury, Kent CT2 7NZ, UK [6] Golub G H and van Loan C F (1996) Matrix Computations Johns Hopkins University Press (3rd Edition), Baltimore [7] Hestenes M and Stiefel E (1952) Methods of conjugate gradients for solving linear systems J. Res. Nat. Bur. Stand. 49 409–436 [8] Higham N J (1988) FORTRAN codes for estimating the one-norm of a real or complex matrix, with applications to condition estimation ACM Trans. Math. Software 14 381–396 [9] Paige C C and Saunders M A (1975) Solution of sparse indeﬁnite systems of linear equations SIAM J. Numer. Anal. 12 617–629

F11.10 [NP3344/3/pdf] F11 – Sparse Linear Algebra Introduction – F11

[10] Saad Y (1996) Iterative Methods for Sparse Linear Systems PWS Publishing Company, Boston, MA [11] Saad Y and Schultz M (1986) GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems SIAM J. Sci. Statist. Comput. 7 856–869 [12] Sleijpen G L G and Fokkema D R (1993) BiCGSTAB() for linear equations involving matrices with complex spectrum ETNA 1 11–32 [13] Sonneveld P (1989) CGS, a fast Lanczos-type solver for nonsymmetric linear systems SIAM J. Sci. Statist. Comput. 10 36–52 [14] van der Vorst H (1989) Bi–CGSTAB, A fast and smoothly converging variant of Bi–CG for the solution of nonsymmetric linear systems SIAM J. Sci. Statist. Comput. 13 631–644

3 Recommendations on Choice and Use of Available Routines 3.1 Types of Routines Available The routines available in this chapter are divided essentially into three types: Basic routines; Utility routines; Black Box routines. Basic routines Basic routines are grouped in suites of three, and implement the underlying iterative method. Each suite comprises a set-up routine, a solver, and a routine to return additional information. The solver routine is independent of the matrix storage format (indeed the matrix need not be stored at all) and the type of preconditioner. It uses reverse communication, i.e., it returns repeatedly to the calling program with the parameter IREVCM set to specified values which require the calling program to carry out a specific task (either to compute a matrix–vector product or to solve the preconditioning equation), to signal the completion of the computation or to allow the calling program to monitor the solution. Reverse communication has the following advantages. (i) Maximum flexibility in the representation and storage of sparse matrices: all matrix operations are carried out outside the solver routine, thereby avoiding the need for a complicated interface with enough flexibility to cope with all types of storage schemes and sparsity patterns. This applies also to preconditioners. (ii) Enhanced user interaction: the progress of the solution can be closely monitored by the user and tidy or immediate termination can be requested. This is useful, for example, when alternative termination criteria are to be employed or in case of failure of the external routines used to perform matrix operations.

At present there are three suites of basic routines, for real symmetric, for nonsymmetric and complex non-Hermitian systems, respectively. The following section of code illustrates a simple loop to call one of the iterative solvers (F11GBFP) using reverse communication. The loop is preceded by a call to the set-up routine (F11GAFP) and, optionally, followed by a call to the third routine in the suite.

* * Initialize the suite * CALL F11GAFP( ... , IFAIL) IF (IFAIL .... ) .... * * Solve the equations * LOOP = .TRUE.

[NP3344/3/pdf] F11.11 Introduction – F11 NAG Parallel Library Manual

IREVCM = 0 10 CONTINUE CALL F11GBFP( ... , IREVCM, ... , IFAIL) * IF (IREVCM.EQ.1) THEN * * Matrix-vector product * ...... * ELSE IF (IREVCM.EQ.2) THEN * * Solve the preconditioning equations * ...... * ELSE IF (IREVCM.EQ.3) THEN * * Monitoring step (optional) * ...... * ELSE IF (IREVCM.EQ.4) THEN * * Completion * LOOP = .FALSE. END IF IF (LOOP) GO TO 10 IF (IFAIL ... ) ... * * Get additional information * CALL F11GCFP( ... )

Utility routines Utility routines perform such tasks as preprocessing a sparse matrix, carrying out a permutation or a reordering of a sparse matrix, generating a preconditioner, applying a preconditioner, or computing a matrix-vector product. All these routines assume that matrix is stored in coordinate storage (CS) format (see Section 2.4.1) using a cyclic row block distribution (see Section 2.5.2). Used in combination, basic routines and utility routines therefore provide iterative methods with a considerable degree of flexibility; allowing the user to select from different termination criteria, monitor the approximate solution, and compute various diagnostic parameters. The tasks of computing the matrix-vector products and dealing with the preconditioner are removed from the user, but at the expense of sacrificing some flexibility in the choice of preconditioner and matrix storage format. Black Box routines Black Box routines call basic and utility routines in order to provide easy-to-use routines for particular preconditioners and sparse matrix storage formats. They are much less flexible than the basic routines, but do not use reverse communication, and may be suitable in many simple cases.

The structure of this chapter has been designed to cater for as many types of application as possible. If a black box exists which is suitable for a given application you are recommended to use it. If you then decide you need some additional ﬂexibility it is easy to achieve this by using basic and utility routines which reproduce the algorithm used in the black box, but allow more access to algorithmic control parameters and monitoring. If you wish to use a preconditioner or storage format for which no utility routines are provided, you must call basic routines, and provide your own utility routines.

F11.12 [NP3344/3/pdf] F11 – Sparse Linear Algebra Introduction – F11

3.2 Choosing an Iterative Method and a Preconditioner 3.2.1 Real nonsymmetric or complex non-Hermitian linear systems The suite of basic routines F11BAFP, F11BBFP and F11BCFP implements either RGMRES, CGS or Bi-CGSTAB(), for the iterative solution of the real sparse nonsymmetric linear system Ax = b.The corresponding suite of basic routines for real non-Hermitian problems consists of F11BRFP, F11BSFP and F11BTFP, and implements the same methods. These routines allow a choice of termination criteria and of the norms used in them, allow monitoring of the approximate solution, and can return estimates of the norm of A and the largest singular value of the preconditioned matrix A¯. In general, it is not possible to recommend one of these methods in preference to another. RGMRES is popular, particularly for problems arising from the discretisation of PDEs, but requires the most storage, and can easily stagnate when the size m of the orthogonal basis is too small, or the preconditioner is not good enough. CGS can be the fastest method, but the computed residuals can exhibit instability which may greatly affect the convergence and quality of the solution. Bi-CGSTAB() seems robust and reliable, but it can be slower than the other methods, at times. Some further discussion of the relative merits of these methods can be found in Barrett et al. [3]. Three main preconditioners are provided, block Jacobi, relaxed Jacobi and SOR, the latter only for matrices with symmetric sparsity patterns. SOR can be employed in a range of applications, provided these result in symmetric sparsity patterns, and, in these cases, it could prove more suitable than block Jacobi preconditioning. The effectiveness of SOR preconditioning may depend on the properties of the matrix, the suitability of the multicolouring scheme used, etc. Relaxed Jacobi preconditioners can be applied to any types of matrices with non-zero diagonal entries. However, the effectivenss of Jacobi preconditioning may depend on the properties of the matrix even more than in the case of SOR preconditioners.

3.2.2 Real symmetric linear systems The suite of basic routines F11GAFP, F11GBFP and F11GCFP implements either the conjugate gradient method (CG), or a Lanczos method based on SYMMLQ, for the iterative solution of the real sparse symmetric linear system Ax = b.IfA is known to be positive-definite the CG method should be chosen; the Lanczos method is more robust but less efficient for positive-definite matrices. These routines allow a choice of termination criteria and the norms used in them, allow monitoring of the approximate solution, and can return estimates of the norm of A and the largest singular value of the preconditioned matrix A¯. In this release, utility routines and data formats specifically tailored to symmetric matrices are not available. The utility routines available for real general (nonsymmetric) problems should be used instead. However, the iterative methods for real symmetric matrices, namely CG and SYMMLQ, should always be used as they are likely to be much more effective than general methods such as RGMRES, CG and Bi-CGSTAB(). The Black Box routines F11JHFP and F11JEFP employ symmetric iterative solvers and general matrix-vector multiplication routines and preconditioners. The choice of preconditioner for real symmetric problems is based upon very similar considerations to those already mentioned for the real nonsymmetric case, the main difference being that SOR preconditioning is applicable in all cases.

3.3 Auxiliary Information for Parallel Sparse Matrix Computations Most operations involving a sparse matrix A stored in distributed coordinate storage format (see Section 2.5) require additional auxiliary information about A in order to be performed eﬃciently in parallel by multiple processors. This auxiliary information is stored in and retrieved from the associated array IAINFO by a number of routines in Chapter F01, Chapter F11 and Chapter X04.

3.3.1 Contents of IAINFO While most of the data contained in IAINFO is meant to be used only internally by library routines, the following array elements are crucial to determining the required size of local arrays and therefore may need to be accessed by user programs:

[NP3344/3/pdf] F11.13 Introduction – F11 NAG Parallel Library Manual

IAINFO(1) – On successful exit, whenever a routine adds information to IAINFO, the element r = IAINFO(1) contains the maximum number of elements of IAINFO used by that routine. Thus, r provides the minimum size of IAINFO required to complete the routine successfully. IAINFO(2) – On successful exit, whenever a routine adds information to IAINFO, the element s = IAINFO(2) contains the number of elements of IAINFO which can subsequently be used by other library routines. Thus the ﬁrst s elements of IAINFO must not be changed by the user program between any subsequent calls to library routines involving the matrix A. IAINFO(3) – ml, the number of rows of A stored on the local processor (see Section 2.6.1). o IAINFO(4) – ml , the overall number of non-local rows of A in the diagonal blocks (see Section 2.7). max IAINFO(5) – ml , the maximum number of rows of A stored on any processor of the Library Grid. i IAINFO(6) – nint, the number of internal interface indices for the local processor (see Section 2.6.1). e IAINFO(7) – nint, the number of external interface indices for the local processor (see Section 2.6.1). IAINFO(8) – nLB, the number of row blocks stored on the local processor (see Section 2.6.1). IAINFO(9) – NNZ, the overall number of non-zero entries in the local diagonal blocks.

3.3.2 Sequence of accesses to IAINFO Fundamental items of information are stored in IAINFO by an appropriate general set-up routine. The routines F11ZBFP and F11ZPFP initialise the array IAINFO. One of F11ZBFP or F11ZPFP must be called once for each sparse matrix A, represented in distributed coordinate format, before any other F11 routine can be applied to A. Additional auxiliary information needed for some sparse matrix operations is stored in IAINFO by corresponding operation-specific set-up routines. In this release, operation-specific set-up routines are available for block Jacobi preconditioning, F11DFFP and F11DTFP. These set-up routines have to be called once before the corresponding operations can be performed using F11DGFP and F11DUFP, respectively. Similarly for multi-colour SSOR the set up routines are F11ZGFP and F11ZUFP, which must be called before the corresponding operations can be performed using F11DEFP, F11DDFP, F11DRFP, and F11DSFP. The direction of arrows in Figure 4illustrates the order in which the array IAINFO is supposed to be accessed (and possibly modified for subsequent routine calls) by different routines in Chapters F01, F11 and X04. Routines which modify IAINFO are contained in solid boxes, while routines which merely retrieve data from IAINFO are signified by dashed boxes.

F11ZBFP

F01YEFP F11DFFP F11ZGFP X04YAFP

F11DGFP F11DHFP F11DEFP F11DDFP

Figure 4 Sequence of Accesses to IAINFO (for real A)

F11.14 [NP3344/3/pdf] F11 – Sparse Linear Algebra Introduction – F11

3.3.3 Required size of IAINFO The size of IAINFO required to store all auxiliary information needed to perform a set of sparse matrix operations depends on a number of problem characteristics, most importantly, the order of the matrix A, the partitioning of A into row blocks, and the sparsity pattern of A. The dependency of the required size of IAINFO on the sparsity pattern of A, in particular, makes it diﬃcult to determine in advance the storage requirement for IAINFO in a set-up routine. Therefore, suggested values for the size of IAINFO in set-up routines can prove insuﬃcient for particular sparse matrices. In this case it is recommended to increase the storage space according to the value of IAINFO(1) returned from the respective set-up routine. In this release of the Library dynamic memory allocation has been implemented. There is now an option by passing LIA (the dimension of IAINFO) as −1 to indicate that memory for IAINFO should be allocated dynamically. In this case IAINFO must be dimensioned to 200.

4 Routines Withdrawn or Scheduled for Withdrawal

F11DAFP has been renamed to F11DFFP. Please consult the new routine document. F11DBFP has been renamed to F11DGFP. Please consult the new routine document. F11DCFP has been renamed to F11DHFP. Please consult the new routine document. F11XAFP is now a dummy routine and will be removed at the next release. F11ZAFP is replaced by F11ZBFP and will be withdrawn at the next release.

5 F11 Chapter Quick Reference Table

Basic routines real nonsymmetric linear systems set-up routine F11BAFP reverse communication RGMRES, CGS or Bi-CGSTAB() F11BBFP diagnostic routine F11BCFP complex non-Hermitian linear systems set-up routine F11BRFP reverse communication RGMRES, CGS or Bi-CGSTAB() F11BSFP diagnostic routine F11BTFP real symmetric linear systems set-up routine F11GAFP reverse communication CG or SYMMLQ F11GBFP diagnostic routine F11GCFP Black Box routines real nonsymmetric linear systems RGMRES, CGS or Bi-CGSTAB() using SOR or Jacobi preconditioner F11DEFP RGMRES, CGS or Bi-CGSTAB() using block Jacobi preconditioner F11DHFP complex non-Hermitian linear systems RGMRES, CGS or Bi-CGSTAB() using SOR or Jacobi preconditioner F11DSFP RGMRES, CGS or Bi-CGSTAB() using block Jacobi preconditioner F11DVFP real symmetric linear systems CG or SYMMLQ using SOR or Jacobi preconditioner F11JEFP CG or SYMMLQ using block Jacobi preconditioner F11JHFP real utility routines matrix preprocessing and permutation preprocess a matrix distributed in cyclic row block form F11ZBFP permute a matrix with same sparsity pattern as a preprocessed matrix F11YAFP permute a vector from global to local indexing order. F11YBFP permute a vector from local to global indexing order F11YCFP multi-colouring for SOR preconditioning F11ZGFP preconditioning compute the incomplete LU factorisation of diagonal blocks F11DFFP

[NP3344/3/pdf] F11.15 Introduction – F11 NAG Parallel Library Manual

block Jacobi preconditioner F11DGFP SOR preconditioner F11DDFP Jacobi preconditioner F11DKFP other routines matrix-vector product F11XBFP complex utility routines matrix preprocessing and permutation preprocess a matrix distributed in cyclic row block form F11ZPFP permute a matrix with same sparsity pattern as a preprocessed matrix F11YNFP permute a vector from global to local indexing order. F11YPFP permute a vector from local to global indexing order. F11YQFP multi-colouring for SOR preconditioning F11ZUFP preconditioning compute the incomplete LU factorisation of diagonal blocks F11DTFP block Jacobi preconditioner F11DUFP SOR preconditioner F11DRFP Jacobi preconditioner F11DXFP other routines matrix-vector product F11XPFP other utility routines release of internally allocated memory F11ZZFP

F11.16 (last) [NP3344/3/pdf]