Solving Ax = b, an overview Numerical ∗ A = A no Good precond. yes flex. precond. yes GCR Improving iterative solvers: ⇒ ⇒ ⇒ yes no no ⇓ ⇓ ⇓ preconditioning, deflation, numerical yes GMRES A > 0 CG software and parallelisation ⇒ ⇓ no ⇓ ⇓ ill cond. yes SYMMLQ str indef no large im eig no Bi-CGSTAB ⇒ ⇒ ⇒ no yes yes ⇓ ⇓ ⇓ MINRES IDR no large im eig BiCGstab(ℓ) ⇐ yes Gerard Sleijpen and Martin van Gijzen a good precond itioner is available ⇓ the precond itioner is flex ible IDRstab November 29, 2017 A + A∗ is strongly indefinite A 1 has large imaginary eigenvalues

November 29, 2017 2 National Master Course National Master Course Delft University of Technology

Introduction Program We already saw that the performance of iterative methods can Preconditioning be improved by applying a preconditioner. Preconditioners (and • Diagonal scaling, Gauss-Seidel, SOR and SSOR deflation techniques) are a key to successful iterative methods. • Incomplete Choleski and Incomplete LU In general they are very problem dependent. • Deflation Today we will discuss some standard preconditioners and we will • Numerical software explain the idea behind deflation. • Parallelisation We will also discuss some efforts to standardise numerical • software. Shared memory versus distributed memory • Domain decomposition Finally we will discuss how to perform scientific computations on • a parallel computer.

November 29, 2017 3 November 29, 2017 4

National Master Course National Master Course Preconditioning Why preconditioners? A preconditioned iterative solver solves the system − − M 1Ax = M 1b. 1 The matrix M is called the preconditioner. 0.8

The preconditioner should satisfy certain requirements: 0.6 Convergence should be (much) faster (in time) for the 0.4 • preconditioned system than for the original system. Normally 0.2

0 this means that M is constructed as an “easily invertible” 25 20 14 15 12 approximation to A. Note that if M = A any 10 10 8 6 5 4 2 converges in one iteration. 0 0 −1 Operations with M should be easy to perform ("cheap"). Information after 1 iteration • M should be relatively easy to construct. • November 29, 2017 5 November 29, 2017 6

National Master Course National Master Course

Why preconditioners? Why preconditioners?

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0 25 25

20 14 20 14 15 12 15 12 10 10 10 8 10 8 6 6 5 4 5 4 2 2 0 0 0 0 Information after 7 iterations Information after 21 iterations

November 29, 2017 6 November 29, 2017 6

National Master Course National Master Course Why preconditioning? (2) Clustering the spectrum In lecture 7 we saw that CG performs better when the spectrum From the previous pictures it is clear that, in 2-d, we need of A is clustered. (1/h)= (√n) iterations to move information from one end to O O the other end of the grid. Therefore, a good preconditioner clusters the spectrum.

3/2 So, at best it takes (n ) operations to compute the solution 1 1 O with an iterative method. 0.5 0.5 In order to improve this we need a preconditioner that enables fast propagation of information through the mesh. 0 0

-0.5 -0.5

-1 -1 0 1 2 3 4 0 1 2 3 4

November 29, 2017 7 November 29, 2017 8

National Master Course National Master Course

Incorporating a preconditioner Incorporating a preconditioner (2)

Normally the matrix M−1A is not explicitly formed. Preconditioners can be applied in different ways:

The multiplication from the left − • − − u = M 1Av M 1Ax = M 1b, can simply be carried out by the two operations centrally • − − − − M = LU; L 1AU 1y = L 1b; x = U 1y, t = Av %% MatVec or from the right solve Mu = t for u %% MSolve. • − − AM 1y = b; x = M 1y. Some methods allow implicit preconditioning. • Left, central and right preconditioning is explicit.

November 29, 2017 9 November 29, 2017 10

National Master Course National Master Course Incorporating a preconditioner (3) CG with central preconditioning

All forms of preconditioning lead to the same spectrum. −1 x = Ux0, r = L (b Ax), u = 0, ρ = 1 %% Initialization Yet there are differences: − while r > tol do Left preconditioning is most natural: no extra step is required to k k • σ = ρ, ρ = r∗r, β = ρ/σ compute x; − −1 −1 ∗ u r β u, c = (L AU )u Central preconditioning allows to preserve symmetry (if U = L ); ← ∗− • σ = u c, α = ρ/σ Right preconditioning does not affect the residual norm. r r α c %% update residual • ← − Implicit preconditioning does not affect the residual norm, no x x + α u %% update iterate • ← extra steps are required to compute x. Unfortunately, not all end while methods allow implicit preconditioning. If possible, adaption of x U−1x the code of the iterative solver is required. ← ∗ Explicit preconitioning allows optimisation of the combination of Here, M = LU with U = L . Note that M is positive definite. • −1 ∗ − 1 MSolve and MatVec as the Eisentat trick. If M = (L + D)D (L + D) , take L (L + D)D 2 . ≡ November 29, 2017 11 November 29, 2017e e e 12

National Master Course National Master Course

CG with implicit preconditioning Incorporating a preconditioner (4) Explicit preconditioning requires x = x0, r = b Ax, u = 0, ρ = 1 %% Initialization − ′ ′ preprocessing (to solve Mb b for b in left while r > tol do • = k k −1 ′ c = M r preconditioning, and x0 = Mx0 in right preconditioning), ∗ σ = ρ, ρ = c r, β = ρ/σ adjust MatVec: adjustment of the code for the operation − • u c βu, c = Au that does the matrix-vector multiplication (to produce, say, ← ∗− − σ = u c, α = ρ/σ c = M 1Au), r r α c %% update residual ′ ← − postprocessing (to solve Mx = x for x in right precond.). x x + α u %% update iterate • ← end while Implicit preconditioning requires adaption of the code of the iterative solver. M should be Hermitian. Matlab takes M1, M2 (i.e., M = M1M2). • −1 ∗ Matlab’s PCG uses implicit preconditioning. For instance, M1 = LD + I and M2 = (L + D) .

November 29, 2017 e e 13 November 29, 2017 14

National Master Course National Master Course Preconditioning in MATLAB (1) Preconditioning in MATLAB (2) ∗ ∗ Example. L L + D, where A = L + D + L is p.d.. Example. L L + D, where A = L + D + L is p.d.. ≡ −1 ∗ 1 −1 −∗ 1 1 −1 ≡ −1 ∗ 1 −1 −∗ 1 1 −1 Ax = b, M LDe L D 2 Le AL De2 y = D 2 L b Ax = b, M LDe L D 2 Le AL De2 y = D 2 L b ≡ ≡ function [PreMV,pre_b,PostProcess,pre_x0] = preprocess(A,b,x0) function x = CG(A,b,tol,kmax,x0) L = tril(A); D = diag(sqrt(diag(A))); L = L/D; U = L’; ... function c = MyPreMV(u); c=A*u; c = L\(A*(U\u)); ... end end PreMV = @MyPreMV; pre_b = L\b; pre_x0 = U x0; * In Matlab’s command window, replace function x = MyPostProcess(y) x = U\y; > x = CG(A,b,100,10ˆ(-8),x0); end PostProcess = @MyPostProcess; by end > [pMV,pb,postP,px0] = preprocess(A,b,x0); > y = CG(pMV,pb,100,10ˆ(-8),px0); Here, A is a . Note that L,U,D are also sparse > x = postP(y);

November 29, 2017 15 November 29, 2017 16

National Master Course National Master Course

Preconditioning in MATLAB (3) Preconditioning in MATLAB (4) ∗ ∗ Example. L L + D, where A = L + D + L is p.d.. Example. L L + D, where A = L + D + L is p.d.. ≡ −1 ∗ 1 −1 −∗ 1 1 −1 ≡ −1 ∗ Ax = b, M LDe L D 2 Le AL De2 y = D 2 L b Ax = b, M LDe L implicite preconditioninge ≡ ≡ The CG code should take both functions and matrices: function MSolve = SetPrecond(A) L = tril(A); U = L’; D = diag(diag(A)); function x = CG(A,b,tol,kmax,x0) function c = MyPreMV(u); A_is_matrix = isnumeric(A); c = U\(D*(L\u)); function c=MV(u) end if A_is_matrix, c=A*u; else, c=A(u); end MSolve = @MyMSolve; end end function x = PCG(A,b,tol,kmax,MSolve,x0) ...... c=MV(u); c=MSolve(r); ...... c=A(u); end ... end

November 29, 2017 17 November 29, 2017 18

National Master Course National Master Course Diagonal scaling Gauss-Seidel, SOR and SSOR The Gauss-Seidel preconditioner is defined by Diagonal scaling or Jacobi preconditioning uses M = L + D M = diag(A) with L the strictly lower-triangular part of A and D = diag(A). as preconditioner. Clearly, this preconditioner does not enable By introducing a relaxation parameter ω (using D/ω instead of fast propagation through a grid. On the other hand, operations D), we get the SOR-preconditioner. with diag(A) are very easy to perform and diagonal scaling can be useful as a first step, in combination with other techniques. For symmetric problems it is wise to take a symmetric preconditioner. A symmetric variant of Gauss-Seidel is − ∗ If the choice for M is restricted to diagonals, M = (L + D)D 1(L + D) − then M = diag(A) minimize (M 1A) is some sense. C By introducing a relaxation parameter ω (i.e., by using D/ω instead of D) we get the so called SSOR-preconditioner.

November 29, 2017 19 November 29, 2017 20

National Master Course National Master Course

ILU-preconditioners Sparsity patterns, fill ILU-preconditioners are the most popular ‘black box’ 0

preconditioners. They are constructed by making a standard 5 LU-decomposition 10 A = LU. 15 However, during the elimination process some nonzero entries in 20 the factors are discarded (i.e., replaced by 0). This leads to M = LU with L and U the “incomplete” L- and U-factors. 25

Discarding nonzero entries can be done on basis of two criteria: 30

Sparsity pattern: e.g., an entry in a factor is only kept if it 35 •

corresponds to a nonzero entry in A; 40

Size: small entries in the decomposition are dropped. 0 5 10 15 20 25 30 35 40 • nz = 184

November 29, 2017 21 November 29, 2017 22

National Master Course National Master Course Sparsity patterns, fill Sparsity patterns, ILU(0) fill pattern, fill of level 0 0 0

10 5

10 20

15 30

20 40

25 50

30 60

35 70

40

80 0 10 20 30 40 50 60 70 80 0 5 10 15 20 25 30 35 40 nz = 352 nz = 184

November 29, 2017 23 November 29, 2017 24

National Master Course National Master Course

Sparsity patterns, ILU(1) Sparsity patterns, ILU(2) fill pattern, fill of level 1 fill pattern, fill of level 2 0 0

5 5

10 10

15 15

20 20

25 25

30 30

35 35

40 40

0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 nz = 244 nz = 294

November 29, 2017 25 November 29, 2017 26

National Master Course National Master Course ILU-preconditioners (2) ILU-preconditioners (2)

Modified ILU-preconditioners are constructed by making an The number of nonzero entries that is maintained in the ILU-decomposition LU-factors is normally of the order of the number of nonzeros M = LU. in A. This implies that operations with the ILU-preconditioner are However, during the elimination process the nonzero entries in approximately as costly as multiplications with A. the factors that are discarded in the ILU construction are now added to the diagonal entry of U. Then A1 = M1. For A Symmetric Positive Definite a special variant of ILU exists, called Incomplete Choleski. This preconditioner is The idea here is that both A and M have the approximately the ∗ based on the Choleski decomposition A = CC . same effect when acting on “smooth” functions (vectors).

November 29, 2017 27 November 29, 2017 28

National Master Course National Master Course

Deflation Deflation (2) U is an n s matrix (typically with s small), V AU . Suppose There are two popular ways to improve the convergence of × ∗ ∗ ≡ the interaction matrix µ U AU = U V is non-singular. Krylov solvers. ≡ Consider the skew projections Preconditioning: improve distribution of the eigenvalues • −1 ∗ −1 ∗ Π1 I Vµ U and Π0 I Uµ U A (cluster eigenvalues). ≡ − ≡ − Deflation: remove eigenvectors with eigenvalues that are • Note. Π1y U for all n-vectors y, Π1AU = 0, Π1A = AΠ0 small in absolute value. ⊥ ′ ′ Actually, small eigenvalues are mapped to zero and Theorem. Put A Π1A and b Π1b. ′ ≡ ′ ≡ approximate solution are constructed in the orthogonal If x′ solves A x′ = b for x′ U, ′ −1 ∗ ⊥ complement of the kernel of the deflated matrix. then x Π0x + Uµ U b solves Ax = b. ≡ Exercise. If A is Hermitian, then A′ is Hermitian.

November 29, 2017 29 November 29, 2017 30

National Master Course National Master Course Deflation (3) Deflation, examples.

The deflated map A′ maps U⊥ to U⊥. In particular, Krylov Multiple right hand side systems or block systems. ′ ′ • subspaces generated by A and b are orthogonal to U and U = [u1,..., us], where uj are the Ritz vectors with absolute ⊥ Krylov subspace solvers find approximate solutions in U . smallest Ritz values obtained from solving Ax = b1 with GMRES. Then U can be used to solve Ax = b2,.... If the space spanned by the columns of U (approximately) contains the eigenvectors associated with the s absolute GCRO: Local Minimal Residual iteration where the update ′ • smallest eigenvalues, then the deflated map A has vector uk in step k is obtained by solving ′ ′ ′ (approximately) the same eigenvalues as A, except for the s A u = rk for u Uk−1 ⊥ absolute smallest ones, which are replaced by 0. with ℓ steps of GMRES. In the next step Uk [Uk−1, uk]. ′ ⊥ ⊥ ≡ When A is considered as a map from U to U , then these 0 Multi-grid. U is the matrix that represents the restric- ′ eigenvalues are removed (then A has only n s eigenvalues). • ∗ −1 − tion to a coarse grid, U is the prolongation, µ is the coarse grid n correction. s is not really small in this application (typically s = 4 ).

November 29, 2017 31 November 29, 2017 32

National Master Course National Master Course

Deflation, examples (2) Numerical software In many application (typically those from CFD), eigenvectors To promote the use of good quality software several efforts have • with absolute small eigenvalues correspond to smooth (non- or been made in the past. We mention: mildly oscillating) functions on the computational grid. Then the Eispack: Fortran 66 package for eigenvalue computations. • columns of U are taken to be vectors corresponding to smooth Linpack: F77 package for dense or banded linear systems. functions, as the function of all 1, plus functions that are 1 on • BLAS: standardisation of basic linear algebra operations. part of the grid and 0 elsewhere. The part of the grid where the • function is 0 can be a part where the coefficients in the PDE are Available in several languages. (small) constant. LAPACK: F77 package for dense eigenvalue problems and • The idea here is that the ‘hampering’ eigenvectors are not known linear systems. Builds on BLAS and has replaced Eispack but they will be close to the space spanned by the U-vectors. and Linpack. F90 and C++ versions also exist. Deflation of this type is typically applied in case of an isolated The above software can be downloaded from cluster of absolute small eigenvalues. http://www.netlib.org.

November 29, 2017 33 November 29, 2017 34

National Master Course National Master Course B asic L inear A lgebra S ubroutines BLAS 1

The BLAS libraries provide a standard for basic linear algebra Examples of BLAS 1 routines are: subroutines. The BLAS library is available in optimised form on Vector update: daxpy • many (super)computers. Three versions are available: Inner product: ddot • BLAS 1: vector-vector operations; • Vector scaling: dscal • BLAS 2: matrix-vector operations; • BLAS 3: matrix-matrix operations. •

November 29, 2017 35 November 29, 2017 36

National Master Course National Master Course

Cache memory BLAS 2 and 3

Computers normally have small, fast memory close to the An example of a BLAS 2 routine is the matrix-vector processor, the so called cache memory. multiplication routine dgemv.

For optimal performance data in the cache should be reused as An example of a BLAS 3 routine is the matrix-matrix much as possible. multiplication routine dgemm.

Level 1 BLAS routines are for this reason not very efficient: for BLAS 2 and in particular BLAS 3 routines make much better use every number that is loaded into the cache only one calculation of the cache than BLAS 1 routines. is made.

November 29, 2017 37 November 29, 2017 38

National Master Course National Master Course Parallel computing Shared versus distributed memory

Modern supercomputers may contain many thousands of An important distinction in computer architecture is the memory processors. Another popular type of parallel computer is the organisation. cluster of workstations connected via a communication network. Shared memory machines have a single address space: all Parallel computing poses special restrictions on the numerical processors read from and write to the same memory. This type . In particular, algorithms that rely on recursions are of machines can be parallelised using fine grain, loop level difficult to parallelise. parallelisation techniques. On distributed memory computers every processors has its own local memory. Data is exchanged via a message passing mechanism. Parallelisation should be done using a coarse grain approach. This is often achieved by making a domain decomposition.

November 29, 2017 39 November 29, 2017 40

National Master Course National Master Course

Domain decomposition Domain decomposition (2) A standard way to parallelise a grid-based computation is to split The domain decomposition decomposes the system Ax = b the domain into p subdomains, and to map each subdomain on a into blocks. For two subdomains one obtains processor. Simple Mesh 2  A11 A12   x1   b1  A = = , 1.5  A21 A22   x2   b2 

1 x1 and x2 represent the subdomain unknowns, A11 and A22 0.5 the subdomain discretization matrices and A12 and A21 the 0 coupling matrices between subdomains.

-0.5

-1

-1.5 -4 -3 -2 -1 0 1 2 3 4 November 29, 2017 41 November 29, 2017 42

National Master Course National Master Course Parallel matrix-vector multiplication Parallel vector operations For the matrix-vector multiplication (MV, MatVec) Inner products (DOT) u = Av u∗v the operations are calculated by computing the local inner products u1 = A11v1 + A12v2 and u2 = A22v2 + A21v1 ∗ ∗ u1v1 and u2v2. can be performed in parallel. The final result is obtained by adding all local inner products. However, processor 1 has to send (part of) v1 to processor 2 This requires global communication. before the computation, and processor 2 (part of) v2 to processor 1. Vector updates (AXPY) x x + αy This kind of communication is local, only informations from ← can be performed locally, without any communication. neighbouring subdomains is needed.

November 29, 2017 43 November 29, 2017 44

National Master Course National Master Course

Domain decomposition Block-Jacobi preconditioner preconditioners A simple domain decomposition preconditioner is defined by It is a natural idea to solve a linear system  A11 0  Ax = b M =  0 A22  by solving the subdomain problems independently and to iterate (Block-Jacobi preconditioner) or, in general, by to correct for the error.

This idea has given rise to the family of domain decomposition  A11  .. preconditioners. M =  .    The theory for domain decomposition preconditioners is vast,    AMM  here we only discuss some important ideas. The subdomain systems can be solved in parallel.

November 29, 2017 45 November 29, 2017 46

National Master Course National Master Course Solution of subdomain problems On the scalability of block-Jacobi

There are several ways to solve the subdomain problems: Without special techniques, the number of iterations increases Exact solves. This is in general (too) expensive. with the number of subdomains. The is not scalable. • Inexact solves using an incomplete decomposition To overcome this problems techniques can be applied to enable • (block-ILU). the exchange of information between subdomains. Inexact solves using an iterative method to solve the • subproblems. Since in this case the preconditioner is variable, the outer iteration should be flexible, for example GCR.

November 29, 2017 47 November 29, 2017 48

National Master Course National Master Course

Improving the scalability Concluding remarks

Two popular techniques improve the flow of information are: Preconditioners are the key to a successful iterative method. Use an overlap between subdomains. Of course one has to • Today we saw some of the most important preconditioners. The ensure that the value of unknowns in gridpoints that belong ’best’ preconditioner, however, depends completely on the to multiple domains is unique. problem. Use a coarse grid correction. The solution of the coarse grid • problem is added to the subdomain solutions. This idea is closely related to multigrid, since the coarse-grid solution is the non-local, smooth part of the solution that cannot be represented on a single subdomain.

November 29, 2017 49 November 29, 2017 50

National Master Course National Master Course