Linear algebra and sparse direct methods

Patrick R. Amestoy (ENSEEIHT-IRIT, Toulouse) and J.-Y. L’Excellent (INRIA/LIP-ENS, Lyon )

done in collaboration with

M. Dayd´e (ENSEEIHT-IRIT, Toulouse),

Lecture notes Semester on High Performance Computing, UNESCO chair at LAMSIN (Tunis), April 12-16, 2004 Tunis

1 Contents

1 Introduction 3 1.1 Motivation and main issues ...... 3 1.2 ...... 7 1.3 QR factorization based on Householder transformations ...... 9 1.4 Graph notations and definitions ...... 12 1.5 Direct and/or iterative ? ...... 15

2 Graphs models and sparse direct solvers 16 2.1 The elimination graph model ...... 16 2.2 The elimination tree ...... 18 2.3 The eDAGs for unsymmetric matrices ...... 20 2.4 Column elimination graphs and trees ...... 24

3 Reordering sparse matrices 26 3.1 Objectives ...... 26 3.2 Maximum matching/transversal orderings ...... 27 3.3 Reordering matrices to special forms ...... 28 3.4 Reduction to Block Triangular Form ...... 29 3.5 Local heuristics to reduce fill-in during factorization ...... 32 3.6 Equivalent orderings of symmetric matrices ...... 40 3.7 Combining approaches ...... 44

4 Partitioning and coloring 46 4.1 Introduction ...... 46 4.2 Graph coloring ...... 46 4.3 Graph partitioning ...... 48

5 Factorization of sparse matrices 57 5.1 Left and right looking approaches ...... 57 5.2 Elimination tree and Multifrontal approach ...... 58 5.3 Sparse LDLT factorization ...... 63 5.4 Multifrontal QR factorization ...... 64 5.5 Task partitioning and scheduling ...... 67 5.6 Distributed memory sparse solvers ...... 68 5.7 Comparison between 3 approaches for LU factorization ...... 71 5.8 Iterative refinement ...... 74 5.9 Concluding remarks ...... 75

6 Comparison of two sparse solvers 76 6.1 Main features of MUMPS and SuperLU solvers ...... 76 6.2 Preprocessing and numerical Accuracy ...... 79 6.3 Factorization cost study ...... 80 6.4 Factorization/solve time study ...... 81 6.5 Recent improvements on MUMPS solver ...... 82 6.6 An unsymmetrized multifrontal approach ...... 83

7 GRID-TLSE 87 7.1 Description of main components of the site ...... 88 7.2 Running the Grid-TLSE demonstrator ...... 89 7.3 Description of the software components ...... 90 7.4 Concluding remarks on GRID-TLSE project ...... 93

2 1 Introduction

A selection of references • Books – Duff, Erisman and Reid, Direct methods for Sparse Matrices, Clarenton Press, Oxford 1986. – Dongarra, Duff, Sorensen and H. A. van der Vorst, Solving Linear Systems on Vector and Shared Memory Computers, SIAM, 1991. – George, Liu, and Ng, Computer Solution of Sparse Positive Definite Systems, book to appear (2004) • Articles – Gilbert and Liu, Elimination structures for unsymmetric sparse LU factors, SIMAX, V14, p 334-352, 1993. – Liu, The role of elimination trees in sparse factorization, SIMAX, V11, p134-172, 1990. – Heath and E. Ng and B. W. Peyton, Parallel for Sparse Linear Systems, SIAM review, V33, p420-460, 1991.

1.1 Motivation and main issues Motivations • Solution of linear system → key algorithmic kernel

Continuous problem ↓ Discretization ↓ Solution of linear system

• Main parameters: – Numerical properties of the linear system (symmetry, pos. definite, conditioning, . . . ) – Size and structure: ∗ Large (> 10000 × 10000 ?), square/rectangular ∗ Dense or sparse (structured / unstructured) ∗ Target computer (sequential/parallel)

Motivations for designing algorithms • Time-critical applications • Solve larger problems • Decrease elapsed time • Minimize cost of computations (time, memory)

Difficulties • Access to data : – Computer : complex memory hierarchy (registers, multilevel cache, main memory (shared or distributed), disk) – : large irregular dynamic data structures. → Exploit the locality of references to data on the computer (design algorithms providing such locality) • Efficiency (time and memory) – Number of operations and memory depend very much on the used and on the numerical and structural properties of the problem. – The algorithm depends on the target computer (vector, scalar, shared, distributed, clusters of Symmetric Multi-Processors (SMP), GRID). → Algorithmic choices are critical

3 Key numbers:

1- Average size : 100 MB matrix; Factors = 2 GB; Flops = 10 Gflops ; 2- A bit more “challenging” : Lab. G´eosiences Azur, Valbonne – Complex matrix arising in 2D 16 × 106 , 150 × 106 nonzeros – Storage : 5 GB (12 GB with the factors ?) – Flops : tens of TeraFlops ? 3- Typical performance (MUMPS) : – PC LINUX (P4, 2GHz) : 1.0 GFlops/s – Cray T3E (512 procs) : Speed-up ≈ 170, Perf. 71 GFlops/s

Typical test problems:

Size of factors: 51.1 million entries Number of operations: 44.9 ×109 BMW car body, 227,362 unknowns, Typical test problems: 5,757,996 nonzeros, MSC.Software

BMW crankshaft, 148,770 unknowns, 5,396,386 nonzeros, Size of factors: 97.2 million entries MSC.Software Number of operations: 127.9 ×109

Factorization process Solution of Ax = b

• A is unsymmetric : – A is factorized as: PAQ = LU where P and Q are permutation matrices, L and U are triangular matrices.

4 – Forward-backward substitution: Ly = PTb then U(QTx) = y • A is symmetric:

– A positive definite: PAPT = LLT (Cholesky) – A indefinite : PAPT = LDLT

• A is rectangular m × n with m ≥ n and minx kAx − bk2 : – A = QR where Q is orthogonal (Q−1 = QT and R is triangular). – Solve: y = QTb then Rx = y

Difficulties

• Only non-zero values are stored • Data structures are complex • Computations are only a small portion of the code (the rest is data manipulation) • Memory size is a limiting factor out-of-core solvers

Several levels of parallelism can be exploited: • At problem level: problem can de decomposed into sub-problems (e.g. domain decomposition) • At matrix level arising from its sparse structure • At submatrix level within dense linear algebra computations (parallel BLAS, . . . )

Data structure for sparse matrices

• Storage scheme depends on the pattern of the matrix and on the type of access required – band or variable-band matrices – “block bordered” or block tridiagonal matrices – general matrix – row, column or diagonal access

Data formats for a general sparse matrix A What needs to be represented

• Assembled matrices: MxN matrix A with NNZ nonzeros. • Elemental matrices (unassembled): MxN matrix A with NELT elements. • Arithmetic: Real (4 or 8 bytes) or complex (8 or 16 bytes) • Symmetric (or Hermitian) → store only part of the data. • Distributed format ? • Duplicate entries and/or out-of-range values ?

5 Assembled matrix: illustration

• Example of a 3x3 matrix with 5 nonzeros 1 2 3 1 a11

2 a22 a23

3 a31 a33

• Coordinate format IRN [1 : NNZ] = 1 3 2 2 3 JCN [1 : NNZ] = 1 1 2 3 3 A [1 : NNZ] = a11 a31 a22 a23 a33 • Compressed Sparse Column (CSC) format IRN [1 : NNZ] = 1 3 2 2 3 A [1 : NNZ] = a11 a31 a22 a23 a33 IP [1 : N + 1] = 1 3 4 6 column J corresponds to IRN/A locations IP(J)...IP(J+1)-1

Example of elemental matrix format

1 0 −1 2 3 1 3 0 2 −1 3 1 A1 = 2 @ 2 1 1 A , A2 = 4 @ 1 2 −1 A 3 1 1 1 5 3 2 1

PNELT • N=5 NELT=2 NVAR=6 A = i=1 Ai ELTPTR [1:NELT+1] = 1 4 7 • ELTVAR [1:NVAR] = 1 2 3 3 4 5 ELTVAL [1:NVAL] = -1 2 1 2 1 1 3 1 1 2 1 3 -1 2 2 3 -1 1 • Remarks: – NVAR = ELTPTR(NELT+1)-1 P 2 P – NVAL = Si (unsym) ou Si(Si + 1)/2 (sym), avec Si = ELT P T R(i + 1) − ELT P T R(i) – storage of elements in ELTVAL: by columns

File storage: Rutherford-Boeing

• Standard ASCII format for files • Header + Data (CSC format). key xyz: – x=[rcp] (real, complex, pattern) – y=[suhzr] (sym., uns., herm., skew sym., rectang.) – z=[ae] (assembled, elemental) – ex: M T1.RSA, SHIP003.RSE • Supplementary files: right-hand-sides, solution, permutations. . . • Canonical format introduced to guarantee a unique representation (order of entries in each column, no duplicates).

6 File storage: Rutherford-Boeing

DNV-Ex 1 : Tubular joint-1999-01-17 M_T1 1733710 9758 492558 1231394 0 rsa 97578 97578 4925574 0 (10I8) (10I8) (3e26.16) 1 49 96 142 187 231 274 346 417 487 556 624 691 763 834 904 973 1041 1108 1180 1251 1321 1390 1458 1525 1573 1620 1666 1711 1755 1798 1870 1941 2011 2080 2148 2215 2287 2358 2428 2497 2565 2632 2704 2775 2845 2914 2982 3049 3115 ... 1 2 3 4 5 6 7 8 9 10 11 12 49 50 51 52 53 54 55 56 57 58 59 60 67 68 69 70 71 72 223 224 225 226 227 228 229 230 231 232 233 234 433 434 435 436 437 438 2 3 4 5 6 7 8 9 10 11 12 49 50 51 52 53 54 55 56 57 58 59 ... -0.2624989288237320E+10 0.6622960540857440E+09 0.2362753266740760E+11 0.3372081648690030E+08 -0.4851430162799610E+08 0.1573652896140010E+08 0.1704332388419270E+10 -0.7300763190874110E+09 -0.7113520995891850E+10 0.1813048723097540E+08 0.2955124446119170E+07 -0.2606931100955540E+07 0.1606040913919180E+07 -0.2377860366909130E+08 -0.1105180386670390E+09 0.1610636280324100E+08 0.4230082475435230E+07 -0.1951280618776270E+07 0.4498200951891750E+08 0.2066239484615530E+09 0.3792237438608430E+08 0.9819999042370710E+08 0.3881169368090200E+08 -0.4624480572242580E+08

1.2 Gaussian elimination Gaussian elimination on a small example 0 1 0 1 0 1 a11 a12 a13 x1 b1 (1) (1) (1) (1) (2) (2) A = A , b = b , A x = b : @ a21 a22 a23 A @ x2 A = @ b2 A 2 ← 2 − 1 × a21/a11 A x = b : a31 a32 a33 x3 b3 3 ← 3 − 1 × a31/a11

0 1 0 1 a11 a12 a13 0 1 b1 x1 (2) (1) B (2) (2) C B (2) C (2) with Det(A ) = ±Det(A ) B 0 a a C @ x2 A = B b C b = b2 − a21b1/a11 ... @ 22 23 A @ 2 A 2 (2) (2) x3 (2) (2) 0 a32 a33 b3 a32 = a32 − a12a31/a11 ...

0 1 0 1 a11 a12 a13 0 1 b1 x1 (3) (3) B (2) (2) C B (2) C Finally A x = b B 0 a a C @ x2 A = B b C @ 22 23 A @ 2 A (3) (2) (2) (2) (2) (3) x3 (3) a = a − a a /a ... 0 0 a33 b3 (33) (33) 32 23 22 (k) (k) (k+1) (k) aik akj Typical Gaussian elimination step k : aij = aij − (k) akk

Relation with LU factors 0 1 1 B . C B C (k) B . C a k B C ik Let L = B 1 C where l = . B C ik (k) B −lk+1,k . C a B C kk @ .. A −ln,k 1

Then A(k+1) = L(k)A(k) , thus U = A(n) = L(n−1) ... L(1)A. 0 1 0 1 B . C (1) −1 (n−1) −1 B C Since [L ] ... [L ] = B . C , we obtain A = LU B C @ . A li,j 1

(k) 0 0 Furthermore let dkk = akk then A = LDU where U = DU , and if A symmetric A = LDLT

Gaussian Elimination and sparsity Step k of LU factorization:

• akk pivot • For i > k, j > k 0 aik × akj aij = aij − akk

0 • If aik 6= 0 et akj 6= 0 then aij 6= 0

• If aij was zero → its non-zero value must be stored fill-in

7 • Idem for Cholesky : For i > k, j > k (lower triang.)

0 aik × ajk aij = aij − √ akk Example • Original matrix x x x x x x x x x x x x x x x x • Matrix is full after the first step of elimination • After reordering the matrix (1st row and column ↔ last row and column)

x x x x x x x x x x x x x x x x • No fill-in

Table 1: Benefits of Sparsity on matrix of order 2021 × 2021 with 7353 nonzeros. (Dongarra etal 91) . Procedure Total storage Flops Time (sec.) on CRAY J90 Full Syst. 4084 Kwords 5503 ×106 34.5 Sparse Syst. 71 Kwords 1073×106 3.4 Sparse Syst. and reordering 14 Kwords 42×103 0.9

Control of : numerical pivoting • In dense linear algebra partial pivoting commonly used (at each step the largest entry in the column is selected). • In sparse linear algebra, flexibility to preserve sparsity is offered : – Partial threshold pivoting : Eligible pivots are not too small with respect to the maximum in (k) (k) the column. Set of eligible pivots = {r | |ark | ≥ u × maxi |aik |}, where 0 < u ≤ 1. – Then among eligible pivots select one preserving better sparsity. – u is called the threshold parameter (u = 1 → partial pivoting). aik×akj 1 – It restricts the maximum possible growth of: aij = aij − to 1 + which is sufficient akk u to the preserve numerical stability. – u ≈ 0.1 is often chosen in practice. • For symmetric indefinite problems 2by2 pivots (with threshold) is also used to preserve symmetry and sparsity.

8 Threshold pivoting and numerical accuracy

Table 2: Effect of variation in threshold parameter u on matrix 541 × 541 with 4285 nonzeros (Dongarra etal 91) . u Nonzeros in LU factors Error 1.0 16767 3 ×10−9 0.25 14249 6 ×10−10 0.1 13660 4 ×10−9 0.01 15045 1 ×10−5 10−4 16198 1 ×102 10−10 16553 3 ×1023

Efficient implementation of sparse solvers(1)

• Indirect addressing is often used in sparse calculations: e.g. sparse SAXPY

do i = 1, m A( inc(i) ) = A( inc(i) ) + alpha * w( i ) enddo

• Even if manufacturers provide hardware for improving indirect addressing – It penalizes the performance • Switching to dense calculations as soon as the matrix is not sparse enough

Effect of switch to dense calculations Matrix from 5-point discretization of the Laplacian on a 50 × 50 grid (Dongarra etal 91)

Table 3: Matrix from 5-point discretization of the Laplacian on a 50 × 50 grid (Dongarra etal 91)

Density for Order of Millions Time switch to full code full matrix of flops (sec.) No switch 0 7 21.8 1.00 74 7 21.4 0.80 190 8 15.0 0.40 305 21 9.0 0.20 422 50 5.5 0.20 422 50 5.5 0.10 531 100 3.7 0.005 140 1908 6.1

1.3 QR factorization based on Householder transformations A = QR based on Householder transformations R

A = Q

9 • QR is numerically more stable than LU.

• Solve linear least square problems minx kb − Axk2. • A m × n full column rank; Q orthogonal: Q−1 = QT . • Standard ways to compute Q are Householder transformations, Givens rotations, Gram-Schmidt (in this case Q is m × n and R is n × n). • We limit our presentation to Q based on Householder transformations.

Solving linear least square problems minx kb − Axk2 • The solution is unique if A is full column rank. • The set of solutions S is such that x ∈ S ↔ AT(b − Ax) = 0. – Thus ATAx = ATb (so called normal equations). – Let r = b − Ax then ATr = 0 and b = A x + r r b

Ax Span{A}

T • The 2 norm is invariant under orthogonal transformations thus minx kb − Axk2 = minx kQ b − Rxk2

Householder Transformations T Householder matrix : H = I − 2 uu ; H orthogonal ( HTH = I) and symmetric ( HT = H ) uT u u v

Span{u }

Hv  1   0  Let v be a nonzero vector, u can be chosen such that Hv = −σ   that is u = v + σ e ,  .  1  .  0 σ = sign(v1)kvk.

Properties of Householder Transformations

T • H = I − 2 uu = I − wwT uT u ◦ Implicit storage of H −→ only the vector w is stored. ◦ (HnHn−1 ... H1) = QT =⇒ Q = H1 ... Hn−1Hn

T where Y = [w1 ··· wn] ◦ Blocked representation (for BLAS3 computation) Q = I − YTY T and YTY = T−1 + T−

 T ˜  B = Y F DGEMM  YTY ⇒ T−1 DSYRK Update :(I − YTYT)F˜ =⇒ B = (T−1)−1B DTRSM   F˜ = YB DGEMM

10 Properties of QR

a. Let A = QR , Q = [Q1Q2] Q = SQ˜ and =⇒ 1 1 and A = Q˜ R˜ , Q˜ = [Q˜ Q˜ ] 1 2 R = SR˜ S = diag(±1).

b. Let A = QR and ATA = LLT =⇒ R = LT (Hp: R has positive diagonal entries).

c. R factor is invariant under row permutations of the matrix A

A˜ = PA = Q˜ R˜ and A = QR =⇒ R˜ = R

Properties a. + c. allow us to choose equivalent factorization algorithm:

1 2 × × × × 3 1 2 × × × × 3 2 6 × × × × 7 2 6 × × × 7 6 7 6 7 3 6 × × × × 7 3 6 × × 7 6 7 6 7 row 4 6 × × × × 7 (p = 2) 4 6 × 7 6 7 6 7 ⇒ 5 6 × × × × 7 ⇒ 5 6 × × × × 7 6 7 6 7 reordering 6 6 × × × × 7 6 6 × × × 7 7 4 × × × × 5 7 4 × × 5 8 × × × × 8 × 1 2

1 2 × × × × 3 1 2 × × × × 3 5 6 × × × × 7 5 6 × × × 7 6 7 6 7 2 6 × × × 7 2 6 × × × 7 6 7 6 7 6 6 × × × 7 (p = 4) 6 6 × × 7 6 7 6 7 etc . . . 3 6 × × 7 ⇒ 3 6 × × 7 6 7 6 7 7 6 × × 7 7 6 × 7 4 4 × 5 4 4 × 5 8 × 8 3 4

Householder transformations and sparsity

T • Let C = HB where H = I − ww .

T – Struct(C) = Struct((I − ww )B). T – Let F = ww B, then Struct(C) ⊂ Struct(B) + Struct(F). • Fill-in in C corresponds to non-zero entries in F. • Structure of F :

– wi = 0 → row i of F is zero. F w wT B

F’ W’ B’ = * *

Zero entries

0 – w corresponds to the nonzero rows of w 0 0 (F and B are the corresponding rows of F and B). 0 – Then if bij 6= 0 then column j of F is full.

11 Sparse QR factorization based on Householder

2 × 3 2 ∗ • • • 3 2 ∗ • • • 3 6 × 7 6 ••• 7 6 ••• 7 6 7 6 7 6 7 6 × 7 6 ••• 7 6 •• 7 6 7 6 7 6 7 6 × 7 6 ••• 7 6 •• 7 6 7 ⇒ 6 7 ⇒ 6 7 ⇒ 6 × × × × 7 6 ∗ ∗ ∗ 7 6 ∗ ∗ 7 6 7 6 7 6 7 6 × 7 6 × 7 6 •• 7 4 × 5 4 × 5 4 × 5 × × ×

2 ∗ • • • 3 2 ∗ • • • 3 6 ••• 7 6 ••• 7 6 7 6 7 6 •• 7 6 •• 7 6 7 6 7 × Original entry 6 • 7 6 • 7 6 7 ⇒ 6 7 • Modified zero 6 ∗ 7 6 7 6 7 6 7 ∗ Modified entry 6 • 7 6 7 4 • 5 4 5 × zero which becomes nonzero → Fill-in fill-in that “appears” and “disappears” → Transient fill-in fill-in that will remain in the R factor → Permanent fill-in

1.4 Graph notations and definitions Graph notations and definitions

• A graph G = (V,E) consists of a finite set of nodes or vertices V together with a set of edges E which are either ordered/unordered, valued or not pairs of vertices. • A graph with unordered edges is called an undirected graph or a symmetric graph. • A graph with ordered edges is called a directed graph or a digraph. • An ordering or labelling of G = (V,E) such that |V | = n is a mapping of V onto 1, 2, . . . , n.

• A graph is strongly connected if for any 2 vertices i and j there is a path from i to j. • G is complete if ∀(i, j) ∈ V 2,(i, j) ∈ E • A subset S of V is a clique or complete set if it induces a complete subgraph. • S is a maximal clique if there is no clique of G which properly contains S. • S is a maximum clique of G if there is no clique of larger cardinality. ω(G) = |S| is the clique number of G. • A clique cover of G is a partition of V into cliques of G.

• A stable or independent set is a subset X of vertices no two of which are adjacent.

• A coloring is a partition of the vertices V = X1 + ... + Xc such that each Xi is an independent set.

• If each member of Xi is given color i then adjacent vertices will receive different colors. • The chromatic number χ(G) is the smallest number of colors. • Property: ω(G) ≤ χ(G) since, every vertex in a maximum clique must have a different color. • if ω(G) = χ(G) then G is a perfect graph.

12 v z y

u x w Connected graph G

v z y v z y

u x w u x w Coloring with 3 independent sets Clique cover (Chromatic number is 3) (Clique number = 3) G is a perfect graph

Trees, spanning trees and topological ordering

• A rooted tree is a tree for which one node has been selected to the root. A topological ordering of a rooted tree is an ordering that numbers children nodes before their parent. Postorderings are topological orderings which number nodes in any subtree consecutively. A spanning tree (forest) of a connected graph G is a subgraph T of G with the same number of nodes such that if there is a path in G between i and j then there exists a path between i and j in T .

v z y

u x w Connected graph G

6 6 z z 5 y v 2 4 y 5 v 4 x 3 x 1 3 2 u w 1 u w Rooted spanning tree Rooted spanning tree with topological ordering with postordering

How to explore a graph: Two algorithms such that each edge is traversed exactly once in the forward and reverse direction and each vertex is visited. (Connected graphs are considered in the description of the algorithms.) • Depth first search : Starting from a given vertex (mark it) follow a path (and mark each vertex in the path) until it is adjacent to only marked vertices. Return to previous vertex and continue. • Breadth first search Select a vertex and put it on an empty queue of vertices to be visited. Repeatedly remove the vertex x at the head of the queue and place onto the queue vertices and then place onto the queue all vertices adjacent to x and never enqueued.

Illustration of DFS and BFS exploration

13 s t v z y w

u x

Depth−first spanning tree (v) Breadth−first spanning tree (v) 1 v 1 v 7 z 2 2 3 t z u t 4 y 3 8 5 6 s y x s 7 x 4 8 w 5 6 w u

Properties of DFS and BFS exploration

• If the graph G is not connected the algorithm is restarted and in fact it detects the connected components of G. • Each vertex is visited once. • The visited edges of the DFS algorithm is a spanning forest of a general graph G so-called the depth first spanning forest of G. • The visited edges of the BFS algorithm is a spanning forest of a general graph G so-called the breadth first spanning forest of G.

Graph shapes and peripheral vertices

• The distance between two vertices of a graph is the size of the shortest path joining those vertices. • The eccentricity l(v) = maxd(v, w)|w ∈ V • The diameter δ(G) = maxl(v)|v ∈ V • v is a peripheral vertex if l(v) = δ(G) • Example of connected graph with δ(G) = 5 and peripheral vertices s, w, x.

s t v z y w

u x

Pseudo-peripheral vertices

• The rooted level structure L(v) of G is a partitioning of G L(v) = Lo(v),...,Ll(v)(v) such that Lo(v) = v Li(v) = Adj(Li−1(v)) \ Li−2(v) • The eccentricity l(v) is also-called the length of L(v). The width w(v) of L(v) is defined as w(v) = max|Li(v)||0 ≤ l(v) • The aspect ratio of L(v) is defined as l(v)/w(v). • A pseudo-peripheral is a node with a large eccentricity.

• Algorithm to determine a pseudo-peripheral : Let i = 0, ri be an arbitrary node, and nblevels = l(ri). The algorithm iteratively updates nblevels by selecting the vertex ri+1 of minimum degree

of Ll(ri) until l(ri+1) > nblevels.

14 s t v z y w

u x 2nd step: select W in L3 First step, select v (min. degree)

L0(w) w v L0(v) L1 y z t L1 L2 z x

y s L2 u L3 v u

x w L3 L4 t Level structure(v) L5 s

1.5 Direct and/or iterative ? Solution of sparse linear systems Ax = b (Direct or Iterative approaches ?

Direct methods Iterative methods

• Very general techniques • Efficiency depends on the type of problem

– High numerical precision – convergence – preconditioning

– Sparse matrices with irregular pat- – numerical properties or structure of A terns • Require the product of A by a vector • Factorization of matrix A – less costly in terms of memory and pos- – May be costly is terms of flops or sibly flops memory for building the factors – computations of solutions with multi- – Allow to compute solutions cor- ple right-hand sides can be prohibitive responding to multiple right-hand sides b • Computing issues

• Computing issues – Smaller granularity of computations

– Good granularity of computations – Often, only one level of parallelism

– Several levels of parallelism can be exploited

15 2 Graphs models and sparse direct solvers 2.1 The elimination graph model The elimination graph model for symmetric matrices

• Let A be a symmetric positive define matrix of order n and G(A) the corresponding undirected graph.

 d vT  • The LLT factorization can be described by the equation: A = A = H = 1 1 0 0 v H √ 1 1     √ vT ! d1 0 1 0 d √1 v vT 1 d T 1 1 = √v1 1 = L1A1L1 , where H1 = H1 − d In−1 0 H1 1 d1 0 In−1

• this basic step is applied on H1H2 ... to obtain : A = (L1L2 ... Ln−1Ln) In (Ln−1 ... L2L1) = LLT

Correspondence with the graph modifications Construction of the elimination graphs Let vi denote the vertex of index i. G0 = G(A), i = 1. At each step delete vi and its incident edges Add edges so that vertices in Adj(vi) are pairwise adjacent in Gi = G(Hi).

Gi are the so-called elimination graphs.

2 1 1 3 2 4 3 G0 : H0 = 4 6 5 5 6

A sequence of elimination graphs

2 1 1 3 2 4 3 G0 : H0 = 4 6 5 5 6

2 3 2 3 G1 : 4 H1 = 4 5 6 5 6

4 3 3 G2 : H2 = 4 5 6 5 6

4 G3 : 4 H3 = 5 6 5 6

Introducing the filled graph G+(A)

• Let F = L + LT be the filled matrix, and G(F) the filled graph of A denoted by G+(A).

16 + • Lemma (Parter 1961) :(vi, vj) ∈ G if and only if (vi, vj) ∈ G or ∃k < min(i, j) such that + + (vi, vk) ∈ G and (vk, vj) ∈ G . 2 1 1 4 3 2 3 4 5 6 5 6 + G (A) = G(F) F = L + LT

Modeling elimination by reachable sets

• The fill edge (v4, v6) is due to the path (v4, v2, v6) in G1. However (v2, v6) is formed from the path (v2, v1, v6) in G0.

• Thus the path (v4, v2, v1, v6) in the original graph is in fact responsible of the fill in edge (v4, v6). • Illustration :

• This has motivated George and Liu to introduce the notion of reachable sets.

Reachable sets

• Let G = (V,E) and S a subset of V . • Definition: The vertex v is said to be reachable from a vertex w through S if there exists a path (w, v1, . . . , vk, v) such that vi ∈ S (where k could be zero). • Reach(w, S)= is the set of reachable vertices from w through S.

• Illustration: let S = {s1, s2, s3, s4} then Reach(w, S) = {a, b, c} a S4 S2 w S1 d

b c S3

Fill-in and reachable sets + The filled graph G (A) can be characterized with reachable sets. Let Si denote the set of variables eliminated at step i.

+ Theorem 1. For i < j, (vi, vj) ∈ G if and only vj ∈ Reach(vi,Si−1).

In terms of the matrix the set Reach(vi,Si−1)is the set of row subscripts that correspond to non-zeros entries below the diagonal in column i of L. Illustration: Reach(vi,Si−1) provides the column structure of column i of L Reach(1,S0) = {5, 8}; Reach(2,S1) = {4, 5} ; Reach(3,S2) = {8}; Reach(4,S3) = {5, 7}; Reach(5,S4) = {6, 7, 8} ...

17 1 2 3 4 2 5 1 9 4 5 6 7 8 7 6 8 3 9

Structure of L factors

Theorem 2. Let w be a vertex in Gi. The set of vertices (column structure of w) in Gi is given by Reach(w, Si) where the reach operator is applied to the original graph G0(A). The structure of L can thus be characterized in terms of the structure of A. Illustration : S2 = {1, 2}, Reach(3,S2) = {4, 5, 6}; Reach(4,S2) = {3, 6}; Reach(5,S2) = {3, 6}; and Reach(6,S2) = {3, 4, 5} due to the paths (3, 2, 1, 6), (4, 2, 1, 6), (5, 6). 1 2 3 4 3 G2 : G0 : 4 6 5 6 5

2 1 1 4 3 2 3 4 5 6 5 6 + G (A) = G(F) F = L + LT • Previous theorem shows that the whole elimination process can be described implicitly by the sequence of reach operator. It can be considered as an implicit model for elimination, as opposed to the explicit model using elimination graphs. • This theorem also gives indication on the impact of reordering on the fill-in.

• Suppose that S separates G into two disconnected components C1 and C2. If the vertices in S are labelled after C1 and C2 then from previous theorem there cannot be fill-edge between vertices of the disconnected components.

2.2 The elimination tree A first definition of the elimination tree

• Let A be a symmetric positive-definite matrix irreducible matrix. Its Cholesky factorization A = LLT and the filled graph G+(A) (graph of F = L + LT). • Since the matrix is irreducible each column of A each column of L (except the last one) has at least one off-diagonal non-zero.

Definition 3. For each column of L we remove all the non-zeros except the first one below the diagonal. The graph resulting from this construction entirely depends on the structure of A, it is a tree and will be referred to as the elimination tree of A or to as T (A). The node associated to the last column of L is the root of T (A).

18 Matrix structures a a b b c c d d e A = e F = f f g g h h i i j j

fill−in entries entries in T(A)

Graph structures

a a b b c c d d e A = e F = f f g g h h i i j j

fill−in entries entries in T(A)

10 j 1 3 a c g 7 a c g 9 i

h f 8 6 8 4 h f e h d 5 d j 4 10 d j 2 7 b g i 9 e b 2 3 i e b 6 5 c f + G(A) 1 G (A) = G(F) a T(A)

Properties of the elimination tree

• T (A) is a spanning tree of G+(A). • The elimination tree structure can be represented by the P ARENT vector defined as follows:

P ARENT [j] = min{i > j|lij 6= 0}.

• Another perspective also leads to the elimination tree

– Dependency between columns of L :

1. Column i > j depends on column j iff lij 6= 0 2. Use a direct graph to express this dependency 3. Simplify redundant dependencies (transitive reduction in graph theory) – The transitive reduction of the directed filled graph gives the elimination tree structure

Directed filled graph and its transitive reduction

19 10 j 1 3 7 1 3 7 a c g a c g 9 i

8 6 8 6 8 4 h f h f e h d 5 4 10 4 10 d j d j 2 7 b g 9 2 9 2 i i e b e b 3 6 5 5 c f Transitive reduction 1 Directed filled graph a T(A)

Interpretation as a Depth first search Theorem 4. The elimination tree T (A) of a connected graph G(A) is a depth-first search tree of the filled graph G+(A). + Proof. Let x1, x2, . . . , xn be the node ordering of G (A). Consider the depth-first search subject to the following tie-breaking rule: when there is a choice of more than one node to explore next, always pick the one with largest subscript. With this additional rule the depth-first search will construct T (A).

A depth-first search tree of the filled graph

10 back edges j

9 i 1 3 7 a c g 8 4 e h d 5 8 6 h f 2 7 b g 4 10 d j 3 6 c f 9 2 i e b 1 5 a + G (A) = G(F) Depth−first search tree

Path characterization of filled edges

Theorem 5. If lij 6= 0 then node xi is an ancestor of xj in T (A).

Corollary 6. Let T [xi] and T [xj] be two disjoint subtrees of T (A) then ∀xs ∈ T [xi] and ∀xt ∈ T [xj], lst = 0.

+ • Let S be the set of nodes not in either T [xi] or T [xj]. S separated the filled graph G (A) into two disjoint subgraphs corresponding to nodes in respectively T [xi] and T [xj] • The corollary also shows that the elimination tree expresses the parallelism from the sparsity.

2.3 The eDAGs for unsymmetric matrices Elimination Directed Acyclic Graph (eDAG) The notion of elimination tree, designed to study the Cholesky factorization of symmetric matrices, is generalized here to the LU factorization without numerical pivoting of unsymmetric matrices (see Gilbert and Liu SIMAX, 1993) A Directed Acyclic graph will be also referred to as a DAG or dag.

20 Unsymmetric matrices and graphs

• Unsymmetric square matrices are commonly represented by directed graphs. • Bipartite graphs representation is used for rectangular matrices and also unsymmetric matrices. The bipartite graph H of a matrix A has two sets of vertices (Vr and Vc) corresponding to respec- tively the row and columns indices of A. ar,c 6= 0 ≡ (r, c) ∈ H(A) • A bipartite graph with m rows and n columns has the Hall property if every set of k column vertices is adjacent to at least k row vertices for all 0 ≤ k ≤ n. • Hall theorem: A bipartite graph has a column matching iff it has the Hall property

Strong Hall property

• A bipartite graph with m rows and n columns has the strong Hall property if every set of k column vertices is adjacent to at least k + 1 row vertices for all 0 ≤ k < n. • Any matrix that is not strong Hall can be permuted in block lower triangular form (see Section on orderings).

Graph structures involved during LU factorization

0 0 0  A h   L   U h  • A = LU with the bordering method: A = = where A is a gT s gT s 1 0 0 matrix of order n − 1; g and h are vectors of size n − 1; g = U −Tg, h = L −1h and s = s − gTh. 0 The matrix A can then be factored recursively in the same way. • At each step k, h and g are obtained as the solution of two lower triangular systems of order n − k. • The entire factorization can be viewed as a sequence of n pairs of triangular solves.

Sparse triangular matrices and dags

• The directed graph G(A) of a matrix A is such that ai,j 6= 0 iff (i, j) is an edge in G(A).

1 1 6 2 2 3 4 5 3 4 5 6

• The directed graph of a triangular matrix is acyclic: dag.

21 Property of dags Theorem 7. The structure of the solution vector x of Lx = b is given by the set of vertices reachable from vertices of b by paths in the DAG of G(LT).

• Illustration: Struct(b)= {1} implies Reach(b) = {1, 3, 4, 5, 6}; Struct(b)= {2, 4} implies Reach(b) = {2, 4, 5, 6}.

1 1 6 2 2 3 4 5 3 4 5 6

Transitive reduction of a DAG

• Objective: Find another DAG with fewer edges than G(L) but preserving he same path structure. • The transitive reduction, so called Go(L) , of G(L) is unique and preserves all path structure as defined by previous theorem.

1 1 6 2 2 3 L = 4 5 3 4 5 6 Dag G(L)

1 6 2

3 4 5

Transitive reduction of G(L)

Definition of the elimination DAGS Definition 8. The pair of directed acyclic graphs Go(L) and Go(U) associated to the LU factorization without pivoting of a matrix A are the elimination DAGS or eDAGS of A.

1 1 1 2 2 2 3 3 3 A = 4 = 4 4 5 5 5 6 6 6

1 6 2 1 6 2

3 4 5 3 4 5

Dag G(L) Dag G(U)

1 6 2 1 6 2

4 5 3 4 5 3 eDAG of L eDAG of U

22 Property

• If A is symmetric then, by definition, T (A) = Go(L) • Illustration: A symmetric → Go(L) is a tree ?

1 1 1 2 S 2 2 S 3 3 3 A = 4 = 4 4 5 5 5 F F 6 F F 6 6

1 6 2 1 6 2

5 3 4 5 3 4

Dag G(L) Dag G(L) Two fathers 1 6 2 1 6 2

3 4 5 3 4 5 Two fathers eDAG of L eDAG of L (tree) On modified matrix

Definitions

• If B and C are two matrices of order n with nonzeros on the diagonal then G(B).G(C) is the product of the graphs defined as: (i, j) ∈ G(B).G(C) iff C (i, j) ∈ G(B) or (i, j) ∈ G(C) or ∃k | (i, k) ∈ G(B) and (k, j) ∈ G(C) • Lemma : G(BC) ⊆ G(B).G(C) with equality unless there is coincidental cancellation in BC • If A = LU then G(A) ⊆ G(L).G(U) (not strict since any fill in L and U cancels out when multiplied back). This is equivalent to saying an edge in G(A) corresponds to an edge in G(L) followed by an edge in G(U). • Corollary : a path in A corresponds to a path in G(L) followed by a path in G(U).

Paths in eDAGS Theorem 9. Any path in the directed graph G(A) corresponds to an ascending path in Go(U) followed by a descending path in Go(L) from k to j. Illustration :

1 a 1 1 2 6 2 6 2 b 3 A = 4 5 5 5 3 4 3 4 6 Dag G(L) Dag G(U) c

1 6 2 1 6 2 b c1 a c2

3 4 5 3 4 5 eDAG of L eDAG of U

23 Properties of the elimination DAGS o Theorem 10. Row structure of L : lij 6= 0 iff ∃k | aik 6= 0 and there exists a path in G (U) (Similar theorem for the column structure of U) Illustration :

U k k U Aki f f

L j j Uji L

Aik f f Lij i i

Entries in eDAG(U) Entries in eDAG(L)

Theorem 11. Column structure of L : The column structure of L∗i is the union of the column structure of A∗i with all columns L∗l such that l < i and l ∈ U∗i. (Similar theorem for the Row structure of U) Illustration :

U U L L

j i

Entries in L or U

Symbolic LU factorization using eDAGS for all i from 1 . . . n do o Compute row structure of Li∗ : (using previous Theorem : traverse G (Ui−1) from Ai∗) o o Transitively reduce Struct(Li∗) using G (Li−1) (extension of the eDAG to G (Li)) Compute row structure of Ui∗ : (using previous Theorem) o o Transitively reduce Struct(Ui∗) using G (Ui−1) (extension of the eDAG to G (Ui)) end for

2.4 Column elimination graphs and trees The column intersection graph

• The column intersection graph of an m × n matrix A is the undirected graph G∩(A) whose vertices are the column indices of A and such that (i, j) ∈ G∩(A) iff ∃r |air 6= 0 and arj 6= 0. • An edge thus joins two vertices whose columns share a nonzero row in A.

T T • Unless there is numerical cancellation, G∩(A) = G(A A) (in all cases G(A A) ⊆ G∩(A) ).

1 2 3 4 5 6 1 x x

2 x 1 2 3 4 5 6 3 x x 1 x x

4 x x 2 x x x

5 x x A= At A = 3 x x x 6 x x 4 x x x x 7 x x 5 x x x 8 x x x 9 x 6 x x

10 x

24 The column elimination tree

• The column elimination tree is the elimination tree of G(ATA) • Given an ordering of the columns of A the column elimination tree of A captures the dependencies between the columns independently of the row interchanges performed (partial pivoting).

• If A = QR then ATA = RTQTQR = RTR. R is the Cholesky factor of the so called normal equations matrix ATA.

+ • If A is full column rank with Strong Hall property then A = QR satisfies G(R) = G∩ (A)

1 2 3 4 5 6 1 x x

2 x 1 2 3 4 5 6 3 x x 1 x x 4 x x 2 x x x

5 x x t A= F(A A) = 3 x x x 6 x x 4 x x x F x 7 x x 5 x x F x F 8 x x x 9 x 6 x F x

10 x

6

5 4

1 2 3

Column elimination tee

25 3 Reordering sparse matrices 3.1 Objectives Objectives of ordering • Reorder matrix to special forms : (bloc tridiagonal, block upper triangular ...). • Reduce fill-in during factorization (local and global heuristics). • Compute the maximum transversal a matrix. • Equivalent orderings : (Topological and postorderings of a tree. Traverse tree to minimize working memory).

Fill-in reduction Three main classes of methods for minimizing fill-in during factorization • Global approach: The matrix is permuted into a matrix with a given pattern – Fill-in is restricted to occur within that structure – Cuthill-McKee (block tridiagonal matrix) – Nested dissections (“block bordered” matrix). • Local heuristics: At each step of the factorization, selection of the pivot that is likely to minimize fill-in. – Method is characterized by the way pivots are selected. – Markowitz criterion (for a general matrix). – Minimum degree (for symmetric matrices). • Hybrid approaches: Once the matrix is permuted in order to obtain a block structure, local heuris- tics are used within the blocks.

Symmetric matrices and graphs • Assumptions: A symmetric and pivots are chosen on the diagonal • Problem: Find a symmetric permutation of A that minimize fill-in. • Considering the pattern of A is sufficient • Structure of A symmetric represented by the graph G0 = (V 0,E0) – Vertices are associated to columns: V 0 = {1, ..., n} 0 0 – Edges E are defined by: (i, j) ∈ E ↔ aij 6= 0 – G0 undirected (symmetry of A) • Remarks:

– Number of nonzeros in column j = |AdjG0 (j)| – Symmetric permutation ≡ renumbering the graph

1 2 3 4 5

1 1

2 3

3 4

4 5 5 2

Symmetric matrix Corresponding graph

26 3.2 Maximum matching/transversal orderings Components of the Maximum Matching algorithm

• Suppose that matching of size k has been found (i.e. there exists permutations to place entries on the first k diagonal positions). • augmenting paths: try to find a matching of size k + 1 using a matching of size k, • branch and bound techniques on depth first search.

1 2 3 4 5 6 r1 c1 X X 0 0 0 0 r2 c2 0 X 0 0 0 X 0 0 X 0 X X r3 c3 0 0 X X 0 0 r4 c4 0 0 0 0 X X X 0 0 0 0 0 r5 c5 r6 c6

Structurally singular matrices If the matching is not maximal (cannot be extended) then the matrix is said to be structurally singular or symbolically singular. In the example Columns 1 and 6 are multiple of each other. 1 2 3 4 5 6 X X X X X X X

Illustration of the algorithm

1 2 3 4 5 6 r1 c1 X X 0 0 0 0 r2 c2 0 X 0 0 0 X 0 0 X 0 X X r3 c3 0 0 X X 0 0 0 0 0 0 X X r4 c4 X 0 0 0 0 0 r5 c5 r6 c6

1 2 3 4 5 6 r1 c1 1 2 3 4 5 6 r1 c1 6 X X 0 0 0 0 r2 c2 X X 0 0 0 0 r2 c2 0 X 0 0 0 X 0 X 0 0 0 X 5 0 0 X 0 X X r3 c3 0 0 X 0 X X r3 c3 0 0 X X 0 0 0 0 X X 0 0 r4 c4 r4 c4 0 0 0 0 X X 0 0 0 0 X X 3 X 0 0 0 0 0 r5 c5 X 0 0 0 0 0 r5 c5 4 r6 c6 r6 c6

27 1 2 3 4 5 6 r1 c1 2 6 3 4 5 1 r1 c1 X X 0 0 0 0 r2 c2 X 0 0 0 0 X r2 c2 0 X 0 0 0 X X X 0 0 0 0 0 0 X 0 X X r3 c3 0 X X 0 X 0 r3 c3 0 0 X X 0 0 0 0 X X 0 0 r4 c4 r4 c4 0 0 0 0 X X 0 X 0 0 X 0 X 0 0 0 0 0 r5 c5 0 0 0 0 0 X r5 c5 r6 c6 r6 c6

Maximum weighted matching Structural+numerical problem: • Maximum weighted transversal: how to find a permutation P of A columns such that the product (or another metric: min, sum . . . ) of the diagonal entries of AP is maximum? • Maximum weighted matching? Same techniques as in maximum matching, but branch and bound more complicated and it is more efficient to do breadth first search. MC64 in fortran 77 or 90 (HSL library).

3.3 Reordering matrices to special forms Cuthill-McKee and Reverse Cuthill-McKee Consider the matrix: 2 x x x x 3 6 x x 7 6 7 6 x x x 7 A = 6 7 6 x x x x 7 6 7 4 x x x 5 x x The corresponding graph is

5 3

2 1

4 6

Cuthill-McKee algorithm

• Attempt to permute the matrix in a block tridiagonal form • Level sets (such as BFS) are built from the vertex of minimum degree (priority to the vertex of smallest number)

We get: S1 = {2},S2 = {1},S3 = {4, 5},S4 = {3, 6} and thus the ordering 2, 1, 4, 5, 3, 6. The reordered matrix is: 2 x x 3 6 x x x x 7 6 7 6 x x x x 7 A = 6 7 6 x x x 7 4 x x x 5 x x

28 Reverse Cuthill-McKee

• The ordering is the reverse of that obtained using Cuthill-McKee i.e. on the example {6, 3, 5, 4, 1, 2} • The reordered matrix is: 2 x x 3 6 x x x 7 6 7 6 x x x 7 A = 6 7 6 x x x x 7 4 x x x x 5 x x • More efficient than Cuthill-McKee at reducing the envelop of the matrix.

Nested dissection Nested Dissection: an example

Graph partitioning Permuted matrix 1 2 S1 S2 3 (1) (4) S3 4 S3 (2) (5) S2 S1

3.4 Reduction to Block Triangular Form Reduction to Block Triangular Form (BTF)

• Suppose that there exists permutations matrices P and Q such that   B11  B21 B22     B31 B32 B33    PAQ =  ....     ....     ....  BN1 BN2 BN3 ... BNN • If N > 1 A is said to be reducible (irreducible otherwise)

• Each Bii is supposed to be irreducible (otherwise finer decomposition is possible)

• Advantage: to solve Ax = b only Bii need be factored Pi−1 T Biixi = (Pb)i − j=1 Bijyj, i = 1,...,N with y = Q x

A two stage approach to compute the reduction

• Stage (1): compute a (column) permutation matrix Q such that AQ1 has non zeros on the diagonal (find a maximum transversal of A), then

t • Stage(2): compute a (symmetric) permutation matrix P such that PAQ1P is BTF.

The diagonal blocks of the BTF are uniquely defined. Techniques exits to directly compute the BTF form. They do not present advantage over this two stage approach.

29 Main components of the algorithm

t • Objective : assume AQ1 has non zeros on the diagonal and compute P such that PAQ1P is BTF. • Use of digraph (directed graph) associated to the matrix with self loops (diagonal entries) not needed. • Symmetric permutations on digraph ≡ relabelling nodes of the graph. • If there is no closed path through all nodes in the digraph then the digraph can be subdivided into two parts. • Strong components of a graph are the set of nodes belonging to a closed path.

• The strong components of the graph are the diagonal blocks Bii of the BTF format.

1 2 6 1 2 3 4 A = 3 4 6 5 5 Digraph of A.

6 3 PAPt = 4 5 1 2

BTF of A

Preamble : Finding the triangular form of a matrix

• Observations: a triangular matrix has a BTF form with blocks of size 1 (BTF will be considered as generalization of the triangular form where each diagonal entry is now a strong component). Note that if the matrix has triangular form the digraph has no cycle (i.e. acyclic). • Sargent and Westerberg algorithm : 1. Select a starting node and trace a path until finding a node from which no path leaves. 2. Number the last node first, eliminate it (and all edges pointing to it). 3. Continue from previous node on the path (or choose a new starting node).

Generalizing Sargent and Westerberg algorithm to compute a BTF A composite node is a set nodes belonging to a closed path. Start from any node and follow a path until (1) a closed path is found: collapse all nodes in a closed path into a composite node the path continue from the composite node. (2) reaching a node or composite node from which no path leaves: the node or composite node is numbered next.

30 4 1 2 3 4 Path 6 3 3 3 1 5 5 5 5 5 5 2 2 2 6 5 Steps 1 2 3 4 5 6 7 8 9 Digraph of A.

1 6 2 4 3 3 A = 4 PAPt = 5 5 1 6 2

A permuted to triangular form (not unique)

3 4 1 2 3 PATHS 5 5 4 4 4 6 2 6 5 3 3 3 3 3 3 3 1 1 2 2 2 2 2 2 2 2 2 2 2 Digraph of A. STEPS 1 2 3 4 5 6 7 8 9 10 11

6 1 3 2 PAPt = 4 6 1 2 3 4 5 A = 3 1 4 6 5 2 5 Digraph of A. BTF of A

31 3.5 Local heuristics to reduce fill-in during factorization Local heuristics to reduce fill-in during factorization Let G(A) be the graph associated to a matrix A that we want to order using local heuristics. Let Metric such that Metric(vi) < Metric(vj) implies vi is a better than vj Generic algorithm Loop until all nodes are selected Step1: select current node p (so called pivot) with minimum metric value, Step2: update elimination (or quotient) graph, Step3: update Metric(vj) for all non-selected nodes vj. Step3 should only be applied to nodes for which the Metric value might have changed.

Markowitz criterion Reordering unsymmetric matrices : Markowitz criterion • At step k of Gaussian elimination:

U

L k A

k k – ri = number of non-zeros in row i of A k k – cj = number of non-zeros in column j of A k k – akk must be large enough and should minimize (ri − 1) × (cj − 1) ∀i, j > k

• Minimum degree : Markowitz criterion for symmetric diagonally dominant matrices

Minimum Degree Minimum degree algorithm • Step 1: Select the vertex that possesses the smallest number of neighbors in G0.

1 2 2 3 4 5 5 6 9 7 3 6 7 8 9 10 8 4 1 10

(b) Elimination graph (a) Sparse symmetric matrix

The node/variable selected is 1 of degree 2.

• Notation for the elimination graph

– Let Gk = (V k,Ek), the graph built at step k. k – G describes the structure of Ak after elimination of k pivots. k – G is non-oriented (Ak is symmetric)

– Fill-in in Ak ≡ adding edges in the graph.

32 Illustration Step 1: elimination of pivot 1 1 2 2 3 4 5 5 6 7 9 7 3 6 8 9 10 8 4 1 10

(a) Elimination graph (b) Factors and active submatrix

2 1 2 5 3 4 7 3 6 5 9 6 7 10 8 4 1 8 9 10

Initial nonzeros Nonzeros in factors Fill−in

Minimum degree algorithm applied to the graph: • Step k : Select the node with the smallest number of neighbors • Gk is built from Gk−1 by suppressing the pivot and adding edges corresponding to fill-in.

Minimum degree algorithm based on elimination graphs

∀i ∈ [1 . . . n] ti = |AdjG0 (i)| For k = 1 to n Do

p = mini∈Vk−1 (ti) For each i ∈ AdjGk−1 (p) Do AdjGk (i) = (AdjGk−1 (i) ∪ AdjGk−1 (p)) \{i, p} ti = |AdjGk (i)| EndFor V k = V k−1 \ p EndFor

Illustration (cont’d) Graphs G1,G2,G3 and corresponding reduced matrices.

33 2 5 5 5 9 7 3 6 9 7 6 9 7 3 6 10 8 4 10 8 4 10 8 4 (a) Elimination graphs

1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 10

(b) Factors and active submatrices

Original nonzero Fill−in Original nonzero modified Nonzerose in factors

Minimum Degree does not always minimizes fill-in !!!

Consider the following matrix

2 7 1 Remark: Using initial ordering 2 3 4 5 6 4 1 9 5 No fill−in 6 7 8 3 9 8 Corresponding elimination graph

2 7 Step 1 of Minimum Degree:

Select pivot 5 (minimum degree = 2) 4 6 1 9 Updated graph

Add (4,6) i.e. fill−in 3 8

Efficient implementation of Minimum degree Reduce time complexity 1. Accelerate selection of pivots and update of the graph: • 1.1 When eliminating a pivot : all vertices that belong to a complete subgraph that includes the pivot can also be eliminated (super-variables) • 1.2 Two non-adjacent nodes of same degree can be eliminated simultaneously (multiple elimina- tions) • 1.3 Degree update of neighbors of the pivot can be effected in an approximate way (Approximate Minimum Degree).

Reduce memory complexity 2. Decrease the size of working space Using the elimination graph, working space is of order O(#nonzeros in factors). • Property: Let pivot be the pivot at step k If i ∈ Adj (pivot) Then Adj (pivot) ⊂ Adj (i) Gk−1 Gk−1 Gk • We can then use an implicit representation of fill-in by defining the notions of element (variable already eliminated) and quotient graph. A variable of the quotient graph is adjacent to variables and elements.

34 • We can show that ∀k ∈ [1 ...N] Size of quotient graph = O(G0)

Quotient graphs notations variable refer to un-eliminated nodes. elements refers to eliminated nodes. The quotient graph, k k Gk = (V k ∪ V ,Ek ∪ E ) • V : variable nodes • V : corresponds to element nodes • E : edges between variables • E : edges between variables and elements (edges between elements are absorbed)

Ai = {j :(i, j) ∈ E}, (1)

Ei = {e :(i, e) ∈ E}, (2)

AdjG(i) = Ai ∪ Ei. (3)

Le = {i :(i, e) ∈ E}, AdjG(e) = Le. (4)

Quotient graph / elimination graph A clique is represented by a list of its members rather than by a list of all the edges in the clique

• Relation between quotient graph G and elimination graph G

i is a node in G implies i is a variable in G ! [ AdjG(i) = Ai ∪ Le \{i} (5)

e∈Ei • Properties of Quotient graphs Storage never exceeds the storage for the original graph No edges between elements (element absorption)

Minimum degree algorithm using quotient graph For k = 1, n Choose xk of minimum degree Form the clique of Adj(xk)

Absorb elements/cliques in Exk For All variables in i ∈ Adj(xk) Add element/clique xk to Ei Update the degree of i

Real/Approximate external Degree

• Real external degree !

[ di = |AdjG(i) \Vi| = |Ai \Vi| + Le \Vi , (6) e∈Ei

since the set Ai is disjoint from any set Le for e ∈ Ei

35 • Approximate external degree

 n − k     k−1  k di + |Lp \Vi| di = min X (7)  |Ai \Vi| + |Lp \Vi| + |Le \Lp|    e∈Ei\{p}

k k here di ≤ di .

Illustration Pivot of element e1 Pivot of element e2 current pivot p j

Original entries

Computing the Approximate external degree  X k |Ai \Vi| + |Lp \Vi| + |Le \Lp| di = min (8) e∈Ei\{p}

Computing |Le \Lp| Le = (Le \Lp) ∪ (Le ∩ Lp)

|Le| = |Le \Lp| + |Le ∩ Lp|

Note that |Le ∩ Lp| is the number of times e appears in the element lists Ei for all variables i in Lp

Computation of |Le \Lp| for all e ∈ V

assume w(1 . . . n) < 0 for each variable i ∈ Lp do for each element e ∈ Ei do if (w(e) < 0) then w(e) = |Le| w(e) = w(e) − 1 end for end for |Le \Lp| = w(e) if w(e) ≥ 0, |Le| otherwise

36 Performance illustration Four ordering algorithms are compared: 1. the Approximate Minimum Degree (AMD) algorithm based on external degrees, 2. Liu’s Multiple Minimum Degree (MMD) 3. Duff and Reid’s minimum degree (MA27) 4. George and Liu’s General Nested Dissection (ND)

Relative Fill in the factors Histograms of |L| of AMD over |L| for each algorithm. MMD MA27 ND AMD has more fill: 1.67 > f ≥ 1.43 - - 2 1.43 > f ≥ 1.25 - - 1 1.25 > f ≥ 1.11 3 1 - slightly more 118 44 12 identical fill (f = 1) 16 18 2 AMD has less fill: slightly less 177 215 38 0.9 ≥ f > 0.8 2 31 68 0.8 ≥ f > 0.7 1 8 51 0.7 ≥ f > 0.6 - - 56 0.6 ≥ f > 0.5 1 - 38 0.5 ≥ f > 1/4 - 1 39 1/4 ≥ f > 1/8 - - 10

Histograms of run time relative to AMD

MMD MA27 ND AMD is slower: 32 > r ≥ 16 - - 3 16 > r ≥ 8 - - 1 8 > r ≥ 4 - - 4 4 > r ≥ 2 - - 12 2 > r ≥ 1.33 2 - 15 1.33 > r ≥ 1.11 14 2 10 similar run time 16 44 20 AMD is faster: 0.9 ≥ r > 0.75 13 24 10 0.75 ≥ r > 0.5 51 41 21 0.5 ≥ r > 1/4 23 9 17 1/4 ≥ r > 1/8 11 8 10 1/8 ≥ r > 1/16 1 - - 1/16 ≥ r > 1/32 3 4 - 1/32 ≥ r > 1/64 1 - - 1/64 ≥ r 1 1 - not compared 185 182 195

Influence on the structure of factors Harwell-Boeing matrix: dwt 592.rua, structural computing on a submarine. NZ(LU factors)=58202

37 0

100

200

300

400

500

0 100 200 300 400 500 nz = 5104

Structure of factors after permutation Minimum Degree MMD (1.1+1.2+2) 0 0

100 100

200 200

300 300

400 400

500 500

0 100 200 300 400 500 0 100 200 300 400 500 nz = 15110 nz = 14838

Detection of supervariables allows to build more regularly structured factors (easier factorization).

Comparison of 3 implementations of Minimum Degree

• Let V0 be the initial algorithm (based on the elimination graph) • MMD the version including 1.1/ + 1.2/ + 2/ (Multiple Minimum Degree, Liu 1985, 1989), used within MATLAB • AMD the version including 1.1/ + 1.3/ + 2/ (Approximate Minimum Degree, Amestoy, Davis, Duff 1995).

Execution times (secs) on a SUN Sparc 10

Matrix Order Nonzeros Minimum Degree V0 MMD AMD dwt 2680 2680 13853 35 0.2 0.2 Min. memory size 250KB 110KB 110KB Wang4 26068 75552 - 11 5 Orani678 2529 85426 - 125 5

• Fill-in is similar • Memory space for MMD et AMD : ≈ 2 × NZ integers • V0 was not able to perform reordering for the 2 last matrices (lack of memory after 2 hours of computations)

38 Minimum fill-in Mininimum fill-in heuristics Recalling the generic algorithm Let G(A) be the graph associated to a matrix A that we want to order using local heuristics. Let Metric be such that Metric(vi) < Metric(vj) ≡ vi is a better than vj Generic algorithm Loop until all nodes are selected Step1: Select current node p (so called pivot) with minimum metric value, Step2: update elimination (or quotient) graph, Step3: update Metric(vj) for all non-selected nodes vj. Step3 should only be applied to nodes for which the Metric value might have changed.

Minimum fill-in based algorithm

• Metric(vi) is the amount of fill-in that vi would introduce if it were selected as a pivot. • Illustration: r has a degree d = 4 and a fill-in metric of d × (d − 1)/2 = 6 whereas s has degree d = 5 but a fill-in metric of d × (d − 1)/2 − 9 = 1.

j3 j4 j2 r i1 j1 s i3 i4

i2 i5

Minimum fill-in properties

• The situation typically occurs when {i1, i2, i3} and {i2, i3, i4, i5} were adjacent to two already selected nodes (here e2 and e1)

e1 e2 r j3 j4 s j2 i1 r i1 i2 j1 s i3 i3 i4 i4 i5 j1 i2 i5 j2 j3 j4

e1 and e2 are previously selected nodes

• The elimination of a node vk affects the degree of nodes adjacent to vk. The fill-in metric of Adj(Adj(vk)) is also affected.

• Illustration: selecting r affects the fill-in of i1 (fill edge (j3, j4) should be deduced).

How to compute the fill-in metrics Computing the exact minimum fill-in metric is too costly • Only nodes adjacent to current pivot are updated. • Only approximated metrics (using clique structures) are computed

39 • Let dk be the degree of node k; dk × (dk − 1)/2 is an upper bound of the fill.

1. Deduce the clique area of the ”last” selected pivot adjacent to k (s → clique of e2).

2. Deduce the largest clique area of all adjacent selected pivots (s → clique of e1) 3. If for d we use instead AMD then cliques of all adjacent selected pivots can be deduced.

3.6 Equivalent orderings of symmetric matrices Equivalent orderings Let F be the filled matrix of a symmetric matrix A (that is, F = L + Lt, where A = LLT G+(A) is the associated filled graph. Definition 12 (Equivalent orderings). P and Q are said to be equivalent orderings iff G+(PAPT) = G+(QAQT) By extension, a permutation P is said to be an equivalent ordering of a matrix A iff G+(PAPT) = G+(A) It can be shown that an equivalent reordering also preserves the amount of arithmetic operations for sparse Cholesky factorization.

Relation with elimination trees • We recall that a topological ordering is an ordering of a tree such that children nodes are numbered before their parent node. • Postorderings are topological orderings which number nodes in any subtree consecutively. • Any topological ordering of the elimination tree corresponds to an equivalent reordering P of A and the elimination tree of PAPT is identical to that of A. • Topological orderings are thus tree preserving reorderings.

Example of matrix, tree and topological ordering u v w A = x F y F z 6 z 5 2 6 5 y v v z y 2 4 x 3 u x w 1 u w 1 4 3 + Graph G (A) Elimination Tree of A with topological ordering

Tree rotations Definition 13. An ordering that does not introduce any fill is referred to as a perfect ordering Natural ordering is a perfect ordering of the filled matrix F. Theorem 14. For any node x of G+(A) = G(F), there exists a perfect ordering on G(F) such that x is numbered last. • Essence of tree rotations : – Nodes in the clique of x is F are numbered last – Relative ordering of other nodes is preserved.

40 Example of equivalent re-orderings On the right hand side tree rotations applied on w : (clique of w is {w, x} and for other nodes relative ordering w.r.t. tree on the left is preserved).

u u w y x F v F = y z F v F x F z w

6 6 w z 5 x 4 y v 5 4 z 3 x 2 3 2 y v 1 u w 1 with a postordering u

Finding the optimal postordering for memory usage Example 1: Processing a wide tree

7

1 3 2 6

4 5

41 Memory

1

1

1 1 . . Active memory .

2 3 1 4 5 3 2 6 5 4 1 . . .

6 2 6 1 3 3 2 4 5 1

6 2 7 6 1 3 3 2 4 5 1

2 6 7 1 3 4 5

6 2 7 1 3 4 5

unused memory space stack memory space factor memory space non-free memory space

Example 2: Processing a deep tree

4

3

1 2

42 Memory

1 . . .

1 1

1 2 2 1

Allocation of 3 3 1 2 2 1

Assembly step for 3 3 1 2

Factorization step for 3 + Stack step for 3

3 2 1 3

3 4 2 1 3 . . .

3 4 1 2

unused memory space stack memory space factor memory space non-free memory space

Two extreme cases Best (abcdefghi) Worse (ihgfedcba) i i

g g e e

c c

a b d f h h f d b a

Objective/Notations Objective: Find an optimal postordering (or traversal) of the tree that minimizes the stack memory usage.

Definition 15. nb children(i) is the number of children of i. ci,j denotes the jth son of node i. cbi is the size of the contribution block and factori the size of the factor. storei = factori + cbi. T (i): subtree rooted at node i Let Mi be the maximum amount of stack memory to process T (i).

Modelisation of the problem The assembly step requires a storage:

nb children(i) X storei + cbci,j j=1

The storage when factorizing the frontal matrix of ci,j is:

j−1 X Mci,j + cbci,k k=1

Mi is thus recursively defined as

j−1 nb children(i) X X Mi = max( max (Mci,j + cbci,k ), storei + cbci,j ) (9) j=1,nb children(i) k=1 j=1

43 Since we want to minimize the peak of the stack, we should reduce the value of M for the root node(s). Pj−1 Adopting a theorem from [37] which says that the minimum of maxj(xj + i=1 yj) is obtained when the sequence (xi, yi) is sorted in decreasing order of xi − yi, we deduce that an optimal child sequence is obtained by rearranging the children nodes in decreasing order of Mci,k − cbci,k . Considering a tree T , based on this result, The algorithm gives an optimal traversal of the tree in terms of the peak of the stack. This algorithm consists in sorting the children nodes of a node i in descending order of Mci,j − cbci,j and compute the new value Mi for the parent node i, using (9). Finally note that we have implemented variants of this algorithm: one for minimizing the global memory (stack + factors) and one for minimizing the average stack size during execution (when the peak is not changed). A complete analysis of all variants is available in [30].

Optimal tree reordering for minimizing stack memory peak Tree Reorder (T ): Begin for all i in the set of root nodes do Process Child(i); end for End Process Child(i): if i is a leaf then Mi=storei else for j = 1 to nb children(i) do Process Child(ci,j); end for

Reorder the children ci,j of i in decreasing order of (Mci,j − cbci,j ); Compute Mi using the formula (9); end if

Impact of fill-in reduction algorithm on the shape of the tree Impact of fill-in reduction on the shape of the tree

Reordering Shape of the tree observations technique

• Deep well-balanced AMD • Large frontal matrices on top

• Very deep unbalanced AMF • Small frontal matrices

Shape of the trees : summary

3.7 Combining approaches Example (1) of hybrid approach

• Top-down followed by bottom-up processing of the graph: Top-down : Apply nested dissection (ND) on complete graph Bottom-up : Local heuristic on each subgraph

44 Reordering Shape of the tree observations technique

• deep unbalanced PORD • Small frontal matrices

• Very wide well-balanced SCOTCH • Large frontal matrices

• Wide well-balanced METIS • Smaller frontal matrices (than SCOTCH)

Peak of the stack Average size of the stack METIS + + SCOTCH + + PORD −− −− AMF − − AMD ++ ++

Table 4: Characteristics of the stack memory for different reordering techniques. The symbol “++” means large value; . . . “−−” means very small value.

• Bottom-up nested dissection based ordering : Bottom-up ordering used to determine a multisection of the graph Bottom-up on each subtree.

(1 cont) hybrid approach S1

S2 S3 ND Graph partitioning Permuted matrix

S1 1 1 2 3 4 2 MD S2 MD (1) MD (4) 3 Elimination graph S3 4 MD (2) MD (5) S3 S2 S1

Example (2) : combine maximum transversal and fill-in reduction

• Consider the LU factorization A = LU of an unsymmetric matrix. • Compute the column permutation Q leading to a maximum numerical transversal of A. AQ has large (in some sense) numerical entries on the diagonal. • Find best ordering of AQ preserving the diagonal entries. Equivalent to finding symmetric per- mutation P such that the factorization of PAQPT has reduced fill-in.

45 4 Partitioning and coloring 4.1 Introduction Introduction

• Graphs:

– Vertices: Objects to be allocated (data, tasks, ...), – Edges: Links between the vertices (communication, dependency, ...). – Weighted graphs : ∗ Weight on vertices : size, computational costs, degrees of freedom, . . . ∗ Weight on edges : communication costs, . . .

• Classes of problems

– Coloring: find a graph partition such that each partition be constituted of independent objects (no edge connecting them) – Partitioning: ∗ find a partitioning of a graph such that partitions are of approximatively equal-sized portions and number of edges between the partitions is minimal (minimize edge-cut). ∗ On a weighted graph : sum of weights associated to vertices is balanced over the partitions. In that case, edge-cut is the sum of edge weights between partitions.

Applications

• Vectorization / parallelization in PDE solvers (mesh coloring, mesh partitioning, domain decom- position) • Solution of Ax = b (A sparse) using – iterative solvers (parallelization of matrix-vector product and efficient preconditioning) or – direct solvers (reordering and parallelization). – Mapping tasks or subdomains – ...

4.2 Graph coloring Graph coloring

• Very general problem : within a set of objects subject to dependencies, identify independent objects (i.e. that can be treated simultaneously) – Graph problem : ∗ Vertices ↔ objects ∗ Edges ↔ express dependencies – After a coloring procedure : all vertices of same color can be treated simultaneously – Wide range of algorithms in terms of cost and efficiency

46 Example of coloring algorithm Build the graph associated to the problem The degree of a vertex is the number of edges incident on it. degree = 3 degree=2

degree=2 Example : degree=1 Algorithm: 1. Renumber the vertices according to decreasing degrees. i = 0

2. i = i + 1, Build color Ci :

• Find the uncolored vertex pi1 corresponding to the smallest index. Ci = {pi1}. • Loop – Select the uncolored vertex of smallest index within the non-adjacent vertex to the vertices S of Ci. Let Pij be that vertex. Then Ci = Ci {pij}. • Until there is no more non-adjacent vertex.

Example of coloring procedure P6 Parallel calculation of y = A × x with A an unassembled matrix of order 6 such that A = i=1 Ai. The elementary matrices Ai include the following variables:

• A1 = {1, 6},A2 = {1, 2, 3},A3 = {2, 3}

• A4 = {3, 5, 6},A5 = {5, 6}

• A6 = {1, 2, 3, 4, 5, 6}

The matrix-vector product will be parallelized over the Ai × x.

→ Identify Ai that do not share variables

Dependency graph of the Ai

2

3

1 4 6

5

Applying the coloring procedure we obtain the following coloring (priority to the vertex of smallest number): Color1 = {4}, Color2 = {6}, Color3 = {1, 3}, Color4 = {2, 5}.

Example: updates in an explicit method

• PDE solved using 5-point finite differences • Domain decomposition on a 2D grid • At each time step: Update at a grid point involves north, south, east and west neighbors

47 xi−1,j l xi,j−1 → xi,j ← xi,j+1 ↑ xi+1,j

• Updates implie communications with north, south, east and west subdomains

• Communication scheme without deadlock

send values to neighbors ↓ receive values from neighbors

• Limits on size of communication buffers may cause deadlock to occur • Coloring to identify sets of processors that can communicate independently

4.3 Graph partitioning Graph partitioning

• Partitioning: Let G = (V,E), Find a partition of V into k non-empty sets Vi such that Vi ∩Vj = ∅ S ∀i 6= j and i Vi = V . • Quality of the produced partitions: – load balancing: sub-graphs of equal size – minimization of communications: minimize the number of edges whose incident vertices belong to different partitions (Edge-cut). • Static or dynamic mapping of sub-graphs (tasks, sub-domains ...) on processors.

• NP-hard problem.

• Classes of partitioning techniques

– Spectral methods : efficient on unstructured problems but complex and often costly. – Geometrical methods : usually fast but partitions produced are often poor. – Multi-level methods: faster than spectral methods, they produce partitions of reasonable quality.

48 Straightforward techniques: 1D and 2D dissections • Example : 4 × 3 grid to be partitioned on 4 processors :

9------10------11------12 | | | | 5------6------7------8 | | | | 1------2------3------4

• Dissection in 2 steps: 1. Find level sets: Breadth-First-search Algorithm 2. Partitioning

Breadth-First-Search Algorithm

• Breadth-first-search algorithm: from an arbitrary vertex classify the n vertices in level sets • Possibly perform several runs from different random vertices and select the one with the smaller edge-cut

BFS algorithm Let i1 be the initial vertex, Lev set(1) = {i1}, next = 2, marker(i1) =1, level=1 While (next < n) Do Lev set(level+1)={} For j in Lev set(level) Do Forall neighbor k of j such that marker(k)=0 Do Add k to Lev set(level+1) marker(k)=1 next=next+1 EndForall EndFor level=level+1 EndWhile

On the example with i1=1 and Si= Lev set(i):

• S1 = {1}

• S2 = {2, 5}

• S3 = {3, 6, 9}

• S4 = {4, 7, 10}

• S5 = {8, 11}

• S6 = {12} Remark: Quite similar to Cuthill-McKee for ordering sparse matrices (attempt to minimize bandwidth).

1D Partitioning

• Let nb be the number of vertices to assign to each partition. • Visit the level sets by increasing order and assign nb consecutive vertices to each partition.

49 Algorithm nb lev=#levels, Lev set= level sets, dom=domains, nb=number of vertices per domain. nb dom=1,size=0 For i=1,nb lev Do • For j in Lev set(i) Do

– Add j to dom(nb dom) – size=size+1 – If (size> nb) Then

∗ nb dom=nb dom+1 ∗ size=0

– EndIf

• EndFor EndFor

Illustration of 1D partitioning

• Algorithms 1. For construction of level sets : Breadth-First-Search algorithm (BFS) 2. Partitioning over the level sets • On the example with nb = 3 : – domain 1 = {1, 2, 5} – domain 2 = {3, 6, 9} – domain 3 = {4, 7, 10} – domain 4 = {8, 11, 12}

2D Partitioning

• Idea : apply twice 1D partitioning • First step : 1D partition • Second step : apply a 1D partitioning within the sub-domains obtained at Step 1

On the example with nb1 = 3 and nb2 = 2 : • Step 1:

– Level sets with i1 = 1 : S1 = {1},S2 = {2, 5},S3 = {3, 6, 9},S4 = {4, 7, 10},S5 = {8, 11},S6 = {12}

– Partition obtained using nb1 = 3 : domain1 = {1, 2, 5}, domain2 = {3, 6, 9}, domain3 = {4, 7, 10}, domain4 = {8, 11, 12}.

• Step 2 : 1D partition within each domain with nb2 = 2 :

domain1 = {1, 2} domain2 = {5} domain3 = {3, 6} domain4 = {9} domain5 = {4, 7} domain6 = {10} domain7 = {8, 11} domain8 = {12}

50 Initial graph

2D partitioning nb1=8, nb2=4

1D partitioning, nb1=8

Illustration of 2D partitioning

• Algorithms – 1D partitioning according to level sets obtained with BFS. – 1D partitioning within each domain.

Limitations of previous algorithms

• Choosing the departure point. • Quality of the partitioning. • Problems even more critical on large graphs.

Multilevel graph partitioning p − way graph partitioning problem: • Given a graph G = (V,E) with |V | = n

• Partition V into p subsets V1,V2,...,Vp such that:

– Vi ∩ Vj = ∅ ∀i 6= j, n – |Vi| = p ∀i, S – and i Vi = V , minimizing the edge-cut.

• Most frequently solved using recursive bisection : – Obtain a 2 − way partition of V – Further divide each part using 2-way partitions – After log(p) phases, G partitioned into p parts. – Partitioning problem is reduced to that of performing a sequence of 2-way partitions or bisec- tions.

51 Basic structure of the algorithm

1. Coarsening phase 2. Partitioning (bisection) on the reduced graph 3. This partition is projected back towards the original graph (uncoarsening phase) by periodically refining the partition.

Software : • Often based on multilevel approaches with recursive bisections. • CHACO (Sandia Nat. Lab.), METIS (Univ. Minesota), SCOTCH (LaBRi Bordeaux)

Multilevel Graph Bisection

G G 0 0

Projected Partition

Refined Partition G G 1 1

G 2 G 2

Coarsening Phase Uncoarsening Phase G 3 Initial Partitioning Phase using Bisection

Coarsening Phase Objective: Obtain a sequence of smaller graphs

• Build a sequence of smaller graphs Gl = (Vl,El) from G0 = (V0,E0) such that |Vl| > |Vl+1|

• Gl+1 constructed from Gl by finding a maximal matching Ml ⊆ El of Gl and collapsing together vertices that are incident on each edge of the matching. • Various ways of computing a maximal matching • We describe a simple scheme called random matching where vertices are visited in a random order. For each unmatched vertex we randomly match it with one of its unmatched neighbors.

Random Matching Algorithm

• While there exists unmarked vertices Do

– Randomly select an unmarked vertex u in Gi – Randomly select, an unmarked neighbor v of u (if it exists) – Mark u and v (if it exists). Include (u, v) in the “matching”. • EndWhile

• Gi+1 is built by amalgamating couples of vertices of Gi that are matching. The valuation of vertices and edges is updated.

52 Heavy Edge Random Matching

• Objective: minimize the sum of the weights of the edges in the coarser graph.

• The matching Ml is computed so that the weight of the edges is high. • The vertices are visited in random order • Vertex u is matched with the unmatched vertex v that is connected with the heavier edge (i.e. edge (u, v) has a maximal value).

Coarsening Phase: illustration Vertices and edges of value 1

1 2 3 4

14 5 6 9 10 13

7 8 11 12

Random Matching (random=min) (or Heavy Edge RM)

1 2 3 4

14 5 6 9 10 13

7 8 11 12

(index of amalgamated vertex = smallest index of the matching)

G (coarser graph) 1 (2) (2) 1 1 3 2 1 3 3 (1) (1) (2) 13 (2) 14 5 1 9 1 2 2 2 11 7 (2) (2) G (coarser graph) 1 (2) (2) 1 1 3 2 1 3 3 (1) (1) (2) 13 14 5 1 (2) 9 2 1 2 2 (2) 7 11 (2)

Random Matching Heavy Edge RM

(2) (2) (2) (2) 1 1 1 3 1 3 2 2 1 3 1 3 (1) 3 3 (1) (1) 13 (1) (2) 13 (2) 14 5 1 14 5 1 (2) (2) 9 9 2 1 1 2 2 2 2 2 11 7 (2) (2) 7 11 (2) (2)

G2 (coarser graph) G2 (coarser graph)

(4) (4) (4) (1) 1 (1) 2 1 2 1 3 13 14 3 3 1 3 4 2 5 7 11 (3) (3) (4) 1 9 (4) (Edge−cut = 2) (Edge−cut = 8)

Initial partitioning on coarsest graph Bisection • Randomly select several vertex as starting points • Apply BFS algorithm to each starting point • Partition according to level sets

53 • Select the best partition (e.g. with smaller edge-cut) Remark: • Efficient enough on a small graph • Only represents a small fraction of the run time of the overall algorithm

Uncoarsening Phase The partition of the coarsest graph is projected back to G0 by going through intermediate graphs. • Example:

Graph G Graph G i i+1

Uncoarsening a1 1 b1 1 5 3 1 a b a2 b2 1 2 3 Bisection a3 b3

Graph G (after refinement) i+1

a1 a3 3 1 3 1 a2 b2 1 2 b1 1 b3

Edge−cut=2

• After each step of projection, the resulting partition is further refined using vertex swap heuristics that decrease the edge-cut. • E.g. Boundary Kernighan-Lin refinement

Kernighan-Lin refinement

• Idea : Greedy algorithm that attempts at minimizing edge-cut by swapping vertex between sub- graphs.

• Definition: Gain (gv) arising from swapping vertex v: X X gv = W eight(edge(u, v)) − W eight(edge(u, v)) u6∈P artition(v) u∈P artition(v)

• If gv > 0 then swapping v improve the partition.

54 K.L. Algorithm

• Initialization: Compute the gain of each vertex • While there exists an unmarked vertex v, with a positive gain in the largest partition Do – Select the vertex of higher positive gain – Swap v and mark it – Update the gain of the neighbors of v. • EndWhile

Remark: In practice these heuristics substantially improve the final partitioning.

Refinement using K.L. Graph Gi+1

1 a1 b1 Maximal Gain in a3 3 1 1 Gain(a3) = 3−1 = 2 b2 Swap / mark a3 1 a2 a1 b1 2 1 3 1 1 3 a3 b3 a2 b2 2

1 b3 3 a3

Gain(b1) = 1+1−1= 1 Swap / mark b1

a1 a3 3 1 3 1 a2 b2 1 2 Graph Gi+1 (after refinement) b1 1 b3

Edge−cut = 2

Influence of the coarsening phase Algorithms Partitioning based on recursive bisections:

1. Graph reduction based on Random Matching (RM) and Heavy Edge Random Matching (HERM) 2. Partitioning using recursive bisection of the reduced graph (1D bisection) 3. Expansion with or without refinement (Kernighan-Lin)

Testing environment

• Tests on graph resulting from sparse matrices (coming from industrial problems) • Partitioning into 32 domains on an SGI Challenge, 200Mhz, • Software: METIS (Karypis G et Kumar V. , Univ. du Minnesota)

Impact of coarsening on the performance Karypis G and Kumar V., Univ. Minnesota, TR 95-035

55 Graph #vertices #edges Edge-cut Without With Refinement during uncoarsening RM HERM RM HERM

bcsstk31 35588 572914 144879 84024 44810 45991

bcsstk32 44609 985046 184236 148637 71416 69361

cant 54195 1960797 817500 487543 323000 323000

inpro1 46949 1117809 205525 187482 78375 75203

wave 156317 1059331 239090 212742 73364 72034

Example: mapping tasks on processors

• Task dependency graph – Edges express dependencies – Weights on edge express amount of communication – Vertices = tasks – Weights on vertices = amount of computations

Partitioning process provides a mapping of tasks onto processors

56 5 Factorization of sparse matrices

Main classes of approaches • For sparse factorization – General technique – Frontal method – Left/Right/Multifrontal approaches • We first describe Left/Right/Multifrontal approaches. Frontal method and General technique will be introduced and compared to a multifrontal approach. • Distributed memory right and multifrontal LU solvers are compared in a separate section.

Three phase scheme based approach Applies to Frontal, Right/Left/multifrontal approaches • Solution of Ax = b with A sparse • Preprocessing of A (permutations, scaling) • Build the dependency graph (elimination tree, eDAG . . . ) (not for frontal approach) • Factorization (LU, LDLT, LLT, QR) • Solution based on factored matrices.

5.1 Left and right looking approaches Left and right looking approaches for Cholesky factorization • cmod(j, k) : Modification of column j by column k, k < j, • cdiv(j) division of column j by a scalar

Left-looking approach for j = 1 to n do for k ∈ Struct(Lj∗) do cmod(j, k) cdiv(j) Right-looking approach for k = 1 to n do cdiv(k) for j ∈ Struct(L∗k) do cmod(j, k)

Illustration of Left and right looking

Left−looking Right−looking

used for modification modified

57 Supernodal Definition 16. A supernode is a set of contiguous columns in the factors L that share essentially the same sparsity structure.

• All columns algorithms (ordering,symbolic facto, facto, solve) generalize to blocked version. • Use of efficient matrix-matrix kernels (improve cache usage). • Same concept as supervariables for elimination tree/minimum degree ordering. • Generalizable to LU.

5.2 Elimination tree and Multifrontal approach Elimination tree and multifrontal approach We recall that: • The elimination tree expresses dependencies between the various steps of the factorization. • It also exhibits parallelism arising from the sparse structure of the matrix.

Building the elimination tree

• Perform the symbolic factorization of the permuted matrix (to reduce fill-in) PAPT. The filled T T T matrix AF = L + L where PAP = LL . The elimination tree results from an analysis of the structure of the matrix AF . • Each column corresponds to a node of the graph. Each node k of the tree corresponds to the factorization of a frontal matrix whose row structure is that of column k of AF .

• If there is no nonzero on the left part of the diagonal of row j then node j is a leaf.

j

j j j is a leaf

• Node j is the father of k in the tree if and only if j is the row index of the first nonzero in column k.

k

k

j is the father of k j j

58 Inclusion property:

• For all node i of the elimination tree whose column structure is i, i1, i2 . . . if

• if i is a son of node j such that j, j1, j2 . . . jp

• then {i1, i2 . . . if } ⊂ {j, j1, j2 . . . jp}

Example

Symmetric matrix (already permuted with fill-in included) 1 2 3 4 5 6

1

2

3

4 F2

5 F2 F4

6 F4

• Nodes 1, 2 and 3 are leaves. • Nonzeros (4,1) and (4,2) are first in their columns. 4 is then father of 1 and 2. • Similarly, (5,3) and (5,4) imply that 5 father of 3 and 4 • (6,5) implies 6 father of 5.

• Build the list of indexes for each frontal matrix. 6 (Root)

5, 6

4, 5, 6

1, 4 2, 4, 5 3, 5, 6

(Leaves)

Amalgamation

• Goal – Exploit a more regular structure in the original matrix – Decrease the amount of indirect addressing – Increase the size of frontal matrices

• How? – Relax the number of nonzeros of the matrix – Amalgamation of nodes of the elimination tree

59 • Consequences? – Increase in the total amount of flops – But decrease of indirect addressing – And increase in performance

• Remark : If i, i1, i2 . . . if is a son of node j, j1, j2 . . . jp and if {j, j1, j2 . . . jp} ⊂ {i1, i2 . . . if } then the amalgamation of i and j is without fill-in

Amalgamation: illustration

Original Matrix Elimination tree 1 2 3 4 5 6 6 (Root)

1 5, 6 2

3 4, 5, 6 4 F2

5 F2 F4 1, 4 2, 4, 5 3, 5, 6 6 F4 (Leaves)

Amalgamation and Supervariables Amalgamation of supervariables does not cause fill-in Initial Graph: 6

10 9 2 5

8 12 3 7 13 11

4 1 Reordering: 1, 3, 4, 2, 6, 8, 10, 11, 5, 7, 9, 12, 13 Supervariables: {1, 3, 4} ; {2, 6, 8} ; {10, 11} ; {5, 7, 9, 12, 13}

Amalgamation (WITHOUT fill-in) (WITH fill-in ) 3, 5, 6

4, 5, 6

3, 4, 5, 6

1, 4 2, 4, 5 1, 4 2, 4, 5 fill-in : (3,4) and (4,3)

60 Illustration of multifrontal factorization We assume pivot are chosen down the diagonal in order. 4 4_

_3, 4 F 3 F

_1, 3, 4 _2, 3, 4 Filled matrix 1 2

Elimination graph Treament at each node: • Assembly of the frontal matrix using the contributions from the sons. • Gaussian elimination on the frontal matrix

• Elimination of variable 1 (a11 pivot) – Assembly of the frontal matrix

1 3 4

1 x x x

3 x

4 x

−(ai1×a1j ) – Contributions : aij = i > 1, j > 1 on a33, a44, a34 and a43 : a11

(1) (a31 × a13) (1) (a31 × a14) a33 = − a34 = − a11 a11 (1) (a41 × a13) (1) (a41 × a14) a43 = − a44 = − a11 a11

Terms − ai1×a1j of the contribution matrix are stored for later updates. a11

• Elimination of variable 2 (a22 pivot) – Assembly of frontal matrix: update of elements of pivot row and column using contributions from previous updates (none here)

2 3 4

2 x x x

3 x

4 x

– Contributions on a33, a34, a43, and a44.

(2) (a32 × a23) a33 = − a22 (2) (a32 × a24) a34 = − a22 (2) (a42 × a23) a43 = − a22 (2) (a42 × a24) a44 = − a22

61 • Elimination of variable 3. – Assembly of frontal matrix Update using the previous contributions: 0 (1) (2) a33 = a33 + a33 + a33 0 (1) (2) a34 = a34 + a34 + a34 (a(34) = 0) 0 (1) (2) a43 = a43 + a43 + a43 , (a(43) = 0) 0 (1) (2) a44 = a44 + a44 stored as a so called contribution matrix.

Note that a44 is partially summed since contributions are transfered only between son and father.

3 4

3 x x

4 x

0 0 (3) 0 (a43×a34) • Contribution on a44 : a44 = a44 − 0 a33 • Elimination of variable 4 (3) – Frontal involves only a44 : a44 = a44 + a44

Representation using the elimination tree gives: • order for eliminating the pivots • dependencies arising from contributions • parallelism arising from the sparsity In the example 1 and 2 can be eliminated simultaneously (no contribution of elimination of 1 on 2).

Parallelization 4 4_

3,_ 4 3

_2, 3, 4 _1, 3, 4 1 2

• 2 levels of parallelism – Arising from sparsity : between nodes of the elimination tree first level of parallelism – Within each node: dense LU factorization second level of parallelism

Other features • Dynamic management of parallelism – Pool of tasks for exploiting the two levels of parallelism • Dynamic management of data – Storage of LU factors, frontal and contribution matrices – Amount of memory available may conflict with exploiting maximum parallelism

62 Parallelism at node level

• Assembly : sparse SAXPY • Gaussian elimination on fully-summed variables – Blocked LU Factorization LU – Efficiency depends on the size • Use of parallel BLAS type of operations • When moving toward the root of tree – Size of frontal matrices increases – Parallelism from the tree decreases • Exploiting the second level of parallelism is crucial

Multifrontal factorization (1) (2) Computer #procs MFlops (speed-up) MFlops (speed-up) Alliant FX/80 8 15 (1.9) 34 (4.3) IBM 3090J/6VF 6 126 (2.1) 227 (3.8) CRAY-2 4 316 (1.8) 404 (2.3) CRAY Y-MP 6 529 (2.3) 1119 (4.8)

Performance summary of the multifrontal factorization on matrix BCSSTK15. In column (1), we exploit only paral- lelism from the tree. In column (2), we combine the two levels of parallelism.

5.3 Sparse LDLT factorization Sparse LDLT factorization Sparse LDLT factorization of a sparse symmetric matrix A, where: • L, lower triangular, • D, a block diagonal matrix with 1×1 and 2×2 blocks.

→ sparse: symmetric ordering to decrease fill-in in the factors. → general symmetric: preprocessing techniques to improve the numerical behavior of the factor- ization. (on going work Duff, Gilbert, Pralet)

Sparse LDLT Example of criteria to limit the growth factor to 1/u: • 1×1 pivot: |akk| ≥ u × max |aki| i

 a a  • 2×2 pivot selected with Duff and Reid algorithm P = kk kl : alk all

 max |a |   1/u  |P −1| i ki ≤ maxj |alj| 1/u

63 OXO and TILES management

• Independently of the numerical values structural block of zeros could be exploited with great benefits. OXO TILE 0X XX X 0 X 0 0 0 0 • However complex to exploit in an automatic way during analysis and/or factorization.

5.4 Multifrontal QR factorization Sparse QR factorization

2 × 3 2 ∗ • • • 3 2 ∗ • • • 3 6 × 7 6 ••• 7 6 ••• 7 6 7 6 7 6 7 6 × 7 6 ••• 7 6 •• 7 6 7 6 7 6 7 6 × 7 6 ••• 7 6 •• 7 6 7 ⇒ 6 7 ⇒ 6 7 ⇒ 6 × × × × 7 6 ∗ ∗ ∗ 7 6 ∗ ∗ 7 6 7 6 7 6 7 6 × 7 6 × 7 6 •• 7 4 × 5 4 × 5 4 × 5 × × ×

2 ∗ • • • 3 2 ∗ • • • 3 6 ••• 7 6 ••• 7 6 7 6 7 6 •• 7 6 •• 7 6 7 6 7 × Original entry 6 • 7 6 • 7 6 7 ⇒ 6 7 • Modified zero 6 ∗ 7 6 7 6 7 6 7 ∗ Modified entry 6 • 7 6 7 4 • 5 4 5 ×

• zero which becomes nonzero → Fill-in

• fill-in that “appear” and “disappear” during computation → Transient fill-in

fill-in that will remain in the R factor → Permanent fill-in

Reducing Permanent fill-in and Transient fill-in PERMANENT FILL-IN −→ Prediction of the structure of R. ( depends on the column ordering of the matrix A.)

1 2 3 4 1 2 3 4 1 2 × × 3 2 × × × × 3 2 6 × × 7 6 × × × 7 6 7 6 7 3 6 × × 7 6 × × 7 A = 6 7 =⇒ R = 6 7 4 6 × 7 6 × 7 6 7 6 7 5 6 × 7 6 7 6 4 × 5 4 5 7 ×

4 3 2 1 4 3 2 1 1 2 × × 3 2 × × 3 2 6 × × 7 6 × × 7 6 7 6 7 3 6 × × 7 ˜ 6 × × 7 AP = 6 7 =⇒ R = 6 7 4 6 × 7 6 × 7 6 7 6 7 5 6 × 7 6 7 6 4 × 5 4 5 7 ×

How to compute a “good” column ordering?

BECAUSE mathematically ATA = RTQTQR = RTR = LTL =⇒ R = LT structurally if A has the Strong Hall Property =⇒ ℘(R) = ℘(LT)

THEN if A has the SHP, ⇒ colum ordering is given by the symmetric permutation of ATA [PTATAP = (PTAT)(AP)]

64 Transient fill-in Depends on the order of the orthogonal transformations.

2 × 3 2 ∗ 3 2 ∗ • • • 3 6 × 7 6 7 6 7 6 7 6 7 6 7 6 × 7 6 7 6 7 6 7 6 7 6 7 6 × 7 6 7 6 7 6 7 ⇒ 6 7 ⇒ 6 7 ⇒ 6 × × × × 7 6 × × × × 7 6 ∗ ∗ ∗ 7 6 7 6 7 6 7 6 × 7 6 × 7 6 × 7 4 × 5 4 × 5 4 × 5 × × ×

2 ∗ • • • 3 2 ∗ • • • 3 2 ∗ • • • 3 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 ⇒ 6 7 ⇒ 6 7 6 ∗ ∗ ∗ 7 6 ∗ ∗ ∗ 7 6 ∗ ∗ ∗ 7 6 7 6 7 6 7 6 •• 7 6 •• 7 6 •• 7 4 × 5 4 • 5 4 • 5 × ×

× Original entry • Modified zero ∗ Modified entry

The multifrontal QR factorization 3 step algorithm

1. Analysis compute a column permutation (minimum degree) and elimination tree (based on the structure of ATA)

2. Factorization −→ Q, R factors [A = QR] (based on the elimination tree and on full QR factorization)

3. Solver −→ the solution x [Rx = QTb] (based on the elimination tree and on full operations)

The multifrontal factorization steps

1 2 3 4 5 6 1 x x

2 x At each node:

3 x x Assembly Factor 4 x x Original matrix Stack 5 x x

6 x x

7 x x 6 8 x x 5 9 x 4 10 x 1 2 3

1 4 1 4 2 4 5 2 4 5 3 5 6 3 5 6 1 x x r r 3 x x r r r 6 x x r r r 2 x 1 4 x x 2 2 7 x x 3 3 5 x x 2 2

6 6 10 x r 2 5 4 5 6 6 5 5 4 5 6 4 5 6 5 7 5 6 5 6 8 x x r r r 5 9 4 4 r r 9 x 4 4 5 2 4 4 5 2 1 4 4 4 4 4 4 5 4 2 2 4 4 5 4 4 5 5 2 2 4 4 4 7 3 3 5

1 2 3 1 2 3

Strategy 1

65 n NPIV Original Data

C1 m

C2

3 father node

NPIV R2 NPIV R1 1 2 C1 son nodes C2

Strategy 2

n NPIV Original Data

C1 m Original Data

(33) C2

3 father node (m > n)

NPIV R1 NPIV R2

C1 C2 (7) 1 2 son nodes (40)

Strategy 3

n NPIV Original Data

m Original Data

(29) C2

3 father node

NPIV R2 NPIV R1 1 C2 C1 2 (10) son nodes (40)

Strategy 3

66 Space for Q factors Space for reals 5 3

4 2 3

2 MWords MWords 1 1

0 0 medium medium2 Nfac70 Nfac80 medium medium2 Nfac70 Nfac80 1 proces. time 8 proces. time 1500 1000

800 1000 600

400 seconds 500 seconds 200

0 0 medium medium2 Nfac70 Nfac80 medium medium2 Nfac70 Nfac80

Strategy 1 Strategy 2 Strategy 3

5.5 Task partitioning and scheduling Task scheduling on shared memory computers

• T col(j) : updates on column j T col(j) = cmod(j, k)|k ∈ Struct(Lj∗) ∪ cdiv(j) • Dynamic scheduling of the tasks (pool of “ready” tasks). • Each processor selects a task (order can influence the performance). • Example of “good” topological ordering (w.r.t time).

18

16 17

11 12 13 14

1 2 3 4 5 6 7 9 10

Ordering not so good in terms of multifrontal working memory.

Subtree to subcube mapping

• Recursively partition the processors “equally” between sons of a given nodes. • Initially all processors are assigned to root node. • Good at localizing communication but not so easy if no overlapping between processor partitions at each step.

67 1,2,3,4,5 18

1,2,3 16 4,5 17 1 2,3 4 4,5 11 12 13 14 4 5 1 1 2 3 3 4 4 1 2 3 4 5 6 7 9 10

Mapping of the tasks onto the 5 processors

Mapping of the tree onto the processors

Objective : Find a layer L0 such that subtrees of L0 can be mapped onto the processor with a good

Step A Step B Step C balance. • Decomposition of the tree into levels .

L3

L2

¡ £¡£

¢¡¢ ¤¡¤

¡ £¡£

¢¡¢ ¤¡¤

¡ £¡£

¢¡¢ ¤¡¤

¡ £¡£ ¤¡¤ L1 ¢¡¢

L0

¡ ¡ ¡ ¡ ¥¡¥ §¡§ ©¡© ¡

¡ ¡ ¡ ¡ ¦¡¦ ¨¡¨ ¡ ¡

¡ ¡ ¡ ¡ ¥¡¥ §¡§ ©¡© ¡

¡ ¡ ¡ ¡ ¦¡¦ ¨¡¨ ¡ ¡

¡ ¡ ¡ ¡ ¥¡¥ §¡§ ©¡© ¡

¡ ¡ ¡ ¡ ¦¡¦ ¨¡¨ ¡ ¡

¡ ¡ ¡ ¡ ¥¡¥ §¡§ ©¡© ¡

¡ ¡ ¡ ¡ ¦¡¦ ¨¡¨ ¡ ¡

¡¡

¡¡

¡¡

¡¡

¡¡

¡¡ ¡¡ ¡¡ Subtree roots Mapping of top of the tree can be dynamic. Could be useful for both shared and distributed memory codes.

5.6 Distributed memory sparse solvers Computational strategies for parallel direct solvers

• The parallel algorithm is characterized by: – Computational graph dependency – Communication graph • Three classical approaches 1. “Fan-in” 2. “Fan-out” 3. “Multifrontal”

Assumptions and Notations

• Assumptions :

– We assume that each column of L is mapped on a processor. – P is the number of processors. – Each processor is in charge of computing cdiv(j) for columns j that he owns.

• Notations :

68 – mycols(p) = is the set of columns owned by processor p. – map(j) gives the processor owning column j.

– procs(L∗k) = {map(j) | j ∈ Struct(L∗k)} (only processors in procs(L∗k) require updates from column k).

Fan-in variant Demand driven algorithm : data required are aggregated update columns computing by sending pro- cessor Fan-in (p) For j = 1 to n do if j ∈ mycols(p) OR Struc(Lj∗) ∩ mycols(p) 6= 0 then u = 0 For k ∈ Struct(Lj∗) ∩ mycols(p) do u = u + cmod(j, k) If map(j) 6= p then Send u to map(j) Else incorporate u in column j While aggregated updates for j need be received receive agregate and incorporate it into column j cdiv(j)

(1) (2) (3) (4) Left Looking (1)

(2)

(3)

(4) Modified

Used for modification

Algorithm: (Cholesky)

(4) For j=1 to n do (3) For k in Struct(L ) do (1) j,* cmod(j,k) (2) Endfor cdiv(j) Endfor if map(1) = map(2) = map(3) = p and map(4) 6= p (only) one message sent by p to update column 4 → exploits data locality of subtree to subcube mapping.

Fan-out variant Data driven algorithm Fan-out (p) For j ∈ mycols(p) do if j is a leaf node in T (A) then cdiv(j) send L∗j to procs(L∗j) mycols(p) = mycols(p) − {j} While mycols(p) 6= ∅ do receive any column (say L∗k) of L For j ∈ Struct(L∗k) ∩ mycols(p) do cmod(j, k) If column j completely updated then

69 cdiv(j) send column j to procs(L∗j) mycols(p) = mycols(p) − {j}

(1) (2) (3) (4) (1) Right Looking

(2)

(3)

(4) Updated

Computed

Algorithm: (Cholesky)

For k=1 to n do (4) cdiv(k) (3) For j in Struct(L ) do (1) *,k cmod(j,k) (2) Endfor Endfor if map(2) = map(3) = p and map(4) 6= p then 2 messages (for column 2 and 3) are sent by p to update column 4. Properties of fan-out: • Historically the first implemented. • Incurs greater interprocessor communications than fan-in (or multifrontal) approach both in terms of – total number of messages – total volume • Do not exploit data locality of subtree to subcube mapping. • Improved algorithm (local aggregation): – send aggregated update columns instead of individual factor columns for columns mapped on a single processor. – Improve exploitation of subtree to subcube data locality – But memory increase to store aggregates can be critical.

Multifrontal variant (1) (2) (3) (4) (5) (1) Right Looking

(2) Updated

(3) Computed

(4)

(4) (3) (1) (5) (2)

Elimination tree (4) Algorithm: (3) (1) For k=1 to n do (2)(3)(4)(5) Build full frontal matrix with all indices in Struct(L ) (2) (2) *,k (3) Partial factorisation (4) L (5) C B Send Contribution Block to Father Endfor

"Multifrontal Method"

70 Tree mapping in distributed memory multifrontal solver MUMPS

P0 P1 P0 2D distribution P2 P3 P2

P0 P0 P1 P0

1D distribution

P3 P0 P1 P0 P2 P2 P2 P0 P2 P1 P3 P3

P1 P0 P2 P3 P3

Subtrees

Node level parallelism and numerical pivoting in multifrontal MUMPS solver

Father frontal matrix

fully-summed rows master_F

N.E. Rows of son (2) P1 (1) (1) P2 (3) F

S1 S2

(4) master_s2 master_S1 NE Rows

P11 P21

P12 P22

5.7 Comparison between 3 approaches for LU factorization Case study on a sequential computer We compare three general approaches for sparse LU factorization • General technique • Frontal method • Multifrontal approach In a separate section a distributed memory multifrontal and a supernodal approaches will be compared

Description of the 3 approaches

• General technique – Numerical and sparsity pivoting performed at the same time – Dynamic sparse data structures • Frontal method – Extension of band or variable-band schemes – No indirect addressing is required in the innermost loop (data are stored in full matrices) • Multifrontal approach – Extension of frontal method

71 – Sparsity pivoting is separated from numerical pivoting. – Full matrices are used in the innermost loops. Compared to frontal schemes: -Assembling of full matrices handling data is more complex / costly -Preserve in a better way the sparse structure

Properties of the 3 approaches

• General technique Good preservation of sparsity : local decision influenced by numerical choices. • Frontal method Simple data structure, fast methods, easier to implement. Very popular, easy out-of-core implementation. Sequential by nature • Multifrontal approach Complex to implement Should be a compromise between general sparse and frontal approaches, Parallelism from the sparsity.

Note that with a Left or Right looking supernodal a behaviour “similar” to the multifrontal approach should be expected.

Frontal method a11 a22

A = ajj

Etape 1 (Elimination of a11) Etape j (Elimination of ajj, j=2,..) Frontal matrix holds of values for Frontal matrix holds: elimination of a11 −updates from previous steps and additionally a11 −row j and column j of the original matrix

(1) ajj F = (j) F =

Values for updating are generated: fij= (ai1*a1j)/a11

• Properties: The band is treated as full → Efficient reordering to minimize bandwidth is crucial.

Factorization of a general matrix with a frontal method

A =

Etape 1: Elimination of a11 Etape 2: Elimination of a22

a22 a11 (1) (2) F = F =

(Treated as a full matrix)

72 Multifrontal methods

Multifrontal method 1

2 Step 1: 3

1

3

Frontal method

Step 2: 2 Step 2: Frontal matrix: Considering frontal 2 3 Matrices is sufficient:

Considered as full Several fronts move ahead simultaneously (treatment of blocks 1 and 2 is independent)

Characteristics: • More complex data structures. • Usually more efficient for preserving sparsity than frontal techniques • Parallelism arising from sparsity.

Illustration : comparison between 3 software for LU

• General approach (MA38: Davis et Duff): – Control of fill-in: Markowitz criterion – Numerical Stability: partial pivoting – Numerical and sparsity pivoting are performed in one step • Multifrontal method(MA41, Amestoy et Duff): – Reordering before numerical factorization – Minimum degree type of reordering – Partial pivoting for numerical stability

• Frontal method (MA42, Duff and Scott): – Reordering before numerical factorization – Reordering for decreasing bandwidth – Partial pivoting for numerical stability

• All these software (MA38, MA41, MA42) are available in HSL-Library.

Test problems from Harwell-Boeing and Tim Davis (Univ. Florida) collections.

Execution times on a SUN

73 Matrix Order Nb of Description nonzeros

Orani678.rua 2526 90158 Economic modelling Onetone1.rua 36057 341088 Harmonic balance method Garon2.rua 13535 390607 2D Navier-Stokes Wang3.rua 26064 177168 3D Simulation of semiconductor mhda416.rua 416 8562 Spectral problem in Hydrodynamic rim.rua 22560 1014951 CFD nonlinear problem

Matrix FLops Size of factors Time (Method) (×106 words) (seconds) Onetone1.rua (×109) General 2 5 59 Frontal 19 115 6392 Multifrontal 8 10 193 Garon2.rua (×108) General 40 8 95 Frontal 20 9 86 Multifrontal 4 2 8 mhda416.rua (×105) General 24 0.16 0.80 Frontal 3 0.02 0.07 Multifrontal 16 0.04 0.11

5.8 Iterative refinement Iterative refinement

• Suppose that we have performed an (exact, partial, or approximate) factorization of A (LU, LDLT or QR). • Let x∗ be the exact solution, and B be a factorization of A then since Bx = (B − A)x − b we can build an iterative scheme xp+1 = (I −B−1A)xp +B−1b. Since xp+1 −x∗ = (I −B−1A)p(xp −x∗) then the scheme will converge if ρ(I − B−1A) < 1. • Finally since rp = b − Axp we obtain xp+1 = xp + B−1rp.

Iterative refinement for linear systems

1. Compute r = b − A˜x.

2. Solve B δx = r (B = LU, LDLT or QR). 3. Update ˜x = ˜x + δx. 4. if the componentwise scaled backward error (ω) small enough (w.r.t. machine precision ) or does not decrease by at least a factor of two then Stop else goto 1

Iterative refinement for linear least-squares  p   b   IA   r  1. Compute = − g 0 AT O x

 IA   δr   p  2. Solve = AT O δx g

3. Correct the residual  r   r   δr  and the solution = + x x δx

4. if ω ≤  or does not decrease by at least a factor of two then Stop else goto 1

74 5.9 Concluding remarks Conclusions

• Key parameters in selecting a method 1. Characteristics of the matrix – Numerical properties and pivoting. – Symmetric or general – Pattern and density 2. Preprocessing of the matrix – Scaling – Reordering for minimizing fill-in 3. Target computer (architecture) • Substantial gains can be achieved with an adequate solver: in terms of numerical precision, com- puting and storage • Good knowledge of matrix and solvers

Solving sparse linear least-squares problems minx kAx − bk2

1. ATAx = ATb Analysis ATA Normal Equations Factoriz. ATA = RTR (NE) Solver RTRx = ATb

2. Ax ' b Analysis ATA Orthog. Factoriz. Factoriz. A = QR (QR) Solver Rx = QTb

3. ATAx = ATb Analysis ATA SemiNormal Equat. Factoriz. A = QR (SNE) Solver RTRx = ATb

 IA   r   b   IA  4. = Analysis AT O x 0 AT O  IA  Augmented Syst. Factoriz. = LDLT AT O  b  (AU) Solver Lz = and O LTy = D−1z

 IA   r   b   IA  5. = Analysis AT O x 0 AT O  IA  Gauss Factoriz. Factoriz. = LU AT O  b  (LU) Solver LUz = O

75 6 Comparison of two sparse solvers

3 phase approaches for direct solvers

1. Analysis and preprocessing

- Improvement of numerical properties - Fill-in reduction, - Computation of the dependency graph, - Static mapping of tasks and data 2. Factorization of the preprocessed matrix

- A symmetric: A = LLT or A = LDLT - A unsymmetric, A = LU PAQ = LU (unsymmetric pivoting) A = QR (overdetermined, linear least-squares). - Synchronous/asynchronous algorithm, - Static/dynamic scheduling.

3. Solve and iterative refinement

6.1 Main features of MUMPS and SuperLU solvers Distributed memory sparse direct codes

Code Technique Scope Availability (www.) CAPSS Multifrontal LU SPD netlib.org/scalapack MUMPS Multifrontal SYM/UNS enseeiht.fr/apo/MUMPS PaStiX Fan-in SPD see caption§ PSPASES Multifrontal SPD cs.umn.edu/∼mjoshi/pspases SPOOLES Fan-in SYM/UNS netlib.org/linalg/spooles SuperLU Fan-out UNS nersc.gov/∼xiaoye/SuperLU S+ Fan-out† UNS cs.ucsb.edu/research/S+ WSMP‡ Multifrontal SYM IBM product § dept-info.labri.u-bordeaux.fr/∼ramet/pastix ‡ Only object code for IBM is available. No numerical pivoting performed.

Shared memory sparse direct codes

Code Technique Scope Availability (www.) MA41 Multifrontal UNS cse.clrc.ac.uk/Activity/HSL MA49 Multifrontal QR RECT cse.clrc.ac.uk/Activity/HSL PanelLLT Left-looking SPD Ng PARDISO Left-right looking UNS Schenk PSL† Left-looking SPD/UNS SGI product SPOOLES Fan-in SYM/UNS netlib.org/linalg/spooles SuperLU Left-looking UNS nersc.gov/∼xiaoye/SuperLU WSMP‡ Multifrontal SYM/UNS IBM product † Only object code for SGI is available

76 Factorization of symmetric matrices Factorization of Symmetric Matrices

T A = L L

1 X X 1

2 X 2 A = L = X 3 X 3

X X 4 X X F 4

4 4 Root 1, 3, 4 3 3 2 2 2 1 1 Leaves

Graph (L) Elimination Tree Assembly tree

(Directed acyclic graph (transitive reduction DAG ) of the graph of L)

Factorization of unsymmetric matrices Factorization of Unsymmetric Matrices

A = L U

1 X X X 1 1 X X X

2 X 2 2 X A = L U = X 3 X X F 3 3 X

X 4 X 4 4

4 4 4

3 3 3 2 2 2

1 1 1

DAG(L) DAG(U) Elimination DAG(U) (transitive reduction of DAG(U))

MUMPS (Multifrontal sparse solver) P.R. Amestoy, I.S. Duff, J.Koster, J.-Y. L’Excellent 1. Analysis and Preprocessing • Preprocessing (max. transversal, scaling) • Fill-in reduction on A + AT • Partial static mapping (elimination tree) 2. Factorization • Multifrontal (elimination tree of A + AT ) Struct(L) = Struct(U) • Partial threshold pivoting • Node parallelism - Partitioning (1D Front - 2D Root) - Dynamic distributed scheduling 3. Resolution and iterative refinement

Features: Real/complex Symmetric/Unsymmetric matrices; Distributed input; Assembled/Elemental format; Schur com- plement; multiple sparse right-hand-sides;

SuperLU (Gaussian elimination with static pivoting) X.S. Li and J.W. Demmel 1. Analysis and Preprocessing • Preprocessing (Max. transversal, scaling) • Fill-in reduction on A + AT • Static mapping on a 2D grid processes

77 2. Factorization • Fan-out (elimination DAGs) • Static pivoting √ √ if (|aii| < εkAk) set aii to εkAk • 2D irregular block cyclic partitioning (based on supernode structure) • Pipelining / BLAS3 based factorization 3. Resolution and iterative refinements

Features: Real and complex matrices ; Multiple right-hand-sides.

SuperLU: 2D block cyclic layout and data structures

Global Matrix Storage of block column of L ¡¡

¡¡ . . . K ¡¡

¡¡1 ¡¡

¡¡ index nzval

¡¡

¡¡

¡¡

¡¡

¡¡ !¡!¡! ¡¡

0 "¡"¡"1 2 0 1 2 0 # of blocks !¡!¡!

1 "¡"¡"

!¡!¡!

"¡"¡"

 ¡

 ¡

¡ ©¡© ¡

¡ ¡ ¡

!¡!¡!

"¡"¡"

 ¡

 ¡

¡ ©¡© ¡

¡ ¡ ¡

¡ ¡

!¡!¡! LDA of nzval

. "¡"¡"

 ¡

 ¡

¡ ©¡© ¡

¡ ¡ ¡

¡

¡

!¡!¡!

"¡"¡"

 ¡ 

. ¡

¡ ©¡© ¡

¡ ¡

3 4 5 3 ¡ 4 5 3

¡

¡ ¡ ¡ ¡ ¡

. ¢¡¢¡¢¡¢¡¢

¡ ¡ ¡ ¡

¢¡¢¡¢¡¢¡¢

¡ ¡ ¡ ¡

¢¡¢¡¢¡¢¡¢ '¡'

(¡( block #

¡ ¡ ¡ ¡

¢¡¢¡¢¡¢¡¢

'¡'

(¡(

¡¡ %

¡¡ &

¡ ¡ ¡ ¡

¢¡¢¡¢¡¢¡¢

'¡'

(¡(

¡¡ % &

K ¡¡ # of full rows

¡ ¡ ¡ ¡

¢¡¢¡¢¡¢¡¢ '¡'

0 1 2 (¡(0 1 2 0

¡¡ %

¡¡ &

¡ ¡ ¡ ¡

¢¡¢¡¢¡¢¡¢

'¡'

(¡(

£¡£¡£¡£¡£

¤¡¤¡¤¡¤¡¤

%

& U

¡ ¡ ¡ ¡

¢¡¢¡¢¡¢¡¢

'¡'

(¡( £¡£¡£¡£¡£

¤¡¤¡¤¡¤¡¤ row subscripts

¡¡

¡¡

£¡£¡£¡£¡£

¤¡¤¡¤¡¤¡¤

¡¡

¡¡

£¡£¡£¡£¡£

¤¡¤¡¤¡¤¡¤

¡¡

¡¡

¡¡ ¡¡

£¡£¡£¡£¡£ i1

¤¡¤¡¤¡¤¡¤

¡¡

¡¡ £¡£¡£¡£¡£

3 4 5 ¤¡¤¡¤¡¤¡¤3 4 5 3

¡¡

¡¡

£¡£¡£¡£¡£ ¤¡¤¡¤¡¤¡¤

¡ i2

¡

£¡£¡£¡£¡£

¤¡¤¡¤¡¤¡¤

¡

¡ ¡

¡ ...

¡

¡

#¡#¡#¡#¡#

$¡$¡$¡$¡$ ¡

0 1 2 0 ¡1 2 0

#¡#¡#¡#¡#

$¡$¡$¡$¡$ #¡#¡#¡#¡#

$¡$¡$¡$¡$ block #

#¡#¡#¡#¡#

$¡$¡$¡$¡$ #¡#¡#¡#¡#

$¡$¡$¡$¡$ # of full rows #¡#¡#¡#¡#

L $¡$¡$¡$¡$ #¡#¡#¡#¡#

3 4 5 3 4$¡$¡$¡$¡$5 3

¥¡¥¡¥¡¥

¦¡¦¡¦¡¦ #¡#¡#¡#¡#

$¡$¡$¡$¡$ row subscripts

¥¡¥¡¥¡¥

¦¡¦¡¦¡¦

§¡§¡§¡¡

¨¡¨¡¨¡¡

¥¡¥¡¥¡¥

¦¡¦¡¦¡¦

§¡§¡§¡¡

¨¡¨¡¨¡¡ ¥¡¥¡¥¡¥

¦¡¦¡¦¡¦ i1

§¡§¡§¡¡

¨¡¨¡¨¡¡

¥¡¥¡¥¡¥

¦¡¦¡¦¡¦

§¡§¡§¡¡

¨¡¨¡¨¡¡ ¥¡¥¡¥¡¥

0 1 2 0 1 2 ¦¡¦¡¦¡¦0 i2

¥¡¥¡¥¡¥ ¦¡¦¡¦¡¦ ...

0 1 2 Process Mesh 3 4 5

Test problems

Real Unsymmetric Assembled Matrix Order NZ StrSym Origin bbmat 38744 1771722 54 R.-B. (CFD) ecl32 51993 380415 93 EECS Dept. UC Berkeley invextr1 30412 1793881 97 Parasol (Polyflow) fidapm11 22294 623554 99 SPARSKIT2 (CFD) garon2 13535 390607 100 Davis (CFD) lhr71c 70304 1528092 0 Davis (Chem Eng) lnsp3937 3937 25407 87 R.-B. (CFD) mixtank 29957 1995041 100 Parasol (Polyflow) rma1010 46835 2374001 98 Davis (CFD) twotone 120750 1224224 14 R.-B. (circuit sim) Real Symmetric Assembled (rsa) bmwcra 1 148770 5396386 100 Parasol (MSC.Software) cranksg2 63838 7106348 100 Parasol (MSC Software) inline 1 503712 18660027 100 Parasol (MSC Software)

StrSym : structural symmetry; R.-B. : Rutherford-Boeing set.

SuperLU : MPI buffer size and performance

Nprocs 1 2 4 8 16 32 64 128 Grid size 29 33 36 41 46 51 57 64 unlimited 57 62 53 62 63 66 76 81 4 Kbytes 58 108 92 103 104 102 119 111 SuperLU factorization time (seconds) for cubic grid problems with MPI BUFFER MAX set to unlimited and to 4Kbytes.

78 Asynchronous communications and MPI buffers • On T3E, messages fitting in MPI internal buffer are copied immediately to destination processor.

A/ Size (Message) < MPI_Internal_buffer

B/ Size(Message) > MP_Internal_buffer

Proc. Source Proc. Destination Proc. Source Proc. Destination Asynch. Recv(User_buf, Ireq) (mpi_irecv) Asynch. send Asynch. send (mpi_isend) (mpi_isend)

MPI_buffer User_buf Local copy Reception (User_buf) Reception message (mpi_recv) (mpi_wait(Ireq)) Message already in User_buf Transfer (User_buf) Time

MUMPS: Introducing Asynchronous Receive • Time for factorization (cranksg2, 8 procs, ND): MPI buffer size (bytes) type of receive 0 512 1K 4K∗ 64K 512K 2Mega standard 37.0 37.4 38.3 37.6 32.8 28.3 26.4 immediate 27.3 26.5 26.6 26.4 26.2 26.2 26.4 ∗ default value on the T3E.

6.2 Preprocessing and numerical Accuracy Impact of preprocessing and numerical issues

• Objective: Maximize diagonal entries of permuted matrix

• MC64 HSL (Harwell Sub. Lib.) code from Duff and Koster (1999) – Unsymmetric permutation and scaling

– Preprocessed matrix B = D1AQD2 is such that |bii| = 1 and |bij| ≤ 1 • Expectations :

– MUMPS : reduce NB of off-diagonal pivots and postponed var. (reduce numerical fill-in) – SuperLU : reduce NB of modified diagonal entries – Improve accuracy.

MC64 and the Flops (109) for factorization (AMD ordering)

Matrix MC64 StrSym MUMPS SuperLU lhr71c No 0 1431.0(∗) – Yes 21 1.4 0.5 twotone No 28 1221.1 159.0 Yes 43 29.3 8.0 fidapm11 No 100 9.7 8.9 Yes 29 28.5 22.0

(∗) Estimated during analysis, – Not enough memory to run the factorization.

79 |r|i Backward error analysis : Berr = maxi (|A|·|x|+|b|)i

MUMPS SuperLU

0 0 10 10 NO MC64 (MUMPS) NO MC64 (SuperLu) MC64 (MUMPS) MC64 (SuperLu) −2 −2 10 10

−4 −4 10 10

−6 −6 10 10

−8 −8 10 10 Berr Berr

−10 −10 10 10

−12 −12 10 10

−14 −14 10 10

−16 −16 10 10 bbmat ecl32 invextr1 fidapm11 garon2 lnsp3937 mixtank rma10 twotone bbmat ecl32 invextr1 fidapm11 garon2 lnsp3937 mixtank rma10 twotone

One step of iterative refinement generally leads to Berr ≈ ε Cost (1 step of iterative refinement) ≈ Cost (LUx = b − Ax)

6.3 Factorization cost study Factorization cost study

• SuperLU preserves/exploits better the sparsity/asymmetry than MUMPS. This results in ++ smaller size of factors (less memory) ++ fewer operations ++ more independency/parallelism −− Extra cost of taking into account asymmetry −− Smaller block-size for BLAS-3 kernels

Cost of preserving sparsity (time on T3E 4 Procs)

5 Size Factors (SuperLU / MUMPS) Flops "" 4.5 Facto Time "" Solve Time "" 4

3.5

3

2.5

2 Ratios (SuperLU/MUMPS) 1.5

1

0.5

0 bbmat(50) fidapm11(46) twotone(43) lhr71c(21)

Nested Dissection versus Minimum Degree orderings (time on T3E 4 Procs)

Flops study Factorization time ratio (AMD/ND) 55 5 AMD (MUMPS) MUMPS 50 AMD(SuperLU) SuperLU ND(MUMPS) 4.5 ND(superLU) 45 4

40 3.5 35 3 30 2.5

Flops (E9) 25 Ratio AMD/ND 2 20 1.5 15

1 10

5 0.5

0 0 bbmat ecl32 invextr1 mixtank bbmat ecl32 invextr1 mixtank

80 Communication issues Average Vol.(64 procs) Average Message Size (64 procs)

Average Communication Volume on 64 Processors Average Message Size on 64 Processors 60 25 MUMPS (AMD) SuperLU (AMD) MUMPS (ND) SuperLU (ND) 50 20

40 MUMPS (AMD) SuperLU (AMD) 15 MUMPS (ND) SuperLU (ND) 30 Kbytes Mbytes

10

20

5 10

0 0 bbmat ecl32 invextr1 mixtank twotone bbmat ecl32 invextr1 mixtank twotone

6.4 Factorization/solve time study

Time Ratios of the numerical phases Time(SuperLU) / Time(MUMPS)

Factorization Solve 4 6 bbmat bbmat ecl32 5.5 ecl32 invextr1 invextr1 3.5 mixtank mixtank 5 twotone twotone

3 4.5

4 2.5 3.5

3 2 2.5 Ratio(SuperLU/MUMPS) Ratio(SuperLU/MUMPS) 1.5 2

1.5 1 1

0.5 0.5 4 8 16 32 64 128 256 512 4 8 16 32 64 128 256 512 Processors Processors

Time (in seconds) of the numerical phases

Factorization Solve Time for factorization phase Time for solve phase 45 4 MUMPS (bbmat) MUMPS (bbmat) "" (ecl32) "" (ecl32) 40 "" (invextr1) 3.5 "" (invextr1) "" (mixtank) "" (mixtank) "" (twotone) "" (twotone) 35 SuperLU (bbmat) 3 SuperLU (bbmat) "" (ecl32) "" (ecl32) "" (invextr1) "" (invextr1) 30 "" (mixtank) "" (mixtank) "" (twotone) 2.5 "" (twotone)

25 2

20 Time (Seconds) Time (Seconds) 1.5

15 1

10 0.5

5 0 4 8 16 32 64 128 256 512 4 8 16 32 64 128 256 512 Processors Processors

Performance analysis on 3-D grid problems Rectangular grids - Nested Dissection order- ing

81 Megaflop rate Efficiency 1.2

250 1

200 0.8

150 0.6 Efficiency Megaflop rate 100 0.4

50 0.2 MUMPS−SYM MUMPS−SYM MUMPS−UNS MUMPS−UNS SuperLU SuperLU 0 0 1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128 Processors Processors

Summary

• Sparsity and Total memory -SuperLU preserves better sparsity -SuperLU (≈ 20%) less memory on 64 Procs (Asymmetry - Fan-out/Multifrontal) • Communication -Global volume is comparable -MUMPS : much smaller (/10) nb of messages • Factorization / Solve time -MUMPS is faster on nprocs ≤ 64 -SuperLU is more scalable

• Accuracy -MUMPS provides a better initial solution -SuperLU : one step of iter. refin. often enough

6.5 Recent improvements on MUMPS solver Recent improvements on MUMPS solver

• Improve static scheduling (based on Amestoy, Duff, and Voemel, SIMAX 2004) • Take into account unsymmetry (based on Amestoy and Puglisi, SIMAX 2002) • Improve scheduling – Scheduling to optimize memory usage (A. Guermouche and J.-Y. l’Excellent) – Scheduling for SMP computers (Thesis of S. Pralet) • Scheduling under memory constraints. (Collaborative work during the thesis of A. Guermouche and S. Pralet)

Improved scheduling in MUMPS solver

• During the preprocessing phase: 1. Bottom up processing of the task dependency graph (pre-list of authorized candidate processors per task) 2. Top down processing of the task dependency graph Choose candidate processors based on the pre-list • During numerical factorization : Scheduling of the task based on least loaded processors with the list of candidates

82 Experimental results Benefits resulting from using a new scheduling strategy:

MUMPS LU cubic grid (ND) MUMPS LU cubic grid (ND) 600 5000 orig orig cand 4500 cand real 500 4000

3500 400 3000

300 2500

2000 200

T3E Factorization time [sec] 1500 Estimated memory for LU factors [MB] 1000 100 500

0 0 1 2 4 8 16 32 48 64 128 256 512 1 2 4 8 16 32 48 64 128 256 512 Processors Processors

Better memory estimation and improved performance scalability

6.6 An unsymmetrized multifrontal approach An unsymmetrized multifrontal approach Outline: • The standard multifrontal approach for unsymmetric matrices • The new approach • Performance analysis

A standard multifrontal approach Illustration on a small matrix

1 2 3 4 5 6 7 1 2 3 4 5 6 7

1 N 1

2 2

3 N N 3 T A = 4 M=A+A = N N N 4

5 N 5

6 N 6

7 N N N 7

original nonzero N new entry in M

Assembly tree

T • The assembly tree is based on MF = L + L where L is the matrix of the Cholesky factor of M. 1 2 3 4 5 6 7

N 1

2 original nonzero N N 3 N new entry in M M = N N N F 4 F : F Fill−in (w.r.t. M) N 5

N 6

N N F N 7

T • Struct(MF) = Struct(L) + Struct(U) and Struct(L ) = Struct(U)

83 Assembly tree and multifrontal factorization

1 2 3 4 5 6 7 5 6 7 1 Root (4) 5 U 2 6 L 3 7 3 4 5 7 A = 4 3 U 5 4 6 5 L 7 (3) 7

original nonzero arrowhead of var. 3 contribution block 1 3 5 2 4 7 1 U 2 U 3 4 L (1) Leaves (2) L 5 7

A closer look at the assembly tree

5 6 7 5 (4) Root 6 7 3 4 5 7 3 4

(3) 5

7

1 3 5 2 4 7 1 (1) (2) 2 3 4

5 7

The assembly tree processed by the new algorithm

5 6 7 5 (4) Root 6 7

3 4 7 3 4 (3) 5

1 3 1

5 2 4 7 (1) (2) 2

84 Structure of the LU factors Standard Algorithm New Algorithm 1 2 3 4 5 6 7 1 2 3 4 5 6 7

N 1 1

2 2

N N 3 3

N N N F 4 N F 4

N 5 N 5

N 6 N 6

N N F N 7 N 7

Performance analysis : Factorization time and total working space

1.1

1 0.965

0.9

0.8

0.7

0.6

0.5

0.4

0.3 Ratio new over standard version

0.2

0.1 total working space factorization time 0 2 4 8 9 15 15 17 19 20 21 43 43 48 49 50 56 56 57 64 65 77 77 78 87 94 95 sy sy sy sy

Structural Symmetry

Factorization time analysis

1 0.97

0.9

0.8 0.77

0.7

0.6

0.5

0.4

Ratio new over standard version 0.3

0.2

0.1 factorization time oper for factors oper for assem. 0 2 4 8 9 15 15 17 19 20 21 43 43 48 49 50 56 56 57 64 65 77 77 78

Structural Symmetry

Correlation between factor size, maximum stack size, and total working space

85 1 0.96 0.9

0.82 0.8

0.7

0.6

0.5

0.4

Ratio new over standard version 0.3

0.2

0.1 maximum size of stack space for LU factors total working space 0 2 4 8 9 15 15 17 19 20 21 43 43 48 49 50 56 56 57 64 65 77 77 78

Structural Symmetry

Summary : Performance ratios New over Standard algorithms

Space for Operations LU Stack Total Elim Assemb. Time 0 ≤ Structural symmetry < 50 mean 0.79 0.35 0.69 0.67 0.41 0.59 median 0.80 0.35 0.76 0.68 0.40 0.61 50 ≤ Structural symmetry < 80 mean 0.93 0.85 0.93 0.89 0.82 0.86 median 0.95 0.91 0.95 0.93 0.89 0.90

86 7 GRID-TLSE

GRID-TLSE : a Web expertise site for sparse linear algebra http://www.enseeiht.fr/lima/tlse

• 3-year project funded by the ACI GRID program from the French Ministry of Research (Jan 2003 −− > Jan 2006) • Research Labs: CERFACS and ENSEEIHT-IRIT (Toulouse), LaBRI (Bordeaux), LIP-ENS (Lyon) • Industrial partners: CNES, CEA, EADS, EDF, IFP • International links: LBNL-Berkeley, RAL, Parallab-Bergen , Univ. Florida, Univ. Minneapolis, Univ. Minnesota, Univ. Tennessee, Univ. San Diego, Univ. Indiana, . . .

Comparison of solvers:

• many solvers with numerous algorithmic parameters

• strong need for expertise from users: -Error analysis wrt threshold pivoting parameter. -Which ordering should I use on my problem ? -Which solver behaves best in terms of memory/time on my problem? -Can I use the preprocessing done by this solver on this other one ? . . .

• this currently requires installing solvers, coding, testing and analysing results

⇒ Grid Computing could be the way to:

• provide remote access to a set of solvers • automate answer to expertise requests

Goals of the project

• Design a Web expert site for sparse matrices (direct solvers), disseminate the expertise of our community • Provide interface to experiment software – public ... as well as commercial – sequential ... as well as parallel • User can submit her/his matrix or use matrix collections • Provide tools to add/update/validate services, design new scenarii.

Example of expertise request

• Assumption : The performance (time and memory used) of our solvers depends mostly on the choice of the ordering used. • Examples of request: – Memory required to factor a matrix – Error analysis as a function of the threshold pivoting value – Minimum time on a given computer to factor a given unsymmetric matrix ?

87 Grid computing : is it realistic ?

• Each request involves a large number of elementary requests (e.g. as many simultaneous executions of a sparse package as available orderings or more generally appropriate values of input parameters) • Choice of target computers depends on type of request: (Matrix availability, Memory requirement, CPU requirement, software availability, cost of computing time ...) • Grid of moderate size where each elementary request will run on one node (uni or multipro- cessor) of the Grid. • Time to answer is not so critical

7.1 Description of main components of the site Main components of the site

• Database of services (sparse direct solvers):

– SuperLU (LBNL), MUMPS (CERFACS, IRIT, LIP-ENS), TAUCS (Tel-Aviv University), UMFPACK (Univ Florida) PaStiX-SCOTCH (LaBRI), SPOOLES (Boeing), OBLIO (Old Dominion) . . . – HSL Library (RAL), PARDISO (ETH, Zurich), . . . – and others . . . • Database of matrices: – Sparse matrix collections : Davis(Univ. of Florida), Matrix Market, RAL-BOEING, PARA- SOL . . . – User-provided matrices (private or public) • Database of experimental results

Grid Infrastructure

• Use of tools developed within GRID-ASP project (LIP-ReMAP, LORIA-R´es´edas,LIFC-SDRP) : FAST, DIET • High-level administrator interface for the definition, the deployment, and the exploitation of services over a grid : Weaver • Interactive Web interface with the Grid: WebSolve • We do not provide computational resources, we just perform expertise (i.e. we may only report statistics on using various software on a matrix)

Software components

88 External User

Internal User Expert Site : Expert Grid−TLSE

/ ...

WebSolve

Database

Weaver

Collect. History Matrices Bibliography Logfiles

MIDDLEWARE Static Dynamic (DIET/FAST) Stats ( RAL−BOEING / Parasol )

User−supplied matrices

Solvers Grid

7.2 Running the Grid-TLSE demonstrator The Grid-TLSE prototype

• First presented at IPDPS ’03 (Nice)

• Illustrate main functionalities and understand difficulties

• Get feedback from users and sparse community to improve specifications and validate software/middleware choices.

• Conference Sparse days and Grid Comput., (St Girons June 03)

• RNTL conference in Grenoble (Oct 03), INRIA ReMap project. • Open to industrial partners (Nov 03). • SIAM conference on parallel processing, San Francisco (Feb. 2004).

Scenario : Ordering Sensitivity

• Algorithm 1/ Get orderings 2/ Obtain values of required metrics for each ordering • Metrics of type estimation 1 solver (get all internal orderings) ≥ 1 solver (get all possible orderings) • Metrics of type effective Factorisation also performed for each ordering

Scenario : Minimum Time

Algorithm

• Phase 1: Get permutations from all solvers. • Phase 2: For each permutation and requested solver -Perform Flops estimation, -Keep best permutation per solver. • Phase 3: For each solver: -Factorize with BOTH selected permutation and internal default permutation -Report statistics with minimum time.

89 7.3 Description of the software components Software components

External User

Internal User Expert Site : Expert Grid−TLSE

/ ...

WebSolve

Database

Weaver

Collect. History Matrices Bibliography Logfiles

MIDDLEWARE Static Dynamic (DIET/FAST) Stats ( RAL−BOEING / Parasol )

User−supplied matrices

Solvers Grid

WebSolve Provide a simple and intuitive web interface

• For users : – Problem description ; – Request management ; – Graphical results • And experts : – Users management ; – Requests management ; – Scenarii description and testing ; – Solvers description and testing ; – Grid state observation

Weaver Manage descriptions of :

• solvers ; • architectures ; • scenarii

Convert expertise requests to solver runs :

• Apply scenarii • Look for solvers and architectures (trader service)

Abstracts from specific middleware (currently DIET)

90 Weaver : Main problem Solvers are quite complex :

• Many algorithmic control parameters • Many secondary results qualifying an execution • Several functions : Analysis, Factorisation, Solve, . . . • Highly dependant of the underlying architecture

All solvers share the same purpose But their parameters and results can be quite different

Solver profile Property-based abstract description of all solvers

• Name ; • Algorithm (LU, QR, LDLt,...); • Parameters (A, b, . . . ) and results (x,...); • Controls (ordering, threshold pivoting, . . . ) ; • Metrics (flops, memory, precision, . . . ) ; • Architecture

Each properties can be qualified semantically (symmetric matrix, 1 Gb of memory, . . . ) Critical : Must be highly evolutive (meta-description)

Solver signature Description of a specific solver

• parameters and results ; • hierarchical description of its sub-functions ;

Solver Parameters Results Analysis

Factorisation and and

Controls Metrics

• meaning of the profile properties ;

Some properties may not be significant Some properties may be defined using other solvers

Expertise requests Problem defined by the user :

• Specify one (or more) expertise scenarii ; • Provides values for some profile properties

Weaver prototype

91 GRID TLSE User

WEBSOLVE

Expertise Request WEAVER

Scenarii

Services

£¡£¡£ ©¡©¡© ¡ ¡

¤¡¤¡¤ DIET

¥¡¥¡¥¡¥

¦¡¦¡¦ §¡§¡§¨¡¨¡¨

SOLVERS

¡ ¡ ¡ ¡

¢¡¢¡¢ ¡ ¡

GRID TLSE User

WEBSOLVE

Expertise Request WEAVER

Scenarii Expertise Step Services

Expertise Run

£¡£¡£ ©¡©¡©¡© ¡ ¡

¤¡¤¡¤ DIET

¥¡¥¡¥

¦¡¦¡¦ §¡§¡§¨¡¨¡¨

SOLVERS

¡ ¡ ¡ ¡

¢¡¢¡¢ ¡ ¡

92 Current expert work Writing Java classes to produce :

• Expertise steps from expertise requests (each scenario) • Solver runs from expertise steps (each solver) : From abstract properties to concrete parameters and results • Solver runs from previous step results (each solver) : idem • Expertise graphs from expertise step results (each metrics)

Weaver purpose : reduce the amount of Java code written by the expert

Future expert work XML-based description (Web-based editor) of :

• Solver profile ; • Solver signature ; • Links between profile and signature ; • Rule-based scenarii (expertise request pattern-matching)

Important : Profile evolution does not require any change in previous scenarii, although it may improve expertise quality

7.4 Concluding remarks on GRID-TLSE project Open issues

• Reasonable cost to add a solver, a scenario, some graphs ? • Scheduling improvement based on signature hierarchical structure ? (persistence between runs) • Scheduling improvement based on expertise results ? (flops and memory estimation) • Batch scheduler interface • Significant architecture description • Expertise for iterative methods ? • Data-mining expertise from crude results ?

93 References

[1] P. R. Amestoy, T. A. Davis, and I. S. Duff. An approximate minimum degree ordering algorithm. SIAM Journal on Matrix Analysis and Applications, 17:886–905, 1996. [2] P. R. Amestoy, I. S. Duff, J. Koster, and J.-Y. L’Excellent. A fully asynchronous multifrontal solver using distributed dynamic scheduling. SIAM Journal on Matrix Analysis and Applications, 23(1):15–41, 2001. [3] P. R. Amestoy, I. S. Duff, and J.-Y. L’Excellent. Multifrontal parallel distributed symmetric and unsymmetric solvers. Comput. Methods Appl. Mech. Eng., 184:501–520, 2000. [4] P. R. Amestoy, I. S. Duff, J.-Y. L’Excellent, and X. S. Li. Analysis and comparison of two general sparse solvers for distributed memory computers. ACM Transactions on Mathematical Software, 27(4):388–421, 2001. [5] P. R. Amestoy, I. S. Duff, S. Pralet, and C. V¨omel. Adapting a parallel sparse direct solver to architectures with clusters of smps. Parallel Computing, 29(11-12):1645–1668, 2003. [6] P. R. Amestoy, I. S. Duff, and C. Puglisi. Multifrontal QR factorization in a multiprocessor environ- ment. Tech. Rep. TR/PA/94/09, CERFACS, Toulouse, France, 1994. Modified version appeared in Int. Journal of Num. Linear Alg. and Appl. [7] P. R. Amestoy, I. S. Duff, and C. V¨omel. Task scheduling in an asynchronous distributed memory multifrontal solver. SIAM Journal on Matrix Analysis and Applications, 26(2):544–565, 2005. [8] P. R. Amestoy, X. S. Li, and E.G. Ng. Diagonal Markowitz scheme with local symmetrization. Technical Report RT/APO/03/5, ENSEEIHT-IRIT, October 2003. Also appeared as Lawrence Berkeley Lab report LBNL-53854. [9] P. R. Amestoy, X. S. Li, and S. Pralet. Constrained Markowitz with local symmetrization. Technical Report RT/APO/04/05, ENSEEIHT-IRIT, Toulouse, France, 2004. Also appeared as Lawrence Berkeley Lab report LBNL-56861 and CERFACS report TR/PA/04/137, modified version accepted to SIMAX. [10] P. R. Amestoy and C. Puglisi. An unsymmetrized multifrontal LU factorization. SIAM Journal on Matrix Analysis and Applications, 24:553–569, 2002. [11] C. Ashcraft. The fan-both family of column-based distributed Cholesky factorisation algorithms. In J. R. Gilbert and J. W. H. Liu, editors, Graph Theory and Sparse Matrix Computations, pages 159–190. Springer-Verlag, NY, 1993. [12] C. Ashcraft and R. G. Grimes. SPOOLES: An object oriented sparse matrix library. In Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing, San Antonio, Texas, March 22–24, 1999. [13] T. A. Davis and I. S. Duff. A combined unifrontal/multifrontal method for unsymmetric sparse matrices. ACM Transactions on Mathematical Software, 25(1):1–19, 1999. [14] J. W. Demmel, S. C. Eisenstat, J. R. Gilbert, X. S. Li, and J. W. H. Liu. A supernodal approach to sparse partial pivoting. SIAM Journal on Matrix Analysis and Applications, 20(3):720–755, 1999. [15] J. W. Demmel, J. R. Gilbert, and X. S. Li. An asynchronous parallel supernodal algorithm for sparse Gaussian elimination. SIAM Journal on Matrix Analysis and Applications, 20(4):915–952, 1999. [16] J. J. Dongarra, I. S. Duff, D. C. Sorensen, and H. A. van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. SIAM, Philadelphia, 1991. [17] I. S. Duff. Algorithm 575. Permutations for a zero-free diagonal. ACM Transactions on Mathematical Software, 7:387–390, 1981.

94 [18] I. S. Duff, R. G. Grimes, and J. G. Lewis. Sparse matrix test problems. ACM Transactions on Mathematical Software, 15:1–14, 1989. [19] I. S. Duff, R. G. Grimes, and J. G. Lewis. The Rutherford-Boeing Sparse Matrix Collection. Technical Report RAL-TR-97-031, Rutherford Appleton Laboratory, 1997. Also Technical Report ISSTECH-97-017 from Boeing Information & Support Services and Report TR/PA/97/36 from CERFACS, Toulouse. [20] I. S. Duff and J. Koster. The design and use of algorithms for permuting large entries to the diagonal of sparse matrices. SIAM Journal on Matrix Analysis and Applications, 20(4):889–901, 1999. [21] I. S. Duff and J. Koster. On algorithms for permuting large entries to the diagonal of a sparse matrix. SIAM Journal on Matrix Analysis and Applications, 22(4):973–996, 2001. [22] I. S. Duff and J. K. Reid. The multifrontal solution of indefinite sparse symmetric linear systems. ACM Transactions on Mathematical Software, 9:302–325, 1983. [23] I. S. Duff and J. K. Reid. The multifrontal solution of unsymmetric sets of linear systems. SIAM Journal on Scientific and Statistical Computing, 5:633–641, 1984. [24] I. S. Duff and J. K. Reid. MA47, a Fortran code for direct solution of indefinite sparse symmetric linear systems. Technical Report RAL 95-001, Rutherford Appleton Laboratory, 1995. [25] I. S. Duff and J. K. Reid. The design of MA48, a code for the direct solution of sparse unsymmetric linear systems of equations. ACM Transactions on Mathematical Software, 22:187–226, 1996. [26] A. Geist and E. Ng. Task scheduling for parallel sparse Cholesky factorization. Int J. Parallel Programming, 18:291–314, 1989. [27] J. R. Gilbert and J. W. H. Liu. Elimination structures for unsymmetric sparse LU factors. SIAM Journal on Matrix Analysis and Applications, 14:334–352, 1993. [28] J. R. Gilbert and E. G. Ng. Predicting structure in nonsymmetric sparse matrix factorizations. In J.R. Gilbert A. George and J.W.H Liu, editors, Graph Theory and Sparse Matrix Computations, pages 107–140. Springer-Verlag NY, 1993. [29] A. Guermouche and J.-Y. L’Excellent. Constructing memory-minimizing schedules for multifrontal methods. ACM Transactions on Mathematical Software, 32(1):17–32, 2006. [30] A. Guermouche, J.-Y. L’Excellent, and G. Utard. Analysis and improvements of the memory usage of a multifrontal solver. Research report RR-4829, INRIA, 2003. Also LIP report RR2003-08. [31] A. Gupta, G. Karypis, and V. Kumar. Highly scalable parallel algorithms for sparse matrix factor- ization. IEEE Trans. Parallel and Distributed Systems, 8(5):502–520, 1997. [32] M. T. Heath, E. Ng, and B. W. Peyton. Parallel algorithms for sparse linear systems. SIAM Review, 33:420–460, 1991. [33] P. H´enon, P. Ramet, and J. Roman. PaStiX: A High-Performance Parallel Direct Solver for Sparse Symmetric Definite Systems. Parallel Computing, 28(2):301–321, January 2002. [34] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. TR 95-035, University of Minnesota, June 1995.

[35] G. Karypis and V. Kumar. MeTiS – Unstructured Graph Partitioning and Sparse Matrix Ordering System – Version 2.0. University of Minnesota, June 1995. [36] X. S. Li and J. W. Demmel. SuperLU DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems. ACM Transactions on Mathematical Software, 29(2), 2003. [37] J. W. H. Liu. On the storage requirement in the out-of-core multifrontal method for sparse factor- ization. ACM Transactions on Mathematical Software, 12:127–148, 1986.

95 [38] J. W. H. Liu. The role of elimination trees in sparse factorization. SIAM Journal on Matrix Analysis and Applications, 11:134–172, 1990.

[39] F. Pellegrini. Scotch 3.4 User’s guide. Technical Report RR 1264-01, LaBRI, Universit´eBor- deaux I, November 2001. [40] C. Puglisi. QR Factorization of large sparse overdetermined and square matrices using a multifrontal method in a multiprocessor environment. Phd thesis, Institut National Polytechnique de Toulouse, 1993. Available as CERFACS report TH/PA/93/33. [41] J. Schulze. Towards a tighter coupling of bottom-up and top-down sparse matrix ordering methods. BIT, 41(4):800–841, 2001.

96