Outline ° Motivation for Dense • Ax=b: Computational Electromagnetics • Ax = lx: Quantum Chemistry Lecture 10: ° Review Gaussian Elimination (GE) for solving Ax=b Linear Algebra Algorithms ° Optimizing GE for caches on sequential machines • using - (BLAS) ° LAPACK library overview and performance , U of Tennessee ° Data layouts on parallel machines

° Parallel matrix-matrix multiplication ° Parallel Gaussian Elimination ° ScaLAPACK library overview Slides are adapted from Jim Demmel, UCB’s Lecture on Linear Algebra Algorithms 2 ° Eigenvalue problem

Parallelism in -vector multiplication Graph Partitioning and Sparse Matrices ° y = A*x, where A is sparse and n x n ° Relationship between matrix and graph ° Questions 1 2 3 4 5 6 • which processors store - y[i], x[i], and A[i,j] 1 1 1 1 • which processors compute 2 1 1 1 1 - y[i] = sum (from 1 to n) A[i,j] * x[j] 3 3 1 1 1 = (row i of A) . x … a sparse dot product 2 4 4 1 1 1 1 ° Partitioning 5 1 1 1 1 • Partition index set {1,…,n} = N1 u N2 u … u Np 1 • For all i in Nk, Processor k stores y[i], x[i], and row i of A 6 1 1 1 1 • For all i in Nk, Processor k computes y[i] = (row i of A) . x 6 5 - “owner computes” rule: Processor k compute the y[i]s it owns ° A “good” partition of the graph has ° Goals of partitioning • equal (weighted) number of nodes in each part (load and storage balance) • balance load (how is load measured?) • minimum number of edges crossing between (minimize communication ) • balance storage (how much does each processor store?) ° Can reorder the rows/columns of the matrix by putting all the • minimize communication (how much is communicated?) nodes in one partition together

3 4

More on Matrix Reordering via Graph Partitioning What about implicit methods and eigenproblems?

° “Ideal” matrix structure for parallelism: (nearly) block diagonal ° Direct methods (Gaussian elimination) • p (number of processors) blocks • Called LU Decomposition, because we factor A = L*U • few non-zeros outside these blocks, since these require communication • Future lectures will consider both dense and sparse cases • More complicated than sparse-matrix vector multiplication P0 ° Iterative solvers P1 • Will discuss several of these in future - Jacobi , Successive overrelaxiation (SOR) , Conjugate Gradients (CG), Multigrid,... = P2 * • Most have sparse-matrix-vector multiplication in kernel P3 ° Eigenproblems P4 • Future lectures will discuss dense and sparse cases • Also depend on sparse-matrix-vector multiplication, direct methods ° Graph partitioning

5 6

1 Continuous Variables, Continuous Parameters

Examples of such systems include

Partial Differential Equations ° Heat flow: Temperature(position, time) PDEs ° Diffusion: Concentration(position, time) ° Electrostatic or Gravitational Potential: Potential(position) ° Fluid flow: Velocity,Pressure,Density(position,time) ° Quantum mechanics: Wave -function(position,time) ° Elasticity: Stress,Strain(position,time)

7 8 7

Example: Deriving the Heat Equation Explicit Solution of the Heat Equation ° For simplicity, assume =1

0 x-h x x+h 1 ° Discretize both time and position Consider a simple problem ° Use finite differences with u[j,i] as the heat at ° A bar of uniform material, insulated except at ends • time t= i*dt (i = 0,1,2,…) and position x = j*h (j=0,1,…,N=1/h) • initial conditions on u[j,0] ° Let u(x,t) be the temperature at position x at time t • boundary conditions on u[0,i] and u[N,i] t=5 ° Heat travels from x-h to x+h at rate proportional to: ° At each timestep i = 0,1,2,... t=4

d u(x,t) (u(x -h,t)-u(x,t))/h - (u(x,t)- u(x+h,t))/h For j=0 to N t=3 = C * dt h u[j,i+1]= z*u[j-1,i]+ (1-2*z)*u[j,i]+ t=2 z*u[j+1,i] t=1 ° As h 0, we get the heat equation: where z = dt/h2 d u(x,t) d 2 u(x,t) ° This corresponds to t=0 = C * u[0,0] u[1,0] u[2,0] u[3,0] u[4,0] u[5,0] dt dx 2 • matrix vector multiply (what is matrix?) 9 • nearest neighbors on grid 10

Parallelism in Explicit Method for PDEs Discretization Error ° Partitioning the space (x) into p largest chunks ° How accurate will the approximate solution be? • good load balance (assuming large number of points relative to p ) ° Beyond the scope of this course. • minimized communication (only p chunks) ° The discretization error is eOtOx=(D)+D[()]2 ° The fact that ?t appears to the first power and ?x to ° Generalizes to the second power is usually described as the • multiple dimensions discretization is first-order accurate in time and • arbitrary graphs (= sparse matrices) second-order accurate in space. ° For the discretization to be stable ?t and ?x must ° Problem with explicit approach satisfy the relationship • numerical instability 1 2 • solution blows up eventually if z = dt/h 2> .5 Dtx£D() 2 2 • need to make the timesteps very small when h is small: dt < .5*h

11 12

2 Instability in solving the heat equation explicitly Implicit Methods ° The previous method was called explicit because the value of u[j,i+1] at the next time level are obtained by an explicit formula in terms of the values at the previous time level. u[j,i+1]= z*u[j-1,i] + (1-2*z)*u[j,i] + z*u[j+1,i]

° Consider the difference approximation u[j,i+1] - u[j,i] = z*(u[j+1,i+1] -2*u[j,i+1] + u[j-1,i+1] • Similar in form but has the important difference that

the values of uj on the right are now evaluated at the i+1th time level rather than at the i th. • Must solve equations to advance to the next time level. 13 14

Implicit Solution 2D Implicit Method

° As with many (stiff) ODEs , need an implicit method ° Similar to the 1D case, but the matrix T is now

° This turns into solving the following equation Graph and “ stencil” 4 -1 -1 (I + (z/2)*T) * u[:,i+1]= (I - (z/2)*T) *u[:,i] -1 4 -1 -1 -1 ° Here I is the identity matrix and T is: -1 4 -1 -1 4 -1 -1 4 -1 -1 T = -1 -1 4 -1 -1 2 -1 -1 Graph and “ stencil” -1 -1 4 -1 -1 2 -1 -1 4 -1 -1 2 -1 -1 -1 4 -1 T = -1 2 -1 -1 2 -1 -1 -1 4 -1 2 ° Multiplying by this matrix (as in the explicit case) is simply nearest neighbor computation on 2D grid ° I.e., essentially solving Poisson’s equation in 1D ° To solve this system, there are several techniques 15 16 u[j,i+1] - u[j,i] = z *(u[j+1,i+1] -2*u[j,i+1] + u[j-1,i+1]

Algorithms for 2D Poisson Equation with N unknowns Short explanations of algorithms on previous slide ° Sorted in two orders (roughly): Algorithm Serial PRAM Memory #Procs • from slowest to fastest on sequential machines ° Dense LU N3 N N2 N2 • from most general (works on any matrix) to most specialized (wor ks on matrices “like” T) ° Dense LU: Gaussian elimination; works on any N-by-N matrix ° Band LU N2 N N3/2 N ° Band LU: exploit fact that T is nonzero only on sqrt(N ) diagonals nearest main ° Jacobi N2 N N N diagonal, so faster ° Explicit Inv. N 2 log N N2 N2 ° Jacobi: essentially does matrix -vector multiply by T in inner loop of iterative algorithm 3/2 1/2 ° Conj.Grad. N N *log N N N ° Explicit Inverse: assume we want to solve many systems with T, so we can ° RB SOR N 3/2 N 1/2 N N precompute and store inv(T) “for free”, and just multiply by it • It’s still expensive! 3/2 1/2 ° Sparse LU N N N*log N N ° Conjugate Gradients: uses matrix-vector multiplication, like Jacobi, but ° FFT N*log N log N N N exploits mathematical properies of T that Jacobi does not ° Red-Black SOR (Successive Overrelaxation): Variation of Jacobi that exploits ° Multigrid N log2 N N N yet different mathematical properties of T ° Lower bound N log N N • Used in Multigrid ° Sparse LU: Gaussian elimination exploiting particular zero structure of T ° FFT(Fast Fourier Transform): works only on matrices very like T PRAM is an idealized parallel model with zero cost communication ° Multigrid: also works on matrices like T, that come from elliptic PDEs (see next slide for explanation) ° Lower Bound: serial (time to print answer); parallel (time to combine N inp uts) 17 18

3 Composite mesh from a mechanical structure Converting the mesh to a matrix

19 20

Effects of Ordering Rows and Columns on Gaussian Elimination Irregular mesh: NASA Airfoil in 2D (direct solution)

21 22

Irregular mesh: Tapered Tube (multigrid) Adaptive Mesh Refinement (AMR)

°Adaptive mesh around an explosion °John Bell and Phil Colella at LBL (see class web page for URL) °Goal of Titanium is to make these algorithms easier to implement in parallel

23 24

4 Computational Electromagnetics Computational Electromagnetics

•Developed during 1980s, driven by defense - Discretize surface into triangular facets using applications standard modeling tools •Determine the RCS (radar cross section) of airplane - Amplitude of currents on surface are •Reduce signature of plane (stealth technology) unknowns •Other applications are antenna design, medical equipment •Two fundamental numerical approaches: •MOM methods of moments ( frequency domain), and - Integral equation is discretized into a set of linear •Finite differences (time domain) equations

image: NW Univ. Comp. Electromagnetics Laboratory http://nueml.ece.nwu.edu/

25 26

Computational Electromagnetics (MOM) Computational Electromagnetics (MOM)

The main steps in the solution process are After discretization the integral equation has the form Fill: computing the matrix elements of A Factor: factoring the dense matrix A A x = b Solve: solving for one or more excitations b where Field Calc: computing the fields scattered from the A is the (dense) impedance matrix, object x is the unknown vector of amplitudes, and b is the excitation vector.

(see Cwik, Patterson, and Scott, Electromagnetic Scattering on the Intel Touchstone Delta, IEEE Supercomputing ‘92, pp 538 - 542) 27 28

Analysis of MOM for Parallel Implementation Results for Parallel Implementation on Delta

Task Work Parallelism Parallel Speed Task Time (hours)

Fill O(n**2) embarrassing low Fill 9.20 Factor O(n**3) moderately diff. very high Factor 8.25 Solve O(n**2) moderately diff. high Field Calc. O(n) embarrassing high Solve 2 .17 Field Calc. 0.12

The problem solved was for a matrix of size 48,672. (The world record in 1991.)

29 30

5 Current Records for Solving Dense Systems Computational Chemistry

° Seek energy levels of a molecule, crystal, etc. • Solve Schroedinger’s Equation for energy levels = eigenvalues • Discretize to get Ax = lBx, solve for eigenvalues l and eigenvectors x Year System Size Machine # Procs Gflops (Peak) • A and B large, symmetric or Hermitian matrices (B positive definite) • May want some or all eigenvalues/eigenvectors 1950's O(100) 1995 128,600 Intel Paragon 6768 281 ( 338) ° MP-Quest (Sandia NL) 1996 215,000 Intel ASCI Red 7264 1068 (1453) • Si and sapphire crystals of up to 3072 atoms 1998 148,000 Cray T3E 1488 1127 (1786) • Local Density Approximation to Schroedinger Equation 1998 235,000 Intel ASCI Red 9152 1338 (1830) • A and B up to n=40000, Hermitian 1999 374,000 SGI ASCI Blue 5040 1608 (2520) • Need all eigenvalues and eigenvectors 1999 362,880 Intel ASCI Red 9632 2379 (3207) • Need to iterate up to 20 times (for self -consistency) 2000 430,000 IBM ASCI White 8192 4928 (12000) ° Implemented on Intel ASCI Red 2002 1,075,200 NEC Earth Sim 5120 35860 (41000) • 9200 Pentium Pro 200 processors (4600 Duals, a CLUMP) • Overall application ran at 605 Gflops (out of 1800 Glops peak), • Eigensolver ran at 684 Gflops

source: Alan Edelman http://www-math.mit.edu/~edelman/records.html • www.cs.berkeley.edu/~stanley/gbell/index.html LINPACK Benchmark: http://www.netlib.org/performance/html/PDSreports.html • Runner- up for Gordon Bell Prize at Supercomputing 98 31 32

EISPACK and LINPACK Review of Gaussian Elimination (GE) for solving Ax=b

° Add multiples of each row to later rows to make A upper triangular ° EISPACK ° Solve resulting triangular system Ux= c by substitution • Design for the algebraic eigenvalue problem, Ax = lx and Ax = lBx. … for each column i • work of J. Wilkinson and colleagues in the 70’s. … zero it out below the diagonal by adding multiples of row i to later rows • 77 software based on translation of ALGOL. for i = 1 to n-1 … for each row j below row i ° LINPACK for j = i+1 to n • Design for the solving systems of equations, Ax = b. … add a multiple of row i to row j • Fortran 77 software using the Level 1 BLAS. for k = i to n A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k)

33 34

Refine GE Algorithm (1) Refine GE Algorithm (2) ° Initial Version ° Last version

… for each column i … zero it out below the diagonal by adding multiples of row i to later rows for i = 1 to n-1 for i = 1 to n-1 for j = i+1 to n … for each row j below row i m = A(j,i)/A(i,i) for j = i+1 to n for k = i to n A(j,k) = A(j,k) - m * A(i,k) … add a multiple of row i to row j for k = i to n A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k) ° Remove computation of constant A(j,i)/A(i,i) from ° Don’t compute what we already know: inner loop zeros below diagonal in column i

for i = 1 to n-1 for j = i+1 to n for i = 1 to n-1 m = A(j,i)/A(i,i) for j = i+1 to n for k = i to n m = A(j,i)/A(i,i) A(j,k) = A(j,k) - m * A(i,k) for k = i+1 to n A(j,k) = A(j,k) - m * A(i,k)

35 36

6 Refine GE Algorithm (3) Refine GE Algorithm (4)

° Last version ° Last version for i = 1 to n-1 for j = i+1 to n A(j,i) = A(j,i)/A(i,i) for i = 1 to n-1 for k = i+1 to n for j = i+1 to n A(j,k) = A(j,k) - A(j,i) * A(i,k) m = A(j,i)/A(i,i) for k = i+1 to n A(j,k) = A(j,k) - m * A(i,k) ° Express using matrix operations (BLAS)

° Store multipliers m below diagonal in zeroed entries for later use for i = 1 to n-1 A(i+1:n,i) = A(i+1:n,i) / A(i,i) for i = 1 to n-1 A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) for j = i+1 to n - A(i+1:n , i) * A(i , i+1:n) A(j,i) = A(j,i)/A(i,i) for k = i+1 to n A(j,k) = A(j,k) - A(j,i) * A(i,k)

37 38

What GE really computes Problems with basic GE algorithm

for i = 1 to n-1 ° What if some A(i,i) is zero? Or very small? A(i+1:n,i) = A(i+1:n,i) / A(i,i) • Result may not exist, or be “unstable”, so need to pivot A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n) ° Current computation all BLAS 1 or BLAS 2, but we know that ° Call the strictly lower of multipliers BLAS 3 (matrix multiply) is fastest (Lecture 2) M, and let L = I+M for i = 1 to n-1 A(i+1:n,i) = A(i+1:n,i) / A(i,i) … BLAS 1 (scale a vector) ° Call the upper triangle of the final matrix U A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) … BLAS 2 (rank-1 update) - A(i+1:n , i) * A(i , i+1:n) ° Lemma (LU Factorization): If the above algorithm IBM RS/6000 Power 3 (200 MHz, 800 Mflop/s Peak) terminates (does not divide by zero) then A = L*U 800 Peak ° Solving A*x=b using GE BLAS 3 600 • Factorize A = L*U using GE (cost = 2/3 n3 flops) • Solve L*y = b for y, using substitution (cost = n2 flops) 400 Mflop/s • Solve U*x = y for x, using substitution (cost = n2 flops) 200 BLAS 2 BLAS 1 ° Thus A*x = (L*U)*x = L*(U*x) = L*y = b as desired 0 10 100 200 300 400 500 39 Order of vector/Matrices 40

Pivoting in Gaussian Elimination Gaussian Elimination with Partial Pivoting (GEPP)

° A = [ 0 1 ] fails completely, even though A is “easy” ° Partial Pivoting: swap rows so that each multiplier [ 1 0 ] |L(i,j)| = |A(j,i)/A(i,i)| <= 1

° Illustrate problems in 3- decimal digit arithmetic: for i = 1 to n-1 A = [ 1e-4 1 ] and b = [ 1 ], correct answer to 3 places is x = [ 1 ] find and record k where |A(k,i)| = max {i <= j <= n} |A(j,i)| [ 1 1 ] [ 2 ] [ 1 ] … i.e. largest entry in rest of column i if |A(k,i)| = 0 ° Result of LU decomposition is exit with a warning that A is singular, or nearly so elseif k != i L = [ 1 0 ] = [ 1 0 ] … No roundoff error yet swap rows i and k of A [ fl(1/1e-4) 1 ] [ 1e4 1 ] end if A(i+1:n,i) = A(i+1:n,i) / A(i,i) … each quotient lies in [ -1,1] U = [ 1e-4 1 ] = [ 1e-4 1 ] … Error in 4th decimal place A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n) [ 0 fl(1-1e4*1) ] [ 0 -1e4 ] ° Lemma: This algorithm computes A = P*L*U, where P is a Check if A = L*U = [ 1e-4 1 ] … (2,2) entry entirely wrong permutation matrix [ 1 0 ] ° Since each entry of |L(i,j)| <= 1, this algorithm is considered ° Algorithm “forgets” (2,2) entry, gets same L and U for all |A(2 ,2)|<5 numerically stable ° Numerical instability ° For details see LAPACK code at www..org/lapack/single/sgetf2.f ° Computed solution x totally inaccurate ° Cure: Pivot (swap rows of A) so entries of L and U bounded 41 42

7 History of Block Partitioned Algorithms Blocked Partitioned Algorithms

° Early algorithms involved use of small main ° Orthogonal reduction to: memory using tapes as secondary storage. • (upper) Hessenberg form ° LU Factorization • symmetric tridiagonal form • bidiagonal form ° Cholesky factorization ° Recent work centers on use of vector registers, ° Block QR iteration for level 1 and 2 cache, main memory, and “out of ° Symmetric indefinite nonsymmetric eigenvalue factorization problems core” memory. ° Matrix inversion ° QR, QL, RQ, LQ factorizations ° Form Q or QTC

43 44

Converting BLAS2 to BLAS3 in GEPP Blocked GEPP ( www.netlib.org/lapack/single/sgetrf.f )

° Blocking for ib = 1 to n-1 step b … Process matrix b columns at a time • Used to optimize matrix-multiplication end = ib + b-1 … Point to end of block of b columns apply BLAS2 version of GEPP to get A(ib:n , ib:end) = P’ * L’ * U’ • Harder here because of data dependencies in GEPP … let LL denote the strict lower triangular part of A(ib:end , ib:end ) + I A(ib:end , end+1:n) = LL- 1 * A(ib:end , end+1:n) … update next b rows of U ° Delayed Updates A(end+1:n , end+1:n ) = A(end+1:n , end+1:n ) • Save updates to “trailing matrix” from several consecutive BLAS2 - A(end+1:n , ib:end) * A(ib:end , end+1:n) updates … apply delayed updates with single matrix -multiply … with inner dimension b • Apply many saved updates simultaneously in one BLAS3 operation ° Same idea works for much of dense linear algebra (For a correctness proof, • Open questions remain see on- lines notes.) ° Need to choose a block size b • Algorithm will save and apply b updates • b must be small enough so that active submatrix consisting of b columns of A fits in cache • b must be large enough to make BLAS3 fast

45 46

Derivation of Blocked Algorithms LAPACK Cholesky Factorization A = UTU

° Linear Algebra library in Fortran 77 • Solution of systems of equations æ A a A ö æU T 0 0 öæU u U ö ç 11 j 13 ÷ ç 11 ÷ç 11 j 13÷ • Solution of eigenvalue problems T T T T ç aj ajj a j ÷ = ç uj u jj 0 ÷ç 0 ujj m j ÷ ç ÷ ç ÷ç ÷ ° Combine algorithms from LINPACK and EISPACK into a single è AT a A ø èU T m U T øè 0 0 U ø package 13 j 33 13 j 33 33 Equating coefficient of the jth column, we obtain ° Efficient on a wide range of computers a = U T u • RISC, Vector, SMPs j 11 j a = u T u + u 2 ° User interface similar to LINPACK jj j j jj • Single, Double, Complex, Double Complex ° Built on the Level 1, 2, and 3 BLAS Hence, if U11 has already been computed, we can compute uj and ujj from the equations: T U 11u j = a j 2 T u jj = a jj - u j u j 47 48

8 LINPACK Implementation LAPACK Implementation

° Here is the body of the LINPACK routine SPOFA which DO 10 J = 1, N implements the method: CALL STRSV( 'Upper', 'Transpose', 'Non-Unit’, J-1, A, LDA, A( 1, J ), 1 ) S = A( J, J ) - SDOT( J-1, A( 1, J ), 1, A( 1, J ), 1 ) DO 30 J = 1, N IF( S.LE.ZERO ) GO TO 20 INFO = J A( J, J ) = SQRT( S ) S = 0.0E0 10 CONTINUE

JM1 = J - 1

IF( JM1.LT.1 ) GO TO 20 ° This change by itself is sufficient to significantly improve the DO 10 K = 1, JM1 performance on a number of machines. T = A( K, J ) - SDOT( K -1, A( 1, K ), 1,A( 1, J ), 1 )

T = T / A( K, K ) ° From 238 to 312 Mflop/s for a matrix of order 500 on a Pentium 4-1.7 A( K, J ) = T GHz. S = S + T*T 10 CONTINUE ° However on peak is 1,700 Mflop/s. 20 CONTINUE S = A( J, J ) - S ° Suggest further work needed. C ...EXIT

IF( S.LE.0.0E0 ) GO TO 40

A( J, J ) = SQRT( S )

30 CONTINUE

49 50

Derivation of Blocked Algorithms LAPACK Blocked Algorithms

æ A A A ö æU T 0 0 öæU U U ö DO 10 J = 1, N, NB ç 11 12 13 ÷ ç 11 ÷ç 11 12 13÷ T T T T CALL STRSM( 'Left', 'Upper', 'Transpose','Non -Unit', J-1, JB, ONE, A, LDA, ç A12 A22 A12 ÷ = çU12 U22 0 ÷ç 0 U22 U23÷ $ A( 1, J ), LDA ) ç ÷ ç ÷ç ÷ è AT AT A ø èU T U T U T øè 0 0 U ø CALL SSYRK( 'Upper', 'Transpose', JB, J -1,-ONE, A( 1, J ), LDA, ONE, 13 12 33 13 23 33 33 $ A( J, J ), LDA ) Equating coefficient of second block of columns, we obtain CALL SPOTF2( 'Upper', JB, A( J, J ), LDA, INFO ) IF( INFO.NE.0 ) GO TO 20 T 10 CONTINUE A12 = U11U12 T T •On Pentium 4, L3 BLAS squeezes a lot more out of 1 proc A22 = U 12 U12 +U 22 U22 Intel Pentium 4 1.7 GHz Rate of Execution Hence, if U11 has already been computed, we can N = 500 compute U 12 as the solution of the following equations Linpack variant (L1B) 238 Mflop/s by a call to the Level 3 BLAS routine STRSM: T Level 2 BLAS Variant 312 Mflop/s U11U 12 = A12 T T Level 3 BLAS Variant 1262 Mflop/s U 22U 22 = A22 -U12U12 51 52

LAPACK Contents LAPACK Ongoing Work

° Combines algorithms from LINPACK and EISPACK into a ° Add functionality • updating/downdating , divide and conquer least squares,bidiagonal bisection, bidiago nal inverse single package. User interface similar to LINPACK. iteration, band SVD, Jacobi methods, ... ° Built on the Level 1, 2 and 3 BLAS, for high performance ° Move to new generation of high performance machines (manufacturers optimize BLAS) • IBM SPs , CRAY T3E, SGI Origin, clusters of workstations ° New challenges ° LAPACK does not provide routines for structured problems or • New languages: FORTRAN 90, HP FORTRAN, ... general sparse matrices (i.e sparse storage formats such as • (CMMD, MPL, NX ...) compressed -row, -column, -diagonal, skyline ...). - many flavors of message passing, need standard (PVM, MPI): BLACS

Computational speed ° Highly varying ratio Communication speed ° Many ways to layout data, ° Fastest parallel algorithm sometimes less stable numerically.

53 54

9 Gaussian Elimination Gaussian Elimination via a Recursive Algorithm

x x F. Gustavsonand S. Toledo 0 0 LU Algorithm: x 0 . . . . 1: Split matrix into two rectangles (m x n/2) . x . 0 if only 1 column, scale by reciprocal of pivot & retur n x 0 Standard Way LINPACK 2: Apply LU Algorithm to the left part subtract a multiple of a row apply sequence to a column 3: Apply transformations to right part -1 (triangular solve A 12 = L A12 and matrix multiplication A 22=A22 -A21*A12 )

x x 4: Apply LU Algorithm to right part 0 L a2 0 L . . . A12 0 a1 a3 0 -1 a2 =L a2 A21 A22 nb nb a3=a 3-a1*a 2 LAPACK 55 Most of the work in the matrix multiply56 apply sequence to nb then apply nb to rest of matrix Matrices of size n/2, n/4, n/8, …

Recursive Factorizations DGEMMPentium ATLAS III 550 MHz& DGETRF Dual Processor Recursive AMD AthlonLU 1GHzFactorization (~$1100Recursive system) LU ° Just as accurate as conventional method

° Same number of operations 800 ° Automatic variable blocking 400 Dual-processor • Level 1 and 3 BLAS only ! 600 LAPACK ° Extreme clarity and simplicity of expression 300 ° Highly efficient 400 Recursive LU

200Mflop/s ° The recursive formulation is just a rearrangement of the point -

wise LINPACK algorithm MFlop/s 200 LAPACK 100 Uniprocessor ° The standard error analysis applies (assuming the matrix operations are computed the “conventional” way). 00 T ° OK for LU, LL , & QR 500500 1000 10001500 200015002500 30002000 3500 40002500 4500 30005000 • Open question on 2-sided algs. eg eigenvalue reduction OrderOrder

57 58

Challenges in Developing Distributed Memory LU Factorization Libraries 3500 Pentium 4, 1.5 GHz, using SSE2 3000 ° How to integrate software? ° Where is the data • Until recently no standards • Who owns it? 2500 sLU • Many parallel languages • Opt data distribution 2000 dLU • Various parallel programming models ° Who determines data layout • Assumptions about the parallel environment • Determined by user? 1500 cLU - granularity • Determined by library developer?

Mflop/s - topology • Allow dynamic data dist. 1000 - overlapping of • Load balancing zLU communication/computation 500 - development tools 0 100 300 500 700 900 1200 1600 2000 2400 2800 Order 59 60

10 ScaLAPACK Programming Style ° Library of software dealing with dense & banded routines ° Distributed Memory - Message Passing ° SPMD Fortran 77 with object based design ° Built on various modules ° MIMD Computers and Networks of Workstations • PBLAS Interprocessor communication ° Clusters of SMPs • BLACS - PVM, MPI, IBM SP, CRI T3, Intel, TMC - Provides right level of notation. • BLAS ° LAPACK software expertise/quality • Software approach • Numerical methods

61 62

Overall Structure of Software PBLAS

° Object based - Array descriptor • Contains information required to establish mapping between a global array entry and its corresponding process and memory location. • Provides a flexible framework to easily specify additional data ° Similar to the BLAS in functionality and naming. distributions or matrix types. ° Built on the BLAS and BLACS • Currently dense, banded, & out -of-core ° Provide global view of matrix ° Using the concept of context CALL DGEXXX ( M, N, A( IA, JA ), LDA,... )

CALL PDGEXXX( M, N, A, IA, JA, DESCA,... )

63 64

ScaLAPACK Structure Choosing a Data Distribution ° Main issues are: • Load balancing ScaLAPACK • Use of the Level 3 BLAS

PBLAS Global Local LAPACK BLACS

BLAS PVM/MPI/...

65 66

11 Possible Data Layouts Distribution and Storage

° 1D block and cyclic column distributions ° Matrix is block-partitioned & maps blocks ° Distributed 2-D block-cyclic scheme

5x5 matrix partitioned in 2x2 blocks 2x2 process grid point of view

A11 A12 A13 A 14 A 15 A11 A12 A15 A13 A14

A21 A22 A23 A 24 A 25 A21 A22 A25 A23 A24

A31 A32 A33 A 34 A 35 A51 A52 A55 A53 A54

A41 A42 A43 A 44 A 45 A31 A32 A35 A33 A34

° 1D block-cycle column and 2D block-cyclic distribution A51 A52 A53 A 54 A 55 A41 A42 A45 A43 A44 ° 2D block-cyclic used in ScaLAPACK for dense matrices ° Routines available to distribute/redistribute data.

67 68

Parallelism in ScaLAPACK ScaLAPACK - What’s Included ° Task parallelism Problem type SDrv EDrv Factor Solve Inv Cond Iter ° Timing and • Sign function eigenvalue Testing routines computations Ax = b Est Refin ° Level 3 BLAS block Triangular X X X X for almost all operations ° Divide and Conquer • Tridiagonal and band solvers, ° This is a large • All the reduction routines SPD X X X X X X X symmetric eigenvalue problem and component of Sign function SPD Banded X X X ° Pipelining the package SPD Tridiagonal X X X • QR Algorithm, Triangular Solvers, ° Cyclic reduction classic factorizations ° Prebuilt libraries • Reduced system in the band solver General X X X X X X X available for SP, ° Redundant computations General Banded X X X PGON, HPPA, ° Data parallelism • Condition estimators General Tridiagonal X X X DEC, Sun, RS6K • Sign function ° Static work assignment Least squares X X X • Bisection GQR X GRQ X

Ax = lx or Ax = lBx SDrv Edrv Reduct Solution Symmetric (2 types) X X X X General (2 types) X X Generalized BSPD X X X 69 SVD X X 70

T0 Dual (500 MHz Alpha EV5/6) Heterogeneous Computing Fast & Gigabit Ethernet

3500 ScaLAPACK LU Solver 1x4 Giganet

° Software intended to be used in this context 3000 1x8 Giganet 1x4 Fast-enet ° Communication of ft. pt. numbers between processors 2500 1x8 Fast-enet ° Machine precision and other machine specific parameters 2000 ° Iterative convergence across clusters of processors Mflop/s ° Defensive programming required 1500

1000

500

0 1000 2000 3000 4000 5000 6000 7000 8000 71 Size 72

12 Out of Core Approach Out of Core Algorithm

° High-level I/O Interface ° Hybrid approach ° ScaLAPACK uses a `Right-looking’ variant for LU, QR and ° Algorithm is ``Left-Looking’’ in Cholesky factorizations. nature, but uses ``Right- Looking’’ (ScaLAPACK) on ° A `Left-looking’ variant is used for Out-of-core factorization to Panel B reduce I/O traffic. ° Latency Tolerant ° Requires two in -core column panels. ° Model for deep memory ° Imposes another level in the memory hierarchy. hierarchy algorithms

Panel A Panel B

73 74

QR Factorization on 64 References processors Intel Paragon ° http://www.netlib.org 8000 ° http://www.netlib.org/lapack

7000 ° http://www.netlib.org/scalapack

6000 Out-of-core In-core ° http://www.netlib.org/lapack/lawns ° http://www.netlib.org/atlas 5000 ° http://www.netlib.org/papi/ 4000 ° http://www.netlib.org/netsolve/ Time (s) 3000 ° http://www.netlib.org/lapack90 2000 ° http://www.nhse.org 1000

0 ° [email protected] 5000 8000 10000 14000 16000 22400 Square problem size 75 ° [email protected] 76

13