1 2

The Situation:

Parallel scienti applications are typically

Overview of ScaLAPACK

 written from scratch, or manually adapted from sequen-

tial programs

Jack Dongarra

 using simple SPMD mo del

 in or C

Computer Science Department

UniversityofTennessee

 with explicit message passing primitives

and

Mathematical Sciences Section

Which means

Oak Ridge National Lab oratory

 similar comp onents are co ded over and over again,

 co des are dicult to develop and maintain

http://w ww .ne tl ib. org /ut k/p e opl e/J ackDo ngarra.html

 debugging particularly unpleasant...

 dicult to reuse comp onents

 not really designed for long-term software solution

3 4

What should math software lo ok like? The NetSolve Client

 Many more p ossibilities than shared memory

Virtual Software Library

Multiplicityofinterfaces ! Ease of use

 Possible approaches

{ Minimum change to status quo

Programming interfaces

 Message passing and ob ject oriented

{ Reusable templates {

requires sophisticated user

 Cinterface

 FORTRAN interface

{ Problem solving environments

 Integrated systems a la Matlab, Mathematica

use with heterogeneous platform

Interactiveinterfaces

{ Computational server, like /xnetlib f2c

 Non-graphic interfaces

 Some users want high p erf., access to details

{MATLAB interface

Others sacri ce sp eed to hide details

{ UNIX-Shell interface

 Not enough p enetration of libraries into hardest appli-

 Graphic interfaces

cations on vector machines

{ TK/TCL interface

 Need to lo ok at applications and

rethink mathematical software

{ HotJava Browser interface WWW

5 6

The NetSolve Agent The NetSolve System

The system is a set of computational servers

Intelligent Agent ! Optimum use of Resource

 Heterogeneity supp orted XDR

The agent is p ermanently aware of :

 Dynamic addition of resource

 The con guration of the NetSolve system

 Fault Tolerance

 The characteristics of each NetSolve resource

Each server uses the installed scienti c packages

 The load of each resource

 The functionalities of each resource

Example of packages :

Each client requests informs the agent ab out :

 BLAS

 LAPACK

 The typ e of problem to solve

 ScaLAPACK

 The size of the problem

 Ellpack

The NetSolve Agentcho oses the NetSolve resource :

Connection to other systems

Load Balancing Strategy

 NEOS

7 8

 New challenges

LAPACK / ScaLapack

A Scalable Library for

{ No standard software

Distributed Memory Machines

Fortran 90, HP Fortran, PVM, MPIbut ...

{ Many avors of message passing

 LAPACK success from HP-48G to CRAY C-90

Need standard - BLACS

{Targets workstations, vector machines, shared mem-

{ Highly varying ratio of

ory

computation sp eed

{ 449,015 requests since then as of 10/95

r =

communication sp eed

{ Dense and banded matrices

{ Manyways to layout data

{ Linear systems, least squares, eigenproblems

{Fastest parallel algorithm sometimes less

{ Manual available from SIAM, html, Japanese trans-

stable numerically

lation

{ Second release and manual in 9/94

{ Used by Cray, IBM, Convex, NAG, IMSL, ...

 Provide useful scienti c program libraries for computa-

tional scientists working on new parallel architectures

{Move to new generation high p erformance machines

- SP-2, T3D, Paragon, Clusters of WS, ...

{ Add functionality generalized eigenproblem, sparse matrices, ...

9 10

ScaLAPACK{OVERVIEW

Needs

 Standards

{Portable

Goal:Port PACKage LAPACK to distributed-memory

{Vendor-supp orted environments.

{ Scalable

 Scalability and Performance

 High eciency

{ Optimized compute and communication engines

 Protect programming investment, if p ossible

{ Blo cked data access level 3 BLAS yields go o d no de p erformance

 Avoid low-level details

{ 2-D Blo ck-cyclic data decomp osition yields go o d load balance

 Scalability

 Portability

{ BLAS: optimized compute engine

{ BLACS: communication kernel PVM, MPI, ...

 Simple calling interface

{ Stay as close as p ossible to LAPACK in calling sequence, storage,

etc.

 Ease of Programming / Maintenance

{ Mo dularity: Build rich set of linear algebra to ols

 BLAS

 BLACS

 PBLAS

{ Whenever p ossible, use reliable LAPACK algorithms.

11 12

Communication Library for Linear Algebra

2-DIMENSIONAL BLOCK CYCLIC

Basic Linear Algebra Communication

DISTRIBUTION

Purp ose of the BLACS:

A design to ol, they are a conceptual aid in design and co ding. 012 

A11 A12 A13 A14 A15 A16 A17 A18 A11 A14A 17 A12 A15A 18 A13 A16

Asso ciate widely recognized mnemonic names with communication A21 A22 A23 A24 A25 A26 A27 A28 A31 A34 A37 A32 A35 A38 A33 A36 

0

op erations, improve program readability and the self-do cumenting

A31 A32 A33 A34 A35 A36 A37 A38 A51 A54A 57 A52 A55 A58 A53 A56

qualityofcode.

A41 A42 A43 A44 A45 A46 A47 A48 A71 A74 A77 A72 A75 A78 A73 A76

A51 A52 A53 A54 A55 A56 A57 A58 A21 A24 A27 A22 A25 A28 A23 A26

 Promote eciency by identifying frequently-o ccurring op erations of

h can b e optimized on various computers. A A A A A linear algebra whic A61 A62 A63 A64 A65 A66 A67 A68 A41 44 47 A42 45 48 A43 46 1 A A A A A

A71 A72 A73 A74 A75 A76 A77 A78 A61 64 67 A62 65 68 A63 66

 Program p ortability is improved through standardization of kernels A81 A84 A87 A82 A85 A88 A83 A86

A81 A82 A83 A84 A85 A86 A87 A88

without giving up eciency since optimized versions will b e available.

Global left and distributed right views of

 Ensure go o d load balance  Performance and scalabil-

ity,

 Encompass a large numb er but not all data distribution schemes,

 Need redistribution routines to go from distribution to the other,

 Not intuitive.

13 14

BLACS { BASICS

BLACS { BASICS

 Pro cesses are emb edded in a 2-dimensional grid,

 Op erations are matrix based and ID-

0123 less 0 0123 N N M 1 4567 M−N M N 2 891011 LDA M

N−M

An op eration whichinvolves more than one sender and one receiver is

 N−M

\scop ed op eration". Using a 2D-grid, there are 3 natural called a M

scop es: M M−N MEANING SCOPE N

N

Row All pro cesses in a pro cess row participate.

Column All pro cesses in a pro cess column participate.

All All pro cesses in the pro cess grid participate.

 4 Typ es of BLACS routines: p oint-to-

! use of SCOPE parameter.

p oint communication, broadcast, combine

op erations and supp ort routines.

15 16

BLACS { COMMUNICATION BLACS { COMMUNICATION

ROUTINES ROUTINES

2. Broadcast

1. Send/Recv

Send submatrix to all pro cesses or subsection of

Send submatrix from one pro cess to another:

pro cesses in SCOPE, using various distribution pat-

2xxSD2D ICONTXT, M, N, A, LDA, RDEST, CDEST 

terns TOP:

2xxRV2D ICONTXT, M, N, A, LDA, RSRC, CSRC 

2xxBS2D ICONTXT, SCOPE, TOP, M, N, A, LDA 

2xxBR2D ICONTXT, SCOPE, TOP, M, N, A, LDA, RSRC, CSRC 

2 Data typ e xx Matrix typ e

I: INTEGER, GE: GEneral

SCOPE TOP

S: REAL, rectangular

'Row' ' ' default

D: DOUBLE PRECISION, matrix,

'Column' 'Increasing Ring'

C: COMPLEX, TR: TRap ezoidal

'All' '1-tree' :::

Z: DOUBLE COMPLEX. matrix.

CALL BLACS_GRIDINFO ICONTXT, NPROW, NPCOL, MYROW, MYCOL 

CALL BLACS_GRIDINFO ICONTXT, NPROW, NPCOL, MYROW, MYCOL 

IF MYROW.EQ.0  THEN

IF MYROW.EQ.0 .AND. MYCOL.EQ.0  THEN

IF MYCOL.EQ.0  THEN

CALL DGESD2D ICONTXT, 5, 1, X, 5, 1, 0 

CALL DGEBS2D ICONTXT, 'Row', ' ', 2, 2, A, 3 

ELSE IF MYROW.EQ.1 .AND. MYCOL.EQ.0  THEN

ELSE

CALL DGERV2D ICONTXT, 5, 1, Y, 5, 0, 0 

CALL DGEBR2D ICONTXT, 'Row', ' ', 2, 2, A, 3, 0, 0 

END IF

END IF

END IF

17 18

BLACS { COMBINE OPERATIONS BLACS { ADVANCED TOPICS

3. Global combine op erations

The BLACS context is the BLACS mecha-

nism for partitioning communication space. A

Perform element-wise SUM, jMAXj, jMINj op erations,

de ning prop erty of a context is that a mes-

on rectangular matrices:

sage in a context cannot be sent or received in

another context. The BLACS context includes

2GSUM2D ICONTXT, SCOPE, TOP, M, N, A, LDA,

RDEST, CDEST 

the de nition of a grid, and each pro cess co or-

dinates in it.

2GAMX2D ICONTXT, SCOPE, TOP, M, N, A, LDA,

RA, CA, RCFLAG, RDEST, CDEST 

It allows the user to

2GAMN2D ICONTXT, SCOPE, TOP, M, N, A, LDA,

 create arbitrary groups of pro cesses,

RA, CA, RCFLAG, RDEST, CDEST 

 create multiple overlapping and/or disjoint

grids,

 RDEST = -1 indicates that the result of the op eration

should b e left on all pro cesses selected by SCOPE,

 isolate each pro cess grid so that grids do not

 For jMAXj, jMINj, when RCFLAG = -1, RA and CA

interfere with each other.

are not referenced; otherwise, these arrays are set on

output with the co ordinates of the pro cess owning the

corresp onding maximum resp. minimum elementin

absolute value of A. Both arrays must b e at least of

! the BLACS context is compatible with the

size M  N and their leading dimension is sp eci ed

by RCFLAG.

MPI communicator

19 20

BLACS { REFERENCES PBLAS { INTRODUCTION

 BLACS software and do cumentation can b e obtained Parallel Basic Linear Algebra Subprograms for distributed

via: memory MIMD computers.

{ WWW: http://www.netlib.org/blacs,

{ anonymous ftp ftp.netlib.org: cd blacs; get index

 Similar functionality as the BLAS: distributed vector-vector,

{ email [email protected] with the message:

matrix-vector and matrix-matrix op erations,

send index from blacs

 Comments and questions can b e addressed to

 Simpli cation of the parallelization of dense linear algebra

[email protected]

co des: esp ecially when implemented on top of the BLAS,

 J. Dongarra and R. van de Geijn, Two dimensional Basic Linear Algebra

Communication Subprograms, Technical Rep ort UT CS-91-138, LAPACK

Working Note 37, UniversityofTennessee, 1991.

 Clarity: co de is shorter and easier to read,

 R. C. Whaley, Basic Linear Algebra Communication Subprograms: Analysis

and Implementation Across Multiple Paral lel Architectures ,Technical Re-

p ort UT CS-94-234, LAPACK Working Note 73, UniversityofTennessee,

 Mo dularity: gives programmer larger building blo cks,

1994.

 J. Dongarra and R. C. Whaley, A User's Guide to the BLACS v1.0,Tech-

nical Rep ort UT CS-95-281, LAPACK Working Note 94, Universityof

 Program p ortability: machine dep endency are con ned to the

Tennessee, 1995.

PBLAS BLAS and BLACS.

 Message Passing Interface Forum, MPI: A Message Passing Interface Stan-

dard, International Journal of Supercomputer Applications and High Perfor-

mance Computing, 83{4, 1994.

 A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam,

PVM: Paral lel Virtual Machine. A Users' Guide and Tutorial for Networked

Paral lel Computing, The MIT Press, Cambridge, Massachusetts, 1994.

21 22

PBLAS { STORAGE CONVENTIONS PBLAS { ARGUMENT CONVENTIONS

 An M

-by-N matrix is blo ck partitionned and these MB -by-NB

 Global view of the matrix op erands, allowing global address-

blo cks are distributed according to the ing of distributed matrices hiding complex lo cal indexing,

JA N_

2-dimensional blo ck-cyclic scheme

 load balanced computations, scalability, N

IA

 Lo cally, the scattered columns are stored contiguously FOR- or"

TRAN \Column-ma j M_ M A( IA:IA+M−1, JA:JA+N−1 )

 re-use of the BLAS leading dimension LLD

.

 Co de reusability,interface very close to sequential BLAS:

a11 a12 a13 a14 a15 0 1

DGEXXX M, N, A IA, JA , LDA  a11a12a15 a13a14 CALL a21 a22 a23 a24 a25

0 a21a22a25 a23a24

CALL DGEMM 'No Transpose', 'No Transpose',

a31 a32 a33 a34 a35

M-J-JB+1, N-J-JB+1, JB, -ONE, AJ+JB,J,

a51a52a55 a53a54 $

$ LDA, AJ,J+JB, LDA, ONE, AJ+JB,J+JB,

a41 a42 a43 a44 a45 a a a a a

LDA  1 31 32 35 33 34 $

a41a42a45 a43a44  a51 a52 a53 a54 a55

5 x 5 matrix partitioned in 2 x 2 blocks 2 x 2 process grid point of view

CALL PDGEXXX M, N, A, IA, JA, DESCA 

Descriptor DESC : 8-Integer array describing the matrix layout, con-

CALL PDGEMM 'No Transpose', 'No Transpose',

taining M , N , MB , NB , RSRC , CSRC , CTXT and LLD , where RSRC ,

$ M-J-JB+1, N-J-JB+1, JB, -ONE, A, J+JB,

CSRC  are the co ordinates of the pro cess owning the rst matrix entry in

$ J, DESCA, A, J, J+JB, DESCA, ONE, A,

the grid sp eci ed by CTXT .

$ J+JB, J+JB, DESCA 

Ex: M

=N =5, MB =NB =5, RSRC =CSRC = 0, LLD  3 in pro cess row

0, and 2 in pro cess row 1.

23 24

PBLAS { SPECIFICATIONS

PBLAS { REFERENCES

DGEMV TRANS, M, N, ALPHA, A, LDA, X, INCX,

 PBLAS software and do cumentation can b e obtained

BETA, Y, INCY 

via:

l

{ WWW: http://www.netlib.org/scalapack,

PDGEMV TRANS, M, N, ALPHA, A, IA, JA, DESCA,

{ anonymous ftp ftp.netlib.org:

X, IX, JX, DESCX, INCX,

cd scalapack; get index

BETA, Y, IY, JY, DESCY, INCY 

{ email [email protected] with the message:

 In the PBLAS, the increment sp eci ed for vectors is always global.

send index from scalapack

So far only INCX=1 and INCX=DESCX1 are supp orted.

 Comments and questions can b e addressed to BLAS

PBLAS

INTEGER LDA INTEGER IA, JA, DESCA 8 

[email protected] du

INTEGER INCX INTEGER INCX, IX, JX, DESCX 8 

A, LDA A, IA, JA, DESCA

X, INCX X, IX, JX, DESCX, INCX

 J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. Walker and R. C.

 PBLAS matrix transp osition routine REAL:

Whaley, AProposal for a Set of Paral lel Basic Linear Algebra

Subprograms, Technical Rep ort UT CS-95-292, LAPACK Working

PDTRAN M, N, ALPHA, A, IA, JA, DESCA, BETA,

Note 100, UniversityofTennessee, 1995.

C, IC, JC, DESCC 

 Level 1 BLAS functions have b ecome PBLAS subroutines. Ouput

Scalar correct in op erand scop e.

>> SPMD PROGRAMMING MODEL <<

25 26

ScaLAPACK IMPLEMENTATION Parallel Level 2 and 3 BLAS: PBLAS

Goal

 Port sequential library for shared memory machine that use the BLAS hines with little e ort.

ScaLAPACK to distributed memory mac

 Reuse the existing software by hiding the message passing in a set of

PBLAS BLAS routines.

 Parallel implementation of the BLAS that understands how the matrix

yed out and when called can p erform not only the op eration but BLAS is la BLACS the required data transfer.

SYSTEM

LAPACK ! ScaLAPACK

BLAS Performance and Portability

BLAS ! PBLAS BLAS, BLACS

 Blo cked data access level 3 BLAS yields go o d lo cal p erformance

 Quality maintenance

BLACS Performance and Portability

 Portability F77 - C

 Dedicated machine dep endent communication kernel,

 Eciency - Reuseability BLAS, BLACS

 2D array-based and ID-less communication,

 Hide Parallelism in PBLAS

 Co de re-entrance, multiple overlapping pro cess grids,

 Safe co-existence with other parallel libraries.

27 28

PERFORMANCE RESULTS

SEQUENTIAL LU FACTORIZATION CODE PARALLEL LU FACTORIZATION CODE

DO20J=1,MIN M, N , NB DO 10 J = JA, JA+MINM,N-1, DESCA 4 

JB = MIN MIN M, N -J+1, NB  JB = MIN MINM,N-J+JA, DESCA 4  

I =IA+J-JA LU Factorization on IBM SP-2

Factor diagonal and subdiagonal blocks and test for exact Factor diagonal and subdiagonal blocks and test for exact

singularity. singularity.

CALL DGETF2 M-J+1, JB, A J, J , LDA, IPIV J , CALL PDGETF2 M-J+JA, JB, A, I, J, DESCA, IPIV, IINFO 

$ IINFO 

8

Adjust INFO and the pivot indices. Adjust INFO and the pivot indices.

IF INFO.EQ.0 .AND. IINFO.GT.0  INFO = IINFO + J - 1

IF INFO.EQ.0 .AND. IINFO.GT.0  7

DO10I=J,MIN M, J+JB-1  $ INFO = IINFO +J-JA

IPIV I=J -1+ IPIV I  CONTINUE

10 6

Apply interchanges to columns 1:J-1. s to columns JA:J-JA.

Apply interchange 5

CALL DLASWP J-1, A, LDA, J, J+JB-1, IPIV, 1  CALL PDLASWP 'Forward', 'Rows', J-JA, A, IA, JA, DESCA, J, J+JB-1, IPIV 

$ 4

IF J+JB.LE.N  THEN IF J-JA+JB+1.LE.N  THEN

Gflops 3

Apply interchanges to columns J+JB:N. Apply interchanges to columns J+JB:JA+N-1.

2 CALL DLASWP N-J-JB+1, A 1, J+JB , LDA, J, J+JB-1,

CALL PDLASWP 'Forward', 'Rows', N-J-JB+JA, A, IA, 64 nodes

$ IPIV, 1  $ J+JB, DESCA, J, J+JB-1, IPIV 

1 Compute block row of U. Compute block row of U. 32 nodes

0

CALL DTRSM 'Left', 'Lower', 'No transpose', 'Unit', CALL PDTRSM 'Left', 'Lower', 'No transpose', 'Unit',

$ JB, N-J-JB+1, ONE, A J, J , LDA, $ JB, N-J-JB+JA, ONE, A, I, J, DESCA, A, I,

$ A J, J+JB , LDA  J+JB, DESCA 

$ 16 nodes

IF J+JB.LE.M  THEN IF J-JA+JB+1.LE.M  THEN

2000

Update trailing submatrix. Update trailing submatrix. 5000

8000

CALL DGEMM 'No transpose', 'No transpose', CALL PDGEMM 'No transpose', 'No transpose',

11000

$ M-J-JB+1, N-J-JB+1, JB, -ONE, $ M-J-JB+JA, N-J-JB+JA, JB, -ONE, A,

14000

$ A J+JB, J , LDA, A J, J+JB , LDA, $ I+JB, J, DESCA, A, I, J+JB, DESCA,

Problem size 17000

$ ONE, A J+JB, J+JB , LDA  $ ONE, A, I+JB, J+JB, DESCA 

END IF END IF

END IF END IF

20 CONTINUE 10 CONTINUE

29 30

PERFORMANCE RESULTS PERFORMANCE RESULTS

QR Factorization on Intel QR Factorization on 128 nodes Intel machines (512 processors) Machines

25

Delta 6

20 Paragon 5

4

15 3 Gflops 2 Gflops

10 1

0 Paragon

5 Delta i860 1000 2000 4000 6000 8000 10000 12000

0 Problem size 14000 16000 18000 10000 14000 18000 22000 26000 30000 34000

Problem size

31 32

LU Factorization on Intel Paragon LU Factorization on Sparc-5 (Ethernet)

128 nodes 256 nodes 512 nodes 1024 nodes 35,00

30,00 50,00 25,00 40,00 20,00

15,00 30,00 Gflops

10,00

Mflops 20,00 5,00 10,00 16 nodes 0,00 8 nodes 0,00 4 nodes 1000 5000 1000 2 nodes 2000

15000 3000 1 node 4000 Problem size 25000 Problem size 5000 35000

4 x 10 LU performance on the Cray T3D (NB=32) LU Factorization on IBM SP-2 2.5

4 nodes 8 nodes 16 nodes

3,5 2 16x16 grid of procs 3

2,5 1.5 2

1,5 Gflops Gflops 1 1

0,5

0 0.5 1000 5000 0

10000 0 0.5 1 1.5 2 2.5 3 3.5 Matrix Size, N 4 Problem size 15000 x 10

33 34

T

Symmetric Eigenproblem Ax = x, A = A Outline of Cupp en's Divide and Conquer

T T

 Two phases: 1. Goal: compute T = QQ , =diag  , QQ = I

i

Phase 1: Reduce A to tridiagonal T :

2.

2 3

2 3

a b 0

1 1

6 7

6 7

a b

1 1

6 7 6 7

. .

. .

3

6 7 6 7

2

. .

b

1

6 7 6 7

T

. .

. .

6 7 6 7

A = UT U , T = . .

T

b

1

1

7

T

6 7 6 7 6

. .

. .

5

6 7 6 7

4 + vv

. .

T = =

b

n1

6 7 6 7

. .

. .

T

6 7

4 5

. .

b

2

n1

6 7

0 b a

4 5

n1 n

b a

n1 n

T

Phase 2: Find e-vals, vectors of T : T = QQ

T T

So A =UQUQ V V

3.

T T

 Phase 2 is b ottleneck 3x Phase 1

Solve T = Q  Q and T = Q  Q recursively

1 1 1 2 2 2

1 2

 LAPACK: Cupp en's divide-and-conquer

4.

2 3 3 2 3

2

metho d

T

T

Q Q

Q  Q

1

1 1

6 7 7 T 6 7 T

1

6

1

4 5 5 4 5

4 D+zz  +vv =

T =

T

T

 ScaLAPACK: Q Q

Q  Q

2

2 2

2

2

where

{ Bisection + with limited

2 3 2 3

reorthogonalization: fast but may fail to T

Q 

1

6 7 6 7

1

4 5 4 5

v and z = D =

compute orthogonal eigenvectors if eigenval- T

Q 

2

2

ues clustered

T

5. Eigenvalues  of D + zz are ro ots of \secular

{ QR iteration: slower but reliable

equation"

2

z

X

i

=0

1+

 d

i

ii

T 1

6. Eigenvectors v of D + zz are D  I  z

j j

2 3

Q

1

6 7

4 5

7. Eigenvectors of T are  V

Q

2

35 36

Sp eedups with Divide-and-Conque r ScaLAPACK Performance on

Pro cessor Intel Gamma Random, reduced to tridiagonal matrices, time relative to DC = 1 64

16

Exp ert Driver - Bisection + Inverse QR

QR iteration 14 BZ (Bisection + inverse iteration) DC (divide and conquer) = 1 ScaLAPACK Performance, Intel Gamma, 64PEs 12 1

10 0.8 predicted 8 0.6

6 0.4 measured 4

Parallel Efficiency 0.2

2 BZ 0 DC 500 1000 1500 2000 2500 0 Matrix Dimension 0 100 200 300 400 500 600 700 800 900 1000 matrix size ScaLAPACK Time Distribution, Intel Gamma, 64PEs Geometrically dist., reduced to tridiagonal matrices, time relative to DC = 1 80 1 QR MM BZ 70 BZ (Bisection + inverse iteration) 0.8 DC (divide and conquer) = 1 EIN 60 0.6 EBZ QR 50 0.4 % of Time TRD 40 0.2

30 0 500 1000 1500 2000 2500 Matrix Dimension 20

10

0 DC 0 100 200 300 400 500 600 700 800 900 1000 matrix size

37 38

WHAT IS AVAILABLE TO YOU ?

 To ols

ScaLAPACK activities

{ BLACS CMMD, MPL, NX, PVM

 ARPACK - Arnoldi-Pack - D. Sorensen

{ PBLAS

{ Algorithms for symmetric and nonsymmet-

ric linear systems, eigenproblems, general-

{ REDISTRIBUTION

ized eigenproblems

{ User's matrix-vector pro duct, rest serial

{ PUMMA

{ Based on clever identity for compressing

partial Arnoldi factorization

 Factorizations

 CAPSS - Cartesian Parallel Sparse Solver - P.

Raghavan and M. Heath

{ LU + solve

{Parallel sparse Cholesky

{ Cholesky + solve

{For nite di erence/ nite element problems

{ Usese co ordinates from underlying mesh

{ QR, QL, LQ, RQ, QRP

{ Cartesian nested dissection to layout data

 Matrix inversions

 Sparse LU with Pivoting

{ Exploits data lo cality using sup erno dal ap-

T T

 Q; QC ; Q C; C Q; C Q

proach

{ Up to 120 M ops on IBM RS6000/590, de-

 Reductions

p ending on sparsity pattern

{Fastest co de available on many test cases,

{ to Hessenb erg form

along with UMFPACK

{To b e released so on

{ to symmetric tridiagonal form

{ to bidiagonal form

 Symmetric eigensol ver

39 40

Software Framework

PVM calls within the BLACS

PVM Parallel Virtual Machine is a software

ScaLAPACK

package that p ermits a heterogeneo us collection

of serial, parallel, and vector computers ho oked

together by a network to app ear as one large

computer.

CS Basic Linear Algebra Communication

Parallel BLA

BLAS

Subroutines is a software package to p erform

unication at the same level as the BLAS comm

GLOBAL

are for computation.

BLACS have b een written in terms of calls to

LOCAL  PVM.

Sequential BLACS

Allows network of workstatio ns to b e used with

BLAS 

minimal change.

Message

Passing

Message Passing Interface

Primitves MPI {

for Message Passing

MPI PVM Standard

 BLACS ! MPI underway

41 42

ACK { ONGOING WORK LU Factorization with Solve, Token Ring Connected IBM RS/6000-320H SCALAP 200 1x8, n=3000

180

 HPF 160 1x6, n=3000

140

{ HPF supp orts the 2D-blo ck-cyclic distri-

120 1x4, n=3000

bution used by ScaLAPACK

100

{ HPF like calling sequence Global index- Speed in Megaflops

80

ing scheme 60 1x2, n=2000

40

 Portability: MPI implementati on of the 1x1, n=1500

20 CS

1 2 3 4 5 6 7 8 BLA

Number of Processors

 More exibili ty added to the PBLAS

 More testing and timing programs

 Condition estimators

LU Results on IBM Cluster

Pro c Time Sp eedup Pro c M op/s Sp eedup

 Iterative re nement of linear system

1 1030 1 n=1500 33

solutions

2 403 2.6 2 n=2000 64 1.9

 SVD

4 155 6.6 4 n=3000 116 3.5

6 115 8.9 6 n=3000 156 4.7

 Linear Least Square solvers

8 93 11.0 8 n=3000 194 5.9

 Banded systems

 Non symmetric eigensol vers

 :::