1 2
The Situation:
Parallel scienti c applications are typically
Overview of ScaLAPACK
written from scratch, or manually adapted from sequen-
tial programs
Jack Dongarra
using simple SPMD mo del
in Fortran or C
Computer Science Department
UniversityofTennessee
with explicit message passing primitives
and
Mathematical Sciences Section
Which means
Oak Ridge National Lab oratory
similar comp onents are co ded over and over again,
co des are dicult to develop and maintain
http://w ww .ne tl ib. org /ut k/p e opl e/J ackDo ngarra.html
debugging particularly unpleasant...
dicult to reuse comp onents
not really designed for long-term software solution
3 4
What should math software lo ok like? The NetSolve Client
Many more p ossibilities than shared memory
Virtual Software Library
Multiplicityofinterfaces ! Ease of use
Possible approaches
{ Minimum change to status quo
Programming interfaces
Message passing and ob ject oriented
{ Reusable templates {
requires sophisticated user
Cinterface
FORTRAN interface
{ Problem solving environments
Integrated systems a la Matlab, Mathematica
use with heterogeneous platform
Interactiveinterfaces
{ Computational server, like netlib/xnetlib f2c
Non-graphic interfaces
Some users want high p erf., access to details
{MATLAB interface
Others sacri ce sp eed to hide details
{ UNIX-Shell interface
Not enough p enetration of libraries into hardest appli-
Graphic interfaces
cations on vector machines
{ TK/TCL interface
Need to lo ok at applications and
rethink mathematical software
{ HotJava Browser interface WWW
5 6
The NetSolve Agent The NetSolve System
The system is a set of computational servers
Intelligent Agent ! Optimum use of Resource
Heterogeneity supp orted XDR
The agent is p ermanently aware of :
Dynamic addition of resource
The con guration of the NetSolve system
Fault Tolerance
The characteristics of each NetSolve resource
Each server uses the installed scienti c packages
The load of each resource
The functionalities of each resource
Example of packages :
Each client requests informs the agent ab out :
BLAS
LAPACK
The typ e of problem to solve
ScaLAPACK
The size of the problem
Ellpack
The NetSolve Agentcho oses the NetSolve resource :
Connection to other systems
Load Balancing Strategy
NEOS
7 8
New challenges
LAPACK / ScaLapack
A Scalable Library for
{ No standard software
Distributed Memory Machines
Fortran 90, HP Fortran, PVM, MPIbut ...
{ Many avors of message passing
LAPACK success from HP-48G to CRAY C-90
Need standard - BLACS
{Targets workstations, vector machines, shared mem-
{ Highly varying ratio of
ory
computation sp eed
{ 449,015 requests since then as of 10/95
r =
communication sp eed
{ Dense and banded matrices
{ Manyways to layout data
{ Linear systems, least squares, eigenproblems
{Fastest parallel algorithm sometimes less
{ Manual available from SIAM, html, Japanese trans-
stable numerically
lation
{ Second release and manual in 9/94
{ Used by Cray, IBM, Convex, NAG, IMSL, ...
Provide useful scienti c program libraries for computa-
tional scientists working on new parallel architectures
{Move to new generation high p erformance machines
- SP-2, T3D, Paragon, Clusters of WS, ...
{ Add functionality generalized eigenproblem, sparse matrices, ...
9 10
ScaLAPACK{OVERVIEW
Needs
Standards
{Portable
Goal:Port Linear Algebra PACKage LAPACK to distributed-memory
{Vendor-supp orted environments.
{ Scalable
Scalability and Performance
High eciency
{ Optimized compute and communication engines
Protect programming investment, if p ossible
{ Blo cked data access level 3 BLAS yields go o d no de p erformance
Avoid low-level details
{ 2-D Blo ck-cyclic data decomp osition yields go o d load balance
Scalability
Portability
{ BLAS: optimized compute engine
{ BLACS: communication kernel PVM, MPI, ...
Simple calling interface
{ Stay as close as p ossible to LAPACK in calling sequence, storage,
etc.
Ease of Programming / Maintenance
{ Mo dularity: Build rich set of linear algebra to ols
BLAS
BLACS
PBLAS
{ Whenever p ossible, use reliable LAPACK algorithms.
11 12
Communication Library for Linear Algebra
2-DIMENSIONAL BLOCK CYCLIC
Basic Linear Algebra Communication Subroutines
DISTRIBUTION
Purp ose of the BLACS:
A design to ol, they are a conceptual aid in design and co ding. 012
A11 A12 A13 A14 A15 A16 A17 A18 A11 A14A 17 A12 A15A 18 A13 A16
Asso ciate widely recognized mnemonic names with communication A21 A22 A23 A24 A25 A26 A27 A28 A31 A34 A37 A32 A35 A38 A33 A36
0
op erations, improve program readability and the self-do cumenting
A31 A32 A33 A34 A35 A36 A37 A38 A51 A54A 57 A52 A55 A58 A53 A56
qualityofcode.
A41 A42 A43 A44 A45 A46 A47 A48 A71 A74 A77 A72 A75 A78 A73 A76
A51 A52 A53 A54 A55 A56 A57 A58 A21 A24 A27 A22 A25 A28 A23 A26
Promote eciency by identifying frequently-o ccurring op erations of
h can b e optimized on various computers. A A A A A linear algebra whic A61 A62 A63 A64 A65 A66 A67 A68 A41 44 47 A42 45 48 A43 46 1 A A A A A
A71 A72 A73 A74 A75 A76 A77 A78 A61 64 67 A62 65 68 A63 66
Program p ortability is improved through standardization of kernels A81 A84 A87 A82 A85 A88 A83 A86
A81 A82 A83 A84 A85 A86 A87 A88
without giving up eciency since optimized versions will b e available.
Global left and distributed right views of matrix
Ensure go o d load balance Performance and scalabil-
ity,
Encompass a large numb er but not all data distribution schemes,
Need redistribution routines to go from distribution to the other,
Not intuitive.
13 14
BLACS { BASICS
BLACS { BASICS
Pro cesses are emb edded in a 2-dimensional grid,
Op erations are matrix based and ID-
0123 less 0 0123 N N M 1 4567 M−N M N 2 891011 LDA M
N−M
An op eration whichinvolves more than one sender and one receiver is
N−M
\scop ed op eration". Using a 2D-grid, there are 3 natural called a M
scop es: M M−N MEANING SCOPE N
N
Row All pro cesses in a pro cess row participate.
Column All pro cesses in a pro cess column participate.
All All pro cesses in the pro cess grid participate.
4 Typ es of BLACS routines: p oint-to-
! use of SCOPE parameter.
p oint communication, broadcast, combine
op erations and supp ort routines.
15 16
BLACS { COMMUNICATION BLACS { COMMUNICATION
ROUTINES ROUTINES
2. Broadcast
1. Send/Recv
Send submatrix to all pro cesses or subsection of
Send submatrix from one pro cess to another:
pro cesses in SCOPE, using various distribution pat-
2xxSD2D ICONTXT, M, N, A, LDA, RDEST, CDEST
terns TOP:
2xxRV2D ICONTXT, M, N, A, LDA, RSRC, CSRC
2xxBS2D ICONTXT, SCOPE, TOP, M, N, A, LDA
2xxBR2D ICONTXT, SCOPE, TOP, M, N, A, LDA, RSRC, CSRC
2 Data typ e xx Matrix typ e
I: INTEGER, GE: GEneral
SCOPE TOP
S: REAL, rectangular
'Row' ' ' default
D: DOUBLE PRECISION, matrix,
'Column' 'Increasing Ring'
C: COMPLEX, TR: TRap ezoidal
'All' '1-tree' :::
Z: DOUBLE COMPLEX. matrix.
CALL BLACS_GRIDINFO ICONTXT, NPROW, NPCOL, MYROW, MYCOL
CALL BLACS_GRIDINFO ICONTXT, NPROW, NPCOL, MYROW, MYCOL
IF MYROW.EQ.0 THEN
IF MYROW.EQ.0 .AND. MYCOL.EQ.0 THEN
IF MYCOL.EQ.0 THEN
CALL DGESD2D ICONTXT, 5, 1, X, 5, 1, 0
CALL DGEBS2D ICONTXT, 'Row', ' ', 2, 2, A, 3
ELSE IF MYROW.EQ.1 .AND. MYCOL.EQ.0 THEN
ELSE
CALL DGERV2D ICONTXT, 5, 1, Y, 5, 0, 0
CALL DGEBR2D ICONTXT, 'Row', ' ', 2, 2, A, 3, 0, 0
END IF
END IF
END IF
17 18
BLACS { COMBINE OPERATIONS BLACS { ADVANCED TOPICS
3. Global combine op erations
The BLACS context is the BLACS mecha-
nism for partitioning communication space. A
Perform element-wise SUM, jMAXj, jMINj op erations,
de ning prop erty of a context is that a mes-
on rectangular matrices:
sage in a context cannot be sent or received in
another context. The BLACS context includes
2GSUM2D ICONTXT, SCOPE, TOP, M, N, A, LDA,
RDEST, CDEST
the de nition of a grid, and each pro cess co or-
dinates in it.
2GAMX2D ICONTXT, SCOPE, TOP, M, N, A, LDA,
RA, CA, RCFLAG, RDEST, CDEST
It allows the user to
2GAMN2D ICONTXT, SCOPE, TOP, M, N, A, LDA,
create arbitrary groups of pro cesses,
RA, CA, RCFLAG, RDEST, CDEST
create multiple overlapping and/or disjoint
grids,
RDEST = -1 indicates that the result of the op eration
should b e left on all pro cesses selected by SCOPE,
isolate each pro cess grid so that grids do not
For jMAXj, jMINj, when RCFLAG = -1, RA and CA
interfere with each other.
are not referenced; otherwise, these arrays are set on
output with the co ordinates of the pro cess owning the
corresp onding maximum resp. minimum elementin
absolute value of A. Both arrays must b e at least of
! the BLACS context is compatible with the
size M N and their leading dimension is sp eci ed
by RCFLAG.
MPI communicator
19 20
BLACS { REFERENCES PBLAS { INTRODUCTION
BLACS software and do cumentation can b e obtained Parallel Basic Linear Algebra Subprograms for distributed
via: memory MIMD computers.
{ WWW: http://www.netlib.org/blacs,
{ anonymous ftp ftp.netlib.org: cd blacs; get index
Similar functionality as the BLAS: distributed vector-vector,
{ email [email protected] with the message:
matrix-vector and matrix-matrix op erations,
send index from blacs
Comments and questions can b e addressed to
Simpli cation of the parallelization of dense linear algebra
co des: esp ecially when implemented on top of the BLAS,
J. Dongarra and R. van de Geijn, Two dimensional Basic Linear Algebra
Communication Subprograms, Technical Rep ort UT CS-91-138, LAPACK
Working Note 37, UniversityofTennessee, 1991.
Clarity: co de is shorter and easier to read,
R. C. Whaley, Basic Linear Algebra Communication Subprograms: Analysis
and Implementation Across Multiple Paral lel Architectures ,Technical Re-
p ort UT CS-94-234, LAPACK Working Note 73, UniversityofTennessee,
Mo dularity: gives programmer larger building blo cks,
1994.
J. Dongarra and R. C. Whaley, A User's Guide to the BLACS v1.0,Tech-
nical Rep ort UT CS-95-281, LAPACK Working Note 94, Universityof
Program p ortability: machine dep endency are con ned to the
Tennessee, 1995.
PBLAS BLAS and BLACS.
Message Passing Interface Forum, MPI: A Message Passing Interface Stan-
dard, International Journal of Supercomputer Applications and High Perfor-
mance Computing, 83{4, 1994.
A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam,
PVM: Paral lel Virtual Machine. A Users' Guide and Tutorial for Networked
Paral lel Computing, The MIT Press, Cambridge, Massachusetts, 1994.
21 22
PBLAS { STORAGE CONVENTIONS PBLAS { ARGUMENT CONVENTIONS
An M
-by-N matrix is blo ck partitionned and these MB -by-NB
Global view of the matrix op erands, allowing global address-
blo cks are distributed according to the ing of distributed matrices hiding complex lo cal indexing,
JA N_
2-dimensional blo ck-cyclic scheme
load balanced computations, scalability, N
IA
Lo cally, the scattered columns are stored contiguously FOR- or"
TRAN \Column-ma j M_ M A( IA:IA+M−1, JA:JA+N−1 )
re-use of the BLAS leading dimension LLD
.
Co de reusability,interface very close to sequential BLAS:
a11 a12 a13 a14 a15 0 1
DGEXXX M, N, A IA, JA , LDA a11a12a15 a13a14 CALL a21 a22 a23 a24 a25
0 a21a22a25 a23a24
CALL DGEMM 'No Transpose', 'No Transpose',
a31 a32 a33 a34 a35
M-J-JB+1, N-J-JB+1, JB, -ONE, AJ+JB,J,
a51a52a55 a53a54 $
$ LDA, AJ,J+JB, LDA, ONE, AJ+JB,J+JB,
a41 a42 a43 a44 a45 a a a a a
LDA 1 31 32 35 33 34 $
a41a42a45 a43a44 a51 a52 a53 a54 a55
5 x 5 matrix partitioned in 2 x 2 blocks 2 x 2 process grid point of view
CALL PDGEXXX M, N, A, IA, JA, DESCA
Descriptor DESC : 8-Integer array describing the matrix layout, con-
CALL PDGEMM 'No Transpose', 'No Transpose',
taining M , N , MB , NB , RSRC , CSRC , CTXT and LLD , where RSRC ,
$ M-J-JB+1, N-J-JB+1, JB, -ONE, A, J+JB,
CSRC are the co ordinates of the pro cess owning the rst matrix entry in
$ J, DESCA, A, J, J+JB, DESCA, ONE, A,
the grid sp eci ed by CTXT .
$ J+JB, J+JB, DESCA
Ex: M
=N =5, MB =NB =5, RSRC =CSRC = 0, LLD 3 in pro cess row
0, and 2 in pro cess row 1.
23 24
PBLAS { SPECIFICATIONS
PBLAS { REFERENCES
DGEMV TRANS, M, N, ALPHA, A, LDA, X, INCX,
PBLAS software and do cumentation can b e obtained
BETA, Y, INCY
via:
l
{ WWW: http://www.netlib.org/scalapack,
PDGEMV TRANS, M, N, ALPHA, A, IA, JA, DESCA,
{ anonymous ftp ftp.netlib.org:
X, IX, JX, DESCX, INCX,
cd scalapack; get index
BETA, Y, IY, JY, DESCY, INCY
{ email [email protected] with the message:
In the PBLAS, the increment sp eci ed for vectors is always global.
send index from scalapack
So far only INCX=1 and INCX=DESCX1 are supp orted.
Comments and questions can b e addressed to BLAS
PBLAS
INTEGER LDA INTEGER IA, JA, DESCA 8
INTEGER INCX INTEGER INCX, IX, JX, DESCX 8
A, LDA A, IA, JA, DESCA
X, INCX X, IX, JX, DESCX, INCX
J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. Walker and R. C.
PBLAS matrix transp osition routine REAL:
Whaley, AProposal for a Set of Paral lel Basic Linear Algebra
Subprograms, Technical Rep ort UT CS-95-292, LAPACK Working
PDTRAN M, N, ALPHA, A, IA, JA, DESCA, BETA,
Note 100, UniversityofTennessee, 1995.
C, IC, JC, DESCC
Level 1 BLAS functions have b ecome PBLAS subroutines. Ouput
Scalar correct in op erand scop e.
>> SPMD PROGRAMMING MODEL <<
25 26
ScaLAPACK IMPLEMENTATION Parallel Level 2 and 3 BLAS: PBLAS
Goal
Port sequential library for shared memory machine that use the BLAS hines with little e ort.
ScaLAPACK to distributed memory mac
Reuse the existing software by hiding the message passing in a set of
PBLAS BLAS routines.
Parallel implementation of the BLAS that understands how the matrix
yed out and when called can p erform not only the op eration but BLAS is la BLACS the required data transfer.
SYSTEM
LAPACK ! ScaLAPACK
BLAS Performance and Portability
BLAS ! PBLAS BLAS, BLACS
Blo cked data access level 3 BLAS yields go o d lo cal p erformance
Quality maintenance
BLACS Performance and Portability
Portability F77 - C
Dedicated machine dep endent communication kernel,
Eciency - Reuseability BLAS, BLACS
2D array-based and ID-less communication,
Hide Parallelism in PBLAS
Co de re-entrance, multiple overlapping pro cess grids,
Safe co-existence with other parallel libraries.
27 28
PERFORMANCE RESULTS
SEQUENTIAL LU FACTORIZATION CODE PARALLEL LU FACTORIZATION CODE
DO20J=1,MIN M, N , NB DO 10 J = JA, JA+MINM,N-1, DESCA 4
JB = MIN MIN M, N -J+1, NB JB = MIN MINM,N-J+JA, DESCA 4
I =IA+J-JA LU Factorization on IBM SP-2
Factor diagonal and subdiagonal blocks and test for exact Factor diagonal and subdiagonal blocks and test for exact
singularity. singularity.
CALL DGETF2 M-J+1, JB, A J, J , LDA, IPIV J , CALL PDGETF2 M-J+JA, JB, A, I, J, DESCA, IPIV, IINFO
$ IINFO
8
Adjust INFO and the pivot indices. Adjust INFO and the pivot indices.
IF INFO.EQ.0 .AND. IINFO.GT.0 INFO = IINFO + J - 1
IF INFO.EQ.0 .AND. IINFO.GT.0 7
DO10I=J,MIN M, J+JB-1 $ INFO = IINFO +J-JA
IPIV I=J -1+ IPIV I CONTINUE
10 6
Apply interchanges to columns 1:J-1. s to columns JA:J-JA.
Apply interchange 5
CALL DLASWP J-1, A, LDA, J, J+JB-1, IPIV, 1 CALL PDLASWP 'Forward', 'Rows', J-JA, A, IA, JA, DESCA, J, J+JB-1, IPIV
$ 4
IF J+JB.LE.N THEN IF J-JA+JB+1.LE.N THEN
Gflops 3
Apply interchanges to columns J+JB:N. Apply interchanges to columns J+JB:JA+N-1.
2 CALL DLASWP N-J-JB+1, A 1, J+JB , LDA, J, J+JB-1,
CALL PDLASWP 'Forward', 'Rows', N-J-JB+JA, A, IA, 64 nodes
$ IPIV, 1 $ J+JB, DESCA, J, J+JB-1, IPIV
1 Compute block row of U. Compute block row of U. 32 nodes
0
CALL DTRSM 'Left', 'Lower', 'No transpose', 'Unit', CALL PDTRSM 'Left', 'Lower', 'No transpose', 'Unit',
$ JB, N-J-JB+1, ONE, A J, J , LDA, $ JB, N-J-JB+JA, ONE, A, I, J, DESCA, A, I,
$ A J, J+JB , LDA J+JB, DESCA
$ 16 nodes
IF J+JB.LE.M THEN IF J-JA+JB+1.LE.M THEN
2000
Update trailing submatrix. Update trailing submatrix. 5000
8000
CALL DGEMM 'No transpose', 'No transpose', CALL PDGEMM 'No transpose', 'No transpose',
11000
$ M-J-JB+1, N-J-JB+1, JB, -ONE, $ M-J-JB+JA, N-J-JB+JA, JB, -ONE, A,
14000
$ A J+JB, J , LDA, A J, J+JB , LDA, $ I+JB, J, DESCA, A, I, J+JB, DESCA,
Problem size 17000
$ ONE, A J+JB, J+JB , LDA $ ONE, A, I+JB, J+JB, DESCA
END IF END IF
END IF END IF
20 CONTINUE 10 CONTINUE
29 30
PERFORMANCE RESULTS PERFORMANCE RESULTS
QR Factorization on Intel QR Factorization on 128 nodes Intel machines (512 processors) Machines
25
Delta 6
20 Paragon 5
4
15 3 Gflops 2 Gflops
10 1
0 Paragon
5 Delta i860 1000 2000 4000 6000 8000 10000 12000
0 Problem size 14000 16000 18000 10000 14000 18000 22000 26000 30000 34000
Problem size
31 32
LU Factorization on Intel Paragon LU Factorization on Sparc-5 (Ethernet)
128 nodes 256 nodes 512 nodes 1024 nodes 35,00
30,00 50,00 25,00 40,00 20,00
15,00 30,00 Gflops
10,00
Mflops 20,00 5,00 10,00 16 nodes 0,00 8 nodes 0,00 4 nodes 1000 5000 1000 2 nodes 2000
15000 3000 1 node 4000 Problem size 25000 Problem size 5000 35000
4 x 10 LU performance on the Cray T3D (NB=32) LU Factorization on IBM SP-2 2.5
4 nodes 8 nodes 16 nodes
3,5 2 16x16 grid of procs 3
2,5 1.5 2
1,5 Gflops Gflops 1 1
0,5
0 0.5 1000 5000 0
10000 0 0.5 1 1.5 2 2.5 3 3.5 Matrix Size, N 4 Problem size 15000 x 10
33 34
T
Symmetric Eigenproblem Ax = x, A = A Outline of Cupp en's Divide and Conquer
T T
Two phases: 1. Goal: compute T = QQ , =diag , QQ = I
i
Phase 1: Reduce A to tridiagonal T :
2.
2 3
2 3
a b 0
1 1
6 7
6 7
a b
1 1
6 7 6 7
. .
. .
3
6 7 6 7
2
. .
b
1
6 7 6 7
T
. .
. .
6 7 6 7
A = UT U , T = . .
T
b
1
1
7
T
6 7 6 7 6
. .
. .
5
6 7 6 7
4 + vv
. .
T = =
b
n 1
6 7 6 7
. .
. .
T
6 7
4 5
. .
b
2
n 1
6 7
0 b a
4 5
n 1 n
b a
n 1 n
T
Phase 2: Find e-vals, vectors of T : T = QQ
T T
So A =UQUQ V V
3.
T T
Phase 2 is b ottleneck 3x Phase 1
Solve T = Q Q and T = Q Q recursively
1 1 1 2 2 2
1 2
LAPACK: Cupp en's divide-and-conquer
4.
2 3 3 2 3
2
metho d
T
T
Q Q
Q Q
1
1 1
6 7 7 T 6 7 T
1
6
1
4 5 5 4 5
4 D+zz +vv =
T =
T
T
ScaLAPACK: Q Q
Q Q
2
2 2
2
2
where
{ Bisection + inverse iteration with limited
2 3 2 3
reorthogonalization: fast but may fail to T
Q
1
6 7 6 7
1
4 5 4 5
v and z = D =
compute orthogonal eigenvectors if eigenval- T
Q
2
2
ues clustered
T
5. Eigenvalues of D + zz are ro ots of \secular
{ QR iteration: slower but reliable
equation"
2
z
X
i
=0
1+
d
i
ii
T 1
6. Eigenvectors v of D + zz are D I z
j j
2 3
Q
1
6 7
4 5
7. Eigenvectors of T are V
Q
2
35 36
Sp eedups with Divide-and-Conque r ScaLAPACK Performance on
Pro cessor Intel Gamma Random, reduced to tridiagonal matrices, time relative to DC = 1 64
16
Exp ert Driver - Bisection + Inverse QR
QR iteration 14 BZ (Bisection + inverse iteration) DC (divide and conquer) = 1 ScaLAPACK Performance, Intel Gamma, 64PEs 12 1
10 0.8 predicted 8 0.6
6 0.4 measured 4
Parallel Efficiency 0.2
2 BZ 0 DC 500 1000 1500 2000 2500 0 Matrix Dimension 0 100 200 300 400 500 600 700 800 900 1000 matrix size ScaLAPACK Time Distribution, Intel Gamma, 64PEs Geometrically dist., reduced to tridiagonal matrices, time relative to DC = 1 80 1 QR MM BZ 70 BZ (Bisection + inverse iteration) 0.8 DC (divide and conquer) = 1 EIN 60 0.6 EBZ QR 50 0.4 % of Time TRD 40 0.2
30 0 500 1000 1500 2000 2500 Matrix Dimension 20
10
0 DC 0 100 200 300 400 500 600 700 800 900 1000 matrix size
37 38
WHAT IS AVAILABLE TO YOU ?
To ols
ScaLAPACK Sparse matrix activities
{ BLACS CMMD, MPL, NX, PVM
ARPACK - Arnoldi-Pack - D. Sorensen
{ PBLAS
{ Algorithms for symmetric and nonsymmet-
ric linear systems, eigenproblems, general-
{ REDISTRIBUTION
ized eigenproblems
{ User's matrix-vector pro duct, rest serial
{ PUMMA
{ Based on clever identity for compressing
partial Arnoldi factorization
Factorizations
CAPSS - Cartesian Parallel Sparse Solver - P.
Raghavan and M. Heath
{ LU + solve
{Parallel sparse Cholesky
{ Cholesky + solve
{For nite di erence/ nite element problems
{ Usese co ordinates from underlying mesh
{ QR, QL, LQ, RQ, QRP
{ Cartesian nested dissection to layout data
Matrix inversions
Sparse LU with Pivoting
{ Exploits data lo cality using sup erno dal ap-
T T
Q; QC ; Q C; C Q; C Q
proach
{ Up to 120 M ops on IBM RS6000/590, de-
Reductions
p ending on sparsity pattern
{Fastest co de available on many test cases,
{ to Hessenb erg form
along with UMFPACK
{To b e released so on
{ to symmetric tridiagonal form
{ to bidiagonal form
Symmetric eigensol ver
39 40
Software Framework
PVM calls within the BLACS
PVM Parallel Virtual Machine is a software
ScaLAPACK
package that p ermits a heterogeneo us collection
of serial, parallel, and vector computers ho oked
together by a network to app ear as one large
computer.
CS Basic Linear Algebra Communication
Parallel BLA
BLAS
Subroutines is a software package to p erform
unication at the same level as the BLAS comm
GLOBAL
are for computation.
BLACS have b een written in terms of calls to
LOCAL PVM.
Sequential BLACS
Allows network of workstatio ns to b e used with
BLAS
minimal change.
Message
Passing
Message Passing Interface
Primitves MPI {
for Message Passing
MPI PVM Standard
BLACS ! MPI underway
41 42
ACK { ONGOING WORK LU Factorization with Solve, Token Ring Connected IBM RS/6000-320H SCALAP 200 1x8, n=3000
180
HPF 160 1x6, n=3000
140
{ HPF supp orts the 2D-blo ck-cyclic distri-
120 1x4, n=3000
bution used by ScaLAPACK
100
{ HPF like calling sequence Global index- Speed in Megaflops
80
ing scheme 60 1x2, n=2000
40
Portability: MPI implementati on of the 1x1, n=1500
20 CS
1 2 3 4 5 6 7 8 BLA
Number of Processors
More exibili ty added to the PBLAS
More testing and timing programs
Condition estimators
LU Results on IBM Cluster
Pro c Time Sp eedup Pro c M op/s Sp eedup
Iterative re nement of linear system
1 1030 1 n=1500 33
solutions
2 403 2.6 2 n=2000 64 1.9
SVD
4 155 6.6 4 n=3000 116 3.5
6 115 8.9 6 n=3000 156 4.7
Linear Least Square solvers
8 93 11.0 8 n=3000 194 5.9
Banded systems
Non symmetric eigensol vers
:::