LAPACK Working Note 58



The Design of Libraries for High Performance Computers

Jack Dongarra yz and David Walker z

yDepartment of Computer Science

UniversityofTennessee

Knoxville, TN 37996

zMathematical Sciences Section

Oak Ridge National Lab oratory

Oak Ridge, TN 37831

Abstract

This pap er discusses the design of linear algebra libraries for high p erformance computers.

Particular emphasis is placed on the development of scalable algorithms for MIMD distributed

memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK

libraries is given, followed by an outline of ScaLAPACK, which is a distributed memory version

of LAPACK currently under development. The imp ortance of blo ck-partitioned algorithms in

reducing the frequency of data movementbetween di erent levels of hierarchical memory is

stressed. The use of such algorithms helps reduce the message startup costs on distributed

memory concurrent computers. Other key ideas in our approach are the use of distributed

versions of the Level 3 Basic Linear Algebra Subgrams BLAS as computational building blo cks,

and the use of Basic Linear Algebra Communication Subprograms BLACS as communication

building blo cks. Together the distributed BLAS and the BLACS can b e used to construct higher-

level algorithms, and hide many details of the parallelism from the application develop er.

The blo ck-cyclic data distribution is describ ed, and adopted as a go o d way of distributing

blo ck-partitioned matrices. Blo ck-partitioned versions of the Cholesky and LU factorizations

are presented, and optimization issues asso ciated with the implementation of the LU factor-

ization algorithm on distributed memory concurrent computers are discussed, together with its

p erformance on the Intel Delta system. Finally, approaches to the design of library interfaces

are reviewed.



This work was supp orted in part by the NSF under Grant No. ASC-9005933 and by ARPA and ARO under

contract number DAAL03-91--0047. 1

1 Intro duction

The increasing availability of advanced-architecture computers is having a very signi cant e ect

on all spheres of scienti c computation, including algorithm research and software developmentin

. Linear algebra|in particular, the solution of linear systems of equations|

lies at the heart of most calculations in scienti c computing. This chapter discusses some of the

recent developments in linear algebra designed to exploit these advanced-architecture computers.

Particular attention will b e paid to dense factorization routines, such as the Cholesky and LU

factorizations, and these will b e used as examples to highlight the most imp ortant factors that

must b e considered in designing linear algebra software for advanced-architecture computers. We

use these factorization routines for illustrative purp oses not only b ecause they are relatively sim-

ple, but also b ecause of their imp ortance in several scienti c and engineering applications that

make use of b oundary element metho ds. These applications include electromagnetic scattering and

computational uid dynamics problems, as discussed in more detail in Section 4.1.

Much of the work in developing linear algebra software for advanced-architecture computers is

motivated by the need to solve large problems on the fastest computers available. In this chapter,

we fo cus on four basic issues: 1 the motivation for the work; 2 the development of standards

for use in linear algebra and the building blo cks for a library; 3 asp ects of algorithm design and

parallel implementation; and 4 future directions for research.

For the past 15 years or so, there has b een a great deal of activity in the area of algorithms and

software for solving linear algebra problems. The linear algebra community has long recognized

the need for help in developing algorithms into software libraries, and several years ago, as a

community e ort, put together a de facto standard for identifying basic op erations required in linear

algebra algorithms and software. The hop e was that the routines making up this standard, known

collectively as the Basic Linear Algebra Subprograms BLAS, would b e eciently implemented on

advanced-architecture computers by many manufacturers, making it p ossible to reap the p ortability

b ene ts of having them eciently implemented on a wide range of machines. This goal has b een

largely realized.

The key insight of our approach to designing linear algebra algorithms for advanced-architecture

computers is that the frequency with which data are moved b etween di erent levels of the memory

hierarchymust b e minimized in order to attain high p erformance. Thus, our main algorithmic

approach for exploiting b oth vectorization and parallelism in our implementations is the use of

blo ck-partitioned algorithms, particularly in conjunction with highly-tuned kernels for p erforming

-vector and matrix-matrix op erations the Level 2 and 3 BLAS. In general, the use of blo ck-

partitioned algorithms requires data to b e moved as blo cks, rather than as vectors or scalars, so

that although the total amount of data moved is unchanged, the latency or startup cost asso ciated

with the movement is greatly reduced b ecause fewer messages are needed to move the data.

A second key idea is that the p erformance of an algorithm can b e tuned by a user byvarying

the parameters that sp ecify the data layout. On shared memory machines, this is controlled by

the blo ck size, while on distributed memory machines it is controlled by the blo ck size and the

con guration of the logical pro cess mesh, as describ ed in more detail in Section 5.

In Section 1, we rst giveanoverview of some of the ma jor software pro jects aimed at solving

dense linear algebra problems. Next, we describ e the typ es of machine that b ene t most from the

use of blo ck-partitioned algorithms, and discuss what is meantby high-quality, reusable software

for advanced-architecture computers. Section 2 discusses the role of the BLAS in p ortability and

p erformance on high-p erformance computers. We discuss the design of these building blo cks, and

their use in blo ck-partitioned algorithms, in Section 3. Section 4 fo cuses on the design of a blo ck-

partitioned algorithm for LU factorization, and Sections 5, 6, and 7 use this example to illustrate 2

the most imp ortant factors in implementing dense linear algebra routines on MIMD, distributed

memory, concurrent computers. Section 5 deals with the issue of mapping the data onto the

hierarchical memory of a concurrent computer. The layout of an application's data is crucial in

determining the p erformance and scalability of the parallel co de. In Sections 6 and 7, details of

the parallel implementation and optimization issues are discussed. Section 8 presents some future

directions for investigation.

1.1 Dense Linear Algebra Libraries

Over the past twenty- veyears, the rst author has b een directly involved in the developmentof

several imp ortant packages of dense linear algebra software: EISPACK, LINPACK, LAPACK, and

the BLAS. In addition, b oth authors are currently involved in the development of ScaLAPACK,

a scalable version of LAPACK for distributed memory concurrent computers. In this section, we

give a brief review of these packages|their history, their advantages, and their limitations on

high-p erformance computers.

1.1.1 EISPACK

EISPACK is a collection of that compute the eigenvalues and eigenvectors

of nine classes of matrices: complex general, complex Hermitian, real general, real symmetric,

real symmetric banded, real symmetric tridiagonal, sp ecial real tridiagonal, generalized real, and

generalized real symmetric matrices. In addition, two routines are included that use singular value

decomp osition to solve certain least-squares problems.

EISPACK is primarily based on a collection of Algol pro cedures develop ed in the 1960s and

collected by J. H. Wilkinson and C. Reinschinavolume entitled Linear Algebra in the Handbook for

Automatic Computation [57] series. This volume was not designed to cover every p ossible metho d of

solution; rather, algorithms were chosen on the basis of their generality, elegance, accuracy, sp eed,

or economy of storage.

Since the release of EISPACK in 1972, over ten thousand copies of the collection have b een

distributed worldwide.

1.1.2 LINPACK

LINPACK is a collection of Fortran subroutines that analyze and solve linear equations and linear

least-squares problems. The package solves linear systems whose matrices are general, banded,

symmetric inde nite, symmetric p ositive de nite, triangular, and tridiagonal square. In addition,

the package computes the QR and singular value decomp ositions of rectangular matrices and applies

them to least-squares problems.

LINPACK is organized around four matrix factorizations: LU factorization, pivoted Cholesky

factorization, QR factorization, and singular value decomp osition. The term LU factorization is

used here in a very general sense to mean the factorization of a square matrix into a lower triangular

part and an upp er triangular part, p erhaps with pivoting. These factorizations will b e treated at

greater length later, when the actual LINPACK subroutines are discussed. But rst a digression

on organization and factors in uencing LINPACK's eciency is necessary.

LINPACK uses column-oriented algorithms to increase eciency by preserving lo cality of ref-

erence. This means that if a program references an item in a particular blo ck, the next reference

is likely to b e in the same blo ck. By column orientation we mean that the LINPACK co des al-

ways reference arrays down columns, not across rows. This works b ecause Fortran stores arrays

in column ma jor order. Thus, as one pro ceeds down a column of an array, the memory references 3

pro ceed sequentially in memory. On the other hand, as one pro ceeds across a row, the memory ref-

erences jump across memory, the length of the jump b eing prop ortional to the length of a column.

The e ects of column orientation are quite dramatic: on systems with virtual or cache memories,

the LINPACK co des will signi cantly outp erform co des that are not column oriented. We note,

however, that textb o ok examples of matrix algorithms are seldom column oriented.

Another imp ortant factor in uencing the eciency of LINPACK is the use of the Level 1 BLAS;

there are three e ects.

First, the overhead entailed in calling the BLAS reduces the eciency of the co de. This reduc-

tion is negligible for large matrices, but it can b e quite signi cant for small matrices. The matrix

size at which it b ecomes unimp ortantvaries from system to system; for square matrices it is typ-

ically b etween n =25and n= 100. If this seems like an unacceptably large overhead, remember

that on many mo dern systems the solution of a system of order 25 or less is itself a negligible calcu-

lation. Nonetheless, it cannot b e denied that a p erson whose programs dep end critically on solving

small matrix problems in inner lo ops will b e b etter o with BLAS-less versions of the LINPACK

co des. Fortunately, the BLAS can b e removed from the smaller, more frequently used program in

a short editing session.

Second, the BLAS improve the eciency of programs when they are run on nonoptimizing

compilers. This is b ecause doubly subscripted array references in the inner lo op of the algorithm

are replaced by singly subscripted array references in the appropriate BLAS. The e ect can b e seen

for matrices of quite small order, and for large orders the savings are quite signi cant.

Finally, improved eciency can b e achieved by co ding a set of BLAS [17] to take advantage of

the sp ecial features of the computers on which LINPACK is b eing run. For most computers, this

simply means pro ducing machine-language versions. However, the co de can also take advantage of

more exotic architectural features, suchasvector op erations.

Further details ab out the BLAS are presented in Section 2.

1.1.3 LAPACK

LAPACK [14] provides routines for solving systems of simultaneous linear equations, least-squares

solutions of linear systems of equations, eigenvalue problems, and singular value problems. The

asso ciated matrix factorizations LU, Cholesky, QR, SVD, Schur, generalized Schur are also pro-

vided, as are related computations such as reordering of the Schur factorizations and estimating

condition numb ers. Dense and banded matrices are handled, but not general sparse matrices. In

all areas, similar functionality is provided for real and complex matrices, in b oth single and double

precision.

The original goal of the LAPACK pro ject was to make the widely used EISPACK and LIN-

PACK libraries run eciently on shared-memory vector and parallel pro cessors. On these machines,

LINPACK and EISPACK are inecient b ecause their memory access patterns disregard the multi-

layered memory hierarchies of the machines, thereby sp ending to o much time moving data instead

of doing useful oating-p oint op erations. LAPACK addresses this problem by reorganizing the algo-

rithms to use blo ck matrix op erations, such as , in the innermost lo ops [3, 14].

These blo ck op erations can b e optimized for each architecture to account for the memory hierarchy

[2], and so provide a transp ortable waytoachieve high eciency on diverse mo dern machines. Here

we use the term \transp ortable" instead of \p ortable" b ecause, for fastest p ossible p erformance,

LAPACK requires that highly optimized blo ck matrix op erations b e already implemented on each

machine. In other words, the correctness of the co de is p ortable, but high p erformance is not|if

we limit ourselves to a single Fortran source co de.

LAPACK can b e regarded as a successor to LINPACK and EISPACK. It has virtually all the 4

capabilities of these two packages and much more b esides. LAPACK improves on LINPACK and

EISPACK in four main resp ects: sp eed, accuracy, robustness and functionality. While LINPACK

and EISPACK are based on the vector op eration kernels of the Level 1 BLAS, LAPACK was

designed at the outset to exploit the Level 3 BLAS |a set of sp eci cations for Fortran subprograms

that do various typ es of matrix multiplication and the solution of triangular systems with multiple

right-hand sides. Because of the coarse granularity of the Level 3 BLAS op erations, their use tends

to promote high eciency on many high-p erformance computers, particularly if sp ecially co ded

implementations are provided by the manufacturer.

1.1.4 ScaLAPACK

The ScaLAPACK software library,scheduled for completion by the end of 1994, will extend the

LAPACK library to run scalably on MIMD, distributed memory, concurrent computers [10, 11].

For such machines the memory hierarchy includes the o -pro cessor memory of other pro cessors,

in addition to the hierarchy of registers, cache, and lo cal memory on each pro cessor. Like LA-

PACK, the ScaLAPACK routines are based on blo ck-partitioned algorithms in order to minimize

the frequency of data movementbetween di erent levels of the memory hierarchy. The funda-

mental building blo cks of the ScaLAPACK library are distributed memory versions of the Level

2 and Level 3 BLAS, and a set of Basic Linear Algebra Communication Subprograms BLACS

[16,26] for communication tasks that arise frequently in parallel linear algebra computations. In

the ScaLAPACK routines, all interpro cessor communication o ccurs within the distributed BLAS

and the BLACS, so the source co de of the top software layer of ScaLAPACK lo oks very similar to

that of LAPACK.

Weenvisage a numb er of user interfaces to ScaLAPACK. Initially, the interface will b e similar to

that of LAPACK, with some additional arguments passed to each routine to sp ecify the data layout.

Once this is in place, weintend to mo dify the interface so the arguments to each ScaLAPACK

routine are the same as in LAPACK. This will require information ab out the data distribution of

each matrix and vector to b e hidden from the user. This may b e done by means of a ScaLAPACK

initialization routine. This interface will b e fully compatible with LAPACK. Provided \dummy"

versions of the ScaLAPACK initialization routine and the BLACS are added to LAPACK, there

will b e no distinction b etween LAPACK and ScaLAPACK at the application level, though each

will link to di erentversions of the BLAS and BLACS. Following on from this, we will exp eriment

with ob ject-based interfaces for LAPACK and ScaLAPACK, with the goal of developing interfaces

compatible with Fortran 90 [10] and C++ [24].

1.2 Target Architectures

The EISPACK and LINPACK software libraries were designed for sup ercomputers used in the

1970s and early 1980s, such as the CDC-7600, Cyb er 205, and Cray-1. These machines featured

multiple functional units pip eline d for go o d p erformance [43]. The CDC-7600 was basically a

high-p erformance scalar computer, while the Cyb er 205 and Cray-1 were early vector computers.

The development of LAPACK in the late 1980s was intended to make the EISPACK and

LINPACK libraries run eciently on shared memory,vector sup ercomputers. The ScaLAPACK

software library will extend the use of LAPACK to distributed memory concurrent sup ercomputers.

The development of ScaLAPACK b egan in 1991 and is exp ected to b e completed by the end of

1994.

The underlying concept of b oth the LAPACK and ScaLAPACK libraries is the use of blo ck-

partitioned algorithms to minimize data movementbetween di erent levels in hierarchical memory. 5

Thus, the ideas discussed in this chapter for developing a library for dense linear algebra compu-

tations are applicable to any computer with a hierarchical memory that 1 imp oses a suciently

large startup cost on the movement of data b etween di erent levels in the hierarchy, and for which

2 the cost of a context switch is to o great to make ne grain size multithreading worthwhile.

Our target machines are, therefore, medium and large grain size advanced-architecture computers.

These include \traditional" shared memory,vector sup ercomputers, such as the Cray Y-MP and

C90, and MIMD distributed memory concurrent sup ercomputers, such as the Intel Paragon, and

Thinking Machines' CM-5, and the more recently announced IBM SP1 and Cray T3D concurrent

systems. Since these machines have only very recently b ecome available, most of the ongoing de-

velopment of the ScaLAPACK library is b eing done on a 128-no de Intel iPSC/860 hyp ercub e and

on the 520-no de Intel Delta system.

The Intel Paragon sup ercomputer can have up to 2000 no des, each consisting of an i860 pro cessor

and a communications pro cessor. The no des eachhave at least 16 Mbytes of memory, and are

connected by a high-sp eed network with the top ology of a two-dimensional mesh. The CM-5 from

Thinking Machines Corp oration [53] supp orts b oth SIMD and MIMD programming mo dels, and

mayhave up to 16k pro cessors, though the largest CM-5 currently installed has 1024 pro cessors.

Each CM-5 no de is a Sparc pro cessor and up to 4 asso ciated vector pro cessors. Point-to-p oint

communication b etween no des is supp orted by a data network with the top ology of a \fat tree"

[46]. Global communication op erations, such as synchronization and reduction, are supp orted bya

separate control network. The IBM SP1 system is based on the same RISC chip used in the IBM

RS/6000 workstations and uses a multistage switch to connect pro cessors. The Cray T3D uses the

Alpha chip from Digital Equipment Corp oration, and connects the pro cessors in a three-dimensional

torus.

Future advances in compiler and hardware technologies in the mid to late 1990s are exp ected

to makemultithreading a viable approach for masking communication costs. Since the blo cks

in a blo ck-partitioned algorithm can b e regarded as separate threads, our approach will still b e

applicable on machines that exploit medium and coarse grain size multithreading.

1.3 High-Quality, Reusable, Mathematical Software

In developing a library of high-quality subroutines for dense linear algebra computations the design

goals fall into three broad classes:

 p erformance

 ease-of-use

 range-of-use

1.3.1 Performance

Two imp ortant p erformance metrics are concurrent eciency and scalability.We seek go o d p er-

formance characteristics in our algorithms by eliminating, as much as p ossible, overhead due to

load imbalance, data movement, and algorithm restructuring. The way the data are distributed

or decomp osed over the memory hierarchy of a computer is of fundamental imp ortance to these

factors. Concurrent eciency, , is de ned as the concurrent sp eedup p er pro cessor [32], where the

concurrent sp eedup is the execution time, T , for the b est sequential algorithm running on one

seq

pro cessor of the concurrent computer, divided by the execution time, T , of the parallel algorithm

running on N pro cessors. When direct metho ds are used, as in LU factorization, the concurrent

p 6

eciency dep ends on the problem size and the numb er of pro cessors, so on a given parallel com-

puter and for a xed numb er of pro cessors, the running time should not vary greatly for problems

of the same size. Thus, wemay write,

1 T N 

seq

1 N; N =

p

N T N; N 

p p

where N represents the problem size. In dense linear algebra computations, the execution time is

usually dominated by the oating-p oint op eration count, so the concurrent eciency is related to

the p erformance, G, measured in oating-p oint op erations p er second by,

N

p

GN; N = N; N  2

p p

t

calc

where t is the time for one oating-p oint op eration. For iterative routines, such as eigensolvers,

calc

the numb er of iterations, and hence the execution time, dep ends not only on the problem size, but

also on other characteristics of the input data, such as condition numb er. A parallel algorithm

is said to b e scalable [37] if the concurrent eciency dep ends on the problem size and number

of pro cessors only through their ratio. This ratio is simply the problem size p er pro cessor, often

referred to as the granularity.Thus, for a scalable algorithm, the concurrent eciency is constantas

the numb er of pro cessors increases while keeping the granularity xed. Alternatively, Eq. 2 shows

that this is equivalenttosaying that, for a scalable algorithm, the p erformance dep ends linearly

on the numb er of pro cessors for xed granularity.

1.3.2 Ease-Of-Use

Ease-of-use is concerned with factors such as p ortability and the user interface to the library.

Portability, in its most inclusive sense, means that the co de is written in a standard language,

suchasFortran, and that the source co de can b e compiled on an arbitrary machine to pro duce a

program that will run correctly.We call this the \mail-order software" mo del of p ortability, since

it re ects the mo del used by software servers suchasnetlib [20]. This notion of p ortability is quite

demanding. It requires that all relevant prop erties of the computer's arithmetic and architecture b e

discovered at runtime within the con nes of a Fortran co de. For example, if it is imp ortant to know

the over ow threshold for scaling purp oses, it must b e determined at runtime without over owing,

since over ow is generally fatal. Such demands have resulted in quite large and sophisticated

programs [28, 44] whichmust b e mo di ed frequently to deal with new architectures and software

releases. This \mail-order" notion of software p ortability also means that co des generally must b e

written for the worst p ossible machine exp ected to b e used, thereby often degrading p erformance

on all others. Ease-of-use is also enhanced if implementation details are largely hidden from the

user, for example, through the use of an ob ject-based interface to the library [24].

1.3.3 Range-Of-Use

Range-of-use may b e gauged byhownumerically stable the algorithms are over a range of input

problems, and the range of data structures the library will supp ort. For example, LINPACK and

EISPACK deal with dense matrices stored in a rectangular array, packed matrices where only the

upp er or lower half of a is stored, and banded matrices where only the nonzero

bands are stored. In addition, some sp ecial formats such as Householder vectors are used internally

to represent orthogonal matrices. There are also sparse matrices, whichmay b e stored in many

di erentways; but in this pap er we fo cus on dense and banded matrices, the mathematical typ es

addressed by LINPACK, EISPACK, and LAPACK. 7

2 The BLAS as the Key to Portability

At least three factors a ect the p erformance of p ortable Fortran co de.

1. Vectorization. Designing vectorizable algorithms in linear algebra is usually straightfor-

ward. Indeed, for many computations there are several variants, all vectorizable, but with

di erentcharacteristics in p erformance see, for example, [15]. Linear algebra algorithms can

approach the p eak p erformance of many machines|principall y b ecause p eak p erformance de-

p ends on some form of chaining of vector addition and multiplication op erations, and this

is just what the algorithms require. However, when the algorithms are realized in straight-

forward Fortran 77 co de, the p erformance may fall well short of the exp ected level, usually

b ecause vectorizing Fortran compilers fail to minimize the numb er of memory references|that

is, the number of vector load and store op erations.

2. Data movement. What often limits the actual p erformance of a vector, or scalar, oating-

p oint unit is the rate of transfer of data b etween di erent levels of memory in the machine.

Examples include the transfer of vector op erands in and out of vector registers, the transfer

of scalar op erands in and out of a high-sp eed scalar pro cessor, the movement of data b etween

main memory and a high-sp eed cache or lo cal memory, paging b etween actual memory and

disk storage in a virtual memory system, and interpro cessor communication on a distributed

memory concurrent computer.

3. Parallelism. The nested lo op structure of most linear algebra algorithms o ers considerable

scop e for lo op-based parallelism. This is the principal typ e of parallelism that LAPACK and

ScaLAPACK presently aim to exploit. On shared memory concurrent computers, this typ e of

parallelism can sometimes b e generated automatically by a compiler, but often requires the

insertion of compiler directives. On distributed memory concurrent computers, data must b e

moved b etween pro cessors. This is usually done by explicit calls to message passing routines,

although parallel language extensions such as CoherentParallel C [31] and Split-C [13]do

the message passing implicitly.

The question arises, \How can weachieve sucient control over these three factors to obtain

the levels of p erformance that machines can o er?" The answer is through use of the BLAS.

There are now three levels of BLAS:

Level 1 BLAS [45]: for vector op erations, suchasy x + y

Level 2 BLAS [18]: for matrix-vector op erations, suchasy Ax + y

Level 3 BLAS [17]: for matrix-matrix op erations, suchasC AB + C .

Here, A, B and C are matrices, x and y are vectors, and and are scalars.

The Level 1 BLAS are used in LAPACK, but for convenience rather than for p erformance: they

p erform an insigni cant fraction of the computation, and they cannot achieve high eciency on

most mo dern sup ercomputers.

The Level 2 BLAS can achieve near-p eak p erformance on manyvector pro cessors, suchasa

single pro cessor of a CRAY X-MP or Y-MP,orConvex C-2 machine. However, on other vector

pro cessors such as a CRAY-2 or an IBM 3090 VF, the p erformance of the Level 2 BLAS is limited

by the rate of data movementbetween di erent levels of memory.

3

The Level 3 BLAS overcome this limitation. This third level of BLAS p erforms O n  oating-

2 2

p oint op erations on O n  data, whereas the Level 2 BLAS p erform only O n  op erations on 8

Table 1: Sp eed Mega ops of Level 2 and Level 3 BLAS Op erations on a CRAY Y-MP. All matrices

are of order 500; U is upp er triangular.

Numb er of pro cessors: 1 2 4 8

Level 2: y  Ax + y 311 611 1197 2285

Level 3: C  AB + C 312 623 1247 2425

Level 2: x  Ux 293 544 898 1613

Level 3: B  UB 310 620 1240 2425

1

Level 2: x  U x 272 374 479 584

1

Level 3: B  U B 309 618 1235 2398

2

O n  data. The Level 3 BLAS also allow us to exploit parallelism in a way that is transparentto

the software that calls them. While the Level 2 BLAS o er some scop e for exploiting parallelism,

greater scop e is provided by the Level 3 BLAS, as Table 1 illustrates.

3 Blo ck Algorithms and Their Derivation

It is comparatively straightforward to reco de many of the algorithms in LINPACK and EISPACK

so that they call Level 2 BLAS. Indeed, in the simplest cases the same oating-p oint op erations

are done, p ossibly even in the same order: it is just a matter of reorganizing the software. To

illustrate this p oint, we consider the Cholesky factorization algorithm used in the LINPACK routine

T

SPOFA, which factorizes a symmetric p ositive de nite matrix as A = U U .We consider Cholesky

factorization b ecause the algorithm is simple, and no pivoting is required. In Section 4 we shall

consider the slightly more complicated example of LU factorization.

Supp ose that after j 1 steps the blo ck A in the upp er lefthand corner of A has b een factored

00

T

as A = U U . The next row and column of the factorization can then b e computed by writing

00 00

00

T

A = U U as

0 1 0 1 0 1

T

A b A U 0 0 U v U

00 j 02 00 j 02

00

B C B C B C

T T T

: a c = v u 0 0 u w

@ A @ A @ A

jj jj jj

j j j

T T

: : A U w U 0 0 U

22 j 22

02 22

where b , c , v , and w are column vectors of length j 1, and a and u are scalars. Equating

j j j j jj jj

th

co ecients of the j column, we obtain

T

b = U v

j j

00

T 2

a = v v + u :

jj j

j jj

Since U has already b een computed, we can compute v and u from the equations

00 j jj

T

U v = b

j j

00

2 T

u = a v v :

jj j

jj j

The b o dy of the co de of the LINPACK routine SPOFA that implements the ab ove metho d is

shown in Figure 1. The same computation reco ded in \LAPACK-style" to use the Level 2 BLAS

routine STRSV which solves a triangular system of equations is shown in Figure 2. The call 9

to STRSV has replaced the lo op over K which made several calls to the Level 1 BLAS routine

SDOT. For reasons given b elow, this is not the actual co de used in LAPACK | hence the term

\LAPACK-style".

This change by itself is sucient to result in large gains in p erformance on a number of

machines|for example, from 72 to 251 mega ops for a matrix of order 500 on one pro cessor

of a CRAY Y-MP. Since this is 81 of the p eak sp eed of matrix-matrix multiplication on this

pro cessor, we cannot hop e to do very much b etter by using Level 3 BLAS.

We can, however, restructure the algorithm at a deep er level to exploit the faster sp eed of the

Level 3 BLAS. This restructuring involves recasting the algorithm as a blo ck algorithm|that is,

an algorithm that op erates on blo cks or submatrices of the original matrix.

3.1 Deriving a Blo ck Algorithm

To derive a blo ck form of Cholesky factorization, we partition the matrices as shown in Figure 4,

in which the diagonal blo cks of A and U are square, but of di ering sizes. We assume that the rst

T

blo ck has already b een factored as A = U U , and that wenowwant to determine the second

00 00

00

blo ck column of U consisting of the blo cks U and U . Equating submatrices in the second blo ck

01 11

of columns, we obtain

T

A = U U

01 01

00

T T

A = U U + U U :

11 01 11

01 11

Hence, since U has already b een computed, we can compute U as the solution to the equation

00 01

T

U U = A

01 01

00

by a call to the Level 3 BLAS routine STRSM; and then we can compute U from

11

T T

U U = A U U :

11 11 01

11 01

This involves rst up dating the symmetric submatrix A by a call to the Level 3 BLAS routine

11

SSYRK, and then computing its Cholesky factorization. Since Fortran do es not allow recursion,

a separate routine must b e called using Level 2 BLAS rather than Level 3, named SPOTF2 in

Figure 3. In this way, successive blo cks of columns of U are computed. The LAPACK-style co de

for the blo ck algorithm is shown in Figure 3. This co de runs at 49 mega ops on an IBM 3090,

more than double the sp eed of the LINPACK co de. On a CRAY Y-MP, the use of Level 3 BLAS

squeezes a little more p erformance out of one pro cessor, but makes a large improvement when using

all 8 pro cessors.

But that is not the end of the story, and the co de given ab ove is not the co de actually used in the

LAPACK routine SPOTRF. We mentioned earlier that for many linear algebra computations there

are several algorithmic variants, often referred to as i-, j -, and k -variants, according to a convention

intro duced in [15] and used in [36]. The same is true of the corresp onding blo ck algorithms.

It turns out that the j -variantchosen for LINPACK, and used in the ab ove examples, is not

the fastest on many machines, b ecause it p erforms most of the work in solving triangular systems

of equations, which can b e signi cantly slower than matrix-matrix multiplication. The variant

actually used in LAPACK is the i-variant, which relies on matrix-matrix multiplication for most of

the work.

Table 2 summarizes the results. 10

do j = 0, n-1

info = j + 1

s = 0.0e0

jm1 = j

if jm1 .ge. 1 then

do k = 0, jm1 - 1

t = ak,j - sdotk,a0,k,1,a0,j,1

t = t/ak,k

ak,j = t

s=s+t*t

end do

end if

s = aj,j - s

if s .le. 0.0e0 go to 40

aj,j = sqrts

end do

Figure 1: The b o dy of the LINPACK routine SPOFA for Cholesky factorization.

doj=0,n-1

call strsv 'upper', 'transpose', 'non-unit', j, a, lda, a0,j, 1 

s = aj,j - sdot j, a0,j, 1, a0,j, 1 

if  s .le. zero  go to 20

aj,j = sqrt s 

end do

Figure 2: The b o dy of the \LAPACK-style" routine SPOFA for Cholesky factorization.

do j = 0, n-1, nb

jb = min nb, n-j 

call strsm 'left', 'upper', 'transpose', 'non-unit', j, jb, one,

a, lda, a0,j, lda 

call ssyrk 'upper', 'transpose', jb, j, -one, a0,j, lda, one,

aj,j, lda 

call spotf2 'upper', jb, aj,j, lda, info 

if info .ne. 0goto20

end do

Figure 3: The b o dy of the \LAPACK-style" routine SPOFA for blo ck Cholesky factorization. In

this co de fragment, nb denotes the width of the blo cks. 11 T A00 A01 A02 U00 00 U00 U01 U02

AT A A UT UT 0 0U U 01 11 12 = 01 11 ∗ 11 12

T T T T T

A02 A12 A22 U02 U12 U22 00 U22

T

Figure 4: Partitioning of A, U , and U into blo cks. It is assumed that the rst blo ck has already

T

b een factored as A = U U , and we next want to determine the blo ck column consisting of U

00 00 01

00

and U . Note that the diagonal blo cks of A and U are square matrices.

11

T

Table 2: Sp eed Mega ops of Cholesky Factorization A = U U for n = 500

IBM 3090 VF, CRAY Y-MP, CRAY Y-MP,

1 pro c. 1 pro c. 8 pro c.

j -variant: LINPACK 23 72 72

j -variant: using Level 2 BLAS 24 251 378

j -variant: using Level 3 BLAS 49 287 1225

i-variant: using Level 3 BLAS 50 290 1414

3.2 Examples of Blo ck Algorithms in LAPACK

Having discussed in detail the derivation of one particular blo ck algorithm, wenow describ e ex-

amples of the p erformance achieved with twowell-known blo ck algorithms: LU and Cholesky

factorizations. No extra oating-p oint op erations nor extra working storage are required for either

of these simple blo ck algorithms. See Gallivan et al. [33] and Dongarra et al. [19] for surveys of

algorithms for dense linear algebra on high-p erformance computers.

Table 3 illustrates the sp eed of the LAPACK routine for LU factorization of a real matrix,

SGETRF in single precision on CRAY machines, and DGETRF in double precision on all other

machines. Thus, 64-bit oating-p oint arithmetic is used on all machines tested. A blo ck size of 1

means that the unblo cked algorithm is used, since it is faster than { or at least as fast as { a blo ck

algorithm.

LAPACK is designed to give high eciency on vector pro cessors, high-p erformance \sup er-

scalar" workstations, and shared memory multipro cessors. LAPACK in its present form is less

likely to give go o d p erformance on other typ es of parallel architectures for example, massively

parallel SIMD machines, or MIMD distributed memory machines, but the ScaLAPACK pro ject,

describ ed in Section 1.1.4, is intended to adapt LAPACK to these new architectures. LAPACK

can also b e used satisfactorily on all typ es of scalar machines PCs, workstations, mainframes.

Table 4 gives similar results for Cholesky factorization, extending the results given in Table 2.

LAPACK, like LINPACK, provides LU and Cholesky factorizations of band matrices. The

LINPACK algorithms can easily b e restructured to use Level 2 BLAS, though restructuring has

little e ect on p erformance for matrices of very narrow bandwidth. It is also p ossible to use Level

3 BLAS, at the price of doing some extra work with zero elements outside the band [22]. This 12

Table 3: Sp eed Mega ops of SGETRF/DGETRF for Square Matrices of Order n

Machine No. of Blo ck Values of n

pro cessors size 100 200 300 400 500

IBM RISC/6000-530 1 32 19 25 29 31 33

Alliant FX/8 8 16 9 26 32 46 57

IBM 3090J VF 1 64 23 41 52 58 63

Convex C-240 4 64 31 60 82 100 112

CRAY Y-MP 1 1 132 219 254 272 283

CRAY-2 1 64 110 211 292 318 358

Siemens/Fujitsu VP 400-EX 1 64 46 132 222 309 397

NEC SX2 1 1 118 274 412 504 577

CRAY Y-MP 8 64 195 556 920 1188 1408

Table 4: Sp eed Mega ops of SPOTRF/DPOTRF for Matrices of Order n. Here UPLO = `U', so

T

the factorization is of the form A = U U .

Machine No. of Blo ck Values of n

pro cessors size 100 200 300 400 500

IBM RISC/6000-530 1 32 21 29 34 36 38

Alliant FX/8 8 16 10 27 40 49 52

IBM 3090J VF 1 48 26 43 56 62 67

Convex C-240 4 64 32 63 82 96 103

CRAY Y-MP 1 1 126 219 257 275 285

CRAY-2 1 64 109 213 294 318 362

Siemens/Fujitsu VP 400-EX 1 1 53 145 237 312 369

NEC SX2 1 1 155 387 589 719 819

CRAY Y-MP 8 32 146 479 845 1164 1393 13

pro cess b ecomes worthwhile for large matrices and semi-bandwidth greater than 100 or so.

4 LU Factorization

In this section, we rst discuss the uses of dense LU factorization in several elds. We next develop

a blo ck-partitioned version of the k , or right-lo oking, variant of the LU factorization algorithm. In

subsequent sections, the parallelization of this algorithm is describ ed in detail in order to highlight

the issues and considerations that must b e taken into account in developing an ecient, scalable, and

transp ortable dense linear algebra library for MIMD, distributed memory, concurrent computers.

4.1 Uses of LU Factorization in Science and Engineering

A ma jor source of large dense linear systems is problems involving the solution of b oundary integral

equations. These are integral equations de ned on the b oundary of a region of interest. All examples

of practical interest compute some intermediate quantityonatwo-dimensional b oundary and then

use this information to compute the nal desired quantity in three-dimensional space. The price

3

one pays for replacing three dimensions with two is that what started as a sparse problem in O n 

2

variables is replaced by a dense problem in O n .

Dense systems of linear equations are found in numerous applications, includin g:

 airplane wing design;

 radar cross-section studies;

 ow around ships and other o -shore constructions;

 di usion of solid b o dies in a liquid;

 noise reduction; and

 di usion of light through small particles.

The electromagnetics community is a ma jor user of dense linear systems solvers. Of particular

interest to this community is the solution of the so-called radar cross-section problem. In this

problem, a signal of xed frequency b ounces o an ob ject; the goal is to determine the intensity

of the re ected signal in all p ossible directions. The underlying di erential equation mayvary,

dep ending on the sp eci c problem. In the design of stealth aircraft, the principal equation is the

Helmholtz equation. To solve this equation, researchers use the method of moments [38,56]. In the

case of uid ow, the problem often involves solving the Laplace or Poisson equation. Here, the

b oundary integral solution is known as the panel method [40, 41], so named from the quadrilaterals

that discretize and approximate a structure such as an airplane. Generally, these metho ds are

called boundary element methods.

Use of these metho ds pro duces a dense linear system of size O N by ON, where N is the

numb er of b oundary p oints or panels b eing used. It is not unusual to see size 3N by3N, b ecause

of three physical quantities of interest at every b oundary element.

Atypical approach to solving such systems is to use LU factorization. Eachentry of the matrix

is computed as an interaction of two b oundary elements. Often, manyintegrals must b e computed.

In many instances, the time required to compute the matrix is considerably larger than the time

for solution. 14 A00 A01 L00 0 U00 U01 = ∗

A10 A11 L10 L11 0 U11

Figure 5: Blo ck LU factorization of the partitioned matrix A. A is r  r , A is r  N r , A

00 01 10

is M r   r , and A is M r   N r . L and L are lower triangular matrices with 1's

11 00 11

on the main diagonal, and U and U are upp er triangular matrices.

00 11

Only the builders of stealth technology who are interested in radar cross-sections are considering

using direct Gaussian elimination metho ds for solving dense linear systems. These systems are

always symmetric and complex, but not Hermitian.

For further information on various metho ds for solving large dense linear algebra problems that

arise in computational uid dynamics, see the rep ort by Alan Edelman [30].

4.2 Derivation of a Blo ck Algorithm for LU Factorization

Supp ose the M  N matrix A is partitioned as shown in Figure 5, and we seek a factorization

A = LU , where the partitioning of L and U is also shown in Figure 5. Then wemay write,

L U = A 3

00 00 00

L U = A 4

10 00 10

L U = A 5

00 01 01

L U + L U = A 6

10 01 11 11 11

where A is r  r , A is r  N r , A is M r   r , and A is M r   N r . L and L

00 01 10 11 00 11

are lower triangular matrices with 1s on the main diagonal, and U and U are upp er triangular

00 11

matrices.

Equations 3 and 4 taken together p erform an LU factorization on the rst M  r panel of A

i.e., A and A . Once this is completed, the matrices L , L , and U are known, and the

00 10 00 10 00

lower triangular system in Eq. 5 can b e solved to give U . Finally,we rearrange Eq. 6 as,

01

0

A = A L U = L U 7

11 10 01 11 11

11

From this equation we see that the problem of nding L and U reduces to nding the LU

11 11

0

factorization of the M r   N r  matrix A . This can b e done by applying the steps outlined

11

0

ab ovetoA instead of to A. Rep eating these steps K times, where

11

K = min dM=re; dN=re 8

we obtain the LU factorization of the original M  N matrix A.For an in-place algorithm, A is

overwritten by L and U { the 1s on the diagonal of L do not need to b e stored explicitly. Similarly,

when A is up dated by Eq. 7 this may also b e done in place. 15 U U

U0 L C L L0 U1 B

E L1 E'

Figure 6: Stage k + 1 of the blo ck LU factorization algorithm showing how the panels B and C ,

and the trailing submatrix E are up dated. The trap ezoidal submatrices L and U have already b een

factored in previous steps. L has kr columns, and U has kr rows. In the step shown another r

columns of L and r rows of U are evaluated.

After k of these K steps, the rst kr columns of L and the rst kr rows of U have b een evaluated,

and matrix A has b een up dated to the form shown in Figure 6, in which panel B is M kr  r

and C is r  N k 1r . Step k + 1 then pro ceeds as follows,

1. factor B to form the next panel of L, p erforming partial pivoting over rows if necessary see

Figure 14. This evaluates the matrices L , L , and U in Figure 6.

0 1 0

2. solve the triangular system L U = C to get the next row of blo cks of U .

0 1

0

3. do a rank-r up date on the trailing submatrix E , replacing it with E = E L U .

1 1

The LAPACK implementation of this form of LU factorization uses the Level 3 BLAS routines

xTRSM and xGEMM to p erform the triangular solve and rank-r up date. We can regard the

algorithm as acting on matrices that have b een partitioned into blo cks of r  r elements, as shown

in Figure 7.

5 Data Distribution

The fundamental data ob ject in the LU factorization algorithm presented in Section 4.2 is a blo ck-

partitioned matrix. In this section, we describ e the blo ck-cyclic metho d for distributing sucha

matrix over a two-dimensional mesh of pro cesses, or template. In general, each pro cess has an

indep endent thread of control, and with each pro cess is asso ciated some lo cal memory directly

accessible only by that pro cess. The assignment of these pro cesses to physical pro cessors is a

machine-dep endent optimization issue, and will b e considered later in Section 7.

An imp ortant prop erty of the class of data distribution we shall use is that indep endent de-

comp ositions are applied over rows and columns. We shall, therefore, b egin by considering the

distribution of a vector of M data ob jects over P pro cesses. This can b e describ ed by a mapping

of the global index, m, of a data ob ject to an index pair p; i, where p sp eci es the pro cess to

which the data ob ject is assigned, and i sp eci es the lo cation in the lo cal memory of p at whichit

is stored. We shall assume 0  m

A1,0 A1,1 A1,2 A1,3 A1,4 A1,5

A2,0 A2,1 A2,2 A2,3 A2,4 A2,5

A3,0 A3,1 A3,2 A3,3 A3,4 A3,5

A4,0 A4,1 A4,2 A4,3 A4,4 A4,5

A5,0 A5,1 A5,2 A5,3 A5,4 A5,5

Figure 7: Blo ck-partitioned matrix A. Each blo ck A consists of r  r matrix elements.

i;j

Two common decomp ositions are the block and the cyclic decomp ositions [55, 32]. The blo ck

decomp osition, that is often used when the computational load is distributed homogeneously over

a regular data structure such as a Cartesian grid, assigns contiguous entries in the global vector to

the pro cesses in blo cks.

m 7!  bm=Lc ;m mo d L  ; 9

where L = dM=P e. The cyclic decomp osition also known as the wrapp ed or scattered decom-

p osition is commonly used to improve load balance when the computational load is distributed

inhomogeneously over a regular data structure. The cyclic decomp osition assigns consecutive en-

tries in the global vector to successive di erent pro cesses,

m 7!  m mo d P; bm=P c  10

Examples of the blo ck and cyclic decomp ositions are shown in Figure 8.

m 0 1 2 3 4 5 6 7 8 9 m 0 1 2 3 4 5 6 7 8 9

p 0 0 0 0 1 1 1 1 2 2 p 0 1 2 0 1 2 0 1 2 0

i 0 1 2 3 0 1 2 3 0 1 i 0 0 0 1 1 1 2 2 2 3

a Blo ck b Cyclic

Figure 8: Examples of blo ck and cyclic decomp ositions of M = 10 data ob jects over P =3

pro cesses.

The blo ck cyclic decomp osition is a generalization of the blo ck and cyclic decomp ositions in

which blo cks of consecutive data ob jects are distributed cyclically over the pro cesses. In the blo ck

cyclic decomp osition the mapping of the global index, m, can b e expressed as m 7! p; b; i, where p

is the pro cess numb er, b is the blo cknumb er in pro cess p, and i is the index within blo ck b to which

m is mapp ed. Thus, if the numb er of data ob jects in a blo ckisr, the blo ck cyclic decomp osition

may b e written,

   

m m mo d T

; ;mmo d r 11 m 7!

r T 17

m 0 1 2 3 4 5 6 7 8 910111213141516171819202122

p 0 0 1 1 2 2 0 0 1 1 2 2 0 0 1 1 2 2 0 0 1 1 2

b 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3

i 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0

a m 7! p; b; i

p 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2

b 0 0 1 1 2 2 3 3 0 0 1 1 2 2 3 3 0 0 1 1 2 2 3

i 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0

m 0 1 6 712131819 2 3 8 914152021 4 51011161722

b p; b; i 7! m

Figure 9: An example of the blo ck cyclic decomp osition of M = 23 data ob jects over P =3

pro cesses for a blo ck size of r = 2. a shows the mapping from global index, m, to the triplet

p; b; i, and b shows the inverse mapping.

where T = rP . It should b e noted that this reverts to the cyclic decomp osition when r = 1, with

lo cal index i = 0 for all blo cks. A blo ck decomp osition is recovered when r = L, in which case

there is a single blo ck in each pro cess with blo cknumber b = 0. The inverse mapping of the triplet

p; b; i to a global index is given by,

p; b; i 7! Br + i = pr + bT + i 12

where B = p + bP is the global blo cknumb er. The blo ck cyclic decomp osition is one of the data

distributions supp orted by High Performance Fortran HPF [42], and has b een previously used,

in one form or another, by several researchers see [1,4,5,9, 23, 27, 50,52, 54] for examples of its

use. The blo ck cyclic decomp osition is illustrated with an example in Figure 9.

The form of the blo ck cyclic decomp osition given by Eq. 11 ensures that the blo ck with global

index 0 is placed in pro cess 0, the next blo ck is placed in pro cess 1, and so on. However, it is

sometimes necessary to o set the pro cesses relative to the global blo ck index so that, in general,

the rst blo ck is placed in pro cess p , the next in pro cess p + 1, and so on. We, therefore, generalize

0 0

0

the blo ck cyclic decomp osition by replacing m on the righthand side of Eq. 11 by m = m + rp to

0

give,

   

0 0

m mo d T m

0

m 7! ; ;m mo d r

r T

   

m mo d T m + rp

0

+ p ;mmo d r : 13 = mo d P;

0

r T

Equation 12 may also b e generalized to,

p; b; i 7! Br + i =pp r+bT + i 14

0

where now the global blo cknumberisgiven by B =pp+ bP . It should b e noted that in

0

pro cesses with p< p , blo ck 0 is not within the range of the blo ck cyclic mapping and it is, therefore,

0

an error to reference it in anyway.

In decomp osing an M  N matrix we apply indep endent blo ck cyclic decomp ositions in the row

and column directions. Thus, supp ose the matrix rows are distributed with blo ck size r and o set

p over P pro cesses by the blo ck cyclic mapping  , and the matrix columns are distributed

0 r;p ;P

0 18

with blo ck size s and o set q over Q pro cesses by the blo ck cyclic mapping  . Then the

0 s;q ;Q

0

matrix element indexed globally bym; n is mapp ed as follows,



m 7! p; b; i



n 7! q ; d; j : 15

The decomp osition of the matrix can b e regarded as the tensor pro duct of the row and column

decomp ositions, and we can write,

m; n 7! p; q ; b; d; i; j : 16

The blo ck cyclic matrix decomp osition given by Eqs. 15 and 16 distributes blo cks of size r  s

to a mesh of P  Q pro cesses. We shall refer to this mesh as the process template , and refer to

pro cesses by their p osition in the template. Equation 16 says that global index m; n is mapp ed

to pro cess p; q , where it is stored in the blo ck at lo cation b; dina two-dimensional arrayof

blo cks. Within this blo ck it is stored at lo cation i; j . The decomp osition is completely sp eci ed

by the parameters r , s, p , q , P , and Q. In Figure 10 an example is given of the blo ck cyclic

0 0

decomp osition of a 36  80 matrix for blo ck size 4  5, a pro cess template 3  4, and a template

o set p ;q =0;0. Figure 11 shows the same example but for a template o set of 1; 2.

0 0

The blo ck cyclic decomp osition can repro duce most of the data distributions commonly used

in linear algebra computations on parallel computers. For example, if Q = 1 and r = dM=P e

the blo ckrow decomp osition is obtained. Similarly, P = 1 and s = dN=Qe gives a blo ck column

decomp osition. These decomp ositions, together with row and column cyclic decomp ositions, are

shown in Figure 12. Other commonly used blo ck cyclic matrix decomp ositions are shown in Figure

13.

6 Parallel Implementation

In this section we describ e the parallel implementation of LU factorization, with partial pivoting

over rows, for a blo ck-partitioned matrix. The matrix, A, to b e factored is assumed to havea

blo ck cyclic decomp osition, and at the end of the computation is overwritten by the lower and

upp er triangular factors, L and U . This implicitly determines the decomp osition of L and U .

Quite a high-level description is given here since the details of the parallel implementation involve

optimization issues that will b e addressed in Section 7.

The sequential LU factorization algorithm describ ed in Section 4.2 uses square blo cks. Although

in the parallel algorithm we could cho ose to decomp ose the matrix using nonsquare blo cks, this

would result in a more complicated co de, and additional sources of concurrentoverhead. For LU

factorization we, therefore, restrict the decomp osition to use only square blo cks, so that the blo cks

used to decomp ose the matrix are the same as those used to partition the computation. If the

blo ck size is r  r , then an M  N matrix consists of M  N blo cks, where M = dM=re and

b b b

N = dN=re.

b

As discussed in Section 4.2, LU factorization pro ceeds in a series of sequential steps indexed by

k =0;min M ;N  1, in each of which the following three tasks are p erformed,

b b

1. factor the k th column of blo cks, p erforming pivoting if necessary. This evaluates the matrices

L , L , and U in Figure 6.

0 1 0

2. evaluate the k th blo ckrowof U by solving the lower triangular system L U = C .

0 1

0

3. do a rank-r up date on the trailing submatrix E , replacing it with E = E L U .

1 1 19 D p,q 0123456789101112131415 00,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 11,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 22,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 30,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 41,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 52,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 B 60,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 71,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 82,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 90,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 10 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3

11 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3

a Assignment of global blo ck indices, B; D, to pro cesses, p; q .

q B,D 0123 0,0 0,4 0,8 0,12 0,1 0,5 0,9 0,13 0,2 0,6 0,10 0,14 0,3 0,7 0,11 0,15

3,0 3,4 3,8 3,12 3,1 3,5 3,9 3,13 3,2 3,6 3,10 3,14 3,3 3,7 3,11 3,15 0 6,0 6,4 6,8 6,12 6,1 6,5 6,9 6,13 6,2 6,6 6,10 6,14 6,3 6,7 6,11 6,15

9,0 9,4 9,8 9,12 9,1 9,5 9,9 9,13 9,2 9,6 9,10 9,14 9,3 9,7 9,11 9,15

1,0 1,4 1,8 1,12 1,1 1,5 1,9 1,13 1,2 1,6 1,10 1,14 1,3 1,7 1,11 1,15

4,0 4,4 4,8 4,12 4,1 4,5 4,9 4,13 4,2 4,6 4,10 4,14 4,3 4,7 4,11 4,15 p 1 7,0 7,4 7,8 7,12 7,1 7,5 7,9 7,13 7,2 7,6 7,10 7,14 7,3 7,7 7,11 7,15

10,0 10,4 10,8 10,12 10,1 10,5 10,9 10,13 10,2 10,6 10,10 10,14 10,3 10,7 10,11 10,15

2,0 2,4 2,8 2,12 2,1 2,5 2,9 2,13 2,2 2,6 2,10 2,14 2,3 2,7 2,11 2,15

5,0 5,4 5,8 5,12 5,1 5,5 5,9 5,13 5,2 5,6 5,10 5,14 5,3 5,7 5,11 5,15 2 8,0 8,4 8,8 8,12 8,1 8,5 8,9 8,13 8,2 8,6 8,10 8,14 8,3 8,7 8,11 8,15

11,0 11,4 11,8 11,12 11,1 11,5 11,9 11,13 11,2 11,6 11,10 11,14 11,3 11,7 11,11 11,15

b Global blo cks, B; D, in each pro cess, p; q .

Figure 10: Blo ck cyclic decomp osition of a 36  80 matrix with a blo ck size of 4  5, onto a 3  4

pro cess template. Each small rectangle represents one matrix blo ck { individual matrix elements

are not shown. In a, shading is used to emphasize the pro cess template that is p erio dically

stamp ed over the matrix, and each blo ck is lab eled with the pro cess to which it is assigned. In b,

each shaded region shows the blo cks in one pro cess, and is lab eled with the corresp onding global

blo ck indices. In b oth gures, the black rectangles indicate the blo cks assigned to pro cess 0; 0. 20 D p,q 0123456789101112131415 01,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 12,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 20,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 31,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 42,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 50,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 B 61,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 72,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 80,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 91,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 10 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1

11 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1

a Assignment of global blo ck indices, B; D, to pro cesses, p; q .

q B,D 0123 ÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐ

Ð 2,2 2,6 2,10 2,14 Ð 2,3 2,7 2,11 2,15 2,0 2,4 2,8 2,12 Ð 2,1 2,5 2,9 2,13 Ð 0Ð 5,2 5,6 5,10 2,145,14 Ð 5,3 5,7 5,11 5,15 5,0 5,4 5,8 5,12 Ð 5,1 5,5 5,9 5,13 Ð Ð 8,2 8,6 8,10 8,14 Ð 8,3 8,7 8,11 8,15 8,0 8,4 8,8 8,12 Ð 8,1 8,5 8,9 8,13 Ð

Ð 11,2 11,6 11,10 11,14 Ð 11,3 11,7 11,11 11,15 11,0 11,4 11,8 11,12 Ð 11,1 11,5 11,9 11,13 Ð

Ð 0,2 0,6 0,10 0,14 Ð 0,3 0,7 0,11 0,15 0,0 0,4 0,8 0,12 Ð 0,1 0,5 0,9 0,13 Ð

Ð 3,2 3,6 3,10 3,14 Ð 3,3 3,7 3,11 3,15 3,0 3,4 3,8 3,12 Ð 3,1 3,5 3,9 3,13 Ð p 1Ð 6,2 6,6 6,10 6,14 Ð 6,3 6,7 6,11 6,15 6,0 6,4 6,8 6,12 Ð 6,1 6,5 6,9 6,13 Ð Ð 9,2 9,6 9,10 9,14 Ð 9,3 9,7 9,11 9,15 9,0 9,4 9,8 9,12 Ð 9,1 9,5 9,9 9,13 Ð

ÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐ

Ð 1,2 1,6 1,10 1,14 Ð 1,3 1,7 1,11 1,15 1,0 1,4 1,8 1,12 Ð 1,1 1,5 1,9 1,13 Ð

Ð 4,2 4,6 4,10 4,14 Ð 4,3 4,7 4,11 4,15 4,0 4,4 4,8 4,12 Ð 4,1 4,5 4,9 4,13 Ð 2Ð 7,2 7,6 7,10 7,14 Ð 7,3 7,7 7,11 7,15 7,0 7,4 7,8 7,12 Ð 7,1 7,5 7,9 7,13 Ð Ð 10,2 10,6 10,10 10,14 Ð 10,3 10,7 10,11 10,15 10,0 10,4 10,8 10,12 Ð 10,1 10,5 10,9 10,13 Ð

ÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐ

b Global blo cks, B; D, in each pro cess, p; q .

Figure 11: The same matrix decomp osition as shown in Figure 10, but for a template o set of

p ;q =1;2. Dashed entries in b indicate that the blo ck do es not contain any data. In b oth

0 0

gures, the black rectangles indicate the blo cks assigned to pro cess 0; 0. 21 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 3,0 3,0 3,0 3,0 3,0 3,0 3,0 3,0 3,0 3,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 3,0 3,0 3,0 3,0 3,0 3,0 3,0 3,0 3,0 3,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 2,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0

3,0 3,0 3,0 3,0 3,0 3,0 3,0 3,0 3,0 3,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0

a r =3, s= 10, P =4, Q=1 b r =1, s= 10, P =4, Q=1

0,0 0,0 0,0 0,1 0,1 0,1 0,2 0,2 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,0 0,0 0,0 0,1 0,1 0,1 0,2 0,2 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,0 0,0 0,0 0,1 0,1 0,1 0,2 0,2 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,0 0,0 0,0 0,1 0,1 0,1 0,2 0,2 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,0 0,0 0,0 0,1 0,1 0,1 0,2 0,2 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,0 0,0 0,0 0,1 0,1 0,1 0,2 0,2 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,0 0,0 0,0 0,1 0,1 0,1 0,2 0,2 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,0 0,0 0,0 0,1 0,1 0,1 0,2 0,2 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,0 0,0 0,0 0,1 0,1 0,1 0,2 0,2 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1

0,0 0,0 0,0 0,1 0,1 0,1 0,2 0,2 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1

c r = 10, s =3, P =1, Q=4 d r = 10, s =1, P =1, Q=4

Figure 12: These 4 gures show di erentways of decomp osing a 10  10 matrix. Each cell rep-

resents a matrix element, and is lab eled by the p osition, p; q , in the template of the pro cess

to which it is assigned. To emphasize the pattern of decomp osition, the matrix entries assigned

to the pro cess in the rst row and column of the template are shown shaded, and each separate

shaded region represents a matrix blo ck. Figures a and b show blo ck and cyclic row-oriented

decomp ositions, resp ectively, for 4 no des. In gures c and d the corresp onding column-oriented

decomp ositions are shown. Below each gure we give the values of r , s, P , and Q corresp onding

to the decomp osition. In all cases p = q =0.

0 0 22 0,0 0,0 0,0 0,1 0,1 0,1 0,2 0,2 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,0 0,0 0,0 0,1 0,1 0,1 0,2 0,2 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,0 0,0 0,0 0,1 0,1 0,1 0,2 0,2 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 1,0 1,0 1,0 1,1 1,1 1,1 1,2 1,2 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,0 1,0 1,0 1,1 1,1 1,1 1,2 1,2 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,0 1,0 1,0 1,1 1,1 1,1 1,2 1,2 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 2,0 2,0 2,0 2,1 2,1 2,1 2,2 2,2 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,0 2,0 2,0 2,1 2,1 2,1 2,2 2,2 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,0 2,0 2,0 2,1 2,1 2,1 2,2 2,2 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1

3,0 3,0 3,0 3,1 3,1 3,1 3,2 3,2 3,2 3,3 3,0 3,1 3,2 3,3 3,0 3,1 3,2 3,3 3,0 3,1

a r =3, s=3, P =4, Q=4 b r =3, s=1, P =4, Q=4

0,0 0,0 0,0 0,1 0,1 0,1 0,2 0,2 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 1,0 1,0 1,0 1,1 1,1 1,1 1,2 1,2 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 2,0 2,0 2,0 2,1 2,1 2,1 2,2 2,2 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 3,0 3,0 3,0 3,1 3,1 3,1 3,2 3,2 3,2 3,3 3,0 3,1 3,2 3,3 3,0 3,1 3,2 3,3 3,0 3,1 0,0 0,0 0,0 0,1 0,1 0,1 0,2 0,2 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 1,0 1,0 1,0 1,1 1,1 1,1 1,2 1,2 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 2,0 2,0 2,0 2,1 2,1 2,1 2,2 2,2 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 3,0 3,0 3,0 3,1 3,1 3,1 3,2 3,2 3,2 3,3 3,0 3,1 3,2 3,3 3,0 3,1 3,2 3,3 3,0 3,1 0,0 0,0 0,0 0,1 0,1 0,1 0,2 0,2 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1

1,0 1,0 1,0 1,1 1,1 1,1 1,2 1,2 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1

c r =1, s=3, P =4, Q=4 d r =1, s=1, P =4, Q=4

Figure 13: These 4 gures show di erentways of decomp osing a 10  10 matrix over 16 pro cesses

arranged as a 4  4 template. Below each gure we give the values of r , s, P , and Q corresp onding

to the decomp osition. In all cases p = q =0.

0 0 23 U

row kr+i exchange L rows pivot row

column

kr+i

Figure 14: This gure shows pivoting for step i of the k th stage of LU factorization. The element

with largest absolute value in the gray shaded part of column kr + i is found, and the row containing

it is exchanged with row kr + i. If the rows exchanged lie in di erent pro cesses, communication

may b e necessary.

Wenow consider the parallel implementation of each of these tasks. The computation in the

factorization step involves a single column of blo cks, and these lie in a single column of the pro cess

template. In the k th factorization step, each of the r columns in blo ck column k is pro cessed in

turn. Consider the ith column in blo ck column k . The pivot is selected by nding the element with

largest absolute value in this column b etween row kr + i and the last row, inclusive. The elements

involved in the pivot search at this stage are shown shaded in Figure 14. Having selected the pivot,

the value of the pivot and its row are broadcast to all other pro cessors. Next, pivoting is p erformed

by exchanging the entire row kr + i with the row containing the pivot. We exchange entire rows,

rather than just the part to the right of the columns already factored, in order to simplify the

application of the pivots to the righthand side in any subsequent solve phase. Finally, eachvalue in

the column b elow the pivot is divided by the pivot. If a cyclic column decomp osition is used, like

that shown in Figure 12d, only one pro cessor is involved in the factorization of the blo ck column,

and no communication is necessary b etween the pro cesses. However, in general P pro cesses are

involved, and communication is necessary in selecting the pivot, and exchanging the pivot rows.

The solution of the lower triangular system L U = C to evaluate the k th blo ckrowofU

0 1

involves a single row of blo cks, and these lie in a single row of the pro cess template. If a cyclic

row decomp osition is used, like that shown in Figure 12b, only one pro cessor is involved in the

triangular solve, and no communication is necessary b etween the pro cesses. However, in general Q

pro cesses are involved, and communication is necessary to broadcast the lower ,

L , to all pro cesses in the row. Once this has b een done, each pro cess in the row indep endently

0

p erforms a lower triangular solve for the blo cks of C that it holds.

The communication necessary to up date the trailing submatrix at step k takes place in two

steps. First, each pro cess holding part of L broadcasts these blo cks to the other pro cesses in the

1

same row of the template. This may b e done in conjunction with the broadcast of L , mentioned

0

in the preceding paragraph, so that all of the factored panel is broadcast together. Next, each 24

p col= q

0

prow= p

0

do k= 0; min M ;N  1

b b

do i= 0;r 1

if q =p col nd pivot value and lo cation

broadcast pivot value and lo cation to all pro cesses

exchange pivot rows

if q =p col divide column r b elow diagonal by pivot

end do

if p =prow then

broadcast L to all pro cess in same template row

0

solve L U = C

0 1

end if

broadcast L to all pro cesses in same template row

1

broadcast U to all pro cesses in same template column

1

up date E  E L U

1 1

p col= pcol +1mod Q

prow= pr ow +1mod P

end do

Figure 15: Pseudo co de for the basic parallel blo ck-partitioned LU factorization algorithm. This

co de is executed by each pro cess. The rst b ox inside the k lo op factors the k th column of blo cks.

The second b ox solvesalower triangular system to evaluate the k th row of blo cks of U , and the

third b ox up dates the trailing submatrix. The template o set is given byp;q , and p; q is

0 0

p osition of a pro cess in the template.

pro cess holding part of U broadcasts these blo cks to the other pro cesses in the same column of the

1

template. Each pro cess can then complete the up date of the blo cks that it holds with no further

communication.

A pseudo co de outline of the parallel LU factorization algorithm is given in Figure 15. There

are two p oints worth noting in Figure 15. First, the triangular solve and up date phases op erate on

matrix blo cks and may, therefore, b e done with parallel versions of the Level 3 BLAS sp eci cally,

xTRSM and xGEMM, resp ectively. The factorization of the column of blo cks, however, involves a

lo op over matrix columns. Hence, is it not a blo ck-oriented computation, and cannot b e p erformed

using the Level 3 BLAS. The second p oint to note is that most of the parallelism in the co de comes

from up dating the trailing submatrix since this is the only phase in which all the pro cesses are

busy.

Figure 15 also shows quite clearly where communication is required; namely, in nding the

pivot, exchanging pivot rows, and p erforming various typ es of broadcast. The exact way in which

these communications are done and interleaved with computation generally has an imp ortant e ect

on p erformance, and will b e discussed in more detail in Section 7.

Figure 15 refers to broadcasting data to all pro cesses in the same row or column of the template.

This is a common op eration in parallel linear algebra algorithms, so the idea will b e describ ed here 25 A AAAAAA B BBBBBB C CCCCCC

D DDDDDD

a Broadcast along rows.

RSTUVW RSTUVW RSTUVW RSTUVW

RSTUVW

b Broadcast along columns.

Figure 16: Schematic representation of broadcast along rows and columns of a 46 pro cess template.

In a, each shaded pro cess broadcasts to the pro cesses in the same row of the pro cess template. In

b, each shaded pro cess broadcasts to the pro cesses in the same column of the pro cess template.

in a little more detail. Consider, for example, the task of broadcasting the lower triangular blo ck,

L , to all pro cesses in the same row of the template, as required b efore solving L U = C .IfL

0 0 1 0

is in pro cess p; q , then it will b e broadcast to all pro cesses in row p of the pro cess template.

As a second example, consider the broadcast of L to all pro cesses in the same template row, as

1

required b efore up dating the trailing submatrix. This typ e of \rowcast" is shown schematically in

Figure 16a. If L is in column q of the template, then each pro cess p; q  broadcasts its blo cks

1

of L to the other pro cesses in row p of the template. Lo osely sp eaking, we can say that L and

1 0

L are broadcast along the rows of the template. This typ e of data movement is the same as that

1

p erformed by the Fortran 90 routine SPREAD [7]. The broadcast of U to all pro cesses in the

1

same template column is very similar. This typ e of communication is sometimes referred to as a

\colcast", and is shown in Figure 16b.

7 Optimization, Tuning, and Trade-o s

In this section, we shall examine techniques for optimizing the basic LU factorization co de presented

in Section 4.2. Among the issues to b e considered are the assignment of pro cesses to physical pro-

cessors, the arrangement of the data in the lo cal memory of each pro cess, the trade-o b etween load

imbalance and communication latency, the p otential for overlapping communication and calcula- 26

tion, and the typ e of algorithm used to broadcast data. Many of these issues are interdep endent,

and in addition the p ortability and ease of co de maintenance and use must b e considered. For

further details of the optimization of parallel LU factorization algorithms for sp eci c concurrent

machines, together with timing results, the reader is referred to the work of Chu and George [12],

Geist and Heath [34], Geist and Romine [35], Van de Velde [55], Brent[8], Hendrickson and Womble

[39], Lichtenstein and Johnsson [47], and Dongarra and co-workers [10, 25].

7.1 Mapping Logical Memory to Physical Memory

In Section 5, a logical or virtual matrix decomp osition was describ ed in which the global index

m; n is mapp ed to a p osition, p; q , in a logical pro cess template, a p osition, b; d, in a logical

array of blo cks lo cal to the pro cess, and a p osition, i; j , in a logical array of matrix elements lo cal

to the blo ck. Thus, the blo ck cyclic decomp osition is hierarchical, and attempts to represent the

hierarchical memory of advanced-architecture computers. Although the parallel LU factorization

algorithm can b e sp eci ed solely in terms of this logical hierarchical memory, its p erformance

dep ends on how the logical memory is mapp ed to physical memory.

7.1.1 Assignment of Pro cesses to Pro cessors

Consider, rst, the assignment of pro cesses, p; q , to physical pro cessors. In general, more than

one pro cess may b e assigned to a pro cessor, so the problem maybeoverdecomp osed. Toavoid load

imbalance the same numb er of pro cesses should b e assigned to each pro cessor as nearly as p ossible.

If this condition is satis ed, the assignment of pro cesses to pro cessors can still a ect p erformance by

in uencing the communication overhead. On recent distributed memory machines, such as the Intel

Delta and CM-5, the time to send a single message b etween two pro cessors is largely indep endent

of their physical lo cation [29, 48, 49], and hence the assignment of pro cesses to pro cessors do es

not havemuch direct e ect on p erformance. However, when a collective communication task, such

as a broadcast, is b eing done, contention for physical resources can degrade p erformance. Thus,

the way in which pro cesses are assigned to pro cessors can a ect p erformance if some assignments

result in di ering amounts of contention. Logarithmic contention-free broadcast algorithms have

b een develop ed for pro cessors connected as a two-dimensional mesh [6, 51], so on such machines

pro cess p; q  is usually mapp ed to the pro cessor at p osition p; q  in the mesh of pro cessors. Such

an assignment also ensures that the multiple one-dimensional broadcasts of L and U along the

1 1

rows and columns of the template, resp ectively, do not give rise to contention.

7.1.2 Layout of Lo cal Pro cess Memory

The layout of matrix blo cks in the lo cal memory of a pro cess, and the arrangement of matrix

elements within each blo ck, can also a ect p erformance. Here, tradeo s among several factors need

to b e taken into account. When communicating matrix blo cks, for example in the broadcasts of

L and U ,wewould like the data in each blo cktobecontiguous in physical memory so there

1 1

is no need to pack them into a communication bu er b efore sending them. On the other hand,

when up dating the trailing submatrix, E , each pro cess multiplies a column of blo cks byarow

of blo cks, to do a rank-r up date on the part of E that it contains. If this were done as a series

of separate blo ck-blo ck matrix multiplications, as shown in Figure 18a, the p erformance would

b e p o or except for suciently large blo ck sizes, r , since the vector and/or pip eline units on most

pro cessors would not b e fully utilized, as may b e seen in Figure 17 for the i860 pro cessor. Instead,

we arrange the lo ops of the computation as shown in Figure 18b. Now, if the data are laid out

in physical memory rst by running over the i index and then over the d index the inner two lo ops 27 40

35

30

25

20

Mflops Α( Μ × Μ ) ⋅ Β ( Μ × Μ ) 15 Α( Μ × Μ/2 ) ⋅ Β ( Μ/2 × Μ ) Α( Μ/2 × Μ ) ⋅ Β ( Μ × Μ/2 ) 10

5

0 0 100 200 300 400 500

Matrix Size, M

Figure 17: Performance of the assembly-co ded Level 3 BLAS matrix multiplication routine

DGEMM on one i860 pro cessor of the Intel Delta system. Results for square and rectangular

matrices are shown. Note that the p eak p erformance of ab out 35 M ops is attained only for matri-

ces whose smallest dimension exceeds 100. Thus, p erformance is improved if a few large matrices

are multiplied by each pro cess, rather than many small ones.

can b e merged, so that the length of the inner lo op is now rd . This generally results in much

max

b etter vector/pip eline p erformance. The b and j lo ops in Figure 18b can also b e merged, giving

the algorithm shown in Figure 18c. This is just the outer pro duct form of the multiplication of an

rd  r byan rrb matrix, and would usually b e done by a call to the Level 3 BLAS routine

max max

xGEMM of which an assembly co ded sequential version is available on most machines. Note that

in Figure 18c the order of the inner two lo ops is appropriate for a Fortran implementation { for

the C language this order should b e reversed, and the data should b e stored in each pro cess by

rows instead of by columns.

Wehave found in our work on the Intel iPSC/860 hyp ercub e and the Delta system that it is

b etter to optimize for the sequential matrix multiplication with an i; d; j; b ordering of memory in

each pro cess, rather than adopting an i; j; d; b ordering to avoid bu er copies when communicating

blo cks. However, there is another reason for doing this. On most distributed memory computers

the message startup cost is suciently large that it is preferable wherever p ossible to send data

as one large message rather than as several smaller messages. Thus, when communicating L

1

and U the blo cks to b e broadcast would b e amalgamated into a single message, which requires

1

a bu er copy. The emerging Message Passing Interface MPI standard [21] provides supp ort for

noncontiguous messages, so in the future the need to avoid bu er copies will not b e of such concern

to the application develop er.

7.2 Tradeo s b etween Load Balance and Communication Latency

Wehave discussed the mapping of the logical hierarchical memory to physical memory. In addition,

wehave p ointed out the imp ortance of maintaining long inner lo ops to get go o d sequential p er- 28

do b =0;b 1

max

do d =0;d 1

max

do i =0;r 1

do j =0;r 1

do k =0;r 1

Eb; d; i; j = Eb; d; i; j  L b; d; i; k  U b; d; k; j 

1 1

end all do lo ops

a Blo ck-blo ckmultiplication

do k =0;r 1

do b =0;b 1

max

do j =0;r 1

do d =0;d 1

max

do i =0;r 1

Eb; d; i; j = Eb; d; i; j  L b; d; i; k  U b; d; k; j 

1 1

end all do lo ops

b Intermediate form of algorithm

do k =0;r 1

do x =0;rb 1

max

do y =0;rd 1

max

E x; y = Ex; y  L x; k  U k; y 

1 1

end all do lo ops

c Outer pro duct form of algorithm

Figure 18: Pseudo co de for di erentversions of the rank-r up date, E  E L U , for one pro cess.

1 1

The number of row and column blo cks p er pro cess is given by b and d , resp ectively; r is the

max max

blo ck size. Blo cks are indexed byb; d, and elements within a blo ckbyi; j . In version a the

r  r blo cks are multiplied one at a time, giving an inner lo op of length r . b shows the lo ops

rearranged b efore merging the i and d lo ops, and the j and b lo ops. This leads to the outer pro duct

form of the algorithm shown in c in which the inner lo op is now of length rd .

max 29

formance for each pro cess, and the desirability of sending a few large messages rather than many

smaller ones. We next consider load balance issues. Assuming that equal numb ers of pro cesses have

b een assigned to each pro cessor, load imbalance arises in two phases of the parallel LU factorization

algorithm; namely, in factoring each column blo ck, whichinvolves only P pro cesses, and in solving

the lower triangular system to evaluate eachrow blo ckofU, whichinvolves only Q pro cesses. If

the time for data movement is negligible, the asp ect ratio of the template that minimizes load

imbalance in step k of the algorithm is,

Sequential time to factor column blo ck P

=

Q Sequential time for triangular solve

2

M k 1=3+ O1=r 

b

17 =

2

N k 1+ O1=r 

b

where M  N is the matrix size in blo cks, and r the blo ck size. Thus, the optimal asp ect ratio of

b b

the template should b e the same as the asp ect ratio of the matrix, i.e., M =N in blo cks, or M=N

b b

in elements. If the e ect of communication time is included then wemust takeinto account the

relative times taken to lo cate and broadcast the pivot information, and the time to broadcast the

lower triangular matrix, L , along a row of the template. For b oth tasks the communication time

0

increases with the numb er of pro cesses involved, and since the communication time asso ciated with

the pivoting is greater than that asso ciated with the triangular solve, wewould exp ect the optimum

asp ect ratio of the template to b e less than M=N . In fact, for our runs on the Intel Delta system we

found an asp ect ratio, P=Q,ofbetween 1/4 and 1/8 to b e optimal for most problems with square

matrices, and that p erformance dep ends rather weakly on the asp ect ratio, particularly for large

grain sizes. Some typical results are shown in Figure 19 for 256 pro cessors, which showavariation

of less than 20 in p erformance as P=Q varies b etween 1/16 and 1 for the largest problem.

The blo ck size, r , also a ects load balance. Here the tradeo is b etween the load imbalance

that arises as rows and columns of the matrix are eliminated as the algorithm progresses, and

communication startup costs. The blo ck cyclic decomp osition seeks to maintain go o d load balance

by cyclically assigning blo cks to pro cesses, and the load balance is b est if the blo cks are small. On

the other hand, cumulative communication startup costs are less if the blo ck size is large since,

in this case, fewer messages must b e sent although the total volume of data sent is indep endent

of the blo ck size. Thus, there is a blo ck size that optimally balances the load imbalance and

communication startup costs.

7.3 Optimality and Pip elining Tradeo s

The communication algorithms used also in uence p erformance. In the LU factorization algorithm,

all the communication can b e done bymoving data along rows and/or columns of the pro cess

template. This typ e of communication can b e done by passing from one pro cess to the next along

the row or column. We shall call this a \ring" algorithm, although the ring may,ormay not, b e

closed. An alternative is to use a spanning tree algorithm, of which there are several varieties.

The complexity of the ring algorithm is linear in the numb er of pro cesses involved, whereas that

of spanning tree algorithms is logarithmic for example, see [6]. Thus, considered in isolation, the

spanning tree algorithms are preferable to a ring algorithm. However, in a spanning tree algorithm,

a pro cess may take part in several of the logarithmic steps, and in some implementations these

algorithms act as a barrier. In a ring algorithm, each pro cess needs to communicate only once,

and can then continue to compute, in e ect overlapping the communication with computation.

An algorithm that interleaves communication and calculation in this way is often referred to as a

pip elined algorithm. In a pip elined LU factorization algorithm with no pivoting, communication 30 6 4 × 64 8 × 32 5 12 × 21 16 × 16

4

3 Gflop/s

2

1

0 0 3000 6000 9000 12000 15000 18000

Matrix Size, M

Figure 19: Performance of LU factorization on the Intel Delta as a function of square matrix size

for di erent pro cessor templates containing approximately 256 pro cessors. The b est p erformance

is for an asp ect ratio of 1/4, though the dep endence on asp ect ratio is rather weak. 31

if q =p col then

do i= 0;r 1

nd pivot value and lo cation

exchange pivot rows lying within panel

divide column r b elow diagonal by pivot

end do

end if

broadcast pivot information for r pivots along template rows

exchange pivot rows lying outside the panel for eachofr pivots

Figure 20: Pseudo co de fragment for partial pivoting over rows. This may b e regarded as replacing

the rst b ox inside the k lo op in Figure 15. In the ab ove co de pivot information is rst disseminated

within the template column doing the panel factorization. The pivoting of the parts of the rows

lying outside the panel is deferred until the panel factorization has b een completed.

and calculation would owinwaves across the matrix. Pivoting tends to inhibit this advantage of

pip elini ng.

In the pseudo co de in Figure 15, we do not sp ecify how the pivot information should b e broadcast.

In an optimized implementation, we need to nish with the pivot phase, and the triangular solve

phase, as so on as p ossible in order to b egin the up date phase whichisrichest in parallelism.

Thus, it is not a go o d idea to broadcast the pivot information from a single source pro cess using

a spanning tree algorithm, since this may o ccupy some of the pro cesses involved in the panel

factorization for to o long. It is imp ortant to get the pivot information to the other pro cesses in

this template column as so on as p ossible, so the pivot information is rst sent to these pro cesses

which subsequently broadcast it along the template rows to the other pro cesses not involved in the

panel factorization. In addition, the exchange of the parts of the pivot rows lying within the panel

is done separately from that of the parts outside the pivot panel. Another factor to consider here

is when the pivot information should b e broadcast along the template columns. In Figure 15, the

information is broadcast, and rows exchanged, immediately after the pivot is found. An alternative

is to store up the sequence of r pivots for a panel and to broadcast them along the template rows

when panel factorization is complete. This defers the exchange of pivot rows for the parts outside

the panel until the panel factorization has b een done, as shown in the pseudo co de fragmentin

Figure 20. An advantage of this second approach is that only one message is used to send the pivot

information for the panel along the template rows, instead of r messages.

In our implementation of LU factorization on the Intel Delta system, we used a spanning

tree algorithm to lo cate the pivot and to broadcast it within the column of the pro cess template

p erforming the panel factorization. This ensures that pivoting, whichinvolves only P pro cesses, is

completed as quickly as p ossible. A ring broadcast is used to pip eline the pivot information and

the factored panel along the template rows. Finally, after the triangular solve phase has completed,

a spanning tree broadcast is used to send the newly-formed blo ckrowof U along the template

columns. Results for square matrices from runs on the Intel Delta system are shown in Figure 21.

For each curve the results for the b est pro cess template con guration are shown. Recalling that

for a scalable algorithm the p erformance should dep end linearly on the numb er of pro cessors for

xed granularity see Eq. 2, it is apparent that scalabilitymay b e assessed by the extent to which

isogranularity curves di er from linearity. An isogranularity curve is a plot of p erformance against

numb er of pro cessors for a xed granularity. The results in Figure 21 can b e used to generate the 32

isogranularity curves shown in Figure 22 which show that on the Delta system the LU factorization

6

routine starts to lose scalability when the granularity falls b elow ab out 0:2  10 . This corresp onds

to a matrix size of ab out M = 10000 on 512 pro cessors, or ab out 13 of the memory available

to applications on the Delta, indicating that LU factorization scales rather well on the Intel Delta

system.

8 Conclusions and Future Research Directions

Portability of programs has always b een an imp ortant consideration. Portabilitywas easy to

achieve when there was a single architectural paradigm the serial von Neumann machine and a

single programming language for scienti c programming Fortran emb o dying that common mo del

of computation. Architectural and linguistic diversityhave made p ortabilitymuch more dicult,

but no less imp ortant, to attain. Users simply do not wish to invest signi cant amounts of time

to create large-scale application co des for each new machine. Our answer is to develop p ortable

software libraries that hide machine-sp eci c details.

8.1 Portability, Scalability, and Standards

In order to b e truly p ortable, parallel software libraries must b e standardized. In a parallel comput-

ing environment in which the higher-level routines and/or abstractions are built up on lower-level

computation and message-passing routines, the b ene ts of standardization are particularly appar-

ent. Furthermore, the de nition of computational and message-passing standards provides vendors

with a clearly de ned base set of routines that they can implement eciently.

From the user's p oint of view, p ortability means that, as new machines are develop ed, they are

simply added to the network, supplying cycles where they are most appropriate.

From the mathematical software develop er's p oint of view, p ortabilitymay require signi cant

e ort. Economy in development and maintenance of mathematical software demands that such

development e ort b e leveraged over as many di erent computer systems as p ossible. Given the

great diversity of parallel architectures, this typ e of p ortability is attainable to only a limited degree,

but machine dep endences can at least b e isolated.

LAPACK is an example of a mathematical software package whose highest-level comp onents

are p ortable, while machine dep endences are hidden in lower-level mo dules. Such a hierarchical

approach is probably the closest one can come to software p ortability across diverse parallel archi-

tectures. And the BLAS that are used so heavily in LAPACK provide a p ortable, ecient, and

exible standard for applications programmers.

Like p ortability, scalability demands that a program b e reasonably e ectiveover a wide range

of numb er of pro cessors. The scalability of parallel algorithms, and software libraries based on

them, over a wide range of architectural designs and numb ers of pro cessors will likely require that

the fundamental granularity of computation b e adjustable to suit the particular circumstances in

which the software may happ en to execute. Our approach to this problem is blo ck algorithms with



adjustable blo ck size. In many cases, however, p olyalgorithms may b e required to deal with the

full range of architectures and pro cessor multiplicity likely to b e available in the future.

Scalable parallel architectures of the future are likely to b e based on a distributed memory

architectural paradigm. In the longer term, progress in hardware development, op erating systems,

languages, compilers, and communications may make it p ossible for users to view such distributed



In a p olyalgori thm the actual algorithm used dep ends on the computing environment and the input data. The

optimal algorithm in a particular instance is automatically selected at runtime. 33 12 8 × 64 10

8

6 8 × 32 Gflops 4 4 × 32 2 4 × 16 2 × 16 0 0 4000 8000 12000 16000 20000 24000 28000

Matrix Size, M

Figure 21: Performance of LU factorization on the Intel Delta as a function of square matrix size

for di erentnumb ers of pro cessors. For each curve, results are shown for the pro cess template

con guration that gave the b est p erformance for that numb er of pro cessors.

12 1.221

10 0.500

8 0.195

6

Gflops 0.096 4

2

0 0 100 200 300 400 500 600

Number of Processors

Figure 22: Isogranularity curves in the N ;G plane for the LU factorization of square matrices on

p

6

the Intel Delta system. The curves are lab eled by the granularity in units of 10 matrix elements

6

p er pro cessor. The linearity of the plots for granularities exceeding ab out 0:2  10 indicates that

the LU factorization algorithm scales well on the Delta. 34

architectures without signi cant loss of eciency as having a shared memory with a global address

space. For the near term, however, the distributed nature of the underlying hardware will continue

to b e visible at the programming level; therefore, ecient pro cedures for explicit communication

will continue to b e necessary. Given this fact, standards for basic message passing send/receive,

as well as higher-level communication constructs global summation, broadcast, etc., b ecome es-

sential to the development of scalable libraries that haveany degree of p ortability. In addition to

standardizing general communication primitives, it may also b e advantageous to establish standards

for problem-sp eci c constructs in commonly o ccurring areas such as linear algebra.

The BLACS Basic Linear Algebra Communication Subprograms [16, 26] is a package that pro-

vides the same ease of use and p ortability for MIMD message-passing linear algebra communication

that the BLAS [17, 18, 45] provide for linear algebra computation. Therefore, we recommend that

future software for dense linear algebra on MIMD platforms consist of calls to the BLAS for com-

putation and calls to the BLACS for communication. Since b oth packages will have b een optimized

for a particular platform, go o d p erformance should b e achieved with relatively little e ort. Also,

since b oth packages will b e available on a wide varietyofmachines, co de mo di cations required to

change platforms should b e minimal.

8.2 Alternative Approaches

Traditionally, large, general-purp ose mathematical software libraries have required users to write

their own programs that call library routines to solve sp eci c subproblems that arise during a com-

putation. Adapted to a shared-memory parallel environment, this conventional interface still o ers

some p otential for hiding underlying complexity.For example, the LAPACK pro ject incorp orates

parallelism in the Level 3 BLAS, where it is not directly visible to the user.

But when going from shared-memory systems to the more readily scalable distributed memory

systems, the complexity of the distributed data structures required is more dicult to hide from the

user. Not only must the problem decomp osition and data layout b e sp eci ed, but di erent phases

of the user's problem may require transformations b etween di erent distributed data structures.

These de ciencies in the conventional user interface have prompted extensive discussion of

alternative approaches for scalable parallel software libraries of the future. Possibilities include:

1. Traditional function library i.e., minimum p ossible change to the status quo in going from

serial to parallel environment. This will allow one to protect the programming investment

that has b een made.

2. Reactive servers on the network. A user would b e able to send a computational problem to a

server that was sp ecialized in dealing with the problem. This ts well with the concepts of a

networked, heterogeneous computing environment with various sp ecialized hardware resources

or even the heterogeneous partitioning of a single homogeneous parallel machine.

3. General interactiveenvironments like Matlab or Mathematica, p erhaps with \exp ert" drivers

i.e., knowledge-based systems. With the growing p opularity of the manyintegrated pack-

ages based on this idea, this approachwould provide an interactive, graphical interface for

sp ecifying and solving scienti c problems. Both the algorithms and data structures are hid-

den from the user, b ecause the package itself is resp onsible for storing and retrieving the

problem data in an ecient, distributed manner. In a heterogeneous networked environment,

suchinterfaces could provide seamless access to computational engines that would b e invoked

selectively for di erent parts of the user's computation according to which machine is most

appropriate for a particular subproblem. 35

4. Domain-sp eci c problem solving environments, such as those for structural analysis. En-

vironments like Matlab and Mathematica have proven to b e esp ecially attractive for rapid

prototyping of new algorithms and systems that may subsequently b e implemented in a more

customized manner for higher p erformance.

5. Reusable templates i.e., users adapt \source co de" to their particular applications. A tem-

plate is a description of a general algorithm rather than the executable ob ject co de or the

source co de more commonly found in a conventional software library. Nevertheless, although

templates are general descriptions of key data structures, they o er whatever degree of cus-

tomization the user may desire.

Novel user interfaces that hide the complexity of scalable parallelism will require new concepts

and mechanisms for representing scienti c computational problems and for sp ecifying how those

problems relate to each other. Very high level languages and systems, p erhaps graphically based,

not only would facilitate the use of mathematical software from the user's p oint of view, but

also would help to automate the determination of e ective partitioning, mapping, granularity, data

structures, etc. However, new concepts in problem sp eci cation and representation may also require

new mathematical research on the analytic, algebraic, and top ological prop erties of problems e.g.,

existence and uniqueness.

Wehave already b egun work on developing such templates for computations.

Future work will fo cus on extending the use of templates to dense matrix computations.

We hop e the insightwe gained from our work will in uence future develop ers of hardware,

compilers and systems software so that they provide to ols to facilitate development of high quality

p ortable numerical software.

The EISPACK, LINPACK, and LAPACK linear algebra libraries are in the public domain, and

are available from .For example, for more information on how to obtain LAPACK, send the

following one-line email message to [email protected]:

send index from lapack

Information for EISPACK and LINPACK can b e similarly obtained. We exp ect to make a prelim-

inary version of the ScaLAPACK library available from netlib in 1993.

Acknowledgments

This researchwas p erformed in part using the Intel Touchstone Delta System op erated by the

California Institute of Technology on b ehalf of the Concurrent Sup ercomputing Consortium. Access

to this facilitywas provided through the Center for ResearchonParallel Computing.

References

[1] E. Anderson, A. Benzoni, J. J. Dongarra, S. Moulton, S. Ostrouchov, B. Tourancheau, and

R. van de Geijn. LAPACK for distributed memory architectures: Progress rep ort. In Paral lel

Processing for Scienti c Computing, Fifth SIAM Conference. SIAM, 1991.

[2] E. Anderson and J. Dongarra. Results from the initial release of LAPACK. Technical Rep ort

LAPACK working note 16, Computer Science Department, University of Tennessee, Knoxville,

TN, 1989.

[3] E. Anderson and J. Dongarra. Evaluating blo ck algorithm variants in LAPACK. Technical

Rep ort LAPACK working note 19, Computer Science Department, University of Tennessee,

Knoxville, TN, 1990. 36

[4] C. C. Ashcraft. The distributed solution of linear systems using the torus wrap data map-

ping. Engineering Computing and Analysis Technical Rep ort ECA-TR-147, Bo eing Computer

Services, 1990.

[5] C. C. Ashcraft. A taxonamy of distributed dense LU factorization metho ds. Engineering

Computing and Analysis Technical Rep ort ECA-TR-161, Bo eing Computer Services, 1991.

[6] M. Barnett, D. G. Payne, and R. van de Geijn. Broadcasting on meshes with worm-hole

routing. Technical rep ort, Department of Computer Science, UniversityofTexas at Austin,

April 1993. Submitted to Sup ercomputing '93.

[7] W. S. Brainerd, C. H. Goldb ergs, and J. C. Adams. Programmers Guide to Fortran 90.

McGraw-Hill, New York, 1990.

[8] R. P. Brent. The LINPACK b enchmark for the Fujitsu AP 1000. In Proceedings of the

Fourth Symposium on the Frontiers of Massively Paral lel Computation, pages 128{135. IEEE

Computer So ciety Press, 1992.

[9] R. P. Brent. The LINPACK b enchmark on the AP 1000: Preliminary rep ort. In Proceedings

of the 2nd CAP Workshop,NOV 1991.

[10] J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker. Scalapack: A scalable linear algebra

library for distributed memory concurrent computers. In Proceedings of the Fourth Symposium

on the Frontiers of Massively Paral lel Computation, pages 120{127. IEEE Computer So ciety

Press, 1992.

[11] J. Choi, J. J. Dongarra, and D. W. Walker. The design of scalable software libraries for

distributed memory concurrent computers. In J. J. Dongarra and B. Tourancheau, editors,

Environments and Tools for Paral lel Scienti c Computing. Elsevier Science Publishers, 1993.

[12] E. Chu and A. George. Gaussian elimination with partial pivoting and load balancing on a

multipro cessor. Paral lel Computing, 5:65{74, 1987.

[13] D. E. Culler, A. Dusseau, S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and

K. Yelick. Intro duction to Split-C: Version 0.9. Technical rep ort, Computer Science Division

{ EECS, University of California, Berkeley, CA 94720, February 1993.

[14] J. Demmel. LAPACK: A p ortable linear algebra library for sup ercomputers. In Proceedings of

the 1989 IEEE Control Systems Society Workshop on Computer-Aided Control System Design,

Decemb er 1989.

[15] J. J. Dongarra. Increasing the p erformance of mathematical software through high-level mo d-

ularity. In Proc. Sixth Int. Symp. Comp. Methods in Eng. & Applied Sciences, Versail les,

France, pages 239{248. North-Holland, 1984.

[16] J. J. Dongarra. LAPACK Working Note 34: Workshop on the BLACS. Computer Science

Dept. Technical Rep ort CS-91-134, UniversityofTennessee, Knoxville, TN, May 1991. LA-

PACK Working Note 34.

[17] J. J. Dongarra, J. Du Croz, S. Hammarling, and I. Du . A set of level 3 basic linear algebra

subprograms. ACM Transactions on Mathematical Software, 161:1{17, 1990. 37

[18] J. J. Dongarra, J. Du Croz, S. Hammarling, and R. Hanson. An extended set of Fortran basic

linear algebra subroutines. ACM Transactions on Mathematical Software, 141:1{17, March

1988.

[19] J. J. Dongarra, I. S. Du , D. C. Sorensen, and H. A. Van der Vorst. Solving Linear Systems

on Vector and Shared Memory Computers. SIAM Publications, Philadelphi a, PA, 1991.

[20] J. J. Dongarra and E. Grosse. Distribution of mathematical software via electronic mail.

Communications of the ACM, 305:403{407, July 1987.

[21] J. J. Dongarra, R. Hemp el, A. J. G. Hey, and D. W. Walker. A prop osal for a user-level

message passing interface in a distributed memory environment. Technical Rep ort TM-12231,

Oak Ridge National Lab oratory,February 1993.

[22] J. J. Dongarra, Peter Mayes, and Giusepp e Radicati di Brozolo. The IBM RISC System/6000

and linear algebra op erations. Supercomputer, 44VI I I-4:15{30, 1991.

[23] J. J. Dongarra and S. Ostrouchov. LAPACK blo ck factorization algorithms on the Intel

iPSC/860. Technical Rep ort CS-90-115, UniversityofTennessee at Knoxville, Computer Sci-

ence Department, Octob er 1990.

[24] J. J. Dongarra, R. Pozo, and D. W. Walker. An ob ject oriented design for high p erformance

linear algebra on distributed memory architectures. In Proceedings of the Object Oriented

Numerics Conference, 1993.

[25] J. J. Dongarra, R. van de Geijn, and D. W. Walker. A lo ok at scalable dense linear algebra li-

braries. In IEEE, editor, Proceedings of the Scalable High-Performance Computing Conference,

pages 372{379. IEEE Publishers, 1992.

[26] J. J. Dongarra and R. A. van de Geijn. Two-dimensional basic linear algebra communication

subprograms. Technical Rep ort LAPACK working note 37, Computer Science Department,

University of Tennessee, Knoxville, TN, Octob er 1991.

[27] J. J. Dongarra and R. A. van de Geijn. Reduction to condensed form for the eigenvalue problem

on distributed memory architectures. Paral lel Computing, 18:973{982, 1992.

[28] J. Du Croz and M. Pont. The development of a oating-p ointvalidation package. In M. J.

Irwin and R. Stefanelli, editors, Proceedings of the 8th Symposium on Computer Arithmetic,

Como, Italy, May 19-21, 1987. IEEE Computer So ciety Press, 1987.

[29] T. H. Dunigan. Communication p erformance of the Intel Touchstone Delta mesh. Technical

Rep ort TM-11983, Oak Ridge National Lab oratory, January 1992.

[30] A. Edelman. Large dense numerical linear algebra in 1993: The parallel computing in uence.

International Journal Supercomputer Applications, 1993. Accepted for publication.

[31] E. W. Felten and S. W. Otto. Coherent parallel C. In G. C. Fox, editor, Proceedings of the

Third Conference on Hypercube Concurrent Computers and Applications, pages 440{450. ACM

Press, 1988.

[32] G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, J. K. Salmon, and D. W. Walker.

Solving Problems on Concurrent Processors,volume 1. Prentice Hall, Englewo o d Cli s, N.J.,

1988. 38

[33] K. Gallivan, R. Plemmons, and A. Sameh. Parallel algorithms for dense linear algebra com-

putations. SIAM Review, 321:54{135, 1990.

[34] A. Geist and M. Heath. Matrix factorization on a hyp ercub e multipro cessor. In M. Heath,

editor, Hypercube Multiprocessors, 1986, pages 161{180, Philadelphia, PA, 1986. So ciety for

Industrial and Applied Mathematics.

[35] A. Geist and C. Romine. LU factorization algorithms on distributed-memory multipro cessor

architectures. SIAM J. Sci. Statist. Comput., 94:639{649, July 1988.

[36] G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins Press, Baltimore,

Maryland, 2nd edition, 1989.

[37] A. Gupta and V. Kumar. On the scalability of FFT on parallel computers. In Proceedings of the

Frontiers 90 Conference on Massively Paral lel Computation. IEEE Computer So ciety Press,

1990. Also available as technical rep ort TR 90-20 from the Computer Science Department,

University of Minnesota, Minneap olis, MN 55455.

[38] R. Harrington. Origin and development of the metho d of moments for eld computation. IEEE

Antennas and Propagation Magazine, June 1990.

[39] B. Hendrickson and D. Womble. The torus-wrap mapping for dense matrix computations on

massively parallel computers. Technical Rep ort SAND92-0792, Sandia National Lab oratories,

April 1992.

[40] J. L. Hess. Panel metho ds in computational uid dynamics. Annual Reviews of Fluid Mechan-

ics, 22:255{274, 1990.

[41] J. L. Hess and M. O. Smith. Calculation of p otential ows ab out arbitrary b o dies. In D. Kuc  he-

mann, editor, Progress in Aeronautical Sciences, Volume 8.Pergamon Press, 1967.

[42] High Performance Fortran Forum. High PerformanceFortran Language Speci cation, Version

1.0, January 1993.

[43] R. W. Ho ckney and C. R. Jesshop e. Paral lel Computers. Adam Hilger Ltd., Bristol, UK, 1981.

[44] W. Kahan. Paranoia. Available from netlib [20].

[45] C. Lawson, R. Hanson, D. Kincaid, and F. Krogh. Basic linear algebra subprograms for Fortran

usage. ACM Trans. Math. Softw., 5:308{323, 1979.

[46] C. Leiserson. Fat trees: Universal networks for hardware-ecient sup ercomputing. IEEE

Transactions on Computers, C-3410:892{901, 1985.

[47] W. Lichtenstein and S. L. Johnsson. Blo ck-cyclic dense linear algebra. Technical Rep ort

TR-04-92, Harvard University, Center for Research in Computing Technology, January 1992.

[48] M. Lin, D. Du, A. E. Klietz, and S. Saro . Performance evaluation of the CM-5 interconnection

network. Technical rep ort, Department of Computer Science, University of Minnesota, 1992.

[49] R. Ponnusamy, A. Choudhary, and G. Fox. Communication overhead on CM-5: An exp eri-

mental p erformance evaluation. In Proceedings of the Fourth Symposium on the Frontiers of

Massively Paral lel Computation, pages 108{115. IEEE Computer So ciety Press, 1992. 39

[50] Y. Saad and M. H. Schultz. Parallel direct metho ds for solving banded linear systems. Technical

Rep ort YALEU/DCS/RR-387, Department of Computer Science, Yale University, 1985.

[51] S. R. Seidel. Broadcasting on linear arrays and meshes. Technical Rep ort TM-12356, Oak

Ridge National Lab oratory, April 1993.

[52] A. Skjellum and A. Leung. LU factorization of sparse, unsymmetric, Jacobian matrices on

multicomputers. In D. W. Walker and Q. F. Stout, editors, Proceedings of the Fifth Distributed

Memory Concurrent Computing Conference, pages 328{337. IEEE Press, 1990.

[53] Thinking Machines Corp oration, Cambridge, MA. CM-5 Technical Summary, 1991.

[54] R. A. van de Geijn. Massively parallel LINPACK b enchmark on the Intel Touchstone Delta

and iPSC/860 systems. Computer Science rep ort TR-91-28, Univ. of Texas, 1991.

[55] E. F. Van de Velde. Data redistribution and concurrency. Paral lel Computing, 16, December

1990.

[56] J. J. H. Wang. Generalized Moment Methods in Electromagnetics. John Wiley & Sons, New

York, 1991.

[57] J. Wilkinson and C. Reinsch. Handbook for Automatic Computation: Volume II - Linear

Algebra. Springer-Verlag, New York, 1971. 40