LAPACK Working Note 58
The Design of Linear Algebra Libraries for High Performance Computers
Jack Dongarra yz and David Walker z
yDepartment of Computer Science
UniversityofTennessee
Knoxville, TN 37996
zMathematical Sciences Section
Oak Ridge National Lab oratory
Oak Ridge, TN 37831
Abstract
This pap er discusses the design of linear algebra libraries for high p erformance computers.
Particular emphasis is placed on the development of scalable algorithms for MIMD distributed
memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK
libraries is given, followed by an outline of ScaLAPACK, which is a distributed memory version
of LAPACK currently under development. The imp ortance of blo ck-partitioned algorithms in
reducing the frequency of data movementbetween di erent levels of hierarchical memory is
stressed. The use of such algorithms helps reduce the message startup costs on distributed
memory concurrent computers. Other key ideas in our approach are the use of distributed
versions of the Level 3 Basic Linear Algebra Subgrams BLAS as computational building blo cks,
and the use of Basic Linear Algebra Communication Subprograms BLACS as communication
building blo cks. Together the distributed BLAS and the BLACS can b e used to construct higher-
level algorithms, and hide many details of the parallelism from the application develop er.
The blo ck-cyclic data distribution is describ ed, and adopted as a go o d way of distributing
blo ck-partitioned matrices. Blo ck-partitioned versions of the Cholesky and LU factorizations
are presented, and optimization issues asso ciated with the implementation of the LU factor-
ization algorithm on distributed memory concurrent computers are discussed, together with its
p erformance on the Intel Delta system. Finally, approaches to the design of library interfaces
are reviewed.
This work was supp orted in part by the NSF under Grant No. ASC-9005933 and by ARPA and ARO under
contract number DAAL03-91-C-0047. 1
1 Intro duction
The increasing availability of advanced-architecture computers is having a very signi cant e ect
on all spheres of scienti c computation, including algorithm research and software developmentin
numerical linear algebra. Linear algebra|in particular, the solution of linear systems of equations|
lies at the heart of most calculations in scienti c computing. This chapter discusses some of the
recent developments in linear algebra designed to exploit these advanced-architecture computers.
Particular attention will b e paid to dense factorization routines, such as the Cholesky and LU
factorizations, and these will b e used as examples to highlight the most imp ortant factors that
must b e considered in designing linear algebra software for advanced-architecture computers. We
use these factorization routines for illustrative purp oses not only b ecause they are relatively sim-
ple, but also b ecause of their imp ortance in several scienti c and engineering applications that
make use of b oundary element metho ds. These applications include electromagnetic scattering and
computational uid dynamics problems, as discussed in more detail in Section 4.1.
Much of the work in developing linear algebra software for advanced-architecture computers is
motivated by the need to solve large problems on the fastest computers available. In this chapter,
we fo cus on four basic issues: 1 the motivation for the work; 2 the development of standards
for use in linear algebra and the building blo cks for a library; 3 asp ects of algorithm design and
parallel implementation; and 4 future directions for research.
For the past 15 years or so, there has b een a great deal of activity in the area of algorithms and
software for solving linear algebra problems. The linear algebra community has long recognized
the need for help in developing algorithms into software libraries, and several years ago, as a
community e ort, put together a de facto standard for identifying basic op erations required in linear
algebra algorithms and software. The hop e was that the routines making up this standard, known
collectively as the Basic Linear Algebra Subprograms BLAS, would b e eciently implemented on
advanced-architecture computers by many manufacturers, making it p ossible to reap the p ortability
b ene ts of having them eciently implemented on a wide range of machines. This goal has b een
largely realized.
The key insight of our approach to designing linear algebra algorithms for advanced-architecture
computers is that the frequency with which data are moved b etween di erent levels of the memory
hierarchymust b e minimized in order to attain high p erformance. Thus, our main algorithmic
approach for exploiting b oth vectorization and parallelism in our implementations is the use of
blo ck-partitioned algorithms, particularly in conjunction with highly-tuned kernels for p erforming
matrix-vector and matrix-matrix op erations the Level 2 and 3 BLAS. In general, the use of blo ck-
partitioned algorithms requires data to b e moved as blo cks, rather than as vectors or scalars, so
that although the total amount of data moved is unchanged, the latency or startup cost asso ciated
with the movement is greatly reduced b ecause fewer messages are needed to move the data.
A second key idea is that the p erformance of an algorithm can b e tuned by a user byvarying
the parameters that sp ecify the data layout. On shared memory machines, this is controlled by
the blo ck size, while on distributed memory machines it is controlled by the blo ck size and the
con guration of the logical pro cess mesh, as describ ed in more detail in Section 5.
In Section 1, we rst giveanoverview of some of the ma jor software pro jects aimed at solving
dense linear algebra problems. Next, we describ e the typ es of machine that b ene t most from the
use of blo ck-partitioned algorithms, and discuss what is meantby high-quality, reusable software
for advanced-architecture computers. Section 2 discusses the role of the BLAS in p ortability and
p erformance on high-p erformance computers. We discuss the design of these building blo cks, and
their use in blo ck-partitioned algorithms, in Section 3. Section 4 fo cuses on the design of a blo ck-
partitioned algorithm for LU factorization, and Sections 5, 6, and 7 use this example to illustrate 2
the most imp ortant factors in implementing dense linear algebra routines on MIMD, distributed
memory, concurrent computers. Section 5 deals with the issue of mapping the data onto the
hierarchical memory of a concurrent computer. The layout of an application's data is crucial in
determining the p erformance and scalability of the parallel co de. In Sections 6 and 7, details of
the parallel implementation and optimization issues are discussed. Section 8 presents some future
directions for investigation.
1.1 Dense Linear Algebra Libraries
Over the past twenty- veyears, the rst author has b een directly involved in the developmentof
several imp ortant packages of dense linear algebra software: EISPACK, LINPACK, LAPACK, and
the BLAS. In addition, b oth authors are currently involved in the development of ScaLAPACK,
a scalable version of LAPACK for distributed memory concurrent computers. In this section, we
give a brief review of these packages|their history, their advantages, and their limitations on
high-p erformance computers.
1.1.1 EISPACK
EISPACK is a collection of Fortran subroutines that compute the eigenvalues and eigenvectors
of nine classes of matrices: complex general, complex Hermitian, real general, real symmetric,
real symmetric banded, real symmetric tridiagonal, sp ecial real tridiagonal, generalized real, and
generalized real symmetric matrices. In addition, two routines are included that use singular value
decomp osition to solve certain least-squares problems.
EISPACK is primarily based on a collection of Algol pro cedures develop ed in the 1960s and
collected by J. H. Wilkinson and C. Reinschinavolume entitled Linear Algebra in the Handbook for
Automatic Computation [57] series. This volume was not designed to cover every p ossible metho d of
solution; rather, algorithms were chosen on the basis of their generality, elegance, accuracy, sp eed,
or economy of storage.
Since the release of EISPACK in 1972, over ten thousand copies of the collection have b een
distributed worldwide.
1.1.2 LINPACK
LINPACK is a collection of Fortran subroutines that analyze and solve linear equations and linear
least-squares problems. The package solves linear systems whose matrices are general, banded,
symmetric inde nite, symmetric p ositive de nite, triangular, and tridiagonal square. In addition,
the package computes the QR and singular value decomp ositions of rectangular matrices and applies
them to least-squares problems.
LINPACK is organized around four matrix factorizations: LU factorization, pivoted Cholesky
factorization, QR factorization, and singular value decomp osition. The term LU factorization is
used here in a very general sense to mean the factorization of a square matrix into a lower triangular
part and an upp er triangular part, p erhaps with pivoting. These factorizations will b e treated at
greater length later, when the actual LINPACK subroutines are discussed. But rst a digression
on organization and factors in uencing LINPACK's eciency is necessary.
LINPACK uses column-oriented algorithms to increase eciency by preserving lo cality of ref-
erence. This means that if a program references an item in a particular blo ck, the next reference
is likely to b e in the same blo ck. By column orientation we mean that the LINPACK co des al-
ways reference arrays down columns, not across rows. This works b ecause Fortran stores arrays
in column ma jor order. Thus, as one pro ceeds down a column of an array, the memory references 3
pro ceed sequentially in memory. On the other hand, as one pro ceeds across a row, the memory ref-
erences jump across memory, the length of the jump b eing prop ortional to the length of a column.
The e ects of column orientation are quite dramatic: on systems with virtual or cache memories,
the LINPACK co des will signi cantly outp erform co des that are not column oriented. We note,
however, that textb o ok examples of matrix algorithms are seldom column oriented.
Another imp ortant factor in uencing the eciency of LINPACK is the use of the Level 1 BLAS;
there are three e ects.
First, the overhead entailed in calling the BLAS reduces the eciency of the co de. This reduc-
tion is negligible for large matrices, but it can b e quite signi cant for small matrices. The matrix
size at which it b ecomes unimp ortantvaries from system to system; for square matrices it is typ-
ically b etween n =25and n= 100. If this seems like an unacceptably large overhead, remember
that on many mo dern systems the solution of a system of order 25 or less is itself a negligible calcu-
lation. Nonetheless, it cannot b e denied that a p erson whose programs dep end critically on solving
small matrix problems in inner lo ops will b e b etter o with BLAS-less versions of the LINPACK
co des. Fortunately, the BLAS can b e removed from the smaller, more frequently used program in
a short editing session.
Second, the BLAS improve the eciency of programs when they are run on nonoptimizing
compilers. This is b ecause doubly subscripted array references in the inner lo op of the algorithm
are replaced by singly subscripted array references in the appropriate BLAS. The e ect can b e seen
for matrices of quite small order, and for large orders the savings are quite signi cant.
Finally, improved eciency can b e achieved by co ding a set of BLAS [17] to take advantage of
the sp ecial features of the computers on which LINPACK is b eing run. For most computers, this
simply means pro ducing machine-language versions. However, the co de can also take advantage of
more exotic architectural features, suchasvector op erations.
Further details ab out the BLAS are presented in Section 2.
1.1.3 LAPACK
LAPACK [14] provides routines for solving systems of simultaneous linear equations, least-squares
solutions of linear systems of equations, eigenvalue problems, and singular value problems. The
asso ciated matrix factorizations LU, Cholesky, QR, SVD, Schur, generalized Schur are also pro-
vided, as are related computations such as reordering of the Schur factorizations and estimating
condition numb ers. Dense and banded matrices are handled, but not general sparse matrices. In
all areas, similar functionality is provided for real and complex matrices, in b oth single and double
precision.
The original goal of the LAPACK pro ject was to make the widely used EISPACK and LIN-
PACK libraries run eciently on shared-memory vector and parallel pro cessors. On these machines,
LINPACK and EISPACK are inecient b ecause their memory access patterns disregard the multi-
layered memory hierarchies of the machines, thereby sp ending to o much time moving data instead
of doing useful oating-p oint op erations. LAPACK addresses this problem by reorganizing the algo-
rithms to use blo ck matrix op erations, such as matrix multiplication, in the innermost lo ops [3, 14].
These blo ck op erations can b e optimized for each architecture to account for the memory hierarchy
[2], and so provide a transp ortable waytoachieve high eciency on diverse mo dern machines. Here
we use the term \transp ortable" instead of \p ortable" b ecause, for fastest p ossible p erformance,
LAPACK requires that highly optimized blo ck matrix op erations b e already implemented on each
machine. In other words, the correctness of the co de is p ortable, but high p erformance is not|if
we limit ourselves to a single Fortran source co de.
LAPACK can b e regarded as a successor to LINPACK and EISPACK. It has virtually all the 4
capabilities of these two packages and much more b esides. LAPACK improves on LINPACK and
EISPACK in four main resp ects: sp eed, accuracy, robustness and functionality. While LINPACK
and EISPACK are based on the vector op eration kernels of the Level 1 BLAS, LAPACK was
designed at the outset to exploit the Level 3 BLAS |a set of sp eci cations for Fortran subprograms
that do various typ es of matrix multiplication and the solution of triangular systems with multiple
right-hand sides. Because of the coarse granularity of the Level 3 BLAS op erations, their use tends
to promote high eciency on many high-p erformance computers, particularly if sp ecially co ded
implementations are provided by the manufacturer.
1.1.4 ScaLAPACK
The ScaLAPACK software library,scheduled for completion by the end of 1994, will extend the
LAPACK library to run scalably on MIMD, distributed memory, concurrent computers [10, 11].
For such machines the memory hierarchy includes the o -pro cessor memory of other pro cessors,
in addition to the hierarchy of registers, cache, and lo cal memory on each pro cessor. Like LA-
PACK, the ScaLAPACK routines are based on blo ck-partitioned algorithms in order to minimize
the frequency of data movementbetween di erent levels of the memory hierarchy. The funda-
mental building blo cks of the ScaLAPACK library are distributed memory versions of the Level
2 and Level 3 BLAS, and a set of Basic Linear Algebra Communication Subprograms BLACS
[16,26] for communication tasks that arise frequently in parallel linear algebra computations. In
the ScaLAPACK routines, all interpro cessor communication o ccurs within the distributed BLAS
and the BLACS, so the source co de of the top software layer of ScaLAPACK lo oks very similar to
that of LAPACK.
Weenvisage a numb er of user interfaces to ScaLAPACK. Initially, the interface will b e similar to
that of LAPACK, with some additional arguments passed to each routine to sp ecify the data layout.
Once this is in place, weintend to mo dify the interface so the arguments to each ScaLAPACK
routine are the same as in LAPACK. This will require information ab out the data distribution of
each matrix and vector to b e hidden from the user. This may b e done by means of a ScaLAPACK
initialization routine. This interface will b e fully compatible with LAPACK. Provided \dummy"
versions of the ScaLAPACK initialization routine and the BLACS are added to LAPACK, there
will b e no distinction b etween LAPACK and ScaLAPACK at the application level, though each
will link to di erentversions of the BLAS and BLACS. Following on from this, we will exp eriment
with ob ject-based interfaces for LAPACK and ScaLAPACK, with the goal of developing interfaces
compatible with Fortran 90 [10] and C++ [24].
1.2 Target Architectures
The EISPACK and LINPACK software libraries were designed for sup ercomputers used in the
1970s and early 1980s, such as the CDC-7600, Cyb er 205, and Cray-1. These machines featured
multiple functional units pip eline d for go o d p erformance [43]. The CDC-7600 was basically a
high-p erformance scalar computer, while the Cyb er 205 and Cray-1 were early vector computers.
The development of LAPACK in the late 1980s was intended to make the EISPACK and
LINPACK libraries run eciently on shared memory,vector sup ercomputers. The ScaLAPACK
software library will extend the use of LAPACK to distributed memory concurrent sup ercomputers.
The development of ScaLAPACK b egan in 1991 and is exp ected to b e completed by the end of
1994.
The underlying concept of b oth the LAPACK and ScaLAPACK libraries is the use of blo ck-
partitioned algorithms to minimize data movementbetween di erent levels in hierarchical memory. 5
Thus, the ideas discussed in this chapter for developing a library for dense linear algebra compu-
tations are applicable to any computer with a hierarchical memory that 1 imp oses a suciently
large startup cost on the movement of data b etween di erent levels in the hierarchy, and for which
2 the cost of a context switch is to o great to make ne grain size multithreading worthwhile.
Our target machines are, therefore, medium and large grain size advanced-architecture computers.
These include \traditional" shared memory,vector sup ercomputers, such as the Cray Y-MP and
C90, and MIMD distributed memory concurrent sup ercomputers, such as the Intel Paragon, and
Thinking Machines' CM-5, and the more recently announced IBM SP1 and Cray T3D concurrent
systems. Since these machines have only very recently b ecome available, most of the ongoing de-
velopment of the ScaLAPACK library is b eing done on a 128-no de Intel iPSC/860 hyp ercub e and
on the 520-no de Intel Delta system.
The Intel Paragon sup ercomputer can have up to 2000 no des, each consisting of an i860 pro cessor
and a communications pro cessor. The no des eachhave at least 16 Mbytes of memory, and are
connected by a high-sp eed network with the top ology of a two-dimensional mesh. The CM-5 from
Thinking Machines Corp oration [53] supp orts b oth SIMD and MIMD programming mo dels, and
mayhave up to 16k pro cessors, though the largest CM-5 currently installed has 1024 pro cessors.
Each CM-5 no de is a Sparc pro cessor and up to 4 asso ciated vector pro cessors. Point-to-p oint
communication b etween no des is supp orted by a data network with the top ology of a \fat tree"
[46]. Global communication op erations, such as synchronization and reduction, are supp orted bya
separate control network. The IBM SP1 system is based on the same RISC chip used in the IBM
RS/6000 workstations and uses a multistage switch to connect pro cessors. The Cray T3D uses the
Alpha chip from Digital Equipment Corp oration, and connects the pro cessors in a three-dimensional
torus.
Future advances in compiler and hardware technologies in the mid to late 1990s are exp ected
to makemultithreading a viable approach for masking communication costs. Since the blo cks
in a blo ck-partitioned algorithm can b e regarded as separate threads, our approach will still b e
applicable on machines that exploit medium and coarse grain size multithreading.
1.3 High-Quality, Reusable, Mathematical Software
In developing a library of high-quality subroutines for dense linear algebra computations the design
goals fall into three broad classes:
p erformance
ease-of-use
range-of-use
1.3.1 Performance
Two imp ortant p erformance metrics are concurrent eciency and scalability.We seek go o d p er-
formance characteristics in our algorithms by eliminating, as much as p ossible, overhead due to
load imbalance, data movement, and algorithm restructuring. The way the data are distributed
or decomp osed over the memory hierarchy of a computer is of fundamental imp ortance to these
factors. Concurrent eciency, , is de ned as the concurrent sp eedup p er pro cessor [32], where the
concurrent sp eedup is the execution time, T , for the b est sequential algorithm running on one
seq
pro cessor of the concurrent computer, divided by the execution time, T , of the parallel algorithm
running on N pro cessors. When direct metho ds are used, as in LU factorization, the concurrent
p 6
eciency dep ends on the problem size and the numb er of pro cessors, so on a given parallel com-
puter and for a xed numb er of pro cessors, the running time should not vary greatly for problems
of the same size. Thus, wemay write,
1 T N
seq
1 N; N =
p
N T N; N
p p
where N represents the problem size. In dense linear algebra computations, the execution time is
usually dominated by the oating-p oint op eration count, so the concurrent eciency is related to
the p erformance, G, measured in oating-p oint op erations p er second by,
N
p
GN; N = N; N 2
p p
t
calc
where t is the time for one oating-p oint op eration. For iterative routines, such as eigensolvers,
calc
the numb er of iterations, and hence the execution time, dep ends not only on the problem size, but
also on other characteristics of the input data, such as condition numb er. A parallel algorithm
is said to b e scalable [37] if the concurrent eciency dep ends on the problem size and number
of pro cessors only through their ratio. This ratio is simply the problem size p er pro cessor, often
referred to as the granularity.Thus, for a scalable algorithm, the concurrent eciency is constantas
the numb er of pro cessors increases while keeping the granularity xed. Alternatively, Eq. 2 shows
that this is equivalenttosaying that, for a scalable algorithm, the p erformance dep ends linearly
on the numb er of pro cessors for xed granularity.
1.3.2 Ease-Of-Use
Ease-of-use is concerned with factors such as p ortability and the user interface to the library.
Portability, in its most inclusive sense, means that the co de is written in a standard language,
suchasFortran, and that the source co de can b e compiled on an arbitrary machine to pro duce a
program that will run correctly.We call this the \mail-order software" mo del of p ortability, since
it re ects the mo del used by software servers suchasnetlib [20]. This notion of p ortability is quite
demanding. It requires that all relevant prop erties of the computer's arithmetic and architecture b e
discovered at runtime within the con nes of a Fortran co de. For example, if it is imp ortant to know
the over ow threshold for scaling purp oses, it must b e determined at runtime without over owing,
since over ow is generally fatal. Such demands have resulted in quite large and sophisticated
programs [28, 44] whichmust b e mo di ed frequently to deal with new architectures and software
releases. This \mail-order" notion of software p ortability also means that co des generally must b e
written for the worst p ossible machine exp ected to b e used, thereby often degrading p erformance
on all others. Ease-of-use is also enhanced if implementation details are largely hidden from the
user, for example, through the use of an ob ject-based interface to the library [24].
1.3.3 Range-Of-Use
Range-of-use may b e gauged byhownumerically stable the algorithms are over a range of input
problems, and the range of data structures the library will supp ort. For example, LINPACK and
EISPACK deal with dense matrices stored in a rectangular array, packed matrices where only the
upp er or lower half of a symmetric matrix is stored, and banded matrices where only the nonzero
bands are stored. In addition, some sp ecial formats such as Householder vectors are used internally
to represent orthogonal matrices. There are also sparse matrices, whichmay b e stored in many
di erentways; but in this pap er we fo cus on dense and banded matrices, the mathematical typ es
addressed by LINPACK, EISPACK, and LAPACK. 7
2 The BLAS as the Key to Portability
At least three factors a ect the p erformance of p ortable Fortran co de.
1. Vectorization. Designing vectorizable algorithms in linear algebra is usually straightfor-
ward. Indeed, for many computations there are several variants, all vectorizable, but with
di erentcharacteristics in p erformance see, for example, [15]. Linear algebra algorithms can
approach the p eak p erformance of many machines|principall y b ecause p eak p erformance de-
p ends on some form of chaining of vector addition and multiplication op erations, and this
is just what the algorithms require. However, when the algorithms are realized in straight-
forward Fortran 77 co de, the p erformance may fall well short of the exp ected level, usually
b ecause vectorizing Fortran compilers fail to minimize the numb er of memory references|that
is, the number of vector load and store op erations.
2. Data movement. What often limits the actual p erformance of a vector, or scalar, oating-
p oint unit is the rate of transfer of data b etween di erent levels of memory in the machine.
Examples include the transfer of vector op erands in and out of vector registers, the transfer
of scalar op erands in and out of a high-sp eed scalar pro cessor, the movement of data b etween
main memory and a high-sp eed cache or lo cal memory, paging b etween actual memory and
disk storage in a virtual memory system, and interpro cessor communication on a distributed
memory concurrent computer.
3. Parallelism. The nested lo op structure of most linear algebra algorithms o ers considerable
scop e for lo op-based parallelism. This is the principal typ e of parallelism that LAPACK and
ScaLAPACK presently aim to exploit. On shared memory concurrent computers, this typ e of
parallelism can sometimes b e generated automatically by a compiler, but often requires the
insertion of compiler directives. On distributed memory concurrent computers, data must b e
moved b etween pro cessors. This is usually done by explicit calls to message passing routines,
although parallel language extensions such as CoherentParallel C [31] and Split-C [13]do
the message passing implicitly.
The question arises, \How can weachieve sucient control over these three factors to obtain
the levels of p erformance that machines can o er?" The answer is through use of the BLAS.
There are now three levels of BLAS:
Level 1 BLAS [45]: for vector op erations, suchasy x + y
Level 2 BLAS [18]: for matrix-vector op erations, suchasy Ax + y
Level 3 BLAS [17]: for matrix-matrix op erations, suchasC AB + C .
Here, A, B and C are matrices, x and y are vectors, and and are scalars.
The Level 1 BLAS are used in LAPACK, but for convenience rather than for p erformance: they
p erform an insigni cant fraction of the computation, and they cannot achieve high eciency on
most mo dern sup ercomputers.
The Level 2 BLAS can achieve near-p eak p erformance on manyvector pro cessors, suchasa
single pro cessor of a CRAY X-MP or Y-MP,orConvex C-2 machine. However, on other vector
pro cessors such as a CRAY-2 or an IBM 3090 VF, the p erformance of the Level 2 BLAS is limited
by the rate of data movementbetween di erent levels of memory.
3
The Level 3 BLAS overcome this limitation. This third level of BLAS p erforms O n oating-
2 2
p oint op erations on O n data, whereas the Level 2 BLAS p erform only O n op erations on 8
Table 1: Sp eed Mega ops of Level 2 and Level 3 BLAS Op erations on a CRAY Y-MP. All matrices
are of order 500; U is upp er triangular.
Numb er of pro cessors: 1 2 4 8
Level 2: y Ax + y 311 611 1197 2285
Level 3: C AB + C 312 623 1247 2425
Level 2: x Ux 293 544 898 1613
Level 3: B UB 310 620 1240 2425