LAPACK Working Note 139
A Numerical Linear Algebra Problem Solving Environment
Designer's Persp ective
y z
A. Petitet, H. Casanova , J. Dongarra, Y. Rob ert, and R. C. Whaley
Octob er, 1998
Department of Computer Science, UniversityofTennessee, Knoxville, TN 37996-1301
y
Department of Computer Science, UniversityofTennessee, Knoxville, TN 37996-1301, and Mathematical Sciences
Section, Oak Ridge National Lab oratory, Oak Ridge, TN 37831
z
Ecole Normale Sup erieure de Lyon, 69364 Lyon Cedex 07, France 1
Abstract
This chapter discusses the design of mo dern numerical linear algebra problem solving en-
vironments. Particular emphasis is placed on three essential comp onents out of which such
environments are constructed, namely well-designed numerical software libraries, software to ols
that generate optimized versions of a collection of numerical kernels for various pro cessor ar-
chitectures, and software systems that transform disparate, lo osely-connected computers and
software libraries into a uni ed, easy-to-access computational service.
A brief description of the \pioneers", namely the EISPACK and LINPACK software libraries
as well as their successor, the Linear Algebra PACKage LAPACK, illustrates the essential im-
p ortance of blo ck-partitioned algorithms for shared-memory,vector, and parallel pro cessors.
Indeed, these algorithms reduce the frequency of data movementbetween di erent levels of
hierarchical memory.Akey idea in this approach is the use of the Basic Linear Algebra Sub-
programs BLAS as computational building blo cks. An outline of the ScaLAPACK software
library, which is a distributed-memory version of LAPACK, highlights the equal imp ortance of
the ab ove design principles to the development of scalable algorithms for MIMD distributed-
memory concurrent computers. The impact of the architecture of high p erformance computers
on the design of such libraries is stressed.
Pro ducing hand-optimized implementations of even a reduced set of well designed software
comp onents such as the BLAS for a wide range of architectures is an exp ensive and tedious
prop osition. For any given architecture, customizing a numerical kernel's source co de to optimize
p erformance requires a comprehensive understanding of the exploitable hardware resources of
that architecture. Since this time-consuming customization pro cess must b e rep eated whenever
a slightly di erent target architecture is available, the relentless pace of hardware innovation
makes the tuning of numerical libraries a constant burden. This chapter presents an innovative
approach to automating the pro cess of pro ducing such optimized kernels for various pro cessor
architectures.
Finally, many scientists and researchers increasingly tend nowadays to use simultaneously
avariety of distributed computing resources such as massively parallel pro cessors, networks
and clusters of workstations and \piles" of PCs. This chapter describ es the NetSolve software
system that has b een sp eci cally designed and conceived to eciently use such a diverse and
lively computational environment and to tackle the problems p osed by such a complex and
innovative approach to scienti c problem solving. NetSolve provides the user with a p o ol of
computational resources. These resources are computational servers that provide run-time access
to arbitrary optimized numerical software libraries. This uni ed, easy-to-access computational
service can make enormous amounts of computing p ower transparently available to users on
ordinary platforms. 2
1 Intro duction
The increasing availability of advanced-architecture computers is having a very signi cant e ect
on all spheres of scienti c computation, including algorithm research and software developmentin
numerical linear algebra. Linear algebra|in particular, the solution of linear systems of equations|
lies at the heart of most calculations in scienti c computing. In this chapter, particular attention
will b e paid to dense general linear system solvers, and these will b e used as examples to highlight the
most imp ortant factors that must b e considered in designing linear algebra software for advanced-
architecture computers. We use these general linear system solving algorithms for illustrative
purp ose not only b ecause they are relatively simple, but also b ecause of their imp ortance in several
scienti c and engineering applications [Ede93] that make use of b oundary element metho ds. These
applications include for instance electromagnetic scattering [Har90,Wan91] and computational uid
dynamics problems [Hes90, HS67].
This chapter discusses some of the recent developments in linear algebra software designed to exploit
these advanced-architecture computers. Since most of the work is motivated by the need to solve
large problems on the fastest computers available, we fo cus on three essential comp onents out of
which current and mo dern problem solving environments are constructed:
1. well-designed numerical software libraries providing a comprehensive functionality and con-
ning most machine dep endencies into a small number of kernels, that o er a wide scop e for
eciently exploiting computer hardware resources,
2. automatic generation and optimization of such a collection of numerical kernels on various
pro cessor architectures, that is, software to ols enabling well-designed software libraries to
achieve high p erformance on most mo dern computers in a transp ortable manner,
3. software systems that transform disparate, lo osely-connected computers and software libraries
into a uni ed, easy-to-access computational service, that is, a service able to make enormous
amounts of computing p ower transparently available to users on ordinary platforms.
For the past twentyyears or so, there has b een a great deal of activity in the area of algorithms and
software for solving linear algebra problems. The linear algebra community has long recognized
the need for help in developing algorithms into software libraries, and several years ago, as a
community e ort, put together a de facto standard identifying basic op erations required in linear
algebra algorithms and software. The hop e was that the routines making up this standard, known
collectively as the Basic Linear Algebra Subprograms BLAS [LHK+79, DDH+88, DDH+90],
would b e eciently implemented on advanced-architecture computers by many manufacturers,
making it p ossible to reap the p ortability b ene ts of having them eciently implemented on a wide
range of machines. This goal has b een largely realized.
The key insight of our approach to designing linear algebra algorithms for advanced-architecture
computers is that the frequency with which data is moved b etween di erent levels of the memory
hierarchymust b e minimized in order to attain high p erformance. Thus, our main algorithmic
approach for exploiting b oth vectorization and parallelism in our implementations is the use of
blo ck-partitioned algorithms, particularly in conjunction with highly-tuned kernels for p erforming 3
matrix-vector and matrix-matrix op erations. In general, the use of blo ck-partitioned algorithms
requires data to b e moved as blo cks, rather than as vectors or scalars, so that although the total
amount of data moved is unchanged, the latency or startup cost asso ciated with the movement
is greatly reduced b ecause fewer messages are needed to move the data. A second key idea is
that the p erformance of an algorithm can b e tuned by a user byvarying the parameters that
sp ecify the data layout. On shared-memory machines, this is controlled by the blo ck size, while on
distributed-memory machines it is controlled by the blo ck size and the con guration of the logical
pro cess mesh.
Section 2 presents an overview of some of the ma jor numerical linear algebra software library
pro jects aimed at solving dense and banded problems. We discuss the role of the BLAS in p ortability
and p erformance on high-p erformance computers as well as the design of these building blo cks, and
their use in blo ck-partitioned algorithms.
The Linear Algebra PACKage LAPACK [ABB+95], for instance, is a typical example of sucha
software design, where most of the algorithms are expressed in terms of a reduced set of compu-
tational building blo cks, in this case called the Basic Linear Algebra Subprograms BLAS. These
computational building blo cks supp ort the creation of software that eciently expresses higher-level
blo ck-partitioned algorithms, while hiding many details of the parallelism from the application de-
velop er. These subprograms can b e optimized for each architecture to account for the deep memory
hierarchies [AD89, DMR91] and pip elined functional units that are common to most mo dern com-
puter architectures, and thus provide a transp ortable waytoachieve high eciency across diverse
computing platforms. For fastest p ossible p erformance, LAPACK requires that highly optimized
blo ck matrix op erations b e already implemented on each machine, that is, the correctness of the
co de is p ortable, but high p erformance is not|if we limit ourselves to a single source co de.
Sp eed and p ortable optimization are thus con icting ob jectives that have proved dicult to satisfy
simultaneously, and the typical strategy for addressing this problem by con ning most of the
hardware dep endencie s in a small numb er of heavily-used computational kernels has limitations.
For instance, pro ducing hand-optimized implementations of even a reduced set of well-designed
software comp onents for a wide range of architectures is an exp ensive and tedious task. For any
given architecture, customizing a numerical kernel's source co de to optimize p erformance requires
a comprehensive understanding of the exploitable hardware resources of that architecture. This
primarily includes the memory hierarchy and how it can b e utilized to maximize data-reuse, as
well as the functional units and registers and how these hardware comp onents can b e programmed
to generate the correct op erands at the correct time. Clearly, the size of the various cache levels,
the latency of oating p oint instructions, the numb er of oating p oint units and other hardware
constants are essential parameters that must b e taken into consideration as well. Since this time-
consuming customization pro cess must b e rep eated whenever a slightly di erent target architecture
is available, or even when a new version of the compiler is released, the relentless pace of hardware
innovation makes the tuning of numerical libraries a constant burden.
The dicult search for fast and accurate numerical metho ds for solving numerical linear algebra
problems is comp ounded by the complexities of p orting and tuning numerical libraries to run on
the b est hardware available to di erent parts of the scienti c and engineering community. Given
the fact that the p erformance of common computing platforms has increased exp onentially in the
past few years, scientists and engineers have acquired legitimate exp ectations ab out b eing able 4
to immediately exploit these available resources at their highest capabilities. Fast, accurate, and
robust numerical metho ds have to b e enco ded in software libraries that are highly p ortable and
optimizable across a wide range of systems in order to b e exploited to their fullest p otential.
Section 3 discusses an innovative approach [BAC+97, WD97] to automating the pro cess of pro duc-
ing such optimized kernels for RISC pro cessor architectures that feature deep memory hierarchies
and pip elined functional units. These research e orts have so far demonstrated very encouraging
results, and have generated great interest among the scienti c computing community.
Many scientists and researchers increasingly tend nowadays to use simultaneously a varietyof
distributed computing resources such as massively parallel pro cessors, networks and clusters of
workstations and \piles" of PCs. In order to use eciently such a diverse and lively computa-
tional environment, manychallenging research asp ects of network-based computing such as fault-
tolerance, load balancing, user-interface design, computational servers or virtual libraries, must
b e addressed. User-friendly, network-enabled, application-sp eci c to olkits have b een sp eci cally
designed and conceived to tackle the problems p osed by such a complex and innovative approach
to scienti c problem solving [FK98]. Section 4 describ es the NetSolve software system [CD95] that
provides users with a p o ol of computational resources. These resources are computational servers
that provide run-time access to arbitrary optimized numerical software libraries. The NetSolve
software system transforms disparate, lo osely-connected computers and software libraries into a
uni ed, easy-to-access computational service. This service can make enormous amounts of com-
puting p ower transparently available to users on ordinary platforms.
The NetSolve system allows users to access computational resources, such as hardware and software,
distributed across the network. These resources are emb o died in computational servers and allow
users to easily p erform scienti c computing tasks without having any computing facility installed on
their computer. Users' access to the servers is facilitated bya varietyofinterfaces: Application Pro-
gramming Interfaces APIs, Textual InteractiveInterfaces and Graphical User Interfaces GUIs.
As the NetSolve pro ject matures, several promising extensions and applications will emerge. In this
chapter, we provide an overview of the pro ject and examine some of the extensions b eing develop ed
for NetSolve: an interface to the Condor system [LLM88], an interface to the ScaLAPACK parallel
library [BCC+97], a bridge with the Ninf system [SSN+96], and an integration of NetSolve and
ImageVision [ENB96].
Future directions for research and investigation are nally presented in Section 5.
2 Numerical Linear Algebra Libraries
This section rst presents a few representativenumerical linear algebra packages in a chronological
p ersp ective. We then fo cus on the software design of the LAPACK and ScaLAPACK software
libraries. The imp ortance of the BLAS as a key to transp ortable eciency as well as the derivation
of blo ck-partitioned algorithms are discussed in detail. 5
2.1 Chronological Persp ective
The EISPACK, LINPACK, LAPACK and ScaLAPACK numerical linear algebra software libraries
are brie y outlined b elowinachronological order. The essential features of each of these packages
are in turn rapidly describ ed in order to illustrate the reasons for this evolution. Particular emphasis
is placed on the impact of the high-p erformance computer architecture on the design features of
these libraries.
2.1.1 The Pioneers: EISPACK and LINPACK
The EISPACK and LINPACK software libraries were designed for sup ercomputers used in the
seventies and early eighties, such as the CDC-7600, Cyb er 205, and Cray-1. These machines
featured multiple functional units pip elined for go o d p erformance [HJ81]. The CDC-7600 was
basically a high-p erformance scalar computer, while the Cyb er 205 and Cray-1 were early vector
computers.
EISPACK is a collection of Fortran subroutines that compute the eigenvalues and eigenvectors
of nine classes of matrices: complex general, complex Hermitian, real general, real symmetric,
real symmetric banded, real symmetric tridiagonal, sp ecial real tridiagonal, generalized real, and
generalized real symmetric matrices. In addition, two routines are included that use singular value
decomp osition to solve certain least-squares problems. EISPACK is primarily based on a collection
of Algol pro cedures develop ed in the sixties and collected by J. H. Wilkinson and C. Reinschina
volume entitled Linear Algebra in the Handbook for Automatic Computation [WR71] series. This
volume was not designed to cover every p ossible metho d of solution; rather, algorithms were chosen
on the basis of their generality, elegance, accuracy, sp eed, or economy of storage. Since the release
of EISPACK in 1972, over ten thousand copies of the collection have b een distributed worldwide.
LINPACK is a collection of Fortran subroutines that analyze and solve linear equations and linear
least-squares problems. The package solves linear systems whose matrices are general, banded,
symmetric inde nite, symmetric p ositive de nite, triangular, and tridiagonal square. In addition,
the package computes the QR and singular value decomp ositions of rectangular matrices and applies
them to least-squares problems. LINPACK is organized around four matrix factorizations: LU
factorization, pivoted Cholesky factorization, QR factorization, and singular value decomp osition.
The term LU factorization is used here in a very general sense to mean the factorization of a square
matrix intoalower triangular part and an upp er triangular part, p erhaps with pivoting. Some of
these factorizations will b e treated at greater length later, but, rst a digression on organization
and factors in uencing LINPACK's eciency is necessary.
LINPACK uses column-oriented algorithms to increase eciency by preserving lo cality of reference.
This means that if a program references an item in a particular blo ck, the next reference is likely to
b e in the same blo ck. By column orientation we mean that the LINPACK co des always reference
arrays down columns, not across rows. This works b ecause Fortran stores arrays in column ma jor
order. Thus, as one pro ceeds down a column of an array, the memory references pro ceed sequentially
in memory. On the other hand, as one pro ceeds across a row, the memory references jump across
memory, the length of the jump b eing prop ortional to the column's length. The e ects of column
orientation are quite dramatic: on systems with virtual or cache memories, the LINPACK co des 6
will signi cantly outp erform co des that are not column oriented.
Another imp ortant in uence on the eciency of LINPACK is the use of the Level 1 BLAS [LHK+79 ].
These BLAS are a small set of routines that may b e co ded to take advantage of the sp ecial fea-
tures of the computers on which LINPACK is b eing run. For most computers, this simply means
pro ducing machine-language versions. However, the co de can also take advantage of more exotic
architectural features, suchasvector op erations. Further details ab out the BLAS are presented
b elow in Section 2.2.1.
2.1.2 LAPACK
The developmentofLAPACK [ABB+95] in the late eighties was intended to make the EIS-
PACK and LINPACK libraries run eciently on shared-memory vector sup ercomputers. LA-
PACK [Dem89] provides routines for solving systems of simultaneous linear equations, least-squares
solutions of linear systems of equations, eigenvalue problems, and singular value problems. The
asso ciated matrix factorizations LU, Cholesky, QR, SVD, Schur, generalized Schur are also pro-
vided, along with related computations such as reordering of the Schur factorizations and estimating
condition numb ers. Dense and banded matrices are handled, but not general sparse matrices. In
all areas, similar functionality is provided for real and complex matrices, in b oth single and double
precision. LAPACK is in the public domain and available from netlib [DG87].
The original goal of the LAPACK pro ject was to make the widely used EISPACK and LINPACK
libraries run eciently on shared-memory vector and parallel pro cessors. On these machines,
LINPACK and EISPACK are inecient b ecause their memory access patterns disregard the mul-
tilayered memory hierarchies of the machines, thereby sp ending to o much time moving data in-
stead of doing useful oating p oint op erations. LAPACK addresses this problem by reorganizing
the algorithms to use blo ck matrix op erations, such as matrix multiplication, in the innermost
lo ops [AD90, Dem89]. These blo ck op erations can b e optimized for each architecture to account
for the memory hierarchy [AD89, DMR91], and so provide a transp ortable waytoachieve high
eciency on diverse mo dern machines. Here we use the term \transp ortable" instead of \p ortable"
b ecause, for fastest p ossible p erformance, LAPACK requires that highly optimized blo ck matrix
op erations b e already implemented on each machine. In other words, the correctness of the co de
is p ortable, but high p erformance is not|if we limit ourselves to a single Fortran source co de.
LAPACK can b e regarded as a successor to LINPACK and EISPACK. It has virtually all the
capabilities of these two packages and much more b esides. LAPACK improves on LINPACK and
EISPACK in four main resp ects: sp eed, accuracy, robustness and functionality. While LINPACK
and EISPACK are based on the vector op eration kernels of the Level 1 BLAS [LHK+79 ], LA-
PACK was designed at the outset to exploit the Level 3 BLAS [DDH+90] | a set of sp eci cations
for Fortran subprograms that do various typ es of matrix multiplication and the solution of tri-
angular systems with multiple right-hand sides. Because of the coarse granularity of the Level 3
BLAS op erations, their use tends to promote high eciency on many high-p erformance computers,
particularly if sp ecially co ded implementations are provided by the manufacturer. 7
2.1.3 ScaLAPACK
The ScaLAPACK [BCC+97] software library is extending the LAPACK library to run scalably
on MIMD distributed-memory concurrent computers. For such machines the memory hierarchy
includes the o -pro cessor memory of other pro cessors, in addition to the hierarchy of registers,
cache, and lo cal memory on each pro cessor. Like LAPACK, the ScaLAPACK routines are based
on blo ck-partitioned algorithms in order to minimize the frequency of data movementbetween
di erent levels of the memory hierarchy. The fundamental building blo cks of the ScaLAPACK
library are parallel distributed-memory versions of the BLAS PBLAS [CDO+95], and a set
of Basic Linear Algebra Communication Subprograms BLACS [WD95] for communication tasks
that arise frequently in parallel linear algebra computations. In the ScaLAPACK routines, all
interpro cessor communication o ccurs within the PBLAS and the BLACS, so that the source co de
of the top software layer of ScaLAPACK lo oks very similar to that of LAPACK.
The ScaLAPACK library contains routines for the solution of systems of linear equations, linear
least squares problems and eigenvalue problems. The goals of the LAPACK pro ject, which continue
into the ScaLAPACK pro ject, are eciency so that the computationally intensive routines execute
as fast as p ossible; scalability as the problem size and numb er of pro cessors grow; reliability,
including the return of error b ounds; p ortability across machines; exibility so that users may
construct new routines from well designed comp onents; and ease of use. Towards this last goal the
ScaLAPACK software has b een designed to lo ok as much like the LAPACK software as p ossible.
Many of these goals have b een attained by developing and promoting standards, esp ecially sp ec-
i cations for basic computational and communication routines. Thus LAPACK relies on the
BLAS [LHK+79, DDH+88, DDH+90], particularly the Level 2 and 3 BLAS for computational
eciency, and ScaLAPACK [BCC+97] relies up on the BLACS [WD95] for eciency of communi-
cation and uses a set of parallel BLAS, the PBLAS [CDO+95], which themselves call the BLAS
and the BLACS. LAPACK and ScaLAPACK will run on any machines for which the BLAS and
the BLACS are available. A PVM [GBD+94]version of the BLACS has b een available for some
time and the p ortability of the BLACS has recently b een further increased by the developmentof
aversion that uses MPI [MPI+94, SOH+96].
The underlying concept of b oth the LAPACK and ScaLAPACK libraries is the use of blo ck-
partitioned algorithms to minimize data movementbetween di erent levels in hierarchical memory.
Thus, the ideas discussed in this chapter for developing a library for dense linear algebra computa-
tions are applicable to any computer with a hierarchical memory that imp oses a suciently large
startup cost on the movement of data b etween di erent levels in the hierarchy, and for which the
cost of a context switch is to o great to make ne grain size multithreading worthwhile. The target
machines are, therefore, medium and large grain size advanced-architecture computers. These in-
clude resp ectively \traditional" shared-memory vector sup ercomputers, such as the Cray Y-MP and
C90, and MIMD distributed-memory concurrent computers, such as massively parallel pro cessors
MPPs and networks or clusters of workstations.
The ScaLAPACK software has b een designed sp eci cally to achieve high eciency for a wide range
of mo dern distributed-memory computers. Examples of such computers include the Cray T3 series,
the IBM Scalable POWERparallel SP series, the Intel iPSC and Paragon computers, the nCub e-2/3
computer, networks and clusters of workstations NoWs and CoWs, and \piles" of PCs PoPCs. 8
Future advances in compiler and hardware technologies in the mid to late nineties are exp ected
to makemultithreading a viable approach for masking communication costs. Since the blo cks
in a blo ck-partitioned algorithm can b e handled by separate threads, our approach will still b e
applicable on machines that exploit medium and coarse grain size multithreading.
2.2 Software Design
Developing a library of high-quality subroutines for dense linear algebra computations requires to
tackle a large numb er of issues. On one hand, the development or selection of numerically stable
algorithms in order to estimate the accuracy and/or domain of validity of the results pro duced by
these routines. On the other hand, it is often required to reformulate or adapt those algorithms
for p erformance reasons that are related to the architecture of the target computers. This section
presents three fundamental ideas to this e ect that characterize the design of the LAPACK and
ScaLAPACK software.
2.2.1 The BLAS as the Key to Transp ortable Eciency
At least three factors a ect the p erformance of p ortable Fortran co de:
1. Vectorization. Designing vectorizable algorithms in linear algebra is usually straightfor-
ward. Indeed, for many computations there are several variants, all vectorizable, but with
di erentcharacteristics in p erformance see, for example, [Don84]. Linear algebra algorithms
can approach the p eak p erformance of many machines|principally b ecause p eak p erformance
dep ends on some form of chaining of vector addition and multiplication op erations, and this
is just what the algorithms require. However, when the algorithms are realized in straight-
forward Fortran 77 co de, the p erformance may fall well short of the exp ected level, usually
b ecause vectorizing Fortran compilers fail to minimize the numb er of memory references|that
is, the number of vector load and store op erations.
2. Data movement. What often limits the actual p erformance of a vector, or scalar, oating
p oint unit is the rate of transfer of data b etween di erent levels of memory in the machine.
Examples include the transfer of vector op erands in and out of vector registers, the transfer
of scalar op erands in and out of a high-sp eed scalar pro cessor, the movement of data b etween
main memory and a high-sp eed cache or lo cal memory, paging b etween actual memory and
disk storage in a virtual memory system, and interpro cessor communication on a distributed-
memory concurrent computer.
3. Parallelism. The nested lo op structure of most linear algebra algorithms o ers considerable
scop e for lo op-based parallelism. This is the principal typ e of parallelism that LAPACK and
ScaLAPACK presently aim to exploit. On shared-memory concurrent computers, this typ e of
parallelism can sometimes b e generated automatically by a compiler, but often requires the
insertion of compiler directives. On distributed-memory concurrent computers, data must
be moved b etween pro cessors. This is usually done by explicit calls to message passing
routines, although parallel language extensions such as CoherentParallel C [FO88] and Split-
C [CDG+93] do the message passing implicitly. 9
The question arises, \How can weachieve sucient control over these three factors to obtain the
levels of p erformance that machines can o er?" The answer is through use of the BLAS. There are
now three levels of BLAS:
Level 1 BLAS [LHK+79]: for vector-vector op erations y x + y ,
Level 2 BLAS [DDH+88]: for matrix-vector op erations y Ax + y,
Level 3 BLAS [DDH+90]: for matrix-matrix op erations C AB + C .
Here, A, B and C are matrices, x and y are vectors, and and are scalars.
Table 1: Sp eed M ops of Level 2 and Level 3 BLAS Op erations on a CRAY Y-MP. All matrices
are of order 500; U is upp er triangular.
Numb er of pro cessors: 1 2 4 8
Level 2: y Ax + y 311 611 1197 2285
Level 3: C AB + C 312 623 1247 2425
Level 2: x Ux 293 544 898 1613
Level 3: B UB 310 620 1240 2425