1
Templates for Linear Algebra Problems
Zhao jun Bai
University of Kentucky
David Day
University of Kentucky
James Demmel
University of California - Berkeley
Jack Dongarra
University of Tennessee and Oak Ridge National Laboratory
Ming Gu
University of California - Berkeley
Axel Ruhe
Chalmers University
and
Henk van der Vorst
Utrecht University
Abstract. The increasing availabili ty of advanced-architecture computers is having a very
signi cant e ect on all spheres of scienti c computation, including algorithm research and
software developmentinnumerical linear algebra. Linear algebra {in particular, the solution
of linear systems of equations and eigenvalue problems { lies at the heart of most calculation s in
scienti c computing. This chapter discusses some of the recent developments in linear algebra
designed to help the user on advanced-architecture computers.
Much of the work in developing linear algebra software for advanced-architecture computers
is motivated by the need to solve large problems on the fastest computers available. In this
chapter, we fo cus on four basic issues: 1 the motivation for the work; 2 the development
of standards for use in linear algebra and the building blo cks for a library; 3 asp ects of
templates for the solution of large sparse systems of linear algorithm; and 4 templates for
the solution of large sparse eigenvalue problems. This last pro ject is under development and
we will pay more attention to it in this chapter.
1 Intro duction and Motivation
Large scale problems of engineering and scienti c computing often require solutions of linear algebra
problems, such as systems of linear equations, least squares problems, or eigenvalue problems. There
is a vast amount of material available on solving such problems, in b o oks, in journal articles, and as
software. This software consists of well-maintained libraries available commercially or electronically
in the public-domain, other libraries distributed with texts or other b o oks, individual subroutines
tested and published by organizations like the ACM, and yet more software available from individuals
or electronic sources like Netlib [21], whichmay b e hard to nd or come without supp ort. So although
manychallenging numerical linear algebra problems still await satisfactory solutions, many excellent
metho ds exist from a plethora of sources.
?
This work was made p ossible in part by grants from the Defense Advanced Research Pro jects Agency under
contract DAAL03-91-C-0047 administered by the Army Research Oce, the Oce of Scienti c Computing
U.S. Department of Energy under Contract DE-AC05-84OR21400, the National Science Foundation Sci-
ence and Technology Center Co op erative Agreement No. CCR-8809615, and National Science Foundation
Grant No. ASC-9005933.
But the sheer numb er of algorithms and their implementations makes it hard even for exp erts, let
alone general users, to nd the b est solution for a given problem. This has led to the developmentof
various on-line search facilities for numerical software. One has b een develop ed by NIST National
Institute of Standards and Technology, and is called GAMS Guide to Available Mathematical
Software [7]; another is part of Netlib [21]. These facilities p ermit search based on library names,
subroutines names, keywords, and a taxonomy of topics in numerical computing. But for the general
user in search of advice as to which algorithm or which subroutine to use for her particular problem,
they o er relatively little advice.
Furthermore, manychallenging problems cannot b e solved with existing \black-b ox" software
packages in a reasonable time or space. This means that more sp ecial purp ose metho ds must b e
used, and tuned for the problem at hand. This tuning is the greatest challenge, since there are a large
numb er of tuning options available, and for many problems it is a challenge to get any acceptable
answer at all, or have con dence in what is computed. The exp ertise regarding which options, or
combinations of options, is likely to work in a sp eci c application area, is widely distributed.
Thus, there is a need for to ols to help users pick the b est algorithm and implementation for their
numerical problems, as well as exp ert advice on how to tune them. In fact, we see three p otential
user communities for such to ols:
{ The \HPCC" High Performance Computing and Communication community consists of those
scientists trying to solve the largest, hardest applications in their elds. They desire high sp eed,
access to algorithmic details for p erformance tuning, and reliability for their problem.
{ Engineers and scientists generally desire easy-to-use, reliable software, that is also reasonably
ecient.
{ Students and teachers want simple but generally e ective algorithms, which are easy to explain
and understand.
It may seem dicult to address the needs of such diverse communities with a single do cument.
Nevertheless, we b elieve this is p ossible in the form of templates.
A template for an algorithm includes
1. A high-level description of the algorithm.
2. A description of when it is e ective, including conditions on the input, and estimates of the time,
space or other resources required. If there are natural comp etitors, they will b e referenced.
3. A description of available re nements and user-tunable parameters, as well as advice on when
to use them.
4. Pointers to complete or partial implementations, p erhaps in several languages or for several
architectures each parallel architecture. These implementation exp ose those details suitable
for user-tuning, and hide the others.
5. Numerical examples, on a common set of examples, illustrating b oth easy cases and dicult
cases.
6. Trouble sho oting advice.
7. Pointers to texts or journal articles for further information.
In addition to individual templates, there will b e a decision tree to help steer the user to the right
algorithm, or subset of p ossible algorithms, based on a sequence of questions ab out the nature of
the problem to b e solved.
Our goal in this pap er is to outline such a set of templates for systems of linear equations and
eigenvalue problems. The main theme is to explain how to use iterative metho ds.
The rest of this pap er is organized as follows. Section 2 discusses related work. Section 3 describ es
the work wehave done on templates for sparse linear systems. Section 4 describ ed our taxonomy
of eigenvalue problems and algorithms, whichwe will use to organize the templates and design a
decision tree. Section 4.1 discusses notation and terminology. Sections 4.2 through 4.7 b elow describ e
the decision tree in more detail. Section 4.8 outlines a chapter on accuracy issues, which is meantto
describ e what accuracy one can exp ect to achieve, since eigenproblems can b e very ill-conditioned,
as well as to ols for measuring accuracy. Section 4.9 describ es the formats in whichwe exp ect to
deliver b oth software and do cumentation.
This pap er is a design do cument, and we encourage feedback, esp ecially from p otential users.
2 Related Work
Many excellentnumerical linear algebra texts [43, 25, 30, 11, 33] and black-b ox software libraries
[36, 24, 1] already exist. A great deal of more sp ecialized software for eigenproblems also exists, for
example, for surveys on the sub ject see [31,4].
A b o ok of templates, including software, has already b een written for iterative metho ds for solving
linear systems [6], although it do es not include all the ingredients mentioned ab ove. In particular,
it discusses some imp ortant advanced metho ds, such as preconditioning, domain decomp osition and
multigrid, relatively brie y, and do es not have a comprehensive set of numerical examples to help
explain the exp ected p erformance from each algorithm on various problem classes. It has a relatively
simple decision tree to help users cho ose an algorithm. Nevertheless, it successfully incorp orates many
of the features we wish to have.
The linear algebra chapter in Numerical Recip es [32] contains a brief description of the conjugate
gradient algorithm for sparse linear systems and the eigensystem chapter only contains the basic
metho d for solving dense standard eigenvalue problem, all of the metho ds and available software in
the recip es, except the Jacobi metho d for symmetric eigenvalue problem, are simpli ed version of
algorithms in the EISPACK and LAPACK. Beside on disk, the software in the recip es are actually
printed line by line in the b o ok.
2.1 Dense Linear Algebra Libraries
Over the past twenty- veyears, wehave b een involved in the developmentofseveral imp ortant
packages of dense linear algebra software: EISPACK [36, 24], LINPACK [16], LAPACK [1], and
the BLAS [28, 19, 18]. In addition, we are currently involved in the development of ScaLAPACK
[8], a scalable version of LAPACK for distributed memory concurrent computers. In this section,
we give a brief review of these packages|their history, their advantages, and their limitations on
high-p erformance computers.
EISPACK EISPACK is a collection of Fortran subroutines that compute the eigenvalues and eigen-
vectors of nine classes of matrices: complex general, complex Hermitian, real general, real symmetric,
real symmetric banded, real symmetric tridiagonal, sp ecial real tridiagonal, generalized real, and
generalized real symmetric matrices. In addition, two routines are included that use singular value
decomp osition to solve certain least-squares problems.
EISPACK is primarily based on a collection of Algol pro cedures develop ed in the 1960s and
collected by J. H. Wilkinson and C. Reinschinavolume entitled Linear Algebra in the Handbook for
Automatic Computation series [44]. This volume was not designed to cover every p ossible metho d of
solution; rather, algorithms were chosen on the basis of their generality, elegance, accuracy, sp eed,
or economy of storage.
Since the release of EISPACK in 1972, thousands of copies of the collection have b een distributed
worldwide.
LINPACK LINPACK is a collection of Fortran subroutines that analyze and solve linear equations
and linear least-squares problems. The package solves linear systems whose matrices are general,
banded, symmetric inde nite, symmetric p ositive de nite, triangular, and tridiagonal square. In
addition, the package computes the QR and singular value decomp ositions of rectangular matrices
and applies them to least-squares problems.
LINPACK is organized around four matrix factorizations: LU factorization, Cholesky factoriza-
tion, QR factorization, and singular value decomp osition. The term LU factorization is used here in
avery general sense to mean the factorization of a square matrix into a lower triangular part and
an upp er triangular part, p erhaps with pivoting.
LINPACK uses column-oriented algorithms to increase eciency by preserving lo cality of ref-
erence. By column orientation we mean that the LINPACK co des always reference arrays down
columns, not across rows. This works b ecause Fortran stores arrays in column ma jor order. Thus,
as one pro ceeds down a column of an array, the memory references pro ceed sequentially in memory.
On the other hand, as one pro ceeds across a row, the memory references jump across memory, the
length of the jump b eing prop ortional to the length of a column. The e ects of column orientation
are quite dramatic: on systems with virtual or cache memories, the LINPACK co des will signi cantly
outp erform co des that are not column oriented. We note, however, that textb o ok examples of matrix
algorithms are seldom column oriented.
Another imp ortant factor in uencing the eciency of LINPACK is the use of the Level 1 BLAS;
there are three e ects.
First, the overhead entailed in calling the BLAS reduces the eciency of the co de. This reduction
is negligible for large matrices, but it can b e quite signi cant for small matrices. The matrix size
at which it b ecomes unimp ortantvaries from system to system; for square matrices it is typically
between n = 25 and n = 100. If this seems like an unacceptably large overhead, rememb er that on
many mo dern systems the solution of a system of order 25 or less is itself a negligible calculation.
Nonetheless, it cannot b e denied that a p erson whose programs dep end critically on solving small
matrix problems in inner lo ops will b e b etter o with BLAS-less versions of the LINPACK co des.
Fortunately, the BLAS can b e removed from the smaller, more frequently used program in a short
editing session.
Second, the BLAS improve the eciency of programs when they are run on nonoptimizing
compilers. This is b ecause doubly subscripted array references in the inner lo op of the algorithm are
replaced by singly subscripted array references in the appropriate BLAS. The e ect can b e seen for
matrices of quite small order, and for large orders the savings are quite signi cant.
Finally, improved eciency can b e achieved by co ding a set of BLAS [18] to take advantage of
the sp ecial features of the computers on which LINPACK is b eing run. For most computers, this
simply means pro ducing machine-language versions. However, the co de can also take advantage of
more exotic architectural features, suchasvector op erations.
Further details ab out the BLAS are presented in Section 2.3.
LAPACK LAPACK [14] provides routines for solving systems of linear equations, least-squares
problems, eigenvalue problems, and singular value problems. The asso ciated matrix factorizations
LU, Cholesky, QR, SVD, Schur, generalized Schur are also provided, as are related computations
such as reordering of the Schur factorizations and estimating condition numb ers. Dense and banded
matrices are handled, but not general sparse matrices. In all areas, similar functionality is provided
for real and complex matrices, in b oth single and double precision.
The original goal of the LAPACK pro ject was to make the widely used EISPACK and LINPACK
libraries run eciently on shared-memory vector and parallel pro cessors, as well as RISC worksta-
tions. On these machines, LINPACK and EISPACK are inecient b ecause their memory access
patterns disregard the multilayered memory hierarchies of the machines, thereby sp ending to o much
time moving data instead of doing useful oating-p oint op erations. LAPACK addresses this problem
by reorganizing the algorithms to use blo ck matrix op erations, such as matrix multiplication, in the
innermost lo ops [3, 14]. These blo ck op erations can b e optimized for each architecture to account for
the memory hierarchy[2], and so provide a transp ortable waytoachieve high eciency on diverse
mo dern machines. Here we use the term \transp ortable" instead of \p ortable" b ecause, for fastest
p ossible p erformance, LAPACK requires that highly optimized blo ck matrix op erations b e already
implemented on each machine. In other words, the correctness of the co de is p ortable, but high
p erformance is not|if we limit ourselves to a single Fortran source co de.
LAPACK can b e regarded as a successor to LINPACK and EISPACK. It has virtually all the
capabilities of these two packages and much more b esides. LAPACK improves on LINPACK and
EISPACK in four main resp ects: sp eed, accuracy, robustness and functionality. While LINPACK and
EISPACK are based on the vector op eration kernels of the Level 1 BLAS, LAPACK was designed
at the outset to exploit the Level 3 BLAS |a set of sp eci cations for Fortran subprograms that do
various typ es of matrix multiplication and the solution of triangular systems with multiple right-hand
sides. Because of the coarse granularity of the Level 3 BLAS op erations, their use tends to promote
high eciency on many high-p erformance computers, particularly if sp ecially co ded implementations
are provided by the manufacturer.
ScaLAPACK The ScaLAPACK software library extends the LAPACK library to run scalably on
MIMD, distributed memory, concurrent computers [8, 9]. For such machines the memory hierarchy
includes the o -pro cessor memory of other pro cessors, in addition to the hierarchy of registers,
cache, and lo cal memory on each pro cessor. Like LAPACK, the ScaLAPACK routines are based on
blo ck-partitioned algorithms in order to minimize the frequency of data movementbetween di erent
levels of the memory hierarchy. The fundamental building blo cks of the ScaLAPACK library are
distributed memory versions of the Level 2 and Level 3 BLAS, and a set of Basic Linear Algebra
Communication Subprograms BLACS [17,20] for communication tasks that arise frequently in
parallel linear algebra computations. In the ScaLAPACK routines, all interpro cessor communication
o ccurs within the distributed BLAS and the BLACS, so the source co de of the top software layer of
ScaLAPACK lo oks very similar to that of LAPACK.
The interface for ScaLAPACK is similar to that of LAPACK, with some additional arguments
passed to each routine to sp ecify the data layout.
2.2 Target Architectures
The EISPACK and LINPACK software libraries were designed for sup ercomputers used in the
1970s and early 1980s, such as the CDC-7600, Cyb er 205, and Cray-1. These machines featured
multiple functional units pip elined for go o d p erformance [26]. The CDC-7600 was basically a high-
p erformance scalar computer, while the Cyb er 205 and Cray-1 were early vector computers.
The development of LAPACK in the late 1980s was intended to make the EISPACK and LIN-
PACK libraries run eciently on shared memory - vector sup ercomputers and RISC workstations.
The ScaLAPACK software library will extend the use of LAPACK to distributed memory concurrent
sup ercomputers.
The underlying concept of b oth the LAPACK and ScaLAPACK libraries is the use of blo ck-
partitioned algorithms to minimize data movementbetween di erent levels in hierarchical memory.
Thus, the ideas for developing a library for dense linear algebra computations are applicable to any
computer with a hierarchical memory that 1 imp oses a suciently large startup cost on the move-
ment of data b etween di erent levels in the hierarchy, and for which 2 the cost of a context switch
is to o great to make ne grain size multithreading worthwhile. Our target machines are, therefore,
medium and large grain size advanced-architecture computers. These include \traditional" shared
memory,vector sup ercomputers, such as the Cray Y-MP and C90, and MIMD distributed mem-
ory concurrent sup ercomputers, such as the Intel Paragon, the IBM SP2 and Cray T3D concurrent
systems.
Future advances in compiler and hardware technologies in the mid to late 1990s are exp ected
to makemultithreading a viable approach for masking communication costs. Since the blo cks in a
blo ck-partitioned algorithm can b e regarded as separate threads, our approach will still b e applicable
on machines that exploit medium and coarse grain size multithreading.
2.3 The BLAS as the Key to Portability
At least three factors a ect the p erformance of p ortable Fortran co de.
1. Vectorization. Designing vectorizable algorithms in linear algebra is usually straightforward.
Indeed, for many computations there are several variants, all vectorizable, but with di erent
characteristics in p erformance see, for example, [15]. Linear algebra algorithms can approach
the p eak p erformance of many machines|principally b ecause p eak p erformance dep ends on
some form of chaining of vector addition and multiplication op erations, and this is just what
the algorithms require. However, when the algorithms are realized in straightforward Fortran
77 co de, the p erformance may fall well short of the exp ected level, usually b ecause vectorizing
Fortran compilers fail to minimize the numb er of memory references|that is, the number of
vector load and store op erations.
2. Data movement. What often limits the actual p erformance of a vector, or scalar, oating-
p oint unit is the rate of transfer of data b etween di erent levels of memory in the machine.
Examples include the transfer of vector op erands in and out of vector registers, the transfer
of scalar op erands in and out of a high-sp eed scalar pro cessor, the movement of data b etween
main memory and a high-sp eed cache or lo cal memory, paging b etween actual memory and disk
storage in a virtual memory system, and interpro cessor communication on a distributed memory
concurrent computer.
3. Parallelism. The nested lo op structure of most linear algebra algorithms o ers considerable
scop e for lo op-based parallelism. This is the principal typ e of parallelism that LAPACK and
ScaLAPACK presently aim to exploit. On shared memory concurrent computers, this typ e of
parallelism can sometimes b e generated automatically by a compiler, but often requires the
insertion of compiler directives. On distributed memory concurrent computers, data must b e
moved b etween pro cessors. This is usually done by explicit calls to message passing routines,
although parallel language extensions such as CoherentParallel C [22] and Split-C [10]dothe
message passing implicitly.
The question arises, \How can weachieve sucient control over these three factors to obtain the
levels of p erformance that machines can o er?" The answer is through use of the BLAS.
There are now three levels of BLAS:
Level 1 BLAS [28]: for vector op erations, suchasy x + y
Level 2 BLAS [19]: for matrix-vector op erations, suchasy Ax + y
Level 3 BLAS [18]: for matrix-matrix op erations, suchasC AB + C .
Here, A, B and C are matrices, x and y are vectors, and and are scalars.
The Level 1 BLAS are used in LAPACK, but for convenience rather than for p erformance: they
p erform an insigni cant fraction of the computation, and they cannot achieve high eciency on
most mo dern sup ercomputers.
The Level 2 BLAS can achieve near-p eak p erformance on manyvector pro cessors, such as a single
pro cessor of a CRAY X-MP or Y-MP,orConvex C-2 machine. However, on other vector pro cessors
suchasaCRAY-2 or an IBM 3090 VF, the p erformance of the Level 2 BLAS is limited by the rate
of data movementbetween di erent levels of memory.
3
The Level 3 BLAS overcome this limitation. This third level of BLAS p erforms O n oating-
2 2
p oint op erations on O n data, whereas the Level 2 BLAS p erform only O n op erations on
2
O n data. The Level 3 BLAS also allow us to exploit parallelism in a way that is transparentto
the software that calls them. While the Level 2 BLAS o er some scop e for exploiting parallelism,
greater scop e is provided by the Level 3 BLAS, as Table 1 illustrates.
Table 1. Sp eed in M op/s of Level 2 and Level 3 BLAS op erations on a CRAY C90
all matrices are of order 1000; U is upp er triangular
Numb er of pro cessors: 1 2 4 8 16
Level 2: y Ax + y 899 1780 3491 6783 11207
Level 3: C AB + C 900 1800 3600 7199 14282
Level 2: x Ux 852 1620 3063 5554 6953
Level 3: B UB 900 1800 3574 7147 13281