The Sp ectral Decomp osition of Nonsymmetric Matrices on
Distributed Memory Parallel Computers
y z x { k
Z. Bai J. Demmel J. Dongarra A. Petitet H. Robinson K. Stanley
Octob er 9, 1995
Abstract
The implementation and p erformance of a class of divide-and-conquer algorithms for
computing the sp ectral decomp osition of nonsymmetric matrices on distributed memory
parallel computers are studied in this pap er. After presenting a general framework,
we fo cus on a sp ectral divide-and-conquer SDC algorithm with Newton iteration.
Although the algorithm requires several times as many oating p oint op erations as
the b est serial QR algorithm, it can b e simply constructed from a small set of highly
parallelizable matrix building blo cks within Level 3 BLAS. Ecient implementations of
these building blo cks are available on a wide range of machines. In some ill-conditioned
cases, the algorithm may lose numerical stability, but this can easily b e detected and
comp ensated for.
The algorithm reached 31 eciency with resp ect to the underlying PUMMA ma-
trix multiplication and 82 eciency with resp ect to the underlying ScaLAPACK ma-
trix inversion on a 256 pro cessor Intel Touchstone Delta system, and 41 eciency
with resp ect to the matrix multiplication in CMSSL on a 32 no de Thinking Machines
CM-5 with vector units. Our p erformance mo del predicts the p erformance reasonably
accurately.
To take advantage of the geometric nature of SDC algorithms, wehave designed
a graphical user interface to let user cho ose the sp ectral decomp osition according to
sp eci ed regions in the complex plane.
1 Intro duction
A standard technique in parallel computing is to build algorithms from existing high p erfor-
mance building blo cks. For example, LAPACK [1] is built on Basic Linear Algebra Subrou-
tines BLAS, for which ecient implementations are available on manyworkstations, vector
pro cessors, and shared memory parallel machines. The recently released ScaLAPACK 1.0
Department of Mathematics, University of Kentucky, Lexington, KY 40506.
y
Computer Science Division and Mathematics Department, University of California, Berkeley, CA 94720.
z
Department of Computer Science, UniversityofTennessee, Knoxville, TN 37996 and Mathematical
Sciences Section, Oak Ridge National Lab oratory, Oak Ridge, TN 37831.
x
Department of Computer Science, UniversityofTennessee, Knoxville, TN 37996.
{
Department of Mathematics, University of California, Berkeley, CA 94720.
k
Computer Science Division, University of California, Berkeley, CA 94720. 1
2
linear algebra library [11] is written in terms of the Parallel Blo ck BLAS PB-BLAS [12],
Basic Linear Algebra Communication Subroutines BLACS [19], BLAS and LAPACK.
ScaLAPACK includes routines for LU, QR and Cholesky factorizations, and matrix inver-
sion, and has b een p orted to the Intel Gamma, Delta and Paragon, Thinking Machines
CM-5, and PVM clusters. The Connection Machine Scienti c Software Library CMSSL
provides analogous functionality and high p erformance for the CM-5.
In this pap er, we use these high p erformance kernels to implement a sp ectral divide-and-
conquer SDC algorithm for nding eigenvalues and invariant subspaces of nonsymmetric
matrices on distributed memory parallel computers. The algorithm recursively divides the
matrix into smaller submatrices, each of which has a subset of the original eigenvalues as
its own [5,28,3]. On a 256 pro cessor Intel Touchstone Delta system, the SDC algorithm
reached 31 eciency with resp ect to the underlying matrix multiplication PUMMA [13]
for matrices of order 4000. and 82 eciency with resp ect to the underlying ScaLAPACK
1.0 matrix inversion. On a 32 pro cessor Thinking Machines CM-5 with vector units, a
varition of the SDC algorithm obtained 41 eciency with resp ect to matrix multiplication
from CMSSL 3.2 for matrices of order 2048.
The nonsymmetric sp ectral decomp osition problem has until recently resisted attempts
at parallelization [16]. The conventional serial metho d is to use the QR algorithm [1]. The
algorithm had app eared to required ne grain parallelism and b e dicult to parallelize. But
recently Henry and van de Geijn [21] have shown that the Hessenb erg QR iteration phase
can b e e ectively parallelized for distributed memory parallel computers with up to 100
pro cessors. Although it do es not app ear to b e as scalable as the algorithm presented in this
pap er, it may b e faster on a wide range of distributed memory parallel computers. The SDC
algorithm p erforms several times as many oating p oint op erations as the QR algorithm,
but they are nearly all within Level 3 BLAS, whereas implementations of the QR algorithm
p erforming the fewest oating p oint op erations use less ecient Level 1 and 2 BLAS. A
thorough comparison of these algorithms will b e the sub ject of a future pap er. We also
note that the algorithm discussed in this pap er may b e less stable than the QR algorithm
in a numb er of circumstances. Fortunately, it is easy to detect and comp ensate for this loss
of stability. Compared with other approaches, we b elieve that the new algorithm o ers an
e ective tradeo b etween parallelizabil ity and stability.
Other parallel algorithms for the eigenproblem include Hessenb erg divide-and-conquer
using either Newton's metho d [18] or homotopies [26], and Jacobi's metho d [33,32]. All
these metho ds su er from the use of ne-grain parallelism, instability, slow or misconver-
gence in the presence of clustered eigenvalues of the original problem or some constructed
subproblems [16]. The other algorithms most closely related to the approach used here
may b e found in [2, 6, 24], where symmetric matrices, or more generally matrices with real
sp ectra, are treated.
One of the notable features of the SDC algorithm is that it can calculate just those
eigenvalues and the corresp onding invariant subspace in a user-sp eci ed region of the
complex plane. To help the user sp ecify this region, we develop ed a graphical user interface
for the algorithm.
The rest of this pap er is organized as follows. In x2, we rst present a general framework
of SDC algorithms, and then fo cus on an SDC algorithm with Newton iteration. We show
how to divide the sp ectrum along arbitrary circles and lines in the complex plane. The
3
implementation and p erformance on Intel Delta and CM-5 are presented in x3. x4 presents
a mo del for p erformance analysis, and demonstrates that it can predict the execution time
reasonably accurately. x5 describ es the design of an X-window user interface. Concluding
remarks are given in x6.
2 Sp ectral Divide and Conquer Algorithms
2.1 General Framework
A general framework of SDC algorithms can b e describ ed as the following. Let
!
J 0
+