A COMPARISON OF PARALLEL SOLVERS FOR DIAGONALLY
DOMINANT AND GENERAL NARROW-BANDED LINEAR
SYSTEMS
y z x {
PETER ARBENZ , ANDREW CLEARY ,JACK DONGARRA , AND MARKUS HEGLAND
Abstract. Weinvestigate and compare stable parallel algorithms for solving diagonally domi-
nant and general narrow-banded linear systems of equations. Narrow-banded means that the band-
width is very small compared with the matrix order and is typically b etween 1 and 100. The solvers
compared are the banded system solvers of ScaLAPACK [11] and those investigated by Arb enz and
Hegland [3, 6]. For the diagonally dominant case, the algorithms are analogs of the well-known
tridiagonal cyclic reduction algorithm, while the inspiration for the general case is the lesser-known
bidiagonal cyclic reduction, which allows a clean parallel implementation of partial pivoting. These
divide-and-conquertyp e algorithms complement ne-grained algorithms which p erform well only for
wide-banded matrices, with each family of algorithms having a range of problem sizes for whichit
is sup erior. We present theoretical analyses as well as numerical exp eriments conducted on the Intel
Paragon.
Key words. narrow-banded linear systems, stable factorization, parallel solution, cyclic reduc-
tion, ScaLAPACK
1. Intro duction. In this pap er we compare implementations of direct parallel
metho ds for solving banded systems of linear equations
Ax = b: 1.1
The n-by-n matrix A is assumed to havelower half-bandwidth k and upp er half-
l
bandwidth k , meaning that k and k are the smallest integers that imply
u l u
a 6=0 = k j i k :
ij l u
We assume that the matrix A has a narrow band, such that k + k n. Linear
l u
systems with wide band can b e solved eciently by metho ds similar to full system
solvers. In particular, parallel algorithms using two-dimensional mappings suchas
the torus-wrap mapping and Gaussian elimination with partial pivoting haveachieved
reasonable success [16, 10, 18]. The parallelism of these algorithms is the same as that
of dense matrix algorithms applied to matrices of size minfk ;k g, indep endentofn,
l u
from whichitisobvious that small bandwidths severely limit the usefulness of these
algorithms, even for large n.
Parallel algorithms for the solution of banded linear systems with small band-
width have b een considered by many authors, b oth b ecause they serve as a canonical
form of recursive equations, as well as having direct applications. The latter include
the solution of eigenvalue problems with inverse iteration [17], spline interp olation
and smo othing [9], and the solution of b oundary value problems for ordinary dif-
ferential equations using nite di erence or nite element metho ds [27]. For these
y
Institute of Scienti c Computing, Swiss Federal Institute of Technology ETH, 8092 Zurich,
Switzerland [email protected]
z
Center for Applied Scienti c Computing, Lawrence Livermore National Lab oratory,P.O. Box
808, L-561, Livermore CA 94551, U.S.A. [email protected]
x
Department of Computer Science, UniversityofTennessee, Knoxville TN 37996-1301, U.S.A.
{
Computer Sciences Lab oratory, RSISE, Australian National University, Canb erra ACT 0200,
Australia [email protected]
2 P. ARBENZ, A. CLEARY, J. DONGARRA, AND M. HEGLAND
one-dimensional applications, bandwidths typically vary b etween 2 and 30. The dis-
cretisation of partial di erential equations leads to applications with slightly larger
bandwidths, for example, the computation of uid ow in a long narrow pip e. In this
case, the numb er of grid p oints orthogonal to the ow direction is much smaller than
the numb er of grid p oints along the ow and this results in a matrix with bandwidth
relatively small compared to the total size of the problem. There is a tradeo for
these typ e of problems b etween band solvers and general sparse techniques, in that
the band solver assumes that all of the entries within the band are nonzero, which
they are not, and thus p erforms unnecessary computation, but its data structures are
much simpler and there is no indirect addressing as in general sparse metho ds.
In section 2 we review an algorithm for the class of nonsymmetric narrow-banded
matrices that can b e factored stably without pivoting, such as diagonally dominant
matrices or M-matrices. This algorithm has b een discussed in detail in [3, 11] where
the p erformance of implementations of this algorithm on distributed memory mul-
ticomputers like the Intel Paragon [3] or the IBM SP/2 [11] is analyzed as well.
Johnsson [23] considered the same algorithm and its implementation on the Thinking
Machine CM-2 which required a di erent mo del for the complexity of the interpro ces-
sor communication. Related algorithms have b een presented in [26,15,14,7,12, 28]
for shared memory multipro cessors with a small numb er of pro cessors. The algorithm
that we consider here can b e interpreted as a generalization of cyclic reduction CR, or
more usefully, as Gaussian elimination applied to a symmetrically p ermuted system of
T
equations PAP P x = P b. The latter interpretation has imp ortant consequences,
such as it implies that the algorithm is backward stable [5]. It can also b e used to show
that the p ermutation necessarily causes Gaussian elimination to generate l l-in which
in turn increases the computational complexityaswell as the memory requirements
of the algorithm.
In section 3 we consider algorithms for solving 1.1 for arbitrary narrow- banded
matrices A that may require pivoting for stability reasons. This algorithm was pro-
p osed and thoroughly discussed in [6]. It can b e interpreted as a generalization of
the well-known blo ck tridiagonal cyclic reduction to blo ck bidiagonal matrices,
and again, it is also equivalent to Gaussian elimination applied to a p ermuted non-
T
symmetrically for the general case system of equations PAQ Qx = P b. Blo ck
bidiagonal cyclic reduction for the solution of banded linear systems was intro duced
by Hegland [19].
In section 4 we compare the ScaLAPACK implementations [11] of the two algo-
rithms ab ove with the implementations by Arb enz [3] and Arb enz and Hegland [6],
resp ectively,by means of numerical exp eriments conducted on the Intel Paragon.
ScaLAPACK is a software package with a diverse user community. Each subroutine
should have an easily intelligible calling sequence interface and work with easily
manageable data distributions. These constraints may reduce the p erformance of
the co de. The other two co des are exp erimental. They have b een optimized for low
communication overhead. The numb er of messages sent among pro cessors and the
marshaling pro cess has b een minimized for the task of solving a system of equations.
The co de do es, for instance, not split the LU factorization from the forward elimi-
nation which prohibits the solution of a sequence of systems of equations with equal
system matrix without factoring the system matrix over and over again. Our com-
parisons shall give an answer to the question howmuch p erformance the ScaLAPACK
algorithms mayhave lost through the constraint to b e user friendly.We further con-
tinue a discussion started in [5] on the overhead intro duced by partial pivoting. Is
PARALLEL SOLVERS FOR NARROW-BANDED LINEAR SYSTEMS 3
it necessary to have a pivoting as well as a non-pivoting algorithm for nonsymmetric
band systems in ScaLAPACK? In LAPACK [2], for instance, there are only pivoting
subroutine for solving dense and banded systems of equations, resp ectively.
2. Parallel Gaussian elimination for the diagonally dominant case. In
this section we assume that the matrix A =[a ] in 1.1 is diagonally domi-
ij i;j =1;:::;n
nant, i.e., that
n
X
ja j; i =1;::: ;n: ja j >
ij ii
j =1
j 6=i
Then the system of equations can b e solved by Gaussian elimination without pivoting
in the following three steps:
1. Factorization into A = LU .
2. Solution of Lz = y forward elimination
3. Solution of Ux = z backward substitution
The lower and upp er triangular Gauss factors L and U are banded with bandwidths
k and k , resp ectively, where k and k are the half-bandwidths of A. The number
l u l u
of oating p oint op erations ' for solving the banded system 1.1 with r right-hand
n
sides by Gaussian elimination is see also e.g. [17]
2
' =2k +1k n +2k +2k 1rn + O k + r k ; k := maxfk ;k g: 2.1
n u l l u l u
For solving 1.1 in parallel on a p pro cessor multicomputer we partition the
matrix A, the solution vector x and the right-hand hand side b according to
0 0 0 1 1 1
U
A B
x b
1
1 1
1
L U
B B B C C C
B C D
1
1 2
1 1
B B B C C C
L U
B B B C C C
D A B
x b
2
2 2
2 2
B B B C C C
= ; 2.2
B B B C C C
. .
. . .
. . . . .
B B B C C C
. . .
. .
B B B C C C
U L
@ @ @ A A A
C D B
p 1
p p 1 p 1 p 1
L
D A
x b
p
p p
p
P
p
n n kk n k
i i i
where A 2 R ;C 2 R ; x ; b 2 R ; ; 2 R ; and n +p 1k = n.
i i i i i
i i
i=1
This block tridiagonal partition is feasible only if n >k. This condition restricts the
i
degree of parallelism, i.e. the maximal numb er of pro cessor p that can b e exploited for
parallel execution, p<n+k =2k . The structure of A and its submatrices is depicted
in Fig. 2.1a for the case p = 4. In ScaLAPACK [11, 8], the lo cal p ortions of A on
each pro cessor are stored in the LAPACK scheme as depicted in Fig. 2.2. This input
L
scheme requires a preliminary step of moving the triangular blo ck D from pro cessor
i
i 1 to pro cessor i. This transfer of the blo ck can b e overlapp ed with computation
and has a negligible e ect on the overall p erformance of the algorithm [11]. The input
format used by Arb enz do es not require this initial data movement.
4 P. ARBENZ, A. CLEARY, J. DONGARRA, AND M. HEGLAND
a b
Fig. 2.1. Non-zero structure shadedarea of a the original and b the block odd-even
permutedband matrix with k >k .
u
l
Fig. 2.2. Storage scheme of the band matrix. The thick lines frame the local portions of A.
Wenow execute the rst step of blo ck-cyclic reduction [22]. This is b est regarded
as blo ck Gaussian elimination of the blo ck o dd-even p ermuted A,
2 3
U
3 3 2 2
B A
1
1
b x
1 1
L U
6 7
A D B
2
2 2
7 7 6 6 7 6
b x
2 2
7 7 6 6 7 6
. . .
. . .
7 7 6 . 6 7 6 .
. . .
. .
7 7 6 6 7 6
. .
7 7 6 6 7 6
.
.
U
7 7 6 6 7 6
.
B A b x
p 1 p 1 p 1
p 1
7 7 6 6 7 6
L
7 7 6 6 7 6
b x
A D
= : 2.3
p p
p
p
7 7 6 6 7 6
L U
7 7 6 6 7 6
B D C
1
1 2 1 1
7 7 6 6 7 6
7 7 6 6 7 6
.
2 2
.
L
7 7 6 6 7 6
.
B C
2
2
7 7 6 6 7 6
. .
. .
6 7
5 5 4 4
. . .
. .
. . .
4 5
. . .
p 1 p 1
L U
B C D
p 1
p 1 p
The structure of this matrix is depicted in Fig. 2.1b. We write 2.3 in the form
U
^
x b
A B
= ; 2.4
L
B C
where the resp ective submatrices and subvectors are indicated by the lines in equa-
^ ^
tion 2.3. If LR = A is the ordinary LU factorization of A then