A COMPARISON OF PARALLEL SOLVERS FOR DIAGONALLY DOMINANT

AND GENERAL NARROWBANDED LINEAR SYSTEMS

y z x

PETER ARBENZ ANDREW CLEARY AND MARKUS HEGLAND

Abstract We investigate and compare stable parallel algorithms for solving diagonally dominant and general

narrowbanded linear systems of equations Narrowbanded means that the bandwidth is very small compared with

the order and is typically b etween and The solvers compared are the banded system solvers of ScaLA

PACK and those investigated by Arb enz and Hegland For the diagonally dominant case the algorithms

are analogs of the wellknown tridiagonal cyclic reduction algorithm while the inspiration for the general case is

the lesserknown bidiagonal cyclic reduction which allows a clean parallel implementation of partial pivoting These

divideandconquer type algorithms complement negrained algorithms which p erform well only for widebanded ma

trices with each family of algorithms having a range of problem sizes for which it is sup erior We present theoretical

analyses as well as numerical exp eriments conducted on the Intel Paragon

Key words narrowbanded linear systems stable factorization parallel solution cyclic reduction ScaLAPACK

Introduction In this pap er we compare implementations of direct parallel metho ds for

solving banded systems of linear equations

Ax b

The nbyn matrix A is assumed to have lower halfbandwidth k and upp er halfbandwidth k

l u

meaning that k and k are the smallest integers that imply

l u

a k j i k

ij l u

We assume that the matrix A has a narrow band such that k k n Linear systems with wide

l u

band can b e solved eciently by metho ds similar to full system solvers In particular parallel algo

rithms using twodimensional mappings such as the toruswrap mapping and

with partial pivoting have achieved reasonable success The parallelism of these

algorithms is the same as that of dense matrix algorithms applied to matrices of size min fk k g

l u

indep endent of n from which it is obvious that small bandwidths severely limit the usefulness of

these algorithms even for large n

Parallel algorithms for the solution of banded linear systems with small bandwidth have b een

considered by many authors b oth b ecause they serve as a canonical form of recursive equations as

well as having direct applications The latter include the solution of eigenvalue problems with inverse

iteration spline interpolation and smo othing and the solution of b oundary value problems

for ordinary dierential equations using nite dierence or nite element metho ds For these

onedimensional applications bandwidths typically vary b etween and The discretisation of

partial dierential equations leads to applications with slightly larger bandwidths for example the

computation of uid ow in a long narrow pip e In this case the number of grid p oints orthogonal

to the ow direction is much smaller than the number of grid p oints along the ow and this results

in a matrix with bandwidth relatively small compared to the total size of the problem There is a

tradeo for these type of problems b etween band solvers and general sparse techniques in that the

band solver assumes that all of the entries within the band are nonzero which they are not and

thus p erforms unnecessary computation but its data structures are much simpler and there is no

indirect addressing as in general sparse metho ds

In the second half of the s numerous pap ers dealt with parallel solvers for the class of non

symmetric narrowbanded matrices that can b e factored stably without pivoting such as diagonally

y

Institute of Scienti Computing Swiss Federal Institute of Technology ETH Zurich Switzerland

arbenzinfethzch

z

Center for Applied Scientic Computing Lawrence Livermore National Lab oratory PO Box L Liver

more CA USA aclearyllnlgov

x

Department of Computer Science University of Tennessee Knoxville TN USA

dongarracsutkedu

Computer Sciences Lab oratory RSISE Australian National University Canberra ACT Australia

MarkusHeglandanuedu au

P ARBENZ A CLEARY J DONGARRA AND M HEGLAND

dominant matrices or Mmatrices Many have grown out of their tridiagonal counterparts Banded

system solvers similar to Wangs tridiagonal solver were presented in for shared

memory multiprocessors with a small number of pro cessors The algorithm that we will review in

section is related to cyclic reduction CR This algorithm has b een discussed in detail in

where the p erformance of implementations of this algorithm on distributed memory multicomputers

like the Intel Paragon or the IBM SP is analyzed as well Johnsson considered the same

algorithm and its implementation on the Thinking Machine CM which required a dierent mo del

for the complexity of the interprocessor communication A similar algorithm has b een presented by

Ha jj and Skelbo e for the Intel iPSC hypercub e Cyclic reduction can b e interpreted as Gaussian

T

elimination applied to a symmetrically p ermuted system of equations P AP P x P b This inter

pretation has imp ortant consequences such as it implies that the algorithm is backward stable

It can also b e used to show that the p ermutation necessarily causes Gaussian elimination to generate

l lin which in turn increases the computational complexity as well as the memory requirements of

the algorithm

In section we consider algorithms for solving for arbitrary narrow banded matrices A that

may require pivoting for stability reasons This algorithm was prop osed and thoroughly discussed

in It can b e interpreted as a generalization of the wellknown blo ck tridiagonal cyclic reduction

to blo ck bidiagonal matrices and again it is also equivalent to Gaussian elimination applied to

T

a p ermuted nonsymmetrically for the general case system of equations P AQ Qx P b Blo ck

bidiagonal cyclic reduction for the solution of banded linear systems was introduced by Hegland

Conroy describ ed an algorithm that p erforms the rst step of our pivoting algorithm but then

continues by solving the reduced system sequentially A further stable algorithm based on full

pivoting was presented by Wright This algorithm is however dicult to implement and prone

to loadimbalance Parallel banded system solvers based on the QR factorization are discussed

in These algorithms require ab out twice the amount of work of those considered here

In section we compare the ScaLAPACK implementations of the two algorithms ab ove

with the implementations by Arb enz and Arb enz and Hegland resp ectively by means of

numerical exp eriments conducted on the Intel Paragon Corresp onding results for the IBM SP are

presented in ScaLAPACK is a software package with a diverse user community Each

should have an easily intelligible calling sequence interface and work with easily manageable data

distributions These constraints may reduce the p erformance of the co de The other two co des are

exp erimental They have b een optimized for low communication overhead The number of messages

sent among pro cessors and the marshaling pro cess has b een minimized for the task of solving a

system of equations The co de do es for instance not split the LU factorization from the forward

elimination which prohibits the solution of a sequence of systems of equations with equal system

matrix without factoring the system matrix over and over again Our comparisons shall give an

answer to the question how much p erformance the ScaLAPACK algorithms may have lost through

the constraint to b e user friendly

We further continue a discussion started in on the overhead introduced by partial pivoting

As the execution time of the pivoting algorithm was only marginally higher than that of the non

pivoting algorithm the question arose if it is necessary to have a pivoting as well as a nonpivoting

algorithm in ScaLAPACK In LAPACK for instance all dense and banded systems that are not

symmetric p ositive denite are solved by the same pivoting algorithm However in contrast to the

serial algorithm in the parallel algorithm the doubling of the memory requirement is unavoidable

see x We settle this issue in the concluding section

Parallel Gaussian elimination for the diagonally dominant case In this section we

assume that the matrix A a in is diagonally dominant ie that

ij ij n

n

X

ja j i n ja j

ij ii

j

j 6i

Then the system of equations can b e solved by Gaussian elimination without pivoting in the following

three steps

PARALLEL SOLVERS FOR NARROWBANDED LINEAR SYSTEMS

Factorization into A LU

Solution of Lz y forward elimination

Solution of U x z backward substitution

The lower and upp er triangular Gauss factors L and U are banded with bandwidths k and k

l u

resp ectively where k and k are the halfbandwidths of A The number of oating p oint op erations

l u

for solving the banded system with r righthand sides by Gaussian elimination is see also

n

eg

k k n k k r n O k r k k maxfk k g

n u l l u l u

For solving in parallel on a p pro cessor multicomputer we partition the matrix A the solution

vector x and the righthand hand side b according to

U

A B

x b

U L

B C C B B C

C D B



B C C B B C

L U

B C C B B C

D A B

x b

B C C B B C

B C C B B C

B C C B B C

B C C B B C

L U

A A A

B C D



p

p p

p p

L

A D

x b

p

p p

p

P

p

k k k n n n

i i i

 R and n p k n This block C R x b R where A R

i i i i i

i i

i

tridiagonal partition is feasible only if n k This condition restricts the degree of parallelism

i

ie the maximal number of pro cessors p that can b e exploited for parallel execution p n

k k The structure of A and its submatrices is depicted in Fig a for the case p In

a b

Fig Nonzero structure shaded area of a the original and b the block oddeven permuted

with k k

u

l

ScaLAPACK the lo cal p ortions of A on each pro cessor are stored in the LAPACK scheme as

L

depicted in Fig This input scheme requires a preliminary step of moving the triangular blo ck D

i

from pro cessor i to pro cessor i This transfer of the blo ck can b e overlapped with computation

and has a negligible eect on the overall p erformance of the algorithm The input format used

by Arb enz do es not require this initial data movement

P ARBENZ A CLEARY J DONGARRA AND M HEGLAND

Fig Storage scheme of the band matrix The thick lines frame the local portions of A

We now execute the rst step of blo ckcyclic reduction This is b est regarded as blo ck

Gaussian elimination of the blo ck o ddeven p ermuted A

U

A B

x b

L U

D B A

x b

U

B A

x b

p

p p

p

L

x b

A D

p p

p

p

L U

B D C 



L

C B



p p

L U

C B D

p

p p

The structure of this matrix is depicted in Fig b We write in the form

Fig Fil lin produced by block Gaussian elimination The brightshaded areas indicate original nonzeros

the darkshaded areas indicate the potential l lin

U

x b

A B

L



B C

where the resp ective submatrices and subvectors are indicated by the lines in equation If LR A

is the ordinary LU factorization of A then

U

U

L

R L B

A B

L U

S C B A B

L

L

B R I S

B C

L L U U U U U L

D overwrite B B D and D resp ectively and L B D R L The matrices B R

i i i i i i i i

i i i i

U L

As D and D suer from llin cf Fig additional memory space for k k n oating p oint

l u

i i

numbers has to b e provided The overall memory requirement of the parallel algorithm is ab out

PARALLEL SOLVERS FOR NARROWBANDED LINEAR SYSTEMS

U L

B keep the structure and L twice as high as that of the sequential algorithm The blo cks B R

i i

i i

L U

of B and B and can b e stored at their original places cf Fig

i i

The Schur complement S of A in A is a diagonally dominant p byp blo ck tridiagonal

matrix whereby the blo cks are k byk

T U

C B

V T U

C B

C B

C B

S

C B

C B

A

U

p

V T

p p

where

L U U L

T C B A B D A D

i i

i i i i

i i

U L L U

B D C B L D L R R

i

i i i i

i i i i

U U L L

L B L D U D R V B R

i i

i i i i

i i i i

As indicated in Fig these blo cks are not full if k k or k k This is taken into account in the

l u

ScaLAPACK implementation but not in the implementation by Arb enz where the blo cktridiagonal

CR solver treats the k byk blo cks as full blo cks Using the factorization gets

U

x L b L b c

R L B

L L

S  B R I

B R L I

where the sections c and of the vectors c and are given by

i

i

L U

c L b B R c D R c

i i i i

i i

i i

i i i

Up to this p oint of the algorithm no interprocessor communication is necessary as each pro cessor

L U

indep endently factors its diagonal blo ck of A A L R and computes the blo cks B R L B

i i i

i i

i i

U L

D R L D and L b Each pro cessor forms its p ortions of the reduced system S 

i

i i

i i i

U U L U U

R L B D R L D D R c D

i

k k k

i i i i i

i i i i i

R R and

L U L L L

R c B R L B C B R L D B

i i

i

i i i i i

i i i i i

Standard theory in Gaussian elimination shows that the reduced system is diagonally dominant One

option is to solve the reduced system on a single pro cessor This may b e reasonable on shared memory

multiprocessors with small pro cessor numbers p but complexity analysis reveals that this

quickly dominates total computation time on multicomputers with many pro cessors An attractive

parallel alternative for solving the system S  is blo ck cyclic reduction Implementationally

the reduction step describ ed ab ove is rep eated until the reduced system b ecomes a dense k byk

system which is trivially solved on a single pro cessor Since the order of the remaining system is

halved in each reduction step blog p c steps are needed Note that the degree of parallelism is

also halved in each step

In order to understand how we pro ceed with CR we take another lo ok at how the p by

p blo ck A in is distributed over the p pro cessors Pro cessor i i p

U

A B

i

L U

i

ab ove it and the blo ck D together with the blo ck D holds the by diagonal blo ck

L

i i

B C

i

i

to the left of it To obtain a similar situation with the reduced system we want the by diagonal

T U

i i

of S in together with the blo ck U ab ove and V to the left to reside on blo cks

i i

V T

i i

pro cessor i i which then allows us to pro ceed with these reduced number of pro cessors

as earlier To that end the o ddnumbered pro cessor i has to send some of its p ortion of S to the

neighboring pro cessors i and i The evennumbered pro cessors will then continue to compute

the evennumbered p ortions of  Having done so the o ddnumbered pro cessors receive  and

i

P ARBENZ A CLEARY J DONGARRA AND M HEGLAND

 from their neighboring pro cessors which allows them to compute their p ortion  of  provided

i i

they know the ith blo ck row of S This is easily attained if in the b eginning of this CR step not

only the o ddnumbered but all pro cessors send the needed information to their neighbors

Finally if the vectors  i p are known each pro cessor can compute its section of x

i

U

B x R c L 

i

L U

x R c L D  L B  i p

i i

i i i

i i i

L

x R c L D 

p p

p p p p

Notice that the evennumbered pro cessors have to receive  from their direct neighbors In the

i

back substitution phase each pro cessor can again pro ceed indep endently without interprocessor

communication

The parallel complexity of the ab ove divideandconquer algorithm as implemented by Arb enz

is

n n

AH

k k k k r

l u l u

np

p p

blog p c k k r t k k r t

s w

In the ScaLAPACK implementation the factorization phase is separated from the forward substi

tution phase This allows a user to solve several systems of equations without the need to factor the

system matrix over and over again In the implementation by Arb enz several systems can b e solved

simultaneously but only in connection with the matrix factorization The close coupling of factor

ization and forward substitution reduces communication at the exp ense of exibility of the co de

In the ScaLAPACK solver the number of messages sent is higher while overall message volume and

op eration count remain the same

n n

ScaLAPACK

k k k k r

l u l u

np

p p

blog p c k k r t k k r t

s w

The complexity for solving a system with an already factored matrix consists of the terms in

containing r the number of righthand sides In and we have assumed that the time for

the transmission of a message of length n oating p oint numbers from one to another pro cessor is

indep endent of the pro cessor distance and can b e represented in the form

t nt

s w

t denotes the startup time relative to the time of a oating p oint op eration ie the number of ops

s

that can b e executed during the startup time t denotes the number of oating p oint op erations

w

that can b e executed during the transmission of one word here a Byte oating p oint number

Notice that t is much larger than t On the Intel Paragon for instance the transmission of m

s w

bytes takes ab out m msec The bandwidth b etween applications is thus ab out

MBs Comparing with the Mops p erformance for the LINPACK b enchmark we get t

s

and t if oating p oint numbers are stored in bytes of memory On the SP or the SGICray

w

TD the characteristic numbers t and t are even bigger

s w

Dividing by and by resp ectively the speedups b ecome

n n

AH ScaLAPACK

S S

np np

AH ScaLAPACK

np np

The pro cessor number for which highest sp eedup is observed is O nk Sp eedup and eciency

are relatively small however due to the high redundancy of the parallel algorithm Redundancy is

the ratio of the serial complexity of the parallel algorithm and the serial algorithm ie it indicates

PARALLEL SOLVERS FOR NARROWBANDED LINEAR SYSTEMS

160

140

120 n=100000, k =k =10, r=10 l u

100

80 n=100000, k =k =10, r=1 l u

60

40 n=20000, k =k =10, r=1 l u

20

0

0 100 200 300 400 500 600 700 800 900 1000

Fig Speedups vs processor numbers for various problemsizes as predicted by Drawn lines indicate

the ArbenzHegland implementation dashed lines the ScaLAPACK implementation

the parallelization overhead with resp ect to oating p oint op erations If r is small say r then the

redundancy is ab out if k k and even higher otherwise If r is bigger then the redundancy

l u

tends to In Fig sp eedup is plotted versus pro cessor number for three dierent problem sizes

as predicted by The Paragon values for t and t have b een chosen Because

s w

there are fewer messages sent in the Arb enzHegland implementation than in the ScaLAPACK

implementation the former can b e exp ected to yield slightly higher sp eedups However the gain

in exibility with the ScaLAPACK routine certainly justies the small p erformance loss Notice

AH ScaLAPACK

however that the formulae given for and for must b e considered very approximative

np np

The assumption that all oating p oint op erations take the same amount of time is completely

unrealistic on mo dern RISC pro cessors Also the numbers t and t are crude estimates of the

w s

reality However the sawteeth caused by the term blog p c are clearly visible in timings

The numerical exp eriments of section will give a more realistic comparison of the implementations

and will tell more ab out the value of the ab ove complexity measures

Parallel Gaussian elimination with partial pivoting In this section we treat the case

where A is not diagonally dominant Then the LU factorization may not exist or its computation

may b e unstable Thus it is advisable to use partial pivoting with elimination in this case The

corresp onding factorization is PA LU where P is the row p ermutation matrix Pivoting requires

additionally ab out k n comparisons and n row interchanges More imp ortantly the bandwidth of U

l

can get as large as k k L lo oses its bandedness but has still only k nonzeros p er column

l u l

and can therefore b e stored at its original place The wider the band of U the higher the number

of arithmetic op erations and our previous upp er b ound for the oating p oint op erations increases in

the worst case to

pp

k k k n k k r n O k r k k k k

l u l l u l u

n

where again r is the number of righthand sides This b ound is obtained by counting the ops for

solving a banded system with lower and upp er halfbandwidth k and k k resp ectively without

l l u

P ARBENZ A CLEARY J DONGARRA AND M HEGLAND

pivoting The overhead introduced by pivoting which may b e as big as k k k

l u u

is inevitable if stability of the LU factorization cannot b e guaranteed Therefore the metho ds

for solving banded systems in packages like LAPACK incorp orate partial pivoting and accept the

overhead This is actually a deciency of LAPACK which has b een eliminated in ScaLAPACK the

memory space wasted is simply to o big Further backsubstitution can b e implemented faster if it

is known that there are no row interchanges Note that this overhead is particular for banded and

sparse linear systems it do es not o ccur with dense matrices

a b

Fig Nonzero structure shaded area of a the original and b the block column permuted band matrix

with k k

u

l

The partition is not suited for the parallel solution of if partial pivoting is to b e applied

In order that pivoting can take place indep endently in blo ck columns they must not have elements

in the same row Therefore the separators have to b e k k k columns wide cf Fig a As

l u

discussed in detail in we consider the matrix A as a cyclic band matrix by moving the last k

l

rows to the top Then we partition this matrix into a cyclic lower blo ck

x b

A D

B C B C C B



B C

B C B C C B

B C B C C B

x b

D A

B C B C C B

B C B C C B

B C B C C B

B C B C C B

A A A

x b

D A

p p

p p



B C

p p

p p

P

p

k k k n m n

i i i

 R k k k and m n C R x b R where A R

l u i i i i i

i i

i

m n k If n for all i then the degree of parallelism is p

i i i

For solving Ax b in parallel we apply a generalization of cyclic reduction that p ermits pivot

ing We again discuss the rst step more closely The later steps are similar except the matrix

blo cks are square We rst formally apply a blo ck o ddeven p ermutation to the columns of the

matrix in For simplicity of exp osition we consider the case p Then the p ermuted system

PARALLEL SOLVERS FOR NARROWBANDED LINEAR SYSTEMS

b ecomes

D A

x

b

C B

B C

B C

B C

D A

B C

B C

B C

b

B C

C B x

B C

B C

B C

B C

D A 

B C

B C

B C

B C

B C

A

b

A

D A



C B

The structure of the matrix in is depicted in Fig b Notice that the p ermutation that moves

the last rows to the top was done for p edagogical reasons it makes the diagonal blo cks A and

i

C square and the rst elimination step formally equal to the successive ones A dierent p oint of

i

view which leads to the same factorisation could allow the rst and the last diagonal blo cks to b e

nonsquare

The lo cal matrices are stored in the LAPACK storage scheme for nondiagonally dominant

matrices in addition to the k k rows that store the original p ortions of the matrix

l u

an additional k k rows have to b e provided for storing the llin In the ScaLAPACK algorithm

l u

pro cessor i stores the blo cks A B C D In the Arb enzHegland implementation pro cessor

i i i i

T

i stores A B C and D It is assumed that D is stored in an extra k byn array The

i i i i i

i

ScaLAPACK algorithm constructs this situation by an initial communication step that consumes

a negligible fraction of the overall computing time as in the discussion in the previous section In

b oth algorithms the blo cks B C and D do not really app ear but are incorp orated into A

p p p

Let

R A

i i

i p L P

i i

O B

k n i

i

b e the LU factorizations of the blo cks on the left of and let

b c Y D X O

i i i i i m k

i

L P L P L P

i i i

i i i

V O T C

i k k i i

i i

Then we can rewrite in the form

a b

Fig Fil lin produced by GE with partial pivoting Here p

P ARBENZ A CLEARY J DONGARRA AND M HEGLAND

A D

B C

L P

D A

L P C B

D A

L P

C B

L P

D A

C B

R X Y

T V O

Y X R

V T O

Y X R

V T O

Y X R

V T O

R X Y

Y X R

Y X R

Y X R

P

T V

V T

V T

V T

P denotes o ddeven p ermutation of the rows The structure of the second and third matrix in

is depicted in Fig a and b resp ectively The last equation shows that again we end up with

a reduced system

T V

C B

V T

C B

S  

C B

A

V T

p p

with the same cyclic blo ck bidiagonal structure as the original matrix A in

The reduced system can b e treated as b efore by dpe pro cessors This pro cedure is discussed

in detail by Arb enz and Hegland

Finally if the vectors  i p are known each pro cessor can compute its section of x

i

x R c X  Y 

p

x R c Y  X  i p

i i i i

i i

i

The back substitution phase do es not need interprocessor communication

The parallel complexity of this algorithm is

n

ppAH

blog pc k k r t k t k k r

s w

np

p

in the Arb enzHegland implementation and

ppS caLAP AC K ppAH

t blog pc

s

np np

in the ScaLAPACK implementation We treat the blo cks in S as full k byk blo cks as their nonzero

pattern is not predictable due to the pivoting pro cess The sp eedups for these algorithms are

pp pp

n n

ppAH ppScaLAPACK

S S

np np

ppAH ppScaLAPACK

np np

PARALLEL SOLVERS FOR NARROWBANDED LINEAR SYSTEMS

90

80 n=100000, k=20,r=10

70

60 n=100000, k=20,r=1 50

speedup 40

30

20 n=20000, k=20,r=1

10

0 0 100 200 300 400 500 600 700 800 900 1000

number of processors

Fig Speedups vs processor numbers for various problemsizes as predicted by in linear scale left and

doubly logarithmic scale right

In general the redundancy of the pivoting algorithm is only ab out for small numbers r of right

hand sides and around if r is large Note though that the redundancy varies according to the

sp ecic sequence of interchanges during partial pivoting Here we have always assumed that llin

is the worst p ossible

We make particular note of the following since the matrix is partitioned dierently than in

the diagonally dominant case this algorithm results in a dierent factorization than the diagonally

dominant algorithm even if the original matrix A is diagonally dominant In particular the diagonal

blo cks A i p are treated like banded lower triangular matrices In fact the case of a

i

diagonally dominant matrix is a somewhat p o or case for this algorithm This is easily seen by

reordering each diagonal blo ck to b e lower triangular we have moved the largest elements from

the diagonal and put them in the middle of each column thus forcing interchanges for the partial

pivoting algorithm when they were not necessary for the input matrix

In Fig we have again plotted predicted sp eedups for two problem sizes and dierent numbers

of righthand sides We do not distinguish b etween the two implementations as the t term is small

s

compared with the others even for k k k The plot on the right shows the same in doubly

l u

logarithmic scale This plot shows that the sp eedup is close to ideal for very small pro cessor

numbers and then detoriates The gap b etween the actual and the ideal sp eedup for small pro cessor

numbers illustrates the impact of the redundancy

Remark If the original problem were actually cyclicbanded the redundancy would b e ie

the parallel algorithm has essentially the same complexity as the serial algorithm This is so

as in this case also the serial co de generates llin

Remark In instead of the LU factorization a QR factorization could b e computed

This doubles the computational eort but enhances stability Similar ideas are pursued by Amestoy

et al for the parallel computation of the QR factorization of large sparse matrices

Numerical exp eriments on the Intel Paragon We compared the algorithms describ ed

in the previous two sections by means of three testproblems The matrix A has all ones within the

band and the number on the diagonal The problem sizes were n k k

l u

n k k and n k k The condition numbers grow very

l u l u

large as tends to one Estimates of them obtained by Highams algorithm p are listed

in Tab for various values of The righthand sides were chosen such that the solution was

T

n which enabled us to compute the error in the computed solution We compiled one

program for each problem size adjusting the arrays to just the needed size In this way we could

solve the small problems on one pro cessor to o When compiling we chose the highest optimization

level and turned o IEEE arithmetic IEEE arithmetic turned on lead to erratic nonrepro ducible

execution times

We b egin with the discussion of the diagonally dominant case In Tab the execution

P ARBENZ A CLEARY J DONGARRA AND M HEGLAND

n k k

l u

e e e

e e e

e e e e

Table

Estimated condition numbers for systems of equations solved in the above tables

Diagonally dominant case on the Intel Paragon

n k k

u

l

p t S t S t S

ScaLAPACK implementation

e e 30750

e e

e e e

e e e

e e e

e e e

e e e

e e e

e e e

e e e

e e e

Arb enz Hegland implementation

e e 32734

e e

e e e

e e e

e e e

e e e

e e e

e e e

e e e

e e e

e e e

Table

Selected execution times t in mil liseconds speedups S S p and error of the two implementations for the

three problem sizes denotes the norm error of the computed solution

times are listed for all problem sizes For the ScaLAPACK and the Arb enzHegland AH imple

mentation the onepro cessors times are quite close The dierence in this part of the co de is that

the AH implementation calls the level BLAS based LAPACK routine dgbtf for the triangular

factorization whereas in the ScaLAPACK implementation the level BLAS based routine dgbtrf

is called The latter is advantageous with the wider bandwidth k while dgbtf p erforms

slightly b etter with the narrow band

The two implementations show a noteworthy dierence in their twoprocessor p erformance The

ScaLAPACK implementation p erforms ab out as fast as on one pro cessor which is to b e exp ected The

AH implementation however lo oses ab out We attribute this loss in p erformance to the zeroing

of auxiliary arrays that are will b e used to store the llin spikes This is done unnecessarily in

the preparation phase of the algorithm

In ScaLAPACK for forward elimination and backward substitution the level BLAS dtbtrs is

called In the AH implementation this routine is expanded in order to avoid unnecessary checks if

rows have b een exchanged in the factorization phase This avoids the evaluation of ifstatements In

the AH implementation the ab ove mentioned auxiliary arrays are stored as lying blo cks to further

improve the scalability and to b etter exploit the RISC architecture of the underlying hardware

PARALLEL SOLVERS FOR NARROWBANDED LINEAR SYSTEMS

The sp eedups of the AH implementation relative to the processor p erformance is very close to ideal

up to at least pro cessors The ScaLAPACK implementation do es not scale so well For large

pro cessor numbers the dierence in execution times is ab out which correlates with the ratio of

messages sent in the two implementations

As indicated by the sp eedups for the medium size problem are b est The pterm that

L U

D and L R containes the factorization of the A and the computations of the spikes D

i

i i

i i

consumes ve times as much time as with the small problem size and scales very well This p ortion

is still increased with the large problem size However there the solution of the reduced system gets

exp ensive also

Nondiagonally dominant case on the Intel Paragon Small problem size

p t S t S t S t S

ScaLAPACK implementation

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

Arb enz Hegland implementation

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

Table

Selected execution times t in mil liseconds speedups S and norm errors of the two implementations for the

smal l problem size n k k with varying

u

l

We now compare the p erformance of the ScaLAPACK and the Arb enzHegland implementation

of the pivoting algorithm of section Tables and contain the resp ective numbers execution

time sp eedup and norm of the error for the three problem sizes

Relative to the AH implementation the execution times for ScaLAPACK comprise overhead

prop ortional to the problem size mainly zeroing elements of work arrays This is done in the AH

implementation during the building of the matrices Therefore the comparison in the nondiagonally

dominant case should not b e based on execution times but on sp eedups Nevertheless it should b e

noted that the computing time increases with the diculty ie with the condition of the problem

They are of course hard or even imp ossible to predict as the pivoting pro cedure is unknown At

least the two problems with bandwidth k k k can b e discussed along similar lines The

l u

AH implementation scales b etter than ScaLAPACK Its execution times for large pro cessor numbers

is ab out of that of the ScaLAPACK implementation again reecting the ratio of messages sent

Notice that here the blo ck size of the reduced system but also of the llin blo cks spikes are twice

as big as in the diagonally dominant case Therefore the p erformance in Mops is higher here

P ARBENZ A CLEARY J DONGARRA AND M HEGLAND

Nondiagonally dominant case on the Intel Paragon Intermediate problem size

p t S t S t S t S

ScaLAPACK implementation

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

Arb enz Hegland implementation

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

Table

Selected execution times t in mil liseconds speedups S and norm errors of the two implementations for the

medium problem size n k k with varying

u

l

This plays a role mainly in the computation of the llin The redundancy do es not have the high

weight that the op count of the previous section indicates In fact the pivoting algorithm p erforms

almost as well or sometimes even b etter than the algorithm for the diagonally dominant case This

may suggest always using the former algorithm This consideration is correct with resp ect to

computing time It must however b e remembered that the pivoting algorithm requires twice as much

memory space as the algorithm for the diagonally dominant case In the serial algorithm the ratio

is only k k k k In any case the overhead for pivoting in the solution of the reduced

l u l u

system by bidiagonal cyclic reduction is not so big that it justies sacricing stability

The picture is dierent for the largest problem size Here ScaLAPACK scales quite a bit b etter

than the implementation by Arb enz and Hegland The reduction of the number of messages and

marshaling overhead without regard to the message volume is counterproductive here With the wide

band the volume of the message times t by far outweighs the cumulated startuptimes cf

w

So for the largest pro cessor numbers ScaLAPACK is fastest and yields the highest sp eedups

Conclusion We have shown that the algorithms implemented in ScaLAPACK are stable

and p erform reasonably well The comparison with the implementations of the same algorithms by

Arb enz and Hegland that are designed to reduce the number of messages that are communicated are

faster for very small bandwidth The dierence is however not to o big The exibility and versatility

of the ScaLAPACK justies the loss in p erformance

Nevertheless it may b e useful to have in ScaLAPACK a routine that combines the factoriza

tion and solution phase Appropriate routines would b e the drivers pddbsvf for the diagonally

dominant case and pdgbsvf for the nondiagonally dominant case In the present version of ScaLA

PACK the former routine consecutively calls pddbtrff and pddbtrsf the latter calls pdgbtrff

and pdgbtrsf resp ectively The storage p olicy could stay the same So the exibility in how to

PARALLEL SOLVERS FOR NARROWBANDED LINEAR SYSTEMS

Nondiagonally dominant case on the Intel Paragon Large problem size

p t S t S t S t S

ScaLAPACK implementation

41089 45619 68553 68737

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

Arb enz Hegland implementation

36333 40785 64100 64308

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

e e e e

Table

Selected execution times t in mil liseconds speedups S and norm errors of the two implementations for the

large problem size n k k with varying The single processor execution times in italics have

u

l

been estimated

apply the routines remains

We found that the pivoting algorithm do es not imply a large computational overhead over the

solver for the diagonally dominant systems of equations We even observed shorter solution times

in some cases However as the pivoting algorithm requires twice as much memory space it should

only b e used in uncertain situations Notice further that on the IBM SP also the computational

overhead of the pivoting algorithm is considerable Therefore we recommend to use the algorithms

for diagonally dominant systems if p ossible

REFERENCES

P R Amestoy I S Duff and C Puglisi Multifrontal QR factorization in a multiprocessor environment

Numer Appl pp

E Anderson Z Bai C Bischof J Demmel J Dongarra J D Croz A Greenbaum S Hammarling

A McKenney S Ostrouchov and D Sorensen LAPACK Users Guide Release So ciety for

Industrial and Applied Mathematics Philadelphia PA Software and guide are available from

at URL httpwwwnetliborgl apa ck

I J Anderson and S K Harbour Parallel factorization of banded linear matrices using a systolic array

processor Adv Comput Math pp

P Arbenz On experiments with a parallel direct solver for diagonal ly dominant banded linear systems in

EuroPar Parallel Pro cessing L Bouge P Fraigniaud A Mignotte and Y Rob ert eds Lecture Notes

in Computer Science No Springer Berlin pp

P Arbenz A Cleary J Dongarra and M Hegland A comparison of parallel solvers for diagonal ly

dominant and general narrowbanded linear systems II in EuroPar Parallel Pro cessing P Amestoy

P Berger M Dayde I Du V Fraysse L Giraud and D Ruiz eds Lecture Notes in Computer Science

No Springer Berlin pp

P Arbenz and W Gander A survey of direct parallel algorithms for banded linear systems Tech

Rep ort ETH Z urich Computer Science Department Octob er Available at URL

httpwwwinfethzch pu blic ati ons

P Arbenz and M Hegland Scalable stable solvers for nonsymmetric narrowbanded linear systems in

Seventh International Parallel Computing Workshop PCW P Mackerras ed Australian National

P ARBENZ A CLEARY J DONGARRA AND M HEGLAND

University Canberra Australia pp PU PU

On the stable parallel solution of general narrow banded linear systems in High Performance Algorithms

for Structured Matrix Problems P Arb enz M Paprzycki A Sameh and V Sarin eds Nova Science

Publishers Commack NY pp

L S Blackford J Choi A Cleary E DAzevedo J Demmel I Dhillon J Dongarra S Hammarling

G Henry A Petitet K Stanley D Walker and R C Whaley ScaLAPACK Users Guide So ciety

for Industrial and Applied Mathematics Philadelphia PA Software and guide are available from

Netlib at URL httpwwwnetliborgsc alap ack

C de Boor A Practical Guide to Splines Springer New York

R Brent A Cleary M Dow M Hegland J Jenkinson Z Leyk M Nakanishi M Osborne P Price

S Roberts and D Singleton Implementation and performance of scalable scientic library

on Fujitsus VPP parallelvector supercomputer in Pro ceedings of the Scalable HighPerformance Com

puting Conference IEEE Computer So ciety Press Los Alamitos CA pp

A Cleary and J Dongarra Implementation in ScaLAPACK of divideandconquer algorithms for banded and

tridiagonal systems Tech Rep ort CS University of Tennessee Knoxville TN April Available

as LAPACK Working Note from URL httpwwwnetliborg lapa ck lawn s

J M Conroy Parallel algorithms for the solution of narrow banded systems Appl Numer Math

pp

M J Dayde and I S Duff The use of computational kernels in ful l and sparse linear solvers ecient code

design on highperformance RISC processors in Vector and Parallel Pro cessing VECPAR J M L M

Palma and J Dongarra eds Lecture Notes in Computer Science No pp

J J Dongarra and L Johnsson Solving banded systems on a parallel processor Parallel Computing

pp

J J Dongarra and A H Sameh On some parallel banded system solvers Parallel Computing

pp

J Du Croz P Mayes and G Radicati Factorization of band matrices using level BLAS Tech Rep ort

CS University of Tennessee Knoxville TN July LAPACK Working Note

G H Golub and C F van Loan Matrix Computations The Johns Hopkins University Press Baltimore

MD nd ed

A Gupta F G Gustavson M Joshi and S Toledo The design implementation and evaluation of a

symmetric banded linear solver for distributedmemory parallel computers ACM Trans Math Softw

pp

I N Hajj and S Skelboe A multilevel parallel solver for block tridiagonal and banded linear systems Parallel

Computing pp

M Hegland Divide and conquer for the solution of banded linear systems of equations in Pro ceedings of the

Fourth Euromicro Workshop on Parallel and Distributed Pro cessing IEEE Computer So ciety Press Los

Alamitos CA pp

M Hegland and M Osborne Algorithms for block bidiagonal systems on vector and parallel computers in

International Conference on Sup ercomputing ICS D Gannon G Egan and R Brent eds ACM New

York NY pp

N J Higham Accuracy and Stability of Numerical Algorithms So ciety for Industrial and Applied Mathematics

Philadelphia PA

R W Hockney and C R Jesshope Parallel Computers Hilger Bristol nd ed

S L Johnsson Solving narrow banded systems on ensemble architectures ACM Trans Math Softw

pp

V Kumar A Grama A Gupta and G Karypis Introduction to Parallel Computing BenjaminCummings

Redwoo d City CA

D Lawrie and A Sameh The computation and communication complexity of parallel banded system solves

ACM Trans Math Softw pp

U Meier A parallel partition method for solving banded systems of linear equations Parallel Computing

pp

J Ortega Introduction to Parallel and Vector Solution of Linear Systems Plenum Press New York

Y Saad and M H Schultz Parallel direct methods for solving banded linear systems Linear Algebra Appl

pp

A Sameh and D Kuck On stable parallel linear system solvers J ACM pp

J Stoer and R Bulirsch Introduction to Springer New York nd ed

D W Walker T Aldcroft A Cisneros G C Fox and W Furmanski LU decomposition and banded

matrices and the solution of linear systems on hypercubes in The Third Conference on Hyp ercub e Concur

rent Computers and Applications G Fox ed New York NY ACM pp

S J Wright Parallel algorithms for banded linear systems SIAM J Sci Stat Comput pp

Edited by Jack Dongarra and Erricos John Kontoghiorghes

Received February st

Accepted August th