Chapter

An Improved Algorithm for Parallel Sparse LU

Decomp osition on a DistributedMemory Multipro cessor

y

Jacko Koster Rob H Bisseling

Abstract

In this pap er we present a new parallel algorithm for the LU decomp osition of a

general sparse Among its features are matrix redistribution at regular intervals

and a dynamic pivot search strategy that adapts itself to the numb er of pivots pro duced

Exp erimental results obtained on a network of transputers show that these features

considerably improve the p erformance

Intro duction

This pap er presents an improved version of the parallel algorithm for the LU decomp osition

of a general develop ed by van der Stapp en Bisseling and van de Vorst

The LU decomp osition of a matrix A A i j n pro duces a unit lower

ij

L an upp er triangular matrix U a row p ermutation vector and a

column p ermutation vector such that

LU for i j n A

ij

i j

We assume that A is sparse and nonsingular and that it has an arbitrary pattern of nonzeros

with all elements having the same small probability of b eing nonzero A review of parallel

algorithms for sparse LU decomp osition can b e found in

We use the following notations A submatr ix of a matrix A is the intersection of several

rows and columns of A The submatrix AI J I J f n g has domain I J

If I fig we use Ai J as shorthand for Afig J The concurrent assignment op erator

c d a b denotes the simultaneous assignment of a to c and b to d For any submatrix

A nz A denotes the numb er of nonzeros in A For any set I jI j is the cardinality of I

Our algorithm is aimed at a distributedmemory messagepassing MIMD multipro cessor

with an M N mesh communication network We identify each pro cessor in the mesh with

a pair s t s M t N A Cartesian distribution of A is a pair of

mappings that assigns matrix element A to pro cessor with M

ij i j i

and N For pro cessor s t the set I s denotes the lo cal set of row indices

j

I s fi i I sg Similarly J t fj j J tg

i j

CERFACS Ave G Coriolis Toulouse Cedex France JackoKostercerfacsfr Part of

this work was done while this author was employed at Eindhoven University of Technology

y

Department of Mathematics Utrecht University PO Box TA Utrecht the Netherlands

bisselingmathruunl Part of this work was done while this author was employed at KoninklijkeSh ell

Lab oratorium Amsterdam

Koster and Bisseling

The PARPACKLU algorithm

In this section we briey describ e the previous algorithm which we refer to as

PARPACKLU This algorithm assigns the nonzero matrix elements to the pro cessors

according to the grid distribution dened by

i mo d M i mo d N for i n

i i

Each step of the algorithm consists of a pivot search row and column p ermutations

and a multiple up date of the reduced matrix AI I At the b eginning of a step

I fk n g where k is the numb er of pivots pro cessed so far

The pivot search determines a set S of m pivots from the reduced matrix with the

following three prop erties First each element i j from S satises the threshold criterion

jA j u max jA j

ij lj

lI

where u is a userdened parameter u Ch This ensures

Second the elements of S have low Markowitz cost to preserve sparsity The Markowitz

cost of a nonzero element A in a submatrix AI J equals nz AI j nz Ai J

ij

Third the elements of S are mutually compatible ie

A for i j i j S i j i j A

i j ij

The compatibility of the pivots enables the algorithm to pro cess them in one step and to

p erform a single rankm up date of the reduced matrix

After the pivot search the rows and columns of the matrix are p ermuted such that the

m m submatrix AI I I fk k m g turns into a diagonal submatrix with

S S S

the m pivots p ositioned on the diagonal This is followed by the rankm up date which

contains the bulk of the oating p oint op erations In this part the set I is subtracted

S

from I each multiplier column j in AI I is divided by the corresp onding pivot value A

S j j

and the matrix pro duct AI I AI I is subtracted from AI I

S S

The new algorithm

An outline of the new algorithm is given b elow a detailed description and a program text

can b e found in

Algorithm Parallel LU decomp osition

I s J t fi i n sg fj j n tg

i j

L U k

while k n do b egin

nd pivot set S fi j k r k mg

r r

I J fi k r k mg fj k r k mg

S S r r

I s J t I s n I J t n J

S S

register pivots in and

p erform multiplerank up date

store multipliers in L and pivots and up date rows in U

k k m

rowredistributeA L

colredistributeA U

end

rowp ermuteL

colp ermuteU

Improved Parallel Sparse LU Decomposition

In the search for pivot candidates we vary the numb er ncol of matrix columns searched

p er pro cessor column on the basis of the estimated density of the reduced matrix In the

rst steps of the algorithm the reduced matrix is still sparse and the candidate pivots will

most likely b e compatible and hence most of them will b ecome pivots As the computation

pro ceeds llin causes the reduced matrix to b ecome gradually denser Consequently

the probability of nonzero elements b eing compatible decreases and exp ensive searches for

large sets of candidate pivots will yield only relatively few compatible pivots Therefore

for higher densities of the reduced matrix we decrease ncol When the reduced matrix

b ecomes so dense that the pivot search rep eatedly pro duces only one pivot the algorithm

switches to a simple search for only one pivot

The PARPACKLU algorithm distributes the work load by maintaining the grid

distribution and p erforming explicit row and column p ermutations at every step

Therefore it often requires the exchange of rows b etween pro cessor rows and similarly

for columns b ecause many of the explicit p ermutations cannot b e p erformed lo cally This

induces a certain amount of added communication time When implicit p ermutations

are used the rows and columns of the matrix remain in place The matrix elements are

addressed indirectly and no communication for row or column movements is required This

however may lead to a p o or load balance dep ending on the sequence of pivot choices

We distribute the load by redistributing the reduced matrix at regular intervals such

that jJ tj dn k N e for all t and jI sj dn k M e for all s It suces to move

columns from pro cessor columns t for which jJ tj dn k N e to pro cessor columns

for which jJ tj dn k N e and similarly for the rows of the matrix The numb er of

columns to b e sent left and right is determined using the criterion a column to b e disp osed

of is sent in the direction that has the least average surplus p er pro cessor column The

rst pro cessor column that has space available accepts a passing column This heuristic

reduces the distance over which the rows and columns are communicated The column

redistribution by pro cessor column t is done as follows

Algorithm Column redistribution

if redistribution needed then b egin

ceiling dn k N e

surplus jJ tj ceiling

if surplus then determine n n such that n n surplus

l r l r

else n n

l r

redistribute n right

r

redistribute n left

l

end

The theoretical analysis of shows that the unordered twodimensional doubly linked

list is the b est lo cal data structure for parallel sparse LU decomp osition among several

plausible candidates We use this data structure for the implementation of the new

algorithm the previous algorithm was implemented using the ordered twodimensional

singly linked list The present data structure facilitates insertion and deletion of nonzeros

Exp erimental results

The parallel computer used for our exp eriments and those of is a Parsytec Sup erCluster

FT consisting of a square mesh of INMOS T transputers The new algorithm

is implemented in ANSI C with extensions for communication and parallelism This

Koster and Bisseling

Table

Time in s of LU decomposition on p processors using a static pivot search strategy

Matrix p p p p p p p p

IMPCOL B

WEST

FS

STEAM

SHL

BP

JPWH

SHERMAN

SHERMAN

LNS

GEMAT

implementation allows us to use the parallel program also for exp eriments on an ordinary

sequential computer We did not optimise the parallel program for these sequential runs

The sequential exp eriments were p erformed on a SUN SPARC mo del workstation

The test set of sparse matrices that we use for our exp eriments is taken from It consists

of eleven unsymmetric matrices from the HarwellBo eing sparse matrix collection

The user of the program has to sp ecify six input parameters The parameter ncol is the

initial numb er of matrix columns that is searched for pivot candidates by each pro cessor

column The parameter u is used for threshold criterion Pivot candidates with a

Markowitz cost higher than M are discarded Here and are input parameters

min

and M is the lowest Markowitz cost of a pivot candidate After nl ast successive pivot

min

searches pro duce only one pivot a switch is made to a singlepivot strategy Each time a

multiple of f pivots has b een pro cessed the matrix is redistributed

First we rep eated the exp eriments from Table to compare the p erformance of the

new algorithm to that of the PARPACKLU algorithm The standard static pivot search

pro cedure in is most closely approximated by xing ncol and using u

and nlast The matrix is redistributed at every step ie f to mimic the

explicit row and column p ermutations of the PARPACKLU algorithm This exp eriment

allows us to examine the eect of cho osing a dierent data structure without regard to

other improvements Table shows the time needed for parallel LU decomp osition on

a square mesh of p transputers A comparison with the results for the PARPACKLU

algorithm shows that the new algorithm is sup erior It outp erforms the previous one

for small p with a gain of up to a factor of for SHERMAN and p For large

p the two algorithms p erform equally well b ecause communication time which dominates

for such p is similar

To investigate the inuence of the pivot strategy we replaced the standard pivot strategy

of by a dynamic one ncol and at the end of each step ncol is adjusted according

to

ncol N

if m then ncol ncol else ncol ncol

This strategy strives for generating twice as many pivot candidates as pivots This way a

relatively large set of pivot candidates each with a reasonable probability of b ecoming a

pivot is used in constructing the pivot set while the time required for the compatibility

checks is kept within b ounds The other parameters in this exp eriment are u

Improved Parallel Sparse LU Decomposition

Table

Time in s of LU decomposition on a sequential computer and on p processors of a paral lel

computer using a dynamic pivot search strategy

Matrix SPARC p p p p

IMPCOL B

WEST

FS

STEAM

SHL

BP

JPWH

SHERMAN

SHERMAN

LNS

GEMAT

nl ast and f Table shows that the dynamic pivot search strategy

is sup erior We attribute this mainly to the higher rank m of the matrix up dates which

leads to less frequent synchronisation of the pro cessors Even for p there is a gain for

example the computing time is reduced by a factor of three for the matrix SHL This

conrms the theoretical analysis from that multiplerank up dates are b enecial also in

sequential sparse LU decomp osition algorithms

References

R H Bisseling and J G G van de Vorst Paral lel LU decomposition on a transputer network

Lecture Notes in Computer Science SpringerVerlag New York pp

D A Calahan Paral lel solution of sparse simultaneous linear equations Pro c th Annual

Allerton Conf on Circuits and System Theory pp

T A Davis and PC Yew A nondeterministic paral lel algorithm for general unsymmetric

sparse LU SIAM J Matrix Anal Appl pp

I S Du A M Erisman and J K Reid Direct Methods for Sparse Matrices Oxford

University Press Oxford UK

I S Du R G Grimes and J G Lewis Sparse matrix test problems ACM Trans Math

Software pp

G A Geist and C H Romine LU factorization algorithms on distributedmemory multipro

cessor architectures SIAM J Sci Stat Comput pp

J Koster Paral lel solution of sparse systems of linear equations on a mesh network of

transputers Final Rep ort Institute for Continuing Education Eindhoven University of

Technology Eindhoven The Netherlands July

A Skjellum Concurrent dynamic simulation multicomputer algorithms research applied

to ordinary dierentialalgebraic process systems in chemical engineering Ph D Thesis

California Institute of Technology Pasadena CA May

A F van der Stapp en R H Bisseling and J G G van de Vorst Paral lel sparse LU

decomposition on a mesh network of transputers SIAM J Matrix Anal Appl pp