Chapter 1 an Improved Algorithm for Parallel Sparse LU Decomposition on a Distributed-Memory Multiprocessor

Chapter An Improved Algorithm for Parallel Sparse LU Decomp osition on a DistributedMemory Multipro cessor y Jacko Koster Rob H Bisseling Abstract In this pap er we present a new parallel algorithm for the LU decomp osition of a general sparse matrix Among its features are matrix redistribution at regular intervals and a dynamic pivot search strategy that adapts itself to the numb er of pivots pro duced Exp erimental results obtained on a network of transputers show that these features considerably improve the p erformance Intro duction This pap er presents an improved version of the parallel algorithm for the LU decomp osition of a general sparse matrix develop ed by van der Stapp en Bisseling and van de Vorst The LU decomp osition of a matrix A A i j n pro duces a unit lower ij triangular matrix L an upp er triangular matrix U a row p ermutation vector and a column p ermutation vector such that LU for i j n A ij i j We assume that A is sparse and nonsingular and that it has an arbitrary pattern of nonzeros with all elements having the same small probability of b eing nonzero A review of parallel algorithms for sparse LU decomp osition can b e found in We use the following notations A submatr ix of a matrix A is the intersection of several rows and columns of A The submatrix AI J I J f n g has domain I J If I fig we use Ai J as shorthand for Afig J The concurrent assignment op erator c d a b denotes the simultaneous assignment of a to c and b to d For any submatrix A nz A denotes the numb er of nonzeros in A For any set I jI j is the cardinality of I Our algorithm is aimed at a distributedmemory messagepassing MIMD multipro cessor with an M N mesh communication network We identify each pro cessor in the mesh with a pair s t s M t N A Cartesian distribution of A is a pair of mappings that assigns matrix element A to pro cessor with M ij i j i and N For pro cessor s t the set I s denotes the lo cal set of row indices j I s fi i I sg Similarly J t fj j J tg i j CERFACS Ave G Coriolis Toulouse Cedex France JackoKostercerfacsfr Part of this work was done while this author was employed at Eindhoven University of Technology y Department of Mathematics Utrecht University PO Box TA Utrecht the Netherlands bisselingmathruunl Part of this work was done while this author was employed at KoninklijkeSh ell Lab oratorium Amsterdam Koster and Bisseling The PARPACKLU algorithm In this section we briey describ e the previous algorithm which we refer to as PARPACKLU This algorithm assigns the nonzero matrix elements to the pro cessors according to the grid distribution dened by i mo d M i mo d N for i n i i Each step of the algorithm consists of a pivot search row and column p ermutations and a multiplerank up date of the reduced matrix AI I At the b eginning of a step I fk n g where k is the numb er of pivots pro cessed so far The pivot search determines a set S of m pivots from the reduced matrix with the following three prop erties First each element i j from S satises the threshold criterion jA j u max jA j ij lj lI where u is a userdened parameter u Ch This ensures numerical stability Second the elements of S have low Markowitz cost to preserve sparsity The Markowitz cost of a nonzero element A in a submatrix AI J equals nz AI j nz Ai J ij Third the elements of S are mutually compatible ie A for i j i j S i j i j A i j ij The compatibility of the pivots enables the algorithm to pro cess them in one step and to p erform a single rankm up date of the reduced matrix After the pivot search the rows and columns of the matrix are p ermuted such that the m m submatrix AI I I fk k m g turns into a diagonal submatrix with S S S the m pivots p ositioned on the diagonal This is followed by the rankm up date which contains the bulk of the oating p oint op erations In this part the set I is subtracted S from I each multiplier column j in AI I is divided by the corresp onding pivot value A S j j and the matrix pro duct AI I AI I is subtracted from AI I S S The new algorithm An outline of the new algorithm is given b elow a detailed description and a program text can b e found in Algorithm Parallel LU decomp osition I s J t fi i n sg fj j n tg i j L U k while k n do b egin nd pivot set S fi j k r k mg r r I J fi k r k mg fj k r k mg S S r r I s J t I s n I J t n J S S register pivots in and p erform multiplerank up date store multipliers in L and pivots and up date rows in U k k m rowredistributeA L colredistributeA U end rowp ermuteL colp ermuteU Improved Parallel Sparse LU Decomposition In the search for pivot candidates we vary the numb er ncol of matrix columns searched p er pro cessor column on the basis of the estimated density of the reduced matrix In the rst steps of the algorithm the reduced matrix is still sparse and the candidate pivots will most likely b e compatible and hence most of them will b ecome pivots As the computation pro ceeds llin causes the reduced matrix to b ecome gradually denser Consequently the probability of nonzero elements b eing compatible decreases and exp ensive searches for large sets of candidate pivots will yield only relatively few compatible pivots Therefore for higher densities of the reduced matrix we decrease ncol When the reduced matrix b ecomes so dense that the pivot search rep eatedly pro duces only one pivot the algorithm switches to a simple search for only one pivot The PARPACKLU algorithm distributes the work load by maintaining the grid distribution and p erforming explicit row and column p ermutations at every step Therefore it often requires the exchange of rows b etween pro cessor rows and similarly for columns b ecause many of the explicit p ermutations cannot b e p erformed lo cally This induces a certain amount of added communication time When implicit p ermutations are used the rows and columns of the matrix remain in place The matrix elements are addressed indirectly and no communication for row or column movements is required This however may lead to a p o or load balance dep ending on the sequence of pivot choices We distribute the load by redistributing the reduced matrix at regular intervals such that jJ tj dn k N e for all t and jI sj dn k M e for all s It suces to move columns from pro cessor columns t for which jJ tj dn k N e to pro cessor columns for which jJ tj dn k N e and similarly for the rows of the matrix The numb er of columns to b e sent left and right is determined using the criterion a column to b e disp osed of is sent in the direction that has the least average surplus p er pro cessor column The rst pro cessor column that has space available accepts a passing column This heuristic reduces the distance over which the rows and columns are communicated The column redistribution by pro cessor column t is done as follows Algorithm Column redistribution if redistribution needed then b egin ceiling dn k N e surplus jJ tj ceiling if surplus then determine n n such that n n surplus l r l r else n n l r redistribute n right r redistribute n left l end The theoretical analysis of shows that the unordered twodimensional doubly linked list is the b est lo cal data structure for parallel sparse LU decomp osition among several plausible candidates We use this data structure for the implementation of the new algorithm the previous algorithm was implemented using the ordered twodimensional singly linked list The present data structure facilitates insertion and deletion of nonzeros Exp erimental results The parallel computer used for our exp eriments and those of is a Parsytec Sup erCluster FT consisting of a square mesh of INMOS T transputers The new algorithm is implemented in ANSI C with extensions for communication and parallelism This Koster and Bisseling Table Time in s of LU decomposition on p processors using a static pivot search strategy Matrix p p p p p p p p IMPCOL B WEST FS STEAM SHL BP JPWH SHERMAN SHERMAN LNS GEMAT implementation allows us to use the parallel program also for exp eriments on an ordinary sequential computer We did not optimise the parallel program for these sequential runs The sequential exp eriments were p erformed on a SUN SPARC mo del workstation The test set of sparse matrices that we use for our exp eriments is taken from It consists of eleven unsymmetric matrices from the HarwellBo eing sparse matrix collection The user of the program has to sp ecify six input parameters The parameter ncol is the initial numb er of matrix columns that is searched for pivot candidates by each pro cessor column The parameter u is used for threshold criterion Pivot candidates with a Markowitz cost higher than M are discarded Here and are input parameters min and M is the lowest Markowitz cost of a pivot candidate After nl ast successive pivot min searches pro duce only one pivot a switch is made to a singlepivot strategy Each time a multiple of f pivots has b een pro cessed the matrix is redistributed First we rep eated the exp eriments from Table to compare the p erformance of the new algorithm to that of the PARPACKLU algorithm The standard static pivot search pro cedure in is most closely approximated by xing ncol and using u and nlast The matrix is redistributed at every step ie

Load more