A Comparison of Parallel Solvers for Diagonally Dominant and General

A COMPARISON OF PARALLEL SOLVERS FOR DIAGONALLY DOMINANT AND GENERAL NARROWBANDED LINEAR SYSTEMS y z x PETER ARBENZ ANDREW CLEARY JACK DONGARRA AND MARKUS HEGLAND Abstract We investigate and compare stable parallel algorithms for solving diagonally dominant and general narrowbanded linear systems of equations Narrowbanded means that the bandwidth is very small compared with the matrix order and is typically b etween and The solvers compared are the banded system solvers of ScaLA PACK and those investigated by Arb enz and Hegland For the diagonally dominant case the algorithms are analogs of the wellknown tridiagonal cyclic reduction algorithm while the inspiration for the general case is the lesserknown bidiagonal cyclic reduction which allows a clean parallel implementation of partial pivoting These divideandconquer type algorithms complement negrained algorithms which p erform well only for widebanded ma trices with each family of algorithms having a range of problem sizes for which it is sup erior We present theoretical analyses as well as numerical exp eriments conducted on the Intel Paragon Key words narrowbanded linear systems stable factorization parallel solution cyclic reduction ScaLAPACK Introduction In this pap er we compare implementations of direct parallel metho ds for solving banded systems of linear equations Ax b The nbyn matrix A is assumed to have lower halfbandwidth k and upp er halfbandwidth k l u meaning that k and k are the smallest integers that imply l u a k j i k ij l u We assume that the matrix A has a narrow band such that k k n Linear systems with wide l u band can b e solved eciently by metho ds similar to full system solvers In particular parallel algo rithms using twodimensional mappings such as the toruswrap mapping and Gaussian elimination with partial pivoting have achieved reasonable success The parallelism of these algorithms is the same as that of dense matrix algorithms applied to matrices of size min fk k g l u indep endent of n from which it is obvious that small bandwidths severely limit the usefulness of these algorithms even for large n Parallel algorithms for the solution of banded linear systems with small bandwidth have b een considered by many authors b oth b ecause they serve as a canonical form of recursive equations as well as having direct applications The latter include the solution of eigenvalue problems with inverse iteration spline interpolation and smo othing and the solution of b oundary value problems for ordinary dierential equations using nite dierence or nite element metho ds For these onedimensional applications bandwidths typically vary b etween and The discretisation of partial dierential equations leads to applications with slightly larger bandwidths for example the computation of uid ow in a long narrow pip e In this case the number of grid p oints orthogonal to the ow direction is much smaller than the number of grid p oints along the ow and this results in a matrix with bandwidth relatively small compared to the total size of the problem There is a tradeo for these type of problems b etween band solvers and general sparse techniques in that the band solver assumes that all of the entries within the band are nonzero which they are not and thus p erforms unnecessary computation but its data structures are much simpler and there is no indirect addressing as in general sparse metho ds In the second half of the s numerous pap ers dealt with parallel solvers for the class of non symmetric narrowbanded matrices that can b e factored stably without pivoting such as diagonally y Institute of Scientic Computing Swiss Federal Institute of Technology ETH Zurich Switzerland arbenzinfethzch z Center for Applied Scientic Computing Lawrence Livermore National Lab oratory PO Box L Liver more CA USA aclearyllnlgov x Department of Computer Science University of Tennessee Knoxville TN USA dongarracsutkedu Computer Sciences Lab oratory RSISE Australian National University Canberra ACT Australia MarkusHeglandanuedu au P ARBENZ A CLEARY J DONGARRA AND M HEGLAND dominant matrices or Mmatrices Many have grown out of their tridiagonal counterparts Banded system solvers similar to Wangs tridiagonal solver were presented in for shared memory multiprocessors with a small number of pro cessors The algorithm that we will review in section is related to cyclic reduction CR This algorithm has b een discussed in detail in where the p erformance of implementations of this algorithm on distributed memory multicomputers like the Intel Paragon or the IBM SP is analyzed as well Johnsson considered the same algorithm and its implementation on the Thinking Machine CM which required a dierent mo del for the complexity of the interprocessor communication A similar algorithm has b een presented by Ha jj and Skelbo e for the Intel iPSC hypercub e Cyclic reduction can b e interpreted as Gaussian T elimination applied to a symmetrically p ermuted system of equations P AP P x P b This inter pretation has imp ortant consequences such as it implies that the algorithm is backward stable It can also b e used to show that the p ermutation necessarily causes Gaussian elimination to generate l lin which in turn increases the computational complexity as well as the memory requirements of the algorithm In section we consider algorithms for solving for arbitrary narrow banded matrices A that may require pivoting for stability reasons This algorithm was prop osed and thoroughly discussed in It can b e interpreted as a generalization of the wellknown blo ck tridiagonal cyclic reduction to blo ck bidiagonal matrices and again it is also equivalent to Gaussian elimination applied to T a p ermuted nonsymmetrically for the general case system of equations P AQ Qx P b Blo ck bidiagonal cyclic reduction for the solution of banded linear systems was introduced by Hegland Conroy describ ed an algorithm that p erforms the rst step of our pivoting algorithm but then continues by solving the reduced system sequentially A further stable algorithm based on full pivoting was presented by Wright This algorithm is however dicult to implement and prone to loadimbalance Parallel banded system solvers based on the QR factorization are discussed in These algorithms require ab out twice the amount of work of those considered here In section we compare the ScaLAPACK implementations of the two algorithms ab ove with the implementations by Arb enz and Arb enz and Hegland resp ectively by means of numerical exp eriments conducted on the Intel Paragon Corresp onding results for the IBM SP are presented in ScaLAPACK is a software package with a diverse user community Each subroutine should have an easily intelligible calling sequence interface and work with easily manageable data distributions These constraints may reduce the p erformance of the co de The other two co des are exp erimental They have b een optimized for low communication overhead The number of messages sent among pro cessors and the marshaling pro cess has b een minimized for the task of solving a system of equations The co de do es for instance not split the LU factorization from the forward elimination which prohibits the solution of a sequence of systems of equations with equal system matrix without factoring the system matrix over and over again Our comparisons shall give an answer to the question how much p erformance the ScaLAPACK algorithms may have lost through the constraint to b e user friendly We further continue a discussion started in on the overhead introduced by partial pivoting As the execution time of the pivoting algorithm was only marginally higher than that of the non pivoting algorithm the question arose if it is necessary to have a pivoting as well as a nonpivoting algorithm in ScaLAPACK In LAPACK for instance all dense and banded systems that are not symmetric p ositive denite are solved by the same pivoting algorithm However in contrast to the serial algorithm in the parallel algorithm the doubling of the memory requirement is unavoidable see x We settle this issue in the concluding section Parallel Gaussian elimination for the diagonally dominant case In this section we assume that the matrix A a in is diagonally dominant ie that ij ij n n X ja j i n ja j ij ii j j 6i Then the system of equations can b e solved by Gaussian elimination without pivoting in the following three steps PARALLEL SOLVERS FOR NARROWBANDED LINEAR SYSTEMS Factorization into A LU Solution of Lz y forward elimination Solution of U x z backward substitution The lower and upp er triangular Gauss factors L and U are banded with bandwidths k and k l u resp ectively where k and k are the halfbandwidths of A The number of oating p oint op erations l u for solving the banded system with r righthand sides by Gaussian elimination is see also n eg k k n k k r n O k r k k maxfk k g n u l l u l u For solving in parallel on a p pro cessor multicomputer we partition the matrix A the solution vector x and the righthand hand side b according to U A B x b U L B C C B B C C D B B C C B B C L U B C C B B C D A B x b B C C B B C B C C B B C B C C B B C B C C B B C L U A A A B C D p p p p p L A D x b p p p p P p k k k n n n i i i R and n p k n This block C R x b R where A R i i i i i i i i tridiagonal partition is feasible only if n k This condition restricts the degree of parallelism i ie the maximal number of pro cessors p that

A Comparison of Parallel Solvers for Diagonally Dominant and General

Recursive Approach in Sparse Matrix LU Factorization

Abstract 1 Introduction

Periodic Staircase Matrices and Generalized Cluster Structures

Overview of Iterative Linear System Solver Packages

Arxiv:Math/0010243V1

Accelerating the LOBPCG Method on Gpus Using a Blocked Sparse Matrix Vector Product

THREE STEPS on an OPEN ROAD Gilbert Strang This Note Describes

Arxiv:2004.05118V1 [Math.RT]

The Spectra of Large Toeplitz Band Matrices with A

Over-Scalapack.Pdf

Prospectus for the Next LAPACK and Scalapack Libraries

Conversion of Sparse Matrix to Band Matrix Using an Fpga For