A Sparse Matrix Library in C++ for High Performance
Architectures
xz
Jack Dongarra , Andrew Lumsdaine , Xinhiu Niu
z x
Roldan Pozo , Karin Remington
x z
Oak Ridge National Lab oratory UniversityofTennessee University of Notre Dame
Mathematical Sciences Section Dept. of Computer Science Dept. of Computer Science & Engineering
Abstract neering applications. Nevertheless, comprehensive li-
braries for sparse matrix computations have not b een
We describ e an ob ject oriented sparse matrix library
develop ed and integrated to the same degree as those
in C++ built up on the Level 3 Sparse BLAS prop osal
for dense matrices. Several factors contribute to the
[5] for p ortability and p erformance across a wide class
diculty of designing such a comprehensive library.
of machine architectures. The C++ library includes
Di erent computer architectures, as well as di erent
algorithms for various iterative metho ds and supp orts
applications, call for di erent sparse matrix data for-
the most common sparse data storage formats used
mats in order to b est exploit registers, data lo cal-
in practice. Besides simplifying the subroutine inter-
ity, pip elining, and parallel pro cessing. Furthermore,
face, the ob ject oriented design allows the same driv-
co de involving sparse matrices tends to b e very com-
ing co de to b e used for various sparse matrix formats,
plicated, and not easily p ortable, b ecause the details
thus addressing many of the diculties encountered
of the underlying data formats are invariably entan-
with the typical approach to sparse matrix libraries.
gled within the application co de.
We emphasize the fundamental design issues of the
To address these diculties, it is essential to de-
C++ sparse matrix classes and illustrate their usage
velop co des which are as \data format free" as p os-
with the preconditioned conjugate gradient PCG
sible, thus providing the greatest exibility for using
metho d as an example. Performance results illustrate
the given algorithm library routine in various ar-
that these co des are comp etitive with optimized For-
chitecture/application combinations. In fact, the se-
tran 77. We discuss the impact of our design on ele-
lection of an appropriate data structure can typically
gance, p erformance, maintainability, p ortability, and
b e deferred until link or run time. We describ e an
robustness.
ob ject oriented C++ library for sparse matrix com-
putations which provides a uni ed interface for var-
ious iterative solution techniques across a varietyof
1 Intro duction
sparse data formats.
The design of the library is based on the following
Sparse matrices are p ervasive in application co des
principles:
which use nite di erence, nite element or nite vol-
ume discretizations of PDEs for problems in compu-
Clarity: Implementations of numerical algorithms
tational uid dynamics, structural mechanics, semi-
should resemble the mathematical algorithms on
conductor simulation and other scienti c and engi-
which they are based. This is in contrast to For-
tran 77, which can require complicated subrou-
This pro ject was supp orted in part by the Defense Ad-
tine calls, often with parameter lists that stretch
vanced Research Pro jects Agency under contract DAAL03-91-
C-0047, administeredby the Army Research Oce, the Ap-
over several lines.
plied Mathematical Sciences subprogram of the Oce of En-
ergy Research, U.S. Department of Energy, under Contract
Reuse: A particular algorithm should only need to
DE-AC05-84OR21400, and by the National Science Founda-
b e co ded once, with identical co de used for all
tion Science and Technology Center Co op erative Agreement
No. CCR-8809615, and NSF grant No. CCR-9209815. matrix representations. 1
gradient algorithm, used to solve Ax = b, with pre- Portability: Implementations of numerical algo-
conditioner M . The comparison b etween the pseudo- rithms should b e directly p ortable across ma-
co de and the C++ listing app ears in gure 1. chine platforms.
Here the op erators suchas*and += have b een
High Performance: The ob ject oriented library
overloaded to work with matrix and vectors formats.
co de should p erform as well as optimized data-
This co de fragmentworks for all of the supp orted
format-sp eci c co de written in C or Fortran.
sparse storage classes and makes use of the Level 3
Sparse BLAS in the matrix-vector multiply A*p.
Toachieve these goals the sparse matrix library
has to b e more than just an unrelated collection of
matrix ob jects. Adhering to true ob ject oriented de-
3 Sparse Matrix typ es
sign philosophies, inheritance and polymorphism are
used to build a matrix hierarchy in which the same
Wehave concentrated on the most commonly used
co des can b e used for various dense and sparse linear
data structures which account for a large p ortion of
algebra computations across sequential and parallel
application co des. The library can b e arbitrarily ex-
architectures.
tended to user-sp eci c structures and will eventually
grow. We hop e to incorp orate user-contributed ex-
tensions in future versions of the software. Matrix
2 Iterative Solvers
classes supp orted in the initial version of the library
include
The library provides algorithms for various iterative
Sparse Vector: List of nonzero elements with their
metho ds as describ ed in Barrett et. al. [1], together
index lo cations. It assumes no particular order-
with their preconditioned counterparts:
ing of elements.
Jacobi SOR SOR
COOR Matrix: Co ordinate Storage Matrix. List of
Conjugate Gradient CG
nonzero elements with their resp ectiverow and
column indices. This is the most general sparse
Conjugate Gradient on Normal Equations
matrix format, but it is not very space or com-
CGNE, CGNR
putationally ecient. It assumes no ordering of
nonzero matrix values.
Generalized Minimal Residual GMRES
CRS Matrix : Compressed Row Storage Matrix.
Minimum Residual MINRES
Subsequent nonzeros of the matrix rows are
stored in contiguous memory lo cations and an
Quasi-Minimal Residual QMR
additional integer arrays sp eci es where eachrow
Chebyshev Iteration Cheb
b egins. It assumes no ordering among nonzero
values within eachrow, but rows are stored in
Conjugate Gradient Squared CGS
consecutive order.
Biconjugate Gradient BiCG
CCS Matrix: Compressed Column Storage also com-
monly known as the Harwel l-Boeing sparse ma-
Biconjugate Gradient Stabilized Bi-CGSTAB
trix format [4]. This is a variant of CRS storage
where columns, rather rows, are stored contigu-
Although iterative metho ds have provided muchof
ously. Note that the CCS ordering of A is the
the motivation for this library, many of the same op-
T
same as the CRS of A .
erations and design issues are addressed for direct
metho ds as well. In particular, some of the most
CDS Matrix: Compressed Diagonal Storage. De-
p opular preconditioners, such as Incomplete LU Fac-
signed primarily for matrices with relatively con-
torization ILU [1]have comp onents quite similar to
stant bandwidth, the sub and sup er-diagonals
direct metho ds.
are stored in contiguous memory lo cations.
One motivation for this work is that the high level
algorithms found in [1] can b e easily implemented in Matrix: Jagged Diagonal Storage. Also knowas JDS
C++. For example, take a preconditioned conjugate ITPACK storage. More space ecient than CDS 2
0 0
Initial r = b Ax
r=b-Ax;
for i =1;2;:::
for int i=1; i i 1 i 1 solve Mz = r z = M.solver; T i 1 i 1 = r z rho=r*z; i 1 if i =1 if i==1 1 0 p = z p=z; else elsef = = beta = rho1/ rho0; i 1 i 1 i 2 i i 1 i 1 p = z + p * beta; p = z + p i 1 g endif i i q = A*p; q = Ap T i i alpha = rho1 / p*q; = =p q i i 1 i i 1 i x += alpha * p; x = x + p i i i 1 i r -= alpha * q; r = r q i if normr/normb < tol break; check convergence; g end Figure 1: Psuedo co de and C++ comparison of a preconditioned conjugate gradient metho d in IML++ package matrices at the cost of a gather/scatter op era- hierarchical memory architectures, the Sparse BLAS tion. allowvendors to provide optimized routines taking advantage of indirect addressing hardware, registers, BCRS Matrix: Blo ck Compressed Row Storage. Use- pip elining, caches, memory management, and paral- ful when the sparse matrix is comprised of square lelism on their particular architecture. Standardizing dense blo cks of nonzeros in some regular pattern. the Sparse BLAS will not only provide ecient co des, The savings in storage and reduced indirect ad- but will also ensure p ortable computational kernels dressing over CRS can b e signi cant for matrices with a common interface. with large blo ck sizes. There are twotyp es of C++ interfaces to basic kernels. The rst utilizes simple binary op erators SKS Matrix: Skyline Storage. Also for variable band for multiplication and addition, and the second is a or pro le matrices. Mainly used in direct solvers, functional interfaces which can group triad and more but can also b e used for handling the diagonal complex op erations. The binary op erators provide for blo cks in blo ck matrix factorizations. a simpler interface, e.g. y = A x denotes a sparse In addition, symmetric and Hermitian versions of matrix-vector multiply, but may pro duce less ecient most of these sparse formats will also b e supp orted. co de. The computational kernels include: In such cases only an upp er or lower triangular p or- sparse matrix pro ducts, C opA B + C . tion of the matrix is stored. The trade-o is a more complicated algorithm with a somewhat di erent pat- solution of triangular systems, tern of data access. Further details of each data stor- 1 C D opA B + C age format are given in [1] and [6]. reordering of a sparse matrix p ermutations, Our library contains the common computational A AopP kernels required for solving linear systems by many direct and iterative metho ds. The internal data struc- 0 conversion of one data format to another, A tures of these kernels are compatible with the pro- A, p osed Level 3 Sparse BLAS, thus providing the user with large software base of Fortran 77 mo dule and where and are scalars, B and C are rectangular 0 application libraries. Just as the dense Level 3 BLAS matrices, D is a blo ck diagonal matrix, A and A T [3]have allowed for higher p erformance kernels on are sparse matrices, and opA is either A or A . 3 4 Sparse Matrix Construction Sparse PCG and I/O Sun SPARC 10 25 In dealing with issues of I/O, the C++ library is presently designed to supp ort reading and writing to ell-Bo eing format sparse matrix les [4]. These Harw 20 les are inherently in compressed column storage; however, since sparse matrices in the library can b e 15 transformed b etween various data formats, this is not a severe limitation. File input is emb edded as an- other form of a sparse matrix constructor; a le can 10 b e read and transformed into another format using Execution time (secs) conversions and the iostream op erators. In the fu- 5 ture, the library will also supp ort other matrix le TM formats, suchasaMatlab compatible format, and IEEE binary formats. Sparse matrices can b e ventional data and index also b e initialized from con la2d64 la2d128 la3d16 la3d32 vectors, thus allowing for a universal interface to im- ortran mo dules. p ort data from C or F SPARSKIT (f77) SparseLib++ (C++) Figure 2: Performance comparison of C++ g++ vs. optimized Fortran co des on a Sun Sparc 10. 5 Performance o compare the eciency of our C++ class de- T Sparse PCG e tested the p erformance our library mo d- signs, w IBM RS6000 Model 580 ules against optimized Fortran sparse matrix pack- 7 ages. Figures 2 and 3 illustrate the p erformance of the PCG metho d with diagonal preconditioning 6 on various common example problems 2D and 3D ween our C++ library and Laplacian op erators b et 5 the f77 Sparskit [?] package. In all cases we utilized h compiler. full optimization of eac 4 The C++ mo dules are slightly more ecient since 3 the low-level Sparse BLAS have b een ned tuned for each platform. This shows that it is p ossible to have Execution time (secs) 2 an elegant co ding interface as shown in gure 1 and tain comp etitive p erformance if not b etter still main 1 with conventional Fortran mo dules. ted out that our intent is not to It should b e p oin la2d64 la2d128 la3d16 la3d32 compare ourselves with a sp eci c library. Sparskit is tFortran library whose fo cus is more on an excellen SPARSKIT (f77) SparseLib++ (C++) p ortability than ultimate p erformance. Our goal is simply to demonstrate that the C++ co des can also Figure 3: Performance comparison of C++ xlC vs. achieve go o d p erformance, compared to Fortran. If optimized Fortran co des xlf -O on an IBM RS/6000 b oth libraries relied on vendor-supplied Sparse BLAS, Mo del 580, then their p erformance di erence would b e unde- tectable. 4 6 Conclusion Using C++ for numerical computation can greatly enhance clarity, reuse, and p ortability. In our C++ sparse matrix library, SparseLib++, the details of the underlying sparse matrix data format are completely hidden at the algorithm level. These results in it- erative algorithm co des which closely resemble their mathematical denotation. Also, since the library is built up on the Level 3 Sparse BLAS, it provides p er- formance comparable to optimized Fortran. References [1] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, H. van der Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, SIAM Press, 1994. [2] J. Dongarra, R. Pozo, D. Walker, \LAPACK++: A Design Overview of Ob ject-Oriented Exten- sions for High Performance Linear Algebra," Pro- ceedings of Sup ercomputing '93, IEEE Press, 1993, pp. 162-171. [3] J. Dongarra, J. Du Croz, I. S. Du , S. Ham- marling, \A set of level 3 Basic Linear Algebra Subprograms," ACM Trans. Math. Soft.,Vol. 16, 1990, pp. 1-17. [4] I. Du , R. Grimes, J. Lewis, \Sparse Matrix Test Problems," ACM Trans. Math. Soft.,Vol. 15, 1989, pp. 1-14. [5] I. Du , M. Marrone, G. Radicati, AProposal for User Level Sparse BLAS, CERFACS Technical Rep ort TR/PA/92/85, 1992. [6] M. A. Heroux, A Proposal for a Sparse BLAS Toolkit, CERFACS Technical Rep ort TR/PA/92/90, 1992. [7] Y. Saad, Sparskit: a basic toolkit for sparse matrix computations,May, 1990. 5