A Sparse Library in ++ for High Performance



Architectures

xz  

Jack Dongarra , Andrew Lumsdaine , Xinhiu Niu

z x

Roldan Pozo , Karin Remington

x z 

Oak Ridge National Lab oratory UniversityofTennessee University of Notre Dame

Mathematical Sciences Section Dept. of Computer Science Dept. of Computer Science & Engineering

Abstract neering applications. Nevertheless, comprehensive li-

braries for computations have not b een

We describ e an ob ject oriented sparse matrix library

develop ed and integrated to the same degree as those

in C++ built up on the Level 3 Sparse BLAS prop osal

for dense matrices. Several factors contribute to the

[5] for p ortability and p erformance across a wide class

diculty of designing such a comprehensive library.

of machine architectures. The C++ library includes

Di erent computer architectures, as well as di erent

algorithms for various iterative metho ds and supp orts

applications, call for di erent sparse matrix data for-

the most common sparse data storage formats used

mats in order to b est exploit registers, data lo cal-

in practice. Besides simplifying the inter-

ity, pip elining, and parallel pro cessing. Furthermore,

face, the ob ject oriented design allows the same driv-

co de involving sparse matrices tends to b e very com-

ing co de to b e used for various sparse matrix formats,

plicated, and not easily p ortable, b ecause the details

thus addressing many of the diculties encountered

of the underlying data formats are invariably entan-

with the typical approach to sparse matrix libraries.

gled within the application co de.

We emphasize the fundamental design issues of the

To address these diculties, it is essential to de-

C++ sparse matrix classes and illustrate their usage

velop co des which are as \data format free" as p os-

with the preconditioned conjugate gradient PCG

sible, thus providing the greatest exibility for using

metho d as an example. Performance results illustrate

the given algorithm library routine in various ar-

that these co des are comp etitive with optimized For-

chitecture/application combinations. In fact, the se-

tran 77. We discuss the impact of our design on ele-

lection of an appropriate data structure can typically

gance, p erformance, maintainability, p ortability, and

b e deferred until link or run time. We describ e an

robustness.

ob ject oriented C++ library for sparse matrix com-

putations which provides a uni ed interface for var-

ious iterative solution techniques across a varietyof

1 Intro duction

sparse data formats.

The design of the library is based on the following

Sparse matrices are p ervasive in application co des

principles:

which use nite di erence, nite element or nite vol-

ume discretizations of PDEs for problems in compu-

Clarity: Implementations of numerical algorithms

tational uid dynamics, structural mechanics, semi-

should resemble the mathematical algorithms on

conductor simulation and other scienti c and engi-

which they are based. This is in contrast to For-

tran 77, which can require complicated subrou- 

This pro ject was supp orted in part by the Defense Ad-

tine calls, often with parameter lists that stretch

vanced Research Pro jects Agency under contract DAAL03-91-

C-0047, administeredby the Army Research Oce, the Ap-

over several lines.

plied Mathematical Sciences subprogram of the Oce of En-

ergy Research, U.S. Department of Energy, under Contract

Reuse: A particular algorithm should only need to

DE-AC05-84OR21400, and by the National Science Founda-

b e co ded once, with identical co de used for all

tion Science and Technology Center Co op erative Agreement

No. CCR-8809615, and NSF grant No. CCR-9209815. matrix representations. 1

gradient algorithm, used to solve Ax = b, with pre- Portability: Implementations of numerical algo-

conditioner M . The comparison b etween the pseudo- rithms should b e directly p ortable across ma-

co de and the C++ listing app ears in gure 1. chine platforms.

Here the op erators suchas*and += have b een

High Performance: The ob ject oriented library

overloaded to work with matrix and vectors formats.

co de should p erform as well as optimized data-

This co de fragmentworks for all of the supp orted

format-sp eci c co de written in C or .

sparse storage classes and makes use of the Level 3

Sparse BLAS in the matrix-vector multiply A*p.

Toachieve these goals the sparse matrix library

has to b e more than just an unrelated collection of

matrix ob jects. Adhering to true ob ject oriented de-

3 Sparse Matrix typ es

sign philosophies, inheritance and polymorphism are

used to build a matrix hierarchy in which the same

Wehave concentrated on the most commonly used

co des can b e used for various dense and sparse linear

data structures which account for a large p ortion of

algebra computations across sequential and parallel

application co des. The library can b e arbitrarily ex-

architectures.

tended to user-sp eci c structures and will eventually

grow. We hop e to incorp orate user-contributed ex-

tensions in future versions of the software. Matrix

2 Iterative Solvers

classes supp orted in the initial version of the library

include

The library provides algorithms for various iterative

Sparse Vector: List of nonzero elements with their

metho ds as describ ed in Barrett et. al. [1], together

index lo cations. It assumes no particular order-

with their preconditioned counterparts:

ing of elements.

 Jacobi SOR SOR

COOR Matrix: Co ordinate Storage Matrix. List of

 Conjugate Gradient CG

nonzero elements with their resp ectiverow and

column indices. This is the most general sparse

 Conjugate Gradient on Normal Equations

matrix format, but it is not very space or com-

CGNE, CGNR

putationally ecient. It assumes no ordering of

nonzero matrix values.

 Generalized Minimal Residual GMRES

CRS Matrix : Compressed Row Storage Matrix.

 Minimum Residual MINRES

Subsequent nonzeros of the matrix rows are

stored in contiguous memory lo cations and an

 Quasi-Minimal Residual QMR

additional integer arrays sp eci es where eachrow

 Cheb

b egins. It assumes no ordering among nonzero

values within eachrow, but rows are stored in

 Conjugate Gradient Squared CGS

consecutive order.

 Biconjugate Gradient BiCG

CCS Matrix: Compressed Column Storage also com-

monly known as the Harwel l-Boeing sparse ma-

 Biconjugate Gradient Stabilized Bi-CGSTAB

trix format [4]. This is a variant of CRS storage

where columns, rather rows, are stored contigu-

Although iterative metho ds have provided muchof

ously. Note that the CCS ordering of A is the

the motivation for this library, many of the same op-

T

same as the CRS of A .

erations and design issues are addressed for direct

metho ds as well. In particular, some of the most

CDS Matrix: Compressed Diagonal Storage. De-

p opular , such as Incomplete LU Fac-

signed primarily for matrices with relatively con-

torization ILU [1]have comp onents quite similar to

stant bandwidth, the sub and sup er-diagonals

direct metho ds.

are stored in contiguous memory lo cations.

One motivation for this work is that the high level

algorithms found in [1] can b e easily implemented in Matrix: Jagged Diagonal Storage. Also knowas JDS

C++. For example, take a preconditioned conjugate ITPACK storage. More space ecient than CDS 2

0 0

Initial r = b Ax

r=b-Ax;

for i =1;2;:::

for int i=1; i

i1 i1

solve Mz = r

z = M.solver;

T

i1 i1

 = r z

rho=r*z;

i1

if i =1

if i==1

1 0

p = z

p=z;

else

elsef

=  =

beta = rho1/ rho0;

i1 i1 i2

i i1 i1

p = z + p * beta;

p = z + p

i1

g

endif

i i

q = A*p;

q = Ap

T

i i

alpha = rho1 / p*q;

=  =p q

i i1

i i1 i

x += alpha * p;

x = x + p

i

i i1 i

r -= alpha * q;

r = r q

i

if normr/normb < tol break;

check convergence;

g

end

Figure 1: Psuedo co de and C++ comparison of a preconditioned conjugate gradient metho d in IML++

package

matrices at the cost of a gather/scatter op era- hierarchical memory architectures, the Sparse BLAS

tion. allowvendors to provide optimized routines taking

advantage of indirect addressing hardware, registers,

BCRS Matrix: Blo ck Compressed Row Storage. Use-

pip elining, caches, memory management, and paral-

ful when the sparse matrix is comprised of square

lelism on their particular architecture. Standardizing

dense blo cks of nonzeros in some regular pattern.

the Sparse BLAS will not only provide ecient co des,

The savings in storage and reduced indirect ad-

but will also ensure p ortable computational kernels

dressing over CRS can b e signi cant for matrices

with a common interface.

with large blo ck sizes.

There are twotyp es of C++ interfaces to basic

kernels. The rst utilizes simple binary op erators

SKS Matrix: Skyline Storage. Also for variable band

for multiplication and addition, and the second is a

or pro le matrices. Mainly used in direct solvers,

functional interfaces which can group triad and more

but can also b e used for handling the diagonal

complex op erations. The binary op erators provide for

blo cks in blo ck matrix factorizations.

a simpler interface, e.g. y = A  x denotes a sparse

In addition, symmetric and Hermitian versions of

matrix-vector multiply, but may pro duce less ecient

most of these sparse formats will also b e supp orted.

co de. The computational kernels include:

In such cases only an upp er or lower triangular p or-

 sparse matrix pro ducts, C  opA B + C .

tion of the matrix is stored. The trade-o is a more

complicated algorithm with a somewhat di erent pat-

 solution of triangular systems,

tern of data access. Further details of each data stor- 1

C  D opA B + C

age format are given in [1] and [6].

 reordering of a sparse matrix p ermutations,

Our library contains the common computational

A  AopP

kernels required for solving linear systems by many

direct and iterative metho ds. The internal data struc-

0

  conversion of one data format to another, A

tures of these kernels are compatible with the pro-

A,

p osed Level 3 Sparse BLAS, thus providing the user

with large software base of Fortran 77 mo dule and where and are scalars, B and C are rectangular

0

application libraries. Just as the dense Level 3 BLAS matrices, D is a blo ck , A and A

T

[3]have allowed for higher p erformance kernels on are sparse matrices, and opA is either A or A . 3

4 Sparse Matrix Construction

Sparse PCG and I/O Sun SPARC 10

25

In dealing with issues of I/O, the C++ library is

presently designed to supp ort reading and writing to ell-Bo eing format sparse matrix les [4]. These

Harw 20

les are inherently in compressed column storage;

however, since sparse matrices in the library can b e

15

transformed b etween various data formats, this is not

a severe limitation. File input is emb edded as an-

other form of a sparse matrix constructor; a le can 10

b e read and transformed into another format using

Execution time (secs)

conversions and the iostream op erators. In the fu-

5

ture, the library will also supp ort other matrix le

TM

formats, suchasaMatlab compatible format,

and IEEE binary formats. Sparse matrices can b e

ventional data and index

also b e initialized from con la2d64 la2d128 la3d16 la3d32

vectors, thus allowing for a universal interface to im- ortran mo dules.

p ort data from C or F SPARSKIT (f77) SparseLib++ (C++)

Figure 2: Performance comparison of C++ g++

vs. optimized Fortran co des on a Sun Sparc 10.

5 Performance o compare the eciency of our C++ class de-

T Sparse PCG e tested the p erformance our library mo d-

signs, w IBM RS6000 Model 580

ules against optimized Fortran sparse matrix pack-

7

ages. Figures 2 and 3 illustrate the p erformance

of the PCG metho d with diagonal preconditioning 6

on various common example problems 2D and 3D

ween our C++ library and

Laplacian op erators b et 5

the f77 Sparskit [?] package. In all cases we utilized h compiler.

full optimization of eac 4

The C++ mo dules are slightly more ecient since

3

the low-level Sparse BLAS have b een ned tuned for

each platform. This shows that it is p ossible to have

Execution time (secs) 2

an elegant co ding interface as shown in gure 1 and

tain comp etitive p erformance if not b etter

still main 1

with conventional Fortran mo dules.

ted out that our intent is not to

It should b e p oin la2d64 la2d128 la3d16 la3d32

compare ourselves with a sp eci c library. Sparskit is

tFortran library whose fo cus is more on

an excellen SPARSKIT (f77) SparseLib++ (C++)

p ortability than ultimate p erformance. Our goal is

simply to demonstrate that the C++ co des can also

Figure 3: Performance comparison of C++ xlC vs.

achieve go o d p erformance, compared to Fortran. If

optimized Fortran co des xlf -O on an IBM RS/6000

b oth libraries relied on vendor-supplied Sparse BLAS,

Mo del 580,

then their p erformance di erence would b e unde-

tectable. 4

6 Conclusion

Using C++ for numerical computation can greatly

enhance clarity, reuse, and p ortability. In our C++

sparse matrix library, SparseLib++, the details of the

underlying sparse matrix data format are completely

hidden at the algorithm level. These results in it-

erative algorithm co des which closely resemble their

mathematical denotation. Also, since the library is

built up on the Level 3 Sparse BLAS, it provides p er-

formance comparable to optimized Fortran.

References

[1] R. Barrett, M. Berry, T. F. Chan, J. Demmel,

J. Donato, J. Dongarra, V. Eijkhout, R. Pozo,

C. Romine, H. van der Vorst, Templates for the

Solution of Linear Systems: Building Blocks for

Iterative Methods, SIAM Press, 1994.

[2] J. Dongarra, R. Pozo, D. Walker, \LAPACK++:

A Design Overview of Ob ject-Oriented Exten-

sions for High Performance ," Pro-

ceedings of Sup ercomputing '93, IEEE Press,

1993, pp. 162-171.

[3] J. Dongarra, J. Du Croz, I. S. Du , S. Ham-

marling, \A set of level 3 Basic Linear Algebra

Subprograms," ACM Trans. Math. Soft.,Vol. 16,

1990, pp. 1-17.

[4] I. Du , R. Grimes, J. Lewis, \Sparse Matrix Test

Problems," ACM Trans. Math. Soft.,Vol. 15,

1989, pp. 1-14.

[5] I. Du , M. Marrone, G. Radicati, AProposal for

User Level Sparse BLAS, CERFACS Technical

Rep ort TR/PA/92/85, 1992.

[6] M. A. Heroux, A Proposal for a Sparse

BLAS Toolkit, CERFACS Technical Rep ort

TR/PA/92/90, 1992.

[7] Y. Saad, Sparskit: a basic toolkit for sparse matrix

computations,May, 1990. 5