Scientific Software Libraries for Scalable Architectures

The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters

Citation Johnsson, S. Lennart and Kapil K. Mathur. 1994. Scientific Software Libraries for Scalable Architectures. Harvard Computer Science Technical Report TR-19-94.

Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:25811003

Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA

Scientic Software Libraries for Scalable

Architectures

S Lennart Johnsson

Kapil K Mathur

TR

August

Parallel Computing Research Group

Center for Research in Computing Technology

Harvard University

Cambridge Massachusetts

To app ear in Parallel Scientic Computing SpringerVerlag

Scientic Software Libraries for Scalable Architectures

S Lennart Johnsson Kapil K Mathur

Thinking Machines Corp and Thinking Machines Corp

and

Harvard University

Abstract

Massively parallel pro cessors introduce new demands on software systems with resp ect to p erfor

mance scalability robustness and p ortability The increased complexity of the memory systems

and the increased range of problem sizes for which a given piece of software is used p oses se

rious challenges to software developers The Connection Machine Scientic Software Library

CMSSL uses several novel techniques to meet these challenges The CMSSL contains routines

for managing the data distribution and provides data distribution indep endent functionality

High p erformance is achieved through careful scheduling of arithmetic op erations and data mo

tion and through the automatic selection of algorithms at runtime We discuss some of the

techniques used and provide evidence that CMSSL has reached the goals of p erformance and

scalability for an imp ortant set of applications

Introduction

The main reason for large scale parallelism is p erformance In order for massively parallel

architectures to deliver on the promise of extreme p erformance compared to conventional sup er

computer architectures an eciency in resource use close to that of conventional sup ercomputers

is necessary Achieving high eciency in using a computing system is mostly a question of e

cient use of its memory system This is the premise on which the Connection Machine Scientic

Software Library the CMSSL is based

Another premise for the CMSSL is the notion of scalability Software systems must b e designed

to op erate on systems and data sets that may vary in size by as much as four orders of mag

nitude This level of scalability with resp ect to computing system size must b e accomplished

transparently to the user ie the same program must execute not only correctly but also e

ciently without change over this range in pro cessing capacity and corresp onding problem sizes

Moreover programs should not have to b e recompiled for various system sizes This requirement

will b e even more imp ortant in the future since over time the assignment of pro cessing no des

to tasks is exp ected to b ecome much more dynamic than to day

Robustness of software b oth with resp ect to p erformance and numerical prop erties are b ecoming

increasingly imp ortant The memory system in each no de is b ecoming increasingly complex in

order to match the sp eed of the memory system with that of an individual pro cessor The

distributed nature of the total memory comp ounds the complexity of the memory system It

is imp erative that software systems deliver a large fraction of the available p erformance over a

wide range of problem sizes transparently to the user For instance small changes in array sizes

should not impact p erformance in a signicant way Robustness with resp ect to p erformance

in this sense will increase the demands on the software systems in particular on the runtime

parts of the systems

Robustness with resp ect to numerical prop erties is also b ecoming increasingly imp ortant The

same software may b e used for problem sizes over a very wide range Condition numbers for the

largest problems are exp ected to b e signicantly worse than for small problems As a minimum

condition estimators must b e provided to allow users to assess the numerical quality of the

results It will also b e increasingly necessary to furnish software for illconditione d problems

and whenever p ossible automatically choose an appropriate numerical metho d Some parallel

metho ds do not have as go o d a numerical b ehavior as sequential metho ds and this disadvantage

is often increasing with the degree of parallelism The tradeo b etween p erformance and

numerical stability and accuracy is very complex Much research is needed b efore the choice of

algorithm with resp ect to numerical prop erties and p erformance can b e automated

Portability of co des is clearly highly desirable in order to amortize the software investment over

as large a usage as p ossible Portability is also critical in a rapid adoption of new technology thus

allowing for early b enets from the increased memory sizes increased p erformance or decreased

costp erformance oered by new technology But not all software is p ortable when p erformance

is taken into account New architectures like MPPs require new software technology that often

lags the hardware technology by several years Thus it is imp ortant to exploit the architecture

of software systems such that architecture dep endent nonp ortable software is limited to as few

functions as p ossible while maintaining p ortability of the vast amount of application software

One of the purp oses of software libraries is to enable p ortability of application co des without

loss of p erformance

The Connection Machine Scientic Software Library to day has ab out user callable functions

covering a wide range of frequent op erations in scientic and engineering computation In this

pap er we illustrate how the goals of high p erformance and scalability have b een achieved

The outline of the pap er is as follows In the next few sections we discuss memory systems for

scalable architectures and their impact on the sequence to storage asso ciation used in mapping

arrays to the memory system We then discuss data representations for dense and sparse arrays

The memory system and the data representation denes the foundation for the CMSSL We

then present the design goals for the CMSSL and how these goals have b een approached and

achieved The multipleinstance capability of the CMSSL is an extension of the functionality

of conventional libraries in the spirit of array op erations and critical to the p erformance in

computations on b oth distributed and lo cal data sets The multipleinstance feature is discussed

in Section Scalability and robustness with resp ect to p erformance b oth dep end heavily on the

ability to automatically select appropriate schedules for arithmeticlogic op erations and data

motion and prop er algorithms These issues are discussed by sp ecic examples A summary is

given in Section

Architectural mo del

High p erformance computing has dep ended on elab orate memory systems since the early days of

computing The Atlas introduced virtual memory as a means of making the main relatively

slow memory app ear as fast as a small memory capable of delivering data to the pro cessor at its

clo ck sp eed Since the emergence of electronic computers pro cessors have as a rule b een faster

than memories regardless of the technology b eing used Today most computers conventional

sup ercomputers excepted use MOS technology for b oth memories and pro cessors But the

prop erties of the MOS technology is such that the sp eed of pro cessors is doubling ab out every

months while the sp eed of memories is increasing at a steady rate of ab out yr

Since the sp eed of individual memory units to day primarily built out of MOS memory chips

is very limited high p erformance systems require a large number of memory banks units

even when lo cality of reference can b e exploited High end systems have thousands to tens of

MEMORY SYSTEM

NETWORK

M M M M M M

P P P P P P

Figure The memory system for distributed memory architectures

thousands of memory banks The aggregate memory bandwidth of such systems far exceeds the

bandwidth of a bus A network of some form is used to interconnect memory mo dules The no des

in the network are typically of a low degree and for most networks indep endent of the size of the

network A large variety of network top ologies can b e constructed out of no des with a limited

xed degree Massively parallel architectures have employed two and threedimensional mesh

top ologies buttery networks binary cub es complete binary trees and fattree top ologies

The sp eed of the memory chips present the most severe restriction with resp ect to p erformance

The second weakest technological comp onent is the communication system Constructing a

communication system with the capacity of a full crossbar with a bandwidth equal to the aggre

gate bandwidth of the memory system is not feasible for systems of extreme p erformance and

would represent a considerable exp ense even for systems where it may b e technically feasible

Hence with a constraint of a network with a lower bandwidth than that of the full memory

system in MPPs pro cessors are placed close to the memory mo dules such that whenever lo cal

ity of reference can b e exploited the p otential negative impact up on p erformance of the limited

network capacity is alleviated This placement of pro cessors and the limited network capacity

has a fundamental impact up on the preferred sequencetostorage asso ciation to b e used for

programming languages This dierence in preferred sequencetostorage asso ciation is a ma jor

source of ineciency in p orting co des in conventional languages to MPPs

The generic architectural mo del for MPPs used throughout this pap er is shown in Figure The

lo cal memory system is shown only schematically As a minimum the lo cal memory hierarchy

consists of a pro cessor register le and DRAM but quite often there is at least one level of

cache and sometimes two levels In systems without a cache such as the Connection Machine

systems the mode in which the DRAM is op erated is imp ortant In addition to the lo cal memory

hierarchy the access time to the memories of other nodes a pro cessor with asso ciated memory

mo dules and network interface hardware often is nonuniform The source of the nonuniformity

in access time may b e directly related to the distance in the network which is the case for

packet switched communication systems In circuit switched and wormhole routing systems

the distance in itself is often insignicant with resp ect to access time However the longer the

routing distance the more likely it is that contention in the network will arise and hence add to

the remote access time

Dimensionality of the address space

The onedimensional address space used for the conventional languages is not suitable for most

applications on MPPs A linearized address space may also result in p o or p erformance for

multidimensional arrays or nonunit stride accesses in banked and interleaved memory systems

Socalled bank conicts are well known p erformance limitations caused by the combination of

data allo cation strategies and access strides For MPPs for computations with a uniform use of

the address space a multidimensional address with as many dimensions as are b eing accessed

uniformly is ideal We discuss this claim through a number of simple but imp ortant exam

ples We rst discuss computations dominated by op erations on a single array then consider

op erations involving two or three arrays

The Fast Fourier Transform

For the fast Fourier Transform the FFT and many other hierarchical or divideandconquer

metho ds an address space with log N dimensions may b e ideal even for a onedimensional

2

array of extent N All data references are to data within unit distance in such an address space

This view is particularly useful in the mapping of the arrays to networks of memory units since

it prop erly mo dels the communication needs

The FFT computations are uniform across the index space and the loadbalance is indep endent

of whether cyclic or consecutive allo cation is used However the cyclic data allo cation yields

lower communication needs than the consecutive allo cation by up to a factor of two for unordered

transforms The reason is that the computations of the FFT always pro ceed from the high

to the low order bit in the index space With the consecutive allo cation the high order bits are

asso ciated with pro cessor addresses and must b e mapp ed to lo cal memory addresses b efore lo cal

buttery computations can b e p erformed Conserving memory in this remapping means that

another remapping is required when the computations are to b e p erformed on the dimensions

that were moved from lo cal memory to pro cessor addresses in order to accommo date the move

of the leading dimensions into lo cal memory In the cyclic allo cation the leading dimensions are

mapp ed to lo cal memory from the start

Direct metho ds for solution of systems of equations

LU and QR factorization only involves a single array while the solution of triangular systems

involves two or three arrays Two imp ortant distinguishing features of dense factoriza

tion are that all data references are global and that the computations are p erformed on a

diminishing set of indices The global references consists of pivot selection and the broadcast

of the pivot row and column If a blo ck algorithm is used sets of rows and columns are treated

together but it do es not fundamentally change the reference pattern for the algorithm

We will rst discuss the preferred dimensionality and shap e of the address space then load

balancing Since the broadcast op erations are p erformed b oth along rows and columns a one

dimensional partitioning makes one of these op erations lo cal while for the other a complete pivot

p p

row or column must b e broadcast With a consecutive data allo cation and a N N no dal

array for the factorization of a P P matrix the broadcast op erations require the communication

P

p

of elements instead of P elements for a onedimensional partitioning This argument

N

is to o simplistic in that the communication among the two axes is not the same But the

conclusion is correct a twodimensional address space is desirable and the shap e of the lo cal

subgrid shall b e close to square Partial pivoting requires additional communication along one

axis Second since not all indices are involved in all steps the number of elements p er no de

P

p

and P resp ectively It dep ends participating in the broadcast op eration is not necessarily

N

up on what techniques are used for loadbalancing as discussed in

Note that for outofcore factorization algorithms using panel techniques with entire columns

in primary storage the shap e of the panel to b e factored may b e extremely rectangular Hence

the shap e of the pro cessing array shall also b e extremely rectangular to yield almost square

subgrids

A cyclic allo cation guarantees go o d loadbalance for computations such as LU and QR fac

torization and triangular system solution But a go o d loadbalance can b e achieved also for

consecutive mapping by adjusting the elimination order accordingly To allow for the use of

level LBLAS Lo cal BLAS blo cking of rows and columns on each no de is used In LU fac

torization a blo cking of the op erations on b rows and columns means that b rows are eliminated

at a time from all the other rows The resulting blockcyclic elimination order yields the desired

loadbalance as well as an opp ortunity to conserve lo cal memory bandwidth A blo ckcyclic

elimination order was rst recommended in for loadbalanced solution of banded systems

The result of the factorization is not two blo ck triangular matrices but blockcyclic triangles A

blo ckcyclic triangle can b e p ermuted to a blo ck triangular matrix However it is not necessary

to carry out this p ermutation for the solution of the blo ckcyclic triangular system of equations

Indeed it is desirable to use the blo ckcyclic triangle for the forward and back substitutions since

the substitution pro cess is loadbalanced for the blo ckcyclic triangles Using blo ck triangular

matrices stored in a square data array A allo cated to no des with a consecutive data allo cation

scheme would result in p o or loadbalance For details as well as mo dications necessary for

rectangular no dal arrays see

Note further that for triangular solvers the communication is again of the global nature and

the conclusions ab out the shap e of the address space still applies

The Alternating Direction Implicit Metho d

In the Alternating Direction Implicit ADI Metho ds a multidimensional op erator is factored

into onedimensional op erators that are applied in alternating order In its most common use

tridiagonal systems of equations are solved along each co ordinate direction of a grid Whether

substructured elimination or straight elimination is used the communication requirements along

each co ordinate axis is prop ortional to the area of the surface having the normal aligned with

the axis of solution Hence regardless of the extent of the axes in the dierent dimensions it is

again desirable with resp ect to minimizing nonlo cal references to minimize the surface area of the

subgrids assigned to each no de For a more detailed discussion of ADI on parallel computers and

cyclic reduction based metho ds as well as Gaussian elimination based metho ds for the solution

of tridiagonal systems of equations see

Stencil computations

For stencil computations on threedimensional arrays with a stencil symmetric with resp ect to

the axis the well known minimum surfacetovolume rule dictates that a threedimensional

address space shall b e used for optimum lo cality of reference For example for a

grid distributed evenly across no des each no de holds k grid p oints With a p oint

centered symmetric stencil in three dimensions the number of nonlo cal grid p oints that must

b e referenced is for cubic subgrids of shap e For the standard linearized

array mapping used by Fortran or C the subgrids will b e of shap e References

along two of the axis are entirely lo cal but the references along the third axis require access

to nonlo cal grid p oints Thus the linearized address space requires a factor of

1

more nonlo cal references for the stencil computations

3

Note that if the data array is of shap e it is still the case that the ideal lo cal

subgrid is of shap e But the ideal shap e of pro cessing no des have changed from

an array to a array This example with simple stencil computations on a

threedimensional array has shown that a multidimensional address space is required in order

to maximize the lo cality of reference Moreover the example also shows that the shap e of the

Mops

6

1200 P

P

1000 P

800

600

400

200

0 -

8 7 6 5 4 3 2 1 0 log N

r

2

0 1 2 3 4 5 6 7 8 log N

c

2

Figure Inuence of shared pro cessor conguration on the p erformance for multiplication of

square matrices of size P bit precision The shap e of the pro cessor Connection Machine

system CM is N N

r c

address space ie how the indices for each axis is split b etween a physical or processor address

and a local memory address is very imp ortant We have implicitly assumed a consecutive or block

partitioning in the discussion ab ove A cyclic partitioning would in fact maximize the number

of nonlo cal references

Matrix multiplication

We restrict the discussion to the computation C C A B In order to minimize the

communication a go o d strategy is to keep the matrix with the largest number of elements

stationary The other two op erands are moved as required for the required indices of the

three op erands to b e present in a given no de at the same time With this underlying strategy

the ideal shap e of the address space is such that the stationary matrix has square submatrices

in each no de This result can b e derived from that fact that the required communication

is alltoall broadcast andor reduction within rows or columns

The ideal shap e of the address space has b een veried on the Connection Machine system CM

and is illustrated in Figure It conrms that the optimal no dal array shap e is square for

square matrices For the matrix shap es used in this exp eriment a onedimensional no dal array

aligned with either the row or column axis requires ab out a factor of six higher execution time

than the ideal twodimensional no dal array shap e

With the prop er no dal array shap e a sup erlinear sp eedup is achieved for matrix multiplication

since the communication requirements increase in prop ortion to the matrix size while the com

3

p ower The sup erlinear sp eedup putational requirements grow in prop ortion to the size to the

2

achieved on the CM is shown in Table

Data representation

In the previous section we showed that linearized address spaces as used by the conventional

languages is not compatible with the notion of lo cality of reference for MPP memory systems

We showed that multidimensional address spaces are required and that the optimal shap e of the

address space can b e derived from the data reference pattern In this section we fo cus on the data

Number Matrix Mops MFlops Size for half

of No des Size P Overall p er no de p eak p erf P

12

Table Performance of matrix multiplication in Mops on the CM

representation and how based on the selected representation the desired data allo cation can

b e realized With the exception of the FFT in all our examples a consecutive data allo cation is

either preferred or the choice b etween cyclic and consecutive allo cation immaterial with resp ect

to p erformance Thus for simplicity we will assume a consecutive data allo cation in this section

Dense arrays

For the allo cation of dense arrays we have seen that subgrids with equal axes extents are either

optimal or close to optimal for several common and imp ortant computations Hence as a default

without sophisticated data reference analysis an allo cation creating subgrids with axes extents

of as equal a length as p ossible is sensible and feasible

Grid Sparse Matrices

For sparse arrays the data representation is less obvious even for sparse matrices originating

from regular grids Such matrices typically consists of a number of nonzero diagonals For

instance consider the case with a p oint centered dierence stencil in three dimensions The

stencil computation can b e represented as a matrixvector multiplication y Ax where x and

y are grid p oint values and the matrix A represents the stencils at all the grid p oints With an

N N N grid with stride one along the axis of extent N and stride N along the axis of

1 2 3 3 3

length N the matrix is of shap e N N N N N N with a nonzero main diagonal a nonzero

2 1 2 3 1 2 3

diagonal immediately ab ove and b elow the main diagonal two nonzero diagonals at distance

N ab ove and b elow the main diagonal and two nonzero diagonals at distance N N ab ove and

3 2 3

b elow the main diagonal

A common representation in Fortran is either to use a set of onedimensional arrays one for

each nonzero diagonal or a single onedimensional array with the nonzero diagonals app ended to

each other However either of these representations are not suitable for MPP memory systems

since preservation of lo cality of reference for matrixvector multiplication is likely to b e lost

A natural representation for gridsparse matrices and gridp oint vectors is to tie the represen

tation directly to the grid rather than the matrix representation of the grid Gridp oint vectors

are represented as multidimensional arrays with one axis for each axis of the grid plus an axis

for the gridp oint vector The gridaxes extents are the same as the lengths of the corresp ond

ing grid axes A grid is represented in an analogous way The matrix represents

interaction b etween variables in dierent grid p oints As an example of the grid based repre

sentation of a gridsparse matrix we consider the common p oint stencil in three dimensions

Each of the stencil co ecients are represented as threedimensional arrays A G of no dal

values of shap e LXLYLZ

No of No of of No of of

partitions shared total shared total

edges no des

Table Partitioning of a tetrahedral mesh b etween concentric spheres

The corresp onding vectors for the op eration y Ax may b e represented as XLXLYLZ

YLXLYLZ and the computation y Ax as

Y x y z Ax y z X x y z B x y z X x y z C x y z X x y z

D x y z X x y z E x y z X x y z

F x y z X x y z Gx y z X x y z

Representation and allo cation of arbitrary sparse matrices

The representation and allo cation of arbitrary sparse matrices is a very dicult topic sub ject

to research Two general partitioning techniques of signicant recent interest are the recursive

sp ectral bisection technique prop osed by Pothen et al and the geometric approach prop osed

by Miller et al The recursive sp ectral bisection technique has b een used successfully by

Simon for partitioning of nite volume and nite element meshes A parallel implementation

of this technique has b een made by Johan

The sp ectral partitioning technique is based on the eigenvector corresp onding to the smallest

nonzero eigenvalue of the Laplacian matrix asso ciated with the graph to b e partitioned The

Laplacian matrix is constructed such that the smallest eigenvalue is zero and its corresp onding

eigenvector consists of all ones The eigenvector asso ciated with the smallest nonzero eigenvalue

is called the Fiedler vector Grid partitioning for nite volume and nite element metho ds

is often based on a dual mesh representing nite volumes or elements and their adjacencies or

some approximation thereof rather than the graph of no dal p oints One advantage of the

sp ectral bisection technique is that it is based on the top ology of the graph underlying the sparse

matrix It requires no geometric information However it is computationally quite demanding

The results of applying the sp ectral bisection technique to a mo del problem is rep orted in and

shown in Table A planar grid of tetrahedra b etween concentric cylinders with no des

tetrahedra and faces is partitioned using the sp ectral bisection algorithm The

numbers of shared no des and edges as a function of the number of partitions are given in the

table

The results of applying the sp ectral bisection technique on a more realistic nite element appli

cation are summarized in Table The sp ectral bisection technique in this example oered

a reduction in the number of remote references by a factor of The sp eedup for the gather

op eration was a factor of and of the scatter op eration the sp eedup was a factor of the

scatter op eration includes the time required for addition which is unaected by the partitioning

Another imp ortant asp ect of computations with arbitrary sparse matrices is that unlike for dense

and grid sparse matrices address computations cannot b e p erformed by incrementing addresses

using xed strides For arbitrary sparse matrices indirect addressing is required It frequently is

the most time consuming part on unipro cessors On a distributed memory machine the address

Op eration Standard Sp ectral

allo cation bisection

Partitioning

Gather

Scatter

Computation

Total time

Table Gather and scatter times in seconds on a no de CM for time steps with a

p oint integration rule for nite element computations on no des and elements

computations do not only involve the computation of lo cal addresses but routing information

as well In an iterative explicit metho d the underlying grid may b e xed for several or all

iterations For such computations it is imp ortant with resp ect to p erformance to amortize the

cost of computing the addresses over as many iterations as p ossible Cacheing this information

and reusing it later is imp ortant for p erformance

In an arbitrary sparse matrix there is no simple way of enco ding the global structure Yet

arbitrary sparse matrices may still have some lo cal structure resulting in a blo ck sparse matrix

Taking advantage of such a blo ck structure for b oth economy in data representation data storage

and eciency of op erations is signicantly simplied by explicitly representing the blo cks

Multipleinstance computation

The multipleinstance capability of the CMSSL is consistent with the idea of collective compu

tation inherent in languages with an array syntax We have already seen how it arises naturally

in the ADI metho d CMSSL routines are designed to carry out a collection of high level com

putations on indep endent sets of op erands in a single call in the same way addition of arrays

are carried out through a single statement or intrinsic functions are applied to each entry in

an array in Fortran To accomplish the same task in an F or C library the call to a

library routine would b e embedded in a set of nested lo ops The multipleinstance capability

not only eliminates lo op nests but also allows for parallelization and optimization without a

sophisticated interprocedural data dep endence analysis The multipleinstance feature for par

allel computation is necessary for the desired degree of optimization which go es b eyond the

capabilities of stateoftheart compiler and runtime systems

We discuss the signicance of the multipleinstance capability with resp ect to p erformance and

simplicity of user co de by considering the computation of the FFT along one of the axes of a

twodimensional array of shap e P Q We assume a canonical data layout in which the set of

pro cessing no des are congured as an array of the same rank as the data array and of a shap e

making the lo cal subarrays approximately square The no dal array shap e is N N N

r c

With the FFT p erformed along the P axis the computations on the twodimensional array

consist of Q indep endent FFT computations each on P data elements We consider three

dierent alternatives for the computation

Maximize the concurrency for each FFT through the use of a canonical data layout for

onedimensional arrays of size P

Compute each FFT without data relo cation

Compute all Q FFTs concurrently through multipleinstance routines

Alternative corresp onds to the following co de fragments

FOR J = 1 TO Q DO

TEMP AJ

CALL FFTTEMPP

AJ TEMP

ENDFOR

SUBROUTINE FFTBN

ARRAY BN

FFT on a onedimensional array

END FFT

The concurrency in the computation of the FFT is maximized The data motion prior to the

computation of the FFT on a column is a onetoal l personalized communication The data

redistribution corresp onds to a change in data allo cation from A to TEMP and back to

the original allo cation one column at a time The arithmetic sp eedup is limited to min N P

for transforms on the P axis

In alternative the data redistribution is avoided by computing each instance inplace An ob

vious disadvantage with this approach is the p o or loadbalance The sp eedup of the arithmetic

is prop ortional to min N P for a transform along the P axis

r

FOR J = 1 TO Q DO

CALL FFTAPQJ

ENDFOR

SUBROUTINE FFTBNMK

ARRAY BNM

Inplace FFT on column K of array B

END FFT

Finally using the CMSSL FFT corresp onds to Alternative All dierent instances of the FFT

represented by the Q columns are treated inplace in a single call The concurrency and data

layout issues are managed inside the FFT routine The CMSSL call is of the form CALL FFTA

DIM where DIM sp ecies the axis of the array A sub ject to transformation The actual

CMSSL call has additional parameters allowing the calling program to dene the subset of axes

for which forward transforms are desired for which axes inverse transforms are desired and for

which axes ordered transforms are desired

FORALL J DO

CALL FFTAJ

ENDFOR

The third choice is clearly preferable b oth with resp ect to communication and arithmetic load

balance Note that with a singleinstance library routine and canonical layouts Alternative

would b e realized Further for particular situations a noncanonical layout will alleviate the

communication problem but in many cases the communication app ears somewhere else in the

application co de Thus we claim that our discussion based on canonical layouts reects the

situation in typical computations

Op eration Mops Eciency

p er no de

Lo cal

norm

2

Matrixvector

Matrixmatrix

Global

norm

2

Matrixvector

Matrixmatrix

LUfactorization

Unstructured grid

Table Peak lo cal and global p erformance p er no de and eciencies achieved for a few dierent

types of computations on the CM bit precision

CMSSL

The primary design goal for the Connection Machine Scientic Software Library CMSSL is

to provide high level supp ort for most numerical metho ds b oth traditional and recently de

veloped metho ds such as hierarchical and multiscale metho ds and multipole and other fast

socalled Nb o dy algorithms used for large scale scientic and engineering computations High

level supp ort in this context means functionality that is at a suciently high level that archi

tectural characteristics are essentially transparent to the user yet that a high p erformance can

b e achieved Sp ecic design goals for the CMSSL include consistency with languages with an

array syntax such as Fortran Connection Machine Fortran and C functionality that is

indep endent of data distribution multiple instance capability supp ort for all four conventional

oatingp oint data types high p erformance scalability across system and problem sizes robust

ness p ortability and functionality supp orting traditional numerical metho ds These goals have

had an impact on the architecture of the CMSSL The rst few goals have also impacted the

user interfaces Version of CMSSL has ab out user callable functions The library exists

on the Connection Machine systems CM CM and CM The CM version consists of

ab out million lines of co de and so do es the CM and CM version

Table gives a few examples of how the goal of high p erformance is met by the CMSSL The table

entry for unstructured grid computations actually represent complete applications

while the other entries represent library functions by themselves Table provides excellent data

of how the goal of scalability has b een met by the CMSSL as well as the CM architecture over

1

a range of a factor of a thousand in system size ENSA is an Euler and Navier Stokes nite

2 3

element co de while TeraFrac and MicMac are solid mechanics nite element co des

To rst order the p erformance p er no de is indep endent of the system size thus demonstrating

excellent scalability For some computations like matrix multiplication the eciency actually

increases as a function of system size For the unstructured grid computations the p erformance

decreases by ab out an insignicant amount

With resp ect to scientic and engineering computations the architectural dep endence on tra

ditional architectures has mostly b een captured in a set of matrix utilities known as the BLAS

Basic Subprograms Ecient implementations of this set of routines

1

Developed at the Division of Applied Mechanics Stanford University

2

Developed at the Division of Engineering Brown University and Tech Univ Denmark

3

Developed at the Department of Mechanical Engineering Cornell University

Number Dense matrix op erations Unstructured grid computations

No des norm MV MM LUfact ENSA TeraFrac MicMac

2

Table Performance in Mops p er no de over a range of CM system sizes bit precision

are architecture dep endent and for most architectures is written in assembly co de Most scien

tic co des achieve high p erformance when built on top of this set of routines On distributed

memory architectures a distributed BLAS DBLAS is required in addition to a

lo cal BLAS LBLAS in each no de Moreover a set of communication routines are required

for data motion b etween no des But not all algorithms parallelizes well and there is an al

gorithmic architectural dep endence Thus architectural indep endence of application programs

requires higher level functions than the DBLAS LBLAS and communication routines Hence

the CMSSL includes a subset of functions corresp onding to traditional libraries such as Linpack

Eispack LApack FFTpack and ITpack

The external architecture of the CMSSL is similar to conventional library systems in that there

exists a set of matrix utilities similar to the BLAS a set of sparse matrix utilities supp orting

op erations on regular and irregular grids dense and banded direct solvers and iterative solvers

Fast Fourier transforms are supp orted for multidimensional transforms In addition CMSSL

also includes a few statistical routines and a routine for integration of systems of ordinary

dierential equations and a simplex routine for dense systems The CMSSL also contains a

communications library Such libraries are unique to distributed memory machines The CMSSL

also contains to ols in the form of two sp ecial compilers a stencil compiler and a communications

compiler Novel ideas in the CMSSL can b e found at all levels in the internal architecture in

the algorithms used in the automatic selection of algorithms at runtime and in the lo cal

op erations in each no de

The CMSSL is a global library It accepts global distributed data structures Internally the

CMSSL consists of a set of library routines executing in each no de and a set of communication

functions The communication functions are either part of the Connection Machine RunTime

System or part of the CMSSL All communication functions that are part of the CMSSL are

directly user accessible and so are the functions in each no de For the global library these

functions are called directly and are transparent to the user and the distributed nature of the

data structures is transparent to the user The internal structure of the CMSSL supp orts data

distribution indep endent functionality automatic algorithm selection for b est p erformance for

the BLAS FFT and a few other functions as well as user sp ecied choices for many other

functions The execution is made through calls to lo cal routines and communication functions

It follows from the internal architecture of the CMSSL that it also has the ability to serve as a

no dal library

Summary

The CMSSL has b een designed for p erformance scalability robustness and p ortability The

architecture with resp ect to functionality follows the approach in scientic libraries for sequential

architectures Internally the CMSSL consists of a no dal library and a set of communication and

data distribution functions CMSSL provides data distribution indep endent functionality and

has logic for automatic algorithm selection based on the data distribution for input and output

arrays and a collection of algorithms together with p erformance mo dels

The p erformance goals have largely b een achieved b oth for the lo cal and global functions Par

ticular emphasis has b een placed on reducing the problem sizes oering half of p eak p erformance

Some p eak global p erformance data were given in Section Scalability is excellent The p er

formance p er no de has b een demonstrated to b e largely indep endent of the number of no des in

the systems over a range of a factor of one thousand Table Section

Robustness with resp ect to p erformance is achieved through the automatic selection of algorithm

as a function of data distribution for b oth low level and high level functions

CMSSL oers p ortability of user co des without loss of p erformance CMSSL itself has an

architecture amenable to p ortability It is the same on all Connection Machine platforms Co de

for maximum exploitation of the memory hierarchy is in assembly language and thus has limited

p ortability Some algorithmic changes were also necessary in p orting the library to the CM

These changes are largely due to the dierences in the communication systems but also due to

the MIMD nature of the CM

Acknowledgement

Many p eople have contributed to the CMSSL We would like to acknowledge the contributions of

Paul Bay JeanPhilipp e Brunet Steven Daly Zdenek Johan David Kramer Rob ert L Krawitz

Woo dy Lichtenstein Doug MacDonald Palle Pedersen and Leo Unger all of Thinking Machines

Corp and Ralph Brickner and William George of Los Alamos National Lab oratories Yu Hu of

Harvard University Michel Jacquemin of Yale University Lars Malinowsky of the Royal Institute

of Technology Sto ckholm and Danny Sorensen of Rice University

The communications functions and some of the numerical routines in the CMSSL relies heavily on

algorithms developed under supp ort of the ONR to Yale University under contracts N

K NK the AFOSR under contract AFOSR to Yale and Harvard

Universities and the NSF and DARPA under contract CCR to Yale and Harvard

Universities Supp ort for the CMSSL has also b een provided by ARPA under a contract to Yale

University and Thinking Machines Corp

References

A J Beaudoin P R Dawson K K Mathur UF Ko cks and D A Korzekwa Application

of p olycrystal plasticity to sheet forming Computer Methods in Applied Mechanics and

Engineering in press

Jack J Dongarra Jeremy Du Croz Iain Du and Sven Hammarling A Set of Level

Basic Linear Algebra Subprograms Technical Rep ort Reprint No Argonne National

Lab oratories Mathematics and Computer Science Division August

M Fiedler A prop erty of eigenvectors of nonnegative symmetric matrices and its applica

tion to graph theory Czechoslovak Mathematical Journal

Zdenek Johan Data Parallel Finite Element Techniques for LargeScale Computational

Fluid Dynamics PhD thesis Department of Mechanical Engineering Stanford University

Zdenek Johan Thomas JR Hughes Kapil K Mathur and S Lennart Johnsson A data

parallel nite element metho d for computational uid dynamics on the Connection Machine

system Computer Methods in Applied Mechanics and Engineering August

Zdenek Johan Kapil K Mathur S Lennart Johnsson and Thomas JR Hughes An

ecient communication strategy for Finite Element Metho ds on the Connection Machine

CM system Computer Methods in Applied Mechanics and Engineering

S Lennart Johnsson Fast banded systems solvers for ensemble architectures Technical

Rep ort YALEUDCSRR Dept of Computer Science Yale University March

S Lennart Johnsson Communication ecient basic linear algebra computations on hyper

cub e architectures J Parallel Distributed Computing April

S Lennart Johnsson Minimizing the communication time for matrix multiplication on

multiprocessors Parallel Computing

S Lennart Johnsson Parallel Architectures and their Ecient Use chapter Massively

Parallel Computing Data distribution and communication pages Springer Verlag

S Lennart Johnsson and ChingTien Ho Spanning graphs for optimum broadcasting and

p ersonalized communication in hypercub es IEEE Trans Computers

September

S Lennart Johnsson and ChingTien Ho Optimizing tridiagonal solvers for alternating

direction metho ds on Bo olean cub e multiprocessors SIAM J on Scientic and Statistical

Computing

S Lennart Johnsson ChingTien Ho Michel Jacquemin and Alan Ruttenberg Computing

fast Fourier transforms on Bo olean cub es and related networks In Advanced Algorithms and

Architectures for Signal Processing II volume pages So ciety of PhotoOptical

Instrumentation Engineers

S Lennart Johnsson Michel Jacquemin and Rob ert L Krawitz Communication ecient

multiprocessor FFT Journal of Computational Physics Octob er

S Lennart Johnsson and Kapil K Mathur Distributed level and level BLAS Technical

rep ort Thinking Machines Corp In preparation

S Lennart Johnsson and Luis F Ortiz Lo cal Basic Linear Algebra Subroutines LBLAS

for distributed memory architectures and languages with an array syntax The International

Journal of Supercomputer Applications

S Lennart Johnsson Yousef Saad and Martin H Schultz Alternating direction metho ds

on multiprocessors SIAM J Sci Statist Comput

CL Lawson RJ Hanson DR Kincaid and FT Krogh Basic Linear Algebra Subpro

grams for Fortran Usage ACM TOMS September

Woo dy Lichtenstein and S Lennart Johnsson Blo ck cyclic dense linear algebra SIAM

Journal of Scientic Computing

Kapil K Mathur and S Lennart Johnsson Alltoall communication Technical Rep ort

Thinking Machines Corp December

Kapil K Mathur and S Lennart Johnsson Multiplication of matrices of arbitrary shap e

on a Data Parallel Computer Parallel Computing July

Kapil K Mathur Alan Needleman and V Tvergaard Ductile failure analyses on massively

parallel computers Computer Methods in Applied Mechanics and Engineering in press

N Metrop olis J Howlett and GianCarlo Rota editors A History of Computing in the

Twentieth Century Academic Press

Gary L Miller ShangHua Teng William Thurston and Stephen A Vavasis Automatic

mesh partitioning In Sparse Matrix Computations Graph Theory Issues and Algorithms

The Institute of Mathematics and its Applications

Alex Pothen Horst D Simon and KangPu Liou Partitioning sparse matrices with eigen

vectors of graphs SIAM J Matrix Anal Appl

Horst D Simon Partitioning of unstructured problems for parallel pro cessing Computing

Systems in Engineering

Tayfun Tezduyar Private communication

Thinking Machines Corp CMSSL for CM Fortran Version