SIAM News, August/September1995

The Global Array ProgrammingModel

for High Performance Scienti

Computing

J. Nieplo cha, .. Harr i son and R.J. Little eld

Paci c Northwest Laboratory

Motivated bythecharacter i stics of current parallel architecture s, wehavede-

veloped an approachtothe programming of scalable scienti c applications that

combines someofthe b e st feature s of me ssage-pass ingandshare d-memory pro-

grammingmodels. Two assumptions p ermeate our work. The rst i s thatmost

high p erformance parallel computers have, and will continue tohave, phys ically

di str ibuted memor ie s with non-uniform memory acce ss NUMA timingchar-

acteristics. NUMA machines work best withapplication programs thathavea

high degree of lo calityintheir memory reference patter ns. Thesecondassump-

tion i s that extra programming e ort i s, and will continue to b e, require d to

construct suchapplications. Thus, a recurr ingthemeinourworkisthedevelop-

mentoftechnique s andto ols that minimize the extra e ort require d to construct

application programs with explicit control of lo cality.

There are s igni cant tradeo s amongthe imp ortantconsiderations of p ortab-

ility, eciency,and eas e of co ding. Theme ssage-pass ing programmingmodel i s

widely used becaus e of its p ortability. Someapplications, however, are too com-

plex tobecoded in a me ssage-pass ingmode, if care i s tobetaken tomaintain

a balance d computation load andavoid re dundant computations. Theshare d-

memory programmingmodel s impli e s co ding, but itisnotportable andoften

provide s little control over interpro ce ssor data transfer costs. Other more recent

parallel programmingmodels, repre s ented bysuch language s and f acilitie s as

HPF[1], SISAL[2], PCN[3], -M[4], Linda[5], andshare d virtual memory,

addre ss the s e problems in di erentways andtovaryingdegree s. Noneofthese

mo dels repre s entsanideal solution.

Global Arrays GAs, theapproachde scr ib e d here, lead tobothsimplecod-

ingand ecient execution for a class of applications thatappears to b e f airly

common. Thekey concept of a GA mo del i s thatitprovide s a p ortable interf ace

through which each pro ce ss in a MIMD parallel program can indep endently,

asynchronously,and eciently acce ss logical blo cks of phys ically di str ibuted 1 registers

on-chip cache speed

off-chip cache capacity main memory

virtual memory

Figure 1: Thememory hierarchyofatypical NUMA architecture.

matr ice s, withnonee d for explicit co operation byother pro ce ss e s. In thi s re-

sp ect, it i s s imilar totheshare d-memory programmingmodel. In addition,

however, theGAmodel acknowle dge s that more time i s require d to acce ss re-

motedatathan lo cal data, and it allows data lo calityto b e explicitly sp eci e d

and used. In these respects, it i s s imilar tome ssage pass ing.

NUMA Architecture

The concept of NUMA i s imp ortanteven totheperformance of mo der n s e-

quential p ersonal computers or workstations. On a standard RISC workstation,

for instance, go o d p erformance of the pro ce ssors re sults f rom algor ithms and

compilers thatoptimize usage of thememory hierarchy.Thememory hierarchy

is formed by regi sters, on-chip cache, o -chip cache, main memory,and virtual

memory s ee Figure 1.

If the programmer ignore s thi s structure and constantly ushes thecacheor,

even wors e, thrashes the virtual memory, p erformance will b e s er iously degraded.

The class ic solution tothi s problem i s to acce ss datainblocks small enough to

t in the cacheandthen ensure thatthe algor ithm make s sucientuseofthe

encached datato justify thecostsofmovingthedata.

Tothe NUMA hierarchyofsequential computers,parallel computers add at

least one extra layer: remotememory. Acce ss toremotememory on di str ibuted-

memory machine s i s accompli shed through me ssage pass ing. Me ssage pass ing,

in addition tothe require d co op eration b etween s ender and rece iver thatmakes

thi s diculttouse,intro duce s degradation of latency

andbandwidthinthe accessing of remote, as opposed to lo cal, memory.

Scalable share d-memory machine s,i.e., architecturally di str ibuted-memory

machine s withhardware supp ort for share d-memory operations for example, the

KSR-2 or theConvex Exemplar, allowaccesstoremotememory in thesame

fashion as tolocalmemory.However, thi s uniform mechani sm for acce ss ing

lo cal andremotememory should b e s een only as a programmingconvenience| 2

on b othshare d- and di str ibute d-memory computers, thelatency andbandwidth

for acce ss ingremotememory are s igni cantly larger than for lo cal memory and

therefore must b e incorp orated into p erformance mo dels.

If wethink about programming of MIMD parallel computers e ither share d-

or di str ibute d-memory in terms of NUMA, then parallel computation di ers

f rom s equential computation only in terms of concurrency. By fo cus ingon

NUMA, we not only have a f ramework in whichto reason aboutthe p erformance

of our parallel algor ithms i.e., memory latency,bandwidth, dataand reference

lo cality, we also conceptually unite s equential and parallel computation.

Global Array Mo del

The GA programmingmodel i s motivated bythe NUMA character i stics of cur-

rent parallel architecture s. By removingtheunnece ssary pro ce ssor interactions

require d to acce ss remotedatainme ssage-pass ing paradigm, theGAmodel

greatly s impli e s parallel programmingand i s s imilar in thi s re sp ect tothe

share d-memory programmingmodel. However, theGAmodel also acknow-

le dge s that itismoretime consumingto acce ss remotedatathan lo cal data i.e.,

remotememory i s yet another layer of NUMA, and it allows data lo calityto

b e explicitly sp eci e d and used. AdvantagesoftheGAmodel over a share d-

memory programmingmodel includeits explicit di stinction b etween lo cal and

remotememory andtheavailabilityoftwodistinct mechanisms for accessing

lo cal and remotedata. Global arralys, instead of hidingthe NUMA charac-

teristics, exp os e them tothe programmer andmake it p oss ible towrite more

ecientand scalable parallel programs.

The current GA programmingmodel can b e character ize d as follows:

 MIMD paralleli sm i s provide d via a multipro ce ss approach, in whichall

non-GA data, le de scr iptors, and so on are replicated or unique toeach

pro ce ss.

 Pro ce ss e s can communicatewith eachother by creatingand accessing

GA di str ibuted matr ice s, as well as if de s ire d byconventional message

pass ing.

 Matr ice s are phys ically di str ibuted -wi s e, e ither regularly or as the

Carte s ian pro duct of irregular di str ibutions on eachaxis.

 Each pro ce ss can indep endently and asynchronously acce ss anytwo- di-

mens ional patch of a GA di str ibuted matr ix, withoutrequiringcooperation

from theapplication co deinanyother pro ce ss.

 Several typ e s of acce ss are supp orte d, including \get," \put," \accumu-

late"  oating-p ointsum-re duction, and \get and increment" integer.

Thi s li st can b e extente d as nee ded.

 Each pro ce ss i s assumed tohave f ast acce ss to someportion of each di str ib-

uted matr ix, and slower acce ss totheremainder. The s e sp ee d di erence s

de nethedata as being lo cal or remote, re sp ectively.However, thenumer ic 3

di erence b etween lo cal and remotememory acce ss times is unsp eci e d.

 Each pro ce ss can determine which p ortion of each di str ibuted matr ix i s

store d lo cally.Every element of a di str ibuted matr ix i s guarantee d tobe

lo cal to exactly one pro ce ss.

Thi s mo del di ers f rom other common mo dels as follows. UnlikeHPF,it

allows task-parallel acce ss to di str ibuted matr ice s, includingreduction intoover-

lappingpatche s. Unlike Linda[5], it eciently provide s for sum-re duction and

acce ss tooverlappingpatche s. Unlikeshare d-virtual-memory software f acilit-

ie s,the GA paradigm require s explicit library calls to acce ss databutavoids

theoverhead asso ciate d withthemaintenance of andhand-

ling of virtual page f aults. The GA implementation guarantee s that all of the

require d data for a patch can b e transferre d atthe sametime. Unlikeactive

me ssage s[6], theGAmodel do e s not incorp oratethe concept of interpro ce ssor

co operation andcanthus b e implemente d eciently[7]even on share d-memory

systems. Finally,unlike someother strategie s bas e d on p olling, task duration i s

relatively unimp ortant in programs that us e GAs, which s impli e s co dingand

make s it p oss ible for GA programs to exploit standard library co des without

mo di cation.

Global Array Toolkit

Thi s GA interf ace has b een de s igne d in the lightofemergingstandards. In

particular, High Performance Fortran HPF will certainly providethe bas i s for

future standards de nition for di str ibute d arrays in Fortran. Theop erations that

providethebasicfunctionality create, fetch, store, accumulate, gather, scatter,

data-parallel op erations all can b e expre ss e d as s ingle statementsinFortran-90

array notation and withthedata-di str ibution directive s of HPF. TheGAmodel

is, however, more general than that of HPF, which currently precludes theuse

of suchoperations in MIMD task-parallel co de.

Supporte d Op erations

EachGAoperation may b e categor ize d as e ither an implementation- dep endent

pr imitiveoperation or an op eration thathas b een constructe d in an implementation-

indep endent f ashion f rom pr imitiveop erations. Op erations also di er in their im-

plie d synchronization. Interf ace s tothird-party librar ie s providea nal di stinction.

The following pr imitiveop erations are invoke d collectively by all processes:

 create an array, controlling alignmentand di str ibution;

 create an array following a provided template exi sting array;

 de stroy an array;

 synchronize all pro ce ss e s.

Thefollowing pr imitiveop erations can b e invoke d in true MIMD style by

any pro ce ss with no implied synchronization withother processes and, unle ss 4

otherwise stated, with no guarantee d atomicity:

 fetch, store, andatomic accumulateintorectangular patchofatwo-dimens ional

array;

 gather andscatter array elements;

 atomic read and increment array elements;

 inquire aboutthelocation and di str ibution of thedata;

 directly acce ss lo cal elements of array tosupp ort and/or improve p erform-

ance of application-sp eci c data-parallel op erations.

Thefollowing s et of BLAS-likedata-parallel operations currently available

in GAs can eas ily b e extende d  ecient implementation can b e doneinan

architecture-indep endent f ashion on topofthe GA pr imitiveoperations:

 vector operations e.g., dot-pro duct or scale optimize d bymeans of direct

acce ss tolocaldatatoavoid communication;

 matr ix op erations e.g., symmetr ize optimize d through direct acce ss to

lo cal datatoreduce communication anddatacopying;

 matr ix multiplication.

Thevector, matr ix multiplication, copy,andprintop erations exi st in two

vers ions thatoperateoneither entire arrays or sp eci e d s ections of arrays.

Thearraysections in operations thatinvolvemultiple arrays do not havetobe

conforming{the only requirementsarethatthey must b e of thesametyp e and

contain the samenumb er of elements.

Functionalityprovided bythird-party librar ie s{standard andgeneralize d real

symmetr ic e igensolvers and linear equation solvers interf ace to ScaLAPACK{i s

madeavailablebyusingthe GA pr imitives to p erform nece ssary data rearrange-

2

ment. The O N costofsuch rearrangements i s obs erved tobenegligible in

3

compar i son tothatofO N  linear algebra op erations. The s e librar ie s can in-

ter nally us e any form of paralleli sm appropr iatetothe computer system, such

as co operativeme ssage-pass ingorshare d-memory.

Sample Co deFragment

The followingcode f ragment uses theFortran interf ace to createann  m double-

preci s ion array,blocke d in at least 10  5chunks; after zeroing, a patchis lled

f rom a lo cal array.Unde ned value s are assumed to b e compute d els ewhere.

Theroutine ga create retur ns thevar iable g a as a handle tothe global array

for subs equent reference s tothe array.

integer g_a, n, m, ilo, ihi, jlo, jhi, ldim

double precision local1:ldim,*

c

call ga_createMT_DBL, n, m, `A', 10, 5, g_a

call ga_zerog_a

call ga_putg_a, ilo, ihi, jlo, jhi, local, ldim 5

Thi s co deisvery s imilar in functionalitytothe following HPF-likestate-

ments:

integer n, m, ilo, ihi, jlo, jhi, ldim

double precision an,m, local1:ldim,*

!hpf$ distribute ablock10, block5

c

a = 0.0

ailo:ihi,jlo:jhi=local1:ihi-ilo+1,1:jhi-jlo+1

The di erence i s thatthi s s ingle HPF ass ignmentwould b e execute d in a

data-parallel f ashion, whereas the global array ga put operation wouldbeex-

ecute d in MIMD parallel mo de, with each pro ce ss able to reference di erent

array patches.

Supporte d Platforms andAvailability

Thepublic-domain GA to olkit i s supporte d on a wide range of di str ibuted- and

share d-memory computer systems, including:

1. Di str ibuted-memory,me ssage-pass ing parallel computers withinterrupt-

dr iven communications or activeme ssage s Intel iPSC/860, DeltaandPar-

agon, IBM SP-1/2.

2. Networks of unipro ce ssor andmultipro ce ssor Unix workstations.

3. Share d-memory parallel computers KSR-1/2, Cray T3D, SGI.

TheGAto olkit i s available via anonymous ftponftp.pnl.gov in the dir-

ectory pub/global.Further information i s provide d on the WWW atthe URL,

http://www.emsl.pnl.gov:2080/docs/global/ga.html.

Applications

Most applications of theGAto olkit have b een in theareaofcomputational chem-

i stry,indeterminations of the electronic structure s of molecule s or crystalline

chemical systems. The s e calculations, which can pre dict manychemical proper-

tie s that are not directly obs erve d exp er imentally, account for a large f raction

of the sup ercomputer cycle s currently us e d for computational chemi stry. All

of these metho ds, of whichtheiterative s elf cons i stent eldSCFmethod [8]

is the s imple st, computeapproximatesolutions tothe nonrelativi stic electronic

Schrodinger equation. 6

As an example of the programming s impli cations andperformance improve-

mentsthat can b e realize d withGAs,we cons ider the parallel SCF application

here i s slightly more detail. Full details and a recent literature survey can b e

foundin[9,10].

Theker nel of the SCF calculation i s the contraction of a large, spars e four-

index matr ix electron-repuls ion integrals withatwo-index matr ix theelec-

tronic dens ity to another two-index matr ix theFockmatr ix. The irreg-

ular spars ityandtheavailable symmetries of theintegrals dr ivethe calculation.

Thedimens ions of b othmatr ice s are determined bythe s ize of an underlying

3 2 4

basis set N  10 . Thenumber of integrals scale s b etween O N and O N 

dep endingonthenature of thesystem andlevel of accuracy require d. Integrals

are most eciently compute d in batches, and eachbatch connectsuptosix

blo cks of thedens itymatr ix withthe corre sp onding blo cks of theFockmatr ix.

Thecostofevaluatingthese batches can vary byafactor of more than 1000,

whichcaus e s a load-balancing problem.

Thetwo previous s igni cant di str ibute d-data algor ithms b oth us e d explicit

me ssage pass ing. In the systolic lo op algor ithm of Colvin et al. [10], thebest

2

. Overall eciencie s of ap- p oss ible execution timeisnobetter than O N

basis

proximately 50 were obtaine d on 256 pro ce ssors of an nCUBE-2. Themost

ecient explicit me ssage-pass ing algor ithm i s thatofFurlani andKing[10], who

implemented rather complicate d di str ibute d-matr ix schemes withpolling. The

overhead assciated with communication andwaiting for re sp ons e s to reque stsfor

acce ss tothedens itymatr ix was re duce d by explicit double-bu er ingandasyn-

chronous prefetching. Thi s approach scale s well, butthe p ollingcaus e s high

latencie s in acce ss tothe di str ibuted matr ice s, thus requir ingtheintro duction of

the additional complexitie s of prefetching.

Given thehighdegree of complexityofthese me ssage-pass ing algor ithms,

andthe s implicity of SCF as compare d withother ab initio algor ithms, wehave

b een s eeking more appropr iate programmingmodels; theGAmodel i s the cur-

rentresult. Our late st SCF program, whichhas b een developed on topofthe

GA to olkit, i s very s imple, and all computational steps with complexity greater

than O N have b een parallelize d. The four neste d lo ops over theunique

atom

integrals are str ipmined into blo cks, s imilar in spir it toFurlani and King. Geo-

metr ic decomposition p ermitsthe us e of spars ityforreducingboth computation

and reference s to global data. Ass ignmentofmultiple atom quartetstoatask

improves thecaching of reads of thedens itymatr ix and accumulation intothe

Fockmatr ix, although to o large a task s ize degrade s load balancing. All tasks

are dynamically ass igne d, whichismade p oss ible bytheone-s ided data acce ss

provided by GAs. The GA vi sualization program, whichdemonstrates access

patter ns todistribute d arrays, was instrumental in de s igning an ecienttask-

schedulingstrategy for the SCF program [7].

A s imple p erformance mo del pre dictsaconstant eciency of about99

2

for theFockmatr ix construction for up to O N  pro ce ssors for extended

atom

molecular systems, at which p oint load imbalance will degrade p erformance. In 7

amodestly s ize d calculation 731 bas i s functions us ing 512 pro ce ssors of the

Intel Delta, we obtainedaspeedup of 496 97 eciency for theFock-matr ix

construction.

Conclus ions

Global Arrays are a new parallel programmingenvironmentforthedevelop-

ment of scienti c applications on mass ively parallel computers. TheGAmodel

provide s a p ortable interf ace through which each pro ce ss in a MIMD parallel

program can eciently acce ss logical blo cks of phys ically di str ibuted matr ice s,

withnonee d for explicit co op eration byother pro ce ss e s or pro ce ssors where

thedataresides.

For applications of certain types, theGAmodel provides a better combination

of s imple co ding, high-level eciency,and p ortabilitythan do do other mo dels.

Theapplications thatmotivated thedevelopmentofGAarecharacter ize d by1

thenee d to acce ss relatively small blo cks of very large matr ice s thus requir ing

blo ck-wi s e phys ical di str ibution; 2 widevar iation in task execution timethus

requir ingdynamic load balancing, withattendantunpre dictable data reference

patter ns; and 3 a f airly large ratio of computation todatamovementthus

making it p oss ible toretain a high eciency while acce ss ingremotedataon

demand. TheGAmodel provide s go o d supp ort for many areas of computational

chemi stry, e sp ecially electronic structure co de s. It also appears promisingfor

suchapplication domains as global climatemodeling, in whichthecodes are

often character ize d byboth spatial lo calityandloadimbalance.

Reference s

[1] High PerformanceFortran Forum, High Performance Fortran Language

Sp eci cation, Vers ion 1.0, Rice Univers ity,1993.

[2] J.A. Stephen and R.R. Oldehoeft, HEP SISAL: Paral lel Functional

Programming,inParallel MIMD Computation: HEP Sup ercomputer and

ItsApplications, pp.123{150, e d. J.S. Kowalik, The MIT Pre ss, Cambr idge,

MA, 1985.

[3] I.T. Foster, R. Olson and S. Tuecke, Productive Paral lel Program-

ming: The PCN Approach, Scienti c Programming, pp.51{66, 11992.

[4] I.T. Foster and .M. Chandy, Fortran M: A Language for Modular

Paral lel Programming, ArgonneNational Laboratory, prepr int MCS-P327-

0992, 1992.

[5] N. Carriero and D. Gelernter, How To Write Paral lel Programs, A

First Course,The MIT Pre ss, Cambr idge, MA, 1990. 8

[6] T. von Eicken, D.E. Culler, S.C. Goldstein and K.E. Schauser,

Active messages: A mechanism for integratedcommunications and compu-

tation, Pro c. 19th Ann. Int. Symp. Comp. Arch., pp. 256-266, 1992.

[7] J. Nieplocha, R.J. Harrison and R.J. Littlefield, Global Arrays:

A Portable \Shared-Memory" Programming Model for

Computers, Pro c. Sup ercomputing 1994, IEEE Computer So cietyPress,

pp. 340{349, 1994.

[8] A. Szabo and N.S. Ostlund, Modern Quantum Chemistry: Introduction

to AdvancdElectronic Structure Theory, 1st Ed. Revised, McGraw-Hill,

Inc., New York, 1989.

[9] R.J. Harrison, M.F. Guest, R.A. Kendall, D.E. Bernholdt, A.T.

Wong, M.S. Stave, J.L. Anchell, A.C. Hess, R.J. Littlefield,

G.I. Fann, J. Nieplocha, G.S. Thomas, D. Elwood, J. Tilson, R.L.

Shepard, A.F.Wagner, I.T. Foster, E. Lusk and R. Stevens, Ful ly

Distributed Paral lel Algorithms|Molecular Self Consistent Field Calcula-

tions, J. Comp. Chem., in pre ss.

[10] R.J. Harrison, R.L. Shepard, AbInitioMolecular Electronic Structure

on Paral lel Computers, Annu. Rev. Phys. Chem. 45: 623-58, 1994. 9