SIAM News, August/September1995
The Global Array ProgrammingModel
for High Performance Scienti c
Computing
J. Nieplo cha, R.J. Harr i son and R.J. Little eld
Paci c Northwest Laboratory
Motivated bythecharacter i stics of current parallel architecture s, wehavede-
veloped an approachtothe programming of scalable scienti c applications that
combines someofthe b e st feature s of me ssage-pass ingandshare d-memory pro-
grammingmodels. Two assumptions p ermeate our work. The rst i s thatmost
high p erformance parallel computers have, and will continue tohave, phys ically
di str ibuted memor ie s with non-uniform memory acce ss NUMA timingchar-
acteristics. NUMA machines work best withapplication programs thathavea
high degree of lo calityintheir memory reference patter ns. Thesecondassump-
tion i s that extra programming e ort i s, and will continue to b e, require d to
construct suchapplications. Thus, a recurr ingthemeinourworkisthedevelop-
mentoftechnique s andto ols that minimize the extra e ort require d to construct
application programs with explicit control of lo cality.
There are s igni cant tradeo s amongthe imp ortantconsiderations of p ortab-
ility, eciency,and eas e of co ding. Theme ssage-pass ing programmingmodel i s
widely used becaus e of its p ortability. Someapplications, however, are too com-
plex tobecoded in a me ssage-pass ingmode, if care i s tobetaken tomaintain
a balance d computation load andavoid re dundant computations. Theshare d-
memory programmingmodel s impli e s co ding, but itisnotportable andoften
provide s little control over interpro ce ssor data transfer costs. Other more recent
parallel programmingmodels, repre s ented bysuch language s and f acilitie s as
HPF[1], SISAL[2], PCN[3], Fortran-M[4], Linda[5], andshare d virtual memory,
addre ss the s e problems in di erentways andtovaryingdegree s. Noneofthese
mo dels repre s entsanideal solution.
Global Arrays GAs, theapproachde scr ib e d here, lead tobothsimplecod-
ingand ecient execution for a class of applications thatappears to b e f airly
common. Thekey concept of a GA mo del i s thatitprovide s a p ortable interf ace
through which each pro ce ss in a MIMD parallel program can indep endently,
asynchronously,and eciently acce ss logical blo cks of phys ically di str ibuted 1 registers
on-chip cache speed
off-chip cache capacity main memory
virtual memory
Figure 1: Thememory hierarchyofatypical NUMA architecture.
matr ice s, withnonee d for explicit co operation byother pro ce ss e s. In thi s re-
sp ect, it i s s imilar totheshare d-memory programmingmodel. In addition,
however, theGAmodel acknowle dge s that more time i s require d to acce ss re-
motedatathan lo cal data, and it allows data lo calityto b e explicitly sp eci e d
and used. In these respects, it i s s imilar tome ssage pass ing.
NUMA Architecture
The concept of NUMA i s imp ortanteven totheperformance of mo der n s e-
quential p ersonal computers or workstations. On a standard RISC workstation,
for instance, go o d p erformance of the pro ce ssors re sults f rom algor ithms and
compilers thatoptimize usage of thememory hierarchy.Thememory hierarchy
is formed by regi sters, on-chip cache, o -chip cache, main memory,and virtual
memory s ee Figure 1.
If the programmer ignore s thi s structure and constantly ushes thecacheor,
even wors e, thrashes the virtual memory, p erformance will b e s er iously degraded.
The class ic solution tothi s problem i s to acce ss datainblocks small enough to
t in the cacheandthen ensure thatthe algor ithm make s sucientuseofthe
encached datato justify thecostsofmovingthedata.
Tothe NUMA hierarchyofsequential computers,parallel computers add at
least one extra layer: remotememory. Acce ss toremotememory on di str ibuted-
memory machine s i s accompli shed through me ssage pass ing. Me ssage pass ing,
in addition tothe require d co op eration b etween s ender and rece iver thatmakes
thi s programming paradigm diculttouse,intro duce s degradation of latency
andbandwidthinthe accessing of remote, as opposed to lo cal, memory.
Scalable share d-memory machine s,i.e., architecturally di str ibuted-memory
machine s withhardware supp ort for share d-memory operations for example, the
KSR-2 or theConvex Exemplar, allowaccesstoremotememory in thesame
fashion as tolocalmemory.However, thi s uniform mechani sm for acce ss ing
lo cal andremotememory should b e s een only as a programmingconvenience| 2
on b othshare d- and di str ibute d-memory computers, thelatency andbandwidth
for acce ss ingremotememory are s igni cantly larger than for lo cal memory and
therefore must b e incorp orated into p erformance mo dels.
If wethink about programming of MIMD parallel computers e ither share d-
or di str ibute d-memory in terms of NUMA, then parallel computation di ers
f rom s equential computation only in terms of concurrency. By fo cus ingon
NUMA, we not only have a f ramework in whichto reason aboutthe p erformance
of our parallel algor ithms i.e., memory latency,bandwidth, dataand reference
lo cality, we also conceptually unite s equential and parallel computation.
Global Array Mo del
The GA programmingmodel i s motivated bythe NUMA character i stics of cur-
rent parallel architecture s. By removingtheunnece ssary pro ce ssor interactions
require d to acce ss remotedatainme ssage-pass ing paradigm, theGAmodel
greatly s impli e s parallel programmingand i s s imilar in thi s re sp ect tothe
share d-memory programmingmodel. However, theGAmodel also acknow-
le dge s that itismoretime consumingto acce ss remotedatathan lo cal data i.e.,
remotememory i s yet another layer of NUMA, and it allows data lo calityto
b e explicitly sp eci e d and used. AdvantagesoftheGAmodel over a share d-
memory programmingmodel includeits explicit di stinction b etween lo cal and
remotememory andtheavailabilityoftwodistinct mechanisms for accessing
lo cal and remotedata. Global arralys, instead of hidingthe NUMA charac-
teristics, exp os e them tothe programmer andmake it p oss ible towrite more
ecientand scalable parallel programs.
The current GA programmingmodel can b e character ize d as follows:
MIMD paralleli sm i s provide d via a multipro ce ss approach, in whichall
non-GA data, le de scr iptors, and so on are replicated or unique toeach
pro ce ss.
Pro ce ss e s can communicatewith eachother by creatingand accessing
GA di str ibuted matr ice s, as well as if de s ire d byconventional message
pass ing.
Matr ice s are phys ically di str ibuted block-wi s e, e ither regularly or as the
Carte s ian pro duct of irregular di str ibutions on eachaxis.
Each pro ce ss can indep endently and asynchronously acce ss anytwo- di-
mens ional patch of a GA di str ibuted matr ix, withoutrequiringcooperation
from theapplication co deinanyother pro ce ss.
Several typ e s of acce ss are supp orte d, including \get," \put," \accumu-
late" oating-p ointsum-re duction, and \get and increment" integer.
Thi s li st can b e extente d as nee ded.
Each pro ce ss i s assumed tohave f ast acce ss to someportion of each di str ib-
uted matr ix, and slower acce ss totheremainder. The s e sp ee d di erence s
de nethedata as being lo cal or remote, re sp ectively.However, thenumer ic 3
di erence b etween lo cal and remotememory acce ss times is unsp eci e d.
Each pro ce ss can determine which p ortion of each di str ibuted matr ix i s
store d lo cally.Every element of a di str ibuted matr ix i s guarantee d tobe
lo cal to exactly one pro ce ss.
Thi s mo del di ers f rom other common mo dels as follows. UnlikeHPF,it
allows task-parallel acce ss to di str ibuted matr ice s, includingreduction intoover-
lappingpatche s. Unlike Linda[5], it eciently provide s for sum-re duction and
acce ss tooverlappingpatche s. Unlikeshare d-virtual-memory software f acilit-
ie s,the GA paradigm require s explicit library calls to acce ss databutavoids
theoverhead asso ciate d withthemaintenance of memory coherence andhand-
ling of virtual page f aults. The GA implementation guarantee s that all of the
require d data for a patch can b e transferre d atthe sametime. Unlikeactive
me ssage s[6], theGAmodel do e s not incorp oratethe concept of interpro ce ssor
co operation andcanthus b e implemente d eciently[7]even on share d-memory
systems. Finally,unlike someother strategie s bas e d on p olling, task duration i s
relatively unimp ortant in programs that us e GAs, which s impli e s co dingand
make s it p oss ible for GA programs to exploit standard library co des without
mo di cation.
Global Array Toolkit
Thi s GA interf ace has b een de s igne d in the lightofemergingstandards. In
particular, High Performance Fortran HPF will certainly providethe bas i s for
future standards de nition for di str ibute d arrays in Fortran. Theop erations that
providethebasicfunctionality create, fetch, store, accumulate, gather, scatter,
data-parallel op erations all can b e expre ss e d as s ingle statementsinFortran-90
array notation and withthedata-di str ibution directive s of HPF. TheGAmodel
is, however, more general than that of HPF, which currently precludes theuse
of suchoperations in MIMD task-parallel co de.
Supporte d Op erations
EachGAoperation may b e categor ize d as e ither an implementation- dep endent
pr imitiveoperation or an op eration thathas b een constructe d in an implementation-
indep endent f ashion f rom pr imitiveop erations. Op erations also di er in their im-
plie d synchronization. Interf ace s tothird-party librar ie s providea nal di stinction.
The following pr imitiveop erations are invoke d collectively by all processes:
create an array, controlling alignmentand di str ibution;
create an array following a provided template exi sting array;
de stroy an array;
synchronize all pro ce ss e s.
Thefollowing pr imitiveop erations can b e invoke d in true MIMD style by
any pro ce ss with no implied synchronization withother processes and, unle ss 4
otherwise stated, with no guarantee d atomicity:
fetch, store, andatomic accumulateintorectangular patchofatwo-dimens ional
array;
gather andscatter array elements;
atomic read and increment array elements;
inquire aboutthelocation and di str ibution of thedata;
directly acce ss lo cal elements of array tosupp ort and/or improve p erform-
ance of application-sp eci c data-parallel op erations.
Thefollowing s et of BLAS-likedata-parallel operations currently available
in GAs can eas ily b e extende d ecient implementation can b e doneinan
architecture-indep endent f ashion on topofthe GA pr imitiveoperations:
vector operations e.g., dot-pro duct or scale optimize d bymeans of direct
acce ss tolocaldatatoavoid communication;
matr ix op erations e.g., symmetr ize optimize d through direct acce ss to
lo cal datatoreduce communication anddatacopying;
matr ix multiplication.
Thevector, matr ix multiplication, copy,andprintop erations exi st in two
vers ions thatoperateoneither entire arrays or sp eci e d s ections of arrays.
Thearraysections in operations thatinvolvemultiple arrays do not havetobe
conforming{the only requirementsarethatthey must b e of thesametyp e and
contain the samenumb er of elements.
Functionalityprovided bythird-party librar ie s{standard andgeneralize d real
symmetr ic e igensolvers and linear equation solvers interf ace to ScaLAPACK{i s
madeavailablebyusingthe GA pr imitives to p erform nece ssary data rearrange-
2
ment. The O N costofsuch rearrangements i s obs erved tobenegligible in
3
compar i son tothatofO N linear algebra op erations. The s e librar ie s can in-
ter nally us e any form of paralleli sm appropr iatetothe computer system, such
as co operativeme ssage-pass ingorshare d-memory.
Sample Co deFragment
The followingcode f ragment uses theFortran interf ace to createann m double-
preci s ion array,blocke d in at least 10 5chunks; after zeroing, a patchis lled
f rom a lo cal array.Unde ned value s are assumed to b e compute d els ewhere.
Theroutine ga create retur ns thevar iable g a as a handle tothe global array
for subs equent reference s tothe array.
integer g_a, n, m, ilo, ihi, jlo, jhi, ldim
double precision local1:ldim,*
c
call ga_createMT_DBL, n, m, `A', 10, 5, g_a
call ga_zerog_a
call ga_putg_a, ilo, ihi, jlo, jhi, local, ldim 5
Thi s co deisvery s imilar in functionalitytothe following HPF-likestate-
ments:
integer n, m, ilo, ihi, jlo, jhi, ldim
double precision an,m, local1:ldim,*
!hpf$ distribute ablock10, block5
c
a = 0.0
ailo:ihi,jlo:jhi=local1:ihi-ilo+1,1:jhi-jlo+1
The di erence i s thatthi s s ingle HPF ass ignmentwould b e execute d in a
data-parallel f ashion, whereas the global array ga put operation wouldbeex-
ecute d in MIMD parallel mo de, with each pro ce ss able to reference di erent
array patches.
Supporte d Platforms andAvailability
Thepublic-domain GA to olkit i s supporte d on a wide range of di str ibuted- and
share d-memory computer systems, including:
1. Di str ibuted-memory,me ssage-pass ing parallel computers withinterrupt-
dr iven communications or activeme ssage s Intel iPSC/860, DeltaandPar-
agon, IBM SP-1/2.
2. Networks of unipro ce ssor andmultipro ce ssor Unix workstations.
3. Share d-memory parallel computers KSR-1/2, Cray T3D, SGI.
TheGAto olkit i s available via anonymous ftponftp.pnl.gov in the dir-
ectory pub/global.Further information i s provide d on the WWW atthe URL,
http://www.emsl.pnl.gov:2080/docs/global/ga.html.
Applications
Most applications of theGAto olkit have b een in theareaofcomputational chem-
i stry,indeterminations of the electronic structure s of molecule s or crystalline
chemical systems. The s e calculations, which can pre dict manychemical proper-
tie s that are not directly obs erve d exp er imentally, account for a large f raction
of the sup ercomputer cycle s currently us e d for computational chemi stry. All
of these metho ds, of whichtheiterative s elf cons i stent eldSCFmethod [8]
is the s imple st, computeapproximatesolutions tothe nonrelativi stic electronic
Schrodinger equation. 6
As an example of the programming s impli cations andperformance improve-
mentsthat can b e realize d withGAs,we cons ider the parallel SCF application
here i s slightly more detail. Full details and a recent literature survey can b e
foundin[9,10].
Theker nel of the SCF calculation i s the contraction of a large, spars e four-
index matr ix electron-repuls ion integrals withatwo-index matr ix theelec-
tronic dens ity to yield another two-index matr ix theFockmatr ix. The irreg-
ular spars ityandtheavailable symmetries of theintegrals dr ivethe calculation.
Thedimens ions of b othmatr ice s are determined bythe s ize of an underlying
3 2 4
basis set N 10 . Thenumber of integrals scale s b etween O N and O N
dep endingonthenature of thesystem andlevel of accuracy require d. Integrals
are most eciently compute d in batches, and eachbatch connectsuptosix
blo cks of thedens itymatr ix withthe corre sp onding blo cks of theFockmatr ix.
Thecostofevaluatingthese batches can vary byafactor of more than 1000,
whichcaus e s a load-balancing problem.
Thetwo previous s igni cant di str ibute d-data algor ithms b oth us e d explicit
me ssage pass ing. In the systolic lo op algor ithm of Colvin et al. [10], thebest
2
. Overall eciencie s of ap- p oss ible execution timeisnobetter than O N
basis
proximately 50 were obtaine d on 256 pro ce ssors of an nCUBE-2. Themost
ecient explicit me ssage-pass ing algor ithm i s thatofFurlani andKing[10], who
implemented rather complicate d di str ibute d-matr ix schemes withpolling. The
overhead assciated with communication andwaiting for re sp ons e s to reque stsfor
acce ss tothedens itymatr ix was re duce d by explicit double-bu er ingandasyn-
chronous prefetching. Thi s approach scale s well, butthe p ollingcaus e s high
latencie s in acce ss tothe di str ibuted matr ice s, thus requir ingtheintro duction of
the additional complexitie s of prefetching.
Given thehighdegree of complexityofthese me ssage-pass ing algor ithms,
andthe s implicity of SCF as compare d withother ab initio algor ithms, wehave
b een s eeking more appropr iate programmingmodels; theGAmodel i s the cur-
rentresult. Our late st SCF program, whichhas b een developed on topofthe
GA to olkit, i s very s imple, and all computational steps with complexity greater
than O N have b een parallelize d. The four neste d lo ops over theunique
atom
integrals are str ipmined into blo cks, s imilar in spir it toFurlani and King. Geo-
metr ic decomposition p ermitsthe us e of spars ityforreducingboth computation
and reference s to global data. Ass ignmentofmultiple atom quartetstoatask
improves thecaching of reads of thedens itymatr ix and accumulation intothe
Fockmatr ix, although to o large a task s ize degrade s load balancing. All tasks
are dynamically ass igne d, whichismade p oss ible bytheone-s ided data acce ss
provided by GAs. The GA vi sualization program, whichdemonstrates access
patter ns todistribute d arrays, was instrumental in de s igning an ecienttask-
schedulingstrategy for the SCF program [7].
A s imple p erformance mo del pre dictsaconstant eciency of about99
2
for theFockmatr ix construction for up to O N pro ce ssors for extended
atom
molecular systems, at which p oint load imbalance will degrade p erformance. In 7
amodestly s ize d calculation 731 bas i s functions us ing 512 pro ce ssors of the
Intel Delta, we obtainedaspeedup of 496 97 eciency for theFock-matr ix
construction.
Conclus ions
Global Arrays are a new parallel programmingenvironmentforthedevelop-
ment of scienti c applications on mass ively parallel computers. TheGAmodel
provide s a p ortable interf ace through which each pro ce ss in a MIMD parallel
program can eciently acce ss logical blo cks of phys ically di str ibuted matr ice s,
withnonee d for explicit co op eration byother pro ce ss e s or pro ce ssors where
thedataresides.
For applications of certain types, theGAmodel provides a better combination
of s imple co ding, high-level eciency,and p ortabilitythan do do other mo dels.
Theapplications thatmotivated thedevelopmentofGAarecharacter ize d by1
thenee d to acce ss relatively small blo cks of very large matr ice s thus requir ing
blo ck-wi s e phys ical di str ibution; 2 widevar iation in task execution timethus
requir ingdynamic load balancing, withattendantunpre dictable data reference
patter ns; and 3 a f airly large ratio of computation todatamovementthus
making it p oss ible toretain a high eciency while acce ss ingremotedataon
demand. TheGAmodel provide s go o d supp ort for many areas of computational
chemi stry, e sp ecially electronic structure co de s. It also appears promisingfor
suchapplication domains as global climatemodeling, in whichthecodes are
often character ize d byboth spatial lo calityandloadimbalance.
Reference s
[1] High PerformanceFortran Forum, High Performance Fortran Language
Sp eci cation, Vers ion 1.0, Rice Univers ity,1993.
[2] J.A. Stephen and R.R. Oldehoeft, HEP SISAL: Paral lel Functional
Programming,inParallel MIMD Computation: HEP Sup ercomputer and
ItsApplications, pp.123{150, e d. J.S. Kowalik, The MIT Pre ss, Cambr idge,
MA, 1985.
[3] I.T. Foster, R. Olson and S. Tuecke, Productive Paral lel Program-
ming: The PCN Approach, Scienti c Programming, pp.51{66, 11992.
[4] I.T. Foster and K.M. Chandy, Fortran M: A Language for Modular
Paral lel Programming, ArgonneNational Laboratory, prepr int MCS-P327-
0992, 1992.
[5] N. Carriero and D. Gelernter, How To Write Paral lel Programs, A
First Course,The MIT Pre ss, Cambr idge, MA, 1990. 8
[6] T. von Eicken, D.E. Culler, S.C. Goldstein and K.E. Schauser,
Active messages: A mechanism for integratedcommunications and compu-
tation, Pro c. 19th Ann. Int. Symp. Comp. Arch., pp. 256-266, 1992.
[7] J. Nieplocha, R.J. Harrison and R.J. Littlefield, Global Arrays:
A Portable \Shared-Memory" Programming Model for Distributed Memory
Computers, Pro c. Sup ercomputing 1994, IEEE Computer So cietyPress,
pp. 340{349, 1994.
[8] A. Szabo and N.S. Ostlund, Modern Quantum Chemistry: Introduction
to AdvancdElectronic Structure Theory, 1st Ed. Revised, McGraw-Hill,
Inc., New York, 1989.
[9] R.J. Harrison, M.F. Guest, R.A. Kendall, D.E. Bernholdt, A.T.
Wong, M.S. Stave, J.L. Anchell, A.C. Hess, R.J. Littlefield,
G.I. Fann, J. Nieplocha, G.S. Thomas, D. Elwood, J. Tilson, R.L.
Shepard, A.F.Wagner, I.T. Foster, E. Lusk and R. Stevens, Ful ly
Distributed Paral lel Algorithms|Molecular Self Consistent Field Calcula-
tions, J. Comp. Chem., in pre ss.
[10] R.J. Harrison, R.L. Shepard, AbInitioMolecular Electronic Structure
on Paral lel Computers, Annu. Rev. Phys. Chem. 45: 623-58, 1994. 9