The Global Array Programming Model for High Performance Scienti C

SIAM News, August/September1995 The Global Array ProgrammingModel for High Performance Scienti c Computing J. Nieplo cha, R.J. Harr i son and R.J. Little eld Paci c Northwest Laboratory Motivated bythecharacter i stics of current parallel architecture s, wehavede- veloped an approachtothe programming of scalable scienti c applications that combines someofthe b e st feature s of me ssage-pass ingandshare d-memory programmingmodels. Two assumptions p ermeate our work. The rst i s thatmost high p erformance parallel computers have, and will continue tohave, phys ically di str ibuted memor ie s with non-uniform memory acce ss NUMA timingchar- acteristics. NUMA machines work best withapplication programs thathavea high degree of lo calityintheir memory reference patter ns. Thesecondassump- tion i s that extra programming e ort i s, and will continue to b e, require d to construct suchapplications. Thus, a recurr ingthemeinourworkisthedevelop- mentoftechnique s andto ols that minimize the extra e ort require d to construct application programs with explicit control of lo cality. There are s igni cant tradeo s amongthe imp ortantconsiderations of p ortability, eciency,and eas e of co ding. Theme ssage-pass ing programmingmodel i s widely used becaus e of its p ortability. Someapplications, however, are too com- plex tobecoded in a me ssage-pass ingmode, if care i s tobetaken tomaintain a balance d computation load andavoid re dundant computations. Theshare d- memory programmingmodel s impli e s co ding, but itisnotportable andoften provide s little control over interpro ce ssor data transfer costs. Other more recent parallel programmingmodels, repre s ented bysuch language s and f acilitie s as HPF[1], SISAL[2], PCN[3], Fortran-M[4], Linda[5], andshare d virtual memory, addre ss the s e problems in di erentways andtovaryingdegree s. Noneofthese mo dels repre s entsanideal solution. Global Arrays GAs, theapproachde scr ib e d here, lead tobothsimplecod- ingand ecient execution for a class of applications thatappears to b e f airly common. Thekey concept of a GA mo del i s thatitprovide s a p ortable interf ace through which each pro ce ss in a MIMD parallel program can indep endently, asynchronously,and eciently acce ss logical blo cks of phys ically di str ibuted 1 registers on-chip cache speed off-chip cache capacity main memory virtual memory Figure 1: Thememory hierarchyofatypical NUMA architecture. matr ice s, withnonee d for explicit co operation byother pro ce ss e s. In thi s re- sp ect, it i s s imilar totheshare d-memory programmingmodel. In addition, however, theGAmodel acknowle dge s that more time i s require d to acce ss remotedatathan lo cal data, and it allows data lo calityto b e explicitly sp eci e d and used. In these respects, it i s s imilar tome ssage pass ing. NUMA Architecture The concept of NUMA i s imp ortanteven totheperformance of mo der n s e- quential p ersonal computers or workstations. On a standard RISC workstation, for instance, go o d p erformance of the pro ce ssors re sults f rom algor ithms and compilers thatoptimize usage of thememory hierarchy.Thememory hierarchy is formed by regi sters, on-chip cache, o -chip cache, main memory,and virtual memory s ee Figure 1. If the programmer ignore s thi s structure and constantly ushes thecacheor, even wors e, thrashes the virtual memory, p erformance will b e s er iously degraded. The class ic solution tothi s problem i s to acce ss datainblocks small enough to t in the cacheandthen ensure thatthe algor ithm make s sucientuseofthe encached datato justify thecostsofmovingthedata. Tothe NUMA hierarchyofsequential computers,parallel computers add at least one extra layer: remotememory. Acce ss toremotememory on di str ibuted- memory machine s i s accompli shed through me ssage pass ing. Me ssage pass ing, in addition tothe require d co op eration b etween s ender and rece iver thatmakes thi s programming paradigm diculttouse,intro duce s degradation of latency andbandwidthinthe accessing of remote, as opposed to lo cal, memory. Scalable share d-memory machine s,i.e., architecturally di str ibuted-memory machine s withhardware supp ort for share d-memory operations for example, the KSR-2 or theConvex Exemplar, allowaccesstoremotememory in thesame fashion as tolocalmemory.However, thi s uniform mechani sm for acce ss ing lo cal andremotememory should b e s een only as a programmingconvenience| 2 on b othshare d- and di str ibute d-memory computers, thelatency andbandwidth for acce ss ingremotememory are s igni cantly larger than for lo cal memory and therefore must b e incorp orated into p erformance mo dels. If wethink about programming of MIMD parallel computers e ither share d- or di str ibute d-memory in terms of NUMA, then parallel computation di ers f rom s equential computation only in terms of concurrency. By fo cus ingon NUMA, we not only have a f ramework in whichto reason aboutthe p erformance of our parallel algor ithms i.e., memory latency,bandwidth, dataand reference lo cality, we also conceptually unite s equential and parallel computation. Global Array Mo del The GA programmingmodel i s motivated bythe NUMA character i stics of current parallel architecture s. By removingtheunnece ssary pro ce ssor interactions require d to acce ss remotedatainme ssage-pass ing paradigm, theGAmodel greatly s impli e s parallel programmingand i s s imilar in thi s re sp ect tothe share d-memory programmingmodel. However, theGAmodel also acknowle dge s that itismoretime consumingto acce ss remotedatathan lo cal data i.e., remotememory i s yet another layer of NUMA, and it allows data lo calityto b e explicitly sp eci e d and used. AdvantagesoftheGAmodel over a share d- memory programmingmodel includeits explicit di stinction b etween lo cal and remotememory andtheavailabilityoftwodistinct mechanisms for accessing lo cal and remotedata. Global arralys, instead of hidingthe NUMA charac- teristics, exp os e them tothe programmer andmake it p oss ible towrite more ecientand scalable parallel programs. The current GA programmingmodel can b e character ize d as follows: MIMD paralleli sm i s provide d via a multipro ce ss approach, in whichall non-GA data, le de scr iptors, and so on are replicated or unique toeach pro ce ss. Pro ce ss e s can communicatewith eachother by creatingand accessing GA di str ibuted matr ice s, as well as if de s ire d byconventional message pass ing. Matr ice s are phys ically di str ibuted block-wi s e, e ither regularly or as the Carte s ian pro duct of irregular di str ibutions on eachaxis. Each pro ce ss can indep endently and asynchronously acce ss anytwo- di- mens ional patch of a GA di str ibuted matr ix, withoutrequiringcooperation from theapplication co deinanyother pro ce ss. Several typ e s of acce ss are supp orte d, including \get," \put," \accumulate" oating-p ointsum-re duction, and \get and increment" integer. Thi s li st can b e extente d as nee ded. Each pro ce ss i s assumed tohave f ast acce ss to someportion of each di str ibuted matr ix, and slower acce ss totheremainder. The s e sp ee d di erence s de nethedata as being lo cal or remote, re sp ectively.However, thenumer ic 3 di erence b etween lo cal and remotememory acce ss times is unsp eci e d. Each pro ce ss can determine which p ortion of each di str ibuted matr ix i s store d lo cally.Every element of a di str ibuted matr ix i s guarantee d tobe lo cal to exactly one pro ce ss. Thi s mo del di ers f rom other common mo dels as follows. UnlikeHPF,it allows task-parallel acce ss to di str ibuted matr ice s, includingreduction intoover- lappingpatche s. Unlike Linda[5], it eciently provide s for sum-re duction and acce ss tooverlappingpatche s. Unlikeshare d-virtual-memory software f acilitie s,the GA paradigm require s explicit library calls to acce ss databutavoids theoverhead asso ciate d withthemaintenance of memory coherence andhand- ling of virtual page f aults. The GA implementation guarantee s that all of the require d data for a patch can b e transferre d atthe sametime. Unlikeactive me ssage s[6], theGAmodel do e s not incorp oratethe concept of interpro ce ssor co operation andcanthus b e implemente d eciently[7]even on share d-memory systems. Finally,unlike someother strategie s bas e d on p olling, task duration i s relatively unimp ortant in programs that us e GAs, which s impli e s co dingand make s it p oss ible for GA programs to exploit standard library co des without mo di cation. Global Array Toolkit Thi s GA interf ace has b een de s igne d in the lightofemergingstandards. In particular, High Performance Fortran HPF will certainly providethe bas i s for future standards de nition for di str ibute d arrays in Fortran. Theop erations that providethebasicfunctionality create, fetch, store, accumulate, gather, scatter, data-parallel op erations all can b e expre ss e d as s ingle statementsinFortran-90 array notation and withthedata-di str ibution directive s of HPF. TheGAmodel is, however, more general than that of HPF, which currently precludes theuse of suchoperations in MIMD task-parallel co de. Supporte d Op erations EachGAoperation may b e categor ize d as e ither an implementation- dep endent pr imitiveoperation or an op eration thathas b een constructe d in an implementation- indep endent f ashion f rom pr imitiveop erations.

Load more