The Global Array Programming Model for High Performance Scienti C

SIAM News, August/September1995 The Global Array ProgrammingModel for High Performance Scienti c Computing J. Nieplo cha, R.J. Harr i son and R.J. Little eld Paci c Northwest Laboratory Motivated bythecharacter i stics of current parallel architecture s, wehavede- veloped an approachtothe programming of scalable scienti c applications that combines someofthe b e st feature s of me ssage-pass ingandshare d-memory programmingmodels. Two assumptions p ermeate our work. The rst i s thatmost high p erformance parallel computers have, and will continue tohave, phys ically di str ibuted memor ie s with non-uniform memory acce ss NUMA timingchar- acteristics. NUMA machines work best withapplication programs thathavea high degree of lo calityintheir memory reference patter ns. Thesecondassump- tion i s that extra programming e ort i s, and will continue to b e, require d to construct suchapplications. Thus, a recurr ingthemeinourworkisthedevelop- mentoftechnique s andto ols that minimize the extra e ort require d to construct application programs with explicit control of lo cality. There are s igni cant tradeo s amongthe imp ortantconsiderations of p ortability, eciency,and eas e of co ding. Theme ssage-pass ing programmingmodel i s widely used becaus e of its p ortability. Someapplications, however, are too com- plex tobecoded in a me ssage-pass ingmode, if care i s tobetaken tomaintain a balance d computation load andavoid re dundant computations. Theshare d- memory programmingmodel s impli e s co ding, but itisnotportable andoften provide s little control over interpro ce ssor data transfer costs. Other more recent parallel programmingmodels, repre s ented bysuch language s and f acilitie s as HPF[1], SISAL[2], PCN[3], Fortran-M[4], Linda[5], andshare d virtual memory, addre ss the s e problems in di erentways andtovaryingdegree s. Noneofthese mo dels repre s entsanideal solution. Global Arrays GAs, theapproachde scr ib e d here, lead tobothsimplecod- ingand ecient execution for a class of applications thatappears to b e f airly common. Thekey concept of a GA mo del i s thatitprovide s a p ortable interf ace through which each pro ce ss in a MIMD parallel program can indep endently, asynchronously,and eciently acce ss logical blo cks of phys ically di str ibuted 1 registers on-chip cache speed off-chip cache capacity main memory virtual memory Figure 1: Thememory hierarchyofatypical NUMA architecture. matr ice s, withnonee d for explicit co operation byother pro ce ss e s. In thi s re- sp ect, it i s s imilar totheshare d-memory programmingmodel. In addition, however, theGAmodel acknowle dge s that more time i s require d to acce ss remotedatathan lo cal data, and it allows data lo calityto b e explicitly sp eci e d and used. In these respects, it i s s imilar tome ssage pass ing. NUMA Architecture The concept of NUMA i s imp ortanteven totheperformance of mo der n s e- quential p ersonal computers or workstations. On a standard RISC workstation, for instance, go o d p erformance of the pro ce ssors re sults f rom algor ithms and compilers thatoptimize usage of thememory hierarchy.Thememory hierarchy is formed by regi sters, on-chip cache, o -chip cache, main memory,and virtual memory s ee Figure 1. If the programmer ignore s thi s structure and constantly ushes thecacheor, even wors e, thrashes the virtual memory, p erformance will b e s er iously degraded. The class ic solution tothi s problem i s to acce ss datainblocks small enough to t in the cacheandthen ensure thatthe algor ithm make s sucientuseofthe encached datato justify thecostsofmovingthedata. Tothe NUMA hierarchyofsequential computers,parallel computers add at least one extra layer: remotememory. Acce ss toremotememory on di str ibuted- memory machine s i s accompli shed through me ssage pass ing. Me ssage pass ing, in addition tothe require d co op eration b etween s ender and rece iver thatmakes thi s programming paradigm diculttouse,intro duce s degradation of latency andbandwidthinthe accessing of remote, as opposed to lo cal, memory. Scalable share d-memory machine s,i.e., architecturally di str ibuted-memory machine s withhardware supp ort for share d-memory operations for example, the KSR-2 or theConvex Exemplar, allowaccesstoremotememory in thesame fashion as tolocalmemory.However, thi s uniform mechani sm for acce ss ing lo cal andremotememory should b e s een only as a programmingconvenience| 2 on b othshare d- and di str ibute d-memory computers, thelatency andbandwidth for acce ss ingremotememory are s igni cantly larger than for lo cal memory and therefore must b e incorp orated into p erformance mo dels. If wethink about programming of MIMD parallel computers e ither share d- or di str ibute d-memory in terms of NUMA, then parallel computation di ers f rom s equential computation only in terms of concurrency. By fo cus ingon NUMA, we not only have a f ramework in whichto reason aboutthe p erformance of our parallel algor ithms i.e., memory latency,bandwidth, dataand reference lo cality, we also conceptually unite s equential and parallel computation. Global Array Mo del The GA programmingmodel i s motivated bythe NUMA character i stics of current parallel architecture s. By removingtheunnece ssary pro ce ssor interactions require d to acce ss remotedatainme ssage-pass ing paradigm, theGAmodel greatly s impli e s parallel programmingand i s s imilar in thi s re sp ect tothe share d-memory programmingmodel. However, theGAmodel also acknowle dge s that itismoretime consumingto acce ss remotedatathan lo cal data i.e., remotememory i s yet another layer of NUMA, and it allows data lo calityto b e explicitly sp eci e d and used. AdvantagesoftheGAmodel over a share d- memory programmingmodel includeits explicit di stinction b etween lo cal and remotememory andtheavailabilityoftwodistinct mechanisms for accessing lo cal and remotedata. Global arralys, instead of hidingthe NUMA charac- teristics, exp os e them tothe programmer andmake it p oss ible towrite more ecientand scalable parallel programs. The current GA programmingmodel can b e character ize d as follows: MIMD paralleli sm i s provide d via a multipro ce ss approach, in whichall non-GA data, le de scr iptors, and so on are replicated or unique toeach pro ce ss. Pro ce ss e s can communicatewith eachother by creatingand accessing GA di str ibuted matr ice s, as well as if de s ire d byconventional message pass ing. Matr ice s are phys ically di str ibuted block-wi s e, e ither regularly or as the Carte s ian pro duct of irregular di str ibutions on eachaxis. Each pro ce ss can indep endently and asynchronously acce ss anytwo- di- mens ional patch of a GA di str ibuted matr ix, withoutrequiringcooperation from theapplication co deinanyother pro ce ss. Several typ e s of acce ss are supp orte d, including \get," \put," \accumulate" oating-p ointsum-re duction, and \get and increment" integer. Thi s li st can b e extente d as nee ded. Each pro ce ss i s assumed tohave f ast acce ss to someportion of each di str ibuted matr ix, and slower acce ss totheremainder. The s e sp ee d di erence s de nethedata as being lo cal or remote, re sp ectively.However, thenumer ic 3 di erence b etween lo cal and remotememory acce ss times is unsp eci e d. Each pro ce ss can determine which p ortion of each di str ibuted matr ix i s store d lo cally.Every element of a di str ibuted matr ix i s guarantee d tobe lo cal to exactly one pro ce ss. Thi s mo del di ers f rom other common mo dels as follows. UnlikeHPF,it allows task-parallel acce ss to di str ibuted matr ice s, includingreduction intoover- lappingpatche s. Unlike Linda[5], it eciently provide s for sum-re duction and acce ss tooverlappingpatche s. Unlikeshare d-virtual-memory software f acilitie s,the GA paradigm require s explicit library calls to acce ss databutavoids theoverhead asso ciate d withthemaintenance of memory coherence andhand- ling of virtual page f aults. The GA implementation guarantee s that all of the require d data for a patch can b e transferre d atthe sametime. Unlikeactive me ssage s[6], theGAmodel do e s not incorp oratethe concept of interpro ce ssor co operation andcanthus b e implemente d eciently[7]even on share d-memory systems. Finally,unlike someother strategie s bas e d on p olling, task duration i s relatively unimp ortant in programs that us e GAs, which s impli e s co dingand make s it p oss ible for GA programs to exploit standard library co des without mo di cation. Global Array Toolkit Thi s GA interf ace has b een de s igne d in the lightofemergingstandards. In particular, High Performance Fortran HPF will certainly providethe bas i s for future standards de nition for di str ibute d arrays in Fortran. Theop erations that providethebasicfunctionality create, fetch, store, accumulate, gather, scatter, data-parallel op erations all can b e expre ss e d as s ingle statementsinFortran-90 array notation and withthedata-di str ibution directive s of HPF. TheGAmodel is, however, more general than that of HPF, which currently precludes theuse of suchoperations in MIMD task-parallel co de. Supporte d Op erations EachGAoperation may b e categor ize d as e ither an implementation- dep endent pr imitiveoperation or an op eration thathas b een constructe d in an implementation- indep endent f ashion f rom pr imitiveop erations.

The Global Array Programming Model for High Performance Scienti C

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support