Evaluation of Architectural Support for Global Address-Based
Total Page:16
File Type:pdf, Size:1020Kb
Evaluation of Architectural Supp ort for Global AddressBased Communication in LargeScale Parallel Machines y y Arvind Krishnamurthy Klaus E Schauser Chris J Scheiman Randolph Y Wang David E Culler and Katherine Yelick the sp ecic target architecture Wehave develop ed multi Abstract ple highly optimized versions of this compiler employing a Largescale parallel machines are incorp orating increas range of co degeneration strategies for machines with dedi ingly sophisticated architectural supp ort for userlevel mes cated network pro cessors In this studywe use this sp ec saging and global memory access We provide a systematic trum of runtime techniques to evaluate the p erformance evaluation of a broad sp ectrum of current design alternatives tradeos in architectural supp ort for communication found based on our implementations of a global address language in several of the current largescale parallel machines on the Thinking Machines CM Intel Paragon Meiko CS We consider ve imp ortant largescale parallel platforms Cray TD and Berkeley NOW This evaluation includes that havevarying degrees of architectural supp ort for com a range of compilation strategies that makevarying use of munication the Thinking Machines CM Intel Paragon the network pro cessor each is optimized for the target ar Meiko CS Cray TD and Berkeley NOW The CM pro chitecture and the particular strategyWe analyze a family vides direct userlevel access to the network the Paragon of interacting issues that determine the p erformance trade provides a network pro cessor NP that is symmetric with os in each implementation quantify the resulting latency the compute pro cessor CP the MeikoandNOW provide overhead and bandwidth of the global access op erations an asymmetric network pro cessor that includes the network and demonstrate the eects on application p erformance interface NI and the TD provides dedicated hardware which acts as a sp ecialized NP for remote reads and writes Intro duction Against these hardware alternatives w e consider a variety of implementation techniques for global memory op erations In recentyears several architectures have demonstrated prac ranging from general purp ose active message handlers to tical scalabilitybeyond a thousand micropro cessors includ sp ecialized handlers executing directly on the NP or in hard ing the nCUBE Thinking Machines CM Intel Paragon ware This implementation exercise reveals several crucial Meiko CS and Cray TD More recently researchers have issues including protection address translation synchro also demonstrated high p erformance communication in Net nization resp onsiveness and owcontrol whichmust b e work of Workstations NOW using scalable switched lo cal addressed dierently under the dierent regimes and con area network technology While the dominant pro tribute signicantl y to the eective communication costs in gramming mo del at this scale is message passing the prim aworking system itives used are inherently exp ensive due to buering and Our investigation is largely orthogonal to the many ar scheduling overheads Consequently these machines chitectural studies of distributed shared memory machines provide varying levels of architectural supp ort for communi whichseektoavoid unnecessary communication by exploit cation in a global address space via various forms of memory ing address translation hardware to allow consistent repli read and write cation of blo cks throughout the system We develop ed the SplitC language to allow exp erimen and op erating system studies which seek the same end by tation with new communication hardware mechanisms by extending virtual memory supp ort In these ef involving the compiler in the supp ort for the global address forts communication is caused by a single load or store in op erations Global memory op erations are statically typ ed struction and the underlying hardware or op erating system so the SplitC compiler can generate a short sequence of mechanisms move the data transparentlyWe fo cus on what co de for each p otentially remote op eration as required by happ ens when the communication is necessary So far dis Computer Science Division University of California Berkeley tributed shared memory techniques have scaled up from the CA farvindkcullerrywangyeli ckg csb erkeley edu tens of pro cessors toward a hundred but many leaders of y Department of Computer Science University of California Santa the eld suggest that the thousand pro cessor scale will b e Barbara CA fschauserchrissgcsucsbedu reached only by clusters of these machines in the foresee able future Our investigatio n overlaps somewhat with the co op erative shared memory work which initiates communi cation transparently but allows remote memory op erations to b e serviced by programmable handlers on dedicated net work pro cessors The study could in principle b e p erformed with other compilerassis ted shared memory im plementations but these do not have the necessary base Thinking Machines CM5 Cray T3D NOW Compute Processor Compute Compute Message Processor Network Processor NI Unit I/O Bus Network Network Processor Memory Bus Memory Bus Network DMA Memory Proc. NI Memory (BLT) Memory SRAM DMA Meiko CS2 Intel Paragon Network Processor Network Compute Network Compute NI Processor Processor Processor Proc. DMA Memory Bus Memory Bus Shared Shared DMA NI Memory Memory Network Figure Structure of the multiprocessor nodes messages by writing to output FIFOs in the NI using un of highly optimized implementation s on a range of hardware cached stores it p olls for messages bychecking network sta alternatives tus registers Thus the network is eectively a distributed The rest of the pap er is organized as follows Section set of queues The queues are quite shallow holding only provides background information for our studyWe briey three word messages The network is a ary fat tree that survey the target architectures and the source language has a link bandwidth of MBsec in each direction Section sketches the basic implementation strategies Sec tion provides a qualitative analysis of the issues that arise in our implementation s and their p erformance impacts Sec Intel Paragon In the Paragon each no de contains one tion substantiates this with a quantitative comparison us or more compute pro cessors MHz i pro cessors and ing microb enchmarks and parallel application s FinallySec an identical CPU dedicated for use as a network pro ces tion draws conclusions sor Our conguration has a single compute pro cessor p er no de The compute and network pro cessors share memory over a cachecoherent memory bus The network pro ces Background sor which runs in system mo de provides communication In this section we establish the background for our study through shared memory to user level on the compute pro cessor It is also resp onsible for constructing and interpret includin g the key architectural features of our candidate ma chines and an overview of SplitC ing message tags Also attached to the memory bus are DMA engines and a network interface The network in terface provides a pair of relatively deep input and output Machines FIFOs KB each which can b e driven by either pro cessor We consider ve machines all constructed with commercial or by the DMA engines The network is a D mesh with micropro cessors and a scalable lowlatency interconnection links op erating at MBs in each direction network The pro cessor and network p erformance diers across the machines but more imp ortantly they dier in o CS no de contains a sp ecial MeikoCS The Meik the pro cessors interface to the network They range from purp ose Elan network pro cessor integrated with the net a minimal network interface on the CM to a fulledged work interface and DMA controller The network pro cessor pro cessor on the Paragon Figure gives a sketchofthe is attached to the memory bus and is cachecoherentwith no de architecture on each machine the compute pro cessor which is a MHz threeway sup er scalar Sup erSparc pro cessor The network pro cessor func Thinking Machines CM The CM has the most tions b oth as a pro cessor and as a memory device so the primitive messaging hardware of the ve machines Each compute pro cessor can issue commands to the network inter no de contains a single MHz Sparc pro cessor and a con face and get status backviaamemoryexchange instruction ventional MBusbased memory system We ignore the vec at user level The network pro cessor has a dedicated connec tor units in b oth the CM and Meikomachines The net tion to the network however it has only mo dest pro cessing work interface unit provides userlevel access to the network power and no general purp ose cache so instructions and Each message has a tag identifying it as a system message data are accessed from main memory The network is com interrupting user message or noninterrupting user message prised of two ary fattrees that have a linklevel bandwidth that can b e p olled from the NI The compute pro cessor sends of MBsec 3 5 Compute Compute Compute Network Network Compute Processor Processor Processor Processor 1 Processor 3 Processor 4 2 1 2 6 4 Memory Memory Shared Shared Memory Memory Memory Operations To Satisfy the Read Memory Operations To Satisfy the Read Steps in the Active Message Request Steps in the Active Message Request Steps in the Active Message Reply Steps in the Active Message Reply Figure Hand lers areexecuted on the CP Figure Hand lers are executed on the NP 1 9 Compute Network Network Compute Compute Network 3 Network Compute Processor