Portable, MPI-Interoperable Coarray

Chaoran Yang Wesley Bland John Mellor-Crummey Department of Computer Science Math. and Comp. Sci. Division Department of Computer Science Rice University Argonne National Laboratory Rice University [email protected] [email protected] [email protected]

Pavan Balaji Mathematics and Computer Science Division Argonne National Laboratory [email protected]

Abstract 1. Introduction The past decade has seen the advent of a number of parallel Message Passing Interface (MPI) is the de facto standard for programming models such as Coarray Fortran (CAF), Uni- programming large-scale parallel programs today. There are fied Parallel C, X10, and Chapel. Despite the productivity several reasons for its success including a rich and standard- gains promised by these models, most parallel scientific ap- ized interface coupled with heavily optimized implementa- plications still rely on MPI as their data movement model. tions on virtually every platform in the world. MPI’s philos- One reason for this trend is that it is hard for users to incre- ophy is simple: it’s primary purpose is not to make simple mentally adopt these new programming models in existing programs easy to implement; rather, it is to make complex MPI applications. Because each model use its own runtime programs possible to implement. system, they duplicate resources and are potentially error- In recent years a number of new programming mod- prone. Such independent runtime systems were deemed nec- els such as Coarray Fortran (CAF) [24], Unified Parallel C essary because MPI was considered insufficient in the past (UPC) [29], X10 [25], and Chapel [7] have emerged. These to play this role for these languages. programming models feature a Partitioned Global Address The recently released MPI-3, however, adds several new Space (PGAS) where data is accessed through language capabilities that now provide all of the functionality needed load/store constructs that eventually translate to one-sided to act as a runtime, including a much more comprehensive communication routines at the runtime layer. Together with one-sided communication framework. In this paper, we in- the obvious productivity benefits of having direct language vestigate how MPI-3 can form a runtime system for one ex- constructs for moving data, these languages also provide the ample programming model, CAF, with a broader goal of en- potential for improved performance by utilizing the native abling a single application to use both MPI and CAF with hardware communication features on each platform. the highest level of interoperability. Despite the productivity promises of these newer pro- gramming models, most parallel scientific applications are Categories and Subject Descriptors D.3.2 [Programming slow to adopt them and continue to use MPI as their data Languages]: Language Classifications—Concurrent, dis- movement model of choice. One of the reasons for this trend tributed, and parallel languages; D.3.4 [Programming Lan- is that it is not easy for programmers to incrementally adopt guages]: Processors—, Runtime environments these new programming models in existing MPI applica- Keywords Coarray Fortran; MPI; PGAS; Interoperability tions. Because most of these new programming models have their own runtime systems, to incrementally adopt them, the user would need to initialize and use separate runtime li- braries for each model within a single application. Not only Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed does this duplicate the runtime resources, e.g. temporary for profit or commercial advantage and that copies bear this notice and the full citation memory buffers and metadata for memory segments, but it is on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). also error-prone with a possibility for deadlock if not used in PPOPP’14, Feb. 15–19, 2014, Orlando, Florida, USA. a careful and potentially in a platform-specific manner (more Copyright is held by the owner/author(s). ACM 978-1-4503-2656-8/14/02. details to follow). http://dx.doi.org/10.1145/2555243.2555270 FFT 16 32 64 128 256 512 1024 2048 4096 CAF-MPI 6.2971 9.9241 17.9998 32.8323 74.2554 152.9704 305.3309 585.6462 945.5121 CAF-GASNet 3.9050 7.2703 11.7259 20.4787 37.9913 66.6050 121.6078 233.8628 419.6483 IDEAL-SCALE 6.2971 12.5942 25.1884 50.3768 100.7536 201.5072 403.0144 806.0288 1612.0576 RA 16 32 64 128 256 512 1024 2048 4096 CAF-MPI 0.1 0.2 0.2 0.5 0.6 1.1 1.4 2.0 2.7 CAF-GASNet 0.2 0.3 0.4 0.6 1.1 1.1 1.9 3.8 8.0 IDEAL-SCALE 0.1231 0.2462 0.4924 0.9848 1.9696 3.9392 7.8784 15.7568 31.5136 HPL 16 64 256 1024 4096 CAF-MPI 0.113494752 0.4315327371 1.5640185942 5.4019310091 17.931944405 CAF-GASNet 0.1153884087 0.4306770224 1.6010092905 IDEAL-SCALE 0.113494752 0.4539790081 1.8159160323 7.2636641294 29.054656517 CGPOP 24 72 120 168 216 264 312 360 CAF-MPI (PUSH) 2373.33 800.57 483.73 481.15 325.18 323.59 324.06 166.37 CAF-MPI (PULL) 2369.46 799.63 482.89 480.68 325.57 323.66 323.87 167.70 CAF-GASNet (PUSH) 2367.96 794.29 482.83 477.60 322.41 321.47 320.01 162.31 CAF-GASNet (PULL) 2362.99 793.70 483.45 478.40 322.98 321.74 320.30 162.44

FFT RandomAccess 10000 100 CAF-MPI CAF-MPI CAF-GASNet CAF-GASNet IDEAL-SCALE IDEAL-SCALE

1000 10

100 GUPS GFlops

1 10

1 0.1 16 32 64 128 256 512 1024 2048 4096 16 32 64 128 256 512 1024 2048 4096 Number of processes Number of processes HPL CGPOP 100 3000 CAF-MPI CAF-MPI (PUSH) CAF-GASNet CAF-MPI (PULL) IDEAL-SCALE CAF-GASNet (PUSH) CAF-GASNet (PULL) 2250 10

1500 TFlops

1 750 Execution time (in seconds)

0.1 0 16 64 256 1024 4096 24 72 120 168 216 264 312 360 Number of processes Number of processes

FFT 8 16 32 64 128 256 512 1024 2048 CAF-MPI 2.5360 3.5693 7.0194 13.9231 23.0590 50.3071 96.1904 152.0733 263.9797 CAF-GASNet 2.3927 3.3042 4.9530 8.6560 15.3140 27.2440 43.8779 79.2683 118.1791 IDEAL-SCALE 2.536 5.072 10.144 20.288 40.576 81.152 162.304 324.608 649.216 CAF-GASNet-NOSRQ 2.4315 3.5079 4.9294 8.4172 15.2665 26.5122 43.4191 77.4317 117.2695 RA 8 16 32 64 128 256 512 1024 2048 CAF-MPI 0.1 0.1 0.1 0.3 0.4 0.6 0.9 1.3 1.5 CAF-GASNet 0.1 0.1 0.2 0.4 0.2 0.3 0.4 0.7 1.0 IDEAL-SCALE 0.06092 0.12184 0.24368 0.48736 0.97472 1.94944 3.89888 7.79776 15.59552 CAF-GASNet-NOSRQ 0.1 0.1 0.2 0.3 0.5 0.7 0.9 1.4 2.2 HPL 16 64 256 1024 0.4439374185 CAF-MPI 0.0350152743 0.1311492785 0.4805325189 1.7443695111 CAF-GASNet 0.0330905247 0.122221024 0.4467551121 1.5327417036 IDEAL-SCALE 0.0350152743 0.1400610971 0.5602443884 2.2409775535 CAF-GASNet-NOSRQ 0.0330424331 0.1254319838 0.4453462682 1.560673607 CGPOP 24 72 120 168 216 264 312 360 CAF-MPI (PUSH) 656.47 251.96 157.64 148.37 102.76 109.36 104.04 50.98 CAF-MPI (PULL) 654.98 250.94 155.62 150.68 108.40 121.16 110.47 50.94 CAF-GASNet (PUSH) 657.82 236.48 155.87 166.66 105.83 104.97 103.08 51.35 CAF-GASNet (PULL) 731.35 266.96 155.32 174.68 117.35 137.99 110.58 55.20

FFT RandomAccess 1000 100 CAF-MPI CAF-MPI CAF-GASNet CAF-GASNet IDEAL-SCALE IDEAL-SCALE CAF-GASNet-NOSRQ CAF-GASNet-NOSRQ 10 100

1 GUPS GFlops

10 0.1

1 0.01 8 16 32 64 128 256 512 1024 2048 8 16 32 64 128 256 512 1024 2048 Number of processes Number of processes HPL CGPOP 10 800 CAF-MPI CAF-MPI (PUSH) CAF-GASNet CAF-MPI (PULL) IDEAL-SCALE CAF-GASNet (PUSH) CAF-GASNet-NOSRQ CAF-GASNet (PULL) 600 1

400 TFlops

0.1 200 Execution time (in seconds)

0.01 0 16 64 256 1024 24 72 120 168 216 264 312 360 Number of processes Number of processes Chart 5 16 64 256 200 Why do we needGASNet-only Interoperable Programming26 34 Models? 39 GASNet-only MPI-only 107 109 115 MPI-only Because of MPI’sDuplicate vast Runtimes success in the parallel133 programming143 154 Duplicate Runtimes 150 arena, a large ecosystem of tools and libraries has developed 154 143 around it. There are high-level libraries using MPI for com- 133 CAF-GASNet CAF-MVAPICH 115 plex mathematical computations, I/O, visualization and data 100 109 Communication (AllToAll) 17.92 6.1 107 analytics, and almostLocal everycomputation other paradigm7.94 used by scien-8.3 tific applications. One of the drawbacks of other program- 50 ming models is that they do not enjoy such support, making Mapped Memory Size (MB) 34 39 it hard for applications to utilize them. For example, a new 26 0 CAF application cannot directly utilize a math library (such 16 64 256 as PETSc or Trillinos) that is written with MPI. Similarly, if Number of processes one developed a high-performanceFFT Analysis FFT library inPercentage CAF, an Time Percentage 29.4 100.00% 14363.8Figure 1. An example of memory usage11.4 when100.00% using both25865.4 MPI application cannot directly plug it in if the underlying Before 0.5 1.70% 244.28GASNet and MPI.Before 0.1 0.88% 226.88947368 runtime systemsPermute are not interoperable.comp 2.9 9.86% 1,416.84 Permute comp 0.4 3.51% 907.55789474 alltoall 3.3 11.22% 1,612.26 alltoall 2.5 21.93% 5,672.2368421 Such requirementsPhaseI are already seen in a number8.1 of27.55% sci- 3,957.37 PhaseI 1.8 15.79% 4,084.0105263 TransposeI packf 0.5 1.70% 244.28 TransposeI packf 0.1 0.88% 226.88947368 entific applications today. For example, in [23], Preissl 1 PROGRAM MAY DEADLOCK TI-alltoall 6.20 21.09% 3,029.10 TI-alltoall 2.8 24.56% 6,352.9052632 et. al. identifiedPhaseII that the Gyrokinetic Tokamak4.4 Simulation14.97% 2 2,149.68USE MPI PhaseII 1.0 8.77% 2,268.8947368 code that is basedTransposeII on MPI+OpenMPpackf can naturally0.6 bene-2.04% 3 293.14 TransposeII packf 0.1 0.88% 226.88947368 TI-alltoall 2.9 9.86% 4 1,416.84CALL MPI INIT(IERR) TI-alltoall 2.6 22.81% 5,899.1263158 fit from the language constructs in CAF that enable direct ALLTOALL 6,058.20136055 CALL MPI COMM RANK(MPI COMM WORLD, MY RANK, IERR17,924.268421 ) remote data accesses.COMPUTATION Consequently, they modified their ap-8,305.59863956 7,941.1315789 plication to further hybridize it with MPI+CAF+OpenMP, 7 IF (MY RANK . EQ . 0) THEN 8 A(:)[1] = A(:) and demonstratedRA performance MPI improvements withPercentage such a Time RA GASNet Percentage Time 9 END IF model. QMCPACKComputation [16], an MPI-based quantum1.316 monte-11.43% 81.974704134 Computation 12.44331 9.11% 46.360558574 event_notify AMRequestShort 3.517 30.56% 219.0767738910 event_notify AMRequestShort 0.9671 0.71% 3.6031647686 carlo package developedevent_wait by Oak Ridge National4.1056 Laboratory,35.67% 255.7411438411 CALL MPI BARRIER(MPIevent_wait COMM WORLD, IERR108.904 ) 79.72% 405.74817078 and GFMC [17,calc_put 22], an MPI-based nuclear physics2.57 monte-22.33% 160.0873781312 CALL MPI FINALIZE(IERR)calc_put 14.3 10.47% 53.278105875 Total 11.5086 100.00% 716.88 Total 136.61441 100.00% 508.99 carlo simulation developed by Argonne National Labora- 13 END PROGRAM tory, both demonstrate such requirements as well. Specif- ically, both of these applications rely on large arrays that RandomAccessFigure 2. A CAF program that may deadlock because CAF 800 cannot make progress when the blocks in MPI. reside on each node for theirCAF-GASNet core sequentialCAF-MPI computations, event_notify Computation 46.36 81.97 event_wait Analysis of FFT and they use MPIcoarray_write to communicate53.28 data between160.09 processes. coarray_write 30 As problem sizesevent_wait grow, however,405.75 these arrays255.74 are becom-600 Computation 219.08 Computation own separate runtime system. For example, GASNetAll-to-all [5] is a event_notify 3.60 219.08 3.60 ing too large to reside on a single node, thus requiring the common runtime system used24 by Berkeley7.94 UPC, CAF 2.0, 400 memory of multiple nodes to accommodate them. Hybridiz- and others.255.74 MPI, on the other18 hand, uses its own platform- 405.75 17.92 ing with MPI+CAF provides a natural extension for these specific runtime system. An12 application that uses multiple8.31 MPI applications where they can simply define theseTime (in seconds) 200 arrays programming models wouldTime (seconds) need to initialize and use multi- 6 as CAF coarrays, allowing the runtime system to distribute ple runtime systems in a single application, thus duplicating6.06 them across nodes and convert load/store accesses of these0 0 CAF-GASNetresources. CAF-MPI CAF-GASNet CAF-MPI arrays to remote data access operations. Figure 1 shows an example of the per-process memory Mira 16 32 64 usage when128 initializing256 both512 GASNet1024 and MPI in2048 an applica-4096 Challenges in InteroperableCAF-GASNet READ Programming272479.56 Models.266666.66There 263852.25tion. 1256410.27The memory266666.66 usage of256410.27 both libraries265957.47 grows247524.75 along with266666.66 are several challengesCAF-GASNet in facilitatingWRITE multiple221729.48 programming217864.92 216919.73 203665.98 213675.22 209205.03 211864.41 207039.33 206611.58 CAF-GASNet EVENT_NOTIFY 99304.867 97560.977 96993.211the number95969.281 of processes.96432.023 The96899.227 duplicated97465.883 runtime96711.797 resources96899.227 models to interoperateCAF-MPI withREAD each other.76745.969 Some of these61614.293 as- 61614.293reduce61614.293 the resources61274.512 available61274.512 to an60642.813 application,60569.352 which will60716.457 pects are relatedCAF-MPI to the programmingWRITE semantics61087.355 and51177.074 the exe- 52273.914 50864.699 51229.508 50226.016 51733.059 51334.703 49358.340 CAF-MPI EVENT_NOTIFY 100704.94 89847.258 89605.727eventually88967.977 hurt performance88888.891 87489.063 or prevent89525.516 the application88809.945 from89766.609 cution model itself.CAF-MPI For instance,AlltoAll for a programming24096.387 21186.441 model 16778.523running11494.253 at a larger7087.1724 scale. 4071.6611 2230.1516 1166.3168 602.73645 such as Chapel,CAF-GASNet which exposesAlltoAll a completely3716.0906 dynamic1979.4141 exe- 984.83356 475.48856 221.75407 102.36043 45.536510 20.609421 9.9222002 Mira Microbenchmarks Using multiple runtime systems within a single applica- cution model where tasks can fork other tasks on demand, to 3E+05 tion1E+05 also makes it hard to reason about the correctness of interoperate with a more static model such as MPI, a clear codes. The semantics of an operation are often well-defined definition of what the user is allowed to do needs to be speci- CAF-GASNet WRITE 2.4E+05 with1E+04 respect toCAF-GASNet its own READ runtime system but are unclear when fied. This part is not the focus of this paper. Instead we focus multiple runtimeCAF-GASNet systems NOTIFY are used. For example, Figure 2 on interoperation of CAF and MPI, which have similar exe- CAF-GASNet AlltoAll lists a simpleCAF-MPI CAF programWRITE that has the process with rank 0 cution model semantics,1.8E+05 with a number of images/processes 1E+03 CAF-MPI READ perform a coarrayCAF-MPI write NOTIFY (line 8), then has every process par- that move data between each other. CAF-MPI AlltoAll ticipate in an MPI BARRIER (line 11). Depending on the im- Even for programming1.2E+05 models with similar execution se- 1E+02 mantics, various challenges exist that make interoperability 1 GASNet uses less memory than MPI because it saves data such as meta- hard. For example, today, each programming model uses its data of memory# of all-to-all ops / second segments in user-space buffers. 6E+04 1E+01 # of point-to-point ops / second

0E+00 1E+00 16 32 64 128 256 512 1024 2048 4096 # of cores

Edison 32 64 128 256 512 1024 2048 4096 CAF-GASNet READ 445434.3 385951.4 324570.0 390930.4 293083.2 232342.0 264550.3 252079.7 CAF-GASNet WRITE 579038.8 500250.1 490436.5 500000.0 256607.7 274499.0 364564.3 308261.4 CAF-GASNet EVENT_NOTIFY 674763.8 665779.0 655308.0 655308.0 655308.0 582411.2 654878.8 521920.7 CAF-MPI READ 207555 209205.0 205465.4 206996.5 176398.0 201612.9 201369.3 143082.0 CAF-MPI WRITE 210172.3 210305.0 206313.2 208159.9 177273.5 202880.9 200964.6 142227.3 CAF-MPI EVENT_NOTIFY 700770.8 700770.8 700770.8 696864.1 696864.1 693962.6 686341.8 619962.8 CAF-MPI AlltoAll 12396.18 5767.345 2727.917 1272.507 514.6469 268.2957 112.9217 29.40790 CAF-GASNet AlltoAll 24177.95 7081.150 2399.923 911.6103 258.6646 87.81258 44.26492 19.71037 Edison Microbenchmarks 8E+05 1E+05

CAF-GASNet WRITE 6.4E+05 1E+04 CAF-GASNet READ CAF-GASNet NOTIFY CAF-GASNet AlltoAll CAF-MPI WRITE 4.8E+05 1E+03 CAF-MPI READ CAF-MPI NOTIFY CAF-MPI AlltoAll

3.2E+05 1E+02 # of all-to-all ops / second 1.6E+05 1E+01 # of point-to-point ops / second

0E+00 1E+00 32 64 128 256 512 1024 2048 4096 # of cores

Edison 16 32 64 128 256 512 1024 2048 CAF-GASNet READ 2.81E+05 2.75E+05 2.48E+05 CAF-GASNet WRITE 3.73E+05 3.70E+05 3.14E+05 CAF-GASNet EVENT_NOTIFY 1.07E+06 8.02E+05 7.08E+05 CAF-MPI READ 1.75E+05 1.77E+05 1.69E+05 CAF-MPI WRITE 1.83E+05 1.74E+05 1.74E+05 CAF-MPI EVENT_NOTIFY 7.99E+04 4.73E+04 9.77E+04 CAF-GASNet AlltoAll 2.82E+04 7.96E+03 2.76E+03 CAF-MPI AlltoAll 5.41E+04 3.10E+04 1.97E+04 plementation of CAF, a coarray write operation may require 2. Overview of CAF and MPI-3 the involvement of the target process to complete. Although To make the paper relatively self-contained, this section such involvement is implicit, it requires the target process briefly introduces the necessary context for this paper. We to make a runtime call to make progress internally. In this will give an overview of CAF [18] and MPI-3 [19]. For fur- example, because the target process is likely to run into the ther background information, refer to the cited papers. MPI BARRIER before seeing the write operation, the coar- ray write on process 0 may never complete. Some implemen- 2.1 Coarray Fortran 2.0 tations of CAF may use a separate progress or have We choose Coarray Fortran 2.0 (CAF 2.0) [18], developed better support from the hardware allowing it to make asyn- by Rice University, as our PGAS programming model to chronous progress. This makes the scenario implementation- test the new capabilities of MPI-3. CAF 2.0 is an open specific and even harder for users to debug. source implementation of CAF and also contains a richer Contributions of this paper. One possible solution to the set of PGAS extensions than CAF-1 that is integrated into interoperability issues described above is to use MPI as the the Fortran 2008 standard. CAF 2.0 better represents the runtime system for CAF, thus allowing all data movement semantics and requirements of modern PGAS languages and to be funneled through a common runtime library. However, would provide a better picture of the appropriateness of the use of MPI as a runtime system was previously deemed MPI-3 as a basis for these models. We describe the PGAS impossible for these PGAS programming models because extensions in CAF 2.0 that are not included in Fortran 2008 MPI was considered inadequate to implement these models. standard in this section. Bonachea and Duell [6] put together a comprehensive list Teams. In CAF 2.0, a team is a first-class entity that rep- of reasons why the remote memory access (RMA) features resents a group of processes, analogous to an MPI commu- introduced in the MPI-2 Standard fall short of the task of nicator. Teams in CAF 2.0 serve three purposes: (a) the set serving as a compilation target of PGAS languages. The of images in a team serves as a domain onto which coar- recently released MPI-3 Standard, however, adds several rays may be allocated; (b) a team provides a name space new capabilities including a much more comprehensive one- within which images can be indexed by their relative rank sided communication (or RMA) model. These new additions within that team; (c) a team provides an isolated domain for not only address the critiques raised about MPI-2 RMA but members of a team to communicate and synchronize collec- also provide new functionality with performance benefits tively. When a CAF program starts, all images belong to the over existing PGAS runtime systems. But whether MPI-3 same team named TEAM WORLD, and new teams can be cre- can live up to the goal of forming a runtime basis for PGAS ated with the team split operation. programming models with these new additions is yet to be validated in practice. Synchronization. CAF 2.0 provides a rich set of synchro- In this paper, we investigate the capability of MPI in nization constructs. These constructs are designed to enable serving as the basis of a PGAS programming model such as programmers to structure correct parallel programs while CAF. We redesigned the runtime system of CAF 2.0, which allowing maximum opportunity to hide communication la- was originally built on top of GASNet, to use MPI-3. The tency with computation. These constructs were previously paper describes various design choices we made during this introduced by Yang et. al. in [32]. process. Further, we present the performance of CAF over Pair-wise synchronization between process images in MPI-3 relative to that of the original CAF implementation, CAF 2.0 can be achieved with events, a first-class entity in and demonstrate that the MPI runtime system can achieve a CAF 2.0. An event must be initialized with event init be- comparable or better performance in several cases. We also fore any other operation can occur. Events can be allocated point out some cases where the performance falls short of the as coarrays to enable them to be posted by other process im- existing implementation; in such cases we present a detailed ages. One posts an event with the event notify operation. analysis of the performance difference. An event wait operation blocks execution of the current This paper is organized as follows: Section 2 presents an thread until the event is posted. event trywait is a non- overview of CAF-2.0 and MPI-3 and serves as background blocking operation which tests whether an event is posted knowledge for the later discussion of the paper. Section 3 and returns immediately. describes the design of the CAF-MPI runtime system. In To synchronize asynchronous operations (described in Section 4 we evaluate our implementation using three HPC the next subsection), CAF 2.0 provides two synchroniza- Challenge Benchmarks [11] and the CGPOP miniapp [27], tion models: the implicit synchronization model, and the ex- which uses both CAF and MPI. In Section 5 we discuss plicit synchronization model. All asynchronous operations in possible extensions to the MPI-3 RMA interface that may CAF 2.0 optionally accept event arguments. Asynchronous provide further benefits when using MPI as the basis of operations invoked with an event argument are explicitly syn- PGAS languages. We discuss related work in Section 6 and chronized operations. The events passed to asynchronous present a summary of our work in Section 7. operations are notified by the runtime system when the syn- chronization point is reached. One can test or wait for the Function shipping is an new addition of CAF 2.0 that en- event to synchronize these operations. ables programmers to transfer computation to where data is One can also use the implicit synchronization model to located, rather than moving data towards computation. The synchronize asynchronous operations. CAF 2.0 provides design of CAF 2.0 function shipping mechanism allows the cofence statements and finish blocks to serve this pur- shipped function to perform the full range of CAF 2.0 oper- pose. A cofence statement blocks the current thread un- ations, e.g., spawning more functions, performing blocking til all asynchronous operations issued before the statement communications, and synchronization. complete locally. More specifically, a cofence ensures all local buffers used by asynchronous operations issued pre- 2.2 MPI-3 Remote Memory Access Extensions viously are ready to be reused. A cofence also serves as The MPI-3 Standard improved the MPI-2 RMA functional- a barrier; operations are not allow to be reordered ity in a number of ways including the addition of atomic op- across a cofence statement. erations, request-generating RMA operations, new synchro- finish is a block-structured global synchronization con- nization operations in passive target epochs, a new unified struct in CAF 2.0, demarcated by finish and end finish memory model, and new window types. This section serves statements. finish in CAF 2.0 is similar to the finish as a introductory description of the features relevant to this block in X10; it ensures that all asynchronous operations is- paper. sued within the block are globally complete before current One-sided atomic operations. MPI-3 RMA provides sev- thread exits from the block. However, unlike the finish eral operations that enable one to manipulate data in target statement in X10, a finish statement in CAF 2.0 is a col- windows atomically. MPI ACCUMULATE and MPI GET lective operation. A finish statement takes a team variable, ACCUMULATE operations allow remote atomic updates of and each process image within the associated team should data, with MPI GET ACCUMULATE further fetching the create a finish block that matches those of its teammates. original remote data back to the origin. The MPI-3 Stan- When the thread exits from the block, all operations issued dard also defines MPI FETCH AND OP, which is a special by all images within the team are globally complete. finish case of MPI GET ACCUMULATE for single element prede- blocks can be nested within each other; this is useful be- fined datatypes (to provide a fast-path for such operations), cause it allows asynchronous operations issued within the and MPI COMPARE AND SWAP, which provides a remote outer finish block to proceed while the process is waiting compare-and-swap operation. for the inner finish block to complete, which yields more opportunity for communication-computation overlapping. Request-generating operations. MPI-3 adds ‘R’ versions of most of the RMA operations that return request handles, such as MPI RPUT, MPI RGET, and MPI RACCUMULATE. Asynchronous operations. CAF 2.0 provides three cat- The request handles returned by these operations can be egories of asynchronous operations: asynchronous copy, later passed to MPI request completion routines, such as asynchronous collectives, and function shipping. Scherer et. MPI WAIT, to manage local completion of the operation. al. and Yang provide detailed descriptions of these opera- These request-generating operations may be used only in tions elsewhere [26, 31]. passive target synchronization epochs. A copy async operation transfers data from a buffer (the MPI-allocated windows. MPI-3 adds several new rou- source) to another buffer (the destination) asynchronously. tines for creating windows. MPI WIN ALLOCATE allows The source and destination may be local or remote coarrays. the MPI implementation to allocate memory for a window, An asynchronous copy takes three optional event arguments: rather than using a user allocated buffer. When MPI allo- the predicate event, the source event, and the destination cates memory for a window, it has the opportunity to allocate event. The copy may proceed after its predicate event is special memory regions such as aligned memory segments posted. The source event, when posted, indicates the source across a group or regions, which reduces buffer is ready to be reused. Notification of the destination the overhead of RMA operations performed on that window. event indicates the data has been delivered to the destination; MPI WIN ALLOCATE SHARED will allocate memory that the destination is ready to be read. is shared by all processes in the group of the communicator Asynchronous collectives have similar functionality as when the processes belong to the same shared-memory node. their synchronous versions except that they are non-blocking. MPI WIN CREATE DYNAMIC creates a window without For example, team reduce async performs a reduction memory attached; one can dynamically attach memory later across a team asynchronously. An asynchronous collective with MPI WIN ATTACH. takes two optional event arguments: the data completion event and the operation completion event. The data comple- Unified memory model. MPI-2 RMA assumes no coher- tion event indicates the local buffer is ready to be read, and ence in the memory subsystem or network interface, result- the operation completion event indicates the local buffer is ing in logically distinct public and private copies of a win- ready to be modified. dow. This conservative model (the separate model) is a poor match for systems where coherent memory subsystems are get processes, it is more convenient to use the passive target available. The new unified memory model added in MPI-3 synchronization model in CAF. With the new additions of better exposes these hardware capabilities to the user. This synchronization routines in the MPI-3 passive target model, assumption relaxes several restrictions present in the sepa- we can lock all targets with MPI WIN LOCK ALL when a rate model such as access to non-overlapping regions in a coarray is allocated. Blocking read and write operations in window by an MPI PUT and a regular store operation con- CAF-MPI use MPI WIN FLUSH to ensure remote comple- currently, thus allowing for higher concurrency in access to tion. The target processes of a window are only unlocked the window data. when the coarray is deallocated. Passive target synchronization. MPI-3 adds a pair of 3.2 Active Messages locking routines to lock and unlock all targets of a win- Active Messages (AM) [30] are an integral component of dow simultaneously. With MPI WIN LOCK ALL, finer- GASNet. The CAF runtime system of makes heavy use of grained synchronization can be achieved, for example using them in various places, e.g., function shipping and event request-generating operations described earlier in this sec- mechanisms. For CAF-MPI, we implemented Active Mes- tion. Two new synchronization routines, MPI WIN FLUSH sages using MPI’s two-sided communication routines. Our and MPI WIN FLUSH ALL, were also added that allow the design of the AM interface is a near-exact replica of the user to complete all operations initiated by an origin process AM interface in the GASNet core API to maintain maxi- to a specified target or to all targets. mum compatibility with the original CAF 2.0 runtime sys- tem. GASNet’s AM API consists of three categories of AM: 3. Designing CAF over MPI-3 the short Active Messages carry only a few integer argu- In this section, we describe the key details of a new imple- ments, the medium Active Message can carry an opaque data mentation of CAF 2.0 using MPI-3 as its communication payload in addition to integer arguments, and the long Ac- substrate. CAF 2.0 features that are trivial to implement with tive Message can also carry an opaque data payload but user MPI, e.g., teams and collectives, are left out due to space needs to specify a predetermined address in the target pro- constrains. In the rest of the paper, we will sometimes refer cess’s memory space to receive the data payload. The num- to our implementation of CAF as CAF-MPI and the original ber of integer arguments that a short AM can carry can be CAF 2.0 as CAF-GASNet. queried with gasnet AMMaxArgs() function; the size of the data payload a medium AM can carry can be queried with 3.1 Coarrays gasnet AMMaxMedium(). Providing different APIs for dif- Coarrays are the main addition of CAF to Fortran 95. Coar- ferent message sizes allows the compiler to generate more rays add a codimension to a plain Fortran array. The codi- efficient code when the message size is known at compile mension indicates the process image on which an array time. is located. Coarrays on remote images can be accessed Since AMs are used in many places in CAF 2.0’s runtime with a Fortran 95 array section syntax plus a codimension. system, its performance is critical. To ensure a fast message Thus, reading from or writing to a remote coarray is a one- injection rate for AM, we used MPI ISEND to send all mes- sided operation that is mapped naturally to MPI GET and sages. Integer arguments of medium data payload are inter- MPI PUT. nally buffered to use MPI ISEND, and waiting for local com- The original CAF 2.0 runtime system represents a ref- pletion of the send operation is delayed until the next syn- erence to a remote memory location with a (image ID, chronization point. The data payload of long AMs are sent address) tuple. Because MPI RMA hides the absolute ad- with an extra blocking MPI SEND. Theoretically, this extra dress of a remote memory in the window object, and cur- send could be replaced by an MPI PUT operation to avoid rently provides no interface to access this information, we the internal data buffering within MPI. However, because augmented the tuple with a remote coarray location that in- the current MPI standard does not provide a mechanism to cludes with an window object and the offset from the start notify the target on the arrival of an MPI PUT, it is hard to of the window. Thus, the new remote memory reference be- ensure that the AM is invoked only after its data payload ar- come a (window, rank, displacement) tuple. rives. Hence we stayed with the MPI SEND based design in CAF-MPI allocates coarrays using MPI WIN ALLOCATE. our approach. With MPI WIN ALLOCATE, an MPI implementation can potentially improve performance by allocating aligned mem- 3.3 Asynchronous Operations ory segments or shared memory regions for the window. The original CAF 2.0’s asynchronous progression model is The semantics of coarray read and write operations re- based on a common progress engine that all aspects of the quire the effect of the operation to be globally visible after CAF runtime plug into. Thus, if a process is waiting on one the operation completes. Proper synchronization is needed CAF event, the runtime can make progress on other opera- to ensure this semantic. Because the active target synchro- tions that are internally queued up. This progress model is nization model in MPI requires the participation of the tar- similar to what most runtime systems, including MPI, use. When redesigning CAF’s runtime system with MPI, one since the performance of MPI SEND and MPI RECV rou- of the restrictions we faced was with the asynchronous tines are more well-tuned to date, and a two-sided commu- progress engine as it relates to MPI RMA operations. Specif- nication model fits more naturally the event notify and ically, the MPI-3 Standard does not provide a mechanism to event wait model. test for remote completion for all MPI RMA operations. The event wait is a blocking operation; it blocks the pro- ‘R’ versions of MPI RMA operations (e.g., MPI RPUT and cess until the event being waited upon is posted. To be se- MPI RGET) provide requests that can be tested or waited on mantically correct, event wait also forbids asynchronous for completion but completion of this request only refers to operations in program order after an event wait from be- local completion for MPI RPUT and MPI RACCUMULATE, ing reordered to a position before the event wait (i.e., it while it refers to both local and remote completion for also functions as a compiler barrier). This restriction is guar- MPI RGET and MPI RGET ACCUMULATE. For remote anteed by the code generation process of CAF-MPI source- completion of MPI PUT and MPI ACCUMULATE, MPI-3 to-source translator. event wait uses a blocking network only provides MPI WIN FLUSH and MPI WIN FLUSH ALL, polling operation to wait for a specific message to arrive; the apart from the epoch close operations such as MPI WIN LOCK polling operation internally uses MPI blocking receive. The and MPI WIN LOCK ALL. These operations can be block- benefit of using a blocking polling operation allows the MPI ing (e.g., when someone else is holding the lock) and do runtime to make progress internally to respond to other pro- not have a request handler that can be used to test for their cesses’ requests. completion. The semantics of the event notify operation specify To workaround this issue, we use the following mapping that when a process posts the notification for an event, the of operations: target of the event can only see the notification after all pre- vious operations issued by the notifying process are com- 1. For one-sided communication operations, if no local or plete at their respective targets. However, the notification remote completion event is requested, we use MPI PUT itself is nonblocking to avoid a possible deadlock situa- and MPI GET. tion caused by circular event wait and event notify 2. If a local or remote completion event is requested for chains. We implement event notify with a release bar- GET-style operations, we use MPI RGET which returns rier and an short AM request. The release barrier holds a an MPI request on which we can wait or test. request handle to every asynchronous operation initiated 3. If only a local completion event is requested for a PUT- locally. Upon event notify, the release barrier waits to style operation, we use MPI RPUT which returns an MPI complete all its request handles with MPI WAITALL; this request on which we can wait or test. ensures local completion of these operations. Furthermore, a MPI WIN FLUSH ALL is required to ensure the remote 4. If a remote completion event is requested for a PUT- completion of all previous operations. The actual notifica- style operation, we use active messages (that are based on tion in event notify is performed by sending an AM re- MPI SEND and MPI RECV as described in Section 3.2) quest to the target process. We use an MPI ISEND to avoid to transfer data in the source buffer to the target process, the deadlock possibility mentioned above. copy data into the destination buffer, then post the des- tination event associated with the asynchronous opera- 3.5 cofence and finish tion. This option is obviously not as efficient as a direct Thanks to the new additions of request-generating RMA op- MPI PUT or MPI GET, which can be implemented more erations in MPI-3, local completion of RMA operations can efficiently on current network hardware. But it provides be easily waited upon using the request handles of RMA op- the necessary functionality. We will further discuss this erations. The cofence statement takes an optional argument limitation of MPI-3 and other possible solutions in Sec- that a user can use to request local completion notification tion 5. of PUT or GET operations. Thus, CAF-MPI runtime sys- tem internally maintains an array of request handles of im- 3.4 Explicit Event Notification plicitly synchronized PUT operations and another array of There are two obvious approaches to implement the event request handles of implicitly synchronized GET operations. mechanism in CAF-MPI. One approach is to leverage the The cofence statement translates to an MPI WAITALL call newly added MPI FETCH AND OP operation in MPI-3 to for the local completion of these operations. notify an event and use the MPI COMPARE AND SWAP op- finish is implemented in the same way as in the orig- eration to busy-wait for the event to be posted in event wait. inal CAF implementation; it uses a distributed termination The second approach is to use MPI ISEND to notify an event detection algorithm presented by Yang [31]. Yang’s algo- and use MPI RECV to wait for the event to be posted. The rithm detects termination by repeatedly performing SUM re- performance implication of these two methods are unclear ductions across a team to compute the global difference be- at this point and may largely depend on the underlying tween the number of shipped functions and the number of MPI implementation. CAF-MPI used the second method completed functions shipped from others. Global termina- FFT 16 32 64 128 256 512 1024 2048 4096 CAF-MPI 6.2971 9.9241 17.9998 32.8323 74.2554 152.9704 305.3309 585.6462 945.5121 CAF-GASNet 3.905 7.2703 11.7259 20.4787 37.9913 66.605 121.6078 233.8628 419.6483 IDEAL-SCALE 6.2971 12.5942 25.1884 50.3768 100.7536 201.5072 403.0144 806.0288 1612.0576 RA 16 32 64 128 256 512 1024 2048 4096 CAF-MPI 0.1231 0.1592 0.2153 0.4872 0.6470 1.1240 1.4230 2.0300 2.7140 CAF-GASNet 0.2180 0.3354 0.3531 0.5853 1.0780 1.0950 1.8970 3.7530 8.0280 IDEAL-SCALE 0.1231 0.2462 0.4924 0.9848 1.9696 3.9392 7.8784 15.7568 31.5136 HPL 16 64 256 1024 4096 CAF-MPI 0.113494752021220.4315327371089121.564018594248695.4019310091196217.9319444054242 CAF-GASNet 0.1153884086875730.430677022444471.6010092904759 IDEAL-SCALE 0.113494752021220.453979008084881.815916032339527.2636641293580829.0546565174323 CGPOP 24 72 120 168 216 264 312 360 CAF-MPI (PUSH) 2373.33 800.57 483.73 481.15 325.18 323.59 324.06 166.37 CAF-MPI (PULL) 2369.46 799.63 482.89 480.68 325.57 323.66 323.87 167.7 CAF-GASNet (PUSH) 2367.96 794.29 482.83 477.6 322.41 321.47 320.01 162.31 CAF-GASNet (PULL) 2362.99 793.7 483.45 478.4 322.98 321.74 320.3 162.44

CAF-MPI (PUSH) 643.68 226.05 155.64 138.02 92.16 101.51 98.95 49.15 CAF-MPI (PULL) 642.62 226.21 155.68 138.70 92.39 105.18 102.48 49.37

FFT RandomAccess 10000 100.0000 CAF-MPI CAF-MPI CAF-GASNet CAF-GASNet IDEAL-SCALE IDEAL-SCALE

1000 10.0000

100 GUPS GFlops

1.0000 10

1 0.1000 16 32 64 128 256 512 1024 2048 4096 16 32 64 128 256 512 1024 2048 4096 Number of processes Number of processes HPL CGPOP 100 3000 CAF-MPI CAF-MPI (PUSH) CAF-GASNet CAF-MPI (PULL) IDEAL-SCALE CAF-GASNet (PUSH) CAF-GASNet (PULL) 2250 10

1500 TFlops

1 750 Execution time (in seconds)

0.1 0 16 64 256 1024 4096 24 72 120 168 216 264 312 360 Number of processes Number of processes

FFT 8 16 32 64 128 256 512 1024 2048 CAF-MPI 2.536 3.5693 7.0194 13.9231 23.059 50.3071 96.1904 152.0733 263.9797 CAF-GASNet 2.3927 3.3042 4.953 8.656 15.314 27.244 43.8779 79.2683 118.1791 IDEAL-SCALE 2.536 5.072 10.144 20.288 40.576 81.152 162.304 324.608 649.216 CAF-GASNet-NOSRQ 2.4315 3.5079 4.9294 8.4172 15.2665 26.5122 43.4191 77.4317 117.2695 RA 8 16 32 64 128 256 512 1024 2048 CAF-MPI 0.06092 0.08127 0.14460 0.26490 0.37180 0.55590 0.82550 1.54600 2.28000

FFT 16 CAF-GASNet32 64 0.08138128 0.11930256 0.19460512 10240.36090 20480.20760 4096 0.30790 0.41440 0.66870 0.97430 CAF-MPI 6.2971IDEAL-SCALE9.9241 17.9998 0.0609232.8323 0.1218474.2554 0.24368152.9704 305.33090.48736 585.64620.97472 945.51211.94944 3.89888 7.79776 15.59552 CAF-GASNet 3.9050CAF-GASNet-NOSRQ7.2703 11.7259 0.0813920.4787 0.1195037.9913 0.1813066.6050 121.60780.30630 233.86280.48190 419.64830.67120 0.86760 1.42900 2.21500 IDEAL-SCALE 6.2971 12.5942HPL 25.1884 16 50.3768 64 100.7536 256201.5072 1024403.0144 806.0288 1612.0576 RA 16 32 64 128 256 512 1024 2048 4096 CAF-MPI 0.03501527427419750.1311492785337550.48053251893751.74436951110407 CAF-MPI 0.1 0.2 0.2 0.5 0.6 1.1 1.4 2.0 2.7 CAF-GASNet CAF-GASNet0.2 0.3 0.03309052474613430.4 0.1222210239688550.6 1.10.4467551120772491.11.532741703569131.9 3.8 8.0 IDEAL-SCALE 0.1231IDEAL-SCALE0.2462 0.49240.03501527427419750.98480.140061097096791.96960.560244388387163.93922.240977553548647.8784 15.7568 31.5136 HPL 16 CAF-GASNet-NOSRQ64 2560.033042433075981024 0.1254319837809024096 0.4453462681551781.56067360700123 CAF-MPI 0.113494752 0.4315327371CGPOP 1.5640185942 245.4019310091 17.93194440572 120 168 216 264 312 360 CAF-GASNet 0.1153884087 0.4306770224 1.6010092905 CAF-MPI (PUSH) 656.47 251.96 157.64 148.37 102.76 109.36 104.04 50.98 IDEAL-SCALE 0.113494752 0.4539790081 1.8159160323 7.2636641294 29.054656517 CGPOP 24 CAF-MPI72 (PULL) 120 654.98168 250.94216 155.62264 312150.68 360 108.4 121.16 110.47 50.94 CAF-MPI (PUSH) 2373.33CAF-GASNet800.57 (PUSH) 483.73 657.82481.15 236.48325.18 155.87323.59 166.66324.06 166.37105.83 104.97 103.08 51.35 CAF-MPI (PULL) 2369.46CAF-GASNet799.63 (PULL) 482.89 731.35480.68 266.96325.57 155.32323.66 174.68323.87 167.70117.35 137.99 110.58 55.2 CAF-GASNet (PUSH) 2367.96 794.29 482.83 477.60 322.41 321.47 320.01 162.31 CAF-GASNet (PULL) 2362.99 793.70 483.45 478.40 FFT 322.98 321.74 320.30 162.44 RandomAccess 1000 100.00000 CAF-MPI CAF-MPI CAF-GASNet CAF-GASNet FFT IDEAL-SCALE RandomAccess IDEAL-SCALE 10000 CAF-GASNet-NOSRQ100 CAF-GASNet-NOSRQ CAF-MPI CAF-MPI CAF-GASNet CAF-GASNet 10.00000 IDEAL-SCALE IDEAL-SCALE 100 2.28000 1.54600 1000 0.82550 10 1.00000 0.55590

GUPS 0.37180 GFlops 0.26490 100 10 0.14460 GUPS GFlops 0.08127 0.100000.06092 1 10

1 0.01000 8 16 32 64 128 256 512 1024 2048 8 16 32 64 128 256 512 1024 2048 1 0.1 16 32 64 128 256 512 1024 2048 4096 Number of processes16 32 64 128 256 512 1024 2048 4096 Number of processes Number of processes HPL Number of processes CGPOP HPL CGPOP 10 800 100 CAF-MPI 3000 CAF-MPI (PUSH) CAF-MPI CAF-GASNet CAF-MPI (PUSH) CAF-MPI (PULL) CAF-GASNet IDEAL-SCALE CAF-MPI (PULL) CAF-GASNet (PUSH) IDEAL-SCALE CAF-GASNet-NOSRQ CAF-GASNet (PUSH) CAF-GASNet (PULL) CAF-GASNet (PULL) 2250 600 10 1

1500 400 TFlops TFlops 1 0.1 750

Executime time (in seconds) 200 Execution time (in seconds)

0.1 0 16 64 256 1024 4096 24 72 120 168 216 264 312 360 0.01 0 Number of processes16 64 256 Number of1024 processes 24 72 120 168 216 264 312 360 Number of processes Number of processes

FFT 8 16 32 64 128 256 512 200 1024 2048 16 64 256 GASNet-only CAF-MPI GASNet-only2.5360 3.5693 7.019426 13.9231 34 23.0590 3950.3071 96.1904 152.0733MPI-only 263.9797 CAF-GASNet MPI-only2.3927 3.3042 4.9530107 8.6560 109 15.3140 11527.2440 43.8779 79.2683Duplicate 118.1791Runtimes IDEAL-SCALE Duplicate2.536 Runtimes5.072 10.144133 20.288 143 40.576 154 81.152 162.304 324.608 649.216 CAF-GASNet-NOSRQ 2.4315 3.5079 4.9294 8.4172 15.2665 26.5122 43.4191150 77.4317 117.2695 154 RA 8 16 32 64 128 256 512 1024 2048 143 CAF-MPI 0.1 0.1 0.1 0.3 0.4 0.6 0.9 1.2 133 1.5 CAF-GASNet 0.1 0.1 0.2 0.4 0.2 0.3 0.4 0.7 1.0 115 100 109 IDEAL-SCALE 0.06092 0.12184 0.24368CAF-GASNet0.48736CAF-MVAPICH0.97472 1.94944 3.89888 7.79776107 15.59552 CAF-GASNet-NOSRQ Communication0.1 0.1 (AllToAll) 0.2 17.92 0.3 0.56.1 0.7 0.9 1.4 2.2 HPL 16 Local computation64 256 10247.94 8.3 CAF-MPI 0.0350152743 0.1311492785 0.4805325189 1.7443695111 50 CAF-GASNet 0.0330905247 0.122221024 0.4467551121 1.5327417036 IDEAL-SCALE 0.0350152743 0.1400610971 0.5602443884 2.2409775535 Mapped Memory Size (MB) 34 39 CAF-GASNet-NOSRQ 0.0330424331 0.1254319838 0.4453462682 1.560673607 26 CGPOP 24 72 120 168 216 264 312 0 360 CAF-MPI (PUSH) 656.47 251.96 157.64 148.37 102.76 109.36 104.04 50.9816 64 256 CAF-MPI (PULL) 654.98 250.94 155.62 150.68 108.40 121.16 110.47 50.94 Number of processes CAF-GASNet (PUSH) 657.82 236.48 155.87 166.66 105.83 104.97 103.08 51.35 CAF-GASNet (PULL) 731.35FFT Analysis 266.96 155.32 174.68 Percentage117.35 Time137.99 110.58 55.20 Percentage 29.4 100.00% 14363.8 11.4 100.00% 25865.4 Before FFT 0.5 1.70% 244.28RandomAccess Before 0.1 0.88% 22,688.95% 1000 Permute comp 2.9 100 9.86% 1,416.84 Permute comp 0.4 3.51% 90,755.79% tion occurs when a sum reductionCAF-MPI yields zero for the differ- CAF-MPI CAF-GASNet alltoall 3.3 11.22%CAF-GASNet1,612.26 alltoall 2.5 21.93% 567,223.68% ence. This algorithm uses n roundsIDEAL-SCALE ofPhaseI reductions in the worst 8.1 27.55%IDEAL-SCALE3,957.37 PhaseI 1.8 15.79% 408,401.05% case, where n is the length ofCAF-GASNet-NOSRQ the longestTransposeI chain ofpackf function 0.5 1.70%CAF-GASNet-NOSRQ244.28 TransposeI packf 0.1 0.88% 22,688.95% 10 TI-alltoall 6.20 21.09% 3,029.10 TI-alltoall 2.8 24.56% 635,290.53% shipping calls in the100finish block. We also implement a fast version of finish that can be usedPhaseII when function ship- 4.4 14.97% 2,149.68 PhaseII 1 8.77% 226,889.47% TransposeII packf 0.6 2.04% 293.14 TransposeII packf 0.1 0.88% 22,688.95% ping is not used in an application. This version ofTI-alltoallfinish 2.9 1 9.86% 1,416.84 TI-alltoall 2.6 22.81% 589,912.63% GUPS involves calling MPIGFlops WIN FLUSH ALLALLTOALLon every window 6,058.20 1,792,426.84% that the local process10 has touched withinCOMPUTATION the block followed 8,305.60 794,113.16% by an MPI BARRIER across the team associated with the 0.1 finish block. RA MPI Percentage Time RA GASNet Percentage Time Computation 1.316 11.43% 8,197.47% Computation 12.44331 9.11% 4,636.06% 1 event_notify AMRequestShort 3.517 0.01 30.56% 21,907.68% event_notify AMRequestShort 0.9671 0.71% 360.32% 4. Evaluation 8 16 32 event_wait64 128 256 512 1024 20484.1056 835.67%16 25,574.11%32 64 128 256 event_wait512 1024 2048 108.904 79.72% 40,574.82% calc_putNumber of processes 2.57 22.33% 16,008.74%Number of processes calc_put 14.3 10.47% 5,327.81% We have evaluated CAF-MPI on twoTotal platforms:HPL Fusion, an 11.5086 100.00% 716.88CGPOP Total 136.61441 100.00% 508.99 10 Figure800 3. Performance of RandomAccess benchmark on InfiniBand cluster at ArgonneCAF-MPI National Laboratory, and Edi- CAF-MPI (PUSH) CAF-GASNet Fusion. CAF-MPI (PULL) son, a Cray XC30 system atIDEAL-SCALE Lawrence Berkeley National CAF-GASNet (PUSH) Laboratory. The hardware characteristicsCAF-GASNet-NOSRQ of these systems RandomAccessCAF-GASNet (PULL) 600 800 are given in Table 1. For each platform, we compareCAF-GASNet the per- CAF-MPI event_notify 1 Computation 46.36 81.97 event_wait Analysis of FFT formance of CAF-MPI with CAF-GASNet.coarray_write We evaluate53.28 the 160.09 coarray_write Computation 219.08 30 performance of our implementationevent_wait with three HPC405.75 Chal- 255.74 400 600 Computation 3.60

TFlops All-to-all lenge Benchmarks: HPL, FFT, andevent_notify RandomAccess [11].3.60 219.08 24 7.94 0.1 These benchmarks exhibit different communication vs. com- 400 255.74 18 200 17.92 putation ratios and data access patterns and thus can serve Executime time (in seconds) 405.75 12 8.31 as representative examples for a wide range of applications. Time (in seconds) 200 We also demonstrate0.01 the performance of CAF-MPI with the 0 Time (seconds) 6 CGPOP miniapp, which8 uses both CAF16 and MPI32 simulta- 64 24 72 120 168 216 264 312 360 6.06 Number of processes 0 Number of processes 0 neously. All benchmarks are compiled with Intel compilers CAF-GASNet CAF-MPI CAF-GASNet CAF-MPI and optimization level “-O3” on16 both platforms.64 256 MPI+GASNet 133 143 154 Figure 4. Time decomposition of RandomAccess on 2048 MPI 107 Mira109 115 16cores on Fusion.32 64 128 256 512 1024 2048 4096 4.1 RandomAccessGASNet Benchmark26 CAF-GASNet34 READ 39 272479.56 266666.66 263852.25 256410.27 266666.66 256410.27 265957.47 247524.75 266666.66 The HPC Challenge RandomAccessCAF-GASNet benchmarkWRITE evaluates 221729.48 217864.92 216919.73 203665.98 213675.22 209205.03 211864.41 207039.33 206611.58 CAF-GASNet EVENT_NOTIFY 99304.867 97560.977 96993.211 95969.281 96432.023 96899.227 97465.883 96711.797 96899.227 the rate at which a parallel system canCAF-MPI apply read-modify-READ 76745.969CAF-GASNet’s61614.293 performance61614.293 drops61614.293 at 128 cores.61274.512 Further61274.512 in- 60642.813 60569.352 60716.457 write updates to randomly indexed entriesCAF-MPI in a distributedWRITE ta- 61087.355vestigation51177.074 shows that52273.914 this is caused50864.699 by the use51229.508 of the Shared50226.016 51733.059 51334.703 49358.34 ble. Performance of the RandomAccessCAF-MPI benchmarkEVENT_NOTIFY is mea- 100704.94Receive Queue89847.258 (SRQ)89605.727 in GASNet.88967.977 The default88888.891 configuration87489.063 89525.516 88809.945 89766.609 CAF-MPI AlltoAll 24096.387 21186.441 16778.523 11494.253 7087.1724 4071.6611 2230.1516 1166.3168 602.73645 sured in Giga Updates Per second (GUP/s).CAF-GASNet GUP/sAlltoAll is cal- 3716.0906of GASNet1979.4141 library automatically984.83356 475.48856 enables SRQ221.75407 as soon as102.36043 do- 45.53651 20.609421 9.9222002 culated by identifying the number of table entries that wereMira Microbenchmarksing so reduces the memory usage of GASNet. The effect of randomly updated in one second, divided3E+05 by 1 billion (109). using SRQ in MVAPICH21E+05 is not noticed in our experiment The CAF 2.0 implementation of RandomAccess uses a of up to 2048 cores. By disabling the use of SRQ in GASNet, CAF-GASNet WRITE software routing algorithm that uses a hypercube-based pat- CAF-GASNet performs roughly the same as CAF-MPI. The 2.4E+05 1E+04 CAF-GASNet READ tern of bulk communication to route updates to the process CAF-GASNet version shows slightly betterCAF-GASNet NOTIFY than CAF-GASNet AlltoAll image co-located with the table index being updated. The the CAF-MPI version on 1024 and largerCAF-MPI cores. WRITE CAF-MPI READ 1.8E+05 1E+03 CAF 2.0 primitives most heavily used in the RandomAccess We profiled the 2048-core run of RandomAccessCAF-MPI NOTIFY with benchmark are coarray write and event notify. HPCToolkit [2] to analyze CAF-MPI;CAF-MPI the analysis AlltoAll results Because the performance of RandomAccess benchmark are shown in Figure 4. The CAF-MPI version of RandomAc- 1.2E+05 1E+02 largely depends on the performance of one-sided communi- cess spends around 200 seconds in event notify while the cation of the communication library used. The result of Ran- CAF-GASNet version spends almost none. This is because # of all-to-all ops / second 6E+04 1E+01 domAccess benchmark is a good indication# of point-to-point ops / second of the overhead of how event notify is implemented in these two libraries. of the communication library on top of the underlying net- In CAF-MPI, event notify invokes MPI WIN FLUSH ALL work hardware. Figure 3 shows the performance difference to ensure that all previously issued operations have been 0E+00 1E+00 of RandomAccess between CAF-MPI and CAF-GASNet16 32 64 on 128 completed256 512 before1024 2048 performing4096 the notification. The current Fusion. The CAF-GASNet version of RandomAccess out- #implementation of cores of MPI WIN FLUSH ALL in all MPICH performs the CAF-MPI version by a small constant factor derivatives (including MVAPICH and Cray MPI) performs up to 64 cores; this indicates that the overhead of the MPI-3 a flush operation on each process within the communicator; Edison 32 64 128 256 512 1024 2048 4096 RMA implementation has a constantCAF-GASNet overhead forREAD each op- 445434.3hence the385951.4 execution time324570 of MPI390930.4WIN FLUSH293083.2ALL grows232342 264550.3 252079.7 eration which is higher than the overheadCAF-GASNet of GASNetWRITE RMA. 579038.8linearly with500250.1 the number490436.5 of processes.500000 This256607.7 is, of course,274499 a 364564.3 308261.4 CAF-GASNet EVENT_NOTIFY 674763.8 665779 655308 655308 655308 582411.2 654878.8 521920.7 CAF-MPI READ 207555 209205 205465.4 206996.5 176398 201612.9 201369.3 143082 CAF-MPI WRITE 210172.3 210305 206313.2 208159.9 177273.5 202880.9 200964.6 142227.3 CAF-MPI EVENT_NOTIFY 700770.8 700770.8 700770.8 696864.1 696864.1 693962.6 686341.8 619962.8 CAF-MPI AlltoAll 12396.18 5767.345 2727.917 1272.507 514.6469 268.2957 112.9217 29.4079 CAF-GASNet AlltoAll 24177.95 7081.15 2399.923 911.6103 258.6646 87.81258 44.26492 19.71037 Edison Microbenchmarks 8E+05 1E+05

CAF-GASNet WRITE 6.4E+05 1E+04 CAF-GASNet READ CAF-GASNet NOTIFY CAF-GASNet AlltoAll CAF-MPI WRITE CAF-MPI READ 4.8E+05 1E+03 CAF-MPI NOTIFY CAF-MPI AlltoAll

3.2E+05 1E+02 # of all-to-all ops / second 1.6E+05 1E+01 # of point-to-point ops / second

0E+00 1E+00 32 64 128 256 512 1024 2048 4096 # of cores

Edison 16 32 64 128 256 512 1024 2048 CAF-GASNet READ 2.81E+05 2.75E+05 2.48E+05 CAF-GASNet WRITE 3.73E+05 3.70E+05 3.14E+05 CAF-GASNet EVENT_NOTIFY 1.07E+06 8.02E+05 7.08E+05 CAF-MPI READ 1.75E+05 1.77E+05 1.69E+05 CAF-MPI WRITE 1.83E+05 1.74E+05 1.74E+05 CAF-MPI EVENT_NOTIFY 7.99E+04 4.73E+04 9.77E+04 CAF-GASNet AlltoAll 2.82E+04 7.96E+03 2.76E+03 CAF-MPI AlltoAll 5.41E+04 3.10E+04 1.97E+04 FFT 16 32 64 128 256 512 1024 2048 4096 CAF-MPI 6.2971 9.9241 17.9998 32.8323 74.2554 152.9704 305.3309 585.6462 945.5121 CAF-GASNet 3.9050 7.2703 11.7259 20.4787 37.9913 66.6050 121.6078 233.8628 419.6483 IDEAL-SCALE 6.2971 12.5942 25.1884 50.3768 100.7536 201.5072 403.0144 806.0288 1612.0576 RA 16 32 64 128 256 512 1024 2048 4096 CAF-MPI 0.1 0.2 0.2 0.5 0.6 1.1 1.4 2.0 2.7 CAF-GASNet 0.2 0.3 0.4 0.6 1.1 1.1 1.9 3.8 8.0 IDEAL-SCALE 0.1231 0.2462 0.4924 0.9848 1.9696 3.9392 7.8784 15.7568 31.5136 HPL 16 64 256 1024 4096 CAF-MPI 0.113494752 0.4315327371 1.5640185942 5.4019310091 17.931944405 CAF-GASNet 0.1153884087 0.4306770224 1.6010092905 IDEAL-SCALE 0.113494752 0.4539790081 1.8159160323 7.2636641294 29.054656517 CGPOP 24 72 120 168 216 264 312 360 CAF-MPI (PUSH) 2373.33 800.57 483.73 481.15 325.18 323.59 324.06 166.37 CAF-MPI (PULL) 2369.46 799.63 482.89 480.68 325.57 323.66 323.87 167.70 CAF-GASNet (PUSH) 2367.96 794.29 482.83 477.60 322.41 321.47 320.01 162.31 CAF-GASNet (PULL) 2362.99 793.70 483.45 478.40 322.98 321.74 320.30 162.44

FFT RandomAccess 10000 100 CAF-MPI CAF-MPI CAF-GASNet CAF-GASNet IDEAL-SCALE IDEAL-SCALE

1000 10

100 GUPS GFlops

1 10

1 0.1 16 32 64 128 256 512 1024 2048 4096 16 32 64 128 256 512 1024 2048 4096 Number of processes Number of processes HPL CGPOP 100 3000 CAF-MPI CAF-MPI (PUSH) CAF-GASNet CAF-MPI (PULL) IDEAL-SCALE CAF-GASNet (PUSH) CAF-GASNet (PULL) 2250 10

1500 TFlops

1 FFT 16 32 64 128 256 512 1024 2048 4096 750 CAF-MPI 6.2971 9.9241 17.9998 32.8323 74.2554 152.9704 305.3309 585.6462 945.5121

Execution time (in seconds) CAF-GASNet 3.9050 7.2703 11.7259 20.4787 37.9913 66.6050 121.6078 233.8628 419.6483 IDEAL-SCALE 6.2971 12.5942 25.1884 50.3768 100.7536 201.5072 403.0144 806.0288 1612.0576 RA 16 32 64 128 256 512 1024 2048 4096 CAF-MPI 0.1 0.2 0.2 0.5 0.6 1.1 1.4 2.0 2.7 0 0.1 CAF-GASNet 0.2 0.3 0.4 0.6 1.1 1.1 1.9 3.8 8.0 16 64 256 1024 4096 24 72 120 168 216 264 312 360 IDEAL-SCALE 0.1231 0.2462 0.4924 0.9848 1.9696 3.9392 7.8784 15.7568 31.5136 Number of processes NumberHPL of processes16 64 256 1024 4096 CAF-MPI 0.113494752 0.4315327371 1.5640185942 5.4019310091 17.931944405 CAF-GASNet 0.1153884087 0.4306770224 1.6010092905 IDEAL-SCALE 0.113494752 0.4539790081 1.8159160323 7.2636641294 29.054656517 FFT 8 16 32 64 128 CGPOP256 512 24 1024 72 2048120 168 216 264 312 360 CAF-MPI 2.5360 3.5693 7.0194 13.9231 23.0590 CAF-MPI50.3071 (PUSH) 96.19042373.33 152.0733800.57 263.9797483.73 481.15 325.18 323.59 324.06 166.37 CAF-GASNet 2.3927 3.3042 4.9530 8.6560 15.3140 CAF-MPI27.2440 (PULL) 43.87792369.46 79.2683799.63 118.1791482.89 480.68 325.57 323.66 323.87 167.70 IDEAL-SCALE 2.536 5.072 10.144 20.288 40.576 CAF-GASNet81.152 (PUSH) 162.3042367.96 324.608794.29 649.216482.83 477.60 322.41 321.47 320.01 162.31 CAF-GASNet-NOSRQ 2.4315 3.5079 4.9294 8.4172 15.2665 CAF-GASNet26.5122 (PULL) 43.41912362.99 77.4317793.70 117.2695483.45 478.40 322.98 321.74 320.30 162.44 RA 8 16 32 64 128 256 512 1024 2048 CAF-MPI 0.1 0.1 0.1 0.3 0.4 0.6 0.9 1.2 1.5 FFT RandomAccess CAF-GASNet 0.1 0.1 0.2 0.4 0.2 0.3 0.4 0.7 1.0 10000 100 IDEAL-SCALE 0.06092 0.12184 0.24368 0.48736 0.97472 1.94944 CAF-MPI3.89888 7.79776 15.59552 CAF-MPI CAF-GASNet-NOSRQ 0.1 0.1 0.2 0.3 0.5 0.7 CAF-GASNet0.9 1.4 2.2 CAF-GASNet HPL 16 64 256 1024 IDEAL-SCALE 0.4439374185 IDEAL-SCALE CAF-MPI 0.0350152743 0.1311492785 0.4805325189 1.7443695111 1000 CAF-GASNet 0.0330905247 0.122221024 0.4467551121 1.5327417036 10 IDEAL-SCALE 0.0350152743 0.1400610971 0.5602443884 2.2409775535 CAF-GASNet-NOSRQ 0.0330424331 0.1254319838 0.4453462682 1.560673607 CGPOP 24 72 120 168 216 264100 312 360 GUPS CAF-MPI (PUSH) 656.47 251.96 157.64 148.37 102.76 GFlops 109.36 104.04 50.98 CAF-MPI (PULL) 654.98 250.94 155.62 150.68 108.40 121.16 110.47 50.94 1 CAF-GASNet (PUSH) 657.82 236.48 155.87 166.66 105.83 104.9710 103.08 51.35 CAF-GASNet (PULL) 731.35 266.96 155.32 174.68 117.35 137.99 110.58 55.20

FFT 1 RandomAccess 0.1 16 32 64 128 256 512 1024 2048 4096 16 32 64 128 256 512 1024 2048 4096 1000 100 CAF-MPI CAF-MPI Number of processes Number of processes CAF-GASNet CAF-GASNet HPL CGPOP IDEAL-SCALE IDEAL-SCALE 100 3000 CAF-GASNet-NOSRQ CAF-GASNet-NOSRQCAF-MPI CAF-MPI (PUSH) 10 CAF-GASNet CAF-MPI (PULL) IDEAL-SCALE CAF-GASNet (PUSH) 100 CAF-GASNet (PULL) 2250

1 10 GUPS GFlops 1500

10 TFlops

0.1 1 750 Executime time (in seconds)

1 0.01 8 16 32 64 128 256 512 1024 2048 8 16 320.1 64 128 256 512 1024 2048 0 Number of processes 16Number of processes64 256 1024 4096 24 72 120 168 216 264 312 360 HPL CGPOP Number of processes Number of processes 10 800 CAF-MPI CAF-MPI (PUSH) CAF-GASNet FFT CAF-MPI8 (PULL)16 32 64 128 256 512 1024 2048 IDEAL-SCALE CAF-MPI CAF-GASNet2.5360 (PUSH)3.5693 7.0194 13.9231 23.0590 50.3071 96.1904 152.0733 263.9797 FFT 16 32 64 128 CAF-GASNet-NOSRQ256 512 1024 2048 4096 CAF-GASNet CAF-GASNet2.3927 (PULL)3.3042 4.9530 8.6560 15.3140 27.2440 43.8779 79.2683 118.1791 CAF-MPI 6.2971 9.9241 17.9998 32.8323 74.2554 152.9704 305.3309 585.6462 945.5121600 IDEAL-SCALE 2.536 5.072 10.144 20.288 40.576 81.152 162.304 324.608 649.216 CAF-GASNet 3.9050 7.2703 11.7259 1 20.4787 37.9913 66.6050 121.6078 233.8628 419.6483 CAF-GASNet-NOSRQ 2.4315 3.5079 4.9294 8.4172 15.2665 26.5122 43.4191 77.4317 117.2695 IDEAL-SCALE 6.2971 12.5942 25.1884 50.3768 100.7536 201.5072 403.0144 806.0288 1612.0576 RA 8 16 32 64 128 256 512 1024 2048 RA 16 32 64 128 256 512 1024 2048 4096 CAF-MPI 0.1 0.1 0.1 0.3 0.4 0.6 0.9 1.2 1.5 CAF-MPI 0.1 0.2 0.2 0.5 0.6 1.1 1.4 2.0 2.7400 CAF-GASNet 0.1 0.1 0.2 0.4 0.2 0.3 0.4 0.7 1.0 CAF-GASNet 0.2 0.3 0.4 0.6 1.1 1.1 1.9 3.8 8.0 IDEAL-SCALE 0.06092 0.12184 0.24368 0.48736 0.97472 1.94944 3.89888 7.79776 15.59552 TFlops IDEAL-SCALE 0.1231 0.2462 0.4924 0.9848 1.9696 3.9392 7.8784 15.7568 31.5136 CAF-GASNet-NOSRQ 0.1 0.1 0.2 0.3 0.5 0.7 0.9 1.4 2.2 HPL 16 64 256 0.1 1024 4096 HPL 16 64 256 1024 CAF-MPI 0.113494752 0.4315327371 1.5640185942 5.4019310091 17.931944405 CAF-MPI 0.0350152743 0.1311492785 0.4805325189 1.7443695111 200 CAF-GASNet 0.0330905247 0.122221024 0.4467551121 1.5327417036 CAF-GASNet 0.1153884087 0.4306770224 1.6010092905 Execution time (in seconds) IDEAL-SCALE 0.113494752 0.4539790081 1.8159160323 7.2636641294 29.054656517 IDEAL-SCALE 0.0350152743 0.1400610971 0.5602443884 2.2409775535 CGPOP 24 72 120 168 System216 264 Nodes312 Cores360 per Node Memory per NodeCAF-GASNet-NOSRQInterconnect0.0330424331 0.1254319838MPI Version0.4453462682 1.560673607 CAF-MPI (PUSH) 2373.33 800.57 483.73 481.15Cluster325.18 (Fusion)323.59 320324.06 2 x166.37 4 36GB CGPOPInfiniBand QDR24 MVAPICH2-1.972 120 168 216 264 312 360 CAF-MPI (PULL) 2369.46 799.63 482.890.01 480.68 325.57 323.66 323.87 167.70 0 CAF-MPI (PUSH) 656.47 251.96 157.64 148.37 102.76 109.36 104.04 50.98 CAF-GASNet (PUSH) 2367.96 794.29 482.83 8 477.60Cray XC30322.4116 (Edison)321.47 5,20032320.01 2 x162.31 12 64 64GB24 72 CAF-MPI120 (PULL)168Cray Aries216 264654.98 312CRAY-MPICH-6.0.2250.94360 155.62 150.68 108.40 121.16 110.47 50.94 CAF-GASNet (PULL) 2362.99 793.70 483.45 478.40 322.98Number321.74 of processes320.30 162.44 CAF-GASNetNumber (PUSH) of processes 657.82 236.48 155.87 166.66 105.83 104.97 103.08 51.35 Table 1. Experimental platforms and systemCAF-GASNet characteristics.Chart (PULL) 5 731.35 266.96 155.32 174.68 117.35 137.99 110.58 55.20 16 64 256 200 FFT MPI+GASNet 133 143RandomAccess154 GASNet FFT RandomAccess 10000 100 MVAPICH1000 100 CAF-MPI MVAPICH 107 CAF-MPI 109 115 CAF-MPI CAF-MPI CAF-GASNet GASNet 26 CAF-GASNet 34 39 150 CAF-GASNet CAF-GASNet IDEAL-SCALE IDEAL-SCALE IDEAL-SCALE 39 IDEAL-SCALE FFT 34CAF-GASNet-NOSRQ16 32 64 128 256 CAF-GASNet-NOSRQ512 1024 2048 4096 26 1000 CAF-MPI 6.2971 9.9241 17.9998 32.8323 10 74.2554 152.9704 305.3309 585.6462 945.5121 CAF-GASNet 3.9050115 7.2703 11.7259 20.4787 37.9913 66.6050 121.6078 233.8628 419.6483 CAF-GASNet CAF-MVAPICH 100 107 100 109 10 IDEAL-SCALE 6.2971 12.5942 25.1884 50.3768 100.7536 201.5072 403.0144 806.0288 1612.0576 Communication (AllToAll) 17.92 6.1 RA 16 32 64 128 256 512 1024 2048 4096 Local computation 7.94 8.3 CAF-MPI 0.1 0.2 0.2 0.5 0.6 1.1 1.4 2.0 2.7 100 50 1 GUPS

GUPS CAF-GASNet 0.2 0.3 0.4 0.6 1.1 1.1 1.9 3.8 8.0 GFlops GFlops IDEAL-SCALE 0.1231 0.2462 0.4924 0.9848 1.9696 3.9392 7.8784 15.7568 31.5136 Mapped Memory Size (MB) 1 10 HPL 16 64 256 1024 4096 10 0 CAF-MPI 0.113494752 0.4315327371 1.5640185942 5.4019310091 0.117.931944405 16 CAF-GASNet 64 0.11538840872560.4306770224 1.6010092905 IDEAL-SCALENumber of processes0.113494752 0.4539790081 1.8159160323 7.2636641294 29.054656517 CGPOP 24 72 120 168 216 264 312 360 1 0.1 CAF-MPI1 (PUSH) 2373.33 800.57 483.73 481.150.01 325.18 323.59 324.06 166.37 16 32 64 128 256 512 1024 2048FFT Analysis4096 16 32 64 128 Percentage256 512 1024Time2048 4096 CAF-MPI (PULL)8 16 32 2369.4664 Percentage128 799.63256 512 482.891024 2048 480.68 8 325.5716 32323.6664 128323.87256 512167.701024 2048 Number of processes 29.4 Number 100.00%of processes 14363.8 CAF-GASNet (PUSH) 11.42367.96Number of100.00% processes794.29 25865.4482.83 477.60 322.41 321.47Number of processes320.01 162.31 HPL Before 0.5 CGPOP1.70% 244.28 Before CAF-GASNet (PULL) 2362.990.1 HPL 793.700.88% 226.88947368483.45 478.40 322.98 321.74 CGPOP320.30 162.44 Permute comp 2.9 9.86% 1,416.84 PermuteFigurecomp 6. Performance of0.4 FFT benchmark3.51% 907.55789474 on Fusion. 100 Figure3000 5. Performance of RandomAccess benchmark on 10 800 CAF-MPI alltoall 3.3 11.22%CAF-MPI1,612.26 (PUSH) alltoall CAF-MPI 2.5 21.93% 5,672.2368421 CAF-MPI (PUSH) Edison. CAF-MPI (PULL) CAF-GASNet CAF-MPI (PULL) CAF-GASNet PhaseI 8.1 27.55% 3,957.37 PhaseI 1.8 FFT 15.79% 4,084.0105263 RandomAccess IDEAL-SCALE CAF-GASNet (PUSH) IDEAL-SCALE CAF-GASNet (PUSH) TransposeI packf 0.5 1.70%CAF-GASNet244.28 (PULL) TransposeI 10000packf CAF-GASNet-NOSRQ0.1 0.88% 226.88947368 100 CAF-GASNet (PULL) CAF-MPI CAF-MPI TI-alltoall2250 6.20 21.09% 3,029.10 TI-alltoall 2.8 24.56% 6,352.9052632 600 CAF-GASNet CAF-GASNet 10 PhaseII simple performance scalability4.4 issue14.97% that can2,149.68 be addressed PhaseII 1 IDEAL-SCALE 1.0 8.77% 2,268.8947368 IDEAL-SCALE TransposeII packf 0.6 2.04% 293.14 TransposeII packf 0.1 0.88% 226.88947368 withinTI-alltoall the MPI implementation,2.9 but9.86% as of the1,416.84 writing of this TI-alltoall1000 2.6 22.81% 5,899.1263158 1500 400 10 ALLTOALL 6,058.2013605 TFlops 17,924.268421

TFlops paper, this issue exists. COMPUTATION 8,305.5986395 7,941.1315789 1 The performance of RandomAccess on Edison, shown in 0.1 100 200

Figure 5, tells roughly the same story as it does on Fusion. GUPS 750 GFlops Executime time (in seconds) RA MPI Executime time (in seconds) Percentage Time RA GASNet Percentage Time The MPI-3 RMA operations in Cray MPI are currently im- 1 Computation 1.316 11.43% 81.974704134 Computation 12.44331 9.11% 46.360558574 plemented internally with send and receive routines rather 10 event_notify AMRequestShort 3.517 30.56% 219.07677389 event_notify AMRequestShort0.01 0.9671 0.71% 3.6031647686 0 0.1 0 8 16 32 64 24 72 120 168 216 264 312 360 16 64 256 1024 event_wait4096than directly24 leveraging72 4.1056120 RDMA168 support35.67%216 255.74114384 in264 the312 network.360 This event_wait 108.904 79.72% 405.74817078 calc_put 2.57 22.33% 160.08737813 calc_put Number14.3 of processes10.47% 53.278105875 Number of processes Number of processes causes a more obvious performanceNumber of processes loss of CAF-MPI ver- Total 11.5086 100.00% 716.88 Total 1 136.61441 100.00% 508.99 0.1 sion of RandomAccess on Edison. Better implementations 16 3216 64 12864 256 512256 1024 2048 4096 16 32 64 128 256 512 1024 2048 4096 MPI+GASNet 133 143 154 of MPI RMA on Cray platforms, such as foMPI [10], al- Number of processes Number of processes FFT 8 16 32 64 128 256 512 1024 2048 MPI 107 109 115 HPL CGPOP CAF-MPI 2.5360 3.5693 7.0194 ready13.9231 exist and23.0590 deliver performance50.3071 96.1904 competitive152.0733RandomAccess with263.9797 CAF GASNet 26 34 39 Figure100 7. Performance of FFT benchmark on Edison. 3000 CAF-GASNet 2.3927 3.3042 4.9530 and UPC.8.6560 However,15.3140 the27.2440 foMPI implementation43.8779 79.2683 is not118.1791 inte- CAF-MPI CAF-MPI (PUSH) CAF-GASNet CAF-MPI 800 IDEAL-SCALE 2.536 5.072 10.144 20.288 40.576 81.152 162.304 event_notify324.608 649.216 CAF-GASNet CAF-MPI (PULL) CAF-GASNet-NOSRQ 2.4315 3.5079 Computation4.9294 grated8.4172 into46.36 Cray15.2665 MPI yet.81.9726.5122 43.4191 event_wait77.4317 117.2695 IDEAL-SCALEAnalysis of FFT CAF-GASNet (PUSH) RA 8 16 coarray_write32 64 53.28128 160.09256 512 coarray_write1024 2048 CAF-GASNet (PULL) 30 CAF-MPI 0.1 0.1 event_wait0.1 0.3405.75 0.4 255.74 0.6 600 0.9 Computation1.2 1.5219.08 Computation 2250 CAF-GASNet 0.1 0.1 0.2 0.4 0.2 0.3 0.4 0.7 1.0 All-to-all event_notify4.2 FFT Benchmark3.60 219.08 3.60 10 IDEAL-SCALE 0.06092 0.12184 0.24368 0.48736 0.97472 1.94944 3.89888 7.79776 15.59552 24 7.94 CAF-GASNet-NOSRQ 0.1 0.1 0.2 The HPC0.3 Challenge0.5 FFT benchmark0.7 400 0.9 measures1.4 the ability2.2 HPL 16 64 256 1024 255.74 18 1500

TFlops 17.92 CAF-MPI 0.0350152743 0.1311492785 0.4805325189 of1.7443695111 a system to overlap computation and405.75 communication CAF-GASNet 0.0330905247 0.122221024 0.4467551121 while1.5327417036 calculating a very large Discrete Fourier Transform 12 8.31 Time (in seconds) 1 IDEAL-SCALE 0.0350152743 0.1400610971 0.5602443884 2.2409775535 200 CAF-GASNet-NOSRQ 0.0330424331 0.1254319838 0.4453462682 of1.560673607 size m. Performance of the FFT benchmark is mea- Time (seconds) 6 750 Executime time (in seconds) CGPOP 24 72 120 sured168 in GFLOP/s,216 with264 calculated312 performance360 defined as 6.06 CAF-MPI (PUSH) 656.47 251.96 157.64 mlog148.372m −9 102.76 109.36 0104.04 50.98 0 CAF-MPI (PULL) 654.98 250.94 155.62 5 t150.6810 ,108.40 where m 121.16is the size110.47 of theCAF-GASNet DFT50.94 and t is theCAF-MPI CAF-GASNet CAF-MPI 0.1 0 CAF-GASNet (PUSH) 657.82 236.48 155.87 166.66 105.83 104.97 103.08 51.35 execution time in seconds. Parallel FFT algorithms have 16 64 256 1024 4096 24 72 120 168 216 264 312 360 CAF-GASNet (PULL) 731.35 266.96 155.32 174.68 117.35 137.99 110.58 55.20 Figure 8. Time decomposition of FFT on 256 cores on been well studied in the past [3, 13, 28]. Number of processes Number of processes Fusion. FFT The CAF 2.0 FFT implementationRandomAccess uses a radix-2 binary 1000 exchange100 formulation that consists of three parts: permuta- CAF-MPI CAF-MPI FFT 8 16 32 64 128 256 512 1024 2048 CAF-GASNet tion of data to moveCAF-GASNet each source element to the position that plementationCAF-MPI solely uses all-to-all2.5360 operation3.5693 for7.0194 data move-13.9231 23.0590 50.3071 96.1904 152.0733 263.9797 IDEAL-SCALE is its binary bit reversal;IDEAL-SCALE local FFT computation for as many ment. CAF-GASNet 2.3927 3.3042 4.9530 8.6560 15.3140 27.2440 43.8779 79.2683 118.1791 CAF-GASNet-NOSRQ CAF-GASNet-NOSRQ IDEAL-SCALE 2.536 5.072 10.144 20.288 40.576 81.152 162.304 324.608 649.216 10 layers of the DFT calculation as can fit in the memory of a FiguresCAF-GASNet-NOSRQ 6 and 7 show the2.4315 performance3.5079 of the FFT4.9294 bench-8.4172 15.2665 26.5122 43.4191 77.4317 117.2695 100 single processor; and remote DFT computation for the layers mark on FusionRA and Edison,8 respectively.16 The CAF-MPI32 ver-64 128 256 512 1024 2048 CAF-MPI 0.1 0.1 0.1 0.3 0.4 0.6 0.9 1.2 1.5 that span multiple processor images. The CAF 2.0 FFT im- sion consistentlyCAF-GASNet outperforms0.1 CAF-GASNet0.1 on both0.2 plat- 0.4 0.2 0.3 0.4 0.7 1.0 1 IDEAL-SCALE 0.06092 0.12184 0.24368 0.48736 0.97472 1.94944 3.89888 7.79776 15.59552 GUPS GFlops CAF-GASNet-NOSRQ 0.1 0.1 0.2 0.3 0.5 0.7 0.9 1.4 2.2 10 HPL 16 64 256 1024 CAF-MPI 0.0350152743 0.1311492785 0.4805325189 1.7443695111 0.1 CAF-GASNet 0.0330905247 0.122221024 0.4467551121 1.5327417036 IDEAL-SCALE 0.0350152743 0.1400610971 0.5602443884 2.2409775535 CAF-GASNet-NOSRQ 0.0330424331 0.1254319838 0.4453462682 1.560673607 1 0.01 CGPOP 24 72 120 168 216 264 312 360 8 16 32 64 128 256 512 1024 2048 8 16 32 64 128 256 512 1024 2048 CAF-MPI (PUSH) 656.47 251.96 157.64 148.37 102.76 109.36 104.04 50.98 CAF-MPI (PULL) 654.98 250.94 155.62 150.68 108.40 121.16 110.47 50.94 Number of processes Number of processes CAF-GASNet (PUSH) 657.82 236.48 155.87 166.66 105.83 104.97 103.08 51.35 HPL CGPOP CAF-GASNet (PULL) 731.35 266.96 155.32 174.68 117.35 137.99 110.58 55.20 10 800 CAF-MPI CAF-MPI (PUSH) CAF-GASNet CAF-MPI (PULL) FFT RandomAccess IDEAL-SCALE CAF-GASNet (PUSH) CAF-GASNet-NOSRQ CAF-GASNet (PULL) 1000 100 CAF-MPI CAF-MPI 600 CAF-GASNet CAF-GASNet 1 IDEAL-SCALE IDEAL-SCALE CAF-GASNet-NOSRQ CAF-GASNet-NOSRQ 10 400 100 TFlops

0.1 1 GUPS

200 GFlops Executime time (in seconds) 10 0.1 0.01 0 8 16 32 64 24 72 120 168 216 264 312 360 Number of processes Number of processes 1 0.01 16 64 256 8 16 32 64 128 256 512 1024 2048 8 16 32 64 128 256 512 1024 2048 MPI+GASNet 133 143 154 Number of processes Number of processes MPI 107 109 115 HPL CGPOP GASNet 26 34 39 10 800 CAF-MPI CAF-MPI (PUSH) CAF-GASNet CAF-MPI (PULL) IDEAL-SCALE CAF-GASNet (PUSH) CAF-GASNet-NOSRQ CAF-GASNet (PULL) 600 1

400 TFlops

0.1 200 Executime time (in seconds)

0.01 0 8 16 32 64 24 72 120 168 216 264 312 360 Number of processes Number of processes

16 64 256 MPI+GASNet 133 143 154 MPI 107 109 115 GASNet 26 34 39 FFT 16 32 64 128 256 512 1024 2048 4096 CAF-MPI 6.2971 9.9241 17.9998 32.8323 74.2554 152.9704 305.3309 585.6462 945.5121 CAF-GASNet 3.9050 7.2703 11.7259 20.4787 37.9913 66.6050 121.6078 233.8628 419.6483 IDEAL-SCALE 6.2971 12.5942 25.1884 50.3768 100.7536 201.5072 403.0144 806.0288 1612.0576 RA 16 32 64 128 256 512 1024 2048 4096 CAF-MPI 0.1 0.2 0.2 0.5 0.6 1.1 1.4 2.0 2.7 CAF-GASNet 0.2 0.3 0.4 0.6 1.1 1.1 1.9 3.8 8.0 IDEAL-SCALE 0.1231 0.2462 0.4924 0.9848 1.9696 3.9392 7.8784 15.7568 31.5136 HPL 16 64 256 1024 4096 CAF-MPI 0.113494752 0.4315327371 1.5640185942 5.4019310091 17.931944405 CAF-GASNet 0.1153884087 0.4306770224 1.6010092905 IDEAL-SCALE 0.113494752 0.4539790081 1.8159160323 7.2636641294 29.054656517 CGPOP 24 72 120 168 216 264 312 360 CAF-MPI (PUSH) 2373.33 800.57 483.73 481.15 325.18 323.59 324.06 166.37 CAF-MPI (PULL) 2369.46 799.63 482.89 480.68 325.57 323.66 323.87 167.70 CAF-GASNet (PUSH) 2367.96 794.29 482.83 477.60 322.41 321.47 320.01 162.31 CAF-GASNet (PULL) 2362.99 793.70 483.45 478.40 322.98 321.74 320.30 162.44

FFT RandomAccess 10000 100 CAF-MPI CAF-MPI CAF-GASNet CAF-GASNet IDEAL-SCALE IDEAL-SCALE FFT 16 32 64 128 256 512 1024 2048 4096 1000 CAF-MPI 6.2971 9.9241 17.9998 32.8323 74.2554 152.9704 305.3309 585.6462 945.5121 CAF-GASNet 3.9050 7.2703 1011.7259 20.4787 37.9913 66.6050 121.6078 233.8628 419.6483 IDEAL-SCALE 6.2971 12.5942 25.1884 50.3768 100.7536 201.5072 403.0144 806.0288 1612.0576 RA 16 32 64 128 256 512 1024 2048 4096 100 CAF-MPI 0.1 0.2 0.2 0.5 0.6 1.1 1.4 2.0 2.7 GUPS GFlops CAF-GASNet 0.2 0.3 0.4 0.6 1.1 1.1 1.9 3.8 8.0 IDEAL-SCALE 0.1231 0.2462 10.4924 0.9848 1.9696 3.9392 7.8784 15.7568 31.5136 HPL 16 64 256 1024 4096 10 CAF-MPI 0.113494752 0.4315327371 1.5640185942 5.4019310091 17.931944405 CAF-GASNet 0.1153884087 0.4306770224 1.6010092905 IDEAL-SCALE 0.113494752 0.4539790081 1.8159160323 7.2636641294 29.054656517 CGPOP 24 72 120 168 216 264 312 360 1 0.1 16 32 CAF-MPI64 128 (PUSH)256 512 2373.331024 2048 4096800.57 483.7316 32 481.1564 128 325.18256 512 323.591024 2048324.064096 166.37 CAF-MPI (PULL) 2369.46 799.63 482.89 480.68 325.57 323.66 323.87 167.70 Number of processes Number of processes CAF-GASNet (PUSH) 2367.96 794.29 482.83 477.60 322.41 321.47 320.01 162.31 CGPOP CAF-GASNet (PULL)HPL 2362.99 793.70 483.45 478.40 322.98 321.74 320.30 162.44 100 3000 CAF-MPI CAF-MPI (PUSH) CAF-GASNet CAF-MPI (PULL) IDEAL-SCALE FFT CAF-GASNetRandomAccess (PUSH) CAF-GASNet (PULL) 10000 100 CAF-MPI 2250 CAF-MPI 10 CAF-GASNet CAF-GASNet IDEAL-SCALE IDEAL-SCALE

1000 1500

TFlops 10 1 100 750 Execution time (in seconds) GUPS GFlops

1 0.1 10 0 16 64 256 1024 4096 24 72 120 168 216 264 312 360 Number of processes Number of processes

1 0.1 16 32 64 128 256 512 1024 2048 4096 16 32 64 128 256 512 1024 2048 4096 FFT 8 16 Number32 of processes64 128 256 512 Number1024 of processes2048 CAF-MPI 2.5360 3.5693 HPL7.0194 13.9231 23.0590 50.3071 96.1904 152.0733CGPOP 263.9797 CAF-GASNet 2.3927 3.3042 4.9530 8.6560 15.3140 27.2440FFT 43.8779 16 79.2683 32118.1791 64 128 256 512 1024 2048 4096 100 3000 IDEAL-SCALE 2.536CAF-MPI5.072 10.144 20.288 40.576 CAF-MPI81.152 162.304 6.2971324.608 9.9241649.216CAF-MPI 17.9998(PUSH) 32.8323 74.2554 152.9704 305.3309 585.6462 945.5121 CAF-GASNet-NOSRQ 2.4315CAF-GASNet3.5079 4.9294 8.4172 15.2665 CAF-GASNet26.5122 43.4191 3.905077.4317 7.2703117.2695CAF-MPI 11.7259(PULL) 20.4787 37.9913 66.6050 121.6078 233.8628 419.6483 RA 8 IDEAL-SCALE16 32 64 128 IDEAL-SCALE256 512 6.29711024 12.59422048CAF-GASNet25.1884 (PUSH) 50.3768 100.7536 201.5072 403.0144 806.0288 1612.0576 CAF-MPI 0.1 0.1 0.1 0.3 0.4 0.6RA 0.9 16 1.3 32 CAF-GASNet1.5 64 (PULL) 128 256 512 1024 2048 4096 CAF-GASNet 0.1 0.1 0.2 0.4 0.2 CAF-MPI22500.3 0.4 0.10.7 0.21.0 0.2 0.5 0.6 1.1 1.4 2.0 2.7 IDEAL-SCALE 10 0.06092 0.12184 0.24368 0.48736 0.97472 CAF-GASNet1.94944 3.89888 7.797760.2 15.595520.3 0.4 0.6 1.1 1.1 1.9 3.8 8.0 CAF-GASNet-NOSRQ 0.1 0.1 0.2 0.3 0.5 IDEAL-SCALE0.7 0.9 0.12311.4 0.24622.2 0.4924 0.9848 1.9696 3.9392 7.8784 15.7568 31.5136 HPL 16 64 256 1024 HPL 16 0.443937418564 256 1024 4096 CAF-MPI 0.0350152743 0.1311492785 0.4805325189 1.7443695111 CAF-MPI1500 0.113494752 0.4315327371 1.5640185942 5.4019310091 17.931944405 CAF-GASNet TFlops 0.0330905247 0.122221024 0.4467551121 1.5327417036 CAF-GASNet 0.1153884087 0.4306770224 1.6010092905 IDEAL-SCALE 0.03501527431 0.1400610971 0.5602443884 2.2409775535 IDEAL-SCALE 0.113494752 0.4539790081 1.8159160323 7.2636641294 29.054656517 CAF-GASNet-NOSRQ 0.0330424331 0.1254319838 0.4453462682 1.560673607 CGPOP 24 72 120 168 216 264 312 360 750 CGPOP 24 72 120 168 216 CAF-MPI264 (PUSH)312 2373.33360 800.57 483.73 481.15 325.18 323.59 324.06 166.37 Executime time (in seconds) CAF-MPI (PUSH) 656.47 251.96 157.64 148.37 102.76 CAF-MPI109.36 (PULL) 104.04 2369.4650.98 799.63 482.89 480.68 325.57 323.66 323.87 167.70 CAF-MPI (PULL) 654.98 250.94 155.62 150.68 108.40 CAF-GASNet121.16 (PUSH)110.47 2367.9650.94 794.29 482.83 477.60 322.41 321.47 320.01 162.31 CAF-GASNet (PUSH) 657.82 236.48 155.87 166.66 105.83 CAF-GASNet104.97 (PULL)103.08 2362.9951.35 793.70 483.45 478.40 322.98 321.74 320.30 162.44 0.1 0 CAF-GASNet (PULL) 731.35 266.96 155.32 174.68 117.35 137.99 110.58 55.20 16 64 256 1024 4096 24 72 120 168 216 264 312 360 Number of processes Number of processes FFT RandomAccess FFT RandomAccess 1000 100 10000 100 CAF-MPI CAF-MPI CAF-MPI CAF-MPI CAF-GASNetFFT 8 16 32 64CAF-GASNet128 CAF-GASNet256 512 1024 2048 CAF-GASNet IDEAL-SCALECAF-MPI 2.5360 3.5693 7.0194 13.9231IDEAL-SCALE23.0590IDEAL-SCALE50.3071 96.1904 152.0733 263.9797 IDEAL-SCALE CAF-GASNet-NOSRQ CAF-GASNet-NOSRQ CAF-GASNet 2.3927 3.3042 4.9530 8.6560 15.3140 27.2440 43.8779 79.2683 118.1791 10 1000 IDEAL-SCALE 2.536 5.072 10.144 20.288 40.576 81.152 162.304 324.608 649.216 100 CAF-GASNet-NOSRQ 2.4315 3.5079 4.9294 8.4172 15.2665 26.5122 43.4191 77.4317 117.2695 10 RA 8 16 32 64 128 256 512 1024 2048 CAF-MPI 0.1 0.1 0.1 0.3 0.4 0.6 0.9 1.2 1.5 1 100 CAF-GASNet 0.1 0.1 0.2 0.4 0.2 0.3 0.4 0.7 1.0 GUPS GUPS GFlops GFlops IDEAL-SCALE 0.06092 0.12184 0.24368 0.48736 0.97472 1.94944 3.89888 7.79776 15.59552 10 CAF-GASNet-NOSRQ 0.1 0.1 0.2 0.3 0.5 0.7 0.9 1.4 2.2 1 HPL 16 64 256 0.1 1024 10 CAF-MPI 0.0350152743 0.1311492785 0.4805325189 1.7443695111 CAF-GASNet 0.0330905247 0.122221024 0.4467551121 1.5327417036 IDEAL-SCALE 0.0350152743 0.1400610971 0.5602443884 2.2409775535 1 CAF-GASNet-NOSRQ 0.0330424331 0.1254319838 0.44534626820.01 1.560673607 1 0.1 8 16 32CGPOP64 128 25624 512 102472 2048 120 8 16816 32 1621664 32 128 64264256128512256312102451220481024360 2048 4096 16 32 64 128 256 512 1024 2048 4096 CAF-MPI (PUSH)Number of processes656.47 251.96 157.64 148.37 Number102.76 of processes109.36Number of processes104.04 50.98 Number of processes CAF-MPI (PULL) HPL 654.98 250.94 155.62 150.68 108.40CGPOP 121.16 HPL110.47 50.94 CGPOP CAF-GASNet (PUSH) 657.82 236.48 155.87 166.66 105.83 104.97 103.08 51.35 10 800 100 3000 CAF-MPICAF-GASNet (PULL) 731.35 266.96 155.32 174.68 117.35CAF-MPI 137.99CAF-MPI (PUSH)110.58 55.20 CAF-MPI (PUSH) CAF-GASNet CAF-GASNet CAF-MPI (PULL) CAF-MPI (PULL) IDEAL-SCALE IDEAL-SCALE CAF-GASNet (PUSH) CAF-GASNet (PUSH) CAF-GASNet-NOSRQ FFT RandomAccessCAF-GASNet (PULL) CAF-GASNet (PULL) 1000 600 100 2250 CAF-MPI CAF-MPI 1 CAF-GASNet 10 CAF-GASNet IDEAL-SCALE IDEAL-SCALE CAF-GASNet-NOSRQ CAF-GASNet-NOSRQ 400 10 1500 TFlops 100 TFlops 0.1 1 200 1 750 Execution time (in seconds) Executime time (in seconds) GUPS GFlops

10 0 0.01 0.10.1 0 16 64 256 1024 24 72 12016 168 64216 264 256312 3601024 4096 24 72 120 168 216 264 312 360 Number of processes Number of processesNumber of processes Number of processes Chart 5 16 64 256 Figure 9. Performance1 of HPL benchmark on Fusion.200 Figure0.01 10. Performance of HPL benchmark on Edison. GASNet-only 26 8 16 3432 64 39128 256 512 1024 GASNet-only2048 8 16 32 64 128 256 512 1024 2048 MPI-only 107 109 115 MPI-only FFT 8 16 32 64 128 256 512 1024 2048 Number of processes Number of processes Duplicate Runtimes 133 143 154 Duplicate RuntimesCAF-MPI 2.5360 3.5693 7.0194 13.9231 23.0590 50.3071 96.1904 152.0733 263.9797 HPL 150 CGPOP forms at different scales. To analyze the performance dif- CAF-GASNet 2.3927154 3.3042 4.9530 8.6560 15.3140 27.2440 43.8779 79.2683 118.1791 10 IDEAL-SCALE800 143 2.536 5.072 10.144 20.288 40.576 81.152 162.304 324.608 649.216 ference of the two versionsCAF-MPI of FFT, we use HPCToolkit to 133 CAF-MPI (PUSH) CAF-GASNet CAF-GASNet-NOSRQ 2.4315 3.5079CAF-MPI (PULL)4.9294 8.4172 15.2665 26.5122 43.4191 77.4317 117.2695 CAF-GASNet CAF-MVAPICH RA 8 115 16 32 64 128 256 512 1024 2048 profile the benchmark runIDEAL-SCALE on 256 cores on Fusion.100 Fig- 107 109 CAF-GASNet (PUSH) Communication (AllToAll) 17.92CAF-GASNet-NOSRQ6.1 CAF-MPI 0.1 CAF-GASNet0.1 (PULL)0.1 0.3 0.4 0.6 0.9 1.2 1.5 ureLocal 8 computation illustrates how much7.94 of the execution8.3 time of FFT CAF-GASNet600 0.1 0.1 0.2 0.4 0.2 0.3 0.4 0.7 1.0 IDEAL-SCALE 0.06092 0.12184 0.24368 0.48736 0.97472 1.94944 3.89888 7.79776 15.59552 is spent in local computation1 and communication (all-to-50 CAF-GASNet-NOSRQ 0.1 0.1 0.2 0.3 0.5 0.7 0.9 1.4 2.2

all to be specific). We can see that the performance differ-Mapped Memory Size (MB) 34 HPL 3916 64 256 1024 ence in two versions of FFT is caused largely by the per- 26 CAF-MPI400 0.0350152743 0.1311492785 0.4805325189 1.7443695111 TFlops 0 CAF-GASNet 0.0330905247 0.122221024 0.4467551121 1.5327417036 16 64 256 formance of all-to-all operations used in CAF’s runtime sys- IDEAL-SCALE 0.0350152743 0.1400610971 0.5602443884 2.2409775535 0.1 tem. Because GASNet currently does not have collectives, CAF-GASNet-NOSRQNumber of processes 0.0330424331 0.1254319838 0.4453462682 1.560673607 FFT Analysis Percentage Time 200 CGPOP 24Percentage 72 120 168 216 264 312 360 Executime time (in seconds) CAF-GASNet implements29.4 alltoall100.00% operation with14363.8 GASNet’s CAF-MPI (PUSH) 11.4 656.47100.00% 251.9625865.4 157.64 148.37 102.76 109.36 104.04 50.98 BeforePUT, GET, and Active Messages.0.5 This1.70% is not as well244.28 tuned as Before CAF-MPI (PULL) 0.1 654.98 0.88% 226.88947368250.94 155.62 150.68 108.40 121.16 110.47 50.94 Permute comp 2.9 9.86% 1,416.84 Permute CAF-GASNetcomp (PUSH) 0.4 657.82 3.51% 907.55789474236.48 155.87 166.66 105.83 104.97 103.08 51.35 MPI ALLTOALLalltoall , which0.01 has been3.3 used11.22% and well-optimized1,612.26 CAF-GASNetalltoall0 (PULL) 2.5 731.3521.93% 5,672.2368421266.96 155.32 174.68 117.35 137.99 110.58 55.20 PhaseIfor many years on most8 computing8.1 platforms27.55%16 today.3,957.3732 PhaseI64 24 72 1201.8 168 15.79%2164,084.0105263264 312 360 TransposeI packf 0.5 1.70%Number of processes244.28 TransposeI packf 0.1Number of0.88% processes226.88947368 TI-alltoall 6.20 21.09% 3,029.10 TI-alltoall 2.8 24.56%FFT 6,352.9052632 RandomAccess PhaseII 4.416 14.97%64 2,149.68256 PhaseII 1000 1.0 8.77% 2,268.8947368 100 4.3 HPL Benchmark Figure 11. CAF-MPIPerformance of CGPOP on Fusion. CAF-MPI TransposeII packfMPI+GASNet 0.6 133 2.04% 143 293.14 154 TransposeII packf 0.1 0.88% 226.88947368 CAF-GASNet CAF-GASNet The High-PerformanceTI-alltoallMPI Linpack2.9 (HPL)107 9.86% [1] benchmark109 1,416.84 115 mea- TI-alltoall IDEAL-SCALE2.6 22.81% 5,899.1263158 IDEAL-SCALE ALLTOALL GASNet 26 6,058.201360534 39 CAF-GASNet-NOSRQ 17,924.268421 CAF-GASNet-NOSRQ COMPUTATIONsures the ability of a system to deliver fast8,305.5986395 floating point to iteratively solve for vector x in the7,941.1315789 equation Ax = b. 10 execution while solving a system of linear equations. HPL The algorithm100 consists of a number of linear algebra com-

isRA one MPI of the most common implementationsPercentage Time of parallel LU RA GASNetputations interleaved with twoPercentage communicationTime steps. The Computationfactorization for distributed1.316 memory11.43% systems.81.974704134 HPL has been ComputationGlobalSum function12.44331 performs a9.11% 3-word46.360558574 vector reduction, 1 GUPS

event_notify AMRequestShort 3.517 30.56% 219.07677389 event_notify AMRequestShortGFlops 0.9671 0.71% 3.6031647686 event_waitpreviously ported to CAF 2.04.1056 by Jin35.67% et. al.255.74114384 [12]. HPL differs event_waitwhile UpdateHalo function108.904 performs79.72% 405.74817078 a boundary exchange 10 calc_putgreatly from the RandomAccess2.57 and22.33% FFT160.08737813 benchmarks de- calc_putbetween neighboring sub-domains.14.3 10.47% The53.278105875 CGPOP miniapp Total 11.5086 100.00% 716.88 Total 136.61441 100.00% 508.99 0.1 scribed earlier because its performance is mostly dominated was ported to CAF for evaluating the performance gains by computation rather than communication. of using a PGAS programming model for the performance Figures 9 and 10 show the performance of HPLRandomAccess on Fusion critical part1 of POP. It is a hybrid MPI+CAF application 0.01 800 8 16 32 64 128 256 512 1024 2048 8 16 32 64 128 256 512 1024 2048 and Edison,CAF-GASNet respectively.CAF-MPI Between the two CAFevent_notify implemen- that uses CAF coarray primitives for data exchange and Computation 46.36 81.97 event_wait AnalysisNumber of FFTof processes Number of processes coarray_writetations, HPL’s performance53.28 160.09 difference is hardlycoarray_write noticeable. MPI REDUCE for reduction operations.HPL CGPOP 600 Computation 219.08 30 event_waitBecause the performance405.75 of255.74 HPL benchmark is mostly com- Figures10 11 and 12 show the performanceComputation results of CG- 800 CAF-MPI All-to-all CAF-MPI (PUSH) event_notify 3.60 219.08 3.60 putation bound, the performance difference of using differ- POP on Fusion24 andCAF-GASNet Edison,7.94 respectively. On Edison, we can CAF-MPI (PULL) 400 IDEAL-SCALE CAF-GASNet (PUSH) ent communication library has little effect on HPL’s overall 255.74 hardly see18 a differenceCAF-GASNet-NOSRQ in the performance of CGPOP be- CAF-GASNet (PULL) performance. 405.75 17.92 600 tween the two12 CAF implementations.8.31 On Fusion, the perfor- Time (in seconds) 200 1

mance of theTime (seconds) two CAF implementations only differs slightly 6 4.4 The CGPOP Miniapp 6.06 when running on a small number of cores. Since both CAF 400 0 0 The CGPOP miniapp is the conjugate gradientCAF-GASNet solver fromCAF-MPI versionsTFlops of CGPOPCAF-GASNet use MPI REDUCECAF-MPI , the only possible LANL POP 2.0, an important multi-agency code used for cause of performance0.1 difference in two versions of CGPOP globalMira ocean modeling and16 is a component32 within64 the Com-128 256is the use512 of PUT1024 and GET2048 operations4096 used by CAF-MPI 200 CAF-GASNet READ 272479.56 266666.66 263852.25 256410.27 266666.66 256410.27 265957.47 247524.75 266666.66 Executime time (in seconds) CAF-GASNetmunity EarthWRITE System Model.221729.48 CGPOP217864.92 is the216919.73 performance203665.98 213675.22and CAF-GASNet.209205.03 211864.41 The lack207039.33 of a performance206611.58 difference in- CAF-GASNetbottleneckEVENT_NOTIFY for the full POP99304.867 application;97560.977 it implements96993.211 95969.281 a ver- 96432.023dicates that96899.227 both implementations97465.883 96711.797 are equally96899.227 efficient in the CAF-MPI READ 76745.969 61614.293 61614.293 61614.293 61274.512 61274.5120.01 60642.813 60569.352 60716.457 0 CAF-MPIsion of conjugateWRITE gradient61087.355 that uses51177.074 a single52273.914 inner product50864.699 51229.508raw MPI50226.016PUT8 and 51733.059MPI GET16 51334.703operations.49358.34032 64 24 72 120 168 216 264 312 360 CAF-MPI EVENT_NOTIFY 100704.94 89847.258 89605.727 88967.977 88888.891 87489.063 89525.516 Number88809.945 of processes89766.609 Number of processes CAF-MPI AlltoAll 24096.387 21186.441 16778.523 11494.253 7087.1724 4071.6611 2230.1516 1166.3168 602.73645 CAF-GASNet AlltoAll 3716.0906 1979.4141 984.83356 475.48856 221.75407 102.36043 45.53651016 20.60942164 9.9222002256 Mira Microbenchmarks MPI+GASNet 133 143 154 MPI 107 109 115 3E+05 1E+05 GASNet 26 34 39

CAF-GASNet WRITE 2.4E+05 1E+04 CAF-GASNet READ CAF-GASNet NOTIFY CAF-GASNet AlltoAll CAF-MPI WRITE 1.8E+05 1E+03 CAF-MPI READ CAF-MPI NOTIFY CAF-MPI AlltoAll

1.2E+05 1E+02 # of all-to-all ops / second 6E+04 1E+01 # of point-to-point ops / second

0E+00 1E+00 16 32 64 128 256 512 1024 2048 4096 # of cores

Edison 32 64 128 256 512 1024 2048 4096 CAF-GASNet READ 445434.3 385951.4 324570.0 390930.4 293083.2 232342.0 264550.3 252079.7 CAF-GASNet WRITE 579038.8 500250.1 490436.5 500000.0 256607.7 274499.0 364564.3 308261.4 CAF-GASNet EVENT_NOTIFY 674763.8 665779.0 655308.0 655308.0 655308.0 582411.2 654878.8 521920.7 CAF-MPI READ 207555 209205.0 205465.4 206996.5 176398.0 201612.9 201369.3 143082.0 CAF-MPI WRITE 210172.3 210305.0 206313.2 208159.9 177273.5 202880.9 200964.6 142227.3 CAF-MPI EVENT_NOTIFY 700770.8 700770.8 700770.8 696864.1 696864.1 693962.6 686341.8 619962.8 CAF-MPI AlltoAll 12396.18 5767.345 2727.917 1272.507 514.6469 268.2957 112.9217 29.40790 CAF-GASNet AlltoAll 24177.95 7081.150 2399.923 911.6103 258.6646 87.81258 44.26492 19.71037 Edison Microbenchmarks 8E+05 1E+05

CAF-GASNet WRITE 6.4E+05 1E+04 CAF-GASNet READ CAF-GASNet NOTIFY CAF-GASNet AlltoAll CAF-MPI WRITE 4.8E+05 1E+03 CAF-MPI READ CAF-MPI NOTIFY CAF-MPI AlltoAll

3.2E+05 1E+02 # of all-to-all ops / second 1.6E+05 1E+01 # of point-to-point ops / second

0E+00 1E+00 32 64 128 256 512 1024 2048 4096 # of cores

Edison 16 32 64 128 256 512 1024 2048 CAF-GASNet READ 2.81E+05 2.75E+05 2.48E+05 CAF-GASNet WRITE 3.73E+05 3.70E+05 3.14E+05 CAF-GASNet EVENT_NOTIFY 1.07E+06 8.02E+05 7.08E+05 CAF-MPI READ 1.75E+05 1.77E+05 1.69E+05 CAF-MPI WRITE 1.83E+05 1.74E+05 1.74E+05 CAF-MPI EVENT_NOTIFY 7.99E+04 4.73E+04 9.77E+04 CAF-GASNet AlltoAll 2.82E+04 7.96E+03 2.76E+03 CAF-MPI AlltoAll 5.41E+04 3.10E+04 1.97E+04 FFT 16 32 64 128 256 512 1024 2048 4096 CAF-MPI 6.2971 9.9241 17.9998 32.8323 74.2554 152.9704 305.3309 585.6462 945.5121 CAF-GASNet 3.9050 7.2703 11.7259 20.4787 37.9913 66.6050 121.6078 233.8628 419.6483 IDEAL-SCALE 6.2971 12.5942 25.1884 50.3768 100.7536 201.5072 403.0144 806.0288 1612.0576 RA 16 32 64 128 256 512 1024 2048 4096 CAF-MPI 0.1 0.2 0.2 0.5 0.6 1.1 1.4 2.0 2.7 CAF-GASNet 0.2 0.3 0.4 0.6 1.1 1.1 1.9 3.8 8.0 IDEAL-SCALE 0.1231 0.2462 0.4924 0.9848 1.9696 3.9392 7.8784 15.7568 31.5136 HPL 16 64 256 1024 4096 CAF-MPI 0.113494752 0.4315327371 1.5640185942 5.4019310091 17.931944405 CAF-GASNet 0.1153884087 0.4306770224 1.6010092905 IDEAL-SCALE 0.113494752 0.4539790081 1.8159160323 7.2636641294 29.054656517 CGPOP 24 72 120 168 216 264 312 360 CAF-MPI (PUSH) 2373.33 800.57 483.73 481.15 325.18 323.59 324.06 166.37 CAF-MPI (PULL) 2369.46 799.63 482.89 480.68 325.57 323.66 323.87 167.70 CAF-GASNet (PUSH) 2367.96 794.29 482.83 477.60 322.41 321.47 320.01 162.31 CAF-GASNet (PULL) 2362.99 793.70 483.45 478.40 322.98 321.74 320.30 162.44

FFT RandomAccess 10000 100 CAF-MPI CAF-MPI CAF-GASNet CAF-GASNet IDEAL-SCALE IDEAL-SCALE

1000 10

100 GUPS GFlops

1 10

1 0.1 16 32 64 128 256 512 1024 2048 4096 16 32 64 128 256 512 1024 2048 4096 Number of processes Number of processes HPL CGPOP 100 3000 CAF-MPI CAF-MPI (PUSH) Chapel, and CAF 2.0 that supports dynamic . CAF-GASNet CAF-MPI (PULL) IDEAL-SCALE CAF-GASNet (PUSH) Zhao et. al. [33, 34] have made an effort in this direction CAF-GASNet (PULL) to support MPI-interoperable Active Messages, but such a 2250 10 model is not a part of the current MPI Standard. The need for non-blocking flush operation in MPI. In 1500

TFlops passive target epochs, the MPI Standard provides two rou-

1 tines: MPI WIN FLUSH and MPI WIN FLUSH ALL for 750 ensuring remote completion of RMA operations without Executime time (in seconds) closing the access epoch. These routines can be blocking calls (e.g., if another image is holding the lock). As dis- 0.1 0 16 64 256 1024 4096 24 72 120 168 216 264 312 360 cussed in Section 3.3, the blocking MPI WIN FLUSH op- Number of processes Number of processes eration eliminates the potential opportunity for overlapping the latency of communication with local computation. A Figure 12. Performance of CGPOP on Edison. nonblocking request-based version of MPI WIN FLUSH, FFT 8 16 32 64 128 256 512 1024 2048 CAF-MPI 2.5360 3.5693 7.0194 5.13.9231 Discussion23.0590 50.3071 96.1904 152.0733 263.9797 such as MPI WIN RFLUSH, can solve this problem. The CAF-GASNet 2.3927 3.3042 4.9530 8.6560 15.3140 27.2440 43.8779 79.2683 118.1791 MPI WIN RFLUSH operation would start the flush process IDEAL-SCALE 2.536 5.072 10.144 In this20.288 section, we40.576 discuss81.152 our findings162.304 during324.608 the process649.216 of CAF-GASNet-NOSRQ 2.4315 3.5079 4.9294 redesigning8.4172 CAF15.2665 2.0 runtime26.5122 system43.4191 to use MPI.77.4317 Using117.2695 MPI and return a request handle. The request handle can be later RA 8 16 32 64 128 256 512 1024 2048 passed to MPI completion routines to wait/test for the com- CAF-MPI 0.1 0.1 0.1 as the basis0.3 of a PGAS0.4 runtime0.6 system,0.9 such as CAF,1.2 reveals1.5 CAF-GASNet 0.1 0.1 0.2 several advantages0.4 0.2 of using MPI0.3 over a0.4 low-level0.7 networking1.0 pletion of the operation. IDEAL-SCALE 0.06092 0.12184 0.24368 0.48736 0.97472 1.94944 3.89888 7.79776 15.59552 CAF-GASNet-NOSRQ 0.1 0.1 0.2 layer such0.3 as GASNet.0.5 We0.7 also discovered0.9 several1.4 features2.2 HPL 16 64 256 that1024 the current MPI Standard lacks which can be useful in 6. Related Work CAF-MPI 0.0350152743 0.1311492785 0.4805325189 1.7443695111 CAF-GASNet 0.0330905247 0.122221024 0.4467551121 building1.5327417036 a PGAS runtime system. CAF and GASNet. CAF has been adopted by the Fortran IDEAL-SCALE 0.0350152743 0.1400610971 0.5602443884 2.2409775535 CAF-GASNet-NOSRQ 0.0330424331 0.1254319838 0.4453462682 The1.560673607 benefits of MPI’s rich interface. Compared with 2008 committee as the parallel programming model in For- CGPOP 24 72 120 168 216 264 312 360 tran. GASNet [5] is a portable PGAS runtime used not only CAF-MPI (PUSH) 656.47 251.96 157.64 GASNet148.37 and other102.76 low-level109.36 networking104.04 layers,50.98 MPI pro- CAF-MPI (PULL) 654.98 250.94 155.62 vides150.68 a much richer108.40 set of121.16 API functions110.47 to support50.94 different by CAF-2.0, but also by UPC and other parallel libraries. CAF-GASNet (PUSH) 657.82 236.48 155.87 high-level166.66 libraries,105.83 languages,104.97 and103.08 applications.51.35 We found There has been previous work to implement GASNet us- CAF-GASNet (PULL) 731.35 266.96 155.32 174.68 117.35 137.99 110.58 55.20 this rich interface to be immensely time-saving when build- ing MPI, culminating in a GASNet conduit built on AM- FFT ing a runtime system that deliversRandomAccess high performance across MPI [4]. However, the AM-MPI implementation is dated 1000 100 and only uses MPI-1.1, which does not include many of the CAF-MPI various platforms.CAF-MPI For example, because GASNet does not CAF-GASNet have collectives,CAF-GASNet collective operations in CAF are hand- more recent MPI features which provide a much higher per- IDEAL-SCALE IDEAL-SCALE CAF-GASNet-NOSRQ crafted in its runtimeCAF-GASNet-NOSRQ system on top of GASNet. Not only forming implementation, including RMA. is this time10 consuming, but also based on our evaluation, 100 MPI+PGAS Implementations. There has been previous not as performant as the MPI collectives. Because collec- work to implement some PGAS-like libraries on top of MPI. tives in MPI are well-optimized over the years by different 1 Dinan et. al. [9] implemented the aggregate remote memory GUPS GFlops MPI implementations on different platforms, harnessing the copy interface (ARMCI), the runtime layer for Global Ar- 10 performance benefits of these fully-optimized collectives in rays (GA) on top of MPI using one-sided communication. MPI is valuable0.1 for a programming model, such as CAF, This work provided portability for GA to new systems which which provides both one-sided and two-sided operations. would previously require significant work to provide both an 1 The need0.01 for Active Messages in MPI. One of the goals efficient one-sided MPI library and an efficient ARMCI im- 8 16 32 64 128 256 512 1024 of2048 building an8 interoperable16 32 64 runtime128 system256 512 for1024 CAF2048 is to plementation. Our work is similar, but with different goals. Number of processes Number of processes HPL allow the runtime system to makeCGPOP progress for both CAF and In this work, we want to understand the semantics and re- 10 800 CAF-MPI MPI when applications make blocking callsCAF-MPI in (PUSH) either one. quirements of a full PGAS language, not just a high-level CAF-GASNet The current implementation of CAF-MPICAF-MPI can stay(PULL) true to PGAS-like library. Specifically, a full PGAS language such IDEAL-SCALE CAF-GASNet (PUSH) CAF-GASNet-NOSRQ this goal for all parts of its runtime exceptCAF-GASNet active messages. (PULL) as CAF enforces a large number of requirements on the MPI Because we600 built the AM subsystem on top of MPI SEND implementation, particularly with respect to active messages 1 and MPI RECV, the CAF-MPI runtime has to do further and event management, which are not required by Global

processing400 of a message to invoke the AM handler. The Arrays.

TFlops MPI runtime itself cannot invoke the handler. Thus, if the Before this work, the same authors had made efforts to 0.1 application is blocked inside an MPI call, no progress is provide a hybrid MPI+UPC programming model [8]. Uni- made on such200 AM handlers. fied Parallel C (UPC) [29] is a PGAS library intended to Executime time (in seconds) Having AM support inside MPI can solve this problem. simplify data sharing between processes by allowing them to AMs are essential for building runtime systems for various share portions of their address space easily with each other. 0.01 0 8 16 32 programming64 24 models,72 especially120 168 for216 models264 such312 as360 X10, With the hybrid model between UPC and MPI, Dinan et. al. Number of processes Number of processes

16 64 256 MPI+GASNet 133 143 154 MPI 107 109 115 GASNet 26 34 39 demonstrated how applications could take advantage of both more optimized MPI operations than what was necessary for the large amount memory which becomes available when this work, such as Active Messages and MPI WIN RFLUSH. using UPC and the portability and existing libraries of MPI. As future work, we plan to look into several additions to As an example, an application could be written primarily in our proposed CAF-MPI framework. One of the short-term UPC, but could use an MPI library, such as ScaLAPACK goals that we plan to tackle are the performance issues that to perform optimized, domain-specific work. However, that we identified within the MPI implementation, particularly work only dealt with the semantic interactions between the with respect to the scalability of MPI WIN FLUSH ALL. two programming models and did not improve the runtime This would improve the performance of operations that rely infrastructure of the models. That is, it still relied on each heavily on CAF events, such as the RandomAccess bench- programming model using its own runtime system. mark. More work was done in this area by researchers at the As a slightly longer term goal, we plan to study the Ohio State University [14] which was specific to MVAPICH. ability of MPI WIN RFLUSH and its applicability to CAF- For this work, the authors unified the runtime systems of MPI. We believe that such an implementation would allow MVAPICH and UPC to create a single runtime which sup- us to move away from MPI SEND and MPI RECV almost ported both programming models. While this work provides entirely within the CAF-MPI runtime, except within active similar results to the work presented here, it is not portable messages. We also plan to use this study as a motivating across many MPI implementations and therefore requires example to encourage the standardization of such a routine that MVAPICH be available on the system, something which in the next MPI Standard. is not always possible given its reliance on InfiniBand. Finally, we plan to investigate several applications that Work to unify programming models has existed outside can benefit from a hybrid MPI+CAF framework. Two of the of purely PGAS languages. In 2011, Preissl et. al. [23] ex- first applications we are planning to look at are QMCPACK amined a communication model that integrated MPI, For- and GFMC. As described in Section 1, these are existing tran 2008 (Coarray Fortran), and OpenMP [21]. This work MPI applications that have a tremendous potential to benefit was specific to the Gyrokinetic Tokamak Simulation code from using coarrays that would allow their “local” data to be but demonstrated the use of many communication libraries distributed across a small number of processes. to implement different algorithms within the code. They fo- cused on how PGAS/OpenMP combined could improve over Acknowledgments a more traditional MPI+OpenMP algorithm. This work was supported by the U.S. Department of En- ergy, Office of Science, Advanced Scientific Computing Other MPI+X Hybrid Models. The MPI+OpenMP paradigm Research, under Contract DE-AC02-06CH11357. We also is just one example of a more classical hybridization of thank Scott K. Warren and Dung X. Nguyen from Rice parallel libraries. It has been specifically targeted to im- Group for porting CGPOP to CAF 2.0 and allowing us to prove on-node communication performance by taking ad- use it for the evaluation of CAF-MPI. vantage of shared address spaces with OpenMP, then switch to MPI when off-node communication is necessary. More References recently, accelerators have entered the hybrid program- [1] A. Petitet and R.C.Whaley and J.Dongara and A.Cleary. HPL ming model space by introducing MPI+CUDA [20] or - A Portable Implementation of the High-Performance Lin- MPI+OpenCL [15] and allowing an MPI application to of- pack. http://bit.ly/JtavrU, Sept. 2008. fload work to the GPU which can take advantage of the [2] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, highly parallel architecture available on GPUs. J. Mellor-Crummey, and N. R. Tallent. HPCTOOLKIT: tools for performance analysis of optimized parallel programs. 7. Conclusions and Future Work Concurr. Comput. : Pract. Exper., 22(6):685–701, 2010. [3] R. C. Agarwal, F. G. Gustavson, and M. Zubair. A high In this paper, we demonstrated how MPI-3 can act as a run- performance for 1-D FFT. In Proceedings time layer for CAF 2.0, thereby providing a basis for in- of the 1994 Conference on Supercomputing, pages 34–40, Los teroperability between the MPI and CAF, and reducing the Alamitos, CA, USA, 1994. runtime overhead introduced by having independent runtime [4] D. Bonachea. Active Messages over MPI. URL systems. We demonstrated the new features introduced by http://bit.ly/14VZNOs. MPI-3 and their applicability within a PGAS runtime sys- [5] D. Bonachea. GASNet specification, v1.1. Technical Re- tem. We demonstrated that changing the runtime layer from port UCB/CSD-02-1207, University of California at Berkeley, GASNet to MPI introduces minimal overhead in some cases, Berkeley, CA, USA, 2002. and a performance improvement in instances where the CAF [6] D. Bonachea and J. Duell. Problems with using MPI 1.1 and operations better map to MPI native operations. Finally, we 2.0 as compilation targets for parallel language implemen- outlined some improvements which could be adopted by fu- tations. Int. J. High Perform. Comput. Netw., 1(1-3):91–99, ture MPI Standards which would allow work like this to use Aug. 2004. . [7] B. Chamberlain, D. Callahan, and H. Zima. Parallel Pro- [22] S. C. Pieper and R. B. Wiringa. QUANTUM MONTE grammability and the Chapel Language. Intl. J. of High Per- CARLO CALCULATIONS OF LIGHT NUCLEI. Annual formance Computing Applications, 21(3):291–312, 2007. . Review of Nuclear and Particle Science, 51(1):53–90, 2001. [8] J. Dinan, P. Balaji, E. L. Lusk, P. Sadayappan, and R. Thakur. . URL http://bit.ly/143fd6u. Hybrid Parallel Programming with MPI and Unified Parallel [23] R. Preissl, N. Wichmann, B. Long, J. Shalf, S. Ethier, and C. In 7th ACM International Conference on Computing Fron- A. Koniges. Multithreaded Global Address Space Commu- tiers, Bertinoro, Italy, Apr. 2010. nication Techniques for Gyrokinetic Fusion Applications on [9] J. Dinan, P. Balaji, J. R. Hammond, S. Krishnamoorthy, and Ultra-Scale Platforms. In Proceedings of 2011 Int. Conf. V. Tipparaju. Supporting the PGAS Model Us- for High Performance Computing, Networking, Storage and ing MPI One-Sided Communication. In Proc. 26th Intl. Par- Analysis, SC ’11, pages 78:1–78:11, New York, NY, USA, allel and Distributed Processing Symp. (IPDPS), Shanghai, 2011. ACM. ISBN 978-1-4503-0771-0. . China, May 2012. [24] J. Reid. Coarrays in Fortran 2008. In Proceedings of the Third [10] R. Gerstenberger, M. Besta, and T. Hoefler. Enabling Highly- Conference on Partitioned Global Address Space Programing scalable Remote Memory Access Programming with MPI- Models, PGAS ’09, pages 4:1–4:1, New York, NY, USA, 3 One Sided. In Proceedings of Intl. Conf. for High 2009. ACM. ISBN 978-1-60558-836-0. . Perf. Computing, Networking, Storage and Analysis, SC [25] V. A. Saraswat, B. Bloom, I. Peshansky, O. Tardieu, and ’13, pages 53:1–53:12, New York, NY, USA, 2013. URL D. Grove. X10 Language Specification, Version 2.2, Sept. http://bit.ly/1dCbxe2. 2011. URL http://bit.ly/1431tse. [11] HPC Challenge Benchmark. HPC challenge benchmark, July [26] W. N. Scherer, III, L. Adhianto, G. Jin, J. Mellor-Crummey, 2010. URL http://icl.cs.utk.edu/hpcc. and C. Yang. Hiding latency in coarray fortran 2.0. In Pro- [12] G. Jin, J. Mellor-Crummey, L. Adhianto, W. N. Scherer III, ceedings of the 4th Conf. on Partitioned Global Address Space and C. Yang. Implementation and Performance Evaluation Programming Model, PGAS ’10, pages 14:1–14:9, New York, of the HPC Challenge Benchmarks in Coarray Fortran 2.0. NY, USA, 2010. ACM. ISBN 978-1-4503-0461-0. . In Proceedings of the 2011 IEEE Intl. Parallel & Distributed [27] A. Stone, J. Dennis, and M. M. Strout. The CGPOP Miniapp, Processing Symposium, IPDPS ’11, pages 1089–1100, Wash- Version 1.0. Technical Report CS-11-103, Colorado State ington, DC, USA, 2011. . University, July 2011. [13] S. L. Johnsson and R. L. Krawitz. Cooley-Tukey FFT on the [28] D. Takahashi and Y. Kanada. High-performance radix-2, 3 Connection Machine. In: . Volume, 18: and 5 parallel 1-D complex FFT algorithms for distributed- 1201–1221, 1991. memory parallel computers. J. Supercomput., 15(2):207–228, [14] J. Jose, M. Luo, S. Sur, and D. K. Panda. Unifying UPC and 2000. MPI runtimes: experience with MVAPICH. In Proceedings of [29] UPC Consortium. UPC language specifications v1. 2. Techni- the Fourth Conference on Partitioned Global Address Space cal report, Lawrence Berkeley National Laboratory, Berkeley, Programming Model, page 5. ACM, 2010. CA, USA, May 2005. [15] Khronos OpenCL Working Group. The OpenCL [30] T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Specification, Version 2.0, July 2013. URL Schauser. Active messages: a mechanism for integrated com- http://bit.ly/15tR61M. munication and computation. SIGARCH Comput. Archit. [16] J. Kim, K. P. Esler, J. McMinis, M. A. Morales, B. K. Clark, News, 20:256–266, Apr. 1992. ISSN 0163-5964. . L. Shulenburger, and D. M. Ceperley. Quantum energy den- [31] C. Yang. Function shipping in a scalable parallel program- sity: Improved efficiency for quantum Monte Carlo calcula- ming model. Master’s thesis, Department of Computer Sci- tions. Physical Review B, 88(3), 2013. . ence, Rice University, Houston, Texas, 2012. [17] E. Lusk, S. Pieper, and R. Butler. More SCALABILITY, [32] C. Yang, K. Murthy, and J. Mellor-Crummey. Managing asyn- Less PAIN. SciDAC Review, (17):30–37, 2010. URL chronous operations in coarray fortran 2.0. In Proceedings http://bit.ly/163sZtd. of the 2013 IEEE International Symposium on Parallel Dis- [18] J. Mellor-Crummey, L. Adhianto, W. N. Scherer, III, and tributed Processing, pages 1321–1332, 2013. . G. Jin. A new vision for Coarray Fortran. In Proceedings [33] X. Zhao, P. Balaji, W. D. Gropp, and R. S. Thakur. MPI- of the 3rd Conf. on Partitioned Global Address Space Pro- Interoperable Generalized Active Messages. In IEEE Inter- graming Models, PGAS ’09, pages 1–9, New York, NY, USA, national Conference on Parallel and Distributed Systems (IC- 2009. ACM. . PADS), Dec. 2013. [19] Message Passing Interface Forum. MPI Report 3.0, Sept. [34] X. Zhao, D. Buntinas, J. A. Zounmevo, J. Dinan, D. Goodell, 2012. URL http://bit.ly/Ul0wY2. P. Balaji, R. Thakur, A. Afsahi, and W. Gropp. Toward [20] J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable Asynchronous and MPI-Interoperable Active Messages. In Parallel Programming with CUDA. Queue, 6(2):40–53, Mar. CCGRID’13, pages 87–94, 2013. 2008. ISSN 1542-7730. . [21] OpenMP Architecture Review Board. OpenMP Appli- cation Program Interface Version 4.0, July 2013. URL http://bit.ly/13LNHtI.