9/13/2016 NERSC MPI Usage Survey - Google Forms

Parallel Research Kernels

Lattice QCD, RHMC, Complex Langevin, Conjugate Gradient

ALPSCore

own code

Applications that use PETSc

GTFock, Elemental

Our own code, adapted for MPI

MILC QCD simulations codes MPI usagePSI-T et at(nite ele meNERSC:nt based MHD code) Present and Future Ray, HipMer

HipMER, Combinatorial BLAS, SuperLU

idealized climate simulations using code named EULAG Alice Koniges, BrandonNERSC MCook,PI Usage SJackurvey Deslippe, Thorston Kurth, Hongzhang Shan MPI-3.1

Lawrenceb enBerkeleychmarks National Laboratory, Berkeley, CA USA desispec and related DESI-specic custom code Abstract: We describe how MPI is used at the National Energy Research Scientific Computing QUESTIONS RESPONSES 32 Center (NERSC). NERSC is the production high-performance computing center for the US Department of Energy, with more than 5000 users and 800 distinct projects. We also compare the usage of MPI to exascale developmental programming models such as UPC++ and HPX, with an eye on what features and extensions to MPI are plausible and useful for NERSC users. We also Which types of decomposition and communication does your application discuss perceived shortcomings of MPI, and why certain groups use other parallel programming How MPI is used at NERSC Portability issues wrt different MPI implementations models. We believe our results tend to represent the advanced MPI NERSC user base, rather than use? (check all that apply) an extensive survey of the entire NERSC code group population. (32 responses) In survey responses of 32 code groups, 10 gave the following comments on portability 9/13/2016 NERSC MPI Usage Survey - Google Forms Special thanks to Dr. Rolf Rabenseifner and colleagues at Stuttgart High Performance Computing Which types of decomposition and communication does your application use? • Have noticed that MPI implementations seem to differ in how they implement read/write_at_all() and read/write_at(). On NERSC's machines, I need to use Center for input in making the MPI NERSC User Survey. ~10,000 read/write_at_all(), while on NCAR's machine, the same code prefers read/write_at(). Nearest-neighborN communicationearest-neigh… 12 (37.5%) typically 16^2 x 45 (horizontal footprint x column depth) • Open MPI: multiple and RMA usually broken, blue gene has no MPI-3 Row/ColumnR communicationow/Column c… 8 (25%) typically 256-1024 • Compilation problem with the compiler In 2015, the NERSC Program supported 6,000 active users Communication with arbitrary Cotheromm processesunicatio… 13 (40.6%) from universities, national laboratories and industry who used • Cray mpi sometimes crashes with alltoallv at large counts NERSC Hardware CartesianCartesian grid gri d 11 (34.4%) approximately 3 billion supercomputing hours. Our users came • We write MPI code following the MPI standard, so generally no portability issues exist. But sometimes we get unexpected RMA bugs in CrayMPI releases (in from across the United States, with 48 states represented, as Non-CartesianN buton- Cregularartesi agridn… 11 (34.4%) How do you implement point-to-point communication (check any that both regular mode and DMAPP mode). well as from around the globe, with 47 countries represented. Cyclic boundary conditions/communicationCyclic bounda… 13 (40.6%) apply) Cori is NERSC's newest system (NERSC-8). It is named after American biochemist Gerty Cori, (29 responses) • Not directly. Co-workers experienced problems with one-sided MPI (Intel MPI/MPICH) the first woman to win a Nobel Prize in Physiology or Medicine. Non-cyclic boundaryNon- cconditionsyclic bo… 7 (21.9%) • Old versions of MPI did not have MPI_INTEGER8 data type. Newer versions do. Scatter/GatherScatter/Gathe r 16 (B5loc0kin%g M)PI… 18 (62.1%) • MPI_Comm_split deadlocking, MPI_Bcast deadlocking, MPI_Comm_dup using linear memory on each call, incorrect attribute caching destructor semantics. Cori Phase 1 system is a Cray XC based on the AllToAllAllToA ll 13 (40.6%) Blocking MPI… 2 (6.9%) Intel Haswell multi-core processor. Cori Phase 2 is based on 9/13/2016 NERSC MPI Usage Survey - Google Forms Blocking MPI… 1 (3.4%) • MPI_IO external32 vs. native encodings with OpenMPI & MPICH 9/13/2016 Broadcast/ReduceBroadcast/Re… NERSC MPI Usage Survey - Google Forms Blocking MPI… 5 (17.2%) 27 (84.4%) the second generation of the Intel® Xeon Phi™ family of • Too many issues over 15+ years to try to list them here. Nonblocking… 14 (48.3%) No communication (my code is embarrassingly~10,000 No com parallel)munic… 3 (9.4%) yes products, called the Knights Landing (KNL) Many Integrated Nonblocking… 13 (44.8%) Core (MIC) Architecture. Cori Phase I & Cori Phase II Other:Othe r 3 (9.4%) no typically 16^2 x 45 (horizontal footprint x column depth) 37.5% 0 2 4 6 8 10 12 14 16 18 integration begins ~September 19, 2016 ( for ~6 weeks). Other responses: typically 256-1024 0 5 10 15 20 25 • Primarily halo communication for integration and all-to-one for writing output, although many options exist for output depending on domain size. Comments/concerns/suggestions/requests wrt MPI • master/slave Have you ever checked if your • One sided Have you ever checked if your communication is serialized? (31 responses) • Cori is a Cray XC40 supercomputer • MPI Structure Communications, threaded communications communication is serialized? • Theoretical Peak performance: Phase I Haswell: 1.92 PFlops/sec; Phase II KNL: 27.9 PFlops/sec. • one-sided and point-to-point between rank 0 and other ranks yes In survey responses 5 code groups gave additional information 3 • Total compute nodes: Phase I Haswell: 1,630 computes nodes, 52,160 cores in total (32 cores per node); • Graph traversal / random access How do you implement point-to-point communication (check any that 61.3% no • 1 master PE for I/O : generally this is minor amount of time • don't break things!!!! • Phase II KNL: 9,304 compute nodes, 632,672 cores in total (68 cores per node). apply) • Cray Aries high-speed interconnect with Dragonfly topology (0.25 µs to 3.7 µs MPI latency, ~8GB/sec MPI bandwidth) • ideally all meaningful patterns areht tbenchmarkedps://docs.goog le.com/a/lbl.gov/forms/d/11xmHlRI5w-jGwTmPcxzuFwMDyJnFu9uQLzJLtHdmeas/edit#responses 3/19 • Lack of good open source MPI for Cray makes lots of things hard. So many problems would have been solved much quicker with source. (29 responses) 62.5% • Aggregate memory: Phase I Haswell partition: 203 TB; Phase II KNL partition: 1 PB. How do you implement point-to-point communication • It would be good to document automatic potential protocol switches in specific MPI implementations (e.g. Cray MPI) for small messages. That could explain • Scratch storage capacity: 30 PB 38.7% some observed communication anomalies. Blocking MPI_Send / MPI_RecvBlocking MPI … 18 (62.1%) • Neighborhood collectives and nonblocking collectives are the most relevant optimizations. Implementation of "endpoints" would be huge for interoperability Edison has been the NERSC’s primary supercomputer system for a number of years. It is named after Blocking MPI_Bsend / MPI_RecvBlocking MPI … 2 (6.9%) between packages that use OpenMP or a similar threading system and those that use MPI only (perhaps with windows). U.S. inventor and businessman Thomas Alva Edison. Blocking MPI_Ssend / MPI_RecvBlocking MPI … 1 (3.4%) Do you overlap communication • For progress of RMA operations, I use a process-based progress engine for MPI RMA, namely Casper. Blocking MPI_SendRecvBlocking MPI … 5 (17.2%) and computation? 9/13/2016 Do you overlap communicatiNoEnRS Ca MnPdI U csaoge mSurvpeyu - tGaootgiloe Fnor?ms (32 responses) Edison, a Cray XC30, has a peak performance of 2.57 Nonblocking MPI_Isend and MPI_RecvNonblocking … 14 (48.3%) yes no petaflops/sec, 133,824 compute cores for running scientific Nonblocking MPI_Send and MPI_IrecvNonblocking … 13 (44.8%) https://docs.google.com/a/lbl.gov/forms/d/11xmHlRI5w-jGwTm3P7c.x5z%uFwMDyJnFu9uQLzJLtHdmeas/edit#responses 5/19 applications, 357 Terabytes of memory, and 7.56 Petabytes 0 2 4 6 8 10 12 14 16 18 Application codes used by MPI survey responders of online disk storage with a peak I/O bandwidth of 168 Do you use any of these features? (31 responses) gigabytes (GB) per second.

Do you use any9/13 of/201 6these features? NERSC MPI Usage Survey - Google Forms 62.5% • NIMROD • NWChem • NWChem, CTF • MILC QCD simulations codes 9/13/2016 NERSC MPI Usage Survey - Google Forms We often use alltoallv in fact, as we have variable data lengths to send to arbitrary processes. Probably could • NCAR LES • Parallel Research Kernels • epiSNP • PSI-Tet ( finite element based MHD code) opMPI_Puttimise by knowing ne ig/h bMPI_Getors. MPI_Put / MPI… 11 (35.5%) • GINGER FEL code • WARP/PICSAR + own applications UsinHg RMaA vfore irre gyulaor coumm eunicvateionr; u sicng hdereivedc daktateypeds to iabfs tryacto struidedr d acta ion NmWchem; sub- nication is serialized? (31 responses) • ALPSCore • Isam MPI_Neighbor_alltoallcommunicatiors are needed to separate comput ing groups and special groups for asynchronous progress (using NWMChePm wIit_h ANRMCeI/MigPI ahndb Caosperr)…; blocking collectives are used to fo2r gl ob(a6l sy.n5chro%niza)tion in • WRF NWChem; non-blocking collectives are used in asynchronous progress engine (i.e., Casper) to handle special • own code • NIMROD • HipMER, Combinatorial BLAS, SuperLU Howb acdkground synchoroMPI_Alltoallniuzatio n dthat ocan beM overlaPpp edI w ithI anppliciatioin aexeclutiozn. ation? (32 responses) Do you use any of these features? (31 responses) MPI_Alltoall 11 (35.5%) • im3shape, cosmosis (my own codes) • Applications that use PETSc • MPI-3.1 • idealized climate simulations using code named EULAG MPI_Barrier for synchronization within shared memory domain (with MPI3 shared memory windows). Sub- yes communicatorsMPI_AllgatherM for rPow/Ico_lumAn clolmgmaunictahtioen s irn dense matrix applications 12 (38.7%) no • WRF, UMWM, HYCOM, ESMF • GTFock, Elemental • ISx • MUMPS, SuperLU_DIST, STRUMPACK, PETSc I need to send non-contiguous parts of an array. 61.3% MPI_Put / MPI… 11 (35.5%) MPI_BarrierMPI_Barrie r MPI_Neighbor… 2 (6.5%) 22 (71%) NERSC Research Areas and Users Asynchronous communication is necessary for overlapping computation/communication • CSU global atmospheric model • NWChem, • FEniCS • Lattice QCD, RHMC, Complex Langevin, Conjugate Gradient MPI_(Ex)Scan MPI_Alltoall 11 (35.5%) Nonblocking coMllectPivesI t_o o(veErlapx co)mSm acnda comnputation, including pipelined ite2rat iv(e s6olv.e5rs. %) MPI_Allgather 12 (38.7%) Issend/Ibarrier for sparse "consensus" -- efciently setting up an unstructured communication pattern. MPI_Barrier 22 •(7 1%) benchmarks • Our own code, adapted for MPI • Ray, HipMer • desispec and related DESI-specific custom code "V" versions of e.g. MPI_Scatterv / MPI_Gatherv MPI_(Ex)Scan 2 (6.5%) Hartree-F"oVck ("HF ) vcaelculratsioniso plany as cru coial fro…le in quantum chemistry and are a useful prototype for parallel 10 (32.3%) 9/1 4/2016 Survey of MPI Alternatives - Google Forms studies. One of the most compute intensive step in HF calculations comprises of building the Fock "V" versions of… 10 (32.3%) Cartesianwith mMatrPix, aIn_ dCommunicators ItChne wiatay rit tise dosne iisa thnrou ghC a soerie…s of one-sided operations to multid2ime n(si6ona.l 5ten%sors,) and Cartesia2n 5Co …(78.1%)2 (6.5%) zoran performing local tensor contractions. This type of a communication model is very suitable for one-sided Sub-communi… 19 (61.3%) operations, offered in MPI-3 RMA such as Put, Get and Accumulate. Also, choice of data layout or distribution Inter-communi… 2 (6.5%) dinan 9/14/2016 Survey of MPI Alternatives - Google Forms Sub-communicators (normal intra-communicators)over proceSssesu isb on-e ocf tohe mkey amspecut ton acih…iev e scalability for any parallel application, which could be 19 (61.3%) implemented using a mix of collective all-to-all and nonblocking point-to-poin3t 8op.e7ra%tions. Nonblocking C… 8 (25.8%) python Inter-communicators Derived Datat… 9 (29%) schatte Scatter/GIanthetr feor rFF-Tcs, Soubcmommunicuatonrs ifo…r different sets of processors depe2ndi ng( o6n d.is5cre%tizati)on 0 2 4 6 8 10 12 14 16 18 20 22 Java, Scala, Python, PERL Nonblocking Collectiveswith MPOccIa_s iIo(e.g.nNali ptor…ocnes sb MPI_Ibcast)sylnochrconikzatiionn g C… 14 (843 (.28%5.)8%) cmartha bencDerivedhmark them Datatypes jphillip UPC barrier: cooDrdineatinrgi Iv/O.e subd-co mDmuanictataorst: s…licing and dicing the cores for an otherwise embarrassingly 9 (29%) parallel problem (but where each sub-unit is itself internally ) fagan Python, Javascript Responses fromR, haske ll, oSurveycaml, python of MPI Alternatives 0 2 4 6 8 10 91/13/2016 14 Tell 1uNEs6RS Ch MPoI Uwsage /Suwrvey -h Go1yogl e8y Foormsu use th2e0 features 2se2lected above? (16 responses) 0 2 4 6 8 10 12 14 16 18 20 22 24 WhaDt eorro ry hoanudl inogv deo yrolua upse ?c(o32 rmespomnses)unication and computation? (32 responses) How do you do MPI Initialization? (32 responses) What error handling do you use? I mostly work from within mpi4py. What language(s) is(are) your application written in? (27 responses) How do you do MPI Initialization? What language(s) is(are) your application written in? Are you working on a proxy app If yes, what is not in the proxy app that is in full To get orderly standard output (specically from our timer); to work on a subset of processors in our multigrid (27 responses) htDefaulttps://docs.Dgeof aouglt le.com/a/lbl.gov/forms/d/11xmHlRI5w-jGwTmPcxzuFwMDyJnFu9uQ2L8z (8J7L.5t%H)dmeas/edit#responses 5/19 Are you working on a proxy app for a major application? solver for a major application? with MPI_Init 25 (78.1%) application? (Please name full application.) with MPI_Init 7 (25.9%) Predefined error handlersPredefined e… 2 (6.3%) See ARMCI-MPI papers to start FortranFortran yes with MPI_Init_threadwith MPI _Init… 14 (43.8%) • computation is greatly simplified Self-written e… 7 (21.9%) C 21 (77.8%) Self-written error handlers MPI_Put/Get implemented to experiment different com solutions. Cartesian communicator because we use C no Cartesian grid. Nonblocking collectives to reduce some values to the rst processor for statistics. Derived • NEK5000 is the full app. At least, proper local C++ 21 (77.8%) 0 5 10 15 20 25 0 2 4 6 data8 type10s to 1d2eal 1w4 ith d16ata 1c8omm20unic22ation24 at the boundaries (extraction of data from 3D martrix) C++ Do you use MPI+X? (32 responses) 9/14/2016 85.2% Sumatrixrvey of MintegrationPI Alternati vise smissing. - Google Forms ranks > 0 esse(n1tia6lly dryneamsicpalloy fnetcsh "ewosrk)" from rank 0 otherother 7 (25.9%) https://docs.google.com/Ta/lbl.goev/formls/dl/11 xmuHlRI5ws-jGwT mPhcxzuFowMDyJwnFu9uQL/zJLtHwdmeas/edhit#respyonses you use t7/19he features selected above? Do you use MPI+X? 14.8% Do you use MPIh+ttXps?://d(3o2c sr.egsopoognlsee.sc)om/a/lbl.gov/forms/d/11xmHlRI5w-jGwTmPcxzuFwMDyJnFu9uQLzJLtHdmeas/edit#responses 6/19 0 2 4 6 8 10 12 14 16 18 20 Other responses: MPIMP onlyI on ly MPI only 18 (56.3%) 18 (56.3%) • UPC, python, bash UPC, UPC++ • Python, Javascript 9/13/2016 NERSC MPI Usage Survey - Google Forms MPI+OpenMP 20 (62.5%) MPI+pthreads 5 (15.6%) • python • R, haskell, ocaml, python MPI+OpenMPMPI+OpenM P MPI+CUDA 3 (9.4%) 20 (62.5%) • Java, Scala, Python, PERL UPC What issue(s) caused you to consider approaches other than MPI? (27 responses) I mostly work from within mpi4py. MPI+OpenCL 1 (3.1%) If checked other, please specify. (7 responses) MPI+pthreadsMPI+pthread s 5 (15.6%) MPI+OpenA… 0 (0%) MPI+OpenM… 0 (0%) If you combine MPI with another communication model, e.g., MPI + Co- 4 (12.5%) MPI+CUDAMPI+CUD A 3 (9.4%) Other What issue(s) caused you to consider approaches other than MPI? Array Fortran, MPI + UPC, etc., whichT moo gdelts oanrd ewrhly? standard output (specically from our timer); to work 0on 2a s4ub6set8 of1 0pro12 ce14ss1o6 rs18 in 20our multigrid UPC, python, bash If yes, what is not in the proxy app that is in full application? (Please name full application.) (2 responses) (7 responses) MPI+OpenCLMPI+OpenC L 1 (3.1%) solver UPC, UPC++ overlap of communication withov ecomputationrlap of co… 19 (70.4%) MPI+OpenACCMPI+OpenA… 0 (0%) https://docs.google.com/a/lbl.gov/forms/d/1eFq1lPN-OJeE40rLYG49Stxno_VL_aPmgnyLfLhdetI/edit?ts=57c0b526#responses computation is greatly simplied 2/12 Trying to switch completely to for ease of programming. If you checked other, please specify (4 responses) oobjectbject migrationmigration 10 (37%) MPI+OpenMP-4 GPUMP directivesI+OpenM… 0 (0%) NEK5000 is the full app. At least, proper local matrix integration is missing. MPI + SHMEM See ARMCI-MPI papers to start ffaultault tolerancetolerance 11 (40.7%) Other:Othe r 4 (12.5%) MPI + TBB, C(++)11 threads load balancing; outside of NERSC to allow reduced memory footprint to allow execution on Xeon Phi (KNL) But we prefer MPI-only, possibly with shared memory optimizations (MPI+MPI) powerpowe rawareness awaren… 6 (22.2%) MPI+UPC (i.e. upcc --with-mpi) https://docs.google.com/a/lbl.gov/forms/d/1eFq1lPN-OJeE40rLYG49Stxno_VL_aPmgnyLfLhdetI/edit?ts=57c0b526#responses 3/12 MPI + Global Arrays Other responses: M0PI_Put2/Get im4plement6ed to ex8perime10nt diffe1r2ent co1m4 solut1io6nsUPC., UCPC++a, 1GAr8SNteet sian2 c0ommunicator because we use pperformanceerformance 13 (48.1%) • MPI O+p eTBB,nMP do C(++)11es not work athreadscross node s Cartesian grid. Nonblocking collectives to reduce some values to the rst processor for statistics. Derived asynchronousasynch executionronous… 20 (74.1%) • But we prefer MPI-only, possibly with shared memory optimizations (MPI+MPI) MPI+UPC (i.e. upcc --with-mpi) https://docs.google.com/a/lbl.gov/forms/d/11xmHlRI5w-jGwTmPcxzuFwMDyJnFu9uQLzJLtHdmeas/edit#responses 8/19 • UPC,UP UPC++,C mainly - to GASNettest interope rability and due to UPC'sd beattetra ontey-spideed pse rftoorm adnce andl pwrodiutchtiv itdy ata communication at the boundaries (extraction of data from 3D martrix) activeactive messaging messag… 11 (40.7%) Ideally I'd use MPI just to get the processes spread across nodes, and use simpler internal to python portabilityportability 9 (33.3%) for additional parallelism If you combine MPI with anotherranks communication> 0 essentially dy nmodel,amical le.g.,y fetc MPIh "w o+r kCo-Array" from ran Fortran,k 0 MPI + UPC, etc., which models and why? ffuturesutures conceptconcep t 6 (22.2%) uniform interface, not combined with separate uthreadingniform in tmodelerfa… 15 (55.6%) • Trying to switch completely to Coarray Fortran for ease of programming. (4 responses) better integrationbetter in withtegr aC++ti… 10 (37%) • MPI + SHMEM If you checked other, please specify Process Creation and hMttapnsa:/g/deomces.ngto: odgol ey.ocuo mus/ae/l(b0 lr.egspoovns/efso)rms/d/11xmHlRI5w-jGwTmPcxzuFwMDyJnFu9uQLzJLtHdmeas/edit#responses 6/19 otherothe r 8 (29.6%) • load balancing; outside of NERSC to allow reduced memory footprint to allow execution on Xeon Phi (KNL) 9/14/2016 Survey of MPI Alternatives - Google Forms • MPINo re +sp oGlobalnses yet fArraysor this qu estion. 0 2 4 6 8 10 12 14 16 18 20 9/13/2016 MPI +N ETRSBC BMP,I UCsa(ge+ S+urv)ey1 - 1Go otghle Froermas ds Other responses: • IfOpenMP yes, which does wind onotw c rworkeatio nacross method nodes(s) (9 res ponses) • complex DAG-like workflow dependencies; familiarity and clarity of multiprocessing vs. MPI • UPC mainly - to test interoperability and due to UPC's better one-sided performance and productivity • compiler transformations on communication operations But we prefer MPI-only, possibly with shared memory optimizations (MPI+MPI) • Binary code without access to expertise to modify source • DIdeallyo you us I'de o nusee-si dMPIed c ojustmm toun igetcati othen (n processesormal windo spreadws)? (qu eacrossstions nodes, and use simpler internal to python multiprocessing for additional parallelism 9/13/2016 NERSC MPI Usage Survey - Google Forms • Much less syntax to accomplish the same thing using Coarray Fortran, resulting in more readable / maintainable code abMoPIu_Wt ins_hcra…red memory wiMndPoIw+Us PaCre (bi.eelo. wup)cc --with-mpi) 5 (55.6%) If yes, which window creation method(s) (9 responses) • Simpler programming (productivity) (31 responses) Do you use one-sided communication (normal windows)? • Funded to implement non-MPI models MPI_Win_all… 6 (66.7%) If yes, which window creation method(s) • Load balancing UPC, UPC++, GASNet 9/13/2016 NERSC MPI Usage Survey - Google Forms yes MPI_Win_createMPI_Win_c r… 5 (55.6%) Which parallel programm• iMPIng is toom lowod level.el (s)/approach(s) do you use? (26 responses) MPI_Win_cr… 3 (33.3%) no (6 responses) (8 responses) 71% If checked other, please specify. MPI_Win_allocateMPI_Win_a ll… 6 (66.7%) Which parallel programming model(s)/approach(s) do you use? 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 MPI_Win_create_dynamicMPI_Win_c r… 3 (33.3%) https://docs.google.com/a/lbl.gov/forms/d/11xmHlRI5w-jGwTmPcxzuFwMDyJnFu9uQLzJLtHdmeas/edit#responses 8/19

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 separate 2 (33.3%) UPCUPC 8 (30.8%) 29% complex DAG-like workow dependencies; familiarity and clarity of multiprocessing vs. MPI UPC++UPC++ 9 (34.6%) Charm++ If yes, do you use overlapping windows? (14 responses) unified 4 (66.7%) Charm++ 5 (19.2%) If yes, do you use overlapping windows? (14 responses) compiler transformations on communication operations If yes, do you use overlapping windows? If you use overlapping windows, please specify the reason KokkosKokkos 0 (0%)

https://docs.google.com/a/lbl.gov/forms/d/11xmHlRI5w-jGwTmPcxzuFwMDyJnFu9uQLzJLtHdmeas/edit#responses yes 9/19 • See Casper papers yes ChapelChape l 1 (3.8%) Snapshot of User Comparisons of MPI to other no 0 0.5 1 1.5 2 2.5 3 3.5 no 4 Binary code without access to expertise to modify source • I use Casper for RMA asynchronous78.6% progress which requires CCoarrayoarray FortranFortran 3 (11.5%) 78.6% overlapped windows to offload user RMA communication to HPXHPX 2 (7.7%) Programming Models at NERSC background processes. Much less syntax to accomplish the same thing using Coarray Fortran, resOtherultin gresponses: in more readable / maintainable code 21.4% OOCRCR -based-based 8 (30.8%) 9/13/2016 NERSC MPI Usage Survey - Google Forms • SHMEM, OpenMP 21.4% LegionLegion 1 (3.8%) • Custom NERSC users are trying a variety of programming models and often comparing them to If yes to one-sided communication with normal windows, which If yes to one-sided communication with normal windows, which RMA Simpler programming (productivity) synchronization methods? • GASNet method(s)? PParallelarallel PythonPython 3 (11.5%) MPI. While these comparisons are obviously implementation and application dependent, (9 responses) If yes to one-sided communication with normal • OpenSHMEM (9 responses) Hadoop/MapReduceHadoop/MapR… 1 (3.8%) • Global Arrays we give here some current results. (See links in MPI Alternatives Survey for more results.) If yes to one-sided communication with normal windows, which RMA method(s)? Funded to implement non-MPI models If you use overlapping windows, please specify the reason (2 responses) • GPI-2 / GASPI 9/13/2016 NERSC MPI Usage Survey - Google Forms otherothe r 7 (26.9%) MPI_Win_fe… windows, (6which responses) RMA memory4 (44 .model?4%) • python multiprocessing; custom 9/14/2016 Survey of MPI Alternatives - Google Forms Bi-Directional bandwidth test on Cori Phase 1 MPI_Put 7 (77.8%) workflow DAG management MPI_Win_po… 2 (22.2%) MPI_Put See Casper papers Load balancing 0 1 2 3 4 5 6 7 8 9 using two processes, one per node If you use overlapping windows, please specify the reason (2 responses) MPI_Get I use Casper for RMA asynchronous progress which requires overlapped windows to ofoad 8us (e8r 8RM.9A%) MPI_Win_lo… separate 2 (33.3%) 5 (55.6%) MPI_Get separate communication to background processes.

100000 Request-bas… 3 (33.3%) MPI_AccumulateMPI_Accum … 6 (66.7%) 9/14/2016 Survey of MPI Alternatives - Google Forms unifiedun ified 4 (66.7%) Do you have versions in more than MPI is too low level. See Casper papers Other responses: Do you have versions in more than one different programming model? If(23 ryes,esponses ) which models? Other lock/fl… 3 (33.3%) MPI_Get_accumulateMPI_Get_ac … 6 (66.7%) MPI, GA, UPC++ • MPI_Fetch_and_op one different programming model? • MPI, GA, UPC++ I use Casper for RMA 0asynch0.r5onous1 progre1s.5s whic2h requi2r.5es over3lapped3. 5window4 s to ofoad user RMA UPC++: A PGAS Extension for C++ 9/13/2016 NERSC MPI Usage Survey O- Gthoeogrle Forms 2 (22.2%) https://docs.google.com/a/lbl.gov/forms/d/1eFq1lPN-OJeE40rLYG49Stxno_VL_aPmgnyLfLhdetI/edit?ts=57c0b526#responses 4/12 UPC++ 0 0.5 1 commu1.n5ication t2o backg2r.o5und pro3cesses.3.5 4 4.5 5 Other: If yes to one-sided communication with normal windows, which• R MMPI_Fetch_and_opA Charm++, HPX • MPI • MPI and multiprocessing 10000 memory model? yes • Charm++, HPX If yes to one-sided communication with normal windows, which • FOp and CAS 43.5% MPI+OpenMP, OCR, UPC, UPC++ no • MPI 0 1 https2://docs.google.com/a/3lbl.gov/forms/d/11xm4HlRI5w-jGwTmPcxz5uFwMDyJnFu9uQLz6JLtHdmeas/edit#resp7onses 8 10/19 • most of them • MPI+OpenMP, OCR, UPC, UPC++ Features UPC++ synchronization methods? If yes to one-sidedIf yes to one-sided c ocommunicationmmunication with normal windows , (withw9h riecshp RoMn Asnormales) MPI and SHMEM • MPI, OpenMP • MPI and Coarray Fortran • MPI and SHMEM Global Pointer (type T) global_ptr 1000 method(s)? windows, which(9 responses) synchronization methods? 9/14/201M6 PI vs OCR Survey of MPI Alternatives - Google Forms • MPI,(7 r eOpenMPsponse s) • benchmarks in multiple models Global Memory Allocation allocate(rank, size) If yes to one-sided communication with normal windows, which RMA If checked other, please specify. • MPI vs OCR MPI Two-Sided If you checked other lock/ush/sync routines, please specify (3 responses) https://dx.doi.org/10.1177/1094342016658110 Contiguous Data Copy async_copy(src, dest, event) MPI_Put MPI_Win_fenceMPI_W7 (i7n7._8%fe) … 4 (44.4%) memory model? 56.5% Remote operation Async_after(event, handler) MPI_Get 8 (88.9%) See http://upc.lbl.gov/publications/ 100 https://docs.google.com/a/lbl.gov/forms/d/MPI_Win_post/start/complete/wait11xmHlRI5w-jGwTmPcxzuFwMDyJnFu9uQLzJLtHdmeas/edMit#PreI_spWonisne_s po… 10/19 2 (22.2%) (3 responses) Lock all, ush all, ush local, ush, win sync, uMPnI_loAccukm a…ll 6 (66.7%) If you checked other, please specify Global Synchronization barrier() MPI One-Sided https://www.computer.org/csdl/proceedings/ipdps/2016/2140/00/2140a453-abs.html If yes, are there results available, e.g., paper, talk, link? MPI_Get_ac… 6 (66.7%) MPI_Win_lock/unlockMPI_Win_lo… 5 (55.6%) http://ieeexplore.ieee.org/document/7161562/?reload=true&arnumber=7161562 Other 2 (22.2%) Have you compared your approach to MPI? (27 responses) Shared Memory support global_ptr p; p.is_local() MPI_9W/13in/2_0l1o6ck_all/unlock_all/ush_all/ush_local_all NERSC MPI Usage Survey - Google Forms Other lock/flush/sync routines responses: python multiprocessing; custom workow DAG management Bandwidth (MB/s) Have you compared your approach to MPI? • http://gasnet.lbl.gov/performance/ Request-based RMA0 communication,1 2 3 4 e.g.,5 MPI_Rput6 Requ7est-bas8… MPI_Fetch_a3n (d3_3o.3p%) Not yet. Non-contiguous Data Copy async_copy_vis(…) 10 UPC++ MPI_Win_uIsfh (y_leocsa,l) (d_aoll )you use (6 responses) • Lock all, ush all, ush local, ush, win sync, unlock all 3 (33.3%) N/A (13 responses) • http://dx.doi.org/10.1007/978-3-319-41321-1_17, https://github.com/ParRes/Kernels Group Synchronization sync_neighbors(list, length) Other lock/flush/sync routinesOther lock/fl… MPI_Fetch_and_op • MPI_Win_lock_all/unlock_all/ ush_all/ ush_local_all If yes, which models? yes no • MPI_Win_ ush(_local)(_all) http://dx.doi.org/1S0.H110M9/IPEDPMS.20,1 3O3.778p% enMP • https://dx.doi.org/10.1177/1094342016658110 0 0.5 1 1.5 2 2F.O5p and C3AS 3.5 4 4.5 5 The supported global address makes implementation of HPGMG 1 If you checked other, please specify (3 responses) MPI normal langu… 6 (100%) Not yet available. • See http://upc.lbl.gov/publications/

Nat become easy.! 8

16 32 64 MPI_Fetch_and_op MPI Zheng, Yili, et al. "UPC++: a PGAS Extension for C++." 128 256 512 Do you use MPI shared memory windows (allocated with http://dx.doi.org/1C0.1u10s9/tIPoDPmSW.2014.83 • https://www.computer.org/csdl/proceedings/ipdps/2016/2140/00/2140a453-abs.html 1024 2048 4096 8192 normal langu… MPI_Fetch_and_op 6 (100%) http://www.gpi-site.com/gpi2/benchmarks/ Parallel and Distributed Processing Symposium, 2014 16384 32768 65536 MPI_Win_allocDoate_ syouhared )use? MPI shared memory windows MPI, OpenMP

131072 262144 524288 http://ieeexplore.ieee.org/document/7161562/?reload=true&arnumber=7161562 FOp and CAS 9/13/2016 NERSC MPI Usage Survey - Google Forms

(11048576 8 responses) (allocated with MPI_Win_allocate_shared)? If yes, do you use (6 responses) MPI, OpenMP 63% Message Size (Bytes) MPI_Put and… 2 (33.3%) GASNet • http://dx.doi.org/10.1109/IPDPS.2013.78 If you checked other lock/ush/hsttpys:n//dcoc sr.gooouglet.icnome/as/lb,l .gpovl/eIffoarm syes,/de/1 1sxmpHel RcdoI5iwf-yjG wyou(T3m rPecsxpzuoF nwusesMeDs)yJnF u9uQLzJLtHdmeas/edit#responses 11/19 MPI and multiprocessing yes • http://dx.doi.org/10.1109/IPDPSW.2014.83, http://www.gpi-site.com/gpi2/benchmarks/ 0 0.5 https://d1ocs.google.co1m/.a5/lbl.gov/forms2/d/11xmHlRI52w.-j5GwTmPcxzuF3wMDyJnnFou93uQ.L5zJLtHdmeas/e4dit#responses4.5 5 5.5 611/19 Ifm yoset sof, twheams your implementation in MPI optimized? (15 responses) BoxLib using cray-mpich7.4.0 on 64 nodes 66.7% normal language (C/Fortran) expressions for "remote load" on the shared memorynormal langu … 6 (100%) OpenSHMEM Lock all, ush all, ush local, ush, win sync, unlock all MIfPI anyes,d Coarray Fwasortran your implementation in MPI optimized? If yes, on which architecture(s) was it optimized? 1000.0 normal language (C/Fortran) assignments for "remote store" on the shared memorynormal langu … 6 (100%) MPI_Win_lock_all/unlock_all/ush_all/ush_local_all benchmarks in multiple models yes no MPI_Put and/or MPI_Get on the shared memory windowMPI_Put and … 2 (33.3%) https:/I/dfo cys.geoosgle,. caomr/ae/lb lt.ghov/eformes/Gd /r1elFqso1luPbNl-tOasJeE la4 0ArvLYarGi4rl9aStxbnyol_esVL, _eaPm. gny.L, fLphdaetpI/edeit?rt,s =t57ac0lbk52,6 #lrienspokns?es Please share a citation or link •if avaiCraylable. 6/12 • many -- Intel CPU cluster, Cray, etc. MPI(Two-sided) MPI_Win_ush(_local)(_all) (10 responses) 26.7% BoxLib: An AMR Software Framework for 33.3% • Cray • x86_64 Which synchronization method do you use, please specify (7 responses) 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 building parallel adaptive mesh refinement MPI(One-sided) http://gasnet.lbl.gov/GpePrfoIrm-2an c/e/ GASPI • Cray XC (Edison/Cori) • x86 (AMR) codes solving multi-scale/multi-physics 100.0 UPC++ MPI_Barrier() in MPI and sync all in CAF http://dx.doi.org/10.1007/9787-33.-33%19-41321-1_17, https://github.com/ParRes/Kernels • See individual papers for details • x86 / InfiniBand partial differential equations Which synchronization method do you use? https://docs.google.com/a/lbl.gov/forms/d/1eFq1lPN-OJeE40rLYG49Stxno_VL_aPmgnyLfLhdetI/edit?ts=57c0b526#responses 5/12 win sync or ush, as appropriate Available at https://github.com/BoxLib-Codes Do you use MPI shared memory windows (allocated with Which synchronization method do you use, please specify (7 responses) https://docs.google.com/a/lbl.gov/forms/d/•11 xmHlRI5MPI_Barrier()w-jGwTmPcxzuFwMDyJnFu9uQ LinzJL tHMPIdmeas/ed iandt#respons esyncs all in CAF 12/19 https://docs.google.com/a/lbl.gov/forms/d/1eFq1lPN-OJeE40rLYG49Stxno_VL_aPmgnyLfLhdetI/edit?ts=57c0b526#responses 7/12 Managed by Weiqun Zhang, et al. MPI_Win_lock_all/unlock_all/ush_all/ush_local_all MPI_Win_allocate_shared)? 9/13/2016 NERSC MPI Usage Survey - Google Forms (18 responses) • win sync or ush, as appropriate MPI_Barrier() in MPI and sync all in CAF C++ and Fortran versions MI_Barrier, plus MPI_Send/Recv of empty messages How do you think MPI could be improved to better address your needs? MPI + OpenMP ( + UPC++) 10.0 • MPI_Win_lock_all/unlock_all/ ush_all/ ush_local_all win sync or ush, as appropriate MPI_WAIT MPI_WAITALL If yes•, on wMuchhich arc simplerhitecture (programmings) was it optimize modeld? (8 resp onses) • MPI_Barrier, plus MPI_Send/Recv of empty messages yes MPI_Win_lock_all/unlock_all/ush_all/ush_local_all no https://docs.google.com/a/lbl.gov/forms/d/1eFq1lPN-OJeE40rLYG49Stxno_VL_aPmgnyLfLhdetI/edit?ts=57c0b526#responses 8/12 MPI_Win_sync for shared memory windows. MI_Barrier, plus MPI_Send/Recv of empty messages • Active Message support, better RMA support MPI_WAIT MPI_WAITALL 66.7% multiple methods (5 responses) MPI_WAIT MPI_WAITALL Average Time Per Itera.on (s) If yes, do you • fault tolerance, implementation quality • MPI_Win_sync for shared memory windows 9/13/2016 NERSC MPI Usage Survey - Google Forms MPI_Win_sync for shared memory windows. • More documentation on design patterns. For example, it is really simple to implement Request/Reply and Pub/Sub patterns using ZeroMQ 1.0 • multiple methods multiple methods (http://zguide.zeromq.org/page:all) in C and much easier to understand than MPI. 128 256 512 1024 2048 use them as… 4 (80%) 33.3% • Probably not High Performance ParalleX (HPX) Number of Ranks Do you use hints via the info object? (20 responses) If yes, do you (5 responses) Do you use hints via the info object? • MPI 3 has resolved some of the problems, but we have a paper in preparation to highlight the remainder. * Lightweight multi-threading use them as… 2 (40%) If yes, do you Do you use hints via the info object? (20 responses) - Divides work into smaller tasks N-Body using LibGeoDecomp yes • Lighter weight (efficiency), simpler API, and remote function execution use them as… 4 (80%) - Increases concurrency use reserved… 0 (0%) no use them associated with windows yes 75% no • Fault Tolerance * Message-driven computation https://docs.google.com/a/lbl.gov/forms/d/11xmHlRuseI5w-jGw TthemmPcxzuFw MassociatedDyJnFu9uQLzJLtHdm ewithas/edit#re scommunicatorsponses 9/13/2016 use them a s… N1E2R/S1C9 MPI Usage Survey - Google F7o25r %m(4s0%) • Task-based - Move work to data 0 0.5 1 1.5 2 2.5 3 3.5 4 write_at, all_write, read at 9/14/2016 Survey of MPI Alternatives - Google Forms use reserved hints, e.g., access_style, chunked,use retc.eserve d… 0 (0%) - Keeps work local, stops blocking MPI_File_write_at, MPI_File_read_at • higher level API, simpler data sharing OpenSHMEM one-sided communication has an advantage over two-sided messaging for global communication patterns where MPI two-sided MPI_FILE_WRITE_AT_ALL 25% messaging overheads (matching, unexpected messages, etc) can generate high overheads. MPI RMA could be used in place of OpenSHMEM if the * Constraint-based synchronization 0 0.5 1 1.5 2 2.5 3 3.5 4 25% p•e rformanOpenSHMEMce of MPI RMA were optimi zone-sideded to the level of SH McommunicationEM. has an advantage over two-sided messaging for global communication patterns where MPI two-sided messaging - Declarative criteria for work overheads (matching, unexpected messages, etc) can generate high overheads. MPI RMA could be used in place of OpenSHMEM if the performance of MPI - Event driven Fault tolerance, ne-grained tasks https://docs.google.com/a/lbl.gov/forms/d/11xmHlRI5w-jGwTmPcxzuFwMDyJnFu9uQLzJLtHdmeas/edit#responses 13/19 RMA were optimized to the level of SHMEM. 9/14/2016 Survey of MPI Alternatives - Google Forms MPI should be a compilation target. Application programmers should not see it at all. - Eliminates global barriers https://docDs.goo gyle.ocomu/a /ulbls.goev/ foMrmsP/d/I1-1xImOH?lRI5(w3-1jG rweTsmpPocxnzsueFws)MDyJnFu9uQLzJLtHdmeas/edit#responses 13/19 If yes, which positioning method do you use (10 responses) Does the application use Python? (24 responses) * Data-directed execution Do you use MPI-IO? If yes, can you specify whichDo yo uMPI_File_write/read use MPI-IO? (31 responses) routines • Fault tolerance, fine-grained tasks - Merger of flow control and data structure yes Explicit Offsets 9 (90%) • MPI should be a compilation target. Application programmers should not see it at all. yes * Shared name space 9/13/2016 NERSC MPI Usage Survey - Google Form•s read/write_at_all() and read/write_at() yes no no Individual file… 2 (20%) no 67.7% 67.7% - Global address space write_at, all_write, read at • MPI_FILE_WRITE_ALL Does the application use dependent libraries? (23 responses) 79.2% Shared filep… 2 (20%) Does the application use dependent libraries? If yes, which libraries does it depend upon? - Simplifies random gathers MPI_File_write_at, MPI_File_read_at • MPI_File_read_all and MPI_File_write_allFile views (w… 4 (40%) yes MPI_FILE_WRITE_AT_ALL • numpy, scipy, astropy • Multiple • write_at, all_write, read at 0 1 2 3 4 5 6 7 8 9 9/14/2016 Survey of MPI Alnteornatives - Google Forms 20.8% 32.3% 69.6% • ZeroMQ and SIGAR (https://github.com/hyperic/sigar) • BLAS, ScaLaPack 32.3% • MPI_File_write_at, MPI_File_read_at BLAS, ScaLaPack • Host cores only, 1024 nodes (16k cores overall) • CDFLIB • HDF5 • MPI_FILE_WRITE_AT_ALL HDF5 • HPX has 87% Parallel efficiency and 1.4X faster than MPI If yes, which positioning method do you use (10 responses) If yes, which data representation (6 responses) If yes, which positioning method do you use If yes, which data representation 30.4% If yes, can you specify which MPI_File_write/read routines (7 responses) "internal" 3 (50%) (21 responses) MiniGhost Weak Scaling "internal" (24 responses) Can your approach be combined with Explicit OffsetsExplicit Offse ts 9 (90%) Does the application use Python? read/write_at_all() and read/write_at() Does the application use Python? Can your approach be combined with "external32""external32 " 3 (50%) If yes, can you speIndividualcify which filepointersMPI_InFdiivlideu_al wfiler …ite/read routin2e (2s0%)(7 responses) Can't remember Other 1 (16.7%) yes Other MPIMPI 16 (76.2%) Shared filepointersShared filep … 2 (20%) MPI_FILE_WRITE_ALL If yes, which libraries does it depend upon? (6 responsesn)o 9/13/2016 NERSC MPI Usage Survey - Google Forms read/write_at_all() and read/write_at() 0 0.5 1 1.5 2 2.5 3 79.2% OpenMP 17 (81%) File views (with MPI_File_set_view)File views (w … 4 (40%) MPI_File_read_all and MPI_File_write_all OpenMP https://docs.google.com/a/lbl.gov/forms/d/11xmHlRI5w-jGwTmPcxzuFwMDyJnFu9uQLzJLtHdmeas/edit#responses 14/19 Other responses: Can't remember Other response: Native numpy, scipy, astropy pthreadspthreads 12 (57.1%) • TBB 0 1 2 3 4 5 6 7 8 9 ZeroMQ and SIGAR (https://github.com/hyperic/sigar) MPI_FILE_WRITE_ALL 20.8% otherother 3 (14.3%) • Slurm (1 response) • C++, Chapel, Legion Do you Douse ayoun I/O usepacka ange l ikI/Oe H DpackageF5? (29 respon slikees) HDF5? If you checked other, please specify CDFLIB MPI_File_read_all and MPI_File_write_all If yes, do you use, HDF5, NetCDF, or other 0 2 4 6 8 10 12 14 16 Multiple https://docs.google.com/a/lbl.gov/forms/d/11xmHlRI5w-jGwTmPcxzuFwMDyJnFu9uQLzJLtHdmeas/edit#responses 14/19 native yes https://docs.google.com/a/lbl.gov/forms/d/1eFq1lPN-OJeE40rLYG49Stxno_VL_aPmgnyLfLhdetI/edit?ts=57c0b526#responses 10/12 (6 responses) https://docs.google.com/a/lbl.gov/forms/d/11xmHlRI5w-jGwTmPcxzuFwMDyJnFu9uQLzJLtHdmeas/edit#responses 15/19 44.8% If yes, which data repnroesentation • 11 of 16 responses use HDF5 • 7 of 16 responses use NetCDF HPX-5 is the High Performance ParalleX runtime library Can your approach be combined with (21 responses) If checked other, please specify (3 responses)

from Indiana University. (http://hpx.crest.iu.edu). "internal" • No response 3specify (50%) other HPX-3 is the C++11/14 implementation of ParalleX Summary TBB execution model from Louisiana State University (http:// "external32" 3 (50%) stellar-group.org/libraries/hpx/). 55.2% Slurm Other 1 (16.7%) • MPI is widely used by a large number of NERSC’s 5000+C++, Ch auserspel, Legion with 32 code groups responding to our MPI Survey. We assume our advanced users responded to the survey, so the percentages do not necessarily represent a portion of the total user population, but rather those concerned with programming models Which tools, if any, do you0 use0.5 to optimize1 1.5 or debug2 the2.5 MPI 3portion of your application? Comments on Alternatives to MPI • We see that many of the advanced MPI features are being used and that a wide range of of decomposition and communication types are represented • Allinea DDT • gdb • In addition to UPC++ and HPX, NERSC users are using UPC, Charm++, Kokkos, Chapel, Coarray Fortran, OCR – • The majority are using MPI only or MPI + OpenMP with some users using MPI + pthreads, CUDA, OpenCL, TBB, UPC, UPC++, GASNet, SHMEM, or Global Arrays https://docs.google.com/a/lbl.gov/forms/d/1eFq1lPN-OJeE40rLYG49Stxno_VL_aPmgnyLfLhdetI/edit?ts=57c0b526#responses 11/12 based, Legion, Parallel Python, Hadoop/MapReduce, SHMEM, OpenMP, GASNet, Global Arrays , andGPI-2 / GASPI If yes, do •y ou usTotalview,e, HDF5, N eprinttCDF , statementsor other (spe cify) (16 responses) • Casper for optimizing asynchronous progress of MPI RMA, DDT for debug. • About 1/3 of the responders use one-sided communicationhttps://docs.google.com/a/lb l.(normalgov/forms/d/1eFq1lP Nwindows)-OJeE40rLYG49Stxno_ VwithL_aPmgny LMPI_Win_create,fLhdetI/edit?ts=57c0b526#responses MPI_Win_allocate,9/9 and MPI_Win_create_dynamic used equally • Lots of print statements to debug! (1 response) • Performance benefit from using these parallel programming model(s)/approach(s) is generally not dramatic and varies If you checked other, please specify HDF5 • Sometimes VTune, Intel Trace Analyzer and Collector • About 1/4 use hints via the info object and 1/3 use MPI-IO with the majority using Explicit Offsets as the positioning method for IO with the application, the programming model, implementation and the hardware but can show definite improvement • write(*,*) for debugging :) HDF5 • TAU, own tools native • Compare results with serial code. • Portability issues wrt different MPI implementations were cited by 10 code groups • As seen in our survey on MPI Alternatives there are additional reasons people are citing for trying other methods HDF5 https://docs.google.com/a/lbl.gov/forms/d/11xmHlRI5w-jGwTmPcxzuFwMDyJnFu9uQLzJLtHdmeas/edit#responses 15/19 • Intel Vtune, Allinea Forge • GDB for debugging. Recently started performing instruction count • The three most cited issues that caused the 22 responders to the MPI Alternative Survey to consider approaches other than MPI were: • MPI is being used in combination with a wider range of other parallel approaches as seen in our MPI User Survey HDF5 analysis using Intel SDE • ITAC • asynchronous execution (74%) HDF5 • Tau • Experiences gained by using MPI alternatives and MPI + X can be used to improve future versions of MPI • ddt HDF5 • internal timers + MPI_P* • overlap of communication with computation (70 %) • Results with these models as well as the one-sided implementation of MPI are preliminary and implementation • Valgrind dependent HDF5 • uniform interface, not combined with separate threading model (56%) parallel-netcdf

This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supportedparallel-n ebytcdf the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

netCDF

NetCDF

Parallel netCDF

Hdf5

HDF5 and NetCDF

HDF5, NetCDF

https://docs.google.com/a/lbl.gov/forms/d/11xmHlRI5w-jGwTmPcxzuFwMDyJnFu9uQLzJLtHdmeas/edit#responses 16/19