MPI Usage at NERSC: Present and Future

9/13/2016 NERSC MPI Usage Survey - Google Forms Parallel Research Kernels Lattice QCD, RHMC, Complex Langevin, Conjugate Gradient ALPSCore own code Applications that use PETSc GTFock, Elemental Our own code, adapted for MPI MILC QCD simulations codes MPI usagePSI-T et at(nite ele meNERSC:nt based MHD code) Present and Future Ray, HipMer HipMER, Combinatorial BLAS, SuperLU idealized climate simulations using code named EULAG Alice Koniges, BrandonNERSC MCook,PI Usage SJackurvey Deslippe, Thorston Kurth, Hongzhang Shan MPI-3.1 Lawrenceb enBerkeleychmarks National Laboratory, Berkeley, CA USA desispec and related DESI-specic custom code Abstract: We describe how MPI is used at the National Energy Research Scientific Computing QUESTIONS RESPONSES 32 Center (NERSC). NERSC is the production high-performance computing center for the US Department of Energy, with more than 5000 users and 800 distinct projects. We also compare the usage of MPI to exascale developmental programming models such as UPC++ and HPX, with an eye on what features and extensions to MPI are plausible and useful for NERSC users. We also Which types of decomposition and communication does your application discuss perceived shortcomings of MPI, and why certain groups use other parallel programming How MPI is used at NERSC Portability issues wrt different MPI implementations models. We believe our results tend to represent the advanced MPI NERSC user base, rather than use? (check all that apply) an extensive survey of the entire NERSC code group population. (32 responses) In survey responses of 32 code groups, 10 gave the following comments on portability 9/13/2016 NERSC MPI Usage Survey - Google Forms Special thanks to Dr. Rolf Rabenseifner and colleagues at Stuttgart High Performance Computing Which types of decomposition and communication does your application use? • Have noticed that MPI implementations seem to differ in how they implement read/write_at_all() and read/write_at(). On NERSC's machines, I need to use Center for input in making the MPI NERSC User Survey. ~10,000 read/write_at_all(), while on NCAR's machine, the same code prefers read/write_at(). Nearest-neighborN communicationearest-neigh… 12 (37.5%) typically 16^2 x 45 (horizontal footprint x column depth) • Open MPI: thread multiple and RMA usually broken, blue gene has no MPI-3 Row/ColumnR communicationow/Column c… 8 (25%) typically 256-1024 • Compilation problem with the Cray compiler In 2015, the NERSC Program supported 6,000 active users Communication with arbitrary Cotheromm processesunicatio… 13 (40.6%) from universities, national laboratories and industry who used • Cray mpi sometimes crashes with alltoallv at large process counts NERSC Hardware CartesianCartesian grid gri d 11 (34.4%) approximately 3 billion supercomputing hours. Our users came • We write MPI code following the MPI standard, so generally no portability issues exist. But sometimes we get unexpected RMA bugs in CrayMPI releases (in from across the United States, with 48 states represented, as Non-CartesianN buton- Cregularartesi agridn… 11 (34.4%) How do you implement point-to-point communication (check any that both regular mode and DMAPP mode). well as from around the globe, with 47 countries represented. Cyclic boundary conditions/communicationCyclic bounda… 13 (40.6%) apply) Cori is NERSC's newest supercomputer system (NERSC-8). It is named after American biochemist Gerty Cori, (29 responses) • Not directly. Co-workers experienced problems with one-sided MPI (Intel MPI/MPICH) the first woman to win a Nobel Prize in Physiology or Medicine. Non-cyclic boundaryNon- cconditionsyclic bo… 7 (21.9%) • Old versions of MPI did not have MPI_INTEGER8 data type. Newer versions do. Scatter/GatherScatter/Gathe r 16 (B5loc0kin%g M)PI… 18 (62.1%) • MPI_Comm_split deadlocking, MPI_Bcast deadlocking, MPI_Comm_dup using linear memory on each call, incorrect attribute caching destructor semantics. Cori Phase 1 system is a Cray XC based on the AllToAllAllToA ll 13 (40.6%) Blocking MPI… 2 (6.9%) Intel Haswell multi-core processor. Cori Phase 2 is based on 9/13/2016 NERSC MPI Usage Survey - Google Forms Blocking MPI… 1 (3.4%) • MPI_IO external32 vs. native encodings with OpenMPI & MPICH 9/13/2016 Broadcast/ReduceBroadcast/Re… NERSC MPI Usage Survey - Google Forms Blocking MPI… 5 (17.2%) 27 (84.4%) the second generation of the Intel® Xeon Phi™ family of • Too many issues over 15+ years to try to list them here. Nonblocking… 14 (48.3%) No communication (my code is embarrassingly~10,000 No com parallel)munic… 3 (9.4%) yes products, called the Knights Landing (KNL) Many Integrated Nonblocking… 13 (44.8%) Core (MIC) Architecture. Cori Phase I & Cori Phase II Other:Othe r 3 (9.4%) no typically 16^2 x 45 (horizontal footprint x column depth) 37.5% 0 2 4 6 8 10 12 14 16 18 integration begins ~September 19, 2016 ( for ~6 weeks). Other responses: typically 256-1024 0 5 10 15 20 25 • Primarily halo communication for integration and all-to-one for writing output, although many options exist for output depending on domain size. Comments/concerns/suggestions/requests wrt MPI • master/slave Have you ever checked if your • One sided Have you ever checked if your communication is serialized? (31 responses) • Cori is a Cray XC40 supercomputer • MPI Structure Communications, threaded communications communication is serialized? • Theoretical Peak performance: Phase I Haswell: 1.92 PFlops/sec; Phase II KNL: 27.9 PFlops/sec. • one-sided and point-to-point between rank 0 and other ranks yes In survey responses 5 code groups gave additional information 3 • Total compute nodes: Phase I Haswell: 1,630 computes nodes, 52,160 cores in total (32 cores per node); • Graph traversal / random access How do you implement point-to-point communication (check any that 61.3% no • 1 master PE for I/O : generally this is minor amount of time • don't break things!!!! • Phase II KNL: 9,304 compute nodes, 632,672 cores in total (68 cores per node). apply) • Cray Aries high-speed interconnect with Dragonfly topology (0.25 µs to 3.7 µs MPI latency, ~8GB/sec MPI bandwidth) • ideally all meaningful patterns areht tbenchmarkedps://docs.goog le.com/a/lbl.gov/forms/d/11xmHlRI5w-jGwTmPcxzuFwMDyJnFu9uQLzJLtHdmeas/edit#responses 3/19 • Lack of good open source MPI for Cray makes lots of things hard. So many problems would have been solved much quicker with source. (29 responses) 62.5% • Aggregate memory: Phase I Haswell partition: 203 TB; Phase II KNL partition: 1 PB. How do you implement point-to-point communication • It would be good to document automatic potential protocol switches in specific MPI implementations (e.g. Cray MPI) for small messages. That could explain • Scratch storage capacity: 30 PB 38.7% some observed communication anomalies. Blocking MPI_Send / MPI_RecvBlocking MP I… 18 (62.1%) • Neighborhood collectives and nonblocking collectives are the most relevant optimizations. Implementation of "endpoints" would be huge for interoperability Edison has been the NERSC’s primary supercomputer system for a number of years. It is named after Blocking MPI_Bsend / MPI_RecvBlocking MPI … 2 (6.9%) between packages that use OpenMP or a similar threading system and those that use MPI only (perhaps with shared memory windows). U.S. inventor and businessman Thomas Alva Edison. Blocking MPI_Ssend / MPI_RecvBlocking MP I… 1 (3.4%) Do you overlap communication • For progress of RMA operations, I use a process-based progress engine for MPI RMA, namely Casper. Blocking MPI_SendRecvBlocking MP I… 5 (17.2%) and computation? 9/13/2016 Do you overlap communicatiNoEnRS Ca MnPdI U csaoge mSurvpeyu - tGaootgiloe Fnor?ms (32 responses) Edison, a Cray XC30, has a peak performance of 2.57 Nonblocking MPI_Isend and MPI_RecvNonblocking … 14 (48.3%) yes no petaflops/sec, 133,824 compute cores for running scientific Nonblocking MPI_Send and MPI_IrecvNonblocking … 13 (44.8%) https://docs.google.com/a/lbl.gov/forms/d/11xmHlRI5w-jGwTm3P7c.x5z%uFwMDyJnFu9uQLzJLtHdmeas/edit#responses 5/19 applications, 357 Terabytes of memory, and 7.56 Petabytes 0 2 4 6 8 10 12 14 16 18 Application codes used by MPI survey responders of online disk storage with a peak I/O bandwidth of 168 Do you use any of these features? (31 responses) gigabytes (GB) per second. Do you use any9/13 of/201 6these features? NERSC MPI Usage Survey - Google Forms 62.5% • NIMROD • NWChem • NWChem, CTF • MILC QCD simulations codes 9/13/2016 NERSC MPI Usage Survey - Google Forms We often use alltoallv in fact, as we have variable data lengths to send to arbitrary processes. Probably could • NCAR LES • Parallel Research Kernels • epiSNP • PSI-Tet ( finite element based MHD code) opMPI_Puttimise by knowing ne ig/h bMPI_Getors. MPI_Put / MPI… 11 (35.5%) • GINGER FEL code • WARP/PICSAR + own applications UsinHg RMaA vfore irre gyulaor coumm eunicvateionr; u sicng hdereivedc daktateypeds to iabfs tryacto struidedr d acta ion NmWchem; sub- nication is serialized? (31 responses) • ALPSCore • Isam MPI_Neighbor_alltoallcommunicatiors are needed to separate comput ing groups and special groups for asynchronous progress (using NWMChePm wIit_h ANRMCeI/MigPI ahndb Caosperr)…; blocking collectives are used to fo2r gl ob(a6l sy.n5chr%oniza)tion in • WRF NWChem; non-blocking collectives are used in asynchronous progress engine (i.e., Casper) to handle special • own code • NIMROD • HipMER, Combinatorial BLAS, SuperLU Howb acdkground synchoroMPI_Alltoallniuzatio n dthat ocan beM overlaPpp edI w ithI anppliciatioin aexeclutiozn. ation? (32 responses) Do you use any of these features? (31 responses) MPI_Alltoall 11 (35.5%) • im3shape, cosmosis (my own codes) • Applications that use PETSc • MPI-3.1 • idealized climate simulations using code named EULAG MPI_Barrier for synchronization within shared memory domain (with MPI3 shared memory windows). Sub- yes communicatorsMPI_AllgatherM for rPow/Ico_lumAn clolmgmaunitcahtioen s irn dense matrix applications 12 (38.7%) no • WRF, UMWM, HYCOM, ESMF • GTFock, Elemental • ISx • MUMPS, SuperLU_DIST, STRUMPACK, PETSc I need to send non-contiguous parts of an array.

MPI Usage at NERSC: Present and Future

Bench - Benchmarking the State-Of- The-Art Task Execution Frameworks of Many- Task Computing

Adaptive Data Migration in Load-Imbalanced HPC Applications

Benchmarking the Intel FPGA SDK for Opencl Memory Interface

BCL: a Cross-Platform Distributed Data Structures Library

Advances, Applications and Performance of The

Beowulf Clusters — an Overview

Improving MPI Threading Support for Current Hardware Architectures

Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters

Automatic Handling of Global Variables for Multi-Threaded MPI Programs

Overview of the Global Arrays Parallel Software Development Toolkit: Introduction to Global Address Space Programming Models

Exascale Computing Project -- Software

The Global Arrays User Manual