2.3.1.07 MPICH: A High-Performance MPI Implementation Pavan Balaji1, Kenneth Raffenetti1, Yanfei Guo1, Min Si1, Rob Latham1, Marc Snir2, Giuseppe Congiu1, Hui Zhou1, Shintaro Iwasaki1 1. Argonne National Laboratory 2. University of Illinois Urbana-Champaign MPICH Overview MPICH is not just a software MPICH Users • Funded by DOE for 27 years It’s an Ecosystem MPICH and its derivatives in the Top 10 TAU 1.Summit (USA): Spectrum MPI • Has been a key influencer in the adoption of MPI PETSc MPE 2.Sierra (USA): Spectrum MPI – First/most comprehensive implementation of every MPI standard RIKEN Intel MPI Tianhe ADLB 3.TaihuLight (China): Sunway MPI – Allows supercomputing centers to not compromise on what HPCToolkit MPI MPI 4.Tianhe-2A (China): MPICH-TH2 Sunway features they demand from vendors MPI ParaStation 5.Frontera (USA): Intel MPI and MVAPICH2 MPI • DOE R&D100 award in 2005 for MPICH DDT MPICH 6.Piz Daint (Switzerland): MPI Cray 7.Trinity (USA): Cray MPI • DOE R&D100 award in 2019 for UCX (MPICH internal comm. layer) MPI FG-MPI Microsoft MathWorks 8.ABCI (USA): Intel MPI and MVAPICH2 MVAPICH • MPICH and its derivatives are the world’s most widely used MPI Totalview MPI 9.SuperMUC-NG (Germany): Intel MPI implementations ANSYS 10.Lassen (USA): Spectrum MPI – Supports all versions of MPI including the recent MPI-3.1

Key Focus Areas Goal Address Key Technical Challenges • Influence the evolution of the MPI Standard to address ECP and DOE application needs Performance & Heterogeneity Fault Tolerance • Work with vendors to ensure high performance MPI implementations for DOE MPI+X Hybrid Programming Topology Awareness • Performance and Scalability • Lightweight communication • MPI+X Hybrid Programming • Reducing overhead in instruction count and memory usage • Exposing Application Parallelism to MPI • Improvements in MPI one-sided communication • Mapping to Multiple Communication Contexts • Enabling HW accelerated RMA • Interoperability with OpenMP and User-Level • Communication hints Multithreaded Transfer Model Current MPI (3.1) Message Rate of Multithread Communication with Separate Communicators(16 threads/node) • Allowing user to tell MPI to optimize for the crucial subset of features Application 55 50 1.2 User Endpoint 45 MPI 40 Application 1 35 CTX 30 MPI Interface 0.8 25 MPICH/CH4 Hardware 20 MPI Layer N=5 15 0.6 MPICH/Origin User Expose Parallelism 10

Machine- (x 106) Messages/s Derived Datatype al N=5 with COMM/TAG 5 independent Group Management 0.4 MPICH/CH4 Application 0 Management -5

Estimated Efficiency Estimated 4 Collectives N=7 16 64 0.2 256 1024 4096 MPICH/Origin User Endpoint 16384 65536 Abstract Device Interface (ADI) al N=7 MPI Message size (B) 0 CTX CTX CH4 1 2 3 4 5 6 7 8 MPI_THREAD_SINGLE Grid Points Per MPI Rank Bucket Hardware MPI_THREAD_MULTIPLE with MPI_COMM_WORLD CH4 Core BGQ LAMMPS Strong Scaling MPICH/CH4 vs Architecture- MPICH/Original Work-Queue Data Transfer Model with MPI Endpoints MPI_THREAD_MULTIPLE with separate COMMs Active Message GPU Support specific MPICH/CH4 Efficiency Fallback Fallback 1400 MPICH/Original Efficiency 100% • Collectives 1200 Heterogeneity 80% 1000 • Supporting Hybrid Memory Configuration for MPI Intranode Communications Netmods Shmmods 800 60% 600 40% • GPU Support 400 OFI UCX 20% POSIX XPMEM 200 Point-to-point multi-bandwidth results for DRAM and Degrees of freedom per per second averaged over 100 Percentage Percentage Timesteps per Second per Timesteps 0 0% MCDRAM in the CH4 device implementation. Plot also iterations of the Nek5000 solver (the higher the number the 512 1024 2048 4096 8192 shows percentage performance improvement of MCDRAM better). Plot also shows percentage errors for DRAM and MCDRAM and percentage improvement of MCDRAM over DRAM (368) (184) (90) (45) (23) over DRAM Number of nodes (atoms per core) DRAM MCDRAM 60000 DRAM 160 2500000 0.9 Performance analysis of NWChem DFT on Cray XC40/KNL. IMPROV DRAM err Evaluation of hardware-accelerated MPI atomics MCDRAM 140 0.8 in NWChem DFT for Carbon 180 with 631G* basis set on Cray XC40/KNL. The atomics overhead (FOP) is significantly reduced by 50000 MCDRAM err using HW MPI atomics. 120 2000000 0.7 40000 Original With HW-Atomics GET PUT ACC FOP SYNC COMP 100 0.6 140 1500000 131.65 140 30000 80 % 0.5

125.84 123.00 124.68 % 120 120 60 0.4 20000 dofpss 1000000 100 100 42.10 27.16 40 0.3

Bandwidth (MB/s) 10000 80 80 20 500000 0.2 60 51.71 60 0 0 0.1 37.78 33.89 35.06 40 76.36 76.36 1 4 40 16 64 0 0 Time (minutes) Time Task Time (minutes) Task 42.16 27.03 256 1024 4096 20 1638465536 5 6 7 8 9 10 11 20 262144 0 Message Size (Bytes) 10485764194304 order 0 Original With HW- Original With HW- 256 512 1024 2048 np=256 Atomics np=512 Atomics Number of Processes (64 processes per node) np=256 np=512

ECP Application/MPICH Integrations CEED ECP Application HACC ECP Application LatticeQCD ECP Application ACME ECP Application AMReX ECP Application NWChemX ECP Application •Nearest neighbor data • Nearest neighbor data • Neighbor exchanges, nonblocking •Nearest neighbor data •Mostly nearest neighbor, •Nearest neighbor data exchanges exchange and periodic reductions exchanges some irregular data exchanges •Key problem areas: checkpoint I/O • Key problem areas: •Key problem areas: movement •Key problem areas: –Efficient halo exchange, Asynchronous –Push strong scaling •Key problem areas: • Key problem areas: Fault progress and Network saturation from –Push strong scaling – Push strong scaling limit/time to solution tolerance single MPI rank limit/time to solution –Efficient halo exchange & limit/time to solution –Efficient halo exchange • Potential integrations: • Potential integrations – Efficient halo exchange better task placement – Efficient halo exchange • Potential integrations: –MPI-3 topology + neighborhood collectives –Resilience improvements •Potential integrations: •Potential integrations: •Potential integrations: –Lightweight –Improvements to MPI+threads; endpoints to process manager (not –Lightweight communication –MPI-3 topology + nhbrhood –Lightweight communication communication –Approaches for async reductions –Neighborhood collectives collectives –Neighborhood collectives directly ULFM) asynchronously –Neighborhood collectives –Expose hardware info through MPI

This material was based upon work supported by the U.S. Department of Energy, Office of Science, under Contract DE-AC02-06CH11357. This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. We gratefully acknowledge the computing resources provided and operated by the Joint Laboratory for System Evaluation (JLSE) at Argonne National Laboratory.