The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing Presentation at OSU Booth (SC ‘19) by

Hari Subramoni The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~subramon Drivers of Modern HPC Cluster Architectures

High Performance Accelerators / Coprocessors Interconnects - InfiniBand high compute density, high Multi-core <1usec latency, 200Gbps performance/watt SSD, NVMe-SSD, Processors Bandwidth> >1 TFlop DP on a chip NVRAM • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD • Accelerators (NVIDIA GPGPUs and Phi) • Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc.

NetworkSummit Based Computing Sierra Sunway K - Laboratory OSU Booth SC’19TaihuLight Computer 2 Parallel Programming Models Overview

P1 P2 P3 P1 P2 P3 P1 P2 P3

Memo Memo Memo Logical Memor Memor Memor Shared Memory ry ry ry y y y

Shared Memory Model Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface), UPC, Chapel, X10, CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are Network Based Computing Laboratorygradually receiving importanceOSU Booth SC’19 3 Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges Co-DesignCo-Design Application Kernels/Applications OpportuniOpportuni tiesties andand Middleware ChallengeChallenge ss acrossacross Programming Models VariousVarious MPI, PGAS (UPC, Global Arrays, OpenSHMEM), LayersLayers CUDA, OpenMP, OpenACC, , Hadoop (MapReduce), Spark (RDD, DAG), etc. PerformaPerforma Communication or Runtime for Programming Models ncence Point-to-Point-to- CollectiveCollective SynchronizSynchroniz I/OI/O andand pointpoint Energy-Energy- FaultFault ScalabilitScalabilit CommunicaCommunica ationation andand FileFile CommunicaCommunica AwarenessAwareness ToleranceTolerance tiontion LocksLocks SystemsSystems yy tiontion Fault-Fault- Networking Technologies Multi/Many-core Accelerators (InfiniBand, 40/100GigE, ResiliencResilienc Architectures (GPU and FPGA) Aries, and OmniPath) ee Network Based Computing Laboratory OSU Booth SC’19 4 MPI+X Programming model: Broad Challenges at Exascale • for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one- sided) – Scalable job start-up • Scalable Collective communication – Offload – Non-blocking – Topology-aware • Balancing intra-node and inter-node communication for next generation nodes (128-1024 cores) – Multiple end-points per node • Support for efficient multi-threading • Integrated Support for GPGPUs and FPGAs • Fault-tolerance/resiliency • QoS support for communication and I/O • Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI+UPC++, MPI + OpenSHMEM, CAF, …) • Virtualization • NetworkEnergy-Awareness Based Computing Laboratory OSU Booth SC’19 5 Overview of the MVAPICH2 Project

• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 3,050 organizations in 89 countries – More than 614,000 (> 0.6 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Nov ‘18 ranking) • 3rd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China • 5th, 448, 448 cores (Frontera) at TACC • 8th, 391,680 cores (ABCI) in Japan • 15th, 570,020 cores (Neurion) in South Korea and many others – Available with software stacks of many vendors and Distros (RedHat, SuSE, and OpenHPC) Partner in the TACC Frontera System Network– http://mvapich.cse.ohio-state.edu Based Computing Laboratory OSU Booth SC’19 6 • Empowering Top500 systems for over a decade MVAPICH2 Release Timeline and 600000 V 2 - M I C 2 . 0 V 2 - A W S 2 . 3 Downloads V 2 - G D R 2 . 0 b M V 2 1 . 9 M 500000 M V 2 2 . 3 . 2 V 2 1 . 8 V 2 - G D R 2 . 3 . 2 M V 2 - X 2 . 3 r c 2 S U I N A M 0 . 9 . 3 V 2 1 . 7 V 2 V i r t 2 . 2 M M M V 2 1 . 6 400000 M O M s M V 2 - A z u r e 2 . 3 . 2 d V 2 1 . 5 a M V 2 1 . 4 V 1 . 1 o l V 2 1 . 0 . 3 V 1 . 0 n M V 2 1 . 0 M w V 2 0 . 9 . 8 300000 V 2 0 . 9 . 0 o M M M V 0 . 9 . 4 D

M f M M o M

r M e

200000b m u N

100000

0

Timeline

Network Based Computing Laboratory OSU Booth SC’19 7 Architecture of MVAPICH2 Software Family High Performance Parallel Programming Models

Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication Runtime Diverse and Mechanisms Point-to-Point-to- CollectivCollectiv Energy-Energy- RemoteRemote I/OI/O andand ActiveActive IntrospecIntrospec pointpoint eses JobJob FaultFault VirtualizaVirtualiza AwareneAwarene MemoryMemory FileFile MessageMessage tiontion && PrimitivePrimitive AlgorithAlgorith StartupStartup ToleranceTolerance tiontion ssss AccessAccess SystemsSystems ss AnalysisAnalysis ss msms Support for Modern Networking TechnologySupport for Modern Multi-/Many-core Architectures (InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter)(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols Modern Features Transport Mechanisms Modern Features MulMul ShareShare SRSR UMUM ODOD SR-SR- dd CMCM IVSHMIVSHM XPMEXPME NVLinNVLin CAPCAP ti ** RCRC UDUD DCDC ti Memo A EM M OptaneOptane * DD RR PP IOVIOV Memo A EM M kk II* RailRail ryry

* Upcoming Network Based Computing Laboratory OSU Booth SC’19 8 MVAPICH2 Software Family

Requirements Library MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2 Advanced MPI Features/Support, OSU INAM, PGAS MVAPICH2-X and MPI+PGAS with IB, Omni-Path, and RoCE MPI with IB, RoCE & GPU and Support for Deep MVAPICH2-GDR Learning HPC Cloud with MPI & IB MVAPICH2-Virt Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA MPI Energy Monitoring Tool OEMT InfiniBand Network Analysis and Monitoring OSU INAM Microbenchmarks for Measuring MPI and PGAS OMB Performance

Network Based Computing Laboratory OSU Booth SC’19 9 MVAPICH2 Distributions • MVAPICH2 – Basic MPI support for IB, Omni-Path, iWARP and RoCE • MVAPICH2-X – Advanced MPI features and support for INAM – MPI, PGAS and Hybrid MPI+PGAS support for IB • MVAPICH2-Virt – Optimized for HPC Clouds with IB and SR-IOV virtualization – Support for OpenStack, Docker, and • OSU Micro-Benchmarks (OMB) – MPI (including CUDA-aware MPI), OpenSHMEM and UPC • OSU INAM – InfiniBand Network Analysis and Monitoring Tool Network Based Computing Laboratory OSU Booth SC’19 10 ) s d n o c e s i l

l Startup Performance on TACC Frontera i M (

n e

k 4.5s a T

3.9s e m

i MPI_Init on Frontera

T 5000 Intel MPI 4000 2019

3000

2000

1000

0 56 112 224 448 896 1792 3584 7168 14336 28672 57344 Number of Processes • MPI_Init takes 3.9 seconds on 57,344 processes on 1,024 nodes • MPI_Init takes 195 seconds on 229376 processes on 4096 nodes while MVAPICH2 takes 31 seconds • All numbers reported with 56 processes per node

Network Based ComputingNew designs available in MVAPICH2-2.3.2 Laboratory OSU Booth SC’19 11 One-way Latency: MPI over IB with Large Message La- MVAPICH21.8 Small Message La- 120 tency TrueScale-QDRtency 1.6 1.1 100 ConnectX-3-FDR 1.4 1.19 ConnectIB-DualFDR 1.2 1 80 ConnectX-4-EDR 1

60 Omni-Path t ( ) c s t ( ) L y a e c s n u L y a e 0.8n 1.15u ConnectX-6 HDR 0.6 1.0 40 1.01 0.4 1.14 20 0.2 0 0

Message Size (bytes) Message Size (bytes) TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch NetworkConnectX-6-HDR Based Computing - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Laboratory OSU Booth SC’19 12 Bandwidth: MPI over IB with MVAPICH2 30000 Unidirectional Band- 60000 Bidirectional Band- width TrueScale-QDRwidth 24,5 48,02 25000 50000 ConnectX-3-FDR 32 7 ConnectIB- 20000 40000 DualFDR ConnectX-4-EDR 21,98

15000i /

B a n d w 30000i d t h ( M B y t e s / s e c ) t ( t ) B a n d w d h M B y e s s e 12,5c Omni-Path 24,133 10000 12,390 20000 ConnectX-6 HDR 6 12,066 21,22 83 5000 6,35 10000 12,167 3,376 6,2281 0 3 0

Message Size (bytes) Message Size (bytes) TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch NetworkConnectX-6-HDR Based Computing - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Laboratory OSU Booth SC’19 13 Intra-node Point-to-Point Performance on ) ) s s u u Intra-Socket Small Message

( Intra-Socket Large Message (

OpenPOWER

Latency y

y Latency c c n n e e t t 0.8 300 a a MVAPICH2-X 2.3.2

0.7 L

L MVAPICH2-X 2.3.2 250 0.6 200 0.5 0.4 150 0.3 100 0.2 ) ) 0.25u 50 s s 0.1 / / s B B 0 0 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M M M ( (

Intra-Socket Bandwidth Intra-Socket Bi-directional h h t t Bandwidth d d i i w w d d n n a a 40000 B B 35000 35000 30000 MVAPICH2-X 2.3.2 MVAPICH2-X 2.3.2 30000 25000 25000 20000 20000 15000 15000 10000 10000 5000 5000 0 0

Platform:Network Two nodes Based of Computing OpenPOWER (Power9-ppc64le) CPU using Mellanox EDR (MT4121) HCA Laboratory OSU Booth SC’19 14 Point-to-point: Latency & Bandwidth (Inter-socket) on ARM ) s u (

Latency - Small Messages Latency - Medium Messages y 3 35 c MVAPICH2- n Latency - Large Messages 30 e 8.3x 2.5 3.5x X t 2500 a )

25 L s ) better MVAPICH2- better s 2 u 2000

( X

u 20

( MVAPICH2-X

1.5 y 1500 y

c 15 OpenMPI+UCX c n n 1 e 10

t 1000 e t a a L 5 L 0.5 500 0 0 0 128 256 512 1024 2048 4096 8192 16384 0 1 2 4 8 16 32 64 32K 64K 128K 256K 512K 1M 2M 4M

) Message Size (Bytes) Message Size (Bytes) Message Size (Bytes) ) s s / / B B M M ( (

Bandwidth - Small Messages h h t t

400 d d i i w ) 5x w d s MVAPICH2- d / n

300 n Bandwidth - Large Messages B

a Bandwidth – Medium Messages X better a B

M 12000 10000 B (

200 MVAPICH2- 10000 h 8000 t X 8000 d i 6000 MVAPICH2- w 6000 100 X d 4000 n 4000 a

B 0 2000 2000 1 2 4 8 16 32 64 128 0 0 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size (Bytes) Message Size (Bytes) Message Size (Bytes) Network Based Computing Laboratory OSU Booth SC’19 15 MPI_Allreduce on KNL + Omni-Path (10,240 300 2000 Processes) 1800 250 1600 200 1400 )

s 1200 u 2.4 ( 150 1000 y X c 800 n

e 100 t 600 a L 50 400 200 0 0 4 8 6 2 4 8 6 2 4 8 6 8K 16K 32K 64K 128K 256K 1 3 6 2 5 1 2 4 9 1 2 5 0 0 0 1 2 4 Message Size MVAPICH2 MessageMVAPICH2-OPT Size IMPI MVAPICH2 MVAPICH2-OPT IMPI OSU Micro Benchmark 64 PPN

• For MPI_Allreduce latency with 32K bytes, MVAPICH2-OPT can reduce the

latencyM. Bayatpour, by 2.4XS. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design,Available SuperComputing since '17.MVAPICH2-X 2.3b Network Based Computing Laboratory OSU Booth SC’19 16 ) % (

p a l

Evaluation of SHArP basedr Non Blocking e v O

Allreduce n o i

MPI_Iallreduce Benchmarkt a t u ) p s m u ( o

MVAPICH2 1 PPN*, 8 Nodes C - y n

c 10 MVAPICH2- o 2.3x i n t r a e SHArP t c e i

L 8 t a n t

o L u

e

w m n B

e 6 m o i s

r o i t

C

i a r

s MVAPICH2

c 1 PPN, 8 Nodes

4 e i B 50 MVAPICH2-SHArP h n e 45 g u

t

i 40 t 2 35 H

e m 30 r 25 m 20 o 0 15

C 10

4 8 16 32 64 128 5 e

r 0 4 8 16 32 64 128

u Message Size (Bytes) Message Size (Bytes) P

• Complete offload of Allreduce collective operation to Switch helps to have much higher overlap of communication and computation Available since MVAPICH2 2.3a *PPN: ProcessesNetwork Per Based Computing NodeLaboratory OSU Booth SC’19 17 Performance Engineering Applications using

MVAPICH2● Enhance existing and support TAU for MPI_T in MVAPICH2 to expose a richer set of performance and control variables ● Get and display MPI Performance Variables (PVARs) made available by the runtime in TAU ● Control the runtime’s behavior via MPI Control Variables (CVARs) ● Introduced support for new MPI_T based CVARs to MVAPICH2 ○ MPIR_CVAR_MAX_INLINE_MSG_SZ, MPIR_CVAR_VBUF_POOL_SIZE, MPIR_CVAR_VBUF_SECONDARY_POOL_SIZE ● TAU enhanced with support for setting MPI_T CVARs in a VBUFnon-interactive usage without mode CVAR forbased uninstrumented tuning as applicationsVBUF usage with CVAR based tuning as displayed by ParaProf displayed by ParaProf

Network Based Computing Laboratory OSU Booth SC’19 18 Dynamic and Adaptive Tag Matching Tag matching is a A new tag matching Better performance s e n

significant design thant other state-of- g o l

overhead for i the art tag- n u

- Dynamicallyt s

receiverse matching schemes

adaptu to l l e Existingl Solutions communication Minimum memory o a are patterns consumptionR S - Statich and do not - Use different C adapt dynamically strategies for to communication different ranks pattern Will be available in - Decisions are based future MVAPICH2 - Do not consider on the number of releases memory overhead request object that must be traversed before hitting on the required one

Normalized Total Tag Matching Time at 512 Normalized Memory Overhead per at 512 Processes Processes Normalized to Default (Lower is Better) Compared to Default (Lower is Better) AdaptiveNetwork and Dynamic Based Design Computing for MPI Tag Matching; M. Bayatpour, H. Subramoni, S. Chakraborty, and D. K. Panda; Laboratory OSU Booth SC’19 IEEE Cluster 2016. [Best Paper Nominee] 19 Dynamic and Adaptive MPI Point-to-point Communication

ProtocolsDesired Eager Eager Threshold for Example Communication Pattern with Threshold Different Designs Dynamic + Process Eager Defau Manually Adaptive Pair Threshold (KB) 4 5 lt 6 7 4 Tuned5 6 7 4 5 6 7 0 – 4 32 16 16 16 16 128 128 128 128 32 64 128 32 1 – 5 64 KB KB KB KB KB KB KB KB KB KB KB KB 2 – 6 128 0 1 2 3 0 1 2 3 0 1 2 3

3 – 7 32 Process on Node Process on Node 1 2 Default Poor overlap; Low memory Low Performance; High requirement Productivity Manually Tuned Good overlap; High memory High Performance; Low requirement Productivity ) s n d o Dynamic +Execution Adaptive TimeGood of Amber overlap; Optimal memoryi RelativeHigh Memory Performance Consumption; High of Amber n 600 t 10 o requirement p Productivity c 500 m e 8 s u (

400 s

n e 6

300 o m C i 4 T y 200 r k o c 100 2 o m l e C 0 0

l M l 128 256 512 1K 128 256 512 1K a e v i W Number of Processes Number of Processes t

a Default Threshold=17K Default Threshold=17K Threshold=64K l e

Threshold=128K Dynamic Threshold R Threshold=64K Threshold=128K H. Subramoni,Network S. Chakraborty, Based D. K. Panda, Computing Designing Dynamic & Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation & Communication, ISC'17 - Best Paper Laboratory OSU Booth SC’19 20 MVAPICH2 Distributions • MVAPICH2 – Basic MPI support for IB, Omni-Path, iWARP and RoCE • MVAPICH2-X – Advanced MPI features and support for INAM – MPI, PGAS and Hybrid MPI+PGAS support for IB • MVAPICH2-Virt – Optimized for HPC Clouds with IB and SR-IOV virtualization – Support for OpenStack, Docker, and Singularity • OSU Micro-Benchmarks (OMB) – MPI (including CUDA-aware MPI), OpenSHMEM and UPC • OSU INAM – InfiniBand Network Analysis and Monitoring Tool Network Based Computing Laboratory OSU Booth SC’19 21 MVAPICH2-X for Hybrid MPI + PGAS ApplicationsHigh Performance Parallel Programming Models MPI PGAS Hybrid --- MPI + X Message Passing Interface(UPC, OpenSHMEM, CAF, UPC++)(MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Unified Communication Runtime Diverse APIs and Mechanisms CollectivesCollectives IntrospectioIntrospectio OptimizedOptimized RemoteRemote Active AlgorithmsAlgorithms Scalable Job FaultFault n & Analysis Point-to-point Memory Active Scalable Job n & Analysis Point-to-point Memory Messages (Blocking and Primitives Access Messages (Blocking and StartupStartup ToleranceTolerance withwith OSUOSU Primitives Access Non-Blocking) Non-Blocking) INAMINAM Support for Efficient Intra-node CommunicationSupport for Modern Networking TechnologiesSupport for Modern Multi-/Many-core Architectures (POSIX SHMEM, CMA, LiMIC, XPMEM…) (InfiniBand, iWARP, RoCE, Omni-Path…) (Intel-Xeon, GPGPUs, OpenPOWER, ARM…)

• Current Model – Separate Runtimes for OpenSHMEM/UPC/UPC++/ CAF and MPI – Possible if both runtimes are not progressed – Consumes more network resource • Unified communication runtime for MPI, UPC, UPC++, OpenSHMEM, CAF Network Based Computing – LaboratoryAvailable with since 2012 (startingOSU with Booth MVAPICH2-X SC’19 1.9) 22 – http://mvapich.cse.ohio-state.edu Optimized CMA-based Collectives for Large

1000000Messages 1000000 1000000 KNL (2 Nodes, 128 Procs) KNL (4 Nodes, 256 Procs) KNL (8 Nodes, 512 Procs)

) 100000 100000 100000 s u ~ ~ ~ 4x ( 2.5x 3.2x

y Bett c 10000 Bett 10000 Bett 10000 er n MVAPICH2- e er MVAPICH2-er MVAPICH2- t 2.3a a 2.3a ~ 2.3a L 1000 Intel MPI 1000 1000 17x Intel MPI Intel MPI 2017 Bett 2017 er 2017 100 100 100

Message Size Message Size Message Size Performance of MPI_Gather on KNL nodes (64PPN) • Significant improvement over existing implementation for Scatter/ Gather with 1MB messages (up to 4x on KNL, 2x on Broadwell, 14x on OpenPower) • New two-level algorithms for better scalability S. Chakraborty,• Improved H. Subramoni, performance and D. K. for Panda, other Contention collectives Aware (Bcast, Allgather, and Kernel-Assisted MPI Collectives for Multi/Many-core Systems,Available since MVAPICH2-X 2.3b NetworkAlltoall) Based Computing LaboratoryIEEE Cluster ’17, BEST PaperOSU Finalist Booth SC’19 23 Shared Address Space (XPMEM)-based CollectivesOSU_Allreduce (BroadwellDesign 256 procs) OSU_Reduce (Broadwell 256 procs) 100000 MVAPICH2-2.3b 100000 MVAPICH2- 1.8X 4X 2.3b 10000 IMPI-2017v1.132

10000) s

u 1000 (

y 100073.2 c 37.9

n 100 e t

a 100 L 10 36.1 16.8 10 1 16K 32K 64K 128K 256K 512K 1M 2M 4M 16K 32K 64K 128K256K512K 1M 2M 4M Message Size Message Size • “Shared Address Space”-based true zero-copy Reduction collective designs in MVAPICH2 • Offloaded computation/communication to peers ranks in reduction collective J. Hashmi,operation S. Chakraborty, M. Bayatpour, H. Subramoni, and D. Panda, Designing EfficientAvailable in MVAPICH2-X 2.3rc1 Shared AddressNetwork Space Based Reduction Computing Collectives for Multi-/Many-cores, International Parallel • UpLaboratory to& 4XDistributed improvement Processing Symposium for 4MB (IPDPSReduceOSU Booth'18), andMay SC’19 2018. up to 1.8X improvement for 4M 24 AllReduce Minimizing Memory Footprint by Direct • d Constant connection cost (One QP for any peer) 0 1 0

Connect (DC) Transport o P P e N • Full Feature Set (RDMA, Atomics etc) • Separate objects for send (DC Initiator) and Node IB Node receive (DC Target) P63 P21 Network – DC Target identified by “DCT Number” P7 P3 – Messages routed with (DCT Number, LID) – Requires same “DC Key” to enable communication d e 2 5 4

o P P e m

N • Available sincei MVAPICH2-X 2.2a ) T

B

n NAMD - Apoa1: Large data set K Memory Footprint for Alltoall ( 1.2 o

500 i RC DC-Pool UD XRC t y RC DC-Pool UD XRC 1

r 97 u

o 47 c 0.8

50 22 e m

x 0.6

e 10 10 10 10 10 E

5 0.4 M d 5 3 0.2

2 e n z o i 0 i l t

a 160 320 620 c 0.5

e Number of Processes 1 1 1 1 1 m r n 80 160 320 640 o n Number of Processes o N

C H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty and D. K. Panda, Designing MPI Library with Dynamic ConnectedNetwork BasedTransport Computing (DCT) of InfiniBand : Early Experiences. IEEE International Supercomputing Conference (ISC ’14)Laboratory OSU Booth SC’19 25 User-mode Memory Registration (UMR) • Introduced by Mellanox to support direct local and remote noncontiguous memory access • Avoid packing at sender and unpacking at receiver )

• Available since MVAPICH2-X 2.2b s u (

Small & Medium Message Latency y

400 c n

UMR e

300 t Large Message Latency )

20000a s L u UMR

( 200

15000 y c 100 10000 n e 5000 t 0 a

L 0 2M 4M 8M 16M Message Size (Bytes) Message Size (Bytes) Connect-IB (54 Gbps): 2.8 GHz Dual Ten-core (IvyBridge) Intel PCI Gen3 with M. Li, H. Subramoni, K. Hamidouche,Mellanox X. Lu IBand FDR D. K. switch Panda, “High Performance MPI Datatype Support with User-mode Memory Registration: Challenges, Designs and Network Based Computing Benefits”, CLUSTER, 2015 Laboratory OSU Booth SC’19 2626 On-Demand Paging (ODP)

• Applications no longer need to pin down underlying physical pages Applications (64 Processes) • Memory Region (MR) are NEVER pinned by the OS 1000 Pin- • Paged in by the HCA when needed down

ODP ) 100 • s Paged out by the OS when reclaimed (

e

• ODP can be divided into two classes m i

T 10

– Explicit ODP n o i • Applications still register memory buffers for t u

c 1 communication, but this operation is used to e T P J x 6 2 G U G L M n d 5 1 B C L S A n r E M a -2 -5 E a n define access control for IO rather than pin- C C m n z e D D lt L -O -O o down the pages P P B W W – Implicit ODP A A • Applications are provided with a special memory key that represents their complete address space, does notM. need Li, to K. register Hamidouche, any X. Lu, H. Subramoni, J. Zhang, and virtual address range D. K. Panda, “Designing MPI Library with On-Demand • Advantages Paging (ODP) of InfiniBand: Challenges and Benefits”, SC • Simplifies programming 2016. Available since MVAPICH2-X 2.3b Network• Unlimited Based MR Computing sizes Laboratory• Physical memory optimization OSU Booth SC’19 27 Impact of Zero Copy MPI Message Passing using HW Tag Matching (Point-to-point)

Eager Rendezvous osu_latency 35 osu_latency ) ) 8 s s 400 u u ( ( % 300

4 y y 200 c c 0 n n 100 e e 0 2 8 2 8 2 K K t t 3 2 1 2 8 1 5 0 a a 32K 64K 128K256K512K 1M 2M 4M L L Message Size (byte) Message Size (byte)

MVAPICH2 MVAPICH2 MVAPICH2+HW-TM MVAPICH2+HW-TM Removal of intermediate buffering/copies can lead up to 35% performance improvement in latency of medium messages on TACC Network Based Computing LaboratoryFrontera OSU Booth SC’19 2828 Benefits of the New Asynchronous Progress S P

Design: Broadwell + InfiniBandO L F G P3DFFT High Performance Linpack (HPL) n i

s 140 44% e 140 Memory Consumption = d c

n 120 n 69% 33% 119 o 100 a 120 c 109PPN=28 m e 80 106 r

s 100 100 100

60 o 100 f n i 40 r

e

p 20 80 P

o 224 448 896 0 o d

l 56 112 224 448

Number of Processes e r z MVAPICH2-X Async MVAPICH2-X Default i e Number of processes l p a

( 28 m e r Higher is better

Lower is better o MVAPICH2-Xm Async MVAPICH2-XPPN Default ) i N

IntelT MPI 18.1.163

Up to 44% performance improvement in P3DFFT application with 448 processes

UpA. Ruhela, to 19% H. Subramoni, and 9% S. performanceChakraborty, M. Bayatpour, improvement P. Kousha, and in D.K. HPL Panda, application with 448Efficient and Asynchronous 896 processes Communication Progress for MPI without Dedicated Resources, EuroMPI 2018 Available in MVAPICH2-X 2.3rc1 Network Based Computing Laboratory OSU Booth SC’19 29 Evaluation of Applications on Frontera (Cascade Lake + HDR100) 250 2.5

200 p e 2 t S e

150 1.5

m i r T e

100 p 1

G

C e 50 0.5 m i T 0 0 13824 27648 41472 69984 7168 14336 28672 Number of Processes Number of Processes PPN=54 PPN=56 MIMD Lattice Computation (MILC) WRF2

Performance of MILC and WRF2 applications scales well with increase in system size Network Based Computing Laboratory OSU Booth SC’19 30 Hybrid (MPI+PGAS) Programming

• Application sub-kernels can be re- HPC Application written in MPI/PGAS based on Kernel 1 MPI

communication characteristics Kernel 2 PGASMPI • Benefits: Kernel 3 – Best of Model MPI – Kernel N Best of Shared Memory Computing PGASMPI Model

Network Based Computing Laboratory OSU Booth SC’19 31 Application Level Performance with Graph500 Execution Time Sort Execution Time 40 ) 3000 Graph500MPI-Simple and Sort s MPI Hybrid

d 2500

30 MPI-CSC n 2000 ) o s c

( MPI-CSR 1500 51% e 20 13X s e Hybrid (MPI+OpenSHMEM) 1000 (

m e i 10 500 T

m 0 7.6 i 0 T 500GB-512 1TB-1K 2TB-2K 4TB-4K 4K 8K X 16K Input Data - No. of Processes No. of Processes • Performance of Hybrid (MPI+ OpenSHMEM) • Performance of Hybrid (MPI+OpenSHMEM) Graph500 Design Sort Application • 8,192 processes • 4,096 processes, 4 TB Input Size - 2.4X improvement over MPI-CSR - MPI – 2408 sec; 0.16 TB/min - 7.6X improvement over MPI-Simple • 16,384 processes - Hybrid – 1172 sec; 0.36 TB/min - 1.5X improvement over MPI-CSR - 51% improvement over MPI-design - 13X improvement over MPI-Simple J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, Optimizing Collective Communication in OpenSHMEM, Int'l Conference on Partitioned Global Address Space Programming Models (PGAS '13), J.October Jose, S. 2013. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC’13), June 2013 J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: DesignNetwork and PerformanceBased Computing Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012 Laboratory OSU Booth SC’19 32 MVAPICH2 Distributions • MVAPICH2 – Basic MPI support for IB, Omni-Path, iWARP and RoCE • MVAPICH2-X – Advanced MPI features and support for INAM – MPI, PGAS and Hybrid MPI+PGAS support for IB • MVAPICH2-Virt – Optimized for HPC Clouds with IB and SR-IOV virtualization – Support for OpenStack, Docker, and Singularity • OSU Micro-Benchmarks (OMB) – MPI (including CUDA-aware MPI), OpenSHMEM and UPC • OSU INAM – InfiniBand Network Analysis and Monitoring Tool Network Based Computing Laboratory OSU Booth SC’19 33 Can HPC and Virtualization be Combined? • Virtualization has many benefits – Fault-tolerance – Job migration – Compaction • Have not been very popular in HPC due to overhead associated with Virtualization • New SR-IOV (Single Root – IO Virtualization) support available with Mellanox InfiniBand adapters changes the field • Enhanced MVAPICH2 support for SR-IOV • MVAPICH2-Virt 2.2 supports: J.– Zhang,OpenStack, X. Lu, J. Jose, Docker, R. Shi and and D. K.singularity Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized InfiniBand Clusters? EuroPar'14 J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand Clusters, HiPC’14 J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: Networkan Efficient Based ComputingApproach to build HPC Clouds, CCGrid’15 Laboratory OSU Booth SC’19 34 Application-Level) Performance on Chameleon s ) s m ( (

e e m m i i T T

n n o o i i t t 400 u u

6000 c MV2-SR-IOV- c

e 350 e Def

MV2-SR-IOV- x x MV2-SR-IOV- 5000 E E Def 300 Opt MV2-SR-IOV- MV2-Native 4000 Opt 2 250 MV2-Native 200 3000 % 150 9.5% 2000 5% 100 1 1000 50 % 0 0 22,20 24,10 24,16 24,20 26,10 26,16 milc leslie3d pop2 GAPgeofem zeusmp2 lu Problem Size (Scale, Edgefactor) Graph500 SPEC MPI2007 • 32 VMs, 6 Core/VM • Compared to Native, 2-5% overhead for Graph500 with 128 Procs • Compared to Native, 1-9.5% overhead for SPEC MPI2007 with 128 Network Based Computing ProcsLaboratory OSU Booth SC’19 35 Performance of Radix on Microsoft Azure

Total Execution Time on HC (Lower is better) Total Execution Time on HB (Lower is better) 25 30 MVAPICH2- 38% MVA- 3x faster 25 20 X

) faster PICH2-X s HPCx d

20 n

o 15 ) c s e d

15 S n ( o 10 c e e

10 m S i ( T

e 5 5 n m o i i t T

u n 0 c 0 o e i x t 60(1X60) 120(2X60) 240(4X60) E u c e x

E Number of Processes (Nodes X PPN) Number of Processes (Nodes X PPN)

Network Based Computing Laboratory OSU Booth SC’19 36 Amazon Elastic Fabric Adapter (EFA)

• Enhanced version of Elastic Network Adapter Feature UD SRD

(ENA) Send/Recv ✔ ✔ • Allows OS bypass, up to 100 Gbps bandwidth Send w/ Immediate ✖ ✖ • New QP type: Scalable Reliable Datagram RDMA (SRD) Read/Write/Atomic ✖ ✖

• Network aware multi-path routing - low tail Scatter Gather Lists ✔ ✔ )

latency c

e Shared Receive e s t ✖ ✖ / a

• Guaranteed Delivery, nog ordering guarantee Queue R

Verbs-level Latency s

e

25 m

• g Reliable Delivery ✖ ✔ n

Exposed through verbsa and libfabric2.02 interfaces o s i l 20 16.95 s

l Unidirectional Message Rate i e Ordering ✖ ✖ ) 2.5 m M s 15 ( 1.77 u

( 15.69 2 Inline Sends ✖ ✖ 10 y 1.5 c n Global Routing e 5 UD 1 ✔ ✖ t Header a SRD L 0 0.5 UD 0 SRD MTU Size 4KB 8KB

Message Size (Bytes) Message Size (Bytes)

Network Based Computing Laboratory OSU Booth SC’19 37 MPI-level Performance with SRD ) ) ) s s s u u u ( (

( Scatterv – 8 node 36 ppn Gatherv – 8 node 36 ppn Allreduce – 8 node 36 ppn

y y 10000y 1000 800 c c c MV2X-UD n n

n MV2X-UD e e e t t t 800 MV2X-SRD MV2X-SRD a a 1000a 600 L L OpenMPI L OpenMPI 600 100 400 400 MV2X-UD 10 200 MV2X-SRD 200 OpenMPI 1 0 0 4 8 16 32 64 128256512 1k 2k 4k Message Size (Bytes) Message Size (Bytes) Message Size (Bytes) )

miniGhost 10% better) CloverLeaf s 80 30 s d MV2X OpenMPI MV2X-UD Instance type: c5n.18xlarge d n CPU: Intel Xeon Platinum 8124M @ 3.00GHz n 25 o MV2X-SRD o

c 60 MVAPICH2 version: MVAPICH2-X 2.3rc2 + SRD c

e 20 support e OpenMPI S S

( 27.5% OpenMPI version: Open MPI v3.1.3 with libfabric 1.7 (

40 15

e

e better 10 S. Chakraborty, S. Xu, H. Subramoni, and D. K. m m i 20 i Panda, Designing Scalable and High- T

T 5

performance MPI Libraries on Amazon Elastic n n

o 0 0 Fabric Adapter, to be presented at the 26th o i i t

t Symposium on High Performance u u Interconnects, (HOTI ’19) c c e Processes (Nodes X PPN) e Processes (Nodes X PPN) x x

E Network Based Computing E Laboratory OSU Booth SC’19 38 MVAPICH2 Distributions • MVAPICH2 – Basic MPI support for IB, Omni-Path, iWARP and RoCE • MVAPICH2-X – Advanced MPI features and support for INAM – MPI, PGAS and Hybrid MPI+PGAS support for IB • MVAPICH2-Virt – Optimized for HPC Clouds with IB and SR-IOV virtualization – Support for OpenStack, Docker, and Singularity • OSU Micro-Benchmarks (OMB) – MPI (including CUDA-aware MPI), OpenSHMEM and UPC • OSU INAM – InfiniBand Network Analysis and Monitoring Tool Network Based Computing Laboratory OSU Booth SC’19 39 OSU Microbenchmarks • Available since 2004 • Suite of microbenchmarks to study communication performance of various programming models • Benchmarks available for the following programming models – Message Passing Interface (MPI) – Partitioned Global Address Space (PGAS) • (UPC) • Unified Parallel C++ (UPC++) • OpenSHMEM • Benchmarks available for multiple accelerator based architectures – Compute Unified Device Architecture (CUDA) – OpenACC Application Program Interface • Part of various national resource procurement suites like NERSC-8 / Trinity Benchmarks • Please visit the following link for more information Network Based Computing Laboratory– http://mvapich.cse.ohio-state.edu/benchmarks/OSU Booth SC’19 40 MVAPICH2 Distributions • MVAPICH2 – Basic MPI support for IB, Omni-Path, iWARP and RoCE • MVAPICH2-X – Advanced MPI features and support for INAM – MPI, PGAS and Hybrid MPI+PGAS support for IB • MVAPICH2-Virt – Optimized for HPC Clouds with IB and SR-IOV virtualization – Support for OpenStack, Docker, and Singularity • OSU Micro-Benchmarks (OMB) – MPI (including CUDA-aware MPI), OpenSHMEM and UPC • OSU INAM – InfiniBand Network Analysis and Monitoring Tool Network Based Computing Laboratory OSU Booth SC’19 41 Overview of OSU INAM

• A network monitoring and analysis tool that is capable of analyzing traffic on the InfiniBand network with inputs from the MPI runtime – http://mvapich.cse.ohio-state.edu/tools/osu-inam/ • Monitors IB clusters in real time by querying various subnet management entities and gathering input from the MPI runtimes • Capability to analyze and profile node-level, job-level and process-level activities for MPI communication – Point-to-Point, Collectives and RMA • Ability to filter data based on type of counters using “drop down” list • Remotely monitor various metrics of MPI processes at user specified granularity • "Job Page" to display jobs in ascending/descending order of various performance metrics in conjunction with MVAPICH2-X • Visualize the data transfer happening in a “live” or “historical” fashion for entire network, job or set of nodes • OSU INAM 0.9.4 released on 11/10/2018 – Enhanced performance for fabric discovery using optimized OpenMP-based multi-threaded designs – Ability to gather InfiniBand performance counters at sub-second granularity for very large (>2000 nodes) clusters – Redesign database layout to reduce database size – Enhanced fault tolerance for database operations • Thanks to Trey Dockendorf @ OSC for the feedback Network– OpenMP-based Based Computing multi-threaded designs to handle database purge, read, and insert operations simultaneously Laboratory– Improved database purging time by usingOSU bulk Boothdeletes SC’19 42 – Tune database timeouts to handle very long database operations – Improved debugging support by introducing several debugging levels OSU INAM Features

Comet@SDSC --- Clustered View Finding Routes Between (1,879 nodes, 212 switches, 4,377 Nodes • Show networknetwork topology links) of large clusters • Visualize traffic pattern on different links • Quickly identify congested links/links in error state • See the history unfold – play back historical state of the network Network Based Computing Laboratory OSU Booth SC’19 43 OSU INAM Features (Cont.)

Visualizing a Job (5 Nodes) Estimated Process Level Link • Job level view • EstimatedUtilization Link Utilization view • Show different network metrics (load, error, etc.) for • Classify data flowing over a network any live job link at different granularity in • Play back historical data for completed jobs to conjunction with MVAPICH2-X 2.2rc1 identify bottlenecks • Job level and • Node level view - details per process or per • Process level node Network• CPU utilization Based Computing for each rank/node Laboratory OSU Booth SC’19 44 • Bytes sent/received for MPI operations (pt-to-pt, collective, RMA) • Network metrics (e.g. XmitDiscard, RcvError) per rank/node Applications-Level Tuning: Compilation of Best• MPI Practicesruntime has many parameters • Tuning a set of parameters can help you to extract higher performance • Compiled a list of such contributions through the MVAPICH Website – http://mvapich.cse.ohio-state.edu/best_practices/ • Initial list of applications – Amber – HoomDBlue – HPCG – Lulesh – MILC – Neuron – SMG2000 – Cloverleaf – SPEC (LAMMPS, POP2, TERA_TF, WRF2) • Soliciting additional contributions, send your results to mvapich-help at cse.ohio- Networkstate.edu. Based Computing • LaboratoryWe will link these results with creditsOSU to Booth you. SC’19 45 Commercial Support for MVAPICH2 Libraries • Supported through X-ScaleSolutions (http://x-scalesolutions.com) • Benefits: – Help and guidance with installation of the library – Platform-specific optimizations and tuning – Timely support for operational issues encountered with the library – Web portal interface to submit issues and tracking their progress – Advanced debugging techniques – Application-specific optimizations and tuning – Obtaining guidelines on best practices – Periodic information on major fixes and updates – Information on major releases – Help with upgrading to the latest release – Flexible Service Level Agreements • Support provided to Lawrence Livermore National Laboratory (LLNL) during last two years

Network Based Computing Laboratory OSU Booth SC’19 46 MVAPICH2 – Plans for Exascale • Performance and Memory scalability toward 1M-10M cores • Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …) • MPI + Task* • Enhanced Optimization for GPUs and FPGAs* • Taking advantage of advanced features of Mellanox InfiniBand • Tag Matching* • Adapter Memory* • Enhanced communication schemes for upcoming architectures • NVLINK* • CAPI* • Extended topology-aware collectives • Extended Energy-aware designs and Virtualization Support • Extended Support for MPI Tools Interface (as in MPI 3.0) • Extended FT support • Support for * features will be available in future MVAPICH2 Releases

Network Based Computing Laboratory OSU Booth SC’19 47 Join us for Multiple Events at SC ‘19

• Presentations at OSU and X-Scale Booth (#2094) – Members of the MVAPICH, HiBD and HiDL members – External speakers • Presentations at SC main program (Tutorials, Workshops, BoFs, Posters, and Doctoral Showcase) • Presentation at many other booths (Mellanox, Intel, Microsoft, and AWS) and satellite events • Complete details available at http://mvapich.cse.ohio-state.edu/conference/752/talks/

Network Based Computing Laboratory OSU Booth SC’19 48 Thank You! [email protected]

Network Based Computing

Network-BasedLaboratory Computing Laboratory http://nowlab.cse.ohio-state.edu/

The High-Performance MPI/PGAS The High-Performance Big Data The High-Performance Deep Learning Project Project Project http://mvapich.cse.ohio-state.edu/Network Based Computing http://hibd.cse.ohio-state.edu/ http://hidl.cse.ohio-state.edu/ Laboratory OSU Booth SC’19 49