The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing

Presentation at Mellanox Theatre (SC ‘18) by

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Drivers of Modern HPC Cluster Architectures

High Performance Interconnects - Accelerators / Coprocessors InfiniBand high compute density, high Multi-core Processors <1usec latency, 100Gbps Bandwidth> performance/watt SSD, NVMe-SSD, NVRAM >1 TFlop DP on a chip • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD • Accelerators (NVIDIA GPGPUs and Intel ) • Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc.

Summit Sunway TaihuLight Sierra K - Based Computing Laboratory Mellanox Theatre SC’18 2 Parallel Programming Models Overview

P1 P2 P3 P1 P2 P3 P1 P2 P3

Logical Memory Memory Memory Shared Memory Memory Memory Memory

Shared Memory Model Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) , UPC, Chapel, X10, CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are gradually receiving importance Network Based Computing Laboratory Mellanox Theatre SC’18 3 Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges

Application Kernels/Applications Co-Design Middleware Opportunities and Programming Models Challenges MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, across Various OpenACC, , Hadoop (MapReduce), Spark (RDD, DAG), etc. Layers

Communication or Runtime for Programming Models Performance Point-to-point Collective Energy- Synchronizatio I/O and Fault Communicatio Communicatio Awareness n and Locks File Systems Tolerance n n Fault- Resilience Networking Technologies Multi/Many-core Accelerators (InfiniBand, 40/100GigE, Architectures (GPU and FPGA) Aries, and OmniPath)

Network Based Computing Laboratory Mellanox Theatre SC’18 4 MPI+X Programming model: Broad Challenges at Exascale

• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) – Scalable job start-up • Scalable Collective communication – Offload – Non-blocking – Topology-aware • Balancing intra-node and inter-node communication for next generation nodes (128-1024 cores) – Multiple end-points per node • Support for efficient multi-threading • Integrated Support for GPGPUs and FPGAs • Fault-tolerance/resiliency • QoS support for communication and I/O • Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI+UPC++, MPI + OpenSHMEM, CAF, …) • Virtualization • Energy-Awareness

Network Based Computing Laboratory Mellanox Theatre SC’18 5 Overview of the MVAPICH2 Project

• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 2,950 organizations in 86 countries – More than 505,000 (> 0.5 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Jul ‘18 ranking) • 2nd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China • 12th, 556,104 cores (Oakforest-PACS) in Japan • 15th, 367,024 cores (Stampede2) at TACC • 24th, 241,108-core (Pleiades) at NASA and many others – Available with software stacks of many vendors and Distros (RedHat, SuSE, and OpenHPC) – http://mvapich.cse.ohio-state.edu Partner in the upcoming TACC Frontera System • Empowering Top500 systems for over a decade Network Based Computing Laboratory Mellanox Theatre SC’18 6 MVAPICH2 Release Timeline and Downloads

600000 MIC MIC 2.0

500000 -

GDR GDR 2.0b

-

MV2

2.3

1.9 MV2

400000 1.8

GDR

0.9.4

MV2

-

1.7

2.3rc1

MV2

X

-

Virt2.2

MV2 1.6

300000 MV2

1.5

MV2

MV2

MV2 MV2 2.3

MV2

OSU INAMOSU

1.4

1.0.3

1.1

1.0

1.0

MV2

MV2 MV

200000 MV2

MV

Number of DownloadsofNumber

MV2

MV2 MV2 0.9.8

MV MV 0.9.4 MV2 MV2 0.9.0

100000

0

Jul-05 Jul-10 Jul-15

Jan-08 Jan-13 Jan-18

Jun-13 Jun-08 Jun-18

Oct-11 Oct-16 Oct-06

Apr-09 Apr-14

Sep-04 Feb-05 Sep-09 Feb-10 Sep-14 Feb-15

Dec-05 Dec-10 Dec-15

Aug-07 Aug-12 Aug-17

Nov-08 Nov-13 Nov-18

Mar-07 Mar-12 Mar-17

May-11 May-16 May-06 Timeline Network Based Computing Laboratory Mellanox Theatre SC’18 7 Architecture of MVAPICH2 Software Family

High Performance Parallel Programming Models

Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication Runtime Diverse and Mechanisms

Point-to- Remote Collectives Energy- I/O and Fault Active Introspection point Job Startup Memory Virtualization Algorithms Awareness File Systems Tolerance Messages & Analysis Primitives Access

Support for Modern Networking Technology Support for Modern Multi-/Many-core Architectures (InfiniBand, iWARP, RoCE, Omni-Path) (Intel-Xeon, OpenPower, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols Modern Features Transport Mechanisms Modern Features

SR- Multi Shared RC XRC UD DC UMR ODP CMA IVSHMEM XPMEM MCDRAM* NVLink* CAPI* IOV Rail Memory

* Upcoming Network Based Computing Laboratory Mellanox Theatre SC’18 8 MVAPICH2 Software Family

Requirements Library

MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2

Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS MVAPICH2-X with IB, Omni-Path, and RoCE MPI with IB, RoCE & GPU and Support for Deep Learning MVAPICH2-GDR

HPC Cloud with MPI & IB MVAPICH2-Virt

Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA

MPI Energy Monitoring Tool OEMT

InfiniBand Network Analysis and Monitoring OSU INAM

Microbenchmarks for Measuring MPI and PGAS Performance OMB

Network Based Computing Laboratory Mellanox Theatre SC’18 9 MVAPICH2 Distributions • MVAPICH2 – Basic MPI support for IB, iWARP and RoCE • MVAPICH2-X – Advanced MPI features and support for INAM – MPI, PGAS and Hybrid MPI+PGAS support for IB • MVAPICH2-Virt – Optimized for HPC Clouds with IB and SR-IOV virtualization – Support for OpenStack, Docker, and Singularity • MVAPICH2-EA – Energy Efficient Support for point-to-point and collective operations – Compatible with OSU Energy Monitoring Tool (OEMT-0.8) • OSU Micro-Benchmarks (OMB) – MPI (including CUDA-aware MPI), OpenSHMEM and UPC • OSU INAM – InfiniBand Network Analysis and Monitoring Tool • MVAPICH2-GDR and Deep Learning (Will be presented on Thursday at 10:30am)

Network Based Computing Laboratory Mellanox Theatre SC’18 10 MVAPICH2 2.3-GA • Released on 07/23/2018 • Major Features and Enhancements – Based on MPICH v3.2.1 – Enhancements to mapping strategies and automatic architecture/network – Introduce basic support for executing MPI jobs in Singularity detection – Improve performance for MPI-3 RMA operations • Improve performance of architecture detection on high core-count systems • Enhanced architecture detection for OpenPOWER, Intel Skylake and Cavium ARM – Enhancements for Job Startup (ThunderX) systems • Improved job startup time for OFA-IB-CH3, PSM-CH3, and PSM2-CH3 • New environment variable MV2_THREADS_BINDING_POLICY for multi-threaded • On-demand connection management for PSM-CH3 and PSM2-CH3 channels MPI and MPI+OpenMP applications • Enhance PSM-CH3 and PSM2-CH3 job startup to use non-blocking PMI calls • Support 'spread', 'bunch', 'scatter', 'linear' and 'compact' placement of threads • Introduce capability to run MPI jobs across multiple InfiniBand subnets – Warn user if oversubscription of core is detected – Enhancements to point-to-point operations • Enhance MV2_SHOW_CPU_BINDING to enable display of CPU bindings on all • Enhance performance of point-to-point operations for CH3-Gen2 (InfiniBand), CH3- nodes PSM, and CH3-PSM2 (Omni-Path) channels • Added support for MV2_SHOW_CPU_BINDING to display number of OMP threads • Improve performance for Intra- and Inter-node communication for OpenPOWER • Added logic to detect heterogeneous CPU/HFI configurations in PSM-CH3 and architecture PSM2-CH3 channels • Enhanced tuning for OpenPOWER, Intel Skylake and Cavium ARM (ThunderX) systems – Thanks to Matias Cabral@Intel for the report • Improve performance for host-based transfers when CUDA is enabled • Enhanced HFI selection logic for systems with multiple Omni-Path HFIs • Improve support for large processes per node and hugepages on SMP systems • Introduce run time parameter MV2_SHOW_HCA_BINDING to show process to HCA – Enhancements to collective operations bindings • Enhanced performance for Allreduce, Reduce_scatter_block, Allgather, Allgatherv – Miscellaneous enhancements and improved debugging and tools support – Thanks to Danielle Sikich and Adam Moody @ LLNL for the patch • Enhance support for MPI_T PVARs and CVARs • Add support for non-blocking Allreduce using Mellanox SHARP • Enhance debugging support for PSM-CH3 and PSM2-CH3 channels – Enhance tuning framework for Allreduce using SHArP • Update to hwloc version 1.11.9 • Enhanced collective tuning for IBM POWER8, IBM POWER9, Intel Skylake, Intel KNL, • Tested with CLANG v5.0.0 Intel Broadwell Network Based Computing Laboratory Mellanox Theatre SC’18 11 One-way Latency: MPI over IB with MVAPICH2 1.8 Small Message Latency 120 Large Message Latency TrueScale-QDR 1.6 1.19 100 ConnectX-3-FDR 1.4 1.11 ConnectIB-DualFDR 1.2 80 ConnectX-4-EDR 1 Omni-Path 60 0.8 1.15 0.6 1.01 40

1.04 Latency (us) Latency Latency (us) Latency 0.4 20 0.2 0 0

Message Size (bytes) Message Size (bytes) TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch Network Based Computing Laboratory Mellanox Theatre SC’18 12 Bandwidth: MPI over IB with MVAPICH2 14000 Unidirectional Bandwidth Bidirectional Bandwidth 12,590 25000 TrueScale-QDR 24,136 12000 12,366 20000 ConnectX-3-FDR

10000 ConnectIB-DualFDR 21,983 )

12,083 ) ConnectX-4-EDR 21,227 sec 8000 15000 6,356 Omni-Path 12,161 6000

10000 Bandwidth

3,373 Bandwidth MBytes/sec

(MBytes/ 4000 6,228 ( 5000 2000 0 0

Message Size (bytes) Message Size (bytes) TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 IB switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch Network Based Computing Laboratory Mellanox Theatre SC’18 13 Startup Performance on KNL + Omni-Path

TACC Stampede2 Oakforest-PACS 80 25 MPI_Init Hello World MPI_Init Hello World 21s 20 60 57s 15 40 10 22s

20 Time (seconds) Time Time (seconds) Time 5 5.8s

0 0

64

64

1K 2K 4K 8K

1K 2K 4K 8K

128 256 512

128 256 512

16K 32K 64K

16K 32K 64K

128K 180K 230K Number of Processes Number of Processes

• MPI_Init takes 22 seconds on 231,936 processes on 3,624 KNL nodes (Stampede2 – Full scale) • At 64K processes, MPI_Init and Hello World takes 5.8s and 21s respectively (Oakforest-PACS) • All numbers reported with 64 processes per node, MVAPICH2-2.3a • Designs integrated with mpirun_rsh, available for srun (SLURM launcher) as well

Network Based Computing Laboratory Mellanox Theatre SC’18 14 Benefits of SHARP Allreduce at Application Level Avg DDOT Allreduce time of HPCG 0.4 12% MVAPICH2 0.3 0.2 0.1

Latency (seconds) Latency 0 (4,28) (8,28) (16,28) (Number of Nodes, PPN) SHARP support available since MVAPICH2 2.3a

Parameter Description Default MV2_ENABLE_SHARP=1 Enables SHARP-based collectives Disabled --enable-sharp Configure flag to enable SHARP Disabled

• Refer to Running Collectives with Hardware based SHARP support section of MVAPICH2 user guide for more information • http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.3-userguide.html#x1-990006.26 Network Based Computing Laboratory Mellanox Theatre SC’18 15 Intra-node Point-to-Point Performance on OpenPower Intra-Socket Small Message Latency Intra-Socket Large Message Latency 1.5 80 MVAPICH2-2.3 MVAPICH2-2.3 SpectrumMPI-10.1.0.2 60 SpectrumMPI-10.1.0.2 1 OpenMPI-3.0.0 OpenMPI-3.0.0 40

0.5 Latency (us) Latency Latency (us) Latency 20 0.30us 0 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M Intra-Socket Bandwidth Intra-Socket Bi-directional Bandwidth 40000 80000 MVAPICH2-2.3 MVAPICH2-2.3 30000 SpectrumMPI-10.1.0.2 60000 SpectrumMPI-10.1.0.2 OpenMPI-3.0.0 OpenMPI-3.0.0 20000 40000

10000 20000

Bandwidth (MB/s) Bandwidth 0 Bandwidth (MB/s) Bandwidth 0 1 8 64 512 4K 32K 256K 2M 1 8 64 512 4K 32K 256K 2M Platform: Two nodes of OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA Network Based Computing Laboratory Mellanox Theatre SC’18 16 Intra-node Point-to-point Performance on ARM Cortex-A72 Small Message Latency Large Message Latency 1.5 800 MVAPICH2-2.3 MVAPICH2-2.3 600 1 400

0.27 micro-second Latency(us)

Latency (us) Latency 0.5 (1 bytes) 200

0 0 0 1 2 4 8 16 32 64 128256512 1K 2K 4K 8K 16K 32K 64K 128K256K512K 1M 2M 4M Bandwidth Bi-directional Bandwidth 10000 20000 MVAPICH2-2.3 MVAPICH2-2.3 8000 15000 6000 10000 4000 5000

2000 Bandwidth (MB/s) Bandwidth

0 0 Bidirectional Bandwidth Bidirectional

Platform: ARM Cortex A72 (aarch64) MIPS processor with 64 cores dual-socket CPU. Each socket contains 32 cores. Network Based Computing Laboratory Mellanox Theatre SC’18 17 Performance Engineering Applications using MVAPICH2 and TAU

● Enhance existing support for MPI_T in MVAPICH2 to expose a richer set of performance and control variables ● Get and display MPI Performance Variables (PVARs) made available by the runtime in TAU ● Control the runtime’s behavior via MPI Control Variables (CVARs) ● Introduced support for new MPI_T based CVARs to MVAPICH2 ○ MPIR_CVAR_MAX_INLINE_MSG_SZ, MPIR_CVAR_VBUF_POOL_SIZE, MPIR_CVAR_VBUF_SECONDARY_POOL_SIZE ● TAU enhanced with support for setting MPI_T CVARs in a non- interactive mode for uninstrumented applications

VBUF usage without CVAR based tuning as displayed by ParaProf VBUF usage with CVAR based tuning as displayed by ParaProf

Network Based Computing Laboratory Mellanox Theatre SC’18 18 Dynamic and Adaptive Tag Matching

Tag matching is a significant A new tag matching design Better performance than overhead for receivers - Dynamically adapt to other state-of-the art tag- Existing Solutions are communication patterns matching schemes

Results Minimum memory

- Static and do not adapt Solution - Use different strategies for consumption Challenge dynamically to different ranks communication pattern - Decisions are based on the - Do not consider memory number of request object overhead that must be traversed Will be available in future before hitting on the MVAPICH2 releases required one

Normalized Total Tag Matching Time at 512 Processes Normalized Memory Overhead per Process at 512 Processes Normalized to Default (Lower is Better) Compared to Default (Lower is Better) Adaptive and Dynamic Design for MPI Tag Matching; M. Bayatpour, H. Subramoni, S. Chakraborty, and D. K. Panda; IEEE Cluster 2016. [Best Paper Nominee] Network Based Computing Laboratory Mellanox Theatre SC’18 19 Dynamic and Adaptive MPI Point-to-point Communication Protocols

Desired Eager Threshold Eager Threshold for Example Communication Pattern with Different Designs

Default Dynamic + Adaptive Process Pair Eager Threshold (KB) Manually Tuned 4 5 6 7 4 5 6 7 4 5 6 7 0 – 4 32

1 – 5 64 16 KB 16 KB 16 KB 16 KB 128 KB 128 KB 128 KB 128 KB 32 KB 64 KB 128 KB 32 KB 2 – 6 128 0 1 2 3 0 1 2 3 0 1 2 3 3 – 7 32 Process on Node 1 Process on Node 2

Default Poor overlap; Low memory requirement Low Performance; High Productivity Manually Tuned Good overlap; High memory requirement High Performance; Low Productivity Dynamic + Adaptive Good overlap; Optimal memory requirement High Performance; High Productivity

Execution Time of Amber 600 10 Relative Memory Consumption of Amber

400 5

200

Consumption Relative Relative Memory 0 0 128 256 512 1K 128 256 512 1K Wall Clock Time (seconds) Time Clock Wall Number of Processes Number of Processes Default Threshold=17K Threshold=64K Default Threshold=17K Threshold=64K Threshold=128K Dynamic Threshold Threshold=128K Dynamic Threshold H. Subramoni, S. Chakraborty, D. K. Panda, Designing Dynamic & Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation & Communication, ISC'17 - Best Paper Network Based Computing Laboratory Mellanox Theatre SC’18 20 Application Scalability on Skylake and KNL (Stamepede2) Cloverleaf (bm64) MPI+OpenMP, MiniFE (1300x1300x1300 ~ 910 GB) NEURON (YuEtAl2012) NUM_OMP_THREADS = 2 60 1500 2000 MVAPICH2 MVAPICH2 MVAPICH2 40 1000 1500 1000 20 500 500 0 0 0 2048 4096 8192 48 96 192 384 768 48 96 192 384 768 1536 3072 Execution Time (s) Time Execution No. of Processes (Skylake: 48ppn) No. of Processes (Skylake: 48ppn) No. of Processes (Skylake: 48ppn) 150 4000 1500 MVAPICH2 MVAPICH2 MVAPICH2 3000 100 1000 2000 50 500 1000

0 Execution Time (s) Time Execution 0 0 2048 4096 8192 64 128 256 512 1024 2048 4096 68 136 272 544 1088 2176 4352 No. of Processes (KNL: 64ppn) No. of Processes (KNL: 64ppn) No. of Processes (KNL: 68ppn) Courtesy: Mahidhar Tatineni @SDSC, Dong Ju (DJ) Choi@SDSC, and Samuel Khuvis@OSC ---- Testbed: TACC Stampede2 using MVAPICH2-2.3b Runtime parameters: MV2_SMPI_LENGTH_QUEUE=524288 PSM2_MQ_RNDV_SHM_THRESH=128K PSM2_MQ_RNDV_HFI_THRESH=128K Network Based Computing Laboratory Mellanox Theatre SC’18 21 MVAPICH2 Distributions • MVAPICH2 – Basic MPI support for IB, iWARP and RoCE • MVAPICH2-X – Advanced MPI features and support for INAM – MPI, PGAS and Hybrid MPI+PGAS support for IB • MVAPICH2-Virt – Optimized for HPC Clouds with IB and SR-IOV virtualization – Support for OpenStack, Docker, and Singularity • MVAPICH2-EA – Energy Efficient Support for point-to-point and collective operations – Compatible with OSU Energy Monitoring Tool (OEMT-0.8) • OSU Micro-Benchmarks (OMB) – MPI (including CUDA-aware MPI), OpenSHMEM and UPC • OSU INAM – InfiniBand Network Analysis and Monitoring Tool • MVAPICH2-GDR and Deep Learning (Will be presented on Thursday at 10:30am)

Network Based Computing Laboratory Mellanox Theatre SC’18 22 MVAPICH2-X for Hybrid MPI + PGAS Applications High Performance Parallel Programming Models MPI PGAS Hybrid --- MPI + X Message Passing Interface (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Unified Communication Runtime Diverse APIs and Mechanisms Collectives Algorithms Introspection & Optimized Point-to- Remote Memory Scalable Job Fault Active Messages (Blocking and Non- Analysis with point Primitives Access Startup Tolerance Blocking) OSU INAM

Support for Efficient Intra-node Communication Support for Modern Networking Technologies Support for Modern Multi-/Many-core Architectures (POSIX SHMEM, CMA, LiMIC, XPMEM…) (InfiniBand, iWARP, RoCE, Omni-Path…) (Intel-Xeon, GPGPUs, OpenPOWER, ARM…)

• Current Model – Separate Runtimes for OpenSHMEM/UPC/UPC++/CAF and MPI – Possible if both runtimes are not progressed – Consumes more network resource • Unified communication runtime for MPI, UPC, UPC++, OpenSHMEM, CAF – Available with since 2012 (starting with MVAPICH2-X 1.9) – http://mvapich.cse.ohio-state.edu

Network Based Computing Laboratory Mellanox Theatre SC’18 23 Minimizing Memory Footprint by Direct Connect (DC) Transport

• Constant connection cost (One QP for any peer)

0

P0 P1 Node • Full Feature Set (RDMA, Atomics etc) • Separate objects for send (DC Initiator) and receive (DC Node 3 IB Node 1 Target) P6 Network P2 P7 P3 – DC Target identified by “DCT Number” – Messages routed with (DCT Number, LID)

– Requires same “DC Key” to enable communication

2

P4 P5 Node • Available since MVAPICH2-X 2.2a Memory Footprint for Alltoall NAMD - Apoa1: Large data set RC DC-Pool UD XRC RC DC-Pool UD XRC 97 1 100 47 22 10 10 10 10 10 10 5 0.5

3 Time 2 1 1 1 1 1 1 0 80 160 320 640 Execution Normalized 160 320 620 Number of Processes Connection Memory (KB) Memory Connection Number of Processes H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty and D. K. Panda, Designing MPI Library with Dynamic Connected Transport (DCT) of InfiniBand : Early Experiences. IEEE International Supercomputing Conference (ISC ’14) Network Based Computing Laboratory Mellanox Theatre SC’18 24 User-mode Memory Registration (UMR) • Introduced by Mellanox to support direct local and remote noncontiguous memory access • Avoid packing at sender and unpacking at receiver • Available since MVAPICH2-X 2.2b Small & Medium Message Latency Large Message Latency 350 20000 300 UMR UMR

250 Default 15000 Default (us) 200 150 10000

100 5000 Latency Latency

50 (us) Latency 0 0 4K 16K 64K 256K 1M 2M 4M 8M 16M Message Size (Bytes) Message Size (Bytes) Connect-IB (54 Gbps): 2.8 GHz Dual Ten-core (IvyBridge) Intel PCI Gen3 with Mellanox IB FDR switch M. Li, H. Subramoni, K. Hamidouche, X. Lu and D. K. Panda, “High Performance MPI Datatype Support with User-mode Memory Registration: Challenges, Designs and Benefits”, CLUSTER, 2015 Network Based Computing Laboratory Mellanox Theatre SC’18 25 On-Demand Paging (ODP)

• Applications no longer need to pin down underlying physical pages • Memory Region (MR) are NEVER pinned by the OS Applications (64 Processes) • Paged in by the HCA when needed 1000 Pin-down

• Paged out by the OS when reclaimed (s) ODP 100

• ODP can be divided into two classes Time – Explicit ODP 10

• Applications still register memory buffers for Execution communication, but this operation is used to define access 1 control for IO rather than pin-down the pages – Implicit ODP • Applications are provided with a special memory key that represents their complete address space, does not need to register any virtual address range • Advantages • Simplifies programming M. Li, K. Hamidouche, X. Lu, H. Subramoni, J. Zhang, and D. K. Panda, • Unlimited MR sizes “Designing MPI Library with On-Demand Paging (ODP) of InfiniBand: Challenges and Benefits”, SC 2016. • Physical memory optimization Available since MVAPICH2-X 2.3b Network Based Computing Laboratory Mellanox Theatre SC’18 26 MPI_Allreduce on KNL + Omni-Path (10,240 Processes) 300 2000 1800 250 1600 200 1400 1200 150 1000 2.4X 800

Latency (us) Latency 100 600 50 400 200 0 0 4 8 16 32 64 128 256 512 1024 2048 4096 8K 16K 32K 64K 128K 256K Message Size Message Size MVAPICH2 MVAPICH2-OPT IMPI MVAPICH2 MVAPICH2-OPT IMPI OSU Micro Benchmark 64 PPN • For MPI_Allreduce latency with 32K bytes, MVAPICH2-OPT can reduce the latency by 2.4X

M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design, SuperComputing '17. Available since MVAPICH2-X 2.3b Network Based Computing Laboratory Mellanox Theatre SC’18 27 Optimized CMA-based Collectives for Large Messages

1000000 1000000 1000000 KNL (2 Nodes, 128 Procs) KNL (4 Nodes, 256 Procs) KNL (8 Nodes, 512 Procs) 100000 100000 100000

10000 ~ 2.5x 10000 ~ 3.2x 10000 ~ 4x Better Better Better 1000 1000 1000 MVAPICH2-2.3a MVAPICH2-2.3a MVAPICH2-2.3a

Latency(us) 100 Intel MPI 2017 100 100 ~ 17x Intel MPI 2017 Better Intel MPI 2017 OpenMPI 2.1.0 10 10 OpenMPI 2.1.0 10 OpenMPI 2.1.0 Tuned CMA 1 Tuned CMA Tuned CMA

1 1

1K 2K 4K 8K

1M 2M 4M

16K 32K 64K

1K 2K 4K 8K 1K 2K 4K 8K

1M 2M 1M

128K 256K 512K

64K 16K 32K 64K 16K 32K

256K 512K 128K 256K 512K Message Size Message Size 128K Message Size Performance of MPI_Gather on KNL nodes (64PPN) • Significant improvement over existing implementation for Scatter/Gather with 1MB messages (up to 4x on KNL, 2x on Broadwell, 14x on OpenPower) • New two-level algorithms for better scalability • Improved performance for other collectives (Bcast, Allgather, and Alltoall) S. Chakraborty, H. Subramoni, and D. K. Panda, Contention Aware Kernel-Assisted MPI Collectives for Multi/Many-core Systems, IEEE Cluster ’17, BEST Paper Finalist Available since MVAPICH2-X 2.3b Network Based Computing Laboratory Mellanox Theatre SC’18 28 Shared Address Space (XPMEM)-based Collectives Design OSU_Allreduce (Broadwell 256 procs) OSU_Reduce (Broadwell 256 procs) MVAPICH2-2.3b 100000 MVAPICH2-2.3b 100000 IMPI-2017v1.132 1.8X IMPI-2017v1.132 4X 10000 10000 MVAPICH2-X-2.3rc1 MVAPICH2-2.3rc1

1000 1000 73.2 37.9

100 100 Latency (us) Latency 10 10 36.1 16.8 1 1 16K 32K 64K 128K 256K 512K 1M 2M 4M 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size Message Size • “Shared Address Space”-based true zero-copy Reduction collective designs in MVAPICH2 • Offloaded computation/communication to peers ranks in reduction collective operation • Up to 4X improvement for 4MB Reduce and up to 1.8X improvement for 4M AllReduce J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and D. Panda, Designing Efficient Shared Address Space Reduction Available in MVAPICH2-X 2.3rc1 Collectives for Multi-/Many-cores, International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018. Network Based Computing Laboratory Mellanox Theatre SC’18 29 Benefits of the New Asynchronous Progress Design: Broadwell + InfiniBand P3DFFT High Performance Linpack (HPL)

150 44% Memory Consumption = 69% 140 33% PPN=28 100 119 120 50 106 109 100 100 100 100 0

56 112 224 448 GFLOPS Number of processes 80 224 448 896 Time insecondsperloop ( 28 PPN ) MVAPICH2-X Async MVAPICH2-X Default Number of Processes

Normalized Performance in Performance Normalized Higher is better Lower is better MVAPICH2-X Async MVAPICH2-X Default Intel MPI 18.1.163

Up to 44% performance improvement in P3DFFT application with 448 processes Up to 19% and 9% performance improvement in HPL application with 448 and 896 processes

A. Ruhela, H. Subramoni, S. Chakraborty, M. Bayatpour, P. Kousha, and D.K. Panda, Efficient Asynchronous Communication Progress for MPI without Dedicated Resources, EuroMPI 2018 Available in MVAPICH2-X 2.3rc1

Network Based Computing Laboratory Mellanox Theatre SC’18 30 SPEC MPI 2007 Benchmarks: Broadwell + InfiniBand 160 Intel MPI 18.1.163 140 MVAPICH2-X-2.3rc1 120 29% 5% 100 80 -12% 1% 60 40 31% 11%

20 Execution Time in (s) Time in Execution 0 MILC Leslie3D POP2 LAMMPS WRF2 LU MVAPICH2-X outperforms Intel MPI by up to 31%

Configuration: 448 processes on 16 Intel E5-2680v4 (Broadwell) nodes having 28 PPN and interconnected with 100Gbps Mellanox MT4115 EDR ConnectX-4 HCA Network Based Computing Laboratory Mellanox Theatre SC’18 31 Hybrid (MPI+PGAS) Programming

• Application sub-kernels can be re-written in HPC Application

MPI/PGAS based on communication Kernel 1 MPI characteristics Kernel 2 • Benefits: PGASMPI Kernel 3 – Best of Model MPI Kernel N – Best of Shared Memory Computing Model PGASMPI

Network Based Computing Laboratory Mellanox Theatre SC’18 32 Application Level Performance with Graph500 and Sort Graph500 Execution Time Sort Execution Time 35 3000 MPI-Simple MPI Hybrid 30 2500 25 MPI-CSC 2000 MPI-CSR 20 13X 1500 51% 15 Hybrid (MPI+OpenSHMEM) 1000 Time (s) Time 10

Time (seconds) Time 500 5 7.6X 0 0 4K 8K 16K 500GB-512 1TB-1K 2TB-2K 4TB-4K No. of Processes Input Data - No. of Processes • Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design • Performance of Hybrid (MPI+OpenSHMEM) Sort • 8,192 processes Application - 2.4X improvement over MPI-CSR • 4,096 processes, 4 TB Input Size - 7.6X improvement over MPI-Simple - MPI – 2408 sec; 0.16 TB/min • 16,384 processes - Hybrid – 1172 sec; 0.36 TB/min - 1.5X improvement over MPI-CSR - 51% improvement over MPI-design - 13X improvement over MPI-Simple

J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, Optimizing Collective Communication in OpenSHMEM, Int'l Conference on Partitioned Global Address Space Programming Models (PGAS '13), October 2013. J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC’13), June 2013 J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012 Network Based Computing Laboratory Mellanox Theatre SC’18 33 MVAPICH2 Distributions • MVAPICH2 – Basic MPI support for IB, iWARP and RoCE • MVAPICH2-X – Advanced MPI features and support for INAM – MPI, PGAS and Hybrid MPI+PGAS support for IB • MVAPICH2-Virt – Optimized for HPC Clouds with IB and SR-IOV virtualization – Support for OpenStack, Docker, and Singularity • MVAPICH2-EA – Energy Efficient Support for point-to-point and collective operations – Compatible with OSU Energy Monitoring Tool (OEMT-0.8) • OSU Micro-Benchmarks (OMB) – MPI (including CUDA-aware MPI), OpenSHMEM and UPC • OSU INAM – InfiniBand Network Analysis and Monitoring Tool • MVAPICH2-GDR and Deep Learning (Will be presented on Thursday at 10:30am)

Network Based Computing Laboratory Mellanox Theatre SC’18 34 Can HPC and Virtualization be Combined? • Virtualization has many benefits – Fault-tolerance – Job migration – Compaction • Have not been very popular in HPC due to overhead associated with Virtualization • New SR-IOV (Single Root – IO Virtualization) support available with Mellanox InfiniBand adapters changes the field • Enhanced MVAPICH2 support for SR-IOV • MVAPICH2-Virt 2.2 supports: – OpenStack, Docker, and singularity J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized InfiniBand Clusters? EuroPar'14 J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand Clusters, HiPC’14 J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build HPC Clouds, CCGrid’15 Network Based Computing Laboratory Mellanox Theatre SC’18 35 Application-Level Performance on Chameleon A Release for Azure Coming Soon

6000 400 MV2-SR-IOV-Def MV2-SR-IOV-Def 350 5000 MV2-SR-IOV-Opt MV2-SR-IOV-Opt 300 4000 MV2-Native MV2-Native 2% 250 3000 200

150 9.5% 2000

ExecutionTime (s) 1% ExecutionTime (ms) 5% 100 1000 50

0 0 22,20 24,10 24,16 24,20 26,10 26,16 milc leslie3d pop2 GAPgeofem zeusmp2 lu Problem Size (Scale, Edgefactor) Graph500 SPEC MPI2007 • 32 VMs, 6 Core/VM • Compared to Native, 2-5% overhead for Graph500 with 128 Procs • Compared to Native, 1-9.5% overhead for SPEC MPI2007 with 128 Procs

Network Based Computing Laboratory Mellanox Theatre SC’18 36 Application-Level Performance on Docker with MVAPICH2 100 300 Container-Def 90 Container-Def 73% Container-Opt 250 80 11% Container-Opt 70 Native 200 Native 60 50 150 40

30 100 Execution Time (s) Time Execution 20 50

10 (ms) Time Execution BFS 0 0 MG.D FT.D EP.D LU.D CG.D 1Cont*16P 2Conts*8P 4Conts*4P NAS Scale, Edgefactor (20,16) Graph 500 • 64 Containers across 16 nodes, pining 4 Cores per Container • Compared to Container-Def, up to 11% and 73% of execution time reduction for NAS and Graph 500 • Compared to Native, less than 9 % and 5% overhead for NAS and Graph 500

Network Based Computing Laboratory Mellanox Theatre SC’18 37 Application-Level Performance on Singularity with MVAPICH2

NPB Class D Graph500 300 3000 Singularity Singularity 6% 250 2500 Native Native 200 2000

150 1500

100 1000

Execution Time (s) Time Execution 7% BFS Execution Time (ms) Time Execution BFS 50 500

0 0 CG EP FT IS LU MG 22,16 22,20 24,16 24,20 26,16 26,20 • 512 Processes across 32 nodes Problem Size (Scale, Edgefactor) • Less than 7% and 6% overhead for NPB and Graph500, respectively J. Zhang, X .Lu and D. K. Panda, Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds?, UCC ’17, Best Student Paper Award Network Based Computing Laboratory Mellanox Theatre SC’18 38 MVAPICH2 Distributions • MVAPICH2 – Basic MPI support for IB, iWARP and RoCE • MVAPICH2-X – Advanced MPI features and support for INAM – MPI, PGAS and Hybrid MPI+PGAS support for IB • MVAPICH2-Virt – Optimized for HPC Clouds with IB and SR-IOV virtualization – Support for OpenStack, Docker, and Singularity • MVAPICH2-EA – Energy Efficient Support for point-to-point and collective operations – Compatible with OSU Energy Monitoring Tool (OEMT-0.8) • OSU Micro-Benchmarks (OMB) – MPI (including CUDA-aware MPI), OpenSHMEM and UPC • OSU INAM – InfiniBand Network Analysis and Monitoring Tool • MVAPICH2-GDR and Deep Learning (Will be presented on Thursday at 10:30am)

Network Based Computing Laboratory Mellanox Theatre SC’18 39 Designing Energy-Aware (EA) MPI Runtime

Overall application Energy Expenditure Energy Spent in Communication Energy Spent in Computation Routines Routines

Point-to-point Collective RMA Routines Routines Routines

MVAPICH2-EA Designs

MPI Two-sided and collectives (ex: MVAPICH2) Impact MPI-3 RMA Implementations (ex: MVAPICH2)

One-sided runtimes (ex: ComEx) Other PGAS Implementations (ex: OSHMPI)

Network Based Computing Laboratory Mellanox Theatre SC’18 40 MVAPICH2-EA: Application Oblivious Energy-Aware-MPI (EAM)

• An energy efficient runtime that provides energy savings without application knowledge • Uses automatically and transparently the best energy lever • Provides guarantees on maximum degradation with 5-41% savings at <= 5% degradation • Pessimistic MPI applies energy 1 reduction lever to each MPI call • Released (together with OEMT) since Aug’15

A Case for Application-Oblivious Energy-Efficient MPI Runtime A. Venkatesh, A. Vishnu, K. Hamidouche, N. Tallent, D. K. Panda, D. Kerbyson, and A. Hoise, Supercomputing ‘15, Nov 2015 [Best Student Paper Finalist] Network Based Computing Laboratory Mellanox Theatre SC’18 41 MVAPICH2 Distributions • MVAPICH2 – Basic MPI support for IB, iWARP and RoCE • MVAPICH2-X – Advanced MPI features and support for INAM – MPI, PGAS and Hybrid MPI+PGAS support for IB • MVAPICH2-Virt – Optimized for HPC Clouds with IB and SR-IOV virtualization – Support for OpenStack, Docker, and Singularity • MVAPICH2-EA – Energy Efficient Support for point-to-point and collective operations – Compatible with OSU Energy Monitoring Tool (OEMT-0.8) • OSU Micro-Benchmarks (OMB) – MPI (including CUDA-aware MPI), OpenSHMEM and UPC • OSU INAM – InfiniBand Network Analysis and Monitoring Tool • MVAPICH2-GDR and Deep Learning (Will be presented on Thursday at 10:30am)

Network Based Computing Laboratory Mellanox Theatre SC’18 42 OSU Microbenchmarks • Available since 2004 • Suite of microbenchmarks to study communication performance of various programming models • Benchmarks available for the following programming models – Message Passing Interface (MPI) – Partitioned Global Address Space (PGAS) • (UPC) • Unified Parallel C++ (UPC++) • OpenSHMEM • Benchmarks available for multiple accelerator based architectures – Compute Unified Device Architecture (CUDA) – OpenACC Application Program Interface • Part of various national resource procurement suites like NERSC-8 / Trinity Benchmarks • Please visit the following link for more information – http://mvapich.cse.ohio-state.edu/benchmarks/ Network Based Computing Laboratory Mellanox Theatre SC’18 43 MVAPICH2 Distributions • MVAPICH2 – Basic MPI support for IB, iWARP and RoCE • MVAPICH2-X – Advanced MPI features and support for INAM – MPI, PGAS and Hybrid MPI+PGAS support for IB • MVAPICH2-Virt – Optimized for HPC Clouds with IB and SR-IOV virtualization – Support for OpenStack, Docker, and Singularity • MVAPICH2-EA – Energy Efficient Support for point-to-point and collective operations – Compatible with OSU Energy Monitoring Tool (OEMT-0.8) • OSU Micro-Benchmarks (OMB) – MPI (including CUDA-aware MPI), OpenSHMEM and UPC • OSU INAM – InfiniBand Network Analysis and Monitoring Tool • MVAPICH2-GDR and Deep Learning (Will be presented on Thursday at 10:30am)

Network Based Computing Laboratory Mellanox Theatre SC’18 44 Overview of OSU INAM

• A network monitoring and analysis tool that is capable of analyzing traffic on the InfiniBand network with inputs from the MPI runtime – http://mvapich.cse.ohio-state.edu/tools/osu-inam/ • Monitors IB clusters in real time by querying various subnet management entities and gathering input from the MPI runtimes • Capability to analyze and profile node-level, job-level and process-level activities for MPI communication – Point-to-Point, Collectives and RMA • Ability to filter data based on type of counters using “drop down” list • Remotely monitor various metrics of MPI processes at user specified granularity • "Job Page" to display jobs in ascending/descending order of various performance metrics in conjunction with MVAPICH2-X • Visualize the data transfer happening in a “live” or “historical” fashion for entire network, job or set of nodes • OSU INAM 0.9.4 released on 11/10/2018 – Enhanced performance for fabric discovery using optimized OpenMP-based multi-threaded designs – Ability to gather InfiniBand performance counters at sub-second granularity for very large (>2000 nodes) clusters – Redesign database layout to reduce database size – Enhanced fault tolerance for database operations • Thanks to Trey Dockendorf @ OSC for the feedback – OpenMP-based multi-threaded designs to handle database purge, read, and insert operations simultaneously – Improved database purging time by using bulk deletes – Tune database timeouts to handle very long database operations – Improved debugging support by introducing several debugging levels Network Based Computing Laboratory Mellanox Theatre SC’18 45 OSU INAM Features

Comet@SDSC --- Clustered View Finding Routes Between Nodes (1,879 nodes, 212 switches, 4,377 network links) • Show network topology of large clusters • Visualize traffic pattern on different links • Quickly identify congested links/links in error state • See the history unfold – play back historical state of the network

Network Based Computing Laboratory Mellanox Theatre SC’18 46 OSU INAM Features (Cont.)

Visualizing a Job (5 Nodes) Estimated Process Level Link Utilization • Job level view • Estimated Link Utilization view • Show different network metrics (load, error, etc.) for any live job • Classify data flowing over a network link at • Play back historical data for completed jobs to identify bottlenecks different granularity in conjunction with • Node level view - details per process or per node MVAPICH2-X 2.2rc1 • CPU utilization for each rank/node • Job level and • Bytes sent/received for MPI operations (pt-to-pt, collective, RMA) • Process level • Network metrics (e.g. XmitDiscard, RcvError) per rank/node

Network Based Computing Laboratory Mellanox Theatre SC’18 47 Applications-Level Tuning: Compilation of Best Practices • MPI runtime has many parameters • Tuning a set of parameters can help you to extract higher performance • Compiled a list of such contributions through the MVAPICH Website – http://mvapich.cse.ohio-state.edu/best_practices/ • Initial list of applications – Amber – HoomDBlue – HPCG – Lulesh – MILC – Neuron – SMG2000 – Cloverleaf – SPEC (LAMMPS, POP2, TERA_TF, WRF2) • Soliciting additional contributions, send your results to mvapich-help at cse.ohio-state.edu. • We will link these results with credits to you.

Network Based Computing Laboratory Mellanox Theatre SC’18 48 MVAPICH Team Part of New NSF Tier-1 System

• TACC and MVAPICH partner to win new NSF Tier-1 System – https://www.hpcwire.com/2018/07/30/tacc-wins-next-nsf- funded-major-/ • The MVAPICH team will be an integral part of the effort • An overview of the recently awarded NSF Tier 1 System at TACC will be presented by Dan Stanzione – The presentation will also include discussion on MVAPICH collaboration on past systems and this upcoming system at TACC

Network Based Computing Laboratory Mellanox Theatre SC’18 49 Commercial Support for MVAPICH2 Libraries • Supported through X-ScaleSolutions (http://x-scalesolutions.com) • Benefits: – Help and guidance with installation of the library – Platform-specific optimizations and tuning – Timely support for operational issues encountered with the library – Web portal interface to submit issues and tracking their progress – Advanced debugging techniques – Application-specific optimizations and tuning – Obtaining guidelines on best practices – Periodic information on major fixes and updates – Information on major releases – Help with upgrading to the latest release – Flexible Service Level Agreements • Support provided to Lawrence Livermore National Laboratory (LLNL) this year

Network Based Computing Laboratory Mellanox Theatre SC’18 50 MVAPICH2 – Plans for Exascale

• Performance and Memory scalability toward 1M-10M cores • Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …) • MPI + Task* • Enhanced Optimization for GPUs and FPGAs* • Taking advantage of advanced features of Mellanox InfiniBand • Tag Matching* • Adapter Memory* • Enhanced communication schemes for upcoming architectures • NVLINK* • CAPI* • Extended topology-aware collectives • Extended Energy-aware designs and Virtualization Support • Extended Support for MPI Tools Interface (as in MPI 3.0) • Extended FT support • Support for * features will be available in future MVAPICH2 Releases

Network Based Computing Laboratory Mellanox Theatre SC’18 51 One More Presentation

• Thursday (11/15/18) at 10:30am

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

Network Based Computing Laboratory Mellanox Theatre SC’18 52 Join us at the OSU Booth

• Join the OSU Team at SC’18 at Booth #4404 and grab a free T-Shirt! • More details about various events available at the following link • http://mvapich.cse.ohio-state.edu/conference/735/talks/

Network Based Computing Laboratory Mellanox Theatre SC’18 53 Thank You! [email protected]

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

The High-Performance MPI/PGAS Project The High-Performance Big Data Project The High-Performance Deep Learning Project http://mvapich.cse.ohio-state.edu/ http://hibd.cse.ohio-state.edu/ http://hidl.cse.ohio-state.edu/

Network Based Computing Laboratory Mellanox Theatre SC’18 54