High Performance and Scalable MPI+X Library for Emerging HPC Clusters

Talk at HPC Developer Conference (SC ‘16) by

Dhabaleswar K. (DK) Panda Khaled Hamidouche The Ohio State University The Ohio State University E-mail: [email protected] E-mail: [email protected] http://www.cse.ohio-state.edu/~panda http://www.cse.ohio-state.edu/~hamidouc High-End Computing (HEC): ExaFlop & ExaByte

100-200 PFlops in 2016-2018 40K EBytes in 2020 ?

10K-20K 1 EFlops in EBytes in 2016-2018 2023-2024?

ExaFlop & HPC ExaByte & BigData

• •

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 2 Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org) 500 Percentage of Clusters 85% 100 450 90 400 Number of Clusters 80 350 70 300 60 250 50 200 40 150 30

100 20 Number Clusters Number of

50 10 Clusters of Percentage 0 0

Timeline

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 3 Drivers of Modern HPC Cluster Architectures

High Performance Interconnects - Accelerators / Coprocessors InfiniBand high compute density, high Multi-core Processors <1usec latency, 100Gbps Bandwidth> performance/watt SSD, NVMe-SSD, NVRAM >1 TFlop DP on a chip • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD • Accelerators ( GPGPUs and Intel Xeon Phi)

Tianhe – 2 Titan Stampede Tianhe – 1A Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 4 Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges

Application Kernels/Applications Co-Design Middleware Opportunities and Programming Models Challenges MPI, PGAS (UPC, , OpenSHMEM), CUDA, OpenMP, across Various OpenACC, , Hadoop (MapReduce), Spark (RDD, DAG), etc. Layers

Communication Library or Runtime for Programming Models Performance Point-to-point Collective Energy- Synchronizatio I/O and Fault Communicatio Communicatio Awareness n and Locks File Systems Tolerance n n Fault- Resilience Networking Technologies Multi/Many-core Accelerators (InfiniBand, 40/100GigE, Architectures (NVIDIA and MIC) Aries, and OmniPath)

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 5 Exascale Programming models

• The community believes Next-Generation Programming models exascale programming model MPI+X will be MPI+X • But what is X? – Can it be just OpenMP? X= ? • Many different environments OpenMP, OpenACC, CUDA, PGAS, Tasks….

and systems are emerging Heterogeneous Highly-Threaded Irregular Computing with – Different `X’ will satisfy the Systems (KNL) Communications Accelerators respective needs

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 6 MPI+X Programming model: Broad Challenges at Exascale

• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) – Scalable job start-up • Scalable Collective communication – Offload – Non-blocking – Topology-aware • Balancing intra-node and inter-node communication for next generation nodes (128-1024 cores) – Multiple end-points per node • Support for efficient multi-threading (OpenMP) • Integrated Support for GPGPUs and Accelerators (CUDA) • Fault-tolerance/resiliency • QoS support for communication and I/O • Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, CAF, …) • Virtualization • Energy-Awareness

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 7 Additional Challenges for Designing Exascale Software Libraries

• Extreme Low Memory Footprint – Memory per core continues to decrease

• D-L-A Framework – Discover • Overall network topology (fat-tree, 3D, …), Network topology for processes for a given job • Node architecture, Health of network and node – Learn • Impact on performance and scalability • Potential for failure – Adapt • Internal protocols and algorithms • mapping • Fault-tolerance solutions – Low overhead techniques while delivering performance, scalability and fault-tolerance

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 8 Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 2,690 organizations in 83 countries – More than 402,000 (> 0.4 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Nov ‘16 ranking) • 1st ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China • 13th ranked 241,108-core cluster (Pleiades) at NASA • 17th ranked 519,640-core cluster (Stampede) at TACC • 40th ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others – Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Empowering Top500 systems for over a decade – System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> Sunway TaihuLight at NSC, Wuxi, China (1st in Nov’16, 10,649,640 cores, 93 PFlops) Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 9 Outline

• Hybrid MPI+OpenMP Models for Highly-threaded Systems

• Hybrid MPI+PGAS Models for Irregular Applications

• Hybrid MPI+GPGPUs and OpenSHMEM for with Accelerators

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 10 Highly Threaded Systems • Systems like KNL • MPI+OpenMP is seen as the best fit – 1 MPI process per socket for Multi-core – 4-8 MPI processes per KNL – Each MPI process will launch OpenMP threads • However, current MPI runtimes are not “efficiently” handling the hybrid – Most of the application use Funneled mode: Only the MPI processes perform communication – Communication phases are the bottleneck • Multi-endpoint based designs – Transparently use threads inside MPI runtime – Increase the concurrency

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 11 MPI and OpenMP

• MPI-4 will enhance the support – Endpoint proposal in the Forum – Application threads will be able to efficiently perform communication – Endpoint is the communication entity that maps to a thread • Idea is to have multiple addressable communication entities within a single process • No context switch between application and runtime => better performance • OpenMP 4.5 is more powerful than just traditional – Supports since OpenMP 3.0 – Supports heterogeneous computing with accelerator targets since OpenMP 4.0 – Supports explicit SIMD and threads affinity pragmas since OpenMP 4.0

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 12 MEP-based design: MVAPICH2 Approach

• Lock-free Communication MPI/OpenMP Program – Threads have their own resources MPI Point-to-Point MPI Collective (MPI_Isend, MPI_Irecv, (MPI_Alltoallv, MPI_Waitall...) MPI_Allgatherv...) • Dynamically adapt the number of threads *OpenMP Pragma needed *Transparent support – Avoid resource contention Multi-endpoint Runtime – Depends on application pattern and system Collective Point-to-Point performance Optimized Algorithm

• Both intra- and inter-nodes communication Endpoint Controller – Threads both channels Lock-free Communication Components • New MEP-Aware collectives Comm. Request Progress Resources Handling Engine • Applicable to the endpoint proposal in MPI-4 Management

M. Luo, X. Lu, K. Hamidouche, K. Kandalla and D. K. Panda, Initial Study of Multi-Endpoint Runtime for MPI+OpenMP Hybrid Applications on Multi-Core Systems. International Symposium on Principles and Practice of Parallel Programming (PPoPP '14).

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 13 Performance Benefits: OSU Micro-Benchmarks (OMB) level Multi-pairs Bcast 180 1000 1000 Alltoallv Orig Multi-threaded Runtime 160 Proposed Multi-Endpoint Runtime 900 900 Process based Runtime 140 800 800 700 700 120

)

)

)

s

s 600 s

u u 600

u

(

(

(

100

y

y

y

c 500 c 500

n

n

n

e

e 80 e

t

t

t

a

a 400 a 400

L

L 60 L 300 300 40 200 200 20 100 orig 100 orig mep mep 0 0 0 16 64 256 1K 4K 16K 64K 256K 1 4 16 64 256 1K 2K 1 4 16 64 256 1K 2K Message size Message size Message size • Reduces the latency from 40us to 1.85 us (21X) • Achieves the same as Processes • 40% improvement on latency for Bcast on 4,096 cores • 30% improvement on latency for Alltoall on 4,096 cores Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 14 Performance Benefits: Application Kernel level

NAS Benchmark with 256 nodes (4,096 cores) CLASS E P3DFFT 200 Orig 15 180 mep 160 140 10

) 120 Orig MEP

s

(

e 100

m i 5

T 80 Execution Time Execution 60 40 0 20 2K 4K 0 # of cores and input size CG LU MG • 6.3% improvement for MG, 11.7% improvement for CG, and 12.6% improvement for LU on 4,096 cores. • With P3DFFT, we are able to observe a 30% improvement in communication time and 13.5% improvement in the total execution time.

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 15 Enhanced Designs for KNL: MVAPICH2 Approach

• On-load approach – Takes advantage of the idle cores – Dynamically configurable – Takes advantage of highly multithreaded cores – Takes advantage of MCDRAM of KNL processors • Applicable to other programming models such as PGAS, Task-based, etc. • Provides portability, performance, and applicability to runtime as well as applications in a transparent manner

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 16 Performance Benefits of the Enhanced Designs

MVAPICH2 MVAPICH2-Optimized MVAPICH2 MVAPICH2-Optimized 17.2% MVAPICH2 MVAPICH2-Optimized 27% 300000 60000 250000 10000 52% 50000 200000 8000 40000 150000 6000 30000

Latency (us) Latency 100000 4000

20000 (us) Latency 50000 2000

10000 Bandwidth (MB/s) 0 0 0 4 8 16 1M 2M 4M 8M 16M 32M 2M 4M 8M 16M 32M 64M No. of processes Message size Message size Intra-node Broadcast with 64MB Message 16-process Intra-node All-to-All Very Large Message Bi-directional Bandwidth

• New designs to exploit high concurrency and MCDRAM of KNL • Significant improvements for large message sizes • Benefits seen in varying message size as well as varying MPI processes

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 17 Performance Benefits of the Enhanced Designs

MV2_Def_DRAM MV2_Opt_DRAM MV2_Opt_DRAM MV2_Opt_MCDRAM 300 MV2_Def_DRAM MV2_Def_MCDRAM 15% 60000 30% 250 50000 200 40000 150

Time (s) Time 30000 100 20000 50 Bandwidth (MB/s) Bandwidth 10000 0 0 4:268 4:204 4:64 1M 2M 4M 8M 16M 32M 64M MPI Processes : OMP Threads Message Size (bytes) CNTK: MLP Training Time using MNIST (BS:64) Multi-Bandwidth using 32 MPI processes

• Benefits observed on training time of Multi-level Perceptron (MLP) model on MNIST dataset using CNTK Deep Learning Framework Enhanced Designs will be available in upcoming MVAPICH2 releases

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 18 Outline

• Hybrid MPI+OpenMP Models for Highly-threaded Systems

• Hybrid MPI+PGAS Models for Irregular Applications

• Hybrid MPI+GPGPUs and OpenSHMEM for Heterogeneous Computing with Accelerators

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 19 Maturity of Runtimes and Application Requirements

• MPI has been the most popular model for a long time - Available on every major machine - Portability, performance and scaling - Most parallel HPC code is designed using MPI - Simplicity - structured and iterative communication patterns • PGAS Models - Increasing interest in community - Simple abstractions and one-sided communication - Easier to express irregular communication • Need for hybrid MPI + PGAS - Application can have kernels with different communication characteristics - Porting only part of the applications to reduce programming effort

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 20 Hybrid (MPI+PGAS) Programming

• Application sub-kernels can be re-written in MPI/PGAS based on communication characteristics HPC Application • Benefits: Kernel 1 – Best of Model MPI – Best of Shared Memory Computing Model Kernel 2 PGASMPI

Kernel 3 MPI

KernelKernel NN PGASMPI

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 21 Current Approaches for Hybrid Programming

• Layering one programming model over another – Poor performance due to semantics mismatch – MPI-3 RMA tries to address • Separate runtime for each programming model

Hybrid (OpenSHMEM + MPI) Applications • Need more network and OpenSHMEM Calls MPI Class memory resources OpenSHMEM Runtime MPI Runtime • Might lead to ! InfiniBand, RoCE, iWARP

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 22 The Need for a Unified Runtime

shmem_int_fadd (data at p1); Active Msg /* P0 local P1 /* operate on data */ computation */ MPI_Barrier(comm); MPI_Barrier(comm);

OpenSHMEM MPI Runtime OpenSHMEM MPI Runtime Runtime Runtime

• Deadlock when a message is sitting in one runtime, but application calls the other runtime • Prescription to avoid this is to barrier in one mode (either OpenSHMEM or MPI) before entering the other • Or runtimes require dedicated progress threads • Bad performance!! • Similar issues for MPI + UPC applications over individual runtimes

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 23 MVAPICH2-X for Hybrid MPI + PGAS Applications

MPI, OpenSHMEM, UPC, CAF, UPC++ or Hybrid (MPI + PGAS) Applications

OpenSHMEM Calls CAF Calls UPC Calls UPC++ Calls MPI Calls

Unified MVAPICH2-X Runtime

InfiniBand, RoCE, iWARP

• Unified communication runtime for MPI, UPC, OpenSHMEM, CAF, UPC++ available with MVAPICH2- X 1.9 onwards! (since 2012) – http://mvapich.cse.ohio-state.edu • Feature Highlights – Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, UPC++, MPI(+OpenMP) + OpenSHMEM, MPI(+OpenMP) + UPC – MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard compliant (with initial support for UPC 1.3), CAF 2008 standard (OpenUH), UPC++ – Scalable Inter-node and intra-node communication – point-to-point and collectives Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 24 OpenSHMEM Atomic Operations: Performance

30 UH-SHMEM MV2X-SHMEM Scalable-SHMEM OMPI-SHMEM 25 20 15

Time (us) Time 10 5 0 fadd finc add inc cswap swap • OSU OpenSHMEM micro-benchmarks (OMB v4.1) • MV2-X SHMEM performs up to 40% better compared to UH-SHMEM

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 25 UPC Collectives Performance Broadcast (2,048 processes) Scatter (2,048 processes) 15000 300 UPC-GASNet 2X 10000 UPC-GASNet 200 UPC-OSU UPC-OSU

5000 25X 100

Time(us) Time(ms)

0 0

64

1K 2K 4K 8K

64

4K 1K 2K 8K

512 128 256

16K 32K 64K

128 256 512

64K 16K 32K

256K 128K 256K Message Size Message Size 128K Gather (2,048 processes) Exchange (2,048 processes) 250 4000 UPC-GASNet UPC-GASNet 200 2X 3000 150 UPC-OSU UPC-OSU 35% 2000 100

1000 Time(ms) Time(ms) 50

0 0

64

2K 1K 4K 8K

64

1K 2K 4K 8K

128 256 512

64K 16K 32K

128 256 512

16K 32K 64K

128K 256K Message Size 128K Message Size J. Jose, K. Hamidouche, J. Zhang, A. Venkatesh, and D. K. Panda, Optimizing Collective Communication in UPC (HiPS’14, in association with IPDPS’14) Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 26 Performance Evaluations for CAF model Get Put NAS-CAF 6000 6000 GASNet-IBV GASNet-IBV 5000 5000 sp.D.256 GASNet-MPI GASNet-MPI 4000 4000 3.5X mg.D.256 12% 3000 MV2X 3000 MV2X 2000 2000 ft.D.256 1000 1000

0 0 ep.D.256

Bandwidth (MB/s) Bandwidth (MB/s) Bandwidth cg.D.256 Message Size (byte) Message Size (byte) bt.D.256 18% 10 10 8 8 0 100 200 300 6 6 4 (ms) 4 29% Time (sec) Latency Latency 2 2 0 0 GASNet-IBV GASNet-MPI MV2X GASNet-IBV GASNet-MPI MV2X Latency (ms) Latency GASNet-IBV GASNet-MPI MV2X • Micro-benchmark improvement (MV2X vs. GASNet-IBV, UH CAF test-suite) – Put bandwidth: 3.5X improvement on 4KB; Put latency: reduce 29% on 4B • Application performance improvement (NAS-CAF one-sided implementation) – Reduce the execution time by 12% (SP.D.256), 18% (BT.D.256) J. Lin, K. Hamidouche, X. Lu, M. Li and D. K. Panda, High-performance Co-array support with MVAPICH2-X: Initial experience and evaluation, HIPS’15 Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 27 UPC++ Collectives Performance

40000 GASNet_MPI 14x MPI + {UPC++} MPI + {UPC++} 35000 GASNET_IBV application application 30000 MV2-X UPC++ MPI UPC++ Runtime 25000 Runtime Interfaces 20000 15000

Time (us) Time GASNet Interfaces MVAPICH2-X 10000 Unified 5000 communication 0 Conduit (MPI) Runtime (UCR)

Message Size (bytes) Network Inter-node Broadcast (64 nodes 1:ppn) • Full and native support for hybrid MPI + UPC++ applications J. M. Hashmi, K. Hamidouche, and D. K. Panda, Enabling • Better performance compared to IBV and MPI conduits Performance Efficient Runtime Support for hybrid • OSU Micro-benchmarks (OMB) support for UPC++ MPI+UPC++ Programming Models, IEEE International Conference on High Performance Computing and • Available since MVAPICH2-X 2.2RC1 Communications (HPCC 2016) Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 28 Application Level Performance with Graph500 and Sort Graph500 Execution Time Sort Execution Time 35 3000 MPI-Simple MPI Hybrid 30 2500 25 MPI-CSC 2000 20 MPI-CSR 13X 1500 51% 15 Hybrid (MPI+OpenSHMEM) 1000 Time(s) 10 500 5 7.6X Time(seconds) 0 0 4K 8K 16K 500GB-512 1TB-1K 2TB-2K 4TB-4K No. of Processes Input Data - No. of Processes • Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design • Performance of Hybrid (MPI+OpenSHMEM) Sort • 8,192 processes Application - 2.4X improvement over MPI-CSR • 4,096 processes, 4 TB Input Size - 7.6X improvement over MPI-Simple - MPI – 2408 sec; 0.16 TB/min • 16,384 processes - Hybrid – 1172 sec; 0.36 TB/min - 1.5X improvement over MPI-CSR - 51% improvement over MPI-design - 13X improvement over MPI-Simple

J. Jose, S. Potluri, H. Subramoni, X. Lu, K. Hamidouche, K. Schulz, H. Sundar and D. Panda Designing Scalable Out-of-core Sorting with Hybrid MPI+PGAS Programming Models, PGAS’14

J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC’13), June 2013 Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 29 MiniMD – Total Execution Time Performance Strong Scaling Hybrid-Barrier MPI-Original Hybrid-Advanced Hybrid-Barrier MPI-Original Hybrid-Advanced 3000 3000 2000 2000 17%

1000 1000

Time (ms) Time (ms) Time 0 0 512 1,024 256 512 1,024 # of Cores # of Cores • Hybrid design performs better than MPI implementation • 1,024 processes - 17% improvement over MPI version • Strong Scaling Input size: 128 * 128 * 128

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 30 Accelerating MaTEx k-NN with Hybrid MPI and OpenSHMEM • MaTEx: MPI-based Machine learning algorithm library • k-NN: a popular supervised algorithm for classification • Hybrid designs: – Overlapped Data Flow; One-sided Data Transfer; Circular-buffer Structure

KDD-tranc (30MB) on 256 cores KDD (2.5GB) on 512 cores

27.6% 9.0%

• Benchmark: KDD Cup 2010 (8,407,752 records, 2 classes, k=5) • For truncated KDD workload on 256 cores, reduce 27.6% execution time • For full KDD workload on 512 cores, reduce 9.0% execution time J. Lin, K. Hamidouche, J. Zhang, X. Lu, A. Vishnu, D. Panda. Accelerating k-NN Algorithm with Hybrid MPI and OpenSHMEM, OpenSHMEM 2015 Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 31 Outline

• Hybrid MPI+OpenMP Models for Highly-threaded Systems

• Hybrid MPI+PGAS Models for Irregular Applications

• Hybrid MPI+GPGPUs and OpenSHMEM for Heterogeneous Computing with Accelerators

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 32 GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU • Standard MPI interfaces used for unified data movement • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) • Overlaps data movement from GPU with RDMA transfers

At Sender:

MPI_Send(s_devbuf, size, …); inside MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, …);

High Performance and High Productivity

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 33 CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.2 Releases • Support for MPI communication from NVIDIA GPU device memory • High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) • High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU) • Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node • Optimized and tuned collectives for GPU device buffers • MPI datatype support for point-to-point and collective communication from GPU device buffers • Unified memory

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 34 Performance of MVAPICH2-GPU with GPU-Direct RDMA (GDR) GPU-GPU internode latency GPU-GPU Internode Bandwidth 30 MV2-GDR2.2 MV2-GDR2.0b 3000 MV2-GDR2.2 25 MV2 w/o GDR 2500 MV2-GDR2.0b 11X 20 2000 MV2 w/o GDR 15 3X

10x (MB/s) 1500

Latency(us) 10

Bandwidth 2X 5 1000 0 2.18us 500 0 2 8 32 128 512 2K 0 Message Size (bytes) 1 4 16 64 256 1K 4K GPU-GPU Internode Bi-Bandwidth Message Size (bytes) 4000 3000 MV2-GDR2.2 MV2-GDR2.0b 11x 2000 MV2 w/o GDR MVAPICH2-GDR-2.2 1000 2X Intel Ivy Bridge (E5-2680 v2) node - 20 cores NVIDIA Tesla K40c GPU

0 Mellanox Connect-X4 EDR HCA Bandwidth (MB/s) Bandwidth

- 1 4 16 64 256 1K 4K CUDA 8.0

Bi Mellanox OFED 3.0 with GPU-Direct-RDMA Message Size (bytes) Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 35 Application-Level Evaluation (HOOMD-blue) 64K Particles 256K Particles 3500 3000 2500 MV2 MV2+GDR 2500 2000 2X 2000 1500 2X 1500 1000 1000

second (TPS) second 500

second (TPS) second 500

0 per Time Steps Average 0

Average Time Steps per per TimeSteps Average 4 8 16 32 4 8 16 32 Number of Processes Number of Processes • Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) • HoomdBlue Version 1.0.5 • GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384 Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 36 Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

Wilkes GPU Cluster CSCS GPU cluster

Default Callback-based Event-based Default Callback-based Event-based

1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0

4 8 16 32 Time Execution Normalized 16 32 64 96 Normalized Execution Time Execution Normalized Number of GPUs Number of GPUs

• 2X improvement on 32 GPUs nodes Cosmo model: http://www2.cosmo-model.org/content • 30% improvement on 96 GPU nodes (8 GPUs/node) /tasks/operational/meteoSwiss/ On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16 Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 37 Need for Non-Uniform Memory Allocation in OpenSHMEM for Heterogeneous Architectures • MIC cores have limited Memory per core memory per core Host Memory MIC Memory • OpenSHMEM relies on symmetric memory,

allocated using shmalloc() Host Cores MIC Cores • shmalloc() allocates same amount of memory on all PEs • For applications running in symmetric mode, this limits the total heap size • Similar issues for applications (even host-only) with memory load imbalance (Graph500, Out-of-Core Sort, etc.) • How to allocate different amounts of memory on host and MIC cores, and still be able to communicate?

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 38 OpenSHMEM Design for MIC Clusters

• Non-Uniform Memory Allocation:

OpenSHMEM Application – Team-based Memory Allocation Applications Co-Design (Proposed Extensions)

void shmem_team_create(shmem_team_t team, int *ranks, OpenSHMEM Extensions int size, shmem_team_t *newteam); Programming Model void shmem_team_destroy(shmem_team_t *team);

void shmem_team_split(shmem_team_t team, int color, MVAPICH2-X OpenSHMEM Runtime int key, shmem_team_t *newteam); Symmetric Memory Manager Proxy based Communication

InfiniBand Shared Memory/ int shmem_team_rank(shmem_team_t team); SCIF Channel int shmem_team_size(shmem_team_t team); Channel CMA Channel

Multi/Many-Core Architectures InfiniBand Networks void *shmalloc_team (shmem_team_t team, size_t size); with memory heterogeneity void shfree_team(shmem_team_t team, void *addr); – Address Structure for non-uniform memory allocations

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 39 Proxy-based Designs for OpenSHMEM

(1) IB REQ HOST1 HOST2 HOST1 HOST2

(2) SCIF (3) IB (2) IB (1) IB (2) IB (2) SCIF Read FIN Write REQ Write Read H H H H MIC1 C C MIC2 MIC1 C C MIC2 A A A A (3) IB OpenSHMEM Put using Active Proxy OpenSHMEMFIN Get using Active Proxy • Current generation architectures impose limitations on read bandwidth when HCA reads from MIC memory – Impacts both put and get operation performance • Solution: Pipelined data transfer by proxy running on host using IB and SCIF channels • Improves latency and bandwidth!

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 40 OpenSHMEM Put/Get Performance

5000 5000 MV2X-Def MV2X-Def 4000 MV2X-Opt 4000 MV2X-Opt 3000 3000 4.5X 4.6X

2000 2000 Latency (us) Latency Latency (us) Latency 1000 1000

0 0

1 4 1 4

16 64 16 64

1K 4K 1K 4K

4M 1M 1M 4M

256 256

16K 64K 16K 64K

256K 256K Message Size Message Size OpenSHMEM Put Latency OpenSHMEM Get Latency

• Proxy-based designs alleviate hardware limitations • Put Latency of 4M message: Default: 3911us, Optimized: 838us • Get Latency of 4M message: Default: 3889us, Optimized: 837us

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 41 Performance Evaluations using Graph500

6 MV2X-Def MV2X-Opt 25 MV2X-Def MV2X-Opt 20 4 15 28% 2 10 5

0 0 Execution Time (s) Execution 64 128 256 512 (s) Time Execution 128 256 512 1024 Number of Processes Number of Processes Native Mode (8 procs/MIC) Symmetric Mode (16 Host+16MIC) • Graph500 Execution Time (Native Mode): – 8 processes per MIC node – At 512 processes , Default: 5.17s, Optimized: 4.96s – Performance Improvement from MIC-aware collectives design • Graph500 Execution Time (Symmetric Mode): – 16 processes on each Host and MIC node – At 1,024 processes, Default: 15.91s, Optimized: 12.41s – Performance Improvement from MIC-aware collectives and proxy-based designs Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 42 Graph500 Evaluations with Extensions 0.8 Host Host+MIC (extensions) 0.6

0.4 26% 0.2 0 Execution Time (s) Execution 128 256 512 1024 Number of Processes • Redesigned Graph500 using MIC to overlap computation/communication – Data Transfer to MIC memory; MIC cores pre-processes received data – Host processes traverses vertices, and sends out new vertices • Graph500 Execution time at 1,024 processes: – 16 processes on each Host and MIC node – Host-Only: .33s, Host+MIC with Extensions: .26s • Magnitudes of improvement compared to default symmetric mode – Default Symmetric Mode: 12.1s, Host+MIC Extensions: 0.16s J. Jose, K. Hamidouche, X. Lu, S. Potluri, J. Zhang, K. Tomko and D. K. Panda, High Performance OpenSHMEM for Intel MIC Clusters: Extensions, Runtime Designs and Application Co-Design, IEEE International Conference on Cluster Computing (CLUSTER '14) (Best Paper Finalist) Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 43 Looking into the Future …. • Architectures for Exascale systems are evolving • Exascale systems will be constrained by – Power – Memory per core – Data movement cost – Faults • Programming Models, Runtimes and Middleware need to be designed for – Scalability – Performance – Fault-resilience – Energy-awareness – Programmability – Productivity • High Performance and Scalable MPI+X libraries are needed • Highlighted some of the approaches taken by the MVAPICH2 project • Need continuous innovation to have the right MPI+X libraries for Exascale systems

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 44 Funding Acknowledgments Funding Support by

Equipment Support by

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 45 Personnel Acknowledgments

Current Students Current Research Scientists Current Research Specialist – S. Guganani (Ph.D.) – M. Rahman (Ph.D.) – A. Awan (Ph.D.) – K. Hamidouche – J. Smith – J. Hashmi (Ph.D.) – D. Shankar (Ph.D.) – M. Bayatpour (Ph.D.) – X. Lu – N. Islam (Ph.D.) – A. Venkatesh (Ph.D.) – S. Chakraborthy (Ph.D.) – H. Subramoni – J. Zhang (Ph.D.) – C.-H. Chu (Ph.D.) – M. Li (Ph.D.)

Past Students Past Research Scientist – W. Huang (Ph.D.) – M. Luo (Ph.D.) – R. Rajachandrasekar (Ph.D.) – A. Augustine (M.S.) – S. Sur – P. Balaji (Ph.D.) – W. Jiang (M.S.) – A. Mamidala (Ph.D.) – G. Santhanaraman (Ph.D.) – S. Bhagvat (M.S.) – J. Jose (Ph.D.) – G. Marsh (M.S.) – A. Singh (Ph.D.) – A. Bhat (M.S.) – S. Kini (M.S.) – V. Meshram (M.S.) – J. Sridhar (M.S.) Past Programmers – S. Sur (Ph.D.) – D. Buntinas (Ph.D.) – M. Koop (Ph.D.) – A. Moody (M.S.) – D. Bureddy – H. Subramoni (Ph.D.) – L. Chai (Ph.D.) – K. Kulkarni (M.S.) – S. Naravula (Ph.D.) – M. Arnold – R. Noronha (Ph.D.) – K. Vaidyanathan (Ph.D.) – B. Chandrasekharan (M.S.) – R. Kumar (M.S.) – J. Perkins – N. Dandapanthula (M.S.) – S. Krishnamoorthy (M.S.) – X. Ouyang (Ph.D.) – A. Vishnu (Ph.D.) – V. Dhanraj (M.S.) – K. Kandalla (Ph.D.) – S. Pai (M.S.) – J. Wu (Ph.D.) – T. Gangadharappa (M.S.) – P. Lai (M.S.) – S. Potluri (Ph.D.) – W. Yu (Ph.D.) – K. Gopalakrishnan (M.S.) – J. Liu (Ph.D.)

Past Post-Docs – D. Banerjee – J. Lin – S. Marcarelli – X. Besseron – M. Luo – J. Vienne – H.-W. Jin – E. Mancini – H. Wang

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 46 OSU Team Will be Participating in Multiple Events at SC ‘16

• Three Conference Tutorials (IB+HSE, IB+HSE Advanced, Big Data) • HP-CAST • Technical Papers (SC main conference; Doctoral Showcase; Poster; PDSW- DISC, PAW, COMHPC, and ESPM2 Workshops) • Booth Presentations (Mellanox, NVIDIA, NRL, PGAS) • HPC Connection Workshop • Will be stationed at Ohio Center/OH-TECH Booth (#1107) – Multiple presentations and demos • More Details from http://mvapich.cse.ohio-state.edu/talks/

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 47 Thank You!

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

The MVAPICH Project http://mvapich.cse.ohio-state.edu/

[email protected], [email protected] Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘16) 48