<<

High-performance and Scalable MPI+X Library for Emerging HPC Clusters & Cloud Platforms

Talk at HPC Developer Conference (SC ‘17) by

Dhabaleswar K. (DK) Panda Hari Subramoni The Ohio State University The Ohio State University E-mail: [email protected] E-mail: [email protected] http://www.cse.ohio-state.edu/~panda http://www.cse.ohio-state.edu/~subramon High-End Computing (HEC): ExaFlop & ExaByte

100-200 PFlops in 2016-2018 40K EBytes in 2020 ?

10K-20K 1 EFlops in EBytes in 2016-2018 2023-2024?

ExaFlop & HPC ExaByte & BigData • •

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 2 Drivers of Modern HPC Cluster Architectures

High Performance Interconnects – Accelerators / Coprocessors InfiniBand, Omni-Path high compute density, high Multi-core Processors <1usec latency, 100Gbps Bandwidth> performance/watt SSD, NVMe-SSD, NVRAM >1 TFlop DP on a chip • Multi-core/many-core technologies • High Performance Interconnects • High Performance Storage and Compute devices • MPI is used by vast majority of HPC applications

Sunway TaihuLight K - Computer Tianhe – 2 Titan Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 3 Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges

Application Kernels/Applications Co-Design Middleware Opportunities and Programming Models Challenges MPI, PGAS (UPC, , OpenSHMEM), CUDA, OpenMP, across Various OpenACC, , Hadoop (MapReduce), Spark (RDD, DAG), etc. Layers

Communication Library or Runtime for Programming Models Performance Point-to-point Collective Energy- Synchronizatio I/O and Fault Communicatio Communicatio Awareness n and Locks File Systems Tolerance n n Fault- Resilience Networking Technologies Multi/Many-core Accelerators (InfiniBand, 40/100GigE, Architectures (GPU and FPGA) Aries, and OmniPath)

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 4 Exascale Programming models

• The community believes Next-Generation Programming models exascale programming model MPI+X will be MPI+X • But what is X? – Can it be just OpenMP? X= ? • Many different environments OpenMP, OpenACC, CUDA, PGAS, Tasks….

and systems are emerging Heterogeneous Highly-Threaded Irregular Computing with – Different `X’ will satisfy the Systems (KNL) Communications Accelerators respective needs

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 5 MPI+X Programming model: Broad Challenges at Exascale

• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) – Scalable job start-up • Scalable Collective communication – Offload – Non-blocking – Topology-aware • Balancing intra-node and inter-node communication for next generation nodes (128-1024 cores) – Multiple end-points per node • Support for efficient multi-threading • Integrated Support for GPGPUs and FPGAs • Fault-tolerance/resiliency • QoS support for communication and I/O • Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI+UPC++, MPI + OpenSHMEM, CAF, …) • Virtualization • Energy-Awareness

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 6 Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 2,825 organizations in 85 countries – More than 432,000 (> 0.4 million) downloads from the OSU site directly – Empowering many TOP500 clusters (June ‘17 ranking) • 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China • 15th, 241,108-core (Pleiades) at NASA • 20th, 462,462-core (Stampede) at TACC • 44th, 74,520-core (Tsubame 2.5) at Tokyo Institute of Technology – Available with stacks of many vendors and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Empowering Top500 systems for over a decade – System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> – Sunway TaihuLight (1st in Jun’17, 10M cores, 100 PFlops) Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 7 MVAPICH2 Software Family High-Performance Parallel Programming Libraries MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime MVAPICH2-GDR Optimized MPI for clusters with GPUs MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud MVAPICH2-EA Energy aware and High-performance MPI MVAPICH2-MIC Optimized MPI for clusters with Intel KNC Microbenchmarks OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs Tools OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration OEMT Utility to measure the energy consumption of MPI applications

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 8 Outline

• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication – Scalable Start-up – Dynamic and Adaptive Communication Protocols and Tag Matching – Optimized Collectives using SHArP and Multi-Leaders – Optimized CMA-based Collectives

• Hybrid MPI+PGAS Models for Irregular Applications

with Accelerators

• HPC and Cloud

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 9 Startup Performance on KNL + Omni-Path

MPI_Init & Hello World - Oakforest-PACS 25 Hello World (MVAPICH2-2.3a) 21s 20 MPI_Init (MVAPICH2-2.3a) 15 10 5.8s 22s 5 Time Taken (Seconds) Time Taken 0 64 1K 2K 4K 8K 128 256 512 16K 32K 64K Number of Processes

• MPI_Init takes 22 seconds on 229,376 processes on 3,584 KNL nodes (Stampede2 – Full scale) • 8.8 times faster than Intel MPI at 128K processes (Courtesy: TACC) • At 64K processes, MPI_Init and Hello World takes 5.8s and 21s respectively (Oakforest-PACS) • All numbers reported with 64 processes per node New designs available in latest MVAPICH2 libraries and as patch for SLURM-15.08.8 and SLURM-16.05.1 Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 10 Dynamic and Adaptive MPI Point-to-point Communication Protocols

Desired Eager Threshold Eager Threshold for Example Communication Pattern with Different Designs

Default Dynamic + Adaptive Pair Eager Threshold (KB) Manually Tuned 4 5 6 7 4 5 6 7 4 5 6 7 0 – 4 32

1 – 5 64 16 KB 16 KB 16 KB 16 KB 128 KB 128 KB 128 KB 128 KB 32 KB 64 KB 128 KB 32 KB 2 – 6 128 0 1 2 3 0 1 2 3 0 1 2 3 3 – 7 32 Process on Node 1 Process on Node 2

Default Poor overlap; Low memory requirement Low Performance; High Productivity Manually Tuned Good overlap; High memory requirement High Performance; Low Productivity Dynamic + Adaptive Good overlap; Optimal memory requirement High Performance; High Productivity

Execution Time of Amber 600 10 Relative Memory Consumption of Amber

400 5 200 Consumption Relative Memory 0 0 128 256 512 1K 128 256 512 1K Wall ClockTime (seconds) Number of Processes Number of Processes Default Threshold=17K Threshold=64K Default Threshold=17K Threshold=64K Threshold=128K Dynamic Threshold Threshold=128K Dynamic Threshold H. Subramoni, S. Chakraborty, . K. Panda, Designing Dynamic & Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation & Communication, ISC'17 - Best Paper Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 11 Dynamic and Adaptive Tag Matching

Tag matching is a significant A new tag matching design Better performance than overhead for receivers - Dynamically adapt to other state-of-the art tag- Existing Solutions are communication patterns matching schemes

Results Minimum memory - Static and do not adapt Solution - Use different strategies for

Challenge dynamically to different ranks consumption communication pattern - Decisions are based on the - Do not consider memory number of request object overhead that must be traversed Will be available in future before hitting on the MVAPICH2 releases required one

Normalized Total Tag Matching Time at 512 Processes Normalized Memory Overhead per Process at 512 Processes Normalized to Default (Lower is Better) Compared to Default (Lower is Better) Adaptive and Dynamic Design for MPI Tag Matching; M. Bayatpour, H. Subramoni, S. Chakraborty, and D. K. Panda; IEEE Cluster 2016. [Best Paper Nominee] Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 12 Advanced Allreduce Collective Designs Using SHArP and Multi-Leaders

60

Lower is betterLower is 0.6 50 0.5 23% 40 0.4 30 0.3 20 40% Latency(us) 0.2 10 0.1 0 0 4 8 16 32 64 128 256 512 1K 2K 4K 56 224 448 Message Size (Byte) Communication Latency (Seconds) Number of Processes MVAPICH2 Proposed-Scoket-Based MVAPICH2+SHArP MVAPICH2 Proposed-Scoket-Based MVAPICH2+SHArP OSU Micro Benchmark (16 Nodes, 28 PPN) HPCG (28 PPN)

• Socket-based design can reduce the communication latency by 23% and 40% on Xeon + IB nodes • Support is available in MVAPICH2 2.3a and MVAPICH2-X 2.3b M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning-based Multi- Leader Design, Supercomputing '17. Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 13 Performance of MPI_Allreduce On Stampede2 (10,240 Processes)

300 2000 1800 250 1600 200 1400 1200 150 1000 2.4X 800

Latency (us)Latency 100 600 50 400 200 0 0 4 8 16 32 64 128 256 512 1024 2048 4096 8K 16K 32K 64K 128K 256K Message Size Message Size MVAPICH2 MVAPICH2-OPT IMPI MVAPICH2 MVAPICH2-OPT IMPI OSU Micro Benchmark 64 PPN

• MPI_Allreduce latency with 32K bytes reduced by 2.4X

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 14 Performance of MiniAMR Application On Stampede2 and Bridges

Stampede2 (32 PPN) Bridges (28 PPN) 80 80 1.5X 70 2.4X 70 60 60 50 50 40 40 30 30 Latency (s) Latency 20 20 10 10 0 0 512 1024 1280 2048 2560 4096 448 896 1792 Number of Processes Number of Processes MVAPICH2 DPML IMPI MVAPICH2 MVAPICH2-OPT IMPI • For MiniAMR Application latency with 2,048 processes, MVAPICH2-OPT can reduce the latency by 2.6X on Stampede2 • On Bridges, with 1,792 processes, MVAPICH2-OPT can reduce the latency by 1.5X Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 15 Enhanced MPI_Bcast with Optimized CMA-based Design

100000 100000 100000 KNL (64 Processes) Broadwell (28 Processes) Power8 (160 Processes) 10000 10000 10000 Use 1000 Use 1000 Use 1000 CMA CMA CMA Use 100

Latency (us) Latency 100 100 MVAPICH2-2.3a SHMEM MVAPICH2-2.3a Intel MPI 2017 Intel MPI 2017 MVAPICH2-2.3a 10 10 10 Use Use OpenMPI 2.1.0 OpenMPI 2.1.0 OpenMPI 2.1.0 SHMEM SHMEM Proposed Proposed Proposed 1 1 1 1K 4K 16K 64K 256K 1M 4M 1K 4K 16K 64K 256K 1M 4M Message Size Message Size Message Size

• Up to 2x - 4x improvement over existing implementation for 1MB messages • Up to 1.5x – 2x faster than Intel MPI and Open MPI for 1MB messages • Improvements obtained for large messages only Support is available • p-1 copies with CMA, p copies with • Fallback to SHMEM for small messages in MVAPICH2-X 2.3b S. Chakraborty, H. Subramoni, and D. K. Panda, Contention Aware Kernel-Assisted MPI Collectives for Multi/Many- core Systems, IEEE Cluster ’17, BEST Paper Finalist Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 16 Outline • Scalability for million to billion processors

• Hybrid MPI+PGAS Models for Irregular Applications

• Heterogeneous Computing with Accelerators

• HPC and Cloud

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 17 Hybrid (MPI+PGAS) Programming

• Application sub-kernels can be re-written in MPI/PGAS based on communication characteristics HPC Application • Benefits: Kernel 1 MPI

– Best of Model Kernel 2 PGASMPI – Best of Shared Memory Computing Model Kernel 3 • Cons MPI

KernelKernel NN – Two different runtimes PGASMPI – Need great care while programming – Prone to if not careful

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 18 MVAPICH2-X for Hybrid MPI + PGAS Applications

MPI, OpenSHMEM, UPC, CAF, UPC++ or Hybrid (MPI + PGAS) Applications

OpenSHMEM Calls CAF Calls UPC Calls UPC++ Calls MPI Calls

Unified MVAPICH2-X Runtime

InfiniBand, RoCE, iWARP

• Unified communication runtime for MPI, UPC, OpenSHMEM, CAF, UPC++ available with MVAPICH2-X 1.9 onwards! (since 2012) – http://mvapich.cse.ohio-state.edu • Feature Highlights – Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, UPC++, MPI(+OpenMP) + OpenSHMEM, MPI(+OpenMP) + UPC – MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard compliant (with initial support for UPC 1.3), CAF 2008 standard (OpenUH), UPC++ – Scalable Inter-node and intra-node communication – point-to-point and collectives Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 19 UPC++ Collectives Performance

40000 GASNet_MPI 14x MPI + {UPC++} MPI + {UPC++} 35000 GASNET_IBV application application 30000 MV2-X UPC++ MPI UPC++ Runtime 25000 Runtime Interfaces 20000 15000 Time (us) Time GASNet Interfaces MVAPICH2-X 10000 Unified 5000 communication 0 Conduit (MPI) Runtime (UCR)

Message Size (bytes) Network Inter-node Broadcast (64 nodes 1:ppn) • Full and native support for hybrid MPI + UPC++ applications J. M. Hashmi, K. Hamidouche, and D. K. Panda, Enabling • Better performance compared to IBV and MPI conduits Performance Efficient Runtime Support for hybrid • OSU Micro-benchmarks (OMB) support for UPC++ MPI+UPC++ Programming Models, IEEE International Conference on High Performance Computing and • Available since MVAPICH2-X 2.2RC1 Communications (HPCC 2016) Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 20 Application Level Performance with Graph500 and Sort Graph500 Execution Time Sort Execution Time 35 3000 MPI-Simple MPI Hybrid 30 2500 25 MPI-CSC 2000 20 MPI-CSR 13X 1500 51% 15 Hybrid (MPI+OpenSHMEM) 1000 Time (s) Time 10

Time(seconds) 500 5 7.6X 0 0 4K 8K 16K 500GB-512 1TB-1K 2TB-2K 4TB-4K No. of Processes Input Data - No. of Processes • Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design • Performance of Hybrid (MPI+OpenSHMEM) Sort • 8,192 processes Application - 2.4X improvement over MPI-CSR • 4,096 processes, 4 TB Input Size - 7.6X improvement over MPI-Simple - MPI – 2408 sec; 0.16 TB/min • 16,384 processes - Hybrid – 1172 sec; 0.36 TB/min - 1.5X improvement over MPI-CSR - 51% improvement over MPI-design - 13X improvement over MPI-Simple

J. Jose, S. Potluri, H. Subramoni, X. Lu, K. Hamidouche, K. Schulz, H. Sundar and D. Panda Designing Scalable Out-of-core Sorting with Hybrid MPI+PGAS Programming Models, PGAS’14

J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC’13), June 2013 Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 21 Accelerating MaTEx k-NN with Hybrid MPI and OpenSHMEM • MaTEx: MPI-based Machine learning algorithm library • k-NN: a popular supervised algorithm for classification • Hybrid designs: – Overlapped Data Flow; One-sided Data Transfer; Circular-buffer Structure

KDD-tranc (30MB) on 256 cores KDD (2.5GB) on 512 cores

27.6% 9.0%

• Benchmark: KDD Cup 2010 (8,407,752 records, 2 classes, k=5) • For truncated KDD workload on 256 cores, reduce 27.6% execution time • For full KDD workload on 512 cores, reduce 9.0% execution time J. Lin, K. Hamidouche, J. Zhang, X. Lu, A. Vishnu, D. Panda. Accelerating k-NN Algorithm with Hybrid MPI and OpenSHMEM, OpenSHMEM 2015 Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 22 Performance of PGAS Models on KNL using MVAPICH2-X Intra-node PUT Intra-node GET 1000 1000 shmem_put shmem_get 100 100 upc_putmem upc_getmem 10 upcxx_async_put 10 upcxx_async_get

1 1 Latency (us) Latency Latency (us) Latency 0.1 0.1

0.01 0.01 1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size Message Size • Intra-node performance of one-sided Put/Get operations of PGAS libraries/languages using MVAPICH2-X communication conduit • Near-native communication performance is observed on KNL Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 23 Optimized OpenSHMEM with AVX and MCDRAM: Application Kernels Evaluation

Heat-2D Kernel using Jacobi method Heat Image Kernel KNL (Default) 1000 100 KNL (Default) KNL (AVX-512) KNL (AVX-512) KNL (AVX-512+MCDRAM) KNL (AVX-512+MCDRAM) 100 Broadwell 10 Broadwell

10 1 Time (s) Time (s)

1 0.1 16 32 64 128 16 32 64 128 No. of processes No. of processes

• On heat diffusion based kernels AVX-512 vectorization showed better performance • MCDRAM showed significant benefits on Heat-Image kernel for all process counts. Combined with AVX-512 vectorization, it showed up to 4X improved performance

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 24 Outline

• Scalability for million to billion processors

• Hybrid MPI+PGAS Models for Irregular Applications

• Heterogeneous Computing with Accelerators

• HPC and Cloud

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 25 GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU • Standard MPI interfaces used for unified data movement • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) • Overlaps data movement from GPU with RDMA transfers

At Sender:

MPI_Send(s_devbuf, size, …); inside MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, …);

High Performance and High Productivity

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 26 CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases

• Support for MPI communication from NVIDIA GPU device memory • High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) • High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU) • Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node • Optimized and tuned collectives for GPU device buffers • MPI datatype support for point-to-point and collective communication from GPU device buffers • Unified memory

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 27 Optimized MVAPICH2-GDR Design GPU-GPU Inter-node Latency 30 6000 GPU-GPU Inter-node Bi-Bandwidth

20 4000 10 11X 1.88us 10x 2000

Latency (us)Latency 0 0 1 2 4 8 Bandwidth (MB/s) Bandwidth 0 16 32 64 1K 2K 4K 8K 1 2 4 8 128 256 512 16 32 64 1K 2K 4K

Message Size (Bytes) 128 256 512 Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a MV2-(NO-GDR) MV2-GDR-2.3a GPU-GPU Inter-node Bandwidth 4000 3000 2000 9x MVAPICH2-GDR-2.3a Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores 1000 NVIDIA Volta V100 GPU

Bandwidth (MB/s) Bandwidth 0 Mellanox Connect-X4 EDR HCA

1 2 4 8 CUDA 9.0 16 32 64 1K 2K 4K 128 256 512 Mellanox OFED 4.0 with GPU-Direct-RDMA Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 28 Application-Level Evaluation (HOOMD-blue) 64K Particles 256K Particles 3500 2500 3000 MV2 MV2+GDR 2500 2000 2X 2000 1500 2X 1500 1000 1000 second (TPS) 500 second (TPS) 500

Average Time Steps per Time Steps Average 0

Average Time Steps per Time Steps Average 0 4 8 16 32 4 8 16 32 Number of Processes Number of Processes • Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) • HoomdBlue Version 1.0.5 • GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384 Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 29 Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

Wilkes GPU Cluster CSCS GPU cluster

Default Callback-based Event-based Default Callback-based Event-based

1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 4 8 16 32 Normalized Execution Time 16 32 64 96 Normalized Execution Time Number of GPUs Number of GPUs

• 2X improvement on 32 GPUs nodes Cosmo model: http://www2.cosmo-model.org/content • 30% improvement on 96 GPU nodes (8 GPUs/node) /tasks/operational/meteoSwiss/ On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application . Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16 Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 30 Application Evaluation: Deep Learning Frameworks • @ RI2 cluster, 16 GPUs, 1 GPU/node – Cognitive Toolkit (CNTK) [https://github.com/Microsoft/CNTK] AlexNet model Lower is better VGG model MV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt MV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt 300 2500 6% 250 15% 24% 2000 15% 200 1500 150 1000 100 Training Time (s) Training 50 Time (s) Training 500 0 0 8 16 8 16 Number of GPU nodes Number of GPU nodes • Reduces up to 24% and 15% of latency for AlexNet and VGG models • Higher improvement can be observed for larger system sizes

C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton, and D. K. Panda, Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning, ICPP’17. Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 31 MVAPICH2-GDS: Preliminary Results Latency oriented: Kernel+Send and Recv+Kernel Overlap with host computation/communication Default MPI Enhaced MPI+GDS Default MPI Enhanced MPI+GDS 80 100 60 80 60 40 40 20 20 Overlap (%) Overlap Latency (us) Latency 0 0 1 4 16 64 256 1K 4K 1 4 16 64 256 1K 4K Message Size (bytes) Message Size (bytes) • Latency Oriented: Able to hide the kernel launch overhead – 8-15% improvement compared to default behavior • Overlap: Asynchronously to offload queue the Communication and computation tasks – 89% overlap with host computation at 128-Byte message size

Intel Sandy Bridge, NVIDIA Tesla K40c and Mellanox FDR HCA Will be available in a public release soon CUDA 8.0, OFED 3.4, Each kernel is ~50us Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 32 Outline • Scalability for million to billion processors

• Hybrid MPI+PGAS Models for Irregular Applications

• Heterogeneous Computing with Accelerators

• HPC and Cloud

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 33 Can HPC and Virtualization be Combined? • Virtualization has many benefits – Fault-tolerance – Job migration – Compaction • Have not been very popular in HPC due to overhead associated with Virtualization • New SR-IOV (Single Root – IO Virtualization) support available with Mellanox InfiniBand adapters changes the field • Enhanced MVAPICH2 support for SR-IOV • MVAPICH2-Virt 2.2 supports: – OpenStack, Docker, and singularity

J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized InfiniBand Clusters? EuroPar'14 J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand Clusters, HiPC’14 J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build HPC Clouds, CCGrid’15 Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 34 Application-Level Performance on Docker with MVAPICH2 100 300 Container-Def 90 Container-Def 73% Container-Opt 250 80 11% Container-Opt 70 Native 200 Native 60 50 150 40 30 100 Execution Time (s) 20 50

10 BFS Execution Time (ms) 0 0 MG.D FT.D EP.D LU.D CG.D 1Cont*16P 2Conts*8P 4Conts*4P NAS Scale, Edgefactor (20,16) Graph 500 • 64 Containers across 16 nodes, pining 4 Cores per Container • Compared to Container-Def, up to 11% and 73% of execution time reduction for NAS and Graph 500 • Compared to Native, less than 9 % and 5% overhead for NAS and Graph 500

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 35 Application-Level Performance on Singularity with MVAPICH2

NPB Class D Graph500 300 3000 Singularity Singularity 6% 250 2500 Native Native 200 2000

150 1500

100 1000 Execution Time (s) 7% BFS Execution Time (ms) 50 500

0 0 CG EP FT IS LU MG 22,16 22,20 24,16 24,20 26,16 26,20 Problem Size (Scale, Edgefactor) • 512 Processes across 32 nodes • Less than 7% and 6% overhead for NPB and Graph500, respectively

J. Zhang, X .Lu and D. K. Panda, Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds?, UCC ’17 Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 36 Looking into the Future …. • Architectures for Exascale systems are evolving • Exascale systems will be constrained by – Power – Memory per core – Data movement cost – Faults • Programming Models, Runtimes and Middleware need to be designed for – Scalability – Performance – Fault-resilience – Energy-awareness – Programmability – Productivity • High Performance and Scalable MPI+X libraries are needed • Highlighted some of the approaches taken by the MVAPICH2 project • Need continuous innovation to have the right MPI+X libraries for Exascale systems Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 37 Funding Acknowledgments Funding Support by

Equipment Support by

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 38 Personnel Acknowledgments

Current Students Current Research Scientists Current Research Specialist – S. Guganani (Ph.D.) – M. Rahman (Ph.D.) – A. Awan (Ph.D.) – X. Lu – J. Smith – J. Hashmi (Ph.D.) – D. Shankar (Ph.D.) – M. Bayatpour (Ph.D.) – H. Subramoni – M. Arnold – S. Chakraborthy (Ph.D.) – N. Islam (Ph.D.) – A. Venkatesh (Ph.D.) Current Post-doc – C.-H. Chu (Ph.D.) – M. Li (Ph.D.) – J. Zhang (Ph.D.) – A. Ruhela

Past Students Past Research Scientist – W. Huang (Ph.D.) – M. Luo (Ph.D.) – R. Rajachandrasekar (Ph.D.) – A. Augustine (M.S.) – K. Hamidouche – P. Balaji (Ph.D.) – W. Jiang (M.S.) – A. Mamidala (Ph.D.) – G. Santhanaraman (Ph.D.) – S. Sur – S. Bhagvat (M.S.) – J. Jose (Ph.D.) – G. Marsh (M.S.) – A. Singh (Ph.D.) – A. Bhat (M.S.) – S. Kini (M.S.) – V. Meshram (M.S.) – J. Sridhar (M.S.) Past – S. Sur (Ph.D.) – D. Buntinas (Ph.D.) – M. Koop (Ph.D.) – A. Moody (M.S.) – D. Bureddy – H. Subramoni (Ph.D.) – L. Chai (Ph.D.) – K. Kulkarni (M.S.) – S. Naravula (Ph.D.) – J. Perkins – K. Vaidyanathan (Ph.D.) – B. Chandrasekharan (M.S.) – R. Kumar (M.S.) – R. Noronha (Ph.D.) – N. Dandapanthula (M.S.) – S. Krishnamoorthy (M.S.) – X. Ouyang (Ph.D.) – A. Vishnu (Ph.D.) – V. Dhanraj (M.S.) – K. Kandalla (Ph.D.) – S. Pai (M.S.) – J. Wu (Ph.D.) – T. Gangadharappa (M.S.) – P. Lai (M.S.) – S. Potluri (Ph.D.) – W. Yu (Ph.D.) – K. Gopalakrishnan (M.S.) – J. Liu (Ph.D.)

Past Post-Docs – D. Banerjee – J. Lin – S. Marcarelli – X. Besseron – M. Luo – J. Vienne – H.-W. Jin – E. Mancini – H. Wang

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 39 Thank You! [email protected]

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

The High-Performance MPI/PGAS Project The High-Performance Deep Learning Project http://mvapich.cse.ohio-state.edu/ http://hidl.cse.ohio-state.edu/

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 40 Please join us for other events at SC’17

• Workshops • Technical Talks – ESPM2 2017: Third International Workshop on Extreme Scale – Designing and Building Efficient HPC Cloud with Modern Programming Models and Middleware Networking Technologies on Heterogeneous HPC Clusters • Tutorials – Co-designing MPI Runtimes and Deep Learning Frameworks – InfiniBand, Omni-Path, and High-Speed Ethernet for for Scalable Distributed Training on GPU Clusters Dummies – High-Performance and Scalable Broadcast Schemes for Deep – InfiniBand, Omni-Path, and High-Speed Ethernet: Advanced Learning on GPU Clusters Features, Challenges in Designing, HEC Systems and Usage • Booth Talks • BoFs – Scalability and Performance of MVAPICH2 on OakForest-PACS – MPICH BoF: MVAPICH2 Project: Latest Status and Future – The MVAPICH2 Project: Latest Developments and Plans Towards Plans Exascale Computing • Technical Talks – Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X – EReinit: Scalable and Efficient Fault-Tolerance for Bulk- – Exploiting Latest Networking and Accelerator Technologies for MPI, Synchronous MPI Applications Streaming, and Deep Learning: An MVAPICH2-Based Approach – An In-Depth Performance Characterization of CPU- and GPU- – MVAPICH2-GDR Library: Pushing the Frontier of HPC and Deep Based DNN Training on Modern Architectures Learning – Scalable Reduction Collectives with Data Partitioning-Based – MVAPICH2-GDR for HPC and Deep Learning Multi-Leader Design Please refer to http://mvapich.cse.ohio-state.edu/talks/ for more details

Network Based Computing Laboratory Intel HPC Dev Conf (SC ‘17) 41