Designing HPC, Big Data, Deep Learning, and Cloud Middleware for Exascale Systems: Challenges and Opportunities

Keynote Talk at SEA Symposium (April ‘18) by

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda High-End Computing (HEC): Towards Exascale

100 PFlops in 2016

1 EFlops in 2019- 2020?

Expected to have an ExaFlop system in 2019-2020!

Network Based Computing Laboratory SEA Symposium (April ’18) 2 Increasing Usage of HPC, Big Data and Deep Learning

Big Data HPC (Hadoop, Spark, (MPI, RDMA, HBase, Lustre, etc.) Memcached, etc.)

Convergence of HPC, Big Deep Learning Data, and Deep Learning! (Caffe, TensorFlow, BigDL, etc.) Increasing Need to Run these applications on the Cloud!!

Network Based Computing Laboratory SEA Symposium (April ’18) 3 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Physical Compute

Network Based Computing Laboratory SEA Symposium (April ’18) 4 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Network Based Computing Laboratory SEA Symposium (April ’18) 5 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Network Based Computing Laboratory SEA Symposium (April ’18) 6 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Hadoop Job Deep Learning Job

Spark Job

Network Based Computing Laboratory SEA Symposium (April ’18) 7 HPC, Big Data, Deep Learning, and Cloud • Traditional HPC – Message Passing Interface (MPI), including MPI + OpenMP – Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) – Exploiting Accelerators • Big Data/Enterprise/Commercial Computing – Spark and Hadoop (HDFS, HBase, MapReduce) – Memcached is also used for Web 2.0 • Deep Learning – Caffe, CNTK, TensorFlow, and many more • Cloud for HPC and BigData – Virtualization with SR-IOV and Containers

Network Based Computing Laboratory SEA Symposium (April ’18) 8 Parallel Programming Models Overview

P1 P2 P3 P1 P2 P3 P1 P2 P3

Logical Memory Memory Memory Shared Memory Memory Memory Memory

Shared Memory Model Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X10, CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are gradually receiving importance Network Based Computing Laboratory SEA Symposium (April ’18) 9 Partitioned Global Address Space (PGAS) Models • Key features - Simple shared memory abstractions - Light weight one-sided communication - Easier to express irregular communication • Different approaches to PGAS - Languages - Libraries • (UPC) • OpenSHMEM • Co-Array Fortran (CAF) • UPC++ • X10 • Global Arrays • Chapel

Network Based Computing Laboratory SEA Symposium (April ’18) 10 Hybrid (MPI+PGAS) Programming

• Application sub-kernels can be re-written in MPI/PGAS HPC Application based on communication characteristics Kernel 1 • Benefits: MPI

– Best of Model Kernel 2 PGASMPI – Best of Shared Memory Computing Model Kernel 3 MPI

Kernel N PGASMPI

Network Based Computing Laboratory SEA Symposium (April ’18) 11 Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges

Application Kernels/Applications Co-Design Middleware Opportunities and Programming Models Challenges MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, across Various OpenACC, , Hadoop (MapReduce), Spark (RDD, DAG), etc. Layers

Communication Library or Runtime for Programming Models Performance Point-to-point Collective Energy- Synchronization I/O and Fault Communication Communication Awareness and Locks File Systems Tolerance Resilience Networking Technologies Multi-/Many-core Accelerators (InfiniBand, 40/100GigE, Architectures (GPU and FPGA) Aries, and Omni-Path)

Network Based Computing Laboratory SEA Symposium (April ’18) 12 Broad Challenges in Designing Communication Middleware for (MPI+X) at Exascale • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) – Scalable job start-up • Scalable Collective communication – Offload – Non-blocking – Topology-aware • Balancing intra-node and inter-node communication for next generation nodes (128-1024 cores) – Multiple end-points per node • Support for efficient multi-threading • Integrated Support for GPGPUs and Accelerators • Fault-tolerance/resiliency • QoS support for communication and I/O • Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, MPI+UPC++, CAF, …) • Virtualization • Energy-Awareness Network Based Computing Laboratory SEA Symposium (April ’18) 13 Additional Challenges for Designing Exascale Software Libraries • Extreme Low Memory Footprint – Memory per core continues to decrease

• D-L-A Framework – Discover • Overall network topology (fat-tree, 3D, …), Network topology for processes for a given job • Node architecture, Health of network and node – Learn • Impact on performance and scalability • Potential for failure – Adapt • Internal protocols and algorithms • mapping • Fault-tolerance solutions – Low overhead techniques while delivering performance, scalability and fault-tolerance

Network Based Computing Laboratory SEA Symposium (April ’18) 14 Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 2,875 organizations in 86 countries – More than 460,000 (> 0.46 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Nov ‘17 ranking) • 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China • 9th, 556,104 cores (Oakforest-PACS) in Japan • 12th, 368,928-core (Stampede2) at TACC • 17th, 241,108-core (Pleiades) at NASA • 48th, 76,032-core (Tsubame 2.5) at Tokyo Institute of Technology – Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Empowering Top500 systems for over a decade

Network Based Computing Laboratory SEA Symposium (April ’18) 15 MVAPICH2 Release Timeline and Downloads 500000

450000 MIC 2.0 - GDR 2.0b 400000 - MV2 1.9 350000 MV2 2.3b 1.8 X GDR 2.3a - - MV2 1.7 Virt 2.2

300000 OSU INAM 0.9.3 MV2 MV2 2.3rc1 MV2 MV2 1.6 MV2 250000 MV2 1.5 MV2 1.4 1.0.3 1.1 1.0

200000 1.0 MV2 MV2 MV MV2 MV Number of Downloads Number

150000 MV2 MV2 0.9.8 MV 0.9.4 MV2 0.9.0

100000

50000

0 Jul-15 Jul-10 Jul-05 Jan-18 Jan-13 Jan-08 Jun-13 Jun-08 Oct-16 Oct-11 Oct-06 Apr-14 Apr-09 Sep-14 Feb-15 Sep-09 Feb-10 Sep-04 Feb-05 Dec-15 Dec-10 Dec-05 Aug-17 Aug-12 Aug-07 Nov-13 Nov-08 Mar-17 Mar-12 Mar-07 May-16 May-11 May-06 Timeline Network Based Computing Laboratory SEA Symposium (April ’18) 16 Architecture of MVAPICH2 Software Family

High Performance Parallel Programming Models

Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication Runtime Diverse and Mechanisms

Point-to- Remote Collectives Energy- I/O and Fault Active Introspection point Job Startup Memory Virtualization Algorithms Awareness Tolerance Messages & Analysis Primitives Access File Systems

Support for Modern Networking Technology Support for Modern Multi-/Many-core Architectures (InfiniBand, iWARP, RoCE, Omni-Path) (Intel-Xeon, OpenPower, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols Modern Features Transport Mechanisms Modern Features

SR- Multi Shared RC XRC UD DC UMR ODP CMA IVSHMEM XPMEM* MCDRAM* NVLink* CAPI* IOV Rail Memory

* Upcoming Network Based Computing Laboratory SEA Symposium (April ’18) 17 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication – Scalable Start-up – Optimized Collectives using SHArP and Multi-Leaders – Optimized CMA-based Collectives – Upcoming Optimized XPMEM-based Collectives – Performance Engineering with MPI-T Support – Integrated Network Analysis and Monitoring • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, …) • Integrated Support for GPGPUs • Optimized MVAPICH2 for OpenPower (with/ NVLink) and ARM • Application Scalability and Best Practices

Network Based Computing Laboratory SEA Symposium (April ’18) 18 One-way Latency: MPI over IB with MVAPICH2

2 Small Message Latency 120 Large Message Latency TrueScale-QDR 1.8 ConnectX-3-FDR 1.6 1.19 100 ConnectIB-DualFDR 1.4 1.15 80 ConnectX-5-EDR 1.2 Omni-Path 1 60 0.8 1.11 0.6 1.04 40 Latency (us) Latency (us) 0.98 0.4 20 0.2 0 0

Message Size (bytes) Message Size (bytes) TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-5-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch Network Based Computing Laboratory SEA Symposium (April ’18) 19 Bandwidth: MPI over IB with MVAPICH2 14000 Unidirectional Bandwidth Bidirectional Bandwidth 12,590 25000 TrueScale-QDR 24,136 12000 12,366 20000 ConnectX-3-FDR 10000 ConnectIB-DualFDR 22,564

) 12,358 ) ConnectX-5-EDR 21,983 sec 8000 15000 6,356 Omni-Path 12,161 6000 10000 Bandwidth

3,373 Bandwidth MBytes/sec (MBytes/ 6,228

4000 ( 5000 2000 0 0

Message Size (bytes) Message Size (bytes) TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-5-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 IB switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch Network Based Computing Laboratory SEA Symposium (April ’18) 20 Startup Performance on KNL + Omni-Path

MPI_Init & Hello World - Oakforest-PACS 25 Hello World (MVAPICH2-2.3a) 21s 20 MPI_Init (MVAPICH2-2.3a) 15 10 5.8s 22s 5 Time Taken (Seconds)Time Taken 0 64 1K 2K 4K 8K 128 256 512 16K 32K 64K Number of Processes

• MPI_Init takes 22 seconds on 229,376 processes on 3,584 KNL nodes (Stampede2 – Full scale) • 8.8 times faster than Intel MPI at 128K processes (Courtesy: TACC) • At 64K processes, MPI_Init and Hello World takes 5.8s and 21s respectively (Oakforest-PACS) • All numbers reported with 64 processes per node New designs available since MVAPICH2-2.3a and as patch for SLURM 15, 16, and 17 Network Based Computing Laboratory SEA Symposium (April ’18) 21 Advanced Allreduce Collective Designs Using SHArP

0.4 Avg DDOT Allreduce time of HPCG MVAPICH2 0.3 12% MVAPICH2-SHArP SHArP Support is available 0.2 since MVAPICH2 2.3a

0.1 Latency (seconds) Latency 0 (4,28) (8,28) (16,28) (Number of Nodes, PPN) M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction 0.2 Mesh Refinement Time of MiniAMR Collectives with Data Partitioning-based Multi- MVAPICH2 Leader Design, SuperComputing '17. 0.15 13% MVAPICH2-SHArP 0.1

0.05 Latency (seconds) Latency 0 (4,28) (8,28) (16,28) (Number of Nodes, PPN) Network Based Computing Laboratory SEA Symposium (April ’18) 22 Performance of MPI_Allreduce On Stampede2 (10,240 Processes) 300 2000 1800 250 1600 200 1400 1200 150 1000 800 2.4X Latency (us) Latency 100 600 50 400 200 0 0 4 8 16 32 64 128 256 512 1024 2048 4096 8K 16K 32K 64K 128K 256K Message Size Message Size MVAPICH2 MVAPICH2-OPT IMPI MVAPICH2 MVAPICH2-OPT IMPI OSU Micro Benchmark 64 PPN • For MPI_Allreduce latency with 32K bytes, MVAPICH2-OPT can reduce the latency by 2.4X

M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design, SuperComputing '17. Available in MVAPICH2-X 2.3b Network Based Computing Laboratory SEA Symposium (April ’18) 23 Optimized CMA-based Collectives for Large Messages

1000000 1000000 1000000 KNL (2 Nodes, 128 Procs) KNL (4 Nodes, 256 Procs) KNL (8 Nodes, 512 Procs) 100000 100000 100000

10000 ~ 2.5x 10000 ~ 3.2x 10000 ~ 4x Better Better Better 1000 1000 1000 MVAPICH2-2.3a MVAPICH2-2.3a MVAPICH2-2.3a

Latency (us) 100 Intel MPI 2017 100 100 ~ 17x Intel MPI 2017 Better Intel MPI 2017 OpenMPI 2.1.0 10 10 OpenMPI 2.1.0 10 OpenMPI 2.1.0 Tuned CMA 1 Tuned CMA Tuned CMA 1 1 1K 2K 4K 8K 1M 2M 4M 16K 32K 64K 1K 2K 4K 8K 1K 2K 4K 8K 1M 2M 1M 128K 256K 512K 16K 32K 64K 16K 32K 64K Message Size Message Size 128K 256K 512K Message Size 128K 256K 512K Performance of MPI_Gather on KNL nodes (64PPN) • Significant improvement over existing implementation for Scatter/Gather with 1MB messages (up to 4x on KNL, 2x on Broadwell, 14x on OpenPower) • New two-level algorithms for better scalability • Improved performance for other collectives (Bcast, Allgather, and Alltoall) S. Chakraborty, H. Subramoni, and D. K. Panda, Contention Aware Kernel-Assisted MPI Collectives for Multi/Many-core Systems, IEEE Cluster ’17, BEST Paper Finalist Available in MVAPICH2-X 2.3b Network Based Computing Laboratory SEA Symposium (April ’18) 24 Shared Address Space (XPMEM)-based Collectives Design OSU_Allreduce (Broadwell 256 procs) OSU_Reduce (Broadwell 256 procs) MVAPICH2-2.3b 100000 MVAPICH2-2.3b 100000 1.8X IMPI-2017v1.132 IMPI-2017v1.132 4X 10000 10000 MVAPICH2-Opt MVAPICH2-Opt

1000 1000 73.2 37.9 100 100 Latency (us) Latency 10 10 36.1 16.8 1 1 16K 32K 64K 128K 256K 512K 1M 2M 4M 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size Message Size • “Shared Address Space”-based true zero-copy Reduction collective designs in MVAPICH2 • Offloaded computation/communication to peers ranks in reduction collective operation • Up to 4X improvement for 4MB Reduce and up to 1.8X improvement for 4M AllReduce J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and D. Panda, Designing Efficient Shared Address Space Reduction Will be available in future Collectives for Multi-/Many-cores, International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018. Network Based Computing Laboratory SEA Symposium (April ’18) 25 Performance Engineering Applications using MVAPICH2 and TAU ● Enhance existing support for MPI_T in MVAPICH2 to expose a richer set of performance and control variables ● Get and display MPI Performance Variables (PVARs) made available by the runtime in TAU ● Control the runtime’s behavior via MPI Control Variables (CVARs) ● Introduced support for new MPI_T based CVARs to MVAPICH2 ○ MPIR_CVAR_MAX_INLINE_MSG_SZ, MPIR_CVAR_VBUF_POOL_SIZE, MPIR_CVAR_VBUF_SECONDARY_POOL_SIZE ● TAU enhanced with support for setting MPI_T CVARs in a non- interactive mode for uninstrumented applications

VBUF usage without CVAR based tuning as displayed by ParaProf VBUF usage with CVAR based tuning as displayed by ParaProf

S. Ramesh, A. Maheo, S. Shende, A. Malony, H. Subramoni, and D. K. Panda, MPI Performance Engineering with the MPI Tool Interface: the Integration of MVAPICH and TAU, Euro MPI, 2017 [Best Paper Award] Network Based Computing Laboratory SEA Symposium (April ’18) 26 Overview of OSU INAM • A network monitoring and analysis tool that is capable of analyzing traffic on the InfiniBand network with inputs from the MPI runtime – http://mvapich.cse.ohio-state.edu/tools/osu-inam/ • Monitors IB clusters in real time by querying various subnet management entities and gathering input from the MPI runtimes • Capability to analyze and profile node-level, job-level and process-level activities for MPI communication – Point-to-Point, Collectives and RMA • Ability to filter data based on type of counters using “drop down” list • Remotely monitor various metrics of MPI processes at user specified granularity • "Job Page" to display jobs in ascending/descending order of various performance metrics in conjunction with MVAPICH2-X • Visualize the data transfer happening in a “live” or “historical” fashion for entire network, job or set of nodes • OSU INAM v0.9.3 released on 03/16/2018 – Enhance INAMD to query end nodes based on command line option – Add a web page to display size of the database in real-time – Enhance interaction between the web application and SLURM job launcher for increased portability – Improve packaging of web application and daemon to ease installation

Network Based Computing Laboratory SEA Symposium (April ’18) 27 OSU INAM Features

Comet@SDSC --- Clustered View Finding Routes Between Nodes (1,879 nodes, 212 switches, 4,377 network links) • Show network topology of large clusters • Visualize traffic pattern on different links • Quickly identify congested links/links in error state • See the history unfold – play back historical state of the network

Network Based Computing Laboratory SEA Symposium (April ’18) 28 OSU INAM Features (Cont.)

Visualizing a Job (5 Nodes) Estimated Process Level Link Utilization • Job level view • Estimated Link Utilization view • Show different network metrics (load, error, etc.) for any live job • Classify data flowing over a network link at • Play back historical data for completed jobs to identify bottlenecks different granularity in conjunction with • Node level view - details per process or per node MVAPICH2-X 2.2rc1 • CPU utilization for each rank/node • Job level and • Bytes sent/received for MPI operations (pt-to-pt, collective, RMA) • Process level • Network metrics (e.g. XmitDiscard, RcvError) per rank/node

Network Based Computing Laboratory SEA Symposium (April ’18) 29 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale • Scalability for million to billion processors • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, …) • Integrated Support for GPGPUs • Optimized MVAPICH2 for OpenPower (with/ NVLink) and ARM • Application Scalability and Best Practices

Network Based Computing Laboratory SEA Symposium (April ’18) 30 MVAPICH2-X for Hybrid MPI + PGAS Applications

• Current Model – Separate Runtimes for OpenSHMEM/UPC/UPC++/CAF and MPI – Possible if both runtimes are not progressed – Consumes more network resource • Unified communication runtime for MPI, UPC, UPC++, OpenSHMEM, CAF – Available with since 2012 (starting with MVAPICH2-X 1.9) – http://mvapich.cse.ohio-state.edu Network Based Computing Laboratory SEA Symposium (April ’18) 31 Application Level Performance with Graph500 and Sort Graph500 Execution Time Sort Execution Time 35 3000 MPI-Simple MPI Hybrid 30 2500 25 MPI-CSC 2000 MPI-CSR 20 13X 1500 51% 15 Hybrid (MPI+OpenSHMEM) 1000 Time (s) 10

Time (seconds) 500 5 7.6X 0 0 4K 8K 16K 500GB-512 1TB-1K 2TB-2K 4TB-4K No. of Processes Input Data - No. of Processes • Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design • Performance of Hybrid (MPI+OpenSHMEM) Sort • 8,192 processes Application - 2.4X improvement over MPI-CSR • 4,096 processes, 4 TB Input Size - 7.6X improvement over MPI-Simple - MPI – 2408 sec; 0.16 TB/min • 16,384 processes - Hybrid – 1172 sec; 0.36 TB/min - 1.5X improvement over MPI-CSR - 51% improvement over MPI-design - 13X improvement over MPI-Simple

J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, Optimizing Collective Communication in OpenSHMEM, Int'l Conference on Partitioned Global Address Space Programming Models (PGAS '13), October 2013. J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC’13), June 2013 J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012 Network Based Computing Laboratory SEA Symposium (April ’18) 32 Optimized OpenSHMEM with AVX and MCDRAM: Application Kernels Evaluation Heat-2D Kernel using Jacobi method Heat Image Kernel 1000 KNL (Default) 100 KNL (Default) KNL (AVX-512) KNL (AVX-512) KNL (AVX-512+MCDRAM) KNL (AVX-512+MCDRAM) 100 Broadwell 10 Broadwell

10 1 Time (s) Time (s)

1 0.1 16 32 64 128 16 32 64 128 No. of processes No. of processes

• On heat diffusion based kernels AVX-512 vectorization showed better performance • MCDRAM showed significant benefits on Heat-Image kernel for all process counts. Combined with AVX-512 vectorization, it showed up to 4X improved performance

Network Based Computing Laboratory SEA Symposium (April ’18) 33 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale • Scalability for million to billion processors • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, …) • Integrated Support for GPGPUs – CUDA-aware MPI – GPUDirect RDMA (GDR) Support – Support for Streaming Applications • Optimized MVAPICH2 for OpenPower (with/ NVLink) and ARM • Application Scalability and Best Practices

Network Based Computing Laboratory SEA Symposium (April ’18) 34 GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU • Standard MPI interfaces used for unified data movement • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) • Overlaps data movement from GPU with RDMA transfers

At Sender:

MPI_Send(s_devbuf, size, …); inside MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, …);

High Performance and High Productivity

Network Based Computing Laboratory SEA Symposium (April ’18) 35 CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases • Support for MPI communication from NVIDIA GPU device memory • High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) • High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU) • Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node • Optimized and tuned collectives for GPU device buffers • MPI datatype support for point-to-point and collective communication from GPU device buffers • Unified memory

Network Based Computing Laboratory SEA Symposium (April ’18) 36 Optimized MVAPICH2-GDR Design GPU-GPU Inter-node Latency 30 6000 GPU-GPU Inter-node Bi-Bandwidth

20 4000 10 11X 1.88us 10x 2000

Latency (us) 0 0 1 2 4 8 Bandwidth (MB/s) Bandwidth 0 16 32 64 1K 2K 4K 8K 1 2 4 8 128 256 512 16 32 64 1K 2K 4K

Message Size (Bytes) 128 256 512 Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a MV2-(NO-GDR) MV2-GDR-2.3a GPU-GPU Inter-node Bandwidth 4000 3000 2000 9x MVAPICH2-GDR-2.3a Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores 1000 NVIDIA Volta V100 GPU

Bandwidth (MB/s) Bandwidth 0 Mellanox Connect-X4 EDR HCA

1 2 4 8 CUDA 9.0 16 32 64 1K 2K 4K 128 256 512 Mellanox OFED 4.0 with GPU-Direct-RDMA Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a Network Based Computing Laboratory SEA Symposium (April ’18) 37 Application-Level Evaluation (HOOMD-blue) 64K Particles 256K Particles 3500 3000 2500 MV2 MV2+GDR 2500 2000 2X 2000 1500 2X 1500 1000 1000

second (TPS)second 500 second (TPS) 500

0 Time per Steps Average 0 Average Time per Steps Average 4 8 16 32 4 8 16 32 Number of Processes Number of Processes • Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) • HoomdBlue Version 1.0.5 • GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384 Network Based Computing Laboratory SEA Symposium (April ’18) 38 Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

Wilkes GPU Cluster CSCS GPU cluster

Default Callback-based Event-based Default Callback-based Event-based

1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 4 8 16 32 Normalized Execution Time 16 32 64 96 Normalized Execution Time Number of GPUs Number of GPUs

• 2X improvement on 32 GPUs nodes Cosmo model: http://www2.cosmo-model.org/content • 30% improvement on 96 GPU nodes (8 GPUs/node) /tasks/operational/meteoSwiss/ On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16 Network Based Computing Laboratory SEA Symposium (April ’18) 39 Streaming Applications Data Source • Streaming applications on HPC systems Real-time streaming 1. Communication (MPI) • Broadcast-type operations HPC resources for Sender real-time analytics 2. Computation (CUDA) • Multiple GPU nodes as workers Data streaming-like broadcast operations

Worker Worker Worker Worker Worker CPU CPU CPU CPU CPU GPU GPU GPU GPU GPU

GPU GPU GPU GPU GPU

Network Based Computing Laboratory SEA Symposium (April ’18) 40 Hardware Multicast-based Broadcast

• For GPU-resident data, using – GPUDirect RDMA (GDR) Destination 1 – InfiniBand Hardware Multicast (IB-MCAST) Header CPU • Overhead IB Source – IB UD limit Header HCA CPU GPU – GDR limit IB IB Data HCA Switch GPU Destination N Data Header CPU A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, 1. IB Gather + GDR Read IB “A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on 2. IB Hardware Multicast HCA InfiniBand Clusters,” in HiPC 2014, Dec 2014. GPU 3. IB Scatter + GDR Write Data Network Based Computing Laboratory SEA Symposium (April ’18) 41 Streaming Benchmark @ CSCS (88 GPUs)

MCAST-GDR-OPT MCAST-GDR MCAST-GDR-OPT MCAST-GDR 60 12000

50 10000 58% 79% 40 8000

30 6000

Latency(μs) 20 Latency(μs) 4000

10 2000

0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size (Bytes) Message Size (Bytes)

• IB-MCAST + GDR + Topology-aware IPC-based schemes

C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. – Up to 58% and 79% reduction Panda, "Designing High Performance Heterogeneous Broadcast for Streaming for small and large messages Applications on GPU Clusters, " SBAC-PAD'16, Oct. 26-28, 2016. Network Based Computing Laboratory SEA Symposium (April ’18) 42 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale • Scalability for million to billion processors • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, …) • Integrated Support for GPGPUs • Optimized MVAPICH2 for OpenPower (with/ NVLink) and ARM • Application Scalability and Best Practices

Network Based Computing Laboratory SEA Symposium (April ’18) 43 Intra-node Point-to-Point Performance on OpenPower Intra-Socket Small Message Latency Intra-Socket Large Message Latency 1.5 80 MVAPICH2-2.3rc1 MVAPICH2-2.3rc1 SpectrumMPI-10.1.0.2 60 SpectrumMPI-10.1.0.2 1 OpenMPI-3.0.0 OpenMPI-3.0.0 40 0.5 Latency (us) Latency Latency (us) Latency 20 0.30us 0 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M Intra-Socket Bandwidth Intra-Socket Bi-directional Bandwidth 40000 80000 MVAPICH2-2.3rc1 MVAPICH2-2.3rc1 30000 SpectrumMPI-10.1.0.2 60000 SpectrumMPI-10.1.0.2 OpenMPI-3.0.0 OpenMPI-3.0.0 20000 40000

10000 20000 Bandwidth (MB/s) Bandwidth Bandwidth (MB/s) Bandwidth 0 0 1 8 64 512 4K 32K 256K 2M 1 8 64 512 4K 32K 256K 2M Platform: Two nodes of OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA Network Based Computing Laboratory SEA Symposium (April ’18) 44 Inter-node Point-to-Point Performance on OpenPower Small Message Latency Large Message Latency 4 200 MVAPICH2-2.3rc1 MVAPICH2-2.3rc1 3 SpectrumMPI-10.1.0.2 150 SpectrumMPI-10.1.0.2 OpenMPI-3.0.0 OpenMPI-3.0.0 2 100 Latency (us) Latency Latency (us) Latency 1 50

0 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M Bandwidth Bi-directional Bandwidth 15000 30000 MVAPICH2-2.3rc1 MVAPICH2-2.3rc1 SpectrumMPI-10.1.0.2 SpectrumMPI-10.1.0.2 10000 OpenMPI-3.0.0 20000 OpenMPI-3.0.0

5000 10000 Bandwidth (MB/s) Bandwidth Bandwidth (MB/s) Bandwidth 0 0 1 8 64 512 4K 32K 256K 2M 1 8 64 512 4K 32K 256K 2M Platform: Two nodes of OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA Network Based Computing Laboratory SEA Symposium (April ’18) 45 MVAPICH2-GDR: Performance on OpenPOWER (NVLink + Pascal) INTRA-NODE LATENCY (SMALL) INTRA-NODE LATENCY (LARGE) INTRA-NODE BANDWIDTH 20 40 15 400 30 10 20 200 5 10 Latency (us) Latency (us) 0

0 0 1 4 16 64 1K 4K 1 2 4 8 1M 4M 256 16K 64K 16 32 64 1K 2K 4K 8K 256K 128 256 512 16K 32K 64K 128K256K512K 1M 2M 4M

Message Size (Bytes) Message Size (Bytes) Bandwidth (GB/sec) Message Size (Bytes)

INTRA-SOCKET(NVLINK) INTER-SOCKET INTRA-SOCKET(NVLINK) INTER-SOCKET INTRA-SOCKET(NVLINK) INTER-SOCKET Intra-node Latency: 13.8 us (without GPUDirectRDMA) Intra-node Bandwidth: 33.2 GB/sec (NVLINK)

INTER-NODE LATENCY (SMALL) INTER-NODE LATENCY (LARGE) INTER-NODE BANDWIDTH 30 1000 8 800 20 6 600 4 10 400 Latency (us) Latency (us) 200 2 0 0

1 2 4 8 0 16 32 64 1K 2K 4K 8K 1 4 Bandwidth (GB/sec) 128 256 512 16 64 1K 4K 1M 4M 256 16K 64K

Message Size (Bytes) 256K Message Size (Bytes) Message Size (Bytes) Inter-node Latency: 23 us (without GPUDirectRDMA) Available in MVAPICH2-GDR 2.3a Inter-node Bandwidth: 6 GB/sec (FDR) Platform: OpenPOWER (ppc64le) nodes equipped with a dual-socket CPU, 4 Pascal P100-SXM GPUs, and 4X-FDR InfiniBand Inter-connect Network Based Computing Laboratory SEA Symposium (April ’18) 46 Scalable Host-based Collectives with CMA on OpenPOWER (Intra-node Reduce & AlltoAll) Reduce 40 2000 3.2X MVAPICH2-GDR-Next MVAPICH2-GDR-Next 30 SpectrumMPI-10.1.0.2 1600 SpectrumMPI-10.1.0.2 OpenMPI-3.0.0 5.2X 1200 OpenMPI-3.0.0 20 1.4X 3.6X 800 10 Latency (us) Latency Latency (us) Latency 400 (Nodes=1, PPN=20) (Nodes=1, PPN=20) 0 0 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Alltoall 3.2X 1.3X 200 3.3X 10000 MVAPICH2-GDR-Next MVAPICH2-GDR-Next 1.2X 150 SpectrumMPI-10.1.0.2 7500 SpectrumMPI-10.1.0.2 OpenMPI-3.0.0 OpenMPI-3.0.0 100 5000

50 2500 Latency (us) Latency (us) Latency (Nodes=1, PPN=20) 0 0 (Nodes=1, PPN=20) 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Up to 5X and 3x performance improvement by MVAPICH2 for small and large messages respectively Network Based Computing Laboratory SEA Symposium (April ’18) 47 Optimized All-Reduce with XPMEM on OpenPOWER 2000 2000 48% MVAPICH2-GDR-Next 4X MVAPICH2-GDR-Next SpectrumMPI-10.1.0 SpectrumMPI-10.1.0 OpenMPI-3.0.0 1000 1000 OpenMPI-3.0.0 2X Latency (us) Latency Latency (us) Latency 34%

(Nodes=1, PPN=20) 0 0 16K 32K 64K 128K 256K 512K 1M 2M 16K 32K 64K 128K 256K 512K 1M 2M 4000 4000 MVAPICH2-GDR-Next 3.3X MVAPICH2-GDR-Next 2X SpectrumMPI-10.1.0 3000 SpectrumMPI-10.1.0 2000 OpenMPI-3.0.0 2000 OpenMPI-3.0.0

1000 2X Latency (us) Latency 3X (us) Latency

(Nodes=2, PPN=20) 0 0 16K 32K 64K 128K 256K 512K 1M 2M 16K 32K 64K 128K 256K 512K 1M 2M Message Size Message Size • Optimized MPI All-Reduce Design in MVAPICH2 – Up to 2X performance improvement over Spectrum MPI and 4X over OpenMPI for intra-node Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch Network Based Computing Laboratory SEA Symposium (April ’18) 48 Intra-node Point-to-point Performance on ARMv8 Small Message Latency Large Message Latency 3 600 MVAPICH2 MVAPICH2

2 400

0.74 microsec 1 (4 bytes) 200 Latency (us) Latency Latency (us) Latency Available since MVAPICH2 2.3a 0 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M Bandwidth Bi-directional Bandwidth 5000 10000 MVAPICH2 MVAPICH2 4000 8000 3000 6000 2000 4000 1000 2000 Bandwidth (MB/s) Bandwidth

0 Bidirectional Bandwidth 0

Platform: ARMv8 (aarch64) MIPS processor with 96 cores dual-socket CPU. Each socket contains 48 cores. Network Based Computing Laboratory SEA Symposium (April ’18) 49 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale • Scalability for million to billion processors • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, …) • Integrated Support for GPGPUs • Optimized MVAPICH2 for OpenPower (with/ NVLink) and ARM • Application Scalability and Best Practices

Network Based Computing Laboratory SEA Symposium (April ’18) 50 Performance of SPEC MPI 2007 Benchmarks (KNL + Omni-Path)

400 6% Intel MPI 18.0.0 350 1% MVAPICH2 2.3 rc1 300

250 448 processes

200 10% on 7 KNL nodes of 2% TACC Stampede2 150 6% (64 ppn) 100 4% 50 2% Execution Time in (S)

0 milc leslie3d pop2 lammps wrf2 tera_tf lu

Mvapich2 outperforms Intel MPI by up to 10%

Network Based Computing Laboratory SEA Symposium (April ’18) 51 Performance of SPEC MPI 2007 Benchmarks (Skylake + Omni-Path)

120 0% 1% Intel MPI 18.0.0 100 MVAPICH2 2.3 rc1

80 4% 480 processes on 10 Skylake nodes 60 -4% of TACC Stampede2 38% (48 ppn) 40 2% -3% 20 0% Execution Time in (S)

0 milc leslie3d pop2 lammps wrf2 GaP tera_tf lu

MVAPICH2 outperforms Intel MPI by up to 38%

Network Based Computing Laboratory SEA Symposium (April ’18) 52 Compilation of Best Practices • MPI runtime has many parameters • Tuning a set of parameters can help you to extract higher performance • Compiled a list of such contributions through the MVAPICH Website – http://mvapich.cse.ohio-state.edu/best_practices/ • Initial list of applications – Amber – HoomDBlue – HPCG – Lulesh – MILC – Neuron – SMG2000 • Soliciting additional contributions, send your results to mvapich-help at cse.ohio-state.edu. • We will link these results with credits to you.

Network Based Computing Laboratory SEA Symposium (April ’18) 53 HPC, Big Data, Deep Learning, and Cloud • Traditional HPC – Message Passing Interface (MPI), including MPI + OpenMP – Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) – Exploiting Accelerators • Big Data/Enterprise/Commercial Computing – Spark and Hadoop (HDFS, HBase, MapReduce) – Memcached is also used for Web 2.0 • Deep Learning – Caffe, CNTK, TensorFlow, and many more • Cloud for HPC and BigData – Virtualization with SR-IOV and Containers

Network Based Computing Laboratory SEA Symposium (April ’18) 54 How Can HPC Clusters with High-Performance Interconnect and Storage Architectures Benefit Big Data Applications?

Can RDMA-enabled Can HPC Clusters with How much Can the bottlenecks be high-performance alleviated with new high-performance performance benefits storage systems (e.g. designs by taking interconnects can be achieved advantage of HPC SSD, parallel file through enhanced technologies? benefit Big Data systems) benefit Big designs? processing? Data applications? How to design What are the major benchmarks for bottlenecks in current Big evaluating the Data processing performance of Big middleware (e.g. Hadoop, Data middleware on Spark, and Memcached)? HPC clusters?

Bring HPC and Big Data processing into a “convergent trajectory”!

Network Based Computing Laboratory SEA Symposium (April ’18) 55 Designing Communication and I/O Libraries for Big Data Systems: Challenges

Applications Benchmarks Big Data Middleware Upper level (HDFS, MapReduce, HBase, Spark and Memcached) Changes? Programming Models (Sockets) RDMA? Communication and I/O Library Point-to-Point Threaded Models Virtualization (SR-IOV) Communication and Synchronization

I/O and File Systems QoS & Fault Tolerance Performance Tuning

Commodity Computing System Networking Technologies Architectures Storage Technologies (InfiniBand, 1/10/40/100 GigE (Multi- and Many-core (HDD, SSD, NVM, and NVMe- and Intelligent NICs) architectures and accelerators) SSD)

Network Based Computing Laboratory SEA Symposium (April ’18) 56 The High-Performance Big Data (HiBD) Project • RDMA for • RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) – Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions • RDMA for Apache HBase • RDMA for Memcached (RDMA-Memcached) Available for InfiniBand and RoCE • RDMA for Apache Hadoop 1.x (RDMA-Hadoop) Also run on Ethernet • OSU HiBD-Benchmarks (OHB) – HDFS, Memcached, HBase, and Spark Micro-benchmarks Available for x86 and OpenPOWER • http://hibd.cse.ohio-state.edu • Users Base: 275 organizations from 34 countries • More than 25,550 downloads from the project site

Network Based Computing Laboratory SEA Symposium (April ’18) 57 HiBD Release Timeline and Downloads 30000 1.0.0 Spark 0.9.4 Spark 25000 - Hadoop 2.x 2.x Hadoop HBase 0.9.1 HBase - RDMA - 0.7.1 Spark 0.9.1 Spark - 20000 0.9.3 - - RDMA RDMA Memcached 0.9.4 Memcached Spark 0.9.5 Spark - RDMA - & OHB & 15000 Hadoop 2.x 2.x Hadoop 1.3.0 - RDMA RDMA Hadoop 2.x 2.x Hadoop 0.9.7 - 10000 RDMA Hadoop 2.x 2.x Hadoop 0.9.6 - Number Downloadsof Hadoop 2.x 2.x Hadoop 0.9.1 RDMA Memcached 0.9.1 & OHB & 0.9.1 Memcached - - Memcached 0.9.6 Memcached Hadoop 1.x 1.x Hadoop 0.9.0 Hadoop 1.x 1.x Hadoop 0.9.9 Hadoop 1.x 1.x Hadoop 0.9.8 - - - -

5000 RDMA RDMA RDMA RDMA RDMA RDMA RDMA

0

Timeline Network Based Computing Laboratory SEA Symposium (April ’18) 58 Different Modes of RDMA for Apache Hadoop 2.x

• HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well as performance. This mode is enabled by default in the package. • HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in- memory and obtain as much performance benefit as possible. • HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster. • HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst buffer design is hosted by Memcached servers, each of which has a local SSD. • MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top of Lustre alone. Here, two different modes are introduced: with local disks and without local disks. • Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH- L, and MapReduce over Lustre). Network Based Computing Laboratory SEA Symposium (April ’18) 59 Design Overview of Spark with RDMA Apache Spark Benchmarks/Applications/Libraries/Frameworks • Design Features

Spark Core – RDMA based shuffle plugin Shuffle Manager (Sort, Hash, Tungsten-Sort) – SEDA-based architecture

Block Transfer Service (Netty, NIO, RDMA-Plugin) – Dynamic connection management and sharing Netty NIO RDMA Netty NIO RDMA Server Server Server Client Client Client – Non-blocking data transfer

Java Socket Interface Java Native Interface (JNI) – Off-JVM-heap buffer management Native RDMA-based Comm. Engine – InfiniBand/RoCE support

RDMA Capable Networks 1/10/40/100 GigE, IPoIB Network (IB, iWARP, RoCE ..) • Enables high performance RDMA communication, while supporting traditional socket interface • JNI Layer bridges Scala based Spark with communication library written in native code X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014 X. Lu, D. Shankar, S. Gugnani, and D. K. Panda, High-Performance Design of Apache Spark with RDMA and Its Benefits on Various Workloads, IEEE BigData ‘16, Dec. 2016. Network Based Computing Laboratory SEA Symposium (April ’18) 60 Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure

Network Based Computing Laboratory SEA Symposium (April ’18) 61 Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure

Network Based Computing Laboratory SEA Symposium (April ’18) 62 Performance Numbers of RDMA for Apache Hadoop 2.x – RandomWriter & TeraGen in OSU-RI2 (EDR) 400 IPoIB (EDR) 800 IPoIB (EDR) Reduced by 4x 350 Reduced by 3x 700 OSU-IB (EDR) 300 600 OSU-IB (EDR) 250 500 200 400 150 300 100 200 Execution Time (s) Execution Time (s) 50 100 0 0 80 120 160 80 160 240 Data Size (GB) Data Size (GB) RandomWriter TeraGen Cluster with 8 Nodes with a total of 64 maps

• RandomWriter • TeraGen – 3x improvement over IPoIB – 4x improvement over IPoIB for for 80-160 GB file size 80-240 GB file size

Network Based Computing Laboratory SEA Symposium (April ’18) 63 Performance Evaluation of RDMA-Spark on SDSC Comet – HiBench PageRank

800 450 37% 43% 700 IPoIB 400 IPoIB 600 RDMA 350 RDMA 500 300 250 400 200 Time (sec) 300 Time (sec) 150 200 100 100 50 0 0 Huge BigData Gigantic Huge BigData Gigantic Data Size (GB) Data Size (GB) 32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time

• InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R) • RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node. – 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps) – 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps)

Network Based Computing Laboratory SEA Symposium (April ’18) 64 Performance Evaluation on SDSC Comet: Astronomy Application

• Kira Toolkit1: Distributed astronomy image 120 processing toolkit implemented using Apache Spark. 100 21 % 80 • Source extractor application, using a 65GB dataset 60 from the SDSS DR2 survey that comprises 11,150 40 image files. 20 0 • Compare RDMA Spark performance with the RDMA Spark Apache Spark standard apache implementation using IPoIB. (IPoIB) Execution times (sec) for Kira SE 1. Z. Zhang, K. Barbary, F. A. Nothaft, E.R. Sparks, M.J. Franklin, D.A. benchmark using 65 GB dataset, 48 cores. Patterson, S. Perlmutter. Scientific Computing meets Big Data Technology: An Astronomy Use Case. CoRR, vol: abs/1507.03325, Aug 2015.

M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet, XSEDE’16, July 2016

Network Based Computing Laboratory SEA Symposium (April ’18) 65 HPC, Big Data, Deep Learning, and Cloud • Traditional HPC – Message Passing Interface (MPI), including MPI + OpenMP – Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) – Exploiting Accelerators • Big Data/Enterprise/Commercial Computing – Spark and Hadoop (HDFS, HBase, MapReduce) – Memcached is also used for Web 2.0 • Deep Learning – Caffe, CNTK, TensorFlow, and many more • Cloud for HPC and BigData – Virtualization with SR-IOV and Containers

Network Based Computing Laboratory SEA Symposium (April ’18) 66 Deep Learning: New Challenges for MPI Runtimes • Deep Learning frameworks are a different game Proposed altogether Co- NCCL NCCL2 Designs – Unusually large message sizes (order of megabytes) – Most communication based on GPU buffers cuDNN • Existing State-of-the-art cuBLAS – cuDNN, cuBLAS, NCCL --> scale-up performance MPI – NCCL2, CUDA-Aware MPI --> scale-out performance • For small and medium message sizes only! up Performanceup - • Proposed: Can we co-design the MPI runtime (MVAPICH2- gRPC GDR) and the DL framework (Caffe) to achieve both? Scale – Efficient Overlap of Computation and Communication Hadoop – Efficient Large-Message Communication (Reductions) – What application co-designs are needed to exploit communication-runtime co-designs? Scale-out Performance A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17) Network Based Computing Laboratory SEA Symposium (April ’18) 67 MVAPICH2: Allreduce Comparison with Baidu and OpenMPI • 16 GPUs (4 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI 3.0

100000 ~10X better 6000000 50000 OpenMPI is ~5X slower 5000000 10000 than Baidu ~30X better 40000 4000000 MV2 is ~2X better 1000 3000000 30000 than Baidu 100 2000000 Latency (us)

Latency (us) 1000000 10 20000

Latency (us) 0 1 10000

4 ~4X better 16 64 256 8388608 1024 4096 0 16777216 33554432 67108864 16384 65536 134217728 268435456 536870912 262144 512K 1M 2M 4M Message Size (Bytes) Message Size (Bytes) Message Size (Bytes) MVAPICH2 BAIDU OPENMPI MVAPICH2 BAIDU OPENMPI MVAPICH2 BAIDU OPENMPI

*Available with MVAPICH2-GDR 2.3a Network Based Computing Laboratory SEA Symposium (April ’18) 68 MVAPICH2-GDR vs. NCCL2 – Broadcast Operation • Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases • MPI_Bcast (MVAPICH2-GDR) vs. ncclBcast (NCCL2) on 16 K-80 GPUs

100000 100000 1 GPU/node 2 GPUs/node 10000 10000

1000 1000 ~4X better 100 100 Latency (us) ~10X better Latency (us)

10 10

1 1

Message Size (Bytes) Message Size (Bytes)

NCCL2 MVAPICH2-GDR NCCL2 MVAPICH2-GDR *Will be available with upcoming MVAPICH2-GDR 2.3b Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 2 K-80 GPUs, and EDR InfiniBand Inter-connect Network Based Computing Laboratory SEA Symposium (April ’18) 69 MVAPICH2-GDR vs. NCCL2 – Reduce Operation

• Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases • MPI_Reduce (MVAPICH2-GDR) vs. ncclReduce (NCCL2) on 16 GPUs

1000 100000 ~2.5X better 10000

100 1000

100 ~5X better Latency (us) Latency (us) 10 10

1 1 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K Message Size (Bytes) Message Size (Bytes)

MVAPICH2-GDR NCCL2 MVAPICH2-GDR NCCL2 *Will be available with upcoming MVAPICH2-GDR 2.3b Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect Network Based Computing Laboratory SEA Symposium (April ’18) 70 MVAPICH2-GDR vs. NCCL2 – Allreduce Operation

• Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases • MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 16 GPUs

1000 100000

10000 ~1.2X better 100 1000 ~3X better 100 Latency (us) Latency (us) 10 10

1 1 4 8 16 32 64 1K 2K 4K 8K 128 256 512 16K 32K 64K Message Size (Bytes) Message Size (Bytes)

MVAPICH2-GDR NCCL2 MVAPICH2-GDR NCCL2 *Will be available with upcoming MVAPICH2-GDR 2.3b Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect Network Based Computing Laboratory SEA Symposium (April ’18) 71 OSU-Caffe: Scalable Deep Learning • Caffe : A flexible and layered Deep Learning framework. GoogLeNet (ImageNet) on 128 GPUs • Benefits and Weaknesses – Multi-GPU Training within a single node 250 – Performance degradation for GPUs across different sockets 200 – Limited Scale-out • OSU-Caffe: MPI-based Parallel Training 150 – Enable Scale-up (within a node) and Scale-out (across multi-GPU nodes) 100 – Scale-out on 64 GPUs for training CIFAR-10 network on CIFAR-10 dataset 50 Training Time (seconds) Training – Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset 0 OSU-Caffe publicly available from 8 16 32 64 128 No. of GPUs http://hidl.cse.ohio-state.edu/ Invalid use case Support on OPENPOWER will be available soon Caffe OSU-Caffe (1024) OSU-Caffe (2048) Network Based Computing Laboratory SEA Symposium (April ’18) 72 Performance Benefits for RDMA-gRPC with Micro-Benchmark

90 1000 18600 Default gRPC Default gRPC 75 OSU RDMA gRPC 800 14900 Default gRPC OSU RDMA gRPC 60 600 OSU RDMA gRPC 11200 45

400 (us) Latency 7500 30 Latency (us) Latency Latency (us) Latency 15 200 3800

0 0 100 2 8 32 128 512 2K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M payload (Bytes) Payload (Bytes) Payload (Bytes)

RDMA-gRPC RPC Latency

• gRPC-RDMA Latency on SDSC-Comet-FDR – Up to 2.7x performance over IPoIB for Latency for small messages – Up to 2.8x performance speedup over IPoIB for Latency for medium messages – Up to 2.5x performance speedup over IPoIB for Latency for large messages R. Biswas, X. Lu, and D. K. Panda, Accelerating gRPC and TensorFlow with RDMA for High-Performance Deep Learning over InfiniBand, Under Review. Network Based Computing Laboratory SEA Symposium (April ’18) 73 Performance Benefits for RDMA-gRPC with TensorFlow Communication Mimic Benchmark 27000 270 Default gRPC OSU RDMA gRPC Default gRPC

18000 180 OSU RDMA gRPC

Latency (us) Latency 9000 90 Calls/ Second

0 0 2M 4M 8M 2M 4M 8M Payload (Bytes) Payload (Bytes) TensorFlow communication Pattern mimic benchmark • TensorFlow communication pattern mimic on SDSC-Comet-FDR – Single process spawns both gRPC server and gRPC client – Up to 3.6x performance speedup over IPoIB for Latency – Up to 3.7x throughput improvement over IPoIB

Network Based Computing Laboratory SEA Symposium (April ’18) 74 High-Performance Deep Learning over Big Data (DLoBD) Stacks • Benefits of Deep Learning over Big Data (DLoBD) . Easily integrate deep learning components into Big Data processing workflow . Easily access the stored data in Big Data systems . No need to set up new dedicated deep learning clusters; Reuse existing big data analytics clusters • Challenges . Can RDMA-based designs in DLoBD stacks improve performance, scalability, and resource utilization on high- performance interconnects, GPUs, and multi-core CPUs? . What are the performance characteristics of representative DLoBD stacks on RDMA networks? 5010 60 IPoIB-Time • Characterization on DLoBD Stacks 4010 RDMA-Time IPoIB-Accuracy . CaffeOnSpark, TensorFlowOnSpark, and BigDL 40 RDMA-Accuracy . IPoIB vs. RDMA; In-band communication vs. Out-of-band 3010 2010 2.6X

communication; CPU vs. GPU; etc. Time (secs) 20

. Performance, accuracy, scalability, and resource utilization 1010 Accuracy (%) . RDMA-based DLoBD stacks (e.g., BigDL over RDMA-Spark) Epochs 10 0 can achieve 2.6x speedup compared to the IPoIB based 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 scheme, while maintain similar accuracy Epoch Number

X. Lu, H. Shi, M. H. Javed, R. Biswas, and D. K. Panda, Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-capable Networks, HotI 2017.

Network Based Computing Laboratory SEA Symposium (April ’18) 75 HPC, Big Data, Deep Learning, and Cloud • Traditional HPC – Message Passing Interface (MPI), including MPI + OpenMP – Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) – Exploiting Accelerators • Big Data/Enterprise/Commercial Computing – Spark and Hadoop (HDFS, HBase, MapReduce) – Memcached is also used for Web 2.0 • Deep Learning – Caffe, CNTK, TensorFlow, and many more • Cloud for HPC and BigData – Virtualization with SR-IOV and Containers

Network Based Computing Laboratory SEA Symposium (April ’18) 76 Can HPC and Virtualization be Combined? • Virtualization has many benefits – Fault-tolerance – Job migration – Compaction • Have not been very popular in HPC due to overhead associated with Virtualization • New SR-IOV (Single Root – IO Virtualization) support available with Mellanox InfiniBand adapters changes the field • Enhanced MVAPICH2 support for SR-IOV • MVAPICH2-Virt 2.2 supports: – OpenStack, Docker, and singularity

J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized InfiniBand Clusters? EuroPar'14 J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand Clusters, HiPC’14 J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build HPC Clouds, CCGrid’15 Network Based Computing Laboratory SEA Symposium (April ’18) 77 Application-Level Performance on Chameleon

6000 400 MV2-SR-IOV-Def MV2-SR-IOV-Def 350 5000 MV2-SR-IOV-Opt MV2-SR-IOV-Opt 300 4000 MV2-Native MV2-Native 2% 250 3000 200 150 9.5% 2000

Execution Time (s) 1% Execution Time (ms) 5% 100 1000 50

0 0 22,20 24,10 24,16 24,20 26,10 26,16 milc leslie3d pop2 GAPgeofem zeusmp2 lu Problem Size (Scale, Edgefactor) Graph500 SPEC MPI2007 • 32 VMs, 6 Core/VM • Compared to Native, 2-5% overhead for Graph500 with 128 Procs • Compared to Native, 1-9.5% overhead for SPEC MPI2007 with 128 Procs

Network Based Computing Laboratory SEA Symposium (April ’18) 78 Application-Level Performance on Singularity with MVAPICH2

NPB Class D Graph500 300 3000 Singularity Singularity 6% 250 2500 Native Native 200 2000

150 1500

100 1000 Execution Time (s) 7% BFS Execution Time (ms) 50 500

0 0 CG EP FT IS LU MG 22,16 22,20 24,16 24,20 26,16 26,20 • 512 Processes across 32 nodes Problem Size (Scale, Edgefactor) • Less than 7% and 6% overhead for NPB and Graph500, respectively J. Zhang, X .Lu and D. K. Panda, Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds?, UCC ’17, Best Student Paper Award Network Based Computing Laboratory SEA Symposium (April ’18) 79 MVAPICH2-GDR on Container with Negligible Overhead GPU-GPU Inter-node Latency GPU-GPU Inter-node Bi-Bandwidth 1000 20000

100 15000 10000 10

Latency (us) 5000

1 (MB/s) Bandwidth 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M 0 Message Size (Bytes) Docker Native Message Size (Bytes) GPU-GPU Inter-node Bandwidth Docker Native 12000 10000 8000 6000 MVAPICH2-GDR-2.3a 4000 Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores 2000 NVIDIA Volta V100 GPU

Bandwidth (MB/s) Bandwidth 0 Mellanox Connect-X4 EDR HCA CUDA 9.0 Mellanox OFED 4.0 with GPU-Direct-RDMA Message Size (Bytes) Docker Native Network Based Computing Laboratory SEA Symposium (April ’18) 80 Concluding Remarks

• Next generation HPC systems need to be designed with a holistic view of Big Data, Deep Learning and Cloud • Presented some of the approaches and results along these directions • Enable Big Data, Deep Learning and Cloud community to take advantage of modern HPC technologies • Many other open issues need to be solved • Next generation HPC professionals need to be trained along these directions

Network Based Computing Laboratory SEA Symposium (April ’18) 81 Additional Presentation and Tutorials • Talk: Exploiting Computation and Communication Overlap in MVAPICH2 and MVAPICH2-GDR MPI Libraries • 04/04/18 at 9:00 am at Overlapping Communication with Computation Symposium • Tutorial: How to the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries • 04/05/18, 9:00 am-12:00 noon • Tutorial: High Performance Distributed Deep Learning: A Beginner’s Guide • 04/05/18, 1:00pm-4:00 pm

Network Based Computing Laboratory SEA Symposium (April ’18) 82 Funding Acknowledgments Funding Support by

Equipment Support by

Network Based Computing Laboratory SEA Symposium (April ’18) 83 Personnel Acknowledgments

Current Students (Graduate) Current Students (Undergraduate) Current Research Scientists Current Research Specialist – A. Awan (Ph.D.) – J. Hashmi (Ph.D.) – N. Sarkauskas (B.S.) – X. Lu – J. Smith – R. Biswas (M.S.) – H. Javed (Ph.D.) – H. Subramoni – M. Arnold – M. Bayatpour (Ph.D.) – P. Kousha (Ph.D.) – S. Chakraborthy (Ph.D.) – D. Shankar (Ph.D.) Current Post-doc – C.-H. Chu (Ph.D.) – H. Shi (Ph.D.) – A. Ruhela – S. Guganani (Ph.D.) – J. Zhang (Ph.D.) – K. Manian Past Students Past Research Scientist – W. Huang (Ph.D.) – J. Liu (Ph.D.) – R. Rajachandrasekar (Ph.D.) – A. Augustine (M.S.) – K. Hamidouche – P. Balaji (Ph.D.) – W. Jiang (M.S.) – M. Luo (Ph.D.) – G. Santhanaraman (Ph.D.) – S. Sur – S. Bhagvat (M.S.) – J. Jose (Ph.D.) – A. Mamidala (Ph.D.) – A. Singh (Ph.D.) – A. Bhat (M.S.) – S. Kini (M.S.) – G. Marsh (M.S.) – J. Sridhar (M.S.) Past Programmers – S. Sur (Ph.D.) – D. Buntinas (Ph.D.) – M. Koop (Ph.D.) – V. Meshram (M.S.) – D. Bureddy – H. Subramoni (Ph.D.) – L. Chai (Ph.D.) – K. Kulkarni (M.S.) – A. Moody (M.S.) – J. Perkins – K. Vaidyanathan (Ph.D.) – B. Chandrasekharan (M.S.) – R. Kumar (M.S.) – S. Naravula (Ph.D.) – N. Dandapanthula (M.S.) – S. Krishnamoorthy (M.S.) – R. Noronha (Ph.D.) – A. Vishnu (Ph.D.) – V. Dhanraj (M.S.) – K. Kandalla (Ph.D.) – X. Ouyang (Ph.D.) – J. Wu (Ph.D.) – T. Gangadharappa (M.S.) – M. Li (Ph.D.) – S. Pai (M.S.) – W. Yu (Ph.D.) – K. Gopalakrishnan (M.S.) – P. Lai (M.S.) – S. Potluri (Ph.D.)

Past Post-Docs – D. Banerjee – J. Lin – S. Marcarelli – X. Besseron – M. Luo – J. Vienne – H.-W. Jin – E. Mancini – H. Wang

Network Based Computing Laboratory SEA Symposium (April ’18) 84 Thank You! [email protected]

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

The High-Performance MPI/PGAS Project The High-Performance Big Data Project The High-Performance Deep Learning Project http://mvapich.cse.ohio-state.edu/ http://hibd.cse.ohio-state.edu/ http://hidl.cse.ohio-state.edu/

Network Based Computing Laboratory SEA Symposium (April ’18) 85