SR-IOV) Clusters (Upto 2 TB)

Home , Singularity (software), Xen

Building HPC Cloud with InfiniBand: Efficient Support in MVAPICH2 for KVM, Docker, Singularity, OpenStack, and SLURM Tutorial and Demo at MUG 2017 by

Xiaoyi Lu The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~luxi Cloud Computing and Virtualization

Cloud Computing Virtualization

• Cloud Computing focuses on maximizing the effectiveness of the shared resources • Virtualization is the key technology for resource sharing in the Cloud • Widely adopted in industry computing environment • Gartner and International Data Corporation (IDC) have issued a forecast predicting public cloud spending will balloon to $203.4 billion by 2020 (Courtesy: https://www.mspinsights.com/doc/cloud-spending-will-reach-billion-in-0001) Network Based Computing Laboratory MUG 2017 2 Overview of Hypervisor-/Containers-based Virtualization

VM1 Container1

• Hypervisor-based virtualization provides higher isolation, multi-OS support, etc. • Container-based technologies (e.g., Docker) provide lightweight virtualization solutions • Container-based virtualization – share host kernel by containers Network Based Computing Laboratory MUG 2017 3 Drivers of Modern HPC Cluster and Cloud Architecture

High Performance Interconnects – SSDs, Object Storage Large memory nodes Multi-/Many-core InfiniBand (with SR-IOV) Clusters (Upto 2 TB)

Processors <1usec latency, 200Gbps Bandwidth> • Multi-core/many-core technologies, Accelerators • Large memory nodes • Solid State Drives (SSDs), NVM, Parallel Filesystems, Object Storage Clusters • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Single Root I/O Virtualization (SR-IOV)

Cloud SDSC Comet TACC Stampede Cloud Network Based Computing Laboratory MUG 2017 4 Single Root I/O Virtualization (SR-IOV) • Single Root I/O Virtualization (SR-IOV) is providing new opportunities to design HPC cloud with very little low overhead

• Allows a single physical device, or a Physical Function (PF), to present itself as multiple virtual devices, or Virtual Functions (VFs) • VFs are designed based on the existing non-virtualized PFs, no need for driver change • Each VF can be dedicated to a single VM through PCI pass-through • Work with 10/40 GigE and InfiniBand

Network Based Computing Laboratory MUG 2017 5

Building HPC Cloud with SR-IOV and InfiniBand

• High-Performance Computing (HPC) has adopted advanced interconnects and protocols – InfiniBand – 10/40/100 Gigabit Ethernet/iWARP – RDMA over Converged Enhanced Ethernet (RoCE) • Very Good Performance – Low latency (few micro seconds) – High Bandwidth (200 Gb/s with HDR InfiniBand) – Low CPU overhead (5-10%) • OpenFabrics software stack with IB, iWARP and RoCE interfaces are driving HPC systems • How to Build HPC Clouds with SR-IOV and InfiniBand for delivering optimal performance?

Network Based Computing Laboratory MUG 2017 6 Broad Challenges in Designing Communication and I/O Middleware for HPC on Clouds • Virtualization Support with Virtual Machines and Containers – KVM, Docker, Singularity, etc. • Communication coordination among optimized communication channels on Clouds – SR-IOV, IVShmem, IPC-Shm, CMA, etc. • Locality-aware communication • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) • Scalable Collective communication – Offload; Non-blocking; Topology-aware • Balancing intra-node and inter-node communication for next generation nodes (128-1024 cores) – Multiple end-points per node • NUMA-aware communication for nested virtualization • Integrated Support for GPGPUs and Accelerators • Fault-tolerance/resiliency – Migration support with virtual machines • QoS support for communication and I/O • Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, MPI+UPC++, CAF, …) • Energy-Awareness • Co-design with resource management and scheduling systems on Clouds – OpenStack, Slurm, etc.

Network Based Computing Laboratory MUG 2017 7 Approaches to Build HPC Clouds • MVAPICH2-Virt with SR-IOV and IVSHMEM – Standalone, OpenStack • SR-IOV-enabled VM Migration Support in MVAPICH2 • MVAPICH2 with Containers (Docker and Singularity) • MVAPICH2 with Nested Virtualization (Container over VM) • MVAPICH2-Virt on SLURM – SLURM alone, SLURM + OpenStack

Network Based Computing Laboratory MUG 2017 8 Overview of the MVAPICH2 Project

• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 2,800 organizations in 85 countries – More than 424,000 (> 0.4 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Nov ‘16 ranking) • 1st ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China • 15th ranked 241,108-core cluster (Pleiades) at NASA • 20th ranked 519,640-core cluster (Stampede) at TACC • 44th ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others – Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) VM/Container-aware Design – http://mvapich.cse.ohio-state.edu is already publicly available in • Empowering Top500 systems for over a decade MVAPICH2-Virt! – System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> Sunway TaihuLight at NSC, Wuxi, China (1st in Nov’16, 10,649,640 cores, 93 PFlops)

Network Based Computing Laboratory MUG 2017 9 MVAPICH2 Software Family High-Performance Parallel Programming Libraries MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud MVAPICH2-EA Energy aware and High-performance MPI MVAPICH2-MIC Optimized MPI for clusters with Intel KNC Microbenchmarks OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs Tools OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration OEMT Utility to measure the energy consumption of MPI applications

Network Based Computing Laboratory MUG 2017 10 HPC on Cloud Computing Systems: Challenges Addressed by OSU So Far Applications

HPC and Big Data Middleware HPC (MPI, PGAS, MPI+PGAS, MPI+OpenMP, etc.)

Resource Management and Scheduling Systems for Cloud Computing (OpenStack Nova, Heat; Slurm) Communication and I/O Library Communication Channels Locality- and NUMA-aware Virtualization (SR-IOV, IVShmem, IPC-Shm, CMA) Communication (Hypervisor and Container)

Fault-Tolerance & Consolidation QoS-aware (Migration) Future Studies

Commodity Computing System Networking Technologies Architectures Storage Technologies (InfiniBand, Omni-Path, 1/10/40/100 (Multi- and Many-core architectures (HDD, SSD, NVRAM, and NVMe-SSD) GigE and Intelligent NICs) and accelerators)

Network Based Computing Laboratory MUG 2017 11 Overview of OSU Solutions for Building HPC Clouds - MVAPICH2-Virt with SR-IOV and IVSHMEM

Network Based Computing Laboratory MUG 2017 12 Intra-Node Inter-VM Performance

3.5 200 IB_Send 180 3 IB_Send IB_Read 160 IB_Read 2.5 IB_Wirte 140

IB_Wirte

(us) IVShmem 120 2 100 IVShmem

1.5 80 Latency 1 Latency(us) 60 40 0.5 20 0 0 2 4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1M Message Size (bytes) Message Size (bytes)

Latency at 64 bytes message size: SR-IOV(IB_Send) - 0.96μs, IVShmem - 0.2μs

Can IVShmem scheme benefit MPI communication within a node on SR-IOV enabled InfiniBand clusters?

Network Based Computing Laboratory MUG 2017 13 Overview of MVAPICH2-Virt with SR-IOV and IVSHMEM

Guest 1 Guest 2 • Redesign MVAPICH2 to make it user space user space MPI MPI proc proc kernel space kernel space virtual machine aware PCI VF PCI VF Device Driver Device Driver

– SR-IOV shows near to native Hypervisor PF Driver performance for inter-node point to Virtual Virtual IV-SHM Physical Function Function Function point communication /dev/shm/ Inﬁniband Adapter

IV-Shmem Channel – IVSHMEM offers shared memory based Host Environment SR-IOV Channel data access across co-resident VMs – Locality Detector: maintains the locality J. Zhang, X. Lu, J. Jose, R. Shi, D. K. Panda. Can Inter-VM information of co-resident virtual machines Shmem Benefit MPI Applications on SR-IOV based – Communication Coordinator: selects the Virtualized InfiniBand Clusters? Euro-Par, 2014 communication channel (SR-IOV, IVSHMEM) J. Zhang, X. Lu, J. Jose, R. Shi, M. Li, D. K. Panda. High adaptively Performance MPI Library over SR-IOV Enabled InfiniBand Clusters. HiPC, 2014 Network Based Computing Laboratory MUG 2017 14

MVAPICH2-Virt with SR-IOV and IVSHMEM over OpenStack • OpenStack is one of the most popular open-source solutions to build clouds and manage virtual machines • Deployment with OpenStack – Supporting SR-IOV configuration – Supporting IVSHMEM configuration – Virtual Machine aware design of MVAPICH2 with SR-IOV • An efficient approach to build HPC Clouds with MVAPICH2-Virt and OpenStack J. Zhang, X. Lu, M. Arnold, D. K. Panda. MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds. CCGrid, 2015 Network Based Computing Laboratory MUG 2017 15 Intra-node Inter-VM Point-to-Point Performance Intra-node Inter-VM Pt2Pt Latency Intra-node Inter-VM Pt2Pt BW 1000 12000 SR-IOV SR-IOV 8% 10000

100 Proposed Proposed

Native 8000 Native

(us)

(MB/s)

10 8% 6000 Latency 84% 4000

1 Bandwidth 3% 158% 2000 3% 0.1 1 4 16 64 256 1K 4K 16K 64K 256K 1M 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size (Byte) Message Size (Byte)

• Compared to SR-IOV, up to 84% and 158% improvement on Latency & Bandwidth • Compared to Native, 3%-8% overhead on both Latency & Bandwidth

Network Based Computing Laboratory MUG 2017 16

HiPC 2014 20 Application Performance (NAS & P3DFFT) NAS-32 VMs (8 VMs per node) P3DFFT-32 VMs (8 VMs per node) 18 5 33% 16 4.5 29% SR-IOV SR-IOV

14 4

Proposed (s)

(s) Proposed

12 3.5

3 29%

Time

Times

10 2.5 20% 8

2 Execution

6 Execution 1.5 4 43% 1 2 0.5 0 0 FT LU CG MG IS SPEC INVERSE RAND SINE NAS B Class P3DFFT 512*512*512 • Proposed design delivers up to 43% (IS) improvement for NAS • Proposed design brings 29%, 33%, 29% and 20% improvement for INVERSE, RAND, SINE and SPEC Network Based Computing Laboratory MUG 2017 17 Application-Level Performance on Chameleon

6000 400 MV2-SR-IOV-Def MV2-SR-IOV-Def 350

5000 MV2-SR-IOV-Opt

MV2-SR-IOV-Opt 300 4000 MV2-Native MV2-Native 2% 250 3000 200

150 9.5% 2000

5% Execution Time (s) 1% ExecutionTime (ms) 100 1000 50

0 0 22,20 24,10 24,16 24,20 26,10 26,16 milc leslie3d pop2 GAPgeofem zeusmp2 lu Problem Size (Scale, Edgefactor) Graph500 SPEC MPI2007 • 32 VMs, 6 Core/VM • Compared to Native, 2-5% overhead for Graph500 with 128 Procs • Compared to Native, 1-9.5% overhead for SPEC MPI2007 with 128 Procs

Network Based Computing Laboratory MUG 2017 18 Approaches to Build HPC Clouds • MVAPICH2-Virt with SR-IOV and IVSHMEM – Standalone, OpenStack • SR-IOV-enabled VM Migration Support in MVAPICH2 • MVAPICH2 with Containers (Docker and Singularity) • MVAPICH2 with Nested Virtualization (Container over VM) • MVAPICH2-Virt on SLURM – SLURM alone, SLURM + OpenStack

Network Based Computing Laboratory MUG 2017 19 Overview of OSU Solutions for Building HPC Clouds: SR-IOV- enabled VM Migration Support in MVAPICH2

Network Based Computing Laboratory MUG 2017 20 Execute Live Migration with SR-IOV Device

Network Based Computing Laboratory MUG 2017 21 Overview of Existing Migration Solutions for SR-IOV

Platform NIC No Guest OS Device Driver Hypervisor Modification Independent Independent

Zhai, etc (Linux Ethernet N/A Yes No Yes (Xen) bonding driver) Kadav, etc (shadow Ethernet Intel Pro/1000 gigabit NIC No Yes No (Xen) driver) Pan, etc (CompSC) Ethernet Intel 82576, Intel 82599 Yes No No (Xen)

Guay, etc InfniBand Mellanox ConnectX2 QDR HCA Yes No Yes (Oracle VM Server (OVS) 3.0.)

Han Ethernet Huawei smart NIC Yes No No (QEMU+KVM)

Xu, etc (SRVM) Ethernet Intel 82599 Yes Yes No (VMware EXSi)

Can we have a hypervisor-independent and device driver-independent solution for InfiniBand based HPC Clouds with SR-IOV? Network Based Computing Laboratory MUG 2017 22 High Performance SR-IOV enabled VM Migration Framework

• Consist of SR-IOV enabled IB Cluster and External Migration Controller • Detachment/Re-attachment of virtualized devices: Multiple parallel libraries to coordinate VM during migration (detach/reattach SR- IOV/IVShmem, migrate VMs, migration status) • IB Connection: MPI runtime handles IB connection suspending and reactivating • Propose Progress Engine (PE) and Migration Thread based (MT) design to optimize VM migration and MPI application performance

J. Zhang, X. Lu, D. K. Panda. High-Performance Virtual Machine Migration Framework for MPI Applications on SR-IOV enabled InfiniBand Clusters. IPDPS, 2017 Network Based Computing Laboratory MUG 2017 23 Parallel Libraries in Controller • Network Suspend Trigger: . Issue VM migration request from admin . Implement on top of MPI startup channel for scalability • Ready-to-Migrate Detector and Migration Done Detector: . Periodically detect states on IVShmem regions

Computer Node Migration Target Migration Source Login Node . Maintain a counter to keep track of all the states, VM MPI VM MPI utilize MPI reduce to collect VF VF Controller IVSHMEM IVSHMEM • Pre-migration Suspend Migration Trigger: Phase 1 Trigger Suspend Channel . Carrying out VM live migration operation Ready-to-migrate RTM Phase 2 Detector Start Phase 3.a . Detaching/re-attaching SR-IOV VF and IVShmem Detach Migration Detach VF IVSHMEM Phase 3.b devices VM Migration n VM MPI ratio Mig • Network Reactivate Notifier: Phase 3.c Attach Attach VF IVSHMEM . Work after VM migration to the targeted host Post-migration Phase 4 Reactivate . Notify MPI processes to reactivate communication Phase 5 Reactivate Channel Migration Done channels

Network Based Computing Laboratory MUG 2017 24 Performance Evaluation of VM Migration Framework

Breakdown of VM migration Multiple VM Migration Time Set POST_MIGRATION Add IVSHMEM Attach VF 35 3 Migration Remove IVSHMEM Detach VF 30 Sequential Migration Framework 51% Set PRE_MIGRATION 2.5 25

Proposed Migration Framework

2 20

1.5 15 Time (s) Time

Times (s) Times 1 10 0.5 5 0 0 TCP IPoIB RDMA 2 VM 4 VM 8 VM 16 VM

• Compared with the TCP, the RDMA scheme reduces the total migration time by 20% • Total time is dominated by `Migration’ time; Times on other steps are similar across different schemes • Proposed migration framework could reduce up to 51% migration time

Network Based Computing Laboratory MUG 2017 25 Performance Evaluation of VM Migration Framework

Pt2Pt Latency Bcast (4VMs * 2Procs/VM) 200 700 PE-IPoIB PE-IPoIB 180 PE-RDMA 600 PE-RDMA 160 MT-IPoIB

140 500 MT-IPoIB

MT-RDMA 120 MT-RDMA

(us) 400 100

80 300 Latency (us) Latency 60 2Latency 200 40 100 20 0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size ( bytes) Message Size ( bytes) • Migrate a VM from one machine to another while benchmark is running inside • Proposed MT-based designs perform slightly worse than PE-based designs because of lock/unlock • No benefit from MT because of NO computation involved Network Based Computing Laboratory MUG 2017 26 Overlapping Evaluation

55 50 PE 45 MT-worst 40 MT-typical 35

30 NM Time (s) Time 25 20 15 10 0 10 20 30 40 50 60 70 Computation ( percentage)

• fix the communication time and increase the computation time • 10% computation, partial migration time could be overlapped with computation in MT- typical

Network• BasedAs Computingcomputation Laboratory percentage increases, moreMUG 2017 chance for overlapping 27

Performance Evaluation with Applications

NAS Graph500

30 0.4

PE PE 25 MT-worst 0.3 20 MT-typical MT-worst 15 NM 0.2 10

5 0.1

Execution (s) Time Execution Execution (s) Time Execution 0 0.0 LU.B EP.B MG.B CG.B FT.B 20,10 20,16 20,20 22,10

• 8 VMs in total and 1 VM carries out migration during application running • Compared with NM, MT- worst and PE incur some overhead • MT-typical allows migration to be completely overlapped with computation

Network Based Computing Laboratory MUG 2017 28

Approaches to Build HPC Clouds • MVAPICH2-Virt with SR-IOV and IVSHMEM – Standalone, OpenStack • SR-IOV-enabled VM Migration Support in MVAPICH2 • MVAPICH2 with Containers (Docker and Singularity) • MVAPICH2 with Nested Virtualization (Container over VM) • MVAPICH2-Virt on SLURM – SLURM alone, SLURM + OpenStack

Network Based Computing Laboratory MUG 2017 29 Overview of OSU Solutions for Building HPC Clouds: MVAPICH2 with Containers (Docker and Singularity)

Network Based Computing Laboratory MUG 2017 30 Benefits of Containers-based Virtualization for HPC on Cloud

ib_send_lat Graph500 9 300 300 8 VM-PT 252.7 250 Communication 7 VM-SR-IOV

250

Container-PT 6 200 Computation 178.7 Native 200 5 150 4 150

Latency (us) Latency 234 3 100 100 65 65.5 2 162.7 50 1 BFS Execution Time (ms) Time Execution BFS 50 50 50.5 0 (ms) Time Execution BFS 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 0 Native-16P15 1Conts*16P15 2Conts*8P16 4Conts*4P18 Message Size (bytes) Native-16P 1Cont*16P 2Conts*8P 4Conts*4P Scale, Edgefactor (20,16) • Experiment on NFS Chameleon Cloud Scale, Edgefactor (20,16) • Container has less overhead than VM • BFS time in Graph 500 significantly increases as the number of container increases on one host. Why? J. Zhang, X. Lu, D. K. Panda. Performance Characterization of Hypervisor- and Container-Based Virtualization for HPC on SR-IOV Enabled InfiniBand Clusters. IPDRM, IPDPS Workshop, 2016 Network Based Computing Laboratory MUG 2017 31 Containers-based Design: Issues, Challenges, and Approaches

• What are the performance bottlenecks when running MPI applications on multiple containers per host in HPC cloud? • Can we propose a new design to overcome the bottleneck on such container-based HPC cloud? • Can optimized design deliver near-native performance for different container deployment scenarios? • Locality-aware based design to enable CMA and Shared memory channels for MPI communication across co-resident containers J. Zhang, X. Lu, D. K. Panda. High Performance MPI Library for Container-based HPC Cloud on InfiniBand Clusters. ICPP, 2016 Network Based Computing Laboratory MUG 2017 32 MPI Point-to-Point and Collective Performance

Latency Allgather 9 Cont-intra-socket-Def 4000 8 Cont-intra-socket-Opt 3500 86% 7 Con-Def

Native-intra-socket 3000

6 Con-Opt Cont-inter-socket-Def 2500 5 Cont-inter-socket-Opt 2000 Native 4 Native-inter-socket 79% 1500

Latency (us) Latency 3 Latency (us) Latency 2 1000 1 500 0 0 1 4 16 64 256 1K 4K 16K 4 8 16 32 64 128 256 512 1K 2K 4K Message Size (Bytes) Message Size (Bytes) • Two containers are deployed on the same socket and different socket • 256 procs totally (4 pros/container, 64 containers across 16 nodes evenly) • Up to 79% and 86% improvement for Point-to-Point and MPI_Allgather, respectively (Cont-Opt vs. Cont- Def) • Minor overhead, compared with Native performance (Cont-*-Opt vs. Native-*) Network Based Computing Laboratory MUG 2017 33 Application-Level Performance on Docker with MVAPICH2 100 300 Container-Def 90 Container-Def Container-Opt 73% 80 250

11%

Container-Opt

70 Native 200 Native 60 50 150 40

30 100 Execution Time (s) Time Execution 20 50

10 (ms) Time Execution BFS 0 0 MG.D FT.D EP.D LU.D CG.D 1Cont*16P 2Conts*8P 4Conts*4P NAS Scale, Edgefactor (20,16) Graph 500 • 64 Containers across 16 nodes, pining 4 Cores per Container • Compared to Container-Def, up to 11% and 73% of execution time reduction for NAS and Graph 500 • Compared to Native, less than 9 % and 5% overhead for NAS and Graph 500

Network Based Computing Laboratory MUG 2017 34 MVAPICH2 Intra-Node and Inter-Node Point-to-Point Performance on Singularity Pt2Pt Latency Pt2Pt BW 16000 18 14000 16 Singularity-Intra-Node Singularity-Intra-Node 14 12000 Native-Intra-Node Native-Intra-Node 12 10000 Singularity-Inter-Node

Singularity-Inter-Node 3% 10 Native-Inter-Node Native-Inter-Node 8% 8000 8 6000

6 (MB/s) Bandwidth Latency (us) Latency 4000 4 2 2000

0 0 1 4 16 64 256 1024 4096 16384 65536 Message Size (Byte) Message Size (Byte) • Less than 8% overhead on latency • Less than 3% overhead on BW

Network Based Computing Laboratory MUG 2017 35 MVAPICH2 Collective Performance on Singularity Broadcast Allreduce 200 80 Singularity 180 Singularity 70 160 Native Native 60

140

50 120

40 100 80 9%

30 (us) Latency Latency (us) Latency 60 20 8% 40 10 20

0 0 1 4 16 64 256 1024 4096 16384 65536 4 16 64 256 1024 4096 16384 65536 Message Size (Byte) Message Size (Byte)

• 512 Processes across 32 nodes • Less than 8% and 9% overhead for Bcast and Allreduce, respectively

Network Based Computing Laboratory MUG 2017 36 Application-Level Performance on Singularity with MVAPICH2

NPB Class D Graph500 300 3000

Singularity Singularity 6% 250 2500

Native Native 200 2000

150 1500

100 1000

Execution Time (s) Time Execution 7% BFS Execution Time (ms) Time Execution BFS 50 500

0 0 CG EP FT IS LU MG 22,16 22,20 24,16 24,20 26,16 26,20 Problem Size (Scale, Edgefactor) • 512 Processes across 32 nodes • Less than 7% and 6% overhead for NPB and Graph500, respectively

Network Based Computing Laboratory MUG 2017 37 Approaches to Build HPC Clouds • MVAPICH2-Virt with SR-IOV and IVSHMEM – Standalone, OpenStack • SR-IOV-enabled VM Migration Support in MVAPICH2 • MVAPICH2 with Containers (Docker and Singularity) • MVAPICH2 with Nested Virtualization (Container over VM) • MVAPICH2-Virt on SLURM – SLURM alone, SLURM + OpenStack

Network Based Computing Laboratory MUG 2017 38 Overview of OSU Solutions for Building HPC Clouds: MVAPICH2 with Nested Virtualization

Network Based Computing Laboratory MUG 2017 39 Nested Virtualization: Containers over Virtual Machines

VM1 VM2

Container1 Container2 Container3 Container4 App App App App Stack Stack Stack Stack

bins/ bins/ bins/ bins/ libs libs libs libs

Docker Engine Docker Engine Redhat Ubuntu Hypervisor Host OS Hardware

• Useful for live migration, sandbox application, legacy system integration, software deployment, etc. • Performance issues because of the redundant call stacks (two-layer virtualization) and isolated physical resources Network Based Computing Laboratory MUG 2017 40 Usage Scenario of Nested Virtualization

User A User B Host

VLAN 1 VLAN 2

VM 0: Ubuntu VM 1: Ubuntu VM 2: MacOS VM 3: MacOS

dockerA1 dockerA3 dockerB1 dockerB3

dockerA2 dockerA4 dockerB2 dockerB4

Pull Pull Pull

Push img img img A B C

Developer Docker Registry

• VM provides good isolation and security so that the applications and workloads of users A and B will not interfere with each other • Root permission of VM can be given to do special configuration • Docker brings an effective, standardized and repeatable way to port and distribute the applications and workloads

Network Based Computing Laboratory MUG 2017 41 Multiple Communication Paths in Nested Virtualization

• Different VM placements introduce multiple communication paths on container level 1. Intra-VM Intra-Container (across core 4 and core 5) 2. Intra-VM Inter-Container (across core 13 and core 14) 3. Inter-VM Inter-Container (across core 6 and core 12) 4. Inter-Node Inter-Container (across core 15 and the core on remote node)

Network Based Computing Laboratory MUG 2017 42 Performance Characteristics on Communication Paths

Intra-Socket Inter-Socket 40 20

Intra-VM Inter-Container-Def Intra-VM Inter-Container-Def 30

Inter-VM Inter-Container-Def 15 Inter-VM Inter-Container-Def

Intra-VM Inter-Container-1Layer* Intra-VM Inter-Container-1Layer* 2X 20 10

Inter-VM Inter-Container-1Layer* Inter-VM Inter-Container-1Layer* Latency (us) Latency Latency (us) Latency Native 3X Native 10 5

0 0 1 4 16 64 256 1024 4096 16384 65536 1 4 16 64 256 1024 4096 16384 65536 Message Size (Byte) Message Size (Byte)

• Two VMs are deployed on the same and different socket, respectively • *-Def and Inter-VM Inter-Container-1Layer have similar performance • Still large gap compared to native performance with just 1layer design 1Layer* - J. Zhang, X. Lu, D. K. Panda. High Performance MPI Library for Container-based HPC Cloud on InfiniBand, ICPP, 2016 Network Based Computing Laboratory MUG 2017 43 Challenges of Nested Virtualization

• How to further reduce the performance overhead of running applications on the nested virtualization environment?

• What are the impacts of the different VM/container placement schemes for the communication on the container level?

• Can we propose a design which can adapt these different VM/container placement schemes and deliver near-native performance for nested virtualization environments?

Network Based Computing Laboratory MUG 2017 44 Overview of Proposed Design in MVAPICH2

Two-Layer Locality Detector: Dynamically detecting MPI processes in the co- resident containers inside one VM as well as the ones in the co-resident VMs

Two-Layer NUMA Aware Communication Coordinator: Leverage nested locality info, NUMA architecture info and message to select appropriate communication channel

J. Zhang, X. Lu, D. K. Panda. Designing Locality and NUMA Aware MPI Runtime for Nested Virtualization based HPC Cloud with SR-IOV Enabled InfiniBand, VEE, 2017

Network Based Computing Laboratory MUG 2017 45 Inter-VM Inter-Container Pt2Pt (Intra-Socket)

Latency BW • 1Layer has similar performance to the Default • Compared with 1Layer, 2Layer delivers up to 84% and 184% improvement for latency and BW

Network Based Computing Laboratory MUG 2017 46 Inter-VM Inter-Container Pt2Pt (Inter-Socket)

Latency BW • 1-Layer has similar performance to the Default • 2-Layer has near-native performance for small msg, but clear overhead on large msg • Compared to 2-Layer, Hybrid design brings up to 42% and 25% improvement for latency and BW, respectively Network Based Computing Laboratory MUG 2017 47 Applications Performance

Graph500 Class D NAS 10 200

Default Default

8 150 1Layer 1Layer 6 2Layer-Enhanced-Hybrid 2Layer-Enhanced-Hybrid 100 4 50

Execution Time (s) TimeExecution BFS Execution Time (s) Time Execution BFS

0 0 22,20 24,16 24,20 24,24 26,16 26,20 26,24 28,16 IS MG EP FT CG LU

• 256 processes across 64 containers on 16 nodes • Compared with Default, enhanced-hybrid design reduces up to 16% (28,16) and 10% (LU) of execution time for Graph 500 and NAS, respectively • Compared with the 1Layer case, enhanced-hybrid design also brings up to 12% (28,16) and 6% (LU) performance benefit. Network Based Computing Laboratory MUG 2017 48 Approaches to Build HPC Clouds • MVAPICH2-Virt with SR-IOV and IVSHMEM – Standalone, OpenStack • SR-IOV-enabled VM Migration Support in MVAPICH2 • MVAPICH2 with Containers (Docker and Singularity) • MVAPICH2 with Nested Virtualization (Container over VM) • MVAPICH2-Virt on SLURM – SLURM alone, SLURM + OpenStack

Network Based Computing Laboratory MUG 2017 49 Overview of OSU Solutions for Building HPC Clouds: MVAPICH2-Virt on SLURM and OpenStack

Network Based Computing Laboratory MUG 2017 50 Typical Usage Scenarios

Exclusive Allocations VM VM

Concurrent Jobs (EACJ) MPI MPI

MPI MPI Compute Nodes Shared-host Allocations Concurrent Jobs (SACJ)

Exclusive Allocations Sequential Jobs (EASJ) MPI

MPI VM MPI MPI VM

VM VM VM

VM MPI

MPI

Network Based Computing Laboratory MUG 2017 51 Need for Supporting SR-IOV and IVSHMEM in SLURM

• Requirement of managing and isolating virtualized resources of SR-IOV and IVSHMEM • Such kind of management and isolation is hard to be achieved by MPI library alone, but much easier with SLURM • Efficient running MPI applications on HPC Clouds needs SLURM to support managing SR-IOV and IVSHMEM – Can critical HPC resources be efficiently shared among users by extending SLURM with support for SR-IOV and IVSHMEM based virtualization? – Can SR-IOV and IVSHMEM enabled SLURM and MPI library provide bare-metal performance for end applications on HPC Clouds?

Network Based Computing Laboratory MUG 2017 52 Architecture Overview

sbatch File VM Configuration File

physical resource request Submit Job SLURMctld physical node list

SLURMd SLURMd SLURMd physical physical …. physical node node node launch VMs

MPI MPI

1. SR-IOV virtual function VM1 VM2 2. IVSHMEM device 3. Network setting IVSHME IVSHME VF VF M 4. Image management Image Pool M 5. Launching VMs and check availability Lustre VM Launching/Reclaiming 6. Mount global storage, etc.

libvirtd SLURMd

Network Based Computing Laboratory MUG 2017 53 SLURM SPANK Plugin based Design

mpirun_vm Slurmctld Slurmd Slurmd • VM Configuration Reader – load SPANK register job Register all VM configuration VM Config step req Reader register job options, set in the job control step reply environment so that they are visible to all allocated nodes. run job step req run job step reply load SPANK • VM Launcher – Setup VMs on vm hostfile VM Launcher each allocated nodes. MPI MPI - File based lock to detect occupied VF and exclusively allocate free VF MPI Job - Assign a unique ID to each IVSHMEM across VMs and dynamically attach to each VM reclaim VMs load SPANK • VM Reclaimer – Tear down VM Reclaimer VMs and reclaim resources release hosts

Network Based Computing Laboratory MUG 2017 54 SLURM SPANK Plugin with OpenStack based Design OpenStack mpirun_vm Slurmctld Slurmd ...... daemon • VM Configuration Reader – VM load SPANK register job VM Config step req options register Reader register job • VM Launcher, VM Reclaimer – request launch VM step reply VM Launcher Offload to underlying OpenStack launch VM ...... infrastructure return - PCI Whitelist to passthrough free VF to VM VM hostfile - Extend Nova to enable IVSHMEM when MPI launching VM

...... request reclaim VM VM Reclaimer J. Zhang, X. Lu, S. Chakraborty, D. K. Panda. reclaim VMs return ...... SLURM-V: Extending SLURM for Building Efficient HPC Cloud with SR-IOV and IVShmem. Euro-Par, release hosts 2016

Network Based Computing Laboratory MUG 2017 55 Application-Level Performance on Chameleon (Graph500)

3000 250 250

VM VM VM 2500 200 200 Native Native Native 2000 150 150 1500 4% 100 100 1000

50 50

BFS Execution Time (ms) Time Execution BFS 500

BFS Execution Time (ms) Time Execution BFS BFS Execution Time (ms) Time Execution BFS 0 0 0 24,16 24,20 26,10 22,10 22,16 22,20 22 10 22 16 22 20 Problem Size (Scale, Edgefactor) Problem Size (Scale, Edgefactor) Problem Size (Scale, Edgefactor) Exclusive Allocations Shared-host Allocations Exclusive Allocations Sequential Jobs Concurrent Jobs Concurrent Jobs • 32 VMs across 8 nodes, 6 Core/VM • EASJ - Compared to Native, less than 4% overhead with 128 Procs • SACJ, EACJ – Also minor overhead, when running NAS as concurrent job with 64 Procs

Network Based Computing Laboratory MUG 2017 56 Available Appliances on Chameleon Cloud*

Appliance Description

Chameleon bare-metal image customized with the KVM hypervisor and a CentOS 7 KVM SR- recompiled kernel to enable SR-IOV over InfiniBand. IOV https://www.chameleoncloud.org/appliances/3/

MPI bare-metal This appliance deploys an MPI cluster composed of bare metal instances using the cluster complex MVAPICH2 library over InfiniBand. appliance (Based on https://www.chameleoncloud.org/appliances/29/ Heat)

MPI + SR-IOV KVM This appliance deploys an MPI cluster of KVM virtual machines using the cluster (Based on MVAPICH2-Virt implementation and configured with SR-IOV for high-performance Heat) communication over InfiniBand. https://www.chameleoncloud.org/appliances/28/

The CentOS 7 SR-IOV RDMA-Hadoop appliance is built from the CentOS 7 CentOS 7 SR-IOV appliance and additionally contains RDMA-Hadoop library with SR-IOV. RDMA-Hadoop https://www.chameleoncloud.org/appliances/17/

• Through these available appliances, users and researchers can easily deploy HPC clouds to perform experiments and run jobs with – High-Performance SR-IOV + InfiniBand – High-Performance MVAPICH2 Library over bare-metal InfiniBand clusters – High-Performance MVAPICH2 Library with Virtualization Support over SR-IOV enabled KVM clusters – High-Performance Hadoop with RDMA-based Enhancements Support [*] Only include appliances contributed by OSU NowLab Network Based Computing Laboratory MUG 2017 57 MPI Complex Appliances based on MVAPICH2 on Chameleon

1. Load VM Config 2. Allocate Ports 3. Allocate FloatingIPs 4. Generate SSH Keypair 5. Launch VM 6. Attach SR-IOV Device 7. Hotplug IVShmem Device 8. Download/Install MVAPICH2-Virt 9. Populate VMs/IPs 10. Associate FloatingIPs

MVAPICH2-Virt Heat- based Complex Appliance

Network Based Computing Laboratory MUG 2017 58 Demos on Chameleon Cloud

• A Demo of Deploying MPI Bare-Metal Cluster with InfiniBand • A Demo of Deploying MPI KVM Cluster with SR-IOV enabled InfiniBand • Running MPI Programs on Chameleon

Network Based Computing Laboratory MUG 2017 59 Login to Chameleon Cloud

https://chi.tacc.chameleoncloud.org/dashboard/auth/login/

Network Based Computing Laboratory MUG 2017 60 Create a Lease

Network Based Computing Laboratory MUG 2017 61 Create a Lease

Network Based Computing Laboratory MUG 2017 62 Lease Starts

Network Based Computing Laboratory MUG 2017 63 Select Complex Appliance - MPI Bare-Metal Cluster

Network Based Computing Laboratory MUG 2017 64 Get Template of MPI Bare-Metal Appliance

Network Based Computing Laboratory MUG 2017 65 Save URL of Template (Will be used later)

Network Based Computing Laboratory MUG 2017 66 Launch Stack

Network Based Computing Laboratory MUG 2017 67 Use Saved Template URL as Source

Network Based Computing Laboratory MUG 2017 68 Input Stack Information

Network Based Computing Laboratory MUG 2017 69 Use Created Lease

Network Based Computing Laboratory MUG 2017 70 Stack Creation In Progress

Network Based Computing Laboratory MUG 2017 71 Stack Details

Network Based Computing Laboratory MUG 2017 72 Demos on Chameleon Cloud

• A Demo of Deploying MPI Bare-Metal Cluster with InfiniBand • A Demo of Deploying MPI KVM Cluster with SR-IOV enabled InfiniBand • Running MPI Programs on Chameleon

Network Based Computing Laboratory MUG 2017 73 Select Complex Appliance - MPI KVM with SR-IOV Cluster

Network Based Computing Laboratory MUG 2017 74 Get Template of MPI KVM with SR-IOV Appliance

Network Based Computing Laboratory MUG 2017 75 Save URL of Template (Will be used later)

Network Based Computing Laboratory MUG 2017 76 Launch Stack

Network Based Computing Laboratory MUG 2017 77 Instances in Stack (Spawning …)

Network Based Computing Laboratory MUG 2017 78 Demos on Chameleon Cloud

• A Demo of Deploying MPI Bare-Metal Cluster with InfiniBand • A Demo of Deploying MPI KVM Cluster with SR-IOV enabled InfiniBand • Running MPI Programs on Chameleon

Network Based Computing Laboratory MUG 2017 79 Create Stack Successfully Login to the first instance with Floating IP

Network Based Computing Laboratory MUG 2017 80 Login Instance with Floating IP

Network Based Computing Laboratory MUG 2017 81 SSH to Other Instances

Network Based Computing Laboratory MUG 2017 82 Compile MPI Program – Hello World

Network Based Computing Laboratory MUG 2017 83 Distribute Executable to Other Instances

Network Based Computing Laboratory MUG 2017 84 Run MPI Program – Hello World

Network Based Computing Laboratory MUG 2017 85

Run OSU MPI Benchmarks – Latency and Bandwidth

Network Based Computing Laboratory MUG 2017 86 Next Steps of MVAPICH2-Virt

Network Based Computing Laboratory MUG 2017 87 Conclusions

• MVAPICH2-Virt over SR-IOV-enabled InfiniBand is an efficient approach to build HPC Clouds – Standalone, OpenStack, Slurm, and Slurm + OpenStack – Support Virtual Machine Migration with SR-IOV InfiniBand devices – Support Virtual Machine, Container (Docker and Singularity), and Nested Virtualization • Very little overhead with virtualization, near native performance at application level • Much better performance than Amazon EC2 • MVAPICH2-Virt is available for building HPC Clouds – SR-IOV, IVSHMEM, Docker and Singularity, OpenStack • Appliances for MVAPICH2-Virt are available for building HPC Clouds • Demos on NSF Chameleon Cloud • Future releases for supporting running MPI jobs in VMs/Containers with SLURM, etc. • SR-IOV/container support and appliances for other MVAPICH2 libraries (MVAPICH2-X, MVAPICH2-GDR, ...)

Network Based Computing Laboratory MUG 2017 88 Thank You! [email protected] http://www.cse.ohio-state.edu/~luxi

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ MVAPICH: MPI over InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE http://mvapich.cse.ohio-state.edu/

Network Based Computing Laboratory MUG 2017 89