<<

Building HPC Cloud with InfiniBand: Efficient Support in MVAPICH2 for KVM, , Singularity, OpenStack, and SLURM

A Tutorial at MUG 2018 by

Xiaoyi Lu The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~luxi HPC Meets Cloud Computing

• Cloud Computing widely adopted in industry computing environment • Cloud Computing provides high resource utilization and flexibility • is the key technology to enable Cloud Computing • Intersect360 study shows cloud is the fastest growing class of HPC • HPC Meets Cloud: The convergence of Cloud Computing and HPC Network Based Computing Laboratory MUG 2018 2 HPC Cloud - Combining HPC with Cloud

• IDC expects that by 2019, HPC ecosystem revenue will jump to a record $30.2 billion. IDC foresees public clouds, and especially custom public clouds, supporting an increasing proportion of the aggregate HPC workload as these cloud facilities grow more capable and mature (Courtesy: http://www.idc.com/getdoc.jsp?containerId=247846) • Combining HPC with Cloud is still facing challenges because of the performance overhead associated virtualization support – Lower performance of virtualized I/O devices • HPC Cloud Examples – Amazon EC2 with Enhanced Networking • Using Single Root I/O Virtualization (SR-IOV) • Higher performance (packets per second), lower latency, and lower jitter

• 10 GigE – NSF Chameleon Cloud

Network Based Computing Laboratory MUG 2018 3 Outline • Overview of Cloud Computing System Software • Overview of Modern HPC Cloud Architecture • Challenges of Building HPC Clouds • High-Performance MPI Library on HPC Clouds • Integrated Designs with Cloud Resource Manager • Appliances and Demos on Chameleon Cloud • Conclusion and Q&A

Network Based Computing Laboratory MUG 2018 4 Virtualization Technology ( vs. Container)

• Provides abstractions of multiple virtual resources by utilizing an intermediate software layer on top of the underlying system

VM1 App1 App2 App3 Stack Stack Stack Container1 • Sharebins/ host kernelbins/ bins/ • HypervisorApp1 providesApp2 a full abstractionApp3 of libs libs libs • Allows execution of isolated user space VM Stack Stack Stack Guest OS Guest OS Guest OS instanceRedhat • Full virtualization,bins/ bins/ different guestbins/ OS, Window Ubuntu libs libs libs • Lightweight, portability better isolation Hypervisor Host Linux OS • Not strong isolationHost OS • Larger overhead due to heavy stack Hardware Hardware

Hypervisor-based Virtualization Container-based Virtualization

Network Based Computing Laboratory MUG 2018 5 Overview of Kernel-based Virtual Machine (KVM)

• A full virtualization solution for Linux on x86 hardware that contains virtualization extensions (Intel VT or AMD-V) • The KVM module creates a bare metal hypervisor on the Linux kernel • KVM hosts the virtual machine images as regular Linux processes • Each virtual machine image can use all of the features of the Linux kernel, including hardware, security, storage, etc.

https://www.ibm.com/support/knowledgecenter/en/linuxonibm/liaat/liaatkvmover.htm

Network Based Computing Laboratory MUG 2018 6 Container Technology - Docker

• Inherit advantages of container technique • Active community contribution • Root owned daemon process • Root escalation in Docker container • Non-negligible performance overhead

Network Based Computing Laboratory MUG 2018 7 Singularity Overview

• Reproducible software stacks – Easily verify via checksum or cryptographic signature • Mobility of compute – Able to transfer (and store) containers via standard data mobility tools • Compatibility with complicated architectures – Runtime immediately compatible with existing HPC architecture • Security model – Support untrusted users running untrusted containers http://singularity.lbl.gov/about

Network Based Computing Laboratory MUG 2018 8 Container Technology (Docker vs. Singularity)

• Singularity aims to provide reproducible and mobile environments across HPC centers • NO root owned daemon • NO root escalation • mpirun_rsh –np 2 –hostfile htfiles singualrity exec /tmp/Centos-7.img /usr/bin/osu_latency

Network Based Computing Laboratory MUG 2018 9 Outline • Overview of Cloud Computing System Software • Overview of Modern HPC Cloud Architecture • Challenges of Building HPC Clouds • High-Performance MPI Library on HPC Clouds • Integrated Designs with Cloud Resource Manager • Appliances and Demos on Chameleon Cloud • Conclusion and Q&A

Network Based Computing Laboratory MUG 2018 10 Drivers of Modern HPC Cluster and Cloud Architecture

High Performance Interconnects – SSDs, Object Storage Large memory nodes Multi-/Many-core InfiniBand (with SR-IOV) Clusters (Upto 2 TB) Processors <1usec latency, 200Gbps Bandwidth> • Multi-core/many-core technologies, Accelerators • Large memory nodes • Solid State Drives (SSDs), NVM, Parallel Filesystems, Object Storage Clusters • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Single Root I/O Virtualization (SR-IOV)

Cloud Cloud SDSC Comet TACC Stampede Network Based Computing Laboratory MUG 2018 11 Trends in High-Performance Networking Technologies • Advanced Interconnects and RDMA protocols – InfiniBand (up to 200 Gbps, HDR) – 10/40/100 Gigabit Ethernet/iWARP – RDMA over Converged Enhanced Ethernet (RoCE)

• Omni-Path • Delivering excellent performance (Latency, Bandwidth and CPU Utilization) • Has influenced re-designs of enhanced HPC middleware – Message Passing Interface (MPI) and PGAS

– Parallel File Systems (Lustre, GPFS, ..) • Paving the way to the wide utilization in HPC Cloud with virtualization support (SR-IOV)

Network Based Computing Laboratory MUG 2018 12 Available Interconnects and Protocols for Data Centers Application / Middleware Application / Middleware Interface

Sockets Verbs OFI

Protocol Kernel Space TCP/IP TCP/IP RSockets SDP TCP/IP RDMA RDMA RDMA

Ethernet Hardware IPoIB User Space RDMA User Space User Space User Space User Space Driver Offload

Adapter

Ethernet InfiniBand Ethernet InfiniBand InfiniBand iW ARP RoCE InfiniBand Omni-Path Adapter Adapter Adapter Adapter Adapter Adapter Adapter Adapter Adapter

Switch

Ethernet InfiniBand Ethernet InfiniBand InfiniBand Ethernet Ethernet InfiniBand Omni-Path Switch Switch Switch Switch Switch Switch Switch Switch Switch 1/10/25/40/ 1/10/25/40/ IPoIB RSockets SDP iW ARP RoCE IB Native 100 Gb/s 50/100 GigE- 50/100 GigE TOE

Network Based Computing Laboratory MUG 2018 13 Open Standard InfiniBand Networking Technology • Introduced in Oct 2000 • High Performance Data Transfer – Interprocessor communication and I/O – Low latency (<1.0 microsec), High bandwidth (up to 25 GigaBytes/sec -> 200Gbps), and low CPU utilization (5-10%) • Flexibility for LAN and WAN communication • Multiple Transport Services – Reliable Connection (RC), Unreliable Connection (UC), Reliable Datagram (RD), Unreliable Datagram (UD), and Raw Datagram – Provides flexibility to develop upper layers • Multiple Operations – Send/Recv – RDMA Read/Write – Atomic Operations (very unique) • high performance and scalable implementations of distributed locks, semaphores, collective communication operations • Leading to big changes in designing HPC clusters, file systems, cloud computing systems, grid computing systems, ….

Network Based Computing Laboratory MUG 2018 14 4. Performance comparisons between IVShmem backed and native mode MPI li- braries, using HPC applications

The evaluation results indicate that IVShmem can improve point to point and collective operations by up to 193% and 91%, respectively. The application execution time can be decreased by up to 96%, compared to SR-IOV. The results further show that IVShmem just brings small overheads, compared with native environment. The rest of the paper is organized as follows. Section 2 provides an overview of IVShmem, SR-IOV, and InfiniBand. Section 3 describes our prototype design and eval- uation methodology. Section 4 presents the performance analysis results using micro- benchmarks and applications, scalability results, and comparison with native mode. We discuss the related work in Section 5, and conclude in Section 6.

2 Background Inter-VM Shared Memory (IVShmem) (e.g. Nahanni) [15] provides zero-copy access to data on shared memory of co-resident VMs on KVM platform. IVShmem is designed and implemented mainly in system calls layer and its interfaces are visible to user space applications as well. As shown in Figure 2(a), IVShmem contains three components: the guest kernel driver, the modified QEMU supporting PCI device, and the POSIX shared memory region on the host OS. The shared memory region is allocated by host POSIX operations and mapped to QEMU process address space. The mapped memory in QEMUSingle Root can be I/O used Virtualization by guest applications (SR-IOV) by being remapped to user space in guest VMs. Evaluation results illustrate that both micro-benchmarks and HPC applications can• achieveSingle Root better I/O Virtualization performance (SR with-IOV) IVShmemis providing support.new opportunities to design HPC cloud with very little low overhead • Allows a single physical device, or a Guest 1 Guest 2 Guest 3 mmap Guest 1 mmap Guest 2 mmap Guest 3 Physicalregion Functionregion (PF), to presentregion itself as Guest OS Guest OS Guest OS Userspace Userspace Userspace multiple virtual devices, or Virtual VF Driver VF Driver VF Driver PCI kernel PCI kernel PCI kernel FunctionsDevice (VFs)Device Device

• VFs are designed based on the existing Hypervisor PF Driver

nonQemu -UserspacevirtualizedQemu PFs, Userspace no needQemu for Userspace driver I/O MMU

changemmap mmap mmap PCI Express eventfds Virtual Virtual Virtual Physical • Each VF can be dedicated to a single VM Function Function Function Function Host /dev/shm/ shared mem fd through PCI pass-through SR-IOV Hardware

• Work(a) Inter-VM with 10/40 Shmem GigE Mechanismand InfiniBand [15] (b) SR-IOV Mechanism [22]

Network BasedFig. Computing 2. Overview Laboratory of Inter-VM ShmemMUG and2018 SR-IOV Communication Mechanisms 15

Single Root I/O Virtualization (SR-IOV) is a PCI Express (PCIe) standard which specifies the native I/O virtualization capabilities in PCIe adapters. As shown in Fig- ure 2(b), SR-IOV allows a single physical device, or a Physical Function (PF), to present itself as multiple virtual devices, or Virtual Functions (VFs). Each virtual device can be dedicated to a single VM through the PCI pass-through, which allows each VM to di- rectly access the corresponding VF. Hence, SR-IOV is a hardware-based approach to Overview of MVAPICH2 Project

• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 2,925 organizations in 86 countries – More than 484,000 (> 0.4 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Jul ‘18 ranking) • 2nd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China • 12th, 556,104 cores (Oakforest-PACS) in Japan • 15th, 367,024 cores (Stampede2) at TACC • 24th, 241,108-core (Pleiades) at NASA and many others – Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu

Network Based Computing Laboratory MUG 2018 16 Architecture of MVAPICH2 Software Family

High Performance Parallel Programming Models

Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms

Point-to- Energy- Remote point Collectives Memory I/O and Fault Active Introspection Algorithms Job Startup Virtualization Messages & Analysis Primitives Awareness Access File Systems Tolerance

Support for Modern Networking Technology Support for Modern Multi-/Many-core Architectures (InfiniBand, iWARP, RoCE, Omni-Path) (Intel-Xeon, OpenPower, Xeon-Phi (MIC, KNL), NVIDIA GPGPU) Transport Protocols Modern Features Transport Mechanisms Modern Features XR SR- Multi Shared CM IVSHME RC UMR ODP * NVLink* CAPI* UD DC IOV Rail Memory A M MCDRAM

* Upcoming Network Based Computing Laboratory MUG 2018 17 Outline • Overview of Cloud Computing System Software • Overview of Modern HPC Cloud Architecture • Challenges of Building HPC Clouds • High-Performance MPI Library on HPC Clouds • Integrated Designs with Cloud Resource Manager • Appliances and Demos on Chameleon Cloud • Conclusion and Q&A

Network Based Computing Laboratory MUG 2018 18 Building HPC Cloud with SR-IOV and InfiniBand

• High-Performance Computing (HPC) has adopted advanced interconnects and protocols – InfiniBand – 10/40/100 Gigabit Ethernet/iWARP – RDMA over Converged Enhanced Ethernet (RoCE)

• Very Good Performance – Low latency (few micro seconds) – High Bandwidth (200 Gb/s with HDR InfiniBand) – Low CPU overhead (5-10%)

• OpenFabrics software stack with IB, iWARP and RoCE interfaces are driving HPC systems • How to Build HPC Clouds with SR-IOV and InfiniBand for delivering optimal performance?

Network Based Computing Laboratory MUG 2018 19 HPC, Big Data, and on Cloud Computing Systems: Challenges Applications

HPC, Big Data, Deep Learning Middleware HPC (MPI, PGAS, MPI+PGAS, Big Data (HDFS, MapReduce, Spark, Deep Learning (Caffe, TensorFlow, MPI+OpenMP, etc.) HBase, Memcached, etc.) CNTK, BigDL, etc.)

Resource Management and Scheduling Systems for Cloud Computing (OpenStack Nova, Swift, Heat; Slurm, etc.)

Communication and I/O Library Communication Channels Locality- and NUMA-aware Virtualization (SR-IOV, IVShmem, IPC-Shm, CMA) Communication (Hypervisor and Container)

Data Placement & Fault-Tolerance Task Scheduling (Migration, Replication, etc.) QoS-aware, etc.

Commodity Computing System Networking Technologies Architectures Storage Technologies (InfiniBand, Omni-Path, 1/10/40/100 (Multi- and Many-core architectures (HDD, SSD, NVRAM, and NVMe-SSD) GigE and Intelligent NICs) and accelerators)

Network Based Computing Laboratory MUG 2018 20 Broad Challenges in Designing Communication and I/O Middleware for HPC on Clouds • Virtualization Support with Virtual Machines and Containers – KVM, Docker, Singularity, etc. • Communication coordination among optimized communication channels on Clouds – SR-IOV, IVShmem, IPC-Shm, CMA, etc. • Locality-aware communication • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) • Scalable Collective communication – Offload; Non-blocking; Topology-aware • Balancing intra-node and inter-node communication for next generation nodes (128-1024 cores) – Multiple end-points per node • NUMA-aware communication for nested virtualization • Integrated Support for GPGPUs and Accelerators • Fault-tolerance/resiliency – Migration support with virtual machines • QoS support for communication and I/O • Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, MPI+UPC++, CAF, …) • Energy-Awareness • Co-design with resource management and scheduling systems on Clouds – OpenStack, Slurm, etc.

Network Based Computing Laboratory MUG 2018 21 Outline • Overview of Cloud Computing System Software • Overview of Modern HPC Cloud Architecture • Challenges of Building HPC Clouds • High-Performance MPI Library on HPC Clouds • Integrated Designs with Cloud Resource Manager • Appliances and Demos on Chameleon Cloud • Conclusion and Q&A

Network Based Computing Laboratory MUG 2018 22 High-Performance MPI Library on HPC Clouds • MVAPICH2-Virt with SR-IOV and IVSHMEM • SR-IOV-enabled VM Migration Support in MVAPICH2 • MVAPICH2 with Containers (Docker and Singularity) • MVAPICH2 with Nested Virtualization (Container over VM)

Network Based Computing Laboratory MUG 2018 23 HPC on Cloud Computing Systems: Challenges Addressed by OSU So Far

Applications

HPC and Big Data Middleware

HPC (MPI, PGAS, MPI+PGAS, MPI+OpenMP, etc.)

Resource Management and Scheduling Systems for Cloud Computing (OpenStack Nova, Heat; Slurm)

Communication and I/O Library Communication Channels Locality- and NUMA-aware Virtualization (SR-IOV, IVShmem, IPC-Shm, CMA) Communication (Hypervisor and Container)

Fault-Tolerance & Consolidation (Migration) QoS-aware Future Studies

Commodity Computing System Networking Technologies Architectures Storage Technologies (InfiniBand, Omni-Path, 1/10/40/100 (Multi- and Many-core architectures (HDD, SSD, NVRAM, and NVMe-SSD) GigE and Intelligent NICs) and accelerators)

Network Based Computing Laboratory MUG 2018 24 MVAPICH2-Virt with SR-IOV and IVSHMEM

Network Based Computing Laboratory MUG 2018 25 Intra-Node Inter-VM Performance

3. 5 200 IB_Send 180 IB_Send 3 IB_Read 160 IB_Read 2. 5 IB_Wirte 140 IB_Wirte IVShmem 120 (us) 2 100 IVShmem 1. 5 80 Latency 1 Latency(us) 60 40 0. 5 20 0 0 2 4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k 128 k 256 k 512 k 1M Message Size (bytes) Message Size (bytes)

Latency at 64 bytes message size: SR-IOV(IB_Send) - 0.96μs, IVShmem - 0.2μs

Can IVShmem scheme benefit MPI communication within a node on SR-IOV enabled InfiniBand clusters?

Network Based Computing Laboratory MUG 2018 26 Overview of MVAPICH2-Virt with SR-IOV and IVSHMEM

Guest 1 Guest 2 • Redesign MVAPICH2 to make it user space user space MPI MPI proc proc kernel space kernel space virtual machine aware PCI VF PCI VF Device Driver Device Driver

– SR-IOV shows near to native Hypervisor PF Driver performance for inter-node point to Virtual Virtual IV-SHM Physical Function Function Function point communication /dev/shm/ Infiniband Adapter – IVSHMEM offers shared memory based IV-Shmem Channel Host Environment SR-IOV Channel data access across co-resident VMs

– Locality Detector: maintains the locality J. Zhang, X. Lu, J. Jose, R. Shi, D. K. Panda. Can Inter-VM information of co-resident virtual machines Shmem Benefit MPI Applications on SR-IOV based – Communication Coordinator: selects the Virtualized InfiniBand Clusters? Euro-Par, 2014 communication channel (SR-IOV, IVSHMEM) J. Zhang, X. Lu, J. Jose, R. Shi, M. Li, D. K. Panda. High adaptively Performance MPI Library over SR-IOV Enabled InfiniBand Clusters. HiPC, 2014

Network Based Computing Laboratory MUG 2018 27 Intra-node Inter-VM Point-to-Point Performance Intra-node Inter-VM Pt2Pt Latency Intra-node Inter-VM Pt2Pt BW 100 0 120 00

SR- I OV SR- I OV 8% 100 00 Pr op osed 100 Pr op osed

Native 800 0 Native (us) (MB/s) 10 8% 600 0 Latency

84% 400 0

1 Bandwidth 3% 158% 200 0 3% 0. 1 0 1 4 16 64 256 1K 4K 16K 64K 256 K 1M 1 4 16 64 256 1K 4K 16K 64K 256 K 1M Message Size (Byte) Message Size (Byte)

• Compared to SR-IOV, up to 84% and 158% improvement on Latency & Bandwidth • Compared to Native, 3%-8% overhead on both Latency & Bandwidth

Network Based Computing Laboratory MUG 2018 28

HiPC 2014 20 Application Performance (NAS & P3DFFT) NAS-32 VMs (8 VMs per node) P3DFFT-32 VMs (8 VMs per node) 18 5

16 4. 5 33% SR- I OV SR- I OV 29% 14 4 Pr op osed Pr op osed (s) (s) 3. 5 12 3

Time 29%

10 Times 2. 5 20% 8 2

Execution 6 Execution 1. 5 4 43% 1 2 0. 5

0 0 FT LU CG MG IS SPE C INVERSE RA ND SI NE NAS B Class P3DFFT 512*512*512

• Proposed design delivers up to 43% (IS) improvement for NAS • Proposed design brings 29%, 33%, 29% and 20% improvement for INVERSE, RAND, SINE and SPEC

Network Based Computing Laboratory MUG 2018 29 Application-Level Performance on Chameleon

600 0 400 MV2 -SR -IOV-Def MV2 -SR -IOV-Def 350 500 0 MV2 -SR -IOV-Opt MV2 -SR -IOV-Opt 300 400 0 MV2 -Native MV2 -Native 2% 250

300 0 200

150 9.5% 200 0

5% (s) Time Execution 1%

Execution Time (ms) Time Execution 100 100 0 50

0 0 22, 20 24, 10 24, 16 24, 20 26, 10 26, 16 milc leslie3d po p2 GAPgeofem zeusmp 2 lu Problem Size (Scale, Edgefactor) Graph500 SPEC MPI2007 • 32 VMs, 6 Core/VM • Compared to Native, 2-5% overhead for Graph500 with 128 Procs • Compared to Native, 1-9.5% overhead for SPEC MPI2007 with 128 Procs

Network Based Computing Laboratory MUG 2018 30 Application-Level Performance on Azure

NPB Class B NPB Class C 30 50 MV AP IC H2 IntelMPI MV AP IC H2 IntelMPI 25 40 20 30 15 20 10 Execution Time (s) Execution Time (s) 5 10

0 0 bt . B cg. B ep .B ft.B is.B lu.B mg. B sp.B cg. C ep .C ft.C is.C lu.C mg. C

• NPB Class B with 16 Processes on 2 Azure A8 instances • NPB Class C with 32 Processes on 4 Azure A8 instances • Comparable performance between MVAPICH2 and IntelMPI

Network Based Computing Laboratory MUG 2018 31 High-Performance MPI Library on HPC Clouds • MVAPICH2-Virt with SR-IOV and IVSHMEM • SR-IOV-enabled VM Migration Support in MVAPICH2 • MVAPICH2 with Containers (Docker and Singularity) • MVAPICH2 with Nested Virtualization (Container over VM)

Network Based Computing Laboratory MUG 2018 32 SR-IOV-enabled VM Migration Support in MVAPICH2

Network Based Computing Laboratory MUG 2018 33 Execute Live Migration with SR-IOV Device

Network Based Computing Laboratory MUG 2018 34 Overview of Existing Migration Solutions for SR-IOV

Platform NIC No Guest OS Device Driver Hypervisor Modification Independent Independent

Zhai, etc (Linux Ethernet N/A Yes No Maybe () bonding driver) Kadav, etc (shadow Ethernet Intel Pro/1000 gigabit NIC No Yes No (Xen) driver) Pan, etc (CompSC) Ethernet Intel 82576, Intel 82599 Yes No No (Xen)

Guay, etc InfiniBand Mellanox ConnectX2 QDR HCA Yes No Yes (Oracle VM Server (OVS) 3.0.)

Han Ethernet Huawei smart NIC Yes No No (QEMU+KVM)

Xu, etc (SRVM) Ethernet Intel 82599 Yes Yes No (VMware EXSi)

Can we have a hypervisor-independent and device driver-independent solution for InfiniBand based HPC Clouds with SR-IOV? Network Based Computing Laboratory MUG 2018 35 High Performance SR-IOV enabled VM Migration Framework

Host Host • Consist of SR-IOV enabled IB Cluster and External Guest VM1 Guest VM2 Guest VM1 Guest VM2 MPI MPI MPI MPI Migration Controller VF / VF / VF / VF / • Detachment/Re-attachment of virtualized SR-IOV SR-IOV SR-IOV SR-IOV devices: Multiple parallel libraries to coordinate Hypervisor Hypervisor IB Ethernet Ethernet IB VM during migration (detach/reattach SR- IVShmem IVShmem Adapter Adapter Adapter Adapter IOV/IVShmem, migrate VMs, migration status) • IB Connection: MPI runtime handles IB connection suspending and reactivating

Network Read-to- Network Migration • Propose Progress Engine (PE) and Migration Migration Suspend Migrate Reactive Done Trigger Trigger Detector Notifier Detector Thread based (MT) design to optimize VM

Controller migration and MPI application performance

J. Zhang, X. Lu, D. K. Panda. High-Performance Virtual Machine Migration Framework for MPI Applications on SR-IOV enabled InfiniBand Clusters. IPDPS, 2017 Network Based Computing Laboratory MUG 2018 36 Performance Evaluation of VM Migration Framework

Breakdown of VM migration Multiple VM Migration Time Set PO ST_M IGRA TI ON Ad d IV SHM EM At t ac h V F 35 Mi gr at io n Re move I VSH MEM Detach VF 3 30 Se qu en tia l M ig rati on F rame work 51% Set PRE _MI GR ATI O N 2. 5 25 Propose d Migra tion Fra mework 2 20 1. 5 15 Time Time (s)

Times (s) 1 10 0. 5 5 0 0 TC P IPoIB RDMA 2 V M 4 V M 8 V M 16 VM

• Compared with the TCP, the RDMA scheme reduces the total migration time by 20% • Total time is dominated by `Migration’ time; Times on other steps are similar across different schemes • Proposed migration framework could reduce up to 51% migration time

Network Based Computing Laboratory MUG 2018 37 Performance Evaluation of VM Migration Framework

Pt2Pt Latency Bcast (4VMs * 2Procs/VM) 20 0 70 0 PE-IPoIB PE - I Po IB 18 0 PE-RDMA 60 0 PE - RD M A 16 0 MT-IPoIB 14 0 50 0 MT - IP oI B MT-RDMA 12 0 MT - RD M A

(us) 40 0 10 0 80 30 0 Latency (us) Latency 60 2Latency 20 0 40 10 0 20 0 0 1 4 16 64 25 6 1K 4K 16 K 64 K 25 6K 1M 1 4 16 64 25 6 1K 4K 16 K 64 K 25 6K 1M Message Size ( bytes) Message Size ( bytes) • Migrate a VM from one machine to another while benchmark is running inside • Proposed MT-based designs perform slightly worse than PE-based designs because of lock/unlock • No benefit from MT because of NO computation involved Network Based Computing Laboratory MUG 2018 38 Performance Evaluation with Applications

NAS Graph500 30 0.4 PE PE 25 MT-wor st 0.3 20 MT-typi cal MT-wor st 15 NM 0.2 10 0.1 5 Execution Time (s) Time Execution Execution Time (s) Time Execution 0 0.0 LU.B EP.B MG.B CG .B FT.B 20,1 0 20,1 6 20,2 0 22,1 0 • 8 VMs in total and 1 VM carries out migration during application running • Compared with NM, MT- worst and PE incur some overhead • MT-typical allows migration to be completely overlapped with computation

Network Based Computing Laboratory MUG 2018 39 High-Performance MPI Library on HPC Clouds • MVAPICH2-Virt with SR-IOV and IVSHMEM • SR-IOV-enabled VM Migration Support in MVAPICH2 • MVAPICH2 with Containers (Docker and Singularity) • MVAPICH2 with Nested Virtualization (Container over VM)

Network Based Computing Laboratory MUG 2018 40 MVAPICH2 with Containers (Docker and Singularity)

Network Based Computing Laboratory MUG 2018 41 Benefits of Containers-based Virtualization for HPC on Cloud

ib_send_lat Graph500 9 300 300 8 VM - PT 252.7 250 Co mmu nica tion VM - SR -I OV 7 250 Co nt ain er -P T 6 200 Co mp ut at io n 178.7 Native 200 5 150 4 150

Latency (us) Latency 234 3 100 100 65 65.5 2 162.7 50 1 BFS Execution Time (ms) 50 50 50.5 0 BFS Execution Time (ms) 0 18 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 0 Native-16P15 1Co15 nt s*16P 2Co16 nt s*8P 4Co nt s*4P Message Size (bytes) Native-16P 1Co nt *16P 2Co nt s*8P 4Co nt s*4P Scale, Edgefactor (20,16) • Experiment on NFS Chameleon Cloud Scale, Edgefactor (20,16) • Container has less overhead than VM

• BFS time in Graph 500 significantly increases as the number of container increases on one host. Why? J. Zhang, X. Lu, D. K. Panda. Performance Characterization of Hypervisor- and Container-Based Virtualization for HPC on SR-IOV Enabled InfiniBand Clusters. IPDRM, IPDPS Workshop, 2016 Network Based Computing Laboratory MUG 2018 42 Containers-based Design: Issues, Challenges, and Approaches

• What are the performance bottlenecks when running MPI applications on multiple containers per host in HPC cloud? • Can we propose a new design to overcome the bottleneck on such container-based HPC cloud? • Can optimized design deliver near-native performance for different container deployment scenarios? • Locality-aware based design to enable CMA and Shared memory channels for MPI communication across co-resident containers

J. Zhang, X. Lu, D. K. Panda. High Performance MPI Library for Container-based HPC Cloud on InfiniBand Clusters. ICPP, 2016 Network Based Computing Laboratory MUG 2018 43 MPI Point-to-Point and Collective Performance

Latency Allgather 9 Co nt-intra-socket-Def 400 0 8 Co nt-intra-socket-Op t 350 0 86% 7 Co n- Def Native-intra-socket 300 0 6 Co n- Opt Co nt-inter-socket-Def 250 0 5 Co nt-inter-socket-Op t Native 200 0 4 Native-inter-socket 3 79% 150 0 Latency (us) Latency Latency (us) Latency 2 100 0 1 500

0 0 1 4 16 64 256 1K 4K 16K 4 8 16 32 64 128 256 512 1K 2K 4K Message Size (Bytes) Message Size (Bytes)

• Two containers are deployed on the same socket and different socket • 256 procs totally (4 pros/container, 64 containers across 16 nodes evenly)

• Up to 79% and 86% improvement for Point-to-Point and MPI_Allgather, respectively (Cont-Opt vs. Cont- Def)

• Minor overhead, compared with Native performance (Cont-*-Opt vs. Native-*) Network Based Computing Laboratory MUG 2018 44 Application-Level Performance on Docker with MVAPICH2

100 300 Co nt ain er -D ef 90 Co nt ain er -D ef Co nt ain er -O pt 73% 80 250 11% Co nt ain er -O pt 70 Native 200 Native 60

50 150 40 100 30 Execution Time (s) 20 50

10 BFS Execution Time (ms)

0 0 MG .D FT. D EP. D LU .D CG .D 1Co nt *16P 2Co nt s*8P 4Co nt s*4P

NAS Scale, Edgefactor (20,16) Graph 500 • 64 Containers across 16 nodes, pining 4 Cores per Container • Compared to Container-Def, up to 11% and 73% of execution time reduction for NAS and Graph 500 • Compared to Native, less than 9 % and 5% overhead for NAS and Graph 500

Network Based Computing Laboratory MUG 2018 45 Singularity Performance on Different Processor Architectures BW on Haswell BW on KNL 16000 Singularity-Intra-Node 9000 Singularity-Intra-Node 14000 Native-Intra-Node 8000 7000 Native-Intra-Node 12000 Singularity-Inter-Node Native-Inter-Node 6000 Singularity-Inter-Node 10000 5000 Native-Inter-Node 7% 8000 4000 6000 3000 4000 2000 Bandwidth (MB/s)

2000 Bandwidth(MB/s) 1000 0 0 1K 4K 16K 64K 256K 1M 1K 4K 16K 64K 256K 1M Mesage Size ( bytes) Mesage Size ( bytes)

• MPI point-to-point Bandwidth • On both Haswell and KNL, less than 7% overhead for Singularity solution

• Worse intra-node performance than Haswell because low CPU frequency, complex cluster mode, and cost maintaining cache coherence

• KNL - Inter-node performs better than intra-node case after around 256 Kbytes, as Omni-Path interconnect outperforms shared memory-based transfer for large message size Network Based Computing Laboratory MUG 2018 46 Singularity Performance on Different Memory Access Mode (NUMA, Cache) 2 NUMA nodes w Cache mode 12 45 Singularity-Intra-Socket Singularity-Intra-Node 10 40 Native-Intra-Socket 35 Native-Intra-Node 8 30 Singularity-Inter-Node Singularity-Inter-Socket 6 25 Native-Inter-Node Native-Inter-Socket 20

Latency (us)Latency 4 15 Latency (us)Latency 2 8% 10 5 0 0 1 4 16 64 256 1K 4K 16K 64K 1 4 16 64 256 1K 4K 16K 64K Mesage Size ( bytes) Mesage Size ( bytes) • MPI point-to-point Latency • NUMA – Intra-socket performs better than inter-socket case, as the QPI bottleneck between NUMA nodes

– Performance difference is gradually decreased, as the message size increases • Overall, less than 8% overhead for Singularity solution in both cases, compared with Native

Network Based Computing Laboratory MUG 2018 47 Singularity Performance on Different Memory Access Mode (Flat) BW w Flat mode MPI_Allreduce w Flat mode 8000 9000 Singularity(DDR) Singularity(DDR) 7000 8000 67% Native(DDR) Native(DDR) 6000 7000 Singularity(MCDRAM) 5000 Singularity(MCDRAM) 6000 5000 Native(MCDRAM) 4000 Native(MCDRAM) 8% 4000 3000 3000 Latency (us)Latency Bandwidth (MB/s) 2000 2000 1000 1000 0 0

Mesage Size ( bytes) Mesage Size ( bytes) • Explicitly specify DDR or MCDRAM for memory allocation • MPI point-to-point BW: No significant performance difference • MPI collective Allreduce: Clear benefits (up to 67%) with MCDRAM after around 256 KB message, compared with DDR • More parallel processes increase data access, which can NOT fit in L2 cache, higher BW in MCDRAM • Near-native performance for Singularity (less than 8% overhead) Network Based Computing Laboratory MUG 2018 48 Singularity Performance on Haswell with InfiniBand

Class D NAS Graph500 300 3000 Singularity Singularity 250 7% 2500 Native Native 200 2000 150 1500 100 1000 Execution Time (s) Time Execution

50 (ms) Time Execution 500 0 0 CG EP FT IS LU MG 22,16 22,20 24,16 24,20 26,16 26,20

• 512 processors across 32 Haswell nodes • Singularity delivers near-native performance, less than 7% overhead on Haswell with InfiniBand

J. Zhang, X. Lu, D. K. Panda. Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds? UCC 2017. (Best Student Paper Award) Network Based Computing Laboratory MUG 2018 49 High-Performance MPI Library on HPC Clouds • MVAPICH2-Virt with SR-IOV and IVSHMEM • SR-IOV-enabled VM Migration Support in MVAPICH2 • MVAPICH2 with Containers (Docker and Singularity) • MVAPICH2 with Nested Virtualization (Container over VM)

Network Based Computing Laboratory MUG 2018 50 MVAPICH2 with Nested Virtualization

Network Based Computing Laboratory MUG 2018 51 Nested Virtualization: Containers over Virtual Machines

VM1 VM2

Container1 Container2 Container3 Container4 App App App App Stack Stack Stack Stack

bins/ bins/ bins/ bins/ libs libs libs libs

Docker Engine Docker Engine Redhat Ubuntu Hypervisor Host OS Hardware

• Useful for live migration, sandbox application, legacy system integration, software deployment, etc. • Performance issues because of the redundant call stacks (two-layer virtualization) and isolated physical resources

Network Based Computing Laboratory MUG 2018 52 Usage Scenario of Nested Virtualization

User A User B Host

VLAN 1 VLAN 2

VM 0: Ubuntu VM 1: Ubuntu VM 2: MacOS VM 3: MacOS

dockerA1 dockerA3 dockerB1 dockerB3

dockerA2 dockerA4 dockerB2 dockerB4

Pull Pull Pull

Push img img img A B C

Developer Docker Registry • VM provides good isolation and security so that the applications and workloads of users A and B will not interfere with each other • Root permission of VM can be given to do special configuration • Docker brings an effective, standardized and repeatable way to port and distribute the applications and workloads

Network Based Computing Laboratory MUG 2018 53 Multiple Communication Paths in Nested Virtualization

VM 0 VM 1 Container 0 Container 1 Container 2 Container 3

core 0 core 1 core 2 core 3 core 8 core 9 core 10 core 11

QPI Memory Controller Memory Controller core 4 core 5 core 6 core 7 core 12 core 13 core 14 core 15

1 2 4 3 ... NUMA 0 NUMA 1 • Different VM placements introduce multiple communication paths on container level 1. Intra-VM Intra-Container (across core 4 and core 5) 2. Intra-VM Inter-Container (across core 13 and core 14) 3. Inter-VM Inter-Container (across core 6 and core 12) 4. Inter-Node Inter-Container (across core 15 and the core on remote node)

Network Based Computing Laboratory MUG 2018 54 Performance Characteristics on Communication Paths

Intra-Socket Inter-Socket 40 20

Intra-VM Inter-Container-Def Intra-VM Inter-Container-Def 30 Inter-VM Inter-Container-Def 15 Inter-VM Inter-Container-Def Intra-VM Inter-Container-1Layer* Intra-VM Inter-Container-1Layer* 2X 20 10 Inter-VM Inter-Container-1Layer* Inter-VM Inter-Container-1Layer*

Latency (us) Latency Native 3X (us) Latency Native 10 5

0 0 1 4 16 64 256 1024 4096 16384 65536 1 4 16 64 256 1024 4096 16384 65536 Message Size (Byte) Message Size (Byte)

• Two VMs are deployed on the same and different socket, respectively • *-Def and Inter-VM Inter-Container-1Layer have similar performance • Still large gap compared to native performance with just 1layer design 1Layer* - J. Zhang, X. Lu, D. K. Panda. High Performance MPI Library for Container-based HPC Cloud on InfiniBand, ICPP, 2016 Network Based Computing Laboratory MUG 2018 55 Challenges of Nested Virtualization

• How to further reduce the performance overhead of running applications on the nested virtualization environment?

• What are the impacts of the different VM/container placement schemes for the communication on the container level?

• Can we propose a design which can adapt these different VM/container placement schemes and deliver near-native performance for nested virtualization environments?

Network Based Computing Laboratory MUG 2018 56 Overview of Proposed Design in MVAPICH2

VM 0 VM 1 Container 0 Container 1 Container 2 Container 3

core 0 core 1 core 2 core 3 core 8 core 9 core 10 core 11

QPI Two-Layer Locality Detector: Dynamically Memory Controller Memory Controller core 4 core 5 core 6 core 7 core 12 core 13 core 14 core 15 detecting MPI processes in the co- resident containers inside one VM as well NUMA 0 1 3 2 4 NUMA 1 Two-Layer Locality Detector as the ones in the co-resident VMs

Container Locality Detector VM Locality Detector

Nested Locality Combiner Two-Layer NUMA Aware Communication Coordinator: Two-Layer NUMA Aware Communication Coordinator Leverage nested locality info, NUMA architecture info and message to

CMA SHared Memory Network (HCA) select appropriate communication Channel (SHM) Channel Channel channel

J. Zhang, X. Lu, D. K. Panda. Designing Locality and NUMA Aware MPI Runtime for Nested Virtualization based HPC Cloud with SR-IOV Enabled InfiniBand, VEE, 2017

Network Based Computing Laboratory MUG 2018 57 Inter-VM Inter-Container Pt2Pt (Intra-Socket)

250 16000 Default Default 1Layer 1Layer 14000 2Layer 2Layer 200 Native(w/o CMA) Native(w/o CMA) Native 12000 Native

150 10000 4 3.5 8000 3 100 2.5 2 6000

Latency (us) 1.5 1 0.5 4000 50 0 Bandwidth (MB/s) 1 4 16 64 256 1K 4K 2000

0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size (bytes) Message Size (bytes) Latency BW • 1Layer has similar performance to the Default • Compared with 1Layer, 2Layer delivers up to 84% and 184% improvement for latency and BW

Network Based Computing Laboratory MUG 2018 58 Inter-VM Inter-Container Pt2Pt (Inter-Socket)

300 16000 Default Default 1Layer 1Layer 14000 250 2Layer 2Layer Basic-Hybrid Basic-Hybrid Enhanced-Hybrid 12000 Enhanced-Hybrid 200 Native(w/o CMA) Native(w/o CMA) Native 10000 Native 4 150 3.5 8000 3 2.5 2 6000 100

Latency (us) 1.5 1 0.5 4000 0 Bandwidth (MB/s) 50 1 4 16 64 256 1K 4K 2000

0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size (bytes) Message Size (bytes) Latency BW • 1-Layer has similar performance to the Default • 2-Layer has near-native performance for small msg, but clear overhead on large msg • Compared to 2-Layer, Hybrid design brings up to 42% and 25% improvement for latency and BW, respectively

Network Based Computing Laboratory MUG 2018 59 Applications Performance

Graph500 Class D NAS 10 20 0 Default Default 8 15 0 1La yer 1La yer 6 2La yer-En ha nced -Hy brid 2La yer-En ha nced -Hy brid 10 0 4 50

2 Execution Time (s) BFS Execution Time (s)

0 0 22 ,2 0 24 ,1 6 24 ,2 0 24 ,2 4 26 ,1 6 26 ,2 0 26 ,2 4 28 ,1 6 IS MG EP FT CG LU

• 256 processes across 64 containers on 16 nodes • Compared with Default, enhanced-hybrid design reduces up to 16% (28,16) and 10% (LU) of execution time for Graph 500 and NAS, respectively • Compared with the 1Layer case, enhanced-hybrid design also brings up to 12% (28,16) and 6% (LU) performance benefit. Network Based Computing Laboratory MUG 2018 60 Outline • Overview of Cloud Computing System Software • Overview of Modern HPC Cloud Architecture • Challenges of Building HPC Clouds • High-Performance MPI Library on HPC Clouds • Integrated Designs with Cloud Resource Manager • Appliances and Demos on Chameleon Cloud • Conclusion and Q&A

Network Based Computing Laboratory MUG 2018 61 Integrated Design with SLURM and OpenStack

Network Based Computing Laboratory MUG 2018 62 MVAPICH2-Virt with SR-IOV and IVSHMEM over OpenStack

• OpenStack is one of the most popular Heat Orchestrates cloud open-source solutions to build clouds and Provides UI Horizon manage virtual machines Provisions Nova • Deployment with OpenStack Neutron Provides Provides – Supporting SR-IOV configuration images VM Network

Glance Stores Swift – Supporting IVSHMEM configuration images in Provides Volumes – Virtual Machine aware design of MVAPICH2 Backup Cinder volumes in with SR-IOV Ceilometer Monitors • An efficient approach to build HPC Clouds Provides with MVAPICH2-Virt and OpenStack Auth for Keystone J. Zhang, X. Lu, M. Arnold, D. K. Panda. MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds. CCGrid, 2015

Network Based Computing Laboratory MUG 2018 63 Typical Usage Scenarios

Exclusive Allocations VM VM Concurrent Jobs (EACJ) MPI MPI

MPI MPI Compute Nodes Compute Shared-host Allocations Concurrent Jobs (SACJ)

Exclusive Allocations Sequential Jobs (EASJ) MPI

MPI VM MPI MPI VM

VM VM VM

VM MPI

MPI

Network Based Computing Laboratory MUG 2018 64 Need for Supporting SR-IOV and IVSHMEM in SLURM

• Requirement of managing and isolating virtualized resources of SR-IOV and IVSHMEM • Such kind of management and isolation is hard to be achieved by MPI library alone, but much easier with SLURM • Efficient running MPI applications on HPC Clouds needs SLURM to support managing SR-IOV and IVSHMEM – Can critical HPC resources be efficiently shared among users by extending SLURM with support for SR-IOV and IVSHMEM based virtualization? – Can SR-IOV and IVSHMEM enabled SLURM and MPI library provide bare-metal performance for end applications on HPC Clouds?

Network Based Computing Laboratory MUG 2018 65 Architecture Overview

sbatch File VM Configuration File

physical resource request Submit Job SLURMctld physical node list

execute MPI job, return results, SLURMd SLURMd SLURMd cleanup physical physical …. physical node node node launch VMs

MPI MPI

VM1 VM2 1. SR-IOV virtual function 2. IVSHMEM device image load 3. Network setting IVSHME IVSHME VF VF image snapshot M M 4. Image management Im age Pool 5. Launching VMs and check availability Lustre VM Launching/Reclaiming 6. Mount global storage, etc.

libvirtd SLURMd

Network Based Computing Laboratory MUG 2018 66 SLURM SPANK Plugin based Design

mpirun_vm Slurmctld Slurmd Slurmd • VM Configuration Reader – register job load SPANK Register all VM configuration VM Config step req Reader register job options, set in the job control step reply environment so that they are visible to all allocated nodes. run job step req run job step reply • VM Launcher – Setup VMs on load SPANK vm hostfile VM Launcher each allocated nodes. MPI MPI - File based lock to detect occupied VF and exclusively allocate free VF MPI Job - Assign a unique ID to each IVSHMEM across VMs and dynamically attach to each VM reclaim VMs • VM Reclaimer – Tear down load SPANK VM Reclaimer VMs and reclaim resources release hosts

Network Based Computing Laboratory MUG 2018 67 SLURM SPANK Plugin with OpenStack based Design mpirun_vm OpenStack Slurmctld Slurmd ...... daemon • VM Configuration Reader – VM load SPANK register job VM Config step req options register Reader register job • VM Launcher, VM Reclaimer – request launch VM step reply VM Launcher Offload to underlying OpenStack launch VM ...... infrastructure return - PCI Whitelist to passthrough free VF to VM VM hostfile - Extend Nova to enable IVSHMEM when MPI launching VM

...... request reclaim VM VM Reclaimer J. Zhang, X. Lu, S. Chakraborty, D. K. Panda. reclaim VMs return ...... SLURM-V: Extending SLURM for Building Efficient HPC Cloud with SR-IOV and IVShmem. Euro-Par, release 2016 hosts

Network Based Computing Laboratory MUG 2018 68 Application-Level Performance on Chameleon (Graph500)

300 0 250 250

VM VM VM 250 0 200 200 Native Native Native 200 0 150 150 150 0 4%

100 100 100 0

500 50 50 BFS Execution Time (ms) BFS Execution Time (ms) BFS Execution Time (ms)

0 0 0 24, 16 24, 20 26, 10 22, 10 22, 16 22, 20 22 10 22 16 22 20 Problem Size (Scale, Edgefactor) Problem Size (Scale, Edgefactor) Problem Size (Scale, Edgefactor) Exclusive Allocations Shared-host Allocations Exclusive Allocations Sequential Jobs Concurrent Jobs Concurrent Jobs • 32 VMs across 8 nodes, 6 Core/VM • EASJ - Compared to Native, less than 4% overhead with 128 Procs • SACJ, EACJ – Also minor overhead, when running NAS as concurrent job with 64 Procs

Network Based Computing Laboratory MUG 2018 69 Outline • Overview of Cloud Computing System Software • Overview of Modern HPC Cloud Architecture • Challenges of Building HPC Clouds • High-Performance MPI Library on HPC Clouds • Integrated Designs with Cloud Resource Manager • Appliances and Demos on Chameleon Cloud • Conclusion and Q&A

Network Based Computing Laboratory MUG 2018 70 Available Appliances on Chameleon Cloud*

Appliance Description

Chameleon bare-metal image customized with the KVM hypervisor and a CentOS 7 KVM SR- recompiled kernel to enable SR-IOV over InfiniBand. IOV https://www.chameleoncloud.org/appliances/3/

MPI bare-metal This appliance deploys an MPI cluster composed of bare metal instances using the cluster complex MVAPICH2 library over InfiniBand. appliance (Based on https://www.chameleoncloud.org/appliances/29/ Heat)

MPI + SR-IOV KVM This appliance deploys an MPI cluster of KVM virtual machines using the cluster (Based on MVAPICH2-Virt implementation and configured with SR-IOV for high-performance Heat) communication over InfiniBand. https://www.chameleoncloud.org/appliances/28/

The CentOS 7 SR-IOV RDMA-Hadoop appliance is built from the CentOS 7 CentOS 7 SR-IOV appliance and additionally contains RDMA-Hadoop library with SR-IOV. RDMA-Hadoop https://www.chameleoncloud.org/appliances/17/

• Through these available appliances, users and researchers can easily deploy HPC clouds to perform experiments and run jobs with – High-Performance SR-IOV + InfiniBand – High-Performance MVAPICH2 Library over bare-metal InfiniBand clusters – High-Performance MVAPICH2 Library with Virtualization Support over SR-IOV enabled KVM clusters – High-Performance Hadoop with RDMA-based Enhancements Support [*] Only include appliances contributed by OSU NowLab Network Based Computing Laboratory MUG 2018 71 MPI Complex Appliances based on MVAPICH2 on Chameleon

1. Load VM Config 2. Allocate Ports 3. Allocate FloatingIPs 4. Generate SSH Keypair 5. Launch VM 6. Attach SR-IOV Device 7. Hotplug IVShmem Device 8. Download/Install MVAPICH2-Virt 9. Populate VMs/IPs 10. Associate FloatingIPs

MVAPICH2-Virt Heat- based Complex Appliance

Network Based Computing Laboratory MUG 2018 72 Demos on Chameleon Cloud

• A Demo of Deploying MPI Bare-Metal Cluster with InfiniBand • A Demo of Deploying MPI KVM Cluster with SR-IOV enabled InfiniBand • Running MPI Programs on Chameleon

Network Based Computing Laboratory MUG 2018 73 Login to Chameleon Cloud

https://chi.tacc.chameleoncloud.org/dashboard/auth/login/

Network Based Computing Laboratory MUG 2018 74 Create a Lease

2

1

Network Based Computing Laboratory MUG 2018 75 Create a Lease

Network Based Computing Laboratory MUG 2018 76 Lease Starts

Network Based Computing Laboratory MUG 2018 77 Select Complex Appliance - MPI Bare-Metal Cluster

Network Based Computing Laboratory MUG 2018 78 Get Template of MPI Bare-Metal Appliance

Network Based Computing Laboratory MUG 2018 79 Save URL of Template (Will be used later)

Network Based Computing Laboratory MUG 2018 80 Launch Stack

Network Based Computing Laboratory MUG 2018 81 Use Saved Template URL as Source

Network Based Computing Laboratory MUG 2018 82 Input Stack Information

Network Based Computing Laboratory MUG 2018 83 Use Created Lease

Network Based Computing Laboratory MUG 2018 84 Stack Creation In Progress

Network Based Computing Laboratory MUG 2018 85 Stack Details

Network Based Computing Laboratory MUG 2018 86 Demos on Chameleon Cloud

• A Demo of Deploying MPI Bare-Metal Cluster with InfiniBand • A Demo of Deploying MPI KVM Cluster with SR-IOV enabled InfiniBand • Running MPI Programs on Chameleon

Network Based Computing Laboratory MUG 2018 87 Select Complex Appliance - MPI KVM with SR-IOV Cluster

Network Based Computing Laboratory MUG 2018 88 Get Template of MPI KVM with SR-IOV Appliance

Network Based Computing Laboratory MUG 2018 89 Save URL of Template (Will be used later)

Network Based Computing Laboratory MUG 2018 90 Launch Stack

Network Based Computing Laboratory MUG 2018 91 Instances in Stack (Spawning …)

Network Based Computing Laboratory MUG 2018 92 Demos on Chameleon Cloud

• A Demo of Deploying MPI Bare-Metal Cluster with InfiniBand • A Demo of Deploying MPI KVM Cluster with SR-IOV enabled InfiniBand • Running MPI Programs on Chameleon

Network Based Computing Laboratory MUG 2018 93 Create Stack Successfully Login to the first instance with Floating IP

Network Based Computing Laboratory MUG 2018 94 Login Instance with Floating IP

Network Based Computing Laboratory MUG 2018 95 SSH to Other Instances

Network Based Computing Laboratory MUG 2018 96 Compile MPI Program – Hello World

Network Based Computing Laboratory MUG 2018 97 Distribute Executable to Other Instances

Network Based Computing Laboratory MUG 2018 98 Run MPI Program – Hello World

Network Based Computing Laboratory MUG 2018 99 Overview of OSU MPI Micro-Benchmarks (OMB) Suite

• A comprehensive suite of benchmarks to – Compare performance of different MPI libraries on various networks and systems – Validate low-level functionalities – Provide insights to the underlying MPI-level designs • Started with basic send-recv (MPI-1) micro-benchmarks for latency, bandwidth and bi-directional bandwidth • Extended later to – MPI one-sided – Collectives – GPU-aware data movement – OpenSHMEM (point-to-point and collectives) – UPC • Has become an industry standard • Extensively used for design/development of MPI libraries, performance comparison of MPI libraries and even in procurement of large-scale systems • Available from http://mvapich.cse.ohio-state.edu/benchmarks • Available in an integrated manner with MVAPICH2 stack

Network Based Computing Laboratory MUG 2018 100 Examples of OMB Benchmarks

MPI Two-sided Pt2Pt MPI One-sided Pt2Pt MPI Collectives (64 Procs = 4 nodes * 16 Procs/node) MPI_Latency (Inter-node, Intra- MPI_Put_Latency MPI_Bcast node) MPI_BW (Inter-node, Intra-node) MPI_Alltoall

Network Based Computing Laboratory MUG 2018 101 Run OSU MPI Benchmarks – Latency and Bandwidth

Network Based Computing Laboratory MUG 2018 102 Next Steps of MVAPICH2-Virt

Network Based Computing Laboratory MUG 2018 103 Conclusions

• MVAPICH2-Virt over SR-IOV-enabled InfiniBand is an efficient approach to build HPC Clouds – Standalone, OpenStack, Slurm, and Slurm + OpenStack – Support Virtual Machine Migration with SR-IOV InfiniBand devices

– Support Virtual Machine, Container (Docker and Singularity), and Nested Virtualization • Very little overhead with virtualization, near native performance at application level • Much better performance than Amazon EC2 • MVAPICH2-Virt is available for building HPC Clouds – SR-IOV, IVSHMEM, Docker and Singularity, OpenStack • Appliances for MVAPICH2-Virt are available for building HPC Clouds • Demos on NSF Chameleon Cloud • Future releases for supporting running MPI jobs in VMs/Containers with SLURM, etc. • SR-IOV/container support and appliances for other MVAPICH2 libraries (MVAPICH2-X, MVAPICH2-GDR, ...)

Network Based Computing Laboratory MUG 2018 104 Thank You! [email protected] http://www.cse.ohio-state.edu/~luxi

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ MVAPICH: MPI over InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE http://mvapich.cse.ohio-state.edu/

Network Based Computing Laboratory MUG 2018 105