Building HPC Cloud with Infiniband: Efficient Support in MVAPICH2 for KVM, Docker, Singularity, Openstack, and SLURM

Building HPC Cloud with InfiniBand: Efficient Support in MVAPICH2 for KVM, Docker, Singularity, OpenStack, and SLURM A Tutorial at MUG 2018 by Xiaoyi Lu The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~luxi HPC Meets Cloud Computing • Cloud Computing widely adopted in industry computing environment • Cloud Computing provides high resource utilization and flexibility • Virtualization is the key technology to enable Cloud Computing • Intersect360 study shows cloud is the fastest growing class of HPC • HPC Meets Cloud: The convergence of Cloud Computing and HPC Network Based Computing Laboratory MUG 2018 2 HPC Cloud - Combining HPC with Cloud • IDC expects that by 2019, HPC ecosystem revenue will jump to a record $30.2 billion. IDC foresees public clouds, and especially custom public clouds, supporting an increasing proportion of the aggregate HPC workload as these cloud facilities grow more capable and mature (Courtesy: http://www.idc.com/getdoc.jsp?containerId=247846) • Combining HPC with Cloud is still facing challenges because of the performance overhead associated virtualization support – Lower performance of virtualized I/O devices • HPC Cloud Examples – Amazon EC2 with Enhanced Networking • Using Single Root I/O Virtualization (SR-IOV) • Higher performance (packets per second), lower latency, and lower jitter • 10 GigE – NSF Chameleon Cloud Network Based Computing Laboratory MUG 2018 3 Outline • Overview of Cloud Computing System Software • Overview of Modern HPC Cloud Architecture • Challenges of Building HPC Clouds • High-Performance MPI LiBrary on HPC Clouds • Integrated Designs with Cloud Resource Manager • Appliances and Demos on Chameleon Cloud • Conclusion and Q&A Network Based Computing Laboratory MUG 2018 4 Virtualization TeChnology (Hypervisor vs. Container) • Provides abstractions of multiple virtual resources by utilizing an intermediate software layer on top of the underlying system VM1 App1 App2 App3 Stack Stack Stack Container1 • Sharebins/ host kernelbins/ bins/ • HypervisorApp1 providesApp2 a full abstractionApp3 of libs libs libs • Allows execution of isolated user space VM Stack Stack Stack Guest OS Guest OS Guest OS instanceRedhat • Full virtualization,bins/ bins/ different guestbins/ OS, Linux Window Ubuntu libs libs libs • Lightweight, portability better isolation Hypervisor Host Linux OS • Not strong isolationHost OS • Larger overhead due to heavy stack Hardware Hardware Hypervisor-based Virtualization Container-based Virtualization Network Based Computing Laboratory MUG 2018 5 Overview of Kernel-based Virtual Machine (KVM) • A full virtualization solution for Linux on x86 hardware that contains virtualization extensions (Intel VT or AMD-V) • The KVM module creates a bare metal hypervisor on the Linux kernel • KVM hosts the virtual machine imaGes as reGular Linux processes • Each virtual machine imaGe can use all of the features of the Linux kernel, includinG hardware, security, storaGe, etc. https://www.ibm.com/support/knowledgecenter/en/linuxonibm/liaat/liaatkvmover.htm Network Based Computing Laboratory MUG 2018 6 Container Technology - Docker • Inherit advantages of container technique • Active community contribution • Root owned daemon process • Root escalation in Docker container • Non-negligible performance overhead Network Based Computing Laboratory MUG 2018 7 Singularity Overview • Reproducible software stacks – Easily verify via checksum or cryptographic signature • Mobility of compute – Able to transfer (and store) containers via standard data mobility tools • Compatibility with complicated architectures – Runtime immediately compatible with existing HPC architecture • Security model – Support untrusted users running untrusted containers http://singularity.lbl.gov/about Network Based Computing Laboratory MUG 2018 8 Container Technology (Docker vs. Singularity) • Singularity aims to provide reproducible and mobile environments across HPC centers • NO root owned daemon • NO root escalation • mpirun_rsh –np 2 –hostfile htfiles singualrity exec /tmp/Centos-7.img /usr/bin/osu_latency Network Based Computing Laboratory MUG 2018 9 Outline • Overview of Cloud Computing System Software • Overview of Modern HPC Cloud Architecture • Challenges of Building HPC Clouds • High-Performance MPI LiBrary on HPC Clouds • Integrated Designs with Cloud Resource Manager • Appliances and Demos on Chameleon Cloud • Conclusion and Q&A Network Based Computing Laboratory MUG 2018 10 Drivers of Modern HPC Cluster and Cloud Architecture High Performance Interconnects – SSDs, Object Storage Large memory nodes Multi-/Many-core InfiniBand (with SR-IOV) Clusters (Upto 2 TB) Processors <1usec latency, 200Gbps Bandwidth> • Multi-core/many-core technologies, Accelerators • Large memory nodes • Solid State Drives (SSDs), NVM, Parallel Filesystems, Object Storage Clusters • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Single Root I/O Virtualization (SR-IOV) Cloud Cloud SDSC Comet TACC Stampede Network Based Computing Laboratory MUG 2018 11 Trends in High-Performance Networking Technologies • Advanced Interconnects and RDMA protocols – InfiniBand (up to 200 Gbps, HDR) – 10/40/100 Gigabit Ethernet/iWARP – RDMA over Converged Enhanced Ethernet (RoCE) • Omni-Path • Delivering excellent performance (Latency, Bandwidth and CPU Utilization) • Has influenced re-designs of enhanced HPC middleware – Message Passing Interface (MPI) and PGAS – Parallel File Systems (Lustre, GPFS, ..) • Paving the way to the wide utilization in HPC Cloud with virtualization support (SR-IOV) Network Based Computing Laboratory MUG 2018 12 Available InterConneCts and ProtoCols for Data Centers AppliCation / Middleware AppliCation / Middleware InterfaCe SoCkets Verbs OFI ProtoCol Kernel Space TCP/IP TCP/IP RSoCkets SDP TCP/IP RDMA RDMA RDMA Ethernet Hardware IPoIB User SpaCe RDMA User SpaCe User SpaCe User SpaCe User SpaCe Driver Offload Adapter Ethernet InfiniBand Ethernet InfiniBand InfiniBand iW ARP RoCE InfiniBand Omni-Path Adapter Adapter Adapter Adapter Adapter Adapter Adapter Adapter Adapter SwitCh Ethernet InfiniBand Ethernet InfiniBand InfiniBand Ethernet Ethernet InfiniBand Omni-Path SwitCh SwitCh SwitCh SwitCh SwitCh SwitCh SwitCh SwitCh SwitCh 1/10/25/40/ 1/10/25/40/ IPoIB RSoCkets SDP iW ARP RoCE IB Native 100 Gb/s 50/100 GigE- 50/100 GigE TOE Network Based Computing Laboratory MUG 2018 13 Open Standard InfiniBand Networking Technology • Introduced in Oct 2000 • High Performance Data Transfer – Interprocessor communication and I/O – Low latency (<1.0 microsec), High bandwidth (up to 25 GigaBytes/sec -> 200Gbps), and low CPU utilization (5-10%) • Flexibility for LAN and WAN communication • Multiple Transport Services – Reliable Connection (RC), Unreliable Connection (UC), Reliable Datagram (RD), Unreliable Datagram (UD), and Raw Datagram – Provides flexibility to develop upper layers • Multiple Operations – Send/Recv – RDMA Read/Write – Atomic Operations (very unique) • high performance and scalable implementations of distributed locks, semaphores, collective communication operations • Leading to big changes in designing HPC clusters, file systems, cloud computing systems, grid computing systems, …. Network Based Computing Laboratory MUG 2018 14 4. Performance comparisons between IVShmem backed and native mode MPI li- braries, using HPC applications The evaluation results indicate that IVShmem can improve point to point and collective operations by up to 193% and 91%, respectively. The application execution time can be decreased by up to 96%, compared to SR-IOV. The results further show that IVShmem just brings small overheads, compared with native environment. The rest of the paper is organized as follows. Section 2 provides an overview of IVShmem, SR-IOV, and InﬁniBand. Section 3 describes our prototype design and evaluation methodology. Section 4 presents the performance analysis results using micro- benchmarks and applications, scalability results, and comparison with native mode. We discuss the related work in Section 5, and conclude in Section 6. 2 Background Inter-VM Shared Memory (IVShmem) (e.g. Nahanni) [15] provides zero-copy access to data on shared memory of co-resident VMs on KVM platform. IVShmem is designed and implemented mainly in system calls layer and its interfaces are visible to user space applications as well. As shown in Figure 2(a), IVShmem contains three components: the guest kernel driver, the modiﬁed QEMU supporting PCI device, and the POSIX shared memory region on the host OS. The shared memory region is allocated by host POSIX operations and mapped to QEMU process address space. The mapped memory in QEMUSingle Root can be I/O used Virtualization by guest applications (SR-IOV) by being remapped to user space in guest VMs. Evaluation results illustrate that both micro-benchmarks and HPC applications can• achieveSingle Root better I/O Virtualization performance (SR with-IOV) IVShmemis providing support.new opportunities to design HPC cloud with very little low overhead • Allows a single physical device, or a Guest 1 Guest 2 Guest 3 mmap Guest 1 mmap Guest 2 mmap Guest 3 Physicalregion Functionregion (PF), to presentregion itself as Guest OS Guest OS Guest OS Userspace Userspace Userspace multiple virtual devices, or Virtual VF Driver VF Driver VF Driver PCI kernel PCI kernel PCI kernel FunctionsDevice (VFs)Device Device • VFs are designed based on the existing Hypervisor PF Driver nonQemu -UserspacevirtualizedQemu PFs, Userspace no needQemu

Load more