Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters

Home , Singularity (software), The Spank, Xen

Designing and Building Efﬁcient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters

Dissertation

Presented in Partial Fulﬁllment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

Jie Zhang, M.S.

Graduate Program in Computer Science and Engineering

The Ohio State University

2018

Dissertation Committee:

Dr. Dhabaleswar K. Panda, Advisor Dr. Christopher Stewart Dr. P. Sadayappan Dr. Yang Wang Dr. Xiaoyi Lu c Copyright by

Jie Zhang

2018 Abstract

Cloud Computing platforms (e.g, Amazon EC2 and Microsoft Azure) have been widely adopted by many users and organizations due to their high availability and scalable computing resources. By using virtualization technology, VM or container instances in a cloud can be constructed on bare-metal hosts for users to run their systems and applications whenever they need computational resources. This has significantly increased the flexibility of resource provisioning in clouds compared to the traditional resource management approaches. These days cloud computing has gained momentum in HPC communities, which brings us a broad challenge: how to design and build efficient HPC clouds with modern networking technologies and virtualization capabilities on heterogeneous HPC clusters?

Through the convergence of HPC and cloud computing, the users can get all the desirable features such as ease of system management, fast deployment, and resource sharing.

However, many HPC applications running on the cloud still suffer from fairly low performance, more speciﬁcally, the degraded I/O performance from the virtualized I/O devices.

Recently, a hardware-based I/O virtualization standard called Single Root I/O Virtualiza- tion (SR-IOV) has been proposed to help solve the problem, which makes SR-IOV achieve near-native I/O performance. Whereas SR-IOV lacks locality-aware communication support, which makes the communications across the co-located VMs or containers not able to leverage the shared memory backed communication mechanisms. To deliver high performance to the end HPC applications in the HPC cloud, we present a high-performance

ii locality-aware and NUMA-aware MPI library over SR-IOV enabled InﬁniBand clusters, which is able to dynamically detect the locality information on VM, container or even nested cloud environment and coordinate the data movements appropriately. The proposed design improves the performance of NAS by up to 43% over the default SR-IOV based scheme across 32 VMs, while incurring less 9% overhead compared with native performance. As one of the most attractive container technologies to build HPC clouds, we evaluate the performance of Singularity on various aspects including processor architectures, advanced interconnects, memory access modes, and the virtualization overhead. Singular- ity shows very little overhead for running MPI-based HPC applications.

SR-IOV is able to provide efficient sharing of high-speed interconnect resources and achieve near-native I/O performance, however, SR-IOV based virtual networks prevent VM migration, which is an essential virtualization capability towards high flexibility and availability. Although several initial solutions have been proposed in the literature to solve this problem, there are still many restrictions on these proposed approaches, such as depending on the specific network adapters and/or hypervisors, which will limit the usage scope of these solutions on HPC environments. In this thesis, we propose a high-performance hypervisor-independent and InfiniBand driver-independent VM migration framework for

MPI applications on SR-IOV enabled InﬁniBand clusters, which is able to not only achieve fast VM migration but also guarantee the high performance for MPI applications during the migration in the HPC cloud. The evaluation results indicate that our proposed design could completely hide the migration overhead through the computation and migration overlapping.

In addition, the resource management and scheduling systems, such as Slurm and PBS, are widely used in the modern HPC clusters. In order to build efﬁcient HPC clouds, some

iii of the critical HPC resources, like SR-IOV enabled virtual devices and Inter-VM shared memory devices, need to be properly enabled and isolated among VMs. We thus propose a novel framework, Slurm-V, which extends Slurm with virtualization-oriented capabilities to support efﬁciently running multiple concurrent MPI jobs on HPC clusters. The proposed

Slurm-V framework shows good scalability and the ability of efﬁciently running concurrent

MPI jobs on SR-IOV enabled InﬁniBand clusters. To the best of our knowledge, Slurm-V is the ﬁrst attempt to extend Slurm for the support of running concurrent MPI jobs with isolated SR-IOV and IVShmem resources.

On a heterogeneous HPC cluster, GPU devices have received signiﬁcant success for parallel applications. In addition to highly optimized computation kernels on GPUs, the cost of data movement on GPU clusters plays critical roles in delivering high performance for the end applications. Our studies show that there is a signiﬁcant demand to design high performance cloud-aware GPU-to-GPU communication schemes to deliver the near-native performance on clouds. We propose C-GDR, the high-performance Cloud-aware GPUDi- rect communication schemes on RDMA networks. It allows communication runtime to successfully detect process locality, GPU residency, NUMA architecture information, and communication pattern to enable intelligent and dynamic selection of the best communication and data movement schemes on GPU-enabled clouds. Our evaluations show C-GDR can outperform the default scheme by up to 26% on HPC applications.

iv To my family, friends, and mentors.

v Acknowledgments

This work was made possible through the love and support of several people who stood

by me, through the many years of my doctoral program and all through my life leading to

it. I would like to take this opportunity to thank all of them.

My family - my parents, Chong Zhang and Jinchuan Li, who have always given me

complete freedom and love to let me go after my dreams and unconditional support to let

me venture forth; my uncle, Pengxi Li, who has always inspired and encouraged me to

pursue the higher goals; my grandmother, Aixiang Yu, who have stood by me and prayed

for me at all times.

My ﬁancee, Hongjin Wang for her love, support, and understanding. I admire and respect her for many qualities she possesses, particularly the great courage and determined mind for the new challenges in her career.

My advisor, Dr. Dhabaleswar K. Panda for his guidance and support throughout my doctoral program. I have been able to grow, both personally and professionally, through my association with him. He works hard and professionally. I can deeply feel his respect to the career he has been pursuing. Even after knowing him for six years, I am still amazed by the energy and commitment he has towards the research.

My collaborators - I would like to express the appreciation to my collaborator: Dr.

Xiaoyi Lu. Through the six years collaboration with him, I have been witnessing his atti- tude and passion towards science and research: he continually and convincingly conveyed

vi a spirit of exploration in regard to research and scholarship, and an excitement in regard to

teaching. Without his guidance and persistent help this dissertation would not have been

possible.

My friends - I am very happy to have met and become friends with Jithin Jose, Hari

Subramoni, Mingzhe Li, Rong Shi, Ching-Hsiang Chu, Dipti Shankar, Jeff Smith, Jonathan

Perkins and Mark Arnold, Gugnani Shashank, Haiyang Shi. This work would remain in- complete without their support and contribution. They have given me memories that I will cherish for the rest of my life.

I would also like to thank all my colleagues, who have helped me in one way or another throughout my graduate studies.

vii Vita

2004-2008 ...... B.S., Computer Science, Tianjin Univer- sity of Technology and Education, China 2008-2011 ...... M.S., Computer Science, Nankai Univer- sity, U.S.A 2012-Present ...... Ph.D., Computer Science and Engineer- ing, The Ohio State University, U.S.A

Publications

Jie Zhang, Xiaoyi Lu and Dhabaleswar K. Panda, C-GDR: High-Performance Cloud- aware GPUDirect MPI Communication Schemes on RDMA Networks (Under Review)

Jie Zhang, Xiaoyi Lu and Dhabaleswar K. Panda, Is Singularity-based Container Tech- nology Ready for Running MPI Applications on HPC Clouds? The 10th International Conference on Utility and Cloud Computing (UCC ’17), Dec 2017, Best Student Paper Award

Jie Zhang, Xiaoyi Lu and Dhabaleswar K. Panda, High-Performance Virtual Machine Migration Framework for MPI Applications on SR-IOV enabled InﬁniBand Clusters, The 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS ’17), May 2017

Jie Zhang, Xiaoyi Lu and Dhabaleswar K. Panda, Designing Locality and NUMA Aware MPI Runtime for Nested Virtualization based HPC Cloud with SR-IOV Enabled Inﬁni- Band, The 13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE ’17), April 2017

Jie Zhang, Xiaoyi Lu, Sourav Chakraborty and Dhabaleswar K. Panda, SLURM-V: Ex- tending SLURM for Building Efﬁcient HPC Cloud with SR-IOV and IVShmem, The 22nd International European Conference on Parallel and Distributed Computing (Euro-Par ’16), Aug 2016

viii Jie Zhang, Xiaoyi Lu and Dhabaleswar K. Panda, High Performance MPI Library for Container-based HPC Cloud on InﬁniBand Clusters, The 45th International Conference on Parallel Processing (ICPP ’16), Aug 2016

Jie Zhang, Xiaoyi Lu and Dhabaleswar K. Panda, Performance Characterization of Hypervisor- and Container-based Virtualization for HPC on SR-IOV Enabled InﬁniBand Clusters, The 1st International Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM ’16), held in conjunction with the 30th IEEE Interna- tional Parallel and Distributed Processing Symposium (IPDPS ’16), May 2016

Jie Zhang, Xiaoyi Lu, Mark Arnold and Dhabaleswar K. Panda, MVAPICH2 over Open- Stack with SR-IOV: An Efﬁcient Approach to Build HPC Clouds, The 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, (CCGrid ’15), May 2015

Jie Zhang, Xiaoyi Lu, Jithin Jose and Dhabaleswar K. Panda, High Performance MPI Library over SR-IOV Enabled InﬁniBand Clusters, The International Conference on High Performance Computing (HiPC’14), Dec 2014

Jie Zhang, Xiaoyi Lu, Jithin Jose, Rong Shi and Dhabaleswar K. Panda, Can Inter-VM Shmem Beneﬁt MPI Applications on SR-IOV based Virtualized InﬁniBand Clusters?, The 20th International European Conference on Parallel and Distributed Computing (Euro-Par ’14), Aug 2014

Mingzhe Li, Xiaoyi Lu, Khaled Hamidouche, Jie Zhang and Dhabaleswar K. Panda, Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA, Interna- tional Conference on High Performance Computing (HiPC ’16), December 2016

Khaled Hamidouche, Jie Zhang, Karen Tomko, and Dhabaleswar K. Panda, OpenSH- MEM NonBlocking Data Movement Operations with MVAPICH2-X: Early Experiences, International Conference on PGAS Applications Workshop, November 2016

Mingzhe Li, Khaled Hamidouche, Xiaoyi Lu, Hari Subramoni, Jie Zhang and Dha- baleswar K. Panda, Designing MPI Library with On-Demand Paging (ODP) of InﬁniBand: Challenges and Beneﬁts, International Conference on Supercomputing (SC ’16), November 2016

Mingzhe Li, Khaled Hamidouche, Xiaoyi Lu, Jie Zhang, Jian Lin and Dhabaleswar K. Panda, High Performance OpenSHMEM Strided Communication Support with InﬁniBand UMR, International Conference on High Performance Computing (HiPC ’15), December 2015

ix Jian Lin, Khaled Hamidouche, Jie Zhang, Xiaoyi Lu, Abhinav Vishnu, and Dhabaleswar K. Panda Accelerating k-NN Algorithm with Hybrid MPI and OpenSHMEM, International Conference on OpenSHMEM 2015 for PGAS Programming in the Exascale Era, Aug 2015

Rong. Shi, Xiaoyi Lu, Seeram Potluri, Khaled Hamidouche, Jie Zhang, and Dhabaleswar K. Panda HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement using MPI Datatypes on GPU Clusters, International Conference on Parallel Processing (ICPP 14), Sep 2014

Jithin Jose, Khaled Hamidouche, Xiaoyi Lu, Seeram Potluri, Jie Zhang, Karen Tomko, and Dhabaleswar K. Panda High Performance OpenSHMEM for MIC Clusters: Exten- sions, Runtime Designs, and Application Co-Design, International Conference on CLUS- TER Computing (CLUSTER ’14), Sep 2014

Jithin Jose, Khaled Hamidouche, Jie Zhang, Akshay Venkatesh, and Dhabaleswar K. Panda Optimizing Collective Communication in UPC, International Workshop on High- Level Parallel Programming Models and Supportive Environments (HIPS ’14), May 2014

Jithin Jose, Jie Zhang, Akshay Venkatesh, Seeram Potluri, and Dhabaleswar K. Panda A Comprehensive Performance Evaluation of OpenSHMEM Libraries on InﬁniBand Clus- ters, International Workshop on OpenSHMEM (OpenSHMEM ’14), Mar 2014

Jithin Jose, Krishna Kandalla, Seeram Potluri, Jie Zhang, and Dhabaleswar K. Panda Op- timizing Collective Communication in OpenSHMEM, International Conference on Parti- tioned Global Address Space Programming Models (PGAS ’13), Oct 2013

Antonio Gomez-Iglesias,´ Dmitry Pekurovsky, Khaled Hamidouche, Jie Zhang and Jer´ ome,ˆ Vienne Porting Scientiﬁc Libraries to PGAS in XSEDE Resources: Practice and Experi- ence, XSEDE Conference: Scientiﬁc Advancements Enabled by Enhanced Cyberinfras- tructure (XSEDE ’15), 2015

Fields of Study

Major Field: Computer Science and Engineering

x Table of Contents

Page

Abstract ...... ii

Dedication ...... v

Acknowledgments ...... vi

Vita ...... viii

List of Tables ...... xvi

List of Figures ...... xvii

1. Introduction ...... 1

1.1 Problem Statement ...... 6 1.2 Research Framework ...... 7

2. Background ...... 12

2.1 Cloud Computing and OpenStack ...... 12 2.2 Virtualization Technology ...... 14 2.2.1 Hypervisor-based Virtualization ...... 14 2.2.2 Container-Based Virtualization ...... 14 2.2.3 Inter-VM Shared Memory (IVShmem) ...... 16 2.2.4 Nested Virtualization ...... 17 2.3 High Performance Computing (HPC) Systems ...... 19 2.3.1 InﬁniBand ...... 19 2.3.2 Single Root I/O Virtualization ...... 20 2.3.3 Intel Knights Landing (KNL) Architecture ...... 20 2.3.3.1 Intel KNL Memory Modes ...... 22 2.3.3.2 Intel KNL Cluster Modes ...... 22

xi 2.3.4 Intel Omni-Path Architecture (OPA) ...... 24 2.3.5 Accelerator ...... 25 2.3.6 Slurm and SPANK ...... 26 2.3.7 Programming Models ...... 26

3. Designing VM-aware MPI Communication with SR-IOV and IVShmem . . . . 29

3.1 Understanding Performance of IVShmem ...... 29 3.2 VM-aware MPI Communication with SR-IOV and IVShmem ...... 30 3.2.1 Design Overview ...... 30 3.2.2 Locality Detector ...... 32 3.2.3 Communication Coordinator ...... 34 3.2.4 Optimizing Communication for IVShmem Channel ...... 35 3.2.5 Optimizing Communication for SR-IOV Channel ...... 38 3.3 Performance Evaluation ...... 38 3.3.1 Point-to-Point Communication Performance ...... 39 3.3.2 Collective Communication Performance ...... 40 3.3.3 Different InﬁniBand Transport Protocol (RC & UD) ...... 42 3.3.4 Application Performance ...... 46 3.4 Related Work ...... 46 3.5 Summary ...... 48

4. Designing SR-IOV Enabled VM Migration Framework ...... 50

4.1 Hypervisor and InﬁniBand Adapter Driver Independent SR-IOV Enabled VM Migration Framework ...... 50 4.1.1 Design Overview ...... 50 4.1.2 VM Migration Procedure ...... 52 4.1.3 Design of VM Migration Controller ...... 53 4.1.4 Design of MPI Runtime ...... 55 4.2 Performance Evaluation ...... 57 4.2.1 VM Migration Performance ...... 58 4.2.2 Overhead Evaluation of Different Schemes ...... 60 4.2.3 Point-to-Point Performance ...... 60 4.2.4 Collective Performance ...... 62 4.2.5 Overlapping Evaluation ...... 62 4.2.6 Application Performance ...... 64 4.3 Related Work ...... 65 4.4 Summary ...... 67

xii 5. Designing Container-aware MPI Communication for Light-weight Virtualization 68

5.1 Container-aware MPI Communication ...... 68 5.1.1 Design Overview ...... 68 5.1.2 Container Locality Detector ...... 70 5.1.3 Optimizing SHM and CMA Channels ...... 71 5.1.4 Optimizing Communication for HCA Channel ...... 72 5.2 Performance Evaluation for Docker Container ...... 73 5.2.1 Experiment Setup ...... 73 5.2.2 Point-to-Point Performance ...... 73 5.2.3 Collective Performance ...... 76 5.2.4 Application Performance ...... 76 5.3 Performance Evaluation for Singularity ...... 77 5.3.1 Experimental Setup ...... 77 5.3.2 Point-to-Point Communication Performance ...... 78 5.3.3 Collective Communication Performance ...... 80 5.3.4 Application Performance ...... 84 5.4 Related Work ...... 88 5.5 Summary ...... 91

6. Designing High Performance MPI Communication for Nested Virtualization . 93

6.1 Two-Layer Locality Aware and NUMA-Aware Design in MPI Library . . 95 6.1.1 Design Overview ...... 97 6.1.2 Two-Layer Locality Detector ...... 98 6.1.3 Two-Layer NUMA Aware Communication Coordinator . . . . . 100 6.1.4 Performance Beneﬁt Analysis ...... 101 6.2 Hybrid Design for NUMA-Aware Communication ...... 102 6.2.1 Basic Hybrid Design with HCA Channel ...... 103 6.2.2 Enhanced Hybrid Design ...... 104 6.2.3 Putting All Together ...... 105 6.3 Performance Evaluation ...... 106 6.3.1 Point-to-Point Performance ...... 108 6.3.2 Collective Performance ...... 112 6.3.3 Application Performance ...... 113 6.4 Related Work ...... 114 6.5 Summary ...... 115

7. Co-designing with Resource Management and Scheduling Systems ...... 116

7.1 Design of Slurm-V ...... 118

xiii 7.1.1 Architecture Overview of Slurm-V ...... 118 7.1.2 Alternative Designs ...... 119 7.2 Performance Evaluation ...... 123 7.2.1 Startup Performance ...... 124 7.2.2 Scalability ...... 126 7.2.3 Application Performance ...... 126 7.3 Related Work ...... 128 7.4 Summary ...... 129

8. Designing High-Performance Cloud-aware GPUDirect MPI Communication Schemes on RDMA Networks ...... 130

8.1 Performance Characteristics of GPU Communication Schemes on Con- tainer Environments ...... 130 8.1.1 GPU Communication Schemes on Cloud ...... 130 8.1.2 Performance Study of GPU Communication on Cloud ...... 133 8.1.2.1 Latency-sensitive Benchmark ...... 134 8.1.2.2 Bandwidth-sensitive Benchmark ...... 134 8.1.3 Analysis and Design Principles for Optimal GPU Communica- tion on Cloud ...... 135 8.2 Proposed Design of C-GDR in MVAPICH2 ...... 137 8.2.1 GPU Locality-aware Detection ...... 139 8.2.2 Workload Characterization Tracing ...... 142 8.2.3 Communication Scheduling ...... 144 8.3 Performance Evaluation ...... 146 8.3.1 Experimental Testbed ...... 146 8.3.2 MPI Level Point-to-Point Micro-benchmarks ...... 147 8.3.3 MPI Level Collective Micro-benchmarks ...... 149 8.3.4 Application Performances ...... 149 8.4 Related Work ...... 152 8.5 Summary ...... 153

9. Impact on the HPC and Cloud Computing Communities ...... 155

9.1 Software Release and Wide Acceptance ...... 157 9.1.1 MVAPICH2-Virt Library ...... 157 9.1.2 Heat-based Complex Appliance ...... 157

10. Future Research Directions ...... 158

10.1 Exploring GPU-enabled VM Migration ...... 158 10.2 QoS-aware Data Access and Movement ...... 159

xiv 10.3 Exploring Different Programming Models on HPC Cloud ...... 159

11. Conclusion and Contribution ...... 160

Bibliography ...... 164

xv List of Tables

Table Page

2.1 OpenStack Services ...... 13

4.1 Total Migration Time Breakdown ...... 59

6.1 Comparison with Existing Studies ...... 94

7.1 VM Startup Breakdown ...... 124

8.1 Best Schemes Discoverd for Given Message Ranges for Latency-sensitive and Bandwidth-sensitive Benchmarks ...... 146

xvi List of Figures

Figure Page

1.1 Research Framework ...... 7

2.1 Hypervisor- and Container-based Virtualization ...... 13

2.2 Singularity usage workﬂows ...... 15

2.3 IVShmem Communication Mechanism ...... 16

2.4 Nested Virtualization ...... 17

2.5 Practical Scenario of Nested Virtualization ...... 19

2.6 SR-IOV Communication Mechanism ...... 21

2.7 Intel KNL Overview [97] ...... 21

2.8 KNL Memory Modes [100] ...... 23

2.9 Slurm Architecure ...... 25

3.1 Primitive-Level Latency Comparison between SR-IOV IB and IVShmem . 30

3.2 MVAPICH2 Stack Running in Native and Virtualization Environments . . . 32

3.3 Virtual Machine Locality Detection ...... 34

3.4 Communication Coordinator ...... 36

3.5 Communication Optimization for IVShmem Channel ...... 36

xvii 3.6 Communication Optimization for SR-IOV Channel ...... 37

3.7 Point-to-point Performance ...... 41

3.8 Collective Communication Performance on 32 VMs (8 VMs per node) . . . 43

3.9 Intra-host Inter-VM Point-to-Point Performance on RC and UD Protocols . 45

3.10 32 VMs (8 VMs per node) Collective Performance on RC and UD Protocols 45

3.11 Application Performance ...... 46

4.1 An Overview of the Proposed Migration Framework ...... 51

4.2 Sequence Diagram of Process Migration ...... 52

4.3 The Proposed Progress Engine based Design and Migration-thread based Design ...... 55

4.4 VM Migration Time and Proﬁling Results ...... 58

4.5 Overhead Evaluation of Different Designs ...... 60

4.6 MPI Communication Performance with VM Migration of Different Designs 61

4.7 Benchmark to Evaluate Computation and Migration Overlapping ...... 64

4.8 Application Execution Time with VM Migration of Different Designs . . . 65

5.1 MVAPICH2 Stack Running in Container-based Environments ...... 69

5.2 Container Locality Detection ...... 71

5.3 Communication Channel Optimization ...... 72

5.4 MPI Two-Sided Point-to-Point Communication Performance ...... 74

5.5 MPI One-Sided Point-to-Point Communication Performance ...... 75

5.6 Collective Communication Performance with 256 Processes ...... 75

xviii 5.7 Application Performance with 256 Processes ...... 77

5.8 MPI Point-to-Point Communication Performance on Haswell ...... 81

5.9 MPI Point-to-Point Communication Performance on KNL with Cache Mode 82

5.10 MPI Point-to-Point Communication Performance on KNL with Flat Mode . 83

5.11 MPI Collective Communication Performance with 512-Process on Haswell 85

5.12 MPI Collective Communication Performance with 128-Process on KNL with Cache Mode ...... 86

5.13 MPI Collective Communication Performance with 128-Process on KNL with Flat Mode ...... 87

5.14 Application Performance with 512-Process on Haswell ...... 88

5.15 Application Performance with 128-Process on KNL with Cache Mode . . . 89

5.16 Application Performance with 128-Process on KNL with Flat Mode . . . . 89

6.1 MPI Point-to-Point Latency Performance on Nested Virtualization Envi- ronment (Compare Default, One-Layer Locality-Aware and Native) . . . . 96

6.2 Communication Paths across Containers on Different VM/Container Place- ments ...... 96

6.3 Two-Layer Locality Aware Communication in Nested Virtualization Envi- ronments ...... 96

6.4 Two-Layer Locality Detector Design (VM Locality Detector utilizes the VM Locality-Aware List to detect the processes on the same host. Further, Container Locality Detector leverages the Container Locality-Aware List to identify the processes on the same VM. Finally, each MPI process has a global view of the locality information. “V” denotes the processes in the same VM, “H” denotes the processes in the same host, but the different VM, “N” denotes the processes on remote hosts) ...... 98

6.5 Two-Layer NUMA Aware Communication Coordinator ...... 100

xix 6.6 MPI Point-to-Point Latency Performance on Nested Virtualization Envi- ronment (Compare Default, One-Layer Locality-Aware, Two-Layer Locality- Aware and Native) ...... 101

6.7 Basic Hybrid Design (SHM Channel for Small Messages, Network Loop- back Channel for Large Messages) ...... 103

6.8 Enhanced Hybrid Design (SHM Channel for Small Messages and Control Messages, Network Loopback Channel for Large Messages) ...... 104

6.9 MPI Point-to-Point Latency of Hybrid Design for Inter-Socket Communi- cation on Nested Virtualization Environment ...... 106

6.10 Point-to-Point Communication Performance of Inter-VM Inter-Container Scenario ...... 109

6.11 Collective Communication Performance with 256 Processes ...... 112

6.12 Application Performance with 256 Processes ...... 113

7.1 Different Scenarios of Running MPI Jobs over VMs on HPC Cloud . . . . 116

7.2 Architecture Overview of Slurm-V ...... 120

7.3 SPANK Plugin-based and SPANK Plugin over OpenStack-based Design . . 122

7.4 VM Launch Breakdown Results on Cluster-A and Chameleon ...... 124

7.5 Scalability Studies on Cluster-A and Chameleon ...... 126

7.6 Graph500 Performance with 64 Processes on Different Scenarios ...... 128

8.1 Data Movement Strategies between GPUs in Container Environments within anode...... 131

8.2 Latency comparison of data movement strategies on Docker container environment within a node ...... 135

8.3 Bandwidth comparison of data movement strategies on Docker container environment within a node ...... 136

xx 8.4 Overview of GPU Locality-aware Detection in C-GDR ...... 139

8.5 GPU Locality-aware Detection Module in C-GDR ...... 141

8.6 NUMA-aware Support in Locality-aware Detection Module ...... 143

8.7 Workload Characterization Tracing Module in C-GDR ...... 144

8.8 Communication Scheduling Module in C-GDR ...... 145

8.9 MPI Point-to-Point Performance for GPU to GPU Communication . . . . . 147

8.10 MPI Collective Communication Performance across 16 GPU Devices . . . 150

8.11 Application Performance across 16 GPU Devices (For Communication Time, the lower is better; For TPS and GFLOS, the higher is better) ...... 151

xxi Chapter 1: Introduction

To meet the increasing demand for computational power, HPC clusters have grown tremendously in size and complexity. As the prevalence of high-speed interconnects, multi/many-core processors, and accelerators continue to increase, efficient sharing of such resources is becoming more important to achieve faster turnaround time and reduce the cost per user. Furthermore, a large number of users, including many enterprise users, experience large variability in workloads depending on business needs, which makes predicting the required resources for future workloads a difficult task. For such users, cloud computing can be an attractive solution that offers on-demand resource acquisition, high configurability, and high performance at a low cost. This demand has been evidenced by the plethora of vendors offering such solutions including Amazon, Google, Microsoft, etc. Virtualization technology, as one of the foundations of cloud computing, has been developing rapidly over the past few decades. Several different virtualization solutions, such as Xen [108],

VMware ESX/ESXi [102], and KVM [52] are proposed and improved by the community.

These hypervisor-based virtualization solutions bring several beneﬁts, including hardware independence, high availability, isolation, and security. They have been widely adopted in industry computing environments. For instance, Amazon’s Elastic Computing Cloud

(EC2) [9], Google’s Compute Engine [27] and VMWare’s vCloud Air [103] utilize Xen,

KVM and ESX/ESXi on their cloud computing platforms, respectively. On the other hand,

1 as a lightweight virtualization solution [98], container-based virtualization (such as Linux-

VServer [61], Linux Containers (LXC) [60] or Docker [16]) has attracted considerable attention recently. In the container-based virtualization, the same host OS kernel is shared across containers. It thus leads to a more efﬁcient way to provide virtualized computing environment to end users. The container-based solution is growing and inﬂuencing the evolution of cloud computing. With the emergence of the container-based virtualization technology on the clouds, another type of usage paradigm, which is called “nested virtualization”, is becoming more and more popular on the clouds. As a typical example, many end users choose to run their applications encapsulated by Docker containers over Amazon

EC2 virtual machines. Such an approach of running containers nested in virtual machines can bring easy deployment beneﬁt for end users while making the cloud easy-to-manage for administrators.

Even though cloud computing with virtualization has gained signiﬁcant momentum in the industry computing domain, running HPC applications on cloud systems with good performance is still challenging [72]. One of the biggest hurdles is the lower performance of virtualized I/O devices [49], which limits the adoption of virtualized cloud computing systems for HPC applications. To address this issue, the community has recently introduced an enhanced networking capability, Single Root I/O Virtualization (SR-IOV) [96], which offers a high performance alternative for virtualizing I/O devices on cloud computing systems. The SR-IOV speciﬁcation provides higher I/O performance and lower CPU utilization compared to the traditional software-based virtualization solutions. Currently,

SR-IOV has been already used in production cloud computing systems, such as the C3 and I2 instance types (using 10GigE) in Amazon EC2, where this feature shows higher packet per second performance and lower network jitter. Although SR-IOV is enabled in

2 these systems, the communications between the co-located instances also have to use SR-

IOV, which results in the performance overheads. The main drawback of SR-IOV is that it does not have locality aware communication support. While high performance MPI libraries in the HPC domain typically use high performance shared memory based schemes for intra-host communication. In the virtualized environment, Inter-VM Shared Memory

(IVShmem) [64] can be hot-plugged to a VM as a virtualized PCI device to support shared memory backed intra-node-inter-VM communication. This brings the following challenge:

Can MPI runtime be redesigned to provide high performance virtualization support such as locality-aware and NUMA-aware support, for virtual machines, containers and nested virtualization environments on the HPC clouds to deliver the optimal communication performance? How much beneﬁts can be achieved on HPC clouds with the redesigned MPI runtime for scientiﬁc kernels and applications?

SR-IOV is able to provide efﬁcient sharing of high-speed interconnect resources to

VMs. However, as an essential virtualization capability towards high availability and resource provisioning, virtual machine migration with SR-IOV devices is still facing challenges. For instance, to successfully migrate a VM with SR-IOV enabled InfiniBand virtual device, we have to handle the challenges of detachment of an active IB device during migration and reestablishment of the IB connections after migration. All of these need to be done transparently and efficiently for the applications. Recent studies [29, 79, 110] have shown that SR-IOV based virtual networks (both InfiniBand and High-Performance

Ethernet) prevent the virtual machine migration with the current generation of hypervisors (KVM, Xen, and ESXi) and InﬁniBand or high performance ethernet SR-IOV drivers.

Although several initial prototypes have been proposed in these works to support VM migration with SR-IOV devices, our investigations show that there are still many restrictions

3 on these proposed approaches. First of all, those proposed approaches do need to modify hypervisors (e.g., [79] needs to modify Xen; [110] needs to modify VMware ESXi). One drawback of modifying hypervisor is that the proposed solution will be dependent on the particular hypervisor being used on the cloud and even dependent on a particular version of hypervisor. More importantly, in the HPC environment, it will be very hard to request

HPC resource administrators to run the modiﬁed version of hypervisor in their clusters due to security concerns. Besides, many of these approaches are also dependent on particular network adapters or drivers. For example, the proposed approaches in [29] and [79] need to modify the drivers of InﬁniBand adapters and Intel Ethernet adapters, respectively.

Obviously, such approaches will limit the usage of their designs on HPC clouds with the different adapters from the different vendors. To address such challenge, can the hypervisor independent and InﬁniBand adapter driver independent fault-tolerance/resilience

(VM Live Migration) be supported on SR-IOV enabled HPC clouds?

In addition, for improved ﬂexibility and resource utilization, it is important to manage and isolate the critical virtualized resources such as SR-IOV enabled Virtual Functions and

IVShmem devices when building efﬁcient VM-based HPC cloud environment to support running multiple concurrent MPI jobs. As this requires knowledge of and some level of control over the underlying physical hosts, it is difﬁcult to achieve this with the MPI library alone, which is only aware of the virtual nodes and resources inside. Thus, extracting the best performance from virtualized clusters requires the proper support from other middleware like resource management and job scheduling systems, which have a global view of the VMs and the underlying physical hosts. Alternatively, container-based virtualization technologies (such as Docker [16]) have been seen as another promising way for building

4 HPC clouds due to their low overhead. Singularity becomes one of the most attractive con-

tainer technologies in the HPC ﬁeld these days. Singularity community has claimed that

the primary design goals of Singularity is to provide the reproducible environments across

the HPC centers and to deliver the near-native performance for HPC applications. These

lead to the following broad challenges on building efﬁcient HPC cloud: How to build

efﬁcient VM-based HPC cloud through co-designing with resource management and

scheduling systems on modern HPC systems?

As a critical component, GPU devices have been widely used in many modern HPC

and cloud computing environments. The computational power of the GPU has changed

the way for researchers and developers to highly parallelize their applications on such

high-performance heterogeneous computing platforms. For instance, GPU has become

one of the most important driving factors of fast and scalable applications such as arti-

ﬁcial intelligence, computation chemistry, and weather forecasting [95]. To efﬁciently

utilize GPUs for parallel applications, in addition to design highly optimized computing

kernels on GPUs, the performance of data movement operations on GPU clusters also

makes signiﬁcant differences [89]. However, the existence of GPUs signiﬁcantly compli-

cates the communication runtime designs on the heterogeneous clusters. There are mul-

tiple data movement schemes for GPU-to-GPU communication, such as cudaMemcpy,

GDRCOPY, cudaIPC, and GDR. The problem gets even more complicated on the cloud environment. For instance, different placement schemes (intra-socket or inter-socket) can be used to deploy containers, multiple containers could also be deployed on the same host, each data movement approach might have different applicable scenario or bring different performance characterization on such diverse container conﬁgurations from the one on native environment. This leads to the following challenges: How to design a cloud-aware

5 GPU-to-GPU communication library on RDMA networks to enable intelligent and adaptive communication scheduling for achieving optimal application performance on the complex cloud environment?

1.1 Problem Statement

It is critical that the issues outlined above can be addressed in order to design and build efficient HPC clouds with modern networking technologies on heterogeneous HPC systems. To deliver the near-native performance to the end HPC applications, the communication runtime needs to be able to provide high performance virtualization support in terms of different types of virtualization environments, such as virtual machines, containers, and nested virtualization. As the live migration is an essential feature in the cloud computing area, the high performance and scalable fault-tolerance and resilience capabilities should be supported on the SR-IOV enabled HPC cloud. For building efficient HPC cloud, we also need to co-design with resource management and job scheduling systems in order to efficiently share the resources on modern HPC systems. Alternatively, container- based technologies, in particular Singularity, should be carefully evaluated when building

HPC cloud. Moreover, as an indispensable component in a heterogeneous HPC system,

GPU devices plays more and more important role on the HPC cloud. Therefore, GPU-to-

GPU communication should also be thoroughly studied and intelligently designed in the complicated cloud context. To summarize, it addresses the following broad challenges:

1. Can MPI runtime be redesigned to provide high performance virtualization support

for virtual machines and containers when building HPC clouds?

2. How much beneﬁts can be achieved on HPC clouds with redesigned MPI runtime for

scientiﬁc kernels and applications?

6 3. Can the hypervisor independent and InﬁniBand adapter driver independent fault-

tolerance/resilience (VM Live Migration) be supported on SR-IOV enabled HPC

clouds?

4. How to build efﬁcient VM-based HPC cloud through co-designing with resource

management and scheduling systems on modern HPC systems?

5. How to design a cloud-aware GPU-to-GPU communication library on RDMA net-

works to enable intelligent and adaptive communication scheduling for achieving

optimal application performance on the complex cloud environment?

1.2 Research Framework

Figure 1.1 depicts the research framework that we propose to address the challenges highlighted above. We discuss how we use the framework to address each of the challenges in detail.

Figure 1.1: Research Framework

7 1. Can MPI runtime be redesigned to provide high performance virtualization support

for virtual machines and containers when building HPC clouds?

Through the convergence of HPC and cloud computing, the users can get all the de-

sirable features such as ease of system management, fast deployment, and resource

sharing. However, many HPC applications running on the cloud still suffer from

fairly low performance, more speciﬁcally, the degraded I/O performance from the

virtualized I/O devices. Recently, a hardware-based I/O virtualization standard called

Single Root I/O Virtualization (SR-IOV) has been proposed to help solve the prob-

lem, which makes SR-IOV achieve near-native I/O performance. Whereas, SR-IOV

lacks “Locality-Aware Communication” support. That is, the communications across

the co-located instances (VMs, containers, or advanced nested virtualization) are not

able to leverage the shared memory backed communication mechanisms, as SMP

channel and CMA channel shown in Figure 1.1. In addition, the instances on the

cloud could have different deployment schemes. For example, multiple instances

could be deployed on either the same or different NUMA nodes. With ”NUMA-

Aware Communication” support, MPI runtime can schedule the appropriate commu-

nication channels according to the NUMA information. To deliver high performance

to the upper layer HPC applications, we propose a high-performance locality-aware

and NUMA-aware MPI library over SR-IOV enabled InﬁniBand clusters, which is

able to dynamically detect the locality information on VM, container or even nested

cloud environment and reschedule the communication channels appropriately based

on the locality and NUMA information.

2. How much beneﬁts can be achieved on HPC clouds with redesigned MPI runtime for

scientiﬁc kernels and applications?

8 Without “Locality-Aware Communication” support and “NUMA-Aware Communi-

cation” support, the MPI communication across the instances can only go through

the SR-IOV channel. Whereas, high performance MPI libraries in the native envi-

ronment typically use shared memory based schemes for intra-node communication

to extract better communication performance. Through the redesigned MPI run-

time with the proposed locality-aware and NUMA-aware communication support,

the end HPC applications running on the upper layer can take fully advantage of all

available channels including SMP/IVShmem channel, CMA channel, and SR-IOV

channel. We evaluate and discuss the beneﬁt to the end HPC applications, which is

resulted from the redesigned MPI runtime. As a promising container-based solution

for building HPC cloud, we study the performance of Singularity on different multi-

core processor and co-processor architectures (e.g., Xeon, Xeon Phi [97]), different

types of interconnects (e.g., Omni-Path [12], InﬁniBand), different memory access

modes, etc.

3. Can fault-tolerance/resilience (VM Live Migration) be supported on SR-IOV enabled

HPC clouds?

SR-IOV technology is able to provide efﬁcient sharing of high-speed interconnect

resources and achieve near-native I/O performance. However, it prevents virtual ma-

chine migration, which is an essential fault-tolerance and resilience capability to-

wards high availability and resource provisioning. Although several initial solutions

have been proposed in the literature to solve this problem, there are still many restric-

tions on these proposed approaches, such as depending on speciﬁc network adapters

9 and/or hypervisors, which will limit the usage scope of these solutions on HPC en-

vironments. In this thesis, we propose a high-performance virtual machine migra-

tion framework for MPI applications on SR-IOV enabled InﬁniBand clusters. The

framework consists of an external ”Parallel Migration Controller” and the associate

”Migration Support” within MPI runtime. The migration controller with multiple

parallel libraries is used to monitor migration status and coordinate with MPI run-

time during migration. The migration support within MPI runtime can handle the

IB connection suspending and reactivating according to the signals from the external

controller. The proposed solution does not need any modiﬁcation to the hypervisor

and InﬁniBand drivers, and it can efﬁciently handle VM migration with SR-IOV IB

device.

4. How to co-design with resource management and scheduling systems on modern

HPC systems to improve the resource utilization when building efﬁcient HPC clouds?

To build VM-based HPC cloud, the critical virtualized resources, such as SR-IOV

enabled Virtual Functions and IVShmem devices, need to be properly managed and

isolated to support running multiple concurrent MPI jobs. This is difﬁcult to be

achieved with the MPI library alone, as this requires knowledge of and some level

of control over the underlying physical hosts. Thus, extracting the best performance

from virtualized clusters require support from other middleware like job launchers

and resource managers, which have a global view of the VMs and the underlying

physical hosts. In this thesis, we propose a framework, Slurm-V, which extends

Slurm through SPANK to manage and isolate virtualized resources when building

the efﬁcient HPC cloud. By the help of Slurm-V, MPI applications can be run

concurrently on virtual machines. In Slurm-V framework, three alternative designs

10 are proposed: Task-based design, SPANK plugin-based design, and SPANK plugin

over OpenStack-based design.

5. How to design a cloud-aware GPU-to-GPU communication library on RDMA net-

works to enable intelligent and adaptive communication scheduling for achieving

optimal application performance on the complex cloud environment?

GPU devices have been widely adopted in many modern HPC and cloud computing

environments because of the massively parallel architectures they provide. To efﬁ-

ciently utilize GPUs for parallel applications, in addition to design highly optimized

computing kernels on GPUs, the performance of data movement operations on GPU

clusters also makes signiﬁcant differences. However, the complexity of designing

efﬁcient GPU-based communication schemes on clouds is signiﬁcantly increased

on the heterogeneous systems, given that there exist different instance deployment

schemes, multiple data movement schemes and multiple advanced hardware fea-

tures. In this thesis, we ﬁrst investigate the performance characteristics of state-

of-the-art GPU-based communication schemes on the native and container-based

environments. Then we propose the C-GDR approach to design high-performance

cloud-aware GPU-to-GPU communication schemes on RDMA networks. With the

proposed ”GPU-based Communication” support and redesigned accelerator channel,

the intelligent and adaptive communication scheduling can be achieved to deliver the

optimal application performance on the complex cloud environment.

11 Chapter 2: Background

2.1 Cloud Computing and OpenStack

Cloud computing is a type of Internet-based computing that provides on-demand computer processing resources and data. It is a model for enabling ubiquitous, on-demand access to a shared pool of conﬁgurable computing resources (e.g., computer networks, servers, storage, applications, and services), which can be rapidly provisioned and released with minimal management effort.

OpenStack [5] is an open-source middle-ware for cloud computing that controls large pools of computing, storage, and networking resources throughout a datacenter, all managed through a dashboard that gives administrative control and web access to users. A breakdown of the OpenStack services is given in Table 2.1. Nova is a core component among them. It is designed to manage and automate pools of compute resources and can work with widely available virtualization technologies, as well as bare metal and high performance computing (HPC) conﬁgurations.

12 Table 2.1: OpenStack Services Service Project name Description Dashboard Horizon Provides a web-based self-service portal to interact with underlying OpenStack services. Compute Nova Manages the lifecycle of compute instances in an OpenStack environment. Networking Neutron Enables network connectivity as a service for other OpenStack services. Object Storage Swift Swift Stores and retrieves arbitrary unstructured data objects. Block Storage Cinder Provides persistent block storage to running instances. Identity Service Keystone Provides an authentication and authorization service for other Open- Stack services. Image Service Glance Stores and retrieves virtual machine disk images. Telemetry Ceilometer Monitors and meters the OpenStack cloud for billing, benchmark- ing, scalability, and statistical purposes. Orchestration Heat Orchestrates multiple composite cloud applications.

VM1 App1 App2 App3 Stack Stack Stack Container1 bins/ bins/ bins/ libs libs libs App1 App2 App3 Stack Stack Stack Guest OS Guest OS Guest OS Redhat bins/ bins/ bins/ Linux Window Ubuntu libs libs libs Hypervisor Docker Engine Host OS Host Linux OS

Hardware Hardware

(a) Hypervisor-based Virtualization (b) Container-based Virtualization

Figure 2.1: Hypervisor- and Container-based Virtualization

13 2.2 Virtualization Technology

2.2.1 Hypervisor-based Virtualization

The hypervisor is also called virtual machine manager. As shown in Figure 2.1(a), the hypervisor works as an intermediate layer that interacts with the underlying host operating system and hardware, which controls the host resources actually. It supports running multiple guest operating systems on a single hardware host. The guest operating systems share the physical resources, but they appear to have their exclusive hardware views without dis- turbing each other. A hypervisor is a powerful tool that can consolidate multiple servers and make use of the available physical computing resources. However, as an additional layer, it introduces some overhead comparing to using physical hardware directly, especially for I/O intensive applications. In this thesis, we use KVM as the hypervisor. KVM is an open-source solution for Linux on x86 processor containing virtualization extensions

(Intel VT or AMD-V). It allows a user space program to utilize the hardware virtualization features in the full virtualization manner. PCI passthrough allows giving access and control of the physical devices to guests: that is, you can use PCI passthrough to assign a PCI device (NIC, disk controller, HBA, USB controller, sound card, etc.) to a guest domain, giving it full and direct access to the PCI device.

2.2.2 Container-Based Virtualization

Container-based virtualization is a kind of technology provided by operating system kernels to support lightweight virtualization. As shown in Figure 2.1(b), container-based virtualization provides “containers” as the self-contained execution environments. In container- based virtualization, a single kernel can execute several isolated userspace instances. Each instance can have independent namespaces, resource views, and software stacks, while the

14 kernel is shared. The resource accesses from different containers are scheduled by the

same kernel through a thin engine layer. It avoids the overhead of the whole stack of guest

operating systems.

Docker [16] is a popular open-source platform for building and running containers and

offers several important features, including portable deployment across machines, version-

ing, reuse of container image and a searchable public registry for images. In addition,

Docker gives users the ﬂexibility to share certain namespaces with either the host or other

containers. For example, sharing the host’s process (PID) namespace allows the processes

within the containers to see all of the processes on the system. And sharing IPC namespace

can accelerate inter-process communication with shared memory segments, semaphores,

and message queues. In container context, runtime privilege gives container access to

all devices. For example, when the operator executes docker run --privileged,

Docker will enable access to all devices on the host as well as set some conﬁguration in

SELinux to allow the container to have nearly all the same access to the host as processes running outside containers on the host. We can use privileged option to give container access to the InﬁniBand device on the host.

Container Creation Import and Bootstrap Container Execution sudo singularity create sudo singularity import container.img container.img docker://ubuntu singularity run container.img

sudo singularity bootstrap container.img ubuntu.def singularity shell container.img

Interact and Modify sudo singularity shell singularity exec container.img ... --writable container.img USER ENDPOINT SHARED COMPUTATIONAL RESOURCE

Figure 2.2: Singularity usage workﬂows

Singularity [54] enables users to have full control of their environment. This means that a non-privileged user can ”swap out” the operating system on the host for one they control. So if the host system is running RHEL6 but your application runs in Ubuntu, you

15 can create an Ubuntu image, install your applications into that image, copy the image to another host, and run your application on that host in its native Ubuntu environment. As shown in Figure 2.2, the standard Singularity usage workﬂow involves a working endpoint

(left) where the user has root, and a container can be created, modified and updated, and then transferred to a shared computational resource (right) to be executed at scale. More- over, Singularity also allows the users to leverage the resources of whatever host you are on. This includes HPC interconnects, resource managers, file systems, GPUs and/or accelerators, etc. Singularity does this by enabling several key facets: 1. Encapsulation of the environment; 2. Containers are image based; 3. No user contextual changes or root escalation allowed; 4. No root owned daemon processes. Singularity uses the filesystem

(mount), PID and user namespaces.

2.2.3 Inter-VM Shared Memory (IVShmem)

mmap Guest 1 mmap Guest 2 mmap Guest 3 region region region Userspace Userspace Userspace

PCI kernel PCI kernel PCI kernel Device Device Device

Qemu Userspace Qemu Userspace Qemu Userspace

mmap mmap mmap eventfds

Host /dev/shm/ shared mem fd

Figure 2.3: IVShmem Communication Mechanism

IVShmem (e.g. Nahanni) [64] provides zero-copy access to data on shared memory of co-resident VMs on KVM platform. IVShmem is designed and implemented mainly

16 in system calls layer and its interfaces are visible to user space applications as well. As shown in Figure 2.3, IVShmem contains three components: the guest kernel driver, the modiﬁed QEMU supporting PCI device, and the POSIX shared memory region on the host OS. The shared memory region is allocated by host POSIX operations and mapped to QEMU process address space. The mapped memory in QEMU can be used by guest applications by being remapped to user space in guest VMs. Evaluation results illustrate that both micro-benchmarks and HPC applications can achieve better performance with

IVShmem support [45].

2.2.4 Nested Virtualization

VM1 VM2

Container1 Container2 Container3 Container4 App App App App Stack Stack Stack Stack bins/ bins/ bins/ bins/ libs libs libs libs

Docker Engine Docker Engine Redhat Ubuntu Hypervisor Host OS Hardware

Figure 2.4: Nested Virtualization

A special organization pattern of virtualization is nested virtualization [11]. In this pattern, two or more levels of virtualization mechanisms are deployed. As shown in Fig- ure 2.4, the upper-level virtual machines or containers run inside the guest operating systems of the lower-level virtual machines. Nested virtualization has potential application in the Infrastructure as a Service (IaaS) cloud environment where the tenants may create user-controlled virtualization platform for private Platform as a Service (PaaS) or Software as a Service (SaaS) clouds. Nested virtualization is also useful for the scenes of cloud live

17 migration, sandbox application, or legacy system integration. Due to the limitation of CPU virtualization extension, not all the virtualization mechanisms can be nested without extra design. Running Linux containers over hypervisor-based virtual machines directly is feasi- ble. However, it still has some performance issues because of the redundant call stacks and isolated physical resources.

Figure 2.5 depicts a practical scenario of nested virtualization. We can see that four

VMs are deployed on a node by users A and B, respectively. On top of that, they launch two docker containers in each VM they created. The reason for deploying VM on the ﬁrst virtualized layer is that VM provides the good isolation and security so that the applications and workloads of users A and B will not interfere with each other. The root permission of

VM can be given to the normal users, whereas the root permission is typically not allowed in the bare-metal cluster. The user is able to do any necessary conﬁgurations to implement some special functions, such as a speciﬁc virtual networking among the VMs. In addition, the user can have different OS options within the Guest OS to gain maximum agility.

Docker brings an effective, standardized and repeatable way to the port and distributes the applications and workloads. The developers can easily build the Docker images for their applications and store these images to Docker registry for sharing. For user A and B, they can update their applications and workloads by quickly pulling their respective images from the Docker registry. Once this is done, it is free to run Docker-packaged applications in the

VM environment. Through this nested virtualization, the users can take advantage of both technologies in a complementary manner.

Another commercial example of nested virtualization is “Photon OS” [81], a lightweight

Linux operating system for cloud-native applications. It is optimized for vSphere and

18 vCloud Air, providing an easy way to extend their current platform with VMware and run modern, distributed applications using containers in their vSphere environments.

User A User B Host

VLAN 1 VLAN 2

VM 0: Ubuntu VM 1: Ubuntu VM 2: MacOS VM 3: MacOS

dockerA1 dockerA3 dockerB1 dockerB3

dockerA2 dockerA4 dockerB2 dockerB4

Pull Pull Pull

Push img img img A B C

Developer Docker Registry

Figure 2.5: Practical Scenario of Nested Virtualization

2.3 High Performance Computing (HPC) Systems

2.3.1 InﬁniBand

InﬁniBand (IB) is an industry standard switched fabric that is designed for interconnecting compute and I/O nodes in High-End Computing clusters [38]. It has emerged as the most- used internal systems interconnect in the Top 500 list of supercomputers. The list released in June 2018 reveals that for the top 500 HPC systems, InﬁniBand in the “Interconnect

Family Performance Share” achieves 36.1%.

Remote Direct Memory Access (RDMA) is one of the main features of InfiniBand, which allows software to remotely access memory contents of another remote process without any involvement at the remote side. When a connection between two channel adapters is established, five kinds of transport layer communication protocols defined by the Infini-

Band speciﬁcation can be selected: Reliable Connection (RC), Reliable Datagram (RD),

19 Unreliable Connection (UC) , Unreliable Datagram (UD) and Raw Datagram. RC and UD are two common protocols. RC is the most popular transport service for implementing MPI over InfiniBand. For connection-oriented RC, a QP must be dedicated to communicating with only one other QP. That is to say, each peer communicating with N other peers thus needs to create at least N QPs. RC provides RDMA capability, atomic operations, and reliable service. Data transfer between two entities using RC receives acknowledgment. UD is a connection-less and unreliable transport without acknowledgement. It is the most basic transport specified for InfiniBand. The advantage is that a single UD QP can communicate with any number of other UD QPs. However, UD does not guarantee reliability or message ordering.

2.3.2 Single Root I/O Virtualization

Single Root I/O Virtualization (SR-IOV) is a PCI Express (PCIe) standard which speciﬁes the native I/O virtualization capabilities in PCIe adapters. As shown in Figure 2.6,

SR-IOV allows a single physical device, or a Physical Function (PF), to present itself as multiple virtual devices, or Virtual Functions (VFs). Each virtual device can be dedicated to a single VM through the PCI pass-through, which allows each VM to directly access the corresponding VF. Hence, SR-IOV is a hardware-based approach to implement I/O virtualization. Furthermore, VFs are designed based on the existing non-virtualized PFs.

Therefore, the drivers of the current adapters can also be used to drive the VFs in a portable manner.

2.3.3 Intel Knights Landing (KNL) Architecture

Intel Knights Landing (KNL) is the successor to Knights Corner (KNC) many-core architecture that is a self-booting processor which packs up to six Teraﬂops of computing

20 Guest 1 Guest 2 Guest 3 Guest OS Guest OS Guest OS

VF Driver VF Driver VF Driver

Hypervisor PF Driver

I/O MMU PCI Express

Virtual Virtual Virtual Physical Function Function Function Function SR-IOV Hardware

Figure 2.6: SR-IOV Communication Mechanism

throughput. As shown in Figure 2.7, KNL comes equipped with 68-72 cores located on

34-36 active tiles. Each tile has a single 1-megabyte L2 cache that is shared between the two cores and each core further supports four threads by hyperthreading. A 2D-mesh interconnect is used for on-die communication by the cores, memory and I/O controllers, and other agents.

Figure 2.7: Intel KNL Overview [97]

21 2.3.3.1 Intel KNL Memory Modes

KNL comprises of six DDR4 channels and eight Multi-Channel DRAM (MCDRAM)

channels. The MCDRAM memory can yield an aggregate bandwidth of 450 GB/s in con-

trast with DDR4 memory which can yield 90 GB/s, hence aptly referred to as High Band-

width Memory (HBM). The processor’s memory mode determines whether the fast HBM

operates as RAM, as direct-mapped L3 cache, or as a mixture of the two. From a software

perspective, the two common ways of directing allocations on the HBM is through the use

Linux NUMA utilities or through the memkind library [66]. Through these utilities, pro-

grams can direct perform memory allocations on either the DRAM or HBM in the ﬂat and

hybrid modes of operation.

Cache Mode This mode is shown in Figure 2.8(b). The fast HBM is conﬁgured as an L3 cache. The operating system transparently uses the HBM to move data from main memory. In this mode, the user has access to 96GB of RAM, all of it traditional DDR4.

Flat Mode This mode is shown in Figure 2.8(a). DDR4 and HBM act as two distinct

Non-Uniform Memory Access (NUMA) nodes. Therefore, it is possible to specify the type of memory (DDR4 or HBM) when allocating memory. In this mode, the user has access to 112GB of RAM: 96GB of traditional DDR and 16GB of fast HBM. By default, memory allocations occur in DDR4.

Hybrid Mode In this mode, the MCDRAM is conﬁgured so that a portion acts as L3 cache and the rest as RAM (a second NUMA node supplementing DDR4).

2.3.3.2 Intel KNL Cluster Modes

The details for KNL are proprietary, but the key idea is that each tile tracks an assigned range of memory addresses. It does so on behalf of all cores on the chip, maintaining a data

22 112GB RAM 96GB RAM

DDR4 96GB DDR4 16GB cache 96GB MCDRAM 16GB MCDRAM 16GB 68 cores 68 cores (a) KNL Flat Mode (b) KNL Cache Mode

Figure 2.8: KNL Memory Modes [100]

structure (tag directory) that tells it which cores are using data from its assigned addresses.

Coherence requires both tile-to-tile and tile-to-memory communication. Cores that read or modify data must communicate with the tiles that manage the memory associated with that data. Similarly, when cores need data from main memory, the tile(s) that manage the associated addresses will communicate with the memory controllers on behalf of those cores. The KNL achieves this in different cluster modes. Each cluster mode, speciﬁed in the BIOS as a boot-time option, represents a tradeoff between simplicity and control. There are three major cluster modes with a few minor variations:

All-to-All Mode All-to-all is the most ﬂexible and most general mode, intended to work on all possible hardware and memory conﬁgurations of the KNL. But this mode also may have higher latencies than other cluster modes because the processor does not attempt to optimize coherency-related communication paths.

Quadrant Mode This mode attempts to localize communication without requiring explicit memory management by the users. It achieves this by grouping tiles into four log- ical/virtual (not physical) quadrants, then requiring each tile to manage HBM addresses only in its own quadrant (and DDR addresses in its own half of the chip). This reduces

23 the average number of ”hops” that tile-to-memory requests require compared to All-to-All mode, which can reduce latency and congestion on the mesh.

Sub-NUMA Clustering Mode This mode, abbreviated SNC, divides the chip into two/four NUMA nodes so that it acts like a two/four-socket processor. SNC aims to optimize coherency-related on-chip communication by conﬁning this communication to a single NUMA node when it is possible to do so. This requires explicit manual memory management by the programmer/user (in particular, allocating memory within the NUMA node that will use that memory) to achieve any performance beneﬁt.

2.3.4 Intel Omni-Path Architecture (OPA)

The Intel OPA is designed to enable a broad class of computations requiring scalable, tightly coupled CPU, memory, and storage resources. Integration between devices in the

Intel OPA family and Intel CPUs enable improvements in system level packaging and network efﬁciency. When coupled with the new user-focused open standard APIs developed by the OpenFabrics Alliance (OFA) Open Fabrics Initiative (OFI), host fabric interfaces

(HFIs) and switches in the Intel OPA family are optimized to provide low latency, high bandwidth, and high message rate. Intel OPA provides important innovations to enable a multigeneration, scalable fabric, including link layer reliability, extended fabric addressing, and optimizations for high core count CPUs. Datacenter needs are also a core focus for Intel OPA, which includes: link level trafﬁc ﬂow optimization to minimize datacenter jitter for high priority packets, robust partitioning support, quality of service support, and a centralized fabric management system [12].

24 2.3.5 Accelerator

Graphics Processing Units (GPUs) have drawn signiﬁcant attention for use in HPC applications because of the potential performance improvement via massive parallelism they provide in a small package. The current generation of GPUs from NVIDIA are connected to a host server system as peripheral devices on the Peripheral Component Interconnect

Express (PCIe) interface. GPUDirect technology [7] is a set of features NVIDIA provides that enables efﬁcient communication among GPUs and between GPUs and other devices.

It significantly enhances communications performance on GPU clusters. GPUDirect provides, through Remote Data Memory Access (RDMA), third party PCIe devices with direct access to GPU memory. This feature is called GPUDirect RDMA (GDR) and is currently supported with Mellanox InfiniBand network host channel adapters (HCAs). This provides a path for moving data to/from GPU device memory over an InfiniBand network that completely bypasses the host CPU and its memory. This path reduces PCIe and CPU resource consumption and may be faster.

slurmd slurmd slurmctld Plugins Plugins (backup) Controller daemons

slurmd

Slurmdbd Plugins slurmctld (optional) (primary)

slurmd

Plugins Database Compute node daemons …

slurmd srun scontrol sinfo squeue Plugins Plugins

User command (partial list)

Figure 2.9: Slurm Architecure

25 2.3.6 Slurm and SPANK

Simple Linux Utility for Resource Management (Slurm) [8] is an open-source resource

manager for large scale Linux based clusters. Slurm can provide users with exclusive and/or

shared access to cluster resources. As shown in Figure 2.9, Slurm provides a framework in-

cluding controller daemons (slurmctld), database daemon (slurmdbd), compute node

daemons ( slurmd), and a set of user commands (e.g. srun, scontrol, squeue) to

start, execute and monitor jobs on a set of allocated nodes and manage a queue of pend-

ing jobs. Slurm Plug-in Architecture for Node and job (K)control (SPANK) [6] provides a

generic interface to be used for dynamically modifying the job launch code. SPANK plug-

ins have the ability to add user options when using srun. It may be built without accessing

Slurm source code and will be automatically loaded at the next job launch. Thus, SPANK provides a low-cost and low-effort mechanism to change runtime behavior of Slurm.

2.3.7 Programming Models

Over the last two decades, Message Passing Interface (MPI) has become the de-facto standard for developing scientiﬁc applications in the HPC domain. Portability and availability of high performance implementations on most modern architectures have been key factors in wide acceptance of MPI. MPI offers communication with different kinds of se- mantics: point-to-point, collective and one-sided. A communication end-point (usually a process) is referred to using a rank in MPI. Point-to-point operations (Send/Recv) are used to move data between two rank while collective operations are used to exchanged data among a group of processes. In these operations, each rank provides the information about the local source/destination buffers. Point-to-point and collective communication are very commonly used in the HPC applications.

26 The MVAPICH2 MPI Library: MVAPICH2 [80], is an open-source implementation of

the MPI-3 speciﬁcation over modern high-speed networks such as InﬁniBand, Omni-Path,

Ethernet/iWARP, and RoCE. MVAPICH2 delivers the best performance, scalability and

fault tolerance for high-end computing systems and servers using InﬁniBand, Omni-Path,

Ethernet/iWARP, and RoCE networking technologies. This software is being used by more

than 2,900 organizations world-wide in 86 countries. More than 475,000 downloads have

taken place from the project’s site. According to the latest Top500 list which is released

in June 2018, it is powering many of the top supercomputing centers in the world, includ-

ing the 2nd ranked Sunway TaihuLight, the 12th ranked Oakforest-PACS, the 15th ranked

Stampede2, and the 24th ranked Pleiades.

MPI libraries typically use the eager protocol for small messages and the rendezvous

protocol for large message communication operations. MVAPICH2 uses an RDMA-based

eager protocol called RDMA-Fast-Path, along with various optimizations to improve the

latency of small message point-to-point communication operations. For large messages,

MVAPICH2 uses zero-copy designs based on RDMA-Write or RDMA-Read operations to

achieve excellent communication bandwidth. Further, MVAPICH2 offers good scalability

through advanced designs such as eXtended RC (XRC), Shared-Receive Queues (SRQ) and

Hybrid (UD/RC) communication modes. MVAPICH2 also provides optimized collective

communication using shared memory based designs. It also employs different collective

algorithms based on the message and job sizes.

CUDA-Aware MPI: The modern high performance computing systems are typically equipped with accelerators like NVIDIA GPUs. This has led to the extension of MPI runtimes for supporting efﬁcient communication between GPUs. Before the introduction of Uniﬁed Vir- tual Addressing (UVA) and GPUDirect features, MPI application developers were forced

27 to explicitly perform data copying between CPU and GPU buffers Through GDR technology, several MPI implementations including OpenMPI, CrayMPI and MVAPICH2 provide

CUDA-Aware MPI primitives for performing point-to-point, one-sided, and collective operations. This feature enables MPI applications to perform direct communication from

GPU buffers transparently without any explicit copies. The CUDA-Aware MPI features provide high performance and high productivity programming for the application developers.

28 Chapter 3: Designing VM-aware MPI Communication with SR-IOV and IVShmem

3.1 Understanding Performance of IVShmem

SR-IOV can attain near to native performance for inter-node point to point communication at the MPI level. However, one of the main drawbacks of SR-IOV is that it does not support VM locality aware communication. Thus, inter-VM communications within the node also have to go through SR-IOV channel, leading to the performance overheads.

On the other hand, VM communication, IVShmem offers the shared memory backed communication for VMs within a single host. Consequently, we carry out a primitive-level experiment using Perftest-1.2.3 [2], to understand the performance of IVShmem, as shown in Figure 3.1. The experiment compares the primitive level latencies between SR-IOV based IB communication and IVShmem based communication, and presents the performance overheads. For 64 bytes message size, the latencies observed are 0.96 and 0.20 µs, for SR-IOV (IB-Send) and IVShmem, respectively. These performance overheads clearly indicate IVShmem scheme can beneﬁt MPI communication within a node on SR-IOV enabled InﬁniBand clusters.

29 IB-Send 200 IB-Read IB-Write IVShmem 150 3.2 2.4 100 1.6 Latency (us) 0.8 50 0 2 4 16 64 2561K 4k 0 2 4 16 64 256 1K 4K 16K 64K256K1M Message Size (bytes)

Figure 3.1: Primitive-Level Latency Comparison between SR-IOV IB and IVShmem

3.2 VM-aware MPI Communication with SR-IOV and IVShmem

Given that IVShmem can bring clear performance beneﬁt, as observed in Section 3.1, it is critical to make full use of SR-IOV and IVShmem within MPI runtime to deliver optimal performance. In this section, we ﬁrst present the overview of our proposed design for high performance MPI library. Then we discuss and analyze different locality detection approaches and describe our design of locality detector. Next, we present the design details of communication coordinator. In Section 3.2.4 and Section 3.2.5, we discuss the communication optimization for IVShmem and SR-IOV channel, respectively.

3.2.1 Design Overview

Our design is based on MVAPICH2, an open-source MPI library over InﬁniBand. For portability reasons, it follows a layered approach, as shown in Figure 3.2(a). The Abstract

Device Interface V3 (ADI3) layer implements all MPI-level primitives. Multiple communication channels provide basic message delivery functionalities on top of communication

30 device APIs. There are two types of communication channels available in MVAPICH2: a shared memory channel communicating over user space shared memory to peers hosted in the same host and a network channel communicating over InﬁniBand user-level APIs to other peers.

Without any modiﬁcation, default MVAPICH2 can run in virtualization environment.

However, VMs running on same host cannot use shared memory channel (SMP) for communication, which can lead to severe performance limitations. In our proposed high performance MPI library, as shown in Figure 3.2(b) we add two components, which are ’Com- munication Coordinator’ and ’Locality Detector’ between ADI3 layer and channel layer.

In the channel layer, we integrate IVShmem channel into the library as well as the SR-IOV channel. Communication Coordinator is responsible for selecting communication channel in lower channel layer, while Locality Detector maintains the information of local VMs on the same host. Communication Coordinator makes a decision on going through a channel by utilizing Locality Detector to identify whether the communicating VMs are co-resident on the same host or not. If they are co-resident in a given host, Communication Coordi- nator will select IVShmem channel for the communication between these co-located VMs.

Otherwise, it will go through SR-IOV channel.

The locality detector will further identify whether there are multiple processes running in the same VM, then the Communication Coordinator will select default SMP channel in

VM (not host) for the communication between those processes. Since this chapter mainly focuses on communication optimization for co-resident VMs, and also default SMP channel in VM is similar with the one in host, we will not discuss this channel in details.

31 Application Application

MPI Layer MPI Layer ADI3 Layer

Communication Coordinator ADI3 Layer MPI Library Virtual Machine Aware Locality Detector

SMP Channel Network Channel SMP IVShmem SR-IOV Channel Channel Channel

Communication Communication Shared Memory InﬁniBand API Shared Memory InﬁniBand API Device APIs Device APIs

Native Hardware Virtualized Hardware

(a) Native Environment (b) Virtualization Environment

Figure 3.2: MVAPICH2 Stack Running in Native and Virtualization Environments

3.2.2 Locality Detector

Given the above functionality provided by IVShmem in Section 2.2.3, the way to identify co-resident VMs among all VMs becomes a critical problem. Basically, there are two locality identiﬁcation alternatives we can evaluate.

The ﬁrst one is a static method, which is mainly used when the information of co- resident VMs is preconﬁgured by the administrator, and it is assumed that the membership of co-resident VMs do not change during the communication afterwards. Thus, the VM locality information is already available when launching the MPI jobs. The advantage of this approach is that the processes can be directly re-mapped in the VM layer based on the above information, with little overhead. But the problem is that without intervention from the administrator, the static information cannot be dynamically updated.

The other one is dynamic detection, that is MPI jobs will dynamically detect the VMs running on the same host. And according to who initiates the process, there are two ways to implement it: Since privileged domain plays a center role in virtualization environment, we can use it to periodically gather VM information on the same host. VM peers advertise

32 their membership information, such as presence and absence to all other VMs running on the same host. This approach is asynchronous and needs centralized management from privileged domain. However, the period between two periodical gather operations needs to be conﬁgured properly. If the period is set longer than needed, it cannot bring accurate co-residency information in time. If it is too short, it might lead to unnecessary probing and thus waste undesirable CPU cycles. The second approach works in synchronous mode.

When a VM takes a signiﬁcant action, it will notify related VMs to update the co-residency information. Thus, the updates are immediate upon the occurrence of the corresponding events. In comparison, the ﬁrst approach periodically collects the status from co-resident

VMs and thus introduces delayed update wasting CPU cycles, and also potential inconsis- tency, while for the second approach, it is possible that co-residency information of multiple

VMs changes concurrently [111].

To take advantage of the dynamic detection approach, we propose a locality detector component. Based on IVShmem support, we create a VM list structure on the shared memory region of each host. And each process will write its own membership information into this shared VM list structure according to its global rank. For example, consider launching an 8-process MPI job, one process per VM. Let ranks 0, 1, 4, and 5 run on the same host

(e.g. host1), as shown in Figure 3.3, and other 4 ranks run on another host (e.g. host2).

Then the four VMs (ranks 0, 1, 4 and 5) will write their own membership information into positions 0, 1, 4 and 5 of VM list on host1 correspondingly. Other positions will be left blank. Similarly, other four VMs write at positions 2, 3, 6 and 7 of VM list on host2. In this case, the local number of processes on host1 can be acquired by checking and counting

33 whether the membership information has been written or not. Similarly, their local ordering will be maintained by their positions in VM list. Therefore, the written membership information on the same VM list indicates that they are co-resident.

Since byte is the smallest granularity of memory access without lock, in our proposed design, the VM list is designed by using multiple bytes. Each byte will be used to tag each

VM. This guarantees that multiple VMs on the same host are able to write membership information on their corresponding positions concurrently without introducing lock&unlock operations. This approach reduces the overhead of locality detection procedure. Moreover, the proposed approach will not introduce much overhead of traversing the VM list. Take a one million processes MPI job for instance, the whole VM list only occupies 1 Mega bytes memory space. Therefore, it brings good scalability on virtualized MPI environment.

user space user space user space user space MPI proc rank 0 MPI proc rank 1 MPI proc rank 4 MPI proc rank 5

1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0

kernel space kernel space kernel space kernel space PCI VF PCI VF PCI VF PCI VF Device Driver Device Driver Device Driver Device Driver

Hypervisor PF Driver

IV-SHM 1 1 0 0 1 1 0 0 /dev/shm/ Host Environment IVShmem Channel

Figure 3.3: Virtual Machine Locality Detection

3.2.3 Communication Coordinator

The default MVAPICH2 stack, as shown in Figure 3.2(a), can also be deployed in virtualization environment, whereas the processes running on different VMs can not communicate through shared memory channel, even though they are co-located on the same host. However, with the help of Locality Detector and VM lists which are created and maintained in shared memory region, the co-resident VMs can be dynamically identiﬁed.

34 Another key component in our proposed design is called Communication Coordinator, as shown in Figure 3.4. It is responsible for capturing the communication channel requests coming from the upper layer and carrying out the channel selection by checking the membership information provided by the Locality Detector. If the communicating processes are co-resident, Communication Coordinator will schedule them to communicate through

IVShmem channel. Otherwise, they will go through SR-IOV channel. For example, we can see in Figure 3.4 that, Guest 1 and Guest 2 are co-located on the same host. MPI process rank 1 and rank 4 are running on Guest 1 and Guest 2, respectively. They can access the same VM list located in IVShmem region by mapping the IVShmem region to their own user space. By checking the ﬂag at position 4 of VM list, the communication coordinator

finds that the flag has been set, which means process rank 4 is on the same host. Thus, communication coordinator will schedule the communication between rank 1 and rank 4 to go through IVShmem channel, as presented in the solid line. If the communication coordinator finds that the flag is not set (e.g. position 6), then it will coordinate the communication between rank 1 and rank 6 to go through SR-IOV channel as shown in dashed line. Same thing happens on the Guest 2 side also, rank 4 will be scheduled to communicate with rank

6 using SR-IOV channel.

3.2.4 Optimizing Communication for IVShmem Channel

When IVShmem channel is selected by Communication Coordinator, the default environment setting, which is optimized for native environment, may not be able to bene-

ﬁt MPI communication by the greatest extent, therefore, we need to optimize the IVSh- mem channel further in order to achieve high performance message passing for intra- host inter-VM communication. There are four related parameters need to be optimized,

35 Guest 1 Guest 2 user space user space MPI Process rank 1 MPI Process rank 4 Communication Communication Coordinator Coordinator 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0

IVShmem SR-IOV IVShmem SR-IOV Channel Channel Channel Channel

kernel space kernel space PCI VF PCI VF Device Driver Device Driver

Hypervisor PF Driver

Virtual Virtual Physical IVShmem Function Function Function /dev/shm/ Inﬁniband Adapter

Figure 3.4: Communication Coordinator

10000 10000 8k 8k 9000 16k 9000 16k 32k 32k 8000 64k 8000 64k 7000 7000 6000 6000 5000 5000 4000 4000

Bandwidth (MB/s) 3000 Bandwidth (MB/s) 3000 2000 2000 1000 1000 0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size (bytes) Message Size (bytes)

(a) Impact of Eager Message Size (b) Impact of Send Buffer Size 10000 10000 128k 16 9000 256k 9000 32 512k 64 8000 8000 128 7000 7000 6000 6000 5000 5000 4000 4000

Bandwidth (MB/s) 3000 Bandwidth (MB/s) 3000 2000 2000 1000 1000 0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size (bytes) Message Size (bytes)

Figure 3.5: Communication Optimization for IVShmem Channel

36 which are SMP EAGER SIZE, SMP SEND BUF SIZE, SMPI LENGTH QUEUE, and

SMP NUM SEND BUFFER. SMP EAGER SIZE deﬁnes the switch point between Ea- ger protocol and Rendezvous protocol. SMPI LENGTH QUEUE is the size of the shared memory buffer which is used to store outstanding small and control messages. Mes- sages larger than SMP EAGER SIZE are packetized and sent out in a pipelined manner.

SMP SEND BUF SIZE is the packet size. SMP NUM SEND BUFFER is the number of send buffers. Figure 3.5 shows the optimization result. Here we only show bandwidth optimization result because there is no clear difference in terms of latency and buffer space (memory footprint) constraints. As we can see in Figure 3.5(a), the optimal bandwidth performance, which is more than 9.3Gb/s, is delivered when SMP EAGER SIZE is set to 32k. Even though 64k also delivers similar bandwidth performance, we still select 32k in order to reduce memory footprint. For large size message transfer, it can be observed in Figure 3.5(b) that based on optimized SMP EAGER SIZE value, bandwidth performance can achieve 9.6Gb/s when SMP SEND BUF SIZE is set to 16k. Similarly,

SMPI LENGTH QUEUE and SMP NUM SEND BUFFER are set to 128k and 16, respectively.

100 1e+07 13k 13k 15k 16k 17k 17k 19k 1e+06 20k 24 16 10 100000 8 Latency (us) 4 4K 8K 16K 32K 64K 10000 Message Rate (Messages/s)

1 1000 1 4 16 64 256 1K 4K 16K 64K 256K 1 4 16 64 256 1K 4K 16K 64K 256K Message Size (bytes) Message Size (bytes)

(a) Impact of Eager Threshold on Latency (b) Impact of Eager Threshold on Message Rate

Figure 3.6: Communication Optimization for SR-IOV Channel

37 3.2.5 Optimizing Communication for SR-IOV Channel

For the optimization of SR-IOV channel, we need to consider an important parameter

MV2 IBA EAGER THRESHOLD, which speciﬁes the switch point between eager and

rendevous protocol. If the threshold is too small, then it could incur additional overhead

of RTS/CTS exchange during rendezvous transfer between sender and receiver for many

message sizes. If it is too large, then it will require a larger amount of memory space

for the library. Therefore, we need to optimize this channel to have optimal threshold for

inter-host inter-VM communication. We measure the performance by setting the parameter

MV2 IBA EAGER THRESHOLD to different values from 13k to 20k. In Figure 3.6, just

some representative values are shown to make the contrast more clear. We can see in

Figure 3.6 that it delivers the optimal performance in terms of latency and message rate,

when MV2 IBA EAGER THRESHOLD is set to 17k.

3.3 Performance Evaluation

In this section, we describe our experimental testbed and discuss the evaluation results on different dimensions based on the optimization results mentioned in Section 3.2.4 and

Section 3.2.5. We evaluate the performance of our proposed design on SR-IOV enabled

InﬁniBand clusters from four dimensions, which are point-to-point communication, collective operations, different InﬁniBand transport protocols (RC and UD), and representative

HPC applications.

Our testbed is an InﬁniBand cluster consisting of four physical nodes, where each node has dual 8-core 2.6 GHz Intel Xeon E5-2670 (Sandy Bridge) processors with 20MB L3 shared cache, 32 GB main memory and equipped with Mellanox ConnectX-3 FDR (56

Gbps) HCAs with PCI Express Gen3 interfaces. We use RedHat Enterprise Linux Server

38 release 6.4 (Santiago) with kernel 2.6.32-279.19.1.el6.x86 64 as the host and VM OS. In addition, we use the Mellanox OpenFabrics Enterprise Distribution MLNX OFED LINUX-

2.1-1.0.0 to provide the InﬁniBand interface with SR-IOV support and use KVM as the

Virtual Machine Monitor (VMM). Each VM is pinned to a single core and has 1.5 GB main memory. All applications and libraries used in this study are compiled with gcc 4.4.6 compiler.

All experiments are conducted by comparing our proposed design with MVAPICH2-

2.0. We choose OSU Micro-Benchmarks (OMB) 4.3 to do the evaluations. Over all four physical nodes, we allocate 8 VMs per node to conduct experiments for collectives, different transport protocols and applications. For point to point experiments, we select two

VMs from them.

3.3.1 Point-to-Point Communication Performance

In this section, we evaluate MPI point-to-point communication performance for inter-

VM in terms of latency and bandwidth.

Figure 3.7(a) and Figure 3.7(b) show the point-to-point performance for intra-host inter-

VM communication. From these two ﬁgures, we can observe that compared to SR-IOV, our proposed design can signiﬁcantly improve the point-to-point performance by up to 84% and 158% for latency and bandwidth, respectively. If we compare the performance of our design with that of native MPI, we can see that our design only has 3%-8% overheads, which are much smaller than the overheads of SR-IOV. For example, at 1KB message size, MPI point-to-point latency of SR-IOV is around 2.36µs, while the latencies of our design and native mode are 0.52µs and 0.5µs, respectively. In this case, our design just shows about 4% overhead. Through this comparison, we can clearly see the performance

39 beneﬁts by incorporating locality-aware communication into MPI library over virtualized environments.

For inter-host inter-VM point-to-point communication, as shown in Figure 3.7(c) and

Figure 3.7(d), our proposed design has similar performance with SR-IOV in terms of latency and bandwidth. This is because the communication coordinator in the proposed design will select the SR-IOV channel for inter-host inter-VM data movement. These results also show that the newly introduced components in our proposed design do not cause extra overhead. If we compare the performance with native MPI, we can see that the overheads of both our proposed design and SR-IOV are very small. For example, the bandwidth of native MPI is about 6.3Gb/s at the message size of 256KB, while both SR-IOV and our design can achieve 6.2 GB/s bandwidth.

From the above discussion, we can see that our proposed design can achieve near-native performance for both intra-host inter-VM and inter-host inter-VM communications. This is because our design can fully exploit the beneﬁts of locality-aware communication for intra-host inter-VM data movement, while maintaining similar performance behavior as

SR-IOV channel for inter-host inter-VM communication.

3.3.2 Collective Communication Performance

We select four widely used collective communication operations in our evaluations:

Broadcast, Allgather, Allreduce and Alltoall. As shown in Figures 3.8(a)-Figure 3.8(d), we can clearly observe that, compared with SR-IOV, the proposed design effectively cuts down the latency for each collective operation across 32 VMs.

40 200 12000 Native Native SR-IOV SR-IOV Proposed 10000 Proposed 150 8000

100 6000

Latency (us) 4000 50 Bandwidth (MB/s) 2000

0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size (bytes) Message Size (bytes)

(a) Intra-host Inter-VM Latency (b) Intra-host Inter-VM Bandwidth 200 8000 Native Native SR-IOV 7000 SR-IOV Proposed Proposed 150 6000

5000

100 4000

Latency (us) 3000

50 Bandwidth (MB/s) 2000

1000

0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size (bytes) Message Size (bytes)

Figure 3.7: Point-to-point Performance

41 We show the Broadcast performance in Figure 3.8(a). The latency of SR-IOV scheme is

6.73 µs at 4 bytes message size, while it is 4.44 µs for proposed design, with 34% improvement. The performance beneﬁt comes from locality-aware based communication in our proposed design instead of IB loopback in SR-IOV. The Allgather performance is shown in Figure 3.8(b), the latencies of SR-IOV scheme and proposed design at 4 bytes message size are 15.77 µs and 11.2 µs, respectively. The proposed design improves the performance by 29%. Figure 3.8(c) shows us the latency of Allreduce operation. We can see that at 4 bytes message size the latency values are 17.29 µs and 6.97 µs for SR-IOV scheme and proposed design, respectively. The performance improvement at 4 bytes message size achieves

60%. With respect to Alltoall operation, as shown in Figure 3.8(d), SR-IOV and proposed design deliver 32.38 µs and 27.20 µs latencies, respectively. The proposed design helps reduce the latency of alltoall at 4 bytes message size by 16%. For different message sizes, the proposed design can improve latency of the above four collective operations (Brocast,

Allgather, Allreduce, Alltoall) by up to 68%, 76%, 61%, 29%, respectively.

Based on the experimental evaluation results, our proposed design can gain remarkable improvement for MPI collective operations compared to SR-IOV.

3.3.3 Different InﬁniBand Transport Protocol (RC & UD)

In Section 2.3.1, we give the introduction of different InﬁniBand transport protocols.

In this section, we evaluate the performance of SR-IOV and proposed design on different

InﬁniBand transport protocols. The point-to-point and collective results are shown in Fig- ure 3.9 and Figure 3.10, respectively. Figure 3.9 shows that the RC protocol performs better than UD for SR-IOV scheme in terms of latency for intra-host inter-VM point-to-point communication. The latency difference can be up to 60%. This is because MVAPICH2

42 2000 60000 SR-IOV SR-IOV 1800 Proposed Proposed 50000 1600 1400 16 40000 240 200 1200 12 160 1000 30000 8 120 800 80 Latency (us) 4 Latency (us) 20000 600 40 0 0 400 1 4 16 64 256 1K 4k 10000 1 4 16 64 256 1K 4k 200 0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size (bytes) Message Size (bytes)

(a) MPI Bcast (b) MPI Allgather 80000 5000 SR-IOV SR-IOV Proposed 4500 Proposed 70000 4000 60000 3500 600 240 50000 3000 180 400 2500 40000 120 200

2000 Latency (us) 30000 Latency (us) 60 1500 0 20000 0 1000 1 4 16 64 256 1K 4k 1 4 16 64 256 1K 4k 10000 500 0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size (bytes) Message Size (bytes)

Figure 3.8: Collective Communication Performance on 32 VMs (8 VMs per node)

43 library enables Fast Path feature by default for RC protocol, which supports RDMA based communication, while the UD scheme does not support this feature. Moreover, the reliability support for UD in MPI library costs some additional overhead. Therefore, we see performance differences between RC and UD protocols on SR-IOV scheme. Since our proposed design uses IVShmem based communication instead of SR-IOV within the same node, it does not have this kind of performance differences as SR-IOV scheme.

Based on the above discussion of point-to-point communication, we can reasonably explain the performance behavior of following collective operations. For Broadcast operation in Figure 3.10(a), the proposed design outperforms SR-IOV scheme for both RC and

UD protocols. In addition, there exists clear performance difference between RC and UD protocols for SR-IOV scheme. Compared with RC, UD protocol increases the broadcast latency by up to 206%, whereas the proposed design delivers similar performance for these two transport protocols. This is also because the intra-host inter-VM communication goes through IVShmem channel instead of SR-IOV channel in the proposed design and it does not get much affected by different transport protocols. However, as the proportion of inter- host inter-VM communication to total amount of communication increases, the inﬂuence of different transport protocols will be more obvious accordingly. Similar to Broadcast, other three collective operations, Alltoall, Allgather and Allreduce for SR-IOV scheme on the UD protocol, incur up to 173%, 76%, 118% latency increasing, respectively, compared with the RC protocol. On the contrast, our proposed design still delivers close performance when switching from RC to UD protocol.

44 1000 SR-IOV (RC) SR-IOV (UD) Proposed 100

10 Latency (us) 1

0.1 1 4 16 64 256 1K 4K 16K64K256K1M Message Size (bytes)

Figure 3.9: Intra-host Inter-VM Point-to-Point Performance on RC and UD Protocols

10000 100000 SR-IOV (RC) SR-IOV (RC) SR-IOV (UD) SR-IOV (UD) Proposed (RC) Proposed (RC) 10000 1000 Proposed (UD) Proposed (UD)

1000 100 100 Latency (us) Latency (us)

10 10

1 1 1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size (bytes) Message Size (bytes)

(a) Broadcast (b) Allgather

10000 100000 SR-IOV (RC) SR-IOV (RC) SR-IOV (UD) SR-IOV (UD) Proposed (RC) Proposed (RC) Proposed (UD) 1000 Proposed (UD) 10000

100 1000 Latency (us) Latency (us)

10 100

1 10 1 4 16 64 256 1K 4K 16K 64K 256K 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size (bytes) Message Size (bytes)

Figure 3.10: 32 VMs (8 VMs per node) Collective Performance on RC and UD Protocols

45 3.3.4 Application Performance

In this section, we evaluate our proposed design and SR-IOV scheme with two end

applications: P3DFFT and NAS Parallel Benchmarks (NPB) .

Figure 8.11 shows the performance comparison of SR-IOV scheme and proposed de-

sign on Class B NAS benchmarks on 32 VMs across 4 nodes. Figure 3.11(a) shows that

our proposed design improves the performance for NAS by up to 43% over SR-IOV based

scheme. For IS benchmark, SR-IOV scheme needs 2.84s, whereas our proposed design

takes only 1.61s. In Figure 3.11(b), we show the performance of SR-IOV scheme and our

proposed design with P3DFFT. We run all 5 tests with the same input size 512×512×512.

From the results, we can see that our proposed scheme outperforms the SR-IOV scheme in all the cases. The improvements for INVERSE, RAND, SINE and SPEC are 29%, 33%,

29% and 20%, respectively. The performance beneﬁts for P3DFFT comes from shared memory collective operations that are not available in SR-IOV scheme.

16 5 SR-IOV SR-IOV Proposed 4.5 Proposed 14 4 12 3.5 10 3

8 2.5

Time (s) Time (s) Time 2 6 1.5 4 1 2 0.5

0 0 FT LU CG MG IS SPEC INVERSE RAND SINE (a) NAS Performance (b) P3DFFT Performance

Figure 3.11: Application Performance

3.4 Related Work

In general, I/O virtualization schemes can be classiﬁed into software based and hardware based schemes. Earlier studies such as [10, 67] have shown network performance

46 evaluations of software-based approaches in Xen. Studies [17, 37, 62] have proved that

SR-IOV demonstrates signiﬁcantly better performance compared to that of software-based solutions for 10GigE networks. Liu et. al [62] provide a detailed performance evaluation on the environment of SR-IOV capable 10GigE Ethernet in KVM. Also, they study several important factors that affect network performance in both virtualized and native environments. Studies [33, 35, 36, 63] with Xen demonstrate the ability to achieve near-native performance in VM-based environment for HPC applications. In addition, the work [64] ﬁrst presents the framework of Nahanni and gives its introduction in details. Based on it, MPI-

Nahanni user-level library is developed, which ports MPICH2 library from Nemesis channel that used memory-mapped shared memory to Nahanni in order to accelerate inter-VM communication on the same host. Simon et. al [82, 83] discuss how dynamically changing topologies and locality-awareness affect point-to-point and collective communication during runtime in the virtualized environments where the MPI processes are encapsulated within virtual machines. Moreover, Yang et. al [104, 109, 112, 113, 122, 123] propose multiple designs for topology-aware and power-aware scheduling on HPC systems, which reveal potential applicable value in the complex cloud environment.

In our earlier studies, we propose designs to improve intra-node point-to-point communication operations using an Inter-VM Communication Library (IVC) and re-design the

MVAPICH2 library to leverage the features offered by the IVC [33]. However, this solution is based on the Xen platform and does not show the studies with SR-IOV enabled

InﬁniBand clusters. Our early evaluation of using SR-IOV with InﬁniBand [49] shows that while SR-IOV enables low-latency communication, MPI libraries need to be redesigned carefully in order to provide advanced features to improve intra-node inter-VM communication. Within a single node, our recent evaluation [45] reveals the fact that the performance

47 of intra-node inter-VM communications can be dramatically improved through IVShmem, compared to SR-IOV scheme on virtualized InﬁniBand clusters. This exhibits signiﬁcant performance potential to optimize MPI communication across nodes further.

Based on our previous evaluation, we propose a new design using high performance

MPI library with the support of KVM and IVShmem [64], which dynamically detects the

VM locality information and coordinates communications between IVShmem and SR-IOV channels to offer effective locality-aware communication on SR-IOV enabled InﬁniBand clusters. The evaluation results show promising results of our proposed design with regard to point-to-point, collective benchmarks and end-applications.

3.5 Summary

In this chapter, we analyze multiple VM locality detection approaches and propose a high performance design of MPI library over SR-IOV enabled InﬁniBand clusters, which can dynamically detect co-located VMs and coordinate communications between SR-IOV and IVShmem channels. The proposed design efﬁciently supports locality-aware communication across VMs. We further analyze and optimize MPI library level core mechanisms and design parameters in both SR-IOV and IVShmem channels for virtualized environments. Based on our new design, we conduct comprehensive performance evaluations by using point-to-point, and collective benchmarks and representative HPC applications.

Our performance evaluations show that, compared to SR-IOV, our proposed design can signiﬁcantly improve the performance of intra-host inter-VM communication by up to

84% and 158% for latency and bandwidth, respectively, while the proposed design only introduces 3%-8% overhead compared with native mode. The evaluation also shows that

48 our proposed design effectively integrates SR-IOV channel for inter-host inter-VM communication. On the aspect of collective operations, the proposed design can achieve up to 68%, 76%, 61%, and 29% performance improvements for Broadcast, Allgather, Allre- duce, and Alltoall, respectively, compared to SR-IOV. In addition, the evaluations for different InﬁniBand transport protocols (RC and UD) indicate that SR-IOV incurs performance degradation when switching the protocol from RC to UD in MPI library. Whereas based on locality-aware communication, our proposed design delivers similar performance between these two protocols. Finally, compared to SR-IOV, our design outperforms NAS and

P3DFFT by up to 43% and 33%, respectively.

49 Chapter 4: Designing SR-IOV Enabled VM Migration Framework

As discussed in Section 1, although several initial prototypes have been proposed to support VM migration with SR-IOV devices, they do need to modify the hypervisor and/or the network adapter driver. These will incur the security concerns and bind the HPC systems to a particular version of hypervisor/adapter driver and device vendor. Consequently, they will be very hard to be adopted in the HPC environments. On the other hand, MPI has become the de-facto programming model for HPC applications. Such constraints motivate us to design a hypervisor independent and InﬁniBand adapter driver independent approach for VM migration over SR-IOV enabled InﬁniBand clusters for MPI applications.

4.1 Hypervisor and InﬁniBand Adapter Driver Independent SR-IOV Enabled VM Migration Framework

4.1.1 Design Overview

We propose a framework to support VM migration for MPI applications on SR-IOV enabled IB clusters. Figure 4.1 consists of two major parts, the SR-IOV enabled IB cluster and the external migration controller, respectively. Each virtual IB device (VF) is directly assigned to a VM in passthrough mode so that HPC jobs such as MPI applications can be executed across these VMs with high-speed interconnects support. In addition, an IVSh- mem region is attached to each VM by exposing itself as a PCI device, as shown in the

50 Host Host Guest VM1 Guest VM2 Guest VM1 Guest VM2 MPI MPI MPI MPI VF / VF / VF / VF / SR-IOV SR-IOV SR-IOV SR-IOV Hypervisor ŏ Hypervisor IB Ethernet Ethernet IB IVShmem IVShmem Adapter Adapter Adapter Adapter

Network Read-to- Network Migration Migration Suspend Migrate Reactive Done Trigger Trigger Detector Notiﬁer Detector

Controller

Figure 4.1: An Overview of the Proposed Migration Framework

box next to the VF/SR-IOV device. The other major part is an external controller, which is responsible for coordinating VM status and executing VM migration related operations.

With the help of IVShmem, the external controller is able to communicate with the MPI processes running inside the VMs, and implement all the necessary interactions in order to appropriately migrate the VMs in the middle of the running of MPI applications with any hypervisor built-in live migration mechanism. The Controller contains the following ﬁve components, which are Migration Trigger, Network Suspend Trigger, Ready-to-Migrate

(RTM) Detector, Network Reactivate Notiﬁer, and Migration Done Detector.

In the rest of this section, we ﬁrst describe the execution procedure of our proposed VM migration framework for MPI applications. Then we describe the main functionalities of the ﬁve components in the controller. To successfully implement the VM migration for MPI applications, we discuss the critical issues on handling the on-going MPI communication, and propose two alternative designs to respond to the requests from the external controller appropriately.

51 Computer Node Migration Target Migration Source Login Node

VM MPI VM MPI VF VF Controller IVSHMEM IVSHMEM Pre-migration Suspend Phase 1 Trigger Suspend Channel Ready-to-migrate RTM Phase 2 Detector Start Phase 3.a Detach Migration Detach VF IVSHMEM Phase 3.b VM Migration VM MPI Migration

Phase 3.c Attach Attach VF IVSHMEM Post-migration Phase 4 Reactivate Phase 5 Reactivate Channel Migration Done

Figure 4.2: Sequence Diagram of Process Migration

4.1.2 VM Migration Procedure

Figure 4.2 depicts the interactions between MPI processes and controller in the different phases during a VM migration.

Phase 1: Once receiving the VM migration request, the external controller utilizes the Sus-

pend Trigger component to set state ‘Pre-Migration’ on IVShmem region on each node

which the MPI job involves. Upon the MPI process discovering this state change on the

PCI device that the IVShmem region maps to, it will suspend the current InﬁniBand com-

munication channel to drain all the in-transit messages; otherwise, these messages will be

lost when releasing the passthroughed VF. Our design guarantees that there is no message

loss through channel suspension and message draining. Details about discovering the state

change will be discussed in section 4.1.3. At the end of Phase 1, all the MPI processes

have successfully suspended their own communication channels, and changed the state to

‘Ready-to-Migrate’.

52 Phase 2: The controller uses Ready-to-Migrate Detector component to periodically col-

lect the state change. Once discovering the state ‘Ready-to-Migrate’ on all the nodes, the

controller enters the next phase to start migration procedure.

Phase 3(a): Since the VF is assigned to VM in passthrough mode, it can not be migrated to

target node. The controller will detach the VF from the VM. Similarly, it will also detach

the IVShmem device exposed inside the VM.

Phase 3(b): Now the VM has detached all the devices which prevent it from migration,

the controller invokes the hypervisor to execute VM live migration. The VM is migrated to

target node at the end of this phase.

Phase 3(c): In order to resume the MPI communication which is suspended earlier, the

controller re-attaches the VF and IVShmem device to the VM.

Phase 4: Since the controller has completed the VM migration and VF/IVShmem reat- tachment at the end of Phase 3(c), it will utilize the ‘Network Reactivate Notifier’ to set the state on all the IVShmem regions to ‘Post-Migration’, which notifies all the MPI processes to resume their InfiniBand communication from the earlier suspend state.

Phase 5: Each MPI process executes the channel reactivation operation in this phase and set the state to ‘Migration-Done’. After this operation, all the MPI processes will be able to resume the communication. And the MPI applications can proceed further on the target node. Details about suspending and reactivating communication channel will be discussed in Section 7.2.3.

4.1.3 Design of VM Migration Controller

In the description of the VM migration procedure in Section 4.1.2, the controller needs to collect and issue the state change from/to the MPI processes and take different actions in

53 the different phases. Therefore, it is fairly important to select a fast-responsive and scalable

state exchange channel. In this way, both controller and MPI processes can quickly update

the current state and response the state change accordingly, which will speed-up the proce-

dure of VM migration for MPI applications on HPC clouds. We propose to use IVShmem

as a notiﬁcation channel between the controller running in host and MPI processes running

inside the VMs since IVShmem can provide the efﬁcient memory-based access speed be-

tween hosts to VMs. And it only needs one hop to notify MPI processes about migration

progress, which is faster than some other solutions [78] such as FTB and PMI networks

that adopt multi-hop based notiﬁcation. Please note that IVShmem is merely one choice to

share information between VM and host. Other choices are also applicable.

Based on the efﬁcient state exchange channel that IVShmem provides, the controller

utilizes the following ﬁve components to issue and collect the state of VM migration.

Network Suspend Trigger is a key component for scalable migration framework. In or-

der to build a high performance and scalable Network Suspend Trigger, this module is

implemented on top MPI startup channel which has been optimized at scale [13].

Both Ready-to-Migrate Detector and Migration-Done Detector needs to collect information from all processes. A naive design is that each process sends a message to the controller which keeps track of the progress. However, this solution is not scalable. Our module is built on top of MPI reduce routine which typically has a time complexity of

O(log(p)), where p denotes the number of MPI processes [101]. This module will periodically collect the states on the involved IVShmem regions. It maintains a counter to keep track of all the states. Once the counter achieves the number of MPI processes, it means that the InﬁniBand communication channels have been suspended by all the MPI processes, and no message is on the ﬂy. The controller can take over the supervision of

54 VMs. For Migration-Done Detector, its counter indicates whether all the MPI processes

have resumed the communication.

Migration Trigger is responsible for carrying out VM live migration operation and detaching/re-

attaching SR-IOV VF and IVShmem devices.

Network Reactivate Notiﬁer will notify the MPI processes to reactivate their communication channels upon the VM and its VF/IVShmem devices have been migrated to the targeted host successfully.

All these components use MPI-based scheme. For instance, the Ready-to-Migrate De- tector issues an MPI Reduce routine with a “sum” operator to collect the states.

4.1.4 Design of MPI Runtime

MPI MPI Call Call No-migration P 0 Comp Time

MPI MPI Call Call Time Comp S R Progress Engine Based P 0 Pre-migration Ready-to-Migrate Post-Migration Migration Controller

MPI MPI Call Call Time P 0 Comp Migration-thread based Typical Scenario R Thread S Pre- Ready-to- Post- migration migrate Controller Migration migration

MPI MPI Call Call Time Migration-thread based P 0 Comp Worst Scenario S R Thread Pre- Ready-to- Post- migration Migration Controller migrate migration

MPI Call S Suspend Channel R Reactivate Channel Lock/Unlock Communication Control Msg

Comp Migration Migration Computation Migration Signal Detection Down-Time for VM Live Migration

Figure 4.3: The Proposed Progress Engine based Design and Migration-thread based De- sign

As we described in Section 4.1.3, the MPI process needs to take different actions ac-

cording to the states on IVShmem region. We design three associated components in the

MPI runtime. State Control Manager is responsible for updating and responding the state

55 on the PCI device. The received state will be sent to Channel Manager component, which invokes either Suspend Channel operation or Reactivate Channel operation, according to the state. Similarly, the operation results will be sent back to State Control Manager. The

Suspend Channel operation or Reactivate Channel operation can guarantee the global con- sistency of the MPI program and transparency for MPI applications. To achieve the communication channels suspension, all the in-transit messages need to be drained, so that no message will be lost when releasing the network. The protocol needs to guarantee that to a certain point, all the messages before this point must have been delivered and all the messages after this point must not have been posted to the network. The channel reactivation is achieved by rebuilding underlying network connections, updating the local communication channel and then sending control messages to update the other side of the channel [24].

Another critical issue is how to design our MPI runtime to apply the suspend/reactivate communication channel operations and state update. We propose two alternative designs, which are Progress Engine based design (PE) and Migration-thread based design (MT).

In Progress Engine based design, each MPI call will ﬁrst detect the migration signal on IVShmem device. If state ‘Pre-Migration’ or ‘Post-Migration’ is detected, the progress engine will delay the current MPI call and invokes channel suspension or reactivation operations ﬁrst. As we can see in the Figure 4.3, The time of the second MPI call will be increased by the time of the channel suspension, channel reactivation, and VM migration.

We also notice that the channel suspension operation happens during the second MPI call, even though the ‘Pre-Migration’ state is issued by the controller in the early computation stage. It indicates that the Progress Engine based design does not allow the overlap between

VM migration and computation in the MPI applications.

56 After detaching the SR-IOV VF and IVShmem devices, the VM can be live migrated to the target host, which provides us the opportunity to achieve the desired overlapping. We then propose our Migration-thread based design. As shown in Figure 4.3, an extra thread is created during the MPI initialization. This thread is responsible for executing all VM migration related operations, which liberates the main MPI thread P0. If the controller issues the ‘Pre-Migration’ signal in the computation stage, then the migration thread can respond instantaneously. It will ﬁrst lock the communication channel, suspend the channel and drain all the messages. Then the subsequent VM live migration operations are executed.

As we observed in the typical case (Typical Scenario), if the total VM migration time is less than the computation time, then it can be completely overlapped. In the worst case

(Worst Scenario), the Migration-thread based design will fall back to the case of Progress

Engine based design. It might have more overhead, compared to the Progress Engine based design, as the lock/unlock operations we introduced. Our runtime designs are working on the underlying device (IB) channel inside MPI runtime, so both our approaches obey the

MPI standard and work with the MPI THREAD MULTIPLE mode.

4.2 Performance Evaluation

We conduct our experiments on two testbeds. The ﬁrst testbed is Chameleon Cloud, where each node has 24-core 2.3 GHz Intel Xeon E5-2670v3 (Haswell) processors with 128

GB main memory and equipped with Mellanox ConnectX-3 FDR (56 Gbps) HCAs with

PCI Express Gen3 interfaces. We use CentOS Linux release 7.1.1503 (Core) with kernel

3.10.0-229.el7.x86 64 as the host OS and MLNX OFED LINUX-3.0-1.0.1 as the HCA driver. The second testbed is Nowlab InﬁniBand cluster, where each node has dual 8-core

2.6 GHz Intel Xeon E5-2670 (Sandy Bridge) processors with 20MB L3 shared cache, 32

57 3.5 3 40 TCP IPoIB RDMA 3 2.5 Set POST_MIGRATION Sequential Migration Add IVSHMEM Framework 2.5 2 30 Attach VF Proposed Migration 2 Framework 1.5 Migration 20 1.5

Remove IVSHMEM Time(s)

Time (s) 1 1 Detach VF 10

Migration Time (s) MigrationTime 0.5 0.5 Set PRE_MIGRATION 0 0 0 512M 1G 2G TCP IPoIB RDMA 2 VM 4 VM 8 VM 16 VM (a) Single VM Migration (b) Proﬁling Results of Single VM Migration (c) Multiple VM Migration Time Time

Figure 4.4: VM Migration Time and Proﬁling Results

GB main memory and equipped with Mellanox ConnectX-3 FDR (56 Gbps) HCAs. The

MLNX OFED LINUX-3.2-2.0.0 is used as the HCA driver.

All applications and libraries used in this study are compiled with GCC 4.8.3 compiler. The MPI communication performance experiments use MVAPICH2-v2.2 and OSU micro-benchmarks v5.3. Experimental results are averaged across ﬁve runs to ensure a fair comparison.

4.2.1 VM Migration Performance

Our first experiment aims to evaluate the total time to migrate VMs configured with various memory sizes from 512MB to 2GB. Figure 4.4(a) shows the time to migrate VMs over different networks. We can see that RDMA-based IB network always performs better than other networks for all memory sizes. When VM memory size is 512MB, the total migration times with TCP, IPoIB, and IB are 2.411s, 2.384s, and 1.957s, respectively. Compared with the Ethernet and IPoIB, the RDMA-based IB network could reduce the migration time by 20%. In order to see where the performance differences come from, we profiled the total time of migrating a VM configured with 512MB memory. The communication breakdown of SR-IOV device for MPI applications inside VM is shown in Figure 4.4(b).

58 Table 4.1: Total Migration Time Breakdown # Chameleon # Nowlab Steps TCP IPoIB RDMA TCP IPoIB RDMA Total Migration Time 2.41 2.38 1.95 2.48 2.53 2.15 Set PRE MIGRATION 0.13 0.13 0.13 0.14 0.14 0.14 Detach VF 0.18 0.18 0.17 0.18 0.20 0.19 Remove IVSHMEM 0.01 0.01 0.01 0.01 0.01 0.01 Migration 1.26 1.23 0.80 1.30 1.28 0.93 Attach VF 0.39 0.39 0.35 0.38 0.40 0.36 Add IVSHMEM 0.10 0.1 0.09 0.12 0.13 0.12 Set POST MIGRATION 0.32 0.32 0.37 0.33 0.35 0.38

From the time breakdown, we could see that most of the time is spent in transferring VM

(Migration) from one machine to another. Since the RDMA-based IB network has higher bandwidth and low latency compared with other networks, the performance difference in

VM transfer phase is expected. All other steps take almost the same amount of time across these networks. Table 4.1 shows the time spent in each phase of the VM migration. On the right side of the table, we also conduct the same experiment on Nowlab cluster which uses different versions of KVM, host and device driver from the Chameleon cluster. From the results, we observe the same performance trend on Nowlab cluster. These numbers show that our proposed migration framework could work efﬁciently and independently with different versions of KVM, host and device drivers. Figure 4.4(c) shows the time of migrating the varied number of VMs from one machine to another simultaneously. Each

VM is conﬁgured with 512MB memory. From the result, we can see that compared with sequential migration framework, our proposed migration framework could reduce up to

51% migration time. The benefits come from the MPI-based scheme in all the components of proposed VM Migration Controller and efficient state exchange channel, by which all the trigger, detection, notification operations can be executed in parallel and efficiently.

59 5 2000 12 10 NM-PE NM-PE NM-PE NM-PE 4 NM-MT NM-MT 10 1500 NM-MT 8 NM-MT NM-Default NM-Default 8 NM-Default NM-Default 3 6 1000 6 2 4 4 500 1 2

Latency (us) Latency 2 Latency (us) Latency Latency (us) Latency

0 (MB/S) Bandwidth 0 0 0 1 2 4 8 1 2 4 8 1 2 4 8 4 8 16 32 64 16 32 64 16 32 64 16 32 64 1K 2K 4K 1K 2K 4K 1K 2K 4K 1K 2K 4K 128 256 512 128 256 512 128 256 512 128 256 512 Mesage Size ( bytes) Mesage Size ( bytes) Mesage Size ( bytes) Mesage Size ( bytes)

(a) MPI Pt-to-Pt Latency (b) MPI Pt-to-Pt Bandwidth (c) MPI Broadcast (d) MPI Reduce

Figure 4.5: Overhead Evaluation of Different Designs

4.2.2 Overhead Evaluation of Different Schemes

In this section, we conduct micro-benchmark level experiments to evaluate the impact of our proposed migration framework on MPI runtime. We evaluate three schemes on no migration case (NM). The first scheme is NM-PE which is our proposed progress-engine based design. The second scheme is NM-MT which is our proposed migration-thread based design. The third scheme is NM-Default which is from MVAPICH2-v2.2 without any modifications. The experimental results are shown in Figure 4.5. The point-to-point latency with 4 bytes message size for NE-PE, NM-MT, and NM-Default is 1.82us, 1.87us, and 1.8us, respectively. The added overhead from our proposed scheme is negligible. The bandwidth, broadcast, and reduce results share the same behavior. These results show that our proposed framework does not affect the micro-benchmark performance even if we modified the most critical path of MPI runtime, such as the progress engine.

4.2.3 Point-to-Point Performance

In this section, we evaluate the communication performance of point-to-point operations shown in Figures 4.6(a)-4.6(b). In these experiments, we migrate a VM from one

60 200 7000 700 700 PE-IPoIB PE-IPoIB PE-IPoIB PE-IPoIB PE-RDMA 6000 PE-RDMA 600 PE-RDMA 600 PE-RDMA 150 MT-IPoIB 5000 MT-IPoIB 500 MT-IPoIB 500 MT-IPoIB MT-RDMA 4000 MT-RDMA 400 MT-RDMA 400 MT-RDMA 100 3000 300 300 50 2000 200 200 1000 100 100 Latency (us) Latency Latency (us) Latency Latency (us) Latency 0 0 0 0 Bandwidth (MB/s) Bandwidth 4 1 4 1 4 1 4 16 64 16 64 16 64 1K 4K 16 64 1K 4K 1K 4K 1M 1M 4M 1M 1K 4K 256 256 256 1M 16K 64K 256 16K 64K 16K 64K 16K 64K 256K 256K 256K 256K Mesage Size ( bytes) Mesage Size ( bytes) Mesage Size ( bytes) Mesage Size ( bytes)

(a) MPI Pt-to-Pt Latency (b) MPI Pt-to-Pt Bandwidth (c) Broadcast (d) Reduce

Figure 4.6: MPI Communication Performance with VM Migration of Different Designs

machine to another machine while a benchmark is running inside. Since all these micro- benchmarks are running in thousands of iterations, the migration overhead is also amortized over these iterations. PE-IPoIB uses our proposed progress engine based design and IPoIB network for VM migration. PE-RDMA represents our proposed progress engine based design but with RDMA-IB network for VM migration. MT-IPoIB uses our proposed migration-thread based design and IPoIB network. MT-RDMA uses our proposed migration-thread based design and RDMA-IB network. Figure 4.6(a) shows the point-to-point latency results with the different designs. We can see that the VM migration happens at 4KB message size. The latencies for PE-IPoIB, PE-RDMA, MT-IPoIB, and MT-RDMA are 160.22us, 141.44us, 161.21us, and 141.84us, respectively. We could see that our proposed MT-based designs perform worse than the PE-based designs. The performance difference is because of locking/unlocking the communication channel in the

MT-based design to make sure that the communication channel is drained. Since there is no computation involved in these benchmarks, no overlapping between the computation and

61 migration could be achieved, so our MT-based design could not bring performance bene-

ﬁts here. Figure 4.6(b) shows the point-to-point bandwidth results which have the same behavior as the point-to-point latency results.

4.2.4 Collective Performance

In this section, we evaluate the overhead of VM migration on collective operations. We take these numbers with four VMs, each running two MPI processes. The experimental results are shown in Figure 4.6(c) and Figure 4.6(d). For broadcast benchmark, the latencies during migration with the PE-IPoIB, PE-RDMA, MT-IPoIB, and MT-RDMA schemes are

476.25us, 349.07us, 479.76us, 351.88us, respectively. The MT-RDMA scheme performs worse than the PE-RDMA because of the locking/unlocking communication channel overhead and context switch overhead between the migration thread and main thread. The same trend could also be seen in the reduce results shown in Figure 4.6(d).

4.2.5 Overlapping Evaluation

In this section, we aim to evaluate overlapping between computation and migration.

We use the benchmark with the different ratios of communication and computation. In this benchmark, we ﬁx the communication time and increase the computation time. The experimental result, which is the average execution time among ten times of VM migrations is shown in Figure 4.7(a). MT-typical stands for the typical scenario where computation and migration could be overlapped. MT-worst stands for the worst-case scenario where no computation and migration overlapping could be achieved. When the computation proportion comes to 10%, the total execution times (computation+communication+migration) for the PE scenario, MT-worst scenario, MT-typical scenario and NM are 18.5s, 18.6s, 17s, and 15.6s, respectively. There are three interesting things in these numbers. The ﬁrst thing

62 is that the MT-worst scenario performs worse than the PE scheme. The reason is that the

MT-worst scenario has the additional overhead of context switching and locking/unlocking. The second thing is that MT-typical scenario performs better than the PE scheme. The performance beneﬁts come from computation and migration overlapping. The third thing is that the MT-typical scenario still performs worse than the NM scheme, that is mainly because only partial of the migration time could be overlapped with the computation. As the computation time goes up, we could see that the MT-typical scenario performs closely to the NM scenario. When the computation percentage achieves 60%, the total times for the PE, MT-worst, MT-typical and NM are 37.9s, 38.2s, 35.26s, and 35.01s, respectively.

The MT-typical scenario almost performs the same as the NM scenario. Some minor overheads are introduced by locking/unlocking operations and VM down-time for live migration. These numbers show that our MT-based design is able to overlap computation and migration to hide the migration overhead. These results also indicate that the applications with more computation are easier to overlap computation and migration. The users should choose PE or MT based on the ratios of computation and communication in their applications. Figure 4.7(b) further presents the total execution time of each iteration on 20% computation case, as shown in Figure 4.7(a). VM migrations are executed on the 29th and the 45th iterations, respectively. We can observe from these two iterations that the total execution time of MT-typical varies. It is 19.35s on the 29th iteration, while it is 18.22s on the 45th one. The performance of MT-typical depends on the time that the migration signal is triggered. The earlier the migration signal is triggered in the computation phase, the more overlapping can be achieved.

63 21.4 PE 50 20.9 PE MT-worst 20.4 MT-worst MT-typical MT-typical 40 NM 19.9 NM 19.4 30 18.9 Time(s) Time(s) 20 18.4 17.9 10 17.4 0 10 20 30 40 50 60 70 0 10 20 30 40 50 Computation ( percentage) Iteration (a) Overlapping Benchmark Results (b) Total Execution Time in Each Iteration on 20% Computation Case

Figure 4.7: Benchmark to Evaluate Computation and Migration Overlapping

4.2.6 Application Performance

In this section, we evaluate the performance of our proposed design with two end applications: Graph 500 and NAS Parallel Benchmarks (NPB). In NAS, we evaluate using the Class ‘B’ testset in VM conﬁgured with 1G memory and Class ‘C’ testset in VM con-

ﬁgured with 2G memory, separately. We launch eight VMs in total and one of them carries out the migration during the benchmark running. We run LU, FT, EP, IS, and MG in the experiments. For each benchmark, we report the PE, MT-worst, MT-typical, and NM results. Figure 4.8(a) shows the NPB results with Class ‘B’. From the results, we could see that the typical number of MT-based design achieves the similar performance as the NM case on LU, FT, and EP, while the worst case of MT-based design and PE design incurs some overhead compared with the NM case. For the MT-typical case, the migration time has been overlapped with the computation. For LU results, the total times for the PE, MT- worst, MT-typical and NM are 24.84s, 24.91s, 22.96s, and 22.65s, respectively. Compared with the MT-worst and PE results, the MT-typical could reduce the total execution time by

64 around 10% by completely overlapping VM migration with application computation. For the MG and IS results in both Figure 4.8(a) and Figure 4.8(b), the MT-typical scenario delivers the similar performance as the MT-worst scenario. The reason is that the computation is less in MG and IS, which results in less overlap with the migration phase.

Figure 4.8(c) shows the Graph500 BFS kernel median time across multiple iterations with a various number of edges and edge factors. We could see that all our proposed designs deliver the similar performance as the NM scenario. The reason is that even though the VM migration delays the BFS process, it affects only one iteration, and the median time across multiple iterations will still be the same. We further show the BFS kernel max time in

Figure 4.8(d). We could see that our proposed schemes have some overhead compared to the NM results. The reason is that Graph500 benchmark is dominated by communication, so there is little overlapping between communication and computation.

120 0.4 3.5 30 PE PE MT-worst PE PE 3.0 25 100 MT-worst MT-typical NM MT-worst MT-worst 0.3 2.5 20 80 MT-typical MT-typical MT-typical 2.0 15 60 0.2 NM NM NM 1.5 10 40 0.1 1.0 Execution Time (s) Time Execution Execution Time (s) Time Execution Execution Time (s) Time Execution 5 20 (s) Time Execution 0.5 0 0 0.0 0.0 LU.B EP.B IS.B MG.B CG.B FT.B LU.C EP.C IS.C MG.C CG.C 20,10 20,16 20,20 22,10 20,10 20,16 20,20 22,10 (a) NAS (1G Memory) (b) NAS (2G Memory) (c) Graph 500 Median (d) Graph 500 Max

Figure 4.8: Application Execution Time with VM Migration of Different Designs

4.3 Related Work

Previous studies [17, 37, 62] have demonstrated that SR-IOV is signiﬁcantly better than software-based solutions for 10GigE networks. While SR-IOV enables low-latency communication, MPI libraries need to be designed carefully and offer advanced features for improving intra-node, inter-VM communication [44, 45]. Based on these studies, we

65 provide an efﬁcient approach to build HPC cloud with OpenStack over SR-IOV enabled

InﬁniBand clusters [117].

Many efforts have been carried out to enable migration of virtual machines over IB devices by reusing the solutions used in Ethernet devices. Most of these efforts to enable migration over SR-IOV devices have been limited to conventional Ethernet devices [20].

Due to the architectural differences between Ethernet and IB, these efforts are not applicable to IB. Huang et al. [32] propose a high performance virtual machine migration design based on RDMA. Zhai et al. [114] propose an approach that combines the PCI hot plug mechanism with the Linux bonding driver. With this solution, the VF is detached when the migration starts and then reattached after the migration is ﬁnished. The network connectivity is maintained using a secondary device during the migration. The ReNIC proposed by Dong et al. [18] extends the SR-IOV speciﬁcation by cloning the internal VF state and migrating it during the VM migration phase. Pan et. al [79] propose a guest VF driver support to enable live migration to track dirty memory pages and to handle VF migration.

It tracks dirty page, by explicitly writing dummy data into each received page so that the page modiﬁcation can be captured by the hypervisor. Xu et al. [110] propose designs in the hypervisor to ensure the VF device can be correctly used by the migrated VM and the applications. This design does not need to modify the guest operating systems or guest

VM drivers. Kadav et al. [51] propose using a shadow driver to monitor, capture, and reset the state of the device driver for the proper migration of direct assignment devices. This approach hot unplugs the device before migration and hot plugs the device after migration.

It then uses the shadow driver to recover the device state after migration. The concept of a shadow driver, however, does not scale in an IB context because the state of each QP must be monitored and captured, which can amount to tens of thousands of QPs. Pickartz et

66 al. [84] present a qualitative and quantitative investigation of the different migration types for their application in HPC.

Apart from the above mentioned efforts in enabling migration with SR-IOV Ethernet and InﬁniBand, several previous works that focus on process migration are also relevant to our effort in enabling migration with SR-IOV IB [78]. Although process migration differs from VM migration, in principle, both face the same problem. Before migration, the resources of active processes must be released. Then a new set of resources must be reallocated after the migration is completed.

Our study goes beyond the existing studies presented in this section. First, our proposed framework is hypervisor and driver independent. Second, our study focuses on virtual machine migration framework for MPI applications on SR-IOV enabled InﬁniBand clusters.

4.4 Summary

In this chapter, we present a high-performance VM migration framework for MPI applications on SR-IOV enabled InfiniBand clusters. Our proposed framework does not depend on the specific hypervisor and host/guest device drivers. Furthermore, we design a high-performance and scalable controller which works seamlessly with our proposed designs in MPI runtime to significantly improve the efficiency of VM migration, in terms of software overhead. We systematically evaluate the proposed framework with MPI level micro-benchmarks and real-world HPC applications. At application level, for NPB LU benchmark running inside VM, our proposed design could completely hide the overhead of VM migration through computation and migration overlapping.

67 Chapter 5: Designing Container-aware MPI Communication for Light-weight Virtualization

5.1 Container-aware MPI Communication

To address such performance bottleneck, we propose our design of a high performance locality-aware MPI library in this section. We further optimize the communication channels in the proposed MPI library.

5.1.1 Design Overview

Our design is based on MVAPICH2, an open-source MPI library over InﬁniBand. For portability reasons, it follows a layered approach, as shown in Figure 5.1. The Abstract

Device Interface V3 (ADI3) layer implements all MPI-level primitives. Multiple communication channels provide basic message delivery functionalities on top of communication device APIs. By default, there are three types of communication channels available in

MVAPICH2: the shared memory channel communicating over user space shared memory to peers hosted on the same host; The CMA channel uses dedicated system calls to do the intra-node communication by directly reading/writing from/to another processes’ address space, after the message size exceeds the predeﬁned threshold; And the HCA channel communicating over InﬁniBand user-level APIs to other peers.

68 Without any modiﬁcation, default MVAPICH2 can run in container based virtualization environment. However, the SHM and CMA channels cannot be used across the different containers running on the same host for communication due to the different types of namespace isolation, which leads to performance limitations. Although sharing host’s IPC and

PID namespaces among containers provides the necessary conditions for SHM and CMA based communication across co-resident containers, the MPI communication still has to go through HCA channel. That is because each container has a unique hostname, default

MPI runtime is not able to identify the co-residence of containers based on the different hostnames.

Therefore, we propose a high performance locality-aware MPI library to address such limitation. As shown in Figure 5.1, we add a component ‘Container Locality Detector’ between ADI3 layer and channel layer. The Container Locality Detector is responsible for dynamically detecting and maintaining the information of local containers on the same host. According to the locality information, the communication path is rescheduled, so that the co-located containers can communicate through either SHM or CMA channel with better performance.

Application

MPI Layer

ADI3 Layer Container Container Locality Detector Locality-Aware MPI Library

SHM CMA HCA Channel Channel Channel

Communication CMA syscall InﬁniBand API Shared Mem Copy Device APIs

Host

Figure 5.1: MVAPICH2 Stack Running in Container-based Environments

69 5.1.2 Container Locality Detector

As discussed in Section 2.2.2, containers can share host’s shared memory segments, semaphores, and message queues by sharing IPC namespace. Therefore, we create a container list structure on the shared memory region of each host, like /dev/shm/locality in

Figure 5.2. During the initialization, each MPI process will write its own membership information into this shared container list structure according to its global rank. Figure 5.2 illustrates a scenario of launching an 8-process MPI job. The container A, B, and C are on the same host (e.g. host1). The MPI rank 0 and 1 are running in container-A, the rank 4 and 5 are running in container-B and container-C, respectively. While the other 4 ranks run on another host (e.g. host2). Then the three containers (ranks 0, 1, 4 and 5) will write their own membership information into positions 0, 1, 4 and 5 of the container list on host1 correspondingly. Other positions will be left blank. Similarly, other four MPI processes write at positions 2, 3, 6 and 7 of the container list on host2. Once the membership update of all processes completes, the real communication can take place subsequently. In this case, the local number of processes on host1 can be acquired by checking and counting whether the membership information has been written or not. Their local ordering can be maintained by their positions in the container list. Therefore, the written membership information on the container list indicates that they are co-resident.

In our proposed design, the container list is designed by using multiple bytes, as the byte is the smallest granularity of memory access without the lock. Each byte will be used to tag each container. This guarantees that multiple containers on the same host are able to write membership information on their corresponding positions concurrently without introducing lock&unlock operations. This approach reduces the overhead of locality detection

70 procedure. Moreover, the proposed approach will not introduce much overhead of traversing the container list. Taking a one million processes MPI job, for instance, the whole container list only occupies 1 MB memory space. Therefore, it brings good scalability on virtualized MPI environment.

Container-A Container-B Container-C

MPI MPI MPI MPI Rank 0 Rank 1 Rank 4 Rank 5

1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0

shared IPC namespace shared IPC namespace shared IPC namespace /dev/shm/locality /dev/shm/locality /dev/shm/locality

shared IPC namespace 1 1 0 0 1 1 0 0 /dev/shm/locality Host Environment

Figure 5.2: Container Locality Detection

5.1.3 Optimizing SHM and CMA Channels

The default setting in MVAPICH2 library has been optimized for the native environment, whereas it may not be the best setting for MPI communication over SHM and CMA channels in the container-based environment. Therefore, we need to optimize these two channels further in order to achieve high performance message passing for intra-host inter- container communication. For the SHM channel, there are four most important parameters for the eager and rendezvous protocols. However, with our proposed design to enable the

CMA channel, messages transferred over the rendezvous protocol will directly go through the CMA channel. So we only need to consider two parameters for the eager protocol over SHM channel. In addition, we just show the optimization results on bandwidth and message rate, since there is no clear performance difference in terms of latency.

SMP EAGER SIZE deﬁnes the switch point between eager and rendezvous protocol. The eager protocol will go through SHM channel while the rendezvous protocol goes through

71 CMA channel. As shown in Figure 5.3(a), we evaluate the performance impact of the

different size of an eager message. The results indicate that the optimal performance can

be achieved by setting SMP EAGER SIZE to 8K.

SMPI LENGTH QUEUE deﬁnes the size of shared buffer between every two processes

on the same node for transferring messages smaller than SMP EAGER SIZE. Figure 5.3(b)

shows the performance impact of the different size of length queue. We can see that the

length queue with size 128K delivers the optimal performance.

16000 16000 4e+06 13K 128K 4K 14000 15K 14000 256K 3.5e+06 8K 17K 512K 12000 16K 12000 3e+06 19K

10000 10000 2.5e+06

8000 8000 2e+06

6000 6000 1.5e+06 Bandwidth (Mbps) Bandwidth (Mbps) 4000 4000 1e+06

2000 2000 Message Rate (Message/s) 500000

0 0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size (bytes) Message Size (bytes) Message Size (bytes)

(a) Impact of CMA Threshold on Band- (b) Impact of Length Queue for Small (c) Impact of Eager Threshold on Mes- width Message sage Rate

Figure 5.3: Communication Channel Optimization

5.1.4 Optimizing Communication for HCA Channel

Similarly, we need to optimize the HCA channel in container-based environments.

MV2 IBA EAGER THRESHOLD speciﬁes the switch point between eager and rendezvous protocol. If the threshold is too small, then it could incur additional overhead of RTS/CTS exchange during rendezvous transfer between sender and receiver for many message sizes. If it is too large, then it will require a larger amount of memory space for the library. Therefore, we need to keep the optimal threshold for inter-host inter-container communication. We measure the performance by setting MV2 IBA EAGER THRESHOLD to

different values from 13K to 19K, as we can see in Figure 5.3(c). The results show that the

72 HCA can deliver the optimal performance when this threshold is set to 17K for container

environments.

5.2 Performance Evaluation for Docker Container

5.2.1 Experiment Setup

We use 16 bare metal InﬁniBand nodes on Chameleon Cloud as our testbed, where

each node has 24-core 2.3 GHz Intel Xeon E5-2670 processors with 128 GB main memory

and equipped with Mellanox ConnectX-3 FDR (56 Gbps) HCAs with PCI Express Gen3 in-

terfaces. We use CentOS Linux release 7.1.1503 (Core) with kernel 3.10.0-229.el7.x86 64 as the host OS and MLNX OFED LINUX-3.0-1.0.1 as the HCA driver. Docker 1.8.2 is

deployed as the engine to build and run Docker containers. The privileged option is enabled

to give the container access to the host HCA. All containers are set to share the host’s PID

and IPC namespaces.

All applications and libraries used in this study are compiled with GCC 4.8.3 compiler.

The MPI communication performance experiments use MVAPICH2-2.2b and OSU micro-

benchmarks v5.0. Experimental results are averaged across multiple runs to ensure a fair

comparison.

5.2.2 Point-to-Point Performance

In this section, we evaluate the point-to-point communication performance between

two containers on a single host. The two containers are deployed on the same socket

and the different socket to represent intra-socket and inter-socket cases. Figures 5.4(a)-

5.4(c) show the 2-sided point-to-point communication performance in terms of latency,

bandwidth and bi-directional bandwidth. The evaluation results indicate that compared to

the default performance (Cont-*-Def), our proposed design (Cont-*-Opt) can signiﬁcantly

73 improve the point-to-point performance in both intra-socket and inter-socket cases. The performance beneﬁts can be up to 79%, 191% and 407% for latency, bandwidth, and bi- directional bandwidth, respectively. If we compare the performance of our design with that of native MPI, we can see that our design only has minor overhead, which is much smaller than the overhead of default performance. For example, at 1KB message size, the MPI intra-socket point-to-point latency of default case is around 2.26µs, while the latencies of our design and native mode are 0.47µs and 0.44µs, respectively. In this case, our design only shows about 7% overhead. Figures 5.5(a)-5.5(f) present the 1-sided communication performance. The evaluation results show that compared with default performance, our proposed design brings up to 95% and 9X improvement in terms of latency and bandwidth for both put and get operations. Compared with native performance, there is also minor overhead with our proposed design. Taking put-bw for instance, at 4B message size, the intra-socket bandwidth of default case is 15.73Mbps, while our design and native mode achieve 147.99Mbps and 155.47Mbps, respectively. Through this comparison, we can clearly observe the performance beneﬁts by optimizing MPI library with locality-aware design on container-based HPC cloud.

250 16000 Cont-intra-socket-Def Cont-intra-socket-Def 25000 Cont-intra-socket-Def Cont-intra-socket-Opt 14000 Cont-intra-socket-Opt Cont-intra-socket-Opt 200 Native-intra-socket Native-intra-socket Native-intra-socket Cont-inter-socket-Def 12000 Cont-inter-socket-Def 20000 Cont-inter-socket-Def Cont-inter-socket-Opt 10000 Cont-inter-socket-Opt Cont-inter-socket-Opt 150 Native-inter-socket Native-inter-socket Native-inter-socket 15000 3 8000 100 2 6000

Latency (us) 10000

1 Bandwidth (Mbps) 4000 Bandwidth (Mbps) 50 5000 0 2000 1 4 16 64 256 1K 0 0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size (bytes) Message Size (bytes) Message Size (bytes)

(a) MPI Point-to-Point Latency (b) MPI Point-to-Point Bandwidth (c) MPI Point-to-Point Bi-Bandwidth

Figure 5.4: MPI Two-Sided Point-to-Point Communication Performance

74 200 30000 30000 Cont-intra-socket-Def Cont-intra-socket-Def Cont-intra-socket-Def 180 Cont-intra-socket-Opt Cont-intra-socket-Opt Cont-intra-socket-Opt 25000 Native-intra-socket 25000 160 Native-intra-socket Native-intra-socket Cont-inter-socket-Def Cont-inter-socket-Def Cont-inter-socket-Def 140 Cont-inter-socket-Opt Cont-inter-socket-Opt 20000 20000 Cont-inter-socket-Opt Native-inter-socket 120 Native-inter-socket Native-inter-socket

100 3 15000 15000 80 2 Latency (us) 10000 10000 60

1 Bandwidth (Mbps) Bandwidth (Mbps) 40 5000 5000 0 20 1 4 16 64 256 1K 0 0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size (bytes) Message Size (bytes) Message Size (bytes)

(a) MPI Put Latency (b) MPI Put Bandwidth (c) MPI Put Bi-Bandwidth

200 30000 1200 Cont-intra-socket-Def Cont-intra-socket-Def Cont-intra-socket-Def 180 Cont-intra-socket-Opt Cont-intra-socket-Opt Cont-intra-socket-Opt 25000 Native-intra-socket 1000 160 Native-intra-socket Native-intra-socket Cont-inter-socket-Def Cont-inter-socket-Def Cont-inter-socket-Def 140 Cont-inter-socket-Opt Cont-inter-socket-Opt 20000 800 Cont-inter-socket-Opt Native-inter-socket 120 Native-inter-socket Native-inter-socket

100 3 15000 600 6 80 5 2 4 Latency (us) 10000 Latency (us) 400 60 3

1 Bandwidth (Mbps) 2 40 1 5000 200 0 0 20 1 4 16 64 256 1K 1 4 16 64 256 1K 0 0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size (bytes) Message Size (bytes) Message Size (bytes)

(d) MPI Get Latency (e) MPI Get Bandwidth (f) MPI Get Accumulate Latency

Figure 5.5: MPI One-Sided Point-to-Point Communication Performance

16 4000 90 9000 Cont-Def Cont-Def Cont-Def Cont-Def 14 Cont-Opt 3500 Cont-Opt 80 Cont-Opt 8000 Cont-Opt Native Native Native Native 3000 70 7000 12 6000 2500 60 60 250 10 5000 200 2000 40 50 8 4000 150 1500 40 Latency (us) Latency (us) 20 Latency (us) Latency (us) 3000 100 6 1000 30 2000 50 4 16 64 4 16 64 4 500 20 1000 2 0 10 0 4 16 64 256 1K 4K 4 16 64 256 1K 4K 4 16 64 256 1K 4K 4 16 64 256 1K 4K Message Size (bytes) Message Size (bytes) Message Size (bytes) Message Size (bytes)

(a) Broadcast (b) Allgather (c) Allreduce (d) Alltoall

Figure 5.6: Collective Communication Performance with 256 Processes

75 5.2.3 Collective Performance

In this section, we deploy 64 containers across 16 nodes evenly. By pinning different containers to different cores, we can avoid application competing for the same core and related performance degradation. Figures 5.6(a)-5.6(d) show the performance of broadcast, allreduce, allgather and alltoall operations with 256 processes, respectively. With the proposed design, the communication across the four co-resident containers on each host can go through SHM and CMA channels, which improves the overall performance of collective operations. The evaluation results indicate that compared with default performance, our proposed design can clearly improve the performance by up to 59%, 64%, 86% and 28% for MPI Bcast, MPI Allreduce, MPI Allgather, MPI Alltoall, respectively.

As inter-node transfer contributes a certain proportion of total amount of communication, we do not see the same beneﬁts as that on point-to-point communication. In the meantime, the proposed design just incurs up to 9% overhead for the above four collective operations, compared with native performance.

5.2.4 Application Performance

Further, we run Graph 500 and Class D NAS with 256 processes across 16 nodes.

As presented in Figure 5.7(a) and 5.7(b), the evaluation results show that compared with default case, the proposed design can reduce up to 16% (22,16) and 11% (CG) of execution time for Graph 500 and NAS, respectively. Compared with the native performance, the proposed design only has up to 5% and 9% overhead.

76 4000 90 Cont-Def Cont-Def 3500 Cont-Opt 80 Cont-Opt Native Native 3000 70 60 2500 50 2000 40 1500 30 Execution Time (s) 1000 BFS Execution Time (ms) 20 500 10 0 0 22,16 22,20 24,16 24,20 26,16 26,20 MG.D FT.D EP.D LU.D CG.D (a) Graph 500 (b) Class D NAS

Figure 5.7: Application Performance with 256 Processes

5.3 Performance Evaluation for Singularity

In this section, we describe the experimental setup, provide the evaluation results of

Singularity, and give an in-depth analysis of these results.

5.3.1 Experimental Setup

Chameleon Cloud We use 32 bare metal InﬁniBand nodes on this testbed, where each has 24 cores delivered in dual socket Intel Xeon E5-2670 v3 Haswell processors with 128

GB main memory and equipped with Mellanox ConnectX-3 FDR (56 Gbps) HCAs with

PCI Express Gen3 interfaces. We use CentOS Linux release 7.1.1503 (Core) with kernel

3.10.0-229.el7.x86 64 as the host OS and MLNX OFED LINUX-3.4-2.0.0 as the HCA

driver.

Local cluster (Nowlab): We use four KNL nodes in this testbed. Each node is equipped with Intel Xeon Xeon Phi(TM) CPU 7250 (1.40GHz), 96GB host memory, and 16GB

MCDRAM, and Omni-Path HFI Silicon 100 Series fabric controller. The operating system

77 used is CentOS 7.3.1611 (Core), with kernel version 3.10.0-514.16.1.el7.x86 64. OFED-

3.18-3 is used as the driver of the Omni-Path fabric controller.

Singularity 2.3 is used to conduct all the Singularity related experiments. All applications and libraries used in this study are compiled with gcc 4.8.3 compiler. All MPI communication performance experiments use MVAPICH2-2.3a and OSU micro-benchmarks v5.3. Experimental results are averaged across ﬁve runs to ensure a fair comparison.

5.3.2 Point-to-Point Communication Performance

Figure 5.8 shows the performance of MPI point-to-point communication on Haswell architecture. Since each node has two CPU sockets, we measure the intra-node point-to- point communication performance in terms of intra-socket and inter-socket for Singularity and native, which are presented in Figure 5.8(a) and Figure 5.8(c). We observe that the intra-socket case has better performance than the inter-socket case, with respect to the latency and bandwidth aspects. For instance, the native latency of intra-socket case is merely

3.36µs at 16 Kbytes message size, while it is 5.18µs for the inter-socket case. Similarly, the bandwidth of intra-socket and inter-socket cases at 16Kbytes message size achieve 9.8GB/s and 5.3GB/s, respectively. It is because the memory access across the different NUMA node has to go over the QPI link, which is much slower than accessing local memory within the same NUMA node. We also notice that the performance difference is gradually decreased, as the message size increases. Figure 5.8(b) and Figure 5.8(d) show the inter- node point-to-point communication performance in terms of latency and bandwidth. Given that 56 Gbps Mellanox ConnectX-3 FDR HCA is used in this testbed, the peak bandwidth can be achieved up to around 6.4GB/s. On the aspect of virtualization solution, we can

78 clearly observe that there is minor overhead for Singularity solution, compared with native performance.The evaluation results indicate that the overhead is less than 7%.

Figure 5.9 shows the point-to-point communication performance on KNL architecture with Cache memory mode. Please note that we separate the message sizes into two ranges, which are 1B-16KB and 32KB-4MB respectively to clearly present the performance trends.

Since there is only one NUMA node on KNL architecture, we do not consider intra/inter- socket anymore here. This architecture of only one NUMA node also avoids the performance bottleneck from QPI link as exists in Haswell architecture. The MPI point-to-point latency performance is presented in Figure 5.9(a) and Figure 5.9(b). The evaluation results indicate that the latency performance on KNL with cache memory is worse than the performance on Haswell architecture. For example, the intra-node and inter-node latency at four bytes message size are 1.13µs and 2.68µs, respectively, while they are 0.2µs and

1.08µs on Haswell architecture. The reason mainly comes from three aspects. One is because the CPU frequency on KNL is much lower than the one on Haswell. The second reason is that KNL has relatively complex cluster mode due to its nature of many-core processor. The communication between the core and the corresponding memory controller takes the extra time, which increases the overall latency. In addition, maintaining the cache coherency across a large number of cores on KNL architecture is more costly compared with that of the multi-core processor. Another interesting thing we observe in Figure 5.9(b) and Figure 5.9(d) is that the inter-node latency is better than the intra-node latency after around 512 Kbytes. It is because the Omni-Path interconnect has better performance than that of the shared memory-based transfer for the large message size, especially considering the complex and costly memory access and cache coherency operations within one node. Moreover, since each KNL node is equipped with one Omni-Path fabric controller

79 (100Gbps), we can see the peak bandwidth can be achieved up to around 9.2GB/s. The evaluation results also indicate that the Singularity-based virtualization solution merely incurs less than 8% overhead, which is similar to what we observed in Figure 5.8.

We then measure the point-to-point communication performance with Flat memory mode on KNL, which is shown in Figure 5.10. As discussed in Section 2.3.3.1, we are able to explicitly specify the type of memory (DDR or MCDRAM) when allocating the memory. We thus conduct the experiments with DDR and MCDRAM, respectively. The evaluation results show that there is no signiﬁcant difference between the performance with

DDR and the one with MCDRAM. Similar to the observation we have in Cache mode, the inter-node latency is also lower than the intra-node latency after around 1 Mbytes, The peak bandwidth is also able to reach 9.2GB/s, and Singularity-based virtualization can deliver near-native performance as well. Compared with the Cache mode performance earlier, We can also observe that the intra-node bandwidth with Cache mode in Figure 5.9(c) is slight worse than that with Flat mode in Figure 5.10(c). The cache misses on MCDRAM are the primary factor in the performance difference between addressable memory MCDRAM and cache MCDRAM, so using the Cache mode could get close to or even match the performance with the Flat mode. However, there is still some inherent overhead associated with using MCDRAM as the cache.

5.3.3 Collective Communication Performance

In this section, we conduct the communication performance evaluation with four commonly used MPI collective operations, which are MPI Bcast, MPI Allghaer, MPI Allreduce, and MPI Alltoall. Figure 5.11 shows the evaluation results with 512 Processes across 32

80 12 30 Singularity-Intra-Socket Singularity-Inter-Node 10 25 Native-Intra-Socket Native-Inter-Node 8 20 Singularity-Inter-Socket 6 15 Native-Inter-Socket

Latency (us)Latency 4 (us)Latency 10

2 5

0 0 1 4 16 64 256 1K 4K 16K 64K 1 4 16 64 256 1K 4K 16K 64K Mesage Size ( bytes) Mesage Size ( bytes)

(a) MPI Intra-Node Point-to-Point Latency (b) MPI Inter-Node Point-to-Point Latency

7000 16000 Singularity-Intra-Socket Singularity-Inter-Node 14000 6000 12000 Native-Intra-Socket 5000 Native-Inter-Node 10000 Singularity-Inter-Socket 4000 8000 Native-Inter-Socket 3000 6000 2000

4000 Bandwidth (MB/s) Bandwidth (MB/s) 2000 1000 0 0

Mesage Size ( bytes) Mesage Size ( bytes)

Figure 5.8: MPI Point-to-Point Communication Performance on Haswell

81 16 1600 Singularity-Intra-Node Singularity-Intra-Node 14 1400 Native-Intra-Node Native-Intra-Node 12 1200 Singularity-Inter-Node Singularity-Inter-Node 10 1000 Native-Inter-Node Native-Inter-Node 8 800 6 600 Latency (us)Latency (us)Latency 4 400 2 200 0 0 1 4 16 64 256 1K 4K 16K 32K 128K 512K 2M Mesage Size ( bytes) Mesage Size ( bytes)

(a) MPI Point-to-Point Latency (1B-16KB) (b) MPI Point-to-Point Latency (32KB-4MB)

3000 10000 Singularity-Intra-Node Singularity-Intra-Node 9000 2500 Native-Intra-Node Native-Intra-Node 8000 7000 Singularity-Inter-Node 2000 Singularity-Inter-Node 6000 Native-Inter-Node 1500 Native-Inter-Node 5000 4000 1000 3000 500 2000

Bandwidth(MB/s) Bandwidth(MB/s) 1000 0 0 1 4 16 64 256 1K 4K 16K 16K 64K 256K 1M 4M

Mesage Size ( bytes) Mesage Size ( bytes)

Figure 5.9: MPI Point-to-Point Communication Performance on KNL with Cache Mode

82 350 200 Singularity(DDR) 180 Singularity(DDR) 300 Native(DDR) 160 Native(DDR) 250 Singularity(MCDRAM) 140 Singularity(MCDRAM) 200 120 Native(MCDRAM) 100 Native(MCDRAM) 150 80

Latency (us)Latency 100 (us)Latency 60 40 50 20 0 0

Mesage Size ( bytes) Mesage Size ( bytes)

(a) MPI Intra-Node Pt-to-Pt Latency (b) MPI Inter-Node Pt-to-Pt Latency

8000 10000 Singularity(DDR) Singularity(DDR) 7000 9000 Native(DDR) 8000 Native(DDR) 6000 Singularity(MCDRAM) 7000 Singularity(MCDRAM) 5000 6000 4000 Native(MCDRAM) 5000 Native(MCDRAM) 3000 4000 3000 Bandwidth (MB/s)

2000 Bandwidth (MB/s) 2000 1000 1000 0 0

Mesage Size ( bytes) Mesage Size ( bytes)

Figure 5.10: MPI Point-to-Point Communication Performance on KNL with Flat Mode

83 nodes with Haswell architecture, while Figure 5.12 and Figure 5.13 present the corresponding results with 128 Processes across two KNL nodes with cache and flat memory modes, respectively. Overall, Singularity-based virtualization solution is still able to deliver the near-native performance with less than 8% overheads on all the four operations. In addition, when the message size exceeds around 256 Kbytes, we can clearly see the benefits for all the four collective operations with MCDRAM in flat memory mode. The benefits can be up to 38%, 56%, 67%, and 16% for MPI Bcast, MPI Allghaer, MPI Allreduce, and

MPI Alltoall, respectively. As the message size increases, the data can not fit in L2 cache anymore. Compared with DDR, MCDRAM is able to more efficiently deliver the data through its ‘Multi-Channel’ architecture to processes that are involved in the collective operations. On the other hand, Singularity-based virtualization solution consistently reflects this performance characteristic on the native environment.

5.3.4 Application Performance

In this section, we evaluate the application performance with NAS and Graph500, respectively. The application performance with 512 MPI processes on Haswell is presented in Figure 5.14. The performance with 128 MPI processes on KNL with cache and ﬂat modes are shown in Figure 5.15 and Figure 5.16, respectively. Six different benchmarks included in NAS test suite are presented as labels on the x-axis of Figure 5.14(a), 5.15(a), and 5.16(a). Most of them are computation-intensive, the communication only takes a small portion of the total execution time. That is the reason that there is no clear performance difference between DDR and MCDRAM in the Flat mode in Figure 5.16(a). The

FT performance gets improved with MCDRAM since it involves a large number of alltoall operations. Graph500 is a data-analytics workload, which heavily utilizes point to point

84 1000 1000000 Singularity Singularity 100000 Native Native 100 10000

1000

10 100 Latency (us)Latency Latency (us)Latency 10

1 1

Mesage Size ( bytes) Mesage Size ( bytes)

(a) MPI Broadcast (b) MPI Allgather

10000 10000000 Singularity Singularity 1000000 Native Native 1000 100000 10000 100 1000

Latency (us)Latency 10 (us)Latency 100 10 1 1

Mesage Size ( bytes) Mesage Size ( bytes)

Figure 5.11: MPI Collective Communication Performance with 512-Process on Haswell

85 10000 1000000 Singularity Singularity Native 100000 Native 1000 10000

100 1000

100 Latency (us)Latency 10 (us)Latency 10

1 1

Mesage Size ( bytes) Mesage Size ( bytes)

(a) MPI Broadcast (b) MPI Allgather

100000 10000 Singularity Singularity 10000 Native Native 1000 1000 100 100

Latency (us)Latency (us)Latency 10 10

1 1

Mesage Size ( bytes) Mesage Size ( bytes)

Figure 5.12: MPI Collective Communication Performance with 128-Process on KNL with Cache Mode

86 4500 350000 Singularity(DDR) Singularity(DDR) 4000 Native(DDR) 300000 Native(DDR) 3500 250000 Singularity(MCDRAM) 3000 Singularity(MCDRAM) 2500 Native(MCDRAM) 200000 Native(MCDRAM) 2000 150000 1500 Latency (us)Latency (us)Latency 100000 1000 500 50000 0 0

Mesage Size ( bytes) Mesage Size ( bytes)

(a) MPI Broadcast (b) MPI Allgather

9000 600000 Singularity(DDR) Singularity(DDR) 8000 Native(DDR) 500000 Native(DDR) 7000 Singularity(MCDRAM) 6000 Singularity(MCDRAM) 400000 5000 Native(MCDRAM) Native(MCDRAM) 300000 4000 3000 200000 Latency (us)Latency (us)Latency 2000 100000 1000 0 0

Mesage Size ( bytes) Mesage Size ( bytes)

Figure 5.13: MPI Collective Communication Performance with 128-Process on KNL with Flat Mode

87 communication (MPI Isend and MPI Irecv) with 4 Kbytes message size for BFS search of the random vertices. The x-axis represents the different problem size by the SCALE- edgefactor pairs. SCALE is the logarithm base two of the number of vertices, while edgefactor indicates the ratio of the graph’s edge count to its vertex count. For instance, (24,16) represents a graph with 16M (224) vertices and 256M (16*224) edges. As what we observe earlier in Figure 5.10, MCDRAM and DDR have similar performance for Graph500 in the

Flat mode here. The evaluation results for all the three cases in Figure 5.14- 5.16 show that, compared with the native performance, Singularity-based container technology only introduces less than 7% overhead, which stems from the inherent cost of the containerization. Therefore, it reveals a promising way for efﬁciently running MPI applications on

HPC clouds.

300 3000 Singularity Singularity 250 2500 Native Native 200 2000 150 1500 100 1000 Execution Time (s) Time Execution

50 (ms) Time Execution 500 0 0 CG EP FT IS LU MG 22,16 22,20 24,16 24,20 26,16 26,20

(a) Class D NAS with 512 Processes (b) Graph500 with 512 Processes

Figure 5.14: Application Performance with 512-Process on Haswell

5.4 Related Work

As a lightweight alternative, container technology has been popularized during the last several years. More and more studies focus on evaluating the performance of different hypervisor-based and container-based solutions for HPC. Xavier et al. [107] conducted an in-depth performance evaluation of container-based virtualization (Linux VServer, OpenVZ,

88 400 350 Singularity 1000 Singularity 300 Native 800 250 Native 200 600 150 400 100 Execution Time (s) Time Execution Execution Time (s) Time Execution 200 50 0 0 CG EP FT IS LU MG 20,10 20,16 20,20 22,10 22,16 22,20 24,10 24,16

(a) Class C NAS with 128 Processes (b) Graph500 with 128 Processes

Figure 5.15: Application Performance with 128-Process on KNL with Cache Mode

50 1000 Singularity(DDR) Singularity(DDR) Native(DDR) 40 Native(DDR) 800 Singularity(MCDRAM) 30 Singularity(MCDRAM) 600 Native(MCDRAM) Native(MCDRAM) 20 400

Execution Time (s) Time Execution 10 (s) Time Execution 200

0 0 CG EP FT IS LU MG 20,10 20,16 20,20 22,10 22,16 22,20 24,10 24,16

(a) Class C NAS with 128 Processes (b) Graph500 with 128 Processes

Figure 5.16: Application Performance with 128-Process on KNL with Flat Mode

89 and LXC) and hypervisor-based virtualization (Xen) for HPC in terms of computing, memory, disk, network, application overhead and isolation. Wes Felter et al. [23] explore the performance of traditional virtual machine deployments (KVM) and contrast them with the use of Docker. They use a suite of workloads that stress CPU, memory, storage, and networking resources. Their results show that containers result in equal or better performance than VMs in almost all cases. In addition, they ﬁnd that both VMs and containers require tuning to support I/O-intensive applications [23]. Cristian et al. [93], evaluate the performance of Linux-based container solutions using the NAS parallel benchmarks in various ways of container deployment. The evaluation shows the limits of using containers, the type of applications that suffer the most and until which level of oversubscription containers can deal with without impacting considerably the application performance. Yuyu et al. [121] compare the virtualization (KVM) and containerization (Docker) techniques for

HPC in terms of features and performance using up to 64 nodes on Chameleon testbed with 10GigE networks. Charliecloud [86] uses the user and mount namespaces to run

Docker containers with no privileged operations or daemons on center resources to provide the user-deﬁned services in a usable manner while minimizing the risks: security, support burden, missing functionality, and performance. Nevertheless, there are several prominent issues with it, such as compatibility, dependency, and user-driven features. The software makes use of kernel namespaces that are not deemed stable by multiple prominent distributions of Linux (e.g. no versions of Red Hat Enterprise Linux or compatibles support it), and may not be included in these distributions for the foreseeable future [54]. In addition, the workﬂow begins with Docker. While Docker is becoming a standard technology in the industry, it would be desirable to not bind with it for baseline operation. rkt [92] is an open source Go project backed by CoreOS, Inc. rkt avoids the need for trusted daemons

90 and optionally uses the user namespace, but it is still a large project with much functionality not focused for HPC. It can run Docker images and also provides a competing image speciﬁcation language.

Not only the performance characterization [119], we also focus on building efﬁcient container-based HPC cloud under different container deployment scenarios. We identify and address a clear performance bottleneck for MPI applications running in multi-container per host environments by proposing a high performance locality-aware MPI library. Fur- ther, we conduct a comprehensive performance evaluation [115] for Docker container and

Singularity.

5.5 Summary

In this chapter, we identify the performance bottleneck for MPI application running in the container-based HPC cloud by executing a detailed performance analysis. To eliminate this bottleneck, we propose a high performance locality-aware MPI library for container- based HPC cloud. By the help of locality-aware design, MPI library is able to dynamically and efﬁciently detect co-resident containers at runtime, so that shared memory and CMA based communication can be executed to improve the communication performance across the co-resident containers. We further analyze and optimize core mechanisms and design parameters of MPI library for SHM, CMA and HCA channels in container-based HPC cloud. Through a comprehensive performance evaluation, the proposed locality-aware design can signiﬁcantly improve the communication performance across co-resident containers.

The evaluation results for Docker indicate that compared with default case, our proposed design can bring up to 95% and 9X performance improvement for MPI point-to-point

91 communication in terms of latency and bandwidth. On the aspect of collective operation, the proposed design can achieve up to 59%, 64%, 86% and 28% improvement for MPI broadcast, allreduce, allgather and alltoall operations, respectively. The evaluation results on application level demonstrate that the proposed design can reduce up to 16% and 11% of execution time for Graph 500 and NAS across 64 containers, respectively. In the meantime, the results also show that the proposed locality-aware design can deliver near-native performance for applications with less than 5% overhead. The performance results of Singularity demonstrate that Singularity-based container technology can achieve near-native performance for both Intel Xeon and Intel Xeon Knights Landing (KNL) platforms with different memory access modes. It also shows that Singularity has very little overhead for running

MPI-based HPC applications on both Omni-Path and InfiniBand networks. Therefore, the proposed high performance locality-aware MPI library reveals significant potential to be utilized to efficiently build large scale container-based HPC cloud.

92 Chapter 6: Designing High Performance MPI Communication for Nested Virtualization

With the emergence of container-based virtualization technology in clouds, another type of usage paradigm, which is called “nested virtualization”, is becoming more and more popular in clouds. As a typical example, many end users choose to run their applications encapsulated by Docker containers over Amazon EC2 virtual machines. Such an approach of running containers nested in virtual machines can bring easy deployment beneﬁt for end users while making the cloud easy-to-manage for administrators. As another example, cloud infrastructures such as Chameleon [14] provide bare-metal machines to cloud designers and developers to build different types of cloud-based environments.

If cloud designers and developers are trying to build HPC clouds based on these infrastructures, they can deliver the resources to their users in the form of virtual machines for achieving cost-effective resource sharing and security. The users can then run their applications through lightweight containers on these virtual machines to achieve high productivity by easy and fast container deployment.

Nested virtualization seems a promising way to build clouds to achieve high productivity, security and efﬁcient resource sharing. However, recent studies [17, 37, 45, 62] have shown that running applications in either virtual machines or containers still has

93 the signiﬁcant performance overhead, especially for I/O intensive applications. In order to improve the performance, several studies are proposed on either VM or container layer. Table 6.1 compares the related works and provides a brief description. The studies [34, 44, 53, 64, 105, 120] provide co-resident VMs detection on VM environment.

Study [118] supports locality detection on container level, and the work [70] is publicly available. However, none of them focuses on the nested virtualization environment and explores the associated locality-aware and NUMA-aware support.

Table 6.1: Comparison with Existing Studies Studies Locality Aware Level NUMA Aware Support Key Ideas [34, 44, 53, 64, 105, 120] 1Layer (VM) × Support co-resident VMs detection [118] 1Layer (Container) × Support co-resident containers detection

Therefore, we conduct the experiments, as shown in Figure 6.1 to explore the performance overhead at MPI level in the nested virtualization environment. In the experiments, we launch two VMs on the same host and further launch two containers in each VM. Then we measure the MPI point-to-point latency across the containers. The two VMs are deployed on the same and different sockets, respectively, as presented in Figures 6.1(a) and

6.1(b). Compared with the native performance, we can observe that there is a signiﬁcant overhead on default mechanism (denoted as *-Def), in both intra-socket and inter-socket cases. In order to explore further, we test with the 1Layer locality-aware mechanism proposed in [118] and available in the library [70]. In this mechanism, the communication across the different containers in the same VM can go through the shared memory based channel and CMA channel. We therefore clearly observe that the Intra-VM Inter-Container-

1Layer delivers near-native performance in both cases. However, there is no performance

94 beneﬁt for Inter-VM Inter-Container-1Layer, which has the similar performance to the default mechanism (denoted as *-Def). Then can we further reduce the performance overhead of running applications on the nested virtulization environment? In addition, the VMs could have different placements, as introduced above. Accordingly, the communication on the container level can be in the same or different containers/VMs, on the same or different

NUMA nodes. As the example illustrated in Figure 6.2, two VMs are deployed on a host, which is equipped with a dual socket processors. Each VM is pinned with eight physical cores on a single NUMA node. And two containers are running inside each VM. There exists four types of communication paths under this placement scheme:

(1) Intra-VM Intra-Container (across core 4 and core 5).

(2) Intra-VM Inter-Container (across core 13 and core 14).

(3) Inter-VM Inter-Container (across core 6 and core 12).

(4) Inter-Node Inter-Container (across core 15 and the core on remote node).

Therefore, it brings another two challenging questions. What are the impacts of the different VM/container placement schemes for the communication on the container level? Can we propose a design to adapt these different VM/container placement schemes and deliver near-native performance for nested virtualization environment?

6.1 Two-Layer Locality Aware and NUMA-Aware Design in MPI Li- brary

To address the above-mentioned performance bottleneck, we propose a design of a high performance two-layer locality-aware MPI library in this section.

95 250 250 Intra-VM Inter-Container-Def Intra-VM Inter-Container-Def 200 Inter-VM Inter-Container-Def 200 Inter-VM Inter-Container-Def Intra-VM Inter-Container-1Layer Intra-VM Inter-Container-1Layer 150 Inter-VM Inter-Container-1Layer 150 Inter-VM Inter-Container-1Layer 3 Native 3 Native 100 2 100 2

Latency (us) 1 Latency (us) 1 50 0 50 0 1 4 16 64 2561K 1 4 16 64 2561K 0 0 1 16 256 4K 64K 1M 1 16 256 4K 64K 1M Message Size (bytes) Message Size (bytes) (a) Intra-socket (b) Inter-socket

Figure 6.1: MPI Point-to-Point Latency Performance on Nested Virtualization Environ- ment (Compare Default, One-Layer Locality-Aware and Native)

VM 0 VM 1 Container 0 Container 1 Container 2 Container 3

core 0 core 1 core 2 core 3 core 8 core 9 core 10 core 11

QPI Memory Controller Memory Controller core 4 core 5 core 6 core 7 core 12 core 13 core 14 core 15

1 2 4 3 ... NUMA 0 NUMA 1

Figure 6.2: Communication Paths across Containers on Different VM/Container Place- ments

VM 0 VM 1 Container 0 Container 1 Container 2 Container 3

core 0 core 1 core 2 core 3 core 8 core 9 core 10 core 11

QPI Memory Controller Memory Controller core 4 core 5 core 6 core 7 core 12 core 13 core 14 core 15

NUMA 0 1 3 2 4 NUMA 1 Two-Layer Locality Detector

Container Locality Detector VM Locality Detector

Nested Locality Combiner

Two-Layer NUMA Aware Communication Coordinator

CMA SHared Memory Network (HCA) Channel (SHM) Channel Channel

Figure 6.3: Two-Layer Locality Aware Communication in Nested Virtualization Environ- ments

96 6.1.1 Design Overview

Two new components are added in the MPI library, which are Two-Layer Locality De- tector and Two-Layer NUMA Aware Communication Coordinator, as shown in Figure 6.3.

As shown in Figure 6.3, there exist multiple different VM/container placement schemes and associated multiple different communication paths. In the bare-metal environment, most MPI libraries use shared memory based communication channels for intra-node message transfer because of the low latency and high bandwidth, while network channel for inter-node transfer. In Figure 6.3, paths 1 and 4 will work in the same way as the ones in the bare-metal environment. With the work in [118], path 2 is also able to utilize any intra-node message transfer mechanisms, making it behave just as path 1. However, the communication path 3 is considered as across nodes due to the lack of the nested locality aware support. Therefore, the Two-Layer Locality Detector is responsible for dynamically detecting MPI processes in the co-resident containers inside one VM as well as the ones in the co-resident VMs on a single host. Once the two-layer locality detection completes, each MPI process will have accurate locality information in the nested virtualization environment. Another component, the Two-Layer NUMA Aware Communication Coordinator, will leverage this nested locality information, and combine the information from NUMA architecture and message to coordinate the selection of communication channels. By the help of the Two-Layer NUMA Aware Communication Coordinator, the communication will be rescheduled to an appropriate channel from the underlying SHM, CMA and HCA channels with optimal performance.

97 VM 0 VM 1

Container 0 Container 1 Container 2 Container 3

MPI Rank 0 MPI Rank 1 MPI Rank 6 MPI Rank 7

V V N N N N H H V V N N N N H H H H N N N N V V H H N N N N V V

1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 Combiner Combiner PCI Device PCI Device (IVSHMEM) (IVSHMEM) Nested Locality Nested Locality Container Locality-Aware List Container Locality-Aware List (on Shared IPC Namespace) (on Shared IPC Namespace) VM

Hypervisor Host

1 1 0 0 0 0 1 1 VM Locality-Aware List (on IVSHMEM )

Figure 6.4: Two-Layer Locality Detector Design (VM Locality Detector utilizes the VM Locality-Aware List to detect the processes on the same host. Further, Container Locality Detector leverages the Container Locality-Aware List to identify the processes on the same VM. Finally, each MPI process has a global view of the locality information. “V” denotes the processes in the same VM, “H” denotes the processes in the same host, but the different VM, “N” denotes the processes on remote hosts)

6.1.2 Two-Layer Locality Detector

Two-Layer Locality Detector contains three function units, which are VM Locality

Detector, Container Locality Detector and Nested Locality Combiner, respectively.

VM Locality Detector provides VM level locality aware support. As discussed in Sec- tion 2.2.3, IVShmem provides a mechanism, which can expose a host memory as a PCI device in the co-resident VMs. As a result, we utilize it to create a VM Locality-Aware

List for the VM level locality detection. Each MPI process writes its locality identiﬁcation according to its global rank by accessing the virtualized PCI device that IVShmem exposes.

In this way, the locality information of MPI processes in all the co-resident VMs can be identiﬁed.

Similarly, Container Locality Detector is responsible for container level locality aware support. Containers can share the shared memory segments, semaphores and message

98 queues by sharing the IPC namespace. Therefore, we create a Container Locality-Aware

List on the shared memory segments in each VM. Each MPI process in co-resident containers will also write its own locality information into this shared container list structure according to its global rank. After a synchronization, it can guarantee that the locality information of all MPI processes has been collected up and stored in the two locality aware lists.

Then, each MPI process uses Nested Locality Combiner to combine the locality information from both VM and container locality aware lists. If the same locality identiﬁcation has been written at the same position of both lists when traversing, that means it is one MPI process inside the same VM (denoted as “V”). If the locality identiﬁcation only exists on

VM locality aware list, then it is one MPI process on the same host, but a different VM

(denoted as “H”). The blank positions on both lists indicate that those corresponding MPI processes are on the remote physical nodes (denoted as “N”).

Figure 6.4 illustrates an example of launching an 8-process MPI job. Two VMs (VMs

0 and 1) are deployed on the same host, and each VM hosts two containers (Container 0-1 and Container 2-3). There is one MPI process in each container, and the other four ranks run on another host. In the VM locality detection unit, the four MPI processes (ranks 0, 1, 6, and 7) write their locality identiﬁcations on positions 0, 1, 6, and 7 on VM Locality-Aware

List, respectively. In the container locality detection unit, two MPI processes (ranks 0 and

1) write their locality information on positions 0 and 1 on Container Locality-Aware List in VM 0. A similar procedure happens in VM 1. Through the Nested Locality Combiner,

MPI process with rank 0 discovers that process with rank 1 is in the same VM, processes with ranks 6 and 7 are on the same host, but a different VM. While processes with ranks

2-5 are on the remote nodes. As a result, each MPI process generates a nested locality

99 aware list, which is comprised of “V, H, N”. The number of local processes on host can be acquired by checking and counting the positions with symbols “V” and “H”. Their local ordering will still be maintained by their positions in the list.

In the design of two-layer locality detector, each list is designed by using multiple bytes, as the byte is the smallest granularity of memory access without a lock. Each byte will be used to tag each MPI process. This guarantees that multiple processes in the same VM or host are able to write their locality information on their corresponding positions concurrently without introducing lock&unlock operations. This approach reduces the overhead of locality detection procedure. Moreover, the proposed approach will not introduce much overhead of traversing the lists. For instance, taking a one million processes MPI job only occupies 1MB memory space for each list. The space complexity is O(n) . Therefore, it brings good scalability on the virtualized MPI environment.

6.1.3 Two-Layer NUMA Aware Communication Coordinator

Two-Layer Locality Detector

Two-Layer NUMA Aware Communication Coordinator

Message NUMA Loader Nested Locality Loader Parser

Communication Coordinator

SHared Memory Network (HCA) CMA Channel (SHM) Channel Channel

Figure 6.5: Two-Layer NUMA Aware Communication Coordinator

Two-Layer NUMA Aware Communication Coordinator reschedules the message to go through the appropriate communication channel in order to deliver the optimal communication performance in the nested virtualization environment. Figure 6.5 presents the architecture of Two-Layer NUMA Aware Communication Coordinator. In this component, there

100 are three function units, which are Nested Locality Loader, NUMA Loader, and Message

Parser, respectively. Nested Locality Loader reads the locality information of destination process from the Two-Layer Locality Detector. NUMA Loader reads the information of

VM/container placements to decide on which NUMA node the destination process is pinning. Message Parser obtains the attributes of the message, such as message type and message size. For a communication request to a speciﬁc destination process, Communi- cation Coordinator selects the appropriate communication channel based on all the above information, In Section 6.2.3, Algorithm 1 elaborately describes the procedure of selecting the appropriate communication channel, which will be introduced later.

6.1.4 Performance Beneﬁt Analysis

350 350 Intra-VM Inter-Container-Def Intra-VM Inter-Container-Def Inter-VM Inter-Container-Def Inter-VM Inter-Container-Def 300 Intra-VM Inter-Container-1Layer 300 Intra-VM Inter-Container-1Layer Inter-VM Inter-Container-1Layer Inter-VM Inter-Container-1Layer 250 Intra-VM Inter-Container-2Layer 250 Intra-VM Inter-Container-2Layer Inter-VM Inter-Container-2Layer Inter-VM Inter-Container-2Layer Native Native 200 200 5 5 4 4 150 3 150 3 2 2 Latency (us) 100 1 Latency (us) 100 1 0 0 50 1 4 16 64 256 1K 4K 50 1 4 16 64 256 1K 4K

1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size (bytes) Message Size (bytes) (a) Intra-socket (b) Inter-socket

Figure 6.6: MPI Point-to-Point Latency Performance on Nested Virtualization Environ- ment (Compare Default, One-Layer Locality-Aware, Two-Layer Locality-Aware and Na- tive)

Figure 6.6 shows the performance of our two-layer locality-aware design on Intel Haswell architecture. In both intra-socket and inter-socket cases, we observe that the one-layer design for inter-VM inter-container scenario (denoted as Inter-VM Inter-Container-1Layer)

101 has the similar performance to the default mechanism (denoted as *-Def), as they all transfer the messages through the network channel. In the intra-socket case, as shown in Fig- ure 6.6(a), we can see that our two-layer locality-aware design (denoted as Inter-VM Inter-

Container-2Layer) signiﬁcantly improves the Inter-VM Inter-Container latency, compared with the one-layer design (denoted as Inter-VM Inter-Container-1Layer). For the small message sizes, our two-layer locality-aware design can achieve near-native performance.

For the large message sizes (>32KB), our two-layer locality-aware design can not utilize the CMA channel because of the different VM kernels. This is the reason that there is some overhead, compared with the native performance. In the inter-socket case, as presented in

Figure 6.6(b), our two-layer locality-aware design (denoted as Inter-VM Inter-Container-

2Layer) also delivers near-native performance for the small message sizes. However, we see clear overhead for the large message sizes for two-layer locality-aware design, compared with the one-layer design (denoted as Inter-VM Inter-Container-1Layer). Please note that our two-layer locality-aware design utilizes the shared memory channel for message transfer, while the one-layer design uses the network channel due to the lack of locality information. It indicates that the network loopback channel has better communication performance than the shared memory channel for the large message sizes in the inter-socket case. This interesting observation inspires us to further explore on NUMA-awareness.

6.2 Hybrid Design for NUMA-Aware Communication

To address the performance difference between the shared memory channel and the network loopback channel for the large message sizes in inter-socket case, we propose our hybrid design for inter-socket communication in this section.

102 6.2.1 Basic Hybrid Design with HCA Channel

VM1 VM2

Container A Container B Container A Container B MPI MPI MPI MPI Rank 0 Rank 1 Rank 0 Rank 1 Host OS

Data IVSHMEM

RTS Data CTS Network

NIC

Figure 6.7: Basic Hybrid Design (SHM Channel for Small Messages, Network Loopback Channel for Large Messages)

MPI stacks use rendezvous protocol for large message transfers. A typical rendezvous protocol transfers data in two steps. The ﬁrst step is to exchange control messages between the sender and receiver, as RTS and CTS shown in Figure 6.7. The real data transfer happens in the second step. In Figure 6.6(b), we can clearly see that the two-layer locality-aware design has more overhead for the large messages in the inter-socket case, while having signiﬁcant improvement for the small messages, compared with the one-layer design. Therefore, we propose a basic hybrid design, which combines two-layer locality- aware design for the small messages and one-layer design for the large messages. Since the one-layer locality-aware design does not support VM level locality detection in the nested virtualization environment, the basic hybrid design in fact utilizes the network loopback channel for the large message transfer, including the control messages and real data, as illustrated in Figure 6.7.

103 VM1 VM2

Container A Container B Container A Container B MPI MPI MPI MPI Rank 0 Rank 1 Rank 0 Rank 1 Host OS

RTS Data CTS IVSHMEM

Data Network

NIC

Figure 6.8: Enhanced Hybrid Design (SHM Channel for Small Messages and Control Mes- sages, Network Loopback Channel for Large Messages)

6.2.2 Enhanced Hybrid Design

For large message transfer with our proposed basic hybrid design, both control messages and data payload go through the network loopback channel. However, from Fig- ures 6.6(a) and 6.6(b), we see that the SHM channel always performs better than the network loopback channel for small message transfers. This implies that control messages exchange for large message transfers going over the network loopback channel are not getting optimal performance. Figure 6.8 shows an overview of our enhanced hybrid design.

In this design, we look at large data transfers at a ﬁner granularity and choose the optimal channel for both metadata (control messages) and real data. We use the same rendezvous protocol, but control messages and data payloads are scheduled over different channels.

When a control packet is sent to a destination, the locality detection could dynamically select the optimal channel to use. The control messages for large message transfers are scheduled to go over the SHM channel. After the control message exchange is completed, the real data transfer still goes through the network channel.

104 6.2.3 Putting All Together

The two-layer locality aware design is proposed in Section 7.1, which signiﬁcantly improves the communication performance compared to the one-layer locality aware solution in most cases. However, there are clear overheads for inter-socket large message transfer on the inter-vm inter-container case. We ﬁnally propose the enhanced-hybrid design in

Section 6.2.2 to adapt different VM/container placement schemes and deliver near-native performance for nested virtualization environment. Therefore, Algorithm 1 describes the procedure of two-layer NUMA aware communication coordination in the enhanced-hybrid design. Function TwoL Comm Coordinate reschedules the communication channel and sends the message based on the nested locality information of destination process. For the large message transfer (msg.size > eager threashold), Function Hand Shake ﬁrst exchanges the control message with destination process, before sending real data. The most important point is that if the destination process is on a different VM and is pinned to a different socket (On Same Socket returns false), the control message will be exchanged through SHM channel, while the real data will be sent through HCA channel. Figure 6.9 presents the performance of our proposed two hybrid designs for inter-socket communication. For the large message transfer in the inter-VM inter-container scenario, we can observe that our hybrid designs (denoted as Inter-VM Inter-Container-*-Hybrid) clearly improve the performance by leveraging network channel, compared to the two-layer design

(denoted as Inter-VM Inter-Container-2Layer). Since the portion of the control messages is fairly small compared to the data portion in the large messages, the beneﬁt of the enhanced hybrid design (denoted as Inter-VM Inter-Container-Enhanced-Hybrid) over the basic hybrid design (Inter-VM Inter-Container-Basic-Hybrid) is easily amortized as the message size keeps increasing. This is why we do not see the clear performance improvement from

105 the enhanced hybrid design over the basic hybrid design. Considering that VM migration

is an essential feature in the cloud environment if a VM is migrated to another node, the

two-layer locality information needs to be re-detected since the VM location gets changed.

This needs to be taken care when MPI runtime tries to resume the connections after VM

migration. It’s doable and there are no limitations from our design for implementing the

re-detection. The two-layer NUMA aware communication coordination is only executed

during MPI communication phase. Thus the migration does not inﬂuence the logic of the

communication coordination as long as the locality information can be updated. In the

following Section 6.3, we use this design evaluating the collective operations and end ap-

plications. 300 Inter-VM Inter-Container-2Layer Inter-VM Inter-Container-Basic-Hybrid 250 Inter-VM Inter-Container-Enhanced-Hybrid Native

200 3 2.5 150 2 1.5 100 1 Latency (us) 0.5 0 50 1 4 16 64 256 1K 4K

0 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size (bytes)

Figure 6.9: MPI Point-to-Point Latency of Hybrid Design for Inter-Socket Communication on Nested Virtualization Environment

6.3 Performance Evaluation

Our testbed is comprised of 16 bare-metal InﬁniBand nodes on Chameleon Cloud, where each node has 24-core 2.3GHz delivered in dual socket Intel Xeon E5-2670v3 processors with 128 GB main memory and equipped with Mellanox ConnectX-3 FDR (56

Gbps) HCAs with PCI Express Gen3 interfaces. We use CentOS Linux release 7.1.1503

106 Algorithm 1: Procedure of Two-Layer NUMA Aware Communication Coordination

1 function TwoL Comm Coordinate() 2 begin 3 loc ﬂag ← detect nested locality(dst) 4 switch loc ﬂag do 5 case N 6 TwoL Comm Inter Node(msg, dst) 7 endsw 8 case H 9 TwoL Comm Same Host(msg, dst, numa) 10 endsw 11 otherwise 12 TwoL Comm Same VC(msg, dst) 13 endsw 14 endsw 15 end 16 function TwoL Comm Inter Node(msg, dst) 17 begin 18 if msg.size ≤ eager threashold then 19 Send Data(msg, dst, HCA) 20 else 21 Hand Shake(ctl pkt, dst, HCA) 22 Send Data(msg, dst, HCA) 23 end 24 end 25 function TwoL Comm Same Host(msg, dst, numa) 26 begin 27 if msg.size ≤ eager threashold then 28 Send Data(msg, dst, SHM) 29 else 30 if On Same Socket(numa) then 31 Hand Shake(ctl pkt, dst, SHM) 32 Send Data(msg, dst, SHM) 33 else 34 Hand Shake(ctl pkt, dst, SHM) 35 Send Data(msg, dst, HCA) 36 end 37 end 38 end 39 function TwoL Comm Same VC(msg, dst) 40 begin 41 if msg.size ≤ eager threashold then 42 Send Data(msg, dst, SHM) 43 else 44 Hand Shake(ctl pkt, dst, SHM) 45 Send Data(msg, dst, CMA) 46 end 47 end

107 (Core) with kernel 3.10.0-229.el7.x86 64 as the host OS and MLNX OFED LINUX-3.0-

1.0.1 as the HCA driver.

To conﬁgure the VM environment at the outer virtualization layer, KVM is used as the

Virtual Machine Monitor (VMM). We deploy two VMs per node, each with 12 cores, 32

GB memory and the same OS as the host. In addition, an IVShmem device and a dedicated virtual InﬁniBand device (VF) are attached to each VM.

For the container environment at the inner virtualization layer, Docker 1.10.3 is deployed as the engine to build and run Docker containers. Two docker containers are launched in each VM. The privileged option is enabled to give the container access to the virtual InﬁniBand device in the VM. The container is set to share the PID and IPC namespaces of the VM where it resides.

All applications and libraries used in this study are compiled with GCC 4.8.3 compiler.

The MPI communication performance experiments use OSU micro-benchmarks v5.3 [73].

Experimental results are averaged across ﬁve runs to ensure a fair comparison.

6.3.1 Point-to-Point Performance

In this section, we evaluate the MPI point-to-point communication performance in terms of latency and bandwidth between two containers on a single host. Since the one- layer locality-aware design can achieve near-native performance for intra-vm inter-container case, as seen in Figures 6.6(a) and 6.6(b), we do not discuss it further. We focus on the performance evaluation of inter-vm inter-container case in this section. Depending on whether the two VMs are deployed on the same CPU socket or the different one, the evaluation is divided into two categories, intra-socket and inter-socket. For the intra-socket case, the communication of inter-vm inter-container will be within the same socket. Otherwise, the

108 250 300 Default Default 1Layer 1Layer 2Layer 250 2Layer 200 Native(w/o CMA) Basic-Hybrid Native Enhanced-Hybrid 200 Native(w/o CMA) Native 150 4 4 3.5 150 3.5 3 3 100 2.5 2.5 2 2 100

Latency (us) 1.5 Latency (us) 1.5 1 1 0.5 0.5 50 0 0 50 1 4 16 64 256 1K 4K 1 4 16 64 256 1K 4K

0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size (bytes) Message Size (bytes) (a) Intra-Socket Latency (b) Inter-Socket Latency 16000 16000 Default Default 1Layer 1Layer 14000 2Layer 14000 2Layer Native(w/o CMA) Basic-Hybrid 12000 Native 12000 Enhanced-Hybrid Native(w/o CMA) 10000 10000 Native

8000 8000

6000 6000

4000 4000 Bandwidth (MB/s) Bandwidth (MB/s) 2000 2000

0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size (bytes) Message Size (bytes) (c) Intra-Socket Bandwidth (d) Inter-Socket Bandwidth

Figure 6.10: Point-to-Point Communication Performance of Inter-VM Inter-Container Sce- nario

109 communication is happening across the socket. Figures 6.10(a) and 8.9(c) show the evalu-

ation results of MPI point-to-point latency.

As we can see in Figure6.10(a), one-layer locality-aware design (denoted as 1Layer)

has the performance similar to the default case. This is because both of them have to go

through network loopback channel for communication instead of shared memory based

channel, due to lack of locality-aware support on VM level. This is also the reason we can

see clear performance degradation, compared with native performance.

Compared with the one-layer design (denoted as 1Layer), we can also clearly observe

that our two-layer locality-aware design (denoted as 2Layer) is able to signiﬁcantly improve

the performance. The beneﬁt is up to 84%. For example, the latency at 4byte message size

is 1.16µs for one-layer design, while it is only 0.2µs for two-layer design. Compared with the native performance, the two-layer design can achieve near-native performance with less than 9% overhead for small messages. Whereas, we observe that there exists more overhead for the large messages. The reason is that the communication happens across the VM. Our two-layer design can not leverage the CMA channel across two different

VM kernels, as we explained in Section 6.1.4. After disabling the CMA channel for large messages, we can see that our two-layer design has similar performance with the native, as denoted by Native(w/o CMA). The minor performance overhead between 2Layer and

Native(w/o CMA) comes from the virtualization environment itself.

From Figure 8.9(c), we can observe the latency performance for inter-socket case. We see that the one-layer design still performs similarly as the default case, and much worse than the native, because of the same reason as the intra-socket case, which is both of the one-layer design and default mechanism use network loopback channel. Our proposed two- layer design can achieve near-native performance for the small messages through the shared

110 memory channel. However, we see clear performance degradation for the large messages, compared with the one-layer design. As we discussed in Section 6.1.4, this is because the performance of the shared memory channel is worse than the performance of network loopback channel for inter-socket large messages transfer in the nested virtualization environment on the current architecture. Our basic-hybrid design uses the shared memory channel for the small messages and the network loopback channel for the large messages, we, therefore can see that our basic-hybrid design is able to maintain the near-native performance for the small messages and achieve the similar performance of one-layer design for the large messages. Our basic-hybrid design can significantly improve the performance of the two-layer design with up to 42% benefit for the large messages. Compared with the one-layer design, the basic-hybrid design delivers up to 64% improvement for the small messages. For instance, the latency at 4byte message size is 1.25µs for the one-layer design, while it is only 0.45µs for the basic-hybrid design. After disabling the CMA channel, we can see that all the mechanisms utilizing the network loopback channel for the large message transfer (denoted as Default, 1Layer, Basic-Hybrid, Enhanced-Hybrid) have better performance than the Native(w/o CMA), which uses the shared memory channel for the large messages. From the evaluation results, we also find that the enhanced-hybrid design does not have clear performance benefit, compared to the basic-hybrid design. As we discussed in Section 6.2.3, the reason is because of the small control message transfer, so the optimization on the control message can not be shown clearly.

Similarly, Figures 6.10(c) and 8.9(f) present the evaluation results of MPI point-to- point bandwidth performance for intra-socket and inter-socket cases. For the intra-socket case, the evaluation results indicate that our two-layer locality-aware design can bring up to

184% improvement, compared to the one-layer design. For example, the bandwidth of the

111 one-layer design is 11.94Mbps, while it can achieve 26Mbps for the two-layer design. For the inter-socket case, our enhanced hybrid design clearly increases the bandwidth of the two-layer design by up to 25% for the large messages, while maintaining the near-native performance for the small messages. Overall, it brings up to 110% performance beneﬁt, compared with the one-layer design.

6.3.2 Collective Performance 22 4000 Default Default 20 1Layer 3500 1Layer 2Layer-Enhanced-Hybrid 2Layer-Enhanced-Hybrid 18 3000 16 2500 80 14 2000 60

12 1500 40 Latency (us) Latency (us)

10 1000 20 4 16 64 8 500 6 0 4 16 64 256 1K 4K 4 16 64 256 1K 4K Message Size (bytes) Message Size (bytes)

(a) Broadcast (MPI Bcast) (b) Allgather (MPI Allgather) 120 10000 Default Default 1Layer 9000 1Layer 100 2Layer-Enhanced-Hybrid 8000 2Layer-Enhanced-Hybrid 7000 80 250 6000 200 60 5000 150 4000

Latency (us) Latency (us) 100 40 3000 50 20 2000 4 16 64 1000 0 0 4 16 64 256 1K 4K 4 16 64 256 1K 4K Message Size (bytes) Message Size (bytes)

Figure 6.11: Collective Communication Performance with 256 Processes

In this section, we evaluate the performance of collective operations. We deploy 64 containers across 32 VMs on 16 nodes evenly. Each VM is conﬁgured with 12 cores which reside on a single socket. Figures 6.11(a)-6.11(d) show the performance of broadcast, allreduce, allgather and alltoall operations with 256 processes, respectively. As we proposed

112 in Section 6.2, the enhanced-hybrid design is both two-layer locality aware and NUMA

aware. With the enhanced-hybrid design, the communication across the co-resident con-

tainers within one host can be scheduled to the optimal channel, depending on the NUMA

information and message size, which improves the overall performance of collective oper-

ations. Thus, we use enhanced-hybrid design to carry out the experiments. The evaluation

results indicate that compared with the performance of the default design, our proposed

enhanced-hybrid design can clearly improve the performance by up to 57%, 75%, 85% and

29% for MPI Bcast, MPI Allreduce, MPI Allgather, MPI Alltoall, respec-

tively. And compared with the one-layer design, the proposed enhanced-hybrid design can

deliver up to 38%, 68%, 81%, and 17% performance beneﬁt, respectively.

6.3.3 Application Performance

10 180 Default Default 9 1Layer 160 1Layer 2Layer-Enhanced-Hybrid 2Layer-Enhanced-Hybrid 8 140 7 120 6 100 5 80 4 60

3 Execution Time (s) BFS Execution Time (s) 2 40 1 20 0 0 22,20 24,16 24,20 24,24 26,16 26,20 26,24 28,16 IS MG EP FT CG LU (a) Graph 500 (b) Class D NAS

Figure 6.12: Application Performance with 256 Processes

In this section, we evaluate the performance of our proposed enhanced-hybrid design with two end applications: NAS Parallel Benchmarks (NPB) and Graph 500. We run Graph

500 and Class D NAS with 256 processes across 64 containers on 16 nodes. The evaluation results are shown in Figure 6.12(a) and 6.12(b), respectively. Compared with the default case, the proposed enhanced-hybrid design can reduce up to 16% (28,16) and 10% (LU)

113 of execution time for Graph 500 and NAS, respectively. And compared with the one-layer case, the enhanced-hybrid design also brings up to 12% (28,16) and 6% (LU) performance beneﬁt.

6.4 Related Work

On the nested virtualization area, several studies focus on the feasibility, practicabil- ity and performance issues of using different organizations of nested environments. Ben-

Yehuda et al. [11] propose the Turtles Project to support multiple KVM and VMware hypervisors running in a nested way. They design new multiplexing mechanisms for CPU, memory and I/O in Linux kernel to overcome the limitation of x86 virtualization extension.

Turtles introduces a small overhead even if using it inside a nested virtual machine. Oracle

Ravello system [77] provides the HVX hypervisor to support nested virtualization. It can be used for rapid development and testing of cloud systems, training and experimenting on cloud environments, etc. Due to its “consolidation” design, the performance of the application in nested VMs is excellent, and the cost of leasing hosted VMs can be reduced.

Microsoft Hyper-V also has supported nested virtualization [68] for Windows 10 system.

Different from these work, this chapter focuses on the nested virtualization in the high- performance computing scene. Our previous studies [42, 44, 118] propose locality-aware support within MPI runtime for VM- and container-based HPC cloud respectively.

114 6.5 Summary

In this chapter, we propose a high performance two-layer locality-aware and NUMA aware MPI library for nested virtualization environment on HPC cloud. Through the two- layer locality-aware design, MPI library is able to dynamically and efﬁciently detect co- resident containers in the same VM as well as co-resident VMs in the same host at runtime.

Thus the MPI processes across different containers and VMs can communicate to each other by shared memory or Cross Memory Attach (CMA) channels instead of network channel as long as they are co-resident. We further propose the basic-hybrid and enhanced- hybrid design with NUMA aware support, so that the proposed enhanced-hybrid design is able to adapt the different VM/container placement schemes and deliver the optimal communication performance. Our evaluation results indicate that our proposed enhance- hybrid design can bring up to 184%, 81% and 12% beneﬁt on point-to-point, collective operations, and end applications, compared with the state-of-art design. Compared with the default performance, our enhanced-hybrid design delivers up to 184%, 85% and 16% performance improvement, accordingly.

115 Chapter 7: Co-designing with Resource Management and Scheduling Systems

It is fairly important to manage and isolate virtualized resources of SR-IOV and IVSh- mem to support running multiple concurrent MPI jobs for better ﬂexibility and resource utilization. As this requires knowledge of and some level of control over the underlying physical hosts, it is difﬁcult to achieve this with the MPI library alone, which is only aware of the virtual nodes and resources inside. Thus, extracting the best performance from virtualized clusters requires co-design with resource management and scheduling systems, which have a global view of the VMs and the underlying physical hosts. Figure 7.1 illustrates three possible scenarios of running MPI jobs over VMs in shared HPC clusters.

IVShmem-2

Exclusive Allocation VM VM

Concurrent Jobs MPI MPI (EACJ) MPI MPI

Compute Nodes Compute VF1 VF2 IVShmem-1

Shared-hosts Allocation IVShmem-2 Exclusive Allocation Concurrent Jobs VM (SACJ) Sequential Jobs (EASJ) VM MPI VF1 VM VM MPI MPI MPI VF2

VF3 VM

VF4 VM MPI

VF1 VF2 IVShmem-1 MPI

IVShmem-1

Figure 7.1: Different Scenarios of Running MPI Jobs over VMs on HPC Cloud

116 Exclusive Allocation for Sequential Jobs (EASJ): Users exclusively allocate the phys-

ical nodes and add dedicated SR-IOV and IVShmem devices for each VM to sequentially

run MPI jobs. This scenario requires co-resident VMs select different Virtual Functions,

like VF1 and VF2, and add virtualized PCI devices mapping to the same IVShmem region,

like IVShmem-1 as shown in Figure 7.1.

Exclusive Allocation for Concurrent Jobs (EACJ): Users get exclusive allocations, but multiple IVShmem devices, like IVShmem-1 and IVShmem-2 in Figure 7.1 need to be added to each VM for multiple MPI jobs running concurrently. Because each MPI job at least needs one IVShmem device on one host to support Inter-VM shared memory based communications.

Shared-hosts Allocation for Concurrent Jobs (SACJ): In shared HPC clusters, different users might allocate VMs on the same physical node. Each VM needs to have a dedicated SR-IOV virtual function, like VF1 to VF4. And IVShmem devices in different users’ VMs need to point to different shared memory regions on the physical node, like

IVShmem-1 and IVShmem-2 in Figure 7.1.

Unfortunately, to the best of our knowledge, none of the currently available studies on resource managers such as Slurm [40, 65] are SR-IOV and IVShmem aware. Therefore, they are not able to handle the above three scenarios of running MPI jobs. To address this challenge, we propose Slurm-V framework, which will be introduced in the following section in detail.

117 7.1 Design of Slurm-V

7.1.1 Architecture Overview of Slurm-V

To co-design with resource management and scheduling systems, we propose a Slurm-

V framework. Figure 7.2 presents an overview of Slurm-V framework. As we can see, it is based on the original architecture of Slurm. It has a centralized manager, Slurmctld, to monitor work and resources. Each compute node has a Slurm daemon, which waits for the task, executes that task, returns status, and waits for more tasks [8]. Users can put their physical resource requests and computation tasks in a batch ﬁle, submit it by sbatch to the Slurm control daemon, Slurmctld. Slurmctld will respond with the requested physical resources according to its scheduling mechanism. Subsequently, the speciﬁed MPI jobs are executed on those physical resources.

In our framework Slurm-V, three new components are integrated into the current architecture. The first component is VM Configuration Reader, which extracts the related parameters for VM configuration. Each time when users request physical resources, they can specify the detailed VM configuration information, such as vcpu-per-vm, memory-per- vm, disk-size, vm-per-node, etc. In order to support high performance MPI communication, the user can also specify SR-IOV devices on those allocated nodes, and the number of IVShmem devices which is the number of concurrent MPI jobs they want to run inside VMs. The VM Configuration Reader will parse this information, and set them in the current Slurm job control environment. In this way, the tasks executed on those physical nodes are able to extract information from job control environment and take proper actions accordingly. The second component is the VM Launcher, which is mainly responsible for launching required VMs on each allocated physical node based on user-specified VM con-

ﬁguration. The zoom-in box in Figure 7.2 lists the main functionalities of this component.

118 If the user speciﬁes the SR-IOV enabled device, this component detects those occupied

VFs and selects a free one for each VM. It also loads user-speciﬁed VM image from the

publicly accessible storage system, such as NFS or Lustre, to the local node. Then it gen-

erates XML ﬁle and invokes libvirtd or OpenStack infrastructure to launch VM. During

VM boot, the selected VF will be passthroughed to VM. If the user enables the IVShmem

option, this component assigns a unique ID for each IVShmem device, and sequentially

hotplugs them to VM. In this way, IVShmem devices can be isolated with each other, such

that each concurrent MPI job will use a dedicated one for inter-VM shared memory based

communication. On the aspect of network setting, each VM will be dynamically assigned

an IP address from an outside DHCP server. Another important functionality is that the

VM Launcher records and propagates the mapping records between local VM and its as-

signed IP address to all other VMs. Other functionalities include mounting global storage

systems, etc. Once the MPI job reaches completion, the VM Reclaimer is executed. Its

responsibilities include reclaiming VMs and the critical resources, such as unlocking the

passthroughed VFs, returning them to VF pool, detaching IVShmem devices and reclaim-

ing corresponding host shared memory regions.

If OpenStack infrastructure is deployed on the underlying layer, VM Launcher invokes

OpenStack controller to accomplish VM conﬁguration, launch and destruction.

7.1.2 Alternative Designs

We propose three alternative designs to effectively support the three components.

Task-based Design: The three new components are treated as three tasks/steps in a

Slurm job. Therefore, the end-user needs to implement corresponding scripts and explicitly

119 insert them in the job batch ﬁle. After the job being submitted, srun will execute these three tasks on allocated nodes.

Slurm-V Listing 7.1: SPANK Plugin- VM Conﬁguration sbatch File Reader Slurmd based Script Execute MPI Job MPI MPI Return Results Image load VM1 VM2 1 #!/bin/bash Submit Job Slurmd Image Image snapshot Pool 2 #SBATCH -J Slurm-V Node List Physical Request Resource Physical physical VF IVSHMEM VF IVSHMEM Lustre node 3 #SBATCH -N 2 Launch VMs Slurm-V 4 #SBATCH -p All VM Launcher 5 #SBATCH --vm-per-node=2 Slurmd VM Reclaimer Slurmctld 1. Image Management 6 #SBATCH --vcpu-per-vm=2 physical 2. SR-IOV node Passthrough SPANK 8 #SBATCH --disk-size=10G SPANK 3. Launching VMs and Task plugin over plugin Check availability based OpenStack 10 #SBATCH --sriov-ib=1 based 4. IVSHMEM Hotplug Design based Design 5. Network Setting Design Slurmd 6. Propagate VM/IP 11 #SBATCH --ivshmem=1 7. Mount global storage, etc. 12 #SBATCH --num-ivshmem=1 physical libvirtd OpenStack node 13 #SBATCH --ivshm-sz=128M 14 15 Slurm-V-run -np 8 a.out

Figure 7.2: Architecture Overview of Slurm-V

The Task-based design is portable and easy to integrate with existing HPC environments without any change to Slurm architecture. However, it is not transparent to end users as they need to explicitly insert the three extra tasks in their jobs. More importantly, it may incur some permission and security issues. VF passthrough requires that VM Launcher connects to the libvirtd instance running with the privileged system account ‘root’, which in turn exposes security threats to the host system. In addition, the scripts implementation may be varied for different users. This will impact the deployment and application performance.

To address these issues, we propose SPANK plugin-based design as discussed below.

SPANK Plugin-based Design: As introduced in Section 2.3.6, the SPANK plugin architecture allows a developer to dynamically extend functions during a Slurm job execution. Listing 7.1 presents an example of a SPANK plugin-based batch job in the Slurm-V framework. As we can see from line5-line13, the user can specify all VM conﬁguration options as inherent ones preceded with #SBATCH. The Slurm-V-run on line15 is a launcher wrapper of srun for launching MPI jobs on VMs. Also, there is no need to insert

120 extra tasks in this job script. Thus, it is more transparent to the end user compared to the

Task-based design. Once the user submits the job using sbatch command, the SPANK

plugin is loaded and the three components are invoked in different contexts.

Figure 7.3(a) illustrates the workﬂow of the SPANK plugin-based design in detail un-

der the Slurm-V framework. Once the user submits the batch job request, SPANK plugin

is loaded, and spank init will first register all VM configuration options specified by

the user and do a sanity checking for them locally before sending to the remote side. Then,

spank init post opt will set these options in the current job control environment so

that they are visible to all Slurmd daemons on allocated nodes later. Slurmctld identiﬁes re-

quested resources, environment and queues the request in its priority-ordered queue. Once

the resources are available, Slurmctld allocates resources to the job and contacts the ﬁrst

node in the allocation for starting user’s job. The Slurmd on that node responds to the

request, establishes the new environment, and initiates the user task speciﬁed by srun

command in the launcher wrapper. srun connects to Slurmctld to request a job step and then passes the job step credential to Slurmds running on the allocated nodes.

After exchanging the job step credential, SPANK plugin is loaded on each of the allocated nodes. During this process, spank task init privileged is invoked to

execute VM Launcher component in order to setup VM for the following MPI job. The

function spank task exit is responsible for executing VM Reclaimer component to

tear down VMs and reclaim resources. In this design, we utilize the ﬁle-based lock mech-

anism to detect occupied VFs and exclusively allocate VFs from available VF pool. With

this design, each IVShmem device will be assigned a unique ID and dynamically attached

to VM. In this way, IVShmem devices can be efﬁciently isolated to support running multi-

ple concurrent MPI jobs.

121 In this design, we utilize snapshot and the multi-threading mechanism to speed up the

image transfer and VM launching, respectively. This will further reduce VM deployment

time.

Slurmctld Slurmd Slurmd OpenStack daemon Slurmctld Slurmd …… submit sbatch VM Launcher sbatch job.sh request launch VM options register VM Conﬁguration Reader load SPANK launch VM spank_init spank_init_post_opt return job queued VM Launcher VM Conﬁguration run req receive launch status VM Launcher Reader populate VM/IP list run reply VM Launcher send Slurm-V-run VM/IP to local job step req VM pass job step env. job step reply validate job step env. execute MPI job on VM

SPANK: return spank_task_init_privileged VM Reclaimer VM Launcher request reclaim VM MPI Job across VMs reclaim VM

job step comp. SPANK: spank_task_exit return release nodes notify exit VM Reclaimer receive reclaim status exit VM Reclaimer (a) SPANK Plugin-based Design (b) SPANK Plugin over OpenStack-based Design

Figure 7.3: SPANK Plugin-based and SPANK Plugin over OpenStack-based Design

SPANK Plugin over OpenStack-based Design: This section discusses the design that combines SPANK plugin and OpenStack infrastructure. In this design, the VM Launcher and VM Reclaimer components will accomplish their functionalities by ofﬂoading the tasks to OpenStack infrastructure.

Figure 7.3(b) presents the workflow of SPANK plugin over OpenStack. When the user submits a Slurm job, SPANK plugin is loaded first. VM configuration options are registered and parsed. The difference is that, on local context, VM Launcher will send a VM launch request to OpenStack daemon on its controller node. The core component of OpenStack,

Nova, is responsible for launching VMs on all allocated compute nodes. Upon the launch completes, it returns a mapping list between all VM instance names and their IP addresses to VM Launcher. VM Launcher propagates this VM/IP list to all VMs. The MPI job will be

122 executed after this. Once the result of MPI job is returned, VM Reclaimer in local context

sends a VM destruction request to OpenStack daemon. Subsequently, VMs are torn down

and associated resources are reclaimed in the way that OpenStack deﬁnes. In addition, our

earlier work [117] describes in details about VF allocation/release and enabling IVShmem

devices for VM under OpenStack framework. In this design, except VM Conﬁguration

Reader, the other two components work by sending requests to OpenStack controller and

receiving its returning results. There are dedicated services in OpenStack infrastructure to

manage and optimize different aspects of VM management, such as identiﬁcation, image,

networking. Therefore, the SPANK plugin over OpenStack-based design is more ﬂexible

and reliable.

7.2 Performance Evaluation

Cluster-A: This cluster has four physical nodes. Each node has dual 8-core 2.6 GHz

Intel Xeon E5-2670 (Sandy Bridge) processors with 32 GB RAM and equipped with Mel- lanox ConnectX-3 FDR (56 Gbps) HCAs.

Chameleon: [14] It has eight physical nodes, each with 24 cores delivered in dual socket Intel Xeon E5-2670 v3 (Haswell) processors, 128 GB RAM and equipped with

Mellanox ConnectX-3 FDR (56 Gbps) HCAs as well.

CentOS Linux 7 (Core) 3.10.0-229.el7.x86 64 is used as both host and guest OS. In

addition, we use KVM as the Virtual Machine Monitor (VMM), and Mellanox OpenFabrics

MLNX OFED LINUX-3.0-1.0.1 to provide the InﬁniBand interface with SR-IOV support.

Our Slurm-V framework is based on Slurm-14.11.8. MVAPICH2-Virt library is used to

conduct application experiments.

123 Job Submission SSH Boot Job Submission SSH Boot VF/XML Generation IVShmem Hotplug VF/XML Generation IVShmem Hotplug Image Transfer VM/IP Propagation Image Transfer VM/IP Propagation VM Creation VM Creation

70 70

60 60

50 50

40 40

30 30 Startup Time (s) Startup Time (s) 20 20

10 10

0 0 Task SPANK SPANKoverlap Task SPANK SPANKoverlap Task SPANK SPANKoverlap Task SPANK SPANKoverlap Direct Image Copy Image Snapshot Direct Image Copy Image Snapshot

(a) Cluster-A (b) Chameleon

Figure 7.4: VM Launch Breakdown Results on Cluster-A and Chameleon

Table 7.1: VM Startup Breakdown Part Time Period Description Job Submission From submitting sbatch job to starting VM conﬁguration VF/XML Generation Reading VM conﬁgurations, selecting available VF to generate XML Image Transfer Transferring VM image from public location to store location of each VM VM Creation Time between invoking libvirt API to create VM and its return SSH Boot Booting VM, getting available IP address until starting SSH service IVShmem Hotplug Time of completing IVShmem hotplug operation VM/IP Propagation Propagating VMs’ hostname/IP records to all VMs

7.2.1 Startup Performance

To analyze and optimize the startup performance of the Slurm-V framework, we break down the whole VM startup process into several parts. Table 7.1 describes the time period of each part.

Overlapping: We found that image transfer is independent of VF/XML generation, so they can start simultaneously after submitting the job. As shown in Figures 7.4(a) and 7.4(b), the time spent on direct image copy (2.2GB) is larger than the time spending on

VF selection and XML generation. So it can be completely overlapped. The overlapping

124 effect can be clearly observed between SPANK and SPANKoverlap under direct image copy scheme on Chameleon.

Snapshot: We also observe that direct image copy takes a large proportion of the whole

VM startup time for any startup methods on both Cluster-A and Chameleon. In order to shorten the time of image transfer, the external snapshot mechanism is applied. The original image file that user specified will be in a read-only saved state. The new file created using external snapshot will be the delta for the changes and take the original image as its backup file. All the changes from here onwards will be written to this delta file. Instead of transferring a large-size image file, we only create a small-size snapshot file for each

VM, which clearly reduces the image transfer time. In addition, the backup ﬁle can be read in parallel by running VMs. Therefore, the snapshot mechanism enhances the VM startup performance signiﬁcantly. The evaluation result shows that the whole VM startup time is shortened by up to 2.64X and 2.09X on Cluster-A and Chameleon, respectively.

Total VM Launch Time: We discussed the SPANK plugin over OpenStack-based design in Section 7.1.2. As VM Launcher ofﬂoads its task to OpenStack infrastructure as a whole task, we do not breakdown timings within the OpenStack operations. The evaluation results show that the total VM launch times are 24.6s, 23.8s, and 20.2s for SPANK plugin-based design, SPANK plugin-based design with overlap and SPANK plugin over

OpenStack-based design, respectively. Compared to other designs, SPANK plugin over

OpenStack has better total VM launch time, which is around 20s. This is because Open-

Stack, as a well-developed and relatively mature framework, has integrated optimizations on different steps of VM launch.

125 7.2.2 Scalability

In this section, we evaluate the scalability of proposed Slurm-V framework using single- threading (ST) and multi-threading (MT) schemes. In the evaluation, snapshot with overlapping is used for both schemes. In MT case, each thread is responsible for launching one VM. From Figures 7.5(a) and 7.5(b), it can be observed that MT scheme signiﬁcantly improves the VM startup performance, compared to ST scheme on both Cluster-A and

Chameleon. For instance, to launch 32 VMs across 4 nodes on Chameleon, ST scheme takes 260.11s, while MT only spends 34.88s. Compared with ST scheme, MT scheme reduces the VM startup time by up to 86% and 87% on Cluster-A and Chameleon, respectively. As the number of physical nodes increases, we do not see the clear increase for startup time of MT scheme. These results indicate that our proposed Slurm-V framework scales well.

250 300 ST-snapshot-overlap ST-snapshot-overlap MT-snapshot-overlap MT-snapshot-overlap 200 250

200 150 150 100 Startup Time (s) Startup Time (s) 100

50 50

0 0 1*2 1*4 1*8 2*2 2*4 2*8 4*2 4*4 4*8 1*2 1*4 1*8 2*2 2*4 2*8 4*2 4*4 4*8 8*2 8*4 8*8 Number of VM (# Nodes * # VMs) Number of VM (# Nodes * # VMs) (a) Cluster-A (b) Chameleon

Figure 7.5: Scalability Studies on Cluster-A and Chameleon

7.2.3 Application Performance

The Slurm-V framework extends Slurm to manage and isolate virtualized resources of

SR-IOV and IVShmem to support running multiple concurrent MPI jobs under different

126 scenarios. In this section, we evaluate the Graph500 performance under three scenarios

(EASJ, EACJ, and SACJ) as indicated in Figure 7.1 with 64 processes across 8 nodes on

Chameleon. Each VM is conﬁgured with 6 cores and 10GB RAM.

For EASJ, two VMs are launched on each node. Figure 7.6(a) shows the Graph500 performance with 64 processes on 16 VMs in this scenario. The evaluation results indicate that the VM launched by Slurm-V with SR-IOV and IVShmem support can deliver near- native performance, with less than 4% overhead. This is because the Slurm-V framework is able to efﬁciently isolate SR-IOV VFs and enable IVShmem device across co-resident

VMs. Co-resident VMs can execute shared memory based communication through IVSh- mem device. On the other hand, each VM with the dedicated VF can achieve near-native inter-node communication performance. For SACJ, four VMs VM(0–3) are launched on each node. Graph500 is executed across all VM(0-1), while the second MPI job is executed across all VM(2–3) simultaneously. We run NAS as the second MPI jobs. For the native case, we use 8 cores corresponding to VM(0–1) to run Graph500, while another 8 cores corresponding to VM(2–3) to run the second job. As shown in Figure 7.6(b), the execution time of Graph500 on VM is similar with the native case with around 6% overhead. This indicates that the Slurm-V framework is able to efﬁciently manage and isolate the virtual resource of SR-IOV and IVShmem on both VM and user level, although in the shared allocation. One dedicated VF is passthroughed to each VM and one unique IVShmem device is attached to all co-resident VMs of each user. For EACJ, similarly, our Slurm-V framework can also deliver the near-native performance, with around 8% overhead, as shown in Fig- ure 7.6(c). The Slurm-V framework supports the management and isolation of IVShmem on MPI job level, so each MPI job can have a unique IVShmem device to execute shared memory backend communication across the co-resident VMs.

127 From these application studies, we see that VMs deployed by Slurm-V with appropriately managed and isolated SR-IOV and IVShmem resources are able to deliver high performance for concurrent MPI jobs, which can be seen as promising results for running applications on shared HPC clouds.

3000 250 800 VM VM VM Native Native 700 Native 2500 200 600 2000 150 500 1500 400

100 300 1000 Execution Time (ms) Execution Time (ms) Execution Time (ms) 200 500 50 100

0 0 0 24,16 24,20 26,10 22,10 22,16 22,20 22,10 22,16 22,20 24,10 24,16 Problem Size(Scale * Edgefactor) Problem Size(Scale * Edgefactor) Problem Size(Scale * Edgefactor)

(a) EASJ (b) SACJ (c) EACJ

Figure 7.6: Graph500 Performance with 64 Processes on Different Scenarios

7.3 Related Work

For building cloud computing environments with Slurm, Jacobsen et al. [65] present

‘shifter’ tightly integrated into Slurm for managing Docker and other user-deﬁned images.

Ismael [40] uses VM for dynamic fractional resource management and load balancing in a batch cluster environment. For building HPC cloud environments, Ruivo et.al [15] explore the potential use of SR-IOV on InfiniBand in an Open Nebula cloud towards the efficient support of MPI-based workloads. Zhang et al. [117] propose an efficient approach to build

HPC clouds by extending OpenStack with redesigned MVAPICH2 library. However, none of these has discussed how to effectively manage and isolate IVShmem and SR-IOV resources in shared HPC cluster under Slurm framework in order to support running MPI jobs in different scenarios [116] as presented in this chapter.

128 7.4 Summary

In this chapter, we propose a novel Slurm-V framework to efﬁcient support running multiple concurrent MPI jobs with SR-IOV and IVShmem in shared HPC clusters. The proposed framework extends Slurm architecture and introduces three new components:

VM Conﬁguration Reader, VM Launcher, and VM Reclaimer. We present three alternative designs to support these components, which are: Task-based design, SPANK plugin- based design and SPANK plugin over OpenStack-based design. We evaluate our Slurm-V framework from different aspects including startup performance, scalability and application performance under different scenarios. The evaluation results indicate that the VM startup time can be reduced by up to 2.64X by using snapshot scheme. Compared with the single-threading scheme, multi-threading scheme reduces the VM startup time by up to 87%. In addition, Slurm-V framework shows good scalability and is able to support running multiple MPI jobs under different scenarios on HPC clouds.

129 Chapter 8: Designing High-Performance Cloud-aware GPUDirect MPI Communication Schemes on RDMA Networks

8.1 Performance Characteristics of GPU Communication Schemes on Container Environments

Choosing the optimal data movement scheme for a given message is a challenging task in the native environment. It becomes even more complicated in the container-based cloud environment because various conﬁgurations of container deployment can be used in the cloud. In this section, we conduct experiments to understand the performance characteristics of the GPU communication schemes on native and cloud environments. Based on the performance study, we are trying to ﬁnd the design guidance to optimize GPU communication on clouds.

8.1.1 GPU Communication Schemes on Cloud

Communication schemes on HPC systems have been substantially studied and optimized in the last few decades. However, it has been signiﬁcantly changed since GPU joins the HPC community. Speciﬁcally, a GPU-to-GPU communication can be roughly cate- gorized into Intra-node and Inter-node case. Intra-node refers to the case that two or more

GPU devices are equipped onto the same physical node. The communication happens from one GPU buffer to another GPU buffer within the node. While inter-node case means the

130 cudaMemcpy GDRCOPY cudaIPC GDR

GDDR GDDR GDDR GDDR Mem Mem Mem Mem

GPU GPU GPU GPU

System InfiniBand ChipSet Memory HCA

Container-A

Container-B Container-C Container-D

CPU CPU CPU CPU CPU CPU ….

CPU CPU CPU CPU CPU CPU

Figure 8.1: Data Movement Strategies between GPUs in Container Environments within a node

GPU-to-GPU communication needs to go across different physical nodes via the network.

And there exist several data movement mechanisms in these two cases. Figure 8.1 illustrates these data movement mechanisms for better understanding.

• Intra-node

cudaIPC: CUDA Inter-Process Communication (IPC) facilitates direct copy of data

between GPU device buffers allocated by different processes on the same node,

which bypasses the host memory and thus eliminates the data staging overhead (from

GPU device memory to the host memory). As shown in Figure 8.1, this is only ap-

plicable in the intra-node case.

cudaMemcpy: Whenever cudaIPC is not available or does not provide good perfor-

mance. An explicit data staging scheme through the shared memory region on the

host is unavoidable. cudaMemcpy is one of the data staging schemes, which copies

data between GPU device memory and host memory by specifying the direction of

the copy.

131 GDRCOPY: GDRCOPY is another data staging scheme, which provides a low-

latency GPU memory copy operation based on NVIDIA GPUDirect RDMA tech-

nology. Basically, it offers the infrastructure to create user-space mappings of GPU

memory via one PCIe BAR (Base Address Register) of the GPU. The user-space

mappings can then be manipulated as if it is the plain host memory [88], as men-

tioned in Figure 8.1.

• Inter-node

GDR: GDR technology enables a path for moving data to/from GPU device mem-

ory over an InﬁniBand Host Channel Adaptor (HCA) that completely bypasses the

host CPU and its memory. If the GDR feature is available, HCA can directly read

the source data on one GPU’s memory and write to another GPU’s memory. How-

ever, due to the performance concern, many communication runtimes have designs

to stage the GPU-resident data through the host memory, where an advanced host-

based pipeline design is common [85]. The same staging schemes, as described in

the Intra-node case can be applied here as well.

GDR-loopback: In the container-based cloud environment, container deployment is

ﬂexible. Multiple containers could be deployed on the same node. However, they do

not recognize each other, even though the communicating peers are within the same

node physically. The communication in this case actually operates in the loopback

manner of GDR scheme.

132 8.1.2 Performance Study of GPU Communication on Cloud

In this section, we conduct the experiments to understand the performance characteristics of GPU-to-GPU communication with different data movement schemes on the native and cloud environments.

The experiments are conducted on a testbed cloud as described in Section 8.3. We use MVAPICH2-GDR, which is a GPU-aware MPI library, and OSU Micro-Benchmark

(OMB) suites to evaluate the latency and bandwidth with different data movement strategies across multiple message sizes. We use two Docker container deployments to adapt to different data movement strategies in the cloud environment. To evaluate the performance of cudaMemcpy, GDRCOPY, and cudaIPC, one Docker container equipped with a

4-core CPU and two GPUs is deployed. To evaluate the performance of GDR-loopback, two Docker containers are deployed on the same host, each container is allocated with a

4-core CPU and one dedicated GPU device. The HCA is shared by the two containers. In this deployment, each container launches one MPI process and exchanges the data on GPU with each other. The latency and bandwidth on the native environment are also presented as the reference. Since the performance of GPU-to-GPU communication is well studied and tuned on the native environment, we use the default runtime conﬁguration.

The experiments are conducted over ten runs, and each run has 1,000 iterations. In- tuitively, one may expect the similar performance between cloud and native environments because they essentially have the same physical conﬁgurations. That is, two MPI processes are communicating each other using the data on its dedicated GPU device on the same host. However, as presented in Figure 8.2 and 8.3, we observe the clear performance

133 difference for different data movement strategies in the container environment. This obser-

vation implies the necessity and signiﬁcance of studying the GPU-to-GPU communication

performance in the cloud environment.

8.1.2.1 Latency-sensitive Benchmark

In the latency-sensitive benchmark, e.g., osu latency in OMB, it uses blocking communication interfaces like MPI Send and MPI Recv to ensure the completion on each com-

munication operation.

From Figure 8.2, we can see that GDRCOPY in the container environment brings the

lowest latency for the small messages (1-16 bytes), while GDR-loopback achieves the op-

timal performance for the medium messages (16-16K bytes), then cudaIPC outperforms

other schemes for the large messages. Because of the high latency of GDRCOPY, we

ignore and remove it from Figure 8.2(b) in order to show clear performance comparison

among other schemes. Our observation indicates that there is no one particular data move-

ment strategy that can beneﬁt for all message sizes. It is critical to carefully organize the

different data movement strategies according to the varying message size.

Moreover, we notice that the shared memory-based intra-node GPU-to-GPU data move-

ment schemes, such as GDRCOPY and cudaIPC, can not be applied in the co-located con-

tainers scenario due to lack of the locality-aware support. The only one scheme they can

utilize is GDR-loopback, even though the communicating peers are physically co-located.

8.1.2.2 Bandwidth-sensitive Benchmark

Here, a bandwidth test, e.g., osu bw in OMB is performed between two processes

within a node, The test is basically issuing multiple non-blocking communications like

MPI Isend and MPI Irecv calls to saturate the available bandwidth of IB HCA. As shown

134 40 600 GDR-loopback 35 GDR-loopback GDRCOPY 500 30 cudaMemcpy 400 cudaMemcpy 25 cudaIPC 20 300 cudaIPC Latency (us) Latency (us) 15 200 10 100 5 0 0 1 4 16 64 256 1K 4K 16K 32K 128K 512K 2M Message Size (Bytes) Message Size (Bytes)

(a) 1 - 16K Bytes (b) 32K - 4M Bytes

Figure 8.2: Latency comparison of data movement strategies on Docker container environment within a node

in Figure 8.3, we can observe that the performance is signiﬁcantly different for different

data movement schemes in the container. as we have seen in the latency tests. This again

implies the different communication paths and data movement strategies need to be care-

fully selected for different message sizes in the container environment.

In order to achieve the optimal performance, we also notice that the switch points to

the optimal schemes are different between latency-sensitive tests and bandwidth-sensitive

tests. For instance, in Figure 8.2(b), in order to deliver the lowest latency, GDR-loopback

is switched to cudaIPC at around 16K bytes message size, while it is approximately 512K

bytes for the bandwidth test in Figure 8.3(b).

8.1.3 Analysis and Design Principles for Optimal GPU Communica- tion on Cloud

The major difference between the container and native environments is the capability to detect the physical location of CPUs and GPUs. In the container environment, the communication between the co-located containers is always treated as the inter-node case due to

135 12000 140 GDR-loopback GDR-loopback 120 10000 GDRCOPY GDRCOPY 100 cudaMemcpy 8000 cudaMemcpy cudaIPC 80 cudaIPC 6000 60 4000 40 Bandiwidth (MB/s)Bandiwidth 2000 Bandiwidth (MB/s)Bandiwidth 20 0 0 1 4 16 64 128 512 2K 8K 32K 128K 512K 2M Message Size (Bytes) Message Size (Bytes)

(a) 1 - 64 Bytes (b) 128 - 4M Bytes

Figure 8.3: Bandwidth comparison of data movement strategies on Docker container environment within a node

the lack of the locality-aware capability in the current communication runtimes. Therefore,

GDR-loopback communication path will always be used, and the GPU communication cannot leverage other communication schemes such as GDRCOPY, cudaMemcpy and cu- daIPC. The experimental results in Figure 8.2 and Figure 8.3 provide following insights:

1) No one particular data movement scheme can deliver the optimal communication performance over all the different message sizes.

2) In order to deliver the optimal communication performance, it is necessary to appropriately coordinate the different data movement strategies.

3) For co-located container case, the shared memory based intra-node data movement schemes can not be applied, even though they perform best on some message sizes. There- fore, it is required to have locality-aware support to enable the optimal communication channel.

4) Compared Figure 8.2(b) with Figure 8.3(b), we can ﬁnd that the switch points among different optimal schemes are different for the latency-intensive and bandwidth-intensive tests.

136 Based on these insights, the design principles of optimal GPU-based communication

schemes on the cloud environment can be summarized as follows:

• A locality-aware support is required to allow runtimes to be able to enable the

intra-node communication paths such as GDRCOPY, cudaMemcpy, and cudaIPC if

applicable

• An intelligent communication path scheduling mechanism is needed to allow run-

times to dynamically select the optimal communication path and data movement

scheme for the given message size

• A real-time workload characterization tracing mechanism, to allow runtimes to

be aware of the latency-sensitive or bandwidth-sensitive communication workloads,

is needed to dynamically switch the communication path during application runtime

8.2 Proposed Design of C-GDR in MVAPICH2

In this section, we take MVAPICH2, a popular open-source MPI library as a case study to provide the high performance cloud-aware GPUDirect communication schemes on RDMA networks, based on the insights and guidance what we have explored in Sec- tion 8.1 for the container-based HPC cloud environment. Figure 8.4 presents the overview of our case study. As we can see, a node is equipped with one multi-core processor, one

HCA, and multiple GPU devices. Accordingly, multiple containers are deployed to fully take advantage of these powerful computing resources.

In order to support high-performance GPUDirect communication schemes with RDMA network on container-based HPC cloud environments, three new modules are introduced into the MPI library, which includes a GPU Locality-aware Detection module, a Workload

137 Characterization Tracing module, and a Communication Coordinator (Scheduling) module. As we introduced in Section 8.1, there exist multiple different communication paths on a GPU-based platform. In the bare-metal environment, MVAPICH2 library uses cud- aMemcpy, cudaIPC, and GDRCOPY communication channels for intra-node GPU-to-GPU

(device to device) message transfer while utilizing GDR and Host-based Pipeline channels for inter-node GPU to GPU communication, as presented in the bottom layer of Figure 8.4.

In the container-based HPC cloud environment, the communication channels and the communication channel coordination can work in the same way as the ones in the bare- metal environment. However, the GPU-to-GPU communication between two co-resident containers will be considered as the inter-node communication (GDR-loopback), due to the lack of GPU locality-aware support. Therefore, the GPU Locality-aware Detection module can help MPI runtime and the applications running on top of it to dynamically and transparently detect the MPI processes in the co-resident containers. With this module, the MPI- based communications between co-resident GPUs have the opportunities to be rescheduled to more efficient communication channels. Moreover, there can be multiple different container deployment schemes on NUMA architecture. Accordingly, the communication between co-resident GPUs can be significantly affected by the varying container deployments from both functionality and performance perspectives. The NUMA-aware Support module is responsible for providing NUMA information to MPI processes. With the aid of the NUMA-aware Support module, the source process is able to figure out whether the destination process is running on the same socket or the different ones before the real communication takes place. The Communication Scheduling module will leverage the GPU locality information and NUMA information generated by GPU Locality-aware Detection module and NUMA-aware Support module, respectively to reschedule the communication

138 going through the appropriate and optimal underlying channel, based on the communication characteristics, which we explored on the container-based HPC cloud environment in

Section 7.1.

GDDR GDDR GDDR GDDR Mem Mem Mem Mem

GPU GPU GPU GPU HCA

ChipSet

Container-A Container-B

CPU CPU CPU CPU System …… Memory …... CPU CPU CPU CPU

Locality-aware Support Workload Tracing Support

Communication Coordination

Host-based cudaIPC GDRCOPY cudaMemcpy GDR Pipeline

Figure 8.4: Overview of GPU Locality-aware Detection in C-GDR

8.2.1 GPU Locality-aware Detection

The GPU Locality-aware Detection module is responsible for dynamically and transparently detecting the location information of communication processes between the co- resident GPUs. Since the shared memory segments, semaphores and message queues can be shared across multiple Docker containers by sharing the IPC namespace when launching containers. We allocate such shared memory segments on each physical node and create a

GPU Locality-aware List on it. Each MPI process associating with one GPU in co-resident containers writes its own locality information into this shared list structure according to its global rank. After a synchronization, it can be guaranteed that the locality information of all local MPI processes has been collected up and stored in the GPU Locality-aware List. If the user launches two MPI processes to carry out GPU-to-GPU communication, the GPU

139 Locality-aware Detection module is able to quickly identify whether it is the co-resident

GPUs communication by checking the locality information on the list according to their global MPI ranks.

Figure 8.4 illustrates an example of launching a 6-process MPI job. Two containers (Container-A and Container-B) are deployed on the same host, and each container is equipped with one GPU device. There is one MPI process in each container, and the other four MPI processes are running on another host. In the GPU Locality-aware Detection module, the two MPI processes (ranks 0 and rank 1) write their identiﬁcations on positions 0 and 1 on the GPU Locality-aware List, respectively. There will be ‘0’ in the other four positions on the list as those four MPI processes are not running on the same host. If

MPI processes with rank 0 and 1 are going to execute GPU-to-GPU message transfers, the fact of co-residence of the GPU devices can be efficiently identified by checking the corresponding positions in the GPU Locality-aware List. The number of local processes on the host can be acquired by traversing and counting the positions with the written identifica- tions. Their local ordering will still be maintained by their positions in the list. It is costly to frequently access the GPU Locality-aware Detection module for each message transfer.

Each MPI process, therefore, scans the locality results generated by GPU Locality-aware

Detection module and maintain its own local copy for all the peer processes. When considering process migration or other scenarios which might cause the locality to change, the proposed Locality-aware Detection module need to be re-triggered to update the locality information. Take the migration, for instance, the communication channel will be suspended before migration to guarantee that there is no on-the-ﬂy messages during migration [43].

Once the migration procedure ﬁnishes, the locality information of all processes needs to

140 be re-detected in order to resume the communication onward. That is, all the communication after the migration will proceed according to the re-detected results to prevent from communicating with inconsistent locality information.

In the design of GPU Locality-aware Detection module, the GPU Locality-aware List is designed by using multiple bytes, as the byte is the smallest granularity of memory access without a lock. The ﬁxed number of bytes will be used to tag each MPI process. This guarantees that multiple processes belonging to co-resident containers are able to write their locality information on their corresponding positions concurrently without introducing lock and unlock operations. This approach reduces the overhead of the locality detection procedure. Moreover, the proposed method will not incur much cost of traversing the lists.

For instance, an MPI job with one million processes only occupies T × 1M bytes memory space for the list, assuming T as the ﬁxed number of bytes for tagging each MPI process.

The space complexity is O(N), where N is the number of MPI processes. It thus brings the good scalability on the container-based HPC cloud environment.

HCA Container-A Container-B

GDDR Chip GDDR GPU GPU Mem Set Mem

CPU CPU …. CPU CPU

MPI MPI Rank 0 Rank 1

1 1 0 0 0 0 1 1 0 0 0 0

shared IPC namespace 1 1 0 0 0 0 /dev/shm/residency

Figure 8.5: GPU Locality-aware Detection Module in C-GDR

141 In addition, there can be different placement schemes to deploy the containers on a

NUMA architecture. The communication performance will also be affected by the placements accordingly. The GPU Locality-aware Detection module can also be used to provide

NUMA information of peer MPI processes for the following Communication Scheduling module, so that some performance bottlenecks and functionality limitations can be avoided during the communication rescheduling phase. We assume that the administrators or cloud deployment stack can specify the CPU cores to launch the containers and different containers will not be launched with the same sets of the cores to eliminate the unnecessary performance interference. When the Docker engine is invoked to launch a container with the speciﬁed core IDs, it forms a tuple with the container name, the corresponding core

IDs, and the associated NUMA node ID, (Container, Cores, Sockets), as shown in Fig- ure 8.6. Then such tuple is exported to each MPI process in the co-resident containers through shared IPC namespace, like MPI Rank 0 and MPI Rank 1 in Figure 8.6. If the destination process is identiﬁed as co-resident through GPU Locality-aware Detection module,

NUMA-aware Support is triggered to further compare the NUMA node IDs of the destination MPI processes with its own ID to identify the relative NUMA information. More speciﬁcally, it can be identiﬁed that whether the message transfer will be across a socket or not.

8.2.2 Workload Characterization Tracing

In the Section 7.1, we observe that latency and bandwidth have the different switch points for communication channel in the container environment. This implies that it is required to dynamically control the channel switch point in the runtime in order to deliver the optimal communication performance for the different types of workloads. C-GDR provides

142 MPI Rank 0 MPI Rank 1

Container Cores Sockets Container Cores Sockets A 0 - 3 0 A 0 - 3 0 …. B 4 - 7 0 B 4 - 7 0 C 8 - 13 0 C 8 - 13 0 D 14 1 D 14 1 E 15 1 E 15 1

Container Cores Sockets A 0 - 3 0 shared IPC B 4 - 7 0 namespace C 8 - 13 0 /dev/shm/hosts D 14 1 E 15 1

Figure 8.6: NUMA-aware Support in Locality-aware Detection Module

Workload Characterization Tracing module, which is responsible for keeping track of the communication patterns. For instance, the Workload Characterization Trace module can persistently record the use of MPI Send/Recv and MP Isend/Irecv to decide the workload is latency-intensive or bandwidth-intensive. Figure 8.7 shows Rank 0’s structure of Work- load Characterization Tracing Module. When the process with rank 0 needs to send/receive message to/from the process with rank 3, it ﬁrst checks the locality information of the destination process (Rank 3) by its locality detection module. If the destination process is detected as a co-located process, it updates the Send/Recv counter if the communication is in the blocking mode, otherwise the Isend/Irecv counter is updated for the non-blocking communication. Upon one of these two counters exceeds the predeﬁned threshold, the communication channel switch point can be adaptively updated. The workload characterization tracing results can be quickly updated and easily maintained in the performance critical path, which does not incur the severe performance overhead.

143 Locality-aware Information that Rank 0 Maintains R1 R2 R3 R4 R5 R6 R7 Destination R3 Process 1 1 1 0 0 0 0 Rank 3’s Tracing Information Send/Recv Isend/Irecv Counter Counter Rank 1’s Tracing Information Send/Recv Isend/Irecv Counter Counter

Figure 8.7: Workload Characterization Tracing Module in C-GDR

8.2.3 Communication Scheduling

The Communication Scheduling module reschedules the message to go through the appropriate communication channel in order to deliver the optimal GPU-to-GPU communication performance in the container-based cloud environment. Figure 8.8 presents the architecture of Communication Scheduling module. In this module, there are four function units including GPU Locality Loader, Workload Characterization Parser, Message

Attribute Parser, and Communication Scheduler. GPU Locality Loader reads the locality information including the NUMA placement of the destination process from the GPU

Locality-aware Detection module. Workload Characterization Parser parses the tracing results from the Workload Characterization Tracing module. Message Attribute Parser obtains the attributes of the message, such as message type and message size. For a communication request to a speciﬁc destination process, Communication Scheduler reschedules the appropriate communication channel based on all the information in the above three aspects. By utilizing the Locality-aware Detection module, the communication between the co-located processes are able to use the high performance intra-node communication channels, such as GDRCOPY, cudaMemcpy, and cudaIPC for different message sizes. From

144 our experiments, we observe that GDR-loopback scheme can deliver better performance than those shared memory based intra-node data movement schemes for some message sizes. In this scenario, the Communication Scheduling module can also select the GDR- loopback communication channel for the speciﬁc range of message size, even though the communication processes are detected as the co-located case. If the workload characterization tracing result indicate that one of the counters arrives the predeﬁned threshold, the

Communication Scheduling module will re-schedule the communication channel based on the comparison result between the message size and the channel switch point. For instance, once the Isend/Irecv counter exceeds the threshold, the workload is identiﬁed as the bandwidth-intensive workload and the switch point from GDR-loopback to cudaIPC will be updated from 16KB to 512KB. After that, the message less than 512KB still goes through the GDR-loopback channel.

Workload Locality Characterization Detector Tracer

Workload Message GPU Characterization Attribute Locality Parser Parser Loader

Communication Scheduler

Host-based cudaIPC GDRCOPY cudaMemcpy GDR Pipeline

Figure 8.8: Communication Scheduling Module in C-GDR

Through our experiments, we summarize the ﬁnal and optimal scheduling policy in container-based cloud environments in Table 8.1. For the latency-sensitive workloads,

GDRCOPY is selected for GPU-to-GPU communication with 1-16 bytes message size.

For the message size which is larger than 16 bytes and less than 16K bytes, GDR-loopback is selected instead of the intra-node data movement schemes. Then cudaIPC is utilized for the large messages transfer. As there is a different performance characterization for

145 Table 8.1: Best Schemes Discoverd for Given Message Ranges for Latency-sensitive and Bandwidth-sensitive Benchmarks

Latency-sensitive Bandwidth-sensitive msg ≤ 16B GDRCOPY 16B < msg < 16KB GDR-loopback GDR-loopback 16KB ≤ msg ≤ 512KB cudaIPC msg > 512KB cudaIPC

the bandwidth-sensitive workloads, GDR-loopback is chosen as the optimal GPU-to-GPU communication scheme with the message size range from 16 bytes to 512K bytes.

8.3 Performance Evaluation

8.3.1 Experimental Testbed

Our testbed consists of eight physical nodes. Each node has a dual-socket 28-core

2.4 GHz Intel Xeon E5-2680 (Broadwell) processor with 128 GB main memory and is equipped with Mellanox ConnectX-4 EDR (100 Gbps) HCAs and one NVIDIA K-80

GK210GL GPU. Please note that each K-80 is a dual-GPU card. The two GPU cards and

HCAs are connected to the same socket. We deploy 16 Docker containers using NVIDIA

Docker 1.01 [74] on these eight physical nodes to make the images agnostic of the NVIDIA driver. Each node has two containers, which are pinned to the same socket. Each container is equipped with one GPU card.

On both physical nodes and containers, we use CentOS Linux 7 as OS. In addition, we use the Mellanox OpenFabrics Enterprise Distribution (OFED) [75] MLNX OFED LINUX-

3.4-2.0.0, NVIDIA Driver 384.81 and CUDA Toolkit 8.0. ‘Native’ denotes the performance of the running process in the bare-metal environment. ‘Container-Def’ denotes the performance of the running process in container environment binding with the same physical core

146 and GPU device as the ones in ‘Native’ scheme. ‘Container-Opt’ denotes the corresponding

performance in the container with our proposed optimizations.

8.3.2 MPI Level Point-to-Point Micro-benchmarks

3.5 250 Native Native 250 3 Container-Def 200 Container-Def Native 2.5 Container-Opt Container-Opt 200 Container-Def 150 2 150 Container-Opt 1.5 100 100 1 Latency (us) Latency (us) Latency 50 Latency (us) 50 0.5 0 0 0 1 2 4 8 16 32 64 128 256 512 1K 4K 16K 64K 256K 1M Message Size (Bytes) Message Size (Bytes) Message Size (Bytes)

(a) Intra-Node Latency: Small Messages (b) Intra-Node Latency: Large Messages (c) Inter-Node Latency

1000 14000 900 Native Native 8000 12000 Native 800 Container-Def Container-Def 7000 700 10000 6000 Container-Def Container-Opt Container-Opt 600 15 8000 5000 Container-Opt 500 4000 400 10 6000 3000 300 5 4000 200 0 2000 Bandiwidth (MB/s)Bandiwidth Bandwidth (MB/s) Bandwidth (MB/s) 2000 100 1 2 4 8 16 1000 0 0 0 1 2 4 8 16 32 64 128 256 512 1K 4K 16K 64K 256K 1M 4M Message Size (Bytes) Message Size (Bytes) Message Size (Bytes)

(d) Intra-Node Bandwidth: Small Mes- (e) Intra-node Bandwidth: Large Mes- (f) Inter-node Bandwidth sages sages

Figure 8.9: MPI Point-to-Point Performance for GPU to GPU Communication

In this subsection, we evaluate the inﬂuence of our proposed designs on communication performance with micro-benchmarks. We focus on the performance evaluation of inter- node and single node inter-container case in this section.

Figures 8.9(a) and 8.9(d) show the evaluation results of latency and bandwidth performance results for small messages. We can clearly observe that both Native and proposed container (Container-Opt) schemes perform better than the default container (Container-

Def) scheme for 1-16 bytes messages. The performance beneﬁt comes from choosing

GDRCOPY as the optimal communication channel. For example, the latencies of Native,

Container-Def, and Container-Opt with 4 bytes message size are 1.55us, 2.1us, and 1.57us,

147 respectively. Compared to the Container-Def case, the Container-Opt scheme could reduce

the latency by up to 27%. The bandwidth of Native, Container-Def, and Container-Opt with

16 bytes message size are 12.61 MB/s, 10.14 MB/s, and 12.55 MB/s, respectively. Com-

pared to the Container-Def scheme, the Container-Opt scheme could improve the band-

width by up to 24%. In addition, Container-Opt just incurs minor overhead, compared to

Native scheme.

From Figure 8.9(a), 8.9(b), 8.9(d), and 8.9(e), we can clearly observe that both Container-

Def and Container-Opt schemes start having the similar performance after 16 bytes mes-

sage size. The is because GDR-loopback scheme is selected by the communication co-

ordinator. For example, the latencies of Container-Def, and Container-Opt with 8K bytes

message size are 7.2us, and 7.34us, respectively, as shown in Figure 8.9(b). The bandwidth of Container-Def, and Container-Opt with 256K bytes message size are 6.59GB/s, and 6.43GB/s, respectively, as shown in Figure 8.9(e). Compared to the Native scheme, both Container-Opt and Container-Def can deliver the near-native performance.

For the large message sizes, cudaIPC performs better than other schemes, that is why we can see Container-Opt brings up to 46% improvement in terms of latency and 32% improvement in terms of bandwidth, compared with Container-Def. Please note that the optimal communication channel switch points (from GDR-loopback to cudaIPC) are different for latency (around 16K bytes ) and bandwidth (around 512K bytes), which further veriﬁes the performance characterization results in Table 8.1. As we can see, the proposed

Container-Opt keeps delivering the near-native performance.

Figures 8.9(c) and 8.9(f) show the evaluation results of inter-node latency and bandwidth performance results. We can clearly observe that both Container-Def and Container-

Opt schemes achieve the similar performance as the Native scheme both in terms of latency

148 and bandwidth. The MPI micro-benchmark level point-to-point evaluation results indicate that the Container-Opt scheme could always select the optimal communication channel for different message sizes to achieve the optimal performance, compared with the Container-

Def scheme. In some cases, it is even better than native performance.

8.3.3 MPI Level Collective Micro-benchmarks

In this section, we evaluate our C-GDR communication schemes with ﬁve MPI level collective operations, which are MPI Bcast, MPI Allgather, MPI Reduce, MPI Allreduce, and Alltoall. We choose these ﬁve collective operations since they are widely used by

GPU-based applications. The performance results are shown in Figure 8.10. The evaluation results indicate that our optimized communication schemes Container-Opt can achieve near-native collective performance. Compared with the performance of Container-Def schemes in the container environment, our optimized scheme brings up to 63%, 66%, 49%, and 50% performance improvement for MPI Bcast, MPI Allgather, MPI Allreduce, and

MPI Alltoall, respectively.

8.3.4 Application Performances

In this section, we evaluate our proposed C-GDR scheme Container-Opt with several end applications, which includes Jacobi solver, HOOMD-blue Lennard-Jones liquid

(Hoomd-LJ), and Anelastic Wave Propagation (AWP-ODC), as shown in Figure 8.11. Ja- cobi solves the Poisson equation on a rectangle with Dirichlet boundary conditions. It leverages the CUDA-aware MPI to directly send/receive (MPI Sendrecv) through the device buffer without staging the data to host buffer. We can adjust the message size for sending and receiving. Jacobi-16B and Jacobi-512KB mean that we use 16 bytes and 512K bytes as message size, respectively.

149 1000 10000 Native Native Container-Def 1000 Container-Def 100 Container-Opt 63% Container-Opt 100 66% 10 10 Latency (us) Latency (us)

1 1

Message Size (Bytes) Message Size (Bytes)

(a) MPI Bcast (b) MPI Allgather 10000 10000 Native Native 1000 Container-Def 1000 Container-Def Container-Opt Container-Opt 49% 100 100 50%

10 (us)latency Latency (us) 10

1 1

Message Size (Bytes) Message Size (Bytes)

Figure 8.10: MPI Collective Communication Performance across 16 GPU Devices

150 In Jacobi-16B case, the evaluation results indicate that our proposed Container-Opt can bring 25% communication performance improvement, compared with the default case, while having the similar performance with the one on the native environment. This is because the GDRCOPY is used in the native and Container-Opt case, which can bring optimal communication performance for message transfer with 16 bytes message size, as what we summarized in Table 8.1.

In Jacobi-512K case, cudaIPC scheme is used to deliver the optimal communication performance. This is the reason we see the similar communication time between Native and Container-Opt schemes, and 26% performance improvement compared with the one with Container-Def scheme.

Both Hoomd-LJ and AWP-ODC dominantly use around 1M bytes message for communication, and the intra-node based communication scheme (cudaIPC) performs better than the GDR-loopback scheme, as we summarized in Table 8.1. Accordingly, we can see from Figure 8.11 our proposed Container-Opt is able to achieve the optimal performance for applications Hoomd-LJ and AWP-ODC with the proposed designs. It can bring 10% and 14% performance improvement for Hoomd-LJ and AWP-ODC, respectively, compared with Container-Def case.

1.5 Native Container-Def Container-Opt

0.5

Normalized Normalized Rate 0 Jacobi-16B Jacobi-512KB HOOMD-LJ AWP-ODC Communication Time TPS GFOLPS

Figure 8.11: Application Performance across 16 GPU Devices (For Communication Time, the lower is better; For TPS and GFLOS, the higher is better)

151 8.4 Related Work

There are four ways to use GPU in a Virtual Machine (VM): I/O pass-through, device emulation, API remoting, and mediated pass-through. In a virtualized environment, GPU could be directly passed through the device to a speciﬁc VM [19]. Using this technique,

Amazon [1] has provided GPU instances to customers for high-performance computing.

Intel has introduced VT-d allows GPU to be passed to a virtual machine exclusively [21].

With GPU device passthrough, the device is dedicated to a speciﬁc virtual machine, so it sacriﬁces the sharing capability of the virtualized environment. CPU virtualization could be done through device emulation; however, such emulation technique could be done with

GPUs.

GPU virtualization could also be achieved through API remoting which is commonly used in commercial software. API remoting forward graphics commands from guest OS to host. VMGL [55] replaces the standard OpenGL library in Linux Guests with its own implementation to pass the OpenGL commands to VMM. Shi et al. present a CUDA- oriented GPU virtualization solution in [94]. It uses API interception to capture CUDA calls on the guest OS with a wrapper library, and to redirect them to the host OS where a stub service was running. Duato et al. [22] propose a library to allow each node in a cluster access any of the CUDA-compatible accelerators installed in the cluster nodes.

Remote GPUs are virtualized devices made available by a wrapper library replacing the

CUDA Runtime. This library forwards the API calls to a remote server and retrieves the results from those remote executions to offer them to the calling application [22]. Several other studies use the same technique to forward CUDA command and OpenCL commands, solving the problem of virtualizing GPGPU devices [25, 30, 87]. VMware products consist of a virtual PCI device and its corresponding driver for different operating systems. The

152 host handles all accesses to the virtual PCI device inside a VM by a user-level process, where the actual GPUvm presents a GPU virtualization solution on an NVIDIA card [99] and it implements both para- and full-virtualization. However, full-virtualization exhibits a considerable overhead for MMIO handling. Compared to native, the performance of optimized para-virtualization is two to three times slower. Since NVIDIA has individual graphics memory on the PCI card, GPUvm cannot handle page faults caused by NVIDIA

GPUs [28]. NVIDIA GRID [4] is a proprietary virtualization solution from NVIDIA on

Kepler architecture. However, there are no technical details about their products available to the public. Reano et al. propose optimizations at InﬁniBand network verbs-level to accelerate GPU virtualization framework [91]. Ravi et al. implement a scheduling policy, based on afﬁnity score between GPU kernels when consolidating kernels among multiple

VMs [90]. Iserte et al. propose to decouple real GPUs from the compute nodes by using the virtualization technology rCUDA [39].

Compared to these work, our work focuses on analyzing and characterizing different

GPU-to-GPU communication schemes on container-based cloud environments, identifying performance bottlenecks. Based on our ﬁndings, we further propose C-GDR, a high performance Cloud-aware GPUDirect communication schemes on RDMA networks, which can dynamically schedule the optimal communication channels.

8.5 Summary

The increase in the number of cloud-based applications that leverage GPUs for parallel computation has made it vital for us to understand and design efﬁcient GPU-based communication schemes in cloud environments. Towards this goal, we ﬁrst investigate the performance characteristics of state-of-the-art GPU-based communication schemes on both

153 native and container-based cloud environments and identify the performance bottlenecks for communication in GPU-enabled cloud environments. To alleviate the bottlenecks iden- tiﬁed, we present C-GDR approach to design high-performance cloud-aware GPUDirect communication schemes on RDMA networks and integrate it with the MVAPICH2 MPI library. The proposed designs provide locality-aware, NUMA-aware, and communication- pattern-aware capabilities to enable intelligent and adaptive communication coordination for the optimal communication performance on GPU-enabled clouds. Performance evaluations show that MVAPICH2 with C-GDR can outperform default MVAPICH2-GDR schemes by up to 66% on micro-benchmarks and 26% performance beneﬁt for various applications on container-based GPU-enabled clouds.

154 Chapter 9: Impact on the HPC and Cloud Computing Communities

HPC cloud is gaining momentum in both HPC and cloud computing communities. The designs in this proposal can provide high performance virtualization support for the different virtualization environments on HPC clouds. The locality-aware support in the redesigned MPI runtime can eliminate the performance degradation by taking account of the locality information of the communication peers. The NUMA-aware support can adapt to the different VM or container placement policies and deliver the optimal communication performance. SR-IOV technology brings near-native communication performance while preventing VM migration. It has become an obstacle for adopting HPC cloud. The proposed high performance virtual machine migration framework enables high performance and scalable VM migration for MPI applications on SR-IOV enabled HPC cloud. The proposed framework is hypervisor independent and driver independent, therefore the deployment of HPC cloud will not be bound to a particular vendor, and the administrators have the complete control of their systems. This addresses the significant contradiction between the security concern from the HPC community and the flexibility from the cloud community, which will speed up the adoption of HPC cloud. To build the efficient HPC cloud, the critical virtualized resources need to be carefully managed and isolated, however such tasks can not be done with any MPI runtime alone running inside the instances.

Slurm is a very popular resource management and job scheduling middleware on many

155 HPC systems. Our proposed Slurm-V framework extends Slurm with the virtualization- oriented capabilities. It enables the efﬁcient sharing of the HPC cluster resources while isolating the critical virtualized HPC resources among VMs. The Slurm-V will be bene-

ﬁcial for not only the administrators to build HPC cloud with the existing HPC systems, but also the end users to run multiple concurrent MPI jobs without the performance impact. As an attractive alternative, Singularity provides another promising approach to build the efﬁcient HPC cloud with container technology. However, there is lack of a systemat- ical study on the performance of Singularity. We propose a four-dimension methodology to evaluate the performance of Singularity on various aspects including processor architecture, advanced interconnects, memory access modes, and the virtualization overhead.

Compared to the native performance, there is very little overhead when running MPI-based

HPC applications over Singularity-based HPC cloud. This work not only makes up the missing of performance evaluation, but also can be used as an important reference for building HPC cloud. Moreover, GPUs have become an indispensable element in the HPC cloud because of its powerful computation capability. With the deployment of GPUs at large scale, the data movement scheme among GPU devices turns out to be dramatically complicated in the HPC cloud. The proposed C-GDR approach presents high-performance cloud-aware GPU based communication schemes on RDMA networks. C-GDR provides locality-aware, NUMA-aware, and communication-pattern-aware capabilities to enable intelligent and adaptive communication coordination for the optimal communication performance on GPU-enabled HPC clouds. Therefore through these designs, we are able to design and build efﬁcient HPC clouds with modern networking technologies on heterogeneous HPC cluster and deliver the optimal performance for the HPC applications to the end users.

156 9.1 Software Release and Wide Acceptance

9.1.1 MVAPICH2-Virt Library

MVAPICH2-Virt, derived from MVAPICH2, is an MPI software to exploit the novel features and mechanisms of high performance networking technologies with SR-IOV as well as other virtualization technologies such as IVShmem for Virtual Machine and IPC enabled Shared Memory (IPC-SHM) and Cross Memory Attach (CMA) for Docker/Sin- gularity container. MVAPICH2-Virt can deliver the best performance and scalability to

MPI applications running inside both VM and container over SR-IOV enabled InﬁniBand clusters. As of July 2018, 1,410 downloads have taken place from this project’s site.

9.1.2 Heat-based Complex Appliance

In order to facilitate the users to quickly deploy the HPC cloud and conduct their re- searches with MVAPICH2 and MVAPICH2-Virt libraries, we develop two appliances [69,

71] on NSF-supported Chameleon Cloud based on OpenStack Heat component. Through these available appliances, users and researchers can easily deploy HPC clouds to perform experiments and run jobs in the different environments, which includes high performance SR-IOV enabled InﬁniBand clusters, high performance MVAPICH2 library over bare-metal InﬁniBand clusters, high performance MVAPICH2 library with virtualization support over SR-IOV enabled KVM clusters.

157 Chapter 10: Future Research Directions

This chapter describes the possible future research directions that can be explored as a

follow up of the work done as part of this thesis.

10.1 Exploring GPU-enabled VM Migration

Recently, cloud computing platforms have been widely adopting GPGPU and NVMe

to achieve high-performance and energy efﬁcient computation and storage. As an essen-

tial virtualization capability towards high availability and efﬁcient resource provisioning,

support for VM live migration is the key. Although VM live migration mechanisms on

homogeneous clusters have been widely discussed [43], handling live-migration in hetero-

geneous environments like GPU- and NVMe-enabled systems remains an open issue. The

most challenging part of GPGPU migration is how to efﬁciently suspend and resume the

compute kernel and data movement channels. Speciﬁcally, the computation and communi-

cation operations are scheduled in a stream-like manner [3]. In other words, it is working asynchronously to the CPU and other devices. As a result, it is required to keep track of GPU streams used by the migration source and the corresponding operations that are executing or queued on the streams. Moreover, coordination or synchronization among various GPU-to-GPU communication schemes when migration is complex and needs to be carefully maintained as well.

158 10.2 QoS-aware Data Access and Movement

The paradigm of Cloud Computing is heavily based on providing a guarantee of service.

Quality of Service (QoS) is an extremely important part of this paradigm. In fact, this is one of the primary reasons for the popularity of Cloud Computing. Most cloud providers these days offer Service Level Agreements (SLAs) to their clients as a basic way of achieving

QoS. In GPGPU-enabled cloud computing platforms, many VMs can share the same set of

GPUs to perform concurrent computation kernels or data movements, and one VM can use multiple GPUs. In this context, providing QoS is essential and challenging to meet user’s satisfaction as well as maximize utilization of GPU resources, i.e., massively parallelism.

However, there are only limited studies on providing QoS for GPUs [46, 106]. Moreover, these studies mainly focus on the QoS support for GPUs in native environments. Thus, it is desired to to ﬁll the gap and provide intelligent priority-based scheduling mechanisms for computation kernels and data movement of GPU-resident data.

10.3 Exploring Different Programming Models on HPC Cloud

Compared to traditional MPI model, the one-sided programming model is gaining momentum, since it shows promise for expressing algorithms that have irregular computation and communication patterns. Partitioned Global Address Space (PGAS) and Remote

Memory Access (RMA) are two examples of the one-sided model. PGAS assumes a global memory address space that is logically partitioned and a portion of it is local to each process. For example, OpenShmem [26, 41, 47, 50, 56, 59, 76], (Uniﬁed Parallel C), UPC [48], and Co-array Fortran [31] are different implementations on PGAS. Remote Memory Ac- cess (RMA) [57, 58] extends the one-sided communication capabilities based on traditional

MPI.

159 Chapter 11: Conclusion and Contribution

Cloud computing has been widely adopted in the industry computing domain due to its several attractive features, such as on-demand resource provision, efficient resource sharing, performance isolation, live migration and so on. More and more enterprises are transplanting their services or applications, which are running and maintaining on their dedicated systems earlier onto public cloud computing platforms. In this way, not only the system resources can be efficiently shared by more users, but also the enterprises are able to reduce their cost and achieve the fast turnaround. Virtualization technologies play the key role behind the scene of cloud computing. There exist three different types of virtualization solutions, which are the hypervisor-based solution, the container-based solution, and the emerging nested virtualization solution. Even though the cloud computing and virtualization are successful in the industry computing environment in the past decades, they are still facing challenges in HPC domain. More specifically, one of the biggest barriers is the lower performance of virtualized I/O performance. The proposed SR-IOV technology addresses this issue by delivering near-native point-to-point performance. However, it still lacks high performance virtualization support for those co-resident instances, such as locality-aware and NUMA-aware support, which incurs the severe performance degradation.

160 In this dissertation, we propose the designs for MPI runtime to provide high performance virtualization support for different types of virtualization environments on HPC cloud. For hypervisor-based virtualization solution, we propose a high performance locality- aware MPI library, which can dynamically detect co-located VMs and coordinate communications between SR-IOV and IVShmem channels. For container-based virtualization, the locality-aware design and IPC namespaces sharing are utilized to dynamically and efﬁ- ciently detect co-resident containers at communication runtime, so that the shared memory and CMA based communication can be executed to improve the communication performance across the co-resident containers. The evaluation results for both Docker and

Singularity show near-native performance on various aspects including processor architecture, advanced interconnects, memory access modes, and the virtualization overhead for

MPI-based HPC applications. Further, we propose a high performance two-layer locality- aware and NUMA-aware MPI library for nested virtualization environment on HPC cloud.

Through the two-layer locality-aware design, MPI library is able to dynamically and ef-

ﬁciently detect co-resident containers in the same VM as well as co-resident VMs in the same host at runtime. Through the NUMA-aware design, the MPI runtime is also able to adapt the different VM/container placement schemes and deliver the optimal communication performance.

The SR-IOV specification is able to provide efficient sharing of high-speed interconnect resources and achieve near-native I/O performance. However, SR-IOV-based virtual networks prevent VM migration, which is an essential virtualization capability towards high availability and resource provisioning. Current solutions have many restrictions, such as depending on specific network adapters and/or hypervisors, which will limit the usage

161 scope of these solutions on HPC environments from security perspective. In this dissertation, we present a high performance virtual machine migration framework for MPI applications on SR-IOV enabled HPC cloud. The framework is hypervisor-independent and host/guest device driver-independent. It consists of a redesigned MPI runtime, which could hide the migration overhead through the overlapping with computation, and a high performance and scalable controller, which works seamlessly with the redesigned MPI runtime to signiﬁcantly improve the efﬁciency of virtual machine migration.

To build efﬁcient HPC cloud, the HPC cluster resources need to be efﬁciently shared by the end users through virtualization. In this context, some critical HPC resources among

VMs, such as SR-IOV enabled virtual functions and IVShmem devices, need to be enabled and isolated to support efﬁciently running multiple concurrent MPI jobs on HPC clouds.

However, original Slurm is not able to supervise VMs and associated critical resources.

In this dissertation, we propose a novel framework, Slurm-V, which extends Slurm with virtualization-oriented capabilities such as job submission to dynamically created VMs with isolated SR-IOV and IVShmem resources.

GPUs as one type of accelerators have gained the signiﬁcant success for parallel applications on the heterogeneous HPC clusters. In addition to highly optimized computation kernels on GPUs, the cost of data movement on GPU clusters plays critical roles in delivering high performance for end applications. In this dissertation, we propose C-GDR, the high-performance Cloud-aware GPUDirect communication schemes on RDMA networks.

It allows communication runtime to successfully detect process locality, GPU residency,

NUMA architecture information, and communication pattern to enable intelligent and dynamic selection of the best communication and data movement schemes on GPU-enabled

162 clouds. Our evaluations show C-GDR can outperform the default scheme by up to 25% on

HPC applications.

163 Bibliography

[1] Amazon High Performance Computing Cloud. https://aws.amazon.com/hpc/. [Last Accessed: July 28, 2018]. [2] CPMD Consortium. http://www.openfabrics.org/downloads/perftest/. [3] CUDA Toolkit Documentation. http://docs.nvidia.com/cuda/. [4] NVIDIA GRID. http://www.nvidia.com/object/grid-technology.html. [Last Ac- cessed: July 28, 2018]. [5] OpenStack. http://openstack.org/. [6] SPANK - Slurm Plug-in Architecture for Node and job (K)control. http://slurm.schedmd.com/spank.html. [7] NVIDIA GPUDirect RDMA. http://docs.nvidia.com/cuda/gpudirect-rdma/, Feb. 2017. [8] A. Yoo, M. Jette, M. Grondona. SLURM: Simple Linux Utility for Resource Man- agement. In Proceedings of 9th International Workshop (JSSPP 2003), Seattle, WA, USA, June 24. [9] Amazon Elastic Compute Cloud (Amazon EC2). http://aws.amazon.com/ec2. [10] Padma Apparao, Srihari Makineni, and Don Newell. Characterization of Network Processing Overheads in Xen. In Proceedings of the 2nd International Workshop on Virtualization Technology in Distributed Computing, VTDC ’06, Washington, DC, USA, 2006. IEEE Computer Society. [11] Muli Ben-Yehuda, Michael D Day, Zvi Dubitzky, Michael Factor, Nadav Har’El, Abel Gordon, Anthony Liguori, Orit Wasserman, and Ben-Ami Yassour. The turtles project: Design and implementation of nested virtualization. In OSDI, volume 10, pages 423–436, 2010. [12] M. S. Birrittella, M. Debbage, R. Huggahalli, J. Kunz, T. Lovett, T. Rimmer, K. D. Underwood, and R. C. Zak. Intel Omni-path Architecture: Enabling Scalable, High Performance Fabrics. In 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects, Aug 2015.

164 [13] S. Chakraborty, H. Subramoni, J. Perkins, A. Moody, M. Arnold, and D. K. Panda. PMI Extensions for Scalable MPI Startup. In Proceedings of the 21st European MPI Users’ Group Meeting, EuroMPI/ASIA ’14, Kyoto, Japan, 2014.

[14] Chameleon Cloud. https://www.chameleoncloud.org/.

[15] T.P.P. De Lacerda Ruivo, G.B. Altayo, G. Garzoglio, S. Timm, Hyun Woo Kim, Seo- Young Noh, and I. Raicu. Exploring Inﬁniband Hardware Virtualization in Open- Nebula towards Efﬁcient High-Performance Computing. In Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on, pages 943–948, 2014.

[16] Docker. https://www.docker.com/.

[17] Y. Dong, X. Yang, J. Li, G. Liao, K. Tian, and H. Guan. High Performance Network Virtualization with SR-IOV. Journal of Parallel and Distributed Computing, 2012.

[18] Yaozu Dong, Yu Chen, Zhenhao Pan, Jinquan Dai, and Yunhong Jiang. ReNIC: Architectural extension to SR-IOV I/O virtualization for efﬁcient replication. TACO, 2012.

[19] Yaozu Dong, Jinquan Dai, Zhiteng Huang, Haibing Guan, Kevin Tian, and Yunhong Jiang. Towards High-quality I/O Virtualization. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, SYSTOR ’09, 2009.

[20] Yaozu Dong, Xiaowei Yang, Jianhui Li, Guangdeng Liao, Kun Tian, and Haibing Guan. High Performance Network Virtualization with SR-IOV. Journal of Paral- lel and Distributed Computing, 2012. Communication Architectures for Scalable Systems.

[21] Micah Dowty and Jeremy Sugerman. GPU Virtualization on VMware’s Hosted I/O Architecture. SIGOPS Oper. Syst. Rev., 2009.

[22] J. Duato, A. J. Pea, F. Silla, R. Mayo, and E. S. Quintana-Ort. rCUDA: Reduc- ing the number of GPU-based accelerators in high performance clusters. In 2010 International Conference on High Performance Computing Simulation, June 2010.

[23] Wes Felter, Alexandre Ferreira, Ram Rajamony, and Juan Rubio. An Updated Per- formance Comparison of Virtual Machines and Linux Containers. Technical Report RC25482 (AUS1407-001), 2014.

[24] Q. Gao, W. Yu, W. Huang, and D. K. Panda. Application-Transparent Check- point/Restart for MPI Programs over InﬁniBand. In Proceedings of Int’l Conference on Parallel Processing (ICPP), August 2006.

165 [25] Giulio Giunta, Raffaele Montella, Giuseppe Agrillo, and Giuseppe Coviello. A GPGPU Transparent Virtualization Component for High Performance Computing Clouds. In Proceedings of the 16th International Euro-Par Conference on Parallel Processing: Part I, EuroPar’10, 2010. [26] Antonio Gomez-Iglesias,´ Dmitry Pekurovsky, Khaled Hamidouche, Jie Zhang, and Jer´ omeˆ Vienne. Porting scientific libraries to pgas in xsede resources: Practice and experience. In Proceedings of the 2015 XSEDE Conference: Scientific Advance- ments Enabled by Enhanced Cyberinfrastructure, XSEDE ’15, 2015. [27] Google Compute Engine (GCE). https://cloud.google.com/compute/. [28] M. Gottschlag, M. Hillenbrand, J. Kehne, J. Stoess, and F. Bellosa. LoGV: Low- Overhead GPGPU Virtualization. In 2013 IEEE 10th International Conference on High Performance Computing and Communications 2013 IEEE International Con- ference on Embedded and Ubiquitous Computing, pages 1721–1726, Nov 2013. [29] Wei Lin Guay, Sven-Arne Reinemo, Bjrn Dag Johnsen, Chien-Hua Yen, Tor Skeie, Olav Lysne, and Ola Trudbakken. Early Experiences with Live Migration of SR-IOV Enabled InfiniBand. Journal of Parallel and Distributed Computing, 2015. [30] Vishakha Gupta, Ada Gavrilovska, Karsten Schwan, Harshvardhan Kharche, Ni- raj Tolia, Vanish Talwar, and Parthasarathy Ranganathan. Gvim: Gpu-accelerated virtual machines. In Proceedings of the 3rd ACM Workshop on System-level Virtu- alization for High Performance Computing, HPCVirt ’09, 2009. [31] Manuel Hasert, Harald Klimach, and Sabine Roller. Caf versus mpi - applicability of coarray fortran to a flow solver. In Recent Advances in the Message Passing Interface, Berlin, Heidelberg, 2011. [32] Wei Huang, Qi Gao, Jiuxing Liu, and Dhabaleswar K. Panda. High Performance Virtual Machine Migration with RDMA over Modern Interconnects. In Proceedings of the 2007 IEEE International Conference on Cluster Computing, CLUSTER ’07, Washington, DC, USA. IEEE Computer Society. [33] Wei Huang, Matthew J. Koop, Qi Gao, and Dhabaleswar K. Panda. Virtual Machine aware Communication Libraries for High Performance Computing. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, SC ’07, 2007. [34] Wei Huang, Matthew J. Koop, Qi Gao, and Dhabaleswar K. Panda. Virtual Machine Aware Communication Libraries for High Performance Computing. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC), Reno, USA, 2007. [35] Wei Huang, Jiuxing Liu, Bulent Abali, and Dhabaleswar K. Panda. A Case for High Performance Computing with Virtual Machines. In Proceedings of the 20th Annual International Conference on Supercomputing, ICS ’06, New York, NY, USA, 2006.

166 [36] Wei Huang, Jiuxing Liu, Matthew Koop, Bulent Abali, and Dhabaleswar Panda. Nomad: Migrating OS-bypass Networks in Virtual Machines. In Proceedings of the 3rd International Conference on Virtual Execution Environments, VEE ’07, New York, NY, USA, 2007.

[37] Z. Huang, R. Ma, J. Li, Z. Chang, and H. Guan. Adaptive and Scalable Optimiza- tions for High Performance SR-IOV. In Cluster Computing (CLUSTER), 2012 IEEE International Conference on, pages 459–467. IEEE, 2012.

[38] InﬁniBand Trade Association. http://www.inﬁnibandta.com.

[39] Sergio Iserte, Francisco J. Clemente-Castello,´ Adrian´ Castello,´ Rafael Mayo, and Enrique S. Quintana-Ort´ı. Enabling gpu virtualization in cloud environments. In Proceedings of the 6th International Conference on Cloud Computing and Services Science - Volume 1 and 2, CLOSER 2016, 2016.

[40] Ismael Farfan Estrada. Overview of a virtual cluster using OpenNebula and SLURM.

[41] J. Jose and J. Zhang and A. Venkatesh and S. Potluri and D. K. Panda. A Compre- hensive Performance Evaluation of OpenSHMEM Libraries on InﬁniBand Clusters. In Proceedings of the First Workshop on OpenSHMEM and Related Technologies. Experiences, Implementations, and Tools - Volume 8356, OpenSHMEM 2014, An- napolis, MD, USA, 2014.

[42] J. Zhang and X. Lu and D. K. Panda. Designing Locality and NUMA Aware MPI Runtime for Nested Virtualization based HPC Cloud with SR-IOV Enabled Inﬁni- Band . In 13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execu- tion Environments (VEE ’17), Xi’an, China, April 2017.

[43] J. Zhang and X. Lu and D. K. Panda. High-Performance Virtual Machine Migra- tion Framework for MPI Applications on SR-IOV enabled InﬁniBand Clusters. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Orlando, USA, May 2017.

[44] J. Zhang, X. Lu, J. Jose, M. Li, R. Shi, D. K. Panda. High Performance MPI Li- brary over SR-IOV Enabled InﬁniBand Clusters. In Proceedings of International Conference on High Performance Computing (HiPC), Goa, India, December 17-20 2014.

[45] J. Zhang, X. Lu, J. Jose, R. Shi, D. K. Panda. Can Inter-VM Shmem Beneﬁt MPI Applications on SR-IOV based Virtualized InﬁniBand Clusters? In Proceedings of 20th International Conference Euro-Par 2014 Parallel Processing, Porto, Portugal, August 25-29 2014.

167 [46] M. K. Jeong, M. Erez, C. Sudanthi, and N. Paver. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. In DAC Design Automation Conference 2012, pages 850–855, June 2012.

[47] J. Jose, K. Hamidouche, X. Lu, S. Potluri, J. Zhang, K. Tomko, and D. K. Panda. High Performance OpenSHMEM for Xeon Phi Clusters: Extensions, Runtime De- signs and Application Co-design. In 2014 IEEE International Conference on Cluster Computing (CLUSTER), pages 10–18, Sept 2014.

[48] J. Jose, K. Hamidouche, J. Zhang, A. Venkatesh, and D. K. Panda. Optimizing Collective Communication in UPC. In 2014 IEEE International Parallel Distributed Processing Symposium Workshops, pages 361–370, May 2014.

[49] J. Jose, Mingzhe Li, Xiaoyi Lu, K.C. Kandalla, M.D. Arnold, and D.K. Panda. SR- IOV Support for Virtualization on InﬁniBand Clusters: Early Experience. In Pro- ceedings of 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pages 385–392, May 2013.

[50] Jithin Jose, Krishna Kandalla, Jie Zhang, Sreeram Potluri, and Dhabaleswar K. Panda. Optimizing collective communication in openshmem. 2013.

[51] Asim Kadav and Michael M. Swift. Live Migration of Direct-access Devices. SIGOPS Oper. Syst. Rev., 2009.

[52] Kernel-based Virtual Machine (KVM). http://www.linux-kvm.org/page/Main Page.

[53] Kangho Kim, Cheiyol Kim, Sung-In Jung, Hyun-Sup Shin, and Jin-Soo Kim. Inter- domain Socket Communications Supporting High Performance and Full Binary Compatibility on Xen. In Proceedings of the 4th ACM SIGPLAN/SIGOPS Inter- national Conference on Virtual Execution Environments (VEE), Seattle, USA, 2008.

[54] Gregory M. Kurtzer, Vanessa Sochat, and Michael W. Bauer. Singularity: Scientiﬁc Containers for Mobility of Compute. PLOS ONE, 12:1–20, 05 2017.

[55] H. Andres Lagar-Cavilla, Niraj Tolia, M. Satyanarayanan, and Eyal de Lara. VMM- independent Graphics Acceleration. In Proceedings of the 3rd International Confer- ence on Virtual Execution Environments, VEE ’07, 2007.

[56] M. Li, K. Hamidouche, X. Lu, J. Zhang, J. Lin, and D. K. Panda. High Performance OpenSHMEM Strided Communication Support with InﬁniBand UMR. In 2015 IEEE 22nd International Conference on High Performance Computing (HiPC), pages 244–253, Dec 2015.

[57] M. Li, X. Lu, K. Hamidouche, J. Zhang, and D. K. Panda. Mizan-RMA: Accel- erating Mizan Graph Processing Framework with MPI RMA. In 2016 IEEE 23rd

168 International Conference on High Performance Computing (HiPC), pages 42–51, Dec 2016.

[58] Mingzhe Li, Khaled Hamidouche, Xiaoyi Lu, Hari Subramoni, Jie Zhang, and Dha- baleswar K. Panda. Designing MPI Library with On-demand Paging (ODP) of In- ﬁniband: Challenges and Beneﬁts. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’16, 2016.

[59] J. Lin, K. Hamidouche, J. Zhang, X. Lu, A. Vishnu, and D. K. Panda. Accelerating k-NN Algorithm with Hybrid MPI and OpenSHMEM . In OpenSHMEM 2015 for PGAS Programming in the Exascale Era , Baltimore Region, MD, USA, August 2015.

[60] Linux Containers. https://linuxcontainers.org.

[61] Linux VServer. http://linux-vserver.org.

[62] J. Liu. Evaluating Standard-Based Self-Virtualizing Devices: A Performance Study on 10 GbE NICs with SR-IOV Support. In Proceeding of 2010 IEEE International Symposium Parallel & Distributed Processing (IPDPS), pages 1–12. IEEE, 2010.

[63] Jiuxing Liu, Wei Huang, Bulent Abali, and Dhabaleswar K. Panda. High Perfor- mance VMM-bypass I/O in Virtual Machines. In Proceedings of the Annual Confer- ence on USENIX ’06 Annual Technical Conference, ATC ’06, Berkeley, CA, USA, 2006.

[64] A. Cameron Macdonell. Shared-Memory Optimizations for Virtual Machines. PhD Thesis. University of Alberta, Edmonton, Alberta, Fall 2011.

[65] Matthias Jurenz, Danny Rotscher, Ralph Muller-Pfefferkorn, . Never Port Your Code Again Docker functionality with Shifter using SLURM. http://slurm.schedmd.com/SLUG15/shifter.pdf/.

[66] Memkind. http://memkind.github.io/memkind/.

[67] Aravind Menon, Jose Renato Santos, Yoshio Turner, G. (John) Janakiraman, and Willy Zwaenepoel. Diagnosing Performance Overheads in the XEN Virtual Machine Environment. In Proceedings of the 1st ACM/USENIX international conference on Virtual execution environments, VEE ’05, pages 13–23, New York, NY, USA, 2005. ACM.

[68] Microsft. Nested virtualization. https://msdn.microsoft.com/en- us/virtualization/hyperv on windows/user guide/nesting.

[69] MVAPICH2 Heat-based Complex Appliance for Bare-Metal Cluster. https://www.chameleoncloud.org/appliances/29/.

169 [70] MVAPICH2-Virt. http://mvapich.cse.ohio-state.edu/.

[71] MVAPICH2-Virt Heat-based Complex Appliance for Virtual Cluster. https://www.chameleoncloud.org/appliances/28/.

[72] Marco A. S. Netto, Rodrigo N. Calheiros, Eduardo R. Rodrigues, Renato L. F. Cunha, and Rajkumar Buyya. Hpc cloud for scientiﬁc and business applications: Taxonomy, vision, and research challenges. ACM Comput. Surv., 51(1):8:1–8:29, January 2018.

[73] Network Based Computing Laboratory. OSU Micro-benchmarks. http://mvapich.cse.ohio-state.edu/benchmarks/.

[74] NVIDIA Docker. https://github.com/NVIDIA/nvidia-docker/wiki.

[75] Open Fabrics Enterprise Distribution. http://www. openfabrics.org/.

[76] OpenSHMEM. http://openshmem.org/.

[77] Oracle. Nested virtualization: Achieving up to 2x better aws performance! https://www.ravellosystems.com/blog/nested-virtualization-achieving-up- to-2x-better-aws-performance/.

[78] X. Ouyang, S. Marcarelli, R. Rajachandrasekar, and D. K. Panda. RDMA-Based Job Migration Framework for MPI over InﬁniBand. In 2010 IEEE International Conference on Cluster Computing, Sept 2010.

[79] Zhenhao Pan, Yaozu Dong, Yu Chen, Lei Zhang, and Zhijiao Zhang. CompSC: Live Migration with Pass-through Devices. In Proceedings of the 8th ACM SIG- PLAN/SIGOPS Conference on Virtual Execution Environments, VEE ’12, pages 109–120, 2012.

[80] Dhabaleswar K. Panda, Karen Tomko, Karl Schulz, and Amitava Majumdar. The MVAPICH Project: Evolution and Sustainability of an Open Source Production Quality MPI library for HPC. In Workshop on Sustainable Software for Science: Practice and Experiences, held in conjunction with Int’l Conference on Supercom- puting (WSSPE), 2013.

[81] Photon OS. https://vmware.github.io/photon/.

[82] Simon Pickartz, Carsten Clauss, Stefan Lankes, and Antonello Monti. Enabling hierarchy-aware mpi collectives in dynamically changing topologies. In Proceedings of the 24th European MPI Users’ Group Meeting, EuroMPI ’17, 2017.

170 [83] Simon Pickartz, Carsten Clauss, Stefan Lankes, and Antonello Monti. Revisiting locality-awareness in view of dynamically changing topologies. Parallel Computing, 77:1 – 18, 2018.

[84] Simon Pickartz, Ramy Gad, Stefan Lankes, Lars Nagel, Tim Suß,¨ Andre´ Brinkmann, and Stephan Krempel. Migration techniques in hpc environments. In Euro-Par 2014: Parallel Processing Workshops, pages 486–497, 2014.

[85] S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy, and D.K. Panda. Efﬁcient Inter-node MPI Communication Using GPUDirect RDMA for InﬁniBand Clusters with NVIDIA GPUs. In Parallel Processing (ICPP), 2013 42nd International Con- ference on, pages 80–89, Oct 2013.

[86] Reid Priedhorsky and Tim Randles. Charliecloud: Unprivileged Containers for User- Deﬁned Software Stacks in HPC. Technical Report LA-UR 16-22370v4, 2007.

[87] Zhengwei Qi, Jianguo Yao, Chao Zhang, Miao Yu, Zhizhou Yang, and Haibing Guan. VGRIS: Virtualized GPU Resource Isolation and Scheduling in Cloud Gam- ing. ACM Trans. Archit. Code Optim., 2014.

[88] R. Shi and S. Potluri and K. Hamidouche and J. Perkins and M. Li and D. Rossetti and D. K. Panda. Designing efﬁcient small message transfer mechanism for inter- node MPI communication on InﬁniBand GPU clusters. In 2014 21st International Conference on High Performance Computing (HiPC), Goa, India, Dec 2014.

[89] R. Shi and X. Lu and S. Potluri and K. Hamidouche and J. Zhang and D. K. Panda. HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement Using MPI Datatypes on GPU Clusters. In 2014 43rd International Conference on Parallel Processing (ICPP), Minneapolis MN, USA, Sept 2014.

[90] Vignesh T. Ravi, Michela Becchi, Gagan Agrawal, and Srimat Chakradhar. Support- ing GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework. In Proceedings of the 20th International Symposium on High Perfor- mance Distributed Computing, HPDC ’11, 2011.

[91] C. Reano and F. Silla. InﬁniBand Verbs Optimizations for Remote GPU Virtualiza- tion. In 2015 IEEE International Conference on Cluster Computing, 2015.

[92] rkt. https://coreos.com/rkt.

[93] Cristian Ruiz, Emmanuel Jeanvoine, and Lucas Nussbaum. Performance Evaluation of Containers for HPC. In 10th Workshop on Virtualization in High-Performance Cloud Computing (VHPC), Vienna, Austria, Aug 2015.

171 [94] Lin Shi, Hao Chen, and Jianhua Sun. vCUDA: GPU accelerated high performance computing in virtual machines. In 2009 IEEE International Symposium on Parallel Distributed Processing, pages 1–11, May 2009. [95] Rong Shi, S. Potluri, K. Hamidouche, Xiaoyi Lu, K. Tomko, and D.K. Panda. A Scalable and Portable Approach to Accelerate Hybrid HPL on Heterogeneous CPU- GPU Clusters. In Proceeding of CLUSTER (CLUSTER’13), Indianapolis, Indiana, USA, 2013. [96] Single Root I/O Virtualization. http://www.pcisig.com/speciﬁcations/iov/single root.

[97] A. Sodani. Knights landing (KNL): 2nd Generation Intel R Xeon Phi processor. In 2015 IEEE Hot Chips 27 Symposium (HCS), pages 1–24, Aug 2015. [98] Stephen Soltesz, Herbert Potzl,¨ Marc E. Fiuczynski, Andy Bavier, and Larry Peterson. Container-based Operating System Virtualization: A Scalable, High- performance Alternative to Hypervisors. In Proceedings of the 2nd ACM SIGOP- S/EuroSys European Conference on Computer Systems (EuroSys ’07), Lisbon, Por- tugal, 2007. [99] Yusuke Suzuki, Shinpei Kato, Hiroshi Yamada, and Kenji Kono. GPUvm: Why Not Virtualizing GPUs at the Hypervisor? In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, USENIX ATC’14, 2014. [100] TACC Stampede Cluster. www.xsede.org/resources/overview. [101] Rajeev Thakur, Rolf Rabenseifner, and William Gropp. Optimization of Collective Communication Operations in MPICH. Int. J. High Perform. Comput. Appl., 2005. [102] VMware ESX/ESXi. https://www.vmware.com/products/esxi-and-esx/overview. [103] VMware vCloud Air. http://vcloud.vmware.com/. [104] Sean Wallace, Xu Yang, Venkatram Vishwanath, William E. Allcock, Susan Cogh- lan, Michael E. Papka, and Zhiling Lan. A data driven scheduling approach for power management on hpc systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’16, 2016. [105] Jian Wang, Kwame-Lante Wright, and Kartik Gopalan. XenLoop: A Transparent High Performance Inter-vm Network Loopback. In Proceedings of the 17th Inter- national Symposium on High Performance Distributed Computing (HPDC), Boston, USA, 2008. [106] Zhenning Wang, Jun Yang, Rami Melhem, Bruce Childers, Youtao Zhang, and Minyi Guo. Quality of Service Support for Fine-Grained Sharing on GPUs. In Pro- ceedings of the 44th Annual International Symposium on Computer Architecture, ISCA ’17, pages 269–281, New York, NY, USA, 2017. ACM.

172 [107] M.G. Xavier, M.V. Neves, F.D. Rossi, T.C. Ferreto, T. Lange, and C.A.F. De Rose. Performance Evaluation of Container-Based Virtualization for High Performance Computing Environments. In Parallel, Distributed and Network-Based Processing (PDP), 2013 21st Euromicro International Conference on, pages 233–240, Belfast, Northern Ireland, Feb 2013.

[108] Xen. http://www.xen.org/.

[109] Xu Yang Robert B. Ross Zhiling Lan Xin Wang, Misbah Mubarak. Trade-off study of localizing communication and balancing network trafﬁc on dragonﬂy system. In Parallel and Distributed Processing Symposium (IPDPS), 2018 IEEE International, May 2018.

[110] Xin Xu and Bhavesh Davda. SRVM: Hypervisor Support for Live Migration with Passthrough SR-IOV Network Devices. In Proceedings of the12th ACM SIG- PLAN/SIGOPS International Conference on Virtual Execution Environments, VEE ’16, 2016.

[111] Y. Ren, L. Liu, Q. Zhang, Q. Wu, J. Yu, J. Kong, J.Guan, H. Dai. Residency-Aware Virtual Machine Communication Optimization: Design Choices and Techniques. In Proceedings of IEEE 6th International Conference on Cloud Computing (Cloud 2013), Santa Clara Marriott, CA, USA, June 27-July 2 2013.

[112] X. Yang, J. Jenkins, M. Mubarak, R. B. Ross, and Z. Lan. Watch out for the bully! job interference study on dragonﬂy network. In SC16: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 750–760, Nov 2016.

[113] Xu Yang, Zhou Zhou, Sean Wallace, Zhiling Lan, Wei Tang, Susan Coghlan, and Michael E. Papka. Integrating dynamic pricing of electricity into energy aware scheduling for hpc systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13, pages 60:1–60:11, 2013.

[114] E Zhai, G.D. Cummings, and Y. Dong. Live Migration with Pass-through Device for Linux VM. Ottawa Linux Symp, 2008.

[115] J. Zhang, X. Lu, and D. K. Panda. Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds? In Proceedings of 10th IEEE/ACM International Conference on Utility and Cloud Computing (UCC), Austin, Texas, USA, December 5-8 2017.

173 [116] J. Zhang, X. Lu, C. Sourav, and D. K. Panda. Slurm-V: Extending Slurm for Build- ing Efﬁcient HPC Cloud with SR-IOV and IVShmem. In Proceedings of 22th Inter- national Conference Euro-Par 2016 Parallel Processing, Grenoble, France, August 24-26 2016.

[117] Jie Zhang, Xiaoyi Lu, M. Arnold, and D.K. Panda. MVAPICH2 over OpenStack with SR-IOV: An Efﬁcient Approach to Build HPC Clouds. In Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on, pages 71–80, May 2015.

[118] Jie Zhang, Xiaoyi Lu, and Dhabaleswar K. Panda. High Performance MPI Library for Container-Based HPC Cloud on InﬁniBand Clusters. In 2016 45th International Conference on Parallel Processing (ICPP), Aug 2016.

[119] Jie Zhang, Xiaoyi Lu, and Dhabaleswar K. Panda. Performance Characterization of Hypervisor-and Container-Based Virtualization for HPC on SR-IOV Enabled In- ﬁniBand Clusters. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), May 2016.

[120] Xiaolan Zhang, Suzanne McIntosh, Pankaj Rohatgi, and John Linwood Grifﬁn. XenSocket: A High-throughput Interdomain Transport for Virtual Machines. In Proceedings of the ACM/IFIP/USENIX 2007 International Conference on Middle- ware (Middleware), Newport Beach, USA, 2007.

[121] Yuyu Zhou, Balaji Subramaniam, Kate Keahey, and John Lange. Comparison of Vir- tualization and Containerization Techniques for High Performance Computing. In Proceedings of the 2015 ACM/IEEE conference on Supercomputing, Austin, USA, Nov 2015.

[122] Z. Zhou, X. Yang, Z. Lan, P. Rich, W. Tang, V. Morozov, and N. Desai. Improving batch scheduling on blue gene/q by relaxing network allocation constraints. IEEE Transactions on Parallel and Distributed Systems, 27(11):3269–3282, Nov 2016.

[123] Zhou Zhou, Xu Yang, Zhiling Lan, P. Rich, Wei Tang, V. Morozov, and N. Desai. Improving batch scheduling on blue gene/q by relaxing 5d torus network allocation constraints. In Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, pages 439–448, May 2015.

174