Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Jie Zhang, M.S. Graduate Program in Computer Science and Engineering The Ohio State University 2018 Dissertation Committee: Dr. Dhabaleswar K. Panda, Advisor Dr. Christopher Stewart Dr. P. Sadayappan Dr. Yang Wang Dr. Xiaoyi Lu c Copyright by Jie Zhang 2018 Abstract Cloud Computing platforms (e.g, Amazon EC2 and Microsoft Azure) have been widely adopted by many users and organizations due to their high availability and scalable computing resources. By using virtualization technology, VM or container instances in a cloud can be constructed on bare-metal hosts for users to run their systems and applications when- ever they need computational resources. This has significantly increased the flexibility of resource provisioning in clouds compared to the traditional resource management approaches. These days cloud computing has gained momentum in HPC communities, which brings us a broad challenge: how to design and build efficient HPC clouds with modern networking technologies and virtualization capabilities on heterogeneous HPC clusters? Through the convergence of HPC and cloud computing, the users can get all the desir- able features such as ease of system management, fast deployment, and resource sharing. However, many HPC applications running on the cloud still suffer from fairly low performance, more specifically, the degraded I/O performance from the virtualized I/O devices. Recently, a hardware-based I/O virtualization standard called Single Root I/O Virtualiza- tion (SR-IOV) has been proposed to help solve the problem, which makes SR-IOV achieve near-native I/O performance. Whereas SR-IOV lacks locality-aware communication support, which makes the communications across the co-located VMs or containers not able to leverage the shared memory backed communication mechanisms. To deliver high performance to the end HPC applications in the HPC cloud, we present a high-performance ii locality-aware and NUMA-aware MPI library over SR-IOV enabled InfiniBand clusters, which is able to dynamically detect the locality information on VM, container or even nested cloud environment and coordinate the data movements appropriately. The proposed design improves the performance of NAS by up to 43% over the default SR-IOV based scheme across 32 VMs, while incurring less 9% overhead compared with native performance. As one of the most attractive container technologies to build HPC clouds, we eval- uate the performance of Singularity on various aspects including processor architectures, advanced interconnects, memory access modes, and the virtualization overhead. Singular- ity shows very little overhead for running MPI-based HPC applications. SR-IOV is able to provide efficient sharing of high-speed interconnect resources and achieve near-native I/O performance, however, SR-IOV based virtual networks prevent VM migration, which is an essential virtualization capability towards high flexibility and availability. Although several initial solutions have been proposed in the literature to solve this problem, there are still many restrictions on these proposed approaches, such as depend- ing on the specific network adapters and/or hypervisors, which will limit the usage scope of these solutions on HPC environments. In this thesis, we propose a high-performance hypervisor-independent and InfiniBand driver-independent VM migration framework for MPI applications on SR-IOV enabled InfiniBand clusters, which is able to not only achieve fast VM migration but also guarantee the high performance for MPI applications during the migration in the HPC cloud. The evaluation results indicate that our proposed design could completely hide the migration overhead through the computation and migration overlap- ping. In addition, the resource management and scheduling systems, such as Slurm and PBS, are widely used in the modern HPC clusters. In order to build efficient HPC clouds, some iii of the critical HPC resources, like SR-IOV enabled virtual devices and Inter-VM shared memory devices, need to be properly enabled and isolated among VMs. We thus propose a novel framework, Slurm-V, which extends Slurm with virtualization-oriented capabilities to support efficiently running multiple concurrent MPI jobs on HPC clusters. The proposed Slurm-V framework shows good scalability and the ability of efficiently running concurrent MPI jobs on SR-IOV enabled InfiniBand clusters. To the best of our knowledge, Slurm-V is the first attempt to extend Slurm for the support of running concurrent MPI jobs with isolated SR-IOV and IVShmem resources. On a heterogeneous HPC cluster, GPU devices have received significant success for parallel applications. In addition to highly optimized computation kernels on GPUs, the cost of data movement on GPU clusters plays critical roles in delivering high performance for the end applications. Our studies show that there is a significant demand to design high performance cloud-aware GPU-to-GPU communication schemes to deliver the near-native performance on clouds. We propose C-GDR, the high-performance Cloud-aware GPUDi- rect communication schemes on RDMA networks. It allows communication runtime to successfully detect process locality, GPU residency, NUMA architecture information, and communication pattern to enable intelligent and dynamic selection of the best communication and data movement schemes on GPU-enabled clouds. Our evaluations show C-GDR can outperform the default scheme by up to 26% on HPC applications. iv To my family, friends, and mentors. v Acknowledgments This work was made possible through the love and support of several people who stood by me, through the many years of my doctoral program and all through my life leading to it. I would like to take this opportunity to thank all of them. My family - my parents, Chong Zhang and Jinchuan Li, who have always given me complete freedom and love to let me go after my dreams and unconditional support to let me venture forth; my uncle, Pengxi Li, who has always inspired and encouraged me to pursue the higher goals; my grandmother, Aixiang Yu, who have stood by me and prayed for me at all times. My fiancee, Hongjin Wang for her love, support, and understanding. I admire and respect her for many qualities she possesses, particularly the great courage and determined mind for the new challenges in her career. My advisor, Dr. Dhabaleswar K. Panda for his guidance and support throughout my doctoral program. I have been able to grow, both personally and professionally, through my association with him. He works hard and professionally. I can deeply feel his respect to the career he has been pursuing. Even after knowing him for six years, I am still amazed by the energy and commitment he has towards the research. My collaborators - I would like to express the appreciation to my collaborator: Dr. Xiaoyi Lu. Through the six years collaboration with him, I have been witnessing his atti- tude and passion towards science and research: he continually and convincingly conveyed vi a spirit of exploration in regard to research and scholarship, and an excitement in regard to teaching. Without his guidance and persistent help this dissertation would not have been possible. My friends - I am very happy to have met and become friends with Jithin Jose, Hari Subramoni, Mingzhe Li, Rong Shi, Ching-Hsiang Chu, Dipti Shankar, Jeff Smith, Jonathan Perkins and Mark Arnold, Gugnani Shashank, Haiyang Shi. This work would remain in- complete without their support and contribution. They have given me memories that I will cherish for the rest of my life. I would also like to thank all my colleagues, who have helped me in one way or another throughout my graduate studies. vii Vita 2004-2008 . .B.S., Computer Science, Tianjin Univer- sity of Technology and Education, China 2008-2011 . .M.S., Computer Science, Nankai Univer- sity, U.S.A 2012-Present . Ph.D., Computer Science and Engineer- ing, The Ohio State University, U.S.A Publications Jie Zhang, Xiaoyi Lu and Dhabaleswar K. Panda, C-GDR: High-Performance Cloud- aware GPUDirect MPI Communication Schemes on RDMA Networks (Under Review) Jie Zhang, Xiaoyi Lu and Dhabaleswar K. Panda, Is Singularity-based Container Tech- nology Ready for Running MPI Applications on HPC Clouds? The 10th International Conference on Utility and Cloud Computing (UCC ’17), Dec 2017, Best Student Paper Award Jie Zhang, Xiaoyi Lu and Dhabaleswar K. Panda, High-Performance Virtual Machine Migration Framework for MPI Applications on SR-IOV enabled InfiniBand Clusters, The 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS ’17), May 2017 Jie Zhang, Xiaoyi Lu and Dhabaleswar K. Panda, Designing Locality and NUMA Aware MPI Runtime for Nested Virtualization based HPC Cloud with SR-IOV Enabled Infini- Band, The 13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE ’17), April 2017 Jie Zhang, Xiaoyi Lu, Sourav Chakraborty and Dhabaleswar K. Panda, SLURM-V: Ex- tending SLURM for Building Efficient HPC Cloud with SR-IOV and IVShmem, The 22nd International European Conference on Parallel and Distributed Computing (Euro-Par ’16), Aug 2016 viii Jie Zhang, Xiaoyi Lu and Dhabaleswar K. Panda, High Performance MPI Library for Container-based HPC Cloud on InfiniBand Clusters, The 45th International

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support