Understanding and Optimizing I/O Virtualization in Data Centers
Total Page:16
File Type:pdf, Size:1020Kb
Understanding and Optimizing I/O Virtualization in Data Centers by Ron Chi-Lung Chiang M.Sc. in Computer Science, May 2001, National Chung Cheng University B.Sc. in Computer Science, May 1999, Tamkang University A Dissertation submitted to The Faculty of The School of Engineering and Applied Science of the George Washington University in partial fulfillment of the requirements for the degree of Doctor of Philosophy January 31, 2014 Dissertation directed by H. Howie Huang Assistant Professor of Engineering and Applied Science The School of Engineering and Applied Science of The George Washington Univer- sity certifies that Ron Chi-Lung Chiang has passed the Final Examination for the degree of Doctor of Philosophy as of August 28, 2013. This is the final and approved form of the dissertation Understanding and Optimizing I/O Virtualization in Data Centers Ron Chi-Lung Chiang Dissertation Research Committee: Howie Huang, Assistant Professor of Engineering and Applied Science, Dissertation Director Tarek El-Ghazawi, Professor of Engineering and Applied Science, Committee Member Suresh Subramaniam, Professor of Engineering and Applied Science, Committee Member Guru Venkataramani, Assistant Professor of Engineering and Applied Science, Committee Member Timothy Wood, Assistant Professor of Computer Science, Committee Member ii Dedication To my beloved wife Claire H. Huang and my family. iii Acknowledgement It is never the individual effort to accomplish a PhD dissertation. I am indebted to all the people who inspire, motivate, and support me in my PhD odyssey. First and foremost, I give my sincere gratitude to my dissertation advisor, Prof. Howie Huang. His immense passion and relentless enthusiasm for doing great research always motivate and encourage me. His accurate guidance steers my vision and goal toward the right direction. Without his great support, I would not be able to finish my journey of pursuing a PhD. I am also grateful to my dissertation committee members, Prof. Tarek El-Ghazawi, Prof. Suresh Subramanian, Prof. Guru Prasadh Venkataramani, and Prof. Timothy Woods for their valuable mentorship through my journey and help me to polish this dissertation. Their insightful acumen and professional acuity strongly support and strengthen my dissertation. I am very fortunate to have the best collaborators in the lab. I express my appreciation to my lab mates, Xin Xu, Hang Liu, Ahsen Uppal, Jie Chen, and Jinho Hwang. I will miss their company during lunch, doing research and course works. I thank Dr. Oliver Spatscheck and Dr. Simon X. Chen for offering me internship opportunity at AT&T Lab. The last but not the least, I give deep thanks to my dearest wife, Claire H. Huang, who has given me countless support, encouragement, and moral boost over the years. I thank my parents for understanding and supporting my adventure. This work is supported in part by the National Science Foundation. iv Abstract Understanding and Optimizing I/O Virtualization in Data Centers Large-scale data centers leverage virtualization technology to achieve excellent re- source utilization, scalability, and high availability. Ideally, the performance of an application running inside a virtual machine (VM) shall be independent of co-located applications and VMs that share the physical machine. However, adverse interference effects exist and are especially severe for data-intensive applications in such virtual- ized environments. We demonstrate on Amazon Elastic Compute Cloud (EC2) a new type of per- formance vulnerability caused by competition among virtual I/O workloads. An adversary could intentionally slow down the execution of a targeted application in a VM that shares the same hardware. In Chapter 3, we design and implement Swiper, a framework which uses a carefully designed workload to incur significant delays on the target VM with minimum cost (i.e., resource consumption). We conduct a comprehensive set of experiments in EC2, which clearly demonstrates that Swiper is capable of significantly slowing down various server applications while consuming a small amount of resources. Our following research on the interference effect leads us to successfully construct mathematical models of resource contention and leverage the modeling results in task scheduling. In Chapter 4, we present TRACON, a novel Task and Resource Allocation CONtrol framework that mitigates the interference effects from concur- rent data-intensive applications and greatly improves the overall system performance. TRACON utilizes modeling and control techniques from statistical machine learning and consists of three major components: the interference prediction model that infers application performance from resource consumption observed from different VMs, the interference-aware scheduler that is designed to utilize the model for effective resource management, and the task and resource monitor that collects application character- v istics at the runtime for model adaption. We implement TRACON on a cluster and validate its effectiveness with experiments using a variety of cloud applications. Ex- periment results show that TRACON successfully achieves up to 25% improvement on application throughput. Swiper and TRACON address the contention on the shared physical resources among co-located VMs. In addition, other main factors contributing to VM perfor- mance unpredictability include limited control of VM allocation as well as lack of knowledge on the performance of a specific VM out of tens of VM types offered by public cloud providers. In Chapter 5, we propose Matrix, a novel performance and resource management system that ensures the performance of an application achieved on a VM can match closely to running on a target physical server. To this end, Matrix utilizes machine learning methods - clustering models with probability estimates - to predict the performance of new workloads in a virtualized environment, choose a suit- able VM type, and dynamically adjust the resource configuration of a VM on the fly. The evaluations on a private cloud, and two public clouds (Rackspace and Amazon EC2) show that for an extensive set of cloud applications, Matrix is able to estimate application performance with 90% average accuracy. In addition, Matrix can deliver the target performance within 3% variance, and do so with the best cost-efficiency in most cases. In addition to all above works which address performance issues on top of the vir- tualization framework, our exploration goes in depth to the virtualization architecture to design innovative I/O virtualization frameworks. Traditional data prefetching has been focused on applications running on bare metal systems using hard drives. In con- trast, virtualized systems using solid-state drives (SSDs) present different challenges for data prefetching. Most existing prefetching techniques, if applied unchanged in virtualized environments, are likely to either fail to fully capture I/O access patterns, interfere with blended I/O requests, or cause too much overhead if run in every virtu- alized instance, all of which could result in undesirable application performance. In Chapter 6, we demonstrate that data prefetching, when running in a virtualization- friendly manner can provide significant performance benefits for a wide range of vi data-intensive applications. We have implemented and evaluated VIO-prefetching in a Linux system with Xen hypervisor. Our comprehensive study provides insights of VIO-prefetching's behavior at various virtualization system configurations, e.g., the number of VMs, in-guest processes, application types, etc. The proposed method improves virtual I/O performance up to 43% with the average of 14% for 1 to 12 VMs while running various applications on a Xen virtualization system. In brief, this dissertation shows that virtualization overheads and architectures in cloud computing environments are very critical to performance, and proposes effective novel approaches which successfully advance the state of the art. More specifically, Swiper and TRACON construct mathematical models and scheduling algorithms to mitigate the interference problem; Matrix leverages machine learning and optimiza- tion techniques to realize the \equivalence" property of virtualization with the best cost-efficiency; and VIO-prefetching fundamentally changes the prefetching scheme in virtualization architecture and improves virtual I/O throughput. The results of this dissertation also envision numerous possibilities to thrust the virtualization and cloud computing technology. vii Contents Dedication iii Acknowledgement iv Abstract v Contents viii List of Figures xi List of Tables xvi 1 Introduction 1 1.1 Swiper . 3 1.2 TRACON . 3 1.3 Matrix . 5 1.4 VIO-Prefetching . 8 1.5 Contributions . 9 1.6 Dissertation Organization . 12 2 Background and Related Work 14 2.1 Amazon Elastic Compute Cloud . 14 viii 2.2 Virtualization . 15 2.3 Preliminary Interference Experiments . 16 2.4 Related Work . 17 2.4.1 Swiper . 18 2.4.2 TRACON . 21 2.4.3 Matrix . 23 2.4.4 VIO-prefetching . 24 3 Swiper 26 3.1 Introduction . 26 3.2 Threat Model . 30 3.2.1 Resource Sharing in Cloud Computing Systems . 30 3.2.2 Problem Definition . 30 3.3 I/O-Based Co-Location Detection . 32 3.4 Resource Competition for a Two-Party System . 34 3.4.1 Technical Challenges for Reaching the Maximum Delay . 34 3.4.2 Main Ideas for Synchronization . 36 3.4.3 Performance Attack . 39 3.5 Systems with Background Processes . 41 3.5.1 Synchronization in Multi-VM Systems . 41 3.5.2 Length of Observation Process . 42 3.6 Experiment Results . 46 3.6.1 Experiment Setup . 46 3.6.2 Comparison with Baseline Attacks . 47 3.6.3 Analysis of Performance Attack . 52 3.6.4 Analysis of Synchronization Accuracy . 53 3.7 Dealing with User Randomness . 55 3.8 Attacking Migratable VMs . 57 3.9 Potential Monetary Loss . 59 ix 4 TRACON 60 4.1 TRACON System Architecture . 60 4.2 Interference Prediction Model . 62 4.3 Interference-Aware Scheduling . 67 4.3.1 Machine Learning Based Scheduling . 70 4.4 Simulation . 74 4.4.1 Data-intensive Benchmarks . 74 4.4.2 Simulation Settings . 76 4.4.3 Performance of Prediction Models . 77 4.4.4 Task Scheduling with Different Models .