Understanding and Optimizing I/O Virtualization in Data Centers

by Ron Chi-Lung Chiang

M.Sc. in Computer Science, May 2001, National Chung Cheng University B.Sc. in Computer Science, May 1999, Tamkang University

A Dissertation submitted to

The Faculty of

The School of Engineering and Applied Science of the George Washington University in partial fulfillment of the requirements for the degree of Doctor of Philosophy

January 31, 2014

Dissertation directed by

H. Howie Huang Assistant Professor of Engineering and Applied Science The School of Engineering and Applied Science of The George Washington Univer- sity certifies that Ron Chi-Lung Chiang has passed the Final Examination for the degree of Doctor of Philosophy as of August 28, 2013. This is the final and approved form of the dissertation

Understanding and Optimizing I/O Virtualization in Data Centers

Ron Chi-Lung Chiang

Dissertation Research Committee:

Howie Huang, Assistant Professor of Engineering and Applied Science, Dissertation Director

Tarek El-Ghazawi, Professor of Engineering and Applied Science, Committee Member

Suresh Subramaniam, Professor of Engineering and Applied Science, Committee Member

Guru Venkataramani, Assistant Professor of Engineering and Applied Science, Committee Member

Timothy Wood, Assistant Professor of Computer Science, Committee Member

ii Dedication

To my beloved wife Claire H. Huang and my family.

iii Acknowledgement

It is never the individual effort to accomplish a PhD dissertation. I am indebted to all the people who inspire, motivate, and support me in my PhD odyssey. First and foremost, I give my sincere gratitude to my dissertation advisor, Prof. Howie Huang. His immense passion and relentless enthusiasm for doing great research always motivate and encourage me. His accurate guidance steers my vision and goal toward the right direction. Without his great support, I would not be able to finish my journey of pursuing a PhD. I am also grateful to my dissertation committee members, Prof. Tarek El-Ghazawi, Prof. Suresh Subramanian, Prof. Guru Prasadh Venkataramani, and Prof. Timothy Woods for their valuable mentorship through my journey and help me to polish this dissertation. Their insightful acumen and professional acuity strongly support and strengthen my dissertation. I am very fortunate to have the best collaborators in the lab. I express my appreciation to my lab mates, Xin Xu, Hang Liu, Ahsen Uppal, Jie Chen, and Jinho Hwang. I will miss their company during lunch, doing research and course works. I thank Dr. Oliver Spatscheck and Dr. Simon X. Chen for offering me internship opportunity at AT&T Lab. The last but not the least, I give deep thanks to my dearest wife, Claire H. Huang, who has given me countless support, encouragement, and moral boost over the years. I thank my parents for understanding and supporting my adventure. This work is supported in part by the National Science Foundation.

iv Abstract

Understanding and Optimizing I/O Virtualization in Data Centers

Large-scale data centers leverage virtualization technology to achieve excellent re- source utilization, scalability, and high availability. Ideally, the performance of an application running inside a virtual machine (VM) shall be independent of co-located applications and VMs that share the physical machine. However, adverse interference effects exist and are especially severe for data-intensive applications in such virtual- ized environments. We demonstrate on Amazon Elastic Compute Cloud (EC2) a new type of per- formance vulnerability caused by competition among virtual I/O workloads. An adversary could intentionally slow down the execution of a targeted application in a VM that shares the same hardware. In Chapter 3, we design and implement Swiper, a framework which uses a carefully designed workload to incur significant delays on the target VM with minimum cost (i.e., resource consumption). We conduct a comprehensive set of experiments in EC2, which clearly demonstrates that Swiper is capable of significantly slowing down various server applications while consuming a small amount of resources. Our following research on the interference effect leads us to successfully construct mathematical models of resource contention and leverage the modeling results in task scheduling. In Chapter 4, we present TRACON, a novel Task and Resource Allocation CONtrol framework that mitigates the interference effects from concur- rent data-intensive applications and greatly improves the overall system performance. TRACON utilizes modeling and control techniques from statistical machine learning and consists of three major components: the interference prediction model that infers application performance from resource consumption observed from different VMs, the interference-aware scheduler that is designed to utilize the model for effective resource management, and the task and resource monitor that collects application character-

v istics at the runtime for model adaption. We implement TRACON on a cluster and validate its effectiveness with experiments using a variety of cloud applications. Ex- periment results show that TRACON successfully achieves up to 25% improvement on application throughput. Swiper and TRACON address the contention on the shared physical resources among co-located VMs. In addition, other main factors contributing to VM perfor- mance unpredictability include limited control of VM allocation as well as lack of knowledge on the performance of a specific VM out of tens of VM types offered by public cloud providers. In Chapter 5, we propose Matrix, a novel performance and resource management system that ensures the performance of an application achieved on a VM can match closely to running on a target physical server. To this end, Matrix utilizes machine learning methods - clustering models with probability estimates - to predict the performance of new workloads in a virtualized environment, choose a suit- able VM type, and dynamically adjust the resource configuration of a VM on the fly. The evaluations on a private cloud, and two public clouds (Rackspace and Amazon EC2) show that for an extensive set of cloud applications, Matrix is able to estimate application performance with 90% average accuracy. In addition, Matrix can deliver the target performance within 3% variance, and do so with the best cost-efficiency in most cases. In addition to all above works which address performance issues on top of the vir- tualization framework, our exploration goes in depth to the virtualization architecture to design innovative I/O virtualization frameworks. Traditional data prefetching has been focused on applications running on bare metal systems using hard drives. In con- trast, virtualized systems using solid-state drives (SSDs) present different challenges for data prefetching. Most existing prefetching techniques, if applied unchanged in virtualized environments, are likely to either fail to fully capture I/O access patterns, interfere with blended I/O requests, or cause too much overhead if run in every virtu- alized instance, all of which could result in undesirable application performance. In Chapter 6, we demonstrate that data prefetching, when running in a virtualization- friendly manner can provide significant performance benefits for a wide range of

vi data-intensive applications. We have implemented and evaluated VIO-prefetching in a system with hypervisor. Our comprehensive study provides insights of VIO-prefetching’s behavior at various virtualization system configurations, e.g., the number of VMs, in-guest processes, application types, etc. The proposed method improves virtual I/O performance up to 43% with the average of 14% for 1 to 12 VMs while running various applications on a Xen virtualization system. In brief, this dissertation shows that virtualization overheads and architectures in cloud computing environments are very critical to performance, and proposes effective novel approaches which successfully advance the state of the art. More specifically, Swiper and TRACON construct mathematical models and scheduling algorithms to mitigate the interference problem; Matrix leverages machine learning and optimiza- tion techniques to realize the “equivalence” property of virtualization with the best cost-efficiency; and VIO-prefetching fundamentally changes the prefetching scheme in virtualization architecture and improves virtual I/O throughput. The results of this dissertation also envision numerous possibilities to thrust the virtualization and cloud computing technology.

vii Contents

Dedication iii

Acknowledgement iv

Abstract v

Contents viii

List of Figures xi

List of Tables xvi

1 Introduction 1 1.1 Swiper ...... 3 1.2 TRACON ...... 3 1.3 Matrix ...... 5 1.4 VIO-Prefetching ...... 8 1.5 Contributions ...... 9 1.6 Dissertation Organization ...... 12

2 Background and Related Work 14 2.1 Amazon Elastic Compute Cloud ...... 14

viii 2.2 Virtualization ...... 15 2.3 Preliminary Interference Experiments ...... 16 2.4 Related Work ...... 17 2.4.1 Swiper ...... 18 2.4.2 TRACON ...... 21 2.4.3 Matrix ...... 23 2.4.4 VIO-prefetching ...... 24

3 Swiper 26 3.1 Introduction ...... 26 3.2 Threat Model ...... 30 3.2.1 Resource Sharing in Cloud Computing Systems ...... 30 3.2.2 Problem Definition ...... 30 3.3 I/O-Based Co-Location Detection ...... 32 3.4 Resource Competition for a Two-Party System ...... 34 3.4.1 Technical Challenges for Reaching the Maximum Delay . . . . 34 3.4.2 Main Ideas for Synchronization ...... 36 3.4.3 Performance Attack ...... 39 3.5 Systems with Background Processes ...... 41 3.5.1 Synchronization in Multi-VM Systems ...... 41 3.5.2 Length of Observation Process ...... 42 3.6 Experiment Results ...... 46 3.6.1 Experiment Setup ...... 46 3.6.2 Comparison with Baseline Attacks ...... 47 3.6.3 Analysis of Performance Attack ...... 52 3.6.4 Analysis of Synchronization Accuracy ...... 53 3.7 Dealing with User Randomness ...... 55 3.8 Attacking Migratable VMs ...... 57 3.9 Potential Monetary Loss ...... 59

ix 4 TRACON 60 4.1 TRACON System Architecture ...... 60 4.2 Interference Prediction Model ...... 62 4.3 Interference-Aware Scheduling ...... 67 4.3.1 Machine Learning Based Scheduling ...... 70 4.4 Simulation ...... 74 4.4.1 Data-intensive Benchmarks ...... 74 4.4.2 Simulation Settings ...... 76 4.4.3 Performance of Prediction Models ...... 77 4.4.4 Task Scheduling with Different Models ...... 79 4.4.5 NLM Prediction Accuracy ...... 80 4.4.6 Model Adaption ...... 80 4.4.7 Performance of Scheduling Algorithms ...... 82 4.4.8 Energy Savings ...... 85 4.5 Implementation and Experiments ...... 86 4.5.1 Implementation and Experiment Environment ...... 86 4.5.2 Cloud Applications ...... 87 4.5.3 Experiment Results ...... 90

5 Matrix 96 5.1 Support Vector Machine ...... 96 5.2 Matrix Architecture ...... 98 5.2.1 Workload Signatures ...... 100 5.2.2 Clustering Method ...... 101 5.2.3 Performance Modeling ...... 103 5.2.4 Automatic Resource Configuration ...... 105 5.3 Implementation ...... 106 5.4 Evaluations ...... 108 5.4.1 Physical to Virtual Machines ...... 110 5.4.2 Physical to Virtual Clusters ...... 116

x 5.4.3 Virtual Cluster in Private and Public Cloud ...... 121

6 VIO-Prefetching 124 6.1 Challenges ...... 124 6.1.1 Challenge #1: No One-size-fits-all ...... 124 6.1.2 Challenge #2: Virtual I/O Blending ...... 125 6.2 Design Principles ...... 127 6.3 The Architecture of VIO-Prefetching ...... 128 6.3.1 Block Prefetching ...... 133 6.3.2 Feedback Monitoring ...... 135 6.4 Evaluation ...... 135 6.4.1 Experiment Environment and Applications ...... 137 6.4.2 VIO-Prefetching vs. Flashy Prefetching ...... 139 6.4.3 Evaluation with Cloud Applications ...... 142

7 Conclusion and Future Work 148 7.1 Conclusion ...... 148 7.2 Future Work ...... 150

Bibliography 152

xi List of Figures

1.1 The performance for various EC2 instances ranges 18% to 3.7 times of two physical machines PM1 and PM2. Each column shows an average of ten runs ...... 6

2.1 Xen I/O architecture. Solid lines represent the I/O channels for data transmission. Host hardware interrupts are controlled and routed through VMM and depicted as dashed lines ...... 16

3.1 Read I/O trace of FileServer ...... 35 3.2 Overlapping I/O of an attacker and a victim ...... 41 3.3 Synchronization accuracy when the victim arrives at different times . 45 3.4 Synchronization accuracy with background VMs ...... 45 3.5 Overlapping I/O of an attacker, a victim, and a background noise . . 46 3.6 Runtime increases by different attacks on Amazon EC2 ...... 47 3.7 The runtime increases on small instances with different workloads when varying attacker’s data consumption ...... 49 3.8 Boxplot of FileServer runtime by peak attack on micro and small in- stances ...... 49 3.9 The means and standard deviations of I/O throughput decreases . . . 51 3.10 Throughput changes when a multi-VM system hosted by Xen/KVM is attacked by the peak attack with various data usage limits ...... 51 3.11 I/O throughput decreases of FileServer and WebServer at different data usage limit of an attacker in a two-VM system ...... 53

xii 3.12 I/O throughput decreases of FileServer and WebServer at different data usage limit of an attacker in a multi-VM system ...... 54 3.13 Synchronization accuracy when the victim arrives at different times . 55 3.14 Synchronization accuracy with background VMs ...... 55 3.15 An extended Swiper architecture for dealing with user randomness . . 56 3.16 Matched minutes at each testing hour in the one-day test when holding a two-hour traces in the repository. The dotted line shows a polynomial fit of the observed data points ...... 57 3.17 The average throughput decrease per attack at each testing hour . . . 57 3.18 The throughput decreases when the migration is enable are shown in red solid lines. The blue dotted lines represent the throughput decreases when the victim is not migrated ...... 58 3.19 Potential revenue loss caused by Swiper on small and micro instances 59

4.1 TRACON system architecture ...... 61 4.2 Model prediction erros ...... 78 4.3 Runtime and IOPS improvements with different models ...... 79 4.4 Predicted minimum runtime of each application compared to its mea- sured minimum, average, and maximum runtimes ...... 80 4.5 Predicted maximum IOPS of each application compared to its mea- sured minimum, average, and maximum IOPS ...... 81 4.6 Online model learning ...... 81

4.7 Speedup by MIBSRT and MIBSIO ...... 82

4.8 Normalized throughput of MIBS8, MIOS and MIX8 at λ tasks per minute...... 83

4.9 Normalized throughput of MIBS8, MIBS4, and MIBS2 at λ tasks per minute...... 83

4.10 Normalized throughput of MIBS8, MIOS and MIX8 at different num- bers of machines ...... 84

xiii 4.11 Normalized throughput of MIBS8, MIBS4 and MIBS2 at different num- bers of machines ...... 85 4.12 Energy savings in a data center with 1,024 machines ...... 86 4.13 Box plots of normalized runtime of each cloud application when run- ning with other co-located VMs. The column heights represent run- times normalized to the unaffected runtimes...... 91 4.14 Interference prediction errors. The column heights represent the aver- age prediction errors, and the error bars represent the standard deviations 91 4.15 Normalized throughput at different scheduling methods. The column heights represent the throughputs normalized to FIFO’s throughput . 92 4.16 Normalized throughput of scheduling methods at different task arrival rate. The column heights represent the throughputs normalized to FIFO’s throughput ...... 92 4.17 Normalized throughput of scheduling methods at different I/O inten- sities. The column heights represent the throughputs normalized to FIFO’s throughput ...... 93 4.18 Normalized number of I/O requests completed of different scheduling methods. The column heights represent the total I/Os normalized to FIFO’s I/Os ...... 94

5.1 Matrix Architecture ...... 99 5.2 Matrix prototype ...... 106 5.3 Percentage of time that each component contributes to the whole over- head. (A) Data preparation, (B) Workload identification, (C) Model generation, (D) Resource allocation ...... 107 5.4 Application composition examples ...... 110

xiv 5.5 Accuracies on predicting performance. The labels aCbM-VSc on the leftmost four columns mean these tests are done on a VM with a VCPU and b GB memory hosted by our local machine V Sc. The right- most seven labels, RS1 to RS7, represent Rackspace instances from the smallest to the biggest one. Other labels represent Amazon instance types used ...... 110 5.6 RP changes as resources and workload intensity change. Intensities are changed every ten minutes ...... 113 5.7 Percentage of instances that are recommended for an application. The y-axis lists testing applications and intensities. Each bar represents percentages of certain instance types that are recommended for the corresponding application type and intensity ...... 114 5.8 Average RPs from running the same application on the recommended instance types ...... 114 5.9 Average accuracies and standard deviations on predicting RPs at dif- ferent VM types and various numbers of VMs ...... 117 5.10 Accuracies on predicting RP decrease as the size of training set shrinks 118 5.11 As workload types and intensities change at every ten minutes, re- sources on each VM and the numbers of VMs are altered to keep RPs close to one ...... 119 5.12 RPs when using Matrix and three static cluster settings on Amazon and Rackspace ...... 121 5.13 Average measured RPs and standard deviations using the recommended Amazon EC2 VC configurations ...... 122

6.1 Virtual I/O blending effect ...... 127 6.2 Integrating VIO-prefetching with a virtual machine host ...... 129

xv 6.3 The speedups and standard deviations of three prefetching systems. (a) Experiments with various numbers of VMs; (b) Experiments with different numbers of in-guest processes; (c) Experiments with multiple VMs and mixed workload types; (d) Experiments with multiple in- guest processes and mixed workload types ...... 140 6.4 The speedups by VIO-prefetching for different applications and num- bers of VMs ...... 142 6.5 VIO-prefetching accuracy for different applications and numbers of VMs. The accuracy is on the y-axis, measured as the amount of prefetched and used data divided by total used data ...... 144 6.6 VIO-prefetching cost for different benchmarks and number of VMs. VIO-prefetching cost is on the y-axis, defined as the ratio of the amount of unused prefetched data to the amount of prefetched data ...... 144 6.7 Box plots of speedups at different read/write ratios ...... 145 6.8 Average throughputs and standard deviations with and without VIO- prefetching at different scheduler combinations ...... 146 6.9 Box plots of speedups at different workload mixes ...... 146 6.10 Average speedups and standard deviations at different queue sizes and request patterns ...... 147

xvi List of Tables

2.1 EBS and instance stores comparison ...... 15 2.2 Characteristics of Amazon micro and small instances ...... 15 2.3 Normalized App1 runtime in VM1 while running various App2 in VM2 17

3.1 S, SAT , and SAC of Xen/KVM in two-VM experiments ...... 50

3.2 S, SAT , and SAC of Xen/KVM in multi-VM experiments ...... 52

4.1 Application Characteristics ...... 63 4.2 Data-Intensive Applications and Benchmarks ...... 75 4.3 Cloud application settings ...... 88

5.1 Summary of representative applications ...... 100 5.2 Most recommended instance types for running certain workloads as on PM1...... 115 5.3 Cost efficiency (RPC and PPC) of Matrix and three static configura- tions on Amazon and Rackspace respectively ...... 121 5.4 Cost efficiency (RPC/PPC) of Matrix and three static configurations 123

6.1 Applications for testing VIO-prefetching ...... 137

xvii Chapter 1

Introduction

Cloud computing has achieved tremendous success in offering Infrastructure/Platform/ Software as a Service, in an on-demand fashion to a large number of clients. This is evident in the popularity of cloud software services, e.g., Gmail and Facebook, and the rapid development of cloud platforms, e.g., Amazon Elastic Compute Cloud (EC2). The key enabling factor for cloud computing is the virtualization technology, e.g., Xen [18], that provides an abstraction layer on top of the underlying physical resources and allows multiple operating systems (OSs) and applications to simulta- neously run on the same hardware. Virtualization [150] plays a vital role in cloud computing. As hypervisors or virtual machine monitors (VMM) encapsulate differ- ent applications into each separate guest virtual machine (VM), a cloud provider can leverage VM consolidation and migration to achieve excellent resource utilization and high availability in large data centers. A cloud computing system offers its users the illusion of “infinite” computing and storage capacities on an on-demand basis [11]. In particular for the purpose of scalability and flexibility of resource delivery, a cloud computing system does not provide each user with a different physical machine - instead, it allocates each user to an independently managed VM which can be dy- namically created, modified, and migrated. Examples of such a platform include Xen VM for Amazon EC2 and the .NET-based run-time environment for Microsoft Azure.

1 In their classic paper “Formal Requirements for Virtualizable Third Generation Architectures” [141], Popek and Goldberg described a VM shall have three properties: 1) efficiency, where a significant portion of the program runs without any intervention from VMM, the hypervisor that manages the VMs; 2) resource control that prevents any program from gaining the full control of the system resources; and 3) equivalence, where any program running in a VM shall “performs in a manner indistinguishable” from an equivalent real machine. Although we have made strides on these properties, we have yet achieved the vision of “an efficient, isolated duplicate of a real machine”, which shows “at worst only minor decreases in speed”. To the contrary, today there is a big gap in performance between real (physical) and virtual machines - on the former one would receive consistent, predictable performance, while the performance on the latter varies by a number of factors, e.g., underlying hardware, VMM, co-located applications, time of date, etc. The essence of virtualization is that multiple VMs may multiplex and share the same physical resources (e.g., CPU, cache, DRAM, and I/O devices). Nonetheless, each VM is supposed to enjoy isolation (in terms of security and performance) from the other VMs. That is, different VMs should not be able to interfere with the execution of each other. Unfortunately, the lack of physical isolation can indeed pose new adverse interference effects to co-located VMs. In the following sections, Sec. 1.1 introduces how Swiper (Chapter 3) exploits these interference effects to downgrade performance. Then, Sec. 1.2 states how TRACON (Chapter 4) builds interference models and arranges VMs to minimize interference. Sec. 1.3 brings in Matrix (Chapter 5) that gives users desired VM performance with minimum cost. Sec. 1.4 introduces how VIO-Prefetching (Chapter 6) changes the way of prefetching in virtualized environments. We summarize the contributions in Sec. 1.5 and provide an outline of this dissertation in Sec. 1.6.

2 1.1 Swiper

For comprehensively exploring the interference effect, we develop Swiper to demon- strate a new type of VM vulnerability which enables a malicious user (i.e., VM) to exploit the resource contention between co-located VMs and obstruct the execution of a targeted application running in a separate VM that is located on the same physical machine as the malicious one. In particular, we focus on exploiting contentions on shared I/O resources that are critical to data-intensive applications - e.g., hard disks and networks. Note that the main concern of Swiper is performance degradation caused by co-located adversaries, rather than information leakage which has been the main focus for vulnerability studies in cloud computing systems [148]. Perfor- mance degradation is critical because it directly increases the cost of per workload completed in cloud [53, 66]. On the other hand, existing works on performance- degradation analysis were conducted on non-virtualized environments (e.g., for CPU, DRAM, hard disk, and network usage [37]) and cannot be directly applied to VMs. For example, a relevant prior work that proposed to exploit the contention on hard disks [87] required access to the hard-disk queue in order to analyze the requests from both the adversary and the victim. However, this queue cannot be directly accessed by VMs, rendering such exploitation no longer applicable.

1.2 TRACON

After seeing the vulnerability of performance isolation in a virtualized environment and providing insights in this issue, our following researches on the interference lead us to build mathematical models for estimating performance degradation and a novel scheduling framework for optimizing execution in data centers. As we discussed ear- lier, an application running inside a VM shall achieve the performance as it would own a portion of the machine to itself, that is, independent of co-located applications and VMs that share the same physical resource. Although extensive work has been done to achieve this so-called performance isolation [96, 120, 131], including various tech-

3 niques to ensure CPU fair sharing [91], little attention has been paid to data-intensive applications that perform complex analytics tasks on a large amount of data, which have become increasingly common in this environment [54]. Traditionally assuming the exclusive ownership of the physical resources, these applications are optimized for the hard drive based storage systems by issuing large sequential reads and writes. However, this assumption breaks down in a shared, virtualized environment, and sub- sequently the previously optimized I/O requests are no longer advantageous. To the contrary, multiple data-intensive applications will be in competition for the limited bandwidth and throughput to network and storage systems, which very likely leads to high I/O interference and low performance. In this case, the combined effects from concurrent applications, when deployed on shared resources, are largely difficult to predict, and the interference as a result of competing I/Os remains problematic to achieve high-performance computing in a virtualized environment. We systematically study the performance effects of co-located data-intensive ap- plications, and develop TRACON1, a novel Task and Resource Allocation CONtrol framework that mitigates the interference effects from concurrent applications. TRA- CON leverages modeling and control techniques from statistical machine learning and acts as the core management scheme for a virtualized environment. TRACON was firstly evaluated in a simulation and showed a good ability to im- prove application performance and I/O throughput for data-intensive applications. Then, we conduct experiments on real-world cloud applications to show that TRA- CON can achieve up to 25% improvement on application throughput for virtualized data centers. In addition, as the total energy used by US data centers is estimated to approach 3% of national electricity consumption [182], TRACON may help a data center save up to 30% more energy because the overall system performance is greatly improved.

1In aviation, TRACON stands for Terminal Radar Approach Control facilities, “for example, Potomac TRACON handles air traffic going into and out of all the airports around Washington D.C., Baltimore, MD, and Richmond, VA.” http://www.faa.gov.

4 1.3 Matrix

While TRACON utilizes interference models to make VM performance more pre- dictable, virtualization technology has yet achieved the vision of “an efficient, isolated duplicate of a real machine”. That is, a VM shall be able to provide the performance close to what a user would expect from a specific physical machine. Take a real world example, before buying a new tablet computer from an online retailer, one may shop a local store like BestBuy to test drive and compare various products. Generally, any product the customer eventually receives from the online retailer will be the same as what is presented locally. Unfortunately, when one purchases a VM in the cloud, little guarantee is provided to ensure an application hosted by the VM would achieve a similar performance as running on a real machine. We propose the concept of Relative Performance (RP) as the “equivalence” metric that measures the performance ratio between a physical machine (PM) and a VM. For a workload w, the RP can be formally defined as

P erf RP = VM , (1.1) P erfPM

where P erfVM and P erfPM are the performance of the workload w when running on the VM and PM, respectively. The performance is workload dependent and can be measured as the runtime (e.g., sequence alignment), throughput (e.g., video stream- ing), latency (e.g., webpage serving), etc. The RP that is equal to one means that the workload delivers the identical performance between the VM and PM. In a cloud, many factors such as limited control of VM allocation and competition from co-located VMs to shared resources (e.g., CPU and I/O devices) contribute to hard-to-predict VM performance. To illustrate the gap on expected performance, we run three benchmarks ranging from I/O intensive, memory intensive to CPU in- tensive workloads, both locally and on Amazon EC2. Fig. 1.1 shows the RPs (in runtime/latency for three benchmarks) for two local physical machines (PM1 and PM2) on four EC2 instances. In this test, PM1 has a 2.93 GHz Intel Core2 Duo

5

3.3

3.7

3.1

2.9 2.7

2.7

2.9 2.8 2.0 2.8

1.5

1.0 t1.micro RP 0.5 m1.small 0.0 FS mcf soplex FS mcf soplex m1.medium

PM1 PM2 m1.large Test setting

Figure 1.1: The performance for various EC2 instances ranges 18% to 3.7 times of two physical machines PM1 and PM2. Each column shows an average of ten runs processor and 4 GB memory, and PM2 has a 3 GHz Intel Pentium4 processor with 2 GB memory. For Amazon EC2 instances, m1.small type equips with 1.7 GB memory and 1 EC2 Compute Unit priced at six cents per hour, m1.medium 3.75 GB memory and 2 Compute Units at 12 cents per hour, m1.large 7.5 GB memory and 4 Compute Units at 24 cents per hour, and the t1.micro has the smallest amount of memory (613 MB) and CPU resource. Our tests show that the RP for these three benchmarks can vary dramatically from 18% of the target performance to more than three times. Clearly, it is challenging to know ahead of time for each application which VM in- stance provides a good tradeoff between the cost and performance. Benchmarking an application in the cloud may alleviate the problem, but it can become cumbersome as public cloud providers offer dozens of VM types. In this work, we propose a novel performance and resource management system, Matrix, that helps cloud providers have better control of shared resources while de- livering predictable VM performance to users. To achieve this goal, Matrix utilizes clustering models with probability estimates to predict the performance of new work- loads in a virtualized environment, chooses a suitable VM type, and dynamically adjusts the resource configuration of a VM on the fly. The first contribution is that Matrix can predict accurately how a new workload will perform on different cloud VM instances. To this end, Matrix first constructs performance models of a set of representative workloads that define common ap- plication “genes”. Given performance models for these applications, we leverage the support vector clustering (SVC) to quickly classify a new workload, using soft

6 boundary probability estimates to infer its “gene” composition. A number of stud- ies [103, 121, 176, 178] have worked on service-level agreement (SLA), performance prediction, and anomaly detection in virtualized environments, where the difficulties often lie in the understanding of the dynamic relationship between resource alloca- tion and VM performance. To the best of our knowledge, this is the first attempt to build classification models based on probability estimates of different applications and cloud instance types. The second contribution is that Matrix allocates VM resource to application in a way that minimizes the cost while achieving good performance. To this end, Matrix applies an approximate optimization algorithm and makes use of the characteristics of the kernel functions of support vector machine (SVM) to find the optimized resource allocation. More specifically, the support vector regression (SVR) is used to develop our RP models. By exploiting gene composition knowledge, Matrix can do so without knowing a priori application information within guest VMs. Third, Matrix is able to handle different cloud environments and applications. We conduct a comprehensive set of experiments with real cloud applications and ranging from a single machine, a local cluster, and a virtual cluster, to evaluate Matrix on both our private cloud and the public cloud of Amazon EC2 and Rackspace cloud servers. In this dissertation, we present three use cases of Matrix:

• Automatic VM configuration. Matrix can adapt VM settings to the changes in workload, while maintaining desired performance and achieving good cost- efficiency in the cloud.

• VM instance recommendation. With workload and VM performance mod- els, Matrix can recommend the VM instance that is best suited for specific applications.

• Cloud provider recommendation. Given a new application, Matrix can also help users to choose an appropriate VM from different cloud providers.

7 1.4 VIO-Prefetching

After developing models and tools to improve virtualization system by well handling the virtualization overhead and performance interference, we then zoom in to the virtualization architecture and propose innovative designs of I/O virtualization. We especially improve the data prefetching by leveraging the peculiarity of I/O virtual- ization and solid-state drives (SSDs). In addition to the prevailing virtualization technology in data centers, cloud ser- vice providers also have adapted flash-based SSDs in data centers for high throughput and low energy consumption [3, 43]. For example, Amazon is using SSDs for the Dy- namoDB application in Amazon Web Services (AWS) that offers VMs virtual storages and computing services [123]. However, the semantic gap between the host and guest domains prevents the system to optimize the data prefetching scheme. The host machine does not have all records of guests’ I/O requests; the guest machine does not know the loading of the physical storage devices and is not aware of the exis- tence of other VMs. Therefore, future data centers certainly need novel prefetching technology to meet the needs for new storage devices and virtualized environments. A prefetching algorithm should adapt its reading size and pattern based on devices and applications characteristics In a virtualized environment, however, I/O requests from multiple data-intensive applications lose most of their sequentiality after going through the virtualization layer and reaching the storage devices, known as virtual I/O blending [159]. To address these new challenges, we propose VIO-prefetching for the prevailing virtualized environments and emerging flash drives. VIO-prefetching monitors the system status as feedback and is aware of the dynamics of both devices and ap- plications. The notable features of VIO-prefetching include not only the feedback- controlled aggressiveness to exploit the high performance SSDs, but also bridging the information gap between the virtualized domains to improve the pattern recognition accuracy. We implement and evaluate VIO-prefetching on a Xen virtualized system with VMs running various server workloads. The evaluation results show that VIO-

8 prefetching is able to identify I/O access patterns from guest VMs and successfully improve virtual I/O performance. Our comprehensive study also provides insights of VIO-prefetching’s behavior at various virtualization system configurations.

1.5 Contributions

The goal of this dissertation is to understand and optimize machine virtualization in data centers, especially on the virtual I/O framework. Bearing the above inadequacies and challenges in mind, this dissertation has the following major contributions:

• Conducted thorough research on I/O virtualization with the goal to

advance cloud computing technology;

• Systematically explored the interference effect from co-located VMs;

• Reduced resource contention between VMs and enhanced virtual I/O

performance, especially improved I/O throughput by a new virtual

I/O prefetching scheme;

• Achieved the equivalence property of virtualization through mathe-

matical modeling and machine learning techniques.

More specifically, we present Swiper in Chapter 3 which studies the adverse per- formance effects in the virtualized environments. In order to effectively degrade a target VM’s performance, Swiper addresses several technical challenges and brings the following contributions:

• For synchronizing with a victim VM’s working patterns, we design a discrete Fourier transformation (DFT) based algorithm which recovers the victim’s orig- inal I/O pattern from the observed (distorted) time-series of I/O throughout,

9 and then determines if the victim application has reached a pre-determined point when it is most vulnerable to an exploitation, e.g., when it is reading large amounts of data from storage.

• One critical issue for Swiper is the existence of “bystander” VMs. In particular, the I/Os from bystander VMs become “noise” that is mixed into the observed I/O time-series - which requires the attacker VM to somehow “filter out” the noise before launching the synchronization and attack phases. We demonstrate through theoretical analysis and experiments that our DFT-based synchroniza- tion and peak attack techniques are resilient to the existence of one or more bystander VMs.

• A comprehensive set of experiments over Amazon EC2 - with the results clearly showing that Swiper is capable of degrading various server applications by 22.54% on average (and up to 31%) for different instance types and bench- marks, while keeping the resource consumption to a minimum.

Then, we introduce TRACON in Chapter 4 to provide interference models and design a novel scheduling framework for optimizing performance in data centers. The main contributions of TRACON are two-fold:

• We characterize the I/O interference from multiple concurrent data-intensive ap- plications in a virtualized environment, and build interference prediction models that reason about application performance in the event of varied levels of I/O interference. While prior works have extensively studied the VM interference on CPUs, caches, and main memory [96, 102, 131], we address the new chal- lenges that arise when modeling data-intensive applications in virtualized data centers. Our statistical models profile the performance of a target application, when running against a set of benchmarks, to infer both the runtime and I/O throughput of the application. We propose to employ nonlinear models that are critical to capture the bursty I/O patterns in data-intensive applications. This design utilizes the application characteristics, observed from the VMs, and maintains a low system overhead.

10 • We develop a management system TRACON for a virtualized data center to mitigate the interference effects from co-located data-intensive applications. To achieve this goal, we incorporate the proposed nonlinear interference predic- tion models into TRACON and by doing so, the system can make optimized scheduling decisions that lead to significant improvements in both application performance and resource utilization. We conduct a comprehensive set of ex- periments on two virtualized clusters with real-world cloud applications.

With the interference effect in mind, Matrix in Chapter 5 enables better control of shared resources while delivering predictable VM performance in data centers. There are three main contributions of Matrix:

• The first contribution is that Matrix can predict accurately how a new work- load will perform on different cloud VM instances. To this end, Matrix first constructs performance models of a set of representative workloads that define common application “genes”. Given performance models for these applications, we leverage the support vector clustering (SVC) to quickly classify a new work- load, using soft boundary probability estimates to infer its “gene” composition. A number of studies [103, 121, 176, 178] have worked on service-level agree- ment (SLA), performance prediction, and anomaly detection in virtualized en- vironments, where the difficulties often lie in the understanding of the dynamic relationship between resource allocation and VM performance. To the best of our knowledge, this is the first attempt to build classification models based on probability estimates of different applications and cloud instance types.

• The second contribution is that Matrix allocates VM resource to application in a way that minimizes the cost while achieving good performance. To this end, Matrix applies an approximate optimization algorithm and makes use of the characteristics of the kernel functions of support vector machine (SVM) to find the optimized resource allocation. More specifically, the support vector regres- sion (SVR) is used to develop our RP models. By exploiting gene composition

11 knowledge, Matrix can do so without knowing a priori application information within guest VMs.

• Third, Matrix is able to handle different cloud environments and applications. We conduct a comprehensive set of experiments with real cloud applications and ranging from a single machine, a local cluster, and a virtual cluster, to evaluate Matrix on both our private cloud and the public cloud of Amazon EC2 and Rackspace cloud servers.

Next in Chapter 6, we zoom in to the virtualization architecture and propose an innovative prefetching method for I/O virtualization. The main contributions of VIO-prefetching are:

• The design of VIO-prefetching is to prefetch data that appropriately matches VM needs by online monitoring performance metrics and adjusting the prefetch- ing size.

• The innovative virtual I/O prefetching architecture. The evaluation results show that VIO-prefetching is able to identify I/O access patterns from guest VMs and successfully improve the virtual I/O performance. Our comprehensive study provides insights of VIO-prefetching’s behavior at various virtualization system configurations.

1.6 Dissertation Organization

The remainder of this dissertation is organized as follows. We start with background knowledge, virtualization technology as well as related works in Chapter 2. Next, Chapter 3 studies the adverse performance effects in virtualized environments caused by resource competition. Then, we introduce TRACON in Chapter 4 to provide interference models and design a novel scheduling framework for optimizing perfor- mance in data centers. Chapter 5 presents Matrix, a novel performance and resource

12 management system, which enables better control of shared resources while deliver- ing predictable VM performance in data centers. Chapter 6 proposes innovative I/O prefetching method by leveraging the characteristics of flash-based SSDs and I/O virtualization. Finally in Chapter 7, we conclude and envision numerous potential directions for further improving I/O virtualization and cloud computing.

13 Chapter 2

Background and Related Work

2.1 Amazon Elastic Compute Cloud

We use Amazon EC2 as one platform to carry out experiments of Swiper and Matrix in Chapter 3 and 5 respectively. EC2 customers can control and configure computing resources as needed on a pay-as-you-go basis. Amazon provides various types of computing “instances” for a customer to choose from - with each instance providing a specified amount of computing capacity. The customer is charged based on the instance-hours consumed. As Swiper focuses on vulnerability with competition for I/O resources, we choose two types of Amazon EC2 instances, micro and small, to be the experiment instances because they use two fundamentally different storage types. Micro instances can only use the Elastic Block Store (EBS) as its storage device, while small instances can use local instant stores (i.e., local storage devices). EBS provides persistent network storage, which is independent of the lifetime of the EC2 instance. It also automatically generates duplicates for fault tolerance. In contrast, data stored on a local instance store only exist during the lifetime of the instance. Note that unlike EBS, there is no extra cost to use local instance stores. We list and compare features of EBS and local instance stores in Table 2.1. Besides the difference of storage types, micro

14 and small instances also have different memory size and compute capacity. Table 2.2

Table 2.1: EBS and instance stores comparison

EBS Instance store Devices Networked disk array Local hard drive Persistence Yes No Replica Yes No $0.1 per GB-month No Cost $0.1 per million I/Os

summarizes the characteristics of the micro and small instances. In general, the micro instance is designed for applications that periodically consume some compute cycles, and the small instance (the most popular EC2 instance type) is well suited for most applications. We leave research on I/O resources on other Amazon instances (e.g., large, extra large) as future work.

Table 2.2: Characteristics of Amazon micro and small instances X XXX Types XXX XX micro small Features XXX Memory 613 MB 1.7 GB Storage EBS only instance stores EC2 Compute Unit 1∼2 1

2.2 Virtualization

Virtualized data centers are common cloud computing platforms. We focus on Xen and its notable paravirtualization technique, where the Xen VMM works as a hard- ware abstraction layer to guest operating systems with the modified kernels. Note that Xen also supports hardware-assisted full virtualization that emulates the host hardware for unmodified operating systems. In paravirtualization, the VMM is in charge of resource control and management, including CPU time scheduling, routing

15 hardware interrupt events, allocating memory space, etc. In addition, a driver domain (Dom0) that has the native drivers of host hardware performs the I/O operations on behalf of guest domains (DomU). Fig. 2.1 depicts a typical Xen I/O architecture, where each guest domain uses a virtual device driver (frontend) in cooperation with a corresponding module (backend) and the native driver in the driver domain to ac- complish I/O operations. The hypervisor and Dom0 work together to ensure security

Dom1 Domn Dom0 (Guest (Guest (Driver Domain) Domain) Domain)

Drivers Backend Frontend Frontend

VMM Software

Hardware CPU, Memory, Hard drive, etc.

Figure 2.1: Xen I/O architecture. Solid lines represent the I/O channels for data transmission. Host hardware interrupts are controlled and routed through VMM and depicted as dashed lines isolation and performance fairness among all VMs. While fairness in CPU and mem- ory virtualization is relatively easy to achieve, we show that maintaining performance isolation for virtual I/O can be extremely challenging.

2.3 Preliminary Interference Experiments

When multiple VMs are running on the same physical machine, several factors con- tribute to the degraded application performance, including virtualization overheads and the imperfect performance isolation between VMs. As cloud applications become increasingly data centric, we shall address additional challenges of the I/O interference when running data-intensive applications in such virtualized environments. In the following example, we illustrate this problem using the data collected on a machine managed by Xen, where two applications, App1 and App2, run in two

16 separate VMs, VM1 and VM2, respectively. In the first scenario (the first row in Table 2.3), App1 is a CPU-intensive program (Calc) that performs algorithmic calculations in VM1 and its runtimes are measured when program in VM2 is a CPU- intensive, data-intensive, both CPU- and data-intensive, or CPU- and data-moderate program. We normalize App1 runtimes to that of App1 running alone, that is, nor- malize to without interference. In the second scenario (the second row in Table 2.3), App1 is a data-intensive program (SeqRead) that sequentially reads a large file.

Table 2.3: Normalized App1 runtime in VM1 while running various App2 in VM2

PP PP App2 CPU I/O CPU & I/O PP App1 PPP High High Medium High Calc 1.96 1.26 1.77 2.52 SeqRead 1.03 10.23 1.78 16.11

As illustrated in Table 2.3, while two CPU-intensive applications show perfor- mance slowdowns, the result is not unexpected. In this case, because both VMs are multiplexed on the same CPU, the runtime of App1 is simply doubled due to Xen’s fair credit scheduling. However, for a data-intensive program, performance interfer- ence is much more severe and unpredictable. For example, while App1(SeqRead) experiences little change in performance when App2 has a heavy CPU consumption, it slows down by 10 times when App2 is performing a similar task and competing for I/O devices. Furthermore, because the driver domain handles I/O operations on behalf of unprivileged guest domains [120, 143], the interference can become even more severe when App2 demands both CPU and I/O resources - App1 is 16 times slower in this case.

2.4 Related Work

To present the broad spectrum of the literature survey, this section classifies literature according to their relativeness to each Chapter.

17 2.4.1 Swiper

In recent years, there have been extensive studies on side channel attacks in virtualized environments which compromise sensitive information and/or reduce performance. These methods have been carried out and also mitigated on memory [126] and cache [131, 145, 148]. The key difference between these studies and Swiper is that we leverage the limited access bandwidth of non-volatile storage devices which are usually shared by a larger number of VMs and are more difficult to enforce an isolation across different VMs. The synchronization phase of our proposed attack is related to the problem of subsequence matching in time-series data, which has been extensively studied in the literature of databases and data mining (see [94] for a survey), because the main objective is to identify the correspondence between a subsequence of the (observed) throughput allocated to attacker and a subsequence of the throughput requested by the victim application. In the existing work, whole [4, 38, 93], range [14, 60, 124], and ranked [72] sequence matching have been addressed using distance measures such as dynamic time wrapping (DTW) [100]. Nonetheless, a key difference between the subsequence matching problem and the requirement of synchronization is that the I/O request time-series of the victim application can be “stretched” substantially at runtime. Thus, most existing techniques based on sliding window constructions cannot be directly applied. To address this challenge, our proposed synchronization technique focuses on recovering the stretching factor through DFT, and then use the stretching factor to derive the scaling factor and the offset, thereby achieving synchronization. While efforts [63, 194] have been spent to address the security challenges in VMs, the state of the art cannot fully enforce the performance isolation, especially in the cases of I/O-intensive applications [44, 71]. The existing approaches to provide perfor- mance guarantees include: resource allocation in a static [9] or dynamic [167] manner; scheduling-based bandwidth and latency control [28, 68, 78, 136, 155]; and feedback and adaptive control, e.g., Fa¸cade[113] designed a feedback controller to throttle I/O

18 requests to virtualized storage systems. The similar approach has been adopted and extended in [68, 77, 88, 166, 183, 188, 203]. The main goal of prior works is to en- sure the fairness among the users, or enforce pre-defined policies where certain users are favored. However, as we will show in Sec. 3.6, an adversary, while consuming little I/O bandwidth, can intentionally delay the victim VM after locating the most vulnerable stages of the targeted application. A number of techniques can be utilized to mitigate the performance inference attacks from malicious VMs. First, CPU resource allocation is one of the key factors for the performance isolation of data-intensive workloads [44, 71]. The increasing number of IO requests from a guest domain will obviously increase Dom0’s CPU consumption. Nevertheless, such resource usage is often not accurately accounted, although Dom0 handles all guest domains’ I/O requests. A potential solution is to provide physically isolated CPU cores for each guest domains’ I/O process. One may also bound guest domain’s I/O process and VM instance on the same CPU(s) to avoid extra cores for each guest domain. Alternatively, in order to achieve complete performance isolation, a new type of virtualization is proposed in [92] that use all dedicated resources, e.g., one VM per CPU core, full memory partition, and dedicated I/O devices. However, it remains undecided if such virtualization would achieve wide adoption due to the need of special hardware supports. Second, a new storage system management can be built based on dynamic resource allocation. Static storage system design for performance guarantee [9] is not practical in a virtualized environment because workload patterns from each guest domain may change over time. On the other hand, dynamic resource allocation is more suitable in this kind of environment. One common technique [155, 167] to provide isolation guarantee is called quanta-based scheduling, which basically grants each guest domain a period of time to exclusive ownership of the resources. However, quanta-based scheduling method will likely increase the latency and may not be suitable for all applications. Further, although quanta-based scheduling may guarantee certain level of performance isolation, it loses the flexibility to provide higher throughput when many VMs are idle. In other words, it may waste available resources. There is a

19 trade-off between latency and overhead in quanta-based scheduling. If the time slot is small, the latency can be smaller but the number of context switching will be higher. Consequently, resources may not be well utilized. In this case, one can apply Swiper’s detection and synchronization techniques to avoid the collision of the performance peaks among the VMs. Thus, the performance isolation and high resource utilization can both be achieved. Several previous works also exploit performance models for performance debug- ging and anomaly detection in virtualized environments [23, 49, 158, 175, 176]. For example, Bodik et al. [23] exploit logistic regression with L1-norm regularization to effectively diagnose and fix performance anomaly; Cohen et al. [49] use Bayesian networks to capture signatures of performance crisis; Shen et al. [158] analyze and model application traces to identify performance anomaly. All these works are possi- ble countermeasures to locate malicious VMs. Many works use feedback and adaptive control [68, 77, 88, 113, 166, 183, 188, 203] for managing resources. These advanced control methods actually provide additional hints because Swiper’s throughput will be changed when searching for victims. These works may be able to mitigate the performance inference, but will require the whole system to use more resources to maintain the desired performance. In addition, Swiper may make the controller unstable because of frequently changing I/O patterns. As a result, the system will never reach a stable state. One possible approach of avoiding potentially interfering workloads is to use the live migration feature of VMs. However, live migration usually migrates the com- puting instances that may keep using the same storage node. This minimizes the pausing time of VM execution because moving the disk image is very expensive. As Swiper targets at a victim’s I/O operations, an effective live migration framework may choose to dynamically distribute the workloads among several duplicate disk images that consistency needs to be properly maintained. Except using duplicate disk images, an alternative solution on public clouds could be using I/O optimized instances, e.g., high-I/O (HI1) or high-storage (HS1) on- demand instances. HI1 and HS1 are exclusively backed with solid state drive and

20 hard disk drive RAIDs respectively. They are also two most expensive on-demand instances at 3.1 and 4.6 dollars per hour. Both HI1 and HS1 instances provide high I/O performance, which may reduce the possibility of resource contention from co- located VMs. However, there are other physical resources, e.g., last level cache and network channels, which are shared among these instances. It would be an interesting future work to explore the interference effect on these I/O optimized instances.

2.4.2 TRACON

Traditional performance modeling [129, 173, 205] and scheduling techniques [26, 164, 190] focus on the computation-intensive applications and model CPU utilization and performance. Extensive work has been done to characterize and predict the I/O performance, mostly in a non-virtualized environment [76, 122, 156]. Du et al. [58] study the issues in profiling VMs for performance debugging and the performance bottleneck in a virtualized environment. For VM performance modeling, Wood et al. [199] measure and use application characteristics to model the virtualiza- tion overheads, and Kundu et al. [102] propose an iterative model training technique based on artificial neural networks to build models for predicting application perfor- mance in virtualized environments. Although [102] considers I/O contention when evaluating performance models, there is no estimation of interference degrees among co-located VMs. Mei et al. [120] and Pu et al. [143] study the network traffic inter- ference in virtualized cloud environments. However, it is unclear how the above work can be utilized to mitigate the I/O interference for data-intensive applications. Kim et al. [95] focus on the scheduler at VMM level and improve Xen’s credit scheduler by taking I/O performance into account while guaranteeing CPU fairness. Weng et al. [193] also focus on the scheduler at VMM level and propose a hybrid scheduling framework to consider the needs of different types of VMs. While TRA- CON focuses on managing performance interference across virtualization hosts, TRA- CON and [95, 193] can work together to complete a better virtualized data center. Speitkamp and Bichler [169] formulate server consolidation in virtualized data centers

21 as optimization problems. The objective function of these problems is to minimize op- eration costs by using as few physical servers as possible. TRACON can complement this work by adding performance interference as a trade-off factor. AutoControl [138] proposes to automatically adapt to dynamic workload changes to achieve application service level objectives (SLOs). AutoControl combines an on- line model estimator and a multi-input, multi-output (MIMO) resource controller. The model estimator captures the relationship between application performance and resource allocations, while the MIMO controller allocates the amount of multiple vir- tualized resources to achieve application SLOs. Similarly, Wang et al. [189] present a cluster-level MIMO power controller that adjusts power among servers based on their performance needs, while controlling the total power of the cluster to stay at or below a constraint imposed by the capacity of its power supplies. Xu et al. [200] propose a two-level control system to manage the mappings of workloads to VMs and VMs to physical resources. It focuses on a multi-objective optimization problem aiming to simultaneously optimize possibly conflicting objectives, including making efficient usage of multidimensional resources, avoiding hotspots, and reducing energy consumption. TRACON is unique in the sense that it focuses on application perfor- mance interference among co-located VMs. pSciMapper [208] is a power-aware consolidation framework for scientific work- loads and builds the models to relate the resource requirements to performance and power consumption. pSciMapper evaluates the trade-off between energy reduction and performance degradation when consolidating workloads onto one host, and uti- lizes correlation analysis between performance and resource contention as the distance metric in its clustering algorithm for consolidation. There is no direct performance prediction in pSciMapper. On the contrary, TRACON is able to estimate runtime or IOPS and make use of these predictions in VM scheduling. TRACON is closely related to Q-Clouds [131]. Q-Clouds utilizes online feedback to build a MIMO model to capture the performance interference, and tune resource allocations to mitigate the performance interference. However, Q-Clouds focuses on the CPU bound workload. In comparison, TRACON goes further by focusing on

22 the performance improvement of the whole system. Moreover, TRACON investigates data-intensive scientific workflows and demonstrates the ability to be used in a large scale system under heavy workloads.

2.4.3 Matrix

The two main categories of researches related to Matrix (Chapter 5) are performance modeling and resource management. Performance Modeling and Analysis has been extensively studied, both in non- virtualized environments [130, 171], and virtualized environments [45, 80, 102, 146, 199]. There are also performance models which target specific applications or sys- tem components. For example, Dryad [108] models the performance of parallel matrix multiplication in virtualized environments, and Watson et al. build probability distri- bution models of response time and CPU allocations in virtualized environments [192]. While we share the same idea on exploiting machine learning techniques, we further explore the ability of classification with probability estimates to model the perfor- mance of new workloads. Automatic Resource Configuration is an important issue in parallel and distributed systems [74, 106, 152], and performance monitoring and analysis tools [105] have been developed for production virtualization software. Similarly, various machine learning techniques have shown promising results for VM provision and configuration, e.g., clustering [144], classification [117], reinforcement learning [147]. Also, several works have focused on minimizing operation cost, for example, Nieh¨orsteret al. [133] applies fuzzy control at runtime, and Kingfisher [157] formulates the problem as an integer linear program (ILP) and implements a heuristic ILP solver. Most related to our work are several existing resource configuration frameworks such as DejaVu [186], JustRunIt [207], and [165]. The key differences of Matrix lie in a quick online SVM based learning algorithm, formation of a minimizing resource problem, and applying the Lagrange algorithm to find a configuration optimized for performance and cost. While DejaVu can handle new applications and well adapt resources to suit new

23 demands, DejaVu uses dedicated sandbox machines to clone and profile VMs. In contrast, Matrix utilizes representative models to construct target workload’s model in an online fashion. Also, Matrix deals with the problem of multi-cloud resource management, which is shown to be critical in [62, 70, 112, 115]. Performance interference in virtualized environments is another critical barrier to provide predictable performance. DeepDive [135] utilizes mathematical models and clustering techniques to detect interference. Again, this framework requires compar- ing the performance from VM clones in dedicated machines. Similar to [96, 131], Matrix removes this need by including the interference factors into the performance models of representative workloads. Then, it dynamically uses probabilistic clustering methods to construct the target performance model.

2.4.4 VIO-prefetching

Prefetching data from main memory on high-performance processors into processor caches is a common practice today, notably [69, 170]. Prefetching from storage de- vices, which has similar challenges about bandwidth and pollution as prefetching from memory, also has been intensively studied at different layer of I/O stack, e.g., block level (e.g., [55, 109]), file level (e.g., [99, 160, 195, 201]), and with caching [17, 64, 198, 201, 206]. Prefetching also has been studied with various techniques. Notable representatives include using probability graph to estimate access possibility of a region [67]; using data compression, mining, and semantics-aware techniques to construct and predict access patterns [52, 109, 32, 161]; tracking addresses of pro- cesses [64, 65]; providing hints, compiler support and off-line information [30, 40, 85, 86, 127, 139]. VIO-prefetching is orthogonal and unique to techniques previously applied to HDDs (hard drives) and bare metal systems in two aspects. First, our approach bridges the information gap for recognizing patterns in a vir- tualized environment. Li et al. implement their prefetching method in guest OS [107]. Prefetching in guest OS only has a partial view of I/O processes in guest domain, this

24 approach does not have the knowledge of physical disk block mappings, where the sequence of disk blocks is critical to prefetching performance. Several previous works utilize caching in the hypervisor to improve virtual I/O performance [83, 111, 191]. These caching techniques are orthogonal to VIO-prefetching and may be combined with VIO-prefetching to provide better virtual I/O performance. Second, we focus on emerging flash-based RAIDs whose high performance enables new potentials and challenges for prefetching. While VIO-prefetching is similar to freeblock scheduling [114] that utilizes remaining I/O bandwidth in a HDDs, we provide insights of prefetching with SSDs. We believe that our technique can be potentially combined with several existing prefetching techniques, e.g., [40, 195, 201]. FAST is a recent work that focuses on shortening the application launch time and utilizes prefetching on SSDs for quick start of various applications [84]. It takes advantage of the nearly identical block-level accesses from run to run and the tendency of these reads to be interspersed with CPU computations. This approach even uses the blktrace API with an LBA-to-inode mapper instead of using a loopback device like VIO-prefetching. A similar work to FAST is C-Miner [109], which discovers block correlations to predict which blocks will be accessed. This approach can cope with a wider variety of access patterns while VIO-prefetching is limited to simpler strided forward and backward patterns. Our approach differs from these two in that it can handle request streams from multiple simultaneous applications and include an aggressiveness-adjusting feedback mechanism. We believe that incorporating block correlations would improve VIO-prefetching’s accuracy in some cases and plan to investigate this approach in the future.

25 Chapter 3

Swiper

Dora the Explorer1: ”Swiper, no swiping!” Swiper the Fox: ”You are too late.”

3.1 Introduction

A cloud computing system offers its users the illusion of “infinite” computing and storage capacities on an on-demand basis [11]. Examples of commercial cloud com- puting platforms include Amazon Elastic Compute Cloud (EC2) and Simple Storage Service (S3), Google AppEngine, Microsoft Azure, etc. Virtualization [150] plays a vi- tal role in cloud computing. In particular, for the purpose of scalability and flexibility of resource delivery, a cloud computing system does not provide each user with a dif- ferent physical machine - instead, it allocates each user to an independently managed virtual machine (VM) which can be dynamically created, modified, and migrated. Examples of such a platform include Xen VM for Amazon EC2 and the .NET-based run-time environment for Microsoft Azure. The essence of virtualization is that multiple VMs may multiplex and share the

1Swiper the Fox is a cartoon character in the animated series of Dora the Explorer, who often sneaks up to Dora and Boots and takes away the items that are needed for Dora’s adventures.

26 same physical resources (e.g., CPU, cache, DRAM, and I/O devices). Nonetheless, each VM is supposed to enjoy isolation (in terms of security and performance) from the other VMs. That is, different VMs should not be able to interfere with the executions of each other. Unfortunately, the lack of physical isolation can indeed pose new security threats to co-located VMs. We consider a new type of VM vulnerability which enables a malicious user (i.e., VM) to exploit the resource contention between co-located VMs and obstruct the execution of a targeted application running in a separate VM that is located on the same physical machine as the malicious one. In particular, we focus on exploiting contentions on shared I/O resources that are critical to data- intensive applications - e.g., hard disks and networks. In practice, service providers often exclude such threats from their service level agreement. That is, customers are solely responsible for their loss caused by resource contention from co-located VMs. However, most service providers do not enable dynamic migration [185]. Even if a customer suspects an attack and wants to move affected VMs away, they need to shutdown and restart all affected VMs. Therefore, an attack from Swiper may incur nontrivial loss in business by introducing service degradation or interruption, e.g., Amazon.com would lose sales by 1% for every 100 ms delay in page load time and a similar test at Google also revealed that a 500 ms increase in displaying the search results could reduce revenue by 20% [97]. Note that the main concern of this work is performance degradation caused by co- located adversaries, rather than information leakage which has been the main focus for vulnerability studies in cloud-computing systems [148]. Performance degradation is critical because it directly increases the cost of per workload completed in cloud [66, 53]. On the other hand, the existing work on performance-degradation analysis were conducted on non-virtualized environments (e.g., for CPU, DRAM, hard disk, and network usage [37]) and cannot be directly applied to VMs. For example, a relevant prior work that proposed to exploit the contention on hard-disks [87] required access to the hard-disk queue in order to analyze the requests from both the adversary and the victim. However, this queue cannot be directly accessed by VMs, rendering such

27 exploitation no longer applicable. In this work, we design and implement Swiper, a framework that exploits the virtual I/O vulnerability in three phases: 1) co-location (“sneaking-up”): place the adversary VM on the same physical machine as the victim VM; 2) synchronization (“getting-ready”): identify whether the targeted application is running on the victim VM and, if so, the state of execution for the targeted application (which we shall elaborate below); and 3) exploiting (“swiping”): design an adversarial workload ac- cording to the state of the victim application, and launch the workload to delay the victim. For the first co-location stage, we propose an I/O-based VM co-location detection by extending the existing network-probing-based techniques [148]. As we focus on data-intensive applications, we determine the co-location by measuring the consump- tion of both network and storage I/O resources. We found the proposed co-location detection technique very effective on Amazon EC2. The objective of synchronization phase is to determine the execution stage of the victim application. Intuitively, any application would go through several execution stages (perhaps iteratively for many times): reading inputs, computation, and saving outputs. An I/O based resource exploitation would have little effect if it is carried out when the victim application is mainly doing CPU-intensive computation. Therefore, synchronization is critical for the adversary to reach a satisfactory trade-off between delaying the victim application and minimizing its own I/O consumption. The key challenge to synchronization is that the adversary cannot directly observe the I/O requests made by the victim VM. Instead, it has to infer the victim’s execution stage based on the I/O throughput allocated to its own VM. Our main technical contribution for this phase is a discrete Fourier transformation (DFT) based algorithm which recovers the victim’s original I/O pattern from the observed (distorted) time- series of I/O throughout, and then determines if the victim application has reached a pre-determined point when it is most vulnerable to an exploitation, e.g., when it is writing large amounts of data to storage. The objective of attack phase is to design an attacking workload that incurs the

28 maximum delay on the victim application while minimizing the attacker’s I/O con- sumption. We develop peak attack which launches I/O requests on the peaks of victim I/O requests, as identified by the synchronization phase. A key question one must address for the design of peak attack is what consists of an effective I/O workload to launch during peaks - e.g., request type (reads or writes), pattern (sequential or random), and interval. Our main hypothesis is that sequential reads would be the most effective method, and we validate such a hypothesis on a number of benchmarks in the experiments. The experiment result demonstrates the superiority of our syn- chronized peak attack over two baseline attacks: a naive attack which maximizes its I/O request for a given time period, and a random attack which launches I/O requests at random time points. Another critical issue we address in the paper is the existence of ”bystander” VMs, i.e., those that are also sharing the same physical resources as the victim and the at- tacker VMs. The existence of such bystander VMs makes the synchronization both challenging and indispensable. In particular, the I/Os from bystander VMs become ”noise” that is mixed into the observed I/O time-series - which requires the attacker VM to somehow “filter out” the noise before launching the synchronization and at- tack phases. Another challenge imposed by bystander VMs here is that the attacker would have to minimize collateral damages to the bystanders (in order to reduce the risk of being detected and punished by the cloud computing service provider). We demonstrate through theoretical analysis and experiments that our DFT-based syn- chronization and peak attack techniques are resilient to the existence of one or more bystander VMs. The contribution of Chapter 3 also includes a comprehensive set of experiments over Amazon EC2 - with the results clearly showing that Swiper is capable of degrad- ing various server applications by 22.54% on average (and up to 31%) for different instance types and benchmarks, while keeping the resource consumption to a mini- mum.

29 3.2 Threat Model

3.2.1 Resource Sharing in Cloud Computing Systems

In general, a cloud computing system provides its end-users with a pool of virtualized computing and I/O resources supported by a large amount of distributed, heteroge- neous, commodity computers. For I/Os, VMs utilize the device drivers (the frontend drivers) in the guest OS to communicate with the backend drivers in DOM0, which access the physical devices, e.g., hard drives and networks, on behalf of each VM. In other words, application I/Os within a VM - which basically consist of block reads and writes to the virtual disks - are translated by the virtualization layer to system calls in the host OS, such as requests to the physical disks. In Xen, the hypervisor and DOM0 work together to ensure security isolation and performance fairness among all VMs. While fairness in CPU and memory virtualization is relatively easy to achieve, in this paper we show that maintaining performance isolation for virtual I/O can be extremely challenging - which opens doors for security threats. In this work, we also evaluate Swiper on KVM (Kernel-based Virtual Machine) that utilizes hardware assisted full virtualization instead of Xen’s paravirtualization. Although Xen and KVM are used to demonstrate this threat in our work, our test and previous work indicate that other virtualization framework like VMware also exhibits similar interference problem [202].

3.2.2 Problem Definition

A straightforward way to delay a victim process is to launch an attacking process which constantly requests a large amount of resources shared with the victim (e.g., I/O bandwidth). Nonetheless, such an attack can be easily detected and countered (e.g., a dynamic resource allocation algorithm can restrict the amount of resources obtained by each process). Thus, our focus in this paper is to incur the maximum delay to the victim while maintaining the resource request from the attacker to a

30 pre-determined (low) threshold.

Prior Knowledge of the Adversary: Since the adversary now has to target the attack specifically to the victim process (instead of blindly delaying all processes sharing the resource), it has to possess certain characteristics of the victim process which distinguishes it from others. For the purpose of this paper, we consider the case where the adversary holds the trace of resource requests from the victim process as the “fingerprint” of the victim. Research on cross-VM side channels can be used to sustain this assumption [1, 2, 179, 204] - malicious VMs are able to retrieve a variety of information, such as data and instruction caches, I/O usage profile, and even private keys, from co-located VMs and hosts via side channels. The techniques for co-location detection in Sec. 3.3 can also be adapted to profile I/O access patterns as well. We plan to extend the profiling technique as future work. In the experiment section, we shall demonstrate that the various workloads we tested all exhibit unique resource-request time-series that can be easily distinguished from others.

Limits on the Adversary: Many cloud computing systems charge by the amount of resource requests. For example, Amazon Elastic Block Store (EBS) charges $0.1 - $0.11 per 1 million I/O requests and Amazon EC2, on the other hand, charges by total network consumption - i.e., the amount of data transferred in and out of the system [10]. Thus, the adversary must minimize the amount of resource request initiated by itself. In this paper, we consider a pre-determined upper bound on the total resource consumption by the adversary.

Problem Statement: Given a workload fingerprint of a victim process, determine an adversarial workload of I/O request which incurs the maximum delay on the vic- tim process without exceeding the pre-determined threshold on the adversary’s own resource consumption.

31 3.3 I/O-Based Co-Location Detection

In this work, we use Amazon EC2 as one testing platform to carry out experiments. As we focus on vulnerability with competition for I/O resources, we choose two types of Amazon EC2 instances, micro and small, to be the instance types in our experiments. I/O-Based Co-Location: As we focus on the competition for I/O resources, we need to further determine the sharing of I/O resources by extending the network-based approach in [148]. To this end, we propose a VM co-location detection mechanism based on measured metrics on both network and storage I/O workloads. We utilize this I/O-based co-location detection to conduct experiments on Amazon EC2. The co-location detection mainly consists of two stages: Probing and Locking-on. Probing: An adversary can locate the geographical zone of a victim process by the victim’s IP information [148]. To conveniently manage separate networks for all availability zones, Amazon EC2 partitions internal IP address space between avail- ability zones. Administration tasks will be more difficult if the internal IP address mapping changes frequently. Because different ranges of internal IP address repre- sent various availability zones and public IP addresses can be mapped to private IP addresses by DNS, an adversary can easily locate the availability zone of a victim, thus greatly reduce the number of instances needed before achieving a co-location placement. Once an adversary knows the availability zone of a victim, it uses network probing to check for the co-residence. In general, if an adversary and a victim are co-located, they are likely to have 1) identical DOM0 IP address, and 2) small packet round-trip times (RTT). Therefore, an adversary can create several probing instances to perform a TCP SYN traceroute operation to a victim’s open service port. If one probing instance and the victim were co-located, they would share the same DOM0 and there would be only a single hop to the victim with a small RTT. In our experience, if the RTT is smaller than half of the average RTT of all one-hop instances in the same zone, the probing instance is very likely on the same physical machine as the victim.

32 Locking-on: Co-location on the same physical machine does not necessarily mean the sharing of the I/O resources - co-located VMs may end up using different storage types. In our tests, if two co-located VMs do not share one hard drive, launching a workload to compete for I/O resources shows limited effect on I/O throughput. On the other hand, if two instances share the same storage device and both try to max out the bandwidth, they can only get part of the total bandwidth. Prior work [143, 120, 132, 166] has also shown similar interference effect in virtualized environments. Because the adversary knows its performance under a given I/O workload, for it to confirm the I/O sharing, it needs a VM instance that would potentially co-locate with the victim and try to compete for I/O resources. The adversary then can simply measure the I/O performance and an obvious performance degradation would be a strong indicator of VM co-location. Amazon currently has eight availability regions: US East (Northern Virginia), US West (Oregon), US West (Northern California), EU (Ireland), Asia Pacific (Singa- pore), Asia Pacific (Tokyo), South America (Sao Paulo), and AWS GovCloud [10]. Our experiments on Amazon EC2 are conducted in the US East region, which is the largest one among all regions and hosts more than 300,000 blade servers [110]. More specifically, the instances are created in the us-east-1c zone. To co-locate an attacker with a victim, we first launch ten instances as targets. Then, we keep generating the probing instances until a pair of co-located instances are found. We successfully locate four attacker-victim pairs in two hours, using about two hundred probing instances. The success rate for the probing stage is about 8% and for a successful locking-on is 2%. Note that not all probing instances were running during the whole probing period. An instance is terminated immediately after confirming that there is no co- located target. Thus, the cost (data and instance usage) of probing and locking-on is small. For example, using micro (0.02 dollars per hour) and small instances (0.06 dollars per hour) as the probing VMs cost about one and three dollars for co-locating one target (based on the 2% success rate) respectively. Such initial cost is very small compared with the potential revenue loss shown in Sec. 3.9.

33 After locking-on a victim, a smart adversary does not just launch a huge workload to compete for resources. This naive method leads to ineffective attacks and wastes time and money. Therefore, we develop a synchronization method in the following sections to assure accurate attacking time. As we will show later, this method gen- erates a more severe performance degradation and uses less resources than the naive method.

3.4 Resource Competition for a Two-Party Sys-

tem

We start with a simple scenario where the resource is only shared between two parties, i.e., the attacker and the victim. In the following, we shall first outline two main technical challenges for an attacker to delay the victim under resource limitations. Then, we shall describe our main ideas for addressing the two challenges respectively. Note that we focus on the synchronization and attack phases in this section and Section 3.5, and discuss the design of the co-location phase in Section 3.3.

3.4.1 Technical Challenges for Reaching the Maximum Delay

There are two critical challenges for incurring the maximum delay to a victim - synchronization and adaptive attack, which we explain respectively as follows: Synchronization: In order for the adversary to incur the maximum delay under a resource constraint, it has to be able to (1) determine whether the victim process is running, and (2) predict the resource request from the victim process at a given time. The necessity of the first task is easy to understand - the attack should not be launched when the victim is not running. It is also easy to achieve in a two-party system. The adversary will observe a decrease on its allocated resource once the victim process is running and competing on the same resource. The necessity for the second task can be observed from Fig. 3.1 which depicts

34 30

20

MBs 10 0 0 10 20 30 40 50 60 70 80 90 100 Runtime (sec)

Figure 3.1: Read I/O trace of FileServer

the read requests from File-Server, a popular I/O benchmark that simulates a file server. One can see that the read operation is not constant over time, but has peaks and valleys. An attack launched at a valley may barely have any effect on the victim (e.g., consider the case where the victim has no I/O request during the attack period), while an attack at a peak can significantly slowdown the victim. Thus, given the resource constraint, it is critical for the adversary to predict the peak moments of I/O requests from the victim. Since the adversary holds the time-series of the victim’s resource request (when no other process is running) as prior knowledge, the task of predicting the victim’s request at a given time is essentially the synchronization of the pre-known time series with the one observed at real time. Note that the adversary can directly observe neither the amount of resource request from the victim nor the total resource request received by the system. Instead, it can only perform the synchronization based on the comparison between its own resource request and allocated resources. Adaptive Attack: Based on the result of synchronization, the adversary should align its resource request (i.e., attack) with the victim. In general, the higher demand the victim has at a given time, the larger request the adversary should submit to the shared resource. The key challenge here for a two-party system is that the adversary may not be able to control the exact moment when its request is transmitted to the I/O device because of the complex scheduling algorithms deployed to the virtual machines and operating systems. Thus, in order to maximize the delay incurred on the victim, the attack must be designed to minimize the impact of the scheduling algorithms.

35 3.4.2 Main Ideas for Synchronization

Overview and Basic Notions: In this paper we consider a simple adversarial strategy of conducting an observation process with a sequential read operation. We chose read over write because the time-series of throughput allocated to write operations tend to have sharp bursts, which would make the synchronization significantly more difficult. Both sequential and random reads in our tests yield similar results in terms of the accuracy of synchronization. We chose sequential read over random read because the latter is rarely the behavior of a normal user and therefore may be detected by the cloud computing system. Before describing the details for synchronization, we first introduce a few basic notions: Recall that the adversary holds as prior knowledge of the I/O request time series from the victim (when no other process is running). Let v(t) be the bandwidth requested by the victim at t seconds after the victim starts running. At run-time, let tob (seconds) be the length of the observation process (where ob stands for observation length) and a(t)(t ∈ [1, ob]) be the (observed) throughput allocated to the adversary for the t-th second since the observation process starts. Let aU be the (upper bound on) throughput for the sequential read operation when no other process is running. Definitions of Offset, Stretching Factor, and Scaling Factor: The objective of synchro- nization is for the adversary to align the pre-known v(t) with the observed time-series aU − a(t). In the ideal case, aU − a(t) would be a concatenation of two sub series: one with zero readings (i.e., when the victim has not yet started running or has fin- ished running), and a sub-sequence of v(t). In practice, however, additive noise and rescaling on both time and throughput may be applied, leading to a requirement on aligning v(t) with aU − a(t) with the following three factors:

• Offset: The victim process might start before or after the observation process

starts. We denote the offset by toff - i.e., the time when the victim starts running, considering the start of the observation process as time 0. Note that

toff < 0 if the victim starts before the observation process starts running.

• Stretching Factor: With the observation process running, the victim will be

36 delayed - i.e., the pre-known time-series v(t) will be “stretched” on the time

domain. We assume such delay is governed by a factor δST (δST ≥ 1) which we refer to as the stretching factor.

• Scaling Factor: The throughput allocated to the victim will also be reduced due to the execution of the observation process. We capture such a amplitude

reduction by again a factor δSC (δSC ≤ 1), which we refer to as the scaling factor.

With offset and stretching/scaling factors, we can represent the relationship be-

tween aU − a(t) and v(t) by

  t − toff aU − a(t) = δSC · v + (t). (3.1) δST

for t ∈ [1, ob], where (t) is the additive noise inherent in the system. We assume (t) is generated i.i.d. for t ∈ [1, ob], independent of both victim and attacker, and has a mean of 0 as well as a small variance. Given (3.1), the main procedure of our synchronization process can be stated as

follows: For each possible value of the offset toff , we compute the optimal stretching

and scaling factors δST and δSC that minimize the additive noise (t) - in particular, P we aim to minimize the 1-norm k(t)k1 = t |(t)|. Then, we choose the offset that minimizes such a minimum 1-norm, i.e.,

  t − toff toff = arg min min aU − a(t) − δSC · v . t δ ,δ ST SC δST 1

The main challenge in the procedure is to derive the optimal δST and δSC that minimize

k(t)k1. Our main idea is to use Discrete Fourier Transformation (DFT) to transform

the pre-known and observed time-series v(t) and aU − a(t) to the frequency domain. Intuitively, regardless of the scaling factor, the strongest frequency component of v(t)

should have a frequency that is δST times as large as that of aU − a(t), due to the

stretching effect of δST on the frequency domain of the observed time series. Therefore, we can determine the stretching factor by identifying and comparing their strongest

37 frequency components. Once the stretching factor is determined, the scaling factor

δSC can be correspondingly derived by comparing the amplitudes of the strongest

frequency component of aU − a(t) and v(t). Formally, given v(0), . . . , v(N − 1), DFT transforms it into a sequence of N

complex numbers V0,...,VN−1 on the frequency domain such that

N−1 X − 2πi ·k·n Vk = v(n) · e N . (3.2) n=0

Intuitively, Vk represents the amplitude of the original time-domain series v(t) on frequency k/N. Let

  0 t − toff v (t) = aU − a(t) ≈ δSC · v . (3.3) δST

0 0 Consider the DFT of v (t) to Vk on the frequency domain:

N−1   0 X n − 2πi ·k·n V ≈ δ · v · e N (3.4) k SC δ n=0 ST N −1 δST 2πi X − N/δ ·k·n ≈ δSC · δST · v(n) · e ST . (3.5) n=0

Note that the cumulative sum in (3.5) is actually the amplitude of time-domain series

v(0), . . . , v(N/δST − 1) on frequency k · δST/N. When N/δST is sufficiently large, we can approximate it by the amplitude on the same frequency for the original time series v(0), . . . , v(N − 1), i.e., Vk·δST . Thus,

0 Vk ≈ δSC · δST · Vk·δST . (3.6)

According to (3.6), the frequencies of the strongest frequency components for v(t)

0 and v (t) should differ by a ratio of δST. Thus, we derive an estimated stretching

38 factor δST based on such a comparison. Note that (3.6) also indicates the difference between the amplitude of strongest frequency components for v(t) and v0(t) to be

δSC · δST. Thus, based on the estimated δST we derive an estimation for the scaling factor δSC based on the amplitude comparison.

3.4.3 Performance Attack

Based on the result of synchronization, we consider a performance attack which launches multiple segments of sequential read operations to delay the victim pro- cess. Each segment persists for a fixed, pre-determined, amount of time. In the following, we discuss three critical issues related to the design of such a performance attack: (1) when should each segment be launched, (2) how long should each segment persist, and (3) why should each segment use a sequential read operation. Positioning of Attack Segments: Recall from the discussion in Section 3.4.1 that, to incur the maximum delay to the victim process, the attack segments should be positioned to cover the moments of peak requests from the victim process. Thus, to position h attack segments each persisting for ` seconds, we use a greedy algorithm which first locates the `-second interval in v(t) which has not yet been executed and has the maximum total request, i.e., finds the start of interval tS ∈ [ob − toff ,N − 1] such that

t+` X tS = arg max v(i), (3.7) t i=t

and then repeat this process after removing interval [tS, tS + `] from consideration, until all h intervals are found. Note that there must be tS ≥ ob − toff because by the end of the observation process, the first ob − toff seconds of the victim process have already passed and thus cannot be attacked. Length of Attack Segments: Somewhat surprisingly, our experiments (as we shall present in Section 3.6) show that as long as each attack segment covers a peak of the victim’s request, the length of the attack segment does not have a significant impact

39 on the delay incurred to the victim process. Intuitively, this is because the length of the attack which does not overlap with the peaks of victim’s request incurs little delay to the victim. Nonetheless, this does not mean that the adversary should set each attack segment to be as short as possible - Instead, it has to take into account the estimation error of synchronization, and make the attack segment long enough to ensure the coverage of the peaks. Operations of Attack Segments: Each attack segment may perform four types of operations: sequential read, random read, sequential write, and random write. We choose the sequential read operation due to the following reasons. First, we excluded the write operations from consideration for the same reason as that discussed for the design of the observation process: write operations tend to introduce sharp bursts on throughput, which makes it difficult to be synchronized with the victim’s peak requests. We chose sequential read over random read because a random read operation is unlikely to sustain a high throughput to “compete” with the victim process and delay it. One note of caution is that, while each attack segment should perform a sequential read operation, the adversary must ensure that consecutive (but different) attack segments do not read sequentially on adjacent blocks. This is because the hard drive or may pre-fetch the latter blocks while performing the previous attack segment. As a result, the latter attack segment does not actually incur any I/O to the hard drive, incurring no delay on the victim process. To address this issue, a simple attack strategy is for each segment to first randomly choose one from a set of files, and then read the file sequentially. Fig. 3.2 shows an example trace when Swiper issues overlapping sequential read operations to slowdown an co-located FileServer. When comparing Fig. 3.1 with Fig. 3.2, one can clearly see Swiper issues most I/O operations when victim issues as well. In addition, victim’s trace has been obviously distorted to certain degree. We will further analyze the performance decreases in Sec. 3.6.

40 25 victim attacker 20 15

MBs 10 5 0 0 10 20 30 40 50 60 70 80 90 100 Runtime (sec)

Figure 3.2: Overlapping I/O of an attacker and a victim

3.5 Systems with Background Processes

In this section, we first describe the change of our synchronization process for a multi-VM system. Then, we present an information-theoretic model for analyzing the accuracy of synchronization for a given number of background VMs.

3.5.1 Synchronization in Multi-VM Systems

A key assumption one has to make to enable synchronization in a multi-VM system is that the background VMs are independent of the victim - i.e., for any background

VM with its time-series of resource request being bi and its DFT transformed result

Bi,

N−1 N−1 ! N−1 ! X 1 X 1 X b − · b · v − · v = 0. (3.8) i n i i n i i=0 i=0 i=0

To see why this assumption is needed, consider a case where a background VM has exactly the same pattern has the victim. In this case, the synchronization process (and therefore the attack) cannot distinguish between the victim and the background VMs, and therefore has to wait attack resources on delaying the background process. The offset and stretching/scaling factors for the victim process still applies to a multi-VM system. The main change is the additive noise - Instead of only considering the system-inherent noise (t) which is a small error with mean equal to 0, we have to consider it as a cumulative sum of resources consumed by background VMs, which

41 can be substantial larger than the victim’s request. As such, we rewrite (3.1) as

  t − toff aU − a(t) = δSC · v + B(t) + (t), (3.9) δST where B(t) is the cumulative throughput consumption of all background VMs. Given (3.9), the objective of synchronization changes from minimizing the 1-norm of aU − a(t) − v((t − toff )/δST) to minimizing the variance of it over time, i.e., t = toff should minimize

   t − toff min vart∈[0,N−1] aU − a(t) − δSC · v (3.10) δST,δSC δST

To compute the stretching factor δST, we again consider the DFT of v(t) and 0 0 0 v (t) = aU − a(t) to Vk and Vk, respectively. The difference is that Vk, i.e., the 0 amplitude of v (t) on frequency k/N, is now the sum of δSC · δST · Vk·δST and the amplitude of B(t) on frequency k/N. We again consider the strongest frequency component of v(t), say on frequency kmax/N. The intuition here is that, since we assume the independence of b(t) and v(t) in (3.8), the amplitude of B(t) on frequency

0 kmax/(δST ·N) is unlikely to significantly affect the amplitude of v (t) on kmax/(δST ·N).

Thus, we identify the closest frequency f to kmax/N which satisfies (1) f ≤ kmax/N 0 and (2) the amplitude (modulus) of v (t) on f is within [r1, r2] times that of v(t) on frequency kmax/N, where r1 and r2 are pre-determined thresholds for which we shall discuss the value selection in the experiments section. Then, we estimate δST as kmax/(f · N). Given the estimated δST, δSC and, in turn, toff , can be derived from (3.10).

3.5.2 Length of Observation Process

A key challenge for deploying the performance attack in a multi-VM system is how to determine the required length of the observation process. Recall that in the two-party system, the observation process is terminated when the estimation error (i.e., k(t)k1)

42 falls below a threshold. Nonetheless, this strategy cannot be used in a multi-VM system because while the value of (t) is system-inherent noise which can be assumed to be small, the residual for a multi-VM system, i.e., the variance of B(t) + (t), can be quite large when many background VMs are running concurrently. To address this challenge, we analyze the accuracy of synchronization in a multi- VM system to investigate how to properly change the length of the observation process according to the number of background VMs. In the following, we first draw an analogy between the synchronization process and the transmission of signal through a noisy communication channel, and then use the noisy-channel coding theorem to derive the length of observation process required for a reliable synchronization. Consider an observation process which persists for ob seconds for every c seconds interval - i.e., if an observation process finds the absence of the victim after ob sec- onds, it hibernates c − ob seconds before starting again for ob seconds. For the sake of simplicity, assume that the length of the victim N  c  ob. The result of syn- chronization is the offset of the victim toff ∈ [ob − c, ob]. To produce such a result, the synchronization algorithm takes as input two time-series: the original victim request pattern v(t) and the observed consumption by VMs other than the attacker a − a(t).

The length of the observed consumption is ob seconds. For v(t), only the first ob − toff seconds may be meaningful for synchronization (because the latter requests bear no impact on the observed consumption). Consider v(t) and a−a(t) as the input and output signal of a noisy communication channel, respectively. The noise is generated by the background VMs. The objective of synchronization is to transmit a message of length toff through this communication channel. If such a transmission is successful, then the synchronization process can reproduce toff by comparing v(t) with a − a(t). Note that the length of message toff is log2 c because its prior distribution is uniform over c possible values. According to the noisy-channel coding theorem, the required length of the observation period is determined by the capacity of the noisy channel C (bits/second) - i.e., given capacity

C, unless ob ≥ (log2 c)/C, no synchronization algorithm can estimate toff with an arbitrarily small probability of error.

43 Since our objective is to analyze how ob should change with the number of back- ground VMs m, we focus on deriving the relationship between m and channel capacity C. For this purpose, we consider the total consumption of the background VMs as additive Gaussian white noise, the worst-case scenario under the independence as- sumption in (3.8). According to the Shannon-Hartley theorem, the channel capacity is

 S  C = B log 1 + . (3.11) 2 N

where B is the bandwidth of the channel, and S and N is the power of signal and noise, respectively. Note that while B and S does not change with m, N is in proportion to m. Also note that when S  N, i.e., when a large number of background VMs are running, C ≈ B · S/N. Thus, the channel capacity is inversely proportional to m. As such, the adversary should set the length of the observation process in proportion to m. The synchronization accuracy is determined by the delay in seconds of detecting an I/O peak in the victim. In the two-process test, we first run the attacker VM and start the victim, the WebServer application, in 10, 20, and 30 seconds. Our synchronization algorithm works very well - the error remains one second when observing the victim for the last 20 seconds. Fig. 3.3 shows the prediction errors in seconds when the victim arrives at various time points. Note that, for our algorithm to work, the attacker would need a ”clear” time window to understand its own I/O pattern. This process can be interfered when the victim comes in during that window, e.g., 10 seconds apart from the attacker’s start time. In this case, the algorithm will not become stable until 45 seconds later, more than double the time needed for the other two cases. This similar phenomena appears later when there are background processes, which also takes a longer time for the algorithm to stabilize. In Fig. 3.4, we test the synchronization accuracy when there are one and two background VMs, which randomly read and write several files. Without the noise, i.e., background VMs, the algorithm has at most one second error when the observation

44 Figure 3.3: Synchronization accuracy when the victim arrives at different times

Figure 3.4: Synchronization accuracy with background VMs time is beyond 50 seconds. As we discussed earlier, our algorithm would need a longer time to ”filter” the background noises - in the case of one background VM, it stabilizes at 65 seconds. The process takes 76 seconds for two background VMs. Nevertheless, this test shows that our approach is able to correctly account for the effects of concurrent I/Os and identify the victim in a timely fashion. Fig. 3.5 shows sample traces when there is one more co-located VM that generates background noise by randomly reading several files. Swiper still issues most I/O operations at the victim’s original I/O peaks despite there exists background noise.

45 25 victim attacker noise 20 15

MBs 10 5 0 0 10 20 30 40 50 60 70 80 90 100 Runtime (sec)

Figure 3.5: Overlapping I/O of an attacker, a victim, and a background noise

3.6 Experiment Results

3.6.1 Experiment Setup

Because a substantial portion of Amazon EC2’s address space hosts publicly acces- sible web servers [148], we test Swiper with the following popular cloud applications or benchmarks: YCSB (Yahoo! Cloud Serving Benchmark) is a performance mea- surement framework for cloud data serving [50]. YCSB’s core workload C is used to emulate read-intensive applications; Wiki-1 and Wiki-2 are running Wikibench [181] with real Wikipedia request traces on the first day of September and October 2007 re- spectively; Darwin is an open source version of Apple’s QuickTime media streaming server; FileServer mimics a typical workload on a file system, which consists of a va- riety of operations (e.g., create, read, write, delete) on a directory tree; VideoServer emulates a video server, which actively serves videos to a number of client threads and uses one thread to write new videos to replace obsolete videos; WebServer mostly performs read operations on a number of web pages, and appends to a log file. The FileServer, VideoServer and WebServer belong to the FileBench suite [118]. Micro and small Amazon EC2 instances and a local machine are used as the test platforms in this work. We use technique described in Sec. 3.3 to locate Amazon EC2 instances, which dwell in the same storage device. The tests are repeated for 50 times and the means are reported. To evaluate the effectiveness of an attack, we define three metrics: 1) the slow- down/decrease in percentage of the victim, S, which assesses the overall effect of an

46 attack. This can be measured as the runtime in seconds or the throughput in KB. 2)

the victim slowdown divided by the total runtime (in seconds) of the attacker, SAT , which determines the impact of the length of an attack. A bigger SAT indicates that an attacker can infiltrate large damages within a shorter time window. 3) the victim slowdown divided by the total throughput (in MB) of the attacker, SAC , which eval- uates the effect of the bandwidth consumption of an attacker. A bigger SAC means that an attack is effective while consuming a smaller amount of bytes.

3.6.2 Comparison with Baseline Attacks

Amazon EC2 instances: We first demonstrate our tests on Amazon EC2. To minimize cache effect and max out the bandwidth, we make the total file size of all four benchmarks larger than double of the memory size. Namely, the working set size of each benchmark is 4 GB on the micro and 8 GB on the small instance in this experiment. Fig. 3.6 shows runtime increases of benchmarks on micro and small instances when the attacker is restricted with 2 and 4 GB data limit to interfere with the victim on the micro and small instance respectively.

120 NAÏVE RANDOM PEAK 80 40 0

Increase (%) Increase Web Video File

(a) micro instance

150 NAÏVE RANDOM PEAK 100 50 0

Increase (%) Increase Web Video File

(b) small instance

Figure 3.6: Runtime increases by different attacks on Amazon EC2

Recall that naive attack exhausts the bandwidth within the given time or data

47 constraints and random attack launches I/O requests at stochastic time points. The naive attack shows little to none effect on runtime increase of the victim workload. The average runtime increases caused by the naive attack are 6 and 14% on micro and small instance respectively. The random attack is better than the naive one because there are chances for the random attack to hit the peak of the victim work- load. The random attack increases victims’ runtime by 35 and 57% on micro and small instance respectively. The peak attack has the best results, which are 67 and 100% on micro and small instance respectively. In general, micro instance is better than small instance in resisting attacks, which implies the networked disk array and storage duplicates help to reduce I/O interference. Despite the possible methods for mitigating the disk I/O interference, other shared system components are also sus- ceptible to the interference issues, e.g., caches [132] and network interfaces [96]. The idea of Swiper could be complemented by combining with other techniques to exploit vulnerable components. As we shall see later, our method can effectively detect the I/O peak and launch workload to slowdown the victim. Altogether, the average run- time increase of peak attacks across instance types and benchmarks is 85%, about twice or eight times more than a random or naive attack, respectively. Fig. 3.7 shows victim’s performance changes when varying attacker’s data con- sumption limit. The peak attack still has the best result under different data usage limits. Note that instance store is more vulnerable than EBS and accessing instance stores has no extra cost for the adversary. An EC2 user should use EBS volumes instead of instance stores to reduce potential damage. A boxplot of runtime reveals more insight on how peak attacks affect victims’ per- formance. Fig. 3.8 demonstrates FileServer runtime distributions when peak attacks happen on micro/small instances. On the result of micro instances, peak attacks ef- fectively change the distribution of runtime. The distribution now is skewed to higher values. On the result of small instances, almost all test results under peak attacks have longer runtime than normal runs. These two figures show that peak attacks can effectively slowdown a victim most of the times. Note that Amazon EC2 should be considered as a multi-VM testing environment because it is very likely that instances

48 160 160

Naïve

120 Random 120 Peak 80 80 Naïve

Random Increase (%) Increase Increase (%) Increase 40 40 Peak 0 0 1 2 3 4 1 2 3 4 Attacker's data limit (GB) Attacker's data limit (GB)

(a) FileServer (b) VideoServer

160

Naïve

120 Random Peak 80

Increase (%) Increase 40

0 1 2 3 4 Attacker's data limit (GB)

(c) WebServer

Figure 3.7: The runtime increases on small instances with different workloads when varying attacker’s data consumption from other users are co-located.

100 runs of FileServer on micro instances 100 runs of FileServer on small instances 250 200

200 150 150

100 100

runtime (seconds) 50 runtime (seconds) 50 normal attacked normal attacked (a) micro instance (b) small instance

Figure 3.8: Boxplot of FileServer runtime by peak attack on micro and small instances

The runtime increase may not provide a tangible idea on the monetary loss. Sec. 3.9 transforms the runtime increase into the revenue loss in business. Our anal- ysis shows a significant amount of financial loss could be done by Swiper if the target is providing critical business services. Two VMs: For conveniently analyzing different virtualization systems, we con- duct the following tests on local machines with a 2.93 GHz Intel Core2 Duo E7500 processor, 4 GB RAM, and 1 TB Samsung hard drive. The host operating system is CentOS Linux with 2.6 kernel. Two paravirtualization frameworks are tested on this

49 Table 3.1: S, SAT , and SAC of Xen/KVM in two-VM experiments Application Metric Naive Random Peak S 7.06/4.37 6.74/1.70 22.70/26.07 WebServer SAT 0.75/0.53 0.72/0.29 3.81/3.62 SAC 0.014/0.008 0.015/0.003 0.048/0.084 S 5.32/4.59 9.45/5.82 28.62/29.33 Darwin SAT 0.45/0.43 1.66/0.55 5.70/4.81 SAC 0.010/0.009 0.019/0.011 0.059/0.069 S 6.59/9.60 8.41/7.59 24.69/25.27 Wiki-2 SAT 0.72/1.01 0.89/1.48 2.88/2.45 SAC 0.013/0.019 0.017/0.017 0.068/0.054 machine: One is the KVM and another is Xen 4.0. Note that Amazon EC2 also uses Xen. All VMs in the experiments have one VCPU and 512 MB memory. The effectiveness metrics of three selected applications are shown In Table 3.1. The attacker’s data usage is limited at 500 MB. The proposed peak attack clearly captures I/O request patterns and achieves additional performance degradation on both Xen and KVM. Peak attack generates an average S of 26.11% of the victim, compared to 6.25% and 6.67% from the naive and random attacks. Recall that SAT is calculated as S divided by the total runtime (in seconds) of the attacker, and SAC is S divided by the total data usage (in MB) by the attacker. For the peak attack, the normalized S by time and data usage are even better - the SAT value is about

4.17 and 5.96 times better than those of the random and naive attacks, while SAC is about 4.57 and 5.33 times better. We also study the I/O throughput degradation when the peak attack has different data usage limit. Fig. 3.9 illustrates the average throughout decreases of different applications running in a Xen VM when the peak attack has data usage limits at 100, 300, and 500 MB respectively. For every increase in the attacker’s data consumption limit, victims will see a larger drop in throughput. Darwin is the most susceptible application among the whole testing benchmarks because of its intensive and clear read request patterns. The average throughput decreases are 4.57%, 14.65%, and 22.54% at 100, 300, and 500 MB data usage limit respectively. Multiple VMs: In addition to the victim and attacker VMs, other VMs may co-exist as background processes. A cloud service can also be provided by collabo-

50 40 100 300 500 30 20

10

Throughput decrease (%) decrease 0 FileServer WebServer Darwin YCSB Wiki-1 Wiki-2

Figure 3.9: The means and standard deviations of I/O throughput decreases

25 20

Xen Xen 20 KVM 15 KVM 15 10 10

5 Throughput

Throughput 5

decrease (%) decrease (%) decrease 0 0 100 200 300 400 500 100 200 300 400 500 Attacker's data limit (MB) Attacker's data limit (MB) (a) Darwin (b) YCSB

30 40

Xen Xen 30 20 KVM KVM 20 10

10

Throughput Throughput

decrease (%) decrease (%) decrease 0 0 100 200 300 400 500 100 200 300 400 500 Attacker's data limit (MB) Attacker's data limit (MB) (c) Wiki-1 (d) Wiki-2

Figure 3.10: Throughput changes when a multi-VM system hosted by Xen/KVM is attacked by the peak attack with various data usage limits rating more than one VMs. For example, one VM serves as the frontend portal and another VM is responsible for providing requested data. Therefore, applications in the following tests are composed of multiple VMs to construct a real world scenario. Recall that Wiki-1 and Wiki-2 are running Wikibench with traces from Wikipedia. Fig. 3.10 presents the changes in the I/O throughput of four cloud service systems. The results show that Xen and KVM are both vulnerable to this threat and none of them is clearly better in resisting it. In Table 3.2, we present the effectiveness of three attack types on web serving applications when attacker’s data consumption is limited at 500 MB. For the peak

51 attack, the normalized degradation by time and data usage are even better - the SAT value is about 8.18 and 3.06 times better than those of the random and naive attacks, while SAC is about 8.69 and 4.0 times better than those of the random and naive attacks.

Table 3.2: S, SAT , and SAC of Xen/KVM in multi-VM experiments Application Metric Naive Random Peak S 9.47/6.82 1.45/2.29 29.5/12.23 WebServer SAT 0.85/1.15 0.20/0.31 2.76/2.10 SAC 0.018/0.013 0.003/0.005 0.073/0.030 S 7.37/8.99 4.94/1.56 16.22/28.56 Wiki-1 SAT 1.14/1.05 0.61/0.29 1.41/4.59 SAC 0.014/0.017 0.011/0.003 0.050/0.070 S 6.26/7.83 4.70/3.20 26.51/23.73 Wiki-2 SAT 0.95/0.77 0.39/0.40 4.99/2.31 SAC 0.012/0.015 0.010/0.006 0.080/0.054

Note that Sec. 3.5 has shown that attackers may need a longer observation length to maintain the synchronization accuracy when the number of VMs increases.

3.6.3 Analysis of Performance Attack

We study I/O throughput under attack on local machines with Intel Atom 1.6 GHz processor, 4 GB RAM, and 1 TB Samsung hard drive. This energy-efficient archi- tecture has been adopted in scientific and cloud computing environments [174, 187]. Fig. 3.11 illustrates the throughput in IOPS that FileServer and WebServer are able to achieve during an attack. For every increase in the attacker’s bandwidth consump- tion, victims will see a larger drop in throughput. Again, the peak attack significantly outperforms the naive and random attacks. The FileServer and WebServer see 40 and 28% decreases in IOPS throughput respectively when an attacker’s data usage is lim- ited at 500 MB. In addition to the victim and attacker VMs, other VMs may co-exist as background processes. Next test shows that our peak attack is able to synchronize with the victim’s I/Os and launch the attacks at those peaks in a multi-VM environment. The test environment is the one used for throughput test in Fig. 3.11. We introduce a

52 Naïve Random Peak

60

40

20

0

IOPS decrease IOPS (%) 100 200 300 400 500 MB

(a) FileServer

Naïve Random Peak

30

20

10

0

IOPS decrease IOPS (%) 100 200 300 400 500 MB

(b) WebServer

Figure 3.11: I/O throughput decreases of FileServer and WebServer at different data usage limit of an attacker in a two-VM system background VM in the test to verify Swiper’s noise resistance. Fig. 3.12 presents the changes in the I/O throughput of (a) the FileServer and (b) the WenServer benchmark. The peak attack once again significantly outperforms the naive and random attacks. When an attacker’s data usage is limited at 500 MB, the FileServer and WebServer see 51 and 70% decreases in IOPS throughput, respectively.

3.6.4 Analysis of Synchronization Accuracy

The synchronization accuracy is determined by the delay in seconds of detecting an I/O peak in the victim. In the two-process test, we first run the attacker VM and start the victim, the WebServer application, in 10, 20, and 30 seconds. Our synchronization

53 Naïve Random Peak

60

40

20

0

IOPS decrease IOPS (%) 100 200 300 400 500 MB

(a) FileServer

Naïve Random Peak

80 60 40 20 0

IOPS decrease IOPS (%) 100 200 300 400 500 MB

(b) WebServer

Figure 3.12: I/O throughput decreases of FileServer and WebServer at different data usage limit of an attacker in a multi-VM system algorithm works very well - the error remains one second when observing the victim for the last 20 seconds. Fig. 3.13 shows the prediction errors in seconds when the victim arrives at various time points. Note that, for our algorithm to work, the attacker would need a “clear” time window to understand its own I/O pattern. This process can be interfered when the victim comes in during that window, e.g., 10 seconds apart from the attacker’s start time. In this case, the algorithm will not become stable until 45 seconds later, more than double the time needed for the other two cases. This similar phenomena appears later when there are background processes, which also takes a longer time for the algorithm to stabilize. In Fig. 3.14, we test the synchronization accuracy when there are one and two background VMs, which randomly read and write several files. Without the noise, i.e., background VMs, the algorithm has at most one second error when the observation

54 Figure 3.13: Synchronization accuracy when the victim arrives at different times

Figure 3.14: Synchronization accuracy with background VMs time is beyond 50 seconds. As we discussed earlier, our algorithm would need a longer time to “filter” the background noises - in the case of one background VM, it stabilizes at 65 seconds. The process takes 76 seconds for two background VMs. Nevertheless, this test shows that our approach is able to correctly account for the effects of concurrent I/Os and identify the victim in a timely fashion.

3.7 Dealing with User Randomness

We have seen that Swiper could successfully locate and synchronize with a victim. Some applications, however, may not have completely deterministic I/O traces be- cause of user randomness. This section demonstrates how Swiper addresses user randomness by working with a pattern repository and learning module.

55 Figure 3.15 shows a high level architecture of this extended Swiper. As a prototype

Pattern Clustering and Start repository Learning

Monitoring Known No Log trace Co-located VMs pattern?

Yes Synchronize and Attack

Figure 3.15: An extended Swiper architecture for dealing with user randomness implementation, the pattern store consists of pre-stored 120 one-minute Wikipedia traces which are from 9 to 11 am on the 1st of October, Monday, 2007. Then, we replay a 24-hour trace on the same day to evaluate how Swiper reacts to the trace. Note the pattern here is the time and amount of bandwidth usage by the target. Since we do not use any advanced pattern learning module (which by itself may become a separate research topic), we relax the scaling and stretching factors by 10% to allow Swiper to accept similar patterns in the 24-hour testing set. If there are more than one matched patterns due to the relaxation, the one with the least distortion will be selected. When Swiper identifies a known pattern in a one-minute interval, it will synchronize with and attack the victim during the remaining time of the matched minute. We limit the data usage of Swiper at 1 GB per matched minute. The machine setting of this experiment is the same as the two-VM one in Sec. 3.6. Note that designing clustering and learning methods for Swiper may by itself a new research topic. Thus we leave them as future works. In Figure 3.16, we first show the matched and attacked minutes at every testing hour during the experiment. This evaluation essentially shows how many one-minute traces are similar to the I/O patterns in the repository. The polynomial fit of the matched minutes shows a trend that similar patterns demonstrate time locality. The requests during the night time (hour 12 to 20) are less frequent and intense and thus less similar to the stored patterns, which are from day time. Note that Swiper is looking for the similarity in I/O patterns. The request traces could be accessing different files but the disk could show similar reading patterns. Because the extended Swiper relaxes the matching criterion and does not hold a

56 60 50 40 30 20 10

Matched minutes per hourper minutes Matched 0 0 5 10 15 20 25 Elapsed hours

Figure 3.16: Matched minutes at each testing hour in the one-day test when holding a two-hour traces in the repository. The dotted line shows a polynomial fit of the observed data points full trace, attacking one matched minute does not necessary mean a correct match and guarantee a significant degradation as before. Therefore, Figure 3.17 exams the average throughput decrease per attack at each testing hour. Although the last 22

30 25 20 15 10 5

Throughput decrease (%) decrease Throughput 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Elapsed hours

Figure 3.17: The average throughput decrease per attack at each testing hour hours are not as good as the first two, the results confirm that a historical trace could still be useful in the future. The throughput degradation ranges from 2 to 20% and has an overall average of 13.12%. As future work, using clustering methods to identify and generate patterns may greatly improve the effectiveness of Swiper.

3.8 Attacking Migratable VMs

Live migration is a possible approach of avoiding potentially interfering workloads. In this experiment, we assume the victim VM is aware of being attacked, and wants to be migrated away. The host machines, MA and MB, have an identical configuration and are the same as the one used in the previous two-VM tests. Hosts are interconnected on a Gigabit Ethernet, and share the same storage device on another machine MC .

57

25 25 30

20 20 25 20 15 15 15 10 10 10

5 5 5

Throughput decrease (%) decrease Throughput (%) decrease Throughput Throughput decrease (%) decrease Throughput 0 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Time period (min) Time period (min) Time period (min)

(a) FileServer (b) WebServer (c) Darwin

30 25 35 30 25 20 20 25 15 20 15 10 15 10 10 5

5 5

Throughput decrease (%) decrease Throughput (%) decrease Throughput Throughput decrease (%) decrease Throughput 0 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Time period (min) Time period (min) Time period (min) (d) YCSB (e) Wiki-1 (f) Wiki-2

Figure 3.18: The throughput decreases when the migration is enable are shown in red solid lines. The blue dotted lines represent the throughput decreases when the victim is not migrated

Note that using the same VM image is a common practice because moving the disk image can vastly increase the pausing time of VM execution. The testing time of each run is 10 minutes, and the data usage of Swiper is limited at 500 MB per minute. In the first run, the attacker and the victim are both on MA and the victim is attacked for 10 minutes. Then, in the second run, the victim is migrated away after being attacked for two minutes, but the attacker keeps interfering the storage accesses. Figure 3.18 shows the average throughput decreases of these two scenarios in every testing minute. Migration can mitigate the damage by about 3.23% on average because of the pausing period caused by the live migration. This short pausing time slightly delays victim’s subsequent requests submitted to the storage device. However, the victim keeps suffering as long as they are sharing the storage device. Therefore, an effective live migration framework should dynamically distribute the workloads among several duplicate disk images that consistency needs to be properly maintained.

58 3.9 Potential Monetary Loss

The runtime increase may not give a tangible idea on the monetary loss. Thus, we use a linear model to translate the runtime increase into revenue loss in business. In Sec. 3.1, we have seen that 100 ms delay in loading pages may causes 1% revenue loss and 500 ms delay in displaying search results may reduce revenue by 20%. We also know that the median web page loading time is about 3 seconds [13] and the average time to display the search result is 0.2 seconds [75]. We use these data as two linear cost vs. delay models. Let’s call them SLA1 and SLA2 cost models respectively. In case SLA1 and SLA2 models are too optimistic for Swiper, we also use SLA3 and SLA4 models which assume the expected loss is only one tenth of the SLA1 and SLA2 respectively. Then, the potential revenue loss caused by Swiper is interpolated from these models and our experiments on EC2. The columns in Figure 3.19 represent the potential revenue loss caused by the average runtime increase. The whisker lines represent the revenue loss caused by the minimum and maximum runtime increases from the experiments on EC2. The average revenue loss could be at least around 10% and up to 30% on small and micro instances when using SLA1 and SLA2 models respectively. Even with the SLA3 and SLA4, the average revenue loss is about 1.8% across two instance types and cost models, which is big enough to justify the co- locating cost of Swiper (Sec. 3.3).

45 60 35 30 25 20 15 10

Revenue loss loss (%) Revenue 5 0 SLA1 SLA2 SLA3 SLA4 SLA1 SLA2 SLA3 SLA4 Micro Small

Figure 3.19: Potential revenue loss caused by Swiper on small and micro instances

59 Chapter 4

TRACON

4.1 TRACON System Architecture

Most cloud service providers utilize a hierarchical management scheme to administer the large quantity of machines [7]. In this environment, each application server has a similar virtualized environment as shown in Fig. 2.1, where VMs are dynamically allocated to run the applications from the clients. A manager server is responsible for supervising a group of application servers, which report their status in a time interval to the manager server. The manager servers can form a tree-like hierarchy for high scalability. Given a virtualized environment that consists of a large number of physical ma- chines and different applications, we utilize statistical machine learning techniques, in particular statistical modeling for reasoning about the application’s performance under interference. We share the same philosophy as in [24, 104] that the statis- tical machine learning will play an important role in the application and resource management in large-scale data centers. As the core management scheme for a virtualized environment, TRACON, our Task and Resource Allocation CONtrol framework consists of three major compo- nents: 1) the interference prediction model infers the application performance from

60 the resource consumption observed from multiple VMs; 2) the interference-aware scheduler is designed to utilize the model and generate optimized assignments of tasks and physical resources; and 3) the task and resource monitor that collects ap- plication characteristics at the runtime and feed to both the model and scheduler. Fig. 4.1 presents the TRACON architecture and the interactions between different components.

task queue tasks

manager server possible update assignments interference prediction model interference module aware prediction training scheduler predicted module module interference machine status task assignments

VM1 VMn VM1 VMn

task/resource monitor task/resource monitor

application server application server

Figure 4.1: TRACON system architecture

Upon the arrival of tasks, the scheduler will generate a number of possible as- signments based on the incoming tasks and the list of available VMs, which will then be communicated to the interference prediction module. This module uses the constructed models, the application profiles, and the machine status to predict the interference effects for the given assignments. Finally, depending on the predictions, the scheduler makes the scheduling decision and assigns the tasks to different servers. On the application servers, the task and resource monitors will manage the on- going tasks that is assigned to each VM and collect application characteristics and the performance interference. In a simple example where two VMs are to share the host

61 hardware, the monitor measures the resource utilization of both VMs via existing system tools (e.g., xentop and iostat) while the assigned tasks are running. Xentop is used to monitor and record the physical CPU utilization of each domain, because a top command in each domain, including the driver domain, can only get the domain’s virtual CPU utilization. In addition, because Dom0 performs the I/O operations on behalf of each guest domain, we use iostat in Dom0 to monitor and record the resource utilization of physical storage devices. Note that the resource utilization monitoring in Dom0 is preferable because system administrators generally do not have access to each guest domain for security and privacy reasons. The collected information is used as feedback to update the interference prediction model.

4.2 Interference Prediction Model

On the high level, the interference can be perceived as the changes in the performance, including the total runtime as used in prior work [205, 129], and I/O throughput that we have shown is critical to data-intensive applications. In TRACON, we construct the interference prediction models in order to extrapolate the application performance as a function of the resource consumption by virtual machines. We apply several regression analysis techniques that are commonly used for modeling the relationship between an observed response and controlled variables [33]. For the data-intensive applications that consume a significant amount of I/O band- width and CPU cycles, we choose to characterize each application by four key pa- rameters (controlled variables), listed in Table 4.1: the read throughput, the write throughput, the CPU utilization in the current guest VM domain (DomU), and the CPU utilization in the virtual machine monitor (Dom0). The first two parameters measure the I/O workload from the target application in terms of the number of requests per second, and the third is used to model the CPU consumption from data processing of the application. While a model with these three parameters is a straight- forward approach to reason about an application, such a model is not sufficient to

62 achieve high accuracy for a virtualized environment, which is the reason that we in- troduce the fourth parameter, the CPU utilization in the virtual machine monitor. Intuitively, because all the requests from guest VMs are routed through Dom0, it is crucial to properly account for the CPU consumption from the I/O handling tasks that are performed in DomU, as well as in Dom0 that acts on physical devices on be- half of DomU. If such I/O overheads on CPU utilization were ignored, the prediction models would produce significantly larger errors, as we have shown in Section 4.4.

Table 4.1: Application Characteristics

CPU I/O Utilization in DomU Read requests per second Utilization in Dom0 Write requests per second

In our model, all four parameters can be easily collected with the help of the TRACON task and resource monitor. This approach leverages low-overhead system tools and reduces unnecessary interference from system monitoring. We propose this simple approach to avoid additional changes to a state-of-art virtualized environment (OS and virtual machine kernels), which we believe can lead to a wide adoption in current production systems. In the following, we present three types of models, the weighted mean method (WMM), the linear model (LM), and the non-linear model (NLM), for two different responses - application runtime and I/O throughput. For simplicity, we assume that there are two virtual machines (VM1 and VM2), each can be assigned with one ap- plication. For each model, each of these responses relates to four key parameters for each VM, that is, eight variables in total, which are the application characteristics in both virtual machines. For each model, Y (runtime or IOPS) is the response vari- able, while XVM1,i and XVM2,i where i ∈ {1, 2, 3, 4}, the application characteristics on VM1 and VM2, are the controlled variables. Note that Sec. 4.5 demonstrates TRA- CON’s ability to manage more than two VMs by referencing each pair’s interference in scheduling.

63 Weighted mean method is based on the principal component analysis (PCA) [82] that utilizes applied linear algebra to the data set in order to produce a set of uncorre- lated variables, i.e., the principal components that capture most important dynamics in the data. The assumption is that because the principal components account for most data variances, chances are that they are a good representation of the data while other variables with smaller variances mostly contribute to the noise and redundancy. As such, the PCA is commonly used to deal with complex data sets, as well as to reduce the high dimensionality in the modeling process. Our WMM model similar to [96] calculates Euclidean distances between the data points in the space spanned by the first four principle components. Next, it chooses three nearest data points and uses the reciprocal of their distances as the weights to get the predicted response. We use WMM as the baseline when evaluating the linear and non-linear models.

Linear models assume there is a linear relationship between the response variable and controlled variables, which can be formally presented as

4 4 ˆ X X Y = c + αi · XVM1,i + βi · XVM2,i (4.1) i=1 i=1 where αi and βi are coefficients and c is a constant. The error is defined as the

ˆ difference between the real and expected value, i.e., Y − Y , and the sum of squared errors (SSE) is calculated as P(Y − Yˆ )2. To obtain a linear model with high prediction accuracy, one needs to search for

a good combination of the constant c and coefficients αi and βi such that SSE is minimized. However, in our case, a prediction model that consists of all eight pa- rameters for two VMs may not necessarily have the best fit to the observed data. In other words, a model with fewer inputs may have a higher or equivalent prediction accuracy than a more complex one. To this end, we use a stepwise algorithm [56] that adds or removes possible variables one at a time into the model fitting process. After re-fitting the new model, the algorithm will evaluate the model’s goodness of fit. The process continues iteratively and in the end outputs the best model among

64 all the candidates. For comparison, the algorithm needs a good metric to measure a model’s goodness of fit. Although the goal of a modeling process is to minimize SSE, it is insufficient to simply use SSEs because of the trade-off between accuracy and flexibility [33]. To this end, we utilize the concept of Akaike information criterion (AIC) [6] that is based on the information theory to provide scores for evaluating such a trade-off.

By definition, AIC can be described as (−2) × loge (maximum likelihood) + 2 × (number of parameters), which describes the quality of a model with regard to the parameters that are selected by maximum likelihood estimation. More specifically, that is to minimize Kullback-Leibler distance [101] between the statistical model and the true distribution. Note that a lower value of AIC indicates a better model. We use the stepwise algorithm with AIC as the scoring function to help select a linear model. Nonlinear models: Our assessment of the linear models reveals that while their prediction accuracy is mostly in par with the weighted mean method, these models cannot be considered as a good fit of the observed data. As we focus on data- intensive applications, the bursty I/O patterns in such applications [77, 35] tend to make the linearity no longer hold and lead to large prediction errors. The need for an alternative model to both the weighted mean method and linear models leads us to explore nonlinear models, in particular with the degree of two, i.e., quadratic models in our study.

By expanding the controlled variables XVM1,i and XVM2,i to all the terms in the P4 P4 2 expansion of the degree-2 polynomial (1 + i=1 XVM1,i + i=1 XVM2,i) , we can construct an initial non-linear function of the controlled variables for the regression as equation (4.2). In the non-linear modeling process, we use the Gauss-Newton method [46] to find the coefficients such that SSE is minimized. The Gauss-Newton method is an iterative process that gradually updates the parameters to obtain the optimal solution. Similarly, we use a stepwise algorithm to choose a non-linear model with the best AIC value. In general, we find that nonlinear models have the best prediction accuracy

65 compared to the other two methods in predicting either the runtime or IOPS, as we will show in Sec. 4.4.

4 4 ˆ X (1) X (2) Y = c + αi · XVM1,i + αi · XVM2,i + i=1 i=1 4 4 X X (1) βi,j · XVM1,i · XVM2,j + i=1 j=1 4 i−1 X X (2) βi,j · XVM1,i · XVM1,j + i=1 j=1 4 i−1 X X (3) βi,j · XVM2,i · XVM2,j + i=1 j=1 4 4 X (1) 2 X (2) 2 γi · XVM1,i + γi · XVM2,i (4.2) i=1 i=1

Model training and learning: For a given application, we generate its inter- ference profile by running it on VM1 while varying the workloads on VM2, for which we develop a workload generator to exercise both CPU and I/O devices with different intensities. By doing so, we obtain a collection of data on interference effects under different background workloads. For the CPU utilization, the workload generator executes a set of arithmetic operations in a loop with varied idle intervals between each iteration so as to the CPU utilization in a guest domain can be controlled in five different intensities ranging from 0%, 25%, 50%, 75% to 100%. In the meantime, a storage device is tasked with either read or write requests. In both cases, the workload generator reads from or writes to a file. The file is much larger than the allocated memory size of the guest domain to avoid OS caching. Similarly, the read requests per second and write requests per second can also be controlled in five different in- tensities ranging from 0% to 100% by adjusting the length of sleep interval between each iteration. For the purpose of creating more realistic scenarios, we create in total 125 of different workloads that serve as the background applications in profiling the interference. Then, VM1 runs against each background setting four times. That is, 500 data points are collected as the training set of one performance model. Note that

66 we also include the performance for each application without interference, that is, the application runs in one VM while another VM is idle. For a cloud platform, our approach can be simply automated when a new appli- cation comes in. Further, this approach supports online learning of the interference prediction model, that is, the model shall be dynamically monitored and modified when it cannot accurately capture the interference relationships among different ap- plications. The causes may come from the changes in applications, virtual machines, operating systems, and cloud infrastructures. To this end, TRACON collects statis- tics on applications and virtual machines and keeps tracks of the prediction errors of the models. Upon the occurrence of some predefined events (e.g., a significant shift of the mean or a large surge in the variance), TRACON will start to build a new model with the latest data.

4.3 Interference-Aware Scheduling

Interference prediction completes one side of the story - with the help of these mod- els, TRACON can now schedule the incoming tasks to different virtual machines in a way that minimizes the interference effects from co-located applications. In general, optimally mapping tasks to machines in parallel and distributed computing environ- ments has been shown to be an NP-complete problem [48]. In this work, we explore a number of heuristic techniques to find a good solution for the scheduling problem. Specifically, TRACON aims to reduce the runtime and improve the I/O throughput for data-intensive applications in a virtualized environment. Given a set of tasks T and each task t ∈ T has the runtime of RTt and the I/O throughput of IOPSt, we define the total runtime RTtotal for this set of tasks as

X RTtotal = RTt (4.3) ∀t∈T

67 and the combined throughput IOPStotal as

X IOPStotal = IOPSt (4.4) ∀t∈T

As the first step, we explore three different scheduling strategies: online scheduling that reduces the queueing time for each incoming task by quickly dispatching them to various virtual machines, batch scheduling that pairs the incoming tasks based on the predicted interference, and mixed scheduling that aims to balance between both batch and online scheduling. For comparison, we use an FIFO scheduler as the baseline where the incoming tasks are allocated to virtual machines in a first-in, first-out order. Minimum interference online scheduler (MIOS) is designed to make a quick scheduling decision that becomes necessary when the tasks arrive at a rapid speed. In such a scenario, the tasks will arrive at the queue at arbitrary times and the scheduler will dispatch an incoming task immediately without waiting for later tasks. We design MIOS based on the concept of the minimum completion time (MCT) heuristic [29]. With the goal of minimizing the sum of execution times of all tasks, MCT maps each incoming task to the machine that completes the task in the shortest time. When a task t arrives, MIOS will predict t’s performance on each available VM, and assign t to a VM with the best predicted performance. The advantage of MIOS is the ability to dispatch a task in a short time. On the downside, the task assignment may not be better than a batch scheduler that considers more possible assignments. The MIOS algorithm is presented in Algorithm 1.

Minimum Interference Batch Scheduler (MIBS) is a batch scheduling al- gorithm based on the concept of the Min-Min heuristic [79]. In a batch scheduling scenario, the scheduling process takes place when the queue that holds the incoming tasks is full. In the first step, the Min-Min heuristic finds a machine with the mini- mum score (e.g., completion time) for each task on the queue (the first “Min”). In the

68 Algorithm 1: MIOS

Data: Task ti; P ool consists of VMj,k, where j ∈ 1, . . . , m, and k ∈ 1, . . . , n ; Model is the interference prediction model. Result: t and VMj,k assignments. begin for each VMj,k in the Pool do scorei = Predict(t, VMj,k, Model); end VMcandidate = Min(scorei); Assign(t, VMcandidate); end

second step, among all task-machine pairs, Min-Min finds the pair with the minimum score (the second “Min”), and assigns the selected task to its corresponding machine. This procedure repeats until the queue is empty.

In TRACON, assume we have a queue of incoming tasks ti, where i ∈ {1, 2, . . . , l} and l is the total number of available tasks, and virtual machines are denoted as

VMj,k, where j ∈ {1, 2, . . . , m} and m is the number of VMs per physical machine; k ∈ {1, 2, . . . , n} and n is the number of physical machines. First, MIBS takes the first task candidate1 in the queue as the input to run MIOS. Second, MIBS chooses another

task candidate2 from the rest of the queued tasks that has the least interference with candidate1. Then, MIBS takes candidate2 as the input to run MIOS. On one hand, MIBS needs to calculate the interference between the incoming tasks, which may lead to a longer waiting time in the queue. However, as MIBS considers the pairing of all incoming tasks, it has good chances of improved performance when the models accurately predict the interference between different tasks. The algorithm of MIBS scheduling is listed in Algorithm 2. Minimum Interference miXed scheduler (MIX) intends to combine two algorithms and possibly improve the performance. The scheduler will not dispatch an assignment of MIBS immediately. Instead, MIX gives every job a chance to be the first job in the queue when executing MIBS, and hopes that future assignments would possibly offer new opportunities for better scheduling decisions. The obvious

69 Algorithm 2: MIBS

Data: Queue is a task batch of ti, where i ∈ 1, . . . , l ; VMj,k are VMs on available machines, where j ∈ 1, . . . , m, and k ∈ 1, . . . , n ; Model is the interference prediction model. Result: ti and VMj,k assignments. begin while Queue is not empty do candidate1 = t1; MIOS(candidate1, VMj,k, Model); for each task ti in the Queue, i 6= 1 do scorei = Predict(ti, t1, Model); end // the first "Min" candidate2 = Min(scorei); // the second "Min" MIOS(candidate2, VMj,k, Model); RemoveFromQueue(candidate1, candidate2); end end drawback here is that for each task, the delay may be increased, although the overall performance could potentially be improved. The algorithm of MIX scheduling is listed in Algorithm 3. In summary, three scheduling strategies have different advantages and drawbacks. MIOS has the lowest scheduling overhead, MIX has the potential to achieve the best performance while incurring the highest possible overheads, and MIBS stands in between which we will show shortly can lead to a good balance between the scheduling performance and overhead.

4.3.1 Machine Learning Based Scheduling

Machine learning techniques have been used to assist task scheduling because of the self-adaptiveness nature of machine learning [16, 140, 142]. The algorithms in Section 4.3 all share the same goal - to find VMs with small interference and put them onto the same physical machine. Here we take a different point of view to deal

70 Algorithm 3: MIX

Data: Queue is a task batch of ti, where i ∈ 1, . . . , l ; VMj,k are VMs on available machines, where j ∈ 1, . . . , m, and k ∈ 1, . . . , n ; Model is the interference prediction model. Result: ti and VMj,k assignments. begin while Queue is not empty do for each task ti in the Queue do Mark ti as the first task in Queue; Assignmenti = MIBS(Queue, VMj,k, Model); if Assignmenti is better than AssignmentMIX then keep Assignmenti as AssignmentMIX ; end end Execute AssignmentMIX ; RemoveFromQueue(AssignmentMIX ); end end

this problem - minimizing the chances of severely interfered VMs being scheduled onto the same physical machine. This way, we can avoid large interference on each physical machine and minimize the total interference. In particular, we design two machine learning based schedulers, a k-means++ scheduler and a doubling scheduler, both of which are based on clustering algorithms. In both schedulers, we use interference as the metric in its original clustering problem. The following are detail descriptions of both algorithms: k-means++: The quality of final clustering of a k-means algorithm mainly de- pends on the initialization process. The idea of most k-means algorithms is to pick points that are far away from each other. However, some methods, like the farthest- first traversal, are too sensitive to outliers. Therefore, k-means++ [12] is designed to make a good choice of initial k centers. Instead of choosing the point farthest from chosen points so far, k-means++ pick each point at random with probability pro- portional to the squared distance. In this approach, the interference-aware scheduler applies k-means++ to initialize physical machines with VMs that interfere each other severely. The k-means++ algorithm for interference-aware scheduling is presented in

71 Algorithm 4.

Algorithm 4: k-means++ Data: A set of VM T ; k physical machines ; Result: Task and machine assignments. begin Randomly and uniformly choose one t ∈ T ; Assign t onto one physical machine and remove t from T ; while not all physical machines have been assigned one VM do for each VM t ∈ T do D(t) = The smallest interference among t and a VM that has already been assigned; end Pick VM t0 ∈ T at random with probability proportional to D(t0)2; Assign t0 to an idle physical machine and remove t0 from T ; end while T 6= ∅ do Assign x ∈ T to the machine such that the interference among co-located VMs is minimized; Remove x from T ; end end

Doubling: Charikar et al [41] proposed a doubling algorithm for incremental clustering. Suppose there is a sequence of points in a metric space, the goal of this al- gorithm is to efficiently maintain a clustering such that the maximum cluster diameter is minimized. The basic idea is to maintain at most k clusters without increasing the radius of clusters. The doubling algorithm has two stages: the merging stage, where the algorithm decreases cluster numbers by merging pairs with a distance smaller than the cost of current clustering, and the update stage, where the algorithm gets a new point as one of the clustering centers, if the distance from the new point to all current clustering centers is larger than the cost of current clustering. The doubling algorithm can avoid the large interference between tasks. Here, each point is the VM to be scheduled. The distance between points is the interference. The doubling algorithm for interference-aware scheduling is shown in Algorithm 5.

72 Algorithm 5: Doubling Data: A set of VM T ; k physical machines ; Result: Task and machine assignments. begin G ← the first k ∈ T ; R=The smallest interference between VMs in G; while the number of VMs in G equals k do //Merging stage for each VM x ∈ G do r= The smallest interference between x and other VMs in G; if r > 2R then move x from G to G0; end end //Updating stage if the number of VMs in G is less than k then r=The smallest interference between y ∈ T and VMs in G; if r > 2R then move y to G; end if T 6= ∅ then R = 2R; else T = G0 and R = 0.5 × R; end end end Assign one VM from G to one physical machine; while not all available spots are occupied do Assign one VM ∈ T to the machine such that the interference among co-located VMs is minimized; end end

It is important to note that not all machine learning algorithms will significantly improve the scheduling performance. As the results later show, while k-means based approach achieves good speedup on different workloads, the benefit from the doubling algorithm is limited. Furthermore, not all machine learning techniques can benefit the interference-aware scheduling problem. For example, a cover tree is a rooted

73 infinite tree that keeps record of distances among each node such that inserting a new node and finding the near nodes are efficient. Beygelzimer et al. [19] make use of the cover tree to handle different cluster numbers k simultaneously. Ideally, a task scheduler can make use of cover tree to efficiently manage the tasks and machines. However, there is a problem to apply the cover tree in the interference problem. For example, there are VMA, VMB, and VMC in the system. The pairs (VMA, VMB) and (VMA, VMC ) have small interference, while the pair (VMB, VMC ) has large interference. This makes cover tree not suitable for interference scheduling. We can try to construct a cover tree per VM. But, this adjustment will not save the time in maintaining the interference relations among the VMs. Clearly, selecting an appropriate machine learning technique is key to construct a good interference-aware scheduling algorithm.

4.4 Simulation

4.4.1 Data-intensive Benchmarks

As we mostly focus on IO performance interference in a virtualized environment, we select eight data-intensive benchmarks for Sec. 4.4, covering different applications from bioinformatics, software development, system administration, data mining, mul- timedia processing, and server applications. Table 4.2 summarizes the benchmarks used in Sec. 4.4. For IO intensity, a larger number indicates higher IOPS and through- put requirement. Bioinformatics: Finding similar DNA or protein sequences is a crucial task for bioinformatics research. Basic Local Alignment Search Tool (BLAST) [8] is one of the most widely used algorithm for identifying local similarity between different bio- logical sequences. This is done by comparing sequences to databases and identifying sequence regions with statistical significant scores. BLAST can be used for multiple purposes, and we use two NIH BLAST algorithms blastn and blastp, which are used to

74 Table 4.2: Data-Intensive Applications and Benchmarks

Name Category Description Data size File count IO Intensity blastn Bioinformatics DNA sequence searching 12 GB 101 6 blastp Bioinformatics Protein sequence searching 11 GB 61 3 compile Software development Linux kernel compilation 2.1 GB 1,358 4 dedup System administration Compression and deduplication 672 MB 1 7 email Server application Email server workload 1.6 GB 249,825 1 freqmine Data mining Frequent itemset mining 206 MB 1 5 video Multimedia processing H.264 video encoding 1.5 GB 1 8 web Server application Web server workload 160 MB 10,000 2

answer nucleotide and protein queries, respectively. As the inputs, the nucleotide and protein databases used are NCBI’s (National Center for Biotechnology Information) NT (12GB) and NR (11GB) databases that contain the full-set of non-redundant DNA and protein sequences. Software development: Source code compilation is a commonly used bench- mark for storage systems. During the compilation process, the compiler reads a number of the source code files at different time points and writes the object files to disks. Here we compile the Linux kernel of 2.6.18. System administration: As data continues the exponential growth, deduplica- tion becomes an important task for system administrators to remove data redundancy and reduce the cost of storage systems. We use dedup from the Parsec benchmark suite [21], which includes a number of diverse multi-threaded applications. Dedup applies various data compressions to a data stream in a pipelined manner and writes an output file with the compressed data. In the test, dedup uses an input file of 672 MB. We also choose two other data-intensive benchmarks from Parsec. Data mining: For data-mining applications, we pick freqmine from Parsec, which mines frequent itemsets from a 206MB input file. Media processing: Again, we choose a Parsec benchmark called video, which is used to encode a H.264 video file of 1.5GB. Video has the highest IOPS among all the benchmarks. Server application: We choose to benchmark two typical enterprise servers, email and web servers. For email server workload, we use a popular benchmark,

75 postmark [89], which performs a large number of file operations (create, read, write, or delete) on small files. For web server workload, we use the web server profile in the FileBench [119]. For web benchmark, we evaluate IOPS interference only and do not evaluate runtime because FileBench takes runtime as an input. In this benchmark, web simulates a mix of open/read/close operations of 10,000 files in about 20 directories, and a data appends to a file for every 10 reads/writes to simulate the proxy log. 100 threads are used and average file size is 16KB. Mixed IO workload: We utilize eight benchmarks to generate the workloads with different IO intensities. In particular, we want to create three types of workloads, namely light, medium, heavy IO, which represent a mixture of workloads with low, medium, and high IO requirements, respectively. To this end, we sort eight bench- marks based on their IOPS, which are shown in Table 4.2. Each number represents the rank of an application in terms of IO intensity. For example, number 1 represents email with lowest IOPS and number 8 means video with highest IOPS. We generate light, median, and heavy IO workloads by following the Gaussian distribution with the means of 2.5, 4, and 5.5, respectively.

4.4.2 Simulation Settings

We implement a simulator to emulate the TRACON’s performance in large-scale data centers. The simulator can evaluate two different scenarios with static and dynamic workloads. In the first case, we assume that there is a list of applications that are waiting to be processed. The number of applications equals to the total number of available virtual machines. Upon the arrival of such workload, the simulator queries the interference prediction module for expected workload interference and generates a schedule based on the predicted results. The simulator calculates the performance by using the actual statistics that have been measured in the real systems. In dynamic workload scenario, we assume that the workload arrival rate follows a given distribution, and each task can be scheduled as soon as possible. When a

76 scheduling event is triggered, the simulator takes all the tasks in the queue and current status on all VMs as the input, and queries the prediction module. Next, the scheduler generates an assignment based on predicted results and the emulator estimates the actual time and system status by using previously measured data. Since workloads are arriving randomly in time, they may be scheduled in between of executions of their co-located tasks. To address this, we calculate the new interfered runtime by the portion of rest workload. For example, task A in VM1 and task B in VM2 are running on the same physical machine from the beginning. But, task B finishes earlier than task A and a scheduler puts task C as next task onto VM2. Suppose task A has already finished 80% of its workload, the rest runtime estimation of task A is 20% of its workload running with task C. We run the simulations for a data center with 8 to 1,024 machines, and scale up to 10,000 machines. We measured the real effects of interference and use the measured data for simulation. All evaluation and measurement are conducted on Dell machines with 2.93 GHz Intel Core2 Duo E7500 processor, 4GB RAM, and a 1TB Samsung SATA hard drive, with Linux kernel 2.6.18 and Xen 3.1.2. Each VM is allocated with 1 virtual CPU, 512MB RAM, 200GB disk space. For simplicity, we assume that the machines are homogeneous in the data center. For all the experiments, we report the average value of three runs. The emulation results are compared to the the First-In-First-Out (FIFO) scheduler, which is served as a baseline in all following experiments.

4.4.3 Performance of Prediction Models

We profile and model the eight benchmarks with the methods described in Section 4.2. The prediction error is defined as | predicted value − actual value | / actual value. Fig. 4.2(a) and 4.2(b) show the prediction errors of LM, NLM, and WMM on runtime and IOPS with respect to different benchmarks. While NLM’s prediction errors on IOPS stay relatively stable across different benchmarks, LM and WMM of bench- marks with many random IO operations, like compile or web, have higher prediction

77 0.9 0.8 WMM LM 0.7 NLM 0.6 NLM w/o Dom0 CPU 0.5 0.4 0.3

Prediction error 0.2 0.1 0.0 blastn blastp compile dedup email freqmine video

Benchmarks (a) Runtime models

0.9 0.8 WMM LM 0.7 NLM 0.6 0.5 0.4 0.3

Prediction error 0.2 0.1 0.0 blastn blastp compile dedup email freqminevideo web

Benchmarks (b) IOPS models

Figure 4.2: Model prediction erros errors than those with mainly sequential IO operations, like video. The differences between linear and non-linear models are mostly contributed by bursty IO patterns from the applications, which makes linearity hard to hold in such cases. We find that adding degree-2 terms into a model significantly reduces the prediction error. In general, NLM has less than 10% prediction errors compared to 20% or more for LM and WMM. The improvement is even bigger when applying the models on IOPS. We also want to point out that, as shown in Fig. 4.2(a), the fourth parameter, Global CPU utilization, is very important for the model to achieve high accuracy. Without it, NLMs would have large prediction errors of up to two times.

78 4.4.4 Task Scheduling with Different Models

We evaluate the effectiveness of different prediction models when used in the sched- uler. In each run, we generate a batch of tasks by randomly uniformly sampling tasks from eight applications. For a batch of tasks where the batch size is equal to the number of VMs, three schedules are generated by MIBS with WMM, LM, and NLM respectively. The performance numbers are normalized to those from the FIFO

scheduler Let the total runtime of tasks scheduled by a scheduler S be RTS and the

one by FIFO be RTFIFO. The runtime improvement, speedup, is defined as

RT Speedup = FIFO . (4.5) RTS

Similarly, let the total IOPS of tasks scheduled by a scheduler S be IOPSS and

the one by FIFO be IOPSFIFO. The throughput improvement, IOBoost, is defined as

IOPS IOBoost = S . (4.6) IOPSFIFO

1.5 WMM LM NLM IOBoost Speedup 1

MIBSRT MIBSIO Scheduler

Figure 4.3: Runtime and IOPS improvements with different models

Let the MIBSRT be the MIBS with the goal to minimize total runtime and MIBSIO be the MIBS with the goal to maximize total IOPS. Fig. 4.3 shows Speedup and

IOBoost by MIBSRT MIBSIO when using WMM, LM, and NLM respectively. NLM not only has lower prediction errors than WMM and LM have, but also has better performance in assisting the scheduler to minimize runtimes and increase IO through-

79 put.

4.4.5 NLM Prediction Accuracy

In this section, we analyze NLM’s ability in determining the minimum runtime and maximum throughput. Fig. 4.4 shows the predicted minimum, measured (real) min- imum, average, and maximum runtimes of each application when it runs concur- rently with other applications. One can see that NLM is able to closely predict a benchmark’s minimum runtime, and the predicted minimum never goes beyond the measured average or maximum runtimes. Similarly, NLM performs well on IOPS predictions as shown in Fig. 4.5. The predicted maximum IOPS is always within a small distance from the real measured maximum throughput.

Pred Min 4.0 Min Avg 3.0 Max

2.0

1.0 Normalized runtime 0.0 blastn blastp compile dedup email freqmine video

Benchmarks

Figure 4.4: Predicted minimum runtime of each application compared to its measured minimum, average, and maximum runtimes

4.4.6 Model Adaption

Our models can make dynamic adjustments to improve the prediction accuracy during the runtime. In this experiment, we build an initial interference model of blastn with applications’ statistics (a total of 500 data points) collected on a machine with local storage devices. Then we use this model to predict the blastn’s runtime and IOPS on the machine with identical software and hardware setting, but using remote storage devices via the iSCSI (internet SCSI) interface.

80 14 Min 12 Avg 10 Max Pred Max 8 6 4 Normalized IOPS 2 0 blastn blastp compile dedup email freqminevideo web

Benchmarks

Figure 4.5: Predicted maximum IOPS of each application compared to its measured minimum, average, and maximum IOPS

In Fig. 4.6, we can see that different storage device can result in dramatic drops in prediction accuracy for our blastn models - the prediction error of IOPS model increases from 12% to 83%, and the prediction error of runtime increases from 12% to 160%. In this case, TRACON continues to collect the application’s statistics from the runtime environment, and gradually replaces the old training data with the newly collected data. We rebuild the models when every 160 new data points are collected. As shown in Fig. 4.6, TRACON is able to reduce the prediction accuracy quickly to the same level of around 10% as it was before. If the environment remains unchanged, that is, local storage is used throughout the experiment, the model can be slightly improved although the difference between old and new models is too small to be distinguishable in Fig. 4.6.

1.6

1.2

0.8

0.4 Prediction error 0.0 0 160 320 480 720 800 Number of new samples Runtime NLM - Local IOPS NLM - Local Runtime NLM - Network IOPS NLM - Network

Figure 4.6: Online model learning

In summary, NLM has a lower prediction error and better performance than both

81 LM and WMM. In addition, it is able to dynamically adjust its prediction accuracy when adapting to a different environment. Therefore, we use NLM as the prediction module in the following emulations.

4.4.7 Performance of Scheduling Algorithms

Static workload: In a static workload scenario, we use workloads with different IO intensities to examine the speedups when we schedule it using MIBSRT and MIBSIO.

Fig. 4.7 demonstrates the speedups from MIBSRT and MIBSIO with respect to dif- ferent numbers of machines and IO workloads. For the heavy IO workload, both

MIBSRT and MIBSIO get limited speedups because there is no much space to re- duce interference - almost all combinations in this workload likely severely interfere with each other. MIBSRT outperforms MIBSIO in this case because the IO band- width has been saturated. When the workload has light IO intensity, both MIBSRT and MIBSIO achieve significantly better performance, with 30% speedups, than with heavy IO workload. However, in this case, intuitively even FIFO could have a good chance to encounter less interference.

The best performance is achieved for medium IO workload, where both MIBSRT and MIBSIO obtain more than 40% improvement over FIFO. In addition, MIBSIO beats FIFO by 1.5 times when there are 1,024 machines. Note that MIBSIO out- performs MIBSRT in this case because it can effectively increase the IO utilization

1.6 1.5 1.4 1.3 1.2 Speedup 1.1 1.0 8 16 32 64 128 256 512 1024 Number of machines

MIBSRT-Light I/O MIBSIO-Light I/O MIBSRT-Medium I/O MIBSIO-Medium I/O MIBSRT-Heavy I/O MIBSIO-Heavy I/O

Figure 4.7: Speedup by MIBSRT and MIBSIO

82 2.0

ut 1.8 1.6 1.4 1.2 1.0

Normalized throughp 20 40 60 80 100

λ

MIBS8-Light I/O MIOS-Light I/O MIX8-Light I/O MIBS8-Medium I/O MIOS-Medium I/O MIX8-Medium I/O MIBS8-Heavy I/O MIOS-Heavy I/O MIX8-Heavy I/O

Figure 4.8: Normalized throughput of MIBS8, MIOS and MIX8 at λ tasks per minute.

2.0

ut 1.8 1.6 1.4 1.2 1.0

Normalized throughp 20 40 60 80 100

λ

MIBS8-Light I/O MIBS4-Light I/O MIBS2-Light I/O MIBS8-Medium I/O MIBS4-Medium I/O MIBS2-Medium I/O MIBS8-Heavy I/O MIBS4-Heavy I/O MIBS2-Heavy I/O

Figure 4.9: Normalized throughput of MIBS8, MIBS4, and MIBS2 at λ tasks per minute.

without utilizing all the bandwidth. Dynamic workload: Most data centers are dealing with tasks that arrive dy- namically and need to schedule them in a real-time fashion. In this section, we assume that task arrival rate follows a Poisson process with an average rate of λ tasks per minute. The throughput TS is defined as the number of tasks completed on a sys- tem with the scheduler S in a time period. The normalized throughput is defined as

TS /TFIFO.

Suppose that the schedulers are MIBS8, MIOS, and MIX8 where the subscripts here represent the a queue length and there are ten hours to process tasks in a data

center with 64 machines. We present in Fig. 4.8 normalized throughput of MIBS8,

MIOS and MIX8 at different λ when workloads have light, medium, or heavy IO intensities, respectively. When λ is small, three schedulers have close throughput

83 2.0 ut 1.8 1.6 1.4 1.2 1.0

Normalized throughp 8 16 32 64 128 256 512 1024 Number of machines

MIBS8-Light I/O MIOS-Light I/O MIX8-Light I/O MIBS8-Medium I/O MIOS-Medium I/O MIX8-Medium I/O MIBS8-Heavy I/O MIOS-Heavy I/O MIX8-Heavy I/O

Figure 4.10: Normalized throughput of MIBS8, MIOS and MIX8 at different numbers of machines

because the data center is idle for most of the time. That is, a scheduler can always find an idle machine for an incoming task without any inteference. As λ goes up, the machines are gradually occupied, and the advantages of scheduling algorithm become more obvious. In this case, although MIX8 has the best performance, MIBS8’s performance is very close with a much less overhead, which makes it more suitable for dynamic workloads. Similar to the previous results for static workload, three schedulers can achieve higher throughput for medium IO workload than light and heavy IO. Fig. 4.9 shows normalized throughputs of MIBS are improved as λ increases. We vary the queue length of MIBS from 2 to 4 and 8. The trend remains that for different queue lengths, MIBS works best for medium IO workload. Clearly, the performance

improves when the queue length increases, e.g., at λ of 100, MIBS8 achieves about

10% higher throughput than MIBS4 and MIBS2, with a small overhead increase. Scalability: We explore the performance of different schedulers when using 8

to 1,024 machines with λ = 1, 000. Fig. 4.10 shows MIBS8’s throughput is close

to MIX8’s and the gap is reduced as the number of machines increases. Clearly,

MIBS8 is a better solution because it has less scheduling overhead while achieving

an approximate throughput as MIX8. In contrast, MIOS has the least performance improvement over FIFO. When we scale the data center to 10,000 machines and

λ = 10, 000, the normalized throughput of MIBS8 with medium IO workload remains

84 2.0 ut 1.8 1.6 1.4 1.2 1.0

Normalized throughp 8 16 32 64 128 256 512 1024 Number of machines

MIBS8-Light I/O MIBS4-Light I/O MIBS2-Light I/O MIBS8-Medium I/O MIBS4-Medium I/O MIBS2-Medium I/O MIBS8-Heavy I/O MIBS4-Heavy I/O MIBS2-Heavy I/O

Figure 4.11: Normalized throughput of MIBS8, MIBS4 and MIBS2 at different num- bers of machines high with 40% improvement.

Fig. 4.11 demonstrates the throughputs of MIBS8, MIBS4, and MIBS2 change as numbers of machine changes. Similar to different λ, MIBS with a longer queue has a higher throughput than one with a shorter queue.

4.4.8 Energy Savings

The success of cloud computing needs strong support from many large-scale data centers. However, energy consumption is a primary issue in data center manage- ment. The total energy used by US data centers is about 61 billion kilowatt-hours of electricity in 2006 (1.5% of national electricity consumption) and is estimated to approach 3% of national electricity consumption [182]. Previous works mainly focus on scheduling to minimize energy consumption or heat generation [42, 125]. In this section, we examine TRACON’s influence on data centers’ energy consumption as a by-product of interference reduction. Assume a machine consumes 442.7 W at active state, and 105.3 W at idle [90], one can save up to 76% of energy when completing a task faster. Given a fixed amount of workload, we consider the time difference between the FIFO runtime and MIBS runtime as the idle time period. Fig. 4.12 shows the percentage of savings on energy consumption when using MIBS scheduler in both static and dynamic workload scenario. In the dynamic workload scenario, MIBS can help save energy from 20% to

85 30%. If we assume that the machines are turned off completely when they are idle, the energy savings can go up to more than 40%.

0.4

0.3

0.2

0.1 Energy savings % 0.0 Light Medium Heavy IO loading Static Dynamic

Figure 4.12: Energy savings in a data center with 1,024 machines

4.5 Implementation and Experiments

In this section, we implement TRACON on two clusters, and evaluate the function- ality and effectiveness with real-world cloud applications.

4.5.1 Implementation and Experiment Environment

We evaluate the proposed methods on two sets of machines: Cluster A has four nodes, each has a two-core 1.8 GHz Intel Atom D525 CPU and 4 GB main memory; and Cluster B has two nodes, each has two six-core 2 GHz Intel Xeon E5-2620 CPUs, and 24 GB main memory. All nodes are running Linux 2.6.32 with Xen 4.0 and connected to a separate NFS server over a Gigabit Ethernet. To understand the interference between various applications, the independent variable in the experiment should be the application combinations, and all other factors should remain constant. Therefore, the VMs are configured with 1 VCPU and 1 GB RAM on cluster A, and 4 VCPUs and 4 GBs RAM on cluster B. Each cluster A’s machine hosts 4 VMs and each cluster B’s machine hosts 6 VMs. In addition, as shown in Fig. 4.1, we have a dedicated machine

86 (not part of cluster A or B) as the manager server that is responsible for scheduling and monitoring application servers. In our experiments, the management system consists of Linux shell scripts, Xen management tool (xm), and a task generating and scheduling process implemented in C. The tasks in queue are randomly generated. We will describe generating methods and distributions in Sec. 4.5.3. When the test starts, TRACON takes the task queue and machine status as the inputs to query the scheduler for potential assignment. Once an assignment is decided, TRACON uses xm tool to start the VMs on applica- tion servers and then start running applications in the VMs. All VMs are loaded with scripts to automatically run applications in VMs. These scripts randomly configure application parameters before each run, thus the application do not reuse the same dataset and problem size. We will introduce these configurable parameters for cloud applications in Sec. 4.5.2. Note that the management server and application servers only need to exchange the commands and system status, and the application data do not need to move because all application servers are connected to the same repository. The measured time from triggering the scheduler to delivering an assignment is less than one second in all experiments on cluster A and B. The management system keeps dispatching tasks until no available resources for new VMs. As soon as a VM finishes its tasks and there are pending tasks in queue, the scheduler will be triggered to generate new assignments. Note that VMs can be migrated to other hosts if it is necessary for the new arrangement. In order to do live migration between all cluster nodes, all nodes in the experiments are configured as migration servers.

4.5.2 Cloud Applications

Table 6.1 shows a number of popular cloud applications and their configurations used in our experiments. Note that data sizes, datasets, and number of processes are dynamically configured in the experiments. Thus, there are thousands possible tasks in the experiment. In summary, YCSB’s I/O pattern shows high burst I/O

87 intermittently because of Zipfian distribution, and YCSB’s I/O bandwidth usage is less than its average most of the time. Darwin streaming server has high CPU and high average I/O consumption. Cloud9 has relatively low I/O and moderate CPU usage compared to others. Cloudstone and Wikibooks have random I/O operations.

Table 4.3: Cloud application settings

Name Description Workload Types Data size Client # Dataset # Cloudstone Social event app Interactive update and read 10∼50 GBs 10∼50 5 Wikibooks Wikimedia Less updates than Cloudstone 10∼60 GBs 10∼100 5 YCSB1 Cloud data serving Update heavy. Read:Write 50:50 10∼50 GBs 10∼100 5 YCSB2 Cloud data serving Read heavy. Read:Write 95:5 10∼50 GBs 10∼100 5 Darwin Video streaming Heavy read operations 2∼36 GBs 10∼100 5 Cloud9 Software testing Less I/O. More CPU 2 GBs 1∼10 98

• Cloudstone, a performance measurement framework for Web 2.0 [163], which has three main components: 1) Web application: Cloudstone uses Olio [177], an open source online social-event calendar. We use Apache Tomcat with PHP and a geocoder emulator to host Olio. 2) Database is used to store and handle user accounts and calendar events. We use MySQL as the backend database in our experiments. Besides MySQL database, Olio also needs a directory to store data. The required space depends on the number of users in the database. This directory and database directory can be mounted to different datasets before each run. The size may vary from 10 to 50 GBs. Note that all applications in the experiments use this method to avoid repeating datasets. 3) Workload gen- erator: Cloudstone uses Faban [168] as the workload generator, which uses the Markovian arrival time model [163] to generate requests to the Web application server. Operations are the mixture of common social web site activities, such as, loading home pages, logging in, adding events, etc.

• Wikipedia is a free online encyclopedia which contains 23 million articles. We use the VMs loaded with Wikibooks from BenchLab [36]. In our experiments, we use database dumps from Wikimedia foundation [196] and real request traces from the Wikibench web site [181].

88 • YCSB (Yahoo! Cloud Serving Benchmark): is designed to address the need of performance measurement of cloud serving systems [50]. In general, YCSB has two main components: a workload generator and agents on every data serving nodes. The workload generator specifies the characteristics of requests. Agents on every data serving nodes follow the commands from the workload generator to exercise and benchmark data serving systems. In our case, we use the Apache Cassandra 0.7.3 as the cloud data serving system, and install YCSB 0.1.3 framework to measure the performance of Cassandra. Two core workloads from YCSB are used in our experiments. We call them YCSB1 and YCSB2. Both send requests following a Zipfian distribution. The major difference between YCSB1 and YCSB2 is the read:write ratio. YCSB1 generates the activities of session stores which frequently record recent actions from users. In other words, YCSB1 is an update heavy workload with the read:write ratio of 50:50. On the other hand, YCSB2 reproduces a read mostly workload. An application example is adding photos or photo tags on Facebook, where there are few updates but many read requests. The read:write ratio is 95:5 in YCSB2. In our experiments, Cassandra database is configured with at most 2 GBs Java heap and 200 MBs garbage collection space.

• Darwin is a video streaming server, an open source version of Apple’s Quick- Time server. It streams videos across networks using the industry standard real-time transport protocol (RTP) and real time streaming protocol (RTSP). The clients send the requests following Faban workload driver’s commands. Since we focus on I/O interference, we have one third of low, medium and high bit-rate videos respectively in our workload mixture.

• Cloud9, an automated software testing tool, aims to make use of abundant resources in cloud systems to provide a high-quality on-demand software testing service [31, 47]. Cloud9 utilizes scalable parallelization of symbolic execution to handle path explosion in large scale software testing. Two main modules of Cloud9 are worker nodes and a load balancer. Each worker independently runs

89 KLEE [34], an open source symbolic execution engine built on top of LLVM, to explore a subtree of the whole execution tree. Load balancer is responsible for the global load distribution and workers report their load to the load balancer periodically. We measure the time needed by Cloud9 to reach complete coverage of a testing code. The testing traces are from CloudSuite which are generated by symbolically executing 98 different utilities from the GNU CoreUtils 6.10.

4.5.3 Experiment Results

Sensitivity to co-located VMs: We study each application’s sensitivity to co- located VMs by measuring the changes in runtime of each application. Fig. 4.13 uses box plots of normalized runtimes to show cloud applications’ sensitivity to co- located VMs. The large value range in Fig. 4.13 indicates certain applications are very sensitive to co-located VMs, but it also indicates that there are opportunities for improved performance should the applications be arranged in a smart way. For different applications in Fig. 4.13, data serving (YCSB1 and YCSB2) is very sensitive to co-located VMs because of the need of bandwidth. Among two data serving workloads, YCSB1, which is update heavy, is more sensitive than YCSB2 because write operations are usually sacrificed for achieving high performance read. On the other hand, I/O operations from other VMs, when running concurrently with YCSB1, are also delayed because if write operations are needed to be flushed, other operations have to wait for relatively slow write operations. Besides data serving, video streaming (Darwin) are also very sensitive to co-located VMs because video streaming needs constant usage of bandwidth and requires low latency. When co- located VMs are competing for resources, the quality of video streaming service is obviously affected. For different kinds of machines, the average values of normalized runtime are 1.3 and 1.15 on cluster A and B respectively. Cluster B has smaller interference than cluster A . Since cluster A and cluster B are both using the same storage, cluster B’s bigger memory, caches, and more powerful CPUs may help to reduce interference

90 3.5 Cluster A 3.0 Cluster B 2.5 2.0 1.5

Normalizedruntime 1.0 YCSB1 YCSB2 Darwin Cloud9 Cloudstone Wikibooks Cloud appolications

Figure 4.13: Box plots of normalized runtime of each cloud application when running with other co-located VMs. The column heights represent runtimes normalized to the unaffected runtimes.

among VMs.

Prediction accuracy: We have shown that NLM can outperform LM and WMM, we build NLM models to predict the runtime of a VM when running with other co-located VMs on a cluster node. Fig. 4.14 shows the prediction error on normalized runtime when the application is running against others. Recall that the prediction error is defined as | predicted value − actual value | / actual value. As it is shown in Fig. 4.14, the prediction errors are mostly below 0.2 on both cluster A and cluster B. The high sensitivity of data serving and video streaming services makes the prediction errors on them are higher. On average we can achieve prediction errors of 0.17 on cluster A and 0.15 on cluster B.

0.4 Cluster A 0.3 Cluster B 0.2 0.1

0.0 Prediction error Prediction

Cloud applications

Figure 4.14: Interference prediction errors. The column heights represent the average prediction errors, and the error bars represent the standard deviations

91 1.20 Cluster A 1.15 Cluster B

1.10

1.05

1.00

Normalizedthroughput MIBS K-means++ Doubling Scheduling methods

Figure 4.15: Normalized throughput at different scheduling methods. The column heights represent the throughputs normalized to FIFO’s throughput

1.25 1.20 MIBS - Cluster A 1.15 MIBS - Cluster B 1.10 K-means++ - Cluster A 1.05 K-means++ - Cluster B 1.00 Doubling - Cluster A

NormalizedThroughput 2 4 8 Full (16+) Doubling - Cluster B Tasks per minute

Figure 4.16: Normalized throughput of scheduling methods at different task arrival rate. The column heights represent the throughputs normalized to FIFO’s throughput

Application throughput improvement: We compare normalized application throughputs to show the effectiveness of each scheduling method. The (application) throughput is defined as the number of tasks completed in a given time period. The baseline is the application throughput under the FIFO scheduler. In Fig. 4.15, we measure the throughput under each scheduling method for two hours and normalized it to the baseline. Each incoming task is uniformly randomly selected from cloud applications listed in Table 6.1. In order to generate a more realistic workload, we randomly choose the datasets, data sizes, and number of processes. As it is shown in Fig. 4.15, k-means++ has the highest throughput improvement of about 1.15 on both clusters. The test in Fig. 4.15 assumes the tasks are always available in queue. In our

92 1.30 1.25 MIBS - Cluster A 1.20 MIBS - Cluster B 1.15 K-means++ - Cluster A 1.10 1.05 K-means++ - Cluster B 1.00 Doubling - Cluster A

NormalizedThroughput Light Medium Heavy Doubling - Cluster B IO intensity

Figure 4.17: Normalized throughput of scheduling methods at different I/O intensi- ties. The column heights represent the throughputs normalized to FIFO’s throughput experiments, the unaffected process time of a task ranges from about three to ten minutes depends on its configuration. To understand the effectiveness of scheduling methods at different task arrival rates, we set task arrival rate following a Poisson process with an average rate of two, four, eight tasks per minute in Fig. 4.16. The results from Fig. 4.15 are also included and labeled as Fullfor comparison. As the task arrival rate decreases from full to two (left to right), we gradually create more available resources and reduce the stress on servers. When the servers are heavily loaded at all time, the room for improvement is limited. When the stress is gradually decreased, schedulers have more chances to improve throughput. Since cluster A and B have different amount of resources, the stress from the same task arrival rate are different on cluster A and B. In this experiment, throughput improvements by all schedulers peaks at four and eight tasks per minute on cluster A and B respectively. When the task arrival rate is as low as two tasks per minute, normalized throughputs of all schedulers are again small, because the interference effect is small given abundant available resources. Workloads used in Fig. 4.15 and Fig. 4.16 are uniformly randomly selected. To study how the schedulers work at different I/O intensities, we create three types of workloads, namely light, medium, and heavy I/O, which represent a mixture of workloads with low, medium, and high I/O requirements respectively. Here we sort the cloud applications based on their I/O requirements. From the lightest to the heaviest I/O requirements, the order is: Cloud9, Wikibooks, Cloudstone, YCSB1,

93 YCSB2, and Darwin. These applications are then labeled from one to six in order. The light, median, and heavy I/O workloads are generated by following the Gaussian distributions of unit variance with the means of 2, 3.5, and 5, respectively. We take the closest integer from the generated number. For example, if the generated number is 4.3, the selected application will be YCSB1. Fig. 4.17 presents normalized throughputs at different I/O intensities. The highest normalized throughputs are achieved at medium I/O intensity, where k-means++ has the highest value of 1.25. The second highest throughput improvement is achieved at heavy I/O intensity because the range of interference degree among heavy I/O workloads is smaller than it among medium I/O workloads. When the I/O intensity is light, normalized throughputs of all schedulers are small because the interference effect is also small in this case. In general, k-means++ has the best performance across all cases.

Cluster A Cluster B 1.30

1.20

1.10 IOPS improvement IOPS 1.00 MIBS K-means++ Doubling Scheduling methods

Figure 4.18: Normalized number of I/O requests completed of different scheduling methods. The column heights represent the total I/Os normalized to FIFO’s I/Os

I/O improvements: To study the improvement on completed I/O requests, we plot normalized I/O numbers in Fig. 4.18. The numbers are obtained from the same test in Fig. 4.15, where the applications are uniformly randomly configured. I/O improvements are higher than 1.2, where k-means++ has the highest value of 1.27 on both clusters. Note that the IO improvement in experiments is smaller than the simulation in previous section, mostly because the cloud applications we use here are more IO-intensive and utilize large data sets. We plan to further enhance our

94 modeling and scheduling techniques, and leave the evaluation in a large-scale cluster as part of future work.

95 Chapter 5

Matrix

While TRACON designs scheduling algorithms and builds interference models to make VM performance more predictable, virtualization technology has yet achieved the vision of “an efficient, isolated duplicate of a real machine”. In other words, a VM shall be able to provide the performance close to what a user would expect from a specific physical machine. To fulfill this missing piece, this chapter designs a novel performance and resource management system, Matrix, to ensure the performance of an application achieved on a VM can match closely to running on a target physical server. Matrix utilizes machine learning techniques, in particular SVM (Support Vector Machine), to classify workloads and build the RP models. Therefore, in the following sections, we will first give an introduction to SVM, then describe the design and evaluation of Matrix.

5.1 Support Vector Machine

SVM is one of the most powerful supervised learning techniques and has been widely used on the pattern recognition and classification [22, 73, 151]. SVM has two main categories: for classification (SVC) and for regression (SVR). In this work, the term SVM will refer to both classification and regression methods, and SVC and SVR will

96 be used when we specifically discuss one of them. Originated from the perceptron [149], the original goal of SVM, a binary SVC, is to find a hyperplane with the maximum margin that separates data points. A common formulation of a basic SVC is a relaxed quadratic optimization problem:

n 1 2 X minimize ||θ|| + C ξi θ,θ0,ξ 2 i=1

subject to yi(θ · xi + θ0) ≤ 1 − ξi, i = 1, . . . , n

ξi ≥ 0, i = 1, . . . , n

d where {(x1, y1) ,..., (xn, yn)} are training points, xi ∈ R are feature vectors, and yi ∈ 1 R are target outputs; θ0 is the offset vector to the origin, θ is the separating vector, and the slack variable ξi enables the chance to find a more generalized separator which allows to violate some of the margin constraints. The term

n X C ξi i=1 represents the penalty for constraint violation, where C > 0 is the regularization parameter. That is, a bigger value of C permits fewer (if any) constraint violations. If C is too big, the resulted model may overfit the training data and fail to generalize the condition.

Many problems do not work with a linear classifier even after the slack variable ξi is used to allow misclassified data points. To address this problem, the kernel functions are introduced to conduct non-linear feature mappings [5, 27]. By mapping data points from the original space to a higher dimensional feature space, and maximizing a linear margin in the feature space, SVCs can obtain non-linear margin curves in the original space. The SVC with a kernel function can be formulated by simply substituting the xi with the kernel function φ (xi). Common kernel types are

97 • Polynomial function

φ (x) = (γ · x · x0 + c)d , c is a constant (5.1)

• Gaussian radial basis function (RBF)

  φ (x) = exp −γ · ||x − x0||2 (5.2)

• Hyperbolic tangent or sigmoid function

φ (x) = tanh (γ · x · x0 + c) , c is a constant (5.3)

The parameter γ > 0 in these kernel functions controls the kernel width. A small γ implies a smooth fit because the influence from a single data point is small. Con- versely, a large γ suggests a curly fit due to the effect of a single data point. In addition to the SVC, Matrix also utilizes the SVR to construct the basic RP models of each training application. We will discuss SVR in detail in Sec. 5.2.3. For now, SVR, which is different from SVC by introducing an alternative loss function, also shows excellent performance in regression and time series prediction [116, 128, 172].

5.2 Matrix Architecture

The goal of Matrix is to predict and configure VMs in an automatic manner so that the applications running within the VMs would achieve a performance with a close vicinity of a specific physical machine. We present the architecture of Matrix in Fig. 5.1. On the left, Matrix builds both clustering and RP models of representative workloads and this task is done offline. There are three steps in this phase: 1) profiling

98 the training set of representative workloads (presented in Sec. 5.2.1); 2) tuning the SVM parameters to find the best configuration; and 3) training the classifier and the basic RP models, for later use of the online module (Sec. 5.2.2 and 5.2.3). This offline training stage builds RP models from our generic benchmarks, but it can be repeated periodically to include data from newly added workloads.

New New resource Gene APP configuration APP Monitoring domain New workload signatures Profiling domain Clustering Analyze data Workload model Workload composition signatures Basic RP RP modeling Configure SVM model RP models Adapting Train models resources RP Training basic models Predicting RP and adapting resources (offline) (online)

Figure 5.1: Matrix Architecture

When a new application is to be moved to the cloud, Matrix requires only a workload signature (i.e., a time series of resource measurements) when running on its current infrastructure, which could be either physical or virtual machines. As shown in the right hand side of Fig. 5.1, Matrix can immediately classify these workload signatures compared to the previously trained models. Then, the system calculates a runtime RP model based on gene-adjusted performance estimates and outputs the predicted RP to the resource allocation module. Next, Matrix will search for the VM configurations with the minimum cost to maintain RP close to one (Sec. 5.2.4). To provide automatic resource management, we formulate an optimization problem with nonlinear inequality constraints. For fast response time, Matrix utilizes the Lagrange multipliers to provide an approximate solution and a bound to the minimum resource cost.

99 5.2.1 Workload Signatures

Matrix first must calculate a set of workload “genes” that indicate how different types of applications will perform when moved between the native and cloud platforms. A group of representative applications are firstly selected to construct an expert system. Our selection principle, similar to [25], is to have the reference workloads as diverse as possible - the resulting collection shall cover from CPU-intensive to data- intensive, and their problem sizes also shall vary from small to large data volumes. Table 5.1 summarizes the proposed representative applications selected from a few widely used benchmark suites, e.g., FileBench [118], SysBench [98], SPEC2006 [51], PARSEC[20], and Cloud9 [31, 47]. Note that while this set of applications is not optimal by all means, they provide, as we will see in evaluations, a good basis for RP modeling. We leave the exploration of different gene applications as future work.

Table 5.1: Summary of representative applications

Name Description video server serving a set of video files web server retrieving web contents and updating log files file server a mixture of various file I/O operations OLTP query and update database tables mcf running simplex algorithm hmmer pattern searching of gene database soplex linear program solver canneal evolutionary algorithm DS01 to DS15 15 distributed data serving workloads C01 to C15 15 parallel CPU-intensive workloads

For parallel application, we select a training set that consists of 15 data-intensive workloads (DS01 to DS15) and 15 CPU-intensive workloads (C01 to C15). The first five DS series workloads run Apache Cassandra, a distributed key-value store, with read/write ratios of 100/0, 75/25, 50/50, 25/75, and 0/100 where the record popu- larity is in uniform distribution. For DS6 to DS10, they access Cassandra with same read/write ratios but in the Zipfian distribution of record popularity. For the number 11 to 15 training workloads, they share the same pattern and order of read/write ratios in both the first and the second five groups, but the record popularity is in the latest distribution. The last 15 representative applications in the training set are

100 CPU-intensive parallel workloads from Cloud9, a scalable parallel software testing service. The training set for CPU-intensive parallel workloads are randomly selected out of 98 different utility traces from the GNU CoreUtils 6.10 for running Cloud9. For a basic signature, we take the arithmetic means of three system parameters - CPU utilization, the amount of data read and written per second. Since it is insufficient to use the mean alone to represent a workload when there is a large variability in the observed data, we choose the coefficient of variation (C.O.V) as part of the signatures to describe the variability. As prior work [81, 25] has already shown that the resource allocation of VMs greatly affects the observed system parameters, we include the number of VCPUs and the size of memory in the workload signatures because these two parameters are frequently used knobs for tuning VM performance. Furthermore, we also take into account the interference from co-located VMs. For simplicity, all workload signatures from other VMs are summed up as one background VM and included in the modeling process. Dealing with a parallel application running on a cluster of machines poses ad- ditional challenges. The traffic in and out of each node is critical to data-intensive applications’ performance. Moreover, it is important to include the number of nodes as one knob for modeling workload concurrency. In other words, Matrix needs to scale resources horizontally (increasing and decreasing the number of nodes), as well as vertically (scaling up and down resources on each node). Therefore, Matrix in- cludes the amount of data in and out of each node and the number of nodes in a cluster as two additional parameters for modeling.

5.2.2 Clustering Method

In Matrix, a workload classifier is needed to identify new workloads that are running in the guest VMs. Most of previous works use a “hard” classifier. That is, the classifier outputs a certain workload without ambiguity. That method, however, provides little help when dealing with new workload, which can be very different from any workload in the training set of the classifier. To address this problem, we explore

101 a “soft” classifier in this work, a classifier with soft boundary that outputs probability estimates of being each component in the model. These probability estimates can be utilized as weights to infer the “gene” composition of new workloads. Specifically, we utilize a multiclass SVC with likelihoods provided by a pairwise coupling method [39]. We use a rigorous procedure to tune and train classifiers. Our classifiers are trained by following steps: Data Scaling can avoid the characteristics in larger numeric ranges dominating those in smaller numeric ranges. In addition, scaling data into a restricted range can avoid numerical difficulties during the kernel value calculation [39]. We scale each attribute in the range of [0, 1]. Parameter Selection: Choosing the optimal parameter values is a critical step in the SVC design. The grid search method is a common practice in finding the best configuration of a SVC. That is, the parameter selection is usually done by varying parameters and comparing either estimates of generalization error or some other related performance measure [59]. When the search approaches a grid point, it calculates the value of ten-fold cross validation (CV). In order to save the searching time, the search firstly starts with a loose grid to identify regions with good CV values. Then, the search uses a finer grid to further approach the best configuration. We conduct the grid search on the following parameters:

• SVC types: C-SVC [27] and ν-SVC [154, 153]. The former has a constraint violation cost C. The later introduces a new parameter ν ∈ (0, 1] to provide an upper bound on the fraction of training errors and a lower bound of the fraction of support vectors.

• Kernel functions: Polynomial, sigmoid, and RBF.

• C: Constraint violation cost. A good C is needed to avoid overfitting or under-

fitting. C ∈ R+.

• γ: Kernel width coefficient. γ affects the model smoothness. γ ∈ R+.

102 • ν: Only available in ν-SVC. ν provides a upper bound on training errors ν ∈ (0, 1].

Training: Once the best parameter configuration is decided, the final classifier is obtained by using the best configuration to train the whole training data. In terms of SVC types and kernel functions, the grid searching results suggest that ν-SVC with RBF kernel outperforms other classifiers and kernel functions. Therefore, Matrix uses ν-SVC with RBF kernel as the classifier in the final training and online clustering. ν-SVC has been proved to provide an upper bound on the fraction of training errors and a lower bound of the fraction of support vectors.

5.2.3 Performance Modeling

Matrix’s performance modeling has two main procedures: Constructing the building block: Matrix utilizes the SVR to construct the basic RP models of each training application. A popular version of SVR is ν-SVR [154, 153]. Like ν-SVC, ν-SVR uses a parameter ν ∈ (0, 1] to control the amount of support vectors. The standard form of ν-SVR is

n ! 1 2 1 X ∗ minimize ||θ|| + C ν + (ξi + ξ ) ∗ i θ,θ0,ξ,ξ 2 n i=1

subject to θ · φ (xi) + θ0 − yi ≤  + ξi, ······ (loss func.)

∗ yi − θ · φ (xi) − θ0 ≤  + ξi , ······ (loss func.) C > 0,  > 0,

∗ ξi, ξi ≥ 0, i = 1, . . . , n,

d where {(x1, y1) ,..., (xn, yn)} are training points, xi ∈ R are feature vectors, and yi ∈ R1 are target outputs. The parameter C determines the trade-off between the model smoothness. Different from SVC, the first two lines of the restriction represent the so called -insensitive loss function, where the  is a threshold of deviation tolerance [162]. The ν in ν-SVR, as the one in ν-SVC, serves as an upper bound on the fraction of

103 training errors and a lower bound on the support vectors. We compare the models built by ν-SVR and another common SVR, the -SVR [184, 57], in the grid search process. Because the SVR essentially is derived from the SVC, data scaling and parameter selection are important steps in the SVR training as well. Therefore, we conduct a procedure similar to the one in Sec. 5.2.2 to configure and train the basic RP models. The main differences are that SVR has different problem formulations and one additional threshold parameter . • SVR types: Two common SVR types, the -SVR and the ν-SVR, are included in the grid search process.

• : A threshold of deviation tolerance.  ∈ R+. Matrix uses ν-SVR with the RBF kernel for the basic RP modeling because the grid searching results suggest it is better than others. Generating the performance model: The performance modeling of represen- tative workloads completes one part of the story. Our goal is to be able to capture new workloads’ RP models in an online fashion.

Suppose there are n representative workloads wi, i ∈ {1, . . . , n}. The corre-

sponding performance models are fi(R), where rj = {x ∈ R | 0 ≤ x ≤ 1} and

R = {r1, . . . , rm} are resource configurations, j ∈ {1, . . . , m}. Because all perfor- mance models are built by SVR, the performance models can be represented as:

fi(R) = θ · φ (ri) + θ0, where φ (ri) are kernel functions.

Our classifier then analyzes a new workload wnew and generates an output {p1, . . . , pn}, where pi are the probability estimates of being workload wi, i ∈ {1, . . . , n}. The final performance model of workload wnew is formulated as

n X fnew(R) = pi · fi(R), i=1 n (5.4) X where pi = 1. i=1

In other words, the likelihood pi acts as a weight to control the fraction of fi in the

104 final model fnew.

5.2.4 Automatic Resource Configuration

Once we obtain a performance model of a new workload wnew, resource configuration module starts to find the minimum resource allocation for keeping VMs as real as possible.

Let Cj be the cost of resource j, j ∈ {1, . . . , m}. Resources are, e.g., the memory

size and the number of VCPUs and VMs. rj is the ratio of resource j on a physi- cal server that is allocated to the VM. We can formulate the resource configuration problem as an optimization one with a nonlinear equality constraint:

m X minimize Fc(R) = Cj × rj R j=1 n X subject to fnew(R) = pi · fi(R) = 1, i=1 n X pi = 1, i=1

rj = {x ∈ R|0 ≤ x ≤ 1}, i ∈ {1, . . . , n}, j ∈ {1, . . . , m}

Because both the objective and constraint function are continuously differentiable1, we utilize the Lagrange algorithm for solving this problem.

Note the above problem is formulated under the assumption that rj is the ratio of resource j on a physical server that is allocated to the VM. However, real systems usu- ally can not partition resources at an arbitrary granularity. For example, the memory allocation for VMs is usually done in the unit of one megabyte. If a system has 2 GB memory, the finest possible Rj values will be {1/2000, 2/2000,..., 2000/2000}. As a result, the system won’t be able to use the optimal resource configuration R∗. Instead,

∗ ∗ the system needs to take (dr1e,..., drme) as the resource configuration, where the ceil- 1Matrix uses the ν-SVR with the RBF kernel for performance modeling because the grid searching results suggest it is better than others.

105 0 ing operation of ri here is defined as taking the smallest value ri in the finest possible 0 granularity, such that ri ≥ ri. Let the granularity of resource i be di, i ∈ {1, . . . , m}.

In other words, the miss allocation on resource i is at most di. Therefore, the upper m P bound on the extra resource allocation cost is Ci × di. i=1

5.3 Implementation

We have implemented and tested Matrix on both a local private cloud and two public clouds, namely Amazon EC2 and Rackspace cloud servers. Fig. 5.2 summarizes the work flow of the prototype. The preparing data block includes parsing, formatting, and scaling collected traces. The clustering model and RP models are previously built by the training set offline. The Matrix online module is controlled by a Linux bash shell script combined with a SVM module written in C and an optimization problem solver in MATLAB. The tasks of this online module are to 1) collect traces, 2) analyze workload compositions, and 3) predict current RP and suggest a configuration to obtain desired performance with less cost.

Traces

Preparing Data

Workload signatures Clustering model SVC-predict Workload composition Sleep till Performance next modeling Basic RP interval models Constraint function & RP Min resource Resource Recommendation

Adjust VM resources Offline Online Figure 5.2: Matrix prototype

The online module is running as a background process in the host domain, which

106 collects workload signatures of VMs every second by using xentop. At every minute, a parser will parse collected data and scale all values in the range of [0, 1]. Note that the value range used in scaling process should be consistent in all steps. Otherwise, the prediction is certainly incorrect because the trained models are not working on the same scale of the input data. The online module then feeds the scaled trace and clustering model to the SVC-predict module which outputs the workload composition in possibilities of representative workloads. These probability estimates along with the basic RP models become the running workload’s performance model (Eq. 5.4). Then, Eq. 5.4 is served as the constraint function of the optimization problem in Sec. 5.2.4. Finally, the online module uses xm tools to adjust resource allocations and repeats the same procedure for the next interval. There are three main differences between the two Matrix prototypes on private and public clouds. First, Matrix in public clouds can not use xentop to collect traces because we have no access to the host domain. Instead, we run top and iostat in every guest domain to collect traces. Second, Matrix can not arbitrarily adjust resources of an instance in the public cloud. And, instance types can only be changed when it is not running. To address this problem, we adapt Xen-blanket [197] nested virtualization for some tests. Prototype performance. The measured running time from parsing collected trace to output the minimum resource recommendation is around 0.6 second where the optimization solver takes about 70% in the whole process. Fig. 5.3 shows the percentage of time that each component contributes to the whole process. As future work, the running time of the online module can be further reduced by implementing the solver in native system without using MATLAB. In addition, Matrix may also be integrated with Monalytics [105] to reduce overheads.

(A) (B) (C) (D)

0% 20% 40% 60% 80% 100% Figure 5.3: Percentage of time that each component contributes to the whole over- head. (A) Data preparation, (B) Workload identification, (C) Model generation, (D) Resource allocation

107 5.4 Evaluations

Testing scenarios. We evaluate Matrix in three scenarios: a single machine, a cluster of physical machines, and of virtual machines. Three virtualized environments are used in our experiments: local Xen virtualized servers, Amazon EC2 instances2 and Rackspace cloud servers3. We label Rackspace cloud servers from smallest to the largest as RS1 to RS7. For example, RS1 has 1 VCPU and 512 MB memory and RS7 has 8 VCPUs and 30 GB memory. All tests on public clouds are conducted for at least 30 runs, with multiple batches that run at different times of day and on various weekdays (including weekends). In Sec. 5.4.1, we start the experiments with the single machine case: given a target physical machine, Matrix aims to accommodate the testing applications in a VM such that the workloads perform closely to the physical one. We mainly use PM1 as the target physical machine as described in Ch. 1. We have two local servers: VS1 and VS2 are both two six-core Intel Xeon CPUs at 2.67 GHz and 2 GHz, and with 24 and 32 GB memory, respectively. Both machines are running Linux 2.6.32, Xen 4.0, and NFS over a Gigabit Ethernet. In Sec. 5.4.2, we use a four-node physical cluster (PC) for target performance, each of which has a 1.80 GHz Intel Atom CPU D525 (two physical cores with hyper- threading) and four GB memory connected on a Gigabit Ethernet. In the local private cloud, we use the VS2 to host a virtualized cluster (VC). Similar to the single machine case in Sec. 5.4.1, the public VCs are hosted on the Amazon EC2 and Rackspace cloud servers. In Sec. 5.4.3, we use a virtual cluster of 32 and 64 VMs in a local cloud as the target, and study how to configure virtual clusters in Amazon EC2 and Rackspace cloud servers to achieve similar performance. Each VM has one VCPU and 1.5 GB memory. This way, we examine the feasibility of migrating a virtual cluster from a private to public cloud while providing the desired performance with minimized cost.

2A full list of Amazon EC2 instance types and prices can be found at http://www.ec2instances.info/ 3A full list of Rackspace cloud servers can be found at http://www.rackspace.com/cloud/servers/.

108 Cloud applications that are used in this work consist of Cloudstone, a perfor- mance measurement framework for Web 2.0 [163]; Wikipedia with Database dumps from Wikimedia foundation [196] and real request traces from the Wikibench web site [181]; Darwin, an open source version of Apple’s QuickTime video streaming server; Cloud9 makes use of cloud resources to provide a high-quality on-demand software testing service; and YCSB (Yahoo! Cloud Serving Benchmark), a perfor- mance measurement framework for cloud serving systems [50]. For YCSB, the experiments use two core workloads: YCSB1 and YCSB2, both send requests following a Zipfian distribution. The major difference between YCSB1 and YCSB2 is the read:write ratio: YCSB1 is an update heavy workload with the read:write ratio of 50:50, and YCSB2 reproduces a read mostly workload with the read:write ratio of 95:5. Note that after Sec. 5.4.2, YCSB1 and YCSB2 are served from multiple nodes. In addition, YCSB3, YCSB4, and YCSB5 will be added into the testing set as well. YCSB3 is a 100% read workload. 95% requests of YCSB4 are read operations and mostly work on the latest records. 95% requests of YCSB5 are also read operations but it scans within 100 records. Evaluation metrics. We use three metrics to evaluate the performance of Ma- trix. To measure the accuracy of the models, we define the prediction accuracy as

1 − (|predicted value − actual value|/actual value).

Clearly, the prediction accuracy closer to 1 indicates a good model. The goal of Matrix is to make a VM perform as closely as the target platform with minimum cost. To this end, we define two additional metrics: the RP-Cost product (RPC) as |RP − 1| · (V M Cost), and the Performance Per Cost (PPC) as RP/V M Cost. In this test, we measure the cost for purchasing instances on public clouds in dollars. For RPC, a smaller value is preferred as it indicates small perfor- mance difference and cost, and for PPC, a larger value is better because of indicating better performance for the same cost. As we will show later, Matrix can achieve best cost efficiency with best RPC and PPC values compared to static allocations.

109 5.4.1 Physical to Virtual Machines

Model Composition. Before we look into the prediction accuracy, we present how Matrix analyzes applications and composes performance models. Fig. 5.4 demon-

video server web server file server OLTP mcf hmmer soplex canneal

Cloud9 Darwin YCSB2 YCSB1 Wiki Cloudstone 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Figure 5.4: Application composition examples strates the snapshots taken by Matrix while applications are running. Let’s take Darwin as an example. Darwin is about 67% like video server, 22% like mcf, 10% like soplex, and the possibilities to be others are very small. Although Darwin is a video streaming server, it is not 100% like the video server from the FileBench in the representatives. The reason is that the video server only emulates I/O operations and omits many CPU tasks on a video streaming server, which can be captured by Matrix with suggestion of including mcf and soplex as part of the Darwin’s workload signature. Therefore, Darwin’s estimated performance by the composition in Fig. 5.4

will be 0.67 · fvideo server + 0.22 · fmcf + 0.1 · fsoplex + ... (Recall Eq. 5.4). Similarly, the sample composition of YCSB1 has a large portion of file server, OLTP, and hmmer. Note that these are just sample snapshots, and the composition ratio depends on the workload intensity and datasets, and may change over time.

Local private servers Amazon EC2 Rackspace cloud servers

100 90 80 70

60 Accuracy(%) 2C4M-VS1 1C2M-VS1 t1.micro m1.medium c1.xlarge m3.xlarge m3.2xlarge RS1 RS2 RS3 RS4 RS5 RS6 RS7 2C4M-VS2 1C2M-VS2 m1.small m1.large m2.xlarge m3.xlarge m3.2xlarge -HVM -HVM

Figure 5.5: Accuracies on predicting performance. The labels aCbM-VSc on the leftmost four columns mean these tests are done on a VM with a VCPU and b GB memory hosted by our local machine V Sc. The rightmost seven labels, RS1 to RS7, represent Rackspace instances from the smallest to the biggest one. Other labels represent Amazon instance types used

110 Model Accuracy. We examine Matrix’s accuracy on predicting new workloads’ RP across different settings on our local VMs, the Amazon EC2 instances, and the Rackspace cloud servers. To train the RP models on the local VMs, we run the training set on PM1 and VMs for the RPs and training data. We collect 1,000 data points for each training workload’s performance model. Each data point is generated by running the workload with a uniformly randomly configured thread (worker) count (2 to 32), working set size (5 to 32 GB), and resource allocation (1 to 8 VCPUs and 1 to 8 GB memory). Because hardware heterogeneity potentially affects performance of cloud applications [137, 61], Matrix also trains models for working on Amazon and Rackspace, instead of simply using those trained on the local VMs. The training process on the public clouds is almost identical to the one on local VMs, except the part of dynamically configuring resources. Because we can not arbitrarily adjust resources on the public clouds, the training data are collected from running them on each instance type for 100 times. Note that Matrix needs only a one time training process for modeling the gene workloads in the VMs. For the tests on local VMs, we run each configuration for five times, five minutes per run. In Fig. 5.5, each column shows the average prediction accuracy and standard deviation of 30 runs (five runs for six testing applications). The same testing process is repeated on the Amazon and Rackspace at three different times and days. More specifically, the tests are repeated at the daytime and the night-time of a weekday, and the weekend. Thus, the public cloud results in Fig. 5.5 are averages of 90 runs. Most of prediction accuracies are higher than 85% and the average value across all cases is 90.15%. The local VM tests on VS2 have a slightly higher accuracy (91.1%) than those on VS1 do (90%). On the Amazon EC2, t1.micro has the lowest prediction accuracy due to big variances on its performance. In general, larger instance types are more stable and usually lead to higher accuracies. The experiments on Rackspace also show that larger instances tend to have higher accuracy. Given the same instance type, HVM instances have lower accuracy than paravirtualized VMs, partly due to virtualization overheads. The average prediction accuracies across all Amazon and Rackspace instance types are 89.8% and 90.3% respectively.

111 All results pass the two-sample t-tests and are stable across all test environments. Note that we also conduct the same tests on the training set. The results show that the training applications can be identified correctly over 95% and their performance estimations have accuracy higher than 94% across all training applications. Automatic Resource Configuration. Suppose a user currently has a PM1 and wants a VM to perform the YCSB as PM1 with minimum resources allocated. Here, the user can reduce cost as the YCSB VM needs less resources and maintains the desired throughput. On the other hand, service providers have more free resources to host additional VMs. In this test, we run YCSB2 for one hour and change workload intensities every ten minutes. In the first ten minutes, two threads work on two millions records; The workload intensity is increased to four threads and eight millions records in the second period; eight threads and 16 millions records in the third period; Then, workload intensity is decreased to four threads and 16 millions records in the fourth period; two threads and 16 millions records in the fifth period; two threads and two millions records in the last ten minutes. Fig. 5.6(a) shows the corresponding resources and RPs as the workload intensity changes. Over the hour, the average resource savings are 37% on CPU and 55% on memory, when compared to a baseline VM which keeps using two VCPUs and four GB memory (PM1’s setting). The average performance is 1.06 (closer to the target value) compared to 1.56 provided by the baseline VM. In Amazon EC2, we can only change the type of an instance when it is not running. As a workaround, we use the Xen-blanket (nested virtualization) in an Amazon EC2 HVM instance (m3.2xlarge). The physical machine targeted here is a PM2. In the one hour test, the average resource saving is about 5% on memory, compared to a baseline VM which keeps using one VCPUs and two GB memory. There is no resource saving numbers for CPU because the minimum VCPU number is one in this test. As shown in Fig. 5.6(b), the average RP is about 0.95 compared to the one of 0.83 by the baseline VM. In other words, with the ability of dynamically adjusting resources to accommodate demands, Matrix can push the average RP close to one with as few resources as possible.

112 3.5 VCPU 3.0 Mem (GB) 2.5 RP 2.0 1.5 1.0 0.5 0.0 0 10 20 30 40 50 60 Time (min) (a) Local

3.5 VCPU Mem (GB) RP 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0 10 20 30 40 50 60 Time (min) (b) Amazon EC2

Figure 5.6: RP changes as resources and workload intensity change. Intensities are changed every ten minutes

Recommended EC2 instances. Fig. 5.7 presents the percentage of EC2 in- stance types recommended for running a certain workload as on PM1. The light, medium, and heavy workloads are defined as 4, 16, and 32 threads (or workers) with 8, 16, and 32 GB working set respectively. Again, each bar in Fig. 5.7 represents results from 30 runs, where 10 runs each are conducted at the weekday daytime, the weekday night-time, and the weekend. For the light intensity, m1.medium is recom- mended most of the time except for the YCSB1 and YCSB2 because these two have relatively low CPU demand than the others and m1.small can deliver I/O throughput comparable to m1.medium can in this case. For the medium and heavy intensities, m1.medium is the choice most of the time. Note that the recommendation for each workload is not always the same, partly due to background workload interference and the nature of the probability estimates in our models. To verify the recommendation, Fig. 5.8 shows the RPs obtained from running the same applications on recommended instance types. The average RP is 1.08 with the standard deviation of 0.26 for all the cases. Choosing instances among cloud providers. In this experiment, we conduct

113 m1.small m1.medium m1.large c1.xlarge Cloud9

Darwin

YCSB2

YCSB1 Heavy Wiki CloudStone Cloud9

Darwin YCSB2 YCSB1 Medium Wiki CloudStone Cloud9 Darwin YCSB2

Light YCSB1 Wiki CloudStone 0% 20% 40% 60% 80% 100% Percentage

Figure 5.7: Percentage of instances that are recommended for an application. The y-axis lists testing applications and intensities. Each bar represents percentages of certain instance types that are recommended for the corresponding application type and intensity

1.5 Light Medium Heavy

1.0

RP 0.5 0.0 CloudStone Wiki YCSB1 YCSB2 Darwin Cloud9

Figure 5.8: Average RPs from running the same application on the recommended instance types

the same tests on Rackspace cloud servers as on Amazon EC2. Then, we list the most recommended instance types in Table 5.2 such that running certain workloads would be as on PM1 with less cost. If the recommended instances on both sides have the same price, e.g., RS2 vs. m1.small, the one provides a higher RP will be selected. For the light workload intensity, RS3 is the most recommended type to use, which has the same price as m1.medium at $0.12 per hour. RS3 is chosen because it provides higher RP with the same price. The performance of YCSB workload is sensitive to the heap size because it affects the amount of cached contents and the

114 frequency of flushing the cached requests in Cassandra. This effect would be more obvious if there are more write operations. Therefore, the recommendation for light YCSB1 is m1.small against RS2 because its memory space is larger.

Table 5.2: Most recommended instance types for running certain workloads as on PM1

Applications Light Medium Heavy Cloudstone RS3 RS3 m1.large Wiki RS3 m1.medium m1.large YCSB1 m1.small m1.medium m1.medium YCSB2 RS2 m1.medium m1.medium Darwin RS3 RS3 m1.medium Cloud9 RS3 RS3 RS3

For the medium workload intensity, the recommended Rackspace instances for YCSB1 and YCSB2 are both RS4, where the recommended Amazon instances are m1.medium. Although RS4 provides higher performance than m1.medium for these workloads, RS4 is more expensive and its RPs here are more than one than m1.medium. Therefore, the recommended instances for medium YCSB1 and YCSB2 are both m1.medium. For the rest of the applications with medium workload intensity, we mostly select the one with higher RP between RS3 and m1.medium. For the heavy workload intensity, Cloudstone and Wiki choose m1.large against RS4 because of the higher performance with the same price. The situation for the heavy YCSB is the same as its medium case. The case of Darwin chooses m1.medium because Darwin does not need more CPU cores but more memory would be helpful. On the other hand, the heavy Cloud9 desires more CPU cores than memory. Thus, the heavy Cloud9 chooses RS3 over m1.medium. Choosing the right instance types to minimize cost and optimize performance for a certain workload requires sophisticated analysis on application and platform characteristics. Such processes could be very time consuming without the help of Matrix.

115 5.4.2 Physical to Virtual Clusters

Many cloud applications are designed to work on multiple computers and communi- cate via a network. In this section, we first start the tests on a local VC. For profiling the system under different resource configurations, the number of VMs in a VC ranges from one, two, four to eight; the VCPU numbers on one VM is varied from one to four; and the size of memory on one VM is also varied from one to four GB. In other words, we have 64 VC settings in terms of the VM numbers, VCPUs, and memory sizes. We assume all VMs in a cluster are identical and leave the heterogeneous or asymmetric clusters as future work. We collect required profiling statistics from five runs of each representative application on all 64 VC settings. In order to capture the dynamics of various workload intensities, training applications will be uniformly randomly configured with thread/worker numbers from 2 to 128 and working set sizes from 20 to 100 GB in each run, in total 9,600 data points. For our tests on the public cloud, the instance types included as the Amazon VC instances for training and testing are t1.micro, m1.small, m1.medium, m1.large, m1.xlarge, and m2.xlarge. Similar to the local VC test, the number of VMs in an Amazon VC ranges from one, two, four to eight. Thus, we have 24 VC settings on EC2. The workload intensity is changed for profiling in the same way as it is in profiling local VCs. We also profiled virtual clusters on Rackspace. The instance types used are RS1 to RS5. The rest processes and settings on Rackspace are similar to those we did on Amazon. Prediction Accuracy. We first explore the accuracies on predicting RPs at clusters with different VM types and various numbers of VMs. All VMs have an identical type in a cluster. Fig. 5.9 presents the average prediction accuracies on the selected VCs. There are four columns for different VC types at a given VM number. Each column shows the average prediction accuracy of 45 tests. The accuracies from the tests on all local VCs vary between 86% and 92% with a mean of 90.01%. There is no big difference in accuracies observed between various VM types or between the numbers of VMs. For the tests on public clouds, Matrix is more accurate on

116 predicting RPs at larger size clusters and also at more powerful instances. The mean accuracy across all cases is 90.18% with a standard deviation of 2.55, where the mean accuracies on Amazon and Rackspace are 90.05% and 90.3% respectively.

100 1C1M 4C4M m1.small m1.large RS2 RS4 90

80

Accuracy (%) Accuracy 2 4 8 Number of VMs

Figure 5.9: Average accuracies and standard deviations on predicting RPs at different VM types and various numbers of VMs

From the accuracy tests, we found that Matrix has relatively good prediction ac- curacies on some applications, e.g., YCSB3 and YCSB4. Take the YCSB3, a read only testing workload in Zipfian distribution, as an example. Matrix effectively iden- tifies this as an 100% read workload with an over 95% possibility. Among three possible distributions for pure read requests, Matrix recognizes this workload has an over 75% possibility to follow Zipfian distribution. This contributes to a relatively high accuracy for YCSB3. To further analyze influences of representative applications in the training set, we remove five workloads at a time from the training set. Then, all models are rebuilt from the new training set. Next, we examine the accuracies on predicting RPs of applications in the testing set on a four-VM cluster whose VMs identically have four VCPUs and four GB memory. This test is repeated three times and the average accuracies are reported in Fig. 5.10. We remove CPU-intensive training applications first, the YCSB5 shows larger degradation than the others in the beginning because it consumes more CPU in scanning records when processing requests. When we start to remove data-intensive training workloads (the training set size is less than 15), all three testing applications drop dramatically. When we reduce the training size from ten to five, YCSB1 and YCSB5 both drop more than 20% because key genes (the 50/50 and 100/0 workload in the Zipfian distribution for YCSB1 and YCSB5 respectively) are removed. YCSB4 holds higher than the others at the training size

117 of five because the 100/0 workload in the latest distribution, which represents most of the YCBS4, is still kept in the final five.

100

YCSB1 80 YCSB2

60 YCSB3 Accuracy (%) Accuracy

YCSB4 40 30 25 20 15 10 5 YCSB5 Size of the training set

Figure 5.10: Accuracies on predicting RP decrease as the size of training set shrinks

Virtual Cluster in Private Cloud. In this test, Fig. 5.11 shows that Matrix can scale VM resource as the workload varies. All VMs have four VCPUs and four GB memory when the test starts. All values shown in Fig. 5.11 are means of three runs, with six 10-minute phases per run. In the beginning, the cluster is running a low intensity YCSB1 workload, which has only two threads and a working set size of 10 GB. Because both the CPU frequency and data bandwidth are better on the virtualized cluster, the RP goes over five in the first minute. Then, Matrix reduces VM resources, in particular, the number of VMs, to pursue a system with the RP close to one. In the second phase, we increase the thread counts to 16 and the working set size to 80 GB. Because of the increased amount of requests, Matrix increases the number of CPUs on each VM. Similarly to the first ten minutes, the VM numbers stay low to keep RPs close to one. After 30 minutes, we change workload to Cloud9 (highly parallel and CPU intensive), and Matrix in turn increases the number of VCPUs and VMs. We keep using Cloud9 in the fifth phase. The RPs keep staying around one stably. Note that one big VM with more than four VCPUs could be comparable to the four VMs in the fourth and fifth phases. Matrix allocates VCPUs no more than four because it is out of the predefined bound when we start this test. We will demonstrate Matrix is able to work in a larger scale later. In the last ten minutes, the workload is changed back to the same settings as the first ten minutes. Matrix

118 demonstrates the ability to catch the change and provide similar configurations as in the first phase.

6 VMs VCPUs Mem (GB) RP 5 4 3 2 1 0 0 10 20 30 40 50 Time (min)

Figure 5.11: As workload types and intensities change at every ten minutes, resources on each VM and the numbers of VMs are altered to keep RPs close to one

Overall, Matrix tracks the workload closely and is able to change VM configuration quickly (after the first couple of minutes in each phase). Because Matrix adjusts VM resources every one minute, the first one or two minutes in each phase may exhibit a higher or lower RP. Except that short period of time, RPs and resource allocations stay on track. Virtual Cluster in Public Clouds. Here we will only change the type of instance. We did not use Xen-blanket here due to concern over the overhead of nested virtualization. In this case, we run the tests in three steps: 1) Each application in the testing set is executed for ten minutes on a VC with a randomly uniformly selected type and the number of VMs from one to eight. 2) Matrix collects required system statistics, and recommends a configuration. 3) The same application then runs on a cluster of instances closest to the recommended configuration. We repeat the above steps at the weekday daytime, the weekday night-time, and the weekend. In addition, we change workload intensity from light, medium, and heavy for each testing application. In general, lighter workloads tend to have smaller instances and sizes. Amazon and Rackspace price most instance types in proportional to their resources. Matrix catches this difference and prefer t1.micro to m1.small, for example, in the light

119 intensity case (recall the cost parameter in Sec. 5.2.4). Fig. 5.12 shows the average RPs and standard deviations when we re-run testing cases with the recommended configurations as well as three fixed size VCs. Each column shows the average RP of 45 runs. Fig. 5.12(a) and Fig. 5.12(b) are results from Amazon and Rackspace respectively. All the RPs from Matrix spread between 0.88 and 1.16 with the mean of 1.02 across all cases. As it is shown in Fig. 5.12, using configurations suggested by Matrix makes the average RPs closer to one and smaller in variance than using the three static configurations. It leads to a low average RP value of 0.82 when using 4 × m1.small or 4 × RS2 all the time because the medium and the heavy workloads are too intensive for it. In general, Matrix uses 4×m1.small or 4 × RS2 at light workloads but uses more powerful instances when workload is heavier. The average RP of all 4 × m1.medium and 4 × RS3 cases is close to one but its standard deviation is 0.1, which is more than twice of the one of Matrix (0.04). The large variance in RPs of the 4 × m1.medium and 4 × RS2 case comes from over- provisioning at the light workload, inadequacy at the heavy one, and the difference in the workload mix, even at the appropriate intensity. For example, Matrix uses 3 × m1.medium for YCSB1 and 2 × m1.large for YCSB5 at the medium workload which makes RPs closer to one than the 4 × m1.medium does. When the workload is heavy, Matrix uses 2 × m1.large or 2 × RS4 most of the time. Thus, although statically using 4 × m1.large and 4 × RS4 has small variance values, the average RP in this case increases to 1.14. Cost Efficiency. Here we examine the RPC and PPC values to see the cost- efficiency of each configuration. To ease comparison, the RPC and PPC values in Table 5.3 are normalized to Matrix’s . A smaller RPC means more cost-efficient while keeping RPs close to one, and on the other hand, a higher PPC means more RP can be achieved for the same cost. In both the tests on Amazon and Rackspace, Matrix outperforms other static settings in both metrics.

120 Matrix 4 X m1.small 4 X m1.medium 4 X m1.large 1.4

1.2

1.0 RP 0.8 0.6 YCSB1 YCSB2 YCSB3 YCSB4 YCSB5 (a) Amazon

Matrix 4 X RS2 4 X RS3 4 X RS4 1.4

1.2

1.0 RP 0.8 0.6 YCSB1 YCSB2 YCSB3 YCSB4 YCSB5 (b) Rackspace

Figure 5.12: RPs when using Matrix and three static cluster settings on Amazon and Rackspace

Table 5.3: Cost efficiency (RPC and PPC) of Matrix and three static configurations on Amazon and Rackspace respectively

Amazon EC2 Matrix 4 × m1.small 4 × m1.medium 4 × m1.large RPC 1.00 24.00 20.41 143.02 PPC 1.00 0.84 0.47 0.33 Rackspace cloud servers Matrix 4 × RS2 4 × RS3 4 × RS4 RPC 1.00 25.33 18.67 90.54 PPC 1.00 0.78 0.68 0.52

5.4.3 Virtual Cluster in Private and Public Cloud

In this case, we evaluate the case of migrating the virtual cluster from private to public cloud. We make the number of VMs per cluster larger than previous tests in order to test the scalability. Because our Rackspace account has a limitation on memory size at 64 GB, the results of public cloud here are all obtained from the Amazon EC2. We use 32- and 64-VM local VCs (VC32 and VC64) in this case. These local VCs are hosted on four VS2 servers. Each VM in one local VC has one VCPU and 1.5 GB memory, and each VS2 hosts 16 VMs. The training procedure on EC2 is almost the same as the previous one in Sec. 5.4.2, except that we extend the number of VMs to 32 and 64 in the procedure. We then verify the prediction accuracies in Amazon VCs.

121 The average accuracy across different clusters is 0.89 with the standard deviation of 0.03. In general, the instances with more VCPUs and memory space are more stable and easier to predict. For example, prediction on a m1.small cluster has an average accuracy of 0.87 and the standard deviation of 0.031. On the other hand, prediction on a m1.xlarge cluster has an average accuracy of 0.9 and the standard deviation of 0.027. Similar to the test in Fig. 5.7, we make Matrix to recommend EC2 configurations comparable to the 32- and 64-VM local VCs for running the light, medium, and heavy testing workloads, which have 8, 32, and 64 threads and 80, 160, and 320 GB working set size respectively. We run each testing application and intensity for 30 times on a VC with 32 × m1.xlarge instances for Matrix to find the matched configurations. In general, Matrix mostly uses 30 × m1.medium, 24 × m1.large, and 20 × m1.xlarge instances for the VC32 at the light, medium, and heavy workloads respectively. When the cluster size increases from 32 to 64, Matrix makes the EC2 cluster to use more instances correspondingly. The configuration for the light workload is changed from 30 × m1.medium to 64 × m1.medium. The configurations for the medium and heavy workloads become 44 × m1.large and 36 × m1.xlarge respectively. In Fig. 5.13, the suggested configurations for the VC32 and VC64 have average RPs of 1.02 and 1.03 with the standard deviations of 0.07 and 0.09 across different workload intensities. The RPs for all the cases spread between 0.88 and 1.16 with the mean of 1.03.

1.2

1.0

RP 0.8 0.6 Light Medium Heavy Light Medium Heavy VC32 VC64

Figure 5.13: Average measured RPs and standard deviations using the recommended Amazon EC2 VC configurations

Table 5.4 lists the PPC and RPC values of Matrix and three static EC2 settings for running virtual clusters. The RPC and PPC values are normalized to the Matrix’s.

122 The three configurations displayed here use m1.medium, m1.large and m1.xlarge be- cause they are among the three most recommended types. For comparison, we use 32 and 64 EC2 instances in the VC32 and VC64 tests respectively. According to the RPC values, Matrix costs much less than the static EC2 VC settings while maintaining RP close to one, especially at the VC64 tests. Further, Matrix demonstrates better PPC values than the static settings, which indicates a good performance-cost efficiency. The PPC values also show that using powerful instances may be not cost-efficient although they do provide better performance.

Table 5.4: Cost efficiency (RPC/PPC) of Matrix and three static configurations

Matrix m1.medium m1.large m1.xlarge VC32 1.00/1.00 2.62/0.99 4.23/0.83 28.80/0.60 VC64 1.00/1.00 3.17/0.89 6.28/0.85 38.64/0.63

123 Chapter 6

VIO-Prefetching

While Swiper and TRACON address the interference issue and Matrix completes a missing piece of the virtualization, our exploration in this chapter goes in depth to the virtualization architecture to design innovative I/O virtualization framework. This chapter will demonstrate that data prefetching, when run in a virtualization-friendly manner can significantly improve performance for a broad selection of data-intensive applications. Traditional data prefetching has been focused on applications running on bare metal systems using hard drives. In contrast, virtualized systems using solid-state drives (SSDs) present different challenges for data prefetching. Before we introduce the design of VIO-prefetching, let’s take a look of these new challenges.

6.1 Challenges

6.1.1 Challenge #1: No One-size-fits-all

There is no one-size-fits-all solution for data prefetching because many things could be different from one server to another. Here we focus on two major aspects: different storage devices and applications.

124 First, storage devices are different. Flash-based devices and spindle drives have very different characteristics. For example: SSDs tend to have excellent seek, random read and write performance, and parallel I/O support. SSDs also have down- sides, e.g., expensive small writes, and limited erase cycles. At a high level of hardware blocks, SSDs commonly consist of three main components: NAND flash packages, controllers, and buffers. However, various SSDs usually show differences in their ar- chitecture and hardware components. For example, they can be different in buffer size and number of channels. These differences can happen as frequently as new SSD drives available. The same manufacturer may even choose to adopt different con- trollers across two models. In our previous tests [180], when measured under Linux, SSDs clearly have higher bandwidths than the hard drive (measured read bandwidth at about 90 MB/s). SSDs normally outperform the hard drive by 2 times or more. SSDs also have obvious difference in write performance, which can range from 80 MB/s to 200 MB/s on different drives. Second, applications are different. Applications are different in many aspects. In the domain of performance management and resource control, applications can be roughly categorized into computing- or data-intensive ones. Even though data- intensive applications tend to have a higher demand in data bandwidth, each one of them still can be different subtly. In our previous work [180], we studied the average application throughput in I/O operations per second (IOPS) for ten applications ranging from large file operations, search engine traces, and database workloads. The demand of I/O can range from hundreds to thousands IOPS across these ten applications. Moreover, each application has multiple stages, as we have seen in Chapter 5, and each stage may have different I/O requirements.

6.1.2 Challenge #2: Virtual I/O Blending

Virtualization is widely used among cloud service providers because virtualization technology provides a number of advantages, such as flexible resource management, efficient resource utilization, etc. But, it comes at the cost of I/O performance. Vanilla

125 operating systems such as Linux assume an exclusive control of storage drives and optimize disk I/Os under that assumption. Virtualization puts operating systems into guest VMs, hosting many VMs on one physical machine. Storage devices are now shared among numerous guest VMs, so that assumption of exclusivity is no longer valid. However, the I/O stacks inside each guest VM still try to optimize the I/O patterns for sequential access on the virtual disk. When these I/O requests are forwarded to the hypervisor, they will be likely blended in an unpredictable fashion. By the time the I/O requests arrive the physical storage devices, they are no longer sequential. This is called the virtual I/O blending effect [159]. The more VMs involved in the blending, the more obvious the effect. Fig. 6.1(a) shows the virtual I/O blending effect on overall performance. It com- pares the combined throughput of sequential read operations on one HDD running multiple guest VMs, each identically configured with a benchmark tool called IO- ZONE [134]. The performance of a single VM is almost identical to that of a non- virtualized server. As we add more VMs, the combined throughput of the HDD decreases dramatically. At eight VMs, it is about half that of a single VM. An SSD does mitigate the situation. Fig. 6.1(b) presents the same tests on an SSD. The combined throughput reaches the maximum bandwidth at four VMs. Then, the per- formance goes down from there because the test is running on a four-core machine. However, the guest systems are not aware of this competition and the host system does not know the I/O access patterns inside each guest VMs. Knowing guest I/O process information can make prefetching methods more effectively capture targeting patterns. In this work, we implement a virtualized system which pass through guest I/O process identifications to the prefetcher in the host system for more accurately identifying potential patterns. Section 6.4 demonstrates that our VIO-prefetching can successfully prefetch needed data by identifying I/O access patterns and main- tain high throughput with feedback control.

126 150

100

50

0

Throughput (MB/s) Throughput 1 2 4 6 8 10 Number of VMs

(a) HDD

300

200

100

0

Throughput (MB/s) Throughput 1 2 4 6 8 10 Number of VMs

(b) SSD

Figure 6.1: Virtual I/O blending effect

6.2 Design Principles

With virtualization and flash drives in mind, VIO-prefetching has the following fea- tures: 1) The ability to dynamically adapt prefetching size based on the diversity of device characteristics, loadings and application demands, 2) awareness of tempo- ral locality of the applications and virtual machine disks, and 3) bridging the in- formation gap of I/O traces between domains. More specifically, we have designed VIO-prefetching around three major principles.

Control prefetching based on drive loadings, application demands, and prefetching performance: The prefetching method needs to be aware of the limit of the storage devices and not to overload the available bandwidth. Even if all prefetched data are useful for a particular application, other applications may be staring for data. Also, over-prefetching can evict useful data from the cache and actually hurt performance. VIO-prefetching adapts the amount of prefetching based on its own performance. If prefetching benefits the system, VIO-prefetching will gradually read data at a faster speed to further improve the performance. On the other hand, the aggressiveness will be decreased if VIO-prefetching finds that prefetching does not help current

127 application’s data accesses. This control process is coordinated between guest VMs and the host domain by the VIO-prefetching.

Enable prefetching for concurrent multiple accesses: To control concur- rent accesses and increase prefetching accuracy, VIO-prefetching needs to track each simultaneous access pattern individually. In other words, VIO-prefetching is aware of the program context. The context includes process id, drive id, block id, how much data an application accesses at a given time, and whether a particular access pattern exists, stops, and changes. VIO-prefetching utilizes this information to assist pattern recognition and prefetching.

Bridging the information gap: Virtualization creates a semantic gap among domains. More specifically, guest VMs have limited information about the underlying physical devices. The host domain does not have a complete view of I/O traces because some of them may be satisfied by guest buffers and never reach the host I/O system. Therefore, we design an inter-domain communication framework for bridging the semantic gap. In this way, VIO-prefetching can improve the accuracy of pattern recognition and achieve better control on the prefetching aggressiveness.

6.3 The Architecture of VIO-Prefetching

There are four main operations of VIO-prefetching:

• tracking each application’s I/O traces,

• recognizing access patterns for a series of requests,

• prefetching data from the drive to the cache in the background, and

• adjusting the prefetching rate according to system performance.

We design the architecture of VIO-prefetching which integrates these four opera- tions with virtualization environments. Fig. 6.2(a) depicts the I/O path in a virtual- ized server integrated with VIO-prefetching. Guest VMs use the front-end drivers to

128 VM VM

Frontend Frontend driver driver

PID PID Hypervisor

Backend Backend driver driver CPU

Memory Device Device driver Driver kernel driver

VMDK VMDK Logic Drive

(a) The overview of a virtual I/O path. Note that the drivers are modified to carry the in-guest pro- cess identification (PID) between the guest and host domains

VM VM Guest domains

Host

Buffer

Data Feedback Prefetching

VMDK VMDK

Logic Drive

Trace Pattern Controller Collection Recognition

(b) A zoom-out view of VIO-prefetching

Figure 6.2: Integrating VIO-prefetching with a virtual machine host talk to the backend drivers in the driver domain, and the backend drivers utilize real device drivers to access the physical devices. To complete a virtual I/O operation through the communication between these drivers, the I/O channel also needs to map memory pages and translate the addresses. These operations are the cause of virtu- alization overheads. Because of these complex operations, the virtual I/O blending, and the semantic gap among domains, the I/O access patterns are hard to identify when arriving at physical storage devices. VIO-prefetching provides a remedy to this problem by bridging information gaps among domains and identifying access patterns at the block device level. We make VMs pass the in-guest process identifications to the host domain for the pattern recognition module of VIO-prefetching. Then, VIO-

129 prefetching groups and prefetches needed data in sequences by utilizing the underused bandwidth on physical devices. In a virtualized environment, each VM has its own view of a disk, called the virtual machine disk (VMDK). VMDK could be an image file on the host machine’s file system, a disk partition, or a physical block device on the host machine. We choose to integrate VIO-prefetching with virtual machine host systems, not guest VMs for the following reasons:

• There are many I/O layers between a guest VM and the underlying physical devices. When a sequential prefetching from a guest VM arrives at underlying physical devices, it is no longer sequential. As a result, prefetching from guest VMs may have limited benefit on performance improvements.

• VIO-prefetching utilizes underused bandwidth for prefetching. Many factors can change the maximum bandwidth observed in a guest VM, e.g., priority, device types, schedulers, etc. These factors may change from time to time and complicate the practical implementation of VIO-prefetching in guest VMs. For practical purposes, VIO-prefetching in virtualization hosts provides greater advantages than in guest VMs. In particular, this approach is independent of the types of guest operating system.

Prefetching in a virtualization host, however, hinders identifying sequential pat- terns because of the missing processes identification. Typically, a backend driver dispatches actual I/O requests to storage devices on behalf of a guest VM. Thus, the driver domain treats all I/O requests from a VM as from a single process, even if multiple processes are making requests inside the guest VM. As a result, it is more difficult for a host prefetcher to catch a sequential access process inside a VM than in a host domain. To assist the pattern recognition module, VIO-prefetching passes I/O requests’ owner process identifications in guest domains to the driver domain. A Xen frontend driver is modified to embed requests’ owner identifications when generating a Xen blkfront I/O request. Correspondingly, the backend driver is also modified to extract

130 requests’ owner identifications when transforming a Xen block request into a normal one. Then, the backend driver uses blktrace API [15] to update traces when submitting the request. Blktrace uses the Linux kernel debug filesystem to trace I/O events. Using blk- trace requires calling the BLKTRACESETUP and BLKTRACESTART ioctls for a file descriptor associated with a block device. The blktrace API offers several useful pieces of context that are not present in a traditional I/O event queue in the driver: the events have the timestamps, process ids, and names of the originating process. VIO-prefetching can use this information to differentiate requests from multiple ap- plications. Also, by examining the process id, requests from VIO-prefetching itself can be ignored when considering applications’ access patterns. Events can also be automatically filtered (read vs. write) with a mask before being delivered to VIO- prefetching. In the current implementation, a process context identifies an application execu- tion environment by using a combination of process id, in-guest process id, drive id and block region. Note that original blktrace API did not have a field for in-guest process id, which is added in this work. In order to collect I/O event traces from all VMs to corresponding VMDKs, all VMDKs are stored in one logic drive. This is a common practice to manage storage systems and can be achieved by utilizing Logical Volume Manager (LVM), loopback devices, or RAIDs. After placing VMDKs in one single logical drive, VIO-prefetching monitors the logic drive for virtual I/O event traces. Fig. 6.2(b) shows how VIO- prefetching is integrated with a virtual machine host. First, the trace collection module gathers a record for every I/O request. Note that not every request by VMs will actually reach its VMDK because some of them may be satisfied by the system cache, but VIO-prefetching traces both issued VM requests and those that actually reach the disk. Then, the pattern recognizer wakes up to look at the accumulated I/O events when a timer expires. The pattern recognizer then informs the controller whether, where, and how much to prefetch. The controller optionally adjusts the aggressiveness based on recent prefetching performance.

131 VIO-prefetching collects the I/O traces in the driver domain. The collected in- formation includes the request time, type, amount, and the process identification. VIO-prefetching keeps records for every I/O request that is issued in a VM, as well as for those actually reaches the disk and utilizes them for the pattern recognizer. Not every request by VMs will actually reach the disk because some of them may be satisfied by the system cache, but VIO-prefetching traces both VM requests and those that actually reach the disk. The design of VIO-prefetching for virtualized environments considers the access patterns from VMs and on storage devices. VIO-prefetching operates pattern recog- nition module in a polling mode. That is, the pattern recognizer starts working when a timer expires. The amount of I/O events collected in the traces for the recognizer is determined by the polling interval. A short interval may not collect enough in- formation to identify a pattern. A long interval, on the other hand, may miss the appropriate timing to prefetch data. VIO-prefetching uses 0.5 seconds by default. If VIO-prefetching recognizes a particular pattern, it starts to prefetch data in the same behavior. VIO-prefetching currently identifies four types of behaviors: sequential for- ward reads, sequential backward reads, strided forward reads, and strided backward reads. The strided pattern is a recurring sequential read pattern where a few blocks are skipped in between. VIO-prefetching maintains several state machines, which are indexed by a hash of process id and block location. The number of consecutive blocks is updated when the current and previous requests are connected either at the start or the end block. Once the fraction of blocks that occurred in patterns divided by the overall count of blocks exceeds a certain threshold, the state machine for that hash entry is ready to perform a prefetch. The default pattern matching threshold is 0.6 which indicates that VIO-prefetching will fetch a certain pattern for the next interval if 60% of the requests during the current interval are matched in the pattern. The prefetching starts at the next block which is contiguous to the most recent request. The stop block is dynamically set according to the application performance and system status.

132 6.3.1 Block Prefetching

The amount of data to prefetch once a pattern has been recognized is determined with the goal of reading data from an SSD into the system cache, but only those blocks that the application will actually request in the near future. For simplicity, we describe the logic for consecutive prefetching. The logic for strided and reverse prefetching is similar. In VIO-prefetching, we utilize two key parameters that control how much data will be prefetched:

Aggressiveness scale factor S is defined as

prefetched data amount S = application read data amount

which means how aggressive prefetching is compared to the application’s measured request rate. While we can measure the application’s rate to tailor prefetching based on the application’s needs, we have found that using a fixed, static scale factor does not work well. The optimal value for this scale factor is application-specific and can be adjusted by feedback (which will be described in the next section). Our experiments showed that the values near 1.0 typically work well as the starting point for the feedback mechanism. A value of 1.0 means that the amount of prefetched data matched the application’s request rate. If the value is higher than one, VIO-prefetching might prefetch some unneeded data. On the other hand, if the value is less than one, some requests may not be satisfied by the prefetched data.

Maximum disk throughput: This has different optimal values for each disk. During the time interval when prefetching is occurring, VIO-prefetching is careful to avoid saturating the available read bandwidth to the disk with prefetching requests at the expense of actual application requests that may be mispredicted and have to go to the disk. If this occurred, the requested prefetch would take more than the entire allotted time interval and VIO-prefetching would drift further and behind real application time. To prevent this, the prefetcher estimates what amount of application

133 requests will actually reach disk because they will not be prefetched successfully and sets the prefetching throughput limit (PFlimit) to the maximum disk throughput minus this value. For this purpose, we use the percentage of consecutive reads that is already computed in the previous stage of pattern recognition. Since the maximum disk throughput depends on the characteristics of each drive, we measure the raw throughput from each disk by reading a large, uncached file, and using this as the maximum. In a virtualized environment, guest VMs do not know the utilization status of phyiscal drives. Therefore, VIO-prefetching helps each VM to prefetch data with device characteristics in mind.

Putting these two parameters together, the prefetcher uses the last known (read) stop block as its start block (Bstart) and finds the stop block as follows. It first tries to determine the linear throughput of the application (TPL) by multiplying the total throughput (TP ) with the percentage of consecutive reads (PCTSEQ). We consider the remainder of the total application throughput to be from random accesses (TPR).

Next, the prefetcher uses the scale factor S and total available bandwidth BWA (by subtracting TPR from the maximum disk throughput BWT ) to determine the stop block (Bstop) for the next polling interval.

Suppose the polling interval is T seconds, the calculation to find Bstop shows as follows:

TPL = TP × PCTSEQ

TPR = TP × (1 − PCTSEQ)

BWA = BWT − TPR

PFlimit = min(S × TPL,BWA)

Bstop = Bstart + T × PFlimit

Once the quota of number of blocks to prefetch for one application during an interval is found, VIO-prefetching simply issues a system call (e.g., readahead in Linux) with the starting block number and the number of blocks to read. (For strided access, there may be multiple readahead calls.) We leave the details of the cache management itself to the underlying operating system.

134 6.3.2 Feedback Monitoring

Feedback monitoring classifies the stream of read operations reaching disk as linear (meaning sequential, reverse, and strided) similar to the way read requests to the operating system were classified during pattern recognition. The intuition is that if there are any linear, easily predictable reads that were not prefetched, and still reached disk, then the prefetching aggressiveness (S) should be increased. On the other hand, if there are no linear reads reaching the disk and the statistics show that the prefetching amount is more than what the applications are requesting, we decrease the aggressiveness accordingly. In practice, not all linear reads can be predicted so we increase the prefetch aggres- siveness scale factor when the percentage of linear reads reaching disk is greater than a predefined threshold. We decrease the aggressiveness when it is clear that additional prefetching would not help. When we see that the number of linear reads reaching disk is zero and that the number of prefetched blocks reaching disk is greater than the number of linear reads that the application requested to the operating system, the prefetch aggressiveness will be reduced. During each polling interval, the feedback monitor analyzes the actual performance of the prefetch operations from the last time interval and adjusts its aggressiveness accordingly. This monitoring is done by comparing the access pattern of reads that the application makes to the operating system (entering the cache) vs. the pattern of reads reaching disk (missed in the cache). Algorithm 6 presents the high-level pseudocode of VIO-prefetching.

6.4 Evaluation

We have implemented a prototype VIO-prefetching in a Linux system with Xen virtu- alization. To assist prefetching in the host domain, VIO-prefetching passes guest I/O requests’ owner process identifications to the driver domain by embedding requests’ owner identifications when generating a Xen blkfront I/O request and extracting them

135 Algorithm 6: The Pseudocode for VIO-prefetching Data: Disk read event ReadDisk; Requested read event ReqRead; Number of consecutive blocks BSEQ; Number of total blocks Btotal; begin for each T do // Collect read operations that reached the physical disk (i.e. not satisfied by cache) for each ReadDisk do // Update per-disk counters Update S, TPL, and TPR end // Collect read operations that are partially satisfied by cache for each ReqRead do // Track concurrent multiple read h = Hash(process id, in-guest id, Bstart); // Update the counters for the mapped sate machine UpdateCounters(ReqRead, h); end for each state machine h do

// Calculate PCTSEQ PCTSEQ = h.BSEQ / h.Btotal; if PCTSEQ > T hSEQ then // Calculate prefetch amount // Set a prefetch ceiling PFlimit = min(S × TPL,BWA) end Prefetch(Bstart,Bstop); end // Adjust S for next time interval end end when transforming a Xen block request into a normal one. In the following sections, we will first introduce the applications and environments used for the experiments. Then, we explain the experiments and results.

136 Table 6.1: Applications for testing VIO-prefetching

Name Description Workload Types Data size Cloudstone Social event application Interactive update and read 20 GB Wiki Wikimedia website Similar to but less updates than WS 60 GB Darwin Video streaming Heavy sequential read operations 36 GB FS File server create, delete, read and write files 20 GB WS Web server open, read, and close multiple files 20 GB VS Video server sequential read and write 20 GB WP Web proxy server open, create, read, write, and close files 20 GB

6.4.1 Experiment Environment and Applications

High-performance storage systems are needed in many different types of data-intensive applications. To evaluate the performance of VIO-prefetching technique, we choose a wide variety of benchmarks, including numerous cloud applications and file system benchmarks. Table 6.1 shows a number of popular cloud applications and their config- urations used in our experiments. In brief, Cloudstone is a performance measurement framework for Web 2.0 [163]; Wiki with Database dumps is from Wikimedia founda- tion [196] and the real request traces are from the Wikibench web site [181]; Darwin is an open source version of Apple’s QuickTime video streaming server; FS, WS, VS, and WP are file, web, video, and web proxy servers respectively, which are all from Filebench [119]. All applications are running in eight concurrent threads/workers unless elsewhere specified.

• Cloudstone is a performance measurement framework for Web 2.0 [163]. It consists of three main components: 1) Web application: Cloudstone uses Olio [177], an open source online social-event calendar. We use Apache Tomcat with PHP and a geocoder emulator to host Olio. 2) Database is used to store and handle user accounts and calendar events. We use MySQL as the backend database in our experiments. 3) Workload generator: Cloudstone uses Faban [168] as the workload generator, which uses the Markovian arrival time model [163] to generate requests to the web application server. Operations are the mixture of common social web site activities, such as, loading home pages, logging in,

137 adding events, etc.

• Wiki replicates Wikipedia, which is a free online encyclopedia which contains 23 million articles. We use the VMs loaded with Wikibooks from BenchLab [36]. In our experiments, we use database dumps from Wikimedia foundation [196] and real request traces from the Wikibench web site [181].

• Darwin is a video streaming server, an open source version of Apple’s Quick- Time server. It streams videos across networks using the industry standard real-time transport protocol (RTP) and real time streaming protocol (RTSP). The clients send the requests following Faban workload driver’s commands. Since we focus on I/O interference, we have one third of low, medium and high bit-rate videos respectively in our workload mixture.

• FS emulates file-server I/O activity. FS performs a sequence of creates, deletes, appends, reads, writes and attribute operations on a directory tree.

• WS mimics web-server I/O activity. WS performs a sequence of opens, reads, and closes on multiple files in a directory tree plus a log file append.

• VS is a video server. It has two file sets: actively served videos and available but currently inactive videos. VS writes new videos to replace inactive videos. Meanwhile, VS also serves videos from another file set.

• WP reproduces I/O activity of a web proxy server. The operations are a mix of create-write-close, open-read-close, and delete on multiple files in a directory tree and a file append to simulate proxy log.

The test system has two six-core Intel Xeon CPUs at 2 GHz and 32 GB mem- ory. This machine is running Linux 3.2, Xen 4.1, and eight 120 GB SSDs configured as RAID0 on a MegaRAID card. Each storage device is formatted with an ext2 filesystem, mounted with the noatime option and filled with one large file which was connected to a loopback device. The loopback device is then formatted with an ext3 filesystem and also mounted with the noatime option for running the benchmarks.

138 The noatime option prevents read operations to the filesystem from generating meta- data updates which would require writes to the device and is intended to improve the I/O throughput.

6.4.2 VIO-Prefetching vs. Flashy Prefetching

Flashy prefetching, the preliminary work of VIO-prefetching, has shown its ability to read ahead just before needed blocks are accessed by the application [180]. Pre- liminary results also indicate that flashy prefetching is capable of adapting the speed of data prefetching on the fly to match the needs of the application. In this work, VIO-prefetching adapts flashy prefetching for virtualized environments by bridging the missed process identification among domains. To see if this change makes VIO- prefetching better than flashy prefetching in a virtualized environment, we start the evaluation section with the comparisons between these two prefetching schemes. We firstly want to see if the VIO-perfetching performs as flashy prefetching when there is only one major I/O process in guest VMs. Therefore, in the first experiment, there is only one major process which is sequentially reading 64 KB from a 1 GB file in a guest VM. The VM has one VCPU and 512 MB memory. The baseline is running this sequential read process in a guest VM with the default readahead set- ting in Linux. Then, we turn off the default Linux readahead function and run the same sequential read process with the help of flashy and VIO-prefetching respectively. The above experiment is repeated three times at 1, 3, 6, 9, and 12-VM cases respec- tively. The average speedups and standard deviations are reported in Fig. 6.3(a). The speedup is obtained by normalizing the measured throughput to the baseline. Flashy and VIO-prefeching perform closely in this experiment because there is only one major I/O threads in guest VMs. Both flashy and VIO-prefetching have bet- ter aggregate throughputs than the baseline. The average speedups by flashy and VIO-prefetching are 1.19 and 1.2 respectively. The peak speedup of flashy and VIO- prefetching schemes appear at 6-VM case with the value of 1.35 and 1.3 respectively. The reason of the reduced speedup at large numbers of VM is because of the satu-

139 Linux flashy VIO Linux flashy VIO

1.6 1.6

1.4 1.4

1.2 1.2 Speedup Speedup 1.0 1.0 0.8 0.8 1 3 6 9 12 1 3 6 9 12 Number of VMs Number of processes (a) (b) Linux flashy VIO Linux flashy VIO

1.6 1.6

1.4 1.4

1.2 1.2 Speedup Speedup 1.0 1.0 0.8 0.8 1 3 6 9 12 1 3 6 9 12 Number of sequential-read VMs Number of sequential-read processes (c) (d)

Figure 6.3: The speedups and standard deviations of three prefetching systems. (a) Experiments with various numbers of VMs; (b) Experiments with different numbers of in-guest processes; (c) Experiments with multiple VMs and mixed workload types; (d) Experiments with multiple in-guest processes and mixed workload types rated bandwidth. As the number of sequential reading process increases, the available bandwidth for prefetching decreases and thus limits the benefit. The second test is to verify if passing in-guest process identification can help prefetching in virtualized environments. To validate this, there are multiple bench- marking processes in a guest VM. More specifically, we have 1, 3, 6, 9, and 12-process cases. Each process is sequentially reading 64 KB from a 1 GB file. Note that each process has its own file. There is only one VM in this test, which has 12 VCPUs and 6 GB memory. We measure the aggregate throughputs of these processes when using the baseline, flashy, and VIO-prefetching. Then, the speedup is obtained by normalizing the measured throughput to the baseline. The tests are repeated three times and the average speedups and standard deviations are drawn as the columns and whiskers in Fig. 6.3(b). When there is only one process, the speedups by flashy (1.19) and VIO-prefetching (1.18) are close. As the number of processes increases, VIO-prefetching shows higher speedups than flashy with the biggest difference of 0.2

140 at the 6-process case. On average, the speedup of VIO-prefetching is higher than the flashy’s by 0.15. Then, we test the prefetching systems in a more complex environment, which has multiple VMs runs different workloads concurrently. There are 12 VMs running concurrently in the third test. The same as the first test, each VM has 1 VCPU and 512 MB memory and there is only one major process in a guest VM. There are five cases in this test. The first case is that one VM is doing sequential read and the other VMs are doing random read and write. The number of sequential-read VMs is changed to 3, 6, 9, and 12 in the second to fifth cases respectively. In all cases, the VMs other than the sequential-read one are doing random read/write 64 KB from/to 1 GB files with the 50:50 read:write ratio. We measure the aggregate throughputs of the sequential-read VMs when using the baseline, flashy, and VIO-prefetching. Then, the speedup is calculated by normalizing the measured throughput to the baseline. The tests are repeated three times and the average speedups and standard deviations are shown as the columns and whiskers in Fig. 6.3(c). Similar to the first test, flashy and VIO-prefeching perform closely in this experiment because there is only one major I/O threads in each guest VM. In all cases, both flashy and VIO-prefetching have better aggregate throughputs than the baseline, which implies the ability to distinguish sequential and random I/O processes. The average speedups by flashy and VIO-prefetching are both 1.13. The overall speedup in Fig. 6.3(c) is less than the one in Fig. 6.3(a) because there are additional VMs doing random I/O concurrently and thus the available bandwidth for speedups is limited. In the fourth test, the setting is the same as the second test (Fig. 6.3(b)), except that we add extra random I/O processes to make it has totally 12 I/O processes running concurrently in all testing cases. The random I/O processes are the same as those in the previous test. The goal of this test is to exam the prefetching systems with multiple I/O processes in a guest VM. We measure the aggregate throughputs of the sequential-read processes when using the baseline, flashy, and VIO-prefetching. Then, the speedup is calculated by normalizing the measured throughput to the baseline. The tests are repeated three times and the average speedups and standard

141 deviations are shown as the columns and whiskers in Fig. 6.3(d). When there is only one process, the speedups by flashy (1.08) and VIO-prefetching (1.10) are close. In all cases, VIO-prefetching has higher speedups than flashy. On average, the speedup of VIO-prefetching (1.13) is higher than the flashy’s (1.06) by 0.07. Because of the extra random I/O processes, the speedups in Fig. 6.3(d) are less than those in Fig. 6.3(b).

6.4.3 Evaluation with Cloud Applications

After comparing VIO-prefetching with other prefetching methods in a virtualized environment, we now evaluate VIO-prefetching with numerous cloud applications and file system benchmarks in this section. Experiments on Different VM Numbers. The experiment environment and applications are described in Section 6.4.1. The goal of this test is to see how VIO- prefetching works at different applications and numbers of VMs. We have tested 1, 2, 4, ··· , and 12 VMs, while each VM has 1 VCPU and 1 GB memory.

1.6 1-VM 2-VM 4-VM 8-VM 10-VM 12-VM

1.4

1.2 Speedup 1.0

0.8 CloudStone Wiki Darwin FS WS VS WP

Figure 6.4: The speedups by VIO-prefetching for different applications and numbers of VMs

Fig. 6.4 shows the speedups of the applications for different numbers of co-located VMs. At each run, all co-located VMs are executing the same application. The results are the average numbers of three runs. The overall mean of the speedups in Fig. 6.4 is 1.14. As shown in Fig. 6.4, WS and Wiki have relatively little speedups. We believe this is for two reasons. First, the nature of web servers are random I/O accesses, thus only a few number of sequential patterns can be found. Second, most requests to web servers are small in size, e.g., 4K, which makes prefetching less effective. On the

142 other hand, CloudStone, FS, and WP demonstrate good performance improvements, up to a 21% speedup on one VM and 2-7% for 12 VMs, because of more predictable, sequential access patterns. For example, the FS in the experiments has the mean file size at 64 MB and each request size is larger than 1 MB. Note it is fair and reasonable to use large file sizes because several new distributed file systems have large file sizes in practice, e.g. the Google file system uses 64 MB chunk size. For Darwin and VS, VIO-prefetching can provide over 14% speedup for 1 to 12 concurrent VMs because a number of sequential read requests are made by these video streaming services. Note that VIO-prefetching is aware of the maximum I/O bandwidth of the system. When more VMs are sharing the same bus and storage, the available I/O bandwidth is decreasing, which leaves less room for data prefetching. Therefore, VIO-prefetching has a reduced benefit when there are more VMs in the system. However, the I/O performance can be significantly improved when the VMs are supported by a high- end storage system, such as Fibre Channel based Storage Area Network (SAN), that comes with much larger I/O bandwidth and lower latency. Fig. 6.5 presents the accuracy of VIO-prefetching for different numbers of co- located VMs and applications. The accuracy is measured as the amount of prefetched and used data divided by total data used by VMs. The higher value the more ac- curate. The accuracy reasonably corresponds to the speedups. It is not surprising that benchmarks with more random access patterns have lower accuracies. But, VIO- prefetching effectively detects that these benchmarks have no sequential patterns and limits the number of attempts to prefetch. Fig. 6.6 explains this phenomenon by showing the cost of VIO-prefetching. The idea of cost is to show the ratio of waste data amount to the total data usage. If the cost is smaller, the system spends less bandwidth in prefetching unneeded data. Therefore, the cost is defined as the ratio of the amount of unused prefetched data to the total data usage of the application and VIO-prefetching. As it is shown in Fig. 6.6, although VIO-prefetching does not speed up WS greatly, the prefetcher does not waste I/O bandwidth on prefetching data. When the number of VMs is large, the cost is lower because the available bandwidth is reduced and thus the prefetched

143 1.0 1-VM 2-VM 4-VM 8-VM 10-VM 12-VM

0.8

0.6

0.4 Accuracy 0.2 0.0 CloudStone Wiki Darwin FS WS VS WP

Figure 6.5: VIO-prefetching accuracy for different applications and numbers of VMs. The accuracy is on the y-axis, measured as the amount of prefetched and used data divided by total used data and unused data amount is also reduced. The cost is higher at small numbers of VMs because VIO-prefetching aggressively consumes bandwidth. Note that the higher cost at small numbers of VMs does not mean lower applications’ performance because VIO-prefetching is using the spare bandwidth. This result clearly shows the success of pattern recognition and feedback control modules.

1.0 1-VM 2-VM 4-VM 8-VM 10-VM 12-VM 0.8

0.6

Cost 0.4 0.2 0.0 CloudStone Wiki Darwin FS WS VS WP

Figure 6.6: VIO-prefetching cost for different benchmarks and number of VMs. VIO-prefetching cost is on the y-axis, defined as the ratio of the amount of unused prefetched data to the amount of prefetched data

Prefetching on Different Read/Write Ratios. We have demonstrated the experiment results on multiple VMs with a single in-guest process in Fig. 6.3(c) and one VM with multiple in-guest processes in Fig. 6.3(d). We now study how the VIO-prefetching reacts at workloads with different read/write ratios. To see how VIO-prefetching works in a more complicated environment, we demonstrate the experiments of different read/write ratios in a multiple VMs and in-guest processes environment. The following experiments are conducted on a server with two six-core

144 Intel Xeon CPUs at 2 GHz and 32 GB memory. This machine is running Linux 3.2, Xen 4.1, and eight 120 GB SSDs configured as RAID 0 on a MegaRAID card. There are eight VMs and each has one VCPU and one GB memory. We use four in-guest processes in each VM to synthesize read/write ratios. For example, a system has a read/write ratio of 75/25 means each VM has three sequential read and one sequential write processes where the I/O size is 64 KB and file size is one GB for each process. We draw box plots in Fig. 6.7 to show the results. As the read ratio increases, the speedup on the VM’s throughput is enhanced because of prefetching. VIO-prefetching successfully identifies read-intensity processes and prefetches data for future use. The overall average speedup is 1.1.

1.20

1.15

1.10

1.05 Speedup 1.00

0.95 25 50 75 100 Read ratio (%)

Figure 6.7: Box plots of speedups at different read/write ratios

Prefetching on Different Schedulers. System administrators could configure I/O schedulers for specific storage devices and applications to achieve higher perfor- mance. All experiments above use NOOP scheduler in both guest and host domains because NOOP has been recommended for guest domains and SSDs. Because the default I/O scheduler on most Linux distributions is CFQ, we test difference combi- nations of CFQ and NOOP in both guest and host domains. The application here is the Darwin video streaming in a VM with eight VCPUs and eight GB memory. Fig. 6.8 shows the average throughputs and standard deviations of ten runs at dif- ferent I/O scheduler combinations. VIO-perfetching improves most on CFQ-CFQ. One characteristic of CFQ is fairly sharing between VMs. However, this feature may increase the latency and underutilize the bandwidth. VIO-prefetching as an assistant to the request issuer effectively prefetches required data when the issuer is forced to wait by CFQ. Note that the good speedup on CFQ-CFQ does not mean CFQ-CFQ is

145 the best combination. In fact, NOOP-NOOP has the best performance in this test.

2.6 CFQ-CFQ CFQ-NOOP NOOP-CFQ NOOP-NOOP 2.4 2.2 2.0 1.8

1.6 Total throughput(GB/s) Total Original VIO-prefetching

Figure 6.8: Average throughputs and standard deviations with and without VIO- prefetching at different scheduler combinations

Prefetching on Different Workload Mixtures. One user may be running video streaming services and a user’s VM is hosting web pages. But, there are many incentives, e.g., high utilization, for service providers to consolidate different VMs on the same physical machine. Thus, VMs on the same server may not run the same application. In this test, we want to see how VIO-prefetching works at different workload mixes. Because video streaming’s sequential I/O patterns are good for prefetching, we control the number of VMs that run streaming services (Darwin) from 0, 2, 4, 6, to 8 VMs. The rest VMs are randomly assigned to run either a file server, web server, web proxy, or data server. These are all multi-threaded applications and there are totally 8 VMs running concurrently. Each VM has one VCPU and one GB memory. Average speedups and standard deviations of ten runs are shown in Fig. 6.9. The trend shows that VIO-prefetching brings more speedup when the number of video streaming VMs increases in the workload mix. The overall average speedup is 1.07.

1.15

1.10

1.05 Speedup 1.00

0.95 0 2 4 6 8 Number of streaming VMs

Figure 6.9: Box plots of speedups at different workload mixes

Prefetching on Different Request Queue Sizes. A bigger I/O request queue could provide more chances for merging multiple small random requests into a large

146 sequential one. We benchmark Darwin and YCSB3 to test how VIO-prefetching works at various I/O request queue sizes, where Darwin is an open source version of Apple’s QuickTime video streaming server and YCSB3, which emulates Hadoop workloads, is from a performance measurement framework for cloud serving systems, YCSB (Yahoo! Cloud Serving Benchmark) [50]. The default queue size is 128 and both the guest and host have I/O request queues. In our experiments, the guest/host queue is fixed at the default size when varying host/guest queue sizes from 128, 512, 2048, to 8192. Fig. 6.10 shows the average speedups and standard deviations of ten runs.

1.3 128 512 2048 8192

1.2 1.1

1.0 Speedup 0.9 0.8 Guest queue size Host queue size Guest queue size Host queue size Darwin YCSB3

Figure 6.10: Average speedups and standard deviations at different queue sizes and request patterns

VIO-prefetching has improved Darwin’s performance a lot at the default queue size. When increasing either guest or host queue sizes, the speedup improvement on Darwin is limited. When increasing guest queue sizes, the speedup of YCSB3 could be improved from 1.04 to 1.1 because more random requests become a sequential one from guest machines. However, the large variance shows the unsteadiness at this setting. On the other hand, increasing host queue sizes does not improve the speedup of YCSB3. This setting may not increase the ratio of merging random requests to sequential requests because the host queue is handling requests from all guests and the random guest requests are distributed onto different separate files (virtual disks) on the storage devices.

147 Chapter 7

Conclusion and Future Work

7.1 Conclusion

This dissertation made in-depth researches on optimizing I/O virtualization. It sys- tematically studied the causes and consequences of interference effect from co-located VMs; successfully improved virtual I/O performance by mitigating resource con- tention; achieved the “equivalence” property of virtualization with the best cost- efficiency; and fundamentally changed the virtual I/O prefetching scheme to improve I/O throughput. In brief, this dissertation successfully advanced the virtualization and cloud computing technology. For studying the adversed interference effect on public cloud, we presented, Swiper, a novel I/O workload based performance reduction which used a carefully designed workload to incur significant delay on a targeted application running in a separate vir- tual machine but on the same physical system. Such a performance reduction poses an especially serious threat to data-intensive applications which require a large number of I/O requests. Performance degradation directly increases the cost of per workload completed in cloud-computing systems. Our experiment results demonstrated the effectiveness of Swiper on different types of victim workloads in real-world systems with various numbers of virtual machines.

148 After exploring the vulnerability of performance isolation in a virtualized envi- ronment with Swiper and providing insights in this issue, our following researches lead us to mathematical models for estimating performance degradation and a novel scheduling framework for optimizing execution in data centers. We designed a man- agement system TRACON that mitigates the interference effects from concurrent data-intensive applications and greatly improves the overall system performance. First, we studied the use of statistical modeling techniques to build different mod- els of performance interference, and suggested to use the non-linear models as the prediction module in TRACON. Second, we developed several scheduling algorithms that work with the prediction module to manage the task assignments in virtualized data centers. We also integrated VM migration and consolidation in the management system. The experiment on two clusters showed that TRACON achieved up to 25% improvement on throughput for real-world cloud applications. With the knowledge and tools from Swiper and TRACON, we then presented Ma- trix, a novel performance and resource management system. Matrix utilizes clustering methods with probability estimates to analyze new workloads’ composition. Then, it constructs a new RP model based on the analyzed components and RP models from representative workloads. Matrix makes use of Lagrange algorithm and the kernel functions of SVM to adjust resource configurations for desired performance with less cost. With the help of Matrix, new workloads’ performance in a VM becomes more touchable and the operating cost is reduced. Both customers and providers of cloud computing systems benefit from Matrix. Users now have a better “look and feel” of purchased VMs and providers’ revenues are increased. In addition to the models and tools to improve virtualization system by well han- dling the virtualization overhead and performance interference, we also zoomed in to the virtualization architecture and proposed innovative designs of I/O virtualiza- tion. We designed and implemented a virtualization-friendly data prefetcher, VIO- prefetching, for emerging high-performance storage devices, including flash-based SSDs that detects application access patterns, retrieves data to match both drive characteristics and application needs, and dynamically controls its aggressiveness

149 with feedback. VIO-prefetching effectively improved the performance of a number of I/O-intensive applications on different virtualized configurations. A prototype of VIO-prefetching with a Xen virtualization server demonstrated a noticeable improve- ment of virtual I/O performance by up to a 43% speedup.

7.2 Future Work

One prominent characteristic of virtualization is to prevent any program from gaining full control of the system resources [141]. However, the latest NUMA-aware virtu- alization products failed to keep this property because they rely on static NUMA topology for VMs and the configuration is unchangeable unless VMs are shutdown. The challenge here is the lack of understanding between host and guest systems. As a result, dynamic resource configuration is prohibited because it may lead to incon- sistent runtime status and eventually crash the VM. Most related works focus on co-locating memory and VCPUs on the same physical CPUs. However, blindly co-locating memory pages and VCPUs may hurt performance because of local resource contention. In addition, frequently moving VCPUs among physical cores for local memory access may cause significant scheduling and context switching overheads. If the hypervisor knows the characteristics of host machines, e.g., cost and stress on local and remote memory access, and the properties of applications, e.g., working set size and access patterns, the whole system can be organized for a better performance. In short, the challenge of a NUMA-aware virtualization system is to automatically recognize the characteristics of all VMs and the host machine, and to optimize all VMs’ performance. In addition to the new challenges on CPU, non-volatile memory (NVM) also brings new challenges and perspectives in the designs of data prefetching and cache architec- ture for virtualization systems. More and more systems may equip with various NVM as storage devices or cache systems. These architectural changes bring challenges as well as opportunities to further improve virtualization system on these new hardware

150 architectures. The challenges of optimizing I/O virtualization also definitely exist on storage systems. I/O schedulers are designed to work on I/O queues to optimize I/O requests’ address linearity, locality, and fairness among processes. I/O queues serve as staging areas before sending the final requests to storage devices. It has been shown that I/O queue sizes are related to the performance in I/O systems. The optimal I/O queue sizes are different on devices, I/O request patterns, and schedulers. In a virtualized environment with multiple VMs, finding the optimal setting becomes more challenging because of the interplay between all these factors from all VMs. Virtualization has become the key component for sharing resources and maintain- ing availability in data centers. Still, many challenges await researchers. With the insights and the research thrusts from this dissertation, solving these difficult chal- lenges would be possible. We believe that I/O virtualization can be further optimized by innovative designs of virtualization architecture and management algorithms.

151 Bibliography

[1] O. Acii¸cmez.Yet another microarchitectural attack:: exploiting i-cache. In Pro- ceedings of the 2007 ACM workshop on Computer security architecture, CSAW ’07, pages 11–18, New York, NY, USA, 2007. ACM.

[2] O. Acii¸cmez,B. B. Brumley, and P. Grabher. New results on instruction cache attacks. In Proceedings of the 12th international conference on Cryptographic hardware and embedded systems, CHES’10, pages 110–124, Berlin, Heidelberg, 2010. Springer-Verlag.

[3] N. Agrawal, V. Prabhakaran, T. Wobber, J. Davis, M. Manasse, and R. Pan- igrahy. Design tradeoffs for SSD performance. In USENIX Annual Technical Conference, pages 57–70, 2008.

[4] R. Agrawal, C. Faloutsos, and A. N. Swami. Efficient similarity search in se- quence databases. In Proceedings of the 4th International Conference on Foun- dations of Data Organization and Algorithms, pages 69–84, 1993.

[5] A. Aizerman, E. M. Braverman, and L. I. Rozoner. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25:821–837, 1964.

[6] H. Akaike. A new look at the statistical model identification. Automatic Control, IEEE Transactions on, 19(6):716 – 723, dec 1974.

152 [7] M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center network architecture. SIGCOMM Comput. Commun. Rev., 38:63–74, August 2008.

[8] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. Journal of molecular biology, 215(3):403–410, Oct. 1990.

[9] G. A. Alvarez, E. Borowsky, S. Go, T. H. Romer, R. Becker-Szendy, R. Golding, A. Merchant, M. Spasojevic, A. Veitch, and J. Wilkes. Minerva: An automated resource provisioning tool for large-scale storage systems. ACM Trans. Comput. Syst., 19(4):483–518, 2001.

[10] Amazon EC2. Amzon EC2 instance types. http://aws.amazon.com/ec2/ instance-types/.

[11] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia. A view of cloud computing. Commun. ACM, 53(4):50–58, 2010.

[12] D. Arthur and S. Vassilvitskii. k-means++: the advantages of careful seeding. In SODA’07, pages 1027–1035, Philadelphia, PA, USA, 2007. Society for Industrial and Applied Mathematics.

[13] M. M. T. Arvind Jain and I. Grigorik. Global site speed overview: How fast are websites around the world?

[14] V. Athitsos, P. Papapetrou, M. Potamias, G. Kollios, and D. Gunopulos. Ap- proximate embedding-based subsequence matching of time series. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 365–378, 2008.

[15] J. Axboe and A. D. Brunelle. blktrace user guide, 2007.

[16] H. Aytug, S. Bhattacharyya, G. Koehler, and J. Snowdon. A review of ma- chine learning in scheduling. Engineering Management, IEEE Transactions on, 41(2):165 –171, may 1994.

153 [17] S. H. Baek and K. H. Park. Prefetching with adaptive cache culling for striped disk arrays. In USENIX Annual Technical Conference, pages 363–376, 2008.

[18] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. In Proceedings of the nineteenth ACM Symposium on Operating Systems Principles, SOSP, pages 164–177, 2003.

[19] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In ICML’06, pages 97–104, New York, NY, USA, 2006. ACM.

[20] C. Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton Uni- versity, January 2011.

[21] C. Bienia, S. Kumar, J. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. Princeton University Techni- cal Report TR-811-08, January 2008.

[22] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.

[23] P. Bodik, M. Goldszmidt, A. Fox, D. B. Woodard, and H. Andersen. Fin- gerprinting the datacenter: automated classification of performance crises. In Proceedings of the 5th European conference on Computer systems, EuroSys ’10, pages 111–124, New York, NY, USA, 2010. ACM.

[24] P. Bod´ık,R. Griffith, C. Sutton, O. Fox, M. Jordan, and D. Patterson. Statisti- cal machine learning makes automatic control practical for internet datacenters. In Proceedings of the 2009 conference on Hot topics in cloud computing, pages 12–12. USENIX Association, 2009.

[25] J. L. Bonebakker. Finding representative workloads for computer system design. Technical report, Mountain View, CA, USA, 2007.

154 [26] C. Boneti, R. Gioiosa, F. J. Cazorla, and M. Valero. A dynamic scheduler for balancing hpc applications. In SC ’08, pages 41:1–41:12. IEEE Press, 2008.

[27] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, COLT ’92, pages 144–152, New York, NY, USA, 1992. ACM.

[28] D. Boutcher and A. Chandra. Does virtualization make disk scheduling pass´e? SIGOPS Oper. Syst. Rev., 44(1):20–24, 2010.

[29] T. D. Braun, H. J. Siegel, N. Beck, L. L. B¨ol´oni,A. I. Reuther, M. D. Theys, B. Yao, R. F. Freund, M. Maheswaran, J. P. Robertson, and D. Hensgen. A com- parison study of static mapping heuristics for a class of meta-tasks on hetero- geneous computing systems. In Heterogeneous Computing Workshop (HCW), pages 15 –29, 1999.

[30] A. D. Brown, T. C. Mowry, and O. Krieger. Compiler-based I/O prefetching for out-of-core applications. ACM Trans. Comput. Syst., 19(2):111–170, 2001.

[31] S. Bucur, V. Ureche, C. Zamfir, and G. Candea. Parallel symbolic execution for automated real-world software testing. In Proceedings of the sixth conference on Computer systems, EuroSys ’11, pages 183–198, New York, NY, USA, 2011. ACM.

[32] N. C. Burnett, J. Bent, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Ex- ploiting gray-box knowledge of buffer-cache management. In USENIX Annual Technical Conference, pages 29–44, 2002.

[33] K. P. Burnham and D. R. Anderson. Model selection and multimodel inference: a practical information-theoretic approach. Springer, 2nd edition, July 2002.

[34] C. Cadar, D. Dunbar, and D. Engler. Klee: unassisted and automatic generation of high-coverage tests for complex systems programs. In Proceedings of the 8th USENIX conference on Operating systems design and implementation, OSDI’08, pages 209–224, Berkeley, CA, USA, 2008. USENIX Association.

155 [35] P. Carns, K. Harms, W. Allcock, C. Bacon, S. Lang, R. Latham, and R. Ross. Understanding and improving computational science storage access through continuous characterization. In MSST’11, pages 1 –14, May 2011.

[36] E. Cecchet, V. Udayabhanu, T. Wood, and P. Shenoy. Benchlab: an open testbed for realistic benchmarking of web applications. In Proceedings of the 2nd USENIX conference on Web application development, WebApps’11, pages 4–4, Berkeley, CA, USA, 2011. USENIX Association.

[37] CERT. http://www.cert.org/tech tips/denial of service.html.

[38] K. Chakrabarti, E. Keogh, S. Mehrotra, and M. Pazzani. Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans- actions on Database Systems, 27(2):188–228, 2002.

[39] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011.

Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

[40] F. Chang and G. A. Gibson. Automatic I/O hint generation through speculative execution. In Proceedings of the third symposium on Operating systems design and implementation, pages 1–14, New Orleans, Louisiana, United States, 1999. USENIX Association.

[41] M. Charikar, C. Chekuri, T. Feder, and R. Motwani. Incremental clustering and dynamic information retrieval. SIAM J. Comput., 33:1417–1440, June 2004.

[42] J. S. Chase, D. C. Anderson, P. N. Thakar, A. M. Vahdat, and R. P. Doyle. Managing energy and server resources in hosting centers. SIGOPS Oper. Syst. Rev., 35:103–116, October 2001.

[43] F. Chen, D. Koufaty, and X. Zhang. Understanding intrinsic characteristics and system implications of flash memory based solid state drives. In Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems, pages 181–192. ACM, 2009.

156 [44] L. Cherkasova and R. Gardner. Measuring cpu overhead for i/o processing in the xen virtual machine monitor. In Proceedings of the annual conference on USENIX Annual Technical Conference, 2005.

[45] R. C. Chiang and H. H. Huang. TRACON: Interference-aware scheduling for data-intensive applications in virtualized environments. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Stor- age and Analysis, SC ’11, pages 47:1–47:12. ACM, 2011.

[46] E. K. P. Chong and S. H. Zak. An Introduction to Optimization (Wiley- Interscience Series in Discrete Mathematics and Optimization). Wiley- Interscience, 3 edition, Feb. 2008.

[47] L. Ciortea, C. Zamfir, S. Bucur, V. Chipounov, and G. Candea. Cloud9: a software testing service. SIGOPS Oper. Syst. Rev., 43(4):5–10, Jan. 2010.

[48] E. G. Coffman. Computer and Job Shop Scheduling Theory. John Wiley & Sons Inc, New York, 1976.

[49] I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox. Capturing, indexing, clustering, and retrieving system history. SIGOPS Oper. Syst. Rev., 39(5):105–118, Oct. 2005.

[50] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Bench- marking cloud serving systems with YCSB. In SoCC, pages 143–154, 2010.

[51] S. P. E. Corporation. Spec cpu2006. http://www.spec.org/cpu2006/.

[52] K. M. Curewitz, P. Krishnan, and J. S. Vitter. Practical prefetching via data compression. In Proceedings of the ACM SIGMOD international conference on management of data, pages 257–266, 1993.

[53] E. Deelman, G. B. Berriman, G. Juve, Y.-S. Kee, M. Livny, and G. Singh. Clouds: An opportunity for scientific applications? In High Performance Com- puting Workshop, pages 192–215, 2008.

157 [54] E. Deelman and A. Chervenak. Data management challenges of data-intensive scientific workflows. In CCGRID ’08, pages 687 –692, May 2008.

[55] X. Ding, S. Jiang, F. Chen, K. Davis, and X. Zhang. Diskseen: exploiting disk layout and access history to enhance i/o prefetch. In USENIX Annual Technical Conference, pages 20:1–20:14, 2007.

[56] N. R. Draper and H. Smith. Applied Regression Analysis. John Wiley and Sons, New York, 1981.

[57] H. Drucker, C. J. C. Burges, L. Kaufman, A. J. Smola, and V. Vapnik. Support vector regression machines. In NIPS, pages 155–161, 1996.

[58] J. Du, N. Sehrawat, and W. Zwaenepoel. Performance profiling of virtual ma- chines. In Proceedings of the 7th ACM SIGPLAN/SIGOPS international con- ference on Virtual execution environments, VEE ’11, pages 3–14, New York, NY, USA, 2011. ACM.

[59] K. Duan, S. Keerthi, and A. N. Poo. Evaluation of simple performance measures for tuning svm hyperparameters. Neurocomputing, 51(0):41 – 59, 2003.

[60] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence match- ing in time-series databases. SIGMOD Record, 23(2), 1994.

[61] B. Farley, A. Juels, V. Varadarajan, T. Ristenpart, K. D. Bowers, and M. M. Swift. More for your money: exploiting performance heterogeneity in public clouds. In Proceedings of the Third ACM Symposium on Cloud Computing, SoCC ’12, pages 20:1–20:14, New York, NY, USA, 2012. ACM.

[62] A. J. Ferrer, F. Hern´andez, J. Tordsson, E. Elmroth, A. Ali-Eldin, C. Zsigri, R. Sirvent, J. Guitart, R. M. Badia, K. Djemame, W. Ziegler, T. Dimitrakos, S. K. Nair, G. Kousiouris, K. Konstanteli, T. Varvarigou, B. Hudzia, A. Kipp, S. Wesner, M. Corrales, N. Forg´o,T. Sharif, and C. Sheridan. Optimis: A holistic approach to cloud service provisioning. Future Generation Computer Systems, 28(1):66 – 77, 2012.

158 [63] T. Garfinkel and M. Rosenblum. When virtual is harder than real: security challenges in virtual machine based computing environments. In Proceedings of the 10th conference on Hot Topics in Operating Systems, 2005.

[64] B. S. Gill and L. A. D. Bathen. AMP: adaptive multi-stream prefetching in a shared cache. In Proceedings of the 5th USENIX conference on File and Storage Technologies, San Jose, CA, 2007. USENIX Association.

[65] B. S. Gill and D. S. Modha. SARC: sequential prefetching in adaptive re- placement cache. In Proceedings of the USENIX Annual Technical Conference. Berkeley, CA, USA, 2005.

[66] A. Greenberg, J. Hamilton, D. A. Maltz, and P. Patel. The cost of a cloud: research problems in data center networks. 39(1):68–73, 2009.

[67] J. Griffioen. Performance measurements of automatic prefetching. In Proceed- ings of the ISCA International Conference on Parallel and Distributed Com- puting Systems, pages 165—170, 1995.

[68] A. Gulati, I. Ahmad, and C. A. Waldspurger. Parda: proportional allocation of resources for distributed storage access. In FAST ’09: Proccedings of the 7th conference on File and storage technologies, pages 85–98, 2009.

[69] Y. Guo, P. Narayanan, M. A. Bennaser, S. Chheda, and C. A. Moritz. Energy- efficient hardware data prefetching. IEEE Trans. Very Large Scale Integr. Syst., 19(2):250–263, Feb. 2011.

[70] A. Gupta, D. Milojicic, and L. V. Kal´e. Optimizing vm placement for hpc in the cloud. In Proceedings of the 2012 workshop on Cloud services, federation, and the 8th open cirrus summit, FederatedClouds ’12, pages 1–6, New York, NY, USA, 2012. ACM.

[71] D. Gupta, L. Cherkasova, R. Gardner, and A. Vahdat. Enforcing per- formance isolation across virtual machines in xen. In Proceedings of the

159 ACM/IFIP/USENIX 2006 International Conference on Middleware, pages 342– 362, 2006.

[72] W.-S. Han, J. Lee, Y.-S. Moon, and H. Jiang. Ranked subsequence matching in time-series databases. In Proceedings of the 33rd international conference on Very large data bases, pages 423–434, 2007.

[73] T. Hastie, R. Tibshirani, and J. H. Friedman. The elements of statistical learn- ing: data mining, inference, and prediction: with 200 full-color illustrations. New York: Springer-Verlag, 2001.

[74] Z. Hill, J. Rowanhill, A. Nguyen-Tuong, G. Wasson, J. Knight, J. Basney, and M. Humphrey. Meeting virtual organization performance goals through adaptive grid reconfiguration. In Grid Computing, 2007 8th IEEE/ACM Inter- national Conference on, pages 177–184, 2007.

[75] U. H¨olzle. Powering a google search. http://googleblog.blogspot.com/2009/01/powering- google-search.html.

[76] H. Huang, S. Li, A. Szalay, and A. Terzis. Performance Modeling and Analysis of Flash-based Storage Devices. In Proceedings of the IEEE Symposium on Massive Storage Systems and Technologies (MSST), 2011.

[77] H. H. Huang and A. S. Grimshaw. Automated performance control in a vir- tual distributed storage system. In Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing, pages 242–249, 2008.

[78] L. Huang, G. Peng, and T.-c. Chiueh. Multi-dimensional storage virtualization. SIGMETRICS Perform. Eval. Rev., 32(1):14–24, 2004.

[79] O. H. Ibarra and C. E. Kim. Heuristic algorithms for scheduling independent tasks on nonidentical processors. J. ACM, 24:280–289, April 1977.

[80] A. Iosup, S. Ostermann, M. Yigitbasi, R. Prodan, T. Fahringer, and D. H. J. Epema. Performance analysis of cloud computing services for many-tasks sci-

160 entific computing. Parallel and Distributed Systems, IEEE Transactions on, 22(6):931–945, 2011.

[81] R. K. Jain. The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley, 1 edition, Apr. 1991.

[82] R. A. Johnson and D. W. Wichern, editors. Applied multivariate statistical analysis. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1988.

[83] S. T. Jones, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Geiger: mon- itoring the buffer cache in a virtual machine environment. In Proceedings of the 12th international conference on architectural support for programming lan- guages and operating systems, pages 14–24, San Jose, California, USA, 2006.

[84] Y. Joo, J. Ryu, S. Park, and K. Shin. FAST: quick application launch on solid- state drives. In Proceedings of the 9th USENIX conference on File and stroage technologies, pages 19–19. USENIX Association, 2011.

[85] M. Kallahalla and P. J. Varman. Optimal prefetching and caching for parallel i/o sytems. In Proceedings of the thirteenth annual ACM symposium on parallel algorithms and architectures, pages 219–228, 2001.

[86] M. Kallahalla and P. J. Varman. Pc-opt: Optimal offline prefetching and caching for parallel i/o systems. IEEE Trans. Comput., 51(11):1333–1344, Nov. 2002.

[87] P. A. Karger and J. C. Wray. Storage channels in disk arm optimization. In IEEE Symposium on Security and Privacy, 1991.

[88] M. Karlsson, C. Karamanolis, and X. Zhu. Triage: Performance differentiation for storage systems using adaptive control. Trans. Storage, 1(4):457–480, 2005.

[89] J. Katcher. Postmark: a new file system benchmark. Network Appliance Tech Report TR3022, Oct. 1997.

161 [90] R. T. Kaushik and M. Bhandarkar. Greenhdfs: towards an energy-conserving, storage-efficient, hybrid hadoop compute cluster. In Proceedings of the 2010 international conference on Power aware computing and systems, HotPower’10, pages 1–9, Berkeley, CA, USA, 2010. USENIX Association.

[91] V. Kazempour, A. Kamali, and A. Fedorova. Aash: an asymmetry-aware sched- uler for hypervisors. In VEE’10, pages 85–96. ACM, 2010.

[92] E. Keller, J. Szefer, J. Rexford, and R. B. Lee. Nohype: virtualized cloud infrastructure without the virtualization. In ISCA, 2010.

[93] E. Keogh. Exact indexing of dynamic time warping. In Proceedings of the 28th international conference on Very Large Data Bases, pages 406–417, 2002.

[94] E. Keogh and S. Kasetty. On the need for time series data mining benchmarks: A survey and empirical demonstration. Data Mining and Knowledge Discovery, 7(4):349–371, 2003.

[95] H. Kim, H. Lim, J. Jeong, H. Jo, and J. Lee. Task-aware virtual ma- chine scheduling for i/o performance. In Proceedings of the 2009 ACM SIG- PLAN/SIGOPS international conference on Virtual execution environments, VEE ’09, pages 101–110, New York, NY, USA, 2009. ACM.

[96] Y. Koh, R. Knauerhase, P. Brett, M. Bowman, Z. Wen, and C. Pu. An analysis of performance interference effects in virtual environments. In In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2007.

[97] R. Kohavi and R. Longbotham. Online experiments: Lessons learned. Com- puter, 40(9):103–105, 2007.

[98] A. Kopytov. Sysbench. http://sysbench.sourceforge.net/index.html.

[99] T. M. Kroeger and D. D. E. Long. Design and implementation of a predictive file prefetching algorithm. In USENIX Annual Technical Conference, pages 105–118, 2001.

162 [100] J. B. Kruskall and M. Liberman. The symmetric time warping algorithm: From continuous to discrete, chapter In Time Warps. Addison-Wesley, 1983.

[101] S. Kullback and R. A. Leibler. On information and sufficiency. Ann. Math. Statistics, 22:79–86, 1951.

[102] S. Kundu, R. Rangaswami, K. Dutta, and M. Zhao. Application performance modeling in a virtualized environment. In High Performance Computer Archi- tecture (HPCA), 2010 IEEE 16th International Symposium on, pages 1 –10, 2010.

[103] S. Kundu, R. Rangaswami, A. Gulati, M. Zhao, and K. Dutta. Modeling vir- tualized applications using machine learning techniques. In Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments, VEE ’12, pages 3–14, New York, NY, USA, 2012. ACM.

[104] D. Kusic, J. Kephart, J. Hanson, N. Kandasamy, and G. Jiang. Power and performance management of virtualized computing environments via lookahead control. Cluster Computing, 12(1):1–15, 2009.

[105] M. Kutare, G. Eisenhauer, C. Wang, K. Schwan, V. Talwar, and M. Wolf. Mon- alytics: online monitoring and analytics for managing large scale data centers. In Proceedings of the 7th international conference on Autonomic computing, ICAC ’10, pages 141–150, New York, NY, USA, 2010. ACM.

[106] S. Lacour, C. Perez, and T. Priol. Generic application description model: to- ward automatic deployment of applications on computational grids. In Grid Computing, 2005. The 6th IEEE/ACM International Workshop on, pages 4 pp.–, 2005.

[107] C. Li, K. Shen, and A. E. Papathanasiou. Competitive prefetching for concur- rent sequential i/o. SIGOPS Oper. Syst. Rev., 41(3):189–202, Mar. 2007.

163 [108] H. Li, G. Fox, and J. Qiu. Performance model for parallel matrix multiplication with dryad: Dataflow graph runtime. In Cloud and Green Computing (CGC), 2012 Second International Conference on, pages 675–683, 2012.

[109] Z. Li, Z. Chen, S. M. Srinivasan, and Y. Zhou. C-Miner: mining block correla- tions in storage systems. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies, pages 173–186, 2004.

[110] H. Liu. Amazon ec2 data center size. http://huanliu.wordpress.com/2012/ 03/13/amazon-data-center-size/.

[111] P. Lu and K. Shen. Virtual machine memory access tracing with hypervisor exclusive cache. In Proceedings of the USENIX Annual Technical Conference, ATC’07, pages 3:1–3:15, Berkeley, CA, USA, 2007.

[112] W. Lu, J. Jackson, J. Ekanayake, R. Barga, and N. Araujo. Performing large science experiments on azure: Pitfalls and solutions. In Cloud Computing Tech- nology and Science (CloudCom), 2010 IEEE Second International Conference on, pages 209–217, 2010.

[113] C. R. Lumb, A. Merchant, and G. A. Alvarez. Fa¸cade:Virtual storage devices with performance guarantees. In FAST ’03: Proceedings of the 2nd USENIX Conference on File and Storage Technologies, pages 131–144, 2003.

[114] C. R. Lumb, J. Schindler, and G. R. Ganger. Freeblock scheduling outside of disk firmware. In Proceedings of the Conference on File and Storage Technolo- gies, pages 275–288, 2002.

[115] P. Marshall, H. Tufo, and K. Keahey. Provisioning policies for elastic computing environments. In Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), 2012 IEEE 26th International, pages 1085–1094, 2012.

[116] D. Mattera and S. Haykin. Advances in kernel methods. chapter Support vector machines for dynamic reconstruction of a chaotic system, pages 211–241. MIT Press, Cambridge, MA, USA, 1999.

164 [117] M. Maurer, I. Brandic, and R. Sakellariou. Self-adaptive and resource-efficient sla enactment for cloud computing infrastructures. In Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on, pages 368–375, 2012.

[118] R. McDougall, J. Crase, and S. Debnath. Filebench. http://sourceforge. net/projects/filebench/.

[119] R. McDougall and J. Mauro. Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture. Prentice Hall, 2006.

[120] Y. Mei, L. Liu, X. Pu, and S. Sivathanu. Performance measurements and analysis of network i/o applications in virtualized cloud. In Cloud Computing (CLOUD), 2010 IEEE 3rd International Conference on, pages 59 –66, 2010.

[121] A. Menon, J. R. Santos, Y. Turner, G. J. Janakiraman, and W. Zwaenepoel. Diagnosing performance overheads in the xen virtual machine environment. In Proceedings of the 1st ACM/USENIX international conference on Virtual execution environments, VEE ’05, pages 13–23, New York, NY, USA, 2005. ACM.

[122] M. P. Mesnier, M. Wachs, R. R. Sambasivan, A. X. Zheng, and G. R. Ganger. Modeling the relative fitness of storage. In Proceedings of the 2007 ACM SIG- METRICS international conference on Measurement and modeling of computer systems, SIGMETRICS ’07, pages 37–48, New York, NY, USA, 2007. ACM.

[123] C. Metz. Flash Drives Replace Disks at Amazon, Facebook, Dropbox. URL:http://www.wired.com/wiredenterprise/2012/06/flash-data-centers/.

[124] Y.-S. Moon, K.-Y. Whang, and W.-S. Han. General match: a subsequence matching method in time-series databases based on generalized windows. In Proceedings of the 2002 ACM SIGMOD international conference on Manage- ment of data, pages 382–393, 2002.

165 [125] J. Moore, J. Chase, P. Ranganathan, and R. Sharma. Making scheduling “cool”: temperature-aware workload placement in data centers. In USENIX ATC’05, pages 5–5. USENIX Association, 2005.

[126] T. Moscibroda and O. Mutlu. Memory performance attacks: denial of memory service in multi-core systems. In SS’07: Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium, pages 1–18, 2007.

[127] T. C. Mowry, A. K. Demke, and O. Krieger. Automatic compiler-inserted i/o prefetching for out-of-core applications. SIGOPS Oper. Syst. Rev., 30(SI):3–17, Oct. 1996.

[128] K.-R. M¨uller,A. J. Smola, G. R¨atsch, B. Sch¨olkopf, J. Kohlmorgen, and V. Vap- nik. Predicting time series with support vector machines. In Proceedings of the 7th International Conference on Artificial Neural Networks, ICANN ’97, pages 999–1004, London, UK, UK, 1997. Springer-Verlag.

[129] F. Nadeem and T. Fahringer. Predicting the execution time of grid workflow applications through local learning. In SC’09, pages 33:1–33:12. ACM, 2009.

[130] Y. Nakajima, Y. Aida, M. Sato, and O. Tatebe. Performance evaluation of data management layer by data sharing patterns for grid rpc applications. In Euro-Par 2008 Parallel Processing, volume 5168 of Lecture Notes in Computer Science, pages 554–564. Springer Berlin Heidelberg, 2008.

[131] R. Nathuji, A. Kansal, and A. Ghaffarkhah. Q-clouds: managing performance interference effects for qos-aware clouds. In Proceedings of the 5th European conference on Computer systems, EuroSys ’10, pages 237–250, New York, NY, USA, 2010. ACM.

[132] R. Nathuji, A. Kansal, and A. Ghaffarkhah. Q-clouds: managing performance interference effects for qos-aware clouds. In EuroSys ’10: Proceedings of the 5th European conference on Computer systems, pages 237–250, 2010.

166 [133] O. Nieh¨orster,A. Brinkmann, A. Keller, C. Kleineweber, J. Kr¨uger,and J. Si- mon. Cost-aware and slo-fulfilling software as a service. Journal of Grid Com- puting, 10:553–577, 2012.

[134] W. Norcutt. IOZone filesystem benchmark. In http://www.iozone.org.

[135] D. V. Novakovi´c,N. Nedeljko, K. Stanko, Dejan, and R. Bianchini. DeepDive: Transparently Identifying and Managing Performance Interference in Virtual- ized Environments. 2013.

[136] D. Ongaro, A. L. Cox, and S. Rixner. Scheduling i/o in virtual machine moni- tors. In VEE ’08: Proceedings of the fourth ACM SIGPLAN/SIGOPS interna- tional conference on Virtual execution environments, pages 1–10, 2008.

[137] Z. Ou, H. Zhuang, J. K. Nurminen, A. Yl¨a-J¨a¨aski,and P. Hui. Exploiting hardware heterogeneity within the same instance type of amazon ec2. In Pro- ceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing, HotCloud’12, pages 4–4, Berkeley, CA, USA, 2012. USENIX Association.

[138] P. Padala, K.-Y. Hou, K. G. Shin, X. Zhu, M. Uysal, Z. Wang, S. Singhal, and A. Merchant. Automated control of multiple virtualized resources. In Proceed- ings of the 4th ACM European conference on Computer systems, EuroSys ’09, pages 13–26, New York, NY, USA, 2009. ACM.

[139] R. H. Patterson, G. A. Gibson, E. Ginting, D. Stodolsky, and J. Zelenka. In- formed prefetching and caching. SIGOPS Oper. Syst. Rev., 29(5):79–95, 1995.

[140] M. L. Pinedo. Scheduling: Theory, Algorithms, and Systems. Springer Publish- ing Company, Incorporated, 3rd edition, 2008.

[141] G. J. Popek and R. P. Goldberg. Formal requirements for virtualizable third generation architectures. Commun. ACM, 17:412–421, July 1974.

[142] P. Priore, D. De La Fuente, A. Gomez, and J. Puente. A review of machine learning in dynamic scheduling of flexible manufacturing systems. Artif. Intell. Eng. Des. Anal. Manuf., 15:251–263, June 2001.

167 [143] X. Pu, L. Liu, Y. Mei, S. Sivathanu, Y. Koh, and C. Pu. Understanding performance interference of i/o workload in virtualized cloud environments. In Cloud Computing (CLOUD), 2010 IEEE 3rd International Conference on, pages 51 –58, 2010.

[144] A. Quiroz, H. Kim, M. Parashar, N. Gnanasambandam, and N. Sharma. To- wards autonomic workload provisioning for enterprise grids and clouds. In Grid Computing, 2009 10th IEEE/ACM International Conference on, pages 50–57, 2009.

[145] H. Raj, R. Nathuji, A. Singh, and P. England. Resource management for iso- lation enhanced cloud services. In Proceedings of the 2009 ACM workshop on Cloud computing security, pages 77–84, 2009.

[146] L. Ramakrishnan, R. S. Canon, K. Muriki, I. Sakrejda, and N. J. Wright. Evaluating interconnect and virtualization performance for high performance computing. In Proceedings of the second international workshop on Perfor- mance modeling, benchmarking and simulation of high performance computing systems, PMBS ’11, pages 1–2, New York, NY, USA, 2011. ACM.

[147] J. Rao, X. Bu, C.-Z. Xu, L. Wang, and G. Yin. Vconf: a reinforcement learning approach to virtual machines auto-configuration. In Proceedings of the 6th international conference on Autonomic computing, ICAC ’09, pages 137–146, New York, NY, USA, 2009. ACM.

[148] T. Ristenpart, E. Tromer, H. Shacham, and S. Savage. Hey, you, get off of my cloud: exploring information leakage in third-party compute clouds. In CCS, 2009.

[149] F. Rosenblatt. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, 1962.

[150] M. Rosenblum and T. Garfinkel. Virtual machine monitors: Current technology and future trends. Computer, 38:39–47, 2005.

168 [151] S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach (3rd Edition). Prentice Hall, 3 edition, 2009.

[152] P. Ruth, J. Rhee, D. Xu, R. Kennell, and S. Goasguen. Autonomic live adap- tation of virtual computational environments in a multi-domain infrastructure. In Autonomic Computing, 2006. ICAC ’06. IEEE International Conference on, pages 5–14, 2006.

[153] B. Sch¨olkopf, J. C. Platt, J. C. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. Neural Comput., 13(7):1443–1471, July 2001.

[154] B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural Comput., 12(5):1207–1245, May 2000.

[155] S. R. Seelam and P. J. Teller. Virtual i/o scheduler: a scheduler of schedulers for performance virtualization. In VEE ’07: Proceedings of the 3rd international conference on Virtual execution environments, pages 105–115, 2007.

[156] H. Shan, K. Antypas, and J. Shalf. Characterizing and predicting the i/o performance of hpc applications using a parameterized synthetic benchmark. In SC’08, pages 42:1–42:12. IEEE Press, 2008.

[157] U. Sharma, P. Shenoy, S. Sahu, and A. Shaikh. A cost-aware elasticity provi- sioning system for the cloud. In Distributed Computing Systems (ICDCS), 2011 31st International Conference on, pages 559 –570, june 2011.

[158] K. Shen, C. Stewart, C. Li, and X. Li. Reference-driven performance anomaly identification. In Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems, SIGMETRICS ’09, pages 85– 96, New York, NY, USA, 2009. ACM.

[159] J.-Y. Shin, M. Balakrishnan, L. Ganesh, T. Marian, and H. Weatherspoon. Gecko: a contention-oblivious design for cloud storage. In Proceedings of the 4th

169 USENIX conference on Hot Topics in Storage and File Systems, HotStorage’12, pages 4–4, Berkeley, CA, USA, 2012. USENIX Association.

[160] E. Shriver, C. Small, and K. A. Smith. Why does file system prefetching work? In USENIX Annual Technical Conference, pages 6–35, 1999.

[161] M. Sivathanu, V. Prabhakaran, F. I. Popovici, T. E. Denehy, A. C. Arpaci- Dusseau, and R. H. Arpaci-Dusseau. Semantically-smart disk systems. In Pro- ceedings of the 2nd USENIX conference on File and Storage Technologies, pages 6–22, 2003.

[162] A. J. Smola and B. Sch¨olkopf. A tutorial on support vector regression. Statistics and Computing, 14(3):199–222, Aug. 2004.

[163] W. Sobel, S. Subramanyam, A. Sucharitakul, J. Nguyen, H. Wong, A. Klepchukov, S. Patil, A. Fox, and D. Patterson. Cloudstone: Multi-platform, multi-language benchmark and measurement tools for web 2.0. In CCA, 2008.

[164] F. Song, A. YarKhan, and J. Dongarra. Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09, pages 19:1–19:11, New York, NY, USA, 2009. ACM.

[165] A. A. Soror, U. F. Minhas, A. Aboulnaga, K. Salem, P. Kokosielis, and S. Ka- math. Automatic virtual machine configuration for database workloads. In Proceedings of the 2008 ACM SIGMOD international conference on Manage- ment of data, SIGMOD ’08, pages 953–966, New York, NY, USA, 2008. ACM.

[166] G. Soundararajan and C. Amza. Towards end-to-end quality of service: control- ling i/o interference in shared storage servers. In Middleware ’08: Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware, pages 287–305, 2008.

[167] G. Soundararajan, D. Lupei, S. Ghanbari, A. D. Popescu, J. Chen, and C. Amza. Dynamic resource allocation for database servers running on virtual

170 storage. In FAST ’09: Proccedings of the 7th conference on File and storage technologies, pages 71–84, 2009.

[168] SPEC Research Group. Faban Harness and Benchmark Framework. http: //faban.org/.

[169] B. Speitkamp and M. Bichler. A mathematical programming approach for server consolidation problems in virtualized data centers. IEEE Transactions on Services Computing, 3:266–278, 2010.

[170] S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In Proceedings of the 2007 IEEE 13th International Symposium on High Per- formance Computer Architecture, pages 63–74, 2007.

[171] C. Stewart, T. Kelly, and A. Zhang. Exploiting nonstationarity for performance prediction. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Con- ference on Computer Systems 2007, EuroSys ’07, pages 31–44, New York, NY, USA, 2007. ACM.

[172] M. O. Stitson, A. Gammerman, V. Vapnik, V. Vovk, C. Watkins, and J. Weston. Advances in kernel methods. chapter Support vector regression with ANOVA decomposition kernels, pages 285–291. MIT Press, Cambridge, MA, USA, 1999.

[173] R. Susukita, H. Ando, M. Aoyagi, H. Honda, Y. Inadomi, K. Inoue, S. Ishizuki, Y. Kimura, H. Komatsu, M. Kurokawa, K. J. Murakami, H. Shibamura, S. Ya- mamura, and Y. Yu. Performance prediction of large-scale parallell system and application using macro-level simulation. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, SC ’08, pages 20:1–20:9, Piscat- away, NJ, USA, 2008. IEEE Press.

[174] A. S. Szalay, G. C. Bell, H. H. Huang, A. Terzis, and A. White. Low-power amdahl-balanced blades for data intensive computing. SIGOPS Oper. Syst. Rev., 44(1):71–75, 2010.

171 [175] Y. Tan, X. Gu, and H. Wang. Adaptive system anomaly prediction for large- scale hosting infrastructures. In Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing, PODC ’10, pages 173–182, New York, NY, USA, 2010. ACM.

[176] Y. Tan, H. Nguyen, Z. Shen, X. Gu, C. Venkatramani, and D. Rajan. Prepare: Predictive performance anomaly prevention for virtualized cloud systems. In Distributed Computing Systems (ICDCS), 2012 IEEE 32nd International Con- ference on, pages 285 –294, june 2012.

[177] The Apache Software Foundation. Olio. http://incubator.apache.org/ olio/.

[178] O. Tickoo, R. Iyer, R. Illikkal, and D. Newell. Modeling virtual machine per- formance: challenges and approaches. SIGMETRICS Perform. Eval. Rev., 37(3):55–60, Jan. 2010.

[179] E. Tromer, D. A. Osvik, and A. Shamir. Efficient cache attacks on aes, and countermeasures. J. Cryptol., 23(2):37–71, Jan. 2010.

[180] A. Uppal, R. Chiang, and H. Huang. Flashy prefetching for high-performance flash drives. In Mass Storage Systems and Technologies (MSST), 2012 IEEE 28th Symposium on, pages 1 –12, april 2012.

[181] G. Urdaneta, G. Pierre, and M. van Steen. Wikipedia workload analysis for decentralized hosting. Elsevier Computer Networks, 53(11):1830–1845, July 2009. http://www.globule.org/publi/WWADH_comnet2009.html.

[182] US Environmental Protection Agency (EPA). Report to congress on server and data center energy efficiency: Public law 109-431. 2008.

[183] S. Uttamchandani, L. Yin, G. A. Alvarez, J. Palmer, and G. Agha. Chameleon: a self-evolving, fully-adaptive resource arbitrator for storage systems. In ATEC ’05: Proceedings of the annual conference on USENIX Annual Technical Con- ference, 2005.

172 [184] V. N. Vapnik. Statistical learning theory. Wiley, 1 edition, Sept. 1998.

[185] V. Varadarajan, T. Kooburat, B. Farley, T. Ristenpart, and M. M. Swift. Resource-freeing attacks: improve your cloud performance (at your neighbor’s expense). In Proceedings of the 2012 ACM conference on Computer and com- munications security, CCS ’12, pages 281–292, New York, NY, USA, 2012. ACM.

[186] N. Vasi´c,D. Novakovi´c,S. Miuˇcin,D. Kosti´c,and R. Bianchini. Dejavu: accel- erating resource allocation in virtualized environments. In Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’12, pages 423–436, New York, NY, USA, 2012. ACM.

[187] Venturebeat. Seamicro drops an atom bomb on the server industry, http://venturebeat.com/2010/06/13/seamicro-drops-an-atom-bomb-on-the- server-industry/.

[188] M. Wachs, M. Abd-El-Malek, E. Thereska, and G. R. Ganger. Argon: perfor- mance insulation for shared storage servers. In FAST ’07: Proceedings of the 5th USENIX conference on File and Storage Technologies, pages 5–5, 2007.

[189] X. Wang and M. Chen. Cluster-level feedback power control for performance optimization. In High Performance Computer Architecture, 2008. HPCA 2008. IEEE 14th International Symposium on, pages 101 –110, feb. 2008.

[190] X. Wang, E. Perlman, R. Burns, T. Malik, T. Budav´ari,C. Meneveau, and A. Szalay. Jaws: Job-aware workload scheduling for the exploration of turbu- lence simulations. In SC ’10, pages 1–11. IEEE Computer Society, 2010.

[191] A. Warfield, R. Ross, K. Fraser, C. Limpach, and S. Hand. Parallax: managing storage for a million machines. In Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10, HOTOS’05, pages 4–4, Berkeley, CA, USA, 2005.

173 [192] B. J. Watson, M. Marwah, D. Gmach, Y. Chen, M. Arlitt, and Z. Wang. Proba- bilistic performance modeling of virtualized resource allocation. In Proceedings of the 7th international conference on Autonomic computing, ICAC ’10, pages 99–108, New York, NY, USA, 2010. ACM.

[193] C. Weng, Z. Wang, M. Li, and X. Lu. The hybrid scheduling framework for virtual machine systems. In Proceedings of the 2009 ACM SIGPLAN/SIGOPS international conference on Virtual execution environments, VEE ’09, pages 111–120, New York, NY, USA, 2009. ACM.

[194] A. Whitaker, M. Shaw, and S. D. Gribble. Denali: a scalable isolation kernel. In Proceedings of the 10th workshop on ACM SIGOPS European workshop, pages 10–15, 2002.

[195] G. Whittle, J.-F. Pˆaris,A. Amer, D. Long, and R. Burns. Using multiple predictors to improve the accuracy of file access predictions. In 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 230–240, Apr. 2003.

[196] Wikimedia Foundation. Wikipedia:Database download. http://en. wikipedia.org/wiki/Wikipedia:Database_download.

[197] D. Williams, H. Jamjoom, and H. Weatherspoon. The xen-blanket: virtualize once, run everywhere. In Proceedings of the 7th ACM european conference on Computer Systems, EuroSys ’12, pages 113–126, New York, NY, USA, 2012.

[198] T. M. Wong and J. Wilkes. My cache or yours? making storage more exclusive. In USENIX Annual Technical Conference, pages 161–175, 2002.

[199] T. Wood, L. Cherkasova, K. Ozonat, and P. Shenoy. Profiling and mod- eling resource usage of virtualized applications. In Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware, Middleware ’08, pages 366–387, New York, NY, USA, 2008. Springer-Verlag New York, Inc.

174 [200] J. Xu and J. A. B. Fortes. Multi-objective virtual machine placement in vir- tualized data center environments. In Proceedings of the 2010 IEEE/ACM Int’l Conference on Green Computing and Communications & Int’l Conference on Cyber, Physical and Social Computing, GREENCOM-CPSCOM ’10, pages 179–188, Washington, DC, USA, 2010. IEEE Computer Society.

[201] C. Yang, T. Mitra, and T. Chiueh. A decoupled architecture for application- specific file prefetching. In USENIX Annual Technical Conference, FREENIX Track, pages 157–170, 2002.

[202] Z. Yang, H. Fang, Y. Wu, C. Li, B. Zhao, and H. Huang. Understanding the effects of hypervisor i/o scheduling for virtual machine performance interfer- ence. In Cloud Computing Technology and Science (CloudCom), 2012 IEEE 4th International Conference on, pages 34–41, 2012.

[203] J. Zhang, A. Sivasubramaniam, Q. Wang, A. Riska, and E. Riedel. Storage performance virtualization via throughput and latency control. Trans. Storage, 2(3):283–308, 2006.

[204] Y. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart. Cross-vm side channels and their use to extract private keys. In Proceedings of the 2012 ACM conference on Computer and communications security, CCS ’12, pages 305–316, New York, NY, USA, 2012. ACM.

[205] Y. Zhang, W. Sun, and Y. Inoguchi. Predicting running time of grid tasks based on cpu load predictions. In GRID ’06, pages 286–292. IEEE Computer Society, 2006.

[206] Z. Zhang, A. Kulkarni, X. Ma, and Y. Zhou. Memory resource allocation for file system prefetching: from a supply chain management perspective. In Proceedings of the 4th ACM European conference on Computer systems, pages 75–88, 2009.

175 [207] W. Zheng, R. Bianchini, G. J. Janakiraman, J. R. Santos, and Y. Turner. Justrunit: experiment-based management of virtualized data centers. In Proceedings of the 2009 conference on USENIX Annual technical conference, USENIX’09, pages 18–18, Berkeley, CA, USA, 2009. USENIX Association.

[208] Q. Zhu, J. Zhu, and G. Agrawal. Power-aware consolidation of scientific work- flows in virtualized environments. In Proceedings of the 2010 ACM/IEEE In- ternational Conference for High Performance Computing, Networking, Storage and Analysis, SC ’10, pages 1–12, Washington, DC, USA, 2010. IEEE Computer Society.

176