Forgoing hypervisor fidelity for measuring virtual machine performance

Oliver . A. Chick

Gonville and Caius College

This dissertation is submitted for the degree of Doctor of Philosophy

FORGOING HYPERVISOR FIDELITY FOR MEASURING VIRTUAL MACHINE PERFORMANCE

OLIVER R.A.CHICK

For the last ten years there has been rapid growth in cloud computing, which has largely been powered by virtual machines. Understanding the performance of a virtual machine is hard: There is limited access to hardware counters, tech- niques for probing have higher probe effect than on physical machines, and per- formance is tightly coupled with the hypervisor’s scheduling decisions. Yet, the need for measuring virtual machine performance is high as virtual machines are slower than physical machines and have highly-variable performance. Current performance-measurement techniques demand hypervisor fidelity: They execute the same instructions on a virtual machine and physical machine. Whilst fidelity has historically been considered an advantage as it allows the hy- pervisor to be transparent to virtual machines, the use case of hypervisors has changed from multiplexing access to a single mainframe across an institution to forming a building block of the cloud. In this dissertation I reconsider the argument for hypervisor fidelity and show the advantages of software that co-operates with the hypervisor. I focus on pro- ducing software that explains the performance of virtual machines by forgoing hypervisor fidelity. To this end, I develop three methods of exposing the hy- pervisor interface to performance measurement tools: (i) Kamprobes is a tech- nique for probing virtual machines that uses unprivileged instructions rather than interrupt-based techniques. I show that this brings the time requires to fire a probe in a virtual machine to within twelve cycles of native performance. (ii) Shadow Kernels is a technique that uses the hypervisor’s memory manage- ment unit so that an kernel can have per-process specialisation, which can be used to selectively fire probes, with low overheads (835 354 cycles ± per page) and minimal operating system changes (340 LoC). (iii) Soroban uses machine learning on the hypervisor’s scheduling activity to report the virtualisa- tion overhead in servicing requests and can distinguish between latency caused by high virtual machine load and latency caused by the hypervisor. Understanding the performance of a machine is particularly difficult when executing in the cloud due to the combination of the hypervisor and other virtual machines. This dissertation shows that it is worthwhile forgoing hypervisor fidelity to improve the visibility of virtual machine performance. DECLARATION

This dissertation is my own work and contains nothing which is the outcome of work done in collaboration with others, except where specified in the text. This dissertation is not substantially the same as any that I have submitted for a degree or diploma or other qualification at any other university. This dissertation does not exceed the prescribed limit of 60 000 words.

Oliver R. A. Chick November 30, 2015

ACKNOWLEDGEMENTS

This work was principally supported by the Engineering and Physical Sciences Research Council [grant number EP/K503009/1] and by internal funds from the University of Cambridge Computer Laboratory. I should like to pay personal thanks to Dr Andrew Rice and Dr Ripduman So- han for their countless hours of supervision and technical expertise, without which I would have been unable to conduct my research. Further thanks to Dr Ramsey M. Faragher for encouragement and help in wide-ranging areas. Special thanks to Lucian Carata and James Snee for their efforts in cod- ing reviews and being prudent collaborators, as well as Dr Jeunese A. Payne, Daniel R. Thomas, and Diana A. Vasile for proof reading this dissertation. My gratitude goes to Prof. Andy Hopper for his support for the Resourceful project. All members of the DTG, especially Daniel R. Thomas and other inhabitants of SN14 have provided me with both wonderful friendships and technical assis- tance, which has been invaluable throughout my Ph.D. Final thanks naturally go to my parents for their perpetual support.

CONTENTS

1 Introduction 15 1.1 Defining ‘forgoing hypervisor fidelity’ ...... 16 1.2 Limitations of hypervisor fidelity in performance measurement tools 17 1.3 The case for forgoing hypervisor fidelity in performance measure- ment tools ...... 18 1.4 Kamprobes ...... 20 1.5 Shadow Kernels ...... 21 1.6 Soroban ...... 22 1.7 Scope of thesis ...... 23 1.7.1 Xen hypervisor ...... 23 1.7.2 GNU/Linux operating system ...... 23 1.7.3 Paravirtualised guests ...... 24 1.7.4 x86-64 ...... 24 1.8 Overview ...... 25

2 Background 27 2.1 Historical justification for hypervisor fidelity ...... 28 2.2 Contemporary uses for virtualisation ...... 29 2.3 Virtualisation performance problems ...... 33 2.3.1 Privileged instructions ...... 33 2.3.2 I/O ...... 33 2.3.3 Networking ...... 34 2.3.4 Increased contention ...... 34 2.3.5 Locking ...... 34 2.3.6 Unpredictable timing ...... 35 2.3.7 Summary ...... 35 2.4 The changing state of hypervisor fidelity ...... 35 2.4.1 Historical changes to hypervisor fidelity ...... 35 2.4.2 Recent changes to hypervisor fidelity ...... 36 2.4.3 Current state of hypervisor fidelity ...... 38 2.4.3.1 Installing guest additions ...... 38 2.4.3.2 Moving services into dedicated domains . . . . . 38 2.4.3.3 Lack of transparency of HVM containers . . . . 39 2.4.3.4 Hypervisor/operating system semantic gap . . . . 39 2.4.4 Summary ...... 39 2.5 Rethinking operating system design for hypervisors ...... 40 2.6 Virtual machine performance measurement ...... 41 2.6.1 Kernel probing ...... 41 2.6.2 Kernel specialisation ...... 42 2.6.3 Performance interference ...... 43 2.6.3.1 Measurement ...... 43 2.6.3.2 Modelling ...... 44 2.6.3.3 Summary ...... 45 2.7 Application to a broader context ...... 46 2.7.1 Containers ...... 46 2.7.2 ...... 47 2.8 Summary ...... 47

3 Kamprobes: Probing designed for virtualised operating systems 49 3.1 Introduction ...... 50 3.2 Current probing techniques ...... 51 3.2.1 Linux: Kprobes ...... 51 3.2.2 Windows: Detours ...... 52 3.2.3 FreeBSD, NetBSD, OS X: DTrace function boundary tracers 53 3.2.4 Summary ...... 53 3.3 Experimental evidence against virtualising current probing tech- niques ...... 54 3.3.1 Cost of virtualising Kprobes ...... 54 3.3.2 Cost of virtualised interrupts ...... 57 3.3.3 Other causes of slower performance when virtualised . . . 58 3.4 Kamprobes design ...... 59 3.5 Implementation ...... 60 3.5.1 Kamprobes API ...... 60 3.5.2 Kernel module ...... 61 3.5.3 Changes to the x86-64 instruction stream ...... 61 3.5.3.1 Inserting Kamprobes into an instruction stream . 61 3.5.3.2 Kamprobe wrappers ...... 62 3.6 Evaluation ...... 69 3.6.1 Inserting probes ...... 69 3.6.2 Firing probes ...... 71 3.6.3 Kamprobes executing on bare metal ...... 74 3.7 Evaluation summary ...... 75 3.8 Discussion ...... 76 3.8.1 Backtraces ...... 76 3.8.2 FTrace compatibility ...... 76 3.8.3 Instruction limitations ...... 76 3.8.4 Applicability to other instruction sets and ABIs ...... 76 3.9 Conclusion ...... 77

4 Shadow kernels: A general mechanism for kernel specialisation in exist- ing operating systems 79 4.1 Introduction ...... 80 4.2 Motivation ...... 82 4.2.1 Shadow Kernels for probing ...... 82 4.2.2 Per-process kernel profile-guided optimisation ...... 84 4.2.3 Kernel optimisation and fast-paths ...... 84 4.2.4 Kernel updates ...... 85 4.3 Design and implementation ...... 86 4.3.1 User space API ...... 86 4.3.2 Linux kernel module ...... 87 4.3.2.1 Module insertion ...... 88 4.3.2.2 Initialisation of a shadow kernel ...... 88 4.3.2.3 Adding pages to the shadow kernel ...... 89 4.3.2.4 Switching shadow kernel ...... 89 4.3.2.5 Interaction with other kernel modules ...... 90 4.4 Evaluation ...... 91 4.4.1 Creating a shadow kernel ...... 91 4.4.2 Switching shadow kernel ...... 93 4.4.2.1 Switching time ...... 93 4.4.2.2 Effects on caching ...... 95 4.4.3 Kamprobes and Shadow Kernels ...... 97 4.4.4 Application to web workload ...... 102 4.4.5 Evaluation summary ...... 103 4.5 Alternative approaches ...... 103 4.6 Discussion ...... 105 4.6.1 Modifications required to kernel debuggers ...... 105 4.6.2 Software guard extensions ...... 105 4.7 Conclusion ...... 106

5 Soroban: Attributing latency in virtualised environments 107 5.1 Introduction ...... 108 5.2 Motivation ...... 109 5.2.1 Performance monitoring ...... 110 5.2.2 Virtualisation-aware timeouts ...... 110 5.2.3 Dynamic allocation ...... 111 5.2.4 QoS-based, fine-grained charging ...... 111 5.2.5 Diagnosing performance anomalies ...... 112 5.3 Sources of virtualisation overhead ...... 112 5.4 Effect of virtualisation overhead on end-to-end latency ...... 116 5.5 Attributing latency ...... 118 5.5.1 Justification of Gaussian processes ...... 121 5.5.2 Alternative approaches ...... 122 5.6 Choice of feature vector elements ...... 123 5.7 Implementation ...... 126 5.7.1 Xen modifications ...... 126 5.7.1.1 Exposing scheduler data ...... 126 5.7.1.2 Sharing scheduler data between Xen and its vir- tual machines ...... 127 5.7.2 Linux kernel module ...... 127 5.7.3 Application modifications ...... 128 5.7.3.1 Soroban API ...... 128 5.7.3.2 Using the Soroban API ...... 129 5.7.4 Data processing ...... 129 5.8 Evaluation ...... 130 5.8.1 Validation of model ...... 130 5.8.1.1 Mapping scheduling data to virtualisation over- head ...... 131 5.8.1.2 Negative virtualisation overhead ...... 133 5.8.2 Validating virtualisation overhead ...... 137 5.8.3 Detecting increased-load from the cloud-provider...... 140 5.8.4 Performance overheads of Soroban ...... 141 5.9 Discussion ...... 142 5.9.1 Increased programmer burden of program annotations . . 142 5.9.2 Scope of performance isolation considered by Soroban . . 143 5.9.3 Limitation to uptake ...... 143 5.9.4 Improvements to machine learning ...... 143 5.10 Conclusion ...... 144

6 Conclusion 145 6.1 Kamprobes ...... 146 6.2 Shadow Kernels ...... 146 6.3 Soroban ...... 147 6.4 Future work ...... 148 6.4.1 Kamprobes ...... 148 6.4.2 Shadow Kernels ...... 149 6.4.3 Soroban ...... 149 6.4.4 Other performance measurement techniques that forgo hy- pervisor fidelity ...... 150 6.5 Overview ...... 150

CHAPTER 1

INTRODUCTION

The recent emergence of cloud computing is largely dependent on the populari- sation of high-performance and secure x86-64 virtualisation. By using a hypervi- sor cloud operators are able to multiplex their hardware, with high performance and strong data isolation, between multiple competing users. This multiplexing allows cloud providers to increase machine utilisation and increase service scal- ability. Moreover, the hypervisor eases system management with maintenance features such as snapshotting and live migration. Yet, despite the advantages of virtual machines they remain slower than phys- ical machines and have highly-variable performance [60]. Whilst efforts have improved both the raw performance and performance isolation of virtual ma- chines, the increased indirection and additional complexity in virtualising privi- leged instructions makes it unlikely that we shall achieve parity of performance. Developers therefore need techniques to help them measure how much slower their applications execute in a virtual machine than they would have done on bare metal. Furthermore, they need to be able to diagnose and fix performance issues that occur in virtualised production systems. However, using current techniques it is difficult to measure the performance of software when it executes in virtual machines. Many of the methods used to measure the performance of software when executing on bare metal, such as raw access to performance counters, processor tracing, and visibility of hard- ware performance metrics are not directly accessible [18], expensive [105], or inaccurate [105, 71] when executing in a virtual machine. The combination of less predictable performance and unavailability of performance-debugging tech- niques makes it hard to measure the performance of an application executing in a virtual machine. One technique is to optimise software on bare metal, where access to more hardware features is available, and then to virtualise the software. However,

15 this is a poor approach as virtualisation has different performance impacts on different operations.1 Currently, the main virtualisation techniques used by hypervisors either have guests execute unmodified code, relying on hardware virtualisation extensions to emulate bare-metal hardware from the point-of-view of the guest or exe- cute paravirtualised guests whereby the virtual machines are made aware that they are executing on a hypervisor and issue hypercalls, as opposed to execut- ing privileged instructions. But such paravirtualisation of mainstream operating systems only applies to the low-level hardware interfaces, typically restricted to the architecture-dependent (arch/) code. As such, performance measurement techniques that execute on a virtual machine exhibit hypervisor fidelity: They execute without consideration of the fact that they are executing in a virtual ma- chine. They are therefore unable to access the same set of counters that they can on physical machines and are unable to explain performance issues, such as CPU starvation of the entire operating system, that do not exist on physical machines. Slower and less-predictable performance of software executing in a virtual machine are two of the greatest disadvantages of executing software using a virtual machine, yet current techniques for measuring this performance do not consider the roleˆ of the virtualisation in slow performance. In this dissertation I argue the benefits of forgoing hypervisor fidelity to measure performance. That is, given the importance of measuring the performance of virtual machines we should turn to forgoing fidelity, in the same way as we have previously forgone fidelity to ameliorate previous problems with virtualisation, such as slow perfor- mance and the difficulties in virtualising classical x86. I show that by forgoing hypervisor fidelity it is possible to build performance- analysis techniques that reduce the probe effect of measuring virtual machines and explain performance characteristics of software that one cannot measure without considering the roleˆ of the hypervisor in executing software.

1.1 Defining ‘forgoing hypervisor fidelity’

Hypervisor fidelity is a well-defined concept [115]. However, the concept of for- going hypervisor fidelity is less well defined. In this dissertation I define forgoing

1Indeed, I show in Chapter 4 and Chapter 5 that depending on the operation performed, virtualisation overheads can vary to the extent of changing the shape of a distribution.

16 hypervisor fidelity as a property of software that is designed for execution on a virtual machine and makes use of the properties of the hypervisor.

1.2 Limitations of hypervisor fidelity in performance measurement tools

Hypervisors date back to early work by IBM in the 1960s, where they were initially used to multiplex access to a scarce, expensive mainframe. However, the current trend of using hypervisors to virtualise cloud infrastructure has its roots in the renaissance that followed fast and secure techniques to virtualise the x86-64 instruction set. The re-emergence of paravirtualisation, addition of hard- ware virtualisation extensions, and plentiful memory and CPU capacity servers throughout the 2000s made it possible to execute many virtual machines on a single server to increase utilisation. This, combined with a consumer movement to performing computations and storing data on servers, made virtualisation at- tractive to industry as virtualisation is cheaper and more scalable than executing on dedicated machines.

The rise of cloud computing in recent years has been impressive. Amazon EC2 alone has grown from nine million to twenty eight million public IP ad- dresses in the past two years [143]. This number is clearly an underestimate for the actual use of virtual machines as it doesn’t include other cloud providers, or non-public IP addresses.

However, the performance of virtual machines executing in the cloud is highly- variable [39, 49], with cloud providers now competing on the predictability of their services [9]. Despite this, the tools available to users to measure the per- formance of their virtual machines have not kept up with the growth in cloud computing. Given the difficulty in correctly virtualising all hardware counters and eliminating performance interference, I show how by forgoing hypervisor fidelity we can build tools that aid with measuring the performance of a virtual machine.

17 1.3 The case for forgoing hypervisor fidelity in perfor- mance measurement tools

Forgoing hypervisor fidelity to ameliorate problems in the virtualisation domain has been repeatedly used in the past. I now explore previous times that we have forgone hypervisor fidelity to improve the utility of virtual machines and argue that contemporary problems mean that it is time to forgo hypervisor fidelity of performance measurement techniques. The concept of forgoing hypervisor fidelity is almost as old as virtualisation itself. The early literature relating to OS/360 and OS/370 considers the roleˆ of pure against impure virtual machines, whereby an impure virtual machines executes differently as it has been virtualised. The advantage of impure virtual machines was that they could execute faster than pure virtual machines. In the end, pure virtual machines became the dominant virtual machine type, although techniques such as paravirtualisation borrow from the ideas of impure virtual machines. More recently, forgoing hypervisor fidelity has been used to overcome classi- cal limitations of the x86 instruction set that meant it was not virtualisable in a way that provided both security and performance. By adopting paravirtualisa- tion to overcome the limitations of classical x86, Xen forgoes hypervisor fidelity since virtual machines execute with knowledge of the hypervisor and issue hy- percalls rather than executing non-virtualisable instructions. Even today, we forgo hypervisor fidelity to overcome performance problems with virtualisation. One problem that virtual machines face is the possibility of not being scheduled when they need to execute, for instance after packets have arrived for the virtual machine. In order for the hypervisor to more-favourably schedule the virtual machine when it has work to do, under Xen there are two hypercalls that allow guests to deschedule themselves: yield and block. When a guest is waiting for I/O or the network they can execute the block hypercall, pa- rameterised on the event that they are waiting for. The hypervisor then preempts the guest until the corresponding event is placed on the guest’s event channel, at which point the hypervisor wakes the guest. The advantage in this case of the guest acknowledging the presence of the hypervisor is that by blocking when it cannot make progress the scheduling algorithm stops consuming credit from the

18 domain. Therefore, when the guest is able to execute the scheduling algorithm will be more-favourable to the domain. Similarly, the yield hypercall allows guests to relinquish their slot on the CPU, without parameterisation, such that they will later be scheduled more favourably. Both the block and yield hypercalls improve the performance of the guest, through forgoing hypervisor fidelity.

Even with the advent of hardware virtualisation that allows unmodified vir- tual machines to execute, we still forgo hypervisor fidelity in the drivers on vir- tual machines to improve performance. On hardware virtual machines (HVM) the emulation of connected devices (which a tool such as QEMU can provide) is slow, therefore HVM guests that need more performance are often converted to ‘PV on HVM guests’, using virtualisation drivers that replace the emulated devices with a driver that directly issues hypercalls. This allows guests to use the hardware-assisted virtualisation interface when this is fastest, such as executing a system call since the lack of rings one and two on x86-64 require all pure- paravirtualised system calls to perform a context switch through the hypervisor, and use the paravirtualised interface when this is faster, such as avoiding hard- ware emulation. This is an example of the virtual machine forgoing hypervisor fidelity to improve the performance of a virtual machine.

As we have seen, forgoing hypervisor fidelity is an oft-used technique to solving problems in the virtualisation domain, in particular for solving perfor- mance issues. A significant issue facing virtualisation today is that performance is variable and yet techniques for measuring the performance of virtual machines have lower utility than techniques for measuring the performance of physical ma- chines. I propose rethinking where we forgo hypervisor fidelity in a mainstream operating system, designed to execute in a contemporary cloud environment.

In this dissertation I show that by building performance measurement tools that don’t have strict hypervisor fidelity it is possible to mitigate many of the issues of measuring the performance of a virtual machine. Forgoing hypervisor fidelity should not be controversial given the trend of forgoing hypervisor fidelity to solve performance-related issues.

In the remainder of this chapter I introduce three key methods by which forgoing hypervisor fidelity allows software to report better performance mea- surements when virtualised. Later, I present each contribution in detail.

19 1.4 Kamprobes

Current kernel probing mechanisms are built without forgoing hypervisor fi- delity. That is, developers execute the same types of probes on virtual machines as they do on physical machines. However, these methods usually rely on set- ting software interrupts in an instruction stream. Whilst these generally execute well on physical hardware, I show in Chapter 3 that interrupts on a virtual ma- chine are 1.81 times more expensive than interrupts on hardware ( 3.3.2), as the § hypervisor has to execute. Probes are a common technique for measuring the performance of computer software. By allowing developers to add additional code at a program’s runtime, probes allow developers to execute code that measures wall-clock time, cycles, or other resources used by a piece of code without the burden of modifying the software’s source code, recompiling and re-executing the software. However, a problem with probes is that when they fire they consume resources, thereby affecting the performance of the application that they try to measure. Whilst this probe effect impacts both physical machines and virtual machines, the overheads are 2.28 times higher on virtual machines than physical machines ( 3.3). § Moreover, virtualisation increases the standard deviation of the number of cycles requires to fire a probe from 8 cycles to 869 cycles ( 3.6.2). § By having higher overheads, probing mechanisms on virtual machines ex- acerbate the probe effect. This makes it harder to identify the cause of poor performance of applications on virtual machines. Kamprobes is a technique for probing virtual machines that only uses unpriv- ileged instructions, such that the hypervisor is not involved in a probe firing and avoids other operations that are expensive in a virtual machine such as hold- ing locks. Kamprobes forgoes hypervisor fidelity by being designed to execute with maximum performance on a virtual machine. For instance, by only using non-privileged instructions, the design of Kamprobes forgoes hypervisor fidelity. There is only a modest difference between executing in a virtual machine and on a physical machine on the number of cycles (twelve cycles) and the variability (two cycles of standard deviation). Moreover, Kamprobes execute much faster than Kprobes (the current state-of-the-art in Linux kernel probing), with a Kam- probe taking 69 16 cycles to execute, whereas a Kprobe takes 6980 869 cycles ± ±

20 to execute ( 3.6.2). Furthermore—whilst not an issue of virtualisation—when § Kprobes determines which handler to execute it performs a lookup that scales with O(n), with the number of probes inserted. The technique that Kamprobes uses does not need to perform a lookup, and so scales linearly (O(1)). Kam- probes can therefore be used in circumstances that require many probes—such as for a function boundary tracer—for which Kprobes is too slow.

1.5 Shadow Kernels

Whilst Kamprobes are a low-overhead technique for probing virtual machines, if they are used—even with empty probe handlers—on hot codepaths the overhead of them repeatedly firing can significantly reduce performance. In principle this shouldn’t be an issue because much of the time developers want to measure the performance of one particular process’s interactions with the kernel in isolation. But there is no current way of setting kernel probes that only fire when one particular process executes. Shadow Kernels is a technique I developed by which specialisation, such as setting probes, can be applied to a kernel instruction stream on a fine-grained basis such that the specialisation applies to a subset of the processes or system calls executing on the system. Currently, specialising the operating system kernel makes changes to the kernel instruction stream that affect all processes executing on the system. This is because whenever the kernel instruction stream is modified the address space of every process is modified as each processes maps the shared kernel into its own address space. The underlying issue is that modifications to the instruction stream of the kernel are a global operation, in that the shared instruction stream is executed by all processes. I therefore show that the effect of this is to reduce the performance of all processes executing on the system, regardless of if their interaction of the kernel were the target of specialisation. Shadow Kernels requires co-operation of virtual machines with the hypervi- sor since the virtual machines execute hypercalls that cause the hypervisor to modify the physical-to-machine memory mappings such that the virtual mem- ory containing the kernel instruction stream maps to different machine-physical memory depending on the calling context. Shadow Kernels is a technique that utilises the indirection of virtualised page

21 tables such that multiple copies of the kernel instruction stream co-exist within a single domain. This allows processes that are not the target of instrumentation to execute their original kernel instruction stream, whilst applications whose interaction with the kernel is the target of specialisation execute a specialised instruction stream. Building Shadow Kernels without a hypervisor would be challenging: Oper- ating systems are designed with a memory layout such that the kernel resides at a fixed offset in physical memory. However, with Shadow Kernels there are mul- tiple copies of pages that include the kernel instruction stream, with the memory management unit changing which page virtual addresses resolve to. Therefore, there is no-longer a fixed mapping between physical and virtual pages in the ker- nel instruction stream. Furthermore, the hypervisor-based approach is easy to port Shadow Kernels to other operating systems.

1.6 Soroban

A key issue with executing software in the cloud is that applications often exe- cute more slowly and sometimes with performance interference from other vir- tual machines [60]. For latency-sensitive applications, in particular, this virtuali- sation overhead prevents users from switching to virtual machines [142]. How- ever, current application monitoring systems are built with hypervisor fidelity, in that they report the same metrics if they execute on a physical machine or a vir- tual machine. As the performance of an application is affected by the hypervisor in a way that is hard to predict, it is currently difficult to measure how much of the latency of a program executing in the cloud is caused by the overheads of virtualisation and how much is due to other causes, such as a high load on the vir- tual machine. Soroban is a technique that forgoes hypervisor fidelity to measure how much of the latency of a request is due to the overheads of virtualisation. By forgoing hypervisor fidelity throughout the software stack, up to the appli- cation, Soroban reports the additional latency imposed on servicing individual requests in a request-response system. This allows developers to measure the additional overheads that their application experiences due to executing in a virtual machine, as opposed to executing on bare metal. By reporting the virtu- alisation overhead, developers can decide whether the additional overheads are

22 worthwhile. Soroban uses a modified version of Xen that shares with each domain the activity performed it by the scheduler, such as timestamps of when the virtual machine is scheduled in and out. Soroban then trains a Gaussian process on the relationship between these variables and the response time of a request-response system. The result of the learning phase is a model that when given a feature vector of scheduling activity on a domain it reports the impact that these events have on the event response time. I evaluate Soroban, showing that the technique can be applied to a web server and measure the increase in latency due to virtualisation in servicing requests. I demonstrate that as more virtual machines execute concurrently, Soroban in- creases the latency increased with virtualisation, but when the web server exe- cutes requests slowly due to high load, Soroban does not increase the measure of virtualisation overhead.

1.7 Scope of thesis

In this dissertation, I primarily focus on Xen, executing paravirtualised GNU/Linux on x86-64 hardware. I now justify this choice.

1.7.1 Xen hypervisor

Xen is the hypervisor used by Amazon EC2 [139], which as of May 2015 is ten times larger than the combined size of all its competitors [85]. Given the clear dominance of Xen in the cloud, solutions to problems of measuring performance when virtualised with Xen have a high impact. However, the key contributions of my thesis can be ported to other hypervisors.

1.7.2 GNU/Linux operating system

As of December 2014, 75% of enterprises report using Linux as their primary cloud platform [52], with the market share of Linux virtual machines increasing. Most of the remainder is Windows virtual machines, however the number of these is falling.

23 1.7.3 Paravirtualised guests

Presently there are two main techniques to virtualising an operating system in the cloud. (i) Hardware extensions allow an unmodified operating system to ex- ecute on a hypervisor (HVM). This is a common way of virtualising proprietary operating systems, such as Windows. (ii) Modifying the guest oper- ating system such that it is aware that is executing in a virtualised environment and directly issues hypercalls rather than performing privileged instructions. The performance of paravirtualised guests is comparable with the perfor- mance of hardware virtual machines, with regular changes as to which one is the faster form of virtualisation. In this dissertation I use paravirtualised virtual machines as they have an ex- isting interface with the hypervisor, through which virtual machines can issue hypercalls. As this dissertation proposes forgoing hypervisor fidelity—as such creating ‘paravirtualised’ performance measurement techniques—it is more nat- ural to build these on paravirtual virtual machines. However, hardware virtual machines often have a paravirtualised interface through which drivers can oper- ate, so many of the ideas could be ported to hardware virtual machines.

1.7.4 x86-64

Whilst instruction sets other than x86-64 are virtualisable, Intel currently has a 98.5% market share in server processors (as measured by number of proces- sors) [77], with much of the remainder being taken by AMD x86-64 processors. As such, I do not consider other instruction sets. The contributions of Kamprobes in Chapter 4 are particularly tightly-coupled with the x86-64 instruction set. However, the fundamental idea of using unpriv- ileged instructions to build a probing system hold true across other instruction sets. Indeed, on a fixed-width instruction set, such as ARM, this technique is both easier to implement and can be used on more opcodes than on x86-64. Both Shadow Kernels and Soroban are less reliant on any particular instruc- tion set.

24 1.8 Overview

In summary, the key contributions of this dissertation are:

Kamprobes. Current probing techniques are built to execute on a physical ma- chine and as such rely on interrupts to obtain an execution context. How- ever, on a virtual machine interrupts are a privileged instruction, so are expensive. Kamprobes is a low-overhead probing technique for x86-64 virtual machines that execute with near-native performance in a virtual machine.

Shadow Kernels. By forgoing hypervisor fidelity, virtual machines can remap their text section to allow virtual machines to specialise shared text re- gions, in particular the kernel. Whilst I focus on the use case of scoping kernel probes, the technique can be applied to other types of kernel text specialisation, such as profile-guided optimisation.

Soroban. A key concern that prevents the uptake of virtualisation is the impact of the virtualisation overhead. I show that by building software that ac- knowledges the presence of the hypervisor in its own monitoring, it is pos- sible to measure the virtualisation overhead of fine-grained activities, such as serving an HTTP request.

The remainder of this dissertation is structured as follows. I explore the back- ground for my thesis in Chapter 2, arguing that the requirement of hypervisor fidelity for performance measurement techniques is a relic of classical hypervisor use cases and can be forgone for contemporary operating systems. In Chapter 3 I introduce Kamprobes, a probing technique for virtualised x86-64 operating systems. In Chapter 4 I propose Shadow Kernels as a solution for specialisation, such as scoping the firing of probes. In Chapter 5 I present Soroban, a technique for using machine learning to report, for each request-response, the additional latency added by executing on the hypervisor.

25 26 CHAPTER 2

BACKGROUND

In their 1974 paper Popek and Goldberg state the classical definition of a hyper- visor as having three properties: Fidelity, performance and safety [115].

Fidelity. Fidelity represents the concept that a hypervisor should portray an accu- rate representation of the underlying hardware, such that software can exe- cute on the hypervisor without requiring modification, or being aware that it executes in a virtualised environment. As such, the results of software executing in a virtualised environment must be identical to those obtained when executing on physical hardware, barring any effects of different tim- ing whilst executing on virtualised hardware.

Performance. The performance of a virtual machine must not be substantially slower than when executing on physical hardware. In particular, most in- structions that execute must run unmodified, without trap-and-emulation techniques (trap-and-emulation is the only virtualisation technique that Popek and Goldberg consider).

Safety. Virtual machines must act independently, without the ability to interfere with other domains executing on the system. Particularly, virtual machines should not have direct access to shared hardware, with which they can modify the state of another virtual machine in a way that would not be expected of that machine executing on physical hardware.

In this dissertation I propose performance-analysis techniques that are de- signed to complement virtualisation, by either using code that virtualises well or by using techniques that interact with the hypervisor. As such, this work breaks the traditional definition of a hypervisor in that it no longer offers fidelity. In this chapter, I consider related work to argue that the difficulty of measuring the per- formance of virtual machines is exacerbated by the requirement of fidelity and

27 that this should be relaxed given the changing uses of hypervisors. Throughout the rest of this dissertation I use this argument to justify techniques that require performance-analysis techniques that are tightly-coupled with the hypervisor.

2.1 Historical justification for hypervisor fidelity

In this section I consider the historical justification for hypervisors, especially for hypervisor fidelity. I later show that the use cases of hypervisors has changed and as such we should reconsider the hypervisor’s original design principles. The concept of hypervisor fidelity, whilst formalised in 1974 [115], dates back to the start of research into virtual machines by IBM. IBM built early hy- pervisors that allowed multiple users to concurrently execute on a rare and ex- pensive mainframe with the illusion of being the only user of the machine. That is, each user had the illusion of being the sole user of the machine’s hardware, with their operating system being the only one executing. The key issues that early hypervisors attempt to fix are that OS/360 uses the—now-common [75]— architecture of a machine executing a single kernel that is shared with every process executing on the system: (i) Different users are unable to execute dif- ferent operating system versions. Due to the lack of availability of mainframes, users were unable to obtain another machine to execute their own operating system version. (ii) Users cannot develop new operating system features in isola- tion from other users. For instance, if a developer were to extend the operating system, but their code contains a bug, with OS/360 it is not possible to prevent this from affecting concurrent users. As traditional abstractions are lower-level than contemporary abstractions, it was commonplace for developers to regularly need to modify or extend their operating system. CP-40 is considered to be the first hypervisor, being released in 1967 and able to concurrently execute fourteen virtual machines. As the complexity of hard- ware increased through the 1970s the use of hypervisors became more practical and feature in the development of OS/360 and OS/370 [62, 127]. Behind all IBM work is the control program (CP), which allows concurrent execution of operating systems, each of which has the illusion of executing on physical hard- ware [56]. The original versions of CP allow an unmodified operating system to execute in a virtualised environment in which CP configures the hardware such

28 that whenever a virtual machines executes a privileged instruction the hardware induces a trap, which CP catches, decodes and emulates in a safe way. There were other early hypervisors, such as the FIGARO system, which was part of the Cambridge Multiple-Access System that have similar design goals [147]. As such, these early hypervisors do provide fidelity, in that the software that executes on them has the same side effects—ignoring timing effects—on both physical and virtual hardware.

2.2 Contemporary uses for virtualisation

Having shown the historical justification for hypervisor fidelity, I now argue that the use case for virtualisation is now different to in the 1960s and 1970s. As such, it is time to reassess the requirement of virtual machine fidelity, in particular to aid in helping developers measure the performance of their virtual machines. Rather than building performance tools that explain a subset of what can be viewed on a physical machine, due to limited access to performance counters, we should forgo hypervisor fidelity by building performance analysis techniques that are designed to execute on a virtual machine. Compared with when hypervisors were pioneered, hardware is now cheaper and more readily-available, as such the original requirements for virtualisation no longer hold: (i) In contemporary computing users have access to many ma- chines, as such they are usually able to execute an operating system of choice on a different computer. (ii) The influx of additional hardware also means that development of operating system features can be performed on dedicated devel- opment hardware. Indeed, executing production services on the same hardware that is used for operating system development, even when a hypervisor is used, would be unconventional in the current era. In comparison to when virtualisa- tion was pioneered, it is standard practice to have fleets of physical machines just testing changes to operating system source code. Moreover, higher-level abstrac- tions reduce the requirement of most development work to involve modifying the operating system. In the last ten years virtualisation has underpinned the move to cloud com- puting, which in turn has revolutionised computing [6]. A lower bounds indica- tor of the growth of cloud computing is that Amazon AWS alone has increased

29 from nine million to twenty eight million EC2 public IP addresses in the past two years [143]. The key benefit of the hypervisor in these cloud computing environ- ments is allowing operators to provide virtual machines to their customers, so that multiple customers can share the same physical server without interference. In particular, hypervisors give a number of advantages to cloud providers:

Higher machine utilisation. By co-hosting virtual machines on a physical server the utilisation of the physical server increases when compared with execut- ing each service on a dedicated physical machine. Whilst higher utilisation was a key factor in the early work on hypervisors, this was because the mainframes that they executed on were scarce and highly-contested. How- ever for cloud providers, servers are readily-available, but higher utilisa- tion decreases power consumption, cooling, maintenance and real-estate expenditure. In order to increase utilisation, hypervisors now offer fea- tures such as memory overcommitting through ballooning [144] and pre- allocation [94]. Although such higher utilisation has remained a benefit of using a hypervisor, the reasons for desiring higher utilisation have changed, as such the roleˆ of the hypervisor has changed. The downside to higher utilisation is that it risks starving virtual machines of resources, thereby reducing their performance. Operating system starvation is not a problem that exists when executing on bare metal therefore tools that do not forgo hypervisor fidelity cannot report this effect.

Creating virtual machines is fast and cheap. Users can spawn a new, booted vir- tual machine in less than one second [84]. This is not possible without a hypervisor, since fast boot up is achieved by forking an already-booted virtual machine, such that the two have the same state. With physical machines, the closest alternatives are techniques such as PXE that aid in reducing the time between connecting a server and it being fully-booted. However, for most use cases the main time cost in running a new physical server is actually in finding server hosting and obtaining a physical server. With hypervisors, there is no need for most users to purchase physical host- ing and servers, as they can simply pay for a virtual machine from their cloud provider. Moreover, the economics of cloud computing often make it cheaper to execute in the cloud than building a data centre [136]. This clearly differs from the original use case of a hypervisor in which being

30 able to rapidly spawn a new machine was not a desired feature.

Scalability to near-infinite computing resource on demand. Usage patterns of - connected applications are highly-variable [119]. In order to respond to spikes in demand they need to be elastic, in that they need to execute using more machines during spikes to maintain a quality of service. Hypervisors allow scalability of virtual machines to up to 3 000 virtual machines in a 32 host pool [61]. In cloud computing environments, where hypervisor pools are less common, the bottleneck on the number of virtual machines that can execute is bounded by economic factors. As virtual machines are fast to spawn, users can build more scalable software that responds to changes in demand by creating more virtual machines. Such requirements were never present in the early forms of virtualisation, as they operated before the creation of the Internet, so contemporary issues such as the ‘slashdot effect’ and ‘viral trends’ did not exist. Furthermore, the original workloads that execute on a hypervisor were non-interactive batch jobs, therefore they had different performance requirements to contemporary clouds, where request-latency is a key metric.

Live migration of virtual machines. Modern hypervisors can transparently mi- grate virtual machines between physical hosts [124] without downtime [34] and similarly migrate and load-balance [59] storage between repositories without downtime [94]. This allows system administrators to perform maintenance on physical machines without disrupting a service executing on the virtual machines, since they can first migrate the instance onto an- other host. As organisations rarely had more than one mainframe when hypervisors were initially designed, this was not a use case of the pioneer- ing work. The downside of live migration is that if the virtual machine is migrated onto a highly-loaded or less powerful host then it may exe- cute more slowly However, this decrease in performance is from the cloud provider, so is hard to detect with existing techniques.

High isolation compared with other virtualisation techniques. Kernel security vul- nerabilities only affect the domain in which the vulnerability is used. Be- tween 2011–2013 there where 147 such exploits for Linux [2]. Compared with other virtualisation techniques that share the same kernel, hypervi-

31 sor exploits are more rare, with Xen having had just one privilege escala- tion vulnerability from paravirtualised guests [140]. Since the invention of hypervisors this requirement has increased: Attack vectors are now more readily exploited and there are more commercial requirements for isolation of services.

Backup and restore. There are advantages to providing backup and restore from outside of a domain [152], since it is fast [37] and does not require operat- ing system co-operation for access locked files and cannot be disabled by malicious software. Backup and restore was not a concern for hypervisor design in the 1960s.

Accountability. Accountable virtual machines allow users to audit the software executing on remote hosts by having the software execute on top of a hypervisor that performs tamper-evident logging [64]. Using virtualisation for accountability is a new use-case for hypervisors that they were not originally designed for.

Emulating legacy software. Windows 7 and later versions contain a hypervisor to execute Windows XP. When the Windows instance is a virtual machine the emulator then executes using nested virtualisation [66]. Whilst nested virtual machines were considered in early work [147], this was mainly a point of academic enlightenment.

Emulating advances in time. As hypervisors emulate wall-clock time to their guests, they can be used to discover how software will behave at a future point in time [35] or when executing under future, faster hardware [109]. Emulat- ing changes in time was not an original design goal of hypervisors.

I have described a number of ways in which hypervisors are used as part of mainstream cloud-computing environments. In particular, I have shown how the use cases for the hypervisor in 2015 differ from those in the 1960s and 1970s when the classical definition of the hypervisor was developed. Due to this change in use case, it is reasonable to argue that strict adherence to an out- dated definition of the hypervisor should be challenged. One of the recurring themes is the change related to moving from serving a batch-processing work- load, to a request-response system in which users need high scalability, and low

32 latency in their serving of requests. Concurrently, virtual machines now execute in a less predictable environment, with untrusted parties, malicious actors and automated scheduling all acting in ways that affect the performance of virtual machines, in a way that early virtual machines did not experience. As such, the importance of measuring performance has increased, such that fidelity now has lower utility than measuring the performance of virtual machines.

2.3 Virtualisation performance problems

Despite its popularity, a particular problem with virtualisation is that the per- formance of virtual machines is slower and more variable than the performance of physical machines yet it is difficult to measure the performance of a virtual machine. As well as contention for shared resources [117] there are other sources of slow performance, which I now explore.

2.3.1 Privileged instructions

Under virtualisation certain instructions become more expensive, such as vmexit, which increases by a factor of between five and twenty-five under virtualisa- tion [122]. Also, as AMD64 only has two rings, paravirtualised guests have a user space and kernel space that both execute in ring one and the hypervisor has to mediate every system call. This makes system calls more expensive in virtual machines than on physical machines, although by how much varies depending on hardware [31].

2.3.2 I/O

I/O on virtual machines involves a longer data path than on physical machines since the hypervisor has to map blocks from the virtual disks exposed to its guests to physical blocks on storage, that is often remote. I/O operations are a regular source of slow performance [57, 26, 100] which are around 20% slower, depending on configuration. Furthermore, the hypervisor’s batching of I/O re- quests can lead to extreme arrival patterns [22].

33 2.3.3 Networking

Networking in virtual machines can be unpredictable [98]: When executing on a CPU-contended host compared with a CPU-uncontended host, throughput can decrease by up to 87% and round trip time can increase from 10 ms to 67 ms [129]. On Xen, two causes of this are the back end of the split-driver being starved of CPU resource as the driver domain is not scheduled or the front end of the split-driver being starved as the scheduler in the virtual machine does not schedule the driver during its scheduling quanta. The effect of poor networking performance is that there are significant reduc- tions in quality of service as observed by end-users in throughput and delay [26].

2.3.4 Increased contention

When executing as a virtual machine there is higher contention, caused by two sources: Other virtual machines being scheduled and the hypervisor/domain zero executing. The hypervisor increases contention when executing as a virtual ma- chine due to switches to the hypervisor, through executing a vm-exit instruction that needs to save the state of the virtual machine and restore the state of the next domain [3]. Other virtual machines also cause performance interference, especially for micro virtual machines, which execute on physical hosts with low priority to use the spare CPU cycles left by other virtual machines. Such mi- cro virtual machines are serviced poorly and to get maximum performance— for the instance type—virtual machines need to inject delays to be scheduled favourably [146].

2.3.5 Locking

Locking has long been known to be problematic on virtual machines. When operating systems are designed programmers often protect data structures with mutexes, and assume that they hold the mutex for a short period of time as holding a mutex on a shared data structure for a long time is expensive [114]. However, when executing in a virtual machine there is the possibility of a vCPU being preempted whilst it holds a mutex, preventing other threads from making progress [40]. Another problem is lock scalability, which unless modified to perform better under a hypervisor, scales poorly with the number of vCPUs [76].

34 2.3.6 Unpredictable timing

When executing inside a virtual machine, time becomes unpredictable as virtu- alised time sources are unreliable and behave poorly under live migration [19]. Also, operations that one expects to have a constant time can take an unpre- dictable amount of time. For instance, techniques such as kernel same page merging can help reduce the memory overhead of executing in virtual machines by sharing identical pages between virtual machines [101]. However, when a vir- tual machine modifies a shared page the hypervisor traps and creates a copy of the page specifically for that virtual machine to modify. This makes page access times unpredictable from within the virtual machine [135].

2.3.7 Summary

Despite many advances, virtual machines remain slower and less predictable than physical machines. As it is unlikely that these issues will be completely removed, it is important that users of virtual machines are able to measure the performance of their virtual machine.

2.4 The changing state of hypervisor fidelity

Given the performance overhead of executing in a virtualised environment and the difficulty in measuring this performance in a virtual machine, I propose that virtual machines should forgo hypervisor fidelity for performance measurement techniques. Rather than treating the hypervisor as a physical machine for every- thing except the lowest layers of the kernel, performance measurement tools should be designed to execute well in a virtual environment and should co- operate with the hypervisor to maximise visibility of performance. Whilst this does involve changing the accepted use of the interface between virtual machines and hypervisors, I now show that changes to this interface have previously been used to ameliorate performance problems in the virtualisation domain.

2.4.1 Historical changes to hypervisor fidelity

Even in the earliest work on hypervisors, there was acceptance that pure-virtualisation may not be practical. A concern with the early versions of CP is that it performed

35 slowly, which is largely attributable to using trap-and-emulate to prevent virtual machines from executing privileged instructions and causing them to execute an emulated version. To address this, the evolution into OS/370 introduces the idea of a hypercall [150], in which the virtualised operating system sets up some state to communicate with CP and then uses the DIAGNOSE instruction to transition context into CP [36]. By introducing the concept of a hypercall, IBM acknowl- edge that building operating systems that are entirely-strict to the definition of fidelity is not necessary. Rather, in cases where full emulation of physical hard- ware has a high cost, it is better to forgo fidelity by making the virtual machine aware that it is executing on a hypervisor and execute a hypercall rather than perform the expensive operation. I argue that we have the same issue today, whereby current techniques for measuring the performance of a virtual machine execute the same code on vir- tual machines as they do on physical machines. Therefore, performance measure- ment tools have lower utility on virtual machines than physical machines as they use code that virtualises poorly and cannot report the cost incurred due to virtual- isation. As such, we should reconsider whether applying the technique employed by IBM in 1973 to solve the problems of the day—namely poor performance— can solve the contemporary issue of it being difficult to measure the performance of virtual machines. In particular, we should consider using ‘paravirtualised per- formance measurement techniques’. The invention of the hypercall created a debate that continues throughout the 1970s [56] regarding pure vs impure virtual machines, in which a pure vir- tual machine is a guest that runs unmodified code, whereas an impure virtual machine runs modified code. In particular, there is consideration of position of the hypervisor interface, since the hypervisor can either simulate high-level ac- tions, such as reading a line, or can simulate the individual instructions involved in performing the high level action [16].

2.4.2 Recent changes to hypervisor fidelity

With the popularisation of (early versions of) x86, virtualisation became harder as the instruction set does not provide trap-and-emulate ability for privileged in- structions such as SIDT, SGDT and SSL [121]. Therefore, to virtualise traditional x86, one has to use binary translation, the process by which the instruction

36 stream is scanned and privileged instructions are rewritten with function calls to emulating functions. Performing full binary translation is a slow process [78], so early x86-64 hypervisors were either slow or unsecure [121]. Those that are slow fail the hypervisor definition as they do not provide the performance property. Furthermore, as Popek and Goldberg’s hypervisor definition is tightly-coupled with trap-and-emulate techniques in its formalisation of fidelity, such that vir- tual machines cannot execute a modified instruction stream, binary rewriting is not considered classically virtualisable [1]. To resolve the issues of virtualisation on traditional x86, Baraham et al. built Xen, a hypervisor that uses paravirtualisation to emulate x86 with performance, strong-isolation and unreduced functionality [12]. In using paravirtualisation, Xen requires that operating systems be modified to issue hypercalls rather than to execute with true-fidelity when issuing privileged instructions. One contribution of Xen was to paravirtualise the memory management unit, in that guests’ page tables are mapped read-only and the guest has to issue a hypercall to update them. This design allows virtual machines to directly map virtual addresses to the addresses of the memory on the physical server (machine physical frames), rather than have shadow page tables that give the illusion of executing in an independent address space. In overcoming the shortcomings of x86, by forgoing hypervisor fidelity, Xen is much like my proposal of forgoing hypervisor fidelity to overcome the shortcomings of performance measurement of virtual machines. More recent advances in the x86-64 instruction undeniably restore a degree of fidelity to the hypervisor by allowing unmodified virtual machines to execute in a hardware virtual machine (HVM) container [141]. HVM containers extend the architecture of x86-64 so as to provide a privileged mode [123] in which the hypervisor executes and to which calls to privileged instructions cause the processor to enter a protected mode (sometimes considered ‘negative’ rings), in which the hypervisor executes. Whilst this increase in fidelity does create some advantages, for instance operating systems can migrate between executing as a physical and virtual instance [83], I nevertheless still argue that this increase in fidelity only came when hardware had advanced sufficiently (for instance with In- tel VT-X) such that fast and secure x86 virtualisation was no longer problematic. Should future hardware allow virtual machines to measure their performance to same degree as physical machines, then restoring fidelity to measuring the perfor- mance of virtual machines may be reasonable. There is already limited evidence

37 of hardware advances increasing the ability of a virtual machine to measure its performance [104].

2.4.3 Current state of hypervisor fidelity

Despite the increase in hardware virtualisation, I still argue that it is common- place for the software stacks that execute on the hypervisor to not exhibit strict fidelity. This is principally due to the process of re-hosting an application on in- frastructure as a service, during which developers are encouraged to make use of properties of the cloud, such as the scalability of virtual machines [103]. How- ever, within virtual machines there are differences to the software stack when compared with physical machines. As such, forgoing hypervisor fidelity in per- formance measurement techniques is not a radical move.

2.4.3.1 Installing guest additions

All high-performance hypervisors that use hardware virtualisation techniques still provide extensions to improve the performance of their guests: XenServer Guest Tools, VirtualBox Guest Additions, and VMware tools are some exam- ples. These typically provide drivers that allow the guest operating system to communicate directly with the hypervisor so that full emulation of devices is not required. However, installing such extensions reduces the fidelity of the virtual machine, since by using different drivers, the virtual machine executes differently on physical and virtual hardware.

2.4.3.2 Moving services into dedicated domains

There is a growing trend to use virtual machine introspection to provide ser- vices that would traditionally have been provided by processes or operating sys- tem [24]. For example, Bitdefender performs malware detection from outside a separate, privileged domain, which prevents malware from attacking the mal- ware detection program, as it is commonplace for viruses to attack antivirus mechanisms [88]. Furthermore, most commercial hypervisors now support vir- tual machine snapshotting, a feature typically performed by the filesystem. There are also proposals to move monitoring into a separate domain [82]. Given the trend of separating services out such that they execute outside of the original do-

38 main, I argue that hardware virtualisation does not achieve full fidelity, since if those operating systems were to execute on physical hardware, they would need reconfiguring such that they execute processes to perform all of these features.

2.4.3.3 Lack of transparency of HVM containers

Even when executing inside a hardware virtual machine container, which is sup- posed to provide fidelity, the interface with the hypervisor still differs from that provided by exclusive use of hardware. One demonstration of this difference in interface is malware that detects the presence of a hypervisor through irregular- ities in the availability of resources, such as CPU cycles, caches, and the TLB, and refuses to execute a payload [149]. Furthermore, the timing properties of a virtual machine differ from physical machines, both due to virtualisation over- head and changes to the time required to access hardware that is emulated by the hypervisor, and hidden page faults caused by access to hypervisor-protected pages, different timing between virtualised instructions (such as cpuid) and non- virtualised instructions (NOP) [54]. Given that the interface to the hypervisor is leaky, I argue that we should acknowledge this difference throughout the soft- ware stack, rather than maintaining fidelity.

2.4.3.4 Hypervisor/operating system semantic gap

The performance of a virtual machine can be improved if the hypervisor is better able to predict the virtual machine’s actions. There are two main techniques of improving the prediction rates: Monitoring the virtual machine, with knowledge of its data structures so-as to be able to improve decisions and policies, which can increase the cache hit ratio of a virtual machine by up to 28% [73] or moving functionality into the hypervisor from the guest [87]. The latter reduces fidelity and the former requires co-operation, therefore we observe deviation from the standard definition of a hypervisor.

2.4.4 Summary

I have now shown that since the advent of the hypervisor, forgoing hypervisor fidelity has been a common solution to solving problems in the realms of virtu- alisation. Even today, with hardware virtual machines, virtual machines do not

39 strictly provide fidelity. My demonstration that forgoing hypervisor fidelity has successfully been used to solve past problems with virtualisation confirms my thesis that the use of the interface should change so as to improve the utility of performance measurement tools.

2.5 Rethinking operating system design for hypervi- sors

There is considerable research literature that reconsiders the roleˆ of the operating system, when executing in the cloud, from the ground-up, which often forgoes fidelity to increase utility. Library operating systems, such as OSv recognise that in a typical cloud soft- ware stack there is a hypervisor, operating system and a language runtime [80]. Each of these performs abstraction and protection, at the cost of an increased footprint and performance overhead, such as a 22% impact on the throughput of lighttpd. Library operating systems replace everything that executes ‘above’ the hypervisor with a single binary, so that the hypervisor performs abstraction and protection [80]. Similarly, Mirage is designed to execute on a hypervisor only, making use of the small hypervisor interface [91], thereby improving on Linux in terms of boot time, I/O throughput and memory footprint. SR-IOV increases fidelity by letting operating systems directly interact with the network interface card, with the hardware ensuring isolation [41]. However, SR-IOV can be used in unconventional ways: Dune is a hypervisor-like project that uses hardware virtualisation features to allow usespace direct access to safe hardware features, such as ring protection, page tables and the TLB [14]. Belay et al. achieve this by using hardware extensions built for virtualisation but have their lowest layer of software still expose an abstraction of a process, rather than hardware. Furthermore, Arrakis [113] and IX [15] use SR-IOV to separate the control and data plane so-as to increase networking throughput of commodity hardware. The work that I present in this dissertation focuses on applying performance measurement techniques to mainstream operating systems in the cloud. As re- search operating systems are not yet mainstream I do not explicitly show the benefits that they would receive. However, the key techniques in all three of

40 my contributions could be applied to such operating systems, without causing diverting behaviour between virtual and physical machines.

2.6 Virtual machine performance measurement

Having argued that the requirement for hypervisors to exhibit fidelity is overly- restrictive and that forgoing hypervisor fidelity has been previously used to solve problems in the virtualisation domain, I now explore work related to virtual machine performance.

2.6.1 Kernel probing

Probing has a rich history that goes back to the dawn of computers. The first use of probing is believed to have been to used by Maurice Wilkes to insert sub- routines into code executing on the EDSAC. These sub-routines would print ‘dis- tinctive’ symbols at intervals throughout a program so that the operator could determine an error [55]. Later computers, starting with the UNIVAC M-460 included programs such as DEBUG that let operators specify addresses to insert additional code that could be used for debugging [47]. Contemporary operating systems have a probing system to allow users to de- bug their software and measure its performance. Linux uses Kprobes [107] and uses Detours [69]. NetBSD [106], FreeBSD [96], and OS X all use DTrace, which embodies a probing system in a wider instrumentation system. There has been further work on these systems to optimise them [68] as the benefits of fast probing have long been known [79]. However, with the exception of Windows Detours, these all use interrupt-based probing techniques. Previous work has shown another technique for probing, based on jumps, which are often faster than executing interrupts [137, 138]. Windows Detours was the first of these jump-based probing systems that preserves the semantics of the target function as a callable subroutine [69]. However, whilst there is some benefit from using jump-based techniques on physical machines, I show that their utility when applied to virtual machines is much higher. This is due to interrupt-based techniques virtualising poorly. There has been some consideration of changing the nature of operating sys- tem probing in the virtualised environment by disaggregating probe handlers

41 into a separate domain [118]. However, this hasn’t received popular uptake.

2.6.2 Kernel specialisation

Kernel specalisation is not a new concept: Early work on the synthesis kernel pioneers kernel specialisation by generating efficient kernel code that acts as fast- paths for applications [116]. The advantages of kernel specialisation are well known [23, 17]: Profile-guided optimisation of Linux improves the kernel per- formance by up to 10% [151] and exokernels [45] remove kernel abstractions so that applications interact with hardware through fewer layers of indirection, thereby reducing kernel overheads. For instance, Xok is an operating system with an exokernel whereby a specialised web server has over four times the throughput of a non-specialised web server [74]. Indeed, the benefits of spe- cialisation are a key feature of Barrelfish, an opearting system redesign to allow kernel specialisation such that cores run different kernels [125] and Dune for al- lowing applications access to privileged CPU features [14]. Another possible op- erating system redesign to allow kernel specialisation is using microkernels, since only a small set of features are then executed by an operating system mapped into every process, rather user space services can provide competing specialised implementations of features [86]. In Chapter 4 I introduce Shadow Kernels, a technique that allows per-process kernel specialisation by having applications that acknowledge the presence of the hypervisor and execute code that causes the hypervisor to switch the underlying memory of the domain’s kernel. The key benefit of Shadow Kernels is to allow multiple kernel instruction streams to execute on a single machine. There do indeed exist techniques of executing multiple kernels already, however they all differ from Shadow Kernels. Executing processes inside virtual machines allows multiple kernels to execute on a single machine [36]. However, each kernel will still typically support multiple processes executing on it, whereas Shadow Kernels can target individual processes. The technique used in Shadow Kernels of modifying kernel instruction streams is well-established. For instance, KSplice modifies the kernel instruction stream to binary patch security updates into a kernel without rebooting the machine [7], but this is a global change that affects all processes, whereas Shadow Kernels can restrict that patch to an individual process. Furthermore, malware can use

42 memory management tricks to hide itself from detection by unmapping memory containing the rootkit [133]. Shadow Kernels differs in that rather than hid- ing malware it allows multiple kernel instruction streams to coexist. Similarly, Mondrix uses changes to the MMU to provide isolation between Linux kernel modules [148], albeit with a performance overhead of up to 15%.

2.6.3 Performance interference

A key concern with executing virtual machines in the cloud is performance in- teference, whereby two or more virtual machines compete for resources. Hyper- visors are designed to have strong performance isolation guarantees, by having coarse-grained scheduling and no sharing of data structures between virtualisa- tion domains [12]. In particular, many services in the cloud—as well as in other circumstances [48]—are latency-sensitve in that they require low and predictable latency [32]. However achieving predictable latency without performance isola- tion is hard. This lack of perfect performance isolation makes it difficult to virtu- alise some workloads [67]. Whilst executing in the cloud allows some detection of performance anomalies before deploying some services [134], this remains an unsolved problem in the general case.

2.6.3.1 Measurement

Researchers have long-studied methods of reducing performance interference of operating systems, in particular with the rise of latency-sensitive applications such as video-streaming [66]. With the rise of hypervisors, there has been fur- ther work in reducing performance interference, whilst increasing utilisation of hardware by using a custom scheduler that limits the resources consumed by vir- tual machines in their domain and in driver domains, such as domain zero [63]. However, in current cloud deployments, virtual machine workloads can in- terfere badly with each other, for instance the IOPS available to a virtual ma- chine can fluctuate wildly depending on other virtual machines executing [60] and poor scheduling causes performance interference, for instance colocating a random and a sequential load reduces performance for the sequential load [58]. Some work improves on the performance guarantees in the cloud, for example with virtual datacentres that have guaranteed throughput. An implementation of a virtual datacentre is Pulsar, which modifies the hypervisors in the cloud to

43 use a leaky bucket per virtual machine on shared resources to guarantee perfor- mance [4]. Whilst guaranteeing performance isolation is preferable, whenever the ma- chine is saturated by its virtual machines there is necessarily performance interfer- ence, in which case monitoring and reporting the performance is possible. There are many ways of measuring the performance of an operating system. Modern operating systems, such as Linux, have a wealth of tools to help measure operat- ing system performance. For instance, Linux has FTrace, perf, SystemTap [43], KLogger [46] and numerous domain-specific tools. Another method, originally implemented on a modified Digital UNIX 4.0D kernel, reports the resource con- sumption of resource containers, rather than of processes and threads [11]. However, all of these methods do not distinguish poor application perfor- mance from the overheads of virtualistion. That is, these tools are unable to report if the virtual machine is starved of resources. Not only do these tools not inform users of virtualisation overhead, they often are unable to access the same set of hardware features as a physical machine to accurately report per- formance to domains.1 Xenoprof is currently the only attempt to provide Xen virtual machines with a way of measuring performance [99]. However, Xeno- prof is incompatible with recent versions of Xen. The technique that I present in Chapter 5 differs in that it requires developers to annotate their programs to indicate the processing of requests—much like is required by X-trace [51]—but then reports the overheads of virtualisation, rather than the performance of the virtual machine, and gives these details on a per-request basis. Calculating this overhead requires applications to have information about how the virtual ma- chine in which they execute is scheduled. Having a hypervisor expose its inner state is similar to how Infokernels expose kernel internals across the interface with applications [8].

2.6.3.2 Modelling

There has been work performed by the modelling community that looks into performance interference between virtual machines. This work largely models which workloads interact badly with each other in order to build better virtual machine placement algorithms. This differs from the technique that I present

1vPMU is an upcoming (as of 2015-09-17) feature for Xen and Linux.

44 in Chapter 5, which is a measurement technique for helping to measure the performance of clouds as they execute. An example of modelling performance interference is hALT, which uses machine learning trained on a dataset from Google [120], to model which workloads cause performance interference [28]. Q-Clouds models CPU-bound virtual machines using a multiple-input multiple- output model whereby they take online feedback from an application and use this as an input to the model and use the output to place virtual machines more effectively [102]. TRACON is similar to Q-Clouds, but focusses on I/O-intensive workloads [28]. Casale et al. produce models of virtual machine disk perfor- mance, based on monitoring the hypervisor’s batching of I/O requests and the arrival queue [22]. CloudScope improves on modelling the performance of vir- tual machine interference by doing away with the need for machine learning or queuing-based models by modelling virtual machine performance using Markov chains to achieve a low-error model that is not tightly coupled with an applica- tion [25]. This work all differs from Soroban in that it is modelling the performance of an entire virtual machine. The virtual machine being modelled is typically assumed to be in a steady-state for a prolonged period of time (perhaps several minutes in length) and the model finds the best placement of virtual machines to minimise performance interference. However, Soroban is a measurement tech- nique that reports the additional latency incurred in servicing a single request in a request-response system. That is, Soroban measures if during the servicing of a request the virtual machine were scheduled out and reports the corresponding cost of this.

2.6.3.3 Summary

I have shown that there is a field of work that considers how to instrument and measure the performance of machines that does not consider the roleˆ of the hypervisor in their measurements. Moreover, there is a separate field of work that models the peformance of service of virtual machines based on factors like the number of executing virtual machines. This work therefore does consider the effect of the hypervisor on the performance of virtual machines. Soroban lies in between these two fields: It allows per request measurement of the performance overhead imposed on that request by the hypervisor. This

45 allows the techniques commonly used by the measurement community to com- prehend slow performance of data centres to also consider the roleˆ of the hyper- visor in causing latency. This is something that the efforts into modelling cannot currently report.

2.7 Application to a broader context

Having argued for reducing the requirement of hypervisor fidelity, I now con- sider the implications of my thesis in a wider context. Specifically, I investigate the implications to containers and microkernels.

2.7.1 Containers

Given the rise in popularity of containers, it is natural to consider the rela- tionship of my thesis to contemporary container systems. Containers are a lightweight virtualisation method whereby rather than executing virtual ma- chines on a hypervisor, users run services from a container within an operating system context. There are two key advantages of executing using containers rather than virtual machines. First, containers have a higher abstraction level as users only need to maintain their own application, rather than an entire op- erating system stack [145]. This therefore reduces the amount of work for sys- tem administrators, since the cloud provider manages operating system updates. Secondly, containers are more lightweight than virtual machines, with numbers suggesting that container density can be up to twelve times higher than virtual machine density [2], although there remain questions about how to schedule such a high density of containers. The lightweight nature of containers also makes scheduling more efficient since it only involves a context switch, rather than a vm-exit/entry event [72]. The key difference between virtual machines and containers is the interface: Containers expose the interface of an operating system against which users can issue system calls, whereas hypervisors expose an interface that mirrors that of physical hardware. As containers do not try to exhibit fidelity, current perfor- mance measurement techniques have higher utility for containers than for vir- tual machines. Containers are collections of processes executing on a shared system and existing techniques explain the performance of processes executing

46 on a shared system. However, improvements to virtual machines are not orthogonal to improve- ments to containers since it is commonplace to execute containers inside virtual machines, as is performed by Amazon and Google [70]. As such, containerised applications executing on a virtual machine can use the techniques that I present in this dissertation to measure the performance of their virtual machine.

2.7.2 Microkernels

Microkernels are operating systems in which the kernel is intentionally kept min- imal in terms of features and size [86]. Services that a monolithic operating sys- tem kernel provides are largely provided by user space services. Despite their popularity in some academic circles [81], there has been limited commerical up- take of microkernels. Indeed, some claim that the hypervisor is the ‘done right’ [65], since there are technical similarities as the hypervisor (in partic- ular, Xen) is an intentionally small, verifiable codebase with a minimal interface. However, the key difference between a microkernel and a hypervisor is that a hypervisor exposes the interface of physical hardware, whereas a microkernel exposes an API—usually based on message passing—to its processes. Therefore, applications and services executing on a microkernel expect to share access to the underlying hardware with other applications and services, whereas virtual machines expect total access to the hardware. As such, the abstraction level of a microkernel is higher and there is no re- quirement of fidelity of processes that execute on a microkernel to execute on hardware, current performance measurement techniques have little to gain from the contributions I make in this dissertation.

2.8 Summary

In this chapter I have presented the classical definition of a hypervisor as provid- ing fidelity, performance and safety. I then argue that this classical definition of a hypervisor should be reconsidered given the changing use case of hypervisors since their invention in the 1960s. In particular, I argue that the requirement of fidelity has previously been forgone to reduce some of the limitations of virtuali- sation, such as virtualising x86. The contemporary issue facing the cloud is that

47 the performance of virtual machines is less predictable and harder to measure than the performance of physical machines. As such, I advocate using the same technique, of forgoing hypervisor fidelity, as an approach to improving the ease with which one can measure virtual machine performance. In the rest of this dissertation I show how by forgoing fidelity of virtualisation we can build software with higher utility that improves on the state of the art in kernel specialisation, tracing and monitoring.

48 CHAPTER 3

KAMPROBES:PROBINGDESIGNEDFORVIRTUALISED OPERATING SYSTEMS

Probe points are a common measurement technique that underpins methods for evaluating the performance of software. For instance, dynamic instrumentation uses probe points to intercept a program’s control flow to execute an instrumen- tation handler. However, when a probe fires it incurs a probe effect, whereby the resources consumed executing the probe change the performance character- istics of the program being tested. In this chapter I show that since current probing mechanisms are designed for physical machines they rely on interrupts, which are 1.8 times more costly on a virtual machine than a physical machine. I therefore present Kamprobes, a (kernel) probing system that forgoes hypervi- sor fidelity by only using unprivileged instructions and techniques that perform well under virtualisation so that Kamprobes fire with near-native speed. Whilst current state-of-the-art techniques take approximately twice as many cycles to execute in a virtual machine than on a physical machine, Kamprobes take just twelve cycles more to execute on a virtual machine than a physical machine. As well as having near-native performance when virtualised, Kamprobes are also fast to execute, requiring 69 16, compared with a best-case performance of ± 6980 869 for the current state-of-the-art when virtualised. Moreover, the vari- ± ability of Kamprobes in a virtual machine (σ = 16.2 cycles) is approximately the same as on a physical machine (σ = 14.3 cycles). This is an improvement when compared to current techniques, where under a virtual machine performance is very unpredictable, with a probe taking anywhere between 5 000 cycles and 40 000 cycles to fire. In addition to virtualising well, Kamprobes fixes the scalability issues that are associated with using Kprobes—the probing system used by the Linux kernel. The cost of inserting and firing a Kprobe grows with O(n), where n is the number of probes already inserted. Whereas the same operations take O(1) time with

49 Kamprobes. Therefore, the best case speedup of Kamprobes is unbounded. Some work in this Chapter is a result of a collaboration. Dr Ripduman Sohan initially suggested designing a low-overhead Linux probing mechanism and the initial implementation of Kamprobes was a result of ‘pair programming’ with Lucian Carata, James Snee, and myself. Throughout this chapter I acknowledge all intellectual contributions to the work from my collaborators.

3.1 Introduction

On both physical and virtual machines probes are a key low-level technique that has revolutionised understanding the performance of modern software. For a developer to understand the performance of their code they no longer have to modify the source program, recompile and execute the software. Rather, they can use a probing system such that whenever control flow hits a certain instruc- tion a probe fires and executes a probe handler. However, the key problem with using probes is that when they fire they con- sume resources, thereby causing a probe effect (a type of observer effect). This probe effect can be problematic as if probes are used to understand the perfor- mance of software, their presence can affect the timing of the system that they are intended to measure. Furthermore, the probe effect can change concurrency relationships, causing different interactions amongst system components depend- ing on the presence of a probe. The key to reducing the probe effect is to minimise the performance impact of the probes. However, current probing mechanisms are designed to execute with a low performance impact on physical machines. They therefore commonly in- sert interrupts into the instruction stream. When the interrupt executes on a physical machine, the processor executes an interrupt service routine at a rea- sonably low cost. However, on a virtual machine interrupts are more expensive because the interrupt causes a vm-exit, since the hypervisor takes the interrupt. The hypervisor’s interrupt service routine parses which virtual machine the inter- rupt is for and issues an upcall to the relevant virtual machine, which then runs its own interrupt service handler. As such, the probe effect on a virtual machine is higher than on a physical machine. In this chapter, I reconsider the design of probing mechanisms for virtual

50 machines with Kamprobes, a general probing system for operating systems that adhere to the System V AMD64 ABI [95] [95]. In particular, Kamprobes relies on JMPQ and CALLQ x86-64 instructions that it inserts into the instruction stream to transfer the execution context from the original program to a probe. This therefore does not experience the high overheads of executing an interrupt in a virtual machine. In my experiments the current Linux kernel probing technique takes a min- imum of 6980 869 cycles to fire on a virtual machine, whereas Kamprobes ± take 69 16 cycles to fire. A further advantage of Kamprobes—although unre- ± lated to virtualisation—is that the cost of firing a Kprobe scales poorly with the number of probes in the system—the cost of insertion and firing a probe both follow O(n)—however, Kamprobes scale with O(1) for both insertion and firing of probes. This results in it taking 6980 869 cycles to fire a Kprobe when there is only ± one Kprobe in the system but 33 600 4 960 cycles to fire a Kprobe when there ± are 20 000 Kprobes inserted. This scalability makes Kamprobes appropriate for situations that require many probes, such as a function boundary tracer. By forgoing hypervisor fidelity and designing software such that it minimises its use of privileged instructions, Kamprobes improves on the state-of-the-art of probing virtual machines by enabling virtual machines to be probed with near- identical probe effect to their physical counterparts

3.2 Current probing techniques

I now explore the current techniques used for kernel probing by major operat- ing systems: Linux, Windows, OS X, FreeBSD, and NetBSD. I show that, with the exception of Windows, they rely on interrupts to transfer execution from the target code to a probe handler. I later go on to show that interrupts vir- tualise poorly and require an expensive trampoline to execute that decodes the instruction pointer to determine the location of the probe handler.

3.2.1 Linux: Kprobes

Kprobes are the de-facto method of inserting Linux kernel, which have been merged into the Linus branch in June 2004. To add a Kprobe, a user calls

51 the register kprobe function, specifying an address to probe, and a function pointer to the handler that is to be run when the probe fires. The register kprobe function replaces the instruction at the given address with a software breakpoint. On x86-64 hardware this breakpoint is an int3 instruction. The original instruc- tion is copied to the handler’s instruction stream so as not to alter the semantics of the kernel. When the kernel reaches the probed instruction a trap occurs, causing the CPU to save its registers and stack. Execution passes to an inter- rupt handler that hashes the instruction pointer at the time that the probe was fired, and uses the result as an index into a hash table to find the address of the corresponding probe handlers. The kernel executes the pre-handler, which con- tains the user’s instrumentation code. Then the original instruction, now copied into the probe handler, is single-stepped by using int1. Finally the kprobe’s post-handler is fired, and control flow returns to the original program. Kretprobes are similar, but allow the user to also specify a return handler, when probing a CALLQ instruction. The return handler executes when the func- tion being called returns. Kprobes, and Kretprobes use two optimisations to reduce the overheads of firing probes. (i) Rather than inserting an interrupt, Kprobes will insert a JMPQ on any instruction that is at least five bytes long. Whilst this optimisation is similar to the technique used by Kamprobes, it still single steps the original in- struction using an int1 instruction. This is expensive on virtual machines as the hypervisor has to convert the interrupt to an upcall. Also, all JMPQs have the same target: The Kprobe dispatcher that hashes the instruction pointer. There- fore, they still have to decode which probe handler to execute. (ii) If a probe is inserted on the first address of a function then rather than inserting an interrupt, Kprobes inserts a jump into the Kprobes handler. However, even with these optimisations, Kprobes still executes poorly in a virtual environment due its continued use of int1, and performing other opera- tions such as disabling interrupts, and pressure on page tables.

3.2.2 Windows: Detours

Windows uses the Detours as a probing system, which is the most-similar prob- ing system to Kamprobes [69]. Detours allocates memory that is used as a detour function and modifies the original instruction stream such that it contains a JMP

52 instruction into the detour function. The detours function saves registers, exe- cutes a probe handler and then executes the target function. The target function returns to the detour function and then to the source function. By using a JMPQ instruction Detours is most similar technique to Kamprobes. However, Detours is tightly-coupled with the Microsoft ABI and can only be applied to the Windows user space and not the kernel. Also, it is not possible to register a Detours handler that executes after a probed function returns. That is, with Kretprobes and Kamprobes, one can insert a return probe that executes a pre-handler before a function executes and a post-handler that executes when the function returns. Probing the entry and exit of a function is useful for debugging the parameters to a function and observing its timing. Whilst one solution is to insert a second Detours probe immediately after the callsite, this us usually not possible, since Detours can only be inserted onto instructions of length five or larger and usually after a function call there are shorter instructions (e.g. MOVQ) that process the returned result.

3.2.3 FreeBSD, NetBSD, OS X: DTrace function boundary tracers

DTrace implementations for FreeBSD, NetBSD, and OS X all have a modular structure such that multiple probing providers can be used, depending on the context. One such provider is the function boundary tracer, which allows the user to execute a D script whenever a particular function enters or exits. The function boundary tracers for FreeBSD, NetBSD [106] and OS X1 insert an interrupt into the instruction stream and register an appropriate interrupt handler. When the interrupt fires they use similar techniques to Kprobes to find the location of the interrupt handler in a data structure, although don’t suffer from the scalability issues of Kprobes.

3.2.4 Summary

With the exception of Windows Detours—which is a user space tool rather than kernel space tool—existing mechanisms for probing operating systems princi- pally use interrupts to transfer program execution from the target code to a trampoline. This trampoline then has to decode the instruction pointer to find

1OS X support for virtualisation is unclear. It is therefore unfair to critique the use of an interrupt.

53 the relevant probe handler. Whilst Kprobes has a jump optimisation, this jump still executes the same codepath that was designed for interrupts, as such whilst it has significantly lower overheads, it still performs more slowly in a virtual machine than on a physical machine. Throughout the remainder of this chapter I compare Kprobes, or their Kret- probes extension, with Kamprobes. Detours is proprietary, expensive ($10 000), only for user space, and uses a different ABI to Kamprobes thus making it in- appropriate to benchmark against. The mechanisms used by OS X, FreeBSD and NetBSD to probe the kernel are tightly coupled with DTrace. As such, they do not provide a low-level interface to insert arbitrary code at a specific kernel address. Any benchmarking would unfairly disadvantage these tools due to the additional safety that they provide.

3.3 Experimental evidence against virtualising current probing techniques

Having argued that the current techniques used for kernel probing are subopti- mal, I now experimentally demonstrate their inefficiencies.

3.3.1 Cost of virtualising Kprobes

I start by showing that whilst current techniques, such as Kprobes, have accept- able performance on physical machines, when they are virtualised they become substantially slower and have wildly unpredictable performance characteristics. This motivates my thesis that we ought to forgo hypervisor fidelity by building performance measurement techniques that compliment those operations that vir- tualise well.

Experimental setup I created a Linux kernel module that repeatedly executed an empty function in a tight code loop, measuring the number of cycles that it takes to call the empty function. The kernel module also registered a Kretprobe on the call to the empty function. This Kretprobe then executed an empty pre- handler and an empty return handler before and after the call to the empty func- tion. I used gcc’s noinline attribute to stop the compiler from inlining the

54 function calls and verified using objdump that the compiler did not optimise out calls to empty function calls. As such, the experiment had the minimal amount of code required to measure the cost of executing a Kretprobe on a function. As the kernel used in this experiment was non pre-emptive,2 the scheduler did not preempt the experiment as it executed. I repeated the experiment whilst increasing the number of Kprobes inserted into unused parts of the kernel (e.g. unused drivers) from 0 to 1 000. This showed the effect of a changing number of probes on the overheads of virtuali- sation without occurring an additional probe effect. For each number of probes, I measured 1 500 executions and discarded the first 500 to minimise the effects of cold caches. This experiment executed on an Intel Xeon E3-1230 V2 @ 3.3 GHz, running Ubuntu 14.10, with a Linux v3.19 kernel compiled from the Linus branch and Xen v4.6. I performed the experiment when executing both on bare metal and in a virtual machine. With the exception of the experiment the host was otherwise idle. All measurements followed Intel’s guidance on benchmarking using the cycle counter [108], in particular I subtracted the mean cost of measuring the timestamp counter in each configuration from the results.

Results Figure 3.1 shows that the number of cycles required to execute a Kret- probe in a virtual machine is significantly higher and more variable than the number of cycles required to execute a Kretprobe on bare metal. This increase in cost causes a higher probe effect for virtual machines than when executing on bare metal. A further issue is that in a virtual machine Kretprobes have less predictable performance than on bare metal. This high variance of the prob- ing system makes it hard for developers to measure the performance of their programs as it is difficult to distinguish between resources consumed by their ap- plication and resources consumed by the probing mechanism that they are using to measure the program performance. In the remainder of this section I explore why current probing mechanisms virtualise poorly.

2All common Linux distributions ship with non pre-emptive kernels.

55 50000 Bare metal Virtual machine 40000

30000 Cycles 20000

10000

0 0 200 400 600 800 1000 Number of inserted probes

Figure 3.1: When Kretprobes are virtualised they consume more cycles and be- come less predictable than when executing on bare metal.

56 1.0 Bare metal 0.8 Dom0 DomU 0.6

0.4

0.2 Cumulative frequency 0.0 0 200 400 600 800 1000 1200 Cycles

Figure 3.2: The number of cycles required to take an interrupt when executing on a hypervisor is substantially higher than when executing on bare metal. As shown in this figure, the number of cycles requires is about double.

3.3.2 Cost of virtualised interrupts

I now show that the cost of an interrupt in a virtual machine is 1.81 times higher than the cost of an interrupt on bare metal. Indeed, interrupts are one of the least-virtualisable components of an operating system. As such, techniques that predominantly use interrupts execute slowly in the context of a virtual machine.

Experimental setup In this experiment I measured the number of cycles that it takes to fire 10 000 interrupts. I executed this experiment using a modified form of the Linux v3.19 kernel where I replaced the interrupt handler for int3 (do int3) with only a return statement. Therefore, all that I measured is the cost of performing the int3 instruction and not the cost of processing it. I repeated the experiment 600 times and discarded the results from the first 100 repeats to ensure that I did not measure unintended artifacts of cache or branch predictor behaviour. The hardware and software were identical to that described in Section 3.3.1.

Results Figure 3.2 shows that firing an interrupt in a virtual machine is substan- tially more expensive than firing an interrupt on bare metal. It takes a median of 1 040 cycles to fire an interrupt in a (unprivileged) virtual machine, including

57 the time spent in the hypervisor, however the distribution has a long tail to 2 450 cycles. This is due to the increased data path from the interrupt executing in the context of the hypervisor, as interrupts are privileged instructions and so require an upcall from the hypervisor. Under Xen on x86-64 the int3 instruction gener- ates a software interrupt to indicate that the program has hit a breakpoint. This generates a #BP exception. Ordinarily, the operating system would inspect the interrupt descriptor table, determine the location of the interrupt service routine and then execute it. However, in a virtualised environment the interrupt forces the system to take a mode switch into the hypervisor, so that the hypervisor can mediate the interrupt, ensuring that it is intended for the correct domain. After the mode switch the hypervisor executes its own interrupt service routine, which checks to see if the hypervisor has inserted a software interrupt on this address. If the hypervisor has not inserted an interrupt for that address, Xen decodes which virtual machine the interrupt corresponds to and delivers the interrupt to the domain through an event channel and executes a mode switch. This creates a substantial performance overhead, largely due to the mode switches, which cause the system to save register state and then emulate the single-instruction interrupt. Indeed, the cost of firing the interrupt alone is sub- stantially higher than the 69 16 cycles required to fire a Kamprobe. As such, ± no probing mechanism used in a virtual machine environment that relies on in- terrupts can be faster than Kamprobes.

3.3.3 Other causes of slower performance when virtualised

The exact costs of the other causes of slowdown are implementation dependent. However, I have already shown a lower bound on the performance of using interrupts is fifteen times more expensive than firing a Kamprobe. Other sources of slowdown from virtualisation include:

Updates to page tables for instance to ensure that instructions are only single- stepped when the pre-handler fires requires modifying pages that are mapped read-only. As Xen has to mediate page table updates for virtual machines, such operations are more expensive than on physical machines.

Disabling interrupts which requires setting a Xen-readable software flag [12].

58 3.4 Kamprobes design

Having shown that current kernel probing techniques are suboptimal when exe- cuting in a virtual machine, I now present Kamprobes. The central idea behind Kamprobes is to provide a probing mechanism that forgoes hypervisor fidelity in the sense that it is designed to execute quickly and predictably on virtual ma- chines. Kamprobes has the following properties:

1. No privileged instructions, so that Kamprobes executes on virtual ma- chines without relying on the hypervisor to virtualise privileged instruc- tions.

2. No locking, to avoid the problems of lock-holder pre-emption where a vCPU holding a lock is preempted, causing other vCPUs waiting on the lock to be unable to progress [53].

3. Low runtime performance overhead, when a probe is fired. When probing overheads are high there is a substantial probe effect that affects the results measured by the probe handlers.

4. Predictable performance overhead, so that the cost of firing a probe does not introduce noise into performance measurements taken in the probe handler.

5. Scalability, through a constant cost of inserting a probe, as non-constant costs prevent a probing system from being used where many probes are required, such as for a function-boundary tracer.

6. Non-lossy probes, in that probe handlers are guaranteed to fire if the probed instruction is hit. Probe systems that are lossy cannot be used in circumstances that require probes to be fired, such as comparing traces from a hypervisor with those from the virtual machine where there should be a bijection of events (e.g. scheduling events).

7. Return handler support, so that when probes are applied to a CALLQ instruc- tion they can execute both a pre-handler to be called before the function call, and a return handler after the function returns. Return handlers are

59 1 static void pre_handler(void){ 2 ... 3 } 4 static void post_handler(void){ 5 ... 6 } 7 8 kamprobes_register(0xffffffff8137e060,&pre_handler,& post_handler); 9 ... 10 kamprobes_unregister(0xffffffff8137e060);

Figure 3.3: An example of the Kamprobes API

useful for profiling individual functions, which is a common technique for measuring virtual machine performance.

8. Zero performance-overhead whilst disabled since it is difficult to get popu- lar uptake of a probing mechanism without this [20].

3.5 Implementation

Kamprobes has a highly-optimised implementation, with the code that executes on firing a probe having been hand-written in x86-64 machine code. This design is optimised such that it can probe function entry points, CALLQ instructions, and five NOP instructions, although well-known techniques can extend it to other instructions [137, 138]. The current implementation is built for Linux, however it could be easily ported to other operating systems that use the System V x86-64 ABI and less-easily ported to other ABIs. The implementation of Kamprobes consists of three parts: (1) An API for ker- nel modules to insert Kamprobes, (ii) a (out-of-tree) kernel module that (iii) rewrites the kernel instruction stream.

3.5.1 Kamprobes API

Figure 3.3 details the Kamprobes API, which has two public API functions: kamprobes register and kamprobes unregister. These functions are exposed

60 through a C header file that can be #include-d by any kernel module. kamprobes register is a ternary function that takes an address to insert a Kamprobe and two func- tion pointers whose target is the start of a pre-handler and a return handler. kamprobes unregister takes the address of a Kamprobe and removes it.

3.5.2 Kernel module

On initialisation, Kamprobes uses the vmalloc node range function3—which is the mechanism used by insmod to allocate kernel memory that is both executable, and within low memory—to allocate memory that will contain the Kamprobes instruction stream.

3.5.3 Changes to the x86-64 instruction stream

Registering a Kamprobe rewrites the kernel instruction stream, introducing a unprivileged JMPQ instruction, rather than the privileged int1 and int3 instruc- tions that other probing techniques use. The target of Kamprobe’s inserted JMPQ is a Kamprobe wrapper, a memory section to which Kamprobes writes a x86-64 instruction stream.

3.5.3.1 Inserting Kamprobes into an instruction stream

Kamprobes can currently be placed on three types of instruction: CALLQ, a func- tion entry point, and five contiguous NOP instructions. Each type of Kamprobe is registered slightly differently.

CALLQ instructions. Whenever a probe is registered, Kamprobes inpsects the ad- dress of the probed instruction and determines the address of the callee. The Kamprobes register function then rewrites the original CALLQ instruci- ton so that its target is no longer the callee, but is the start of a Kamprobe Wrapper.

Function entry points Most Linux kernels are now compiled with support for FTrace. This causes the compiler to insert a call to fentry as the first in-

3 node range does not use the EXPORT SYMBOL mechanism to expose itself to kernel modules. However, I use code (written by Lucian Carata) that scans the kernel’s memory to find the location of kernel symbols that are not exported.

61 struction of every function. Usually, FTrace is disabled. Kamprobes there- fore overwrites the FTrace trampoline, replacing it with a JMPQ into the Kamprobe wrapper. This differs from inserting a Kamprobe onto a CALLQ instruction, which calls into the Kamprobe wrapper, because when the a function entry point probe fires a function call has already been made and a stack frame for the callee exists. Kamprobes therefore cannot generate an additional stack frame, as doing so would clobber data passed on the stack, such as the first parameters.

Five NOP instructions. The central idea behind Kamprobes is to use a CALLQ or JMPQ instruction rather than an interrupt to avoid the overheads of an in- terrupt in the context of a virtual machine. The x86-64 instruction set is variable-width, with software interrupts (int3) being one byte long (0xcc), which allows them to replace any instruction. However, JMPQ instructions are five bytes long (0xe9, followed by a four-byte relative address). As such, Kamprobes cannot be placed on instructions that are fewer than five bytes long. If developers need to insert a Kamprobe between two specific instructions then they can recompile the kernel with five contiguous NOP instructions and insert a Kamprobe on those.

3.5.3.2 Kamprobe wrappers

Kamprobes wrappers are sections of memory to which Kamprobes writes an x86-64 instruction stream to, which call the relevant probe handlers. There is a bijection between Kamprobes and Kamprobes wrappers because each Kamprobe wrapper encodes the location of the Kamprobe. The contents of a Kamprobe wrapper changes depending on the type of in- struction that the Kamprobe is inserted on:

CALLQ instructions. Figure 3.4 shows a Kamprobes wrapper for a CALLQ instruc- tion and Figure 3.5 describes the steps of executing a CALLQ Kamprobe:

Save registers. The System V AMD64 ABI [95] specifies that registers may be used to pass parameters to functions. As such, Kamprobes takes care to ensure that it does not clobber the values passed in registers by calling a pre-handler. Kamprobes pushes the values in these registers

62 // Save registers 0xc042a000: push%rax 0xc042a001: push%rbx 0xc042a002: push%rdi 0xc042a003: push%rsi 0xc042a004: push%rdx 0xc042a005: push%rcx 0xc042a006: push%r8 0xc042a008: push%r9 0xc042a00a: push%r10 // Call pre-handler. 0xc042a00c:CALLQ0xffffffffc03fcab0 0xc042a011: pop%r10// Pop registers 0xc042a013: pop%r9 0xc042a015: pop%r8 0xc042a017: pop%rcx 0xc042a018: pop%rdx 0xc042a019: pop%rsi 0xc042a01a: pop%rdi 0xc042a01b: pop%rbx 0xc042a01c: pop%rax // Push return address to stack. 0xc042a01d: movq $0xffffffffc042a02a,(%rsp) // Jump to original function. 0xc042a025:JMPQ0xffffffff8137e060 // Part2. Executes when the original function returns.

// Push the return address onto the stack. 0xc042a02a: pushq $0xffffffff81002564 // Jump to post-handler. 0xc042a02f:JMPQ0xffffffffc03fdab0

Figure 3.4: An example Kamprobe wrapper intercepting a call from blk lookup devt to disk get part. I omit the most significant bits of the op- code addresses for clarity.

63 Figure 3.5: Inserting a Kamprobe onto a CALLQ instruction replaces the target of the call site with a call to a Kamprobe wrapper. This executes the pre-handler, then performs the original function call. When this function returns it does so into the Kamprobe wrapper, which executes the post-handler and returns.

64 onto the stack to ensure that it does not trash there values in executing the pre-handler. Call the pre-handler. When the user registers a Kamprobe, they pass a function pointer to the pre-handler that they wish to execute when the probe fires. This pre-handler is a void, nullary function, so execut- ing it just requires a CALLQ instruction. Restore registers. The pre-handler returns to this point. Having executed the pre-handler, Kamprobes pops register values from the stack so that the register file and stack are in the same state as they would have been had the pre-handler not been executed. This ensures that Kamprobes does not affect the parameters to the original function. Modify the return address on the stack. Initially the return address on the stack points to the instruction after the instruction that has been probed. Before executing the original function, Kamprobes modifies this return address to point to ‘part 2’ of the Kamprobes wrapper. This change makes the control flow return to the Kamprobes wrap- per when the original function executes a ret instruction. Jump into the original function. Having ensured that the stack and regis- ter file are unmodified by Kamprobes, with the exception of the modi- fied return address, Kamprobes can now execute the original function. When probing a CALLQ instruction, Kamprobes cannot use another CALLQ instruction to execute the original function. Under the System V AMD64 ABI [95] the caller of a function prepares some of the stack for the callee and the x86-64 CALLQ instruction pushes the current in- struction pointer to the stack, as a return address. This makes the offsets of the arguments on the stack, which are relative to the stack pointer, incorrect. As the code in the Kamprobe wrapper has already modified the return address, the original function returns to the wrapper, despite using a JMPQ instruction, rather than a CALLQ. Push a return address to the stack. The original function returns here. Af- ter the original function has now executed, Kamprobes still needs to execute the return handler, and return execution to the original func- tion. Kamprobes uses an optimisation to minimise the performance

65 overhead of this step by modifying the call stack such that the Kam- probes wrapper calls the return handler. However, the return handler returns directly to the original program, without executing code from the wrapper.4 To perform this operation, Kamprobes calculates the address of the next instruction from the original instruction stream—by adding the width of a CALLQ instruction to the address of the original instruction being probed—and pushes this to the stack. Therefore, when the re- turn handler returns, using a ret instruction, it immediately returns to the original code, without executing Kamprobes code. Jump to return handler. Kamprobes JMPQs into the return handler, which is a void function so it does not modify the return register. When the return handler returns it does so to the address pushed onto the stack. A disadvantage of the return handler bypassing the Kamprobes wrap- per is that it is legal according to the System V AMD64 ABI [95] for the return handler to clobber rax. In practice, using current versions of GCC to compile kernel modules, I have not observed functions that do clobber rax. Should additional flags or other compiler ver- sions break this assumption then Kamprobes could be modified to preserve rax.

Function entry point. The key difference between Kamprobes on function entry points to those on CALLQ instructions is that Kamprobes on function entry points necessarily obtain an execution context through a JMPQ instruction, rather than a CALLQ instruction. This creates two issues: (i) The Kamprobe wrapper executes in a stack frame containing state that Kamprobes can- not modify. (ii) To execute a post-handler Kamprobes needs to modify the return address so that all exit points from the function cause the post- handler to fire. However, the original return address needs to be retained so that the post-handler can return to the caller of the function entry point. Therefore, there are differences in the layout of the Kamprobe wrapper for function entry points: (i) When a function entry point Kamprobe first fires it copies the return address into a Kamprobes buffer. This is because, un- like in the CALLQ case, the Kamprobe wrapper needs to return execution to

4This optimisation was designed by Lucian Carata.

66 Figure 3.6: The control flow for Kamprobes on a function entry is differs from Kamprobes on a CALLQ instruction. This is caused by a stack frame for the callee function having been already created when the Kamprobe fires, which the Kamprobe wrapper needs to avoid trashing.

67 this address after the post-handler fires. (ii) The target function executes by JMPQing to five bytes into its definition. As the function caller has already created a stack frame for the callee, by executing a CALLQ instruction, Kam- probes cannot create an additional stack frame. By JMPQing five bytes into the function definition, Kamprobes avoids the infinite loop that would oc- cur by calling the first byte in the function, which is modified so-as to JMPQ into the Kamprobe wrapper. As previous parts of the wrapper mod- ify the return address, the function returns into the wrapper rather than to its caller. (iii) The post-handler is executed through a CALLQ instruction, rather than a JMPQ, so after executing the post handler, control flow returns to the Kamprobes wrapper. The reason for this is that the second part of the Kamprobes wrapper executes after the function being probed executes its own return. By executing a return, the original function destroys its own stack frame, therefore Kamprobes should not execute an additional return, as this would destroy its caller’s stack frame (the stack frame of bar in Figure 3.6). Rather, Kamprobes creates a new stack frame for the post handler by using a CALLQ instruction, which the post handler destroys when it returns, and the Kamprobe wrapper JMPQs to the next instruction in the caller, thereby not destroying an additional stack frame.

Five contiguous NOP instructions. When a Kamprobe fires on five contiguous NOP instructions it can only invoke a pre-handler as there is no function call in- volved. The modifications to the instruction stream are much like those for probing a function entry point, but do not modify the return address on the stack in order not to invoke a return handler.

Kamprobes that are inserted onto five contiguous NOP instructions cannot fire a return probe, so JMPQ back to the original instruction immediately

After writing the instruction stream to the Kamprobes wrapper, the register function remaps the page to be read-only and disable the non-executable (NX) bit, allowing the processor to execute the data in the Kamprobes wrapper.

68 3.6 Evaluation

I now evaluate Kamprobes, comparing them with the current state of the art in Linux kernel probing, Kprobes, and show that the time taken to fire a Kam- probes is within twelve cycles of the native performance. Moreover, firing a Kamprobe takes approximately 10% of the cycles taken to fire an optimised Kprobe. I executed all experiments on Kprobes and Kamprobes on an Intel Xeon E3-1230 V2 @ 3.3 GHz, running Xen v4.6 with all virtual machines exe- cuting Ubuntu 14.10, with a Linux v3.19 kernel compiled from the Linus branch. When executing in a virtual machine, the virtual machine was the only one run- ning, with all CPUs and memory that were not taken by the hypervisor. All exper- iments followed Intel’s guidance on benchmarking using the cycle counter [108] and used a paravirtualised timestamp counter.

3.6.1 Inserting probes

I now show that Kamprobes insert substantially faster than Kprobes in a virtual machine. Whereas the cost of inserting a Kprobe scales with complexity O(n), where n is the number of probes inserted into the system, the cost of inserting a Kamprobe scales with complexity O(1). This cost is important when inserting many probes into an operating system, such as for a function boundary tracer that inserts a probe into every kernel function.

Experimental setup To measure the cost of inserting a probe I manually iden- tified part of the kernel that did not execute on my hardware setup, as it was a driver for hardware not present on my machine. I inserted probes into ‘dead’ ker- nel code to ensure that during the insertion phase the already-inserted probes did not fire, thus causing a performance degradation. I initially scanned the memory, finding instructions that a probe can be inserted onto. Then, I read the current wall-clock time, inserted n probes onto the instruction, read the wall-clock time again and recorded the delta. I repeated the experiment twenty-five times.

Results Figure 3.7 shows that to insert a single Kamprobe takes 1.46 ms 0.1 ms ± whereas to insert a single Kprobe takes 0.051 s 0.215 s (the distribution is not ± normal). This is because of the complexity required by Kprobes in building hash

69 0.005 0.035 0.030 0.004 0.025 0.003 0.020

0.002 0.015 0.010

Insertion time (s) 0.001 0.005 0.000 0.000 0 200 400 600 800 1000 0 2000 4000 6000 8000 10000

0.20 20 kprobes 0.15 kamprobes 15 0.10 10

0.05 5 Insertion time (s)

0.00 0 0 200 400 600 800 1000 0 2000 4000 6000 8000 10000 Number of probes Number of probes

Figure 3.7: As the number of Kamprobes increases the time taken to fire a Kam- probe remains constant. However, the time taken to fire a Kprobe increases linearly.

70 tables entries that map instruction pointer values to addresses, and inserting the interrupt. Furthermore, the time taken to insert a Kprobes into the kernel grows with a O(n) relationship with the number of probes inserted, whereas the time taken to insert a Kamprobe grows following O(1). As such, inserting more than 10 000 kernel probes becomes prohibitively expensive, as the Linux kernel watchdog begins to fire. Moreover, when inserting more than 10 000 probes the kernel becomes unstable and crashes regularly. However Kamprobes scale linearly, so users can insert Kamprobes into far more call sites. Indeed, it takes 0.035 s to insert 10 000 Kamprobes. Whenever a Kprobe is inserted, the kernel checks to see if there is an existing Kprobe at the same address by walking a hash table of active Kprobes, using the get kprobe function [68]. This hash table has just sixty-four entries, so naturally cannot contain 10 000 probes. Whenever there is a collision in the hash table, Linux uses a linked list to store all the metadata for each Kprobe. Therefore, as the number of probes increases beyond sixty-four the cost of inserting a probe changes from being O(1)—the cost of doing a hash—to O(n)—the cost of walking a linked list. Moreover, there are no guarantees that the linked list is in contiguous memory, which causes cache misses. The performance of Kamprobes is much more scalable, since my design has a constant cost of adding a probe. Unlike with Kprobes, there are no data structures that need to be walked when inserting a probe: The old address is copied into the Kamprobe wrapper.

3.6.2 Firing probes

I now measure the cost of firing Kprobes and compare it with Kamprobes. Min- imising this cost is important to reduce the probe effect. In this experiment I consider the changing cost of each probing mechanism as the number of inserted probe points scales.

Experimental setup I used a microbenchmark that inserted n 1 probes at − unique addresses in unused parts of the kernel that the virtual machine does not touch, as such these probes did not fire whilst the experiment executes. The microbenchmark then inserted another probe, whose performance it did not measure, which I refer to as the target probe. The address on which the mi-

71 crobenchmark inserted the target probe depended on the probing mechanism:

Unoptimised Kprobes can be inserted on any function in the kernel. The source code of my experiment read the timestamp counter (rdtscp5), executes a NOP and then re-read the timestamp counter. To measure the cost of firing a Kprobe the microbenchmark inserted a Kprobe on the NOP instruction.

Optimised Kprobes use the FTrace mechanism in the preamble of a function. The microbenchmark therefore created the smallest valid function and in- serted a Kprobe on the function. The function was the minimum function that conforms to the System V AMD64 ABI [95] (generated by asm("")). It contained five instructions: a five-byte NOP header for the FTrace mech- anism, a return instruction and three instructions to set pointers. The ex- periment consisted of reading the timestamp counter (rdtsc), calling the empty function (with a Kprobe attached), reading the timestamp counter as soon as the function returns and taking the difference between the two.

Kamprobes used the same technique as optimised Kprobes.

After inserting the probes, the microbenchmark fired the target probe 12 000 times and measured the time taken to fire the probe. I dropped the first 2 000 measurements to avoid measuring phases where caches are cold. I repeated the entire experiment five times.

Results Figure 3.8 shows that the cost of firing a Kamprobe is substantially cheaper than the cost of firing a Kprobe. In the best case (with only one Kprobe in the kernel) it takes at least 6980 869 cycles to execute an empty Kprobe ± handler. In comparison, Kamprobes take 69 16 cycles to fire. Moreover, the ± cost of firing a Kprobe scales linearly with the number of probes in the system, whereas Kamprobes have a constant cost of firing. Kamprobes scales with a constant cost of firing. This is due to Kamprobes encoding the location of the handler directly in the instruction stream, by in- serting a JMPQ directly to the Kamprobes wrapper, whereas Kprobes insert an interrupt and have to decode the value of RIP using a hashtable that adds a layer of indirection. Whilst an empty Kamprobe is fast to fire, it still has a measurable cost.

5rdtscp reads the cycle counter without out-of-order execution

72 Kprobes Kprobes 104 (unoptimised) (optimised) 104 4.0 × × 4.0 3.5 3.5 3.0 3.0 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 Cycles to fire one probe Cycles to fire one probe 0.0 0.0 0 6000 12000 18000 0 6000 12000 18000 Registered probes Registered probes

Kamprobes 120 100 80 60 40 20 Cycles to fire one probe 0 0 6000 12000 18000 Registered probes

0 1 Figure 3.8: Firing a Kamprobe takes 69 16, whereas firing a Kprobe takes at least 9 000 cycles. Moreover, as the number± of probes in the system increases, the cost of firing a Kprobe increases linearly, whereas Kamprobes have a constant cost to fire.

73 3.6.3 Kamprobes executing on bare metal

The main use case of Kamprobes is to have a probing technique that is designed to execute on virtual machines. Whilst Kamprobes only uses features that vir- tualise well, it does not require a hypervisor to execute. I now show that Kam- probes does not perform more slowly or with higher unpredictability in a virtual machine than on bare metal. I now evalaute the time it takes to fire a Kamprobe in both a virtual machine and on bare metal and find that for all practical pur- poses, Kamprobes are as fast and predictable on virtual machines as they are on bare metal.

Experimental setup To compare the cost of firing a Kamprobe on bare metal against the cost of firing a Kamprobe in a virtual machine I built a microbench- mark. This microbenchmark inserted a Kamprobe onto a call to an empty func- tion and registered an empty pre-handler and an empty return handler. It then executed this function call 2 000 000 times, measuring the timestamp counter before and after each read. I discarded the results from 200 000 trials to ensure that I did not measure the system with cold caches. I subtracted the number of cycles that it took to read the timestamp counters in each instance as recom- mended by Intel’s guidance on benchmarking using the cycle counter to remove the probe effect occurred by reading the timestamp counter [108].

Results Figure 3.9 shows a cumulative frequency distribution of the time taken to execute a Kamprobe on bare metal, rather than a virtual machine. Whilst Kamprobes is significantly (p < 0.01) slower on a virtual machine than on bare metal, the difference is unsubstantial. At the median point, firing a Kamprobe requires twelve additional cycles on a virtual machine, when compared with bare metal. Given that this cost is low—approximately the cost of a L2 cache hit—it is sufficiently close to the cost on bare metal so as to be insignificant in the overall cost of an experiment. Indeed, much of increase in cycles taken to fire a Kamprobe in a virtual machine may be measurement artefacts, caused by the paravirtualisation of the timestamp counter in the virtual machine case. The paravirtualised timestamp counter executes at 1 GHz so that the slowest host that a virtual machines can migrate to should be clocked at 1 GHz or greater. However, the paravirtualised timestamp counter has inaccuracies. For instance,

74 1.0

0.8

0.6

0.4

0.2 Bare metal

Cumulative frequency Virtual machine 0.0 0 20 40 60 80 100 120 140 160 180 Number of cycles

Figure 3.9: The number of cycles taken to execute a Kamprobe is very close to bare metal performance and is highly predictable. Similar performance and predictability cannot be achieved with the interrupt-based mechanisms that are commonplace in mainstream operating systems. measuring the paravirtualised timestamp counter takes 36 cycles, whereas mea- suring the physical timestamp counter takes 45 cycles. One would not expect that the virtual timestamp counter is faster than the physical timestamp counter. Should errors in the paravirtualised timestamp give incorrect values of the cost of reading the timestamp counter such that the virtualised counter has the same cost as the physical counter, it would explain nine of the twelve cycles. The unpredictability of the time it takes to fire a Kamprobe is no higher on virtual machines than on bare metal. By reducing the variability of Kamprobes on virtual machines, developers can measure the performance of their software when executing in a virtual machine with less experimental error than they could do with current techniques.

3.7 Evaluation summary

I have shown that current techniques, such as Kprobes, execute at least 2.28 times slower in virtual machines than on bare metal. In particular, they all use interrupts, which I have shown take 1 040 1 300 cycles to fire on a virtual ± machine without any processing of the interrupt, whereas a Kamprobe takes

75 just 69 16 cycles to fire. ±

3.8 Discussion

I now consider some of the drawbacks of using Kamprobes, when compared with existing techniques.

3.8.1 Backtraces

The aggressive optimisations used in Kamprobes cause the kernel to print out incorrect stack traces when a probe handler crashes. In the case of probing a CALLQ instruction, Kamprobes replaces the target of the CALLQ instruction with a Kamprobe wrapper, therefore the stack frame is associated with the Kamprobes wrapper, rather than the target function. This causes the printed backtrace to print the address of the Kamprobes wrapper, rather than the target function.

3.8.2 FTrace compatibility

Under the current implementation it is only possible to run Kamprobes and FTrace if FTrace is enabled before Kamprobes. Enabling FTrace after Kamprobes causes FTrace to replace the call into Kamprobes with its own call. This could be overcome by modifying FTrace to detect if there is a Kamprobe

3.8.3 Instruction limitations

At present, Kamprobes only applies to a subset of instructions. Yet one may desire the ability to insert probes at arbitrary addresses to be able to probe ar- bitrary kernel instructions. The techniques for using jump-based probing on any instruction in a variable-length instruction sets are well known [137]. Fu- ture work could undergo the engineering effort to extend Kamprobes to probe shorter-length instructions in the instruction set.

3.8.4 Applicability to other instruction sets and ABIs

Kamprobes currently only has an implementation on x86-64 and for operating systems that use the System V AMD64 ABI [95]. Porting the ideas behind Kam-

76 probes to other ABIs would be moderately straightforward, requiring changes to the set of registers that can be saved. Implementing Kamprobes for other instruction sets is also be possible, and de- pending on the instruction set may be more expressive than Kamprobes. Presently, Kamprobes can only probe instructions that are five-bytes in length. However, other instruction sets have fixed-width instructions so inserting a Kamprobe is substantially easier as any instruction can be replaced with a call into a Kam- probe wrapper.

3.9 Conclusion

Current mainstream operating systems use the same probing mechanism both when they execute as a virtual machine and when they execute on bare metal. However, some of the techniques that they use, such as interrupts, perform more slowly and less predictably when executing on a hypervisor. As such, they induce a larger probe effect when used in a virtual machine than they do on a physical machine. This makes it more difficult to accurately measure the performance of kernels executing on a hypervisor. I have argued that this is the wrong approach and instead we should forgo hypervisor fidelity by building a probing system that is designed to perform well in a virtual machine. As such, I presented Kamprobes, an implementation of a jump-based probing system for operating systems that use the System V AMD64 ABI [95]. Kamprobes reduces the number of cycles required to execute a probe from 10 000 cycles (in the best case) to 75 cycles. Moreover, there are only modest differences between executing on bare metal and in a virtual machine for both performance (twelve cycles) and variability (two cycles).

77 78 CHAPTER 4

SHADOWKERNELS:A GENERALMECHANISMFORKER- NELSPECIALISATIONINEXISTINGOPERATINGSYSTEMS

Existing operating systems share a common kernel text section amongst all pro- cesses. It is not possible to perform kernel specialisation such that different applications execute text optimised for their kernel use, despite the benefits of kernel specialisation for performance and performance measurement, such as probing, profile-guided optimisation, exokernels, kernel fast-paths, and cheaper hardware access. Current specialisation primitives involve system-wide changes to kernel text, which can have an adverse impact on other processes sharing the kernel due to the global side effects. In this chapter, I present Shadow Kernels1: A primitive that allows multiple kernel text sections to coexist in a mainstream operating system executing in a virtual machine by using the hypervisor’s page tables to change the mapping of virtual memory as processes execute. A process executing in a virtual machine can specialise a page of the kernel by issuing hypercalls to remap the kernel virtual memory to different machine-physical pages. Each time that process is context-switched in or out, Shadow Kernels issues the hypercalls necessary to obtain the correct mapping of pages for that process. As such, processes execute different kernel instruction streams that are specialised for their own execution. In using the Shadow Kernels API developers acknowledge that their applica- tion will execute on a hypervisor and forgo hypervisor fidelity to use the hyper- visor’s existing mechanisms for mapping pages into the guest’s memory. As such, Shadow Kernels forgoes hypervisor fidelity to improve specialisation, which can be used to improve measurement of virtual machine performance. In Chapter 3 I showed that Kamprobes are a highly-efficient probing mech- anism, in this chapter I show that despite these low overheads, if Kamprobes are placed on hot kernel functions they can cause a significant performance im-

1The name ‘Shadow Kernels’ was conceived by Dr Ripduman Sohan.

79 pact, even with empty probe handlers. However, developers often only want to probe their operating system for a subset of execution contexts. In particular, developers often desire probing the kernel for a particular system call, process or resource container. Despite this, current mechanisms insert probes that fire each time an instruction is hit, with developers then wrapping if statements around their probe handler, which often make the probe handler return without executing any further code. Shadow Kernels allows developers to set probe points that only fire in a sub- set of the kernel’s execution. For instance, a probe can be inserted that only executes when a particular process is scheduled in. Therefore, probes do not fire for other processes. By forgoing hypervisor fidelity, Shadow Kernels do not need to make invsasive changes to the operating system memory subsystem, have a small implementation (approximately 200 lines of code), and can be ported be- tween operating systems that can issue hypercalls. The contributions in this chapter won the best paper award at APSys 2015 [29] and appear in SIGOPS Operating Systems Review (OSR) January 2016 [30]. Throughout this chapter I acknowledge the contributions of my co-authors on this paper. Minor modifications to this work are the result of shepherding by Dr Gernot Heiser.

4.1 Introduction

Traditional monolithic operating design has a shared kernel that is mapped into the top of the address space of every process [123]. This design is used by all current major operating systems, including Linux, Windows, BSDs, and OS X, as well as a number of research operating systems. This design has numerous advantages: the loose coupling of applications and the kernel cause most ap- plications to execute the same kernel code without experiencing any noticeable performance, reliability, or usability issues. Also, shared-code has a low memory footprint, there is a higher cache-hit rate, system calls are fast as they do not re- quire a context switch, and shared state eases kernel design and implementation. At the same time, kernel specialisation has been shown to be beneficial [17]: Profile-guided optimisation of Linux can improve performance by up to 10% for some applications [151]; exokernels eliminate abstractions for applications

80 so that applications communicate more directly with hardware, thereby reducing kernel overheads [45]; and kernel instrumentation can be added that only fires when the kernel is executing on behalf of certain processes. Such kernel specialisation is often process-specific, in that the specialisations applied to one process may have an adverse effect on other processes. For in- stance, profile-guided optimisation of the kernel improves the performance of some applications and deminishes the performance of others. Similarly, removal of security checks may be desirable for trusted processes, but undesirable for non-trusted processes. Yet, current production operating systems do not provide a primitive for kernel specialisation on a per-process level. The shared kernel means that any changes to the kernel text or data have global effects; there is no way to iso- late kernel modifications to individual processes. As such, it is not currently possible to execute an individual process with different kernel optimisations or instrumentation to the rest of the processes executing on the system. To provide a high-performance, useful and effective application augmenta- tion primitive, it is important to have the ability to limit the scope of kernel specialisation and probing. A new low-level primitive is needed to support this: One that isolates kernel specialisation for a single process and allows for quick changes to its scope. To this end, I propose Shadow Kernels: Kernel variants with specialised text sections that are modified with the specialisation required, but share their data sections with the booted kernel. Non-specialised processes continue to run the original unmodified kernel instruction stream, whereas those that require specialisation are dynamically switched to execute the modified code of a shadow kernel. When a process requires kernel specialisation, it makes a call to the Shadow Kernels API, specifying the page numbers that it will specialise. The API issues hypercalls to the Xen hypervisor that cause Xen to remap the pages specified to newly-allocated memory that initially contains the same contents as the original kernel page. The process can then modify the contents of that page using existing techniques. When other processes execute, Shadow Kernels remaps the original kernel memory by issuing further hypercalls to Xen. Therefore, other processes do not execute the specialised instruction stream. Rather, they execute the origi- nal kernel. In the case of specialisation through setting probes, this allows other processes not to incur a performance overhead.

81 A further benefit of Shadow Kernels is that they only modify the text section of the kernel, so the advantages of having a shared-state kernel remain. Also, since it is not necessary to modify the virtual address of function entry points there is no effect on function pointer semantics. Shadow Kernels directly interfaces with the hypervisor to rewrite the memory mappings from virtual to machine-physical frames. Removing the dependencies on the hypervisor would be challenging. As such, by using Shadow Kernels, the software stack from the kernel to the application that spawns the shadow kernels are all rearchitected forgo hypervisor fidelity. An implication of this is that the hypervisor no longer has fidelity; the same software cannot execute on physical machines, let alone execute identically. However, given the performance benefits that I show for virtual machines that use Shadow Kernels, this lack of fidelity is clearly worthwhile. In this chapter, I evaluate Shadow Kernels and explore their use for restrict- ing the visibility of probes such that they only fire for specific processes. I show that switching shadow pages has a cost that scales linearly with the number of specialised pages. As such, specialising the entire kernel is unpractical: Doing so causes a 78% increase in latency of serving a web load. However, for a small number of pages there are minimal overheads—8 000 cycles for eight pages—a cost that is offset by the performance speedup of the specialisation. For instance, Shadow Kernels removes the probe effect of inserting Kamprobes into hot code- paths of the kernel.

4.2 Motivation

I now discuss use cases of per-process kernel specialisation with Shadow Kernels.

4.2.1 Shadow Kernels for probing

Whilst there are numerous benefits of Shadow Kernels, in this chapter I predom- inantly focus on the application of Shadow Kernels for kernel instrumenation. A key advantage of modern probing systems, which has been crucial in their recent widespread adoption by major operating systems, is having zero probe ef- fect whilst disabled [20]. That is, they dynamically modify the instruction stream to insert probes, rather than compile them in. If the probing system is disabled

82 then the instruction stream is not modified and the application executes as nor- mal, without a significant performance degradation. This approach of rewriting the instruction stream is crucial since even if probe points are wrapped in an if statement then there is a—albeit-small—performance overhead. However, the current state-of-the-art offers no-such protection from the probe effect when en- abled. Firing a probe—even with an efficient probing mechanism and without executing code in the probe handler—consumes CPU cycles. In particular, when code is shared between multiple applications—as is the case with the operating system kernel, and shared libraries—the probes are fired for every application using the shared code. This impacts the performance of every application that is using the shared code, even when if guards are used to protect the probe point.

The crux of the issue is that kernel probing primitives rely on modifying the shared instruction stream, for instance by inserting a JMPQ or an int3. Every time the kernel executes a probed instruction, a probe is fired, even if instrumentation is only desired for a specific system call or process.

The Achilles’ heel of this approach is that any process that executes the in- strumented address or function calls into the instrumentation system regardless of whether it is required. It is currently impossible for users to restrict the scope of the instrumentation to a particular process. The unavoidable penalty of hit- ting the probe is incurred by every process each time it is executed, regardless of whether it is applicable to the executing process. This overhead is significant even if no action is taken once the probes fire. This is particularly problematic if the application being investigated consumes a minority of the system’s CPU cycles, since if hot functions are probed—those that are an obvious cause of poor performance—every system call made by the rest of the applications on the system could become substantially slower.

With Shadow Kernels, probes can be set so they only fire for an individual application, thereby leaving the performance of the well-behaving programs un- touched. The overall probe effect of the added instrumentation is also reduced: Setting kernel probes on hot functions such as kmalloc or tcp sendmsg on a busy server no longer degrades overall system performance.

83 4.2.2 Per-process kernel profile-guided optimisation

Recent work considers applying profile-guided optimisation to operating system kernels to improve performance [151]. Most of the gains of profle-guided op- timisation come from improving code layout based on results from Ball–Larus path profiling [10], optimising branches, and speculatively estimating the values of expressions based on common runtime values. An unsolved issue with profile-guided optimisation is that the optimisation must be based on a representative workload. In particular, if the kernel is opti- mised based on one application then other applications executing on the same system often see a slowdown in performance. Yuan et al. show that profile- guided optimisation of the Linux kernel can improve performance of some appli- cations by 10%, and reduce performance of others [151], even when executing a single application on a machine. As profile-guided optimisation often only affects the instruction stream, Shadow Kernels allow applications executing on the same machine to each execute with their own kernel that is optimised with profile-guided optimisation specific to that program.2 This therefore allows a training-phase per-process that generates a shadow kernel per-process. So-long as the profile-guided optimisations do not modify the data sections—which can be ensured through compiler flags—each process can have its own shadow kernel. Each time that the scheduler schedules- in a process, it remaps the kernel to the appropriate shadow kernel. This can be further extended to allow multiple shadow kernels per-process by creating shadow kernels per-process.

4.2.3 Kernel optimisation and fast-paths

Kernel configurations options are often a tradeoff between performance and util- ity of features. For example, kernel options that provide debug modes for locks, schedulers and memory allocators add additional code to the instruction stream that is executed and causes a performance degradation. With Shadow Kernels, configuration options that only affect the text section can be applied to individ- ual processes, without system-wide effects. Two such fast-paths that can be applied to individual processes are: (i) Re-

2Use case of profile-guided optimisation suggested by Nikilesh Balakrishnan.

84 moving security checks for trusted processes, and (ii) Removing some concur- rency operations when executing a process on a single core. (i) A key roleˆ of the operating system kernel is to perform security checks. However applying these checks can consume a substantial amount of computational resources [111]. Often, some processes—such as system processes—are trusted whereas others ought to be subject to the usual kernel security checks. Moreover, applications such as debuggers often need so subvert the usual security checks to introspect the memory of another process. However, with current kernel models, the same checks are applied to all processes. With Shadow Kernels, privileged processes can be mapped onto shadow kernels that contain exactly the security checks relevant to each application. (ii) Virtual machines operating in the cloud can be subject to vCPU hotplugging by the cloud provider. Should a virtual ma- chine change between having multiple vCPUs to one vCPU there is scope for optimisation by eliminating concurrency primitives from the kernel instruction stream. In particular, a key benefit of Shadow Kernels built using a hypervisor is that, with a few implementation changes, a privileged domain—controlled by the cloud provider—could trigger switches of shadow kernel as part of the hot- plugging routine by rewriting the kernel text section. As such, when the virtual machine operates with multiple vCPUs, it could have an SMP instruction stream and when it executes with a single vCPU it could execute with fewer concurrency mechanisms.

4.2.4 Kernel updates

Linux offers live kernel patching to apply a patch without rebooting [7, 93]. However, kernel functions that are currently on the stack frame of any thread cannot be patched. This is problematic when threads are in loops that prevent them from exiting functions that require patching. Presently, the entire system is not updated until the code is not being executed. With Shadow Kernels, those threads that do not have the function that is being updated on their stack can be updated, by executing in a shadow kernel. As other threads pop stack frames, if they are no longer executing the function that is being replaced, those threads can also be switched to a shadow kernel.

85 1 #include 2 shdw_descriptor sd; 3 int err{shadow_create(&sd)}; 4 5 if(err<0){ 6 return err; 7 } 8 9 if(!shdw_switch(sd)){ 10 shdw_add_pages(sd,...); 11 kamprobes_register(...); 12 13 // Usea shadow kernel for*all* processes on the system. 14 ... 15 // Switch back to the original kernel. 16 shdw_reset(); 17 } 18 ... 19 // Now just use the shadow kernel for*this* process. 20 shdw_switch_pid(sd);

Figure 4.1: The Shadow Kernels API allows privileged applications to spawn shadow kernels and add pages to their shadow kernels.

4.3 Design and implementation

The key idea behind the implementation of Shadow Kernels is that a virtual ma- chine can have multiple instruction streams available to it and switch between them by issuing hypercalls that cause the hypervisor to change which machine- physical frame number maps to each virtual frame number. My current imple- mentation is built for Linux executing on Xen and has two components: A user space API and a kernel module.

4.3.1 User space API

A virtual machine operating system with Shadow Kernels support boots in the traditional manner, using an unmodified kernel and an unmodified hypervisor. An application that uses Shadow Kernels must execute with root privileges, by

86 design as Shadow Kernels interacts with the hypervisor and modifies the kernel; it is not an operation that should be performed by unprivileged guests. Applications use the Shadow Kernels API to explicitly spawn and switch shadow kernels. This API, outlined in Figure 4.1, is a user space C library, against which any user space program can link. Applications must #include the Shadow Kernels header file (1). This allows the application to create a shadow kernel (3). Shadow Kernels are not guaranteed to succeed in creation, and error codes are returned if there is insufficient memory (-ENOMEM) or the program is not executing as root (-EACCES). As such, developers should check the return code (5). Each shadow kernel has a shadow kernel handle, which can also be shared amongst processes, applied to all processes in Linux container, or used within a resource container [11]. For instance, a system may have a shadow kernel with complete instrumentation that any process can use to get a function call graph. If the shadow kernel creation succeeds, the application can switch the entire operating system to using that shadow kernel (9). This can be used to temporarily change the entire operating system kernel, and then returning to the original kernel (15). When executing in a shadow kernel, an application can specify which pages should be part of that shadow kernel (10) and use existing methods to perform the kernel specialisation (11). If the specialisation modifies a page that has not been added to the shadow kernel then the modification affects all processes un- less they have their own shadow kernel that includes that page. Whilst using shdw switch and shdw reset allows developers to interact with Shadow Kernels at a low level, and pass the shadow kernels handle (sd) between different processes, threads, or resource containers, it is also difficult to write software that consistently executes in the correct shadow kernel. I therefore provide an alternative API (20) that swaps shadow kernels whenever a process is scheduled in-or-out. This allows individual processes to specialise the kernel such that the specialisation is only active when the kernel executes behalf of that particular process.

4.3.2 Linux kernel module

The core of Shadow Kernels is a Linux kernel module that can execute on an un- modified Linux kernel (tested on v3.19). A kernel module is needed to interpose

87 Figure 4.2: An unsorted array tracks the pages in the shadow kernel that are modified. the scheduler so that shadow kernel switches can occur on each context switch. If this feature were not required, Shadow Kernels could be implemented in user space by using privcmd to issue hypercalls from user space.

4.3.2.1 Module insertion

On initialisation, the kernel module registers a Linux char driver, through which the user space API communicates with the kernel module. This communication is performed using the Linux netlink mechanism, which makes all operations synchronous. There is also a asynchronous version of Shadow Kernels whereby switching shadow kernel is performed when the next system call is made to prevent the overheads of an additional system call.

4.3.2.2 Initialisation of a shadow kernel

To initialise a shadow kernel, Shadow Kernels initially allocates an unsorted array that stores the frame numbers of frames that have been modified in the

88 shadow kernel. Figure 4.2 shows the unsorted array that keeps track of the pages that have been modified in the shadow kernel. An unsorted array has desirable costs for the operations required for a shadow kernel: Constant time insert and next operations.

4.3.2.3 Adding pages to the shadow kernel

A call to shdw add pages allocates a physical page for each page that needs shad- owing and memcpys the contents of the respective page from the booted kernel to the newly allocated page. Shadow Kernels then adds an entry into the shadow kernel handle that maps the virtual address of the page that is to be shadowed to the machine-physical frame number of the freshly-allocated page. If the shadow kernel handle corresponds to the currently active shadow kernel then Shadow Kernels updates the virtual-to-machine-physical mappings, as later described in Section 4.3.2.4 It is not currently possible to “unshadow” pages, principally due to a lack of use-case, so fragmentation of the bitmap is not currently an issue. However, this is a limitation of the current implementation, rather than a conceptual limitation.

4.3.2.4 Switching shadow kernel

Shadow kernels can be switched in one of two ways: Either manually by using a shdw descriptor and calling shdw switch; or using shdw switch pid, which switches shadow kernel whenever a process is scheduled in or out. The two methods are similar. However, shdw switch pid interposes the Linux scheduler3 such that it calls shdw switch whenever the scheduler sched- ules in or schedules out a process. Switching shadow kernels makes modifications to the Xen paravirtualised page tables. Under Xen, paravirtualised guests are aware that they are operating in a virtualised environment, and, as such, maintain page tables that map virtual addresses to machine addresses, as opposed physical addresses. This prevents the need for an additional layer of page tables that map physical addresses to machine addresses. However, to prevent a guest from mapping machine mem- ory into its page tables that belongs to another domain—thereby circumventing

3I reuse Linux scheduler interposition code written by Lucian Carata and modified by James Snee.

89 Figure 4.3: A non-privileged domain (domU) allocates memory for each shadow kernel, copies the booted kernel into that memory and then unmaps the memory. To switch to a shadow kernel the domU updates its page tables by making two hypercalls to Xen. memory isolation of guests—page tables are mapped into the guest address space with read-only permissions. All page table updates need to be made through hy- percalls with Xen ensuring that the machine frame number has sufficient permis- sions to access the page. This requires Shadow Kernels to perform two memory update hypercalls, bundled into a multicall to reduce the overheads of multiple vm-exits and vm-entries:

1. MMU NORMAL PT UPDATE updates the virtual-to-machine page table such that the newly allocated-and-specialised page is mapped into the same virtual address as the corresponding page from the current kernel.

2. MMU MACHPHYS UPDATE updates the machine-to-physical page table to cor- rect the reverse mappings, as the physical address of a shadow page does not change as the hypervisor remaps the virtual to machine mappings.

4.3.2.5 Interaction with other kernel modules

Shadow Kernels are also compatible with kernel modules: Each time a module is inserted Shadow Kernels iterates over the page tables, and perform a mapping, or unmapping for each shadow kernel. My current implementation prevents

90 modules from being removed whenever any page of the module is added to a shadow kernel, by incrementing the count field on the module such that it won’t reach 0. This prevents a module being removed and Shadow Kernels later mapping a page back in from that module during a shadow kernel switch, since this if a new module is inserted at the same virtual address range as the old module, Shadow Kernels would corrupt the instruction stream. This limitation can be overcome by hooking into the module subsection of the Linux kernel such that whenever a module is removed, it is also removed from all shadow kernels.

4.4 Evaluation

I now evaluate Shadow Kernels, showing that for a small number of pages that need specialising, the cost of the specialisation is low (835 354 cycles to switch ± a shadow page). However, the cost scales linearly with the number of pages specialised. As such, it can cost up to 1.5 106 cycles to specialise every page × in the kernel, which I show has a significant performance effect on the through- put of a key-value store (memcached). I execute all experiments on an Intel Xeon E3-1230 V2 @ 3.3 GHz, running Ubuntu 15.04, with a Linux v3.19 ker- nel compiled from the Linus branch, and Xen v4.5. This has 32 KB L1 caches, 256 KB L2 caches, and 8 MB L3 cache. I backport Xen vPMU patches to Linux (from a subsystem maintainer’s Linux v4.3-next branch) and Xen (from v4.6- testing) to measure domain performance. Each experiment is performed in an unprivileged domain (domU), with one vCPU (pinned to a physical CPU), 8 GB of memory and with no other concurrent virtual machines (except domain zero). The domain is configured with a paravirtualised timestamp counter, which is rec- ommended for ‘both correctness and high performance’ [92]. All experiments follow Intel’s guidance on benchmarking using the cycle counter [108].

4.4.1 Creating a shadow kernel

I start by considering the cost of creating a shadow kernel and adding all pages in the original kernel to the shadow kernel. Creating a shadow kernel is an action that I expect to be infrequent as it only occurs when an application requires new specialisation. By adding all pages, I bound the upper case of the time that it takes to create a shadow kernel. However, I later argue that in most

91 6 2 10− 10− 3.5 × 1.2 × 3.0 1.0 2.5 0.8 2.0 0.6 1.5

Frequency Frequency 0.4 1.0 0.5 0.2 0.0 0.0 0 1 2 3 4 5 6 7 0 400 800 1200 1600 2000 106 µ Cycles × Time ( s)

Figure 4.4: Distribution of the wall-clock time and cycles taken to create a shadow kernel. The difference in distributions represents the cycles spent by the hypervisor in creating a shadow kernel. circumstances, users will add fewer pages, thereby further reducing the (already- negligible) cost. The results from this experiment show that the costs of creating a shadow kernel are sufficiently small that adding them to program startup time adds a non-problematic overhead.

Experimental setup To measure the cost of creating a shadow kernel I used a microbenchmark that repeatedly created a shadow kernel, and added every page in the kernel to the shadow, but did not switch to it. I measured either the wall- clock time or the timestamp counter before and after creating a shadow kernel and took the delta between the two measurements. Both were measured because the timestamp counter reports the number of cycles spent in the virtual machine, whereas the wall-clock time also includes time spent with the hypervisor execut- ing. In all experiments I mitigated the probe effect by subtracting the mean base time that it takes to read the timestamp counter or wall-clock time.

Results Figure 4.4 shows the distributions of cycles and wall-clock time taken to create a shadow kernel. Creating a shadow kernel is expected to be a opera- tion performed rarely, for instance, on program startup or when a user attaches

92 a debugger to a program. Therefore, it is not necessary for the implementation to be optimal. Nevertheless, in the worst case (adding every page in the kernel to the shadow kernel) it takes 820 ms 261 ms to create a shadow kernel. Given ± the rarity of creating a shadow kernel, this cost is not prohibitive. However, should this cost become limiting, an optimisation would be to re- move the use of the Linux Contiguous Memory Allocator (CMA), which allo- cates the memory for the shadow kernel, rather than using kmalloc. Linux CMA is slower than kmalloc and is the cause of the slow performance. kmalloc on Linux has an upper limit of 4 MB, whereas the kernel text section measures approximately 8 MB. However, the virtual addresses are unmapped shortly after creation, so two 4 MB regions could be allocated and unmapped.

4.4.2 Switching shadow kernel

I now consider the costs of switching to an existing shadow kernel, an action typically performed on each context switch. This is the cost incurred whenever there is a switch of shadow kernel and so should be lower than the savings made by the specialisation. I specifically consider two measurements of the cost of switching shadow kernel: The direct cost of the cycles and wall-clock time that it takes to perform the switch to a shadow kernel, and the indirect cost of switching to a shadow kernel and causing cache evictions.

4.4.2.1 Switching time

Experimental setup To measure the cost of switching shadow kernel I used a microbenchmark that repeatedly switched to a shadow kernel with a varying number of pages. This is a user space program that used the Shadow Kernels API to communicate using netlink with the Shadow Kernels kernel module, causing the domain to issue hypercalls that triggered a switch of shadow kernel. I mea- sured the combined cycles spent in the kernel and the hypervisor in switching shadow kernel.

Results Figure 4.5 shows the wall-clock time and cycles spent executing the microbenchmark, varying the number of specialised pages between 1 and 1927, the number of pages in the text section of the kernel. These are both linearly

93 106 10000 ×

8000 1.5

6000 1.0

Cycles 4000 Cycles 0.5 2000

0 0.0 0 2 4 6 8 0 500 1000 1500 Shadow pages Shadow pages 800 14 700 12 600 10 s) s) 500 µ µ 8 400 6 300 Time ( Time ( 4 200 2 100 0 0 0 2 4 6 8 0 500 1000 1500 Shadow pages Shadow pages

Figure 4.5: The time to switch to a shadow kernel is linearly related to the number of pages that are modified in the shadow kernel. Therefore, shadowing every page in the kernel is slow. I therefore only envisage shadow pages that contain a few pages, which will have low overheads as shown by the left-hand plots.

94 (O(n)) related to the number of pages modified, since for each page Xen per- forms an unmap and remap operation. However, such a O(n) relationship gives an overly-expensive cost ( 1.5 106 cycles) for shadowing every page in the ∼ × kernel. Unless the specialisation has both a substantial performance benefit and is called many times per scheduling quanta, it is unlikely that Shadow Kernels will improve performance if applied to every page in the kernel. I only envisage a small number of pages in a shadow kernel (which has a low performance over- head) as shown by the zoomed-in plots on the left-hand side of Figure 4.5. For a small number of pages the cost of creating a shadow kernel is low. For instance, five pages can be shadowed in 4 100 cycles, which is lower than the overhead of needlessly firing 61 Kamprobes. It is reasonable to expect a Kamprobe to fire more than 61 times per scheduling quanta.

4.4.2.2 Effects on caching

Caching is vital in ensuring that software executes in a perfomant manner. How- ever, by modifying page table entries of virtual machines, Shadow Kernels in- validates cache entries, since after a switch of shadow kernel some virtual to machine physical mappings are no longer valid. Often, invalidating cache lines can cause poor performance. However, my experiments show that the effect of cache evictions is negligible.

Experimental setup To measure the effects of Shadow Kernels on caching, I created a microbenchmark that switches to a shadow kernel for n of the most- commonly-used pages containing the kernel and then sleeps for one second. There was also a background load of an SSH connection and other stock-Ubuntu programs executing on the same server. I used the Xen vPMU to measure the performance of the caches whilst only processing the domain executing with Shadow Kernels. Only measuring the current domain ensures that I only re- port the additional performance overhead experienced by the domain and not the other virtual machines executing on the same server, such as domain zero. By waiting for one second I reveal the latent performance effect of any cache evictions that take place due to Shadow Kernels.

95 107 107 × × 1.2 1.2 0.6 0.6 Cycles 0.0 0.0 0 9 50 1900 105 105 × × 1.0 1.0 0.5 0.5 misses

L1 cache 0.0 0.0 0 9 50 1900 104 104 0.8 × 0.8 × 0.4 0.4 iTLB 0.0 0.0 references 0 9 50 1900 104 104 × × 2 2 1 1 0 0

iTLB misses 0 9 50 1900 105 105 × × 1.2 1.2 0.6 0.6

L3 cache 0.0 0.0 references 0 9 50 1900 104 104 × × 2 2 1 1 misses

L3 cache 0 0 0 9 50 1900 Pages Pages

Figure 4.6: As the number of pages in the shadow kernel grows there is a modest increase in iTLB load-misses and L3 cache misses.

96 Results Figure 4.6 shows the effect of Shadow Kernels on caches and the TLB. I show the results of using a shadow kernel containing between zero4 and nine pages—the range that I expect Shadow Kernels to be used for—and the rest of the kernel to show the scalability of Shadow Kernels. As the number of pagODes in the shadow kernel increases, there is a linear rise in the cycles spent performing the experiment, which we expect given that I have already shown that the cycles taken to create a shadow kernel grows linearly with the number of pages in the shadow kernel. As the number of pages increases there is a statistically-significant but modest increase in the number of iTLB load-misses, and L3 cache misses. This is to be expected as the shadow kernel switch invalidates entries in both the TLB and the cache. There is no significant increase in the number of L1 cache misses, as the size of the L1 cache is sufficiently small that invalidating all of its entries causes sufficiently few cache misses to be observable.

4.4.3 Kamprobes and Shadow Kernels

I now show that by combining the high-performance techniques of Kamprobes with per-process kernel specialisation with Shadow Kernels, we can restrict the scope of a firing probe to an individual process. As such, we reduce the over- heads of probes firing on behalf of processes that are not the target of the instru- mentation to having insignificant or negligible overheads. The overheads of switching shadow kernels grows linearly with the num- ber of pages in the shadow kernel, whereas the overheads of firing probes that should not fire are proportional to the number of probes fired. How often a process switches shadow kernel is in-turn correlated with how often the process is scheduled, since scheduler interposition triggers a switch of shadow kernel. As such, there is a tradeoff between these metrics: Unwanted probe fires, the number of shadowed pages, and the time the process spends scheduled in.

Experimental setup I execute the operating systems benchmarks from lmbench [97], a benchmarking tools for POSIX operating systems under three conditions:

1. No probes and no Shadow Kernels. This is a baseline for performance.

4Zero pages represents Shadow Kernels disabled.

97 2. Kamprobes with empty handlers on the five most-hot kernel functions (as measured with FTrace) when executing lmbench, but no Shadow Kernels. This represents the effect of repeatedly firing Kamprobes for a process that should not be specialised. Such a setup is a best-case scenario whereby a developer wishes to probe the kernel’s interaction with a specific process on the machine but not the other processes. This would typically be done using an if statement based on the process PID. However, I omit that check so-as to show the best-case performance of inserting a probe.

3. Kamprobes with empty handlers on hot kernel functions, but Shadow Ker- nels restricting them to a single process (which excecutes approximately 0.4% of CPU load). This shows the effect of using Shadow Kernels to prevent a process from being specialisied.

I execute each lmbench experiment twenty times.5

Results Figure 4.7 shows the results of executing lmbench’s process bench- marks. If Kamprobes—despite their minimal overheads—are inserted onto the most hot kernel functions there is a significant and sometimes substantial per- formance degredation on the processing of I/O operations (null I/O, stat, open close, and select), signal handling, and process forking. This is due to the Kamprobe probe handler repeatedly firing. However, when I apply Shadow Ker- nels to restrict the scope of the specialisation such that it only applies to one process, this overhead becomes insignifcant in every case. The only experiment with a significant difference between executing with and without Shadow Ker- nels is the case of sh proc,6 which I believe to be due to these operations making heavy use of the hypervisor, as Xen has to mediate all the page table updates in process creation. By using Shadow Kernels, the hypervisor spends more time executing, which prevents the hypervisor’s cache lines from being evicted, thus causing a speedup for the benchmark. Figure 4.8 shows the results of executing lmbench’s context switching bench- marks. Inserting Kamprobes on hot kernel functions does not have a substantial— or sometimes significant—increase in the cost of performing a context switch.

5lmbench internally performs eleven repeats. 6sh proc executes fork followed by /bin/sh -c so that system shell find an executable from $PATH.

98 6 No probes 2000 Kamprobes 5 Kamprobes + shadow kernels 1500 4 s) s) µ µ 3 1000 Time ( Time ( 2 500 1

0 0

stat select null I/O sig inst sh proc null call sig hndl fork proc open close exec proc

Figure 4.7: lmbench: Processes. Smaller is better. Inserting Kamprobes onto hot kernel functions increases the latency of system operations. However, by using Shadow Kernels the Kamprobes are not in the instruction stream of most kernel executions so have negligible performance impact.

99 7 No probes Kamprobes 6 Kamprobes + shadow kernels

5 s) µ 4

Time ( 3

2

1

0

2p 0K 8p 0K 2p 16K 2p 64K 8p 16K 8p 64K 16p 0K 16p 16K 16p 64K

Figure 4.8: lmbench: Context switching times (smaller is better). File and VM system latencies (smaller is better). Context switches are comparatively rare, so the inclusion of Kamprobes does not have a large effect on context switch time. However, Shadow Kernels does not have a negative performance impact.

100 No probes 2000 20 Kamprobes Kamprobes + shadow kernels 1500 15 s) s) µ µ 1000 10 Time ( Time (

5 500

0 0

page fault

0 KB file create0 KB file delete mmap latency protection fault 10 KB file10 create KB file delete

Figure 4.9: lmbench: File and virtual memory system latencies (smaller is bet- ter). Where Kamprobes have a significant impact on performance this can be mitigated by using Shadow Kernels.

This is because context switches consume a minority of CPU cycles so there codepaths are not hot. However, Shadow Kernels can create a slight decrease in the time taken to perform a context switch. Figure 4.9 shows the results from lmbench’s file and virtual memory bench- marks. There is little additional cost of firing Kamprobes for the metrics of protection fault, page fault, and file deletion, as they do not exercise the ker- nel’s most-hot functions. However, there is no significant cost of having used Shadow Kernels in this case. File creation and mmap show a substantial increase in their execution time when Kamprobes are inserted into the kernel. However, by using Shadow Kernels this overhead is eliminated. We can notice a decrease in execution time taken for 0 KB file create and the mmap latency, which is also a page-table heavy operation and, therefore, Shadow Kernels helps keep the hy- pervisor’s cache lines hot.

101 7 64 6 56 5 48 4 40 3 32 24 2 16

Response time (ms) 1 8 0 0 0 500 1000 1500 Shadow pages

Figure 4.10: As the number of pages in the shadow kernel increases the server- side latency of lighttpd increases.

4.4.4 Application to web workload

Having shown a microbenchmark cost of switching to a shadow kernel, I now show the overheads when applied to a realistic workload.

Experimental setup I modified lighttpd to switch to a shadow kernel, with a varying number of pages in the shadow kernel, whenever it is scheduled in by the operating system scheduler. This setup represents the case of adding probes to the kernel that only fire when one process executes. Lighttpd served a static 217 KB file7, chosen to be representative of a real-world workload. I increased the number of pages in the shadow from 0 to 1920 in increments of ten, mea- suring the response time for 100 requests at each level, with a client concurrency level of one.

Results Figure 4.10 shows that as the number of pages in the shadow kernel increases, the server response time increases as well. The server-side median response time without using Shadow Kernels is 3.54 ms 0.55 ms, as can be ob- ± served in Figure 4.10). With a shadow kernel that remaps every page the server response time is 4.34 ms 0.53 ms. Whilst in low-latency setups an additional ± 7The homepage of https://edition.cnn.com/.

102 overhead of 0.2 ms may be undesirable for kernel specialisations that do not specialise all of the kernel text—the majority of which is for unused drivers— the overheads are lower. For instance, there is no significant increase in latency when fewer than 300 pages are specialised.

4.4.5 Evaluation summary

I have shown that Shadow Kernels are a perfomant technique for performing per-process specialisation of kernel code in a mainstream operating system. Due to the O(n) complexity of updating the MMU, it is currently only a feasible technique for a low number of pages. However, Shadow Kernels is the first work to use the Xen MMU in this way, so future work may improve this cost. As well as the cost of updating the Xen MMU, I have shown that there is an effect on cache validity of the operating system due to Shadow Kernels, which can cause some unexpected performance [42].

4.5 Alternative approaches

There are alternative methods for reducing the overheads of kernel probing mechanisms: Additional flow control. Whilst Shadow Kernels uses page table switching to remove specialisation from the kernel instruction stream, it is also possible to add additional if statements to the instruction stream, each of which check if a flag is set that indicates if the specialisation should be used. There are several downsides to this approach. Firstly, there is an increase in the number of instructions that need to be executed, since the initial JMPQ instruction must be taken, then registers must be saved, the flag checked and then registers restored. Secondly, adding such checks to the probing API is no different to adding a check to the start of the probe handler, which I have shown to have a significant overhead. Thirdly, the extra conditional branches may decrease the branch prediction rate. However, if the flag is set on a per-pid process then one would expect that a branch predictor might incorrectly predict the first branch after a context switch, but it would correctly predict whether the branch is taken from the second branch onwards, thereby eliminating this concern.

103 Repeated rewriting of the kernel binary. Shadow Kernels reduces the overhead of firing probes by updating the Xen page tables to allow processes to map different kernel binary instruction streams. A similar approach to updat- ing the kernel binary could be performed by instrumenting the scheduler such that whenever a process is scheduled in a tracepoint it executes that memcpys in a new binary into the same virtual address space as the current binary. The disadvantages of this are that to ensure correctness of concur- rency, the tracepoint must first obtain a big lock—the stop machine com- mand in the Linux kernel—which ensures that there are no kernel threads executing the instructions that are being poked. Obtaining such locks is ex- pensive, especially as the number of processors increases since they all need to rendezvous. Furthermore, unlike with Shadow Kernels, applications ex- ecuting on a multicore machine cannot each map their own shadow kernel concurrently. Therefore, a second mechanism need be employed if two processes are executing concurrently, on different cores, each of which has different specialisation.

Hypervisor-free implementation. Shadow Kernels is designed for paravirtualised virtual machines as the Shadow Kernels modules issues hypercalls to the hypervisor, causing it to rewrite the virtual machine’s page tables. Shadow Kernels therefore cannot be applied to a bare-metal operating system. Whilst the central ideas of Shadow Kernels—to remap the kernel instruction stream dynamically—can be applied to bare-metal machines they would lose some key benefits. In particular, Xen-based Shadow Kernels are unobtrusive can be applied to any partavirtualised operating system that can allocate physically-contiguous memory by writing a small kernel module. This is because Xen manages the paravirtualised page tables and has a common interface through hypercalls to every operating system that runs on it. The implementation effort to write a bare-metal implementation of Shadow Kernels is much higher: As well as having to modify the core of the operat- ing system (rather than writing a kernel module), developers need to make extensive and intrusive changes to the memory management subsystem. For instance, Linux assumes that its kernel’s virtual memory is mapped at a fixed offset into physical memory. Previous efforts to make minor changes to this assumption—such as grub adding support for listing bad memory

104 addresses to which the kernel should not be mapped,8 but still enforcing a boot-time relationship between physical and virtual addresses—have been invasive, as operating systems typically make numerous assumptions about this mapping. For instance, the Linux pa() macro, which maps kernel virtual addresses to phyiscal addresses, uses bitwise operators to map ad- dresses.

4.6 Discussion

4.6.1 Modifications required to kernel debuggers

Currently Shadow Kernels do not work with kernel debuggers. Existing ker- nel debuggers, such as KDB, need to become aware of the different instruction streams that execute when other processes are scheduled in. Presently, apply- ing a kernel debugger to an operating system executing with Shadow Kernels results in erroneous results and crashes the debugger, as the debugger assumes that the instruction stream that it reads (through /proc/kcore on Linux) is only modified from within the operating system, for instance by the insertion and re- moval of kernel modules. With Shadow Kernels, this assumption is wrong as the hypervisor remaps the instruction stream based on a scheduler interposition layer. Adding support for Shadow Kernels to a kernel debugger does break the portability of Shadow Kernels in that—unlike the rest of my implementation—it requires substantial engineering effort that cannot be ported between operating systems that operate in a paravirtualised environment. Furthermore, this would require changes to the core of the operating system kernel, rather than inserting a new kernel module. That said, the use of debuggers currently influences op- erating system design, since a debugger bypasses many of the security features including process isolation that are applied to typical programs.

4.6.2 Software guard extensions

Traditionally, virtual machines must trust the hypervisor on which they execute. A malicious or incompetent administrator can modify the memory of a virtual

8https://lwn.net/Articles/440319/

105 machine and corrupt its state or read a virtual machine’s memory without con- sent from the virtual machine. To prevent this, extensions have been proposed to x86-64 that protect the virtual machine from the hypervisor such that the hypervisor is unable to snoop on or modify the virtual machine. Haven is one operating system that executes on Intel Software Guard Extension (SGX) instruc- tions [13]. SGX enclaves break Shadow Kernels as they prevent the hypervisor from per- forming remappings of the page tables ‘behind the back’ of the virtual machine. It is unclear whether SGX will become commonplace in cloud environments.

4.7 Conclusion

Current operating systems map a common kernel instruction stream into the address space of all processes executing on the system. Whilst this approach typically has many benefits, including low memory requirements, low complex- ity, and loose-coupling between applications and the kernel, there are circum- stances whereby applications can gain from kernel specialisation. Primarily, I have considered probing as a use case of specialisation. However, I have also explored proposed benefits of per-process kernel specialisation including per- process profile-guided optimisation of the Linux kernel. Previous work appreciates the benefits of kernel specialisation, it typically proposes radical operating system redesigns. I have proposed and shown an im- plementation of Shadow Kernels, an approach that allows multiple kernel text sections to coexist with a kernel module. With Shadow Kernels a mainstream operating system can implement a minimal kernel module to use the Xen par- avirtual hypercall interface to remap the virtual to machine-physical mappings. Shadow Kernels is another example of the benefits that can be made by for- going hypervisor fidelity to improve performance measurement of a virtual ma- chine. Without forgoing hypervisor fidelity Shadow Kernels requires a substan- tial redesign of the operating system to ensure safe operation of the Shadow Kernel. However, by acknowledging the presence of the hypervisor and using its separation of machine-physical and guest-physical addresses, Shadow Ker- nels allows mainstream operating systems with paravirtual bindings to specialise applications on a per-process basis.

106 CHAPTER 5

SOROBAN:ATTRIBUTINGLATENCYINVIRTUALISED ENVIRONMENTS

In the previous chapters, I showed that by acknowledging the existence of the hy- pervisor, we can build more-high-performance probing with a restricted probe effect. In this chapter, I show the benefits of building user-space applications that forgo hypervisor fidelity to report the performance impact of executing as a virtual machine. Soroban1 is a technique for measuring the virtualisation over- heads experienced in a request-response system, such that for any individual request one can report how much of the latency of that request is attributable to the cloud provider and how much to their customer. Soroban supplies an API with which developers describe the semantics of starting and stopping the processing of a request. Soroban then uses a modi- fied version of Xen that reports scheduling data to domains and uses these data as inputs to a Gaussian process that maps scheduling data to a virtualisation overhead. I demonstrate Soroban with lighttpd and show that Soroban can correctly differentiate between high latencies caused by the virtual machine exe- cuting on an under-provisioned host and high latencies caused by a high load on the lighttpd server. Moreover, Soroban is able to report when requests are ser- viced slowly due to a cloud-provider batch processing task, such as performing an antivirus scan. The contributions in this chapter are published in the Proceedings of Hot- Cloud ’15 [132]. The principal contribution of this paper is mine, but I acknowl- edge the contributions of my colleagues throughout this chapter. In the paper, James Snee ports the ideas that I present in this chapter to explain performance variability of applications executing with containers sharing a kernel. Further details regarding the software architecture are available as a tech report [21].

1The name ‘Soroban’ was suggested by Tanika Mei.

107 5.1 Introduction

A key mechanism for providing low-cost computational power through cloud computing is to cohost multiple services on a single machine. Typically many consumers are cohosted with arbitration of code performed by a hypervisor. Whilst cohosting services increases utilisation, it has downsides: (i) The hypervi- sor introduces a level of indirection, with multiple independent schedulers. This makes the service time of queues unpredictable, leading to volatile performance. (ii) Consumers are not exposed to details about the machine they are executing on that affect the quality of service: Number of other machines hosted, time scheduled in, contention on I/O. Therefore, consumers can be unaware of when changes from the cloud provider affect the performance of their service. Whilst efforts have improved schedulers [27] to maximise performance isolation [130], these effects are still present. The lack of performance isolation can manifest itself in variability in latency of a request-response system, such as serving HTTP requests. Finding the cause of slow responses can be challenging, especially when hosted on a shared-hosting service. Yet, at the same time, a key benefit of cloud hosting is the ability to switch between different classes of hosting, depending on the performance char- acteristics required. Unlike in traditional systems, cloud software can treat the underlying hardware as a configuration problem. However, it is currently diffi- cult to determine if latency is induced by having too little resource allocated to a virtual machine, or container, or if it is caused by the consumer’s own software stack, such as a high load executing on the machine. Current techniques for understanding performance of shared hosting plat- forms rely on benchmarking virtual machines to measure the performance. How- ever, the utility of benchmarking is reduced by several factors: (i) Benchmarks are typically not representative [44, 126] (ii) The state of the host machine con- stantly varies from factors like domain creation and changing workloads on other domains (iii) Benchmarks reveal a measurement of the throughput or la- tency of a system, but do not give a root-cause explanation of results. In the case of comprehending virtual machine performance, this means that benchmarks do not reveal if the performance is bottlenecked on how the cloud provider sched- ules the virtual machine.

108 To this end, I present Soroban, a framework where an application execut- ing on a virtual machine using a request-response paradigm, such as a HTTP server, or a , can measure its virtualisation overhead. By modifying the application such that it includes library calls that indicate the start and end of re- quest processing, Soroban monitiors the servicing of each request in the system, and reports if the latency in serving each individual request is due to the cloud provider, or the consumer software. Modifying applications such that they can determine the virtualisation overheads of their requests requires applications to forgo hypervisor fidelity in their design. By making these changes to their soft- ware, developers are able to monitor and report the performance impact and performance interference caused by executing on a shared hypervisor. As such, Soroban reduces a key problem with virtualisation, measuring performance, by exposing an interface to the hypervisor to higher levels of the software stack. By using a modified Xen hypervisor, the scheduling activity of a domain is shared—to that domain only—by the hypervisor over shared-memory. Soroban uses these data to train a Gaussian process by executing a synthetic setup in which the load on the hypervisor from other domains gradually increases and Soroban monitors how the increased contention affects both the latency of ser- vicing requests and the scheduling activity on the virtual machine. After this training phase the Gaussian process can attribute virtualisation overhead by monitoring the scheduling of the current domain only. By attributing latency solely using scheduling data, this method can distinguish between increases in latency due to high hypervisor load from increases in latency due to high load within the current domain.

5.2 Motivation

Soroban is a technique that uses machine learning to attribute the latency of servicing a request in a request-response system to determine how much of that latency is due to virtualisation overheads. There are different places in which such information might be useful; I now highlight motivating scenarios where Soroban can be applied in production cloud environments:

109 5.2.1 Performance monitoring

Understanding the performance of software is an ongoing challenge. When ap- plications execute in the cloud this problem is exacerbated due to the roleˆ of the cloud provider. For instance, the cloud provider can reconfigure the under- yling hardware, network topologies, storage paths, or hypervisor version, each of which are transparent to a virtual machine, but may affect its performance metrics. Current techniques for monitoring the performance of an application are unable to determine how much of the latency of an application is due to the vir- tualisation overhead. In order to measure the performance of a virtual machine, developers execute a set of benchmarks on the virtual machine, which are likely to have different performance characteristics to the application that the virtual machine executes, and are not run continuously to find changes in virtualisation overhead. Therefore, currently, there is an information asymmetry, as consumers know few details about how their virtual machine is configured and comparing offerings between cloud providers is difficult. With Soroban, applications monitor the virtualisation overhead and report this in the natural accounting unit of the application, as defined using the Soroban API in the developer’s application. Soroban can report this overhead such that if the virtualisation overhead increases, a site reliability engineer can raise a sup- port issue with the cloud provider or switch to an alternative cloud provider [50, 5].

5.2.2 Virtualisation-aware timeouts

Many applications attempt to adjust their own behaviour based on the perfor- mance of the operating system, so as to maintain a quality of service. The typical way that they do this is by using timeouts that fire based on wall-clock time [112]. For instance, applications such as web servers may timeout requests that have been processing for too long; the kernel watchdog fires if the kernel has not re- sponded after a fixed timeout, and TCP uses timeouts to predict if a packet has been dropped. All of these metrics are built on wall-clock time, so make the assumption that only one operating system executes on the hardware. There- fore, the reason for timing out is that the load on the system is too high and so

110 they should backoff. This assumption is not true when executing in a virtual machine, since if the domain is starved of resources these timeouts fire, thereby causing the software stack to stop processing the request. However, backing off when a domain is starved of resources does not cause the throughput of the vir- tual machine to increase, as it would on a physical machine, but rather causes it to decrease since the next task will likely also not be sufficiently serviced. With Soroban, timeouts could be parameterised based on how the virtual machine is serviced.

5.2.3 Dynamic allocation

Current allocation schemes used by cloud providers are usually statically defined. That is, the consumer picks an instance type when they create their virtual ma- chine, which determines the resources bound to the virtual machine. However, unlike on physical machines, reconfiguring the hardware is a trivial operation in the cloud, as virtual machines can be rapidly ported to newer hardware, mi- grated to faster storage, or be given extras vCPUs and memory. As such, the hy- pervisor when executing in the cloud changes the choice of hardware on which a software stack executes from being a procurement exercise to being a parameter in a configuration file. Presently, it is typical for a site reliability engineer to monitor metrics and decide if the instance size should be reconfigured. However, with Soroban, this process can be made more declarative: Consumers could specify the bounds on virtualisation overhead that are tolerable. Whenever the load increases beyond this point, the virtual machine can be dynamically allocated more resources. This allows the cloud provider to react to users in a more efficient manner by allocat- ing resources according to price and demand.

5.2.4 QoS-based, fine-grained charging

Most current charging models are coarse-grained. Cloud providers offer in- stance types, where each instance type has a fixed cost per minute that it ex- ecutes, regardless of the quality-of-service given to the virtual machine by the cloud provider. For example, a user pays the same amount to service two HTTP requests even if one takes twice as long to complete due to hypervisor

111 delays. Whilst some efforts have produced novel charging methods, such as spot servers [128], these currently charge based on the cloud provider’s spare capacity, rather than the quality of service given to the virtual machine. Cloud providers could extend Soroban to explore flexible new pricing models that set price points as a function of user demand and overall system-imposed delay, leading to more accurate and representative charging models.

5.2.5 Diagnosing performance anomalies

Slow server responses are a principal component of end-to-end latency for client- server systems [33]. Soroban attributes slow responses to either the cloud provider or the customer software stack. Consumers can use this information to purchase more computing resource, or focus their efforts on modifying their software. Soroban enables accurate pinpointing of performance anomalies at a request level, enabling providers and users to understand if the hypervisor is responsible for ocassionally-slow requests.

5.3 Sources of virtualisation overhead

Having explored the advantages to Soroban in measuring virtualisation over- head, I now show the causes of virtualisation overhead. I demonstrate, in this section, that there is no trivial relationship between the scheduler and the time taken to serve a request. Rather, measuring virtualisation overhead requires a multivariate analysis, that is not dependent on data being normally distributed. The findings of this section motivate my use of Gaussian processes to model vir- tualisation overhead. One reason for this is that the servicing of requests requires performing I/O, so requests are not entirely bottlenecked on the CPU. Therefore, if the hypervior preempts the virtual machine whilst it is waiting for I/O, it may have no effect on the latency in which the request is serviced. This complexity in understanding the causes of virtualisation overhead demonstrates the need for cloud providers to provide Soroban.

Experimental setup A single physical machine executed a workload consisting of sixteen (domU) virtual machines running on Xen v4.6, with the Credit 2 scheduler. Domain zero had eight vCPUs, one of which was pinned to a pCPU.

112 Fifteen of these virtual machines executed a mix of I/O intensive, CPU intensive and idle background loads, chosen to represent a typical workload for a host op- erating in the cloud. The sixteenth virtual machine executed lighttpd serving a 500 KB static file. This size was chosen to be in the middle of the distribution of file sizes served by HTTP, representing an asset such as an image. I measured the number of cycles that it takes lighttpd to service each request and compared this with the cycles that the domain spent scheduled out during the service of that re- quest. This therefore shows the relationship between the hypervisor scheduling out a virtual machine and the additional latency caused by the scheduling. I also measured the number of times that the virtual machine executes a block hyper- call, which causes the virtual machine to be preempted until an event arrives for the virtual machine. An identical server, connected over a two-hop 1 Gbps link via an uncontended switch, executed ApacheBench with a concurrency level of ten to generate requests for the lighttpd virtual machine. I executed this experiment on two Intel Xeon E3-1230 V2 @ 3.3 GHz, run- ning a modified branch of Xen-unstable (forked on 2015-02-11). All virtual machines (including domain zero) executed Ubuntu 14.10, with a Linux kernel compiled from the Linus branch at v3.19.

Results Figure 5.1 shows that there is no trivial relationship between the time that a request spends scheduled out and the impact of virtualisation, which one might naıvely¨ expect. In Figure 5.1 I plot the line y = x. Any points along this line would indicate that the request is serviced by the virtual machine whilst being scheduled in for zero cycles. Whilst servicing a request in zero cycles is clearly impossible, the horizontal distance between the line and points on the graph shows the number of cycles that the virtual machine was scheduled in for whilst scheduling the request. We can see three clusters:

Lowest cluster, coloured dark blue/black. Requests in this cluster are those whereby the virtual machine executing lighttpd spends little of the time scheduled out whilst processing the request. As such, there is no correlation between the time spent with the virtual machine scheduled out and the server-side latency in servicing the request.

Middle cluster, coloured pink and purple. The points in this cluster show weak,

113 108 1.0 ×

160

0.8 140

120

0.6 100

80

0.4 Number of blocks

60 Virtual machine scheduled-out (cycles) 40 0.2

20

0.0 0 0.8 1.0 1.2 1.4 1.6 108 Server-side latency (cycles) ×

Figure 5.1: There is no trivial relationship between the time a request spends scheduled out and the impact of virtualisation. Some requests spend a long time scheduled out but the server-side latency is unaffected.

114 positive correlation between the time spent scheduled-out by Xen and the server-side latency. However, the correlation accounts for a small amount of the variation. The weak correlation is caused by the virtual machines not being immediately rescheduled after an event that triggers a block. The virtual machine issues blocks whenever it needs to load data from disk, which are parameterised by an event channel ID, such that whenever an event arrives on that channel—for instance by a block being returned from a disk—the domain should be rescheduled. By blocking, a virtual machine intentionally increases the number of cycles that it spends scheduled out. However, by doing so there is a delay between the blocks arriving and the hypervisor scheduling the domain. As most of the time that the virtual machine is scheduled out it is waiting for I/O, we can observe that x cycles spent with the virtual machine scheduled out does not cause an increase in server-side latency of x cycles. Rather, for this cluster, it is the delta between blocks arriving and the virtual machine being scheduled that is re- sponsible for the weak relationship between server-side latency and cycles scheduled out. Therefore, the virtualisation overhead for these requests is the time between data arriving in the hypervisor and the virtual machine being rescheduled, not the time spent scheduled out.

Top cluster, coloured orange. This cluster shows very strong, positive correla- tion between the server-side latency and the cycles spent scheduled out by Xen. Importantly, this cluster is parallel to the line y = x, and there are no points on the graph that are closer to this line than those in this cluster. From this we can infer that for the points in this cluster—especially those towards the top of it—the time spent with the virtual machine scheduled in is dominated by the time spent executing the request. Therefore, this correlation explains most of the variation of the data since the server-side latency of these requests is limited by the physical CPU quota that is dedi- cated to the domain. As such, there is strong correlation for these requests between time scheduled-out and virtualisation overhead.

Given the complexities of these relationships, it is not possible to measure vir- tualisation overhead by considering the time spent scheduled out alone. Server- side latency is a multivariate problem; to understand how long it takes to execute a request in a virtual machine, one has to factor in multiple effects, such as the

115 number of blocks, and the cycles that the virtual machine spends scheduled out. In Soroban I therefore use machine learning to report the virtualisation overhead, since machine learning can be trained on the relationship between multiple vari- ables.

5.4 Effect of virtualisation overhead on end-to-end la- tency

I now show that server-side latency corresponds with an increase in end-to-end latency. I show that the overheads in processing a request correlate strongly with the end-to-end latency as experienced by the client.

Experimental setup I ran the same experiment as in Section 5.3 and measured the server-side latency with timers in lighttpd as well as the end-to-end latency that ApacheBench measures.

Results Figure 5.2 shows that as the number of worker virtual machines in- creases, both the server-side latency and the end-to-end latency increase. Cru- cially, the changes in distribution for the two distributions are very similar. There- fore, any results based on server-side latency can be observed in the end-to-end latency of the request. We can observe that between zero and six worker virtual machines there is no impact on performance caused by executing more worker virtual machines. This is because the host has eight processors, so with domain zero executing on one of them and the target virtual machine executing on another, there remain six processors on which a virtual machine can execute, so there is no contention for physical CPUs. If the virtual machines had multiple vCPUs then I should not expect to see this flat portion, since the physical CPUs would become saturated more quickly. Beyond six virtual machines we see two changes to the distribu- tion: (i) There is an increase in latency, as virtual machine scheduling becomes a bottleneck on the server-side latency; (ii) The distribution becomes wider, mean- ing that latency becomes less predictable. This is as one should expect, as CPU starvation causes queues to grow, thereby causing unpredictable response times.

116 40

35

30

25

20

15 Proportion of requests Server-side latency (ms)

10

5 10 15 20 Number of worker virtual machines

40

35

30

25

20 Proportion of requests End-to-end latency (ms) 15

10

5 10 15 20 Number of worker virtual machines

Figure 5.2: As the number of worker virtual machines increases the server-side latency (top) and end-to-end (bottom) latency increase. Moreover, both distribu- tions show similar shifts. 117 5.5 Attributing latency

I have already shown that there is a non-trivial relationship between the schedul- ing of a virtual machine and the observable latency in servicing requests. I now describe the approach of combining quantile-to-quantile measurements from bare metal with Gaussian processes to overcome these issues.2 A key benefit of the approach that I describe in this section is that it does not require guests to receive data about the scheduling of other virtual machines on the same host. In a cloud environment, it is unlikely that cloud providers would release infor- mation about co-hosted virtual machines. However, they may be convinced into releasing data to a virtual machine regarding how it is scheduled. Whilst cloud providers may fear a larger attack vector for timing attacks, I do not believe that information from the scheduler would reveal substantially more informa- tion than can be already be inferred. As shown in Figure 5.2, when a host is underprovisioned, two things happen to the distribution of server-side latency: (i) It moves to the right as the latency increases (ii) It widens as the server-side latency becomes less predictable. It is based on this insight that I give a measure of virtualisation overhead, which is a measure of the virtualisation overhead experienced by a request. I per- form a quantile-to-quantile comparison of the distributions of request-response time by taking a request rv, which has latency tv and is in quantile q of its dis- tribution. I then find a corresponding request, rbm, that is in quantile q of the bare-metal distribution. If this has latency tbm then I define the virtualisation overhead of this request to be tv t . Figure 5.3 demonstrates this idea. − bm The benefit of this approach is that it considers the full shape of the distribu- tion in determining virtualisation overhead. If I were to use a simpler method, such as comparing a latency on a virtualised system with the mean latency on bare metal, it would report that the hypervisor had increased the latency of re- quests at the head of the virtualised distribution, if there is substantial overlap between the two distributions. Such a expression for virtualisation overhead is insufficient to directly mea- sure virtualisation overhead in a production environment since it relies on a constant distribution of virtualised latencies to make this quantile-to-quantile

2Using a Gaussian process, as opposed to other type of machine learning, was suggested by Dr Ramsey M. Faragher.

118 Figure 5.3: When a server is virtualised, one expects the latency distribution of requests to become both slower and more variable, due to the cycles spent in the hypervisor and longer data-paths than are typical on a bare-metal server. As the hypervisor becomes increasingly loaded the frequency distribution becomes slower and more variable. I report the virtualisation overhead of a request to be the quantile-to-quantile difference between executing on bare metal and execut- ing in a virtual machine, with constant load on the hypervisor from other virtual machines. For clarity, in this graph I exaggerate the amount of virtualisation overhead that one would expect to see.

119 comparison [38]. In a production system there are other virtual machines on the host, each of which have different load patterns. When other virtual machines increase their load it is possible for the host to be overcontested and therefore the performance of some virtual machines on the host will be diminshed. Of partic- ular concern is that high load is often bursty and short-lived, for instance caused by an incast. As the shape of the virtualised distribution continually changes, it is difficult to pinpoint the quantile in which a datum lives. To report costs using a model that considers the distribution of requests and can distinguish between increased latency from the hypervisor and increased latency from other services operating within the virtual machine, Soroban uses supervised machine learning. A model, created with Gaussian processes, maps data from the hypervisor’s scheduler to a measure of virtualisation overhead. Soroban initially calculates a ground-truth model by measuring performance as the latency frequency distribution of servicing a workload on bare metal. This gives a ground truth distribution as an upper-bound of achievable performance for a virtualised instance. This is a one-time operation that only needs to be carried out once for each hardware/software combination. For large systems that execute in the cloud, I envisage this step being integrated into a continuous build or integration system. Following this, Soroban investigates the additional latency of the program when it is virtualised. Soroban measures the latency distribution of the work- load whilst executing as the only virtual machine (domU) on the host, thereby getting a virtualised latency distribution that is usually slightly slower than the bare-metal latency distribution, due to the additional overheads of having a hy- pervisor and the (typically-longer) data paths between the virtual machine and physical hardware. Soroban then repeatedly increases the load on the host, by booting ‘worker’ virtual machines that execute a mix of CPU-intensive and I/O- intensive workloads. This means that the Soroban-enabled application repeat- edly serves the same load, whilst increasing the number of other guests on the server, contending for resources. Each time a virtual machine boots, Soroban measures a fresh latency distribution. For each request that executes in this training phase Soroban calculates the quantile-to-quantile difference in latency and records all of the scheduling per- formed on the virtual machine by the hypervisor. Soroban then builds a feature vector for each request that maps the hyper-

120 visor’s scheduling events that happened whilst servicing the request to a nor- malised measure of virtualisation overhead. Soroban collects all feature vectors from the training set, which span the entire range of virtual machine load, and uses them to train a Gaussian processes. The output of the machine learning is therefore a model that maps a feature vector of virtualisation events to virtualisa- tion overhead. As the model maps scheduling events to a virtualisation overhead there is neither no need, nor benefit to be gained, from performing quantile-to- quantile measurements after the training phase, when the model executes in a uncontrolled environnment. Furthermore, as the model reports the virtualisa- tion overhead by solely using the information about the hypervisor’s scheduling decisions, it is agnostic to changes in latency that are caused by events other than the hypervisor, such as higher load on the service.

5.5.1 Justification of Gaussian processes

Gaussian processes are a method of performing supervised machine learning in which a regression model is built using a training dataset and then used to pre- dict values for previously-unseen data. Whilst there are other types of machine learning, Gaussian processes have several advantages [90].3 During the training process Gaussian processes assume that the dependent variable—virtualisation overhead in the case of Soroban—comes from an under- lying distribution, made up of the variables in the feature vector. The training tries to find this distribution by creating a series of Gaussian functions across the multi-dimensional feature space that when summed together approximate the underlying distribution. Unlike most other machine learning techniques, Gaussian processes do not assume that the data, or their errors, are normally distributed, which is impor- tant as the Soroban data do not follow any recognisable distribution. For a mul- tivariate regression—as performed by Soroban as there are multiple dimensions to the Soroban feature vector—the training phase has a hyperparameter step that finds the best correlation length and measures how much of the variation is ex- plained by each dimension. For instance, at some values the cycles scheduled out explains most of the server-side latency, so Gaussian processes attaches a high

3The choice of using a Gaussian process was made by Dr Ramsey M. Faragher. All intellec- tual contributions in this section are his.

121 weight to this element of the feature vector. However, for low values the cycles scheduled out is a poor predictor of server-side latency, the number of blocks is a good predictor, so Gaussian processes will assign a higher weighting to this element. Gaussian processes can also be computed quickly.

5.5.2 Alternative approaches

There are some alternative approaches to using combining quantile-to-quantile measurements with supervised machine learning, however these have disadvan- tages: quantile-to-quantile comparison of the entire virtualised distribution. For a request,

r1, one could consider its position in the—possibly-non-strict—subset of

the distribution of virtualised requests. In other words, on servicing r1, Soroban could measure that it took x ms to execute so therefore falls into the yth percentile of requests executing in the cloud. The measure of virtual- isation overhead could therefore be measured by comparing the difference

in latency between r1 and the latency of request at percentile y on bare metal. The problem with this approach is that in cloud environments load on the host from other virtual machines is unpredictable as the load from other virtual machines changes and new virtual machines are scheduled on the same host. As such, the distribution of latencies is not constant.

For instance, if r1 lies at the median point of the virtualised distribution then its latency is compared with the median distribution on bare metal. If

at a later time another request, r2 is serviced with the same latency as r1

and the hypervisor schedules the domain identically, yet between r1 and r2

the distribution has changed shape such that the r2 is not at the median, the reported virtualisation overhead will change, despite the request being served equally. Moreover, a key advantage of Soroban is the ability to distinguish between increase in latency caused by the cloud provider from increases in latency caused from within a virtual machine. As a motivating example, consider the case of a software upgrade or an antivirus scan executing from within a domain that is comparing the entirity of the virtualised distribution. The

122 background task would cause the response latency to increase as the load on the virtual machine increases. Any attempts that merely compare dis- tributions with a best case are unable to distinguish between increases in latency from the virtualisation from other increases in latency.

Comparison with mean latency. If the latency of each request when executing in a virtual machine were compared with the mean latency when executing on bare metal, then we would observe artefacts based on the distribution of requests when executing on bare metal. When contention is lower than one vCPU per pCPU, virtualisation overheads are small and the distributions of end-to-end request latency overlap. As such, Soroban would report that requests that execute quickly in a virtual machine would execute slower on bare metal.4

5.6 Choice of feature vector elements

Soroban uses machine learning to build a model that maps a feature vector of hypervisor scheduling data to a measure of virtualisation overhead in servicing a request. To determine the basis vectors for this feature vector I performed an experiment in which I measured the performance of lighttpd under varying loads on the hypervisor to find those variables that were correlated with performance. It is the variables that showed correlation that I use as an input to the machine learning algorithm.

Experimental setup This experiment executed on the same operating system and hypervisor setup that I describe for the experiment in Section 5.3. Using a modified version of Xen, I measured various data from the Xen sched- uler as each request executes. Throughout the experiment a single virtual machine served lighttpd, which records the server-side latency of all requests that it serves, as well as recording— for each request—all hypervisor scheduling events imposed on the virtual ma-

4I later show that Soroban occasionally does report a negative virtualisation overhead, but this is due to virtual machines offloading work to the hypervisor so server-side latency de- creases, but end-to-end latency increases. However a comparison with mean latency would cause Soroban to report requests as having negative latency due to them being faster than the mean latency on bare metal.

123 chine. I refer to this virtual machine as the target virtual machine. A server two- hops away on a 1 Gbps link via an uncontended switch executed ApacheBench with concurrency three, continually requesting a 10 MB file from the virtual ma- chine, such that the target virtual machine was continuously servicing HTTP GET requests. Initially the target virtual machine was the only virtual machine executing on the host. However, a script executing on the same host as the target virtual machine repeatedly starts worker virtual machines, which put an additional load on the hypervisor. The purpose of creating additional load with worker virtual machines was to cause the server to become underprovisioned and therefore service requests more slowly. After each worker virtual machine booted, lighttpd profiled its performance for ten seconds. I did not profile the target virtual machine whilst worker virtual machines are booting. Each worker virtual machine had 1 vCPU, 256 MB of RAM and ran an iden- tical operating system to the target virtual machine. After booting the worker virtual machines executed stress -i 1 -c 1, which created a CPU-intensive thread and an I/O intensive thread.

Results Figure 5.4 shows the relationship between those metrics available from Xen that show correlation with the server-side latency in serving HTTP responses. From this graph, there is clear correlation between the server-side latency and the following variables:

Number of blocks. Xen allows virtual machines to issue blocks, whereby they are preempted until an event arrives on a specified event channel. Typi- cal uses of block are to deschedule the virtual machine whilst waiting for blocks to return from disk or for data to send over a network.

Cycles scheduled out. This is a measurement of the number of cycles consumed in the processing of a request during which the virtual machine (but not necessarily the lighttpd process) is scheduled out. If the virtual machine is scheduled out whilst processing a request, one would expect that the request takes longer to be serviced.

Virtual machine pre-emptions. When there is contention for physical CPUs, the hypervisor can pre-empt a vCPU to allocate the corresponding physical

124 32 450 24 300 16 Blocks 150 8 Number of VMs

0 Server-side latency (ms) Server-side latency (ms) 109 1.00 ×

500 0.75

0.50

250 Cycles 0.25 scheduled out VM pre-emptions 0 0.00 Server-side latency (ms) Server-side latency (ms)

900

600 VM exits 300

150 300 Server-side latency (ms)

Figure 5.4: There are trends between the number of blocks, virtual machine pre- emptions, cycles scheduled out and the number of virtual machine exits. Further- more, these metrics are not correlated with each other. Soroban therefore uses all of these metrics in its feature vector that is supplied to the Gaussian process.

125 CPU to another virtual machine. As the load on the hypervisor increases, the number of virtual machine pre-emptions also increases.

Number of vmexits. This is the total number of times that the virtual machine exits—through a vmexit—during the processing of a request. As such, it is part-explained by the number of virtual machine pre-emptions and blocks.

Metrics for which I cannot find strong correlation are: Minimum/maximum/mean/- median scheduler credit, number of upcalls, and minimum/maximum/median/mean number of events on the event channel when an upcall takes place. From this experiment, Soroban uses a feature vector with the following ele- ments: Number of blocks, cycles scheduled out, virtual machine pre-emptions and the number of scheduling events performed on the virtual machine.

5.7 Implementation

The current implementation of Soroban consists of four parts: A modified Xen hypervisor that reports data from hypervisor scheduling events, a Linux kernel module that allows applications to read the scheduling data that occur during their execution, a modified application (currently this is lighttpd), and Python data-processing scripts.

5.7.1 Xen modifications

Soroban requires a modified version of Xen that shared with each virtual ma- chine data from the scheduler regarding how it is scheduled. The modifications that I implement for Soroban are designed for guests that forgo hypervisor fi- delity since the modifications share the internal state of the hypervisor’s sched- uler with its guests. Only guests that acknowledge that they are executing on a hypervisor will therefore read this information. Currently, Soroban is based on Xen v4.6

5.7.1.1 Exposing scheduler data

Soroban uses a statically, manually instrumented version of the Xen credit2 scheduler. The instrumentation executes every time the Xen schedule function

126 executes and stores a struct of data from the scheduler. These structs are stored in a ring buffer.

5.7.1.2 Sharing scheduler data between Xen and its virtual machines

Currently hypervisors do not export data to domains regarding the quality of service provided to them. This is in-part a design choice: Hypervisors—Xen in particular—are designed to have a minimal codebase, much like a microker- nel. The rationale behind this is that it should be possible to read the code, and understand it all, verifying it for bugs. As the codebase increases in size, the attack vector increases. With Xen, there is a tendency to push as much be- haviour as possible to privileged domains rather than into the hypervisor [31]. Xen therefore has no existing mechanism to allow a privileged domain access to the hypervisor’s memory. Currently, Xen does allow sharing pages between virtual machines, through the grant reference mechanism, but there is no counter- part for sharing pages between the hypervisor and virtual machines. Xen shares one page with each domain for the shared info struct, a data structure that contains event channels for each vCPU, architectural information, and timing information. Soroban needs each guest to read its scheduling activities from the hypervisor, from shared memory. To share memory between Xen and its guests, I modify the createdomain hypercall to allocate pages from the hypervisor’s heap. Each of these pages are shared with the domain, thereby allowing the domain to access the physical memory associated with the domain. The pages each have a machine frame number (MFN) that the hypervisor shares with the domain by appending them to the end of the shared info struct. By using this machine frame number, virtual machines can map in the pages that have been shared with them.

5.7.2 Linux kernel module

A Linux kernel module allows processes that use Soroban to read the hypervisor scheduling event of their virtual machine. When the virtual machine inserts the kernel module it inspects the shared info struct to find the machine frame numbers of the shared pages. As Xen paravirtu- alises the memory management unit (MMU), the domain can map the physical address given by the machine frame numbers. The domain’s kernel uses the

127 1 int sd= srbn_start(); 2 ... 3 srbn_yield(sd); 4 ... 5 srbn_resume(sd); 6 ... 7 srbn_end(sd);

Figure 5.5: The Soroban API allows applications to be instrumented to mark the start, and end of requests. paging mode translate mechanism that causes Xen to add the correct virtual to physical page table entries. Whilst the typical way in Linux to map physical addresses to kernel virtual addresses is to use ioremap, it is not possible to use this function on an unmodified Linux kernel to map Xen phyiscal addresses, as ioremap bypasses the Xen paging translation. Soroban therefore uses a different method to map the machine frame number to virtual addresses: Soroban initi- tally allocates standard kernel memory (using kmalloc), causing Xen to make virtual to physical address mappings for each of the pages. Soroban returns the machine-physical addresses to Xen, whilst maintaining the virtual addresses, us- ing a similar mechanism to a balloon driver of issuing hypercalls to decrease the virtual machine’s memory reservation. Soroban can then remap the virtual ad- dresses onto the addresses given by the machine frame numbers shared by Xen, by creating page table entries.

5.7.3 Application modifications

In order to use Soroban, applications need to use an API to indicate the start and of each request.

5.7.3.1 Soroban API

Figure 5.5 shows the Soroban API. Services use srbn start (1) and srbn end (7) to start and stop recording the scheduling events as as being associated with the request. It is expected that application use those calls to signal a logical event where resource allocation for a particular action (e.g. serving a HTTP request) starts and stops.

128 A call to srbn start (1) returns an unique token for identifying the current application action to which measurements should be attributed. Services distin- guish between simultaneous logical events through multiple calls to the start and end operations, thereby enabling differentiation based on service classification (e.g. per user). A single user-level thread can differentiate between logical ac- tions via the srbn yield (3) and srbn resume (5) calls, which allow user-level applications to signal that subsequent system calls should be attributed to a dif- ferent logical event [11]. I discuss the implications of requiring applications to use a dedicated API in Section 5.9.

5.7.3.2 Using the Soroban API

Applications that use Soroban must currently be modified with the Soroban API and recompiled. Presently, there is a branch of lighttpd that has the rele- vant modifications.5 The key challenge with adding Soroban to lighttpd is that the concurrency model, despite being single-threaded, is non-trivial and so re- quires a substantial engineering effort to implement. Furthermore, expressing the request-processing pipeline, in which requests are serviced in multiple parts requires additional code.

5.7.4 Data processing

Soroban currently calculates all measures of virtualisation overhead offline, but could be extended to allow applications to query their virtualisation overhead online. Soroban has a collection of Python scripts6 that read the data created by the Soroban API and use SciKit Learn [110] to perform machine learning.

5Modifications to lighttpd were written by Lucian Carata. 6Implementation of data processing scripts was a collaboration between myself and Lu- cian Carata. The contributions of Lucian’s scripts are: Parsing the output of the kernel module and applying the Gaussian Process to these data. Dr Ramsey M. Faragher helped in picking cor- rect values for the parameters of the Gaussian Process. All graphs in this dissertation are drawn using my own scripts. My scripts also perform the training phase of Soroban.

129 5.8 Evaluation

I executed all experiments on an Intel Xeon E3-1230 V2 @ 3.3 GHz, running a modified branch of Xen-unstable (forked on 2015-02-11), with all virtual ma- chines (including domain zero) running Ubuntu 14.10, with a Linux kernel com- piled from the Linus branch at v3.19. I used Xen v4.6 with the Xen Credit 2 scheduler, the next-generation Xen scheduler designed to reduce the latency in scheduling a vCPU after an event for that vCPU is issued. Domain zero had eight vCPUs, one of which is pinned. By using the credit 2 scheduler I investigate the state-of-the-art in hypervisor scheduling that is designed to minimise performance interference. This setup minimised latency, to show a lower bound on the performance of Soroban. For cloud providers who use legacy versions of Xen, one would expect higher per- formance interference than I measured in this section, thereby further increasing the utility of Soroban.

5.8.1 Validation of model

I begin evaluating Soroban by considering how the use of a Gaussian process with quantile-to-quantile difference based on bare-metal performance, models the virtualisation overhead. In particular, I analyse how the model fits the train- ing data that I supply to evaluate how the Gaussian process fits the data and show that quantile-to-quantile measurements are an appropriate measure of vir- tualisation overhead. In these experiments I trained Soroban using datasets produced using ApacheBench, with concurrency 50 (a reasonable load for a server executing as a virtual ma- chine on a shared host), requesting a file 500 KB (a mid-range HTTP file size, representing the size of an image). Using this model, Soroban measured how much of the latency of the each request in the training dataset is attributable to the virtualisation overhead, for each datum in the Soroban training data set. This experiment therefore demonstrates how Gaussian processes and quantile- to-quantile differences model virtualisation overhead. I later evaluate Soroban using unseen data.

130 5.8.1.1 Mapping scheduling data to virtualisation overhead

Experimental setup The model created by Soroban maps a feature vector with four elements to determine the virtualisation overhead. As such, it is a many- dimensional model. To explore this model, I project its dimensions onto two axis: The time associated with the virtualisation overhead and one dependent variable.

Results Figure 5.6 shows for each unit vector of the Gaussian process the rela- tionship between the measured values and the virtualisation overhead calculated by the Gaussian process. For each metric, in particular pre-emptions and sched- uled out, we see that the virtualisation overhead has a similar trend to the server- side latency. This shows that as the hypervisor performs more schedules on a virtual machine as it processes a request, the server-side latency increases. This trend is weak, as expected, as each graph is a projection a multi-dimensional feature-space: If there were a trend between virtualisation overhead and server- side latency, the input variables would be correlated and therefore have redun- dancy. The similarities in trends are higher for the cycles scheduled out and pre-emptions than blocks or vm-exits, which is due to the relationship shown in Figure 5.4 In paritcular, blocks are only a good indicator of performance when there is low contention, as when contention is higher lighttpd becomes CPU- bound so issues fewer blocks and vm-exits are a more noisy measure than other variables. As such, the Gaussian process puts less weight on blocks for high values and vm-exits In Figure 5.6, 7% of requests have a higher virtualisation overhead than server-side latency and 28% report a negative virtualisation overhead. As shown in Figure 5.7, the 7% of requests whereby the virtualisation overhead is higher than the server-side latency occurs when the host is highly-contended, so the virtualisation overhead dominates the server-side latency. As the virtualisation overhead dominates, a small-percentage error in the computed virtualisation overhead can cause the virtualisation overhead to be reported above the server- side latency. Whilst a virtualisation overhead that is larger than the server-side latency is incorrect, it occurs rarely and the conclusion from this result is still true: The hypervisor is dominating the server-side latency. Moreover, in a realis- tic scenario, one would not expect a hypervisor to be as heavily-loaded as in this

131 2000 1500 Server-side latency Virtualisation overhead 1000 500 0 Virtualisation overhead (ms) 500 − 0.0 0.2 0.4 0.6 0.8 1.0 1.2 109 Cycles scheduled out × 2000 1500 1000 500 0 Virtualisation overhead (ms) 500 − 0 100 200 300 400 500 Blocks 2000 1500 1000 500 0 Virtualisation overhead (ms) 500 − 0 200 400 600 800 1000 VM exits 2000 1500 1000 500 0 Virtualisation overhead (ms) 500 − 0 100 200 300 400 500 600 700 800 Pre-emptions

Figure 5.6: Soroban’s machine learning as applied to its training data, which shows how Soroban models its input data. 132 16 14 12 10 8 6 Frequency 4 2 0 0 8 16 24 32 Worker virtual machines

Figure 5.7: The distribution of the number of worker virtual machines executing when Soroban associates more milliseconds of virtualisation overhead than the server-side latency. As this only occurs when there are many worker threads exe- cuting, the virtualisation overhead is high and dominates the server-side latency. As such, the error term on the prediction is high and the single-point estimate of latency is above the server-side latency. experiment.

5.8.1.2 Negative virtualisation overhead

We have seen that Soroban can predict negative virtualisation overhead. Whilst this initially seems counter-intuitive, I now show that negative virtualisation over- heads are correct and represent the asynchronous computation offloaded onto the hypervisor and domain zero [26].

Experimental setup I used the same setup as described in Section 5.8 but mea- sure the latency in serving 32 000 requests for the 500 KB file. I performed 32 000 requests to ensure that I measure the complete distribution of virtualisa- tion overheads. I measured two different scenarios:

Bare metal (BM). Lighttpd executed on the (same) server but without virtualisa- tion enabled.

One virtual machine (1VM). Lighttpd executed in a (domU) virtual machine on an otherwise uncontended host.

133 For both scenarios I measured the latency of each request three ways: Server-side latency (SS). Server-side latency is the latency that is observed by reading the counters from inside the virtual machine at the start and end of processing each request. However, server-side latency does not include the time spent executing work that is offloaded by the virtual machine to domain zero or the hypervisor.

End-to-end latency (EE). End-to-end latency is the latency that is measured as the ‘processing time’ by ApacheBench. Whilst end-to-end latency includes additional sources of error due to the overheads in networking, it does in- clude the time executing computations that are offloaded to the hypervisor or domain zero.

Domain zero latency. Domain zero latency is the processing time, as measured by ApacheBench executing from domain zero and hence on the same phys- ical server as lighttpd. By executing ApacheBench in domain zero, rather than on a host over the network, we can observe the total processing time of the request, without the interference of the network. Executing ApacheBench on the same phyiscal host introduces the possibility of intro- ducing a probe effect. However, I expect this to be minimal as the server has eight processors, of which lighttpd uses only one. As this domain zero latency measurement requires the presence of domain zero, I could not perform this experiment on bare metal.

Results Whilst a negative virtualisation overhead might appear eroneous, Fig- ure 5.8 shows that negative virtualisation overheads are in fact correct. Compar- ing the frequency distribution of bare metal, server-side latency with one-virtual- machine, server-side latency we see that at points in the distributions requests appear to execute faster in the virtualised case than the bare-metal case. This is confirmed by the measures of server-side virtualisation overhead. However, when I measure the end-to-end latency, we can see that requests do not execute faster when virtualised than on bare metal as the virtualised distribution remains higher than the bare-metal distribution for any given point. We therefore see that the distribution of end-to-end virtualisation overhead is positive. Looking at the domain zero latency distributions, we again see that virtualisation has a slight negative effect on the performance of requests.

134 0.016 0.016 0.008 0.008

Frequency 0.000 0.000 0 100 200 300 0 100 200 300 BM SS latency (ms) 1VM SS latency (ms)

0.016 0.016 0.008 0.008

Frequency 0.000 0.000 0 100 200 300 0 100 200 300 BM EE latency (ms) 1VM EE latency (ms) 0.03 0.02 0.01 0.00 0 100 200 300 Dom0 latency (ms)

0.008 0.08 0.004 0.04

Frequency 0.000 0.00 200 0 200 400 100 50 0 50 100 − − − SS virt. overhead (ms) SS virt. overhead (%)

0.04 0.08 0.02 0.04

Frequency 0.00 0.00 200 0 200 400 100 50 0 50 100 − − − EE virt. overhead (ms) EE virt. overhead (%)

0.030 0.06 0.015 0.03

Frequency 0.000 0.00 200 0 200 400 100 50 0 50 100 − − − Dom0 virt. overhead (ms) Dom0 virt. overhead (%)

Figure 5.8: Measuring server-side (SS) latency appears to show that requests execute faster in a virtual machine (1VM) than on bare metal (BM). However, external validation from a two-hop link (EE) or from domain zero shows that this is a measurement error. The apparent discrepancy of server-side latency reducing is a facet of the virtualisation process. Xen gives each virtual machine a virtual NIC, which uses the Xen front/back protocol to communicate with a bridged network executing in domain zero. This setup makes the performance of servicing requests, as measured using server-side latency, appear faster for two reasons: (i) The Xen net-front driver is simple, as its principal roleˆ is to put packets onto a ring buffer, so fewer cycles are required than when executing a classical driver. However, the drivers that execute in domain zero or on bare metal are usually more complex than net-front as they interact with hardware and involve executing more code. (ii) The net front/back driver mechanism acts as a buffer for packets between the virtual machine and the real hardware, since the virtual machine writes data to a ring buffer, and then considers the packet sent. It is the roleˆ of domain zero to consume these packets from the ring buffer and send them on the NIC. As soon as packets are pushed onto the ring buffer, the virtual machine considers them sent. However it can be several microseconds until the packets are dispatched by the physical NIC. Due to the facets shown in Figure 5.8, it is therefore reasonable that Soroban reports negative virtualisation overheads for some requests when the load on the hypervisor is low. One can consider the negative overhead to represent the under-measurement of server-side latency due to executing in a virtualised envi- ronment. The principal advantage of Soroban—as opposed to measuring performance directly—is that it reports how much of the measured server-side latency is due to virtualisation. Whilst I have shown that end-to-end measured latency is a poor indicator of performance for a virtual machine, due to offloaded work, it is still a suitable performance metric on which to base the virtualisation over- head. In Figure 5.4 I show that increases in server-side latency have a corre- sponding increase in end-to-end latency. As such, by measuring the effect of the hypervisor on server-side latency, Soroban also reports the effect on end-to-end latency. However, this experiment does highlight a limitation of Soroban: It does not report the additional overhead on end-to-end performance caused by asynchronous, offloaded computations. For instance, Soroban cannot detect an increase in the time it takes for packets to be sent from the physical NIC, due to changes in cofiguration or load on the hypervisor’s network. Whilst other approaches can measure more of the latency of the request, such

136 as measurements from external sources, they are less satisfactory. External mea- surements require extra insfrastructure to maintain and changes in the perfor- mance of the external monitoring tool affect the perceived service performance. Moreover, external measurements cannot measure the virtualisation overhead for every request that the server processes, as they only measure the requests that they generate. Another method is to modify the hypervisor and domain zero so that all network data are sent synchronously to the physical NIC, however this would decrease performance. In addition to explaining the negative overheads reported by Soroban, the re- sults in this section also confirm the dangers in trusting benchmarks that execute in virtual machines without external validation of results.

5.8.2 Validating virtualisation overhead

Having explored how Gaussian processes and quantile-to-quantile measurements model the virtualisation overhead, I now use Soroban’s model to calculate the virtualisation overhead for previously-unseen data. A contribution of Soroban is the ability to determine if requests are slow due to a high virtualisation overhead or due to a high load in the target virtual machine. The experiment in this section validates Soroban’s ability to make this distinction by varying both the hypervisor load, by using an increasing number of virtual machines on the same host and increasing the load within the virtual machine, by increasing the concurrency level of ApacheBench. Increasing either of these parameters increases the server-side latency of processing requests and I show that Soroban is able to distinguish which parameter causes the increase in latency.

Experimental setup I trained Soroban to measure the virtualisation overhead whilst lighttpd was serving a load generated using ApacheBench with concur- rency level 50 and increased the number of worker virtual machines from 1 to 32. Soroban then used this model, which I describe in Section 5.8.1, to deter- mine how much of the latency of requests is due to virtualisation overheads. I varied the number of concurrent virtual machines from 1 to 32, which executed a different load to those used for training to ensure that Soroban’s machine learn- ing was not over-fitted. The virtual machines had two vCPUs and used a mix

137 of workloads, randomly chosen from stressing the CPU; ping; ApacheBench; stressing I/O; and a mixed I/O and CPU load. Using a mix of workloads on virtual machines more-closely modelled realistic loads in public clouds. After each machine booted, I used ApacheBench executing on another machine, con- nected over a 2-hop 1 Gbps link through an uncontended switch, to send 3 000 requests to the target virtual machine, whilst increasing concurrency from 5 to 105 in increments of ten. Soroban then used its model to measure how much of the latency of each request was due to the virtualisation and how much is due to the high concur- rency.

Results The left-hand column of Figure 5.9 shows that as either the number of concurrent virtual machines or the ApacheBench concurrency increases there is an increase in the server-side latency at all reported percentiles. The high- est server-side latencies are therefore when both ApacheBench concurrency and concurrent virtual machines are high. In the middle column of Figure 5.9, we can see that the virtualisation over- head associated with each request increases as the server-side latency increases. This is expected as requests that spend a long time executing have a higher inter- ference from the hypervisor. On the right-hand side of Figure 5.9, we see Soroban’s bare-metal projection of the requests. Bare-metal projection is Soroban’s prediction of the server-side latency if it were not affected by the virtualisation. Soroban measures this by taking each request’s latency and subtracting its virtualisation overhead. Look- ing from bottom-to-top of each heatmap there is no trend of change in latency, showing that the number of concurrent virtual machines does not correlate with Soroban’s projection of how long the request would take if executed on bare metal. However, we do notice that there is still a trend from left-to-right of each heatmap, showing that when requests execute slowly due to high concurrency, Soroban reports that they would still execute slowly on bare metal. It therefore follows that where server-side latency is not due to the cloud provider under- provisioning the virtual machine, Soroban does not report this as virtualisation overhead. Whilst the overall trend of Figure 5.9 is that Soroban correctly reports the virtualisation overhead, there is some noise in the top right-hand corner of the

138 0.0SS latency Hyp blame BM projection 2.2 32 24 16 8 VMs

0 Concurrent 10th percentile 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100

32 24 16 8 VMs

0 Concurrent 25th percentile 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100

32 24 16 8 VMs

0 Concurrent 50th percentile 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100

32 24 16 8 VMs

0 Concurrent 75th percentile 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100

32 24 16 8 VMs

0 Concurrent 90th percentile 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100

32 24 16 8 VMs

0 Concurrent 99th percentile 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 ab concurrency ab concurrency ab concurrency

Figure 5.9: Increasing the ApacheBench concurrency or the number of concur- rent virtual machines causes an increase in the server-side (SS) latency (left). Soroban is able to detect how much of this latency is attributable to the hy- pervisor increasing load (middle). If we subtract that virtualisation overhead from the server-side latency, we observe that there is no trend of the number of concurrent virtual machines increasing the server-side latency. bare-metal projected latencies, particularly at the 90th and 99th percentiles. This shows that as I extrapolate both the concurrent number of vCPUs and concur- rent number of requests both do double the values with which they were trained, Soroban begins to become less reliable. This is caused by the large extrapolation of the Gaussian process beyond its training values.

5.8.3 Detecting increased-load from the cloud-provider.

Cloud providers often perform routine maintenance tasks that reduce the quality of service offered to their customers. Such tasks include snapshots, backups, filesystems scrubs, and antivirus scans. Despite these tasks typically executing with reduced priority—often through nice—they can still cause a reduction in performance. In particular, tasks executed by the cloud provider can cause an increase in tail-latency of serving HTTP requests. With Soroban consumers using cloud services can detect such a reduction in the quality of service that they receive and attribute this decrease in quality of service to the cloud provider, rather than changes to their own load.

Experimental setup I executed sixteen virtual machines, each with a different workload, chosen randomly from idle, I/O intensive, CPU intensive and CPU bursty. Virtual machines executed a mixture of different workloads so-as to mimic a cloud scenario in which clients have different resource usage patterns. Moreover, to obtain high utilisation cloud providers should schedule virtual ma- chines with different workloads on the same host [102]. This setup created a machine CPU utilisation of 30%. I then started another virtual machine executing lighttpd, serving a 500 KB file. A server on a two-hop 1 Gbps link via an uncontended switch ran ApacheBench to generate a continuous load for the lighttpd server. On one run, I executed the experiment with domain zero idle, representing a typical cloud scanerio. I then repeated the experiment but with ClamAV per- forming a anti-virus scan of the filesystems of all virtual machines. Soroban, executing in the virtual machine serving lighttpd, then reported the additional latency caused by the virtualisation in each case.

140 1.0

0.8

0.6

0.4

0.2 Low contention

Cumulative frequency AV scan running 0.0 0 10 20 30 40 50 60 Additional latency from virtualisation (ms)

Figure 5.10: As the cloud provider performs an action that worsens the qual- ity of service delivered to a virtual machine—executing a ClamAV scan in this case—Soroban shows that there is an increase in tail latency caused by the cloud provider. Compared to when executing under a ‘realistic’ workload, the same setup executing at the same time as an antivirus scan has a longer tail latency.

Results Figure 5.10 shows that when executing ClamAV, Soroban attributes a significantly higher amount of virtualisation overhead than under normal circum- stances. This is especially true when considering tail latency: The 90th percentile shifts from 36 ms to 47 ms of virtualisation overhead. This increase is expected, since ClamAV increases the I/O and CPU load, thereby causing starvation to lighttpd that affects the time that it takes to process a request.

5.8.4 Performance overheads of Soroban

I now show that Soroban has no significant performance cost on the performance of lighttpd.

Experimental setup I measured the performance of lighttpd when serving a 500 KB file to a separate server, connected over a 2-hop 1 Gbps link through an uncontended switch, executing ApacheBench with concurrency level 50 and requested 50 000 HTTP GET requests. The virtual machine serving lighttpd was the only domU executing on the host. I repeated this experiment when using Soroban and when executing the same workload without the Soroban extensions

141 to the hypervisor, kernel or lighttpd. I compared the difference in throughput and end-to-end latency between executing Soroban and not executing Soroban. I did not include the cost of executing the Gaussian processes and plotting graphs as these are intended for offline analysis during which performance is not essential.

Results Student’s t-test shows that there is no significant overhead in executing Soroban on end-to-end latency (p < 0.01). This is due to the minimal amount of extra overhead imposed on the hypervisor or the operating system in measuring the actions of the hypervisor; compared with the cost being scheduled out, the overhead of a few additional memcpys is negligible.

5.9 Discussion

In this chapter I have shown how Soroban determines the virtualisation overhead of servicing requests in a request-response system. I now explore some of the limitations of Soroban, indicating how future work might overcome these.

5.9.1 Increased programmer burden of program annotations

One limitation of Soroban is that it requires programs to be instrumented so-as to mark the processing of a request, using the Soroban API. Whilst adding an- notations to a program is a burden, it is not uncommon for software to require programs to be instrumented: SystemTap [43], DTrace [20], Pivot Tracing [89], and X-trace [51] all similarly require programs to be annotated to help users understand their performance. Moreover, large distributed systems are already built using standardised mechanisms from which the semantics of processing a request are often determinable. For instance, Google’s Dapper uses information from protocol buffers and Stubby to determine the performance of each node in servicing a request [131]. With engineering effort, it may be possible to ex- tend the Soroban API so as to determine request processing from alternatives, such as Apache Thrift or with X-trace.

142 5.9.2 Scope of performance isolation considered by Soroban

Currently, Soroban detects the lack of performance isolation caused by using machine learning on data from the hypervisor scheduler. The rationale behind this is that if the host is under-provisioned for its current workload, data from the scheduler can reveal the under-provisioning. However, there are other sources of a lack of performance isolation, such as being migrated onto a machine with a slower CPU. In these cases, the virtual machine may not block or be pre-empted, so the virtual machine would execute contiunously, thereby causing Soroban to report that the hypervisor has very low overhead on the processing of the request. However, the overhead would actually be high. Existing work considers producing virtual data centres with a minimum guar- anteed quality of service [130] and highly-isolated. Soroban could tie-in with these methods to report where the received quality of service differs from the minimum guaranteed quality of service.

5.9.3 Limitation to uptake

A concern that may prevent the uptake of Soroban by cloud providers is that it requires the hypervisor to expose data to virtual machines regarding how they are scheduled. Cloud providers currently treat such information as confidential, as it can be used to infer properties about their data centres. Also, the current sit- uation creates information asymetry, since cloud providers can determine more easily than their consumers if a virtual machine is on an overly-contested host.

5.9.4 Improvements to machine learning

There are numerous approaches to improve the quality of machine learning such that Soroban can more accurately measure the virtualisation overhead. Presently, Soroban is trained using a constant level of load on the target virtual machine and varying the number of virtual machines executing on the hypervisor. A more accurate model could be built, but would require much longer to produce, by training Soroban on a workload that varies the load on both the hypervisor and the target virtual machine.

143 5.10 Conclusion

In this chapter, I have presented Soroban, a technique for building applications that do not exhibit hypervisor fidelity and can therefore report the virtualisation overhead occurred in servicing requests in a request-response system. Soroban uses a modified version of the Xen hypervisor such that scheduling data are shared between Xen and its virtual machines. By applying supervised machine learning to the scheduler data Soroban infers the difference in latency between the program executing on virtualised hardware as compared with how it would have executed on bare metal. With Soroban cloud consumers can better measure the additional overheads incurred by executing in a virtual machine, and cloud providers can offer novel charging models, based on the service that they offer to virtual machines, rather than the resources assigned to the virtual machine.

144 CHAPTER 6

CONCLUSION

In this thesis I have argued that one can improve the techniques used to mea- sure virtual machine performance by forgoing hypervisor fidelity. Measuring the performance of a virtual machine is currently hard: The combination of the hypervisor and other virtual machines makes performance both slower and less predictable than when executing on bare metal. Moreover, many of the techniques that developers use to measure performance when executing on bare metal do not work in virtual machines as they rely on non-virtualisable hard- ware. Given that virtual machine performance is slower, less predictable and harder to measure than performance on a physical server, I propose forgoing the requirement of hypervisor fidelity for tools that measure the performance of a virtual machine.

Historically, the argument for hypervisor fidelity stems from needing to pro- vide the abstraction of a scarce, expensive mainframe to multiple users concur- rently where each user had the illusion of being the sole user. It was therefore im- portant that the same software executed in the virtual machine as on a physical machine. However, the principal use case for hypervisors has changed. Hypervi- sors are now the building block of cloud computing, used to isolate untrusting users executing on commercial, off-the-shelf hardware. As such, it is time to reconsider the requirements of software that executes on a hypervisor. The in- terface between a software stack executing on a mainstream operating system and the hypervisor need not be fixed, and as such I propose that developers have tighter-coupling of their performance tools with the hypervisor.

I have demonstrated the advantages of forgoing hypervisor fidelity in three ways:

145 6.1 Kamprobes

Probing mechanisms are the de-facto way of measuring virtual machine perfor- mance. However, associated with probing mechanisms is the probe effect, where the additional overhead of executing a probe affects the performance of the soft- ware that the probes are trying to measure. To reduce the probe effect, probing mechanisms need to minimise the resources that they use. Existing probing systems do not forgo hypervisor fidelity in their design. They are designed to execute on physical machines and are ported to virtual ma- chines, without due consideration of the effects of virtualisation on their probe effect. The effect of this is that firing a Kprobe—the current state-of-the-art Linux probing mechanism—is twice as expensive on a virtual machine than a physical machine. Moreover, the time taken to fire a Kprobe has much higher variability in a virtual machine than a physical machine. The effect of these two factors is that virtual machines have a probe effect that is both higher and less predictable than on bare metal. To this end, I have presented Kamprobes, a probing technique that reconsid- ers jump-based probing for the System V AMD64 ABI [95], without using any privileged instructions. By using only privileged instructions, Kamprobes avoids the overheads of requiring the hypervisor to emulate an interrupt through an upcall to the relevant domain. As well as executing at near-native speeds when virtualised, I show that Kam- probes avoids the need to perform a lookup of the probe handler from the instruction pointer, and so has substantially faster scalability than the current state-of-the-art, Kprobes. Specifically the cost of firing a Kprobe is O(n) with the number of probes inserted, but the cost of firing a Kamprobe is O(1).

6.2 Shadow Kernels

Shadow Kernels is a technique that uses the hypervisor to allow per-process specalisation of a kernel instruction stream, one use case of which is to have probes that only fire when a specific process is executing. In a conventional operating system the kernel is mapped into the top of the address space of every process. The benefits of this approach are that it is relatively-simple, has high

146 cache-hit rates and gives high performance. However, the downside is that every process has to execute the same instruction stream in the kernel. This prevents specialisation of the instruction stream to certain processes on the system. One such type of process specialisation that one cannot perform currently is probing the kernel’s interactions with a specific process, since when a probe is placed in the kernel instruction stream it will fire for every process. Shadow Kernels avoids this problem by forgoing hypervisor fidelity and using the hypervisor to provide a performance measurement technique that cannot be used on bare metal. By issuing hypercalls, Shadow Kernels changes the virtual to machine-physical mappings of the kernel instruction stream as are seen by different processes. A virtual machine already has two concepts of physical ad- dress: Guest-physical addresses and machine-physical addresses, whereby there is a mapping between the physical pages as considered by a guest to the physical pages as considered by the host. As such, the hypervisor can modify the guest- physical to machine-physical address mappings, without modifying the guest operating system. I show that the cost of specialising a page costs 835 354 cycles per page and ± grows linearly (O(1)) with the number of pages specialised. As such, Shadow Ker- nels is appropriate for when specialisation affects a small number of pages, how- ever the exact number depends on the effectiveness of specialisation. If Shadow Kernels restricts the scope of Kamprobes, then Shadow Kernels are effective if at least ten Kamprobes are not fired due to the use of Shadow Kernels. By using Shadow Kernels, probing hot functions of an operating system to understand their interaction with a specific process can have an insignificant cost on all other processes executing.

6.3 Soroban

Applications that execute in a virtual machine are typically slower and have higher variability in performance than the same application executing on a phys- ical machine. Such changes to timing are inevitable from the addition of a hy- pervisor, especially when virtualisation is used to increase the utilisation of ma- chines. Whilst efforts continue to close the performance gap between virtual and physical machines, Soroban takes an alternative approach by measuring and re-

147 porting this virtualisation overhead. By building applications that acknowledge that they will execute on a virtual machine and therefore linking against the Soroban library, applications can determine how much slower each request, in a request-response system, executes due to the additional overheads of virtualisa- tion. Soroban extends the Xen hypervisor to share with each domain data regard- ing when it is scheduled. This information is not typically shared with each virtual machine by Xen, so Soroban cannot execute on an unmodified version of Xen. By using Soroban, applications are designed with recognition that they exe- cute in a virtualised environment. As such, they can determine the virtualisation overheads that the hypervisor imposes on them. With Soroban, cloud providers can provide evidence to their clients that the performance of their applications is diminished because they are operating on a low-cost service, clients can bet- ter distinguish between cloud providers or provide their provider evidence of insufficient quality of service.

6.4 Future work

In this dissertation, I have explored the benefits of exposing an interface to the hypervisor to higher levels of the software stack, with a particular focus on the benefits of using the hypervisor when measuring an operating system. In this section I explore some of the future work for exploring the benefit of forgoing hypervisor fidelity for measuring the performance of operating systems.

6.4.1 Kamprobes

Kamprobes can currently only be applied to a limited subset of instructions, specifically function preambles, CALLQ, and five NOP instructions. The results from this proof-of-concept implementation show that jump-based probing is faster and more predictable than interrupt based probing in a virtual machine. As such, future engineering work can extend Kamprobes to other types of in- structions. There are well-known techniques that allow jump-based probing of other instruction types [137].

148 6.4.2 Shadow Kernels

Shadow Kernels restrict the scope of kernel specialisation to a single process. However, the ideas in this chapter may well also apply to other pieces of shared code, such as libc. The limitations of this are that only the instruction stream can be modified, but not the data structures, and that the performance overhead scales linearly with the number of pages that are specialised. As such, future work can address these two issues: To allow Shadow Kernels that modify the data structures, Shadow Kernel could treat each shadow kernel as an indepen- dent kernel and use process migration, for instance with LinuxPMI, to ‘migrate’ the process between shadow kernels. To address the issue of scaling linearly, Shadow Kernels could use huge pages, however there are compatibly issues be- tween huge pages and virtual machine migration.

6.4.3 Soroban

As I explore in this dissertation, the predictability of timing properties of soft- ware executing in a virtualised environment is lower than the equivalent software executing on a physical machine. A cause of much of this unpredictability is the multiple schedulers that operate independently: The hypervisor scheduler sched- ules domains where the only knowledge that the scheduler has of the domains is the event channel, which corresponds to pending interrupts for the domain. Af- ter a domain is scheduled, the operating system scheduler schedules the process that should execute without knowledge of executing in a virtual machine. As such, it does not consider the domain’s credits, time remaining in its quanta as perceived by the hypervisor, the benefits of issuing block or yield hypercalls or contention on shared resources. The effect of having two, non-complementary schedulers is that the servicing of a particular process is dependent on the two schedulers’ interactions aligning correctly. When the schedulers do align poorly, processes can be temporarily starved of resource. As such, future work in this area can consider how to build schedulers for hypervisors that both co-operate and don’t trust domains to eliminate this mulitple-scheduler problem. For in- stance, if the hypervisor were to advertise to each domain how long its current quanta lasts for, the domain’s scheduler can perform I/O-bound requests towards the end of the quanta such that the I/O will be serviced before the domain is next

149 scheduled.

6.4.4 Other performance measurement techniques that forgo hy- pervisor fidelity

Kamprobes and Shadow Kernels virtual machines provide a low-overhead prob- ing mechanism, that can be applied on a per-process basis. However, once a probe handler fires there is a substantial probe effect in that the additional code that executes in the context of the probe handler changes the performance char- acteristics of the underlying machine. This is particularly troublesome as a key motivator for using kernel probing is to measure performance bottlenecks. In- strumains is a technique that would allow virtual machines to forgo hypervisor fidelity and issue hypercalls such that when a probe fires, the virtual machine is forked, with one copy executing the probe handler and the other not. The results from the handler can then be passed back to the original virtual machine for inspection. This uses the hypervisor’s performance isolation to isolate the performance of a probe handler from having a large probe effect on the perfor- mance of the target program. Soroban could then measure and report the effect that Instrumains does have through imperfect performance isolation.

6.5 Overview

In this dissertation I argue the thesis that despite virtual machines executing more slowly and with more variability than physical machines, current techniques for measuring the performance of virtual machines have lower utility than they do on physical machines. To reduce this problem, I propose forgoing hypervisor fidelity, a technique that has been used to solve previous problems in the virtual- isation domain. As such, I present three techniques for helping to understand the performance of a virtual machine that forgo hypervisor fidelity. Firstly, Kamprobes is a prob- ing system designed to execute on virtual machines by avoiding the use of priv- ileged instructions, such as interrupts that have a high performance overhead on a virtual machine. Secondly, Shadow Kernels uses the hypervisor to remap the memory of its guests such that mainstream operating systems can transpar- ently remap pages with minimal changes. Thirdly, Soroban allows applications

150 to monitor the effect of the virtualisation overhead on their own performance. Whilst the benefits of being able to migrate from a phyiscal machine to a virtual machine are well-known, there are fewer benefits from building software that can be ported from virtual machines to physical machines. Given the diffi- culty in measuring the performance of virtual machines, the tradeoff between vis- ibility and abstraction should be moved towards visibility. Conclusions derived about the performance of a software executing in a virtual machine must be de- rived with an understanding that the software is not executing on bare metal. As such, techniques that attempt to explain the performance of virtualised software without consideration or utilisation of the hypervisor are insufficient. Rather, to support the cloud, developers need performance tools that report the overheads of the virtualisation and use the hypervisor to increase the information available to developers when debugging performance issues.

151 152 BIBLIOGRAPHY

[1] Keith Adams and Ole Agesen. A comparison of software and hardware techniques for x86 virtualization. SIGARCH Compututer Architecture News, 34(5):2–13, 2006.

[2] Kavita Agarwal, Bhushan Jain, and Donald E. Porter. Containing the hype. In Proceedings of the 6th Asia-Pacific Workshop on Systems, APSys ’15, New York, NY, USA, 2015. ACM.

[3] Ole Agesen, Jim Mattson, Radu Rugina, and Jeffrey Sheldon. Software techniques for avoiding hardware virtualization exits. In Proceedings of the 2012 USENIX Conference on USENIX Annual Technical Conference, ATC ’12, pages 373–385, Berkeley, CA, USA, 2012. USENIX Association.

[4] Sebastian Angel, Hitesh Ballani, Thomas Karagiannis, Greg O’Shea, and Eno Thereska. End-to-end performance isolation through virtual data- centers. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI ’14, pages 233–248, Berkeley, CA, USA, 2014. USENIX Association.

[5] Danilo Ardagna, Elisabetta Di Nitto, Giuliano Casale, Dana Petcu, Paras- too Mohagheghi, Sebastien´ Mosser, Peter Matthews, Anke Gericke, Cyril Ballagny, Francesco D’Andria, Cosmin-Septimiu Nechifor, and Craig Sheridan. MODAClouds: A model-driven approach for the design and execution of applications on multiple clouds. In Proceedings of the 4th International Workshop on Modeling in Software Engineering, MiSE ’12, pages 50–56, Washington, DC, USA, 2012. IEEE Computer Society.

[6] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia. A view of cloud computing. Communica- tions of the ACM, 53(4):50–58, 2010.

153 [7] Jeff Arnold and M. Frans Kaashoek. Ksplice: Automatic rebootless kernel updates. In Proceedings of the 4th ACM European Conference on Com- puter Systems, EuroSys ’09, pages 187–198, New York, NY, USA, 2009. ACM.

[8] Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Nathan C. Bur- nett, Timothy E. Denehy, Thomas J. Engle, Haryadi S. Gunawi, James A. Nugent, and Florentina I. Popovici. Transforming policies into mecha- nisms with Infokernel. In Proceedings of the Nineteenth ACM Sympo- sium on Operating Systems Principles, SOSP ’03, pages 90–105, New York, NY, USA, 2003. ACM.

[9] Raj Bala and Valdis Filks. Amazon Web Services shifts focus to predictable storage performance in the cloud, but at what cost? Gartner, 2015. https://www.gartner.com/doc/3116617. Online; accessed 2015-11-27.

[10] Thomas Ball and James R. Larus. Efficient path profiling. In Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchi- tecture, Mirco 29, pages 46–57, Washington, DC, USA, 1996. IEEE Com- puter Society.

[11] Gaurav Banga, Peter Druschel, and Jeffrey C. Mogul. Resource contain- ers: A new facility for resource management in server systems. In Pro- ceedings of the Third Symposium on Operating Systems Design and Im- plementation, OSDI ’99, pages 45–58, Berkeley, CA, USA, 1999. USENIX Association.

[12] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield. Xen and the art of virtualization. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP ’03, pages 164–177, New York, NY, USA, 2003. ACM.

[13] Andrew Baumann, Marcus Peinado, and Galen Hunt. Shielding applica- tions from an untrusted cloud with Haven. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI ’14, pages 267–283, Berkeley, CA, USA, 2014. USENIX Associa- tion.

154 [14] Adam Belay, Andrea Bittau, Ali Mashtizadeh, David Terei, David Mazieres,` and Christos Kozyrakis. Dune: Safe user-level access to priv- ileged CPU features. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI ’12, pages 335– 348, Berkeley, CA, USA, 2012. USENIX Association.

[15] Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. IX: A protected dataplane operating system for high throughput and low latency. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’14, pages 49–65, Berkeley, CA, USA, 2014. USENIX Association.

[16] J. Bellino and Cl Hans. Virtual machine or virtual operating system? In Proceedings of the workshop on virtual computer systems, pages 20–29, New York, NY, USA, 1973. ACM.

[17] S. Bhatia, C. Consel, A. Le Meur, and C. Pu. Automatic specialization of protocol stacks in operating system kernels. In Proceedings of the 29th Annual IEEE International Conference on Local Computer Networks, pages 152–159, Washington, DC, USA, 2004. IEEE Computer Society.

[18] Stanislav Bratanov, Roman Belenov, and Nikita Manovich. Virtual ma- chines: A whole new world for performance analysis. SIGOPS Operating Systems Review, 43(2):46–55, 2009.

[19] Timothy Broomhead, Laurence Cremean, Julien Ridoux, and Darryl Veitch. Virtualize everything but time. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, OSDI ’10, Berkeley, CA, USA, 2010. USENIX Association.

[20] Bryan M. Cantrill, Michael W. Shapiro, and Adam H. Leventhal. Dy- namic instrumentation of production systems. In Proceedings of the 2004 USENIX Conference on USENIX Annual Technical Conference, ATEC ’04, pages 15–28, Berkeley, CA, USA, 2004. USENIX Association.

[21] Lucian Carata, Oliver R. A. Chick, James Snee, Ripduman Sohan, An- drew Rice, and Andy Hopper. Resourceful: Fine-grained resource ac-

155 counting for explaining service variability. Technical Report UCAM-CL- TR-859, University of Cambridge, Computer Laboratory, 2014.

[22] Giuliano Casale, Stephan Kraft, and Diwakar Krishnamurthy. A model of storage I/O performance interference in virtualized systems. In Pro- ceedings of the 31st International Conference on Systems Workshops, ICDCSW ’11, pages 34–39, 2011.

[23] Dominique Chanet, Bjorn De Sutter, Bruno De Bus, Ludo Van Put, and Koen De Bosschere. System-wide compaction and specialization of the Linux kernel. In Proceedings of the 2005 ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems, LCTES ’05, pages 95–104, New York, NY, USA, 2005. ACM.

[24] Peter M. Chen and Brian D. Noble. When virtual is better than real. In Proceedings of the Eighth Workshop on Hot Topics in Operating Systems, HotOS ’01, pages 133–138, Berkeley, CA, USA, 2001. USENIX Associa- tion.

[25] Xi Chen, Lukas Rupprecht, Rasha Osman, Peter Pietzuch, William Knot- tenbelt, and Felipe Franciosi. CloudScope: Diagnosing performance inter- ference for resource management in multi-tenant clouds. In Proceedings of the 23rd IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunications Systems, MASCOTS ’15, Washington, DC, USA, 2015. IEEE Computer Society.

[26] Ludmila Cherkasova and Rob Gardner. Measuring CPU overhead for I/O processing in the Xen virtual machine monitor. In Proceedings of the 2005 USENIX Conference on USENIX Annual Technical Conference, ATEC ’05, Berkeley, CA, USA, 2005. USENIX Association.

[27] Ludmila Cherkasova, Diwaker Gupta, and Amin Vahdat. Comparison of the three CPU schedulers in Xen. SIGMETRICS Performance Evaluation Review, 35(2):42–51, 2007.

[28] Ron C. Chiang and H. Howie Huang. Tracon: Interference-aware scheduling for data-intensive applications in virtualized environments. In Proceedings of 2011 International Conference for High Performance

156 Computing, Networking, Storage and Analysis, SC ’11, New York, NY, USA, 2011. ACM.

[29] Oliver R. A. Chick, Lucian Carata, James Snee, Nikilesh Balakrishnan, and Ripduman Sohan. Shadow Kernels: A general mechanism for kernel specialization in existing operating systems. In Proceedings of the 6th Asia-Pacific Workshop on Systems, APSys ’15, New York, NY, USA, 2015. ACM.

[30] Oliver R. A. Chick, Lucian Carata, James Snee, Nikilesh Balakrishnan, and Ripduman Sohan. Shadow Kernels: A general mechanism for kernel specialization in existing operating systems. SIGOPS Operating Systems Review, 2016.

[31] David Chisnall. The definitive guide to the Xen hypervisor. Pearson Edu- cation, 2008. ISBN: 9780521642989.

[32] Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F. Wenisch. The mystery machine: End-to-end performance analysis of large- scale internet services. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’14, pages 217– 231, Berkeley, CA, USA, 2014. USENIX Association.

[33] Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F. Wenisch. The mystery machine: End-to-end performance analysis of large- scale internet services. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI ’14, pages 217– 231, Berkeley, CA, USA, 2014. USENIX Association.

[34] Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, and Andrew Warfield. Live migration of virtual machines. In Proceedings of the 2nd Conference on Symposium on Networked Systems Design and Implementation, NSDI ’05, pages 273– 286, Berkeley, CA, USA, 2005. USENIX Association.

[35] Jedidiah R. Crandall, Gary Wassermann, Daniela A. S. de Oliveira, Zhen- dong Su, S. Felix Wu, and Frederic T. Chong. Temporal search: Detect-

157 ing hidden malware timebombs with virtual machines. SIGARCH Com- pututer Architecture News, 34(5):25–36, 2006.

[36] Robert J. Creasy. The origin of the VM/370 time-sharing system. IBM Journal of Research and Development, 25(5):483–490, 1981.

[37] Lei Cui, Jianxin Li, Tianyu Wo, Bo Li, Renyu Yang, Yingjie Cao, and Jinpeng Huai. HotRestore: A fast restore system for virtual machine clus- ter. In Proceedings of the 28th USENIX Conference on Large Installation System Administration, LISA ’14, pages 10–25, Berkeley, CA, USA, 2014. USENIX Association.

[38] Augusto Born de Oliveira, Sebastian Fischmeister, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney. Why you should care about quantile re- gression. SIGARCH Computer Architecture News, 41(1):207–218, 2013.

[39] Jiang Dejun, Guillaume Pierre, and Chi-Hung Chi. EC2 performance anal- ysis for resource provisioning of service-oriented applications. In Proceed- ings of the 2009 International Conference on Service-oriented Comput- ing, ICSOC/ServiceWave ’09, pages 197–207, Berlin, Heidelberg, 2009. Springer-Verlag.

[40] Xiaoning Ding, Phillip B. Gibbons, Michael A. Kozuch, and Jianchen Shan. Gleaner: Mitigating the blocked-waiter wakeup problem for vir- tualized multicore applications. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, ATC ’14, pages 73–84, Berkeley, CA, USA, 2014. USENIX Association.

[41] Yaozu Dong, Xiaowei Yang, Jianhui Li, Guangdeng Liao, Kun Tian, and Haibing Guan. High performance network virtualization with SR-IOV. Journal on Parallel Distributed Computing, 72(11):1471–1480, 2012.

[42] Cort Dougan, Paul Mackerras, and Victor Yodaiken. Optimizing the idle task and other MMU tricks. In Proceedings of the Third Symposium on Operating Systems Design and Implementation, OSDI ’99, pages 229– 237, Berkeley, CA, USA, 1999. USENIX Association.

[43] Frank C. Eigler, Vara Prasad, Will Cohen, Hien Nguyen, Martin Hunt, Jim Keniston, and Brad Chen. Architecture of Systemtap: a Linux

158 trace/probe tool. DOI: 10.1.1.109.1364, http://citeseerx.ist.psu. edu/viewdoc/download?doi=10.1.1.109.1364&rep=rep1&type=pdf, 2005.

[44] Daniel Ellard and Margo Seltzer. NFS tricks and benchmarking traps. In Proceedings of the 2003 USENIX Conference on USENIX Annual Tech- nical Conference, ATEC ’03, Berkeley, CA, USA, 2003. USENIX Associa- tion.

[45] Dawson R. Engler, M. Frans Kaashoek, and James O’Toole, Jr. Exoker- nel: An operating system architecture for application-level resource man- agement. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, SOSP ’95, pages 251–266, New York, NY, USA, 1995. ACM.

[46] Yoav Etsion, Dan Tsafrir, Scott Kirkpatrick, and Dror G. Feitelson. Fine grained kernel logging with KLogger: Experience and insights. In Proceed- ings of the 2nd ACM SIGOPS/EuroSys European Conference on Com- puter Systems 2007, EuroSys ’07, pages 259–272, New York, NY, USA, 2007. ACM.

[47] Thomas G. Evans and D. Lucille Darley. DEBUG—an extension to current online debugging techniques. Communications of the ACM, 8(5):321–326, 1965.

[48] Ramsey M. Faragher, Oliver R. A. Chick, Daniel T. Wagner, Timothy Goh, James Snee, and Brian Jones. Captain Buzz: An all-smartphone autonomous delta-wing drone. In Proceedings of the First Workshop on Micro Aerial Vehicle Networks, Systems, and Applications for Civilian Use, DroNet ’15, pages 27–32, New York, NY, USA, 2015. ACM.

[49] Benjamin Farley, Ari Juels, Venkatanathan Varadarajan, Thomas Risten- part, Kevin D. Bowers, and Michael M. Swift. More for your money: Exploiting performance heterogeneity in public clouds. In Proceedings of the Third ACM Symposium on Cloud Computing, SoCC ’12, New York, NY, USA, 2012. ACM.

159 [50] Alex Fishman, Mike Rapoport, Evgeny Budilovsky, and Izik Eidus. HVX: Virtualizing the cloud. In Proceedings of the 5th USENIX Workshop on Hot Topics in Cloud Computing, HotCloud 13, Berkeley, CA, USA, 2013. USENIX Association.

[51] Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica. X-trace: A pervasive network tracing framework. In Proceedings of the 4th USENIX Conference on Networked Systems Design and Im- plementation, NSDI ’07, Berkeley, CA, USA, 2007. USENIX Association.

[52] Linux Foundation. The Linux Foundation releases report detail- ing linux user trends among worlds largest companies. http : //www.linuxfoundation.org/news-media/announcements/2014/12/ linux-foundation-releases-report-detailing-linux-user-trends-among. Online; accessed 2015-10-30.

[53] Thomas Friebel. Preventing guests from spinning around. Xen Sum- mit Boston, 2008. http://www-archive.xenproject.org/files/ xensummitboston08/LHP.pdf. Online; accessed 2015-11-27.

[54] Tal Garfinkel, Keith Adams, Andrew Warfield, and Jason Franklin. Com- patibility is not transparency: VMM detection myths and realities. In Proceedings of the 11th USENIX Workshop on Hot Topics in Operating Systems, HotOS ’07, Berkeley, CA, USA, 2007. USENIX Association.

[55] S. Gill. The diagnosis of mistakes in programmes on the EDSAC. Pro- ceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 206(1087):538–554, 1951.

[56] Robert P. Goldberg. Survey of virtual machine research. Computer, 7(9):34–45, 1974.

[57] Abel Gordon, Nadav Amit, Nadav Har’El, Muli Ben-Yehuda, Alex Lan- dau, Assaf Schuster, and Dan Tsafrir. ELI: Bare-metal performance for I/O virtualization. SIGPLAN Notices, 47(4):411–422, 2012.

[58] Ajay Gulati, Chethan Kumar, and Irfan Ahmad. Storage workload charac- terization and consolidation in virtualized environments. In Proceedings

160 of the Workshop on Virtualization Performance: Analysis, Characteriza- tion, and Tools, VPACT ’09, Washington, DC, USA, 2009. IEEE Com- puter Society.

[59] Ajay Gulati, Chethan Kumar, and Irfan Ahmad. Modeling workloads and devices for IO load balancing in virtualized environments. SIGMETRICS Performance Evaluation Review, 37(3):61–66, 2010.

[60] Ajay Gulati, Arif Merchant, and Peter J. Varman. mClock: Handling throughput variability for hypervisor IO scheduling. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Imple- mentation, OSDI ’10, Berkeley, CA, USA, 2010. USENIX Association.

[61] Ajay Gulati, Ganesha Shanmuganathan, Anne Holler, and Irfan Ahmad. Cloud-scale resource management: challenges and techniques. In Proceed- ings of the 3rd USENIX conference on Hot topics in cloud computing, Berkeley, CA, USA, 2011. USENIX Association.

[62] P. H. Gum. System/370 extended architecture: Facilities for virtual ma- chines. IBM Journal of Research and Development, 27(6):530–544, 1983.

[63] Diwaker Gupta, Ludmila Cherkasova, Rob Gardner, and Amin Vah- dat. Enforcing performance isolation across virtual machines in Xen. In Maarten van Steen and Michi Henning, editors, Proceedings of the 7th International Middleware Conference, volume 4290 of Lecture Notes in Computer Science, pages 342–362. Springer Berlin Heidelberg, 2006.

[64] Andreas Haeberlen, Paarijaat Aditya, Rodrigo Rodrigues, and Peter Dr- uschel. Accountable virtual machines. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, OSDI ’10, Berkeley, CA, USA, 2010. USENIX Association.

[65] Steven Hand, Andrew Warfield, Keir Fraser, Evangelos Kotsovinos, and Dan Magenheimer. Are virtual machine monitors microkernels done right? In Proceedings of the 10th Workshop on Hot Topics in Operating Systems, HotOS ’05, Berkeley, CA, USA, 2005. USENIX Association.

161 [66] Steven M. Hand. Self-paging in the Nemesis operating system. In Pro- ceedings of the Third Symposium on Operating Systems Design and Im- plementation, OSDI ’99, pages 73–86, Berkeley, CA, USA, 1999. USENIX Association.

[67] Jin Heo and Reza Taheri. Virtualizing latency-sensitive applications: Where does the overhead come from? VMWare Technical Journal, 4:21– 30, 2013.

[68] Masami Hiramatsu. Scalability efforts for Kprobes or: How I learned to stop worrying and love a massive number of Kprobes. http://events. linuxfoundation.org/sites/events/files/slides/Handling%20the% 20Massive%20Multiple%20Kprobes%20v2_1.pdf, 2014. Online; accessed 2015-09-14.

[69] Galen Hunt and Doug Brubacher. Detours: Binary interception of Win32 functions. In Proceedings of the 3rd Conference on USENIX Windows NT Symposium, WINSYM ’99, Berkeley, CA, USA, 1999. USENIX Asso- ciation.

[70] Google Inc. Google Container Engine—Google Cloud Platform. https: //cloud.google.com/container-engine/. Online; accessed 2015-10- 12.

[71] VMWare Inc. Timekeeping in VMWare virtual machines. Technical re- port, VMWare Inc., 2011.

[72] Intel. Intel 64 and IA-32 architectures software de- velopers manual. 2010. http : / / www . intel . com / content / dam / www / public / us / en / documents / manuals / 64-ia-32-architectures-software-developer-manual-325462 . pdf. Online; accessed 2015-11-27.

[73] Stephen T. Jones, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci- Dusseau. Geiger: Monitoring the buffer cache in a virtual machine en- vironment. In Proceedings of the 12th International Conference on Ar- chitectural Support for Programming Languages and Operating Systems, ASPLOS XII, pages 14–24, New York, NY, USA, 2006. ACM.

162 [74] M. Frans Kaashoek, Dawson R. Engler, Gregory R. Ganger, Hector M. Briceno,˜ Russell Hunt, David Mazieres,` Thomas Pinckney, Robert Grimm, John Jannotti, and Kenneth Mackenzie. Application performance and flexibility on Exokernel systems. In Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles, SOSP ’97, pages 52–65, New York, NY, USA, 1997. ACM.

[75] Paul A. Karger and Roger R. Schell. Multics security evaluation volume II. Vulnerability analysis. Technical report, DTIC Document, 1974.

[76] Sanidhya Kashyap, Changwoo Min, and Taesoo Kim. Scalability in the clouds!: A myth or reality? In Proceedings of the 6th Asia-Pacific Work- shop on Systems, APSys ’15, New York, NY, USA, 2015. ACM.

[77] Roger Kay. Intel and AMD: The juggernaut vs. the squid. Forbes, 2014-11-25, 2014. http://www.forbes.com/sites/rogerkay/2014/11/ 25/intel-and-amd-the-juggernaut-vs-the-squid/. Online; accessed 2015-11-27.

[78] Piyus Kedia and Sorav Bansal. Fast dynamic binary translation for the kernel. In Proceedings of the Twenty-Fourth ACM Symposium on Oper- ating Systems Principles, SOSP ’13, pages 101–115, New York, NY, USA, 2013. ACM.

[79] Peter B. Kessler. Fast breakpoints: Design and implementation. In Pro- ceedings of the ACM SIGPLAN 1990 Conference on Programming Lan- guage Design and Implementation, PLDI ’90, pages 78–84, New York, NY, USA, 1990. ACM.

[80] Avi Kivity, Dor Laor, Glauber Costa, Pekka Enberg, Nadav Har’El, Don Marti, and Vlad Zolotarov. OSv: Optimizing the operating system for virtual machines. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, ATC ’14, pages 61–72, Berkeley, CA, USA, 2014. USENIX Association.

[81] Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June Andronick, David Cock, Philip Derrin, Dhammika Elkaduwe, Kai Engelhardt, Rafal Kolan- ski, Michael Norrish, Thomas Sewell, Harvey Tuch, and Simon Winwood.

163 seL4: Formal verification of an OS kernel. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, SOSP ’09, pages 207–220, New York, NY, USA, 2009. ACM.

[82] Ricardo Koller, Canturk Isci, Sahil Suneja, and Eyal de Lara. Unified monitoring and analytics in the cloud. In Proceedings of the 7th USENIX Workshop on Hot Topics in Cloud Computing, HotCloud ’15, Berkeley, CA, USA, 2015. USENIX Association.

[83] Thawan Kooburat and Michael Swift. The best of both worlds with on- demand virtualization. In Proceedings of the 13th USENIX Workshop on Hot Topics in Operating Systems, HotOS ’13, Berkeley, CA, USA, 2013. USENIX Association.

[84] Horacio Andres´ Lagar-Cavilla, Joseph Andrew Whitney, Adin Matthew Scannell, Philip Patchin, Stephen M. Rumble, Eyal de Lara, Michael Brudno, and Mahadev Satyanarayanan. SnowFlock: Rapid virtual ma- chine cloning for cloud computing. In Proceedings of the 4th ACM Eu- ropean Conference on Computer Systems, EuroSys ’09, New York, NY, USA, 2009. ACM.

[85] Lydia Leong, Douglas Toombs, and Bob Gill. Magic quadrant for cloud infrastructure as a service, worldwide. Gartner, May, 2015. https://www. gartner.com/doc/3056019. Online; accessed 2015-11-27.

[86] Jochen Liedtke. On micro-kernel construction. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, SOSP ’95, pages 237–250, New York, NY, USA, 1995. ACM.

[87] Pin Lu and Kai Shen. Virtual machine memory access tracing with hyper- visor exclusive cache. In Proceedings of the 2007 USENIX Conference on USENIX Annual Technical Conference, ATC ’07, Berkeley, CA, USA, 2007. USENIX Association.

[88] Sandor Lukacs, Dan H. Lutas, and Raul V. Tosa. Hypervisor-based enter- prise endpoint protection, 2014. US Patent 8,910,238.

[89] Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. Pivot tracing: Dy- namic causal monitoring for distributed systems. In Proceedings of the

164 25th Symposium on Operating Systems Principles, SOSP ’15, pages 378– 393, New York, NY, USA, 2015. ACM.

[90] David J. C. MacKay. Information theory, inference and learning algo- rithms. Cambridge University Press, 2003. ISBN: 9780521642989.

[91] Anil Madhavapeddy, Richard Mortier, Charalampos Rotsos, David Scott, Balraj Singh, Thomas Gazagnaire, Steven Smith, Steven Hand, and Jon Crowcroft. Unikernels: Library operating systems for the cloud. SIG- PLAN Notices, 48(4):461–472, 2013.

[92] Dan Magenheimer. Xen 4.6 TSC Mode how-to. http://xenbits.xen. org/docs/4.6-testing/misc/tscmode.txt. Online; accessed 2015-09- 14.

[93] Kristis Makris and Kyung Dong Ryu. Dynamic and adaptive updates of non-quiescent subsystems in commodity operating system kernels. SIGOPS Operating Systems Review, 41(3):327–340, 2007.

[94] Ali Mashtizadeh, Emre´ Celebi, Tal Garfinkel, and Min Cai. The design and evolution of live storage migration in VMware ESX. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, ATC ’11, Berkeley, CA, USA, 2011. USENIX Association.

[95] Michael Matz, Jan Hubicka, Andreas Jaeger, and Mark Mitchell. System V application binary interface. AMD64 Architecture Processor Supple- ment, Draft v0, 99, 2005.

[96] Marshall K. McKusick, George V. Neville-Neil, and Robert N. M. Wat- son. The Design and Implementation of the FreeBSD Operating System. Addison-Wesley, 2 edition, 2014. ISBN: 0321968972.

[97] Larry McVoy and Carl Staelin. Lmbench: Portable tools for performance analysis. In Proceedings of the 1996 USENIX Conference on USENIX Annual Technical Conference, ATEC ’96, pages 279–294, Berkeley, CA, USA, 1996. USENIX Association.

[98] Aravind Menon, Alan L. Cox, and Willy Zwaenepoel. Optimizing net- work virtualization in Xen. In Proceedings of the 2006 USENIX Confer-

165 ence on USENIX Annual Technical Conference, ATEC ’06, Berkeley, CA, USA, 2006. USENIX Association.

[99] Aravind Menon, Jose Renato Santos, Yoshio Turner, G. (John) Janaki- raman, and Willy Zwaenepoel. Diagnosing performance overheads in the Xen virtual machine environment. In Proceedings of the 1st ACM/USENIX International Conference on Virtual Execution Environ- ments, VEE ’05, pages 13–23, New York, NY, USA, 2005. ACM.

[100] Roger Pau Monne. Improving block protocol scalability with persistent grants. https : / / blog . xenproject . org / 2012 / 11 / 23 / improving-block-protocol-scalability-with-persistent-grants/, 2012. Online; accessed 2015-10-30.

[101] Yeji Nam, Dongwoo Lee, and Young Ik Eom. SELF: Improving the memory-sharing opportunity using virtual-machine self-hints in virtual- ized systems. In Proceedings of the 6th Asia-Pacific Workshop on Systems, APSys ’15, New York, NY, USA, 2015. ACM.

[102] Ripal Nathuji, Aman Kansal, and Alireza Ghaffarkhah. Q-clouds: Man- aging performance interference effects for QoS-aware clouds. In Proceed- ings of the 5th European Conference on Computer Systems, EuroSys ’10, pages 237–250, New York, NY, USA, 2010. ACM.

[103] Gartner Newsroom. Gartner identifies five ways to migrate applications to the cloud. May 2011. http://www.gartner.com/newsroom/id/ 1684114. Online; accessed 2015-11-27.

[104] Khang Nguyen. Benefits of Intel Cache Monitoring Technology in the In- tel Xeon Processor E5 v3 Family. https://software.intel.com/en-us/ blogs/2014/06/18/benefit-of-cache-monitoring, 2014. Online; ac- cessed 2015-11-05.

[105] Ruslan Nikolaev and Godmar Back. Perfctr-Xen: A framework for per- formance counter virtualization. In Proceedings of the 7th ACM SIG- PLAN/SIGOPS International Conference on Virtual Execution Environ- ments, VEE ’11, pages 15–26, New York, NY, USA, 2011. ACM.

166 [106] Ryota Ozaki. Porting DTrace to NetBSD/ARM. https://www.netbsd. org/~ozaki-r/pub/AsiaBSDCon2014-WIP-ozaki-r.pdf, 2014. Online; accessed 2015-08-31.

[107] Prasanna Panchamukhi. Kernel debugging with Kprobes. IBM developer- Works, 2004.

[108] Gabriele Paoloni. How to benchmark code execution times on Intel IA- 32 and IA-64 instruction set architectures. Intel Corporation, September, 2010. Online; accessed 2015-09-14.

[109] Dimosthenis Pediaditakis, Charalampos Rotsos, and Andrew W. Moore. Faithful reproduction of network experiments. In Proceedings of the Tenth ACM/IEEE Symposium on Architectures for Networking and Com- munications Systems, ANCS ’14, pages 41–52, New York, NY, USA, 2014. ACM.

[110] Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Edouard´ Duches- nay. Scikit-learn: Machine learning in Python. Journal of Machine Learn- ing Research, 12:2825–2830, 2011.

[111] Simon Peter and Thomas Anderson. Arrakis: A case for the end of the empire. In Proceedings of the 14th USENIX Workshop on Hot Topics in Operating Systems, HotOS ’13, Berkeley, CA, USA, 2013. USENIX Association.

[112] Simon Peter, Andrew Baumann, Timothy Roscoe, Paul Barham, and Re- becca Isaacs. 30 seconds is not enough!: A study of operating system timer usage. In Proceedings of the 3rd ACM SIGOPS/EuroSys European Con- ference on Computer Systems 2008, Eurosys ’08, pages 205–218, New York, NY, USA, 2008. ACM.

[113] Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, Thomas Anderson, and Timothy Roscoe. Arrakis: The operating system is the control plane. In Proceedings of the 11th USENIX

167 Conference on Operating Systems Design and Implementation, OSDI ’14, pages 1–16, Berkeley, CA, USA, 2014. USENIX Association.

[114] Sean Peters, Adrian Danis, Kevin Elphinstone, and Gernot Heiser. For a microkernel, a big lock is fine. In Proceedings of the 6th Asia-Pacific Workshop on Systems, APSys ’15, New York, NY, USA, 2015. ACM.

[115] Gerald J. Popek and Robert P. Goldberg. Formal requirements for vir- tualizable third generation architectures. Communications of the ACM, 17(7):412–421, 1974.

[116] Calton Pu, Henry Massalin, and John Ioannidis. The Synthesis kernel. Computing, Springer Verlag, pages 11–33, 1988.

[117] Xing Pu, Ling Liu, Yiduo Mei, S. Sivathanu, Younggyun Koh, C. Pu, and Yuanda Cao. Who is your neighbor: Net I/O performance interference in virtualized clouds. IEEE Transactions on Services Computing, 6(3):314– 329, 2013.

[118] Nguyen Anh Quynh and Kuniyasu Suzaki. Xenprobes, a lightweight user- space probing framework for Xen virtual machine. In Proceedings of the 2007 USENIX Conference on USENIX Annual Technical Conference, ATC ’07, Berkeley, CA, USA, 2007. USENIX Association.

[119] Charles Reiss, Alexey Tumanov, Gregory R. Ganger, Randy H. Katz, and Michael A. Kozuch. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proceedings of the Third ACM Symposium on Cloud Computing, SoCC ’12, New York, NY, USA, 2012. ACM.

[120] Charles Reiss, John Wilkes, and Joseph L. Hellerstein. Google cluster- usage traces: format+ schema. Google Inc., Mountain View, CA, USA, Technical Report, 2011.

[121] John Scott Robin and Cynthia E. Irvine. Analysis of the Intel Pentium’s ability to support a secure virtual machine monitor. In Proceedings of the 9th Conference on USENIX Security Symposium, SSYM ’00, Berkeley, CA, USA, 2000. USENIX Association.

168 [122] Joanna Rutkowska and Alexander Tereshkin. Bluepilling the Xen hy- pervisor. Black Hat USA, 2008. http://invisiblethingslab.com/ resources/bh08/part3.pdf. Online; accessed 2015-11-27.

[123] Jerome H. Saltzer and Michael D. Schroeder. The protection of informa- tion in computer systems. In Proceedings of the IEEE 63-9, Washington, DC, USA, 1975. IEEE Computer Society.

[124] Constantine P. Sapuntzakis, Ramesh Chandra, Ben Pfaff, Jim Chow, Mon- ica S. Lam, and Mendel Rosenblum. Optimizing the migration of virtual computers. SIGOPS Operating Systems Review, 36(SI):377–390, 2002.

[125] Adrian Sch ¨upbach,Simon Peter, Andrew Baumann, Timothy Roscoe, Paul Barham, Tim Harris, and Rebecca Isaacs. Embracing diversity in the Bar- relfish manycore operating system. In Proceedings of the Workshop on Managed Many-Core Systems, New York, NY, USA, 2008. ACM.

[126] Malte Schwarzkopf, Derek G. Murray, and Steven Hand. The seven deadly sins of cloud computing research. In Proceedings of the 4th USENIX Workshop on Hot Topics in Cloud Ccomputing, HotCloud ’12, Berkeley, CA, USA, 2012. USENIX Association.

[127] Love H. Seawright and Richard A. MacKinnon. VM/370: A study of multiplicity and usefulness. IBM Systems Journal, 18(1):4–17, 1979.

[128] Prateek Sharma, Stephen Lee, Tian Guo, David Irwin, and Prashant Shenoy. SpotCheck: Designing a derivative IaaS cloud on the spot market. In Proceedings of the Tenth European Conference on Computer Systems, EuroSys ’15, New York, NY, USA, 2015. ACM.

[129] Ryan Shea, Feng Wang, Haiyang Wang, and Jiangchuan Liu. A deep investigation into network performance in virtual machine based cloud environments. In Proceedings of the 2014 Annual Joint Conference of the IEEE Computer and Communications Societies. Technology: Emerging or Converging, InfoCom ’14, pages 1285–1293, 2014.

[130] Alan Shieh, Srikanth Kandula, Albert Greenberg, and Changhoon Kim. Seawall: Performance isolation for cloud datacenter networks. In Pro-

169 ceedings of the 2nd USENIX Workshop on Hot Topics in Cloud Com- puting, HotCloud ’10, Berkeley, CA, USA, 2010. USENIX Association.

[131] Benjamin H. Sigelman, Luiz Andr Barroso, Mike Burrows, Pat Stephen- son, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. Dapper, a large-scale distributed systems tracing infrastructure. Technical report, Google, Inc., 2010.

[132] James Snee, Lucian Carata, Oliver R. A. Chick, Ripduman Sohan, Ram- sey M. Faragher, Andrew Rice, and Andy Hopper. Soroban: Attributing latency in virtualized environments. In Proceedings of the 6th USENIX Workshop on Hot Topics in Cloud Computing, HotCloud ’15, Berkeley, CA, USA, 2015. USENIX Association.

[133] Sherri Sparks and Jamie Butler. Shadow Walker: Raising the bar for rootkit detection. Black Hat Japan, pages 504–533, 2005.

[134] Riza O. Suminto, Agung Laksono, Anang D. Satria, Thanh Do, and Haryadi S. Gunawi. Towards pre-deployment detection of performance failures in cloud distributed systems. In Proceedings of the 7th USENIX Workshop on Hot Topics in Cloud Computing, HotCloud ’15, Berkeley, CA, USA, 2015. USENIX Association.

[135] Kuniyasu Suzaki, Kengo Iijima, Toshiki Yagi, and Cyrille Artho. Memory deduplication as a threat to the guest OS. In Proceedings of the Fourth European Workshop on System Security, EUROSEC ’11, New York, NY, USA, 2011. ACM.

[136] Byung Chul Tak, Bhuvan Urgaonkar, and Anand Sivasubramaniam. To move or not to move: The economics of cloud computing. In Proceed- ings of the 3rd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud ’11, Berkeley, CA, USA, 2011. USENIX Association.

[137] Ariel Tamches and Barton P. Miller. Fine-grained dynamic instrumenta- tion of commodity operating system kernels. In Proceedings of the Third Symposium on Operating Systems Design and Implementation, OSDI ’99, pages 117–130, Berkeley, CA, USA, 1999. USENIX Association.

170 [138] Ariel Tamches and Barton P. Miller. Using dynamic kernel instrumen- tation for kernel and application tuning. International Journal of High Performance Computing, 13(3):263–276, 1999.

[139] Xen.org Security Team. Security policy. http://www.xenproject.org/ security-policy.html. Online; accessed 2015-10-30.

[140] Xen.org Security Team. XSA-148, CVE-2015-7835. Available from MITRE, CVE-ID CVE-2015-7835., http://xenbits.xen.org/xsa/ advisory-148.html; accessed 2015-10-30.

[141] Rich Uhlig, Gil Neiger, Dion Rodgers, Amy L. Santoni, Fernando Mar- tins, Andrew V. Anderson, Steven M. Bennett, Alain Kagi,¨ Felix H. Leung, and Larry Smith. Intel virtualization technology. Computer, 38(5):48–56, 2005.

[142] Kenneth van Surksum. Deploying extremely latency-sensitive applica- tions in VMware vSphere 5.5. http://www.vmware.com/files/pdf/ techpaper/latency-sensitive-perf-vsphere55.pdf, 2013. Online; accessed 2015-11-22.

[143] Various. Amazon EC2 IP address ranges. https://ip-ranges. amazonaws.com/ip-ranges.json. Online; accessed 2015-10-13.

[144] Carl A. Waldspurger. Memory resource management in VMware ESX Server. SIGOPS Operating Systems Review, 36(SI):181–194, 2002.

[145] Richard Watson. Docker democratizes virtualization for devops-minded developers and administrators. https://www.gartner.com/doc/3011919, 2015. Online; accessed 2015-11-22.

[146] Jiawei Wen, Lei Lu, Giuliano Casale, and Evgenia Smirni. Less can be more: micro-managing VMs in Amazon EC2. In Proceedings of the 8th IEEE International Conference on Cloud Computing, Cloud ’15, New York, NY, USA, 2015. IEEE Computer Society.

[147] Maurice V. Wilkes. Time Sharing Computer Systems. Elsevier Science Inc., 3 edition, 1975. ISBN: 0444195254.

171 [148] Emmett Witchel, Junghwan Rhee, and Krste Asanovic.´ Mondrix: Mem- ory isolation for Linux using Mondriaan memory protection. In Proceed- ings of the Twentieth ACM Symposium on Operating Systems Principles, SOSP ’05, pages 31–44, New York, NY, USA, 2005. ACM.

[149] Candid Wueest. Does malware still detect virtual ma- chines? http : / / www . symantec . com / connect / blogs / does-malware-still-detect-virtual-machines. Online; accessed 2015-10-30.

[150] Carl J. Young. Extended architecture and hypervisor performance. In Proceedings of the Workshop on Virtual Computer Systems, pages 177– 183, New York, NY, USA, 1973. ACM.

[151] Pengfei Yuan, Yao Guo, and Xiangqun Chen. Experiences in profile- guided operating system kernel optimization. In Proceedings of 5th Asia- Pacific Workshop on Systems, APSys ’14, New York, NY, USA, 2014. ACM.

[152] Irene Zhang, Tyler Denniston, Yury Baskakov, and Alex Garthwaite. Op- timizing VM checkpointing for restore performance in VMware ESXi. In Proceedings of the 2013 USENIX Conference on USENIX Annual Tech- nical Conference, ATC ’13, Berkeley, CA, USA, 2013. USENIX Associa- tion.

172