Non-intrusive Virtual Systems Monitoring

by

Sahil Suneja

A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Computer Science University of Toronto

Copyright 2016 by Sahil Suneja Abstract

Non-intrusive Virtual Systems Monitoring

Sahil Suneja Doctor of Philosophy Graduate Department of Computer Science University of Toronto 2016

In this thesis, I discuss why existing intrusive systems monitoring approaches are not a good fit for the modern virtualized , and describe two alternative out-of-band solutions that leverage for better systems monitoring.

My first solution employs Introspection (VMI) to gain access to a VM’s runtime state from the virtualization layer. I develop new VMI techniques to efficiently expose VM memory state from outside the VM boundary, which can be readily employed in existing cloud platforms as they are designed to operate with no new modifications or dependencies. While there exist a variety of other competing alternatives, their latency, overhead, complexity and consistency trade-offs are not clear.

Thus, I begin my thesis with addressing this gap by organizing the various existing VMI techniques into a taxonomy based upon their operational principles, and performing a thorough exploration of their trade-offs both qualitatively and quantitatively. I further present a deep dive on VMI consistency aspects to understand the sources of inconsistency in observed VM state, and show marginal benefits for consistency with commonly employed VMI solutions despite their prohibitive overheads.

Then, I present NFM (Near Field Monitoring)- a new approach that decouples system execution from monitoring by pushing monitoring components out of the target systems’ scope. By extending and combining VMI with a backend cloud analytics platform, NFM provides simple, standard interfaces to monitor running systems in the cloud that require no guest cooperation or modification, and have minimal effect on guest execution. By decoupling monitoring and analytics from target system context,

NFM provides always-on monitoring, even when the target system is unresponsive. My second solution-

CIVIC (Cloning and Injection based VM Inspection and Customization)- avoids NFM’s functionality duplication effort and overcomes its VMI-related limitations arising out of its raw memory byte level visibility into the guest. CIVIC operates at a logical OS level and reuses the vast stock monitoring software codebase, but in a separate isolated environment thus avoiding guest intrusion and interference hassles. CIVIC enables a broader usage scope in addition to NFM’s passive (read-only) monitoring, by supporting actuation or on-the-fly introduction of new functionality. It restricts all impact and side- effects of such customization operations inside a live clone of the guest VM. New functionality over the

ii replicated VM state is introduced using code injection.

I present four applications built on top of NFM using its ‘systems as data’ monitoring approach, to showcase its capabilities for across-systems and across-time analytics. I also highlight CIVIC’s versatility in terms of enabling hotplugged and impact-free live customization, by employing it to monitor, inspect, troubleshoot and tune unmodified VMs.

iii Acknowledgements

I express my sincere gratitude to Prof. Eyal de Lara for steering me towards the successful completion of this voyage of exploration. I’ve learnt a great many things from him, and I could not have asked for a more supportive supervisor.

I am thankful to my committee members- Prof. Angela Demke Brown, Prof. Bianca Schroeder, and Prof. Ryan Johnson- for their guidance and backing.

I owe a great deal of gratitude to my mentors at IBM Research- Dr. Canturk Isci and Dr. Vasanth Bala- for their encouragement, collaboration and contribution to this work.

I also want to thank all members of the DCS Graduate Office for facilitating a smoothly functioning work environment. During the course of my journey I’ve been helped by so many of them, that I’m afraid I’ll miss out on any names if I start listing them here!

A note of thanks also to my fellow graduate students and faculty members. I’ve always been in awe of their brilliance and dedication, which has humbled me and motivated me to strive hard.

My sincere gratitude and respect to my parents- Mr. S.K. Suneja and Mrs. Vandna Suneja- for their love, affection and emotional support, encouraging me to put in my sincere efforts.

A very special thanks to my brother, Sagar Suneja, for pushing me forward during the last mile.

Thanks also to my friends for their help and support throughout my time in Toronto.

It would be unfair not to thank the numerous stackoverflow.com users for sharing their technical knowledge!

Finally, Thank you God for all of the above! Sahil Suneja

iv Contents

1 Introduction 1

2 Background and Related Work 4 2.1 Monitoring Tasks ...... 4 2.2 Monitoring Techniques ...... 5 2.3 VMI Techniques ...... 6 2.3.1 Exposing VM State ...... 6 2.3.2 Exploiting VM State ...... 6 2.4 VMI Applications ...... 7 2.5 Other Candidate Techniques for Monitoring ...... 7 2.5.1 Concerns with Alternatives ...... 8

3 Exploring VM Introspection: Techniques and Trade-offs 10 3.1 VMI Taxonomy ...... 11 3.2 Qualitative Comparison ...... 14 3.3 Quantitative Comparison ...... 15 3.3.1 Maximum Monitoring Frequency ...... 17 3.3.2 Resource Cost on Host ...... 18 3.3.3 Impact on VM’s Performance ...... 19 3.3.4 Real Workload Results ...... 21 3.4 Consistency of VM State ...... 25 3.4.1 Inconsistency Types ...... 25 3.4.2 Quantitative Evaluation ...... 27 3.5 Observations and Recommendations ...... 28 3.6 Summary ...... 30

4 Near Field Monitoring 32 4.1 NFM’s Design ...... 33 4.2 Implementation ...... 35 4.2.1 Exposing VM State ...... 36 4.2.2 Exploiting VM State ...... 37 4.2.3 The Frame Datastore ...... 39 4.2.4 Application Architecture ...... 39 4.3 Prototype Applications ...... 39

v 4.3.1 TopoLog ...... 40 4.3.2 CTop ...... 42 4.3.3 RConsole ...... 43 4.3.4 PaVScan ...... 44 4.4 Evaluation ...... 45 4.4.1 Latency and Frequency of Monitoring ...... 46 4.4.2 Monitoring Accuracy ...... 47 4.4.3 Benefits of Holistic Knowledge ...... 47 4.4.4 Operational Efficiency Improvements ...... 48 4.4.5 Impact on VM’s Performance ...... 49 4.4.6 Impact on Co-located VMs ...... 50 4.4.7 Space Overhead ...... 51 4.5 Summary ...... 52

5 Cloning and Injection based VM Inspection and Customization 53 5.1 CIVIC’s Design ...... 55 5.1.1 Discussion ...... 57 5.2 Implementation ...... 57 5.2.1 Disk COW ...... 58 5.2.2 Live Migration ...... 58 5.2.3 COW Memory ...... 58 5.2.4 Hotplugging ...... 59 5.2.5 Code Injection ...... 59 5.2.6 Application Loader Script ...... 61 5.3 Performance Evaluation ...... 62 5.3.1 Memory Cost ...... 62 5.3.2 Clone Instantiation Time ...... 63 5.3.3 Impact on Source VM ...... 64 5.4 Applications ...... 65 5.4.1 Safe Agent Reuse ...... 65 5.4.2 Anomaly Detection ...... 66 5.4.3 Problem Diagnostics and Troubleshooting ...... 67 5.4.4 Autotuning-as-a-Service ...... 69 5.5 Conclusion ...... 70

6 Conclusion and Future Work 71

Bibliography 74

vi List of Tables

3.1 Qualitative comparison of VMI techniques- empty cells in compatibility column indicates functionality not advertised by , or enabled by users...... 14

4.1 Key capabilities of the prototype applications ...... 39

6.1 NFM vs. CIVIC ...... 72

vii List of Figures

3.1 VMI Taxonomy: categorizing current implementations ...... 11 3.2 Comparing maximum monitoring frequency across all KVM instances of VMI techniques . 17 3.3 CPU used vs. maximum monitoring frequency ...... 18 3.4 Comparing % degradation on x264 benchmark’s frames-encoded/s as a function of moni- toring frequency...... 20 3.5 Comparing % degradation on memory, disk and network throughput as a function of monitoring frequency ...... 22 3.6 Comparing % degradation on Sysbench OLTP benchmarks transactions/s as a function of monitoring frequency ...... 23 3.7 Comparing % degradation on httperf’s metrics as a function of monitoring frequency . . . 24 3.8 Observed inconsistency probabilities for all categories...... 27

4.1 Introspection and analytics architecture ...... 34 4.2 VM and app connectivity discovered by Topology Analyzer for 4 VMs ...... 40 4.3 VM Connectivity Matrix [Mbps]...... 41 4.4 Above: in-VM top; Below: CTop ...... 42 4.5 RConsole captures datacpy’s hidden listener connection ...... 43 4.6 Measured crawling latencies and achievable monitoring frequencies (log scale)...... 45 4.7 CPU utilization: in-VM top vs. CTop...... 46 4.8 top vs. CTop: comparing LAMP processes across 3 VMs to explain httperf statistics. . . 47 4.9 Httperf reply rate and connection drops with various virusscan configurations ...... 48 4.10 Httperf over 10 rounds (each bounded between successive vertical lines); Virusscan starts between round 2-3 ...... 49 4.11 Impact on webserver VM with parallel out-of-band monitoring and management . . . . . 50 4.12 State management overhead with delta frames...... 51

5.1 CIVIC’s architecture; step-by-step description in Design Section...... 56 5.2 Measuring memory footprint of CIVIC’s postcopy+COW clones and precopy clones, for different source VM sizes and memory use configurations ...... 63 5.3 Measuring clone instantiation time for CIVIC’s postcopy+COW clones and precopy clones, for different source VM sizes and memory use configurations. The expected curve for source size-independent postcopy clones should be a horizontal line but for experimental variation...... 63 5.4 Measuring CPU usage with collectd agent in source (left) and clone (right) ...... 65

viii 5.5 SAAD under CIVIC: enhancements on enabling debug mode in clones, in addition to stock SAAD capabilities (dashed ) ...... 67 5.6 Count as well as memory usage of PHP processes in a webserver, for different proportions of cached data with expired TTL. Compared across 3 different PHP versions with memory leaks, fixed between v5.1.6 to v5.6.10...... 68 5.7 Measuring PHP ’ memory usage via strace; Leaks detected in versions 5.1.6 and 5.3.20 ...... 69 5.8 Webserver capacity variations with apache+kernel tuning, normalized to base capacity . . 70

ix Chapter 1

Introduction

Cloud computing and virtualization technologies are dramatically changing how IT systems operate. What used to be a relatively static environment, with fixed physical nodes, has quickly transformed into a highly dynamic environment, where (clusters of) virtual machines (VMs) are programmatically provisioned, started, replicated, stopped and deprovisioned with cloud . VMs have become the processes of the cloud OS, with short lifetimes and a rapid proliferation trend [87]. While the nature of data center operations has changed, the management methodology of these (virtual) machines has not adapted appropriately. Tasks such as performance monitoring, compliance and security scans, and product discovery, amongst others, are carried out using re-purposed versions of tools originally devel- oped for managing physical systems or via newer virtualization-aware alternatives that require guest cooperation and accessibility. These approaches require a communication channel, i.e., a hook, into the running system, or the introduction of a software component, i.e., an agent, within the system runtime. In this thesis, I discuss why existing intrusive monitoring approaches are not a good fit for the modern virtualized cloud, and describe two alternative solutions that leverage virtualization for better systems monitoring. The two solutions operate at different levels of visibility into the guest: (i) raw - a byte level memory view, and (ii) logical - an OS level view. I also develop new techniques for exposing the target VMs’ memory (primarily for monitoring the VM from outside), and perform a comprehensive aggregation and comparative evaluation of existing alternatives. For my first solution, I employ VM introspection (VMI) [71] to gain access to a VM’s memory from the virtualization layer. In addition to my proposed mechanisms for exposing VM’s memory state, there exist several other VMI techniques that have been developed independently over the years but there is no comprehensive framework that puts all these techniques in context, and compares and contrasts them. In the first part of my thesis, I present a thorough exploration of VMI techniques, and introduce a taxonomy for grouping them into different classes based upon four operation principles: (i) whether guest cooperation is required; (ii) whether the technique creates an exact point-in-time replica of the guest’s memory; (iii) whether the guest has to be halted; and, (iv) the type of interface provided to access guest state. Then, I present the results of their qualitative and quantitative comparison across several dimensions such as latency, resource consumption and VM impact. I further present a detailed exploration of the memory consistency aspects of VMI. Particularly, introspecting a live system while its state is changing may lead to inconsistencies in the observed VM state (non-existent or malformed data structures), causing the monitoring process to fail. I identify the various sources of such inconsistency-

1 Chapter 1. Introduction 2 both intrinsic to the OS, and extrinsic due to live introspection. I show that, contrary to common expectation, pause-and-introspect based VMI techniques achieve very little to improve consistency de- spite their substantial performance impact. To conclude this part, I present a set of observations and suggestions based on my experience with the different VMI techniques. Next, I present Near Field Monitoring (NFM), a new approach for system monitoring that leverages virtualization technology to decouple system monitoring from system context. NFM extends VMI tech- niques and combines it with a backend cloud analytics platform to perform monitoring and management functions without requiring access into, or cooperation from the target systems. NFM crawls VM mem- ory and disk state in an out-of-band manner from outside the guest’s context, to collect system state which is then fed to the analytics backend. The monitoring functions simply query this systems data, instead of accessing and intruding on each running system. In stark contrast with existing techniques, NFM seamlessly works even when a system becomes unresponsive (always-on monitoring). Unlike the in-VM solutions that run within the guest context and compete for resources allocated to the guest VMs, NFM is non-intrusive, does not require the installation and configuration of any hooks or agents inside the guest, and does not steal guests’ cycles. NFM is better suited for responding to the ephemeral nature of VMs and further opens up new opportunities for cloud analytics by decoupling VM execution and monitoring. In the final part of my thesis, I present my second VM monitoring solution- CIVIC (Cloning and Injection based VM Inspection and Customization). CIVIC operates at a logical OS level rather than NFM’s raw memory byte level visibility into the guest. The motivation behind CIVIC is to overcome the effort involved in bridging the semantic gap in NFM between the exposed raw VM memory view and the logical OS-level VM-internal view. Furthermore, monitoring VMs with memory introspection requires effort- either exposing an entire OS-like view (/proc etc.) for pre-existing monitoring software, or writing fresh monitors using introspection directly. CIVIC’s OS-level operational environment allows reuse of the vast stock software codebase, without the corresponding guest intrusion and interference hassles. CIVIC follows the same principles as NFM of not enforcing guest cooperation or requiring any pre-requisites to be built into the VM. CIVIC first creates a live replica of the guest VM including its runtime state in a separate isolated sandbox environment (the clone). Then, it uses runtime code injection to introduce new userspace-level functionality over the replicated VM state. In addition to systems monitoring, this approach further enables deep inspection and customization, i.e., tuning or adding new functionality, without the fear of negatively impacting the original guest system.

The main contributions of my work are:

• Novel virtual systems monitoring techniques using VM introspection and ‘systems as documents’ abstraction (NFM), and VM cloning and code injection (CIVIC).

• End-to-end implementations of NFM and CIVIC on KVM/QEMU hypervisor (NFM also on ).

• New methods for low-latency access to VMs’ memory from unmodified KVM/QEMU hosts (with optional support for stricter memory-view consistency), enabling subsecond monitoring of unmod- ified guest systems over NFM.

• A taxonomy to organize existing and proposed VMI techniques for exposing VM memory.

• A comprehensive comparative evaluation of VMI techniques and a set of recommendations to aid users in selecting the approach best suited to their requirements and constraints. Chapter 1. Introduction 3

• A detailed exploration on VMI memory consistency aspects, and highlighting a consistency fallacy in terms of pause-and-introspect techniques failing to mitigate all forms of inconsistency in the observed VM state.

• Four applications over NFM to highlight its capabilities for across-systems and across-time an- alytics, while leveraging familiar paradigms from the data analytics domain such as document differencing and semantic annotations to analyze systems.

• A technique for injecting and running code (CIVIC) in a VM from the hypervisor, which is useful for troubleshooting VMs with a dysfunctional userspace environment (such as non-responsive SSH).

• Porting stock software atop CIVIC and demonstrating with four use-cases its ability to monitor, inspect, troubleshoot and tune unmodified VMs.

The rest of my thesis is organized as follows. In Chapter 2, I summarize existing monitoring tech- niques, VMI techniques and its typical use-cases, and other approaches that can provide NFM-like execution-monitoring decoupling, thus leading to my CIVIC solution. Chapter 3 organizes existing and proposed VMI techniques into my VMI taxonomy, compares them qualitatively and quantitatively, and characterizes potential memory state inconsistencies arising (partly) due to non-guest-synchronized out-of-band VM memory access. Chapter 4 describes and evaluates my NFM framework and four mon- itoring applications that I’ve built on top of NFM to demonstrate its potential and benefits. Chapter 5 describes and evaluates my CIVIC solution, and demonstrates its versatility with three use-cases in addition to basic systems monitoring. Finally, Chapter 6 summarizes my thesis and discusses future research directions. Chapter 2

Background and Related Work

This Chapter first sets the data center monitoring context for this thesis, and then summarizes existing systems monitoring techniques. Next, it introduces VM Introspection (VMI) [71], the different existing VMI techniques and its typical use-cases thus far in literature. The NFM framework described in this thesis employs and extends VMI for touchless systems monitoring. Finally, it discusses other techniques that can provide NFM-like execution-monitoring decoupling while overcoming the effort involved in developing VMI based tools. This sets the scope for the second solution (CIVIC) described in this thesis.

2.1 Data Center Monitoring Tasks

In addition to running customer workloads, datacenter operation requires the execution of a variety of management and monitoring tasks that track the health of the system and ensure the efficient operation of the infrastructure. Some examples include:

• Resource Monitoring- Runtime tracking of system-wide and per-process VM’s resources - CPU, memory, network and disk.

• Compliance Conformity- Ensuring policy compliance such as patch levels, network and security configurations, detecting blacklisted applications, etc.

• VM Sizing and Consolidation- Tracking and cross-mapping application-level resource demands with VM configuration and demand to detect VM resizing requirements or opportunities, and guiding VM consolidation with application-level resource requirements.

• Anomaly Detection- Monitoring to detect rogue processes, system thrashing such as via pro- cess’ memory usage patterns or swap daemon’s activity, bottleneck nodes or spurious connections, amongst other anomalies.

• Inter-VM Network Topology- Cluster-level analytics for discovering communicating VMs, which can be used to guide VM placement to optimize network utilization.

• Patch Management- Identifying tightly coupled clusters for parallel patching to minimize overall service downtime for distributed applications.

• Security Scanning- Searching for signatures of known viruses or malware.

4 Chapter 2. Background and Related Work 5

2.2 Monitoring Techniques

System monitoring has been a major part of enterprise IT operations. The various techniques employed today can be broadly categorized as follows:

1. Installing agents inside VMs for each monitoring task; the agents could be standalone or client components for a monitoring backend.

2. Remotely accessing the VMs using application specific hooks (e.g., over ssh) to observe system state.

3. Tracking limited black-box metrics collected by the virtualization layer.

4. Installing general purpose agents or backdoors inside the VMs, that provide generic in-VM infor- mation through the virtualization layer.

Existing cloud monitoring and management solutions employ one or more of the above methods to deliver their service. For example, Amazon’s CloudWatch [10] service falls under the third category in its base operation, while it can be extended by the end users with in-VM data providers (as in the first category) to provide deeper VM-level information. Dell Quest/VKernel Foglight [55] is a combina- tion of guest remote access, in-VM agents, and hypervisor level metrics exported by VM management consoles like VMware vCenter and Red Hat Enterprise Management. PHD Virtual’s [133] basic VM-as- blackbox metrics use only hypervisor level information, while in-depth VM level metrics require running scripts / installing ‘intelligent agents’ inside the VMs. Reflex vWatch monitoring [139] uses information from VMware vCenter. VMware vCenter Operations Management Suite [170] is also a combination of hypervisor level metrics, together with an in-guest agent (vFabric Hyperic). To mitigate the limitations, and in particular, intrusiveness of custom in-VM techniques, an emerging approach has been the use of general purpose agents or backdoors inside the VMs that supply in-VM state information to the virtualization layer (fourth category). Various virtualization extensions such as VMware tools [173], VMSafe API [176], Azure VM Agents [114], VMCI driver [172] and vShield endpoint [174] follow this approach. VMware VIX API [171] uses VMware Tools to track VM information and actuate in-VM operations through this interface. Several security solutions such as McAfee MOVE, TrendMicro DeepSecurity, Reflex vTrust, SADE [35], and CloudSec [86] use the VMSafe / vShield single agent approach. There are important caveats with these monitoring approaches, covered in detail in Chapter 4 where I describe NFM. Briefly, in-VM solutions are only as good as the monitored system’s operational health, and are susceptible to potentially unreliable and inaccurate views of the system. Their maintenance and lifecycle management has become unwieldy owing to cloud expansion and agility [69]. Other concerns include introducing runtime interference, guest modification, and security vulnerabilities due to their external-facing nature. Generic-agent based approaches, on the other hand, cause vendor lock-in due to guest specialization. Alternate solutions include (i) Hypertection’s [83] ‘agentless’ security solution, but their approach to access Hyper-V VMs’ memory (only) is unknown, and (ii) Hardware introspection based solutions, such as Litty and Lie’s out-of-VM patch auditing [108] that can detect execution of unpatched applications, but they are limited in the kind of VM runtime state that can be inferred at the virtual hardware level Chapter 2. Background and Related Work 6

(page bits, CPU registers, etc.), and rely on a functional VM environment, unlike memory introspection (described in the next Section). NFM alleviates most of the concerns with existing solutions by combining and extending VM in- trospection with a backend ‘systems-as-data’-based cloud analytics platform, and further enables new opportunities for cloud monitoring and analytics.

2.3 VMI Techniques

The guest’s runtime information that in-VM components consume to carry out their monitoring func- tionality either resides in the VM’s memory or disk. For out-of-band monitoring, this runtime state needs to be obtained from outside the guest’s context. This virtual machine introspection (VMI) [71] process can be broken down into two steps: (i) exposing VM runtime state, i.e., getting an out-of-band handle on the VM’s memory and disk, and (ii) exploiting VM runtime state, i.e., interpreting the ex- posed memory and disk images to reconstruct the guest’s runtime information. Following is a summary of existing techniques to expose and exploit VM state. NFM deals mostly with memory introspection to extract volatile runtime state from the VM’s memory.

2.3.1 Exposing VM State

For disk introspection, a disk handle depends upon whether a file-level or a block-level access is desired. The former is simply the VM’s which exists as a regular file on the host system. The latter can be obtained by trapping virtual block device accesses [128, 141]. As for exposing VM memory state, some solutions employ hardware methods, such as System Management Mode (SMM) in x86 BIOS [181, 17], DMA [6, 113, 28, 20, 130], and system bus snooping [116, 101]. Others rely on software techniques, including in-VM kernel modules (Volatilux [63]), in-VM memory mapping (i.e., via /dev/mem), VMM- level memory snapshotting (QEMU pmemsave or migrate-to-file, or ’s dump), remapping of guest address space inside a trusted VM (XenAccess [128], and LibVMI [24]), and hypervisor-facilitated API access to the guest’s memory (VMWare VMSafe()). My NFM solution adds to this list in terms of a ‘direct memory access’ based memory handle for KVM/QEMU using primitives. Also, Chapter 3’s taxonomy organizes and contrasts all these VMI techniques based upon their operational principles.

2.3.2 Exploiting VM State

Disk state extraction at a file-level translates to employing standard filesystem drivers and libraries to access the VM’s disk image [140, 89]. Block-level disk introspection can be achieved by traversing filesystem-specific data structures on the virtual disk image so as to translate virtual disk writes into filesystem updates [141, 196]. On the memory side, several techniques have been developed to over- come the semantic gap between the exposed raw out-of-VM view and logical OS-level in-VM view. IDetect [111] interprets exposed memory by accessing OS-specific, pre-determined kernel data structure offsets to extract specific system information. Other techniques for automatically detecting the offsets also exist, which rely on pattern matching, function disassembly heuristics, guest support for inserting kernel modules or additional kernel debug information [63, 94, 14, 30, 103, 51]. Recent work has also Chapter 2. Background and Related Work 7 proposed automated introspection solutions that do not require detailed knowledge of OS kernel inter- nals. Virtuoso [58] collects in-VM training code traces to create out-of-VM introspection code. VM Space Traveler [67] uses instruction monitoring at the VMM layer to identify and redirect introspection related data to guest OS memory. The NFM framework described in this thesis employs IDetect-like kernel data structure traversal inside the exposed memory view. In addition to VM state introspection, another important aspect of VMI is event introspection, i.e. the active monitoring of system events such as system calls, calls, and context switches. Such event introspection is typically facilitated by setting break or trap points on virtualized hardware such as registers and instructions [88, 73, 155, 132]. While NFM primarily targets passive monitoring (state introspection), it’s decoupled monitoring-execution design lends itself well to augmentation with active event monitoring. CIVIC uses similar event monitoring (at the kernel’s schedule() function) to capture execution for code injection initiation.

2.4 VMI Applications

Most previous VM introspection work focuses on the security and forensics domain. It is used by digital forensics investigators to get a VM memory snapshot to examine and inspect [29, 30, 76, 151, 61]. On the security side, VMI has been employed for kernel integrity monitoring [130, 81, 20], intrusion detection [71], anti-malware [57, 67, 89, 129, 23], firewall solutions [156], and information flow tracking for bolstering sensitive systems [79]. While VMI originally started as a security specific tool targeting slow-moving and small scale sys- tems, NFM adapts it for real-time large scale cloud monitoring as well as across-systems and across-time analytics. Outside the security domain, IBMon [138] comes closest to NFM’s approach, using memory introspection to estimate bandwidth resource use for VMM-bypass network devices. Other recent work employs VMI for information flow policy enforcement [19], application whitelisting [80], VM checkpoint- ing [8], and memory deduplication [34].

2.5 Other Candidate Techniques for Monitoring

Even though VMI is a very powerful technique, it is fragile in that it requires deep OS-specific knowledge that might change, albeit slowly, across different OS versions. Additionally, it demands effort in terms of either exposing the entire OS-like view (/proc etc.) for already existing software, or writing fresh tools using introspection directly. Tools for automatic generation of VMI-based utilities [58] are much slower than native execution, require in-guest training and expert intervention, and still remain incompatible with existing software. One way to avoid these VMI related concerns is to operate at a logical OS-level and reuse the vast stock monitoring software codebase but in a separate isolated environment. This still keeps the guest VM free from the intrusion and interference of monitoring agents. Other non-VMI approaches that can potentially be used for virtual systems monitoring, while still providing a similar execution-monitoring decoupling as NFM leverages from VMI, can be categorized into (i) runtime redirection-based solutions, and (ii) VM replication techniques. Redirection-based solutions operate from a secondary context such a separate privileged VM or the hypervisor, and feed off of the guest VM’s runtime state directly. This secondary context can be viewed Chapter 2. Background and Related Work 8 as offering isolation, with access to the guest state being facilitated by employing techniques such as (i) process relocation outside guest [155], (ii) kernel data redirection to guest [67, 68, 146], and (iii) component implanting (syscall, function-call, module or process) inside guest [73, 184, 69, 35, 27, 178]. The ‘VM replication’ set of methods enable creating a secondary VM similar to the guest VM, which can potentially be employed for monitoring on behalf of the original guest VM. These can further enable active monitoring (actuation) as compared to typical passive (read-only) monitoring with VMI. They provide greater diagnosis flexibility and isolation with their ability to reproduce a problematic occurrence in a (potentially more permissive) secondary VM. These methods can be subclassified based upon how the secondary VM is created from the primary: (i) a cold boot over base image copy, (ii) from a point-in- time snapshot [169], (iii) live cloning with runtime state [48, 98, 162, 199, 115], or (iv) record-and-replay of non-deterministic inputs [60, 154, 36, 92]. VM replication techniques treat each VM as a whole, while redirection-based solutions operate at a fine grained syscall or instruction level. My second solution proposed in this thesis - CIVIC - is a piece in the middle of this guest view detail spectrum, and combines live VM cloning with runtime code injection.

2.5.1 Concerns with Alternatives

A major concern with redirection-based solutions is that most of them install handler components inside the guest VM itself. Such guest intrusion and interference would be unacceptable in a production VM. CIVIC, on the other hand, restricts all operations to clones. In many cases, only basic utilities (like ps, lsmod) are supported and that with heavy performance slowdowns [67, 68, 184, 69], one reason being reliance on binary translation. In other cases, the solution is not transparent to the utility software [73] which needs to be made aware of a non-standard runtime environment (e.g., recompilation with static linking and hypercalls). CIVIC, on the other hand, supports complex stock software with negligible runtime slowdown. Some solutions require separate OS fingerprinting to use the exact same version of the guest OS in the secondary VM [68, 146], while cloning gives that for free in CIVIC. Amongst VM replication techniques, copy suffers from full memory duplication, instantiation latency, as well state rebuilding which can be costly (e.g., up to 15 minutes to warm up caches or load data [32, 148]). As for a secondary VM instantiated from a point-in-time snapshot, explicit state prop- agation, especially when the snapshot is not recent, costs additional developer effort, resource wastage (idle workers) and migration delays (consolidated idle VMs) [98]. Additionally, both of these categories may miss capturing exact point-in-time occurrences or anomalies. CIVIC circumvents all these concerns via on-demand replication of the target VM’s live runtime state. While other work has explored live VM cloning, unlike CIVIC, the clones do not perform new functionality and are typically employed for reliability and resource optimization such as high availability [48], fault tolerance [162], parallel worker forking [98, 115] and speeding up system testing [199]. Furthermore, they are based on the assumption of direct access to clones. Although this is valid when guest users themselves create replicas for self management, it becomes problematic when access is to be granted to the service providers in an as-a- service model. For the latter, it would either require enforcing guest cooperation, or defeat portability by installing vendor-specific software components in the guest like VMware Tools [173, 168], VMCI driver [172], and Azure VM Agents [114]. CIVIC instead employs code injection to avoid imposing such guest cooperation and vendor lock-in. Finally, in case of record-and-replay based replication, new analysis in the middle of a recording Chapter 2. Background and Related Work 9 is typically limited to the granularity of registers, instructions, memory addresses, page bits, and disk blocks. Although this enables interesting analysis like heap overflow detection and memory safety check- ing [36], it is tricky to realize analysis at an OS or application-level semantics as with CIVIC. Also unlike CIVIC, these systems do not support changes to the application state [167]. Further architectural requirements and constraints that CIVIC does not have to deal with include maintaining multi-threaded order, precise instruction/branch counting, and stricter constraints on target architecture than virtualization, amongst others [48].

This thesis brings all these bodies of work together into a common context. Chapter 3 presents a comprehensive aggregation and both qualitative and quantitative comparison of existing and proposed VMI techniques to expose VM state. Chapter 4 describes a new approach- NFM- that employs and extends VMI techniques to perform non-intrusive and out-of-band cloud monitoring. Chapter 5 describes an alternate out-of-band monitoring solution- CIVIC- that overcomes NFM’s VMI related limitations by operating at a logical OS level rather than NFM’s raw memory byte level visibility into the guest. Chapter 3

Exploring VM Introspection: Techniques and Trade-offs

My first proposed monitoring solution- NFM- uses two custom mechanisms for exposing a VM’s memory state by leveraging Linux’s memory management primitives. In addition to these, there exist several other VMI techniques that have been developed independently over the years but there is no comprehensive framework that puts all these techniques in context, and contrasts them. Understanding the trade-offs between the competing alternatives is crucial to the design of effective new applications, and would aid potential VMI users in deciding which of the myriad techniques to adopt as per their requirements and constraints. In this Chapter, I present a thorough exploration of VMI techniques to expose VM state, and introduce a taxonomy for grouping them into different classes based upon their operation principles. Then, I present the results of their qualitative and quantitative comparison across the different taxonomy classes. The qualitative evaluation considers techniques available in VMware, Xen and KVM. The quantitative evaluation is restricted to a single hypervisor to minimize environmental variability. I use KVM as it has the highest coverage of VMI techniques and gives us more control, but its results can be extrapolated to similar VMware and Xen techniques that fall in the same taxonomy classes and thus follow the same underlying principles. Although the absolute performance of the different hypervisor-specific technique implementations may slightly differ even within the same class, the relative performance comparison across the taxonomy classes themselves still holds (sometimes differing by an order of magnitude each). I further present a detailed exploration on VMI consistency aspects to understand the sources of in- consistency in observed VM state and show that, contrary to common expectation, pause-and-introspect based VMI techniques achieve very little to improve consistency despite their substantial performance impact. This chapter reveals the stunning range of variations in performance, complexity and overhead with different VMI techniques, providing application developers different alternatives to choose from based on their desired levels of latency, frequency, overhead, liveness, consistency, and intrusiveness, constrained by their workloads, use-cases, resource budget and deployability flexibility. To conclude, I present a comprehensive set of observations and best practices for efficient, accurate and consistent VMI operation based on my experiences with these techniques.

10 Chapter 3. Exploring VM Introspection: Techniques and Trade-offs 11

Guest Cooperation Yes No

I. Agent assisted Snapshotting access Yes No

Live Live No Yes Yes No

II. Halt Snap III. Live Snap Memory Access Type

Memory Access Map / Interface Type Read based

Map Read Interface VII. Guest-halting VIII. Guest-halting based map / reads interface access

IV. Live memory V. Live memory VI. Live interface map reads access

Figure 3.1: VMI Taxonomy: categorizing current implementations

3.1 VMI Taxonomy

I characterize VMI techniques based on four orthogonal dimensions: (i) Guest Cooperation, whether the technique involves cooperation from code running inside the guest VM; (ii) Snapshotting, whether the technique creates an exact point-in-time replica of the guest’s memory; (iii) Guest Liveness, whether the technique halts the guest VM; and (iv) Memory Access Type, the type of interface provided to access guest state, which can be either via address space remapping, reads on a file descriptor, or through an interface provided by a VM manager or a debugger. While there can be arbitrary combinations of these dimensions, in practice only a few are employed. Figure 3.1’s taxonomy shows the specific attribute combinations that can categorize the current imple- mentations for accessing in-VM memory state. Some of these methods are hypervisor-exposed, while others are specialized use cases, leverage low level memory management primitives, or are enabled by third party libraries. My ‘direct memory access’ techniques for NFM fall under the ‘Live Memory Reads’ and ‘Guest-Halting Reads’ classes. The proposed taxonomy is general and hypervisor-independent. The rest of this section describes the techniques’ functionality.

I. Agent assisted access requiring guest cooperation These techniques install agents or modules inside the guests to facilitate runtime state extraction from outside.

• VMWare VMSafe() [176], XenServer’s XenGuestAgent [39, 38], QEMU’s -ga [137]: These techniques access a VM’s memory directly via guest pseudo devices (/dev/mem) or interface with the guest OS via pseudo filesystem (/proc) or kernel exported functions. The solutions then communicate either directly through their own custom in-VM agents [172, 174, 171, 55, 7], or Chapter 3. Exploring VM Introspection: Techniques and Trade-offs 12

mediated by the hypervisor [86, 35].

II. Halt Snap These methods do not require guest cooperation and distinguish themselves for producing a full copy of the guest’s memory image, while also pausing the guest to obtain a consistent snapshot.

• QEMU pmemsave, Xen dump-core, Libvirt/Virsh library’s dump and save, VMWare vmss2core: These techniques dump the VM memory to a file. Example usages include Black- sheep [23] that clusters memory dumps based on similarity to detect rootkit infestation, and Crash [51] that uses them to obtain guest kernel core dumps.

• QEMU migrate to file: This functionality migrates a VM to a file instead of a physical host. It is essentially similar to memory dumping, but smarter in terms of the content that actually gets written (deduplication, skipping zero pages, etc.).

• LibVMI library’s shm-snapshot [24]: This approach creates a VM memory snapshot inside a shared memory virtual filesystem at host. It has been implemented for both Xen and KVM (QEMU modified). Access to the snapshot is mediated by LibVMI after internally mapping the memory resident (/dev/shm/*) file.

III. Live Snap These methods obtain a consistent snapshot without pausing the guest.

• HotSnap [47] for QEMU/KVM and similar alternatives for Xen [160, 44, 82, 179, 98] use copy-on- write implementations to create consistent memory snapshots that do not halt the guest. These approaches modify the hypervisor due to lack of default support.

IV. Live Memory Mapping These methods do not require guest cooperation or a guest memory image capture, and support intro- spection while the guest continues to run. Methods in this class provide a memory-mapped interface to access the guest state.

• Xen xc map foreign range(), QEMU Pathogen [143]: These techniques map the target guest’s memory into the address space of a privileged monitoring or introspection process. They are used in libraries such as XenAccess [128] and LibVMI [24], and in cloud monitoring solutions such as IBMon [138] and RTKDSM [80].

• QEMU large-pages (hugetlbfs) based and VMWare .vmem paging file backed VM mem- ory: This approach involves mapping a file that backs a VM’s memory into the monitoring or introspection process’ address space. This is used in OSck [81] for monitoring guest kernel code and data integrity, and by Opferman [165] to expose a guest’s video buffer as a virtual screen.

• QEMU and VMWare host physical memory access: This method maps the machine pages backing the VM’s memory into the introspection process’ address space. It leverages Linux memory primitives for translating the region backing the VM’s memory inside the QEMU or vmware-vmx process (via /proc/pid/maps pseudo-file) to their corresponding physical addresses (/proc/pid/pagemap file). Achieving this functionality is not straightforward in Xen; although the administrator domain can access the host physical memory, hypervisor cooperation is still needed to extract guest backing physical frames list (physical-to-machine (P2M) table). Chapter 3. Exploring VM Introspection: Techniques and Trade-offs 13

V. Live Memory Reads Methods in this class also enable live introspection without perturbing the guest, but access guest state through a file descriptor-based interface.

• QEMU and VMWare direct VM memory access: These methods directly read a guest’s memory contents from within the container process that runs the VM. This can be achieved in different ways: (i) Injecting a DLL into the vmware-vmx.exe container process in VMWare to read its .vmem RAM file [110], (ii) Using QEMU’s native memory access interface by running the intro- spection inside QEMU itself [19], (iii) Leveraging Linux memory primitives– reading QEMU process’ memory pages at the hypervisor (via /proc/pid/mem pseudo-file) indexed appropriately by the virtual address space backing the VM memory (/proc/pid/maps). My NFM technique uses the last of these methods.

• LibVMI memory transfer channel: This technique requests guest memory contents over a socket based communication channel created in a modified QEMU container process, which are served by QEMU’s native guest memory access interface.

VI. Live Interface Access Methods in this class access guest state over an interface provided by a third party program, while the guest continues to run undisturbed.

• QEMU monitor’s xp [182] functionality that uses the hypervisor management interface to extract raw bytes at specified (pseudo) physical addresses.

VII. Guest-Halting Memory Map and Reads These methods achieve coherent/consistent access to the guest memory by halting the guest while introspection takes place (pause-and-introspect), but do not create a separate memory image and access guest memory contents directly. While all the live memory map and read methods can be included in this category by also additionally pausing the guest, I only select NFM’s direct read method as a representative in this study. Another example is OSck’s [81] memory access after guest quiescing, employed when its integrity checker fails due to data races with the kernel.

• QEMU semilive direct access, that encompasses the guest memory reads (QEMU /proc/pid/mem) between ptrace()- attach/detach calls. NFM uses this approach for cloud monitoring under strict consistency constraints.

VIII. Guest-Halting Interface Access Methods in this class halt the guest and access guest state over an interface provided by a third party program.

• Xen’s gdbsx, VMWare’s debugStub, QEMU’s gdbserver GDB stub for the VM: This techniue involves attaching a debugger to the guest VM and accessing guess state over the debugger’s interface. It is used in IVP [150] for verifying system integrity. LibVMI when used without its QEMU patch defaults to using this technique to access guest state. I use LibVMI’s GDB-access version in my compariative evaluation of VMI techniques. Chapter 3. Exploring VM Introspection: Techniques and Trade-offs 14

Host and Hypervisor Compatibility Live View Speed CPU VM Xen KVM/QEMU VMWare Consi- cost perf stency hit

Agent assisted X X (not Med Med Low VM /dev/mem VM /dev/mem support or Default; special access /dev/ support or mod- module installation drivers/tools in mem) ule installation VM

Halt Snap X Low High High Default; Default; Default - In-mem snap - In-mem snap via library via library + hypervisor modification

Live Snap X X Med Low Low Hypervisor Hypervisor modifications modifications

Live Memory X Very Very Very Default Hypervisor modifications; via library; Mapping High Low Low - Default file backed map- - Default file ping with special VM flags, backed mapping; hugepage host reservation; - /dev/mem sup- - /dev/mem support for port for host host phys mem access phys mem access

Live Memory X High Low Very Compatible (via /proc); via library Reads Low - Mem transfer channel via library + hypervisor mod.

Guest-Halting X Med Low Med Compatible (+ Compatible (+ guest Compatible (+ Map & Reads guest pause) pause) guest pause)

Live Interface X Very Very Low Default (via management Access Low High interface)

Guest-Halting X Very Very Low Default Default + special VM ini- Default + spe- Interface Low High tialization flags cial VM config Access options

Table 3.1: Qualitative comparison of VMI techniques- empty cells in compatibility column indicates functionality not advertised by hypervisor, or enabled by users.

3.2 Qualitative Comparison

The various VMI techniques described in the previous section follow different operation principles and correspondingly exhibit different properties. Table 3.1 compares them in terms of the following proper- ties:

• Guest Liveness: Whether the target VM continues to make progress normally without interrup- tion, during memory acquisition and subsequent VM state extraction?

• Memory view consistency: Whether the runtime state exposed by the method remains coherent with the guest’s actual state (Section 3.4)?

• Speed: How quickly can guest state be extracted with a particular method?

• Resource consumption on host: How heavy is a particular approach in terms of the CPU resources consumed by it, normalized to monitoring 1 VM at 1Hz (memory and disk cost is negligible for all but snapshotting methods).

• VM performance impact: How heavily does memory acquisition and state extraction hit the target VM’s workload?

• Compatibility: How much effort does deploying a particular technique cost in terms of its host and hypervisor compatibility- whether available as stock functionality, or requiring hypervisor modifications, third party library installation, or host specialization. Chapter 3. Exploring VM Introspection: Techniques and Trade-offs 15

Table 3.1 only contrasts these properties qualitatively, while a detailed quantitative comparison follows in the next section. The compatibility columns in the table do not indicate whether a functionality is available or missing from a hypervisor, but rather whether the functionality has been ‘shown’ to work– either being advertised as a default feature by the hypervisor, or enabled by users via libraries or hypervisor modifications. The table features such empty ‘compatibility’ fields for Xen and VMWare. In case of Xen, it exports the high performing ‘live memory mapping’ method (xc map foreign range()) by default, which is thus the most popular amongst users, with other techniques featuring scarcely in literature. Similarly, for VMWare, ‘agent-assisted’ VMSafe() technique is the default choice, with perhaps its closed source nature limiting the exploration or usage of other methods. KVM/QEMU on the other hand has the highest coverage of VMI techniques, probably because it is still young and under active community development. As can be seen, no one technique can satisfy all properties at the same time, leading to different tradeoffs for different use cases. One tradeoff is between the conflicting goals of view consistency and guest liveness for almost all techniques. If the user, however, desires both, then he would either have to let go of guest independence by opting for the guest cooperation methods that run inside the guest OS scope, or choose a hardware assisted out-of-band approach using transactional memory [109]. COW-based live snapshotting seems to be a good compromise, providing an almost-live and consistent snapshot. Another tradeoff is between a VMI technique’s performance and generality in terms of requirements imposed on the host’s runtime. For example, the live direct-reads method in KVM is sufficiently fast for practical monitoring applications and works out-of-box, but an order of magnitude higher speed can be achieved with live memory-mapping techniques by either enabling physical memory access on the host, or reserving large pages in host memory for the file-backed method. The latter, however, comes with a tradeoff of increasing system vulnerability (/dev/mem security concerns) and memory pressure (swapping concerns [158, 123]).

3.3 Quantitative Comparison

To quantitatively compare VMI techniques, I use a simple generic use case of periodic monitoring. This entails extracting at regular intervals generic runtime system information from the VM’s memory: CPU, OS, modules, N/W interfaces, process list, memory usage, open files, open network connections and per-process virtual memory to file mappings. This runtime information is distributed into several in-memory kernel data structures for processes (task struct), memory mapping (mm struct), open files (files struct), and network information (net devices), among others. These struct templates are overlaid over the exposed memory, and then traversed to read the various structure fields holding the relevant information [111], thereby converting the byte-level exposed memory view into structured runtime VM state. This translates to reading around 700KB of volatile runtime state from the VM’s memory, spread across nearly 100K read/seek calls.

I compare the different VMI techniques along the following dimensions:

1. Maximum frequency of monitoring 2. Resource usage cost on host 3. Overhead caused to the VM’s workload

Different benchmarks are run inside the VM to measure monitoring’s impact when different resource Chapter 3. Exploring VM Introspection: Techniques and Trade-offs 16 components are stressed - CPU, disk, memory, network and the entire system as a whole. The different targeted benchmarks, as well as full system benchmarks, tested are as follows.

1. x264 CPU Benchmark: Each run measures x264 video encoding benchmark’s [120] (v1.7.0) frames encoded per second.

2. Bonnie++ Disk Benchmark: I measure bonnie++’s [145] (v1.96) disk read and write through- puts as it processes 4GB of data sequentially. The high performance virtio disk driver is loaded in the VM, and disk caching at the hypervisor is disabled so that true disk throughputs can be measured, which are verified by running iostat and iotop on the host. Host and VM caches are flushed across each of the 5 bonnie++ runs.

3. STREAM Memory Benchmark: I measure STREAM benchmark’s [90] (v5.10) in-memory data copy throughput. I modified the STREAM code to also emit the ‘average’ sustained through- put across all the STREAM iterations (N=2500), along with the default ‘best’ throughput. The array size is chosen to be the default 10M elements in accordance with STREAM’s guidelines of array size vs. cache memory size on the system. The memory throughputs observed inside the VM are additionally confirmed to be similar to when STREAM is run on the host.

4. Netperf Network Benchmark: I measure the network bandwidth when a netperf [142] server (v2.5.0) runs inside the VM, while another physical machine is used to drive TCP data transfer sessions (N=6). The high performance virtio network driver is loaded in the VM, and the network throughput recorded by the client is confirmed to be similar to when the netperf server runs on the host machine itself.

5. Full System OLTP Benchmark: Each run measures Sysbench OLTP database benchmark’s [9] (v0.4.12) throughput (transactions per second) and response time. The benchmark is configured to fire off 50K database transactions, which includes a mix of read and write queries, on a 1M row InnoDB table. Optimal values are ensured for InnoDB’s service thread count, cache size and concurrency handling, with the in-VM performance verified to be similar to on-host.

6. Full System Httperf Benchmark: I measure the incoming request rate that a webserver VM can service without any connection drops, as well as its average and 95th percentile response latency. A 512MB working set workload is setup in a webserver VM, from which it serves different 2KB random content files to 3 different httperf clients (v0.9.0) running on 3 separate machines. The file size is chosen to be 2KB so that the server is not network bound.

Experimental Setup: The host is an 8 core Intel Xeon E5472 @ 3GHz machine, with 16GB memory and Intel Vt-x support. The software stack includes Linux-3.8 host OS with KVM support, Linux 3.2 guest OS, libvirt 1.0.4, QEMU 1.6.2, and libvmi-master commit-b01b349 (for in-memory snapshot). In all experiments except the memory benchmark, the target VM has 1GB of memory and 1 VCPU. Larger memory impacts snapshotting techniques linearly, without any noticeable impact on other tech- niques as they are agnostic to VM size. Also, more VCPUs do not affect VMI performance much, except for generating some extra CPU-specific state in the guest OS that also becomes a candidate for state extraction. 1 VCPU is selected so as to minimize any CPU which could mask the impact of the Chapter 3. Exploring VM Introspection: Techniques and Trade-offs 17

Figure 3.2: Comparing maximum monitoring frequency across all KVM instances of VMI techniques

VMI techniques on the VM’s workload. However, in case of the memory benchmark, a multicore VM was necessary as the memory bandwidth was observed to increase with the number of cores, indicating a CPU bottleneck, with the peak bandwidth being recorded on employing 4 cores (almost twice as much as on 1 core; going beyond 4 had no further improvement).

Limitations: (i) Live snapshotting is not included in the quantitative evaluation because of the unavail- ability of a standalone implementation (patch or library) for the KVM testbed, while its qualitative performance measures are borrowed from [82]. Live snapshotting is expected to have much better per- formance than its guest-halting counterpart as indicated in Table 1. Quantitatively, while monitoring the target VM, live snapshotting is expected to achieve ∼5Hz of monitoring frequency, with about 10% CPU consumption on host, and <13% hit on the VM’s workload [82]. (ii) Guest cooperation methods are not explicitly compared in the remainder of this section. This is because the default qemu-ga guest agent implementation on KVM/QEMU is pretty limited in its functionality. Absence of a dynamic exec capability with the agent means the generic monitoring process on the host has to read all relevant guest /proc/* files to extract logical OS-level state [7], which takes about 300ms per transfer over the agent’s serial channel interface. This translates to a maximum monitoring frequency of the order of 0.01Hz with <1% CPU consumption on host and guest. However, a better way would be for a custom agent to do the state extraction processing in-band and only transfer the relevant bits over to the host, along the lines of the oVirt guest agent [124]. Emulating this with qemu agent, to extract the 700KB of generic VM runtime state, results in a maximum monitoring frequency of the order of 1Hz with about 50% CPU consumption on host, and a 4.5% hit on the VM workload.

3.3.1 Maximum Monitoring Frequency

Figure 3.2 compares the maximum attainable frequency at which an idle VM can be monitored while employing the different VMI techniques. The monitoring frequency is calculated from the average run- ning time for 1000 monitoring iterations. I use an optimized version of LibVMI in this study that skips per-iteration initialization/exit cycles. Disabling this would add over 150ms latency per itera- tion thereby lowering the maximum monitoring frequency, most noticeably of the live memory transfer channel implementation. Chapter 3. Exploring VM Introspection: Techniques and Trade-offs 18

Figure 3.3: CPU used vs. maximum monitoring frequency

Interestingly, when sorting the methods in increasing order of their maximum monitoring frequency, each pair of methods shows similar performance, which jumps almost always by an order of magnitude across the pairs. This is because the candidates per pair belong to the same taxonomy category, hence they follow similar operation principles, except for interface-based methods where the frequency is limited by the interface latency. Amongst the live memory access methods, mapping is much superior to direct reads primarily because of greater overheads in the latter (multiple read()/seek() vs. single ()). The next best is guest-halting direct reads that stun the VM periodically for a few milliseconds, while still being much faster than guest-halting snapshotting methods that halt the VM for a few seconds. Finally, the methods interfacing with the management layer and GDB are the slowest because of yet another layer of indirection. The maximum monitoring frequencies can vary with the workload inside the VM. Depending upon how active the VM is, it would change the amount of runtime state that exists in the VM, thereby leading to a change in the time required to extract this state. This can easily be observed in the maximum frequencies recorded with httperf in Section 3.3.4 which decrease by 4X due to a proportional increase in runtime state.

3.3.2 Resource Cost on Host

Monitoring with almost all methods has a negligible space footprint, except for snapshotting techniques that consume space, on disk or memory, equivalent to the VM’s size. As for the CPU cost, Figure 3.3 plots the CPU resource usage on the host while an idle VM is monitored at the maximum frequency afforded by each technique. The CPU cost includes the CPU usage by both– the monitoring process itself, as well as the QEMU process running the VM, in cases where a method requires QEMU assistance (this explains why the live interface access method has a >100% CPU usage). The graph shows the same pairwise grouping of the methods as in the case of their maximum monitoring frequency. The exception here is that the live interface access method is much heavier than the guest-halting variety, although both deliver the same monitoring frequency. The previous frequency comparison graph (Figure 3.2) showed that the live memory mapping methods were an order of magnitude faster than live direct reads, which were themselves faster than guest-halting Chapter 3. Exploring VM Introspection: Techniques and Trade-offs 19 reads and snapshotting methods. This graph shows that the better performance does not come at an added cost as all of these except for halting-reads consume similar CPU resources. However, with the same CPU consumption, the methods situated more towards the increasing X axis are more efficient in terms of normalized CPU cost per Hz. Hence, amongst the live methods having the same CPU consumption, the higher efficiency of guest memory mapping can be easily observed. Also, the lower CPU usage for the halting-reads method is misleading as the graph does not plot the impact on the VM with each technique. So even though the former can hit the 10Hz frequency while consuming <40% CPU as compared to live reads that consume 100% CPU for about 30Hz, it is still costlier because it stuns the VM periodically thereby disturbing its workload heavily. The next section quantifies this impact.

3.3.3 Impact on VM’s Performance

Targeted workloads are run inside the VM stressing different resource components, and the percentage overhead on their corresponding performance metrics is measured. VM impact is measured for the lowest monitoring frequency of 0.01 Hz, increasing in orders of 10 up to 10Hz or the maximum attainable frequency for each VMI technique. Each benchmark is run enough times (exact count reported within each corresponding subsection) to ensure sufficient monitoring iterations are performed for each method at each frequency. The graphs only plot the mean values while the error bars are omitted for readability (the experimental variation was within 5% of the means). The guest reported benchmark performance metrics are used, after having ensured that (i) the VM’s clock does not get skewed throughout the benchmarks’ progress, while being monitored with the different techniques, as well as that (ii) the benchmarks’ metrics are similar to when run directly on host. In the experiments, the VMI-based monitoring application runs on a separate host core, while I experiment with two different configurations mapping the VM’s VCPU to host’s PCPU1. In the first 1VCPU-1PCPU configuration, I pin to a single core on host the QEMU process that runs the main VM thread and all other helper threads that get spawned to serve the monitoring process’ memory access requests. In the second 1VCPU-2PCPU configuration, I taskset the QEMU process to two cores on host, the VM still having only one virtual core to itself . This is done to visualize the kind of overheads that would be seen if each technique was given unbounded CPU resources (a single extra core suffices, going beyond this has no additional effect).

(a) CPU Benchmark

Figure 3.4(a) plots the percentage degradation on x264’s [120] frames encoded per second as a function of monitoring frequency for each technique. The rightmost points for each curve show the overheads for the maximum attainable monitoring frequency for each method. Each data point is obtained by averaging 10 x264 runs. As can be seen, there is minimal overhead on x264’s framerate with the live methods (except the libVMI memory transfer channel implementation), while for the rest, the overhead decreases with de- creasing monitoring frequency. The biggest hit is observed for methods that quiesce the VM, as expected. Figure 3.4(b) compares x264’s performance degradation when each technique is given unbounded CPU resources in the 1VCPU - 2PCPU taskset configuration. As a result, the VM overhead is greatly

1The hardware architecture influences the introspection application’s as well as the VM’s VCPU-PCPU core mapping. The chosen configuration ensures the least impact on the VM due to introspection. Chapter 3. Exploring VM Introspection: Techniques and Trade-offs 20

75

d 65 a Halt Snap | Memory dump o l

k Halt Snap | In-memory snapshot r 55 o Guest-halting interface access w

s

' 45 Live interface access

M Guest-halting reads | SemiLive V

n 35 Live reads | Memory xfer channel o Live reads | Direct memory access n o

i 25

t Live map | Host phys mem access a

d Live map | File backed VM mem a

r 15 g e D 5 %

-5 0.01 0.1 1 10 100 1000 Monitoring frequency [Hz] (a) 1VCPU- 1PCPU

75

65 Halt Snap | Memory dump d

a Halt Snap | In-memory snapshot o l

k 55 Guest-halting interface access r

o Live interface access w Guest-halting reads | SemiLive s 45 '

M Live reads | Memory xfer channel V 35 Live reads | Direct memory access n o

Live map | Host phys mem access n

o Live map | File backed VM mem

i 25 t a d a

r 15 g e D

5 % -5 0.01 0.1 1 10 100 1000 Monitoring frequency [Hz] (b) 1VCPU - 2PCPU

Figure 3.4: Comparing % degradation on x264 benchmark’s frames-encoded/s as a function of monitoring frequency.

reduced for methods that spawn QEMU threads to extract VM state, as the main QEMU thread servicing the VM now no longer has to contend for CPU with the other helper threads that get spawned to serve the monitoring process’ memory access requests. The halting-read method, which wasn’t using a full CPU to begin with, has no use for the extra CPU resources and thus the VM overhead remains the same owing to the periodic VM stuns.

This is the only case where the performance of all candidate techniques is compared. The main focus is actually on how the categories themselves compare in terms of performance degradation of the target VM’s workload. Hereafter, I only present results for one representative method from each category- namely memory dumps (guest-halting snapshotting), management interface (interface access), semilive direct access (halting-reads), QEMU direct memory access (live memory reads), and file-backed VM memory (live memory map). Although not explicitly shown, the omitted methods follow performance trends similar to their sibling candidates from the same taxonomy category. The interface access methods are also observed to exhibit similar performance. Chapter 3. Exploring VM Introspection: Techniques and Trade-offs 21

(b) Memory, Disk and Network Benchmarks

Figure 3.5 plots the impact on the VM’s memory, disk and network throughputs, owing to VMI-based monitoring (averaged across 2500 STREAM iterations, 5 bonnie++ and 6 netperf runs respectively). Impact is mostly observed for only the methods that quiesce the VM, and does not improve markedly when extra CPU resources (1VCPU-2PCPU mapping) are provided to the techniques. This is because the CPU is not the bottleneck here, with the workloads either being limited by the memory subsystem, or bounded by network or disk IO. The degradation on STREAM [90] benchmark’s default ‘best’ (of all iterations) memory throughput was negligible even while monitoring with methods that quiesce the VM. However, the techniques’ true impact can be seen in Figure 3.5(a), which compares the percentage degradation on STREAM’s ‘average’ (across all iterations) memory throughput. In other words, the impact is only observed on the sustained bandwidth and not the instantaneous throughput. For the impact on bonnie++ [145] disk throughputs, separate curves for disk writes are only shown for VM quiescing methods (Figure 3.5(b)), the rest being identical to those of reads, with the main noticeable difference being the minimal impact seen on the write throughput even with the halting-reads method. This can be attributed to the fact that the VM’s CPU is not being utilized at its full capacity and spends a lot of time waiting for the disk to serve the write requests made from bonnie++. Hence, minor VM stunning doesn’t hurt the benchmark so badly, as the work gets delegated to the disk. This, along with the writeback caching in the kernel, also means that the worst-case per-block write latency (not shown in the graphs) does not see a big hit even for methods that quiesce the VM, while their worst-case read latency is an order of magnitude higher. Another interesting observation is the markedly high impact on the disk throughputs with memory dumping, as compared to the CPU intensive benchmark, which moreover shows no improvement even when the monitoring frequency is reduced from 0.1Hz to 0.01Hz. Netperf’s [142] network bandwidth also sees a similar hit with guest-halting snapshotting (Figure 3.5(c)), with its impact curves being very similar to those of the disk (read) throughputs. The difference in this case is that that the overhead curve does not plateau out and eventually subsides to minimal impact at 0.01Hz. As demonstrated later in Section 3.3.4, these high overheads can attributed to the backlog of pending IO requests that dumping (and hence VM quiescing) creates in the network and disk IO queues.

3.3.4 Real Workload Results

After characterizing VMI-based monitoring’s impact on individual VM resources, I use two full sys- tem benchmarks to understand the impact on real world deployments- database and webserver. For these workloads, making extra resources available for monitoring does not have a marked effect (except interface-based methods that require assitance from QEMU process running the VM). This is due to the CPU slack that exists as the CPU waits for the disk file fetches in case of the webserver, and for transactions to commit in ACID compliance in the database workload. The results shown below are for the 1VCPU-2PCPU configuration giving full resources to all techniques.

(a) Full System OLTP Benchmark

Figure 3.6 compares the percentage degradation on Systbench OTLP’s transaction throughput, across the different VMI techniques as a function of their monitoring frequency. Each data point is obtained Chapter 3. Exploring VM Introspection: Techniques and Trade-offs 22

(a) Impact on STREAM benchmark’s memory copy throughput

(b) Impact on bonnie++’s disk throughputs. Differing behavior on writes shown separately.

(c) Impact on netperf’s network transfer bandwidth

Figure 3.5: Comparing % degradation on memory, disk and network throughput as a function of moni- toring frequency Chapter 3. Exploring VM Introspection: Techniques and Trade-offs 23

85 Guest-halting snapshots 75

d Interface access a 65 o l k

r Guest-halting reads o 55 w

s

' Live memory reads

M 45 V

n Live memory mapping o 35 n o i t a 25 d a r g

e 15 D

% 5

-5 0.01 0.1 1 10 100 1000 Monitoring frequency [Hz]

Figure 3.6: Comparing % degradation on Sysbench OLTP benchmarks transactions/s as a function of monitoring frequency by averaging 6 runs, with the VM and host caches being flushed, as well as the database table being recreated before each run. The graph for transaction response time is similar and omitted for brevity. The curves are pretty much identical to the disk read benchmark, with the impact being attributed to the backlogging in the database transaction queues. The next section presents a deeper inspection to understand these queue perturbations.

(b) Full System Httperf Benchmark

Figure 3.7(a) plots the impact on the webserver VM’s sustainable request rate as compared to the base case without any monitoring, for the different VMI techniques under different monitoring frequencies. Each data point in the graph is obtained by averaging 3 httperf [117] runs, each run lasting for 320s. Amongst all the benchmarks, httperf is hit the worst by methods that quiesce the VM, even at low monitoring frequency, with the halting-reads method recording ∼25% impact even at 1Hz. With memory dumping, like in case of the disk and OLTP benchmarks, the impact on the sustainable request rate is not lowered even with extra CPU resources afforded to the QEMU process, as well as when the monitoring frequency is reduced from 0.1Hz to 0.01Hz. I explain this with an experiment later in this section. Also note the much lower maximum monitoring frequencies recorded for the different techniques as they monitor the httperf workload. The longer monitoring cycles are because the amount of state extracted is far more than other benchmarks (∼4X), owing to several apache processes running inside the VM. This also prevents the interface-based approaches from operating even at 0.01Hz, while the halting-reads method is unable to operate at its usual 10Hz (iteration runtime ∼150ms). The sustainable request rate is only one half of the story. Figure 3.7(b) also plots the impact on the webserver VM’s average and 95th percentile response latencies. Shown are overheads for the practical monitoring frequencies of 0.1Hz for techniques that quiesce the VM, and for maximum attainable mon- itoring frequencies for the other live methods. As can be seen, even if a user was willing to operate the webserver at 75% of its peak capacity, while snapshotting once every 10s for view-consistent introspec- tion, they should be aware of the fact that the response times would shoot up 100% on average, going beyond 200% in the worst case. The particular requests experiencing these massive degradations can be Chapter 3. Exploring VM Introspection: Techniques and Trade-offs 24

(a) Impact on httperf’s sustainable request rate

(b) Impact on httperf’s response times

(c) httperf 1950 req/s + 1 round of memory dumping

Figure 3.7: Comparing % degradation on httperf’s metrics as a function of monitoring frequency Chapter 3. Exploring VM Introspection: Techniques and Trade-offs 25 spotted in a server timeline graph (not shown), where the response times jump quite a bit for about 50s after a single dumping iteration (<2s). Finally, I investigate why guest-halting snapshotting shows a horizontal impact curve from 0.1Hz to 0.01Hz in Figure 3.7(a), instead of reducing the impact on the server’s capacity when snapshotting it 10 times less frequently. As discussed above, when the webserver operates at 75% of its peak capacity (serving 1500 requests/s as opposed to 1950), the jump in response times eventually subsides after a single snapshotting iteration, and no requests are dropped. If the requests arrive at any rate greater than this, it is observed that a single <2s dumping cycle degrades the server capacity to ∼700 serviced requests/s, with several connection drops, and the server doesn’t recover even after 15 minutes. Fig- ure 3.7(c) visualizes this observation for 5 httperf rounds of 320s each, plotting the (i) server capacity (reply rate) (ii) avg. response time per request, and (iii) percentage connections dropped (Error %). In the end, the server has to be ‘refreshed’ with an apache process restart to clear up all the wait queues to bring it back up to its base capacity. Because the server is operating at its peak capacity in this case, the wait queues are operating at a delicate balance with the incoming request rate. Any perturbation or further queuing introduced by a single VM quiescing cycle destroys this balance, thereby creating a backlog of pending requests from which the webserver never seems to recover. The behavior is same for any incoming rate >1500 requests/s. This is why even for 0.01Hz monitoring frequency, the server can only handle 1500 requests/s at best. Note that the measurements are made from the clients’ side in httperf, so the requests from a new round also get queued up behind the pending requests from an earlier round. Hence, from the clients’ perspective, a possible eventual server capacity recovery is not observed without a complete server queue flush.

3.4 Consistency of VM State

A key concern with VMI techniques is the consistency of the observed VM state. Particularly, intro- specting a live system while its state is changing may lead to inconsistencies2 in the observed data structures. An inconsistency during introspection may cause the monitoring process to fail, trying to access and interpret non-existent or malformed data. A common approach to mitigate inconsistencies is to pause/quiesce3 the systems during introspection (halting-reads method). This is considered a safe ap- proach as the system does not alter its state while the data structures are interpreted [71, 109, 131, 75, 24]. Therefore it is commonly employed for “safe” introspection despite its high overheads as shown in the prior sections. In this section I present a deeper exploration of what these inconsistencies are, their like- lihood, and when pause-and-introspect (PAI) solutions help. My investigation leads to some interesting key observations. First, I show that there are multiple forms of inconsistencies, both in intrinsic VM state and extrinsic due to live introspection. Second, contrary to common expectation, PAI does not mitigate all forms of inconsistency.

3.4.1 Inconsistency Types

I use the same introspection process from Section 3.3 which accesses the various kernel data structure fields inside a target VM’s memory to extract runtime system state from it. I capture inconsistencies

2While the OS itself is not inconsistent, the observed inconsistencies arise because of a missing OS-context within VMI scope. 3The terms pause/quiesce/halt are used to refer to the same guest state- OS-agnostic pause of the VM process at the hypervisor; not to be confused with the possibly different interpretations from the point of view of the OS. Chapter 3. Exploring VM Introspection: Techniques and Trade-offs 26 by recording the read() or seek() failures in the introspection process, while the VM being monitored runs workloads (Section 3.4.2) that continuously alter system state. Each of these failures denotes an access to a malformed or non-existent data structure such as incorrect, NULL or garbage field values, or seek offsets4. My usage of the term ‘inconsistent’ refers to observing conflicting views of a system during introspection. This can happen either because the system state changed during the same introspection cycle, or because different data structures are reflecting different views of the system. For instance, an introspection process starts at t1, reads a data structure at t2 which points to a memory reference

Mem[A]. Then, at t3, system state is changed and Mem[A] is invalidated. After this, the introspection process tries to access Mem[A] at t4, which leads to an inconsistency error (t1 < t2 < t3 < t4). Another form of inconsistency can be at a single point-in-time snapshot itself (say just at t1), even when nothing changes in the system. I refer to these two cases as extrinsic and intrinsic inconsistencies respectively. By tracing back the root cause of these failures, I was able to categorize every inconsistency ob- served as shown below. I further verified the exact causes of each inconsistency occurrence by running Crash [51] on a captured memory snapshot of the paused VM under inconsistency.

I. Intrinsic Inconsistencies This category of inconsistencies occur due to different but related OS data structures being at incon- sistent states themselves—for a short period—in the OS, and not because of live introspection. These inconsistencies still persist even if PAI techniques are employed instead of live introspection. I subcate- gorize these into the following types:

I.A Zombie Tasks: For tasks marked as dead but not yet reaped by the parent, only certain basic task struct fields are readable. Others such as memory mapping information, open files and network connections lead to inconsistency errors when accessed.

I.B Dying Tasks: For tasks that are in the process of dying but not dead yet (marked “exiting” in their task struct), their memory state might be reclaimed by the OS. Therefore, although their state seems to be still available, accessing these can lead to NULL or incorrect values being read by the monitoring process.

I.C As-good-as-dead tasks: These include a subset of ‘dying tasks’ (type I.B) with also a NULL memory info data structure (mm struct), which means not only are such a task’s memory mappings unavailable, but any attempt to extract its open files / network connections list is also highly likely to fail.

I.D Fresh tasks: For newly-created processes, all of their data structures are not initialized instan- taneously. Therefore, accessing the fields of a fresh process may lead to transient read() errors, where addresses read may be NULL or pointing to incorrect locations.

II. Extrinsic Inconsistencies This second category of inconsistencies occur during live introspection and only these can be mitigated by PAI techniques. The reason for these inconsistencies is VM state changing during introspection. I subcategorize these into the following types:

4This methodology can treat an incorrect access as being legitimate, e.g., accessing a recycled memory location that now belongs to another object. These can be properly flagged by comparing multiple successively extracted views. Chapter 3. Exploring VM Introspection: Techniques and Trade-offs 27

2.0% 1.8% 1.6% 1.4% 1.2% 1.0% 0.8% I.C, 1.28% 0.6% II.A, 0.42% II.B, 0.41% I.B, 0.32% 0.4% I.D, 0.95% Inconsistency Rate Inconsistency

0.2% 1.C, 0.05% II.A, 0.03% I.D, 0.03% II.B, 0.01% I.A, 34.63% I.A, 1.55% I.B, 0.00% 0.0% Cat. I Cat. II Cat. I Cat. II Cork Httperf

Figure 3.8: Observed inconsistency probabilities for all categories.

II.A Task Dies During Monitoring: For tasks that die while their data structures were being interpreted, data fields and addresses read after the task state is recycled lead to read()/seek() errors.

II.B Attributes Change During Monitoring: In this case, while the tasks themselves keep alive, their attributes that point to other data structures might change, such as open files, sockets or network connections. In this case accessing these data structures based on stale/invalid memory references leads to inconsistency errors.

3.4.2 Quantitative Evaluation I first create a benchmark, cork, that rapidly changes system state by forking and destroying processes at various rates, and use it with a process creation rate of 10Hz and a process lifetime of 1s. The occurrence probabilities of inconsistencies are quantified with two workloads: (i) the simple cork benchmark, which stresses the process create/delete dimension; and (ii) a webserver at its peak capacity serving incoming HTTP requests from three separate httperf clients for 218 different 2KB files, which stresses both the process and file/socket open/close dimensions. Figure 3.8 shows the observed probabilities for all the different inconsistency types for both bench- marks. These probabilities are computed from 3 separate runs, each of which repeat 10,000 introspection iterations (based on the live direct memory read approach) while the benchmarks are executed. The ob- served results are independent of the introspection frequency. As the figure shows, most inconsistencies are rather rare events (except for one corner case with httperf), and the majority of those observed fall into category I. While not shown here, when the same experiments are performed with the halting- reads approach, all dynamic state extrinsic inconsistencies of Category II disappear, while Category I results remain similar. The quantitative evaluation shows some interesting trends. First, we see that Category II events are rather rare (less than 1%) even for these worst-case benchmarks. Therefore, for most cases PAI techniques produce limited return on investment for consistency. If strong consistency is what is desired regardless of cost, then PAI approaches do eliminate these dynamic state inconsistencies. The cost of this can be up to 35% with a guest-halting direct-reads approach for 10Hz monitoring, and 4% for 1Hz monitoring, in terms of degradation on VM’s workload. Cork records more type II.A inconsistencies, whereas the webserver workload exhibits more of type II.B. This is because of the continuous closing Chapter 3. Exploring VM Introspection: Techniques and Trade-offs 28 and opening of files and sockets, while serving requests in the webserver case. Both of these, however, occur infrequently—in only 0.4% of the iterations. Cork also exhibits type I.C and I.D inconsistencies for freshly created and removed tasks, as the OS context itself becomes temporarily inconsistent while updating task structures. One unexpected outcome of this evaluation is the very high rate of type I.A inconsistencies with the webserver, which also has a significant occurrence in cork. The amount of time state is kept for zombie tasks varies by both system configuration and load, and can lead to substantial VMI errors (type I.A inconsistency) as seen with the webserver. Zombies are alive until the parent process reads the child’s exit status. If the parent process dies before doing so, then the system’s init process periodically reaps the zombies. Under high loads, the webserver forks several apache worker threads and it takes a while before reaping them, thereby leading to their zombie state existence for longer durations.

3.5 Observations and Recommendations

In this section, I summarize my observations and present suggestions to VMI users in selecting the technique best suited to their requirements and constraints.

• Broad Spectrum of Choices There are several available VMI alternatives operating on different principles ranging from dumping to memory-mapping. Their performance varies widely along several dimensions such as their speed, resource consumption, overhead on VM’s workload, view consistency, and more. These methods may be available out-of-box on different or be enabled by third party libraries or hypervisor modifications, giving the user a choice between easy deployability vs. hypervisor specialization.

• Guest Cooperation vs. Out-of-band If the user has sufficient resources allocated to his VMs, and installing in-VM components is acceptable, then guest-cooperation is a great way of bridging VMI’s semantic gap. If this isn’t acceptable, or security and inaccuracy of in-VM entities is an additional concern, then, the out-of-VM methods are a good alternative. The latter also helps against vendor lock-in, if the user prefers uninterrupted functionality with VM mobility across hypervisors without specializing his VMs for a particular hypervisor.

• VMI use-case Some techniques are more suitable in certain scenarios. For example, high speed live methods are best for high frequency realtime monitoring such as process level resource monitoring, continuous validation, or best effort security monitoring. On the other hand, snapshotting techniques are useful when all that is needed is an (infrequent) point-in-time snapshot, as in digital forensics investigation. For infrequent peeking into guest memory, a simple management interface access would suffice, while for guest debugging or crash troubleshooting, the guest-halting GDB-access interface would be the most suitable to freeze and inspect the guest in its inconsistent state without any regards to performance or overhead. Where strict view consistency is desired within acceptable overhead, guest-halting memory mapping/reads would work well such as for low frequency security scanning and compliance audits. Low frequency monitoring offers a lot more flexibility in terms of the choice of technique, except if the workloads are bound by specific resources as discussed next.

• VM Workload Along with the intended VMI use-case, the target VM’s workload can also influence the choice of in- Chapter 3. Exploring VM Introspection: Techniques and Trade-offs 29 trospection technique. If the user’s workload is not bound by a particular VM resource, then there is more flexibility in selecting the introspection technique as well as its speed (frequency), even the ones that quiesce the VM. Even if it is CPU-intensive or memory bound, it can still tolerate guest-halting snapshotting better than if it were IO bound (disk / network / transactions), because the latter would be more sensitive to perturbation of the service queues, in which case snapshotting can be heavy even at very low monitoring frequencies. On the other hand, IO bound workloads can tolerate the lighter stuns of the guest-halting direct-reads method better than CPU intensive workloads, because the work gets handed off to other components while the CPU halts temporarily. But the halting-reads method’s execution length, and hence the VM stun duration, depends on the amount of state to be extracted. So it might not be a good fit on an active VM with rapidly changing state (see rapidly spawning apache processes in httperf evaluation in Section 3.3.4), or an application that accesses large memory such as virusscan.

• Host/Hypervisor Specialization Different hypervisors support different techniques out-of-box, some faster than others (comparison across techniques, not across hypervisors). If the user has freedom of choice over hypervisor selection, e.g. if they are not vendor locked to a particular provider or constrained by enterprise policies, then they may choose the one offering the best technique- fastest or cheapest (resource consumption). Otherwise, if the hypervisor selection is fixed, but the user still has control over the host resources or is willing to modify the hypervisor or install third party libraries, they can further optimize the available option to extract the best performance. For example, a ‘direct memory access’ method in KVM is sufficiently fast for practical monitoring applications and works out-of-box, still an order of magnitude higher speed can be achieved by either modifying QEMU, or enabling physical memory access on host, or reserving large pages in host memory for file-backed method. Although the latter come with a tradeoff of increasing system vulnerability and memory pressure. This work also shows that libraries or hypervisor modification may not be needed to extract high performance, as depicted by the QEMU direct access live method (enabled by leveraging Linux memory primitives) being more efficient than the LibVMI library’s live transfer channel implementation, while the latter also requires QEMU modifications.

• Mapping over direct reads Amongst the various methods compared in this study, the live methods are the best performing across several dimensions. Amongst these, guest memory mapping is much superior to direct memory reads (e.g. speed order of 100Hz vs 10Hz), primarily because of greater system call overheads in the latter (multiple read()/seek() vs. single mmap()). However, the previous observation’s speed vs. hypervisor specialization tradeoff holds true here as well, at least for KVM.

• Guest-halting map/reads over snapshotting For strict view-consistent monitoring and other VM-snapshot based use-cases, it is better to use halting- reads than halting-snapshot based approaches, because although both techniques quiesce the target VM, the impact on the VM’s workload is generally much lower with the former technique, and especially bear- able for low monitoring frequency. Also, as shown in experiments, guest-halting snapshotting methods create backlogs in work queues thereby heavily impacting performance. Live snapshotting, on the other hand, is a much better alternative as indicated in Section 3.2’s qualitative analysis and towards the end of Section 3.3 (under Limitations). Chapter 3. Exploring VM Introspection: Techniques and Trade-offs 30

• Consistency vs. Liveness, Realtimeness, and VM performance For almost all techniques, view consistency and guest liveness are conflicting goals. If the user, however, desires both, then they would either have to let go of guest independence by opting for the guest cooperation methods that run inside the guest OS scope, or choose a hardware assisted out-of-band approach using transactional memory [109]. One compromise option is COW snapshotting that provides an almost-live and consistent snapshot. For the common non-live pause-and-introspect (PAI) based techniques (halting-reads), its maximum monitoring frequency can never equal live’s because that would mean the VM is paused all the time and is thus making no meaningful progress. Thus, for PAI techniques, there exists a consistency vs. realtimeness tradeoff in addition to the consistency vs. VM performance tradeoff, the latter evident with high VM overheads with halting-reads. Consistency Fallacy: Furthermore, as my experiments indicate, PAI techniques, employed for “safe” introspection despite their high VM performance impact, do not mitigate all forms of inconsistency, which are very rare to begin with. There is thus a need to synchronize with the guest OS to determine guest states safe for introspection.

• Monitoring Overhead vs. Resource Usage In the KVM/ QEMU implementations of guest-halting snapshotting and interfaced based memory access methods, there exists a tradeoff between the resources available for monitoring versus the impact on the VM being monitored, except for when the target VM has CPU slack. This tradeoff does not hold true for the live memory map/reads which already have negligible VM overhead in the base case, as well as the halting-reads method that doesn’t consume a full CPU to begin with, while the overhead stems from periodic VM stuns.

• Scalability of approaches If the user targets several VMs to be monitored at once, another important metric to consider is scal- ability. Although an explicit comparison is not made, it is relatively straightforward to correlate a technique’s maximum frequency with CPU usage, and observe that the live memory map/read tech- niques all consuming more or less a single CPU core on host would monitor the maximum number of VMs at 1 Hz (ranging between 30 to 500 VMs per dedicated monitoring core).

3.6 Summary

I presented a comparative evaluation of existing VMI techniques to aid VMI users in selecting the ap- proach best suited to their requirements and constraints. I organized existing VMI techniques into a taxonomy based upon their operational principles. My quantitative and qualitative evaluation reveals that VMI techniques cover a broad spectrum of operating points. I showed that there is substantial dif- ference in their operating frequencies, resource consumption on host, and overheads on target systems. These methods may be available out-of-box on different hypervisors or can be enabled by third party libraries or hypervisor modifications, giving the user a choice between easy deployability vs. hypervi- sor specialization. I also demonstrated the various forms of intrinsic and extrinsic inconsistency in the observed VM state, and show that pause-and-introspect based techniques have marginal benefits for con- sistency, despite their prohibitive overheads. Therefore application developers have different alternatives to choose from based on their desired levels of latency, frequency, overhead, consistency, intrusiveness, Chapter 3. Exploring VM Introspection: Techniques and Trade-offs 31 generality and practical deployability. I hope that my observations can benefit the community in under- standing the trade-offs of different techniques, and for making further strides leveraging VMI for their applications. Chapter 4

Near Field Monitoring

Near Field Monitoring (NFM) is a non-intrusive and out-of-band cloud monitoring approach. It addresses the following research questions, overcoming the limitations of existing monitoring techniques (2.2):

• How to monitor and manage systems even when they become unresponsive or are compromised? In-VM solutions are only as good as the monitored system’s operational health. This poses an interesting conundrum, where the system information becomes unavailable exactly when it is most critical—when the VM hangs, is unresponsive or subverted. A recent outage [43] presents a prime example to the effect of this limitation, where a significant portion of Google’s production systems became unresponsive due to a dynamic loader misconfiguration. As a result of this, none of the in-system agents could publish data outside, neither was it possible to log in to the impacted systems for manual diagnosis. Thus, it was extremely difficult to get system information when it was most crucial.

• How to perform IT operations (monitoring, compliance, etc.) without relying on guest cooperation or in-VM hooks? In-VM solutions face an uphill battle against the emerging cloud operation principles—ephemeral, short-lived VMs—and increasing VM proliferation. Their maintenance and lifecycle management has become a major pain point in enterprises [69]. Furthermore, both agents and hooks modify the target systems, interfere with their operation, consume end-user cycles and are prime candidates for security vulnerabilities due to their external-facing nature. A recent observation from [11] highlights how agent operation and maintenance issues can impact managed systems. In this case, an incomplete maintenance update for DNS configuration in some of the agents, coupled with a memory leak issue led to a performance degradation for part of Amazon storage services.

Even with generic-agent-based approaches (Type 4, Section 2.2), the problems of guest cooperation and intrusion do not completely disappear, although mitigated to some extent. However, these create a new challenge that goes against one of the key aspects of : portability. Generic agents/hooks require specialization of VMs for the virtualization layer providing the mon- itoring APIs, which leads to vendor lock-in. Additionally, custom solutions need to be designed to work with each such agent provider.

32 Chapter 4. Near Field Monitoring 33

• How to use virtualization technology to provide better system monitoring? In-VM techniques are further susceptible to potentially unreliable and inaccurate views of the system and its resource use characteristics, as the guest view of its environment and resources can deviate substantially from reality. For example, prior studies [53, 65] show that different “identical” EC2 instances have different CPU and IO characteristics. This limited knowledge about the system status can also cause ‘Compliance storms’ when large numbers of VMs simultaneously begin their conformity-check processes, swamping the possibly already heavily utilized hardware resources under high consolidation scenarios.

To address these challenges, NFM leverages virtualization technology to decouple system monitoring from execution. NFM employs and extends VM introspection (VMI) techniques to extract runtime system state, residing in the guest VM’s disk and memory, from outside the guest’s context (out-of-band). This collected system state is then fed into an analytics backend which facilitates NFM’s ‘systems as data’ monitoring approach. The monitoring functions simply query this systems data, instead of accessing and intruding on each running system. The analogy is that of a Google-like service running atop the cloud, enabling a query interface to seek live, as well as historical information about the cloud. Cloud monitoring and analytics applications then simply act as the clients of this service. NFM’s key differentiating value is its ability to provide monitoring and management capabilities without requiring guest system cooperation and without interfering with the guest’s runtime environ- ment. It does not require the installation and configuration of any hooks or agents in the target systems. Unlike the in-VM solutions that run within the guest context and compete for resources allocated to the guest VMs, NFM is non-intrusive and does not steal guests’ cycles or interfere with their actual operation. By decoupling monitoring and analytics from target system context, NFM is better suited for responding to the ephemeral nature of VMs, and enables always-on monitoring, even when the target system is unresponsive. This Chapter describes NFM’s implementation on KVM/QEMU and Xen hypervisors, and new meth- ods for low-latency, live access to a VM’s memory from the hypervisor layer (with optional consistency support), as well as optimizations that enable subsecond monitoring of systems. Also described are four applications that I’ve built on top of the NFM framework based on actual enterprise use cases. These highlight how we can treat systems as documents and leverage familiar paradigms from the data analytics domain such as document differencing and semantic annotations to analyze systems. These include (i) a cloud topology discovery and evolution tracker application, (ii) a cloud-wide realtime resource monitor providing a more accurate and holistic view of guests’ resource utilization, (iii) an out-of-VM console-like interface enabling administrators to query system state without having to log into guest systems, as well as a handy “time travel” capability for forensic analysis of systems, and (iv) a hypervisor-paging aware out-of-VM virus scanner that demonstrates how across-stack knowledge of system state can dramatically improve the operational efficiency of common management applications like virus scan. Finally, I present NFM’s evaluation, which showcases its high accuracy, monitoring frequency, reliability and efficiency, as well as its low impact on monitored systems.

4.1 NFM’s Design

Architecturally, NFM’s overall framework, depicted in Figure 4.1, can be viewed as an introspection fron- tend and an analytics backend, which enables the VM execution - monitoring decoupling. The frontend Chapter 4. Near Field Monitoring 34

IBM Research Hypervisor Cloud Analytics Analytics VM Apps OS Memory APP Crawl Frames API Frame { Crawl ...... Datastore Disk MEM MEM ...... APP Logic } Structured View view of Disk Disk Crawl VM states View API APP Frontend Backend Figure 4.1: Introspection and analytics architecture provides an out-of-band view into the running systems, while the backend extracts and maintains their runtime state. I have developed introspection techniques that build upon and extend traditional VMI approaches [71, 111], to gain an out-of-band view of VM runtime state. The analytics backend triggers VM introspection and accesses exposed VM state via Crawl APIs. Through these APIs, the Crawl Logic accesses raw VM disk and memory structures. VM disk state is accessed only once, before crawling a VM for the first time, to extract a small amount of information on the VM’s OS configuration. Crawl Logic uses this information to access and parse raw VM memory to create a structured, logical view of the live VM state, referred to as a frame, representing a point-in-time view of the VM state. All frames of all VMs are stored in a Frame Datastore that the cloud monitoring and management applications run against, to perform their core functions, as well as to run more complex, cloud-level analytics across time or across VMs. By decoupling monitoring and management from the VM runtime, these tasks can now proceed without interfering with VM execution. The VMs being monitored are never modified by NFM, nor is the hypervisor. NFM is designed to alleviate most of the issues with existing solutions (as discussed in Section 2.2). I follow three key principles to achieve this in a form that is tailored to the cloud operation principles:

1. Decoupled Execution and Monitoring: By decoupling target system execution from monitor- ing and analytics tasks, any implicit dependency between the two gets eliminated. NFM operates completely out of band, and can continue tracking system state even when a system is hung, unresponsive or compromised.

2. No Guest Interference or Enforced Cooperation: Guest context and cycles are precious and belong to the end user. Unlike most agent- or hook-based techniques that are explicitly disruptive to system operation, NFM’s design is completely non-intrusive. NFM does not interfere with guest system operation at all, nor does it require any guest cooperation or access to the monitored systems.

3. Vendor-agnostic Design: NFM’s design is based on a generic introspection frontend, which provides standard crawl APIs. Its contract with the virtualization layer is only for base VMI functions—commonly available across hypervisors—exposing VM state in its rawest form (disk blocks and memory bytes). It does not require any custom APIs or backdoors between the VM and the virtualization layer in its design. All the data interpretation and custom intelligence Chapter 4. Near Field Monitoring 35

are performed at the analytics backend, which also means simplified management as opposed to maintaining in-VM agents or hooks across several VMs.

In addition to alleviating some of the limitations of existing solutions, NFM’s design further opens up new opportunities for cloud monitoring and analytics. First, the decoupling of VM execution and monitoring inherently achieves computation offloading, where the monitoring / analysis computation is carried out outside the VM. This enables some heavy-handed complex analytics, such as full compliance scans, to be run with no impact on the actual systems. Second, many existing solutions actually track similar system features. By providing a singular data provider (backend datastore) for all analytics applications, NFM eliminates redundancy in information collection. Third, by shifting the scope of collected systems data from individual VMs to multiple cloud instances at the backend, NFM enables across-VM analytics, such as VM patterns, or topology analysis, with simple analytics applications running against the datastore. Fourth, along with VM-level metrics, NFM is also exposed to host-level resource accounting measures, enabling it to derive a holistic and true view of a VMs’ resource utilization and demand characteristics. NFM supports arbitrarily complex monitoring and analytics functions on cloud VMs with no VM interference and no setup requisites for the users. The framework is designed to serve as the corner- stone of an analytics (AaaS) cloud offering, where the users can seamlessly subscribe and unsubscribe to various out-of-the-box monitoring and analytics services, with no impact on their exe- cution environments. These services span a wide range, from simple resource and security monitoring to across-the-cloud (anonymous) comparative systems analysis. Users would have the choice to opt into this service, paying for the cycles the hypervisor and the monitoring/analytics subsystem spends on behalf of them. One consideration for such a service is privacy for end users, however the guests already relinquish the same level of control to the existing agents running inside their systems, I only argue in favour of the same level of trust without the downside of having a potentially insecure foreign entity installed inside their runtime environment. One limitation of NFM is a by-product of its reliance on VMI to crawl VM memory state, which involves interpreting kernel data structures in memory. These data structures may vary across OS versions and may not be publicly documented for proprietary OSes. The tractability of this data structure interpretation is discussed in the implementation discussion (Section 4.2.2). Also, as NFM focuses on the OS-level system view, application-level (e.g., a MapReduce worker’s health) or architectural (e.g., perf counters) state are not immediately observable via NFM unless exposed by the guest OS. Although, the hypervisor might allow measuring certain architectural states (e.g., perf counters) for a guest from outside.

4.2 Implementation

I have implemented two real-system prototypes of NFM based on two virtualization platforms, Xen and KVM. Currently NFM only supports Linux guests; support for other OSes is discussed at the end of Section 4.2.2. The overall implementation can be broken into four parts: (i) exposing VM runtime state, i.e., accessing VM memory and disk state and making them available over the crawl APIs; (ii) exploiting VM runtime state, i.e., interpreting the memory and disk images with the Crawl Logic to reconstruct the guest’s runtime information; (iii) persisting this VM state information in the Frame Datastore for Chapter 4. Near Field Monitoring 36 the subscribed applications; and (iv) developing high level applications to build on top of the extracted VM runtime state.

4.2.1 Exposing VM State

Most hypervisors provide different methods to obtain a view of guest VM memory. VMWare provides VMSafe APIs [176] to access guest memory. Xen [22] provides a userspace routine (xc map foreign range) via its Xen Control library (libxc) to re-map a guest VM’s memory into a privileged VM. I use this technique to obtain a live read-only handle on the VM’s memory. KVM [93] does not have default support for live memory handles. Although other techniques [24, 81] are able to provide guest memory access by modifying QEMU, my goal is to use stock hypervisors for generality. For KVM, options exist to dump VM memory to a file, via QEMU’s pmemsave memory snapshotting, or migrate-to-file, or libvirt’s dump. While the VM’s memory can be exposed in this manner, the overheads associated are non-trivial, as the guest is paused during the dump duration. Therefore, for KVM, I have developed an alternative solution to acquire a live handle on VM memory, while inducing negligible overhead on the running VM. As KVM is part of a standard Linux environment, I leverage Linux memory management primitives and access VM memory via QEMU process’ /proc//mem pseudo-file, indexed by the virtual address space backing the VM’s memory from /proc//maps. I employ similar techniques for exposing VM disk state. VM disk state is crawled only once to collect required persistent system information that is used to interpret raw memory data structures (Section 4.2.2). I use standard filesystem methods to access and expose VM disks, which are represented as regular files on the host system. In both prototypes, the obtained external view of raw VM memory and disk is wrapped and exposed as network-attached devices (FUSE over iSCSI) to the backend’s Crawl Logic over common Memory and Disk Crawl APIs. Thus, the actual crawling and analytics components are completely decoupled from VM execution, and the upstream logic is transparent to the underlying virtualization technology. Enhancements. (i) While NFM’s methods for exposing VM state generally have negligible latencies for most cloud monitoring and analytics applications, I have implemented optimizations to further curb these latencies. Among these, one notable optimization is selective memory extraction for applications that rely on near-realtime information. The key insight behind this is that most applications actually require access to a tiny fraction of VM memory space, and therefore completeness can be traded off for speed. After an initial full crawl, the VM memory regions that hold the relevant information are identified, and in subsequent iterations only those sections are opportunistically crawled. For dump- based handles, the relevant distributed memory regions are batched together, to optimize the dump size and minimize dump cycles. Additional buffering heuristics are added to accommodate some data structures like task lists that grow dynamically. With such optimizations, the overall latency impact is further reduced by an order of magnitude. For example, for one of my realtime applications, CTop, this approach enables subsecond granularity realtime system monitoring even with heavyweight, dump-based methods, reducing overall latencies from a few seconds to milliseconds. (ii) NFM does not expect the guest OS to be in some steady state while extracting its live runtime state from outside the guest context. Since any monitoring specific changes to the guest are avoided and no guest cooperation is enforced, lack of synchronization with the guest OS can potentially lead to inconsistency of views between the actual runtime state that exists inside the VM and what gets reconstructed from the view exposed outside, e.g., a process terminating inside the guest while its Chapter 4. Near Field Monitoring 37 memory mappings (mm struct) are being traversed outside. To tackle this, I have built additional routines for (optional) consistency support for KVM guest memory access via ptrace()- attach/detach on the QEMU container process. This trades off minor VM stuns for increased consistency. The experiments show that even with a heavy (10 times/s) fork and kill workload consuming significant CPU, memory and network resources inside a VM, inconsistency of extracted state occurs rarely—in about 2% of crawler iterations, while extracting full system runtime state (Section 4.4.1). Chapter 3’s Section 3.4 presents a detailed VMI memory consistency analysis.

4.2.2 Exploiting VM State

The backend attaches to the VM view exposed by the frontend, and implements the Crawl Logic that performs the logical interpretation of this raw state consisting of the disk block device and the memory byte array. Interpretation of the disk state is relatively well defined by leveraging standard filesystem drivers. The key challenge, however, is bridging the inherent semantic gap between the exposed raw VM memory state and the logical OS-level VM-internal view. The traditional in-VM approaches simply leverage guest OS context to extract system information, such as the /proc tree for process information. The Crawl Logic achieves the same function by translating the byte-level memory view into structured runtime VM state. VM runtime information is distributed into several in-memory kernel data structures for processes (task struct), memory mapping (mm struct), open files (files struct), and network information (net devices) among others. These struct templates are overlaid on the exposed memory, and traversed to read the various structure fields holding the relevant information [111]. To correctly map these data structures, three important pieces of information are extracted from the VM disk image and/or a kernel repository:

1. Version, Architecture and Map Parameters: From the VM kernel log, the running kernel version and the target VM architecture (32 vs. 64bit) are extracted for correctly sizing the data structures. Also read is the BIOS RAM map for determining the amount of VM memory to map and the size and layout of the VM memory regions.

2. Starting Addresses for structs: To identify the starting addresses for various kernel structures, such as initial task (init task), module list (modules), and kernel (init level4 pgt), the System.map file for the guest kernel is read, which includes the kernel exported addresses for these structures.

3. Field Offsets: After identifying entry points for kernel structures, the offsets to the relevant fields in these structures are calculated. The kernel source (or vmlinux build image) and the build configuration are used to determine offsets of desired fields.

Given a live VM memory handle and the above-mentioned pieces of information, the Crawl Logic’s overall memory crawl process can be summarized as:

1. Reading a kernel exported address (X), such as the initial process’ address (symbol init task).

2. Mapping the associated data structure template (struct task struct) to the memory region at address X.

3. Reading the relevant structure member fields after adding their relative offsets to the starting address X. Chapter 4. Near Field Monitoring 38

4. Continuing with the next object by reading its address via linked list traversal (prev, next member fields, together with offset arithmetic).

In addition to kernel structure traversal, per-process page tables are also traversed inside the extracted memory view, to translate a virtual addresses in a guest process’ address space. The process’ page directory is extracted from its mm struct, and its page tables are traversed. The Crawl Logic can currently crawl all X86, X86 PAE, and x86 64 architectures, and also supports huge pages. After introspecting all the relevant data structures, the Crawl Logic creates a single structured document (frame) for the crawled VM. This crawl document captures a rich set of information on VM state that the cloud monitoring and analytics applications can build upon. The current frame format covers various VM features, including system, cpu, memory and process information, modules, address mappings, open files, network information and runtime resource use. This VM frame, with its corresponding timestamp and VM ID, is then loaded into the frame datastore described in the next section.

Discussion. (i) While the backend implementation is focused on Linux, NFM’s applicability is not limited to a particular OS. Structure-offset extraction and VMI have been shown to work for Mac and Windows as well [16, 24, 119, 163]. Also, their versions are few as compared to Linux, and change slowly. (ii) The tractability of kernel data structure traversal based introspection is corroborated by prior studies reporting modest data structure modifications across major Linux versions [126]. Indeed, my parsers for kernel versions 3.2 and 3.4.4 differ only in their network-device related data structures among all of the ones tracked. Furthermore, the standardization trends in enterprise clouds also work in favor of this approach, limiting the OS-version variability. Even Amazon EC2 has a small number of base VM image types (5 different Linux OS versions). (iii) NFM’s correctness depends on the sanity of kernel data structures which can be obfuscated by the VM user (e.g., custom kernel), or tampered with by rootkits [18, 25]. The former concern is automatically mitigated by the standardization and compliance policies for enterprise clouds, as well as the explicit user sign-up for the offered out-of-band monitoring service. The latter can be safeguarded against by using VMI-based countermeasures which ensure that the kernel integrity is maintained [81, 20], based upon the assumption that the VM user is not an adversary to begin with. (iv) To portray NFM’s generality, the memory crawler employed in NFM’s evaluation operates across a diverse configuration set- multi architecture (x86 / x86 64), multi core, variably (RAM) sized VMs running Linux kernels far apart in the version chain and from different vendors — RHEL6 / Linux2.6.32, Fedora14 / 2.6.35, Ubuntu12.04 / 3.2.0, Fedora16 / 3.4.4. (v) In my experience, to generate structure-field information manually it takes a one-time effort of roughly an hour for a new OS version. This process can be automated by using crash [51] or gdb on debugging information-enabled vmlinux kernel images. There also exist alternate OS-version independent memory introspection techniques (see Section 2.3.2). (vi) Since its inception, NFM has been adapted to detect and monitor more than 1000 different system distributions (including distribution patches on top of official vanilla kernels), without requiring any manual configuration setup for target systems. This includes Linux kernel versions between 2.6.11 and 3.19 [95]. Out of the 96 data structure fields that are extracted to collect system information, only 6 have changed across these kernel versions spanning a decade (years 2005 to 2015). Chapter 4. Near Field Monitoring 39

TopoLog CTop RConsole PaVScan

Across-system analytics ● ●

Across-time analytics ● ●

Monitoring unresponsive or compromised systems ● ● Deep, across the stack system knowledge ● ●

Table 4.1: Key capabilities of the prototype applications

4.2.3 The Frame Datastore

The frame datastore is a repository of historical as well as live VM runtime states of all of the guest VMs, sent to it as structured frame documents by the “crawler” VM running the Crawl Logic. Although intended to be an actual database system for a full cloud deployment, in most cases the crawler VM file system is simply used as the current frame datastore. To ensure scalability with respect to the space overhead for maintaining VM state history (frames) for cloud-scale deployment, this is controlled by keeping incremental delta frames across successive crawls of a VM over time. Depending upon whether the delta is obtained over the base crawl or the most recent crawl, there are tradeoffs with regards to the ease of delta insertion/deletion vs. delta sizes, the former being trivial with base crawl deltas, while the latter being more manageable with latest crawl deltas. Section 4.4.7 evaluates this space overhead for maintaining system states across time.

4.2.4 Application Architecture

In the NFM framework, the Cloud monitoring applications act as clients of the Frame Datastore, building on top of the rich system state it maintains throughout the VMs’ lifetimes, to perform cloud analytics across time (past vs present system states) and space (across systems). Some of the target applications bypass the frame datastore for realtime operations (e.g., CTop resource monitoring), or for interfacing directly against the raw VM memory view (e.g., PaVScan virus scanning). A benefit of visualizing system information as data documents (i.e., frames) is that it enables leveraging familiar paradigms from the data analytics domain such as diff-ing systems just like diff-ing documents and tagging systems with semantic annotations. Tagging point-in-time system states as being ‘healthy’ and diff-ing them against faulty states potentially makes it easier and more efficient to troubleshoot system issues.

4.3 Prototype Applications

This section describes four concrete applications that I have built over NFM’s cloud analytics framework. These applications highlight NFM’s capabilities and target some interesting use cases. The fundamental principles for system analytics in the cloud are preserved amongst all applications: they all operate out-of-band with VM execution, are completely non-intrusive, and operate without requiring any guest cooperation. Table 4.1 shows the four applications, TopoLog, CTop, RConsole and PaVScan and the key capa- bilities they highlight. TopoLog is a cloud topology discovery application that focuses on across-system Chapter 4. Near Field Monitoring 40

Figure 4.2: VM and app connectivity discovered by Topology Analyzer for 4 VMs analytics. It discovers VM and application connectivity by analyzing and correlating frames of cloud instances. It can also provide across-time analytics by tracing the evolution of cloud topology over frame history. CTop is a cloud-wide, realtime resource monitor that can monitor even unresponsive systems. CTop also showcases how deep, across-the-stack system knowledge can enable a more accurate and reli- able monitoring of a system’s resource utilization. RConsole is an out-of-VM, console-like interface that is mainly designed with cloud operators in mind. Its pseudo-console interface enables administrators to query system state without having to log into guest systems, and even when the system is compro- mised. It also enables a handy “time travel” capability for forensic analysis of systems. PaVScan is a hypervisor-paging aware virus scanner, and is a prime example of how across-stack knowledge of system state, combining the in-VM view with the out-of-VM view, can dramatically improve the operational efficiency of common management applications like virus scan.

4.3.1 TopoLog

TopoLog is a network and application topology analyzer for cloud instances. It discovers (i) the in- terconnectivity across VMs, (ii) communicating processes inside each VM, (iii) connectivity patterns across high level applications, and (iv) (topology-permitting) per VM network flow statistics, without installing any hooks inside the VMs. TopoLog facilitates continuous validation in the cloud by ensuring that a desired topology for a distributed application (deployed as a pattern/set of VMs, e.g. via a Chef recipe [122]) is maintained. It detects unauthorized connections, bottleneck nodes and network resource use patterns. TopoLog offers other potential use cases, such as (i) optimizing inter- and intra- rack network use by identifying highly-communicating VMs and bringing them closer together, and (ii) simultaneous patching of interconnected VMs for minimal service downtime at the application level. For each VM, topology analyzer extracts the per-process network information from the latest frame in the Frame Datastore. The extracted information (containing the socket type and state, associated source and destination IP addresses, and the process owning the socket) is correlated across all or a specified subset of cloud instances to generate a connectivity graph. Higher-level application information for the Chapter 4. Near Field Monitoring 41

L−VM W−VM D−VM M−VM Ext

LogAnalyticsVM 0.00 109.08 0.00 0.00 0.00 (L-VM)

WebServerVM 0.56 0.00 42.67 0.00 0.00 (W-VM)

DataStoreVM 0.00 0.86 0.00 0.22 0.00 (D-VM)

MasterCtlVM 0.00 0.00 0.14 0.00 0.00 (M-VM)

External 0.00 0.00 0.00 0.00 0.00

Figure 4.3: VM Connectivity Matrix [Mbps]. communicating processes is discovered by traversing the process tree within each VM frame. These steps are sufficient to generate the cloud network and application topology. In addition to these, TopoLog further discovers network traffic statistics for each VM by extracting and comparing counts for received, transmitted and dropped packets/bytes across two different timestamped frames. Depending upon application knowledge and a particular topology, TopoLog can go one step further and estimate the weight of the connection edges in the topology graph. Since Linux does not maintain per-process data transfer statistics, these are estimated by converting the connectivity graph to a linear system of equations. For a VM-to-VM connectivity graph of N VMs, there exist at most N 2 − N potential unknowns (graph edges) and 2 ∗ N equations (per-VM system-wide received and transmitted bytes). If the number of equations are sufficient to solve for the actual number of edges, the weight of each connection can be determined. To bring down the number of unknowns, domain knowledge is employed such as information about connections with negligible or known constant weights. Figure 4.2 shows one such connectivity graph generated automatically by the topology analyzer for an application pattern composed of 4 VMs. This application includes (i) a MasterControl VM, which monitors each application component and serves data to a DataStore VM; (ii) a WebServer VM, which serves client-side requests over http; (iii) a DataStore VM, which warehouses application data and receives updates from the MasterControl VM; and (iv) a LogAnalyzer VM, which downloads and analyzes the logs from the WebServer VM. As can be seen, the topology analyzer was able to discover all cluster connections such as the masterControlVM having long lived hooks into all other VMs over ssh, and feeding the dataStoreVM with data files. Also found was a connection that does not belong to the cluster, labeled as “exit to outside world”. Also note that although the connection between dataStoreVM and httpWebServerVM is detected as an ssh connection, by traversing the process tree inside the latter’s frame, TopoLog can get the higher level picture of this actually being a secure file transfer. A packet sniffing based topology discovery would not be able to detect intra-host connections for colocated VMs, or capture application-level information for the communicating systems. Figure 4.3 further shows a snapshot of the derived network traffic statistics, as a VM connectiv- ity matrix depicted as an intensity plot, for the same application pattern. An entry M(r, c) in the Chapter 4. Near Field Monitoring 42

Figure 4.4: Above: in-VM top; Below: CTop

matrix represents the rate of data transferred from VMc to VMr in Mbps. The row and column la- bels are identical (column labels are abbreviated in the plot). The connectivity matrix highlights the strongly-connected application components, which are the {W ebServerV M → LogAnalyzerV M} and {DataStoreV M → W ebServerV M} tuples in this case. This information is useful for both network- aware optimization of application resources and for continuous validation—by identifying unauthorized or abnormal application communication patterns.

4.3.2 CTop

CTop is a cloud-wide, realtime consolidated resource monitoring application. CTop is of equivalent fidelity and time granularity as the standard, in-band Linux top utility, with two enhancements for cloud-centric monitoring rather than traditional, system-centric techniques. First, modern cloud appli- cations typically span multiple VMs and hosts, requiring a distributed application-level view of resource use across system boundaries. CTop provides a single unified view of resource utilization across all applications and VMs distributed onto various physical machines within a cloud. Allowing for host and VM normalization (scaling), CTop dissolves VM and host boundaries to view individual processes as belonging to a single global cloud computer, with additional drill-down capabilities for restricting the view within hosts or VMs. Second, since CTop operates outside the VM scope, it is aware of both VM-level and hypervisor-level resource usage. Thus, it can provide a more accurate and holistic view of utilization than what the guest sees in its virtualized world. It appropriately normalizes a process’ resource usage inside a VM to its corresponding usage on the host, or in terms of what the user paid for, for direct comparison of the overall application’s processes across VMs. Equation 4.1 shows this normalization, where the actual CPU usage of a VM V ’s process P on host H is calculated in terms Chapter 4. Near Field Monitoring 43

In-VM Console:IBM Research Active connections (servers and established) Proto Local Address Foreign Address State tcp 127.0.0.1:25 0.0.0.0:* LISTEN tcp 9.XX.XXX.110:52019 9.XX.XXX.109:22 ESTABLISHED : tcp 9.XX.XXX.110:22 9.XX.XXX.15:49845 ESTABLISHED RConsole: Active Internet connections Proto Local Address Foreign Address State PID Process tcp 127.0.0.1:25 0.0.0.0:0 SS_UNCONNECTED 741 [sendmail] tcp 9.XX.XXX.110:52019 9.XX.XXX.109:22 SS_CONNECTED 6177 [ssh] : tcp 9.XX.XXX.110:22 9.XX.XXX.15:49845 SS_CONNECTED 14894 [sshd] tcp 0.0.0.0:2476 0.0.0.0:0 SS_UNCONNECTED 23304 [datacpy]

Figure 4.5: RConsole captures datacpy’s hidden listener connection

P ∗ of the CPU usage of P inside the VM (CPUV ), overall CPU utilization of V (CPUV ), and the CPU V usage of the VM on host H (CPUH ).

P P CPUV V CPUH = ∗ × CPUH (4.1) CPUV To achieve realtime monitoring, CTop directly uses the crawler to emit a custom frame on demand at the desired monitoring frequency, bypassing the Frame Datastore. Fields of the frame include per-process PID, PPID, command, virtual and physical memory usage, scheduling statistics and CPU runtime. CTop analyzes frames from successive monitoring iterations to generate a top-like per-process resource usage monitor. Figure 4.4 compares the output of CTop with the standard in-VM top for a single VM. In this case, in-VM measures and CTop measures match, as there is no host-level contention. CTop’s unified application-level resource utilization view allows for straightforward comparison of different instances of the same application across different VMs. This provides a simple form of problem diagnosis in the cloud, by tracking unexpected resource use variability among instances of an application. As shown in Section 4.4.3, CTop’s holistic view of the operating environment helps explain performance and capacity conflicts that are hard to reason about when monitoring is bound to the VM scope. CTop’s latency and accuracy are also evaluated in Sections 4.4.1 and 4.4.2.

4.3.3 RConsole

RConsole is an out-of-band “console-like” interface for VMs being monitored by NFM. It is a read-only interface, with no side effects on the running instances. RConsole supports basic system functions such as ls, lsmod, ps, netstat and ifconfig. It is designed for cloud operators, to provide visibility into running instances without requiring access into the systems. RConsole runs against the system state captured in the frames indexed into the Frame Datastore. It implements a sync API call to crawl the current live state of a VM and retrieve its most up-to-date state, and a seed API to retrieve a prior stored state of a VM, which enables traveling back in time to observe past system state. With RConsole, admins can perform simple security, compliance and configuration monitoring in an out-of-band fashion, without disrupting or accessing into the running systems. As RConsole operates by interpreting raw VM memory structures rather than relying on in-VM OS functions, it is also more robust Chapter 4. Near Field Monitoring 44 against certain security attacks that may compromise a guest OS. This is demonstrated by infecting VMs with the AverageCoder rootkit [64] that covertly starts and hides malicious processes, unauthorized users and open network connections from the guest OS. Figure 4.5 shows an example of this for network connections. Here the top box shows the (simplified) output of the standard in-VM netstat command, while the bottom box shows the output from RConsole’s netstat. Both outputs remain mostly similar, except for one additional entry in RConsole: a malicious datacpy process with a listening connection on port 2476. In-VM netstat fails to discover this as it relies on compromised guest OS exported functions, while RConsole can easily capture it from crawled VM state. More sophisticated attacks and VM introspection based counter-measures are well-established in prior studies [18, 25, 71, 20, 81, 177, 103, 59]. RConsole is also greatly useful in troubleshooting system issues owing to its ability to travel back and forth in time. Once an anomaly is detected, it can trace back system state to detect the root cause, and compare across systems to identify similar good and bad system configurations. RConsole is even able to inspect a VM that was made completely dysfunctional by forcing a kernel panic; the entire runtime state still exists in the VM’s memory, which RConsole is able to retrieve to pinpoint the culprit process and module. This ability to analyze unresponsive systems also plays a critical role in dramatically improving time to resolution in certain problems, as in the Google outage example of Section 1. A similar example has also been recently observed in one of my collaborator’s (IBM’s) cloud deployments, where a network misconfiguration in one of the systems caused a serious routing problem in VM network traffic, rendering most of the VMs inaccessible. In both cases, RConsole’s ability to perform simple troubleshooting operations (such as tracking network configurations via ifconfig) across time and across systems can (and indeed did for the latter) play a critical role in pinpointing the offending systems and configurations in a simple and more efficient way.

4.3.4 PaVScan

PaVScan is a hypervisor paging [180] aware virus scanner that operates outside a VM’s context, working directly on the raw VM memory state. I have built PaVScan over the popular open source anti-virus project, ClamAV [40] and used its publicly available virus signature database. PaVScan searches for signatures of known viruses inside the VM’s memory using the Aho-Corasick algorithm for pattern matching, and works by building and traversing a finite state machine from the signature database. PaVScan bypasses the Frame Datastore and interfaces with the memory crawler directly to get the live handle on the target VM’s memory. Once VM memory is exposed, the virus signatures are directly scanned over the raw memory. While one obvious main advantage of PaVScan is its ability to perform out-of-band virus scanning, this is not unique to this work [89]. The key differentiating aspect of PaVScan is that it tracks hypervisor- paging—that is, the guest-agnostic reclaim of VM memory by the hypervisor. Using the crawler’s interface to the hypervisor, PaVScan identifies which VM page frames are actually mapped on the physical RAM and which are paged out on disk. It then scans only the RAM-backed VM memory, while scanning the rest of the pages when they originally get paged out. This prevents unnecessary and costly page-ins from disk and ensures that the VM’s working set does not get “thrashed” by the virus scan operation. PaVScan presents a prime example of how deep, across-the-stack knowledge of system state can crucially impact across-system performance in the cloud. Traditional in-VM scanning techniques—for viruses or otherwise—are limited by their guest-only scope. Their actions, oblivious to the broader oper- Chapter 4. Near Field Monitoring 45

Safe: 10Hz 100.0 1010 20 KVM: 20Hz 10.0 100100 200 Xen: 200Hz 1.0 10001000

Basic 0.1 Crawl 10^410000 Full

Crawl Latency [ms] Crawl Monitoring Freq.Monitoring [Hz] 0.0 10^5100000 Xen KVM Figure 4.6: Measured crawling latencies and achievable monitoring frequencies (log scale). ating environment view, can be severely detrimental to both the system itself that they are monitoring and other instances sharing the cloud. In the case of virus scanning, a typical in-VM scanner(or even a paging-unaware out-of-VM solution) will indiscriminately scan all memory pages, as neither the guest nor the scanner are aware of any guest pages that have been paged out. Every such page access will cause a page-in from swap and potentially a page-out of a currently-loaded page (so long as the hypervisor doesn’t give this VM more memory), severely impacting the performance of other applications on the VM. Depending on overall resource allocation and resource scheduler actions, this paging behavior can further impact other instances sharing the same memory and I/O resources in the cloud. In contrast, PaVScan’s paging-aware, out-of-VM scanning approach operates with negligible impact on the moni- tored systems, while providing the same level of system integrity. Section 4.4.4 compares PaVScan with an in-VM scanner.

4.4 Evaluation

To evaluate NFM’s performance, I use the cloud analytics applications to answer the following questions:

1. How frequently can runtime VM information be extracted with NFM?

2. How accurate is NFM’s out-of-band VM monitoring?

3. Can NFM perform better than existing in-VM techniques with its holistic view of the cloud oper- ating environment?

4. How does out-of-VM monitoring improve operational efficiency in the cloud?

5. What is the overhead imposed on the VMs being monitored?

6. What is the impact of out-of-band monitoring on co-located VMs?

7. What is the space overhead of storing across-time (forensic) frame data in the datastore for each cloud instance? Chapter 4. Near Field Monitoring 46

100 VM Top 90 CTop 80 70 60 50 40 CPU CPU [%] 30 20 10 0 0 60 120 180 240 300 360 420 480 Time [s] Figure 4.7: CPU utilization: in-VM top vs. CTop.

The experimental setup consists of physical systems with Intel Core-i5 processors, 4-64GB of memory and VT-x hardware virtualization support. The hosts run Linux 3.7.9, Xen 4.2.1 and QEMU-KVM 1.4. The VMs run a variety of Fedora, RedHat and Ubuntu distributions (both 32 and 64 bit) in the experiments to ensure my crawlers work with a range of OS versions. The analytics backend is run as a specialized VM with baked-in datastore and API components. The benchmarks used in the experiments are: bonnie++ v1.96 [145], x264 v1.7.0 [120], and httperf v0.9.0 [117]

4.4.1 Latency and Frequency of Monitoring The amount of time it takes for the crawlers to extract runtime VM information varies with the richness of the desired information. This experiment measures the time required to extract basic process level resource use information for the CTop application, as well as for deeper, full system information includ- ing the system configuration, OS, module, process, file, CPU, memory, and network connection data. All times are averaged over several runs while varying configured VM memory and CPUs; the in-VM workloads emulate heavy process fork/kill behavior and stress the CPU, memory, disk and network (500 runs for each configuration). For Xen, there is a one-time only operation for getting an initial handle on a target VM’s memory (Section 4.2.1). This takes on average 0.37s per GB of VM memory. After this one-time operation, the crawler takes an average of 0.165 ms to extract the basic state and 4.5 ms for full state. For KVM, there is no separate memory handle acquisition step. The time taken to extract basic and full VM state information is 2.5 ms and 47.4 ms respectively. Figure 4.6 summarizes these results and highlights the corresponding achievable monitoring frequencies. As shown in the figure, full system crawls can be performed over 20 times/s (Hz) for KVM and 200 times/s for Xen (dashed horizontal lines). These times do not include the frame indexing time into the datastore, as this is off the critical path and is bypassed by the realtime monitoring applications, where latency matters. In either case, the results show that NFM can easily operate at a 10Hz monitoring frequency (solid horizontal line), which more than meets the practical requirements of most cloud applications. The crawler has a negligible memory footprint and its CPU usage is proportional to the desired monitoring frequency and the number of VMs being monitored. In the KVM setup, for example, the crawler utilizing a full CPU core can monitor 1 VM at 20Hz or 20 VMs at 1Hz. Thus, there also exists a tradeoff between time granularity of monitoring and number of VMs that can be monitored in parallel. Summary: The crawlers extract full system state in less than 50ms, and NFM’s live out-of-VM monitoring can easily operate at a 10Hz frequency. Chapter 4. Near Field Monitoring 47

Figure 4.8: top vs. CTop: comparing LAMP processes across 3 VMs to explain httperf statistics.

4.4.2 Monitoring Accuracy

I use the CTop application here to validate NFM’s accuracy. A custom workload is run in the target VM, which dynamically varies its CPU and memory demand based on a configurable sinusoidal pattern. Figure 4.7 shows how well CTop tracks the CPU resource use for a VM process with respect to top. The memory results are similar. The slight variation in measurements is due to the inevitable misalignment of the update cycles / sampling time points, as the two solutions operate asynchronously. Overall, the out-of-VM CTop monitoring is very accurate and reliable. The average sample variation between in-VM top and CTop metrics is very low, ranging between 4% and 1% at different time scales. Summary: Out-of-VM monitors built atop NFM accurately track VM process and resource usage information, providing the same level of fidelity and time granularity as in-VM techniques.

4.4.3 Benefits of Holistic Knowledge

As previously discussed, NFM is privy to both in-VM and out-of-VM resource use measures. This unified, holistic view of systems enables significantly-better resource management in the cloud. The quantitative elements of this capability are demonstrated here with a webserver application, distributed across 3 identical instances, each running a LAMP stack. The VMs are facing high incoming HTTP load causing them to run at full utilization, the load originating from three separate httperf clients (one per server VM) generating identical request streams for 2MB files. While the VMs’ configurations are identical, their current allocation of resources is not. Due to varying contention and priorities with respect to other colocated instances, the three VMs receive a 100%, 70% and 30% share of the CPU respectively. Figure 4.8 demonstrates the holistic view of application performance characteristics with CTop and contrasts this with the VM-only view of in-VM top. Figure 4.8 shows the httperf statistics observed for each of the three VMs (top chart), the CPU utilization of the Apache, PHP and MySql processes inside each of these VMs as measured by top Chapter 4. Near Field Monitoring 48

2400 httperf % connections dropped + In-VM virusscan 100 2100 httperf reply rate + in-VM virusscan 90 httperf reply rate + out-of-VM virusscan 80 1800 httperf reply rate + NO virusscan d e

70 p p ) 1500 o s / r

( 60

d

e 1200 t s a

50 n r

o

900 i y t l 40 c p e

e 600 n

R 30 n

300 o c

20

0 10 % -300 0 900 1200 1500 1800 2100 2400 Request rate (/s)

Figure 4.9: Httperf reply rate and connection drops with various virusscan configurations

(middle chart) and as derived by CTop (bottom chart). As seen, top’s CPU utilization statistics look the same for all three VMs and thus cannot be used to reason for the different httperf sustained request rate, observed bandwidth and response times across the three VMs. However, by using CTop’s CPU utilization metrics (Section 4.3.2) that dissolve VM boundaries and normalize utilization of all of the LAMP processes across the application’s instances at the host level, a clear CPU utilization difference can be spotted across the LAMP processes in different VMs. Thus, although the CPU utilization of the LAMP processes looks very similar when viewed inside the VMs in isolation, the true picture is in fact very different as captured by CTop. The true resource utilization across the application’s distributed instances clearly explains the application-level performance variability in httperf statistics. Summary: The unified, holistic view of the cloud environment enables accurate monitoring and in- terpretation of distributed application performance.

4.4.4 Operational Efficiency Improvements

I quantify the efficiency improvements achievable with NFM by evaluating virus scanning—representative of common scanning/inspection operations in the cloud—in the experimental framework. I use PaVScan as the out-of-VM scanner and install an identical in-VM scanner in the test VM, and compare their impact on the test VM’s primary workload. The VM is configured with two VCPUs, where a web server is run as the primary application on one VCPU and the in-VM virusscan application is run on a separate VCPU to avoid introducing any CPU contention. Hypervisor paging is enabled on the host via Xenpaging reclaiming 256MB from the VM’s configured 1GB memory. A medium workload httperf server is setup on the VM with a working set of 256MB, from which it serves 2KB random content files to 3 different httperf clients running on 3 separate machines. The file size is chosen to be 2KB so that the server is not network bound. The average base-case runtime for the virus scanner was 17.5s to scan the entire 1G VM memory. All httperf server statistics that follow are averaged over 5 runs. Figure 4.9 shows the reply rates that can be sustained by the webserver VM in this setup. Specifically, with the virusscanner turned off, the webserver VM is able to match an incoming request rate of up to 2400 requests/s without dropping any connections. PaVScan matches this reply rate very closely, reaching 2395 replies/sec with only 0.27% connection drops. On the other hand, httperf experiences Chapter 4. Near Field Monitoring 49

2000

1500

1000 response_time(ms) timeout_errors(#) avg_reply_rate(#/sec)

500

0 0 100 200 300 400 500 600 Time in seconds --->

Figure 4.10: Httperf over 10 rounds (each bounded between successive vertical lines); Virusscan starts between round 2-3 a major performance hit with the in-VM scanner, where around 30% of connections are dropped with requests rates higher than 900 requests/s. Even with 900 requests/s, a drop rate of 5% is observed meaning that the actual sustainable rate is even below that. Effectively there is a decrease in performance by more than 63% (from 2400 to 900 serviced requests/s with in-VM scanning.). Additionally, response times degrade by an order of magnitude. The virusscanner’s running time itself degrades to around 59s from 17.5s. Furthermore, the performance impact of the in-VM virus scanner lasts much longer than just the scan duration. Figure 4.10 shows that it takes a much longer time for httperf to recover after the in- VM virusscanner destroys httperf’s file cache due to swapped page-ins. Shown are 10 httperf rounds (bounded by vertical lines in the figure) servicing 256MB of file requests each, fired at a rate of 2100 requests/s. The in-VM virusscanner starts between rounds 2-3 and completes by round 5. As can be seen it takes about 7 rounds (∼460 seconds) for httperf to re-establish its file cache for sustaining the incoming request rate. For the entire 10 rounds, there is an 18.6% performance degradation in terms of serviced requests per second and 12.5% connection drops. Thus, in this hypervisor paging scenario, the in-VM scanner’s impact is felt even long after it exits. In contrast, the out-of-VM scanner shows only negligible impact on the guest during and after its execution. Summary: The ability to push common cloud operations out of the VMs’ scope can lead to dramatic improvements in system performance compared to their in-VM counterparts.

4.4.5 Impact on VM’s Performance

I measure NFM’s overhead on a target system’s workload with three experiments. I expose the live VM state to (i) monitor the VM with CTop, (ii) hash the VM’s memory, and (iii) scan the VM’s memory with PaVScan. All experiments are repeated 5 times. The target VM and workload setup are similar to those in Section 4.4.4 (except that hypervisor paging is turned off). I use two workload configurations for the webserver VM: (i) A 256 MB working set to avoid guest swapping, serving all requests from memory without accessing the disk, and (ii) a 512 MB working set that involves the standard kswapd swap daemon. The httperf server is pushed to CPU saturation in both cases. The host system is never Chapter 4. Near Field Monitoring 50

Figure 4.11: Impact on webserver VM with parallel out-of-band monitoring and management overloaded in the experiments. Figure 4.11 shows the performance impact on the VM for both workload configurations, when the three described out-of-VM monitoring applications are run for the same VM. Realtime Monitoring: The impact of monitoring the webserver VM at a 10Hz frequency is measured, while extracting the full system state (Section 4.4.1) in addition to CTop’s per process state, at each monitoring iteration. No drop is recorded in the VM’s serviced request rate, but the response time degrades by about 2%, for the 256 MB working set only. Hashing VM’s Memory: As a stress-test benchmark for memory crawling, all of the VM’s memory pages are hashed with Mhash library’s MD5 hashing [112]. Even this benchmark has no visible impact on the VM’s httperf server capacity as it continues to sustain the original request rate. However, the average response time degrades from 0.8 ms/ request to 4.5 ms/request for the 256 MB working set scenario, while remaining within the timeout threshold. For the 512 MB working set, the response time degrades by only 1.5%. VM Memory Scanning: The virus scanner prototype, which is close to a worst-case application, introduces a 2.9% degradation on the httperf sustainable request rate with an average response time of 3.3ms for the 256 MB working set. In this case, the VM runs at its absolute limit, continuously serving requests directly from memory. Interestingly, the higher httperf working set size of 512MB (involving kswapd) records no impact on the httperf server capacity with PaVScan running simultaneously, as the application occasionally requires new page-ins, which is the common case for most practical applications. Note that the 256MB scenario represents a transient phase; the ‘no kswapd’ constraint cannot be sus- tained beyond a few httperf rounds. Summary: For practical applications, the target VMs are not heavily impacted when subject to high- frequency monitoring and complex operations with NFM.

4.4.6 Impact on Co-located VMs

Here, I measure the performance overhead introduced on a VM (X) while other co-located VMs on the same host are being monitored. Since the VM X itself is not being monitored, any impact on its performance can be regarded as a side effect of monitoring the other colocated VMs. I use Xen for a clean administrative domain-guest domain separation for this experiment. The memory crawler is run inside Dom0 itself, and monitors VMs at 10Hz, extracting the full system state from the VM at Chapter 4. Near Field Monitoring 51

4.0% Dynamic Resource Info Core Process Info 3.5% Delta w.r.t. Base Image Delta w.r.t. Prior Sample 3.0%

3.0% 2.5%

2.5% ) 2.0% 2.0% 1.5% x10^-4

1.5% ( 1.0% 1.0%

Percentage of Base Frame 0.5% 0.5% Percentage Memof VM Size

0.0% 0.0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Samples (Days)

Figure 4.12: State management overhead with delta frames. each iteration. Alongside Dom0, three other VMs run on their separate cores on a quad-core host. The impact on each VM is measured separately, while monitoring the other two VMs. The 3 VMs are put under stress, each running a different workload: (i) CPU-bound x264 video encoding benchmark, (ii) disk-bound bonnie++ disk benchmark, and (iii) a full-system stress workload simulated by an httperf webserver configured exactly as described in Section 4.4.5. By virtue of Xen’s frontend/backend driver model, Dom0 becomes responsible for arbitrating VMs’ access to network and disk. Disk caching at the hypervisor is disabled so that true disk throughputs can be measured. The host system (including Dom0) itself is never overloaded for repeatable and consistent measurements. While monitoring the remaining 2 VMs at a 10Hz frequency, and repeating each experiment 5 times, neither the bonnie++ VM nor the x264 VM see any impact on their read/write throughputs and frame- rate respectively, but the httperf server VM’s maximum sustainable request rate drops by 2.2%. For the latter, as in Section 4.4.5, relaxing the artificial “no guest swapping” constraint and thus increasing the working set size from 256MB to 512MB, also results in no impact on VM performance. An important point to further note is that a 10Hz frequency is actually high for most cloud monitoring and analytics operations, and the crawler’s impact can be further minimized by operating at lower but still largely practical frequencies of 1Hz or less. Moreover, NFM’s execution - monitoring decoupled design favours these operations to be handed off to other lightly loaded or dedicated hosts, further minimizing crawler impact. Summary: NFM is lightweight and does not have a heavy monitoring side-effect on the host’s colocated VMs.

4.4.7 Space Overhead

This experiment evaluates the storage requirements for maintaining VM state history in the frame datastore to enable across-time system analytics. I use the two delta frame extraction approaches described in Section 4.2.3. Shown in Figure 4.12 are the delta frame sizes relative to the base frame and relative to the previous frame, for a daily crawl of a VM over a 3-week period, while running various routine tasks and additional end-user applications. When computing deltas over the base frame (top curve) the frame sizes grow from 2.5% to 3% of the base frame size. However, when deltas are computed over the most-recent crawled state, the frame sizes do not grow over time (lower curve), averaging Chapter 4. Near Field Monitoring 52 around 1.5% of the full frame size. Also, as can be seen, the amount of information that must be kept for each VM for across-time analysis is minuscule compared to the actual VM sizes (delta frame sizes are only 0.00015% of the VM’s 4GB memory size). In the experiments, the full frame sizes vary around 300KB-400KB and the delta frames are about 3KB-8KB. Scaling these numbers out for a hypothetical cloud deployment with 10,000 VMs, the overall space overhead for the datastore with daily system snapshots and a year-long time horizon amounts to 14GB to 33GB, which is quite manageable even without any potential across-VM optimizations. Summary: The overall space overhead of maintaining VM state across all instances and across time is manageable and is potentially scalable to large-scale cloud deployments.

4.5 Summary

This Chapter introduced Near Field Monitoring, a fundamentally different approach to systems monitor- ing and analytics in the cloud. I showed that traditional in-VM techniques or newer virtualization-aware alternatives are not a good fit for modern data centers, and addressed their limitations with my non- intrusive, out-of-band cloud monitoring and analytics framework. NFM decouples system execution from monitoring and analytics functions, and pushes these functions out of the systems’ scope. The NFM framework provides always-on monitoring, even when the monitored systems are unresponsive or compromised, and works non-intrusively by eliminating any need for guest cooperation, modification or runtime interference. I described the implementation of NFM across two virtualization platforms, and presented four prototype applications built on top of the NFM framework to highlight its capabilities for across-systems and across-time analytics. The evaluations showed that we can accurately and reliably monitor cloud instances, in realtime, with minimal impact on both the monitored systems and the cloud infrastructure. Chapter 5

Cloning and Injection based VM Inspection and Customization

This Chapter presents an alternate out-of-band monitoring solution- CIVIC- that overcomes NFM’s limitations arising out of its raw memory byte-level visibility into the guest. Specifically, CIVIC’s approach circumvents the requirements inherent to VMI-based monitoring, including the need for (i) deep understanding of OS-specific kernel data structures to reconstruct logical information from the extracted raw memory view, (ii) exposing an entire OS-like view (/proc etc.) for pre-existing monitoring software, or (iii) writing fresh monitors using introspection directly. CIVIC avoids this fragile dependence on kernel data structures and the functionality duplication effort, by operating at a logical OS level and reusing the vast stock monitoring software codebase but in a separate isolated environment. As with NFM, CIVIC liberates the guest from the intrusion and interference of monitoring agents. However, CIVIC does not provide all of the benefits of NFM, for example CIVIC lacks holistic knowledge by virtue of reusing traditional agents. On the other hand, it enables a broader usage scope in addition to NFM’s basic passive (read-only) monitoring, by supporting actuation or on-the-fly introduction of new functionality without the fear of negatively impacting the guest system. CIVIC stands for ‘Cloning and Injection based VM Inspection and Customization’. The bigger goal CIVIC tries to achieve is enabling safe customization of a VM’s internal behavior, similar to the flexibility that exists for VM-external components. Systems monitoring then becomes one of CIVIC’s use-cases. Specifically, while virtualization technology has made it possible to reconfigure application resource al- location [175, 91, 180, 74, 190] and placement [98, 41, 183] at run time, this flexibility does not currently extend to the internal behavior of the VM. Once a VM is deployed in production it typically becomes a hands-off entity in terms of restrictions while inspecting or customizing it, i.e., tuning or adding new functionality to it. This stems from the risk associated with causing an intrusion or failure with an otherwise functional system that is running production workloads. If this risk can be mitigated, deep inspection of production VMs can lead to valuable insights that depend upon runtime state, and thus cannot be easily gathered during testing or staging of these systems. It also enables customization oper- ations such as updating long-running applications without downtime, tuning configuration parameters to current workload, or diagnosing the cause of recurring transient performance problems. Such live VM inspection and customization presents significant challenges. Real-world application behavior is a function of both the underlying system and software, as well as runtime system state. Such

53 Chapter 5. Cloning and Injection based VM Inspection and Customization 54 state is impacted by incoming load, interaction with other running components, system configuration, and administrative policies, amongst other things. Recreating a given condition in another environment (like a more permissive VM) is both challenging and time-consuming. On the other hand, the risk associated with making changes to live VMs is not unfounded. CIVIC enables safe inspection and customization of production VMs on-the-fly, without requiring any pre-requisites to be built into the VMs. It achieves this in two steps. First, it creates a live replica of the production VM (the source), including its runtime state, in a separate isolated sandbox environment (the clone). This liberates the source from from the resource overheads, runtime interference and even installation of inspection or customization software and their associated dependencies (‘safe customiza- tion’). Second, it uses runtime code injection to introduce new userspace-level functionality over the replicated VM state. This avoids enforcing any guest cooperation, or modification, thereby enabling providers to offer VM customization as a service. An example is automatic tuning of configuration pa- rameters for popular applications (e.g., Apache, MySQL) based on runtime load. Such a service would be transparently available to any VM running on their infrastructure just by virtue of executing on a CIVIC-capable host. CIVIC differs with existing alternatives that can provide such execution-customization decoupling in the following ways. As mentioned before, it enables efficient reuse of the vast stock software codebase, overcoming the software incompatibility and functionality duplication effort with VMI-based monitoring and inspection solutions [71, 80, 161]. Unlike most redirection-based solutions [155, 73, 184, 69, 68] that install in-guest handler components (and are slow), CIVIC restricts all customization operations to clones, and does not cause any guest intrusion and interference which would be unacceptable in a production VM. CIVIC enables introducing new post-hoc customizations unlike other VM replication solutions, which do not support introducing new functionality [48, 98] or analysis at an OS or application- level semantics [60, 36]. Further details can be found in Section 2.5.1. CIVIC employs code injection to avoid causing vendor lock-in or imposing guest cooperation, which would otherwise be required when access to the clones is to be granted to the service providers in an as-a-service model. The former defeats defeats cloud portability [149, 97] by causing guest specialization via installation of vendor-specific in-VM software components [173, 168, 172, 114]. The other option i.e. enforcing guest cooperation by sharing credentials, or granting login access to the cloud provider, aggravates manageability concerns at cloud scale [69], as well as auditing [70] concerns by giving the cloud admin read-write access to even the user VM. On the other hand, granting permission to clone the VM only allows read access, and injection then enables write access only on the clone without directly impacting the source. There’s also a positive side-effect of using injection and not access as an entry route. It can potentially troubleshoot dysfunctional-userspace systems from within the kernel itself, such as the recent Google outage [43] where login (SSH) became unavailable. CIVIC’s execution-customization decoupling approach enables experimentation with possibly intru- sive, heavyweight, or speculative post-hoc customizations without the fear of negatively impacting the original guest system. I demonstrate four such use-cases in this Chapter. The first use-case is that of injecting and running monitoring agents inside the clone on behalf of the source, that are not desir- able to install or run in the source VM itself. An example is the buggy agents in the Amazon Elastic Block Store (EBS) Service that caused severe EBS performance degradation [11]. The second example attaches an anomaly detector to a live Cassandra [99] service on-the-fly, instead of baking it directly into the base service. The latter gets spared from the analysis’ intrusion and interference, while the Chapter 5. Cloning and Injection based VM Inspection and Customization 55 detector is allowed more aggressive analysis for improved accuracy. The third example enables a risk- free exploratory diagnosis for a webserver memory leak. Instead of further degrading source webserver capacity by troubleshooting the leak within it, the problematic runtime cache state gets replicated, and debugging tools introduced, in a webserver clone. And finally, the fourth use-case shows how CIVIC clones can be employed to perform faster, risk-free and on-the-fly webserver configuration-parameter autotuning. CIVIC’s evaluation shows CIVIC is nimble and lightweight in terms of clone activation time (6.5s) as well as memory footprint (30MB transfer size, 80MB resident set size (RSS)), and has a low impact on the source/guest VM (<10%).

5.1 CIVIC’s Design

CIVIC builds on top of whole-system replication and live cloning constructs [48, 98] to swiftly create a low-overhead clone of the guest VM, which acts as a customization-enabling proxy system containing the original guest’s runtime state. The desired application functionality is then added to the clone through runtime code injection from the hypervisor into the clone’s kernel. This hotplugged functionality can range from realtime operations such as periodic system health checks by injecting monitoring agents in the clone (Section 5.4.1), to deep inspection such as diagnosing the root cause of memory leaks by introducing debugging tools into the clone (Section 5.4.3). Although CIVIC’s cutomization might not be applicable to all scenarios, it caters to a wide range of use-cases as discussed later in Section 5.1.1 In CIVIC, the instant a clone is created, it shares the same runtime state as the source. After this stage, however, the clone operates independently and can deviate from the source, while the latter always runs unperturbed. CIVIC optionally enables freezing the clone’s userspace, so that its inspected state refers to the source’s runtime state at the time of clone creation. In the above examples, the clone remains static in the monitoring case, but runs live in the diagnostics example. CIVIC limits its customization operations to within the clone, and does not target merging the clone’s state back to the source. It reports all customization outcomes as recommendations to the VM user. Propagating a validated customization to the source is application-specific. One possibility is to push the customization outcome (such as tuned configuration settings, leak fix, or validated patch) to the source using the same injection mechanism as in clone. Another alternative is making the clone the new source, or an interim source while the original gets customized with the validated procedure. As illustrated in Figure 5.1, CIVIC employs the following sequence of operations to enable VM customization, without interfering with the guest system operation, or requiring any sort of guest mod- ifications or cooperation to access the target systems. 1. Disk COW: In CIVIC, both the source and the clone operate as two independent VMs sharing the same base disk image. Ensuring consistent disk access for both VMs requires (i) preserving a snapshot of the disk image at the time of cloning to prevent the source’s future changes to disk from being reflected on the clone, and (ii) creating a COW slice for the clone to prevent disk state discrepancy in the opposite direction. 2. Live Migration: The clone in CIVIC is created by live migrating a snapshot of the source VM. A postcopy migration technique [77] is employed which allows fetching memory on demand (pull-based approach) instead of transferring the full source VM memory at once (push-based approach). This minimizes the clone initiation latency, and limits the clone’s resource consumption to only what’s required for a particular customization application as opposed to a full blown copy. CIVIC does not enforce any Chapter 5. Cloning and Injection based VM Inspection and Customization 56

2.

VM CLONE 6. Frozen OS

Kernel Kernel Original

User User

App Userspace .

OS

4. 4. Persistent 5. New Store w/ N/W App Soware Disk N/W Mem COW Mem Disk On-demand

1. COW Snapshots

3.

Figure 5.1: CIVIC’s architecture; step-by-step description in Design Section. restriction on the placement of clones– either locally on the same host as the guest-VM or on a separate host; the choice is dependent on user preference, host resource constraints, as well as the target use-case. 3. Memory COW: Since in CIVIC’s adoption of postcopy migration the source remains active, there is also a requirement of COW memory on the source. This is to ensure that the clone’s on-demand memory accesses remain consistent to the source’s memory state at the time of cloning. 4. Disk and NIC Hotplugging: In CIVIC, the clone’s modification to its COW slice are not preserved on its exit. This also circumvents the consistency and management concerns of preserving, merging and updating a previous clone’s disk state modifications to a future clone. This is because a clone’s disk view is based upon the source’s disk state at the time of cloning, which might be different from when a previous clone was created. But in many cases, customization applications across successive cloning iterations may require intermediate state from previous ones, such as configuration settings, log files, and licenses. To enable such state persistence, all clones of a particular source are hotplugged with the same additional disk referred to as a persistent store. For network consistency in terms of preventing IP/MAC conflicts, source configured networking is disabled on the clone. However, there are cases where the clone requires network access such as for the communication exchange (e.g., monitored data) between an application and its backend, as well as when traffic is partially or fully duplicated to the clone for tasks such as network filtering, runtime diagnostics and performance tuning. For such cases, the clone is hotplugged with its own NIC. Another alternative is to use network spoofing instead on the clone side, to let the OS or applications keep believing they’re using the same network configuration as before. 5. Code Injection: With the clone VM set up, the next step is to introduce inspection or customization functionality into its runtime. Depending upon the use-case, the customization software could refer to a stock monitoring agent, some debugging tool, or simply a root shell capable of accessing files and listing processes and connections. However, it is not feasible or acceptable to expect or enforce all corresponding software components to be resident inside the guest. Even if new application software was to somehow find its way inside the clone, since I assume no guest cooperation as well as no artifacts in the guest, Chapter 5. Cloning and Injection based VM Inspection and Customization 57 this poses a challenge in initiating the application without the existence of any helper scripts inside the guest, or the login credentials for clone shell access in the first place. To enable application entry into the clone, the hotplugged persistent store from the previous step also acts as a storehouse of all application software components- the binaries, related libraries, helper packages, configuration files, and an application loader script. Then, code injection from the hypervisor into the clone’s kernel transfers control to the loader script. The script finally sets up the clone OS’ operational environment and initiates inspection and customization. Section 5.2.6 details the script’s operations such as optionally freezing the replicated userspace in the clone to preserve its state for inspection.

5.1.1 Discussion

Usage Scope: CIVIC enables a hotplugged customization-as-a-service offering in the cloud, with the end-user having the choice to opt into this service and bearing the cost for the service cycles spent, but without the associated cooperation, intrusion and interference hassles. Isolating inspection and cus- tomization impact in the clone allows for a wide range of potentially heavyweight, intrusive or speculative operations including: (i) systems monitoring such as compliance scans and virusscanning, (ii) sandbox- ing: experimenting with different procedures and applying the optimal configuration on to the original system. For ex., patch validation [167], live diagnostics and remediation [153], (iii) deep inspection such as application debugging, analysis and optimization [36, 46], (iv) proactive analytics such as network filtering [33], malware detection [73], fault injection [100, 96]. Privacy: Cloning a user’s runtime environment can raise privacy concerns, but it provides the same kind of visibility and access as do VMI-based [71, 161] and redirection-based solutions [68, 69], or others that operate inside the original guest context itself. I argue in favour of the same kind of trust with the added advantage of guest operations remaining free of any intrusion and interference from inspection and tuning workflows, similar to VMI. Also, trusting the hypervisor is a fundamental assumption common to all such hypervisor facilitated solutions. This trust can be established using a self-serving cloud model [26], as well as cloud auditing [70]. Generality: While I have implemented CIVIC for Linux guests, it is equally applicable to Windows as well as Mac OSes. The only in-guest (clone-side) component to CIVIC’s design is code injection, which has been shown to work for other OSes as well [35, 13, 118, 157]. Side-effects: A side-effect of the clone’s independent existence could be when the source communicates with an external entity, say, a database backend. In this case, replicated requests from the clone may corrupt the backend. Depending upon the use-case these side-effects may be handled by: (i) freezing the corresponding process in the clone, or (ii) identifying the duplicates at the backend by assigning a unique IP to the clone, or (iii) synchronously cloning all components of the system (e.g., the webserver and the database) together, assuming the side-effect observer isn’t the external world.

5.2 Implementation

I now describe in detail the implementation specifics of CIVIC’s building blocks of post-copy live migra- tion, copy-on-write disk and memory, device hotplugging, GDB access to VM kernel, as well as kernel Chapter 5. Cloning and Injection based VM Inspection and Customization 58 code injection from the hypervisor. The rationale behind these operations is already covered in Sec- tion 5.1. These operations are orchestrated by a userspace bash script running on the host machine. Although my implementation is on the KVM/QEMU platform, these underlying constructs exist for the Xen hypervisor as well [187, 186, 127, 185, 188].

5.2.1 Disk COW

In order to ensure consistent disk access, on the clone side I use QEMU’s redirect-on-write [2] feature to create a COW slice on top of the source’s disk so that clone’s writes are redirected to a separate location than the original image file. On the source side, I use a simplistic approach of running the source VM in a QEMU snapshot mode, and enforce a write back (commit) of its disk state before cloning. A better alternative perhaps is to employ copy-on-first-write approach [62] on source wherein the source continues to write to the same disk image as before, but before a block gets overwritten, its original contents are saved to a separate location. Still, the explicit commit step above ensures that the proper impact on the source is incorporated in CIVIC’s evaluation.

5.2.2 Live Migration

My original design achieved source VM cloning by employing live pre-copy migration [41]. But due to its large memory footprint and high instantiation latency as described before in Section 5.1, the current CIVIC implementation uses post-copy migration [77] instead, which fetches memory on demand similar to the VM Fork abstraction [98] in Xen. There exist different independent implementations of post-copy migration in QEMU [49, 78, 152, 144]. I use the latest flavour [49] that builds on top of Linux’s userfault kernel feature [5], which enables handling memory page faults in userspace. QEMU uses this functionality to trap the clone’s accesses to remote memory pages, fetch them from the source, and move them into the clone VM process’ address space using the remap anon pages system call [5]. I make two modifications to default implementation. First, while in a typical postcopy setting the original VM is paused while the migrated VM fetches memory from it, in CIVIC the source is allowed to resume operations as the primary/production VM (COW enabled for consistency, see next subsection), while the secondary migrated instance operates as the proxy clone. Second, the default implementation also has simultaneous pre-copy iterations that makes the clone’s memory footprint similar to the source. Thus, to make CIVIC clones lightweight, I minimize these pre-copy transfers by only allowing the initial state fetch during clone initiation via pre-copy, and thereafter switching to pure post-copy i.e. memory- on-demand based fetches.

5.2.3 COW Memory

The source memory COW implementation follows the common approach of saving the original memory pages to a holding area inside QEMU on source writes, and servicing the clone’s page-fetch requests based upon dirty flags, either from the source’s memory directly for an unset flag or from the holding area otherwise. I assume hardware assisted paging (EPT) support on the host so as to obtain a convenient trap point for the source’s memory accesses inside KVM (the alternate shadow-paging / soft-mmu code route would require multiple trap points and a non-trivial QEMU communication path). I build upon Chapter 5. Cloning and Injection based VM Inspection and Customization 59

HotSnap’s implementation [47] to enable COW snapshotting in QEMU. Briefly, I trap different sources of writes to guest memory as follows:

1. Guest writes: Triggered by hardware on write faults by the guest. Write protection set using QEMU’s cpu physical memory set dirty tracking() routine, and KVM’s KVM SET USER MEMORY REGION with KVM MEM LOG DIRTY PAGES flag. Caught in KVM’s handle ept violation() routine.

2. DMA writes: Triggered by DMA accesses to guest memory of the form DMA DIRECTION FROM DEV ICE inside QEMU; Does not trigger page fault due to memory mapped interface.

3. KVM writes: Caused by direct writes by KVM into guest memory for operations such as clock up- dates (kvm guest time update()), setting MSR registers (record steal time()), instruction em- ulation (emulator write phys()), etc. Captured before kvm write guest() / kvm write guest cached() calls in such routines.

4. QEMU writes: Includes all other QEMU-internal writes to the guest memory that update QEMU’s dirty memory bitmap- ram list.dirty memory[]. Triggered by a variety of sources such as vga, framebuffer, vhost, etc. Caught mostly inside immediate callers to QEMU’s cpu physical memory set dirty*() family of functions.

Depending upon the source of writes, I modify KVM to either inform QEMU to directly save the existing memory page contents before being dirtied, or save the original page contents locally and send them to QEMU via Linux’s copy to user().

5.2.4 Hotplugging

To enable IO communication on the clone with the outside world, as discussed in Section 5.1, the clone is hotplugged with its own NIC by using QEMU’s device add / host net add utilities, as well as a persistent store disk using QEMU’s device add / drive add functionality. This requires standard PCI hotplug modules (acpiphp, pci hotplug) to be loaded into the guest, which are available by default in most distributions.

5.2.5 Code Injection

I use code injection as a means to initiate analysis and customization operations inside the clone, to avoid requiring (i) guest modifications in terms of installing helper scripts inside the guest, as well as (ii) guest cooperation in terms of login credentials to access the clone’s shell. The basic goal with code injection is to run an application loader script residing in the hotplugged persistent store, inside the clone OS’s userspace from the hypervisor. This is achieved with the following sequence of operations:

1. Attach to the clone’s kernel using QEMUs gdbserver GDB stub [136].

2. Inject machine code into an empty memory buffer. The code performs the following operations:

(a) Save registers (b) mount hotplugged disk (c) exec application loader script Chapter 5. Cloning and Injection based VM Inspection and Customization 60

(d) Restore registers (e) Return to caller

3. Break at clone kernel’s schedule() function.

4. Redirect flow to injected code (replace NOP instruction by jmpq)

5. Restore control flow (restore original NOP instruction)

6. Detach GDB from clone.

Employing GDB simplifies the (static) code injection procedure such as modifying (set) the clone’s memory directly, setting runtime break-points and corresponding commands to execute, as well as locat- ing addresses (info address) of the relevant kernel functions (alternatively, these can be read from the System.map symbol table file). An alternative to GDB dependence is to inject code through QEMU’s dynamic binary translation engine together with its Tiny Code Generation API [193]. Section 5.1.1 addresses the associated privacy concerns with this CIVIC stage. I now detail the technical specifics required for some of the above-mentioned steps. Memory buffer (Step 2): In the current implementation, I use an empty memory buffer at 16MB + 1 page from the beginning of physical memory. Since the kernel text mapping starts at physical address 0, this buffer corresponds to the second page after the 16MB of ZONE DMA. Across different kernel versions, I found this empty space to be sufficient for the 290 bytes needed to inject code. This is just for implementation convenience, alternatives to obtain an empty memory buffer include trapping or redirecting control to kmalloc() [35], or hotplugging extra memory to the clone [73, 191, 166]. Mount and Exec (Steps 2b, 2c) : In order to use the filesystem hosted inside the persistent store, the hotplugged disk device must first be mounted inside the clone OS. Achieving this from ker- nelspace boils down to calling the do mount() kernel function with the appropriate register and memory arguments set- device name, mount point, filesystem type. The next step after mounting the disk is to run the application loader script residing in there, so as to setup the clone OS environment for customization operations. To run this script in the clone OS’ userspace, I invoke the kernel function call usermodehelper setup() with the path to the loader script as an argument, which in turn sets up a process structure to be ultimately fed as an argument to call usermodehelper exec(). The script can optionally be run with real-time priority [105, 106] for immediate execution. Directing control into userspace then allows the usual OS-exported functionality to be leveraged by the application software. Schedule() (Steps 3, 4, 5) : My original implementation hijacked the system call handler in the clone’s kernel [135], to replace the current system call context (register contents containing the system call number and corresponding arguments) by that of mount or exec. I found this to be too intrusive, with inconsistencies and side-effects for the victim process whose state should be preserved for possible inspection in clone. I thus selected a redirection point inside the kernel’s schedule() function to maintain state consistency. To aid our cause further, I was able to leverage a NOP instruction (xchg %ax,%ax) inside there to replace it with a jumpq to the injected code, that ultimately gets restored after injected operations complete. The NOP instruction is again not a necessity but convenience, alternatively another instruction could be saved, replaced and restored while still following the same injection flow as above. The injected code also includes instructions to disable/ enable appropriately as needed by the different injected operations. Also, although schedule() serves as a clean interception point for code injection, the clone may miss capturing some state e.g. a task may die before schedule() Chapter 5. Cloning and Injection based VM Inspection and Customization 61 gets called. Thus, CIVIC would benefit from a more robust code injection mechanism so as to control or capture clone deviation.

5.2.6 Application Loader Script

The previous code injection stage results in an application loader script running inside the clone OS’s userspace. Controlling the clone now, as an inspection and customization enabling proxy for the source, is straightforward by utilizing OS-exported functionality. Specifically, the loader script performs the following tasks (‘once’ below refers to operations performed only for the first clone):

1.[ Optional:] Pause userspace processes (kill -STOP -1).

2. Disable source configured networking (ifdown).

3. Enable networking on hotplugged NIC (ifup).

4. Setup clone runtime environment

(a)[ Once:] Mirror root partition (‘/’) on hotplugged persistent store disk: automatically, via (yum --installroot). (b) Update executable paths: prepend mirrored /bin to $PATH. (c) Update dynamic linker’s run-time bindings: prepend mirrored /lib to ldconfig.

5.[ Once:] Install application software and redirect paths in config files to point to the persistent store’s ‘/’ sub-tree.

6. Run application, either directly from the persistent store’s /bin or via symbolic links in ‘/’ pointing to persistent store’s ‘/’.

The optional freezing of userspace processes enables preserving the source’s runtime state for inspec- tion inside the clone. Section 5.3 shows how I use this feature to track the source’s process-level resource utilization via the clone as a proxy. The need for network reconfiguration is described in Section 5.1, and also highlighted in Section 5.4’s applications. Mirroring the root partition inside the hotplugged disk enables state persistence, as well as depen- dency resolution along the lines of chroot jails [104]. Application software installation typically involves downloading dependency packages, and if the installation process is allowed to run as is, these helper binaries and/or libraries will be installed inside the standard ‘/’ directory (/bin, /lib). Due to CIVIC’s design, these changes will not persist across cloning rounds. One way of avoiding this is to require all dependency packages to exist (or be installed) in the source. But this enforces guest cooperation and modification which is against CIVIC’s design principles. A better alternative is to instead redirect in- stallation of all software components inside the clone to a mirrored root partition inside the hotplugged persistent store, and to update the linker/loader bindings and executable paths correspondingly. A pos- itive side-effect with such a chroot-like approach is that the source is spared from package pollution and potential dependency conflicts. The end product of this last CIVIC stage is a proxy clone completely set up for analysis and customization on behalf of the source without any source intrusion and enforced cooperation. Chapter 5. Cloning and Injection based VM Inspection and Customization 62

5.3 Performance Evaluation

I evaluate CIVIC’s performance by answering the following: 1. What is CIVIC’s memory footprint in terms of the clone’s memory usage, memory transferred and COW-ed?

2. How much time does it take to get the clone ready, including time spent during migration, hot- plugging, and code injection?

3. What is the impact on the source VM in terms of downtime and workload performance degradation? Along with CIVIC clones I also create precopy clones, just to observe and quantify the savings that postcopy offers CIVIC in my setup. Setup: The host is a 4 core Intel i5 @ 2.8GHz machine, with 16GB memory and Intel VT-x and EPT hardware virtualization support. The software stack includes Linux-3.19 host OS with KVM and userfault [12] support, QEMU 2.1.50 with postcopy migration support [50]. Guest VMs are configured with 1 CPU core, {1G, 2G, 4G} RAM, and Linux OS {3.2/Ubuntu, 3.4.4/Fedora}. The reported metrics in experiments below (as well as in Applications Section 5.4) are averaged across at least 3 runs. The same host runs both source and clone VMs (full copy, no page sharing), with QEMU migration transfer rate set to 1Gbps. The host is assumed to have sufficient resources to run the clone VM. For high consolidation/contention scenarios, the clone can be run on a separate lightly-loaded host; the CIVIC orchestration script (Section 5.2) itself has negligible resource cost.

5.3.1 Memory Cost

I vary the memory load on the source while cloning it, and measure the clone’s memory usage, memory transferred and memory COW-ed. The measurements are made across three different source memory load configurations: fresh idle, malloc static, and dirty dynamic. The first configuration refers to a freshly booted idle VM, while the static and dynamic memory-use configurations touch about 75% of the VM’s memory with the latter continuously redirtying it. I use the stress utility [1] to achieve target memory usage in the source. For the first two source configurations, the CIVIC clone’s memory consumption (measured as resident set size (RSS)) is <=80MB, amount of memory COWed <=2.7MB, and amount of memory transferred during migration is around 30MB, irrespective of the source VM size. Whereas for dirty dynamic config- uration, all metrics are equivalent to the working set size i.e. ∼75% of the source VM size. This is when the memory dirtying workload is run as-is inside the clone VM as well, whereas in an alternate scenario the process would optionally be frozen in the clone leading to lean clones as in the other two configura- tions. Further reduction in the memory footprint for host-local clones can be achieved by augmenting CIVIC with a page sharing optimization. On the other hand, as is to be expected, the precopy clone’s RSS and transfer size increases linearly with the source VM size as shown in Figure 5.2. Note that for the dirty dynamic configuration, precopy migration over network can only complete when source’s memory dirty rate is less than the network bandwidth (completed only for working set size <=35MB, in experiments). Summary: CIVIC’s footprint depends upon source’s working set size; for scenarios (such as moni- toring) that aren’t heavily dependent on full system state, the clones are light weight with 30MB transfer size and 80MB RSS. Chapter 5. Cloning and Injection based VM Inspection and Customization 63

4000.00 Precopy | malloc_static

3500.00 Precopy | fresh_idle

CIVIC-postcopy | malloc_static & fresh_idle 3000.00 ) B M (

d 2500.00 e r r e f

s 2000.00 n a r T

y 1500.00 r o m e 1000.00 M

500.00 30.03 29.91 31.01 0.00 0.5 1 1.5 2 2.5 3 3.5 4 4.5 VM Size (GB)

Figure 5.2: Measuring memory footprint of CIVIC’s postcopy+COW clones and precopy clones, for different source VM sizes and memory use configurations

35.00 Precopy | malloc_static

30.00 Precopy | fresh_idle

) CIVIC-postcopy | malloc_static & fresh_idle s (

25.00 e m i T

n 20.00 o i t a i t n

a 15.00 t s n I

e

n 10.00 o l C

5.00

0.00 0.5 1 1.5 2 2.5 3 3.5 4 4.5 VM Size (GB)

Figure 5.3: Measuring clone instantiation time for CIVIC’s postcopy+COW clones and precopy clones, for different source VM sizes and memory use configurations. The expected curve for source size- independent postcopy clones should be a horizontal line but for experimental variation.

5.3.2 Clone Instantiation Time

Figure 5.3 shows how the clone instantiation time varies with source VM size for the different source’s memory-use configurations as described in Section 5.3.1. Compared are the end-to-end times including disk snapshotting, migration, hotplugging, and code injection costs, up until the application loader script execution inside the clone. CIVIC’s instantiation time is independent of the source VM size as well as it’s memory-use config- uration, whereas the VM size affects precopy cloning time linearly. The results remain the same when the memory load on the source VM is replaced with a CPU intensive workload (sysbench prime com- putation [9]). CIVIC’s dependence on userspace QEMU and GDB components limits cloning agility, which take about 6.5s to get up and running with the stage-wise breakdown being about 0.2s for VM initialization, 4s for disk snapshotting and hotplug, and 2.3s for GDB code injection. The clone instantiation time can be reduced by having the persistent store and NIC for clone oper- ations hotplugged (but inactive) ahead of time inside the source, but it introduces source modification which is against CIVIC’s design principles. A better alternative is to reuse the same clone across succes- Chapter 5. Cloning and Injection based VM Inspection and Customization 64 sive rounds by only fetching the delta from the latest source state. This would take away the hotplugging and GDB overheads and lead to sub-second clones with a customization-ready environment. This serves as a possible future optimization to CIVIC. Summary: CIVIC clones are ready in at most 6.5s, irrespective of the source VM’s working set size.

5.3.3 Impact on Source VM

To measure CIVIC’s impact on the source VM, I periodically clone it while running the following three workloads individually inside the source: (i) x264 video encoding CPU benchmark [120] (v1.7.0) with ∼350MB of memory footprint, (ii) bonnie++ disk benchmark [145] (v1.96) processing 4GB of data sequentially, and (iii) full system webserver benchmark- httperf [117] (v0.9.0) serving distinct 2KB random-data files worth 512MB in working set size, to clients running on separate machines. Additional measures were taken to ensure true benchmark measurements, such as using high performance virtio drivers in the guest, disabling disk caching at hypervisor (and QEMU snapshot’s writeback caching for bonnie++), ensuring network isn’t a bottleneck, and pushing webserver to saturation. Each cloning iteration performs periodic monitoring tasks of compliance scan, healthcheck and re- source monitoring (Section 5.4.1), by tracking open files and connections, loaded modules, running ap- plications, system logs and resource utilization metrics. CIVIC’s clone instantiation time (Section 5.3.2) limits the monitoring frequency to 0.1Hz. The average degradation on the source’s workload was ob- served to be 5.2% on x264’s framerate, 1.2% on bonnie++’s disk throughputs, and 10% on the webserver’s maximum sustainable capacity (i.e., serviced requests per second, without any connection drops), the latter attributable to work queue backlogging (Section 3.3.4, Chapter 3) due to minor VM stuns (see downtime below). In the case of monitoring, as well as inspection that isn’t heavily dependent on full system state, a majority of the source’s memory would not be transferred over to the clone (memory-on-demand). Thus, to additionally account for higher-level analysis and customization tasks like anomaly detection and autotuning (Sections 5.4.2, 5.4.4), I also let a cloned instance of the webserver operate in parallel with the source. For the case of httperf, during the cloning process both the source and the clone see 5-6% degradation on maximum sustainable capacity. Thereafter, both are able to operate at peak capacity as recorded in source pre-cloning.

Finally, for measuring VM downtime, I use fping to fire ICMP ECHO REQUEST packets to the source VM with a 100ms timeout, and count the failed i.e. timeout-expired requests. The VM downtime was recorded to be 0.4 seconds. Although the source impact is low for the 0.1Hz maximum cloning frequency supported by my current implementation, this seems to translate to a heavy impact for higher frequencies. For such high frequency use-cases, the 0.4s source VM stuns per cloning iteration would need to be minimized. Also, a lower impact can be expected with the potential optimization of reusing the same clone instance across cloning rounds (Section 5.3.2) Summary: CIVIC has a low impact on the source VM, reaching 10% degradation with continuous (0.1Hz) cloning. Chapter 5. Cloning and Injection based VM Inspection and Customization 65

100 100

80 80 ) ) % 60 % ( 60 (

e e g g a a s s U u

40 40 U U P P C C

20 20

0 0 0 50 100 150 200 250 0 50 100 150 200 250 Time (s) Time (s) (a) (b)

Figure 5.4: Measuring CPU usage with collectd agent in source (left) and clone (right)

5.4 Applications

This section highlights CIVIC’s versatility by describing how we have used CIVIC in different settings to facilitate a variety of inspection and customization operations. These scenarios don’t necessarily require CIVIC to operate, but in most cases CIVIC works better than other alternatives as described below (more details in Section 2.5.1). First, the VMI-based solutions [80, 161] would be incompatible with the stock software employed to address these scenarios. Second, the redirection-based solutions [155, 184, 69] would be very slow and cause guest intrusion and interference by running handler components inside the source VM, which is precisely what some of these scenarios attempt to avoid. Finally, most can be achieved by live cloning solutions [48, 98] by either directly accessing the clones via credentials/SSH, or through backdoors or hooks installed beforehand in the source VM [173, 172]. But the former approach hurts cloud manageability [69] and auditing [70], and wouldn’t work when inspecting dysfunctional systems as in the Google outage example [43], while the latter approach defeats cloud portability [149, 97]. CIVIC’s injection-based clone access avoids imposing such guest cooperation and vendor lock-in.

5.4.1 Safe Agent Reuse

While NFM provides one alternative to agent-based monitoring, it needs to either replicate an OS- like view for pre-existing software, or create new monitors over its frames abstraction. Instead of this functionality duplication effort, CIVIC enables reusing the vast monitoring software codebase while providing similar isolation benefits by virtue of restricting agents in the clone. The tradeoff is a lack of holistic knowledge that NFM tools possess which makes them more accurate (Section 4.4.3). To provide simple visual evidence, I use process-level resource tracking as an example for agent based monitoring. I have tested CIVIC successfully against three such agents- an internal custom agent, a closed-source enterprise-level agent- IBM Tivoli Endpoint Manager (BESClient [84]), and a popular open-source monitoring agent-collectd [66] (v4.10.4-1). The agents were run as-is with the config files updated to point to persistent store’s mirrored root partition for installing the agent software and associated dependencies- libraries and/or helper binaries. In this experiment, I use collectd to track the resource use metrics for a custom source workload (same as the one used in NFM accuracy evaluation- Section 4.4.2) that varies its CPU and memory utilization sinusoidally. The workload gets frozen in the clone, with its runtime state analyzed by Chapter 5. Cloning and Injection based VM Inspection and Customization 66 collectd injected on successive cloning iterations. Housing this agent (one of eventually many such) in the clone avoids installing up to 77 packages on the source. The exact count depends upon just the core package installation (collectd-core) as opposed to a full installation (collectd) of the daemon including the configuration, and also upon the amount of dependent packages already installed on the system. To illustrate the performance of a CIVIC clone as a runtime monitoring proxy for the source, Figure 5.4 compares the workload’s CPU-usage tracking by collectd inside the clone, with the expected curve had the agent run inside the source (memory graphs similar; not shown). Section 5.3.2 discusses how the clone’s 0.1Hz monitoring frequency can be improved, in comparison to 1Hz as configured for in-source monitoring.

5.4.2 Anomaly Detection

CIVIC enables a service model where secondary functionality gets offered as hotpluggable components available for user subscription, instead of baking it into the primary base ser- vice. Employing clones to provide the add-on functionality (such as a diagnostics framework) allows the latter to be as intrusive or destructive as need be, while also isolating possibly conflicting functionalities (or base service modifications) in their own separate environments (clones). I adopt an anomaly detection use-case to highlight this capability by enabling SAAD [72] (Stage Aware Anomaly Detector) to be hotplugged to a base Cassandra service. The need to modify stock Cas- sandra in SAAD makes it a more intrusive example amongst other static analysis and instrumentation- based alternatives [102, 52, 197]. But all of these benefit when employed in CIVIC’s restriction-free clones as described later below. Also, SAAD serves as an example for a whole class of Java-based services that can be automatically ported over CIVIC by using JVM classloader level class-hotswapping. SAAD [72] is an efficient logging-based runtime anomaly detector that collects stage-level log sum- maries from storage servers, and proactively analyzes them to capture rare execution flows and unusual high flow durations. SAAD requires two modifications to be made to the base Cassandra codebase: (i) augmenting log messages with log IDs and stage indicators, and (ii) adding ID tracking and stage- synopsis forwarding functionality to the Java logger library. CIVIC enables a user to run an unmodified Cassandra service, while clones equipped with SAAD are added to the worker pool on-the-fly. CIVIC pre- serves the source’s replicated runtime state in the clone by avoiding a Cassandra service re-instantiation (i.e. without terminating stock Cassandra and initiating SAAD Cassandra). In order to achieve this, I utilize JVM classloader level class-hotswapping to replace and reload running instances of stock Cas- sandra classes and Java logger library with their SAAD versions. I use the JRebel tool [194] to achieve hotswapping without making any changes to the JVM or stock Cassandra code. Requests to all or some of the original Cassandra service’s worker nodes can be mirrored completely or partially, and redirected to the clone(s) under analysis, triggered possibly after suspicious info-level (default) log messages in the source. Along with the source service being spared from analysis’ intrusion and interference, moving SAAD into the clone further allows more fine-grained data to be logged from within Cassandra itself, as well as importing global system state into the analysis. This can improve SAAD’s anomaly detection accuracy and quality. Examples include SAAD’s enhancement with (i) verbose logging and (ii) syscall tracing, as follows. Verbose Logging: Figure 5.5 highlights SAAD’s capability enhancement when offered as a restriction- free service, in this particular case enabling debug-level logging on-the-fly in the clone. Some of the new SAAD capabilities, such as false alarm reduction, benefit from multiple clones to test anomaly Chapter 5. Cloning and Injection based VM Inspection and Customization 67

ü Detects anomalous control flow ü Detects performance anomaly for normal flow ü Assists in avoiding fault-masking repercussions

ü Raises crical alarms only (reduced false posives) ü Provides richer alarm context - method parameters, call graphs, stack trace ü Beer root cause diagnosis - internal vs. external, soware vs. hardware ü Suggests possible fix

Figure 5.5: SAAD under CIVIC: enhancements on enabling debug mode in clones, in addition to stock SAAD capabilities (dashed box) reproducibility for tuning SAAD’s prediction. Such verbose logging is not recommended in a production system since it causes heavy performance degradation, such as a 25.8% impact on Cassandra’s throughput in my experiments when operated against a typical YCSB (Yahoo! Cloud Serving Benchmark) [45] workload with 25% reads, 25% update and 50% insert proportions. Syscall Tracing: Since SAAD operates by prefixing log IDs to pre-existing log statements, its accuracy in pointing to the anomalous method can be off in case log statements are not prolific enough. In these cases (and generally for finer granularity anomaly detection), CIVIC enables SAAD to incorporate syscall tracing in its analysis while limiting the associated overhead to the clone. Then, based on abnormal syscall patterns, SAAD’s anomaly detection can be improved. An example is Cassandra bug #5064 [31] (‘alter table’ command causing hang), where the bug source isn’t trivial to pinpoint using stock SAAD due to sparse log messages, but has been shown to exhibit an abnormal syscall pattern ({sys gettimeofday, sys futex, sys futex, sys gettimeofday}) for synchronization from a buggy infinite loop [52]. While stock SAAD would only report a performance anomaly in a particular ‘stage’ but not the more precise identification of the anomalous method, the latter can be achieved by adding (potentially heavyweight) tracing support to SAAD.

5.4.3 Problem Diagnostics and Troubleshooting

CIVIC aids in analyzing system performance degradation without the fear of making the issue even worse and impacting performance further. Some issues are better studied with the appropriate profiling, debugging, or instrumentation tools such as gdb, sysstat, strace, or PIN. However, as mentioned before, these may be too intrusive and heavyweight for a production system. By replicating a troubled system’s runtime state inside a clone, and introducing debugging tools in it, CIVIC enables risk-free exploratory diagnosis while absorbing the associated impact and side-effects. Since this use-case requires an admin to perform diagnosis, a root shell can also be injected using CIVIC in cases where the admin (e.g., IT) is not the VM owner. I highlight this by capturing and fixing PHP memory leaks (Bugs #45161 and #65458 [134]) in my custom webserver setting. The apache + -fastcgi webserver serves incoming user requests for data that it caches from a backend server (a database) based upon time-to-live (TTLs). When the TTLs expire, the webserver either fetches fresh data from the backend, or otherwise renews their TTL until the next synchronization cycle. Figure 5.6(a) shows two different memory usage patterns for different Chapter 5. Cloning and Injection based VM Inspection and Customization 68

600

90% objects with expired TTL 500 10% objects with expired TTL 400

300 Memory Usage [MB] 200 ------Num PHP Processes 100

0

Time [s] Time [s] Time [s] (a) PHP 5.1.6 (b) PHP 5.3.20 (c) PHP 5.6.10

Figure 5.6: Count as well as memory usage of PHP processes in a webserver, for different proportions of cached data with expired TTL. Compared across 3 different PHP versions with memory leaks, fixed between v5.1.6 to v5.6.10.

proportion of data with expired TTLs- 10% and 90%. Although an increased memory consumption for the latter case could be because of fresh data being fetched from the backend, the webserver logs indicate no such activity. Another cause could be the slightly higher number of PHP processes (together with a slightly higher RSS per process) for the latter as shown in Figure 5.6(a)’s bottom-most curves. However, this can only account for 35MB extra memory usage but not the overall 170MB explosion as seen between the top-most curves.

At this point, troubleshooting this apparent memory leak on the webserver VM itself could degrade its sustainable request rate by further polluting its memory cache and/or adding debugging/instrumentation load to the system. Furthermore, the production system might not have the instrumentation frameworks installed there and one wouldn’t want to perturb the environment even more. To enable risk-free diagnosis, CIVIC replicates the problematic runtime cache state into a webserver clone, introduces diagnostics tools, and redirects a fraction of incoming requests to it.

In this example, I attach strace to the apache/php processes to measure their memory usage trends. This essentially boils down to capturing mmap()/munmap() and brk() system calls being made by the apache/php process and recording their corresponding sizes [164]. I additionally confirmed such memory usage measurement to map very closely to usage when recorded from inside a PHP application. Fig- ure 5.7 plots the memory usage across time across 3 different PHP versions- v5.1.6 used in the webserver experiment of Figure 5.6(a), v5.3.20 that fixed one leak, and v5.6.10 with no leaks. For the leaks to show up, a connection does not even have to be made to the backend, and simply setting up the URL and its headers via cURL/libcurl [4] suffices. Specifically, in the context of the webserver experiment under scrutiny, although new data was not being fetched from the backend, simply communicating with it for TTL renewal was enough to activate the memory leaks. The decreasing memory consumption across different PHP versions helps with the troubleshooting case wherein the webserver admin first uses strace to pinpoint the issue, then changes the PHP version (perhaps iteratively) till the problem (memory leak) gets resolved. The clone webserver’s memory usage across time on changing the PHP versions can be seen Figure 5.6(b) and (c). The latter setting is what gets reported (or pushed) to the source to fix the issue. The diagnostics approach adopted in this particular example caused a further degradation of the clone webserver capacity by up to 12.5% and 20.5% for the two data proportion cases, depending upon the tracing aggressiveness. Chapter 5. Cloning and Injection based VM Inspection and Customization 69

18 PHP 5.1.6 15 PHP 5.3.20 PHP 5.6.10 ] B

M 12 [

e g a

s 9 U

y r

o 6 m e M 3

0 0 5 10 15 20 Time [s]

Figure 5.7: Measuring PHP process’ memory usage via strace; Leaks detected in versions 5.1.6 and 5.3.20

5.4.4 Autotuning-as-a-Service

As a final demonstration of CIVIC’s usefulness, I propose to employ CIVIC clones to perform faster, risk-free and on-the-fly configuration parameter autotuning. I select webservers as target systems since untuned and badly tuned webservers are highly prevalent, and the cause for ap- proximately 40% of all web delays [54, 15, 121]. Furthermore, misconfigurations can also cause avail- ability, performance, and security problems [198, 192]. Also, webserver tuning is an error-prone, time consuming (e.g. 25 minutes per run [147]), computationally intensive, and highly workload-dependent task [37, 195, 198, 56]. A contributing factor is the large search space, e.g. apache has over 240 con- figurable parameters [198]. To achieve such runtime specific tuning, webserver tuners employ different techniques such as heuristic searches [189], evolutionary strategies [147], simplex methods[198, 37], con- trol system modeling [56], online tuning [159] and neural networks [195]. With CIVIC-enabled autotuning-as-a-service, the users would simply need to deploy their webserver VMs (say, on Amazon cloud service) without having to predict what their traffic would look like, or worry about tuning their server themselves or via autotuners. With the webserver live, CIVIC would fork clone(s), carrying over the webserver’s runtime cache state onto the clone, followed by autotuner injection, and live request replication and redirection to the clone. Furthermore, multiple clones can be employed in parallel for a faster, more aggressive and less time-constrained space exploration. Host-local clones can help control the memory footprint by sharing cached webserver content pages. To quantify the potential gains from such a model, I run 3 different kinds of workloads against a base system configuration, and then tune the configuration to achieve best performance per workload. The webserver workloads are as follows:

1. Httperf: A streaming type workload consisting of distinct 2KB files with a working set of 512 MB, with httperf [117] clients requesting 10 files per connection.

2. Surge: A web workload modeled to represent actual traffic serviced by commodity webservers [21]. It follows statistical models for file sizes, request sizes, file popularity, embedded references, tem- poral locality, and user think (off) times. Setup includes a 420MB working set spread across 20K files following a zipf distribution.

3. Rubis: An auction website workload modeled after eBay [125], with typical user interactions involving browsing, bidding, buying or selling items, leaving comments, etc. All state is stored in a Chapter 5. Cloning and Injection based VM Inspection and Customization 70

200.0 >500 Base config Tuned config for Httperf Tuned config for Surge t

u 150.0 Tuned config for Rubis p h g u o r h

T 100.0

d e z i l a m

r 50.0 o N

0.0 Httperf Surge Rubis

Figure 5.8: Webserver capacity variations with apache+kernel tuning, normalized to base capacity

database server (MySQL, in experiments), which the webserver communicates with via PHP. The initial database consists of 42K items occupying 235 MB, with the workload being Rubis’ default browsing+bidding mix containing 15% read-write interactions.

In this experiment, the configuration space under tuning includes (i) apache (httpd.conf) parameters such as core multi-processing module type, MaxClients, KeepAliveTimeout, ListenBacklog, etc., as well as (ii) kernel () parameters such as net.ipv4.tcp rmem, vm.swappiness, fs.file-max amongst others. The webserver VM is configured with 1 core and 1G RAM, while the incoming load is generated from separate client machines. The network subsystem was further ensured to not be a bottleneck to the webserver capacity. For each workload, Figure 5.8 plots the webserver’s capacity (ser- viced requests per second, without any connection drops, normalized to the base system configuration) for 4 configurations- base, tuned configuration for the particular workload, as well the best performing configurations for the other two workloads. The results are similar for both a stock Fedora VM, and a popular LAMP stack image from Amazon AWS Marketplace (Bitnami LAMPStack) that comes pre- configured with tailored webserver components. As can be seen, no single configuration works best for all workloads, and the tuned configuration for one may not be ideal for the other, leading to a performance range of 13% to over 500% of the base configuration in my experiments. Such extreme variations are not uncommon [3]. Finally, the base configuration works well only for the Rubis workload where the throughput is constrained by the database server instead. This simple experiment illustrates the need for tuning, and the potential benefits to be had with CIVIC enabled autotuning-as-a-service.

5.5 Conclusion In this Chapter, I’ve presented my CIVIC solution that enables safe reconfiguration of a VM’s internal behavior, similar to the flexibility that exists for VM-external components of resource allocation and placement. All customization impact and side-effects are restricted to a runtime replica (the clone) of the original unmodified VM (the source). New functionality is introduced on-the-fly using runtime code in- jection into the clone. Not having pre-installed credentials, agents or hooks inside the original VM makes it possible for providers to offer VM customization as a service. I highlighted one such application with workload-specific live tuning of a webserver’s configuration parameters. I also demonstrated CIVIC’s versatility and benefits with three other use-cases. The evaluation showed my approach is nimble and lightweight and has a low impact on the target systems. Chapter 6

Conclusion and Future Work

In this thesis, I discussed why existing monitoring techniques are not a good fit for modern data cen- ters, and presented two alternative solutions – NFM and CIVIC – that leverage virtualization for better systems monitoring. Both these solutions operate from outside the guest’s boundary, thereby decou- pling monitoring from guest execution. They operate non-intrusively by eliminating any need for guest cooperation, modification or runtime interference. They enable an always-on monitoring environment supporting inspection of even dysfunctional or unresponsive systems. These have been designed in ac- cordance with the ‘as-a-service’ cloud model, where the end users can seamlessly subscribe to various out-of-the-box monitoring, analytics and customization services, with no impact or setup requisites on their execution environments. Table 6.1 compares my two solutions along their salient features. As can be seen, both solutions have their own pros and cons. For example, CIVIC’s emphasis on stock software reuse means it misses out on the holistic knowledge (in-VM + host-level) that NFM tools incorporate. On the other hand, NFM tools perform passive or read-only monitoring, while CIVIC enables active inspection and fine tuning. A selection of one technique over the other depends on several factors such as target use-cases (passive monitoring vs. actuation), organizational host-hypervisor policies (modification authority and flexibility) and preference (software reuse vs. fresh tools support). Described in this thesis are four applications I’ve developed over NFM, that showcase its ‘systems as data’ monitoring approach, and leverage familiar paradigms from the data analytics domain such as document differencing and semantic annotations to analyze systems. These include (i) a cloud topology discovery and evolution tracker application, (ii) a cloud-wide realtime resource monitor providing a more accurate and holistic view of guests’ resource utilization, (iii) an out-of-VM console-like interface enabling administrators to query system state without having to log into guest systems, as well as a handy “time travel” capability for forensic analysis of systems, and (iv) a hypervisor-paging aware out-of-VM virus scanner that demonstrates how across-stack knowledge of system state can dramatically improve the operational efficiency of common management applications like virus scan. I also highlighted CIVIC’s versatility in terms of stock-software-based impact-free live customization, by demonstrating four of its use-cases- (i) attaching an intrusive anomaly detector to a live service, while also improving the detectors capabilities (ii) impact-heavy problem diagnostics and troubleshooting which would otherwise be prohibitive to perform on the guest VM, (iii) safe reuse of system monitoring agents that are not desirable to install or run in the guest VM, and (iv) automatic workload-specific live

71 Chapter 6. Conclusion and Future Work 72

Properties NFM CIVIC

Guest View Raw | Byte-level Logical | OS-level

Underlying Technique VMI VM Cloning & Code Injection

Hypervisor Support KVM and Xen KVM

Unmodified Guest VM  

Unmodified Hypervisor  * (added basic COW and cloning constructs) Guest-cooperation Free  

Always-on Monitoring  * (dysfunctional userspace manageable, not kernel) Actuation Capability  

Holistic Knowledge  

Monitoring Frequency High Low

Software Reuse * (possible but hard) 

OS-version Agnostic  

Deployed in Production (IBM Research Compute  Cloud)

Table 6.1: NFM vs. CIVIC tuning of a webserver’s configuration parameters. While exploring VM introspection techniques for NFM, I also developed new methods for low-latency live access to VMs’ memory from unmodified KVM/QEMU hosts, enabling subsecond monitoring of unmodified guests over NFM. Then, to perform a thorough comparative evaluation of these and existing VMI techniques, I organized them into a taxonomy based upon their operation principles- (i) whether guest cooperation is required; (ii) whether an exact point-in-time replica of the guest’s memory is created; (iii) whether the guest has to be halted; and, (iv) the type of interface provided to access guest state. My quantitative and qualitative evaluation revealed that VMI techniques cover a broad spectrum of operating points. I showed that there is substantial difference in their speed (operating frequencies), resource consumption on host, and overheads on target systems. These methods may be available out-of-box on different hypervisors or can be enabled by third party libraries or hypervisor modifications, giving the user a choice between easy deployability vs. hypervisor specialization. Furthermore, higher performance may be extracted by modifying the hypervisor or host, yielding a performance vs. host/hypervisor specialization tradeoff. I also presented a detailed analysis of the VMI’s memory consistency aspects, wherein I demonstrated the various forms of inconsistency in the observed VM state- both intrinsic to the OS, and extrinsic due to live introspection. I showed that, contrary to common expectation, pause-and-introspect based Chapter 6. Conclusion and Future Work 73 techniques have marginal benefits for consistency despite their prohibitive overheads. I concluded my VMI exploration with a set of suggestions based on my experience with the different VMI techniques. I hope that my observations can help users in their choice of technique based on their use-cases, resource budget and deployability flexibility, tolerance for intrusiveness and workload impact, and requirement levels for latency, liveness and consistency. Going forward, NFM has been extended to monitor popular ‘container’ application environments in a similar touchless manner as VMs [42]. Although the same kernel data structure traversal approach would suffice for container monitoring, an easier alternative is to monitor at a logical OS-level by attaching to the container’s namespace (Linux’ sentns() [107]). NFM has been deployed in IBM’s Research Compute Cloud, and is also included in IBM’s cloud platform offering [85]. NFM has been adapted to detect and monitor more than 1000 different system distributions (including distribution patches on top of official vanilla kernels), without requiring any manual configuration setup for target systems. This includes Linux kernel versions between 2.6.11 and 3.19 (years 2005 to 2015) [95]. There is further potential to employ NFM for better infrastructure management for cloud providers, by virtue of the in-VM grey-box knowledge that NFM possesses without the associated guest concerns (vendor lock-in due to guest specialization through in-VM hook installation). Examples include guiding VM sizing, placement and consolidation with in-VM and/or application-level resource utilization and demands. Further opportunities for interesting analytics arise from NFM’s treatment of systems as documents. Diff-ing systems just like diff-ing documents can potentially make it easier to tackle modern day cloud management concerns arising from aggressive cloud expansion such as tackling system-drift– tracking how a deployed system with an initial desired state deviates over time. Another example is across-the-cloud (anonymous) comparative systems analysis via similarity matching. Fingerprinting VMs to detect similarity can be used for flagging security risks when a ‘similar’ VM gets compromised, or consolidating similar VMs together. As for CIVIC, two performance optimizations would enable more efficient operations- (i) reducing memory footprint via page sharing for host-local clones, and (ii) improv- ing cloning frequency by reusing a single clone across successive rounds while fetching deltas from the latest source state. CIVIC opens up an interesting opportunity to allow systems monitoring, inspection, and diagnostics applications to incorporate more system-wide state into their behavior, and be as in- trusive (e.g., tracing, profiling), destructive (e.g., fault injection, on-demand crash dump) or speculative (filtering incoming network packets, or observing behavior with different experimental procedures and selecting the optimal one) as need be, now that source impact is not a limiting factor (the impact is restricted inside the clone sandbox). To conclude, my work serves as a cornerstone for analytics-as-a-service cloud offering with NFM, and customization-as-a-service with CIVIC, and as a contribution to the VMI community, organizes and contrasts existing (and proposed) VMI techniques to expose VM state. Bibliography

[1] Amos Waterland . Stress. http://people.seas.harvard.edu/ apw/stress/.

[2] Anthony Liguori and Stefan Hajnoczi . QEMU Snap- shots. http://wiki.qemu.org/Documentation/CreateSnapshot and http://wiki.qemu.org/Features/Snapshots2.

[3] Caleb Gilbert. Scaling Drupal: HTTP pipelining and benchmarking revisited. http://rocketmodule.com/blog/scaling-drupal-http-pipelining-and-benchmarking- revisited/.

[4] Daniel Stenberg. PHP cURL Manual. http://no1.php.net/manual/en/intro.curl.php.

[5] Jonathan Corbet and Andrea Arcangeli. Page faults in . http://lwn.net/Articles/615086/.

[6] Adam Boileau. Hit by a Bus: Physical Access Attacks with Firewire. RuxCon 2006. http: //www.security-assessment.com/files/presentations/ab_firewire_rux2k6-final.pdf.

[7] Adam Litke. Use the Qemu guest agent with MOM. http://https://aglitke.wordpress.com/2011/08/26/use-the-qemu-guest-agent-with- memory-overcommitment-manager/.

[8] Ferrol Aderholdt, Fang Han, Stephen L. Scott, and Thomas Naughton. Efficient checkpointing of virtual machines using virtual machine introspection. In Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on, pages 414–423, May 2014.

[9] Alexey Kopytov. SysBench Manual. http://sysbench.sourceforge.net/docs/#data base mode.

[10] Amazon. CloudWatch. http://aws.amazon.com/cloudwatch/.

[11] Amazon. Summary of the October 22,2012 AWS Service Event in the US-East Region. https: //aws.amazon.com/message/680342/.

[12] Andrea Arcangeli. Linux Userfault. https://kernel.googlesource.com/pub/scm/linux/kernel /git/andrea/aa/+/userfault.

[13] Angelo Laub. Practical Mac OS X Insecurity. https://events.ccc.de/congress/2004/fahrplan /files/95-macosx-insecurity-paper.pdf.

74 Bibliography 75

[14] Anthony Desnos. Draugr - Live memory forensics on Linux. http://code.google.com/p/ draugr/.

[15] Martin Arlitt, Diwakar Krishnamurthy, and Jerry Rolia. Characterizing the scalability of a large web-based shopping system. ACM Transactions on Internet Technology, 1(1):44–69, 2001.

[16] Mike Auty, Andrew Case, Michael Cohen, Brendan Dolan-Gavitt, Michael Hale Ligh, Jamie Levy, and AAron Walters. Volatility - An advanced memory forensics framework. http://code.google. com/p/volatility.

[17] Ahmed M. Azab, Peng Ning, Zhi Wang, Xuxian Jiang, Xiaolan Zhang, and Nathan C. Skalsky. Hypersentry: Enabling stealthy in-context measurement of hypervisor integrity. In Proceedings of the 17th ACM Conference on Computer and Communications Security, CCS ’10, pages 38–49, New York, NY, USA, 2010. ACM.

[18] S. Bahram, Xuxian Jiang, Zhi Wang, M. Grace, Jinku Li, D. Srinivasan, Junghwan Rhee, and Dongyan Xu. DKSM: Subverting Virtual Machine Introspection for Fun and Profit. In 29th IEEE Symposium on Reliable Distributed Systems (SRDS), pages 82 –91, 2010.

[19] Mirza Basim Baig, Connor Fitzsimons, Suryanarayanan Balasubramanian, Radu Sion, and Don- ald E. Porter. CloudFlow: Cloud-wide Policy Enforcement Using Fast VM Introspection. In Proceedings of the 2014 IEEE International Conference on Cloud Engineering, IC2E ’14, pages 159–164, 2014.

[20] Arati Baliga, Vinod Ganapathy, and Liviu Iftode. Detecting kernel-level rootkits using data struc- ture invariants. IEEE Trans. Dependable Secur. Comput., 8(5):670–684, September 2011.

[21] Paul Barford and Mark Crovella. Generating representative web workloads for network and server performance evaluation. In Proceedings of the 1998 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’98/PERFOR- MANCE ’98, pages 151–160, New York, NY, USA, 1998. ACM.

[22] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield. Xen and the art of virtualization. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP ’03, pages 164–177, 2003.

[23] Antonio Bianchi, Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vigna. Blacksheep: Detecting compromised hosts in homogeneous crowds. In Proceedings of the 2012 ACM Conference on Computer and Communications Security, CCS ’12, pages 341–352, New York, NY, USA, 2012. ACM.

[24] Bryan Payne. LibVMI Introduction - Vmitools - An introduction to LibVMI. http://code. google.com/p/vmitools/wiki/LibVMIIntroduction.

[25] James Butler and Greg Hoglund. VICE–catch the hookers. BlackHat USA Conference. 2004. http: //www.blackhat.com/presentations/bh-usa-04/bh-us-04-butler/bh-us-04-butler.pdf.

[26] Shakeel Butt, H. Andr´esLagar-Cavilla, Abhinav Srivastava, and Vinod Ganapathy. Self-service cloud computing. In Proceedings of the 2012 ACM Conference on Computer and Communications Security, CCS ’12, pages 253–264, New York, NY, USA, 2012. ACM. Bibliography 76

[27] Martim Carbone, Matthew Conover, Bruce Montague, and Wenke Lee. Secure and robust mon- itoring of virtual machines through guest-assisted introspection. In Proceedings of the 15th In- ternational Conference on Research in Attacks, Intrusions, and Defenses, RAID’12, pages 22–41, 2012.

[28] Brian D. Carrier and Joe Grand. A hardware-based memory acquisition procedure for digital investigations. Digital Investigation, 1(1):50–60, 2004.

[29] Andrew Case, Andrew Cristina, Lodovico Marziale, Golden G. Richard, and Vassil Roussev. Face: Automated digital evidence discovery and correlation. Digit. Investig., 5:S65–S75, September 2008.

[30] Andrew Case, Lodovico Marziale, and Golden G. RichardIII. Dynamic recreation of kernel data structures for live forensics. Digital Investigation, 7, Supplement(0):S32 – S40, 2010.

[31] Cassandra. Bug 5064: Alter table when it includes collections makes cqlsh hang. https://issues. apache.org/jira/browse/CASSANDRA-5064.

[32] Jin Chen, Saeed Ghanbari, Gokul Soundararajan, Francesco Iorio, Ali B. Hashemi, and Cristiana Amza. Ensemble: A tool for performance modeling of applications in cloud data centers. Cloud Computing, IEEE Transactions on, PP(99):1–1, 2015.

[33] Peter M. Chen and Brian D. Noble. When virtual is better than real. In Proceedings of the Eighth Workshop on Hot Topics in Operating Systems (HotOS), pages 133–138, 2001.

[34] Jui-Hao Chiang, Han-Lin Li, and Tzi-cker Chiueh. Introspection-based memory de-duplication and migration. In Proceedings of the 9th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE ’13, pages 51–62, New York, NY, USA, 2013. ACM.

[35] Tzicker Chiueh, Matthew Conover, and Bruce Montague. Surreptitious deployment and execu- tion of kernel agents in windows guests. Cluster Computing and the Grid, IEEE International Symposium on, pages 507–514, 2012.

[36] Jim Chow, Tal Garfinkel, and Peter M. Chen. Decoupling dynamic program analysis from execution in virtual environments. In USENIX 2008 Annual Technical Conference on Annual Technical Conference, pages 1–14, 2008.

[37] I-Hsin Chung and Jeffrey K. Hollingsworth. Automated cluster-based web service performance tuning. In Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing, HPDC ’04, pages 36–44, Washington, DC, USA, 2004. IEEE Computer Society.

[38] Citrix. Citrix XenServer 6.2.0 Virtual Machine User’s Guide. http://support.citrix.com/servlet/KbServlet/ download/34971-102-704221/guest.pdf.

[39] Inc. XenServer Windows PV Tools Guest Agent Service. https://github.com/xenserver/win-xenguestagent.

[40] ClamAV. Clam AntiVirus. http://www.clamav.net. Bibliography 77

[41] Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, and Andrew Warfield. Live migration of virtual machines. In Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation-Volume 2, pages 273– 286. USENIX Association, 2005.

[42] CloudViz. GitHub:agentless-system-crawler. https://github.com/cloudviz/ agentless-system-crawler.

[43] C. Colohan. The Scariest Outage Ever. CMU SDI/ISTC Seminar Series. http://www.pdl.cmu. edu/SDI/2012/083012b.html, 2012.

[44] P. Colp, C. Matthews, B. Aiello, and A. Warfield. VM Snapshots. http://www-archive.xenproject.org/files/xensummit oracle09/VMSnapshots.pdf, Xen Summit 2009.

[45] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Bench- marking Cloud Serving Systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC ’10, pages 143–154, New York, NY, USA, 2010. ACM.

[46] John Criswell, Andrew Lenharth, Dinakar Dhurjati, and Vikram Adve. Secure virtual architec- ture: A safe execution environment for commodity operating systems. SIGOPS Oper. Syst. Rev., 41(6):351–366, October 2007.

[47] Lei Cui, Bo Li, Yangyang Zhang, and Jianxin Li. HotSnap: A Hot Distributed Snapshot System for Virtual Machine Cluster. In Proceedings of the 27th International Conference on Large Installation System Administration, LISA’13, pages 59–73, Berkeley, CA, USA, 2013. USENIX Association.

[48] Brendan Cully, Geoffrey Lefebvre, Dutch Meyer, Mike Feeley, Norm Hutchinson, and Andrew Warfield. Remus: High availability via asynchronous virtual machine replication. In Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation, pages 161–174. San Francisco, 2008.

[49] Dave Gilbert. PostCopyLiveMigration. http://wiki.qemu.org/Features/PostCopyLiveMigration.

[50] Dave Gilbert. PostCopyLiveMigration. https://github.com/orbitfp7/qemu/tree/wp3-postcopy.

[51] David Anderson. White Paper: Red Hat Crash Utility. http://people.redhat.com/anderson/ crash_whitepaper/.

[52] Daniel J. Dean, Hiep Nguyen, Xiaohui Gu, Hui Zhang, Junghwan Rhee, Nipun Arora, and Geoff Jiang. Perfscope: Practical online server performance bug inference in production cloud computing infrastructures. In Proceedings of the ACM Symposium on Cloud Computing, SOCC ’14, pages 8:1–8:13, New York, NY, USA, 2014. ACM.

[53] Jiang Dejun, Guillaume Pierre, and Chi-Hung Chi. Ec2 performance analysis for resource pro- visioning of service-oriented applications. In Proceedings of the 2009 international conference on Service-oriented computing, ICSOC/ServiceWave’09, pages 197–207, 2009.

[54] Kemal Delic, Jeff Riley, Claudio Bartolini, and Adnan Salihbegovic. Knowledge-based self- management of apache web servers. In XXI International Simposium on Information, Communi- cation and Automation technologies, page NA, 2007. Bibliography 78

[55] Dell Quest/VKernel. Foglight for Virtualization. http://www.quest.com/ foglight-for-virtualization-enterprise-edition/.

[56] Yixin Diao, Joseph L Hellerstein, Sujay Parekh, and Joseph P Bigus. Managing web server per- formance with autotune agents. IBM Systems Journal, 42(1):136–149, 2003.

[57] B Dolan-Gavitt, B Payne, and W Lee. Leveraging forensic tools for virtual machine introspection. Technical Report GT-CS-11-05, Georgia Institute of Technology, 2011.

[58] Brendan Dolan-Gavitt, Tim Leek, Michael Zhivich, Jonathon Giffin, and Wenke Lee. Virtuoso: Narrowing the Semantic Gap in Virtual Machine Introspection. In IEEE Security and Privacy ’11, pages 297–312.

[59] Brendan Dolan-Gavitt, Abhinav Srivastava, Patrick Traynor, and Jonathon Giffin. Robust signa- tures for kernel data structures. In Proceedings of the 16th ACM Conference on Computer and Communications Security, CCS ’09, pages 566–577, New York, NY, USA, 2009. ACM.

[60] George W. Dunlap, Samuel T. King, Sukru Cinar, Murtaza A. Basrai, and Peter M. Chen. Revirt: Enabling intrusion analysis through virtual-machine logging and replay. SIGOPS Oper. Syst. Rev., 36(SI):211–224, December 2002.

[61] Josiah Dykstra and Alan T. Sherman. Acquiring forensic evidence from infrastructure-as-a-service cloud computing: Exploring and evaluating tools, trust, and techniques. Digital Investigation, 9:S90–S98, 2012.

[62] EMC. VNX Snapshots White Paper. https://www.emc.com/collateral/software/ white-papers/h10858-vnx-snapshots-wp.pdf.

[63] Emilien Girault. Volatilitux- Memory forensics framework to help analyzing Linux physical memory dumps. http://code.google.com/p/volatilitux/.

[64] Matias F. Linux Rootkit Implementation. http://average-coder.blogspot.com/2011/12/ linux-rootkit.html, 2011.

[65] Benjamin Farley, Ari Juels, Venkatanathan Varadarajan, Thomas Ristenpart, Kevin D. Bowers, and Michael M. Swift. More for your money: exploiting performance heterogeneity in public clouds. In Proceedings of the Third ACM Symposium on Cloud Computing, SoCC ’12, pages 20:1–20:14, 2012.

[66] Florian octo Forster. Collectd: The system statistics collection daemon. https://collectd.org/.

[67] Yangchun Fu and Zhiqiang Lin. Space Traveling across VM: Automatically Bridging the Seman- tic Gap in Virtual Machine Introspection via Online Kernel Data Redirection. In IEEE Secu- rity&Privacy’12, pages 586–600.

[68] Yangchun Fu and Zhiqiang Lin. Exterior: Using a dual-vm based external shell for guest-os introspection, configuration, and recovery. In Proceedings of the 9th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE ’13, pages 97–110, New York, NY, USA, 2013. ACM. Bibliography 79

[69] Yangchun Fu, Junyuan Zeng, and Zhiqiang Lin. Hypershell: A practical hypervisor layer guest os shell for automated in-vm management. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, USENIX ATC’14, pages 85–96, Berkeley, CA, USA, 2014. USENIX Association.

[70] Afshar Ganjali and David Lie. Auditing cloud management using information flow tracking. In Proceedings of the Seventh ACM Workshop on Scalable Trusted Computing, STC ’12, pages 79–84, New York, NY, USA, 2012. ACM.

[71] Tal Garfinkel and Mendel Rosenblum. A Virtual Machine Introspection Based Architecture for Intrusion Detection. In In Proc. Network and Distributed Systems Security Symposium, pages 191–206, 2003.

[72] Saeed Ghanbari, Ali B. Hashemi, and Cristiana Amza. Stage-aware anomaly detection through tracking log points. In Proceedings of the 15th International Middleware Conference, Middleware ’14, pages 253–264, New York, NY, USA, 2014. ACM.

[73] Zhongshu Gu, Zhui Deng, Dongyan Xu, and Xuxian Jiang. Process implanting: A new active introspection framework for virtualization. In Reliable Distributed Systems (SRDS), 2011 30th IEEE Symposium on, pages 147–156. IEEE, 2011.

[74] Ajay Gulati, Ganesha Shanmuganathan, Irfan Ahmad, Carl Waldspurger, and Mustafa Uysal. Pesto: Online storage performance management in virtualized datacenters. In Proceedings of the 2Nd ACM Symposium on Cloud Computing, SOCC ’11, pages 19:1–19:14, New York, NY, USA, 2011. ACM.

[75] Brian Hay, Matt Bishop, and Kara Nance. Live analysis: Progress and challenges. Security & Privacy, IEEE, 7(2):30–37, 2009.

[76] Brian Hay and Kara Nance. Forensics examination of volatile system data using virtual introspec- tion. SIGOPS Oper. Syst. Rev., 42(3):74–82, 2008.

[77] Michael R. Hines and Kartik Gopalan. Post-copy based live virtual machine migration using adap- tive pre-paging and dynamic self-ballooning. In Proceedings of the 2009 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE ’09, pages 51–60, New York, NY, USA, 2009. ACM.

[78] Hirofuchi Takahiro and Isaku Yamahata. Yabusame. http://wiki.qemu.org/Features/PostCopy LiveMigrationYabusame.

[79] J. Hizver and Tzi cker Chiueh. Automated discovery of credit card data flow for pci dss compliance. In Reliable Distributed Systems (SRDS), 2011 30th IEEE Symposium on, pages 51–58, Oct 2011.

[80] Jennia Hizver and Tzi-cker Chiueh. Real-time deep virtual machine introspection and its applica- tions. In Proceedings of the 10th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE ’14, pages 3–14, New York, NY, USA, 2014. ACM. Bibliography 80

[81] Owen S. Hofmann, Alan M. Dunn, Sangman Kim, Indrajit Roy, and Emmett Witchel. Ensuring kernel integrity with OSck. In Proceedings of the Sixteenth International Confer- ence on Architectural Support for Programming Languages and Operating Systems, pages 279–290, 2011.

[82] Kai-Yuan Hou, Mustafa Uysal, Arif Merchant, Kang G Shin, and Sharad Singhal. Hydravm: Low- cost, transparent high availability for virtual machines. Technical report, HP Laboratories, Tech. Rep, 2011.

[83] Hypertection. Hypervisor-Based Antivirus. hypertection.com.

[84] IBM. BigFix / Endpoint Manager. https://github.com/bigfix/platform-releases.

[85] IBM. Bluemix Docs. https://www.ng.bluemix.net/docs/.

[86] Amani S. Ibrahim, James H. Hamlyn-Harris, John Grundy, and Mohamed Almorsy. CloudSec: A security monitoring appliance for Virtual Machines in IaaS cloud model. In Network and System Security (NSS), 2011 5th International Conference on, pages 113–120.

[87] Jack of all Clouds. Recounting EC2 One Year Later. www.jackofallclouds.com/2010/12/recounting-ec2/.

[88] Xuxian Jiang and Xinyuan Wang. ”out-of-the-box” monitoring of vm-based high-interaction hon- eypots. In Proceedings of the 10th International Conference on Recent Advances in Intrusion Detection, RAID’07, pages 198–218, Berlin, Heidelberg, 2007. Springer-Verlag.

[89] Xuxian Jiang, Xinyuan Wang, and Dongyan Xu. Stealthy malware detection through VMM- based out-of-the-box semantic view reconstruction. In Proceedings of the 14th ACM Conference on Computer and Communications Security, pages 128–138, 2007.

[90] John D. McCalpin. Memory Bandwidth: Stream Benchmark. http://www.cs.virginia.edu/ stream/.

[91] Hwanju Kim, Hyeontaek Lim, Jinkyu Jeong, Heeseung Jo, and Joonwon Lee. Task-aware virtual machine scheduling for i/o performance. In Proceedings of the 2009 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE ’09, pages 101–110, New York, NY, USA, 2009. ACM.

[92] Samuel T. King, George W. Dunlap, and Peter M. Chen. Debugging operating systems with time- traveling virtual machines. In Proceedings of the Annual Conference on USENIX Annual Technical Conference, ATEC ’05, pages 1–1, Berkeley, CA, USA, 2005. USENIX Association.

[93] Avi Kivity, Y Kamay, D Laor, U Lublin, and A Liguori. KVM: the Linux Virtual Machine Monitor. In OLS ’07: The 2007 Ottawa Linux Symposium, pages 225–230, 2007.

[94] Ivor Kollar. Forensic RAM dump image analyser. Master’s Thesis, Charles University in Prague, 2010. hysteria.sk/~niekt0/fmem/doc/foriana.pdf. [95] Ricardo Koller, Canturk Isci, Sahil Suneja, and Eyal De Lara. Unified monitoring and analytics in the cloud. In Proceedings of the 7th USENIX Conference on Hot Topics in Cloud Computing, HotCloud’15, pages 10–10, Berkeley, CA, USA, 2015. USENIX Association. Bibliography 81

[96] Konstantin Boudnik. Hadoop: Code Injection, Distributed Fault Injection. http://www.boudnik. org/~cos/docs/Hadoop-injection.pdf.

[97] Tobias Kurze, Markus Klems, David Bermbach, Alexander Lenk, Stefan Tai, and Marcel Kunze. Cloud federation. In Proceedings of the 2nd International Conference on Cloud Computing, GRIDs, and Virtualization, CLOUD COMPUTING 2011.

[98] Horacio Andr´esLagar-Cavilla, Joseph Andrew Whitney, Adin Matthew Scannell, Philip Patchin, Stephen M. Rumble, Eyal de Lara, Michael Brudno, and Mahadev Satyanarayanan. Snowflock: Rapid virtual machine cloning for cloud computing. In Proceedings of the 4th ACM European Conference on Computer Systems, EuroSys ’09, pages 1–12, New York, NY, USA, 2009. ACM.

[99] Avinash Lakshman and Prashant Malik. Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44(2):35–40, April 2010.

[100] M. Le and Y. Tamir. Fault injection in virtualized systems- challenges and applications. Dependable and Secure Computing, IEEE Transactions on, 12(3):284–297, May 2015.

[101] Hojoon Lee, Hyungon Moon, Daehee Jang, Kihwan Kim, Jihoon Lee, Yunheung Paek, and Brent ByungHoon Kang. Ki-mon: A hardware-assisted event-triggered monitoring platform for mutable kernel object. In Proceedings of the 22Nd USENIX Conference on Security, SEC’13, pages 511–526, Berkeley, CA, USA, 2013. USENIX Association.

[102] Kaituo Li, Pallavi Joshi, Aarti Gupta, and Malay K. Ganai. Reprolite: A lightweight tool to quickly reproduce hard system bugs. In Proceedings of the ACM Symposium on Cloud Computing, SOCC ’14, pages 25:1–25:13, New York, NY, USA, 2014. ACM.

[103] Zhiqiang Lin, Junghwan Rhee, Xiangyu Zhang, Dongyan Xu, and Xuxian Jiang. SigGraph: Brute Force Scanning of Kernel Data Structure Instances Using Graph-based Signatures. In Proc. of the 18th Annual Network and Distributed System Security Symposium (NDSS’11), page NA, 2011.

[104] Linux man page. Chroot. http://linux.die.net/man/1/chroot.

[105] Linux man page. chrt - manipulate real-time attributes of a process. http://linux.die.net/ man/1/chrt.

[106] Linux man page. sched setscheduler - set scheduling policy/parameters. http://linux.die.net/ man/2/sched_setscheduler.

[107] Linux ’s Manual. SETNS. http://man7.org/linux/man-pages/man2/setns.2. html.

[108] Lionel Litty and David Lie. Patch auditing in infrastructure as a service clouds. In Proceedings of the 7th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE ’11, pages 145–156, New York, NY, USA, 2011. ACM.

[109] Yutao Liu, Yubin Xia, Haibing Guan, Binyu Zang, and Haibo Chen. Concurrent and consistent virtual machine introspection with hardware transactional memory. In High Performance Com- puter Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 416–427, Feb 2014. Bibliography 82

[110] Marco Batista. VMInjector: DLL Injection tool to unlock guest VMs. https://github.com/batistam/VMInjector.

[111] Mariusz Burdach. Digital forensics of the physical memory. 2005. http://forensic.seccure. net/pdf/mburdach_digital_forensics_of_physical_memory.pdf.

[112] Nikos Mavroyanopoulos and Sascha Schumann. Mhash. http://mhash.sourceforge.net.

[113] Maximillian Dornseif. 0wned by an iPod. PacSec Applied Security Conference 2004. http://md. hudora.de/presentations/firewire/PacSec2004.pdf.

[114] Azure. VM Agent and Extensions . https://azure.microsoft.com/en-us/blog/ vm-agent-and-extensions-part-2/.

[115] Michael J. Mior and Eyal de Lara. Flurrydb: A dynamically scalable relational database with virtual machine cloning. In Proceedings of the 4th Annual International Conference on Systems and Storage, SYSTOR ’11, pages 1:1–1:9, New York, NY, USA, 2011. ACM.

[116] Hyungon Moon, Hojoon Lee, Jihoon Lee, Kihwan Kim, Yunheung Paek, and Brent Byunghoon Kang. Vigilare: Toward snoop-based kernel integrity monitor. In Proceedings of the 2012 ACM Conference on Computer and Communications Security, CCS ’12, pages 28–37, New York, NY, USA, 2012. ACM.

[117] David Mosberger and Tai Jin. httperf - a tool for measuring web server performance. SIGMETRICS Perform. Eval. Rev., 26(3):31–37, 1998.

[118] Nemo. Abusing Mach on Mac OS X . http://uninformed.org/index.cgi?v=4&a=3.

[119] Nirsoft. Windows Vista Kernel Structures. http://www.nirsoft.net/kernel_struct/vista/.

[120] OpenBenchmarking/Phoronix. x264 Test Profile. http://openbenchmarking.org/test/pts/ x264-1.7.0.

[121] David Oppenheimer, Archana Ganapathi, and David A. Patterson. Why do internet services fail, and what can be done about it? In Proceedings of the 4th Conference on USENIX Symposium on Internet Technologies and Systems - Volume 4, USITS’03, pages 1–1, Berkeley, CA, USA, 2003. USENIX Association.

[122] Opscode. Chef. http://www.opscode.com/chef/.

[123] Oracle’s Linux Blog. Performance Issues with Transparent Huge Pages. https://blogs.oracle.com/linux/entry/performance issues with transparent huge.

[124] oVirt. oVirt guest agent. http://www.ovirt.org/Category:Ovirt guest agent.

[125] OW2 Consortium. RUBiS: Rice University Bidding System. http://rubis.ow2.org/.

[126] Yoann Padioleau, Julia L. Lawall, and Gilles Muller. Understanding collateral evolution in linux device drivers. In Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006, EuroSys ’06, pages 59–71, New York, NY, USA, 2006. ACM. Bibliography 83

[127] Patrick Colp. VM Snapshots. http://www-archive.xenproject.org/files/xensummit oracle09 /VMSnapshots.pdf.

[128] B.D. Payne, M.D.P. de Carbone, and Wenke Lee. Secure and Flexible Monitoring of Virtual Machines. In Twenty-Third Annual Applications Conference, pages 385 –397, 2007.

[129] Bryan D. Payne, Martim Carbone, Monirul Sharif, and Wenke Lee. Lares: An architecture for secure active monitoring using virtualization. In Proceedings of the 2008 IEEE Symposium on Security and Privacy, SP ’08, pages 233–247, 2008.

[130] Nick L. Petroni, Jr., Timothy Fraser, Jesus Molina, and William A. Arbaugh. Copilot - a coprocessor-based kernel runtime integrity monitor. In Proceedings of the 13th Conference on USENIX Security Symposium - Volume 13, SSYM’04, pages 13–13, Berkeley, CA, USA, 2004. USENIX Association.

[131] Jonas Pfoh, Christian Schneider, and Claudia Eckert. A formal model for virtual machine intro- spection. In Proceedings of the 1st ACM Workshop on Virtual Machine Security, VMSec ’09, pages 1–10, New York, NY, USA, 2009. ACM.

[132] C. Pham, Z. Estrada, P. Cao, Z. Kalbarczyk, and R. K. Iyer. Reliability and security monitoring of virtual machines using hardware architectural invariants. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pages 13–24, June 2014.

[133] PHD Virtual. Virtual Monitoring. http://www.phdvirtual.com/.

[134] PHP. Bug 45161 and 65458. https://bugs.php.net/bug.php?id=45161 and https://bugs.php.net/bug.php?id=65458.

[135] Boris Proch´azka, Tom´asVojnar, and Martin Drahansk`y.Hijacking the linux kernel. In Sixth Doc- toral Workshop on Mathematical and Engineering Methods in Computer Science (MEMICS’10), pages 85–92, 2010.

[136] QEMU. Documentation/Debugging: Using gdb. http://wiki.qemu.org/Documentation/Debugging.

[137] QEMU. Features/QAPI/GuestAgent. http://wiki.qemu.org/Features/QAPI/GuestAgent.

[138] Adit Ranadive, Ada Gavrilovska, and Karsten Schwan. Ibmon: monitoring vmm-bypass capable infiniband devices using memory introspection. In Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing, pages 25–32, 2009.

[139] Reflex. vWatch Monitoring. http://www.reflexsystems.com/Products/vWatch.

[140] Richard W.M. Jones. Guestfs. http://libguestfs.org/guestfs.3.html.

[141] Wolfgang Richter, Canturk Isci, Benjamin Gilbert, Jan Harkes, Vasanth Bala, and Mahadev Satya- narayanan. Agentless cloud-wide streaming of guest file system updates. In Proceedings of the 2014 IEEE International Conference on Cloud Engineering, IC2E ’14, pages 7–16, Washington, DC, USA, 2014. IEEE Computer Society.

[142] Rick Jones . Netperf Homepage. http://www.netperf.org/netperf/. Bibliography 84

[143] Anthony Roberts, Richard McClatchey, Saad Liaquat, Nigel Edwards, and Mike Wray. Poster: Introducing pathogen: a real-time virtualmachine introspection framework. In Proceedings of the 2013 ACM SIGSAC conference on Computer and communications security, CCS ’13, pages 1429– 1432, New York, NY, USA, 2013. ACM.

[144] Christopher A. Rogers. Lightweight local cloning of kvm/qemu virtual machines. In ProQuest Dissertations and Theses: State University of New York at Binghamton, page 40, 2014.

[145] Russell Coker. Bonnie++. http://www.coker.com.au/bonnie++/.

[146] Alireza Saberi, Yangchun Fu, and Zhiqiang Lin. Hybrid-bridge: Efficiently bridging the semantic gap in virtual machine introspection via decoupled execution and training memoization. Pro- ceedings of the 21st Annual Network and Distributed System Security Symposium (NDSS14), San Diego, CA, page NA, 2014.

[147] Anooshiravan Saboori, Guofei Jiang, and Haifeng Chen. Autotuning configurations in distributed systems for performance improvements using evolutionary strategies. In Proceedings of the 2008 The 28th International Conference on Distributed Computing Systems, ICDCS ’08, pages 769–776, Washington, DC, USA, 2008. IEEE Computer Society.

[148] Tudor-Ioan Salomie, Gustavo Alonso, Timothy Roscoe, and Kevin Elphinstone. Application level ballooning for efficient server consolidation. In Proceedings of the 8th ACM European Conference on Computer Systems, EuroSys ’13, pages 337–350, 2013.

[149] Benjamin Satzger, Waldemar Hummer, Christian Inzinger, Philipp Leitner, and Schahram Dust- dar. Winds of change: From vendor lock-in to the meta cloud. IEEE Internet Computing, 17(1):69– 73, January 2013.

[150] Joshua Schiffman, Hayawardh Vijayakumar, and Trent Jaeger. Verifying system integrity by proxy. In Proceedings of the 5th International Conference on Trust and Trustworthy Computing, TRUST’12, pages 179–200, Berlin, Heidelberg, 2012. Springer-Verlag.

[151] Andreas Schuster. Searching for processes and threads in memory dumps. Digit. Investig., 3:10–16, September 2006.

[152] Aidan Shribman and Benoit Hudzia. Pre-copy and post-copy vm live migration for memory intensive applications. In Euro-Par 2012: Parallel Processing Workshops, volume 7640 of Lecture Notes in Computer Science, pages 539–547. Springer Berlin Heidelberg, 2013.

[153] Luis Moura Silva, Javier Alonso, Paulo Silva, Jordi Torres, and Artur Andrzejak. Using virtu- alization to improve software rejuvenation. In Network Computing and Applications, 2007. NCA 2007. Sixth IEEE International Symposium on, pages 33–44. IEEE, 2007.

[154] Deepa Srinivasan and Xuxian Jiang. Time-traveling forensic analysis of vm-based high-interaction honeypots. In Security and Privacy in Communication Networks, pages 209–226. 2012.

[155] Deepa Srinivasan, Zhi Wang, Xuxian Jiang, and Dongyan Xu. Process out-grafting: An efficient ”out-of-vm” approach for fine-grained process execution monitoring. In Proceedings of the 18th ACM Conference on Computer and Communications Security, CCS ’11, pages 363–374, New York, NY, USA, 2011. ACM. Bibliography 85

[156] Abhinav Srivastava and Jonathon Giffin. Tamper-Resistant, Application-Aware Blocking of Ma- licious Network Connections. In Proceedings of the 11th international symposium on Recent Ad- vances in Intrusion Detection, pages 39–58, 2008.

[157] Stanley Cen. Mac OS X Code Injection and Reverse Engineering . http://stanleycen.com/blog/mac-osx-code-injection/.

[158] Structured Data. Transparent Huge Pages and Hadoop Workloads. http://structureddata.org/2012/06/18/linux-6 -transparent-huge-pages-and-hadoop-workloads/.

[159] Akiyoshi Sugiki, Kenji Kono, and Hideya Iwasaki. A practical approach to automatic parameter- tuning of web servers. In Proceedings of the 10th Asian Computing Science Conference on Advances in Computer Science: Data Management on the Web, ASIAN’05, pages 146–159, Berlin, Heidel- berg, 2005. Springer-Verlag.

[160] Michael H. Sun and Douglas M. Blough. Fast, lightweight virtual machine checkpointing. Technical report, Georgia Institute of Technology, 2010.

[161] Sahil Suneja, Canturk Isci, Vasanth Bala, Eyal de Lara, and Todd Mummert. Non-intrusive, out-of-band and out-of-the-box systems monitoring in the cloud. In The 2014 ACM International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’14, pages 249– 261, New York, NY, USA, 2014. ACM.

[162] Yoshi Tamura. Kemari: Fault tolerant vm synchronization based on kvm. http://www.linux-kvm.org/images/0/0d/0.5.kemari-kvm-forum-2010.pdf, 2010.

[163] S. Thomas, K.K. Sherly, and S. Dija. Extraction of memory forensic artifacts from windows 7 ram image. In Information and Communication Technologies (ICT), 2013 IEEE Conference on, pages 937–942, April 2013.

[164] Tim Starling. Measuring memory usage with strace. http://tstarling.com/blog/2010/06/ measuring-memory-usage-with-strace/.

[165] Toby Opferman. Sharing Memory with the Virtual Machine. http://www.drdobbs.com/sharing-memory-with-the -virtual-machine/184402033.

[166] Vasilis Liaskovitis, Igor Mammedov, Paolo Bonzini. ACPI memory hotplug. https://lists.gnu.org/archive/html/qemu-devel/2014-04/msg00734.html.

[167] Nicolas Viennot, Siddharth Nair, and Jason Nieh. Transparent mutable replay for multicore de- bugging and patch validation. In Proceedings of the Eighteenth International Conference on Archi- tectural Support for Programming Languages and Operating Systems, ASPLOS ’13, pages 127–138, New York, NY, USA, 2013. ACM.

[168] VMware. Guest Operating System Customization Requirements . https: //pubs.vmware.com/vsphere-51/index.jsp#com.vmware.vsphere.vm_admin.doc/ GUID-E63B6FAA-8D35-428D-B40C-744769845906.html. Bibliography 86

[169] VMware. Understanding Clones. https://www.vmware.com/support/ws5/doc/ws_clone_ overview.html.

[170] VMware. vCenter Operations Management Suite. http://www.vmware.com/products/ vcenter-operations-management/.

[171] VMware. VIX API Documentation. http://www.vmware.com/support/developer/vix-api/.

[172] VMware. VMCI Overview. http://pubs.vmware.com/vmci-sdk/.

[173] VMware. VMWare Tools. http://kb.vmware.com/kb/340.

[174] VMware. vShield Endpoint. http://www.vmware.com/products/vsphere/features-endpoint.

[175] VMware. vSphere 5 Documentation Center: CPU Hotplug. https:// pubs.vmware.com/vsphere-50/topic/com.vmware.vsphere.vm_admin.doc_50/ GUID-285BB774-CE69-4477-9011-598FEF1E9ACB.html.

[176] VMWare Inc. VMWare VMSafe security technology. http://www.vmware.com/company/news/ releases/vmsafe_vmworld.html.

[177] Sebastian Vogl. A bottom-up Approach to VMI-based Kernel-level Rootkit Detection. PhD Thesis, Technische Unversitat Munchen, 2010.

[178] Sebastian Vogl, Fatih Kilic, Christian Schneider, and Claudia Eckert. X-tier: Kernel module injection. In Javier Lopez, Xinyi Huang, and Ravi Sandhu, editors, Network and System Security, volume 7873 of Lecture Notes in Computer Science, pages 192–205. Springer Berlin Heidelberg, 2013.

[179] Michael Vrable, Justin Ma, Jay Chen, David Moore, Erik Vandekieft, Alex C. Snoeren, Geof- frey M. Voelker, and Stefan Savage. Scalability, fidelity, and containment in the potemkin virtual honeyfarm. In Proceedings of the Twentieth ACM Symposium on Operating Systems Principles, SOSP ’05, pages 148–162, New York, NY, USA, 2005. ACM.

[180] Carl A. Waldspurger. Memory Resource Management in VMware ESX Server. SIGOPS Oper. Syst. Rev., 36(SI):181–194, December 2002.

[181] Jiang Wang, Angelos Stavrou, and Anup Ghosh. Hypercheck: A hardware-assisted integrity moni- tor. In Proceedings of the 13th International Conference on Recent Advances in Intrusion Detection, RAID’10, pages 158–177, Berlin, Heidelberg, 2010. Springer-Verlag.

[182] Wikibooks. QEMU/Monitor. http://en.wikibooks.org/wiki/QEMU/Monitor.

[183] Timothy Wood, Prashant Shenoy, Arun Venkataramani, and Mazin Yousif. Black-box and gray- box strategies for virtual machine migration. In Proceedings of the 4th USENIX Conference on Networked Systems Design and Implementation, NSDI’07, pages 17–17, Berkeley, CA, USA, 2007. USENIX Association.

[184] Rui Wu, Ping Chen, Peng Liu, and Bing Mao. System call redirection: A practical approach to meeting real-world virtual machine introspection needs. In Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN ’14, pages 574– 585, Washington, DC, USA, 2014. IEEE Computer Society. Bibliography 87

[185] Xen Project Blog. Debugging on xen. https://blog.xenproject.org/2009/10/21/debugging- on-xen/.

[186] Xen Project Wiki. Blktap. http://wiki.xenproject.org/wiki/Blktap.

[187] Xen Project Wiki. Migration. http://wiki.xenproject.org/wiki/Migration.

[188] Xen.org: Sean Dague, Daniel Stekloff, Reiner Sailer, and Stefan Berger. Xen Management . http://xenbits.xen.org/docs/4.3-testing/man/xm.1.html#block devices.

[189] Bowei Xi, Zhen Liu, Mukund Raghavachari, Cathy H. Xia, and Li Zhang. A smart hill-climbing algorithm for application server configuration. In Proceedings of the 13th International Conference on World Wide Web, WWW ’04, pages 287–296, New York, NY, USA, 2004. ACM.

[190] Di Xie, Ning Ding, Y. Charlie Hu, and Ramana Kompella. The only constant is change: Incor- porating time-varying network reservations in data centers. SIGCOMM Comput. Commun. Rev., 42(4):199–210, August 2012.

[191] Yasuaki Ishimatsu. Memory Hotplug. http://events.linuxfoundation.org/sites/events/ files/lcjp13 ishimatsu.pdf.

[192] Zuoning Yin, Xiao Ma, Jing Zheng, Yuanyuan Zhou, Lakshmi N. Bairavasundaram, and Shankar Pasupathy. An empirical study on configuration errors in commercial and open source systems. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP ’11, pages 159–172, New York, NY, USA, 2011. ACM.

[193] Junyuan Zeng, Yangchun Fu, and Zhiqiang Lin. Pemu: A pin highly compatible out-of-vm dy- namic binary instrumentation framework. In Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE ’15, pages 147–160, New York, NY, USA, 2015. ACM.

[194] ZeroTurnaround. JRebel Java Plugin. http://zeroturnaround.com/software/jrebel/.

[195] Fan Zhang, Junwei Cao, Lianchen Liu, and Cheng Wu. Fast autotuning configurations of pa- rameters in distributed computing systems using ordinal optimization. In Proceedings of the 2009 International Conference on Parallel Processing Workshops, ICPPW ’09, pages 190–197, Wash- ington, DC, USA, 2009. IEEE Computer Society.

[196] Youhui Zhang, Yu Gu, Hongyi Wang, and Dongsheng Wang. Virtual-machine-based intrusion detection on file-aware block level storage. In 18TH International Symposium on Computer Ar- chitecture and High Performance Computing, 2006. SBAC-PAD’06., pages 185–192. IEEE, 2006.

[197] Xu Zhao, Yongle Zhang, David Lion, Muhammad Faizan Ullah, Yu Luo, Ding Yuan, and Michael Stumm. Lprof: A non-intrusive request flow profiler for distributed systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, pages 629–644, Berkeley, CA, USA, 2014. USENIX Association.

[198] Wei Zheng, Ricardo Bianchini, and Thu D. Nguyen. Automatic configuration of internet services. SIGOPS Oper. Syst. Rev., 41(3):219–229, March 2007. Bibliography 88

[199] Junji Zhi, Sahil Suneja, and Eyal De Lara. The case for system testing with swift hierarchical vm fork. In Proceedings of the 6th USENIX Conference on Hot Topics in Cloud Computing, HotCloud’14, pages 19–19, 2014.