<<

Multi- Virtual Machines: Enabling an Ecosystem of Hypervisor-level Services Kartik Gopalan, Rohit Kugve, Hardik Bagdi, and Yaohui Hu, Binghamton University; Daniel Williams and Nilton Bila, IBM T.J. Watson Research Center https://www.usenix.org/conference/atc17/technical-sessions/presentation/gopalan

This paper is included in the Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC ’17). July 12–14, 2017 • Santa Clara, CA, USA ISBN 978-1-931971-38-6

Open access to the Proceedings of the 2017 USENIX Annual Technical Conference is sponsored by USENIX. Multi-hypervisor Virtual Machines: Enabling an Ecosystem of Hypervisor-level Services

Kartik Gopalan, Rohith Kugve, Hardik Bagdi, Yaohui Hu Computer Science, Binghamton University Dan Williams, Nilton Bila IBM T.J. Watson Research Center

Abstract tegrity for commodity guests. CloudVisor [68] guar- antees guest privacy and integrity on untrusted clouds. Public cloud marketplaces already offer users a RTS provides a Real-time [52] for wealth of choice in operating systems, database manage- real-time guests. These specialized may not ment systems, financial software, and virtual network- provide guests with the full slate of memory, virtual CPU ing, all deployable and configurable at the click of a but- (VCPU), and I/O management, but rely upon either an- ton. Unfortunately, this level of customization has not other commodity hypervisor, or the guest itself, to fill in extended to emerging hypervisor-level services, partly the missing services. because traditional virtual machines (VMs) are fully con- Currently there is no good way to expose multiple trolled by only one hypervisor at a time. Currently, a VM services to a guest. For a guest which needs multiple in a cloud platform cannot concurrently use hypervisor- hypervisor-level services, the first option is for the single level services from multiple third-parties in a compart- controlling hypervisor to bundle all services in its super- mentalized manner. We propose the notion of a multi- visor mode. Unfortunately, this approach leads to a “fat” hypervisor VM, which is an unmodified guest that can si- feature-filled hypervisor that may no longer be trustwor- multaneously use services from multiple coresident, but thy because it runs too many untrusted or mutually dis- isolated, hypervisors. We present a new ar- trusting services. One could de-privilege some services chitecture, called Span virtualization, that leverages nest- to the hypervisor’s as processes that control ing to allow multiple hypervisors to concurrently control the guest indirectly via event interposition and system a guest’s memory, virtual CPU, and I/O resources. Our calls [63, 40]. However, public cloud providers would be prototype of Span virtualization on the KVM/QEMU reluctant to execute untrusted third-party services in the platform enables a guest to use services such as intro- hypervisor’s native user space due to a potentially large spection, network monitoring, guest mirroring, and hy- user-kernel interface. pervisor refresh, with performance comparable to tradi- The next option is to de-privilege the services further, tional nested VMs. running each in a Service VM with a full-fledged OS. For instance, rather than running a single Domain0 VM run- 1 Introduction ning that bundles services for all guests, [4] can use several disaggregated [23] service domains for In recent years, a number of hypervisor-level services resilience. Service domains, while currently trusted by have been proposed such as detection [61], live Xen, could be adapted to run third-party untrusted ser- patching [19], intrusion detection [27], high availabil- vices. A service VM has a less powerful interface to ity [24], and (VM) introspection [30, the hypervisor than a user space service. However, nei- 53, 26, 49, 42, 60]. By running inside the hypervisor ther user space services nor Service VMs allow control instead of the guest, these services can operate on mul- over low-level guest resources, such as page mappings or tiple guests, while remaining transparent to the guests. VCPU scheduling, which require hypervisor privileges. Recent years have also seen a rise in the number of spe- One could use nested virtualization [34, 10, 48, 29] cialized hypervisors that are tailored to provide VMs to vertically stack hypervisor-level services, such that a with specific services. For instance, McAfee Deep De- trusted base hypervisor at layer-0 (L0) controls the phys- fender [46] uses a micro-hypervisor called DeepSafe to ical hardware and runs a service hypervisor at layer-1 improve guest security. SecVisor [56] provides code in- (L1), which fully or partially controls the guest at layer-2

USENIX Association 2017 USENIX Annual Technical Conference 235 V Span VM that runs on two L1s (H3 and H4). This paper V2 3 V V 4 5 makes the following contributions:

V H H H H 1 1 2 3 4 • We examine the solution space for providing mul- tiple services to a common guest and identify the relative merits of possible solutions. L 0 • We present the design of Span virtualization which enables multiple L1s to concurrently control an un-

Figure 1: Span virtualization: V1 is a traditional single- modified guest’s memory, VCPU, and I/O devices level VM. V2 is a traditional nested VM. V3, V4 and V5 using a relatively simple interface with L0. are multi-hypervisor VMs. • We describe our implementation of Span virtualiza- tion by extending the nested virtualization support (L2). Nested virtualization is experiencing considerable in KVM/QEMU [40] and show that Span virtual- interest [25, 31, 64, 68, 55, 53, 56, 39, 7, 65, 45]. For ex- ization can be implemented within existing hyper- ample, one can use nesting [21] to run McAfee Deep De- visors with moderate changes. fender [46], which does not provide full system and I/O • We evaluate Span VMs running unmodified Linux virtualization, as a guest on XenDesktop [20], a full com- and simultaneously using multiple L1 services in- modity hypervisor, so that guests can use the services of cluding VM introspection, network monitoring, both. Similarly, [15] uses nesting on a Xen- guest mirroring, and hypervisor refresh. We find based hypervisor for security. Ravello [2] and XenBlan- that Span VMs perform comparably with nested ket [66, 57] use nesting on public clouds for cross-cloud VMs and within 0–20% of single-level VMs, across portability. However, vertical stacking reduces the de- different configurations and benchmarks. gree of guest control and visibility to lower layers com- pared to the layer directly controlling the guest. Also, the 2 Solution Space overhead of nested virtualization beyond two layers can Table 1 compares possible solutions for providing multi- become rather high [10]. ple services to a guest. These are single-level virtualiza- Instead, we propose Span virtualization, which pro- tion, user space services, service VMs, nested virtualiza- vides horizontal layering of multiple hypervisor-level tion, and Span. services. A Span VM, or a multi-hypervisor VM, is First, like single-level and nested alternatives, Span an unmodified guest whose resources (, virtualization provides L1s with control over the virtual- CPU, and I/O) can be simultaneously controlled by mul- ized instruction set architecture (ISA) of the guest, which tiple coresident, but isolated, hypervisors. A base hy- includes trapped instructions, memory mappings, VCPU pervisor at L0 provides a core set of services and uses scheduling, and I/O. nested virtualization to run multiple deprivileged service Unlike all alternatives, Span L1s support both full and hypervisors at L1. Each L1 augments L0’s services by partial guest control. Span L1s can range from full hy- adding/replacing one or more services. Since the L0 pervisors that control all guest resources to specialized no longer needs to implement every conceivable service, hypervisors that control only some guest resources. L0’s footprint can be smaller than a feature-filled hyper- Next, consider the impact of service failures. In visor. Henceforth, we use the following terms: single-level virtualization, failure of a privileged service • Guest or VM refers to a top-level VM, with quali- impacts the L0 hypervisor, all coresident services, and fiers single-level, nested, and Span as needed. all guests. For all other cases, the L0 hypervisor is pro- • L1 refers to a service hypervisor at layer-1. tected from service failures because services are deprivi- • L0 refers to the base hypervisor at layer-0. leged. Furthermore, failure of a deprivileged service im- • Hypervisor refers to the role of either L0 or any L1 pacts only those guests to which the service is attached. in managing guest resources. Next, consider the failure impact on coresident ser- vices. User space services are isolated by -level Figure 1 illustrates possible Span VM configurations. isolation and hence protected from each other’s fail- One L0 hypervisor runs multiple L1 hypervisors (H1, ure. However, process-level isolation is only as strong H2, H3, and H4) and multiple guests (V1, V2, V3, V4 and as the user-level privileges with which the services run. V5). V1 is a traditional single-level (non-nested) guest that Nested virtualization provides only one deprivileged ser- runs on L0. V2 is a traditional nested guest that runs on vice compartment. Hence services for the same guest only one hypervisor (H1). The rest are multi-hypervisor must reside together in an L1, either in its user space VMs. V3 runs on two hypervisors (L0 and H1). V4 runs or kernel. A service failure in a nested L1’s kernel im- on three hypervisors (L0, H2, and H3). V5 is a fully nested pacts all coresident services whereas a failure in its user

236 2017 USENIX Annual Technical Conference USENIX Association Level of Guest Control Impact of Service Failure Additional Virtualized Partial or L0 Coresident Guests Performance ISA Full Services Overheads Single-level Yes Full Fails Fail All None User space No Partial Protected Protected Attached Process switching Service VM No Partial Protected Protected Attached VM switching Nested Yes Full Protected Protected in Attached L1 switching + nesting L1 user space Span Yes Both Protected Protected Attached L1 switching + nesting Table 1: Alternatives for providing multiple services to a common guest, assuming one service per user space process, service VM, or Span L1. ent hypervisors at a given instant. • Control Transfer: Control of guest VCPUs and/or Span Guest (unmodified) virtual I/O devices can be transferred from one hy- L1 Hypervisor(s) Virtual Guest EPT pervisor to another, but only via L0. Virtual Guest EPTMemory I/O VCPU Manager Manager Manager

Guest Control Event Producer/Consumer Figure 2 shows the high-level architecture. A Span guest Requester Messages Traps Fault Handler begins as a single-level VM on L0. One or more L1s can then attach to one or more guest resources and optionally L0 Hypervisor Message Channel L1 Traps Guest Memory Faults Manager subscribe with L0 for specific guest events. I/O Guest Controller Event Processing Manager (attach/detach/subscribe/unsubscribe) Guest Control Operations: The Guest Controller VCPU (Relay/Emulation) Manager in L0 supervises control over a guest by multiple L1s Figure 2: High-level architecture for Span virtualization. through the following operations. space does not. Service VMs and Span virtualization iso- • [attach L1, Guest, Resource]: Gives L1 late coresident services in individual VM-level compart- control over the Resource in Guest. Resources in- ments. Thus, failure of a service VM or Span L1 does clude guest memory, VCPU, and I/O devices. Con- not affect coresident services. trol over memory is shared among multiple attached Finally, consider additional performance overhead L1s, whereas control over guest VCPUs and virtual over the single-level case. User space services introduce I/O devices is exclusive to an attached L1. Attach- context switching overhead among processes. Service ing to guest VCPUs or I/O device resources requires VMs introduce VM context switching overhead, which attaching to the guest memory resource. is more expensive. Nesting adds the overhead of emu- • [detach L1, Guest, Resource]: Releases lating privileged guest operations in L1. Span virtual- L1’s control over Resource in Guest. Detaching ization uses nesting but supports partial guest control by from the guest memory resource requires detaching L1s. Hence, nesting overhead applies only to the guest from guest VCPUs and I/O devices. resources that an L1 controls. • [subscribe L1, Guest, Event, ] Registers L1 with L0 to receive Event 3 Overview of Span Virtualization from Guest. The GFN Range option specifies The key design requirement for Span VMs is trans- the range of frames in the guest on parency. The guest OS and its applications should re- which to track the memory event. Presently we main unmodified and oblivious to being simultaneously support only memory event subscription. Other controlled by multiple hypervisors, which includes L0 guest events of interest could include SYSENTER and any attached L1s. Hence the guest sees a virtual re- instructions, port-mapped I/O, etc. source abstraction that is indistinguishable from that of a • [unsubscribe L1, Guest, Event, ] Unsubscribes L1 Guest Event. resources, we translate this requirement as follows. The Guest Controller also uses administrative policies to • Memory: All hypervisors must have the same con- resolve apriori any potential conflicts over a guest con- sistent view of the guest memory. trol by multiple L1s. While this paper focuses on mecha- • VCPUs: All guest VCPUs must be controlled by nisms rather than specific policies, we note that the prob- one hypervisor at a given instant. lem of conflict resolution among services is not unique to • I/O Devices: Different virtual I/O devices of the Span. Alternative techniques also need ways to prevent same guest may be controlled exclusively by differ- conflicting services from controlling the same guest.

USENIX Association 2017 USENIX Annual Technical Conference 237 Isolation and Communication: Another design goal EPT HPA is to compartmentalize L1 services, from each other and VA Page Table GPA from L0. First, L1s must have lower execution privi- (a) Single-level lege compared to L0. Secondly, L1s must remain iso- Shadow EPT lated from each other. These two goals are achieved by Page Virtual VA L1PA EPT HPA deprivileging L1s using nested virtualization and execut- Table GPA EPT L1 ing them as separate guests on L0. Finally, L1s must re- (b) Nested main unaware of each other during execution. This goal Shadow Shadow Shadow is achieved by requiring L1s to receive any subscribed EPT EPTEPT guest events that are generated on other L1s only via L0. Page Virtual EPT Virtual EPT VA GPA Virtual EPTVirtual L1PA Virtual EPT HPA Table EPTL1 There are two ways that L0 communicates with L1s: EPT implicitly via traps and explicitly via messages. Traps EPTGuest allow L0 to transparently intercept certain memory man- (c) Span agement operations by L1 on the guest. Explicit mes- Figure 3: Memory translation for single-level, nested, sages allow an L1 to directly request guest control from and Span VMs. VA = Virtual Address; GPA = Guest L0. An Event Processing module in L0 traps runtime up- Physical Address; L1PA = L1 Physical Address; HPA = dates to guest memory mappings by any L1 and synchro- Host Physical Address. nizes guest mappings across different L1s. The event processing module also relays guest memory faults that 4.1 Traditional Memory Translation need to be handled by L1. A bidirectional Message Channel relays explicit messages between L0 and L1s in- In modern processors, hypervisors manage the phys- cluding attach/detach requests, memory event subscrip- ical memory that a guest can access using a virtualiza- tion/notification, guest I/O requests, and virtual inter- tion feature called Extended Page Tables (EPT) [37], also rupts. Some explicit messages, such as guest I/O re- called Nested Page Tables in AMD-V [5]. quests and virtual interrupts, could be replaced with im- Single-level virtualization: Figure 3(a) shows that for plicit traps. Our choice of which to use is largely based single-level virtualization, the guest page tables map vir- on ease of implementation on a case-by-case basis. tual addresses to guest physical addresses (VA to GPA in the figure). The hypervisor uses an EPT to map guest Continuous vs. Transient Control: Span virtualiza- physical addresses to host physical addresses (GPA to tion allows L1’s control over guest resources to be either HPA). Guest memory permissions are controlled by the continuous or transient. Continuous control means that combination of permissions in guest page table and EPT. an L1 exerts uninterrupted control over one or more guest resources for an extended period of time. For example, Whenever the guest attempts to access a page that is an intrusion detection service in L1 that must monitor either not present or protected in the EPT, the hardware guest system calls, VM exits, or network traffic, would generates an EPT fault and traps into the hypervisor, require continuous control of guest memory, VCPUs, which handles the fault by mapping a new page, em- and network device. Transient control means that an L1 ulating an instruction, or taking other actions. On the acquires full control over guest resources for a brief du- other hand, the hypervisor grants complete control over ration, provides a short service to the guest, and releases the traditional paging hardware to the guest. A guest OS guest control back to L0. For instance, an L1 that period- is free to maintain the mappings between its virtual and ically checkpoints the guest would need transient control guest physical address space and update them as it sees of guest memory, VCPUs, and I/O devices. fit, without trapping into the hypervisor. Nested virtualization: Figure 3(b) shows that for nested virtualization, the guest is similarly granted con- trol over the traditional paging hardware to map virtual 4 Memory Management addresses to its guest physical address space. L1 main- tains a Virtual EPT to map the guest pages to pages in A Span VM has a single guest physical address space L1’s physical addresses space, or L1 pages. Finally, one which is mapped into the address space of all attached more translation is required: L0 maintains EPTL1 to map L1s. Thus any memory write on a guest page is imme- L1 pages to physical pages. However, x86 processors diately visible to all hypervisors controlling the guest. can translate only two levels of addresses in hardware, Note that all L1s have the same visibility into the guest from guest virtual to guest physical to host physical ad- memory due to the horizontal layering of Span virtualiza- dress. Hence the Virtual EPT maintained by L1 needs to tion, unlike the vertical stacking of nested virtualization, be shadowed by L0, meaning that the Virtual EPT and which somewhat obscures the guest to lower layers. EPTL1 must be compacted by L0 during runtime into a

238 2017 USENIX Annual Technical Conference USENIX Association Shadow EPT that directly maps guest pages to physical Span Guest Process VA pages. To accomplish this, manipulations to the Virtual EPT by L1 trigger traps to L0. Whenever L1 loads a Vir- L1 Hypervisor(s)Virtual Guest tual EPT, L0 receives a trap and activates the appropriate Page Table Virtual Guest Memory Event EPTEPT Virtual EPT Shadow EPT. This style of nested page table manage- Memory Events Event ment is also called multi-dimensional paging [10]. GPA L1PA Notifications EPT faults on guest memory can be due to (a) the guest Virtual EPT Modifications accessing its own pages that have invalid Shadow EPT L0 Guest Event Virtual EPT entries, and (b) the L1 directly accessing guest pages that Handling Trap Handler have invalid EPTL1 entries to perform tasks such as I/O Event Subscription Shadow EPT processing and VM introspection (VMI). Both kinds of Shadow EPTShadow EPT EPTL1 Service EPT faults are first intercepted by L0. L0 examines a HPA Shadow EPT fault to further determine whether it is due to a invalid Virtual EPT entry; such faults are forwarded Figure 4: Span memory management overview. to L1 for processing. Otherwise, faults due to invalid EPTL1 entries are handled by L0. and the Shadow EPT. A memory-detach operation cor- Finally, an L1 may modify the Virtual EPT it main- respondingly undoes the EPTL1 mappings for guest and tains for a guest in the course of performing its own releases the reserved L1 address range. memory management. However, since the Virtual EPT is shadowed by L0, all Virtual EPT modifications cause 4.4 Synchronizing Guest Memory Maps traps to L0 for validation and a Shadow EPT update. To enforce a consistent view of guest memory across all L1s, L0 synchronizes memory mappings upon two 4.2 Memory Translation for Span VMs events: EPT faults and Virtual EPT modifications. Fault handling for Span VMs extends the correspond- In Span virtualization, L0 extends nested EPT manage- ing mechanism for nested VMs described earlier in Sec- ment to guests that are controlled by multiple hypervi- tion 4.1. The key difference in the Span case is that sors. Figure 3(c) shows that a Span guest has multiple L0 first checks if a host physical page has already been Virtual EPTs, one per L1 that is attached to the guest. mapped to the faulting guest page. If so, the existing When an L1 acquires control over a guest’s VCPUs, the physical page mapping is used to resolve the fault, else a L0 shadows the guest’s Virtual EPT in the L1 to construct new physical page is allocated. the corresponding Shadow EPT, which is used for mem- As with the nested case, modifications by an L1 to es- ory translations. In addition, an EPT is maintained Guest tablish Virtual EPT mappings trap to a Virtual EPT trap by L0 for direct guest execution on L0. A guest’s mem- handler in L0, shown in Figure 4. When the handler re- ory mappings in Shadow EPTs, the EPT , and the Guest ceives a trap due to a protection modification, it updates EPT are kept synchronized by L0 upon page faults so L1 each corresponding EPT with the new least-permissive that every attached hypervisor sees a consistent view of L1 combination of page protection. Our current prototype guest memory. Thus, a guest virtual address leads to the allows protection modifications but disallows changes to same host physical address irrespective of the Shadow established GPA-to-L1PA mappings to avoid having to EPT used for the translation. change mappings in multiple EPTs. 4.3 Memory Attach and Detach 4.5 Memory Event Subscription A Span VM is initially created directly on L0 as a single- An L1 attached to a guest may wish to monitor and con- level guest for which the L0 constructs a regular EPT. To trol certain memory-related events of the guest to provide attach to the guest memory, a new L1 requests L0, via a a service. For instance, an L1 that provides live check- hypercall, to map guest pages into its address space. pointing or guest mirroring may need to perform dirty Figure 4 illustrates that L1 reserves a range in the L1 page tracking in which pages to which the guest writes physical address space for guest memory and then in- are periodically recorded so they can be incrementally forms L0 of this range. Next, L1 constructs a Virtual EPT copied. As another example, an L1 performing intrusion for the guest which is shadowed by L0, as in the nested detection using VM introspection might wish to monitor case. Note that the reservation in L1 physical address a guest’s attempts to execute code from certain pages. space does not immediately allocate physical memory. In Span virtualization, since multiple L1s can be at- Rather, physical memory is allocated lazily upon guest tached to a guest, the L1 controlling the guest’s VCPUs memory faults. L0 dynamically populates the reserved may differ from the L1s requiring the memory event no- address range in L1 by adjusting the mappings in EPTL1 tification. Hence L0 provides a Memory Event Subscrip-

USENIX Association 2017 USENIX Annual Technical Conference 239 tion interface to enable L1s to independently subscribe to Guest guest memory events. An L1 subscribes with L0, via the Frontend message channel, requesting notifications when a spe- cific type of event occurs on certain pages of a given I/O Ring I/O guest. When the L0 intercepts the subscribed events, Request Buffer Response Forwarded Forwarded it notifies all L1 subscribers via the message channel. Interrupt Kick Upon receiving the event notification, a memory event emulator in each L1, shown in Figure 4, processes the Backend event and responds back to L0, either allowing or dis- L1a L1b allowing the guest’s memory access which triggered the Native I/O event. The response from the L1 also specifies whether via L0 L0 to maintain or discontinue the L1’s event subscription on the guest page. For example, upon receiving a write Figure 5: Paravirtual I/O for Span VMs. L1a controls event notification, an L1 that performs dirty page track- the guest I/O device and L1b controls the VCPUs. Kicks ing will instruct L0 to allow the guest to write to the page, from L1b and interrupts from L1a are forwarded via L0. and cancel the subscription for future write events on the page, since the page has been recorded as being dirty. tual drivers are used at both levels. On the other hand, an intrusion detection service in L1 For Span guests, different L1s may control guest VC- might disallow write events on guest pages containing PUs and I/O devices. If the same L1 controls both guest kernel code and maintain future subscription. L0 concur- VCPUs and the device backend then I/O processing pro- rently delivers event notifications to all L1 subscribers. ceeds as in the nested case. Figure 5 illustrates the other Guest memory access is allowed to proceed only if all case, when different L1s control guest VCPUs and back- subscribed L1s allow the event in their responses. ends. L1a controls the backend and L1b controls the To intercept a subscribed memory event on a page, the guest VCPUs. The frontend in the guest and backend L0 applies the event’s mask to the corresponding EPT L1 in L1a exchange I/O requests and responses via the ring entry of each L1 attached to the guest. Updating EPT L1 buffer. However, I/O kicks are generated by guest VC- prompts L0 to update the guest’s Shadow EPT entry with PUs controlled by L1b, which forward the kicks to L1a. the mask, to capture guest-triggered memory events. Up- Likewise, L1a forwards any virtual interrupts from the dating EPT entries also captures the events resulting L1 backend to L1b, which injects the interrupt to the guest from direct accesses to guest memory by an L1 instead VCPUs. Kicks from the frontend and virtual interrupts of the guest. For instance, to track write events on a guest from the backend are forwarded between L1s via L0 us- page, the EPT entry could be marked read-only after sav- ing the message channel. ing the original permissions for later restoration.

5 I/O Control 6 VCPU Control In this work, guests use paravirtual devices [54, 6] which provide better performance than device emula- In single-level virtualization, L0 controls the scheduling tion [59] and provide greater physical device sharing of guest VCPUs. In nested virtualization, L0 delegates among guests than direct device assignment [11, 12, 50]. guest VCPU scheduling to an L1. The L1 schedules For single-level virtualization, the guest OS runs a set guest VCPUs on its own VCPUs and L0 schedules the of paravirtual frontend drivers, one for each virtual de- L1’s VCPUs on PCPUs. This hierarchical scheduling vice, including block and network devices. The hypervi- provides the L1 some degree of control over customized sor runs the corresponding backend driver. The frontend scheduling for its guests. and the backend drivers communicate via a shared ring Span virtualization can leverage either single-level or buffer to issue I/O requests and receive responses. The nested VCPU scheduling depending on whether the L0 frontend places an I/O request in the ring buffer and no- or an L1 controls a guest’s VCPUs. Our current design tifies the backend through a kick event, which triggers requires that all VCPUs of a guest be controlled by one a VM exit to the hypervisor. The backend removes the of the hypervisors at any instant. However, control over I/O request from the ring buffer, completes the request, guest VCPUs can be transferred between hypervisors if places the I/O response in the ring buffer, and injects an needed. When L0 initiates a Span VM, it initializes the I/O completion interrupt to the guest. The interrupt han- all the VCPUs as it would for a single-level guest. After dler in the frontend then picks up the I/O response from the guest boots up, the control of guest VCPUs can be the ring buffer for processing. For nested guests, paravir- transferred to/from an L1 using attach/detach operations.

240 2017 USENIX Annual Technical Conference USENIX Association Guest Guest tialization, I/O emulation, checkpointing, and migration. QEMU Span Guest QEMU Guest In L1a In L1b Figure 6 shows the position of the Guest Controller in Guest QEMU L1a L1b different virtualization models. In both single-level and L1 Guest Guest L1 L1a KVM KVM L1b nested virtualization, there is only one Guest Controller Guest KVM QEMU QEMU QEMU QEMU QEMU per guest, since each guest is completely controlled by L0 L0 KVM L0 KVM KVM one hypervisor. Additionally, in the nested case, each L1 has its own Guest Controller that runs on L0. In Span vir- Nested Span Single tualization, each guest is associated with multiple Guest Figure 6: Roles of QEMU (Guest Controller) and KVM Controllers, one per attached hypervisor. For instance, (hypervisor) for Single-level, Nested, and Span VMs. the Span Guest in Figure 6 is associated with three Guest Controllers, one each on L0, L1a, and L1b. During at- 7 Implementation Details tach/detach operations, the Guest Controller in an L1 ini- Platform and Modifications: Our prototype supports tiates the mapping/unmapping of guest memory into the running an unmodified Linux guest as a Span VM in L1’s address space and, if needed, acquires/releases con- modes V3, V4, and V5 from Figure 1. In our test setup, the trol over the guest’s VCPU and virtual I/O devices. guest runs 15.10 with Linux 4.2.0. The prototype Paravirtual I/O Architecture: The Guest Controller for Span virtualization is implemented by modifying the also performs I/O emulation of virtual I/O devices con- KVM/QEMU nested virtualization support that is built trolled by its corresponding hypervisor. The paravirtual into standard Linux distributions. Currently the imple- device model described in Section 5 is called virtio in mentation of L0 and all L1s uses modified KVM/QEMU KVM/QEMU [54]. For nested guests, the virtio drivers hypervisors in Linux, specifically QEMU-1.2.0, kvm- are used at two levels: once between L0 and each L1 and kmod-3.14.2 and Linux 3.14.2. The modifications are again between an L1 and the guest. This design is also different for the L0 and L1 layers. Ideally, we would called virtio-over-virtio. A kick is implemented in vir- prefer L1 to be unmodified to simplify its interface with tio as a software trap from the frontend leading to a VM L0. However, current hypervisors assume complete and exit to KVM, which delivers the kick to the Guest Con- exclusive guest control whereas Span allows L1s to exer- troller as a signal. Upon I/O completion, the Guest Con- cise partial control over a subset of guest resources. Sup- troller requests KVM to inject a virtual interrupt into the porting partial guest control necessarily requires changes guest. Kicks and interrupts are forwarded across hyper- to L1 for attaching/detaching with a subset of guest re- visors using the message channel. Redirected interrupts sources and memory event subscription. In implement- are received and injected into the guest by a modified ing L1 attach/detach operations on a guest, we tried, as version of KVM’s virtual IOAPIC code. much as possible, to reuse existing implementations of VCPU Control: The Guest Controllers in different VM creation/termination operations. hypervisors communicate with the Guest Controller in Code size and memory footprint: Our implemen- L0 to acquire or relinquish guest VCPU control. The tation required about 2200 lines of code changes in Guest Controller represents each guest VCPU as a user KVM/QEMU, which is roughly 980+ lines in KVM and space thread. A newly attached L1 hypervisor does not 500+ lines in QEMU for L0, 300+ in KVM and 200+ in initialize guest VCPU state from scratch. Rather, the QEMU for L1, and another 180+ in the virtio backend. Guest Controller in the L1 accepts a checkpointed guest We disabled unnecessary kernel components in both L0 VCPU state from its counterpart in L0 using a technique and L1 implementations to reduce their footprint. When similar to that used for live VM migration between phys- idle, L0 was observed to have 600MB usage at startup. ical hosts. After guest VCPU states are transferred from When running an idle Span guest attached to an idle L1, L0 to L1, the L1 Guest Controller resumes the guest L0’s memory usage increased to 1756MB after exclud- VCPU threads while the L0 Guest Controller pauses ing usage by the guest and the L1. The L1’s initial mem- its VCPU threads. A VCPU detach operation similarly ory usage, as measured from L0, was 1GB after exclud- transfers a checkpoint of guest VCPU states from L1 to ing the guest footprint. This is an initial prototype to L0. Transfer of guest VCPU states from one L1 to an- validate our ideas. The footprints of L0 and L1 imple- other is presently accomplished through a combination mentations could be further reduced using one of many of detaching the source L1 from the guest VCPUs fol- lightweight Linux distributions [14]. lowed by attaching to the destination L1 (although a di- Guest Controller: A user-level control process, rect transfer could be potentially more efficient). called the Guest Controller, runs on the hypervisor along- Message Channel: The message channel between side each guest. In KVM/QEMU, the Guest Controller L0 and each L1 is implemented using a combination is a QEMU process which assists the KVM hypervisor of hypercalls and UDP messages. Hypercalls from an with various control tasks on a guest, including guest ini- L1 to L0 are used for attach/detach operations on guest

USENIX Association 2017 USENIX Annual Technical Conference 241 memory. UDP messages between an L1 and L0 are used for relaying I/O requests, device interrupts, mem- ory subscription messages, and attach/detach operations on guest VCPU and I/O devices. UDP messages are L1a: Network Monitoring presently used for ease of implementation and will be replaced by better alternatives such as hypercalls, call- backs, or shared buffers. Guest infected 8 Evaluation with KBeast L1b: Volatility We first demonstrate unmodified Span VMs that can si- multaneously use services from multiple L1s. Next we investigate how Span guests perform compared to tradi- Figure 7: A screenshot of Span VM simultaneously us- tional single-level and nested guests. Our setup consists ing services from two L1s. of a containing dual six-core Xeon 2.10 GHz CPUs, 128GB memory and 1Gbps Ethernet. The soft- even though its resources are controlled by multiple hy- ware configurations for L0, L1s, and Span guests are as pervisors. Second, an L1 can transparently examine described earlier in Section 7. Each point is a mean guest memory. Third, an L1 controlling a guest virtual (average) over at least five or more runs. device (here network interface) can examine all I/O re- quests specific to the device even if the I/O requests are 8.1 Span VM Usage Examples initiated from guest VCPUs controlled by another hyper- We present three examples in which a Span VM trans- visor. Thus an I/O device can be delegated to an L1 that parently utilizes services from multiple L1s. An unmod- does not control the guest VCPUs. ified guest is controlled by three coresident hypervisors, Use Case 2 – Guest Mirroring and VM Introspec- namely, L0, L1a, and L1b. tion: In this use case, we demonstrate an L1 that sub- Use Case 1 – Network Monitoring and VM Intro- scribes to guest memory events from L0. Hypervisors spection: In the first use case, the two L1s passively can provide a high availability service that protects un- examine the guest state, while L0 supervises resource modified guests from a failure of the physical machine. control. L1a controls the guest’s virtual network device Solutions, such as Remus [24], typically work by con- whereas L1b controls the guest VCPUs. L1a performs tinually transferring live incremental checkpoints of the network traffic monitoring by running the tcpdump tool guest state to a remote server, an operation that to capture packets on the guest’s virtual network inter- we call guest mirroring. When the primary VM fails, its face. Here we use tcpdump as a stand-in for other more backup image is activated, and the VM continues run- complex packet filtering and analysis tools. ning as if failure never happened. To checkpoint incre- L1b performs VM introspection (VMI) using a tool mentally, hypervisors typically use a feature called dirty called Volatility [3] which continuously inspects a page tracking. The hypervisor maintains a dirty bitmap, guest’s memory using a utility such as pmemsave to i.e. the set of pages that were dirtied since the last check- extract an accurate list of all processes running inside point. The dirty bitmap is constructed by marking all the guest. The guest OS is infected by a rootkit, Ker- guest pages read-only in the EPT and recording dirtied nel Beast [38], which can hide malicious activity and pages upon write traps. The pages listed in the dirty present an inaccurate process list to the compromised bitmap are incrementally copied to the backup server. guest. Volatility, running in L1b, can nevertheless extract As a first approximation of guest mirroring, we mod- an accurate guest process list using VM introspection. ified the pre-copy code in KVM/QEMU Figure 7 shows a screenshot, where the top window to periodically copy all dirtied guest pages to a backup shows the tcpdump output in L1a, specifically the SSH server at a given frequency. In our setup, L1a mirrors a traffic from the guest. The bottom right window shows Span guest while L1b runs Volatility and controls guest that the rootkit KBeast in the guest OS hides a process VCPUs. L1a uses memory event subscription to track evil, i.e. it prevents the process evil from being listed write events, construct the dirty bitmap, and periodically using the ps command in the guest. The bottom left win- transfer any dirty pages to the backup server. We mea- dow shows that Volatility, running in L1b, successfully sured the average bandwidth reported by the iPerf [1] detects the process evil hidden by the KBeast rootkit in benchmark running in the guest when L1a mir- the guest. rors the guest memory at different frequencies. When This use case highlights several salient features of our guest mirroring happens every 12 seconds, iPerf delivers design. First, an unmodified guest executes correctly 800Mbps average bandwidth which is about the same as

242 2017 USENIX Annual Technical Conference USENIX Association 1 300 1 300 1 300 942 107.7s 29.3s 29.5s 29.6s 942 941 + + 0.2 + 0.0 + 30.8s + + 0 - 0 - - - 0.0 - 0.0 31.52s - 889 + 0 + Mbps Mbps - 121.0s 250 - 0.1 + 250 + 250 123.0s - 0.3 0.8 - 24 Mbps 0.8 + 0.3 0.8 836 - 131.0s +- 0.3 Mbps + + 135.4s - 74 - 0.2 Mbps +- 0.2 200 200 200 0.6 0.6 0.6 150 150 150 0.4 0.4 0.4 100 100 100 102.4% 96.0% 93.1% 94.0% 100.2% 92.6% 91.8% 93.6% 191.2% 112.0% 198.0% + 92.5% + 92.2% 126.4%

+ + + + + + CPU Utilization (%) - 0.1 + - 0.1 +0.3 CPU Utilization (%) - 0.1 + - 0.6 +0.7 CPU Utilization (%) + - 0.1 +0.3 0.2 - 0.3 0.3 - - 0.2 - 0.2 0.1 - - 0.2 - 0.3 0.3 - - 50 50 50 Normalized Performance Normalized Normalized Performance Normalized Performance Normalized 2.7% +- 0.1 0 0 0 0 0 0 Host Single Nested Span0 Span1 Host Single Nested Span0 Span1 Host Single Nested Span0 Span1

Normalized Performance CPU Utilization Normalized Performance CPU Utilization Normalized Performance CPU Utilization

(a) Kernbench (b) Quicksort (c) iPerf Figure 8: No-op Mode: Normalized performance when no services run in host, L0, or L1s. The L0 controls the virtio block and network devices of the guest. with a nested guest. When guest mirroring happens every L0 L1 L2 second, the average bandwidth drops to 600Mbps, indi- Mem CPUs Mem VCPUs Mem VCPUs Host 128GB 12 N/A N/A N/A N/A cating a 25% performance impact of event subscription Single 128GB 12 3GB 1 N/A N/A at very high mirroring frequencies. Nested 128GB 12 16GB 8 3GB 1 Use Case 3 – Proactive Refresh: Hypervisor-level Span0 128GB 12 8GB 4 3GB 1 on L0 services may contain latent bugs, such as memory leaks, Span1 128GB 12 8GB 4 3GB 1 on L1 or other vulnerabilities that become worse over time, Table 2: Memory and CPU assignments for experiments. making a monolithic hypervisor unreliable for guests. Techniques like Microreboot[18] and ReHype[43] have ways has 128GB and 12 physical CPU cores. In the been proposed to improve hypervisor availability, ei- nested configuration, L1 has 16GB memory and 8 VC- ther proactively or post-failure. We have already seen PUs. The guest VCPU in the Span0 configuration is how Span virtualization can compartmentalize unreliable controlled by L0, and in Span1 by an L1. Finally, in hypervisor-level services in an isolated deprivileged L1. both Span0 and Span1, L1a and L1b each have 8GB of Here, we go one step further and proactively replace un- memory and 4VCPUs, so their sums match the L1 in the reliable L1s with a fresh reliable instance while the guest nested setting. and the base L0 hypervisor keep running. In our setup, an The guest runs one of the following three bench- old L1 (L1a) was attached to a 3GB Span guest. To per- marks: (a) Kernbench [41] compiles the . form hypervisor refresh, we attached a new pre-booted (b) Quicksort sorts 400MB of data in memory. (c) iPerf replacement hypervisor (L1b) to the guest memory. Then [1] measures network bandwidth to another host. L1a was detached from the guest by transferring guest The benchmarks run in two modes: No-op Mode, VCPU and I/O devices to L1b via L0. In our implemen- when no hypervisor-level services run, and Service tation, the entire refresh operation from attaching L1b to Mode, when network monitoring and VM introspection detaching L1a completes on the average within 740ms. services run at either L0 or L1s. The figures report each Of this, 670ms are spent in attaching L1b to guest mem- benchmark’s normalized performance against the best ory while the guest is running. The remaining 70ms is case and system-wide average CPU utilization, which is the guest downtime due to the transfer of VCPU and I/O measured in L0 using the atop command each second states. Thus Span virtualization achieves sub-second L1 during experiments. refresh latency. If we attach the replacement L1b to guest From Figures 8(a) and (b) and Figures 9(a) and (b), memory well in advance, then the VCPU and I/O state in both modes for Kernbench and Quicksort, Span0 transfer can be triggered on-demand by events, such as performs comparably with the single-level setting and unusual memory pressure or CPU usage, yielding sub- Span1 performs comparably with the nested setting, with 100ms guest downtime and event response latency. In similar CPU utilization. contrast, using pre-copy [22] to live migrate a guest from For iPerf in No-op mode (Figure 8(c)), we observe that L1a to L1b can take several seconds depending on guest the Span1 guest experiences about 6% degradation over size and workload [65]. the nested guest with notable bandwidth fluctuation and 7% more CPU utilization. This is because the guest’s 8.2 Macro Benchmarks VCPU in Span1 is controlled by L1a, but the guest’s Here we compare the performance of macro benchmarks network device is controlled by L0. Hence, guest I/O in Span VM against a native host (no hypervisor), single- requests (kicks) and responses are forwarded from L1a level, and nested guests. Table 2 shows the memory and to L0 via the message channel. The message channel processor assignments at each layer for each case. The is currently implemented using UDP messages, which guest always has 3GB memory and one VCPU. L0 al- compete with guest’s iPerf client traffic on the L1’s vir-

USENIX Association 2017 USENIX Annual Technical Conference 243 1 300 1 300 1 300 942 121.0s 120.9s 29.5s 29.8s + + +0 - 0.2 + - 0.0 31.4s + - - 1.1 - 0.5 132.0s + 31.24s Mbps 134.6s 250 - 0.1 250 817 250 0.8 + +- 0.8 +- 0.8 + 809 812 - 0.5 1.0 0.4 -23 + -40 +36 Mbps Mbps - 200 200 254.5% Mbps 200 209.0% 210.0% 227.4% + +5.6 + 0.6 0.6 0.6 - 3.9 - -8.6 + - 8.0 169.0% 150 168.2% 150 150 + + - 0.9 152.4% - 1.9 150.5% 144.6% 145.5% 143.5% 137.0% 0.4 +0.6 0.4 +2.6 0.4 - + + - + + - 1.8 - 1.2 100 - 5.4 - 3.6 100 100 0.2 0.2 0.2 50 CPU Utilization (%) 50 CPU Utilization (%) 50 CPU Utilization (%) Normalized Performance Normalized Performance Normalized Performance Normalized 0 0 0 0 0 0 Single Nested Span0 Span1 Single Nested Span0 Span1 Single Nested Span0 Span1

Normalized Performance CPU Utilization Normalized Performance CPU Utilization Normalized Performance CPU Utilization

(a) Kernbench (b) Quicksort (c) iPerf Figure 9: Service Mode: Normalized performance with hypervisor-level services network monitoring and Volatility. For single-level, L0 runs both services. For nested, L1 runs both services. For Span0 and Span1, L1a runs network monitoring and controls the guest’s network device; L1b runs Volatility; L0 controls guest’s block device. tio network interface with L0. We observed that if L1a 1000 controls the guest network device as well, then iPerf in Memory 800 VCPU the Span1 guest performs as well as in the nested guest. I/O For iPerf in service mode (Figure 9(c)), nested, Span0, 600 and Span1 guests perform about 14–15% worse than the single-level guest, due to the combined effect of virtio- 400 over-virtio overhead and tcpdump running in L1a. Fur- ther, for Span0, the guest VCPU is controlled by L0 200 whereas the network device is controlled by L1a. Thus (milliseconds) Overhead forwarding of I/O kicks and interrupts between L0 and 0 1024 2048 3072 4096 L1a via the UDP-based message channel balances out Span VM memory size (MB) any gains from having guest VCPUs run on L0. Figure 10: Overhead of attaching an L1 to a guest. Figure 8(c) shows that the average CPU utilization Single Nested Span increases significantly for iPerf in no-op mode – from EPT Fault 2.4 2.8 3.3 2.7% for the native host to 100+% for the single-level and Span0 configurations and 180+% for the nested and Virtual EPT Fault - 23.3 24.1 Span1 configurations. The increase appears to be due Shadow EPT Fault - 3.7 4.1 to the virtio network device implementation in QEMU, Message Channel - - 53 since we observed this higher CPU utilization even with Memory Event Notify - - 103.5 newer versions of (unmodified) QEMU (v2.7) and Linux Table 3: Low-level latencies(µs) in Span virtualization. (v4.4.2). Figures 8(c) and 9(c) also show higher CPU utilization for the nested and Span1 cases compared to detach operation for VCPUs and I/O devices has similar the single-level case. This is because guest VCPUs are overhead. controlled by L1s in the nested and Span1 cases, making Page Fault Servicing: Table 3 shows the latency of nested VM exits more expensive. page fault handling and message channel. We measured the average service times for EPT faults in Span at both 8.3 Micro Benchmarks levels of nesting. It takes on the average 3.3µs to resolve a fault caused against EPTL1 and on the average 24.1µs Attach Operation: Figure 10 shows the time taken to to resolve a fault against the Virtual EPT. In contrast, the attach an L1 to a guest’s memory, VCPU, and I/O de- corresponding values measured for the nested case are vices as the guest memory size is increased. The time 2.8µs and 23.3µs. For the single-level case, EPT-fault taken to attach memory of a 1GB Span guest is about processing takes 2.4µs. The difference is due to the extra 220ms. Memory attach overhead increases with guest synchronization work in the EPT-fault handler in L0. size because each page that L1 has allocated for Span Message Channel and Memory Events: The mes- needs to be remapped to the Span physical page in L0. sage channel is used in Span virtualization to exchange Attaching VCPUs to one of the L1s takes about 50ms. events and requests between L0 and L1s. It takes on Attaching virtual I/O devices takes 135ms. When I/O the average 53µs to send a message between L0 and an control has to be transferred between hypervisors, the L1. We also measured the overhead of notifying L1 sub- VCPUs need to be paused. The VCPUs could be running scribers from L0 for write events on a guest page. With- on any of the L1s and hence L0 needs to coordinate paus- out any subscribers, the write-fault processing takes on ing and resuming the VCPUs during the transfer. The the average 3.5µs in L0. Notifying the write event over

244 2017 USENIX Annual Technical Conference USENIX Association the message channel from L0 to an L1 subscriber adds Nesting enables vertical stacking of two layers of around 100µs, including a response from L1. hypervisor-level services. Third parties such as Rav- ello [2] and XenBlanket [66, 57] leverage nesting to of- 9 Related Work fer hypervisor-level services (in an L1) over public cloud platforms (L0) such as EC2 and Azure, often pitching Here, we review prior work on user space services, ser- their service as a way to avoid lock-in with a cloud vice VMs, and nested virtualization. We build upon ear- provider. However, this model also leads to a different lier discussion of their relative merits in Section 2. level of lock-in, where a guest is unable use services from User space Services: and library oper- more than one third party. Further, these third-party ser- ating systems have a long history [44, 13, 28, 35] of pro- vices are not fully trusted by the base hypervisor (L0) viding OS services in user space. µDenali [63] allows of the cloud provider, necessitating the use of nesting, programmers to use event interposition to extend the hy- rather than user space services. Span virtualization pre- pervisor with new user-level services such as disk and vents guest lock-in at all levels by adding support for network I/O. In the KVM/QEMU [40] platform, each multiple third-party L1s to concurrently service a guest, guest is associated with a dedicated user space man- while maintaining the isolation afforded by nesting. agement process, namely QEMU. A single QEMU pro- Ephemeral virtualization [65] combines nesting and cess bundles multiple services for its guest such as VM optimized live migration [22, 36] to enable transient con- launch/exit/pause, paravirtual I/O, migration, and check- trol over guest by L1s. L1s and L0 take turns exchanging pointing. One can associate different variants of QEMU full control over the guest by co-mapping its memory. In with different guests, allowing some degree of service contrast, Span allows multiple L1s to concurrently exer- customization. However, QEMU’s interface with the cise either full or partial control over a guest, in either KVM hypervisor is large, consisting of system calls, sig- continuous or transient modes. nals, and shared buffers with the kernel, which increases the KVM hypervisor’s exposure to potentially untrusted 10 Conclusions services. Also, while user space services can map guest A rich set of hypervisor-level services have been pro- memory and control paravirtual I/O, they lack direct con- posed in recent years, such as VM introspection, high trol over low-level guest resources such as EPT map- availability, live patching, and migration. However, in pings and VCPU scheduling, unlike nesting and Span. modern cloud platforms, a sole controlling hypervisor Service VMs: Another option is to provide guest ser- continues to host all such services. Specifically, sup- vices via specialized Service VMs that run alongside port for third-parties to offer hypervisor-level services to the guest. For instance, the Xen [4] platform runs a guests is lacking. We presented a new approach, called trusted service VM called Dom0 which runs paravirtual- Span virtualization, which leverages nesting to enable ized Linux, controls all guests via hypercalls to the Xen multiple coresident, but isolated, hypervisors to control hypervisor, and provides guests with services related to and service a common guest. Our prototype of Span vir- lifecycle management and I/O. To avoid a single point tualization on the KVM/QEMU platform can support un- of failure or vulnerability, the Xoar [47, 23] project pro- modified guests which simultaneously use multiple ser- posed decomposing Dom0 into smaller service domains, vices that augment the base hypervisor’s functionality. one per service, that can be replaced or restarted. Pos- Span guests achieve performance comparable to tradi- sible support for third-party service domains has been tional nested guests. Looking ahead, we believe that the discussed [16], but its status is unclear. Nova [58] min- opportunity for a cloud-based ecosystem of hypervisor- imizes the size of the hypervisor by implementing the level services is large, including security services, cross- VMM, device drivers, and special-purpose applications cloud portability, custom schedulers, virtual devices, and in user space. Self-service clouds [17] allows users to high availability. customize control over services used by their VMs on untrusted clouds. Services, such as storage and secu- 11 Acknowledgement rity, can be customized by privileged service domains, whereas the hypervisor controls all low-level guest re- This work was funded in part by the National Science sources, such as VCPUs and EPT mappings. Foundation via awards 1527338 and 1320689. We thank our shepherd, Nadav Amit, and all reviewers for insight- Nested Virtualization: Nested virtualization was ful feedback; Umesh Deshpande, Spoorti Doddamani, originally proposed and refined in the 1970s [32, 33, 51, Michael Hines, Hani Jamjoom, Siddhesh Phadke, and 8, 9, 48] and has experienced renewed interest in recent Piush Sinha, for discussions, implementation, and evalu- years [29, 34, 10]. Recent support [25], such as VMCS ation; and the Turtles Project [10] authors for inspiration. shadowing [62] and direct device assignment [67] aim to reduce nesting overheads related to VM exits and I/O.

USENIX Association 2017 USENIX Annual Technical Conference 245 References [13]B ERSHAD,B.N.,CHAMBERS,C.,EGGERS,S., MAEDA,C.,MCNAMEE,D.,PARDYAK, P., SAV- [1] iPerf: The Network Bandwidth Measurement Tool. AGE,S., AND SIRER, E. G. SPIN : An extensi- http://iperf.fr/. ble for application-specific services. Proc. of ACM SIGOPS Operating [2] Ravello Systems. Systems Review 29, 1 (1995), 74–77. https://www.ravellosystems.com/.

[3] Volatility Framework. [14]B HARTIYA, S. Best lightweight linux distros for http://www.volatilityfoundation.org/. 2017. https://www.linux.com/news/best- [4] Xen Hypervisor. lightweight-linux-distros-2017. http://www.xen.org/. [15]B ROMIUM. [5] AMD. AMD Virtualization (AMD-V). https://www.bromium.com. http://www.amd.com/en-us/solutions/ servers/virtualization. [16]B ULPIN, J. Whatever happened to XenServer’s Windsor architecture? [6]B ARHAM, P., DRAGOVIC,B.,FRASER,K., https://xenserver.org/blog/entry/ HAND,S.,HARRIS, T., HO,A.,NEUGEBAUER, whatever-happened-to-xenserver-s- R.,PRATT,I., AND WARFIELD, A. Xen and the art windsor-architecture.html. of virtualization. In Proc. of SOSP (Bolton Land- ing, NY, USA, 2003), pp. 164–177. [17]B UTT,S.,LAGAR-CAVILLA,H.A.,SRIVAS- [7]B EHAM,M.,VLAD,M., AND REISER, H. In- TAVA,A., AND GANAPATHY, V. Self-service trusion detection and honeypots in nested virtu- . In Proc. of ACM Conference alization environments. In Proc. of 43rd Annual on Computer and Communications Security(CCS) IEEE/IFIP International Conference on Depend- (Raleigh, NC, USA, 2012). able Systems and Networks(DSN) (Budapest, Hun- gary, June 2013). [18]C ANDEA,G.,KAWAMOTO,S.,FUJIKI, Y., FRIEDMAN,G., AND FOX, A. Microreboot-a [8]B ELPAIRE,G., AND HSU, N.-T. Formal proper- technique for cheap recovery. In Proc. of 6th ties of recursive virtual machine architectures. In USENIX Symposium on Operating Systems Design Proc. of SOSP, Austin, Texas, USA (1975), pp. 89– and Implementation (OSDI) (San Francisco, CA, 96. USA, 2004), vol. 4, pp. 31–44.

[9]B ELPAIRE,G., AND HSU, N.-T. Hardware archi- [19]C HEN,H.,CHEN,R.,ZHANG, F., ZANG,B., tecture for recursive virtual machines. In Proc. of AND YEW, P. Live updating operating systems Annual ACM Conference (1975), pp. 14–18. using virtualization. In Proc. of ACM Interna- tional Conference on Virtual Execution Environ- [10]B EN-YEHUDA,M.,DAY,M.D.,DUBITZKY, ments (VEE) (Ottawa, Canada, June 2006). Z.,FACTOR,M.,HAR’EL,N.,GORDON,A., LIGUORI,A.,WASSERMAN,O., AND YASSOUR, [20]C ITRIX. XenDesktop. B.-A. The Turtles project: Design and implemen- https://www.citrix.com/products/ tation of nested virtualization. In Proc. of Operat- xenapp-xendesktop/. ing Systems Design and Implementation (2010).

[11]B EN-YEHUDA,M.,MASON,J.,XENIDIS,J., [21]C ITRIX. XenDesktop and The Evolution of KRIEGER,O., VAN DOORN,L.,NAKAJIMA,J., Hardware-Assisted Server Technologies. MALLICK,A., AND WAHLIG, E. Utilizing IOM- https://www.citrix.com/content/dam/ MUs for virtualization in Linux and Xen. In Ottawa citrix/en_us/documents/go/2015- Linux Symposium (July 2006). edition-hosted-desktop.pdf.

[12]B EN-YEHUDA,M.,XENIDIS,J.,OSTROWSKI, [22]C LARK,C.,FRASER,K.,HAND,S.,HANSEN, M.,RISTER,K.,BRUEMMER,A., AND VAN J.,JUL,E.,LIMPACH,C.,PRATT,I., AND DOORN, L. The price of safety: Evaluating WARFIELD, A. Live migration of virtual machines. IOMMU performance. In Proc. of Ottawa Linux In Proc. of Network System Design and Implemen- Symposium (July 2007). tation (2005).

246 2017 USENIX Annual Technical Conference USENIX Association [23]C OLP, P., NANAVATI,M.,ZHU,J.,AIELLO, W., [33]G OLDBERG, R. P. Survey of Virtual Machine Re- COKER,G.,DEEGAN, T., LOSCOCCO, P., AND search. Computer 7, 6 (1974), 34–45. WARFIELD, A. Breaking up is hard to do: Security and functionality in a commodity hypervisor. In [34]G RAF,A., AND ROEDEL, J. Nesting the virtual- Proc. of SOSP (2011), pp. 189–202. ized world. In Linux Plumbers Conference (Sept. 2009). [24]C ULLY,B.,LEFEBVRE,G.,MEYER,D.,FEE- [35]H AND,S.,WARFIELD,A.,FRASER,K.,KOTSO- LEY,M.,HUTCHINSON,N., AND WARFIELD,A. VINOS,E., AND MAGENHEIMER, D. J. Are vir- Remus: High availability via asynchronous virtual tual machine monitors microkernels done right? In machine replication. In Proc. of Networked Sys- Proc. of HotOS (2005). tems Design and Implementation, San Francisco, CA, USA (2008). [36]H INES,M.,DESHPANDE,U., AND GOPALAN, K. Post-copy live migration of virtual machines. [25]D AS,B.,ZHANG, Y. Y., AND KISZKA, J. Nested In SIGOPS Operating Systems Review (July 2009), virtualization: State of the art and future directions. 14–26. In KVM Forum (2014). [37]I NTEL CORP. Intel 64 and IA-32 Architecture Soft- [26]D INABURG,A.,ROYAL, P., SHARIF,M., AND ware Developers Manual, Volume 3, System Pro- LEE, W. Ether: analysis via hardware vir- gramming Guide. Order number 325384. April tualization extensions. In Proc. of 15th ACM con- 2016. ference on Computer and communications security (CCS) (Alexandria, VA, USA, 2008), pp. 51–62. [38] IPSECS. Kernel Beast. http://core.ipsecs.com/rootkit/kernel- [27]D UNLAP, G. W., KING, S. T., CINAR,S.,BAS- rootkit/. RAI,M.A., AND CHEN, P. M. ReVirt: Enabling Intrusion Analysis through Virtual-Machine Log- [39]J IANG,X.,WANG,X., AND XU, D. Stealthy ging and Replay. In Proc. of 5th USENIX Sympo- malware detection and monitoring through VMM- sium on Operating Systems Design and Implemen- based “out-of-the-box” semantic view reconstruc- tation (OSDI) (Boston, MA, Dec. 2002). tion. ACM Trans. Information Systems Security 13, 2 (Mar. 2010), 1–28. [28]E NGLER,D.R.,KAASHOEK, M. F., ETAL. : An operating system architecture [40]K IVITY,A.,KAMAY, Y., LAOR,D.,LUBLIN,U., for application-level resource management. In AND LIGUORI, A. KVM: the linux virtual machine ACM SIGOPS Operating Systems Review (1995), monitor. In Proc. of Linux Symposium (June 2007). vol. 29(5), pp. 251–266. [41]K OLIVAS, C. Kernbench. [29]F ORD,B.,HIBLER,M.,LEPREAU,J.,TULL- http://ck.kolivas.org/apps/kernbench/. MANN, P., BACK,G., AND CLAWSON, S. Micro- [42]K OURAI,K., AND CHIBA, S. Hyperspector: Vir- kernels Meet Recursive Virtual Machines. In Proc. tual distributed monitoring environments for se- OSDI, Seattle, Washington, USA (1996), pp. 137– cure intrusion detection. In ACM/USENIX Inter- 151. national Conference on Virtual Execution Environ- ments (2005), pp. 197 – 207. [30]G ARFINKEL, T., AND ROSENBLUM, M. A Vir- tual Machine Introspection Based Architecture for [43]L E,M., AND TAMIR, Y. ReHype: enabling VM Intrusion Detection. In Network & Distributed Sys- survival across hypervisor failures. In Proc. of ACM tems Security Symposium (San Diego, CA USA, SIGPLAN Notices (2011), vol. 46, ACM, pp. 63– 2003). 74.

[31]G EBHARDT,C., AND DALTON, C. LaLa: A [44]L IEDTKE, J. On micro-kernel construction. Proc. Late Launch Application. In Workshop on Scalable of ACM SIGOPS Operating Systems Review 29, 5 Trusted Computing, Chicago, Illinois, USA (2009), (1995), 237–250. pp. 1–8. [45]L OWELL,D.E.,SAITO, Y., AND SAMBERG, [32]G OLDBERG, R. P. Architecture of Virtual Ma- E. J. Devirtualizable virtual machines enabling chines. In Proceedings of the Workshop on Virtual general, single-node, online maintenance. Proc. of Computer Systems, Cambridge, MA, USA (1973), SIGARCH Comput. Archit. News 32, 5 (Oct. 2004), pp. 74–112. 211–223.

USENIX Association 2017 USENIX Annual Technical Conference 247 [46]M CAFEE. Root Out : An Inside Look at [57]S HEN,Z.,JIA,Q.,SELA,G.-E.,RAINERO,B., McAfee Deep Defender. SONG, W., VAN RENESSE,R., AND WEATHER- http://www.intel.com/content/dam/ SPOON, H. Follow the Sun Through the Clouds: www/public/us/en/documents/white- Application Migration for Geographically Shifting papers/mcafee-deep-defender-deepsafe- Workloads. In Proceedings of the Seventh ACM rootkit-protection-paper.pdf. Symposium on Cloud Computing (2016), pp. 141– 154. [47]M URRAY,D.G.,MILOS,G., AND HAND,S. Improving Xen Security Through Disaggregation. [58]S TEINBERG,U., AND KAUER, B. NOVA: A In Proc. of Virtual Execution Environments (2008), Microhypervisor-based Secure Virtualization Ar- pp. 151–160. chitecture. In Proc. of EuroSys, pp. 209–222.

[48]O SISEK,D.L.,JACKSON,K.M., AND GUM, [59]S UGERMAN,J.,VENKITACHALAM,G., AND P. H. Esa/390 interpretive-execution architecture, LIM, B.-H. Virtualizing I/O Devices on VMware foundation for /esa. IBM Systems Journal 30, 1 Workstation’s Hosted Virtual Machine Monitor. In (Feb. 1991), 34–51. Proc. of USENIX Annual Technical Conference, Monterey, CA, USA [49]P AYNE,B.D.,CARBONE,M.,SHARIF,M., AND (2002). LEE, W. Lares: An Architecture for Secure Ac- [60]S UNEJA,S.,ISCI,C.,BALA, V., DE LARA, tive Monitoring Using Virtualization. In Proc. of E., AND MUMMERT, T. Non-intrusive, Out-of- IEEE Symposium on Security and Privacy (2008), band and Out-of-the-box Systems Monitoring in pp. 233 – 247. the Cloud. In Proc. of SIGMETRICS’14, Austin, [50] PCI SIG. Single Root I/O Virtualization and TX, USA (2014). Sharing. ˇ http://www.pcisig.com/specifications/ [61]T OLDINAS,J.,RUDZIKA,D., STUIKYS, V., AND iov/single_root/. ZIBERKAS, G. Rootkit Detection Experiment within a Virtual Environment. Electronics and [51]P OPEK,G.J., AND GOLDBERG, R. P. Formal re- Electrical Engineering–Kaunas: Technologija, 8 quirements for virtualizable third generation archi- (2009), 104. tectures. Proc. of Communications of ACM 17, 7 (July 1974), 412–421. [62]W ASSERMAN, O. Nested Virtualization: Shadow Turtles. In KVM Forum, Edinburgh, Spain (October [52]R EAL TIME SYSTEMS GMBH. RTS Real-Time 2013). Hypervisor. http://www.real-time-systems.com/ [63]W HITAKER,A.,COX,R., AND SHAW, M. Con- products/index.php. structing services with interposable virtual hard- ware. In Proc. of First USENIX Symposium on Net- [53]R ILEY,R.,JIANG,X., AND XU, D. Guest- worked Systems Design and Implementation (San transparent prevention of kernel rootkits with Francisco, California, 2004). VMM-based memory shadowing. In Proc. of Re- cent Advances in Intrusion Detection (2008), pp. 1– [64]W IKIPEDIA. Phoenix Hyperspace. 20. https://en.wikipedia.org/wiki/ HyperSpace_(software). [54]R USSELL, R. Virtio: Towards a de-facto standard for virtual I/O devices. Proc. of SIGOPS Operating [65]W ILLIAMS,D.,HU, Y., DESHPANDE,U., Systems Review 42, 5 (July 2008), 95–103. SINHA, P. K., BILA,N.,GOPALAN,K., AND [55]R UTKOWSKA, J. Subverting vista kernel for fun JAMJOOM, H. Enabling efficient hypervisor-as-a- and profit. In Blackhat (Las Vegas, USA, Aug. service clouds with ephemeral virtualization. In 2006). Proc. of the 12th ACM SIGPLAN/SIGOPS Inter- national Conference on Virtual Execution Environ- [56]S ESHADRI,A.,LUK,M.,QU,N., AND PERRIG, ments (VEE) (2016). A. SecVisor: a tiny hypervisor to provide life- time kernel code integrity for commodity OSes. In [66]W ILLIAMS,D.,JAMJOOM,H., AND WEATHER- Proc. of ACM SIGOPS Operating Systems Review SPOON, H. The Xen-Blanket: Virtualize once, run (2007), vol. 41(6), pp. 335–350. everywhere. In EuroSys, Bern, Switzerland (2012).

248 2017 USENIX Annual Technical Conference USENIX Association [67]Y ASSOUR,B.-A.,BEN-YEHUDA,M., AND WASSERMAN, O. Direct Device Assignment for Untrusted Fully-Virtualized Virtual Machines. Tech. rep., IBM Research, 2008.

[68]Z HANG, F., CHEN,J.,CHEN,H., AND ZANG,B. CloudVisor: Retrofitting Protection of Virtual Ma- chines in Multi-tenant Cloud with Nested Virtual- ization. In Proc. of the Twenty-Third ACM Sym- posium on Operating Systems Principles (2011), ACM, pp. 203–216.

USENIX Association 2017 USENIX Annual Technical Conference 249