Enabling Efficient -as-a-Service Clouds with Ephemeral VirtualizationMulti-Hypervisor Virtual Machines: Dan Williams†, Yaohui Hu‡, Umesh Deshpande, Piush K Sinha‡, NiltonEnabling An Ecosystem of HypervisorBila†, Kartik Gopalan‡, Hani Jamjoom† -level Services

† IBM T.J. Watson Research Center Kartik Gopalan, Rohith Kugve, Hardik Bagdi, Yaohui Hu – Binghamton University ‡ Binghamton University Dan Williams, Nilton Bila – IBM T.J. Watson Research Center IBM Almaden Research Center

Funded in part by the NSF Funded by NSF

• A thin and secure layer in the cloud -- or -- Guest 1

Hypervisor Hypervisors

• A thin and secure layer in the cloud -- or -- • Feature-filled cloud differentiators Guest 1 • Migration • Checkpointing • High availability • Live Guest Patching • Network monitoring Service Service Service A B C • Intrusion detection • Other VMI Hypervisor Lots of third-party interest in hypervisor-level services

• Ravello • • XenBlanket • But limited support for third party • McCafe DeepDefender services from base hypervisor. • Secvisor • Cloudvisor • And more… How can a guest use multiple third-party hypervisor-level services?

• Our Solution: Span Guest 1 Guest 2

• One guest controlled by multiple Service coresident hypervisors. Service Service Service A L1s B C D

L0 Hypervisor Outline

• Why multi-hypervisor virtual machines? • Design of Span Virtualization • Evaluations • Related Work • Conclusions and Future Work Option 1: Fat hypervisor

• All services run at the most privileged level. Guest 1 Guest 2 • But…hypervisor cannot trust third- party services in privileged mode.

Service Service Service A B C Hypervisor Option 2: Native services Hypervisor User space • Services run natively in the user space of the hypervisor Service Service Service A B C Guest 1 • Services control guest indirectly via the hypervisor

• E.g. QEMU with KVM, uDenali Hypervisor • But…Potentially large user-kernel interface Cloud providers reluctant to run third- • event interposition and system calls party services natively, even if in user space. Option 3: Service VMs

Service VM • Run services inside deprivileged VMs Service Service Service • Services control guest indirectly via A B C Guest 1 hypercalls and events

• Single trusted Service VM • Runs all services Hypervisor • E.g. Domain0 in

-- or -- Option 3: Service VMs

Service VMs • Run services inside deprivileged VMs Service Service Service A B Guest 1 • Services control guest indirectly via C hypercalls and events

• Multiple service VMs • One per service Hypervisor • Deprivileged and restartable • E.g. Service Domains in Xoar Option 3: Service VMs

Service VMs • Run services inside deprivileged VMs Service Service Service A B C Guest 1 • Services control guest indirectly via hypercalls and events

• Multiple service VMs Hypervisor • One per service • Deprivileged and restartable • E.g. Service Domains in Xoar Lack direct control over ISA-level guest state • Memory mappings, VCPU scheduling, port-mapped I/O, etc. Option 4: Nested Virtualization

• Services run in a deprivileged L1 hypervisor, which runs on L0. Guest 1 Guest 2 • Services control guest at virtualized ISA level.

• But … multiple services must reside in the L1 same L1, i.e. fat L1. A B C D

• Vertically Stack L1 hypervisors? L0 • More than two levels of nesting is inefficient. Hypervisor Our solution: Span Virtualization

• Allow multiple coresident L1s to concurrently control a common guest • i.e. Horizontal layering of L1 hypervisors Guest 1 Guest 2

• Guest is a multi-hypervisor

• Each L1 L1s A B C D • Offers guest services that augment L0’s services. • Controls one or more guest resources L0 Hypervisor Design Goals of Span Virtualization

• Guest Transparency • Guest remains unmodified Guest 1 Guest 2

• Service Isolation • L1s controlling the same guest are unaware of each other. L1s A B C D

L0 Hypervisor Guest Control operations

• L0 supervises which L1 controls which Span Guest (unmodified) Guest resource L1 Hypervisor(s) Virtual Guest EPT • Memory, VCPU and I/O Virtual Guest EPT

• L0 and L1s communicate via Traps/Faults (implicit) and Messages (explicit) L0 Hypervisor Message Channel L1 Traps Guest • Operations: Faults • Attach an L1 to a specified guest resource

• Detach an L1 from a guest resource

• Subscribe an attached L1 to receive guest events (currently memory events)

• Unsubscribe an L1 from a subscribed guest event Control over Guest Resources

• Guest Memory • Shared: All hypervisors have the same consistent view of guest memory

• Guest VCPUs • Exclusive: All guest VCPUs are controlled by one hypervisor at any instant

• Guest I/O devices • Exclusive: Different virtual I/O devices of a guest may be controlled by different hypervisors

• Control Transfer • Control over guest VCPUs and I/O devices can be transferred from one L1 to another via L0. Memory Translation Isolation and Communication: Another design goal Single-Level Virtualization VA Page Table GPA EPT HPA is to compartmentalize L1 services, from each other and from L0. First, L1s must have lower execution privi- (a) Single-level lege compared to L0. Secondly, L1s must remain iso- Shadow EPT lated from each other. These two goals are achieved by Page Virtual VA L1PA EPT HPA deprivileging L1s using nested virtualization and execut- Table GPA EPT L1 ing them as separate guests on L0. Finally, L1s must re- (b) Nested main unaware of each other during execution. This goal Shadow Shadow Shadow EPT is achieved by requiring L1s to receive any subscribed EPTEPT guest events that are generated on other L1s only via L0. Page Virtual EPT Virtual EPT VA GPA Virtual EPTVirtual L1PA Virtual EPT HPA Table EPTL1 There are two ways that L0 communicates with L1s: EPT implicitly via traps and explicitly via messages. Traps EPT (c) Span Guest allow L0 to transparently intercept certain memory man- agement operations by L1 on the guest. Explicit mes- Figure 3: Memory translation for single-level, nested, sages allow an L1 to directly request guest control from and Span VMs. VA = Virtual Address; GPA = Guest L0. An Event Processing module in L0 traps runtime up- Physical Address; L1PA = L1 Physical Address; HPA = dates to guest memory mappings by any L1 and synchro- Host Physical Address. nizes guest mappings across different L1s. The event processing module also relays guest memory faults that 4.1 Traditional Memory Translation need to be handled by L1. A bidirectional Message Channel relays explicit messages between L0 and L1s in- In modern processors, hypervisors manage the phys- cluding attach/detach requests, memory event subscrip- ical memory that a guest can access using a virtualiza- tion/notification, guest I/O requests, and virtual inter- tion feature called Extended Page Tables (EPT) [37], also rupts. Some explicit messages, such as guest I/O re- called Nested Page Tables in AMD-V [5]. quests and virtual interrupts, could be replaced with im- Single-level virtualization: Figure 3(a) shows that for plicit traps. Our choice of which to use is largely based single-level virtualization, the guest page tables map vir- on ease of implementation on a case-by-case basis. tual addresses to guest physical addresses (VA to GPA in the figure). The hypervisor uses an EPT to map guest Continuous vs. Transient Control: Span virtualiza- physical addresses to host physical addresses (GPA to tion allows L1’s control over guest resources to be either HPA). Guest memory permissions are controlled by the continuous or transient. Continuous control means that combination of permissions in guest page table and EPT. an L1 exerts uninterrupted control over one or more guest resources for an extended period of time. For example, Whenever the guest attempts to access a page that is an intrusion detection service in L1 that must monitor either not present or protected in the EPT, the hardware guest system calls, VM exits, or network traffic, would generates an EPT fault and traps into the hypervisor, require continuous control of guest memory, VCPUs, which handles the fault by mapping a new page, em- and network device. Transient control means that an L1 ulating an instruction, or taking other actions. On the acquires full control over guest resources for a brief du- other hand, the hypervisor grants complete control over ration, provides a short service to the guest, and releases the traditional paging hardware to the guest. A guest OS guest control back to L0. For instance, an L1 that period- is free to maintain the mappings between its virtual and ically checkpoints the guest would need transient control guest physical address space and update them as it sees of guest memory, VCPUs, and I/O devices. fit, without trapping into the hypervisor. Nested virtualization: Figure 3(b) shows that for nested virtualization, the guest is similarly granted con- trol over the traditional paging hardware to map virtual 4 Memory Management addresses to its guest physical address space. L1 main- tains a Virtual EPT to map the guest pages to pages in A Span VM has a single guest physical address space L1’s physical addresses space, or L1 pages. Finally, one which is mapped into the address space of all attached more translation is required: L0 maintains EPTL1 to map L1s. Thus any memory write on a guest page is imme- L1 pages to physical pages. However, x86 processors diately visible to all hypervisors controlling the guest. can translate only two levels of addresses in hardware, Note that all L1s have the same visibility into the guest from guest virtual to guest physical to host physical ad- memory due to the horizontal layering of Span virtualiza- dress. Hence the Virtual EPT maintained by L1 needs to tion, unlike the vertical stacking of nested virtualization, be shadowed by L0, meaning that the Virtual EPT and which somewhat obscures the guest to lower layers. EPTL1 must be compacted by L0 during runtime into a

238 2017 USENIX Annual Technical Conference USENIX Association Memory Translation IsolationIsolation and and Communication: Communication:AnotherAnother design design goal goal Single-Level Virtualization VAVA Page TablePage Table GPAGPA EPT HPA isis to to compartmentalize compartmentalize L1 L1 services, services, from from each each other other and and (a) Single-level fromfrom L0. L0. First, First, L1s L1s must must have have lower lower execution execution privi- privi- (a) Single-level lege compared to L0. Secondly, L1s must remain iso- Shadow lege compared to L0. Secondly, L1s must remain iso- EPT lated from each other. These two goals are achieved by EPT lated from each other. These two goals are achievedNested Virtualization by Page Virtual VA Page GPA Virtual L1PA EPTL1 HPA deprivileging L1s using nested virtualization and execut- VA Table GPA EPT L1PA EPTL1 HPA deprivileging L1s using nested virtualization and execut- Table EPT (b) Nested inging them them as as separate separate guests guests on on L0. L0. Finally, Finally, L1s L1s must must re- re- (b) Nested main unaware of each other during execution. This goal main unaware of each other during execution. This goal Shadow Shadow Shadow Shadow is achieved by requiring L1s to receive any subscribed Shadow Shadow EPTEPT is achieved by requiring L1s to receive any subscribed EPTEPTEPT guest events that are generated on other L1s only via L0. EPT Virtual EPT guest events that are generated on other L1s only via L0. Page Virtual EPT Virtual EPT VA Page GPA Virtual EPTVirtual EPTVirtual L1PA Virtual EPT HPA VA Table GPA Virtual EPTVirtual L1PA Virtual EPTEPTL1 HPA There are two ways that L0 communicates with L1s: Table EPT EPTL1 There are two ways that L0 communicates with L1s: EPT implicitly via traps and explicitly via messages. Traps EPTGuest implicitly via traps and explicitly via messages. Traps (c) Span EPT allow L0 to transparently intercept certain memory man- (c) Span Guest allow L0 to transparently intercept certain memory man- agement operations by L1 on the guest. Explicit mes- Figure 3: Memory translation for single-level, nested, agement operations by L1 on the guest. Explicit mes- Figure 3: Memory translation for single-level, nested, sages allow an L1 to directly request guest control from and Span VMs. VA = Virtual Address; GPA = Guest sages allow an L1 to directly request guest control from and Span VMs. VA = Virtual Address; GPA = Guest L0. An Event Processing module in L0 traps runtime up- Physical Address; L1PA = L1 Physical Address; HPA = L0. An Event Processing module in L0 traps runtime up- Physical Address; L1PA = L1 Physical Address; HPA = dates to guest memory mappings by any L1 and synchro- Host Physical Address. Host Physical Address. datesnizes to guest guest mappings memory mappings across different by any L1 L1s. and The synchro- event nizes guest mappings across different L1s. The event processing module also relays guest memory faults that 4.1 Traditional Memory Translation processingneed to be module handled also by relays L1. guest A bidirectional memory faultsMessage that 4.1 Traditional Memory Translation needChannel to berelays handled explicit by messages L1. A bidirectional between L0 andMessage L1s in- In modern x86 processors, hypervisors manage the phys- Channel relays explicit messages between L0 and L1s in- Inical modern memory x86 that processors, a guest can hypervisors access using manage a virtualiza- the phys- cluding attach/detach requests, memory event subscrip- ical memory that a guest can access using a virtualiza- cludingtion/notification, attach/detach guest requests, I/O requests, memory and event virtual subscrip- inter- tion feature called Extended Page Tables (EPT) [37], also tion/notification, guest I/O requests, and virtual inter- tioncalled feature Nested called PageExtended Tables in Page AMD-V Tables [5].(EPT) [37], also rupts. Some explicit messages, such as guest I/O re- called Nested Page Tables in AMD-V [5]. rupts.quests Some and virtual explicit interrupts, messages, could such be replaced as guest with I/O im- re- Single-level virtualization: Figure 3(a) shows that for Figure 3(a) shows that for questsplicit and traps. virtual Our choice interrupts, of which could to be use replaced is largely with based im- single-levelSingle-level virtualization, virtualization: the guest page tables map vir- single-level virtualization, the guest page tables map vir- pliciton ease traps. of implementation Our choice of which on a case-by-case to use is largely basis. based tual addresses to guest physical addresses (VA to GPA in on ease of implementation on a case-by-case basis. tualthe addressesfigure). The to guest hypervisor physical uses addresses an EPT (VA to map to GPA guest in Continuous vs. Transient Control: Span virtualiza- thephysical figure). addresses The hypervisor to host physical uses an addresses EPT to map (GPA guest to Continuous vs. Transient Control: Span virtualiza- tion allows L1’s control over guest resources to be either physical addresses to host physical addresses (GPA to tion allows L1’s control over guest resources to be either HPA). Guest memory permissions are controlled by the continuous or transient. Continuous control means that HPA). Guest memory permissions are controlled by the continuous or transient. Continuous control means that combination of permissions in guest page table and EPT. an L1 exerts uninterrupted control over one or more guest combination of permissions in guest page table and EPT. anresources L1 exerts for uninterrupted an extended control period over of time. one or For more example, guest Whenever the guest attempts to access a page that is Whenever the guest attempts to access a page that is resourcesan intrusion for an detection extended service period in of L1 time. that For must example, monitor either not present or protected in the EPT, the hardware either not present or protected in the EPT, the hardware anguest intrusion system detection calls, VM service exits, in or L1 network that must traffic, monitor would generates an EPT fault and traps into the hypervisor, guestrequire system continuous calls, VM control exits, of or guest network memory, traffic, VCPUs, would generateswhich handles an EPT the fault faultand by mapping traps into a new the hypervisor, page, em- requireand network continuous device. control Transient of guest control memory, means that VCPUs, an L1 whichulating handles an instruction, the fault or by taking mapping other a actions. new page, On em- the other hand, the hypervisor grants complete control over andacquires network full device. control Transient over guest control resources means for that a brief an L1du- ulating an instruction, or taking other actions. On the the traditional paging hardware to the guest. A guest OS acquiresration, provides full control a short over service guest resources to the guest, for and a brief releases du- other hand, the hypervisor grants complete control over is free to maintain the mappings between its virtual and ration,guest controlprovides back a short to L0. service For instance, to the guest, an L1 and that releases period- the traditional paging hardware to the guest. A guest OS guest physical address space and update them as it sees guestically control checkpoints back to the L0. guest For would instance, need an transient L1 that period- control is free to maintain the mappings between its virtual and fit, without trapping into the hypervisor. icallyof guest checkpoints memory, the VCPUs, guest and would I/O need devices. transient control guest physical address space and update them as it sees of guest memory, VCPUs, and I/O devices. fit,Nested without virtualization: trapping into theFigure hypervisor. 3(b) shows that for nestedNested virtualization, virtualization: the guestFigure is similarly 3(b) shows granted that con- for nestedtrol over virtualization, the traditional the paging guest is hardware similarly to granted map virtual con- 4 Memory Management troladdresses over the to traditionalits guest physical paging address hardware space. to map L1 virtual main- 4 Memory Management addressestains a Virtual to its EPT guestto physical map the address guest pages space. to L1 pages main- in A Span VM has a single guest physical address space tainsL1’s physicala Virtual addresses EPT to map space, the or guest L1 pages. pages Finally, to pages one in Awhich Span is VM mapped has a into single the guest address physical space addressof all attached space L1’smore physical translation addresses is required: space, L0 or maintains L1 pages. EPT Finally,L1 to map one L1s. Thus any memory write on a guest page is imme- L1 pages to physical pages. However, x86 processors which is mapped into the address space of all attached more translation is required: L0 maintains EPTL1 to map L1s.diately Thus visible any memory to all hypervisors write on a controlling guest page the is imme- guest. L1can pages translate to physical only two pages. levels of However, addresses x86 in processors hardware, diatelyNote that visible all L1s to all have hypervisors the same visibility controlling into the the guest. guest canfrom translate guest virtual only twoto guest levels physical of addresses to host in physical hardware, ad- Notememory that due all L1s to the have horizontal the same layering visibility of Span into virtualiza- the guest fromdress. guest Hence virtual the Virtual to guest EPT physical maintained to host by physical L1 needs ad- to memorytion, unlike due to the the vertical horizontal stacking layering of nested of Span virtualization, virtualiza- dress.be shadowed Hence theby Virtual L0, meaning EPT maintained that the Virtual by L1 EPT needs and to which somewhat obscures the guest to lower layers. EPT must be compacted by L0 during runtime into a tion, unlike the vertical stacking of nested virtualization, be shadowedL1 by L0, meaning that the Virtual EPT and which somewhat obscures the guest to lower layers. EPTL1 must be compacted by L0 during runtime into a

238 2017 USENIX Annual Technical Conference USENIX Association

238 2017 USENIX Annual Technical Conference USENIX Association Isolation and Communication: AnotherMemory Translation design goal VA Page Table GPA EPT HPA is toIsolationIsolation compartmentalize and and Communication: Communication: L1 services,Another fromAnother each design design other goal goal and Single-Level Virtualization VAVA Page TablePage Table GPAGPA EPT HPA isfromis to to compartmentalize L0. compartmentalize First, L1s must L1 L1 services, services, have lower from from execution each each other other privi- and and (a) Single-level (a) Single-level fromlegefrom compared L0. L0. First, First, to L1s L1s L0. must must Secondly, have have lower lower L1s must execution execution remain privi- privi- iso- (a) Single-level Shadow Shadow EPT legelatedlege compared from compared each to other. to L0. L0. These Secondly, Secondly, two L1s goals L1s must must are achieved remain remain iso- iso- by Shadow Page Virtual EPT lated from each other. These two goals are achieved by VA GPA L1PA EPTL1 HPA lateddeprivileging from each L1s other. using These nested two virtualization goals are achieved andNested Virtualization execut- by TablePage Virtual EPT VA Page GPA Virtual L1PA EPTL1 HPA deprivileging L1s using nested virtualization and execut- VA Table GPA EPT L1PA EPTL1 HPA deprivileginging them as separate L1s using guests nested on virtualization L0. Finally, L1s and must execut- re- (b) NestedTable EPT (b) Nested ingmaining them them unaware as as separate separate of each guests guests other on during on L0. L0. Finally, execution. Finally, L1s L1s This must must goal re- re- (b) Nested main unaware of each other during execution. This goal Shadow Shadow Shadow mainis achieved unaware by of requiring each other L1s during to receive execution. any subscribed This goal Shadow EPTEPT Shadow Shadow Shadow EPT is achieved by requiring L1s to receive any subscribed Shadow Shadow EPTEPT isguest achieved events by that requiring are generated L1s to on receive other L1s any only subscribed via L0. EPTEPTEPT Span Virtualization Page Virtual EPT EPT Virtual EPT guest events that are generated on other L1s only via L0. VA GPA Virtual EPTVirtual L1PA Virtual EPTEPT HPA guest events that are generated on other L1s only via L0. TablePage Virtual EPT Virtual EPTL1 There are two ways that L0 communicates with L1s: VA Page GPA Virtual EPTVirtual EPT L1PA Virtual EPTVirtual EPT HPA Table Virtual EPT Virtual EPTEPTL1 There are two ways that L0 communicates with L1s: VA GPA Virtual EPTVirtual EPT L1PA EPT HPA implicitly via traps and explicitly via messages. Traps Table EPT L1 There are two ways that L0 communicates with L1s: (c) Span EPT Guest implicitly via traps and explicitly via messages. Traps EPTGuest implicitlyallow L0 to via transparentlytraps and explicitly intercept certain via messages memory. Traps man- (c) Span EPT allow L0 to transparently intercept certain memory man- (c) Span Guest allowagement L0 operationsto transparently by L1 intercept on the certain guest. memory Explicit man- mes- Figure 3: Memory translation for single-level, nested, agement operations by L1 on the guest. Explicit mes- Figure 3: Memory translation for single-level, nested, agementsages allow operations an L1 to by directly L1 on request the guest. guest Explicit control mes- from Figureand Span 3: VMs. Memory VA translation = Virtual for Address; single-level, GPA = nested, Guest sages allow an L1 to directly request guest control from and Span VMs. VA = Virtual Address; GPA = Guest sagesL0. An allowEvent an Processing L1 to directlymodule request in L0 guest traps control runtime from up- andPhysical Span Address; VMs. VA L1PA = Virtual = L1 Physical Address; Address; GPA = HPA Guest = L0. An Event Processing module in L0 traps runtime up- Physical Address; L1PA = L1 Physical Address; HPA = L0.dates An toEvent guest Processingmemory mappingsmodule by in L0 any traps L1 and runtime synchro- up- PhysicalHost Physical Address; Address. L1PA = L1 Physical Address; HPA = dates to guest memory mappings by any L1 and synchro- Host Physical Address. nizes guest mappings across different L1s. The event Host Physical Address. datesnizes to guest guest mappings memory mappings across different by any L1 L1s. and The synchro- event nizesprocessing guest module mappings also across relays different guest memory L1s. The faults event that 4.1 Traditional Memory Translation processing module also relays guest memory faults that 4.1 Traditional Memory Translation processingneed to be module handled also by relays L1. Aguest bidirectional memory faultsMessage that 4.1 Traditional Memory Translation Channelneed torelays be handled explicit by messages L1. A between bidirectional L0 andMessage L1s in- In modern x86 processors, hypervisors manage the phys- needChannel to berelays handled explicit by messages L1. A bidirectional between L0 andMessage L1s in- icalIn modern memory x86 that processors, a guest canhypervisors access using manage a virtualiza- the phys- Channelcluding attach/detachrelays explicit requests, messages memory between event L0 and subscrip- L1s in- Inical modern memory x86 that processors, a guest can hypervisors access using manage a virtualiza- the phys- tion/notification,cluding attach/detach guest requests, I/O requests, memory and event virtual subscrip- inter- icaltion memory feature called that aExtended guest can Page access Tables using(EPT) a virtualiza- [37], also cludingtion/notification, attach/detach guest requests, I/O requests, memory and event virtual subscrip- inter- calledtion feature Nested called PageExtended Tables in Page AMD-V Tables [5].(EPT) [37], also tion/notification,rupts. Some explicit guest messages, I/O requests, such and as guest virtual I/O inter- re- tioncalled feature Nested called PageExtended Tables in Page AMD-V Tables [5].(EPT) [37], also questsrupts. and Some virtual explicit interrupts, messages, could such be replaced as guest with I/O im- re- calledSingle-level Nested Page virtualization: Tables in AMD-VFigure [5]. 3(a) shows that for rupts.quests Some and virtual explicit interrupts, messages, could such be replaced as guest with I/O im- re- Single-level virtualization: Figure 3(a) shows that for plicit traps. Our choice of which to use is largely based single-level virtualization, the guestFigure page 3(a) tables shows map that vir-for questsplicit and traps. virtual Our choice interrupts, of which could to be use replaced is largely with based im- single-levelSingle-level virtualization, virtualization: the guest page tables map vir- on ease of implementation on a case-by-case basis. single-leveltual addresses virtualization, to guest physical the guest addresses page tables (VA to map GPA vir- in pliciton ease traps. of implementation Our choice of which on a case-by-case to use is largely basis. based thetual figure). addresses The to guest hypervisor physical uses addresses an EPT (VA to tomap GPA guest in onContinuous ease of implementation vs. Transient on a Control: case-by-caseSpan basis. virtualiza- tualthe addressesfigure). The to guest hypervisor physical uses addresses an EPT (VA to map to GPA guest in Continuous vs. Transient Control: Span virtualiza- physical addresses to host physical addresses (GPA to tion allows L1’s control over guest resources to be either thephysical figure). addresses The hypervisor to host physical uses an addresses EPT to map (GPA guest to tionContinuous allows L1’s vs. control Transient over guest Control: resourcesSpan to virtualiza- be either HPA). Guest memory permissions are controlled by the continuous or transient. Continuous control means that physicalHPA). Guest addresses memory to hostpermissions physical are addresses controlled (GPA by the to tioncontinuous allows L1’sor transient control over. Continuous guest resources control to means be either that combination of permissions in guest page table and EPT. an L1 exerts uninterrupted control over one or more guest HPA).combination Guest of memory permissions permissions in guest are page controlled table and by EPT. the continuousan L1 exertsor uninterruptedtransient. Continuous control over control one or means more guest that Whenever the guest attempts to access a page that is resources for an extended period of time. For example, combinationWhenever of the permissions guest attempts in guest to access page table a page and that EPT. is anresources L1 exerts for uninterrupted an extended control period over of time. one or For more example, guest either not present or protected in the EPT, the hardware an intrusion detection service in L1 that must monitor eitherWhenever not present the guest or protected attempts in to the access EPT, a the page hardware that is resourcesan intrusion for an detection extended service period in of L1 time. that For must example, monitor generates an EPT fault and traps into the hypervisor, guest system calls, VM exits, or network traffic, would eithergenerates not present an EPT or fault protectedand traps in the into EPT, the the hypervisor, hardware anguest intrusion system detection calls, VM service exits, in or L1 network that must traffic, monitor would which handles the fault by mapping a new page, em- require continuous control of guest memory, VCPUs, generateswhich handles an EPT the fault faultand by mapping traps into a new the hypervisor, page, em- guestrequire system continuous calls, VM control exits, of or guest network memory, traffic, VCPUs, would ulating an instruction, or taking other actions. On the and network device. Transient control means that an L1 whichulating handles an instruction, the fault or by taking mapping other a actions. new page, On em- the requireand network continuous device. control Transient of guest control memory, means that VCPUs, an L1 other hand, the hypervisor grants complete control over acquires full control over guest resources for a brief du- ulatingother hand, an instruction, the hypervisor or taking grants other complete actions. control On over the andacquires network full device. control Transient over guest control resources means for that a brief an L1du- the traditional paging hardware to the guest. A guest OS ration, provides a short service to the guest, and releases otherthe traditional hand, the paging hypervisor hardware grants to complete the guest. control A guest over OS acquiresration, provides full control a short over service guest resources to the guest, for and a brief releases du- is free to maintain the mappings between its virtual and guest control back to L0. For instance, an L1 that period- theis free traditional to maintain paging the hardware mappings to between the guest. its A virtual guest and OS ration,guest controlprovides back a short to L0. service For instance, to the guest, an L1 and that releases period- guest physical address space and update them as it sees ically checkpoints the guest would need transient control isguest free physical to maintain address the mappings space and between update them its virtual as it sees and guestically control checkpoints back to the L0. guest For would instance, need an transient L1 that period- control fit, without trapping into the hypervisor. of guest memory, VCPUs, and I/O devices. guestfit, without physical trapping address into space the hypervisor. and update them as it sees icallyof guest checkpoints memory, the VCPUs, guest and would I/O need devices. transient control Nested virtualization: Figure 3(b) shows that for fit,Nested without virtualization: trapping into theFigure hypervisor. 3(b) shows that for of guest memory, VCPUs, and I/O devices. nested virtualization, the guest is similarly granted con- nestedNested virtualization, virtualization: the guestFigure is similarly 3(b) shows granted that con- for trol over the traditional traditional paging paging hardware hardware to to map map virtual virtual 4 Memory Management nestedaddresses virtualization, to its guest the physical guest is address similarly space. granted L1 main- con- 4 Memory Management troladdresses over the to traditionalits guest physical paging address hardware space. to map L1 virtual main- 4 Memory Management tains a Virtual EPT EPT toto map map the the guest guest pages pages to to pages pages in in A Span VM has a single guest physical address space addressesL1’s physical to its addresses guest physical space, address or L1 pages. space. Finally, L1 main- one A Span VM has a single guest physical address space tainsL1’s physicala Virtual addresses EPT to map space, the or guest L1 pages. pages Finally, to pages one in which is mapped into the address space of all attached more translation is required: L0 maintains EPTL1 to map Awhich Span is VM mapped has a into single the guest address physical space addressof all attached space L1’smore physical translation addresses is required: space, L0 or maintains L1 pages. EPT Finally,L1 to map one L1s.L1s. Thus Thus any any memory memory write write on on a a guest guest page page is is imme- imme- L1 pages to physical physical pages. pages. However, However, x86 x86 processors processors which is mapped into the address space of all attached more translation is required: L0 maintains EPTL1 to map diatelydiately visible visible to to all all hypervisors hypervisors controlling controlling the the guest. guest. can translate only only two two levels levels of of addresses addresses in in hardware, hardware, L1s.Note Thusthat all any L1s memory have the write same on visibility a guest page into is the imme- guest L1from pages guest to virtual physical to guest pages. physical However, to host x86 physical processors ad- diatelyNote that visible all L1s to all have hypervisors the same visibility controlling into the the guest. guest canfrom translate guest virtual only twoto guest levels physical of addresses to host in physical hardware, ad- memorymemory due due to to the the horizontal horizontal layering layering of of Span Span virtualiza- virtualiza- dress. Hence the the Virtual Virtual EPT EPT maintained maintained by by L1 L1 needs needs to to Notetion, unlike that all the L1s vertical have the stacking same ofvisibility nested intovirtualization, the guest frombe shadowed guest virtualby L0, to guest meaning physical that the to host Virtual physical EPT ad- and memorytion, unlike due to the the vertical horizontal stacking layering of nested of Span virtualization, virtualiza- dress.be shadowed Hence theby Virtual L0, meaning EPT maintained that the Virtual by L1 EPT needs and to whichwhich somewhat somewhat obscures obscures the the guest guest to to lower lower layers. layers. EPTL1 must be be compacted compacted by by L0 L0 during during runtime runtime into into a a tion, unlike the vertical stacking of nested virtualization, be shadowedL1 by L0, meaning that the Virtual EPT and which somewhat obscures the guest to lower layers. EPTL1 must be compacted by L0 during runtime into a

238238 2017 2017 USENIX USENIX Annual Annual Technical Technical ConferenceConference USENIXUSENIX Association Association

238 2017 USENIX Annual Technical Conference USENIX Association Synchronizing Guest Memory Maps

• Shadow EPTGuest physical memory to Host physical that directly maps guest pages to physical Span Guest memory translation should be the same VA pages. To accomplishregardless of the translation path. this, manipulations to the Virtual EPT by L1 trigger traps to L0. Whenever L1 loads a Vir- L1 Hypervisor(s)Virtual Guest Page Table Virtual Guest tual EPT, L0 receives a trap and activates the appropriate Memory Event EPTEPT Virtual EPT • L0 syncs Shadow EPTs and EPT Shadow EPT. This style of nested page tableL1s manage- Memory • Guest faults Events Event ment is also called multi-dimensional paging [10]. GPA L1PA Notifications • Virtual EPT modifications by L1 EPT faults on guest memory can be due to (a) the guest Virtual EPT • When L1 directly accesses guest memory L0 Modifications accessing its own pages that have invalid Shadow EPT Guest Event Virtual EPT entries, and (b) the L1 directly accessing guest pages that Handling Trap Handler • L1s subscribe to guest memory events via have invalid EPTL1 entries to perform tasks such as I/O Event Subscription L0 Shadow EPT processing and VM introspection (VMI). Both kinds of Shadow EPTShadow EPT EPTL1 Service • E.g. to track write events for dirty page EPT faults are first intercepted by L0. L0 examines a tracking HPA Shadow EPT fault to further determine whether it is due to a invalid Virtual EPT entry; such faults are forwarded Figure 4: Span memory management overview. to L1 for processing. Otherwise, faults due to invalid EPTL1 entries are handled by L0. and the Shadow EPT. A memory-detach operation cor- Finally, an L1 may modify the Virtual EPT it main- respondingly undoes the EPTL1 mappings for guest and tains for a guest in the course of performing its own releases the reserved L1 address range. memory management. However, since the Virtual EPT is shadowed by L0, all Virtual EPT modifications cause 4.4 Synchronizing Guest Memory Maps traps to L0 for validation and a Shadow EPT update. To enforce a consistent view of guest memory across all L1s, L0 synchronizes memory mappings upon two 4.2 Memory Translation for Span VMs events: EPT faults and Virtual EPT modifications. Fault handling for Span VMs extends the correspond- In Span virtualization, L0 extends nested EPT manage- ing mechanism for nested VMs described earlier in Sec- ment to guests that are controlled by multiple hypervi- tion 4.1. The key difference in the Span case is that sors. Figure 3(c) shows that a Span guest has multiple L0 first checks if a host physical page has already been Virtual EPTs, one per L1 that is attached to the guest. mapped to the faulting guest page. If so, the existing When an L1 acquires control over a guest’s VCPUs, the physical page mapping is used to resolve the fault, else a L0 shadows the guest’s Virtual EPT in the L1 to construct new physical page is allocated. the corresponding Shadow EPT, which is used for mem- As with the nested case, modifications by an L1 to es- ory translations. In addition, an EPT is maintained Guest tablish Virtual EPT mappings trap to a Virtual EPT trap by L0 for direct guest execution on L0. A guest’s mem- handler in L0, shown in Figure 4. When the handler re- ory mappings in Shadow EPTs, the EPT , and the Guest ceives a trap due to a protection modification, it updates EPT are kept synchronized by L0 upon page faults so L1 each corresponding EPT with the new least-permissive that every attached hypervisor sees a consistent view of L1 combination of page protection. Our current prototype guest memory. Thus, a guest virtual address leads to the allows protection modifications but disallows changes to same host physical address irrespective of the Shadow established GPA-to-L1PA mappings to avoid having to EPT used for the translation. change mappings in multiple EPTs. 4.3 Memory Attach and Detach 4.5 Memory Event Subscription A Span VM is initially created directly on L0 as a single- An L1 attached to a guest may wish to monitor and con- level guest for which the L0 constructs a regular EPT. To trol certain memory-related events of the guest to provide attach to the guest memory, a new L1 requests L0, via a a service. For instance, an L1 that provides live check- hypercall, to map guest pages into its address space. pointing or guest mirroring may need to perform dirty Figure 4 illustrates that L1 reserves a range in the L1 page tracking in which pages to which the guest writes physical address space for guest memory and then in- are periodically recorded so they can be incrementally forms L0 of this range. Next, L1 constructs a Virtual EPT copied. As another example, an L1 performing intrusion for the guest which is shadowed by L0, as in the nested detection using VM introspection might wish to monitor case. Note that the reservation in L1 physical address a guest’s attempts to execute code from certain pages. space does not immediately allocate physical memory. In Span virtualization, since multiple L1s can be at- Rather, physical memory is allocated lazily upon guest tached to a guest, the L1 controlling the guest’s VCPUs memory faults. L0 dynamically populates the reserved may differ from the L1s requiring the memory event no- address range in L1 by adjusting the mappings in EPTL1 tification. Hence L0 provides a Memory Event Subscrip-

USENIX Association 2017 USENIX Annual Technical Conference 239 I/O Control

• We consider para-virtual I/O in this work

Guest

Frontend

I/O Kick I/O Ring Buffer Interrupt Request Response

Backend

Hypervisor Native Device I/O

Traditional Para-virtual I/O I/O Control

• We consider para-virtual I/O in this work • A Span Guest’s I/O device and VCPUs may be controlled by different L1s tion interfaceGuest to enable L1s to independently subscribe to Guest guest memory events. An L1 subscribes with L0, via the Frontend message channel, requestingFrontend notifications when a spe- cific type of event occurs on certain pages of a given I/O Ring I/O guest. When the L0 intercepts the subscribed events, Request Buffer Response Forwarded Forwarded I/O Kick I/O Ring Buffer Interrupt Interrupt Kick it notifies all L1Request subscribers viaResponse the message channel. Upon receiving the event notification, a memory event Guest emulator in each L1, shown in Figure 4, processes the Backend VCPUS event and responds backBackend to L0, either allowing or dis- L1a L1b allowing the guest’s memory access which triggered the Native I/O I/O via L0 event. The response from theNative Device I/O L1 also specifies whether via L0 Hypervisor L0 to maintain or discontinue the L1’s event subscription on the guest page.Traditional Para For example,-virtual I/O upon receiving a write Figure 5: ParavirtualPara-virtual I/O in Span Virtualization I/O for Span VMs. L1a controls event notification, an L1 that performs dirty page track- the guest I/O device and L1b controls the VCPUs. Kicks ing will instruct L0 to allow the guest to write to the page, from L1b and interrupts from L1a are forwarded via L0. and cancel the subscription for future write events on the page, since the page has been recorded as being dirty. tual drivers are used at both levels. On the other hand, an intrusion detection service in L1 For Span guests, different L1s may control guest VC- might disallow write events on guest pages containing PUs and I/O devices. If the same L1 controls both guest kernel code and maintain future subscription. L0 concur- VCPUs and the device backend then I/O processing pro- rently delivers event notifications to all L1 subscribers. ceeds as in the nested case. Figure 5 illustrates the other Guest memory access is allowed to proceed only if all case, when different L1s control guest VCPUs and back- subscribed L1s allow the event in their responses. ends. L1a controls the backend and L1b controls the To intercept a subscribed memory event on a page, the guest VCPUs. The frontend in the guest and backend L0 applies the event’s mask to the corresponding EPT L1 in L1a exchange I/O requests and responses via the ring entry of each L1 attached to the guest. Updating EPT L1 buffer. However, I/O kicks are generated by guest VC- prompts L0 to update the guest’s Shadow EPT entry with PUs controlled by L1b, which forward the kicks to L1a. the mask, to capture guest-triggered memory events. Up- Likewise, L1a forwards any virtual interrupts from the dating EPT entries also captures the events resulting L1 backend to L1b, which injects the interrupt to the guest from direct accesses to guest memory by an L1 instead VCPUs. Kicks from the frontend and virtual interrupts of the guest. For instance, to track write events on a guest from the backend are forwarded between L1s via L0 us- page, the EPT entry could be marked read-only after sav- ing the message channel. ing the original permissions for later restoration.

5 I/O Control 6 VCPU Control In this work, guests use paravirtual devices [54, 6] which provide better performance than device emula- In single-level virtualization, L0 controls the scheduling tion [59] and provide greater physical device sharing of guest VCPUs. In nested virtualization, L0 delegates among guests than direct device assignment [11, 12, 50]. guest VCPU scheduling to an L1. The L1 schedules For single-level virtualization, the guest OS runs a set guest VCPUs on its own VCPUs and L0 schedules the of paravirtual frontend drivers, one for each virtual de- L1’s VCPUs on PCPUs. This hierarchical scheduling vice, including block and network devices. The hypervi- provides the L1 some degree of control over customized sor runs the corresponding backend driver. The frontend scheduling for its guests. and the backend drivers communicate via a shared ring Span virtualization can leverage either single-level or buffer to issue I/O requests and receive responses. The nested VCPU scheduling depending on whether the L0 frontend places an I/O request in the ring buffer and no- or an L1 controls a guest’s VCPUs. Our current design tifies the backend through a kick event, which triggers requires that all VCPUs of a guest be controlled by one a VM exit to the hypervisor. The backend removes the of the hypervisors at any instant. However, control over I/O request from the ring buffer, completes the request, guest VCPUs can be transferred between hypervisors if places the I/O response in the ring buffer, and injects an needed. When L0 initiates a Span VM, it initializes the I/O completion interrupt to the guest. The interrupt han- all the VCPUs as it would for a single-level guest. After dler in the frontend then picks up the I/O response from the guest boots up, the control of guest VCPUs can be the ring buffer for processing. For nested guests, paravir- transferred to/from an L1 using attach/detach operations.

240 2017 USENIX Annual Technical Conference USENIX Association VCPU control

• Simple for now.

• All VCPUs controlled by one hypervisor • Either by L0 or one of the L1s

• Can we distribute VCPUs among L1s? • Possible, but no good reason why. • Requires expensive IPI forwarding across L1s • Complicates memory synchronization. Implementation Guest Guest tialization, I/O emulation, checkpointing, and migration. • Guest: Unmodified 15.10, 4.2 QEMU Span Guest QEMU Guest In L1a In L1b Figure 6 shows the position of the Guest Controller in Guest QEMU L1a L1b different virtualization models. In both single-level and • L0 and L1 L1 Guest Guest L1 L1a KVM KVM L1b nested virtualization, there is only one Guest Controller Guest KVM QEMU • QEMU 1.2 and Linux 3.14.2 QEMU QEMU QEMU QEMU • Modified nesting support in KVM/QEMU per guest, since each guest is completely controlled by L0 L0 KVM L0 • L0 : 980+ lines in KVM and 500+ lines in QEMU KVM KVM one hypervisor. Additionally, in the nested case, each L1 • L1: 300+ lines in KVM and 380+ in QEMU has its own Guest Controller that runs on L0. In Span vir- Single Nested Span tualization, each guest is associated with multiple Guest Figure 6: Roles of QEMU (Guest Controller) and KVM • Guest controller • Message channel Controllers, one per attached hypervisor. For instance, (hypervisor)• for Single-level, Nested, and Span VMs. • User space QEMU process For I/O kick and interrupt forwarding the Span Guest in Figure 6 is associated with three Guest • Currently using UDP messages and hypercalls • Guest initialization, I/O emulation, Control Controllers, one each on L0, L1a, and L1b. During at- Transfer, Migration, etc • Control Transfer7 Implementation Details tach/detach operations, the Guest Controller in an L1 ini- • Guest VCPUs and virtio devices can be transferred between L1s and L0 tiates the mapping/unmapping of guest memory into the • I/O: virtio-over-virtio Platform• Using attach/detach operations and Modifications: Our prototype supports • Direct assignment: future work running an unmodified Linux guest as a Span VM in L1’s address space and, if needed, acquires/releases con- modes V3, V4, and V5 from Figure 1. In our test setup, the trol over the guest’s VCPU and virtual I/O devices. guest runs Ubuntu 15.10 with Linux 4.2.0. The prototype Paravirtual I/O Architecture: The Guest Controller for Span virtualization is implemented by modifying the also performs I/O emulation of virtual I/O devices con- KVM/QEMU nested virtualization support that is built trolled by its corresponding hypervisor. The paravirtual into standard Linux distributions. Currently the imple- device model described in Section 5 is called virtio in mentation of L0 and all L1s uses modified KVM/QEMU KVM/QEMU [54]. For nested guests, the virtio drivers hypervisors in Linux, specifically QEMU-1.2.0, kvm- are used at two levels: once between L0 and each L1 and kmod-3.14.2 and Linux 3.14.2. The modifications are again between an L1 and the guest. This design is also different for the L0 and L1 layers. Ideally, we would called virtio-over-virtio. A kick is implemented in vir- prefer L1 to be unmodified to simplify its interface with tio as a trap from the frontend leading to a VM L0. However, current hypervisors assume complete and exit to KVM, which delivers the kick to the Guest Con- exclusive guest control whereas Span allows L1s to exer- troller as a signal. Upon I/O completion, the Guest Con- cise partial control over a subset of guest resources. Sup- troller requests KVM to inject a virtual interrupt into the porting partial guest control necessarily requires changes guest. Kicks and interrupts are forwarded across hyper- to L1 for attaching/detaching with a subset of guest re- visors using the message channel. Redirected interrupts sources and memory event subscription. In implement- are received and injected into the guest by a modified ing L1 attach/detach operations on a guest, we tried, as version of KVM’s virtual IOAPIC code. much as possible, to reuse existing implementations of VCPU Control: The Guest Controllers in different VM creation/termination operations. hypervisors communicate with the Guest Controller in Code size and memory footprint: Our implemen- L0 to acquire or relinquish guest VCPU control. The tation required about 2200 lines of code changes in Guest Controller represents each guest VCPU as a user KVM/QEMU, which is roughly 980+ lines in KVM and space thread. A newly attached L1 hypervisor does not 500+ lines in QEMU for L0, 300+ in KVM and 200+ in initialize guest VCPU state from scratch. Rather, the QEMU for L1, and another 180+ in the virtio backend. Guest Controller in the L1 accepts a checkpointed guest We disabled unnecessary kernel components in both L0 VCPU state from its counterpart in L0 using a technique and L1 implementations to reduce their footprint. When similar to that used for live VM migration between phys- idle, L0 was observed to have 600MB usage at startup. ical hosts. After guest VCPU states are transferred from When running an idle Span guest attached to an idle L1, L0 to L1, the L1 Guest Controller resumes the guest L0’s memory usage increased to 1756MB after exclud- VCPU threads while the L0 Guest Controller pauses ing usage by the guest and the L1. The L1’s initial mem- its VCPU threads. A VCPU detach operation similarly ory usage, as measured from L0, was 1GB after exclud- transfers a checkpoint of guest VCPU states from L1 to ing the guest footprint. This is an initial prototype to L0. Transfer of guest VCPU states from one L1 to an- validate our ideas. The footprints of L0 and L1 imple- other is presently accomplished through a combination mentations could be further reduced using one of many of detaching the source L1 from the guest VCPUs fol- lightweight Linux distributions [14]. lowed by attaching to the destination L1 (although a di- Guest Controller: A user-level control process, rect transfer could be potentially more efficient). called the Guest Controller, runs on the hypervisor along- Message Channel: The message channel between side each guest. In KVM/QEMU, the Guest Controller L0 and each L1 is implemented using a combination is a QEMU process which assists the KVM hypervisor of hypercalls and UDP messages. Hypercalls from an with various control tasks on a guest, including guest ini- L1 to L0 are used for attach/detach operations on guest

USENIX Association 2017 USENIX Annual Technical Conference 241 Example 1: Two L1s controlling one Guest memory. UDP messages between an L1 and L0 are used for relaying I/O requests, device• Guest: Infected with interrupts, mem- ory subscription messages, and attach/detachrootkit operations on guest VCPU and I/O devices. UDP messages are L1a: Network Monitoring presently used for ease of implementation and will be replaced by better alternatives such• L1a: Monitoring as hypercalls, call- backs, or shared buffers. network traffic Guest infected with KBeast 8 Evaluation • L1b: Running VMI (Volatility) L1b: Volatility We first demonstrate unmodified Span VMs that can si- multaneously use services from multiple L1s. Next we investigate how Span guests perform compared to tradi- Figure 7: A screenshot of Span VM simultaneously us- tional single-level and nested guests. Our setup consists ing services from two L1s. of a containing dual six-core Xeon 2.10 GHz CPUs, 128GB memory and 1Gbps Ethernet. The soft- even though its resources are controlled by multiple hy- ware configurations for L0, L1s, and Span guests are as pervisors. Second, an L1 can transparently examine described earlier in Section 7. Each data point is a mean guest memory. Third, an L1 controlling a guest virtual (average) over at least five or more runs. device (here network interface) can examine all I/O re- quests specific to the device even if the I/O requests are 8.1 Span VM Usage Examples initiated from guest VCPUs controlled by another hyper- We present three examples in which a Span VM trans- visor. Thus an I/O device can be delegated to an L1 that parently utilizes services from multiple L1s. An unmod- does not control the guest VCPUs. ified guest is controlled by three coresident hypervisors, Use Case 2 – Guest Mirroring and VM Introspec- namely, L0, L1a, and L1b. tion: In this use case, we demonstrate an L1 that sub- Use Case 1 – Network Monitoring and VM Intro- scribes to guest memory events from L0. Hypervisors spection: In the first use case, the two L1s passively can provide a high availability service that protects un- examine the guest state, while L0 supervises resource modified guests from a failure of the physical machine. control. L1a controls the guest’s virtual network device Solutions, such as Remus [24], typically work by con- whereas L1b controls the guest VCPUs. L1a performs tinually transferring live incremental checkpoints of the network traffic monitoring by running the tcpdump tool guest state to a remote backup server, an operation that to capture packets on the guest’s virtual network inter- we call guest mirroring. When the primary VM fails, its face. Here we use tcpdump as a stand-in for other more backup image is activated, and the VM continues run- complex packet filtering and analysis tools. ning as if failure never happened. To checkpoint incre- L1b performs VM introspection (VMI) using a tool mentally, hypervisors typically use a feature called dirty called Volatility [3] which continuously inspects a page tracking. The hypervisor maintains a dirty bitmap, guest’s memory using a utility such as pmemsave to i.e. the set of pages that were dirtied since the last check- extract an accurate list of all processes running inside point. The dirty bitmap is constructed by marking all the guest. The guest OS is infected by a , Ker- guest pages read-only in the EPT and recording dirtied nel Beast [38], which can hide malicious activity and pages upon write traps. The pages listed in the dirty present an inaccurate process list to the compromised bitmap are incrementally copied to the backup server. guest. Volatility, running in L1b, can nevertheless extract As a first approximation of guest mirroring, we mod- an accurate guest process list using VM introspection. ified the pre-copy code in KVM/QEMU Figure 7 shows a screenshot, where the top window to periodically copy all dirtied guest pages to a backup shows the tcpdump output in L1a, specifically the SSH server at a given frequency. In our setup, L1a mirrors a traffic from the guest. The bottom right window shows Span guest while L1b runs Volatility and controls guest that the rootkit KBeast in the guest OS hides a process VCPUs. L1a uses memory event subscription to track evil, i.e. it prevents the process evil from being listed write events, construct the dirty bitmap, and periodically using the ps command in the guest. The bottom left win- transfer any dirty pages to the backup server. We mea- dow shows that Volatility, running in L1b, successfully sured the average bandwidth reported by the iPerf [1] detects the process evil hidden by the KBeast rootkit in client benchmark running in the guest when L1a mir- the guest. rors the guest memory at different frequencies. When This use case highlights several salient features of our guest mirroring happens every 12 seconds, iPerf delivers design. First, an unmodified guest executes correctly 800Mbps average bandwidth which is about the same as

242 2017 USENIX Annual Technical Conference USENIX Association Example 2: Guest mirroring

• L1a runs Volatility

Incremental copy • L1b runs Guest Mirroring to a remote node • Periodically copy dirty guest pages • Requires subscription on write events iPerf Guest • Guest runs iPerf VCPUS VM Guest • ~800Mbps when mirrored every 12 Introspection Mirroring seconds. Same as standard nested. Service Service • ~600Mbps every 1 second. • 25% impact with high frequency dirty L1a L1b page tracking Memory Event L0 Subscription Example 3: Live Hypervisor Replacement

• Replace hypervisor underneath a live Guest • L1 runs a full hypervisor • L0 acts as a thin switching layer Guest • Replacement operation • Attach new L1 • Detach old L1

Old New • 740ms replacement latency, including Hypervisor memory co-mapping Hypervisor

• 70ms guest downtime L0 : Switches the L1 Hypervisor • During VCPU and I/O state transfer (a) Kernbench (b) QuicksortMacrobenchmarks(c) iPerf Guest Workloads Figure 8: No-op Mode: Normalized performance when no services run in host, L0, or L1s. The L0 controls the virtio • Kernbench: repeatedly compiles the kernel block and network devices of the guest. • Quicksort: repeatedly sorts 400MB data with a nested guest. When guest mirroring happens every L0 L1 L2 • iPerf: Measures bandwidth to another host second, the average bandwidth drops to 600Mbps, indi- Mem CPUs Mem VCPUs Mem VCPUs Host 128GB 12 N/A N/A N/A N/A cating a 25% performance impact of event subscription Single 128GB 12 3GB 1 N/A N/A Hypervisor-level Services at very high mirroring frequencies. Nested 128GB 12 16GB 8 3GB 1 • Network monitoring (tcpdump) Use Case 3 – Proactive Refresh: Hypervisor-level Span0 128GB 12 8GB 4 3GB 1 on L0 • VMI (Volatility) services may contain latent bugs, such as memory leaks, Span1 128GB 12 8GB 4 3GB 1 on L1 or other vulnerabilities that become worse over time, Table 2: Memory and CPU assignments for experiments. making a monolithic hypervisor unreliable for guests. 1 300 1 300 1 300 942 ways121.0s has 128GB120.9s and 12 physical CPU cores.29.5s In the 29.8s Techniques like Microreboot[18] and ReHype[43] have + + +0 - 0.2 + - 0.0 31.4s + - - 1.1 - 0.5 132.0s + 31.24s Mbps 134.6s 250 - 0.1 250 817 250 0.8nested configuration,+ L1 has+- 1.0 16GB memory0.8 and 8 VC- +- 0.4 0.8 + 809 812 been proposed to improve hypervisor availability, ei- - 0.5 -23 + -40 +36 Mbps Mbps - PUs. The guest VCPU in the Span0200 configuration is 200 254.5% Mbps 200 ther proactively or post-failure. We have already seen 209.0% 210.0% 227.4% + +5.6 + 0.6 0.6 0.6 - 3.9 - -+8.6 - 8.0 controlled169.0% by L0, and in Span1150 by an L1. Finally,168.2% in 150 150 how Span virtualization can compartmentalize unreliable + + - 0.9 152.4% - 1.9 150.5% 144.6% 145.5% 143.5% 137.0% 0.4 +0.6 0.4 +2.6 0.4 both Span0 and- Span1,+ L1a+ and L1b each have 8GB of - + + hypervisor-level services in an isolated deprivileged L1. - 1.8 - 1.2 100 - 5.4 - 3.6 100 100 Here, we go one step further and proactively replace un- 0.2memory and 4VCPUs, so their sums match the0.2 L1 in the 0.2 nested setting. 50 (%) Utilization CPU 50 (%) Utilization CPU 50 (%) Utilization CPU reliable L1s with a fresh reliable instance while the guest Performance Normalized Performance Normalized Performance Normalized 0 0 0 0 0 0 and the base L0 hypervisor keep running. In our setup, an TheSingle guest Nested runs Span0 one of Span1 the following threeSingle bench- Nested Span0 Span1 Single Nested Span0 Span1 old L1 (L1a) was attached to a 3GB Span guest. To per- marks:Normalized (a) PerformanceKernbenchCPU Utilization[41] compiles the LinuxNormalized kernel. Performance CPU Utilization Normalized Performance CPU Utilization form hypervisor refresh, we attached a new pre-booted (b) Quicksort sorts 400MB of data in memory. (c) iPerf replacement hypervisor (L1b) to the guest memory. Then [1] measures(a) Kernbench network bandwidth to another host. (b) Quicksort (c) iPerf L1a was detached from the guest by transferring guestFigure 9:The Service benchmarks Mode: run Normalized in two modes: performanceNo-op Mode with, hypervisor-level services network monitoring and Volatility. VCPU and I/O devices to L1b via L0. In our implemen-For single-level,when no hypervisor-level L0 runs both services services. run, For and nested,Service L1 runs both services. For Span0 and Span1, L1a runs network tation, the entire refresh operation from attaching L1b to Mode, when network monitoring and VM introspection detaching L1a completes on the average within 740ms.monitoringservices and run controlsat either L0 the or guest’s L1s. The network figures report device; each L1b runs Volatility; L0 controls guest’s block device. benchmark’s normalized performance against the best Of this, 670ms are spent in attaching L1b to guest mem-tio network interface with L0. We observed that if L1a ory while the guest is running. The remaining 70ms is case and system-wide average CPU utilization, which is the guest downtime due to the transfer of VCPU and I/Ocontrolsmeasured the guest in L0 network using the deviceatop command as well, each then second iPerf in states. Thus Span virtualization achieves sub-second L1the Span1during guest experiments. performs as well as in the nested guest. refresh latency. If we attach the replacement L1b to guest For iPerfFrom in Figures service 8(a) mode and (b) (Figure and Figures 9(c)), nested,9(a) and Span0, (b), memory well in advance, then the VCPU and I/O stateand Span1in both guests modes perform for Kernbench about 14–15% and Quicksort, worse Span0 than the transfer can be triggered on-demand by events, such assingle-levelperforms guest, comparably due to with the the combined single-level effect setting of and virtio- unusual memory pressure or CPU usage, yielding sub- Span1 performs comparably with the nested setting, with 100ms guest downtime and event response latency. Inover-virtiosimilar overhead CPU utilization. and tcpdump running in L1a. Fur- contrast, using pre-copy [22] to live migrate a guest fromther, forFor Span0, iPerf in No-op the guest mode VCPU(Figure 8(c)), is controlled we observe that by L0 L1a to L1b can take several seconds depending on guestwhereasthe Span1 the network guest experiences device is about controlled 6% degradation by L1a. over Thus size and workload [65]. forwardingthe nested of guest I/O kicks with notable and interrupts bandwidth between fluctuation L0 and and 7% more CPU utilization. This is because the guest’s 8.2 Macro Benchmarks L1a via the UDP-based message channel balances out any gainsVCPU from in Span1 having is controlledguest VCPUs by L1a, run but on the L0. guest’s Here we compare the performance of macro benchmarks network device is controlled by L0. Hence, guest I/O Figure 10: Overhead of attaching an L1 to a guest. Figure 8(c) shows that the average CPU utilization in Span VM against a native host (no hypervisor), single- requests (kicks) and responses are forwarded from L1a Single Nested Span level, and nested guests. Table 2 shows the memory andincreasesto L0 significantly via the message for channel. iPerf in The no-op message mode channel – from EPT Fault 2.4 2.8 3.3 processor assignments at each layer for each case. The2.7%is for currently the native implemented host to using 100+% UDP for messages, the single-level which guest always has 3GB memory and one VCPU. L0 al-and Span0compete configurations with guest’s iPerf and client 180+% traffic for on the the L1’snested vir- and Virtual EPT Fault - 23.3 24.1 Span1 configurations. The increase appears to be due Shadow EPT Fault - 3.7 4.1 to the virtio network device implementation in QEMU, Message Channel - - 53 USENIX Association since we observed2017 USENIX this Annual higher Technical CPU utilization Conference even 243 with Memory Event Notify - - 103.5 newer versions of (unmodified) QEMU (v2.7) and Linux Table 3: Low-level latencies(µs) in Span virtualization. (v4.4.2). Figures 8(c) and 9(c) also show higher CPU utilization for the nested and Span1 cases compared to detach operation for VCPUs and I/O devices has similar the single-level case. This is because guest VCPUs are overhead. controlled by L1s in the nested and Span1 cases, making Page Fault Servicing: Table 3 shows the latency of nested VM exits more expensive. page fault handling and message channel. We measured the average service times for EPT faults in Span at both levels of nesting. It takes on the average 3.3 s to resolve 8.3 Micro Benchmarks µ a fault caused against EPTL1 and on the average 24.1µs Attach Operation: Figure 10 shows the time taken to to resolve a fault against the Virtual EPT. In contrast, the attach an L1 to a guest’s memory, VCPU, and I/O de- corresponding values measured for the nested case are vices as the guest memory size is increased. The time 2.8µs and 23.3µs. For the single-level case, EPT-fault taken to attach memory of a 1GB Span guest is about processing takes 2.4µs. The difference is due to the extra 220ms. Memory attach overhead increases with guest synchronization work in the EPT-fault handler in L0. size because each page that L1 has allocated for Span Message Channel and Memory Events: The mes- needs to be remapped to the Span physical page in L0. sage channel is used in Span virtualization to exchange Attaching VCPUs to one of the L1s takes about 50ms. events and requests between L0 and L1s. It takes on Attaching virtual I/O devices takes 135ms. When I/O the average 53µs to send a message between L0 and an control has to be transferred between hypervisors, the L1. We also measured the overhead of notifying L1 sub- VCPUs need to be paused. The VCPUs could be running scribers from L0 for write events on a guest page. With- on any of the L1s and hence L0 needs to coordinate paus- out any subscribers, the write-fault processing takes on ing and resuming the VCPUs during the transfer. The the average 3.5µs in L0. Notifying the write event over

244 2017 USENIX Annual Technical Conference USENIX Association (a) Kernbench (b) QuicksortMacrobenchmarks(c) iPerf Guest Workloads Figure 8: No-op Mode: Normalized performance when no services run in host, L0, or L1s. The L0 controls the virtio • Kernbench: repeatedly compiles the kernel block and network devices of the guest. • Quicksort: repeatedly sorts 400MB data with a nested guest. When guest mirroring happens every L0 L1 L2 • iPerf: Measures bandwidth to another host second, the average bandwidth drops to 600Mbps, indi- Mem CPUs Mem VCPUs Mem VCPUs Host 128GB 12 N/A N/A N/A N/A cating a 25% performance impact of event subscription Single 128GB 12 3GB 1 N/A N/A Hypervisor-level Services at very high mirroring frequencies. Nested 128GB 12 16GB 8 3GB 1 • Network monitoring (tcpdump) Use Case 3 – Proactive Refresh: Hypervisor-level Span0 128GB 12 8GB 4 3GB 1 on L0 • VMI (Volatility) services may contain latent bugs, such as memory leaks, Span1 128GB 12 8GB 4 3GB 1 on L1 or other vulnerabilities that become worse over time, Table 2: Memory and CPU assignments for experiments. making a monolithic hypervisor unreliable for guests. 1 300 1 300 1 300 942 ways121.0s has 128GB120.9s and 12 physical CPU cores.29.5s In the 29.8s Techniques like Microreboot[18] and ReHype[43] have + + +0 - 0.2 + - 0.0 31.4s + - - 1.1 - 0.5 132.0s + 31.24s Mbps 134.6s 250 - 0.1 250 817 250 0.8nested configuration,+ L1 has+- 1.0 16GB memory0.8 and 8 VC- +- 0.4 0.8 + 809 812 been proposed to improve hypervisor availability, ei- - 0.5 -23 + -40 +36 Mbps Mbps - PUs. The guest VCPU in the Span0200 configuration is 200 254.5% Mbps 200 ther proactively or post-failure. We have already seen 209.0% 210.0% 227.4% + +5.6 + 0.6 0.6 0.6 - 3.9 - -+8.6 - 8.0 controlled169.0% by L0, and in Span1150 by an L1. Finally,168.2% in 150 150 how Span virtualization can compartmentalize unreliable + + - 0.9 152.4% - 1.9 150.5% 144.6% 145.5% 143.5% 137.0% 0.4 +0.6 0.4 +2.6 0.4 both Span0 and- Span1,+ L1a+ and L1b each have 8GB of - + + hypervisor-level services in an isolated deprivileged L1. - 1.8 - 1.2 100 - 5.4 - 3.6 100 100 Here, we go one step further and proactively replace un- 0.2memory and 4VCPUs, so their sums match the0.2 L1 in the 0.2 nested setting. 50 (%) Utilization CPU 50 (%) Utilization CPU 50 (%) Utilization CPU reliable L1s with a fresh reliable instance while the guest Performance Normalized Performance Normalized Performance Normalized 0 0 0 0 0 0 and the base L0 hypervisor keep running. In our setup, an TheSingle guest Nested runs Span0 one of Span1 the following threeSingle bench- Nested Span0 Span1 Single Nested Span0 Span1 old L1 (L1a) was attached to a 3GB Span guest. To per- marks:Normalized (a) PerformanceKernbenchCPU Utilization[41] compiles the LinuxNormalized kernel. Performance CPU Utilization Normalized Performance CPU Utilization form hypervisor refresh, we attached a new pre-booted (b) Quicksort sorts 400MB of data in memory. (c) iPerf replacement hypervisor (L1b) to the guest memory. Then [1] measures(a) Kernbench network bandwidth to another host. (b) Quicksort (c) iPerf L1a was detached from the guest by transferring guestFigure 9:The Service benchmarks Mode: run Normalized in two modes: performanceNo-op Mode with, hypervisor-level services network monitoring and Volatility. VCPU and I/O devices to L1b via L0. In our implemen-For single-level,when no hypervisor-level L0 runs both services services. run, For and nested,Service L1 runs both services. For Span0 and Span1, L1a runs network tation, the entire refresh operation from attaching L1b to Mode, when network monitoring and VM introspection detaching L1a completes on the average within 740ms.monitoringservices and run controlsat either L0 the or guest’s L1s. The network figures report device; each L1b runs Volatility; L0 controls guest’s block device. benchmark’s normalized performance against the best Of this, 670ms are spent in attaching L1b to guest mem-tio network interface with L0. We observed that if L1a ory while the guest is running. The remaining 70ms is case and system-wide average CPU utilization, which is the guest downtime due to the transfer of VCPU and I/Ocontrolsmeasured the guest in L0 network using the deviceatop command as well, each then second iPerf in states. Thus Span virtualization achieves sub-second L1the Span1during guest experiments. performs as well as in the nested guest. refresh latency. If we attach the replacement L1b to guest For iPerfFrom in Figures service 8(a) mode and (b) (Figure and Figures 9(c)), nested,9(a) and Span0, (b), memory well in advance, then the VCPU and I/O stateand Span1in both guests modes perform for Kernbench about 14–15% and Quicksort, worse Span0 than the transfer can be triggered on-demand by events, such assingle-levelperforms guest, comparably due to with the the combined single-level effect setting of and virtio- unusual memory pressure or CPU usage, yielding sub- Span1 performs comparably with the nested setting, with 100ms guest downtime and event response latency. Inover-virtiosimilar overhead CPU utilization. and tcpdump running in L1a. Fur- contrast, using pre-copy [22] to live migrate a guest fromther, forFor Span0, iPerf in No-op the guest mode VCPU(Figure 8(c)), is controlled we observe that by L0 L1a to L1b can take several seconds depending on guestwhereasthe Span1 the network guest experiences device is about controlled 6% degradation by L1a. over Thus size and workload [65]. forwardingthe nested of guest I/O kicks with notable and interrupts bandwidth between fluctuation L0 and and 7% more CPU utilization. This is because the guest’s 8.2 Macro Benchmarks L1a via the UDP-based message channel balances out any gainsVCPU from in Span1 having is controlledguest VCPUs by L1a, run but on the L0. guest’s Here we compare the performance of macro benchmarks network device is controlled by L0. Hence, guest I/O Figure 10: Overhead of attaching an L1 to a guest. Figure 8(c) shows that the average CPU utilization in Span VM against a native host (no hypervisor), single- requests (kicks) and responses are forwarded from L1a Single Nested Span level, and nested guests. Table 2 shows the memory andincreasesto L0 significantly via the message for channel. iPerf in The no-op message mode channel – from EPT Fault 2.4 2.8 3.3 processor assignments at each layer for each case. The2.7%is for currently the native implemented host to using 100+% UDP for messages, the single-level which guest always has 3GB memory and one VCPU. L0 al-and Span0compete configurations with guest’s iPerf and client 180+% traffic for on the the L1’snested vir- and Virtual EPT Fault - 23.3 24.1 Span1 configurations. The increase appears to be due Shadow EPT Fault - 3.7 4.1 to the virtio network device implementation in QEMU, Message Channel - - 53 USENIX Association since we observed2017 USENIX this Annual higher Technical CPU utilization Conference even 243 with Memory Event Notify - - 103.5 newer versions of (unmodified) QEMU (v2.7) and Linux Table 3: Low-level latencies(µs) in Span virtualization. (v4.4.2). Figures 8(c) and 9(c) also show higher CPU utilization for the nested and Span1 cases compared to detach operation for VCPUs and I/O devices has similar the single-level case. This is because guest VCPUs are overhead. controlled by L1s in the nested and Span1 cases, making Page Fault Servicing: Table 3 shows the latency of nested VM exits more expensive. page fault handling and message channel. We measured the average service times for EPT faults in Span at both levels of nesting. It takes on the average 3.3 s to resolve 8.3 Micro Benchmarks µ a fault caused against EPTL1 and on the average 24.1µs Attach Operation: Figure 10 shows the time taken to to resolve a fault against the Virtual EPT. In contrast, the attach an L1 to a guest’s memory, VCPU, and I/O de- corresponding values measured for the nested case are vices as the guest memory size is increased. The time 2.8µs and 23.3µs. For the single-level case, EPT-fault taken to attach memory of a 1GB Span guest is about processing takes 2.4µs. The difference is due to the extra 220ms. Memory attach overhead increases with guest synchronization work in the EPT-fault handler in L0. size because each page that L1 has allocated for Span Message Channel and Memory Events: The mes- needs to be remapped to the Span physical page in L0. sage channel is used in Span virtualization to exchange Attaching VCPUs to one of the L1s takes about 50ms. events and requests between L0 and L1s. It takes on Attaching virtual I/O devices takes 135ms. When I/O the average 53µs to send a message between L0 and an control has to be transferred between hypervisors, the L1. We also measured the overhead of notifying L1 sub- VCPUs need to be paused. The VCPUs could be running scribers from L0 for write events on a guest page. With- on any of the L1s and hence L0 needs to coordinate paus- out any subscribers, the write-fault processing takes on ing and resuming the VCPUs during the transfer. The the average 3.5µs in L0. Notifying the write event over

244 2017 USENIX Annual Technical Conference USENIX Association (a) Kernbench (b) QuicksortMacrobenchmarks(c) iPerf Guest Workloads Figure 8: No-op Mode: Normalized performance when no services run in host, L0, or L1s. The L0 controls the virtio • Kernbench: repeatedly compiles the kernel block and network devices of the guest. • Quicksort: repeatedly sorts 400MB data with a nested guest. When guest mirroring happens every L0 L1 L2 • iPerf: Measures bandwidth to another host second, the average bandwidth drops to 600Mbps, indi- Mem CPUs Mem VCPUs Mem VCPUs Host 128GB 12 N/A N/A N/A N/A cating a 25% performance impact of event subscription Single 128GB 12 3GB 1 N/A N/A Hypervisor-level Services at very high mirroring frequencies. Nested 128GB 12 16GB 8 3GB 1 • Network monitoring (tcpdump) Use Case 3 – Proactive Refresh: Hypervisor-level Span0 128GB 12 8GB 4 3GB 1 on L0 • VMI (Volatility) services may contain latent bugs, such as memory leaks, Span1 128GB 12 8GB 4 3GB 1 on L1 or other vulnerabilities that become worse over time, Table 2: Memory and CPU assignments for experiments. making a monolithic hypervisor unreliable for guests. 1 300 1 300 1 300 942 ways121.0s has 128GB120.9s and 12 physical CPU cores.29.5s In the 29.8s Techniques like Microreboot[18] and ReHype[43] have + + +0 - 0.2 + - 0.0 31.4s + - - 1.1 - 0.5 132.0s + 31.24s Mbps 134.6s 250 - 0.1 250 817 250 0.8nested configuration,+ L1 has+- 1.0 16GB memory0.8 and 8 VC- +- 0.4 0.8 + 809 812 been proposed to improve hypervisor availability, ei- - 0.5 -23 + -40 +36 Mbps Mbps - PUs. The guest VCPU in the Span0200 configuration is 200 254.5% Mbps 200 ther proactively or post-failure. We have already seen 209.0% 210.0% 227.4% + +5.6 + 0.6 0.6 0.6 - 3.9 - -+8.6 - 8.0 controlled169.0% by L0, and in Span1150 by an L1. Finally,168.2% in 150 150 how Span virtualization can compartmentalize unreliable + + - 0.9 152.4% - 1.9 150.5% 144.6% 145.5% 143.5% 137.0% 0.4 +0.6 0.4 +2.6 0.4 both Span0 and- Span1,+ L1a+ and L1b each have 8GB of - + + hypervisor-level services in an isolated deprivileged L1. - 1.8 - 1.2 100 - 5.4 - 3.6 100 100 Here, we go one step further and proactively replace un- 0.2memory and 4VCPUs, so their sums match the0.2 L1 in the 0.2 nested setting. 50 (%) Utilization CPU 50 (%) Utilization CPU 50 (%) Utilization CPU reliable L1s with a fresh reliable instance while the guest Performance Normalized Performance Normalized Performance Normalized 0 0 0 0 0 0 and the base L0 hypervisor keep running. In our setup, an TheSingle guest Nested runs Span0 one of Span1 the following threeSingle bench- Nested Span0 Span1 Single Nested Span0 Span1 old L1 (L1a) was attached to a 3GB Span guest. To per- marks:Normalized (a) PerformanceKernbenchCPU Utilization[41] compiles the LinuxNormalized kernel. Performance CPU Utilization Normalized Performance CPU Utilization form hypervisor refresh, we attached a new pre-booted (b) Quicksort sorts 400MB of data in memory. (c) iPerf replacement hypervisor (L1b) to the guest memory. Then [1] measures(a) Kernbench network bandwidth to another host. (b) Quicksort (c) iPerf L1a was detached from the guest by transferring guestFigure 9:The Service benchmarks Mode: run Normalized in two modes: performanceNo-op Mode with, hypervisor-level services network monitoring and Volatility. VCPU and I/O devices to L1b via L0. In our implemen-For single-level,when no hypervisor-level L0 runs both services services. run, For and nested,Service L1 runs both services. For Span0 and Span1, L1a runs network tation, the entire refresh operation from attaching L1b to Mode, when network monitoring and VM introspection detaching L1a completes on the average within 740ms.monitoringservices and run controlsat either L0 the or guest’s L1s. The network figures report device; each L1b runs Volatility; L0 controls guest’s block device. benchmark’s normalized performance against the best Of this, 670ms are spent in attaching L1b to guest mem-tio network interface with L0. We observed that if L1a ory while the guest is running. The remaining 70ms is case and system-wide average CPU utilization, which is the guest downtime due to the transfer of VCPU and I/Ocontrolsmeasured the guest in L0 network using the deviceatop command as well, each then second iPerf in states. Thus Span virtualization achieves sub-second L1the Span1during guest experiments. performs as well as in the nested guest. refresh latency. If we attach the replacement L1b to guest For iPerfFrom in Figures service 8(a) mode and (b) (Figure and Figures 9(c)), nested,9(a) and Span0, (b), memory well in advance, then the VCPU and I/O stateand Span1in both guests modes perform for Kernbench about 14–15% and Quicksort, worse Span0 than the transfer can be triggered on-demand by events, such assingle-levelperforms guest, comparably due to with the the combined single-level effect setting of and virtio- unusual memory pressure or CPU usage, yielding sub- Span1 performs comparably with the nested setting, with 100ms guest downtime and event response latency. Inover-virtiosimilar overhead CPU utilization. and tcpdump running in L1a. Fur- contrast, using pre-copy [22] to live migrate a guest fromther, forFor Span0, iPerf in No-op the guest mode VCPU(Figure 8(c)), is controlled we observe that by L0 L1a to L1b can take several seconds depending on guestwhereasthe Span1 the network guest experiences device is about controlled 6% degradation by L1a. over Thus size and workload [65]. forwardingthe nested of guest I/O kicks with notable and interrupts bandwidth between fluctuation L0 and and 7% more CPU utilization. This is because the guest’s 8.2 Macro Benchmarks L1a via the UDP-based message channel balances out any gainsVCPU from in Span1 having is controlledguest VCPUs by L1a, run but on the L0. guest’s Here we compare the performance of macro benchmarks network device is controlled by L0. Hence, guest I/O Figure 10: Overhead of attaching an L1 to a guest. Figure 8(c) shows that the average CPU utilization in Span VM against a native host (no hypervisor), single- requests (kicks) and responses are forwarded from L1a Single Nested Span level, and nested guests. Table 2 shows the memory andincreasesto L0 significantly via the message for channel. iPerf in The no-op message mode channel – from EPT Fault 2.4 2.8 3.3 processor assignments at each layer for each case. The2.7%is for currently the native implemented host to using 100+% UDP for messages, the single-level which guest always has 3GB memory and one VCPU. L0 al-and Span0compete configurations with guest’s iPerf and client 180+% traffic for on the the L1’snested vir- and Virtual EPT Fault - 23.3 24.1 Span1 configurations. The increase appears to be due Shadow EPT Fault - 3.7 4.1 to the virtio network device implementation in QEMU, Message Channel - - 53 USENIX Association since we observed2017 USENIX this Annual higher Technical CPU utilization Conference even 243 with Memory Event Notify - - 103.5 newer versions of (unmodified) QEMU (v2.7) and Linux Table 3: Low-level latencies(µs) in Span virtualization. (v4.4.2). Figures 8(c) and 9(c) also show higher CPU utilization for the nested and Span1 cases compared to detach operation for VCPUs and I/O devices has similar the single-level case. This is because guest VCPUs are overhead. controlled by L1s in the nested and Span1 cases, making Page Fault Servicing: Table 3 shows the latency of nested VM exits more expensive. page fault handling and message channel. We measured the average service times for EPT faults in Span at both levels of nesting. It takes on the average 3.3 s to resolve 8.3 Micro Benchmarks µ a fault caused against EPTL1 and on the average 24.1µs Attach Operation: Figure 10 shows the time taken to to resolve a fault against the Virtual EPT. In contrast, the attach an L1 to a guest’s memory, VCPU, and I/O de- corresponding values measured for the nested case are vices as the guest memory size is increased. The time 2.8µs and 23.3µs. For the single-level case, EPT-fault taken to attach memory of a 1GB Span guest is about processing takes 2.4µs. The difference is due to the extra 220ms. Memory attach overhead increases with guest synchronization work in the EPT-fault handler in L0. size because each page that L1 has allocated for Span Message Channel and Memory Events: The mes- needs to be remapped to the Span physical page in L0. sage channel is used in Span virtualization to exchange Attaching VCPUs to one of the L1s takes about 50ms. events and requests between L0 and L1s. It takes on Attaching virtual I/O devices takes 135ms. When I/O the average 53µs to send a message between L0 and an control has to be transferred between hypervisors, the L1. We also measured the overhead of notifying L1 sub- VCPUs need to be paused. The VCPUs could be running scribers from L0 for write events on a guest page. With- on any of the L1s and hence L0 needs to coordinate paus- out any subscribers, the write-fault processing takes on ing and resuming the VCPUs during the transfer. The the average 3.5µs in L0. Notifying the write event over

244 2017 USENIX Annual Technical Conference USENIX Association (a) Kernbench (b) Quicksort (c) iPerf Figure 9: Service Mode: Normalized performance with hypervisor-level services network monitoring and Volatility. For single-level, L0 runs both services. For nested, L1 runs both services. For Span0 and Span1, L1a runs network monitoring and controls the guest’s network device; L1b runs Volatility; L0 controls guest’s block device. tio network interface with L0. We observed that if L1a controls the guest network device as well, then iPerf in the Span1 guest performs as well as in the nested guest. For iPerf in service mode (Figure 9(c)), nested, Span0, and Span1 guests perform about 14–15% worse than the single-level guest, due to the combined effect of virtio- over-virtio overhead and tcpdump running in L1a. Fur- ther, for Span0, the guest VCPU is controlled by L0 whereas the network device is controlled by L1a. Thus forwarding of I/O kicks and interrupts between L0 and L1a via the UDP-based message channel balances outMicrobenchmarks any gains from having guest VCPUs run on L0. Figure 10: Overhead of attaching an L1 to a guest. Figure 8(c) shows that the average CPU utilization Single Nested Span increases significantly for iPerf in no-op mode – from EPT Fault 2.4 2.8 3.3 2.7% for the native host to 100+% for the single-level and Span0 configurations and 180+% for the nested and Virtual EPT Fault - 23.3 24.1 Span1 configurations. The increase appears to be due Shadow EPT Fault - 3.7 4.1 to the virtio network device implementation in QEMU, Message Channel - - 53 since we observed this higher CPU utilization even with Memory Event Notify - - 103.5 newer versions of (unmodified) QEMU (v2.7) and Linux Table 3: Low-levelLow-level latencies in Span virtualization latencies(µs) in Span virtualization. (v4.4.2). Figures 8(c) and 9(c) also show higher CPU utilization for the nested and Span1 cases compared to detach operation for VCPUs and I/O devices has similar the single-level case. This is because guest VCPUs are overhead. controlled by L1s in the nested and Span1 cases, making Page Fault Servicing: Table 3 shows the latency of nested VM exits more expensive. page fault handling and message channel. We measured the average service times for EPT faults in Span at both levels of nesting. It takes on the average 3.3 s to resolve 8.3 Micro Benchmarks µ a fault caused against EPTL1 and on the average 24.1µs Attach Operation: Figure 10 shows the time taken to to resolve a fault against the Virtual EPT. In contrast, the attach an L1 to a guest’s memory, VCPU, and I/O de- corresponding values measured for the nested case are vices as the guest memory size is increased. The time 2.8µs and 23.3µs. For the single-level case, EPT-fault taken to attach memory of a 1GB Span guest is about processing takes 2.4µs. The difference is due to the extra 220ms. Memory attach overhead increases with guest synchronization work in the EPT-fault handler in L0. size because each page that L1 has allocated for Span Message Channel and Memory Events: The mes- needs to be remapped to the Span physical page in L0. sage channel is used in Span virtualization to exchange Attaching VCPUs to one of the L1s takes about 50ms. events and requests between L0 and L1s. It takes on Attaching virtual I/O devices takes 135ms. When I/O the average 53µs to send a message between L0 and an control has to be transferred between hypervisors, the L1. We also measured the overhead of notifying L1 sub- VCPUs need to be paused. The VCPUs could be running scribers from L0 for write events on a guest page. With- on any of the L1s and hence L0 needs to coordinate paus- out any subscribers, the write-fault processing takes on ing and resuming the VCPUs during the transfer. The the average 3.5µs in L0. Notifying the write event over

244 2017 USENIX Annual Technical Conference USENIX Association Related Work • User space Services • , library OS, uDenali, KVM/QEMU, NOVA

• Service VMs • Dom0 in Xen, Xoar, Self-Service Cloud

• Nested virtualization • Belpaire & Hsu, Ford et. al, Graf & Roedel, Turtles • Ravello, XenBlanket, Bromium, DeepDefender, Dichotomy

• Span virtualization is the first to address multiple third-party hypervisor-level services to a common guest Summary: Span Virtualization

• We introduced the concept of a multi-hypervisor virtual machine • that can be concurrently controlled by multiple coresident hypervisors

• Another tool in a cloud provider’s toolbox • to offer compartmentalized guest-facing third-party services

• Future work • Faster event notification and processing • Direct device assignment to L1s or Guest • Possible to support unmodified L1s? • Requires L1s to support partial guest control. Current L1s assume full control.

• Code to be released after porting to newer KVM/QEMU Questions? Span Nested Span Single-level Backup slides Comparison

Level of Guest Control Impact of Service Failure Additional Virtualized Partial or L0 Coresident Guests Performance ISA Full Services Overheads Single-level Yes Full Fails Fail All None User space No Partial Protected Protected Attached Process switching Service VM No Partial Protected Protected Attached VM switching Nested Yes Full Protected Protected in Attached L1 switching + nesting L1 user space Span Yes Both Protected Protected Attached L1 switching + nesting Table 1: Alternatives for providing multiple services to a common guest, assuming one service per user space process, service VM, or Span L1. ent hypervisors at a given instant. Control Transfer: Control of guest VCPUs and/or Span Guest • (unmodified) L1 Hypervisor(s) virtual I/O devices can be transferred from one hy- Virtual Guest EPT pervisor to another, but only via L0. Virtual Guest EPTMemory I/O VCPU Manager Manager Manager

Guest Control Event Producer/Consumer Figure 2 shows the high-level architecture. A Span guest Requester Messages Traps Fault Handler begins as a single-level VM on L0. One or more L1s can then attach to one or more guest resources and optionally L0 Hypervisor Message Channel L1 Traps Guest Memory Faults Manager subscribe with L0 for specific guest events. I/O Guest Controller Event Processing Manager (attach/detach/subscribe/unsubscribe) Guest Control Operations: The Guest Controller VCPU (Relay/Emulation) Manager in L0 supervises control over a guest by multiple L1s Figure 2: High-level architecture for Span virtualization. through the following operations. [attach L1, Guest, Resource]: Gives L1 space does not. Service VMs and Span virtualization iso- • late coresident services in individual VM-level compart- control over the Resource in Guest. Resources in- ments. Thus, failure of a service VM or Span L1 does clude guest memory, VCPU, and I/O devices. Con- not affect coresident services. trol over memory is shared among multiple attached Finally, consider additional performance overhead L1s, whereas control over guest VCPUs and virtual over the single-level case. User space services introduce I/O devices is exclusive to an attached L1. Attach- context switching overhead among processes. Service ing to guest VCPUs or I/O device resources requires VMs introduce VM context switching overhead, which attaching to the guest memory resource. [detach L1, Guest, Resource]: Releases is more expensive. Nesting adds the overhead of emu- • lating privileged guest operations in L1. Span virtual- L1’s control over Resource in Guest. Detaching ization uses nesting but supports partial guest control by from the guest memory resource requires detaching L1s. Hence, nesting overhead applies only to the guest from guest VCPUs and I/O devices. [subscribe L1, Guest, Event, ] Registers L1 with L0 to receive Event 3 Overview of Span Virtualization from Guest. The GFN Range option specifies The key design requirement for Span VMs is trans- the range of frames in the guest address space on parency. The guest OS and its applications should re- which to track the memory event. Presently we main unmodified and oblivious to being simultaneously support only memory event subscription. Other controlled by multiple hypervisors, which includes L0 guest events of interest could include SYSENTER and any attached L1s. Hence the guest sees a virtual re- instructions, port-mapped I/O, etc. [unsubscribe L1, Guest, Event, ] Unsubscribes L1 Guest Event. resources, we translate this requirement as follows. The Guest Controller also uses administrative policies to Memory: All hypervisors must have the same con- resolve apriori any potential conflicts over a guest con- • sistent view of the guest memory. trol by multiple L1s. While this paper focuses on mecha- VCPUs: All guest VCPUs must be controlled by nisms rather than specific policies, we note that the prob- • one hypervisor at a given instant. lem of conflict resolution among services is not unique to I/O Devices: Different virtual I/O devices of the Span. Alternative techniques also need ways to prevent • same guest may be controlled exclusively by differ- conflicting services from controlling the same guest.

USENIX Association 2017 USENIX Annual Technical Conference 237 Continuous and Transient Control

Guest Guest

L1a L1b Guest L1

Hypervisor Hypervisor

Continuous Control Transient Control L1s always attached to guest L1 attaches/detaches from guest as needed