Multi-Hypervisor Virtual Machines

Enabling Efficient Hypervisor-as-a-Service Clouds with Ephemeral VirtualizationMulti-Hypervisor Virtual Machines: Dan Williams†, Yaohui Hu‡, Umesh Deshpande, Piush K Sinha‡, NiltonEnabling An Ecosystem of HypervisorBila†, Kartik Gopalan‡, Hani Jamjoom† -level Services † IBM T.J. Watson Research Center Kartik Gopalan, Rohith Kugve, Hardik Bagdi, Yaohui Hu – Binghamton University ‡ Binghamton University Dan Williams, Nilton Bila – IBM T.J. Watson Research Center IBM Almaden Research Center Funded in part by the NSF Funded by NSF Hypervisors • A thin and secure layer in the cloud -- or -- Guest 1 Hypervisor Hypervisors • A thin and secure layer in the cloud -- or -- • Feature-filled cloud differentiators Guest 1 • Migration • Checkpointing • High availability • Live Guest Patching • Network monitoring Service Service Service A B C • Intrusion detection • Other VMI Hypervisor Lots of third-party interest in hypervisor-level services • Ravello • Bromium • XenBlanket • But limited support for third party • McCafe DeepDefender services from base hypervisor. • Secvisor • Cloudvisor • And more… How can a guest use multiple third-party hypervisor-level services? • Our Solution: Span virtualization Guest 1 Guest 2 • One guest controlled by multiple Service coresident hypervisors. Service Service Service A L1s B C D L0 Hypervisor Outline • Why multi-hypervisor virtual machines? • Design of Span Virtualization • Evaluations • Related Work • Conclusions and Future Work Option 1: Fat hypervisor • All services run at the most privileged level. Guest 1 Guest 2 • But…hypervisor cannot trust third- party services in privileged mode. Service Service Service A B C Hypervisor Option 2: Native user space services Hypervisor User space • Services run natively in the user space of the hypervisor Service Service Service A B C Guest 1 • Services control guest indirectly via the hypervisor • E.g. QEMU with KVM, uDenali Hypervisor • But…Potentially large user-kernel interface Cloud providers reluctant to run third- • event interposition and system calls party services natively, even if in user space. Option 3: Service VMs Service VM • Run services inside deprivileged VMs Service Service Service • Services control guest indirectly via A B C Guest 1 hypercalls and events • Single trusted Service VM • Runs all services Hypervisor • E.g. Domain0 in Xen -- or -- Option 3: Service VMs Service VMs • Run services inside deprivileged VMs Service Service Service A B Guest 1 • Services control guest indirectly via C hypercalls and events • Multiple service VMs • One per service Hypervisor • Deprivileged and restartable • E.g. Service Domains in Xoar Option 3: Service VMs Service VMs • Run services inside deprivileged VMs Service Service Service A B C Guest 1 • Services control guest indirectly via hypercalls and events • Multiple service VMs Hypervisor • One per service • Deprivileged and restartable • E.g. Service Domains in Xoar Lack direct control over ISA-level guest state • Memory mappings, VCPU scheduling, port-mapped I/O, etc. Option 4: Nested Virtualization • Services run in a deprivileged L1 hypervisor, which runs on L0. Guest 1 Guest 2 • Services control guest at virtualized ISA level. • But … multiple services must reside in the L1 same L1, i.e. fat L1. A B C D • Vertically Stack L1 hypervisors? L0 • More than two levels of nesting is inefficient. Hypervisor Our solution: Span Virtualization • Allow multiple coresident L1s to concurrently control a common guest • i.e. Horizontal layering of L1 hypervisors Guest 1 Guest 2 • Guest is a multi-hypervisor virtual machine • Each L1 L1s A B C D • Offers guest services that augment L0’s services. • Controls one or more guest resources L0 Hypervisor Design Goals of Span Virtualization • Guest Transparency • Guest remains unmodified Guest 1 Guest 2 • Service Isolation • L1s controlling the same guest are unaware of each other. L1s A B C D L0 Hypervisor Guest Control operations • L0 supervises which L1 controls which Span Guest (unmodified) Guest resource L1 Hypervisor(s) Virtual Guest EPT • Memory, VCPU and I/O Virtual Guest EPT • L0 and L1s communicate via Traps/Faults (implicit) and Messages (explicit) L0 Hypervisor Message Channel L1 Traps Guest • Operations: Faults • Attach an L1 to a specified guest resource • Detach an L1 from a guest resource • Subscribe an attached L1 to receive guest events (currently memory events) • Unsubscribe an L1 from a subscribed guest event Control over Guest Resources • Guest Memory • Shared: All hypervisors have the same consistent view of guest memory • Guest VCPUs • Exclusive: All guest VCPUs are controlled by one hypervisor at any instant • Guest I/O devices • Exclusive: Different virtual I/O devices of a guest may be controlled by different hypervisors • Control Transfer • Control over guest VCPUs and I/O devices can be transferred from one L1 to another via L0. Memory Translation Isolation and Communication: Another design goal Single-Level Virtualization VA Page Table GPA EPT HPA is to compartmentalize L1 services, from each other and from L0. First, L1s must have lower execution privi- (a) Single-level lege compared to L0. Secondly, L1s must remain iso- Shadow EPT lated from each other. These two goals are achieved by Page Virtual VA L1PA EPT HPA deprivileging L1s using nested virtualization and execut- Table GPA EPT L1 ing them as separate guests on L0. Finally, L1s must re- (b) Nested main unaware of each other during execution. This goal Shadow Shadow Shadow EPT is achieved by requiring L1s to receive any subscribed EPTEPT guest events that are generated on other L1s only via L0. Page Virtual EPT Virtual EPT VA GPA Virtual EPTVirtual L1PA Virtual EPT HPA Table EPTL1 There are two ways that L0 communicates with L1s: EPT implicitly via traps and explicitly via messages. Traps EPT (c) Span Guest allow L0 to transparently intercept certain memory management operations by L1 on the guest. Explicit mes- Figure 3: Memory translation for single-level, nested, sages allow an L1 to directly request guest control from and Span VMs. VA = Virtual Address; GPA = Guest L0. An Event Processing module in L0 traps runtime up- Physical Address; L1PA = L1 Physical Address; HPA = dates to guest memory mappings by any L1 and synchro- Host Physical Address. nizes guest mappings across different L1s. The event processing module also relays guest memory faults that 4.1 Traditional Memory Translation need to be handled by L1. A bidirectional Message Channel relays explicit messages between L0 and L1s in- In modern x86 processors, hypervisors manage the phys- cluding attach/detach requests, memory event subscrip- ical memory that a guest can access using a virtualization/notification, guest I/O requests, and virtual inter- tion feature called Extended Page Tables (EPT) [37], also rupts. Some explicit messages, such as guest I/O re- called Nested Page Tables in AMD-V [5]. quests and virtual interrupts, could be replaced with im- Single-level virtualization: Figure 3(a) shows that for plicit traps. Our choice of which to use is largely based single-level virtualization, the guest page tables map vir- on ease of implementation on a case-by-case basis. tual addresses to guest physical addresses (VA to GPA in the figure). The hypervisor uses an EPT to map guest Continuous vs. Transient Control: Span virtualiza- physical addresses to host physical addresses (GPA to tion allows L1’s control over guest resources to be either HPA). Guest memory permissions are controlled by the continuous or transient. Continuous control means that combination of permissions in guest page table and EPT. an L1 exerts uninterrupted control over one or more guest resources for an extended period of time. For example, Whenever the guest attempts to access a page that is an intrusion detection service in L1 that must monitor either not present or protected in the EPT, the hardware guest system calls, VM exits, or network traffic, would generates an EPT fault and traps into the hypervisor, require continuous control of guest memory, VCPUs, which handles the fault by mapping a new page, em- and network device. Transient control means that an L1 ulating an instruction, or taking other actions. On the acquires full control over guest resources for a brief du- other hand, the hypervisor grants complete control over ration, provides a short service to the guest, and releases the traditional paging hardware to the guest. A guest OS guest control back to L0. For instance, an L1 that period- is free to maintain the mappings between its virtual and ically checkpoints the guest would need transient control guest physical address space and update them as it sees of guest memory, VCPUs, and I/O devices. fit, without trapping into the hypervisor. Nested virtualization: Figure 3(b) shows that for nested virtualization, the guest is similarly granted control over the traditional paging hardware to map virtual 4 Memory Management addresses to its guest physical address space. L1 maintains a Virtual EPT to map the guest pages to pages in A Span VM has a single guest physical address space L1’s physical addresses space, or L1 pages. Finally, one which is mapped into the address space of all attached more translation is required: L0 maintains EPTL1 to map L1s. Thus any memory write on a guest page is imme- L1 pages to physical pages. However, x86 processors diately visible to all hypervisors controlling the guest. can translate only two levels of addresses in hardware, Note that all L1s have the same visibility into the guest from guest

Load more