VMB: A General Purpose Monitor Matthew Schulkind Department of Computer Science Columbia University [email protected]

1 Introduction 2.2 Parallels Workstation Parallels[5] Workstation is a product very similar in na- VMB is an open-source x86 virtualizer. The goal is to ture to VMWare Workstation. The website claims that support fast of all x86 operating systems it is the first virtualization product which runs on top of without requiring any modifications to the guest operat- the Intelversion of OS X. It also runs on top of Windows ing system. Currently there are no open-source appli- and . cations which fully virtualize an unmodified ; VMB will fill this niche. This will provide a 2.3 QEMU framework to expand upon for anyone who wishes to do QEMU[6] is a very functional open-source CPU emula- research on specific aspects of such as tor which can emulate a number of different processors optimizing network performance. The only host operat- including the x86 on top of a variety of host systems. ing system supported currently is the Linux 2.4 kernel. This is fundamentally different than the goal of VMB VMB does not require much functionality at all from the which is to virtualize and not emulate. host operating system other than allocating memory, so it should not be very hard to port to other host operating 2.4 QVM86 systems. QVM86[8] is an open-source add-on module for QEMU Currently work is being done to support the Linux 2.4 which enables some guest code to run natively on the series as a guest OS, but no hacks specific to this guest host CPU when an x86 guestis beingrun on an x86 host. operating systems are being implemented. All virtual- Almost all of the problems encountered by virtualization ization support is general-purpose; the reason work is programs such as VMWare and VMB are avoided by being focused on the Linux 2.4 series is that it allows running all kernel-mode code through the QEMU em- a certain subset of x86 features to be implemented at ulation software. This simplifies the design a lot since first. Once the Linux 2.4 series is supported, it should be QEMU is already very functional, but it also means that trivial to support the 2.6 series, and relatively little de- theoretically it will never be as fast as a full virtualiza- velopment time should be required to support any flavor tion solution. The current status of this project is that it of BSD or [4] compared to the de- will boot Windows 98, but it is still unstable, and Linux velopment time needed to support the Linux 2.4 kernel boots up as well but may still have some minor bugs. initially. 2.5 The QEMU Accelerator Module The QEMU Accelerator Module[7], also known as 2 Related Work kqemu, is very similar to QVM86 in that it is a add- on to QEMU which adds virtualization support. Un- 2.1 VMWare Workstation like QVM86, the QEMU Accelerator Modules is closed- source. The most recent version adds support to virtual- VMWare[10] is a company which makes various dif- ize some guest kernel code in addition to the guest user ferent commercial applications, all which provide full code which QVM86 is able to virtualize. Even though x86 virtualization. VMWare Workstation[12] is the most some kernel code is virtualized, is not similar to VMB in that it provides x86 virtualization provided due to some assumptions being made about with a hosted architecture, described in Section 3.5. The the workings of the guest OS. These assumptions should latest version is very functional, but it is a closed-source only lead to unsupported functions in the guest OS and project with a fairly large price tag. There are currently not a compromise of the host OS. The website claims no open-source projects which provide the same features that the assumptions should be safe for Windows and that VMWare Workstation does, primarily the ability to Linux guests, but other operating systems may not func- fully virtualize an unmodified operating system. tion correctly. 2.6 This also allows applications of the guest OS to still run Xen[13] is an open-source virtualization project run by as ring 3 so that they do not have enough privileges to the University of Cambridge. It has similar goals to modify the guest OS kernel directly. VMB, but Xen requires that the guest OS be modified 3 Architecture to run under it. The modifications are also non-trivial in nature and require a fairly significant amount of ef- 3.1 Overview fort to port an OS to run under Xen. Also, currently Figure 1 shows an overview of the VMB architecture. there are no plans to support Windows due to licensing There are two distinct address spaces which can be issues with modifying Windows source code. VMB will switched to: that of the host kernel and that of the guest. be able to eventually run Windows since no modifica- The magic page, described in Section 4.1, is used to tions will be necessary. On the up side, because Xen switch between these two address spaces. To allow the requires the OS to be modified, there is almost no speed VMM to be able to catch faults which occur inside the hit in many situations between running an OS under Xen guest, the Linux ptrace interface is used. The VMM and directly on the x86 hardware. Xen significantly out- and GUI run inside the host world so that they have ac- performs the latest version of VMWare which allows cess to the functionality provided by the host kernel. for benchmarks to be published and unofficially outper- The guest runs inside the guest world so that it can be forms to some degree the latest version of VMWare in- given control over its address space. The guest is al- cluding their ESX Server[11]. Work is in progress on ways run in an unprivileged execution mode so that it a beta version of Xen which utilizes the new VT and can’t directly access any privileged state. All attempts Pacifica by and AMD respectively which provide to access privileged state trap into the VMM through the support for hardware-assisted virtualization. This new ptrace interface and are handled appropriately. Typ- version will eventually allow fast virtualization of un- ical IRET/interrupt pairs are used for control transfer modified operating systems, but currently the beta runs between privileged and unprivileged code. The VMB unacceptably slow and requires hardware which is not kernel module and VMM communicate using a charac- available publicly. ter device. Xen does not attempt to directly virtualize any part of the x86 architecture which can not be done efficiently. # cd vmb/ Instead of directly virtualizing these parts of the x86 ar- # insmod vmb.o chitecture, Xen provides a similar interface to the OS. # mkdir ramfs The OS must be modified to use this Xen provided in- # mount -t ramfs none ramfs terface instead of those provided by the x86 architecture # dd if=/dev/zero of=ramfs/ram \ itself. This allows Xen to filter and emulate various dif- > bs=1M count=64 ferent privileged functions such as writing to a page ta- 64+0 records in ble. 64+0 records out To handle the provided interfaces and virtualization # dd if=/dev/zero of=ramfs/shadow \ of the rest of the x86 architecture, Xen uses a hyper- > bs=1M count=64 visor layer. This is a very lightweight operating sys- 64+0 records in tem which is started before the guest OS and is always 64+0 records out mapped into the current process space. The # ./vmb is mapped into all process spaces so that there does not need to be a full context switch and corresponding TLB flush to handle the virtualization of privileged instruc- tions. One difficulty of keeping the hypervisor mapped Figure 2: Running VMB into all processes is ensuring that the part of the address space allocated to the hypervisoris not also in use by the Figure 2 shows the steps needed to start VMB after guest OS. Xen accomplishes this by reserving the upper it has already been built. Both a floppy image and a 64MB of the address space purely for Xen. This task hard drive image can be supplied. Currently the paths will not be quite as trivial for VMB since there will be to the two images are hard-coded, but the two image lo- no guaranteed area of the address space free. cations could easily be read at run-time. Once the steps To make sure that the guestOS can neverrun any priv- in Figure 2 are followed, one must type “run” in the ileged instructions without Xen’s consent, the guest OS command box at the bottom of the window and hit enter. is run at ring 1 and the hypervisor runs at ring 0. This Due to a GUI bug, one must click in the debug log box means that whenever the guest OS attempts to execute at the top of the window before typing text meant for the a privileged instruction, it will trap into the hypervisor. guest. Guest world Host world

Magic page Host kernel

VMB kernel module IRET IRET Char device InterruptPage table Interrupt load using the magic Guest page VMM and GUI

ptrace()

Figure 1: VMB Block Diagram

3.2 Code Structure the virtualization can be done with the help of hardware The source code is organized into three main parts, a virtualization mechanisms, but some must also be done GUI, a virtualization component, and a kernel module to through software trickery. provide all the privileged operations needed by the vir- 3.5 Hosted Virtualization tualization component. The main focus of this project is on the virtualization component as this is where most of To make things simpler, VMB uses a hosted virtualiza- the desired functionality is provided. The GUI as of right tion architecture. This means that there is a host oper- now is only primitive enough to provide basic functions ating system, Linux 2.4 in this case, which VMB runs needed to test and debug the code. alongside. I say alongside here instead of under because when VMB executes the guest, the host operating sys- 3.3 Debugger tem is fully unmapped from the running context. Using To help test the functionality of VMB and debug VMB a hosted architecture means that I did not have to worry itself, there is a pretty basic debugger built in. The de- about boostrapping the real x86 hardware or writing any bugger can perform basic tasks such has single-stepping device drivers to access hardware such as the physical code, examining memory, and printing out data struc- . VMB’s dependencies on the Linux 2.4 ker- tures like GDT entries. Breakpoints were implemented nel are discussed more fully in Section 3.10. in previous version, but currently they are not supported. In the future they will be implemented using the x86 de- 3.6 Original Real Mode Virtualization bug registers. The original design of VMB used the virtual 86 mode for virtualizing guest real mode code. Virtualization 3.4 Basic Concept of Virtualization using this method is fairly straight-forward. The vir- The idea of virtualization is to provide virtual copies of tual 86 mode extension of the x86 processor provides all hardware to the guest operating system. To the guest, mechanisms to easily virtualize all hardware accessible it appears that all of the hardware that it is accessing through real mode. Linux provides easy access to the is the real physical hardware, but by only providing a virtual86 mode throughthe use of the vm86 system call. virtual copy, multiple operating systems can be run at Virtual 86 mode provides the ability to trap on the ex- the same time. ecution of all privileged instructions. It is also possible The virtual hardware is provided by what is called a to select which software interrupts a trap should be gen- Virtual Machine Monitor(VMM). The VMM makes ev- erated for. This enables a VMM to only trap on the in- ery attempt to make sure that each guest operating sys- terrupts which need virtualization and let the guest OS tem functions as though it has full unrestricted access directly process the remaining software interrupts. to the physical hardware. One guest should not be able When the guest tries to run a privileged instruction to affect the operation of another guest or even detect and a trap is generated into the VMM, the VMM must that it is not running on physical hardware. Some of then emulate the functionality of the privileged instruc- tions. For the most part, the emulation is fairly trivial, mode. The real mode segmented memory model is em- but there is quite a lot of hardware functionality that can ulated using GDT segment descriptors. be accessed through privileged instructions, particularly throughBIOS interruptsand I/O ports. A significantpor- 3.8 Protected Mode Virtualization tion of the early development effort was spent on figure A protected mode OS must be given a full address space out what each interrupt and I/O port did and emulating to work with and a virtualized page table which it has them. full control over. There is no easy way to do this on the Originally, all BIOS interrupts were trapped on and x86 architecture. The x86 in its current incarnation has handled by the VMM, but there are enough different in- no real mechanisms to aid in the virtualization of pro- terrupt functions that individually emulating each one tected mode operating systems. Both AMD and Intel got to be too much effort. Instead of trapping on in- have plans to release new technologies to help with vir- terrupts, I decided to use the BIOS which is used in the tualizing operating systems, but currently there is noth- [1] and QEMU[6] projects. This simplifies things ing available to the public. greatly because now I just have to load the BIOS into the To give the guest its own address space, the process correct memory location and set the guest entry point to is forked and then traced using the Linux ptrace in- be the BIOS entry point and all interruptsare handledfor terface. Tracing the process using the ptrace inter- me. The BIOS interrupt service routines use I/O ports to face means that VMB can alter memory and handle ex- directly access hardware, so these I/O ports must still be ceptions occurring from a separate process and sepa- emulated. rate address space. Giving the guest its own process This method of virtualizing has two main drawbacks: only allows for partial address space virtualization. To 1. When switching from protected mode to real mode, allow the guest to map memory over where the host all code from the patcher, described in Section 4.3, kernel is already mapped requires the additional tech- must be unpatched and repatched which is very nique describe in Section 4.1. Some instructions, mainly slow. This is particularly a problem when GRUB[3] PUSHF/POPF and segment register reads and write, loads the Linux kernel into memory before booting must be patched and emulated so that the instructions it. Because GRUB is executingin real mode, it only act exactly as they should in the guest. has just over 1MB of directly addressable memory 3.9 Kernel Module available, so it loads the kernel into extended mem- ory. To copy data into extended memory, a real VMB makes use of a kernel module to execute all priv- mode program must use a BIOS interrupt which ileged instructions needed for virtualization. The bulk handles the copying. The BIOS is able to copy data of the kernel module is the address space virtualization into extended memory by first switching the pro- code detailed in Section 4.1 and some support code for cessor into protected mode, copying the data, and the interrupt flag tracking. To ensure that VMB gets a then switching back to real mode. Data can only be chance to switch address spaces upon entering the guest copied in 64KB chunksand the Linux kernel can be process, the kernel module dynamically patches the tail over 1MB, so quite a few real mode/protected mode end of the Linux kernel code which returns into user- switches occur while copying the entire kernel into mode processes. This is further explained in Section 4.1. extended memory. 3.10 Host Dependencies 2. This method of real mode virtualization will never be able to support big real mode, described in Sec- In writing VMB, I have tried to rely on as little as pos- tion 10.2.1, because big real mode is a side effect sible from the host operating system so that it would be of the existence of protected mode and virtual 86 easy to port VMB to other host operating systems such mode only emulates the original real mode envi- as either the Linux 2.6 kernel or Microsoft Windows[4]. ronment. This would also make it easier to remove the depen- dency on a host operating system altogether and have 3.7 Current Real Mode Virtualization VMB host itself in a manner similar to how VMWare To solve both of the problems mentioned in the previ- ESX Server[11] works. ous section, I devised a new method for virtualizing real Outside of functionality user-space processes nor- mode. Instead of using virtual 86 mode, all real mode mally expect such as file I/O and a GUI display, VMB code is actually run in using the same techniques as pro- needs only three things from the host operating system: tected mode which is described in the following section. 1. A method to run privileged such as by loading a All instructions which act differently in real mode and kernel module or device driver. protected mode, such as PUSHF/POPF and INT, are 2. A way of pinning guest memory to avoid memory patched and emulated to act as they would have in real corruption as described in Section 4.4. 3. A hook to execute code on every entry into the the ESP register. guest process. This is currently forced upon the 6. Flush the TLB and jump to the magic page where host operating system more than it is provided by it is mapped in the host kernel address space. it. 7. Load the magic page table and finally restore the context of the guest. 4 Major Components It is important that the magic page is mapped at the 4.1 MMU Virtualization same address in the magic address space and in the orig- inal address space so that the page table can be changed Because any operating system expects to be able to man- and keep executing the same code path. It doesn’t mat- age it’s own address space, it was necessary to shrink ter what is mapped at the address of the magic page in down the stub which must be mapped into the guest’s the host kernel address space before mapping the magic process as much as possible and be able to relocate the page there since the original page is mapped back into stub at will. I was able to shrink the stub down to the size place before ever having the possibility of being ac- of one 4k page. The job of this stub is to switch contexts cessed. back to that of the original process with the host kernel To exit from the guest’s context and back into the host mapped at the correct memory locations. With the stub kernel, a process approximately equal to the entrance being only the size of one page, the guest can use the en- procedure is executed in reverse. To ensure that VMB tire address space except for one page. The page that is is the first to run when exiting the guest’s process, all reserved can be changed at any time as to allow the most of the exception handlers in the magic IDT point to the flexibility. Throughoutthe code I call this stub the magic magic exit code. The only way to exit the guest context page and all data structures contained in the magic page is through some sort of exception, so this ensures noth- also have magic in front of their name. ing is missed. There are 256 different exceptions, each The memory management unit(MMU) virtualization needing their own handler so that VMB can store which code has two distinct parts, the code used to enter the exception was called execute the appropriate handler af- context of the guest and the code used to exit the con- ter the context is switch back into that of the host ker- text of the guest. Only the necessary parts of the code nel. Because all 256 of these handlers have to fit into the are contained in the magic page in order to keep the size four kilobyte magic page, it was important to keep each down. Because of tight register usage requirements and handler’s size down as much as possible. To store the tight size requirements, all of the address space virtual- exception number, the PUSH BYTE assembly instruc- ization code is written in pure assembly which is assem- tions was used and then the jumps to the exit code are all bled by NASM and then linked into the kernel module. short jumps which totals four bytes per handler. To keep Originally I started to write some of the code using inline all of the jumps as local one byte offsets, intermediate GCC assembly, but GCC tried to optimize a lot of the jump points had to be spaced throughout the exception code by changing register usage, making it a headache handlers. to make sure it only did what I wanted. Using NASM Because the guest is running inside a regular pro- for all the code made things much simpler. cess, to make sure VMB executed the context switch The basic outline of switching to the guest’s context into the guest’s context and not the Linux kernel, I re- is as follows: placed the last few instructions of the restore all 1. Load the magic IDT address into the IDTR. code in the Linux kernel with a jump to a hook func- 2. Load the magic GDT address into the GDTR. tion inside the VMB kernel module. The function of 3. Map the magic page into the host kernel address the restore all code is to reload the register state to space at the same address the magic page will be what it was before the target process was interruptedpre- mapped into the guest address space. The only re- viously. After the register state is restored, VMB checks striction on this address is that it must be different to see if the guest process is the target process, and if it than the location of the VMB kernel module. The is, switch to the virtualized address space. old mapping at this address must also be stored. Because the VMB hook runs for every host kernel 4. Copy the stack segment and magic stack offset into context switch, the method used to determine if the guest the magic TSS and then load the new TSS location. is the target context must be fast and access only a mini- The magic TSS is not a full TSS structure, it is only mal amountof state. The easiest way I came up with was large enough to contain up to the ring 0 stack seg- to compare the physical address of the page table of the ment and offset because the rest of the TSS is never process to the previously stored physical address of the needed. guest’s page table to see if they match. If they do in fact 5. Copy the top several entries on the stack to the match and the code segment just loaded is not that of the magic stack and load the magic stack’s offset into kernel, then VMB executes the full context switch code. The guest’s page table address should never change or be sure that the guest can’t disabled interrupts for the host duplicated because it is allocated in the kernel’s memory computer. space which is never swapped out or relocated. My first attempt to track the state of the interrupt flag The last piece of this puzzle is keeping the magic page was just to track CLI and STI calls which directly set table up to date with what the guest thinks it should look and clear the interrupt flag. This is easy to do because like. To do this efficiently, the magic page table is loaded when these two instructions are executed in an unprivi- on demand. After every write to the CR3 register which leged modes, they cause a general protection fault(GPF) holds the page table address and flushes the translation which is easy to catch and handle. This appeared to lookaside buffer(TLB) when written to, all non-global work except it doesn’t handle the very common case in pages in the magic page must be cleared. The way the the Linux kernel where the state of the interrupt flag is magic page table is handled is analogous to the way the pushed onto the stack, interrupts are disabled, and then TLB worksexcept this TLB will hold all mappings. This the state is popped off the stack instead of explicitly call- means that there will not be problems with the magic ing the STI instruction to enable interrupts. page table getting out of sync with the guest’s page table The instructions which push the state of the interrupt because the guest will invalidate any TLB entries upon flag onto the stack and then pop it off again are PUSHF modifying its page table. When a page fault occurs in and POPF respectively. These two instructions make the guest context, it is handled in one of four ways: tracking the interrupt flag difficult because unlike CLI 1. If the page fault is a read fault and the page is and STI, instead of causing a GPF when executingin an mapped in the guest page table but not the magic unprivileged mode, they just ignore the interrupt flag. To page table, map the page as read-only in the magic keep track of the interrupt flag being stored on the stack, page table and set the accessed flag in the guest whenever interrupts are disabled, VMB single-steps the page table. guest and watches for PUSHF and POPF instructions go- 2. If the page fault is a write fault, the page does not ing by. When a PUSHF is executed, it is assumed that contain patched code, and the page is mapped as the next POPF executed will pop the status of the in- read/write in the guest page table, map the page as terrupt flag as it was when the prior PUSHF was called. read/write in the magic page table and set the dirty To make sure the interrupt flag is tracked correctly, until and accessed flags in the guest page table. the number of POPF calls match the number of PUSHF 3. If the page fault is a write fault and the page does calls, the guest must be single-stepped even if interrupts contains patched code, unpatch the page, map it as are enabled. If a POPF is called when no PUSHF was read/write in the magic page table, and set the dirty recorded as being previously executed, interrupts are en- and accessed flags in the guest page table. abled since the PUSHF was probably called previous to 4. If the page is not mapped in the guest page table disabling interrupts. or the page fault was a write fault and the page is Because single-stepping and switching back to the mapped read-only in the guest page table, pass the VMM’s context is very expensive when executing large page fault on to the guest to handle. numbers of instructions, the kernel module is aware if it should be single-stepping the guest to watch for PUSHF Being able to attach a debugger to the QEMU and POPF instructions. If a debug exception occurs due emulator[6] was useful for this entire project, but it was to single-stepping, the module checks to see what the especially useful for the MMU virtualization code work- instruction is, and if it not one being watched for, it ing because it is not possible to use a kernel debugger immediately continues executing the guest awaiting the while playing around with things like the IDT and un- next debug exception. The code to check if the guest mapping the host kernel from memory. should be continued instead of trapping into the VMM 4.2 Original Interrupt Flag Tracking is fully contained in the magic page, so the context does not switch at all. The only cost of single-stepping each In order to keep the mechanisms of VMB as simple as instruction is about 8 extra instructions executing per in- possible, the original guest interrupt flag tracking did not struction. This speeds up the single-stepping quite a bit. rely on being able to patch the guest code dynamically. The IRET instruction also has be handled like a POPF When I wrote the tracking code, nothing else yet needed as it also restores the state of the interrupt flag from dynamic patching, so I tried avoiding the use here also. the stack. Because the Linux kernel uncompresses itself The interruptflag is normallykept in the EFLAGS reg- into memory with interrupts disabled, it ends up execut- ister and tells the CPU to not execute maskable interrupts ing quite a few instructions that must be single-stepped when the flag is cleared. This flag, just like everything which gets very slow. Because Linux doesn’t re-enable else must be virtualized inside VMB. VMB must know interrupts until it switches to protected mode, there is when to send the guest interrupts and also must make a hack in VMB to not single-step for the purposes of tracking the interrupt flag state until protected mode is 4.3 Incremental Patcher entered. This speeds up the uncompressing code by at To allow guest code to be dynamically patched at run- least ten fold. There is still a potential that this method time, an incremental patcher was developed. The idea is will track the interrupt flag incorrectly in cases such as that any code which can be run by the guest must be first a PUSHF being called and then the value being thrown validated by the patcher and have breakpoints inserted away. This would throw off the PUSHF and POPF pair- at each point in the code which needs further considera- ing. With this method, I can’t come up with any way tion by the VMM but would not otherwise cause a trap to easily fix this, but it should work for situations which into the VMM. I chose to use the INT3 for inserting arise with the Linux kernel and also Windows NT based breakpoints into the guest code for two reasons: hard- kernels from what I’m told. ware breakpoints are limited in number, and the INT3 instruction is only one byte and can fit in the place of While developing the single stepping method, I en- even a one byte instruction. Because there is a very countered a very hard to track down bug caused by some blurry line between data and code on the x86 architec- checks the Linux kernel does for debug exceptions. If ture and because the guest loads executable code into a debug exception occurs in a user-mode process with memory incrementally, the patching must also be done an address in kernel space above 0xC0000000, the de- incrementally. bug exception is not passed on to the user-mode process. An example of the patcher running is show in Fig- This caused a problem for VMB since the guest was able ure 3. The original guest code is shown in Figure 3a. to have instructions executing in what appeared to be The ... represents additional instructions which do not kernel space. After I tracked down that this was hap- need special attention. This block of code contains two pening, I modified the kernel module so that it would instructions, PUSHF and POPF, which must be emulated set the debug exception’s address to 0x00000000 before by the VMM but would not otherwise cause a trap into passing it on to the kernel. The actual address is stored the VMM. On the initial pass throughthe patcher, break- inside the kernel module and is accessible to the user- points are inserted in place of the PUSHF and POPF in- mode component of VMB through the character device structions as shown in Figure 3b. Breakpoints are also interface. inserted in place of the code branch instructions JZ and JMP so that the patcher will have a chance to patch the One feature of the Intel x86 specifications that caught code at the destination of these branches before it is run. my eye while working on this interrupt flag tracking The patched code is contained in a shadow page which is code was the virtual interrupt flag support. Intel provides mapped in place of the original guest page in the magic a method to track a virtual interrupt flag in the EFLAGS page table. The shadow page is mapped read-only so register which is modified when the an unprivileged pro- that the patcher can easily detect when the guest changes cess executes a CLI or STI instruction and the virtual already patched code and handle the situation correctly. interrupt flag extension is enabled. This is instead of To try to minimize the number of breakpoints which immediately throwing a general protection fault. This must be handled by the patcher code, all direct branch seems like a very useful feature that should eliminate all instructions, such as JMP or CALL, are unpatched af- the code that I wrote to track the status of the interrupt ter their first execution. Figure 3c shows the block of flag, but the PUSHF, POPF, and IRET instructions were guest code after the JZ and JMP instructions have been not considered at all for this extension. The only expla- executed for the first time. Although direct branches nation I can come up with of why Intel would add such can be unpatched in this way, indirect branches such as a feature to the x86 architecture and then not implement the very common RET instruction cannot be unpatched if fully enough to actually make it useful is political and since their destination cannot be known until they are not technical. I’m not sure what they have to gain by executed each time. If one of these instructions were including a broken extension like this, but it would not unpatched, execution could enter an unpatched block of have been hard to make it work fully. AMD also imple- guest code. ments the broken extension to the same extent. To be able to handle various situations such as the guest unmapping a code page from memory, a careful In order to correctly virtualize various other instruc- unpatcher had to be developed. At first it would seem tions, a method of dynamically patching guest code had that unpatching code would be as simple as remapping to be developed. This allowed the VMM to simply trap in the original guest code page in the magic page table, on the execution all PUSHF/POPF instructions, making but it is not quite this simple. The difficulty comes from the interrupt flag tracking trivial. As a result, the error- the fact that all direct branches are unpatched after the prone method of tracking the interrupt flag described in first execution. The patcher has to maintain mapping of this section was scrapped. all unpatched branches so that if a branch points to the PUSHF INT3 INT3 CLI CLI CLI MOV eax, 3 MOV eax, 3 MOV eax, 3 ADD ebx, 4 ADD ebx, 4 ADD ebx, 4 ...... POPF INT3 INT3 JZ other code INT3 JZ other code JMP more code INT3 JMP more code (a) Original Guest Code (b) Initial Pass (c) After executing JZ and JMP Figure 3: Step-by-step patching page being unmapped, the branching instruction can be 6. An 82078-compatible floppy controller for guest repatched. floppy access 7. Partial support for an RS-232 which can 4.4 Guest RAM and Shadow Memory be used to view output from a guest serial console To ensure that the guest does not write all over random 8. A PCI configuration controller which currently ex- memory, it is necessary to pin the guest RAM in mem- poses an empty PCI bus. ory. The host kernel is not aware of the magic page, so if the it swapped out a page which was mapped into 6 Limitations the magic page table, it would not know to remove the Apart from the prohibitiveoverheaddetailed in Section 1 mapping there. This would lead to the magic page table there are two functionally incorrect aspects of the virtu- pointing to a physical page which may no longer be as- alization: signed to theguest RAM region. Ifthe guestwas to write 1. There is no page-level protection provided to the to an address on a page with a stale mapping, instead of guest. At the moment this is done for efficiency writing to its own RAM, it would be writing over some reasons. Once the x86 to x86 translator, described random host memory. in Section 10.1.3 is written, there will be more op- The easiest way to ensure that the guest RAM was not tions, but right now the only way to provide page- swapped out by the Linux 2.4 kernel was to map a file level protection to the guest would be to maintain a from a RamFS filesystem. RamFS files are guaranteed page table of privileged pages and a separate page to never be swapped out, so this was the easiest method table of unprivileged pages and constantly swap be- to pin the guest RAM in memory. tween the two. Linux does not rely on page-level 5 Peripheral Support protection for correct execution, only for protecting the kernel from malicious user-mode processes, so Because the guest must not be allowed to directly access this was left out for now. any hardware on the host machine, all peripherals used 2. Because the incremental patcher must patch the by the guest must be emulated by the VMM. guest code in memory, if the guest reads back code The list of emulated peripherals is as follows: which has already been patched, it will see this 1. A standard VESA video interface is provided with patches in memory. For any sort of normal code, the help of VGABIOS[9], a VGA bios developed this will never be a problem, but it is technically in- for use in the Bochs[1] and QEMUqemu projects. correct, and a guest could easily tell that it is being 2. A programmable interrupt controller(PIC) to allow executed inside a VM by examining code it has al- the guest to selectively mask interrupts ready executed. The x86 to x86 translator can also 3. An 82C54-compatibleprogrammableinterval timer solve this problem as described in Section 10.1.3. to allow the guest access to accurate timing infor- Deciding when to deliver timer interrupts is also a mation problem,but this problemis as easily solved as the above 4. An 8042-compatible keyboard controller for key- two. An operating system relies on both the fact that a board input from the user constant amount of wall-clock time has passed for every 5. An ATAPI v7 compatible IDE controller for guest timer interrupt it receives and that a constant number of hard drive access, CD-ROM drives are not cur- cycles are executed between each timer interrupt. Be- rrently supported cause the guest can be swapped off the CPU by the host kernel, it is not possible to keep both facts true from the call. Currently the workaround code is commented out view of the guest. Either the wall-clock time can be kept and unmaintained since the newest version of the Linux constant or the number of cycles executed can be kept kernel as of now is unaffected. constant, but not both. I chose to keep the wall-clock time constant because for the most part, operating sys- 8 Current Status tems don’t rely on the number of cycles being executed The first name of this project was VMBear. This was to remain constant, but they rely heavily on the amount simply derived from Virtual Machine and then Bear of wall-clock time remaining constant in order to keep since it will hopefully be as strong (stable) as a bear. track of the system clock. It also has that added bonus of sounding like VMWare. This might not be the best name for the future, so I have 7 Kernel Bugs Encountered changed the name to just VMB. Unofficially, it’s pretty While developing VMB thus far, I encountered two ker- much still the same name, but hopefully it’ll be a better nel bugs. name and not sound so incredibly close to VMWare. vm86 VMB is by no means bug free, but it can now boot a 7.1 System Call minimal Linux 2.4 guest and allow the user to execute The first bug I encountered puzzled me for long enough commands within the guest. Booting a full Linux 2.4 that I ended up making a large design change to work setup with multiple init scripts and a real login prompt around it. At first, the bug caused any pthreads related works sometimes without crashing, but it takes some- functions to segmentation fault after he vm86 system where in the range of five to ten minutes to boot where call had been executed. as the minimal setup boots in under one minute. A more The pthreads calls were being made by the wxWidgets in-depth performance analysis is given in the next sec- library which is used to providethe GUI. To workaround tion. this problem, I split the application into a client and server part. The server would just run the virtualiza- 9 nbench Results tion and not execute any pthreads calls, while the client would communicate with the server over a FIFO and Test Relative Iter/sec provide the GUI interface. Numeric Sort 0.89% During routine software upgrades, I upgraded the ver- String Sort 6.89% sion of glibc I had on my system and this caused any Bitfield 71.91% glibc call to cause a segmentation fault instead of just FP Emulation 9.07% pthreads calls. The problem was now large enough that I Fourier 2.31% started to reinvestigate possible solutions. I finally got an Assignment 74.63% answer to what was causing these segmentations faults Idea 3.43% from the glibc development mailing list. Huffman 0.22% The problem turned out to be that the vm86 system Neural Net 12.89% call did not preserve the FS and GS segment registers. LU Decomposition 71.94% Apparently this is a bug known by some, but for what- ever reason it was never fixed in the 2.4 Linux kernel, Table 1: nbench results relative to native execution although it was fixed during the 2.5 development and so the 2.6 kernel works just fine also. The workaround for To measure the performance of software running in- this on the 2.4 kernel is rather simple, I just save the FS side a VMB guest, I chose the nbench[2] benchmark and GS segment registers before calling the vm86 sys- suite. All of the test are CPU bound and involve no I/O. tem call and then restore them after. I specifically avoided benchmarks with I/O tests because the IDE emulation codeis too slow to even warranta real 7.2 mmap System Call test. At first instead of adding the workaround for the vm86 Table 1 shows the results from the nbench benchmark in case other problems also popped up, I switched to the suite running inside a VMB guest relative to the nbench 2.6.11 kernel and discovered another bug. Memory re- benchmark suite running natively in the host environ- gions mapped using the mmap system call with a base ment. Measurements were taken by averaging three con- address within the first one kilobyte of memory are not secutive runs with warm caches. All of the result are less preserved across fork calls. The only kernel version I than 100% indicating that the benchmarks ran slower in- found to be affected was 2.6.11. To work around this, side the VMB guest than natively. The test machine was I unmapped the memory before the fork call and then a Dell Inspiron 2600 laptop with a 1.2GHz Intel Celeron remapped it and copied over the memory after the fork processor and 320MB of RAM running the Linux 2.4.32 kernel. The guest was running a Linux 2.4.21 kernel. because they only require a segment register load and Three of the ten tests show fairly decent performance, the TLB is never flushed. Reserving the top 4MB of the but the other seven tests show pretty abysmal perfor- guest address space should be sufficient to fit the entire mance. The Huffman test ran over 450 times slower than VMM. natively. I expected CPU bound workloads to take some- To accomplish this, a number of subtle challenges where in the range of a 50% performancehit, but nothing have to be solved: quite like 450 times slower. 1. With a normal user space application, outputting To better understand why some tests ran incredibly debug information is very easy by just using slow while others ran with acceptable speed, I used the printf. Even from a kernel module, one can RDTSC instruction to profile various part of the VMB use printk to output debug information. Once code. Figure 9 shows some results gathered by the pro- the VMM is mapped directly into the guest ad- filing code. The three main execution areas tracked were dress space, it won’t have access to printf or the guest, the monitor, and the context switching be- even glibc. A ring buffer will have to be setup tween the guest and the monitor. Further breakdowns to transfer debug messages between the VMM and are given for the time spent switching contexts and the the user space GUI. Also, code which previously time spent in the monitor. used glibc functions, STL data-types, or anything As expected, the RET instruction was the single else normally provided to a C program by libraries largest source of overhead. For the full run with pro- loaded in memory must be rewritten without them. file results shown in Figure 4a, handling of the RET 2. An ELF or object loader must be written to actually instruction accounts for 51.18% of the CPU time. By load the VMM into the guest address space and ap- comparing Figure 4b, which shows the profile results ply the necessary relocations needed by the VMM for the LU decomposition test that ran with acceptable code so that it functions correctly in the top 4MB performance, to Figure 4c, which shows the profile re- of memory. sults for the Huffman test that ran with abysmal perfor- 3. Because the guest must not be allowed to knowthat mance, it is obvious that the reason for the large gap in the VMM is mapped into the top 4MB of address performance is due to the Huffman test executing a sig- space, all writes and reads to guest pages which nificantly higher proportion of RET instructions than the should have been mapped in this 4MB will have to LU decomposition test. These are fairly promising re- be fully emulated. Linux shouldn’thave much trou- sults because the optimizations described in Section 10.1 ble running without this emulation since it won’t should be able to reduce the overhead for a RET instruc- use the upper 4MB of the address space unless up- tion to almost zero. wards of 800MB of RAM is provided to it, but for completeness this must be emulated. It is very pos- 10 Future Work sible that other operating systems such as Windows 10.1 Optimizations XP use this area of the address space. Attention has been payed to efficiency and speed wher- 10.1.2 Swapping Guest Memory ever possible during development, but there are still Even if most machines today have enough RAM to ded- many areas of VMB which can be greatly optimized. icate 128MB to a VM, it is not desirable to have such 10.1.1 Eliminate VMM Context Switches a large region of RAM reserved for a single use. Also, if multiple VMs are started at the same time, the RAM Currently, every time an instruction needs to be emu- usage scales linearly and would not allow for very many lated, four context switches are required: guest memory simultaneously running VMs. space to kernel memory space, kernel to the user space There are two ways to solve this problem: VMM, user space VMM back to the kernel, kernel mem- ory space back to the guest memory space. Although the 1. Dedicate a fixed amount of RAM to all VMs run- user/kernel context switches are not as expensive as the ning and let the VMM manage what is allocated to guest/kernel context switches because they don’t require each VM in addition to swapping out pages to disk a page table load, the code page is much longer than it when the RAM region is overcommitted. has to be just to emulated a single instruction. 2. Cooperate with the host OS in such a way that To work around this problem, it should be possible to pages allocated for guest RAM can be swapped out map the VMM directly into the guest address space so as they normally would. that only two very light weight context switches need to The first method is probably the better way to go as happen to emulate an instruction: guest to VMM, VMM long as there is a chunk of ram that the user is willing to guest. These context switches are very light-weight to dedicate to VMs. The second method would require Location CPU % Location CPU % Location CPU % Guest 18.67% Guest 67.57% Guest 8.96% Context switches 39.23% Context switches 14.99% Context switches 43.04% For RET 35.87% For RET 4.48% For RET 40.26% Other 3.36% Other 10.51% Other 2.78% Monitor 42.10% Monitor 17.45% Monitor 48.00% RET emulation 15.31% RET emulation 2.21% RET handling 17.90% Other emulation 1.11% Other emulation 3.54% Other emulation 0.71% Code patching 8.74% Code patching 3.47% Code patching 10.10% Other 16.94% Other 8.23% Other 19.29% (a) Full Run (b) LU Decomposition Only (c) Huffman Only

Figure 4: CPU profiles for various nbench runs modification of the kernel swapping routines and might 10.1.4 Execute Guest User-Mode Code Di- not even be possible on a host OS such as Windows XP. rectly Whether the incremental patcher or the x86 to x86 trans- 10.1.3 x86 to x86 Translator lator is used, guest user-mode code should not have to be handled specially. It should be possible to execute this code directly with only minor precautions. The guest’s There are two major drawbacks of the incremen- shadow GDT and LDT would have to be setup in a way tal patcher based virtualization scheme: all indirect such that the segment selectors are the same as the guest branches and privileged instructions must trap into the OS is expecting. The shadow GDT should also be setup VMM for handling and the patched code is visible to the in a way such that the guest OS can’t switch to any seg- guest OS. To solve these problems, an x86 to x86 trans- ments which would not normally be available to it. lator could be introduced which would rewrite the guest OS code in such a way that instructions which don’t 10.2 Features strictly need to trap into the VMM are implemented di- The goal of the VMB project is to provide an x86 vir- rectly in the guest OS code and the rewritten code is not tualization framework to the open-source world so that visible to the guest OS. any number of virtualization-related features can be de- veloped without having to first duplicate all of the work Rewriting instructions would have to be considered I have done so far on VMB. This section describes a few on a case by case basis. The RET instruction accounts of these possible features. for most of the overhead incurred by patcher traps. It is called quite regularly and is an indirect branch, so every 10.2.1 Big Real Mode time it is executed, the return address must be checked Big real mode is the name given to a trick used by many to make sure that it has already been patched. With DOS programs when protected mode was first intro- the translator based approach, only previously translated duced and there were large preexisting real-mode code code will be addressable, so the RET instruction will not bases. The x86 maintains hidden registers which store have to be handled by the VMM and will instead execute the segment base addresses and segment limits among directly. Other instruction will be optimized similarly. other things. In normal real mode, these registers are To hide the patched or translated code from the guest not accessible in any way, so there is a limit of just over OS, all reads of executable guest code must be shad- 1MB of accessible memory. By first switching into pro- owed. In the process of translating the guest code, all tected mode, loading segment descriptors from the GDT, of the code addresses will be relocated to point into the and then switching back to real mode, the hidden regis- area of memory used to hold translated code, but all data ters could be left in a state which was only supposed to addresses must be left untouched. This may prove to be be valid from protected mode. This allowed real-mode quite tricky, but it should certainly be doable. Any data programs to use a flat memory model and also address page containing code which has been translated must 4GB of memory instead of just over 1MB of memory. still be marked read-only, just like with the incremen- With the new real-mode virtualization architecture de- tal patcher based approach, so that self-modifying code scribed in Section 3.7, it should be fairly easy to support can be handled. big real mode. It is possible that big real mode already functions correctly, but no attention has been payed to [9] Plex86/Bochs VGABios. (http://www. the relevant details and have not been tested in any way. nongnu.org/vgabios/) 10.2.2 Virtual 86 Mode [10] VMWare. (http://www.vmware.com/) http://www.vmware. Even though its use is not very common, certain guests [11] VMWare ESX Server. ( com/products/esx/ will require the use of virtual 86 mode. Support for vir- ) tual 86 mode should not be very hard to provide to the [12] VMWare Workstation. (http://www. guest because it should be possible to just use the physi- vmware.com/products/ws/) cal CPU’s virtual 86 mode and chain the traps generated [13] Xen virtual machine monitor. (http: to the guest’s trap handlers. //www.cl.cam.ac.uk/Research/SRG/ 10.2.3 Other netos/xen/) Infinitely many other features could be added including extended video mode support, network support, USB support, sound support, accelerated OpenGL or DirectX, and guest SMP. 11 Conclusion As can be seen by examining VMWare’s[10] market share, demand for x86 virtualization technology is grow- ing quickly in both the server and desktop markets. Cur- rently there is no open-source solution to turn to for full x86 virtualization. A general purpose x86 virtualizer must run at near-native speeds in order to be truly useful. VMB is not at this point yet, but I believe I have reached the point at which I can start looking for other people to help out on the project. With a collaborative effort, I believe the goal of an fast open-source general purpose x86 virtualizer can be full realized. References [1] Bochs IA-32 Emulator Project. (http: //bochs.sourceforge.net/) [2] BYTE Magazine’s BYTEmark benchmark pro- gram. (http://www.tux.org/∼mayer/ linux/bmark.html) [3] GRand Unified Bootloader. (http: //www.gnu.org/software/grub/) [4] Microsoft Windows. (http://www. microsoft.com/windows/default. mspx) [5] Parallels Workstation. (http://www. parallels.com/en/products/ workstation/) [6]QEMU. (http://fabrice.bellard. free.fr/qemu/) [7] QEMU Accelerator Module. (http: //fabrice.bellard.free.fr/qemu/ qemu-accel.html) [8] QVM86. (http://savannah.nongnu. org/projects/qvm86/)