Bhyvecon Tokyo 2014
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to bhyve Takuya ASADA / @syuu1228 What is bhyve? What is bhyve? • bhyve is a hypervisor introduced in FreeBSD • Similar to Linux KVM, runs on host OS • BSD License • Developed by Peter Grehan and Neel Natu bhyve features • Required Intel VT-x and EPT (Nehalem or later) AMD support in progress • Does not support BIOS/UEFI for now UEFI support in progress • Minimal device emulation support: virtio-blk, virtio-net, COM port + α • Supported guest OS: FreeBSD/amd64, i386, Linux/x86_64, OpenBSD/amd64 How to use it? kldload vmm.ko /usr/sbin/bhyveload -m ${mem} -d ${disk} ${name} /usr/sbin/bhyve -c ${cpus} -m ${mem} \ -s 0,hostbridge -s 2,virtio-blk,${disk} \ -s 3,virtio-net,${tap} -s 31,lpc -l com1,stdio vm0 How to run Linux? • bhyve OS Loader(/usr/sbin/bhyveload) only supports FreeBSD You need another OS Loader to support other OSs • grub2-bhyve is the solution • It’s modified version grub2, runs on host OS (FreeBSD) • Can load Linux and OpenBSD • Available in ports & pkg! Virtualization in general Difference between container and hypervisor • Jail is container • It’s virtualize OS environment on kernel level • bhyve is hypervisor • It virtualizes whole machine • Totally different approach Container • Process in jail is just a normal jail1 jail2 process for the kernel • The kernel do some tricks to proc proc proc ess ess isolate environments between jails ess • Lightweight, less-overhead Kernel • Share one kernel with all jails NIC → If the kernel panics, all jails Disk will die • You cannot install another OS (No Windows, No Linux!) Hypervisor • Hypervisor virtualizes a machine vm1 vm2 proc proc • From guest OS, it looks like real ess ess hardware Kernel Kernel • Virtual machine is a normal NI NI Disk Disk process for host OS proc C C ess Hypervisor • Does not share kernel, it is Kernel completely isolated NIC • You can run Full OS inside of Disk the VM → Windows! Linux! How hypervisor virtualize machine? • To make complete virtual machine, you need to virtualize following things: • CPU • Memory (Address Space) • I/O CPU Virtualization: Emulate entire CPU? QEMU run CPU mov dx,3FBh mov al,128 virtual emula out dx,al device IO tor OS physica physic l device al CPU • Like QEMU • You can emulate the entire CPU operation on a normal process • Very slow, not a really useful choice for virtualization CPU Virtualization: Direct execution? • You want run guest instructions directly on a real CPU since you are virtualizing x86 on x86 • You need to avoid executing some instructions which modify system global state, or perform I/O (called sensitive instructions) • If you execute these instructions on a real CPU, it may break host OS state such as directly accessing a HW device Perform I/O on VM GuestOS Real Display Virtual CPU outb • You need to avoid access to real HW from VM • Need to prevent execution of the instruction Perform IO on VM Virtual Display GuestOS trap! Virtual CPU outb • You can trap them by executing in lower privileged mode • However, on x86, there are some instructions which are impossible to trap because these are nonprivileged instructions Software techniques to virtualize x86 • Binary translation (old VMware): interpret & modify guest OS’s instructions on-the-fly → Runs fast, but implementation is very complex • Paravirtualization (old Xen): Modify guest OS for the hypervisor → Runs fast, but is impossible to run unmodified OS’s • We want an easier & better solution → HW assisted virtualization! Hardware assisted virtualization(Intel VT-x) VMX VMX root mode non-root mode User User (Ring 3) VMEntry (Ring 3) Kernel VMExit Kernel (Ring 0) (Ring 0) • New CPU mode: VMX root mode (hypervisor) / VMX non-root mode (guest) • If some event needs to emulate in the hypervisor, CPU stops guest, exit to hypervisor → VMExit • You don’t need complex software techniques You don’t have to modify the guest OS Memory Virtualization Guest A Process A Guest physical memory Page table A 1 1 1 1 2 2 Host physical memory Process B Page table B 3 1 1 3 4 1 2 2 4 2 3 4 Guest B 5 Process A Guest physical memory 6 Page table A 7 1 1 1 1 8 2 2 Process B Page table B 3 4 1 1 3 2 2 4 • If you run guest OS natively, memory address translation become problematic • If GuestB loads Page table A, virtual page 1 translate to Host physical page 1 but you meant Host physical page 5 Shadow Paging Page table A' 1 5 Host physical memory 2 Page table B' 1 1 7 2 2 8 3 4 Guest A 5 Process A 6 Page table A 7 1 1 2 1 8 2 2 Process B Page table B 3 4 1 1 3 2 2 4 Guest physical memory • Trap page table loading/modifying, create “Shadow Page Table”, tell physical page number to the MMU • A software trick that works well, but is slow Nested Paging (Intel EPT) Host physical memory Guest A Process A 1 Page table A EPT A 2 1 1 2 1 1 5 3 2 2 2 6 4 Process B Page table B 3 3 7 5 4 4 8 6 1 1 3 7 2 2 4 Guest physical memory 8 • HW assisted memory virtualization! • You will have Guest physical : Host physical translation table • MMU translates address by two step (Nested) I/O Virtualization • To run unmodified OSs, you’ll need to emulate all devices what you have on the real hardware • SATA, NIC(e1000), USB(ehci), VGA(Cirrus), Interrupt controller(LAPIC, IO-APIC), Clock(HPET), COM port… • Emulating real devices is not very fast because it causes lot of VMExits, not ideal for for virtualization Paravirtual I/O • Virtual I/O device is designed for VM use • Much faster than emulating real devices • Required device driver on guest OS • De-facto standard: virtio-blk, virtio-net PCI Device passthrough Physical memory Guest A Process A 1 IOMMU Pagetable A EPT A 2 translation table 1 1 2 1 1 5 3 5 1 2 2 2 6 4 6 2 DMA! 7 3 Process B Pagetable B 3 3 7 5 PCI 4 4 8 6 8 4 Devices 1 1 3 7 2 2 4 Guest physical memory 8 • If you attach a real HW device on a VM, you will have a problem with DMA • Because the device requires physical address for DMA but the guest OS doesn’t know the Host physical address • Address translator for the devices: IOMMU(Intel VT-d) • Translates guest physical to host physical using a translation table bhyve internals How bhyve virtualize machine? • CPU: HW-assisted virtualization (Intel VT-x) • Memory: HW-assisted memory virtualization (Intel EPT) • IO: virtio, PCI passthrough, +α • Uses HW assisted features bhyve overview 2. Run VM instace • bhyveload: loads Disk image guest OS 1. Create VM instance, tap device load guest kernel stdin/stdout Guest Console bhyve: userland part of kernel N 3. Destroy VM • H I D instance Hypervisor, emulates C bhyveload bhyvectl bhyve devices libvmmapi bhyvectl: a management mmap/ioctl • tool /dev/vmm/${vm_name} (vmm.ko) FreeBSD kernel • libvmmapi: userland API • vmm.ko: kernel part of Hypervisor vmm.ko • All VT-x features only accessible in kernel mode, vmm.ko handles it • Most important work of vmm.ko is CPU mode switching between hypervisor/guest • Provides interface for userland via /dev/ vmm/${vmname} • Each vmm device file contains each VM instance state /dev/vmm/${vmname} interfaces • create/destroy Can create/destroy device file via sysctl hw.vmm.create, hw.vmm.destroy • read/write/mmap Can access guest memory area by standard syscall (Which means you even can dump guest memory by dd command) • ioctl Provides various operations to VM /dev/vmm/${vmname} ioctls • VM_MAP_MEMORY: Maps guest memory area at requested size • VM_SET/GET_REGISTER: Access registers • VM_RUN: Run guest machine, until virtual devices accessed (or some other trap happened) libvmmapi • wrapper library of /dev/vmm operations • vm_create(name)→ sysctl(“hw.vmm.create”, name) • vm_set_register(reg, val) → ioctl(VM_SET_REGISTER, reg, val) bhyveload • bhyve uses OS loader instead of BIOS/UEFI, to load guest OS • FreeBSD bootloader ported to userland: userboot • bhyveload runs host OS, to initialize guest OS • Once it called, it does following things: • Parse UFS on diskimage, find kernel • Load kernel to guest memory area • Initialize Page Table • Create GDT, IDT, LDT • Initialize special registers to get ready for 64bit mode • Guest machine can starts from kernel entry point, with 64bit mode bhyve • bhyve command is the userland part of the hypervisor • It invokes ioctl(VM_RUN) to run GuestOS • Emulates virtual devices • Provides user interface(no GUI for now) main loop in bhyve while (1) { ioctl(VM_RUN, &vmexit); switch (vmexit.exit_code) { case IOPORT_ACCESS: emulate_device(vmexit.ioport); … } } Q&A?.