Study of Firecracker MicroVM

Madhur Jain Khoury College of Computer Sciences Northeastern University Boston, MA [email protected]

Abstract—Firecracker is a virtualization technology that makes Firecracker was developed to handle serverless workloads use of Kernel Virtual Machine (KVM). Firecracker belongs to and has been running in production for AWS Lambda and a new virtualization class named the micro-virtual machines Fargate since 2018. Serverless workloads require isolation (MicroVMs). Using Firecracker, we can launch lightweight Mi- croVMs in non-virtualized environments in a fraction of a second, and security, and at the same time, have container benefits at the same time offering the security and workload isolation of faster boot time. A lot of research has been done with provided by traditional VMs and also the resource efficiency that respect to the cold-start and warm-start of the VM instances comes along with containers [1]. Firecracker aims to provide a for serverless workloads. [17] The idea behind developing slimmed-down MicroVM, comprised of approximately 50K lines Firecracker was to make use of the KVM module loaded in of code in Rust and with a reduced attack surface for guest VMs. This report will examine the internals of Firecracker and the and get rid of the legacy devices that other understand why Firecracker is the next big thing going forward virtualization technologies like Xen and VMWare offer. This in virtualization and cloud computing. way, Firecracker can create a VMM with a smaller memory Index Terms—firecracker, microvm, Rust, VMM, QEMU, footprint and also provide improved performance. KVM Section 2 explains the high-level architecture of the Fire- cracker Micro VM, Section 3 dives deep into the boot se- I.INTRODUCTION quence of the Firecracker, Section 4 describes the device model emulation provided, and Section 5 shares some light Firecracker is a new open source Virtual Machine Monitor on the conclusions garnered through this study. (VMM) developed by AWS for serverless workloads. It is a virtualization technology which makes use of KVM, meaning II.ARCHITECTUREOF FIRECRACKER it can only be run on KVM-supported and enabled hosts and KVM is an enabler of hardware extensions provided by a corresponding Linux Kernel v4.14+. With the recommended vendors such as Intel and AMD with their virtualization Linux kernel guest configuration, Firecracker claims to offer extensions such as SVM and VMX. These extensions allow a memory overhead of less than 5MB per container, boots to KVM to directly execute the guest code on the host CPU. application code within 125ms, and allows for the creation of There are three sets of ioctls that make up the KVM API and up to 150 MicroVMs per second. The number of Firecracker are issued to control the various aspects of the virtual machine. MicroVMs running simultaneously on a single host is only The three classes that the iocltls belongs to are [7] - limited by the availability of hardware resources. [1] • System IOCTLs: These query and set global attributes, Firecracker provides security and resource isolation by run- which affect the whole KVM subsystem. In addition, a ning the Firecracker userpsace process inside a jailer process. system ioctl is used to create virtual machines. The jailer sets up system resources that require elevated • VM IOCTLs: These query and set attributes that affect permissions (e.g., cgroup, chroot), drops privileges, and then an entire virtual machinefor example, memory layout.

arXiv:2005.12821v1 [cs.OS] 26 May 2020 exec()s into the Firecracker binary, which then runs as an In addition, a VM ioctl is used to create virtual CPUs unprivileged process. Past this point, Firecracker can only (vCPUs). VM ioctls are run from the same userspace access resources to which a privileged third-party grants access process (address space) that was used to create the VM. (e.g., by copying a file into the chroot, or passing a file • vCPU IOCTLs: These query and set attributes that descriptor). [13] control the operation of a single virtual CPU. They run filters limit the system calls that the Firecracker vCPU ioctls from the same thread that was used to create process can use. There are 3 possible levels of seccomp the vCPU. filtering, configurable by passing a command line argument QEMU is an open source machine emulator and a virtualiza- to the jailer: 0 (disabled), 1 (whitelists a set of trusted system tion technique to run KVM-enabled Virtual Machines. Similar calls by their identifiers) and 2 (whitelists a set of trusted to QEMU, Firecracker uses a multithreaded event-driven ar- system calls with trusted parameter values), the latter being chitecture. Each Firecracker process runs one and only one the most restrictive and the recommended one. The filters MicroVM. The process consists of three main threads: API, are loaded in the Firecracker process, immediately before the VMM and vCPU. The API server thread is an HTTP server execution of the untrusted guest code starts. [13] that is created to accept requests and performs actions on those created for each virtual CPU. These vCPU threads are nothing but POSIX threads created by KVM. To run guest code, the vCPUs execute an ioctl with KVM RUN as its argument. Software in the root mode is the hypervisor. Hypervisor or the VMM forms a new plane that runs in root mode while the VMs run in non-root mode. KVM uses virtualization extensions to provide these different modes on the host CPUs. In the case of Intel CPUs, VT-x is the CPU virtualization and VT-d is the IO virtualization. For vCPUs, VT-x provides two modes of guest code execution: root and non-root. Whenever Fig. 1. Firecracker Design - Threads a VM attempts to execute an instruction that is prohibited by the non-root model, vCPU immediately switches to a root mode in a trap-like way. This is called a VMEXIT. requests. The API server thread structure comprises of a socket Hypervisor deals with the reason for VMEXIT and then connection to listen to requests on the port, epoll fd to listen to executes VMRESUME to re-enter non-root mode for that VM events on the socket port, and a hashmap of connections. The instance. This interaction between root and non-root is the hashmap consists of token and connection pairs corresponding essence of hardware virtualization. to the file descriptor of the streamin this case, the socket. Traditionally, QEMU has made use of select or system III.FIRECRACKER BOOT SEQUENCE calls to maintain an of file descriptors on which Traditional PCs boot Linux with a BIOS and a bootloader. to listen for new events. [16] select or poll system calls The primary responsibilities of the BIOS includes booting requires a list of all open file descriptors maintained in the the CPU in real mode and performing a Power on Self Test VMM structure and then it would go through each of the file (POST) setup before loading the bootloader. BIOS determines descriptors to determine which FDs have new events. This the candidates for boot devices, and once a bootable device would take up O(N) time where N is the number of file is found, the bootloader is loaded into RAM and executed. descriptors to listen for new events. [14] Different systems have different stages of the bootloader to Firecracker takes the epoll approach where the host kernel be executed. LILO, for example, has a two stage bootloader maintains a list of file descriptors for VM process and notifies while GRUB contains a 3 stage bootloader. Multiple stages the VMM whenever there is a new event that occurs in any of a bootloader are used because of the system limitations of of the file descriptors. This is called as a Edge Triggered some of the older devices that were used to boot Linux. mechanism (”pull”), whereas the select/poll was a Level Linux kernels actually do not require a BIOS and a boot- Triggered mechanism (”push”). The epoll fd structure created loader. Instead, Firecracker uses what is known as Linux Boot by Firecracker has the ‘close on exec‘ flag set, which means Protocol. [15] There are multiple versions of the Linux Boot if a process forks the Firecracker process, the file descriptors Protocol standard that exist. Firecracker follows the 64-bit would not be shared. Linux Boot Protocol Standard. Thus, Firecracker can directly The API server exposes a number of requests as REST load the Linux kernel and mount the corresponding root file APIs which can be used to configure the MicroVM. Once system. the MicroVM has been started, i.e., once it receives the The Linux kernel is an uncompressed bzImage (big com- ”InstanceStart” command, the API server thread will just pressed image, usually larger than 512KB). The bzImage block on the epoll file descriptor until new events come in. format consists of a real mode kernel code and a protected Firecracker creates a channel (see Rust channels) to enable mode kernel code. Instead of booting into the entry point communication between the API Server thread and the VMM defined by the real mode, Firecracker directly boots into the thread. Rust channels are similar to Unix pipes for comparison. 64-bit entry point located at 0x200 in the protected mode The VMM server thread manages the entire MicroVM. kernel code. Firecracker loads the uncompressed Linux kernel Once the VMM server thread is created, it runs an event as well as the init process, thereby saving approximately 20 loop which takes the parsed request one by one from the API to 30ms of the time taken to decompress the kernel. server thread and dispatches it to the appropriate handlers. The Linux kernel also contains another component, namely the handlers are defined according to the dispatch table set by initramfs [5]. There are four primary reasons to have an the event loop. For now, Firecracker supports the following initramfs in the LFS environment: loading the rootfs from handlers - Exit, Stdin, DeviceHandler, VMMActionRequest, a network, loading it from an LVM logical volume, having WriteMetrics. The dispatch table is managed by the epoll fd. an encrypted rootfs where a password is required, or for the The dispatch table maintains a map of file descriptors that convenience of specifying the rootfs as a LABEL or UUID. are to be monitored, and the kind of events to be monitored Anything else usually means that the kernel was not configured for. When the vCPU thread creation request is received by properly. the VMM thread, the VMM spawns the required number of Since Firecracker doesn’t need any of the above stated vCPU by using the KVM vCPU ioctls. A vCPU thread is reasons for loading the initramfs before mounting the root ˜ ˜ | Protected-mode kernel | 100000 +------+ | I/O memory hole | 0A0000 +------+ | Reserved for BIOS | Leave as much as possible unused ˜ ˜ | Command line | (Can also be below the X+10000 mark) X+10000 +------+ | Stack/heap | For use by the kernel real-mode code. X+08000 +------+ | Kernel setup | The kernel real-mode code. | Kernel boot sector | The kernel legacy boot sector. X +------+ | Boot loader | <- Boot sector entry point 0000:7C00 001000 +------+ | Reserved for MBR/BIOS | 000800 +------+ | Typically used by MBR | 000600 +------+ | BIOS use only | 000000 +------+

Fig. 2. Linux BzImage Memory Layout

file system, it is recommended to avoid loading the initramfs and the corresponding device being exposed by the hyper- at boot time, thereby further reducing the overall boot time visor or the VMM. A transport layer enables communication and the memory footprint for the kernel. So, if no initramfs is between the host and the guest. For the transport layers, Virtio configured externally, then at boot time, Firecracker replaces employs a ring-buffer virtqueue structure. A virtuqueue is a the initramfs with a default empty, 134 byte initramfs. queue of guest-allocated buffers that the host interacts with either by reading or writing to them. Each device can have IV. FIRECRACKER DEVICE MODEL zero or more virtqueues. A back-end driver present in the Until this section, we have talked about the similar architec- host kernel completes the communication flow, to which the tures and execution flow for Firecracker and QEMU. So, what virtqueue is connected. is different between QEMU and Firecracker? One of the main Firecracker device model architecture using Virtio is shown differences is with the device emulations. There are only 5 in Figure 3. The following list provides a description of the Device emulations available in Firecracker: network, block de- devices available within Firecracker: vices, sockets, serial console and minimal keyboard controller, • virtio-net: implementation for the network driver (tun/tap as shown in Figure 3. Firecracker does not provide support for devices) device emulations like USB, GPU and 9P filesystem in order • virtio-blk: implementation for the block devices to provide increased security compared to other virtualization • virtio-vsock: implementation for VM sockets providing technologies like QEMU. On the other hand, QEMU has most N:1 serial communications device model emulations available in the VMM. The careful • serial console: implementation for the legacy console reader will notice that Firecracker does not the use the vhost devices for serial communication - terminal implementation in the host kernel that provides more efficient • keyboard controller: implementation for the keyboard IO performance without doing VMEXITS. device, though only one function is implemented - An open specification for emulating device models in virtu- Ctrl+Alt+Del to reboot/shutdown the system. alization has been developed, named Virtio. Virtio is defined as a straightforward, efficient and a standard mechanism to V. SCOPEFOR IMPROVEMENTS allow guest OS to talk to the virtual device driver in a similar way the host OS would call the actual hardware device driver. Even with all the excellent features providing near-native It takes advantage of the fact that the guest can share memory performance of the guest code using KVM, as well as faster with the host for IO. boot times and lower memory footprint of the VMs due The general flow for the virtio specification [8] includes a to fewer support for available device emulations, there still front-end driver representing the virtual device in the guest, exist some areas for improvements that can make Firecracker modular design approach has also led to the development of community-driven high-quality rust-vmm crates which pro- vide us with the core modules required for the implementation of a hypervisor [11]. rust-vmm [12] is a community approach initiated by AWS. Amazon along with Intel, Redhat and Google, are trying to provide a platform to build a hypervisor from scratch by only consuming the modules required from the rust-vmm crates. This approach also enables the development of a plug-n-play architecture in hypervisors, which we haven’t seen so far.

Fig. 3. Firecracker Device Model REFERENCES [1] https://aws.amazon.com/blogs/aws/firecracker-lightweight-virtualization-for-serverless-computing [2] Alexandru Agache, Marc Brooker, Andreea Florescu, Alexandra Ior- more suitable for general use cases and not just for serverless dache, Anthony Liguori, Rolf Neugebauer, Phil Piwonka, and Diana- workloads. Maria Popa https://www.usenix.org/system/files/nsdi20-paper-agache. pdf Firecracker: Lightweight Virtualization for Serverless Applications • Support for virtio-fs: virto-fs is the interface to provide [3] https://unixism.net/2019/10/how-aws-firecracker-works-a-deep-dive/ efficient sharing between the host and the guest filesystem [4] https://developer.ibm.com/technologies/linux/articles/l-linuxboot/ avoiding context switches (VMEXITS) thereby providing [5] http://www.linuxfromscratch.org/blfs/view/cvs/postlfs/initramfs.html more performance. virtio-fs is an upgrade on the existing [6] Swift Birth and Quick Death: Enabling Fast ParallelGuest Boot and Destruction in the Xen Hypervisor https://ssrg.ece.vt.edu/papers/vee virtio-9p interface for the same purpose. Though more 2017.pdf research is required for security purposes before including [7] Mastering KVM Virtualization by Prasad Mukhedkar; Humble it as part of Firecracker. Devassy Chirammal; Anil Vettathu https://learning.oreilly.com/library/ view/mastering-kvm-virtualization/9781784399054/ch02s04.html • Increased IO Performance: The results of the tests [8] Deep Dive into virtio and vhost-net driver https://www.redhat.com/en/ performed between the Firecracker, QEMU and Cloud blog/deep-dive-virtio-networking-and-vhost-net Hypervisor show limitations in Firecracker’s virtio im- [9] Memory Ballooning Support https://github.com/firecracker-microvm/ firecracker/issues/1571 plementation and serial execution. [1] [2] [10] Enable IRQ Sharing to remove current limit of 10 devices https://github. • Larger number of device emulations: Currently, Fire- com/firecracker-microvm/firecracker/issues/1268 cracker can emulate only 10 devices, since each device [11] Building virtualization stack for the future https://opensource.com/ article/19/3/rust-virtual-machine gets its own IRQ. [10] [12] rust-vmm https://github.com/rust-vmm • Support for attaching devices at runtime: Firecracker [13] Firecracker Design Doc https://github.com/firecracker-microvm/ only allows specifying the devices at booting time. De- firecracker/blob/master/docs/design.md [14] Understand epoll and its madness https://medium.com/@copyconstruct/ vices can only be attached when the MicroVM is shut the-method-to-epolls-madness-d9d2d6378642 down. [15] Linux 64-bit Boot Protocol https://www.kernel.org/doc/html/latest/x86/ boot.html#id1 • Hotplugging Support: For any workload, it is beneficial [16] QEMU Internals http://blog.vmsplice.net/2011/03/ to allow guest memory/CPU hotplugging within a VM qemu-internals-overall-architecture-and.html at runtime in order to avoid interference to the workload. [17] Cold Start/Warm Start with AWS Lambda https://blog.octo.com/en/ Firecracker oversubscribes the allocated memory required cold-start-warm-start-with-aws-lambda/ for the guest, but there is no way to expand the allocated memory for the guest. • Memory Ballooning Support: At present, Firecracker does not have any support for reclaiming unused mem- ory from the guest, since no communication is present between the host and the guest. This, along with the hot- plugging feature would make it very easy to dynamically add/remove memory/CPU at runtime thereby providing elasticity to the MicroVM. [9]

VI.CONCLUSION/THOUGHTS This paper reviews the implementation of a minimalist and modular VMM in the form of Firecracker MicroVM. It also identifies the process of how Firecracker provides resource isolation and security through the use of seccomp filters and jailer process and provides faster boot times and lower memory footprint due to KVM and minimal device model emulation. One other thing to note is that Firecracker embodies the modular design in the development of the hypervisor. The