Study of Firecracker Microvm

Total Page:16

File Type:pdf, Size:1020Kb

Study of Firecracker Microvm Study of Firecracker MicroVM Madhur Jain Khoury College of Computer Sciences Northeastern University Boston, MA [email protected] Abstract—Firecracker is a virtualization technology that makes Firecracker was developed to handle serverless workloads use of Kernel Virtual Machine (KVM). Firecracker belongs to and has been running in production for AWS Lambda and a new virtualization class named the micro-virtual machines Fargate since 2018. Serverless workloads require isolation (MicroVMs). Using Firecracker, we can launch lightweight Mi- croVMs in non-virtualized environments in a fraction of a second, and security, and at the same time, have container benefits at the same time offering the security and workload isolation of faster boot time. A lot of research has been done with provided by traditional VMs and also the resource efficiency that respect to the cold-start and warm-start of the VM instances comes along with containers [1]. Firecracker aims to provide a for serverless workloads. [17] The idea behind developing slimmed-down MicroVM, comprised of approximately 50K lines Firecracker was to make use of the KVM module loaded in of code in Rust and with a reduced attack surface for guest VMs. This report will examine the internals of Firecracker and the Linux kernel and get rid of the legacy devices that other understand why Firecracker is the next big thing going forward virtualization technologies like Xen and VMWare offer. This in virtualization and cloud computing. way, Firecracker can create a VMM with a smaller memory Index Terms—firecracker, microvm, Rust, VMM, QEMU, footprint and also provide improved performance. KVM Section 2 explains the high-level architecture of the Fire- cracker Micro VM, Section 3 dives deep into the boot se- I. INTRODUCTION quence of the Firecracker, Section 4 describes the device model emulation provided, and Section 5 shares some light Firecracker is a new open source Virtual Machine Monitor on the conclusions garnered through this study. (VMM) developed by AWS for serverless workloads. It is a virtualization technology which makes use of KVM, meaning II. ARCHITECTURE OF FIRECRACKER it can only be run on KVM-supported and enabled hosts and KVM is an enabler of hardware extensions provided by a corresponding Linux Kernel v4.14+. With the recommended vendors such as Intel and AMD with their virtualization Linux kernel guest configuration, Firecracker claims to offer extensions such as SVM and VMX. These extensions allow a memory overhead of less than 5MB per container, boots to KVM to directly execute the guest code on the host CPU. application code within 125ms, and allows for the creation of There are three sets of ioctls that make up the KVM API and up to 150 MicroVMs per second. The number of Firecracker are issued to control the various aspects of the virtual machine. MicroVMs running simultaneously on a single host is only The three classes that the iocltls belongs to are [7] - limited by the availability of hardware resources. [1] • System IOCTLs: These query and set global attributes, Firecracker provides security and resource isolation by run- which affect the whole KVM subsystem. In addition, a ning the Firecracker userpsace process inside a jailer process. system ioctl is used to create virtual machines. The jailer sets up system resources that require elevated • VM IOCTLs: These query and set attributes that affect permissions (e.g., cgroup, chroot), drops privileges, and then an entire virtual machinefor example, memory layout. arXiv:2005.12821v1 [cs.OS] 26 May 2020 exec()s into the Firecracker binary, which then runs as an In addition, a VM ioctl is used to create virtual CPUs unprivileged process. Past this point, Firecracker can only (vCPUs). VM ioctls are run from the same userspace access resources to which a privileged third-party grants access process (address space) that was used to create the VM. (e.g., by copying a file into the chroot, or passing a file • vCPU IOCTLs: These query and set attributes that descriptor). [13] control the operation of a single virtual CPU. They run Seccomp filters limit the system calls that the Firecracker vCPU ioctls from the same thread that was used to create process can use. There are 3 possible levels of seccomp the vCPU. filtering, configurable by passing a command line argument QEMU is an open source machine emulator and a virtualiza- to the jailer: 0 (disabled), 1 (whitelists a set of trusted system tion technique to run KVM-enabled Virtual Machines. Similar calls by their identifiers) and 2 (whitelists a set of trusted to QEMU, Firecracker uses a multithreaded event-driven ar- system calls with trusted parameter values), the latter being chitecture. Each Firecracker process runs one and only one the most restrictive and the recommended one. The filters MicroVM. The process consists of three main threads: API, are loaded in the Firecracker process, immediately before the VMM and vCPU. The API server thread is an HTTP server execution of the untrusted guest code starts. [13] that is created to accept requests and performs actions on those created for each virtual CPU. These vCPU threads are nothing but POSIX threads created by KVM. To run guest code, the vCPUs execute an ioctl with KVM RUN as its argument. Software in the root mode is the hypervisor. Hypervisor or the VMM forms a new plane that runs in root mode while the VMs run in non-root mode. KVM uses virtualization extensions to provide these different modes on the host CPUs. In the case of Intel CPUs, VT-x is the CPU virtualization and VT-d is the IO virtualization. For vCPUs, VT-x provides two modes of guest code execution: root and non-root. Whenever Fig. 1. Firecracker Design - Threads a VM attempts to execute an instruction that is prohibited by the non-root model, vCPU immediately switches to a root mode in a trap-like way. This is called a VMEXIT. requests. The API server thread structure comprises of a socket Hypervisor deals with the reason for VMEXIT and then connection to listen to requests on the port, epoll fd to listen to executes VMRESUME to re-enter non-root mode for that VM events on the socket port, and a hashmap of connections. The instance. This interaction between root and non-root is the hashmap consists of token and connection pairs corresponding essence of hardware virtualization. to the file descriptor of the streamin this case, the socket. Traditionally, QEMU has made use of select or poll system III. FIRECRACKER BOOT SEQUENCE calls to maintain an event loop of file descriptors on which Traditional PCs boot Linux with a BIOS and a bootloader. to listen for new events. [16] select or poll system calls The primary responsibilities of the BIOS includes booting requires a list of all open file descriptors maintained in the the CPU in real mode and performing a Power on Self Test VMM structure and then it would go through each of the file (POST) setup before loading the bootloader. BIOS determines descriptors to determine which FDs have new events. This the candidates for boot devices, and once a bootable device would take up O(N) time where N is the number of file is found, the bootloader is loaded into RAM and executed. descriptors to listen for new events. [14] Different systems have different stages of the bootloader to Firecracker takes the epoll approach where the host kernel be executed. LILO, for example, has a two stage bootloader maintains a list of file descriptors for VM process and notifies while GRUB contains a 3 stage bootloader. Multiple stages the VMM whenever there is a new event that occurs in any of a bootloader are used because of the system limitations of of the file descriptors. This is called as a Edge Triggered some of the older devices that were used to boot Linux. mechanism (”pull”), whereas the select/poll was a Level Linux kernels actually do not require a BIOS and a boot- Triggered mechanism (”push”). The epoll fd structure created loader. Instead, Firecracker uses what is known as Linux Boot by Firecracker has the ‘close on exec‘ flag set, which means Protocol. [15] There are multiple versions of the Linux Boot if a process forks the Firecracker process, the file descriptors Protocol standard that exist. Firecracker follows the 64-bit would not be shared. Linux Boot Protocol Standard. Thus, Firecracker can directly The API server exposes a number of requests as REST load the Linux kernel and mount the corresponding root file APIs which can be used to configure the MicroVM. Once system. the MicroVM has been started, i.e., once it receives the The Linux kernel is an uncompressed bzImage (big com- ”InstanceStart” command, the API server thread will just pressed image, usually larger than 512KB). The bzImage block on the epoll file descriptor until new events come in. format consists of a real mode kernel code and a protected Firecracker creates a channel (see Rust channels) to enable mode kernel code. Instead of booting into the entry point communication between the API Server thread and the VMM defined by the real mode, Firecracker directly boots into the thread. Rust channels are similar to Unix pipes for comparison. 64-bit entry point located at 0x200 in the protected mode The VMM server thread manages the entire MicroVM. kernel code. Firecracker loads the uncompressed Linux kernel Once the VMM server thread is created, it runs an event as well as the init process, thereby saving approximately 20 loop which takes the parsed request one by one from the API to 30ms of the time taken to decompress the kernel. server thread and dispatches it to the appropriate handlers.
Recommended publications
  • A Practical UNIX Capability System
    A Practical UNIX Capability System Adam Langley <[email protected]> 22nd June 2005 ii Abstract This report seeks to document the development of a capability security system based on a Linux kernel and to follow through the implications of such a system. After defining terms, several other capability systems are discussed and found to be excellent, but to have too high a barrier to entry. This motivates the development of the above system. The capability system decomposes traditionally monolithic applications into a number of communicating actors, each of which is a separate process. Actors may only communicate using the capabilities given to them and so the impact of a vulnerability in a given actor can be reasoned about. This design pattern is demonstrated to be advantageous in terms of security, comprehensibility and mod- ularity and with an acceptable performance penality. From this, following through a few of the further avenues which present themselves is the two hours traffic of our stage. Acknowledgments I would like to thank my supervisor, Dr Kelly, for all the time he has put into cajoling and persuading me that the rest of the world might have a trick or two worth learning. Also, I’d like to thank Bryce Wilcox-O’Hearn for introducing me to capabilities many years ago. Contents 1 Introduction 1 2 Terms 3 2.1 POSIX ‘Capabilities’ . 3 2.2 Password Capabilities . 4 3 Motivations 7 3.1 Ambient Authority . 7 3.2 Confused Deputy . 8 3.3 Pervasive Testing . 8 3.4 Clear Auditing of Vulnerabilities . 9 3.5 Easy Configurability .
    [Show full text]
  • A Multiplatform Pseudo Terminal
    A Multi-Platform Pseudo Terminal API Project Report Submitted in Partial Fulfillment for the Masters' Degree in Computer Science By Qutaiba Mahmoud Supervised By Dr. Clinton Jeffery ABSTRACT This project is the construction of a pseudo-terminal API, which will provide a pseudo-terminal interface access to interactive programs. The API is aimed at developing an extension to the Unicon language to allow Unicon programs to easily utilize applications that require user interaction via a terminal. A pseudo-terminal is a pair of virtual devices that provide a bidirectional communication channel. This project was constructed to enable an enhancement to a collaborative virtual environment, because it will allow external tools such to be utilized within the same environment. In general the purpose of this API is to allow the UNICON runtime system to act as the user via the terminal which is provided by the API, the terminal is in turn connected to a client process such as a compiler, debugger, or an editor. It can also be viewed as a way for the UNICON environment to control and customize the input and output of external programs. Table of Contents: 1. Introduction 1.1 Pseudo Terminals 1.2 Other Terminals 1.3 Relation To Other Pseudo Terminal Applications. 2. Methodology 2.1 Pseudo Terminal API Function Description 3. Results 3.1 UNIX Implementation 3.2 Windows Implementation 4. Conclusion 5. Recommendations 6. References Acknowledgments I would like to thank my advisor, Dr. Clinton Jeffery, for his support, patience and understanding. Dr. Jeffery has always been prompt in delivering and sharing his knowledge and in providing his assistance.
    [Show full text]
  • Efficient Parallel I/O on Multi-Core Architectures
    Lecture series title/ lecture title Efficient parallel I/O on multi-core architectures Adrien Devresse CERN IT-SDC-ID Thematic CERN School of Computing 2014 1 Author(s) names – Affiliation Lecture series title/ lecture title How to make I/O bound application scale with multi-core ? What is an IO bound application ? → A server application → A job that accesses big number of files → An application that uses intensively network 2 Author(s) names – Affiliation Lecture series title/ lecture title Stupid example: Simple server monothreaded // create socket socket_desc = socket(AF_INET , SOCK_STREAM , 0); // bind the socket bind(socket_desc,(struct sockaddr *)&server , sizeof(server)); listen(socket_desc , 100); //accept connection from an incoming client while(1){ // declarations client_sock = accept(socket_desc, (struct sockaddr *)&client, &c); //Receive a message from client while( (read_size = recv(client_sock , client_message , 2000 , 0)) > 0{ // Wonderful, we have a client, do some useful work std::string msg("hello bob"); write(client_sock, msg.c_str(), msg.size()); } } 3 Author(s) names – Affiliation Lecture series title/ lecture title Stupid example: Let's make it parallel ! int main(int argc, char** argv){ // creat socket void do_work(int socket){ socket_desc = socket(AF_INET , SOCK_STREAM , 0); //Receive a message while( (read_size = // bind the socket recv(client_sock , bind(socket_desc, server , sizeof(server)); client_message , 2000 , 0)) > 0{ listen(socket_desc , 100); // Wonderful, we have a client // useful works //accept connection
    [Show full text]
  • Vnios Deployment on KVM January 2019
    Deployment Guide vNIOS deployment on KVM January 2019 TABLE OF CONTENTS Overview .............................................................................................................................................. 3 Introduction ........................................................................................................................................ 3 vNIOS for KVM ................................................................................................................................... 3 vNIOS deployment on KVM ................................................................................................................ 3 Preparing the environment ................................................................................................................. 3 Installing KVM and Bridge utilities ...................................................................................................... 3 Creating various types networks ........................................................................................................ 5 Downloading vNIOS QCOW2 image ................................................................................................. 8 Copying vNIOS qcow2 image .......................................................................................................... 10 Deploying vNIOS with Bridge Networking........................................................................................ 11 Deploying vNIOS through xml file with Bridge Networking .............................................................
    [Show full text]
  • Kqueue: a Generic and Scalable Event Notification Facility
    Kqueue: A generic and scalable event notification facility Jonathan Lemon [email protected] FreeBSD Project Abstract delivered to the application, when a file in the filesystem changes in some fashion, or when a process exits. None Applications running on a UNIX platform need to be no- of these are handled efficiently at the moment; signal de- tified when some activity occurs on a socket or other de- livery is limited and expensive, and the other events listed scriptor, and this is traditionally done with the select() or require an inefficient polling model. In addition, neither poll() system calls. However, it has been shown that the poll() nor select() can be used to collect these events, performance of these calls does not scale well with an in- leading to increased code complexity due to use of mul- creasing number of descriptors. These interfaces are also tiple notification interfaces. limited in the respect that they are unable to handle other This paper presents a new mechanism that allows the potentially interesting activities that an application might application to register its interest in a specific event, and be interested in, these might include signals, file system then efficiently collect the notification of the event at a changes, and AIO completions. This paper presents a later time. The set of events that this mechanism covers generic event delivery mechanism, which allows an ap- is shown to include not only those described above, but plication to select from a wide range of event sources, may also be extended to unforeseen event sources with and be notified of activity on these sources in a scalable no modification to the API.
    [Show full text]
  • Linux/UNIX System Programming Fundamentals
    Linux/UNIX System Programming Fundamentals Michael Kerrisk man7.org NDC TechTown; August 2020 ©2020, man7.org Training and Consulting / Michael Kerrisk. All rights reserved. These training materials have been made available for personal, noncommercial use. Except for personal use, no part of these training materials may be printed, reproduced, or stored in a retrieval system. These training materials may not be redistributed by any means, electronic, mechanical, or otherwise, without prior written permission of the author. These training materials may not be used to provide training to others without prior written permission of the author. Every effort has been made to ensure that the material contained herein is correct, including the development and testing of the example programs. However, no warranty is expressed or implied, and the author shall not be liable for loss or damage arising from the use of these programs. The programs are made available under Free Software licenses; see the header comments of individual source files for details. For information about this course, visit http://man7.org/training/. For inquiries regarding training courses, please contact us at [email protected]. Please send corrections and suggestions for improvements to this course material to [email protected]. For information about The Linux Programming Interface, please visit http://man7.org/tlpi/. Short table of contents 1 Course Introduction 1-1 2 Fundamental Concepts 2-1 3 File I/O and Files 3-1 4 Directories and Links 4-1 5 Processes 5-1 6 Signals:
    [Show full text]
  • Linux Loadable Kernel Module HOWTO
    Linux Loadable Kernel Module HOWTO Bryan Henderson 2006−09−24 Revision History Revision v1.09 2006−09−24 Revised by: bjh Fix typos. Revision v1.08 2006−03−03 Revised by: bjh Add copyright information. Revision v1.07 2005−07−20 Revised by: bjh Add some 2.6 info and disclaimers. Update references to Linux Device Drivers book, Linux Kernel Module Programming Guide. Revision v1.06 2005−01−12 Revised by: bjh Cover Linux 2.6 briefly. Update hello.c and reference to Lkmpg. Add information about perils of unloading. Mention dmesg as way to see kernel messages. Revision v1.05 2004−01−05 Revised by: bjh Add information on module.h and −DMODULE. Fix tldb.org to tldp.org. Add information on kallsyms. Revision v1.04 2003−10−10 Revised by: bjh Fix typo: AHA154x should be AHA152x Add information on what module names the kernel module loader calls for. Add information on what an LKM does when you first load it. Add information on loop module. Change linuxdoc.org to tldp.org. Revision v1.03 2003−07−03 Revised by: bjh Update on kernels that don't load into vmalloc space. Add explanation of "deleted" state of an LKM. Explain GPLONLY. Revision v1.02 2002−05−21 Revised by: bjh Correct explanation of symbol versioning. Correct author of Linux Device Drivers. Add info about memory allocation penalty of LKM vs bound−in. Add LKM−to−LKM symbol matching requirement. Add open source licensing issue in LKM symbol resolution. Add SMP symbol versioning info. Revision v1.01 2001−08−18 Revised by: bjh Add material on various features created in the last few years: kernel module loader, ksymoops symbols, kernel−version−dependent LKM file location.
    [Show full text]
  • Benchmarkinginput/Output Multiplexingfacilitiesofthe
    Treball Final de Grau GRAU D’ENGINYERIA INFORMÀTICA Facultat de Matemàtiques i Informàtica Universitat de Barcelona Benchmarking Input/Output Multiplexing Facilities Of The Linux Kernel Autor: Francesc Bruguera i Moriscot Director: Dr. Lluís Garrido Realitzat a: Departament de Matemàtiques i Informàtica Barcelona, 4 Febrer 2019 Acknowledgements This thesis would not have been possible with the help and support ofmany individuals. I own my deepest gratitude to my supervisor, Dr. Lluís Garrido, for helping and encouraging me during all the journey. I was introduced in late 2016 to the world of performance optimisation and scaling issues at work. My thanks to my former colleagues, Antonio and Gabri, for what we learned and for the work we did. Finally, of course, I would like to thank my family for their support at all the times, always letting me chase my dreams. We can only see a short distance ahead, but we can see plenty there that needs to be done. – Alan Turing Abstract In the last few decades, concurrent connection processing needs have increased, and will continue to do so – both server-side and client-side. There has been a big push in the industry towards solutions that improve the efficiency of all the pieces involved in this task. This Bachelor’s Thesis focuses in a foundational feature for many programs: how to handle more than one connection at the same time. It is an apparently simple task –whether you ask a computer engineer or a computer user. How- ever, it can be done in different ways, each of which has different efficiency consequences.
    [Show full text]
  • Linux Kernel 2.4 Internals Linux Kernel 2.4 Internals Table of Contents
    Linux Kernel 2.4 Internals Linux Kernel 2.4 Internals Table of Contents Linux Kernel 2.4 Internals.................................................................................................................................1 Tigran Aivazian tigran@veritas.com.......................................................................................................1 1. Booting.................................................................................................................................................1 2. Process and Interrupt Management......................................................................................................1 3. Virtual Filesystem (VFS).....................................................................................................................2 4. Linux Page Cache................................................................................................................................2 5. IPC mechanisms..................................................................................................................................2 1. Booting.................................................................................................................................................2 1.1 Building the Linux Kernel Image......................................................................................................2 1.2 Booting: Overview.............................................................................................................................3 1.3 Booting:
    [Show full text]
  • KQUEUE Prototype Implementation for Linux
    KQUEUE prototype implementation for Linux. Nikolai Joukov and Erez Zadok (instructor) Computer Science dept., Stony Brook University CSE 506 Final Project Report. May 2002. ABSTRACT - Kqueue event notification API is implemented for Linux in the form of a loadable kernel module. Existing BSD applications that use kqueue API were ported to Linux and approved the correct implementation functionality. Event requests are preserved in the kernel and thus improvement for long-living connections over standard poll mechanism was achieved. Detailed survey of existing event notification mechanisms is presented. A set of a simple kernel changes is proposed in order to enhance kernel extendibility. A scheme of a very low latency light weighted high scalability notification mechanism based on the extended kernel variant is proposed. I. INTRODUCTION One of the key aspects of increasing Linux servers’ scalability is the improvement of different event notification mechanisms performance [1]. At the moment, there is a number of schemes available for different platforms, but all the existing Linux implementations have important drawbacks that disallow their use for efficient and reliable servers capable of handling thousands or even dozens of thousands connections simultaneously. The only thoroughly designed mechanism that is suitable for these purposes and has very broad functionality is a Kqueue interface, recently implemented for BSD UNIX. This mechanism is rapidly becoming more and more popular in the BSD world. In the current project Kqueue API is used because the Unix community already accepts it, carefully designed and documented, and allows good level of extendibility. Alternative event notification mechanisms are not popular and are developed very slow since they are implemented in the form of kernel patches that need to change dozens of kernel source files since this part of the kernel is not extendable at all.
    [Show full text]
  • How Not to Write Network Applications
    How not to write network applications (and how not to use them correctly..) Adrian Chadd <[email protected]> Overview • A simple overview - including HTTP basics • A few “bad” examples, notably from Squid/ Apache - and what they’ve subsequently done • An “ok” example - notably lighttpd • “good” examples - memcached, varnish • What is libevent ? Overview (ctd) • Latency, bandwidth delay product, and scheduling network IO • Why does disk IO matter? • Summary Introduction • Writing network applications is easy • Writing efficient network applications is less easy • Writing efficient, scalable network applications is even less easy • Predicting your real-life workloads and handling that is difficult Lessons learnt, #1 • High-performance network applications needs clue • Coding clue • Algorithm choices, structure • Hardware clue • How fast can you push what • Gathering/Interpreting profiling Lessons learnt, #1 • Operating system clue • Best way to schedule stuff • Worst ways to schedule stuff • Profiling! • Networking clue • “speed of light” • TCP/UDP behaviour Lessons learnt, #1 • Protocol clue • How does the protocol work • Client <-> Server communication • Client behaviour, Server behaviour • How this ties into the network An example: HTTP • HTTP is .. strange • A large variance in usage patterns, client/ servers, traffic patterns, software versions, network behaviour.. • Small objects • < 64k • will never see TCP window size hit maximum during initial connection lifetime An example: HTTP • Large objects • Well, >64k really • Will start to hit congestion and back-off limits • Throughput variations are perceived by end-user • versus small objects - request/reply rate dictates perceived speed An example: HTTP • But there’s more! • HTTP keepalives affect TCP congestion • HTTP pipelining influences perceived request speed on small objects • Clients and servers have differently tuned TCP stacks..
    [Show full text]
  • Providing Asynchronous File I/O for the Plan 9 Operating System by Jason M
    Providing Asynchronous File I/O for the Plan 9 Operating System by Jason Hickey Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degrees of Bachelor of Science in Computer Science and Engineering and Master of Engineering in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology May 27, 2004 Copyright 2004 Massachusetts Institute of Technology. All rights reserved. Author _________________________________________________________________ Jason M. Hickey Department of Electrical Engineering and Computer Science May 27, 2004 Certified by _____________________________________________________________ Russell S. Cox Graduate Student, Parallel and Distributed Operating Systems Group Thesis Supervisor Certified by _____________________________________________________________ M. Frans Kaashoek Professor of Computer Science and Engineering Thesis Supervisor Accepted by _____________________________________________________________ Arthur C. Smith Chairman, Department Committee on Graduate Theses Providing Asynchronous File I/O for the Plan 9 Operating System by Jason M. Hickey Submitted to the Department of Electrical Engineering and Computer Science May 27, 2004 In Partial Fulfillment of the Requirements for the Degree of Bachelor of Science in Computer Science and Engineering and Master of Engineering in Electrical Engineering and Computer Science Abstract This thesis proposes a mux abstraction that multiplexes messages of a network file protocol to provide asynchronous access to all system resources on the Plan 9 operating system. The mux provides an easy-to-program asynchronous interface alleviating the need to manage multiple connections with different servers. A modified version of the Plan 9 Web server demonstrates that the mux can be used to implement a high- performance server with user-level threads without having to use a kernel thread for each user-level thread.
    [Show full text]