Lupine Linux “A Linux in Unikernel Clothing”
Dan Williams (IBM)
With Hsuan-Chi (Austin) Kuo (UIUC + IBM), Ricardo Koller (IBM), Sibin Mohan (UIUC) Roadmap
• Context • Containers and isolation • Unikernels • Nabla containers • Lupine Linux • A linux in unikernel clothing • Concluding thoughts
2 Containers are great!
• Have changed how applications are packaged, deployed and developed • Normal processes, but “contained” • Namespaces, cgroups, chroot • Lightweight • Start quickly, “bare metal” • Easy image management (layered fs) • Tooling/orchestration ecosystem
3 But…
High level of • Large attack surface to the host abstraction (e.g., system • Limits adoption of container-first architecture calls)
app
• Fortunately, we know how to reduce attack surface! Host Kernel with namespacing (e.g., Linux) Containers
4 Deprivileging and unsharing kernel functionality
Low level of High level of • Virtual machines (VMs) abstraction abstraction • Guest kernel (e.g., virtual (e.g., system app hardware) calls) • Thin interface
Guest Kernel app (e.g., Linux)
Monitor Process (e.g., QEMU) Host Kernel with Host Kernel/Hypervisor namespacing (e.g., Linux/KVM) (e.g., Linux) VMs Containers
5 Deprivileging and unsharing kernel functionality
Low level of High level of • Virtual machines (VMs) abstraction abstraction • Guest kernel (e.g., virtual (e.g., system app hardware) calls) • Thin interface • Userspace kernel Guest Kernel app • Performance issues (e.g., Linux)
Monitor Process (e.g., QEMU) Host Kernel with Host Kernel/Hypervisor namespacing (e.g., Linux/KVM) (e.g., Linux) VMs Containers
6 But wait? Aren't VMs slow and heavyweight?
• Boot time? • Memory footprint?
• Especially for environments like serverless??!!
7 VMs are becoming lightweight Low level of abstraction • Thin monitors (e.g., virtual • e.g., AWS Firecracker app hardware) • Reduce complexity for performance (e.g., no PCI) Guest Kernel (e.g., Linux)
Monitor Process (e.g., QEMU) Host Kernel/Hypervisor (e.g., Linux/KVM) VMs
8 VMs are becoming lightweight Low level of abstraction • Thin monitors (e.g., virtual • e.g., AWS Firecracker app hardware) • Reduce complexity for performance (e.g., no PCI) Guest Kernel (e.g., Linux)
Monitor Process
Host Kernel/Hypervisor (e.g., Linux/KVM) VMs Firecracker boot times as reported in Agache et al., NSDI 2020
9 VMs are becoming lightweight Low level of abstraction • Thin monitors (e.g., virtual • e.g., AWS Firecracker app hardware) • Reduce complexity for performance (e.g., no PCI) Guest Kernel (e.g., Linux)
Monitor Process
Host Kernel/Hypervisor (e.g., Linux/KVM) VMs Firecracker boot times as reported in Agache et al., NSDI 2020 Manco et al., SOSP 2017 10 VMs are becoming lightweight Low level of abstraction • Thin monitors (e.g., virtual • e.g., AWS Firecracker app hardware) • Reduce complexity for performance (e.g., no PCI)
Guest Kernel • Thin guests? • Userspace: (e.g., Ubuntu --> Alpine Linux) Monitor Process
• Kernel configuration (e.g., TinyX, Lupine) Host Kernel/Hypervisor • Unikernels (e.g., Linux/KVM) VMs
KubeCon 2020 Eurosys 2020 11 Unikernels are thin guests to the extreme
• An application linked with library OS components • Run on virtual hardware (like) abstraction • Single CPU VM
• Language-specific • MirageOS (OCaml) • IncludeOS (C++)
• Legacy-oriented • Rumprun (NetBSD-based) • Hermitux • OSv Claim binary compatibility with Linux 12 Deprivileging and unsharing kernel functionality
Low level of High level of • Virtual machines (VMs) abstraction abstraction • Guest kernel (e.g., virtual (e.g., system app hardware) calls) • Thin interface • Userspace kernel Guest Kernel app • E.g., UML (e.g., Linux) • Performance issues Monitor Process • (e.g., QEMU) Library OS / unikernel Host Kernel with • Only-what-you need Host Kernel/Hypervisor namespacing • Lightweight (e.g., Linux/KVM) (e.g., Linux) VMs Containers
13 What we learned from Nabla containers
• Nabla containers are unikernels as Unikernel as Process processes Application • Can achieve or exceed lightweight HotCloud ’16, HotOS ‘17 libraries, runtimes characteristics of containers Library OS • Interfaces are what matter, not HotCloud ’18, SOCC ‘18 virtualization HW TCP/IP FS … • But we lose a lot: Generality Solo5 • Lupine Linux: applying unikernel techniques to Linux VMs Eurosys ’20
14 What we learned from Nabla containers
• Nabla containers are unikernels as Unikernel as Process processes Application • Can achieve or exceed lightweight HotCloud ’16, HotOS ‘17 libraries, runtimes characteristics of containers Library OS • Interfaces are what matter, not HotCloud ’18, SOCC ‘18 virtualization HW TCP/IP FS … • But we lose a lot: Generality Solo5 • Lupine Linux: applying unikernel Eurosys ’20 this techniques to Linux VMs talk
15 Lupine Linux Overview and Roadmap
• Introduction • Lupine Linux Application (container) Unikernel-like techniques App rootfs • Specialization Application manifest Specialization • System Call Overhead via Kconfig Elimination System Call • Putting it together Overhead Elimination Lupine Linux • Evaluation via KML Linux source “Unikernel” • Discussion • Related Work
16 Unikernels are great
• Small kernel size • Fast boot time • Performance • Security Unikernels are great… but
• Small kernel size • Lack full Linux support • Fast boot time • Hermitux: supports only 97 system calls • Performance • OSv: • application needs to be compiled with –PIE, can’t use TLS • Security • Static-linked applications are not supported • Fork() , execve() are not supported • Special files are not supported such as /proc • Signal mechanism is not complete • Rumprun: only 37 curated applications • Community is too small to keep it rolling Can Linux > be as small as
Lupine Linux “Unikernel” > boot as fast as > outperform unikernels? Can Linux > be as small as
Lupine Linux “Unikernel” > boot as fast as > outperform unikernels? • Spoiler alert: Yes! • 4MB image size • 23 ms boot time • Up to 33% higher throughput Lupine Linux Overview and Roadmap
• Introduction • Lupine Linux Application (container) Unikernel-like techniques App rootfs • Specialization Application manifest Specialization • System Call Overhead via Kconfig Elimination System Call • Putting it together Overhead Elimination Lupine Linux • Evaluation via KML Linux source “Unikernel” • Discussion • Related Work
21 Lupine Linux Overview and Roadmap
• Introduction • Lupine Linux Application (container) Unikernel-like techniques App rootfs • Specialization Application manifest Specialization • System Call Overhead via Kconfig Elimination System Call • Putting it together Overhead Elimination Lupine Linux • Evaluation via KML Linux source “Unikernel” • Discussion • Related Work
22 Unikernel technique #1: Specialization
• Unikernels include only what is needed
• Linux is very configurable • Kconfig • 16,000 options • Drivers • Filesystems • Processor features • ... Specializing Linux through configuration
• Start with Firecracker microvm configuration • Assuming unikernel-like workload, can remove even more! • Application-specific options • Multiprocessing • HW management
24 Application-specific options
• Example: system calls Option Enabled System Call(s) Furthermore, the kernel is usually run in a separate, more ADVISE_SYSCALLS madvise, fadvise64 privileged security domain than the application. As such, AIO io_setup, io_destroy, io_submit, io_cancel, io_getevents the kernel contains enhanced access control systems such • Kernel services BPF_SYSCALL bpf EPOLL epoll_ctl, epoll_create, epoll_wait, epoll_pwait as SELinux and functionality to guard the crossing from the • e.g., /proc, sysctl EVENTFD eventfd, eventfd2 application domain to the kernel domain, such as seccomp FANOTIFY fanotify_init, fanotify_mark lters, all of which are all unnecessary for unikernels More • Kernel library FHANDLE open_by_handle_at, name_to_handle_at importantly, security options with a severe impact on per- FILE_LOCKING ock formance are also unnecessary for this reason. For example, • Crypto routines FUTEX futex, set_robust_list, get_robust_list INOTIFY_USER inotify_init, inotify_add_watch, inotify_rm_watch KPTI (kernel page table isolation [9]) forbids the mapping • Compression routines SIGNALFD signalfd, signalfd4 of kernel pages into processes’ page table to mitigate the TIMERFD timerfd_create, timerfd_gettime, timerfd_settime Meltdown [39] vulnerability. This dramatically aects sys- • Debugging/information Table 1. Linux conguration options that enable/disable tem call performance; when testing with KPTI on Linux 5.0 system calls. we measured a 10x slowdown in system call latency. In total, we eliminated 12 conguration options due to the single security domain. 25 Linux is well equipped to run on multiple-processor sys- A Lupine kernel compiled for redis does not contain the tems. As a result, the kernel contains various options to in- AIO or EVENTFD-related system calls. clude and tune SMP and NUMA functionality. On the other In addition to the above, some applications expect other hand, since most unikernels do not support fork, the stan- services from the kernel, for instance, the /proc lesystem or dard approach to take advantage of multiple processors is to sysctl functionality. Moreover, the Linux kernel maintains run multiple unikernels. a substantial library that resides in the kernel because of its Finally, Linux contains facilities for dynamically loading traditional position as a more privileged security domain. functionality through modules. A single application facili- Unikernels do not maintain the traditional privilege separa- tates the creation of a kernel that contains all functionality tion but may make use of this functionality directly or indi- it needs at build time. rectly by using a protocol or service that needs it (e.g., cryp- Overall, we attribute the removal of 89 conguration op- tographic routines for IPsec). We marked 20 compression- tions to the single-process—“uni”—characteristics of uniker- related and 55 crypto-related options from the microVM nels as shown in Figure 4 (under "Multiple Processes"). In conguration as application-specic. Finally, Linux contains Section 5, we examine the relaxation of this property. signicant facilities for debugging; a Lupine unikernel can select up to 65 debugging and information-related kernel Unikernels are not intended for general hardware. De- conguration options from microVM’s conguration. fault congurations for Linux are intended to result in a In total, we classied approximately 311 conguration general-purpose system. Such a system is intimately involved options as application-specic as shown in Figure 4. In Sec- in managing hardware with congurable functionality to tion 4, we will evaluate the degree of application specializa- perform tasks, including power management, hotplug and tion via Linux kernel conguration (and its eects) achieved driving and interfacing with devices. Unikernels, which are in Lupine for common cloud applications. typically intended to run as virtual machines in the cloud, can leave many physical hardware management tasks to the 3.1.2 Unnecessary options. underlying host or hypervisor. Firecracker’s microVM ker- Some options in microVM’s conguration will, by deni- nel conguration demonstrates the rst step by eliminating tion, never be needed by any Lupine unikernel so they can many unnecessary drivers and architecture-specic cong- be safely eliminated. We categorize these options into two uration options (as shown in Figure 3). Lupine’s congura- groups: (1) those that stem from the single-process nature of tion goes further by classifying 150 conguration options— unikernels and (2) those that stem from the expected virtual including 24 options for power management that can be left hardware environment in the cloud. to the underlying host—as unnecessary for Lupine uniker- Unikernels are not intended for multiple processes. The nels as shown in Figure 4. Linux kernel is intended to run multiple processes, thus requir- ing congurable functionality for synchronization, sched- 3.2 Eliminating System Call Overhead uling and resource accounting. For example, cgroups and Kernel Mode Linux [41] is an existing patch to Linux that namespaces are specic mechanisms that limit, account for enables normal user processes to run in kernel mode, share and isolate resource utilization between processes or groups an address space with the kernel and eliminate the need for of processes. We classied about 20 conguration options expensive privilege transitions or context switches during related to cgroups and namespaces in Firecracker’s microVM system calls. Yet they are processes that, unlike kernel mod- conguration. ules, do not require any change to the programming model 5 Other assumptions from unikernels
• Unikernels are not intended for multiple processes • Related to isolating, accounting for processes • Cgroups, namespaces, SElinux, seccomp, KPTI • SMP, NUMA • Module support • Unikernels are not intended for general hardware • Intended to run as VMs in the cloud • microVM removes many drivers and arch-specific configs • Lupine removes more, including power mgmt
26 How to get an app-specific kernel config
• Start with lupine-base • Manual trial and error • Guided by application output • E.g., the futex facility returned an unexpected error code => CONFIG_FUTEX
• In general, this is a hard problem
27 Lupine Linux Overview and Roadmap
• Introduction • Lupine Linux Application (container) Unikernel-like techniques App rootfs • Specialization Application manifest Specialization • System Call Overhead via Kconfig Elimination System Call • Putting it together Overhead Elimination Lupine Linux • Evaluation via KML Linux source “Unikernel” • Discussion • Related Work
28 Unikernel technique #2: System call overhead elimination • Kernel Mode Linux (KML) • Non-upstream patch (latest Linux 4.0) • Execute unmodified apps in kernel mode • User program can directly access the kernel
Location exposed • Replace “syscall” instruction with “call” in libc e.g., musl via vsyscall - __asm__ __volatile__ ("syscall" : "=a"(ret) : + __asm__ __volatile__ ("call *%1" : "=a"(ret) : "r"(__kml), "a"(n), "D"(a1), "S"(a2), "d"(a3), "r"(r10), "r"(r8), "r"(r9) : "rcx", "r11", "memory"); • Requires relink for static binaries • Less invasive than build modifications for unikernels Lupine Linux Overview and Roadmap
• Introduction • Lupine Linux Application (container) Unikernel-like techniques App rootfs • Specialization Application manifest Specialization • System Call Overhead via Kconfig Elimination System Call • Putting it together Overhead Elimination Lupine Linux • Evaluation via KML Linux source “Unikernel” • Discussion • Related Work
30 Putting it all together
libc.so libm.so librarieslibc.so Unmodified app binary Linux kernel source
31 Putting it all together
Application-specific requirements (manifest) libc.so libm.so Application-specific librarieslibc.so Lupine config Unmodified app binary Linux kernel source Specialization
32 Putting it all together
Application-specific requirements (manifest) libc.so libm.so Application-specific librarieslibc.so Lupine config Unmodified app binary Linux kernel source Specialization
KML KML-enabled patch musl libc overhead System call elimination
33 Putting it all together
Application-specific requirements (manifest) libc.so libm.so Application-specific librarieslibc.so Lupine config Unmodified app binary Linux kernel source Specialization
KML KML-enabled patch musl libc overhead System call elimination
Application-specific Lupine kernel image
34 Remaining issues
• How to build a root filesystem for Linux • Container images are root filesystems already • Contains both application and necessary libraries
• How to start the (single) application • Linux kernel parameter “init” specifies first program, usually “/sbin/init” • Boot the kernel with “init=/app” • Caveats: • May need some simple setup (e.g., network) • Application-specific!
35 Putting it all together
Application-specific requirements (manifest) libc.so libm.so Application-specific librarieslibc.so Lupine config Unmodified app binary Linux kernel source Specialization
KML KML-enabled patch musl libc overhead System call elimination
Application-specific Lupine kernel image
36 Putting it all together
Application-specific requirements (manifest) Container image (alpine) libc.so Metadata libm.so Application-specific librarieslibc.so entrypoint Lupine config Unmodified env variables app binary Linux kernel source Specialization
KML KML-enabled patch musl libc overhead System call elimination
Application-specific Lupine kernel image
37 Putting it all together
Application-specific requirements (manifest) Container image (alpine) libc.so Metadata libm.so Application-specific librarieslibc.so Application- entrypoint Lupine config specific Unmodified env variables startup script app binary Linux kernel source (“init”) Specialization
KML KML-enabled patch musl libc overhead System call elimination
Application-specific Lupine kernel image Lupine app “rootfs”
38 Lupine Linux Overview and Roadmap
• Introduction • Lupine Linux Application (container) Unikernel-like techniques App rootfs • Specialization Application manifest Specialization • System Call Overhead via Kconfig Elimination System Call • Putting it together Overhead Elimination Lupine Linux • Evaluation via KML Linux source “Unikernel” • Discussion • Related Work
39 Evaluation setup
• Machine setup • CPU: Intel(R) Xeon(R) CPU E3-1270 v6 @ 3.80GHz • Mem: 16 GB • VM setup • Hypervisor : firecracker • 1 VCPU, 512 MB Mem • Guest: Linux 4.0 with and without KML patches improvement in macrobenchmarks, indicating that system # Options atop Name Downloads Description call overhead should not be a primary concern for uniker- lupine-base nel developers. Finally, we show that Lupine avoids major nginx 1.7 Web server 13 pitfalls of POSIX-like unikernels that stem from not being postgres 1.6 Database 10 Linux-based, including both the lack of support for unmod- httpd 1.4 Web server 13 ied applications and performance from highly-optimized node 1.2 Language runtime 5 code. redis 1.2 Key-value store 10 mongo 1.2 NOSQL database 11 4.1 Conguration diversity mysql 1.2 Database 9 traek 1.1 Edge router 8 Lupine attempts to mimic the only-what-you-need approach memcached 0.9 Key-value store 10 of unikernels in order to achieve some of their performance hello-world 0.9 C program “hello” 0 and security characteristics. In this subsection, we evaluate mariadb 0.8 Database 13 how much specialization of the Linux kernel occurs in prac- golang 0.6 Language runtime 0 tice when considering the most popular cloud applications. python 0.5 Language runtime 0 Unlike other unikernel approaches, Lupine poses no re- openjdk 0.5 Language runtime 0 strictions on applications and requires no application modi- rabbitmq 0.5 Message broker 12 cations, alternate build processes, or curated package lists. php 0.4 Language runtime 0 As a result, we were able to directly run the most popu- wordpress 0.4 PHP/mysql blog tool 9 lar cloud applications on Lupine unikernels. To determine Configurationhaproxy 0.4 DiversityLoad balancer 8 popularity, we used the 20 most downloaded container im- inuxdb 0.3 Time series database 11 elasticsearch 0.3 Search engine 12 ages from Docker Hub [2]. We nd that popularity follows a improvement in macrobenchmarks, indicating that system # Options atop Top twenty most popular applications on Docker Name Downloads Description power-law distribution: 20 applications account for 83% of Table 3. call overhead should not be a primary concern for uniker- lupine-base • Manually determined app-specificnel developers. configurations Finally, we show that Lupine avoids major nginx 1.7 Web server 13 Hub (by billions of downloads) and thepitfalls number of POSIX-like of unikernels additional that stem from not being postgres 1.6 Database 10 all downloads. Table 3 lists the applications. Linux-based, including both the lack of support for unmod- httpd 1.4 Web server 13 • conguration options each requires beyondied applications the and performancelupine-base from highly-optimized node 1.2 Language runtime 5 For each application, in place of an application manifest, 20 top apps on Docker hub (83%code. of all downloads) redis 1.2 Key-value store 10 9 mongo 1.2 NOSQL database 11 kernel conguration. mysql 1.2 Database 9 we carried out the following process to determine the mini- 4.1 Conguration diversity • Only 19 configuration options required to run all traek 1.1 Edge router 8 Lupine attempts to mimic the only-what-you-need approach memcached 0.9 Key-value store 10 mal viable conguration. First we ran the application as a of unikernels in order to achieve some of their performance hello-world 0.9 C program “hello” 0 20 applications: lupine-generaland security characteristics. In this subsection, we evaluate mariadb 0.8 Database 13 standard container to determine success criteria for the ap- how much specialization of the Linux kernel occurs in prac- golang 0.6 Language runtime 0 tice when considering the most popular cloud applications. python 0.5 Language runtime 0 Unlike other unikernel approaches, Lupine poses no re- openjdk 0.5 Language runtime 0 plication. While success criteria could include sophisticated strictions on applications and requires no application modi- rabbitmq 0.5 Message broker 12 20 cations, alternate build processes, or curated package lists. php 0.4 Language runtime 0 test suites or achieving performance targets, we limited our- As a result, we were able to directly run the most popu- wordpress 0.4 PHP/mysql blog tool 9 18 lar cloud applications on Lupine unikernels. To determine haproxy 0.4 Load balancer 8 selves to the following tests. Language runtimes like golang, popularity, we used the 20 most downloaded container im- inuxdb 0.3 Time series database 11 g optionsg 16 elasticsearch 0.3 Search engine 12