XOS: An Application-Defined Operating System for Datacenter Computing
Chen Zheng∗†, Lei Wang∗, Sally A. McKee‡, Lixin Zhang∗, Hainan Ye§ and Jianfeng Zhan∗† ∗State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences †University of Chinese Academy of Sciences, China ‡Chalmers University of Technology §Beijing Academy of Frontier Sciences and Technology
Abstract—Rapid growth of datacenter (DC) scale, urgency of ability, and isolation. Such general-purpose, one-size-fits-all cost control, increasing workload diversity, and huge software designs simply cannot meet the needs of all applications. The investment protection place unprecedented demands on the op- kernel traditionally controls resource abstraction and alloca- erating system (OS) efficiency, scalability, performance isolation, and backward-compatibility. The traditional OSes are not built tion, which hides resource-management details but lengthens to work with deep-hierarchy software stacks, large numbers of application execution paths. Giving applications control over cores, tail latency guarantee, and increasingly rich variety of functionalities usually reserved for the kernel can streamline applications seen in modern DCs, and thus they struggle to meet systems, improving both individual performances and system the demands of such workloads. throughput [5]. This paper presents XOS, an application-defined OS for modern DC servers. Our design moves resource management There is a large and growing gap between what DC ap- out of the OS kernel, supports customizable kernel subsystems plications need and what commodity OSes provide. Efforts in user space, and enables elastic partitioning of hardware to bridge this gap range from bringing resource management resources. Specifically, XOS leverages modern hardware support into user space to bypassing the kernel for I/O operations, for virtualization to move resource management functionality avoiding cache pollution by batching system calls, isolat- out of the conventional kernel and into user space, which lets applications achieve near bare-metal performance. We implement ing performance via user-level cache control, and factoring XOS on top of Linux to provide backward compatibility. XOS OS structures for better scalability on many-core platforms. speeds up a set of DC workloads by up to 1.6× over our Nonetheless, several open issues remain. Most approaches to baseline Linux on a 24-core server, and outperforms the state- constructing kernel subsystems in user space only support of-the-art Dune by up to 3.3× in terms of virtual memory limited capabilities, leaving the traditional kernel responsible management. In addition, XOS demonstrates good scalability and strong performance isolation. for expensive activities like switching contexts, managing Index Terms—Operating System, Datacenter, Application- virtual memory, and handling interrupts. Furthermore, many defined, Scalability, Performance Isolation instances of such approaches cannot securely expose hardware resources to user space: applications must load code into the I.INTRODUCTION kernel (which can affect system stability). Implementations Modern DCs support increasingly diverse workloads that based on virtual machines incur overheads introduced by the process ever-growing amounts of data. To increase resource hypervisor layer. Finally, many innovative approaches break utilization, DC servers deploy multiple applications together current programming paradigms, sacrificing support for legacy on one node, but the interferences among these applications applications. and the OS lower individual performances and introduce Ideally, DC applications should be able to finely control unpredictability. Most state-of-the-practice and state-of-the-art their resources, customize kernel policies for their own ben- arXiv:1901.00825v1 [cs.OS] 3 Jan 2019 OSes (including Linux) were designed for computing envi- efit, avoid interference from the kernel or other application ronments that lack the support for diversity of resources and processes, and scale well with the number of available cores. workloads found in modern systems and DCs, respectively, From 2011 to 2016, in collaboration with Huawei, we have a and they present user-level software with abstracted interfaces large project to investigate different aspects of DC computing, to those hardware resources, whose policies are only optimized ranging from benchmarking [6], [7], architecture [8], hard- for a few specific classes of applications. ware [9], OS [4], and programming [10]. Specifically, we ex- In contrast, DCs need streamline OSes that can exploit plores two different OS architectures for DC computing [11]– 1 modern multi-core processors, reduce or remove the overheads [13]. This paper dedicates to XOS—an application-defined of resource allocation and management, and better support per- OS architecture that follows three main design principles: formance isolation. Although Linux has been widely adopted • Resource management should be separated from the OS as a preferred OS for its excellent usability and programma- kernel, which then merely provides resource multiplexing bility, it has limitations with respect to performance, scal- and protection.
The corresponding author is Jianfeng Zhan. 1The another OS architecture for DC computing is reported in [4]
p p A the system plus several important DC workloads. XOS outper-
forms state-of-the-art systems like Linux and Dune/IX [14],
Y S C A L L S [15] by up to 2.3× and 3.3×, respectively, for a virtual
memory management microbenchmark. Compared to Linux,
F S
V XOS speeds up datacenter workloads from BigDataBench [7]
U n i x D e v i c e i l e
F by up to 70%. Furthermore, XOS scales 7× better on a 24-core
I P C F i l e S y s t e m
p p p
A p A p A p
S S
e r v e r D r i v e r s r v e r
e machine and performs 12× better when multiple memory-
V i r t u a l
c h e d u l e r
S intensive microbenchmarks are deployed together. XOS keeps
M e m o r y
e v i c e I n t e r r u p t
D tail latency — i.e., the time it takes to complete requests falling
I P C V i r t u a l M e m o r y
D i s p a t c h e r
r i v e r s H a n d l e r D into the 99th latency percentile — in check while achieving
much higher resource utilization.
H a r d w a r e
H a r d w a r e
(a) monolithic (b) microkernel II.BACKGROUNDAND RELATED WORK
XOS is inspired by and built on much previous work in
A p p A p p
A p p A p p
p p A p
A p OSes to minimize the kernel, reduce competition for resources,
L i b r a r y L i b r a r y
O S
M i r a g e v
L i b r a
S O S
O and specialize functionalities for the needs of specific work-
R u n t i m e V M
loads. Each of these design goals addresses some, but not
E x o k e r n e l
y p e r v i s o r H all, of the requirements for DC workloads. Our application-
defined OS model is made possible by hardware support for
H a r d w a r e
a r d w a r e H virtualizing CPU, memory, and I/O resources. (c) exokernel [1] (d) unikernel [2] Hardware-assisted virtualization goes back as far as the IBM System/370, whose VM/370 [16] operating system presented
App each user (among hundreds or even thousands) with a separate App App App App App App App App App App App App App virtual machine having its own address space and virtual Single OS Image Single OS Image ® CPU CPU CPU Supervisor SubOS SubOS CPU CPU CPU Supervisor SubOS SubOS devices. For modern CPUs, technologies like Intel VT-x [17] driver driver driver driver driver driver break conventional CPU privilege modes into two new modes: Hardware Hardware Hardware Hardware VMX root mode and VMX non-root mode. (e) multikernel [3] (f) ITFS OS [4] For modern I/O devices, single root input/output virtualiza-
tion (SR-IOV) [18] standard allows a network interface (in
A p p A p p p
p this case PCI Express, or PCI-e) to be safely shared among
A p A p
i n u x
L several software entities. The kernel still manages physical
X O S X O S
R u n t i m e R u n t i m e
e r n e l
K device configuration, but each virtual device can be configured
a r d w a r e H independently from user level. OS architectures have leveraged virtualization to better (g) XOS nokernel meet design goals spanning the need to support increasing Fig. 1: The difference of the XOS nokernel model from the hardware heterogeneity to provide better scalability, elasticity other OS models. with respect to resource allocation, fault tolerance, customiz- ability, and support for legacy applications. Most approaches break monolithic kernels into smaller components to specialize functionality and deliver more efficient performance and fault • Applications should be able to define user-space kernel isolation by moving many traditionally privileged activities out subsystems that give them direct resource access and of the kernel and into user space. customizable resource control. Early efforts like HURRICANE [19] and Hive [20] were • Physical resources should be partitioned such that appli- designed for scalable shared-memory (especially NUMA) cations have exclusive access to allocated resources. machines. They deliver greater scalability by implementing Our OS architecture provides several benefits. Applications microkernels (Figure 1b) based on hierarchical clustering. have direct control over hardware resources and direct access In contrast, exokernel architectures [1] (Figure 1c) sup- to kernel-specific functions; this streamlines execution paths port customizable application-level management of physical and avoids both kernel overheads and those of crossing be- resources. A stripped-down kernel securely exports hardware tween user and kernel spaces. Resource-control policies can resources to untrusted library operating systems running within be tailored to the classes of applications coexisting on the same the applications themselves. Modern variants use this approach OS. And elastic resource partitioning enables better scalability to support Java workloads (IBM’s Libra [21]) or Windows and performance isolation by vastly reducing resource con- applications (Drawbridge [22]). Like Libra, avoiding execution tention, both inside and outside the kernel. We quantify these on a traditional Linux kernel that supports legacy applications benefits using microbenchmarks that stress different aspects of requires adapting application software. Unikernels such as Mirage and OSv [23] (Figure 1d) represent exokernel variants 1) separation of resource management from the kernel; 2) built for cloud environments. application-defined kernel subsystems; and 3) elastic resource Early exokernels often loaded application extensions for partitioning. resource provisioning into the kernel, which poses stability problems. Implementations relying on virtual machine mon- A. Separating Resource Management itors can unacceptably lengthen application execution paths. We contend that layered kernel abstractions are the main XOS avoids such overheads by leveraging virtualization hard- causes of both resource contention and application interfer- ware to securely control physical resources and only uploading ence. Achieving near bare-metal performance thus requires streamlined, specialized kernel subsystems to user space. that we remove the kernel from the critical path of an The advent of increasingly high core-count machines and application’s execution. the anticipation of future exascale architectures have brought In an application-defined OS (Figure 1g), the traditional a new emphasis on problems of scalability. Corey [24] imple- role of OS kernel is split. Applications take over OS kernel ments OS abstractions to allow applications to control inter- duties with respect to resource configuration, provisioning, core resource sharing, which can significantly improve perfor- and scheduling. This allows most kernel subsystems to be mance (and thus scalability) on multicore machines. Whereas constructed in user space. The kernel retains the responsibility domains in earlier exoscale/library-OS solutions often shared for resource allocation, multiplexing, and protection, but it no an OS image, replicated-kernel approaches like Hive’s are long mediates every application operation. Reducing kernel increasingly being employed. Multikernel [3] employ no vir- involvement in application execution has several advantages: tualization layer, running directly on bare hardware, without first, applications need not trap into kernel space. In current a virtualization layer. The OS (Figure 1e) is composed of general-purpose OSes like Linux, applications must access different CPU driver, each running on a single core. The CPU resources through the kernel, lengthening their execution drivers cooperate to maintain a consistent OS view. paths. The kernel may also interrupt application execution: for Our another DC OS architecture [4] introduces the ”iso- instance, a system call invocation typically raises synchronous lated first, then share” (IFTS) OS model in which hardware exception, which forces two transitions between user and resources are split among disparate OS instances. The IFTS kernel modes. Moreover, it flushes the processor pipeline twice OS model decomposes the OS into supervisor and several and pollutes critical processor structures, such as the TLB, subOSes, and avoids shared kernel states between application, branch prediction tables, prefetch buffers, and private caches. which in turn reduces performance loss caused by contention. When a system call competes for shared kernel structures, it Several systems streamline the OS by moving traditional may also stall the execution of other processes using those functionalities into user space. For instance, solutions such structures. as mTCP [25], Chronos [26], and MegaPipe [27] build user- One challenge in separating resource management from the space network stacks to lower or avoid protocol-processing kernel is how to securely expose hardware to user space. overheads and to exploit common communication patterns. Our model leverages modern virtualization-support hardware. NoHype [28] eliminates the hypervisor to reduce OS noise. Policies governing the handling of privilege levels, address XOS shares the philosophy of minimizing the kernel to realize translation, and exception triggers can enforce security when performance much closer to that of bare metal. applications directly interact with hardware. They make it Dune [14]/IX [15] and Arrakis [29] employ approaches possible for an application-defined OS to give applications very similar to XOS. Dune uses nested paging to support the ability to access all privileged instructions, interrupts, user-level control over virtual memory. IX (Dune’s successor) exceptions, and cross-kernel calls and to have direct control and Arrakis use hardware virtualization to separate resource over physical resources, including physical memory, CPU management and scheduling functions from network process- cores, I/O devices, and processor structures. ing. IX uses a full Linux kernel as the control plane (as in Software Defined Networking), leveraging memory-mapped B. Application-Defined Kernel Subsystems I/O to implement pass-through access to the NIC and Intel’s Our OS model is designed to separate all resource man- VT-x to provide three-way isolation between the control plane, agement from the kernel, including CPU, memory, I/O, ex- network stack, and application. Arrakis uses Barrelfish as the ceptions. Applications are allowed to customize their own control plane and exploits IOMMU and SR-IOV to provide kernel subsystems, choosing the types of hardware resources direct I/O access. XOS allows applications to directly access to expose in user space. Furthermore, applications running on not just I/O subsystems but also hardware features such as the same node may implement different policies for a given memory management and exception handling. kernel subsystem. Kernel services not built into user space are requested from the kernel just as in normal system calls. III.THE XOSMODEL Application-defined kernel subsystems in user space are a The growing gap between OS capabilities and DC work- major feature of XOS. Applications know what resources they load requirements necessitates that we rethink the design of need and how they want to use them. Tradeoffs in traditional OSes for modern DCs. We propose an application-defined OS design are often made to address common application OS model. This OS model is guided by three principles: needs, but this leads to poor performance for applications
without the such “common” needs or behaviors. For example,
A p p l i c a t i o n A p p l i c a t i o n A p p l i c a t i o n
data analysis workloads with fixed memory requirements will
l i b c
O S I X A P I P O S I X A P I
not always benefit from demand-paging in the virtual mem- P
U s e r M o d e
S Y S
C A L L
e t w o r k
ory subsystem, and network applications often need different N
K e r n e l M o d e
0
P a g e r P a g e r
g
S t a c k
i
n
e s s a g e
network-stack optimizations for better throughput and latency. M
R
V M X M o d u l e
M e m o r y M e m o r y B a s e d I / O
d e
F i l e S y s t e m
M a n a g e r M a n a g e r S Y S C A L L o
In this model, a cell is an XOS process that is granted M
I / O K e r n e l
g
e
T a b l e
t
a
T h r e a d P o o l
s
o
I n t e r r u p t E x c e p t i o n I n t e r r u p t
s o
exclusive resource ownership and that runs in VMX non- R
M e
H a n d l e r H a n d l e r H a n d l e r