The Operating System Is the Control Plane UW Technical Report UW-CSE-13-10-01, Version 2.1, June 26, 2014

Arrakis: The Operating System is the Control Plane UW Technical Report UW-CSE-13-10-01, version 2.1, June 26, 2014 Simon Peter Jialin Li Irene Zhang Dan R.K. Ports Timothy Roscoe Doug Woos Arvind Krishnamurthy Thomas Anderson ETH Zurich University of Washington Abstract Linux network and file system stacks have latency and throughput many times worse than that achieved by the Recent device hardware trends enable a new approach raw hardware. to the design of network server operating systems. In a traditional operating system, the kernel mediates access Twenty years ago, researchers proposed streamlining to device hardware by server applications, to enforce pro- packet handling for parallel computing over a network of cess isolation as well as network and disk security. We workstations by mapping the network hardware directly have designed and implemented a new operating system, into user space [17, 20, 50]. Although commercially un- Arrakis, that splits the traditional role of the kernel in successful at the time, the virtualization market has now two. Applications have direct access to virtualized I/O led hardware vendors to revive the idea [6, 34, 44], and devices, allowing most I/O operations to skip the ker- also extend it to disks [48, 49]. nel entirely, while the kernel is re-engineered to provide This paper explores the OS implications of removing network and disk protection without kernel mediation of the kernel from the data path for nearly all I/O operations. every operation. We describe the hardware and software We argue that doing this must provide applications with changes needed to take advantage of this new abstraction, the same security model as traditional designs; it is easy and we illustrate its power by showing 2-5x end-to-end to get good performance by extending the trusted com- latency and 9x throughput improvements for a popular puting base to include application code, e.g., by allowing persistent NoSQL store relative to a well-tuned Linux applications unfiltered direct access to the network or the implementation. disk. We demonstrate that operating system protection is not 1 Introduction contradictory with high performance. For our prototype Reducing the overhead of the operating system process ab- implementation, a client request to the Redis persistent straction has been a longstanding goal of systems design. NoSQL store has 2x better read latency, 5x better write la- This issue has become particularly salient with modern tency, and 9x better write throughput compared to Linux. client-server computing. Many servers spend much of We make three specific contributions: their time executing operating system code: delivering • We give an architecture for the division of labor be- interrupts, demultiplexing and copying network packets, tween the device hardware, kernel, and runtime for and maintaining filesystem meta-data. Server applications direct network and disk I/O by unprivileged processes often perform very simple functions, such as key-value (x3). table lookup and storage, yet traverse the OS kernel mul- tiple times per client request. Further, the combination of • We implement prototypes of our model as a set of high speed Ethernet and low latency persistent memories modifications to the open source Barrelfish operating are widening the gap between what is possible running in system, running on commercially available multi-core kernel mode and what is available to applications. computers and I/O device hardware (x3.6). These trends have led to a long line of research aimed • We use these prototypes to quantify the potential bene- at optimizing kernel code paths for various use cases: fits of user-level I/O for several widely used network eliminating redundant copies in the kernel [41], reduc- services, including a key-value store, a NoSQL store, ing the overhead for large numbers of connections [25], an IP-layer middlebox, and an HTTP load balancer (x4). protocol specialization [39], resource containers [8, 35], We show that significant gains are possible in terms of direct transfers between disk and network buffers [41], both latency and scalability, relative to Linux, in many interrupt steering [42], hardware TCP acceleration, etc. cases without modifying the application programming Much of this has been adopted in mainline commercial interface; additional gains are possible by changing the OSes, and yet it has been a losing battle: we show that the POSIX API (x4.3). 1 90 write 80 App App Userspace fsync 70 60 50 Core Core Core Kernel 40 30 20 10 System call duration [us] 0 ext2 64Bext2 1KBext3 64Bext3 1KBext4 64Bext4 1KBbtrfs 64Bbtrfs 1KB Incoming Q's Outgoing Q's NIC Figure 1: Linux networking architecture and workflow. Figure 2: Overhead in µs of various Linux filesystem implemen- 2 Background tations, when conducting small, persistent writes. We first give a detailed breakdown of the OS and applica- Scheduler overhead depends significantly on whether tion overheads in network and storage operations today, the receiving process is currently running. If it is, only followed by a discussion of current hardware technologies 5% of processing time is spent in the scheduler; if it is that support user-level networking and I/O virtualization. not, the time to context-switch to the server process from To analyze the sources of overhead, we record times- the idle process adds an extra 2.2 µs and a further 0.6 µs tamps at various stages of kernel and user-space pro- slowdown in other parts of the network stack. cessing. Our experiments are conducted on a six ma- Cache and lock contention issues on multicore systems chine cluster consisting of 6-core Intel Xeon E5-2430 add further overhead and are exacerbated by the fact that (Sandy Bridge) systems at 2.2 GHz clock frequency exe- incoming messages can be delivered on different queues cuting Ubuntu Linux 13.04, with 1.5 MB L2 cache and by the network card, causing them to be processed by 15 MB L3 cache, 4 GB of memory, an Intel X520 (82599- different CPU cores—which may not be the same as the based) 10Gb Ethernet adapter, and an Intel MegaRAID cores on which the user-level process is scheduled, as RS3DC040 RAID controller with 1GB cache of flash- depicted in Figure 1. Advanced hardware support such backed DRAM, exposing a 100GB Intel DC S3700 series as accelerated receive flow steering (aRFS) [4] aims to SSD as one logical disk. All machines are connected to a mitigate this cost, but these solutions themselves impose 10Gb Dell PowerConnect 8024F Ethernet switch. One sys- non-trivial setup costs [42]. server tem (the ) executes the application under scrutiny, By leveraging hardware support to remove kernel me- while the others act as clients. diation from the data plane, Arrakis can eliminate certain 2.1 Networking Stack Overheads categories of overhead entirely, and minimize the effect of others. Table 1 also shows the corresponding overhead Consider a UDP echo server implemented as a Linux pro- for two variants of Arrakis. Arrakis eliminates scheduling cess. The server performs recvmsg and sendmsg calls in and kernel crossing overhead entirely, because packets are a loop, with no application-level processing, so it stresses delivered directly to user space. Network stack processing packet processing in the OS. Figure 1 depicts the typical is still required, of course, but it is greatly simplified: it is workflow for such an application. As Table 1 shows, op- no longer necessary to demultiplex packets for different erating system overhead for packet processing falls into applications, and the user-level network stack need not four main categories. validate parameters provided by the user as extensively as • Network stack processing at the hardware, IP, and a kernel implementation must. Because each application UDP layers. has a separate network stack, and packets are delivered to • Scheduler overhead: waking up a process (if neces- cores where the application is running, lock contention sary), selecting it to run, and context switching to it. and cache effects are reduced. • Kernel crossings: from kernel to user space and back. In Arrakis’ network stack, the time to copy packet data to and from user-provided buffers dominates the • Copying of packet data: from the kernel to a user buffer processing cost, a consequence of the mismatch between on receive, and back on send. the POSIX interface (Arrakis/P) and NIC packet queues. Of the total 3.36 µs (see Table 1) spent processing each Arriving data is first placed by the network hardware into a packet in Linux, nearly 70% is spent in the network stack. network buffer and then copied into the location specified This work is mostly software demultiplexing and security by the POSIX read call. Data to be transmitted is moved checks. The kernel must validate the header of incoming into a buffer that can be placed in the network hardware packets, and must perform security checks on arguments queue; the POSIX write can then return, allowing the user provided by the application when it sends a packet. memory to be reused before the data is sent. Although 2 Linux Arrakis Receiver running CPU idle POSIX interface Native interface in 1.26 (37.6%) 1.24 (20.0%) 0.32 (22.3%) 0.21 (55.3%) Network stack out 1.05 (31.3%) 1.42 (22.9%) 0.27 (18.7%) 0.17 (44.7%) Scheduler 0.17 (5.0%) 2.40 (38.8%) - - in 0.24 (7.1%) 0.25 (4.0%) 0.27 (18.7%) - Copy out 0.44 (13.2%) 0.55 (8.9%) 0.58 (40.3%) - return 0.10 (2.9%) 0.20 (3.3%) - - Kernel crossing syscall 0.10 (2.9%) 0.13 (2.1%) - - Total 3.36 (σ = 0:66) 6.19 (σ = 0:82) 1.44 (σ < 0:01) 0.38 (σ < 0:01) Table 1: Sources of packet processing overhead in Linux and Arrakis.

Load more