A High Performance and Compatible User-Level Network Operating System

LOS: A High Performance and Compatible User-level Network Operating System Yukai Huang Jinkun Geng Du Lin [email protected] [email protected] [email protected] Bin Wang Junfeng Li Ruilin Ling [email protected] [email protected] [email protected] Dan Li [email protected] Tsinghua University ABSTRACT ACM Reference format: With the ever growing speed of Ethernet NIC and more Yukai Huang, Jinkun Geng, Du Lin, Bin Wang, Junfeng Li, Ruilin and more CPU cores on commodity X86 servers, the pro- Ling, and Dan Li. 2017. LOS: A High Performance and Compatible User-level Network Operating System. In Proceedings of APNET ’17, cessing capability of the network stack in Linux kernel has Hong Kong, China, August 3–4, 2017, 7 pages. become the bottleneck. Recently there is a trend on mov- https://doi.org/10.1145/3106989.3106997 ing the network stack up to user level and bypassing the kernel. However, most of these stacks require changing the APIs or modifying the source code of applications, and hence 1 INTRODUCTION are difficult to support legacy applications. In this work,we design and develop LOS, a user-level network operating sys- The hardware capability of commodity X86 platform is ever tem that not only gains high throughput and low latency by increasing. Servers with tens of CPU cores and 100Gb Eth- kernel-bypass technologies but also achieves compatibility ernet NICs are widely commercialized. However, there is a with legacy applications. We successfully run Nginx and Net- considerable mismatch between the high-performance hard- PIPE on top of LOS without touching the source code, and ware of servers and the processing capability of the network the experimental results show that LOS achieves significant stack in Linux kernel. Therefore, in recent years a variety of throughput and latency gains compared with Linux kernel. high-performance network stacks have been designed and released [2, 7, 9, 10, 13, 14, 17–19]. Most of these stacks em- CCS CONCEPTS ploy the idea of kernel-bypass networking, and move the network stack up to user level, by leveraging I/O libraries • Networks → Network architectures; Network design such as DPDK or Netmap. According to the benefits from principles; Network components; Network performance technologies such as polling-mode network I/O, huge page evaluation; memory, zero-copy, etc., these stacks show significantly better performance than Linux kernel. KEYWORDS However, the network stacks mentioned above are usually user level, network operating system, compatability, high not fully compatible with legacy applications. Some stacks througput, low latency change the network APIs (so as to provide new function- alities such as zero-copy), which requires applications to Permission to make digital or hard copies of all or part of this work for rewrite the interfaces, such as IX [2]. Some stacks retain personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear similar APIs as Linux kernel, but run the stack as a thread of this notice and the full citation on the first page. Copyrights for components the main application, such as mTCP [9]. In either way, the of this work owned by others than ACM must be honored. Abstracting with source codes of applications have to be modified. It not only credit is permitted. To copy otherwise, or republish, to post on servers or to causes much effort on porting legacy applications, but also redistribute to lists, requires prior specific permission and/or a fee. Request requires developers to learn the new API semantics. What’s permissions from [email protected]. more, most of these stacks manage a single application well, APNET ’17, August 3–4, 2017, Hong Kong, China © 2017 Association for Computing Machinery. but fail to support multiple applications. ACM ISBN 978-1-4503-5244-4/17/08. In this work, we seek to design a kernel-bypass network https://doi.org/10.1145/3106989.3106997 stack which not only gains high performance, but is also APNET ’17, August 3–4, 2017, Hong Kong, China Yukai, et al. fully compatible with legacy applications. Here compatibility raw packets, rather than network connections and flows. means that not only the applications do not need to modify In order to leverage these libraries to handle higher-level their codes to run on top of the user-level stack, but also mul- protocols such as TCP and UDP, developers have to imple- tiple independent applications can share the stack. Hence, ment a TCP/IP stack from scratch for their own applications, the stack is actually a user-level network operating system. which not only requires a thorough understanding of TCP/IP By doing so, we want to decouple the network-related opera- protocols, but also considerably limits code reuse and main- tions from the Linux kernel, and leave them on the user-level tenance. Therefore, it is reasonable to develop a user-level network operating system. Designing such a network oper- high-performance TCP/IP stack as a general-purpose library ating system imposes several technical challenges, such as to benefit different applications. The representative works in automatically taking over the network-related APIs, distin- the literature include mTCP [9] based on Packet I/O Engine guishing the access of two FD (file descriptor) spaces, new (PSIO) [10], IX [2] and Seastar [19] based on DPDK, UTCP event notification scheme, detecting application crashes and [5] and Sandstorm [14] based on netmap [17], etc. However, recycling the resources, etc. We employ a variety of tech- all these works require updating applications at some extent, nologies to address these challenges. Besides, in order to which raises the barrier for wide deployment. Moreover, they keep high performance of the network operating system, we usually well support a single application instead of multiple not only use similar mechanisms as in existing works, such applications. as polling-mode I/O, zero-copy in the stack and FD space Kernel stack optimizations. On another theme, the lit- separation, but also introduce new methods. We use lock- erature never stops improving the performance of the net- less shared-queue for IPC between the networking operating work performance of Linux kernel. The SO_REUSEPORT system and applications, and run the user-level operating sys- option [11] has been added into the Linux kernel since ver- tem and applications on different CPU cores, both of which sion 3.9 to make multi-thread server scale better on multicore show significant performance gains. systems. The basic idea is to reduce the contention on shared We have developed the single-core version of the user- data structures (e.g. the socket data structure and the accept level network operating system, which is called LOS. We queue) by allowing multiple sockets on the same host to bind successfully run Nginx and NetPIPE on top of LOS without to the same port. For passive connections, Affinity-Accept touching the source code, and the experimental results show [16] and MegaPipe [18] guarantee connection locality and that LOS achieves 17∼41% higher throughput and 20∼55% improve load balance between cores by providing a listen- lower latency compared with Linux kernel. ing socket with multiple per-core accept queues instead of a The remaining part of this paper is organized as follows. shared one. In addition, MegaPipe also proposed system call Section II introduces the related works. Section III describes batching to amortize system call overhead and lightweight the design details. Section IV presents the evaluation results. sockets to avoid VFS overhead. In a recent work, Fastsocket Finally, Section V concludes this paper. [21] achieves connection locality and scalability for both ac- tive and passive connections, and avoids lock contention by 2 RELATED WORK partitioning the global Listen Table/Established Table into local ones. However, the aforementioned optimizations fail High-performance packet processing libraries. So far, to fully address the overhead incurred in kernel stack, such there has been much effort devoted to the development of as the overhead of interrupt handling on receiving packets, high-performance packet processing libraries [6, 10, 15, 17]. which can greatly affect the overall performance under the These libraries aim to accelerate network I/O by completely scenario of high network load. decoupling the network data path from the kernel, thus elim- inating overheads imposed by heavy-weight kernel network 3 SYSTEM DESIGN stacks. Our work in this paper is built on top of DPDK [6], which becomes an official open-source project of the Linux We design LOS, a user-level network operating system which Foundation very recently. DPDK provides fast packet pro- not only gains higher throughput and lower latency by ex- cessing techniques such as polling mode driver (PMD) in- ploiting kernel-bypass technologies, but also achieves full stead of per-packet interrupt, zero-copy packet sending and compatibility with legacy applications. We expect that by receiving in the user space, pre-allocating rings and memory our work more applications and more users can benefit from pools to avoid per-packet memory allocation, using huge- kernel-bypass networking to improve their performance. page memory to reduce TLB misses, etc. Similar techniques can also be found in other libraries. 3.1 Performance Improvement User-space TCP stacks. The main problem with Intel At first we describe our design on how to improve the net- DPDK and other libraries is that they directly manipulate work performance of LOS. LOS: A High Performance and Compatible User-level NOS APNET ’17, August 3–4, 2017, Hong Kong, China Typical optimization methods as in existing works. we run the LOS and a Nginx worker together in one CPU In the scenario of high-speed network, it has been proved core, the http responses from the Nginx worker degrades in that polling-based I/O gains distinctive advantages over interrupt- a disastrous way compared with running them in separate based mechanisms [4].

Load more