Scalable Kernel TCP Design and Implementation for Short-Lived Connections
Total Page:16
File Type:pdf, Size:1020Kb
Scalable Kernel TCP Design and Implementation for Short-Lived Connections Xiaofeng Lin Yu Chen Xiaodong Li Sina Corporation, ZHIHU Department of Computer Science Sina Corporation Corporation and Technology, Tsinghua University xiaodong2@staff.sina.com.cn [email protected] [email protected] Junjie Mao Jiaquan He Wei Xu Yuanchun Shi Department of Computer Science and Technology, Tsinghua University [email protected], [email protected], [email protected], [email protected] Abstract same time. Fastsocket is already deployed in the production With the rapid growth of network bandwidth, increases in environment of Sina WeiBo, serving 50 million daily active CPU cores on a single machine, and application API models users and billions of requests per day. demanding more short-lived connections, a scalable TCP Categories and Subject Descriptors D.4.4 [Operating Sys- stack is performance-critical. tems]: Communications Management—Network communi- Although many clean-state designs have been proposed, cation production environments still call for a bottom-up parallel TCP stack design that is backward-compatible with existing Keywords TCP/IP; multicore system; operating system applications. We present Fastsocket, a BSD Socket-compatible and scalable kernel socket design, which achieves table-level 1. Introduction connection partition in TCP stack and guarantees connection Recent years have witnessed a rapid growth of mobile de- locality for both passive and active connections. Fastsocket vices and apps. These new mobile apps generate a large architecture is a ground up partition design, from NIC inter- amount of the traffic consisting of short-lived TCP connec- rupts all the way up to applications, which naturally elim- tions [39]. HTTP is a typical source of short-lived TCP con- inates various lock contentions in the entire stack. More- nections. For example, in Sina WeiBo, the typical request over, Fastsocket maintains the full functionality of the kernel length of one heavily invoked HTTP interface is around TCP stack and BSD-socket-compatible API, and thus appli- 600 bytes and the corresponding response length is typically cations need no modifications. about 1200 bytes. Both the request and the response only Our evaluations show that Fastsocket achieves a speedup consume one single IP packet. After the two-packet data of 20.4x on a 24-core machine under a workload of short- transaction is done, the connection is closed, which closely lived connections, outperforming the state-of-the-art Linux matches the characteristics of short-lived connection. In this kernel TCP implementations. When scaling up to 24 CPU situation, the speed of establishment and termination of TCP cores, Fastsocket increases the throughput of Nginx and connections becomes crucial for server performance. HAProxy by 267% and 621% respectively compared with As the number of CPU cores in one machine increases, the base Linux kernel. We also demonstrate that Fastsocket the scalability of the network stack plays a key role in the can achieve scalability and preserve BSD socket API at the performance of the network applications on multi-core plat- forms. For long-lived connections, the metadata manage- ment for new connections is not frequent enough to cause significant contentions. Thus we do not observe scalabil- Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice ity issues of the TCP stack in these cases. However, it is and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to challenging to reach the same level of scalability for short- lists, contact the Owner/Author. Request permissions from [email protected] or Publications Dept., ACM, Inc., fax +1 (212) 869-0481. Copyright 2016 held by Owner/Author. Publication Rights Licensed to ACM. lived connections as that of long-lived connections, since it ASPLOS ’16 April 2–6, 2016, Atlanta, Georgia, USA. involves frequent and expensive TCP connection establish- Copyright ⃝c 2016 ACM 978-1-4503-4091-5/16/04. $15.00 DOI: http://dx.doi.org/10.1145/2872362.2872391 ment and termination, which result in severe shared resource 339 contentions in the TCP Control Block (TCB) and Virtual File 3. Resource Isolation and Sharing: TCP/IP stack needs to System (VFS). support diverse existed NIC hardware resources, and TCB Management The TCP connection establishment should isolated or share data with other name space or and termination operations involve managing the global sub-systems (disks, file systems, etc.) TCBs which are represented as TCP sockets in the Linux Extensive researches [12, 21, 22, 27, 30, 38] are focused kernel. All these sockets are managed in two global hash tables, known as the listen table and the established table. on system support for enabling network-intensive applica- tions to achieve performance close to the limits imposed by Since these two tables are shared system-wide, synchroniza- tion are inevitable among the CPU cores. the hardware. Some work proposes a brand new I/O abstrac- In addition, the TCB of a connection is accessed and tion to make the TCP stack scalable [21]. However, applica- updated in two phases: tions have to be rewritten with the new API. Recent research projects [12, 22, 30] attempt to solve the TCP stack scal- 1. Incoming packets of the connection are received in the ability problem using modified embedded TCP stack lwIP interrupt context through network stack on the CPU core [26] or custom-built user-level TCP stack from the ground that receives the NIC (Network Interface Controller) in- up. Though these design can improve the performance of terrupt. network stack, it does not fully replicate all kernel’s ad- 2. Further processing of incoming packets payload in user- vanced networking features (Firewall/Netfilter, TCP veto, level application and transmission of the outgoing pack- IPsec NAT and other vulnerabilities defenses for security, ets are accomplished on the CPU core running the corre- TCP optimizations for wireless net, cgroup/sendfile/splice sponding application. mechanism for resource isolating or sharing, etc.) which are heavily utilized by our performance-critical applications. It is ideal that for a given connection, both phases are Therefore, currently it remains an open challenge to com- handled entirely on one CPU core, which maximizes CPU pletely solve the TCP stack scalability problem while keep- cache efficiency. We refer to this property as connection lo- ing backward compatibility and full features to benefit exist- cality. Unfortunately, in the current Linux TCP stack, these ing network applications in the production environment. two phases often run on different CPU cores, which seri- In this paper, we present Fastsocket to address the afore- ously affects scalability. mentioned scalability and compatibility problems in a back- VFS Abstraction A socket is abstracted by VFS and ward compatible, incrementally deployable and maintain- exposed to user-level applications as a socket File Descriptor able way. This is achieved with three major changes: 1) (FD). As a result, the performance of opening and closing a partition the globally shared data structures, the listen table socket FD in VFS has a direct impact on the efficiency of the and established table; 2) correctly steer incoming packets to TCP connection establishment and termination. However, achieve connection locality for any connection; and 3) pro- there are severe synchronization overheads in processing vide a fast path for socket in VFS to solve scalability prob- VFS shared states, e.g., inode and dentry, which results lems in VFS and wrap the aforementioned TCP designs to in scalability bottlenecks. fully preserve BSD socket API. Serious lock contention are observed on an 8-core pro- In the production environment, with Fastsocket, the con- duction server running HAProxy as an HTTP load balancer tention in TCB management and VFS is eliminated and the for WeiBo to distribute client traffic to multiple servers. effective capacity of HAProxy server is increased by 53.5%. From profiling data collected by perf, we observed that Our experimental evaluations show that in short-lived con- spin lock consumes 9% and 11% of total CPU cycles in nection benchmark of web servers and proxy servers han- TCB management and VFS respectively. Even worse, due dling short-lived connections, Fastsocket outperforms base to contention inside the network stack processing, there is a Linux kernel by 267% and 621% respectively. clear load imbalance across all CPU cores, though packets This paper makes three key contributions: are evenly distributed to these CPU cores by NIC. Production Environment Requirements In production 1. We introduce a kernel network stack design which achieves environment, such as data center, the network server/proxy table-level partition for TCB management and realizes applications have three fundamental requirements: connection locality for arbitrary connection types. Our 1. Compatibility of TCP/IP related RFC: Different applica- partition scheme is complete and naturally solves all po- tions follow different TCP/IP related RFCs. As a result, tential contentions along the network processing path, in- the TCP/IP implementation in production environment cluding the contentions we do not directly address such also need to support commonly accepted RFCs as far as as timer lock contention. possible. 2. We demonstrate that BSD socket API does not have to 2. Security: TCP may be attacked in a variety of ways(DDoS, be an obstacle in network stack scalability for commod- Connection hijacking, etc.). These attacks result of a thor- ity hardware configuration. The evaluation using real- ough security protection of TCP [13] in OS kernel.