Parallelizing Packet Processing in Container Overlay Networks
Total Page:16
File Type:pdf, Size:1020Kb
Parallelizing Packet Processing in Container Overlay Networks Jiaxin Lei∗ Manish Munikar∗ Kun Suo∗ Binghamton University The University of Texas at Arlington Kennesaw State University [email protected] [email protected] [email protected] Hui Lu Jia Rao Binghamton University The University of Texas at Arlington [email protected] [email protected] Abstract centers and public clouds. A recent report from Datadog [1] Container networking, which provides connectivity among has revealed that customers quintupled the number of con- containers on multiple hosts, is crucial to building and scal- tainers in their rst nine-month container adoption. Google ing container-based microservices. While overlay networks deploys containers in its cluster management and is reported are widely adopted in production systems, they cause sig- to launch about 7,000 containers every second in its search nicant performance degradation in both throughput and service [11]. With containers, applications can be automati- latency compared to physical networks. This paper seeks to cally and dynamically deployed across a cluster of physical understand the bottlenecks of in-kernel networking when or virtual machines (VMs) with orchestration tools, such as running container overlay networks. Through proling and Apache Mesos [2], Kubernetes [13], and Docker Swarm [6]. code analysis, we nd that a prolonged data path, due to Container networks provide connectivity to distributed packet transformation in overlay networks, is the culprit of applications and are critical to building large-scale, container- performance loss. Furthermore, existing scaling techniques based services. Overlay networks, e.g., Flannel [10], Weave in the Linux network stack are ineective for parallelizing [26], Calico [4] and Docker overlay [25], are widely adopted the prolonged data path of a single network ow. in container orchestrators [3, 6, 13]. Compared to other com- We propose F, a fast and balanced container net- munication modes, overlay networks allow each container working approach to scale the packet processing pipeline to have its own network namespace and private IP address in overlay networks. F pipelines software interrupts independent from the host network. In overlay networks, associated with dierent network devices of a single ow on packets must be transformed from private IP address to pub- multiple cores, thereby preventing execution serialization lic (host) IP address before transmission, and vice versa dur- of excessive software interrupts from overloading a single ing reception. While network virtualization oers exibility core. F further supports multiple network ows by to congure private networks without increasing the com- eectively multiplexing and balancing software interrupts plexity of host network management, packet transformation of dierent ows among available cores. We have developed imposes signicant performance overhead. Compared to a a prototype of F in Linux. Our evaluation with both physical network, container overlay networks can incur dras- micro-benchmarks and real-world applications demonstrates tic throughput loss and suer an order of magnitude longer the eectiveness of F, with signicantly improved per- tail latency [50, 68, 69, 74, 78]. formance (by 300% for web serving) and reduced tail latency The overhead of container overlay networks is largely (by 53% for data caching). due to a prolonged data path in packet processing. Overlay packets have to traverse the private overlay network stack ACM Reference Format: and the host stack [78] for both packet transmission and Jiaxin Lei, Manish Munikar, Kun Suo, Hui Lu, and Jia Rao. 2021. reception. For instance, in a virtual extensible LAN (VXLAN) Parallelizing Packet Processing in Container Overlay Networks. In overlay, packets must go through a VXLAN device for IP trans- Sixteenth European Conference on Computer Systems (EuroSys ’21), formation, i.e., adding or removing host network headers April 26–28, 2021, Online, United Kingdom. ACM, New York, NY, USA, 16 pages. hps://doi.org/10.1145/3447786.3456241 during transmission or reception, a virtual bridge for packet forwarding between private and host stacks, and a virtual network device (veth) for gating a container’s private net- 1 Introduction work. The inclusion of multiple stages (devices) in the packet Due to its high performance [38, 66], low overhead [36, 68], processing pipeline prolongs the critical path of a single net- and widespread community support [53], container tech- work ow, which can only be processed on a single core. nology has increasingly been adopted in both private data The existing mechanisms for parallelizing packet process- ing, such as Receive Packet Steering (RPS) [20], focus on ∗Equal contribution. EuroSys ’21, April 26–28, 2021, Online, United Kingdom Jiaxin Lei, Manish Munikar, Kun Suo, Hui Lu, and Jia Rao distributing multiple ows (packets with dierent IPs or Hardware Kernel Space User Space IRQ softIRQ ports) onto separate cores, thereby not eective for accel- pNIC VXLAN Host1 erating a single ow. The prolonged data path inevitably Packet adds delay to packet processing and causes spikes in latency decapsulation and signicant throughput drop if computation overloads Container Container Container bridge vNIC B (receiver) a core. To shorten the data path, the state-of-the-art seeks to either eliminate packet transformation from the network Container Container Container stack [78] or ooad the entire virtual switches and packet bridge vNIC A (sender) transformation to the NIC hardware[19]. Though the perfor- Packet mance of such software-bypassing or hardware-ooading encapsulation network is improved (close to the native), these approaches pNIC VXLAN Host2 undermine the exibility in cloud management with limited Layer 2 Layer 3 & 4 Layer 7 support and/or accessibility. For example, Slim [78] does not apply to connection-less protocols, while advanced hardware Figure 1. Illustration of container overlay networks. ooading is only available in high-end hardware [19]. This paper investigates how and to what extent the con- Though F piplelines the processing stages of a packet ventional network stack can be optimized for overlay net- on multiple cores, it does not require packet copying between works. We seek to preserve the current design of overlay these cores. Our experimental results show that the perfor- networks, i.e., constructing the overlay using the existing mance gain due to parallelization signicantly outweighs the building blocks, such as virtual switches and virtual network cost of loss of locality. To summarize, this paper has made devices, and realizing network virtualization through packet the following contributions: transformation. This helps to retain and support the existing We perform a comprehensive study of the performance • network and security policies, and IT tools. Through com- of container overlay networks and identify the main prehensive proling and analysis, we identify previously bottleneck to be the serialization of a large number of unexploited parallelism within a single ow in overlay net- softirqs on a single core. works: Overlay packets travel multiple devices across the We design and implement F that parallelizes the • network stack and the processing at each device is handled by prolonged data path for a single ow in overlay net- a separate software interrupt (softirq); while the overhead works. Unlike existing approaches that only parallelize of container overlay networks is due to excessive softirqs softirqs at packet reception, F allows softirqs to of one ow overloading a single core, the softirqs are asyn- be parallelized at any stage of the processing pipeline. chronously executed and their invocations can be interleaved. We evaluate the eectiveness of F with both • This discovery opens up new opportunities for parallelizing micro and real-world applications. Our results show softirq execution in a single ow with multiple cores. that F can signicantly improve throughput (e.g., We design and develop F (fast and balanced container up to 300% for web serving) and reduce latency (e.g., networking) — a novel approach to parallelize the data path up to 53% for data caching). of a single ow and balance network processing pipelines of multiple ows in overlay networks. F leverages mul- Road map: Section 2 discusses the background and gives tiple cores to process packets of a single ow at dierent motivating examples of the comparison between a physical network devices via a new hashing mechanism: It takes not network and a container overlay network. Section 3 analyzes only ow but also network device information into consider- reasons causing the degradation in container network per- ation, thus being able to distinguish packet processing stages formance. Section 4 presents design details of F while associated with distinct network devices. F uses in- Section 5 discusses its implementation. Section 6 shows the kernel stage transition functions to move packets of a ow experimental results. Section 7 reviews related works and among multiple cores in sequence as they traverse overlay Section 8 concludes the paper. network devices, preserving the dependencies in the packet processing pipeline (i.e., no out-of-order delivery). Further- 2 Background and Motivation more, to exploit parallelism within a heavy-weight network In this section, we rst describe the process of packet pro- device that overloads a single core, F enables a softirq cessing in the OS kernel. Then, we examine the performance