Tackling Parallelization Challenges of Kernel Network Stack for Container Overlay Networks

Tackling Parallelization Challenges of Kernel Network Stack for Container Overlay Networks Jiaxin Lei*, Kun Suo+, Hui Lu*, and Jia Rao+ *State University of New York (SUNY) at Binghamton +University of Texas at Arlington Abstract space with their private IP addresses, while their packets are Overlay networks are the de facto networking technique for routed through “tunnels” using their hosts public IP addresses. providing flexible, customized connectivity among distributed Constructing overlay networks in a container host can be containers in the cloud. However, overlay networks also in- simply achieved by stacking a pipeline of in-kernel network cur non-trivial overhead due to its complexity, resulting in devices. For instance, for a VxLAN overlay, a virtual network significant network performance degradation of containers. device is created and assigned to a container’s network names- In this paper, we perform a comprehensive empirical perfor- pace, while a tunneling VxLAN network device is created mance study of container overlay networks which identifies for packet encapsulation/decapsulation. These two network unrevealed, important parallelization bottlenecks of the kernel devices are further connected via a virtual switch (e.g., Open network stack that prevent container overlay networks from vSwitch [15]). Such a container overlay network is also exten- scaling. Our observations and root cause analysis cast light sible: various network policies (e.g., isolation, rate limiting, on optimizing the network stack of modern operating systems and quality of service) can be easily added to either the virtual on multi-core systems to more efficiently support container switch or the virtual network device of a container. overlay networks in light of high-speed network devices. Regardless of the above-mentioned advantages, container overlay networks incur additional, non-trivial overhead com- 1 Introduction pared to the native host network (i.e., without overlays). Re- As an alternative to virtual machine (VM) based virtualiza- cent studies report that overlay networks achieve 50% less tion, containers offer a lightweight process-based virtualiza- throughput than the native and suffer much higher packet tion method for managing, deploying and executing cloud processing latency [28, 29]. The prolonged network packet applications. Lightweight containers lead to higher server processing path in overlay networks can be easily identified as consolidation density and lower operational cost in cloud data the main culprit. Indeed, as an example in the above VxLAN centers, making them widely adopted by industry — Google overlay network, a packet traverses three different namespaces even claims that “everything at Google runs in containers” [4]. (i.e., the container, overlay and host) and two kernel network Further, new cloud application architecture has been enabled stacks (the container and host) in both sending and receiv- by containers: services of a large-scale distributed applica- ing ends, leading to high per-packet processing cost and long tion are packaged into separate containers, automatically and end-to-end latency. However, our investigation reveals that dynamically deployed across a cluster of physical or virtual the causes of high-overhead and low-efficiency of container machines with orchestration tools, such as Apache Mesos [1], network overlays are much complicated and multifaceted: Kubernetes [12], and Docker Swarm [7]. First, the high-performance, high-speed physical network Container overlay networks are the de facto networking devices (e.g., 40 and 100 Gbps Ethernet) require the kernel technique for providing customized connectivity among these to quickly process each packet (e.g., 300 ns for a 40 Gbps distributed containers. Various container overlay network ap- network link). However, as stated above, the prolonged packet proaches are becoming available, such as Flannel [8], Weave path in container overlay networks slows down the per-packet [20], Calico [2] and Docker Overlay [6]. They are gener- processing speed with multiple network devices involved. ally built upon the tunneling approach which enables con- More critically, we observe that modern OSes only provide tainer traffic to travel across physical networks via encapsu- parallelization of packet processing at the per-flow level (in- lating container packets with their host headers (e.g., with the stead of per-packet); thus, the maximum network throughput VxLAN protocol [19]). With this, containers belonging to a of a single container flow is limited by the processing capa- same virtual network can communicate in an isolated address bility of a single core (e.g., 6.4 Gbps for TCP in our case). NICs, kernel space, and user space. Taking receiving a packet Layer 7 Container Applications as an example (Figure1 ): When a packet arrives at the NIC, 2 VxLAN it is copied (via DMA) to the kernel ring buffer and triggers a Layer 3&4 hardware interrupt (IRQ). The kernel responds to the interrupt 3 2 Network and starts the receiving path. The receiving process in kernel Stack is divided into two parts: the top half and the bottom half. The 3 3 NAPI RPS Scheduler Veth vBridge top half runs in the context of a hardware interrupt, which SoftIRQ 1 Layer 2 1 simply inserts the packet in the per-CPU packet queue and 1 2 3 1 IRQ SoftIRQ RX Ring triggers the bottom half. The bottom half is executed in the Handler GRO Handler Buffer form of a software interrupt (softirq), scheduled by the kernel IRQ Coalescing IRQ at an appropriate time later and is the main routine that the DMA packet is processed through the network protocol stack. After Layer 1 Multiqueue NIC being processed at various protocol layers, the packet is finally Figure 1: Illustration of data receiving path in Linux kernel. copied to the user space buffer and passed to the application. Container Overlay Networks. Overlay networks are a com- Further, the combination of multi-core CPUs and multi- mon way to virtualize container networks and provide cus- queue network interface cards (NIC) allows packets of differ- tomized connectivity among distributed containers. Container ent flows to route to separate CPU cores for parallel process- overlay networks are generally based on a tunneling tech- ing. Unfortunately, container overlay networks are observed to nique (e.g., VxLAN): When sending a container packet, it produce poor scalability — the network throughput increases encapsulates the packet in a new packet with the (source and by 4x with 6x number of flows. In addition, under the same destination) host headers; when receiving an encapsulated throughput (e.g., 40 Gbps with 80 flows), overlay networks container packet, it decapsulates the received packet to re- ∼ consume much more CPU resources (e.g., 2 3 times). Our cover the original packet and finally delivers it to the target investigation finds that this severe scalability issue is largely container application by its private IP address. due to the inefficient interplay by kernel among pipelined, As illustrated in Figure1 , the overlay network is created by asynchronous packet processing stages — an overlay packet adding additional devices, such as a VxLAN network device traverses among the contexts of one hardware interrupt, three for packet encapsulation and decapsulation, virtual Ethernet software interrupts and the user-space process. With more ports (veth) for network interfaces of containers, and a virtual flows, the hardware also becomes inefficient with poor cache bridge to connect all these devices. Intuitively, compared to efficiency and high memory bandwidth. the native host network, a container overlay network is more Last, research has long observed inefficiency in the kernel complex with longer data path. As an example in Figure1 , re- network stack for flows with small packet sizes. We observe ceiving one container packet raises one IRQ and three softirqs that such inefficiency becomes more severe in container over- (by the host NIC, the VxLAN, the veth separately). In conse- lay networks which achieve as low as 50% packet processing quence, the container packet traverses three network names- rate of that in the native host (e.g., for UDP packets). We find paces (host, overlay and container) and two network stacks that, in addition to prolonged network path processing path, (container and host). Inevitably, it leads to high overhead of the high interrupt request (IRQ) rate and the associated high packet processing and low efficiency of container networking. software interrupt (softirq) rate (i.e., 3x of IRQs) impair the Optimizations for Packet Processing. There is a large body overall system efficiency by frequently interrupting running of work targeting at optimizing the kernel for efficient packet processes with enhanced context switch overhead. processing. We categorize them into two groups: In this paper, we perform a comprehensive empirical per- (1) Mitigating per-packet processing overhead: Packet pro- formance study of container overlay networks and identify cessing cost generally consists of two parts: per-packet cost the above-stated new, critical parallelization bottlenecks in and per-byte cost. In modern OSes, per-packet cost dominates the kernel network stack. We further deconstruct these bottle- in packet processing. Thus, a bunch of optimizations have necks to locate their root causes. We believe our observations been proposed to mitigate per-packet processing including in- and root cause analysis will cast light on optimizing the kernel terrupt coalescing

Tackling Parallelization Challenges of Kernel Network Stack for Container Overlay Networks

Improving Route Scalability: Nexthops As Separate Objects

Mesalock Linux: Towards a Memory-Safe Linux Distribution

Adecuándose a La Norma ISO/IEC 1799 Mediante Software Libre

Linux Networking 101

Practical Practical

Open Source Used in CVR100W 1.0.1.1X

BERKELEY PACKET FILTER: Theory, Practice and Perspectives

Making the Kernel's Networking Data Path Programmable with BPF and XDP

Virtualization Best Practices

Iproute2 Cheat Sheet by TME520 (TME520) Via Cheatography.Com/20978/Cs/4067

How to Surprise by Being a Linux Performance "Know-It-All"

Intro to Iproute2