High Performance Packet Processing with Flexnic

High Performance Packet Processing with FlexNIC Antoine Kaufmann1 Simon Peter2 Naveen Kr. Sharma1 Thomas Anderson1 Arvind Krishnamurthy1 1University of Washington 2The University of Texas at Austin fantoinek,naveenks,tom,[email protected] [email protected] Abstract newer processors. A 40 Gbps interface can receive a cache- The recent surge of network I/O performance has put enor- line sized (64B) packet close to every 12 ns. At that speed, mous pressure on memory and software I/O processing sub- OS kernel bypass [5; 39] is a necessity but not enough by it- systems. We argue that the primary reason for high mem- self. Even a single last-level cache access in the packet data ory and processing overheads is the inefficient use of these handling path can prevent software from keeping up with ar- resources by current commodity network interface cards riving network traffic. (NICs). We claim that the primary reason for high memory and We propose FlexNIC, a flexible network DMA interface processing overheads is the inefficient use of these resources that can be used by operating systems and applications alike by current commodity network interface cards (NICs). NICs to reduce packet processing overheads. FlexNIC allows ser- communicate with software by accessing data in server vices to install packet processing rules into the NIC, which memory, either in DRAM or via a designated last-level cache then executes simple operations on packets while exchang- (e.g., via DDIO [21], DCA [20], or TPH [38]). Packet de- ing them with host memory. Thus, our proposal moves some scriptor queues instruct the NIC as to where in server mem- of the packet processing traditionally done in software to the ory it should place the next packet and from where to read NIC, where it can be done flexibly and at high speed. the next arriving packet for transmission. Except for sim- We quantify the potential benefits of FlexNIC by emulat- ple header splitting, no further modifications to packets are ing the proposed FlexNIC functionality with existing hard- made and only basic distinctions are made among packet ware or in software. We show that significant gains in appli- types. For example, it is possible to choose a virtual queue cation performance are possible, in terms of both latency and based on the TCP connection, but not, say, on the application throughput, for several widely used applications, including a key in a memcache lookup. key-value store, a stream processing system, and an intrusion This design introduces overhead in several ways. For detection system. example, even if software is only interested in a portion of each packet, the current interface requires NICs to transfer Keywords Network Interface Card; DMA; Flexible Net- packets in their entirety to host memory. No interface exists work Processing; Match-and-Action Processing to steer packets to the right location in the memory hierarchy based on application-level information, causing extraneous 1. Introduction cache traffic when a packet is not in the right cache or at Data center network bandwidth is growing steadily: 10 Gbps the right level. Finally, network processing code often does Ethernet is widespread, 25 and 40 Gbps are gaining trac- repetitive work, such as to check packet headers, even when tion, while 100 Gbps is nearly available [47]. This trend is it is clear what to do in the common case. straining the server computation and memory capabilities; The current approach to network I/O acceleration is to put I/O processing is likely to limit future server performance. fixed-function offload features into NICs [46]. While use- For example, last-level cache access latency in Intel Sandy ful, these offloads bypass application and OS concerns and Bridge processors is 15 ns [34] and has not improved in can thus only perform low-level functions, such as checksum processing and packet segmentation for commonly used, standard protocols (e.g., UDP and TCP). Higher-level offloads can constrain the system in other ways. For example, remote direct memory access (RDMA) [42], is difficult Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and to adapt to many-to-one client-server communication mod- the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author. els [32; 13], resulting in diminished benefits [25]. On some Copyright is held by the owner/author(s). ASPLOS ’16 April 2–6, 2016, Atlanta, Georgia, USA. ACM 978-1-4503-4091-5/16/04. DOI: http://dx.doi.org/10.1145/2872362.2872367 high-speed NICs, the need to pin physical memory resources • We use our prototype to quantify the potential bene- prevents common server consolidation techniques, such as fits of flexible network I/O for several widely used net- memory overcommit. worked applications, including a key-value store, data To address these shortcomings, we propose FlexNIC, stream processing, and intrusion detection (x3–5). a flexible DMA programming interface for network I/O. • We present a number of building blocks that are useful to FlexNIC allows applications and OSes to install packet pro- software developers wishing to use FlexNIC beyond the cessing rules into the NIC, which instruct it to execute sim- use cases evaluated in this paper (x2.3). ple operations on packets while exchanging them with host memory. These rules allow applications and the OS to exert Our paper is less concerned with a concrete hardware fine-grained control over how data is exchanged with the implementation of the FlexNIC model. While we provide host, including where to put each portion of it. The FlexNIC a sketch of a potential implementation, we assume that the programming model can improve packet processing per- model can be implemented in an inexpensive way at high formance while reducing memory system pressure at fast line-rates. Our assumption rests on the fact that OpenFlow network speeds. switches have demonstrated this capability with minimum- The idea is not far-fetched. Inexpensive top-of-rack Open- sized packets at aggregate per-chip rates of over a half terabit Flow switches support rule-based processing applied to ev- per second [47]. ery packet and are currently being enhanced to allow pro- grammable transformations on packet fields [9]. Fully pro- 2. FlexNIC Design and Implementation grammable NICs that can support packet processing also A number of design goals guide the development of FlexNIC: exist [35; 51; 10] and higher-level abstractions to program them are being proposed [22]. We build upon these ideas • Flexibility. We need a hardware programming model to provide a flexible, high-speed network I/O programming flexible enough to serve the offload requirements of net- interface for server systems. work applications that change at software development The value of our work is in helping guide next generation timescales. NIC hardware design. Is a more or less expressive program- • Line rate packet processing. At the same time the model ming model better? At one extreme, we can offload the entire may not slow down network I/O by allowing arbitrary application onto a network processor, but development costs processing. We need to be able to support the line rates will be higher. At the other extreme, fixed-function offloads, of tomorrow’s network interfaces (at least 100 Gb/s). such as Intel Flow Director, that match on only a small num- • Inexpensive hardware integration. The required addi- ber of packet fields and cannot modify packets in flight are tional hardware has to be economical, such that it fits the less expressive than FlexNIC, but forego some performance pricing model of commodity NICs. benefits, as we will show later. This paper presents the FlexNIC programming model, • Minimal application modification. While we assume which is expressive and yet allows for packet processing that applications are designed for scalability and high- at line rates, and quantifies the potential performance ben- performance, we do not assume that large modifications efits of flexible NIC packet steering and processing for a to existing software are required to make use of FlexNIC. set of typical server applications. When compared to a high- • OS Integration. The design needs to support the protec- performance user-level network stack, our prototype imple- tion and isolation guarantees provided at the OS, while mentation achieves 2.3× better throughput for a real-time allowing applications to install their own offloading prim- analytics platform modeled after Apache Storm, 60% better itives in a fast and flexible manner. throughput for an intrusion detection system, and 60% better latency for a key-value store. Further, FlexNIC is versatile. To provide the needed flexibility at fast line rates, we ap- Beyond the evaluated use cases, it is useful for high perfor- ply the reconfigurable match table (RMT) model recently mance resource virtualization and isolation, fast failover, and proposed for flexible switching chips [8] to the NIC DMA can provide high-level consistency for RDMA operations. interface. The RMT model processes packets through a se- We discuss these further use cases in x6. quence of match plus action (M+A) stages. We make four specific contributions: We make no assumptions about how the RMT model is implemented on the NIC. Firmware, FPGA, custom silicon, • We present the FlexNIC programming model that allows and network processor based approaches are all equally fea- flexible I/O processing by exposing a match-and-action sible, but incur different costs. However, we note that a typ- programming interface to applications and OSes (x2).

High Performance Packet Processing with Flexnic

Packet Processing at Wire Speed Using Network Processors

Packet Processing Execution Engine (PROX) - Performance Characterization for NFVI User Guide

Lightweight Internet Protocol Stack Ideal Choice for High-Performance HFT and Telecom Packet Processing Applications Lightweight Internet Protocol Stack

High Level Synthesis for Packet Processing Pipelines Cristian

Evaluating the Power of Flexible Packet Processing for Network Resource Allocation Naveen Kr

An Architecture for High-Speed Packet Switched Networks (Thesis)

ICMP for Ipv6

Packet Processing with a Noc-Enhanced FPGA

Pathminer Powered Predictable Packet Processing

Impressive Packet Processing Performance Enables Greater Workload Consolidation Single Intel® Xeon® Processor Platform Achieves Over 80 Mpps Throughput1

High Level Synthesis Tool for High Speed Packet Processing

IMPLEMENTING VOICE OVER IP Copyright 6 2003 by John Wiley & Sons, Inc