High Performance Packet Processing with Flexnic

High Performance Packet Processing with FlexNIC Antoine Kaufmann1 Simon Peter2 Naveen Kr. Sharma1 Thomas Anderson1 Arvind Krishnamurthy1 1University of Washington 2The University of Texas at Austin fantoinek,naveenks,tom,[email protected] [email protected] Abstract I/O processing is likely to limit future server performance. The recent surge of network I/O performance has put enor- For example, last-level cache access latency in Intel Sandy mous pressure on memory and software I/O processing sub- Bridge processors is 15 ns [34] and has not improved in systems. We argue that the primary reason for high mem- newer processors. A 40 Gbps interface can receive a cache- ory and processing overheads is the inefficient use of these line sized (64B) packet close to every 12 ns. At that speed, resources by current commodity network interface cards OS kernel bypass [5; 39] is a necessity but not enough by it- (NICs). self. Even a single last-level cache access in the packet data We propose FlexNIC, a flexible network DMA interface handling path can prevent software from keeping up with ar- that can be used by operating systems and applications alike riving network traffic. to reduce packet processing overheads. FlexNIC allows ser- We claim that the primary reason for high memory and vices to install packet processing rules into the NIC, which processing overheads is the inefficient use of these resources then executes simple operations on packets while exchang- by current commodity network interface cards (NICs). NICs ing them with host memory. Thus, our proposal moves some communicate with software by accessing data in server of the packet processing traditionally done in software to the memory, either in DRAM or via a designated last-level cache NIC, where it can be done flexibly and at high speed. (e.g., via DDIO [21], DCA [20], or TPH [38]). Packet de- We quantify the potential benefits of FlexNIC by emulat- scriptor queues instruct the NIC as to where in server mem- ing the proposed FlexNIC functionality with existing hard- ory it should place the next packet and from where to read ware or in software. We show that significant gains in appli- the next arriving packet for transmission. Except for sim- cation performance are possible, in terms of both latency and ple header splitting, no further modifications to packets are throughput, for several widely used applications, including a made and only basic distinctions are made among packet key-value store, a stream processing system, and an intrusion types. For example, it is possible to choose a virtual queue detection system. based on the TCP connection, but not, say, on the application key in a memcache lookup. Categories and Subject Descriptors C.2.1 [Computer- This design introduces overhead in several ways. For Communication Networks]: Network Architecture and De- example, even if software is only interested in a portion of sign each packet, the current interface requires NICs to transfer Keywords Network Interface Card; DMA; Flexible Net- packets in their entirety to host memory. No interface exists work Processing; Match-and-Action Processing to steer packets to the right location in the memory hierarchy based on application-level information, causing extraneous 1. Introduction cache traffic when a packet is not in the right cache or at the right level. Finally, network processing code often does Data center network bandwidth is growing steadily: 10 Gbps repetitive work, such as to check packet headers, even when Ethernet is widespread, 25 and 40 Gbps are gaining trac- it is clear what to do in the common case. tion, while 100 Gbps is nearly available [47]. This trend is The current approach to network I/O acceleration is to put straining the server computation and memory capabilities; fixed-function offload features into NICs [46]. While useful, these offloads bypass application and OS concerns and can thus only perform low-level functions, such as checksum processing and packet segmentation for commonly used, Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and standard protocols (e.g., UDP and TCP). Higher-level of- the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author. floads can constrain the system in other ways. For exam- Copyright is held by the owner/author(s). ASPLOS ’16 April 2–6, 2016, Atlanta, Georgia, USA. ple, remote direct memory access (RDMA) [42], is difficult ACM 978-1-4503-4091-5/16/04. DOI: http://dx.doi.org/10.1145/2872362.2872367 to adapt to many-to-one client-server communication mod- els [32; 13], resulting in diminished benefits [25]. On some ulation using dedicated processor cores on commodity high-speed NICs, the need to pin physical memory resources multi-core computers and I/O device hardware (x2.4). prevents common server consolidation techniques, such as • We use our prototype to quantify the potential bene- memory overcommit. fits of flexible network I/O for several widely used net- To address these shortcomings, we propose FlexNIC, worked applications, including a key-value store, data a flexible DMA programming interface for network I/O. stream processing, and intrusion detection (x3–5). FlexNIC allows applications and OSes to install packet processing rules into the NIC, which instruct it to execute sim- • We present a number of building blocks that are useful to ple operations on packets while exchanging them with host software developers wishing to use FlexNIC beyond the memory. These rules allow applications and the OS to exert use cases evaluated in this paper (x2.3). fine-grained control over how data is exchanged with the Our paper is less concerned with a concrete hardware host, including where to put each portion of it. The FlexNIC implementation of the FlexNIC model. While we provide programming model can improve packet processing per- a sketch of a potential implementation, we assume that the formance while reducing memory system pressure at fast model can be implemented in an inexpensive way at high network speeds. line-rates. Our assumption rests on the fact that OpenFlow The idea is not far-fetched. Inexpensive top-of-rack Open- switches have demonstrated this capability with minimum- Flow switches support rule-based processing applied to ev- sized packets at aggregate per-chip rates of over a half terabit ery packet and are currently being enhanced to allow pro- per second [47]. grammable transformations on packet fields [9]. Fully pro- grammable NICs that can support packet processing also 2. FlexNIC Design and Implementation exist [35; 51; 10] and higher-level abstractions to program A number of design goals guide the development of FlexNIC: them are being proposed [22]. We build upon these ideas to provide a flexible, high-speed network I/O programming • Flexibility. We need a hardware programming model interface for server systems. flexible enough to serve the offload requirements of net- The value of our work is in helping guide next generation work applications that change at software development NIC hardware design. Is a more or less expressive program- timescales. ming model better? We are aiming to better understand the • Line rate packet processing. At the same time the model middle ground. At one extreme, we can offload the entire may not slow down network I/O by allowing arbitrary application onto a network processor, but development costs processing. We need to be able to support the line rates will be higher. At the other extreme, fixed-function offloads, of tomorrow’s network interfaces (at least 100 Gb/s). such as Intel Flow Director, that match on only a small num- • ber of packet fields and cannot modify packets in flight are Inexpensive hardware integration. The required addi- less expressive than FlexNIC, but forego some performance tional hardware has to be economical, such that it fits the benefits, as we will show later. pricing model of commodity NICs. This paper presents the FlexNIC programming model, • Minimal application modification. While we assume which is expressive and yet allows for packet processing that applications are designed for scalability and high- at line rates, and quantifies the potential performance ben- performance, we do not assume that large modifications efits of flexible NIC packet steering and processing for a to existing software are required to make use of FlexNIC. set of typical server applications. When compared to a high- • OS Integration. The design needs to support the protec- performance user-level network stack, our prototype imple- tion and isolation guarantees provided at the OS, while mentation achieves 2.3× better throughput for a real-time allowing applications to install their own offloading prim- analytics platform modeled after Apache Storm, 60% better itives in a fast and flexible manner. throughput for an intrusion detection system, and 60% better latency for a key-value store. Further, FlexNIC is versatile. To provide the needed flexibility at fast line rates, we ap- Beyond the evaluated use cases, it is useful for high perfor- ply the reconfigurable match table (RMT) model recently mance resource virtualization and isolation, fast failover, and proposed for flexible switching chips [8] to the NIC DMA can provide high-level consistency for RDMA operations. interface. The RMT model processes packets through a se- We discuss these further use cases in x6. quence of match plus action (M+A) stages. We make four specific contributions: We make no assumptions about how the RMT model is implemented on the NIC.

High Performance Packet Processing with Flexnic

Packet Processing at Wire Speed Using Network Processors

Packet Processing Execution Engine (PROX) - Performance Characterization for NFVI User Guide

Lightweight Internet Protocol Stack Ideal Choice for High-Performance HFT and Telecom Packet Processing Applications Lightweight Internet Protocol Stack

High Level Synthesis for Packet Processing Pipelines Cristian

Evaluating the Power of Flexible Packet Processing for Network Resource Allocation Naveen Kr

An Architecture for High-Speed Packet Switched Networks (Thesis)

ICMP for Ipv6

Packet Processing with a Noc-Enhanced FPGA

Pathminer Powered Predictable Packet Processing

Impressive Packet Processing Performance Enables Greater Workload Consolidation Single Intel® Xeon® Processor Platform Achieves Over 80 Mpps Throughput1

High Level Synthesis Tool for High Speed Packet Processing

High Performance Packet Processing with Flexnic