Design and Implementation of a Stateful Network Packet Processing

610 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 25, NO. 1, FEBRUARY 2017 Design and Implementation of a Stateful Network Packet Processing Framework for GPUs Giorgos Vasiliadis, Lazaros Koromilas, Michalis Polychronakis, and Sotiris Ioannidis Abstract— Graphics processing units (GPUs) are a powerful and TCAMs, have greatly reduced both the cost and time platform for building the high-speed network traffic process- to develop network traffic processing systems, and have ing applications using low-cost hardware. The existing systems been successfully used in routers [4], [7] and network tap the massively parallel architecture of GPUs to speed up certain computationally intensive tasks, such as cryptographic intrusion detection systems [27], [34]. These systems offer a operations and pattern matching. However, they still suffer from scalable method of processing network packets in high-speed significant overheads due to critical-path operations that are still environments. However, implementations based on special- being carried out on the CPU, and redundant inter-device data purpose hardware are very difficult to extend and program, transfers. In this paper, we present GASPP, a programmable net- and prohibit them from being widely adopted by the industry. work traffic processing framework tailored to modern graphics processors. GASPP integrates optimized GPU-based implemen- In contrast, the emergence of commodity many-core tations of a broad range of operations commonly used in the architectures, such as multicore CPUs and modern graph- network traffic processing applications, including the first purely ics processors (GPUs) has proven to be a good solution GPU-based implementation of network flow tracking and TCP for accelerating many network applications, and has led stream reassembly. GASPP also employs novel mechanisms for to their successful deployment in high-speed environ- tackling the control flow irregularities across SIMT threads, and for sharing the memory context between the network interfaces ments [16], [19], [21], [22], [38]. Recent trends have shown and the GPU. Our evaluation shows that GASPP can achieve that certain network packet processing operations can be multigigabit traffic forwarding rates even for complex and implemented efficiently on GPU architectures. Typically, such computationally intensive network operations, such as stateful operations are either computationally intensive (e.g., encryp- traffic classification, intrusion detection, and packet encryption. tion [22]), memory-intensive (e.g., IP routing [19]), or both Especially when consolidating multiple network applications on the same system, GASPP achieves up to 16.2× speedup compared (e.g., intrusion detection and prevention [21], [36], [38]). with different monolithic GPU-based implementations of the These operations are facilitated by modern GPU architectures, same applications. which offer high computational throughput and hide excessive Index Terms— Network packet processing, CUDA, GPU. memory latencies. Unfortunately, the lack of programming abstractions and libraries for GPU-based network traffic processing—even for I. INTRODUCTION simple tasks such as packet decoding and filtering—increases OMPUTER networks have been growing in size, significantly the programming effort needed to build, extend, Ccomplexity, and connection speeds [6], [13]. Especially and maintain high-performance GPU-based network applica- in access and backbone links, speeds typically reach tens tions. More complex critical-path operations, such as flow or hundreds of Gbit/s rates. At the same time, networking tracking and TCP stream reassembly, currently still run on applications become diversified and traffic processing gets the CPU, negatively offsetting any performance gains by the more sophisticated, requiring data-plane operations beyond the offloaded GPU operations. The absence of adequate OS sup- traditional operations of the lower networking layers, such as port also increases the cost of data transfers between the host forwarding and routing. Coping with the increasing network and I/O devices. For example, packets have to be transferred capacity and complexity necessitates pushing the performance from the network interface to the user-space context of the of network packet processing applications as high as possible. application, and from there to kernel space in order to be Previously programmable special-purpose hardware transferred to the GPU. Although programmers can explicitly solutions, such as FPGAs, Network Processors (NPUs), optimize data movements, this increases the design complexity and code size of even simple GPU-based packet processing Manuscript received August 23, 2015; revised May 3, 2016; accepted July 10, 2016; approved by IEEE/ACM TRANSACTIONS ON NETWORKING programs. Editor N. A. L. Reddy. Date of publication August 24, 2016; date of current As a step towards tackling the above challenges, we present version February 14, 2017. This work was supported by in part by the GASPP, a network traffic processing framework tailored to European Commission through the H2020 ICT-32-2014 Project SHARCS under Grant 644571 and in part by the H2020 ICT-07-2014 Project RAPID modern graphics processors. GASPP integrates into a purely under Grant 644312. GPU-powered implementation many of the most common G. Vasiliadis is with Qatar Computing Research Institute, Hamad Bin operations used by different types of network traffic processing Khalifa University, Doha 5825, Qatar (e-mail: [email protected]). L. Koromilas and S. Ioannidis are with Foundation for Research & applications, including the first GPU-based implementation of Technology–Hellas, Heraklion 71110, Greece (e-mail: [email protected]; network flow tracking and TCP stream reassembly. By hiding [email protected]). complicated network processing tasks while providing a rich M. Polychronakis is with Stony Brook University, Stony Brook, NY 11794 USA (e-mail: [email protected]). and expressive interface that exposes only the data that matters Digital Object Identifier 10.1109/TNET.2016.2597163 to applications, GASPP allows developers to build complex 1063-6692 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. VASILIADIS et al.: DESIGN AND IMPLEMENTATION OF A STATEFUL NETWORK PACKET PROCESSING FRAMEWORK 611 GPU-based network traffic processing applications in a flexible and efficient way. We have developed and integrated into GASPP novel tech- niques for sharing memory context between network interfaces and the GPU to avoid redundant data movement, and for scheduling packets in an efficient way that increases the uti- lization of the GPU and the shared PCIe bus. Overall, GASPP allows applications to scale in terms of performance, and carry out on the CPU only infrequently occurring operations. The main contributions of our work are: • We have designed, implemented, and evaluated GASPP, a novel GPU-based framework for high-performance network traffic processing, which eases the development of applications that process data at multiple layers of the Fig. 1. GASPP architecture. protocol stack. • We present the first (to the best of our knowledge) purely another, which is actually the case in network traffic processing GPU-based implementation of flow state management applications. A GPU-accelerated network application has to and TCP stream reconstruction. move packets from the NIC’s memory to the GPU’s memory. • We present a novel packet scheduling technique that Addressing such redundancy for cross-device communication tackles control flow irregularities and load imbalance requires OS-level support and an appropriate programming across GPU threads. interface. • We present a zero-copy mechanism between the network C. The Need for Stateful Processing interfaces and the GPU, which in certain cases avoids Flow tracking and TCP stream reconstruction are mandatory redundant memory copies increasing significantly the operations required by a broad range of network applications. throughput of cross-device data transfers. For instance, intrusion detection and traffic classification sys- II. MOTIVATION tems typically inspect the application-layer stream to identify patterns that span multiple packets and to thwart evasion A. The Need for Modularity attacks [15], [40]. Existing GPU-assisted network processing The rise of general-purpose computing on GPUs (GPGPU) applications, however, just offload to the GPU certain data- and related frameworks, such as CUDA and OpenCL, has parallel tasks, and are saturated by the many computationally made the implementation of GPU-accelerated applications heavy operations that are still being carried out on the CPU, easier than ever. Unfortunately, the majority of GPU-assisted such as network flow tracking, TCP stream reassembly, and network applications follow a monolithic design, lacking both protocol parsing [21], [38]. modularity and flexibility. As a result, building, maintaining, The most common approach for stateful processing is and extending such systems eventually becomes a real burden. to buffer incoming packets, reassemble them, and deliver In addition, the absence of libraries for core network process- “chunks” of the reassembled stream to higher-level processing ing operations—even for basic tasks like packet decoding or elements [10], [12]. A major drawback of this approach is that filtering—increases development costs even further. GASPP

Design and Implementation of a Stateful Network Packet Processing

C5ENPA1-DS, C-5E NETWORK PROCESSOR SILICON REVISION A1

Embedded Multi-Core Processing for Networking

Network Processors: Building Block for Programmable Networks

Intel® IXP42X Product Line of Network Processors with ETHERNET Powerlink Controlled Node

And GPU-Based DNN Training on Modern Architectures

Effective Compilation Support for Variable Instruction Set Architecture

NP-5™ Network Processor

Network Processors the Morgan Kaufmann Series in Systems on Silicon Series Editor: Wayne Wolf, Georgia Institute of Technology

Synchronized MIMD Computing Bradley C. Kuszmaul

A Network Processor Architecture for High Speed Carrier Grade Ethernet Networks

Object-Oriented Reconfigurable Processing for Wireless Networks Andrew A

IXP43X Product Line of Network Processors Datasheet December 2008 2 Document Number: 316842; Revision: 003US Contents