CUDA Based Rabin-Karp Pattern Matching for Deep Packet Inspection on a Multicore GPU

I. J. Computer Network and Information Security, 2015, 10, 70-77 Published Online September 2015 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijcnis.2015.10.08 CUDA based Rabin-Karp Pattern Matching for Deep Packet Inspection on a Multicore GPU Jyotsna Sharma and Maninder Singh Computer Science & Engineering Department Thapar University Patiala, INDIA Email: {jyotsana.sharma,msingh}@thapar.edu Abstract—This paper presents a study of the matching [4]. Under high load conditions, high improvement in efficiency of the Rabin-Karp pattern- processing abilities are needed to process the captured matching algorithm based Deep Packet Inspection. traffic. A vast majority of earlier research focuses on NVIDIA GPU is programmed with the NVIDIA's general specialized hardware for the the compute intensive Deep purpose parallel computing architecture, CUDA, that Packet Inspection [5]. Application Specific Integrated leverages the parallel compute engine in NVIDIA GPUs Circuits (ASICs) [6], Field Programmable Group to solve many complex computational problems in a Arrays(FPGAs) [7] and network processor units(NPUs) more efficient way than on a CPU. The proposed CUDA [8] provide for fast discrimination of content within based implementation on a multicore GPU outperforms packets while also allowing for data classification. the Intel quadcore processor and runs upto 14 times faster Engaging special hardware translates to higher costs as by executing the algorithm in parallel to search for the the data and processing requirements for even medium pattern from the text. The speedup may not sound size networks soon increase exponentially. Some exorbitant but nonetheless is significant, keeping in view researchers have discussed an interesting entropy based that the experiments have been conducted on real data technique for fine-grained traffic analysis [9] [10]. and not synthetic data and, optimal performance even Jeyanthi et al. presented an enhanced approach to this with the huge increase in traffic was the main expectation, behavior-based detection mechanism, the “Enhanced not just an improvement in speed with few test cases. Entropy” approach, by deploying trust credits to detect and outwit attackers at an early stage. The approaches Index Terms—CUDA, Deep Packet Inspection, Intrusion seem good but call for methods like bi-directional Detection, GPGPU, Network Forensics, Rabin Karp measurements for best performance they which impose Pattern Matching additional computing overhead [11]. The task of pattern-matching for DPI can be performed at a much more reasonable cost as GPUs, being a I. INTRODUCTION necessary component of most computers these days, are now readily and easily available The task is split into The Graphics Processing Unit (GPU) which was parallel segments resulting in searching the string much primarily associated with processing graphics is rapidly faster than a CPU. evolving towards a more ﬂexible architecture which has There are several pattern-matching algorithms for DPI, encouraged research and development in several Aho-Corasick and Boyer-Moore being the most popular computationally demanding non graphic applications [12][13]. In this paper the Rabin Karp algorithm [30] has where its capability can be utilized for computations that been selected from over these and several other popular can leverage parallel execution [1]. General-Purpose algorithms because it involves sequential accesses to the Computing on GPU (GPGPU), also known as GPU memory in order to locate all the appearances of a pattern Computing exploits the capabilities of a GPU using APIs and is based on a compute-intensive rolling hash such as OpenCL and the Compute Unified Device calculations. The algorithm has delivered good Architecture (CUDA) [2]. The opportunity is that we can performance and efficiency when executed on a multicore implement any algorithm, not only graphics, but the GPU(in this study, the Nvidia GeForce 635M). challenge is to obtain efﬁciency and high performance in the field where they replace or support the traditional CPU computing. II. CONCEPTS A good candidate for such GPGPU is Deep Packet Inspection (DPI), the technology, where the appliance has A. GPGPU the mechanism to look within the application payload of the traffic by inspecting every byte of every packet, and A GPU, capable of running thousands of lightweight detect intrusions which are more difficult to detect as threads in parallel, is designed for intensive, highly compared to the simple network attacks [3]. The parallel computation, exactly for graphic rendering computational and storage demand for such inspection purposes. Current-generation GPUs, designed to act as and analysis is quite high. An NIDS spends 75% of the high performance stream-processors derive their overall processing time of each packet in pattern performance from parallelism at both the data and Copyright © 2015 MECS I.J. Computer Network and Information Security, 2015, 10, 70-77 CUDA based Rabin-Karp Pattern Matching for Deep Packet Inspection on a Multicore GPU 71 instruction-level. In GPU's architecture much more transfers to and from the GPU and efficient management transistors are devoted to data processing and less to data of the GPU shared memory. The parallel throughput caching[14]. architecture of CUDA can be leveraged by the software Traditionally, GPUs have been designed to perform developers through parallel computing extensions to very specific type of calculations on textures and many popular high level languages , such as ,C, C++, and primitive geometric objects, and it was very difficult to FORTRAN, CUDA accelerated compute libraries, and perform non-graphical operations on them. One of the compiler directives. Support for Java, Python, Perl, first successful implementation of parallel processing on Haskell, .NET, Ruby and other languages is also GPU using non-graphical data was registered in 2003 available. after the appearance of shaders [15]. Latest generations of The CUDA programming model is a C-like language graphics cards incorporate new generations of GPUs, where the CUDA threads execute on a the device(GPU) which are designed to perform both graphical and non- which operates as a coprocessor to the the host(CPU). graphical operations. One of the pioneers in this area is The GPU and CPU program code exist in the same NVIDIA[16] which rules the market with its parallel source file with the GPU kernel code indicated by the computing platform and API model, CUDA (Compute _global_ qualifier in the function declaration. The host Unified Device Architecture) [17], gradually taking over program launches the sequence of kernels. A kernel is its major competitor, AMD. Others also have their own organized as a hierarchy of threads. Threads are grouped programmable interfaces for their technologies: the into blocks, and blocks are grouped into a grid. Each vendor neutral open-source, OpenCL which gives a major thread has a unique local index in its block, and each competition to CUDA, AMD ATI Stream, HMPP, block has a unique index in the grid, which can RapidMind and PGI Accelerator [18]. be used by the kernel to compute array subscripts. Threads in a single block will be executed on a single B. NVIDIA GPUs multiprocessor, sharing the software data cache, and can NVIDIA's GPU offered the opportunity to utilize the synchronize and share data with threads in the same block; GPU for GPGPU(General Purpose computing on a warp will always be a subset of threads from a single Graphics Processing Unit). In 2001, NVIDIA GeForce 3 block. Threads in different blocks may be assigned to exposed the application developer to the internal different multiprocessors concurrently, to the same instruction set of the floating point vertex engine(VS & multiprocessor concurrently (using multithreading), or T/L stage). Later GPUs extended general may be assigned to the same or different multiprocessors programmability and floating-point capability to the pixel at different times, depending on how the blocks are shader stages and exploited data independence. With the scheduled dynamically. There is a physical limit on the introduction of the GeForce 8 series, GPUs became a size of a thread block for a GPU determined by its more generalized computing device. The unified shader compute capability. In this work, for the compute architecture introduced in the series was advanced further capability 2.1 GPU, it is 1536 threads or 32 warps. in the Tesla microarchitecture backed GeForce 200 series which offered double precision support for use in GPGPU applications. The successor GPU microarchitectures viz. Fermi[19] for the GeForce 400 and GeForce 500 series , then Kepler [20] GeForce 600 and GeForce 700 and subsequently Maxwell [21] GeForce 800 and GeForce 900 series , have dramatically improved performance as well as energy efficiency over the previous ones . Very recently NVIDIA announced major architectural improvements such as unified memory so that the CPU and GPU can both access both main system memory and memory on the graphics card, in its yet to be released Pascal architecture[22]. The new GPU generation also boasts of a feature called NVLink Fig. 1. CUDA Programming Model (Source: www.nvidia.com) which would allow data between the CPU and GPU to flow at 80 GB per second ,compared to the 16GB per The CUDA programming model assumes that the host second available presently. and the device maintain their own separate memory C. CUDA spaces. The CUDA program manages the device memory allocation and de-allocation as well as the data transfer There are a variety of GPU programming models, the between host and device memory[25]. CUDA devices popular ones being the the open source OpenCL[23] , the provide access to several memory architectures, such as BSD licensed BrookGPU[24] which is free and global memory, constant memory, texture memory, share CUDA[25] ,the Nvidia's proprietary framework. memory and registers, with their certain performance CUDA(Compute Unified Device Architecture) is characteristics and limitations. Fig.2 illustrates the memory supported well by Nvidia hence it has become the most architecture of CUDA device, specifically the Nvidia popular of all.

CUDA Based Rabin-Karp Pattern Matching for Deep Packet Inspection on a Multicore GPU

C5ENPA1-DS, C-5E NETWORK PROCESSOR SILICON REVISION A1

Design and Implementation of a Stateful Network Packet Processing

Embedded Multi-Core Processing for Networking

Network Processors: Building Block for Programmable Networks

Intel® IXP42X Product Line of Network Processors with ETHERNET Powerlink Controlled Node

And GPU-Based DNN Training on Modern Architectures

Effective Compilation Support for Variable Instruction Set Architecture

NP-5™ Network Processor

Network Processors the Morgan Kaufmann Series in Systems on Silicon Series Editor: Wayne Wolf, Georgia Institute of Technology

Synchronized MIMD Computing Bradley C. Kuszmaul

A Network Processor Architecture for High Speed Carrier Grade Ethernet Networks

Object-Oriented Reconfigurable Processing for Wireless Networks Andrew A