An Impulse-C Hardware Accelerator for Packet Classification Based on Fine/Coarse Grain Optimization
Total Page:16
File Type:pdf, Size:1020Kb
Hindawi Publishing Corporation International Journal of Reconfigurable Computing Volume 2013, Article ID 130765, 23 pages http://dx.doi.org/10.1155/2013/130765 Research Article An Impulse-C Hardware Accelerator for Packet Classification Based on Fine/Coarse Grain Optimization O. Ahmed, S. Areibi, R. Collier, and G. Grewal Faculty of Engineering and Computer Science, University of Guelph, Guelph, ON, Canada Correspondence should be addressed to S. Areibi; [email protected] Received 26 March 2013; Revised 10 June 2013; Accepted 10 July 2013 Academic Editor: Walter Stechele Copyright © 2013 O. Ahmed et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Current software-based packet classification algorithms exhibit relatively poor performance, prompting many researchers to concentrate on novel frameworks and architectures that employ both hardware and software components. The Packet Classification with Incremental Update (PCIU) algorithm, Ahmed et al. (2010), is a novel and efficient packet classification algorithm witha unique incremental update capability that demonstrated excellent results and was shown to be scalable for many different tasks and clients. While a pure software implementation can generate powerful results on a server machine, an embedded solution may be more desirable for some applications and clients. Embedded, specialized hardware accelerator based solutions are typically much more efficient in speed, cost, and size than solutions that are implemented on general-purpose processor systems. This paper seeks to explore the design space of translating the PCIU algorithm into hardware by utilizing several optimization techniques, ranging from fine grain to coarse grain and parallel coarse grain approaches. The paper presents a detailed implementation of a hardware accelerator of the PCIU based on an Electronic System Level (ESL) approach. Results obtained indicate that the hardware accelerator achieves on average 27x speedup over a state-of-the-art Xeon processor. 1. Introduction software components. The continuous explosive growth of Internet traffic will ultimately require that future packet clas- The task of packet classification entails the matching of sification algorithms are implemented with a purely hardware an incoming packet with rules (established in an exist- approach. ing classifier) to determine the type of action that would With our recent work [1] on the packet classification be appropriate. Although this problem has been studied problem, we proposed a novel algorithm, called “Packet Clas- extensively, the fast emergence of new network applications, sification with an Incremental Update” (PCIU), that exhibited coupledwiththerapidgrowthoftheInternet,hasintroduced a substantial improvement over previous approaches. The many new challenges, and the research community remains PCIU provides lower preprocessing time, lower memory motivated to design novel and efficient packet classification consumption, greater ease for incremental rule update, and solutions. Packet classification plays a crucial role for a faster classification time (when compared to other state-of- number of network services, including, but not limited to, the-art algorithms), with the maximum memory required by policy-based routing, traffic billing, and preventing unautho- PCIU for accommodating 10,000 rules requiring less than rized access using firewalls. Moreover, packet classification 2.5 MB for the worst case. In this paper, we attempt to give algorithms that will scale to large, multifield databases are detailed explanation of the implementation of the PCIU becoming essential for a variety of applications, including algorithm that was previously published in [1] in order for the load balancers, network security appliances, and quality of readers to reproduce the work. We are releasing the design service filtering. Unfortunately, the current, software-based and implementation publicly to assist those interested in packet classification algorithms exhibit relatively poor perfor- implementing the PCIU algorithm along with benchmarks mance, prompting many researchers to concentrate on novel [2]. Furthermore, we propose enhancements to the PCIU frameworks and architectures that employ both hardware and [1] in this paper by making it more accessible, for a variety 2 International Journal of Reconfigurable Computing of applications, by way of a hardware implementation. Field hardware. Therefore, designers are encouraged to manually Programmable Gate Arrays (FPGAs) are considered to be restructure their code to optimize the resulting hardware. excellent platforms and candidates for mapping packet clas- Typically, this is done by applying various transformations to sification to hardware. FPGAs provide an excellent trade off theoriginalsourcecode.However,thesheervolumeofthese between reprogrammability and performance compared with language-level transformations leads to a whole design space traditional Application Specific Integrated Circutis (ASICs). of potential solutions, all based on different optimizations. The term flexibility refers to the concept of reprogramming The main goal of this paper is (i) to improve the run-time the FPGA as the algorithm is modified and updated. The performance of the PCIU packet classification algorithm performance on the other hand refers to the performance that by parallelizing the algorithm and eventually mapping it can be achieved by exploiting parallelism at the bit level in onto an FPGA, and (ii) to perform an empirical study to addition to instruction and task level. FPGAs are considered determine the overall effectiveness of different language-level to be a good fit for classification since the target is “embedded transformations when using Impulse-C to implement the systems” which with typically to consume less power than PCIU algorithm. current state-of-the-art general-purpose processors. In addi- The main contributions of this paper can be clearly stated tion to reducing power consumption “Embedded systems”, as follows. attempt to increase reliability and decrease operating cost. Embedded systems are specialized HW/SW computer sys- (1) The majority of networking applications are beyond tems, that are custom-designed to perform an often highly the capabilities of general-purpose processors since real-time constrained task in generally small form factor current networking trends are pushing towards com- designs.AnFPGAisalsoanexcellentcandidateforrun plex protocols that provide additional and improved time dynamic reconfiguration where only some parts of the network services. The Impulse-C based implemen- algorithm can be present while others are swapped in and out tation proposed in this work achieves substantial as required. This enables any classification based algorithm to speedup (27x) over a pure software implementation consume less power and also to fit into FPGAs with different running on a state-of-the-art Xeon processor. sizes. (2) An extensive experimental analysis is performed in When mapping any algorithm onto a reconfigurable which all possible combinations of optimizations are computing platform such as FPGAs, an important step considered. To the best of our knowledge, this is the involves using an appropriate language for design entry and first paper to propose such extensive exploration for hardware synthesis. VHDL and Verilog are two popular fine-grained optimization of Impulse-C. The explo- hardware description languages (HDL) used in both industry ration performed can be easily extended to similar and academia. The main advantage of these languages is the applications that utilize ESL based approaches. efficient hardware produced via synthesis since they describe (3) In addition to a full factorial experiment that will the hardware at the register-transfer level. However, designers allow us to test interactions between different com- consume quite a substantial amount of time dealing with binations of language-level transformation based on structural details of the actual hardware. On the other hand, fine grain optimization (FGO), the authors seek higher level languages or Electronic System Level (ESL) based to further improve performance via Coarse Grain languagessuchasHandel-C[3], Impulse-C [4], and Catapult- (CGO) and Parallel Coarse Grain Optimization C[5] have started to gain popularity as an alternative to (PCGO) by exploiting both data parallelism and VHDL and Verilog for the purpose of hardware acceleration pipelining. of software-based applications. One of the goals of these languages is to enable designers to focus their attention at The remainder of this paper is organized into six sections. higher level of abstraction; that is, on the algorithm to be Section 2 provides an overview of the packet classification mapped into hardware rather than the lower level details of problem, along with necessary background. Section 3 pro- the circuit to be built. Designers can start with automatic vides a brief overview of the most significant work published compilation and then focus their efforts on improving loops in the field of packet classification. In Section 4,thePCIU and constructs to further enhance the performance of the algorithm [1] is described briefly along with the different hardware accelerator. The designer’s required effort is, there- stages of preprocessing, classification, and updating.