A SRAM-Based Architecture for Trie-Based IP Lookup Using FPGA

16th International Symposium on Field-Programmable Custom Computing Machines A SRAM-based Architecture for Trie-based IP Lookup Using FPGA Hoang Le, Weirong Jiang, Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Los Angeles, CA 90089, USA {hoangle, weirongj, prasanna}@usc.edu Abstract Most hardware-based solutions for high speed packet forwarding in routers fall into two main catego- Internet Protocol (IP) lookup in routers can be ries: ternary content addressable memory (TCAM)- implemented by some form of tree traversal. Pipelining based and dynamic/static random access memory can dramatically improve the search throughput. How- (DRAM/SRAM)-based solutions. Although TCAM- ever, it results in unbalanced memory allocation over based engines can retrieve results in just one clock the pipeline stages. This has been identified as a major cycle, their throughput is limited by the relatively low challenge for pipelined solutions. In this paper, an IP speed of TCAMs. They are expensive and offer little lookup rate of 325 MLPS (millions lookups per second) flexibility to adapt to new addressing and routing pro- is achieved using a novel SRAM-based bidirectional tocols [3]. As shown in Table 1, SRAMs outperform optimized linear pipeline architecture on Field Pro- TCAMs with respect to speed, density, and power con- grammable Gate Array, named BiOLP, for tree-based sumption. search engines in IP routers. BiOLP can also achieve a Table 1: Comparison of TCAM and SRAM technologies perfectly balanced memory distribution over the pipeline stages. Moreover, by employing caching to exploit TCAM SRAM the Internet traffic locality, BiOLP can achieve a high (18Mb chip) (18Mb chip) throughput of up to 1.3 GLPS (billion lookups per Maximum clock rate (MHz) 266 [4] 400 [5], [6] second). It also maintains packet input order, and sup- # of transistors per bit [7] 16 6 ports route updates without blocking subsequent in- Power consumption (Watts) 12 ∼ 15 [8] ≈ 0.1 [9] coming packets. Since SRAM-based solutions utilize some kind of tree traversal, they require multiple cycles to perform a Keywords: IP Address Lookup, Longest Prefix single lookup. Several researchers have explored pipe- Matching, Reconfigurable Hardware, Field Programm- lining to improve the throughput. A simple pipelining able Gate Array (FPGA). approach is to map each tree level onto a pipeline stage with its own memory and processing logic. One packet 1. Introduction can be processed every clock cycle. However, this approach results in unbalanced tree node distribution over With the rapid growth of the Internet, design of the pipeline stages. This has been identified as a domi- high speed IP routers has been a major area of research. nant issue for pipelined architectures [10]. In an unba- Advances in optical networking technology are push- lanced pipeline, the “fattest” stage, which stores the ing link rates in high speed IP routers beyond OC-768 largest number of tree nodes, becomes a bottleneck. It (40 Gbps). Such high rates demand that packet for- adversely affects the overall performance of the pipe- warding in IP routers must be performed in hardware. line in the following aspects. First, more time is needed For instance, 40 Gbps links require a throughput of 8 to access the larger local memory. This leads to a re- ns per packet, i.e. 125 million packets per second duction in the global clock rate. Second, a fat stage (MPPS), for a minimum size (40 bytes) packet. Such results in many updates, due to the proportional rela- throughput is impossible using existing software-based tionship between the number of updates and the num- solutions [1], [2]. ber of tree nodes stored in that stage. Particularly dur- _________________________________________ ing the update process caused by intensive route/rule Supported by the United States National Science insertion, the fattest stage may also result in memory Foundation under grant No.CCR-0702784. overflow. Furthermore, since it is unclear at hardware 978-0-7695-3307-0/08 $25.00 © 2008 IEEE 33 DOI 10.1109/FCCM.2008.9 design time which stage will be the fattest, we need to to the uni-bit trie in Figure 1(b). For example, the pre- allocate memory with the maximum size for every fix “010*” corresponds to the path starting at the root stage. This over-provision results in memory wastage and ending in node P3: first a left-turn (0), then a right- [11]. turn (1), and finally a turn to the left (0). Each trie node To balance the memory distribution across stages, contains two fields: the represented prefix and the several novel pipeline architectures have been pro- pointer to the child nodes. By using the optimization posed [11-13]. However, none of them can achieve a called leaf-pushing [15], each node needs only one perfectly balanced memory distribution over stages. field: either the pointer to the next-hop address or the Moreover, some of them use non-linear structures and pointer to the child nodes. Figure 1(c) shows the leaf- result in throughput degradation. pushed uni-bit trie derived from Figure 1(b). For sim- The key issues for any new architecture on IP loo- plicity, we consider only the leaf-pushed uni-bit trie in kup engine are: high throughput, maximal size of sup- this paper, though our ideas also apply to other forms ported routing table, incremental update, in-order of tries [16]. packet output, and power consumption. To address Given a leaf-pushed uni-bit trie, IP lookup is per- these challenges, we propose and implement a SRAM- formed by traversing the trie according to the bits in based bidirectional optimized linear pipeline architec- the IP address. When a leaf is reached, the prefix asso- ture, named BiOLP, for tree-based IP Lookup engine ciated with the leaf is the longest matched prefix for routers on FPGA. This paper makes the following con- that IP address. tributions: • To the best of our knowledge, this architecture is 0* P1 root the first trie-based design that uses the on-chip 000* P2 0 1 FPGA resources only to support a fairly large 010* P3 P1 1 routing table, Mae-West (rrc08, 20040901) [14]. 01001* P4 0 1 • This is also the first architecture that employs IP 01011* P5 00101 caching on FPGA without using TCAM to effec- P2 P3 P6 P7 P8 011* P6 0 1 tively exploit the Internet traffic locality. A high 110* P7 throughput of more than 1 packet per clock cycle 1 1 111* P8 P4 P5 is obtained. (b) Uni-bit trie • The implementation results show the throughput of (a) Prefix set root 325 MLPS for non-cache-based design. To the Level0 best of our knowledge, this is the fastest IP Loo- 0 1 kup engine on FPGA; and 1.3 GLPS for cache- Level1 010 1 based design. This is a promising solution for the null Level2 next generation IP routers. 00101 1 The rest of the paper is organized as follows: Sec- P2P1 P6 P7 P8 Level3 0 1 tion 2 covers the background and related work; Section Level4 3 introduces the BiOLP architecture; Section 4 de- 001 1 scribes BiOLP implementation; Section 5 presents im- P3 P4P3 P5 Level5 plementation results; and Section 6 concludes the pa- (c) Leaf-pushed uni-bit trie per. Figure 1: Prefix set; Uni-bit trie; Leaf-pushed uni-bit trie 2. Background and Related Work 2.2. Related Work 2.1. Trie-based IP Lookup Since this work is based entirely on FPGA, we only cover some related works that are in the same domain. The nature of IP lookup is longest prefix matching As mentioned above, there are two types of architec- (LPM). The most common data structure in algorithmic ture on FPGA: TCAM-based and SRAM-based. Each solutions for performing LPM is some form of trie [1]. of them has its own advantages and disadvantages. Trie is a binary tree, where a prefix is represented by a node. The value of the prefix corresponds to the path 2.2.1. TCAM-based Architectures from the root of the tree to the node representing the prefix. The branching decisions are made based on the TCAM is used in some architectures to simplify the consecutive bits in the prefix. A trie is called a uni-bit complexity of the designs. However, it also reduces the trie if only one bit is used for making branching deci- clock speed, and increases the power consumption of sion at a time. The prefix set in Figure 1(a) corresponds the entire system. Song et al. [17] introduce an archi- 34 tecture called BV-TCAM, which combines the TCAM another architecture with the lookup speed of 66 and the Bit Vector (BV) algorithm to effectively com- MLPS. In this design, a commodity Random Access press the data representations and boost throughput. Memory (RAM) is needed in their design; and the This design can only handle lookup rate of about 30 achieved lookup rate is reasonably low. MLPS. As mentioned above, pipelining can dramatically Kasnavi et al. [18] propose a similar idea of cache- improve the throughput of tree traversal. A straight based IP address lookup architecture, which is com- forward way to pipeline a tree is to assign each tree prised of a non-blocking Multizone Pipelined Cache level to a different stage, so that a packet can be (MPC) and of a hardware-supported IP routing lookup processed every clock cycle. However, this simple method. They use two-stage pipeline for a half- pipeline scheme results in unbalanced memory distri- prefix/half-full address IP cache that results in lower bution, leading to low throughput and inefficient mem- activity than conventional caches.

A SRAM-Based Architecture for Trie-Based IP Lookup Using FPGA

Programmer's Guide

Linux Kernel and Driver Development Training Slides

Tricore Architecture Manual for a Detailed Discussion of Instruction Set Encoding and Semantics

FSS) in Oracle Cloud Infrastructure

Circ-Tree: a B+-Tree Variant with Circular Design for Persistent Memory

15-213 Lectures 22 and 23: Synchronization

Eodbox Documentation Release 1.0

Chapter 7: Queues and Deques

Algorithms and Data Structures for the Modelling of Dynamical Systems by Means of Stochastic Finite Automata

Load-Balancing Succinct B Trees

Circ-Tree: a B+-Tree Variant with Circular Design for Persistent Memory

Visualdsp++ 4.0 C/C++ Compiler and Library Manual for SHARC Processors CONTENTS