Compact Forest: Scalable architecture for IP Lookup on FPGAs

Oguzhan˘ Erdem Aydin Carus Hoang Le Electrical and Electronics Engineering Computer Engineering Electrical Engineering Trakya University Trakya University University of Southern California Edirne, TURKEY 22030 Edirne, TURKEY 22030 Los Angeles, USA 90007 Email: [email protected] Email: [email protected] Email: [email protected]

Abstract—Memory efficiency with compact data structures pipelining techniques are used to improve the through- for Internet Protocol (IP) lookup has recently regained much put. However, pipelined hardware implementation of these interest in the research community. In this paper, we revisit algorithms suffer from inefficient memory usage due to the classic trie-based approach for solving the longest prefix matching (LPM) problem used in IP lookup. Among all existing unbalanced mapping of tree onto the pipeline stages. implementation platforms, Field Programmable Gate Array We propose a compact trie forest for trie- (FPGA) is a prevailing platform to implement SRAM-based based IP lookup. The search data structure is realized on pipelined architectures for high-speed IP lookup because of its a scalable high-throughput, SRAM-based linear pipeline abundant parallelism and other desirable features. However, architecture. Additionally, this design utilizes the dual-ported due to the available on-chip memory and the number of I/O pins of FPGAs, state-of-the-art designs cannot support large feature of on-chip memory on FPGAs to achieve high routing tables consisting of over 350K prefixes in backbone throughput. This paper makes the following contributions: routers. 1) A Compact trie (CT) structure that achieves a better We propose a search algorithm and data structure denoted memory efficiency compared with that of the tradi- Compact Trie (CT) for IP lookup. Our algorithm demonstrates a substantial reduction in the memory footprint compared with tional binary trie (Section III). the state-of-the-art solutions. A parallel architecture on FPGAs, 2) A Compact trie forest (CTF) consisting of multiple named Compact Trie Forest (CTF), is introduced to support the CTs to eliminate the backtracking problem in CTs. data structure. Along with pipelining techniques, our optimized These CTs are searched in parallel for high perfor- architecture also employs multiple memory banks in each mance IP lookup, taking advantages of the abundant stage to further reduce memory and resource redundancy. Implementation on a state-of-the-art FPGA device shows that parallelism provided by state-of-the-art FPGAs (Sec- the proposed architecture can support large routing tables tion III). consisting up to 703K IPv4 or 418K IPv6 prefixes. The post 3) A linear pipelined SRAM-based architecture that can place-and-route result shows that our architecture can sustain be easily implemented in hardware. Our optimized 420 a throughput of million lookups per second (MLPS), or architecture also employs multiple memory banks to 135 Gbps for the minimum packet size of 40 Bytes. The result surpasses the worst-case 150 MLPS required by the further improve memory efficiency (Section IV). standardized 100GbE line cards. 4) A design that can support up to 703K IPv4 and 418K IPv6 prefixes, using a state-of-the-art FPGA I. INTRODUCTION device. The post place-and-route result shows that our architecture can sustain a throughput of 420 million Most hardware-based solutions for network routers fall lookups per second, or 135 Gbps for the minimum into two main categories: Ternary Content Addressable packet size of 40 Bytes (Section V). Memory (TCAM)-based and dynamic/static random access The rest of the paper is organized as follows. Section II memory (DRAM/SRAM)-based solutions. In TCAM-based covers the background and overviews the existing solutions solutions, each prefix is stored in a word and an incoming for IP lookup. Section III presents in detail the algorithms IP address is compared in parallel with all the entries and data structures for CTF. Section IV introduces the in TCAM in one clock cycle. TCAM-based solutions are proposed architecture and its implementation on FPGA. simple, and therefore, are de-facto solutions for today’s Section V presents experimental setup and implementation routers. However, TCAMs are expensive, power-hungry, results. Section VI concludes the paper. and offer little adaptability to new addressing and routing protocols. On the other hand, SRAM has higher density, II. BACKGROUND lower power consumption, and higher speed. The common data structure in SRAM-based solutions is some form of tree. A. IP Lookup Overview In these solutions, multiple memory accesses are required IP packet forwarding, or simply, IP-lookup, is a classic in order to find the search result. Therefore, FPGA based problem. In computer networking, a routing table is a

978-1-4673-2921-7/12/$31.00 c 2012 IEEE that is stored in a router or a networked computer. some pre-defined lengths to compare directly with the input The routing table stores the routes and metrics associated address. In a binary search tree (BST), each node has a value with those routes, such as next hop routing indices, to (prefix) and an associated next hop index. The left subtree of particular network destinations. The IP-lookup problem is that node contains only values less than or equal to the nodes referred to as “longest prefix matching” (LPM), which is value, and the right subtree contains values greater than the used by routers in IP networking to select an entry from nodes value. Pipelined BST-based IP lookup solutions are the given routing table. To determine the outgoing port for limited by the complex pre-processing required to convert a given address, the longest matching prefix among all the routing prefixes into exclusive ranges and sort them. Using prefixes needs to be determined. Routing tables often contain this structure is also difficult to perform incremental updates. a default route in case matches with all other entries fail. III. ALGORITHM AND DATA STRUCTURE B. Related Work A. Definitions and Notations Various hardware based IP lookup solutions on FPGAs have been proposed in recent years. In general, these FPGA- The following notations are used throughout the paper: based approaches can be classified into four categories: (1) MSB - Most Significant Bit, LSB - Least Significant Bit, linear pattern search in TCAM [1], [2], (2) hash based solu- MSSB - Most Significant Set Bit, LSSB - Least Significant Set Bit, MSRB - Most Significant Reset Bit, LSRB - Least tions [3], [4], (3) binary bit traversal in pipelined [5]– ∗ [8], and (4) binary value search in pipelined trees [9], [10]. Significant Reset Bit. For instance, prefix 00110101 has 0 TCAM-based solutions are simple, but they are expensive, and 1 values as MSB and LSB; 2, 7, 0 and 6 values for power-hungry, and offer little adaptability to new addressing MSSB, LSSB, MSRB and LSRB positions. and routing protocols. Hash based IP lookup schemes have Definition Prefix node in a trie is any node to which a several disadvantages; (a) large number of different hash path from the root of the trie corresponds to an entry in the tables may be required to store routing prefixes of different routing table. If there is no valid prefix stored in a trie node, lengths, (b) use of separate hash functions for each length then it is called non-prefix node. is impractical, (c) it is hard to find perfect hash functions Active part to minimize bin overflows, and (d) an additional memory Definition (AP) of a prefix is the bit string (CAM, etc.) needs to be reserved for resolving overflows. between The most common and simple data structure for IP lookup 1) MSSB and LSSB bits for (MSB,LSB)=(0,0) is the binary trie. These trie-based solutions achieve good 2) MSSB and LSRB bits for (MSB,LSB)=(0,1) throughput performance and support quick prefix updates. 3) MSRB and LSSB bits for (MSB,LSB)=(1,0) In such a trie, the path from the root to a node represents 4) MSRB and LSRB bits for (MSB,LSB)=(1,1) a prefix in a routing table. Fig. 1 illustrates a sample of a prefix excluding the both MSS(R)B and LSS(R)B bits. prefix table and its corresponding binary trie. In this figure, For example, the active parts of the prefixes 011011∗ and each black node corresponds to a prefix. Multiple memory 00101100∗ are 1 and 01, respectively. accesses are required to find the longest matched prefix. Definition If two prefixes have the same active part, then Therefore, pipelining techniques are used to improve the they are called conflicted prefixes. For instance, the prefixes throughput. However, pipelined hardware implementation of 011001∗ and 0011010∗ are conflicted because they both this algorithm suffers from inefficient memory usage due to have the same active part of 10. unbalanced mapping of tree onto the pipeline stages. B. Prefix Table Conversion Next Prefix Hop 0 1 11* P1 A prefix p can be expressed as the concatenation of three 111* P2 0101* P3 0 110 substrings x, y and z, such that p = xyz. In this notation, 00101* P4 01001* P5 P1 x is a string composed of only 0’s followed by a single 01110* P6 0 1 0 1 0 0 1 10001* P7 11001* P8 P2 1 (MSSB), or only 1’s followed by a single 0 (MSRB). z 11010* P9 1 0 0 1 1 0 1 0 1 0 000101 P10 is a string composed of a 1 (LSSB) followed by 0’s, or a 001010 P11 010001 P12 0 1 0 P3 1 1 1 001 1 p xyz 011101 P13 0 (LSRB) followed by 1’s. Alternatively, prefix = 011110 P14 {|x|,y,|z|} 100011 P15 1 0 P4 P5 P6 P7 0 1 P8 P9 0 can be represented as a triplet . For example, 100110 P16 ∗ ∗ 110001 P17 000010010100 can be represented as {5, 0010 , 3}, with 111010 P18 P10 P11 P12 P13 P14 P15 P16 P17 P18 5 and 3 being the length of the preceding and succeeding (a) (b) zeros plus one. Hence, the information lost by removing Figure 1. (a) A sample prefix table (b) The corresponding binary trie the preceding and succeeding zeros are recovered by their lengths. Using this representation, the input prefix table is In pipelined tree-based IP lookup, each routing prefix is converted to a compact prefix table (C-PT). The maximum either converted to an L-bit value range, or expanded to length of x and y is log2 W , where W denotes the length Table I COMPACT PREFIX TABLE Yet, a secondary search is required at each node. Alterna-

PTinit PT C − PT tively, backtracking can be employed to get the correct next Prefix Next xyzNext MSB |x| y |z| Next hop result. In backtracking, search proceeds in the backward Hop Hop LSB Hop direction after it fails to find a match in the forward di- 11* P1 - 11* - P1 - - 11* - P1 111* P2 - 111* - P2 - - 111* - P2 rection. Backtracking in hardware pipelining requires either 0101* P3 01 * 01 P3 012*2P3 stalling the pipeline or duplicating the memory for backward 00101* P4 001 * 01 P4 013*2P4 searching; therefore, it is not desirable. 01001* P5 01 0* 01 P5 10 2 0* 2 P5 01110* P6 01 1* 10 P6 01 2 1* 2 P6 We develop a novel approach to solve the backtracking 10001* P7 10 0* 01 P7 11 2 0* 2 P7 problem in CT implementation. We set a limit for the 11001* P8 110 * 01 P8 113*2P8 number of conflicted prefixes per node Ptrie, and move the 11010* P9 110 * 10 P9 103*2P9 000101 P10 0001 * 01 P10 01 4 * 2 P10 excessive conflicted prefixes to a newly generated CT. The 001010 P11 001 0* 10 P11 00 3 0* 2 P11 final data structure obtained is called compact trie forest 010001 P12 01 00* 01 P12 01 2 00* 2 P12 (CTF). IP lookup in CTF is performed in parallel in all the 011101 P13 01 11* 01 P13 01 2 11* 2 P13 011110 P14 01 11* 10 P14 00 2 11 2 P14 tries. The results are fed into a priority encoder. In CTF, 100011 P15 10 0* 011 P15 11 2 0* 3 P15 backtracking is eliminated because each node stores at most 100110 P16 10 01* 10 P16 10 2 01 2 P16 Ptrie prefix(es), and matching results can be resolved at each 110001 P17 110 0* 01 P17 11 3 0* 2 P17 111010 P18 1110 * 10 P18 10 4 * 2 P18 node. Hence, the search can simply move in the forward direction. The design parameter Ptrie is a trade-off between the memory efficiency and the number of CTs in the forest. of an IP address (32 in IPv4, and 64 in IPv6). The sample Our analysis of the real routing tables collected from [11] compact prefix table is shown in Table I. shows that the number of compact tries are no more than 9 for Ptrie =2. For the prefixes with y=’*’ (less than 1% of all C. Compact Trie Structure prefixes) a single traditional binary trie with no compaction A compact prefix table can be represented by a binary is used. Fig. 2b shows the forest structure composed of a P trie, constructed by using only the active part (APprefix) CTs for the sample prefixes in Figure 1 for trie =2. y or substring of prefixes. Two conflicted prefixes are D. IP Lookup Algorithm differentiated from each other by their |x|, |z|, MSB, LSB, or any combination of them. For instance, the two conflicted For each incoming packet, the destination IP address AP prefixes 01001∗ and 100100∗ have the same |x| =2, is extracted. The active part key is obtained from the but different |z| (2 and 3), MSB (0 and 1) and LSB (1 IP address and searched in all the CT tries in parallel. and 0) values. The resulting trie is shallower and denser Search starts from the root node and at each node visited, AP than a traditional binary trie. We call the resulting trie a key is left-shifted by one bit. The direction of traversal is AP compact trie (CT), which is shown in Fig. 2a. In a CT, trie determined by the most significant bit of key (left if 0 or traversal is similar to that of the binary trie, but the matching right otherwise) as in binary trie. The outputs from all tries property is similar to that of the BST. As previously stated, are compared and the longer match is returned as the final in addition to the child pointers and the next hop information result. The search algorithm for CT is presented in Alg. 1. fields, extra information (|x|, |z|, MSB and LSB values) are The notations used in the algorithm are listed in Table II. stored at each node to differentiate the conflicted prefixes. Table II However, the variable number of conflicted prefixes at each LIST OF NOTATIONS USED IN THE ALGORITHM node results in memory inefficiency as the size of a node Notation Meaning is determined by that of the largest node in hardware MSB(LSB)key Most (Least) Significant bit of key implementations. MSB(LSB)prefix Most (Least) Significant bit of prefix APkey Active part of IP address MSSB position for MSB =0or MSRB CT0 CT1 CT2 CT3 Mkey P3,P4,P8, position for MSB =1of key P9,P10,P18 P3,P4 P8,P9 P10,P18 0 1 0 1 0 0 1 NHIkey Next hop information for key P5,P7,P11, P6 P5,P7 P6 P15,P17 The number of consecutive zeros(ones) after 0 11 0 11 1 P11,P15 P17 Lkey int the most significant set(reset) bit in APkey for P1,P13,P14 P1 1 1 LSBprefix LSBprefix P12 P16 P12 P16 P13,P14 =0( =1). B0 Most significant bit value in APkey P2 P2 M |x| (a) (b) prefix of prefix Lprefix |z| of prefix NHIprefix Next hop information of prefix Figure 2. (a) Compact Trie (CT) (b) Compact Trie Forest (CTF) The prefix update complexity of CTF is the same as that To solve the memory inefficiency problem, an auxiliary of a binary trie. However, due to the space limitation, the data structure can be constructed for the conflicted prefixes. update procedure is not presented in this paper. Algorithm 1 IP lookup Register

Input: Destination IP address MSS(R)BPIP Output: NHIkey MSS(R)BPIP Address Data Mprefix IP 1: Find Mkey based on MSBkey and extract Active Part (APkey) Lprefix IP B MSBprefix Match NHIIP 2: Start search from root node, traverse trie nodes using 0 BRAM NHIIP LSBprefix Module Address 3: while current node is not leaf node do NHI AP Address prefix 4: shift key one bit left Address LengthIP 5: if current node is a prefix node then LengthIP MSB MSB MSB 6: if key!= prefix then MSBIP IP 7: no match B LSB MSS(R)BP: Most Significant Set(Reset) Bit Position, NHI:Next hop information 8: else if 0 == prefix then MSB:Most Significant Bit, LSB: Least Significant Bit 9: no match Figure 4. A basic stage of the pipeline for the trie lookup 10: else if Mkey!=Mprefix then 11: no match Table III 12: else if Lprefix == 0 then LIST OF NOTATIONS USED IN MEMORY SIZE CALCULATIONS 13: match and update NHIkey = NHIprefix L ≥ L 14: else if key int(LSBprefix) prefix then Notations Meaning 15: match and update NHIkey = NHIprefix MCT Total memory size of a single compact trie 16: else MCTF Total memory size of compact trie forest 17: no match Ntrie Total number of nodes 18: end if P Number of pointer bits 19: Update current node NHI Number of bits to store the nexthop information 20: end if M Number of bits to store Mprefix (log2 W ) 21: end while W IP address length (32 for IPv4 and 128 for IPv6) NHIkey 22: return L Number of bits to store the Lprefix (log2 W ) S Number of bits to store the MSBprefix and LSBprefix

Register Register Register presents a single pipeline stage. Each stage includes a match CT0 Lookup MSSBP Left module, whose operation is described in Alg. 1, and a calculator Barrel BRAM, where the trie nodes are stored for Ptrie =1. Shifter CT1 Lookup MSRBP However, the number of match modules and the width of calculator Priority Result P IP Address Encoder BRAM entries can increase linearly with trie value. This LSSBP provides the trade-off between the memory and resource calculator CTn-1 Lookup requirement. Small Ptrie values improve the memory ef- Length LSRBP ficiency, but increase the number of compact tries, and CTn Lookup calculator in turn, the number of pipelines in the architecture. To

MSSBP: Most Significant Set Bit Position LSSBP: Least Significant Set Bit Position take advantage of the dual-ported feature provided by the MSSRP: Most Significant Reset Bit Position LSRBP: Least Significant Reset Bit Position BRAM in FPGAs, the architecture is configured as dual- linear pipelines to double the lookup rate. At each stage, Figure 3. Block diagram of the IP lookup architecture the memory has dual Read/Write ports so that two packets can be input every clock cycle. IV. ARCHITECTURE AND IMPLEMENTATION ON FPGA A. Architecture B. Implementation on FPGAs Table III shows the list of notations used in the analysis. We use pipelining to improve the throughput. The number The memory consumption of the CTF can be calculated of pipelines is equal to the number of CTs. The number of using Equ. 1 and 2. pipelines stages are determined by the height of the CTs used. Fig. 3 describes the overall architecture of the proposed MCT = Ntrie × (2P + M + L + NHI + S) (1) IP lookup engine. MCTF =ΣMCT i (2) The IP address which is extracted from the incoming packet is routed to all pipelines, and the searches are We assign M = log2 W  =5, L = log2 W  =5 performed in parallel in the forest. The results are fed and S =2.WesetNHI =8to support up to 256 through a priority resolver to select the next hop index of next hop information, and P =16to support up to the longest matched prefix. In Fig. 3, the MSS(R)BP and 64K nodes in each level. By substituting these values in LSS(R)BP blocks calculate the most and least significant Equ. 1, we get MCT =52Ntrie. We observed that the set/reset bit positions of the IP address, respectively. The memory consumption increases linearly with the number of active part (AP ) of IP address is extracted by the left barrel prefixes. Therefore, a state-of-the-art FPGA device with 36 shifter. The length of AP is used to early terminate the Mb of BRAM (e.g. Xilinx Virtex6) can support up to 703K search once its last bit is checked. IPv4 or 418K IPv6 prefixes, without using external SRAM There are at most W stages in the pipelines. Fig. 4 (assuming the same prefix distribution). Table IV NUMBER OF NODES PER LEVEL IN CT0 OF CTF CONSTRUCTED USING PREFIX TABLE RRC00 Level 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Node (0) 1 0 0 0 0 0 0 1 1 9 36 77 291 1505 3439 6970 Node (1) 0 0 0 0 0 2 1 1 15 25 58 166 547 1386 2740 5565 Node (2) 0 2 4 8 16 30 63 126 240 478 924 1777 3093 4246 6484 8313 Total 1 2 4 8 16 32 64 128 256 512 1018 2020 3931 7137 12663 20848 Level 16 17 18 18 20 21 22 23 24 25 26 27 28 29 30 31 Node (0) 11496 15772 16774 6552 767 524 274 153 90 56 32 10 0 0 0 0 Node (1) 9037 13925 19935 28908 13678 599 419 173 82 49 32 19 10 0 0 0 Node (2) 10270 12532 15643 21537 11563 130 108 27 9 9 3 6 1 0 0 0 Total 30803 42229 52352 56997 26008 1253 801 353 181 114 67 35 11 0 0 0 Table V NOOFPREFIXESOFREAL-LIFE AND SYNTHETIC ROUTING TABLES IPv4 rrc00 rrc01 rrc02 rrc03 rrc04 rrc05 rrc06 rrc07 rrc10 rrc11 IPv6 rrc00 6 rrc01 6 rrc02 6 rrc03 6 rrc04 6 rrc05 6 rrc06 6 rrc07 6 rrc10 6 rrc11 6 # prefixes 332118 324172 272743 321617 347232 322997 321577 322557 319952 323668 Table VI NUMBER OF NODES (K)/TOTAL MEMORY (MBIT) IN IPV4 rrc00 rrc01 rrc02 rrc03 rrc04 rrc05 rrc06 rrc07 rrc10 rrc11 BT 807/34.7 789/33.9 679/29.2 780/33.5 845/36.3 783/33.7 782/33.6 784/33.7 787/33.8 775/33.3 BTPC 597/46.0 583/45.0 493/38.1 579/44.6 628/48.5 581/44.8 579/44.7 581/44.8 576/44.4 583/44.9 BST 453/35.8 440/34.8 380/30.1 434/34.3 474/37.5 436/34.4 436/34.5 437/34.5 431/34.1 438/34.7 DBPC 530/25.8 519/25.3 446/21.7 513/25.1 545/26.6 514/25.1 514/25.1 516/25.2 518/25.3 509/24.8 CTF 294/19.9 288/19.4 249/16.7 283/19.1 304/20.5 284/19.1 284/19.2 285/19.2 286/19.3 281/18.9 CTFopt 294/16.6 288/16.35 249/13.84 283/15.99 304/17.38 284/16.05 284/16.04 285/16.13 286/15.89 281/16.19 Table VII NUMBER OF NODES (K)/TOTAL MEMORY (MBIT) IN IPV6 rrc00 6 rrc01 6 rrc02 6 rrc03 6 rrc04 6 rrc05 6 rrc06 6 rrc07 6 rrc10 6 rrc11 6 BT 7610/341 7476/335 7367/330 8094/363 7391/332 7376/331 7407/332 7327/329 7724/347 7484/336 BTPC 648/72 633/70 628/69 677/75 630/70 627/69 629/70 624/69 654/72 636/70 DBPC 698/47 686/46 676/45 743/50 678/45 677/45 680/46 672/45 709/48 687/46 CTFopt (Path.comp) 212/27.9 208/27.3 206/27.0 222/29.2 207/27.2 206/27.1 206/27.1 205/26.9 214/28.0 208/27.2

C. Optimization use a single BRAM to support two next hop information per level. On the other hand, using multiple BRAMs for the In existing pipelined architectures, each stage uses a single remaining stages clearly improve memory efficiency. bank of BRAM to store trie nodes. In our CTF, the size of trie nodes are not fixed because the number of conflicted V. P ERFORMANCE EVALUATION prefixes varies. Due to the variation in the size of trie nodes, the BRAM entry size is determined based on the largest A. Experimental Setup trie node, resulting in poor memory utilization. Therefore, Ten experimental IPv4 core routing tables were collected multiple BRAM banks are employed to improve the memory from Project - RIS [11] on 06/03/2010. These core routing efficiency. Table IV shows the node distribution of the largest tables were used to evaluate our algorithm for a real net- CT (CT0) in the CTF constructed using the real routing table working environment. Additionally, from these core routing rrc00 from [11]. CT0 consists of ≈ %90 of the total number tables, we generated the corresponding IPv6 routing tables of nodes of the overall structure. In this trie, Ptrie =2 using the same method as in [15]. The IPv4-to-IPv6 prefix indicates that a trie node can be one of the three types: mapping is one-to-one. The numbers of prefixes of the non-prefix nodes (Node(0)), prefix nodes with one next hop experimental routing tables are shown in Table V. information (Node(1)), and prefix nodes with two next hop information (Node(2)). Each column corresponds to a level B. Memory Requirement in the compact trie. Rows 1,2 and 3 show the total number Fig. 5 shows the node distribution among the CTs in the of non-prefix nodes, one next hop nodes, and two next hop forest for the real tables. The results indicate that the major- nodes in the corresponding level, respectively. The last row ity of nodes and prefixes are stored in CT0. Table VI shows illustrates the total number of nodes. The results confirm the memory requirement of the state-of-the-art algorithms that, in the first 11 levels, most of the nodes have two next for the real tables. These candidates are Binary trie (BT), hop information; and hence, single BRAM supporting two Binary trie with path compression (BTPC), BST (with leaf next hop information field is enough. Similarly, the last 8 push), and Distance Bounded Path Compression (DBPC). levels have very few number of nodes; therefore, we can Note that the results are presented in an A/B format, where 100 Table VIII PERFORMANCE COMPARISON 90 rrc00 80 rrc01 1 23 4 5 rrc02 Architecture # prefix Mem. eff. Throughput Throughput eff. 70 Bits/prefix Gbps Gbps/B rrc03 60 CTF 324 52.46 135 2.57 rrc04 50 RCST [12] 80 57.15 135 2.36 rrc05 Flashtrie [3] 310 39.27 80 2.03 40 rrc06 DBPC [13] 324 82 150 1.83 30 rrc07 BST [9] 324 113.21 175 1.55 No of nodes (%) nodes of No 20 rrc08 POLP [6] 80 120 64 0.53 Ring [14] 80 120 64 0.53 10 rrc09 TreeBitmap [5] 310 69.9 12.8 0.18 0 CT0 CT1 CT2 CT3 CT4 CT5 CT6 CT7 CT8 linear-pipeline architecture to support the proposed data Figure 5. Number of nodes per trie in the forest structure on FPGAs. Furthermore, we optimized the existing architecture to support multiple banks per pipeline stages to increase memory efficiency. Therefore, our algorithm can be A indicates the total number of nodes, and B is the total used to improve the performance (throughput and memory memory size (in Mbits). The results in Table VI show that efficiency) of trie-based IPv4/v6 lookup schemes to satisfy our data structure achieves a substantial memory reduction fast internet link rates up to and beyond 100 Gbps at core for IPv4. The performance of our approach in IPv6 was also routers, and compact memory footprint that can fit in the evaluated with path compression and it achieved a memory on-chip caches of multi-core and network processors. In the reduction of 12.2× over the binary trie and 2.6× over the future, we plan to extend the algorithm to virtual routers to path compressed trie. improve their memory efficiency.

C. Throughput REFERENCES The proposed hardware design was simulated and imple- [1] D. Lin, Y. Zhang, C. Hu, B. Liu, X. Zhang, and D. Pao, “Route table . partitioning and load balancing for parallel searching with TCAMs,” mented in Verilog, using Xilinx ISE 12 4, with Xilinx Virtex- in Proc. IPDPS, 2007, pp. 1–10. 6 XC6VSX475T (−2 speed grade) as the target. With the [2] M. J. Akhbarizadeh, M. Nourani, R. Panigrahy, and S. Sharma, “A clock period of 4.750 ns, the design is capable of running TCAM-based parallel architecture for high-speed packet forwarding,” IEEE Trans. Comput., vol. 56, no. 1, pp. 58–72, 2007. at 210 MHz, while utilizing less than 15% of the total logic [3] M. Bando and J. Chao, “Flashtrie: Hash-based prefix-compressed trie resources. Using dual-pipeline configuration, the architecture for ip route lookup beyond 100gbps,” in INFOCOM, 2010. [4] M. Waldvogel, G. Varghese, J. Turner, and B. Plattner, “Scalable high can support 420 million lookups per second (MLPS), or speed IP routing lookups,” in Proc. SIGCOMM, 1997, pp. 25–38. 135 Gbps for the minimum packet size of 40 Bytes. The [5] W. Eatherton, G. Varghese, and Z. Dittia, “Tree bitmap: Hard- result surpassed the worst-case MLPS required by the ware/Software IP Lookups with Incremental Updates,” SIGCOMM 150 Comput. Commun. Rev., vol. 34, no. 2, pp. 97–122, 2004. standardized 100GbE line cards. [6] W. Jiang and V. K. Prasanna, “A memory-balanced linear pipeline architecture for trie-based IP lookup,” in Proc. Hot Interconnects D. Performance Comparison (HotI ’07), 2007, pp. 83–90. [7] W. Lu and S. Sahni, “Packet forwarding using pipelined multibit The performance of CTF is compared with the state-of- tries,” in Proc. ISCC, 2006, pp. 802–807. [8] H. Song, J. Turner, and J. Lockwood, “Shape shifting trie for faster the art IP lookup approaches, with respect to the memory IP router lookup,” in Proc. ICNP, 2005, pp. 358–367. efficiency (in Bits/prefix) and throughput (in Gbps). We also [9] H. Le and V. Prasanna, “Scalable High Throughput and Power used throughput efficiency (in Gbps/Bits), which is the ratio Efficient IP-Lookup on FPGA,” in Proc. of 17th Annual IEEE Symposium on Field-Programmable Custom Computing Machines of the throughput to the memory efficiency, to evaluate the (FCCM), 2009. time-storage tradeoff of various designs. [10] H. Lu and S. Sahni, “A B-Tree Dynamic Router-Table Design,” IEEE Trans. Comput., vol. 54, no. 7, pp. 813–824, 2005. Columns IV and V give the throughput and throughput ef- [11] “RIS Raw Data,” http://data.ris.ripe.net. ficiency of the state-of-the-art hardware-based architectures. [12] H. Fadishei, M. S. Zamani, and M. Sabaei, “A novel reconfigurable hardware architecture for IP address lookup,” in Proc. ANCS, 2005, Column IV shows that our design achieves good throughput pp. 81–90. performance. Similarly, our scheme also outperforms the [13] H. Le, W. Jiang, and V. K. Prasanna, “Memory-efficient ipv4/v6 existing schemes with respect to the throughput efficiency, lookup on fpgas using distance-bounded path compression,” Field- Programmable Custom Computing Machines, Annual IEEE Sympo- as shown in Column V. sium on, vol. 0, pp. 242–249, 2011. [14] F. Baboescu, D. M. Tullsen, G. Rosu, and S. Singh, “A tree based VI. CONCLUSION router search engine architecture with single port memories,” in Proc. ISCA, 2005, pp. 123–133. In this paper, we proposed a compact trie forest data [15] M. Wang, S. Deering, T. Hain, and L. Dunn, “Non-random generator for ipv6 tables,” in High Performance Interconnects, 2004. Proceed- structure for IP lookup. The proposed structure achieves ings. 12th Annual IEEE Symposium on, aug. 2004, pp. 35 – 40. substantial memory saving without the need for backtrack- ing. We also designed and implemented a high-throughput,