This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

1 On Adding Bloom Filters to Longest Prefix Matching

Hyesook Lim, Member, IEEE, Kyuhee Lim, Nara Lee, and Kyong-hye Park, Student Members, IEEE

!

Abstract—High speed IP address lookup is essential to achieve wire- power per than SRAMs. TCAMs consume around speed packet forwarding in Internet routers. Ternary content address- 30-40% of the total line card power. As line cards are able memory (TCAM) technology has been adopted to solve the IP stacked together, TCAMs impose a high cost on the address lookup problem because of its ability to perform fast parallel matching. However, the applicability of TCAMs presents difficulties due cooling system. System vendors are willing to accept to cost and power dissipation issues. Various algorithms and hardware some latency penalty if the power of a line card can be architectures have been proposed to perform the IP address lookup lowered [6]. TCAMs also cost about 30 times more per bit using ordinary memories such as SRAMs or DRAMs without using of storage than DDR SRAMs. Various algorithms have TCAMs. Among the algorithms, we focus on two efficient algorithms pro- been studied to replace TCAMs with ordinary memories viding high-speed IP address lookup; parallel multiple-hashing such as SRAMs or DRAMs [1]-[4], [6]-[26]. and binary search on level algorithm. This paper shows how effectively A fast on-chip SRAM is often used in several ap- an on-chip Bloom filter can improve those algorithms. A performance evaluation using actual backbone routing data with 15,000-220,000 plications, so that critical data is stored there with a prefixes shows that by adding a Bloom filter, the complicated hardware guaranteed fast access time [27] since an access to off- for parallel access is removed without search performance penalty in chip memory (usually DRAM) requires longer access parallel-multiple hashing algorithm. Search speed has been improved time, which is 10-20 times slower than on-chip memory by 30-40% by adding a Bloom filter in binary search on level algorithm. access. It is important to partition properly so that a small part of data is stored into on-chip memories and Index Terms—Internet, router, IP address lookup, longest prefix match- ing, Bloom filter, multi-hashing, binary search on levels, leaf pushing. most of the data is stored in slower and higher capacity off-chip memories. Several metrics are used for evaluating the perfor- mance of IP address lookup algorithms and architec- 1INTRODUCTION tures. Since IP address lookup should be performed at DDRESS lookup determines an output port us- wire-speed for every incoming packet, which can be a ing the destination IP address of an incoming hundred million packets per second, search performance Apacket. The address aggregation technology currently is the most important metric. Search performance is often used for the Internet is a bitwise prefix matching scheme measured by the number of off-chip memory accesses. called classless inter-domain routing (CIDR), which uses The next metric is the required memory size for storing variable-length subnet masking to allow arbitrary-length a routing table. The incremental update of a routing prefixes. An IP address is said to match a prefix if the table is also an important metric. Scalability for large most significant l of the address and a l-bit prefix routing data sets and migration to IPv6 should also be are the same. When an IP address matches more than considered. The performance in these metrics depends one prefix, the longest matching prefix is selected as the on data structures and search algorithms, and thus, it best matching prefix (BMP) [1]-[4]. IP address lookup is is essential to have an efficient structure and a search one of the most challenging operations in router design algorithm to provide the high-performance IP address because of the amount of traffic and the number of lookup. networks, which have increased dramatically in recent IP address lookup algorithms can be roughly catego- years. rized by (or tree)-based algorithms [2]-[4], [8], [13]- Using application-specific integrated circuits (ASICs) [26], hashing-based algorithms [9]-[11], or bitmap-based with off-chip ternary content addressable memories algorithms [6], [12]. Recently, dynamic programming- (TCAMs) has been the best solution to provide the based approaches have been proposed to improve the wire-speed packet forwarding. However, TCAMs have search performance and/or storage performance [15]- some limitations [5]. TCAMs consume 150 times more [21]. Hashing is a well-defined procedure for turning each Manuscript received 5 Nov. 2011; Authors are with the Department of Electronics Engineering, Ewha W. key into a smaller integer called a hash index, which University, Seoul, Korea (e-mail: [email protected]). serves as a pointer into an array. Hashing has been

Digital Object Indentifier 10.1109/TC.2012.193 0018-9340/12/$31.00 © 2012 IEEE This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

2

used mostly in search algorithms to quickly locate a However, the positive does not mean that all those bit- data record for a given search key. For the IP address locations were only by that current element under lookup, hashing is applied to each length of prefixes, and querying and there is a possibility that those locations the longest prefix among matched prefixes is selected would have been set by some other elements in the set. as the best match [9]-[11]. Among trie-based algorithms, This type of positive result is termed as a false positive. binary searching on hash tables organized by prefix It is important to properly control the rate of the false lengths [13]-[15] provides the best IP address lookup positive in designing a Bloom filter. For a given ratio performance. of m/n, it is known that the false positive probability is Bloom filter is a space-efficient probabilistic data struc- minimized when the number of hash functions k has the ture used to test whether an element is a member following relationship [7]: of a set. Bloom filter has been popularly applied to m network algorithms [7], [26], [28]-[31]. This paper shows k = ln 2 (1) 2log2 n how effectively an on-chip Bloom filter can improve the search performance of known efficient IP address lookup On the whole, a Bloom filter may produce false positives algorithms. but not false negatives. This paper is organized as follows. Section 2 describes the Bloom filter theory. Section 3 introduces two differ- ent algorithms providing high-speed IP address lookup; 3RELATED WORKS parallel multiple-hashing, and binary search on levels. Section 4 describes our proposed method to improve IP address lookup problem can be defined formally as P = {P ,P , ··· ,P } those algorithms using a Bloom filter. Section 5 shows follows [8]. Let 1 2 N be a set of routing N A performance evaluation results, and Section 6 concludes prefixes, where is the number of prefixes. Let be S(A, l) the paper. an incoming IP address and be a substring of the most significant l bits of A. Let n(Pi) be the length of a prefix Pi. A is defined to match Pi if S(A, n(Pi)) = Pi. M( ) 2BLOOM FILTER THEORY Let A be the set of prefixes in P that A matches, then M(A)={Pi ∈ P : S(A, n(Pi)) = Pi}. The longest A Bloom filter is basically a bit-vector used to represent prefix matching (LPM) problem is to find the prefix Pj in the membership information of a set of elements. A M(A), such that n(Pj ) >n(Pi) for all Pi ∈M(A),i= j. = { ··· } Bloom filter that represents a set S x1,x2, ,xn of Once the longest matching prefix Pj is determined, the n elements is described by an array of m bits, initially all input packet is forwarded to an output port directed by set to 0. Bloom filter supports two different operations; the prefix Pj . programming and querying. In programming, for an element x in the set S, k different hash functions are computed in such a way that the resulting hash index 3.1 IP Address Lookup Algorithms Using Bloom Fil- hi(x) is of the range 0 ≤ hi(x) ≤ m for i =1, ··· ,k. Then ters all the bit-locations corresponding to k hash indices are An IP address lookup architecture proposed by Dharma- set as 1 in the Bloom filter. The pseudo-code to program purikar et al. is the first algorithm employing a Bloom a Bloom filter for an element x is as follows [7]: filter [7]. It performs parallel queries on W Bloom filters sorted by prefix length to determine the possible lengths BFProgramming (x) of prefix match, where W is 32 in IPv4. For a given for ( i =1tok ) IP address, off-chip hash tables are probed for prefix BF[hi(x)]=1; lengths, which turn out to be positive in Bloom filters A querying is performed to test whether an element starting from the longest prefix. This architecture has a y ∈ S. For an input y, k hash indices are generated using high implementation complexity because of the Bloom the same hash functions that were used to program the filters as well as the hash tables in each prefix length. filter. The bit-locations in the Bloom filter corresponding Depending on the prefix distribution, the size of the to the hash indices are checked. If at least any one of Bloom filters and the size of the hash tables can be highly the location was 0, then it is absolutely not a member skewed. of the set S, and it is termed as negative. If all the hash In order to bound the worst-case search performance index locations were set as 1, then the input may be by limiting the number of distinct prefix lengths, which a member of the set, and it is termed as positive. The is the same as the worst number of probes, querying procedure is as follows [7]: controlled prefix expansion (CPE) [22] is suggested in the paper. However, prefix replication is inevitable in the BFQuery(y) CPE. Moreover, the naive hash table employed in this for (i =1to k) architecture incurs collisions, and resolving the collisions if (BF[hi(y)]==0)return negative; using chaining adversely affects the worst-case lookup- return positive; rate guarantees that routers should provide [30], [32]. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

3

3.2 Parallel Multiple-Hashing A is used to map the search key to a hash index. In general, a hash function may map several different keys to the same hash index. This is called collision. Collisions are an intrinsic problem of hashing. Broder et al., proposed to use multiple-hash functions to reduce collisions [9]. Instead of searching for a in which each distinguishable search key is mapped to a different hash index, a multiple-hashing architecture [9] uses multiple hash functions for each search key. The number of hash tables is equal to the number of hash indices. Assuming two hash indices, the corresponding two hash tables are named the left table and the right table. Each slot of a hash table should contain a set of entries, and for this reason, each slot of a hash table is often called a bucket. Fig. 1. Parallel multiple hashing-architecture. In storing a given prefix, two hash indices are ob- tained; the hash index from the hash function 1 is used to access the left table, and the index from the hash function sides in a node of the trie, in which the level and 2 is used to access the right table. Comparing the number the path of the node from the root node correspond of loads stored already in the two buckets accessed, the to the prefix length and the value, respectively. Fig- prefix is stored in the bucket with a smaller number of ure 2 shows the binary trie for an example set of pre- loads. By the multiple-hashing, prefixes are distributed fixes P = {1∗, 00∗, 010∗, 111∗, 1101∗, 11111∗, 110101∗}.In more evenly into hash tables. The number of collisions Fig. 2, black nodes represent prefixes, and white nodes can be controlled by three parameters: the number of represent empty internal nodes. At each node, the search hash tables, the number of buckets in a table, and the proceeds to the left or right according to sequential in- number of entries in a bucket. spection of address bits starting from the most significant To apply multiple-hashing to an IP address lookup bit. If it is 0, the search proceeds to a ”left child” and problem with variable-length prefixes, a parallel otherwise proceeds to a ”right child”, until it reaches a multiple-hashing (PMH) architecture [10] constructs a leaf node. The binary trie structure is simple and easy separate multi-hash table for each group of prefixes to implement. However, the search performance of the with a distinct length and additionally, an overflow binary trie is linearly related to the length of IP address, TCAM. A prefix is stored into the overflow TCAM since each bit is examined one at a time. when both buckets are already full. Figure 1 shows the overall PMH architecture. Multiple hash tables (here two) are constructed for each length, and prefixes in each length are stored into either a left table entry or a right table entry of the corresponding length. The search procedure is as follows. For a given input address, hash indices for all possible lengths are obtained. Using these hash indices, multi-hashing tables are accessed in parallel, and matching prefixes in each length (if exist) are returned. The overflow TCAM is also assumed to be accessed in parallel. Among the returned prefixes, the longest matching prefix is selected by the priority encoder. By the parallel access of multi-hashing tables in every Fig. 2. The binary trie for an example set of prefixes. length, the best matching prefix (BMP) is obtained in a single access cycle. However, since tables in each length should be implemented using a separate memory for As an attempt to improve the search performance of parallel access and the size of the tables can be highly the trie, algorithms performing binary search on trie skewed depending on prefix distribution, implementa- levels are proposed [13], [14]. The binary search on level tion complexity can become very high. structure proposed by Waldvogel et al. [13] separates the binary trie, according to the level of the trie, and stores nodes included in each level in a hash table. Binary 3.3 Binary Search on Trie Levels by Waldvogel search is performed on the hash tables of each level. A binary trie is a tree-based which When accessing the medium-level hash table, if there applies linear search on length [4]. Each prefix re- is a match, the search space becomes the longer half; This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

4

otherwise, the search space becomes a shorter half. a substring of another prefix. That is, since each node Figure 3 shows the Waldvogel’s binary search on can have prefixes in both upper and lower levels, the length (W-BSL) structure with the denotation of access markers and their BMPs should be pre-computed. If levels [13]. Level 3 is the first level of access, levels 1 and every prefix is disjointed in relation to each other, they 5 are the second levels of access, and levels 2, 4, and 6 are located only in leaves and free from a prefix nesting are the last levels of access. The W-BSL structure uses relationship. Hence, the binary search on trie levels for the pre-computed markers and BMPs. Markers are pre- the set of disjoint prefixes can be performed without pre- computed in the internal nodes, if there is a longer length computation. prefix in the levels accessed later. A pre-computed BMP Lim’s binary search on level structure (L-BSL) [14] uses is maintained at each marker, and the pre-computed leaf-pushing [22] to make every prefix disjoint. Figure 4 BMP is returned, when there is no match in longer levels. shows the leaf-pushed binary trie for the same set of The markers and the BMPs are not maintained for the prefixes. It is represented that leaf-pushed nodes are last level of access. They are pre-computed for nodes of connected to the trie by dotted edges. The levels of access preceding levels of access as shown in Fig. 3. performing the binary search on levels are also shown. Assume a 6-bit input 110110. Note that we use a different input example from the W-BSL to show the search procedure for the L-BSL in detail. Since the first level of access is 4, the most significant 4 bits, which are 1101, are used for hashing. As the search encounters an internal node, it is proceeded to a longer level. In level 6, the most significant 6 bits of the input, 110110, do not match any node. Hence the search proceeds to a shorter level, which is level 5. The input matches prefix P4 in the level 5. The prefix P4 is returned and the search is over. The L-BSL finishes a search, either when a match to a prefix occurs or when it reaches the last level of access, while the W-BSL always finishes a search when it reaches the last level of access.

Fig. 3. Waldvogel’s binary search on lengths (W-BSL).

As a search example, for a 6-bit input 111000, the most significant 3 bits, which are 111, are used for hashing in accessing level 3. Since the input matches P5,itis remembered as the current BMP, and search goes to level 5. In level 5, the most significant 5 bits of the input, 11100, does not match any node. The search goes to level 4 and does not match. Since it is the last level of access, the search is over and prefix P5 is returned as the BMP. Using the binary search on trie levels, three memory Fig. 4. Lim’s binary search on lengths (L-BSL). accesses were performed to find the longest matching prefix for this input. Waldvogel’s algorithm provides O(logldist) hash-table accesses, where ldist is the number of distinct prefix 4THE PROPOSED ARCHITECTURES lengths. Srinivasan and Varghese have proposed to im- 4.1 Adding a Bloom Filter to Parallel Multiple- prove the search performance by the use of controlled Hashing Architecture (PMH-BF) prefix expansion reducing the value of ldist [22]. Kim and Sahni have proposed to optimize the storage require- In this subsection, we propose to add an on-chip Bloom ment by selecting prefix lengths minimizing the number filter to the PMH architecture in order to reduce the im- of markers and pre-computed BMPs when the ldist is plementation complexity. In implementing hash tables, given [15]. previous architectures, such as [7], [9], [10], [11], [13] and [32], require separate hash tables for each length, and this increases the implementation complexity by re- 3.4 Binary Search on Trie Levels in a Leaf-Pushed quiring multiple variable-sized memories. To reduce the Trie required number of memories by reducing the distinct W-BSL requires complex pre-computation because of the number of prefix lengths, [7] suggested using CPE. How- prefix nesting relationship in which one prefix becomes ever, CPE causes prefix replications and increases the This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

5

memory requirement. In [32], prefix collapsing is sug- In Fig. 6, we show the overall structure of our pro- gested. However, prefix collapsing increases the number posed architecture. The on-chip Bloom filter was pro- of collisions in hashing. On the whole, in using the grammed using the BF indices in TABLE 1. The multi- hashing for IP address lookup, it is essential to reduce hashing table, which is composed of two hash tables, has the required number of off-chip memories. If there is no 8 buckets and 2 entries per bucket storing the prefixes. need for parallel access in each prefix length, the hash Prefixes were stored into the multi-hashing table using table can be designed to accommodate prefixes with the indices shown in TABLE 1. In this example, there is various lengths within a single table. no overflow. We describe the proposed architecture using the same example set. We use a single hash function based on a cyclic redundancy check (CRC) generator to obtain multiple-hash indices for prefixes in every length [29]. The CRC generator is composed of shift-right registers with XOR logic, and hence it is easy to implement. There is an advantage to using a CRC generator as a hash generator; hash indices can be obtained consistently for various-length prefixes. Figure 5 shows an example of an 8-bit CRC generator. All the registers of the CRC generator are initially set to 0. Once a prefix with an arbitrary length is serially entered to the CRC generator and XORed with a stored register value, a fixed-length scrambled code is obtained. By selecting a set of registers or multiple sets of registers from the scrambled code, we obtain as many hash indices as desired in any length.

Fig. 6. Our proposed multiple hashing architecture (PMH- BF).

The search procedure for the proposed algorithm is summarized in the following pseudo-code. Let A be the Fig. 5. CRC-8 generator. destination address of a given input packet, and S(A, l) be the sub-string of the most significant l bits of A. Let Let n, m, and k be the number of prefixes, the number n(x) be the length of an element x. of bits in a Bloom filter, and the number of hash indices, respectively. As an example case, we set the Bloom filter SearchMHB(A) { size m as 16. The number of hash indices k should be TCAM BMP = TCAM search(A); derived from Eq. (1), and here we set k as 2. Since m is 16, if (TCAM BMP != NULL) len = n(TCAM BMP); we need 4 bits for each hash index. We arbitrarily select else len = 0; the first 4 bits and the last 4 bits from the 8-bit CRC codes for (l = W to len+1 ) { for Bloom filter indices. Assuming an 8-bucket multi- if (valid[l]==1){ // l is a valid length hashing table, we can also obtain hash indices for the inString = S(A, l); multi-hashing table from the CRC code. We arbitrarily CRC=crc gen (inString); select the first 3 bits and the last 3 bits from the 8-bit CRC for (i =1to k)bf idx[i] = extract(CRC); codes for the multi-hashing indices. TABLE 1 shows the rst = probe BF (bf idx[1], ···,bf idx[k]); CRC codes for the example set of prefixes and selected if ( rst == positive) { indices. for (i =1to 2) h idx[i] = extract(CRC); BMP = hash table(inString, h idx[1], h idx[2]); TABLE 1 if (BMP != NULL) return BMP; CRC code and Bloom filter index for each prefix } } Prefix CRC Code BF Indices Hash Table Indices } 1* 10101011 10, 11 5, 3 return TCAM BMP; 00* 00000000 0, 0 0, 0 } 010* 11010101 13, 5 6, 5 111* 10010100 9, 4 4, 4 1101* 00110100 3, 4 1, 4 11111* 01011011 5, 11 2, 3 In this example set, the valid levels are length 6, 5, 4, 110101* 10100110 10, 6 5, 6 3, 2, and 1. As a search example of 6-bit input address This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

6

TABLE 2 111000, the CRC code generated for this input using the CRC codes and Bloom filter indices for WBSL-BF 8-bit CRC generator shown in Fig. 5 is 10010010. Hence, by selecting the first 4 bits and last 4 bits, we have Prefix CRC Code BF Indices BF indices 9 and 2. The Bloom filter shown in Fig. 6 0* 00000000 0, 0 produces a negative since one of the Bloom filter bits is 1* 10101011 10, 11 00* 00000000 0, 0 0, and hence the hash table is not accessed. Reducing 01* 10101011 10, 11 the input by 1 bit, 11100 is entered, and the generated 11* 01111110 7, 14 code is 00100101. The BF indices are 2 and 5, and the 010* 11010101 13, 5 110* 00111111 3, 15 Bloom filter produces a negative, and hence the hash 111* 10010100 9, 4 table is not accessed. Next, the input 1110 is tried. The 1101* 00110100 3, 4 generated CRC is 01001010, and the BF indices are 4 and 1111* 11100001 14, 1 11010* 00011010 1, 10 10. The Bloom filter produces a positive, and the hash 11111* 01011011 5, 11 table is accessed using two hash indices obtained by the 110101* 10100110 10, 6 first 3 bits and the last 3 bits of the CRC code, which are 2 and 2. There is a match neither in bucket 2 of the left table nor bucket 2 in the right table for 1110, so it turns out that the Bloom filter produced a false positive. Next, the input 111 is tried. The generated CRC is 10010100, and the BF indices are 9 and 4. The Bloom filter produces a positive, and the hash table is accessed. We obtain a matched prefix in the bucket 4 of the left table. Therefore, the search is over. The search for a given input is terminated when a true positive is occurred. In this example, the Bloom filter was accessed 4 times, lengths 6, 5, 4, and 3, and it generated 2 negatives, 1 false positive, and 1 true positive. The number of hash table accesses is 2, which is the same as the number of the Bloom filter positives. As will be shown in simulation section, the false positive can be reduced to less than 0.3% by increasing the size of Bloom filters to 16 times the number of Fig. 7. Adding a Bloom filter to W-BSL (WBSL-BF). prefixes. The parallel accesses to hash tables in each length is not necessary in the proposed architecture since the Bloom filter filters out the length of the input that SearchWBSL(A) { does not have a matching prefix. The Bloom filter is small TCAM BMP = TCAM search(A); enough to be implemented in a fast or embedded low = min level; high = max level; in a chip. Hence the implementation complexity is signif- while (low ≤ high) { icantly reduced by adding a simple Bloom filter without accLevel =  (low + high)  /2 ; sacrificing the search performance. inString = S(A, accLevel); CRC = crc gen (inString); 4.2 Adding a Bloom Filter to Binary Search on Trie for (i =1to k)bf idx[i] = extract(CRC); Levels by Waldvogel (WBSL-BF) rst = probe BF (bf idx[1], ···,bf idx[k]); In this subsection, we propose to add an on-chip Bloom if ( rst == negative) high = accLevel - 1; filter to the Waldvogel’s binary search on level algorithm else // positive { in order to improve the search performance. The prelim- h idx = extract(CRC); inary version of this proposal was presented in [26]. The node = access hash table(inString, h idx); proposed algorithm is described in detail in the context if (node == NULL) high = accLevel - 1; of adding a Bloom filter. The role of the Bloom filter is to else { //a node exists filter out the substring of each input that does not have low = accLevel + 1; a node in the binary trie. if (prefix node or bmp node) TABLE 2 shows CRC codes and Bloom filter indices tree BMP = node.BMP; for every node in the level of access (level 1 through 6) } for the W-BSL trie shown in Fig. 3. We use the CRC-8 } generator as in Fig. 5 to obtain BF indices. Figure 7 shows } the WBSL-BF trie, which has a Bloom filter programmed if (n(TCAM BMP)

7

As a search example for the 111000, the CRC code generated for the 3-bit substring 111 is 10010100. Hence, we have BF indices 9 and 4. The Bloom filter shown in Fig. 7 produces a positive, and hence the hash table is accessed and P5 is obtained as the current best match. For the 5-bit substring of the input 11100, the CRC code is 00100101, and hence the BF indices are 2 and 5, and it is a negative. Hence hash table is not accessed, and the search space becomes the shorter lengths. For 4-bit substring of the input 1110, the CRC code is 01001010, and hence the BF indices are 4 and 10, and it is a positive. Hence hash table is accessed. There is no match, so it turns out to be a false positive. The P5 is returned as BMP. Compared with the search procedure in W-BSL, the number of off-chip hash table accesses is reduced from 3 to 2 because of the Bloom filter which produced Fig. 8. Adding a Bloom filter to L-BSL (LBSL-BF). one negative.

h idx = extract(CRC); 4.3 Adding a Bloom Filter to Binary Search on Trie node = access hash table(inString, h idx); Levels in a Leaf-Pushed Trie (LBSL-BF) if (node == NULL) high = accLevel - 1; TABLE 3 shows CRC codes and Bloom filter indices for else if (node.type == internal) every node in the level of accesses (level 2 through 6) for low = accLevel + 1; the L-BSL trie shown in Fig. 4. Figure 8 shows the LBSL- else { // prefix node BF trie, which has a Bloom filter programmed using the tree BMP = node.BMP; Bloom filter indices. if (n(TCAM BMP)

8

language. Throughout our simulation, hash indices were average number of hash table probes versus the average consistently generated using a 32-bit CRC generator [29]. number of Bloom filter queries, i.e. Havg/Qavg. As the The number of overflows depends on several factors, size of the Bloom filter increases from N to 16N , the such as the number of hash tables, the number of buckets hash table access rate is exponentially reduced. This in each hash table, and the number of entries in each means that the Bloom filter effectively avoids unnec- bucket. For the number of prefixes, N, we have two hash essary memory accesses to the off-chip hash table as   tables, each of which has N =2log2 N buckets. Each the size of the Bloom filter increases. When the size of bucket has two entries, and each entry has 46 bits. For the Bloom filter is 16N , the maximum number of hash 5 different prefix sets, there is no overflow except for table probes (Hmax) is bigger than one, but the average one overflow occurred in PORT80. We assume to store number of hash table access (Havg) is only 1. overflow prefixes in a TCAM. Figure 9 shows the entry For a given input trace, if the Bloom filter has false structure of the multihashing table. Hence, the memory positives, the number of hash table probes becomes requirement is4x46xN bits. more than one. For the Bloom filter size 16N , TABLE 5 shows the total number of input traces injected to our simulation and the number of inputs that have at least one false positives. As shown, the number of inputs with at least one false positive are very small compared with the total number of input traces. Hence the fractional part of the average number of hash table access (Havg) Fig. 9. Entry structure of multi-hashing table is zero up to the second digit under the decimal point. It means that most of the false positives are removed From our simulation, we found out the average num- when the size of Bloom filter is sufficiently large, and ber of hash table access (Havg) is not as low as we ex- each IP address lookup only requires one off-chip hash pected, because of the false positives of the Bloom filter. table access on average. Hence the complex hardware for Unexpected false positives are caused by prefixes having parallel access and multiple separate memories required the same bit patterns but having different lengths. To in the previous architecture are effectively removed by eliminate the false positives, the hash key of the Bloom adding a simple on-chip Bloom filter that has a size of filter should have more information than prefix value up to 512Kbytes for about 200K prefixes. itself. We padded zeros after each prefix to make every TABLE 6 shows the performance comparison of with prefix become 32 bits and attached 6 bits of the prefix and without Bloom filter for PMH algorithm. When length information after that. Each hash key now has 38 there is no Bloom filter, the average number of off-chip bits. Performance evaluation result for the sets of prefixes memory accesses for an IP address lookup using PMH is shown in TABLE 4. The number of prefixes in each algorithm is 6.77 to 11.96, and the maximum number routing set is shown inside the parenthesis. The number is 22 to 30. In our PMH-BF algorithm, the size of the of input traces generated for simulation is 3 times the Bloom filter is 32Kbytes to 512Kbytes, which is 16x of   number of prefixes. N , where N =2log2 N for N prefixes. For each given The size of the Bloom filter M is set proportional to N . input, off-chip hash table accesses are avoided for the The number of hash indices, K, is calculated using Eq. length that is a negative in the Bloom filter query. The (1), and the result is 2, 2, 3, 6, and 11, respectively, for negative means that there is no prefix corresponding to N , 2N , 4N , 8N , and 16N . An input IP address is probed the specific length of the input. The average number only for distinct lengths that exist in the routing data, of off-chip memory accesses for an IP address lookup starting from the longest length. If a positive is returned becomes 1.00 for every case, and the maximum number from the Bloom filter, the hash table is probed. If it has a is2or3. match in the hash table, the search is over. If it does not Simulations have been performed to compare the have a match, the same procedure is repeated for next search performance of our proposed PMH-BF algorithm longest length, and so on. with that of the Dharmapurikar’s algorithm (D-BF) [7]. The maximum number of Bloom filter queries (Qmax) The D-BF algorithm applies controlled prefix expansion can be up to the number of distinct lengths that exist in to bound the worst-case search performance as 3 hash the routing data. Since the Bloom filter query is stopped table accesses. A direct lookup array is used for prefixes when the longest matching prefix is found, the average of length less than or equal to 20 bits. Prefixes of length number of Bloom filter queries (Qavg) is smaller than 21 to 23 are expanded to length 24, and prefixes of the maximum. As the size of the Bloom filter increases, length 25 to 31 are expanded to length 32. The D-BF the maximum number of hash table probes (Hmax) and algorithm requires two Bloom filters; one for prefixes of the average number of hash table probes (Havg) decrease length 24 and the other for prefixes of length 32. For as expected, since the number of negatives from the fair comparison, we implemented both algorithms using Bloom filter increases and the number of false positives same constraints in terms of memory amount for Bloom decreases. filters, memory amount for hash table implementation, The hash table access rate in the last column is the hash functions, and the handling of collided hash keys. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

9

TABLE 4 Performance evaluation results of PMH-BF

Bloom Filter Multi-hashing Table Routing Data N M Size KQmax Qavg No. of Size Hmax Havg Hash Table (N) (Kbyte) Buckets (Mbyte) Access Rate 14 MAE-WEST 16384 N 2 2 22 8.16 2 0.368 20 5.95 0.73 (14553) 2N 42 13 3.51 0.43 4N 83 9 1.84 0.23 8N 16 6 4 1.10 0.14 16N 32 11 2 1.00 0.12 16 MAE-EAST 65536 N 8 2 22 7.88 2 1.472 17 4.46 0.57 (39464) 2N 16 2 12 2.44 0.31 4N 32 3 6 1.35 0.17 8N 64 6 3 1.02 0.13 16N 128 11 2 1.00 0.13 17 PORT80 131072 N 16 2 25 11.96 2 2.944 22 8.35 0.70 (112310) 2N 32 2 16 4.65 0.39 4N 64 3 11 2.18 0.18 8N 128 6 5 1.12 0.94 16N 256 11 3 1.00 0.84 18 Grouptlcom 262144 N 32 2 20 6.77 2 5.888 17 4.08 0.60 (170601) 2N 64 2 12 2.33 0.34 4N 128 3 7 1.34 0.20 8N 256 6 3 1.02 0.15 16N 512 11 2 1.00 0.15 18 Telstra 262144 N 32 2 25 9.43 2 5.888 23 6.79 0.72 (227223) 2N 64 2 18 3.84 0.41 4N 128 3 13 1.94 0.21 8N 256 6 6 1.10 0.12 16N 512 11 3 1.00 0.11

TABLE 5 Total number of input traces and the number of inputs with at least one false positive

MAE-WEST MAE-EAST PORT80 Grouptlcom Telstra No. of inputs 43659 118392 336930 511803 681669 No. of inputs with false positives 88 16 867 82 1451 Rate 0.002016 0.000135 0.002573 0.000160 0.002129

TABLE 6 Comparison in search performance with and without Bloom filter for PMH

Without Bloom Filter With Bloom Filter (PMH-BF) Prefix Set Avg. Memory Access Max. Memory Access BF Size (Kbyte) Avg. Memory Access Max. Memory Access MAE-WEST 8.16 22 32 1.00 2 MAE-EAST 7.88 22 128 1.00 2 PORT80 11.96 25 256 1.00 3 Grouptlcom 6.77 20 512 1.00 2 Telstra 9.51 30 512 1.00 3

In generating hash indices for a Bloom filter, the ANSI determined as follows. For the D-BF algorithm, we allo- C function rand() was used as suggested in [7] for both cated 1Mbytes for the direct lookup array. The number algorithms. Each hash key used as an input to the rand() of hash table entries for other prefixes was determined is less than or equal to 32 bits. In generating hash as 2(N24 + N32), where N24 is the number of prefixes keys for our proposed algorithm, 6 bits of prefix length with length 24 and N32 is the number of prefixes with information were attached after each prefix unless the length 32. The same amount of memory required for the entire length is not longer than 32 bits. implementation of the D-BF algorithm was allocated for TABLE 7 shows the result. The memory size for the the multi-hashing table of the PMH-BF. N N Bloom filters was fixed as 16 bits for prefixes. Since TABLE 7 shows that the D-BF algorithm has large N our proposed PMH-BF has a single Bloom filter, 16 bits prefix replication, in which the number of prefixes stored were used for the Bloom filter. For the D-BF algorithm, in the hash table is more than the number of prefixes. bits were allocated for two Bloom filters proportional to Hence the algorithm shows slightly worse search per- the number of prefixes. formance both in the average and in the worst case than Multi-hashing tables were used for both algorithms. our proposed algorithm. It was assumed that collided The memory amount for hash table implementation was prefixes are connected by a without assuming This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

10

a perfect hash function for a given set of prefixes in The simulation result is also where the size of Bloom this simulation. Hence the worst case number of hash filter is 8U. The size of the Bloom filter is 16 to 128Kbytes. table accesses is not bounded by 3 for [7] as shown in The average number of off-chip memory accesses for Grouptlcom and Telstra. an IP address lookup for the LBSL-BF became 2.62 to 3.47, and hence the number of memory accesses was reduced by approximately 30% by adding a Bloom filter. 5.2 Search Performance Improvement by Adding a The performance improvement is smaller in the LBSL-BF Bloom Filter to W-BSL than that of the WBSL-BF. Because of the leaf-pushing, In implementing binary search on level algorithms, if many internal nodes are created, and hence a smaller there is no prefix in a level of the binary trie, it is number of Bloom filter negatives was produced in the an invalid level and nodes in the invalid level are not LBSL-BF. stored into a Bloom filter and a hash table. Every node including prefix nodes and internal nodes in valid levels 5.4 Search Performance Comparison with Other Al- are stored into the Bloom filter and the hash table. gorithms Throughout this simulation, we assume to have a perfect This section shows simulation results comparing with hash function in storing nodes of the trie to an off-chip other algorithms in terms of required memory amount hash table. The worst-case number of off-chip memory and search performance. Algorithms in comparison are access is equal to log2(W +1) in the W-BSL algorithm binary trie (B-Trie) [4], priority-trie (P-Trie) [23], binary and it is 6. search on range (BSR) [24], binary search with prefix TABLE 8 shows the performance comparison for W- vector (BST-PV) [8], Waldvogel’s BSL (W-BSL) [13], Lim’s BSL algorithm. The number of nodes represents the total BSL (L-BSL) [14], logW -Elevator algorithm (logW-E) [25], number of nodes stored. When there is no Bloom filter, Dharmapurikar’s algorithm (D-BF) [7], and proposed the average number of off-chip memory accesses for an algorithms (PMH-BF, WBSL-BF, LBSL-BF). The details of IP address lookup using W-BSL algorithm is 4.33 to 4.78. B-Trie, P-Trie, BSR, BST-PV, and logW-E algorithms can The simulation result shown in TABLE 8 is where the   be found in [4]. Our simulation has used the same prefix size of Bloom filter is 8U, where U =2log2 U and U is the sets as those used in [4]. total number of nodes. The size of the Bloom filter is 16 TABLE 10 shows the required memory amount for to 64Kbytes. The average number of Bloom filter queries each algorithm. For the algorithms requiring a Bloom is equal to the average number of memory accesses when filter, which are the D-BF, the PMH-BF, the WBSL-BF, and there is no Bloom filter. For each given input, off-chip the LBSL-BF, the required memory amounts for Bloom hash table accesses is avoided for the length that is a filters are also shown. The sizes of the Bloom filers negative in the Bloom filter query. The negative means are reasonably small so that each Bloom filter can be that there is no node corresponding to the specific length embedded in a chip. The memory amount for the hash of the input in the trie. The average number of off-chip table implementation of the WBSL-BF and the LBSL- memory accesses for an IP address lookup for the WBSL- BF is the same as that of the W-BSL and the L-BSL, BF became 2.50 to 3.19, and hence the number of memory respectively. Algorithms requiring multi-hashing table accesses was reduced by around 40% by adding a Bloom such as the D-BF and the PMH-BF generally consume filter. more memory than other algorithms, while they provide the better search performance as will be shown shortly. 5.3 Search Performance Improvement by Adding a Figures 10 and 11 show the worst-case search per- Bloom Filter to L-BSL formance and the average-case search performance, re- spectively. The D-BF algorithm and the proposed PMH- TABLE 9 shows the performance evaluation result for BF algorithm provide the best performance both in the the L-BSL algorithm. As the same as the W-BSL case, if worst-case and the average-case search performance. The there is no prefix in a level of a leaf-pushed trie, the level WBSL-BF and the LBSL-BF are the next. The search is an invalid level, and nodes in invalid levels are not performances of known algorithms were effectively im- stored into a Bloom filter and a hash table. The number proved by adding an on-chip Bloom filter. of nodes represents the total number of nodes including prefix nodes and internal nodes in valid levels. The number inside the parenthesis represents the number 6CONCLUSION of prefix nodes after the leaf-pushing. The number of This paper shows how effectively an on-chip Bloom filter prefix nodes after leaf-pushing is slightly more than the can improve the search performance of known efficient number of original prefix nodes. When there is no Bloom IP address lookup algorithms. The parallel multiple- filter, the average number of off-chip memory accesses hashing architecture provides high-speed IP address for an IP address lookup is 3.57 to 4.49. The search lookup with a single access cycle of off-chip memory, but performance of L-BSL is slightly better than the W-BSL it requires a complicated hardware for parallel accesses case since a search can be terminated when a prefix node to the separate memories storing prefixes in each length. is encountered even though it is not a last level of access. This paper shows how to avoid the parallel access to This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

11

TABLE 7 Search performance comparison with [7]

BF size HT size [7] Proposed (PMH-BF) Routing Data (Kbyte) (Mbyte) Prefix replication factor Hmax Havg Prefix replication factor Hmax Havg MAE-WEST 28.4 1.26 73.3 3 1.006 1 3 1.003 MAE-EAST 77.1 1.77 27.9 3 1.007 1 3 1.003 PORT80 219.4 3.53 10.8 3 1.010 1 3 1.006 Grouptlcom 333.2 4.94 7.6 4 1.014 1 3 1.003 Telstra 443.8 10.19 7.1 4 1.051 1 3 1.006

TABLE 8 Comparison in average search performance with and without Bloom filter for W-BSL

Trie Characteristics Without Bloom Filter With Bloom Filter (WBSL-BF) Prefix Set No. of Inputs No. of Nodes No of Valid Levels Memory Access BF Size Memory Access Max. Avg. (Kbyte) Max. Avg. MAE-WEST 43659 76708 22 6 4.36 16 2.50 MAE-EAST 118392 172418 22 6 4.33 32 2.53 PORT80 336930 225050 25 6 4.72 32 2.78 Grouptlcom 511803 314986 20 6 4.67 64 3.15 Telstra 681669 452732 25 6 4.78 64 3.19

TABLE 9 Comparison in average search performance with and without Bloom filter for L-BSL

Trie Characteristics Without Bloom Filter With Bloom Filter (LBSL-BF) Prefix Set No. of Inputs No. of Nodes No. of Valid Levels Memory Access BF Size Memory Access (prefix nodes) Max. Avg. (Kbyte) Max. Avg. MAE-WEST 43659 82156 (19968) 23 6 4.06 16 4 2.98 MAE-EAST 118392 191757 (59377) 20 6 4.49 32 4 3.47 PORT80 336930 299899 (145267) 25 6 3.73 64 4 2.62 Grouptlcom 511803 411122 (203093) 21 6 3.57 64 4 2.82 Telstra 681669 576370 (285741) 25 6 3.95 128 4 3.04

TABLE 10 Memory requirement (Mbyte)

Prefix Set B-Trie P-Trie BSR BST-PV logW-E W-BSL L-BSL D-BF PMH-BF WBSL-BF LBSL-BF BF* HT BF* HT BF* HT BF* HT Mae-West 0.45 0.14 0.17 0.34 1.01 0.45 0.40 28.4 1.26 28.4 1.26 16 0.45 16 0.40 Mae-East 0.99 0.39 0.46 0.92 2.58 0.99 0.71 77.1 1.77 77.1 1.77 32 0.99 32 0.71 PORT80 1.29 1.07 0.99 1.65 3.03 1.29 1.43 219.4 3.53 219.4 3.53 32 1.29 64 1.43 Grouptlcom 1.80 1.67 1.50 2.50 4.36 1.80 1.96 333.2 4.94 333.2 4.94 64 1.80 64 1.96 Telstra 2.59 2.22 2.00 3.77 5.70 2.59 2.75 443.8 10.19 443.8 10.19 64 2.59 128 2.75

*Each Bloom filter size is in Kbyte.

off-chip memories by adding a small on-chip Bloom performance since their performance is proportional to filter. For a given input, the Bloom filter is queried first, O(logldist), where ldist is the distinct prefix lengths. starting from the longest length. If it turns out to be a This paper shows how to improve further the search negative, access to the off-chip hash table is avoided for performance of those algorithms by adding a simple on- that specific length. The off-chip hash table is accessed chip Bloom filter. For each given input, the Bloom filter only for the positive result of the Bloom filter. When is queried first for the current level of access. If it turns it turns out a true positive, the search for the input out to be a negative, it means that there is no node in is finished. It is shown that the proposed architecture the trie. Hence the search can proceed to a shorter level provides compatible average search performance with without accessing the off-chip hash table. It is shown that the parallel multiple-hashing by properly controlling the the average search performance is improved by 30-40% false positive rate. The proposed architecture requires by effectively avoiding the access of off-chip hash table much less hardware since it only has a small on-chip when there is no node in the current level. Bloom filter and a single multi-hashing table and does Multi-bit with controlled prefix expan- not require complicated hardware or separate memories sion [22], [33]-[34] provide better search performance for parallel access. than binary tries by reducing the number of distinct Among trie-based algorithms, algorithms based on levels. Binary search on trie levels can be applied to binary search on trie levels provide the best search multi-bit tries without the loss of generality, and hence This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

12

(a) (a)

(b) (b)

(c) (c)

(d) (d)

(e) (e) Fig. 10. Worst case number of memory accesses for Fig. 11. Average number of memory accesses for each algorithm. (a) Mae-West (14553 prefixes). (b) Mae- each algorithm. (a) Mae-West (14553 prefixes). (b) Mae- East (39464 prefixes). (c) PORT80 (112310 prefixes). East (39464 prefixes). (c) PORT80 (112310 prefixes). (d) Grouptlcom (170601 prefixes). (e) Telstra (227223 (d) Grouptlcom (170601 prefixes). (e) Telstra (227223 prefixes). prefixes). This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

13

our proposed approach using an on-chip Bloom filter [17] W. Lu and S. Sahni, “Recursively partitioned static router tables,” also can be applied to the binary search on trie levels IEEE Transactions on Computers, vol.59, no.12, pp.1683–1690, 2010. [18] S. Sahni and K. Kim, “An O(logn) dynamic router table design,” at a multi-bit trie. We believe that a Bloom filter is a IEEE Trans. on Computers, vol.53, no.3, pp.351–363, Mar. 2004. simple but extremely powerful data structure that will [19] H. Lu and S. Sahni, “O(logn) dynamic router-tables for prefixes improve the performance of many other applications and ranges,” IEEE Trans. on Computers, vol.53, no.10, pp.1217–1230, Oct. 2004. as well [31], and we are actively seeking possible [20] W. Lu and S. Sahni, “Packet classification using space-efficient applications. pipelined multi-bit tries,” IEEE Trans. on Computers, vol.57, no.5, pp.591–605, May. 2008. [21] K. Kim and S. Sahni, “Efficient construction of pipelined multibit- ACKNOWLEDGMENTS trie router-tables,” IEEE Trans. on Computers, vol.56, no.1, pp.32–43, Jan. 2007. This work was supported by the National Research [22] V. Srinivasan and G. Varghese, “Fast address lookups using con- trolled prefix expansion,” ACM Transactions on Computer Systems, Foundation of Korea (NRF) grant funded by the Korea vol.17, no.1, pp.1–40, Feb. 1999. government (MEST) (2012-005945). This research was [23] H. Lim, C. Yim, and E. E. Swartzlander, Jr., “Priority trie for IP also supported by the MKE (The Ministry of Knowledge address lookup,” IEEE Trans. on Computers, vol.59, no.6, pp.784– 794, Jun. 2010. Economy), Korea, under the ITRC support program [24] B. Lampson, B. Srinivasan, and G. Varghese, “IP lookups using supervised by the NIPA (NIPA-2012-H0301-12-4004). multiway and multicolumn search,” IEEE/ACM Trans. Networking, The preparation of this paper would not have been vol. 7, no. 3, pp. 324-334, 1999. [25] R. Sangireddy, N. Futamura, S. Aluru, and A. K. Somani, “Scal- possible without the efforts of our students in SoC able, memory efficient, high-speed algorithms for IP lookups,” Design Lab at Ewha W. University on simulations. We IEEE/ACM Trans. on Networking, vol.13, no.4, pp.802–812, Aug. are particularly grateful for Jungwon Lee and Youngju 2005. [26] K. Lim, K. Park, and H. Lim, “Binary search on levels using a Choi. Bloom filter for IPv6 address lookup,” IEEE/ACM ANCS, 2009, pp.185–186. [27] P. Panda, N. Dutt, and A. Nicolau, “On-Chip vs. Off-Chip Mem- REFERENCES ory: The data partitioning problem in embedded processor-based systems,” ACM Transactions on Design Automation of Electronics [1] H. Jonathan Chao, “Next generation routers,” Proceedings of the Systems, vol.5, no.3, pp.682–704, July 2000. IEEE, vol.90, no.9, pp.1518–1588, Sept. 2002. [28] H. Lim and S. Kim, “Tuple pruning using Bloom filters for packet [2] M. A. Ruiz-Sanchez, E. M. Biersack and W. Dabbous, “Survey and classification,” IEEE Micro, vol.30, no.3, pp.784–794, May/June taxonomy of IP address lookup algorithms,” IEEE Networks, vol. 2010. 15, no. 2, pp. 8–23, March/April 2001. [29] A.G. Alagu Priya and H. Lim, “Hierarchical packet classification [3] S. Sahni, K. Kim, and H. Lu, “Data structures for one-dimensional using a Bloom filter and rule-priority tries,” Computer Communica- packet classification using most-specific-rule matching,” , Inter- tions, vol.33, no.10, pp.1215–1226, Jun. 2010. national Journal on Foundations of Computer Science, vol.14, no.3, [30] H. Song, S. Dharmapurikar, J. Turner, and J. Lockwood, “Fast pp.337–358, 2003. hash table lookup using extended Bloom filter: An aid to network [4] H. Lim and N. Lee, “Survey and proposal on binary search processing,” Proc. ACM SIGCOMM, Aug. 2005. algorithms for longest prefix match” IEEE Communications Surverys [31] S. Taroma, C. E. Rothenberg, and E. Lagerspetz, “Theory and prac- and Tutorials, pp.1–17, 2012 (IEEE early access). tice of Bloom filters for distributed systems,” IEEE Communications [5] F. Yu, R. H. Katz, T. V. Lakshman, “Efficient multimatch packet Surveys and Tutorials, vol.14, no.1, pp.131–155, first quarter, 2012. classification and lookup with TCAM,” IEEE Micro, vol. 25, no. 1, [32] J. Hasan, S. Cadambi, V. Jakkula, and S. Chakradhar, “Chisel: pp. 50–59, Jan/Feb. 2005. A storage-efficient, collision-free hash-based network processing [6] H. Lu and S. Sahni, “Dynamic tree bitmap for IP lookup and Architecture,” Proc. ISCA, pp.203–215, 2006. update,” International Conference on Networking, 2007. [33] W. Lu and S. Sahni, “Packet forwarding using pipelined multibit [7] S. Dharmapurikar, P. Krishnamurthy, and D. Taylor, “Longest tries,” IEEE Symposium on Computers and Communications, pp.802– prefix matching using Bloom filters,” IEEE/ACM Trans. Networking, 807, May. 2006. vol.14, no.2, pp.397–409, Feb. 2006. [34] W. Lu and S. Sahni, “Packet classification using pipelined two- [8] H. Lim, H. Kim, and C. Yim, “IP address lookup for Internet dimensional multibit tries,” IEEE Symposium on Computers and routers using balanced binary search with prefix vector,” IEEE Communications, pp.808–813, May. 2006. Trans. on Communications, vol.57, no.3, pp.618–621, Mar. 2009. [35] http://www.potaroo.net [9] A. Broder and M. Mitzenmacher, “Using multiple hash functions to improve IP lookups,” IEEE Infocom, vol.3, pp.1454–1463, 2001. [10] H. Lim and Y. J. Jung, “A parallel multiple hashing architecture for IP address lookup,” IEEE HPSR, pp.91–95, 2004. [11] H. Lim, J. Seo, and Y. Jung, “High speed IP address lookup architecture using hashing,” IEEE Communications Letters, vol.7, Hyesook Lim (M’91) received a B.S. degree and no.10, pp.502–504, Oct. 2003. an M.S. degree from the Department of Con- [12] W. Eatherton, G. Varghese, and Z. Dittia, “Tree bitmap: Hard- trol and Instrumentation Engineering at Seoul ware/software IP lookups with incremental updates,” ACM SIG- National University, Seoul, Korea, in 1986 and COMM Computer Communications Review, vol.34, no.2, pp.97–122, 1991, respectively. She got the Ph.D. degree Apr. 2004. from the University of Texas at Austin, in 1996. [13] M. Waldvogel, G. Varghese, J. Turner, and B. Plattner, “Scalable From 1996 to 2000, she was employed as a high speed IP routing lookups,” Proc. ACM SIGCOMM, 1997, member of the technical staff at Bell Labs of pp.25–35. Lucent Technologies, Murray Hill in New Jersey. [14] J. H. Mun, H. Lim and C. Yim, “Binary search on prefix lengths From 2000 to 2002, she worked for Cisco Sys- for IP address lookup,” IEEE Communications Letters, vol.10, no.6, tems, San Jose in California. She is currently a pp.492–494, June 2006. professor in the Department of Electronics Engineering, Ewha Womans [15] K. Kim and S. Sahni, “IP lookup by binary search on prefix University, Seoul, Korea. Her research interests include router design length,” Journal of Interconnection Networks, vol.3, pp.105–128, 2002. issues such as address lookup and packet classification, Bloom filter [16] W. Lu and S. Sahni, “Succinct representation of static packet application to various distributed algorithms, and the hardware imple- classifiers,” IEEE/ACM Transactions on Networking, vol.17, no.3, mentation of various network algorithms. pp.803–816, 2009. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

14

Kyuhee Lim received a B.S. degree from the Department of Electronics Engineering from Ewha Womans University, Seoul, Korea, in 2005. From 2005 to 2009, she was employed at Hynix Semiconductor, Korea, where she was working for memory design. She is currently pur- suing a Ph.D. degree from the same university. Her research interests include address lookup and packet classification algorithms and TCAM architecture design.

Nara Lee received a B.S. degree and an M.S. degree from the Department of Electronics En- gineering at Ewha Womans University, Seoul, Korea, in 2009 and 2012, respectively. Her re- search interests include various network algo- rithms such as IP address lookup and packet classification, web caching, and Bloom filter ap- plication to various distributed algorithms.

Kyong-hye Park received a B.S. degree and an M.S. degree from the Department of Elec- tronics Engineering at Ewha Womans University, Seoul, Korea, in 2007 and 2009, respectively. She works for Mobile Communication Business Unit, Samsung Electronics, Korea, where she is currently developing Android handsets.