<<

A Hybrid Finite Automaton for Practical Deep Packet Inspection

Michela Becchi Patrick Crowley Washington University Washington University Science and Computer Science and Engineering St. Louis, MO 63130-4899 St. Louis, MO 63130-4899 +1-314-935-4306 +1-314-935-9186 [email protected] [email protected]

in networking devices and computer systems. Most popular ABSTRACT software tools—including Snort [6][7] and Bro [10]—and Deterministic finite automata (DFAs) are widely used to devices—including the Cisco family of Security Appliances perform regular expression matching in linear time. Several [8] and the Citrix Application Firewall [9]—use regular techniques have been proposed to compress DFAs in order expressions to describe payload patterns. While more to reduce memory requirements. Unfortunately, many real- expressive than simple patterns of exact-match strings, and world IDS regular expressions include complex terms that therefore able to describe a wider variety of payload result in an exponential increase in number of DFA states. signatures [12], regular expression implementations Since all recent proposals use an initial DFA as a starting- demand far greater memory space and bandwidth. As a point, they cannot be used as comprehensive regular result of these trends, there has been a considerable amount expression representations in an IDS. of recent work on implementing regular expressions for use In this work we propose a hybrid automaton which in high-speed networking applications, particularly with addresses this issue by combining the benefits of representations based on discrete finite automata (DFAs). deterministic and non-deterministic finite automata. We test DFAs have attractive properties that explain the our proposal on Snort rule-sets and we validate it on real attention they have received. Foremost, they have a traffic traces. Finally, we address and analyze the worst predictable memory bandwidth requirement. In fact, case behavior of our scheme and compare it to traditional processing an input string involves one DFA state traversal ones. per character, which translates into a deterministic number Categories and Subject Descriptors of memory accesses. Moreover, it has long been established that, for any given regular expression, a DFA with a C.2.0 [ Computer Communication Networks ]: General – minimum number of states can be determined [4][5]. Even Security and protection (e.g., firewalls ) so, DFAs corresponding to large sets of regular General Terms expressions, each one representing a different rule, can be Algorithms, Performance, Design, Security. prohibitively large. Recent work has tackled this problem in two ways. Keywords First, since an explosion in states can occur when many Deep packet inspection, DFA, NFA, regular expressions. rules are grouped together into a single DFA, Yu et al. [15] have proposed segregating rules into multiple groups and 1. INTRODUCTION evaluating the corresponding DFAs concurrently. This Increasingly, network packets are classified not only by the solution decreases memory space requirements, sometimes fields of their headers, but also by the content of their dramatically, but increases memory bandwidth linearly with payloads. In particular, signature-based deep packet the number of active DFAs. The second approach, proposed inspection has taken root as a dominant security by Kumar et al. [15], aims at reducing the memory space requirement of any given DFA and is based on two Permission to make digital or hard copies of all or part of this work for observations. First, the memory space required to store a personal or classroom use is granted without fee provided that copies are DFA strictly depends on the number of transitions between not made or distributed for profit or commercial advantage and that states. Second, many states in DFAs have identical sets of copies bear this notice and the full citation on the first page. To copy outgoing transitions. Substantial space savings in excess of otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 90% are achievable in current rule-sets when this CoNEXT 2007, December 10-13, 2007, New York, NY, U.S.A. redundancy is exploited. The compression technique (c) 2007 ACM 978-1-59593-770-4 07/ 0012 $5.00. proposed trades off memory storage requirements with 2. BACKGROUND AND MOTIVATION processing time. Several techniques for minimizing the memory Unfortunately, DFAs are infeasible for regular requirements of DFAs representing sets of regular expressions found in the most frequently used rule-sets . expressions have been recently proposed [15][17][18]. Specifically, when repeated wildcards are present in a These proposals have some common properties; regular expression, it may be impossible to build a DFA specifically: with a reasonable number of states. For example, the a) They are based on the assumption that a DFA can be regular expression “ prefix.{100}suffix ”, which matches if computed and is given as input. and only if “ prefix” is separated from “ suffix” by 100 characters, would require well over 1 million states to be b) They exploit the observation that DFAs corresponding represented in a DFA. Since, as we will see, such constructs to rule-sets derived from commonly used Network occur frequently within popular security rule-sets, DFA- Intrusion Detection Systems (NIDS) have significant based approaches, including the recent work described state transition redundancy . above, are infeasible as comprehensive solutions. The second aspect - i.e., the presence of significant As an alternative, one could consider using a solution transition redundancy - can be easily explained as follows. based on non-deterministic finite automata (NFAs) [22]. Regular expressions used within NIDS typically consist of The number of NFA states required to represent a regular sets of patterns containing: simple strings, character ranges, expression is on the order of the number of characters wildcards, indefinitely repeating sub-patterns, and sub- present in the regular expression itself. As an example, the patterns repeated a discrete number of times. Notably, regular expression above would require just 101+ (# chars nested repetitions—causing loop-backs in the in prefix ) + (# chars in suffix ) NFA states. Therefore, an corresponding NFAs and DFAs—are not found in practice. NFA-based representation would alleviate the memory One can think about compiling together the set of regular storage problem. However, an NFA may lead to a variable, expressions corresponding to several rules by first building and potentially large, memory bandwidth requirement. In an NFA which represents the disjunction of the NFAs of the fact, multiple NFA states can be active in parallel and each single regular expressions, and then converting it to a DFA input character can trigger multiple state transitions, and through the well-known subset construction procedure [4]. therefore require multiple parallel memory operations. In Such NFAs will typically have a tree-like structure (with the the worst-case, all NFA states can be active concurrently, exception, as we will see, of few loops and backward requiring a prohibitive amount of memory bandwidth. transitions), where the root corresponds to the starting state In this paper we propose a hybrid DFA-NFA finite and the leaves to the accepting states. If common prefixes automaton ( Hybrid-FA ), a solution bringing together the are collapsed, the root and the nodes at the first levels will strengths of both DFAs and NFAs. When constructing a have several outgoing transitions, whereas, moving towards hybrid-FA, any nodes that would contribute to state the leaves, the tree will tend to become “skinny”, i.e., to explosion retain an NFA encoding, while the rest are consist of long chains of nodes. Moreover, except for the transformed into DFA nodes. The result is a data structure first levels of the NFA and for the transitions representing with size nearly that of an NFA, but with the predictable wildcards and large character ranges, most nodes will have and small memory bandwidth requirements of a DFA. outgoing transitions defined only for a few characters. We evaluate the hybrid-FA structure by comparing it to When building the corresponding DFA, missing transitions both DFA and NFA representations on rule-sets from the on the NFA typically translate into backward transitions to popular security package Snort. The primary contribution the nodes at the first levels of the hierarchy (or intermediate of the hybrid-FA is that entirely new classes of regular nodes when the represented regular expression contains expressions can be implemented in fast networking dot-star conditions or repetitions of wide character ranges). contexts. Thus, a restricted number of nodes in the final DFA tend to The remainder of this paper is organized as follows. be the target of most transitions. Additional background and motivation are presented in a Section 2. Section 3 describes the conditions of DFA state a a c d explosion in greater detail. The hybrid-FA structure is 1) 0 2 3 4 1 b introduced in Section 4. Extensions to establish correctness a a and worst-case bounds are presented in Section 5. Section 6 a ^c c provides a brief discussion on implementation issues and ab c d 2) 0 1 2 3 4 alternatives. Experimental results are found in Section 7. ^d c Further discussion of related work is found in Section 8. ^c The paper concludes with discussion in Section 9. Figure 1: DFAs representing the following RegEx: abcd (1) and ab.*cd (2). Transitions to state 0 are omitted. Taking advantage of this redundancy enables very states. However, there are common conditions which can effective memory compression techniques for a given DFA , bring the number of DFA states close to the theoretical but does not address a major problem. Namely that, due to upper bound. An analysis of those conditions within DFAs state explosion during NFA-to-DFA transformation, DFAs representing single regular expressions is presented in [15]. cannot be built for many individual regular expressions Here we want to focus on two patterns which occur and sets of expressions in NIDS rule-sets . Theoretically frequently in practical data-sets, namely “dot-star” speaking, during subset construction, an exponential growth conditions and “counting constraints.” in the number of states can take place. In the case of large 3.1 “Dot-star” conditions NFAs, this can make DFA construction infeasible. In A dot-star condition is a sub-pattern of the type “.* ”, practice, if subset construction is performed “lazily” (i.e., meaning “a wildcard repeated any number of times.” As an new DFA states are created only when they happen to be extension, we include in this category sub-patterns of the targets of any other state), there are only few recognizable form “ [^c1c2...c k]*”, where the repetition involves a large cases where this can happen. To this end, we are interested range of characters (namely, all characters but c1, c2,..., ck). in two distinct situations: While excluding characters from the repetition introduces a) State blow-up happens when compiling a single regular some additional issues that we will discuss later, this feature expression in isolation. exhibits the same characteristics as a pure “.*” condition in b) Given two regular expressions RE 1 and RE 2 the terms of state blow-up. corresponding DFA and DFA can be built without 1 2 Dot-star conditions are common in practical data-sets. incurring state explosion. However, when compiling Their primary use is to detect occurrences of sub-patterns RE and RE into a unique DFA , either the number of 1 2 12 separated by an arbitrary number of characters. In the case states in DFA is significantly greater than the sum of 12 of Snort rules, many regular expressions use a “ [^\n\r]*” DFA and DFA , or DFA cannot be built at all, due to 1 2 12 term to search for an occurrence of the prefix sub-pattern in exponential state explosion. the same line of text as the suffix sub-pattern. Multiple dot- Clearly, in the first case, DFAs are not a feasible star terms can appear within the same expression. representation of the given regular expression. The second For example, the Snort spyware rule “ User- case can be treated by keeping the two DFAs separated and Agent\x3A[^\r\n]*ZC-Bridge”, looks for an occurrence of operating them in parallel (i.e., trading memory space for the sub-pattern “ ZC-Bridge” only if “ User-Agent\x3A” has bandwidth). However, given a set of regular expressions, it been previously detected and no carriage return or new line would be beneficial to be able to predict this situation character occurred in between. That means, the two sub- without testing all possible combinations. patterns must occur in the given order and on the same line, The goals of this work are the following: and may be separated by an arbitrary number of characters. • Explore two distinct conditions which lead to the state In practical rule-sets, dot-star conditions do not cause blow up during subset construction. state blow-up when individual regular expressions are • Propose a hybrid automaton which deals with the compiled in isolation. In fact, those patterns affect the above problems in a unified way. transitions in the DFA but not the number of states. Figure • Refine the proposal in order to provide an acceptable 1, which compares DFAs accepting regular expressions worse case bound on the memory bandwidth 0 requirement. e a 3 e b c As previously mentioned, the proposal and the results e [^cde] c 2 1 a [^ceg] focus on practical data-sets from Snort NIDS. However, a 5 [^ce] d these dot-star and counting constraint terms are not unique f a c c e c c to the Snort rule-sets. Via personal communication with a 8/1 e 4 e e c colleagues at Cisco Systems Inc., we have learned of e e g a proprietary IDS regular expression rule-sets for in 14% and a 1% of the rules include dot-star and counting terms, 6 e 7 [^cef] f respectively. e h 9 [^ceh] e e 3. STATE BLOW-UP g 10/2 [^ce] Given an NFA with N states, the corresponding DFA can 11 consist of potentially 2 N states [4]. In practice, this upper h bound is never reached and, in most cases, the number of states in the DFA is comparable to that of the 12/2 corresponding NFA. If the NFA represents simple patterns Figure 2: DFA representing (1) ab.*cd and (2) efgh. and common prefixes and suffixes have not been collapsed, In the accepting states, the number following the “/” a state-minimized DFA can actually have slightly fewer represents the accepted regular expression. abcd and ab.*cd , illustrates this fact. It can be observed that which can be a specific symbol, a wildcard or a character the number of states is the same in the two cases; the within a range. Since we are interested only in situations transitions are “moved toward” the tail of the DFA in the causing state blow-up, we will restrict ourselves to second one. repetitions of wildcards and of large character ranges. However, dot-star conditions add complexity when In Snort, regular expressions with counting constraints distinct regular expressions are compiled together (note that are commonly used to detect buffer overflow situations. the same condition would arise in case of single regular Again, “ [\n\r]{n}”-like sub-expressions are utilized in order expressions consisting of disjunctions of complex sub- to split the text string on a line basis. As an example, the expressions). To see why, assume that we compile together Snort rule “ AUTH\s[^\n]{100}” would detect an IMAP expressions RE 1 and RE 2 (that is, build a DFA for the authentication overflow attempt, where the buffer is 100 expression “.*(RE 1|RE 2)”) and that RE 1 contains a “.*” term characters long and a new line terminates the authentication and RE 2 does not. Since the dot-star term in RE 1 can match string. any string, including all those strings matching RE2, a properly formed combined DFA will have additional states In contrast to dot-star terms, counting constraints on wildcards and large character ranges cause exponential to determine a match of RE 2 within the “.*” pattern state blow-up when creating DFAs even for single regular belonging to RE 1. This condition effectively duplicates the expressions . This can be explained as follows: when sub-DFA representing RE 2 within the sub-DFA for RE 1. Figure 1 illustrates this situation in the composite DFA expanding the counting constraint, all possible occurrences for regular expressions “ ab.*cd ” and “ efgh ”. Notice that the of the regular expression prefix must be considered at each sub-DFA matching “ efgh ” is replicated: first in states 2, 4, 7 instance of the wildcard or character range. Figure 3 and 10, and second in states 6, 9, 11 and 12. The second illustrates this fact on the simple regular expression replica originates from state 3, which derives from “ab.{3}cd ”. Clearly, the size of the DFA grows rapidly with expanding the dot-star condition. the cardinality of the counting constraints. If the regular expressions compiled together contain The situation gets dramatically worse when multiple common sub-patterns, the replication may involve only sub- regular expressions are compiled together in a combined expressions. However, in general, a sub-DFA will be DFA. In this situation, one has to also consider all possible replicated once for every occurrence of a dot-star term in occurrences of the other regular expressions into the one other regular expressions. Thus, dot-star terms create linear having the counting constraint. We note that counting increases in the number of DFA states. constraints in typical data-sets consist of at least 100 3.2 Counting constraints repetitions; it is therefore impossible to build reasonable A counting constraint corresponds to the repetition of a sub- DFAs for such rules, much less for groups of them. pattern for a given number of times, and is expressed in the form sub-pattern{n,m} where n and m are the minimum and 4. HYBRID-FA the maximum cardinality of the repetition. If n and m are One obvious way to keep the size of the automaton equal, the counting constraint is expressed in the form sub- contained when transforming a NFA into DFA is to pattern{n}. interrupt the subset construction operation at those NFA states whose expansion would cause state explosion to The most frequent situation occurring in practical data- happen. In the two specific cases described above, the sets corresponds to the repetition of only one character, critical states can be easily determined. In fact, they a 7 a correspond to the first state of the dot-star constraint (that a a d b c d b ^a ^a * a 1 2 3 4 1 2^a 3 4 5c 6 a a a 0 a c b c e [^ab] [^ab] 5 6 7 a 0 8 a9 a 10 a 1 a b b b c d b 0,1 0, 2 4 a a 5 3 a b ^a [^ac] 2 c 11 12 13 3 a 10 a 0 a [^ac] a a b a b c c c e 0,5 0,6 0,7 b 5 [^ad] 14 c 15 16 [^ad] 4 b b a [^abc] d d 4 Figure 4: NFA and hybrid-FA for regular expressions a c 117 6 18 9 abcd and bce. Subset construction is interrupted at Figure 3: DFA representing regular expression NFA-state 2. Within the hybrid-FA, the DFA part is ab.{3}cd. solid whereas the NFA part is dashed. is, the one with the auto-loop) and the initial state of the Table 1: NFA traversal example. Stable states are repetition sub-expression. represented in bold; accepting states are underlined. The outcome of interrupting subset construction at an b a a c a b c a c e f c d e intermediate state will be a hybrid automaton (which we 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 5 1 2 5 1 5 11 12 5 2 11 will call hybrid -FA ), consisting of not expanded NFA-like 9 2 9 2 6 7 8 4 2 states, DFA-like states and “ border ” states. The latter can 3 2 3 2 2 2 be considered as being part of both a DFA and of an NFA. 3 Figure 4 shows a small example where subset readability) has 21 states. For the reasons explained above, construction is interrupted at NFA state 2. State numbering the dot-star term in the first regular expression leads to a in the hybrid-FA reflects the subset construction operation. replication of the portion of DFA devoted to the second, Since, for instance, processing symbol a in NFA state 0 third and fourth regular expressions. Clearly, such state leads to NFA states 0 and 1, processing the same character replication would increase with the number of regular in (DFA) state 0 of the hybrid automaton leads to a (DFA) expressions contained in the data set. Moreover, the state tagged 0-1. It can be noted that the border state 0-2-5 situation would worsen if the number of regular expressions has two distinct outgoing transitions on character c: one containing dot-star conditions also increased. falling into the NFA-part and one into the DFA-part. We note that no state explosion would occur if the Moreover, its sub-state 2 is ignored when computing the regular expressions containing dot-star conditions were transition targets to the DFA-part. compiled into separate DFAs; while this would avoid state If we restrict ourselves to regular expressions explosion, it would trade space for bandwidth. In fact, consisting of sequences of sub-patterns possibly separated memory bandwidth requirements increase linearly with the by dot-star conditions and counting constraints, we start number of concurrent DFAs (each DFA makes one state subset construction at the NFA initial state and interrupt it transition for each character). as just described, then the resulting hybrid-FA will exhibit While reducing the number of states, a NFA some useful properties. Specifically: i) the starting state will representation can increase memory bandwidth be a DFA-state; ii) the NFA part of the automaton will requirements. Specifically, the non-determinism inherent in remain inactive till a border state is reached; and iii) there an NFA implies that many states may be active at once. will be no backwards activation of the DFA coming from Unlike a DFA, an NFA can make multiple state transitions the NFA. when consuming a single input character. In order to better illustrate these concepts, let us The dynamic memory bandwidth needed by an NFA consider two examples: the first one containing a “.*” sub- representation depends on the size of the active state set , expression and the second one a counting constraint. The that is, the set of states active in parallel. In fact, the number goal of the discussion will be two-fold. First, we want to of active states implies the number of memory accesses show the characteristics of a hybrid-FA compared to the required to make state transitions for each input character corresponding DFA and NFA. Second, we want to give an processed. In theory, processing a character in an NFA intuition about how the traversal of hybrid-FA works. requires O(N NFA ) memory operations, where N NFA is the 4.1 “Dot-star” regular expressions total number of states. In practical cases, however, the In Figure 5 the NFA representing regular expressions: active set size is much lower than N NFA . Operationally, the ab.*cd , cefc, cad and efb is shown. The double-circled number of active states tends to increase if any current state a * states are accepting states; within them, the number e 11 e b c d following the slash indicates the accepted regular 1 2 3 4/1 a expression. State 0 is the initial state. The NFA is reduced e a c c by merging common prefixes. f c e 6 7 8/2 e a b a e c The corresponding DFA (which we don’t show for c c 1 11 13 a 0 5 1 11 5 c * a c d b c d 9 10/3 1 2 3 4/1 e a b a e c e c a 1 * f c 11 2 1 5 11 e 6 7 8/2 f 11 b c a 12 13/4 0 5 7 a d e 1 c e a e a 10/3 c e 9 1 5 11 1 5 11 f b 11 12 13/4 Figure 6: Hybrid-FA for (1) ab.*cd, (2) cefc, (3) cad, (4) efb. Transitions to state 0 in the DFA part are Figure 5: NFA for RegEx: (1) ab.*cd, (2) cefc, (3) cad, omitted for reada bility. The DFA part is solid, the (4) efb. NFA part is dashed and the boundary state is red. a Table 2: Hybrid-FA traversal example. Stable states e 11 e b c d represented in bold; accepting states are underlined. 1 2 ***14 15 16 3 4/1 a e b a a c a b c a c e f c d e a c c f c e 6 7 8/2 0 1 1 5 9 2 5 9 5 6 7 8 0 11 e a ae b c c c 11 a 2 2 2 2 2 2 2 2 0 5 1 1 11 13 5 c 3 3 3 4 a c 9 d 10/3 e has several outgoing transitions on the given input c a e b a c e 1 11 2 1 5 11 character. Conversely, it tends to decrease if an active state 11 f b 13/4 a 12 e has no transitions defined on the current input character. An e 1 a c e a c important special case is represented by states having 1 5 11 1 5 11 wildcard transitions back to themselves (e.g., states 0 and 2 Figure 7: Hybrid-FA NFA for RegEx: (1) ab.{ 3}cd, in Figure 5); these states are stable: once they are visited, (2) cefc, (3) cad, (4) efb they will never leave the active set. which immediately precedes the counting constraint. An example of traversal of the NFA in Figure 5 with input string “ baacabcacefcde ” is shown in Table 1. In this An example of traversal of the hybrid-FA with text example, the states in the active set are never more than 5 string “ baacababcefcde ” is shown in Table 3. Notice that out of 14. the hybrid-FA does not have any stable states. Again, the Let us now consider the hybrid-FA for the given DFA is always active and there is a single activation of it regular expressions (Figure 6). Subset construction is during the whole matching operation. On the contrary, the interrupted at state 2, which would cause state explosion to NFA part can have several parallel activations, one for each happen. As can be seen, the second, third and fourth regular border state traversal. Note that, if this was not the case, the expressions are completely matched within the DFA part. match reported on state 4 would have not been detected. On the other hand, the first regular expression is matched Finally, the active set size is less than that of the NFA within the DFA part only up to the second character. counterpart (whose maximum value is 6). Beyond that, the matching operation is performed in an 5. IMPROVING THE WORST CASE NFA. Note that the number of states in the hybrid FA does As mentioned, the hybrid-FA consisting of a head-DFA and not exceed that of the NFA. Finally, the matching operation many tail-NFAs represents a compromise between a mere involves one state traversal per character as long as the DFA and a mere NFA solution, and allows dealing with border state 2 is not traversed. In other words, as long as the situations where a DFA is unfeasible. In particular, this prefix of the first regular expression “ ab ” is not matched, solution trades memory occupancy (number of states) with processing is restricted to the DFA portion. processing time/memory bandwidth requirements (size of An example of hybrid-FA traversal with text string the active set). “baacabcacefcde ” is shown in Table 2. State 0 is no longer a “stable” state, but state 2 is. As can be seen, the active set While the described automaton can provide satisfactory will contain only one state until the border state 2 is average case performance and improves the worst case as traversed. One and only one activation of the DFA is compared to a pure NFA, the worst case bound can still possible; conversely, the NFA can have several parallel result unacceptable. In fact, as it has been pointed out: activations. Note that the size of the active set is in general • The head-DFA is always active in one and only one lower than what we have with the pure-NFA counterpart state; (with at most 3 versus 5 states). • Each tail-NFA is activated each time the border state is 4.2 Regular expressions with counting reached. Moreover, every activation may involve several states. constraints on wildcards Therefore, the theoretical worst case is represented by Figure 7 represents the hybrid-FA for a small dataset the number of NFA states present in the hybrid automaton containing a regular expression with a wildcard repeated plus one (the DFA active state). exactly 3 times. The reader can easily draw the In this section we explore two techniques to further corresponding NFA, also consisting of 19 states. The state- reduce the worst case bound: one suitable to dot-star minimized DFA (not shown for readability) has 46 states. conditions and the other applicable to counting constraints. In this case subset construction is interrupted at the state 5.1 Tail-DFAs Table 3: Hybrid-FA traversal examples. Stable states The first obvious way to limit the worst case active set size represented in bold. is to transform the tail-NFAs into tail-DFAs, as exemplified b a a c a b a b c e f c d e in Figure 8. In fact, this will ensure that, for every 0 1 1 5 9 2 1 2 5 6 7 8 0 11 activation , each tail-automaton will be active only in one 14 15 14 15 16 3 4 16 state.

While this technique can be applied to any hybrid-FA, to deactivating the tail-DFA. it is effective only in case of dot-star conditions. In the There are two possibilities: x may or may not appear in general case, the number of parallel activations of a tail- sub_pattern 2. Let us consider those two cases separately. DFA depends on the number of times the border state is • x ∉∉∉ sub_pattern 2: All the states in tail-DFA will have a traversed. If, for any given DFA, it is possible to compute transition to the dead-state on character x. This the minimum number of characters to be processed between situation is exemplified in Figure 9, where the NFA two consecutive border-state traversals, this measure is and the DFA corresponding to [^x]*abc are DFA dependent and not likely to provide satisfactory represented. bounds. Let us assume to reach the border state when the tail- In the context of NIDS we are interested in determining DFA is active. There are two sub-cases: the set of rules to be fired on a packet. Thus, it is enough to o x is the last character processed (e.g.: ax[^x]*abc ). detect only one (the first) possible match of each regular In this case the former activation of tail-DFA will die, expression. This will allow us to show that, in the case of and the new activation will be the only one in place. the most dot-star conditions, a single tail-DFA activation is o x is not the last character processed (e.g.: ad sufficient to have correct traversal and detect all possible [^x]*abc ). Since we are interested in the first match of matches. This allows us to limit the worst case bound on sub_pattern 2, we can safely ignore the second memory bandwidth/processing time to the number of sub- activation. Notice that, doing that, we don’t risk DFAs the hybrid-FA is decomposed into. missing matches. In fact, let us assume that an In the remainder of this section we provide evidence of occurrence of x followed, which would inactivate tail- this consideration. The reader interested only in the main DFA. Since such occurrence would follow also the results can skip to section 5.2. potential second activation, it would invalidate it as To demonstrate the above property, we distinguish well. Therefore, ignoring the second activation is, in pure wildcard repetitions from [^x]*-like conditions. Note this case, safe. that the following discussion can be directly extended to the • x ∈∈∈ sub_pattern 2: In this case some tail-DFA states more general case [^c 1c2...c k]*. will have a transition to the dead state on character x, Wildcard-repetitions (.*) Let us assume to have a but some won’t. Therefore, depending on the current regular expression of the form sub-pattern 1.*sub_pattern 2. state, an occurrence of character x can cause either a This means, “try to match sub_pattern 2 if and only if sub- deactivation of the tail-DFA or a progress in the match pattern 1 did previously occur in the text string”. of sub_pattern 2. This fact is exemplified in Figure 10, Operationally, the head-DFA will recognize .* sub-pattern 1 where the NFA and the DFA corresponding to and the tail-DFA will match .* sub_pattern 2. The activation [^x*]axb are represented. Note that all states starting of the tail-DFA will occur upon border-state traversal. This, from 3 have mismatching transitions leading to state Ø. in turn, will happen once sub-pattern 1 is matched. Since, In this situation, it is in general not true that a single upon matching of sub-pattern 1, we are interested only in the activation of the tail-DFA is always sufficient to first occurrence of sub-pattern 2, we may ignore any preserve correct operation. If the border state is subsequent activation of the tail-DFA. Also note that, since traversed when the tail-DFA is active, discarding one tail-DFA represents a regular expression starting with “.*”, of the two activations is unsafe. In fact, the next it won’t contain any “dead-states” (that is, any stable state transition could invalidate the first one while keeping which, once reached, will prevent any progress). the second alive. One simple example is given by [^x]*-like conditions Let us assume to have a regular regular expression ax[^x]*axb and string axaxaxb. expression of the form sub-pattern 1[^x]*sub_pattern 2. This From the above discussion it should be clear how, in the means, “try to match sub_pattern 2 if and only if sub- case of [^x]*-like conditions, we can ensure that keeping pattern 1 did previously occur in the text string and the two only one activation of the tail-DFA preserves correctness sub-patterns are not separated by character x”. Again, the only if the sub-expression following the repetition does not head-DFA will recognize .* sub-pattern 1 whereas the tail- contain the characters excluded from the repetition itself. DFA will match [^x]*sub_pattern 2. However, there are a few exceptions to this general rules In this case, the tail-DFA will have a dead-state which which represent common cases in Snort rule-set. can be reached on character x for some tail-DFA states. We .* masking: Let us consider rules where the part of can safely assume that reaching the dead-state is equivalent regular expression following the first [^x]* condition is a

(a) (b) complex sub-pattern containing a “.*” repetition. In other tail-NFA 1 tail-DFA 1 words, let us consider regular expressions of the type: head-DFA head-DFA tail-NFA 2 tail-DFA 2 sub_pattern 1 [^x]*sub_pattern 2.* sub_pattern 3. The “.*” tail-NFA 3 tail-DFA 3 condition will “mask” all occurrences of x in sub_pattern 3. Therefore, if x does not occur in sub_pattern 2, then keeping Figure 8: Hybrid-FA exemplification. ^x ^x

1) 1 a2 b3 c 4 1) 1 a2 x3 b 4 a [^ax] a a a b a1,2 x b 2) 1 a 1,2 1,3 c 1,4 1 3 4 2) [^ax] ^b x x x * x x Ø Ø

Figure 9: NFA (1) and DFA (2) for regular expression Figure 10 : NFA (1) and DFA (2) for regular [^x]*abc. The state numeration in the DFA reflects expression [^x]*axb. The state numeration in the DFA subset construction. State Ø is the dead state. Missing reflects subset construction. State Ø is the dead state. transitions in DFA lead to state 1. All the transitions are represented. Let us first consider counting constraints of the form at most a single activation of the tail-DFA will preserve “.{n} ”, where the wildcard is repeated exactly n times. correct operation. Figure 12 shows the NFA for the generic .{n}suffix regular Overlapping tail-DFA activations: A second case expression. As can be seen, the NFA consists of n-1 similar which occurs frequently in Snort rule-sets can be described states (from b+1 to b+n-1), each having all outgoing as follows: sub_pattern 1 is a simple string and tracing it transitions directed towards the next state of the chain. from any state of the tail-DFA always bring to its entry Those states simply operate as a counter. The last state of state. In this case, one can ensure that a new activation of the sequence b+n is the first one whose outgoing transitions the tail-DFA will take place either if such DFA is inactive, represent progress information within the suffix. or if it finds itself in the entry state. Therefore, two The same information could be simply stored through consecutive activations will always overlap. an auto-decrementing counter and a pointer to state b+n . The argument above, which refers to regular The counter can be activated and set to n when the border expressions in isolation, can be easily extended to groups of state is reached. At each character processed, the counter regular expressions sharing a common prefix (at least up to gets auto-decremented. Only when the counter is nullified the dot-star repetition included). the state associated to the corresponding pointer is 5.2 Counter mechanism accessed. Even if applicable, tail-DFAs would not be effective in The worst-case is characterized by n active counter addressing counting constraints. In fact, for correct instances plus the size of the suffix-NFA. However, it can operation, a new activation of the tail-DFA is required each be noticed that the counters can be kept in on-chip memory, time the border state is traversed. To have an intuition and do not involve real state traversals. Moreover, as we about this fact one can consider the simple regular point out in [20], a proper representation allows the update expression ab.{3}cd , whose head- and tail- DFAs are and query of at most two counter instances to suffice for represented in Figure 12, and the text string ababxyzcd . correct operation. Ignoring the second tail-DFA activation would in this The [^c1c2...c k]{n} condition can be treated in a similar example lead to missing the match on the last character. way; in this case the counter should be associated the set of characters c c ...c which would cause its de-allocation. Since, in the worst case, the tail-automaton can be 1 2 k A special case which is very common in practice is the activated every cycle, the bound does not improve one where the counting constraint is located at the end of with a DFA solution. We will therefore think of a the regular expression. In this situation, a single counter mechanism to limit the number of state traversals starting instance always suffices independent of the number of from a tail-NFA. For an exhaustive discussion on a general times the border state is traversed. In fact, in case of methodology to handle this case we address the reader to wildcard repetitions, the occurrence of the first n wildcards our technical report [20]. will determine a match. In case of [^c1c2...c k]{n} -like counting constraints, an occurrence of an invalidating ci 1) 1 a 1,2b 1, 3 a character within n characters from the oldest tail-NFA activation would be also within n characters from any 2) 3 * 4 **5 6c 7d 8 newer parallel one. Therefore, it is in this case safe to ^c ^d * Ø b*** b+1 . . . b+n-1 * b+n suffix Figure 11: head-DFA (1) and tail-DFA(2) for regular expression ab.{3}cd. The missing transitions in the n states head-DFA are to state 1. The state numbering is Figure 12 : NFA corresponding to regular expression according to subset construction. .{n}suffix Table 4: Summary of Snort rule-sets Nr. Header Characteristics Rule- of Protocol Source IP Src. Destination IP Destination Port .* and .{n,m} set rules Port [^x]* Group1 329 Tcp $HOME_NET any $EXTERNAL_NET $HTTP_PORTS/any 283 - Group2 40 Tcp $HOME_NET any $EXTERNAL_NET 25/any 24 - Group3 18 Tcp $EXTERNAL_NET any $HOME_NET 7777:7778/any 5 10 Group4 45 Tcp $EXTERNAL_NET any $HOME_NET 143/any 24 19 Group5 20 Tcp $EXTERNAL_NET any $HOME_NET 119/any 6 11 Group6 24 Tcp $EXTERNAL_NET any $HOME_NET 110/any 7 12 ignore subsequent activations of the tail-NFA thus keeping current input character. Specifically, one can observe that at most one active counter. the most part of NFA states have either transitions defined Counting constraints of the form .{n,} and on a very small set of characters, or on all but one or two [^c1c2...c k]{n,}, where at least n occurrences of the characters. Thus, by encoding in the state identifier the wildcard/character range are of interest, can be treated as a information about the set of symbols a transition is (or is direct generalization of the above. In the NFA of Figure 12, not) defined on, it is possible to limit the number of this would correspond to adding an auto-loop to state b+n memory accesses below the active set size. For details the on the same character range in the repetition. Thus: i) the interested reader can refer to [15]. counter mechanism can also be applied to states from b+1 Finally, border and counter states should be treated in a to b+n-1. ii) Additionally, the suffix (of which state b+n is special way: the former imply the need for pointers from the the entry state) can be converted to DFA. Again, in the case DFA to the corresponding NFA entries, whereas for the of wildcard repetitions or if the invalidating characters latter the information listed in Section 5.2 must be stored. c1c2...c k, do not appear in the suffix, a single activation of the suffix-DFA does always guarantee proper operation. 7. EXPERIMENTAL RESULTS In this section we validate the proposal on rule-sets from Finally, cases .{n,m} and [^c1c2...c k]{n,m} can be treated as follow. Upon traversal of the border state b, the the Snort IDS. auto-decrementing counter is set to m, and a fixed value m- 7.1 Rule-sets n is associated to it. Once again, the counter will be The rule-sets considered have been taken from the Snort dropped once nullified (or upon occurrence of any IDS [7]. Specifically, since some Snort rules only use exact- invalidating character ci). However, state b+m is accessed match strings, in this paper we only consider those having a for every value of the counter less than or equal to m-n. The Perl Compatible Regular Expression (PCRE) in their firing considerations above about the worst case apply to this condition. situation as well. As mentioned before, the rules under consideration do In conclusion, if the data-set contains NT counting not exhibit the whole expressive power of regular constraints located at the end of the corresponding regular expressions. Rather, they can normally be decomposed into expression and NNT counting constraints in intermediate sequences of simple sub-patterns separated by dot-star positions, then the worst case bound on memory bandwidth conditions (either in the pure .* or in the [^c 1c2...c k]* form) is reduced to NT + 2NNT +1 (“one” representing the head- and counting constraints on wildcards and character ranges. DFA activation) memory accesses per character processed. Character ranges (and their repetition) are very 6. MEMORY LAYOUT common within sub-patterns. Specifically, they appear either in the form [c -c ], or as special escape sequences: \s One important point to address to implement the proposed 1 k (all space characters), \S (all but space characters), \d scheme is how to layout the data structure representing the (digits), \D (all but digits), \w (alphanumeric characters) above automaton so to limit memory requirement and allow and \W (all but alphanumeric characters). an efficient state traversal. Nested repetitions and disjunctions of complex sub- As far as the DFA part (head-DFA and possible tail- patterns (e.g.: patterns containing dot-star conditions or DFAs) is concerned, any compression technique proposed wildcard repetitions) have not been observed in the rule- in literature [15][17][18][19] can be reused. sets. We expect that these more general types of patterns Let us now address the encoding of the NFA portion of will be the subject of important future work, but we do not the automaton. Content addressing, a technique proposed in consider them here since they are not found our rule-sets. [16] in the context of DFAs, can be adapted to NFAs. The Of 982 distinct regular expressions: 25% contain long goal is to limit the number of memory accesses when counting constraints, generally located at the end of the processing a state without any transitions defined on the regular expressions, 11.4% contain .* conditions and 54.89% [^c 1c2...c k]* conditions. A large part of Snort rules start with character “^”, isolation or in combination with just a few other rules. In which normally forces the match operation only at the effect, as explained in Section 3, each dot-star condition beginning of the text string (i.e., of the packet payload). tends to generate a replication of the DFA being merged This could theoretically decrease the complexity of the with the current regular expression. corresponding DFA, and avoid state explosion when regular When creating the hybrid-FA, tail-DFAs and the expressions with counting constraints are compiled in counter mechanism have been used in order to limit the isolation. Unfortunately, nearly all Snort PCREs use the worst case bound. Because of the varying complexity of the “m” modifier. Combined with symbol “^” at the beginning rule-sets, dot-star conditions have been treated in different of the regular expression, this forces the match operation ways: some of them have been expanded through subset not only at the beginning of the text string, but also at the construction, and some have been made border-states. beginning of each line. In other words, the m modifiers acts Specifically, the goal was the one of keeping the head-DFA on regular expression ^pattern transforming it into ^pattern below 50,000 states. This threshold was selected as a good | ([\n\r]pattern) . This, in turn, keeps the complexity of the head-DFA target size because proposed DFA compression resulting DFA high. techniques [15][17][19] can encode those states in around The Snort IDS performs packet payload inspection 2MB, a size that can be realized in on-chip memory in an only after header filtering (i.e. packet classification). ASIC or microprocessor. We can create a head-DFA of any Therefore, we clustered rules with common header and specific number of states by expanding the head DFA in a performed experiments on some of the largest groups. A greedy fashion until the target size has been reached; summary of the derived rule-sets is presented in Table 4. thereafter, all dot-star conditions become border-states and 7.2 Memory storage requirement lead to tail-FAs. As a result, all dot-star conditions in In this section, we study the memory storage requirement of group1 have been moved to tail-DFAs, the ones in groups the different rule-sets by generating the corresponding 3-6 have been expanded in the head-DFA, and a mixed automata. As can be observed in the second and in the last solution has been adopted for group2. two columns of Table 4, the rule-sets differ in the number In the case of rule-set group2, two different DFA of regular expressions, dot-star conditions and counting groupings have been tested: the first consisting of the same constraints they include. number of DFAs as the hybrid-FA, and the second consisting of one less DFA. As one could expect, in the first Rule-sets group1 and group2 do not contain counting scenario the overall number of states in the two automata is constraints, whereas group3-group6 do. The counting similar, whereas in the second one the pure DFA solution constraints encountered consist of 20 to 1024 repetitions of pays for the better worst case performance bound with a large character ranges; however, they are always located at higher memory occupancy (specifically, it requires 50% the end of the corresponding regular expressions (and can more states). therefore benefit in the best way of the counter mechanism). In the case of rule-set group1, the pure-DFA and the Moreover, the [^x]* -like conditions present in the rule-set hybrid-FA solution have comparable size. But, as will be make it possible to build tail-DFAs which can be safely pointed out in the next section, the hybrid automaton is traversed with at most one activation. preferable in terms of average case memory bandwidth. In Table 5 a summary of the characteristics of the size For rule-sets group3-6, the DFA cannot be constructed of the NFA, DFA and hybrid-FA corresponding to the at all due to exponential state blow-up, while the hybrid-FA given rule-sets are reported. Consider the following solution has an easily realizable size. Also, it is worth observations. noticing the reduction in the number of states when moving First, a DFA solution is never feasible in the case of from a NFA to a hybrid-FA, which is due to removing the rule-sets containing counting constraints. In fact, because of long chains of counting states. the high number of repetitions, exponential state explosion is observed also if DFAs are generated in isolation for each Table 5: Automata sizes for corresponding rule-sets. of the regular expressions. Therefore, in those cases, N-A NFA DFA Hybrid-FA (not applicable) is indicated in the table (experimentally, Rule- # # Total # head- Total subset construction was aborted after generation of 10 set states DFA states tail- DFA tail- million states). s FA states states For rule-sets group1 and group2, a DFA solution is Group1 15679 31 71234 30 40461 30321 possible but the regular expressions must be distributed Group2 1036 3 22651 2 20724 1905 across multiple DFAs in order to avoid state-blow-up. Rule 2 31521 partitioning is performed according to heuristics, as Group3 8871 N-A N-A 10 514 - follows. First, rules containing a dot-star condition and Group4 3119 N-A N-A 19 2560 - sharing the prefix to it are compiled together. Second, rules Group5 5205 N-A N-A 11 2485 - containing multiple dot-star conditions are compiled in Group6 1952 N-A N-A 12 4878 -

Table 6: Active vector sizes for Snort rule-sets Table 6 reports statistics about the size of the active NFA Hybrid-FA vector across the different data-sets. The average values Av g Max Avg Max Worst case have been derived by first computing, for each trace, the Group1 1.15 34 1.009 5 32 weighted average of the active vector size across the Group2 1.06 13 1.001 2 3 simulation interval. Then, the values obtained for different Group3 1.04 4 1.002 2 11 traces on the same rule-set have been again averaged. For a Group4 2.45 12 1.001 2 20 given rule-set, we have not observed a substantial variance Group5 1.04 5 1.001 2 12 across the traces. The maximum value displayed is, in all Group6 2.99 6 1.088 2 13 cases, the maximum active vector size achieved for a In terms of absolute memory occupancy, the use of particular rule-set across all traces and all simulations. default transitions [15][19] and of content addressing [17] As can be seen, the average behavior of the NFA to encode the hybrid-FA lead to storage requirements solution is far better than what the worst case would varying from 21KB ( group3 ), up to 3MB ( group1). In fact, indicate. This is due to the fact that only a few rules are the former technique allows eliminating around 98-99% of matched, and dead branches in the NFA are often taken. the DFA transitions, while the latter imply the use of 64 bit The hybrid-FA outperforms the NFA both in terms of wide state identifiers. Notice that this range makes it average behavior and maximum active vector size. In fact, possible to accommodate the automaton data structures in the automaton traversal remains for the most part within the on-chip memory [26]. head-DFA. Note that a value of 1 could be achieved only if it was possible to compile all the regular expressions in a 7.3 Memory bandwidth requirement single DFA. The memory bandwidth requirement can be expressed in Finally, since the average case behavior of a DFA terms of the number of memory operations to be performed solution is the same as its worst case, the hybrid-FA for each input character processed. In this section, we want outperforms also the DFA solution with regular expression to compare the different automata in both worst case and grouping adopted for the group1 and group2 data-sets. average case behaviors. 7.3.1 Worst case behavior 8. RELATED WORK The worst-case memory bandwidth requirement can be seen Regular expression matching at line rate has been in Table 5. In the case of NFAs, the worst-case bandwidth recognized as an important problem, and has been corresponds to the number of states, which is reported in considered in related work. The prior work in this area the second column of the table. For example, 1036 states focuses on two distinct directions: FPGA based corresponds to 1036 concurrent memory operations to implementations [22][23][24] and general-purpose or implement state transitions for each input character software oriented approaches [6][15][15][17][18][19]. Our processed. This bound, even if rarely achieved, is clearly work falls into the second category, although one could unacceptable. offload the tail-automaton operation to an FPGA. DFA solutions have a worst case bound corresponding As mentioned, memory compression techniques to the number of DFAs needed to represent the regular allowing an efficient representation of generic DFAs have expressions (i.e.: the number of groups the rule-set is been presented in [15][17][18][19]. However, such decomposed into). This value (column 3) is attractive when proposals assume that the DFA is given a priori. On the those solutions are feasible, that is when the regular opposite, in this work we address the case where a DFA is expressions do not contain large counting constraints. either practically unfeasible or not a suitable representation In the case of Hybrid-FAs, the worst case bound is of the regular expressions of interest. equal to 1 plus the number of tail-DFAs (each tail-DFA Our work has a practical character in that it does not being a simple counter for rule-sets group3-6). For data- address generic regular expressions, but particular sets group1-2, this coincides with the worst case bound of subclasses which are common in broadly used NIDS [6] [7] the DFA counter-part. In case of counting constraints, this [8]. To this end, our work has commonalities with the one value is far less than that of the NFA solution, and depends presented in [15], which proposes rewriting rules to only on the number of regular expressions (as opposed as simplify DFA in the case of common patterns. However, to the number of states). our focus is different in that we concentrate on the 7.3.2 Average case behavior automaton rather than on modifying the input patterns. To evaluate the average case memory bandwidth, we It is worthwhile to compare and contrast the hybrid- compare the behavior of the different solutions on real FAs presented in this work and lazy-DFAs [12]. The two traffic. To this end, we perform simulations using twelve proposals share the common idea of partially performing packet traces downloaded from [25] of size varying from subset construction on a NFA. However, lazy-DFAs assume about 17MB to about 264MB. that subset construction is done dynamically depending on the input string (that is, on the incoming packets’ payload.) Specifically, the NFA paths covered by the input string are Computation, J. Kohavi, Ed. New York: Academic, 1971, dynamically converted to DFA. While this may be helpful pp. 189--196. in the average case, it does not address the worst case, [6] M. Roesch, “Snort: Lightweight Intrusion Detection for which is of first interest in the context of NIDS. Moreover, Networks,” in System Administration Conf., 1999 this solution is suitable only for a software implementation. [7] Snort: http://www.Snort.org/ Therefore, we assume that the partial subset construction be [8] Cisco Systems. Cisco ASA 5505 Adaptive Security done statically a priori so to prevent state explosion from Appliance. http://www.cisco.com. 2007. happening. Moreover, we introduce refinements to further [9] Citrix Systems. Citrix Application Firewall. bound the worst case. http://www.citrix.com. 2007. [10] Bro: http://bro-ids.org/ 9. CONCLUDING REMARKS [11] Vern Paxson et al., “Flex: A fast scanner generator,” Regular expression matching is an important task in modern http://www.gnu.org/software/flex/ NIDS. Recent proposals have in their experimental [12] R. Sommer and V. Paxson, “Enhancing byte-level network evaluations drawn selectively from regular expression rule- intrusion detection signatures with context.”, in CCS sets to avoid troublesome rules. For example, fully 25% of 2003. the regular expressions in the current Snort rule-set include [13] N. Tuck, T. Sherwood, B. Calder, and G. Varghese, counting constraints for which no DFA can be constructed “Deterministic memory-efficient string matching using a reasonable amount of memory, such as that algorithms for intrusion detection,” in Infocom 2004. normally found in a workstation or PC. In all prior work we [14] L. Tan, and T. Sherwood, “A High Throughput String have seen, these rules have been excluded from discussion Matching Architecture for Intrusion Detection and or evaluation, presumably for this reason. Prevention,” ISCA 2005. [15] F. Yu, Z. Chen, Y. Diao, T. V. Lakshman, and R. H. Katz, The primary contribution of this work is the hybrid-FA, “Fast and Memory-Efficient Regular Expression Matching which is, to our knowledge, the first automaton that is for Deep Packet Inspection”, in ANCS 2006 capable of evaluating all the regular-expression types found [16] S. Kumar et alt., “Algorithms to Accelerate Multiple in common NIDS systems such as Snort and can be Regular Expressions Matching for Deep Packet implemented efficiently in practical high-speed systems. Inspection,” in ACM SIGCOMM, Sept 2006. The key characteristics of a hybrid-FA are: a modest [17] S. Kumar, et alt, “Advanced Algorithms for Fast and memory storage requirement comparable to those of an Scalable Deep Packet Inspection”, in ANCS 2006 NFA solution, an average case memory bandwidth [18] M. Becchi and S. Cadambi, “Memory-Efficient Regular requirement similar to that of a single DFA solution Expression Search Using State Merging”, in INFOCOM (although the DFA would be unfeasibly large), a worst case 2007 memory bandwidth linear in the number of regular [19] M. Becchi and P. Crowley, “An Improved Algorithm to expressions containing counting constraints and dot-star Accelerate Regular Expression Evaluation”, in ANCS 2007 conditions (and, notably, independent of the number of states in the automaton). [20] M. Becchi and P. Crowley, “Addressing complex regular expressions through counting automata”, Washington 10. ACKNOWLEDGEMENTS University Tech. Report, July 2007. This work has been supported by National Science [21] R. W. Floyd, and J. D. Ullman, “The Compilation of Foundation grants CCF-0430012 and CCF-0427794, and Regular Expressions into Integrated Circuits”, Journal of ACM, vol. 29, no. 3, pp 603-622, July 1982. by gifts from Intel and Cisco Systems. [22] R. Sidhu and V. K. Prasanna, "Fast Regular Expression REFERENCES Matching using FPGAs", in FCCM 2001 [1] A. V. Aho and M. J. Corasick, “Efficient String Matching: [23] C. R. Clark and D. E. Schimmel, “Efficient reconfigurable An Aid to Bibliographic Search,” Communications of the logic circuit for matching complex network intrusion ACM, pp. 333–340, 1975. detection patterns,” in FLP 2003. [2] B. Commentz-Walter, “A string matching algorithm fast [24] J. Moscola et alt., “Implementation of a content-scanning on the average,” in ICALP, July 1979. module for an internet firewall,” in FCCM, USA, April [3] S. Wu, U. Manber, “A fast algorithm for multi-pattern 2003. searching,” Tech. Report TR-94-17, Univ of Arizona, [25] Internet traffic traces: http://cctf.shmoo.com/ 1994. [26] Cu-11 standard cell/gate array ASIC, IBM. www.ibm.com [4] J. E. Hopcroft and J. D. Ullman, “Introduction to Automata Theory, Languages, and Computation,” Addison Wesley, 1979. [5] J. Hopcroft, “An nlogn algorithm for minimizing states in a finite automaton,” in Theory of and