A Hybrid Finite Automaton for Practical Deep Packet Inspection

A Hybrid Finite Automaton for Practical Deep Packet Inspection Michela Becchi Patrick Crowley Washington University Washington University Computer Science and Engineering Computer Science and Engineering St. Louis, MO 63130-4899 St. Louis, MO 63130-4899 +1-314-935-4306 +1-314-935-9186 [email protected] [email protected] in networking devices and computer systems. Most popular ABSTRACT software tools—including Snort [6][7] and Bro [10]—and Deterministic finite automata (DFAs) are widely used to devices—including the Cisco family of Security Appliances perform regular expression matching in linear time. Several [8] and the Citrix Application Firewall [9]—use regular techniques have been proposed to compress DFAs in order expressions to describe payload patterns. While more to reduce memory requirements. Unfortunately, many real- expressive than simple patterns of exact-match strings, and world IDS regular expressions include complex terms that therefore able to describe a wider variety of payload result in an exponential increase in number of DFA states. signatures [12], regular expression implementations Since all recent proposals use an initial DFA as a starting- demand far greater memory space and bandwidth. As a point, they cannot be used as comprehensive regular result of these trends, there has been a considerable amount expression representations in an IDS. of recent work on implementing regular expressions for use In this work we propose a hybrid automaton which in high-speed networking applications, particularly with addresses this issue by combining the benefits of representations based on discrete finite automata (DFAs). deterministic and non-deterministic finite automata. We test DFAs have attractive properties that explain the our proposal on Snort rule-sets and we validate it on real attention they have received. Foremost, they have a traffic traces. Finally, we address and analyze the worst predictable memory bandwidth requirement. In fact, case behavior of our scheme and compare it to traditional processing an input string involves one DFA state traversal ones. per character, which translates into a deterministic number Categories and Subject Descriptors of memory accesses. Moreover, it has long been established that, for any given regular expression, a DFA with a C.2.0 [ Computer Communication Networks ]: General – minimum number of states can be determined [4][5]. Even Security and protection (e.g., firewalls ) so, DFAs corresponding to large sets of regular General Terms expressions, each one representing a different rule, can be Algorithms, Performance, Design, Security. prohibitively large. Recent work has tackled this problem in two ways. Keywords First, since an explosion in states can occur when many Deep packet inspection, DFA, NFA, regular expressions. rules are grouped together into a single DFA, Yu et al. [15] have proposed segregating rules into multiple groups and 1. INTRODUCTION evaluating the corresponding DFAs concurrently. This Increasingly, network packets are classified not only by the solution decreases memory space requirements, sometimes fields of their headers, but also by the content of their dramatically, but increases memory bandwidth linearly with payloads. In particular, signature-based deep packet the number of active DFAs. The second approach, proposed inspection has taken root as a dominant security mechanism by Kumar et al. [15], aims at reducing the memory space requirement of any given DFA and is based on two Permission to make digital or hard copies of all or part of this work for observations. First, the memory space required to store a personal or classroom use is granted without fee provided that copies are DFA strictly depends on the number of transitions between not made or distributed for profit or commercial advantage and that states. Second, many states in DFAs have identical sets of copies bear this notice and the full citation on the first page. To copy outgoing transitions. Substantial space savings in excess of otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 90% are achievable in current rule-sets when this CoNEXT 2007, December 10-13, 2007, New York, NY, U.S.A. redundancy is exploited. The compression technique (c) 2007 ACM 978-1-59593-770-4 07/ 0012 $5.00. proposed trades off memory storage requirements with 2. BACKGROUND AND MOTIVATION processing time. Several techniques for minimizing the memory Unfortunately, DFAs are infeasible for regular requirements of DFAs representing sets of regular expressions found in the most frequently used rule-sets . expressions have been recently proposed [15][17][18]. Specifically, when repeated wildcards are present in a These proposals have some common properties; regular expression, it may be impossible to build a DFA specifically: with a reasonable number of states. For example, the a) They are based on the assumption that a DFA can be regular expression “ prefix.{100}suffix ”, which matches if computed and is given as input. and only if “ prefix” is separated from “ suffix” by 100 characters, would require well over 1 million states to be b) They exploit the observation that DFAs corresponding represented in a DFA. Since, as we will see, such constructs to rule-sets derived from commonly used Network occur frequently within popular security rule-sets, DFA- Intrusion Detection Systems (NIDS) have significant based approaches, including the recent work described state transition redundancy . above, are infeasible as comprehensive solutions. The second aspect - i.e., the presence of significant As an alternative, one could consider using a solution transition redundancy - can be easily explained as follows. based on non-deterministic finite automata (NFAs) [22]. Regular expressions used within NIDS typically consist of The number of NFA states required to represent a regular sets of patterns containing: simple strings, character ranges, expression is on the order of the number of characters wildcards, indefinitely repeating sub-patterns, and sub- present in the regular expression itself. As an example, the patterns repeated a discrete number of times. Notably, regular expression above would require just 101+ (# chars nested repetitions—causing loop-backs in the in prefix ) + (# chars in suffix ) NFA states. Therefore, an corresponding NFAs and DFAs—are not found in practice. NFA-based representation would alleviate the memory One can think about compiling together the set of regular storage problem. However, an NFA may lead to a variable, expressions corresponding to several rules by first building and potentially large, memory bandwidth requirement. In an NFA which represents the disjunction of the NFAs of the fact, multiple NFA states can be active in parallel and each single regular expressions, and then converting it to a DFA input character can trigger multiple state transitions, and through the well-known subset construction procedure [4]. therefore require multiple parallel memory operations. In Such NFAs will typically have a tree-like structure (with the the worst-case, all NFA states can be active concurrently, exception, as we will see, of few loops and backward requiring a prohibitive amount of memory bandwidth. transitions), where the root corresponds to the starting state In this paper we propose a hybrid DFA-NFA finite and the leaves to the accepting states. If common prefixes automaton ( Hybrid-FA ), a solution bringing together the are collapsed, the root and the nodes at the first levels will strengths of both DFAs and NFAs. When constructing a have several outgoing transitions, whereas, moving towards hybrid-FA, any nodes that would contribute to state the leaves, the tree will tend to become “skinny”, i.e., to explosion retain an NFA encoding, while the rest are consist of long chains of nodes. Moreover, except for the transformed into DFA nodes. The result is a data structure first levels of the NFA and for the transitions representing with size nearly that of an NFA, but with the predictable wildcards and large character ranges, most nodes will have and small memory bandwidth requirements of a DFA. outgoing transitions defined only for a few characters. We evaluate the hybrid-FA structure by comparing it to When building the corresponding DFA, missing transitions both DFA and NFA representations on rule-sets from the on the NFA typically translate into backward transitions to popular security package Snort. The primary contribution the nodes at the first levels of the hierarchy (or intermediate of the hybrid-FA is that entirely new classes of regular nodes when the represented regular expression contains expressions can be implemented in fast networking dot-star conditions or repetitions of wide character ranges). contexts. Thus, a restricted number of nodes in the final DFA tend to The remainder of this paper is organized as follows. be the target of most transitions. Additional background and motivation are presented in a Section 2. Section 3 describes the conditions of DFA state a a c d explosion in greater detail. The hybrid-FA structure is 1) 0 2 3 4 1 b introduced in Section 4. Extensions to establish correctness a a and worst-case bounds are presented in Section 5. Section 6 a ^c c provides a brief discussion on implementation issues and ab c d 2) 0 1 2 3 4 alternatives. Experimental results are found in Section 7. ^d c Further discussion of related work is found in Section 8. ^c The paper concludes with discussion in Section 9. Figure 1: DFAs representing the following RegEx: abcd (1) and ab.*cd (2). Transitions to state 0 are omitted. Taking advantage of this redundancy enables very states. However, there are common conditions which can effective memory compression techniques for a given DFA , bring the number of DFA states close to the theoretical but does not address a major problem. Namely that, due to upper bound. An analysis of those conditions within DFAs state explosion during NFA-to-DFA transformation, DFAs representing single regular expressions is presented in [15].

Load more