REAPR: Reconfigurable Engine for Automata Processing
Total Page:16
File Type:pdf, Size:1020Kb
REAPR: Reconfigurable Engine for Automata Processing Ted Xie1, Vinh Dang2, Jack Wadden2, Kevin Skadron2, Mircea Stan1 1Department of Electrical and Computer Engineering, University of Virginia 2Department of Computer Science, University of Virginia fted.xie, vqd8a, wadden, skadron, [email protected] Abstract— Finite automata have proven their usefulness in GAs and the Micron Automata Processor [4] are an ideal high-profile domains ranging from network security to machine target for these workloads. learning. While prior work focused on their applicability for Our goal in this paper is to develop a high throughput and purely regular expression workloads such as antivirus and network security rulesets, recent research has shown that scalable engine for enabling spatial automata processing on automata can optimize the performance for algorithms in other FPGAs, as opposed to prior work which focused purely on areas such as machine learning and even particle physics. regular expressions. To achieve this goal, we have developed Unfortunately, their emulation on traditional CPU architectures REAPR (Reconfigurable Engine for Automata PRocessing), is fundamentally slow and further bottlenecked by memory. a flexible and parameterizable tool that generates RTL to In this paper, we present REAPR: Reconfigurable Engine for Automata PRocessing, a flexible framework that synthesizes emulate finite automata. RTL for automata processing applications as well as I/O to The major contributions of REAPR are the following: handle data transfer to and from the kernel. We show that even • The first effort to characterize automata performance with memory and control flow overheads, FPGAs still enable extremely high-throughput computation of automata workloads for applications other than regular expressions on the compared to other architectures. FPGA using the ANMLZoo benchmark suite [5]; • A head-to-head comparison of FPGA computational I. INTRODUCTION throughput against a best-effort CPU automata process- ing engine (Intel HyperScan) and the Micron Automata Many modern applications essential to domains such as Processor; high-performance computing, network security, and machine • The first automata-to-RTL translation tool with I/O learning benefit from an approach based on finite automata. generation capability; For instance, in the case of network security, regular ex- • To our knowledge, the first attempt to run an automata pressions used for deep packet inspection are represented in processing application (Random Forest) on actual hard- memory as their equivalent nondeterministic finite automata; ware by using AXI and PCI-Express to enable CPU- RegEx engines such as Intel HyperScan [1] perform au- FPGA communication; tomata transformations to maintain excellent performance in • A maximally-sized Levenshtein automaton to determine terms of both runtime and memory utilization. In the field of peak state capacity with complex routing; machine learning, Tracy et al. [2] have shown that automata can even be used for a high performance implementation of II. PRIOR WORK the Random Forest classification algorithm. Previously, engineers have relied on Moore’s Law scaling A. CPU Automata Engines for computing power to keep up with demands for these In a nondeterministic finite automaton (NFA), symbols workloads, despite the fact that traditional von Neumann from the input stream are broadcast to each state simulta- architectures are poorly suited for automata processing. neously, and each state connects to several other states, each Now in the “post-Moore” era, it is no longer prudent to of which may or may not activate depending on whether a rely on advances in manufacturing for improvements in given state matches the incoming symbol. For each symbol, computational throughput. The unsuitability of traditional an NFA engine must determine the next set of activated CPU architectures for automata processing is amplified by states, which involves a linear-time scan of the adjacency the increasing demand for such applications; heterogeneous lists of all states in the current activated set. In the worst accelerator-based solutions are the way forward. case, the adjacency list may contain nearly all of the states Specifically, prior work [3] [4] has shown that digital in the automaton; therefore, the run-time on a CPU for circuits enable a one-to-one spatial mapping between au- simulating an m-state automaton on n symbols is O(n · m), tomata states and circuit components such as registers and and for non-trivial data with non-trivial automata (n = m), wires. Due to the volatility of many automata processing the overall runtime is quadratic. CPU NFA processing is workloads such as network security and machine learning, additionally hampered by the so-called “memory wall” due in which the rulesets and algorithms change frequently, to the NFA’s pointer-chasing execution model, and therefore non-reprogrammable application-specific integrated circuits it is desirable to drastically reduce the number of memory (ASICs) are too inflexible and too costly to be commercially accesses per input item. In order to mask memory latency, viable, and therefore reconfigurable platforms such as FP- state-of-the-art NFA engines such as Intel HyperScan [1] perform SIMD vector operations to execute as many state process. Its fundamental building block is the State Tran- transitions as possible for a given memory transaction. Even sition Element (STE), which is the hardware realization of so, such optimizations can not escape the fact that sequential an automaton state as well as the next-state transition logic. von Neumann architectures are fundamentally ill-suited for The first generation AP packs roughly 49,000 states per chip, these type of workloads. and with 32 chips per board, one AP card holds nearly In order to improve the run-time complexity of automata 1.6 million states. REAPR implements automata states in traversals, some regular expression engines transform the a similar manner but is more extensible due to the flexibility NFA into its equivalent deterministic finite automata (DFA). of the FPGA. A DFA only has one state active for any given symbol 2) Programming Interface: In addition to a multi-lingual cycle and is functionally equivalent to an NFA; this is (C, Python, Java) SDK, Micron offers the Automata Network achieved through a process known as subset construction, Markup Language (ANML), a format based upon XML for which involves enumerating all possible paths through an describing the interconnectivity of AP components such as NFA. Converting an NFA to DFA has the benefit of reducing STEs, booleans, and counters. Developers can either gen- the runtime to O(n) for n symbols (note that now the erate their own ANML files or generate them from regular runtime is independent of automaton size) and only requires expressions using the SDK’s apcompile command [9]. one memory access per input symbol, but frequently causes D. Other Architectures an exponential increase in the number of states necessary; this phenomenon is often referred to as state explosion. Several past efforts have proposed modifications to exist- Subset construction for large automata incurs a huge memory ing von Neumann architectures to specifically increase per- footprint, which may actually cause performance degradation formance of automata processing workloads. HARE (Hard- due to memory overhead in von Neumann machines. ware Accelerator for Regular Expressions) [10] uses an array Prior work by Becchi [6] attempted to leverage the best of parallel modified RISC processors to emulate the Aho- of both types of finite automata (the spatial density of NFA Corasick DFA representation of regular expression rulesets. and temporal density of DFA). By intercepting the subset The Unified Automata Processor (UAP) [11] also uses an construction algorithm and not expanding paths that would array of parallel processors to execute automata transitions result in a state explosion, Becchi achieved 98-99% reduction and can emulate any automaton, not just Aho-Corasick. in memory capacity requirement and up to 10x reduction in However, because these works are 1) not FPGA-centric memory transactions. (both are ASICs), 2) based on the von Neumann model and not spatially distributed like REAPR, and 3) confined to a B. Automata on FPGA limited set of just regular expressions (as opposed to general 1) NFA: Past implementations of NFAs on FPGA [3] [7] automata applications), we do not directly compare to them. focused on synthesizing only regular expression matching III. BENCHMARKS circuits for applications such as antivirus file scanning and We synthesize the ANMLZoo [5] automata benchmark network intrusion detection. REAPR extends this prior work suite developed by Wadden et al. to determine the efficiency by focusing on a more diverse set of finite automata to of REAPR. ANMLZoo contains several applications falling address the fact that the workload for automata processing is into three broad categories: regular expressions, widgets, and much richer and more diverse than regular expressions. We mesh. The applications, along with their categories, are listed extend the underlying approaches for NFA RTL generation below in Table I. Detailed descriptions of these benchmarks from prior work, adapt it for other NFA applications, and can be found in the ANMLZoo paper [5]. detail our process in Section IV. 2) DFA: Several efforts [8] in accelerating automata processing with FPGAs use Aho-Corasick DFAs as the Benchmark Name Category States underlying data