An Extensible System-On-Chip ------

ABSTRACT Internet Packets A single-chip, firewall has been implemented that performs packet filtering, content scanning, and per-flow queuing of Internet Fiber packets at Gigabit/second rates. All of the packet processing Backbone Switch operations are performed using reconfigurable hardware within a Switch single Xilinx Virtex XCV2000E Field Programmable Gate Array (FPGA). The SOC firewall processes headers of Internet packets Firewall in hardware with layered protocol wrappers. The firewall filters packets using rules stored in Content Addressable Memories PC 1 (CAMs). The firewall scans payloads of packets for keywords PC 2 using a hardware-based regular expression matching circuit. Lastly, the SOC firewall integrates a per-flow queuing module to Internal Hosts Internet mitigate the effect of Denial of Service attacks. Additional features can be added to the firewall by dynamic reconfiguration of FPGA hardware. Figure 1: Internet Firewall Configuration

network, individual subnets can be isolated from each other and Categories and Subject Descriptors be protected from other hosts on the Internet. I.5.3 [Pattern Recognition]: Design Methodology; B.4.1 [Data Communications]: Input/Output Devices; C.2.1 [- Recently, new types of firewalls have been introduced with an Communication Networks]: and Design increasing set of features. While some types of attacks have been thwarted by dropping packets based on the value of packet headers, new types of firewalls must scan the bytes in the payload General Terms of the packets as well. Further, new types of firewalls need to Design, Experimentation, defend internal hosts from Denial of Service (DoS) attacks, which occur when remote machines flood traffic to a victim at high Keywords rates [1]. Few existing firewalls have the ability to scan the full System On Chip, FPGA, Internet, Firewall, Packet Scanning, Per- packet payload or provide protection against DOS attacks. Of the Flow Queuing, Network Intrusion Detection systems that do, most run in software and are not fast enough to perform those functions at high speeds [3]. There exists a need for 1. INTRODUCTION hardware accelerated packet processing firewalls which maintain high throughput. As the Internet has grown, demand for network security has significantly increased. Internet-connected machines continuously Custom Integrated Circuits (ICs) can be used to implement are the target of malicious attacks from machines located around firewall functions at Gigabit/second rates. They achieve high the world. Internal hosts can be protected from remote attacks by throughput by performing operations in parallel and by processing filtering traffic through a firewall. As shown in Figure 1, firewalls packets in deep pipelines. In the past, hardware-based packet typically reside between the backbone switches and the internal processing systems required multiple ASICs to filter and forward hosts. Firewalls drop packets that are known to be malicious and packets in hardware. Today, an with tens of rate-limit traffic flows that attempt to transmit excessively large millions of transistors can implement a firewall as a single System amounts of traffic. By placing multiple firewalls throughout a On Chip (SOC). A challenge in building firewalls is to make the device capable of protecting against both current and future

threats [6]. Reconfigurable hardware provides both the logic Permission to make digital or hard copies of all or part of this work for density to implement a complex firewall while maintaining the personal or classroom use is granted without fee provided that copies are flexibility to reconfigure and implement new functions. not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, 2. SYSTEM ON CHIP FIREWALL requires prior specific permission and/or a fee. A System-On-Chip Internet firewall has been implemented on a Design Automation Conference ‘03, June 2-6, 2003, Anaheim, CA. Xilinx Virtex XCV2000E FPGA. In order to protect against Copyright 2003 ACM 1-58113-000-0/00/0000…$5.00. current threats, the SOC firewall integrates circuits to filter applied to the vector to select which bits of each row must match headers, scan payloads, and buffer traffic. In order to protect and which bits can be ignored. If all of the values match in all of against future threats, the SOC is extensible allowing insertion of the bit locations that are unmasked, then that row of the TCAM is new packet processing hardware modules. considered to be a match. The flow identifier associated with the rule in the highest-priority matching TCAM is then assigned to Interfaces to Off-Chip Memories flow.

0 SDRAM SRAM 111 103 Free List CAM_MASK_1 Controller Controller Manager 0 111 103 CAM_VALUE_1 Flow 0 Payload CAM Buffer 111 103 p p p Con- Src Dest Src IP Dest IP Proto Scanner Filter tent Port Port

Data Output 0

Data Input 111 103 Queue Packet CAM_VALUE_2 Manager Scheduler 0 111 103 CAM_MASK_2 Layered Protocol Wrappers . . . Figure 2: Block Diagram of System-On-Chip Firewall Figure 3: Ternary CAM Filter The top-level architecture of the System On Chip firewall is 2.2 Payload Processing shown in Figure 2. When data first enters the SOC, a set of Many types of Internet traffic cannot be classified by examination layered protocol wrappers parse the headers of the Internet of the packet headers. For example, the KaZaA program packets. Next, the payload scanner examines the content of the sometimes disguises packet headers to appear as through they packets to identify keywords and/or regular expressions. Next, were being sent from a web server. For network administrators the CAM filter compares the fields in the header of the packet who care about the security of their networks, it is important to be with a set of rules stored in Ternary Content Addressable Memory able to classify a packets based on the their content rather than (TCAM). Some rules can cause the CAM filters to outright drop just the values that appear in the packet headers. packets, while other rules are used to classify the packet and assign it with a flow identifier. After classification, the queue 2.2.1 Regular Expression Matching manager schedules when packets are transmitted from the flow In order to scan the payload of packets, a regular expression buffer, which stores the packet in off-chip memory. Once matching circuit was implemented. Regular expressions provide a scheduled, data is read from the flow buffer and transmitted out of shorthand means to specify the value of a string, a wildcard the firewall. Additional features can be added to the system by character (specified by ‘?’), or a string of multiple characters inserting blocks along the data processing path. (specified by ‘*’). For example, the string “{A|a}lbert ? {E|e}instein” matches all four case variations of the name Albert 2.1 Header Processing Einstein and allows the middle initial to be an arbitrary character. packets contain both a header and a payload. The header contain multiple fields that specify the type of packet, 2.2.2 Implementation of the Payload Scanner the protocol of the packet, where a packet has come from, where To generate high-speed hardware that searches for the regular is packet is destined to, the length of the packet, and other options expression, a design flow was created to automatically generate relevant to the Internet protocols. finite state machines from the specification of regular expressions. 2.1.1 Layered Protocol Wrappers A match is detected when the sequence of arriving bytes cause the state machine to reach a matching state. In order to scan for To simplify the processing of the protocol fields on the SOC multiple regular expressions, a sequence of scanning engines is firewall, a set of layered protocol wrappers was implemented to instantiated. In order to achieve higher performance, pipelines process protocols at multiple layers [2]. At the lowest layer, data can operate in parallel. A payload scanner searching for eight is segmented and reassembled from short cells into complete Regular Expressions (RE1-RE8) using four parallel search flows frames. At another layer of the protocol stack, the fields of the is illustrated in Figure 4. Internet Protocol (IP) packets are computed and verified. At the highest level of the protocol processing, the user-level data is separated from the headers and transport fields used by the Pipeline of Regular Expression scanning (RE) engines RE1 RE2 RE3 RE4 RE5 RE6 RE7 RE8 network. Incoming Outgoing Packets Packets 2.1.2 Content Addressable Memory Filters RE1 RE2 RE3 RE4 RE5 RE6 RE7 RE8 Once the header has been processed, a Ternary Content RE1 RE2 RE3 RE4 RE5 RE6 RE7 RE8 Addressable Memory (TCAM) classifies packets as belonging to a specific flow. A diagram of a two-entry TCAM is shown in Flow RE1 RE2 RE3 RE4 RE5 RE6 RE7 RE8 Flow Dispatcher Collector Figure 4. When a packet arrives, the packet’s source address, Parallel Search Flows destination address, source port, destination port, and protocol are Figure 4: Regular Expression (RE) Payload Scanner simultaneously compared to the value fields in all of the rows of the TCAM. After the bits are compared, a mask register is 2.2.3 Application using the Payload Scanner When a packet arrives, the packet’s data is delivered to the flow A payload processing circuit has been implemented on the SOC buffer and the packet’s flow identifier is passed to the En-queue firewall that scans for unwanted messages, commonly FSM in the Queue Manager. Using the flow ID, the En-queue referred to as SPAM. By scanning packets as they pass through FSM reads SRAM to retrieve the flow’s state. As shown in figure the network, it is possible to identify SPAM before the message is 6, each entry of the flow’s state table contains a pointer to the forwarded to the endpoint host. To implement the SPAM filter, head and tail of a linked list of stored packets in SDRAM as well eight categories of phrases were identified that included terms that as counters to track the number of packets read and written to that commonly appear in SPAM , such as “CALL NOW” and flow. “MAKE MONEY FAST”. In total, a set of 34 case-insensitive Meanwhile, the flow buffer is used to store the packet in memory regular expressions were specified. The terms were then to the location specified by the flow’s tail pointer. The flow compiled into hardware and then programmed into the FPGA to buffer includes a controller to store the packet in Synchronous scan packets as they passed through the firewall. Whenever a Dynamic Random Access Memory (SDRAM) [4]. After writing a regular expression search engine found specific content, a bit was packet to the next available memory location, the value of the tail set in content vector, as shown in figure 4. A rule was then pointer is passed to the queue manager to identify the next programmed into the TCAM to drop every message that contained available free memory location. SPAM. Within each class of traffic, the queue manager performs round- 2.3 Flow Buffering robin queuing of individual flows. When the first packet of a flow To provide Quality of Service (QoS) for traffic that passes arrives that has no packets already buffered, the flow identifier is through the network, the SOC firewall performs both class-based inserted into a scheduling queue for that packet’s class of service and per-flow queuing. Class-based queuing allows certain types and the flow state table is updated. When another packet of a of traffic to receive better service than other traffic. Per-flow flow that is already scheduled arrives, the packet is simply queuing ensures that no single traffic flow can consume all of the appended to the linked list and the packet write count is network’s bandwidth. incremented. To support the multiple classes of service, traffic flows are To transmit packets, the de-queue FSM reads a flow ID from the organized by the firewall into four classes. Traffic from flows in scheduler. The scheduler de-queues the flow from the next- certain classes are given priority over traffic from flows in other available flow in the highest-priority class of traffic. The de- classes. Multiple linked lists of packets are managed to support queue FSM then reads the flow state to obtain a pointer to head of per-flow queuing. All management of queues and tracking of free the flow buffer’s packet storage. Also, the flow identifier is memory space is computed in the FPGA hardware using constant- removed from the head of the scheduler’s queue. The flow time, linked-list data structures maintained in SRAM. identifier re-enters the tail of the same queue if that flow has additional packets to transmit. A diagram of the flow buffer and queue manager is shown in Figure 5. The queue manager includes circuits to en-queue traffic 3. EXTENSIBLE FEATURES flows, de-queue traffic flows, and to schedule flows for By implementing the firewall in an FPGA, new functions can be transmission. Within the scheduler, four separate queues of flows added by inserting new packet processing modules along the data are maintained (one for each class). paths within the SOC.

3.1.1 Other Modules developed for the SOC Firewall In addition to the core features described in this paper, several Flow other modules have been prototyped on the SOC firewall. Buffer Extensible modules have been implemented that perform the Next Next Head Tail Tail Head following functions: Queue Manager · Virus blocking Scheduler · Content filtering P0 · Denial of Service Protection Queue Re-Entry · Decryption of AES and 3DES packets P1 Queue Re-Entry · Bitmap image filtering Enqueue DeQueue FSM P2 FSM · Network Address Translation (NAT) Queue Re-Entry · Domain Name Service (DNS) caching P3 · IP Version 6 (IPV6) tunneling Queue Re-Entry · Resource Reservation Protocol (RSVP)

SRAM Controller Some or all of these modules can be compiled into a top-level implementation of the SOC firewall. All Modules use standard Figure 5: Flow Buffer and Queue Manager interfaces to enable integration of components in a common System on Chip bitstream is generated which contains the configuration data. The placement and phase of the design flow typically runs for 3.1.2 Open Interfaces on the SOC Firewall 20-25 minutes on a Gigahertz-class processor. To facilitate the integration of these extensible modules into the SOC firewall, standard interfaces were defined on the top-level of Compile the SOC firewall circuit. A diagram of the top-level of the Circuit (Mentor vcom) Verify firewall with one extensible module is shown in Figure 7. Each of Functionality the red lines indicates an interface where an extensible module (ModelSim) can be attached. The module that inserts between the payload Verify packets dropped or Synthesize scanner and the CAM filter, for example, could be used to check delayed for other properties of the packet before the packet is forwarded to to gates (Synplicity) the CAM table for lookup. The interfaces to the SRAM and Test Module with actual network SDRAM controller allow the module to access external memory, traffic Constrain Interfaces to Off-Chip Memories Upload bitfile Placement for in-system Place and SDRAM 2 SDRAM 1 Generate SRAM 2 SRAM 1 testing Route Controller Controller Controller Controller (FPX) bitstream (Xilinx) (Xilinx)

Free List Manager Figure 8: CAD Flow for the SOC Firewall Flow p p Buffer p Payload Extensible CAM Scanner p Module p Data Input Filter Data Output 4.1.3 Testing the SOC Firewall on the FPX Platform Queue Packet Manager Scheduler The Field Programmable Port Extender platform was used to evaluate the performance of the SOC firewall with real Internet Layered Protocol Wrappers traffic. The FPX is an open hardware platform that includes two = New Component = New Connectivity = Available Interface multi-gigabit/second network interfaces and dynamically Figure 7: Extensible Interfaces on the SOC Firewall reconfigurable hardware than can be reprogrammed over the (Shown in Red) network [7]. To implement the SOC firewall, the bitfile for if needed. Other modules can integrate into some or all of the circuit was uploaded into the Virtex XCV2000E on the FPX for other interfaces. The SOC firewall is typically built with all of the in-system testing [8]. core features and a mixture of one or more extensible modules. To verify the operation of the SOC firewall with high throughput The number of extensible modules that can be compiled into the traffic, actual network traffic was sent to the hardware over the SOC firewall is only limited by the size of the FPGA. network from remote hosts. Malicious packets were dropped, the SPAM was rate-limited, and all other flows received a fair share 4. DESIGN FLOW of bandwidth.

The SOC firewall quickly synthesizes into hardware through the use of multiple design automation tools. The design flow shown in figure 8 depicts how to compile, verify, synthesize, place and route, and test the operation of the SOC firewall. The total time to iterate from source code modification to in-system testing of the SOC firewall with actual network traffic is approximately one half hour. 4.1.1 Verification of the SOC Firewall The first step in building the SOC firewall involves compiling the VHDL source code. Once compiled, a quick simulation of a few packets that contain a mixture of proper and malicious traffic are used to verify that a circuit can correctly process packets. To save time, exhaustive testing is performed at speed in hardware after synthesis. 4.1.2 Synthesis of the SOC Firewall The firewall synthesizes to hardware circuit components using Synplicity’s Symplify Pro tool. The resulting EDIF netlist is generated within a few minutes and then fed into Xilinx’s Figure 9: Implementation on FPX Platform backend design flow. Pin locations specified in a constraint file are used to map the pins of the SOC firewall to appropriate pins of the Xilinx Virtex XCV2000E FPGA. Next, the place and route tool is run to implement the FPGA circuit. Lastly, the resulting 5. RESULTS 6. CONCLUSIONS The core components of the SOC firewall, including the layered An extensible firewall has been implemented as a reconfigurable protocol wrappers, TCAM packet filters, a pipeline of regular System On Chip (SOC). In addition to the standard features expression matching engines to detect SPAM, and the per-flow implemented by other Internet firewalls, the SOC firewall can also packet buffer with the SDRAM controller were synthesized into perform high-throughput payload scanning and implement per- the Reprogrammable Application Device (RAD) of the Field- flow queuing. The circuit was implemented on a Xilinx XCV- programmable Port Extender (FPX). 2000E FPGA. The resulting bitfile was tested on the Field Programmable Port Extender (FPX) network platform. 5.1 Device Utilization Placement was constrained using Synplicity’s Amplify tool to By using parallel hardware and deeply pipelined circuits, the SOC lock the location of modules into specific regions of the FPGA. A firewall can process protocol headers with TCAMS and the entire view of the placed and routed Xilinx Virtex XCV2000E is shown payload using regular expression matching at 2 Gigabits/second. in Figure 9. Note that the center region of the chip was left Denial of Service attacks are mitigated through the use of class- available for insertion of extensible modules. based and per-flow queuing. A region of gates in the FPGA was left available to be used for extensible plug-in modules. By programming adding modules to the hardware, the firewall can Memory Controller protect a network against additional threats. Layered Protocol Per-flow Wrappers Queuing 7. ACKNOWLEDGMENTS --- for contribution to the NCHARGE control software, ---- for contribution to the SDRAM flow buffer, --- for work on the Network Interface Device. --- for work on the regular expression matching circuit and enhancements to the protocol wrappers. Region for Extensible Plug-in Modules 8. REFERENCES [1] Jose Brustoloni, Protecting Electronic Commerce from Packet Store Manager Filtering Regular Expression Payload Filtering TCAM Header Distributed Denial-of-Service Attacks, Proceedings of the 11th International Conference Figure 9: FPGA Layout of the SOC Firewall (WWW2002), ACM, Honolulu, HI, May 2002 [2] Florian Braun, John Lockwood, and Marcel Waldvogel, The results after place and route for the synthesized SOC Firewall Protocol Wrappers for Layered Network Packet Processing on the Xilinx Virtex XCV2000E are listed in Table 1. The core in Reconfigurable Hardware, IEEE Micro, Volume 22, logic occupied 43% of the logic and 39% of the block RAMs. Number 3, Feb 2002, pp. 66-74. Table 1. SOC Firewall Implementation [3] R. Franklin, D. Carver, B. L. Hutchings, Assisting Network Intrusion Detection with Reconfigurable Hardware, Virtex XCV2000E Utilization Resource FCCM'02 Device Utilization Percentage Sarang Dharmapurikar and John Lockwood, Synthesizable Logic Slices 8342 out of 19200 43% [4] Design of a Multi-Module Memory Controller, Washington BlockRAMs 63 out of 160 39% University, Department of , Technical Report WUCS-01-26, October, 2001 External IOBs 286 out of 512 55% [5] Edson L. Horta, John W. Lockwood, David E. Taylor, David Parlour, Dynamic Hardware Plugins in an FPGA with Partial Run-time Reconfiguration. Design Automation Conference 5.2 Throughput (DAC), New Orleans, LA, June 10-14, 2002, Paper 24.2. The components of the SOC firewall that implement the protocol [6] John W. Lockwood, Evolvable Internet Hardware Platforms, wrappers, CAM filter, flow buffer, and queue manager were NASA/DoD Workshop on Evolvable Hardware (EHW'01), synthesized and operated at 62.5 MHz. Each of these components Long Beach, CA, July 12-14, 2001, pp. 271-279. process 32 bits of data in every cycle, thus giving the SOC firewall a throughput of 32*62.5MHz = 2 Gigabits/second. [7] John W. Lockwood, Jon S. Turner, David E. Taylor, Field The regular expression scanning circuit for the SPAM pipeline Programmable Port Extender (FPX) for Distributed Routing synthesized at 37 MHz. Given that each pipeline processes 8 bits and Queuing, ACM International Symposium on Field of data per cycle, the throughput of the SPAM filter is Programmable Gate Arrays (FPGA'2000), Monterey, CA, 8*37MHz=296 Megabits/second per pipeline. By running 8 February 2000, pp. 137-144. SPAM pipelines in parallel, the throughput of the payload [8] Todd Sproull, John W. Lockwood, David E. Taylor, Control matching circuit achieves 8*8*37MHz=2.368 Gigabits/second. and Configuration Software for a Reconfigurable Networking Hardware Platform, IEEE Symposium on Field- Programmable Custom Machines, (FCCM), Napa, CA, April 24, 2002