A FLEXIBLE PLATFORM FOR NETWORK PROCESSING

Kurtis B. Kredo II, Dr. Albert A. Liddicoat, Dr. Hugh M. Smith, Dr. Phillip L. Nico California Polytechnic State University, San Luis Obispo 1 Grand Avenue, EE Department San Luis Obispo, CA 93407 United States [email protected], (aliddico, husmith, and pnico)@calpoly.edu

ABSTRACT and new algorithms or techniques are developed for Much of the current research in computer networks security applications. Software-based network devices focuses on providing increasing levels of functionality at provide a programmable base to support these advanced very high bandwidths. Traditional implementations using and changing functions, but they are unable in most cases application specific integrated circuits (ASICs) can to support very high data rates [3, 4]. process data very quickly, but do not allow modification Current research generally uses one of two when protocols or algorithms change. Software-based technologies to perform network processing at high implementations provide the ability to change bandwidths. Software-based implementations often use a functionality very easily, but often can not support high , such as the IXP1200, to bandwidths. The third generation Cal Poly Intelligent implement the required functionality. Alternatively, some Network Interface Card (CiNIC), presented in this paper, researchers focus development of reprogrammable combines the speed of hardware implementation with the hardware-based systems that utilize Field Programmable flexibility of a software-based system by using field Gate Arrays (FPGAs). Researchers have investigated programmable gate arrays (FPGAs) and a hardcore other technologies, however network processors and processor to perform network protocol processing. FPGAs are the predominant implementation devices. Utilizing the CiNIC within a network device allows A review of current work and an overview of the developers and researchers to implement additional project described in this paper are provided in the next functionality in various ways. The CiNIC Platform has two subsections. Sections 2 and 3 describe the been developed for flexibility and may be used for a architecture of the CiNIC presented in this paper and broad range of research and development projects conclusions are provided in Section 4. including hardware/software co-design, embedded systems, and distributed systems. 1.1 Current Work

KEY WORDS At least two research groups, one from Georgia Network Interfaces, Computer Networks, Protocol Institute of Technology and another from Princeton Offloading, Reprogrammable Devices, Intelligent NIC University, have used Intel IXP network processors for research projects. The authors in [5] describe a software programmable router that uses an Intel IXP1200 network 1. Introduction processor and a PentiumIII general-purpose processor. Computational resources are logically divided into three Current advances in fiber optic technology have made functional layers with the lowest level (the IXP1200 large amounts of bandwidth available in computer MicroEngines) performing common data plane processing networks as users have placed increasing demands upon and the PentiumIII at the highest level performing control network devices [1, 2, 3]. Providing hardware that can plane processing. The middle layer contains the utilize these speeds has often been the sole domain of StrongARM processor present in the IXP1200. application specific integrated circuits (ASICs). Additionally, the authors in [4] describe an IXP1200- However, ASICs are not flexible enough to be used based network co-processor card used for traffic within network devices that must be updated often or modification within a storage area network. Utilizing the within systems that change in functionality over time. In co-processor improved performance while using a addition to operating at very high bandwidths, network lightweight messaging system. devices are being called upon to do more processing than Other researchers have used reprogrammable hardware the standard protocols. Implementations of these to perform the network processing. Moving the functions often have to change; for example, as service implementation from software to reprogrammable parameters change for quality of service implementations hardware provides system designers with a greater degree of flexibility, which allows many operations, such as

438-081 55 encryption, to occur much more rapidly. Two projects of · Provide an architecture that is no more difficult to use note are the Field Programmable Port Extender [3], which than a standard NIC and the set of applications provides pluggable processing modules within an implemented on the co-processor. advanced switch, and a reprogrammable network · Focus development on high-end network devices. interface presented in [6]. However, there are several · Design to reduce complexity and promote flexibility. other research projects that focus on development Providing a general and flexible architecture enables utilizing reprogrammable hardware [7, 8, 9]. researchers to implement functionality in ways that are not possible using other—both commercial and 1.2 The CiNIC Project academic—platforms. Platforms utilizing an architecture that provides reprogrammable hardware and a processor The Cal Poly Intelligent Network Interface Card within a single system-on-chip package (e.g., the Xilinx (CiNIC) project is part of the ongoing research of the VirtexII Pro FPGAs) approaches the CiNIC’s flexibility, Network Performance Research Laboratory (NetPRL) at but does not provide the computational resources Cal Poly. The CiNIC project focuses on improving available from a separate hardcore processor. network performance by implementing additional Additionally, since the third generation CiNIC has a functionality, beyond a standard system, within the CiNIC general architecture, several research projects beyond and by offloading network protocol processing from the network processing, such as hardware/software co-design, Host system. Two generations of the CiNIC have already embedded systems, and distributed systems, would been developed and this paper discusses the third and benefit from the architecture. current generation. Additionally, since development using intelligent NICs Each of the past CiNIC generations has provided a differs from development using other devices, work has platform for development and provided information for been done to introduce new interfaces to the Host. Work future design by showing the strengths and weaknesses of such as VIA [10] and SPINE [11] have provided new the respective architectures. The first generation CiNIC software interfaces between the Host system or user performed network processing within an Intel application and the intelligent NIC. In a similar fashion, a StrongARM processor. A complete embedded computer new hardware- and operating system-independent system was dedicated to the first generation CiNIC, interface was developed along with the third generation including: an EBSA 285 processor board, a separate hard CiNIC which provides a minimal generic interface to any drive, and a standard NIC. The second generation CiNIC intelligent NIC. Thomas details the new interface in [12] deviated from the previous generation by performing all and provides examples of how to map the interface to the network processing within ’s Nios softcore standard socket API. processor. System size was also reduced to a single PCI card which is inserted into the host system. Based on past results a combination of these architectures was used for 2. CiNIC Architecture the third generation CiNIC. Taking the best of the previous architectures, the third The third generation of the CiNIC (afterwards referred generation CiNIC combines the flexibility and speed of to simply as the CiNIC) departs from the previous reprogrammable hardware with the raw computational generations in that it integrates both the reprogrammable power of a hardcore processor. Each type of hardware flexibility of FPGAs and the computational power of a plays a different role in performing the network protocol hardcore processor. Combining flexibility and processing in the third generation CiNIC. For common computational power allows the CiNIC to perform the processing two FPGAs provide a platform for standard network protocol processing while adding implementing functionality either directly in hardware by functionality implemented either in hardware within the use of hardware description language (HDL) modules or FPGAs or in software running on the processor. The in software that runs on Xilinx’s MicroBlaze softcore entire system is contained on a PCI form factor card for processor. More computationally intensive or out of band use within a host system; however, there are no processing occurs on an IBM PowerPC processor. restrictions that prevent using the CiNIC as a standalone Providing multiple implementation possibilities also system if PCI functionality is not required. makes the third generation CiNIC the most flexible Two Xilinx Virtex1000 FPGAs provide the basis for CiNIC architecture to date and allows easy comparison network processing with one FPGA containing the between different implementations of the same MicroBlaze softcore processor and its peripherals, and the functionality. other FPGA containing custom HDL modules. Software Several goals guided the design of the third generation running on MicroBlaze performs the transport layer CiNIC. Those goals are listed below: functionality along with interfacing to the Host system · Provide functionality and flexibility unavailable in while maintaining connection state for Host processes. traditional host and NIC architectures. The HDL modules within the lower FPGA—collectively · Reduce the host computer’s computational load by called the CiNIC stack—perform the Internet Protocol offloading network processing to the CiNIC. (IP) and Ethernet processing. Each module within the · Minimize the latency introduced by the CiNIC. lower FPGA is separated by a pair of 32-bit wide FIFOs,

56 Host System Ethernet Serial PCI Bus Port Port

Serial Port

SRAM PowerPC MicroBlaze Data Memory Upper FPGA Flash Memory MicroBlaze Instruction Memory DDR SDRAM

SRAM

SystemACE System Clocks Flash

Lower FPGA Real-Time Programming Clock Interface External Peripheral Bus (EXB)

Ethernet IC

Main System Processor Island (PI) Figure 1. CiNIC Hardware Architecture Diagram which allow data to be passed easily and without inter- ports, and basic I/O interfaces. There are two SODIMM module synchronization. The system was designed so SDRAM memory sockets connected to the upper FPGA that traffic arriving from the network would be processed for utilization by MicroBlaze. One memory holds the by each module serially before being passed to the Host packet data during processing and a portion of this across the PCI bus. Outbound traffic follows the reverse memory is mapped to the Host’s address space. The other path. Both FPGAs interface to the PowerPC through the memory holds instructions for programs running on processor’s External Peripheral Bus (EXB). MicroBlaze. Separate memories are used within the CiNIC so instruction fetches from memory do not place 2.1 Hardware Architecture additional load on the data side of the OPB and decrease the bandwidth of data flow. The separation of data and The current CiNIC consists of two subsystems, as instruction memories is possible due to MicroBlaze’s shown in Figure 1: the main system that contains the Harvard architecture. 16 Megabytes (MB) of Flash FPGAs and peripherals, and the Processor Island (PI) [13] memory is also connected to the upper FPGA for storage that contains the PowerPC processor and its peripherals. of program code or other data that must be retained when Since the CiNIC is intended to be a component within a the system is powered down. A final memory available Host system, both subsystems are located on a PCI card on the main system is 1MB of SRAM located on the with the PCI fingers connected to the upper FPGA. EXB. External ports available to the main system are: a Most of the network processing occurs within the two serial port connected to the upper FPGA and utilized by FPGAs of the main system, which are connected in series MicroBlaze, and an Ethernet port connected to the lower between the PCI bus and the Ethernet physical layer FPGA. Modules within either FPGA can act as a master, integrated circuit (IC). Within the lower FPGA, VHDL slave, or both upon the PowerPC’s EXB. For system modules perform the lower network protocol layer development and utilization there are 16 LEDs, four functionality; specifically, the IP and Ethernet protocols seven-segment displays, and 16 switches connected to the along with interfacing to the Ethernet IC. Modules within lower FPGA as general-purpose I/O devices and are not the upper FPGA perform the transport layer processing, pictured in Figure 1. maintain state for Host connections, and interface to the Advanced or intensive processing occurs within the PI. PCI bus through software running on the MicroBlaze The PI hosts several external interfaces to the PowerPC softcore processor. Additionally, both FPGAs interface to such as a serial port and an Ethernet port. Built into the the PowerPC’s EXB so that communication between the PowerPC are interfaces to DDR SDRAM and SRAM subsystems may occur. memories. The PI also contains additional SRAM and Additional hardware resources available to the FPGAs bootable Flash memories located on the EXB, which are in the main system include several memories, external also accessible to the FPGAs of the main system.

57 PCI Bus

Instruction BlockRAM Data BlockRAM

Local Memory Bus (LMB)

Instruction Data SDRAM Interrupt MicroBlaze PCI SDRAM Controller Controller Controller Controller

On-Chip Peripheral Bus (OPB) Timers

OPB Bridge PowerPC Module Interface Logic Upper FPGA

Lower FPGA

IP Module

PowerPC MAC Module Interface Logic

PHY Module

Ethernet IC

Figure 2. CiNIC Functional Architecture Diagram

A Xilinx SystemACE Soft Controller (SystemACE Model. The main modules are: the interface to the SC), comprised of a small VirtexE FPGA, a Xilinx Ethernet physical layer IC (PHY Module), the Ethernet PROM, and a Flash memory, configures the CiNIC MAC layer module (MAC Module), the IP layer module FPGAs upon power up or reset. The VirtexE acts as a (IP Module), the Bridge between the stack FIFOs and the controller to program the two main FPGAs with the On-chip Peripheral Bus (OPB) (OPB Bridge Module) and desired images that are stored in Flash memory. The MicroBlaze. Figure 2 diagrams the functional main FPGAs may be programmed in several ways with architecture of the main system. The PI could be the configuration mode determined by three system considered a sixth module, but it does not perform a switches. Additionally, the SystemACE SC allows the predefined role in the main-path packet processing. storage of up to eight system images so the system may FIFOs implemented within the lower FPGA connect be reprogrammed simply by changing the system image four of the modules together and allow each module to selection switches and resetting the system. operate independently without per-cycle synchronization of data. Each FIFO can be generated independently with 2.2 Functional Architecture different specifications using the Xilinx CoreGenerator software. The FIFOs provide two interfaces—one read Functionally, the CiNIC is broken into five main and one write—which are on different clock domains to modules that perform different processing and are allow each module to run at different clock frequencies. organized into a stack similar to the OSI Reference BlockRAM (BRAM) memory built into the FPGA is used

58 to implement the FIFOs and allows much faster operation Modules do not use the CiNIC trailer of outbound frames than using external memory. since it is assumed the CiNIC does not introduce errors; in The module functionality is similar to that specified by a similar way, modules do not use the CiNIC header of the OSI Reference Model with several exceptions. The inbound frames since each module must examine every PHY Module—implementing part of the data link layer— inbound frame. Optional header words located between interfaces to the Ethernet physical layer IC, verifies and the CiNIC header and the first data word are used by the generates the Frame Check Sequence (FCS) for each IP and MAC Modules when generating protocol headers frame, and performs the Medium Access Control (MAC) for outbound frames. MicroBlaze provides frame-specific protocol used by Ethernet—CSMA/CD. The rest of the information, such as destination addresses, within the data link layer functionality is performed by the MAC optional headers. Module, which generates Ethernet headers, ensures frame length, and generates Address Resolution Protocol (ARP) replies for valid inbound ARP requests. Similar 3. Memory Architecture processing is done by the IP Module, which generates IP headers and verifies Transmission Control Protocol To facilitate data transfer between the CiNIC and Host (TCP), User Datagram Protocol (UDP), and Internet system a section of data memory physically located on the Control Message Protocol (ICMP) checksums [14]. CiNIC board is mapped into the Host’s address space. Communication between the stack FIFOs and the OPB is All data transferred between the Host and the CiNIC accomplished through the OPB Bridge Module. Finally, occurs through this shared memory. The memory is MicroBlaze performs the transport layer processing, located on the OPB with Host accesses occurring through namely UDP and TCP, along with interfacing to the Host the PCI-to-OPB Bridge intellectual property core and system [15]. MicroBlaze accesses occurring as normal data memory However, not all of the protocol processing may be accesses. With a memory-mapped architecture the Host performed by the CiNIC without limiting the Host has the ability to copy data directly from the CiNIC into system’s architecture or functionality. IP routing and the address space of the destination user process since the ARP processing still occur on a Host system using the Host operating system believes the memory to be located CiNIC. Moving IP routing to the CiNIC would locally. While not a zero-copy architecture, the CiNIC complicate host interfacing by requiring the CiNIC to architecture does remove one copy present in traditional have knowledge of all network interfaces, consume extra systems. system resources as additional commands or packets are The packet data memory is divided into a series of transferred across the PCI bus, or limit the system to only independent queues each holding a different traffic flow. those interfaces present on the CiNIC. Moving ARP There are two primary queues for inbound data transfer request generation processing onto the CiNIC would (toward Host) and two primary queues for outbound data require additional resources for storage of the ARP table transfer (toward network). Each direction of traffic has a and packets waiting for an ARP reply, and would prevent queue for data that go completely from Host to stack, or the serial processing of packets within the memory queues vice-versa, and another queue for data that originate in (see Section 3). MicroBlaze. For example, the two inbound queues are: a Communication between the modules occurs through queue for frames which originate from the network and messages encapsulated by a CiNIC header and trailer. are transmitted up the stack, processed by MicroBlaze, Operation fields within the CiNIC header indicate if the and then read by the Host; and another queue for message is a packet to process or a command to perform. commands or command responses generated by Distinct operation fields are included for the IP, MAC, MicroBlaze for the Host to process. Two optional queues and PHY Modules, but operation fields are not provided are used for TCP processing within MicroBlaze. One for the OPB Bridge Module and MicroBlaze since they TCP queue stores inbound TCP segments and the other process none or all of the packets, respectively. If the stores outbound TCP segments. Separate TCP queues are operation field specifies the message is a command, then used because of the possible out of order processing of a separate command field will contain the code TCP segments. Moving the out of order processing into corresponding to the command to perform. Commands separate queues allows the other queues to be handled are used to control system operation for such functionality strictly in order. as changing the local IP address and disabling the Queues are further divided into equally-sized slots network interface. A final field of the CiNIC header is with each slot containing a single command or packet. the byte enable, which is used by the PHY Module to only Therefore, command and packet data, including necessary transmit the proper number of data bytes. The CiNIC control structures, are limited to the slot size. trailer is always located after the last data word and MicroBlaze, the OPB Bridge Module, and the Host contains several error and state fields. These fields are system access the slots according to the Slot Protocol, used by MicroBlaze to determine if the frame is valid. which also provides for interrupt coalescing. Fields indicate if FCS, IP, ICMP, TCP, or UDP The Slot Protocol provides a mechanism that allows checksums fail along with flags indicating if the frame slot data to be passed between entities without having to has the proper IP address and is the proper length. copy data between memory locations while preventing

59 data corruption. Each slot contains two bits that control References: which entity has ownership of the slot. When the data source initially enters data into the slot all the control bits [1] P. Crowley, M. Fiuczynski, J.-L. Baer, & B. Bershad, are at their default, cleared values. When the “Ready for Characterizing Processor Architectures for Programmable MicroBlaze” bit is set MicroBlaze may process the slot Network Interfaces. Proc. 2000 International Conference data before passing control of the slot to the data receiver. on Supercomputing, Sante Fe, NM, 2000. Setting the “Ready for Receiver” gives control of the slot [2] S. Karlin & L. Peterson, VERA: An Extensible Router to the receiving entity. After the slot data has been Architecture, Computer Networks, 38(3), 2002. processed by the receiving entity all slot control bits are [3] J. Lockwood, N Naufel, J. Turner, & D. Taylor, cleared and the slot is now under the control of the data Reprogrammable Network Packet Processing on the Field source again. A third control bit, the “To Receiver” bit, Programmable Port Extender (FPX). Proc. ACM associated with each slot indicates if the slot should be International Symposium on Field Programmable Gate processed by the receiving entity. If the bit is set the Arrays, Monterey, CA, 2001, 87-93. receiving entity processes the slot data normally, but if the [4] K. Mackenzie, W. Shi, A. McDonald & I. Ganev, An bit is not set, then the receiving entity immediately clears Intel IXP1200-based Network Interface. Proc. 2nd all of the slot control bits and ignores the data in the slot. Annual Workshop on Novel Uses of Systems Area As mentioned above, the Slot Protocol provides a Networks, Anaheim, CA, 2003. mechanism for interrupt coalescing. Interrupts are used [5] T. Spalink, S. Karlin, L. Peterson, & Y. Gottlieb, between modules to notify each other when slots are Building a Robust Software-Based Router Using Network ready for processing. However, it is not necessary in all Processors. Proc. 18th ACM Symposium on Operating cases to interrupt on every packet or command that Systems Principles, Banff, Alberta, Canada, 2001 becomes available. To reduce the number of interrupts [6] K. Underwood, R. Sass, & W. Ligon III, A generated by the modules a Pending bit is associated with Reconfigurable Extension to the Network Interface of the Host and the OPB Bridge Module, and two Pending Beowulf Clusters. Proc. IEEE Conference on Cluster bits are associated with MicroBlaze. A set pending bit Computing, Los Angeles, CA, 2001. indicates that the module has been interrupted previously, [7] M. Wong & A. Liddicoat, FPGA Based but has not yet begun processing slot data. Before Encryption/Decryption Network Co-Processor. Proc. An interrupting another module the Pending bit is checked to International Symposium on Low-Power and High-Speed determine if it is necessary to generate the interrupt. It is Chips, Yokohama Japan, April 2004. important that the slot control bits are set before [8] Network FPGA Project Homepage. examining the Pending bit for modules that cause http://yuba.stanford.edu/NetFPGA/, June 2004. interrupts. For modules that are about to begin processing [9] Gigabit FPGA-based NIC Homepage. slots it is important to clear the Pending bit before http://cmclab.rice.edu/projects/giganic/, June 2004. processing the first slot. If the slot control and Pending [10] Compaq Computer Corporation, Intel Corporation, bits are not processed in this order it is possible for slot Microsoft Corporation, Virtual Interface Architecture processing to be postponed until the next interrupt. Specification, Version 1.0, 1997). [11] M. Fiuczynski, R. P. Martin, T. Owa, & B. N. Bershad, SPINE: A Safe Programmable and Integrated 4. Conclusion Network Environment. Proc. 8th ACM SIGOPS European Workshop, Sintra, Portugal, 1998. The architecture outlined in this paper presents a [12] S. Thomas. The Third Generation CiNIC Host flexible and general foundation for network processing Interface. California Polytechnic State University, San and other projects. Specifically, providing Luis Obispo CSC Masters Thesis. June 2004 reprogrammable hardware and a hardcore processor [13] A. Angeles. Evaluation of Hardware and Software allows developers using the CiNIC to implement Requirements for a Processor Island Subsystem on functionality in many different ways. Additionally, an CiNIC. California Polytechnic State University, San Luis analysis of the architecture has been performed through Obispo EE Masters Thesis. June 2004. simulations [16] and revealed that the only system [14] J. Chiang. MAC and IP Protocol Implementation for limitations on 100Mbps networks are those imposed by the Third Generation CiNIC. California Polytechnic State the FPGAs chosen for system production. University, San Luis Obispo CPE Senior Project Report. Uses for the CiNIC described here include research June 2004 and development of network processing and protocol [15] S. Gee. The Third Generation CiNIC Transport offloading, which is the focus of the CiNIC project, along Layer Protocols. California Polytechnic State University, with hardware/software co-design, embedded system, and San Luis Obispo CPE Senior Project Report. June 2004 distributed system projects. Additionally, the generality [16] K. Kredo II. Design and Development of the Third and flexibility of the architecture allow for many others Generation CiNIC. California Polytechnic State uses where reprogrammable hardware and a hardcore University, San Luis Obispo EE Masters Thesis. June processor would be advantageous. 2004.

60