A Versatile UDP/IP based PC ↔ FPGA Communication Platform

Nikolaos Alachiotis, Simon A. Berger, Alexandros Stamatakis The Exelixis Lab, Scientific Computing Group Heidelberg Institute for Theoretical Studies Heidelberg, Germany Emails: {Nikolaos.Alachiotis,Simon.Berger,Alexandros.Stamatakis}@h-its.org

Abstract—We present a substantially improved version of Nonetheless, this novel and more flexible UDP/IP core still our popular UDP/IP core for simple and fast PC ↔ FPGA outperforms all competing UDP/IP core implementations in communication over Gigabit . We provide a novel terms of speed and resource requirements. feature to automatically configure (previously hard-coded) internal settings on the FPGA. Thereby, we substantially reduce Since February 2010, the proof-of-concept implementa- the installation overhead when a FPGA shall communicate with tion has been downloaded over 6,000 times from OpenCores. several different PCs. The UDP/IP core is designed to occupy org (http://opencores.org/project,udp ip core) and from a minimum amount of hardware resources on the FPGA. On our software page (http://www.exelixis-lab.org/countIPv4. the PC side, this new automatic configuration protocol can be php). Note that both Xilinx [2] and Altera [3] provide used and invoked via a C software interface which provides convenient functions for setting up the connection to the FPGA PCI Express-based solutions for PC ↔ FPGA communi- device and sending/retrieving arrays of common C data types cation. However, the interest our initial core has gener- to/from the UDP/IP core on the FPGA. The initial UDP/IP core ated underlines the apparent lack of a standardized, open- version is available under the LGPL license at http://opencores. source Ethernet-based communication mechanism. Thus, in org/project,udp ip core while the improved version of the addition to the improved UDP/IP core architecture, we core, including the C software interface (also under LGPL), is available at http://opencores.org/project,pc fpga com. also present an implementation of a comprehensive open- source PC ↔ FPGA communication platform that relies Keywords-FPGA; UDP/IP; PC-FPGA communication; on Gigabit Ethernet. This platform deploys the versatile version of the UDP/IP core and implements a simplified I.INTRODUCTION communication protocol that facilitates usage and integration FPGAs are commonly deployed as accelerator devices or of the send/receive paradigm for common C data types for verifying architectures. Irrespective of their particular and arrays. Our platform hides the complexity inherent to use, a machinery for moving data to/from the device is establishing a PC ↔ FPGA connection and to synchronizing required. Recently, we presented a proof-of-concept imple- data transmissions in both directions. The versatile UDP/IP mentation of a highly efficient UDP/IP core architecture that core and the communication platform are available for is optimized for point-to-point communication [1]. We found download at http://opencores.org/project,pc fpga com. that combining the IPv4 and UDP protocols represents an The remainder of this paper is organized as follows: efficient solution for establishing a connection between a Section II provides a short overview of related work on PC and a FPGA, with respect to reconfigurable hardware UDP/IP and TCP/IP cores as well as PC ↔ FPGA com- utilization. We achieve low resource utilization because the munication platforms. In Section III, we describe the design architecture employs a simplified method for calculating the of the novel and more versatile UDP/IP core. The PC ↔ IPv4 Header Checksum based on a pre-calculated template. FPGA communication platform is introduced in Section IV. Here, we present an improved UDP/IP core implemen- Section V covers implementation and verification details as tation and a comprehensive Ethernet-based communication well as a performance evaluation of the UDP/IP core and the platform. A key property of the UDP/IP core is that all PC ↔ FPGA platform. Section V also includes application static header fields required for point-to-point communica- examples to demonstrate the functionality of the proposed tion are stored in a lookup table. In the proof-of-concept solution. We conclude in Section VI and address directions implementation, this required the user to manually initialize of future work. the lookup table before configuring the device. The improved architecture we present here entails additional control logic II. RELATED WORK that retrieves all static fields of an incoming initialization In [4], L¨ofgren et al. present three different (commer- packet from the PC. This data is then stored in the respective cially available) stand-alone UDP/IP cores and denote them lookup table. This versatility of the new UDP/IP core as minimum, medium, and advanced implementations. The comes at a cost: the additional logic occupies approximately medium and advanced IP cores offer additional protocols 56% more hardware area than the proof-of-concept core. which are not necessary for establishing a direct PC ↔ to from user interface FPGA communication. Thus, we do not compare these EMAC TRANSMITTER HLUT wraddr,wren,din heavy-weight cores with our implementation because such reg a comparison would be unfair. The minimum IP core offers wren,value the same functionality as our Gigabit-speed UDP/IP core but RST_U IPv4 Header Checksum rst HLUT requires 2.8 times more hardware resources while operating Register CONTROL at 10/100 Mbps only. CONFIG_U from In [5], K¨uhn et al. used UDP/IP to implement PC ↔ EMAC config FPGA communication over Gigabit Ethernet. To implement RECEIVER to user interface the communication mechanism, the authors deployed an operating system (Linux) that was running on an embedded Figure 1. Top-level design of the advanced UDP/IP core. PowerPC processor. While using a processor that handles the communication represents an elegant and straightforward approach, not all reconfigurable devices contain such hard- coded processors. When such a processor is not required for of the 802.3 MAC frame are the Destination Address, the conducting additional computations, deploying a soft-coded Source Address, and the Ethernet Type. The IPv4 header sec- processor can lead to excessive hardware utilization on the tion contains the following fields: Version, Header Length, device. Our novel communication platform essentially offers Differentiated Services, Total Length, Identification, Flags, the same functionality but does not require an embedded Fragment Offset, Time to Live, Protocol, Header Checksum, processor. Source Address, and Destination Address [11]. The UDP Dollas et al. [6] presented an architecture for an open header section consists of the Source Port, the Destination TCP/IP core that supports all necessary protocols (ARP, Port, the Length, and the Checksum [12]. When only a ICMP, UDP) typically required for real-world applications. direct point-to-point connection between a PC and a FPGA The architectural complexity that is induced by the large is considered, all of the above header fields are constant number of supported protocols would yield a comparison to with the exception of the IPv4 Total Length and Header our light-weight design unfair. Note that a full TCP/IP core is Checksum fields as well as the UDP Length field. The idea not necessary for direct PC ↔ FPGA communication since of storing constant fields of UDP point-to-point connections this can be implemented using less complex protocols and in lookup tables has been used before [4]. However, the a lower amount of resources. required hardware resources are further reduced in our design because the calculation of the IPv4 checksum field In [7], Lin et al. presented a PC ↔ FPGA based commu- nication platform for verification and fast prototyping. The can be implemented by a single subtraction. platform communicates via a PCI interface and is intended During a transmission of an Ethernet packet from the to facilitate verification of multimedia applications. There FPGA to the PC, all static/constant fields are retrieved exist similar platforms as the one presented by Lin et al. that directly from the HLUT. The three variable fields (Total target specific application types [8], [9]. Our PC ↔ FPGA Length, Header Checksum, Length) can be calculated at communication platform offers a mechanism for integrating low cost using two additions and one subtraction. The communication routines into any kind of application and initial architecture contained three hard-coded values that allows for transferring arrays of the basic IEEE-754 data were used as the first operand of the two adders and the types (e.g., characters, short integers, integers, floats, long subtracter, while the second operand in all three operations integers, and doubles). was the length of the user data. The hard-coded values were Recently, Lieber and Hutchings [10] presented an open- set to the minimum IPv4 Total Length and UDP Length source PC ↔ FPGA communication framework for Xilinx values, that is, the offsets to the respective header sections FPGA boards achieving a maximum throughput of 504 and the IPv4 Header Checksum field of an Ethernet packet Mbps. Our PC ↔ FPGA communication platform provides without any user data attached to it. Based on the amount similar functionality and can operate at Gigabit speed (1000 of data (number of bytes) transferred with the packet, the Mbps) while requiring 50% less reconfigurable resources. corresponding length fields are calculated by two additions. The subtraction is used to calculate the correct IPv4 Header III. VERSATILE UDP/IP CORE ARCHITECTURE Checksum as a function of the IPv4 Total Length field. A In the following, we briefly review the basic concept of more detailed description of this procedure and a proof that the initial UDP/IP core and describe the enhancements in the a single subtraction is sufficient can be found in [1]. new version. As already mentioned, all static header fields Figure 1 illustrates the versatile UDP/IP core architecture. required for direct point-to-point communication are stored The block diagram of the TRANSMITTER and a description in a lookup table (henceforth denoted as HLUT). The fields of the RECEIVER components are provided in [1]. The new stored in HLUT form part of the 802.3 MAC frame, the IPv4 RST U, CONFIG U, and HLUT CONTROL components header section, and the UDP header section. The main fields have been integrated to increase the flexibility of the core. C−Pack R−Pack D−Pack The RST U component monitors the first byte of the user data in every incoming packet. The purpose of this unit Request Data Header Configuration Data Type Section is to detect a dedicated byte-code that triggers a reset of Code Code Type the HLUT CONTROL. A similar function (also using a Data Data Length dedicated byte-code) is implemented in the CONFIG U User Data Section component for detecting a configuration enable code that triggers the HLUT initialization process. The demand for Data Length receiving the reset and configuration enable codes in distinct packets represents a less error-prone solution, especially with Figure 2. Supported packets in the Configuration and Data Protocol. respect to the fact that the UDP/IP core might not always be used with the provided software interface. The HLUT CONTROL is implemented as a 5-state FSM In Section IV-B, the hardware interface is described, while (Finite State Machine). When the device is configured, the Section IV-C covers the software interface. FSM transitions from the default reset state to the idle state. A. Configuration and Data Protocol It will remain in idle state until a configuration enable signal arrives from the PC. Upon arrival, the FSM transitions We propose a Configuration and Data Protocol (hence- to pre-configuration state and waits for the next incoming forth denoted as CDP) which offers a simple communi- packet. This packet is used as reference packet to retrieve cation model for configuring the system and transferring the required static fields and must not contain any user data variable size user data. However, in analogy to the User since the IPv4 Header Checksum value of the packet will Datagram Protocol (UDP [12]), it does not provide reliability be stored in a respective 16-bit register. When this empty and data integrity mechanisms. There exist three types packet has arrived, the FSM then switches to configuration of packets: configuration packets, data packets, and data state. In this state, all incoming header fields are redirected request/retrieval packets. In CDP, we henceforth denote a to the HLUT that resides in the TRANSMITTER along configuration packet as C-Pack, a data request packet as R- with write addresses and write-enable signals. In addition Pack, and a data packet as D-Pack. Figure 2 illustrates the to controlling the HLUT initialization process, the HLUT three packet formats. CONTROL component also stores the value of the IPv4 A C-Pack consists of a header field with a length of a Header Checksum (see above). After these initialization single byte that contains a configuration code. Two config- steps, the FSM transitions to the lock state which then allows uration codes are supported in order to reset and enable the the TRANSMITTER to transmit regular user data. To pre- configuration process of the UDP/IP core (see Section III). vent the transmission of faulty packets, the TRANSMITTER A R-Pack consists of a three-byte-long header field that can not initiate any transmissions to the PC if the HLUT contains a data request code, a data type, and a data length CONTROL is not in the lock state. field. Using the R-Pack, the software on the PC can initiate a These additional components, allow to setup and use the transmission of data from the FPGA back to the PC. Finally, UDP/IP core for communication with any PC in a seamless the D-Pack consists of a byte-long Data Type header field manner, that is, without requiring manual code changes. The followed by the user data. only requirement for the platform to work properly is that the PC must transmit three packets to the FPGA that contain B. Hardware Interface the reset code, the configuration code, and one packet that On the FPGA side, our implementation (Figure 3) pro- does not contain any data. This can be easily encapsulated vides a well-defined interface that allows the reconfigurable in an initialization function. We describe the implementation architecture to transmit and receive all supported data types. of such a general communication platform that hides the Furthermore this interface provides all the required connec- complexity of the underlying design (based on this improved tions to the Media Access Controller (MAC). version of the UDP/IP core) in the following section. For receiving packets, two active-high signals indicate the start and end of data. Within the time-frame defined by these IV. PC ↔ FPGA COMMUNICATION PLATFORM signals, a valid data signal is deployed to denote the clock In the following, we describe the software/hardware in- cycles during which the data buses contain valid user data. terface and a simple communication protocol that provide There exist four data buses with different lengths that allow (in conjunction with the UDP/IP core) a complete PC ↔ for receiving character data over the 8-bit bus, short integer FPGA communication solution. The key design objective data over the 16-bit bus, integers and floats over the 32-bit is to hide the complexity of (i) setting up a PC ↔ FPGA bus, and long integers and doubles over the 64-bit bus. An connection and (ii) transmitting and receiving variable data additional 3-bit data type signal indicates which of the four types. In Section IV-A, we introduce a simplified protocol data buses contains valid data at each point in time. Three for configuring the UDP/IP core and transmitting user data. bits are required to select among buses and to distinguish type FPGA2PC user enable For sending data from the PC to the FPGA, the software interface splits up the arrays provided by the user into sep- HEADER PACK REG SIZE length arate UDP packets and can optionally also perform endian swapping [our current target PC platform (x86 32- or 64-bit FPGA2PC Linux) uses little endian]. For receiving data, the software FSM RDADDR rdaddr GEN implementation can concatenate multiple UDP packets into bus8 larger arrays (including endian conversion) which are then bus16 DATA bus32 returned to the application that invoked the receive function. SEL bus64 In contrast to the functions for sending data to the FPGA, UDP/IP the receive functions make use of a FIFO receive buffer CORE HEADER TYPE type and a background reader thread to improve performance CHECK CNTRL (see below). The background reader thread constantly reads TO/FROM EMAC incoming UDP packets and stores them in a FIFO list. VALID vld_o CNTRL The array splitting and re-assembly part works as fol- bus8 lows: The Ethernet standard defines a maximum packet size DATA bus16 (MTU=maximum transmission unit) that has to be supported bus32 SEL bus64 by standardized Ethernet equipment. Thus, in the worst case, PC2FPGA an IP packet can carry a maximum payload of only 1,500 bytes. Note that, for specific hardware interfaces, the MTU Figure 3. Hardware design of the communication platform. can potentially be larger. The Intel network interface we used in our tests supports a MTU of 9,000 bytes. Because the MTU is hardware-dependent, we provide an option to between the respective data types (e.g., 32-bit integers or explicitly set the appropriate MTU. If the user intends to floats on the 32-bit bus). transmit an array that exceeds the MTU (e.g., 1,000 double To transmit data from the reconfigurable architecture to values, which require 8,000 bytes while the MTU is 1,500 the PC, a transmission enable pulse is required. In synchrony bytes), the data is split into separate UDP packets which with the enable pulse, the input type port must contain the are sent one-by-one to the FPGA. Conversely, if the user respective Data Type code (e.g., character, short integer, requests to receive 1,000 double values, the library expects integer, etc.) and the length port must contain the number of the data to be split into sufficiently small UDP packets by data items to be sent. The address generator component will the FPGA. In this case, the PC interface will read as many then compute respective read addresses on the rdaddr bus UDP packets as required to complete the receive request and that must be connected to on-chip memory with a latency of re-assemble the array from the individual packets. one clock cycle. The Data Sel component selects between The background reader thread has been introduced to the four data input buses to transmit the specified data type. alleviate problems that can occur when the PC is not able to Depending on the data transmission type (e.g., character, receive UDP packets that are sent by the FPGA fast enough. short integer, integer, etc.), the read addresses are generated This can happen in the following scenario (assuming a at different speeds. Therefore, the user design (reconfig- single-threaded implementation on the PC): The PC sends a urable architecture) is notified when the transmission has UDP packet to the FPGA for retrieving a data array that can been completed by a transmission over signal. For the sake consist of multiple UDP packets. After the request packet of simplicity, the transmission over signal as well as the start (R-Pack) has been sent, a system call for receiving the first and end of data signals have been omitted from Figure 3. reply packet is issued on the PC. Our UDP/IP core has very short latencies between receiving a R-Pack and sending C. Software Interface back the reply packets. Therefore it is possible, and indeed The software interface offers three categories of functions quite common according to our experiences, that the reply that are used for configuration, sending data, and receiving packets from the FPGA are already on their way before data. The configuration (initialization) functions are used to the PC is ready to receive them. When using a standard establish an initial connection with the FPGA by transmit- Linux configuration and consumer network hardware, only ting C-Packs. The data transmission functions are used for limited buffer space for incoming UDP packets is typically sending arrays of C data types (characters, short integers, available. Thus, it is likely that some of the reply packets integers, floats, long integers, and doubles) to the FPGA. from the FPGA are discarded before they can actually be This is accomplished by transmitting D-Packs. The third and read by the user process on the PC. To solve this problem, final class of functions allows for receiving basic data type it is necessary to already issue the system call of the receive arrays. R-Packs can be transmitted using the respective data function prior to sending an R-Pack to the FPGA. The reception function. background reader approach solves this problem as follows: ↔ A dedicated thread is started by the background reader UDP/IP Core PC FPGA Initial Versatile Platform mechanism upon initialization of the communication library. Slice Registers 79 127 408 This thread constantly reads incoming packets and stores Slice LUTs 155 195 562 them in a sufficiently large FIFO buffer. When the user sends Occupied Slices 67 105 272 a R-Pack, the background reader thread will already have Frequency (MHz) 261 262 181 issued the receive call and can therefore reliably receive the reply packets. The user program can then retrieve the Table I packets from the FIFO buffer in the same order as they were RESOURCES OCCUPIED ON A VIRTEX 5 SX95T-1 FPGA BY THE INITIAL AND VERSATILE UDP/IP CORE IMPLEMENTATIONS AND THE received by the background reader thread. The background PC ↔ FPGA PLATFORM HARDWARE PART (INCL. UDP/IP CORE). reader thread works transparently in the background and does therefore not require any multi-threaded programming in the application that uses the communication interface. L¨ofgren core our UDP/IP core The background reader is implemented in C++ using the Xilinx Slices 517 184 boost (www.boost.org) libraries for multi-threading. To en- Xilinx BRAMS 3 0 sure portability, we provide an ANSI-C standard-compliant FMax(MHz) 90.7 128.8 Duplex Mode FULL FULL application interface for the background reader. The re- Length(Bytes) 256 1472* maining functions for FPGA initialization and sending data Speed(Mbps) 10/100 10/100/1000 packets are directly implemented in ANSI-C. Table II V. EXPERIMENTAL RESULTS PERFORMANCE COMPARISON ON A SPARTAN3 XC3S200-4 FPGA: UDP/IP CORE [4] VS OUR IMPROVED IMPLEMENTATION.*LIMITATION Initially, we verify the functionality of the enhanced IMPOSED BY THE EMAC CONFIGURATION (XILINX EMAC WRAPPER UDP/IP core and the PC ↔ FPGA hardware interface FILES GENERATED BY CORE GENERATOR). (Section V-A). Thereafter, we provide a performance and resource utilization comparison (Section V-B) with the previous version of our UDP/IP core and the minimum commercial UDP/IP core [4]. Finally, we describe three we mapped the UDP/IP core on the same FPGA (Spartan application test cases that demonstrate the functionality of 3 XC3S200-4) that was used by L¨ofgren et al. in their our communication platform (Section V-C). paper. On this FPGA, our UDP/IP core architecture, unlike the commercial design, can operate at Gigabit speed while A. Verification occupying almost 2.8 times less hardware resources. To verify the correctness of the enhanced UDP/IP core C. Test Applications implementation, we performed post-place-and-route simu- lations and tests on a real FPGA board (HTG-V5-PCIE To demonstrate the potential of our communication plat- development platform) connected to a DELL Latitude e4300 form, we implemented the following three test cases: notebook running Linux via a standard CAT5 twisted-pair Basic Test: In the first experiment, we offload a simple Ethernet cable. We used Chipscope Pro Analyzer to monitor for-loop that calculates the natural logarithm of the values the input and output ports of the UDP/IP core which were stored in an array of floats to the FPGA. We used the connected to the EMAC Local Link Wrapper. In addition, respective platform function to send the input array of floats we monitored the incoming and outgoing signals of the to the FPGA. On the FPGA, we connected the 32-bit output PC2FPGA and FPGA2PC components (see Figure 3) to bus to a LAU (Logarithm Approximation Unit [13], [14]) for verify correctness of the hardware interface on the FPGA. computing the logarithms on the FPGA. The output values of the LAU are stored in a memory block on the FPGA. B. Resource Utilization Finally, the respective software function for requesting an The communication system was mapped on a Virtex 5 array of floating-point values was used to retrieve this array SX95T-1 FPGA. Table I shows the resources occupied by the from the FPGA. proof-of-concept and enhanced versions of our UDP/IP core. Phylogenetic Alignment Kernel: In the second experi- The table shows that the enhanced UDP/IP core occupies ment, we focus on verifying a more complex reconfigurable 56% more FPGA slices than the initial implementation. architecture that we designed for accelerating a phylogeny- The additional hardware resources required reflect the price aware short read alignment kernel [15]. Here, we used the that has to be paid for increasing flexibility. Nevertheless, respective communication functions for sending character Table II also shows that our improved UDP/IP core still arrays in order to transmit bit-encoded DNA sequences from outperforms the commercial solution [4] with respect to the PC to the FPGA. The results of this phylogeny-aware hardware resource utilization. To conduct a fair comparison short read alignment process are 16-bit integers/scores that between our implementation and the one described in [4], are subsequently retrieved by the PC through the 16-bit FPGA2PC data bus. [5] W. K¨uhn, C. Gilardi, D. Kirschner, J. Lang, S. Lange, M. Liu, Phylogenetic Parsimony Kernel: The reconfigurable T. Perez, S. Yang, L. Schmitt, D. Jin, L. Li, Z. Liu, Y. Lu, architecture we used for the third experiment was de- Q. Wang, S. Wei, H. Xu, D. Zhao, K. Korcyl, J. Otwinowski, P. Salabura, I. Konorov, and A. Mann, “Fpga based compute signed to accelerate the parsimony kernel [16] for build- nodes for high level triggering in panda,” in Journal of ing phylogenetic trees as implemented in the open-source Physics: Conference Series, vol. 119, 2008, pp. 22–27. Parsimonator code [17]. The data transfers to the FPGA are similar to those in the alignment kernel (see above), since [6] A. Dollas, I. Ermis, I. Koidis, I. Zisis, and C. Kachris, “An we need to transfer bit-encoded DNA character arrays to open tcp/ip core for reconfigurable logic,” in Proceedings of the 13th Annual IEEE Symposium on Field-Programmable the FPGA. Thus, the 8-bit bus provided by the PC2FPGA Custom Computing Machines (FCCM’05), 2005, pp. 297– component was used to receive nucleotide sequence data 298. as well as to trigger parsimony computation requests on the FPGA. The resulting parsimony scores (32-bit integer [7] Y.-L. Lin, C.-P. Young, Y.-J. Chang, Y.-H. Chung, and S. A.W.Y, “Versatile pc/fpga based verification/fast prototyp- values) are then retrieved by the parsimonator software on ing platform with multimedia applications,” in Instrumenta- the PC through the FPGA2PC 32-bit bus. tion and Measurement Technology Conference, 2004. IMTC In these test scenarios (as for the phylogenetic kernels), 04. Proceedings of the 21st IEEE, vol. 2, May 2004, pp. thousands of input arguments and hundreds of scores were 1490–1495. reliably transmitted back and forth within a few seconds [8] D.-S. Kang, S. Y. Hwang, K.-S. Jhang, and K. Yi, “A low cost using our communication platform, thereby allowing for and interactive rapid prototyping platform for digital system extensive real-world testing of our reconfigurable systems. design education,” Microelectronics Systems Education, IEEE International Conference on/Multimedia Software Engineer- VI. CONCLUSION ing, International Symposium on, vol. 0, pp. 95–96, 2007. We presented a significantly enhanced version of our widely-used open-source UDP/IP core for efficient direct [9] P. Schumacher, M. Mattavelli, A. Chirila-Rus, and R. Turney, “A software/hardware platform for rapid prototyping of video PC ↔ FPGA communication. The improved version allows and multimedia designs,” System-on-Chip for Real-Time Ap- for automatic configuration of the UDP/IP core. In addi- plications, International Workshop on, vol. 0, pp. 30–33, tion, we introduce a light-weight communication protocol 2005. and provide an appropriate software/hardware interface and [10] P. Lieber and B. Hutchings, “FPGA Communication Frame- communication library implementation. This library allows work,” in Field-Programmable Custom Computing Machines for easy integration with C or C++ codes. Our versatile (FCCM), 2011 IEEE 19th Annual International Symposium hardware/software interface completely hides the complexity on. IEEE, 2011, pp. 69–72. inherent to establishing communication (configuring the UDP/IP core) and transmitting arrays of variable size con- [11] J. Postel, “,” RFC 791 (Standard), Internet Engineering Task Force, Updated by RFC 1349, September taining standard C data types such as characters, short 1981. integers, integers, floats, long integers, and doubles. Future work will focus on providing a more generic [12] ——, “,” RFC 768 (Standard), Internet implementation that can be mapped to any device by any Engineering Task Force, August 1980. vendor. We also plan to extend the communication protocol [13] N. Alachiotis and A. Stamatakis, “Efficient floating-point to compensate for packet loss and/or corruption by means logarithm unit for FPGAs,” in Parallel & Distributed Pro- of automatic packet re-transmission (e.g., using time-outs). cessing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on. IEEE, 2010, pp. 1–8. REFERENCES [14] ——, “A Vector-Like Reconfigurable Floating-Point Unit [1] N. Alachiotis, S. A. Berger, and A. Stamatakis, “Efficient PC- for the Logarithm,” International Journal of Reconfigurable FPGA Communication over Gigabit Ethernet,” in CIT, 2010, Computing, 2011. pp. 1727–1734. [2] Xilinx, “Bus Master DMA Performance Demonstration [15] N. Alachiotis, S. Berger, and A. Stamatakis, “Accelerating Reference Design for the Xilinx Endpoint PCI Express Phylogeny-Aware Short DNA Read Alignment with FP- Field-Programmable Custom Computing Machines Solutions.” [Online]. Available: http://www.xilinx.com/ GAs,” in (FCCM), 2011 IEEE 19th Annual International Symposium support/documentation/application notes/xapp1052.pdf on. IEEE, 2011, pp. 226–233. [3] Altera, “PCI Express Compiler: x1, x4, and x8 MegaCore Functions.” [Online]. Available: http://www.altera.com/ [16] N. Alachiotis and A. Stamatakis, “FPGA Acceleration of products/ip/iup/pci-express/m-alt-pcie8.html the Phylogenetic Parsimony Kernel?” in Field Programmable Logic and Applications (FPL), 2011 International Conference [4] A. L¨ofgren, L. Lodesten, S. Sj¨oholm, and H. Hansson, on. IEEE, 2011, pp. 417–422. “An analysis of FPGA-based UDP/IP stack parallelism for embedded Ethernet connectivity,” in Proceedings of the 23rd [17] A. Stamatakis, Parsimonator, http://www.exelixis- IEEE NORCHIP Conference, November 2008, pp. 94–97. lab.org/software.html.