10Gbps TCP/IP Streams from the FPGA for the CMS DAQ Eventbuilder Network

10Gbps TCP/IP Streams from the FPGA for the CMS DAQ Eventbuilder Network

TWEPP 2013 10Gbps TCP/IP streams from the FPGA for the CMS DAQ Eventbuilder Network Petr Žejdl, Dominique Gigi on behalf of the CMS DAQ Group 26 September 2013 Outline ● CMS DAQ Readout System – Upgrade – DAQ2 Proposed Layout ● TCP/IP – Overview, Introduction – Simplifcation – Implementation ● FEROL – Introduction, block diagram – Modes of operation – TCP Engine ● Measurements – Point-2-point Measurements – Stream/Link Aggregation ● Summary TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 2 Current CMS DAQ Readout System ● Current system based on SLINK64 and Myrinet network – A sender (FED) card implementing an electrical LVDS link running at 400 MByte/s (3.2 Gbit/s) Detector Front-End Driver (FED) Mezzanine – A receiver (FRL) card ● Receives the SLINK data and performs CRC checking ● Interfaces to commercial Myrinet SLINK64 hardware Cable ● Myrinet NIC runs custom frmware up to 10m, designed by DAQ group 400 MB/s Front-end Readout Link (FRL) 1 or 2 optical links to Myrinet NIC TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 3 SLINK cables going into FRLs TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 4 Motivation for the Upgrade ● End-of-life of almost all PC and networking equipment – Hardware is more than 5 years old – The system was purchased in 2006 and installed in 2007 – Myrinet PCI-X cards and PCs with PCI-X slot, diffcult to buy today ● Beneft from technology evolution – New PCs with multicore CPUs and NUMA architecture – 10/40 Gbit/s Ethernet and 56 Gbit/s IB FDR network equipment ● New uTCA based FEDs will be in operation after LS1 – DAQ group developed a point-2-point optical link – SlinkXpress ● Simple interface to custom readout electronics ● Reliable link, data are retransmitted in case of error ● Current implementation allows to run up to 6.3 or at 10 Gbit/s ● IP Core is available for Altera and Xilinx FPGA TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 5 Requirements for Subsystem Readout ● A new link to replace the Myrinet network is required ● Requirements: – L1 trigger rate up to 100 kHz – Suffcient bandwidth ● Legacy S-link (electrical LVDS) FEDs with 3.2 Gbit/s (400 MByte/s) ● New (uTCA, optical link based) FEDs with 6 Gbit/s (in future 10 Gbit/s) – Reliable (loss-less) connection between underground and surface ● The new readout link discussed in this presentation is the replacement for the Myrinet network TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 6 DAQ2 Proposed Layout S-link/ Custom Optical Hardware 10 Gbit/s Ethernet Underground Surface 40 Gbit/s Ethernet Commercial Hardware 56 Gbit/s Infniband 40/10/1 Gbit/s Ethernet TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 7 DAQ2 Proposed Layout (2) S-link/ Custom Optical Hardware 10 Gbit/s Ethernet Underground Surface 40 Gbit/s Ethernet Commercial Hardware 56 Gbit/s Infniband 40/10/1 Gbit/s Ethernet TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 8 FEROL Introduction ● Front-End Readout Optical Link (FEROL) – Interface between custom and commercial hardware/network – Replace Myrinet NIC with custom FPGA based NIC card ● Input: – Legacy S-link input via FRL – SlinkXpress interface ● 2x optical 6 Gbit/s interface ● 1x optical 10 Gbit/s interface ● Output: – Optical 10 Gbit/s Ethernet link – Optional second 10 Gbit/s Ethernet link – Runs a standard protocol: TCP/IP over 10Gbit/s Ethernet TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 9 TCP/IP ● Benefts of using TCP/IP – TCP/IP guarantees a reliable and in-order data delivery ● Retransmissions deal with packet loss ● Flow control respects the occupancy of the buffers in a receiving PC ● Congestion control allows transmitting multiple streams on the same link (link aggregation) – Standard and well known protocol suite (almost) – Implemented in all mainstream operating systems – Debugging and monitoring tools widely available (tcpdump, wireshark, iperf, …) – Network composed from off-the-shell hardware, multiple vendors ● Don't re-invent a reliable network but make use of available software and commercial hardware TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 10 TCP Implementation ● In principle a very diffcult task for an FPGA – TCP/IP is a general purpose protocol suite – Even for a PC the TCP/IP is a very resource hungry protocol – ~15 000 lines of C code in the Linux Kernel for only TCP ● Consideration – CMS DAQ network has a fxed topology – The data traffc goes only in one direction from FEROL to Readout Unit (PC) – The aggregated readout network throughput is suffcient (by design) to avoid the packet congestion and packet loss ● Can we simplify the TCP? TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 11 TCP Implementation (2) ● Robustness Principle [RFC 793] – TCP implementations will follow a general principle of robustness: Be conservative in what you do, be liberal in what you accept from others. ● According to robustness principle we simplifed the TCP sender. The receiving PC (with full TCP/IP) stack will handle the rest – FEROL is a client, PC is a server – FEROL opens a TCP connection – FEROL sends the data to the PC, data fows in one direction from client to the server ● Acknowledge packets are sent back, they are part of the protocol – TCP connection is aborted instead of closed. Connection abort is unreliable and should be initialized by server (PC). – Use simple congestion control TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 12 TCP Implementation (3) TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 13 TCP Implementation (4) We don't listen (we are only client) / we don't receive any data TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 14 TCP Implementation (5) ABORT/RST We do a connection abort instead of connection close TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 15 TCP Implementation (6) ABORT/RST FInal State Diagram TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 16 But not so simple... TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 17 Implementation and Simplifcations ● Implemented – Nagle's algorithm (data merging to utilize MTU) – MTU Jumbo frame support up to 9000 bytes – Window scaling (understands window sizes greater than 64KB) – Silly window avoidance (not to send when receiver's window is small) – Six TCP/IP Timers reduced to three timers implemented by one counter ● Connection-establishment timer, Retransmission timer, Persist timer ● Complex congestion control reduced to – Exponential back-off: double the retransmit timeout if a packet is not acknowledged – Fast-retransmit: if only single segment was lost – retransmit immediately without waiting for timeout ● Not implemented (not necessary) – Timestamps, Selective acknowledgements, Out of band data (urgent data) – Server part and data reception (FEROL is client and opens TCP/IP connection) TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 18 FEROL TCP/IP Software Emulator ● Software implementation of simplifed TCP/IP – For protocol verifcation and testing before implementing in hardware (e.g. verifcation of the TCP congestion control) – Runs as a user space program ● For TCP/IP packets it is important to bypass Linux kernel otherwise they are interfering with Linux TCP/IP stack. ● Based on the PF_RING* – Received packets are stored in a circular buffer and read from user space *http://www.ntop.org/products/pf_ring/ TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 19 Is congestion control important? PC 5x10Gb/s lines Gb/s PC 10 Gb/s PC PC X 5.29 PC 2.0 PC 0.89 Senders: 2048 bytes @ 125 kHz ~ 2.048 Gb/s 5 x 2,048 = 10.24 Gb/s A little bit of congestion - all bandwidth will be eaten up by buffers being re-sent due to a temporary congestion: without congestion control the link is not able to recover from this state even though the link works fawless. TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 20 Link Aggregation (2 links into 1 link, 8 streams into 1 PC) 10 Gbit/s Ethernet Dell R310 Optical Connections 50% Dell R310 50% Two links aggregated into one Dell R310 Brocade Switch Dell R620 Dell R310 10GE NIC Dell R310 10GE NIC Dell R310 ● 2 streams aggregated into one 10GE link Dell R310 ● 8 threads receiving data (1 thread per stream) ● Linux TCP stack compared to the FEROL Dell R310 simplifed TCP TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 21 Stream Aggregation (8 streams to 1 PC) FEROL TCP Emulator Linux Sockets TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 22 FEROL Hardware Architecture Hardware – Altera Aria II GX FPGA – Vitesse transceiver 10GE / XAUI – QDR Memory (16 MBytes) – DDR2 Memory (512 MBytes) Interfaces – FED/SlinkXpress interface ● 2x optical 6 Gbit/s ● 1x optical 10 Gbit/s – DAQ interface ● 1x optical 10 Gbit/s Ethernet TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 23 FEROL Operation Modes ● Input ● Input – 2x SlinkXpress 6 Gbit/s FED input – 1x SlinkXpress 10 Gbit/s FED input – Legacy S-LINK data through PCI-X ● Output ● Output – 1x 10 Gbit/s Ethernet / Optional second 10 – 1x 10 Gbit/s Ethernet Gbit/s Ethernet link – 2x TCP streams – 1x TCP streams ● Memory buffer is divided in two, one per stream ● Memory buffer is used by one stream ● Data fragments ● Data fragments – Internal generator at 10Gbit/s speed – Internal generator at 10Gbit/s speed – PCI-X bus with maximum 6.4 Gbit/s – SlinkXpress with maximum 10 Gbit/s – SlinkXpress with maximum 2x 5.04 Gbit/s TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 24 FEROL TCP Core ● Several blocks handling different protocols (ARP/ICMP/TCP) ● TCP payload is stored in 64-bit words ● TCP sequence processed in multiples of 8 (64-bits) ● ICMP (PING) is limited to 128 bytes of payload ● IP address is static and assigned by control

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    41 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us