10Gbps TCP/IP Streams from the FPGA for the CMS DAQ Eventbuilder Network

TWEPP 2013 10Gbps TCP/IP streams from the FPGA for the CMS DAQ Eventbuilder Network Petr Žejdl, Dominique Gigi on behalf of the CMS DAQ Group 26 September 2013 Outline ● CMS DAQ Readout System – Upgrade – DAQ2 Proposed Layout ● TCP/IP – Overview, Introduction – Simplifcation – Implementation ● FEROL – Introduction, block diagram – Modes of operation – TCP Engine ● Measurements – Point-2-point Measurements – Stream/Link Aggregation ● Summary TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 2 Current CMS DAQ Readout System ● Current system based on SLINK64 and Myrinet network – A sender (FED) card implementing an electrical LVDS link running at 400 MByte/s (3.2 Gbit/s) Detector Front-End Driver (FED) Mezzanine – A receiver (FRL) card ● Receives the SLINK data and performs CRC checking ● Interfaces to commercial Myrinet SLINK64 hardware Cable ● Myrinet NIC runs custom frmware up to 10m, designed by DAQ group 400 MB/s Front-end Readout Link (FRL) 1 or 2 optical links to Myrinet NIC TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 3 SLINK cables going into FRLs TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 4 Motivation for the Upgrade ● End-of-life of almost all PC and networking equipment – Hardware is more than 5 years old – The system was purchased in 2006 and installed in 2007 – Myrinet PCI-X cards and PCs with PCI-X slot, diffcult to buy today ● Beneft from technology evolution – New PCs with multicore CPUs and NUMA architecture – 10/40 Gbit/s Ethernet and 56 Gbit/s IB FDR network equipment ● New uTCA based FEDs will be in operation after LS1 – DAQ group developed a point-2-point optical link – SlinkXpress ● Simple interface to custom readout electronics ● Reliable link, data are retransmitted in case of error ● Current implementation allows to run up to 6.3 or at 10 Gbit/s ● IP Core is available for Altera and Xilinx FPGA TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 5 Requirements for Subsystem Readout ● A new link to replace the Myrinet network is required ● Requirements: – L1 trigger rate up to 100 kHz – Suffcient bandwidth ● Legacy S-link (electrical LVDS) FEDs with 3.2 Gbit/s (400 MByte/s) ● New (uTCA, optical link based) FEDs with 6 Gbit/s (in future 10 Gbit/s) – Reliable (loss-less) connection between underground and surface ● The new readout link discussed in this presentation is the replacement for the Myrinet network TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 6 DAQ2 Proposed Layout S-link/ Custom Optical Hardware 10 Gbit/s Ethernet Underground Surface 40 Gbit/s Ethernet Commercial Hardware 56 Gbit/s Infniband 40/10/1 Gbit/s Ethernet TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 7 DAQ2 Proposed Layout (2) S-link/ Custom Optical Hardware 10 Gbit/s Ethernet Underground Surface 40 Gbit/s Ethernet Commercial Hardware 56 Gbit/s Infniband 40/10/1 Gbit/s Ethernet TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 8 FEROL Introduction ● Front-End Readout Optical Link (FEROL) – Interface between custom and commercial hardware/network – Replace Myrinet NIC with custom FPGA based NIC card ● Input: – Legacy S-link input via FRL – SlinkXpress interface ● 2x optical 6 Gbit/s interface ● 1x optical 10 Gbit/s interface ● Output: – Optical 10 Gbit/s Ethernet link – Optional second 10 Gbit/s Ethernet link – Runs a standard protocol: TCP/IP over 10Gbit/s Ethernet TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 9 TCP/IP ● Benefts of using TCP/IP – TCP/IP guarantees a reliable and in-order data delivery ● Retransmissions deal with packet loss ● Flow control respects the occupancy of the buffers in a receiving PC ● Congestion control allows transmitting multiple streams on the same link (link aggregation) – Standard and well known protocol suite (almost) – Implemented in all mainstream operating systems – Debugging and monitoring tools widely available (tcpdump, wireshark, iperf, …) – Network composed from off-the-shell hardware, multiple vendors ● Don't re-invent a reliable network but make use of available software and commercial hardware TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 10 TCP Implementation ● In principle a very diffcult task for an FPGA – TCP/IP is a general purpose protocol suite – Even for a PC the TCP/IP is a very resource hungry protocol – ~15 000 lines of C code in the Linux Kernel for only TCP ● Consideration – CMS DAQ network has a fxed topology – The data traffc goes only in one direction from FEROL to Readout Unit (PC) – The aggregated readout network throughput is suffcient (by design) to avoid the packet congestion and packet loss ● Can we simplify the TCP? TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 11 TCP Implementation (2) ● Robustness Principle [RFC 793] – TCP implementations will follow a general principle of robustness: Be conservative in what you do, be liberal in what you accept from others. ● According to robustness principle we simplifed the TCP sender. The receiving PC (with full TCP/IP) stack will handle the rest – FEROL is a client, PC is a server – FEROL opens a TCP connection – FEROL sends the data to the PC, data fows in one direction from client to the server ● Acknowledge packets are sent back, they are part of the protocol – TCP connection is aborted instead of closed. Connection abort is unreliable and should be initialized by server (PC). – Use simple congestion control TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 12 TCP Implementation (3) TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 13 TCP Implementation (4) We don't listen (we are only client) / we don't receive any data TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 14 TCP Implementation (5) ABORT/RST We do a connection abort instead of connection close TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 15 TCP Implementation (6) ABORT/RST FInal State Diagram TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 16 But not so simple... TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 17 Implementation and Simplifcations ● Implemented – Nagle's algorithm (data merging to utilize MTU) – MTU Jumbo frame support up to 9000 bytes – Window scaling (understands window sizes greater than 64KB) – Silly window avoidance (not to send when receiver's window is small) – Six TCP/IP Timers reduced to three timers implemented by one counter ● Connection-establishment timer, Retransmission timer, Persist timer ● Complex congestion control reduced to – Exponential back-off: double the retransmit timeout if a packet is not acknowledged – Fast-retransmit: if only single segment was lost – retransmit immediately without waiting for timeout ● Not implemented (not necessary) – Timestamps, Selective acknowledgements, Out of band data (urgent data) – Server part and data reception (FEROL is client and opens TCP/IP connection) TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 18 FEROL TCP/IP Software Emulator ● Software implementation of simplifed TCP/IP – For protocol verifcation and testing before implementing in hardware (e.g. verifcation of the TCP congestion control) – Runs as a user space program ● For TCP/IP packets it is important to bypass Linux kernel otherwise they are interfering with Linux TCP/IP stack. ● Based on the PF_RING* – Received packets are stored in a circular buffer and read from user space *http://www.ntop.org/products/pf_ring/ TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 19 Is congestion control important? PC 5x10Gb/s lines Gb/s PC 10 Gb/s PC PC X 5.29 PC 2.0 PC 0.89 Senders: 2048 bytes @ 125 kHz ~ 2.048 Gb/s 5 x 2,048 = 10.24 Gb/s A little bit of congestion - all bandwidth will be eaten up by buffers being re-sent due to a temporary congestion: without congestion control the link is not able to recover from this state even though the link works fawless. TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 20 Link Aggregation (2 links into 1 link, 8 streams into 1 PC) 10 Gbit/s Ethernet Dell R310 Optical Connections 50% Dell R310 50% Two links aggregated into one Dell R310 Brocade Switch Dell R620 Dell R310 10GE NIC Dell R310 10GE NIC Dell R310 ● 2 streams aggregated into one 10GE link Dell R310 ● 8 threads receiving data (1 thread per stream) ● Linux TCP stack compared to the FEROL Dell R310 simplifed TCP TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 21 Stream Aggregation (8 streams to 1 PC) FEROL TCP Emulator Linux Sockets TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 22 FEROL Hardware Architecture Hardware – Altera Aria II GX FPGA – Vitesse transceiver 10GE / XAUI – QDR Memory (16 MBytes) – DDR2 Memory (512 MBytes) Interfaces – FED/SlinkXpress interface ● 2x optical 6 Gbit/s ● 1x optical 10 Gbit/s – DAQ interface ● 1x optical 10 Gbit/s Ethernet TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 23 FEROL Operation Modes ● Input ● Input – 2x SlinkXpress 6 Gbit/s FED input – 1x SlinkXpress 10 Gbit/s FED input – Legacy S-LINK data through PCI-X ● Output ● Output – 1x 10 Gbit/s Ethernet / Optional second 10 – 1x 10 Gbit/s Ethernet Gbit/s Ethernet link – 2x TCP streams – 1x TCP streams ● Memory buffer is divided in two, one per stream ● Memory buffer is used by one stream ● Data fragments ● Data fragments – Internal generator at 10Gbit/s speed – Internal generator at 10Gbit/s speed – PCI-X bus with maximum 6.4 Gbit/s – SlinkXpress with maximum 10 Gbit/s – SlinkXpress with maximum 2x 5.04 Gbit/s TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 24 FEROL TCP Core ● Several blocks handling different protocols (ARP/ICMP/TCP) ● TCP payload is stored in 64-bit words ● TCP sequence processed in multiples of 8 (64-bits) ● ICMP (PING) is limited to 128 bytes of payload ● IP address is static and assigned by control

10Gbps TCP/IP Streams from the FPGA for the CMS DAQ Eventbuilder Network

Ieee 802.1 for Homenet

IEEE 802.11Be Multi-Link Operation: When the Best Could Be to Use Only a Single Interface

MR52 Datasheet

Ds-Ruckus-R710.Pdf

A 24Port 10G Ethernet Switch

Cisco Small Business 300 Series Managed Switches Administration

GS12 Standalone Fully Managed Gigabit Ethernet Switch

Network Virtualization Using Shortest Path Bridging (802.1Aq) and IP/SPB

Bandwidth Aggregation Across Multiple Smartphone Devices

LAN Aggregation

MIF Vs Other IETF Efforts

BATCP: Bandwidth-Aggregation Transmission Control Protocol