TWEPP 2013
10Gbps TCP/IP streams from the FPGA for the CMS DAQ Eventbuilder Network
Petr Žejdl, Dominique Gigi on behalf of the CMS DAQ Group
26 September 2013 Outline
● CMS DAQ Readout System – Upgrade – DAQ2 Proposed Layout
● TCP/IP – Overview, Introduction – Simplifcation – Implementation
● FEROL – Introduction, block diagram – Modes of operation – TCP Engine
● Measurements – Point-2-point Measurements – Stream/Link Aggregation
● Summary
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 2 Current CMS DAQ Readout System
● Current system based on SLINK64 and Myrinet network – A sender (FED) card implementing an electrical LVDS link running at 400 MByte/s (3.2 Gbit/s) Detector Front-End Driver (FED) Mezzanine – A receiver (FRL) card ● Receives the SLINK data and performs CRC checking ● Interfaces to commercial Myrinet SLINK64 hardware Cable ● Myrinet NIC runs custom frmware up to 10m, designed by DAQ group 400 MB/s
Front-end Readout Link (FRL) 1 or 2 optical links to Myrinet NIC
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 3 SLINK cables going into FRLs
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 4 Motivation for the Upgrade
● End-of-life of almost all PC and networking equipment
– Hardware is more than 5 years old
– The system was purchased in 2006 and installed in 2007
– Myrinet PCI-X cards and PCs with PCI-X slot, diffcult to buy today
● Beneft from technology evolution
– New PCs with multicore CPUs and NUMA architecture
– 10/40 Gbit/s Ethernet and 56 Gbit/s IB FDR network equipment
● New uTCA based FEDs will be in operation after LS1 – DAQ group developed a point-2-point optical link – SlinkXpress ● Simple interface to custom readout electronics ● Reliable link, data are retransmitted in case of error ● Current implementation allows to run up to 6.3 or at 10 Gbit/s ● IP Core is available for Altera and Xilinx FPGA
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 5 Requirements for Subsystem Readout
● A new link to replace the Myrinet network is required
● Requirements:
– L1 trigger rate up to 100 kHz
– Suffcient bandwidth
● Legacy S-link (electrical LVDS) FEDs with 3.2 Gbit/s (400 MByte/s)
● New (uTCA, optical link based) FEDs with 6 Gbit/s (in future 10 Gbit/s) – Reliable (loss-less) connection between underground and surface
● The new readout link discussed in this presentation is the replacement for the Myrinet network
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 6 DAQ2 Proposed Layout
S-link/ Custom Optical Hardware
10 Gbit/s Ethernet Underground Surface
40 Gbit/s Ethernet Commercial Hardware 56 Gbit/s Infniband
40/10/1 Gbit/s Ethernet
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 7 DAQ2 Proposed Layout (2)
S-link/ Custom Optical Hardware
10 Gbit/s Ethernet Underground Surface
40 Gbit/s Ethernet Commercial Hardware 56 Gbit/s Infniband
40/10/1 Gbit/s Ethernet
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 8 FEROL Introduction
● Front-End Readout Optical Link (FEROL)
– Interface between custom and commercial hardware/network
– Replace Myrinet NIC with custom FPGA based NIC card
● Input:
– Legacy S-link input via FRL
– SlinkXpress interface
● 2x optical 6 Gbit/s interface
● 1x optical 10 Gbit/s interface
● Output:
– Optical 10 Gbit/s Ethernet link
– Optional second 10 Gbit/s Ethernet link
– Runs a standard protocol: TCP/IP over 10Gbit/s Ethernet
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 9 TCP/IP
● Benefts of using TCP/IP
– TCP/IP guarantees a reliable and in-order data delivery
● Retransmissions deal with packet loss
● Flow control respects the occupancy of the buffers in a receiving PC
● Congestion control allows transmitting multiple streams on the same link (link aggregation)
– Standard and well known protocol suite (almost)
– Implemented in all mainstream operating systems
– Debugging and monitoring tools widely available (tcpdump, wireshark, iperf, …)
– Network composed from off-the-shell hardware, multiple vendors
● Don't re-invent a reliable network but make use of available software and commercial hardware
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 10 TCP Implementation
● In principle a very diffcult task for an FPGA
– TCP/IP is a general purpose protocol suite
– Even for a PC the TCP/IP is a very resource hungry protocol
– ~15 000 lines of C code in the Linux Kernel for only TCP
● Consideration
– CMS DAQ network has a fxed topology
– The data traffc goes only in one direction from FEROL to Readout Unit (PC)
– The aggregated readout network throughput is suffcient (by design) to avoid the packet congestion and packet loss
● Can we simplify the TCP?
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 11 TCP Implementation (2)
● Robustness Principle [RFC 793]
– TCP implementations will follow a general principle of robustness: Be conservative in what you do, be liberal in what you accept from others.
● According to robustness principle we simplifed the TCP sender. The receiving PC (with full TCP/IP) stack will handle the rest
– FEROL is a client, PC is a server
– FEROL opens a TCP connection
– FEROL sends the data to the PC, data fows in one direction from client to the server
● Acknowledge packets are sent back, they are part of the protocol – TCP connection is aborted instead of closed. Connection abort is unreliable and should be initialized by server (PC).
– Use simple congestion control
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 12 TCP Implementation (3)
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 13 TCP Implementation (4)
We don't listen (we are only client) / we don't receive any data TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 14 TCP Implementation (5)
ABORT/RST
We do a connection abort instead of connection close TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 15 TCP Implementation (6)
ABORT/RST
FInal State Diagram
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 16 But not so simple...
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 17 Implementation and Simplifcations
● Implemented
– Nagle's algorithm (data merging to utilize MTU)
– MTU Jumbo frame support up to 9000 bytes
– Window scaling (understands window sizes greater than 64KB)
– Silly window avoidance (not to send when receiver's window is small)
– Six TCP/IP Timers reduced to three timers implemented by one counter
● Connection-establishment timer, Retransmission timer, Persist timer
● Complex congestion control reduced to
– Exponential back-off: double the retransmit timeout if a packet is not acknowledged
– Fast-retransmit: if only single segment was lost – retransmit immediately without waiting for timeout
● Not implemented (not necessary)
– Timestamps, Selective acknowledgements, Out of band data (urgent data)
– Server part and data reception (FEROL is client and opens TCP/IP connection)
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 18 FEROL TCP/IP Software Emulator
● Software implementation of simplifed TCP/IP – For protocol verifcation and testing before implementing in hardware (e.g. verifcation of the TCP congestion control) – Runs as a user space program ● For TCP/IP packets it is important to bypass Linux kernel otherwise they are interfering with Linux TCP/IP stack. ● Based on the PF_RING* – Received packets are stored in a circular buffer and read from user space
*http://www.ntop.org/products/pf_ring/
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 19 Is congestion control important?
PC 5x10Gb/s lines Gb/s
PC 10 Gb/s PC PC X 5.29
PC 2.0 PC 0.89
Senders: 2048 bytes @ 125 kHz ~ 2.048 Gb/s 5 x 2,048 = 10.24 Gb/s
A little bit of congestion - all bandwidth will be eaten up by buffers being re-sent due to a temporary congestion: without congestion control the link is not able to recover from this state even though the link works fawless.
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 20 Link Aggregation (2 links into 1 link, 8 streams into 1 PC)
10 Gbit/s Ethernet Dell R310 Optical Connections 50%
Dell R310 50% Two links aggregated into one
Dell R310 Brocade Switch Dell R620
Dell R310 10GE NIC
Dell R310 10GE NIC
Dell R310
● 2 streams aggregated into one 10GE link Dell R310 ● 8 threads receiving data (1 thread per stream)
● Linux TCP stack compared to the FEROL Dell R310 simplifed TCP
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 21 Stream Aggregation (8 streams to 1 PC)
FEROL TCP Emulator
Linux Sockets
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 22 FEROL Hardware Architecture
Hardware – Altera Aria II GX FPGA – Vitesse transceiver 10GE / XAUI – QDR Memory (16 MBytes) – DDR2 Memory (512 MBytes)
Interfaces – FED/SlinkXpress interface ● 2x optical 6 Gbit/s ● 1x optical 10 Gbit/s – DAQ interface ● 1x optical 10 Gbit/s Ethernet
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 23 FEROL Operation Modes
● Input ● Input – 2x SlinkXpress 6 Gbit/s FED input – 1x SlinkXpress 10 Gbit/s FED input – Legacy S-LINK data through PCI-X ● Output ● Output – 1x 10 Gbit/s Ethernet / Optional second 10 – 1x 10 Gbit/s Ethernet Gbit/s Ethernet link – 2x TCP streams – 1x TCP streams ● Memory buffer is divided in two, one per stream ● Memory buffer is used by one stream ● Data fragments ● Data fragments – Internal generator at 10Gbit/s speed – Internal generator at 10Gbit/s speed – PCI-X bus with maximum 6.4 Gbit/s – SlinkXpress with maximum 10 Gbit/s – SlinkXpress with maximum 2x 5.04 Gbit/s TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 24 FEROL TCP Core
● Several blocks handling different protocols (ARP/ICMP/TCP) ● TCP payload is stored in 64-bit words ● TCP sequence processed in multiples of 8 (64-bits) ● ICMP (PING) is limited to 128 bytes of payload ● IP address is static and assigned by control SW ● MAC address is kept in EEPROM memory
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 25 FPGA Resource Utilization
FPGA Floorplan 16% JTAG + Flash Access + Debugging 1% QDR Memory logic 3% DDR Memory logic 8% SlinkXpress 6 Gbit/s logic 3% PCI-X logic 3% Internal FED event generators + 20% TCP Core + 10 Gbit/s logic 3% One TCP Stream
Altera Arria II GX 125 FPGA ● Logic utilization 54% (53611 / 99280 ALUTs) ● Internal memory 37% (2.5 MBits) ● GXB transceivers 10 of 12
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 26 Production
● 21 FEROLs prototypes built – 16 FEROLs installed in CMS online computing room
● Production of 650 FEROLs is launched
● September – Pre-series of 32 boards started
● November – Pre-series tests
● The remaining 618 boards will be produced after the validation of the pre-series.
● Installation in 100 meters below the ground in April
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 27 Point-2-Point 10GE Measurements
FEROL (FPGA) PC 10Gbit/s Ethernet FRL (optical connection) 10GE NIC
● DELL PowerEdge C6100 as a receiver PC – CPU Intel Xeon X5650 @ 2.67 GHz
– Myricom 10GE NIC
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 28 Point-2-Point 10GE Measurements
● Linux TCP Sockets compared to FEROL “TCP/IP” implementation
● Throughput measured with different fragment sizes
● One CPU core utilized at 85%
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 29 Link Aggregation (8 links into 1 link, 16 streams into 1 PC) Mellanox SX1036 – 40 Gbit/s Ethernet Switch
DELL PowerEdge R720 FEROL with Mellanox 40 Gbit/s 01 Ethernet NIC FEROL 02
FEROL 03
FEROL 04
FEROL 05
FEROL 06
FEROL Tests 07
● Link/Stream aggregation test up to 16 streams FEROL 08 ● Stability – 16 streams running at at 100kHz with 2kB data fragments (~26 Gbit/s) – No backpressure (FEROL buffer overfow) for 125 hours (5 days) TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 30 Stream aggregation (16 Streams to 1 PC)
Throughput for more than 12 streams is constant if hyper-threading is enabled
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 31 Summary
● Simplifed TCP/IP is working!
– Verifed by software emulation
– Test with real FEROL with maximum throughput 9.70 Gbps over 10GE interface
● Stream aggregation (congestion control) is working
– Test with 8 devices producing up to 16 streams
– Maximum aggregated throughput is 39.66 Gbps over 40GE interface
– Stable streams (at 26 Gbit/s) over 5 days, no problems found
● Observations
– The maximum performance is sensitive to the PC confguration
– BIOS settings: Hyper-threading on/off
– OS settings: TCP socket buffer settings, IRQs affnities
– User process settings: CPU core affnities
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 32 Summary...
To our best knowledge this is the frst TCP/IP hardware implementation running at 10 Gbit/s in the High Energy Physics!
Thank You
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 33 TWEPP 2013
Backup Slides
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 34 Labview based control & monitoring
First TCP Stream (FED0) Second TCP Stream (FED1)
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 35 Web based control & monitoring
application
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 36 FEROL TCP/IP Software Emulator
● For protocol verifcation and testing we developed FEROL TCP/IP emulator implementing the simplifed TCP/IP in the software as a user space program
● The “emulated” TCP/IP streams are bypassing Linux kernel (PF_RING)
Dell PowerEdge R310 – Intel Xeon Dell PowerEdge C6100 – Intel Xeon X3450TWEPP @ 20132.67 -GHz 10Gbps (8MB TCP/IP Cache, streams 4 cores) from the FPGA X5650 @ 2.67 GHz (12MB cache, 6 cores) 37 Opening and closing TCP connections
● Opening connection (one way/client)
– Standard 3-way handshake is used
– States: CLOSED → SYN_SENT → ESTABLISHED
– States not used: LISTEN, SYN_RECEIVED
–
● Connection closing
– RST is send (connection is aborted)
– States: ESTABLISHED → CLOSED
– States not used: FIN_WAIT_1, FIN_WAIT_2, TIME_WAIT, CLOSING, CLOSE_WAIT, LAST_ACK
struct linger so_linger; Connection abort with Linux TCP/IP Stack so_linger.l_onoff = TRUE; so_linger.l_linger = 0;
setsockopt(s, SOL_SOCKET, SO_LINGER, &so_linger, sizeof so_linger);
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 38 FEROL TCP/IP Stream Aggregation
● TCP/IP allows to send multiple streams over the same link, reliability and congestions are handled by TCP/IP
8/16 Streams ● n streams are concentrated in one 40 GE switch (aggregation ratio depends on the rate requirements)
● Aggregated streams are sent through one 40 GE interface to the RU PC
● Less amount of networking devices and PCs is required
● When PC dies the network is re-confgured (e.g., aggregation ratio is changed or 'hot spare' PC is used)
1 port 40 GE (RU PC)
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 39 TCP and Data Fragments
FIFO
● TCP connection is a stream connection ● Data (TCP point of view) have no begin and no end – no relationship to fragments
● How to distinguish fragments? – Additional header is be used (contains length and fragment Start/End marks) – Used for receiving PC to reassemble fragments and to limit segment length
HDR HDR HDR (S,E) Fragment 1 (S,E) F2 (S,E) F3
HDR HDR HDR (S) Fragment 4 (E) Fragment 4 (S,E) F4
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 40 SlinkXpress (for AMC13)
Input link for CMS DAQ
The protocol was tested with the HCAL AMC13 (ver1) 5Gb (8b/10b encoding) with FEROL up to 6.3 Gb(8b/10b encoding) with FEROL 10Gb (XUAI interface) exist 10Gb 66/64b encoding (to be tested)
TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 41