TWEPP 2013

10Gbps TCP/IP streams from the FPGA for the CMS DAQ Eventbuilder Network

Petr Žejdl, Dominique Gigi on behalf of the CMS DAQ Group

26 September 2013 Outline

● CMS DAQ Readout System – Upgrade – DAQ2 Proposed Layout

● TCP/IP – Overview, Introduction – Simplifcation – Implementation

● FEROL – Introduction, block diagram – Modes of operation – TCP Engine

● Measurements – Point-2-point Measurements – Stream/

● Summary

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 2 Current CMS DAQ Readout System

● Current system based on SLINK64 and Myrinet network – A sender (FED) card implementing an electrical LVDS link running at 400 MByte/s (3.2 Gbit/s) Detector Front-End Driver (FED) Mezzanine – A receiver (FRL) card ● Receives the SLINK data and performs CRC checking ● Interfaces to commercial Myrinet SLINK64 hardware Cable ● Myrinet NIC runs custom frmware up to 10m, designed by DAQ group 400 MB/s

Front-end Readout Link (FRL) 1 or 2 optical links to Myrinet NIC

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 3 SLINK cables going into FRLs

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 4 Motivation for the Upgrade

● End-of-life of almost all PC and networking equipment

– Hardware is more than 5 years old

– The system was purchased in 2006 and installed in 2007

– Myrinet PCI-X cards and PCs with PCI-X slot, diffcult to buy today

● Beneft from technology evolution

– New PCs with multicore CPUs and NUMA architecture

– 10/40 Gbit/s and 56 Gbit/s IB FDR network equipment

● New uTCA based FEDs will be in operation after LS1 – DAQ group developed a point-2-point optical link – SlinkXpress ● Simple interface to custom readout electronics ● Reliable link, data are retransmitted in case of error ● Current implementation allows to run up to 6.3 or at 10 Gbit/s ● IP Core is available for Altera and Xilinx FPGA

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 5 Requirements for Subsystem Readout

● A new link to replace the Myrinet network is required

● Requirements:

– L1 trigger rate up to 100 kHz

– Suffcient bandwidth

● Legacy S-link (electrical LVDS) FEDs with 3.2 Gbit/s (400 MByte/s)

● New (uTCA, optical link based) FEDs with 6 Gbit/s (in future 10 Gbit/s) – Reliable (loss-less) connection between underground and surface

● The new readout link discussed in this presentation is the replacement for the Myrinet network

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 6 DAQ2 Proposed Layout

S-link/ Custom Optical Hardware

10 Gbit/s Ethernet Underground Surface

40 Gbit/s Ethernet Commercial Hardware 56 Gbit/s Infniband

40/10/1 Gbit/s Ethernet

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 7 DAQ2 Proposed Layout (2)

S-link/ Custom Optical Hardware

10 Gbit/s Ethernet Underground Surface

40 Gbit/s Ethernet Commercial Hardware 56 Gbit/s Infniband

40/10/1 Gbit/s Ethernet

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 8 FEROL Introduction

● Front-End Readout Optical Link (FEROL)

– Interface between custom and commercial hardware/network

– Replace Myrinet NIC with custom FPGA based NIC card

● Input:

– Legacy S-link input via FRL

– SlinkXpress interface

● 2x optical 6 Gbit/s interface

● 1x optical 10 Gbit/s interface

● Output:

– Optical 10 Gbit/s Ethernet link

– Optional second 10 Gbit/s Ethernet link

– Runs a standard protocol: TCP/IP over 10Gbit/s Ethernet

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 9 TCP/IP

● Benefts of using TCP/IP

– TCP/IP guarantees a reliable and in-order data delivery

● Retransmissions deal with packet loss

● Flow control respects the occupancy of the buffers in a receiving PC

● Congestion control allows transmitting multiple streams on the same link (link aggregation)

– Standard and well known protocol suite (almost)

– Implemented in all mainstream operating systems

– Debugging and monitoring tools widely available (tcpdump, wireshark, iperf, …)

– Network composed from off-the-shell hardware, multiple vendors

● Don't re-invent a reliable network but make use of available software and commercial hardware

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 10 TCP Implementation

● In principle a very diffcult task for an FPGA

– TCP/IP is a general purpose protocol suite

– Even for a PC the TCP/IP is a very resource hungry protocol

– ~15 000 lines of C code in the Kernel for only TCP

● Consideration

– CMS DAQ network has a fxed topology

– The data traffc goes only in one direction from FEROL to Readout Unit (PC)

– The aggregated readout network throughput is suffcient (by design) to avoid the packet congestion and packet loss

● Can we simplify the TCP?

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 11 TCP Implementation (2)

● Robustness Principle [RFC 793]

– TCP implementations will follow a general principle of robustness: Be conservative in what you do, be liberal in what you accept from others.

● According to robustness principle we simplifed the TCP sender. The receiving PC (with full TCP/IP) stack will handle the rest

– FEROL is a client, PC is a server

– FEROL opens a TCP connection

– FEROL sends the data to the PC, data fows in one direction from client to the server

● Acknowledge packets are sent back, they are part of the protocol – TCP connection is aborted instead of closed. Connection abort is unreliable and should be initialized by server (PC).

– Use simple congestion control

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 12 TCP Implementation (3)

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 13 TCP Implementation (4)

We don't listen (we are only client) / we don't receive any data TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 14 TCP Implementation (5)

ABORT/RST

We do a connection abort instead of connection close TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 15 TCP Implementation (6)

ABORT/RST

FInal State Diagram

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 16 But not so simple...

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 17 Implementation and Simplifcations

● Implemented

– Nagle's algorithm (data merging to utilize MTU)

– MTU Jumbo frame support up to 9000 bytes

– Window scaling (understands window sizes greater than 64KB)

– Silly window avoidance (not to send when receiver's window is small)

– Six TCP/IP Timers reduced to three timers implemented by one counter

● Connection-establishment timer, Retransmission timer, Persist timer

● Complex congestion control reduced to

– Exponential back-off: double the retransmit timeout if a packet is not acknowledged

– Fast-retransmit: if only single segment was lost – retransmit immediately without waiting for timeout

● Not implemented (not necessary)

– Timestamps, Selective acknowledgements, Out of band data (urgent data)

– Server part and data reception (FEROL is client and opens TCP/IP connection)

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 18 FEROL TCP/IP Software Emulator

● Software implementation of simplifed TCP/IP – For protocol verifcation and testing before implementing in hardware (e.g. verifcation of the TCP congestion control) – Runs as a program ● For TCP/IP packets it is important to bypass Linux kernel otherwise they are interfering with Linux TCP/IP stack. ● Based on the PF_RING* – Received packets are stored in a circular buffer and read from user space

*http://www.ntop.org/products/pf_ring/

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 19 Is congestion control important?

PC 5x10Gb/s lines Gb/s

PC 10 Gb/s PC PC X 5.29

PC 2.0 PC 0.89

Senders: 2048 bytes @ 125 kHz ~ 2.048 Gb/s 5 x 2,048 = 10.24 Gb/s

A little bit of congestion - all bandwidth will be eaten up by buffers being re-sent due to a temporary congestion: without congestion control the link is not able to recover from this state even though the link works fawless.

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 20 Link Aggregation (2 links into 1 link, 8 streams into 1 PC)

10 Gbit/s Ethernet Dell R310 Optical Connections 50%

Dell R310 50% Two links aggregated into one

Dell R310 Brocade Switch Dell R620

Dell R310 10GE NIC

Dell R310 10GE NIC

Dell R310

● 2 streams aggregated into one 10GE link Dell R310 ● 8 threads receiving data (1 thread per stream)

● Linux TCP stack compared to the FEROL Dell R310 simplifed TCP

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 21 Stream Aggregation (8 streams to 1 PC)

FEROL TCP Emulator

Linux Sockets

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 22 FEROL Hardware Architecture

Hardware – Altera Aria II GX FPGA – Vitesse transceiver 10GE / XAUI – QDR Memory (16 MBytes) – DDR2 Memory (512 MBytes)

Interfaces – FED/SlinkXpress interface ● 2x optical 6 Gbit/s ● 1x optical 10 Gbit/s – DAQ interface ● 1x optical 10 Gbit/s Ethernet

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 23 FEROL Operation Modes

● Input ● Input – 2x SlinkXpress 6 Gbit/s FED input – 1x SlinkXpress 10 Gbit/s FED input – Legacy S-LINK data through PCI-X ● Output ● Output – 1x 10 Gbit/s Ethernet / Optional second 10 – 1x 10 Gbit/s Ethernet Gbit/s Ethernet link – 2x TCP streams – 1x TCP streams ● Memory buffer is divided in two, one per stream ● Memory buffer is used by one stream ● Data fragments ● Data fragments – Internal generator at 10Gbit/s speed – Internal generator at 10Gbit/s speed – PCI-X bus with maximum 6.4 Gbit/s – SlinkXpress with maximum 10 Gbit/s – SlinkXpress with maximum 2x 5.04 Gbit/s TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 24 FEROL TCP Core

● Several blocks handling different protocols (ARP/ICMP/TCP) ● TCP payload is stored in 64-bit words ● TCP sequence processed in multiples of 8 (64-bits) ● ICMP (PING) is limited to 128 bytes of payload ● IP address is static and assigned by control SW ● MAC address is kept in EEPROM memory

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 25 FPGA Resource Utilization

FPGA Floorplan 16% JTAG + Flash Access + Debugging 1% QDR Memory logic 3% DDR Memory logic 8% SlinkXpress 6 Gbit/s logic 3% PCI-X logic 3% Internal FED event generators + 20% TCP Core + 10 Gbit/s logic 3% One TCP Stream

Altera Arria II GX 125 FPGA ● Logic utilization 54% (53611 / 99280 ALUTs) ● Internal memory 37% (2.5 MBits) ● GXB transceivers 10 of 12

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 26 Production

● 21 FEROLs prototypes built – 16 FEROLs installed in CMS online computing room

● Production of 650 FEROLs is launched

● September – Pre-series of 32 boards started

● November – Pre-series tests

● The remaining 618 boards will be produced after the validation of the pre-series.

● Installation in 100 meters below the ground in April

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 27 Point-2-Point 10GE Measurements

FEROL (FPGA) PC 10Gbit/s Ethernet FRL (optical connection) 10GE NIC

● DELL PowerEdge C6100 as a receiver PC – CPU Xeon X5650 @ 2.67 GHz

– Myricom 10GE NIC

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 28 Point-2-Point 10GE Measurements

● Linux TCP Sockets compared to FEROL “TCP/IP” implementation

● Throughput measured with different fragment sizes

● One CPU core utilized at 85%

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 29 Link Aggregation (8 links into 1 link, 16 streams into 1 PC) Mellanox SX1036 – 40 Gbit/s Ethernet Switch

DELL PowerEdge R720 FEROL with Mellanox 40 Gbit/s 01 Ethernet NIC FEROL 02

FEROL 03

FEROL 04

FEROL 05

FEROL 06

FEROL Tests 07

● Link/Stream aggregation test up to 16 streams FEROL 08 ● Stability – 16 streams running at at 100kHz with 2kB data fragments (~26 Gbit/s) – No backpressure (FEROL buffer overfow) for 125 hours (5 days) TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 30 Stream aggregation (16 Streams to 1 PC)

Throughput for more than 12 streams is constant if hyper-threading is enabled

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 31 Summary

● Simplifed TCP/IP is working!

– Verifed by software emulation

– Test with real FEROL with maximum throughput 9.70 Gbps over 10GE interface

● Stream aggregation (congestion control) is working

– Test with 8 devices producing up to 16 streams

– Maximum aggregated throughput is 39.66 Gbps over 40GE interface

– Stable streams (at 26 Gbit/s) over 5 days, no problems found

● Observations

– The maximum performance is sensitive to the PC confguration

– BIOS settings: Hyper-threading on/off

– OS settings: TCP socket buffer settings, IRQs affnities

– User process settings: CPU core affnities

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 32 Summary...

To our best knowledge this is the frst TCP/IP hardware implementation running at 10 Gbit/s in the High Energy Physics!

Thank You

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 33 TWEPP 2013

Backup Slides

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 34 Labview based control & monitoring

First TCP Stream (FED0) Second TCP Stream (FED1)

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 35 Web based control & monitoring

application

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 36 FEROL TCP/IP Software Emulator

● For protocol verifcation and testing we developed FEROL TCP/IP emulator implementing the simplifed TCP/IP in the software as a user space program

● The “emulated” TCP/IP streams are bypassing Linux kernel (PF_RING)

Dell PowerEdge R310 – Intel Xeon Dell PowerEdge C6100 – Intel Xeon X3450TWEPP @ 20132.67 -GHz 10Gbps (8MB TCP/IP Cache, streams 4 cores) from the FPGA X5650 @ 2.67 GHz (12MB cache, 6 cores) 37 Opening and closing TCP connections

● Opening connection (one way/client)

– Standard 3-way handshake is used

– States: CLOSED → SYN_SENT → ESTABLISHED

– States not used: LISTEN, SYN_RECEIVED

● Connection closing

– RST is send (connection is aborted)

– States: ESTABLISHED → CLOSED

– States not used: FIN_WAIT_1, FIN_WAIT_2, TIME_WAIT, CLOSING, CLOSE_WAIT, LAST_ACK

struct linger so_linger; Connection abort with Linux TCP/IP Stack so_linger.l_onoff = TRUE; so_linger.l_linger = 0;

setsockopt(s, SOL_SOCKET, SO_LINGER, &so_linger, sizeof so_linger);

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 38 FEROL TCP/IP Stream Aggregation

● TCP/IP allows to send multiple streams over the same link, reliability and congestions are handled by TCP/IP

8/16 Streams ● n streams are concentrated in one 40 GE switch (aggregation ratio depends on the rate requirements)

● Aggregated streams are sent through one 40 GE interface to the RU PC

● Less amount of networking devices and PCs is required

● When PC dies the network is re-confgured (e.g., aggregation ratio is changed or 'hot spare' PC is used)

1 port 40 GE (RU PC)

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 39 TCP and Data Fragments

FIFO

● TCP connection is a stream connection ● Data (TCP point of view) have no begin and no end – no relationship to fragments

● How to distinguish fragments? – Additional header is be used (contains length and fragment Start/End marks) – Used for receiving PC to reassemble fragments and to limit segment length

HDR HDR HDR (S,E) Fragment 1 (S,E) F2 (S,E) F3

HDR HDR HDR (S) Fragment 4 (E) Fragment 4 (S,E) F4

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 40 SlinkXpress (for AMC13)

Input link for CMS DAQ

The protocol was tested with the HCAL AMC13 (ver1) 5Gb (8b/10b encoding) with FEROL up to 6.3 Gb(8b/10b encoding) with FEROL 10Gb (XUAI interface) exist 10Gb 66/64b encoding (to be tested)

TWEPP 2013 - 10Gbps TCP/IP streams from the FPGA 41