Throughput Issues for High-Speed Wide-Area Networks

Brian L. Tierney ESnet Lawrence Berkeley National Laboratory http://fasterdata.es.net/ (HEPIX Oct 30, 2009) Why does the Network seem so slow? TCP Performance Issues

• Getting good TCP performance over high- high-bandwidth networks is not easy! • You must keep the TCP window full – the size of the window is directly related to the network latency • Easy to compute max throughput: – Throughput = buffer size / latency – eg: 64 KB buffer / 40 ms path = 1.6 KBytes (12.8 Kbits) / sec Setting the TCP buffer sizes

• It is critical to use the optimal TCP send and receive socket buffer sizes for the link you are using. – Recommended size to fill the pipe • 2 x Bandwidth Delay Product (BDP) – Recommended size to leave some bandwidth for others • around 20% of (2 x BPB) = .4 * BDP • Default TCP buffer sizes are way too small for today’s high speed networks – Until recently, default TCP send/receive buffers were typically 64 KB – tuned buffer to fill LBL to BNL link: 10 MB • 150X bigger than the default buffer size – with default TCP buffers, you can only get a small % of the available bandwidth! Buffer Size Example

• Optimal Buffer size formula: – buffer size = 20% * (2 * bandwidth * RTT) • ping time (RTT) = 50 ms • Narrow link = 500 Mbps (62 MBytes/sec) – e.g.: the end-to-end network consists of all 1000 BT and OC-12 (622 Mbps) • TCP buffers should be: – .05 sec * 62 * 2 * 20% = 1.24 MBytes Sample Buffer Sizes

• LBL to (50% of pipe) – SLAC (RTT = 2 ms, narrow link = 1000 Mbps) : 256 KB – BNL: (RTT = 80 ms, narrow link = 1000 Mbps): 10 MB – CERN: (RTT = 165 ms, narrow link = 1000 Mbps): 20.6 MB • Note: default buffer size is usually only 64 KB, and default maximum autotuning buffer size for is often only 256KB – e.g.: FreeBSD 7.2 Autotuning default max = 256 KB – 10-150 times too small! • Home DSL, US to Europe (RTT = 150, narrow link = 2 Mbps): 38 KB – Default buffers are OK. Socket Buffer Autotuning

• To solve the buffer tuning problem, based on work at LANL and PSC, Linux OS added TCP Buffer autotuning – Sender-side TCP buffer autotuning introduced in Linux 2.4 – Receiver-side autotuning added in Linux 2.6 • Most OS’s now include TCP autotuning – TCP send buffer starts at 64 KB – As the data transfer takes place, the buffer size is continuously re-adjusted up max autotune size • Current OS Autotuning default maximum buffers – Linux 2.6: 256K to 4MB, depending on version – FreeBSD 7: 256K – Windows Vista: 16M – Mac OSX 10.5: 8M Autotuning Settings

• Linux 2.6 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 # autotuning min, default, and max number of to use net..tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 • FreeBSD 7.0 net.inet.tcp.sendbuf_auto=1 net.inet.tcp.recvbuf_auto=1 net.inet.tcp.sendbuf_max=16777216 net.inet.tcp.recvbuf_max=16777216 • Windows Vista netsh interface tcp set global autotunninglevel=normal – max buffer fixed at 16MB • OSX 10.5 (“Self-Tuning TCP) kern.ipc.maxsockbuf=16777216 • For more info, see: http://fasterdata.es.net/TCP-Tuning/ More Problems: TCP congestion control

Path = LBL to CERN (Geneva) OC-3, (in 2000), RTT = 150 ms

average BW = 30 Mbps Parallel Streams Can Help Parallel Streams Issues

• Potentially unfair • Places more load on the end hosts • But they are necessary when you don’t have root access, and can’t convince the sysadmin to increase the max TCP buffers

graph from Tom Dunigan, ORNL TCP Response Function

• Well known fact that TCP Reno does not scale to high- speed networks • Average TCP congestion window = 1.2 p segments – p = rate – see: http://www.icir.org/floyd/hstcp.html • What this means: – For a TCP connection with 1500- packets and a 100 ms round-trip time • filling a 10 Gbps pipe would require a congestion window of 83,333 packets, • a packet drop rate of at most one drop every 5,000,000,000 packets. – requires at most one packet loss every 6000s, or 1h:40m to keep the pipe full Proposed TCP Modifications

• Many proposed alternate congestion control algorithms: – BIC/CUBIC – HTCP: (Hamilton TCP) – Scalable TCP – Compound TCP – And several more Linux 2.6.12 Results

TCP Results Note that BIC reaches Max throughput MUCH 800 faster

700

600

500 Linux 2.6, BIC TCP 400 Linux 2.4 Linux 2.6, BIC off 300

200

100 RTT = 67 ms

0 1 3 5 7 9 11131517192123252729313335 time slot (5 second intervals) Comparison of Various TCP Congestion Control Algorithms Selecting TCP Congestion Control in Linux

• To determine current configuration: sysctl -a | grep congestion net.ipv4.tcp_congestion_control = cubic net.ipv4.tcp_available_congestion_control = cubic reno

• Use /etc/sysctl.conf to set to any available congested congestion control. • Supported options (may need to enabled by default in your kernel): – CUBIC, BIC, HTCP, HSTCP, STCP, LTCP, more.. Summary so far

• To optimize TCP throughput, do the following: – Use a newer OS that supports TCP buffer autotuning – Increase the maximum TCP autotuning buffer size – Use a few parallel streams if possible – Use a modern congestion control algorithm PERFSONAR High Performance Networking

• Most of the R&E community has access to 10 Gbps networks. • Naive users with the right tools should be able to easily get: • 200 Mbps/per stream between properly maintained systems • 2 Gbps aggregate rates between significant computing resources • Most users are not experiencing this level of performance • “There is widespread frustration involved in the transfer of large data sets between facilities, or from the data’s facility of origin back to the researcher’s home institution. “From the BES network requirements workshop: http://www.es.net/pub/esnet‐doc/BES‐Net‐Req‐Workshop‐ 2007‐Final‐Report.pdf • We can increase performance by measuring the network and reporting problems! Network Troubleshooting is a Multi-Domain Problem

Source Education Destination Campus Backbone Campus

S D

JET Regional Network Where are common problems?

Latency dependant Congested or faulty links problems inside domains between domains with small RTT Source Education Destination Campus Backbone Campus

S D

JET Regional Network Soft Network Failures

• Soft failures are where basic connectivity functions, but high performance is not possible. • TCP was intentionally designed to hide all transmission errors from the user: – “As long as the TCPs continue to function properly and the internet system does not become completely partitioned, no transmission errors will affect the users.” (From IEN 129, RFC 716) • Some soft failures only affect high bandwidth long RTT flows. • Hard failures are easy to detect & fix, soft failures can lie hidden for years. Common Soft Failures • Small Queue Tail Drop – Switches not able to handle the long packet trains prevalent in long RTT sessions and cross traffic at the same time • Un-intentional Rate Limiting – Process Switching on Cisco 6500 devices due to faults, acl’s, or mis-configuration – Security Devices • E.g.: 10X improvement by turning off Cisco Reflexive ACL • Random Packet Loss – Bad fibers or connectors – Low light levels due to amps/interfaces failing – Duplex mismatch Local testing will not find all problems

Performance is good when Performance is poor when RTT is < 20 ms RTT exceeds 20 ms Destination Campus Source R&E Campus Backbone S D

Switch with Regional Regional small buffers Addressing the Problem: perfSONAR

• Developing an open web-services based framework for collecting, managing and sharing network measurements • Deploying the framework across the science community • Encouraging people to deploy ‘known good’ measurement points near domain boundaries • Using the framework to find & correct soft network failures. perfSONAR Deployments

• Internet2 • APAN • ESnet •GLORIAD • Argonne National Lab • JGN2PLUS • Brookhaven National Lab • KISTI Korea •Fermilab • Monash University, Melbourne • National Energy Research Scientific • NCHC, HsinChu, Taiwan Computing Center • Simon Fraser University • Pacific Northwest National Lab • GEANT • University of Michigan, Ann Arbor •GARR • Indiana University • HUNGARNET • Boston University • PIONEER • University of Texas Arlington •SWITCH • Oklahoma University, Norman • CCIN2P3 • Michigan Information Technology Center •CERN • William & Mary • CNAF • University of Wisconsin Madison • DE-KIT • Southern Methodist University, Dallas • NIKHEF/SARA • University of Texas Austin •PIC • Vanderbilt University •RAL • TRIUMF perfSONAR Architecture

• Interoperable network measurement middleware (SOA): – Modular – Web services-based – Decentralized – Locally controlled • Integrates: – Network measurement tools and data archives – Data manipulation – Information Services • Discovery • Topology • Authentication and authorization • Based on: – Open Grid Forum Network (OGF) Network Measurement Working Group (NM- WG) schema – Formalizing the specification of perfSONAR protocols in the OGF NMC-WG – Network topology description work in the OGF NML-WG Main perfSONAR Services • Lookup Service – gLS – Global service used to find services – hLS – Home service for registering local perfSONAR metadata • Measurement Archives – SNMP MA – Interface Data – pSB MA -- Scheduled bandwidth and latency data • Measurement Points –BWCTL –OWAMP –PINGER • Troubleshooting Tools –NDT – NPAD • Topology Service PerfSONAR-PS Deployment Status

• Currently deployed in over 130 locations – 45 bwctl and owamp servers – 40 pSB MA – 15 SNMP MA – 5 gLS and 135 hLS • US Atlas Deployment – Monitoring all “Tier 1 to Tier 2” connections • For current list of services, see: – http://www.perfsonar.net/activeServices.html

29 Deploying a perfSONAR measurement host in under 30 minutes

• Using the PS Performance Toolkit is very simple – Boot from CD – Use command line tool to configure • what disk partition to use for persistent data • Network address and DNS • User and root passwords – Use Web GUI to configure • Select which services to run • Select remote measurement points for bandwidth and latency tests • Configure Cacti to collect SNMP data from key router interfaces

30 SAMPLE RESULTS Example: US Atlas

• LHC Tier 1 to Tier 2 Center Data transfer problem – Couldn’t exceed 1 Gbps across a 10GE end to end path that included 5 administrative domains – Used perfSONAR tools to localize problem – Identified problem device • An unrelated domain had leaked a full routing table to the router for a short time causing FIB corruption. The routing problem was fixed, but the router started process switching some flows after that. –Fixed • Rebooting device fixed the symptoms of the problem • Better BGP filters configured to prevent reoccurrence

32 Sample Results: Fixed a soft failure Sample Results: Bulk Data Transfer between DOE Supercomputer Centers • Users were having problems moving data between supercomputer centers – One user was: “waiting more than an entire workday for a 33 GB input file” • perfSONAR Measurement tools were installed – Regularly scheduled measurements were started • Numerous choke points were identified & corrected • Dedicate wide area transfer nodes were setup – Tuned for Wide Area Transfers – Now moving 40 TB in less than 3 days

34 Importance of Regular Testing

• You cant wait for users to report problems and then fix them • Soft failure can go unreported for years • Problems that get fixed have a way of coming back – System defaults coming back after hardware/software upgrade – New employees not knowing why the previous employee set things up that way • perfSONAR makes is easy to collect, archive, and alert on throughput information Example: NERSC <-> ORNL Testing More Information

• http://fasterdata.es.net/

• email: [email protected]

• Also see: – http://fasterdata.es.net/talks/Bulk-transfer-tutorial.pdf – ://plone3.fnal.gov/P0/WAN/netperf/methodology/