Fixing the Internet: Congestion Control and Buffer Bloat

Influential OS Research Julian Stecklina Outline

TCP Protocol TCP Implementation Queuing Algorithms Buffer Bloat Unreliable Packet Networks

A packet can be: ● damaged, ● dropped, ● duplicated, ● reordered with other packets.

Need an abstraction to deal with this! Transmission Control Protocol

Provides bidirectional channel. Each direction is a reliable byte stream over unreliable best-effort packet network.

TCPv4 specified in 1981. Later changes essentially backward compatible.

Basis for advanced protocols (MPTCP, SCTP). TCP Header

Position in byte +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+stream of first byte | Source Port in this packet | Destination Port | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+Byte stream received | Sequence correctlyNumber upto this byte | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Acknowledgment Number Can receive this many | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+bytes additionally | Data | |U|A|P|R|S|F| | | Offset| Reserved |R|C|S|S|Y|I| Window | | | |G|K|H|T|N|N| | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Checksum | Urgent Pointer | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Options | Padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | data | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

TCP Header Format [RFC 793] Receive Window sequence numbers 232 / 0

initial sequence number

can be discarded

recv’d, ack’d, delivered wait for recv() call by application

recv’d, ack’d available space need to for recv’d packets ACK empty recv’d announced window

receive buffer Send Window sequence numbers 232 / 0

initial sequence number

can be discarded

sent, ack’d

wait for ACK

sent waiting for space waiting to be in receive window transmitted data that data ready can’t be to be sent sent yet

send buffer

Recognized as a problem in the Internet as soon as 1984 [RFC 896].

By 1986 ‘congestion collapses’ were observed, where dropped by orders of magnitude. Especially bad when networks of wildly varying speeds are connected. How much can we send?

“The internet is not something you just dump something on. It’s not a big truck. It’s, it’s series of tubes. [...] when you put your message in it, it gets in line, it’s gonna be delayed by anyone who puts into that tube enormous amounts of material [...]” [1] Ted Stevens (former US Senator of Alaska) 5x (re)transmits of same data

Data sent faster than slow middle link can handle.

Retransmissions in a TCP/IP connection without congestion avoidance. Two fast networks are connected by a slow link. [Jacobson 1988] Network Congestion

Connection should reach an equilibrium where ‘conservation of packets’ principle holds. [Jacobson 1988]

How to get to equilibrium? → Slow Start

How to maintain it? → Round-trip Timing → Congestion Avoidance “Self-clocking” in a TCP connection. Number of packets in-flight is constant. [Jacobson 1988] Bandwidth Delay Product (BDP)

BDP = packet delay ⋅ bandwidth

The ideal amount of unacknowledged data in-flight.

In TCP speak, usually called the ‘pipe size’. Round-Trip Time Estimation

Need RTT estimation to know when to retransmit data. Retransmitting earlier violates conservation rule, later wastes bandwidth.

Old method [RFC 793] roughly: last smoothin scaling measured g factor factor RTT R ← αR + (1 - α)M RTO ← βR smoothed RTT Retransmission Timeout Retransmission Death Spiral

β in [1.3, 2.0] was suggested, but a loaded network exhibits more variance in RTTs.

Early retransmissions → More traffic → More RTT variance → Collapse

A solution is to estimate β based on observed variance. Estimating RTT Variation

Error: e ← M - R RTT estimation: R ← R + αe (same as before) Var. estimation*: v ← v + α’(|e| - v)

RTO ← R + 4v

RTO adapts to changes in RTT variance. No early retransmissions in most scenarios. *Computes mean deviation, which is smaller than the standard deviation. Slow Start

Probe available bandwidth by adding a congestion window. Only send min(cwnd, rwnd) bytes.

Congestion window starts at 1* packet. Increase by 1 per received ACK.

Effectively double bandwidth per packet round-trip. No need to change wire protocol. * Today we usually use 2 or more packets and ACKs are generated every two packets. Congestion Avoidance

With proper RTO, timeout fires for lost packets. Packets are lost due to damage or congestion.

TCP model assumes rare packet damage. Timeout indicates network congestion.

When congestion occurs, half the current window size is remembered as ssthresh. cwnd is set to intial value. Slow start takes over again. (4.3BSD Tahoe TCP) Congestion Avoidance

If cwnd > ssthresh, slow start is stopped and window grows linear by at most 1 segment per RTT. [Stevens 1994] TCP/IP with Slow Start and Congestion Avoidance. [Jacobson 1988] Recap: TCP

Two phases for operation: ● Slow-Start with exponential window growth on new connection or after ● Congestion Avoidance with linear window growth. RTT estimation is crucial to maintain ‘conservation of packets’. Packet loss behavior is improved with Fast Retransmit and Fast Recovery. Assumptions on Internet Routers

Routers/switches in the Internet are dumb best-effort. Packets may be: ● dropped ● damaged ● reordered.

Transient congestion is handled via buffers. Routers know in advance that congestion is likely. Global Synchronization

With FIFO queue and Tail Drop, a congestion causes multiple TCP connections to back off and slow start at the same time.

Connections synchronize: Cause congestion again.

Available bandwidth is used inefficiently when Global Synchronization occurs. RED -

Probabilistically mark packets when queue is getting filled. Drop if protocol doesn’t support marking. [Floyd 1993]:

don’t mark mark fewmark moremark all packets packets packets packets minth maxth

filled filled filled filled

empty Switch Buffer full RED: Mark Probability

When average queue size between minth and maxth:

pb = maxp (avg - minth) / (maxth - minth)

Avoid global synchronization by keeping counter since last mark:

pa = pb / (1 - count · pb) th At pb = 0.01 at least every 100 packet will be dropped.

Early Congestion Notification

Dropping packets if switch has buffer space is wasteful. ECN introduces marking of TCP packets [RFC 3168].

Congestion Experienced (CE) treated as packet loss. Not widely deployed due to compatibility concerns. Recap: Queue Management

Simple FIFO queues in switches lead to global synchronization and poor performance. (AQM) needed.

RED is an example of an AQM system with probabilistic marking/dropping of packets according to average queue size. AQM is complicated, so... use el cheapo solution: Large buffers.

“AQM is MIA” [Gettys 2011]

TCP fills the pipe! Buffer in front of slowest link will be filled.

Link inflated → “” Problem here, if WiFi slow. Problem here, if DSL/cable Problem here, slow. if Netflix has a new Breaking Bad episode.

ISP Switch

Problematic buffer in front of slowest link. But slowest link and bandwidth change! Buffer at slowest link (startup) [Nichols 2012] After one RTT it is clear that queue will never be empty. Congestion window larger than BDP! [Nichols 2012] Standing Queues

When the congestion window is larger than BDP, standing queues develop.

Standing queues add needless latency: 1 MiB standing buffer, 16 MBit/s link speed → ~0.5s latency Latency vs. Fairness

RTT determines “reaction time” of TCP connection. A long lived connection causing large latency essentially starves short-lived connections.

Short connections spend most time in Slow Start. Ramp up speed determined by RTT. ‘Good’ vs. ‘Bad’ Queue

How do we decide how much queueing is bad? [Nichols 2012] ‘Good’ queue: ● absorbs temporary changes in packet arrival ● resolves after RTT roughly ‘Bad’ queue: ● adds needless latency ● persists for several RTT Queue length in the standing queue situation. [Nichols 2012] Controlled Delay (CoDel) Queueing

Idea: Control delay of queue.

t Target delay of queue (5ms) i Measuring interval (100ms)

When minimum queue delay of packets in measuring interval is exceeding t for longer than i, start dropping packets. Details in [Nichols 2012]. TCP Implementation

*BSD / Linux implementations are largely similar. Kernel implements complete protocol stack.

BSD sockets API maps TCP connections to file descriptors for applications: ● bind/connect/send/recv/… ● read/write user kernel hardware network driver TCP/IP stack socket layer

Application

socket queue device queue

Send buffering in a typical TCP/IP stack. Device Queue

Fixed-size queue in front of driver usually limited in the number of packets.

1000 64-byte packets at 54MBit/s: ~9.4ms 1000 1518-byte packets at 1MBit/s: ~12s

One queue size doesn’t fit all. Linux: Byte Queue Limits

Dynamically adapt maximum queue length of NIC measured in bytes:

● Increase maximum if NIC is starved quickly. ● Decrease maximum if NIC cannot keep up slowly.

Implemented in Linux 3.3. [Hruby 2012] Experimental CoDel is built on top. Linux: TCP Small Queues

Linux’ TCP/IP has several queues between the application and the hardware.

Introduce a configurable global limit on data queued for a single connection:

/proc/sys/net//tcp_limit_output_bytes (128K)

First step to more comprehensive queue management. Wrap up

Effective flow / congestion control in TCP needs:

● the protocol (ECN, SACK, …) ● the protocol implementation (cwnd, ...) ● the OS implementation (BQL, TCP small queues) Sidenote: InfiniBand Networks

Infiniband is a HPC / datacenter interconnect.

It uses credit-based flow control to never send data unless a data buffer is available.

Switches need less buffer space. Very low latency. References

[Jacobson 1988] Congestion Avoidance and Control [Floyd 1993] Random early detection gateways for congestion avoidance [Gettys 2011] Bufferbloat: Dark Buffers in the Internet [Nichols 2012] Controlling Queue Delay

CeroWRT Bufferbloat-reducing router firmware based on OpenWRT