TCP Flow Control and Congestion Control

EECS 489 Computer Networks http://www.eecs.umich.edu/courses/eecs489/w07 Z. Morley Mao Monday Feb 5, 2007

Acknowledgement: Some slides taken from Kurose&Ross and Katz&Stoica Mao W07 1 TCP Flow Control flow control sender won’t overflow ƒ receive side of TCP receiver’s buffer by connection has a receive transmitting too much, buffer: too fast

ƒ speed-matching service: matching the send rate to the receiving app’s drain rate app process may be slow at reading from buffer

Mao W07 2 TCP Flow control: how it works

ƒ Rcvr advertises spare room by including value of RcvWindow in segments ƒ Sender limits unACKed data to RcvWindow - guarantees receive buffer (Suppose TCP receiver discards doesn’t overflow out-of-order segments) ƒ spare room in buffer = RcvWindow = RcvBuffer-[LastByteRcvd - LastByteRead]

Mao W07 3 TCP Connection Management

Recall: TCP sender, receiver Three way handshake: establish “connection” before exchanging data segments Step 1: client host sends TCP SYN segment to server ƒ initialize TCP variables: - seq. #s - specifies initial seq # - buffers, flow control info - no data (e.g. RcvWindow) Step 2: server host receives SYN, ƒ client: connection initiator replies with SYNACK segment Socket clientSocket = new - server allocates buffers Socket("hostname","port - specifies server initial seq. # number"); Step 3: client receives SYNACK, ƒ server: contacted by client replies with ACK segment, Socket connectionSocket = welcomeSocket.accept(); which may contain data

Mao W07 4 TCP Connection Management (cont.)

Closing a connection: client server client closes socket: close clientSocket.close(); FIN Step 1: client end system sends TCP FIN control ACK segment to server close FIN Step 2: server receives FIN, replies with ACK. Closes connection, sends FIN. ACK timed wait closed

Mao W07 5 TCP Connection Management (cont.)

Step 3: client receives FIN, client server replies with ACK. closing - Enters “timed wait” - will FIN respond with ACK to received FINs

Step 4: server, receives ACK. ACK Connection closed. closing FIN Note: with small modification, can handle simultaneous FINs. ACK closed timed wait closed

Mao W07 6 TCP Connection Management (cont)

TCP server lifecycle

TCP client lifecycle

Mao W07 7 Principles of Congestion Control

Congestion: ƒ informally: “too many sources sending too much data too fast for network to handle” ƒ different from flow control! ƒ manifestations: - lost packets (buffer overflow at routers) - long delays (queueing in router buffers) ƒ a top-10 problem!

Mao W07 8 Causes/costs of congestion: scenario 1

Host A λ λ : original data out ƒ two senders, two in receivers

ƒ one router, infinite Host B unlimited shared buffers output link buffers ƒ no retransmission

ƒ large delays when congested ƒ maximum achievable throughput

Mao W07 9 Causes/costs of congestion: scenario 2

ƒ one router, finite buffers ƒ sender retransmission of lost packet

Host A λ λin : original data out

λ'in : original data, plus retransmitted data

Host B finite shared output link buffers

Mao W07 10 Causes/costs of congestion: scenario 2

λ = λ ƒ always: in out () ƒ “perfect” retransmission only when loss: λ > λ in out ƒ retransmission of delayed (not lost) packet makes λ larger (than in perfect case) for same λ out R/2 R/2 R/2

R/3 out out out R/4 λ λ λ

R/2 R/2 R/2 λin λin λin

“costs”a. of congestion: b. c. more work (retrans) for given “goodput” unneeded retransmissions: link carries multiple copies of pkt Mao W07 11 Causes/costs of congestion: scenario 3

ƒ four senders Q: what happens as λ ƒ multihop paths in and λ increase ? ƒ timeout/retransmit in

Host A λout λin : original data

λ'in : original data, plus retransmitted data finite shared output link buffers

Host B

Mao W07 12 Causes/costs of congestion: scenario 3

H λ o s o u t A t

H o s t B

Another “cost” of congestion: when packet dropped, any “upstream transmission capacity used for that packet was wasted!

Mao W07 13 Approaches towards congestion control

Two broad approaches towards congestion control:

End-end congestion control: Network-assisted congestion ƒ no explicit feedback from control: network ƒ routers provide feedback to ƒ congestion inferred from end- end systems system observed loss, delay - single bit indicating ƒ approach taken by TCP congestion (SNA, DECbit, TCP/IP ECN, ATM) - explicit rate sender should send at

Mao W07 14 Case study: ATM ABR congestion control

ABR: available bit rate: RM (resource management) ƒ “elastic service” cells: ƒ if sender’s path ƒ sent by sender, interspersed with “underloaded”: data cells - sender should use ƒ bits in RM cell set by switches available (“network-assisted”) ƒ if sender’s path congested: - NI bit: no increase in rate - sender throttled to (mild congestion) minimum guaranteed rate - CI bit: congestion indication ƒ RM cells returned to sender by receiver, with bits intact

Mao W07 15 Case study: ATM ABR congestion control

ƒ two-byte ER (explicit rate) field in RM cell - congested switch may lower ER value in cell - sender’ send rate thus minimum supportable rate on path ƒ EFCI bit in data cells: set to 1 in congested switch - if data cell preceding RM cell has EFCI set, sender sets CI bit in returned RM cell

Mao W07 16 TCP Congestion Control

ƒ end-end control (no network assistance) How does sender perceive ƒ sender limits transmission: congestion? LastByteSent-LastByteAcked ƒ loss event = timeout or 3 duplicate acks ≤ CongWin ƒ TCP sender reduces rate ƒ Roughly, (CongWin) after loss event three mechanisms: -AIMD ƒ CongWin is dynamic, function of - slow start perceived - conservative after timeout events CongWin rate = Bytes/sec RTT

Mao W07 17 TCP AIMD multiplicative decrease: additive increase: increase cut CongWin in half CongWin by 1 MSS every after loss event RTT in the absence of loss events: probing congestion window

24 Kbytes

16 Kbytes

8 Kbytes

time

Long-lived TCP connection Mao W07 18 TCP Slow Start

ƒ When connection begins, When connection begins, CongWin = 1 MSS increase rate exponentially fast - Example: MSS = 500 bytes & until first loss event RTT = 200 msec - initial rate = 20 kbps ƒ available bandwidth may be >> MSS/RTT - desirable to quickly ramp up to respectable rate

Mao W07 19 TCP Slow Start (more)

ƒ When connection begins, Host A Host B increase rate exponentially until first loss event: one segment - double CongWin every RTT RTT two - done by incrementing segments CongWin for every ACK received ƒ Summary: initial rate is four segme slow but ramps up nts exponentially fast

time

Mao W07 20 Refinement Philosophy: ƒ After 3 dup ACKs: - CongWin is cut in half • 3 dup ACKs indicates - window then grows linearly network capable of ƒ But after timeout event: delivering some segments - CongWin instead set to 1 MSS; • timeout before 3 dup - window then grows exponentially ACKs is “more alarming” - to a threshold, then grows linearly

Mao W07 21 Refinement (more)

Q: When should the exponential increase switch to linear? A: When CongWin gets to 1/2 of its value before timeout.

Implementation: ƒ Variable Threshold ƒ At loss event, Threshold is set to 1/2 of CongWin just before loss event

Mao W07 22 Summary: TCP Congestion Control

ƒ When CongWin is below Threshold, sender in slow- start phase, window grows exponentially.

ƒ When CongWin is above Threshold, sender is in congestion-avoidance phase, window grows linearly.

ƒ When a triple duplicate ACK occurs, Threshold set to CongWin/2 and CongWin set to Threshold.

ƒ When timeout occurs, Threshold set to CongWin/2 and CongWin is set to 1 MSS.

Mao W07 23 TCP sender congestion control

Event State TCP Sender Action Commentary ACK receipt for Slow Start CongWin = CongWin + MSS, Resulting in a doubling of previously (SS) If (CongWin > Threshold) CongWin every RTT unacked data set state to “Congestion Avoidance” ACK receipt for Congestion CongWin = CongWin+MSS * Additive increase, resulting previously Avoidance (MSS/CongWin) in increase of CongWin by unacked data (CA) 1 MSS every RTT Loss event SS or CA Threshold = CongWin/2, Fast recovery, detected by CongWin = Threshold, implementing multiplicative triple duplicate Set state to “Congestion decrease. CongWin will not ACK Avoidance” drop below 1 MSS. Timeout SS or CA Threshold = CongWin/2, Enter slow start CongWin = 1 MSS, Set state to “Slow Start” Duplicate ACK SS or CA Increment duplicate ACK count CongWin and Threshold not for segment being acked changed

Mao W07 24 TCP throughput

ƒ What’s the average throughout of TCP as a function of window size and RTT? - Ignore slow start ƒ Let W be the window size when loss occurs. ƒ When window is W, throughput is W/RTT ƒ Just after loss, window drops to W/2, throughput to W/2RTT. ƒ Average throughout: .75 W/RTT

Mao W07 25 TCP Futures

ƒ Example: 1500 byte segments, 100ms RTT, want 10 Gbps throughput ƒ Requires window size W = 83,333 in-flight segments ƒ Throughput in terms of loss rate: 1.22⋅ MSS RTT L ƒ ➜ L = 2·10-10 Wow ƒ New versions of TCP for high-speed needed!

Mao W07 26 TCP Fairness Fairness goal: if K TCP sessions share same bottleneck link of bandwidth R, each should have average rate of R/K

TCP connection 1

bottleneck TCP router connection 2 capacity R

Mao W07 27 Why is TCP fair?

Two competing sessions: ƒ Additive increase gives slope of 1, as throughout increases ƒ multiplicative decrease decreases throughput proportionally

R equal bandwidth share

t

u

p

h

g

u

o

r loss: decrease window by factor of 2

h

t congestion avoidance: additive increase

2

loss: decrease window by factor of 2

n

o congestion avoidance: additive increase

i

t

c

e

n

n

o

C Connection 1 throughput R

Mao W07 28 Fairness (more)

Fairness and UDP Fairness and parallel TCP connections ƒ Multimedia apps often do not use TCP ƒ nothing prevents app from opening parallel cnctions between - do not want rate throttled by congestion control 2 hosts. ƒ Web browsers do this ƒ Instead use UDP: - pump audio/video at ƒ Example: link of rate R supporting constant rate, tolerate 9 cnctions; - new app asks for 1 TCP, gets rate ƒ Research area: TCP friendly R/10 - new app asks for 11 TCPs, gets R/2 !

Mao W07 29 Delay modeling

Notation, assumptions: Q: How long does it take to receive ƒ Assume one link between client an object from a Web server after and server of rate R sending a request? ƒ S: MSS (bits) Ignoring congestion, delay is ƒ O: object size (bits) influenced by: ƒ no retransmissions (no loss, no corruption) ƒ TCP connection establishment Window size: ƒ data transmission delay ƒ First assume: fixed congestion ƒ slow start window, W segments ƒ Then dynamic window, modeling slow start

Mao W07 30 TCP Delay Modeling: Slow Start (1)

Now suppose window grows according to slow start

Will show that the delay for one object is:

O ⎡ S ⎤ P S = 2RTT + + P RTT + − (2 −1) R ⎣⎢ R⎦⎥ R

where P is the number of times TCP idles at server:

P = min{Q,K −1}

- where Q is the number of times the server idles if the object were of infinite size.

- and K is the number of windows that cover the object.

Mao W07 31 TCP Delay Modeling: Slow Start (2)

Delay components: initiate TCP connection • 2 RTT for connection

estab and request request object • O/R to transmit first window object = S/R

• time server idles due RTT second wind to slow start = 2S/R

Server idles: third window P = min{K-1,Q} times = 4S/R Example:

• O/S = 15 segments fourth window •K = 4 windows = 8S/R •Q = 2 • P = min{K-1,Q} = 2

complete object transmission Server idles P=2 times delivered time at time at server client Mao W07 32 TCP Delay Modeling (3)

S + RTT = time from when server starts to send segment R initiate TCP until server receives acknowledgement connection

k −1 S 2 = time to transmit the kth window request R object first window = S/R + RTT ⎡ S k −1 S ⎤ second window ⎢ + RTT − 2 ⎥ = idle time after the kth window = 2S/R ⎣ R R ⎦

third window = 4S/R

O P delay = + 2RTT + idleTime fourth window ∑ p = 8S/R R p=1 O P S S = + 2RTT + ∑[ + RTT − 2k −1 ] R R R complete k =1 object transmission delivered O S P S = + 2RTT + P[RTT + ]− (2 −1) time at time at server R R R client Mao W07 33 TCP Delay Modeling (4)

Recall K = number of windows that cover object

How do we calculate K ? 0 1 k −1 K = min{k : 2 S + 2 S +L+ 2 S ≥ O} 0 1 k −1 = min{k : 2 + 2 +L+ 2 ≥ O / S} O = min{k : 2k −1≥ } S O = min{k : k ≥ log ( +1)} 2 S ⎡ O ⎤ = ⎢log2 ( +1)⎥ ⎢ S ⎥

Calculation of Q, number of idles for infinite-size object, is similar (see HW).

Mao W07 34 HTTP Modeling

ƒ Assume Web page consists of: - 1 base HTML page (of size O bits) - M images (each of size O bits) ƒ Non-persistent HTTP: - M+1 TCP connections in series - Response time = (M+1)O/R + (M+1)2RTT + sum of idle times ƒ Persistent HTTP: - 2 RTT to request and receive base HTML file - 1 RTT to request and receive M images - Response time = (M+1)O/R + 3RTT + sum of idle times ƒ Non-persistent HTTP with X parallel connections - Suppose M/X integer. - 1 TCP connection for base file - M/X sets of parallel connections for images. - Response time = (M+1)O/R + (M/X + 1)2RTT + sum of idle times

Mao W07 35 HTTP Response time (in seconds) RTT = 100 msec, O = 5 Kbytes, M=10 and X=5 20 18 16 14 non-persistent 12 10 persistent 8 6 parallel non- 4 persistent 2 0 28 100 1 Mbps 10 Kbps Kbps Mbps For low bandwidth, connection & response time dominated by transmission time. Persistent connections only give minor improvement over parallel connections. Mao W07 36 HTTP Response time (in seconds)

RTT =1 sec, O = 5 Kbytes, M=10 and X=5

70 60 50 non-persistent 40 30 persistent

20 parallel non- 10 persistent 0 28 100 1 Mbps 10 Kbps Kbps Mbps For larger RTT, response time dominated by TCP establishment & slow start delays. Persistent connections now give important improvement: particularly in high delay•bandwidth networks. Mao W07 37 Issues to Think About

ƒ What about short flows? (setting initial cwnd) - most flows are short - most bytes are in long flows

ƒ How does this work over wireless links? - packet reordering fools fast retransmit - loss not always congestion related

ƒ High speeds? - to reach 10gbps, packet losses occur every 90 minutes!

ƒ Fairness: how do flows with different RTTs share link?

Mao W07 38 Security issues with TCP

ƒ Example attacks: - Sequence number spoofing - Routing attacks - Source address spoofing - Authentication attacks

Mao W07 39 Network Layer goals: ƒ understand principles behind network layer services: - routing (path selection) - dealing with scale - how a router works - advanced topics: IPv6, mobility ƒ instantiation and implementation in the

Mao W07 40 Network layer

application ƒ transport segment from sending transport to receiving host network data link network physical ƒ on sending side encapsulates network data link network data link physical data link segments into datagrams physical physical ƒ on rcving side, delivers network data link segments to physical network data link ƒ network layer protocols in every physical

host, router network network data link ƒ Router examines header fields data link physical in all IP datagrams passing physical network through it data link application physical transport network data link physical

Mao W07 41 Key Network-Layer Functions

ƒ forwarding: move packets analogy: from router’s input to appropriate router output ƒ routing: process of planning trip from source to ƒ routing: determine route dest taken by packets from source to dest. ƒ forwarding: process of getting through single - Routing interchange

Mao W07 42 Interplay between routing and forwarding

routing

local forwarding table header value output link 0100 3 0101 2 0111 2 1001 1

value in arriving packet’s header

0111 1

3 2

Mao W07 43 Connection setup

ƒ 3rd important function in some network architectures: - ATM, frame relay, X.25 ƒ Before datagrams flow, two hosts and intervening routers establish virtual connection - Routers get involved ƒ Network and transport layer cnctn service: - Network: between two hosts - Transport: between two processes

Mao W07 44 Network service model

Q: What service model for “channel” transporting datagrams from sender to rcvr?

Example services for individual Example services for a flow of datagrams: datagrams: ƒ guaranteed delivery ƒ In-order datagram delivery ƒ Guaranteed delivery with less ƒ Guaranteed minimum than 40 msec delay bandwidth to flow ƒ Restrictions on changes in inter-packet spacing

Mao W07 45 Network layer service models:

Guarantees ? Network Service Congestion Architecture Model Bandwidth Loss Order Timing feedback

Internet best effort none no no no no (inferred via loss) ATM CBR constant yes yes yes no rate congestion ATM VBR guaranteed yes yes yes no rate congestion ATM ABR guaranteed no yes no yes minimum ATM UBR none no yes no no

Mao W07 46 Network layer connection and connection-less service

ƒ Datagram network provides network-layer connectionless service ƒ VC network provides network-layer connection service ƒ Analogous to the transport-layer services, but: - Service: host-to-host - No choice: network provides one or the other - Implementation: in the core

Mao W07 47 Virtual circuits “source-to-dest path behaves much like telephone circuit” - performance-wise - network actions along source-to-dest path

ƒ call setup, teardown for each call before data can flow ƒ each packet carries VC identifier (not destination host address) ƒ every router on source-dest path maintains “state” for each passing connection ƒ link, router resources (bandwidth, buffers) may be allocated to VC

Mao W07 48 VC implementation

A VC consists of: 1. Path from source to destination 2. VC numbers, one number for each link along path 3. Entries in forwarding tables in routers along path ƒ Packet belonging to VC carries a VC number. ƒ VC number must be changed on each link. - New VC number comes from forwarding table

Mao W07 49 Forwarding table VC number

12 22 32 1 3 2

Forwarding table in interface northwest router: number Incoming interface Incoming VC # Outgoing interface Outgoing VC #

1 12 2 22 2 63 1 18 3 7 2 17 1 97 3 87 …… … …

Routers maintain connection state information!

Mao W07 50 Virtual circuits: signaling protocols

ƒ used to setup, maintain teardown VC ƒ used in ATM, frame-relay, X.25 ƒ not used in today’s Internet

application 6. Receive data application transport 5. Data flow begins transport network 4. Call connected 3. Accept call network data link 1. Initiate call 2. incoming call data link physical physical

Mao W07 51 Datagram networks

ƒ no call setup at network layer ƒ routers: no state about end-to-end connections - no network-level concept of “connection” ƒ packets forwarded using destination host address - packets between same source-dest pair may take different paths

application application transport transport network network data link 1. Send data 2. Receive data data link physical physical

Mao W07 52 4 billion Forwarding table possible entries

Destination Address Range Link Interface

11001000 00010111 00010000 00000000 through 0 11001000 00010111 00010111 11111111

11001000 00010111 00011000 00000000 through 1 11001000 00010111 00011000 11111111

11001000 00010111 00011001 00000000 through 2 11001000 00010111 00011111 11111111

otherwise 3

Mao W07 53 Longest prefix matching

Prefix Match Link Interface 11001000 00010111 00010 0 11001000 00010111 00011000 1 11001000 00010111 00011 2 otherwise 3

Examples

DA: 11001000 00010111 00010110 10100001 Which interface?

DA: 11001000 00010111 00011000 10101010 Which interface?

Mao W07 54 Datagram or VC network: why?

Internet ATM ƒ data exchange among computers ƒ evolved from telephony - “elastic” service, no strict ƒ human conversation: timing req. - strict timing, reliability ƒ “smart” end systems (computers) requirements - can adapt, perform control, - need for guaranteed service error recovery ƒ “dumb” end systems - simple inside network, complexity at “edge” - telephones ƒ many link types - complexity inside network - different characteristics - uniform service difficult

Mao W07 55