Outline

• Transport introduction

15-744 Computer Networking • Error recovery &

• TCP flow control/connection setup/data transfer Review 2 – Transport Protocols • TCP reliability

• Congestion sources and collapse

• Congestion control basics

2

Transport Protocols Functionality Split

• Network provides best-effort delivery • Lowest level end-to- end protocol. • End-systems implement many functions • Header generated by 7 7 • Reliability sender is interpreted 6 6 • In-order delivery only by the destination 5 5 • Demultiplexing • Routers view transport • Message boundaries header as part of the Transport Transport payload • Connection abstraction IP IP IP • Not always true… • Congestion control Datalink 2 2 Datalink • Firewalls • … Physical 1 1 Physical

router

3 4

1 Transport Protocols UDP: User Datagram Protocol [RFC 768]

• UDP provides just integrity and demux • “No frills,” “bare bones” Why is there a UDP? • TCP adds… Internet transport protocol • No connection establishment • Connection-oriented (which can add delay) • “Best effort” service, • Reliable • Simple: no connection state UDP segments may be: at sender, receiver • Ordered • Lost • Small header • Byte-stream • Delivered out of order to • No congestion control: UDP • Full duplex app can blast away as fast as desired • Flow and congestion controlled • Connectionless: • DCCP, RTP, SCTP -- not widely used. • No handshaking between UDP sender, receiver • Each UDP segment handled independently of others 5 6

UDP, cont. UDP Checksum

• Often used for streaming 32 bits Goal: detect “errors” (e.g., flipped bits) in transmitted multimedia apps Length, in Source port # Dest port # segment – optional use! • Loss tolerant bytes of UDP Length Checksum segment, Sender: Receiver: • Rate sensitive including • Treat segment contents as • Compute checksum of • Other UDP uses header sequence of 16-bit integers (why?): received segment • Checksum: addition (1’s • Check if computed checksum • DNS Application complement sum) of segment equals checksum field value: • Reliable transfer data contents (message) • Sender puts checksum value • NO - error detected over UDP into UDP checksum field • YES - no error detected • Must be at But maybe errors application layer UDP segment format nonetheless? • Application-specific error recovery

7 8

2 High-Level TCP Characteristics TCP Header

• Protocol implemented entirely at the ends Source port Destination port • Fate sharing (on IP) Sequence number • Protocol has evolved over time and will continue Flags: SYN Acknowledgement to do so FIN • Nearly impossible to change the header RESET HdrLen 0 Flags Advertised window PUSH • Use options to add information to the header URG Checksum Urgent pointer ACK • Change processing at endpoints Options (variable) • Backward compatibility is what makes it TCP Data

9 10

Evolution of TCP TCP Through the 1990s

1984 1994 1975 Nagel’s algorithm 1996 T/TCP Three-way handshake to reduce overhead 1987 SACK TCP (Braden) Raymond Tomlinson of small packets; Karn’s algorithm 1990 (Floyd et al) Transaction In SIGCOMM 75 predicts congestion to better estimate 4.3BSD Reno Selective TCP collapse round-trip time fast retransmit Acknowledgement delayed ACK’s 1983 BSD Unix 4.2 1986 1988 1993 1994 1996 1996 1974 supports TCP/IP Congestion Van Jacobson’s TCP Vegas ECN Hoe FACK TCP TCP described by collapse algorithms (Brakmo et al) (Floyd) NewReno startup (Mathis et al) Vint Cerf and Bob Kahn observed congestion avoidance delay-based Explicit and loss recovery extension to SACK In IEEE Trans Comm 1982 and congestion control congestion avoidance Congestion TCP & IP (most implemented in Notification RFC 793 & 791 4.3BSD Tahoe)

1975 1980 1985 1990 1993 1994 1996

11 12

3 Outline Stop and Wait

• Transport introduction • ARQ • Receiver sends acknowledgement (ACK) • Error recovery & flow control when it receives packet • Sender waits for ACK and Sender Receiver timeouts if it does not • TCP flow control/connection setup/data transfer arrive within some time period • Simplest ARQ protocol • TCP reliability • Send a packet, stop and Timeout wait until ACK arrives • Congestion sources and collapse • Performance Time • Can only send one • Congestion control basics packet per round trip

13 14

Recovering from Error How to Recognize Resends?

• Use sequence numbers • both packets and acks • Sequence # in packet is finite  How big should it be? Timeout Time Timeout Timeout • For stop and wait? • One bit – won’t send seq #1 until received ACK for seq #0 Timeout Timeout Timeout

ACK lost Packet lost Early timeout DUPLICATE PACKETS!!!

15 16

4 How to Keep the Pipe Full? Sliding Window

• Send multiple packets without • Reliable, ordered delivery waiting for first to be acked • Receiver has to hold onto a packet until all prior • Number of pkts in flight = window: Flow control packets have arrived • Reliable, unordered delivery • Why might this be difficult for just parallel stop & wait? • Several parallel stop & waits • Sender must prevent buffer overflow at receiver • Send new packet after each ack • Circular buffer at sender and receiver • Sender keeps list of unack’ed packets; resends after timeout • Packets in transit  buffer size • Receiver same as stop & wait • Advance when sender and receiver agree packets at • How large a window is needed? beginning have been received • Suppose 10Mbps link, 4ms delay, 500byte pkts •1? 10? 20? • Round trip delay * bandwidth = capacity of pipe 17 18

Sender/Receiver State Sequence Numbers

• How large do sequence numbers need to be? Sender Receiver • Must be able to detect wrap-around • Depends on sender/receiver window size Max ACK received Next seqnum Next expected Max acceptable • E.g.

…………• Max seq = 7, send win=recv win=7 • If pkts 0..6 are sent succesfully and all acks lost Sender window Receiver window • Receiver expects 7,0..5, sender retransmits old 0..6!!! • Max sequence must be  send window + recv window Sent & Acked Sent Not Acked Received & Acked Acceptable Packet

OK to Send Not Usable Not Usable

19 20

5 Window Sliding – Common Case Loss Recovery

• On reception of new ACK (i.e. ACK for something that was • On reception of out-of-order packet not acked earlier) • Send nothing (wait for source to timeout) • Increase sequence of max ACK received • Cumulative ACK (helps source identify loss) • Send next packet • Timeout (Go-Back-N recovery) • On reception of new in-order data packet (next expected) • Set timer upon transmission of packet • Hand packet to application • Retransmit all unacknowledged packets • Send cumulative ACK – acknowledges reception of all packets up to sequence number • Performance during loss recovery • Increase sequence of max acceptable packet • No longer have an entire window in transit • Can have much more clever loss recovery

21 22

Important Lessons Good Ideas So Far…

• Transport service • Flow control • UDP  mostly just IP service • Stop & wait • TCP  congestion controlled, reliable, byte stream • Parallel stop & wait • Types of ARQ protocols • Sliding window • Stop-and-wait  slow, simple • Loss recovery • Go-back-n  can keep link utilized (except w/ losses) • Timeouts • Selective repeat  efficient loss recovery -- used in SACK • Acknowledgement-driven recovery (selective repeat or • Sliding window flow control cumulative acknowledgement) • Addresses buffering issues and keeps link utilized

23 24

6 Outline More on Sequence Numbers

• Transport introduction • 32 Bits, Unsigned  for bytes not packets!

• Error recovery & flow control • Why So Big? • For sliding window, must have • TCP flow control/connection setup/data transfer • |Sequence Space| > |Sending Window| + |Receiving Window| • TCP reliability • No problem • Also, want to guard against stray packets • Congestion sources and collapse • With IP, packets have maximum lifetime of 120s • Sequence number would wrap around in this time at 286Mbps • Congestion control basics

25 26

TCP Flow Control Window Flow Control: Send Side

• TCP is a sliding window protocol • For window size n, can send up to n bytes without receiving an acknowledgement window • When the data is acknowledged then the window slides forward • Each packet advertises a window size • Indicates number of bytes the receiver has space for Sent and acked Sent but not acked Not yet sent • Original TCP always sent entire window • Congestion control now limits this Next to be sent

27 28

7 Window Flow Control: Send Side Performance Considerations

Packet Sent Packet Received • The window size can be controlled by receiving

Source Port Dest. Port Source Port Dest. Port application Sequence Number Sequence Number • Can change the socket buffer size from a default (e.g. 8Kbytes) to a maximum value (e.g. 64 Kbytes) Acknowledgment Acknowledgment HL/Flags Window HL/Flags Window • The window size field in the TCP header limits the D. Checksum Urgent Pointer D. Checksum Urgent Pointer window that the receiver can advertise Options… Options... • 16 bits  64 KBytes • 10 msec RTT  51 Mbit/second • 100 msec RTT  5 Mbit/second App write • TCP options to get around 64KB limit  scales window size

acknowledged sent to be sent outside window

29 30

Establishing Connection: Outline Three-Way handshake

• Each side notifies other of • Transport introduction starting sequence number it SYN: SeqC will use for sending • Error recovery & flow control • Why not simply chose 0? • Must avoid overlap with earlier ACK: SeqC+1 incarnation SYN: SeqS • TCP flow control/connection setup/data transfer • Security issues

• Each side acknowledges ACK: SeqS+1 • TCP reliability other’s sequence number • SYN-ACK: Acknowledge • Congestion sources and collapse sequence number + 1

• Can combine second SYN Client Server • Congestion control basics with first ACK

31 32

8 Reliability Challenges TCP = Go-Back-N Variant

• Congestion related losses • Sliding window with cumulative acks • Receiver can only return a single “ack” sequence number to • Variable packet delays the sender. • What should the timeout be? • Acknowledges all bytes with a lower sequence number • Starting point for retransmission • Reordering of packets • Duplicate acks sent when out-of-order packet received • How to tell the difference between a delayed packet • But: sender only retransmits a single packet. and a lost one? • Reason??? • Only one that it knows is lost • Network is congested  shouldn’t overload it • Error control is based on byte sequences, not packets. • Retransmitted packet can be different from the original lost packet – Why?

33 34

Round-trip Time Estimation Original TCP Round-trip Estimator

• Wait at least one RTT before retransmitting • Round trip times exponentially averaged: 2.5 • Importance of accurate RTT estimators: • New RTT =  (old RTT) + • Low RTT estimate (1 - ) (new sample) 2 • unneeded retransmissions • Recommended value for 1.5 • High RTT estimate : 0.8 - 0.9 1 • 0.875 for most TCP’s • poor throughput 0.5 • RTT estimator must adapt to change in RTT 0 • But not too fast, or too slow! • Retransmit timer set to (b * RTT), where b = 2 • Spurious timeouts • Every time timer expires, RTO exponentially backed-off • “Conservation of packets” principle – never more than a • Not good at preventing premature timeouts window worth of packets in flight

35 36

9 RTT Sample Ambiguity Jacobson’s Retransmission Timeout

ABAB • Key observation: X • At high loads round trip variance is high RTO RTO Sample Sample • Solution: RTT RTT • Base RTO on RTT and standard deviation • RTO = RTT + 4 * rttvar • new_rttvar =  * dev + (1- ) old_rttvar • Karn’s RTT Estimator • Dev = linear deviation • If a segment has been retransmitted: • Inappropriately named – actually smoothed linear • Don’t count RTT sample on ACKs for this segment deviation • Keep backed off time-out for next packet • Reuse RTT estimate only after one successful transmission

37 38

Timestamp Extension Timer Granularity

• Used to improve timeout mechanism by more • Many TCP implementations set RTO in multiples accurate measurement of RTT of 200,500,1000ms • When sending a packet, insert current time into • Why? option • Avoid spurious timeouts – RTTs can vary quickly due to • 4 bytes for time, 4 bytes for echo a received timestamp cross traffic • Receiver echoes timestamp in ACK • Reduce timer expensive timer interrupts on hosts • Actually will echo whatever is in timestamp • What happens for the first couple of packets? • Removes retransmission ambiguity • Pick a very conservative value (seconds) • Can get RTT sample on any packet

39 40

10 Fast Retransmit -- Avoiding Timeouts Fast Retransmit

• What are duplicate acks (dupacks)? • Repeated acks for the same sequence • When can duplicate acks occur? • Loss • Packet re-ordering X Retransmission • Window update – advertisement of new flow control window Sequence No Duplicate Acks • Assume re-ordering is infrequent and not of large magnitude • Use receipt of 3 or more duplicate acks as indication of loss • Don’t wait for timeout to retransmit packet

Packets Acks Time

41 42

TCP (Reno variant) SACK

• Basic problem is that cumulative acks provide little X X information X • Selective acknowledgement (SACK) essentially Now what? - timeout X adds a bitmask of packets received Sequence No • Implemented as a TCP option • Encoded as a set of received byte ranges (max of 4 ranges/often max of 3) • When to retransmit? • Still need to deal with reordering  wait for out of order Packets by 3pkts Acks Time

43 44

11 SACK Performance Issues

• Timeout >> fast rexmit X X X • Need 3 dupacks/sacks Now what? – send X retransmissions as soon Sequence No as detected • Not great for small transfers • Don’t have 3 packets outstanding

• What are real loss patterns like? Packets Acks Time

45 46

Important Lessons Outline

• Three-way TCP Handshake • Transport introduction • TCP timeout calculation  how is RTT estimated • Error recovery & flow control

• Modern TCP loss recovery • TCP flow control/connection setup/data transfer • Why are timeouts bad? • How to avoid them?  e.g. fast retransmit • TCP reliability

• Congestion sources and collapse

• Congestion control basics

47 48

12 Congestion Causes & Costs of Congestion

• Four senders – multihop paths Q: What happens as rate 10 Mbps increases? 1.5 Mbps • Timeout/retransmit

100 Mbps

• Different sources compete for resources inside network • Why is it a problem? • Sources are unaware of current state of resource • Sources are unaware of each other • In many situations will result in < 1.5 Mbps of throughput (congestion collapse)

49 50

Causes & Costs of Congestion Congestion Collapse

• Definition: Increase in network load results in decrease of useful work done • Many possible causes • Spurious retransmissions of packets still in flight • Classical congestion collapse • How can this happen with packet conservation • Solution: better timers and TCP congestion control • When packet dropped, any “upstream • Undelivered packets • Packets consume resources and are dropped elsewhere in transmission capacity used for that packet network was wasted! • Solution: congestion control for ALL traffic • Etc..

51 52

13 Other Congestion Collapse Causes Where to Prevent Collapse?

• Fragments • Can end hosts prevent problem? • Mismatch of transmission and retransmission units • Yes, but must trust end hosts to do right thing • Solutions • E.g., sending host must adjust amount of data it puts in • Make network drop all fragments of a packet (early packet the network based on detected congestion discard in ATM) • Can routers prevent collapse? • Do path MTU discovery • No, not all forms of collapse • Control traffic • Doesn’t mean they can’t help • Large percentage of traffic is for control • Sending accurate congestion signals • Headers, routing messages, DNS, etc. • Isolating well-behaved from ill-behaved sources • Stale or unwanted packets • Packets that are delayed on long queues • “Push” data that is never used

53 54

Congestion Control and Avoidance Approaches For Congestion Control

• A mechanism which: • Two broad approaches towards congestion control: • Uses network resources efficiently • Preserves fair network resource allocation End-to-end Network-assisted • Prevents or avoids collapse • Congestion collapse is not just a theory • No explicit feedback from • Routers provide feedback network to end systems • Has been frequently observed in many networks • Congestion inferred from • Explicit rate sender should end-sys tem observed send at loss, delay • Single bit indicating congestion (SNA, DEC bit, • Approach taken by TCP TCP/IP ECN, ATM) • Problem: makes routers complicated

55 56

14 Example: TCP Congestion Control Outline

• Very simple mechanisms in network • Transport introduction • FIFO scheduling with shared buffer pool • Feedback through packet drops • Error recovery & flow control • TCP interprets packet drops as signs of congestion and slows down • TCP flow control/connection setup/data transfer • This is an assumption: packet drops are not a sign of congestion in all networks • E.g. wireless networks • TCP reliability • Periodically probes the network to check whether more bandwidth has become available. • Congestion sources and collapse

• Congestion control basics

57 58

Objectives Basic Control Model

• Simple router behavior • Let’s assume window-based control • Distributedness • Reduce window when congestion is perceived • Efficiency: X = x (t) • How is congestion signaled? knee i • Either mark or drop packets 2 2 • Fairness: (xi) /n(xi ) • When is a router congested? • Power: (throughput/delay) • Drop tail queues – when queue is full • Average queue length – at some threshold • Convergence: control system must be stable • Increase window otherwise • Probe for available bandwidth – how?

59 60

15 Linear Control Phase plots

• Many different possibilities for reaction to • Simple way to visualize behavior of competing congestion and probing connections over time • Examine simple linear controls

• Window(t + 1) = a + b Window(t) Fairness Line

• Different ai/bi for increase and ad/bd for decrease

• Supports various reaction to signals User 2’s Allocation

• Increase/decrease additively x2 • Increased/decrease multiplicatively • Which of the four combinations is optimal? Efficiency Line

User 1’s Allocation x1

61 62

Phase plots Additive Increase/Decrease

• What are desirable properties? • Both X1 and X2 increase/decrease by the same amount over time • What if flows are not equal? • Additive increase improves fairness and additive decrease reduces fairness

Fairness Line Fairness Line Overload T1 User 2’s Allocation Optimal point User 2’s x2 Allocation T0 x2 Underutilization

Efficiency Line Efficiency Line

User 1’s Allocation x1

User 1’s Allocation x1

63 64

16 Multiplicative Increase/Decrease Convergence to Efficiency

• Both X1 and X2 increase by the same factor over time • Extension from origin – constant fairness

Fairness Line Fairness Line T1 xH

User 2’s User 2’s Allocation Allocation x x 2 T0 2

Efficiency Line Efficiency Line

User 1’s Allocation x1 User 1’s Allocation x1

65 66

Distributed Convergence to Efficiency Convergence to Fairness

a=0 b=1 Fairness Line Fairness Line

xH xH

User 2’s User 2’s Allocation Allocation

x2 x2

xH’

Efficiency Line Efficiency Line

User 1’s Allocation x1 User 1’s Allocation x1

67 68

17 Convergence to Efficiency & Fairness Increase

Fairness Line Fairness Line

xH

User 2’s User 2’s Allocation Allocation

x2 x2

xH’

xL Efficiency Line Efficiency Line

User 1’s Allocation x1 User 1’s Allocation x1

69 70

Constraints What is the Right Choice?

• Distributed efficiency • Constraints limit us to AIMD • I.e.,  Window(t+1) >  Window(t) during increase • Can have multiplicative term in increase (MAIMD)

• ai > 0 & bi ≥ 1 • AIMD moves towards optimal point • Similarly, ad < 0 & bd ≤ 1

• Must never decrease fairness Fairness Line x1 • a & b’s must be ≥ 0

x0 • ai/bi > 0 and ad/bd ≥ 0 User 2’s Allocation x2 • Full constraints x2

• ad = 0, 0 ≤ bd <1, ai > 0 and bi ≥ 1

Efficiency Line

User 1’s Allocation x1

71 72

18 Questions TCP Congestion Control

• Congestion Control • Fairness – why not support skew  AIMD/GAIMD • RED analysis • More bits of feedback  DECbit, XCP, Vegas • Assigned Reading • [FJ93] Random Early Detection Gateways for • Guess # of users  hard in async system, look at Congestion Avoidance loss rate? • [TFRC] Equation-Based Congestion Control for Unicast • Stateless vs. stateful design Applications (two sections) • Wired vs. wireless • Non-linear controls  Bionomial

73 74

19