Fair Queuing • Fair Queuing • Core-stateless Fair queuing 15-744: Computer Networking • Assigned reading • [DKS90] Analysis and Simulation of a Fair Queueing Algorithm, Internetworking: Research and Experience L-5 Fair Queuing • [SSZ98] Core-Stateless Fair Queueing: Achieving Approximately Fair Allocations in High Speed Networks

2

Overview Example • 10Gb/s linecard • TCP and queues • Requires 300Mbytes of buffering. • Queuing disciplines • Read and write 40 byte packet every 32ns.

• RED • Memory technologies • DRAM: require 4 devices, but too slow. • Fair-queuing • SRAM: require 80 devices, 1kW, $2000. • Core-stateless FQ • Problem gets harder at 40Gb/s • Hence RLDRAM, FCRAM, etc. • XCP

3 4

1 Rule-of-thumb If flows are synchronized

• Rule-of-thumb makes sense for one flow • Typical backbone link has > 20,000 flows • Does the rule-of-thumb still hold?

t • Aggregate window has same dynamics • Therefore buffer occupancy has same dynamics • Rule-of-thumb still holds.

5 6

If flows are not synchronized Central Limit Theorem

B • CLT tells us that the more variables (Congestion 0 Windows of Flows) we have, the narrower the Gaussian (Fluctuation of sum of windows) • Width of Gaussian decreases with • Buffer size should also decreases with

Buffer Size Probability Distribution

7 8

2 Required buffer size Overview

• TCP and queues

• Queuing disciplines

• RED

• Fair-queuing

• Core-stateless FQ

Simulation • XCP

9 10

Queuing Disciplines Packet Drop Dimensions • Each router must implement some queuing discipline Aggregation Per-connection state Single class • Queuing allocates both bandwidth and buffer space: Class-based queuing • Bandwidth: which packet to serve (transmit) Drop position next Head Tail • Buffer space: which packet to drop next (when required) Random location • Queuing also affects latency Early drop Overflow drop

11 12

3 Typical Queuing FIFO + Drop-tail Problems • FIFO + drop-tail • Leaves responsibility of congestion control • Simplest choice to edges (e.g., TCP) • Used widely in the Internet • FIFO (first-in-first-out) • Does not separate between different flows • Implies single class of traffic • No policing: send more packets  get more • Drop-tail service • Arriving packets get dropped when queue is full regardless of flow or importance • Synchronization: end hosts react to same • Important distinction: events • FIFO: discipline • Drop-tail: drop policy

13 14

Active Queue Management Active Queue Designs • Design active router queue management to • Modify both router and hosts aid congestion control • DECbit – congestion bit in packet header • Why? • Modify router, hosts use TCP • Routers can distinguish between propagation • Fair queuing and persistent queuing delays • Per-connection buffer allocation • Routers can decide on transient congestion, • RED (Random Early Detection) based on workload • Drop packet or set bit in packet header as soon as congestion is starting

15 16

4 Overview Internet Problems • Full queues • TCP and queues • Routers are forced to have have large queues • Queuing disciplines to maintain high utilizations • TCP detects congestion from loss • RED • Forces network to have long standing queues in steady-state • Fair-queuing • Lock-out problem • Core-stateless FQ • Drop-tail routers treat bursty traffic poorly • Traffic gets synchronized easily  allows a few • XCP flows to monopolize the queue space

17 18

Design Objectives Lock-out Problem • Keep throughput high and delay low • Random drop • Accommodate bursts • Packet arriving when queue is full causes some random packet to be dropped • Queue size should reflect ability to accept bursts rather than steady-state queuing • Drop front • Improve TCP performance with minimal • On full queue, drop packet at head of queue hardware changes • Random drop and drop front solve the lock- out problem but not the full-queues problem

19 20

5 Full Queues Problem Random Early Detection (RED) • Drop packets before queue becomes full • Detect incipient congestion, allow bursts (early drop) • Keep power (throughput/delay) high • Intuition: notify senders of incipient • Keep average queue size low congestion • Assume hosts respond to lost packets • Example: early random drop (ERD): • Avoid window synchronization • If qlen > drop level, drop each new packet with fixed • Randomly mark packets probability p • Does not control misbehaving users • Avoid bias against bursty traffic • Some protection against ill-behaved users

21 22

RED Algorithm RED Operation

• Maintain running average of queue length Max thresh Min thresh

• If avgq < minth do nothing • Low queuing, send packets through

• If avgq > maxth, drop packet Average Queue Length • Protection from misbehaving sources P(drop) • Else mark packet in a manner proportional 1.0 to queue length

• Notify sources of incipient congestion maxP

minth maxth Avg queue length

23 24

6 RED Algorithm Queue Estimation

• Maintain running average of queue length • Standard EWMA: avgq = (1-wq) avgq + wqqlen • Byte mode vs. packet mode – why? • Special fix for idle periods – why? • Upper bound on w depends on min • For each packet arrival q th • Want to ignore transient congestion • Calculate average queue size (avg) • Can calculate the queue average if a burst arrives

• If minth ≤ avgq < maxth • Set wq such that certain burst size does not exceed minth • Calculate probability P a • Lower bound on wq to detect congestion relatively • With probability Pa quickly • Mark the arriving packet • Typical wq = 0.002 • Else if maxth ≤ avg • Mark the arriving packet

25 26

Thresholds Packet Marking

• minth determined by the utilization • maxp is reflective of typical loss rates requirement • Paper uses 0.02 • Tradeoff between queuing delay and utilization • 0.1 is more realistic value • Relationship between max and min th th • If network needs marking of 20-30% then • Want to ensure that feedback has enough time to make difference in load need to buy a better link! • Depends on average queue increase in one • Gentle variant of RED (recommended) RTT • Vary drop rate from maxp to 1 as the avgq • Paper suggest ratio of 2 varies from maxth to 2* maxth • Current rule of thumb is factor of 3 • More robust to setting of maxth and maxp

27 28

7 Extending RED for Flow Isolation Stochastic Fair Blue

• Problem: what to do with non-cooperative • Same objective as RED Penalty Box flows? • Identify and penalize misbehaving flows • Fair queuing achieves isolation using per- • Create L hashes with N bins each

flow state – expensive at backbone routers • Each bin keeps track of separate marking rate (pm) • How can we isolate unresponsive flows without • Rate is updated using standard technique and a bin per-flow state? size • Flow uses minimum p of all L bins it belongs to • RED penalty box m • Non-misbehaving flows hopefully belong to at least one • Monitor history for packet drops, identify flows bin without a bad flow that use disproportionate bandwidth • Large numbers of bad flows may cause false positives • Isolate and punish those flows

29 30

Stochastic Fair Blue Overview • False positives can continuously penalize • TCP and queues same flow • Solution: moving hash function over time • Queuing disciplines

• Bad flow no longer shares bin with same flows • RED • Is history reset does bad flow get to make trouble until detected again? • Fair-queuing • No, can perform hash warmup in background • Core-stateless FQ

• XCP

31 32

8 Fairness Goals What is Fairness? • Allocate resources fairly • At what granularity? • Isolate ill-behaved users • Flows, connections, domains? • What if users have different RTTs/links/etc. • Router does not send explicit feedback to • Should it share a link fairly or be TCP fair? source • Maximize fairness index? • Still needs e2e congestion control 2 2 • Fairness = (Σxi) /n(Σxi ) 0

33 34

Max-min Fairness Max-min Fairness Example • Allocate user with “small” demand what it • Assume sources 1..n, with resource wants, evenly divide unused resources to demands X1..Xn in ascending order “big” users • Assume channel capacity C. • Formally: • Give C/n to X1; if this is more than X1 wants, • Resources allocated in terms of increasing demand divide excess (C/n - X1) to other sources: each • No source gets resource share larger than its gets C/n + (C/n - X1)/(n-1) demand • If this is larger than what X2 wants, repeat • Sources with unsatisfied demands get equal share process of resource

35 36

9 Implementing max-min Fairness Bit-by-bit RR • Generalized processor sharing • Single flow: clock ticks when a bit is • Fluid fairness transmitted. For packet i: • Bitwise round robin among all queues • Pi = length, Ai = arrival time, Si = begin transmit time, Fi = finish transmit time • Why not simple round robin? • Fi = Si+Pi = max (Fi-1, Ai) + Pi • Variable packet length  can get more service • Multiple flows: clock ticks when a bit from all by sending bigger packets active flows is transmitted  round number • Unfair instantaneous service rate • Can calculate Fi for each packet if number of • What if arrive just before/after packet departs? flows is know at all times • This can be complicated

37 38

Bit-by-bit RR Illustration Fair Queuing • Not feasible to • Mapping bit-by-bit schedule onto packet interleave bits on transmission schedule real networks • Transmit packet with the lowest Fi at any • FQ simulates bit-by- given time bit RR • How do you compute Fi?

39 40

10 FQ Illustration Bit-by-bit RR Example

Output Flow 1 Flow 1 Flow 2

Flow 2 I/P O/P F=10 F=8 Flow 1 Flow 2 F=5 Output (arriving) transmitting

Cannot preempt packet F=10 Flow n currently being transmitted

Variation: Weighted Fair Queuing (WFQ) F=2

41 42

Fair Queuing Tradeoffs Overview

• FQ can control congestion by monitoring flows • TCP and queues • Non-adaptive flows can still be a problem – why? • Complex state • Queuing disciplines • Must keep queue per flow • Hard in routers with many flows (e.g., backbone routers) • RED • Flow aggregation is a possibility (e.g. do fairness per domain) • Complex computation • Fair-queuing • Classification into flows may be hard • Core-stateless FQ • Must keep queues sorted by finish times • Finish times change whenever the flow count changes • XCP

43 44

11 Core-Stateless Fair Queuing Core-Stateless Fair Queuing • Key problem with FQ is core routers • Edge routers keep state about flows and do • Must maintain state for 1000’s of flows computation when packet arrives • Must update state at Gbps line speeds • DPS (Dynamic Packet State) • CSFQ (Core-Stateless FQ) objectives • Edge routers label packets with the result of • Edge routers should do complex tasks since they have fewer flows state lookup and computation • Core routers can do simple tasks • Core routers use DPS and local • No per-flow state/processing  this means that core routers measurements to control processing of can only decide on dropping packets not on order of processing packets • Can only provide max-min bandwidth fairness not delay allocation

45 46

Edge Router Behavior Core Router Behavior • Monitor each flow i to measure its arrival • Keep track of fair share rate α

rate (ri) • Increasing α does not increase load (F) by N * • EWMA of rate α

• Non-constant EWMA constant • F(α) = Σi min(ri, α)  what does this look like? • e-T/K where T = current interarrival, K = constant • Periodically update α • Helps adapt to different packet sizes and arrival • Keep track of current arrival rate patterns • Only update α if entire period was congested or • Rate is attached to each packet uncongested • Drop probability for packet = max(1- α/r, 0)

47 48

12 F vs. Alpha Estimating Fair Share

• Need F(α) = capacity = C F • Can’t keep map of F(α) values  would require per flow state C [linked capacity] • Since F(α) is concave, piecewise-linear

• F(0) = 0 and F(α) = current accepted rate = Fc

• F(α) = Fc/ α

• F(αnew) = C  αnew = αold * C/Fc • What if a mistake was made? • Forced into dropping packets due to buffer capacity alpha r1 r2 r3 old alpha • When queue overflows α is decreased slightly New alpha

49 50

Other Issues Overview • Punishing fire-hoses – why? • TCP and queues • Easy to keep track of in a FQ scheme • What are the real edges in such a scheme? • Queuing disciplines • Must trust edges to mark traffic accurately • RED • Could do some statistical sampling to see if edge was marking accurately • Fair-queuing

• Core-stateless FQ

• XCP

51 52

13 How does XCP Work? How does XCP Work?

Round TripRound Time Trip Time Round Trip Time

CongestionCongestion Window Window Congestion Window

FeedbackFeedbackFeedback = Feedback == + 0.1 packet +- 0.10.3 packetpacket

Congestion Header

53 54

How Does an XCP Router Compute the How does XCP Work? Feedback? Congestion Controller Fairness Controller Congestion Δ Fairness Goal: Matches input traffic to Goal: Divides Δ between link capacityController & drains the queue flows Controllerto converge to fairness Congestion Window = Congestion Window + Feedback Looks at aggregate traffic & Looks at a flow’s state in Congestion Header queue MIMD AIMD XCP extends ECN and CSFQ Algorithm: Algorithm: Aggregate traffic changes by Δ If Δ > 0 ⇒ Divide Δ equally between flows Routers compute feedback without Δ ~ Spare Bandwidth Δ ~ - Queue Size If Δ < 0 ⇒ Divide Δ between any per-flow state So, Δ = α d Spare - β Queue flows proportionally to their avg current rates

55 56

14 Getting the devil out of the details … Discussion • RED Congestion Controller Fairness Controller Algorithm: • Parameter settings Δ = α davg Spare - β Queue If Δ > 0 ⇒ Divide Δ equally between flows • RED vs. FQ If Δ < 0 ⇒ Divide Δ between flows • How much do we need per flow tracking? At what cost? proportionally to their current rates Theorem: System converges • FQ vs. XCP/CSFQ to optimal utilization (i.e., Need to estimate number of • Is coarse-grained fairness sufficient? stable) for any link bandwidth, flows N • Misbehaving routers/trusting the edge delay, number of sources if: • Deployment (and incentives) • How painful is FQ • XCP vs CSFQ • What are the key differences • Granularity of fairness RTTpkt : Round Trip Time in header • Mechanism vs. policy  will see this in QoS (Proof based on Nyquist Cwndpkt : Congestion Window in header NoCriterion) Parameter Tuning T: CountingNo Per-Flow Interval State

57 58

Important Lessons Lessons • How does TCP implement AIMD? • Fairness and isolation in routers • Sliding window, slow start & ack clocking • Why is this hard? • What does it achieve – e.g. do we still need congestion • How to maintain ack clocking during loss recovery control?  fast recovery • Routers • How does TCP fully utilize a link? • FIFO, drop-tail interacts poorly with TCP • Various schemes to desynchronize flows and control loss • Role of router buffers rate (e.g. RED) • Fair-queuing • TCP alternatives • Clean resource allocation to flows • Complex packet classification and scheduling • TCP being used in new/unexpected ways • Core-stateless FQ & XCP • Key changes needed • Coarse-grain fairness • Carrying packet state can reduce complexity

59 60

15 Next Lecture: TCP & Routers • RED • XCP • Assigned reading EXTRA SLIDES • [FJ93] Random Early Detection Gateways for Congestion Avoidance • [KHR02] Congestion Control for High Bandwidth-Delay Product Networks The rest of the slides are FYI

61

Overview Delay Allocation in FQ • Fairness • Reduce delay for flows using less than fair share • Fair-queuing • Advance finish times for sources whose queues drain temporarily • Core-stateless FQ • Schedule based on Bi instead of Fi • Other FQ variants • Fi = Pi + max (Fi-1, Ai)  Bi = Pi + max (Fi-1, Ai - δ)

• If Ai < Fi-1, conversation is active and δ has no effect

• If Ai > Fi-1, conversation is inactive and δ determines how much history to take into account • Infrequent senders do better when history is used

63 64

16 Stochastic Fair Queuing • Compute a hash on each packet • Each queue is allowed to send Q bytes per • Instead of per-flow queue have a queue per round hash bin • If Q bytes are not sent (because packet is • An aggressive flow steals traffic from other too large) deficit counter of queue keeps flows in the same hash track of unused portion • Queues serviced in round-robin fashion • If queue is empty, deficit counter is reset to • Has problems with packet size unfairness 0 • Memory allocation across all queues • Uses hash bins like Stochastic FQ • When no free buffers, drop packet from longest queue • Similar behavior as FQ but computationally simpler

65 66

Self-clocked Fair Queuing Self-clocked FQ • Virtual time to make computation of finish • Use the finish time of the packet being time easier serviced as the virtual time • Problem with basic FQ • The difference in this virtual time and the real • Need be able to know which flows are really round number can be unbounded backlogged • Amount of service to backlogged flows is • They may not have packet queued because they were serviced earlier in mapping of bit-by-bit to bounded by factor of 2 packet • This is necessary to know how bits sent map onto rounds • Mapping of real time to round is piecewise linear  however slope can change often

67 68

17 Start-time Fair Queuing TCP Modeling • Packets are scheduled in order of their start • Given the congestion behavior of TCP can we not finish times predict what type of performance we should get? • Self-clocked  virtual time = start time of • What are the important factors • Loss rate packet in service • Affects how often window is reduced • Main advantage  can handle variable rate • RTT service better than other schemes • Affects increase rate and relates BW to window • RTO • Affects performance during loss recovery • MSS • Affects increase rate

69 70

Overall TCP Behavior Simple TCP Model

• Let’s concentrate on steady state behavior • Some additional assumptions with no timeouts and perfect loss recovery • Fixed RTT • No delayed ACKs • In steady state, TCP losses packet each time window reaches W packets • Window drops to W/2 packets Window • Each RTT window increases by 1 packetW/2 * RTT before next loss • BW = MSS * avg window/RTT = • MSS * (W + W/2)/(2 * RTT) • .75 * MSS * W / RTT Time

71 72

18 Simple Loss Model TCP Friendliness • What was the loss rate? • What does it mean to be TCP friendly? • Packets transferred between losses = • TCP is not going away • Avg BW * time = • Any new congestion control must compete with TCP flows • (.75 W/RTT) * (W/2 * RTT) = 3W2/8 • Should not clobber TCP flows and grab bulk of link 2 • 1 packet lost  loss rate = p = 8/3W • Should also be able to hold its own, i.e. grab its fair share, or it • W = sqrt( 8 / (3 * loss rate)) will never become popular • How is this quantified/shown? • BW = .75 * MSS * W / RTT • Has evolved into evaluating loss/throughput behavior • BW = MSS / (RTT * sqrt (2/3p)) • If it shows 1/sqrt(p) behavior it is ok • But is this really true?

73 74

19