<<

Carousel: Scalable Traffic Shaping at End Hosts

Ahmed Saeed* Nandita Dukkipati Vytautas Valancius Georgia Institute of Technology Google, Inc. Google, Inc. Vinh e Lam Carlo Contavalli Amin Vahdat Google, Inc. Google, Inc. Google, Inc. ABSTRACT Õ INTRODUCTION Tra›c shaping, including pacing and rate limiting, is fundamental Network bandwidth, especially across the WAN, is a constrained to the correct and e›cient operation of both datacenter and wide resource that is expensive to overprovision [ó¢, ó˜, ìó]. is creates area networks. Sample use cases include policy-based bandwidth al- an incentive to shape tra›c based on the priority of the application location to žow aggregates, rate-based congestion control algorithms, during times of congestion, according to operator policy. Further, and packet pacing to avoid bursty transmissions that can overwhelm as networks run at higher levels of utilization, accurate shaping to a router bušers. Driven by the need to scale to millions of žows and target rate is increasingly important to e›cient network operation. to apply complex policies, tra›c shaping is moving from network Bursty transmissions from a žow’s target rate can lead to: i) packet switches into the end hosts, typically implemented in soŸware in the loss, ii) less accurate bandwidth calculations for competing žows, kernel networking stack. and iii) increasing round trip times. Packet loss reduces goodput and In this paper, we show that the performance overhead of end-host confuses transport protocols attempting to disambiguate -share tra›c shaping is substantial limits overall system scalability as we available-capacity signals from bursty tra›c sources. One could ar- move to thousands of individual tra›c classes per server. Measure- gue that deep bušers are the solution, but we nd that the resulting ments from production servers show that shaping at hosts consumes increased latency leads to poor experience for users. Worse, high considerable CPU and memory, unnecessarily drops packets, sušers latency reduces application performance in common cases where from head of line blocking and inaccuracy, and does not provide compute is blocked on, for example, the completion of an RPC. Simi- backpressure up the stack. We present Carousel, a framework that larly, the performance of consistent storage systems is dependent on scales to tens of thousands of policies and žows per server, built from network round trip times. the synthesis of three key ideas: i) a single queue shaper using time as In this paper, we use tra›c shaping broadly to refer to either pacing the basis for releasing packets, ii) ne-grained, just-in-time freeing or rate limiting, where pacing refers to injecting inter-packet gaps of resources in higher layers coupled to actual packet departures, and to smooth tra›c within a single connection, and rate limiting refers iii) one shaper per CPU core, with lock-free coordination. Our pro- to enforcing a rate on a žow-aggregate consisting of one or more duction experience in serving video tra›c at a Cloud service provider individual connections. shows that Carousel shapes tra›c accurately while improving over- While tra›c shaping has historically targeted wide area networks, all machine CPU utilization by ˜Û (an improvement of óþÛ in the two recent trends bring it to the forefront for datacenter communica- CPU utilization attributed to networking) relative to state-of-art de- tions, which use end-host based shaping. e rst trend is the use of ployments. It also conforms Õþ times more accurately to target rates, ne grained pacing by rate-based congestion control algorithms such and consumes two orders of magnitude less memory than existing as BBR [Õß] and TIMELY [ìä]. BBR and TIMELY’s use of rate control approaches. is motivated by studies that show pacing žows can reduce packet CCS CONCEPTS drops for video-serving tra›c [óì] and for incast communication patterns [ìä]. Incast arises from common datacenter applications •Networks éPacket scheduling; that simultaneously communicate with thousands of servers. Even KEYWORDS if receiver bandwidth is allocated perfectly among senders at coarse Tra›c shaping, Rate-limiters, Pacing, Timing Wheel, Backpressure granularity, simultaneous bursting to NIC line rate from even a small ACM Reference format: subset of senders can overwhelm the receiver’s network capacity. Ahmed Saeed, Nandita Dukkipati, Vytautas Valancius, Vinh e Lam, Carlo Furthermore, tra›c from end hosts is increasingly bursty, due to Contavalli, and Amin Vahdat. óþÕß. Carousel: Scalable Tra›c Shaping at End heavy batching and aggregation. Techniques such as NIC oœoads Hosts. In Proceedings of SIGCOMM ’Õß, Los Angeles, CA, USA, August óÕ-ó¢, optimize for server CPU e›ciency — e.g., Transmission Segmenta- óþÕß, Õ¦ pages. tion Oœoad [óþ] and Generic Receive Oœoad [ì]. Pacing reduces DOI: Õþ.ÕÕ¦¢/ìþɘ˜óó.ìþɘ˜¢ó burstiness, which in turn reduces packet drops at shallow-bušered *Work done while at Google. switches [óó] and improves network utilization [Õ¦]. It is for these rea- sons that BBR and TIMELY rely on pacing packets as a key technique Permission to make digital or hard copies of part or all of this work for personal for precise rate control of thousands of žows per server. or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice e second trend is the need for network tra›c isolation across and the full citation on the first page. Copyrights for third-party components of competing applications or Virtual Machines (VMs) in the Cloud. this work must be honored. For all other uses, contact the owner/author(s). e emergence of Cloud Computing means that individual servers SIGCOMM ’Õß, Los Angeles, CA, USA © óþÕß Copyright held by the owner/author(s). Éߘ-Õ-¦¢þì-¦ä¢ì-¢/Õß/þ˜... gÕ¢.þþ may host hundreds or perhaps even thousands of VMs. Each virtual DOI: Õþ.ÕÕ¦¢/ìþɘ˜óó.ìþɘ˜¢ó endpoint can in turn communicate with thousands of remote VMs, SIGCOMM ’17, August 21-25, 2017, Los Angeles, CA, USA A. Saeed et al. internal services within and between datacenters, and the Internet at Socket Buffers Multi-queue Token in Host OS or large, resulting in millions of žows with overlapping bottlenecks and Buckets rate limit flow Guest VM bandwidth allocation policies. Providers use predictable and scal- aggregates

To NIC able bandwidth arbitration systems to assign rates to žow aggregates, Rate1 which are then enforced at end systems. Such an arbitration can be Rate2 performed by centralized entities, e.g., Bandwidth Enforcer [ìó] and Rate3 Classifier

SWAN [ó¢], or distributed systems, e.g., EyeQ [ìþ]. Scheduler For example, consider ¢þþ VMs on a host providing Cloud ser- vices, where each VM communicates on average with ¢þ other virtual endpoints. e provider, with no help from the guest operating sys- tem, must isolate these ó¢K VM-to-endpoint žows, with bandwidth Figure Õ: Token Bucket Architecture: pre-ltering with multiple token allocated according to some policy. Otherwise, inadequate network bucket queues. isolation increases performance unpredictability and can make Cloud services unavailable [ÕÕ, ìþ]. on the state-of-the-art shapers in terms of e›ciency, accuracy and Traditionally, network switches and routers have implemented scalability. tra›c shaping. However, inside a datacenter, shaping in middleboxes is not an easy option. It is expensive in bušering and latency, and ó TRAFFIC SHAPERS IN PRACTICE middleboxes lack the necessary state to enforce the right rate. More- In the previous section we established the need for at-scale shaping over, shaping in the middle does not help when bottlenecks are at at end hosts: rst, modern congestion control algorithms, such as network edges, such as the host network stack, hypervisor, or NIC. BBR and TIMELY, use pacing to smooth bursty video tra›c, and Finally, when packet bušering is necessary, it is much cheaper and to handle large-scale incasts; second, tra›c isolation in Cloud is more scalable to bušer near the application. critically dependent on e›cient shapers. Before presenting details erefore, the need to scale to millions of žows per server while of Carousel, we rst present an overview of the shapers prevalently applying complex policy means that tra›c shaping must largely be used in practice. implemented in end hosts. e e›ciency and ešectiveness of this Nearly all of rate limiting at end hosts is performed in soŸware. end-host tra›c shaping is increasingly critical to the operation of Figure Õ shows the typical rate limiter architecture, which we broadly both datacenter and wide area networks. Unfortunately, existing im- term as pre-ltering with multiple token bucket queues. It relies on plementations were built for very dišerent requirements, e.g., only a classier, multiple queues, token bucket shapers and/or a scheduler for WAN žows, a small number of tra›c classes, modest accuracy processing packets from each queue. e classier divides packets requirements, and simple policies. rough measurement of video into dišerent queues, each queue representing a dišerent tra›c ag- tra›c in a large-scale Cloud provider, we show that the performance gregate.Õ A queue has an associated tra›c shaper that paces packets of end-host tra›c shaping is a primary impediment to scaling a from that queue as necessary. Tra›c shapers are synonymous with virtualized network infrastructure. Existing end-host rate limiting token buckets or one of its variants. A scheduler services queues in consumes substantial CPU and memory; e.g., shaping in the Linux round-robin order or per service-class priorities. networking stack use ÕþÛ of server CPU to perform pacing (§ì); shap- is design avoids head of line blocking, through a separate queue ing in the hypervisor unnecessarily drops packets, sušers from head per tra›c aggregate: when a token bucket delays a packet, all packets of line blocking and inaccuracy, and does not provide backpressure in the same queue will be delayed, but not packets in other queues. up the stack (§ì). Other mechanisms, such as TCP Small Queues (TSQ) [¦] in Linux, We present the design and implementation of Carousel, an im- provide backpressure and reduce drops in the network stack. provement on existing, kernel-based tra›c shaping mechanism. Carousel Below, we describe three commonly used shapers in end-host scales to tens of thousands of žows and tra›c classes, and supports stacks, hypervisors and NICs: Õ) Policers; ó) Hierarchical Token complex bandwidth-allocation mechanisms for both WAN and dat- Bucket; and ì) FQ/pacing. ey are all based on per-tra›c aggre- acenter communications. We design Carousel around three core gate queues and a token bucket associated with each queue for rate techniques: i) a single queue shaper using time as the basis for sched- limiting. uling, specically, Timing Wheel [¦ì], ii) coupling actual packet transmission to the freeing of resources in higher layers, and iii) each ó.Õ Policers shaper runs on a separate CPU core, which could be as few as one CPU. We use lock-free coordination across cores. Our production Policers are the simplest way to enforce rates. Policer implementa- experience in serving video tra›c at a Cloud service provider shows tions use token buckets with zero bušering per queue. A token bucket that Carousel shapes tra›c ešectively while improving the overall policer consists of a counter representing the number of currently machine CPU utilization by ˜Û relative to state-of-art implemen- available tokens. Sending a packet requires checking the request tations, and with an improvement of óþÛ in the CPU utilization against the available tokens in the queue. e policer drops the packet attributed to networking. It also conforms Õþ times more accurately if too few tokens are available. Otherwise, the policer forwards the to target rates, and consumes two orders of magnitude less memory packet, reducing the available tokens accordingly. e counter is re- than existing approaches. While we implement Carousel in a soŸware lled according to a target rate and capped by a maximum burst. e NIC, it is designed for hardware oœoad. By the novel combination of ÕTraffic aggregate refers to a collection of individual TCP flows being grouped previously-known techniques, Carousel makes a signicant advance on a shaping policy. Carousel: Scalable Traffic Shaping at End Hosts SIGCOMM ’17, August 21-25, 2017, Los Angeles, CA, USA

Packets traverse Source 1000 the HTB tree. Source flows Internal Nodes flows 800 burst 1000ms handle traffic 600 burst 100ms accounting, and Token Flow state lookup 400 burst 10ms leaf nodes Bucket 200 Root burst 1ms shaping traffic. 0

C1 C2 C3 Hashtable of red-black trees on Throughput (Mbps) 0.1 1 10 100 flow ids RTT (ms) C1:1 C1:2 C1:3 Flow pacing time_next_packet F1 t=now+5 Figure ì: Policing performance for target throughput ÕGbps with dišer-

F2 F3 ent burst sizes. Policers exhibit poor rate conformance, especially at high t=now+3 t=now+8 RTT. e dišerent lines in the plot correspond to the congured burst size

in the token bucket. F7 F6 F5 F4 t=now+2 t=now+4 t=now+6 t=now+9 Scheduler polls active queues to add packet to NIC queue Flow delays (red-black tree on timestamps) ó.ì FQ/pacing

FQ/pacing [¢] is a Linux Qdisc used for packet pacing along with NIC NIC fair queueing for egress TCP and UDP žows. Figure ób shows the (a) HTB (b) FQ/pacing structure of FQ/pacing. e FQ scheduler tracks per-žow state in an array of Red-Black (RB) trees indexed on žow hash IDs. A decit Figure ó: Architecture of Linux rate limiters. round robin (DRR) scheduler [¦Õ] fetches outgoing packets from the active žows. A garbage collector deletes inactive žows. Maintaining per-žow state provides the opportunity to support rate represents the average target rate and the burst size represents egress pacing of the active žows. In particular, a TCP connection the tolerance for jitter, or for short-term deviation from the target sends at a rate approximately equal to cwnd/RTT, where cwnd is the rate conformance. congestion window and RTT is the round-trip delay. Since cwnd ó.ó Hierarchical Token Bucket (HTB) and RTT are only best-ešort estimates, the Linux TCP stack conser- vatively sets the žow pacing rate to ó × cwnd/RTT [ä]. FQ/pacing Unlike Policers that have zero queueing, tra›c shapers bušer packets enforces the pacing rate via a leaky bucket queue [˜], where sending waiting for tokens, rather than drop them. In practice, related token timestamps are computed from packet length and current pacing rate. buckets are grouped in a complex structure to support advanced Flows with next-packet timestamps far into the future are kept in a tra›c management schemes. Hierarchical Token Bucket (HTB) [ó] separate RB tree indexed on those timestamps. We note that pacing in the Linux kernel Queueing Discipline (Qdisc) uses a tree organi- is enforced at the granularity of TCP Segmentation Oœoad (TSO) zation of shapers as shown in Figure óa. HTB classies incoming segments. packets into one of several tra›c classes, each associated with a to- e Linux TCP stack further employs the pacing rate to automati- ken bucket shaper at a leaf in the tree. e hierarchy in HTB is for cally size packets meant for TCP Segmentation Oœoading. e goal borrowing between leaf nodes with a common parent to allow for is to have at least one TSO packet every Õms, to trigger more frequent work-conserving scheduling. packet sends, and hence better acknowledgment clocking and fewer e host enforcer in the Bandwidth Enforcer (BwE) [ìó] system is a microbursts for žows with low rates. TSO autosizing, FQ, and pacing large-scale deployment of HTB. Each HTB leaf class has a one-to-one together achieve tra›c shaping in the Linux stack [՘]. mapping to a specic BwE task žow group – typically an aggregate In practice, FQ/pacing and HTB are used either individually or in of multiple TCP žows. e number of HTB leaf classes is directly tandem, each for its specic purpose. HTB rate limits žow aggregates, proportional to the number of QoS classes × destination clusters × but scales poorly with the number of rate limited aggregates and the number of tasks. packet rate (§ì). FQ/pacing is used for pacing individual TCP con- To avoid unbounded queue growth, shapers should be used with nections, but this solution does not support žow aggregates. We use a mechanism to backpressure the tra›c source. e simplest form Policers in hypervisor switch deployments where HTB or FQ/pacing of backpressure is to drop packets, however, drops are also a coarse are too CPU intensive to be useful. signal to a transport like TCP. Linux employs more ne grained back- pressure. For example, HTB Qdisc leverages TCP Small Queues ì THE COST OF SHAPING (TSQ) [¦] to limit two outstanding packets for a single TCP žow Tra›c shaping is a requirement for the e›cient and correct opera- within the TCP/IP stack, waiting for actual NIC transmission before tion of production networks, but uses CPU cycles and memory on enqueuing further packets from the same žow. ó. In the presence of datacenter servers that can otherwise be used for applications. In this žow aggregates, the number of enqueued packets equals the num- section, we quantify these tradeošs at scale in a large Cloud service. ber of individual TCP žows × TSQ limit. If HTB queues are full, Policiers: Among the three shapers, Policers are the simplest and subsequent arriving packets will be dropped. cheapest. ey have low memory overhead since they are bušerless, and they have small CPU overhead because there is no need to sched- óThe limit is controlled via a knob whose default value of Õó˜KB yields two ule or manage queues. However, Policers have poor conformance to full-sized ä¦KB TSO segments. the target rate. Figure ì shows that as round-trip time (RTT) increases, SIGCOMM ’17, August 21-25, 2017, Los Angeles, CA, USA A. Saeed et al.

1 1 0.8 0.8 0.6 0.6

CDF 0.4 FQ approach 0.4 Pacing 0.2 PFIFO approach CDF 0.2 No Pacing 0 0 0.5 0.6 0.7 0.8 0.9 1 0.12 0.15 0.18 Fraction of Machine CPU Fraction of Machine CPU

1 0.8 FQ approach Figure ¢: Pacing impact on a process generating QUIC tra›c. 0.6 PFIFO approach 900K CDF 0.4 0.2 0 600K 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 300K No HTB Retransmission Rate Throughput HTB 0 Figure ¦: Impact of FQ/pacing on a popular video service. Retransmis- (RR transactions/sec) 1 10 100 1000 Number of TCP RR flows sion rate on paced connections is ¦þÛ lower than that on non-paced con- nections (lower plot). Pacing consumes ÕþÛ of total machine CPU at the median and the tail (top plot). e retransmission rates and CPU usage Figure ä: TCP-RR rate with and HTB. HTB is ìþÛ lower due to locking are recorded over ìþ second time intervals over the experiment period. overhead. designed to improve performance for HTTPS tra›c. It is globally Policers deviate by Õþx from the target rate for two reasons: rst by deployed at Google on thousands of servers and is used to serve dropping non-conformant packets, Policers lose the opportunity to YouTube video tra›c. QUIC runs in user space with a packet pacing schedule them at a future time, and resort to emulating a target rate implementation that performs MTU-sized pacing independent from with on/oš behavior. Second, Policers trigger poor behavior from the kernel’s FQ/pacing. We experimented with QUIC’s pacing turned TCP, where even modest packet loss leads to low throughput [óÕ] and ON and OFF. We obtained similar results with QUIC/UDP experi- wasted upstream work. ments as we obtained with the kernel’s FQ/pacing. Figure ¢ shows To operate over a large bandwidth-delay product, Policers require the CPU consumption of a process generating approximately ˜Gbps a large congured burst size, e.g., on the order of the žow round- QUIC tra›c at peak. With pacing enabled, process CPU utilization at trip time. However, large bursts are undesirable, due to poor rate the ÉÉth percentile jumped from þ.Õ¢ to þ.ó of machine CPU – a ìþÛ conformance at small time scales, and tail dropping at downstream increase, due to arming and serving timers for each QUIC session. switches with shallow bušers. Even with a burst size as large as one Pacing lowered retransmission rates from ó.äÛ to Õ.äÛ. second, Figure ì shows that the achieved rate can be ¢x below the HTB: Like FQ/pacing, HTB can also provide good quality rate target rate, for an RTT of Õþþms using TCP CUBIC. conformance, with a deviation of ¢Û from target rate (§ß), albeit at Pacing: FQ/pacing provides good rate conformance, deviating at the cost of high CPU consumption. CPU usage of HTB grows linearly most äÛ from the target rate (§ß), but at the expense of CPU cost. with the packets per second (PPS) rate. Figure ä shows an experiment To understand the impact on packet loss and CPU usage, we ran an with Netperf TCP-RR, request-response ping-pong tra›c of Õ Byte experiment on production video servers where we turn oš FQ/pacing in each direction, running with and without HTB. is experiment on Õþ servers, each serving ìßGbps at peak across tens of thousands measures the overhead of HTB’s shaping architecture. We chose the žows. We instead congured a simple PFIFO queueing discipline Õ-byte TCP-RR because it mostly avoids overheads other than those typically used as the default low overhead Qdisc [Õ]. We ensured that specically related to packets per second. In production workloads, the experiment and baseline servers have the same machine congu- it is desirable for shapers (and more generally networking stacks) ration, are subject to similar tra›c load, including being under the to achieve high PPS rates while using minimal CPU. As the ošered same load balancers, and ran the experiments simultaneously for one PPS increases in Figure ä (via the increase in the number of TCP-RR week. connections on the x-axis), HTB saturates the machine CPU at äþþK Figure ¦ compares the retransmission rate and CPU utilization transactions/sec. Without HTB, the saturation rate is ììÛ higher at with and without FQ/pacing (Õþ servers in each setting). e upside ˜þþK transactions/sec. of pacing is that the retransmission rate on paced connections is ¦þÛ e main reason for HTB’s CPU overhead is the global Qdisc lock lower than those on non-paced connections. e loss-rate reduction acquired on every packet enqueue. Contention for the lock grows comes from reducing self-inžicted losses, e.g., TSO or large conges- with PPS. [Õ¢] details the locking overhead in Linux Qdisc, and the tion windows. e žip side is that pacing consumes ÕþÛ of total CPU increasing problems posed as links scale to ÕþþGbps. Figure ß shows on the machine at the median and the tail (Éþth and ÉÉth percentiles); HTB statistics from one of our busiest production clusters over a th e.g., on a ä¦ core server, we could save up to ä cores through more ó¦-hour period. Acquiring locks in HTB can take up to Õs at the ÉÉ e›cient rate limiting. percentile, directly impacting CPU usage, rate conformance, and Todemonstrate that the cost of pacing is not restricted to FQ/pacing tail latency. e number of HTB classes at any instant on a server th Qdisc or its specic implementation, we conduct an experiment with is tens of thousands at the Éþ percentile, with Õþþþ-óþþþ actively QUIC [ì¦] running over UDP tra›c. QUIC is a transport protocol rate limited. Carousel: Scalable Traffic Shaping at End Hosts SIGCOMM ’17, August 21-25, 2017, Los Angeles, CA, USA

99%-ile 95%-ile median mean

High HTB FQ/pacing 15000 1500

12000 1200 9000 900 Carousel Policiers 6000 600 usage 3000 300

CPU/Memory Ideal 0 0 Number of classes Low 0 4 8 12 16 20 0 4 8 12 16 20 Number of active flows Low Rate conformance High Time (hours) Time (hours) Figure É: Policers have low CPU usage but poor rate conformance. HTB (a) Total number of HTB classes and active žows and FQ/pacing have good shaping properties but a high CPU cost. 1500 99%-ile 95%-ile median mean 1200 900 packets within a queue may be dropped even while there is plenty of 600 room in the global pool. 300 Summary: e problem with HTB and FQ/pacing is more than 0

Lock Time Per Host (ms) 0 4 8 12 16 20 just accuracy or timer granularities. e CPU ine›ciency of HTB Time (hours) and FQ/pacing is intrinsic to these mechanisms, and not the result of (b) HTB lock acquire time poor implementations. e architecture of these shapers is based on synchronization across multiple queues and cores, which increases Figure ß: HTB statistics over one day from a busy production cluster. the CPU cost: Õ) e token bucket architecture of FQ/pacing and HTB necessi- 1 0.8 tates multiple queues, with one queue per rate limit to avoid head 0.6 of line blocking. e cost of maintaining these queues grows, some-

CDF 0.4 times super-linearly, with the number of such queues. CPU costs 0.2 stem from multiple factors, especially the need to poll queues for 0 10000 12000 14000 16000 18000 20000 22000 packets to process. Commonly, schedulers either use simple round- Number of packets robin algorithms, where the cost of polling is linear in the number of queues, or more complex algorithms tracking active queues. ó) Synchronization across multiple cores: Additionally, the cost Figure ˜: Bušered packets of a VM at a shaper in hypervisor. e bump on multi-CPU systems is dominated by locking and/or contention in the curve is because there are two internal limits on the number of overhead when sharing queues and associated rate limiters between bušered packets: ÕäK is a local per-VM limit, and óÕK is at the global level for all VMs. ere are packet drops when the ÕäK limit is exceeded, which CPUs. HTB and FQ/pacing acquire a global lock on a per-packet impacts the slope of the curve. basis, whose contention gets worse as the number of CPUs increases. Solving these problems requires a fundamentally dišerent ap- Memory: ere is also a memory cost related to tra›c shaping: the proach, such as the one we describe in the following sections. Today, memory required to queue packets at the shaper, and the memory to practitioners have a di›cult choice amongst rate limiters: Policers maintain the data structures implementing the shaper. Without back- have high CPU/memory e›ciency but unacceptable shaping perfor- pressure to higher level transports like TCP, the memory required to mance. HTB and FQ/pacing have good shaping properties, but fairly queue packets grows with the number of žows × congestion window, high CPU and memory costs. Figure É summarizes this tradeoš. ÕþÛ and reaches a point where packets need to be dropped. Figure ˜ shows CPU overhead from accurate shaping can be worth the additional the number of packets at a rate limiter in the absence of backpressure network e›ciency, especially for WAN transfers. Now, the question for a Cloud VM that has a single žow being shaped at óGbps over becomes whether we can get the same capability with substantially th less CPU overhead. a ¢þms RTT path. In the ÉÉ percentile, the queue of the shaper has óÕ thousand MTU size packets (or ìó MB of memory) because of CUBIC TCP’s large congestion window, adding Õóþms of latency ¦ CAROUSEL DESIGN PRINCIPLES from the additional bušering. In the presence of backpressure (§¢), Our work begins with the observation that a unied, accurate, and there will be at most two TSO size worth of packets or ˜¢ MTU sized CPU-e›cient tra›c shaping mechanism can be used in a variety of packetsì (or Õó˜KB of memory), i.e., two orders of magnitude less important settings. Our design principles follow naturally from the memory. following requirements: A second memory cost results from maintaining the shaper’s data Õ) Work compatibly with higher level congestion control mechanisms structures. e challenge is partitioning and constraining resources such as TCP. such that there are no drops during normal operation; e.g., queues a)Pace packets correctly: avoid bursts and unnecessary delays. in HTB classes can be congured to grow up to ÕäK packets, and b) Provide backpressure and avoid packet drops: delaying pack- FQ/pacing has a global limit of ÕþK packets and a per queue limit ets because of rate limiting should quickly result in slowing down of Õþþþ packets, aŸer which the packets are dropped even if there is the end application. Each additional packet generated by the appli- plenty of global memory available. With poorly tuned congurations, cation will otherwise need to be bušered or dropped in the shaper, wasting memory and CPU to either queue or regenerate the packet. ìWe assume maximum TSO size of ä¦KB and MTU size of Õ.¢KB. Additionally, loss-based congestion control algorithms oŸen react to SIGCOMM ’17, August 21-25, 2017, Los Angeles, CA, USA A. Saeed et al.

Timestamp every packet loss by signicantly slowing down the corresponding žows, Socket Buffers Time-based queue packet based on requiring more time to reach the original bandwidth. in Host OS or ordered based on Guest VM policy and rate c) Avoid head of line blocking: shaping packets belonging to a timestamp tra›c aggregate should not delay packets from other aggregates. To NIC ó) Use CPU and memory e›ciently: Rate limiting should consume a very small portion of the server CPU and memory. As implied by §ì, this means avoiding the processing overhead of multiple queues, as well as the associated synchronization costs across CPUs in multi- Timestamper core systems. We nd that these requirements are satised by a shaper based on Report completion to source three simple tenets: Õ. Single Queue Shaping (§¢.Õ): Relying on a single queue to shape packets alleviates the ine›ciency and overhead with multi-queue Figure Õþ: Carousel architecture. systems which use token buckets, because these require a queue per rate limit. We employ a single queue indexed by time, e.g., a ¢ THE CAROUSEL SYSTEM Calendar Queue [Õä] or Timing Wheel [¦ì]. We insert all packets Carousel shaping consists of three sequential steps for each packet. from dišerent žows into the same queue and extract them based on First, the network stack computes a timestamp for the packet. Second, their transmission schedule. We set a send-time timestamp for each the stack enqueues the packet in a queue indexed by these timestamps. packet, based on a combination of its žow’s rate limit, pacing rate, Finally, when the packet’s transmission deadline passes, the stack and bandwidth sharing policy. dequeues the packet and delivers a completion event, which can then ó. Deferred Completions (§¢.ó): e sender must limit the number transmit more packets. of packets in žight per žow, potentially using existing mechanisms We implement Carousel in a soŸware NIC in our servers. SoŸware such as TCP Small Queues or the congestion window. Consequently, NICs dedicate one or more CPU cores for packet processing, as there must be a mechanism for the network stack to ask the sender inSoŸNIC and FlexNIC [ó¦, ìÕ]. Our soŸware NIC operates in busy- to queue more packets for a specic žow, e.g., completions or ac- wait polling mode. Polling allows us to check packet timestamps knowledgments. With appropriate backpressure such as Deferred at high frequency, enabling accurate conformance with the packets’ Completions described in §¢.ó, we reduce both HoL blocking and schedules. Polling also assists in CPU-e›cient shaping, by alleviating memory pressure. Deferred Completions is a generalization of the the need for locking, and for timers for releasing packets. layer ¦ mechanism of TCP Small Queues to be a more general layer We note that SoŸware NICs are not a requirement for Carousel, ì soŸware switch mechanism in a hypervisor/container soŸware and it is possible to implement Carousel in the kernel. However, an switch. implementation in the kernel would not be a simple modication ì. Silos of one shaper per-core (§¢.ì): Siloing the single queue of one of the existing Queuing Disciplines [Õþ], and would require shaper to a core alleviates the CPU ine›ciency that results from re-engineering the kernel’s Qdisc path. is is because the kernel is locking and synchronization. We scale to multi-core systems by us- structured around parallelism at the packet level, with per-packet ing independent single-queue shapers, one shaper per CPU. Note lock acquisitions, which does not scale e›ciently to high packet rates. that a system can have as few as one shaper assigned to a specic Our approach is to replicate the shaping-related data structure per CPU, i.e. not every CPU requires a shaper. To implement shared core, and to achieve parallelism at the level of žows, not at the level rates across cores we use a re-balancing algorithm that periodically of a shared shaper across cores. Since SoŸware NICs already uses redistributes the rates across CPUs. dedicated cores, it was easy to rapidly develop and test a new shaper Figure Õþ illustrates the architecture using Timing Wheel and like Carousel, which is a testament to their value. Deferred Completions. Deferred Completions addresses the rst of the requirements on working compatibly with congestion control. ¢.Õ Single Queue Shaping Deferred Completions allows the shaper to slow down the source, and We examine the two steps of emulating the behavior of multiqueue hence avoids unnecessary drops and mitigates head of line blocking. architecture using a single time-indexed queue while improving CPU Additionally, Deferred Completions also allows the shaper to bušer e›ciency. fewer packets per žow, reducing memory requirements. e use of a single Timing Wheel per core makes the system CPU-e›cient, thus 5.1.1 Timestamp Calculation. Timestamp calculation is the addressing the second requirement. rst step of tra›c shaping in Carousel. Timestamp of a packet de- Carousel is a more accurate and CPU-e›cient than state-of-art termines its release time, i.e., the time at which packet should be token-bucket shapers. Carousel’s architecture of a single queue per transmitted to the wire. A single packet can be controlled by multiple core, coupled with Deferred Completions can be implemented any- shaping policies, e.g., transport protocol pacing, per-žow rate limit, where below the transport, in either soŸware or NIC hardware. and aggregate limit for the žow’s destination. In typical settings, these To demonstrate the benets of our approach, we explore shaping policies are enforced independently, each at the point where the pol- egress tra›c, in §¢ using Deferred Completions for backpressure, icy is implemented in the networking stack. Carousel requires that at and shaping ingress incast tra›c at the receiver, in §ä using ACKs each point a packet is timestamped with transmission time according deferral for backpressure. to the rate at that point rather than enforcing the rate there. Hence, these policies are applied sequentially as the packet trickles down Carousel: Scalable Traffic Shaping at End Hosts SIGCOMM ’17, August 21-25, 2017, Los Angeles, CA, USA the layers of the networking stack. For example, the TCP stack rst Algorithm Õ Timing Wheel implementation. timestamps each packet with a release time to introduce inter-packet Õ: procedure I•«u§±(PZhŽu±,T«) gaps within a connection (pacing); subsequently, dišerent rate lim- ó: Ts = Ts~Granularity iters in the packet’s path modify the packet’s timestamp to conform ì: if Ts <= FrontTimestamp then to one or more rates corresponding to dišerent tra›c aggregates. ¦: Ts = FrontTimestamp e timestamping algorithm operates in two steps: Õ) timestamp ¢: else if Ts > FrontTimestamp + NumberOf Slots − Õ then calculation based on individual policies, such as pacing and rate ä: Ts = FrontTimestamp + NumberOf Slots − Õ limiting, and ó) consolidation of the dišerent timestamps, so that ß: end if the packet’s nal transmission time adheres to the overall desired ˜: TW[TsÛNumberOf Slots].append(Packet) behavior. e packet’s nal timestamp can be viewed as the Earliest É: end procedure Release Time (ERT) of the packet to the wire. Õþ: procedure E챧Zh±(now) Setting Timestamps: Both pacing and rate limiting use the same ÕÕ: now = now~Granularity technique for calculating timestamps. Timestamps can be set as either Õó: while now >= FrontTimestamp do relative timestamps, i.e., transmit aŸer a certain interval relative to Õì: if TW[nowÛNumberOf Slots].empty() then now, or the absolute wall time timestamps, i.e., transmit at a certain Õ¦: FrontTimestamp+ = Granularity time. In our implementation, we use the latter. Both pacing and rate Õ¢: else × limiting force a packet to adhere to a certain rate: ó cwnd/RTT for Õä: return TW[nowÛNumberOf Slots].PopFront() pacing, and a target rate for rate limiting. In our implementation, Õß: end if the timestamp for pacing is set by sender’s TCP, and that for target ՘: end while rate is set by a module in the SoŸware NIC, which receives external ÕÉ: return Null rates from centralized entities. Policy i enforces a rate Ri and keeps óþ: end procedure track of the latest timestamp it creates, LTSi . e timestamp for th th i the j packet going through the i policy, Pj is then calculated ′ i extracting packets every g . Within the g period, packets are as: LTSi = LTSi + len(Pj )~Ri . Timestamps are thus advanced on min min every packet. Note that packets through a policy can belong to the extracted in FIFO order. e Timing Wheel is suitable as a single same žow, e.g., in the case of pacing, or multiple žows, e.g., in case time-indexed queue operated by Carousel. We seek to achieve line of rate limiting žow aggregates. Packets that are neither paced nor rate for the aggregate of tens of thousands of žows. Carousel relies on rate limited by any policy carry a timestamp of zero. a single queue to maintain line rate i.e. supporting tens of thousands Consolidating Timestamps: Consolidation is straightforward, of entries with minimal enqueue and dequeue overhead. e Timing since larger timestamps represent smaller target rates. In the course Wheel’s O(Õ) operations makes it a perfect design choice. of a packet’s traversal through multiple policies, we choose the largest With the Timing Wheel, Carousel keeps track of a variable now of timestamps, which avoids violating any of the policies associated representing current time. Packets with timestamps older than now with the packet. Note that the ešective nal calculation of a žow’s rate do not need to be queued and should be sent immediately. Fur- is equivalent to a series of token buckets, as the slowest token bucket thermore, Carousel allows for conguring horizon, the maximum ends up dominating and slowing a žow to its rate. We also note between now and the furthermost time that a packet can be queued that Carousel is not a generic scheduler. Hence, this consolidation upto. In particular, if rmin represents the minimum supported rate, approach will not work for all scheduling policies (further discussed lmax is the maximum number of packets in the queue belonging to in Section ˜). the same aggregate that is limited to rmin, then the following rela- horizon = lmax tion holds: rmin seconds. As Carousel is implemented within a busy polling system, the SoŸware NIC can visit Carousel 5.1.2 Single Time-indexed Queue. One of the main compo- with a maximum frequency of fmax , due to having to execute other nents of Carousel is Timing Wheel [¦ì], an e›cient queue slotted functions associated with a NIC. is means that it is unnecessary to by time and a special case of the more generic Calendar Queue [Õä]. have time slots in the queue representing a range of time smaller than e key dišerence is that Calendar Queue has O(Õ) amortized inser- Õ , as this granularity will not be supported. Hence, the minimum tion and extraction but can be O(N) in skewed cases. e constant fmax granularity of time slots is g = Õ . for O(Õ) in Timing Wheel is smaller than that of Calendar Queue. min fmax A Calendar Queue would provide exact packet ordering based on An example conguration of the Timing Wheel is for a minimum timestamps even within a time slot, a property that we don’t strictly rate of Õ.¢Mbps for Õ¢þþB packet size, slot granularity of ˜us, and a require for shaping purposes. time horizon of ¦ seconds. is conguration yields ¢þþK slots, calcu- A Timing Wheel is an array of lists where each entry is a times- horizon lated as gmin . e minimum rate limit or pacing rate supported is tamped packet. e array represents the time slots from now till the one packet sent per time horizon, which is Õ.¢Mbps for Õ¢þþB packet precongured time horizon, where each slot represents a certain size. e maximum supported per-žow rate is Õ.¢Gbps. is cap time range of size gmin within the overall horizon, i.e., the mini- on maximum rate is due to the minimum supported gap between mum time granularity. e array is a circular representation of time, two packets of ˜µs and the packet size of Õ¢þþB. Higher rates can be as once a slot becomes older than now and all its elements are de- supported by either increasing the size of packets scheduled within a queued, the slot is updated to represent now + horizon. e number slot (e.g., scheduling a ¦KB packet per slot increases the maximum horizon of slots in the array is calculated as gmin . Furthermore, we allow supported rate to ¦Gbps), or by decreasing the slot granularity. SIGCOMM ’17, August 21-25, 2017, Los Angeles, CA, USA A. Saeed et al.

We note that Token Bucket rate limiters have an explicit congura- Software NIC packet tion of burst size that determines the largest line rate burst for a žow. Transport Stack Driver Network Such a burst setting can be realized in Carousel via timestamping virtio dequeue (Guest Kernel) Shaper such that no more than a burst of packets are transmitted at line rate. For example, a train of consecutive packets of a total size of burst Completion signal can be congured to be transmitted at the same time or within the same timeslot. is will create a burst of that size. (a) Immediate Completions Algorithm Õ presents a concrete implementation of a Timing Wheel. Transport emits one Insert more packet (TSQ) ere are only two operations: places a packet at a slot de- Software NIC termined by the timestamp and ExtractNext retrieves the rst packet Transport Driver Network enqueued packet with a timestamp older than the current time, now. Stack dequeue (Guest Kernel) virtio is allows the Timing Wheel to act as a FIFO if all packets have a Carousel timestamp of zero. All packets with timestamp older than now are inserted in the slot with the smallest time. Deferred Completion signal We have two options for packets with a timestamp beyond the (b) Deferred Completions horizon. In the rst option, the packets are inserted in the end slot that represents the timing wheel horizon. is approach is desirable when Figure ÕÕ: Illustration of Completion signaling. In (a), a large backlog can build up to substantially overwhelm the bušer, leading to drops and only žow pacing is the goal as it doesn’t drop packets, but instead head-of-line blocking. In (b), only two packets are allowed in the shaper results in a temporary rate overshoot because of bunched packets and once a packet is released another is added. Indexing packets based on at the horizon. e second approach is to drop packets beyond the time allows for interleaving packets. horizon, which is desired when imposing a hard rate limit on žow, and an overshoot is undesirable. A straightforward Timing Wheel implementation supports O(Õ) e idea behind Deferred Completions, as shown in Figure ÕÕ, is insertion and extraction where a list representing a slot is a dynamic holding the completion signal until Carousel transmits the packet to list, e.g., std::list in C++. Memory is allocated and deallocated the wire, preventing the stack from enqueueing more packets to the on every insertion and deletion within the dynamic list. Such mem- NIC or the hypervisor. In Figure ÕÕa, the driver in the guest stack ory management introduces signicant cost per packet. Hence, we generates completion signals at enqueue time, thus overwhelming introduce a Global Pool of memory: a free list of nodes to hold packet the Timing Wheel with as much as a congestion window’s worth references. Timing Wheel allocates from the free list when packets of packets per TCP žow, leading to drops past the Timing Wheel are enqueued. To further increase e›ciency, each free list node can horizon. In Figure ÕÕb, Carousel in SoŸware NIC returns the com- hold references to n packets. Preallocated nodes are assigned to dif- pletion signal to the guest stack only aŸer packet is dequeued from ferent time slots based on need. is eliminates the cost of allocation the Timing Wheel, and thus strictly bounds the number of packets and deallocation of memory per packet, and amortizes the cost of in the shaper, since the transport stack relies on completion signals moving nodes around over n packets. Freed nodes return to the to enqueue more packets to Qdiscs. global pool for reuse by other slots based on need. §ß quanties the Implementing Deferred Completions requires modifying the way Timing Wheel performance. completions are currently implemented in the driver. Completions are currently delivered by the driver in the order in which packets arrive at the NIC, regardless of their application or žow. Carousel can transmit packets in an order dišerent from their arrival because ¢.ó Deferred Completions of shaping. is means that a žow could have already transmitted Carousel provides backpressure to higher layer transport without packets to the wire, but be blocked because the transport layer has not drops, by bounding the number of enqueued packets per žow. Oth- received its completion yet. Consider a scenario of two competing erwise, the queue can grow substantially, causing packet drops and žows, where the rate limit of one žow is half of the other. Both žows head of line blocking. us, we need mechanisms in the soŸware NIC deliver packets to the shaper at the same arrival rate. However, packets or the hypervisor to properly signal between Carousel and the source. of the faster žow will be transmitted at a higher rate. is means that We study two such approaches in detail: Deferred Completions and packets from the faster žow can arrive at the shaper later than the delay-based congestion control. packets from the slow žow and yet be transmitted rst. e delivery Completion is a signal reported from the driver to the transport rate of completion signals will need to match the packet transmission layer in the kernel signaling that a packet leŸ the networking stack rate, or else the fast žow will be unnecessarily backpressured and (i.e. completed). A completion allows the kernel stack to send more slowed down. Ordered completion delivery can cause head of line packets to the NIC. e number of packets sent by the stack to the blocking, hence we implemented out-of-order completions in the NIC is transport dependent, e.g., in the case of TCP, TSQ [¦] ensures driver of the guest kernel. that the number of packets outstanding between the stack and the We implement out-of-order completions by tracking of all the generation of a completion is at most two segments. Unreliable pro- packets still held by Carousel, in a hashmap in the guest kernel driver. tocols like UDP could in theory generate an arbitrary number of On reception of a packet from guest transport stack, the SoŸware NIC packets. In practice, however, UDP sockets are limited to Õó˜KB of enqueues the reference to that packet in the hashmap. When Carousel outstanding send data. releases the packet to the network, the SoŸware NIC generates a Carousel: Scalable Traffic Shaping at End Hosts SIGCOMM ’17, August 21-25, 2017, Los Angeles, CA, USA completion signal, which is delivered to the guest transport stack than their fair share, then the extra is given to žows on the remain- and allows removal of the packet from the hashmap. e removal of ing cores. If žows on any core have zero demand, then the central the packet from the hashmap is propagated up in the stack, allowing algorithm still reserves ÕÛ of the aggregate rate, to provide headroom TSQ to enqueue another packet [ÕÉ]. is requires changing both for ramping up. the driver and its interface in the guest kernel, i.e., virtio [¦þ]. ó) Communication between NBA and per-core Timestampers: Carousel is implemented within a SoŸware NIC that relies on Usage updates by Timestampers and the dissemination of rates by zero-copy to reduce memory overhead within the NIC. is becomes NBA are performed periodically in a lazy manner. We chose lazy especially important in the presence of Carousel, because when the updates to avoid the overhead of locking the data path with every SoŸware NIC shapes tra›c, it handles more packets as it would update. e frequency of updates is a tradeoš between stability and without any shaping. Hence, the Timing Wheel carries references fast convergence of computed rates; we currently use ≈Õþþms. (pointers) to packets and not the actual packets themselves. e mem- ory cost incurred is thus roughly the cost of the number of references held in Carousel, while the packet memory is fully attributed to the sending application until the packet is sent to the wire (completed). ä CAROUSEL AT THE RECEIVER In practice, this means that for every one million of packets outstand- We apply the central tenets discussed in §¦ to show how a receiver ing in the Timing Wheel our soŸware NIC needs to allocate only in can shape žows on the ingress side. Our goal is to allow overwhelmed the order of ˜MB of RAM. e combination of Deferred Completions receivers to handle incast tra›c through a fair division of bandwidth and holding packet references provides backpressure while using only amongst žows [Õó]. e direct way of shaping ingress tra›c to queue a small memory footprint. incoming data packets, which can consume considerable memory. Delay-based Congestion Control (CC) is an alternative form of Instead, we shape the acknowledgements reported to the sender. backpresssure to Deferred Completions. Delay-based CC senses the Controlling the rate of acknowledged data allows for ne grained delay introduced in the shaper, and modulates TCP’s congestion win- control over the sender’s rate. dow to a smaller value as the RTT increases. e congestion window e ingress shaping algorithm modies the acknowledgment se- serves as a limit on the number of packets in Timing Wheel. We note quence numbers in packets’ headers, originally calculated by TCP, to that delay-based CC should discount the pacing component of the acknowledge packets at the congured target rate. We apply Carousel delay when computing the RTT. Operating with an RTT that includes as an extension of DCTCP without modifying its behavior. e al- pacing delay introduces undesirable positive feedback loops that re- gorithm keeps track of a per žow Last Acked Sequence Number duce throughput. On the other hand, for Retransmission Timeout (SNa), Latest Ack Time (Ta), and Last Received Sequence Number (RTO) computations, it is crucial that pacing delay be included, so as (SNr). e rst two variables keep track of the sequence number of to avoid premature ring of the RTO timer. §ß provides a comparison last-acknowledged bytes and the time that acknowledgement was across backpressure mechanisms. sent. For an outgoing packet from the receiver to the sender, we check the acknowledgement number in the packet. We use that to update SNr. en, we change that number based on the following for- mula: NewAckSeqNumber = min((now−Ta)×Rate+SNa , SNr). ¢.ì Scaling Carousel with multiple cores We update Ta to now and SNa to NewAckSeqNumber, if SNa is So far we have described the use of a single Timing Wheel per CPU smaller than NewAckSeqNumber. Acknowledgments are generated core to shape tra›c. Using a single shaper in multi-core systems by Carousel when no packet has been generated by the upper layer for MTU incurs contention costs that would make the solution impractical. the time Rate . e variable Rate is set either through a bandwidth It is easy to scale Carousel by using independent Timing Wheels, allocation system, or it is set as the total available bandwidth divided one per CPU. For our approach, each TCP connection is hashed to a by the number of žows. single Timing Wheel. Using one shaper per core su›ce for pacing or rate limiting individual žows, and allows lock-free queuing of packets to a NIC instance on a specic core. However, this simple approach does not work for the case when ß EVALUATION we want to shape multiple TCP connections by a single aggregate We evaluate Carousel in small-scale microbenchmarks where all rate, because the individual TCP connections land on dišerent cores shaped tra›c is between machines within the same rack, and in pro- (through hashing) and hence are shaped by dišerent Timing Wheels. duction experiments on machines serving video content to tens of We resolve this with two components: thousands of žows per server. We compare the overhead of Carousel Õ) NBA: Our implementation uses a NIC-level bandwidth alloca- to HTB and FQ/pacing in similar settings and demonstrate that tor (NBA) that accepts a total rate limit for žow-aggregates and re- Carousel is ˜Û more e›cient in overall machine CPU utilization distributes each rate limit periodically to individual per-core Times- (óþÛ more e›cient in CPU utilization attributable to networking), tampers. NBA uses a simple water-lling algorithm to achieve work- and äÛ better in terms of rate conformance, while maintaining simi- conservation and max-min fairness. e individual Timestampers lar or lower levels of TCP retransmission rates. Note that the video provide NBA with up-to-date žow usage data as a sequence of (byte serving system is just for the convenience of production experiments count, timestamp of the byte count measurement) pairs; rate assign- in this section; the same shaping mechanisms are also deployed in ments from NBA to Timestampers are in bytes per second. e our network operations such as enforcing Bandwidth Enforcer (BwE) water-lling algorithm ensures that if žows on one core are using less rates. SIGCOMM ’17, August 21-25, 2017, Los Angeles, CA, USA A. Saeed et al.

600 DC - Median TCP Vegas - Median Carousel DC - 99th Pct TCP Vegas - 99th Pct 450 HTB 200 300 FQ 160 150 120 0 80

Absolute Deviation 0 4000 8000 40 in Shaper from Target Rate (Mbps) Set Rate (Mbps) 0 0 1000 2000 3000 4000 5000 6000 7000 8000 Number of Packets Aggregate Rate (Mbps) Figure Õó: Comparison between Carousel, HTB, and FQ in their rate con- formance for a single žow. Figure Õ¦: A comparison of Deferred Completion (DC) and delay-based congestion control (TCP Vegas) backpressure, for Ÿeen žows when vary- 5000 ing the aggregate rate limits. 4500 Carousel 4000 HTB 3500 FQ 3000 (Mbps) 2500 which re at lower rates than needed for accuracy at high rates. For

Achieved Rate 0 200 400 600 800 1000 instance, HTB’s timer res once every Õþms. Carousel relies on ac- Number of Flows curate scheduling of packets in busy-wait polling SoŸware NIC, by placing them in a Timing Wheel slot that transmits a packet within a Figure Õì: Comparison between Carousel, HTB, and HTB/FQ showing slot size of its scheduled time. the ešect of the number of žows load on their rate conformance to a target Vary the number of žows. We vary the number of žows in our rate of ¢ Gbps. experiments between one and ÕþK, typical for user facing servers. Figure Õì shows the ešect of varying the number of žows until Õþþþ ß.Õ Microbenchmark žows. Carousel is not impacted as it relies on the Timing Wheel Experiments setup: We conduct experiments for egress tra›c shap- with O(Õ) insertion and extraction times. HTB maintains a constant ing between two servers sharing the same top of rack switch. e deviation of ¢Û from the target rate. However, FQ/pacing conformity servers are equipped with soŸware NIC on which we implemented signicantly drops beyond óþþ žows. is is due to the limited num- Carousel to shape tra›c. For baseline, servers run Linux with HTB ber of per-queue and global packet limits. However, increasing this congured with ÕäK packets per queue; FQ/pacing is congured with number signicantly increases its CPU overhead due to its reliance a global limit of ÕþK packets and a žow limit of Õþþþ packets. ese on RB-trees for maintaining žows and packets schedules [¢]. settings follow from best practices. We generate tra›c with neper 7.1.2 Memory Efficiency. We compare memory e›ciency of [ß], a network performance measurement tool that allows for gener- Deferred Completions and delay-based CC as backpressure schemes ating large volumes of tra›c with up to thousands of žows per server. (§¢.ó). We use the metric of number of packets held in the shaper, We vary RTT using netem. All reported results are for an emulated which directly režects the memory allocation expected from the RTT of ì¢ms. We found the impact of RTT on Carousel to be negligi- shaper along with the possibility of packet drops as memory demand ble. Experiments are run for ìþ seconds each and enough number of exceeds allocation. We collect samples of the number of packets every runs to make the standard deviation ÕÛ or smaller. Unless otherwise time a new packet is extracted from the shaper. mentioned, the Timing Wheel granularity is two microseconds with Change target rate. Figure Õ¦ shows that with Deferred Comple- a horizon of two seconds. tions, the average of number of packets in the Carousel is not ašected 7.1.1 Rate Conformance. We measure rate conformance as by changes in the target rate. While delay-based CC maintains similar the absolute deviation from the target rate. is metric represents average number of packets like in Deferred Completions, it has twice the deviation in bytes per second due to shaping errors. We measure the number of packets in the ÉÉth percentile. is is because delay absolute deviation from target rate by collecting achieved rate samples variations can grow the congestion window in Vegas to large values, once every Õþþ milliseconds. We collect samples at the output of while Deferred Completions maintains a strict bound regardless of the shaper, i.e., on the sender, as we want to focus on the shaper the variations introduced by congestion control. e bump in ÉÉth behavior and avoid factoring in network impact on tra›c. Samples percentile at ÕGbps rate for Deferred Completions results from an are collected by using tcpdump where throughput is calculated by interaction of TSQ and TSO autosizing. Deferred Completions be- measuring the amount of bytes sent every interval of Õþþ milliseconds. havior is consistent regardless of the congestion control algorithm We compare Carousel’s performance to HTB and FQ/pacing: Õ) used in TCP. Our experiments for Deferred Completions used TCP pacing is performed by FQ/pacing on a per-žow basis where a maxi- Cubic in Linux. mum pacing rate is congured in TCP using SO_MAX_PACING_RATE Vary number of žows. In Figure Õ¢, we compare the number of [¢], and ó) rate limiting where HTB enforces rate on a žow aggregate. packets held in the shaper for a xed target rate while varying the number of žows. In the absence of Deferred Completions, the žows Change target rate. We investigate rate conformance for a single continually send until all of the ßþK memory in the shaper is lled. TCP connection. As the target rate is varied, we nd that HTB and For Deferred Completions, the number of packets held in the shaper FQ have a consistent ¢Û and äÛ deviation between the target and is deterministic as approximately twice the number of active žows. achieved rates (Figure Õó). is is due to their reliance on timers is is also režected in the ÉÉth percentile of the number of packets Carousel: Scalable Traffic Shaping at End Hosts SIGCOMM ’17, August 21-25, 2017, Los Angeles, CA, USA

DC - Avg TCP Vegas - Avg DC - 99th Pct TCP Vegas - 99th Pct 1 No DC 0.8 80000 0.6 60000 0.4

CDF 0.2 40000 0 20000 64 256 800 3000 14000 in Shaper 0 Absolute Deviation between 0 2000 4000 6000 8000 10000 Timestamp and Tx Time (ns) Number of Packets Number of Flows Figure ՘: A CDF of deviation between a packet timestamp and transmis- Figure Õ¢: A comparison of Deferred Completion (DC) and delay-based sion time. congestion control (TCP Vegas) backpressure, while varying the number of žows with an aggregate rate limit of ¢Gbps. Number of Packets in Shaper Õþþþ ¦þþþ ìóþþþ ó¢äþþþ óþþþþþþþ STDList (ns per packet) óó óÕ óÕ óÕ óó 400 Global Pool (ns per packet) Õó ÕÕ ÕÕ ÕÕ ÕÕ 300 Table Õ: Overhead per packet of dišerent Timing Wheel implementations. 200 100 0 of ˜-Õäµs su›ces to maintain a small deviation from the target rate.

Absolute Deviation 2 8 32 500 4000 We also nd that rate conformity at larger slot granularities such as

from Target Rate (Mbps) Slot Granularity (us) ¦ms deviates by ìþþ Mbps which is comparable to the deviation of Figure Õä: Rate conformance of Carousel for target rate of ¢Gbps for dif- HTB and FQ at the same target rate as shown in Figure Õó. ferent slot granularities. Timing Wheel Horizon. e horizon represents the maximum delay a packet can encounter in shaper (hence minimum supported 8000 6000 Min Rate rate), and is the product of slot granularity and number of slots. Fig- 6000 4500 # Slots ure Õß, shows the minimum rate and the the number of slots needed 3000 4000 for dišerent horizon values, all for a granularity of two microsec- 1500 2000 (Mbps) 0 0 onds. Note that results here assume that only one žow is shaped by

Minimum Rate 0 3000 6000 9000 12000 15000 Carousel which means that the horizon needs to be large enough to Timing Wheel Horizon (nanoseconds) Number of Slots accommodate the time gap between only two packets. Deviation between Timestamp and Transmission Time. Fig- Figure Õß: An analytical behavior of the minimum supported rate for a ure ՘ shows a CDF of the dišerence between the scheduled trans- specic Timing Wheel horizon along with the required number of slots mission time of the packet and the actual transmission time of the for a slot size of two microseconds. packet in nanoseconds. is metric represents the accuracy of the Timing Wheel in how much it adheres to the schedules. Over ÉþÛ held. For delay-based CC, the median number of packets is higher of packets have deviations smaller than the slot granularity of two than the ÉÉth percentile behavior of Deferred Completion. Further- microseconds, which occurs due to the rounding down error when more, the ÉÉth percentile of delay-based CC is less predictable than mapping packets with schedules calculated in nanoseconds to slots that of the ÉÉth percentile of Deferred Completions because it can be of óµs. Deviations that exceed a slot size due to excessive queuing ašected by variations in RTT. We note that we didn’t observe any sig- delay between the timestamper and the shaper are rare. nicant dišerence in rate conformance between both approaches. We CPU Overhead of Timing Wheel. For a system attempting to nd that Deferred Completion is the better backpressure approach achieve line rate, every nanosecond counts for processing each packet. because of its predictable behavior. e Timing Wheel implementation is the most resource consuming ere is no hard limit on the number of individual connections or component in the system as the rest of the operations merely change žow aggregates that Carousel can handle. Carousel operates on the parts of the packet metadata. We explore two parameters that can references (pointers) to packets, which are small. Surely, we have to ašect the delay per packet in the Timing Wheel implementation: Õ) allocate memory to hold the references, but in practice this memory the impact of the data structures used for representing a slot, and ó) is a small fraction of memory available on the machine. the number of packets held in the Timing Wheel. Table Õ shows the impact of both parameters. It’s clear that because of the O(Õ) insertion 7.1.3 Impact of Timing Wheel Parameters. In §¢, we choose and extraction overhead, the Timing Wheel is not ašected by the the Timing Wheel due to it is high e›ciency and predictable behavior. number of packets or žows. e factor ašecting the delay per packet We study the ešect of its dišerent parameters on the behavior of is the overhead of insertion and extraction of packets from each slot. Carousel. We nd that implementing a slot data structure that avoids allocating Slot Granularity. Slot granularity controls the burst size and also memory for every insertion, i.e., Global Pool (described in §¢.Õ.ó), determines the smallest supported gap between packets which ašects reduces the overhead of C++ std::list by ¢þÛ. the maximum rate that Carousel can pace at. Furthermore, for a certain granularity to be supported, the Timing Wheel has to be 7.1.4 Carousel Impact at the Receiver. We evaluate the dequeued at least that rate. We investigate the ešect of dišerent receiver-based shaper in a scenario of Õþþ:Õ incast: Õþ machines with granularities on rate conformance. Figure Õä shows that a granularity Õþ connections each send incast tra›c to one machine; all machines SIGCOMM ’17, August 21-25, 2017, Los Angeles, CA, USA A. Saeed et al.

16000 1 DCTCP+Carousel 0.8 12000 Ideal Target Rate 0.6 0.4 8000 CDF FQ 0.2 Carousel 4000 0 Rate(Mbps) 4000 6000 8000 10000 12000 14000 16000 0 0.005 0.01 0.015 0.02 0.025 0.03 Retransmission Rate

Aggregate Achieved Aggregate Set Rate (Mbps)

1 Figure ÕÉ: Comparing receiver side rate limiting with target rate. 0.8 8.2% improvement 0.6

0.7 CDF 0.4 FQ FQ Carousel 0.2 6.4% improvement Carousel 0.6 0 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.5 Gbps / CPU

Gbps/CPU M1 M2 M3 M4 M5 0.4 Figure óÕ: Comparison between FQ and Carousel for two machines serv- 00:00 01:00 02:00 03:00 04:00 5:00 Time (hours) ing large volumes of video tra›c exhibiting similar CPU loads showing that Carousel can push more tra›c for the same CPU utilization.

Figure óþ: Immediate improvement in CPU e›ciency aŸer switching to 1 Carousel in all ve machines in a US west coast site. FQ/Pacing 0.8 Carousel 0.6 0.4 ≈ are located in the same rack. Unloaded network RTT is Õþus, packet CDF 0.2 size is Õ¢þþ Bytes, and experiment is run for Õþþ seconds. Senders use 0 DCTCP and Carousel to limit the throughput of individual incast 20 22 24 26 28 30 32 34 36 žows. Figure ÕÉ shows the rate conformance of Carousel for varying Gbps/SoftNic aggregate rates enforced at the receiver. e deviation of aggregate achieved rate is within ÕÛ of the target rate. Figure óó: SoŸware NIC e›ciency comparison with pacing enforced in kernel vs pacing enforced in soŸware NIC. Bandwidth here is normalized ß.ó Production Experience to a unit of soŸware NIC self-reported utilization. To evaluate Carousel with real production loads, we choose ó¢ servers 1 in ve geographically diverse locations. e servers are deployed in two US locations (East and West Coast), and three European locations 0.9 FQ/Pacing (Western and Eastern Europe). ey are serving streaming video CDF Carousel tra›c from a service provider to end clients. Each server generates 0.8 up to ì˜Gbps of tra›c serving as many as ¢þ,þþþ concurrently active 0 1 2 3 4 5 6 7 8 Batch Size sessions. For comparison, the servers support both conventional FQ/pacing in Linux kernel and Carousel implemented on soŸware NIC like SoŸNIC [ó¦]. Figure óì: Cumulative batch sizes (in TSO packets) that soŸware NIC re- e metrics we use are: Õ) retransmission rates, ó) server CPU ceives from kernel when pacing is enforced in kernel vs when pacing is e›ciency, and ì) soŸware NIC e›ciency. To measure server CPU enforced in soŸware NIC. e›ciency, we use Gbps/CPU metric which is the total amount of egress tra›c divided by the total number of CPUs on a server and pick these two machines specically as they have similar CPU utiliza- then divided by their utilization. For example, if a server sends ՘ tion during our measurement period. is provides a fair comparison Gbps with ßó CPUs that are ¢þÛ utilized on average, it is said to for the Gbps/CPU by unifying the denominator which focuses on ՘ th th have ßó∗þ.¢ = þ.¢Gbps/CPU e›ciency. We present only the data how e›ciently the CPU is used to push data. At ¢þ and Éþ per- for periods of use when server CPUs are more than ¢þÛ utilized centiles, Carousel is ä.¦Û and ˜.óÛ more e›cient (Gbps/CPU) in (approximately ä hours every day). Peak periods is when e›ciency is serving tra›c. On ßó-CPU machines this translates to saving an the most important because server capacity must be provisioned for average of ¦.ä CPUs per machine. Considering that networking ac- such peaks. Gbps/CPU is better than direct CPU load for measuring counts for ≈¦þÛ of the machine CPU, the savings are ÕäÛ and óþÛ CPU e›ciency. is is because our production system routes new in the median and tail for CPU attributable to networking. e two client requests to the least loaded machines to maintain high CPU machines exhibit similar retransmission rate which shows that using utilization at all machines. is load-balancing behavior means that Carousel maintains pacing value while signicantly reducing its cost. more e›cient machines will receive more client requests. Carousel on SoŸware NIC. We examine the overhead of Carousel CPU e›ciency. In the interest of space, we do not show results on the soŸware NIC. Surprisingly, we nd that moving pacing to from all sites. Figure óþ shows the impact of Carousel on Gbps/CPU soŸware NIC in fact improves its performance. e soŸware NIC metric on ve machines in the West Coast site. Figure óÕ compares operates in spin-polling mode, consuming ÕþþÛ of xed number two machines one using Carousel and the other using FQ/pacing. We of CPUs assigned to it. e NIC reports the CPU cycles it spent in Carousel: Scalable Traffic Shaping at End Hosts SIGCOMM ’17, August 21-25, 2017, Los Angeles, CA, USA a spin loop and whether it did any useful work in that particular [ì˜]. vShaper allows for sharing queues across dišerent žows by es- loop (i.e. received a batch that has at least one packet, or there was timating the share each will take based on their demand [ìì]. In a packet emitted by timing wheel); the CPU cycles spent in useful contrast, Carousel relies on a single queue that can be implemented loops divided by the total number of CPU cycles in all loops is the e›ciently in either hardware or soŸware. Pacing was rst suggested SoŸware NIC utilization level. To measure soŸware NIC e›ciency, to have a per-žow timer which made pacing unattractive due to the we normalize observed throughput by soŸware NIC self-reported excessive CPU overhead [Õ¦]. FQ/pacing in the OS (§¢) uses only one utilization, thus deriving Gbps/SoŸNIC metric. Figure óó shows timer and still imposes non-trivial CPU overhead (§ì). By relying on Gbps/SoŸNIC metric for pacing with FQ/pacing and Carousel. We time-based queue, Carousel implements shaping more e›ciently and see that moving pacing to the soŸware NIC improves its e›ciency accurately as packets are extracted based on their release times rather by ÕóÛ (ìó.ó Gbps vs ó˜.˜ Gbps per soŸware NIC). e improved than spending CPU cycles checking dišerent queues for packets that e›ciency is from larger batching from kernel TCP stack to soŸware are due. NIC, because the pacing is moved from kernel to the NIC. Figure óì New data structures for packet scheduling: Recently new data shows cumulative batch sizes, including empty batches, for machines structures in literature implement scheduling at line rate while saving serving production tra›c with pacing enforcement in kernel and on memory and CPU. UPS presents Least Slack Time First-based pacing enforcement in soŸware NIC. On average, non-empty packet packet scheduling algorithm for routers [ì¢]. e Push-In First-Out batch sizes increased by ¢ßÛ (from an average batch size of ó.Õ to an Queue (PIFO) [¦ó] is a priority queue that allows packets to be en- average batch size of ì.ì). queued into an arbitrary location in the queue (thus enabling pro- grammable packet scheduling), but only dequeued from the head. ˜ DISCUSSION Timing Wheel is also a particular kind of PIFO. PIFO and UPS focus on scheduling in switches which require work-conserving algorithms; Application of Carousel in Hardware NICs: Some hardware de- Carousel focuses on pacing and rate-limiting at end hosts which are vices, such as Intel NICs [óä, óß], provide support for rate limiting. non work-conserving. Carousel is similar to VirtualClock [¦¦] semi- e common approach requires one hardware queue per rate limiter, nal work in its reliance on transmission time to control tra›c. How- where the hardware queue is bound to one or more transmit queues ever, the focus in VirtualClock is on setting timestamps for specic in the host OS. is approach heads in the direction of increasing rates that can be specied by a network controller. Carousel presents transmit queues in hardware which does not scale because of silicon a data structure that can e›ciently implement such techniques, and constraints. Hence, hardware rate limiters are restricted for use in extends timestamping to apply it for pacing and rate limiting. Fur- custom applications, rather than as a normal oœoad that the OS can thermore, Carousel relies on Deferred Completions to limit queue use. length and avoid harmful interaction amongst žows. With Carousel, the single queue approach is a sharp departure Consumers of shaping: Recent congestion control and band- from the trend of needing more number of hardware queues, and can width allocation systems [Õì, ìþ, ìó, ìß, ìÉ] include an implementation be implemented follows: Õ) hardware is congured with one transmit for shaping, e.g., FQ/pacing for BBR [Õß], priority queue in TIMELY queue per CPU to avoid contention between the OS and the NIC; ó) [ìä], hardware token bucket queue in HULL [É], injecting void pack- Classication and rates on the hardware are congured via an API, ets in Silo [óÉ], and HTB [ó] for BwE [ìó]. Most proposed techniques ì) hardware is able to consume input packets, generate per-packet rely on variations of token buckets to enforce rates. Carousel proposes timestamps based on congured rates, and enqueue them to Timing a system that combines both pacing and rate limiting to reduce mem- Wheel, ¦) dequeue packets that are due, ¢) return the completion ory and CPU overhead. TIMELY and HULL both strongly motivate event relative to the packet sent. hardware implementation of pacers due to the overhead of soŸware Limitations: Carousel does not serve as a generic packet scheduler pacing. However, TIMELY presents a soŸware pacer that works on that can realize arbitrary packet schedules because the Timing Wheel a per žow basis. is requires keeping tack of žow information at does not support changing timestamps of already enqueued packets. the pacer which can be a large overhead in the data path. Carousel Some schedules such as Weighted Fair Queueing are supportable requires each packet to only be annotated with a timestamp which with existing O(Õ) Enqueue / Dequeue operations, while preemptive reduces the pacer overhead as the notion of žows can be kept where schedules such as strict priorities are not supportable. it is only required, e.g., TCP layer. A second challenge stems from the use of a single queue shaper with timestamping of packets. e challenge is how to make such timestamping consistent when the timestamps are not set in one cen- tralized place. An approach is to standardize all sources to set times- Õþ CONCLUSION tamps in nanoseconds (or microseconds) from the Unix epoch time. Even though tra›c shaping is fundamental to the correct and e›cient Other challenges include handling timestamps from unsynchronized operation of datacenters and WANs, deployment at scale has been CPU clocks, and timestamps that may be of dišerent granularities. di›cult due to prohibitive CPU and memory overheads. We showed in this paper that two techniques can help overcome scaling and É RELATED WORK e›ciency concerns: a single time-indexed queue per CPU core for Improving shaping e›ciency: SENIC and vShaper introduce hy- packets, and management of higher-layer resources with Deferred brid soŸware/hardware shapers to reduce CPU overhead. SENIC Completions. Scaling is further helped by a multicore-aware design. allows the host’s CPU to classify and enqueue packets while the NIC Carousel is able to individually shape tens of thousands of žows on a is responsible for pulling packets when it is time to transmit them single server with modest resource consumption. SIGCOMM ’17, August 21-25, 2017, Los Angeles, CA, USA A. Saeed et al.

We believe the most exciting consequence of our work will be the (August ÕÉÉÉ). creation of novel policies for pacing, bandwidth allocation, handling [óÕ] Tobias Flach, Pavlos Papageorge, Andreas Terzis, Luis Pedrosa, Yuchung Cheng, Tayeb Karim, Ethan Katz-Bassett, and Ramesh Govindan. óþÕä. An incast, and mitigation against DoS attacks. No longer are we restricted Internet-Wide Analysis of Traffic Policing. In SIGCOMM ’Õä. ¦ä˜–¦˜ó. to cherry-picking handful of žows to shape. We are condent that [óó] Yashar Ganjali and Nick McKeown. óþþä. Update on Buffer Sizing in Internet Routers. SIGCOMM Comput. Commun. Rev. (óþþä), äß–ßþ. our results with Carousel make a strong technical case for networking [óì] Monia Ghobadi, Yuchung Cheng, Ankur Jain, and Matt Mathis. óþÕó. stacks and NIC vendors to invest in the basic abstractions of Carousel. Trickle: Rate Limiting YouTube Video Streaming. In USENIX ATC ’Õó. ÕÉÕ–ÕÉä. ÕÕ ACKNOWLEDGMENTS [ó¦] Sangjin Han, Keon Jang, Aurojit Panda, Shoumik Palkar, Dongsu Han, and Sylvia Ratnasamy. óþÕ¢. SoftNIC: A Software NIC to Augment Hardware. We thank Ken Durden and Luigi Rizzo for their direct contributions Technical Report UCB/EECS-óþÕ¢-Õ¢¢. EECS Department, University of , Berkeley. to this work. We thank Yuchung Cheng, Brutus Curtis, Eric Dumazet, [ó¢] Chi-Yao Hong, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Vijay Gill, Van Jacobson and Erik Rubow for their insightful discussions; Marco Mohan Nanduri, and Roger Wattenhofer. óþÕì. Achieving High Utilization Paganini for assisting in several rounds of Carousel experiments on with Software-driven WAN. SIGCOMM Comput. Commun. Rev. ¦ì, ¦ (óþÕì), Õ¢–óä. Bandaid servers; Ian Swett and Jeremy Dorfman for conducting pac- [óä] Intel Networking Division. óþÕä. ˜ó¢ÉÉ Õþ GbE Controller Datasheet. ing experiments with QUIC; David Wetherall and Daniel Eisenbud (óþÕä). [óß] Intel Networking Division. óþÕä. Ethernet Switch FMÕþþþþ. (óþÕä). for their ongoing support of congestion control work. We thank our [ó˜] Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, technical reviewers, Rebecca Isaacs, Ješ Mogul, Dina Papagiannaki Arjun Singh, Subbaiah Venkata, Jim Wanderer, Junlan Zhou, Min Zhu, Jon and Parthasarathy Ranganathan for providing useful feedback. We Zolla, Urs Hölzle, Stephen Stuart, and Amin Vahdat. óþÕì. B¦: Experience with a Globally-deployed Software Defined Wan. In SIGCOMM ’Õì. ì–Õ¦. thank our shepherd, Andreas Haeberlen, and anonymous SIGCOMM [óÉ] Keon Jang, Justine Sherry, Hitesh Ballani, and Toby Moncaster. óþÕ¢. Silo: reviewers for their detailed and excellent feedback. Predictable Message Latency in the Cloud. In SIGCOMM ’Õ¢. ¦ì¢–¦¦˜. [ìþ] Vimalkumar Jeyakumar, Mohammad Alizadeh, David Mazières, Balaji Prabhakar, Albert Greenberg, and Changhoon Kim. óþÕì. EyeQ: Practical REFERENCES Network Performance Isolation at the Edge. In NSDI ’Õì. óÉß–ìÕÕ. [Õ] óþþó. pfifo-tc: PFIFO Qdisc. (óþþó). https://linux.die.net/man/˜/tc-pfifo/. [ìÕ] Antoine Kaufmann, SImon Peter, Naveen Kr. Sharma, Thomas Anderson, [ó] óþþì. Linux Hierarchical Token Bucket. (óþþì). http://luxik.cdi.cz/ de- and Arvind Krishnamurthy. óþÕä. High Performance Packet Processing vik/qos/htb/. with FlexNIC. In ASPLOS ’Õä. äß–˜Õ. [ì] óþþ˜. net: Generic receive offload. (óþþ˜). https://lwn.net/Articles/좘ÉÕþ/. [ìó] Alok Kumar, Sushant Jain, Uday Naik, Anand Raghuraman, Nikhil Kasi- [¦] óþÕó. tcp: TCP Small Queues. (óþÕó). https://lwn.net/Articles/¢þäóìß/. nadhuni, Enrique Cauich Zermeno, C. Stephen Gunn, Jing Ai, Björn Carlin, [¢] óþÕì. pkt_sched: fq: Fair Queue packet scheduler. (óþÕì). Mihai Amarandei-Stavila, Mathieu Robin, Aspi Siganporia, Stephen Stuart, https://lwn.net/Articles/¢ä¦˜ó¢/. and Amin Vahdat. óþÕ¢. BwE: Flexible, Hierarchical Bandwidth Allocation [ä] óþÕì. tcp: TSO packets automatic sizing. (óþÕì). for WAN Distributed Computing. In SIGCOMM ’Õ¢. Õ–Õ¦. https://lwn.net/Articles/¢ä¦ÉßÉ/. [ìì] Gautam Kumar, Srikanth Kandula, Peter Bodik, and Ishai Menache. óþÕì. [ß] óþÕä. neper: a Linux networking performance tool. (óþÕä). Virtualizing Traffic Shapers for Practical Resource Allocation. In ¢th https://github.com/google/neper. USENIX Workshop on Hot Topics in Cloud Computing. [˜] óþÕß. Leaky bucket as a queue. (óþÕß). [ì¦] Adam Langley, Alistair Riddoch, Alyssa Wilk, Antonio Vicente, Charles https://en.wikipedia.org/wiki/Leaky_bucket. Krasic, Dan Zhang, Fan Yang, Fedor Kouranov, Ian Swett, Janardhan Iyengar, [É] Mohammad Alizadeh, Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Jeff Bailey, Jeremy Dorfman, Jim Roskind, Joanna Kulik, Patrik Westin, Vahdat, and Masato Yasuda. óþÕó. Less Is More: Trading a Little Bandwidth Raman Tenneti, Robbie Shade, Ryan Hamilton, Victor Vasiliev, Wan-Teh for Ultra-Low Latency in the Data Center. In NSDI ’Õó. ó¢ì–óää. Chang, and Zhongyi Shi. óþÕß. The QUIC Transport Protocol: Design and [Õþ] W. Almesberger, J. H. Salim, and A. Kuznetsov. ÕÉÉÉ. Differentiated services Internet-Scale Deployment. In SIGCOMM ’Õß. on Linux. In GLOBECOM’ÉÉ. ˜ìÕ–˜ìä vol. Õb. [ì¢] Radhika Mittal, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. óþÕä. [ÕÕ] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy Universal Packet Scheduling. In NSDI ’Õä. ¢þÕ–¢óÕ. Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion [ìä] Radhika Mittal, Vinh The Lam, Nandita Dukkipati, Emily Blem, Hassan Stoica, and Matei Zaharia. óþÕþ. A View of Cloud Computing. Commun. Wassel, Monia Ghobadi, Amin Vahdat, Yaogong Wang, David Wetherall, ACM ¢ì, ¦ (óþÕþ), ¢þ–¢˜. and David Zats. óþÕ¢. TIMELY: RTT-based Congestion Control for the [Õó] Wei Bai, Kai Chen, Haitao Wu, Wuwei Lan, and Yangming Zhao. óþÕ¦. PAC: Datacenter. In SIGCOMM ’Õ¢. ¢ìß–¢¢þ. taming TCP Incast congestion using proactive ACK control. In ICNP ’Õ¦. [ìß] Lucian Popa, Praveen Yalagandula, Sujata Banerjee, Jeffrey C. Mogul, 옢–ìÉä. Yoshio Turner, and Jose Renato Santos. óþÕì. ElasticSwitch: Practical [Õì] Hitesh Ballani, Paolo Costa, Thomas Karagiannis, and Ant Rowstron. óþÕÕ. Work-conserving Bandwidth Guarantees for Cloud Computing. SIGCOMM Towards Predictable Datacenter Networks. SIGCOMM Comput. Commun. Comput. Commun. Rev. ¦ì, ¦ (Aug. óþÕì), ì¢Õ–ìäó. Rev. ¦Õ, ¦ (Aug. óþÕÕ), ó¦ó–ó¢ì. [ì˜] Sivasankar Radhakrishnan, Yilong Geng, Vimalkumar Jeyakumar, Abdul [Õ¦] Neda Beheshti, Yashar Ganjali, Monia Ghobadi, Nick McKeown, and Geoff Kabbani, George Porter, and Amin Vahdat. óþÕ¦. SENIC: Scalable NIC for Salmon. óþþ˜. Experimental Study of Router Buffer Sizing. In IMC ’þ˜. End-Host Rate Limiting. In NSDI ’Õ¦. ÕÉß–óÕþ. [ìÉ] Barath Raghavan, Kashi Vishwanath, Sriram Ramabhadran, Kenneth [Õ¢] Jesper Dangaard Brouer. óþÕ¢. Network stack challenges at increas- Yocum, and Alex C. Snoeren. óþþß. Cloud Control with Distributed Rate ing speeds: The ÕþþGbit/s challenge. LinuxCon . Limiting. In SIGCOMM ’þß. ììߖ즘. (óþÕ¢). http://events.linuxfoundation.org/sites/events/files/slides/net_ [¦þ] Rusty Russell. óþþ˜. virtio: towards a de-facto standard for virtual I/O stack_challenges_ÕþþG_Õ.pdf devices. ACM SIGOPS Operating Systems Review ¦ó, ¢ (óþþ˜), É¢–Õþì. [Õä] Randy Brown. Õɘ˜. Calendar queues: a fast O(Õ) priority queue implemen- [¦Õ] M. Shreedhar and George Varghese. ÕÉÉ¢. Efficient Fair Queueing Using tation for the simulation event set problem. Commun. ACM ìÕ, Õþ (Õɘ˜), Deficit Round Robin. In SIGCOMM ’É¢. ìߢ–옢. Õóóþ–Õóóß. [¦ó] Anirudh Sivaraman, Suvinay Subramanian, Mohammad Alizadeh, Sharad [Õß] Neal Cardwell, Yuchung Cheng, C. Stephen Gunn, Van Jacobson, and Soheil Chole, Shang-Tse Chuang, Anurag Agrawal, Hari Balakrishnan, Tom Edsall, Yeganeh. óþÕä. BBR: Congestion-Based Congestion Control. ACM Queue Sachin Katti, and Nick McKeown. óþÕä. Programmable Packet Scheduling (óþÕä), ¢þ:óþ–¢þ:¢ì. at Line Rate. In SIGCOMM ’Õä. ¦¦–¢ß. [՘] Yuchung Cheng and Neal Cardwell. óþÕä. Making Linux TCP Fast. In [¦ì] G. Varghese and T. Lauck. Õɘß. Hashed and Hierarchical Timing Wheels: Netdev Conference. Data Structures for the Efficient Implementation of a Timer Facility. In [ÕÉ] David D Clark. Õɘ¢. The structuring of systems using upcalls. In SOSP ’˜¢. Proceedings of the Eleventh ACM Symposium on Operating Systems Principles ÕßÕ–Õ˜þ. (SOSP ’˜ß). ó¢–ì˜. [óþ] Glenn William Connery, W Paul Sherer, Gary Jaszewski, and James S Binder. [¦¦] L. Zhang. ÕÉÉþ. Virtual Clock: A New Traffic Control Algorithm for Packet ÕÉÉÉ. Offload of TCP segmentation to a smart adapter. US Patent ¢,Éìß,ÕäÉ. Switching Networks. In SIGCOMM ’Éþ. ÕÉ–óÉ.