Carousel: Scalable Trafﬁc Shaping at End Hosts

Carousel: Scalable Trafﬁc Shaping at End Hosts Ahmed Saeed* Nandita Dukkipati Vytautas Valancius Georgia Institute of Technology Google, Inc. Google, Inc. Vinh e Lam Carlo Contavalli Amin Vahdat Google, Inc. Google, Inc. Google, Inc. ABSTRACT Õ INTRODUCTION Trac shaping, including pacing and rate limiting, is fundamental Network bandwidth, especially across the WAN, is a constrained to the correct and ecient operation of both datacenter and wide resource that is expensive to overprovision [ó¢, ó, ìó]. is creates area networks. Sample use cases include policy-based bandwidth al- an incentive to shape trac based on the priority of the application location to ow aggregates, rate-based congestion control algorithms, during times of congestion, according to operator policy. Further, and packet pacing to avoid bursty transmissions that can overwhelm as networks run at higher levels of utilization, accurate shaping to a router buers. Driven by the need to scale to millions of ows and target rate is increasingly important to ecient network operation. to apply complex policies, trac shaping is moving from network Bursty transmissions from a ow’s target rate can lead to: i) packet switches into the end hosts, typically implemented in soware in the loss, ii) less accurate bandwidth calculations for competing ows, kernel networking stack. and iii) increasing round trip times. Packet loss reduces goodput and In this paper, we show that the performance overhead of end-host confuses transport protocols attempting to disambiguate fair-share trac shaping is substantial limits overall system scalability as we available-capacity signals from bursty trac sources. One could ar- move to thousands of individual trac classes per server. Measure- gue that deep buers are the solution, but we nd that the resulting ments from production servers show that shaping at hosts consumes increased latency leads to poor experience for users. Worse, high considerable CPU and memory, unnecessarily drops packets, suers latency reduces application performance in common cases where from head of line blocking and inaccuracy, and does not provide compute is blocked on, for example, the completion of an RPC. Simi- backpressure up the stack. We present Carousel, a framework that larly, the performance of consistent storage systems is dependent on scales to tens of thousands of policies and ows per server, built from network round trip times. the synthesis of three key ideas: i) a single queue shaper using time as In this paper, we use trac shaping broadly to refer to either pacing the basis for releasing packets, ii) ne-grained, just-in-time freeing or rate limiting, where pacing refers to injecting inter-packet gaps of resources in higher layers coupled to actual packet departures, and to smooth trac within a single connection, and rate limiting refers iii) one shaper per CPU core, with lock-free coordination. Our pro- to enforcing a rate on a ow-aggregate consisting of one or more duction experience in serving video trac at a Cloud service provider individual connections. shows that Carousel shapes trac accurately while improving over- While trac shaping has historically targeted wide area networks, all machine CPU utilization by Û (an improvement of óþÛ in the two recent trends bring it to the forefront for datacenter communica- CPU utilization attributed to networking) relative to state-of-art de- tions, which use end-host based shaping. e rst trend is the use of ployments. It also conforms Õþ times more accurately to target rates, ne grained pacing by rate-based congestion control algorithms such and consumes two orders of magnitude less memory than existing as BBR [Õß] and TIMELY [ìä]. BBR and TIMELY’s use of rate control approaches. is motivated by studies that show pacing ows can reduce packet CCS CONCEPTS drops for video-serving trac [óì] and for incast communication patterns [ìä]. Incast arises from common datacenter applications •Networks éPacket scheduling; that simultaneously communicate with thousands of servers. Even KEYWORDS if receiver bandwidth is allocated perfectly among senders at coarse Trac shaping, Rate-limiters, Pacing, Timing Wheel, Backpressure granularity, simultaneous bursting to NIC line rate from even a small ACM Reference format: subset of senders can overwhelm the receiver’s network capacity. Ahmed Saeed, Nandita Dukkipati, Vytautas Valancius, Vinh e Lam, Carlo Furthermore, trac from end hosts is increasingly bursty, due to Contavalli, and Amin Vahdat. óþÕß. Carousel: Scalable Trac Shaping at End heavy batching and aggregation. Techniques such as NIC ooads Hosts. In Proceedings of SIGCOMM ’Õß, Los Angeles, CA, USA, August óÕ-ó¢, optimize for server CPU eciency — e.g., Transmission Segmenta- óþÕß, Õ¦ pages. tion Ooad [óþ] and Generic Receive Ooad [ì]. Pacing reduces DOI: Õþ.ÕÕ¦¢/ìþÉóó.ìþÉ¢ó burstiness, which in turn reduces packet drops at shallow-buered *Work done while at Google. switches [óó] and improves network utilization [Õ¦]. It is for these rea- sons that BBR and TIMELY rely on pacing packets as a key technique Permission to make digital or hard copies of part or all of this work for personal for precise rate control of thousands of ows per server. or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice e second trend is the need for network trac isolation across and the full citation on the first page. Copyrights for third-party components of competing applications or Virtual Machines (VMs) in the Cloud. this work must be honored. For all other uses, contact the owner/author(s). e emergence of Cloud Computing means that individual servers SIGCOMM ’Õß, Los Angeles, CA, USA © óþÕß Copyright held by the owner/author(s). Éß-Õ-¦¢þì-¦ä¢ì-¢/Õß/þ... gÕ¢.þþ may host hundreds or perhaps even thousands of VMs. Each virtual DOI: Õþ.ÕÕ¦¢/ìþÉóó.ìþÉ¢ó endpoint can in turn communicate with thousands of remote VMs, SIGCOMM ’17, August 21-25, 2017, Los Angeles, CA, USA A. Saeed et al. internal services within and between datacenters, and the Internet at Socket Buffers Multi-queue Token in Host OS or large, resulting in millions of ows with overlapping bottlenecks and Buckets rate limit flow Guest VM bandwidth allocation policies. Providers use predictable and scal- aggregates To NIC able bandwidth arbitration systems to assign rates to ow aggregates, Rate1 which are then enforced at end systems. Such an arbitration can be Rate2 performed by centralized entities, e.g., Bandwidth Enforcer [ìó] and Rate3 Classifier SWAN [ó¢], or distributed systems, e.g., EyeQ [ìþ]. Scheduler For example, consider ¢þþ VMs on a host providing Cloud services, where each VM communicates on average with ¢þ other virtual endpoints. e provider, with no help from the guest operating system, must isolate these ó¢K VM-to-endpoint ows, with bandwidth Figure Õ: Token Bucket Architecture: pre-ltering with multiple token allocated according to some policy. Otherwise, inadequate network bucket queues. isolation increases performance unpredictability and can make Cloud services unavailable [ÕÕ, ìþ]. on the state-of-the-art shapers in terms of eciency, accuracy and Traditionally, network switches and routers have implemented scalability. trac shaping. However, inside a datacenter, shaping in middleboxes is not an easy option. It is expensive in buering and latency, and ó TRAFFIC SHAPERS IN PRACTICE middleboxes lack the necessary state to enforce the right rate. More- In the previous section we established the need for at-scale shaping over, shaping in the middle does not help when bottlenecks are at at end hosts: rst, modern congestion control algorithms, such as network edges, such as the host network stack, hypervisor, or NIC. BBR and TIMELY, use pacing to smooth bursty video trac, and Finally, when packet buering is necessary, it is much cheaper and to handle large-scale incasts; second, trac isolation in Cloud is more scalable to buer near the application. critically dependent on ecient shapers. Before presenting details erefore, the need to scale to millions of ows per server while of Carousel, we rst present an overview of the shapers prevalently applying complex policy means that trac shaping must largely be used in practice. implemented in end hosts. e eciency and eectiveness of this Nearly all of rate limiting at end hosts is performed in soware. end-host trac shaping is increasingly critical to the operation of Figure Õ shows the typical rate limiter architecture, which we broadly both datacenter and wide area networks. Unfortunately, existing im- term as pre-ltering with multiple token bucket queues. It relies on plementations were built for very dierent requirements, e.g., only a classier, multiple queues, token bucket shapers and/or a scheduler for WAN ows, a small number of trac classes, modest accuracy processing packets from each queue. e classier divides packets requirements, and simple policies. rough measurement of video into dierent queues, each queue representing a dierent trac ag- trac in a large-scale Cloud provider, we show that the performance gregate.Õ A queue has an associated trac shaper that paces packets of end-host trac shaping is a primary impediment to scaling a from that queue as necessary. Trac shapers are synonymous with virtualized network infrastructure. Existing end-host rate limiting token buckets or one of its variants. A scheduler services queues in consumes substantial CPU and memory; e.g., shaping in the Linux round-robin order or per service-class priorities. networking stack use ÕþÛ of server CPU to perform pacing (§ì); shap- is design avoids head of line blocking, through a separate queue ing in the hypervisor unnecessarily drops packets, suers from head per trac aggregate: when a token bucket delays a packet, all packets of line blocking and inaccuracy, and does not provide backpressure in the same queue will be delayed, but not packets in other queues.

Load more