<<

BwE: Flexible, Hierarchical Bandwidth Allocation for WAN Distributed Computing

Alok Kumar Sushant Jain Uday Naik Anand Raghuraman Nikhil Kasinadhuni Enrique Cauich Zermeno C. Stephen Gunn Jing Ai Björn Carlin Mihai Amarandei-Stavila Mathieu Robin Aspi Siganporia Stephen Stuart Amin Vahdat

Google Inc. [email protected]

ABSTRACT Keywords WAN bandwidth remains a constrained resource that is eco- Bandwidth Allocation; Wide-Area Networks; Soware- nomically infeasible to substantially overprovision. Hence, Deûned Network; Max-Min Fair it is important to allocate capacity according to service pri- ority and based on the incremental value of additional allo- 1. INTRODUCTION cation. For example, it may be the highest priority for one service to receive Ï«Gb/s of bandwidth but upon reaching TCP-based bandwidth allocation to individual ows con- such an allocation, incremental priority may drop sharply tending for bandwidth on bottleneck links has served the In- favoring allocation to other services. Motivated by the ob- ternet well for decades. However, this model of bandwidth servation that individual ows with ûxed priority may not allocation assumes all ows are of equal priority and that all be the ideal basis for bandwidth allocation, we present the ows beneût equally from any incremental share of available design and implementation of Bandwidth Enforcer (BwE), bandwidth. It implicitly assumes a client-server communi- a global, hierarchical bandwidth allocation infrastructure. cation model where a TCP ow captures the communication BwE supports: i) service-level bandwidth allocation follow- needs of an application communicating across the . ing prioritized bandwidth functions where a service can rep- his paper re-examines bandwidth allocation for an im- resent an arbitrary collection of ows, ii) independent alloca- portant, emerging trend, distributed computing running tion and delegation policies according to user-deûned hier- across dedicated private WANs in support of cloud comput- archy, all accounting for a global view of bandwidth and fail- ing and service providers. housands of simultaneous such ure conditions, iii) multi-path forwarding common in traõc- applications run across multiple global data centers, with engineered networks, and iv) a central administrative point thousands of processes in each data center, each potentially to override (perhaps faulty) policy during exceptional con- maintaining thousands of individual active connections to ditions. BwE has delivered more service-eõcient bandwidth remote servers. WAN traõc engineering means that site-pair utilization and simpler management in production for mul- communication follows diòerent network paths, each with tiple years. diòerent bottlenecks. Individual services have vastly diòer- ent bandwidth, latency, and loss requirements. We present a new WAN bandwidth allocation mechanism CCS Concepts supporting distributed computing and data transfer. BwE provides work-conserving bandwidth allocation, hierarchi- •Networks → Network resources allocation; Network man- agement; cal fairness with exible policy among competing services, and Service Level Objective (SLO) targets that independently account for bandwidth, latency, and loss. BwE’s key insight is that routers are the wrong place to map Permission to make digital or hard copies of part or all of this work for personal policy designs about bandwidth allocation onto per-packet or classroom use is granted without fee provided that copies are not made or behavior. Routers cannot support the scale and complex- distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of ity of the necessary mappings, oen because the semantics this work must be honored. For all other uses, contact the owner/author(s). of these mappings cannot be captured in individual packets. SIGCOMM ’15 August 17-21, 2015, London, United Kingdom Instead, following the End-to-End Argument[†Ê], we push © 2015 Copyright held by the owner/author(s). all such mapping to the source host machines. Hosts rate ACM ISBN 978-1-4503-3542-3/15/08. limit their outgoing traõc and mark packets using the DSCP DOI: http://dx.doi.org/Ï«.ÏÏ/†;Ê•E.†;Ê;;Ê ûeld. Routers use the DSCP marking to determine which

1 path to use for a packet and which packets to drop when congested. We use global knowledge of and link utilization as input to a hierarchy of bandwidth en- forcers, ranging from a global enforcer down to enforcers on each host. Bandwidth allocations and packet marking pol- icy ows down the hierarchy while measures of demand ow up, starting with end hosts. he architecture allows us to de- couple the aggregate bandwidth allocated to a ow from the handling of the ow at the routers. BwE allocates bandwidth to competing applications based bandwidth functions on exible policy conûgured by cap- Figure Ï: WANNetwork Model. turing application priority and incremental utility from ad- ditional bandwidth in diòerent bandwidth regions. BwE 2. BACKGROUND supports hierarchical bandwidth allocation and delegation We begin by describing our WAN environment and high- among services while simultaneously accounting for multi- light the challenges we faced with existing bandwidth alloca- path WAN communication. BwE is the principal bandwidth tion mechanisms. housands of individual applications and allocation mechanism for one of the largest private WANs services run across dozens of wide area sites each containing and has run in production for multiple years across hundreds multiple clusters. Host machines within a cluster share a com- of thousands of end points. he systems contributions of our S S mon LAN. Figure Ï shows an example WAN with sites Ï, † work include: S CÏ C† S and t; Ï and Ï are clusters within site Ï. We host a combination of interactive web services, e.g. ● Leveraging concepts from Soware Deûned Network- ing, we build a uniûed, hierarchical control plane for search and web mail, streaming video, batch-style data pro- bandwidth management extending to all end hosts. In cessing, e.g., MapReduce [Ït], and large-scale data transfer particular, hosts report per-user and per-task demands services, e.g., index copy from one site to another. Cluster to the control plane and rate shape a subset of ows. management soware maps services to hosts independently; we cannot leverage IP address aggregation/preûx to identify ● We integrate BwE into existing WAN traõc engineer- a service. However, we can install control soware on hosts ing (TE) [Ï;, ÏÏ, φ] mechanisms including MPLS Auto- and leverage a control protocol running outside of routers. Bandwidth [††] and a custom SDN infrastructure. BwE We started with traditional mechanisms for bandwidth al- takes WAN pathing decisions made by a TE service location such as TCP, QoS and MPLS tunnels. However these and re-allocates the available site-to-site capacity, split proved inadequate for a variety of reasons: across multiple paths, among competing applications. At the same time, we beneût from the reverse integra- ● Granularity and Scale: Our network and service capac- tion: using BwE measures of prioritized application de- ity planners need to reason with bandwidth allocations mand as input to TE pathing algorithms (Section .t.Ï). at diòerent aggregation levels. For example, a prod- uct group may need a speciûed minimum of site-to-site ● We implement hierarchical max-min fair bandwidth al- bandwidth across all services within the product area. location to exibly-deûned FlowGroups contending for In other cases, individual users or services may require resources across multiple paths and at diòerent levels of a bandwidth guarantee between a speciûc pair of clus- network abstraction. he bandwidth allocation mech- ters. We need to scale bandwidth management to thou- anism is both work-conserving and exible enough to sands of individual services, and product groups across implement a range of network sharing policies. dozens of sites each containing multiple clusters. We need a way to classify and aggregate individual ows In sum,BwE delivers a number of compelling advantages. into arbitrary groups based on conûgured policy. TCP First, it provides isolation among competing services, deliv- fairness is at a -tuple ow granularity. On a congested ering plentiful capacity in the common case while maintain- link, an application gets bandwidth proportional to the ing required capacity under failure and maintenance scenar- number of active ows it sends across the links.Our ios. Capacity available to one service is largely independent of services require guaranteed bandwidth allocation inde- the behavior of other services. Second, administrators have a pendent of the number of active TCP ows. Router QoS single point for specifying allocation policy. While pathing, and MPLS tunnels do not scale to the number of service RTT, and capacity can shi substantially,BwE continues to classes we must support and they do not provide suõ- allocate bandwidth according to policy. Finally,BwE enables cient exibility in allocation policy (see below). the WAN to run at higher levels of utilization. By tightly inte- grating loss-insensitive ûle transfer protocols running at low ● Multipath Forwarding: For eõciency, wide area packet priority with BwE, we run many of our WAN links at •«ì forwarding follows multiple paths through the net- utilization. work, possibly with each path of varying capac- ity. Routers hash individual service ows to one of the available paths based on packet header content.

2 Any bandwidth allocation from one site to another must simultaneously account for multiple source/des- tination paths whereas existing bandwidth allocation mechanisms—TCP, router QoS, MPLS tunnels—focus on diòerent granularity (ows, links, single paths re- spectively). ● Flexible and Hierarchical Allocation Policy: We found Figure †: Reduction in TCP packet loss aer BwE was de- simple weighted bandwidth allocation to be inade- ployed. Y-axis denotes packet loss in percentage. Diòerent quate. For example, we may want to give a high pri- lines correspond to diòerent QoS classes (BEÏ denoting best ority user a weight of Ï«.« until it has been allocated Ï eòort, and AFÏ/AF† denoting higher QoS classes.) Gb/s, a weight of Ï.« until it is allocated † Gb/s and a weight of «.Ï for all bandwidth beyond that. Further, 3. ABSTRACTIONS AND CONCEPTS bandwidth allocation should be hierarchical such that bandwidth allocated to a single product group can be 3.1 Traffic Aggregates or FlowGroups subdivided to multiple users, which in turn may be hi- Individual services or users run jobs consisting of multiple erarchically allocated to applications, individual hosts tasks and ûnally ows. Diòerent allocation policies should . Each task may contain multiple Linux processes and be available at each level of the hierarchy. runs in a Linux container that provides resource accounting, isolation and information about user_name, job_name and ● Delegation or Attribution: Applications increasingly task_name. We modiûed the Linux networking stack to mark leverage computation and communication from a vari- the per-packet socket buòer structure to uniquely identify the ety of infrastructure services. Consider the case where originating container running the task. his allows BwE to a service writes data to a storage service, which in distinguish between traõc from diòerent tasks running on turn replicates the content to multiple WAN sites for the same host machine. availability. Since the storage service acts on behalf of BwE further classiûes task traõc based on destination clus- thousands of other services, its bandwidth should be ter address. Optionally, tasks use setsockopt() to indicate charged to the originating user. Bandwidth delegation other classiûcation information, e.g. for bandwidth delega- provides diòerential treatment across users sharing a tion. Delegation allows a task belonging to a shared infras- service, avoids head of line blocking across traõc for tructure service to attribute its bandwidth to the user whose diòerent users, and ensures that the same policies are request caused it to generate traõc. For example, a copy ser- applied across the network for a user’s traõc. vice can delegate bandwidth charges for a speciûc ûle transfer to the user requesting the transfer. We designed BwE to address the challenges and require- For scalability, baseline TCP regulates bandwidth for most ments described above around the principle that bandwidth application ows. BwE dynamically selects the subset of ows allocation should be extended all the way to end hosts. While accounting for most of the demand to enforce. Using TCP historically we have looked to routers with increasingly so- as the baseline also provides a convenient fallback for band- phisticated ASICs and control protocols for WAN bandwidth width allocation in the face of a BwE system failure. allocation, we argue that this design point has resulted sim- BwE allocates bandwidth among FlowGroups at various ply from lack of control over end hosts on the part of net- granularities, deûned below. work service providers. Assuming such access is available, we ûnd that the following functionality can be supported with ● Task FlowGroup or task-fg: . his FlowGroup is the ûnest unit groups, ii) exibly sub-dividing aggregate bandwidth alloca- of bandwidth measurement and enforcement. tions back to individual ows, iii) accounting for delegation job-fg of resource charging from one service to another, and iv) ex- ● Job FlowGroup or : Bandwidth usage across all task-fgs belonging to the same job is aggregated into pressing and enforcing exible max-min bandwidth sharing job-fg policies. On the contrary, existing routers must inherently a : . to map packets to one of a small number of service classes or ● User FlowGroup or user-fg: Bandwidth usage across tunnels. all job-fgs belonging to the same user is aggre- Figure † shows an instance of very high loss in multiple gated into a user-fg: . congestion control was not eòective and the loss remained high until we turned on admission control on hosts. ● Cluster FlowGroup or cluster-fg: Bandwidth usage across all user-fg belonging to same user_aggregate and belonging to same cluster-pair is combined into a cluster-fg:

3 f g f g destination_cluster>. he user_aggregate corresponds (a) Ï (b) † to an arbitrary grouping of users, typically by business Allocation Weight Bandwidth Allocation Weight Bandwidth group or product. his mapping is deûned in BwE con- Level (Gbps) Level (Gbps) Guaranteed « « Guaranteed Ï« Ï« ûguration (Section t.†). †« Ï« Best-Effort Ï« ∞ Best-Effort  ∞ ● Site FlowGroup or site-fg: Bandwidth usage for cluster- fgs belonging to the same site-pair is combined into a Table Ï:BwEConûguration Example. site-fg: . BwE creates a set of trees of FlowGroups with parent-child 25 25 relationships starting with site-fg at the root to cluster-fg, user- 20 20 15 15 fg, job-fg and eventually task-fg at the leaf. We measure band- 10 10 width usage at task-fg granularity in the host and aggregate site-fg 5 5 to the level. BwE estimates demand (Section E.Ï) for (Gbps) Bandwidth 0 (Gbps) Bandwidth 0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 each FlowGroup based on its historical usage. BwE allocates Fair Share Fair Share bandwidth to site-fgs, which is redistributed down to task-fgs (a) f g (b) f g and enforced in the host kernel. Beyond rate limiting, the hi- Ï † erarchy can also be used to perform other actions on a ow group such as DSCP remarking. All measurements and rate Figure t:Example Bandwidth Functions. limiting are done on packet transmit. BwE policies are deûned at site-fg and user-fg level. Mea- he BwE conûguration maps users to user_aggregates. user-fg site-fg surement and enforcement happen at task-fg level. Other lev- Mapping from to can be derived from this. he site-fg els are required to scale the system by enabling distributed BwE conûguration policies describe how s share the user-fg site-fg execution of BwE across multiple machines in Google data- network and also describe how s within a share site-fg centers. bandwidth allocated to the . For all FlowGroups in a level of hierarchy, the BwE conûguration deûnes: Ï) band- 3.2 Bandwidth Sharing Policies width for each allocation level and †) within each allocation level, weight of the FlowGroup that can change based on allo- 3.2.1 Requirements cated bandwidth. An example of a BwE conûguration for the f g f g Our WAN(Figure Ï) is divided in two levels, the inter- relative priority for two FlowGroups, Ï and † is shown site network and the inter-cluster network. he links in the in Table Ï. l l l inter-site network ( ;, Ê and • in the ûgure) are the most expensive. Aggregated demands on these links are easier 3.2.3 Bandwidth Functions to predict.Hence, our WAN is provisioned at the inter- he conûgured sharing policies are represented inside BwE site network. Product groups (user_aggregates) create band- as bandwidth functionsÏ. A bandwidth function [Ï;] speciûes width requirements for each site-pair. For a site-pair, de- the bandwidth allocation to a FlowGroup as a function of pending on the network capacity and its business priority, its relative priority on an arbitrary, dimensionless measure of each user_aggregate gets approved bandwidth at several allo- available fair share capacity, which we call fair share. fair share cation levels. Allocation levels are in strict priority order, ex, is an abstract measure and is only used for internal computa- Guaranteed allocation level should be fully satisûed before al- tion by the allocation algorithm. Based on the conûg, every locating to Best-Eòort allocation level. Allocated bandwidth site-fg and user-fg is assigned a piece-wise linear monotonic of a user_aggregate for a site-pair is further divided to all its bandwidth function (e.g. Figure t). It is capped at the dynamic member users. estimated demand (Section E.Ï) of the FlowGroup. hey can Even though provisioning and sharing of the inter-site net- also be aggregated to create bandwidth functions at the higher work is the most important, several links not in the inter-site levels (Section t.†.). network may also get congested and there is a need to share he fair share dimension can be partitioned into regions their bandwidth fairly during congestion. We assign weights (corresponding to allocation levels in the BwE conûgura- to the users that are used to subdivide their user_aggregate’s tion) of strict priority. Within each region, the slope† of a allocated bandwidth in the inter-cluster network. To allow FlowGroup’s bandwidth function deûnes its relative priority more ûne grained control, we allow weights to change based or weight. Once the bandwidth reaches the maximum ap- on allocated bandwidth as well as to be overridden to a non- proved for the FlowGroup in a region, the bandwidth function default value for some cluster-pairs. attens (« slope) until the start of the next region. Once the 3.2.2 Configuration ÏBandwidth functions are similar to utility functions [Ê, E] Network administrators conûgure BwE sharing policies except that these are derived from static conûgured pol- icy (Section t.†) indicating network fair share rather than through centralized conûguration. BwE conûguration spec- application-speciûed utility as a function of allocated band- iûes a ûxed number of strict priority allocation levels, e.g., width. there may be two levels corresponding to Guaranteed and †Slope can be a multiple of weight as long as the same multi- Best-Eòort traõc. ple is used for all FlowGroups.

4 25.0 5.0 fg1(left) 22.5 4.5 fg2(left) fair share(right) 20.0 4.0

17.5 3.5

15.0 3.0

12.5 2.5

10.0 2.0 Fair Share Fair 7.5 1.5

5.0 1.0

2.5 0.5 Allocated Bandwidth (Gbps) 0.0 0.0 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 25.0 27.5 30.0 32.5 35.0 37.5 40.0 Link Available Bandwidth (Gbps)

Figure : Bandwidth Sharing on a Bottleneck Link.

Figure :BwEArchitecture. bandwidth function reaches the FlowGroup’s estimated de- chy (job-fgs and task-fgs) equally (no weights) based on their mand, it becomes at from that point for all the following estimated demands. regions. Figure t shows example bandwidth functions for two Flow- 3.2.4 Bandwidth Function Aggregation Groups, f g and f g , based on BwE conûguration as deûned Ï † Bandwidth Functions can be aggregated from one Flow- in Table Ï. here are two regions of fair share: Guaranteed («- Group level to another higher level. We require such aggre- †) and Best-Eòort (†- ). he endpoints for each region are ∞ gation when input conûguration deûnes a bandwidth func- system-level constants deûned in BwE conûguration. BwE’s tion at a ûner granularity, but the BwE algorithm runs over estimated demand of f g is ÏGbps and hence, its bandwidth Ï coarser granularity FlowGroups. For example, BwE’s input function attens past that point. Similarly, f g ’s estimated de- † conûguration provides bandwidth function at user-fg level, mand is †«Gbps. while BwE (Section .Ï) runs across cluster-fgs. In this case, We present a scenario where f g and f g are sharing one Ï † we aggregate user-fgs bandwidth functions to create a cluster- constrained link in the network. he goal of the BwE algo- fg bandwidth function. We create aggregated bandwidth func- rithm is to allocate the bandwidth of the constrained link tions for a FlowGroup by adding bandwidth value for each such the following constraints are satisûed: Ï) f g and f g get Ï † value of fair share for all its children. maximum possible but equal fair share, and †) sum of their al- fair share located bandwidth corresponding to the allocated 4. SYSTEM DESIGN is less than or equal to the available bandwidth of the link. Figure  shows the output of the BwE allocation algorithm BwE consists of a hierarchy of components that aggregate (Section .t) with varying link’s available bandwidth shown network usage statistics and enforce bandwidth allocations. on the x-axis. he allocated fair share to the FlowGroups is BwE obtains topology and other network state from a net- shown on the right y-axis and the corresponding bandwidth work model server and bandwidth sharing policies from an allocated to the FlowGroups is shown on the le y-axis. Note administrator-speciûed conûguration. Figure  shows the that the constraints above are always satisûed at each snap- functional components in BwE. shot of link’s available bandwidth. One can verify using this 4.1 Host Enforcer graph that the prioritization as deûned by Table Ï is respected. One of BwE’s principal responsibilities is to dynamically At the lowest level of the BwE hierarchy, the Host Enforcer determine the level of contention for a particular resource runs as a user space daemon on end hosts. Every ûve sec- (bandwidth) and to then assign the resource to all compet- onds, it reports bandwidth usage of local application’s tasks- ing FlowGroups based on current contention. Higher val- fgs to the Job Enforcer. In response, it receives bandwidth ues of fair share indicate lower levels of resource contention allocations for its task-fgs from the Job Enforcer. he Host and correspondingly higher levels of bandwidth that can po- Enforcer collects measurements and enforces bandwidth al- tentially be assigned to a FlowGroup. Actual consumption locations using the HTB (Hierarchical Token Bucket) queu- is capped by current FlowGroup estimated demand, making ing discipline in Linux. the allocation work-conserving (do not waste any available bandwidth if there is demand). 4.2 Job Enforcer he objective of BwE is the max-min fair [E] allocation of Job Enforcers aggregate usages from task-fgs to job-fgs and fair share to competing site-fgs and then the max-min fair report job-fgs’ usages every Ï« seconds to the Cluster En- allocation of fair share to user-fgs within a site-fg. For each forcer. In response, the Job Enforcer receives job-fgs’ band- user-fg, maximize the utilization of the allocated bandwidth width allocations from the Cluster Enforcer. he Job Enforcer to the user-fg by subdividing it to the lower levels of hierar- ensures that for each job-fg, bandwidth usage does not ex-

5 ceed its assigned allocation. To do so, it redistributes the as- ters to multiple other clusters, with each cluster pair utilizing signed job-fg allocation among the constituent task-fgs using multiple paths.Hence, the bandwidth allocation must simul- the WaterFill algorithm (Section .). taneously account for multiple potential bottlenecks. Here, we present an adaptation of the traditional max-min 4.3 Cluster Enforcer fairness objective for FlowGroups sharing a bottleneck link he Cluster Enforcer manages two levels of FlowGroup ag- to multipath cluster-to-cluster communication. We designed MultiPath Fair Allocation (MPFA) gregation - user-fgs to job-fgs and cluster-fgs to user-fgs.It a centralized algorithm to aggregates usages from job-fgs to user-fgs and computes user- determine global max-min fairness. We present a simpler fgs’ bandwidth functions based on input from a conûguration version of the problem with a single layer of FlowGroups ûle.It aggregates the user-fgs’ bandwidth functions (capped (Section .†) and then extend it to multiple layers of Flow- at their estimated demand) to cluster-fg bandwidth functions Groups with diòerent network abstractions in hierarchical MPFA (Section t.†.), reporting them every Ï seconds to the Global (Section .). cluster- Enforcer. In response, the Cluster Enforcer receives 5.1 Inputs and Outputs fgs’ bandwidth allocations, which it redistributes among user- task-fg band- fgs and subsequently to job-fgs (Section .). Inputs to the BwE algorithm are s’ demands, width functions of user-fgs and site-fgs and network paths for 4.4 Network Model Server cluster-fgs and site-fgs. We aggregate task-fgs’ demands all the site-fg user-fg bandwidth function he Network Model Server builds the abstract network way up to s and aggregate s’ s to cluster-fgs’ bandwidth functions (Section t.†.). We run model for BwE.Network information is collected by stan- MPFA site-fg cluster- dard monitoring mechanisms (such as SNMP). Freshness is global hierarchical (Section .) on s and fgs that results in cluster-fgs’ allocations. hen, we distribute critical since paths change dynamically. BwE targets getting cluster-fg task-fg an update every t« seconds. he consistency of the model is s’ allocations to s (Section .), which are en- veriûed using independent mechanisms such as traceroute. forced at the hosts. 4.5 Global Enforcer 5.2 MPFA Problem Inputs for MPFA are: he Global Enforcer sits at the top of the Bandwidth En- forcer hierarchy.It divides available bandwidth capacity on Ï. Set of n FlowGroups, F = { fi , ∀i S Ï ≤ i ≤ n} where the network between diòerent clusters. he Global Enforcer FlowGroups are deûned in Section t.Ï. Each fi has an bandwidth function bandwidth function B f B f takes the following inputs: i) s from the associated (Section t.†.t), i . i Cluster Enforcers summarizing priority across all users at maps fair share to bandwidth for fi .If fi is allocated fair cluster-fg level, ii) global conûguration describing the shar- share of s, then it should be allocated bandwidth equal site-fg B f s ing policies at level, and iii) network topology, link ca- to i ( ). pacity, link utilization and drop statistics from the network m lk k k m lk model server. A small fraction of ows going over a link may †. Set of links, L = { , ∀ S Ï ≤ ≤ }. Each link has an associated allocatable capacity cl . not be under BwE control. To handle this, for every link we k dark bandwidth f also compute . his is the amount of traõc n f fi p i t. Set of i paths for each . Each path, j , has an as- going over the link which BwE is unaware of. his may be due f w i j n f to packet header overhead (particularly tunneling in network sociated weight, j , where Ï ≤ ≤ i and for each f f i i routers) or various failure conditions where BwE has incom- fi j n w p , ∑Ï≤ ≤ f j = Ï. Each path, j , is a set of links, i.e, f i plete information. Dark bandwidth is the smoothed value of i p j . (actual link usage - BwE reported link usage), and link al- ⊆ L locatable capacity is (link capacity - dark bandwidth). BwE We deûne the fraction of fi that traverse lk as FR( fi , lk ). f f reported link usage is computed by taking the set of ows i i his is calculated as the sum of weights, w j , for all paths, p j , (and their current usage) reported to the Global Enforcer by f f l p i Cluster Enforcers, and mapping them to the paths and links for the FlowGroup, i , such that k ∈ j . f for those ows. Given these inputs, the Global Enforcer runs FR fi lk w i MPFA cluster-fg ( , ) = Q j hierarchical (Section .t) to compute s’ band- ≤ j≤n Sl ∈p fi Ï fi k j width allocations and sends these allocations to Cluster En- he output of MPFA is the max-min fair share alloca- forcers. s f fi tion i to each FlowGroup, , such that ascending sorted s f s f s f ( Ï , † , ... , n ) is maximized in lexicographical order. Such 5. BWE ALLOCATION ALGORITHM maximization is subject to the constraint of satisfying capac- l One of the challenges we faced was deûning the optimiza- ity constraints for all links, k . FR fi lk B f s f cl Q ( , ) × i ( i ) ≤ k tion objective for bandwidth allocation to individual ows. f ∀ i First, we did not wish to allocate bandwidth among compet- ing -tuple ows but rather to competing FlowGroups. Sec- 5.3 MPFA Algorithm ond, services do not compete for bandwidth at a single bottle- he MPFA algorithm (Algorithm Ï) can be described in the neck link because services communicate from multiple clus- following high-level steps:

6 Input: f i i n FlowGroups, F ∶ { i , ∀ S Ï ≤ ≤ }; l k k m Links, L ∶ { k , ∀ S Ï ≤ ≤ }; l c k k m Allocatable capacities for ∀ k : { l , ∀ S Ï ≤ ≤ }; bandwidth function f B k for i : fi ; f l f l // ∀ i , ∀ k , Fraction of i traversing link, k FR f l Function, ( i , k ) : Output is a fraction ≤ Ï; Output: f s i i n Allocated fair share for ∀ i :{ fi , ∀ S Ï ≤ ≤ }; Figure E: MPFA Example. b Bottleneck Links , L ← ∅; l bandwidth func- Frozen FlowGroups, F f ← ∅; Ï. For each link, k , calculate the link’s f s tion B bandwidth functions foreach i do fi ← ∞; , l , by aggregating of all l k // Calculate bandwidth function for each k non-frozent FlowGroups, fi , in appropriate fractions, l s B s FR f l B s foreach k do ∀ , l ( ) ← ∀ ( i , k ) × f ( ) ; FR f l B fair share s k ∑ fi i ( i , k ). l maps , , to allocation on the l l b f f f ) l k l while (∃ k S k ∉ L ) ∧ (∃ i S i ∉ F do link k when all FlowGroups traversing k are allocated Bottleneck link, l null; fair share s b ← the of . Min Bottleneck fair share, smin ← ∞; b foreach l b do †. Find the bottleneck fair share, sl , for each remaining k ∉ L k sb c B sb Find S l = l ( ); (not bottlenecked yet) link, lk , by ûnding the fair share lk k k lk sb smin smin sb l l cl bandwidth if lk < then ← lk ; b ← k ; corresponding to its capacity, k in the link’s function Bl bandwidth function l null l b , k . Since is a piece-wise if b ≠ then Add b to L ; linear monotonic function, ûnding fair share for a given else break; f l capacity can be achieved by a binary search of the in- // Freeze i taking the bottleneck link, b f FR f l f f teresting points (points where the slope of the function foreach i S ( i , b ) > « ∧ i ∉ F do f f s smin changes). Add i to F ; fi ← ; // Remove allocated bandwidth from B fi t. he link, lb, with the minimum bottleneck fair share, s, B s max «, B s B smin ; min ∀ fi ( ) ← ( fi ( ) − fi ( )) s // Subtract B from B for all its links is the next bottleneck. If the minimum bottleneck fi lk fair share l FR f l l b equals ∞, then terminate. foreach k S ( i , k ) > « ∧ k ∉ L do s B s B s FR f l B s ∀ , lk ( ) ← lk ( ) − ( i , k ) × fi ( ); . Mark the link, lb, as a bottleneck link. Freeze all Flow- Groups, fi , with non-zero fraction, FR fi , lb , on lb. ( ) Algorithm Ï: MPFA Algorithm Frozen FlowGroups are not considered to ûnd further bottleneck links. Subtract frozen FlowGroups’ band- min width functions beyond the bottleneck fair share,s from all remaining links. ⎧ †.;s « s • «.; min ÏÊ, s ⎪ ∶ ≤ < Bl s ( ( )) ⎪ s s † ( ) = ‹ s  = ⎨ «.; + ÏÊ ∶ • ≤ < ÏÊ . If any link is not a bottleneck, continue to step †. + min(ÏÊ, † ) ⎪ s ⎩⎪ tÏ. ∶ ≥ ÏÊ Figure E shows an example of the allocation algorithm with «.†s « s ÏÊ l l l Bl s s ∶ ≤ < three links, Ï, † and t, with capacity Ït, Ït and  respectively. t ( ) = «.†(min(ÏÊ, )) = › . ∶ s ≥ ÏÊ Assume all bandwidth numbers are in Gbps for this exam- b f l Next, we ûnd bottleneck fair share, sl for each link, lk , ple. here are three FlowGroups: Ï) Ï takes two paths ( † and k l l f B sb c sb sb Ï → t) with weights «.; and «.† respectively, †) † takes such that l ( l ) = l . his results in l = , l ≈ .;†, l f l k k k Ï † one path ( †), and t) t taking one path ( Ï). All FlowGroups b sl ÏE. his makes l the bottleneck link and freezes both f f f t = Ï have demand of ÏÊ. Assume Ï, † and t have weights of Ï, † f f fair share l and t respectively, corresponding bandwidth functions of the Ï and t at of . Ï will not further participate in MPFA f fair share Bl Bl B f s s B f s s . Since Ï is frozen at of , † and t need to FlowGroups are: Ï ( ) = min(ÏÊ, ), † ( ) = min(ÏÊ, † ) B f fair share B f s s be updated to not account for Ï beyond of . he and t ( ) = min(ÏÊ, t ). f updated functions are: Based on that paths, fraction of FlowGroups( i ) travers- s s l FR f l FR f l ⎪⎧ †.; ∶ « ≤ <  ing Links( k ) are: ( Ï, Ï) = «.†, ( Ï, †) = «.;, ⎪ Bl s ⎪ s s FR f l FR f l FR f l † ( ) = ⎨ † + t ∶  ≤ < • ( Ï, t) = «.†, ( †, †) = Ï and ( t, Ï) = Ï. ⎪ †Ï s • We calculate bandwidth function for links as: ⎩⎪ ∶ ≥ ⎧ s s s ⎪ t.† ∶ « ≤ < E «.†s « s  «.†(min(ÏÊ, )) ⎪ Bl s ∶ ≤ < Bl s s s t ( ) = › s Ï ( ) = ‹ s  = ⎨ «.† + ÏÊ ∶ E ≤ < ÏÊ Ï  + min(ÏÊ, t ) ⎪ s ∶ ≥ ⎪ ††. ∶ ≥ ÏÊ b b ⎩ We recalculate sl and sl based on the new values for Bl t † t † A frozen FlowGroup is a FlowGroup that is already bottle- b b MPFA and Bl . his results in sl  and sl . l is the next bot- necked at a link and does not participate in the algo- t † = t = ∞ † fair share f fair share rithm run any further. tleneck with of . † is now frozen at the of

7 . Since all FlowGroups are frozen, MPFA terminates. he û- f f f fair share nal allocation to ( Ï, †, t) in is (, , ), translating to (Gbps, Ï«Gbps, φGbps) using the corresponding band- width functions l l . his allocation ûlls bottleneck links, Ï and † completely and fair share allocation (, , ) is max-min fair with the given pathing constraints. No FlowGroup’s alloca- tion can be increased without penalizing other FlowGroups with lower or equal fair share. 5.3.1 Interaction with Traffic Engineering (TE) Figure ;:Allocation Using WaterFill. he BwE algorithm takes paths and their weights as input. A separate system, TE[Ï;, ÏÏ, φ], is responsible for ûnding up the bandwidth allocation for each user-fg using its band- optimal pathing that improves BwE allocation. Both BwE width function. and TE are trying to optimize network in a fair Bandwidth distribution from a user-fg to job-fgs and from way and input ows are known in advance. However, the a job-fg to task-fgs is simple max-min fair allocation of one key diòerence is that in BwE problem formulation, paths and resource to several competing FlowGroups using a WaterFill their weights are input constraints, where-as for TE[Ï;, ÏÏ, as shown in Figure ;. WaterFill calculates the water level cor- φ], paths and their weights are output. In our network, we responding to the maximum allocation to any FlowGroup. treat TE and BwE as independent problems. he allocation to each child FlowGroup is TE has more degrees of freedom and hence can achieve min(demand, waterlevel).If there is excess band- higher fairness. In the above example, the ûnal allocation width still remaining aer running the WaterFill, it is f l can be more max-min fair if Ï only uses the path †. In this divided among the FlowGroups as bonus bandwidth. Since case, MPFA will allocate fair share to ow groups ≈ (.tt, .tt, some (or a majority) of the FlowGroups will not use the ,tt) with corresponding bandwidth of (.ttGbps,Ê.EEGbps, bonus assigned to them, the bonus is over-allocated by a ÏtGbps). Hence, a good traõc engineering solution results in conûgurable scaling factor. better (more max-min fair) BwE allocations. We run TE[Ï;] and BwE independently because they work at diòerent time-scales and diòerent topology granularity. 5.5 Hierarchical MPFA Since TE is more complex, we aggregate topology to site- Next, we describe hierarchical MPFA, which reconciles the level where-as for BwE, we are able to run at a more gran- complexity between site-fg and cluster-fg level allocation. he ular cluster-level topology. TE re-optimizes network less of- fairness goal is to allocate max-min fair share to site-fg re- ten because changing network paths may result in packet re- specting bandwidth functions and simultaneously observing ordering, transient loss [ÏE] and resulting changes inter-site and intra-site topological constraints (Figure Ï). Be- may add signiûcant load to network routers. Separation of cause not all cluster-fgs within a site-fg share the same WAN TE and BwE also gives us operational exibility. he fact paths, individual cluster-fgs within a site-fg may bottleneck that both systems have the same higher level objective func- on diòerent intra-site links. tion helps ensure that their decisions are aligned and eõcient. We motivate hierarchical fairness using an example based l Even though in our network we run these independently the on Figure Ï. All links have Ï««Gbps capacity, except Ï l site-fg s f S possibility of having a single system to do both can not be (Gbps) and • («Gbps). here are two s, Ï from Ï S s f S S s f cluster-fg c f ruled out in future. to t and † from † to t. Ï consists of s: Ï CÏ CÏ c f C† CÏ s f cluster- from Ï to t and † from Ï to t. † consists of a 5.4 Allocation Distribution fg c f CÏ CÏ site-fg : t from † to t. All s have equal weights and for MPFA site-fg cluster-fg c f allocates bandwidth to the highest level of aggrega- each , all its member s have equal weights. Ï site-fg c f c f tions, s. his allocation needs to be distributed to lower and t have Ï««Gbps of demand while † has a Gbps de- cluster- MPFA site-fg s f s f levels of aggregation. Distribution of allocation from mand.If we run naively on s, then Ï and † fg l to lower levels is simpler since the network abstraction will be allocated †«Gbps each due to the bottleneck link, •. s f c f does not change and the set of paths remains the same during However, when we further subdivide Ï’s †«Gbps among Ï c f c f l de-aggregation. We describe such distributions in this sec- and †, Ï only receives Gbps due to the bottleneck link Ï site-fg cluster-fg c f c f s f tion. he distribution from to is more com- while † only has demand of Gbps. t receives all of †’s plex since the network abstraction changes from site-level to †«Gbps allocation. MPFA l cluster-level (Figure Ï), requiring an extension of to With this naive approach, the ûnal total allocation on • is MPFA c f Hierarchical (Section .) to allocate bandwidth di- t«Gbps wasting Ï«Gbps, where t could have used the extra rectly to cluster-fgs while honoring fairness and network ab- Ï«Gbps. Allocation at the site level must account for indepen- stractions at site-fg and cluster-fg level. dent bottlenecks in the topology one layer down. Hence, we To distribute allocation from a cluster-fg to user-fgs, we cal- present an eõcient hierarchical MPFA to allocate max-min culate the aggregated bandwidth functions for the cluster-fgs fair bandwidth among site-fgs while accounting for cluster- u (Section t.†.) and determine the fair share, s , correspond- level topology and fairness among cluster-fgs. u ing to the cluster-fg’s bandwidth allocation. We use s to look he goals of hierarchical MPFA are:

8 e fair share site-fg Bcf (Input) B (Output) ● Ensure max-min fairness of across 1 cf1 site-fg bandwidth function 30.0 30.0 0 based on s’ s. 25.0 1 0.2 25.0 22.5 0.067 22.5 20.0 20.0 site-fg fair share Apply TÏ 0.167 Within a , ensure max-min fairness of 15.0 on fair share 15.0 ● 0.75 across cluster-fgs using cluster-fgs’ bandwidth functions. 2 Ô⇒

Bandwidth (Gbps) Bandwidth 0.0 0.0 (Gbps) Bandwidth 0 0 10 15 20 20 50

he algorithm should be work-conserving. 7.5

● 100 110 12.5 87.5 Fair Share Fair Share ● he algorithm should not over-allocate any link in the network, hence, should enforce capacity constraints of + e Bcf (Input) B (Output) intra-site and inter-site links. 2 cf2 30.0 30.0 0 0 25.0 25.0 MPFA MPFA cluster- 0.133 For hierarchical , we must run on all 20.0 20.0 Apply TÏ 0.167 fg 15.0 on fair share 15.0 s to ensure that bottleneck links are fully utilized and en- 2 Ô⇒ 0.75 forced. To do so, we must create eòective bandwidth func- tion cluster-fg site-fg (Gbps) Bandwidth 0.0 0.0 (Gbps) Bandwidth

s for s such that the fairness among s and 0 0 10 15 20 20 50 7.5 100 110 site-fg 12.5 87.5 fairness within a are honored. Fair Share Fair Share We enhance MPFA in the following way. In addition to bandwidth function Bc f cluster-fg c fi = , i , for , , we further con- a B (Input) B sf (Calculated) sf1 bandwidth function Bs f site-fg s fx 1 sider the , for , . Using 55.0 Calculate TÏ: 55.0 x 1 0 ∀ 50.0 47.5 Bandwidth i Bc f x Bs f bandwidth func- 3 0.2 ∀ , i and ∀ , x , we derive the eòective map fair share e 40.0 a → 0.33 40.0 tion B c f Bs f Bs fÏ , c f , for i . 30.0 Ï 30.0 i T (;.) = †« e 4 Ï 1.5 B Bc f fair share TÏ(Ï«) = « We create c f by transforming i along the di- i TÏ(φ.) = Ê;.

c f (Gbps) Bandwidth T (Ï) = Ï«« (Gbps) Bandwidth mension while preserving the relative priorities of i with 0.0 Ï 0.0 0 TÏ(> Ï) = ∞ 0 10 15 20 20 50 7.5 100 110 12.5 87.5 respect to each other. We call bandwidth values of diòerent Fair Share Fair Share c fi as equivalent if they map to the same fair share based on their respective bandwidth functions. To preserve rela- Figure Ê: Bandwidth Function Transformation Example tive priorities of ∀c fi ∈ s fx , the set of equivalent bandwidth bandwidth values should be identical before and aer the Again, just applying the transformation at the inter- functions transformation. Any transformation applied in fair share esting points (points where the slope of the function should preserve this property as long as the same trans- changes) is suõcient. formation is applied to all c fi ∈ s fx . Allocated bandwidth to each c fi on a given available capacity (e.g. Figure ) should An example of creating eòective bandwidth function is be unchanged due to such transformation. In addition, we shown in Figure Ê. MPFA algorithm as described in Sec- must ûnd a transformation such that when all c fi ∈ s fx use tion .t is run over cluster-fgs as FlowGroups with their ef- e bandwidth function their eòective (transformed) bandwidth functions, Bc f , they fective s to achieve hierarchical fairness. i e MPFA can together exactly replace s fx . his means that when Bc f When we run hierarchical in the topology shown e i in Figure Ï, the allocation to c f increases to t«Gbps, fully are added together, it equals Bs f . s, ∀iSc f ∈s f Bc f s t x ∀ ∑ i x i ( ) = l c f using bottleneck link •. However, if † has higher demand Bs f s x ( ). c f Be (say Ï««Gbps), then it will not receive beneût of Ï being he steps to create c f are: s f i bottlenecked early and Ï will not receive its full fair share bandwidth function Ï. For each site-fg, s fx , create aggregated bandwidth func- of †«Gbps. To resolve this, we rerun the a site-fg cluster- tion, Bs f (Section t.†.): transformation for a when any of its member x fgs is frozen due to an intra-site bottleneck link. a s Bs f s Bc f s ∀ , x ( ) = Q i ( ) ∀c f ∈s f i x 6. SYSTEM IMPLEMENTATION a his section describes various insights, design and imple- †. Find a transformation function of fair share from Bs f x mentation considerations that made BwE a practical and use- to Bs f . he transformation function, Tx is deûned as: x ful system. a Tx s s B s Bs f s ( ) = ¯ S s f ( ) = x (¯) x 6.1 Demand Estimation Note that since bandwidth function is piece-wise linear Estimating demand correctly is important for fair alloca- monotonic function, just ûnd Tx (s) for values for in- a tion and high network utilization. Estimated demand should teresting points (where slope changes in either Bs f or x be greater than current usage to allow each FlowGroup to Bs f ). x ramp its bandwidth use. But high estimated demand (com- t. For each c fi s fx , apply Tx on fair share dimension of pared to usage) of a high priority FlowGroup can waste band- e∈ Bc f Bc f width. In our experience, asking users to estimate their de- i to get i . e mand is untenable because user estimates are wildly inaccu- Bc f Tx s Bc f s i ( ( )) = i ( ) rate.Hence, BwE employs actual, near real-time measure-

9 ments of application usage to estimate demand. BwE es- timates FlowGroup demand by maintaining usage history: Demand usage scale min demand = max(max∆t( ) × , _ ) We take the peak of a FlowGroup’s usage across ∆t time interval, multiply it with a factor scale > Ï and take the max with min_demand. Without the concept of min_demand, small ows (few Kbps) would ramp to their real demand too slowly. Empirically, we found that ∆t = φ«s, scale = Ï.Ï and min_demand = Ï«Mbps works well for user-fg for our net- work applications. We use diòerent values of min_demand at diòerent levels of the hierarchy. Figure •: Improving Network Utilization 6.2 WaterFill Allocation For Bursty Flows fgs. Because multiple such runs across all FlowGroups does he demands used in our WaterFill algorithm (Section .) not scale, we run one instance of the global algorithm and are based on peak historical usage and diòerent child Flow- pass to Cluster Enforcers the bandwidth function for the most Groups can peak at diòerent times. his results in demand constrained link for each FlowGroup. Assuming the most over-estimation and subsequently the WaterFill allocations constrained link does not change, the Cluster Enforcer can can be too conservative. To account for burstiness and the re- eõciently calculate allocation for a FlowGroup with ∞ de- sulting statistical , we estimate a burstiness factor mand in the constrained link, assuming it becomes the bot- (≥ Ï) for each FlowGroup based on its demand and sum of its tleneck. children’s demand: estimated demand burstiness f actor ∑∀chil dren 6.4 Improving Network Utilization = parent′s estimated demand BwE allows network administrators to increase link uti- Since estimated demand is based on peak historical usage lization by deploying high throughput NETBLT [Ï«]-like (Section E.Ï), the burstiness factor of a FlowGroup is a mea- protocols for copy traõc. BwE is responsible for determin- sure of sum of peak usages of children divided by peak of sum ing the ow transmission rate for these protocols. We mark of usages of the children. We multiply a FlowGroup’s alloca- packets for such copy ows with low priority DSCP values so tion by its burstiness factor before running the WaterFill. his that they absorb most of the transient loss. allows its children to burst as long as they are not bursting To ensure that the system achieves high utilization (>•«ì) together.If a FlowGroup’s children burst at uncoordinated without aòecting latency/loss sensitive ows such as web and times, then the burstiness factor is high, otherwise the value video traõc, the BwE Global Enforcer supports two rounds will be close to Ï. of allocation. 6.3 Fair Allocation for Satisfied Flow- Groups ● In the ûrst round, link capacities are set conservatively (for example at •«ì of actual capacity). All traõc types Satisûed FlowGroup A is one whose demand is less than are allowed to participate in this round of allocation. or equal to its allocation. Initially, we throttled each satis- ûed FlowGroup strictly to its estimated demand. However ● In the second round, the Global Enforcer allocates only we found that latency sensitive applications could not ramp copy traõc, but it scales up the links aggressively, e.g., fast enough to their fair share on a congested link. We next to Ï«ì of link capacity. eliminated throttling allocations for all satisûed FlowGroups. However, this lead to oscillations in system behavior as a ● We also adjust link scaling factors depending on loss FlowGroup switched between throttled and unthrottled each on the link. If a link shows loss for higher QoS classes, time its usage increased. we reduce the scaling factor. his allows us to better Our current approach is to assign satisûed FlowGroups a achieve a balance between loss and utilization on a link. stable allocation that reects the fair share at inûnite demand. his allocation is a FlowGroup’s allocation if its demand grew Figure • shows link utilization increasing from Ê«ì to •Êì to inûnity while demand for other FlowGroups remained as we adjust the link capacity. he corresponding loss for the same. When a high priority satisûed FlowGroup’s usage copy traõc also increases to an average †ì loss with no in- increases, it will ramp almost immediately to its fair share. creases in loss for loss-sensitive traõc. Other low-priority FlowGroups will be throttled at the next iteration of the BwE control loop. his implies that the ca- 6.5 Redundancy and Failure Handling pacity of a constrained link is oversubscribed and can result For scale and fault tolerance, we run multiple replicas at in transient loss if a FlowGroup’s usage suddenly increases. each level of the BwE hierarchy. here are N live and M cold he naive approach for implementing user-fg allocation standby Job Enforcers in each cluster. Hosts report all task-fgs involves running our global allocation algorithm multiple belonging to the same job-fg to the same Job Enforcer, shard- times for each FlowGroup, assigning inûnite demand to the ing diòerent job-fgs across Job Enforcers by hashing .

10 Cluster Enforcers run as master/hot standby pairs. Job En- forcers report all information to both. Both instances in- 1.6M 180.0M 150.0M dependently run the allocation algorithm and return band- 1.2M 120.0M 800.0k 90.0M width allocations to Job Enforcers. he Job Enforcers enforce 60.0M 400.0k 30.0M the bandwidth allocations received from the master.If the 50.0k 0.0 task flow groups master is unreachable Job Enforcers switch to the allocations Feb 2012 Jul 2012 Jan 2013 Jul 2013 Dec 2013 user/job/cluster flowgroups user flow groups (left) cluster flow groups (left) received from the standby. We employ a similar redundancy job flow groups (left) task flow groups (right) approach between Global Enforcers and Cluster Enforcers. Communication between BwE components is high prior- Figure Ï«:FlowGroup Counts ity and is not enforced. However, there can be edge scenar- 700.0M 550 ios where BwE components are unable to communicate with 600.0M 500 500.0M 450 each other. Some examples are: BwE job failures (e.g. binaries 400.0M 400 350 300.0M Bits/s

300 Cores go into a crash loop) causing hosts to stop receiving updated 200.0M 250 100.0M 200 bandwidth allocations, network routing failures preventing 0.0 150 Cluster Enforcers from receiving allocation from the Global Feb 2012 Jul 2012 Jan 2013 Jul 2013 Dec 2013 Cores (right) Control Traffic - Total Enforcers, or the network model becoming stale. Control Traffic - WAN he general strategy for handling these failures is that we continue to use last known state (bandwidth allocations or Figure ÏÏ: Resource overhead (Control System) capacity) for several minutes. For longer/sustained failures, in most cases we eliminate allocations and rely on QoS and ûne-grained bandwidth allocation and decreasing reporting TCP congestion management. For some traõc patterns such interval for enforcement frequency. as copy-traõc we set a low static allocation. We have found Figure Ï« shows growth in FlowGroups over time. As ex- this design pattern of defense by falling back to sub-optimal pected, as we go up in the hierarchy the number of Flow- but still operable baseline systems invaluable to building ro- Groups drops signiûcantly, allowing us to scale global com- bust network infrastructure. ponents. Figure ÏÏ shows the amount of resources used by our distributed deployment (excepting per-host overhead). It 7. EVALUATION also shows the communication overhead of the control plane. We can conclude that the overall cost is very small relative to enforcing traõc generated by hundreds of thousands of cores 7.1 Micro-benchmarks on Test Jobs using terabits/sec of bandwidth capacity. We begin with some micro-benchmarks of the live BwE Table † shows the number of FlowGroups on a congested system to establish its baseline behavior. Figure φ(a) demon- link at one point in time relative to all outgoing ow groups strates BwE fairness across users running diòerent number from a major cluster enforcer.It gives an idea of overall scale of TCP connections. Two users send traõc across a network in terms of the number of competing entities. here are por- conûgured to have Ï««Mbps of available capacity between the tions of the network with millions of competing FlowGroups. source/destination clusters. UserÏ has two connections and a Table t shows our algorithm run time at various levels in the weight of one. We vary the number of connections for User† hierarchy. We show max and average (across multiple in- (shown on the x-axis) and its BwE assigned weight. he graph stances) for each level except global. Overall, our goal is to shows the throughput ratio is equivalent to the users weight enforce large ows in a few minutes, which we are able to ratio independent of the number of competing TCP ows. achieve. he table also shows that the frequency of collecting Next we show how quickly BwE can enforce bandwidth al- and distributing data is a major contributing factor to reac- locations with and without the inûnite demand feature (Sec- tion time. tion E.t). In this scenario there are † users on a simulated Ï«« Mbps link. Initially, UserÏ has weight of t and User† has site cluster user job task One Congested Link ttE t.«k «k ««k φ;Ïk weight of Ï. At φ«s, we change the weight of User† to φ. In Largest Cluster Enforcer  t.k Etk ÏEk ÏEE«k Figure φ(b), where the inûnite demand feature is disabled, Avg across cluster enforcers t Ï.k ††k E«k E•Ek we observe that BwE converges at Ê«s. In Figure φ(c), where Global Ï• ;.k Eʆk Ïʆk ϕ«ÊÊk *-fg inûnite demand feature is enabled, we observe it converges at Table †: Number of s (at various levels in BwE) for a con- ÏE«s. his demonstrates BwE can enforce bandwidth alloca- gested link and for a large cluster enforcer. tions and converge in intervals of tens of seconds. his delay Algo Run-time Algo Reporting is reasonable for our production WAN network since large Max(s) Mean(s) Interval(s) Interval(s) bandwidth consumers are primarily copy traõc. Global Enforcer t - Ï« Ï« Cluster Enforcer .ÏE .Ï  Ï« Job Enforcer <«.«Ï <«.«Ï   7.2 System Scale Table t:Algorithm run time and feedback cycle in seconds. A signiûcant challenge for BwE deployment is the system’s Algorithm interval is how frequently algorithm is invoked sheer scale. Apart from organic growth to ows and network and Reporting interval is the duration between two reports scale other reasons that aòect system scale were supporting from the children in BwE hierarchy.

11 (a) BwE Fairness (b) Allocation Capped at Demand (c) Allocation not Capped at Demand

Figure φ:BwE compliance

Min Max Mean exploring abstractions where we can provide fair share to all Number of user-fg tì ÏÏì ;ì Number of job-fg tì Ï«ì Eì users across all bottleneck links irrespective of the number Usage •ì ••ì •;ì and placement of communicating cluster pairs.Other areas Table : Percentage of FlowGroups enforced. of research include improving the reaction time of the control system while scaling to a large number of FlowGroups, pro- user-fg job-fg viding fairness at longer timescales (hours or days) and in- BwE tracks about ««k s and millions of s, but cluding ow deadlines in the objective function for fairness. only enforces a small fraction of these.Processing for unen- forced FlowGroups is lightweight at higher levels of the BwE 8.1 Broader Applicability hierarchy allowing the system to scale. Table  shows the frac- Our work is motivated by the observation that per-ow tion of enforced ows and the fraction of total usage they rep- bandwidth allocation is no longer the ideal abstraction for resent. BwE only enforces Ï«ì of the ows but these ows ac- emerging WAN use cases. We have seen substantial bene- count for •ì of the traõc. We also found that for congested ût of BwE to applications for WAN distributed computing links, those with more than Ê«ì utilization, less than Ïì of and believe that it is also applicable to a number of emerg- the utilization belonged to unenforced ows. his indicates ing application classes in the broader Internet. For example, BwE is able to focus its work on the subset of ows that most video streaming services for collaboration or entertainment contribute to utilization and any congestion. increasingly dominate WAN communication. hese appli- We introduced a number of system optimizations to cations have well-understood bandwidth requirements with address growth along the following dimensions: Ï) Flow step functions in additional utility from incremental band- Groups: Organic growth and increase in speciûcity (for e.g., width allocation. Consider that a Ê«P video stream may delegation). For example, the overall design has been serv- receive no incremental beneût from an additional Ï««Kbps ing us through a growth of Ï«x from BwE’s inception (†«M to of bandwidth; only suõcient additional bandwidth to enable †««M). †)Paths: Traõc Engineering introduced new paths in ;†«P streaming is useful. Finally, homes and businesses are the system. t) Links: Organic network Growth ) Reporting trending toward multiple simultaneous video streams with frequency: there is a balance between enforcement accuracy known relative priority and incremental bandwidth utility, all and resource overhead. ) Bottleneck links: he number of sharing a single bottleneck with known capacity. bottleneck links aòects the overall run time of the algorithm Next, consider the move toward an Internet of hings [†•] on the Global Enforcer. where hundreds of devices in a home or business may have varying wide-area communication requirements. hese ap- 8. DISCUSSION AND FUTURE WORK plications may range from home automation, to security, BwE requires accurate network modeling since it is a key health monitoring, to backup. For instance, home security input to the allocation problem. his is challenging in an en- may have moderate bandwidth requirements but be of the vironment where devices fail oen and new technologies are highest priority. Remote backup may have substantial, sus- being introduced rapidly. In many cases, we lack standard tained bandwidth requirements. However, the backup does APIs to expose network topology, routing and pathing infor- not have a hard deadline and is largely insensitive to packet mation. With Soware Deûned Networking, we hope to see loss. Investigating BwE-based mechanisms for fair allocation improvements in this area. Another challenge is that for scal- based on an understanding of relative application utility in ability, we oen combine a symmetric full mesh topology into response to additional bandwidth is an interesting area of fu- a single abstract link. his assumption however breaks dur- ture work. ing network failures and handling these edge cases continues to be a challenge. 9. RELATED WORK Our FlowGroup abstraction is limited in that it allows his paper focuses on allocating bandwidth among users users sending from multiple source/destination cluster pairs in emerging multi-datacenter WAN environments. Given the over the same bottleneck link to have an advantage. We are generality of the problem, we necessarily build on a rich body

12 of related eòorts, including utility functions [Ê], weighted fair Work in RSVP,Diòerentiated Services and Traõc Engi- sharing [†, ÏÏ, φ, Ê] and host-based admission control [•]. neering [Ï, †;, ;, ††, †Ï, Ï, ] overlaps in terms of goals. How- We extend existing Utility max-min approaches [Ê] for multi- ever, these approaches are network centric, assuming that path routing and hierarchical fairness. host control is not possible. In some sense, we take the oppo- Weighted queuing is one common bandwidth allocation site approach, considering an orthogonal hierarchical con- paradigm. While a good starting point, we ûnd weights trol infrastructure that leverages host-based demand mea- are insuõcient for delivering user guarantees. Relative to surement and enforcement. weighted queuing, we focus on rate limiting based on de- Congestion Manager [t] is an inspiration for our work on mand estimation. BwE control is centralized and protocol BwE, enabling a range of exible bandwidth allocation poli- agnostic, i.e., general to TCP and UDP. his is in contrast to cies to individual ows based on an understanding of ap- DRL [†], which solves the problem via distributed rate con- plication requirements. However,Congestion Manager still trol for TCP while not accounting for network capacity ex- manages bandwidth at the granularity of individual hosts, plicitly. whereas we focus on the infrastructure and algorithms for Netshare [†«] and Seawall [t«] also use weighted band- bandwidth allocation in a large-scale distributed computing width allocation mechanisms. Seawall in particular achieves WAN environment. per-link proportional fairness. We have found max-min fair- ness to be more practical because it provides better isolation. 10. CONCLUSIONS Gatekeeper [†E] also employs host-based hierarchical to- In this paper, we present Bandwidth Enforcer (BwE), our ken buckets to share bandwidth among data center ten- mechanism for WAN bandwidth allocation. BwE allocates ants, emphasizing work-conserving allocation. Gatekeeper bandwidth to competing applications based on exible pol- however assumes a simpliûed topology for every tenant. icy conûgured by bandwidth functions. BwE supports hierar- BwE considers complex topologies, multi-path forwarding, chical bandwidth allocation and delegation among services centralized bandwidth allocation, and a range of exible while simultaneously accounting for multi-path WAN com- bandwidth allocation mechanisms. Secondnet [Ï] provides munication. pair-wise bandwidth guarantees but requires accurate user- Based on multiple years of production experience, we provided demands and is not work conserving. summarize a number of important beneûts to our WAN. Oktopus [] proposes a datacenter tree topology with spec- First,BwE provides isolation among competing services, de- iûed edge capacity. While suitable for datacenters, it is not a livering plentiful capacity in the common case while main- natural ût for the WAN where user demands vary based on taining required capacity under failure and maintenance sce- source-destination pairs. he BwE abstraction is more ûne- narios. Second, we provide a single point for specifying allo- grained, with associated implementation and conûguration cation policy to administrators. While pathing, RTT, and ca- challenges. Oktopus also ties the problem of bandwidth al- pacity can shi substantially,BwE continues to allocate band- location with VM placement.Our work however does not width according to policy. Finally,BwE enables the WAN to aòect computation placement but rather takes the source of run at higher levels of utilization than before. By tightly in- demand as ûxed. We believe there are opportunities to ap- tegrating new loss-insensitive ûle transfer protocols running ply such joint optimization to BwE. Datacenter bandwidth at low priority with BwE, we run many of our WAN links at sharing eòorts such as ElasticSwitch [†], FairCloud [†t] and •«ì utilization. EyeQ [ÏÊ] focus on a hose model for tenants. EyeQ uses ECN to detect congestion at the edge and assumes a congestion Acknowledgements core. In contrast, our ow-wise bandwidth sharing model al- lows aggregation across users and is explicitly focused on a Many teams within Google collaborated towards the success congested core. of the BwE project. We would like to acknowledge the BwE Recent eòorts such as SWAN [ÏE], and B [Ï;] are closely development, test and operations groups including Aaron related but focus on the network and routing infrastructure Racine, Alex Docauer,Alex Perry,Anand Kanagala, Andrew to eòectively scale and utilize emerging WAN environments. McGregor,Angus Lees, Deepak Nulu,Dmitri Nikulin,Eric In particular, they focus on employing Soware Deûned Net- Yan, Jon Zolla, Kai Song, Kirill Mendelev, Mahesh Kalla- working constructs for controlling and eõciently scaling the halla, Matthew Class, Michael Frumkin, Michael O’Reilly, network. Our work is complementary and focuses on en- Ming Xu, Mukta Gupta, Nan Hua, Nikhil Panpalia, Phong forcing policies given an existing multipath routing con- Chuong, Rajiv Ranjan, Richard Alimi, Sankalp Singh, Subba- ûguration. Jointly optimizing network routing, and band- iah Venkata, Vijay Chandramohan, Xijie Zeng, Yuanbo Zhu, width allocation is an area for future investigation. TEM- Aamer Mahmood, Ben Treynor,Bikash Koley and Urs Hölzle PUS [ϕ] focuses on optimizing network utilization by ac- for their signiûcant contributions to the project. We would counting for deadlines for long-lived ows .Our bandwidth also like to thank our shepherd, Srikanth Kandula, and the sharing model applies to non-long-lived ows as well and anonymous SIGCOMM reviewers for their useful feedback. does not require deadlines to be known ahead of time. Flow deadlines open up possibility of further optimization (for ex- 11. REFERENCES ample, by smoothing bandwidth allocation over a longer pe- [Ï] Wikipedia: Diòerentiated services. riod of time) and that remains an area for future work for us. http://en.wikipedia.org/wiki/Diòerentiated_services.

13 [†]Avv*vÕo, M.,*˜­ S-*xT~~, Y. Centralized and [Ï;] J*T˜, S., Ko*!,A., M*˜­*v, S., O˜$, J., PÕo~TÜxObT, Distributed Algorithms for Routing and Weighted L., ST˜$-,A., Vܘb*~*, S., W*˜­Ü!Ü!, J., Z-Õo, J., Max-Min Fair Bandwidth Allocation. IEEE/ACM Z-o, M., ZÕvv*, J., H°v±vÜ, U., S~o*!~, S.,*˜­ Trans.Networking ÏE,  (†««Ê), Ï«Ï–Ï«†. V*-­*~,A. B:Experience with a Globally-Deployed [t] B*v*b!TO-˜*˜, H., R*-ov, H. S.,*˜­ SÜO-*˜, S. An Soware Deûned WAN. In Proceedings of the ACM integrated congestion management architecture for SIGCOMM †«Ït (†«Ït), ACM, pp. t–Ï. internet hosts. In In Proc. ACM SIGCOMM (ϕ••), [ÏÊ] Jܖ*bo*!, V.,AvT±*­Ü-, M., M*±TÜ!ÜO,D., pp. Ï;–ÏÊ;. P!*C-*b*!,B., KT,C.,*˜­ G!ÜܘCÜ!$,A. Eyeq: [] B*vv*˜T, H.,CÕO~*, P., K*!*$T*˜˜TO, T.,*˜­ Practical isolation at the edge. In RÕ|O~!՘,A. Towards predictable datacenter Proc. of NSDI (†«Ït), USENIX Association, pp. †•;–tφ. networks. In SIGCOMM (†«ÏÏ). [ϕ] K*˜­ov*, S., Mܘ*Š-Ü, I., SŠ-|*!~±, R.,*˜­ []Bv*bÜ, S.,Bv*Šb,D.,C*!vO՘, M., D*xTÜO,E., W*˜$, B*CCov*, S. R. Calendaring for wide area networks. In Z.,*˜­ WÜTOO, W. An Architecture for Diòerentiated Proc. SIGCOMM (August †«Ï). Service. RFC †; (Informational), December ϕ•Ê. [†«] L*, T., R*­-*b!TO-˜*˜, S., V*-­*~,A.,*˜­ Updated by RFC t†E«. V*!$-ÜOÜ, G. NetShare: Virtualizing data center [E]BÕo­ÜŠ, J.-Y. Rate adaptation, congestion control and networks across services. Tech. rep., †«Ï«. fairness:A tutorial, †«««. [†Ï] MT˜ÜT, I.,*˜­ LoŠÜb, J. MPLS-Enabled Applications: [;]B!*­Ü˜, R., Z-*˜$, L., BÜ!O՘, S., HÜ!±Õ$, S.,*˜­ Emerging Developments and New Technologies. Wiley J*T˜, S. Resource ReSerVation Protocol (RSVP) – Series on Communications Networking &Distributed Version Ï Functional Speciûcation. RFC ††« Systems. Wiley, †««Ê. (Proposed Standard), September ϕ•;. Updated by [††] OOCÕ!˜Ü, E.,*˜­ ST-*, A. Traõc Engineering with RFCs †;«, t•tE, •. Mpls (Paperback).Networking Technology Series. [Ê] C*Õ, Z.,*˜­ ZÜ$o!*, E. W. Utility max-min:An Cisco Press, †««†. application-oriented bandwidth allocation scheme. In [†t] PÕü*, L., K!TO-˜*o!~-–,A., R*~˜*O*–, S.,*˜­ INFOCOM (ϕ••). S~ÕTŠ*, I. Faircloud: Sharing the network in cloud [•]C-ÕT,B.-K.,*˜­ BÜ~~*~T, R. Endpoint admission computing. In Proceedings of the Ï«th ACM Workshop control: network based approach. In Distributed on Hot Topics in Networks (New York, NY, USA, †«ÏÏ), Computing Systems, †««Ï. †Ïst International Conference HotNets-X, ACM, pp. ††:Ï–††:E. on. (Apr †««Ï), pp. ††;–†t. [†] PÕü*, L., Y*v*$*˜­ov*, P., B*˜Ü!\ÜÜ, S., MÕ$ov, [Ï«]Cv*!b,D. D., L*CÜ!~, M. L.,*˜­ Z-*˜$, L. Netblt: J. C., To!˜Ü!, Y.,*˜­ S*˜~ÕO, J. R. Elasticswitch: A high throughput transport protocol. In Proceedings Practical work-conserving bandwidth guarantees for of the ACM Workshop on Frontiers in Computer cloud computing. In Proceedings of the ACM Communications Technology (New York, NY, USA, SIGCOMM †«Ït Conference on SIGCOMM (New York, ϕÊÊ), SIGCOMM ’Ê;, ACM, pp. tt–t•. NY, USA, †«Ït), SIGCOMM ’Ït, ACM, pp. tÏ–tE†. [ÏÏ] D*˜˜*, E., H*OOT­T,A., K*üv*˜, H., Ko*!,A., [†] R*$-*x*˜,B., VTO-|*˜*~-, K., R**C-*­!*˜, S., M*˜OÕo!, Y., R*±,D.,*˜­ SÜ$*vÕx, M. Upward Max YՊo, K.,*˜­ S˜ÕÜ!ܘ,A. C. Cloud control with Min Fairness. In INFOCOM (†«Ï†), pp. Êt;–Ê. distributed rate limiting. In In SIGCOMM (†««;). [φ] D*˜˜*, E., M*˜­*v, S.,*˜­ ST˜$-,A. A Practical [†E] RÕ­!T$oÜO, H., S*˜~ÕO, J., To!˜Ü!, Y., SÕ*!ÜO, P., Algorithm for Balancing the Max-min Fairness and *˜­ GoÜ­ÜO,D.Gatekeeper: Supporting bandwidth hroughput Objectives in Traõc Engineering. In Proc. guarantees for multi-tenant datacenter networks. In INFOCOM (March †«Ï†), pp. ÊE–Ê. Workshop on I/O Virtualization (†«ÏÏ). [Ït] DÜ*˜, J.,*˜­ G-܁*|*~, S.Mapreduce: Simpliûed [†;] RÕo$-*˜, M., T-Õ!oü, M.,*˜­ Z-*˜$, Y. Traõc data processing on large clusters. Commun. ACM Ï, Ï Engineering with Estimated Traõc Matrices. In Proc. (January †««Ê), Ï«;–ÏÏt. IMC (†««t), pp. †Ê–†Ê. [Ï]FÕ!~±,B., R܂Õ!­, J.,*˜­ T-Õ!oü, M. Traõc [†Ê] S*v~±Ü!, J. H., RÜÜ­, D. P.,*˜­Cv*!b,D. D. Engineering with Traditional IP Routing Protocols. End-to-end arguments in system design. ACM Trans. IEEE Communications Magazine « (†««†), ÏÏʖφ. Comput. Syst. †,  (November ϕÊ), †;;–†ÊÊ. [Ï] GoÕ,C., Lo, G., W*˜$, H. J., Y*˜$, S., K՘$,C., So˜, [†•] S*!*, S.,B!Պb,D. L.,*˜­AO-~՘, K. he P., Wo, W.,*˜­ Z-*˜$, Y. SecondNet: A data center networked physical world–proposals for engineering network virtualization architecture with bandwidth the next generation of computing, commerce & guarantees. In CoNEXT (†«Ï«). automatic identiûcation. White Paper,Auto-ID Center, MIT. Designed b y Foxner. www. foxner. com [ÏE] H՘$,C.-Y., K*˜­ov*, S., M*-*\*˜, R., Z-*˜$, M., (†«««). GTvv, V., N*˜­o!T, M.,*˜­ W*~~ܘ-ÕÜ!, R.Have [t«] S-TÜ-,A., K*˜­ov*, S., G!ÜܘCÜ!$,A., KT,C.,*˜­ Your Network and Use It Fully Too: Achieving High S*-*, B. Sharing the data center network. In NSDI Utilization in Inter-Datacenter WANs. In Proc. (†«ÏÏ). SIGCOMM (August †«Ït).

14