CONGA: Distributed Congestion-Aware Load Balancing for Datacenters

CONGA: Distributed Congestion-Aware Load Balancing for Datacenters Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, Ramanan Vaidyanathan, Kevin Chu, Andy Fingerhut, Vinh The Lam (Google), Francis Matus, Rong Pan, Navindra Yadav, George Varghese (Microsoft) Cisco Systems ABSTRACT to paths, hash collisions can cause significant imbalance if there are We present the design, implementation, and evaluation of CONGA, a few large flows. More importantly, ECMP uses a purely local de- a network-based distributed congestion-aware load balancing mech- cision to split traffic among equal cost paths without knowledge of anism for datacenters. CONGA exploits recent trends including potential downstream congestion on each path. Thus ECMP fares the use of regular Clos topologies and overlays for network vir- poorly with asymmetry caused by link failures that occur frequently tualization. It splits TCP flows into flowlets, estimates real-time and are disruptive in datacenters [17, 34]. For instance, the recent congestion on fabric paths, and allocates flowlets to paths based study by Gill et al. [17] shows that failures can reduce delivered on feedback from remote switches. This enables CONGA to effi- traffic by up to 40% despite built-in redundancy. ciently balance load and seamlessly handle asymmetry, without re- Broadly speaking, the prior work on addressing ECMP’s short- quiring any TCP modifications. CONGA has been implemented in comings can be classified as either centralized scheduling (e.g., custom ASICs as part of a new datacenter fabric. In testbed exper- Hedera [2]), local switch mechanisms (e.g., Flare [27]), or host- iments, CONGA has 5× better flow completion times than ECMP based transport protocols (e.g., MPTCP [41]). These approaches even with a single link failure and achieves 2–8× better through- all have important drawbacks. Centralized schemes are too slow put than MPTCP in Incast scenarios. Further, the Price of Anar- for the traffic volatility in datacenters [28, 8] and local congestion- chy for CONGA is provably small in Leaf-Spine topologies; hence aware mechanisms are suboptimal and can perform even worse CONGA is nearly as effective as a centralized scheduler while be- than ECMP with asymmetry (x2.4). Host-based methods such as ing able to react to congestion in microseconds. Our main thesis MPTCP are challenging to deploy because network operators often is that datacenter fabric load balancing is best done in the network, do not control the end-host stack (e.g., in a public cloud) and even and requires global schemes such as CONGA to handle asymmetry. when they do, some high performance applications (such as low latency storage systems [39, 7]) bypass the kernel and implement Categories and Subject Descriptors: C.2.1 [Computer-Communication their own transport. Further, host-based load balancing adds more Networks]: Network Architecture and Design Keywords: Datacenter fabric; Load balancing; Distributed complexity to an already complex transport layer burdened by new requirements such as low latency and burst tolerance [4] in datacenters. As our experiments with MPTCP show, this can make for 1. INTRODUCTION brittle performance (x5). Datacenter networks being deployed by cloud providers as well Thus from a philosophical standpoint it is worth asking: Can as enterprises must provide large bisection bandwidth to support load balancing be done in the network without adding to the com- an ever increasing array of applications, ranging from financial ser- plexity of the transport layer? Can such a network-based approach vices to big-data analytics. They also must provide agility, enabling compute globally optimal allocations, and yet be implementable in any application to be deployed at any server, in order to realize a realizable and distributed fashion to allow rapid reaction in mi- operational efficiency and reduce costs. Seminal papers such as croseconds? Can such a mechanism be deployed today using stan- VL2 [18] and Portland [1] showed how to achieve this with Clos dard encapsulation formats? We seek to answer these questions topologies, Equal Cost MultiPath (ECMP) load balancing, and the in this paper with a new scheme called CONGA (for Congestion decoupling of endpoint addresses from their location. These de- Aware Balancing). CONGA has been implemented in custom ASICs sign principles are followed by next generation overlay technolo- for a major new datacenter fabric product line. While we report on gies that accomplish the same goals using standard encapsulations lab experiments using working hardware together with simulations such as VXLAN [35] and NVGRE [45]. and mathematical analysis, customer trials are scheduled in a few However, it is well known [2, 41, 9, 27, 44, 10] that ECMP can months as of the time of this writing. balance load poorly. First, because ECMP randomly hashes flows Figure 1 surveys the design space for load balancing and places CONGA in context by following the thick red lines through the design tree. At the highest level, CONGA is a distributed scheme to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed allow rapid round-trip timescale reaction to congestion to cope with for profit or commercial advantage and that copies bear this notice and the full cita- bursty datacenter traffic [28, 8]. CONGA is implemented within the tion on the first page. Copyrights for components of this work owned by others than network to avoid the deployment issues of host-based methods and ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- additional complexity in the transport layer. To deal with asymme- publish, to post on servers or to redistribute to lists, requires prior specific permission try, unlike earlier proposals such as Flare [27] and LocalFlow [44] and/or a fee. Request permissions from [email protected]. that only use local information, CONGA uses global congestion SIGCOMM’14, August 17–22, 2014, Chicago, IL, USA. Copyright 2014 ACM 978-1-4503-2836-4/14/08 ...$15.00. information, a design choice justified in detail in x2.4. http://dx.doi.org/10.1145/2619239.2626316 . analysis. We also prove that load balancing behavior and the Centralized Distributed (e.g., Hedera, B4, SWAN) effectiveness of flowlets depends on the coefficient of varia- slow to react for DCs tion of the flow size distribution. In-Network Host-Based (e.g., MPTCP) Hard to deploy, increases Incast Local Global, Conges6on-Aware 2. DESIGN DECISIONS (Stateless or Conges6on-Aware) (e.g., ECMP, Flare, LocalFlow) This section describes the insights that inform CONGA’s major Poor with asymmetry, General, Link State design decisions. We begin with the desired properties that have especially with TCP traffic Leaf to Leaf Complex, hard to Near-op>mal for 2->er, guided our design. We then revisit the design decisions shown in deploy deployable with an overlay Figure 1 in more detail from the top down. Per Flow Per Flowlet (CONGA) Per Packet Subop>mal for “heavy” No TCP modifica>ons, Op>mal, needs 2.1 Desired Properties flow distribu>ons (with large flows) resilient to flow distribu>on reordering-resilient TCP CONGA is an in-network congestion-aware load balancing mech- Figure 1: Design space for load balancing. anism for datacenter fabrics. In designing CONGA, we targeted a solution with a number of key properties: Next, the most general design would sense congestion on every 1. Responsive: Datacenter traffic is very volatile and bursty [18, link and send generalized link state packets to compute congestion- 28, 8] and switch buffers are shallow [4]. Thus, with CONGA, sensitive routes [16, 48, 51, 36]. However, this is an N-party pro- we aim for rapid round-trip timescale (e.g., 10s of microsec- tocol with complex control loops, likely to be fragile and hard to onds) reaction to congestion. deploy. Recall that the early ARPANET moved briefly to such congestion-sensitive routing and then returned to static routing, cit- 2. Transport independent: As a network mechanism, CONGA ing instability as a reason [31]. CONGA instead uses a 2-party must be oblivious to the transport protocol at the end-host “leaf-to-leaf” mechanism to convey path-wise congestion metrics (TCP, UDP, etc). Importantly, it should not require any modi- between pairs of top-of-the-rack switches (also termed leaf switches) fications to TCP. in a datacenter fabric. The leaf-to-leaf scheme is provably near- optimal in typical 2-tier Clos topologies (henceforth called Leaf- 3. Robust to asymmetry: CONGA must handle asymmetry due Spine), simple to analyze, and easy to deploy. In fact, it is de- to link failures (which have been shown to be frequent and ployable in datacenters today with standard overlay encapsulations disruptive in datacenters [17, 34]) or high bandwidth flows (VXLAN [35] in our implementation) which are already being used that are not well balanced. to enable workload agility [33]. 4. Incrementally deployable: It should be possible to apply With the availability of very high-density switching platforms CONGA to only a subset of the traffic and only a subset of for the spine (or core) with 100s of 40Gbps ports, a 2-tier fabric can the switches. scale upwards of 20,000 10Gbps ports.1 This design point covers the needs of the overwhelming majority of enterprise datacenters, 5. Optimized for Leaf-Spine topology: CONGA must work which are the primary deployment environments for CONGA. optimally for 2-tier Leaf-Spine topologies (Figure 4) that cover Finally, in the lowest branch of the design tree, CONGA is con- the needs of most enterprise datacenter deployments, though structed to work with flowlets [27] to achieve a higher granularity it should also benefit larger topologies. of control and resilience to the flow size distribution while not re- quiring any modifications to TCP.

Load more