1 High-Performance Routing with Multipathing and Path Diversity in and HPC Networks

Maciej Besta1, Jens Domke2, Marcel Schneider1, Marek Konieczny3, Salvatore Di Girolamo1, Timo Schneider1, Ankit Singla1, Torsten Hoefler1 1Department of Computer Science, ETH Zurich; 2RIKEN Center for Computational Science (R-CCS) 3Faculty of Computer Science, Electronics and Telecommunications; AGH-UST

Abstract—The recent line of research into topology design focuses on lowering network diameter. Many low-diameter topologies such as Slim Fly or Jellyfish that substantially reduce cost, power consumption, and latency have been proposed. A key challenge in realizing the benefits of these topologies is routing. On one hand, these networks provide shorter path lengths than established topologies such as Clos or torus, leading to performance improvements. On the other hand, the number of shortest paths between each pair of endpoints is much smaller than in Clos, but there is a large number of non-minimal paths between router pairs. This hampers or even makes it impossible to use established multipath routing schemes such as ECMP. In this work, to facilitate high-performance routing in modern networks, we analyze existing routing protocols and architectures, focusing on how well they exploit the diversity of minimal and non-minimal paths. We first develop a taxonomy of different forms of support for multipathing and overall path diversity. Then, we analyze how existing routing schemes support this diversity. Among others, we consider multipathing with both shortest and non-shortest paths, support for disjoint paths, or enabling adaptivity. To address the ongoing convergence of HPC and “Big Data” domains, we consider routing protocols developed for both HPC systems and for data centers as well as general clusters. Thus, we cover architectures and protocols based on Ethernet, InfiniBand, and other HPC networks such as . Our review will foster developing future high-performance multipathing routing protocols in supercomputers and data centers. !

This is an extended version of a paper published at Dragonfly Fat tree HyperX Slim Fly Xpander IEEE TPDS 2021 under the same title 100 75 variant Default 50 1 INTRODUCTIONAND MOTIVATION 25 0 Equivalent Fat tree [141] and related networks such as Clos [59] are the 100 Jelly fi sh 75 most commonly deployed topologies in data centers and 50 supercomputers today, dominating the landscape of Ether- 25 0 net clusters [166], [100], [219]. However, many low-diameter (%) fraction of router pairs 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Topologies with a comparable topologies such as Slim Fly or Jellyfish that substantially number of endpoints (≈10,000) length of minimal paths reduce cost, power consumption, and latency have been Dragonfly Fat tree HyperX Slim Fly Xpander proposed. These networks improve the cost-performance 100 Path diversity... Low path variant Default tradeoff compared to fat trees. For instance, Slim Fly is 75 diversity ≈2× more cost- and power-efficient at scale than fat trees, 50 25 ...Low ...High simultaneously delivering ≈25% lower latency [35]. 0 Equivalent

100 Randomness in Jellyfish "smooths out" distributions Jelly fi sh arXiv:2007.03776v3 [cs.NI] 30 Oct 2020 A key challenge in realizing the benefits of these topolo- 75 of minimal path diversities gies is routing. On one hand, due to their lower diameters, 50 these networks provide shorter path lengths than fat trees 25 0 and other traditional topologies such as torus. However, as (%) fraction of router pairs 1 2 3 >3 1 2 3 >3 1 2 3 >3 1 2 3 >3 1 2 3 >3 Topologies with a comparable illustrated by our recent research efforts [43], the number of number of endpoints (≈10,000) diversity (counts) of minimal paths shortest paths between each pair of endpoints is much smaller Fig. 1: Distributions of lengths and counts of shortest paths in low-diameter than in fat trees. Selected results are illustrated in Figure 1. topologies and in fat trees. When analyzing counts of minimal paths between a In this figure, we compare established three-level fat trees router pair, we consider disjoint paths (no shared links). An equivalent Jellyfish (FT3) with representative modern low-diameter networks: network is constructed using the same number of identical routers as in the corresponding non-random topology (a plot taken from our past work [43]). Slim Fly (SF) [35], [33] (a variant with diameter 2), Dragonfly (DF) [137] (the “balanced” variant with diameter 3), Jelly- fish (JF) [203] (with diameter 3), Xpander (XP) [219] (with diameter ≤ 3), and HyperX (Hamming graph) (HX) [4] that networks (i.e., random Jellyfish networks constructed using generalizes Flattened Butterflies (FBF) [136] with diameter 3. the same number of identical routers as in the corresponding As observed [43], “in DF and SF, most routers are connected non-random topology), “the results are more leveled out, but with one minimal path. In XP, more than 30% of routers are pairs of routers with one shortest part in-between still form large connected with one minimal path.” In the corresponding JF fractions. FT3 and HX show the highest diversity.” We conclude 2 that in all the considered low-diameter topologies, shortest ments are important in today’s large-scale networking land- paths fall short: at least a large fraction of router pairs are scape. While the most powerful Top500 systems use vendor- connected with only one shortest path. specific or InfiniBand (IB) interconnects, more than half of Simultaneously, these low-diameter topologies offer the Top500 (e.g., in the June 2019 or in the November 2019 high diversity of non-minimal paths [43]. They provide at issues) machines [74] are based on Ethernet, see Figure 2. We least three disjoint “almost”-minimal paths (i.e., paths that observe similar numbers for the Green500 list. The impor- are one hop longer than their corresponding shortest paths) tance of Ethernet is increased by the “convergence of HPC and per router pair (for the majority of pairs). For example, in Big Data”, with cloud providers and data center operators Slim Fly (that has the diameter of 2), 99% of router pairs are aggressively aiming for high-bandwidth and low-latency connected with multiple non-minimal paths of length 3 [43]. fabrics [219], [100], [223]. Another example is Mellanox, with The above properties of low-diameter networks its Ethernet sales for the 3rd quarter of 2017 being higher place unprecedented design challenges for performance- than those for IB [181]. Similar trends are observed in more conscious routing protocols. First, as shortest paths fall recent numbers [233]. Similar trends are observed in more short, one must resort to non-minimal routing, which is recent numbers: “Sales of Ethernet adapter products increased usually more complex than the minimal one. Moreover, 112% year-over-year (...) we are shipping 400 Gbps Ethernet as topologies lower their diameter, their link count is also switches” [233]. At the same time, IB’s sales have been reduced. Thus, even if they do indeed offer more than one growing by 27% year-over-year [233]. This is “led by strong non-minimal path between pairs of routers, the correspond- demand for the HDR 200 gigabit solutions” [233]. Thus, our ing routing protocol must carefully use these paths in order analysis can facilitate developing multipath routing in both not to congest the network (i.e., the path diversity is still IB-based supercomputers but also in a broad landscape of a scarce resource demanding careful examination and use). cloud computing infrastructure such as data centers. Third, a shortage of shortest paths means that one cannot use established multipath routing1 schemes such as Equal- Ethernet Cost Multi-Path (ECMP) [106], which usually assume that different paths between communicating entities are minimal and have equal lengths. Restricting traffic to these paths does not utilize the path diversity of low-diameter networks. 40 In this work, to facilitate overcoming these challenges and to propel designing high-performance routing for mod- Infiniband ern interconnects, we develop a taxonomy of different forms of support for path diversity by a routing design. These 20 Myrinet Custom forms of support include (1) enabling multipathing using Omnipath both (2) shortest and (3) non-shortest paths, (4) explicit Share (percentage) consideration of disjoint paths, (5) support for adaptive Proprietary load balancing across these paths, and (6) genericness (i.e., 0 being applicable to different topologies). We also discuss 2010 2012 2014 2016 2018 additional aspects, for example whether a given design uses Top500 list issue multipathing to enhance its resilience, performance, or both. Then, we use this taxonomy to categorize and analyze a Fig. 2: The share of different interconnect technologies in the Top500 systems (a wide selection of existing routing designs. Here, we consider plot taken from our past work [43]). two fundamental classes of routing designs: simple routing building blocks (e.g., ECMP [106] or Network Address Alias- In general, we provide the following contributions: ing (NAA)) and routing architectures (e.g., PortLand [166] • We provide the first taxonomy of networking architectures or PARX [73]). While analyzing respective routing archi- and the associated routing protocols, focusing on the of- tectures, we include and investigate the architectural and fered support for path diversity and multipathing. technological details of these designs, for example whether • We use our taxonomy to categorize a wide selection of a given scheme is based on the simple Ethernet architecture, routing designs for data centers and supercomputers. the full TCP/IP stack, the InfiniBand (IB) stack, or other • We investigate the relationships between support for path HPC designs. This enables network architects and protocol diversity and architectural and technological details of designers to gain insights into supporting path diversity in different routing protocols. the presence of different technological constraints. • We discuss in detail the design of representative protocols. We consider protocols and architectures that originated • We are the first to analyze multipathing schemes related in both the HPC and data center as well as general net- to both supercomputers and the High-Performance Com- working communities. This is because all these environ- puting (HPC) community (e.g., the Infiniband stack) and 1Multipath routing indicates a routing protocol that uses more than one to data centers (e.g., the TCP/IP stack). path in the network, for at least one pair of communicating endpoints. We consider multipathing both within a single flow/message (e.g., Complementary Analyses There exist surveys on mul- as in spraying single packets across multiple paths, cf. § 4.6), and tipathing [183], [5], [243], [15], [217], [142], [202]. Yet, none multipath across flows/messages (e.g., as in standard ECMP, where focuses on multipathing and path diversity offered by different flows follow different paths § 4.4). Path diversity indicates whether a given network topology offers multiple paths between dif- routing in data centers or supercomputers. For example, ferent routers (i.e., has potential for speedups from multipath routing). Lee and Choi describe multipathing in the general Internet 3

Example modern low-diameter topologies Legend: A certain Routers with Minimal path type of many and Communicating Jellyfish An example routers few ports routers Slim Fly [2014]; diam.: 2 The default structure has [2012]; no well-defined groups of physical Non-minimal path Xpander [2016]; diam.: ≤ 3 diam.: ≤ 3 (is flat), it can however packaging be arbitrarily modified Multistage networks (Fat trees, Clos) Fat tree (3-level); diam.: 4 There is a unique Core routers, path between any also referred edge router and to as border any core router [5] routers or spine routers Routers in A pod multistage (3-level) A router networks usually have a moderate A group A router SF groups are not number of necessarily fully intra-connected attached Flattened Butterfly [2007]; endpoints Dragonfly [2008]; diam.: ≤ 3 HyperX [2009]; diam.: ≤ 3 A group A router In multistage networks, Aggregation Edge routers, A router only edge routers are routers, also also referred to attached to a certain numebr referred to as as leaf routers; of servers / compute nodes access routers they can also form Top-of-Rack Servers (ToR) routers (endpoints)

Example Clos (3-level); diam.: 4

There is more than one path between any edge router and any core router [89, 90]

In a basic DF variant, Every "row" and An example routers in a group are "column" of routers packaging fully connected. Still, are fully connected of routers in general, intra-group cabling pattern may vary

In these topologies, A router each router connects Routers usually to a certain number of have many servers Servers servers / compute nodes (endpoints) attached (endpoints)

Example traditional [high-diameter] topologies Mesh Torus Routers in Example tree (3-level); diam.: 4 traditional There is a unique topologies path between any usually have Core and edge router and few attached aggregation any core router endpoints routers have often a higher number of ports than edge routers

Servers (endpoints)

Packaging Diameter in traditional depends on a topologies is usually high specific network

Fig. 3: Illustration of network topologies related to the routing protocols and schemes considered in this work. Red color indicates an example shortest path between routers. Green color indicates example alternative non-minimal paths. Blue color illustrates grouping of routers. and telecommunication networks [139]. Li et al. [142] also data centers, and small clusters. As opposed to other works focus on the general Internet, covering aspects of multi- with broad focus, we specifically target the performance aspects path transmission related to all TCP/IP stack layers. Singh of multipathing and path diversity. Our survey is the first to et al. [202] cover only a few multipath routing schemes deliver a taxonomy of the path diversity features of routing used in data centers, focusing on a broad Internet setting. schemes, to categorize existing routing protocols based on Moreover, some works are dedicated to performance evalu- this taxonomy, and to consider both traditional TCP/IP and ations of a few schemes for multipathing [3], [109]. Next, Ethernet designs, but also protocols and concepts tradi- different works are dedicated to multipathing in sensor tionally associated with HPC, for example multipathing in networks [15], [5], [183], [243]. Finally, there are analyses of InfiniBand [179], [87]. other aspects of data center networking, for example energy efficiency [200], [26], optical interconnects [124], network virtualization [117], [25], overall routing [56], general data 2 FUNDAMENTAL NOTIONS center networking with focus on traffic control in TCP [168], low-latency data centers [145], the TCP incast problem [188], We first outline fundamental notions: network topologies, bandwidth allocation [57], transport control [237], general network stacks, and associated routing concepts and designs. data center networking for clouds [226], congestion manage- While we do not conduct any theoretical investigation, ment [120], reconfigurable data center networks [81], and we state – for clarity – a network model used implicitly transport protocols [207], [182]. We complement all these in this work. We model an interconnection network as an works, focusing solely on multipath routing in supercomputers, undirected graph G = (V,E); V and E are sets of routers, 4

also referred to as nodes (|V | = Nr), and full-duplex inter- 2.3 Routing Schemes router physical links. Endpoints (also referred to as servers We consider routing schemes (designs) that can be loosely or compute nodes) are not modeled explicitly. grouped into specific protocols (e.g., OSPF [162]), architec- tures (e.g., PortLand [166]), and general strategies and tech- 2.1 Network Topologies niques (e.g., ECMP [106] or spanning trees [174]). Overall, We consider routing in different network topologies. The a protocol or a strategy often addresses a specific network- most important associated topologies are in Figure 3. We ing problem, rarely more than one. Contrarily, a routing only briefly describe their structure that is used by routing architecture usually delivers a complete routing solution and architectures to enable multipathing (a detailed analysis it often addresses more than one, and often all, of the above- of respective topologies in terms of their path diversity is described problems. All these designs are almost always available elsewhere [43]). In most networks, routers form developed in the context of a specific network stack, also groups that are intra-connected with the same pattern of referred to as network architecture, that we describe next. cables. We indicate such groups with the blue color. Many routing designs are related to fat trees (FT) [141] 2.4 Network Stacks and Clos (CL) [59]. In these networks (broadly referred We focus on data centers and high-performance systems. to as “multistage topologies (MS)”), a certain fraction of Thus, we target Ethernet & TCP/IP, and traditional HPC routers is attached to endpoints while the remaining routers networks (InfiniBand, Myrinet, OmniPath, and others). are only dedicated to forwarding traffic. A common real- ization of these networks consists of three stages (layers) of 2.4.1 Ethernet & TCP/IP routers: edge (leaf ) routers, aggregation (access) routers, and In the TCP/IP protocol stack, two layers of addressing are core (spine, border) routers. Edge and aggregation routers are used. On Layer 2 (L2), Ethernet (MAC) addresses are used additionally grouped into pods, to facilitate physical layout to uniquely identify endpoints, while on Layer 3 (L3), IP (cf. Fig. 3). Only edge routers connect to endpoints. Aggre- addresses are assigned to endpoints. Historically, the Ether- gation and core routers only forward traffic; they enable net layer is not supposed to be routable: MAC addresses are multipathing. The exact form of multipathing depends on only used within a -like topology where no routing is the topology variant. Consider a pair of communicating required. In contrast, the IP layer is designed to be routable, edge routers (located in different pods/groups). In fat trees, with a hierarchical structure that allows scalable routing multipathing is enabled by selecting different core routers over a worldwide network (the Internet). More recently, and different aggregation routers to forward traffic between vendors started to provide routing abilities on the Ethernet the same communicating pair of edge routers. Importantly, layer for pragmatic reasons: since the Ethernet layer is after fixing the core router, there is a unique path between effectively transparent to the software running on the end- the communicating edge routers. In Clos, in addition to such points, such solutions are easy to deploy. Additionally, the multipathing enabled by selecting different core routers, one Ethernet interconnect of a cluster can usually be considered can also use different paths between a specific edge and core router. homogeneous, while the IP layer is used to route between Finally, simple trees are similar to fat trees in that fixing networks and needs to be highly interoperable. different core routers enables multipathing; still, one cannot Since Ethernet was not designed to be routable, there are multipath by using different aggregation routers. several restrictions on routing protocols for Ethernet: First, The most important modern low-diameter networks the network cannot modify any fields in the packets (control- are Slim Fly (SF) [35], Dragonfly (DF) [137], Jellyfish data plane separation is key in self-configuring Ethernet (JF) [203], Xpander (XP) [219], and HyperX (Hamming devices). There is no mechanism like the TTL field in the IP graph) (HX) [4]. Other proposed topologies in this family in- header that allows the network to detect cyclic routing. Sec- clude Flexfly [227], Galaxyfly [140], Megafly [77], projective ond, Ethernet devices come with pre-configured, effectively topologies [52], HHS [22], and others [184], [128], [172]. All random addresses. This implies that there is no structure these networks have different structure and thus different in the addresses that would allow for a scalable routing potential for multipathing [43]; in Figure 3, we illustrate implementation: Each switch needs to keep a lookup table example paths between a pair of routers. Importantly, in with entries for each endpoint in the network. Third, since most of these networks, unlike in fat trees, different paths the network is expected to self-configure, Ethernet routing between two endpoints usually have different lengths [43]. schemes must be robust to the addition and removal of Finally, many routing designs can be used with any links. These restrictions shape many routing schemes for topology, including traditional ones such as meshes. Ethernet: Spanning trees are commonly used to guarantee loop-freedom under any circumstances, and more advanced 2.2 Routing Concepts and Related schemes often rely on wrapping Ethernet frames into a We often refer to three interrelated sub-problems for routing: format more suitable for routing at the edge switches [166]. P Path selection, R Routing itself, and L Load balanc- Another intricacy of the TCP/IP stack is that flow control ing. Path selection determines which paths can be used is only implemented in Layer 4 (L4), the transport layer. This for sending a given packet. Routing itself answers a means that the network is not supposed to be aware of and question on how the packet finds a way to its destination. responsible for load balancing and resource sharing; rather, Load balancing determines which path (out of identified it should deliver packets to the destination on a best-effort alternatives) should be used for sending a packet to maximize basis. In practice, most advanced routing schemes violate performance and minimize congestion. this separation and are aware of TCP flows, even though 5

flow control is still left to the endpoint software [100]. control mechanism to throttle ingest traffic. However, ad- Many practical problems are caused by the interaction of hering to the correct service levels or actually throttling the TCP flow control with decisions in the routing layer, and packet generation is left to the discretion of the endpoints. such problems are often discussed together with routing Similarly, in sacrifice for lowest latency and highest band- schemes, even though they are completely independent of width, IB switches have limited support for common capa- the network topology (e.g., the TCP incast problem). bilities found in Ethernet, for example VLANs, firewalling, Traditional Ethernet is lossy: when packet buffers are or other security-relevant functionality. Consequently, some full, packets are dropped. Priority Flow Control (PFC) [66] of these have been implemented in software at the end- addresses this by allowing a switch to notify another (up- points on top of the IB transport protocol, e.g., TCP/IP via stream) switch with special “pause” frames to stop sending IPoIB, whenever the HPC community deemed it necessary. frames until further notice, if the buffer occupancy in the first switch is above a certain threshold. Another extension 2.4.3 Other HPC Network Designs of Ethernet towards technologies traditionally associated Cray’s Aries [14] is a packet-switched interconnect designed with HPC is the incorporation of Remote Direct Memory for high performance and deployed on the Cray XC systems. Access (RDMA) using the RDMA over Converged Ethernet Aries adopts a dragonfly topology, where nodes within (RoCE) [112] protocol, which enriches the Ethernet with the groups are interconnected with a two-dimensional all-to- RDMA communication semantics. all structure (i.e., routers in one dragonfly group effectively form a flattened butterfly, cf. Figure 3). Being designed for 2.4.2 InfiniBand high-performance systems, it allows nodes to communicate with RDMA operations (i.e., put, get, and atomic opera- The InfiniBand (IB) architecture is a switched fabric design tions). The routing is destination-based, and the network and is intended for high-performance and system area addresses are tuples composed by a node identifier (18-bit, network (SAN) deployment scales. Up to 49,151 endpoints max 262,144 nodes), the memory domain handle (12-bit) (physical or virtual), addressed by a 16 bit local identifier that identifies a memory segment in the remote node, and (LID), can be arranged in a so called subnet, while the an offset (40-bit) within this segment. The Aries switches remaining address space is reserved for multicast operations employ wormhole routing [62] to minimize the per-switch within a subnet. Similar to the modern datacenter Ethernet required resources. Aries does not support VLANs or QoS (L2) solutions, these IB subnets are routable to a limited mechanisms, and its stack design does not match that of extent with switches supporting unicast and multicast for- Ethernet. Thus, we define the (software) layer at which the warding tables, flow control, and other features which do Aries routing operates as proprietary. not require modification of in-flight packet headers. Theo- Slingshot [1] is the next-generation Cray network. It retically, multiple subnets can be connected by IB routers implements a DF topology with fully-connected groups. — performing the address translation between the subnets Slingshot can switch two types of traffic: RoCE (using L3) — to create larger SANs (effectively L3 domains), but this and proprietary. Being able to manage RoCE traffic, a Sling- impedes performance due to the additionally required global shot system can be interfaced directly to data centers, while routing header (GRH) and is rarely used in practice. the proprietary traffic (similar to Aries, i.e., RDMA-based IB natively supports RDMA and atomic operations. The and small-packet headers) can be generated from within necessary (for high performance) lossless packet forwarding the system, preserving high performance. Cray Slingshot within IB subnets is realized through link-level, credit-based supports VLANs, QoS, and endpoint congestion mitigation. flow control [61]. Software-based and latency impeding IBM’s PERCS [19] is a two-level direct interconnection solutions to achieve reliable transmissions, as for example in network designed to achieve high bisection bandwidth TCP, are therefore not required. While switches have the ca- and avoid external switches. Groups of 32 compute nodes pability to drop deadlocked packets that reside for extended (made of four IBM POWER7 chips) are fully connected time periods in their buffers, they cannot identify livelocks, and organized in supernodes. Each supernode has 512 links such as looping unicast or multicast packets induced by connecting it to other supernodes. Depending on the system cyclic routing. Hence, the correct and acyclic routing config- size (max 512 supernodes), each supernode pair can be uration is offloaded to a centralized controller, called subnet connected with one or multiple links. The PERCS Hub manager, which configures connected IB devices, calculates Chip connects the POWER7 chips within a compute node the forwarding tables with implemented topology-agnostic between themselves and with the rest of the network. The or topology-aware routing algorithms, and monitors the net- Hub Chip participates to the cache coherence protocol and work for failures. Therefore, most routing algorithms either is able to fetch/inject data directly from the processors’ focus on minimal path length to guarantee loop-freedom, or L3 caches. PERCS supports RDMA, hardware-accelerated are derivatives of the Up*/Down* routing protocol [152], collective operations, direct-cache (L3) network access, and [78] which can be viewed as a generalization of the spanning enables applications to switch between different routing tree protocol of Ethernet networks. Besides this oblivious, modes. Similarly to Aries, PERCS routing operates on a destination-based routing approach, IB also supports source- proprietary stack. based routing, but unfortunately only for a limited traffic We also summarize other HPC oriented proprietary class reserved for certain management packets. interconnects. Some of them are no longer manufactured; The subnet manager can configure the InfiniBand net- we include them for the completeness of our discussion of work with a few flow control features, such as quality-of- path diversity. Myricom’s Myrinet [47] is a local area mas- service to prioritize traffic classes over others or congestion sively parallel processor network, designed to connect thou- 6

sands of small compute nodes. A more recent development, overall destination-based routing) and also specific protocols Myrinet Express (MX) [86], provides more functionalities (e.g., Valiant routing [220]). in its network interface cards (NICs). Open-MX [92] is a Note that, in addition to schemes focusing on multi- communication layer that offers the MX API on top of the pathing, we also describe designs that do not explicitly Ethernet hardware. ’ QsNet [178], [177] integrates enable it. This is because these designs are often used as key local memories of compute nodes into a single global virtual building blocks of architectures that provide multipathing. address space. Moreover, Intel introduced OmniPath [46], An example is a simple spanning tree mechanism, that – on an architecture for a tight integration of CPU, memory, its own – does not enable any form of multipathing, but is a and storage units. Other HPC interconnects are Atos’ Bull basis of numerous designs that enable it [16], [209]. eXascale Interconnect (BXI) [67] and EXTOLL’s intercon- nect [165]. Many of these architectures feature some form of 4.1 Destination-Based Routing Protocols programmable NICs [47], [177]. Finally, there exist routing The most common approach to routing are destination- protocols for specific low-diameter topologies, for example based routing schemes. Each router holds a routing table that for SF [234] or DF [153]. However, they usually do not maps any destination address to a next-hop output port. No support multipathing or non-minimal routing. information apart from the destination address is used, and the packet does not need to be modified in transit. In this 2.5 Focus of This Work setup, it is important to differentiate the physical network topology (typically modeled as an undirected graph, since In our investigation, we focus on routing. Thus, in the Ether- all practically used network technologies use full-duplex net and TCP/IP landscape, we focus on designs associated links, cf. § 2.1) from the routing graph, which is naturally with Layer 2 (L2, Data Link Layer) and Layer 3 (L3, Internet directed in destination-based schemes. In the routing graph, Layer), cf. § 2.4.1. As most of congestion control and load there is an edge from node a to node b iff there is a routing balancing are related to higher layers, we only describe such table entry at a indicating b as the next hop destination. schemes whenever they are parts of the associated L2 or L3 Typically, the lookup table is implemented using longest- designs. In the InfiniBand landscape, we focus on the subnet prefix matching, which allows entries with an identical and L3 related schemes, cf. § 2.4.2. address prefix and identical output port to be compressed into one table slot. This method is especially well suited to hierarchically organized networks. In general, longest-prefix 3 TAXONOMYOF ROUTING SCHEMES matching is not required: it is feasible and common to keep We first identify criteria for categorizing the considered uncompressed routing tables, e.g., in Ethernet routing. routing designs. We focus on how well these designs utilize Simple destination-based routing protocols can only pro- path diversity. These criteria are used in Tables 1–2. Specif- vide a single path between any source and destination, but ically, we analyze whether a given scheme enables using this path can be non-minimal. For non-minimal paths, special (1) arbitrary shortest paths and (2) arbitrary non-minimal care must be taken to not cause cyclic routing: this can paths. Moreover, we consider whether a studied scheme happen when the routing tables of different routers are not enables (3) multipathing (between two hosts) and whether consistent, cf. property preserving network updates [70]. In a these paths can be (4) disjoint. Finally, we investigate (5) the configuration without routing cycles, the routing graph for support for adaptive load balancing across exposed paths a fixed destination node is a tree rooted in the destination. between router pairs and (6) compatibility with an arbi- trary topology. In addition, we also indicate the location 4.2 Source Routing (SR) 2 of each routing scheme in the networking stack . We also Another routing scheme is source routing (SR). Here, the indicate whether a given multipathing scheme focuses on route from source to destination is computed at the source, performance or resilience (i.e., to provide backup paths in and then attached to the packet before it is injected into the event of failures). Next, we identify whether supported the network. Each switch then reads (and possibly removes) paths come with certain restrictions, e.g., whether they the next hop entry from the route, and forwards the packet are offered only within a spanning tree. Finally, we also there. Compared to destination based routing, this allows broadly categorize the analyzed routing schemes into basic for far more flexible path selection [121]. Yet, now the and complex ones. The former are usually specific protocols endpoints need to be aware of the network topology to or classes of protocols, used as building blocks of the latter. make viable routing choices. Source routing is rarely deployed in practice. Still, it could enable superior routing decisions (compared to des- 4 SIMPLE ROUTING BUILDING BLOCKS tination based routing) in terms of utilizing path diversity, We now present simple routing schemes, summarized in as endpoints know the physical topology. There are recent Table 1, that are usually used as building blocks for more proposals on how to deploy source routing in practice, for complex routing designs. For each described scheme, we example with the help of OpenFlow [121], or with packet indicate what aspects of routing (as described in § 2.2) this encapsulation (IP-in-IP or MAC-in-MAC) [95], [111], [94]. scheme focuses on: P path selection, R routing itself, L or load Source routing can also be achieved to some degree with balancing. We consider both general classes of schemes (e.g., Multiprotocol Label Switching (MPLS) [190], a technique in which a router forwards packets based on path labels instead 2We consider protocols in both Data Link (L2) and Network (L3) layers. However, we abstract away hardware details and use a term “router” for both L2 switches of network addresses (i.e., the MPLS label assigned to a packet and L3 routers, unless describing a specific switching protocol (to avoid confusion). can represent a path to be chosen [190], [232]). 7

Routing Scheme Related Stack Features of schemes Additional remarks and clarifications (Name, Abbreviation, Reference) concepts Layer (§ 2.2) (§ 2.4) SP NP MP DP ALB AT General routing building blocks (classes of routing schemes) Simple Destination-based routing R L2, L3 - Ŏ∗ é é é - ∗Care must be taken not to cause cyclic dependencies Simple Source-based routing (SR) L2, L3 -- Ŏ∗ Ŏ∗ é - Source routing is difficult to deploy in practice, but it is more flexible than destination-based routing. ∗As endpoints know the physical topology, multipathing should be easier to realize than in destination routing. Simple Minimal routing P L2, L3 - é é é é - Easy to deploy, numerous designs fall in this category Specific routing building blocks (concrete protocols or concrete protocol families) Equal-Cost Multipathing (ECMP) [106] L L3 - é - é é - In ECMP, all routing decisions are local to each switch. Spanning Trees (ST) [174] L2 Ŏ∗ Ŏ∗ é é é - ∗The ST protocol offers shortest paths but only within one spanning tree. Packet Spraying (PR) [69] L2, L3 - é - é é - One selects output ports with round-robin [69] or randomization [195]. Virtual LANs (VLANs) L2 Ŏ∗ Ŏ∗ é é é - ∗VLANs by itself does not focus on multipathing, and it inherits spanning tree limitations, but it is a key part of multipathing architectures. IP Routing Protocols L2, L3 - é é é é - Examples are OSPF [162], IS-IS [170], EIGRP [173]. Location–Identification Separation (LIS) L2, L3 -∗ -∗ é∗ é∗ é - ∗LIS by itself does not focus on multipathing and path diversity, but it may facilitate developing a multipathing architecture. Valiant load balancing (VLB) [220] L2, L3 é - é é é - — UGAL [137] L2, L3 -- Ŏ∗ é -- UGAL means Universal Globally-Adaptive Load balanced routing. Network Address Aliasing (NAA) L3, subnet -∗ -∗ -∗ -∗ -∗ - NAA is based on IP aliasing in Ethernet networks [180] and virtual ports via LID mask control (LMC) in InfiniBand [113, Sec. 7.11.1]. ∗Depending on how a derived scheme implements it. Multi-Railing L2, L3, subn. -∗ -∗ --∗ -∗ - ∗Depending on how a derived scheme implements it. Multi-Planes L2, L3, subn. -∗ -∗ ---∗ - ∗Depending on how a derived scheme implements it. TABLE 1: Comparison of simple routing building blocks (often used as parts of more complex routing schemes in Table 2). Rows are sorted chronologically. We focus on how well the compared schemes utilize path diversity. “Related concepts” indicates the associated routing concepts described in § 2.2. “Stack Layer” indicates the location of each routing scheme in the TCP/IP or InfiniBand stack (cf. § 2.4). SP, NP, MP, DP, ALB, and AT illustrate whether a given routing scheme supports various aspects of path diversity. Specifically: SP: A given scheme enables using arbitrary shortest paths. NP: A given scheme enables using arbitrary non-minimal paths. MP: A given scheme enables multipathing (between two hosts). DP: A given scheme considers disjoint paths. ALB: A given scheme offers adaptive load balancing. AT: A given scheme works with an arbitrary topology. -: A given scheme does offer a given feature. Ŏ: A given scheme offers a given feature in a limited way. é: A given scheme does not offer a given feature. ∗Explanations in remarks.

4.3 Minimal Routing Protocols to routing loops. Now, any router can make an arbitrary A common approach to path selection is to only use choice among these next-hop options. The resulting routing minimal paths: Paths that are no longer than the shortest will still be loop-free and only use minimal paths. path between their endpoints. Minimal paths are preferrable ECMP allows to use a greater variety of paths compared for routing because they minimize network resources con- to simple destination-based routing. Since now there may sumed for a given volume of traffic, which is crucial to be multiple possible paths between any pair of nodes, a achieve good performance at high load. mechanism for load balancing is needed. Typically, ECMP is used with a simple, oblivious scheme similar to packet An additional advantage of minimal paths is that they spraying (§ 4.6), but on a per-flow level to prevent packet guarantee loop-free routing in destination-based routing reordering [58]: each switch chooses a pseudo-random next schemes. For a known, fixed topology, the routing tables can hop port among the shortest paths based on a hash com- be configured to always send packets along shortest paths. puted from the flow parameters, aiming to obtain an even Since every hop along any shortest path will decrease the distribution of load over all minimal paths (some variations shortest-path distance to the destination by one, the packet of such simple per-flow scheme were proposed, for exam- always reaches its destination in a finite number of steps. ple Table-based Hashing [208] or FastSwitching [244]). Yet, To construct shortest-path routing tables, a variation of random assignments do not imply uniform load balancing the Floyd-Warshall all-pairs shortest path algorithm [80] can in general, and more advanced schemes such as Weighted be used. Here, besides the shortest-path distance for all Cost Multipathing (WCMP) [238], [241] aim to improve this. router pairs, one also records the out-edge at a given router In addition, ECMP natively does not support adaptive load (i.e., the output port) for the first step of a shortest path balancing. This is addressed by many network architectures to any other router. Other schemes are also applicable, for described in Section 5 and by direct extensions of ECMP, example an algorithm by Suurballe and Tarjan for finding such as Congestion-Triggered Multipathing (CTMP) [205] or shortest pairs of edge-disjoint paths [212]. Table-based Hashing with Reassignments (THR) [58]. Basic minimal routing does not consider multipathing. However, schemes such as Equal-Cost Multipathing (ECMP) extend minimal routing to multipathing (§ 4.4). 4.5 Spanning Trees (ST) Another approach to path selection is to restrict the topol- ogy to a spanning tree. Then, the routing graph becomes a 4.4 Equal-Cost Multipathing (ECMP) tree of bi-directional edges which guarantees the absence of Equal-Cost Multipathing [106] routing is an extension of sim- cycles as long as no router forwards packets back on the link ple destination-based routing that specifically exploits that the packet arrived on. This can be easily enforced by the properties of minimal paths. Instead of having only one each router without any global coordination. Spanning tree entry per destination in the routing tables, multiple next- based solutions are popular for auto-configuring protocols hop options are stored. In practice, ECMP is used with on changing topologies. However, simple spanning tree- minimal paths, because using non-minimal ones may lead based routing can leave some links completely unused if 8

the network topology is not a tree. Moreover, shortest paths IP address used in an application) does not necessarily indi- within a spanning tree are not necessarily shortest when cate the physical location of this endpoint in the network. A considering the whole topology. Spanning tree based solu- mapping between identifiers and addresses can be stored in tions are an alternative to minimal routing to ensure loop- a distributed hashtable (DHT) maintained by switches [135] free routing in destination-based routing systems. They or hosts, or it can be provided by a directory service (e.g., allow for non-minimal paths at the cost of not using network using DNS) [94]. This approach enables more scalable rout- resources efficiently and have been used as a building block ing [76]. Importantly, it may facilitate multipathing by – for in schemes like SPAIN [163]. A single spanning tree does example – maintaining multiple virtual topologies defined not enable multipathing between two endpoints. However, by different mappings in DHTs [118]. as we discuss in Section 5, different network architectures use spanning trees to enable multipathing [209]. 4.10 Valiant Load Balancing (VLB) To facilitate non-minimal routing, additional information 4.6 Packet Spraying apart from the destination address can be incorporated A fundamental concept for L load balancing is per-packet into a destination-based routing protocol. An established load balancing. In the basic variant, random packet spray- and common approach is Valiant routing [220], where this ing [69], each packet is sent over a randomly chosen path additional information is an arbitrary intermediate router R selected from a (static) set of possible paths. The key differ- that can be selected at the source endpoint. The routing is ence from ECMP is that modern ECMP spreads flows, not divided into two parts: first, the packet is minimally routed packets. Typically, packet spraying is applied to multistage to R; then, it is minimally routed to the actual destination. networks, where many equal length paths are available and VLB has aspects of source routing, namely the choice of R a random path among these can be chosen by selecting a and the modification of the packet in flight, while most of random upstream port at each router. Thus, simple packet the routing work is done in a destination-based way. As spraying natively considers, enables, and uses multipathing. such, VLB natively does not consider multipathing. VLB In TCP/IP architectures, per-packet load balancing is also incorporates a specific path selection (by selecting often not considered due to the negative effects of packet the intermediate node randomly). This also provides simple, reordering on TCP flow control; but these effects can still oblivious load balancing. be reduced in various ways [69], [100], for example by spraying not single packets but series of packets, such as 4.11 Universal Globally-Adaptive Load Balanced (UGAL) flowlets [223] or flowcells [102]. Moreover, basic random packet spraying is an oblivious load balancing method, as Universal Globally-Adaptive Load balanced (UGAL) [137] it does not use any information about network congestion. is an extension of VLB that enables more advantageous However, in some topologies, for example in fat trees, it can routing decisions in the context of load balancing still guarantee optimal performance as long as it is used for . Specifically, when a packet is to be routed, UGAL either all flows. Unfortunately, this is no longer true as soon as the selects a path determined by VLB, or a minimum one. The topology looses its symmetry due to link failures [241]. decision usually depends on the congestion in the network. Consequently, UGAL considers multipathing in its design: consecutive packets may be routed using different paths. 4.7 Virtual LANs (VLANs)

Virtual LANs (VLANs) [147] were originally used for iso- 4.12 Network Address Aliasing (NAA) lating Ethernet broadcast domains. They have recently been used to implement multipathing. Specifically, once a VLAN Network Address Aliasing (NAA) is a building block to is assigned to a given spanning tree, changing the VLAN support multipathing, especially in InfiniBand-based net- tag in a frame results in sending this frame over a different works. Network Address Aliasing, also known as IP aliasing path, associated with a different spanning tree (imposed on in Ethernet networks [180] or port virtualization via LID the same physical topology). Thus, VLANs – in the context mask control (LMC) in InfiniBand [113, Sec. 7.11.1], is a of multipathing – primarily address path selection P . technique that assigns multiple identifiers to the same network endpoint. This allows the routing protocols to increase the path diversity between two endpoints, and it was used 4.8 Simple IP Routing both as a fail-over (enhancing resilience) [225] or for We explicitly distinguish a class of established IP routing load balancing the traffic (enhancing performance) [73]. In protocols R , such as OSPF [162] or IS-IS [170]. They are particular, due to the destination-based routing — where a often used as parts of network architectures. Despite being path is only defined by the given destination address; as generic (i.e., they can be used with any topology), they do mandated by the InfiniBand standard [113] — this address not natively support multipathing. aliasing is the only standard-conform and software-based solution to enable multiple disjoint paths between an IB 4.9 Location–Identification Separation (LIS) source and a destination port. In Location–Identification Separation (LIS), used in some architectures, a routing scheme separates the physical 4.13 Multi-Railing and Multi-Planes location of a given endpoint from its logical identifier. In this Various HPC systems employ multi-railing: using multiple approach, the logical identifier of a given endpoint (e.g., its injection ports per node into a single topology [97], [229]. 9

Another common scheme is multi-plane topologies, where is a Traffic Engineering (TE) distributed protocol for bal- nodes are connected to a set of disjoint topologies, either ancing traffic in intra-domains of ISP operations. It focuses similar [96] or different [156]. This is used to increase path on algorithms for path selection and load balancing, and diversity and available throughput. However, this increased briefly discusses a suggested implementation that relies on level of complexity also comes with additional challenges protocols such as RSVP-TE [21] to deploy paths in routers. for the routing protocols to utilize the hardware efficiently. TeXCP is similar to another protocol called MATE [75]. TRansparent Interconnection of Lots of Links (TRILL) [214] and Shortest Path Bridging (SPB) [13] are similar schemes 5 ROUTING PROTOCOLSAND ARCHITECTURES that both rely on link state routing to, among others, enable We now describe representative networking architectures, multipathing based on multiple trees and ECMP. Ether- focusing on their support for path diversity and multipathing3, net on Air [191] uses the approach introduced by SEAT- according to the taxonomy described in Section 3. Table 2 TLE [135] to eliminate flooding in the switched network. illustrates the considered architectures and the associated They both rely on LIS and distributed hashtables (DHTs), protocols. Symbols “-”, “Ŏ”, and “é” indicate that a given implemented in switches, to map endpoints to the switches design offers a given feature, offers a given feature in a connecting these endpoints to the network. Here, Ethernet limited way, and does not offer a given feature, respectively. on Air uses its DHT to construct a routing substrate in the We broadly group the considered designs intro three form of a Directed Acyclic Graph (DAG) between switches. classes. First (§ 5.1), we describe schemes that belong to Different paths in this DAG can be used for multipathing. the Ethernet and TCP/IP landscape and were introduced VIRO [118] is similar in relying on the DHT-style routing. for the Internet or for small clusters, most often for the It mentions multipathing as a possible feature enabled by purpose of increasing resilience, with performance being multiple virtual topologies built on top of a single physical only secondary target. Despite the fact that these schemes network. Finally, MLAG [210] and MC-LAG [210] enable originally did not target data centers, we include them as multipathing through link aggregation. many of these designs were incorporated or used in some First, many of these designs enable shortest paths, but way in the data center context. Second, we incorporate a non-negligible number is limited in this respect by the Ethernet and TCP/IP related designs that are specifically used spanning tree protocol (i.e., the used shortest paths targeted at data centers or supercomputers (§ 5.2). The last are not shortest with respect to the underlying physical class is dedicated to designs related to InfiniBand (§ 5.3). topology). A large number of protocols alleviates this with different strategies. For example, SEATTLE, Ethernet on Air, and VIRO use DHTs that virtualize the physical topology, 5.1 Ethernet & TCP/IP (Clusters, General Networks) enabling shortest paths. Other schemes, such as Smart- In the first part of Table 2, we illustrate the Ethernet and Bridge [189] or RBridges [175], directly enhance the span- TCP/IP schemes that are associated with small clusters and ning tree protocol (in various ways) to enable shortest paths. general networks. Chronologically, the considered schemes Second, many protocols also support multipathing. Two most were proposed between 1999 and 2010 (with VIRO from common mechanisms for this are either ECMP (e.g., in AMP 2011 and MLAG from 2014 being exceptions). or THR) or multiple spanning trees combined with VLAN Multiple Spanning Trees (MSTP) [16], [65] extends the tagging (e.g., in MSTP or GOE). However, almost no schemes STP protocol and it enables creating and managing multiple explicitly support non-minimal paths4, disjoint paths, or adap- spanning trees over the same physical network. This is done tive load balancing. Yet, they all work on arbitrary topologies. by assigning different VLANs to different spanning trees, All these features are mainly dictated by the purpose and and thus frames/packets belonging to different VLANs can origin of these architectures and protocols. Specifically, most traverse different paths in the network. There exist Cisco’s of them were developed with the main goal being resilient implementations of MSTP, for example Per-VLAN spanning to failures and not higher performance. This explains – for tree (PVST) and Multiple-VLAN Spanning Tree (MVST). example – almost no support for adaptive load balancing Table-based Hashing with Reassignments (THR) [58] ex- in response to network congestion. Moreover, they are all re- tends ECMP to a simple form of load balancing: it selectively stricted by the technological constraints in general Ethernet reassigns some active flows based on load sharing statistics. and TCP/IP related equipment and protocols, which are Global Open Ethernet (GOE) [114], [115] provides virtual historically designed for the general Internet setting. Thus, private network (VPN) services in metro-area networks they have to support any network topology. Simultaneously, (MANs) using Ethernet. Its routing protocol, per-destination many such protocols were based on spanning trees. This dic- multiple rapid spanning tree protocol (PD-MRSTP), com- tates the nature of multipathing support in these protocols, bines MSTP [16] (for using multiple spanning trees for often using some form of multiple spanning trees (MSTP, different VLANs) and RSTP [17] (for quick failure recovery). GOE, Viking) or “shortcutting” spanning trees (VIRO). Viking [197] is very similar to GOE. It also relies on MSTP to explicitly seek faster failure recovery and more through- 5.2 Ethernet & TCP/IP (Data Centers, Supercomputers) put by using a VLAN per spanning tree, which enables redundant switching paths between endpoints. TeXCP [125] The designs associated with data centers and supercomput- ers are listed in the second part of Table 2. 3We encourage participation in this survey. In case the reader possesses additional information relevant for the contents, the authors welcome the input. We also 4While schemes based on spanning trees strictly speaking enable non-minimal encourage the reader to send us any other information that they deem important, paths, this is not a mechanism for path diversity per se, but limitation dictated e.g., architectures not mentioned in the current survey version. by the fact that the used spanning trees often do not enable shortest paths. 10

Routing Scheme Stack Features of schemes Scheme Additional remarks and clarifications Layer SP NP MP DP ALB AT used Related to Ethernet and TCP/IP (small clusters and general networks): OSPF-OMP (OMP) [224] L3 - é - é é - OSPF Cisco’s enhancement of OSPF to the multipathing setting. Packets from the same flow are forwarded using the same path. MPA [164] L3 Ŏ∗ Ŏ∗ Ŏ∗ é é - — ∗MPA only focuses on algorithms for generating routing paths. SmartBridge [189] L2 - é é é é - ST SmartBridges improves ST; packets are sent between hosts using the shortest possible path in the network. MSTP [16], [65] L2 Ŏ∗ é∗ - é é - ST+VLAN ∗Shortest paths are offered only within spanning trees. STAR [150] L2 - é é é é - ST STAR improves ST; frames are forwarded over alternate paths that are shorter than their corresponding ST path. LSOM [84] L2 - é é é é - — LSOM supports mesh networks also in MAN. LSA manages state of links. AMP [93] L3 - é - é -- ECMP, OMP AMP extends ECMP and OSPF-OMP. RBridges [175] L2 - é é é é - —— THR [58] L3 - é - é Ŏ - ECMP Table-based Hashing with Reassignments (THR) extends ECMP, it selectively reassigns some active flows based on load sharing statistics. GOE [114] L2 Ŏ∗ é∗ - é é - ST+VLAN ∗Shortest paths are offered only within spanning trees. One spanning tree per VLAN is used. Focus on resiliece. Viking [197] L2 Ŏ∗ é∗ - é é∗∗ - ST+VLAN ∗Shortest paths are offered only within spanning trees. One spanning tree per VLAN is used. ∗∗Viking uses elaborate load balancing, but it is static. TeXCP [125] L3 - Ŏ - Ŏ -∗ - — Routing in ISP, path are computed offline, ∗load balancing selects paths based on congestion and failures. CTMP [205] L3∗ - é ---- ECMP The scheme focuses on generating paths and on adaptive load balancing. It extends ECMP. ∗Path generation is agnostic to the layer. SEATTLE [135] L2 - é é é é - LIS (DHT) Packets traverse the shortest paths. SPB [13], TRILL [214] L2 - é∗ - é é - —— Ethernet on Air [191] L2 - é Ŏ∗ é é - LIS (DHT) ∗Multipathing is used only for resilience. VIRO [118] L2–L3 -- Ŏ∗ é é - LIS (DHT) ∗Multipathing could be enabled by using multiple virtual networks over the same underlying physical topology. MLAG [210], MC-LAG [210] L2 Ŏ∗ é∗ Ŏ∗∗ é é - — ∗Not all shortest paths are enabled; ∗∗multipathing only for resilience. Related to Ethernet and TCP/IP (data centers, supercomputers): DCell [99] L2–L3 é - é é é é (RL)∗ — ∗DCell comes with a specific topology that consists of layers of routers. Monsoon [95] L2, L3 Ŏ∗ é∗ Ŏ∗∗ é é é (CL)† VLB, SR, ∗VLB is used in groups of edge routers. ∗∗ECMP is used only between ECMP border and access routers. Work by Al-Fares et al. [7] L3 - é --- é (FT) — — PortLand [166] L2 - é -∗ é é é (FT) ECMP — MOOSE [194] L2 - é Ŏ∗ é é - OSPF-OMP∗∗, ∗Only a brief discussion on augmenting the frame format for multipathing. LIS ∗∗only mentioned as a possible mechanism for multipathing in MOOSE. BCube [98] L2–L3 - é -- é é (RL)∗ — ∗BCube comes with a specific topology that consists of layers of routers. VL2 [94] L3∗ - é - é Ŏ∗∗ é (CL) LIS, VLB, ECMP ∗L3 is used but L2 semantics are offered. ∗∗TCP congestion control. SPAIN [163] L2 Ŏ∗ Ŏ∗ -- é - ST+VLAN ∗SPAIN uses one ST per VLAN. Path diversity is limited by #VLANs supported in L2 switches. Work by Linden et al. [222] L3 - é - Ŏ∗ Ŏ∗ - ECMP ∗These aspects are only mentioned. The whole design extends ECMP. Work by Suchara et al. [211] L3 - Ŏ∗ -- Ŏ∗∗ - — ∗Support is implicit. ∗∗Paths are precomputed based on predicted traffic. The design focuses on fault tolerance but also considers performance. PAST [209] L2 Ŏ∗ Ŏ∗ Ŏ∗∗ - é - ST+VLAN, ∗PAST enables either shortest or non-minimal paths. VLB ∗∗Limited or no multipathing. Shadow MACs [2] L2 - é∗ - é é - — ∗Non-minimal paths are mentioned only in the context of resilience. WCMP for DC [241] L3 -∗ é -∗ é é é (MS)∗∗ ECMP WCMP uses OpenFlow [157]. ∗WCMP extends ECMP with hashing of flows based on link capacity. ∗∗Applicable to simple 2-stage networks. Flexible fabric [121] L3 - Ŏ∗ é∗∗ é é Ŏ† SR ∗Non-minimal paths considered for resilience only. ∗∗Only mentioned. †Main focus is placed on leaf-spine and fat trees. XPath [107] L3 - Ŏ∗ -- Ŏ∗∗ - — ∗Unclear scaling behavior. ∗∗XPath relies on default congestion control. Adaptive load balancing∗ L3 - é - é - é (MS) PR ∗Examples are DRILL [90] or DRB [53] ECMP-VLB [127] L3 --- é -- (XP)∗ ECMP, VLB ∗Focus on the Xpander network. FatPaths [43] L2–L3 -∗ -∗ ----∗∗ PR∗∗∗, ∗Simultaneous use of shortest and non-minimal paths. ECMP†, ∗∗Generally applicable but main focus is on low-diameter topologies. VLAN† ∗∗∗FatPaths sprays packets grouped in flowlets. †Only briefly described. Related to InfiniBand and other traditionally HPC-related designs (data centers, supercomputers): Shortest path schemes∗ subnet é é Ŏ∗∗ é é Ŏ D-free ∗They incl. Min-Hop [158], (DF-)SSSP [105], [72], and Nue [71]. ∗∗Only when combined with NAA. MUD [152], [78] subnet ŎŎŎ é é é∗ D-free ∗Original proposals disregarded IB’s destination-based routing criteria; hence, applicability is limited without NAA. LASH-TOR [204] subnet é é Ŏ é é é∗ D-free ∗Original proposals disregarded IB’s destination-based routing criteria; hence, applicability is limited without NAA. Multi-Routing [167] subnet Ŏ∗ é -- Ŏ∗∗ Ŏ∗ — ∗Depends on #{network planes} and/or selected routing schemes. ∗∗Must be implemented in upper layer protocol, like MPI. Adaptive Routing [159] subnet — — - — - Ŏ — Propriety Mellanox extension are outside of InfiniBand specification. SAR [70] subnet é é é é Ŏ∗ é (FT) NAA, D-free ∗Theoretically in Phase 2 & 4 of ’Property Preserving Network Update’. PARX [73] subnet ŎŎ - é Ŏ∗ é (HX) NAA, D-free ∗Implemented via upper layer protocol, e.g. modified MPI library. Cray’s Aries [14] propr. --- é -∗ é (DF) UGAL, D-free ∗Link congestion information are propagated through the network and used to decide between minimal and non-minimal paths. Cray’s Slingshot [1] L3 or propr. --- é -∗ é (DF) UGAL, D-free ∗Similar to Aries, adds endpoint congestion mitigation. Myricom’s Myrinet [47] propr. -- äää - SR, D-free — Atos’ BXI [67] propr. - äää -- D-free — EXTOLL’s architecture [165] propr. - äää -- — Intel’s OmniPath [46] propr. -- Ŏ∗ Ŏ∗ -- D-free ∗No built-in support for enforcing packeting ordering across different paths Quadrics’ QsNet [178], [177] propr. -- Ŏ∗ Ŏ∗ - é (FT) SR ∗Unclear details on how to use multipathing in practice IBM’s PERCS [19] propr. --- é Ŏ∗ é (DF) UGAL, D-free ∗Routing modes can be set on a per-packet basis. TABLE 2: Routing architectures. Rows are sorted chronologically and then by topology/multipathing support. “Scheme used” indicates incorporated building blocks from Table 1. “Stack Layer” indicates the location of a given scheme in the TCP/IP or InfiniBand stack (cf. § 2.4). SP, NP, MP, DP, ALB, and AT illustrate whether a given routing scheme supports various aspects of path diversity. Specifically: SP: A given scheme enables using arbitrary shortest paths. NP: A given scheme enables using arbitrary non-minimal paths. MP: A given scheme enables multipathing (between two hosts). DP: A given scheme considers disjoint (no shared links) paths. ALB: A given scheme offers adaptive load balancing. AT: A given scheme works with an arbitrary topology. -: A given scheme does offer a given feature. Ŏ: A given scheme offers a given feature in a limited way. é: A given scheme does not offer a given feature. ∗ Explanations in remarks. MS, FT, CL, XP, and HX are symbols of topologies described in § 2.1. RL is a specific type of a network called “recursive layered” design, described in § 5.2.3. “ä”: Unknown. “D-free”: deadlock-free. 11

5.2.1 Multistage (Fat Tree, Clos, Leaf-Spine) Designs may in principle work on any topology [43]. In contrast to One distinctive group of architectures target multistage architectures for multistage networks, designs for general topologies. A common key feature of all these designs networks rarely consider ECMP because it is difficult to is multipathing based on multiple paths of equal lengths use ECMP in a context of a general topology, without the leading via core routers (cf. § 2.1). Common building blocks guarantee of a rich number of redundant shortest paths, are ECMP, VLB, and PR; however, details (of how these common in Clos or in a fat tree. Instead, they often resort blocks are exactly deployed) may vary depending on, for to some combination of ST and VLANs. example, the specific targeted topology (e.g., fat tree vs. leaf- SPAIN [163] is an L2 architecture that focuses on using spine), the targeted stack (e.g., bare L2 Ethernet vs. the L3 commodity off-the-shelf switches. To enable multipathing IP setting), or whether a given design uses off-the-shelf in an arbitrary network, SPAIN (1) precomputes a set of re- equipment or rather proposes some HW modifications. Im- dundant paths for different endpoint pairs, (2) merges these portantly, these designs focus on multipathing with shortest paths into trees, and (3) maps each such tree into a separate paths because multistage networks offer a rich supply of VLAN. Different VLANs may be used for multipathing such paths. They often offer some form of load balancing. between endpoint pairs, assuming used switches support Monsoon [95] provides a hybrid L2–L3 Clos design in VLANs. While SPAIN relies on TCP congestion control for which all endpoints in a datacenter form a large single L2 reacting to failures, it does not offer any specific scheme for domain. L2 switches may form multiple layers, but the last load balancing for more performance. two layers (access and border) consist of L3 routers. ECMP MOOSE [194] addresses the limited scalability of Ether- is used for multipathing between access and border routers. net; it simply relies on orthogonal designs such as OSPF- All L2 layers use multipathing based on selecting a random OMP for multipathing. PAST [209] is a complete L2 architecture for general intermediate switch in the uppermost L2 layer (with VLB). networks. Its key idea is to use a single spanning tree per To implement this, Monsoon relies on switches that support endpoint. As such, it does not explicitly focus on ensuring MAC-in-MAC tunneling (encapsulation) [111] so that one may multipathing between pairs of endpoints, instead focusing forward a frame via an intermediate switch. on providing path diversity at the granularity of a desti- PortLand [166] uses fat trees and provides a complete L2 nation endpoint, by enabling computing different spanning design; it simply assumes standard ECMP for multipathing. trees, depending on bandwidth requirements, considered Al-Fares et al. [7] also focus on fat trees. They provide topology, etc.. It enables shortest paths, but also supports a complete design based on L3 routing. While they only VLB by offering algorithms for deriving spanning trees briefly mention multipathing, they use an interesting solu- where paths to the root of a tree are not necessarily minimal. tion for spreading traffic over core routers. Specifically, they PAST relies on ST and VLAN for implementation. propose that each router maintains a two-level routing table. There are also works that focus on encoding a diver- Now, a destination address in a packet may be matched sity of paths available in different networks. For example, based on its prefix (“level 1”); this matching takes place Jyothi et al. [121] discuss encoding arbitrary paths in a data when a packet is sent to an endpoint in the same pod. If center with OpenFlow to enable flexible fabric, XPath [107] a packet goes to a different pod, the address hits a special compresses the information of paths in a data center so entry leading to routing table “level 2”. In this level, match- that they can be aggregated into a practical number of ing uses the address suffix (“right-hand” matching). The key routing entries, and van der Linden et al. [222] discuss observation is that, while simple prefix matching would how to effectively enable source routing by appropriately force packets (sent to the same subnet) to use the same transforming selected fields of packet headers to ensure that core router, suffix matching enables selecting different core the ECMP hashing will result in the desired path selection. routers. The authors propose to implement such routing Some recent architectures focus on high-performance tables with ternary content-addressable memories (TCAM). routing in low-diameter networks. ECMP-VLB is a simple VL2 [94] targets Clos and provides a design in which routing scheme suggested for Xpander topologies [127] that, the infrastructure uses L3 but the services are offered L2 as the name suggests, combines the advantages of ECMP semantics. VL2 combines ECMP and VLB for multipathing. and VLB. Finally, FatPaths [43] targets general low-diameter To send a packet, a random core router is selected (VLB); networks. It (1) divides physical links into layers that form ECMP then is used to further spread load across available acyclic directed graphs, (2) uses paths in different layers for redundant paths. Using an intermediate core router in VLB multipathing. Packets are sprayed over such layers using is implemented with IP-in-IP encapsulation. flowlets. FatPaths discusses an implementation based on There is a large number of load balancing schemes for address space partitioning, VLANs, or ECMP. multistage networks. The majority focus on the transport layer details and are outside the scope of this work; we 5.2.3 Recursive Networks outline them in Section 6 and coarsely summarize them in Some architectures, besides routing, also come with novel Table 2. An example design, DRB [53], offers round-robin “recursive” topologies [99], [98]. The key design choice in packet spraying and it also discusses how to route such these architectures to obtain path diversity is to use multiple packets in Clos via core routers using IP-in-IP encapsulation. NICs per server and connect servers to one another.

5.2.2 General Network Designs 5.3 InfiniBand There are also architectures that focus on general topologies; We now describe the IB landscape. We omit a line of com- some of them are tuned for certain classes of networks but mon routing protocols based on shortest paths, as they are 12 not directly related to multipathing, but their implementa- primary set of shortest paths, calculated with a modified DF- tions in the IB fabric manager natively support NAA; these SSSP routing [72], and a secondary set of paths, calculated routings are MinHop [158], SSSP [105], Deadlock-Free SSSP with the Up*/Down* routing algorithm. Whenever SAR re- (DFSSSP) [72], and a DFSSSP variant called Nue [71]. routes the network to adapt to the currently running HPC applications, the network traffic must temporarily switch 5.3.1 Multi-Up*/Down* (MUD) routing to the fixed secondary paths to avoid potential deadlocks Numerous variations of Multi-Up*/Down* routing have during the deployment of the new primary forwarding been proposed, e.g., [152], [78], to overcome the bottlenecks rules. Hence, during each deployment, there is a short time and limitations of Up*/Down*. The idea is to utilize a set of frame where multipathing is intended, but (theoretically) Up*/Down* spanning trees—each starting from a different the message passing layer could also utilize both, the pri- root node—and choose a path depending on certain criteria. mary and secondary paths, simultaneously, outside of the For example, Flich at al. [78] proposed to select two roots deployment window without breaking SAR’s validity. which either give the highest amount of non-minimal or the 5.3.6 Pattern-Aware Routing for HyperX (PARX) highest amount of minimal paths, and then randomly select from those two trees for each source-destination pair. Sim- PARX is the only known, and practically demonstrated, ilarly, Lysne et al. [152] proposed to identify multiple root routing for InfiniBand which intentionally enforces the gen- nodes (by maximizing the minimal distance between them), eration of minimal and non-minimal paths, and mixes the and load-balance the traffic across the resulting spanning usage of both for load-balancing reasons [73], while still ad- trees to avoid the usual bottleneck near a single root. Both hering to the IB specifications. The idea of this routing is an approaches require NAA to work with InfiniBand. emulation of AR capabilities with non-AR techniques/tech- nologies to overcome the bottlenecks on the shortest path between IB switch located in the same dimension of the 5.3.2 LASH-Transition Oriented Routing (LASH-TOR) HyperX topology. PARX for a 2D HyperX, with NAA and The goal of LASH-TOR [204] is not directly path diversity, LMC = 2, offers between 2 and 4 disjoint paths, and adap- however it is a byproduct of how the routing tries to ensure tively selects minimal or non-minimal routes depending on deadlock-freedom (an essential feature in lossless networks) the message size to optimize for either message latency under resource constraints. LASH-TOR uses the LAyered (with short payloads) or throughput for large messages. Shortest Path routing for the majority of source-destination pairs, and Up*/Down* as fall-back when LASH would ex- 5.4 Other HPC Network Designs ceed the available virtual channels. Hence, assuming NAA Cray’s Aries and Slingshot adopt the adaptive UGAL rout- to separate the LASH (minimal paths) from the Up*/Down* ing to distribute the load across the network. When using (potentially non-minimal path), one can gain limited path minimal paths, the packets are sent directly to the dragonfly diversity in InfiniBand. destination group. With non-minimal paths, instead, pack- ets are first minimally routed to an intermediate group, then 5.3.3 Multi-Routing minimally routed to the destination group. Within a group, Multi-routing can be viewed as an extension of the multi- packets are always minimally routed. Routing decisions are plane designs outlined in § 2.1. In preliminary experiments, taken on a per-packet basis. They consist in selecting a researchers have tried if the use of different routing algo- number of minimal and non-minimal paths, evaluating the rithms on similar network planes can have an observable load on these paths, and finally selecting one. The load is es- performance gain [167]. Theoretically, additionally to the in- timated by using link load information propagated through creased, non-overlapping path diversity resulting from the the network [137]. Applications can select different “biasing multi-plane design, utilizing different routing algorithms levels” for the adaptive routing (e.g., bias towards minimal within each plane can yield benefits for certain traffic pat- routing), or disable the adaptive routing and always use terns and load balancing schemes, which would otherwise minimal or non-minimal paths. be hidden when the same routing is used everywhere. In IBM’s PERCS, shortest paths lengths vary between one and three hops (i.e., route within the source supernode; 5.3.4 Adaptive Routing (AR) reach the destination supernode; route within the desti- nation supernode). Non-minimal paths can be derived by For completeness, we list Mellanox’s adaptive routing im- minimally-routing packets towards an intermediate supern- plementation for InfiniBand as well, since it (theoretically) ode. The maximum non-minimal path length is five hops. increases path diversity and offers load balancing within As pairs of supernodes can be connected by more than one the more recent Mellanox-based InfiniBand networks [159]. link, multiple shortest paths can exist. PERCS provides three However, to this date, their technology is proprietary and routing modes that can be selected by applications on a per- outside of the IB specifications. Furthermore, Mellanox’s AR packet basis: non-minimal, with the applications defining only supports a limited set up topologies (tori-like, Clos-like the intermediate supernode; round-robin, with the hard- and their Dragonfly variation). ware selecting among the multiple routes in a round-robin manner; randomized (only for non-minimal paths), where 5.3.5 Scheduling-Aware Routing (SAR) the hardware randomly chooses an intermediate supernode. Similar to LASH-TOR, the path diversity offered by SAR Quadrics’ QsNet [178], [177] is a source routed inter- was not intended as multipathing feature or load balancing connect that enables, to some extent, multipathing between feature [70]. Using NAA with LMC = 1, SAR employs a two endpoints, and comes with adaptivity in switches. 13

Specifically, a single routing table (deployed in a QsNet NIC 7 CHALLENGES called “Elan”) translates a processor ID to a specification There are many challenges related to multipathing and path of a path in the network. Now, as QsNet enables loading diversity support in HPC systems and data centers. several routing tables, one could encode different paths First, we predict a rich line of future routing protocols in different routing tables. Finally, QsNet offers hardware and networking architectures targeting recent low-diameter support for broadcasts, and for multicasts to physically topologies. Some of the first examples are the FatPaths contiguous QsNet endpoints. architecture [43] or the PARX routing [73]. However, more Intel’s OmniPath [46] offers two mechanisms for mul- research is required to understand how to fully use the po- tipathing between any two endpoints: different paths in tential behind such networks, especially considering more the fabric or different virtual lanes within the same phys- effective congestion control and different technological con- ical route. However, the OmniPath architecture itself does straints in existing networking stacks. not prescribe specific mechanisms to select a specific path. Moreover, little research exists into routing schemes Moreover, it does not provide any scheme for ensuring suited specifically for particular types of workloads, for ex- packet ordering. Thus, when such ordering is needed, the ample deep learning [27], linear algebra computations [39], packets must use the same path, or the user must provide [40], [206], [138], graph processing [36], [33], [42], [45], other scheme for maintaining the right ordering. [38], [91], [32], [41], and other distributed workloads [37], Finally, the specifications of Myricom’s Myrinet [47] [34], [87] and algorithms [193], [192]. For example, as some or Open-MX [92], Atos’ BXI [67], and EXTOLL’s inter- workloads (e.g., in deep learning [28]) have more predicable connect [165] do not disclose details on their support for communication patterns, one could try to gain speedups multipathing. Myrinet does use source routing and works with multipath routing based on the structural network on arbitrary topologies. Both BXI and EXTOLL’s design offer properties that are static or change slowly. Contrarily, when adaptive routing to mitigate congestion, but it is unclear if routing data-driven workloads such as graph computing, multipathing is used. one could bias more aggressively towards adaptive multi- pathing, for example with flowlets [43], [223]. 6 RELATED ASPECTSOF NETWORKING Finally, we expect the growing importance of various Congestion control & load balancing are strongly related to schemes enabling programmable routing and transport [18], the transport layer (L4). This area was extensively covered [55]. Here, one line of research will probably heavily depend in surveys, covering overall networking [155], [51], [187], on OpenFlow [157] and, especially, P4 [48]. It is also inter- [151], [116], [228], mobile or ad hoc environments [146], esting to investigate how to use FPGAs [31], [44], [64] or [201], [196], [79], [240], [63], [88], [154], and more recent “smart NICs” [68], [104], [55] in the context of multipathing. cloud and data center networks [230], [199], [213], [239], [231], [20], [89], [198], [168], [188], [237], [226], [120], [81], 8 CONCLUSION [207], [182]. Thus, we do not focus on these aspects of Developing high-performance routing protocols and net- networking and we only mention them whenever necessary. working architectures in HPC systems and data centers is an However, as they are related to many considered routing important research area. Multipathing and overall support schemes, we cite respective works as a reference point for for path diversity is an important part of such designs, the reader. Many schemes for load balancing and congestion and specifically one of the enablers for high performance. control were proposed in recent years [149], [29], [171], The importance of routing is increased by the prevalence [169], [176], [144], [160], [54], [11], [103], [242], [100], [185], of communication intensive workloads that put pressure on [23], [12], [221], [148], [110], [161], [119], [24], [9], [49], [50], the interconnect, such as graph analytics or deep learning. [108], [129], [85], [236]. Such adaptive load balancing can be Many networking architectures and routing protocols implemented using flows [60], [186], [195], [218], [30], [241], have been developed. They offer different forms of support [8], [123], [106], flowcells (fixed-sized packet series) [102], for multipathing, they are related to different parts of vari- flowlets (variable-size packet series) [131], [10], [223], [130], ous networking stacks, and they are based on miscellaneous [126], and single packets [235], [100], [69], [53], [176], [235], classes of simple routing building blocks or design princi- [185], [90]. In data centers, load balancing most often focuses ples. To propel research into future developments in the area on flow and flowlet based adaptivity. This is because the of high-performance routing, we present the first analysis targeted stack is often based on TCP that suffers perfor- and taxonomy of the rich landscape of multipathing and mance degradation whenever packets become reordered. In path diversity support in the routing designs in supercom- contrast, HPC networks usually use packet level adaptivity, puters and data centers. We identify basic building blocks, and research focuses on choosing good congestion signals, we crystallize fundamental concepts, we list and categorize often with hardware modifications [82], [83]. Similarly to congestion control, we exclude flow control existing architectures and protocols, and we discuss key from our focus, as it is also usually implemented within L4. design choices, focusing on the support for different forms Some works analyze various properties of low-diameter of multipathing and path diversity. Our analysis can be used topologies, for example path length, throughput, and band- by network architects, system developers, and routing pro- width [219], [128], [122], [203], [127], [33], [143], [134], [101], tocol designers who want to understand how to maximize [132], [216], [77], [133], [22], [215], [6]. Such works could the performance of their developments in the context of bare be used with our multipathing analysis when developing Ethernet, full TCP/IP, or InfiniBand and other HPC stacks. routing protocols and architectures that take advantage of Acknowledgment The work was supported by JSPS KAKENHI different properties of a given topology. Grant Number JP19H04119. 14

REFERENCES [24] B. G. Banavalikar, C. M. DeCusatis, M. Gusat, K. G. Kamble, and R. J. Recio. Credit-based flow control in lossless Ethernet [1] Slingshot: The Interconnect for the Exascale Era - networks, Jan. 12 2016. US Patent 9,237,111. Cray Inc. https://www.cray.com/sites/default/files/ [25] M. F. Bari, R. Boutaba, R. Esteves, L. Z. Granville, M. Podlesny, Slingshot-The-Interconnect-for-the-Exascale-Era.pdf. M. G. Rabbani, Q. Zhang, and M. F. Zhani. Data center network [2] K. Agarwal, C. Dixon, E. Rozner, and J. B. Carter. Shadow macs: virtualization: A survey. IEEE Communications Surveys & Tutorials, scalable label-switching for commodity ethernet. In HotSDN’14, 15(2):909–928, 2012. pages 157–162, 2014. [26] A. Beloglazov, R. Buyya, Y. C. Lee, and A. Zomaya. A taxonomy [3] S. Aggarwal and P. Mittal. Performance evaluation of single path and survey of energy-efficient data centers and cloud computing and multipath regarding bandwidth and delay. Intl. J. Comp. App., systems. In Advances in computers, volume 82, pages 47–111. 145(9), 2016. Elsevier, 2011. [4] J. H. Ahn, N. Binkert, A. Davis, M. McLaren, and R. S. Schreiber. [27] T. Ben-Nun, M. Besta, S. Huber, A. N. Ziogas, D. Peter, and HyperX: topology, routing, and packaging of efficient large-scale T. Hoefler. A modular benchmarking infrastructure for high- networks. In ACM/IEEE Supercomputing, page 41, 2009. performance and reproducible deep learning. arXiv preprint [5] H. D. E. Al-Ariki and M. S. Swamy. A survey and analysis arXiv:1901.10183, 2019. of multipath routing protocols in wireless multimedia sensor [28] T. Ben-Nun and T. Hoefler. Demystifying parallel and distributed networks. Wireless Networks, 23(6), 2017. deep learning: An in-depth concurrency analysis. ACM CSUR, [6] F. Al Faisal, M. H. Rahman, and Y. Inoguchi. A new power 2019. efficient high performance interconnection network for many- [29] C. H. Benet, A. J. Kassler, T. Benson, and G. Pongracz. Mp-hula: core processors. Journal of Parallel and Distributed Computing, Multipath transport aware load balancing using programmable 101:92–102, 2017. data planes. In Proceedings of the 2018 Morning Workshop on In- [7] M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity Network Computing, pages 7–13. ACM, 2018. data center network architecture. In ACM SIGCOMM, pages 63– [30] T. Benson, A. Anand, A. Akella, and M. Zhang. Microte: Fine 74, 2008. grained traffic engineering for data centers. In Proceedings of [8] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and the Seventh COnference on emerging Networking EXperiments and A. Vahdat. Hedera: Dynamic flow scheduling for data center Technologies, page 8. ACM, 2011. networks. In NSDI, volume 10, pages 19–19, 2010. [31] M. Besta, M. Fischer, T. Ben-Nun, J. De Fine Licht, and T. Hoefler. [9] M. Alasmar, G. Parisis, and J. Crowcroft. Polyraptor: embracing Substream-centric maximum matchings on fpga. In ACM/SIGDA path and data redundancy in data centres for efficient data FPGA, pages 152–161, 2019. transport. In Proceedings of the ACM SIGCOMM 2018 Conference [32] M. Besta, M. Fischer, V. Kalavri, M. Kapralov, and T. Hoefler. on Posters and Demos, pages 69–71. ACM, 2018. Practice of streaming and dynamic graphs: Concepts, models, [10] M. Alizadeh, T. Edsall, S. Dharmapurikar, R. Vaidyanathan, systems, and parallelism. arXiv preprint arXiv:1912.12740, 2019. K. Chu, A. Fingerhut, F. Matus, R. Pan, N. Yadav, G. Varghese, [33] M. Besta, S. M. Hassan, S. Yalamanchili, R. Ausavarungnirun, et al. CONGA: Distributed congestion-aware load balancing O. Mutlu, and T. Hoefler. Slim noc: A low-diameter on-chip for datacenters. In Proceedings of the 2014 ACM conference on network topology for high energy efficiency and scalability. In SIGCOMM, pages 503–514. ACM, 2014. ACM SIGPLAN Notices, 2018. [11] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, [34] M. Besta and T. Hoefler. Fault tolerance for remote memory B. Prabhakar, S. Sengupta, and M. Sridharan. Data center access programming models. In ACM HPDC, pages 37–48, 2014. TCP (DCTCP). ACM SIGCOMM computer communication review, [35] M. Besta and T. Hoefler. Slim Fly: A Cost Effective Low-Diameter 41(4):63–74, 2011. Network Topology. Nov. 2014. ACM/IEEE Supercomputing. [12] M. Alizadeh, S. Yang, M. Sharif, S. Katti, N. McKeown, B. Prab- [36] M. Besta and T. Hoefler. Accelerating irregular computations hakar, and S. Shenker. pFabric: Minimal near-optimal datacenter with hardware transactional memory and active messages. In transport. ACM SIGCOMM Computer Communication Review, ACM HPDC, 2015. 43(4):435–446, 2013. [37] M. Besta and T. Hoefler. Active access: A mechanism for high- [13] D. Allan, P. Ashwood-Smith, N. Bragg, J. Farkas, D. Fedyk, performance distributed data-centric computations. In ACM ICS, M. Ouellete, M. Seaman, and P. Unbehagen. Shortest path 2015. bridging: Efficient control of larger ethernet networks. IEEE [38] M. Besta and T. Hoefler. Survey and taxonomy of lossless graph Communications Magazine, 48(10), 2010. compression and space-efficient graph representations. arXiv [14] B. Alverson, E. Froese, L. Kaplan, and D. Roweth. Cray XC series preprint arXiv:1806.01799, 2018. network. Cray Inc., White Paper WP-Aries 01-1112, 2012. [39] M. Besta, R. Kanakagiri, H. Mustafa, M. Karasikov, G. Ratsch,¨ [15] A. A. Anasane and R. A. Satao. A survey on various multipath T. Hoefler, and E. Solomonik. Communication-efficient jaccard routing protocols in wireless sensor networks. Procedia Computer similarity for high-performance distributed genome compar- Science, 79:610–615, 2016. isons. IEEE IPDPS, 2020. [16] ANSI/IEEE. Amendment 3 to 802.1q virtual bridged local area [40] M. Besta, F. Marending, E. Solomonik, and T. Hoefler. Slimsell: networks: Multiple spanning trees. ANSI/IEEE Draft Standard A vectorizable graph representation for breadth-first search. In P802.1s/D11.2, 2001. IEEE IPDPS, pages 32–41, 2017. [17] ANSI/IEEE. Virtual bridged local area networks amendment 4: [41] M. Besta, E. Peter, R. Gerstenberger, M. Fischer, M. Podstawski, Provider bridges. ANSI/IEEE Draft Standard P802.1ad/D1, 2003. C. Barthels, G. Alonso, and T. Hoefler. Demystifying graph [18] M. T. Arashloo, A. Lavrov, M. Ghobadi, J. Rexford, D. Walker, databases: Analysis and taxonomy of data organization, system and D. Wentzlaff. Enabling programmable transport protocols in designs, and graph queries. arXiv preprint arXiv:1910.09017, 2019. high-speed nics. In NSDI, 2020. [42] M. Besta, M. Podstawski, L. Groner, E. Solomonik, and T. Hoefler. [19] B. Arimilli, R. Arimilli, V. Chung, S. Clark, W. Denzel, B. Drerup, To push or to pull: On reducing communication and synchroniza- T. Hoefler, J. Joyner, J. Lewis, J. Li, N. Ni, and R. Rajamony. The tion in graph computations. In ACM HPDC, 2017. PERCS High-Performance Interconnect. In Hot Interconnects 2010. [43] M. Besta, M. Schneider, K. Cynk, M. Konieczny, E. Henriksson, IEEE, Aug. S. Di Girolamo, A. Singla, and T. Hoefler. Fatpaths: Routing in [20] M. Aruna, D. Bhanu, and R. Punithagowri. A survey on load supercomputers and data centers when shortest paths fall short. balancing algorithms in cloud environment. International Journal ACM/IEEE Supercomputing, 2020. of Computer Applications, 82(16), 2013. [44] M. Besta, D. Stanojevic, J. D. F. Licht, T. Ben-Nun, and T. Hoefler. [21] D. Awduche, L. Berger, D. Gan, T. Li, V. Srinivasan, and G. Swal- Graph processing on fpgas: Taxonomy, survey, challenges. arXiv low. Rsvp-te: extensions to rsvp for lsp tunnels, 2001. preprint arXiv:1903.06697, 2019. [22] S. Azizi, N. Hashemi, and A. Khonsari. Hhs: an efficient network [45] M. Besta, D. Stanojevic, T. Zivic, J. Singh, M. Hoerold, and topology for large-scale data centers. The Journal of Supercomput- T. Hoefler. Log (graph) a near-optimal high-performance graph ing, 72(3):874–899, 2016. representation. In ACM PACT, pages 1–13, 2018. [23] W. Bai, L. Chen, K. Chen, D. Han, C. Tian, and W. Sun. PIAS: [46] M. S. Birrittella et al. Intel® omni-path architecture: Enabling practical information-agnostic flow scheduling for data center scalable, high performance fabrics. In IEEE HOTI, 2015. networks. In Proceedings of the 13th ACM Workshop on Hot Topics [47] N. J. Boden et al. Myrinet: A gigabit-per-second local area in Networks, HotNets-XIII, pages 25:1–25:7, 2014. network. IEEE micro, 1995. 15

[48] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford, Topology: First At-Scale Implementation and Comparison to the C. Schlesinger, D. Talayco, A. Vahdat, G. Varghese, et al. P4: Fat-Tree. In ACM/IEEE Supercomputing, 2019. Programming protocol-independent packet processors. ACM [74] J. J. Dongarra, H. W. Meuer, E. Strohmaier, et al. Top500 super- SIGCOMM Computer Communication Review, 44(3):87–95, 2014. computer sites. Supercomputer, 13:89–111, 1997. [49] M. Bredel, Z. Bozakov, A. Barczyk, and H. Newman. Flow-based [75] A. Elwalid, C. Jin, S. Low, and I. Widjaja. Mate: Mpls adaptive load balancing in multipathed layer-2 networks using openflow traffic engineering. In IEEE INFOCOM, 2001. and multipath-tcp. In Hot topics in software defined networking, [76] D. Farinacci, D. Lewis, D. Meyer, and V. Fuller. The locator/id pages 213–214. ACM, 2014. separation protocol (lisp) rfc 6830. 2013. [50] M. Caesar, M. Casado, T. Koponen, J. Rexford, and S. Shenker. [77] M. Flajslik et al. Megafly: A topology for exascale systems. Dynamic route recomputation considered harmful. ACM SIG- In International Conference on High Performance Computing, pages COMM Computer Communication Review, 40(2):66–71, 2010. 289–310. Springer, 2018. [51] C. Callegari, S. Giordano, M. Pagano, and T. Pepe. A survey [78] J. Flich, P. Lopez,´ J. C. Sancho, A. Robles, and J. Duato. Improving of congestion control mechanisms in linux tcp. In International InfiniBand Routing Through Multiple Virtual Networks. In Conference on Distributed Computer and Communication Networks, ISHPC, 2002. pages 28–42. Springer, 2013. [79] D. J. Flora, V. Kavitha, and M. Muthuselvi. A survey on conges- [52] C. Camarero, C. Mart´ınez, E. Vallejo, and R. Beivide. Projective tion control techniques in wireless sensor networks. In ICETECT, networks: Topologies for large parallel computer systems. IEEE pages 1146–1149. IEEE, 2011. TPDS, 28(7):2003–2016, 2016. [80] R. W. Floyd. Algorithm 97: shortest path. Communications of the [53] J. Cao, R. Xia, P. Yang, C. Guo, G. Lu, L. Yuan, Y. Zheng, H. Wu, ACM, 5(6):345, 1962. Y. Xiong, and D. Maltz. Per-packet load-balanced, low-latency [81] K.-T. Foerster and S. Schmid. Survey of reconfigurable data center routing for clos-based data center networks. In ACM CoNEXT, networks: Enablers, algorithms, complexity. ACM SIGACT News, pages 49–60, 2013. 50(2):62–79, 2019. [54] N. Cardwell, Y. Cheng, C. S. Gunn, S. H. Yeganeh, and V. Ja- [82] M. Garcia, E. Vallejo, R. Beivide, M. Odriozola, C. Camarero, cobson. BBR: congestion-based congestion control. ACM Queue, M. Valero, G. Rodriguez, J. Labarta, and C. Minkenberg. On-the- 14(5):20–53, 2016. fly adaptive routing in high-radix hierarchical networks. In 41st [55] A. Caulfield, P. Costa, and M. Ghobadi. Beyond smartnics: International Conference on Parallel Processing, ICPP, pages 279–288, Towards a fully programmable cloud. In IEEE HPSR, pages 1–6. 2012. IEEE, 2018. [83] M. Garcia, E. Vallejo, R. Beivide, M. Odriozola, and M. Valero. [56] K. Chen, C. Hu, X. Zhang, K. Zheng, Y. Chen, and A. V. Vasilakos. Efficient routing mechanisms for dragonfly networks. In 42nd Survey on routing in data centers: insights and future directions. International Conference on Parallel Processing, ICPP, pages 582– IEEE network, 25(4):6–10, 2011. 592, 2013. [57] L. Chen, B. Li, and B. Li. Allocating bandwidth in datacenter [84] R. Garcia et al. Lsom: A link state protocol over mac addresses networks: A survey. Journal of Computer Science and Technology, for metropolitan backbones using optical ethernet switches. In 29(5):910–917, 2014. IEEE NCA, 2003. [58] T. W. Chim and K. L. Yeung. Traffic distribution over equal-cost- [85] Y. Geng, V. Jeyakumar, A. Kabbani, and M. Alizadeh. Juggler: multi-paths. In IEEE International Conference on Communications, a practical reordering resilient network stack for datacenters. volume 2, pages 1207–1211, 2004. In Proceedings of the Eleventh European Conference on Computer [59] C. Clos. A study of non-blocking switching networks. Bell Labs Systems, pages 1–16, 2016. Technical Journal, 32(2):406–424, 1953. [86] P. Geoffray. Myrinet express (MX): Is your interconnect smart? In [60] A. R. Curtis, W. Kim, and P. Yalagandula. Mahout: Low-overhead IEEE HPC Asia, 2004. datacenter traffic management using end-host-based elephant [87] R. Gerstenberger et al. Enabling Highly-scalable Remote Mem- detection. In INFOCOM, 2011 Proceedings IEEE, pages 1629–1637. ory Access Programming with MPI-3 One Sided. In Proc. of IEEE, 2011. ACM/IEEE Supercomputing, 2013. [61] W. Dally and B. Towles. Principles and Practices of Interconnection [88] A. Ghaffari. Congestion control mechanisms in wireless sensor Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, networks: A survey. Journal of network and computer applications, USA, 2003. 52:101–115, 2015. [62] W. J. Dally and B. P. Towles. Principles and practices of interconnec- [89] E. J. Ghomi, A. M. Rahmani, and N. N. Qader. Load-balancing tion networks. Elsevier, 2004. algorithms in cloud computing: A survey. Journal of Network and [63] E. Dashkova and A. Gurtov. Survey on congestion control Computer Applications, 88:50–71, 2017. mechanisms for wireless sensor networks. In Internet of things, [90] S. Ghorbani, Z. Yang, P. Godfrey, Y. Ganjali, and A. Firoozshahian. smart spaces, and next generation networking, pages 75–85. Springer, Drill: Micro load balancing for low-latency data center networks. 2012. In ACM SIGCOMM, 2017. [64] J. de Fine Licht et al. Transformations of high-level synthesis [91] L. Gianinazzi, P. Kalvoda, A. De Palma, M. Besta, and T. Hoefler. codes for high-performance computing. arXiv:1805.08288, 2018. Communication-avoiding parallel minimum cuts and connected [65] A. F. De Sousa. Improving load balance and resilience of eth- components. In ACM SIGPLAN Notices, volume 53, pages 219– ernet carrier networks with ieee 802.1 s multiple spanning tree 232. ACM, 2018. protocol. In IEEE ICN/ICONS/MCL, 2006. [92] B. Goglin. Design and implementation of Open-MX: High- [66] C. DeCusatis. Handbook of fiber optic data communication: a practical performance message passing over generic Ethernet hardware. guide to optical networking. Academic Press, 2013. In IEEE IPDPS, 2008. [67] S. Derradji et al. The bxi interconnect architecture. In IEEE HOTI, [93] I. Gojmerac, T. Ziegler, and P. Reichl. Adaptive multipath routing 2015. based on local distribution of link load information. In Springer [68] S. Di Girolamo, K. Taranov, A. Kurth, M. Schaffner, T. Schneider, QofIS, 2003. J. Beranek,´ M. Besta, L. Benini, D. Roweth, and T. Hoefler. [94] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, Network-accelerated non-contiguous memory transfers. arXiv P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. VL2: a scalable preprint arXiv:1908.08590, 2019. and flexible data center network. ACM SIGCOMM computer [69] A. Dixit, P. Prakash, Y. C. Hu, and R. R. Kompella. On the impact communication review, 39(4):51–62, 2009. of packet spraying in data center networks. In INFOCOM, 2013 [95] A. Greenberg, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. Proceedings IEEE, pages 2130–2138. IEEE, 2013. Towards a next generation data center architecture: scalability [70] J. Domke and T. Hoefler. Scheduling-Aware Routing for Super- and commoditization. In ACM PRESTO, 2008. computers. In ACM/IEEE Supercomputing, 2016. [96] GSIC, Tokyo Institute of Technology. TSUBAME2.5 Hardware [71] J. Domke, T. Hoefler, and S. Matsuoka. Routing on the De- and Software, Nov. 2013. pendency Graph: A New Approach to Deadlock-Free High- [97] GSIC, Tokyo Institute of Technology. TSUBAME3.0 Hardware Performance Routing. In ACM HPDC, 2016. and Software Specifications, July 2017. [72] J. Domke, T. Hoefler, and W. Nagel. Deadlock-Free Oblivious [98] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, Routing for Arbitrary Topologies. In IEEE IPDPS, 2011. and S. Lu. Bcube: a high performance, server-centric network [73] J. Domke, S. Matsuoka, I. R. Ivanov, Y. Tsushima, T. Yuki, A. No- architecture for modular data centers. ACM SIGCOMM CCR, mura, S. Miura, N. McDonald, D. L. Floyd, and N. Dube.´ HyperX 39(4):63–74, 2009. 16

[99] C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, and S. Lu. Dcell: a [126] S. Kandula, D. Katabi, S. Sinha, and A. Berger. Dynamic load scalable and fault-tolerant network structure for data centers. balancing without packet reordering. ACM SIGCOMM Computer In ACM SIGCOMM Computer Communication Review, volume 38, Communication Review, 37(2):51–62, 2007. pages 75–86. ACM, 2008. [127] S. Kassing, A. Valadarsky, G. Shahaf, M. Schapira, and A. Singla. [100] M. Handley, C. Raiciu, A. Agache, A. Voinescu, A. W. Moore, Beyond fat-trees without antennae, mirrors, and disco-balls. In G. Antichi, and M. Wojcik. Re-architecting datacenter networks ACM SIGCOMM, pages 281–294, 2017. and stacks for low latency and high performance. In ACM [128] G. Kathareios, C. Minkenberg, B. Prisacari, G. Rodriguez, and SIGCOMM, 2017. T. Hoefler. Cost-effective diameter-two topologies: Analysis and [101] V. Harsh, S. A. Jyothi, I. Singh, and P. Godfrey. Expander data- evaluation. In ACM/IEEE Supercomputing, 2015. centers: From theory to practice. arXiv preprint arXiv:1811.00212, [129] N. Katta, A. Ghag, M. Hira, I. Keslassy, A. Bergman, C. Kim, and 2018. J. Rexford. Clove: Congestion-aware load balancing at the virtual [102] K. He, E. Rozner, K. Agarwal, W. Felter, J. B. Carter, and A. Akella. edge. In Proceedings of the 13th International Conference on emerging Presto: Edge-based load balancing for fast datacenter networks. Networking EXperiments and Technologies, pages 323–335, 2017. In ACM SIGCOMM, 2015. [130] N. Katta, M. Hira, C. Kim, A. Sivaraman, and J. Rexford. Hula: [103] K. He, E. Rozner, K. Agarwal, Y. J. Gu, W. Felter, J. B. Carter, and Scalable load balancing using programmable data planes. In A. Akella. AC/DC TCP: virtual congestion control enforcement Proceedings of the Symposium on SDN Research, page 10. ACM, for datacenter networks. In Proceedings of the 2016 conference on 2016. ACM SIGCOMM, pages 244–257, 2016. [131] N. P. Katta, M. Hira, A. Ghag, C. Kim, I. Keslassy, and J. Rexford. [104] T. Hoefler, S. Di Girolamo, K. Taranov, R. E. Grant, and CLOVE: how I learned to stop worrying about the core and love R. Brightwell. spin: High-performance streaming processing in the edge. In Proceedings of the 15th ACM Workshop on Hot Topics in the network. In ACM/IEEE Supercomputing, 2017. Networks, HotNets, pages 155–161, 2016. [105] T. Hoefler, T. Schneider, and A. Lumsdaine. Optimized Routing [132] R. Kawano, H. Nakahara, I. Fujiwara, H. Matsutani, M. Koibuchi, for Large-Scale InfiniBand Networks. In IEEE HOTI, 2009. and H. Amano. Loren: A scalable routing method for layout- [106] C. Hopps. RFC 2992: Analysis of an Equal-Cost Multi-Path conscious random topologies. In 2016 Fourth International Sympo- Algorithm, 2000. sium on Computing and Networking (CANDAR), pages 9–18. IEEE, [107] S. Hu, K. Chen, H. Wu, W. Bai, C. Lan, H. Wang, H. Zhao, 2016. and C. Guo. Explicit path control in commodity data centers: [133] R. Kawano, H. Nakahara, I. Fujiwara, H. Matsutani, M. Koibuchi, Design and applications. IEEE/ACM Transactions on Networking, and H. Amano. A layout-oriented routing method for low- 24(5):2768–2781, 2016. latency hpc networks. IEICE TRANSACTIONS on Information and [108] J. Huang et al. Tuning high flow concurrency for mptcp in data Systems, 100(12):2796–2807, 2017. center networks. Journal of Cloud Computing, 9(1):1–15, 2020. [134] R. Kawano, R. Yasudo, H. Matsutani, and H. Amano. k- [109] X. Huang and Y. Fang. Performance study of node-disjoint mul- optimized path routing for high-throughput data center net- tipath routing in vehicular ad hoc networks. IEEE Transactions on works. In 2018 Sixth International Symposium on Computing and Vehicular Technology, 2009. Networking (CANDAR), pages 99–105. IEEE, 2018. [110] J. Hwang, J. Yoo, and N. Choi. Deadline and incast aware tcp for [135] C. Kim, M. Caesar, and J. Rexford. Floodless in seattle: a scalable cloud data center networks. Computer Networks, 68:20–34, 2014. ethernet architecture for large enterprises. In ACM SIGCOMM, pages 3–14, 2008. [111] IEEE. IEEE 802.1ah standard. http://www.ieee802.org/1/ [136] J. Kim, W. J. Dally, and D. Abts. Flattened butterfly: a cost- pages/802.1ah.html, 2008. efficient topology for high-radix networks. In ACM SIGARCH [112] Infiniband Trade Association and others. Rocev2, 2014. Comp. Arch. News, 2007. [113] InfiniBand® Trade Association. InfiniBandTM Architecture Spec- [137] J. Kim, W. J. Dally, S. Scott, and D. Abts. Technology-driven, ification Volume 1 Release 1.3 (General Specifications), Mar. 2015. highly-scalable dragonfly topology. In 2008 International Sympo- [114] A. Iwata, Y. Hidaka, M. Umayabashi, N. Enomoto, and A. Aru- sium on Computer Architecture, pages 77–88. IEEE, 2008. taki. Global open ethernet (goe) system and its performance [138] G. Kwasniewski, M. Kabic,´ M. Besta, J. VandeVondele, R. Solca,` evaluation. IEEE Journal on Selected Areas in Communications, and T. Hoefler. Red-blue pebbling revisited: near optimal par- 22(8):1432–1442, 2004. allel matrix-matrix multiplication. In ACM/IEEE Supercomputing, [115] A. IWATE. Global optical ethernet architecture as cost-effective page 24. ACM, 2019. scalable vpn solution. In NFOEC, 2002. [139] G. M. Lee and J. Choi. A survey of multipath routing for traffic [116] R. Jain. Congestion control and traffic management in atm engineering. Information and Communications University, Korea, networks: Recent advances and a survey. Computer Networks and 2002. ISDN systems, 28(13):1723–1738, 1996. [140] F. Lei, D. Dong, X. Liao, X. Su, and C. Li. Galaxyfly: A novel [117] R. Jain and S. Paul. Network virtualization and software defined family of flexible-radix low-diameter topologies for large-scales networking for cloud computing: a survey. IEEE Communications interconnection networks. In ACM ICS, 2016. Magazine, 51(11):24–31, 2013. [141] C. E. Leiserson, Z. S. Abuhamdeh, D. C. Douglas, C. R. Feynman, [118] S. Jain et al. Viro: A scalable, robust and namespace independent M. N. Ganmukhi, J. V. Hill, W. D. Hillis, B. C. Kuszmaul, M. A. S. virtual id routing for future networks. In IEEE INFOCOM, 2011. Pierre, D. S. Wells, M. C. Wong-Chan, S. Yang, and R. Zak. The [119] J. Jiang, R. Jain, and C. So-In. An explicit rate control framework network architecture of the connection machine CM-5. J. Parallel for lossless ethernet operation. In Communications, 2008. ICC’08. Distrib. Comput., 33(2):145–158, 1996. IEEE International Conference on, pages 5914–5918. IEEE, 2008. [142] M. Li, A. Lukyanenko, Z. Ou, A. Yla-J¨ a¨aski,¨ S. Tarkoma, [120] R. P. Joglekar and P. Game. Managing congestion in data center M. Coudron, and S. Secci. Multipath transmission for the network using congestion notification algorithms. IRJET, 2016. internet: A survey. IEEE Communications Surveys & Tutorials, [121] S. A. Jyothi, M. Dong, and P. Godfrey. Towards a flexible data 18(4):2887–2925, 2016. center fabric with source routing. In ACM SOSR, 2015. [143] S. Li, P.-C. Huang, and B. Jacob. Exascale interconnect topology [122] S. A. Jyothi, A. Singla, P. B. Godfrey, and A. Kolla. Measuring and characterization and parameter exploration. In HPCC/SmartCi- understanding throughput of network topologies. In ACM/IEEE ty/DSS, pages 810–819. IEEE, 2018. Supercomputing, 2016. [144] Y. Li and D. Pan. Openflow based load balancing for fat-tree [123] A. Kabbani, B. Vamanan, J. Hasan, and F. Duchene. FlowBen- networks with multipath support. In Proc. 12th IEEE International der: Flow-level Adaptive Routing for Improved Latency and Conference on Communications (ICC’13), Budapest, Hungary, pages Throughput in Datacenter Networks. In Proceedings of the 10th 1–5, 2013. ACM International on Conference on emerging Networking Experi- [145] S. Liu, H. Xu, and Z. Cai. Low latency datacenter networking: A ments and Technologies, pages 149–160. ACM, 2014. short survey. arXiv preprint arXiv:1312.3455, 2013. [124] C. Kachris and I. Tomkos. A survey on optical interconnects for [146] C. Lochert, B. Scheuermann, and M. Mauve. A survey on conges- data centers. IEEE Communications Surveys & Tutorials, 14(4):1021– tion control for mobile ad hoc networks. Wireless communications 1036, 2012. and mobile computing, 7(5):655–676, 2007. [125] S. Kandula, D. Katabi, B. Davie, and A. Charny. Walking the [147] LS Committee and others. Ieee standard for local and metropoli- tightrope: Responsive yet stable traffic engineering. In ACM tan area networks—virtual bridged local area networks. IEEE SIGCOMM CCR, volume 35, pages 253–264. ACM, 2005. Std, 802, 2006. 17

[148] Y. Lu. Sed: An sdn-based explicit-deadline-aware tcp for cloud [175] R. Perlman. Rbridges: transparent routing. In IEEE INFOCOM, data center networks. Tsinghua Science and Technology, 21(5):491– 2004. 499, 2016. [176] J. Perry, A. Ousterhout, H. Balakrishnan, D. Shah, and H. Fugal. [149] Y. Lu, G. Chen, B. Li, K. Tan, Y. Xiong, P. Cheng, J. Zhang, Fastpass: A centralized zero-queue datacenter network. ACM E. Chen, and T. Moscibroda. Multi-path transport for RDMA SIGCOMM Computer Communication Review, 44(4):307–318, 2015. in datacenters. In 15th USENIX Symposium on Networked Systems [177] F. Petrini et al. The quadrics network: High-performance cluster- Design and Implementation (NSDI 18), pages 357–371, 2018. ing technology. IEEE Micro, 2002. [150] K.-S. Lui, W. C. Lee, and K. Nahrstedt. Star: a transparent [178] F. Petrini et al. Performance evaluation of the quadrics intercon- spanning tree bridge protocol with alternate routing. ACM nection network. Cluster Computing, 2003. SIGCOMM CCR, 32(3):33–46, 2002. [179] G. F. Pfister. An introduction to the infiniband architecture. High [151] W.-M. Luo, C. Lin, and B.-P. Yan. A survey of congestion control Performance Mass Storage and Parallel I/O, 42:617–632, 2001. in the internet. CHINESE JOURNAL OF COMPUTERS-CHINESE [180] J.-M. Pittet. IP and ARP over HIPPI-6400 (GSN). RFC 2835, May EDITION-, 24(1):1–18, 2001. 2000. [152] O. Lysne and T. Skeie. Load Balancing of Irregular System Area [181] T. N. Platform. THE TUG OF WAR BETWEEN INFINIBAND Networks Through Multiple Roots. In CIC. CSREA Press, 2001. AND ETHERNET. https://www.nextplatform.com/2017/10/ [153] G. Maglione-Mathey et al. Scalable deadlock-free deterministic 30/tug-war-infiniband-ethernet/. minimal-path routing engine for infiniband-based dragonfly net- [182] M. Polese, F. Chiariotti, E. Bonetto, F. Rigotto, A. Zanella, and works. IEEE TPDS, 2017. M. Zorzi. A survey on recent advances in transport layer [154] G. Maheshwari, M. Gour, and U. K. Chourasia. A survey on protocols. IEEE Communications Surveys & Tutorials, 21(4):3584– congestion control in manet. International Journal of Computer 3608, 2019. Science and Information Technologies (IJCSIT), 5(2):998–1001, 2014. [183] M. Radi, B. Dezfouli, K. A. Bakar, and M. Lee. Multipath routing [155] A. Matrawy and I. Lambadaris. A survey of congestion control in wireless sensor networks: survey and research challenges. schemes for multicast video applications. IEEE Communications Sensors, 12(1):650–685, 2012. Surveys & Tutorials, 5(2):22–31, 2003. [184] M. S. Rahman, M. A. Mollah, P. Faizian, and X. Yuan. Load- [156] S. Matsuoka. A64fx and Fugaku: A Game Changing, HPC / AI balanced slim fly networks. In Proceedings of the 47th International Optimized Arm CPU for Exascale, Sept. 2019. Conference on Parallel Processing, pages 1–10, 2018. [157] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Pe- [185] C. Raiciu, S. Barre, C. Pluntke, A. Greenhalgh, D. Wischik, and terson, J. Rexford, S. Shenker, and J. Turner. OpenFlow: enabling M. Handley. Improving datacenter performance and robustness innovation in campus networks. ACM SIGCOMM Computer with multipath TCP. In Proceedings of the ACM SIGCOMM 2011 Communication Review, 38(2):69–74, 2008. Conference on Applications, Technologies, Architectures, and Protocols [158] Mellanox Technologies. Mellanox OFED for Linux User Manual for Computer Communications, pages 266–277, 2011. Rev. 2.0-3.0.0, Aug. 2013. [186] J. Rasley, B. Stephens, C. Dixon, E. Rozner, W. Felter, K. Agarwal, [159] Mellanox Technologies. How To Configure Adaptive Routing J. Carter, and R. Fonseca. Planck: Millisecond-scale monitoring and SHIELD (New), Nov. 2019. and control for commodity networks. In ACM SIGCOMM Com- [160] R. Mittal, V. T. Lam, N. Dukkipati, E. R. Blem, H. M. G. Wassel, puter Communication Review, volume 44, pages 407–418. ACM, M. Ghobadi, A. Vahdat, Y. Wang, D. Wetherall, and D. Zats. 2014. TIMELY: rtt-based congestion control for the datacenter. In [187] K. S. Reddy and L. C. Reddy. A survey on congestion control Proceedings of the 2015 ACM Conference on Special Interest Group mechanisms in high speed networks. IJCSNS-International Journal on Data Communication, SIGCOMM, pages 537–550, 2015. of Computer Science and Network Security, 8(1):187–195, 2008. [161] B. Montazeri, Y. Li, M. Alizadeh, and J. Ousterhout. Homa: [188] Y. Ren, Y. Zhao, P. Liu, K. Dou, and J. Li. A survey on tcp incast A receiver-driven low-latency transport protocol using network in data center networks. International Journal of Communication priorities. arXiv preprint arXiv:1803.09615, 2018. Systems, 27(8):1160–1172, 2014. [162] J. Moy. Ospf version 2. Technical report, 1997. [189] T. L. Rodeheffer, C. A. Thekkath, and D. C. Anderson. Smart- [163] J. Mudigonda, P. Yalagandula, M. Al-Fares, and J. C. Mogul. bridge: A scalable bridge architecture. ACM SIGCOMM CCR, SPAIN: COTS Data-Center Ethernet for Multipathing over Ar- 30(4):205–216, 2000. bitrary Topologies. In NSDI, pages 265–280, 2010. [190] E. Rosen, A. Viswanathan, R. Callon, et al. Multiprotocol label [164] P. Narvaez, K.-Y. Siu, and H.-Y. Tzeng. Efficient algorithms for switching architecture. 2001. RFC 3031, January. multi-path link-state routing. 1999. [191] D. Sampath, S. Agarwal, and J. Garcia-Luna-Aceves. ’ethernet on [165] S. Neuwirth et al. Scalable communication architecture for air’: Scalable routing in very large ethernet-based networks. In network-attached accelerators. In IEEE HPCA, 2015. IEEE ICDCS, 2010. [166] R. Niranjan Mysore, A. Pamboris, N. Farrington, N. Huang, [192] P. Schmid, M. Besta, and T. Hoefler. High-performance dis- P. Miri, S. Radhakrishnan, V. Subramanya, and A. Vahdat. Port- tributed RMA locks. In ACM HPDC, pages 19–30, 2016. land: a scalable fault-tolerant layer 2 data center network fabric. [193] H. Schweizer, M. Besta, and T. Hoefler. Evaluating the cost of ACM SIGCOMM CCR, 39(4):39–50, 2009. atomic operations on modern architectures. In IEEE PACT, pages [167] A. Nomura et al. Performance Evaluation of Multi-rail InfiniBand 445–456, 2015. Network in TSUBAME2.0 (in Japanese). IPSJ SIG Technical Report, [194] M. Scott, A. Moore, and J. Crowcroft. Addressing the scalability 2012, 2012. of ethernet with moose. In Proc. DC CAVES Workshop, 2009. [168] M. Noormohammadpour and C. S. Raghavendra. Datacenter [195] S. Sen, D. Shue, S. Ihm, and M. J. Freedman. Scalable, optimal traffic control: Understanding techniques and tradeoffs. IEEE flow routing in datacenters via local link balancing. In CoNEXT, Comm. Surveys & Tutorials, 2017. 2013. [169] V. Olteanu and C. Raiciu. Datacenter scale load balancing for [196] C. Sergiou, P. Antoniou, and V. Vassiliou. A comprehensive sur- multipath transport. In Proceedings of the 2016 workshop on Hot vey of congestion control protocols in wireless sensor networks. topics in Middleboxes and Network Function Virtualization, pages IEEE Communications Surveys & Tutorials, 16(4):1839–1859, 2014. 20–25, 2016. [197] S. Sharma, K. Gopalan, S. Nanda, and T.-c. Chiueh. Viking: A [170] D. Oran. Osi is-is intra-domain routing protocol. Technical report, multi-spanning-tree ethernet architecture for metropolitan area 1990. and cluster networks. In IEEE INFOCOM, 2004. [171] M. Park, S. Sohn, K. Kwon, and T. T. Kwon. Maxpass: Credit- [198] S. B. Shaw and A. Singh. A survey on scheduling and load based multipath transmission for load balancing in data centers. balancing techniques in cloud computing environment. In 2014 Journal of Communications and Networks, 2019. international conference on computer and communication technology [172] R. Penaranda˜ et al. The k-ary n-direct s-indirect family of (ICCCT), pages 87–95. IEEE, 2014. topologies for large-scale interconnection networks. The Journal [199] H. Shoja, H. Nahid, and R. Azizi. A comparative survey on load of Supercomputing, 2016. balancing algorithms in cloud computing. In Fifth International [173] I. Pepelnjak. EIGRP network design solutions. Cisco press, 1999. Conference on Computing, Communications and Networking Technolo- [174] R. Perlman. An algorithm for distributed computation of a gies (ICCCNT), pages 1–5. IEEE, 2014. spanningtree in an extended lan. In ACM SIGCOMM CCR, [200] J. Shuja, K. Bilal, S. A. Madani, M. Othman, R. Ranjan, P. Balaji, volume 15, pages 44–53. ACM, 1985. and S. U. Khan. Survey of techniques and architectures for 18

designing energy-efficient data centers. IEEE Systems Journal, [226] B. Wang, Z. Qi, R. Ma, H. Guan, and A. V. Vasilakos. A survey on 10(2):507–519, 2014. data center networking for cloud computing. Computer Networks, [201] A. P. Silva, S. Burleigh, C. M. Hirata, and K. Obraczka. A survey 91:528–547, 2015. on congestion control for delay and disruption tolerant networks. [227] K. Wen, P. Samadi, S. Rumley, C. P. Chen, Y. Shen, M. Bahadori, Ad Hoc Networks, 25:480–494, 2015. K. Bergman, and J. Wilke. Flexfly: Enabling a reconfigurable [202] S. K. Singh, T. Das, and A. Jukan. A survey on internet multi- dragonfly through silicon photonics. In ACM/IEEE Supercomput- path routing and provisioning. IEEE Communications Surveys & ing, 2016. Tutorials, 17(4):2157–2175, 2015. [228] J. Widmer, R. Denda, and M. Mauve. A survey on tcp-friendly [203] A. Singla, C.-Y. Hong, L. Popa, and P. B. Godfrey. Jellyfish: congestion control. IEEE network, 15(3):28–37, 2001. Networking data centers randomly. 9th USENIX Symposium on [229] N. Wolfe, M. Mubarak, N. Jain, J. Domke, A. Bhatele, C. D. Networked Systems Design and Implementation (NSDI), 2012. Carothers, and R. B. Ross. Preliminary Performance Analysis of Multi-rail Fat-tree Networks. In IEEE/ACM CCGrid, 2017. [204] T. Skeie, O. Lysne, J. Flich, P. Lopez,´ A. Robles, and J. Duato. [230] C. Xu, J. Zhao, and G.-M. Muntean. Congestion control design LASH-TOR: A Generic Transition-Oriented Routing Algorithm. for multipath transport protocols: A survey. IEEE communications In ICPADS, page 595. IEEE Computer Society, 2004. surveys & tutorials, 2016. [205] S. Sohn, B. L. Mark, and J. T. Brassil. Congestion-triggered [231] M. Xu, W. Tian, and R. Buyya. A survey on load balancing multipath routing based on shortest path information. In IEEE algorithms for virtual machines placement in cloud computing. ICCCN, 2006. Concurrency and Computation: Practice and Experience, 29(12):e4123, [206] E. Solomonik, M. Besta, F. Vella, and T. Hoefler. Scaling between- 2017. ness centrality using communication-efficient sparse matrix mul- [232] X. Xu et al. Unified source routing instructions using mpls label tiplication. In ACM/IEEE Supercomputing, page 47, 2017. stack. IETF MPLS Working Group draft, Internet Eng. Task Force, [207] P. Sreekumari and J.-i. Jung. Transport protocols for data center 2017. networks: a survey of issues, solutions and challenges. Photonic [233] Yahho Finance. Mellanox Delivers Record First Quarter Network Comm., 31(1), 2016. 2020 Financial Results. https://finance.yahoo.com/news/ [208] A. Sridharan et al. Achieving near-optimal traffic engineering mellanox-delivers-record-first-quarter-200500726.htm. solutions for current ospf/is-is networks. IEEE/ACM TON, [234] P. Yebenes´ et al. Improving non-minimal and adaptive routing 13(2):234–247, 2005. algorithms in slim fly networks. In IEEE HOTI, 2017. [209] B. Stephens, A. Cox, W. Felter, C. Dixon, and J. Carter. PAST: [235] D. Zats, T. Das, P. Mohan, D. Borthakur, and R. H. Katz. Detail: Scalable Ethernet for data centers. In ACM CoNEXT, 2012. reducing the flow completion time tail in datacenter networks. In [210] K. Subramanian. Multi-chassis link aggregation on network ACM SIGCOMM, pages 139–150, 2012. devices, June 24 2014. US Patent 8,761,005. [236] H. Zhang, J. Zhang, W. Bai, K. Chen, and M. Chowdhury. [211] M. Suchara, D. Xu, R. Doverspike, D. Johnson, and J. Rexford. Resilient datacenter load balancing in the wild. In Proceedings Network architecture for joint failure recovery and traffic engi- of the Conference of the ACM Special Interest Group on Data Commu- neering. In ACM SIGMETRICS, 2011. nication, pages 253–266, 2017. [212] J. W. Suurballe and R. E. Tarjan. A quick method for finding [237] J. Zhang, F. Ren, and C. Lin. Survey on transport control in data shortest pairs of disjoint paths. Networks, 14(2):325–336, 1984. center networks. IEEE Network, 27(4):22–26, 2013. [238] J. Zhang, K. Xi, L. Zhang, and H. J. Chao. Optimizing network [213] A. Thakur and M. S. Goraya. A taxonomic survey on load performance using weighted multipath routing. In IEEE ICCCN, balancing in cloud. Journal of Network and Computer Applications, 2012. 98:43–57, 2017. [239] J. Zhang, F. R. Yu, S. Wang, T. Huang, Z. Liu, and Y. Liu. Load bal- [214] J. Touch and R. Perlman. Transparent interconnection of lots of ancing in data center networks: A survey. IEEE Communications links (TRILL): Problem and applicability statement. Technical Surveys & Tutorials, 20(3):2324–2352, 2018. report, 2009. [240] J. Zhao, L. Wang, S. Li, X. Liu, Z. Yuan, and Z. Gao. A survey of [215] N. T. Truong, I. Fujiwara, M. Koibuchi, and K.-V. Nguyen. Dis- congestion control mechanisms in wireless sensor networks. In tributed shortcut networks: Low-latency low-degree non-random 2010 Sixth International Conference on Intelligent Information Hiding topologies targeting the diameter and cable length trade-off. IEEE and Multimedia Signal Processing, pages 719–722. IEEE, 2010. Transactions on Parallel and Distributed Systems, 28(4):989–1001, [241] J. Zhou, M. Tewari, M. Zhu, A. Kabbani, L. Poutievski, A. Singh, 2016. and A. Vahdat. Wcmp: Weighted cost multipathing for improved [216] T.-N. Truong, K.-V. Nguyen, I. Fujiwara, and M. Koibuchi. fairness in data centers. In ACm EuroSys, 2014. Layout-conscious expandable topology for low-degree intercon- [242] D. Zhuo, Q. Zhang, V. Liu, A. Krishnamurthy, and T. Anderson. nection networks. IEICE TRANSACTIONS on Information and Rack-level congestion control. In Proceedings of the 15th ACM Systems, 99(5):1275–1284, 2016. Workshop on Hot Topics in Networks, pages 148–154. ACM, 2016. [217] J. Tsai and T. Moors. A review of multipath routing protocols: [243] S. M. Zin, N. B. Anuar, M. L. M. Kiah, and I. Ahmedy. Survey of From wireless ad hoc to mesh networks. In ACoRN workshop on secure multipath routing protocols for wsns. Journal of Network wireless multihop networking, 2006. and Computer Applications, 55:123–153, 2015. [218] F. P. Tso, G. Hamilton, R. Weber, C. Perkins, and D. P. Pezaros. [244] A. Zinin and I. Cisco. Routing: Packet forwarding and intra- Longer is better: Exploiting path diversity in data center net- domain routing protocols, 2002. works. In IEEE 33rd International Conference on Distributed Com- puting Systems, ICDCS, pages 430–439, 2013. [219] A. Valadarsky, M. Dinitz, and M. Schapira. Xpander: Unveiling the secrets of high-performance datacenters. In ACM HotNets, 2015. [220] L. Valiant. A scheme for fast . SIAM journal on computing, 11(2):350–361, 1982. [221] B. Vamanan, J. Hasan, and T. Vijaykumar. Deadline-aware data- center TCP (D2TCP). ACM SIGCOMM Computer Communication Review, 42(4):115–126, 2012. [222] S. Van der Linden, G. Detal, and O. Bonaventure. Revisiting next- hop selection in multipath networks. In ACM SIGCOMM CCR, volume 41, 2011. [223] E. Vanini, R. Pan, M. Alizadeh, P. Taheri, and T. Edsall. Let it flow: Resilient asymmetric load balancing with flowlet switching. In NSDI, pages 407–420, 2017. [224] C. Villamizar. OSPF optimized multipath (OSPF-OMP). 1999. [225] A. Vishnu, A. Mamidala, S. Narravula, and D. Panda. Automatic Path Migration over InfiniBand: Early Experiences. In IEEE IPDPS, 2007.