ABSTRACT

Designing Hybrid Data Center Networks for High Performance and Fault Tolerance

by

Dingming Wu

Data center networks are critical infrastructure behind today’s cloud services. High-performance data center network architectures have received extensive attention in the past decade, with multiple interesting DCN architectures proposed over the years. In the early days, data center networks were built mostly on electrical packet switches, with Clos, or multi-rooted tree, structures. These networks fall short in 1) providing high bandwidth for emerging non-uniform traffic communications, e.g., multicast transmissions and skewed communicating server pairs, and 2) recovering from various network component failures, e.g., link failures and switch failures. More recently, optical technologies have gained high popularity due to its ultra-simple and flexible connectivity provisions, low power and low cost. However, the advantages of optical networking come at a price—orders of magnitude of lower switching speed than its electrical counterpart. This thesis explores the design space of hybrid electrical and optical network architectures for modern data centers. It tries to reach a delicate balance between performance, fault-tolerance, scalability and cost through coordinated use of both electrical and optical components in the network. We have developed several ap- proaches to achieving these goals from different angles. First, we used optical splitters as key building blocks to improve multicast transmission performance. We built an unconventional optical multicast architecture, called HyperOptics, that provides orders of magnitude of throughput improvement for multicast transmissions. Sec- ond, we developed a failure tolerant network, called ShareBackup, by embedding optical switches into the Clos networks. Sharebackup, for the first time, achieves network-wide full-capacity failure recovery in milliseconds. Third, we proposed to enable programmable at runtime by inserting optical switches at the network edge. Our system, called RDC, breaks the bandwidth boundaries between servers and dynamically optimizes its topology according to traffic patterns. Through these three works, we demonstrate the high potential of hybrid datacenter network architectures for high performance and fault-tolerance. Acknowledgments

I would first like to thank my advisor Prof. T. S. Eugene Ng, for his support, encouragement, and precious advice at both the academic and the personal levels. I am thankful that he pushed me and gave me directions when I did not know what to do. I am thankful that he supported and helped me to explore various external opportunities when I could not make further progress. I am thankful that he believed in me and encouraged me when I had doubts in myself. Without his mentorship, I cannot be here today. I also thank my thesis committee members: Ang Chen and Lin Zhong. They have given me so many valuable feedbacks and suggestions on my research, job search, as well as this thesis. A special thank is owed to Ang, collaborations with whom produced one of the major results in this thesis. I want to thank my internship mentors: Yingnong Dang and Guohui Wang. They taught me by example how to be productive and professional in the industry. They also gave me a lot of suggestions that helped me grow and be technically mature. More importantly, I thank them for providing me with memorable experiences in Seattle and Silicon Valley. I am grateful to have a lot of fantastic fellow students in the CS department. Especially, I thank my groupmates in the BOLD lab: Xiaoye Steven Sun, Yiting Xia, Simbarashe Dzinamarira, Xin Sunny Huang, Sushovan Das, Afsaneh Rahbar, and Weitao Wang. They helped me a lot in shaping the ideas behind this thesis as well as conducting the experimental studies. I also thank the fellow network-systems students for their productive discussions with me: Kuo-Feng Hsu, Jiarong Xing, Qiao Kang, and Xinyu Wu. v

Graduate studies are stressful; I thank the company of many of my friends at Rice. I treasure the friendship with Xiaoye Sun, Lechen Yu, Beidi Chen, Chen Luo, Terry Tang, and Simbarashe Dzinamarira. I am grateful that they were always there when I wanted to share my ups and downs. They have made these years an enjoyable journey for me. Finally, I would like to give my special thanks to my parents Ping Xiao and Tongxin Wu, who have always trusted me and supported me along the way. Contents

Abstract ii Acknowledgments iv List of Illustrations x List of Tables xiv

1 Introduction 1 1.1 Data Center Networks: State of the Art and Challenges ...... 3 1.2 ThesisContributions ...... 5 1.3 ThesisOrganization...... 7

2 Background and Related Work 8 2.1 OpticsinDataCenters...... 8 2.2 Hybrid Data Center Network Architectures ...... 10 2.3 Supporting Multicast Transmissions in Data Center Networks . . . . 11 2.4 Failure Recovery in Data Center Networks ...... 13

3 HyperOptics: Integrating An Optical Multicast Architec- ture in Data Centers 16 3.1 Introduction ...... 16 3.2 HyperOptics Architecture ...... 19 3.2.1 ToRConnectivityDesign...... 19 3.2.2 Routing and Relay Set Computation ...... 19 3.2.3 Analysis ...... 21 3.2.4 System Overview ...... 23 vii

3.2.5 MulticastScheduling ...... 24 3.3 Evaluation...... 25 3.3.1 Comparednetworks...... 25 3.3.2 Simulation setup ...... 26 3.3.3 EffectofSplitterFanoutUsedinOCS ...... 27 3.3.4 PerformanceComparison...... 28 3.4 Summary ...... 28

4 Sharebackup: Masking Data Center Network Failures from Application Performance 30 4.1 Introduction ...... 30 4.2 Related Work ...... 32 4.3 Network Architecture ...... 35 4.4 ControlPlane ...... 39 4.4.1 Fast Failure Detection and Recovery ...... 39 4.4.2 Distributed Network Controllers ...... 40 4.4.3 Offline Auto-Diagnosis ...... 42 4.4.4 Live Impersonation of Failed Switch ...... 44 4.5 Discussion ...... 45 4.5.1 ControlSystemFailures ...... 45 4.5.2 CostAnalysis ...... 46 4.5.3 BenefitstoNetworkManagement ...... 47 4.5.4 Alternatives in the Design Space ...... 48 4.6 ImplementationandEvaluation ...... 48 4.6.1 Testbed ...... 49 4.6.2 Simulation ...... 51 4.6.3 ExperimentalSetup...... 52 4.6.4 Transient State Analysis ...... 56 viii

4.6.5 ResponsivenessofControlPlane...... 57 4.6.6 BandwidthAdvantage ...... 59 4.6.7 Transmission Performance at Scale ...... 61 4.6.8 Benefits to Real Applications ...... 64 4.7 Summary ...... 66

5 RDC: Relieving Data Center Network Congestion with Topological Reconfigurability at the Edge 67 5.1 Introduction ...... 68 5.2 TheCaseforRacklessDataCenters...... 71 5.2.1 Observation #1: Pod locality ...... 72 5.2.2 Observation #2: Inter-pod imbalance ...... 72 5.2.3 Observation #3: Predictable patterns ...... 73 5.2.4 Understanding the power of racklessness ...... 74 5.3 TheRDCArchitecture ...... 76 5.3.1 Building block: Circuit switching ...... 76 5.3.2 Connectivitystructure ...... 78 5.3.3 Thepodcontroller ...... 79 5.3.4 Routing ...... 80 5.4 RDCControlAlgorithms...... 82 5.4.1 Traffic localization ...... 83 5.4.2 Uplink load-balancing ...... 84 5.4.3 Application-driven optimizations ...... 86 5.4.4 Advanced control algorithms ...... 87 5.5 Discussions ...... 87 5.5.1 Recovering from ToR failures ...... 87 5.5.2 Wiring and incremental deployment ...... 88 5.5.3 Costanalysis ...... 89 ix

5.5.4 Handling circuit switch failures ...... 91 5.5.5 Scaling ...... 91 5.5.6 Alternatives in the design space ...... 92 5.6 ImplementationandEvaluation ...... 93 5.6.1 Platforms ...... 93 5.6.2 TCPtransientstate ...... 94 5.6.3 Throughputbenchmark ...... 95 5.6.4 Controllooplatency ...... 96 5.6.5 Transmission performance at scale ...... 97 5.6.6 Real-world applications ...... 102 5.7 Related Work ...... 105 5.8 Summary ...... 106

6 Conclusions and Future Work 107

Bibliography 110 Illustrations

1.1 An example of three-layer Clos network using homogeneous commodity switches...... 2

2.1 An illustration of the optical splitter. (a) is a logical representation of an optical splitter with 1 input port and k output ports. (b) shows an real splitter device with k =8...... 9 2.2 An illustration of the optical circuit switch (OCS). (a) is a logical representation of an OCS with 8 input ports and 8 output ports, or an 8 8 OCS. (b) shows a real Calient 312 312 OCS...... 9 ⇥ ⇥

3.1 An example of HyperOptics connectivity with splitter fanout k =3.

The connectivity of t0, t3, t4 and t6 are shown in the figure. All other ToRs are interconnected to their neighbors in a similar way. The table on the bottom demonstrates the connectivity of all ToRs...... 20

k 1 3.2 Two broadcast trees originating from ti and ti+2 .Solidcirclesare relays. The union of the relays and the last neighbor of each relay, shown by squares, form a complete set of all ToRs. The two broadcasts have disjoint relay sets...... 22 3.3 An overview of the HyperOptics system...... 24 3.4 Comparison of the average total FCT of all three networks. The speedups of HyperOptics over the OCS network are labeled in the figure. 27 xi

3.5 Computation time of HyperOptics’ control plane under different numberofmulticastrequests...... 29

4.1 A k =6andn =1ShareBackupnetwork.(a),(b),(c)correspondto shaded areas in (d). Devices are labeled according to the notations in Table 4.1. Edge and aggregation switches are marked by their in-Pod indices; core switches and hosts are marked by their global indices. Switches in the same failure group are packed together, which share a backup switch in stripe on the side. Circuit switches are inserted into adjacent layers of switches/hosts. The connectivity in shade is the basic building block for shareable backup. The crossed switch and connections represent example switch and link failures. Switches involved in failures are each replaced by a backup switch with the new circuit switch configurations shown at the bottom, where connections regarding the original red round ports reconnect to the new black square ports...... 36 4.2 Communication protocol in the control system. (a): Failure detection and recovery. (b): Diagnosis of link failure...... 41 4.3 Circuit switch configurations for diagnosis of link failures shown by examples (b) and (c) in Figure 4.1. Circuit switches in a Pod are chained up using the side ports. Only “suspect switches” on both sides of the failed link and some related backup switches are shown.

Through configurations 1 , 2 ,and 3 ,the“suspectinterface”onboth “suspect switches” associated with the failure can connect to 3 different interfaces on one or multiple other switches...... 42 4.4 A testbed of k =4n = 1 ShareBackup with 2 Pods ...... 50 4.5 Trace of TCP sequence number during failure recovery...... 55 4.6 TCP congestion window size during failure recovery...... 56 xii

4.8 Minimum flow throughput under edge-aggregation link failures normalized against the no-failure case on the LP simulator with global optimal routing for all networks...... 59 4.7 iPerf throughput of 8 flows saturating all links on testbed...... 59 4.9 CDF of completion time slowdowns on packet simulator...... 61 4.10 Percentage of jobs missing deadlines on packet simulator...... 62 4.11 Performance of the Spark Word2Vec and Tez Sort applications with a single edge (edge-host) or network (core-aggregation and aggregation-edge) link failure on the testbed...... 63 4.12 CDF of query latency in the Spark TPC-H application with a single edge (edge-host) or network (core-aggregation and aggregation-edge) link failure on the testbed...... 65

5.1 Traffic patterns from the Facebook traces. (a) is the rack level traffic heatmap of a representative frontend pod. (b) shows the heatmap after regrouping servers in (a). (c) and (d) plot the sorted load of inter-pod traffic across racks in a representative database pod, before and after server regrouping, respectively. (d) shows the traffic stability over differenttimescales...... 71 5.2 Aggregated server throughput before and after server regrouping for (a) one-to-one, (b) one-to-many (many-to-one) and (c) many-to-many traffic patterns. Sources servers are colored in white and destination servers are colored in gray...... 74 5.3 RDC architecture and control plane overview. (a) is an example of the RDC network topology. Circuit switches are inserted at the edge between servers and ToR switches. Connectivities for aggregation switches (agg.) and core switches remain the same as in traditional Clos networks. (b) presents an overview of the control plane...... 78 xiii

5.4 An example of RDC’s 0/1 rule update on an OpenFlow-enabled ToR switch...... 82 5.5 Packaging design of an RDC pod...... 88 5.6 An RDC prototype with 4 racks and 16 servers...... 93 5.7 RDC reconfiguration improves throughput significantly (b) with negligibleTCPdisruption(a)...... 95 5.8 Performance comparison using Cache traffic. Network load is 50%. (a) CDF of flow completion times. (b) Distribution of path lengths. More results under different settings are in Appendix ??...... 99 5.9 Performance comparison using packet simulations under different traffic workloads and network loads...... 101 5.10 Application performance improvements of RDC compared with the 4:1 oversubscribed network (4:1-o.s.) and the non-blocking network (NBLK). (a) The HDFS read/write traffic pattern. (b) The HDFS transfer time. (c)-(d) Memcached query throughput and latency. (e) The DMM traffic pattern. (f) Average shift time and communication time...... 104 Tables

3.1 Average FCT of 256 random multicasts on the OCS network under ToR size 256. The OCS uses splitters with fanout varying from 2 to 8. 27

4.1 List of notations ...... 34 4.2 Road map of experiment setups ...... 49 4.3 Break-down of failure recovery and diagnosis delay (ms) ...... 57 4.4 Percentage of impacted flows/coflows in Figure 4.9 ...... 61

5.1 Cost estimates of network components and their quantities needed by a) an RDC pod with 4:1 oversubscription b) a 4:1 oversubscribed packet switching network (4:1-o.s.), c) a rack-based hybrid circuit/packet switching network with 4:1 oversubscribed circuit bandwidth (Hybrid-1) and d) with non-blocking circuit bandwidth (Hybrid-2), and finally e) a non-blocking packet switching network (NBLK). “p.c.” stands for private communication...... 90 5.2 Break-down of control loop latency (ms) for traffic localization (TL) and uplink load-balancing (ULB)...... 96 1

Chapter 1

Introduction

In recent years, data centers are becoming one of the most important infrastructures in our modern digital life. Popular Internet services and applications such as instant messengers, web search, social networks, e-commerce, video streaming, and Internet of Things are hosted on data centers. These services and applications generate a huge amount of data every day. For example, there are 350 million photos and videos uploaded, 5.7 billion comments and likes produced, and 4.75 billion pieces of content shared on Facebook each day [1]. Services at such a large scale require enormous computing and storage capability and become a natural fit for massive computing infrastructure. As a result, more and more companies and organizations are building hyper-scale data centers around the globe to meet the ever-increasing demands of their services and applications [2]. Aggregating computing and storage resources in a centralized warehouse-scale facil- ity, data centers allow operators to amortize the capital and operational expenditures while tremendously reducing the management and maintenance complexity. From the perspective of cloud applications, the eventual goal of designing a data center is to provide an always-on, unified platform with almost unlimited compute and storage capacity. Behind this “unified platform”, however, is tens or hundreds of thousands of servers interconnected by high-capacity networks. Modern cloud-scale applications hosted on such a platform tend to be divided into a large number of tasks distributed across many servers, each responsible for a relatively less amount of work. The result of this work division is an increasing demand for high-performance data 2

Core switches

Aggregation switches

ToR switches

Servers

Figure 1.1 : An example of three-layer Clos network using homogeneous commodity switches.

3/8/19 2 transfer among the servers. Nowadays, network traffic in data centers is growing at speed one order of magnitude faster than the global Internet traffic [3]. According to estimations, the amount of data transferred in 2016 has already reached be 6.8 ZB and will triple to reach 20.6 ZB by 2021 [3]. Essentially, data center networks have become a critical component for application performances and availability. On the one hand, emerging data-intensive applications such as Hadoop [4], Spark [5] and Tez [6] require ultra-high communication throughput between servers and have been continuously driving the design of networks with higher bandwidth. On the other hand, user-face services such as web search and online transactions must achieve low latency and high availability — typically aiming for at least 99.99% uptime (“four nines,” or about an hour of downtime per year) — requiring anetworkthatisresilientfromtrafficcongestionsandcanrecoverfromfailuresin minimal time. 3

1.1 Data Center Networks: State of the Art and Challenges

Considerable efforts have been made on designing DCN architectures to keep pace with the high traffic demand of services and applications. Over the years, networks constructed with multi-stage folded Clos topologies using commodity Ethernet switches have gained the highest popularity [7–10]. Typical Clos networks have two or three layers of switches that form a multi-rooted tree where Top-Of-Rack (ToR) switches are the leaves, the core switches are the roots, and an optional layer of aggregation switches are in between. Figure 1.1 shows one example of the three-layer Clos network. Real-world data center networks are usually over-subscripted for economic reasons. Depending on the number of uplinks for each rack, the over-subscription ratio can vary from as low as 5:1 to as high as 20:1 [8]. Early generations of the Clos networks are based on standard server racks. Each rack typically contains 20-40 servers, which are connected to a single ToR switch using 1 Gbps links. The ToR switches are then connected to the core switches via 2-4 10 Gbps links. To provide high aggregate bandwidth between racks, core switches require switching chips with high port-count and high backplane capacity. The primary challenge for these networks is that they do not scale—network capacity is upgraded by replacing the existing core switches with ones that have even higher port-count and backplane bandwidth, which incurs non-linear cost increases with network size. Rather than scale-up, recent evolutions of DCNs scale-out by employing a large amount of inexpensive and homogeneous switches in the network core. The aggregate network bandwidth can be upgraded by adding commodity switches horizontally. Fat- [7] is a typical example of these networks that use k port-count switches to support full bisection bandwidth for k3/4 servers. As is shown in Figure 1.1, in Clos networks, there could be multiple shortest paths between two servers under different ToRs. This, in turn, requires a multi-path routing technique, such as ECMP [11]. To take advantage of multiple paths, ECMP performs flow-based hashing based on the five tuples and uniformly randomly chooses a path 4 from a pool of multiple equal cost candidates. ECMP effectively improves traffic load balancing among network links and reduces network congestion. Clos networks with multipath routing have been successful at addressing the bandwidth bottleneck and scalability issues in data center networking for the past decade [10]. However, as traffic workloads continue to evolve, many problems of the existing Clos network remain:

Handling group communications. Clos networks are best for uniform server- • to-server communications. However, many new applications rely on more skewed communication patterns, such as multicast transmissions. Multicast transmis- sions are common in many data center applications such as model training in distributed machine learning [12], and data replication in Hadoop File Sys- tems [13]. However, Clos networks cannot efficiently handle these transmissions. Due to the high complexity of managing multicast groups on such networks, many applications handle multicast transmission by simply sending the same copy of data multiple times to different servers [4], causing huge network resource wastage and degraded application performance. Bandwidth disparity. A Clos network is typically over-subscribed at the • core—with typical over-subscription ratios somewhere between 5:1 to 20:1 [8]. This means that the bandwidths between servers can vary drastically, depending on how “close” they are on the physical topology. Servers communicating in the same rack enjoy full bisection bandwidth, but servers communicating across racks have much lower bandwidths. Some proposals try to provide full bisection bandwidth by adding more switches in the core [7, 9]. However, due to the non-uniform traffic pattern, this usually results in a network with low utilization and a significantly higher cost. Fault-tolerance. Despite the fact that individual switches and links failures • are relatively rare with 99.9% uptime [14], a large data center network may have thousands of switches and hundreds of thousands of links. The chance 5

that some switch or link has failed is therefore very high. The state-of-art failure recovery solutions in data center networks are all more-or-less based on rerouting. While rerouting maintains connectivity, traffic may experience higher bandwidth contention due to the network capacity loss, and eventually, application performance suffers. What is missing is a truly fault-tolerant network that preserves application performance under failures.

1.2 Thesis Contributions

In this thesis, we argue that many of the challenges faced by Clos networks can be addressed by augmenting the existing physical network architecture with low-cost optical switching devices. We present the design and implementation of multiple systems that address the three challenges mentioned in the previous section, i.e., supporting group communication, removing bandwidth disparity, and improving fault- tolerance. One common goal behind these systems is to solve each of the above problems while retaining as many of the existing benefits of current Clos networks as possible, and without significantly increasing the hardware cost. In particular, this thesis makes the following contributions.

First, we propose a network architecture, called HyperOptics, that augments the • existing Clos network with a separate ToR-to-ToR network dedicated for multi- cast transmissions. A key contribution of HyperOptics is its novel connectivity design for the ToRs that leverages the physical layer optical splitting technology to form a static switchless topology. The topology of HyperOptics resembles the logical connectivity graph in traditional distributed hash tables [15]. We show that by offloading multicast traffic from the Clos network to HyperOptics, we can achieve line-rate multicast transmissions. Compared to a design using optical circuit switches, HyperOptics achieves ultra-low latency transmissions due to its switchless topology. Albeit performance improvements, the overall cost of HyperOptics is still comparable with the network using optical circuit 6

switches. Our second contribution is a fault-tolerant network architecture, called Share- • backup, that preserves application performance under various network failures. Sharebackup uses a small number of backup switches shared network-wide for repairing failures on demand so that the network quickly recovers to its full capacity without applications noticing the failures. Under the hood, this is enabled by inserting circuit switches between adjacent layers of the Ethernet switches and connect both the regular switches and backup switches to the circuit switches. Sharebackup avoids complications and ineffectiveness of rerouting by fast fail-over to backup switches at the physical layer. We implement Share- Backup on a hardware testbed. Its failure recovery is as fast as the underlying circuit switching hardware, causing no disruption to routing, and it accelerates big data applications by multi-fold factors under failures. We also use large-scale simulations with real data center traffic and failure model. In all our experiments, the results for ShareBackup have little difference from the no-failure case. Our third contribution is the design and implementation of a “rackless” data • center network or RDC, that aims to remove the bandwidth disparity between intra-rack and inter-rack communications. RDC is a pod-centric DCN architec- ture that breaks the traditional rack boundaries in a pod and creates uniform high bandwidth for servers regardless of their topological locations. RDC creates the illusion that servers can move freely among ToR switches in response to traffic pattern changes. Rather than optimizing the workloads based on the topology, RDC optimizes the topology to suit the changing workloads. RDC is implemented by inserting circuit switches between the edge switches and the servers. It reconfigures the circuits on demand to form different connectivity patterns. We have performed extensive evaluations of RDC both in hardware testbeds and packet-level simulations. RDC can achieve an aggregate server throughput close to that of a non-blocking network and an average path length 7

35% shorter. On realistic applications such as HDFS, Memcached, and MPI workloads, RDC can improve job completion times by 1.1-2.7 . ⇥

1.3 Thesis Organization

The rest of this thesis proposal is organized as follows. Chapter 2 provides a brief introduction to the background of hybrid data center networks and the related works. Chapters 3, 4, and 5 present the three major contributions introduced above: HyperOptics, Sharebackup, and RDC. We conclude the thesis in chapter 6. 8

Chapter 2

Background and Related Work

In this chapter, we discuss the background of optical technologies used in data centers and related works.

2.1 Optics in Data Centers

Due to the large bandwidth capacity and low power consumption, a significant amount of research has been carried out in the past few years to develop optical interconnects and networking technologies for data centers [16–20]. In fact, it has been estimated that 70% of energy saving can be obtained if data center infrastructure moves toward afullyopticalnetwork[21].Opticalinterconnectschemesindatacentersmainly rely on a mixture of active and passive optical devices to provide switching, routing, and interconnection. Among the many optical devices that offer such functionalities, we focus on the use of two types of devices—optical splitters and optical circuit switches — that provides passive multicast transmission in line-rate and reconfigurable point-to-point connectivities at high switching speed, respectively. Optical splitters. As passive physical layer devices, optical splitters can duplicate the optical signal coming from its input port into multiple copies and transmit them out via the output ports. Under the hood, this is implemented by physical optical power splitting. As illustrated in Figure 2.1, a key parameter of an optical splitter is its fanout, denoted by k, which represents the maximum number of copies the input signal be duplicated. Due to power splitting, the power of the output optical signal will be a fraction of the input power. For example, there is a 3 dB power loss for k =2splitters. To ensure the output power is above the sensitivity of an 9

1 input port k output ports …

(a) (b)

Figure 2.1 : An illustration of the optical splitter. (a) is a logical representation of an optical splitter with 1 input port and k output ports. (b) shows an real splitter device with k =8.

N Input ports

N output ports (a) (b) Figure 2.2 : An illustration of the optical circuit switch (OCS). (a) is a logical representation of an OCS with 8 input ports and 8 output ports, or an 8 8 OCS. ⇥ (b) shows a real Calient 312 312 OCS. ⇥

optical receiver, a splitter with a higher fanout requires higher input power. Optical splitters are manufactured by the Planar Lightwave Circuit (PLC) technology and are commercially available up to fanout 32 [22]. Optical circuit switches Optical circuit switches are physical layer devices traditionally been used in wide-area backbone networks and SONET cross-connects to support telecommunications traffic loads. They provide bipartite connectivities between the input ports and the output ports. A key parameter of an OCS is 10 its port-count. Figure 2.2 shows the logical representation of a 8 8 OCS and ⇥ a commercially available Calient OCS with 312 input ports and 312 output ports. In an OCS, setting up a circuit (or switching) is typically accomplished through the use of Micro-Electromechanical Systems (MEMS) mirror arrays that physically direct the light beams to establish an optical data path between any input and any output port [23]. Once a circuit has been established, communication between the endpoints occurs at very high bandwidth and very low latency. Since all data paths are optical, no optical to electrical to optical (O-E-O) conversions are needed. This results in tremendous savings of optical transceivers at the endpoints and overall power consumption compared to their packet switching counterparts. There are two key parameters of an OCS: port-count and circuit setup delay. Due to its mechanical nature of light direction, the circuit setup delay varies from several milliseconds to tens of milliseconds [24,25], which is several orders of magnitude slower than an Ethernet packet switch. OCSes with a few hundreds of ports are commercially available [24] and ones with 1100 ports have already been fabricated [26].

2.2 Hybrid Data Center Network Architectures

Many hybrid electrical/optical [17,18,27,28] or hybrid electrical/wireless [29,30] DCN architectures are proposed during the past decade. These works all share a similar motivation and take a similar generic approach: creating on-demand connectivities between servers to mitigate the bandwidth bottleneck of DCNs under non-uniform traffic patterns. They only differ in the underlying physical technologies and the algorithms in setting up those on-demand links. For example, Helios [18] and c- through [17] propose to augment the existing electrical network with another separate network core using optical circuit switches. The electrical network is used for all-to-all low-volume communications, while the optical network is used for long-lived high- bandwidth communication. Similarly, the work by Zhou et al [29] and Flyways [28] meet burst bandwidth demands by creating 60 GHz links instead of optical links. One 11

key difference of our work on ShareBackup and RDC is that we embed the optical devices into the existing Clos topologies, instead of adding a separate network with its own architecture and traffic scheduling techniques. Our work promotes the integration of optical switching and packet switching in a more organic way and allows a coherent control system design for single network architecture. We show in this thesis that this integration can solve many of the issues of the Clos networks, such as bandwidth disparity and fault tolerance, while keeping their existing advantages.

2.3 Supporting Multicast Transmissions in Data Center Net- works

Multicast transmission has long been a performance bottleneck in data center networks. Traditional in-network multicast solution, IP multicast, is problematic in many senses. First, it does not scale with aggregate hardware resources in the number of supported multicast groups. Second, existing protocols for IP multicast routing and group management are not able to construct multicast trees in milliseconds as required by many data analytics applications. Third, IP multicast is lacking a universally acceptable congestion control mechanism that is friendly to unicast flows. Even though some recent efforts have been made to improve its scalability in the datacenter contexts [31–33], the other problems have caused most network operators to eschew its use. As a result, in today’s data center networks, the task of supporting multicast transmission is delegated to the applications. Applications themselves have to build and maintain an overlay network over the physical network. For instance, Spark supports multicast using a BitTorrent-style overlay among the recipient servers [5]. However, BitTorrent suffers from suboptimal multicast trees that introduce duplicate transmissions in the physical network. Due to the large volume of multicast traffic, repetitive data transmissions add a significant burden on the network. In fact, it has also been shown that when the overlay gets large, its throughput can collapse even 12 when overlay forwarding experiences infrequent short delays [34]. Hadoop Distributed File System (HDFS) improves the BitTorrent-style overlay by sending the data to multiple receivers in a pipeline fashion, in which each data block is transmitted from the source to the first receiver, and then from the first receiver to the second receiver, and so on. The reception of a data block is considered successful only when the last receiver has received the block. Although this pipelined data transfer relieves the link stress, it causes high latency for the transfer of each data block. More recently, researchers have recognized the advantages of leveraging optical splitters for data multicast and have proposed several optical multicast network architectures and control systems [35–38]. These systems are more efficient than the application overlays because optical data duplication is performed in the physical layer in a bandwidth transparent way — data can be transmitted to multiple receivers as fast as the sender’s transceiver speed without causing extra link stress. Also, a separate optical network avoids interference of multicast traffic on unicast traffic and greatly relieves the burden of the unicast network. However, an important drawback common to the above systems is that they require a high port-count optical circuit switch (OCS) acting as the switching substrate for optical splitters. Using an OCS for multicast transmission is problematic. First, today’s high port-count OCS can only switch at a speed of milliseconds to tens of milliseconds. Second, optical circuit switches are high-cost active devices. For example, our recent quotes from vendors show that an OCS port could be two orders of magnitude more expensive than an optical splitter port. Different from these systems, our solution in HyperOptics adopts a switchless design that directly interconnects the ToR switches to form a regular graph. Such a design has two major advantages compared to the OCS-based multicast systems. First, it can provide high bandwidth even at the packet granularity because the slow circuit switching delay is completely eliminated. Second, it scales well in the number of ToRs because the constraint of the OCS port-count no longer exists. 13

2.4 Failure Recovery in Data Center Networks

Clos topologies provide extensive path diversity and redundancy for end hosts. The connectivity for end-host pairs can still be established even under a considerable amount of device or link failures in the network core. To use this path-richness for failure recovery, previous studies have explored the tradeoffbetween the speed and quality of different rerouting approaches. Both fast local rerouting, e.g., F10 [39] and optimal global rerouting approaches, e.g., Portland [9] and Hedera [40] are proposed to handle network failures. In F10, the authors show that link failures can be recovered in 10 milliseconds by conducting fast local rerouting on a variation of the tradition FatTree topology. While this rerouting strategy can quickly find an alternative path under failures, the detour path can be significantly longer than the original one due to traffic bouncing between adjacent layers in the network. We refer to this problem as path dilation in this thesis. Path dilation not only enlarges latency and lowers throughput for the rerouted flows, but more importantly, it may cause some links being drastically congested and resulting in significant load-imbalance in the network. F10 also proposed a traffic push-back strategy and a centralized load-balancer to mitigate this problem. However, these two schemes have to operate on a much larger time scale due to their high overhead. Further, to achieve fast rerouting, the rerouting rules have to be pre-installed on each of the switches, which has already been shown as having scalability issues because the number of rules increases quadratically with the server size [41]. Compared to the local rerouting approaches, the design of centralized network monitoring, e.g., Portland [9] and Hedera [40] have the advantage of performing optimal rerouting and load balancing based on global network topology and traffic load information. However, this optimality comes at the cost of significantly reduced responsiveness to failures. After a failure is detected and heart-beated to the monitor, the monitor needs to produce a new set of routing rules based on the changed network topology and install these rules on (potentially) all switches at runtime. In addition to 14 the communication delay and computation delay of the centralized monitor, it takes even longer for the new rules to be installed and taking effect. Beyond the data center scenarios, packet rerouting strategies have been existing for a long time in the Internet or ISP networks. One type of works uses pre-computed backup routes when the primary routes fail. For example, Packet Re-cycling [42] associate each unidirectional link in the network with a unidirectional cycle that can be used to bypass it if it fails, through a process called cycle following. Packet Re-cycling can tolerate multiple failures in the network. If further failures are encountered in links along the backup path, the backup paths of these links are guaranteed to avoid previously encountered failures. R-BGP [43], pre-computes a few strategically chosen failover paths during BGP convergence and provides some provable guarantees such as loop-prevention as long as it will have a policy-compliant path to that destination. Other works using pre-computed backup routes include IP restoration [44], MPLS Fast-Reroute [45]. However, one major issue of using backup routes is that it does not scale in terms of the required forwarding table entries on network devices, which limits its resilience to only a few failures in the network [41,46]. FCP [46] is another type of rerouting paradigm aiming to eliminate the convergence process upon failures completely. All routers in the network have a consistent view of the potential set of links. FCP allows data packets to carry the information of failed links. It ensures that when a packet arrives at a router, that router knows about any relevant failures on the packet’s previous path, thereby eliminating the need for the routing protocol to propagate failure information to all routers immediately. However, the techniques in FCP require nontrivial computation resources on each router because the router must re-compute the shortest path available to a given destination for each failure combination it receives within a failure carrying packet. Besides, FCP also causes considerable overhead on each packet header. While all these rerouting schemes are resilient to failed connections, they do not really fix the failures and restore the network capacity. Some hardware failures 15

may have long downtimes, and consequently, network traffic will have to suffer from bandwidth loss in an extended time period. Our work in ShareBackup asks a very different question: can we recover the network capacity immediately upon failures without traffic disturbance? An affirmative answer would suggest that we can completely mask the failure impact from application performance. In chapter 4, we will detail the design and implementation of Sharabackup and demonstrate that it is an economical and effective way to mask failures from application performance. 16

Chapter 3

HyperOptics: Integrating An Optical Multicast Architecture in Data Centers

In this chapter we present HyperOptics, a static and switchless optical network designed for multicast transmissions in data centers. The key building block behind HyperOptics is a optical splitter that can passively duplicate the optical signals into multiple copies at line-rate. Building on top of optical splitters, HyperOptics directly interconnects ToR switches to form a novel connectivity structure of a k-regular graph, where k is the fanout of a splitter. We analytically show that this architecture is scalable and efficient for multicasts. Simulations show that running multicasts on HyperOptics can on average be multi-fold faster than the state-of-art solution.

3.1 Introduction

As datacenters scale up, online services and data intensive computation jobs running on them have an increasing need for fast data replication from one source machine to multiple destination machines, or the multicast service. Apart from traditional multicast applications such as simultaneous server OS installation and upgrade [47], data chunks replication in distributed file systems [4, 48, 49] and cache consistency check on a large number of nodes [50], recent distributed machine learning models also see a huge demand for multicast services. The explosion of data allows the learning of powerful and complex models with 109 to 1012 parameters [51,52], in which broadcasting the model parameters alone poses a challenge for the underlying network. Some learning algorithms require the processed intermediate data to be duplicated across different nodes. For example, the Latent Dirichlet Allocation algorithm for text 17

mining needs to multicast the word distribution data in every iteration [53]. A few thousand iterations of LDA with 1 GB of data for each iteration would easily cause over 1 TB of multicast data transfer in today’s datacenters. Reducing the multicast delay would significantly accelerate the machine learning jobs. However, multicast services are still not natively supported by current datacenters. The most established solution is IP multicast which is originally designed for the Internet. Even though some efforts have been made to improve its scalability in the datacenter context [31–33], the complex dynamic multicast tree building and maintenance, the potentially high packet loss rate and costly loss recovery, and the lacking of satisfactory congestion control have caused most network operators to eschew its use. On the other hand, as data size continues to grow, there is an increasing trend towards deploying a high bandwidth (40/100 Gbps) network core for datacenters [54]. However, high data rate transmissions are not feasible for even modest-length electrical links. For example, data transmissions on traditional twinax copper cable propagate at most 7 m at 40 Gbps due to power limitation [55]. Optical communication technologies are well suited to such high bandwidth networks. The advantages of optical devices and links, such as data rate transparency, lower power consumption, less heat dissipation, lower bit-error rate and lower cost have been noted or already exploited by the industry [56]. As datacenters gradually evolve from electrical to optical, we believe a system design that fully leverages the key physical features of optical technologies is necessary for future datacenters. In this chapter, we propose HyperOptics, a novel optical multicast architecture for datacenters. HyperOptics follows the recent efforts such as [18,20,36,37,57,58] that augment the traditional electrical network with a high speed optical network, but HyperOptics dedicates the optical network to multicast transmissions. The existing optical network proposals usually employ an Optical Circuit Switch (OCS) to provide configurable connectivity for ToRs. The switching speed in today’s large port-count 18

OCSes is, however, orders of magnitude slower (about tens of millisecond) than packet switches. In [20], the authors propose a specific implementation of OCS that is capable of switching in microseconds, but it is unscalable to support a large port-count due to the limited number of available optical wavelengths. Also, OCSes are high cost devices. According to our recent quote from a vendor, a 192-port OCS would cost 365 K USD. All these problems of OCSes motivate us to design an optical network that gets rid of its use and directly interconnect the racks by low cost optical splitters. The design of HyperOptics is inspired by Chord’s [15] way of organizing peer nodes in traditional overlay networks. Each ToR in HyperOptics can talk to multiple neighbor ToRs simultaneously via passive optical splitters, by which the ToRs form the connectivity of a regular graph. We identify two main advantages of HyperOptics over the OCS architecture. First, HyperOptics can provide high bandwidth even at the packet granularity because the slow circuit switching delay is completely eliminated. Second, unlike the existing OCS architecture, HyperOptics scales well in the number of ToRs because the constraint of the OCS port-count no longer exists in HyperOptics. We show that HyperOptics is well suited for high throughput and low latency multicast transmissions. Data from one ToR could be physically duplicated via an optical splitter to multiple ToRs at line speed. For multicasts with large group sizes, data is relayed by some intermediate ToRs. Due to the path flexibility of regular graphs, we show that the maximum path length for any multicast is bounded by log n, where n is the number of ToRs. Another distinguishing property of HyperOptics is that it can support 2 simultaneously active multicasts with maximal group size. To take full advantage of the underlying optical technologies, we propose a centralized control plane that manages the routing policy and multicast scheduling. Preliminary simulations show that HyperOptics can on average be 2.1 faster than the OCS architecture for multicast services. ⇥ 19

3.2 HyperOptics Architecture

We first introduce the connectivity structure of ToRs and then discuss the routing strategy under the given network architecture. Next, we analyze the multicast performance and the wiring complexity of HyperOptics. And finally, we present an overview of the system.

3.2.1 ToR Connectivity Design

We assume that all splitters have the same fanout k and the number of ToRs is n =2k. In our model, optical signals can only pass through the splitters in one direction, i.e., from the input port to the output ports. The ToRs are interconnected as a special k regular graph. The only difference of HyperOptics from the general k regular graph is that a node (ToR) can only send the same data to its k neighbors simultaneously. This limitation comes from the fact that the splitter just passively duplicates the input signal on its output ports.

Assume that the ToRs are denoted as t0, t1,..., tn 1.AllToRsarelogicallyorganized k on a circle modulo 2 . Each ToR ti is connected to the input port of splitter si. The

0 1 k 1 k output ports of si are connected to ti+2 , ti+2 ,..., ti+2 ,respectively.Notethat

the gap between ti’s two consecutive neighbors increases exponentially, which is very similar to Chord [15] in organizing peer nodes in overlay networks. Since all ToRs are on a logical circle, the operations above are all modulo 2k.Forexample,ifk =3and

n =8,thethirdneighboroft4 is t4+22 = t0. An example of HyperOptics with k =3is

given in Fig.5.3. We only show the connectivity of t0, t3, t4 and t6 in the figure. The

other ToRs are connected in a similar way, e.g., t1 is connected to t2, t3, t5. The full connectivity of the architecture is shown in the table on the bottom.

3.2.2 Routing and Relay Set Computation

Routing traffic to indirect destinations needs relays. For example, in Fig 5.3, a possible

path shown as dashed lines from t0 to t7 is t0 t4 t6 t7 where t4 and t6 are 20

splitter splitter splitter splitter … …

ToR 0 ToR 1 ToR 2 ToR 3 ToR 4 ToR 5 ToR 6 ToR 7 … … … … … … … …

Source ToR 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 0 Neighbors 2 3 4 5 6 7 0 1 4 5 6 7 0 1 2 3

Figure 3.1 : An example of HyperOptics connectivity with splitter fanout k = 3. The connectivity of t0, t3, t4 and t6 are shown in the figure. All other ToRs are interconnected to their neighbors in a similar way. The table on the bottom demonstrates the connectivity of all ToRs.

relays. There may exist multiple paths between each ToR pair. The relay set of a multicast is mainly determined by the routing strategy of HyperOptics. For simplicity, we propose a best-effort based routing strategy for HyperOptics. We note that our routing strategy might not be optimal and there is room for improvement. But it already provides satisfactory gains as we will show in Sec. 3.3. For a single source-destination pair, best-effort routing will always designate the neighbor that is nearest to the destination as the next relay. Also, we ensure that the index of the next relay is logically smaller than the destination. Mathematically,

t t t log(j i) given a destination j,arelayToR i will specify i+2b c as the next relay. The routing algorithm will recursively compute the remaining relays as if the next relay is

the source. For example, consider the traffic from t0 to t7 in Fig 5.3, the next relay for

t0 is t4 because t4 is one of t0’s neighbor that is nearest to t7.Andthenextrelayfor

t4 is t6.Hence,therelaysetforthepathfromt0 to t7 is t4,t6 .Notethatthenext { } relay of t4 is not t0 because t0 has logically passed the destination t7.Formulticasts, best-effort routing will compute the relay set for each individual destination and then 21

return the union of all relay sets.

3.2.3 Analysis

We now analyze the multicast performance under the design of HyperOptics and compare the cost of HyperOptics with the traditional OCS networks. Multicast hop-count: The hop-count of a multicast characterizes the minimum latency of a packet traversing from the source to the destination. The following theorem gives the worst case and average hop-count of a multicast in our architecture.

Lemma 1. The maximum hop-count of a multicast under best-effort routing is upper- log n bounded by log n and the average hop-count is 2 .

Proof. All ToRs in HyperOptics are logically equal. Without loss of generality, we 0 k 1 consider a multicast originating from t0. The k direct neighbors of t0 is ToR 2 ,...,2 , these IDs differ from 0 by only one bit. Similarly, the IDs of ToRs that are two hops

away from t0 differ by two bits. The farthest ToR differs by k bits. In best-effort routing, traversing a hop is equivalent to flipping the most significant bit of the source ToR’s ID that is different with the corresponding bit of the destination’s ID. Therefore,

the largest hop-count is k = log n. The number of ToRs that are j hops away from t0 k k j k 1 k j=1 (j) k2 k log n is j ,(1 j k). The average hop-count is k k = 2k = 2 = 2 .   P j=1 (j) P For one hop, the signal decoding and packet processing can be done in sub- nanosecond [59]. Therefore, for a datacenter with 1 K racks, the average latency for a multicast is less than 0.5 log 1000 1 ns 5.0 ns. In the following, we simply assume ⇤ ⇤ ⇡ that the multicast latency is negligible. Simultaneously active multicasts: Each ToR in HyperOptics has k direct neigh- bors. In an extreme case where all group members of a multicast are the source’s direct neighbors, HyperOptics could support n active multicasts simultaneously. In another extreme case where multicasts’ group sizes are maximal and need the most number of relays, the number of simultaneous active multicasts would be much smaller. 22

i i+2&-"

… … " i+1 i+2 i+2#$" i+ 2#$"+1 i+2#$"+2" i+2#$"+2#$"= i

… … % " #$" i+2 i+2 +2 i+2%+2&$" i+2 +2 i+2#$"+2" i+2#$"+2&$"+1 = i+1 … …

… … #$" #$" i+2 -1 i+2 i+2&$"+2#$"-2 = i-2 i-1 i i+2&$"-2

… … #$" i+2 i+2#$"+1 i+2&$"+2#$"-1= i-1 i i+1 i+2&$"-1

k 1 Figure 3.2 : Two broadcast trees originating from ti and ti+2 .Solidcirclesare relays. The union of the relays and the last neighbor of each relay, shown by squares, form a complete set of all ToRs. The two broadcasts have disjoint relay sets.

However, the following theorem shows that HyperOptics still has the capability of servicing multiple multicasts simultaneously in the worst case.

Lemma 2. HyperOptics can simultaneously service two one-to-all multicasts.

k 1 Proof. We consider two broadcast sources ToR i and ToR i +2 . Under best-effort routing, we draw the two broadcast trees in Fig 3.2 where solid circles are relays, and squares are the last neighbor of each relay. As can be seen, the relay set and the last neighbor of each relay form a complete set of all ToRs for each broadcast. k 1 k 1 ToR i’s relay set is i, i +1,i+2,...i+2 1 , while ToR i +2 ’s relay set is { } k 1 k 1 k 1 i +2 ,i+2 +1,i+2 +2,...,i 1 . The two relay sets are disjoint and { } therefore, both broadcasts can be active simultaneously.

ToR port-count: In HyperOptics, each ToR is connected to the input port of a 1 k ⇥ splitter. One splitter would take up k +1ports across the ToRs. The average number n (k+1) ⇤ of of occupied ports on each ToR would be n =1+logn. 23

Cost: Even though HyperOptics does not use the OCS, it occupies more ToR ports than the OCS network. The per-port OCS cost is 1.5K USD, derived from our recent vendor quote (365K USD for a 192-port switch) but factors in a 20% discount. The per port cost of ToR and transceiver are 100 USD and 200 USD respectively, from [?,60]. Splitters are very inexpensive at 5 USD per port [61]. For a medium size datacenter with 128 ToRs where each ToR is connected to other ToRs via 40 Gbps links, the total networking cost for HyperOptics is approximately 0.31 M USD. The total cost of the OCS network using a commercially available 192-port OCS is comparable at 0.33 M USD. For a datacenter with 256 racks, the total costs for HyperOptics and the OCS network using a 320-port OCS become 0.69 M and 0.56 M, respectively. HyperOptics is thus cost comparable with the OCS architecture under the current price of different network elements. Wiring complexity: While the total number of fibers needed to interconnect the ToRs is n log n. Many of them are short fibers that only go across a few racks. In k 0 1 k 1 a datacenter with 2 racks, the k fibers from each ToR will go across 2 , 2 ,..., 2 racks, respectively. For instance, in a datacenter with 256 racks, only 2 fibers will go across over 50 racks for each ToR. The total number of long fibers that go across over 50 racks is 2 256 = 512. For large datacenters with thousands of racks, we envisage ⇤ that the ToRs are packaged into Pods. Pods can be wired in the same way as if one Pod is a single virtual ToR. This hierarchical organization of ToRs would significantly reduce the number of global fibers. A systematic study of this hierarchical design is our future work.

3.2.4 System Overview

Fig. 3.3 shows an overview of the HyperOptics architecture. Our current design of HyperOptics assumes that the network core bandwidth can be fully utilized. This assumption holds when the link bandwidth between a server and its ToR is the same as the inter-rack bandwidth. Or when a server’s bandwidth is lower than the inter- 24

List of Multicast IDs require ToR i as a relay

ToR i Requests: (id, s, D, f), finish cmd HyperOptics … Manager start cmd Figure 3.3 : An overview of the HyperOptics system.

rack bandwidth, multiple sources within a rack have the same destination set. The work-flow of HyperOptics is as follows. The manager first receives multicast requests from source servers. A multicast request contains the request ID, the source server, the destination servers and the flow size. The manager then computes the relay set for each request and send to each ToR i alistofIDsofmulticaststhatrequireToRi as a relay. All multicast data packets contain a multicast ID in their headers. During the service period, when ToR i receives a packet, it will read the packet header and check whether it is a relay for the packet and relay the packet if it is. Note that this rule installation process is conducted only once before each scheduling cycle. Since relays are non-sharable resources for a multicast, multicasts that require common relays must be serviced sequentially. The HyperOptics manager will compute a schedule for all requests, which we will discuss in the next section. Every time a server finishes sending its multicast traffic, it will send a finish message to the manager, the manager then checks whether it is time to schedule the next batch of multicasts. If yes, then the manager will send a start message to the source servers of the next batch. Rules for the current scheduling cycle will be deleted on ToRs before the next cycle begins.

3.2.5 Multicast Scheduling

Given the input of n multicast requests, we now consider how to schedule these multicasts such that the overall delay is minimized. We formulate this problem as 25

a max vertex coloring problem [62] where a vertex corresponds to a multicast, the edges correspond to the conflict relations among multicasts, i.e., if two multicasts have common relays, there’s an edge between them. The weight of a vertex corresponds to the flow size of the multicast. Max vertex coloring has been shown to be strongly NP-hard [63]. We therefore focus on efficient heuristics. HyperOptics adopts a heuristic called Weight based First Fit (WFF) in which the vertices are first sorted in anon-increasingorderoftheirweights.WFFthenscanstheverticesandassigneach vertex a least-index color that is consistent with its already colored neighbors. The WFF heuristic is a specific version of the online coloring method for general graph coloring problems whose approximation ratio is analyzed in [64]. The time complexity of WFF is ⇥( V 2). | |

3.3 Evaluation

In HyperOptics, the inter-rack link bandwidth is 40 Gbps. We simulate the following two networks to compare with HyperOptics.

3.3.1 Compared networks

OCS network: Each ToR is connected to an OCS via a 40 Gbps link. The OCS has 320 ports, among which some are occupied by the ToRs, and the remaining ports are reserved for optical splitters. The number of splitters varies with the fanout of each splitter. The maximum group size achieved by cascading m 1 k splitters is ⇥ k +(m 1) (k 1). We assume the OCS reconfiguration delay is 25 ms according to ⇤ commercially available products [24]. As is discussed in Sec. 3.2.3, the total cost of this network is comparable to HyperOptics. Conceptual OCS network: We assume the Conceptual OCS has zero reconfigura- tion delay and sufficient port-count to support arbitrary multicast group size. The other configurations are the same as the OCS network. This network is not feasible in practice; it only serves as a comparison baseline to isolate the effect of different design 26

components of HyperOptics.

3.3.2 Simulation setup

The control plane delay consists of the scheduling algorithm computation time, the rule installation time and the control message transmission time between the manager and the servers. The computation time (measured at run-time) and the rule installation time (about 8.7 ms [37]) are one-time overheads for each scheduling cycle. The control messages between the manager and the servers can be implemented using any existing RPC solutions and its delay has been shown to be less than 2 ms [36,37]. We assume in a scheduling cycle every rack has exactly one server that generates a multicast request (id, i, D, f) with itself being the source, D being a random subset of servers in other racks as receivers (each rack has a 50% chance of having some receivers for each source), and f being a random flow size between 10 MB and 1 GB. The number of requests is equal to the number of ToRs. We repeat the experiment 500 times and report the average result. This traffic pattern helps us evaluate the network core capacity of the HyperOptics architecture. Note that the group size of a multicast is constrained in the OCS networks due to the limited number of ports available for splitters. For better evaluations, we make sure that all multicast group sizes are no larger than the largest group size that the OCS network can support. We apply the WFF scheduling algorithm on both HyperOptics and the OCS network and compare the total flow completion time (FCT) of multicasts. The conflict relations of multicasts are only slightly different in the OCS network than in HyperOptics. In the OCS network, multicasts conflict when they share some destinations or there are not enough splitter resources to service them simultaneously, or one multicast’s source is another multicast’s destination. 27

OCS Switching or Control Message Overhead OCS Service Delay 2.40× Conceptual OCS Service Delay 15 HyperOptics Service Delay 2.14×

10

2.10×

5

Average Total FCT (s) 2.01×

1.95×

0 16 32 64 128 256 Number of ToRs or number of multicast requests Figure 3.4 : Comparison of the average total FCT of all three networks. The speedups of HyperOptics over the OCS network are labeled in the figure.

Fanout 2 4 6 8 FCT(s) 17.5 17.8 17.7 17.8

Table 3.1 : Average FCT of 256 random multicasts on the OCS network under ToR size 256. The OCS uses splitters with fanout varying from 2 to 8.

3.3.3 Effect of Splitter Fanout Used in OCS

The overall multicast delay for the OCS network might vary as the splitter fanout changes. Table 3.1 shows the average FCT of 256 random multicasts on the OCS network with varying splitter fanout. It can be seen that the FCT remains quite constant. Intuitively, smaller/larger splitter fanout would yield better result when the multicast group size is small/large. In the following experiments, we always report the best result of the OCS network using various splitters. 28

3.3.4 Performance Comparison

Fig. 3.4 shows the average FCT for the three networks. We see that the speedup of HyperOptics is on average 2.13 over the OCS network. The speedup also increases ⇥ with an increasing number of ToRs. We identify two reasons for HyperOptics’ ad- vantages. First, HyperOptics does not use the OCS, the high reconfiguration delay (occurring every time the circuits need to change) is completely eliminated. As can be seen, the overhead of OCS, mainly the OCS reconfiguration delay, is on average 24 ⇥ larger than the overhead of HyperOptics, which contains only the 2 ms control message delay between the manager and the servers. Second, in the OCS network, a ToR can only receive traffic from one other ToR at a time. As a result, multicast requests that share some common destinations must be serviced sequentially. However, ToRs are interconnected in a log n regular graph in HyperOptics, each ToR can receive traffic from log n other ToRs simultaneously. We observe that the Conceptual OCS network is still 1.8 slower than HyperOptics. This fact shows that the unique connectivity ⇥ structure of HyperOptics alone can lead to a significant FCT improvement.

Computation Time of Control Algorithm

We run our C++ implementation of WFF scheduling on a 3.4 GHz, 4 GB RAM Linux machine. As is shown in Fig. 3.5, the time cost is less than 80 ms with 600 requests and less than 18 ms with 256 requests. In addition, this time cost is a one-time overhead for a scheduling cycle. The manager does not need to recompute the schedule in the service period. Results in Fig. 3.5 demonstrates that the HyperOptics manager is responsive in handling a large number of requests.

3.4 Summary

We have presented HyperOptics, a multicast architecture for datacenters. A key contribution of HyperOptics is its novel connectivity design for the ToRs that leverages 29

80

60

40

20

Computation time (ms) 0 100 200 300 400 500 600 Number of requests

Figure 3.5 : Computation time of HyperOptics’ control plane under different number of multicast requests.

the physical layer optical splitting technology. HyperOptics achieves high throughput and overcomes the high switching delay of the OCS. We show that the overall cost of HyperOptics is comparable with the OCS network, but it is on average 2.1 faster ⇥ than the OCS network for multicast services. Our current routing and scheduling techniques in HyperOptics are quite basic and have much room for improvements. Our next step is to explore alternate routing and scheduling techniques to fully exploit the HyperOptics architecture. 30

Chapter 4

Sharebackup: Masking Data Center Network Failures from Application Performance

In this chapter, we present a system called Sharebackup as an economical and effective way to mask failures from application performance. Sharebackup uses a small number of backup switches shared network-wide for repairing failures on demand so that the network quickly recovers to its full capacity without applications noticing the failures. This approach avoids complications and ineffectiveness of rerouting. We propose ShareBackup as a prototype architecture to realize this concept and present the detailed design. We implement ShareBackup on a hardware testbed. Its failure recovery takes merely 0.73ms, causing no disruption to routing; and it accelerates Spark and Tez jobs by up to 4.1 under failures. Large-scale simulations with real ⇥ data center traffic and failure model show that ShareBackup reduces the percentage of job flows prolonged by failures from 47.2% to as little as 0.78%. In all our experiments, the results for ShareBackup have little difference from the no-failure case.

4.1 Introduction

The ultimate goal of failure recovery in data center networks is to preserve application performance. In this chapter, we propose shareable backup as a ground-breaking solution towards that goal. Shareable backup allows the entire data center to share a pool of backup switches. If any switch in the network fails, a backup switch can be brought online to replace it. The failover should be fast enough to avoid disruption to applications. With the power of shareable backup, it is possible for the first time to repair failures instantly instead of making do with a crippled network. 31

Shareable backup is a natural quest due to ineptness of rerouting, the mainstream solution to fault tolerance in data center networks [7–9,39,65–69]. While rerouting maintains connectivity, bandwidth is nonetheless degraded under failures. The rerouted traffic may contend with other traffic originally on the path, thus enlarging the effect of failure to a wider range of the network. Routing convergence is known to be slow [70], and even path re-computation on a centralized management entity is expensive [9,68]. This latency is especially harmful to interactive applications with rigid deadlines. Rerouting also risks misconfigurations when updating routing tables, which may cause the network to dysfunction. Not to mention other overheads, e.g. slow failure propagation, longer alternative paths, excessive state exchange, etc. With all these factors, the application performance may be jeopardized drastically. According to a failure study of a path-rich production data center, 10% less traffic is delivered for the median case of the analyzed failures, and 40% less for the worst 20% of failures [71]. Injecting these failures into our simulation of a real data center setting (Section 4.6.7), 42% jobs get slowed down by at least 3 (Figure 4.9(b)), 51% ⇥ jobs miss deadlines (Figure 4.10(b)), and 21.3% flows not on the path of failure still get affected because of rerouting (Table 4.4). Shareable backup is desirable for its cost-effectiveness. The pool of backup switches needs not be large in practice, because failures in data centers are rare and transient. The above failure study shows most devices have over 99.99% availability; and failures usually last for only a few minutes [71]. With shareable backup, we for the first time achieve network-wide backup at low cost, which is impossible for the traditional 1:1 backup that requires a dedicated spare for each switch. Shareable backup is achievable by circuit switches, which have been used to facilitate physical-layer topology adaptation in many novel network architectures [16– 18, 20, 72–75]. Theoretically, if the pool of backup switches and all the switches in the network are connected to a circuit switch, any switch can then be replaced as we change the circuit switch connections. However, a circuit switch has limited port 32

count, and layering multiple circuit switches to scale up increases insertion loss. Rather than scaling up, recent proposals scale out low-cost modest-size circuit switches by distributed placement of them across the network [74, 75]. We adopt this approach to partition the network into smaller failure groups and realize shareable backup in each group. In this work, we design a prototype architecture, named ShareBackup,toexplore the feasibility of shareable backup on fat-tree [7], a typical network topology found in data centers [10,76]. We have implemented ShareBackup and its competing solutions on a hardware testbed, a Linear Programming simulator, and a packet-level simulator. We have conducted extensive evaluations including TCP convergence, control system latency, bandwidth capacity, transmission performance at scale with real traffic and failure model, and benefits to Spark and Tez jobs on the testbed. The key properties of ShareBackup are: (1) failure recovery only takes 0.73ms, latencies from hardware and control system combined; (2) it restores bandwidth to full capacity after failures, and routing is not disturbed; (3) for all our experiments, its performance difference with the no-failure case is negligible, proving its ability to mask failures from application performance; (4) under failures, it accelerates Spark and Tez jobs by upto 4.1 and reduces the percentage of job flows slowed down by ⇥ failures in the large-scale simulation from 47.2% to 0.78%.

4.2 Related Work

Data center network architectures rely on rich redundant paths for failure resilience [7, 8,65–68]. Among them, fat-tree is the most popular in practical use [7]. ShareBackup builds on top of fat-tree, so it is related to other proposals enhancing fault-tolerance of fat-tree networks. PortLand reroutes traffic to globally optimal paths based on a central view of the network at the fabric manager. F10 reduces delays from failure propagation and path re-computation by local rerouting at switches [39], at the cost of longer paths. It also adjusts wiring of fat-tree to form AB fat-tree, which provides 33

diverse paths for local rerouting. Aspen Tree adds different degrees of redundancy to fat-tree to tune the local rerouting path length [69]. It either partitions the network or adds extra switches to have more paths. If keeping the host count, it requires at least one more layer, or 40% more switches. ShareBackup takes a completely different approach. Instead of rerouting, it deploys backup switches in the physical layer. We compare with PortLand, F10, and Aspen Tree in the evaluations to explore interesting properties of ShareBackup. Besides architectural solutions, many works have been tackling failures in data centers from different angles. NetPilot and CorrOpt give operational guidance to manually mitigating the effect of failures [77,78]. ShareBackup instead automatically replaces failed switches to restore full capacity of the network. Its recovery speed is also significantly faster, e.g. sub-ms vs. tens of minutes. Subways and Hot Standby Router Protocol suggest multi-homing hosts to several switches to avoid single point of failure [79,80], which consumes more ports on the hosts and switches. ShareBackup provides more efficient redundancy at the network edge without multi-homing, and we invent a more light-weight VLAN-based solution to make backup switches hot standbys with no additional latency (Section 4.4.4). In the context of rerouting, there is a large body of work on local fast failover [46,81–87], some of which cause explosion of backup routes and Plinko introduces a forwarding table compression algorithm accordingly [88]. ShareBackup does not depend on rerouting for failure recovery, so it avoids these complications and the forwarding tables are intrinsically small. On the application level, Bodik et al. propose intelligent service placement for both fault tolerance and traffic locality [89], and computation frameworks such as Spark [5] and Tez [6] restart tasks elsewhere when workers are lost. ShareBackup provides a more reliable network, so service placement has less constraints. Our experiment in Section 4.6.8 shows application-level resilience is insufficient: the performance is degraded by multi-folds if hosts are disconnected. Thus, in-network failure recovery is extremely important. 34

Table 4.1 : List of notations

Notation Meaning k Fat-tree parameter: switch port count and # Pods [7] k n # backup switches shared by 2 switches per failure group Hj The jth host Ei,j The jth Edge switch in the ith Pod Ai,j The jth Aggregation switch in the ith Pod Cj The jth Core switch CSl,i,j The jth Circuit Switch in the ith Pod on the lth layer FGl,u The uth Failure Group on the lth layer BSl,u,v The vth Backup Switch in FGl,u UPp The pth UPward facing port of a circuit switch DOWNp The pth DOWNward facing port of a circuit switch 35

4.3 Network Architecture

Algorithm 1 ShareBackup wiring algorithm

// Edge layer k 1: for each CS1,i,j where 0 i

E0,0&&&&&&&E0,2&E0,0&&&&&&BS1,0,0&&E0,2&E0,0&&&&&&BS1,0,0&&E0,2& BS1,0,0& A1,0&&&&&&&A1,2&A1,0&&&&&&BS2,1,0&&A1,2&A1,0&&&&&&BS2,1,0&&A1,2& BS2,1,0&C0,&C3,&C6&C0,&CBS33,0,0&,&C6&C0,&CBS3,&C3,0,0&6&C1,&CBS4,&C3,0,0&7&C1,&CBS43,1,0&,&C7&C1,&CBS43,1,0&,&C7&C2,&CBS5,&C3,1,0&8&C2,&CBS53,2,0&,&C8&C2,&CBS53,2,0&,&C8& BS3,2,0& 0" 1" 2"0" 1" FG2"0"1,0&1" FG2" 1,0& FG1,0&0" 1" 2"0" 1" FG2"0"2,1&1" FG2" 2,1& FG0"2,1& 3" 6"0" 3" FG6"0"3,0&3" FG6"1"3,0&4" 7"FG1"3,0&4" FG7"1"3,1&4" FG7"2"3,1&5" FG8"2"3,1&5" FG8"2"3,2&5" FG8"3,2& FG3,2& Pod"0" Pod"0" Pod"0" (a)$ (a)$ (a)$ (b)$ (b)$ (b)$ (c)$ (c)$ (c)$

CS1,0,0& CS1,0,0& CS1,0,0& CS2,1,0& CS2,1,0& CS2,1,0& |" |" |" |" |" |" CS3,3,0& CS3,3,0& CS3,3,0& CS3,4,0& CS3,4,0& CS3,4,0& CS3,5,0& CS3,5,0& CS3,5,0& |" CS1,0,2& CS1,0,2& CS1,0,2& CS2,1,2& CS2,1,2& CS2,1,2& |" |" |" |" |" |" |" |" CS3,3,2& CS3,3,2& CS3,3,2& CS3,4,2& CS3,4,2& CS3,4,2& CS3,5,2& CS3,5,2& CS3,5,2&

Pod"1" Pod"1" Pod"1" Pod"3" Pod"3" Pod"3" Pod"4" Pod"4" Pod"4" Pod"5" Pod"5" Pod"5" 0" 3" 6" 0" 1"3" 4"6" 7"0" 3"1"2"6"4"5"7"8"1" 2"4" 5"7" 8" 2" 5" 8"0" 1" 2"0" 1" FG2"0"1,1&1" FG2" 1,1& FG0"1,1& 1" 2"0" 1" FG2"0"2,3&1" FG2"0"2,3&1" FG2"0"2,3&1" FG2"0"2,4&1" FG2"0"2,4&1" FG2"0"2,4&1" FG2"0"2,5&1" FG2"2,5& FG2,5& H0&&&&&&&H8& H0&&&&&&&H8&H0&&&&&&&H8& E1,0&&&&&&&E1,2&E1,0&&&&&&BS1,1,0&&E1,2&E1,0&&&&&&BS1,1,0&&E1,2& BS1,1,0&A3,0&&&&&&&A3,2&A3,0&&&&&&BS2,3,0&&A3,2&A3,0&&&&&&BS2,3,0&&AA3,2&4,0&&&&&&BS2,3,0&&A4,2&A4,0&&&&&&BS2,4,0&&A4,2&A4,0&&&&&&BS2,4,0&&A4,2&A5,0&&&&&&BS2,4,0&&A5,2&A5,0&&&&&&BS2,5,0&&A5,2&A5,0&&&&&&BS2,5,0&&A5,2& BS2,5,0&

(c) (c)$

(a)(a)$ (b)(b)$

(d) A k=6 fat-tree network

Figure 4.1 : A k =6andn =1ShareBackupnetwork.(a),(b),(c)correspondto shaded areas in (d). Devices are labeled according to the notations in Table 4.1. Edge and aggregation switches are marked by their in-Pod indices; core switches and hosts are marked by their global indices. Switches in the same failure group are packed together, which share a backup switch in stripe on the side. Circuit switches are inserted into adjacent layers of switches/hosts. The connectivity in shade is the basic building block for shareable backup. The crossed switch and connections represent example switch and link failures. Switches involved in failures are each replaced by a backup switch with the new circuit switch configurations shown at the bottom, where connections regarding the original red round ports reconnect to the new black square ports.

ShareBackup has stringent requirements on cost and failure recovery delay, which guide our choice of circuit switch technologies. No existing circuit switch has enough ports to connect to all switches in the data center plus the pool of backup switches. Cascading multiple circuit switches wastes many intermediate ports, increases insertion loss thus 37

requiring more powerful (and expensive) transceivers, and causes large end-to-end switching delay that slows down failure recovery. Instead, recent works promote partial configurability in small network regions using circuit switches with considerably low per-port cost and switching delay [74,75]. For instance, a commercial 160-port 10Gbps electrical crosspoint switch costs $3 per port and has 70ns switching delay [74]; 32-port 25Gbps optical 2D-MEMS has been developed, with 40µs switching delay at an estimated cost of $10 per port [90,91]. These technologies meet our demand. These targeted circuit switches have modest port count, so we divide the network into smaller failure groups and deploy them in each group. Measurement studies show that failures in data centers are rare, uncorrelated, and spatially dispersed [71, 78]. ShareBackup’s distributed design is a good match for these characteristics and can k provide good coverage. Fat-tree has 2 edge and aggregation switches per Pod. To k align with the architecture, we cluster 2 switches into a failure group and allow them to share n backup switches. All switches in a failure group, including the backup switches, must connect to the same circuit switch with the same wiring pattern. In this way, a backup switch can be brought online at run time to replace any failed switch or failed links associated with it within its failure group. This circuit switch should have at least k ( k +n) ports, which may exceed the port count of the targeted 2 ⇥ 2 k circuit switches for a large data center. We combine 2 individual circuit switches side by side and design a wiring pattern to achieve equivalent functionality. Figure 4.1 gives intuitions of the architecture design. Algorithm 1 shows the wiring plan, with notations listed in Table 4.1. Figure 4.1(a) illustrates the basic building block for ShareBackup. The edge k switches in the same Pod form a failure group (line 2 in Algorithm 1). We place 2 k k units of ( 2 +n+2) by ( 2 +n+2) circuit switches between the edge switches and the k hosts⇤. Every switch, regular and backup switch alike, connects to these 2 circuit k switches each with a link (line 5 and 8 ). As shown in Figure 4.3, these 2 switches are

⇤ Each circuit switch has 2 side ports. 38

chained together via 2 side ports, which is omitted in Figure 4.1 for simplicity. Hosts connect to the edge switches via straight-through connections on the intermediate circuit switches (line 4 and 6 ). The ports to backup switches are not connected internally. When a switch is down, the internal connections to it on all the circuit switches are reconfigured to connect to a backup switch, which replaces the failed switch completely. A switch whose links are down is replaced in the same manner so as to fix the link failures. In Figure 4.1(b), the aggregation switches in the same Pod form a failure group (line 10 ). Edge and aggregation switches in their failure groups repeat the building k block of connectivity in Figure 4.1(a) to another set of 2 circuit switches (line 12, 13, 16, and 17 ). In a fat-tree Pod, an edge/aggregation switch connects to each and every aggregation/edge switch, so we use a rotational wiring pattern in the circuit switches (line 14 ) to achieve this shuffle connectivity, i.e. the different internal connections on

CS2,1,0 to CS2,1,2. Similarly, aggregation switches in each failure group shown in Figure 4.1(c) are k connected upward to 2 circuit switches with the wiring pattern in the building block. As the fat-tree example shows, the connections from aggregation switches in each Pod iterate through all the core switches in consecutive order. Because the aggregation switches are already connected to the circuit switches, we wire up the core switches and the circuit switches to achieve the fat-tree connectivity. In the Figure 4.1(c)

example, core switches C0, C1, C2 connect to the first aggregation switch in each

Pod (A3,0, A4,0, A5,0) through different circuit switches in the Pod; C3, C4, C5 to the

second aggregation switch in each Pod (A3,1, A4,1, A5,1); and C6, C7, C8 to the third k (A3,2, A4,2, A5,2). As a summary of this pattern, the core switches connect to 2 circuit k switches with a stride of 2 ,andwesetupstraight-throughconnectionsinthecircuit switches. Similar to the building block for shareable backup in Figure 4.1(a), only switches connected to the same set of circuit switches can be put into a failure group. k As a result, core switches whose indices are in 2 intervals form a failure group. We 39

give each failure group n backup switches and connect them up in the same way as regular switches. In fat-tree, edge and aggregation switches are packaged into Pods for ease of deployment. In each ShareBackup Pod, there are n additional edge and aggregation k switches respectively as backup switches, and 3 sets of 2 circuit switches between adjacent layers of switches and hosts. It is straightforward to package the backup switches and the circuit switches into the original fat-tree Pods with simple changes of wiring as shown in Figure 4.1. Core switches and the backup core switches each connect to every Pod with one link. In practice, the core switches can be placed as in the original fat-tree, followed by the backup core switches. The reordering in Figure 4.1(c) is unnecessary. By streamlining the connectors from within each Pod, we can maintain the original Pod-host and Pod-core wiring patterns in fat-tree.

4.4 Control Plane

4.4.1 Fast Failure Detection and Recovery

Most previous fault-tolerant data center network architectures, such as PortLand [9] and F10 [39], mainly focus on link failures. According to a measurement study, however, switch failures account for 11.3% of the failure events in data centers and their impact is significantly more severe than link failures [71]. Therefore, ShareBackup aims to detect and recover both link and switch failures rapidly. Failure detection and recovery are handled by a management entity, e.g. one or more dedicated machines running specific processes. For switch failures, we require switches to send keep-alive messages continuously to the management entity. After missing keep-alive messages from a switch for a pre-defined time period, the management entity allocates an available backup switch to failover to and reconfigures the circuit switches associated with the failure group. As shown in Figure 4.1(a), in these circuit switches, original connections to the failed switch should reconnect to the backup switch. 40

We adopt the rapid failure detection mechanism in F10 [39] for link failures, where switches keep sending packets to each other (or to hosts) to test the interface, , and forwarding engine. When a link is down, it takes time to determine which end has lost connectivity. For the purpose of fast failure recovery, the switches on both sides of the failed link are replaced. The cause of failure is analyzed later by the procedure in Section 4.4.3. The management entity gets notifications of link failures from switches and hosts, and reconfigures the circuit switches in the same way as it addresses switch failures on both ends. Figure 4.1(b) and 4.1(c) show examples of this approach.

4.4.2 Distributed Network Controllers

ShareBackup requires the management entity to be implemented as distributed network controllers. A single controller is not capable of collecting frequent heartbeats from all the switches in the network. Like F10, the probing interval for failure detection can be as low as a few ms. For example, in a k = 48 fat-tree with 2880 switches, a heartbeat message every 5ms leads to 576k queries per second, which exceeds the capacity of a single controller. Distributed controllers also isolate the impact of controller failures to a small portion of the network. Controllers only store local state, so adding redundant controllers can further enhance resiliency with low state-exchange overhead. Finally, with distributed placement of network controllers, switches and circuit switches can be physically close to their controller, which effectively reduces the message latency of failure detection and recovery. 41

k/2$ respondent(switch( respondent(controller( circuit(switch( k/2$ replace(DOWN1,DOWN2)$ connect(Port11,Port,Port22))"" link$failure$ 23 k/2$ config" ping" ACK" (a)$ ID" (b)$ 56 32 suspect( 1 k/2$ k/2$ interface( ready"54 replace(UP1,UP2)$ connect(Port11,Port,Port22))"" success?"76 link$failure$ 43 switch( controller( circuit(switch( suspect(switch( ini#ator(controller( circuit(switch(

Figure 4.2 : Communication protocol in the control system. (a): Failure detection and recovery. (b): Diagnosis of link failure.

Figure 4.2(a) shows the failure detection and recovery process using distributed network controllers. Based on the layout of the ShareBackup architecture, an intuitive way of controller placement is to assign each failure group a dedicated controller. k Each controller receives heartbeat messages from 2 switches only. As shown in k Figure 4.1, a failure group in the core layer corresponds to 2 circuit switches beneath k it, while one in the edge and aggregation layers corresponds to two sets of 2 circuit switches above and beneath it, so each controller reconfigures k circuit switches at maximum. The load for each controller is thus very light even for a large network. The controllers do not share state, so the communication among them is also minimal. Due to the simple functionality, controllers can be realized by low-cost, bare-minimum computing hardware, such as the Arduino [92] and Raspberry Pi [93] platforms. Multiple controllers can also reside on the same machine to realize different degrees of distribution/centralization. Most circuit switches nowadays use the TL1 software interface to setup a con- nection, whose input and output ports should be specified explicitly, i.e. con- nect(input_port,output_port). The network controller needs to maintain the current connections of the circuit switches so as to switch to new connections. In Figure 4.1(b) and 4.1(c), a circuit switch connects to two switches from the failure groups above and below, so it can be reconfigured by both controllers. After one controller updates the circuit configuration, the other is ignorant of the change. The other controller may later use outdated port information and mess up the connections. To address 42

this problem, we change the interface function to replace(old_port,new_port) and free controllers from bookkeeping of the circuit switch configurations. After this change, network controllers reconfigure circuit switches by two parameters: the old port to the failed switch and the new port to the backup switch (the red round and black square ports in Figure 4.1), from which circuit switches resolve the new connections to change to. Requests from different controllers relate to ports on opposite sides of the circuit switch. In case of concurrent requests, the circuit switch reconfigures one side at a time, so the order of execution does not affect the end result.

4.4.3 Offline Auto-Diagnosis

0 A1,0' 0 A1,0' 0 A1,0' 0 0 0 C0' BS3,1,0' BS3,2,0' C0' BS3,1,0' BS3,2,0' C0' BS3,1,0' BS3,2,0' (b)$ (b)$ (b)$ (c)$ (c)$ (c)$

1 2 3 1 2 3

0 E1,0' 0 E1,0' 0 E1,0' 0 A3,0' 0 A3,0' 0 A3,0'

Figure 4.3 : Circuit switch configurations for diagnosis of link failures shown by examples (b) and (c) in Figure 4.1. Circuit switches in a Pod are chained up using the side ports. Only “suspect switches” on both sides of the failed link and some related backup switches are shown. Through configurations 1 , 2 ,and 3 ,the“suspect interface” on both “suspect switches” associated with the failure can connect to 3 different interfaces on one or multiple other switches.

Most link failures are due to malfunction of the network interface on one end. After both switches are replaced, we run failure diagnosis in the background to find which “suspect interface” (and the “suspect switch” it belongs to) has caused the problem. We chain up circuit switches in the same layer of a Pod as a ring through the side ports. Figure 4.3 shows the circuit switch configurations, through which the suspect interface on either end of the failed link can connect to 3 different interfaces, either on

the same switch (as A1,0, E1,0,andC0) or on different switches (as A3,0). 43

The network controllers for the involved switches coordinate to change the circuit switch configurations and enforce the switches to exchange testing messages. A suspect interface that has connectivity in at least one configuration is redressed as healthy, so is the corresponding suspect switch. Because failure diagnosis only involves suspect switches already taken offline and backup switches not in use, this process is completely independent of the functioning network. Figure 4.2(b) illustrates the controller coordination process. The two suspect interfaces are tested one by one. Their corresponding controllers elect one to be the initiator and the other one as passive respondent. The initiator cycles through the configurations to test the suspect interface on its side, after which the initiator and the respondent reverse roles to test the other suspect interface. As shown in Figure 4.2(a), in our distributed control system, each controller is responsible for reconfiguring a small subset of circuit switches and is only allowed to control the ports on its own side. So, both controllers need to participate in the circuit setup. The respondent controller learns the target connections from the initiator controller via the configuration ID. Here, we use the original TL1 interface, i.e. connect(input_port, output_port),to connect to the side ports. As an offline process, failure diagnosis can be preempted by failure recovery. It is paused if the involved backup switch, such as BS3,1,0 and

BS3,2,0 in Figure 4.3(c), needs to be used when another failure happens. The initiator controller thus proceeds only after receiving confirmation that the respondent side is not being preempted at the moment. It will continue with the next configuration if reaching the ACK timeout. In the end, the tested interface pings the other end and terminates the diagnosis process if it has connectivity. Failure diagnosis requires both sides have at least one healthy interface, so that both suspect interfaces can be tested. If this condition is not met, both suspect switches are considered faulty. Since all hosts are actively in use, the offline failure diagnosis is not supported between hosts and edge switches. We assume switches are at fault for link failures to hosts. If the problem is not fixed after replacing the switch, 44 we mark the switch as healthy and trouble-shoot the host. After a failed switch is repaired or a suspect switch is exonerated, it is unnecessary to switch back to the original connectivity. Backup switches and regular switches are equal in functionality, so we keep the backup switch online and turn the replaced switch into a backup switch for future use. This design saves the reconfiguration overhead and avoids disruptions in the network. The network controller keeps track of the current backup switches in their failure groups.

4.4.4 Live Impersonation of Failed Switch

Traffic is redirected to the backup switch in the physical layer after a failed switch is replaced. The backup switch needs to impersonate the failed switch by using the same routing table. Fat-tree uses Two-Level Routing, where each switch has a pre-defined routing table [7]. To avoid the additional delay of inserting forwarding rules into the backup switch, we aim to preload the routing table and make the backup switch a hot standby. Regular switches recovered from failures can work as backup switches, so every switch needs to store the routing tables of all the switches in the failure group. The challenge is to resolve the conflicts between different routing tables. In fat-tree, all the core switches and all the aggregation switches in the same Pod have the same routing table. Therefore, in the aggregation and core layers of our network, switches in a failure group only keep a common routing table. For in-bound k traffic, edge switches in a Pod, also a failure group, have the same set of 2 forwarding entries that match on the suffix of the end host addresses. For out-bound traffic, each k of these edge switches has 2 different entries. We use VLANs for differentiation. We first edit the original fat-tree routing tables by assigning every edge switch in the Pod a unique VLAN ID and adding it to the out-bound routing table entries. The edited routing tables from all the edge switches are then combined together and stored in every switch in the failure group. A host knows which edge switch it should connect to, so it tags out-going packets with the VLAN ID of the edge switch. No matter 45 what switches in the failure group are active, by matching the VLAN ID, packets can k always refer to the correct routing table. This combined routing table from 2 edge k k2 switches has 2 in-bound entries and 4 out-bound entries. This total number is within the TCAM capacity of commercial switches even for large-scale fat-tree networks. For instance, the table contains only 1056 entries for a k = 64 fat-tree with over 65k hosts.

4.5 Discussion

4.5.1 Control System Failures

ShareBackup uses a separate control network for failure recovery of the data plane. This raises the question as to how ShareBackup handles failures in the control network itself. For most control plane failures, ShareBackup can gracefully fall back to the original fat-tree network and leverage any existing rerouting protocols. In the following, we discuss different component failures of the control system. Circuit switch failures. Circuit switches are highly reliable passive physical-layer devices [94]. They have minimal control software receiving infrequent reconfiguration requests. When the control software fails, a circuit switch becomes unresponsive but keeps the existing circuit configuration. Therefore, the data plane is not impacted. Because ShareBackup can no longer replace failed switches connected to the circuit switch. The affected part of the network have to fall back to traditional rerouting when data plane failures occur. When a circuit switch has hardware failures such as ports-down, the connected switches will experience real link failures. ShareBackup will treat them as regular link failures and replace the affected switches. In the rare case that a circuit switch is completely down (e.g., power outage), the controller will receive a large number of link failure reports in a short period of time; It will stop failure recovery and request for human intervention. Since each switch is connected to k 2 2 circuit switches, each of them only loses k capacity during the downtime. We note that hardware/power failures are relatively rare in practice. Most of the failures are from the software layer such as switch software/firmware bugs or configuration errors, 46

etc. [77]. Circuit switches are generally free from these type of failures. Control channel failures. ShareBackup cannot distinguish circuit switch soft- ware failure with communication channel failure between the circuit switches and their controllers. Thus it treat these two failure cases equally. When a link failure happens, switches on both side of the link need to be replaced. This is handled by two separate controllers. If one side of the replacement is not successful, the other side will still have connection failure on the replaced switch. To handle this issue, a backup switch falls back to rerouting based solutions if the connection error still exists after replacement. Our offline auto-diagnosis requires the coordination of two controllers. The initiator controller cannot proceed without confirmation from its peer controller; it will stop the auto-diagnosis after a timeout and calls for human intervention. We note that offline auto-diagnosis is only performed on switches not in use and will not impact the data plane. Controller failures. Our distributed controllers are intrinsically robust. Each controller only keep a small amount of runtime state on current circuit configurations. It is straightforward to realize a fault-free control plane by state duplication.

4.5.2 Cost Analysis

Redundancy incurs extra cost. We make key design decisions in ShareBackup to reduce cost. Concurrent failures are rare in data centers [71], so the ideal case is to have a single backup switch shared by the entire network. However, as discussed at the beginning of Section 4.3, this requires cascaded circuit switches with high cost, insertion loss, and switching delay. As a compromise, we deploy low-cost circuit switches with short switching delay, e.g. electrical crosspoint switch or optical 2D- MEMS, in separate failure groups. Our targeted circuit switches have modest port count. As Figure 4.1 shows, we combine them to cover more switches and form larger failure groups. Our design achieves a reasonably low backup ratio at low circuit switch cost. The additional cabling cost is minimal in ShareBackup, because the circuit 47

switches, either electrical or optical, are passive and do not require active elements on their end, e.g. optical transceiver or amplifier for copper. The extra cost introduced by ShareBackup on a k =48fat-tree network with 27648 servers is estimated to be 6.7% [25]. And this cost does not increase linearly as the network scales up—the larger the failure groups, the smaller the backup ratio. To further reduce cost, we envisage that ShareBackup can be partially deployed at different network layers or pods. ShareBackup is especially powerful at the edge of the network. In today’s data centers, a host connects to one Top-of-Rack (ToR) switch only, and most fault-tolerant architectures fail to improve this condition [9, 39, 69]. If ToR or host link failures happen, hosts are disconnected and we have to rely on application frameworks to restart the task elsewhere. Our testbed experiment in Figure 4.11 shows Spark and Tez jobs get delayed by upto 4.1 in such case. In ⇥ k2 k afat-treenetwork,thereare 4 parallel paths in the core layer, but only 2 in the aggregation layer. ShareBackup is more helpful in the aggregation layer, as rerouting may cause greater congestion with fewer paths to balance the load. Partial deployment is straightforward in ShareBackup thanks to the separate failure groups. We give acompletesolutioninthischapter,butnetworkoperatorshavefreedomtodeploy backup switches in certain parts of the network according to application requirements and monetary budget.

4.5.3 Benefits to Network Management

When switches are routinely taken out for upgrade or maintenance, backup switches can neatly take their place to avoid downtime. Misconfigurations account for a large proportion of failures in data centers [71], and they are hard to reason and fix. ShareBackup can help mitigate the effect and diagnose the problem. The configurations of backup switches can be verified when they are idle. If a switch is misconfigured, it can failover to the backup switch whose configurations are guaranteed to be correct. Then complicated diagnosis can be executed offline. With judicious use of hardware, 48

our offline failure diagnosis in Section 4.4.3 helps identify which interface has caused a link failure. In today’s data centers, failure diagnosis and repair are mostly handled manually and take hours at least. Even pioneering work like NetPilot takes 20 minutes only to mitigate failures [77]. Our system implementation demonstrates later in Section 4.6.5 that ShareBackup automatically repairs failures in sub-ms and diagnoses failures in sub-second, which is a breakthrough for data center management.

4.5.4 Alternatives in the Design Space

An interesting question is whether PortLand and F10 will outperform ShareBackup if allowing the same deployment cost, e.g., employing the same number of ShareBackup’s backup switches as their production switches. However, tree networks are known to lack expandability. It is hard, if not impossible, to add only a small proportion of switches with the same port count to a fully populated tree network. Even if more switches could be added, they would lock up bandwidth to fixed locations. Failures at highly unpredictable locations might still cause bandwidth loss. In contrast, ShareBackup can move backup switches to wherever needed. Unstructured networks have been proposed for easier expansion [68,95,96], but the performance under failures is yet to be explored. Admittedly, these topologies have rich bandwidth and diverse paths, but the path length hugely varies, causing risk of path dilation. We can add switches to either provision bandwidth at the price of degraded performance under failures, or to provide guaranteed performance while keeping the backups idle most of the time. We choose the latter, and we believe shareable backup is an effective way to reduce the idle rate.

4.6 Implementation and Evaluation

ApriorstudyhasdemonstratedthecostadvantageofShareBackup:using1backup switch per failure group, ShareBackup only costs 6.7% more than fat-tree; and it is still more cost-effective than other redundancy-featured architectures with 4 backup 49

Table 4.2 : Road map of experiment setups

Experiment Section Workload Failure model Platform TCP disruption 4.6.4 Single flow Single Testbed Control plane latency 4.6.5 - Single Testbed Practical bandwidth 4.6.6 iPerf flows Rand layered Testbed Theoretical bandwidth 4.6.6 Synthetic traffic Rand layered LP simulator FCT/CCT slowdown 4.6.7 Coflow trace Real Packet simulator Job deadlines 4.6.7 Deadline trace Real Packet simulator Throughput-intensive app 4.6.8 Word2Vec, Sort Rand layered Testbed Latency-sensitive app 4.6.8 TPC-H Rand layered Testbed

switches per failure group [25]. However, this work lacks implementation of the ShareBackup system and performance evaluation against alternative solutions. In this chapter, we conduct comprehensive evaluations about the ShareBackup performance using both testbed implementation and large-scale simulations. Table 4.2 is a road map showing the setup of each experiment. First, we explore key properties of the ShareBackup system on the testbed, including failure recovery delay, TCP behavior during the transient state, and overhead of the control system. Since the major advantage of ShareBackup against other fault-tolerant network architectures is to restore bandwidth after failures, we next compare their bandwidth capacity with both Linear Programming simulations and testbed experiments. Next, we evaluate ShareBackup’s transmission performance using packet-level simulations with practical routing and transport protocols. To simulate real-world scenarios, we use traffic traces and a failure model from production data centers. Finally, we run Spark and Tez jobs on our testbed to measure the performance improvement to real data center applications.

4.6.1 Testbed

Our prototype network is a k =4n = 1 ShareBackup with 2 Pods, that is 2 Pods of a k = 4 fat-tree where each failure group’s 2 switches share 1 backup switch. Specifically, 50

(a) Hosts (b) Openflow packet switches (c) OCS

Figure 4.4 : A testbed of k =4n = 1 ShareBackup with 2 Pods

it is a non-blocking network with 12 active switches, 6 backup switches, and 8 hosts. Figure 5.6 shows the physical deployment of this logical network. All links are 10Gbps. The switches locate on 5 48-port OpenFlow packet switches: one partitioned into core switches and their backups, and the others each into the active and backup switches in the same layer of a Pod. The circuit switches are logical partitions of a 192-port 3D-MEMS optical circuit switch (OCS). The hosts are individual machines each with 6 3.5GHz dual-hyperthreaded CPU cores and 128GB RAM. They run Linux 3.16.5 with TCP Cubic. To make the testbed more manageable, we connect hosts to the OCS via an extra hop on packet switches. We deploy distributed network controllers as described in Section 4.4.2. Our OCS uses the standard TL1 interface. To support the proposed interface function, i.e. replace( ) in Figure 4.2, we let controllers talk to each logical circuit switch (an OCS partition) through an agent. The agent stores the connections of its own circuit switch, through which it translates the controller queries via our new interface into the TL1 command to control the corresponding ports on the OCS. The source code of our switch image is unaccessible, so we are unable to realize the failure detection mechanism in Section 4.4.1. Since failure detection is not our main contribution, we bypass it by creating failures at will. We disable forwarding rules to introduce 51

failures and make the controllers react after a dummy detection latency of 10ms. This artifact is easily solvable, since the BFD protocol for fast failure detection is readily available for many commercial switches. We set VLANs at end hosts to enable live impersonation of failed switch according to Section 4.4.4. We focus on the online failure recovery in the testbed implementation. The offline diagnosis of link failures is evaluated separately in Section 4.6.5. The switching delay of our OCS is several milliseconds, orders of magnitude higher than that of the targeted circuit switches, e.g. 70ns for electrical crosspoint switch [74] and 40µs for 2D-MEMS [91]. To evaluate the performance accurately, we also emulate the ideal circuit switch using electrical packet switch (EPS), which observes similar switching delay. The bipartite connections on circuit switches are realized as straight- through forwarding rules between input and output ports on the EPS. In case of failures, controllers change the forwarding rules to redirect traffic to the backup switch. Although rule insertion/deletion introduces extra latency, this is at the best of our effort given the limited hardware.

4.6.2 Simulation

For both simulations below, the simulated network is a k = 16 fat-tree, which consists of 320 switches and 1024 hosts. We assign each failure group 1 backup switch, that is 40 additional switches to the network. Linear Programming Simulation:Weabstractthenetworkasagraphand cripple a varying number of links and switches. We solve the maximum concurrent multi-commodity flow problem [97] given different traffic patterns using a Linear Programming (LP) solver, which is a well-adopted approach in topology analysis [27, 68,98]. This formulation is to maximize the minimum throughput among all the flows, showing the worst case under the effect of failures. The result assumes optimal routing under perfect load balancing. Packet-Level Simulation: We developed a simulator that supports TCP, fat- 52

tree’s Two-Level Routing, and dynamic failure events. The failure recovery delay is based on the measurement result on our testbed and reported numbers for the compared architectures [9,39,69]. Our simulation is a big improvement to a similar study previously [25]. First, their work is limited to converged steady state after failures, while we consider the failure recovery process. Second, their results are biased against rerouting solutions. It uses ECMP routing, which may create more hot spots after rerouting due to hash collisions, and thus exaggerate the effect of failures. It also randomly drops packets from the buffer when congestion happens, so packet retransmission will hugely increase the flow completion time. In contrast, Two-Level Routing eliminates randomness by assigning each flow a deterministic path, and load balancing further mitigates hot spots. The random drop behavior is disabled. Our simulation enables realistic and fair comparisons against rerouting solutions.

4.6.3 Experimental Setup

We compare ShareBackup with the following networks:

PortLand [9]: We abstract PortLand as a fat-tree [7] network in the LP formu- lation. In the packet-level simulator and the testbed, we use Two-Level Routing as the default routing method without failures and improve PortLand’s global rerouting with near-optimal load balancing under failures. In the simulator, we reroute traffic heuristically according to the bandwidth utilization of alternative paths. In the testbed, for every possible failure, we hard-code the rerouted paths for affected flows such that each link carries roughly the same number of flows. Our optimized version of PortLand gives a throughput upper bound for rerouting. F10 [39]: We build F10’s AB fat-tree. In the packet-level simulator and the testbed, we use Two-Level Routing as the default failure-free routing and perform 3-hop local rerouting under failures. In the simulator, we randomly reroute impacted flows to available local paths. In the testbed, limited by the network scale, there is 53

only one alternative path for each flow. The LP solver enforces global optimal routing, so it evaluates the capacity of the network topology alone, which serves as a very loose upper bound for actual F10. Aspen Tree [69]: We maintain the host count as in the above networks and pick the Aspen Tree configuration that minimizes extra cost and failure convergence time, that is adding an extra layer of switches below the core switches. Two-level routing does not apply to Aspen Tree because of the redundant layer. Instead, we use ECMP to distribute flows evenly in each layer. Under failures, we reroute traffic locally if alternative paths exist or push back to upstream switches otherwise. Like in F10, the LP simulations perform global optimal routing on the topology to show the capacity upper bound. Aspen Tree is not supported in our testbed due to the extra hardware required. The failure models considered in our experiments are listed below.

Random Layered: The LP analysis requires an easy-to-reason failure model that helps understand the effect of failure locations, so we generate random switch and link failures separately in different layers of the network. We simplify this model in the testbed experiments: we create one link failure at a time, since switch failures and concurrent link failures are fatal for our small-scale testbed. Real:Wereproducereal-worldfailuresinourpacket-levelsimulatoraccordingto a failure study in production data centers [71]. We create dynamic switch and link failures in the network. Failure locations are derived from the probability of failures per switch/link type (Figure 6 and 7 in [71]), and the arrival time and duration of failures are based on the corresponding distributions (Figure 8 and 9 in [71]). For all our comparisons, we use the following traffic patterns.

The LP solver runs on steady traffic, so we drive the computation with the following widely-used synthetic traffic patterns. 54

Permutation: Every host sends a single flow to a unique server other than itself at random. This pattern creates uniform traffic across the network. Stride: Every host sends a single flow to its counterpart in the next Pod. This traffic pattern creates heavy contention in the network core. Hot Spot: Every 100 hosts form a cluster, in which one host broadcasts to all the others. It simulates the multicast phase in many machine learning applications. Many-to-Many: Every 20 hosts form a cluster with all-to-all traffic. This traffic pattern simulates the shuffle phase in MapReduce jobs. There are two pervasive types of applications in data centers: throughput-intensive and latency-sensitive [99,100]. We feed the packet-level simulator with data center traffic traces from these applications to create realistic settings. Coflow:WeobtainthetraceinaFacebookdatacenterfromthecoflowbench- mark [101]. It contains correlated flows known as coflows that reflect communications in MapReduce jobs. We observe highly skewed multicast, shuffle, and incast traffic in the trace: some coflows involve a large number of machines and have high traffic volume. For each rack-to-rack flow in the trace, we create flows between hosts under the source and destination edge switches to saturate switch uplinks. Deadline:WefollowthemethodinD3 to generate partition-aggregate traffic in interactive web applications [102]. For each query, we randomly choose 1 host as the aggregator and 40 hosts as workers. Workers respond after a random jitter between (0, 10ms] to simulate the local computation. The network utilization varies between 10% and 30%. The deadline is set to be 2 the response time in the failure-free network. ⇥ Finally, we run the following real applications for performance evaluation. We run Spark and Tez on our testbed as representative applications in data centers. Among the 8 hosts, the first works as the master node and all the others as slave nodes. We create the following throughput-intensive and latency-sensitive jobs. Spark Word2Vec: This iterative machine learning job uses high dimensional vectors to represent words in documents. In each iteration, the master node broadcasts 55

Disruption time = 8.5 ms

Retransmission

Disruption time = 0.5 ms

Retransmission

Figure 4.5 : Trace of TCP sequence number during failure recovery.

the updated model to all workers. We configure Spark to broadcast 500MB data in ⇠ each iteration in BitTorrent fashion. This phase thus observes heavy all-to-all traffic. The data to be transmitted is readily available in memory. Tez Sort: This job is a distributed sorting algorithm based on the MapReduce programming model. The aggregate input data size is 100GB. This job has a heavy shuffle phase, where all the nodes as mappers send data to a subset of nodes as reducers. We store the data on a RAM disk to prevent the hard drive being the bottleneck of data read/write. Spark TPC-H: This decision support benchmark consists of a suite of database queries that help answer important business questions. The queries are running against a 160GB database on each worker. The processing power of the decision making system is reflected by the number of queries per hour, so the query latency is critical to performance. 56

Figure 4.6 : TCP congestion window size during failure recovery.

4.6.4 Transient State Analysis

We examine the TCP behavior during ShareBackup’s failure recovery. A sender host transmits TCP packets at line rate to a receiver host. At the receiver, we capture packets with Wireshark to get the sequence number and record TCP congestion window size with the tcp_probe kernel module while injecting a link failure along the path. We get similar results with the variance of sender and receiver locations. Figure 4.5 and Figure 4.6 show one instance of the results. In Figure 4.5, the OCS implementation and the EPS emulation experiences 8.5ms and 0.5ms disruption time respectively. This delay is contributed by OCS/EPS recon- figuration, i.e. resetting OCS circuits or changing EPS forwarding rules. Interestingly, we observe less packet loss on the OCS testbed even though it has relatively longer disruption time. Further investigations reveal that our packet switch by default stops forwarding packets when their destination port is detected as down. Those packets are buffered in switch memory and sent out after the port comes up. The EPS emulation does not have such port-down period and packets are then continuously sent out and dropped in the transient state. As described at the beginning of Section 4.3, 57

Table 4.3 : Break-down of failure recovery and diagnosis delay (ms)

FailureRecovery FailureDiagnosis total communication computation reconfig preemption no preemption OCS8.73 0.22 0.01 8.5 502.1 487.3 EPS 0.73 0.22 0.01 0.5 359.2 352.6

our targeted circuit switch technologies have much lower switching delay than the EPS. They function like the OCS and will cause the port-down event. In practical implementation, we expect shorter disruption time than the EPS and similar or less packet loss than the OCS. In Figure 4.6, neither the OCS implementation nor the EPS emulation hit the retransmission timeout. For both of them, TCP can proceed smoothly and recover lost packets rapidly. This result validates our design of fast in-network failure repair that is transparent to applications. Our testbed experiments in the rest of the chapter are based on the OCS implementation. As we will show later, regardless of the relatively long disruption time, the performance of our ShareBackup implementation is still similar to a failure-free network.

4.6.5 Responsiveness of Control Plane

Our testbed is limited in scale, and the above implementation ignores failure diagnosis. To understand the efficiency of the control plane in a large data center, we abstract the involved entities as individual processes and realize the communication protocol in Figure 4.2 on a k = 48 fat-tree network. There are 2880 switch processes, 120 controller processes, and 3456 agent processes for circuit switches. The communications are implemented using the server-client model with TCP sockets. Link failures are more complicated than switch failures for the control system and are followed by the offline failure diagnosis, so we show the failure recovery and diagnosis delay for link failures in Table 4.3 as the worst-case performance of 58

the control system. The failure recovery delay is broken down into the time for communications in Figure 4.2(a), computation at the distributed controllers, and OCS/EPS reconfiguration discussed in the above subsection. Circuit switch agents are not necessary if circuit switches support the proposed reconfiguration function, i.e. replace( ) in Figure 4.2. Thus, the communication delay can be further reduced in real implementation. The computation delay is minimal, as controllers only map the failed switch and the backup switch to their circuit switch ports so as to reset circuits. In the EPS emulation, the end-to-end delay of failure recovery is only 0.73ms, which will be even lower if using the targeted circuit switches with shorter switching latency and the modified reconfiguration function. F10 and PortLand reported 1ms and 65ms convergence delay [9, 39]. ShareBackup is more efficient than these state-of-the-art solutions because it does not involve change of forwarding rules and computation for rerouting. Our implementation of failure diagnosis cycles through all the configurations in Figure 4.3, although the process may terminate earlier in reality. Even with our very simple implementation, failure diagnosis can finish in hundreds of milliseconds. If the diagnosis process is preempted by failure recovery in one of the tested configurations, the duration only increases slightly. As discussed in Section 5.5 (3), this sub-ms failure recovery and sub-second failure diagnosis are breakthroughs to data center management, compared to the common practice based on manual efforts nowadays. 59

Figure 4.8 : Minimum flow throughput under edge-aggregation link failures normalized against the no-failure case on the LP simulator with global optimal routing for all networks.

4.6.6 Bandwidth Advantage

Figure 4.7 : iPerf throughput of 8 flows saturating all links on testbed.

On the testbed, we show the bandwidth difference among the various network archi- tectures with an iPerf throughput experiment. We create an instance of permutation traffic, where Two-Level Routing places the 8 flows onto different paths without con- tention. This traffic pattern saturates all links. In Figure 4.7, Sharebackup achieves 60 the same performance as the no-failure case, showing ShareBackup’s ability to restore bandwidth quickly after failures. In contrast, for PortLand and F10, the worst-case flow throughput decreases dramatically as the failure approaches edge links. In a k2 k fat-tree (or AB fat-tree) network, there are 4 parallel paths in the core layer, 2 in the aggregation layer, yet only 1 in the edge layer right above hosts. As a result, rerouting causes greater congestion with fewer paths to balance the load at the edge. F10 is less tolerant to failures than PortLand, because its local rerouting uses longer paths and makes congestion even worse. This trend holds for our LP simulations as well. Due to space limitation, we only show the results for link failures in the aggregation layer in Figure 4.8. ShareBackup outperforms the other architectures by 13% to 25%. It achieves similar throughput as the case without failures given 2 backup switches per failure group; while 1 backup switch falls short sometimes when concurrent failures happen in the same failure group. Note that this is a stress test. In data centers, most devices have over 99.99% availability, and concurrent failures are very rare [71]. So, 1 backup switch per failure group is sufficient in most cases. Portland and F10 have the same numbers, as the LP solver performs optimal routing on them both. Although Aspen Tree uses more switches for redundancy, it fails to add more bandwidth to the network, so its performance is no better than PortLand and F10. Under edge link failures, the impacted flows have zero throughput in all architec- tures, whereas ShareBackup still gives full capacity. If link failures happen in the core layer, these architectures have very similar throughput, since there are abundant alternative paths for rerouting. The results for switch failures are the same as link failures. Our formulation calculates the minimum flow throughput as the worst-case analysis. Switch failures affect more flows but do not change the minimum value. From these observations, we conclude that ShareBackup is most powerful in the edge layer, then the aggregation layer. As discussed in Section 5.5 (1), ShareBackup supports layer-wise partial deployment to save cost. 61

Table 4.4 : Percentage of impacted flows/coflows in Figure 4.9

ShareBackup PortLand F10 Aspen Tree Directly impacted flows 0.63% 3.92% 3.91% 3.89% Indirectly impacted flows 0 16.01% 21.29% 16.23% Directly impacted coflows 0.78% 17.32% 18.31% 18.48% Indirectly impacted coflows 0 18.95% 28.89% 19.22%

4.6.7 Transmission Performance at Scale

Figure 4.9 : CDF of completion time slowdowns on packet simulator.

For throughput-intensive jobs in the Coflow trace, the transmission performance is largely determined by the network’s bandwidth capacity shown in the above section. Figure 4.9 plots the distributions of flow completion time (FCT) and coflow completion time (CCT) normalized against the no-failure case, i.e. slowdowns. Overall, Share- backup has negligible performance degradation, whereas PortLand, F10 and Aspen Tree experience multi-fold slowdowns for >20% flows and >30% coflows. The impact of failure is magnified at the coflow level, since a small number of straggler flows is all it takes to negatively impact the CCT. F10 performs notably worse than PortLand 62

and Aspen Tree, because its local rerouting uses longer paths (path dilation), resulting in more flows being impacted. Aspen Tree slightly underperforms PortLand, because Aspen Tree’s local rerouting also has a small path dilation. As an application can only proceed after the entire coflow has finished, ShareBackup is significantly more effective in masking failures from applications. These results are corroborated by Table 4.4. As demonstrated in Section 4.6.5, ShareBackup recovers from a failure in sub-ms. So, very few flows and coflows get impacted during the small transition period. Other architectures, however, have upto 25.2% flows and 47.2% coflows impacted. Note that a large portion of flows/coflows are indirectly impacted, which are not hit by failures but have contention with the rerouted flows. Rerouting spreads the effect of failures to innocent flows, thus converting the local failure to global performance degradation. In comparison, ShareBackup’s principle of fixing failures at where they happen effectively localizes the problem and provides more predictability to application performance.

Figure 4.10 : Percentage of jobs missing deadlines on packet simulator. 63

Figure 4.11 : Performance of the Spark Word2Vec and Tez Sort applications with a single edge (edge-host) or network (core-aggregation and aggregation-edge) link failure on the testbed. 64

For latency-sensitive flows in the Deadline trace, Figure 4.10 shows the percentage of jobs that miss deadlines under failures. In ShareBackup, failures only cause less than 2% deadline miss, with slight increase as the network utilization grows. ShareBackup handles switch and link failures in the same way, so the results are similar. Rerouting- based solutions perform much worse in comparison. They are sensitive to network utilization and failure type, with the worst-case job miss rate reaching 51%. Although F10 has similar failure recovery delay as ShareBackup, its local rerouting renders heavy traffic contention. PortLand’s global rerouting is more efficient, but there is still bandwidth loss and path re-computation takes as long as 65ms [9]. Jointly affected by these two factors, F10 outperforms PortLand slightly. Aspen Tree pushes back most impacted downstream traffic to the core layer and reroutes from there, balancing rerouting delay and path dilation. Its performance thus falls between PortLand and F10.

4.6.8 Benefits to Real Applications

The performance of the bandwidth-intensive applications on the testbed is shown in Figure 4.11. We do not differentiate link failures in the aggregation and core layers because they result in similar performance. The trend is consistent with the bandwidth difference in Figure 4.7. It confirms that ShareBackup masks failures from application performance: for both applications, the communication time and job completion time constantly stay the same as the no-failure case. Big-data frameworks consume most time on computations. Inter-node communications between computations are influ- enced by node synchronization, data serialization, garbage collection, etc. Considering all these factors, the bandwidth advantage of ShareBackup can still translate into over 12% less communication time and 23% less job completion time under in-network failures. If an edge link fails, workers may get lost and the master needs to relaunch tasks. The communication phase and thus the entire job finish multi-fold slower. In our experiments, we even encountered cases where the job crashed. Failures near the 65 hosts are disastrous to applications, and ShareBackup is an especially useful remedy. Figure 4.11(c) zooms in to the distribution from multiple iterations of data broad- cast for the Spark Word2Vec case. Under failures, ShareBackup has almost the same communication and job duration throughout the runs, while the variation is huge for PortLand and F10. In this Word2Vec job with BitTorrent-like traffic, receivers retrieve data blocks depending on availability. Rerouting slows down data retrieval on different degrees at the worker nodes. The change of data availability shapes traffic further later on, leads to a long-tail in completion time. This observation validates the point in Section 4.6.7 that rerouting enlarges the effect of failure while ShareBackup preserves predictability. We again show ShareBackup can mask failures from applica- tion performance: for many applications, it provides performance guarantee even for the worst case.

Figure 4.12 : CDF of query latency in the Spark TPC-H application with a single edge (edge-host) or network (core-aggregation and aggregation-edge) link failure on the testbed.

This long-tail phenomenon also exists in the Spark TPC-H application. The CDF of query latency in Figure 4.12 reflects TPC-H’s performance metric: the number of queries finished in a time period. PortLand and F10 under edge link failures are on average 25% lower than the rest cases using that metric, and their job completion 66

time determined by the last query is 37.4% and 38.1% longer respectively. Hosts are disconnected in this case, and the job relies on Spark’s own resilience mechanism, i.e. task relaunch, to proceed. Traffic is light in this application. Traffic contention from rerouting is not heavy enough to cause congestion, so PortLand and F10 have similar performance as ShareBackup in most cases. Nonetheless, ShareBackup is still necessary for edge link failures.

4.7 Summary

The advancement of circuit switching technology makes it possible to assign backup switches on demand at runtime. ShareBackup is the first effort to realize this concept of shareable backup in data center networks. Circuit switches have inherent tradeoffs between cost, switching latency, and port count. ShareBackup aims at a cost-effective network architecture for rapid failure recovery, so port count has to be restricted. The choice of modest-size circuit switches drives ShareBackup’s distributed design of both the network architecture and the control system. We find this design a good match for the rare, uncorrelated, and spatially dispersed failures in data centers. With co-design of architecture and control system, backup switches can work as hot standbys without primary-backup coordinations or online change of forwarding rules. Besides failure recovery, special setups of circuit switches can automate and speed up failure diagnosis. Extensive system implementations and evaluations demonstrate ShareBackup can effectively mask failures from application performance. This powerful concept of shareable backup goes beyond the specific ShareBackup architecture. We encourage more research efforts in this promising direction. 67

Chapter 5

RDC: Relieving Data Center Network Congestion with Topological Reconfigurability at the Edge

In this chapter, we propose the idea of “rackless” data center (RDC)network architec- ture aiming to solve the bandwidth disparity issue in oversubscribed Clos networks. The rackless data center (RDC) is a novel network architecture that removes the logical rack boundary and its inefficiencies in a traditional data center. As modern applications generate more and more inter-rack traffic, the traditional architecture suffers from contention at the core, imbalanced bandwidth utilization across racks, and longer network paths; failures of Top-of-Rack switches also disconnect servers unless multihoming is used. RDC builds upon the traditional Clos topology, inheriting desirable properties such as ease of deployment, maintenance, and expansion. At the same time, it can dynamically move servers across the logical rack boundary at runtime. RDC achieves this by inserting circuit switches at the network edge between the ToR switches and the servers, and by reconfiguring the circuits to regroup servers across racks based on the traffic patterns or switch failures. We have performed extensive evaluations of RDC both in hardware testbeds and packet-level simulations. RDC can achieve an aggregate server throughput close to that of a non-blocking network and an average path length 35% shorter. On realistic applications such as HDFS, Memcached, and MPI workloads, RDC can improve the job completion times by 1.1-2.7 . ⇥ 68

5.1 Introduction

Data center network (DCN) architecture are critical for achieving high throughput and low latency, as well as maintaining low cost and complexity. To meet these goals, researchers have proposed a long line of DCN architectures [7,8,16,18–20,28,30,39, 73,103–108] over the past decade. Although these proposals have competing designs for the network core, they have very similar designs for the network edge: servers organized in racks. The network core connects multiple racks, and each rack hosts tens of servers that are statically connected via a Top-of-Rack (ToR) switch. Standardized server racks enable unified power supply and cooling, as well as tremendous space and cable savings. This rack-based topology and connectivity pattern is deeply ingrained in the design of existing DCN architectures. However, the drawback of this rack-based connectivity is prominent. It fragments the server pool into isolated racks, posing a physical limitation on the speed of communications across rack boundaries. To reduce equipment and operational cost, DCNs are typically oversubscribed at the core —with typical oversubscription ratios somewhere between 4:1 to 20:1 [8,10,109]. Therefore, the available bandwidths between servers can vary drastically, depending on how close they are on the topology. Servers communicating in the same rack enjoy line-rate throughput and low latency, but servers communicating across racks have much lower performance. As a result, the oversubscribed layers can be easily congested by inter-rack flows, and the edge links tend to be not fully utilized due to congestion higher up in the hierarchy. Measurement studies on a variety of DCNs have repeatedly highlighted these limitations [109–112]. For instance, results in [110] show that while the core links are usually heavily utilized, the edge links within racks are extremely under-utilized: more than 98% of the links observe less than 1% utilization. Existing work mostly views these limitations as a given and designs around them. For instance, advanced transport protocols [99,113] aim to alleviate congestion, flow scheduling algorithms [40,114] aim to increase bandwidth utilization, and job placement 69 and execution strategies [112,115] can reduce inter-rack traffic. All these approaches can and do lead to performance benefits, but the rack boundaries pose physical constraints that are inherently challenging to get around. Our rackless DCN architecture, in comparison, has a very different goal — removing the fixed, topological rack boundaries while preserving the benefits of rack-based designs in terms of ease of power supply, cooling, and space savings. In RDC, servers are still mounted on physical racks, but they are not bound statically to any ToR switch. Rather, they can move logically from one ToR to another. Under the hood, this is achieved by the use of circuit switches, which can be dynamically reconfigured to form different connectivity patterns. In other words, servers remain immobile on the racks, but circuit changes may shift them to different topological locations. The RDC architecture is particularly useful for pod-level communication, which has become increasingly dominant in datacenter workloads. A pod—or cluster—usually has tens of racks hosting a range of services that work together; for instance, a compute cluster may need to retrieve data from a cache layer and then a storage layer. Recent measurement studies show that more and more DCN traffic is escaping the rack boundary and becoming pod-local [109]. This is not only due to the ever-increasing scale of jobs, but also because racks tend to host servers of similar types—e.g., one rack may host storage servers, and another rack in the same pod may host cache servers. Therefore, servers on one rack would inevitably need to coordinate with servers on other racks, producing inter-rack traffic within the same pod [109]. The rackless design, therefore, can provide substantial throughput and latency improvements for such pod-level workloads. The power of RDC stems from the fact that servers can dynamically form locality groups optimized for the current traffic pattern. DCN traffic patterns can change based on the underlying dataset being processed, the particular placement of workers, and many other factors. Even for a single job, it may proceed in multiple stages, with each stage exhibiting a different traffic pattern (e.g., distributed matrix multiplica- 70

tion [116,117]). As a result, no static locality group can consistently outperform other configurations across workloads, worker placements, and job stages. RDC addresses this root cause by designing a novel architecture that is not committed to any static configuration. Rather, servers that heavily communicate with each other can be grouped together on-demand, and they can be regrouped as soon as the pattern changes again. Instead of optimizing the workloads for the topology, RDC optimizes the topology for the changing workloads. Specifically, RDC leads to several key benefits:

Increased traffic locality. RDC dynamically regroups servers that commu- • nicate heavily with each other via circuit reconfigurations, so that a higher percentage of traffic would enjoy full bandwidth and low latency due to the increased locality. Balanced inter-pod traffic. RDC can redistribute servers across the entire • pod, which allows the uplinks of ToR switches to be load-balanced nearly optimally, effectively mitigating uplink congestion. Higher resilience to ToR switch failures. In today’s DCN architecture, • a ToR switch failure would disconnect all servers in a rack. In RDC, the disconnected servers can be quickly reconnected to a different ToR switch for recovery.

We have evaluated RDC using a prototype implemented on a hardware testbed and using packet-level simulations. Our evaluation shows that RDC can regroup servers at runtime to improve the flow completion time (FCT) by more than an order of magnitude. It achieves a similar speedup to that of an ideal non-blocking network and outperforms other baseline solutions. Moreover, RDC can be modularly packaged and incrementally deployed; its extra cost is estimated to be only 3.1% more than that of a traditional pod. 71

(a) (b) (c) (d) (e)

(a) (b) (c) (d) (e)

Figure 5.1 : Traffic patterns from the Facebook traces. (a) is the rack level traffic heatmap of a representative frontend pod. (b) shows the heatmap after regrouping servers in (a). (c) and (d) plot the sorted load of inter-pod traffic across racks in arepresentativedatabasepod,beforeandafterserverregrouping,respectively.(d) shows the traffic stability over different timescales.

5.2 The Case for Rackless Data Centers

To further motivate our design, we present a set of measurement results obtained from sampled traces from a production data center. The highest-level findings are: a) traffic patterns exhibit pod locality, but not rack locality, b) rack-level traffic patterns are heavily imbalanced, and c) traffic patterns are stable over suitably chosen time epochs. Removing rack boundaries, therefore, is both advantageous and feasible. Dataset. We used a public dataset released by Facebook, which contains packet-level traces collected from their production data centers in a one-day period. The traces were sampled at a rate of 1:30 k, and each packet contains information about the source and destination servers [118]. They were collected from the “frontend”, “database”, 72

and “Hadoop” clusters, respectively. To understand the benefits of removing rack boundaries, we simulate a rackless design by regrouping servers in different racks under a “hypothetical” rack. This algorithm essentially describes our design in section 5.3.

5.2.1 Observation #1: Pod locality

Fig. 5.1(a) plots the the traffic patterns of a representative pod with 74 racks in a 2-minute interval in the frontend trace, in the format of a heatmap. If a server in rack i sends more traffic to another server in rack j,thenthepixel(i,j) in the heatmap will become darker. Intra-rack traffic appears on the diagonal (i.e., i = j). As we can see, the scattered dots show that the traffic does not exhibit rack locality — in fact, 96.26% of the traffic in this heatmap is inter-rack but intra-pod. A similar trend exists for the database trace: 92.89% of traffic is inter-rack but intra-pod. Hadoop trace has more intra-rack traffic, but still has 52.49% of traffic being inter-rack but intra-pod. This observation is consistent with a larger-scale study of Facebook’s traffic patterns, which shows that over 70% of the traffic is pod-local, but only 18.3% is rack-local [109]. Regrouping servers improves locality. Fig. 5.1(b) shows the heatmap in a hypothetical data center where servers are regrouped under different racks based on their communication intensity, simulating the desired effect of RDC. Here, most of the traffic is on the diagonal, and inter-rack traffic is reduced significantly to 38.4%. For the database and Hadoop traces, the inter-rack traffic after regrouping is 28.4% and 41.6%, respectively. Thus, we believe that enormous optimization opportunities exist if we were able to dynamically regroup servers under different physical racks.

5.2.2 Observation #2: Inter-pod imbalance

Another trend we observe is the heavy imbalance of inter-pod traffic, which can also benefit from regrouping servers. Fig. 5.1(c) sorts the racks based on the amount of inter-pod traffic they sent (traffic trace: database) in a 20-min interval where the X-axis is the rack ID, and the Y-axis is the (normalized) amount of inter-pod traffic a 73

rack sends. As we can see, the top 11 racks account for nearly 50% of the inter-pod traffic, and almost half of the racks never sent traffic across pods. This means that some uplinks of ToR switches are heavily utilized, whereas other links are almost

always idle. The load imbalance, defined as max(Li)/avg(Li), where Li is the amount of inter-pod traffic from rack i,isasmuchas4.17.Wefoundqualitativelysimilar results on other traces.

Regrouping servers mitigates load imbalance. Fig. 5.1(d) shows the results for regrouped servers. After regrouping, the inter-pod traffic is much more evenly load balanced across racks, achieving a load imbalance of 1.14. This would make better use of the inter-pod links, and avoid congesting any particular link due to traffic imbalance.

5.2.3 Observation #3: Predictable patterns

Our third observation is on the degree of predictability of traffic patterns. This does not speak to the benefits of RDC, but its feasibility. We can only dynamically regroup servers based on traffic patterns if the patterns are stable — if traffic patterns change without any predictability, then it would be difficult to find a suitable reconfiguration strategy. We analyzed the stability of inter-rack traffic at different time intervals within a pod, using a metric similar to that in MicroTE [119]. We divide time into epochs, and for each epoch, we measure the amount of traffic exchanged between a pair of servers. This would result, for each epoch i,asetoftriples(s, r, v), where s is the sender, r is the receiver, and v is the exchanged volume. Then, if we have a triple

v1 v2 (s, r, v ) i (s, r, v ) i +1 | | < 0.2 1 for epoch ,andatriple 2 for epoch ,and v2 ,thenwe consider the communication pattern between s and r to be stable across these two epochs.

Regrouping based on traffic patterns is feasible. Figure 5.1(e) plots the percent of such “stable triples” for the three traces. As we can see, the database trace is highly stable when an epoch lasts from two minutes to one hour. If an epoch is an hour, then 74

Throughput: n/p Throughput: min(1, n/p) Throughput: n/p Rack 1 … Rack 1 Rack 1 Server re … Rack 2 Rack 2 … Rack 2 … -

grouping Throughput: n Throughput: n Throughput: 1 … … … …

… … … … Rack 1 Rack 2 Rack 1 Rack 2 Rack 1 Rack 2 (a) (b) (c) Figure 5.2 : Aggregated server throughput before and after server regrouping for (a) one-to-one, (b) one-to-many (many-to-one) and (c) many-to-many traffic patterns. Sources servers are colored in white and destination servers are colored in gray.

81% of the triples are stable. Although the Hadoop trace is less stable, it still has 18% stable triples for one-hour epochs. We note that these traces are sampled from the original traffic, which may exhibit higher stability at finer timescales. Nevertheless, the highest-level takeaway is that the stability of traffic patterns is notable, and that this could be learned from past traffic patterns. RDC could then use the stability patterns to determine its reconfiguration period.

5.2.4 Understanding the power of racklessness

To better understand the conceptual benefits of a rackless design, we consider several simple but illustrative traffic patterns. Consider a network with two racks, each with n servers; the 2 ToR switches are connected via a single core switch with an oversubscription ratio of p :1(p 1). For each traffic pattern, we assume that the src servers and dst servers are originally in different racks. Each server has 1 unit of NIC bandwidth. One-to-one. As shown in Fig. 5.2(a), in this traffic pattern, each src server in one 75

rack talks to a dst server in another rack. The communication graph consists of 2n vertices and n disjoint edges. Before regrouping, all the n flows would travel through the oversubscribed network core and yields an aggregate throughput of n/p.For such a pattern, a rackless network can localize all the traffic by simply packing both endpoints of the edges into the same rack, achieving p throughput improvement. ⇥ One-to-many/many-to-one. We consider the case where a server in one rack talks to n servers in another rack. Depending on the oversubscription ratio p,thetraffic could either be bottlenecked at the source (aggregate throughput 1) or at the core switch (aggregate throughput n/p). Therefore, this traffic pattern yields an aggregate throughput of min(1,n/p). As shown in Fig. 5.2(b), a rackless network can group the source and the n 1 destinations in one rack and the remaining destination in another rack. After regrouping, the unused NIC bandwidth due to the bottleneck in the network core can always be filled up by the intra-rack traffic. That is, no matter how large p is, the source can always saturate its NIC bandwidth and achieve an aggregate throughput of 1,equivalenttotheaggregatethroughputofanon-blocking network. In fact, it is not necessary to move n 1 servers— moving even one server from rack 2 to rack 1 would be sufficient. Many-to-many. In this case, all the servers in one rack communicate with all other servers in another. This is essentially n simultaneous one-to-many transmissions. Since each server has n inter-rack flows, traffic can only be bottlenecked at the core switch, yielding an aggregate throughput of n/p. However, as shown in Fig. 5.2(c), the rackless network can exchange the locations of one-half of the servers in rack 1 with one-half of the servers in rack 2, such that every src server has dst peers in its own rack. Therefore, every src server can saturate its NIC bandwidth, and the whole network achieves an aggregate throughput of n,equivalenttothatofanon-blocking network. Summary. The above analysis demonstrates an attractive property of RDC — it can provide the same aggregate throughput as that of a non-blocking network, as long as 76

the traffic was originally bottlenecked by the network core; moreover, this property is independent of the oversubscription ratio. We expect that this improvement would lead to application-level benefits, such as shorter flow/job completion times.

5.3 The RDC Architecture

We describe the RDC architecture in this section.

5.3.1 Building block: Circuit switching

Our choice of the underlying circuit switching technology is guided by the design goal of providing pod-level reconfigurability. A pod typically contains tens of racks and several hundreds of servers [120], and forms an atomic deployment unit of compute and storage [121,122]. Supporting rackless architecture for networks at such a scale requires circuit switches with O(1000) ports. 3D MEMS-based optical circuit switches (OCS) can scale to such a port density with a switching delay of several milliseconds [123,124]. For example, today’s largest commercial OCS has 640 input and output ports [123] and research prototypes have demonstrated the feasibility of a few thousand ports [26]. Fast electrical circuit switches with tens of nanoseconds switching delay are also commercially available, but they have relatively fewer ports (e.g., 160) [74,125,126]. The constraints on port-count and switching delay of circuit switches are likely to loosen in the future. As recent research shows, non-MEMS (e.g., DMD) circuit switches can scale to tens of thousands of ports with switching delay on the order of microseconds [19]. Whether optical or electrical, circuit switches do not encode, decode, or buffer packets, so they are protocol and data rate transparent. They can provide high bandwidth at very low power [124]. RDC uses circuit switches to achieve reconfigurable server-ToR connectivities at the network edge. Circuit setup and reconfiguration can be managed by software controllers, e.g., via the TL1 interface. Circuits can be reconfigured independently from each other, so reconfigurations only impact traffic on the changed circuits, 77 and switching delay is independent from the number of changed circuits. In RDC, reconfiguration happens only when a server’s rack membership needs to change. Each circuit reconfiguration introduces a small amount of link-down time for the connected servers. However, since servers can detect the link-down event and buffer traffic in the transient state and the reconfigurations are infrequent, RDC will not cause significant traffic disruption. We will discuss several reconfiguration algorithms in § 5.4. 78

(a) Core Core Core Core

Agg. Agg. Agg. Agg. Agg. Agg.

ToR ToR ToR ToR ToR ToR ToR ToR ToR …

Circuit switch Circuit switch Circuit switch … … … Servers Servers Servers RDC pod RDC pod RDC pod

(b)

new ② topology Agg. switches routes optimization circuit reconfiguration statistics traffic routes new Controller Circuit switch pkt version ToR switches new

Figure 5.3 : RDC architecture and control plane overview. (a) is an example of the RDC network topology. Circuit switches are inserted at the edge between servers and ToR switches. Connectivities for aggregation switches (agg.) and core switches remain the same as in traditional Clos networks. (b) presents an overview of the control plane.

5.3.2 Connectivity structure

Fig. 5.3(a) shows an example of the RDC pods. RDC changes the traditional multi- layer Clos topology [7,8] by inserting circuit switches at the edge layer between servers and ToR switches. The aggregation layer and the core layer of the network remain 79

the same. Circuit switches are used per pod. Each pod has m racks with n servers per rack, and thus requires 2mn ports on the circuit switch—half of the ports are connected to servers while the other half are connected to ToR switches. For example, a 16-rack pod with 32 servers per rack requires 1024 circuit switch ports. In traditional data centers, each server has a fixed connection to a single ToR; each locality group has a size of n.Incontrast,RDCenablesfullflexibilitytopermutetheserver-ToR connectivities, and can achieve a locality group of size mn across reconfigurations.

5.3.3 The pod controller

Today’s data centers are constructed from modular pods [120–122,127], where a pod typically hosts one type of service. RDC similarly views pods as basic units, and uses a per-pod network controller that manages both packet switches and the circuit switch within the pod. The controller does not require any datacenter-wide information and there is no coordinations among the pod controllers. The controller has two operation modes: proactive mode and reactive mode. In reactive mode, the controller passively monitors the traffic statistics from packet switches and change the topology periodically. Depending on the cluster usage, network operators may have different optimization goals for the traffic. RDC supports this by providing a general optimization framework that can adjust its reconfiguration algorithm to suit different needs (§5.4). A key property of the reactive controller is that it is transparent to end hosts and applications, which distinguishes itself from many prior reconfigurable network proposals that require such modifications to obtain traffic demands (e.g., c-Through [104]). Fig. 5.3(b) illustrates the controller under reactive mode. In each reconfiguration cycle, it first collects the traffic statistics by querying the flow counters on the ToRs and computes a near-optimal topology under certain optimization goals. Then, the controller generates a set of new routes associated with a new packet version match field and installs them on the packet switches. Next, the controller sends the circuit 80

reconfiguration request to the circuit switch and simultaneously installs the new packet version number on ingress switches (ToRs). Only this final step would cause a small amount of disturbance due to the circuit reconfiguration delay. The controller can also run in the proactive mode, where applications are allowed to directly reconfigure the topology based on their needs. We’ll discuss in detail later in section 5.4.3.

5.3.4 Routing

In traditional DCNs, forwarding rules are aggregated based on IP prefixes. In RDC, however, such aggregation does not work as servers have no fixed topological locations. Instead, RDC uses flat IP addressing and exact matching rules on packet switches. Topology changes are captured by updating the routing rules accordingly. These rules are for intra-pod routing only, as routing mechanisms across pods remain unchanged. In an RDC pod, each ToR has a flow table entry for every server IP in its rack, and a single default entry for other prefixes outside the rack. Each ToR splits traffic to other racks equally across its uplinks using ECMP [11]. All agg. switches have the same forwarding table: one entry per destination IP. The flow entries on ToRs and agg. switches both need to be updated when topology changes. In general, for an RDC pod with m racks and n servers per rack, a topology change could result in n rule updates on ToRs and m n updates on agg. switches, which is on the order of ⇥ hundreds to a thousand. Updating this number of rules on an OpenFlow switch could take 100ms to over 1s [128,129]. Previous works have developed two-phase commit to reduce disruption due to updates [130, 131], which first populate the switches with new routing rules and then flip the packet version at the ingress switches. However,

such an approach cannot avoid packet loss in the transient state⇤. This is because updating the version tagging rule at the ingress switch is still not atomic — there is a

⇤ zUpdate [131] claims zero congestive loss in the update, but switches could still drop packets when its forwarding rules change. 81

short period the old tagging rule is removed but the new tagging rule is not yet being installed, during which the packets end up being untagged and get dropped later in the switch pipeline. In fact, our measurement on a Quanta T3048-LY2R OpenFlow switch shows that this transient period could last for 0.5ms and cause over 500KB data loss on 10Gbps links. RDC modifies this approach to reduce transient disruption further, using the following mechanism. It flips packets from tagging (untagging) mode to untagging (tagging) mode during the update, instead of flipping tag versions. Assume packets are in tagging mode before the change and there is a single tagging rule at the ingress switch for all packets. We first install the new set of rules with lower priority that

matches only on destination IPs†. Then, we remove the tagging rule. In this way, those untagged packets during the transient state can immediately match against the new set of rules. Similarly, if packets are in untagging mode before the update, we first install the new set of rules matches on both packet version and destination IPs and then install a single tagging rule for all packets. Fig. 5.4 illustrates the update mechanism in RDC, which we call 0/1 update. It uses an example of forwarding state update on an OpenFlow ToR switch, which has 4 ports. Port 1 and 2 are connected to servers, port 3 and 4 are connected to agg. switches. Packet versions are encoded in the VLAN tag. Before the update, packets are first matched against a VLAN table that tags packets with a VLAN ID. Those tagged packets are then matched against the old rules in the forwarding table. During the transient state of rule updating, packets become untagged and can thus immediately match against the new rules without being dropped. The instructions of the forwarding table direct packets to the group table where packets are either directly sent out via an output port or get load-balanced over multiple output ports using the select group type. Similarly, an update from the untagging mode to the tagging mode also causes no packet loss. We apply this update approach on both ToR and agg. switches but only tag/untag

† The more general matching rules always have lower priority. 82

MatchFields Instruction ingressPort set-field MatchFields Priority Instruction vlanID dstIp GroupID * set vlanID 1/None vlan table 1 d1 2 1 old 1 d2 2 2 rules GroupID GroupType ActionBuckets * d1 1 2 new 1 indirect out_port 1 * d2 1 1 rules 2 indirect out_port 2 * * 3 out_port 3: 0.5 3 select out_port 4: 0.5 forwarding table group table Figure 5.4 : An example of RDC’s 0/1 rule update on an OpenFlow-enabled ToR switch.

packets on ToR switches. Tag flipping actions are only performed when the new rules have been populated network-wide. Note that rule inconsistency could still occur due to the asynchronous tag flipping across multiple ToRs, which is inherently difficult to get around even if the controller sends parallel requests to ToRs simultaneously. If it occurs, a packet could either be sent to a wrong destination ToR (or server) if the source ToR is not ready, or sent to a correct ToR but gets dropped if the source ToR is ready but the destination ToR is not. However, since tag flipping is performed simultaneously with circuit reconfiguration and it is very fast (0.5ms on our testbed), the extra packet loss due to this asynchrony is small.

5.4 RDC Control Algorithms

RDC provides a general framework to support different control algorithms that can collect traffic statistics and reconfigure the topology based on specific optimization goals. We describe several key use cases below. 83

5.4.1 Traffic localization

Traffic localization targets workloads with mostly intra-pod traffic. It aims to localize the inter-rack traffic to improve traffic throughput and reduce latency. The controller takes cross-server traffic matrices as input, and produces an optimal server-ToR assignment that minimizes the traffic volume traversing the agg. switches. Traffic data collection. Previous works have shown that datacenter workloads demonstrate certain degrees of stability [109,119], and RDC similarly relies on this stability to estimate the traffic demand based on historical data. RDC collects traffic matrices entirely in-network using flow counters on packet switches, so that no host modifications are required for deployment. A flow counter associates the 5-tuple (13 bytes) of a flow to an 8-byte counter value and thus has 21 bytes in total. Switch memory constraint is traditionally the main concern of maintaining per-flow counters, but this constraint is loosening over the years as the switch SRAM size has been continuously growing. The most recent switch ASICs have 50-100MB of SRAM and can store millions of flow states [132,133]. As recent DCN measurement works show that the number of concurrent flows per server is on the order of hundreds to a thousand [99, 109], each ToR in RDC would then need tens of thousands of flow counters assuming tens of servers per rack. The RDC controller pulls flow counters from ToRs periodically. Assuming an example RDC pod with 16 racks and 32 servers per rack, and a counter pulling period of 1s, the control channel bandwidth usage is roughly 86Mbps, which is low enough to be feasible. Estimating true demands. The traffic matrix collected via flow counters is not an accurate indicator of the true server demands, as it is heavily biased by the existing network topology. Due to the inter-rack bandwidth bottleneck, servers can only transfer inter-rack traffic at a lower rate than the intra-rack traffic, resulting in smaller values of the inter-rack flow counters. Using the flow counters directly would cause infrequent topology change and miss the opportunities to achieve higher throughput. To compensate for the bias, RDC first filters the small flows with volumes less than a 84

threshold. Then, it computes the max-min fair bandwidth allocation for the remaining flows. This compensation algorithm performs repeated iterations of increasing the flow rates from the sources and decreasing exceeded capacity at the destinations until all the flow rates converge. The calibrated demand values reflect the rate a flow would grow to in a non-oversubscribed network, such that eventually it becomes limited by either the source or destination values. Similar techniques have also been used in Hedera [40] and Helios [18]. Algorithm. The estimated flow demands are then aggregated into server demands for our topology optimization. We formulate this as a balanced graph partitioning problem [134]. The traffic demand is a graph G =(E,V ), where V is the vertex set (i.e., servers) and E is the edge set. The weight of an edge e, w(e),isthetraffic demand between the vertices. To simplify the computation, we do not distinguish the directions of traffic between a server pair, i.e., graph G is non-directional. Our goal is to partition the graph into subgraphs of equal numbers of vertices such that the weighted sum of cross-subgraph edges is minimized. We require partitions of the same size because each ToR must connect to the same fixed number of servers. The balanced graph partitioning problem is NP-hard, but fast heuristics are available [135].

5.4.2 Uplink load-balancing

Uplink load-balancing targets out-facing services with mostly inter-pod or inter-DC traffic. The goal is to mitigate the congestion across ToR uplinks. The key insight is that server traffic is correlated—multiple servers in the same rack likely perform similar services to clients outside the pod simultaneously [79]. At the same time, uplink congestion is usually not correlated across racks due to the de-correlated demand of a large number of clients [107]. When RDC runs in the load-balancing mode, it monitors the uplink utilization of all ToRs and keeps track of the traffic each server has sent. Once congestion is detected on particular ToRs, it will move several hot servers under the ToR to other idle racks. 85

Traffic data collection.RDCmaintainsflowcountersonToRstomonitorthe amount of traffic that each server has sent to the outside clients. We assume each RDC pod has a unique ID, e.g., an IP address prefix shared by all servers in the pod. Counters are only installed and updated for inter-pod traffic. This can be implemented in the switch using two separate flow tables. The first flow table matches on the destination IP prefix, and has only one rule matching the switch’s own pod ID. If the first table misses, the second table then matches the 5-tuple and updates the associated counters. Otherwise, the packet skips the second table and goes to the forwarding table. By default, a miss on the second table will not result in packet loss, but a go-to action to the rest of the switch pipeline, which avoids traffic disruption when the counter rules change. Estimating true demands. We use a similar technique to estimate the true demand of servers in bottlenecked racks assuming they fair-share the uplink bandwidth. The estimates are obtained by first aggregating the flow counters for each server and then scaling up the per-server demand to reach an aggregate uplink throughput as if the rack is not oversubscribed. We only apply this technique to racks that have been bottlenecked in the collection period to prevent idle racks being mistakenly included. This technique keeps the relative order of server traffic load but brings larger quantitative differences among servers, guiding our algorithm to compute better topologies. Algorithm.Wemodeltheuplinkload-balancingproblemasanumberpartitioning problem under the constraint that each partition must have the same cardinality. This

formulation partitions a set of numbers a1,a2,...,an into k subsets S1,S2,...Sk such { } that each subset Sj has exactly n/k numbers (assuming n is divisible by k)andthe

maximum cost of a subset, defined as max( c(Sj) ) where c(Sj)= ai(ai Sj),is { } 2 minimized. In our case, n is the number of servers, k is the numberP of ToRs, ai0 s are the traffic loads from each server. Again, we require a balanced partition of the servers because each ToR must host the same number of servers. The problem is 86

strongly NP-hard when k>2 [136, 137]. We use a simple and fast heuristic that tries to reduce the number of server moves. Our algorithm starts with the existing server topology. It iteratively exchanges the most-loaded server in the most congested rack with the most loaded server in the least congested rack. The iteration stops when the maximum rack load is within a certain factor of the minimum rack load

i.e., max( Sj ) (1 + ) min( Sj ), or when a predefined number of exchanges has { }  { } been reached.

5.4.3 Application-driven optimizations

In addition to the reactive mode that maintains application transparency, RDC also allows applications to proactively request reconfigurations or adapt their traffic patterns to the changing circuits, not unlike how data center applications today tailor their task placements to the network topology. To do so, RDC provides a standard TL1 command interface that allows applications to configure the network topology via a single RPC call. We envision that two general classes of applications will benefit from telegraphing their intent to the RDC controller. First, applications with physical placement con- straints. Many datacenter applications have to sacrifice the benefits of communication locality as they need to spread their tasks to survive rack failures [112] or reduce synchronized power consumption spikes [138]. With RDC, such sacrifices become unnecessary — tasks of an application can be logically brought back at runtime based on their traffic patterns without violating their physical placement constraints. Second, applications with changing traffic patterns. For example, many distributed matrix multiplication (DMM) algorithms proceed in iterations and require a different group of servers to communicate in each iteration [116,117]. In section 5.6.6 we implement an OpenMPI based DMM algorithm on our testbed and allows it to directly request the RDC controller to group different subset of servers at runtime. 87

5.4.4 Advanced control algorithms

RDC also allows the co-existence of multiple control algorithms, e.g., for a mix of workloads or applications. For instance, one might optimize for uplink load balancing and inter-rack traffic simultaneously, e.g., minimizing ↵T + R, where T is the total inter-rack traffic volume, R is the rack imbalance ratio, and ↵ and are the respective weights. We note that prior work has used similar techniques to optimize for multiple goals when placing tasks [112]. The only requirement of RDC is that the control algorithms should not produce conflicting reconfigurations. This can be done by a simple conflict detection and resolution algorithm at the controller.

5.5 Discussions

We discuss several practial issues in RDC.

5.5.1 Recovering from ToR failures

Designing data center networks that are resilient to failures is an important considera- tion. For example, high degrees of path diversity [7,8] and rerouting techniques [9,39,40] are common in today’s data centers. However, most DCNs do not have such redun- dancy at the edge layer—each server is connected a ToR switches via a single edge link—making ToR switch failure an unrecoverable failure. An obvious solution is to use redundant links between servers and ToR switches i.e. multihoming [79,139]. However, such a 1:1 backup solution comes with substantially higher cost—even dual-homing requires doubling both the server NIC ports and the number of ToR switches. Recent measurement study shows that individual devices are highly reliable, with 99.99% availability and failures usually last for just a few minutes [140]. This makes backup-sharing techniques a more cost-effective solution [25,141]. We note that our rackless architecture can support backup-sharing promptly. Since servers are not tied to any particular ToR, ToR failures can be recovered by “migrating” servers from 88

failed ToRs to other healthy ToRs. The requirement is that the healthy ToR must have enough "free" ports to host the immigrant servers, which could be achieved by either reserving ports on existing ToRs or adding extra ToRs to the network. In general, for an RDC pod with m racks and n servers per rack, we would only need n/m free ports on each ToR to recover from any single ToR failure. Those free ports must be connected to the circuit switch as well, requiring n more ports on the circuit switch. However, compared with the multi-homing solution, RDC requires substantially less ToR ports (n instead of m n)anddoesnotrequiremultipleNICportsoneach ⇥ server.

5.5.2 Wiring and incremental deployment

Connectors to core switches

Switch Server rack rack Server rack Fiber bundle

Agg.

s switches

ToR switches

Server rack Server rack OCS Server rack Server rack

Figure 5.5 : Packaging design of an RDC pod.

One major concern of using circuit switches at the edge layer is that it doubles the number of cables or optical fibers needed to connect all the servers. To make our discussion more concrete, we assume the underlying circuit switch in RDC is an optical circuit switch (OCS). We consider some packaging techniques to reduce the wiring complexity. Fig. 5.5 shows the packaging design of an RDC pod, which is somewhat 89

different from that of a traditional pod. RDC has a central switch rack dedicated to hosting ToRs, agg. switches, and the OCS. Server racks are connected to the OCS via fiber bundles to reduce wiring complexity. On the central rack, ToRs are connected to the OCS and agg. switches using short fibers and cables, respectively. Agg. switches provide similar connectivity to core switches outside the pod, just like in traditional data centers. To ensure that centralized switch placement has similar reliability as traditional switch placement, backup power supplies are employed. Similar to existing modular data centers, RDC supports incremental expansion by adding RDC pods.

5.5.3 Cost analysis

In a traditional pod, servers are connected to ToRs via short, direct attach cables, and ToR are connected to agg. switches via longer optical fibers. Assuming similar fiber bundling mechanisms, the wiring cost of RDC would be similar, with the extra cost mostly coming from the OCS and per-server optical transceivers. Optical transceivers used to be the primary cost of DCNs, but their price has declined sharply over the years due to massive production: a QSFP+ 40Gbps transceiver with 150m transmission distance costs only $39 today [142]. Based on feedback from a commercial optical switch vendor, we have learned that today’s high port-count MEMS OCS costs $200- $500 per port, but this high cost is mainly due to the low-volume, high-margin market for these switches. If production volume goes up to 50 k units per year, the per-port cost can be dramatically reduced to below $50. As a concrete example, consider an RDC pod with 16 racks and 32 servers per rack, using the packaging strategy in section 5.5.2. Each server has a 40 Gbps transceiver connecting to the OCS via optical fibers. We compare the overall networking cost of RDC with 1) a static network with 4:1 oversubscription (4:1-o.s.), 2) a non-blocking network (NBLK) and 3) a rack-based hybrid network (Hybrid) that augments 4:1 o.s. with a separate circuit switching network. Hybrid has configurable per ToR-pair circuit bandwidth. We therefore consider two options: 1) circuit bandwidth is the same as 90

Table 5.1 : Cost estimates of network components and their quantities needed by a) an RDC pod with 4:1 oversubscription b) a 4:1 oversubscribed packet switching network (4:1-o.s.), c) a rack-based hybrid circuit/packet switching network with 4:1 oversubscribed circuit bandwidth (Hybrid-1) and d) with non-blocking circuit bandwidth (Hybrid-2), and finally e) a non-blocking packet switching network (NBLK). “p.c.” stands for private communication.

Quantity Needed Component Price Note Source RDC 4:1-o.s. Hybrid-1 Hybrid-2 NBLK Ethernetport $200 40G [143] 768 768 896 1280 1536 40Gtransceiver $39 QSFP+ [142] 1024 256 384 768 1024 Op. fiber, 8m $4.4 inter-rack [144] 512 128 256 640 512 Op. fiber, 3m $3.4 intra-rack [144] 512 0 0 0 0 40GDAC, $40 intra-rack [145] 128 512 512 512 512 OCSport $50 atscale p.c. 1024 0 128 512 0 Total price $254k $185k $222k $335k $370k

that of the packet bandwidth between a ToR-pair (320Gbps), which we call Hybrid-1 and 2) circuit bandwidth is non-blocking (1.28Tbps), which we call Hybrid-2. Note that the setting of Hybrid captures the essence of a wide range of rack-based hybrid DCN architectures proposed in the last decade such as c-Through [104], Helios [18], Flyways [30] and many others [20,28,58,108]. The estimated component prices and their quantities needed by each architecture are listed in Table 5.1. The overall networking cost of RDC lies between Hybrid-1 and Hybrid-2 but is significantly lower than NBLK (by 31.4%). As we will show in section 5.6.5, RDC achieves a transmission performance close to that of NBLK and has the highest performance-per-dollar among these networks. Especially, RDC has higher performance than Hybrid-2, even though Hybrid-2 adds enormous extra bandwidth to the network core. When taking the cost of servers into account, e.g., assuming each server costs $4000, then the extra cost of an RDC compared to 4:1-o.s. is only 3.1%. It would be difficult to extrapolate the cost curve of RDC beyond thousands of servers, as the optical switching technology would need to be significantly different from the 91

MEMS-based OCSes; we leave the cost estimate of larger-scale RDC as future work.

5.5.4 Handling circuit switch failures

Circuit switches are physical layer devices with bare-minimum software stack. They are generally free from firmware bugs, software upgrades, and misconfigurations— most common causes of switch failures in today’s DCs [140,146,147]—and are thus highly reliable [148]. For example, MEMS-based optical circuit switches have reported mean-time-between-failure being over 30 years [149]. To ensure reliability, we can use 1:1 backup to handle hardware or power failures. In an OCS-based RDC network, each server is connected to a primary OCS and a second backup OCS, via an inexpensive 1:2 optical splitter [61]. Traffic leaving the two OCSes is multiplexed via a cheap 2:1 optical MUX [150] connected to the ToRs. A similar setup is also used for the ToR to server traffic direction to interpose the primary and backup OCS in between. During normal operation, only the primary OCS transmits optical signals, and the backup’s circuits are disabled. Upon primary failure, the backup OCS changes its circuit configuration based on the configuration of the primary. This would double the OCS cost, but does not require additional transceivers on both ends of the fibers. This backup strategy brings the extra cost to about 6% for our RDC pod example, including the extra splitters and MUXes.

5.5.5 Scaling

RDC’s pod size is constrained by the port-count of the underlying circuit switch. Today’s largest MEMS switch provides 640 ports [123], suggesting a maximum pod size of 320. Research prototypes have shown that MEMS switch can scale to a few thousands of ports [26], although they are not commercially available at the moment. While a few hundreds of servers is already a good fit for current DC pods, it is likely that future workloads require server locality groups with even larger sizes. A simple solution is to cascade multiple circuit switches using the similar Clos or butterfly 92

topologies used in scaling up packet switching networks, at the cost of both increased cost and the requirement for fast, coordinated circuit switching. However, to further scale up, the internal switching technology of an OCS may need to change significantly. Recent research prototypes have explored several alternative technologies, e.g., DMD- based free-space optics [19], hybrid MZI-SOA based switching [151,152]. Their results seem promising: circuit switches with tens of thousands of ports and microseconds of switching delay are feasible.

5.5.6 Alternatives in the design space

One might wonder if we could simply pack more servers under the same ToR to enable larger locality groups and avoid the need for reconfiguration in RDC. However, this is not as simple as it might seem. At the time of writing, the fastest switching ASIC on the market (e.g., Broadcom’s Tomahawk 3 chipset [153]) can provide 12.8Tbps capacity. A ToR switch equipped with such an ASIC can theoretically provide full bandwidth for a single “super” rack with 256 servers, each with a 40Gbps NIC, and achieve a much larger locality group. However, such a network would lock up the bandwidth ratio between servers and the ToR switch. Bandwidth upgrades on servers, e.g., from 40Gbps to 100Gbps or to 400Gbps [154], requires replacing the switch with one that has even higher capacity, which may not be available for years to come or have prohibitively higher cost. In short, such a network would have similar scalability and cost issues that have catalyzed the design preferences of “scale-out” over “scale-up” in the last decade [7, 8, 10]. Moreover, this network also imposes higher risks of ToR switch failures, as one such failure could disconnect hundreds of servers. In comparison, RDC’s architecture is based on the Clos topology and is more fault-resilient and sustainable—allowing bandwidth upgrades by adding more commodity switches rather than scaling up the capacity of a single switch. Because the circuit switch operates at the physical layer and is bandwidth transparent, RDC’s network architecture does not need to change when bandwidth upgrades are performed 93

on either servers or ToRs.

5.6 Implementation and Evaluation

We conduct comprehensive evaluations of RDC using testbed experiments and packet- level simulations. Our experiments focus on several dimensions: a) microbenchmarks on RDC, including transient disruptions, throughput improvements, and control loop latency, b) packet-level simulations on the latency and throughput improvements at scale, and c) real-world applications of RDC to HDFS [4], Memcached [155], and MPI-based distributed matrix multiplication (DMM) [116] as use cases.

(a) Servers (b) OCS (c) OpenFlow packet switches Figure 5.6 : An RDC prototype with 4 racks and 16 servers.

5.6.1 Platforms

Testbed. Our RDC prototype consists of 16 servers and 4 ToR switches in 4 logical racks, as well as one agg. switch and one circuit switch; Fig. 5.6 illustrates our hardware testbed. The ToR switches are emulated on two 48-port Quanta T3048- LY2R switches. Each ToR switch has four 10 Gbps downlinks connected to the servers, and one 10 Gbps uplink to the agg. switch, forming an oversubscription ratio of 4:1. We can tune this ratio to emulate a non-blocking network by increasing the number of uplinks to 4. The agg. switch is a separate OpenFlow switch. The OCS is a 192-port Glimmerglass 3D-MEMS switch with a switching delay of several milliseconds. Each 94

server has six 3.5 GHz dual-hyperthreaded CPU cores and 128 GB RAM, running TCP CUBIC on Linux 3.16.5. Packet-level simulator. In order to simulate a wider variety of experimental settings, we have developed a packet-level simulator based on htsim, which was used to evaluate MPTCP [113] and NDP [156]. This simulator has a full implementation of TCP flow control and congestion control algorithms and supports ECMP. An RDC reconfiguration can be simulated by a topology change followed by a fixed reconfiguration delay of 8.5 ms (obtained from our testbed measurements). Packets in flight during reconfiguration will be dropped if they traverse the disrupted links, and unsent packets will be buffered at the servers. We simulate an RDC pod with 512 servers, 32 servers per rack, and 16 racks overall. The 16 ToR switches are connected to a single agg. switch with tunable oversubscription ratios.

5.6.2 TCP transient state

We first evaluate how RDC reconfiguration affects TCP connections in the transient state. To measure this, we configured the TCP sender to transmit at line rate to the receiver, and captured the TCP sequence numbers of transmitted packets via Wireshark. The connection was disrupted by a reconfiguration in the middle of the transfer. We note on one caveat due to the artifact of our NIC firmware: our older NIC can only detect link-down events at a 300-500ms delay, so this causes packet drops that can be easily avoided by a NIC or firmware upgrade (e.g., by reducing the polling period to detect link status changes). Since our NIC firmware is not open source, we emulated this effect by using a packet switch port as the server NIC, which can poll link status every 100µs; packet transmissions will be temporarily paused

upon link-down events.‡ Fig. 5.7(a) shows the transient disruption to TCP. As we can

‡ Alternatively, we can use the Ethernet flow control protocols defined in IEEE 802.3x and 802.1Qbb to generate “pause frames” from ToR switches to servers before reconfiguration, and use “unpause frames” to resume transmission. This is supported by commodity NICs and Ethernet switches [157–159] and has been demonstrated to be practical by previous work [160]. A 8.5ms 95

see, the connection experienced 8.5ms disruption and lost 120 KB data, which is far less than the 8.5ms worth of data size on 10Gbps links (10.6MB). Since the data loss is substantially smaller than the TCP congestion window size (2 MB at the time of reconfiguration), this did not lead to any retransmission timeouts; the lost data was rapidly recovered by fast retransmission.

(a) (b) Reconfig. 0.5ms 8.5ms 120KB 438KB Retransmission Retransmission

(a) Circuit reconfiguration (b) Packet version update Figure 5.7 : RDC reconfiguration improves throughput significantly (b) with negligible TCP disruption (a).

5.6.3 Throughput benchmark

Our second benchmark tests the throughput improvement of traffic localization using a simple traffic pattern. We created 4 inter-rack iPerf flows that saturate the inter-rack bandwidth. Each server in the source rack sends a single TCP flow to each server in the destination rack. We measured the individual flow throughput at 100ms intervals, and aggregated them across four flows. As we can see in Fig 5.7(b), each flow initially received 1/4 of the full NIC bandwidth due to congestion at the inter-rack links. We then issued an RDC reconfiguration at t=18s, which regrouped servers based on their communication patterns. This caused a transient throughput drop by 16%, but the throughput quickly ramped up to full line rate after the reconfiguration

reconfiguration delay would require buffering 10.6MB (42.4MB) of data on servers with 10Gbps (40Gbps) links. 96

finished. These results show that line-rate throughput improvement can be expected in RDC — topology reconfiguration does not introduce any inefficiency in the steady state.

5.6.4 Control loop latency

Next, we evaluate the latency of the RDC control loop, which includes five components: a) collecting flow counters, b) estimating traffic demands, c) computing new topologies, d) modifying forwarding rules, and e) updating packet versions and reconfiguring the circuits. This latency will affect how fast RDC can respond to changing traffic patterns. First, we measured a set of baseline latency results. Using the Ryu OpenFlow controller [161] to read a single flow counter takes 0.4ms (dominated by the RTT and RPC call processing). If the controller reads multiple counters concurrently, this latency can be reduced: reading 32 counters from different flows in parallel took 1.3ms.

Table 5.2 : Break-down of control loop latency (ms) for traffic localization (TL) and uplink load-balancing (ULB).

481632 #Racks TL ULB TL ULB TL ULB TL ULB Counter collection 10.6 2.3 21.3 2.6 42.6 3.4 85.1 4.5 Demand estimation 10.8 0.7 24.9 1.1 80.6 1.3 310.6 1.7 Topo. computation 14.2 0.1 45.3 0.1 149.3 0.3 507.7 0.6 Rule installation 32.5 30.6 45.6 30.8 75.6 41.4 147.6 70.6 Circuit reconfig. 49.4 35.2 71.9 50.3 91.2 60.5 147.4 66.1 Total 117.5 68.9 209.0 84.9 439.3 106.9 1198.4 143.5

Table 5.2 breaks down the control loop latency for the traffic localization (TL) and uplink load-balancing (ULB) use cases discussed before. To obtain these results, we ran a set of experiments using different numbers of racks, with 32 servers per rack, using the traffic patterns from the Facebook packet traces. The ToR switches are connected to a single agg. switch. Since our testbed only has four ToR switches, we 97

emulated more ToR switches using servers, and ensured that each server has the same latency for collecting counters and for installing routing rules as a physical ToR switch. The number of forwarding rules to be installed is bounded by 32 for the ToR switches and 32 #racks for the agg. switch, and the number varies depending on the traffic ⇥ patterns and may be different across switches. The overall rule installation delay is determined by the slowest switch, which has the most number of changes. As we can see from the table, in the worst-case scenario, this latency is 147.6ms. We then measured the circuit reconfiguration latency, or the time between the controller issuing a request and the time it receives a response from the OCS. This includes message serialization/deserialization delay, controller software stack delay, OCS software delay, OCS hardware delay and the RTT. The OCS software delay increases linearly with the number of circuit changes, but the OCS hardware delay stays constant (8.5ms). Since our OCS has a maximum of 192/2=96circuits, we obtained the latency results for 96 circuit changes via measurements and these  for > 96 circuits via extrapolation. We can see that ULB has notably smaller delay because it involves fewer reconfigurations—our algorithm in 5.4.2 explicitly aims at reducing server relocations. The topology computation time for TL dominates as the pod size increases. This is partly due to our simple, single-threaded implementation of the balanced graph partitioning algorithm; using fast parallel heuristics can potentially reduce the time significantly [135]. Overall, RDC’s control loop latency is 1.2s for TL and 150ms for ULB, which are on similar timescales with state-of-the-art traffic engineering techniques [119].

5.6.5 Transmission performance at scale

Next, we evaluate an RDC pod at data center scale using the packet-level simulator. Our baselines are a) a static non-blocking network (NBLK), b) a static network with 4:1 oversubscription (4:1-o.s.), c) a hybrid network with 4:1/1:1 oversubscribed reconfigurable ToR-pair bandwidth (Hybrid-1/Hybrid-2), d) RDC with future traffic- 98

demand information (Ideal RDC), and e) a 4:1-o.s. network that applies RDC’s reconfiguration algorithm for only once over the entire traffic trace (One-time RDC). Hybrid-1 and Hybrid-2 use the commonly used Edmonds’ algorithm [162] for circuit allocation. They have the same circuit switching delay as RDC and buffers packets at ToRs during the circuit downtime. We used the Cache and Web traffic from the Facebook traces. Since the original traces do not contain flow-level information, we generated flow-level traffic based on the sampled packet trace from [109]. Specifically, we inferred the source/destination servers of the flows from the trace, and simulated flow sizes and arrival times based on Figures 6 and 14 in the same Facebook paper. Cache traffic has an average flow size of 680 KB, with 82% being inter-rack. Web traffic has an average flow size of 63 KB with 75% being inter-rack. Both traffic traces last for 60s in the simulation, and RDC’s reconfiguration period is 1s. We run the simulator under different network loads. Fig. 5.8(a) shows the CDF of flow completion times (FCT) for RDC and the baselines using the Cache traffic under network load 50%. We observe the median FCT is more than an order of magnitude lower in RDC than in 4:1-o.s.. In fact, even applying the RDC traffic localization algorithm just once can bring down the median FCT to 0.12 . This huge performance gap is due to the TCP dynamics—severe ⇥ inter-rack congestion causes consecutive packet losses and TCP would become very conservative in increasing its flow rate. The hybrid networks add extra inter-rack bandwidth, but this bandwidth is provisioned per ToR pairs. As the traffic traces that motivate our RDC design are generally uniform at the ToR level (see the heatmap in Fig: 5.1), the hybrid networks fall short in relieving inter-rack congestion. Their average FCTs are 1.4 to 1.75 higher than RDC. More importantly, we observe ⇥ ⇥ that RDC with future knowledge of traffic demands performs consistently close to the non-blocking network, which again demonstrates the power of a rackless network. Because the Cache traffic is largely stable, similar to that in the Database traffic in 99

the original traces, RDC performs only 1.15 worse in average FCT when such future ⇥ knowledge is absent. Fig. 5.8(b) shows the distribution of flow path lengths for RDC, Hybrid and NBLK. (Hybrid-1 and Hybrid-2 have the same distribution; NBLK and 4:1-o.s. also have the same distribution). An intra-rack flow has path length 2 in all networks, whereas an inter-rack flow has path length 4 in RDC, NBLK and 4:1-o.s., but could vary in Hybrid—3 for the circuit path and 4 for the normal packet switched path. Overall, RDC could localize more than 70% of the inter-rack traffic and achieve an average path length 0.75 that of the hybrid network and 0.65 that of the non-blocking ⇥ ⇥ network.

(a) (b)

Figure 5.8 : Performance comparison using Cache traffic. Network load is 50%. (a) CDF of flow completion times. (b) Distribution of path lengths. More results under different settings are in Appendix ??.

Fig. 5.9 further shows the CDFs of flow completion times for RDC and for its baseline networks under different traffic traces and network loads. Regardless of the trace, RDC performs significantly better than the 4:1 oversubscribed network. On average, it improves the FCT 31.6 in Cache traffic and 13.8 in Web traffic, under ⇥ ⇥ network load 70%. Compared to Cache traffic, Web traffic observes less improvement, mainly due to its smaller flow sizes. Hybrid-1 and Hybrid-2 perform better than the 4:1 oversubscribed network as they add extra bandwidth to the network core. 100

Again, because this extra bandwidth is not uniformly available for all ToR pairs, the hybrid networks are limited in relieving network core congestion when inter-rack traffic is uniform. For both traces, higher network loads result in larger performance improvements over the baselines. Nevertheless, we observe that RDC with future traffic demand knowledge performs consistently close to the non-blocking network. Our examples in section 5.2.4 explained RDC’s bandwidth advantages qualitatively. The results here show this advantage quantitatively: RDC with future demand knowledge has an average FCT always within 1.2 that of the non-blocking network under ⇥ different simulation settings. 101

(a) Cache traffic, load = 70%

(b) Web traffic, load = 50%

(c) Web traffic, load = 70%

Figure 5.9 : Performance comparison using packet simulations under different traffic workloads and network loads. 102

5.6.6 Real-world applications

Next, we evaluate how RDC can improve the performance of real-world applications for each of its use cases.

HDFS. We set up a HDFS cluster with 8 datanodes across 2 racks and 1 namenode on a separate server, with a replication factor of 3 and a block size of 256 MB. Four clients on a different cluster initiated concurrent read/write requests to the HDFS cluster on four 20 GB files. A write generates two inter-rack and one intra-rack flows, because the three replicas reside on different racks for fault tolerance; a read generates one inter-rack flow (Fig. 5.10(a)). All data is cached in RAM disks, so the hard drive is not a bottleneck. Fig. 5.10(b) shows the performance. On average, the read latency was reduced to 0.37 by RDC, which is equal to the speedup provided by a ⇥ non-blocking network. The write latency was reduced to 0.77 , as write operations ⇥ involve a mix of inter- and intra-rack flows. Therefore, RDC can more effectively localize the inter-rack flows in read operations. Memcached. We then configured Memcached [155] servers on two racks, and issued read/write requests from two other racks. This setup emulates the scenario where clients in one pod read/write the cache servers in another pod. Our workload has a) 200 k key value pairs uniformly distributed across 8 servers, b) a 99%/1% read/write ratio, and c) 512 byte keys and 10 KB values. We adopted a Zipfian query key distribution of skewness 0.99 similar to previous works [163,164], which led to a load imbalance of 1.8 on the server racks. As shown in Fig 5.10(c)-(d), RDC improves the ⇠ query throughput by 1.78 on average, and reduces the median latency to 0.48 ;these ⇥ ⇥ improvements are close to what a non-blocking network could achieve. RDC also cuts the tail latency significantly, for which network congestion is a major cause [165,166]. OpenMPI DMM.Next,weevaluateRDCusingdistributedmatrixmultiplication (DMM), which is a key primitive in many machine learning and HPC applications. We set up a 16-node OpenMPI cluster and implemented a commonly used DMM algorithm [116]. This algorithm organizes n processes as a pn pn grid, and divides ⇥ 103

the input matrices A and B evenly across the grid (shown in Fig. 5.10(e)). In each iteration, it performs a “broadcast-shift-multiply” cycle where a process a) broadcasts submatrices of A row-wise, b) shifts submatrices of B column-wise, and c) multiplies A with the corresponding blocks in B. These iterations cause a changing traffic pattern, for which no static process placement would be consistently optimal. We instrumented this application to invoke RDC reconfigurations upon traffic pattern changes. Fig. 5.10(f) shows the performance of RDC and the 4:1 oversubscribed network under various input matrix sizes. The 4:1 oversubscribed network is under row-wise rack organization and is optimal for broadcast, but suboptimal for shift. RDC, on the other hand, achieves a speedup of 3.1 -3.6 on shift operations, which translates to ⇥ ⇥ 2.1 -2.4 speedups in communication time and 1.1 -2.0 speedups in the overall ⇥ ⇥ ⇥ ⇥ application performance (not shown). We observe less improvement at the application level, as the computation time quickly becomes dominant when matrix sizes scale up, accounting for 80% of the application time for 12k 12k matrices. This is because ⇠ ⇥ our setup has fewer compute resources than MPI high-performance computing clusters, which uses accelerators (e.g., GPUs) to reduce computation time by more than a order of magnitude [167,168]. In these cases, network improvements will be even more prominent. 104

Block write Block read 0.7☓ 0.65☓

0.32☓ 0.32☓

Clients Datanodes Datanodes

(a) HDFS (b)

2.35☓

1.78☓ 0.44☓ 0.38☓

0.48☓ 0.29☓

(c) Memcached (d)

Row-wise broadcast Column-wise shift

(e) (f) DMM Figure 5.10 : Application performance improvements of RDC compared with the 4:1 oversubscribed network (4:1-o.s.) and the non-blocking network (NBLK). (a) The HDFS read/write traffic pattern. (b) The HDFS transfer time. (c)-(d) Memcached query throughput and latency. (e) The DMM traffic pattern. (f) Average shift time and communication time. 105

5.7 Related Work

Various DCN proposals have recognized the need for serving dynamic traffic workloads, and provision bandwidth on-demand with reconfigurable topologies. One line of works adds extra bandwidth to the network by creating ad hoc links at runtime [18, 30, 104, 106, 108]. Another line constructs an all-connected flexible network core with high capacity [16,20,73,103,105,169]. Most of these works focus on providing reconfigurable topology at the rack level under the assumption of skewed inter-rack traffic. RDC, however, alleviates the reliance on such an assumption, and achieves finer-grained topology control by pushing reconfigurability down to the edge. We have showed in section 5.6.5 that this fine-grained topology control could bring higher transmission performance than previous proposals, even if they add extra bandwidth to the network. Flat-tree [170] is a recent architecture proposal that enables DC-wide reconfigurability by dynamically changing the topology between Clos [7] and random graph [68]. However, it can only provide a limited number of topology modes that is suitable for a coarse-grained classification of traffic patterns, e.g., rack-local, pod-local and DC-local. Compared to RDC, Flat-tree requires substantial changes to the whole network and incurs a much higher burden for wiring and deployment. Beyond architectural solutions, there has been a recent flurry of research projects that try to improve network performance at upper layers in the stack. Optimized transport protocols (e.g., DCTCP [99], MPTCP [113]) and traffic engineering tech- niques (e.g., hedera [40], MicroTE [119]) can improve flow performance under a wide range of application scenarios. Several other works further improve flow performance by optimizing task placements. For example, Sinbad [171] selectively chooses data transfer destinations to avoid network congestion. ShuffleWatcher [172] attempts to localize the shuffle phase of MapReduce jobs to one or a few racks. Corral [115] jointly places input data and compute to reduce inter-rack traffic for recurring jobs. Different from these works, RDC focuses on a novel network architecture design. 106

5.8 Summary

RDC is a rackless network that removes the static rack boundaries in traditional data centers. It uses circuit switching technologies to achieve topological reconfigurability at the edge, enabling fine-grained and on-demand topology control at runtime. RDC also preserves the cost, deployment, maintenance and expansion benefits of Clos networks. By co-designing the network architecture and the control systems, we identified several practical use cases of RDC that benefit a wide range of data center workloads. Our evaluations show that RDC leads to practical benefits in real-world applications. 107

Chapter 6

Conclusions and Future Work

The main goal of this thesis is to investigate whether a hybrid network can be a good candidate for addressing the cost, performance, and reliability issues we are facing today in current data center networks. To answer this question, we have explored the design space of hybrid optical/electrical and packet switching/circuit switching networks. Overall, the results from the above three chapters suggest a positive answer. Specifically, the highlights of our results are:

Integrating a separate multicast network into the existing network can tremen- • dously relive the bandwidth requirement pressure on the existing network and improve overall network performance. HyperOptics is the first step towards this goal by purely leveraging the low-cost optical splitting technology. Eliminating the use of optical circuit switch enables dramatic cost reduction and avoid the slow circuit switching speed. HyperOptics shows great potential in improving multicast transmission performance with high power and cost-efficiency. Sharing a small backup network pool for an entire network is an attractive • cost-efficient solution to improve the overall network reliability. However, the topology constraints of today’s network impose prohibitively high wiring or device complexity to this solution. Sharebackup shows that coordinated use of a large number of modest port-count circuit switches can achieve that goal with satisfactory cost and complexity. We find this design a good match for the rare, uncorrelated, and spatially dispersed failures in data centers. Besides failure recovery, special setups of circuit switches can automate and speed up failure diagnosis. Leveraging ShareBackup’s nice properties of wiring and routing, 108

backup switches can work as hot standbys without online change of forwarding rules or primary-backup coordinations. For all our experiments, the results for ShareBackup have little difference from the no-failure case. We conclude ShareBackup effectively masks failures from application performance. A dynamic network that can adjust its topology for different traffic characteristics • or network maintenance tasks is promising in improving both performance and reliability. Traditional hybrid networks provide topological dynamicity above the server racks. RDC is a first step towards realizing the concept of “rackless network” without sacrificing the cost, deployment, maintenance, and expansion benefits of existing networks. With the co-design of the network architecture and the corresponding control systems, we proposed several practical use cases of RDC that cover a wide range of data center workloads. A key design choice in RDC is, through the use of circuit switching technologies, pushing the topological reconfigurability down to the edge, thereby enabling on-demand fine-grained topology control at runtime. Extensive testbed evaluations and packet-level simulations showed that RDC meets its design goals.

Although this thesis has made progress towards understanding how to improve network performance, cost, and reliability through hybrid optical/electrical architectures, many interesting problems remain unresolved:

Joint topology optimization of failure recovery and performance. This • thesis presents the potential of runtime network topology reconfiguration for failure recovery in Sharebackup and performance improvement in RDC. However, in many cases, these two goals are related. For example, when we are running out of backup switches due to simultaneous failures, Sharebackup will fall back to the rerouting strategy to maintain the data plane connectivity. This causes bandwidth contention and impacts the performance of other innocent flows that are not on the failed path. As mentioned before, most of today’s cloud software frameworks are designed with redundancy and can tolerate a certain 109

degree of network failures. In such cases, it is possible that leaving the network topology unchanged upon failures induces even higher benefits for all parties. Notably, making such a decision on whether or how to change the topology may not solely depend on the failure detections, but also requires the performance degradation model of applications under various failures. More sophisticated network statistics and conditions may also be needed as input to make such optimizations. Towards automated network topology management.Ourcurrenttopol- • ogy reconfiguration models of both Sharebackup and RDC are still very primitive. In Sharebackup, this model is based on network component failure detections, which could itself be a challenging task as intermittent or performance-related “gray” failures [173] are prevalent in today’s cloud systems. In those cases, the detected failures may not always be accurate. More sophisticated network statistics and operational conditions are needed by Sharebackup to perform topology reconfigurations only when it is necessary or in a correct way. In RDC, this model is based on historical traffic characteristics, which may not be enough to induce an optimal topology for future traffic as many factors, such as routing policies, application behaviors, and even the network topology itself, could interplay with the observed traffic characteristics. In light of the recent advances in leveraging artificial intelligence for network management, we envision that data-driven topology management can be an integral part of the future “self-driving” network. 110

Bibliography

[1] “53 incredible facebook statistics and facts.” https://www.brandwatch.com/blog/ facebook-statistics/, 2019.

[2] “Research: There are now close to 400 hyper-scale data cen- ters in the world.” https://www.datacenterknowledge.com/cloud/ research-there-are-now-close-400-hyper-scale-data-centers-world, 2017.

[3] “Cisco global cloud index: Forecast and methodology, 2016–2021 white paper.” https://www.cisco.com/c/en/us/solutions/collateral/service-provider/ global-cloud-index-gci/white-paper-c11-738085.html, 2018.

[4] “Apache hadoop.” http://hadoop.apache.org/, 2015.

[5] “Apache Spark, https://spark.apache.org.”

[6] “Apache Tez, https://tez.apache.org/.”

[7] M. Al-Fares, A. Loukissas, and A. Vahdat, “A Scalable, Commodity Data Center Network Architecture,” in SIGCOMM ’08,(Seattle,Washington,USA), pp. 63–74, August 2008.

[8] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta, “VL2: A Scalable and Flexible Data Center Network,” in Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication, SIGCOMM ’09, (New York, NY, USA), pp. 51–62, ACM, 2009.

[9] R. Niranjan Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakr- ishnan, V. Subramanya, and A. Vahdat, “PortLand: A Scalable Fault-tolerant 111

Layer 2 Data Center Network Fabric,” in Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication, SIGCOMM ’09, (New York, NY, USA), pp. 39–50, ACM, 2009.

[10] A. Singh, J. Ong, A. Agarwal, G. Anderson, A. Armistead, R. Bannon, S. Boving, G. Desai, B. Felderman, P. Germano, A. Kanagala, J. Provost, J. Simmons, E. Tanda, J. Wanderer, U. Hölzle, S. Stuart, and A. Vahdat, “Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network,” in SIGCOMM ’15, (London, United Kingdom), pp. 183–197, ACM, August 2015.

[11] D. Hopps, “Analysis of an Equal-Cost Multi-Path Algorithm,” RFC 2992,2000.

[12] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, “Scaling distributed machine learning with the parameter server,” in 11th USENIX Symposium on Operating Systems { } Design and Implementation ( OSDI 14),pp.583–598,2014. { }

[13] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop Distributed File System,” in Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pp. 1–10, IEEE, 2010.

[14] P. Gill, N. Jain, and N. Nagappan, “Understanding network failures in data centers: measurement, analysis, and implications,” ACM SIGCOMM Computer Communication Review,vol.41,no.4,pp.350–361,2011.

[15] I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, M. F. Kaashoek, F. Dabek, and H. Balakrishnan, “Chord: a scalable peer-to-peer lookup protocol for internet applications,” IEEE/ACM Transactions on Networking (TON),vol.11,no.1, pp. 17–32, 2003.

[16] K. Chen, A. Singla, A. Singh, K. Ramachandran, L. Xu, Y. Zhang, X. Wen, and 112

Y. Chen, “OSA: An Optical Switching Architecture for Data Center Networks with Unprecedented Flexibility,” in NSDI ’12, (San Joes, CA), April 2012.

[17] G. Wang, D. G. Andersen, M. Kaminsky, K. Papagiannaki, T. S. E. Ng, M. Kozuch, and M. Ryan, “c-Through: Part-time Optics in Data Centers,” in SIGCOMM ’10,(NewDelhi,India),pp.327–338,August2010.

[18] N. Farrington, G. Porter, S. Radhakrishnan, H. H. Bazzaz, V. Subramanya, Y. Fainman, G. Papen, and A. Vahdat, “Helios: A Hybrid Electrical/Optical Switch Architecture for Modular Data Centers,” in SIGCOMM ’10,(NewDelhi, India), pp. 339–350, August 2010.

[19] M. Ghobadi, R. Mahajan, A. Phanishayee, N. Devanur, J. Kulkarni, G. Ranade, P.-A. Blanche, H. Rastegarfar, M. Glick, and D. Kilper, “ProjecToR: Agile Reconfigurable Data Center Interconnect,” in Proceedings of the 2016 Conference on ACM SIGCOMM 2016 Conference, SIGCOMM ’16, (Florianopolis, Brazil), pp. 216–229, August 2016.

[20] G. Porter, R. Strong, N. Farrington, A. Forencich, P. Chen-Sun, T. Rosing, Y. Fainman, G. Papen, and A. Vahdat, “Integrating Microsecond Circuit Switch- ing into the Data Center,” in SIGCOMM ’13, (Hong Kong, China), pp. 447–458, August 2013.

[21] C. Kachris and I. Tomkos, “The rise of optical interconnects in data centre networks,” in Proc. 14th Int. Conf. Transparent Opt. Netw.(ICTON),pp.1–4, 2012.

[22] https://fibertronics.com/abs-plc-splitter-boxes, 2019.

[23] A. Neukermans and R. Ramaswami, “Mems technology for optical networking applications,” IEEE Communications Magazine,vol.39,no.1,pp.62–69,2001.

[24] “Calient.” http://www.calient.net/products/s-journal-photonic-switch/. 113

[25] Y. Xia, X. S. Huang, and T. S. E. Ng, “Stop Rerouting! Enabling ShareBackup for Failure Recovery in Data Center Networks,” in Proceedings of the 16th ACM Workshop on Hot Topics in Networks, HotNets ’17, (Palo Alto, CA), pp. 171–177, December 2017.

[26] J. Kim, C. J. Nuzman, B. Kumar, D. F. Lieuwen, J. S. Kraus, A. Weiss, C. P. Lichtenwalner, A. R. Papazian, R. E. Frahm, N. R. Basavanhally, D. A. Ramsey, V. A. Aksyuk, F. Pardo, M. E. Simon, V. Lifton, H. B. Chan, M. Haueis, A. Gasparyan, H. R. Shea, S. Arney, C. A. Bolle, P. R. Kolodner, R. Ryf, D. T. Neilson, and J. V. Gates, “1100 x 1100 port mems-based optical crossconnect with 4-db maximum loss,” IEEE Photonics Technology Letters,vol.15,pp.1537–1539, Nov 2003.

[27] Y. Xia and T. S. E. Ng, “Flat-tree: A Convertible Data Center Network Archi- tecture from Clos to Random Graph,” in Proceedings of the 15th ACM Workshop on Hot Topics in Networks,HotNets’16,(Atlanta,GA),pp.71–77,November 2016.

[28] N. Hamedazimi, Z. Qazi, H. Gupta, V. Sekar, S. R. Das, J. P. Longtin, H. Shah, and A. Tanwer, “FireFly: A Reconfigurable Wireless Data Center Fabric Using Free-space Optics,” in Proceedings of the 2014 ACM Conference on SIGCOMM, SIGCOMM ’14, (Chicago, Illinois, USA), pp. 319–330, August 2014.

[29] X. Zhou, Z. Zhang, Y. Zhu, Y. Li, S. Kumar, A. Vahdat, B. Y. Zhao, and H. Zheng, “Mirror Mirror on the Ceiling: Flexible Wireless Links for Data Centers,” in Proceedings of the ACM SIGCOMM 2012 Conference on Applica- tions, Technologies, Architectures, and Protocols for Computer Communication, SIGCOMM ’12, (Helsinki, Finland), pp. 443–454, August 2012.

[30] D. Halperin, S. Kandula, J. Padhye, P. Bahl, and D. Wetherall, “Augmenting data center networks with multi-gigabit wireless links,” ACM SIGCOMM Computer 114

Communication Review,vol.41,no.4,pp.38–49,2011.

[31] D. Li, Y. Li, J. Wu, S. Su, and J. Yu, “Esm: efficient and scalable data center multicast routing,” IEEE/ACM Transactions on Networking (TON),2012.

[32] X. Li and M. Freedman, “Scaling ip multicast on datacenter topologies,” ACM Conext,2013.

[33] Y. Vigfusson, H. Abu-Libdeh, M. Balakrishnan, K. Birman, R. Burgess, G. Chockler, H. Li, and Y. Tock, “Dr. multicast: Rx for data center com- munication scalability,” ACM Eurosys,2010.

[34] D. Basin, K. Birman, I. Keidar, and Y. Vigfusson, “Sources of instability in data center multicast,” in Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware, pp. 32–37, ACM, 2010.

[35] H. Wang, Y. Xia, K. Bergman, T. Ng, S. Sahu, and K. Sripanidkulchai, “Rethink- ing the physical layer of data center networks of the next decade: Using optics to enable efficient*-cast connectivity,” ACM SIGCOMM Computer Communication Review,vol.43,no.3,pp.52–58,2013.

[36] P. Samadi, V. Gupta, J. Xu, H. Wang, G. Zussman, and K. Bergman, “Optical multicast system for data center networks,” Optics express,2015.

[37] Y. Xia, T. S. E. Ng, and X. Sun, “Blast: Accelerating High-Performance Data Analytics Applications by Optical Multicast,” in INFOCOM ’15,(HongKong, China), pp. 1930–1938, April 2015.

[38] X. S. Sun, Y. Xia, S. Dzinamarira, X. S. Huang, D. Wu, and T. E. Ng, “Republic: Data multicast meets hybrid rack-level interconnections in data center,” in 2018 IEEE 26th International Conference on Network Protocols (ICNP),pp.77–87, IEEE, 2018. 115

[39] V. Liu, D. Halperin, A. Krishnamurthy, and T. Anderson, “F10: A Fault-Tolerant Engineered Network,” in Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13),(Lombard,IL), pp. 399–412, USENIX, 2013.

[40] M. Al-fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat, “Hedera: Dynamic Flow Scheduling for Data Center Networks,” in NSDI ’10,(SanJose, CA), 2010.

[41] B. Stephens, A. L. Cox, and S. Rixner, “Scalable multi-failure fast failover via forwarding table compression,” in Proceedings of the Symposium on SDN Research, SOSR ’16, (New York, NY, USA), pp. 9:1–9:12, ACM, 2016.

[42] S. S. Lor, R. Landa, and M. Rio, “Packet re-cycling: eliminating packet losses due to network failures,” in Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks, p. 2, ACM, 2010.

[43] N. Kushman, S. Kandula, D. Katabi, and B. M. Maggs, “R-bgp: Staying connected in a connected world,” USENIX, 2007.

[44] G. Iannaccone, C.-N. Chuah, S. Bhattacharyya, and C. Diot, “Feasibility of ip restoration in a tier 1 backbone,” Ieee Network,vol.18,no.2,pp.13–19,2004.

[45] P. Pan, G. Swallow, A. Atlas, et al.,“Fastrerouteextensionstorsvp-teforlsp tunnels,” 2005.

[46] K. Lakshminarayanan, M. Caesar, M. Rangan, T. Anderson, S. Shenker, and I. Stoica, “Achieving convergence-free routing using failure-carrying packets,” ACM SIGCOMM Computer Communication Review,vol.37,no.4,pp.241–252, 2007.

[47] “Windows server.” https://technet.microsoft.com/en-us/library/hh831764.aspx. 116

[48] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The google file system,” ACM SOSP,2003.

[49] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets.,” HotCloud,2010.

[50] C. Gray and D. Cheriton, “Leases: An efficient fault-tolerant mechanism for distributed file cache consistency,” ACM SOSP,1989.

[51] M. Li, D. G. Andersen, A. J. Smola, and K. Yu, “Communication efficient distributed machine learning with the parameter server,” Advances in Neural Information Processing Systems,2014.

[52] K. Canini, T. Chandra, E. Ie, J. McFadden, K. Goldman, M. Gunter, J. Harmsen, K. LeFevre, D. Lepikhin, T. Llinares, et al.,“Sibyl:Asystemforlargescale supervised machine learning,” Machine Learning Summer School, Santa Cruz, CA,2012.

[53] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research,2003.

[54] “100 gbps for the data center.” http://www.networkcomputing.com/data-centers/ 100-gbps-headed-data-center/407619707.

[55] “Ieee802.3ba-2010 standard.” http://www.ieee802.org/3/ba/. Accessed: 2016-02- 01.

[56] “Calient application note.” http://www.calient.net/resources/application-notes/.

[57] G. Wang, D. G. Andersen, M. Kaminsky, K. Papagiannaki, T. E. Ng, M. Kozuch, and M. Ryan, “c-through: Part-time optics in data centers,” ACM SIGCOMM, 2010. 117

[58] H. Liu, M. K. Mukerjee, C. Li, N. Feltman, G. Papen, S. Savage, S. Seshan, G. M. Voelker, D. G. Andersen, M. Kaminsky, et al.,“Schedulingtechniquesfor hybrid circuit/packet networks,” ACM Conext,2015.

[59] G. Keeler, D. Agarwal, C. Debaes, B. E. Nelson, N. C. Helman, H. Thienpont, and D. A. Miller, “Optical pump-probe measurements of the latency of silicon cmos optical interconnects,” IEEE Photonics Technology Letters,2002.

[60] “Mellanox sx6536.” http://www.colfaxdirect.com/store/pc/viewPrd.asp? idproduct=1760&idcategory=7.

[61] “1x2 plc fiber splitter.” https://www.fs.com/products/12493.html, 2018.

[62] S. V. Pemmaraju and R. Raman, “Approximation algorithms for the max-coloring problem,” ICALP,2005.

[63] M. Kubale, Graph colorings. American Mathematical Society, 2004.

[64] A. Miller, “Online graph colouring,” Canadian Undergraduate Mathematics Conference,2004.

[65] J. H. Ahn, N. Binkert, A. Davis, M. McLaren, and R. S. Schreiber, “HyperX: Topology, Routing, and Packaging of Efficient Large-scale Networks,” in SC ’09, (Portland, Oregon, USA), pp. 41:1–41:11, November 2009.

[66] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and S. Lu, “BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers,” in SIGCOMM ’09,(Barcelona,Spain),pp.63–74,August2009.

[67] C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, and S. Lu, “DCell: A Scalable and Fault-Tolerant Network Structure for Data Centers,” in SIGCOMM ’08,(Seattle, Washington, USA), pp. 75–86, August 2008. 118

[68] A. Singla, C.-Y. Hong, L. Popa, and P. B. Godfrey, “Jellyfish: Networking Data Centers Randomly,” in NSDI ’12, (San Jose, California, USA), pp. 1–14, April 2012.

[69] M. Walraed-Sullivan, A. Vahdat, and K. Marzullo, “Aspen Trees: Balancing Data Center Fault Tolerance, Scalability and Cost,” in Proceedings of the Ninth ACM Conference on Emerging Networking Experiments and Technologies, CoNEXT ’13, (New York, NY, USA), pp. 85–96, ACM, 2013.

[70] M. Caesar, M. Casado, T. Koponen, J. Rexford, and S. Shenker, “Dynamic Route Recomputation Considered Harmful,” SIGCOMM Comput. Commun. Rev.,vol.40,pp.66–71,Apr.2010.

[71] P. Gill, N. Jain, and N. Nagappan, “Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications,” in Proceedings of the ACM SIGCOMM 2011 Conference, SIGCOMM ’11, (New York, NY, USA), pp. 350– 361, ACM, 2011.

[72] K. Chen, X. Wen, X. Ma, Y. Chen, Y. Xia, C. Hu, and Q. Dong, “WaveCube: A Scalable, Fault-tolerant, High-performance Optical Data Center Architecture,” in 2015 IEEE Conference on Computer Communications (INFOCOM),pp.1903– 1911, April 2015.

[73] Y. J. Liu, P. X. Gao, B. Wong, and S. Keshav, “Quartz: A New Design Element for Low-latency DCNs,” in SIGCOMM ’14, (Chicago, Illinois, USA), pp. 283–294, August 2014.

[74] S. Legtchenko, N. Chen, D. Cletheroe, A. Rowstron, H. Williams, and X. Zhao, “XFabric: A Reconfigurable In-Rack Network for Rack-Scale Computers,” in 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), (Santa Clara, CA), pp. 15–29, USENIX Association, 2016. 119

[75] Y. Xia, X. S. Sun, S. Dzinamarira, D. Wu, X. S. Huang, and T. S. E. Ng, “A tale of two topologies: Exploring convertible data center network architectures with flat-tree,” in Proceedings of the Conference of the ACM Special Interest Group on Data Communication, SIGCOMM ’17, (New York, NY, USA), pp. 295–308, ACM, 2017.

[76] “Introducing data center fabric, the next-generation facebook data center net- work.”

[77] X. Wu, D. Turner, C.-C. Chen, D. A. Maltz, X. Yang, L. Yuan, and M. Zhang, “NetPilot: Automating Datacenter Network Failure Mitigation,” in Proceedings of the ACM SIGCOMM 2012 Conference on Applications, Technologies, Archi- tectures, and Protocols for Computer Communication, SIGCOMM ’12, (Helsinki, Finland), pp. 419–430, August 2012.

[78] D. Zhuo, M. Ghobadi, R. Mahajan, K.-T. Förster, A. Krishnamurthy, and T. Anderson, “Understanding and Mitigating Packet Corruption in Data Center Networks,” in Proceedings of the Conference of the ACM Special Interest Group on Data Communication, SIGCOMM ’17, (Los Angeles, CA), pp. 362–375, ACM, 2017.

[79] V. Liu, D. Zhuo, S. Peter, A. Krishnamurthy, and T. Anderson, “Subways: A Case for Redundant, Inexpensive Data Center Edge Links,” in Proceedings of the 11th ACM Conference on Emerging Networking Experiments and Technologies, CoNEXT ’15, (Heidelberg, Germany), pp. 27:1–27:13, ACM, 2015.

[80] L. T., C. B., M. P., and L. D., “Cisco Hot Standby Router Protocol (HSRP),” RFC 2281,1998.

[81] J. Liu, A. Panda, A. Singla, B. Godfrey, M. Schapira, and S. Shenker, “Ensuring connectivity via data plane mechanisms.,” in NSDI,pp.113–126,2013. 120

[82] S. S. Lor, R. Landa, and M. Rio, “Packet re-cycling: eliminating packet losses due to network failures,” in Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks, p. 2, ACM, 2010.

[83] M. Reitblatt, M. Canini, A. Guha, and N. Foster, “FatTire: Declarative Fault Tolerance for Software-defined Networks,” in Proceedings of the Second ACM SIGCOMM Workshop on Hot Topics in Software Defined Networking,HotSDN ’13, (Hong Kong, China), pp. 109–114, ACM, Aug. 2013.

[84] B. Yang, J. Liu, S. Shenker, J. Li, and K. Zheng, “Keep forwarding: Towards k-link failure resilient routing,” in INFOCOM, 2014 Proceedings IEEE,pp.1617– 1625, IEEE, 2014.

[85] B. Stephens and A. L. Cox, “Deadlock-free local fast failover for arbitrary data center networks,” in Computer Communications, IEEE INFOCOM 2016-The 35th Annual IEEE International Conference on, pp. 1–9, IEEE, 2016.

[86] M. Borokhovich, L. Schiff, and S. Schmid, “Provable data plane connectivity with local fast failover: Introducing openflow graph algorithms,” in Proceedings of the third workshop on Hot topics in software defined networking,pp.121–126, ACM, 2014.

[87] P. P., S. G., and A. A, “Fast Reroute Extensions to RSVP-TE for LSP Tunnels,” RFC 4090,1998.

[88] B. Stephens, A. L. Cox, and S. Rixner, “Scalable Multi-Failure Fast Failover via Forwarding Table Compression,” in Proceedings of the Symposium on SDN Research, SOSR ’16, (Santa Clara, CA), pp. 9:1–9:12, ACM, 2016.

[89] P. Bodík, I. Menache, M. Chowdhury, P. Mani, D. A. Maltz, and I. Stoica, “Surviving Failures in Bandwidth-constrained Datacenters,” in SIGCOMM ’12, (Helsinki, Finland), pp. 431–442, August 2012. 121

[90] M. Schlansker, M. Tan, J. Tourrilhes, J. R. Santos, and S.-Y. Wang, “Configurable optical interconnects for scalable datacenters,” in Optical Fiber Communication Conference and Exposition and the National Fiber Optic Engineers Conference (OFC/NFOEC), 2013, pp. 1–3, IEEE, 2013.

[91] M. C. Wu, O. Solgaard, and J. E. Ford, “Optical MEMS for Lightwave Commu- nication,” Journal of Lightwave Technology,vol.24,pp.4433–4454,December 2006.

[92] “Arduino, https://www.arduino.cc.”

[93] “Raspberry Pi, https://www.raspberrypi.org.”

[94] T. J. Seok, N. Quack, S. Han, W. Zhang, R. S. Muller, and M. C. Wu, “Reliability study of digital silicon photonic mems switches,” in Group IV Photonics (GFP), 2015 IEEE 12th International Conference on, pp. 205–206, IEEE, 2015.

[95] S. Kassing, A. Valadarsky, G. Shahaf, M. Schapira, and A. Singla, “Beyond Fat-trees Without Antennae, Mirrors, and Disco-balls,” in Proceedings of the Conference of the ACM Special Interest Group on Data Communication,SIG- COMM ’17, (Los Angeles, CA, USA), pp. 281–294, ACM, 2017.

[96] A. Valadarsky, G. Shahaf, M. Dinitz, and M. Schapira, “Xpander: Towards Optimal-Performance Datacenters,” in Proceedings of the 12th International on Conference on Emerging Networking EXperiments and Technologies, CoNEXT ’16, (Irvine, California, USA), pp. 205–219, ACM, 2016.

[97] T. Leighton and S. Rao, “Multicommodity Max-flow Min-cut Theorems and Their Use in Designing Approximation Algorithms,” J. ACM,vol.46,no.6, pp. 787–832, November 1999.

[98] A. Singla, Designing Data Center Networks for High Throughput. Ph.D. Thesis, University of Illinois at Urbana-Champaign, 2015. 122

[99] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, and M. Sridharan, “DCTCP: Efficient Packet Transport for the Commoditized Data Center,” in SIGCOMM’10,August2010.

[100] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken, “The Nature of Data Center Traffic,” in IMC ’09, (Chicago, Illinois, USA), pp. 202–208, November 2009.

[101] “Coflow-Benchmark, https://github.com/coflow/coflow-benchmark.”

[102] C. Wilson, H. Ballani, T. Karagiannis, and A. Rowtron, “Better Never Than Late: Meeting Deadlines in Datacenter Networks,” in Proceedings of the ACM SIGCOMM 2011 Conference, SIGCOMM ’11, (Toronto, Ontario, Canada), pp. 50–61, ACM, 2011.

[103] W. M. Mellette, R. Das, Y. Guo, R. McGuinness, A. C. Snoeren, and G. Porter, “Expanding across time to deliver bandwidth efficiency and low latency,” arXiv e-prints, p. arXiv:1903.12307, Mar 2019.

[104] G. Wang, D. G. Andersen, M. Kaminsky, K. Papagiannaki, T. Ng, M. Kozuch, and M. Ryan, “c-through: Part-time optics in data centers,” in ACM SIGCOMM Computer Communication Review, vol. 40, pp. 327–338, ACM, 2010.

[105] W. M. Mellette, R. McGuinness, A. Roy, A. Forencich, G. Papen, A. C. Sno- eren, and G. Porter, “Rotornet: A scalable, low-complexity, optical datacenter network,” in Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pp. 267–280, ACM, 2017.

[106] H. Liu, F. Lu, A. Forencich, R. Kapoor, M. Tewari, G. M. Voelker, G. Papen, A. C. Snoeren, and G. Porter, “Circuit switching under the radar with reactor,” in 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14), (Seattle, WA), pp. 1–15, USENIX Association, 2014. 123

[107] A. Chatzieleftheriou et al., “Larry: Practical network reconfigurability in the data center,” in NSDI, USENIX, 2018.

[108] X. Zhou, Z. Zhang, Y. Zhu, Y. Li, S. Kumar, A. Vahdat, B. Y. Zhao, and H. Zheng, “Mirror mirror on the ceiling: Flexible wireless links for data centers,” ACM SIGCOMM Computer Communication Review,vol.42,no.4,pp.443–454, 2012.

[109] A. Roy, H. Zeng, J. Bagga, G. Porter, and A. C. Snoeren, “Inside the social network’s (datacenter) network,” SIGCOMM Comput. Commun. Rev.,vol.45, pp. 123–137, Aug. 2015.

[110] T. Benson et al., “Network traffic characteristics of data centers in the wild,” in IMC, ACM, 2010.

[111] T. Benson, A. Anand, A. Akella, and M. Zhang, “Understanding data center traffic characteristics,” in Proceedings of the 1st ACM workshop on Research on enterprise networking, pp. 65–72, ACM, 2009.

[112] P. Bodík et al., “Surviving failures in bandwidth-constrained datacenters,” in SIGCOMM, ACM, 2012.

[113] D. Wischik, C. Raiciu, A. Greenhalgh, and M. Handley, “Design, implementation and evaluation of congestion control for multipath tcp,” in Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, NSDI’11, (Berkeley, CA, USA), pp. 99–112, USENIX Association, 2011.

[114] M. Chowdhury, Y. Zhong, and I. Stoica, “Efficient Coflow Scheduling with Varys,” in SIGCOMM ’14, (Chicago, IL), pp. 443–454, 2014.

[115] V. Jalaparti et al., “Network-aware scheduling for data-parallel jobs: Plan when you can,” SIGCOMM,2015. 124

[116] G. Fox, S. Otto, and A. Hey, “Matrix algorithms on a hypercube i: Matrix multiplication,” ,vol.4,no.1,pp.17–31,1987.

[117] R. A. Van De Geijn and J. Watts, “Summa: Scalable universal matrix mul- tiplication algorithm,” Concurrency: Practice and Experience,vol.9,no.4, pp. 255–274, 1997.

[118] “Facebook network analytics data sharing.” https://www.facebook.com/groups/ 1144031739005495/.

[119] T. Benson, A. Anand, A. Akella, and M. Zhang, “MicroTE: Fine Grained Traffic Engineering for Data Centers,” in CoNEXT ’11,(Tokyo,Japan),pp.8:1–8:12, ACM, 2011.

[120] “Introducing data center fabric, the next-generation facebook data center network.” https://code.fb.com/production-engineering/ introducing-data-center-fabric-the-next-generation-facebook-data-center-network.

[121] “Core and pod data center design.” http://go.bigswitch.com/rs/974-WXR-561/ images/Core-and-Pod%20Overview.pdf.

[122] “Specifying data center it pod architectures.” https://www.apc.com/salestools/ WTOL-AHAPRN/WTOL-AHAPRN_R0_EN.pdf.

[123] “Edge 640 optical circuit switch.” https://www.calient.net/products/ edge640-optical-circuit-switch.

[124] “Polatis series 7000 384x384.” https://www.polatis.com/ series-7000-384x384-port-software-controlled-optical-circuit-switch-sdn-enabled. asp.

[125] “Macom m21605 crosspoint switch.” https://www.macom.com/products/ product-detail/M21605/. 125

[126] V. Shrivastav, A. Valadarsky, H. Ballani, P. Costa, K. S. Lee, H. Wang, R. Agar- wal, and H. Weatherspoon, “Shoal: A network architecture for disaggregated racks,” in 16th USENIX Symposium on Networked Systems Design and Im- plementation (NSDI 19), (Boston, MA), pp. 255–270, USENIX Association, 2019.

[127] “Ibm prefabricated modular data center.” https://www.ibm.com/us-en/ marketplace/prefabricated-modular-data-center.

[128] M. Kuźniar, P. Perešíni, and D. Kostić, “What you need to know about sdn flow tables,” in International Conference on Passive and Active Network Measurement, pp. 347–359, Springer, 2015.

[129] K. He, J. Khalid, A. Gember-Jacobson, S. Das, C. Prakash, A. Akella, L. E. Li, and M. Thottan, “Measuring control plane latency in sdn-enabled switches,” in Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined Networking Research, p. 25, ACM, 2015.

[130] M. Reitblatt, N. Foster, J. Rexford, C. Schlesinger, and D. Walker, “Abstrac- tions for network update,” ACM SIGCOMM Computer Communication Review, vol. 42, no. 4, pp. 323–334, 2012.

[131] H. H. Liu, X. Wu, M. Zhang, L. Yuan, R. Wattenhofer, and D. Maltz, “zupdate: Updating data center networks with zero loss,” in ACM SIGCOMM Computer Communication Review, vol. 43, pp. 411–422, ACM, 2013.

[132] R. Miao, H. Zeng, C. Kim, J. Lee, and M. Yu, “Silkroad: Making stateful layer-4 load balancing fast and cheap using switching asics,” in Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pp. 15–28, ACM, 2017.

[133] “Barefoot tofino.” https://www.barefootnetworks.com/products/brief-tofino. 126

[134] R. Krauthgamer et al.,“Partitioninggraphsintobalancedcomponents,”in SODA, Society for Industrial and Applied Mathematics, 2009.

[135] D. LaSalle and G. Karypis, “Multi-threaded graph partitioning,” in Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pp. 225–236, IEEE, 2013.

[136] M. Dell’Amico and S. Martello, “Bounds for the cardinality constrained p cmax problem,” Journal of Scheduling,vol.4,no.3,pp.123–138,2001.

[137] W. Michiels, J. Korst, E. Aarts, and J. Van Leeuwen, “Performance ratios for the differencing method applied to the balanced number partitioning problem,” in Annual Symposium on Theoretical Aspects of Computer Science,pp.583–595, Springer, 2003.

[138] C.-H. Hsu, Q. Deng, J. Mars, and L. Tang, “Smoothoperator: Reducing power fragmentation and improving power utilization in large-scale datacenters,” in Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’18, (New York, NY, USA), pp. 535–548, ACM, 2018.

[139] “Ipv4 multihoming practices and limitations.” https://tools.ietf.org/html/ rfc4116.

[140] P. Gill, N. Jain, and N. Nagappan, “Understanding network failures in data centers: measurement, analysis, and implications,” ACM SIGCOMM Computer Communication Review,vol.41,no.4,pp.350–361,2011.

[141] Y. Xia, X. S. Huang, and T. S. E. Ng, “Stop rerouting!: Enabling sharebackup for failure recovery in data center networks,” in Proceedings of the 16th ACM Workshop on Hot Topics in Networks,HotNets-XVI,(NewYork,NY,USA), pp. 171–177, ACM, 2017. 127

[142] “40G short range transceiver.” http://www.fs.com/products/17931.html, 2018.

[143] “N8000-32q 40g sdn switch.” https://www.fs.com/products/69342.html, 2018.

[144] “9/125 single mode fiber patch cable.” https://www.fs.com/products/40200.html, 2018.

[145] “40g qsfp+ passive direct attach copper cable.” https://www.fs.com/products/ 30898.html, 2018.

[146] J. Meza, T. Xu, K. Veeraraghavan, and O. Mutlu, “A large scale study of data center network reliability,” in Proceedings of the Internet Measurement Conference 2018, pp. 393–407, ACM, 2018.

[147] X. Wu, D. Turner, C.-C. Chen, D. A. Maltz, X. Yang, L. Yuan, and M. Zhang, “Netpilot: automating datacenter network failure mitigation,” in Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication, pp. 419–430, ACM, 2012.

[148] T. J. Seok, N. Quack, S. Han, W. Zhang, R. S. Muller, and M. C. Wu, “Reliability study of digital silicon photonic mems switches,” in Group IV Photonics (GFP), 2015 IEEE 12th International Conference on, pp. 205–206, IEEE, 2015.

[149] “Mean time between failures.”

[150] “Single fiber cwdm mux demux.” https://www.fs.com/products/70407.html, 2018.

[151] J. Xia, M. Ding, A. Wonfor, R. V. Penty, and I. H. White, “The feasibility of building 1024 & 4096-port nanosecond switching for data centre networks using dilated hybrid optical switches,”

[152] Q. Cheng, A. Wonfor, R. V. Penty, and I. H. White, “Scalable, low-energy hybrid photonic space switch,” Journal of Lightwave Technology,vol.31,pp.3077–3084, Sep. 2013. 128

[153] “Bcm56980 series.” https://www.broadcom.com/products/ ethernet-connectivity/switching/strataxgs/bcm56980-series.

[154] “Ieee standard 802.3bs.” https://standards.ieee.org/standard/802_3bs-2017. html.

[155] “Memcached.” https://memcached.org.

[156] M. Handley, C. Raiciu, A. Agache, A. Voinescu, A. W. Moore, G. Antichi, and M. Wójcik, “Re-architecting datacenter networks and stacks for low latency and high performance,” in Proceedings of the Conference of the ACM Special Interest Group on Data Communication, SIGCOMM ’17, (New York, NY, USA), pp. 29–42, ACM, 2017.

[157] “Intel ethernet converged network adapter x520.” https://www.intel.com/ content/dam/doc/product-brief/ethernet-x520-server-adapters-brief.pdf.

[158] “Mellanox connectx-4 vpi card.” http://www.mellanox.com/related-docs/prod_ adapter_cards/PB_ConnectX-4_VPI_Card.pdf.

[159] “Cisco nexus 5600 platform 40-gbps switches data sheet.” https: //www.cisco.com/c/en/us/products/collateral/switches/nexus-5624q-switch/ datasheet-c78-733100.html.

[160] B. C. Vattikonda, G. Porter, A. Vahdat, and A. C. Snoeren, “Practical tdma for datacenter ethernet,” in Proceedings of the 7th ACM european conference on Computer Systems, pp. 225–238, ACM, 2012.

[161] “Welcome to ryu the network operating system(nos).” https://ryu.readthedocs. io/en/latest/.

[162] J. Edmonds, “Paths, trees, and flowers,” Canadian Journal of mathematics, vol. 17, pp. 449–467, 1965. 129

[163] B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny, “Workload analysis of a large-scale key-value store,” in ACM SIGMETRICS Performance Evaluation Review, vol. 40, pp. 53–64, ACM, 2012.

[164] R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C. Li, R. McElroy, M. Paleczny, D. Peek, P. Saab, et al.,“Scalingmemcacheatfacebook,”in Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13),pp.385–398,2013.

[165] M. Alizadeh, A. Kabbani, T. Edsall, B. Prabhakar, A. Vahdat, and M. Yasuda, “Less is more: trading a little bandwidth for ultra-low latency in the data center,” in Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation NSDI 12),pp.253–266,2012.

[166] D. Zats, T. Das, P. Mohan, D. Borthakur, and R. Katz, “Detail: reducing the flow completion time tail in datacenter networks,” in Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication, pp. 139–150, ACM, 2012.

[167] L. Polok, V. Ila, and P. Smrz, “Fast sparse matrix multiplication on gpu,” in Proceedings of the Symposium on High Performance Computing, HPC ’15, (San Diego, CA, USA), pp. 33–40, Society for Computer Simulation International, 2015.

[168] L. West and J. K. Lee, “Performance comparison between cg-based and cuda- based matrix multiplications,” in Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA), p. 1, The Steering Committee of The World Congress in Computer Science, Computer . . . , 2012.

[169] P. Bakopoulos, K. Christodoulopoulos, G. Landi, M. Aziz, E. Zahavi, D. Gallico, R. Pitwon, K. Tokas, I. Patronas, M. Capitani, et al.,“Nephele:Anend-to-end 130

scalable and dynamically reconfigurable optical architecture for application- aware sdn cloud data centers,” IEEE Communications Magazine,vol.56,no.2, pp. 178–188, 2018.

[170] Y. Xia, X. S. Sun, S. Dzinamarira, D. Wu, X. S. Huang, and T. Ng, “A tale of two topologies: Exploring convertible data center network architectures with flat-tree,” in Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pp. 295–308, ACM, 2017.

[171] M. Chowdhury, S. Kandula, and I. Stoica, “Leveraging Endpoint Flexibility in Data-Intensive Clusters,” in SIGCOMM ’13, (Hong Kong, China), pp. 231–242, 2013.

[172] F. Ahmad et al.,“Shufflewatcher:Shuffle-awareschedulinginmulti-tenant mapreduce clusters.,” in ATC, USENIX, 2014.

[173] P. Huang, C. Guo, J. R. Lorch, L. Zhou, and Y. Dang, “Capturing and enhancing in situ system observability for failure detection,” in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), (Carlsbad, CA), pp. 1–16, USENIX Association, Oct. 2018.