ABSTRACT
Designing Hybrid Data Center Networks for High Performance and Fault Tolerance
by
Dingming Wu
Data center networks are critical infrastructure behind today’s cloud services. High-performance data center network architectures have received extensive attention in the past decade, with multiple interesting DCN architectures proposed over the years. In the early days, data center networks were built mostly on electrical packet switches, with Clos, or multi-rooted tree, structures. These networks fall short in 1) providing high bandwidth for emerging non-uniform traffic communications, e.g., multicast transmissions and skewed communicating server pairs, and 2) recovering from various network component failures, e.g., link failures and switch failures. More recently, optical technologies have gained high popularity due to its ultra-simple and flexible connectivity provisions, low power and low cost. However, the advantages of optical networking come at a price—orders of magnitude of lower switching speed than its electrical counterpart. This thesis explores the design space of hybrid electrical and optical network architectures for modern data centers. It tries to reach a delicate balance between performance, fault-tolerance, scalability and cost through coordinated use of both electrical and optical components in the network. We have developed several ap- proaches to achieving these goals from different angles. First, we used optical splitters as key building blocks to improve multicast transmission performance. We built an unconventional optical multicast architecture, called HyperOptics, that provides orders of magnitude of throughput improvement for multicast transmissions. Sec- ond, we developed a failure tolerant network, called ShareBackup, by embedding optical switches into the Clos networks. Sharebackup, for the first time, achieves network-wide full-capacity failure recovery in milliseconds. Third, we proposed to enable programmable network topology at runtime by inserting optical switches at the network edge. Our system, called RDC, breaks the bandwidth boundaries between servers and dynamically optimizes its topology according to traffic patterns. Through these three works, we demonstrate the high potential of hybrid datacenter network architectures for high performance and fault-tolerance. Acknowledgments
I would first like to thank my advisor Prof. T. S. Eugene Ng, for his support, encouragement, and precious advice at both the academic and the personal levels. I am thankful that he pushed me and gave me directions when I did not know what to do. I am thankful that he supported and helped me to explore various external opportunities when I could not make further progress. I am thankful that he believed in me and encouraged me when I had doubts in myself. Without his mentorship, I cannot be here today. I also thank my thesis committee members: Ang Chen and Lin Zhong. They have given me so many valuable feedbacks and suggestions on my research, job search, as well as this thesis. A special thank is owed to Ang, collaborations with whom produced one of the major results in this thesis. I want to thank my internship mentors: Yingnong Dang and Guohui Wang. They taught me by example how to be productive and professional in the industry. They also gave me a lot of suggestions that helped me grow and be technically mature. More importantly, I thank them for providing me with memorable experiences in Seattle and Silicon Valley. I am grateful to have a lot of fantastic fellow students in the CS department. Especially, I thank my groupmates in the BOLD lab: Xiaoye Steven Sun, Yiting Xia, Simbarashe Dzinamarira, Xin Sunny Huang, Sushovan Das, Afsaneh Rahbar, and Weitao Wang. They helped me a lot in shaping the ideas behind this thesis as well as conducting the experimental studies. I also thank the fellow network-systems students for their productive discussions with me: Kuo-Feng Hsu, Jiarong Xing, Qiao Kang, and Xinyu Wu. v
Graduate studies are stressful; I thank the company of many of my friends at Rice. I treasure the friendship with Xiaoye Sun, Lechen Yu, Beidi Chen, Chen Luo, Terry Tang, and Simbarashe Dzinamarira. I am grateful that they were always there when I wanted to share my ups and downs. They have made these years an enjoyable journey for me. Finally, I would like to give my special thanks to my parents Ping Xiao and Tongxin Wu, who have always trusted me and supported me along the way. Contents
Abstract ii Acknowledgments iv List of Illustrations x List of Tables xiv
1 Introduction 1 1.1 Data Center Networks: State of the Art and Challenges ...... 3 1.2 ThesisContributions ...... 5 1.3 ThesisOrganization...... 7
2 Background and Related Work 8 2.1 OpticsinDataCenters...... 8 2.2 Hybrid Data Center Network Architectures ...... 10 2.3 Supporting Multicast Transmissions in Data Center Networks . . . . 11 2.4 Failure Recovery in Data Center Networks ...... 13
3 HyperOptics: Integrating An Optical Multicast Architec- ture in Data Centers 16 3.1 Introduction ...... 16 3.2 HyperOptics Architecture ...... 19 3.2.1 ToRConnectivityDesign...... 19 3.2.2 Routing and Relay Set Computation ...... 19 3.2.3 Analysis ...... 21 3.2.4 System Overview ...... 23 vii
3.2.5 MulticastScheduling ...... 24 3.3 Evaluation...... 25 3.3.1 Comparednetworks...... 25 3.3.2 Simulation setup ...... 26 3.3.3 EffectofSplitterFanoutUsedinOCS ...... 27 3.3.4 PerformanceComparison...... 28 3.4 Summary ...... 28
4 Sharebackup: Masking Data Center Network Failures from Application Performance 30 4.1 Introduction ...... 30 4.2 Related Work ...... 32 4.3 Network Architecture ...... 35 4.4 ControlPlane ...... 39 4.4.1 Fast Failure Detection and Recovery ...... 39 4.4.2 Distributed Network Controllers ...... 40 4.4.3 Offline Auto-Diagnosis ...... 42 4.4.4 Live Impersonation of Failed Switch ...... 44 4.5 Discussion ...... 45 4.5.1 ControlSystemFailures ...... 45 4.5.2 CostAnalysis ...... 46 4.5.3 BenefitstoNetworkManagement ...... 47 4.5.4 Alternatives in the Design Space ...... 48 4.6 ImplementationandEvaluation ...... 48 4.6.1 Testbed ...... 49 4.6.2 Simulation ...... 51 4.6.3 ExperimentalSetup...... 52 4.6.4 Transient State Analysis ...... 56 viii
4.6.5 ResponsivenessofControlPlane...... 57 4.6.6 BandwidthAdvantage ...... 59 4.6.7 Transmission Performance at Scale ...... 61 4.6.8 Benefits to Real Applications ...... 64 4.7 Summary ...... 66
5 RDC: Relieving Data Center Network Congestion with Topological Reconfigurability at the Edge 67 5.1 Introduction ...... 68 5.2 TheCaseforRacklessDataCenters...... 71 5.2.1 Observation #1: Pod locality ...... 72 5.2.2 Observation #2: Inter-pod imbalance ...... 72 5.2.3 Observation #3: Predictable patterns ...... 73 5.2.4 Understanding the power of racklessness ...... 74 5.3 TheRDCArchitecture ...... 76 5.3.1 Building block: Circuit switching ...... 76 5.3.2 Connectivitystructure ...... 78 5.3.3 Thepodcontroller ...... 79 5.3.4 Routing ...... 80 5.4 RDCControlAlgorithms...... 82 5.4.1 Traffic localization ...... 83 5.4.2 Uplink load-balancing ...... 84 5.4.3 Application-driven optimizations ...... 86 5.4.4 Advanced control algorithms ...... 87 5.5 Discussions ...... 87 5.5.1 Recovering from ToR failures ...... 87 5.5.2 Wiring and incremental deployment ...... 88 5.5.3 Costanalysis ...... 89 ix
5.5.4 Handling circuit switch failures ...... 91 5.5.5 Scaling ...... 91 5.5.6 Alternatives in the design space ...... 92 5.6 ImplementationandEvaluation ...... 93 5.6.1 Platforms ...... 93 5.6.2 TCPtransientstate ...... 94 5.6.3 Throughputbenchmark ...... 95 5.6.4 Controllooplatency ...... 96 5.6.5 Transmission performance at scale ...... 97 5.6.6 Real-world applications ...... 102 5.7 Related Work ...... 105 5.8 Summary ...... 106
6 Conclusions and Future Work 107
Bibliography 110 Illustrations
1.1 An example of three-layer Clos network using homogeneous commodity switches...... 2
2.1 An illustration of the optical splitter. (a) is a logical representation of an optical splitter with 1 input port and k output ports. (b) shows an real splitter device with k =8...... 9 2.2 An illustration of the optical circuit switch (OCS). (a) is a logical representation of an OCS with 8 input ports and 8 output ports, or an 8 8 OCS. (b) shows a real Calient 312 312 OCS...... 9 ⇥ ⇥
3.1 An example of HyperOptics connectivity with splitter fanout k =3.
The connectivity of t0, t3, t4 and t6 are shown in the figure. All other ToRs are interconnected to their neighbors in a similar way. The table on the bottom demonstrates the connectivity of all ToRs...... 20
k 1 3.2 Two broadcast trees originating from ti and ti+2 .Solidcirclesare relays. The union of the relays and the last neighbor of each relay, shown by squares, form a complete set of all ToRs. The two broadcasts have disjoint relay sets...... 22 3.3 An overview of the HyperOptics system...... 24 3.4 Comparison of the average total FCT of all three networks. The speedups of HyperOptics over the OCS network are labeled in the figure. 27 xi
3.5 Computation time of HyperOptics’ control plane under different numberofmulticastrequests...... 29
4.1 A k =6andn =1ShareBackupnetwork.(a),(b),(c)correspondto shaded areas in (d). Devices are labeled according to the notations in Table 4.1. Edge and aggregation switches are marked by their in-Pod indices; core switches and hosts are marked by their global indices. Switches in the same failure group are packed together, which share a backup switch in stripe on the side. Circuit switches are inserted into adjacent layers of switches/hosts. The connectivity in shade is the basic building block for shareable backup. The crossed switch and connections represent example switch and link failures. Switches involved in failures are each replaced by a backup switch with the new circuit switch configurations shown at the bottom, where connections regarding the original red round ports reconnect to the new black square ports...... 36 4.2 Communication protocol in the control system. (a): Failure detection and recovery. (b): Diagnosis of link failure...... 41 4.3 Circuit switch configurations for diagnosis of link failures shown by examples (b) and (c) in Figure 4.1. Circuit switches in a Pod are chained up using the side ports. Only “suspect switches” on both sides of the failed link and some related backup switches are shown.
Through configurations 1 , 2 ,and 3 ,the“suspectinterface”onboth “suspect switches” associated with the failure can connect to 3 different interfaces on one or multiple other switches...... 42 4.4 A testbed of k =4n = 1 ShareBackup with 2 Pods ...... 50 4.5 Trace of TCP sequence number during failure recovery...... 55 4.6 TCP congestion window size during failure recovery...... 56 xii
4.8 Minimum flow throughput under edge-aggregation link failures normalized against the no-failure case on the LP simulator with global optimal routing for all networks...... 59 4.7 iPerf throughput of 8 flows saturating all links on testbed...... 59 4.9 CDF of completion time slowdowns on packet simulator...... 61 4.10 Percentage of jobs missing deadlines on packet simulator...... 62 4.11 Performance of the Spark Word2Vec and Tez Sort applications with a single edge (edge-host) or network (core-aggregation and aggregation-edge) link failure on the testbed...... 63 4.12 CDF of query latency in the Spark TPC-H application with a single edge (edge-host) or network (core-aggregation and aggregation-edge) link failure on the testbed...... 65
5.1 Traffic patterns from the Facebook traces. (a) is the rack level traffic heatmap of a representative frontend pod. (b) shows the heatmap after regrouping servers in (a). (c) and (d) plot the sorted load of inter-pod traffic across racks in a representative database pod, before and after server regrouping, respectively. (d) shows the traffic stability over differenttimescales...... 71 5.2 Aggregated server throughput before and after server regrouping for (a) one-to-one, (b) one-to-many (many-to-one) and (c) many-to-many traffic patterns. Sources servers are colored in white and destination servers are colored in gray...... 74 5.3 RDC architecture and control plane overview. (a) is an example of the RDC network topology. Circuit switches are inserted at the edge between servers and ToR switches. Connectivities for aggregation switches (agg.) and core switches remain the same as in traditional Clos networks. (b) presents an overview of the control plane...... 78 xiii
5.4 An example of RDC’s 0/1 rule update on an OpenFlow-enabled ToR switch...... 82 5.5 Packaging design of an RDC pod...... 88 5.6 An RDC prototype with 4 racks and 16 servers...... 93 5.7 RDC reconfiguration improves throughput significantly (b) with negligibleTCPdisruption(a)...... 95 5.8 Performance comparison using Cache traffic. Network load is 50%. (a) CDF of flow completion times. (b) Distribution of path lengths. More results under different settings are in Appendix ??...... 99 5.9 Performance comparison using packet simulations under different traffic workloads and network loads...... 101 5.10 Application performance improvements of RDC compared with the 4:1 oversubscribed network (4:1-o.s.) and the non-blocking network (NBLK). (a) The HDFS read/write traffic pattern. (b) The HDFS transfer time. (c)-(d) Memcached query throughput and latency. (e) The DMM traffic pattern. (f) Average shift time and communication time...... 104 Tables
3.1 Average FCT of 256 random multicasts on the OCS network under ToR size 256. The OCS uses splitters with fanout varying from 2 to 8. 27
4.1 List of notations ...... 34 4.2 Road map of experiment setups ...... 49 4.3 Break-down of failure recovery and diagnosis delay (ms) ...... 57 4.4 Percentage of impacted flows/coflows in Figure 4.9 ...... 61
5.1 Cost estimates of network components and their quantities needed by a) an RDC pod with 4:1 oversubscription b) a 4:1 oversubscribed packet switching network (4:1-o.s.), c) a rack-based hybrid circuit/packet switching network with 4:1 oversubscribed circuit bandwidth (Hybrid-1) and d) with non-blocking circuit bandwidth (Hybrid-2), and finally e) a non-blocking packet switching network (NBLK). “p.c.” stands for private communication...... 90 5.2 Break-down of control loop latency (ms) for traffic localization (TL) and uplink load-balancing (ULB)...... 96 1
Chapter 1
Introduction
In recent years, data centers are becoming one of the most important infrastructures in our modern digital life. Popular Internet services and applications such as instant messengers, web search, social networks, e-commerce, video streaming, and Internet of Things are hosted on data centers. These services and applications generate a huge amount of data every day. For example, there are 350 million photos and videos uploaded, 5.7 billion comments and likes produced, and 4.75 billion pieces of content shared on Facebook each day [1]. Services at such a large scale require enormous computing and storage capability and become a natural fit for massive computing infrastructure. As a result, more and more companies and organizations are building hyper-scale data centers around the globe to meet the ever-increasing demands of their services and applications [2]. Aggregating computing and storage resources in a centralized warehouse-scale facil- ity, data centers allow operators to amortize the capital and operational expenditures while tremendously reducing the management and maintenance complexity. From the perspective of cloud applications, the eventual goal of designing a data center is to provide an always-on, unified platform with almost unlimited compute and storage capacity. Behind this “unified platform”, however, is tens or hundreds of thousands of servers interconnected by high-capacity networks. Modern cloud-scale applications hosted on such a platform tend to be divided into a large number of tasks distributed across many servers, each responsible for a relatively less amount of work. The result of this work division is an increasing demand for high-performance data 2
Core switches
Aggregation switches
ToR switches
Servers
Figure 1.1 : An example of three-layer Clos network using homogeneous commodity switches.
3/8/19 2 transfer among the servers. Nowadays, network traffic in data centers is growing at speed one order of magnitude faster than the global Internet traffic [3]. According to estimations, the amount of data transferred in 2016 has already reached be 6.8 ZB and will triple to reach 20.6 ZB by 2021 [3]. Essentially, data center networks have become a critical component for application performances and availability. On the one hand, emerging data-intensive applications such as Hadoop [4], Spark [5] and Tez [6] require ultra-high communication throughput between servers and have been continuously driving the design of networks with higher bandwidth. On the other hand, user-face services such as web search and online transactions must achieve low latency and high availability — typically aiming for at least 99.99% uptime (“four nines,” or about an hour of downtime per year) — requiring anetworkthatisresilientfromtrafficcongestionsandcanrecoverfromfailuresin minimal time. 3
1.1 Data Center Networks: State of the Art and Challenges
Considerable efforts have been made on designing DCN architectures to keep pace with the high traffic demand of services and applications. Over the years, networks constructed with multi-stage folded Clos topologies using commodity Ethernet switches have gained the highest popularity [7–10]. Typical Clos networks have two or three layers of switches that form a multi-rooted tree where Top-Of-Rack (ToR) switches are the leaves, the core switches are the roots, and an optional layer of aggregation switches are in between. Figure 1.1 shows one example of the three-layer Clos network. Real-world data center networks are usually over-subscripted for economic reasons. Depending on the number of uplinks for each rack, the over-subscription ratio can vary from as low as 5:1 to as high as 20:1 [8]. Early generations of the Clos networks are based on standard server racks. Each rack typically contains 20-40 servers, which are connected to a single ToR switch using 1 Gbps links. The ToR switches are then connected to the core switches via 2-4 10 Gbps links. To provide high aggregate bandwidth between racks, core switches require switching chips with high port-count and high backplane capacity. The primary challenge for these networks is that they do not scale—network capacity is upgraded by replacing the existing core switches with ones that have even higher port-count and backplane bandwidth, which incurs non-linear cost increases with network size. Rather than scale-up, recent evolutions of DCNs scale-out by employing a large amount of inexpensive and homogeneous switches in the network core. The aggregate network bandwidth can be upgraded by adding commodity switches horizontally. Fat-tree network [7] is a typical example of these networks that use k port-count switches to support full bisection bandwidth for k3/4 servers. As is shown in Figure 1.1, in Clos networks, there could be multiple shortest paths between two servers under different ToRs. This, in turn, requires a multi-path routing technique, such as ECMP [11]. To take advantage of multiple paths, ECMP performs flow-based hashing based on the five tuples and uniformly randomly chooses a path 4 from a pool of multiple equal cost candidates. ECMP effectively improves traffic load balancing among network links and reduces network congestion. Clos networks with multipath routing have been successful at addressing the bandwidth bottleneck and scalability issues in data center networking for the past decade [10]. However, as traffic workloads continue to evolve, many problems of the existing Clos network remain:
Handling group communications. Clos networks are best for uniform server- • to-server communications. However, many new applications rely on more skewed communication patterns, such as multicast transmissions. Multicast transmis- sions are common in many data center applications such as model training in distributed machine learning [12], and data replication in Hadoop File Sys- tems [13]. However, Clos networks cannot efficiently handle these transmissions. Due to the high complexity of managing multicast groups on such networks, many applications handle multicast transmission by simply sending the same copy of data multiple times to different servers [4], causing huge network resource wastage and degraded application performance. Bandwidth disparity. A Clos network is typically over-subscribed at the • core—with typical over-subscription ratios somewhere between 5:1 to 20:1 [8]. This means that the bandwidths between servers can vary drastically, depending on how “close” they are on the physical topology. Servers communicating in the same rack enjoy full bisection bandwidth, but servers communicating across racks have much lower bandwidths. Some proposals try to provide full bisection bandwidth by adding more switches in the core [7, 9]. However, due to the non-uniform traffic pattern, this usually results in a network with low utilization and a significantly higher cost. Fault-tolerance. Despite the fact that individual switches and links failures • are relatively rare with 99.9% uptime [14], a large data center network may have thousands of switches and hundreds of thousands of links. The chance 5
that some switch or link has failed is therefore very high. The state-of-art failure recovery solutions in data center networks are all more-or-less based on rerouting. While rerouting maintains connectivity, traffic may experience higher bandwidth contention due to the network capacity loss, and eventually, application performance suffers. What is missing is a truly fault-tolerant network that preserves application performance under failures.
1.2 Thesis Contributions
In this thesis, we argue that many of the challenges faced by Clos networks can be addressed by augmenting the existing physical network architecture with low-cost optical switching devices. We present the design and implementation of multiple systems that address the three challenges mentioned in the previous section, i.e., supporting group communication, removing bandwidth disparity, and improving fault- tolerance. One common goal behind these systems is to solve each of the above problems while retaining as many of the existing benefits of current Clos networks as possible, and without significantly increasing the hardware cost. In particular, this thesis makes the following contributions.
First, we propose a network architecture, called HyperOptics, that augments the • existing Clos network with a separate ToR-to-ToR network dedicated for multi- cast transmissions. A key contribution of HyperOptics is its novel connectivity design for the ToRs that leverages the physical layer optical splitting technology to form a static switchless topology. The topology of HyperOptics resembles the logical connectivity graph in traditional distributed hash tables [15]. We show that by offloading multicast traffic from the Clos network to HyperOptics, we can achieve line-rate multicast transmissions. Compared to a design using optical circuit switches, HyperOptics achieves ultra-low latency transmissions due to its switchless topology. Albeit performance improvements, the overall cost of HyperOptics is still comparable with the network using optical circuit 6
switches. Our second contribution is a fault-tolerant network architecture, called Share- • backup, that preserves application performance under various network failures. Sharebackup uses a small number of backup switches shared network-wide for repairing failures on demand so that the network quickly recovers to its full capacity without applications noticing the failures. Under the hood, this is enabled by inserting circuit switches between adjacent layers of the Ethernet switches and connect both the regular switches and backup switches to the circuit switches. Sharebackup avoids complications and ineffectiveness of rerouting by fast fail-over to backup switches at the physical layer. We implement Share- Backup on a hardware testbed. Its failure recovery is as fast as the underlying circuit switching hardware, causing no disruption to routing, and it accelerates big data applications by multi-fold factors under failures. We also use large-scale simulations with real data center traffic and failure model. In all our experiments, the results for ShareBackup have little difference from the no-failure case. Our third contribution is the design and implementation of a “rackless” data • center network or RDC, that aims to remove the bandwidth disparity between intra-rack and inter-rack communications. RDC is a pod-centric DCN architec- ture that breaks the traditional rack boundaries in a pod and creates uniform high bandwidth for servers regardless of their topological locations. RDC creates the illusion that servers can move freely among ToR switches in response to traffic pattern changes. Rather than optimizing the workloads based on the topology, RDC optimizes the topology to suit the changing workloads. RDC is implemented by inserting circuit switches between the edge switches and the servers. It reconfigures the circuits on demand to form different connectivity patterns. We have performed extensive evaluations of RDC both in hardware testbeds and packet-level simulations. RDC can achieve an aggregate server throughput close to that of a non-blocking network and an average path length 7
35% shorter. On realistic applications such as HDFS, Memcached, and MPI workloads, RDC can improve job completion times by 1.1-2.7 . ⇥
1.3 Thesis Organization
The rest of this thesis proposal is organized as follows. Chapter 2 provides a brief introduction to the background of hybrid data center networks and the related works. Chapters 3, 4, and 5 present the three major contributions introduced above: HyperOptics, Sharebackup, and RDC. We conclude the thesis in chapter 6. 8
Chapter 2
Background and Related Work
In this chapter, we discuss the background of optical technologies used in data centers and related works.
2.1 Optics in Data Centers
Due to the large bandwidth capacity and low power consumption, a significant amount of research has been carried out in the past few years to develop optical interconnects and networking technologies for data centers [16–20]. In fact, it has been estimated that 70% of energy saving can be obtained if data center infrastructure moves toward afullyopticalnetwork[21].Opticalinterconnectschemesindatacentersmainly rely on a mixture of active and passive optical devices to provide switching, routing, and interconnection. Among the many optical devices that offer such functionalities, we focus on the use of two types of devices—optical splitters and optical circuit switches — that provides passive multicast transmission in line-rate and reconfigurable point-to-point connectivities at high switching speed, respectively. Optical splitters. As passive physical layer devices, optical splitters can duplicate the optical signal coming from its input port into multiple copies and transmit them out via the output ports. Under the hood, this is implemented by physical optical power splitting. As illustrated in Figure 2.1, a key parameter of an optical splitter is its fanout, denoted by k, which represents the maximum number of copies the input signal be duplicated. Due to power splitting, the power of the output optical signal will be a fraction of the input power. For example, there is a 3 dB power loss for k =2splitters. To ensure the output power is above the sensitivity of an 9
1 input port k output ports …
(a) (b)
Figure 2.1 : An illustration of the optical splitter. (a) is a logical representation of an optical splitter with 1 input port and k output ports. (b) shows an real splitter device with k =8.
N Input ports
N output ports (a) (b) Figure 2.2 : An illustration of the optical circuit switch (OCS). (a) is a logical representation of an OCS with 8 input ports and 8 output ports, or an 8 8 OCS. ⇥ (b) shows a real Calient 312 312 OCS. ⇥
optical receiver, a splitter with a higher fanout requires higher input power. Optical splitters are manufactured by the Planar Lightwave Circuit (PLC) technology and are commercially available up to fanout 32 [22]. Optical circuit switches Optical circuit switches are physical layer devices traditionally been used in wide-area backbone networks and SONET cross-connects to support telecommunications traffic loads. They provide bipartite connectivities between the input ports and the output ports. A key parameter of an OCS is 10 its port-count. Figure 2.2 shows the logical representation of a 8 8 OCS and ⇥ a commercially available Calient OCS with 312 input ports and 312 output ports. In an OCS, setting up a circuit (or switching) is typically accomplished through the use of Micro-Electromechanical Systems (MEMS) mirror arrays that physically direct the light beams to establish an optical data path between any input and any output port [23]. Once a circuit has been established, communication between the endpoints occurs at very high bandwidth and very low latency. Since all data paths are optical, no optical to electrical to optical (O-E-O) conversions are needed. This results in tremendous savings of optical transceivers at the endpoints and overall power consumption compared to their packet switching counterparts. There are two key parameters of an OCS: port-count and circuit setup delay. Due to its mechanical nature of light direction, the circuit setup delay varies from several milliseconds to tens of milliseconds [24,25], which is several orders of magnitude slower than an Ethernet packet switch. OCSes with a few hundreds of ports are commercially available [24] and ones with 1100 ports have already been fabricated [26].
2.2 Hybrid Data Center Network Architectures
Many hybrid electrical/optical [17,18,27,28] or hybrid electrical/wireless [29,30] DCN architectures are proposed during the past decade. These works all share a similar motivation and take a similar generic approach: creating on-demand connectivities between servers to mitigate the bandwidth bottleneck of DCNs under non-uniform traffic patterns. They only differ in the underlying physical technologies and the algorithms in setting up those on-demand links. For example, Helios [18] and c- through [17] propose to augment the existing electrical network with another separate network core using optical circuit switches. The electrical network is used for all-to-all low-volume communications, while the optical network is used for long-lived high- bandwidth communication. Similarly, the work by Zhou et al [29] and Flyways [28] meet burst bandwidth demands by creating 60 GHz links instead of optical links. One 11
key difference of our work on ShareBackup and RDC is that we embed the optical devices into the existing Clos topologies, instead of adding a separate network with its own architecture and traffic scheduling techniques. Our work promotes the integration of optical switching and packet switching in a more organic way and allows a coherent control system design for single network architecture. We show in this thesis that this integration can solve many of the issues of the Clos networks, such as bandwidth disparity and fault tolerance, while keeping their existing advantages.
2.3 Supporting Multicast Transmissions in Data Center Net- works
Multicast transmission has long been a performance bottleneck in data center networks. Traditional in-network multicast solution, IP multicast, is problematic in many senses. First, it does not scale with aggregate hardware resources in the number of supported multicast groups. Second, existing protocols for IP multicast routing and group management are not able to construct multicast trees in milliseconds as required by many data analytics applications. Third, IP multicast is lacking a universally acceptable congestion control mechanism that is friendly to unicast flows. Even though some recent efforts have been made to improve its scalability in the datacenter contexts [31–33], the other problems have caused most network operators to eschew its use. As a result, in today’s data center networks, the task of supporting multicast transmission is delegated to the applications. Applications themselves have to build and maintain an overlay network over the physical network. For instance, Spark supports multicast using a BitTorrent-style overlay among the recipient servers [5]. However, BitTorrent suffers from suboptimal multicast trees that introduce duplicate transmissions in the physical network. Due to the large volume of multicast traffic, repetitive data transmissions add a significant burden on the network. In fact, it has also been shown that when the overlay gets large, its throughput can collapse even 12 when overlay forwarding experiences infrequent short delays [34]. Hadoop Distributed File System (HDFS) improves the BitTorrent-style overlay by sending the data to multiple receivers in a pipeline fashion, in which each data block is transmitted from the source to the first receiver, and then from the first receiver to the second receiver, and so on. The reception of a data block is considered successful only when the last receiver has received the block. Although this pipelined data transfer relieves the link stress, it causes high latency for the transfer of each data block. More recently, researchers have recognized the advantages of leveraging optical splitters for data multicast and have proposed several optical multicast network architectures and control systems [35–38]. These systems are more efficient than the application overlays because optical data duplication is performed in the physical layer in a bandwidth transparent way — data can be transmitted to multiple receivers as fast as the sender’s transceiver speed without causing extra link stress. Also, a separate optical network avoids interference of multicast traffic on unicast traffic and greatly relieves the burden of the unicast network. However, an important drawback common to the above systems is that they require a high port-count optical circuit switch (OCS) acting as the switching substrate for optical splitters. Using an OCS for multicast transmission is problematic. First, today’s high port-count OCS can only switch at a speed of milliseconds to tens of milliseconds. Second, optical circuit switches are high-cost active devices. For example, our recent quotes from vendors show that an OCS port could be two orders of magnitude more expensive than an optical splitter port. Different from these systems, our solution in HyperOptics adopts a switchless design that directly interconnects the ToR switches to form a regular graph. Such a design has two major advantages compared to the OCS-based multicast systems. First, it can provide high bandwidth even at the packet granularity because the slow circuit switching delay is completely eliminated. Second, it scales well in the number of ToRs because the constraint of the OCS port-count no longer exists. 13
2.4 Failure Recovery in Data Center Networks
Clos topologies provide extensive path diversity and redundancy for end hosts. The connectivity for end-host pairs can still be established even under a considerable amount of device or link failures in the network core. To use this path-richness for failure recovery, previous studies have explored the tradeoffbetween the speed and quality of different rerouting approaches. Both fast local rerouting, e.g., F10 [39] and optimal global rerouting approaches, e.g., Portland [9] and Hedera [40] are proposed to handle network failures. In F10, the authors show that link failures can be recovered in 10 milliseconds by conducting fast local rerouting on a variation of the tradition FatTree topology. While this rerouting strategy can quickly find an alternative path under failures, the detour path can be significantly longer than the original one due to traffic bouncing between adjacent layers in the network. We refer to this problem as path dilation in this thesis. Path dilation not only enlarges latency and lowers throughput for the rerouted flows, but more importantly, it may cause some links being drastically congested and resulting in significant load-imbalance in the network. F10 also proposed a traffic push-back strategy and a centralized load-balancer to mitigate this problem. However, these two schemes have to operate on a much larger time scale due to their high overhead. Further, to achieve fast rerouting, the rerouting rules have to be pre-installed on each of the switches, which has already been shown as having scalability issues because the number of rules increases quadratically with the server size [41]. Compared to the local rerouting approaches, the design of centralized network monitoring, e.g., Portland [9] and Hedera [40] have the advantage of performing optimal rerouting and load balancing based on global network topology and traffic load information. However, this optimality comes at the cost of significantly reduced responsiveness to failures. After a failure is detected and heart-beated to the monitor, the monitor needs to produce a new set of routing rules based on the changed network topology and install these rules on (potentially) all switches at runtime. In addition to 14 the communication delay and computation delay of the centralized monitor, it takes even longer for the new rules to be installed and taking effect. Beyond the data center scenarios, packet rerouting strategies have been existing for a long time in the Internet or ISP networks. One type of works uses pre-computed backup routes when the primary routes fail. For example, Packet Re-cycling [42] associate each unidirectional link in the network with a unidirectional cycle that can be used to bypass it if it fails, through a process called cycle following. Packet Re-cycling can tolerate multiple failures in the network. If further failures are encountered in links along the backup path, the backup paths of these links are guaranteed to avoid previously encountered failures. R-BGP [43], pre-computes a few strategically chosen failover paths during BGP convergence and provides some provable guarantees such as loop-prevention as long as it will have a policy-compliant path to that destination. Other works using pre-computed backup routes include IP restoration [44], MPLS Fast-Reroute [45]. However, one major issue of using backup routes is that it does not scale in terms of the required forwarding table entries on network devices, which limits its resilience to only a few failures in the network [41,46]. FCP [46] is another type of rerouting paradigm aiming to eliminate the convergence process upon failures completely. All routers in the network have a consistent view of the potential set of links. FCP allows data packets to carry the information of failed links. It ensures that when a packet arrives at a router, that router knows about any relevant failures on the packet’s previous path, thereby eliminating the need for the routing protocol to propagate failure information to all routers immediately. However, the techniques in FCP require nontrivial computation resources on each router because the router must re-compute the shortest path available to a given destination for each failure combination it receives within a failure carrying packet. Besides, FCP also causes considerable overhead on each packet header. While all these rerouting schemes are resilient to failed connections, they do not really fix the failures and restore the network capacity. Some hardware failures 15
may have long downtimes, and consequently, network traffic will have to suffer from bandwidth loss in an extended time period. Our work in ShareBackup asks a very different question: can we recover the network capacity immediately upon failures without traffic disturbance? An affirmative answer would suggest that we can completely mask the failure impact from application performance. In chapter 4, we will detail the design and implementation of Sharabackup and demonstrate that it is an economical and effective way to mask failures from application performance. 16
Chapter 3
HyperOptics: Integrating An Optical Multicast Architecture in Data Centers
In this chapter we present HyperOptics, a static and switchless optical network designed for multicast transmissions in data centers. The key building block behind HyperOptics is a optical splitter that can passively duplicate the optical signals into multiple copies at line-rate. Building on top of optical splitters, HyperOptics directly interconnects ToR switches to form a novel connectivity structure of a k-regular graph, where k is the fanout of a splitter. We analytically show that this architecture is scalable and efficient for multicasts. Simulations show that running multicasts on HyperOptics can on average be multi-fold faster than the state-of-art solution.
3.1 Introduction
As datacenters scale up, online services and data intensive computation jobs running on them have an increasing need for fast data replication from one source machine to multiple destination machines, or the multicast service. Apart from traditional multicast applications such as simultaneous server OS installation and upgrade [47], data chunks replication in distributed file systems [4, 48, 49] and cache consistency check on a large number of nodes [50], recent distributed machine learning models also see a huge demand for multicast services. The explosion of data allows the learning of powerful and complex models with 109 to 1012 parameters [51,52], in which broadcasting the model parameters alone poses a challenge for the underlying network. Some learning algorithms require the processed intermediate data to be duplicated across different nodes. For example, the Latent Dirichlet Allocation algorithm for text 17
mining needs to multicast the word distribution data in every iteration [53]. A few thousand iterations of LDA with 1 GB of data for each iteration would easily cause over 1 TB of multicast data transfer in today’s datacenters. Reducing the multicast delay would significantly accelerate the machine learning jobs. However, multicast services are still not natively supported by current datacenters. The most established solution is IP multicast which is originally designed for the Internet. Even though some efforts have been made to improve its scalability in the datacenter context [31–33], the complex dynamic multicast tree building and maintenance, the potentially high packet loss rate and costly loss recovery, and the lacking of satisfactory congestion control have caused most network operators to eschew its use. On the other hand, as data size continues to grow, there is an increasing trend towards deploying a high bandwidth (40/100 Gbps) network core for datacenters [54]. However, high data rate transmissions are not feasible for even modest-length electrical links. For example, data transmissions on traditional twinax copper cable propagate at most 7 m at 40 Gbps due to power limitation [55]. Optical communication technologies are well suited to such high bandwidth networks. The advantages of optical devices and links, such as data rate transparency, lower power consumption, less heat dissipation, lower bit-error rate and lower cost have been noted or already exploited by the industry [56]. As datacenters gradually evolve from electrical to optical, we believe a system design that fully leverages the key physical features of optical technologies is necessary for future datacenters. In this chapter, we propose HyperOptics, a novel optical multicast architecture for datacenters. HyperOptics follows the recent efforts such as [18,20,36,37,57,58] that augment the traditional electrical network with a high speed optical network, but HyperOptics dedicates the optical network to multicast transmissions. The existing optical network proposals usually employ an Optical Circuit Switch (OCS) to provide configurable connectivity for ToRs. The switching speed in today’s large port-count 18
OCSes is, however, orders of magnitude slower (about tens of millisecond) than packet switches. In [20], the authors propose a specific implementation of OCS that is capable of switching in microseconds, but it is unscalable to support a large port-count due to the limited number of available optical wavelengths. Also, OCSes are high cost devices. According to our recent quote from a vendor, a 192-port OCS would cost 365 K USD. All these problems of OCSes motivate us to design an optical network that gets rid of its use and directly interconnect the racks by low cost optical splitters. The design of HyperOptics is inspired by Chord’s [15] way of organizing peer nodes in traditional overlay networks. Each ToR in HyperOptics can talk to multiple neighbor ToRs simultaneously via passive optical splitters, by which the ToRs form the connectivity of a regular graph. We identify two main advantages of HyperOptics over the OCS architecture. First, HyperOptics can provide high bandwidth even at the packet granularity because the slow circuit switching delay is completely eliminated. Second, unlike the existing OCS architecture, HyperOptics scales well in the number of ToRs because the constraint of the OCS port-count no longer exists in HyperOptics. We show that HyperOptics is well suited for high throughput and low latency multicast transmissions. Data from one ToR could be physically duplicated via an optical splitter to multiple ToRs at line speed. For multicasts with large group sizes, data is relayed by some intermediate ToRs. Due to the path flexibility of regular graphs, we show that the maximum path length for any multicast is bounded by log n, where n is the number of ToRs. Another distinguishing property of HyperOptics is that it can support 2 simultaneously active multicasts with maximal group size. To take full advantage of the underlying optical technologies, we propose a centralized control plane that manages the routing policy and multicast scheduling. Preliminary simulations show that HyperOptics can on average be 2.1 faster than the OCS architecture for multicast services. ⇥ 19
3.2 HyperOptics Architecture
We first introduce the connectivity structure of ToRs and then discuss the routing strategy under the given network architecture. Next, we analyze the multicast performance and the wiring complexity of HyperOptics. And finally, we present an overview of the system.
3.2.1 ToR Connectivity Design
We assume that all splitters have the same fanout k and the number of ToRs is n =2k. In our model, optical signals can only pass through the splitters in one direction, i.e., from the input port to the output ports. The ToRs are interconnected as a special k regular graph. The only difference of HyperOptics from the general k regular graph is that a node (ToR) can only send the same data to its k neighbors simultaneously. This limitation comes from the fact that the splitter just passively duplicates the input signal on its output ports.
Assume that the ToRs are denoted as t0, t1,..., tn 1.AllToRsarelogicallyorganized k on a circle modulo 2 . Each ToR ti is connected to the input port of splitter si. The
0 1 k 1 k output ports of si are connected to ti+2 , ti+2 ,..., ti+2 ,respectively.Notethat
the gap between ti’s two consecutive neighbors increases exponentially, which is very similar to Chord [15] in organizing peer nodes in overlay networks. Since all ToRs are on a logical circle, the operations above are all modulo 2k.Forexample,ifk =3and
n =8,thethirdneighboroft4 is t4+22 = t0. An example of HyperOptics with k =3is
given in Fig.5.3. We only show the connectivity of t0, t3, t4 and t6 in the figure. The
other ToRs are connected in a similar way, e.g., t1 is connected to t2, t3, t5. The full connectivity of the architecture is shown in the table on the bottom.
3.2.2 Routing and Relay Set Computation
Routing traffic to indirect destinations needs relays. For example, in Fig 5.3, a possible
path shown as dashed lines from t0 to t7 is t0 t4 t6 t7 where t4 and t6 are 20
splitter splitter splitter splitter … …
ToR 0 ToR 1 ToR 2 ToR 3 ToR 4 ToR 5 ToR 6 ToR 7 … … … … … … … …
Source ToR 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 0 Neighbors 2 3 4 5 6 7 0 1 4 5 6 7 0 1 2 3
Figure 3.1 : An example of HyperOptics connectivity with splitter fanout k = 3. The connectivity of t0, t3, t4 and t6 are shown in the figure. All other ToRs are interconnected to their neighbors in a similar way. The table on the bottom demonstrates the connectivity of all ToRs.
relays. There may exist multiple paths between each ToR pair. The relay set of a multicast is mainly determined by the routing strategy of HyperOptics. For simplicity, we propose a best-effort based routing strategy for HyperOptics. We note that our routing strategy might not be optimal and there is room for improvement. But it already provides satisfactory gains as we will show in Sec. 3.3. For a single source-destination pair, best-effort routing will always designate the neighbor that is nearest to the destination as the next relay. Also, we ensure that the index of the next relay is logically smaller than the destination. Mathematically,
t t t log(j i) given a destination j,arelayToR i will specify i+2b c as the next relay. The routing algorithm will recursively compute the remaining relays as if the next relay is
the source. For example, consider the traffic from t0 to t7 in Fig 5.3, the next relay for
t0 is t4 because t4 is one of t0’s neighbor that is nearest to t7.Andthenextrelayfor
t4 is t6.Hence,therelaysetforthepathfromt0 to t7 is t4,t6 .Notethatthenext { } relay of t4 is not t0 because t0 has logically passed the destination t7.Formulticasts, best-effort routing will compute the relay set for each individual destination and then 21
return the union of all relay sets.
3.2.3 Analysis
We now analyze the multicast performance under the design of HyperOptics and compare the cost of HyperOptics with the traditional OCS networks. Multicast hop-count: The hop-count of a multicast characterizes the minimum latency of a packet traversing from the source to the destination. The following theorem gives the worst case and average hop-count of a multicast in our architecture.
Lemma 1. The maximum hop-count of a multicast under best-effort routing is upper- log n bounded by log n and the average hop-count is 2 .
Proof. All ToRs in HyperOptics are logically equal. Without loss of generality, we 0 k 1 consider a multicast originating from t0. The k direct neighbors of t0 is ToR 2 ,...,2 , these IDs differ from 0 by only one bit. Similarly, the IDs of ToRs that are two hops
away from t0 differ by two bits. The farthest ToR differs by k bits. In best-effort routing, traversing a hop is equivalent to flipping the most significant bit of the source ToR’s ID that is different with the corresponding bit of the destination’s ID. Therefore,
the largest hop-count is k = log n. The number of ToRs that are j hops away from t0 k k j k 1 k j=1 (j) k2 k log n is j ,(1 j k). The average hop-count is k k = 2k = 2 = 2 . P j=1 (j) P For one hop, the signal decoding and packet processing can be done in sub- nanosecond [59]. Therefore, for a datacenter with 1 K racks, the average latency for a multicast is less than 0.5 log 1000 1 ns 5.0 ns. In the following, we simply assume ⇤ ⇤ ⇡ that the multicast latency is negligible. Simultaneously active multicasts: Each ToR in HyperOptics has k direct neigh- bors. In an extreme case where all group members of a multicast are the source’s direct neighbors, HyperOptics could support n active multicasts simultaneously. In another extreme case where multicasts’ group sizes are maximal and need the most number of relays, the number of simultaneous active multicasts would be much smaller. 22
i i+2&-"
… … " i+1 i+2 i+2#$" i+ 2#$"+1 i+2#$"+2" i+2#$"+2#$"= i
… … % " #$" i+2 i+2 +2 i+2%+2&$" i+2 +2 i+2#$"+2" i+2#$"+2&$"+1 = i+1 … …
… … #$" #$" i+2 -1 i+2 i+2&$"+2#$"-2 = i-2 i-1 i i+2&$"-2
… … #$" i+2 i+2#$"+1 i+2&$"+2#$"-1= i-1 i i+1 i+2&$"-1
k 1 Figure 3.2 : Two broadcast trees originating from ti and ti+2 .Solidcirclesare relays. The union of the relays and the last neighbor of each relay, shown by squares, form a complete set of all ToRs. The two broadcasts have disjoint relay sets.
However, the following theorem shows that HyperOptics still has the capability of servicing multiple multicasts simultaneously in the worst case.
Lemma 2. HyperOptics can simultaneously service two one-to-all multicasts.
k 1 Proof. We consider two broadcast sources ToR i and ToR i +2 . Under best-effort routing, we draw the two broadcast trees in Fig 3.2 where solid circles are relays, and squares are the last neighbor of each relay. As can be seen, the relay set and the last neighbor of each relay form a complete set of all ToRs for each broadcast. k 1 k 1 ToR i’s relay set is i, i +1,i+2,...i+2 1 , while ToR i +2 ’s relay set is { } k 1 k 1 k 1 i +2 ,i+2 +1,i+2 +2,...,i 1 . The two relay sets are disjoint and { } therefore, both broadcasts can be active simultaneously.
ToR port-count: In HyperOptics, each ToR is connected to the input port of a 1 k ⇥ splitter. One splitter would take up k +1ports across the ToRs. The average number n (k+1) ⇤ of of occupied ports on each ToR would be n =1+logn. 23
Cost: Even though HyperOptics does not use the OCS, it occupies more ToR ports than the OCS network. The per-port OCS cost is 1.5K USD, derived from our recent vendor quote (365K USD for a 192-port switch) but factors in a 20% discount. The per port cost of ToR and transceiver are 100 USD and 200 USD respectively, from [?,60]. Splitters are very inexpensive at 5 USD per port [61]. For a medium size datacenter with 128 ToRs where each ToR is connected to other ToRs via 40 Gbps links, the total networking cost for HyperOptics is approximately 0.31 M USD. The total cost of the OCS network using a commercially available 192-port OCS is comparable at 0.33 M USD. For a datacenter with 256 racks, the total costs for HyperOptics and the OCS network using a 320-port OCS become 0.69 M and 0.56 M, respectively. HyperOptics is thus cost comparable with the OCS architecture under the current price of different network elements. Wiring complexity: While the total number of fibers needed to interconnect the ToRs is n log n. Many of them are short fibers that only go across a few racks. In k 0 1 k 1 a datacenter with 2 racks, the k fibers from each ToR will go across 2 , 2 ,..., 2 racks, respectively. For instance, in a datacenter with 256 racks, only 2 fibers will go across over 50 racks for each ToR. The total number of long fibers that go across over 50 racks is 2 256 = 512. For large datacenters with thousands of racks, we envisage ⇤ that the ToRs are packaged into Pods. Pods can be wired in the same way as if one Pod is a single virtual ToR. This hierarchical organization of ToRs would significantly reduce the number of global fibers. A systematic study of this hierarchical design is our future work.
3.2.4 System Overview
Fig. 3.3 shows an overview of the HyperOptics architecture. Our current design of HyperOptics assumes that the network core bandwidth can be fully utilized. This assumption holds when the link bandwidth between a server and its ToR is the same as the inter-rack bandwidth. Or when a server’s bandwidth is lower than the inter- 24
List of Multicast IDs require ToR i as a relay
ToR i Requests: (id, s, D, f), finish cmd HyperOptics … Manager start cmd Figure 3.3 : An overview of the HyperOptics system.
rack bandwidth, multiple sources within a rack have the same destination set. The work-flow of HyperOptics is as follows. The manager first receives multicast requests from source servers. A multicast request contains the request ID, the source server, the destination servers and the flow size. The manager then computes the relay set for each request and send to each ToR i alistofIDsofmulticaststhatrequireToRi as a relay. All multicast data packets contain a multicast ID in their headers. During the service period, when ToR i receives a packet, it will read the packet header and check whether it is a relay for the packet and relay the packet if it is. Note that this rule installation process is conducted only once before each scheduling cycle. Since relays are non-sharable resources for a multicast, multicasts that require common relays must be serviced sequentially. The HyperOptics manager will compute a schedule for all requests, which we will discuss in the next section. Every time a server finishes sending its multicast traffic, it will send a finish message to the manager, the manager then checks whether it is time to schedule the next batch of multicasts. If yes, then the manager will send a start message to the source servers of the next batch. Rules for the current scheduling cycle will be deleted on ToRs before the next cycle begins.
3.2.5 Multicast Scheduling
Given the input of n multicast requests, we now consider how to schedule these multicasts such that the overall delay is minimized. We formulate this problem as 25
a max vertex coloring problem [62] where a vertex corresponds to a multicast, the edges correspond to the conflict relations among multicasts, i.e., if two multicasts have common relays, there’s an edge between them. The weight of a vertex corresponds to the flow size of the multicast. Max vertex coloring has been shown to be strongly NP-hard [63]. We therefore focus on efficient heuristics. HyperOptics adopts a heuristic called Weight based First Fit (WFF) in which the vertices are first sorted in anon-increasingorderoftheirweights.WFFthenscanstheverticesandassigneach vertex a least-index color that is consistent with its already colored neighbors. The WFF heuristic is a specific version of the online coloring method for general graph coloring problems whose approximation ratio is analyzed in [64]. The time complexity of WFF is ⇥( V 2). | |
3.3 Evaluation
In HyperOptics, the inter-rack link bandwidth is 40 Gbps. We simulate the following two networks to compare with HyperOptics.
3.3.1 Compared networks
OCS network: Each ToR is connected to an OCS via a 40 Gbps link. The OCS has 320 ports, among which some are occupied by the ToRs, and the remaining ports are reserved for optical splitters. The number of splitters varies with the fanout of each splitter. The maximum group size achieved by cascading m 1 k splitters is ⇥ k +(m 1) (k 1). We assume the OCS reconfiguration delay is 25 ms according to ⇤ commercially available products [24]. As is discussed in Sec. 3.2.3, the total cost of this network is comparable to HyperOptics. Conceptual OCS network: We assume the Conceptual OCS has zero reconfigura- tion delay and sufficient port-count to support arbitrary multicast group size. The other configurations are the same as the OCS network. This network is not feasible in practice; it only serves as a comparison baseline to isolate the effect of different design 26
components of HyperOptics.
3.3.2 Simulation setup
The control plane delay consists of the scheduling algorithm computation time, the rule installation time and the control message transmission time between the manager and the servers. The computation time (measured at run-time) and the rule installation time (about 8.7 ms [37]) are one-time overheads for each scheduling cycle. The control messages between the manager and the servers can be implemented using any existing RPC solutions and its delay has been shown to be less than 2 ms [36,37]. We assume in a scheduling cycle every rack has exactly one server that generates a multicast request (id, i, D, f) with itself being the source, D being a random subset of servers in other racks as receivers (each rack has a 50% chance of having some receivers for each source), and f being a random flow size between 10 MB and 1 GB. The number of requests is equal to the number of ToRs. We repeat the experiment 500 times and report the average result. This traffic pattern helps us evaluate the network core capacity of the HyperOptics architecture. Note that the group size of a multicast is constrained in the OCS networks due to the limited number of ports available for splitters. For better evaluations, we make sure that all multicast group sizes are no larger than the largest group size that the OCS network can support. We apply the WFF scheduling algorithm on both HyperOptics and the OCS network and compare the total flow completion time (FCT) of multicasts. The conflict relations of multicasts are only slightly different in the OCS network than in HyperOptics. In the OCS network, multicasts conflict when they share some destinations or there are not enough splitter resources to service them simultaneously, or one multicast’s source is another multicast’s destination. 27
OCS Switching or Control Message Overhead OCS Service Delay 2.40× Conceptual OCS Service Delay 15 HyperOptics Service Delay 2.14×
10
2.10×
5
Average Total FCT (s) 2.01×
1.95×
0 16 32 64 128 256 Number of ToRs or number of multicast requests Figure 3.4 : Comparison of the average total FCT of all three networks. The speedups of HyperOptics over the OCS network are labeled in the figure.
Fanout 2 4 6 8 FCT(s) 17.5 17.8 17.7 17.8
Table 3.1 : Average FCT of 256 random multicasts on the OCS network under ToR size 256. The OCS uses splitters with fanout varying from 2 to 8.
3.3.3 Effect of Splitter Fanout Used in OCS
The overall multicast delay for the OCS network might vary as the splitter fanout changes. Table 3.1 shows the average FCT of 256 random multicasts on the OCS network with varying splitter fanout. It can be seen that the FCT remains quite constant. Intuitively, smaller/larger splitter fanout would yield better result when the multicast group size is small/large. In the following experiments, we always report the best result of the OCS network using various splitters. 28
3.3.4 Performance Comparison
Fig. 3.4 shows the average FCT for the three networks. We see that the speedup of HyperOptics is on average 2.13 over the OCS network. The speedup also increases ⇥ with an increasing number of ToRs. We identify two reasons for HyperOptics’ ad- vantages. First, HyperOptics does not use the OCS, the high reconfiguration delay (occurring every time the circuits need to change) is completely eliminated. As can be seen, the overhead of OCS, mainly the OCS reconfiguration delay, is on average 24 ⇥ larger than the overhead of HyperOptics, which contains only the 2 ms control message delay between the manager and the servers. Second, in the OCS network, a ToR can only receive traffic from one other ToR at a time. As a result, multicast requests that share some common destinations must be serviced sequentially. However, ToRs are interconnected in a log n regular graph in HyperOptics, each ToR can receive traffic from log n other ToRs simultaneously. We observe that the Conceptual OCS network is still 1.8 slower than HyperOptics. This fact shows that the unique connectivity ⇥ structure of HyperOptics alone can lead to a significant FCT improvement.
Computation Time of Control Algorithm
We run our C++ implementation of WFF scheduling on a 3.4 GHz, 4 GB RAM Linux machine. As is shown in Fig. 3.5, the time cost is less than 80 ms with 600 requests and less than 18 ms with 256 requests. In addition, this time cost is a one-time overhead for a scheduling cycle. The manager does not need to recompute the schedule in the service period. Results in Fig. 3.5 demonstrates that the HyperOptics manager is responsive in handling a large number of requests.
3.4 Summary
We have presented HyperOptics, a multicast architecture for datacenters. A key contribution of HyperOptics is its novel connectivity design for the ToRs that leverages 29
80
60
40
20
Computation time (ms) 0 100 200 300 400 500 600 Number of requests
Figure 3.5 : Computation time of HyperOptics’ control plane under different number of multicast requests.
the physical layer optical splitting technology. HyperOptics achieves high throughput and overcomes the high switching delay of the OCS. We show that the overall cost of HyperOptics is comparable with the OCS network, but it is on average 2.1 faster ⇥ than the OCS network for multicast services. Our current routing and scheduling techniques in HyperOptics are quite basic and have much room for improvements. Our next step is to explore alternate routing and scheduling techniques to fully exploit the HyperOptics architecture. 30
Chapter 4
Sharebackup: Masking Data Center Network Failures from Application Performance
In this chapter, we present a system called Sharebackup as an economical and effective way to mask failures from application performance. Sharebackup uses a small number of backup switches shared network-wide for repairing failures on demand so that the network quickly recovers to its full capacity without applications noticing the failures. This approach avoids complications and ineffectiveness of rerouting. We propose ShareBackup as a prototype architecture to realize this concept and present the detailed design. We implement ShareBackup on a hardware testbed. Its failure recovery takes merely 0.73ms, causing no disruption to routing; and it accelerates Spark and Tez jobs by up to 4.1 under failures. Large-scale simulations with real ⇥ data center traffic and failure model show that ShareBackup reduces the percentage of job flows prolonged by failures from 47.2% to as little as 0.78%. In all our experiments, the results for ShareBackup have little difference from the no-failure case.
4.1 Introduction
The ultimate goal of failure recovery in data center networks is to preserve application performance. In this chapter, we propose shareable backup as a ground-breaking solution towards that goal. Shareable backup allows the entire data center to share a pool of backup switches. If any switch in the network fails, a backup switch can be brought online to replace it. The failover should be fast enough to avoid disruption to applications. With the power of shareable backup, it is possible for the first time to repair failures instantly instead of making do with a crippled network. 31
Shareable backup is a natural quest due to ineptness of rerouting, the mainstream solution to fault tolerance in data center networks [7–9,39,65–69]. While rerouting maintains connectivity, bandwidth is nonetheless degraded under failures. The rerouted traffic may contend with other traffic originally on the path, thus enlarging the effect of failure to a wider range of the network. Routing convergence is known to be slow [70], and even path re-computation on a centralized management entity is expensive [9,68]. This latency is especially harmful to interactive applications with rigid deadlines. Rerouting also risks misconfigurations when updating routing tables, which may cause the network to dysfunction. Not to mention other overheads, e.g. slow failure propagation, longer alternative paths, excessive state exchange, etc. With all these factors, the application performance may be jeopardized drastically. According to a failure study of a path-rich production data center, 10% less traffic is delivered for the median case of the analyzed failures, and 40% less for the worst 20% of failures [71]. Injecting these failures into our simulation of a real data center setting (Section 4.6.7), 42% jobs get slowed down by at least 3 (Figure 4.9(b)), 51% ⇥ jobs miss deadlines (Figure 4.10(b)), and 21.3% flows not on the path of failure still get affected because of rerouting (Table 4.4). Shareable backup is desirable for its cost-effectiveness. The pool of backup switches needs not be large in practice, because failures in data centers are rare and transient. The above failure study shows most devices have over 99.99% availability; and failures usually last for only a few minutes [71]. With shareable backup, we for the first time achieve network-wide backup at low cost, which is impossible for the traditional 1:1 backup that requires a dedicated spare for each switch. Shareable backup is achievable by circuit switches, which have been used to facilitate physical-layer topology adaptation in many novel network architectures [16– 18, 20, 72–75]. Theoretically, if the pool of backup switches and all the switches in the network are connected to a circuit switch, any switch can then be replaced as we change the circuit switch connections. However, a circuit switch has limited port 32
count, and layering multiple circuit switches to scale up increases insertion loss. Rather than scaling up, recent proposals scale out low-cost modest-size circuit switches by distributed placement of them across the network [74, 75]. We adopt this approach to partition the network into smaller failure groups and realize shareable backup in each group. In this work, we design a prototype architecture, named ShareBackup,toexplore the feasibility of shareable backup on fat-tree [7], a typical network topology found in data centers [10,76]. We have implemented ShareBackup and its competing solutions on a hardware testbed, a Linear Programming simulator, and a packet-level simulator. We have conducted extensive evaluations including TCP convergence, control system latency, bandwidth capacity, transmission performance at scale with real traffic and failure model, and benefits to Spark and Tez jobs on the testbed. The key properties of ShareBackup are: (1) failure recovery only takes 0.73ms, latencies from hardware and control system combined; (2) it restores bandwidth to full capacity after failures, and routing is not disturbed; (3) for all our experiments, its performance difference with the no-failure case is negligible, proving its ability to mask failures from application performance; (4) under failures, it accelerates Spark and Tez jobs by upto 4.1 and reduces the percentage of job flows slowed down by ⇥ failures in the large-scale simulation from 47.2% to 0.78%.
4.2 Related Work
Data center network architectures rely on rich redundant paths for failure resilience [7, 8,65–68]. Among them, fat-tree is the most popular in practical use [7]. ShareBackup builds on top of fat-tree, so it is related to other proposals enhancing fault-tolerance of fat-tree networks. PortLand reroutes traffic to globally optimal paths based on a central view of the network at the fabric manager. F10 reduces delays from failure propagation and path re-computation by local rerouting at switches [39], at the cost of longer paths. It also adjusts wiring of fat-tree to form AB fat-tree, which provides 33
diverse paths for local rerouting. Aspen Tree adds different degrees of redundancy to fat-tree to tune the local rerouting path length [69]. It either partitions the network or adds extra switches to have more paths. If keeping the host count, it requires at least one more layer, or 40% more switches. ShareBackup takes a completely different approach. Instead of rerouting, it deploys backup switches in the physical layer. We compare with PortLand, F10, and Aspen Tree in the evaluations to explore interesting properties of ShareBackup. Besides architectural solutions, many works have been tackling failures in data centers from different angles. NetPilot and CorrOpt give operational guidance to manually mitigating the effect of failures [77,78]. ShareBackup instead automatically replaces failed switches to restore full capacity of the network. Its recovery speed is also significantly faster, e.g. sub-ms vs. tens of minutes. Subways and Hot Standby Router Protocol suggest multi-homing hosts to several switches to avoid single point of failure [79,80], which consumes more ports on the hosts and switches. ShareBackup provides more efficient redundancy at the network edge without multi-homing, and we invent a more light-weight VLAN-based solution to make backup switches hot standbys with no additional latency (Section 4.4.4). In the context of rerouting, there is a large body of work on local fast failover [46,81–87], some of which cause explosion of backup routes and Plinko introduces a forwarding table compression algorithm accordingly [88]. ShareBackup does not depend on rerouting for failure recovery, so it avoids these complications and the forwarding tables are intrinsically small. On the application level, Bodik et al. propose intelligent service placement for both fault tolerance and traffic locality [89], and computation frameworks such as Spark [5] and Tez [6] restart tasks elsewhere when workers are lost. ShareBackup provides a more reliable network, so service placement has less constraints. Our experiment in Section 4.6.8 shows application-level resilience is insufficient: the performance is degraded by multi-folds if hosts are disconnected. Thus, in-network failure recovery is extremely important. 34
Table 4.1 : List of notations
Notation Meaning k Fat-tree parameter: switch port count and # Pods [7] k n # backup switches shared by 2 switches per failure group Hj The jth host Ei,j The jth Edge switch in the ith Pod Ai,j The jth Aggregation switch in the ith Pod Cj The jth Core switch CSl,i,j The jth Circuit Switch in the ith Pod on the lth layer FGl,u The uth Failure Group on the lth layer BSl,u,v The vth Backup Switch in FGl,u UPp The pth UPward facing port of a circuit switch DOWNp The pth DOWNward facing port of a circuit switch 35
4.3 Network Architecture
Algorithm 1 ShareBackup wiring algorithm
// Edge layer k 1: for each CS1,i,j where 0 i