Designing Hybrid Data Center Networks for High Performance and Fault Tolerance
Total Page:16
File Type:pdf, Size:1020Kb
ABSTRACT Designing Hybrid Data Center Networks for High Performance and Fault Tolerance by Dingming Wu Data center networks are critical infrastructure behind today’s cloud services. High-performance data center network architectures have received extensive attention in the past decade, with multiple interesting DCN architectures proposed over the years. In the early days, data center networks were built mostly on electrical packet switches, with Clos, or multi-rooted tree, structures. These networks fall short in 1) providing high bandwidth for emerging non-uniform traffic communications, e.g., multicast transmissions and skewed communicating server pairs, and 2) recovering from various network component failures, e.g., link failures and switch failures. More recently, optical technologies have gained high popularity due to its ultra-simple and flexible connectivity provisions, low power and low cost. However, the advantages of optical networking come at a price—orders of magnitude of lower switching speed than its electrical counterpart. This thesis explores the design space of hybrid electrical and optical network architectures for modern data centers. It tries to reach a delicate balance between performance, fault-tolerance, scalability and cost through coordinated use of both electrical and optical components in the network. We have developed several ap- proaches to achieving these goals from different angles. First, we used optical splitters as key building blocks to improve multicast transmission performance. We built an unconventional optical multicast architecture, called HyperOptics, that provides orders of magnitude of throughput improvement for multicast transmissions. Sec- ond, we developed a failure tolerant network, called ShareBackup, by embedding optical switches into the Clos networks. Sharebackup, for the first time, achieves network-wide full-capacity failure recovery in milliseconds. Third, we proposed to enable programmable network topology at runtime by inserting optical switches at the network edge. Our system, called RDC, breaks the bandwidth boundaries between servers and dynamically optimizes its topology according to traffic patterns. Through these three works, we demonstrate the high potential of hybrid datacenter network architectures for high performance and fault-tolerance. Acknowledgments I would first like to thank my advisor Prof. T. S. Eugene Ng, for his support, encouragement, and precious advice at both the academic and the personal levels. I am thankful that he pushed me and gave me directions when I did not know what to do. I am thankful that he supported and helped me to explore various external opportunities when I could not make further progress. I am thankful that he believed in me and encouraged me when I had doubts in myself. Without his mentorship, I cannot be here today. I also thank my thesis committee members: Ang Chen and Lin Zhong. They have given me so many valuable feedbacks and suggestions on my research, job search, as well as this thesis. A special thank is owed to Ang, collaborations with whom produced one of the major results in this thesis. I want to thank my internship mentors: Yingnong Dang and Guohui Wang. They taught me by example how to be productive and professional in the industry. They also gave me a lot of suggestions that helped me grow and be technically mature. More importantly, I thank them for providing me with memorable experiences in Seattle and Silicon Valley. I am grateful to have a lot of fantastic fellow students in the CS department. Especially, I thank my groupmates in the BOLD lab: Xiaoye Steven Sun, Yiting Xia, Simbarashe Dzinamarira, Xin Sunny Huang, Sushovan Das, Afsaneh Rahbar, and Weitao Wang. They helped me a lot in shaping the ideas behind this thesis as well as conducting the experimental studies. I also thank the fellow network-systems students for their productive discussions with me: Kuo-Feng Hsu, Jiarong Xing, Qiao Kang, and Xinyu Wu. v Graduate studies are stressful; I thank the company of many of my friends at Rice. I treasure the friendship with Xiaoye Sun, Lechen Yu, Beidi Chen, Chen Luo, Terry Tang, and Simbarashe Dzinamarira. I am grateful that they were always there when I wanted to share my ups and downs. They have made these years an enjoyable journey for me. Finally, I would like to give my special thanks to my parents Ping Xiao and Tongxin Wu, who have always trusted me and supported me along the way. Contents Abstract ii Acknowledgments iv List of Illustrations x List of Tables xiv 1 Introduction 1 1.1 Data Center Networks: State of the Art and Challenges . 3 1.2 ThesisContributions ........................... 5 1.3 ThesisOrganization. 7 2 Background and Related Work 8 2.1 OpticsinDataCenters.......................... 8 2.2 Hybrid Data Center Network Architectures . 10 2.3 Supporting Multicast Transmissions in Data Center Networks . 11 2.4 Failure Recovery in Data Center Networks . 13 3 HyperOptics: Integrating An Optical Multicast Architec- ture in Data Centers 16 3.1 Introduction . 16 3.2 HyperOptics Architecture . 19 3.2.1 ToRConnectivityDesign. 19 3.2.2 Routing and Relay Set Computation . 19 3.2.3 Analysis . 21 3.2.4 System Overview . 23 vii 3.2.5 MulticastScheduling . 24 3.3 Evaluation................................. 25 3.3.1 Comparednetworks. 25 3.3.2 Simulation setup . 26 3.3.3 EffectofSplitterFanoutUsedinOCS . 27 3.3.4 PerformanceComparison. 28 3.4 Summary ................................. 28 4 Sharebackup: Masking Data Center Network Failures from Application Performance 30 4.1 Introduction . 30 4.2 Related Work . 32 4.3 Network Architecture . 35 4.4 ControlPlane ............................... 39 4.4.1 Fast Failure Detection and Recovery . 39 4.4.2 Distributed Network Controllers . 40 4.4.3 Offline Auto-Diagnosis . 42 4.4.4 Live Impersonation of Failed Switch . 44 4.5 Discussion . 45 4.5.1 ControlSystemFailures . 45 4.5.2 CostAnalysis ........................... 46 4.5.3 BenefitstoNetworkManagement . 47 4.5.4 Alternatives in the Design Space . 48 4.6 ImplementationandEvaluation . 48 4.6.1 Testbed . 49 4.6.2 Simulation . 51 4.6.3 ExperimentalSetup. 52 4.6.4 Transient State Analysis . 56 viii 4.6.5 ResponsivenessofControlPlane. 57 4.6.6 BandwidthAdvantage . 59 4.6.7 Transmission Performance at Scale . 61 4.6.8 Benefits to Real Applications . 64 4.7 Summary ................................. 66 5 RDC: Relieving Data Center Network Congestion with Topological Reconfigurability at the Edge 67 5.1 Introduction . 68 5.2 TheCaseforRacklessDataCenters. 71 5.2.1 Observation #1: Pod locality . 72 5.2.2 Observation #2: Inter-pod imbalance . 72 5.2.3 Observation #3: Predictable patterns . 73 5.2.4 Understanding the power of racklessness . 74 5.3 TheRDCArchitecture .......................... 76 5.3.1 Building block: Circuit switching . 76 5.3.2 Connectivitystructure . 78 5.3.3 Thepodcontroller ........................ 79 5.3.4 Routing . 80 5.4 RDCControlAlgorithms. 82 5.4.1 Traffic localization . 83 5.4.2 Uplink load-balancing . 84 5.4.3 Application-driven optimizations . 86 5.4.4 Advanced control algorithms . 87 5.5 Discussions . 87 5.5.1 Recovering from ToR failures . 87 5.5.2 Wiring and incremental deployment . 88 5.5.3 Costanalysis ........................... 89 ix 5.5.4 Handling circuit switch failures . 91 5.5.5 Scaling . 91 5.5.6 Alternatives in the design space . 92 5.6 ImplementationandEvaluation . 93 5.6.1 Platforms ............................. 93 5.6.2 TCPtransientstate ....................... 94 5.6.3 Throughputbenchmark . 95 5.6.4 Controllooplatency . 96 5.6.5 Transmission performance at scale . 97 5.6.6 Real-world applications . 102 5.7 Related Work . 105 5.8 Summary ................................. 106 6 Conclusions and Future Work 107 Bibliography 110 Illustrations 1.1 An example of three-layer Clos network using homogeneous commodity switches................................... 2 2.1 An illustration of the optical splitter. (a) is a logical representation of an optical splitter with 1 input port and k output ports. (b) shows an real splitter device with k =8...................... 9 2.2 An illustration of the optical circuit switch (OCS). (a) is a logical representation of an OCS with 8 input ports and 8 output ports, or an 8 8 OCS. (b) shows a real Calient 312 312 OCS. 9 ⇥ ⇥ 3.1 An example of HyperOptics connectivity with splitter fanout k =3. The connectivity of t0, t3, t4 and t6 are shown in the figure. All other ToRs are interconnected to their neighbors in a similar way. The table on the bottom demonstrates the connectivity of all ToRs. 20 k 1 3.2 Two broadcast trees originating from ti and ti+2 − .Solidcirclesare relays. The union of the relays and the last neighbor of each relay, shown by squares, form a complete set of all ToRs. The two broadcasts have disjoint relay sets. 22 3.3 An overview of the HyperOptics system. 24 3.4 Comparison of the average total FCT of all three networks. The speedups of HyperOptics over the OCS network are labeled in the figure. 27 xi 3.5 Computation time of HyperOptics’ control plane under different numberofmulticastrequests. 29 4.1 A k =6andn =1ShareBackupnetwork.(a),(b),(c)correspondto shaded areas in (d). Devices are labeled according to the notations in Table 4.1. Edge and aggregation switches are marked by their in-Pod indices; core switches and hosts are marked by their global indices. Switches in the same failure group are packed together, which share a backup switch in stripe on the side. Circuit switches are inserted into adjacent layers of switches/hosts. The connectivity in shade is the basic building block for shareable backup. The crossed switch and connections represent example switch and link failures. Switches involved in failures are each replaced by a backup switch with the new circuit switch configurations shown at the bottom, where connections regarding the original red round ports reconnect to the new black square ports. 36 4.2 Communication protocol in the control system.