UNIVERSITY of CALIFORNIA SAN DIEGO Simplifying Datacenter Fault
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITY OF CALIFORNIA SAN DIEGO Simplifying datacenter fault detection and localization A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science by Arjun Roy Committee in charge: Alex C. Snoeren, Co-Chair Ken Yocum, Co-Chair George Papen George Porter Stefan Savage Geoff Voelker 2018 Copyright Arjun Roy, 2018 All rights reserved. The Dissertation of Arjun Roy is approved and is acceptable in quality and form for publication on microfilm and electronically: Co-Chair Co-Chair University of California San Diego 2018 iii DEDICATION Dedicated to my grandmother, Bela Sarkar. iv TABLE OF CONTENTS Signature Page . iii Dedication . iv Table of Contents . v List of Figures . viii List of Tables . x Acknowledgements . xi Vita........................................................................ xiii Abstract of the Dissertation . xiv Chapter 1 Introduction . 1 Chapter 2 Datacenters, applications, and failures . 6 2.1 Datacenter applications, networks and faults . 7 2.1.1 Datacenter application patterns . 7 2.1.2 Datacenter networks . 10 2.1.3 Datacenter partial faults . 14 2.2 Partial faults require passive impact monitoring . 17 2.2.1 Multipath hampers server-centric monitoring . 18 2.2.2 Partial faults confuse network-centric monitoring . 19 2.3 Unifying network and server centric monitoring . 21 2.3.1 Load-balanced links mean outliers correspond with partial faults . 22 2.3.2 Centralized network control enables collating viewpoints . 23 Chapter 3 Related work, challenges and a solution . 24 3.1 Fault localization effectiveness criteria . 24 3.2 Existing fault management techniques . 28 3.2.1 Server-centric fault detection . 28 3.2.2 Network-centric fault detection . 31 3.2.3 Hybridized approaches . 33 3.3 Accounting for traffic and faults with outlier analysis . 35 3.3.1 Localizing faults via passive monitoring and outlier analysis . 35 3.3.2 Outlier analysis demonstration . 38 3.3.3 Outlier analysis in real world datacenters . 42 Chapter 4 Facebook datacenters and outlier analysis . 44 4.1 Inside the social network’s (datacenter) network . 44 4.2 Related datacenter traffic characterization work . 49 v 4.3 A Facebook datacenter . 50 4.3.1 Datacenter topology . 50 4.3.2 Constituent services . 52 4.3.3 Data collection . 53 4.4 Provisioning . 55 4.4.1 Utilization . 56 4.4.2 Locality and stability . 58 4.4.3 Traffic matrix . 59 4.4.4 Implications for connection fabrics . 63 4.5 Traffic engineering . 64 4.5.1 Flow characteristics . 64 4.5.2 Load balancing . 66 4.5.3 Heavy hitters . 68 4.5.4 Implications for traffic engineering . 72 4.6 Switching . 72 4.6.1 Per-packet features . 73 4.6.2 Arrival patterns . 74 4.6.3 Buffer utilization . 75 4.6.4 Concurrent flows . 78 4.7 Implications for partial fault localization . 79 Chapter 5 Simplified fault detection and localization . 82 5.1 Formalizing outlier-analysis via equivalence sets . 84 5.1.1 Defining equivalence sets . 84 5.1.2 Equivalence sets underpin outlier based partial-fault localization. 85 5.1.3 Equivalence-sets compared to prior approaches . 86 5.2 System overview . 87 5.2.1 Production datacenter . 87 5.2.2 System architecture . 89 5.3 Implementation . 92 5.3.1 Datacenter flow path discovery . 92 5.3.2 Aggregating server metrics & path data . 94 5.3.3 Verdict generation . 99 5.3.4 Centralized fault localization . 104 5.4 Evaluation . 105 5.4.1 Test environment . 106 5.4.2 Motivating example . 106 5.4.3 Speed and sensitivity . 108 5.4.4 Precision and accuracy . 110 5.4.5 Small-scale deployment experience . 112 5.5 Applicability . 115 5.5.1 Surmountable issues. 115 5.5.2 Traffic homogeneity . 115 5.5.3 Link failures . 116 vi 5.5.4 Topologies . 117 5.5.5 Limitations . ..