Artificial Intelligence for Understanding Large and Complex
Total Page:16
File Type:pdf, Size:1020Kb
Artificial Intelligence for Understanding Large and Complex Datacenters by Pengfei Zheng Department of Computer Science Duke University Date: Approved: Benjamin C. Lee, Advisor Bruce M. Maggs Jeffrey S. Chase Jun Yang Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science in the Graduate School of Duke University 2020 Abstract Artificial Intelligence for Understanding Large and Complex Datacenters by Pengfei Zheng Department of Computer Science Duke University Date: Approved: Benjamin C. Lee, Advisor Bruce M. Maggs Jeffrey S. Chase Jun Yang An abstract of a dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science in the Graduate School of Duke University 2020 Copyright © 2020 by Pengfei Zheng All rights reserved except the rights granted by the Creative Commons Attribution-Noncommercial Licence Abstract As the democratization of global-scale web applications and cloud computing, under- standing the performance of a live production datacenter becomes a prerequisite for making strategic decisions related to datacenter design and optimization. Advances in monitoring, tracing, and profiling large, complex systems provide rich datasets and establish a rigorous foundation for performance understanding and reasoning. But the sheer volume and complexity of collected data challenges existing techniques, which rely heavily on human intervention, expert knowledge, and simple statistics. In this dissertation, we address this challenge using artificial intelligence and make the case for two important problems, datacenter performance diagnosis and datacenter workload characterization. The first thrust of this dissertation is the use of statistical causal inference and Bayesian probabilistic model for datacenter straggler diagnosis. Stragglers are excep- tionally slow tasks in parallel execution that delay overall job completion. Stragglers, which are uncommon within a single job, are pervasive in datacenters with many jobs. A large body of research has focused on mitigating stragglers, but relatively little research has focused on systematically identifying their causes. We present Hound, a statistical machine learning framework that infers the causes of stragglers from traces of datacenter-scale jobs. The second thrust of this dissertation is the use of graph theory and statistical se- mantic learning for datacenter workload understanding, which has significant impact iv on datacenter hardware architecture, capacity planning, software re-optimization, etc. Datacenter engineers understand datacenter workloads with continuous, dis- tributed profiling that produces snapshots of call stacks across datacenter machines. Unlike stack traces profiled for isolated micro-benchmarks or small applications, those for hyperscale datcenters are enormous and complex and reflect the scale and diversity of their production codes, and expose great challenges for efficient and ef- fective interpretation. We present Limelight+, an algorithmic framework based on graph theory and statistical semantic learning, to extract workload insights from datacenter-scale stack traces, and to gain design insights for datacenter architecture. v For my teachers, family and friends | people who helped me to come this far vi Contents Abstract iv List of Tablesx List of Figures xii Acknowledgements xiv 1 Introduction1 1.1 Datacenter-scale Performance Diagnosis.................2 1.2 Understanding Datacenter Workload Structure.............3 1.3 Key Contributions............................4 2 Causal Inference and Bayesian Probabilistic Model for Straggler Diagnosis at Datacenter Scale6 2.1 System Objectives............................7 2.1.1 Datacenter-scale Diagnosis....................7 2.1.2 Interpretable Models.......................8 2.1.3 Unbiased Inference........................9 2.1.4 Computational Efficiency.....................9 2.2 The Hound Framework.......................... 10 2.2.1 Base Learning........................... 12 2.2.2 Meta Learning.......................... 19 2.2.3 Ensemble Learning........................ 22 2.3 Experimental Methods.......................... 26 vii 2.3.1 Google Trace........................... 26 2.3.2 Spark Traces........................... 27 2.4 Evaluation with Google Trace...................... 29 2.4.1 Mixtures of Causes........................ 33 2.4.2 Validation with Case Studies................... 37 2.4.3 Validation with Mutual Information............... 39 2.4.4 Comparisons with Expert Diagnosis............... 42 2.4.5 Comparison with Simpler Base Learners............ 44 2.5 Evaluation with Spark Traces...................... 49 2.6 Complexity and Overheads........................ 52 2.7 Related Work............................... 53 2.7.1 Straggler Mitigation....................... 53 2.7.2 Performance Analysis....................... 54 2.8 Conclusions................................ 55 3 Graph Theory and Semantic Learning for Understanding Datacen- ter Workload 57 3.1 Challenges of Existing Stack Trace Analysis Methods......... 58 3.1.1 Counting Exclusive Cycles.................... 63 3.1.2 Counting Inclusive Cycles.................... 64 3.1.3 Call Path Analysis........................ 66 3.1.4 Call Graph Analysis....................... 66 3.2 Limelight+ Overview........................... 67 3.3 Limelight+ Layerization.......................... 70 3.3.1 Foundational Degree....................... 70 3.3.2 Regularized Foundational Degree................ 77 3.3.3 Maximizing Foundational Degree................ 82 viii 3.4 Limelight+ Function Clustering..................... 87 3.5 Limelight+ Cycle Attribution....................... 97 3.6 Experimental Methods.......................... 100 3.6.1 SERVICES............................ 100 3.6.2 FLEET.............................. 102 3.7 Evaluation of Limelight+ ......................... 103 3.7.1 Discovering Layers........................ 104 3.7.2 Discovering Accelerators..................... 108 3.7.3 Evaluating Layer Quality..................... 117 3.7.4 Evaluating Semantic Embeddings................ 122 3.7.5 Analyzing Production Datacenter................ 124 3.8 Complexity and Overhead........................ 133 3.9 Related Work............................... 135 3.10 Conclusions................................ 138 4 Conclusions 140 4.1 Conclusions................................ 140 4.2 Future Work................................ 141 4.2.1 Performance Diagnosis and Optimization in the Era of Server- less, Microservices and Privacy................. 141 4.2.2 Machine Learning - Efficiency and Economics......... 142 Bibliography 144 Biography 171 ix List of Tables 2.1 Comparison of statistical dependence measures............. 17 2.2 Task metrics in the Google datacenter trace.............. 27 2.3 Task metrics in the Spark traces for BDBench and TPC-DS...... 28 2.4 Hound's causal topics for the Google dataset, derived from an ensem- ble of predictive, dependent, and causal models............. 30 2.5 Hound's causal topics on the Google trace, with predictive model... 32 2.6 Hound's causal topics on the Google trace, with dependence model.. 33 2.7 Hound's causal topics on the Google trace, with causal model..... 34 2.8 Example - Mixtures of causes...................... 35 2.9 Comparison of inferred causes from varied modeling strategies for job 6308689702................................. 35 2.10 Comparison of inferred causes from varied modeling strategies for job 6343946350................................. 36 2.11 Number of causes per job......................... 36 2.12 Coverage statistics for causes....................... 36 2.13 Examples of stragglers' causes from related studies that produce ex- pert diagnoses............................... 44 2.14 Comparison of causes diagnosed by Hound for the Google system against causes diagnosed by experts for related systems......... 45 2.15 Causal topics inferred with linear regression as base learner...... 46 2.16 Causal topics inferred with Pearson's correlation as base learner.... 48 x 2.17 Causal topics inferred with logistic regression based Rubin Causal Model as base learner........................... 48 2.18 Hound's causal topics for the Spark BDBench dataset......... 50 2.19 Hound's estimate of stragglers (percentage) explained by each cause for the Spark BDBench dataset..................... 50 2.20 Hound's causal topics for the Spark TPC-DS dataset......... 51 2.21 Hound's estimate of stragglers (percentage) explained by each cause for the Spark TPC-DS dataset...................... 51 2.22 Computational Complexity of Hound .................. 52 3.1 Stack Samples............................... 59 3.2 Notations. Let S denote a stack trace S, s a stack sample, and f or g a function................................. 62 3.3 Examples of conjugate functions..................... 81 3.4 A example stack trace. Suppose we target at a layer Lu and only show in the trace the five functions....................... 88 3.5 Experimental trace SERVICES and FLEET............... 100 3.6 Comparison between hot function clusters revealed by Limelight+ and expert-designed ASICs/accelerators.................... 112 3.7 Comparison between hot function clusters revealed by Limelight+ and expert-designed software re-optimizations................ 115 3.8 Comparison of Directed Acyclic Graph (DAG) layerization algorithms. 119 3.9 Evaluation of Limelight+'s EE (Equilibrium Embedding) and Word2Vec