<<

Artificial Intelligence for Understanding Large and Complex Datacenters

by

Pengfei Zheng

Department of Computer Science Duke University

Date: Approved:

Benjamin . Lee, Advisor

Bruce . Maggs

Jeffrey S. Chase

Jun Yang

Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science in the Graduate School of Duke University 2020 Abstract Artificial Intelligence for Understanding Large and Complex Datacenters

by

Pengfei Zheng

Department of Computer Science Duke University

Date: Approved:

Benjamin C. Lee, Advisor

Bruce M. Maggs

Jeffrey S. Chase

Jun Yang

An abstract of a dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science in the Graduate School of Duke University 2020 Copyright © 2020 by Pengfei Zheng All rights reserved except the rights granted by the Creative Commons Attribution-Noncommercial Licence Abstract

As the democratization of global-scale web applications and cloud computing, under- standing the performance of a live production datacenter becomes a prerequisite for making strategic decisions related to datacenter design and optimization. Advances in monitoring, tracing, and profiling large, complex systems provide rich datasets and establish a rigorous foundation for performance understanding and reasoning. But the sheer volume and complexity of collected data challenges existing techniques, which rely heavily on human intervention, expert knowledge, and simple statistics. In this dissertation, we address this challenge using artificial intelligence and make the case for two important problems, datacenter performance diagnosis and datacenter workload characterization. The first thrust of this dissertation is the use of statistical causal inference and Bayesian probabilistic model for datacenter straggler diagnosis. Stragglers are excep- tionally slow tasks in parallel that delay overall job completion. Stragglers, which are uncommon within a single job, are pervasive in datacenters with many jobs. A large body of research has focused on mitigating stragglers, but relatively little research has focused on systematically identifying their causes. We present Hound, a statistical machine learning framework that infers the causes of stragglers from traces of datacenter-scale jobs. The second thrust of this dissertation is the use of graph theory and statistical se- mantic learning for datacenter workload understanding, which has significant impact

iv on datacenter hardware architecture, capacity planning, software re-optimization, etc. Datacenter engineers understand datacenter workloads with continuous, dis- tributed profiling that produces snapshots of call stacks across datacenter machines. Unlike stack traces profiled for isolated micro-benchmarks or small applications, those for hyperscale datcenters are enormous and complex and reflect the scale and diversity of their production codes, and expose great challenges for efficient and ef- fective interpretation. We present Limelight+, an algorithmic framework based on graph theory and statistical semantic learning, to extract workload insights from datacenter-scale stack traces, and to gain design insights for datacenter architecture.

v For my teachers, family and friends — people who helped me to come this far

vi Contents

Abstract iv

List of Tablesx

List of Figures xii

Acknowledgements xiv

1 Introduction1

1.1 Datacenter-scale Performance Diagnosis...... 2

1.2 Understanding Datacenter Workload Structure...... 3

1.3 Key Contributions...... 4

2 Causal Inference and Bayesian Probabilistic Model for Straggler Diagnosis at Datacenter Scale6

2.1 System Objectives...... 7

2.1.1 Datacenter-scale Diagnosis...... 7

2.1.2 Interpretable Models...... 8

2.1.3 Unbiased Inference...... 9

2.1.4 Computational Efficiency...... 9

2.2 The Hound Framework...... 10

2.2.1 Base Learning...... 12

2.2.2 Meta Learning...... 19

2.2.3 Ensemble Learning...... 22

2.3 Experimental Methods...... 26

vii 2.3.1 Google Trace...... 26

2.3.2 Spark Traces...... 27

2.4 Evaluation with Google Trace...... 29

2.4.1 Mixtures of Causes...... 33

2.4.2 Validation with Case Studies...... 37

2.4.3 Validation with Mutual Information...... 39

2.4.4 Comparisons with Expert Diagnosis...... 42

2.4.5 Comparison with Simpler Base Learners...... 44

2.5 Evaluation with Spark Traces...... 49

2.6 Complexity and Overheads...... 52

2.7 Related Work...... 53

2.7.1 Straggler Mitigation...... 53

2.7.2 Performance Analysis...... 54

2.8 Conclusions...... 55

3 Graph Theory and Semantic Learning for Understanding Datacen- ter Workload 57

3.1 Challenges of Existing Stack Trace Analysis Methods...... 58

3.1.1 Counting Exclusive Cycles...... 63

3.1.2 Counting Inclusive Cycles...... 64

3.1.3 Call Path Analysis...... 66

3.1.4 Call Graph Analysis...... 66

3.2 Limelight+ Overview...... 67

3.3 Limelight+ Layerization...... 70

3.3.1 Foundational Degree...... 70

3.3.2 Regularized Foundational Degree...... 77

3.3.3 Maximizing Foundational Degree...... 82

viii 3.4 Limelight+ Function Clustering...... 87

3.5 Limelight+ Cycle Attribution...... 97

3.6 Experimental Methods...... 100

3.6.1 SERVICES...... 100

3.6.2 FLEET...... 102

3.7 Evaluation of Limelight+ ...... 103

3.7.1 Discovering Layers...... 104

3.7.2 Discovering Accelerators...... 108

3.7.3 Evaluating Layer Quality...... 117

3.7.4 Evaluating Semantic Embeddings...... 122

3.7.5 Analyzing Production Datacenter...... 124

3.8 Complexity and Overhead...... 133

3.9 Related Work...... 135

3.10 Conclusions...... 138

4 Conclusions 140

4.1 Conclusions...... 140

4.2 Future Work...... 141

4.2.1 Performance Diagnosis and Optimization in the Era of Server- less, Microservices and Privacy...... 141

4.2.2 Machine Learning - Efficiency and Economics...... 142

Bibliography 144

Biography 171

ix List of Tables

2.1 Comparison of statistical dependence measures...... 17

2.2 Task metrics in the Google datacenter trace...... 27

2.3 Task metrics in the Spark traces for BDBench and TPC-DS...... 28

2.4 Hound’s causal topics for the Google dataset, derived from an ensem- ble of predictive, dependent, and causal models...... 30

2.5 Hound’s causal topics on the Google trace, with predictive model... 32

2.6 Hound’s causal topics on the Google trace, with dependence model.. 33

2.7 Hound’s causal topics on the Google trace, with causal model..... 34

2.8 Example - Mixtures of causes...... 35

2.9 Comparison of inferred causes from varied modeling strategies for job 6308689702...... 35

2.10 Comparison of inferred causes from varied modeling strategies for job 6343946350...... 36

2.11 Number of causes per job...... 36

2.12 Coverage statistics for causes...... 36

2.13 Examples of stragglers’ causes from related studies that produce ex- pert diagnoses...... 44

2.14 Comparison of causes diagnosed by Hound for the Google system against causes diagnosed by experts for related systems...... 45

2.15 Causal topics inferred with linear regression as base learner...... 46

2.16 Causal topics inferred with Pearson’s correlation as base learner.... 48

x 2.17 Causal topics inferred with logistic regression based Rubin Causal Model as base learner...... 48

2.18 Hound’s causal topics for the Spark BDBench dataset...... 50

2.19 Hound’s estimate of stragglers (percentage) explained by each cause for the Spark BDBench dataset...... 50

2.20 Hound’s causal topics for the Spark TPC-DS dataset...... 51

2.21 Hound’s estimate of stragglers (percentage) explained by each cause for the Spark TPC-DS dataset...... 51

2.22 Computational Complexity of Hound ...... 52

3.1 Stack Samples...... 59

3.2 Notations. Let S denote a stack trace S, s a stack sample, and f or g a function...... 62

3.3 Examples of conjugate functions...... 81

3.4 A example stack trace. Suppose we target at a layer Lu and only show in the trace the five functions...... 88

3.5 Experimental trace SERVICES and FLEET...... 100

3.6 Comparison between hot function clusters revealed by Limelight+ and expert-designed ASICs/accelerators...... 112

3.7 Comparison between hot function clusters revealed by Limelight+ and expert-designed software re-optimizations...... 115

3.8 Comparison of Directed Acyclic Graph (DAG) layerization algorithms. 119

3.9 Evaluation of Limelight+’s EE (Equilibrium Embedding) and Word2Vec on token-level and function-level Semantic Relatedness Test...... 123

3.10 Evaluation of function semantic embedding with Probabilistic Stack Modeling...... 124

3.11 Service-level hot function clusters revealed by Limelight+ on a month- long trace of a live production datacenter...... 128

3.12 Shared-library-level, hot function clusters revealed by Limelight+ on a month-long trace of a live production datacenter...... 131

xi List of Figures

2.1 The Hound Framework...... 11

2.2 Example - Hound Inputs...... 12

2.3 Example - Hound Outputs...... 13

2.4 Example - Hound Base Learning...... 14

2.5 Example - Hound Meta Learning...... 21

2.6 Example - Hound Ensemble Learning...... 22

2.7 Profiles suggest stragglers in job 6308689702 arise from a mix of data, computation, and I/O skew...... 39

2.8 Profiles suggest stragglers in job 6283499093 arise from data skew... 40

2.9 Profiles suggest stragglers in job 6266469130 arise from queueing delay. 40

2.10 Profiles suggest stragglers in job 6274140245 arise from limited pro- cessor usage...... 41

2.11 Mutual information to assess accuracy and coherence...... 43

2.12 Comparison of modeling methods for latency prediction on the Google dataset...... 46

2.13 Comparison of dependence measures for safeguards against false cor- relation on the Google dataset...... 47

2.14 Comparison of propensity score estimates for causal inference on the Google dataset...... 47

2.15 Hound’s scalability as dataset size increases...... 53

3.1 Example of Limelight+’s produced graph...... 68

3.2 Illustration of FDMAX...... 71

xii 3.3 Illustration of Conjugate Functionality Regularization (CFR)..... 78

3.4 Illustration of modules and modules headers...... 89

3.5 Cluster semantically similar functions within each layer with STEAM. 97

3.6 Illustration of HELP...... 99

3.7 Limelight+’s layers and groups for profiled services...... 106

3.8 Categorization of service-level hotspots revealed by Limelight+ for a macro-scale view of datacenter workload composition...... 125

3.9 Categorization and of library-level hotspots revealed by Limelight+ for a micro-scale view of datacenter software infrastructure...... 126

3.10 Runtime analysis of Limelight+. Measurements are averaged over mul- tiple runs...... 134

xiii Acknowledgements

It is a long and challenging journey to materialize this doctoral dissertation. With- out the help from my teachers, friends and family, this dissertation would never have been accomplished. First of all, I want to thank my academic advisor, Dr. Benjamin Lee, for giving me the opportunity to work as a doctoral student and research assistant in his lab. We have worked closely during my time at Duke; he has devoted numerous weekdays and even weekends to help me improve technical ability, writing skills and presentation expertise. Dr. Lee is a unique researcher that has strong backgrounds in multiple research areas, including computer architecture, computer system, algorithmic decision making, statistics and machine learning. I am inspired and greatly benefit from his inter-disciplinary insights in execution of research projects. Dr. Lee was always passionate of our work and sought out the best venues for publication; this dissertation consists of our research that was accepted and submitted to competitive conferences. The members of my exam committee have all been enormously helpful for each of my milestone events: Dr. Bruce Maggs, Dr. Jeff Chase and Dr. Jun Yang, for their expertise in computer systems, especially distributed systems to help define practical research problems, provide constructive feedback, and advise technical roadmap. It is my supreme honor to have these top-notch researchers in the field on my committee and to be able to discuss my work with them. Moreover, I would like to extend my thanks to Dr. Xiaobai Sun and his student Tiancheng Liu, for their valuable input

xiv on the graph theory part of this dissertation. In particular, I would express my thanks to Dr. Benjamin C. Lee, Dr. Bruce Maggs and Dr. Jeff Chase, for their invaluable referral during my postdoctoral job search. Getting a PhD is the adjunction of academic training and mental development. I would like to thank many important people that accompany me in this long journey and helped me reach the finish line: Dr. Qiuyun Llull, Dr. Sonchun Fan, Dr. Seyed Zahedi, Dr. Ziqiang Huang, Dr. Luwa Matthews, Dr. Tamara Lehman, Yuhao Li and Atefeh Mehrabi. Last but by no means least, I would like to thank my wife, my parents, and my friends for their support, encouragement and love throughout my doctoral study.

xv 1

Introduction

As datacenter computing scales to support modern Internet services and products, their efficient operation and strategic design becomes a significant challenge. At such scale, small improvements in software performance usually translate into stronger as- surances in user experience and immense savings in ownership cost. In particular, troubleshooting performance deficiencies, such as poor load balancing, can save bil- lions of user requests from delayed completion and avoid inconsistent service quality. Further, offloading performance hotspots that are prominent in datacenter comput- ing for domain-specific acceleration can boost energy efficiency by orders of magni- tude and significantly cut energy bills. But first of all, identifying opportunities for such improvements requires an in-depth understanding of datacenter performance characteristics. System engineers have built distributed monitoring, tracing, and profiling in- frastructure to track datacenter activities in different domains (e.g., server, network, FPGA mesh) and at fine granularities (e.g., request or task level). Such “big measure- ment” establishes a rigorous foundation for systematic understanding of datacenter performance. But its sheer volume and complexity significantly challenge existing

1 performance analysis techniques, which rely heavily on domain expertise, static rules, simple statistics, etc. In this dissertation, we address this challenge with artificial in- telligence techniques, and make the case for two important problems in performance analysis: performance diagnosis and workload characterization.

1.1 Datacenter-scale Performance Diagnosis

In the first work, we focus on diagnosing datacenter stragglers. Datacenter sched- uler splits a data-intensive job into many tasks, execute them in parallel on many machines, and aggregate results when the last task completes.1 Stragglers are excep- tionally slow tasks within a job that significantly delay its completion. Unfortunately, stragglers’ effects increase with the number of tasks and scale of the system. In a Google datacenter [213], we find that stragglers extend completion time in 20% of jobs by more than 1.5×. Existing mitigation strategies are incomplete solutions because they address symp- toms rather than diagnose causes. For example, stragglers often arise from the skewed distribution of data across tasks. When some tasks are overloaded with work [160], speculative re-execution or rescheduling consumes system resources without improving performance. Although profilers support straggler diagnosis by produc- ing massive datacenter traces, deriving causal explanations from complex datasets is difficult. Existing diagnosis procedures rely heavily on human expertise in systems and application of best practice, which are laborious and fail to scale [92, 43, 44]. Thus, understanding stragglers’ causes is a prerequisite for efficient countermeasures and large datasets motivate methods for rigorous causal analysis. We present a statistical machine learning framework – Hound – that infers strag- glers’ causes with causal inference and Bayesian probabilistic models. We demon-

1 A job contains one or more tasks that execute a single program on multiple data in parallel. When frameworks, such as Apache Spark, organize tasks into stages, each stage corresponds to a job.

2 strate Hound’s capabilities for a production trace from Google’s warehouse-scale datacenters and two Spark traces from Amazon EC2 clusters. Hound’s diagnostic inference codifies domain knowledge to reveal straggler causes and are consistent with those from expert analysis.

1.2 Understanding Datacenter Workload Structure

In the second work, we aim to understand the workload structure of a production datacenter and extract design insights for datacenter architecture. We reveal the structure of datacenter workloads by analyzing stack traces, which assess how pro- cessor cycles are consumed by software functions. Rigorous stack analysis informs datacenter design and management. First, macro scale analysis of applications guides hardware provision and roadmaps. Meso and micro scale analyses reveals software hotspots that can be targeted for algorithm, code, or resource optimizations. These hotspots are also targets for hardware/software co-design or domain-specific acceler- ators that improve performance and energy efficiency by orders of magnitude [130]. For example, machine learning in datacenter workloads has, in part, driven a golden age of ASIC design for neural networks [142]. Performance toolkits aggregate stack profiles to produce a call graph (e.g., perf, Intel VTune, GraphViz). Each node is a software function. Each directed edge is a caller-callee relationship. Edge weights quantify how frequently one function calls the other. 2 Call graphs for datacenter services, unlike those for small mi- crobenchmarks or applications, include tens of thousands of nodes and hundreds of thousands of edges. The graph’s scale and complexity makes human analysis infea- sible. Although pruning the graph to include only the hottest routines may seem an attractive approximation, the distribution of computational cycles over software

2 We study directed acyclic graphs (DAGs) obtained by pruning call stacks with recursive proce- dures and thereby eliminating cycles within the call graph.

3 routines is heavy-tailed and the resulting analysis would neglect a large fraction of datacenter activity. Scale and complexity motivate Limelight+, a new algorithm framework that ex- tracts workload insights from stack traces. Limelight+ uses graph theory to partition a enormous call graph into succinct layers. Moreover, it uses semantic learning to annotate layers for efficient, human interpretation. We demonstrate Limelight+’s ca- pabilities by analyzing call graphs for widely used software frameworks—deep neural network training and inference, persistent key-value store, server-side PHP process- ing and HTTP web server—that run as core services within industrial datacenter servers. Moreover, we analyze a broader ecosystem of workloads within a live, pro- duction datacenter. Results show that the call graph layers and function clusters inferred by Limelight+ can be effectively translated into workload insights at multi- ple scales and reveal computational hotshots for hardware acceleration or software re-optimization.

1.3 Key Contributions

This dissertation is an interdisciplinary study of computer systems and artificial intelligence. We summarize our contributions as below.

• Establish the connection between artificial intelligence and system per- formance analysis. In this dissertation, we establish the link between artificial intelligence (including machine learning, statistical modeling, non-convex opti- mization and graph theory) and system performance analysis. We use causal inference and Bayesian probabilistic models for the diagnosis large, complex dis- tributed systems and datacenters. Furthermore, we use graph theory and statisti- cal semantic learning for understanding workload structure at scale.

• Extend theories to meet challenges in real system settings. Real system 4 settings expose challenges to artificial intelligence theories. For example, it is challenging to integrate numerous diagnostic models that are built for varied types of computation to reveal recurring issues at datacenter scale. Furthermore, it is challenging to build succinct models that codify domain knowledge for efficient human interpretation. In this dissertation, we choose and extend standard artificial intelligence methods to overcome many domain-specific challenges presented by computer systems.

• Build distributed engines for scalable performance inference and rea- soning. In this dissertation, we build distributed engines for our performance models and analytics, with parallel execution frameworks such as Apache Spark [46], Berkeley Ray [186] and Apache Arrow [10]. Inference and analysis benefit from massive datacenter scale parallelism.

5 2

Causal Inference and Bayesian Probabilistic Model for Straggler Diagnosis at Datacenter Scale

We present a statistical machine learning framework – Hound – that infers strag- glers’ causes with three techniques. First, Hound models task latency using system conditions, such as hardware activity, resource utilization, and scheduling events, in order to reveal causes of stragglers for each job. Second, Hound discovers recurring, interpretable causes across jobs by constructing topic models that treat jobs’ latency models as documents and their fitted parameters as words. Topics reveal sophis- ticated causes that explain stragglers at datacenter scale. Finally, Hound guards against false conclusions by constructing an ensemble of predictive, dependence, and causal models and reporting only causes found by multiple models. We demonstrate Hound on two representative systems. The first is a month-long trace from a production Google datacenter with 12K machines running 650K jobs and 25M tasks [213]. The second is a pair of traces from Amazon EC2 clusters run- ning Spark data analytics [202]. Hound produces interpretable topics that describe stragglers’ causes. Each topic is a set of system conditions that combine to explain a

6 major source of stragglers that affects many jobs. Results show that Hound’s inferred causes consistent with those from expert analysis.

2.1 System Objectives

We architect Hound for four desiderata: datacenter-scale diagnosis to automati- cally identify recurring causes of stragglers across many jobs; interpretable models to concisely reveal sophisticated causes and domain insight; unbiased inference to reduce the risk of false explanations when model assumptions are invalid; com- putational efficiency with methods that are parallelizable and have polynomial complexity.

2.1.1 Datacenter-scale Diagnosis

Datacenter operators benefit from a broad view of stragglers. A datacenter-scale per- spective reveals recurring problems that cause stragglers in many jobs instead of mi- nor problems that impact just a few. Hound diagnoses stragglers at datacenter-scale with a two-stage approach. For each job, it infers models of latency and characterizes stragglers’ causes. Then it synthesizes patterns across jobs and their models. Straggler diagnosis requires a separate model for each job rather than a single model for all jobs. Jobs are heterogeneous and completion time varies significantly due to differences in input data, system parameters, and task parallelism. A single model across jobs cannot account for a straggler’s slowness relative to other tasks within its job. It cannot differentiate a slow task in an inherently fast job from a slow task in an inherently slow job—the former is a straggler while the latter is a nominal task. Moreover, a single model provides only one causal explanation for thousands of unique jobs when multiple, diverse causes exist and tailored explanations are required. Datacenter-scale diagnosis requires extracting domain insight from recurring pat-

7 terns across jobs and their models. Hound use meta learning to identify datacenter- scale patterns [72]. The base learner for each job constructs latency models and re- veals system conditions associated with poor task performance. From base learners’ models, the meta learner discovers recurring, straggler-inducing conditions shared by many jobs. Our choice of learners is deliberate and serves several objectives. We select base learners to produce succinct models that can be trained without manual intervention and to support subsequent meta learning. We exclude Bayesian networks [83, 84, 270, 240, 219] and decision / regression trees [221, 242, 263, 49, 241, 63], popular methods for analyzing system performance, because learning patterns from heteromorphic graphs and trees is difficult. We select meta learners to extract semantic structure from jobs’ models and to enhance interpretability. Topic modeling infers themes from recurring clusters of words that appear in many documents. Hound constructs topic models by treating base learners’ models as documents, system conditions as words, and causal explanations for stragglers as topics.

2.1.2 Interpretable Models

Widely used models are difficult to interpret. Some methods are prone to over-fitting and preclude broader interpretation. Regression trees often produce large models from small datasets [188]. Such models generate accurate predictions, but cannot produce generalizable insight. Other methods require domain expertise. Lasso re- gression selects features associated with stragglers and discards the rest [60, 180, 59]. Translating these features into causes requires system expertise. Hound ensures interpretability with topic models. Latent Dirichlet allocation identifies features that often appear together in jobs’ latency models. A cluster of fea- tures correspond to a topic, which reveals clear and concise system conditions that are associated with atypically large latencies. For example, the topic [CACHE MISS(+),

8 CPI(+)] indicates some stragglers arise from poor cache behavior, measured in terms of the number of cache misses and average cycles per instruction. When jobs suffer from diverse causes of stragglers, Hound reports a mix of relevant topics and assesses their relative contributions to system performance.

2.1.3 Unbiased Inference

A statistical model makes assumptions that prevent it from performing well on all datasets. The No-Free-Lunch theorem states that a learner pays for performance on some datasets with degraded performance on others [259]. Certain models capture some system behaviors but not others [242]. For example, regression assumes little collinearity between predictors. If collinearity exists, fitted models infer erroneous associations [96]. Bayesian networks assume a prior distribution (e.g., Dirichlet) when inferring network structure. Whether the prior is appropriate for the dataset determines inferred network’s quality [232]. Relying on one method is risky when analyzing large, heterogeneous datasets such as datacenter traces. Hound uses ensemble learning to reduce risk of biased conclusions. An ensemble combines responses from multiple, independent learners with a majority rule that amplifies correct responses and avoids erroneous ones [64]. An ensemble is robust when its learners are diverse because, given a dataset, many assumptions hold even when some fail. We design an ensemble that combines distinct but complementary methods for diagnosing stragglers.

2.1.4 Computational Efficiency

Statistical inference can be computationally expensive and even intractable. For example, inferring Bayesian networks requires finding directed acylic graphs that optimally represent the structure of conditional dependencies within the training data, an NP-hard problem [198]. Even after the network is built, responding to exact

9 queries is intractable [86]. The computational complexity of constructing a model for straggler diagnosis depends on three factors: number of profiled metrics per task, number of tasks per job, and number of jobs. Datacenter traces can include millions of jobs and tasks, each with tens of profiles. Hound relies on learning methods that have polynomial complexity and are amenable to parallelization in a distributed system.

2.2 The Hound Framework

Hound infers predictive, dependent, and causal relationships for task latency from job profiles. Moreover, it reveals and assigns relevant causes to each job’s stragglers. Hound provides these capabilities by extending state-of-the-art methods to improve inference for datacenter profiles. Figure 2.1 shows how inputs translate into outputs in three stages. First, base learning captures relationships between latency and system conditions. Second, meta learning reveals recurring topics. Third, ensemble learning integrates results from disparate types of models. Inputs – Profiles. Illustrated in Figure 2.2, Hound requires profiles from jobs that exhibit long-tailed latency distributions. Datacenters may collect these profiles with a unified facility, such as Google-wide Profiling [215] or CPI2 [271], or combine existing facilities for continuous profiling and distributed file systems [45, 119, 125]. Profiles collected throughout the datacenter and across time comprise a dataset. Hound uses the dataset to infer models that take latency as the dependent variable and system conditions, such as resource usage and scheduling events, as independent variables. Outputs - Causal Topics. Hound’s topics are concise causal explanations for stragglers. Each topic is a set of abnormal profile metrics associated with stragglers across many jobs. Suffixes (-) or (+) indicate whether metric’s values are significantly lower or higher for stragglers than those for normal tasks. Hound identifies relevant

10 Figure 2.1: The Hound Framework topics for each job and estimates each topic’s significance across the datacenter’s tasks and jobs. Administrators can use these outputs to easily identify significant

11 (a) Jobs and tasks (b) Task profile and trace

Figure 2.2: Example - Hound Inputs causes and triage system anomalies. Illustrated in Figure 2.3a, Hound reports a small number of causal topics. Topic

E1 attributes stragglers to lower average and peak processor utilization (APU, PPU).

Topic E3 attributes stragglers to higher garbage collection frequency and duration (GCF, GCD). In Figure 2.3b, Hound assigns relevant causes to each job, using weighted mixes when a job’s stragglers have multiple causes. Hound attributes job

J1’s stragglers to a mix of low processor utilization (E1) and garbage collector inter- ference (E3). Processor utilization is dominant and weighted 0.83.

2.2.1 Base Learning

Base learners infer relationships between task latency and profiled metrics for each

P job. A base learner produces a causality profile C, a vector in [−1, 1] where Ci is 12 (a) Inferred causes of stragglers across jobs

(b) Assigned causes of stragglers for individual jobs

Figure 2.3: Example - Hound Outputs the effect of metric i on latency and P is the number of metrics. Vector elements

PP are absolute-sum-to-one such that i=1 |Ci | = 1. Because relying on a single learner could induce bias and produce false causes, Hound uses an ensemble of heterogeneous learners to discover predictive, dependence, and causal relationships.

• Predictive (PR). Model effects of independent variables on the dependent vari- able, minimizing differences between data and model outputs. Variables with larger (smaller) effects have higher (lower) predictive power.

• Dependence (DP). Model association between independent and dependent vari-

13 Figure 2.4: Example - Hound Base Learning

ables with probabilistic foundations.

• Causal (CA). Model cause and effect with matching methods, which compare data that differ only in the suspected cause.

Figure 2.4 illustrates base learning. The learner infers predictive models with reg- ularized regression methods such as ElasticNet. It then produces a causality profile from fitted and re-scaled regression coefficients. The causality profile reveals each metric’s statistical significance when predicting latency. For Job J2, low processor utilization (APU, PPU) and high queueing delay (QUD) predict high task latency. Garbage collection and network communication have no effect.

14 Predictive Modeling (PR). Linear regression supports the creation of causal- ity profiles and subsequent meta learning. Hound’s PR learner constructs regression models with Bagging Augmented ElasticNet (BAE) (cf., Algorithm1). ElasticNet is a regularization method that automatically selects significant metrics [281]. Bagging is a machine learning method that improves model accuracy and stability. Hound combines these methods to fit latency models using profiled metrics. It encodes resulting regression models as vectors of coefficients to quantify metrics’ predictive power and summarize potential causes of stragglers. Regularization methods enable robust regression by addressing collinearity. The symptom of collinearity is typical in computer systems, and arises when correlations between variables distort estimates of regression coefficients [96]. Lasso, a popular regularization method in systems research [60], mitigates collinearity by randomly including only one variable from a group of correlated variables [245]. However, randomly excluding variables may harm diagnostic power. Lasso may, for example, include cycles per instruction (CPI) and exclude correlated hardware events such as cache misses or branch mispredictions. A model that includes only CPI would fail to identify cache behavior as a more likely and direct cause of stragglers. ElasticNet addresses Lasso’s limitations by grouping correlated variables and including the group when any of its variables predicts latency. Bagging methods enable robust regression by avoiding over-fitted models, which do not accurately generalize beyond the training dataset. Bagging methods mitigate over-fitting by resampling the dataset [64]. The method constructs R replicas of a -element dataset. Each replica draws d samples with replacement from the original dataset. The final model is a linear combination of R models fit for the replicas. Hound performs bagging on the dataset, uses ElasticNet to fit models to each replica, and reports the models’ average. Dependence Modeling (DP). Statistical dependence, a powerful framework

15 ALGORITHM 1: Bagging Augmented ElasticNet (BAE) • Input:

(i). Dataset D = (Xn, yn). Matrix Xn and vector yn denote the observed task

metrics and latencies, respectively, for all tasks in job Jn. (ii). Number of bootstrap replicates, I.

• Initialize: Create I bootstrap replicates D1, D2,..., DI from D .

• For i = 1, 2,..., I:

Train an ElasticNet model on dataset Di and apply LARS-EN algorithm to estimate the coefficients βˆ(i):

ˆ(i) ˆ(i) 2 ˆ(i) 2 ˆ(i) argmin L( β ) = kyn − Xn β k2 + λ1k β k2 + λ2k β k1 βˆ(i)

ˆ 1 ˆ(1) ˆ(2) ˆ(I) • Output: the average of coefficient estimates β = I ( β + β + ... β )

for causal discovery [206], assesses the association between latency and profiled met- rics. Table 2.1 presents properties for various dependence measures [47]. First, measures should satisfy basic properties (BP) established by the first four of R´enyi’s classic axioms [218]. Moreover, they should be non-linear (NL) because dependences between performance and system conditions are often non-linear [242]. Third, mea- sures should be invariant to strictly increasing transformations (TI), which are used to obfuscate industrial traces [214]. Our analysis favors the Schweizer-Wolff Depen- dence (SWD), but this measure is unsigned and computationally expensive. We ad- dress SWD’s limitations and create the Signed Schweizer-Wolff Dependence (SSW). SWD measures the dependence between two random variables X, Y. The joint distribution of these variables comprises two pieces of information— marginals and dependence. The Sklar Theorem separates them and defines Copula C to describe dependence [192]. SWD transforms variables using their cumulative distribution

16 Table 2.1: Comparison of statistical dependence measures. Dependence Measure BP NL TI Complexity Pearson’s ρ [91]    O(N) Spearman’s ρ [91]    O(N log(N)) Kendall’s τ [91]    O(N log(N)) Schweizer-Wolff Dependence (SWD) [47]    O(N2)

functions FX , FY to obtain u≡FX (x), v≡FY (y). The variables are increasingly depen- dent as the distance between C(u, v) and u·v grows. SWD measures this distance as follows [206].

12 XN XN SWD(X, Y ) = |C(u , v ) − u · v | N2 − 1 i j i j i=1 j=1 D

#(xk, yk ) s.t. xk ≤ xi, yk ≤ yj C(u , v ) = , k∈[1, N] i j N

Note that (x1, y1), (x2, y2),... are observed pairs of X, Y. When dependence between two random variables is positive (negative), an in- crease in one suggests an increase (decrease) in the other. We use the Fr´echet- Hoeffding Bound [192] to determine the sign. When a positive dependence grows stronger, C(u, v) approaches min(u, v). When a negative dependence grows stronger, C(u, v) approaches max(u + v − 1, 0). When X, Y are independent, C(u, v) is equidis- tant between these bounds. SSW estimates the Copula’s distance to its bounds to determine the sign within some .

SSW(X, Y ) = sign(X, Y ) · SWD(X, Y )

−1 L1 − L2 >  sign(X, Y ) = 0 |L1 − L2| ≤   1 L2 − L1 <     17 1 XN XN   L = min(u , v ) − Cˆ(u , v ) 1 N2 i j i j i=1 j=1

1 XN XN   L = Cˆ(u , v ) − max(u + v − 1, 0) 2 N2 i j i j i=1 j=1

SSW values in [-1,0) indicate negative dependence, in (0,1] indicate positive depen- dence, and equal to 0 indicate independence. The SSW estimator has complexity

2 O(N ) because the SWD and sign estimators visit all pairs (ui, vj ) for i, j∈[1, N]. Sampling reduces cost. Causal Modeling (CA). Confounding bias is a major challenge when seeking causal associations between latency and profiled metrics [50]. Confounding bias arises when the supposed causal association between two variables is partially or completely explained by a third, confounding variable. For example, suppose task latency is much higher on older processors than on newer ones. We cannot definitively say processor design causes the latency difference. Servers with new processors often use faster memory and the difference could be partially or primarily caused by memory design. Determining the causal effect of processors requires eliminating memory’s effect. Hound’s CA learner constructs Rubin Causal Models (RCM) [220]. RCM esti- mates the causal effect of one metric on latency while eliminating bias induced by other metrics. Let Z be a binary random variable for a treatment level such as high and low processor usage. Let R be a continuous random variable for the response level such as high and low latency. We estimate the causal effect of Z on R while controlling for all other metrics X such as memory usage, scheduling events, etc.

18 RCM measures the effect ∆, which is the difference in responses with and without treatment (i.e., R1 − R0), for each task.

∆ = E(R1 − R0) = E(R1) − E(R0)

RCM uses Inverse Probability Weighting (IPW) to estimate ∆ [175]. It estimates the effect despite missing data. For each data point, the treatment is either applied or not and the outcome is either R1 or R0. The data cannot report outcomes with and without treatment.

ZR (1 − Z)R ∆ = E − E " e(X) # " 1 − e(X) #

e(X) = P{Z = 1|X}

Propensity score e(X) is the conditional probability of having a treatment Z given values of confounding metrics X [220]. Given the score, outcomes and treatment are independent such that E(R1) = E[ZR/e(X)] and E(R0) = E[(1 − Z)R/1 − e(X)][175]. This formulation resolves the missing value problem because Z and R are observed and e(X) can be estimated from data. We design AdaBoost Inverse Probability Weighting (AIPW) to estimate the propensity score (cf., Algorithm2). Propensity scores are usually estimated with logistic regression, e(X) = exp( β · X)/1 + exp( β · X). But this approach is vulner- able to collinearity, over-fitting, and outliers [208], which distort estimates of causal effects [175]. We address these challenges by enhancing IPW with AdaBoost [111], which estimates conditional probabilities more accurately than regression [253].

2.2.2 Meta Learning

Hound’s meta learner uses topic models to identify patterns and extract semantic

19 ALGORITHM 2: AdaBoost Propensity Score Estimation • Input:

P-1 (i). Observations (x1, z1), (x2, z2), ... , (xM, zM ) for (X, Z), X ∈ [0, 1] ,

Z ∈ {−1, 1}. xm indicates the values of confounding metrics (X) observed

from task m, and zm indicates the value of the treatment metric (Y) observed from task m.

(ii). A weak binary classification algorithm H : X → Z, which predicts the treatment metric (Z) with the confounding metrics (X).

(iii). Initial sampling distribution D = [D0, D1, ... , DM ], with Dm indicating

the sampling probability for observation (xm, zm). (iv). The number of iterations I.

• Initialize: D(1) = D

• For i = 1, 2,..., I

(i). Train a weak classifier hi using H with samples drawn using D(i).

(ii). Estimate the error rate of training, i = (# of xm s.t. hi (xm) , zm)/M. (iii). Determine the weight for weak classifier h : α = 1 log( 1−i ). i i 2 i (iv). Create a new sampling distribution D(i + 1) from D(i) with αi: −αi·zm·hi (xm) Dm (i + 1) = Dm (i) · e (m = 1, 2, ··· , M)

(v). Normalize the new sampling distribution: D (i + 1) XM D (i + 1) = m (m = 1, 2, ··· , M), S = D (i + 1) m S i+1 m i+1 m=1 PI • Boosted Hypothesis f (X) = i=1 αi hi (X). −2 f (X) • Output: eAdaBoost (X) = P{Z = 1|X} = 1/(1 + e ).

D structure from jobs’ numerous models and causality profiles. We use topic models because of the parallels between inferring themes in documents and identifying pat- terns in causality [58]. Topics arise from recurring clusters of words just as causes arise from recurring clusters of atypical metric values. Documents contain multiple

20 Figure 2.5: Example - Hound Meta Learning topics just as profiles reveal causes of stragglers. Significant topics appear in many documents just as significant causes explain stragglers in many jobs. Hound uses Latent Dirichlet Allocation to infer causes of stragglers as shown in Figure 2.5. First, Hound populates a dictionary with metrics and identifies metrics that often appear together in jobs’ causality profiles. Metrics that are more promi- nent in these profiles produce corresponding words that appear more frequently in documents. Topic P1 is defined by APU(-) and PPU(-), which cluster within profiles. APU(-) is more significantly associated with latency and is weighted more heavily.

Second, Hound identifies a mix of relevant topics for each job. Job J2’s stragglers are explained by topics P1 and P2. Topic P2 is weighted more heavily because most

21 (a) Clustering base topics to identify ensemble topics

(b) Assigning ensemble topics to individual jobs

Figure 2.6: Example - Hound Ensemble Learning stragglers are caused by queueing delay QUD(+) even though some are caused by low processor utilization. See Algorithm3 for details of meta learning.

2.2.3 Ensemble Learning

Hound’s ensemble learner reconciles potentially divergent topics from multiple learn- ers. The ensemble emphasizes topics found by multiple learners and drops those found by only one. Figure 2.6 illustrates this process. First, Hierarchical Agglom-

22 ALGORITHM 3: Meta Learning Algorithm • Input:

(i) causality profiles C ={C1, C2,..., CN }, minimum number of topics kmin (de-

fault kmin=5), stopping criterion  (default =0.05), trimming threshold η (default η=0.1)

• Initialize:

(i) Create vocabulary V . Each task metric Ki (i = 1,..., P) corresponds to

two words in V , Ki (+) and Ki (-), which indicate a atypically high and low value for stragglers.

(ii) Create documents D . Transform each causality profile Ci into a document

Di ∈ D (i = 1,..., N), which is a bag of words in V . For each entry Ci·j in

Ci (j = 1, 2,..., P), calculate probabilities Pr{K j (+)}, Pr{K j (−)}:

Pr{K j (+)} = |Ci·j |, Pr{K j (−)} = 0 Ci·j > 0  Pr{K j (+)} = 0, Pr{K j (−)} = |Ci·j | Ci·j < 0   Pr{K j (+)} = 0, Pr{K j (−)} = 0 Ci·j = 0   • For k = k , k + 1, k + 2,... min min min

(i). Infer topics β=β1, β2, ... , βk, and the probabilistic mixture of topics for

each job θ={θ1, θ2, . . ., θN } from D with Latent Dirichlet Allocation (LDA). (ii). Trim each topic to eliminate words with weight lower than η, and trim the topic mixture for each document to remove topics with weight lower than

η. Rescale βi and every θi ∈ θ to keep each of them sum-to-one. PN (iii). Calculate the document coverage for each topic, ( j=1 I{θi·j > 0}) /N, with I the indicator function, θi·j is the mixture probability of the j-th topic in

document Di. (iv). Stop if any inferred topic has document coverage lower than epsilon.

• Output current β, θ. Each topic in β is defined as an Causal Topic.

23 ALGORITHM 4: Ensemble Learning Algorithm - Phase I • Input:

(i) Causal topics P ={P1, ... , PQ1 }, D ={D1, ... , DQ2 }, C ={C1, ... , CQ3 }, produced by PR, DP and CA, respectively.

(ii) Mixture of causal topics for each job, PP ={PP1, PP2, ... , PPN },

PD ={PD1, PD2, ... , PDN }, PC ={PC1, PC2, ... , PCN }, produced by PR, DP and CA, respectively.

(iii) Minimum and maximum number of causes, m1, m2. (iv) Trimming threshold η (default η=0.1).

• Initialize: SCORE ← 0

• For m = m1, m1+1, m1+2,..., m2

(i). Apply HAC (Hierarchical Agglomerative Clustering) to group all causal top- ics in P ∪ D ∪ C in m clusters.

(ii). Calculate score SCORE[m] for the m clusters. Let H1 denote the average

intra-cluster similarity, and H2 denote the average inter-cluster similarity (default: cosine similarity, λ=2/3):

SCORE[m] ← λH1 + (1 − λ)(1 − H2)

• Determine the optimal number of clusters L as argmaxm SCORE[m]. Let

{S1, S2,..., SL } be the clusters produced by HAC with optimal parameter L.

• For each cluster Si (i = 1, 2,..., L), take its centroid Ei as an ensemble for Si. If a cluster contains only a single topic or all its topics are produced by the same model, mark it as an outlier ∅.

• Define a set E = {Ei |Ei , ∅, i = 1, 2,..., L}. Each element in E is considered as a Ensemble Causal Topic.

• Output E

24 ALGORITHM 5: Ensemble Learning Algorithm - Phase II • Input:

(i) Mixture of causal topics for each job, PP ={PP1, PP2, ... , PPN },

PD ={PD1, PD2, ... , PDN }, PC ={PC1, PC2, ... , PCN }, produced by PR, DP and CA, respectively.

(ii) Ensemble causal topics E produced from Phase I. Each ensemble causal topic is a cluster of topics in PP , PD and PC • Update the mixture of causal topics in PP , PD and PC for each job to the mixture of their belonged clusters (ensemble causal topics in E). Suppose PP , PD ,

and PC are updated to {PPEn}, {PDEn}, {PCEn} (n = 1, 2,..., N), respectively.

• Determine the ensemble of topic mixture for each job. Define PEn = 1/3 ·

(PPEn + PDEn + PCEn) as the ensemble of topic mixture for job Jn. Define PE =

{PE1, PE2,..., PEN }.

• Trim mixture of causal topics for each job to eliminate topics with weight lower

than η and rescale each PEi ∈ PE to keep it sum-to-one.

• Output PE

erative Clustering (HAC) identifies similarities in predictive (P∗), dependence (D∗), and causal (C∗) topics [191]. Although the ensemble identifies consensus around some topics, such as low processor usage, APU(-) and PPU(-), it also corrects errors and outliers. P∗ and C∗ report queuing delay, QUD(+), as a cause when D∗ misses it. C∗ misreports network overhead, NET(+), as a cause when P∗ and D∗ do not. Clusters’ centroids define the ensemble’s topics. See details in Algorithm4. Second, Hound identifies relevant ensemble topics for each job. Weights for ensemble topics are averages of those for prediction, dependence, and causation topics. See details in Algorithm5.

25 2.3 Experimental Methods

We implement Hound in Apache Spark with 3,800 lines of Python code. We deploy Hound on a cluster of eight NUMA nodes for parallel computing. Each node is configured with 48 AMD Opteron 6174 cores and 256GB DDR3 memory.

2.3.1 Google Trace

We use a month-long trace from a production Google cluster that contains 12K servers and computes for diverse jobs such as web services, MapReduce, and high performance computing [213]. A job is a collection of tasks that execute the same program on different shards of data. Jobs are heterogeneous and configured with different scheduling levels. At one end of the spectrum are online (level-3) jobs that serve revenue generating user requests. At the other end are batch (level-0) jobs that support internal software development and background computation. The trace supplies 180GB of data for 650K jobs comprised of 25M tasks. Our analysis includes only production jobs, which comprise 12% of the total. We exclude jobs with limited task parallelism (i.e., fewer than 20 tasks), which contribute 4.8% of the trace’s tasks. After such post-processing, the trace contains 13K jobs and 3.3M tasks. Some tasks need treatment of missing values (e.g., imputation) before statistical analysis. Table 2.2 presents the metrics profiled for each task. At coarse-grain, system monitors track each task’s scheduled processor time, allocated memory capacity, and actual resource utilization. At fine-grain, hardware counters track instruction throughput and cache miss rates. As tasks run, profilers report mean and max for resource usage and microarchitectural activity over a measurement period (i.e., 300s). We average these periodic measurements over the task’s duration. Collectively, these metrics can diagnose stragglers and reveal causes related to systems management such as resource allocation, job scheduling, colocation control, fail-over mechanisms.

26 Table 2.2: Task metrics in the Google datacenter trace. Task Metric Description Machine Capacity MACHINE CPU Num. cores in host machine MACHINE RAM RAM in host machine (B) Scheduling SCHED DELAY Task scheduling delay (µs) PRIORITY Task priority (integer) EVICT Num. times task is evicted FAIL Num. times task fails Resource Request REQ CPU Num. processor cores requested REQ MEM Amt. memory requested (B) REQ DISK Amt. local storage requested (B) Resource Usage PEAK CPU Max processor used (core-sec/sec) MEAN CPU Mean processor used (core-sec/sec) MEM ASSIGN Amt. memory allocated (B) PEAK MEM Max memory used (B) MEAN MEM Mean memory used (B) PAGE CACHE Total page cache used (B) PAGE CACHE UM Unmapped page cache used (B) PEAK IO Max I/O rate (I/O-sec/sec) MEAN IO Mean I/O rate (I/O-sec/sec) DISK SPACE Local storage used (B) µarchitecture CPI Cycles per instruction CACHE MISS LLC misses per kilo-inst.

2.3.2 Spark Traces

We use Spark traces to supplement the Google trace. Although we can validate our casual explanations for the Google trace, we cannot compare against expert diagnoses, which are not publicly available for this system. For this reason, we demonstrate Hound on a pair of traces collected for Spark performance on Amazon EC2 clusters and use a prior study of this data [202] to define expert diagnoses for comparison.

27 The first trace profiles a five-node Amazon EC2 cluster running Big Data Bench (BDBench), which launches five iterations of ten unique SQL queries [255]. The trace includes 84 Spark jobs, 162 stages, and 115K tasks that process 60GB of input data. The second trace profiles a twenty-node Amazon EC2 cluster running TPC- DS, which simulates 13 users executing 20 unique SQL queries in random order [199]. The trace includes 1,003 Spark jobs, 1,296 stages and 239K tasks that process 850GB of input data. We consider versions of BDBench and TPC-DS that compute on data stored on disk. Table 2.3 presents the major performance metrics profiled for each task. Note that metrics are summary counters instead of time series data. Data is drawn from fine-grained instrumentation of Spark’s run-time engine. Spark-SQL transforms SQL queries into Spark jobs [46]. A Spark job consists of multiple stages and each stage consists of one or more parallel tasks. Tasks within the same stage run the same binary with different partitions of the data. Typically, early stages read data from the file system or memory cache while later stages read the previous stage’s data with a network shuffle [202].

Table 2.3: Task metrics in the Spark traces for BDBench and TPC-DS. Task Metric Description Executor EXE DES TIME Executor deserialization time EXE RUN TIME Executor run time BROADCAST TIME Variable broadcast time Output RESULT SIZE Result size RESULT DES TIME Result deserialization time OUTPUT WRITE TIME Output blocked write time OUTPUT BYTES Output size in bytes Spill SPILL BYTES Num. bytes spilled to disk SPILL BYTES DISK Num. of compressed bytes spilled to disk Input READ BYTES Total input size in bytes 28 READ TIME Total time to read input HDFS READ BYTES Input read bytes from HDFS HDFS READ PACKETS Input read packets from HDFS HDFS READ TIME Time to read input from HDFS HDFS OPEN TIME Time to open input file on HDFS Shuffle Read SHFL READ RBYTES Shuffle read bytes (network) SHFL READ RBLKS Shuffle read blocks (network) SHFL READ WAIT TIME Time to open shuffle read connection SHFL READ LBLKS Shuffle read blocks on local server SHFL READ LBYTES Shuffle read bytes on local server SHFL READ LTIME Time to shuffle read on local server Shuffle Write SHFL WRITE BYTES Size of shuffle write in bytes SHFL WRITE TIME Time to shuffle write SHFL WRITE OPEN TIME Time to open file for shuffle write SHFL WRITE CLOSE TIME Time to close file for shuffle write Processor Usage CPU USER User-space CPU usage CPU SYS Kernel-space CPU usage Disk Usage DISK UTILIZATION Disk utilization DISK READ THPT Disk read throughput DISK WRITE THPT Disk write throughput Network Usage NET READ BYTES PSEC Num. network bytes read per second NET SEND BYTES PSEC Num. network bytes sent per second NET READ PACK PSEC Num. network packets read per second NET SEND PACK PSEC Num. network packets sent per second Scheduler Delay SCHED DELAY Scheduling time to place a task Garbage Collection GC TIME Garbage collection overhead First Task FIRST TASK True for first task in stage on a worker

2.4 Evaluation with Google Trace

Table 2.4 presents Hound’s topics that explain stragglers in Google’s datacenter.

Topics constructed from prediction (P∗), dependence (D∗), and causation (C∗) models

29 Table 2.4: Hound’s causal topics for the Google dataset, derived from an ensemble of predictive (P), dependent (D), causal (C) models. Cluster identifies models’ topics that produce each ensemble topic. Table 2.5-2.7 show individual models’ topics. Topics with “” are revealed by individual models but rejected by the ensemble. Topic Keywords Weights Cluster Interpretation MEM ASSIGN(+), 0.5, P , P , E MEAN MEM(+), 0.25, 0 3 Data Skew 0 D , C PEAK MEM(+) 0.25 0 0 PAGE CACHE(+), 0.45, E1 PAGE CACHE UM(+), 0.38, P1, D1, C1 Data Skew MEM ASSIGN(+) 0.17

E2 DISK SPACE(+) 1.0 P2, D2, C2 Data Skew MEAN CPU(+), 0.52, E P , D , C Computation Skew 3 PEAK CPU(+) 0.48 4 3 3 PEAK IO(+), 0.51, E P , D , C I/O Skew 4 MEAN IO(+) 0.49 5 4 4

MEAN CPU(-), 0.8, P6, D5, E5 Limited Processor PEAK CPU(-) 0.2 C5, C6 MEAN MEM(-), 0.83, P7, D6, E6 Limited Memory PEAK MEM(-) 0.17 D7, C7 E7 MEAN IO(-) 1.0 D8, C8 Limited I/O PEAK IO(-), 0.83, E P , D Limited I/O 8 MEAN IO(-) 0.17 8 9 CACHE MISS(+), 0.54, E P , D , C Cache Bottleneck 9 CPI(+) 0.46 9 10 9

E10 SCHED DELAY(+) 1.0 P10, D11, C10 Scheduler Delay E11 EVICT(+) 1.0 P11, C11 Eviction Delay P12 FAIL(+) 1.0  Failure C12 MACHINE RAM(+) 1.0  RAM Heterogeneity cluster to produce an ensemble topic (E∗). Each topic is a weighted combination of keywords, which are measured metrics followed by a suffix, drawn from models. The suffix “(+)” or “(-)” means the metric is significantly larger or smaller for stragglers than for normal tasks. Keywords and topics produce natural interpretations and reveal causes of stragglers. Many of the following causes are concise and consistent with domain-specific expertise, which is remarkable given the wealth of data.

30 Load Imbalance (Skew1). Hound finds that tasks with more work are more likely to be stragglers. Tasks that use more memory (E0), larger page caches (E1), or more local storage (E2) may be computing on more data. These topics indicate larger working sets that may not fit in memory. Tasks that require more processor time (E3) or I/O transactions (E4) may also have more work. Resource Constraints. Tasks that use fewer resources are likely to perform poorly. Under-utilization of one resource suggests contention for others. Atypically low processor and memory use suggest poor progress (E5,E6), and atypically low

I/O rates suggest constrained data supply (E7,E8). Note that atypically low and

high processor usage are both causes for stragglers (E3,E5), but these conditions do not arise simultaneously. Some jobs’ stragglers may be caused by low usage and other jobs’ by high usage. Hound identifies causes from patterns across many jobs and then assigns the most relevant cause to each job. Microarchitectural Activity. Processor counters shed light on poor task per- formance. Cache misses increase the average number of cycles required per instruc-

tion (E9). Although low processor usage and frequent cache misses may appear cor-

related, Hound identifies separate causes (E5 versus E9) that are justified by domain expertise. When measuring processor usage, the tracks processor time whereas the processor tracks cycles committing instructions. Processor time can be low, perhaps due to thread scheduling, even when pipelines rarely stall for caches. Conversely, processor time can be high even when tasks have poor instruction-level parallelism. Scheduling Problems. Finally, Hound finds that the cluster manager’s deci- sions affect task completion. Queueing delays cause stragglers by extending tasks’

end-to-end latency (E10). Eviction delays also cause stragglers as tasks halt com-

1 Here, skew means the non-uniform partition of data across tasks rather than the statistical measure of asymmetry for a probability distribution.

31 putation on overloaded machines, re-launch on another machine, and lose progress

(E11). Hound’s ensemble guards against false conclusions, which may be consistent with operator intuition but arise from biased models. Systems operators might think that machine heterogeneity creates stragglers as some tasks run on slower machines, but only the causation model flags memory heterogeneity (C12). Operators might worry about transient task failures that require re-launch, but only the prediction model raises this concern (P12). Users appropriately size processor and memory requests such that they are irrelevant in our analysis of stragglers.

Table 2.5: Hound’s causal topics on the Google trace, with predictive model (PR). Topic Keywords Weights Interpretation MEAN MEM(+), 0.36, P0 PEAK MEM(+), 0.34, Data Skew MEM ASSIGN(+) 0.3 PAGE CACHE(+), 0.55, P Data Skew 1 PAGE CACHE UNMAP(+) 0.45

P2 DISK SPACE(+) 1.0 Data Skew P3 MEM ASSIGN(+) 1.0 Data Skew PEAK CPU(+), 0.55, P Computation Skew 4 MEAN CPU(+) 0.45 MEAN IO(+), 0.57, P I/O Skew 5 PEAK IO(+) 0.43

P6 MEAN CPU(-) 1.0 Limited Processor P7 MEAN MEM(-) 1.0 Limited Memory PEAK IO(-), 0.66, P Limited I/O 8 MEAN IO(-) 0.34 CACHE MISS(+), 0.55, P Cache Bottleneck 9 CPI(+) 0.45

P10 SCHED DELAY(+) 1.0 Queueing Delay P11 EVICT(+) 1.0 Eviction Delay P12 FAIL(+) 1.0 Failure Delay

32 Table 2.6: Hound’s causal topics on the Google trace, with dependence model (DP) Topic Keywords Weights Interpretation MEAN MEM(+), 0.36, D0 PEAK MEM(+), 0.35, Data Skew MEM ASSIGN(+) 0.29 PAGE CACHE(+), 0.44, D1 PAGE CACHE UNMAP(+), 0.35, Data Skew MEM ASSIGN(+) 0.21

D2 DISK SPACE(+) 1.0 Data Skew MEAN CPU(+), 0.57, D Computation Skew 3 PEAK CPU(+) 0.43 PEAK IO(+), 0.55, D I/O Skew 4 MEAN IO(+) 0.45 MEAN CPU(-), 0.74, D Limited CPU 5 PEAK CPU(-) 0.26 D6 MEAN MEM(-) 1.0 Limited Memory PEAK MEM(-), 0.3, MEM ASSIGN(-), 0.25, D7 MEAN MEM(-), 0.2, Limited Memory PAGE CACHE UNMAP(-), 0.13, PAGE CACHE(-) 0.12 D8 MEAN IO(-) 1.0 Limited I/O D9 PEAK IO(-) 1.0 Limited I/O CACHE MISS(+), 0.54, D Cache Bottleneck 10 CPI(+) 0.46 D11 SCHED DELAY(+) 1.0 Queueing Delay

2.4.1 Mixtures of Causes

Hound not only identifies recurring causes of stragglers, it determines which jobs suffer from which causes. Hound assigns each job a weighted mix of causes. Weights estimate the fraction of stragglers attributed to each cause in the mix as shown in Table 2.8 for representative jobs. Sometimes, a job’s stragglers can be explained by a single cause; job 682...093’s stragglers are explained by data skew. More often, a job’s stragglers have multiple causes; job 634...350’s stragglers are explained by a mix of low processor and memory utilization.

33 Table 2.7: Hound’s causal topics on the Google trace, with causal model (CA). Topic Keywords Weights Interpretation PEAK MEM(+), 0.26, MEM ASSIGN(+), 0.23, C0 MEAN MEM(+) 0.23, Data Skew PAGE CACHE(+), 0.16, PAGE CACHE UNMAP(+) 0.12 PAGE CACHE(+), 0.36, C1 PAGE CACHE UNMAP(+), 0.33, Data Skew MEM ASSIGN(+) 0.31

C2 DISK SPACE(+) 1.0 Data Skew MEAN CPU(+), 0.55, C Computation Skew 3 PEAK CPU(+) 0.45 PEAK IO(+), 0.56, C I/O Skew 4 MEAN IO(+) 0.44

C5 MEAN CPU(-) 1.0 Limited Processor PEAK CPU(-), 0.53, C Limited Processor 6 MEAN CPU(-) 0.47 C7 MEAN MEM(-) 1.0 Limited Memory C8 MEAN IO(-) 1.0 Limited I/O CACHE MISS(+), 0.52, C Cache Bottleneck 9 CPI(+) 0.48

C10 SCHED DELAY(+) 1.0 Queueing Delay C11 EVICT(+) 1.0 Eviction Delay C12 MACHINE RAM (+) 1.0 RAM Heterogeneity

Hound guards against faulty mixtures of causes by constructing an ensemble of independent models (i.e., for prediction, dependence, and causation) Tables 2.9–2.10 show how ensembles avoid faulty diagnoses. For job 630...702, DP and CA reveal a mix of data, computation, and I/O skew whereas PR alone identifies limited I/O resources. The ensemble considers I/O scarcity a false cause. For job 634...350’s causes, DP alone reveals cache misses while CA misses limited memory. The ensemble considers cache behavior a false cause, but includes memory scarcity as a true cause. We propose a series of measures to understand mixtures of causes. We say a

34 Table 2.8: Example - Mixtures of causes Job ID Causes Weight 6283499093 Data Skew 1.00 6266469130 Queueing Delay 1.00 6274140245 Limited Processor 1.00 Limited Processor 0.65 6343946350 Limited Memory 0.35 Data Skew 0.53 6308689702 Computation Skew 0.27 I/O Skew 0.20 Data Skew 0.33 Eviction 0.25 6343048076 Queueing Delay 0.18 Limited Processor 0.12 Limited I/O 0.12

Table 2.9: Comparison of inferred causes from varied modeling strategies for job 6308689702. Model Causes Weights

Data Skew (D0) 0.27 Computation Skew (D ) 0.26 DP 3 I/O Skew (D4) 0.25 Data Skew (D1) 0.22

Computation Skew (C3) 0.36 CA Data Skew (C0) 0.33 I/O Skew (C4) 0.31

Data Skew (P0) 0.31 Data Skew (P2) 0.22 PR Data Skew (P1) 0.19 Computation Skew (P4) 0.17 Limited I/O (P8) 0.11

Data Skew (E0 + E3 + E2) 0.51 Computation Skew (E ) 0.26 ENS 9 I/O Skew (E4) 0.19 Limited I/O (E6)() 0.04

35 Table 2.10: Comparison of inferred causes from varied modeling strategies for job 6343946350. Model Causes Weights

Limited Processor (D5) 0.35 Limited Memory (D ) 0.29 DP 6 Limited Memory (D7) 0.19 Cache Bottleneck (D10) 0.17

CA Limited Processor (C5) 1.0 Limited Memory (P ) 0.55 PR 7 Limited Processor (P6) 0.45

Limited Memory (E2) 0.6 ENS Limited Processor (E0) 0.34 Cache Bottleneck (E5)() 0.06

Table 2.11: Number of causes per job Number of Causes 1 2 3 4 5 Percent of Jobs 12% 33% 36% 16% 3%

Table 2.12: Coverage statistics for causes Dominant Cause Coverage Coverage

Data Skew (E0, E1, E2) 73.6% 55.0% Limited Processor (E5) 39.2% 12.1% Cache Misses (E9) 32.6% 7.0% Limited I/O (E7, E8) 36.7% 6.6% Queueing Delay (E10) 20.0% 5.1% Limited Memory (E6) 13.6% 2.7% Computation Skew (E3) 31.2% 2.2% Eviction Delay (E11) 3.80% 0.90% I/O Skew (E4) 5.60% 0.60%

36 cause is dominant when it explains the majority of a job’s stragglers. We define a cause’s coverage as how often it explains some of a job’s stragglers. We define a cause’s dominant coverage as how often it explains the majority of a job’s stragglers, which is a more selective measure and smaller than coverage. Topics with greater coverage are relevant for more jobs and provide greater utility to system operators. Consider Table 2.8 for examples. Job 630...702’s dominant cause is data skew since its weight is greater than the sum of weights for computation and I/O skew. As seen for job 634...076, not every job has a dominant cause. Suppose the system has only the six jobs in Table 2.8, Data skew’s coverage is 50% because it explains three of six jobs while limited processor and memory’s coverage are 33.3% and 16.7%, respectively. Data skew’s dominant coverage is 33.3% whereas limited memory’s is zero. Table 2.11 presents the percentage of jobs that require multiple causes. Although Hound identifies eleven possible causes, each used by some jobs, more than 80% of jobs require fewer than three causes to explain their stragglers. When multiple causes are required, one is usually dominant. Table 2.12 summarizes coverage. The top three causes—data skew, limited pro- cessor utilization, cache misses—explain the majority of stragglers in 74.1% of jobs. 73.6% of jobs experience some form of data skew. Many jobs also suffer from low re- source usage or cache behavior. Relatively minor causes include memory availability, task eviction and I/O skew.

2.4.2 Validation with Case Studies

We determine whether Hound’s causal explanations are consistent with domain ex- pertise by comparing inferred causes with manually diagnosed ones. First, we use boxplots to visualize distributions of microarchitectural, system, and scheduling ac- tivity. Visual differences in boxplots for stragglers (S) and normal tasks (N) suggest

37 causes of stragglers. Second, we measure differences between stragglers and normal tasks’ data dis- tributions by calculating the Wilcoxon Ranksum Test (WRT [258]) and the Mean Quartile Difference (MQD). WRT evaluates a null hypothesis in which profiled dis- tributions for stragglers and normal tasks exhibit no significant difference. Rejected hypotheses, when WRT < 0.05, reveal causes of stragglers. MQD averages differ-

ences between two distributions’ quartiles. Suppose datasets dS, dN are profiled from stragglers and normal tasks, and suppose Qi denotes a dataset’s i-th quantile. MQD calculates 1 X Q (d ) − Q (d ) 3 i S i N i={25,50,75}

We perform manual diagnosis for a series of case studies. First, we classify tasks within a job as straggling or normal based on measured latency. Then, we determine whether stragglers’ metrics differ from normal tasks’ because those that differ may explain stragglers’ atypical latency. Finally, we compare this manual analysis against Hound’s inferred causes. Figure 2.7 considers a job with stragglers due to multiple causes. Stragglers use more memory, page cache, and disk space. They also require more processor time and have higher peak I/O rates. Manual analysis is challenging because data indicate multiple atypical behaviors. Inferred causes are consistent with domain expertise. Hound automatically infers three causes—data, computation, and I/O skew. Moreover, it accurately infers the relative importance of these causes. Hound assigns weights of 0.53, 0.27, and 0.20 to data, computation, and I/O skew. These weights are consistent with manual analysis. MQD for data skew’s metrics are larger than computation and I/O skew’s. Data skew explains more stragglers than other types of skew. Our findings are borne out repeatedly for the datacenter’s diverse jobs. For ex- ample, figure 2.8 compares profiles for stragglers and normal tasks. Stragglers have

38 Figure 2.7: Profiles suggest stragglers in job 6308689702 arise from a mix of data, computation, and I/O skew. Hound’s inferred causes match human analysis. “S” denotes the distribution of values for stragglers, which is plotted in red; “N” denotes the distribution of values, which is plotted in blue; yellow lines indicate medians in the distributions. significantly higher memory usage, memory allocations, and page cache usage. Man- ual analysis suggests data skew is a probable cause of stragglers, matching Hound’s automatically inferred cause. In Figure 2.9, stragglers have significantly higher queueing time than normal tasks. Other metrics show comparable distributions between stragglers and normal tasks. Manual analysis suggests queueing delay is a probable cause of stragglers, matching Hound’s automatically inferred cause. In Figure 2.10, stragglers have significantly lower processor usage than normal tasks. Other metrics show comparable distributions between stragglers and normal tasks. Manual analysis suggests limited processor resources is a probable cause of stragglers, matching Hound’s automatically inferred cause.

2.4.3 Validation with Mutual Information

We draw on information theory to test the validity of Hound’s topics. Shannon mu- tual information, in units of nats, quantifies information obtained about one random

39 Figure 2.8: Profiles suggest stragglers in job 6283499093 arise from data skew. Hound’s inferred causes match human analysis. “S” denotes the distribution of values for stragglers, which is plotted in red; “N” denotes the distribution of values, which is plotted in blue; yellow lines indicate medians in the distributions.

Figure 2.9: Profiles suggest stragglers in job 6266469130 arise from queueing delay. Hound’s inferred causes match human analysis. “S” denotes the distribution of values for stragglers, which is plotted in red; “N” denotes the distribution of values, which is plotted in blue; yellow lines indicate medians in the distributions.

40 Figure 2.10: Profiles suggest stragglers in job 6274140245 arise from limited pro- cessor usage. Hound’s inferred causes match human analysis. “S” denotes the distri- bution of values for stragglers, which is plotted in red; “N” denotes the distribution of values, which is plotted in blue; yellow lines indicate medians in the distributions. variable through another random variable [225]. The mutual information between discrete variables X and Y, with joint density PXY (x, y) and marginals PX (x) and

PY (Y ), is X P (x, y) I(X;Y ) = P (x, y) log XY XY P x P y x,y X ( ) Y ( ) .

We use mutual information to validate the accuracy and coherence of Hound’s causal topics assigned for each job. For accuracy, task latency should have high

mutual information with metrics within causal topics (IM·L) and low mutual infor-

mation otherwise (IU·L). For coherence, metrics within a causal topic should have high mutual information with each other (IM·M ) and low mutual information other-

wise (IM·U ). For each of these desiderata, we calculate mutual information, averaged over topics and metrics.

41 ( j) ( j) ( j) IM·L = Et Ew I(Mt,w ; L )

( j) ( j) ( j) IU·L = Ek I(Uk ; L )

( j) ( j) ( j) IM·M = Et Ev,w I(Mt,v ; Mt,w )

( j) ( j) ( j) IM·U = x Et Ew,k I(Mt,w ; Uk )

where j identifies job, L denotes task latency, Mt,w denotes the metric profiled for word w in a topic t assigned to the job, and Uk denotes a metric not in any topic. Figure 2.11 shows the distribution of mutual information measures across all jobs in the datacenter trace. Large IML and IUL indicate accuracy as Hound’s topics include the most relevant metrics and exclude the rest. Large IMM and small IUM indicate coherence as Hound’s topics integrate multiple metrics into a causal expla- nation for stragglers. We argue that Hound’s inferred causes are accurate because task latency has high mutual information with metrics included in causal topics and low mutual in- formation with all other metrics. And we argue that its causes are coherent because metrics within a causal topic have high mutual information with each other and low mutual information otherwise. Thus, Hound automates model inference and data analysis yet identifies causes that are accurate, coherent, and consistent with manual information analysis.

2.4.4 Comparisons with Expert Diagnosis

Hound’s advantage over prior approaches is its comprehensive approach to causal analysis on rich datasets. Tables 2.13-2.14 survey studies in which experts discover

42 Figure 2.11: Mutual information to assess accuracy (large IML, small IUL) and coherence (large IMM , small UM ). Boxplots show distribution of values across all jobs in the datacenter trace. and investigate specific causes of stragglers in large-scale, distributed systems. Hound discovers causes in the Google system that match causes found by expert diagnoses in several other systems. It discovers the same fundamental system phenomena that were discovered in other studies despite differences in methods and systems. The only gaps in diagnosis relate to network congestion, hardware heterogeneity and power management; the Google trace lacks profiles to reveal these conditions. More- over, Hound identifies prevalent causes of stragglers. Prior studies often attribute stragglers to data skew and resource constraints, which match Hound’s diagnoses. Coverage statistics indicate that data skew and bottlenecks in processors, memory, and I/O are dominant causes for 76% of the datacenter’s jobs (cf. Table 2.12). One might be concerned about comparing Hound for a Google datacenter and assorted methods for other systems, but the comparison is informative. Google’s trace includes both interactive and batch jobs. Hound’s analysis of interactive jobs

43 Table 2.13: Examples of stragglers’ causes from related studies that produce expert diagnoses. Straggler Cause Example Data Skew Hash function based partitioning scheme distributes data unevenly across tasks. Tasks processing more data are likely to be stragglers. [159] Resource Constraints Processor [44], memory [265] and I/O [92] are contended by collocated tasks, background daemons, garbage col- lector, etc. µarchitecture Activity Colocated tasks contend for shared hardware such as cache. [271] Scheduler Delay Tasks trapped in multiple layers of queues in servers and network switches. [92] Eviction Delay Low-priority tasks are preempted, re-launch on another machine, and lose progress. [271] Computation Skew Some records (e.g., dense graphs) are much more com- putationally expensive than other records (e.g., sparse graphs). [158] I/O Skew Stragglers are caused due to intensive disk I/O, such as Spark’s shuffle write. [202] Network Congestion A rack with many tasks can be congested on its network link and produce stragglers. [44] Hardware Heterogeneity Tasks assigned to obsolete hardware can be slower than others. [267] Power Management Power-saving can add significant delay when moving from inactive to active modes. [92] matches other Google-based studies of tails in latency-sensitive services [92, 271], and our analysis of batch jobs is consistent with those for Hadoop and Spark. Second, Google’s trace, with over 40K unique logical job names (i.e., binaries), is diverse and representative of real-world workloads [213].

2.4.5 Comparison with Simpler Base Learners

Predictive Models. Hound constructs a model to predict task latency from system conditions. Figure 2.12 compares Hound’s method against state-of-the-art predictive models by illustrating the distribution of adjusted R2 values across all jobs. Hound’s Bagging Augmented ElasticNet (BAE) performs 16% better than linear regression,

44 Table 2.14: Comparison of causes diagnosed by Hound for the Google system against causes diagnosed by experts for related systems. DS denotes data skew as straggler cause, RC denotes resource constraints, µarch denotes micro-architectural bottleneck or interference, SD denotes scheduler delay, ED denotes eviction delay, CS denotes computation skew, IS denotes I/O skew, NC denotes network congestion, HH de- notes hardware heterogeneity, and PC denotes power capping.“G-” denotes a prior study of Google datacenters; “DF-” denotes a prior study on distributed systems or frameworks such as Hadoop and Spark.“*” identifies system causes that are beyond this paper’s scope because the Google trace lacks relevant profiles. Study DS RC µarch SD ED CS IS NC∗ HH∗ PC∗ G-1 — ——— — [92] XXX X X G-2 —— — ————— [271] X X DS-1 ———— ——— [159] X X DF-2 ————— —— [44] XX X DF-3 ———————— [63] XX DF-4 ———— ———— [158] X X DF-5 ———————— — [267] X DF-6 ———————— [87] XX DF-7 —— ——— [202] XXX X DF-8 ———————— [265] XX Hound XXXXXXX ———

slightly better than ElasticNet and CART, and comparably to SVM. Nearest neigh- bor regression (KNN) and linear regression perform much worse than others. Linear regression shows negative coefficients for more than 25% of jobs. Table 2.15 indicates that multicollinearity affects topic models. Linear regres-

sion’s topics, such as P3 and P4, report correlated metrics with contradictory signs. Model parameters with incorrect signs and implausible magnitudes are a typical

45 Figure 2.12: Comparison of modeling methods for latency prediction on the Google dataset.

Table 2.15: Causal topics inferred with linear regression as base learner. Topic Keywords Weights P0 PEAK MEM(+), MEAN MEM(-) 0.52, 0.48 P1 MEAN MEM(+), PEAK MEM(-) 0.71, 0.29 P2 MEAN MEM(-), MEM ASSIGN(+) 0.54, 0.46 P3 CACHE MISS(-), CPI(+) 0.5, 0.5 P4 CACHE MISS(+), CPI(-) 0.52, 0.48 P5 PAGE CACHE(+), PAGE CACHE UM(-) 0.66, 0.34 P6 PAGE CACHE(-), PAGE CACHE UM(+) 0.5, 0.5 P7 MEM ASSIGN(+) 1.0 P8 DISK SPACE(+) 1.0 P9 MEAN CPU(-) 1.0 P10 SCHED DELAY(+) 1.0 P11 EVICT DELAY(+) 1.0 symptom of multicollinearity [183]. BAE regularization mitigates these effects and produces better causal explanations.

46 Figure 2.13: Comparison of dependence measures for safeguards against false cor- relation on the Google dataset.

Figure 2.14: Comparison of propensity score estimates for causal inference on the Google dataset.

47 Table 2.16: Causal topics inferred with Pearson’s correlation as base learner. Topic Keywords Weights MEAN IO(-),SCHED DELAY(-), 0.45, 0.19, D 0 MEM ASSIGN(-), RAM REQ(-) 0.18, 0.18

D1 MEM ASSIGN(+) 1.0 D2 SCHED DELAY(+) 1.0 D3 CACHE MISS(+) 1.0 D4 MEAN CPU(-) 1.0 D5 MEAN MEM(-) 1.0 D6 PAGE CACHE(+), MEAN MEM(+) 0.65, 0.35

Dependence Models. Copula-based dependence measures control false corre- lations better than conventional measures such as Pearson’s correlation [174, 206]. Figure 2.13 shows dependence estimates after we destroy the correlation between system conditions and task latency by randomly shuffling latency measurements in the dataset. Better estimators are those that more consistently report a value close

Table 2.17: Causal topics inferred with logistic regression based Rubin Causal Model as base learner. Topic Keywords Weights

C0 CPI(+), , CACHE MISS(+) 0.5, 0.5 C1 CPI(-) 1.0 C2 PEAK IO(+), MEAN IO(+) 0.63, 0.37 C3 MEAN IO(+), PEAK IO(-) 0.69, 0.31 C4 MEAN IO(-), PEAK IO(-) 0.52, 0.48 C5 PAGE CACHE(-), PAGE CACHE UM(-) 0.53, 0.47 C6 PAGE CACHE(+), PAGE CACHE UM(-) 0.51, 0.49 C7 PEAK MEM(+), MEAN MEM(+) 0.51, 0.49 C8 MEAN MEM(-), PEAK MEM(+) 0.52, 0.48 C9 RAM CAPACITY(+) 1.0 C10 RAM CAPACITY(-) 1.0 C11 SCHED DELAY(+) 1.0 C12 SCHED DELAY(-) 1.0 C13 MEAN CPU(-), PEAK CPU(-) 0.5, 0.5

48 to zero since there is no dependence between variables. Hound’s Signed Schweizer- Wolff (SSW) guards against false correlations much better than Pearson, Spearman, and Kendall’s measures. Table 2.16 indicates that Pearson’s linear estimate affects topic models. First, Pearson causes models to miss important keywords in inferred topics, which often have just one keyword. Pearson’s topics also include erroneous relationships, such as

the negative correlation between scheduling delay and task latency in topic D0; the correlation should in fact be positive. Pearson’s correlation is sensitive to noise and outliers, which can create false correlations. Moreover, Pearson is a linear estimator and may miss non-linear associations. Causal Models. The core of the Rubin Causal Model (RCM) is the propensity score and we compare state-of-the-art methods for estimating these scores (cf. Fig- ure 2.14). Hound’s AdaBoosted IPW estimates are most accurate. Naive Bayes and logistic regression perform much worse than others. Lasso regularization performs in between. Table 2.17 indicates that Rubin Causal Models that use logistic regression to estimate propensity scores are ineffective when inferring topics. Logistic regression severely distorts causal effect estimation, producing multiple pairs of contradictory topics, such as C11 and C12, and topics with reversed signs for correlated regressors such as C3 and C4.

2.5 Evaluation with Spark Traces

Hound reproduces the results of domain-specific expertise. Ousterhout et al. say something causes stragglers if tasks become normal when that thing requires no time [202]. Such expert diagnosis for BDBench produces four observations. First, three causes—HDFS Read (Disk), Garbage Collection, Shuffle Write (Disk)—explain a large fraction of stragglers. Second, one cause—Output Size (Skew)—explains a

49 Table 2.18: Hound’s causal topics for the Spark BDBench dataset Mean Mix. Topic Keywords Weights Interpretation Prob. DISK READ THPT(+) 1.0 HDFS (Input) read 16% READ BYTES(+), 0.25, READ TIME(+), 0.25, HDFS (Input) read 10% HDFS READ BYTES(+), 0.25, HDFS READ TIME(+) 0.25 HDFS READ TIME(+), 0.35, READ TIME(+), 0.25, HDFS (Input) read 10% HDFS OPEN TIME(+) 0.25 GC TIME(+) 1.0 GC overhead 30% DISK UTILIZATION(+), 0.5, Shuffle write 14% DISK WRITE THPT(+) 0.5 SHFL WRITE OPEN TIME(+) 1.0 Shuffle write 9% SHFL WRITE TIME(+), 0.62, Shuffle write 2% SHFL WRITE CLOSE TIME(+) 0.38 OUTPUT WRITE TIME(+), 0.35, DISK WRITE TIME(+), 0.33, Output skew 9% OUTPUT BYTES(+) 0.32

Table 2.19: Hound’s estimate of stragglers (percentage) explained by each cause for the Spark BDBench dataset HDFS (Input) Read GC Overhead Shuffle Write Output Skew 36% 30% 25% 9% small fraction of stragglers. Third, other events—First Task, Scheduler Delay, Shuffle Read—are insignificant causes. Finally, no cause among the three most prominent ones dominate. Table 2.18 presents Hound’s eight topics, which map to four straggler causes that are consistent with expert diagnosis. Table 2.19 indicates that mixture probabilities, which estimate the number of stragglers explained by each cause, also align with expert diagnosis. Similarly, expert diagnosis for TPC-DS produces five observations. First, two causes—HDFS Read(Disk), Output Size(Skews)—explain a large fraction of strag-

50 Table 2.20: Hound’s causal topics for the Spark TPC-DS dataset Mean Mix. Topic Keywords Weights Interpretation Prob. HDFS READ TIME(+), 0.5, HDFS (Input) read 23% INPUT READ TIME(+) 0.5 HDFS READ BYTES(+) 1.0 HDFS (Input) read 8% RESULT SIZE(+) 1.0 Output skew 23% SHFL READ WAIT TIME(+) 1.0 Shuffle read 17% DISK UTILIZATION(+) 1.0 Shuffle write 8% SHFL WRITE OPEN (+), 0.34, SHFL WRITE TIME(+), 0.33, Shuffle write 4% SHFL WRITE CLOSE TIME(+) 0.33 DISK WRITE THPT(+) 1.0 Shuffle write 4% GC TIME(+) 1.0 GC Overhead 10% SCHED DELAY(+) 1.0 Scheduler delay 3%

Table 2.21: Hound’s estimate of stragglers (percentage) explained by each cause for the Spark TPC-DS dataset HDFS (Input) Output Shuffle Shuffle GC Scheduler Read Skew Read Write overhead Delay 31% 23% 17% 16% 10% 3% glers. Second, three causes—Shuffle Read(Network), Garbage Collection, Shuffle Write(Disk)—explain a moderate fraction of stragglers. Third, one cause—Scheduler Delay—explains a small fraction of stragglers. First Task is an insignificant cause. Finally, there is no dominant cause among the two most prominent causes and no dominant cause among the next three most prominent causes. Tables 2.20-2.21 present Hound’s analysis. Nine topics map to six straggler causes that are consistent with expert diagnosis. Mixture probabilities are also broadly consistent with expert diagnosis.

51 2.6 Complexity and Overheads

Table 2.22 presents the computational complexity of Hound’s constituent methods with the following notation: M, N represent number of tasks per job and jobs per trace; K, W, T represent number of profiled metrics, words per document and topics to learn; I represents number of algorithmic iterations, a tunable parameter. We draw on prior complexity analyses. Bagging Augmented ElasticNet applies the Least Angle optimizer to fit coefficients [281]. AdaBoost-IPW estimator applies C4.5 to train weak learners [234]. Distributed online variational inference implements topic models[134]. Ensemble learning uses hierarchical agglomerative clustering [191].

Table 2.22: Computational Complexity of Hound Learning Theoretical Approximate Procedure Complexity Complexity BASE Bagging Augmented ElasticNet O(N (K3 + K2 M)I) O(NM) Signed Schweizer-Wolff Estimator O(NKM2) O(NM) AdaBoost-IPW Estimator O(NK3 MI) O(NM) META O(WTNI) O(N) ENSEMBLE O(T2log(T)I) O(T2log(T))

In theory, Hound is in polynomial time—O(NK3M + NKM2). In practice, K, T, W are typically constants compared to N and M. For example, the Google and Spark traces report 21 and 37 task metrics (K), respectively. The number of output topics (T) is at most 15 and the number of words per topic W is at most 5. These parameters are much smaller than N and M. Given these practical considerations √ and a 1/ M sampling rate for the SSW estimator, Hound’s approximate complexity is O(NM). Figure 2.15 presents measured scalability, indicating that Hound’s run time in-

52 Figure 2.15: Hound’s scalability as dataset size increases. creases linearly with trace size. Each measurement is the average of five runs on randomly drawn Google jobs. Moreover, we measure scalability with respect to the number of workers used for learning. We implement Hound with Apache Spark to parallelize learning on massive datasets. Given 38.9K jobs and 10M tasks, Hound’s needs 52 minutes to complete inference on a single worker and just 12 minutes on eight workers.

2.7 Related Work

2.7.1 Straggler Mitigation

Speculative Execution. Dean et al. detect when a task runs slower than ex- pected and launches another equivalent task as a backup, hoping that transient performance issues disappear [93]. But a few limitations of speculative execution have been identified in prior research. First, speculative execution is too aggressive in heterogeneous clusters as too many speculative tasks are launched and reduce system capacity. LATE [267] addresses this problem by throttling the number of

53 backup tasks and by replicating only a few prioritized slow tasks rather than any slow one. Second, as a reactive approach, speculative execution is usually too late to help small, latency-sensitive jobs. Dolly [43] improves timeliness by proactively launching multiple clones of every task of a job and using only the clone finishes first. Third, speculative execution improves the performance of the job at hand but can hurt the performance of others since backup tasks can cause resource contention. Hopper [217] addresses contention by right sizing the resource pool for backup tasks. Scheduling. Schedulers can mitigate stragglers by preventing tasks from com- puting on slow machines. Mantri [44] improves task placement to avoid congested network link. Wrangler [264] predicts the straggler risk for each machines and risky machines are prohibited from serving certain tasks. Root Cause Analysis. Tales of Tail [166] experimentally explores whether stragglers are caused by interference from background daemons, poor scheduling, constrained concurrency models, or power throttling. Treadmill [272] carries out ex- periments to measure tail latencies under different hardware configurations (NUMA, Turbo Boost, DVFS and NIC) and use Analysis of Variance (ANOVA) to estimate the impact of different hardware factors. Determining and validating the suspected causes from Tales of Tail and Treadmill require deep expertise in the specific system and architecture. In comparison, Hound is a statistical machine learning framework that focuses on automated causal discovery rather than experimental validation. In addition, Hound is independent of any specific system or architecture detail.

2.7.2 Performance Analysis

Machine Learning. Learning has been widely applied in prior research to help understand performance issues in large-scale distributed system. Related methods can be classified into regression-, tree-, and graph-based methods. Regression-based methods (e.g., Highlighter [59] and Fingerprint[60]) use regression models, such as

54 Lasso, to correlate the state of service-level-obligations (SLO) with system metrics. The constructed model can select a few salient metrics, usually called the signature, that help operators understand anomalies such as SLO violations. Similarly, tree-based methods [221, 263] apply classification and regression trees to characterize the causal rules that produce specific performance outcomes. Graph- based methods [190, 240, 84] apply probabilistic graphs, such as Bayesian networks, to visualize the causal relationship between system metrics and performance state. In comparison, Hound performs causal learning with little domain knowledge and constructs topic models that emphasize interpretability, reliability and scalability. Tracing Techniques. Profilers and tracers for large-scale distributed systems include Pivot Tracing [177], The Mystery Machine [80], Magpie [51], lprof [273], Pinpoint [75], XTrace [108] and Dapper [228]. These frameworks stitch dispersed event logs from many thousands of machines and reconstruct the complete control flow (system path) for each user request. In comparison, Hound focuses on trace analysis not acquisition.

2.8 Conclusions

Hound is a statistical machine learning framework for diagnosing stragglers at dat- acenter scale. Hound offers interpretability, reliability and scalability. We apply Hound to analyze a production Google datacenter and two experimental Amazon EC2 clusters, revealing challenges and providing insights for future systems design and management. For future work, Hound is an open framework and could incorporate additional causal analysis algorithms, such as Recursive Structural Equation Models (RSEM) [66], as base learners for even more reliable inference. Moreover, Hound assumes a unified profiling framework that reports the same metrics for every job and task. When profilers are heterogeneous, Hound will require new methods for inferring

55 causal topics from different vocabularies.

56 3

Graph Theory and Semantic Learning for Understanding Datacenter Workload

Scale and complexity motivate Limelight+, a new algorithmic pipeline that extracts workload insights from huge graphs. Limelight+ uses graph theory to partition a datacenter-scale call graph into succinct layers. Moreover, it uses semantic learning to annotate layers for efficient, human interpretation. The algorithmic pipeline includes three stages.

• FDMAX (Foundational Degree Maximization) organizes the call graph into lay- ers that reveal dependencies within and across micro-, meso-, and macro-scale functions.

• STEAM (Spectral Clustering for Equilibrium Embedded Modules) annotates call graph layers by clustering functions according to statistically inferred semantics.

• HELP (Header Label Propagation) determines the number of computational cy- cles attributed to clusters of semantically similar functions, detailing software structure and revealing targets for acceleration.

57 We demonstrate Limelight+ in two sets of experiments on open-source and industry- strength datacenter software. First, we study 400K stack samples profiled on a testbed that runs PyTorch Caffe2 [15] for neural network training and inference, RocksDB [30] for key-value store, HHVM [200] for just-in-time PHP compilation, and Proxygen [28] for web serving. We evaluate the quality of discovered call graph layers and function clusters. We also translate these layers and clusters into workload insights at multiple granularities and scales. Limelight+ outperforms state-of-the-art analysis strategies and reveals computational hotspots for optimization. Moreover, we study 300M stack samples profiled on a live, production datacenter. Limelight+’s layered graph and semantic clusters reveals three salient features. First, datacenter workloads are intelligent and deploy machine learning as a scalable back- bone service. Second, datacenter systems are memory-centric and rely on memory caches, key-value stores, and in-memory analytics. Third, datacenters are automated and deploy mechanisms that consume significant computational resources for service management. Beyond these macro-scale insights, we find no single approach to ac- celerate datacenter performance. Limelight+ indicates the fifteen hottest clusters of micro-scale functions account for only 50% of datacenter cycles, which motivates holistic approaches to performance optimization.

3.1 Challenges of Existing Stack Trace Analysis Methods

The Performance Monitoring Unit (PMU) profiles how software functions interact with hardware resources such as the processor core and cache hierarchy. Online, the PMU samples counters and periodically interrupts the OS every T cycles, a configurable parameter, to record instruction addresses. Offline, we reconstruct the complete call path by linking addresses to debug information and symbolizing cor- responding software functions. Thus, cycles can be attributed to a software routine

58 and its context (i.e., call stack). A stack trace is a collection of profiled samples. Each stack sample s is a tuple s = (cs, ws), where cs is the call path and ws is the associated time weight. For example, sample s3 = (cs3, ws3 ) in Table 3.1, has call path cs,3 as follows

root→ start→ libc start main→caffe2::python::TensorFeeder ::FeedTensor→caffe2::CPUContext::CopyBytesFromCPU→ memcpy avx unaligned

and associated weight ws,3=5 cycles.

Table 3.1: Stack Samples Service Stack Sample Weight /Index root → start → libc start main → caffe2::Predictor::operator → caffe2::Workspace::RunNet → caffe2::SimpleNet::Run → caffe2::Operator::Run Caffe2 → caffe2::ConvPoolOpBase::RunOnDevice Inference → 10 caffe2::ConvOp::RunOnDeviceWithOrderNCHW → /s1 caffe2::math::GemmStridedBatched → cblas sgemm batch → sgemm batch → mkl blas sgemm batch → mkl blas sgemm root → start → libc start main → caffe2::Predictor::operator → caffe2::Workspace::RunNet Caffe2 → caffe2::SimpleNet::Run Inference → 5 caffe2::Operator::Run → /s2 caffe2::ConvPoolOpBase::RunOnDevice → caffe2::ConvOp::RunOnDeviceWithOrderNCHW → caffe2::math::Im2Col root start libc start main Caffe2 → → → caffe2::python::TensorFeeder::FeedTensor Inference → 5 caffe2::CPUContext::CopyBytesFromCPU /s → 3 memcpy avx unaligned 59 root → start → libc start main → caffe2::Predictor::operator → caffe2::Workspace::RunNet → caffe2::SimpleNet::Run → Caffe2 caffe2::Operator::Run → Inference caffe2::ConvPoolOpBase::RunOnDevice → 2 /s4 caffe2::ConvOp::RunOnDeviceWithOrderNCHW → caffe2::math::Im2Col → caffe2::math::::Im2ColZeroPaddingNoDilation → caffe2::math::CopyMatrix → mkl trans mkl somatcopy root → start → libc start main → caffe2::Predictor::operator → caffe2::Workspace::RunNet → caffe2::SimpleNet::Run → Caffe2 caffe2::Operator::Run → Inference caffe2::ConvPoolOpBase::RunOnDevice → 2 /s5 caffe2::PoolOp::RunOnDeviceWithOrderNCHW → caffe2::MaxPoolFunctor::Forward → caffe2::::RunMaxPool2D → caffe2::::ComputeMaxPool2D → Eigen::DenseBase::maxCoeff root → start → libc start main → caffe2::python::TensorFeeder::FeedTensor → caffe2::empty → Caffe2 c10::TensorImpl::raw mutable data → Inference c10::DefaultCPUAllocator::allocate → 1 /s6 c10::alloc cpu → posix memalign → int memalign → int malloc

60 root →start thread → ::::StartThreadWrapper → rocksdb::DBImpl::Put → rocksdb::DB::Put → rocksdb::DBImpl::Write → RocksDB rocksdb::DBImpl::WriteImpl → 6 /s7 rocksdb::DBImpl::PipelinedWriteImpl → rocksdb::DBImpl::WriteToWAL → rocksdb::log::Writer::AddRecord → rocksdb::log::Writer::EmitPhysicalRecord → operatornew → malloc root →start thread → rocksdb::::StartThreadWrapper → rocksdb::DBImpl::Get → rocksdb::DBImpl::GetImpl → rocksdb::MemTable::Get → RocksDB rocksdb::MemTable::Seek → 6 /s8 rocksdb::InlineSkipList::Iterator::Seek → rocksdb::InlineSkipList::FindGreaterOrEqual → rocksdb::InlineSkipList::KeyIsAfterNode → rocksdb::InternalKeyComparator::CompareKeySeq → rocksdb::Slice::compare → memcmp sse4 1 root →start thread → rocksdb::::StartThreadWrapper → rocksdb::DBImpl::Get → rocksdb::DBImpl::GetImpl → rocksdb::BlockBasedTable::Get RocksDB → rocksdb::DataBlockIter::SeekForGet → 6 /s9 rocksdb::DataBlockIter::Seek → rocksdb::BlockIter::BinarySeek → rocksdb::InternalKeyComparator::CompareKeySeq → rocksdb::Slice::compare → memcmp sse4 1

61 root →start thread → rocksdb::ThreadPoolImpl::Impl::BGThread → rocksdb::DBImpl::BGWorkCompaction → rocksdb::CompactionJob::Run → rocksdb::CompactionJob::Process... RocksDB → rocksdb::BlockBasedTableBuilder::Add → 2 /s10 rocksdb::BlockBasedTableBuilder::Flush → rocksdb::BlockBasedTableBuilder::WriteBlock → rocksdb::CompressBlock → rocksdb:: Compress → snappy::RawCompress

Limelight+ requires each stack sample to be acyclic and eliminates cycles (i.e., pruning stacks with recursive functions) as needed. For any function f , there does not exist a sub-call-path in cw that starts from and ends with f . Note a stack trace is a collection, rather than a set, of stack samples and can contain multiple identical entries. Table 3.2 defines notations for stack trace analysis.

Table 3.2: Notations. Let S denote a stack trace S, s a stack sample, and f or g a function Notation Explanation F (S) Set: all unique functions in S DRS ( f, S) Set: direct (immediate) callers of f in S IRS ( f, S) Set: direct or indirect (transitive) callers of f in S DES ( f, S) Set: direct callees of f in S IES ( f, S) Set: direct or indirect callees of f in S I ( f, s) Bin. Func.: 1 if f is in s, 0 otherwise DR ( f, g, s) Bin. Func.: 1 if f is an direct caller of g in s, 0 otherwise IR ( f, g, s) Bin. Func.: 1 if f is an direct/indirect caller of g in s, 0 otherwise

There exist three types of techniques for understanding stack traces. Function- level analysis counts the cycles consumed by individual functions and determines how the total number of cycles is distributed across functions. This analysis may use some combination of exclusive or inclusive cycle measurements, and is “flat”

62 and agnostic about composition relationships between functions. In contrast, call graph analysis is “hierarchical” and characterize composition relationships between functions using a directed acyclic graph (DAG).

3.1.1 Counting Exclusive Cycles

Measures of exclusive cycles count cycles consumed by individual functions, exclud-

ing cycles consumed by its . Exclusive measure of time EXCS ( f ) = P s∈S L ( f, s)ws counts cycles consumed by leaf function f in stack trace S. Indicator

L ( f, s) is one when f is a leaf in s and zero otherwise. Percentage EXCPS ( f ) = P EXCS ( f ) / s∈S ws counts these cycles consumed as a share of the total. Measurements of exclusive cycles are orthogonal and complete. Cycles counted for one function does not overlap with those for any other function in the trace. Cycles across all functions in the trace sum to the total cycles in the trace. These properties avoid double-counting and ensure a full accounting of consumed cycles. For example, we characterize exclusive cycles for ResNet50 inference in Caffe2 [133, 141, 15]. We find that single precision matrix-matrix multiplication consumes 40% of cycles (mkl blas sgemm). Convolutional kernels, which must flatten image patches into matrix columns consume 20% of cycles (caffe2::math::Im2Col)[216]. Memory management, such as copying memory blocks, copying matrices, and allo- cating memory, consumes 32% of cycles (mkl trans mkl somatcopy, int malloc). Finally, calculating the maximum elements of a matrix consumes 8% of cycles (Eig en::DenseBase::maxCoeff). Challenge 1: Discover computational hotspots at different scales. An analysis of exclusive cycles focuses on leaves in the stack trace, which has limitations when analyzing datacenter scale computation. The analysis focuses on low-level, shared libraries yet fail to discover larger integrated hotspots. A longitudinal study of Google’s production datacenter profiles computation for stack leaves and measures

63 EXCP to reveal datacenter hotspots [146]. The revealed hotspots are entirely micro- scale libraries such as memory (de)allocation, memory moves and hashing. None of the revealed hotspots are macro-scale services or capabilities. Focusing on only low-level hotspots can lead to suboptimal performance solutions.

Consider the trace S = [s7−10] in Table 3.1. RocksDB reports EXCP and INCP of 60% for memcmp sse4 1. But context indicates that memory comparisons are performed to compare keys during binary search for key-values in rocksdb::InternalKeyC omparator. Without context, an architect might build hardware accelerators for memory copy, the 60% hotspot, since Intel SSE4 instructions already accelerates me mcmp sse4 1 [127]. However, with context, an architect might optimize the broader key-value seek operation, replacing binary search with index hashes and avoiding ASIC costs.

3.1.2 Counting Inclusive Cycles

Measures of inclusive cycles count cycles consumed by individual functions, including

cycles consumed by its subroutines. Inclusive measure of time INCS ( f ) counts cycles consumed by function f in stack trace S. Indicator I ( f, s) equals one when f is in the call path of s and zero otherwise. Percentage INCPS ( f ) counts these cycles as a share of the total. Measures of inclusive cycles permit the analyst to identify integrated services, comprised of disparate functions that collectively consume many cycles. For exam- ple, we find that function caffe2::ConvOp:RunonDeviceWithOrderNCHW reports EXCP=0% and INCP=68%. This indicates the convolutional kernel is an integrated piece of computation distributed across multiple sub-routines that consume a large number of system cycles [162]. Challenge 2: Eliminate misdirecting abstract wrappers. An analysis of inclusive cycles alone has significant limitations. For example, consider the trace

64 S = [s1−6] in Table 3.1. Caffe2 reports inclusive cycles of 100% for libc star t main, 76% for caffe2::Operator::Run, 68% for caffe2::ConvOp:RunonDev iceWithOrderNCHW, and 28% for caffe2::math::Im2Col. These measures are problematic for several reasons. First, wrapper functions report large INCP misdirect hotspot analysis. Because functions like root, start, libc start main report large percentages but provide little insight into processor demand, the analysis should identify and exclude wrapper functions. Challenge3: Zero double-counting of cycles. Second, the inclusive cycle count for a function overlaps with that of its sub-routines. Cycles are double-counted for functions that hold caller-callee relationships, both direct and indirect, complicat- ing the effort to breakdown total system cycles across functions. When functions are independent, inclusive cycles accurately reveal where processing time is consumed. Caffe2 reports INCP of 76% and 24% for {caffe2::Predictor::Operator and caf fe2::TensorFeeder::FeedTensor. However, when functions are dependent, inclu- sive cycles are double counted and breakdowns are incorrect. Caffe2 reports INCP 76% for each of {caffe2::Predictor::Operator and caffe2::Workspace::Run because the former calls the latter. Challenge 4: Compare functions and cycles at a matched scale. Third, comparing inclusive cycles reported for functions at varied scales is difficult. Consider the trace S = [s1−6] in Table 3.1. Caffe2 reports INCP of 76% for {caffe2::Predi ctor::Operator, 20% for memcpy avx unaligned, and 4% for int malloc. This breakdown is valid but provides little insight because cycles consumed by micro- scale memory functions are compared against those for macro-scale neural network functions. A more informative breakdown would compare neural network inference and tensor feeding, two functions of comparable granularity that reside at the same level in the call graph.

65 3.1.3 Call Path Analysis

Call path analysis accounts for context and measures cycles consumed by functions on a specific path through the software. The number of cycles in call path c over stack P P trace S is PHC(c) = s|cs = cws, s ∈ S. Percentage PHCP(c) = PHC(c) / s∈S ws counts these cycles as a share of the total. Call path measures capture context around hotspots but suffer from several lim- itations. First, call path analysis often provides only a partial view of activity. For tractability, the analysis often focuses on hot paths and neglects the long, heavy tail in the distribution of cycles consumed by all paths. The number of possible paths grows combinatorially with the number of functions and caller-callee relationships. We find the number of distinct paths is 10× larger than the number of functions for HHVM in a production datacenter [200] and 5× larger for RocksDB in our testbed [30]. Challenge 5: Derive a hierarchical view of how functions compose. Second, call path measures provide no information about how functions compose.

For example, from trace S = [s1−6] in Table 3.1, we cannot examine any single hot path and infer that the convolutional kernel (caffe2::ConvOp::RunOnDeviceWithO rderNCHW) consists of strided, batched matrix multiply (caffe2::math::GemmStri dedBatch) and the image-to-column operation (caffe2::math::Im2Col).

3.1.4 Call Graph Analysis

The call graph is a directed acyclic graph (DAG) that accounts for caller-callee relationships between functions [126]. Each node represents a function and is as- sociated with a [INCP, EXCP] tuple. Each edge represents a caller-callee relation- ship and is annotated with cycles attributed to sub-routines. Attributed measure P ATCS ( f1, f2) = s∈S DR ( f1, f2)ws counts cycles in function f1 attributed to sub-

routine f2. Indicator DR ( f1, f2) is one when f1 is a direct call of f2 and zero 66 P otherwise. Percentage ATCPS ( f1, f2) = ATCS ( f1, f2) / s∈S ws counts these cycles as a share of the total. With call graphs, we can analyze how functions compose and how cycles are at- tributed to sub-routines. Such analysis reveals common, low-level primitives that support a spectrum of high-level services. Optimizing these primitives could im- prove performance across the software stack or datacenter. For example, we find that neural network prediction (caffe2::Predictor::operator) consumes 71% of testbed cycles. The operator’s sub-routines for convolution and pooling comprise 59% and 12% of cycles consumed. As a result, we can attribute 83% (=59/71) of DNN prediction costs to convolution and 17% (=12/71) to pooling. Challenge 6: Scalable algorithms that summarize the structural and se- mantic information of huge stack traces for efficient interpretation. Ex- isting techniques for call graph analysis requires human expertise and fail to scale to large, complex graphs. A moderately sized trace of twenty sampled stacks from our testbed produces 114 nodes and 119 distinct call paths. Even so, it is difficult to visually discover themes in the computation and accurately account for processor cycles. Such challenges become even more problematic as datacenters transition to cluster-wide profiles [215, 271] and longitudinal studies over multiple years and tens of thousands of production machines [146]. Such datasets produce massive traces and graphs. Our month-long trace of a production datacenter includes 300M sampled stacks and 400K distinct functions. Thus, techniques for automatic and intelligent call graph analysis are required.

3.2 Limelight+ Overview

We design a new algorithm framework Limelight+ to to address these challenges. At first, FDMAX partitions call graph functions into coarse-grained layers that exhibit two properties. (i) Functions of comparable scales and similar types of computation

67 Figure 3.1: Example of Limelight+’s produced graph. are placed into the same layer. (ii) Functions at lower layers support those in higher layers thereby describing how micro-scale computation compose to support macro- scale computation. Second, within each layer, STEAM clusters and consolidates functions of similar or closely-related functionality into “super-functions”, reducing call-graph size by orders of magnitude. Third, HELP re-attributes consumed cycles to the produced “super-functions” and layers, derives a succinct, hierarchical view of software structure and suggests hotspots for acceleration. The following highlights the contributions of Limelight+ with an example of its output shown in Figure 3.1. Multi-Scale Discovery, Wrapper Elimination and Scale-consistent Com-

68 parison. FDMAX partitions distinct functions into five layers (scales). L1 contains system abstract for threading, L2 contains macro-scale key-value queries, HTTP/PHP requests, and neural network pipelines, which implement service threads. L3 contains computational foundations for L2 such as neural network operators, MemTable and BlockBased Table operations. L4 contains micro-scale libraries for L3 such as neural network math and key-value block writes. L5 contains shared libraries and func- tional primitives. Functions in root layer L1 can be classified as abstract wrappers and systematically eliminated. Scale-consistent comparison is feasible for functions within the same layer, e.g., neural network training vs. inference and key-value GET vs. PUT in L2. Graph Reduction for Enhanced Interpretability. Consolidating similar functions clustered by STEAM into super functions can significantly reduce call graph size and complexity. For example, convolution and pooling are consolidated as a group representing neural network operators (L3G1) and memory allocation libraries are consolidated in group L5G4. In experiments, STEAM can achieve a 29×-38× reduction of call graph size with no sacrifice of function clustering quality, and produce succinct graphs for effective and efficient human interpretation. Zero Double Counting and Hierarchical Composition. HELP re-attributes consumed cycles from individual functions in raw call graph into the discovered groups and layers. The cycle-re-attribution procedure by design guarantees zero double counting of cycles. GIP (LIP) and GEP (LEP) indicate inclusive and exclu- sive cycle percentage for a group (layer), respectively. Moreover, cycles are attributed between groups across different layers (golden arrows) to quantify function compo- sition. Hotspot Insights and Acceleration Trade-offs. Comparing groups at dif- ferent scales can help make trade-offs between acceleration schemes. For example, an end-to-end acceleration of neural network operators (L3G1, a 60% hotspot) is po-

69 tentially a bigger win compared with optimizing only matrix multiply (L5G1, a 40% hotspot). The former has opportunities in optimizing data path between different operators while the latter does not. Scalability. Under mild assumptions, Limelight+ has quadratic time complexity and can successfully scales to datacenter-strength traces. Limelight+ finishes analysis for hundreds of millions of stack profiles and tens of thousands of functions on a single machine (48 cores) in 12 hours.

3.3 Limelight+ Layerization

We design a graph-theoretic algorithm, FDMAX, that partitions a call graph into layers by maximizing a measure called foundational degree. The algorithm organizes the graph’s nodes such that functions of similar type and scale occupy the same layer. It also tracks the graph’s edges such that computational cycles can be attributed to functions. The partitioning algorithm scales to massive call graphs and permits interpretable analysis, revealing how lower layers compose to provide computational foundations for higher layers. Limelight+ uses a recursive, bottom-up algorithm for partitioning the call graph. In each round, FDMAX discovers a foundation (lower layer) and partitions it from the residual of the graph (higher layers). This bi-partition preserves the partial order relationships because, when a function is assigned to the foundation, all of its (in)direct callees are as well. The algorithm invokes FDMAX on the residual graph after re-calculating partitioning criteria to reflect changes in graph structure.

3.3.1 Foundational Degree

Limelight+ partitions the graph for layer discovery and cycle attribution by maxi- mizing a measure called foundational degree (FD). Foundational degree is based on binary decision variables that indicate whether a function is assigned to a layer.

70 Figure 3.2: Illustration of FDMAX. Dashed boxes include functions that are placed to the foundation. Red numbers indicate functions’ measured modularity.

Moreover, foundational degree encodes desiderata for the graph partition and, when assigning a function to a layer, accounts for its modularity. Modularity measures the likelihood a function is foundational infrastructure for other functions higher in the call graph. As FDMAX partitions the call graph, more modular functions are placed in lower layers. We argue that modularity is high when a function and its sub-routines provide an integrated and coherent module of computation. We quantify modularity m for function f by assessing four criteria— capacity, fundamentality, integrity, and homogeneity—in the context of its stack trace S.

mS ( f ) = cS ( f ) · dS ( f ) · iS ( f ) · hS ( f )

A function may exhibit different characteristics across these four criteria. Modularity, as a product of multiple criteria, accounts for interactions and trade-offs between the ways that a function could be foundational. Capacity. The capacity of a function encodes the number of its callees, both direct and indirect. Capacity cS of function f in stack trace S increases with the number of callees, denoted by IES ( f, S). A baseline parameter cµ specifies the

71 measure’s minimum capacity for leaf functions, which have zero callees. And a

discount parameter cδ models diminishing marginal returns in capacity as the number 2 of callees increases. Defaults are cµ = 0.5 and cδ = 2.5.

|IES ( f, S)| + c − − µ cS ( f ) = 1 exp 2 , cµ > 0, cδ > 0 (3.1) cδ * + FDMAX uses the capacity measure, to consolidate- functions along the call path in the same layer and produce fewer layers. When partitioning the call graph and plac- ing functions in the foundation, FDMAX prefers high-capacity functions over their callees because the former describes integrated, macro-scale computation and the latter describes fragmented, micro-scale kernels. For example, FDMAX consolidates functions Eigen::DenseBase::block, Eigen::Block::Block and Eigen::BlockIm pl::BlockImpl in one layer because they are related to matrix block construction. However, relying on capacity alone risks placing functions that span multiple computational scales into the same layer. For example, HPHP::Transport::addHea derNoLock has high capacity but consolidating it and its indirect callees produces an incoherent layer. PHP transport header creation is a macro-scale function whereas memory allocation is a micro-scale primitive. Thus, FDMAX must supplement ca- pacity with other partitioning criteria that account for scale. Fundamentality. Fundamentality encodes the extent to which a function sup- ports other functions. The fundamentality dS of function f in stack trace S in- creases with the number of distinct and direct callers of the function, denoted by |DRS ( f, S)|. It measures the function’s in-degree in the call graph, normalized by the largest in-degree for any function in the graph. Sensitivity parameter dµ makes the measure less sensitive when the call graph’s maximum in-degree is extremely

72 small. Default is dµ = 1.

|DRS ( f, S)| + dµ dS ( f ) = , dµ > 0 (3.2) maxg∈F (S) |DRS (g, S)| + dµ

FDMAX uses fundamentality to identify functions that provide infrastructure to higher layers in the call graph. When partitioning the call graph and placing func- tions in the foundation, FDMAX prefers functions with high fundamentality. For example, fundamentality measures for HPHP::Transport::addHeaderNoLock, std: :vector:: M emplace back aux, GI libc malloc are 2/10=0.2, for int malloc is 3/10=0.33, and for operatornew is 10/10=1.0. FDMAX favors placing operator new and its indirect callees, GI libc malloc and int malloc, in the foundation. Integrity. Integrity encodes the degree to which a function is self-contained computation. The integrity iS of function f in stack trace S is high when all of its callees, direct and indirect, are rarely called by functions other than itself. If the function is a graph leaf, L ( f, S) = 1, its integrity is one. Otherwise, we clasify graph edges to f ’s callees into three types. First, α edges are calls from f and its callees. Second, β edges are calls from indirect callers of f . Third, γ edges are calls from functions unrelated to f (i.e., neither f ’s indirect callers nor indirect callees).

α X iS ( f ) = DR ( f1, f2, S), f1, f2 ∈ IES ( f, S) ∪ { f }

β X iS ( f ) = DR ( f1, f2, S), f1 ∈ IRS ( f, S),

f2 ∈ IES ( f, S)

γ X iS ( f ) = DR ( f1, f2, S), f1 < IRS ( f, S) ∪ IES ( f, S) ∪ { f },

f2 ∈ IES ( f, S)

Function f is more integrated and self-contained when its α edges significant

73 outnumber its β and γ edges. Because we find that β edges break integrity less

significantly than γ edges, we specify tolerance parameter iµ to discount their effect.

β γ iµiS ( f ) + iS ( f ) iS ( f ) = 1 − , 0 ≤ iµ ≤ 1 (3.3) α β γ iS ( f ) + iS ( f ) + iS ( f )

Figure 3.2 illustrates examples of the three edge types. First, edge from std::v ector:: M emplace back aux to operatornew is an α edge for HPHP::Transport: :addHeaderNoLock; Second, edge from operatornew to int malloc is a β edge for GI libc malloc. Finally, the eight edges from higher-level functions to operator new are γ edges for HPHP::Transport::addHeaderNoLock. FDMAX prefers high-integrity function when partitioning the call graph and placing functions in the foundation. For contrast, if function f has low integrity, its sub-routines widely called by higher-level functions other than itself. Because the sub-routines provide broad support infrastructure, they and not f are better suited for the foundation. Homogeneity. Homogeneity measures the similarity between function names, which suggests similarity in the type or scale of the computation. We measure

homogeneity hS for a function f in stack trace S by proposing the Levenshtein- Winkler (LW) measure of string similarity between function f and its indirect callees. Higher similarity suggests the function and its callees should be placed in the same call graph layer. 1 X h ( f ) = LW( f, g) (3.4) S |IES ( f, s)| g∈IES ( f,s)

We design LW similarity based on the Levenshtein Edit Distance (LED) and the Winkler adjustment.

74 ! S( f, g) LW( f, g) = max hb, NL( f, g) + hc (1 − NL( f, g) hw

NL( f, g) = 1 − LED f, g) / max(| f |, |g|

0 ≤ hw ≤ 1; 0 < hw; 0 ≤ hb ≤ 1

LED( f, g) is the smallest number of edits required to change string f into g.1 Our implementation is case insensitive and ignores global names (e.g., HPHP, caf fe2 ) and name space delimiters (i.e., ::), which appear frequently and carry little information. NL normalizes LED by the longer string’s length. When NL is low, the Winkler adjustment supplements the homogeneity measure to account for exact matches of S( f, g) prefix characters, up to hw matches. When NL is high, the effect of the Winkler adjustment is small.

The homogeneity measure includes three parameters. First, a window hw deter- mines how many prefix characters are checked for matches. Second, a coefficient hc determines the power of the Winkler adjustment. Third, a base value hb ensures that homogeneity is greater than zero, which reduces sensitivity when multiplying homogeneity against the other three measures of modularity. Defaults are hw = 5,

hc = 0.5, and hb = 0.1. FDMAX prefers placing high-homogeneity functions and their callees into the foundation. If a function has low homogeneity, it likely has multiple sub-routines of varied type and scale. Consolidating these functions into the same partition would lead to less coherent call graph layers. Functions with high homogeneity include E igen::DenseBase::block (hS=0.75), rocksdb::Snappy Compress (hS=0.81), fol

ly::IOBuf::create (hS=0.77), and GI libc malloc (hS=0.55). Those with low

homogeneity include HPHP::Transport::addHeaderNoLock (hS=0.20) and vector

1 LED can be computed with a simple recursive algorithm and dynamic programming efficiently computes it in O(| f | × |g|) time.

75 :: M emplace back aux (hS=0.19). Groups of Functions. Modularity is a function-level measure, but we must determine whether a group of functions should be assigned to the foundation when partitioning the call graph. Moreover, we must ensure FDMAX places low-level functions in the foundation even as it prefers to place them in the residual given their low modularity. We address these issues by augumenting our measure of modularity with an analysis of partial order in the call graph. We construct partial-order matrix POM by computing the transistive closure of the indirect caller-callee matrix2, using algorithms such as Depth-First Transitive Closure (DFTC).

X X POMS[ f, g] = δ DR ( f, p, s) + IR ( f, p, s) · IR (p, g, s) s∈S p∈F (S) .* /+ 1, x > 0 - δ(x) = 0 x ≤ 0    When POMS[ f, g] = 1, f is partial-order caller of g. When POMS[ f, g] = 0 and

POMS[g, f ] = 0, f is partial-order independent of g and vice versa. Foundational Degree. We compute the foundational degree FD for a group of functions G using their modularities m and the partial order matrix POM. FD is sum of functions’ modularities excluding their partial-order callers.

X X FD (G) = m ( f ) 1 − POM [g, f ] (3.5) S S  S  f ∈G  g∈G    First, maximizing FD ensures that the residual holds high-level functions. If function f ’s indirect callees are partial-order independent and their summed modu-

2 ICS[ f, g] = 1 when f is indirect caller of g

76 larities exceeds f ’s, the function will certainly be placed in the residual and not the foundation. In Figure 3.2, FD’s maximum value is 0.585 (= 0.14 + 0.2 + 0.15 + 0.095) and this value degrades significantly if any high-level functions are placed in the foundation; see ComputeMaxPoolGradient2D, rocksdb::Benchmark::Compress Slice, HPHP::ResponseMessage::ResponseMessage, HPHP::Transport::addHead erNoLock, std::vector:: M emplace back aux. Second, maximizing FD ensures the foundation holds a large number of partial- order independent functions. This outcome is desirable because each discovered layer in the call graph should cover as many consumed cycles as possible yet producing incoherent and uninterpretable. Third, maximizing FD is is equivalent to seeking call graph partitions that cut higher in the software hierarchy. Including low-modularity functions below this cut does not harm the group’s foundational degree and distort the cut’s goodness. When high-modularity function f is placed in the foundation partition, its indirect callees do not change the group’s FD regardless of whether they are placed in the foundation or residual partition. FD for a group of functions depends heavily on higher-level functions in the group and not their partial-order callees. Finally, we can maximize FD while preserving the call graph’s partial order relationships. A function should never appear in a layer higher than any of its indirect callers. After FDMAX optimizes the cut in the call graph, it moves any residual function into the foundation if the function has a partial-order callee in the foundation. This adjustment preserves partial-order relationships but does not affect the cut’s optimality for FD maximization.

3.3.2 Regularized Foundational Degree

Conjugate functions are connected but perform complementary computation at the same scale. FDMAX should place conjugate functions in the same call graph layer.

77 Figure 3.3: Illustration of Conjugate Functionality Regularization (CFR). Func- tions below dashed lines are partitioned to the foundation. Blue and red lines define the resulted partition with and without CFR.

For example, conjugate functions at similar scales for matrix multiplication and summation (e.g., eigen::internal::scalar product op::operator, eigen::in ternal::scalar sum op::operator) should reside in the same layer. In another example, FDMAX is likely to place rocksdb::Benchmark::AcquireLoad in the foundation because it is a leaf called exclusively by rocksdb::Benchmark::Thre adBody. This is problematic as rocksdb::Benchmark::AcquireLoad occupies a layer that differs from its conjugates such as rocksdb::Benchmark::ReadSequent ial. However, thus far, our approach to maximizing foundational degree does not directly align functions by the scale of their computation. We develop regularization techniques to align peer functions using measures of

graph proximity and string similarity. The set of peers ∆S ( f ) for function f exhibit high string similarity, discussed earlier, and hierarchical affinity.

∆S ( f ) = {g | POMS[ f, g] = 0 and POMS[g, f ] = 0

and LW ( f, g) > ρs and HA S ( f, g) > ρg}

78 The set of peers depends on two thresholds 0 ≤ ρs ≤ 1 for string similarity and

0 ≤ ρg ≤ 1 for hierarchical affinity. Defaults are ρs = 0.6 and ρg = 0.85. Hierarchical Affinity. Affinity measures the degree to which two functions are vertically aligned in the call graph and perform computation at the same scale. We measure affinity between two functions by comparing their proximities or distances to call graph boundaries. We define the upper boundary by inserting a virtual root fr that becomes the unique caller to the graph’s original root functions. And we define its lower boundary by inserting a virtual leaf fl that becomes the unique callees from the graph’s original leaf functions. Affinity is high when functions are similarly proximate to either the virtual root or leaf. Hierarchical affinity measures proximity between vertices in a graph as follows

HA S ( f, g) = max{ η(SCTK( f, fr ; GS), SCTK(g, fr ; GS)),

η(SCTK( f, fl; GS), SCTK(g, fl; GS)) }

|x−y| where η(x, y) = 1 − max(x,y), x > 0, y > 0. GS is an adjacency matrix generated by reducing the call graph into a symmetric, unweighted, undirected graph. Suppose

matrix element DMS[ f, g] = 1 when f is a direct caller of g and DMS[ f, g] = 0 T otherwise. Then GS = (DMS + DMS )/2.

Sigmoid Commute Time Kernel. SCTK( f, g; GS) measures proximity be-

tween vertices f and g in graph GS. We use SCTK instead of simpler measures, such as geodesic distance, since it accounts for all paths between vertices rather than one specific path. For example, SCTK correctly considers root function start thread a high-level vertex based on its large average path length to the virtual leaf even though the shortest path length is only four. SCTK is used in graph algorithms, machine learning, and computer networking. Empirically, it outperforms other prox- imity measures such as the Regularized Laplacian, Markov Diffusion, and Laplacian

79 Exponential Diffusion Kernels. SCTK measures proximity by averaging commute times between two vertices in a graph. Commute time is the number of steps taken during a random walk from vertex f to g and back again. The SCTK measures proximity as follows

SCTK [ f, g] = sigmoid( L+ [i, j] ), GS GS

L+ : Moore-Penrose Inverse of L , GS GS P k GS[i, k] if i = j L [i, j] = −1 if i j, G [i, j] G [j, i] 1 GS  , S = S =  0 otherwise   where L+ denotes the Moore-Penrose pseudo-inverse of the Laplacian of G . GS S In practice, the matrix pseudo-inversion problem is simple and can be computed efficiently with polynomial time complexity, O (n2.37). Conjugate Functions. The conjugate functions CF for function f are a subset of its peers ∆. Function g is a conjugate if it satisfies three conditions. First, g must be f ’s peer. Second, there is no other peer g0 that is an indirect caller or callee of g and has greater hierarchical affinity.

0 CF S ( f ) = {g | g ∈ ∆S ( f ) and @g ∈ ∆S ( f ), s.t.,

g0 ∈ IRS (g, S) ∪ IES (g, S),

0 HA S ( f, g ) > HA S ( f, g)}

For example, In Figure 3.3, rocksdb::Benchmark::TimeSeries and rocksdb: :Benchmark::TimeSeriesWrite have high affinity since the former is the unique direct caller of the latter and high string similarity. These function are both peers to rocksdb::Benchmark::AcquireLoad. But only rocksdb::Benchmark::TimeSerie s is a conjugate since it is more hierarchically aligned to the graph’s upper boundary

80 Table 3.3: Examples of conjugate functions. Function Conjugate Functions GI libc close, GI libc write, GI libc open GI libc pread64, GI ftruncate64, GI unlink, GI readlink rocksdb::port:: rocksdb::port::Mutex::Unlock, Mutex::Lock rocksdb::port::Mutex::AssertHeld Eigen::internal::pmax, Eigen::internal::ploadu, Eigen::internal::pmul Eigen::internal::pstore, Eigen::internal::padd, Eigen::internal::predux HPHP::f array unique, HPHP::f array keys, HPHP::f array diff HPHP::f array combine, HPHP::f array key exists, HPHP::f array merge

than rocksdb::Benchmark::TimeSeriesWrite. Table 3.3 presents more examples of conjugate functions. We seek call graph partitions that assign a function and its conjugates to the same layer. We measure how often partitions fall short of this goal and divide conjugates across partitions, whcih means placing a function in the foundation and its conjugates, if they exist, in the residual. The degree of conjugate division is

CD S (G) when a call graph F for stack trace S is partitioned into foundation G and residual F − G. Indicator function ‰(x) is one when x is true and zero otherwise.

X X CD S (G) = ‰(g ∈ CF S ( f )) f ∈G g∈F (S)−G

Regularized Foundational Degree. Regularization describes our technique for refining our measure of foundational degree to account for how a function and its conjugates are assigned to partitions. When conjugates are divided across parti- tions, we penalize our measure of foundational degree by a multiplicative factor.The

coefficient θr is regularization power. Default is θr = 0.98.

81 R CD S (G) FD S (G) = θr · FD S (G) (3.6)

3.3.3 Maximizing Foundational Degree

n Let x = [x1, x2, ... , xn] ∈ {0, 1} be a vector of binary decision variables such that xi is one and zero when function i is placed in the foundation and residual, respectively.

Let G(x) ≡ { fi | xi = 1; k = 1, 2,..., n}. We maximize foundational degree to discover the optimal foundation F∗ as follows.

∗ CD S (G(x)) x = argmax θr · FD S (G(x)) x

∗ ∗ ∗ F ={ f p | xp = 1 or ∃q, xq = 1, POMS[ fq, f p] = 1; i = 1, 2,..., n}

Rubinstein Cross Entropy Method. Maximizing foundational degree re- quires combinatorial optimization and exhaustive search is intractable. We use Ru- binstein’s Cross Entropy Method (CEM), a stochastic search algorithm, to find the foundation when partitioning the call graph. CEM has several advantages over sim- pler search heuristics such as simulated annealing, genetic algorithms, tabu search, etc.

• Asymptotic Convergence. CEM provably terminates with probability one in a finite number of iterations and produces consistent, normal estimates for optimal parameters (Momem-de-Mello and Rubinstein).

• Black-Box Optimization. CEM uses statistical models but does not require ana- lytical models, fitted models, or gradient calculations for the objective function.

• Computational Efficiency. CEM has polynomial time complexity and coverges to optimal solutions quickly in extensive empirical studies.

82 • Practical Application. CEM successfully solves varied combinatorial optimization problems such as mixed integer nonlinear programming, optimal buffer allocation, clustering, graph partition, etc. CEM also supports fast policy search in reinforce- ment learning and accelerates Bayesian optimization.

CEM for Rare Event Simulation. We introduce CEM for rare event simula- tion and then transform the problem into combinatorial optimization. Finally, we for- mulate FDMAX within the CEM framework. Suppose there exists an n-dimensional random variable x that follows statistical distribution f (x; p) with parameters p. For real-valued function G(X), we wish to estimate L .

X L = (G(X) ≥ α) = f (x; p) ‰(G(X) ≥ α) x

ˆ 1 PN An unbiased estimator L f = N k=1 ‰(G(xk ) ≥ α) uses N Monte Carlo simula- tions that sample x with density f (x; p). But when G(X) ≥ α is extremely rare, a tremendous number of samples are required for a reliable estimate.

X f (x; p) f (x; p) L = h(x) ‰(G(X) ≥ α) = ¾X∼h(x)[ ‰(G(X) ≥ α)], x h(x) h(x)

f (x; p)‰(G(X) ≥ α) = 0 if h(x) = 0

Importance Sampling (IS) addresses this problem with an alternative estimate

1 PN f (Xk;p) Lˆh = ‰(G(Xk ) ≥ α) that samples using density h(x). With the proper N k=1 h(Xk ) density, rare events happen more frequently in Monte Carlo simulation and fewer samples are required to produce an accurate estimate. The optimal IS density is h∗(x) = f (x; p)‰(G(X) ≥ α)/L but constructing h∗(x) directly is difficult since L itself has no accurate estimate. CEM searches within the family of probability densi- ties around f (x; p) to find a density f (x; q) that minimizes the cross entropy between f (x; q) and h∗(x).

83 q∗ = argmin KL(h∗(x) , f (x; q)) q X = argmax f (x; p)‰(G(X) ≥ α) ln f (x; q) q x

The optimal parameter q∗ solves the optimization above. Because it is intractable to traverse the entire space of X, especially when X is high-dimensional, we apply IS one more time.

X q∗ = argmax f (x; p)‰(G(X) ≥ α) ln f (x; q) q x f (x; p) = argmax ¾X∼ f (x;r))[‰(G(X) ≥ α) ln f (x; q)] q f (x; r)

First, we specify density f (x; r) to supply samples for the stochastic counterpart.

Second, we draw samples Xk (k = 1, 2,..., N) from the density and construct the stochastic program. Note that f (Xk; p), f (Xk; r), and ‰(G(Xk) ≥ α) are computed from samples.

X ∗ 1 f (Xk; p) qˆ = argmax ‰(G(Xk) ≥ α) ln f (Xk; q) (3.7) q N f (X ; r) k k

We have introduced a second parameter r to search for the optimal IS parameter

∗ q . Unfortunately, ‰(G(X) ≥ α) may also be rare under distribution f (Xk; r). CEM ∗ addresses this challenge by iteratively refining the optimized solution {qt } using a sequence of non-decreasing values of {αt } that approach α. As {αt } converges to α, ∗ ∗ {qt } asymptotically converges to q . ∗ In iteration t, CEM finds an intermediate q (t) that makes G(X ≥ αt ) more likely. ∗ In the next iteration t +1, q (t) defines the sampling density f (Xk; r) and the (1− )- ∗ percentile of f (Xk; r determines the increased αt+1. CEM again solves for q (t + 1) 84 that makes G(X ≥ αt+1) more likely. Once αt exceeds α, the cumulative changes to the sampling density ensure (G(X) ≥ α) ≥ . This implies that a proper IS density f (x; r) has been found to estimate the optimal IS density f (x; q) with program (3.7). CEM for Combinatorial Optimization. We have described CEM for rare event simulation, but the framework supports combinatorial optimization with a small extension (see CEM-OPT in Algorithm6). For rare event simulation, CEM uses an optimized IS density f (x; q∗) to shift the probability mass of G(X) beyond a target α. For combinatorial optimization, we replace the pre-specified α with an increasing sequence of values. As it runs, CEM asymptotically converges to a state at which the probability mass of G(X) approaches the global maximum

∗ ∗ G = maxX G(X). Finally, to derive optimal decision variable X = argmaxX G(X) we sample from the optimized density.

∗ ∗ X = argmax G(X), xi ∼ f (x; q ) x∈{x1,x2,...,xM} D FDMAX extends CEM-OPT to optimize decision variables indicating whether

functions are placed in the foundation or residual partition. Let X ≡= [x1, x2,..., xn] ∈ n {0, 1} denote the decision vector such that xi is one and zero when fi is in the foun-

dation and residual, respectively. Each xi is a Bernoulli random variable that places

function i in the foundation with probability pi. FDMAX iteratively samples graph partitions X with Bernoulli sampling. It as- sesses each sampled partition’s goodness with the regularized foundational degree

R FDS . It updates the Bernoulli parameter pi based on the best partitions. For the Bernoulli distribution, the optimal update can be determined analytically. After the last iteration, functions with large Bernoulli parameters are placed in the founda-

tion and the rest are placed in the residual. Hyperparameters  (=0.01), βF (=0.5),

νF (=0.99) specify update percentile, smoothness across iterations, and partition

85 ALGORITHM 6: CEM-OPT: Cross Entropy Method for Combinatorial Op- timization Input : Random variable X with density f (x; q) ∗ Output: Optimal density parameter q = qt

1 Initialize parameter q0. Reset iteration counter t = 1;

2 Draw N i.i.d samples from density f (x; qt−1);

3 Evaluate function for N samples to produce S1, S2, ... , SN .;

4 Let αt denote (1 − )-percentile of S.

5 If αt unchanged for w iterations, terminate program.

6 Construct stochastic program using samples and solve

N 0 X qt = argmax ‰(G(Xk ) ≥ αt ) ln f (Xk; q) q k=1

0 7 Update qt = βCEM qt + (1 − βCEM )qt−1.;

8 Increment counter t = t + 1 and go to step2

threshold. See details of FDMAX in Algorithm7. FDMAX for Graph Layerization. We run FDMAX for multiple rounds to successively extract a sequence of layers. Lower layers are discovered in earlier rounds. In round i, FDMAX reveals a foundation layer using functions’ modularities and the graph’s partial order matrix from stack trace Si. We remove foundation functions from Si to produce Si+1 and proceed to the next round. The process completes when a round terminates with an empty residual partition.

86 ALGORITHM 7: FDMAX: Foundational Degree Maximization Input : Stack trace S

Output: Foundation FS, residual RS partitions

1 Initialize parameter p0. Reset iteration counter t = 1;

2 Draw iid samples X = [X1,..., Xn], Xi ∼ Bernoulli(pt−1,i);

3 Given samples, calculate regularized foundational degree

R R R FDS = [FDS,1,..., FDS,n].; R 4 Let αt denote (1 − )-percentile of FDS ;

5 If αt unchanged for w iterations, go to step9;

6 Calculate CEM-OPT solution as follows

PN R k=1 ‰(FD ≥ αt )Xi p0 = S,k , i = 1, 2,..., n t,i PN ‰ R ≥ k=1 (FDS,k αt )

0 7 Update pt = βF pt + (1 − βF )pt−1.;

8 Increment iteration counter t = t + 1 and go to step2;

9 Place function i and its partial-order callees in foundation FS if parameter pi > νF. Place remaining functions in residual RS. Terminate program.

3.4 Limelight+ Function Clustering

We design a statistical learning algorithm, STREAM (SpecTral clustering for Equi- librium embedded Affinitive Modules), to cluster functions within each layer. Func- tions within a cluster are similar to each other and dissimilar to those in other clus- ters. STREAM has a preliminary phase, module identification and two unsupervised learning phases, semantic embedding and clustering. First, we organize functions into modules. Each module is composed of a unique header and the header’s indirect callees. A function is a header if it has at least one direct caller in higher layers. See demonstration of module and module headers in Table 3.4 and Figure 3.4.

87 Second, we construct a vector space model that embeds semantic information of each function into a kernel space. We extract semantic information from function declarations, which include the name space, class name, and function name. More- over, we infer semantic similarity between functions even though they have distinct but synonymous tokens. We measure simple, first-order semantic similarity by con- structing a vector space feature map, Soft Token Occurrence (STOC) that encodes lexical and structural relationships between tokens and between functions. Further- more, we infer complex, high-order semantic similarity between tokens and between functions by designing Equilibrium Embedding (EE). Third, we measure semantic proximity between modules based on the equilibrium embedded similarity between their member functions, and cluster modules into larger function groups with spectral clustering.

Table 3.4: A example stack trace. Suppose we target at a layer Lu and only show in the trace the five functions — operatornew, GI libc malloc, malloc, and alloc root — partitioned into Lu. See the corresponding modules within Lu in Figure 3.4. Index Stack Sample

s1 ... → GI libc malloc → ... s2 ... → operatornew → GI libc malloc → ... s3 ... → malloc → ... s4 ... → alloc root → ...

Soft Token Occurrence (STOC): We propose Soft Token Occurrence (STOC), a generalized feature map that extends the standard Bag of Words (BOW) feature map from natural language processing. BOW and its variants measure elementary semantic similarity between words and documents. Two documents are semantically similar if they have a significant number of common words. Conversely, two words are semantically similar if they co-occur in a significant number of documents. The similarity analysis uses feature map P[i, j] that encodes how often word j appears in

88 Figure 3.4: Illustration of modules and modules headers. An example of modules (red dashed boxes) and their headers (yellow filled boxes) in layer Lu. See corre- sponding stack trace in Table 3.4. document i. Limelight+ views functions as extremely short sentences or documents. Tokens are words that appear in functions’ declarations. The corpus is a stack trace. Although we could easily use BOW to encode relationships between tokens and functions, the stack trace and its caller-callee relationships is far more structured than a typical natural language corpus. STOC is a feature map designed to exploit this structure’s semantic information. Word stemming and abreviation recognition use external knowledge bases to infer lexical similarity. Word stemming maps inflected or derived words to their lexical stems, revealing similar tokens such as logging and logged. Abbreviation recog- nition maps tokens to their complete forms, revealing similar tokens such as alloc and allocator. We cannot rely on external knowledge bases from natural language

89 processing because programming languages use different lexical forms. But because tokens should be similar to their stems and abbrevaitions, we use Levenshtein simi- larity (cf. NL Section 3.3) as a proxy for lexical similarity. We design the STOC feature map to generalize BOW and implement two prin- ciples. First, two tokens w and w0 are semantically similar if (i) they frequently co-occur in a function’s declaration, (ii) functions including w and those including w0 are proximate in the call graph, and (iii) they are lexically similar as measured by Levenshtein. Second, two functions are semantically similar if they (i) share a large number of tokens, (ii) are proximate in the call graph, and (iii) share a significant numer of lexically similar words.

For function f ∈ F (S) and token w ∈ T (S), STOC feature map BS is a matrix where

1 1 ON ( f, w) 1 BS[ f, w] = IF (w) OC ( f, w) + + OS ( f, w) "2 4ON ( f, w) + ks 4 # X ON ( f, w) = OC ( f 0, w), f 0 ∈ DRS ( f ) ∪ DES ( f ) (3.8) OS ( f, w) = max { NL (w0, w) | w0 ∈ TK ( f ) } 

|F (S)| IF (w) = log P f ∈F (S) OC ( f, w)

BS measures token w’s occurrence in function f using a weighted sum of three terms. First, OC ( f, w) equals one when token w occurs in function f ’s declara- tion and zero otherwise. Sets T (S) and TK ( f ) are the unions of tokens in stack trace S and function f ’s declaration, respectively. Second, ON ( f, w) is the fre- quency that token w occurs in declarations f ’s direct callers and direct callees.

ON ( f, w)/ON ( f, w) + ks is a soft indicator, between zero and one, that measures

the prevalence of w in f ’s neighborhood; parameter ks controls sensitivity to neigh- bors. Third, OS ( f, w) denotes maximum similarity between token w and f ’s tokens.

90 Its normalized Levenshtein similarity is a soft indicator, between zero and one, that measures the prevalence of tokens that are lexically similar to w in f ’s declaration. Finally, IF (w) denotes the inverse function frequency of token w, which is anal- ogous to the concept of inverse document frequency in natural language processing and information retrieval. This factor promotes the soft occurrence of tokens unique to a function and reduces that of tokens common to a large number of functions. Our feature map is called Soft Token Occurrence because of its soft indicators. The map tracks occurrence, values between zero and one, instead of occurrence frequency since function declarations are usually very short (e.g., fewer than ten tokens) and repeated token occurrences are rare. We extend the standard measure of occurrence OC with structural influences, but these extensions may inject noise into the semantic embedding. Structural soft occurrence (ON ) injects noise when caller and callees may not exhibit strong seman- tic similarity when they span different scales of computation. Lexical soft occurrence (OS ) injects noise since Levenshtein is not an oracular measure of lexical similarity (e.g., tree and free are similar). We mitigate these effects by weighting soft occur- rence measures by half that of the standard measure (i.e., 1/4 versus 1/2). And we consider similarity only for caller-callee relationships within the same layer. First-Order Kernels and Semantic Proximity Matrix. Using feature map

BS, we construct semantic proximity matrices for functions and tokens. Matrix T 0 0 PS = BS BS measures proximity between functions f and f in PS[ f, f ]. Matrix T 0 0 QS = BS BS measures proximity between tokens w and w in QS[w, w ] These first- order kernels encode elementary semantic relationships between any pair of functions or any pair of tokens. We will need more than first-order kernels for complex semantic relationships.High- order kernels are required to transitively infer indirect or high-order similarities. When functions operatornew and GI libc malloc are similar because of struc-

91 tural links and functions GI libc malloc and alloc root) are similar because of lexical proximity, a high-order kernel should deduce similarity between functions op eratornew and alloc root. Equilibrium Embedding, High-order Kernels and Neumann Diffusion:We define a system of recursive equations that augment first-order kernels. These equa- tions leverage the link between semantic proximity for tokens and that for functions. (k) (k) PS and QS denote k-th order semantic proximity matrices for functions and to- kens, respectively. Weight 0 < ν < 1 controls the contribution of indirect proximity. Default ν = 0.5.

(k+1) (k) T T PS = νBSQS BS + BS BS

(k+1) (k) T T QS = νBS PS BS + BS BS, k = 1, 2 ..., +∞

High-Order Kernels and Equilibrium Embedding. Measures of indirect proximity recursively build on measures of direct proximity. If initial proximity matrices P(1) and Q(1) are positive definite kernels, then P(k) and Q(k) are also positive definite kernels for k > 1. Any positive definite kernel can be decomposed

k (k) (k)T such that QS = US US and the indirect part can be written as a new kernel (k) T (k) (k) T between functions BSQ BS = (BSUS )(BSUS ) . In effect, this new kernel measures (k) indirect proximity between functions with new feature map BSUS . We can re-write (k) T (k) BSQ BS [ f, g] as the inner product between a row of f ’s features BSUS [ f, :] and (k) a row of g’s features BSUS [g, :] in the new map. As k increases, the two recursive equations produce increasingly refined high-order kernels that integrate similarities revealed by lower-order kernels. As k goes to infinity, there exists an equilibrium for the system of equations below.

92 ∗ ∗ T T PS = νBSQS BS + BS BS

∗ ∗ T T QS = νBS PS BS + BS BS

When ν < min ||P||−1, ν < ||Q||−1, the equilibrium has the closed-form solution below [145].

∗ T T −1 −1 PS = BS BS (I − νBS BS ) = P(I − νP) (3.9) ∗ T T −1 −1 QS = BS BS (I − νBS BS) = Q(I − νQ)

The equilibrium kernels measure semantic proximity between function f and g, considering neighborhoods at all orders from one to infinity. Note that P(I − νP)−1 is

P+∞ k−1 k −1 equivalent to k=1 ν P and similarly for Q(I − νQ) . Lower-order neighbors out- weigh higher-order ones with decay coefficient ν. The kernels relate to random walk in graph theory and the Neuman Diffusion Kernel[109], which has been successfully used to identify related documents. Spectral graph theory provides a rigorous case for these high-order kernels. Con- sider a graph with affinity matrix G where G[i, j] measures affinity between nodes i and j. Powered matrix Gk[i, j] measures affinity between i and j when j is a k-th order neighbor to i. Node i has affinity to its k-th order neighbor j if and only if all the k − 1-th order neighbors of i that connect j to i have affinity to both i and j. For example, consider the advantage of high-order kenels for three functions:

f =operatornew, g=malloc and h=alloc root. The first-order kernel P0 reveals

semantic similarity between two pairs of functions (P0[ f, g] = P0[g, h] = 1) but fails

to discover the similarity between the third (P0[ f, h] = 0) The second-order kernel 2 2 P0 uncovers the missing similarity (P0 [ f, h] = 1). Function alloc root is a second- order neighbor to operatornew and their affinity is revealed by first-order neighbor malloc.

93 Module Semantic Proximity Matrix. Finally, we define semantic proximity

0 between modules as the average proximity between member functions. MS[m, m ] describes semantic proximity between modules m and m0 in layer L. Note that we cluster functions within each layer and not across layers. Let |m| denote the number of functions in module m.

1 P P ∗ 0 0 |m||m0| PS[ f, f ] ∃L, m, m ∈ L 0 ∈ 0∈ 0 MS[m, m ] = f m f m (3.10) 0 @L, m, m0 ∈ L   STEAM Algorithm. Algorithm 8 summarizes our algorithm for function clus- tering. For each layer in the partitioned call graph, we use semantic proximity matrix MS as a feature map to cluster functionally related modules. We use the map to perform spectral clustering on modules within a call graph layer [195]. Spectral clustering is robust when the geometric properties of the semantic matrix violate assumptions, made by standard implementations of KMEANS and DBSCAN, re- garding cluster shape, distance metric, and density thresholds. Spectral clustering transforms data from a raw feature space into an eigenvector space using Laplacian eigenmap and then uses standard clustering algorithms, such as KMEANS, on transformed data. Under mild assumptions, transformed data has provable guarantees on intra-cluster tightness and inter-cluster separation [195]. Such transformations can guide KMEANS to find non-elliptical or non-convex clusters. We tune the cluster count using multiple runs of the procedure. We select the count that minimizes average pairwise distance between members of a cluster. To prevent fragmented clusters, we limit the exploration range to [(1 − ω)NCAP, (1 +

ω)NCAP]. Default ω = 0.25. Finally, KAP is an initial estimate of cluster number from Affinity Propagation [112]. Figure 3.5 shows an example that within each layer, STEAM clusters semantically

94 ALGORITHM 8: STEAM: Spectral Clustering for Equilibrium Embedded Affinitive Modules Parameter: Decay µ = 0.5, search range ω = 0.25

Input : Stack trace S, layers L1,..., LM and corresponding modules

Rp,1,..., Rp,Zp in each layer Lp (p = 1,..., M)

Output : Clusters Cp,1, Cp2 ,... discovered for each layer Lp (p = 1,..., M)

1 Construct STOC feature map BS such that BS[ f, w] is soft occurrence score of token w for function f (cf., Equation 3.8);

T 2 Construct first-order kernel PS = BS BS , and then equilibrium kernel ∗ −1 ∗ PS = PS (I − νPS) such that PS[ f, g] is semantic proximity between function f , g (cf. Equation 3.9); ∗ 0 3 Construct matrix MS using equilibrium kernel PS such that MS[m, m ] is semantic proximity between modules m, m0 (cf. Equation 3.10);

4 foreach layer Lp in L1, L2, ... , LM do

5 Extract submatrix MS,p from MS with rows, columns corresponding to Zp

modules in layer Lp;

6 Invoke Affinity Propagation, using MS,p as an affinity matrix, to estimate the number of clusters and assign to NCAP. 1 1 − 2 − 2 7 Compute normalized Laplacian WS,p = I − DS,p MS,pDS,p , where P DS,p [i, i]= j MS,p[i, j];

8 Solve for eigenvalues, eigenvectors of WS,i. Select eigenvectors   corresponding to smallest Sp = log (1 + ω)NCAP eigenvalues. Stack

eigenvectors column by column to form Zp × Sp eigenmap US,p;

9 Use eigenmap US,p as feature map, where each row corresponds to module in Layer Lp. Run KMEANS, varying cluster count from (1 − ω)NCAP to (1 + ω)NCAP;

10 Select cluster result that minimizes average pairwise distance within

clusters. Output as Cp,1, Cp,2, ... ;

11 end

95 similar functions into a few groups for human annotation.

Figure 3.5: Cluster semantically similar functions within each layer with STEAM

96 Figure 3.5: Cluster semantically similar functions within each layer with STEAM.

3.5 Limelight+ Cycle Attribution

We design an algorithm for Header Label Propagation (HELP) that translates the original trace into a re-labeled trace in which functions are replaced by STEAM

97 groups or FDMAX layers (cf., Algorithm9). HELP produces a succinct summary of the original call graph by aggregating cycles (inclusive, exclusive, attributed) for individual functions and attributing them to groups or layers. Figure 3.6 shows how a call graph that originally include 114 functions and 119 edges is translated into 5 layers, 6 to 9 STEAM groups per layer, and 32 edges. Such consolidation leads to interpretability.

ALGORITHM 9: HELP (Header Label Propagation) Input : Stack trace S, STEM groups G1,..., Gk Output : Re-labeled stack trace SG 1 Construct a propagated label P and initialize P = Φ; 2 foreach stack s ∈ S do 3 foreach function f ∈ s do 4 if f is a module header then 5 Re-label f with its STEAM group in G1,..., Gk; 6 Save label as propagated label P; 7 else 8 Re-label f as propagated label P; 9 end 10 end 11 De-duplicate labels in s while preserving order; 12 end 13 Return re-labeled stack trace SG, replacing functions in S with G1,..., Gk

After running HELP, each stack sample within the original trace S is translated into a call chain of STEAM-generated groups in a re-labeled trace SG. These groups are viewed as “super functions” in the re-labeled trace. INCP, EXCP, and ATCP measures for functions produce GIP, GEP, and GAP for group inclusive, exclusive, and atrributed cycle percentage, respectively. HELP is easily modified to produce an algorithm that translates the trace into a call chain of FDMAX-generated layers in a re-labeled trace SL. These layers are viewed as “super functions” and cycle measures are re-calculated accordingly. INCP, EXCP, and ATCP measures for functions produce LIP, LEP, and LAP for layer

98 Figure 3.6: Illustration of HELP. HELP re-attributes cycles to function clusters for efficient interpretation of software structure, and for easy identification of com- putation hotspots. inclusive, exclusive, and atrributed cycle percentage, respectively.

99 3.6 Experimental Methods

We deploy Limelight+ to analyze a pair of stack traces, SERVICES and FLEET. SERVICES is collected from a five-machine testbed. Four machines host major ser- vices and applications under continuous profiling. A fifth machine collects stack profiles for later analysis. FLEET is collected from a live, production datacenter that includes more than 1M servers, which host business-critical services and sup- port datacenter-wide profiling and analysis. Table 3.5 provides a brief summary of SERVICES sand FLEET.

Table 3.5: Experimental trace SERVICES and FLEET. Trace Profiled Workload Collection Name Services (Publisher) Source Caffe2 Convolutional Caffe2 ImageNet Testbed Neural Network Trainer - ResNet50 Server1 Training (by ) SERVICES Caffe2 Convolutional Caffe2 Model Testbed Neural Network Zoo - ResNet50 Server2 Inference (by Facebook) RocksDB Key-Value db bench Testbed Store (by Facebook) Server3 HHVM Server-side PHP Processing Engine OSS-Performance Testbed Proxygen HTTP (by Facebook) Server4 Web Server WWW, Advertisement, Live ,Search Engine, Production FLEET Datacenter Stream Processing, , Datacenter Workload Image/Video Analytics, etc

3.6.1 SERVICES

We collect 100K stack samples for each machine. SERVICES is a merged, cluster- level trace with 400K stack samples that include about 3.5K unique functions. These

100 functions cover six major software repositories—PyTorch, RocksDB, HHVM, Prox- ygen, Wangle and Folly—released by Facebook [16]. Testbed. Each of four machines hosts major software services that have been open-sourced by Facebook. Moreover, each machine deploys a lightweight profiler that periodically configures perf record and profiles stack profiles. These machines are used exclusively for experiments and noisy background daemons are turned off. These machines are configured with 48 Intel Xeon E5-2697 processor cores (2.7GHz clock, 32K L1-DCache, 32K L1-ICache, 256K L2-Cache, 30M L3-Cache), 256GB of main memory, 1TB of Flash storage, and two Intel I350 Ethernet adapters. A fifth machine gathers stack profiles from workload servers. It parses stack profiles with perf script and demangles symbolic names geneated by the C/C++ linker to human-readable function declarations with C++ filt. It transforms the demangled profiles to schematic tables that include process name and ID, binary path, time weight, time stamp, raw stack symbols, function addresses, and hardware event type (i.e., CYCLE). This machine is configured with 32 Intel Xeon E5-2630 processor cores (2.4GHz clock, 32K L1-DCache, 32K L1-ICache, 256K-L2Cache, 20M L3-Cache), 128GB of main memory, 128GBH of Flash storage, and two Intel X540- AT2 Ethernet adapters. Caffe2 Neural Network Training. One server hosts Caffe2 (PyTorch 0.4.0 [29]) for training deep convolutional neural networks. It runs Facebook’s resnet50 trainer, a parallelized trainer for Resnet-50 on ImageNet data[154]. We use default config- urations for the batch size (32) and number of shards (1). But we modify configu- rations for the number of epochs (from 1K to 10K) to support continuous profiling, for CPU use (from false to true) to perform multi-processor computation, and for asynchronous training. Caffe2 Neural Network Inference. Server two hosts Caffe2 (PyTorch 0.4.0) for evaluating deep convolutional neural networks. It runs Facebook’s inference

101 script, which classifies image using a pre-trained Resnet-50 from Caffe2 Model Zoo [18]. The workload iteratively re-scales ImageNet inputs and performs inference. We increase the number of iterative rounds to support continuous profiling. RocksDB Persistent Key-Value Store. Server three hosts RocksDB run- ning Facebook’s db bench [32], a tool for performance benchmarking that enhances LevelDB db bench [1]. db bench supports benchmarks that represent a variety of workloads (e.g., fillseq, fillrandom, readrandom, redseq). We use default configura- tions and construct a three-level LSM-Tree by setting -num levels to 3. HHVM PHP Processing Engine and Proxygen Web Server. Server four hosts the HHVM PHP processing engine and the built-in Proxygen web server; HHVM and Proxygen run as independent processes. We run Facebook’s standard benchmark suite, OSS-Performance [25], which includes benchmarks various PHP implementations. The benchmarks generate user access requests with Siege [34] and balance load with Ngix [22]. They also include popular content management sys- tems, WordPress and . We run Drupal, which provides more facilities and greater flexibility in rich web sites. We use default configurations and enable infinite benchmarking by activating --i-am-not-benchmarking, --no-time-limit.

3.6.2 FLEET

The FLEET trace is sampled from a live, production datacenter with more than 1M servers that host hundreds of business-critical services such as web servers, advertise- ments, news feeds, in-memory , distributed memory caches, search engines, machine learning, stream processing, recommendation systems, etc. Each server is configured with a profiler that periodically profiles the call stack. Profiles are gathered and stored in a distributed, in-memory database for short- term storage and interactive analysis. Profiles are eventually persisted for long-term storage and off-line analysis. For each of 30 consecutive days, we sample 10M stack

102 samples across all servers and time frames. The resulting trace includes 300M stack samples and 240K unique functions written in C/C++ and Java. The trace excludes functions written in scripting languages such as Python, PHP, and Ruby. Pre-Processing. Of the 240K functions, 68K are anonymous or inline functions with no explicit declaration. We merge anonymous functions or inline functions to their immediate callers in individual call-stack samples. Another 67K Java functions are dynamically generated with unique IDs. For example, different Hive queries for the same SQL operation (e.g., SEL, JOIN, GBY) share the same function format but use different and unique function names. We merge dynamically generated functions of the same format into a single function and drop the part in function declaration related to unique IDs. Of the remaining 110K unique functions, we select the hottest 20K functions with large measures of inclusive cycles and remove relatively cold functions in the tail that comprise less than 0.003% of system cycles. Such pruning preserves 95% of overall cycles consumed in the complete trace and does not distort the analysis. The resulting call graph is comprised of 20K nodes and 650K attribute edges.

3.7 Evaluation of Limelight+

First, we run Limelight+ on the SERVICES trace, partitioning the call graph into a few layers of clustered functions (cf. Section 3.7.1). The analysis reveals software hotspots and highlights opportunities for hardware accelerators (cf. Section 3.7.2). We show partitions are valid and robust, comparing FDMAX against state-of-the-art graph partitioning algorithms (cf. Section 3.7.3). And we show clustered functions are valid and interpretable, comparing STEAM against state-of-the-art semantic learning algorithms (cf. Section 3.7.4). Second, we run Limelight+ on the FLEET trace to reveal workload structure of a live, production datacenter. We describe datacenter computation by analyzing any

103 service and library that accounts for more than 0.5% and 0.1% of system cycles, respectively. The analysis summarizes features and trends in modern datacenters. It also provides a comprehensive set of software targetrs for hardware accelerators (cf. Section 3.7.5).

3.7.1 Discovering Layers

Figure 3.7 presents the layerized SERVICES trace. Limelight+ permits interpretabil- ity of a complex call-graph, reducing a graph of 3.5K nodes into a diagram of 127 nodes distributed across six layers. Functions are organized into 127 semantically cohesive groups. From high to low layers, groups gradually transition from macro- to micro-scale functions; lower groups support the computation of higher groups. L1 (Processes and Threads): comprises a single group for process and thread functions. Representative functions include thread wrappers (execute native thr ead routine, rocksdb::::StartThreadWrapper, HPHP::AsyncFuncImpl::Thread Func). L2 (Services) comprises six groups of entry functions into broadly defined services. Representative functions include xneural network training in Pytorch (c10::Threa dPool::main loop), RocksDB foreground queries (rocksdb::Benchmark::ReadSeq uential), and the PHP processing engine (HPHP::ExecutionContext::invokeUni t). L3-L5 (Service Logic) are layers for the functional logic of different services, with lower-level groups providing infrastructure for higher-level groups. The profiled services span diverse domains and reveal little common functionality in these inter- mediate layers. Notable exceptions include Caffe2 training and inference services that do share related, complementary functions. Caffe2 computation for deep neural networks are split across three layers. L3 includes neural network functions such as training, inference, and the tensor pipeline. L4 supports training and inference with

104 Figure 3.7: Limelight+’s layers and groups for profiled services. Services con- tain convolutional neural network inference and training (Caffe2), key-value store (RocksDB), PHP processing engine (HHVM) and web server (Proxygen).

105 Figure 3.7: Limelight+’s layers and groups for profiled services. Services con- tain convolutional neural network inference and training (Caffe2), key-value store (RocksDB), PHP processing engine (HHVM) and web server (Proxygen). assorted tensor operators for gradients, pooling, convolution, etc. L5 supports tensor operators with core math libraries. RocksDB is a key-value store that uses a log structured merge tree (LSM) [204]. LSM-trees achieve high write throughput by first staging data in memory and then

106 across multiple levels of persistent storage[148]. Every write to the in-memory MemTable is logged to persistent storage to ensure data consistency. As MemTables fills, data are flushed to higher-level and eventually to lower-level persistent storage. Persistent data are sorted and arranged in the BlockBasedTable. Reads are first di- rected to MemTables and then to BlockBasedTable. L3 includes RocksDB primitives such as GET, PUT, DELETE, and ITERATION. L4 supports these operations with reads and writes for the log and table. L5 implements infrastructure for bytewise comparators and file writers for key-value reads and writes. HHVM provides Just-in-Time compilation and a virtual-machine runtime for pro- cessing web applications [200]. L3 supports PHP API/Library support for arrays, strings, regular expressions, etc. It also supports JIT/IR primitives for data structure and class/object manipulation. Finally, it supports HHVM’s database interface to SQL. L4 supports these capabilities with object serializers, JIT methods, the HHVM network interface, and VM management operations. Finally, L5 provides low-level infrastructure for basic data structures, database operations, VM state control, and file operations. Proxygen is the web server for the HHVM PHP processing engine. Proxygen listens on the HTTP port, creates sessions for incoming client connections, and sup- plies HHVM with parsed HTTP messages that include the path of the PHP script to be executed. HHVM responds to each request via the web server. L3 includes web server in/egress functions. L4 supports these functions with the HTTP protocol. L5 performs socket I/O and other network primitives. L6 (Core Libraries) comprise functions that provide foundational infrastructure for the testbed’s services. They also include system primitives for memory manage- ment allocation, thread synchronization, and network I/O.

107 3.7.2 Discovering Accelerators

Limelight+ provides accurate and interpretable recommendations for acceleration tar- gets, based on integrated groups of functions, without requiring expertise in software services and system profiling. Moreover, Limelight+’s multi-scale analysis reveals broad, integrated acceleration targets rather than narrow, fragmented ones. We validate the value of the discovered targets by linking them to existing hardware accelerators and software optimization. Recommended targets cover over 50 cases of expertly designed hardware accelerators and over 30 cases of re-optimized systems and algorithms (cf., Table 3.6-3.7). Caffe2 Neural Networks. Limelight+ reveals directions to accelerate training and inference at multiple scales. These directions are consistent with expert designed hardware and optimized software. High-level groups of functions can be targeted with faster computation and custom data flow, producing large gains at high cost. Low-level groups can be targeted with simple parallelism, producing modest gains at low cost. L3 functions motivate end-to-end accelerators for neural networks. For training, researchers have designed hardware architectures like SCALEDEEP [250], a tiled pro- cessor that supports efficient synchronization for forward, backward propagation, and gradient updates and have developed software optimizers like TVM [78], a that fuses operators, maps hardware, and hides latency. For inference, researchers have designed hardware architectures with efficient data flow and operators for pool- ing, batch normalization, the rectified linear unit [176] and have developed software optimizations for networks such as SqueezeNet [137], which achieves AlexNet-level accuracy with fewer parameters and smaller models. L4 functions suggest acceleration for operators, especially for convolution, the hottest of the operators in neural network computation. Hardware strategies include

108 FPGAs that implement Depthwise Separable Convolution with the ping-pong buffer and matrix-multiplication engine [48]. Software strategies include AccNet [89], a fast convolution kernel based on g Canonical Polyadic Decomposition and g-Convolution. Finally, L5 functions suggest acceleration for mathematical kernels. Matrix oper- ations (e.g., GEMM, GEMV, Im2Col, and Col2Im), which take over 76% of the com- putational time for neural network training and inference. Hardware strategies often focus on matrix multiplication. Google’s Tensor Processing Unit (TPU) implements a 65,536 multiply-accumulate unit supported by a large software managed on-chip memory. Software strategies include rethinking the multiiplication algorithm. For example, the Benson-Ballard algorithm [55] performs matrix multiply with asymp- totic complexity O(N2.775), outperforming vendor implementations of the classical algorithm and Strassen’s fast algorithm. RocksDB Key-Value Store. Limelight+ reveals directions for accelerating RocksDB Key-Value store at different scales, which are consistent with expert- designed hardware accelerators and software optimizations. L3 reveals that 70% of RocksDB’s computational time is consumed for LSM- Tree operations. Because these operations are diverse— PUT, MERGE, DELETE, COMPACTION, FLUSH, GET, MULTIGET, ITERATION, TRANSACTION and COLUMNFAMILY—RocksDB would benefit from end-to-end acelerators for the LSM-Tree. For example, NoveLSM [148] accelerates RocksDB by supporting MemTable with byte-addressable non-volatile memory, which permits the direct mutability of persistent state. And NoFTL-KV [252] manages write amplification by eliminating the flash translation layer. L4 reveals major intermediate targets for acceleration, persistant storage I/O and MemTable/SkipList operations. First, accesses to persistent storage account for over 50% of LSM operation cycles. Persistent reads and writes are required for write- ahead-logging (WAL). They are also required for BlockBasedTable access. Second,

109 MemTable and SkipList insertion, reads, and searches account for over 30% of LSM operation cycles. System architects could accelerate writes and merges, required by WAL and BlockBasedTable, with Open-channel SSD [33,7] and Zoned Namespace SSD [6]. They could accelerate reads, required by BlockBasedTable, with an Elastic Bloom Filter [167]. And they could accelerate LSM-Tree operations with smart NICs and more efficient SkipList implementations that use distributed memory objects [170] or cache-sensitive data structures [268]. L5 suggests about 44% of cycles are consumed by file I/O such as the Posix file writer. These costs could be mitigated with non-volatile memory heaps [81] or emerging file systems with pre-allocation and explicit journaling [164]. HHVM PHP Engine. Limelight+ reveals six acceleration targets for HHVM in server-side PHP processing that cover over 60% of HHVM computational time. Note that HHVM time is relatively fragmented and there exist no single dominant hotspot even after function clustering, an observation confirmed by independent HHVM stud- ies [123]. Relational database interface, serialization/deserialization, JIT/IR, hash map, string processing, regular expression, and JIT/IR account for 22%, 12%, 9%, 9%, 8%, 2% of system time. Relational databases are accelerated with system software or hardware architec- ture. For the former, Facebook accelerates SQL queries with MyRocks [182], which integrates MySQL with RockDB for faster data access, data replication and greater space efficiency. Presto [246] is an in-memory, distributed SQL query engine for fast, interactive analytics. For the latter, the Q100 [261] accelerates query performance by tiling custom processing elements for SQL operators such as JOHN, SELECT, and SORT. The Swarm64 [36] provides an FPGA accelerator or PostgreSQL. HHVM processing benefits from accelerated hash map, string, and regex process- ing. For hash tables, IMPICA [135] uses 3D-stacked memory to build an in-memory

110 pointer chasing accelerator. For string processing, ZuXa [247] proposes an ISA and accelerator for encoding, conversion, searching, etc. For regular expression match- ing, HARE [122] is a pipelined accelerator for the Aho-Corasick algorithm. These hardware solutions complement software optimizations such as Cuckoo Hashing [95] and Delayed Input Deterministic Finite Automata [155] for hashing and regular ex- pression matching, respectively. Serialization transforms structured objects into unstructured byte-streams for caching, storage, or communication. HHVM caches code and data, which requires time-consuming (de)serialization for fetch and store requests. Several systems have sought to reduce these overheads. Graphiq [23] implements an in-memory, formatted file cache with PHP opcache. HGum [269] uses a hardware accelerated schema Tree, context stack and finite state machine. Apache Arrow [10] uses a columnar data representation, eliminating (de)serialization overheads. Hardware accelerators for PHP JIT/IR are rare and static optimizations are more common [207]. Meta-Tracing JIT automatically generate high-performance JIT-optimizing virtual machines from high-level specifications [138, 139]. Note that, in our study, heap management is not a hotspot because we compile and optimize HHVM with jemalloc [103], a high-performance heap allocator that is enabled by default for Facebook’s web servers [104]. Proxygen Web Server. Limelight+’s analys for Proxygen suggests accelera- tion targets for the HTTP web server and the network protocol stack. Commodity products, such as f5 BIG-IP WebAccelerator [13], accelerates the web server with hardware for SSL encryption/decryption, HTTP protocol, etc. Google’s SPDY [101] optimizes the HTTP protocol and advanced web server functions. The network protocol benefits from accelerators in both hardware and software. For the former, Microsoft proposes the Lightweight Transportation Layer [70] Azure Accelerated Networking [107], accelerating network protocols and management cus-

111 tom SmartNICs on field programmable gate arrays. For the latter, researchers em- phasize kernel bypass or fastpath methods. IX [102] is a kernel bypassing, user-space TCP/IP stack and TAS [150] accelerates TCP packet processing with fastpath OS service on dedicated CPUs. Software Libraries. Library-level subroutines that provide foundational infras- tructure for a spectrum of application services are natural targets for acceleration. Software optimizations have permitted more efficient heap management with jemal- loc, Hoard, and Ptmalloc [171], more efficient (de)compression with LZ4 and Snappy [20, 35]. Moreover, researchers have built library-level hardware accelerators for memory (de)allocation [147], (de)compression [110], memory primitives [260], [248], [211]), and synchronization primitives [120]). In summary, Limelight+ reveals the software infrastructure of the major services that collectively account for 70% of overall cycle consumption in the SERVICES testbed. Moreover, suggested acceleration targets align with hardware & software solutions proposed by domain experts.

Table 3.6: Comparison between hot function clusters revealed by Limelight+ and expert-designed ASICs/accelerators, for convolutional neural network training and inference, key-value store, server-side PHP processing and shared libraries. Integrated Hardware Acceleration Limelight+ Hotspot Group Deep Neural End-to-end FPGA acceleration for Deep Convolution L3-G2 Network Neural Network, hardware optimized data flow and op- (DNN) erators related to CONV, POOL, BatchNorm, and Elt- Inference wise (ReLU) [176] Deep Neural CATERPILLAR [168]:hardware accelerated GEMV, L3-G1 Network GEMM kernels, collective communication and stochas- (DNN) tic gradient descent (SGD). SCALEDEEP [250]: het- Training erogeneous processing tiles and compute chips with 3-tiered grid-wheel-ring interconnect and low-overhead synchronization for optimized forward, backward prop- agation and gradient update. DNN FPGA for acceleration of Depthwise Separable Convolu- L4-G1 ConvOp tion, ping-pong buffer and matrix-multiplication engine [48].Convolution Engine [210]: ASIC optimized for en- ergy efficient convolution dataflow and computation.

112 GEMM/ TPU [142]: Tensor Processing Unit, ASIC with a 65,536 L5-G1/ GEMV 8-bit MAC matrix multiply unit and a large software L6-G1 managed on-chip memory.NTX [223]: specialized co- processor for Multiply-Accumulate (MAC) computa- tions and accelerated matrix-vector product,i.e.,GEMV. BLAS ARION [262]: scale-up accelerations for linear alge- L6-G2 (Basic braic operations (e.g., SUM, MAX, MULTIPLY) on Linear Algebra Dense and Sparse matrices.LACore [233] a RISC-V ) processor and ISA for acceleration of BLAS func- tions.cuBLAS[52]: fast GPU-accelerated BLAS. Log HyperLoop [151]: RDMA NICs and NVM based end- L2-G4, Structured to-end hardware acceleration for RocksDB. NoveLSM L2-G6 Merged [148]: NVM supported byte addressable MemTable, Trees direct mutability of persistent state, and opportunis- (RocksDB) tic read parallelism for end-to-end RocksDB accelera- tion.Eideticom NoLoad CSP [31]: co-processor for end- to-end acceleration of RocksDB. SkipList iPipe [170]: Multicore SoC SmartNICs acceler- L4-G3, (RocksDB ated, DMO (Distributed Memory Object) implemented L4-G7, Memtable) SkipList for LSM-Trees (LogStructured Merge Trees) L4-G9 based key-value store. GFSL [187]: GPU supported SkipList, accelerated with chunked skiplist nodes and warp-cooperative functions, higher memory coalescence and lower execution divergence for performance ampli- fication. BionicDB [152]: FPGA accelerated SkipList with pipelined, fast, power-efficient SkipList access and control. PIM [172]: Processing In-Memory co-processor accelerated near-memory skiplist manipulation. RocksDB Open-channel SSD [33,7] and Zoned Namespace SSD L4-G2, Write(WAL/ [6] accelerated WAL (Write Ahead Log) and Block- L4-G35 PUT/ basedTable write and merge. MERGE) RocksDB X-Engine [136]: a write-optimized storage built at Al- L3-G7 Com- ibaba with support FPGA-accelerated, pipelined com- paction/ paction in LSM-Trees. Flush RDBMS Q100 [261]: a heterogeneous collection of ASIC tiles, L3-G4, (Relational each implements a SQL relational operator, JOIN, SE- L3-G5 SQL LECT, SORT, etc. DAnA [179]: FPGA ISA for ac- Database celeration of RDBMS SQL query, with Strider engine for accelerated resultset acess.Swarm64 [36]: end-to-end FPGA accelerator for PostgreSQL database. Serialization/ HGum [269]: serialization/deserialization accelerator L3-G6/ Deserializa- with hardware accelerated schema Tree, context stack L5-G4 tion and finite state machine. ACCORDA [106]: Unstruc- tured Data Processor, with in-memory hierarchy inte- gration for acceleration of data parsing and deserializa- tion.

113 Ordered IMPICA [135]: 3D-stacked memory based In-Memory L6-G6 Map/ Hash PoInter Chasing Accelerator for hash table acceleration. Table/ SQRL[156]: integrated with last-level-cache (LLC), spe- Hashing cialized LLC refill engine (Collector) for accelerated HashTable access.QuickAssist [236]: Intel QuickAssist co-processor accelerated hashing algorithms, SHA-1, SHA-256, SHA-348, etc. Text/ String Hardware accelerator for text/string processing in L3-G8/ Processing content-rich PHP applications. Accelerated function- L5-G5 alities: character/sub-string finding, lower-case conver- sion, upper-case conversion, character/string replace- ment, and string trim. [123]. ZuXa [247]: Hardware accelerator and ISA for string-processing functions, en- coding, conversion, searching, filtering, etc. Regular Hardware accelerator for architectural support of reg- L3-G22, Expression ular expression matching/replacement in content-rich L3-G32 Processing PHP applications. [123]. HAWK [238]: pipelined hard- ware accelerator for bit-split finite automata based reg- ular expression matching. HARE [122]: pipelined hard- ware accelerator for Aho-Corasick algorithm based reg- ular expression matching. PCRE ( FPGA accelerated PCRE for greedy quantifier matching L5-G28 Compatible NFA (Nondeterministic Finite Automata) and PCRE Regular operators, e.g., anchors, quantifiers, ranged quantifiers, Expression) character classes [185]. XILINX GRegeX [17] and Paxym [26] PCRE accelerator. HTTP f5 BIG-IP WebAccelerator [13]: end-to-end hardware L2-G3 (Web) acceleration for SSL encryption/decryption, HTTP pro- Server) tocol, request cache, etc. Network LTL (Lightweight Transportation Layer) [70]: Microsoft L5-G14/ Transport inter-FPGA network protocol, hardware accelerated L6-G4 network transport engine. NICA [102]: a hardware- software co-designed framework for network I/O accel- eration based on FPGA-NICs. AccelNet (Azure Accel- erated Networking) [107]: offloading host networking to hardware, using custom Azure SmartNICs based on FP- GAs. File System BPFS [85]: short-circuit shadow paging for accelerating L6-G4, I/O atomic, fine-grained updates to persistent storage. NV- L6-G9, heaps [81]: Non-volatile Memory Heaps for I/O transac- L6-G16 tion acceleration with pointer safety and garbage collec- tion. UBJ [163]: non-volatile memory supported buffer cache architecture for caching-journaling union and I/O acceleration.

114 Compression/ HBALC [110]: Microsoft FPGA accelerator for L6-G6 Decompres- lossless compression/decompression with LZ77 and sion static Huffman coding. zEDC [14]: zEnterprise Data Compression, hardware acceleration for DE- FLATE/INFLATE based compression/decompression. QuickAssist [236]: INTEL QuickAssist Asic for accel- eration of DEFLATE/INFLATE and LZS based com- pression/decompression. Mutex/ SCU [120]: ASIC for hardware accelerated synchroniza- L6-G15 CondVar/ tion primitives such as barriers and locks. Barrier Memory Mallacc [147]: hardware accelerator for three primary L6-G5 Allocation/ operations of memory allocation request, size class com- Deallocation putation, free block retrieval, and memory usage sam- pling.Heap Manager Accelerator [123]. Memory Hardware accelerators for memcpy, memset, memmove, L6-G3 Copy/ memcmp, etc [260, 248, 211, 97, 209, 98]. Move/ Set/ Comparison

Table 3.7: Comparison between hot function clusters revealed by Limelight+ and expert-designed software re-optimizations, for convolutional neural network training and inference, key-value store, server-side PHP processing and shared libraries Integrated Software Re-optimization Limelight+ Hotspot Group Deep Neural SqueezeNet [137]: a small CNN architecture for accel- L3-G2 Network erated inference with AlexNet-level accuracy, 50x fewer (DNN) parameters and < 0.5MB model size. EIE [131]: effi- Inference cient inference engine on compressed deep neural net- work. Deep Neural TVM [78]: an end-to-end compiler for DNN-specific op- L3-G1 Network timizations, e.g., high-level operator fusion, mapping to (DNN) arbitrary hardware primitives, and memory latency hid- Training ing. DNN AccNet [89]: fast convolution kernel based on gCP (g L4-G1 ConvOp Canonical Polyadic) Decomposition and g-Convolution. GEMM/ Benson-Ballard algorithm [55]: a fast matrix multipli- L4-G1 GEMV cation algorithm with asymptotic complexity O(N2.775), Intel Math Kernel Library [254], OpenBLAS [24]. Log- NoFTL-KV [252]: LSM-Trees (Log Structured Merge L2-G4, structured Trees) read/write amplification with elimination of FTL L2-G6 Merge Trees (Flash Translation Layer) abstraction and native flash (RocksDB) storage control/management.

115 SkipList CSSL [230]: Cache-Sensitive SkipList, a cache-friendly L4-G3, (RocksDB data layout and traversal algorithm that minimizes L4-G7, Memtable) cache misses, branch mis-predictions, and allows to ex- L4-G9 ploit SIMD instructions for optimized range queries. S3 [268]: a scalable, two-layered, in-memory skiplist for the customized version of RocksDB in Alibaba Cloud, with cache-sensitive structure and semi-ordered skiplist in- dex. RocksDB ElasticBF [167]: Elastic Bloom Filter with hotness L4-G14 BlockBased awareness for boosting read performance of Block- Table Read BasedTable, i.e., SST (Sorted String Table) in LSM- Trees (e.g., RocksDB and LevelDB). RDBMS MyRocks [182]: MySQL with RocksDB storage back- L3-G4, (Relational end, featuring efficient compression and compaction, L3-G5 SQL fast append-only writes, etc. Presto [246]: in-memory, Database distributed SQL query engine optimized for realtime analysis at interactive speed. Serialization/ Graphiq File Cache [23]: PHP objects file cache memory L3-G6/ Deserializa- without serialization/deserialzation overhead. Apache L5-G4 tion Arrow [10]: language-independent columnar memory format to eliminate serialization/deserialization over- head. ScootR [157]: dataflow engine using Truffle framework and the Graal compiler to reduce serializa- tion/deserialization overhead. Ordered Cuckoo Hashing [95]: open-addressing hash structure L6-G6 Map/ Hash with optimized collision resolving and constant-time Table/ worst-case lookup. Horton Table [65]: retrofitted BCHT Hashing (Bucketized Cuckoo Hash Table) optimized with remap entries and cache-friendliness. MemC3 [105]: optimistic cuckoo hashing with compact LRU-approximating evic- tion and optimistic locking. Regular D2FA (Delayed Input Deterministic Finite Automata) L3-G22, Expression [155]: a highly compact DFA representation and algo- L3-G32 Processing rithm for fast, space-efficient regular expressions match- ing and replacement. Text/ String Fast-decodable indexing schemes for edit distance which L3-G8/ Processing can be used to speed up string match computations to L5-G5 near-linear time [129]. Linear-time Monte Carlo algo- rithm in optimal time and space for string suffix tree construction [118]. HTTP Google SPDY [101]: Google’s application-layer web pro- L2-G3 (Web) tocol in extension HTTP with advanced web serving Server mechanisms, e.g., multiplexing, compression, universal encryption, push/hint, content prioritization, etc.

116 Network IX [102]: kernel bypassing, user-space TCP/IP stack for L5-G14/ Transport network transport acceleration. TAS [150]: acceleration L6-G4 of TCP packet processing as a fastpath OS service on dedicated CPUs. File System EvFS [266]:I/O acceleration with user-level storage L6-G4, I/O stack and asynchronous, direct I/O. WALDIO [164]: L6-G9, I/O acceleration with: preallocation & explicit journal- L6-G16 ing, header embedding and group synchronization. Be- trFS [140]: in-kernel file system with write-optimized index (WOIs) for I/O acceleration. Compression/ LZ4 [20]: algorithm optimized for extremely fast com- L6-G6 Decompres- pression/decompression with Linear small-integer code sion (LSIC). Snappy [35]: fast, stable, robust compres- sion/decompression library comparable to fastest mode of zlib. Mutex/ CASPAR [115]: CAS (Compare-And-Swap) based syn- L6-G15 CondVar/ chronization architecture that enables lock-free synchro- Barrier nization in parallel. Memory Lock-Free Memory Allocator [184]: scalable, fast, lock- L6-G5 Allocation Free dynamic memory allocator design with support /Dealloca- of deadlock immunity, async-signal-safety, etc. je- tion malloc [103]: high performance heap allocator opti- mized for scalable concurrency support, high alloca- tion/deallocation throughput and fragmentation avoid- ance. Hoard, Ptmalloc [171].

3.7.3 Evaluating Layer Quality

We compare Limelight+’s FDMAX against four DAG layerization techniques drawn from broad algorithmic classes. In the following, let V (G) and E(G) denote nodes

and edges in DAG G. Let (u, v), denote an edge from u to v. And let LG (v) denote node v’s assigned layer.

• Longest Path (LP) assigns each node to the layer based on its distance from the graph’s root; a node d edges from the root is assigned to layer d. LP can be solved in polynomial time with dynamic programming. It assigns layers to topologically sorted nodes such that the layer assigned to a node is one plus the deepest layer assigned to its predecessors [224, 53].

117 • Maximum Directed Cut (MAXDICUT) cuts a directed graph to maximize the number of edges crossing from a predecessor to successor partition. More generally, k-MAXDICUT cuts the graph into k partitions to maximize the number of edges crossing from a predecessor into all (not necessarily immediate) successors [222]. Because MAXDICUT is NP-hard [121], we approximate MAXDICUT with Extremal Optimization (τ-EO) [61, 99, 62] and approximate k-MAXDICUT with recursive MAXDICUT.

• Coffman-Graham Algorithm (CG) performs list scheduling and minimizes makespan. Consider w processors and n tasks, each requiring one unit of time, and a DAG that describes task interdependencies. List scheduling executes tasks to minimize time, ensuring no more than w tasks execute simultaneously and no task executes before its predecessors. List scheduling is NP-complete for w > 2 [117], and CG is a broadly used approximation [73].

• Node2Vec Embedding and Density Clustering (N2VDC) is a graph em- bedding and clustering algorithm. First, it produces a vector space representation of the graph’s nodes that captures topological structure and preserves intra-node proximities. Second, it uses Euclidean distances and DBSCAN to cluster functions according to distribution densities. Finally, clusters are mapped to layers based on centroids’ distances to the graph’s root.

Table 3.8 indicates FDMAX outperforms alternative techniques for graph layer- ization across multiple measures of goodness. FDMAX performance is derived from the composite measure of modularity and the regularization strategy that aligns functions by graph proximity and string similarity. Scale Divergence, Consistency. First, layers should exhibit scale divergence across layers and scale consistency within each layer. Divergence and consistency

118 Table 3.8: Comparison of Directed Acyclic Graph (DAG) layerization algorithms. Intervals in table header indicate domains and superscript + (-) indicate whether larger (smaller) value is better. CLH+ ILH+ FSP+ RDC+ ADC+ SRV− Algorithm [0, 1] [−∞, 0] [0, dmax] [0, 1] [0, 1] [0, dmax] LP 0.58 -0.84 1.05 0.52 0.33 0 MAXDICUT 0.83 -0.4 0.94 0.89 0.79 0.34 CG 0.55 -1.1 0.83 0.36 0.32 0 N2VDC 0.57 -1.04 0.23 0.46 0.21 0.33 FDMAX 0.91 -0.34 1.2 0.94 0.86 0

are defined for functions with attributive or caller-callee relationships. A caller and callee are divergent (consistent) if the callee cannot (can) comprehensively represent the caller’s dominant functionality. For example, caffe2::math::MaxPoolFunctor::Forward, subroutine RunMaxP ool2D, and sub-subroutine ComputeMaxPool2D provide a consistent representation of max pooling. However, caffe2::Predictor::operator and caffe2::math::M axPoolFunctor::Forward are divergent since max pooling provides an incomplete representation of neural network prediction. In general, wrapper functions and their implementations exhibit the highest degree of scale consistency (e.g., rocksdb::DBI mpl::Write and rocksdb::DBImpl::WriteImpl)) whereas process/thread functions and leaf libraries exhibit the highest degree of scale divergence (e.g., libc start m ain and malloc). We measure divergence and consistency with measures of information hetero- geneity. We consider each layer as a document that describes functions and their computation. Consider layers L1,..., Lm, each with a set of function declarations.

Let Li +L j denote the union of function declarations in Li and L j. Function LZ77(Li) is the code length when using Lempel-Ziv 77 algorithm [280] to encode text in Li.

Function NCHAR(Li) is the number of characters of text in Li.

119 Larger CLH [0, 1]) indicates greater textual heterogeneity across layers. LZ77(L j ) and LZ77(Li +L j )−LZ77(Li) are the code lengths for L j when encoding from scratch and encoding with Li’s text as a dictionary, respectively. If the layers’ contents are similar, L j’s code benefits from Li’s dictionary, its code length should shrink, and the fraction should shrink. CIH is averaged across pairs of higher and lower layers. Larger ILH ([0, ∞]) indicates greater textual homogeneity within a layer, implying coherent sets of computation. ILH mesausres code length per character averaged across layers. If functions within the same layer have many common characters, tokens or substrings, the layer’s text will exhibit higher redundancy and small code lengths.

2 X LZ77(Li + L j ) − LZ77(Li) CLH(L ,..., L ) = 1 m m(m − 1) LZ77(L ) i< j j

1 X LZ77(L ) ILH(L ,..., L ) = i 1 m m NCHAR(L ) i i

For divergence measures (CLH), FDMAX approaches the optimal value of 1, outperforms LP, CG, N2VDC by 1.8×, and outperforms MAXDICUT by 1.1×. For consistency and alignment measures (ILH), FDMAX outperforms CG and N2VDC by 3×, outperforms LP by 2.5×, and outperforms MAXDICUT by 1.2×.

Functional Support. Suppose the call graph is layered (L1,..., Lm) and | · |

denotes the number of functions in a layer. FSP ([0, dmax], dmax is maximum in- degree of graph) measures the average number of direct callers from layer L − i − 1 to a function in layer Li. FSP increases with the amount of infrastructural support that a lower layer provides to a higher one and higher FSPs are preferred.

1 Xm X FSP(L , L ,..., L ) = |DRS( f, S) ∩ L − | 1 2 m F (S) i 1 i=2 f ∈Li

120 For measures of functional support, FDMAX reports a value greater than one and outperforms N2VDC by 6× and outperforms CG, MAXDICUT, and LP by 1.1-1.4×. Completeness. Absolute completeness (ADC) is a measure that helps us un- derstand multi-scale cycle distributions and answer questions like what share of cy- cles are exclusively consumed by libraries. Absolute completeness for a layer is the summed layer inclusive cycle percentage (LIP). Absolute completeness reflects call graph structure rather than layerization quality. For example the absolute complete- ness of layer L1 is 100% and the absolute completeness of L6 is 70%. L6 contains 27% of the 3.7K functions in SERVICES and cannot report absolute completeness of 100% unless the other 83% of functions consume no exclusive cycles. Relative completeness (RDC), on the other hand, measures layerization quality by calculating the ratio between absolute completeness and summed exclusive cy- cles excluding higher level layers. For example, L4’s relative completeness is 85% (=81%/95%) because its absolute completeness is 81% and higher layers consume 5% of exclusive cycles. Limelight+’s layers all report relative completeness greater than 85% such that any layer provides a comprehensive view of how cycles are con- sumed. This mesure also indicates the number of attributed cycles that bypass an intermediate layer is no more than 15%.

1 Xm ADC(L , L ,..., L ) = LIP(L ) 1 2 m m i i=1

1 Xm LIP(L ) RDC(L , L ,..., L ) = i 1 2 m m − 1 Pi−1 i=2 1 − j=1 LEP(L j )

For measures of decomposition completeness, FDMAX reports a value close to one. It outperforms N2VDC by 4×, outperforms LP and CG by 2.6×, and outperforms MAXDICUT by 1.1×. Service Relationships. Functions within a layer L should have no indirect

121 callers (callees) in any layer lower (higher) than L. Violating this rule implies a faulty layerization in which higher-level functions support lower-level ones. SRV

(range [0, dmax], dmax is maximum in-degree in graph) measures the average number

of direct callees from higher layers L1,..., Li−1 to a function in layer Li. A layerization with SRV of zero implies an absence of faults.

1 X X SRV (L , L ,..., L ) = |DES( f, S) ∩ Li| 1 2 m F (S) i< j f ∈L j

FDMAX’s layers have no faulty service relationships whereas MAXDICUT and N2VDC have severe challenges in maintaining caller-callee relationships when layer- izing the call graph.

3.7.4 Evaluating Semantic Embeddings

We evaluate the quality of the semantic embeddings with two methods. First, the Semantic Relatedness Test (SRT) determines, given a function and its vector repre- sentation, whether the K nearest functions are related. Second, Probabilistic Stack Modeling (PSM) trains a model to learn the joint probability for functions in stack traces. If the trained model is characterized by higher joint probabilities, then it can more effectively generate collections of functions that reflect real-word relationships between them. Semantic Relatedness Test (SRT): We use the Precision@K (=10) metric to evaluate the quality of semantic embeddings. First, we repeatedly sample a random function q from the trace and search for 10 nearest functions in the vector space defined by the semantic embedding algorithm. Second, we ask two domain experts to eliminate irrelevant neighbors, which have little functional relationship to q. Fi- nally, Precision@10 is fraction of neighboring functions that remain, averaged over sampled functions. We perform ten independent tests for token- and function-level

122 evaluations, drawing 30 random queries per test. Word2Vec is the baseline for to- ken embedding and the average of Word2Vec embedded vectors is the baseline for function embedding.

Table 3.9: Evaluation of Limelight+’s EE (Equilibrium Embedding) and Word2Vec on token-level and function-level Semantic Relatedness Test. Suffix -Avg indicates function vectors are produced by averaging token vectors. Word2Vec EE Word2Vec-Avg EE (Token) (Token) (Function) (Function) Precision@10 Precision@10 Precision@10 Precision@10 0.79 0.86 0.92 0.94

Table 3.9 reports Precision@K (=10) for Equilibrium Embedding and Word2Vec. Equilibrium Embedding accurately catpures semantic similarities between functions and tokens. Equilibrium Embedding’s token-level precision is 0.86, which is 0.7 higher than Word2Vec’s. Its function-level is 0.93, which is 0.2 higher than Word2Vec’s. Probabilistic Stack Modeling (PSM): Each stack sample s ∈ S is a chain of function calls, f1 → f2 →,..., → f k, and the joint probability of sample s is the product of conditional probabilities for each function given its indirect callers.

P(s) = P( f1) · P( f2| f1) · P( f3| f1, f2) ····· P( f k | f1, f2,... f k−1)

A routine itself dictates invoked sub-routines and this “Markovian” property simpli- fies the task of learning the joint probability to predicting direct caller-callee rela- tionships using routines’ semantic embeddings.

P(s) = P( f1) · P( f2| f1) · P( f3| f2) ····· P( f k | f k−1)

We use a simple SOFTMAX classifier based on the negative Euclidean distance between functions within the embedded vector space.

e−|| f −g||2 P(g| f ) = P −|| f −g0|| g0∈F (S) e 2

123 We compare Limelight+’s Equlibrium Embeddeding against Word2Vec, Latent Se- mantic Indexing, Latent Dirichlet Allocation, and TF-IDF. Equilibrium Embedding directly produces function embeddings. The others first produce token embeddings that are averaged to obtain function embeddings. Limelight+’s Equilibrium Embedding incorporates both lexecical similarity and structural affinity. To fairly compare our embedding against the baselines, we im- plement structural and non-structural versions of PSM. First, non-structural PSM makes all algorithms agnostic to call graph structures. Each function is treated as a small document and functions together are reated as a corpus of documents. Second, structural PSM enhances baseline algorithms with awareness of call graph structure. We concatenate function declarations based on direct caller-callee relationships to form new declarations that capture graph structure for PSM training.

Table 3.10: Evaluation of function semantic embedding with Probabilistic Stack Modeling. Value outside (inside) parenthesis indicate non-structural (structural) PSM. Embedding Log Probability Log Probability Log Probability Model Minimum Mean Maximum EE -186.81 (-215.32) -15.30 (-13.53) -5.82 (-4.05) Word2Vec-Avg -305.46 (-259.16) -16.86 (-14.42) -7.48 (-4.94) LDA-Avg -333.91 (/) -17.63 (/) -8.15 (/) LSI-Avg -319.04 (-325.32) -17.31 (-17.49) -7.82 (-8.01) TF-IDF-Avg -307.20 (-253.79) -17.18 (-15.17) -7.69 (-5.71)

Tables 3.10 presents log probabilities for all stack samples in the test set. Results indicate EE and Word2Vec significantly outperform alternatives and EE outperforms Word2Vec.

3.7.5 Analyzing Production Datacenter

For the FLEET trace, Limelight+ produces hundreds of function clusters distributed over seven layers. L2 groups correspond to 43 major services that account for 60% of

124 Figure 3.8: Categorization of service-level hotspots revealed by Limelight+ for a macro-scale view of datacenter workload composition. overall consumed processor cycles. These higher level groups reveal broad workload trends such as microservices and serverless computing. L7 groups correspond to thousands of libraries and account for 50% of cycles. These lower-level groups reveal software infrastructure that could inspire directions in hardware architecture. Table 3.11 lists major services deployed on hundreds to thousands of machines. Figure 3.8 suggests categories of services important for modern, production data- center computing. From this analysis, we identify three salient features of modern datacenters. First, machine learning is a first-class citizen. Inference is deployed as a remote predictor service or a local predictor with API integration (e.g., Caffe2 model integrated with Ads). Together, these predictors are among the hottest five services.

125 Figure 3.9: Categorization and of library-level hotspots revealed by Limelight+ for a micro-scale view of datacenter software infrastructure.

Moreover, inference an essential component to varied business-critical applications such as advertisement, news feed, search, integrity & security check, image and video understanding. Second, memory is essential for datacenter storage. Large-scale memory caches and in-memory databases are significant contributors to datacenter cycles. They appear in three major settings where disks may fail to meet latency and scalability requirements. Large-scale memcache services (Cache1 and Cache2) provide an in- termediate cache tier between web applications and backend persistent storage (e.g., MySQL) to accelerate object fetching and saving. Key-value stores (DB3 and DB7) are created with RAM-Disk hybrid data structures such a log-structured merge tree

126 in RAM and a sorted string table or block based table on disk. In-memory relational databases are the final significant contributor to memory usage in datacenters. Third, service management and serverless computing contributes moderately to cycle consumption. These cycles are used for run-time buildup, resource manage- ment, RPC routing, utilization monitoring, and anomaly detection and recovery. These management functions form an abstraction layer between the service interface and physical servers. Thee abstraction layer illustrates how datacenter management is transitioning from servers to services. Other studies refer to this trend as serverless computing [100]. Table 3.12 presents low-level groups of functions discovered by Limelight+. Fig- ure 3.9 breakdowns the cycles across libraries. L7’s 3000 functions account for 57% of datacenter cycles. These functions are organized into 38 groups that provide detail fine-grained software infrastructure. Limelight+ suggests varied functions provide software infrastructure and no sin- gle function is a dominant hotspot. This finding resonates with prior work from Google [146]. However, the Google study discovers only six groups of infrastructural functions that cover only 25% of system cycles. This restricted finding is due to two methodological limitations. First, the study assesses functions’ exclusive cycles (EXCP), a measure that is agnostic to hierarchical relationships between routines. Routines with high EXCP may not be low-level infrastructure and could be high- level services. Second, its analysis of cycles is fragmented across many functions, which complicates the discovery of core infrastructure. Limelight+ improves the analysis with hierarchical graph analysis (FDMAX) and semantic clustering for function declarations (STEAM). These improvements yield two important findings. First, Limelight+ identifies 3000 infrastructural functions that account for 57% of system cycles. Second, these functions can be clustered into 100 groups and the hottest 37 groups account for 50% of system cycles.

127 Datacenter software infrastructure has been a popular target for hardware ac- celeration. Table 3.6 surveys hardware accelerators relevant to datacenter software. The abundance and diversity of accelerators reveals a transition from software to hardware libraries. For example, Microsoft’s FPGA cluster accelerates web index ranking, matrix arithmetic (for neural networks), compression, network transporta- tion, etc. [70].

Table 3.11: Service-level hot function clusters revealed by Limelight+ on a month- long trace of a live production datacenter. Only hot functionalities with GIP (Group Inclusive Cycle Percentage) larger or equal to 0.05% of overall datacenter cycles are listed. Unlike previous examples, the group notation in this table does not reflect GIP ranking. All 38 groups in total cover about 60% of overall datacenter cycles. Limelight+ Associated Functionality Description Example System Group Service L2G1 WWW1 Front-end services related to HHVM[200] Hack/PHP script processing and web serving. L2G2 WWW2 Web server, proxy, process uWSGI[37] manager & monitors for support of python web applications (e.g., Django). L2G3 WWW3 HTTP server framework for Proxygen[28] support of HTTP/1.1, SPDY/3, SPDY/3.1, HTTP/2, and HTTP/3. L2G4, Ads1, Ads2, Back-end services related to adindexer [201] L2G5, Ads3, Ads4 advertisement marketing, L2G6, indexing, targeting, fetching, L2G7 scoring, ranking and delivery. L2G8 Feed1, Feed1 (Aggregator): News Feed Multifeed[279] Feed2 aggregation, retrieval, filtering, score, ranking and story delivery. Feed2 (Leaf): distributed storage and indexing for up-to-date user activities/actions related to News Feed stories. L2G9 Search1 Online, in-memory search and Unicorn[88] inverted indexing for relationships (e.g., friend, likes, live-in) in social graphs. L2G10 Cache1 Multi-region, distributed Memcached[197] in-memory key-value store with token based lease protocols.

128 L2G11 Cache2 Distributed, in-memory data store TAO[67] for timely access to nodes and edges in constantly changing social graphs. L2G12 DB1 In-memory time series database Gorilla[205] optimized for aggregation queries and compression of time series data. L2G13 DB2 Data storage in support of Rockfort real-time SQL queries. Express[5] L2G14 DB3 Distributed, persistent key-value ZippyDB[237] store based on Paxos and RocksDB L2G15 DB4 Distributed, in-memory SQL Presto[246] query engine optimized for real-time analytics at interactive speed. L2G16 DB5 Distributed, in-memory SQL Scuba[38] query engine optimized for real-time analytics and visualization. L2G17 DB6 Persistent, relational database for MySQL[21] SQL query. L2G18 DB7 Persistent key-value store based Laser[2] on RocksDB for large-scale concurrent queries L2G19 DB8 No-SQL, distributed, structured Cassandra [161] data storage and query engine. L2G20 Log1, Log2 Log1: collector and aggregator of [244], web-generated logs (e.g., user LogDevice [19] actions). Log2: distributed, fault tolerant log storage and indexing. L2G21 MapRed MapReduce based distributed Hadoop/Hive[11] data processing, query and storage engine. L2G22 Pub-Sub Publish-subscribe systems that Wormhole[226] identify and transmit publisher new writes (updates) to interested subscriber services. L2G23 Inference1 Multi-tenancy service for realtime FBLearner[132] inference of pre-trained machine learning models for computer vision, NLP, etc. L2G24 Vision1 Deep neural network based Lumos[69] computer vision service for image/video understanding (e.g., object recognition).

129 L2G25 Vision2 Deep neural network based Facer[132] computer vision service for human face detection/recognition in images and videos. L2G26 Codec Codec service for Transcoder[8] encoding/decoding of digital contents such as image, video, live stream. L2G27 Stream Stream storage & processing PUMA & engine in support of SQL-like PTail[74,2] queries with UDFs (userdefined functions). L2G28 BlobStore1 BLOB (Binary Large OBjects) f4[189] persistent storage for photos, videos, heap dumps, etc. , L2G29 BlobStore2 BLOB (Binary Large OBjects) haystack[54] persistent storage optimized for photo access. L2G30 Integrity1 Site integrity check for Monarch[243] identification of blacklisted domains, URLs, email addresses, phone numbers, etc. L2G31 Integrity2 Machine learning based malicious Sigma [132] content detection in support of site integrity, spam defense, payment security, etc. L7G32 Recommend Recommendation of social PYMK[4] connections between friends, colleagues, etc (real function names under obfuscation). L2G33 Mobile Cloud based mobile computation Snaptu[3] offloading for low-end/obsolete mobile device acceleration. L2G34 Manage1 service communication and SMC[39] management, e.g., service RPC router, service physical resource management. L2G35 Manage2 Production, code version control, configerator authoring, review, automated [239] canary testing and config distribution. L2G36 Monitor1 Back-end service for server fbagent[165], monitoring, dynamic fb303[27] configuration, uptime reports, activity collection, etc. L2G37 Monitor2 Times series database & analytics ODS [205] for performance metrics, realtime performance anomaly detection & alarm.

130 L2G38 Monitor3 Server utilization (e.g. CPU, dynolog [57] RAM, network and disk) monitoring and collection, hardware counter profiling.

Table 3.12: Shared-library-level, hot function clusters revealed by Limelight+ on a month-long trace of a live production datacenter.. Only hot functionalities with GIP (Group Inclusive Cycle Percentage) larger or equal to 0.1% of overall datacenter cycles are listed. Note that notations from L7G1 to L7G37 reflects descending ranks of GIPs, with L7G1 the highest and L7G37 the smallest. All 37 groups in total cover about 50% of overall datacenter cycles. Limelight+ Functionality Representative Member Functions Group L7G1 Memory malloc,free,operatornew,sdallocx,nallocx, (De)Allocation PyMem Malloc,std::smartRealloc,ASN1 STRIN G free L7G2 Memory memcmp sse4 1, memcpy avx unaligned, mems Operations et avx2,memchr, bcopy in 64, bcopy32, rawm emchr sse2 L7G3 Array Operations HPHP::iterNextArrayPacked,java.util.Array List.add,jshort disjoint arraycopy,HPHP:: Array::exists L7G4 String Operations std:: cxx11::basic string:: M append,ava.l ang.AbstractStringBuilder.append,folly::j son::escapeString L7G5 Hash/ Map std:: Hashtable:: M assign,folly::hash::S pookyHashV2::Hash128,java.util.HashSet.ad d,MurmurHash64A L7G6 (De)Compression qlz decompress,LZ4 compress fast,HUF compr ess1X usingCTable,snappy::InternalUncompr ess L7G7 Lock/ CondVar/ pthread mutex lock,folly::RWSpinLock::loc Sync. k shared,pthread cond signal, lll lock wai t L7G8 Linear Algebra Eigen::internal::gemm pack rhs::operator( ),mkl blas avx2 xsaxpy,mkl blas avx2 xsgem v,mkl blas avx sgemm pst L7G9 Network I/O folly::AsyncSocket::sendSocketMessage,sun .nio.ch.SocketChannelImpl.write,sendmsg,r ecv L7G10 Red-Black/ B+ std:: Rb tree:: M erase,btree::btree::reba Tree lance or split,std:: Rb tree insert and reb alance,std:: Rb tree::find L7G11 Vector Operations std::vector:: M insert aux,folly::sorted v ector map::sorted vector map,folly::small vector::insert

131 L7G12 Tensor Operators caffe2::Tensor::Resize,caffe2::OperatorBa se::Output,caffe2::Tensor::CopyFrom,caffe 2::Tensor::dim32 L7G13 Math Functions ieee754 exp avx, log2f finite, pow, ceil sse41, cos avx, floor sse41 L7G14 Text Codec sun.nio.cs.UTF 8$Decoder.decodeArrayLoop, PyUnicode FromFormatV,icu 53::UnicodeSet: :contains L7G15 Multi- threading folly::collectAll,folly::SharedPromise::s etTry,folly::RequestContext::setContextDa ta,folly::Future::detach L7G16 Shared Pointer std:: Sp counted ptr:: M dispose,folly::at (SP) omic shared ptr::get shared ptr,boost::sha red ptr::reset L7G17 Clock/ Timer clock gettime,os::javaTimeMillis,std::ch rono:: V2::steady clock::now,timeout next L7G18 ML Basic Basic machine learning operations, e.g., ngram Operators counter, decision tree leaf evaluation (real function names under obfuscation). L7G19 Video Codec x264 pixel satd 8x8 internal avx2,ff h264 f ilter mb fast,y8 own 8x8 LSClip 16s8u L7G20 Crpyto/Decrypto CRYPTO gcm128 decrypt ctr32,rsaz 1024 mul avx2,aesni ctr32 encrypt blocks L7G21 Image Processing png read filter row paeth multibyte pixel, png write find filter,S32 generic D32 filt er DX SSSE3 L7G22 Regular java.util.regex.Matcher.find,boost::cpp r Expression egex traits::isctype,re2::RE2::Match L7G23 LIBC I/O GI libc write, libc read, libc fcntl, lib Interface c recv, GI libc sendmsg, GI inet aton L7G24 I/O Buffer folly::IOBuf::cloneAsValue,folly::IOBuf:: (IOBUF) decrementRefcount,folly::IOBuf::empty L7G25 Python Runtime PyLong FromLong,PyBool FromLong,PyTuple S Primitives ize,PyFloat FromDouble L7G26 PHP Runtime PHP JIT/IR primitives related to interpretation, Primitives register manipulation, etc (real function names under obfuscation) L7G27 Queue Operations folly::NotificationQueue::putMessageImpl, java.util.concurrent.LinkedBlockingQueue. offer L7G28 Lightweight folly::fibers::Fiber::resume,folly::fiber Thread s::Fiber::LocalData::reset L7G29 Obj. Assoc. Software-implemented associative cache for Cache objects (real function names under obfuscation). L7G30 (Un)Serialization Serialization and unserialization related routines (real function names under obfuscation).

132 L7G31 Object Copy/ folly::Try::operator=,std::back insert it Assignment erator::operator= L7G32 Type Cast/ cxxabiv1:: class type info:: do dyncast,d Conversion ouble conversion::FastDtoa L7G33 Java Runtime PhaseChaitin::Split,PhaseIdealLoop::build Primitives loop late,PhaseIFG::SquareUp L7G34 SSL ssl3 read,OPENSSL cleanse,folly::ssl::Ope nSSLUtils::getCipherNameabi:cxx11 L7G35 Boost Container boost::variant::variant,boost::variant::v ariant assign L7G36 Big Number BN div,bn mul4x mont,BN uadd Arithmetics L7G37 Deduplication Storage content deduplication (real function names under obfuscation)

3.8 Complexity and Overhead

FDMAX. The time complexity for FDMAX is O(T (n2 + IS(d + c)n)), assuming that the number of caller-callee relations between functions is a constant multiple, d, to the number of functions, n. This assumption is reasonable for real-world stack traces, since most functions usually have only a moderate number of direct callers and callees. T is the number rounds, I is the number of iterations per round, and S is the number of randomly generated samples per iteration in the CEM-OPT framework. c is the maximum number of conjugate functions for a single function. The time complexity for modularity computation per iteration is O(n2), if we pre-build a LW (Levenshtein-Winkler) similarity matrix, a hash table for conjugate functions, and if we use dynamic programming for computing Integrity. The time complexity of partial order analysis per random sample (in CEM-OPT) is O(dn), if we use breadth-first search for solving transitive closure and depth-first search for finding sub-root functions. The time complexity of CFR (Conjugate Functionality Regularization) per random sample (in CEM-OPT) is O(cn). STEAM. The time complexity for STEAM is O(kn2 + wn). k represents the number of selected eigenvectors in spectral embedding and it requires the eigenvectors

133 Figure 3.10: Runtime analysis of Limelight+. Measurements are averaged over multiple runs. are computed using Krylov subspace method such as Arnoldi algorithm, which has time complexity O(kn2). w is maximum number of tokens in function declarations. HELP. The time complexity for HELP is O(lm). l is maximum stack depth and m the total number of stack samples in trace. Limelight+. The overall time complexity for Limelight+ is O(l2m + wn +TIS(d + c)n + (k + T)n2). Besides that incurred by FDMAX, STEAM and HELP, Limelight+ undertakes overhead asserted by ahead-of-time computation. First, it requires a pre- procedure that parses call stack samples with time complexity (O(l2m)). Second, it requires pre-building a hash table for conjugate functions with time complexity (O(kn2). Third, it requires pre-compute the Sigmoid Commute Time Kernel for functions, with k-truncated SVD based Penrose-Moore inversion. The time complexity of Limelight+ can be simplified to O(Sn + n2), if we assume that c, l, d, T, I and k are constant with respect to n, and w is a constant with respect to n2. We argue that these assumptions are reasonable for real-world stack traces. We parallelize the algorithm framework of Limelight+ with Berkeley Ray and Apache Arrow, and deploy our implementation on a 48-core NUMA machine for

134 overhead evaluation. The end-to-end runtime of Limelight+, measured over stack traces of different sizes are shown in Figure 3.10. The results show Limelight+ can complete its analysis in 5 minutes on a moderate stack trace with 1K functions, and Limelight+ needs 729 minutes to analyze a huge stack trace with over 19K functions. Moreover, the results show a quadratic trend for Limelight+’s scalability. We also validate that Limelight+’s implementation can benefit large degree of parallelism. The result shows that, compared with serial execution on a single core, Limelight+’s implementation achieves on average a 4×, 13×, and 35× speed up for running with 4, 16 and 48 cores, respectively.

3.9 Related Work

Datacenter-scale Performance Monitoring, Tracing, Profiling and Analy- sis (i) Datacenter-scale Monitoring: dynolog [57] is a monitoring engine that runs on all servers in fleet and collects at a second interval application-level, system-level, and architectural-level performance metrics, including RPS (requests per seconds), CPU & RAM utilization, storage and network I/O statistics, page faults and context switches per seconds, memory bandwidth utilization, IPC (Instruction Per Cycle), etc. Similar large-scale monitoring engines are Borg [251], IBM Tivoli [169] , Amazon CloudWatch [9], AzureWatch (CloudMonix) [12], and Ganglia [181]. (ii)Datacenter- scale Monitoring: Canopy [144] is a performance tracing engine that stitches perfor- mance data across the end-to-end execution path of a web request, which involves browsers, mobile applications, backend services, etc. Using comprehensive timing and causal information from the client and across all server-side processes, canopy can reveals the critical path of requests and pinpoint performance bottlenecks. Sim- ilar datacenter-scale tracing frameworks are XTrace [108], Pivot Tracing [178] and Dapper [229]. (iii) Datacenter-scale Profiling: Datacenter-wide profiling focuses on continuous profiling of hardware events. For example, GWP (Google Wide Profiling)

135 [215] samples across machines in multiple data centers and collects various hardware events such as call stacks, cache misses, etc. Similar datacenter-scale profiling engines are CPI2 [271] and Strobelight [68]. (iv) Datacenter-scale Analysis; Datacenter-scale performance analysis requires scalable data storage and processing engine for inter- active analytics and visualization. Examples are ODS [165], Scuba [38] and Gorilla [205]. Datacenter-scale performance anomaly troubleshooting requires systematic di- agnosis methods, such as Hound [275], The Mystery Machine [80], Iprof [273], Kraken [249] and Blocked Time Analysis [203]. Domain-Specific Hardware Acceleration Currently, domain-specific hard- ware acceleration mostly target at shared library subroutines, such as memory man- agement and manipulation [147, 123, 260, 248, 211, 97, 209, 98], linear algebra [142, 223, 262, 233, 52], advanced data structure [79, 172], string & regular ex- pression processing [123, 247, 238, 122], file & network I/O [81, 85, 163, 70, 102], compression & decompression [110, 14, 236], synchronization primitives [120], etc. Acceleration for shared library subroutines are prevailing fro two reasons. First, they are common building blocks for a wide range of application domains. Second, they are a special class of miniature applications that with simple, deterministic control logic and data flow, they are usually performant and efficient with sufficient parallelism, and they can be synthesized productively on programmable logics such as FPGA. Besides hardware accelerators for shared library subroutines, there also exist ac- celeration schemes for larger, integrated computation that are usually with sophis- ticated algorithms, data structures and program logic. Examples are end-to-end ac- celeration of deep neural networks [168, 250], relational database query [261, 36, 179] and LSM-Tree based persistent key-value store [151, 31, 148]. Moreover, researcher in network community have demonstrated the potential of offloading complex, dis- tributed applications from datacenter servers to multi-core SOC network devices

136 [170, 151]. DAG Layerization There are three major applications of DAG layerization: workflow search, list scheduling, and layered graph drawing. (i)Workflow Search: Workflows are mostly formatted as DAGs. In these DAGs, data input, data output and data processing modules are nodes, and the directed edges (links) between nodes define data flow. Given a query, a workflow search engine is required to return a set of workflows similar to the query one. Researchers found that DAG layerized with the Longest Path algorithm [53, 224] is more effective than other DAG representa- tions for retrieving similar workflows [231]. List Scheduling: DAGs can be used to represent data dependencies between potentially parallel jobs. A scheduling plan is a layered DAG (i.e., Gantt Chart [143]), with no dependencies between jobs within the same layer, and with only dependencies from high-layer jobs to low-layer jobs. The goal of list scheduling it to minimize the number of layers (i.e., makespan), while constraining that the number of jobs per single layer does not exceed a predefined value w. Prevailing list scheduling algorithms are HLF algorithm [114], Graham’s list algorithm [124], Coffman-Graham algorithm [82, 73] and Critical Path Method [153]. Layered Graph Drawing: Layered graph drawing or Sugiyama-style graph drawing [235], is a type of graph drawing in which the nodes of a directed graph are arranged and drawn in horizontal layers. Layered graph drawing asserts several aes- thetic objectives, such as limited width (i.e., number of nodes per layer) and depth (i.e., number of layers). Common layered graph drawing algorithms are Longest Path, Coffman-Graham and Gansner’s ILP algorithm [116]. Combinatorial Graph Optimization Examples of combinatorial graph opti- mization problems are travelling salesman [212], minimum cut [149], maximum cut [94], maximum directed cut [121], maximum k-cut [113, 222], Newman’s modularity maximization [193], chromatic number of graph coloring [256], etc. These algorithms lie at the interface of graph theory and combinatorial optimization, and have signifi-

137 cant in computer vision, machine learning, job shop scheduling and network science. Examples are normalized minimum cut [227] for image segmentation, ratio-cut for spectral clustering [71], and Newman’s modularity maximization for network com- munity detection [194]. To solve these problems, some researchers focus on randomized algorithms. A milestone work of these randomized algorithms is a SDP (Semi Definite Program- ming) approximation for maximum directed cut, with approximation ratio .79607 [121]. Besides randomized algorithms with strict proof of approximation bounds, researchers also propose advanced “heuristic” algorithms, such as Extremal Opti- mization [61] and Rubinstein’s Cross-Entropy Method [90], which succeed in many real-world problems of combinatorial graph optimization. Programming Language & Semantic Learning The naturalness hypothesis surges research at the intersection of machine learning, programming languages, and software engineering. The naturalness hypothesis states that software is a form of human communication; software corpora have similar statistical properties to natural language corpora; and these properties can be exploited to build better software engineering tools [41]. Researchers design learnable probabilistic models of [56] that exploit the abundance of patterns in code to accomplish challengeable tasks such as code plagiarism or cloning detection [257], automatic bug fixing & code patching [173, 128] and automatic code completion [40, 196].

3.10 Conclusions

In this paper, we proposes an algorithm framework Limelight+ for automated, scal- able stack trace analysis. The major contributions of Limelight+ are three-folded. First, Limelight+ partitions functions into a small number of layers. Layerization reveals computation hotspots at different scales, and ensures comparing computa- tion and cycles at a matched scale. Second, Limelight+ clusters functions that im-

138 plement similar or closely-related computation with inferred semantics into “super functions” and re-attribute cycles to reveal the hotness of and the relations between these “super functions”. Function clustering can succinctly reveal the structure of an enormous, complex software stack, and extracts design insights for datacenter architecture or hardware roadmap. Third, Limelight+ has quadratic time complexity under reasonable assumptions and its distributed, parallel implementation can scale to datacenter-scale stack traces that include tens of thousands of functions. We apply Limelight+ on a pair of traces and the results show that Limelight+ can translate the trace of an enormous software stack into a miniature diagram for effi- cient interpretation. The produced diagram succinctly explains workload structure and reveals computational hotspots at multiple granularities and scales for hardware acceleration or software re-optimization.

139 4

Conclusions

4.1 Conclusions

This dissertation proposes to use artificial intelligence for understanding the perfor- mance characteristics of large, complex datacenters, and presents two pieces of work to demonstrate the workflow and effectiveness of this approach. In the first work, we present Hound, a statistical machine learning framework that leverages causal inference and Bayesian probabilistic models to diagnose performance stragglers in datacenter computing. Hound offers interpretability, reliability and scalability. We apply Hound to analyze a production Google datacenter and two experimental Amazon EC2 clusters, revealing straggler causes that are consistent with those from expert analysis, and providing insights for future systems design and management. In the second work, we presents Limelight+, a algorithm framework that leverages graph theory and statistical semantic learning, to extract design insights for data- center architecture from enormous call stack traces. We apply Limelight+ to analyze call stack traces of a live, production datacenter. Results show that Limelight+ can

140 succinctly characterize the composition and structure of datacenter workload and, at multiple granularities and scales, reveals workload hotshots for hardware acceleration or software re-optimization. Our contribution is not limited to adapting artificial intelligence techniques for system problems. We extend theories to meet challenges raised by real system set- tings and build distributed engines to accomplish scalable performance modeling and analytics.

4.2 Future Work

The results presented in the dissertation are just the beginning of artificial intelligence for system research. Each piece of work can be improved to various degrees and to different directions. My future work will continue to bridge the gap. Specifically, I plan to carry out research in performance diagnosis and optimization for cloud platforms featuring serverless computing [100] and microservices [42]. Moreover, I plan to study machine learning systems from the perspective of efficiency and economics.

4.2.1 Performance Diagnosis and Optimization in the Era of Serverless, Microser- vices and Privacy

Serverless computing frees users from operational responsibilities of infrastructure but introduces significant challenges to cloud providers. Operational analytics (e.g., performance debugging) and strategic management (e.g., resource scaling) are ob- scured by incomplete knowledge over black-box applications whose architecture and functionality are undisclosed, and data traffic are encrypted due to user privacy concerns. Further, microservice architecture explodes management difficulty since a single, monolithic service expands into a number of inter-dependent, heteroge- neous microservice modules. Operational decisions need to be made separately yet

141 optimized jointly over a high-dimensional decision space to guarantee consistent, pre- dictable overall performance. Moreover, microservices introduces cascading perfor- mance errors and troubleshooting error propagation over a large dependency network is cumbersome and error-prone. To meet these challenges, performance analysis & optimization of networked system in the era of serverless, microservices, and privacy, at first, requires prediction, explanation and optimization of service performance with limited visibility to service internals. Second, it requires elaborate design of performance optimization procedures that scales up to massive-scale microservice deployment. Third, it requires rigorously dealing with complex interactions between microservices. I am interested in the following problems: (i) scalable automatic resource scaling for massive-scale microservice deployment, (ii) inferring and diagnosing microservice performance with obfuscated and incomplete data. Recent advances in black-box optimization and latent-variable causal inference provide a fundamental approach to solve these problems, and such research is compelling for multi-disciplinary interac- tions.

4.2.2 Machine Learning - Efficiency and Economics

Machine learning systems introduces domain-specific problems. At first, improving DNN (Deep Neural Network) training efficiency (e.g., convergence rate) is challenge- able from many different perspectives such as device placement of DNN operators and tolerating stragglers. Second, labeled training data has become a precious com- modity in artificial intelligence economy. Requirement of arduous human labor in data annotation leads to the emergence of crowdsourcing based machine learning, where a complete training set is gathered from a number of community users, each of which labels and shares a small amount of training data. This introduces challenges such as inaccurate labeling, data pricing, sharing mechanism, etc.

142 I am interested in the following problems: (i) device placement optimization for efficient model-parallel training, (ii) public economics and mechanism design for training data crowdsourcing. Recent theoretical advances in approximate graph op- timization and public good allocation provide promising solutions to these problems.

143 Bibliography

[1] Leveldb benchmarks, 2011.

[2] Real-time analytics at facebook, 2011.

[3] A focus on efficiency: A whitepaper from facebook, ericsson and qualcomm, 2013.

[4] Methods and systems for determining use and content of pymk based on value model, 2014.

[5] Lessons learned from building and operating scuba, 2015.

[6] Accelerating rocksdb with nvme zoned ssds, 2016.

[7] Application-driven storage with open-channel ssds, 2016.

[8] Accelerating facebooks infrastructure with application-specific hardware, 2019.

[9] Amazon cloudwatch: Use cloudwatch to monitor your aws resources and the applications you run on aws in real time., 2019.

[10] Apache arrow: A cross-language development platform for in-memory data, 2019.

[11] , 2019.

[12] Azurewatch (cloudmonix), 2019.

[13] Big-ip webaccelerator: Accelerate web applications, improve user experience, and increase revenue, 2019.

[14] Big-ip webaccelerator: Accelerate web applications, improve user experience, and increase revenue, 2019.

144 [15] Caffe2: A new lightweight, modular, and scalable deep learning framework, 2019.

[16] Facebook open-source projects, 2019.

[17] Gregex 12.8gb/s, 2019.

[18] Loading pre-trained caffe2 models, 2019.

[19] Logdevice: a scalable and fault tolerant distributed log system, 2019.

[20] Lz4: Extremely fast compression, 2019.

[21] Mysql, 2019.

[22] nginx, 2019.

[23] Opcache adapter (psr-16), 2019.

[24] An optimized blas library, 2019.

[25] Oss performance, 2019.

[26] The paxym regex/pcre accelerator, 2019.

[27] Project fb303: The facebook bassline, 2019.

[28] Proxygen: Facebook’s c++ http libraries, 2019.

[29] Pytorch: An open source machine learning framework that accelerates the path from research prototyping to production deployment, 2019.

[30] Rocksdb: A persistent key-value store for fast storage enviroments, 2019.

[31] Rocksdb acceleration using noload csp, 2019.

[32] Rocksdb benchmarking tools, 2019.

[33] Rocksdb on open-channel ssds, 2019.

[34] siege, 2019.

[35] Snappy:a fast compressor/decompressor., 2019.

145 [36] Swarm64 database accelerator: Turn postgresql into a powerful open source data warehouse with swarm64 da and fpga, 2019.

[37] The uwsgi project, 2019.

[38] Lior Abraham, John Allen, Oleksandr Barykin, Vinayak Borkar, Bhuwan Chopra, Ciprian Gerea, Daniel Merl, Josh Metzler, David Reiss, Subbu Subra- manian, et al. Scuba: Diving into data at facebook. Proceedings of the VLDB Endowment, 6(11), 2013.

[39] Aditya Agarwal. Facebook architecture, 2008.

[40] Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. Sug- gesting accurate method and class names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, pages 38– 49, New York, NY, USA, 2015. ACM.

[41] Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR), 51(4):81, 2018.

[42] Nuha Alshuqayran, Nour Ali, and Roger Evans. A systematic mapping study in microservice architecture. In 2016 IEEE 9th International Conference on Service-Oriented Computing and Applications (SOCA), pages 44–51. IEEE, 2016.

[43] Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. Ef- fective straggler mitigation: Attack of the clones. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation (NSDI ’13), pages 185–198, 2013.

[44] Ganesh Ananthanarayanan, Srikanth Kandula, Albert G Greenberg, Ion Sto- ica, Yi Lu, Bikas Saha, and Edward Harris. Reining in the outliers in map- reduce clusters using mantri. In 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10), volume 10, page 24, 2010.

[45] J. Anderson, L. Berc, J. Dean, S. Ghemawat, M. Henzinger, S. Leung, R. Sites, M. Vandervoorde, C. Waldspurger, and W. Weihl. Continuous Profiling: Where have all the cycles gone? In Proc. Symposium on Operating Systems Principles (SOSP), 1997.

[46] Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al.

146 Spark : Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1383–1394. ACM, 2015.

[47] E. F. Wolff B. Schweizer. On nonparametric measures of dependence for ran- dom variables. The Annals of Statistics, 9(4):879–885, 1981.

[48] L. Bai, Y. Zhao, and X. Huang. A cnn accelerator on fpga using depthwise separable convolution. IEEE Transactions on Circuits and Systems II: Express Briefs, 65(10):1415–1419, Oct 2018.

[49] Athula Balachandran, Vyas Sekar, Aditya Akella, Srinivasan Seshan, Ion Sto- ica, and Hui Zhang. Developing a predictive model of quality of experience for internet video. In ACM SIGCOMM Computer Communication Review (SIG- COMM’13), volume 43, pages 339–350. ACM, 2013.

[50] Elias Bareinboim and Judea Pearl. Controlling selection bias in causal infer- ence. In AAAI, 2011.

[51] Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. Using magpie for request extraction and workload modelling. In 6th USENIX Sym- posium on Operating Systems Design and Implementation, volume 4, pages 18–18, 2004.

[52] Sergio Barrachina, Maribel Castillo, Francisco D Igual, Rafael Mayo, and En- rique S Quintana-Orti. Evaluation and tuning of the level 3 cublas for graphics processors. In 2008 IEEE International Symposium on Parallel and Distributed Processing, pages 1–8. IEEE, 2008.

[53] Giuseppe Di Battista, Peter Eades, Roberto Tamassia, and Ioannis G Tollis. Graph drawing: algorithms for the visualization of graphs. Prentice Hall PTR, 1998.

[54] Doug Beaver, Sanjeev Kumar, Harry C Li, Jason Sobel, Peter Vajgel, et al. Finding a needle in haystack: Facebook’s photo storage.

[55] Austin R Benson and Grey Ballard. A framework for practical parallel fast matrix multiplication. In ACM SIGPLAN Notices, volume 50, pages 42–53. ACM, 2015.

[56] Pavol Bielik, Veselin Raychev, and Martin Vechev. Phog: Probabilistic model for code. In Proceedings of the 33rd International Conference on International

147 Conference on Machine Learning - Volume 48, ICML’16, pages 2933–2942. JMLR.org, 2016.

[57] Goranka Bjedov. Managing capacity and performance in a large scale produc- tion environment, 2013.

[58] David M Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.

[59] Peter Bod´ık,Moises Goldszmidt, and Armando Fox. Hilighter: Automatically building robust signatures of performance behavior for small- and large-scale systems. In Proceedings of the Third Conference on Tackling Computer Systems Problems with Machine Learning Techniques, pages 3–3, Berkeley, CA, USA, 2008. USENIX Association.

[60] Peter Bodik, Moises Goldszmidt, Armando Fox, Dawn B. Woodard, and Hans Andersen. Fingerprinting the datacenter: Automated classification of perfor- mance crises. In EuroSys 2010, pages 111–124, 2010.

[61] Stefan Boettcher and Allon G Percus. Extremal optimization for graph parti- tioning. Physical Review E, 64(2):026114, 2001.

[62] Stefan Boettcher and Allon G. Percus. Optimization with extremal dynamics. Phys. Rev. Lett., 86:5211–5214, Jun 2001.

[63] Edward Bortnikov, Ari Frank, Eshcar Hillel, and Sriram Rao. Predicting exe- cution bottlenecks in map-reduce clusters. In Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing, pages 18–18. USENIX Asso- ciation, 2012.

[64] Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.

[65] Alex D Breslow, Dong Ping Zhang, Joseph L Greathouse, Nuwan Jayasena, and Dean M Tullsen. Horton tables: Fast hash tables for in-memory data-intensive computing. In 2016 {USENIX} Annual Technical Conference ({USENIX}{ATC} 16), pages 281–294, 2016.

[66] Carlos Brito and Judea Pearl. Graphical condition for identification in recursive sem. arXiv preprint: 1206.6821, 2012.

[67] Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, et al. {TAO}: Facebooks distributed data store for the . In Presented as

148 part of the 2013 {USENIX} Annual Technical Conference ({USENIX}{ATC} 13), pages 49–60, 2013.

[68] Yannick Brosseau. Using tracing at facebook scale, 2014.

[69] Joaquin Quinonero Candela. Building scalable systems to understand content, 2017.

[70] Adrian M Caulfield, Eric S Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo- Young Kim, et al. A cloud-scale acceleration architecture. In The 49th An- nual IEEE/ACM International Symposium on Microarchitecture, page 7. IEEE Press, 2016.

[71] Pak K Chan, Martine DF Schlag, and Jason Y Zien. Spectral k-way ratio-cut partitioning and clustering. IEEE Transactions on computer-aided design of integrated circuits and systems, 13(9):1088–1096, 1994.

[72] Philip K. Chan and Salvatore J. Stolfo. Experiments on multistrategy learning by meta-learning. In CIKM ’93, pages 314–323, 1993.

[73] Marc Chardon and Aziz Moukrim. The coffman–graham algorithm optimally solves uet task systems with overinterval orders. SIAM Journal on Discrete Mathematics, 19(1):109–121, 2005.

[74] Guoqiang Jerry Chen, Janet L Wiener, Shridhar Iyer, Anshul Jaiswal, Ran Lei, Nikhil Simha, Wei Wang, Kevin Wilfong, Tim Williamson, and Serhat Yilmaz. Realtime data processing at facebook. In Proceedings of the 2016 International Conference on Management of Data, pages 1087–1098. ACM, 2016.

[75] Mike Y Chen, Emre Kiciman, Eugene Fratkin, Armando Fox, and Eric Brewer. Pinpoint: Problem determination in large, dynamic internet services. In De- pendable Systems and Networks, 2002. DSN 2002. Proceedings. International Conference on, pages 595–604. IEEE, 2002.

[76] P. Chen, Y. Qi, P. Zheng, and D. Hou. Causeinfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In IEEE INFOCOM 2014 - IEEE Conference on Computer Commu- nications, pages 1887–1895, April 2014.

[77] P. Chen, Y. Qi, P. Zheng, J. Zhan, and Y. Wu. Multi-scale entropy: One metric of software aging. In SOSE 2013 - IEEE Seventh International Symposium on Service-Oriented System Engineering, pages 162–169, March 2013.

149 [78] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. Tvm: An automated end-to-end for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, 2018.

[79] David Cheriton, Amin Firoozshahian, Alex Solomatnikov, John P Stevenson, and Omid Azizi. Hicamp: architectural support for efficient concurrency-safe shared structured data access. In ACM SIGPLAN Notices, volume 47, pages 287–300. ACM, 2012.

[80] Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F Wenisch. The mystery machine: End-to-end performance analysis of large- scale internet services. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14), pages 217–231, 2014.

[81] Joel Coburn, Adrian M. Caulfield, Ameen Akel, Laura M. Grupp, Rajesh K. Gupta, Ranjit Jhala, and Steven Swanson. Nv-heaps: Making persistent ob- jects fast and safe with next-generation, non-volatile memories. SIGPLAN Not., 46(3):105–118, March 2011.

[82] Edward G Coffman and Ronald L Graham. Optimal scheduling for two- processor systems. Acta informatica, 1(3):200–213, 1972.

[83] Ira Cohen, Moises Goldszmidt, Terence Kelly, Julie Symons, and Jeffrey S. Chase. Correlating instrumentation data to system states: A building block for automated diagnosis and control. In 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’04), pages 16–16, 2004.

[84] Ira Cohen, Steve Zhang, Moises Goldszmidt, Julie Symons, Terence Kelly, and Armando Fox. Capturing, indexing, clustering, and retrieving system history. In 20th ACM Symposium on Operating Systems Principles (SOSP ’05), pages 105–118, 2005.

[85] Jeremy Condit, Edmund B Nightingale, Christopher Frost, Engin Ipek, Ben- jamin Lee, Doug Burger, and Derrick Coetzee. Better i/o through byte- addressable, persistent memory. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 133–146. ACM, 2009.

[86] Gregory F. Cooper. The computational complexity of probabilistic inference using bayesian belief networks (research note). Artif. Intell., 42(2-3):393–405, March 1990.

150 [87] Henggang Cui, James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Abhi- manu Kumar, Jinliang Wei, Wei Dai, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, and Eric P. Xing. Exploiting bounded staleness to speed up big data analytics. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, USENIX ATC’14, pages 37–48, Berkeley, CA, USA, 2014. USENIX Association.

[88] Michael Curtiss, Iain Becker, Tudor Bosman, Sergey Doroshenko, Lucian Gri- jincu, Tom Jackson, Sandhya Kunnatur, Soren Lassen, Philip Pronin, Sriram Sankar, Guanghao Shen, Gintaras Woss, Chao Yang, and Ning Zhang. Unicorn: A system for searching the social graph. Proc. VLDB Endow., 6(11):1150–1161, August 2013.

[89] Longquan Dai, Liang Tang, Yuan Xie, and Jinhui Tang. Designing by training: acceleration neural network for fast high-dimensional convolution. In Advances in Neural Information Processing Systems, pages 1466–1475, 2018.

[90] Pieter-Tjerk De Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubin- stein. A tutorial on the cross-entropy method. Annals of operations research, 134(1):19–67, 2005.

[91] Suzana de Siqueira Santos, Daniel Yasumasa Takahashi, Asuka Nakata, and Andr´eFujita. A comparative study of statistical methods used to identify dependencies between gene expression signals. Briefings in bioinformatics, page 051, 2013.

[92] Jeffrey Dean and Luiz Andr´eBarroso. The tail at scale. Communications of the ACM, 56(2):74–80, 2013.

[93] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. CACM, 51(1):107–113, January 2008.

[94] Charles Delorme and Svatopluk Poljak. Laplacian eigenvalues and the maxi- mum cut problem. Mathematical Programming, 62(1-3):557–574, 1993.

[95] Luc Devroye and Pat Morin. Cuckoo hashing: further analysis. Information Processing Letters, 86(4):215–219, 2003.

[96] Carsten F Dormann, Jane Elith, Sven Bacher, Carsten Buchmann, Gudrun Carl, Gabriel Carr´e,Jaime R Garc´ıaMarqu´ez,Bernd Gruber, Bruno Lafour- cade, Pedro J Leit˜ao,et al. Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography, 36(1):27–46, 2013.

151 [97] Filipa Duarte and Stephan Wong. A memcpy hardware accelerator solu- tion for non cache-line aligned copies. In 2007 IEEE International Conf. on Application-specific Systems, Architectures and Processors (ASAP), pages 397– 402. IEEE, 2007.

[98] Filipa Duarte and Stephan Wong. Cache-based memory copy hardware accel- erator for multicore systems. IEEE Transactions on Computers, 59(11):1494– 1507, 2010.

[99] Jordi Duch and Alex Arenas. Community detection in complex networks using extremal optimization. Physical review E, 72(2):027104, 2005.

[100] A. Eivy. Be wary of the economics of ”serverless” cloud computing. IEEE Cloud Computing, 4(2):6–12, March 2017.

[101] Y. Elkhatib, G. Tyson, and M. Welzl. Can spdy really make the web faster? In 2014 IFIP Networking Conference, pages 1–9, June 2014.

[102] Haggai Eran, Lior Zeno, Maroun Tork, Gabi Malka, and Mark Silberstein. {NICA}: An infrastructure for inline acceleration of network applications. In 2019 {USENIX} Annual Technical Conference ({USENIX}{ATC} 19), pages 345–362, 2019.

[103] Jason Evans. A scalable concurrent malloc (3) implementation for . 2006.

[104] Jason Evans. jemalloc background, 2017.

[105] Bin Fan, David G. Andersen, and Michael Kaminsky. Memc3: Compact and concurrent memcache with dumber caching and smarter hashing. In Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13), pages 371–384, Lombard, IL, 2013. USENIX.

[106] Yuanwei Fang, Chen Zou, and Andrew A. Chien. Accelerating raw data analy- sis with the accorda software and hardware architecture. Proc. VLDB Endow., 12(11):1568–1582, July 2019.

[107] Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric Chung, et al. Azure accelerated networking: Smartnics in the public cloud. In 15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18), pages 51–66, 2018.

152 [108] Rodrigo Fonseca, George Porter, Randy H Katz, Scott Shenker, and Ion Stoica. X-trace: A pervasive network tracing framework. In Proceedings of the 4th USENIX conference on Networked systems design & implementation, pages 20–32. USENIX Association, 2007.

[109] Franois Fouss, Kevin Francoisse, Luh Yen, Alain Pirotte, and Marco Saerens. An experimental investigation of kernels on graphs for collaborative recommen- dation and semisupervised classification. Neural Networks, 31:53–72, 2012.

[110] J. Fowers, J. Kim, D. Burger, and S. Hauck. A scalable high-bandwidth ar- chitecture for lossless compression on fpgas. In 2015 IEEE 23rd Annual In- ternational Symposium on Field-Programmable Custom Computing Machines, pages 52–59, May 2015.

[111] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on- line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, August 1997.

[112] Brendan J Frey and Delbert Dueck. Clustering by passing messages between data points. science, 315(5814):972–976, 2007.

[113] Alan Frieze and Mark Jerrum. Improved approximation algorithms for maxk- cut and max bisection. Algorithmica, 18(1):67–81, 1997.

[114] Harold N Gabow. An almost-linear algorithm for two-processor scheduling. Journal of the ACM (JACM), 29(3):766–780, 1982.

[115] Tanmay Gangwani, Adam Morrison, and Josep Torrellas. Caspar: Breaking serialization in lock-free multicore synchronization. SIGPLAN Not., 51(4):789– 804, March 2016.

[116] Emden R Gansner, Eleftherios Koutsofios, Stephen C North, and K-P Vo. A technique for drawing directed graphs. IEEE Transactions on Software Engi- neering, 19(3):214–230, 1993.

[117] Michael R Garey, David S Johnson, and Ravi Sethi. The complexity of flowshop and jobshop scheduling. Mathematics of operations research, 1(2):117–129, 1976.

[118] Pawe lGawrychowski and Tomasz Kociumaka. Sparse suffix tree construction in optimal time and space. In Proceedings of the Twenty-Eighth Annual ACM- SIAM Symposium on Discrete Algorithms, SODA ’17, pages 425–439, Philadel- phia, PA, USA, 2017. Society for Industrial and Applied Mathematics.

153 [119] S. Ghemawat, H. Gobioff, and S. Leung. The Google file system. In Proc. Symposium on Operating Systems Principles (SOSP ’03), 2003.

[120] F. Glaser, G. Haugou, D. Rossi, Q. Huang, and L. Benini. Hardware- accelerated energy-efficient synchronization and communication for ultra-low- power tightly coupled clusters. In 2019 Design, Automation Test in Europe Conference Exhibition (DATE), pages 552–557, March 2019.

[121] Michel X Goemans and David P Williamson. Improved approximation algo- rithms for maximum cut and satisfiability problems using semidefinite pro- gramming. Journal of the ACM (JACM), 42(6):1115–1145, 1995.

[122] Vaibhav Gogte, Aasheesh Kolli, Michael J Cafarella, Loris D’Antoni, and Thomas F Wenisch. Hare: Hardware accelerator for regular expressions. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, page 44. IEEE Press, 2016.

[123] Dibakar Gope, David J Schlais, and Mikko H Lipasti. Architectural support for server-side processing. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pages 507–520. IEEE, 2017.

[124] Ronald L Graham. Bounds on multiprocessing anomalies and related packing algorithms. In Proceedings of the May 16-18, 1972, spring joint computer conference, pages 205–217. ACM, 1972.

[125] S. Graham, P. Kessler, and M. McKusick. Gprof: A call graph execution profiler. In Proc. Symposium on Compiler Construction (CC), 1982.

[126] David Grove, Greg DeFouw, Jeffrey Dean, and Craig Chambers. Call graph construction in object-oriented languages. In Proceedings of the 12th ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, OOPSLA ’97, pages 108–124, New York, NY, USA, 1997. ACM.

[127] Part Guide. Intel® 64 and ia-32 architectures software developers manual. Volume 3B: System programming Guide, Part, 2, 2011.

[128] Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. Deepfix: Fix- ing common c language errors by deep learning. In Thirty-First AAAI Con- ference on Artificial Intelligence, 2017.

[129] Bernhard Haeupler, Aviad Rubinstein, and Amirbehshad Shahrasbi. Near- linear time insertion-deletion codes and (1+)-approximating edit distance via

154 indexing. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, pages 697–708, New York, NY, USA, 2019. ACM.

[130] R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. Lee, S. Richardson, C. Kozyrakis, and M. Horowitz. Understanding sources of inefficiency in general-purpose chips. In Proc. International Symposium on Computer Architecture (ISCA), 2010.

[131] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. Eie: efficient inference engine on compressed deep neural network. In 2016 ACM/IEEE 43rd Annual International Sympo- sium on Computer Architecture (ISCA), pages 243–254. IEEE, 2016.

[132] Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, et al. Applied machine learning at facebook: A datacenter infrastructure per- spective. In 2018 IEEE International Symposium on High Performance Com- puter Architecture (HPCA), pages 620–629. IEEE, 2018.

[133] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learn- ing for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

[134] Matthew Hoffman, Francis R. Bach, and David M. Blei. Online learning for latent dirichlet allocation. In J. D. Lafferty, C. K. I. Williams, J. Shawe- Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 856–864. Curran Associates, Inc., 2010.

[135] Kevin Hsieh, Samira Khan, Nandita Vijaykumar, Kevin K Chang, Amirali Boroumand, Saugata Ghose, and Onur Mutlu. Accelerating pointer chasing in 3d-stacked memory: Challenges, mechanisms, evaluation. In 2016 IEEE 34th International Conference on Computer Design (ICCD), pages 25–32. IEEE, 2016.

[136] Gui Huang, Xuntao Cheng, Jianying Wang, Yujie Wang, Dengcheng He, Tiey- ing Zhang, Feifei Li, Sheng Wang, Wei Cao, and Qiang Li. X-engine: An optimized storage engine for large-scale e-commerce transaction processing. In Proceedings of the 2019 International Conference on Management of Data, pages 651–665. ACM, 2019.

[137] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with

155 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.

[138] B. Ilbeyi, C. F. Bolz-Tereick, and C. Batten. Cross-layer workload character- ization of meta-tracing jit vms. In 2017 IEEE International Symposium on Workload Characterization (IISWC), pages 97–107, Oct 2017.

[139] Berkin Ilbeyi. Co-Optimizing Hardware Design and Meta-Tracing Just-in-Time Compilation. PhD thesis, Cornell University, 2019.

[140] William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet, Yizheng Jiao, Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Walsh, Michael Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter. Betrfs: A right-optimized write-optimized file system. In 13th USENIX Conference on File and Storage Technologies (FAST 15), pages 301–315, Santa Clara, CA, February 2015. USENIX Association.

[141] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM in- ternational conference on Multimedia, pages 675–678. ACM, 2014.

[142] N. Jouppi, C. Young, N. Patil, D. Patterson, et al. In-datacenter performance analysis of a tensor processing unit. In Proc. International Symposium on Computer Architecture (ISCA), 2017.

[143] Cengiz Kahraman, Orhan Engin, Ihsan˙ Kaya, and R Elif Ozt¨urk.Multiproces-¨ sor task scheduling in multistage hybrid flow-shops: a parallel greedy algorithm approach. Applied Soft Computing, 10(4):1293–1300, 2010.

[144] Jonathan Kaldor, Jonathan Mace, Micha l Bejda, Edison Gao, Wiktor Kuropatwa, Joe O’Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, et al. Canopy: An end-to-end performance tracing and analysis sys- tem. In Proceedings of the 26th Symposium on Operating Systems Principles, pages 34–50. ACM, 2017.

[145] Jaz Kandola, Nello Cristianini, and John S Shawe-Taylor. Learning semantic similarity. In Advances in neural information processing systems, pages 673– 680, 2003.

[146] Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ran- ganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. Profiling a

156 warehouse-scale computer. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[147] Svilen Kanev, Sam Likun Xi, Gu-Yeon Wei, and David Brooks. Mallacc: Accelerating memory allocation. ACM SIGOPS Operating Systems Review, 51(2):33–45, 2017.

[148] Sudarsun Kannan, Nitish Bhat, Ada Gavrilovska, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau. Redesigning lsms for nonvolatile memory with novelsm. In 2018 USENIX Annual Technical Conference (USENIX ATC’18), pages 993– 1005, 2018.

[149] David R Karger and Clifford Stein. A new approach to the minimum cut problem. Journal of the ACM (JACM), 43(4):601–640, 1996.

[150] Antoine Kaufmann, Tim Stamler, Simon Peter, Naveen Kr. Sharma, Arvind Krishnamurthy, and Thomas Anderson. Tas: Tcp acceleration as an os service. In Proceedings of the Fourteenth EuroSys Conference 2019, EuroSys ’19, pages 24:1–24:16, New York, NY, USA, 2019. ACM.

[151] Daehyeok Kim, Amirsaman Memaripour, Anirudh Badam, Yibo Zhu, Hongqiang Harry Liu, Jitu Padhye, Shachar Raindel, Steven Swanson, Vyas Sekar, and Srinivasan Seshan. Hyperloop: group-based nic-offloading to accel- erate replicated transactions in multi-tenant storage systems. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Commu- nication, pages 297–312. ACM, 2018.

[152] Kangnyeon Kim, Ryan Johnson, and Ippokratis Pandis. Bionicdb: Fast and power-efficient oltp on fpga. In Proceedings of the 22nd International Confer- ence on Extending Database Technology (EDBT), pages 301–312, 2019.

[153] Walter H. Kohler. A preliminary evaluation of the critical path method for scheduling tasks on multiprocessor systems. IEEE Transactions on Computers, 100(12):1235–1238, 1975.

[154] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

[155] Sailesh Kumar, Sarang Dharmapurikar, Fang Yu, Patrick Crowley, and Jonathan Turner. Algorithms to accelerate multiple regular expressions match- ing for deep packet inspection. In Proceedings of the 2006 Conference on Ap- plications, Technologies, Architectures, and Protocols for Computer Communi- cations, SIGCOMM ’06, pages 339–350, New York, NY, USA, 2006. ACM.

157 [156] Snehasish Kumar, Arrvindh Shriraman, Vijayalakshmi Srinivasan, Dan Lin, and Jordon Phillips. Sqrl: hardware accelerator for collecting software data structures. In Proceedings of the 23rd international conference on Parallel architectures and compilation, pages 475–476. ACM, 2014.

[157] Andreas Kunft, Lukas Stadler, Daniele Bonetta, Cosmin Basca, Jens Meiners, Sebastian Breß, Tilmann Rabl, Juan Fumero, and Volker Markl. Scootr: Scal- ing r dataframes on dataflow systems. In Proceedings of the ACM Symposium on Cloud Computing, pages 288–300. ACM, 2018.

[158] YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia. Skew- resistant parallel processing of feature-extracting scientific user-defined func- tions. In Proceedings of the 1st ACM symposium on Cloud computing, pages 75–86. ACM, 2010.

[159] YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia. A study of skew in mapreduce applications. In The 5th International Open Cirrus Summit, 2011.

[160] YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia. Skew- tune: Mitigating skew in mapreduce applications. In SIGMOD ’12, pages 25–36, 2012.

[161] Avinash Lakshman and Prashant Malik. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review, 44(2):35–40, 2010.

[162] Yann Lecun and Yoshua Bengio. Convolutional networks for images, speech, and time-series. In The handbook of brain theory and neural networks. MIT Press, 1995.

[163] Eunji Lee, Hyokyung Bahn, and Sam H. Noh. Unioning of the buffer cache and journaling layers with non-volatile memory. In Presented as part of the 11th USENIX Conference on File and Storage Technologies (FAST 13), pages 73–80, San Jose, CA, 2013. USENIX.

[164] Wongun Lee, Keonwoo Lee, Hankeun Son, Wook-Hee Kim, Beomseok Nam, and Youjip Won. WALDIO: Eliminating the filesystem journaling in resolving the journaling of journal anomaly. In 2015 USENIX Annual Technical Confer- ence (USENIX ATC 15), pages 235–247, Santa Clara, CA, July 2015. USENIX Association.

[165] Ran Leibman. Monitoring at facebook, 2015.

158 [166] Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and Steven D. Gribble. Tales of the tail: Hardware, os, and application-level sources of tail latency. In SOCC ’14, pages 9:1–9:14, 2014.

[167] Yongkun Li, Chengjin Tian, Fan Guo, Cheng Li, and Yinlong Xu. Elas- ticbf: elastic bloom filter with hotness awareness for boosting read performance in large key-value stores. In 2019 {USENIX} Annual Technical Conference ({USENIX}{ATC} 19), pages 739–752, 2019.

[168] Yuanfang Li and Ardavan Pedram. Caterpillar: Coarse grain reconfigurable architecture for accelerating the training of deep neural networks. In 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP), pages 1–10. IEEE, 2017.

[169] D. Lindquist, H. Madduri, C. J. Paul, and B. Rajaraman. Ibm service man- agement architecture. IBM Systems Journal, 46(3):423–440, 2007.

[170] Ming Liu, Tianyi Cui, Henry Schuh, Arvind Krishnamurthy, Simon Peter, and Karan Gupta. Offloading distributed applications onto smartnics using ipipe. In Proceedings of the ACM Special Interest Group on Data Communication, pages 318–333. ACM, 2019.

[171] Ye Liu, Shinpei Kato, and Masato Edahiro. Is the heap manager important to many cores? In Proceedings of the 8th International Workshop on Runtime and Operating Systems for Supercomputers, ROSS’18, pages 5:1–5:6, New York, NY, USA, 2018. ACM.

[172] Zhiyu Liu, Irina Calciu, Maurice Herlihy, and Onur Mutlu. Concurrent data structures for near-memory computing. In Proceedings of the 29th ACM Sym- posium on Parallelism in Algorithms and Architectures, pages 235–245. ACM, 2017.

[173] Fan Long and Martin Rinard. Automatic patch generation by learning correct code. In ACM SIGPLAN Notices, volume 51, pages 298–312. ACM, 2016.

[174] David Lopez-Paz, Philipp Hennig, and Bernhard Sch¨olkopf. The randomized dependence coefficient. In Advances in neural information processing systems, pages 1–9, 2013.

[175] Jared K Lunceford and Marie Davidian. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in medicine, 23(19):2937–2960, 2004.

159 [176] Yufei Ma, Minkyu Kim, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. End- to-end scalable fpga accelerator for deep residual networks. In 2017 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1–4. IEEE, 2017.

[177] Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. Pivot tracing: dynamic causal monitoring for distributed systems. In 25th ACM Symposium on Oper- ating Systems Principles (SOSP ’09), pages 378–393, 2015.

[178] Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. Pivot tracing: Dynamic causal monitoring for distributed systems. ACM Transactions on Computer Systems (TOCS), 35(4):11, 2018.

[179] Divya Mahajan, Joon Kyung Kim, Jacob Sacks, Adel Ardalan, Arun Kumar, and Hadi Esmaeilzadeh. In-rdbms hardware acceleration of advanced analytics. Proc. VLDB Endow., 11(11):1317–1331, July 2018.

[180] Ajay Anil Mahimkar, Zihui Ge, Aman Shaikh, Jia Wang, Jennifer Yates, Yin Zhang, and Qi Zhao. Towards automated performance diagnosis in a large iptv network. In Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication, SIGCOMM ’09, pages 231–242, New York, NY, USA, 2009. ACM.

[181] Matthew L Massie, Brent N Chun, and David E Culler. The ganglia dis- tributed monitoring system: design, implementation, and experience. Parallel Computing, 30(7):817–840, 2004.

[182] Yoshinori Matsunobu. Myrocks: A space- and write-optimized database, 2016.

[183] Carl Mela and Praveen Kopalle. The impact of collinearity on regression anal- ysis: the asymmetric effect of negative and positive correlations. Applied Eco- nomics, 34(6):667–677, 2002.

[184] Maged M. Michael. Scalable lock-free dynamic memory allocation. In Pro- ceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation, PLDI ’04, pages 35–46, New York, NY, USA, 2004. ACM.

[185] Abhishek Mitra, Walid Najjar, and Laxmi Bhuyan. Compiling pcre to fpga for accelerating snort ids. In Proceedings of the 3rd ACM/IEEE Symposium on Architecture for Networking and Communications Systems, ANCS ’07, pages 127–136, New York, NY, USA, 2007. ACM.

160 [186] Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jor- dan, et al. Ray: A distributed framework for emerging {AI} applications. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18), pages 561–577, 2018.

[187] Nurit Moscovici, Nachshon Cohen, and Erez Petrank. A gpu-friendly skiplist algorithm. In 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 246–259. Ieee, 2017.

[188] Jes´usMu˜nozand Angel´ M Felic´ısimo. Comparison of statistical methods com- monly used in predictive modelling. Journal of Vegetation Science, 15(2):285– 292, 2004.

[189] Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Shiva Shankar, Viswanath Sivakumar, Linpeng Tang, et al. f4: Facebook’s warm {BLOB} storage system. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14), pages 383–398, 2014.

[190] Karthik Nagaraj, Charles Killian, and Jennifer Neville. Structured comparative analysis of systems logs to diagnose performance problems. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI ’12), pages 353–366, 2012.

[191] Mirco Nanni. Speeding-up hierarchical agglomerative clustering in presence of expensive metrics. In PAKDD ’05, pages 378–387. Springer, 2005.

[192] Roger B Nelsen. An Introduction to Copulas. Springer Science & Business Media, 2007.

[193] Mark EJ Newman. Modularity and community structure in networks. Pro- ceedings of the national academy of sciences, 103(23):8577–8582, 2006.

[194] Mark EJ Newman. Equivalence between modularity optimization and max- imum likelihood methods for community detection. Physical Review E, 94(5):052315, 2016.

[195] Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering: Anal- ysis and an algorithm. In Advances in neural information processing systems, pages 849–856, 2002.

161 [196] Anh Tuan Nguyen, Michael Hilton, Mihai Codoban, Hoan Anh Nguyen, Lily Mast, Eli Rademacher, Tien N. Nguyen, and Danny Dig. Api code recom- mendation using statistical learning from fine-grained changes. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, pages 511–522, New York, NY, USA, 2016. ACM.

[197] Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, et al. Scaling memcache at facebook. In Presented as part of the 10th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 13), pages 385–398, 2013.

[198] Sebastian Ordyniak and Stefan Szeider. Algorithms and complexity results for exact bayesian structure learning. In UAI 2010, 2010.

[199] Raghunath Othayoth and Meikel Poess. The making of tpc-ds. In PROCEED- INGS OF THE INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, volume 32, page 1049, 2006.

[200] Guilherme Ottoni. Hhvm jit: A profile-guided, region-based compiler for php and hack. In Proceedings of the 39th ACM SIGPLAN Conference on Program- ming Language Design and Implementation, PLDI 2018, pages 151–165, New York, NY, USA, 2018. ACM.

[201] Guilherme Ottoni and Bertrand Maher. Optimizing function placement for large-scale data-center applications. In Proceedings of the 2017 International Symposium on Code Generation and Optimization, pages 233–244. IEEE Press, 2017.

[202] Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung- Gon Chun. Making sense of performance in data analytics frameworks. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15), pages 293–307. USENIX Association, 2015.

[203] Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung- Gon Chun. Making sense of performance in data analytics frameworks. In 12th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 15), pages 293–307, 2015.

[204] Patrick ONeil, Edward Cheng, Dieter Gawlick, and Elizabeth ONeil. The log-structured merge-tree (lsm-tree). Acta Informatica, 33(4):351–385, 1996.

162 [205] Tuomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro, Qi Huang, Justin Meza, and Kaushik Veeraraghavan. Gorilla: A fast, scalable, in-memory time series database. Proceedings of the VLDB Endowment, 8(12):1816–1827, 2015.

[206] Barnab´asP´oczos,Zoubin Ghahramani, and Jeff G Schneider. Copula-based kernel dependency measures. In ICML ’12, pages 775–782, 2012.

[207] Nikita Popov, Biagio Cosenza, Ben Juurlink, and Dmitry Stogov. Static op- timization in php 7. In Proceedings of the 26th International Conference on Compiler Construction, CC 2017, pages 65–75, New York, NY, USA, 2017. ACM.

[208] Daryl Pregibon. Resistant fits for some commonly used logistic models with medical applications. Biometrics, pages 485–498, 1982.

[209] Kishore Pusukuri, Rob Gardner, and Jared Smolens. An implementation of fast memset () using hardware accelerators. In HPDC 18, pages 3–1, 2018.

[210] Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, and Mark A. Horowitz. Convolution engine: Balancing efficiency & flexibility in specialized computing. SIGARCH Comput. Archit. News, 41(3):24–35, June 2013.

[211] M. Raoufi, Q. Deng, Y. Zhang, and J. Yang. Pagecmp: Bandwidth efficient page deduplication through in-memory page comparison. In 2019 IEEE Com- puter Society Annual Symposium on VLSI (ISVLSI), pages 82–87, July 2019.

[212] Gerhard Reinelt. The traveling salesman: computational solutions for TSP applications. Springer-Verlag, 1994.

[213] Charles Reiss and John Wilkes. Google cluster-usage traces: format+ schema. Technical Report, 2011.

[214] Charles Reiss, John Wilkes, and Joseph L Hellerstein. Obfuscatory obscantur- ism: making workload traces of commercially-sensitive systems safe to release. In Network Operations and Management Symposium (NOMS), 2012 IEEE, pages 1279–1286. IEEE, 2012.

[215] Gang Ren, Eric Tune, Tipp Moseley, Yixin Shi, Silvius Rus, and Robert Hundt. Google-wide profiling: A continuous profiling infrastructure for data centers. IEEE Micro, (4):65–79, 2010.

163 [216] Mengye Ren, Andrei Pokrovsky, Bin Yang, and Raquel Urtasun. Sbnet: Sparse blocks network for fast inference. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

[217] Xiaoqi Ren, Ganesh Ananthanarayanan, Adam Wierman, and Minlan Yu. Hopper: Decentralized speculation-aware cluster scheduling at scale. In SIG- COMM ’15, pages 379–392, 2015.

[218] Alfr´edR´enyi. On measures of dependence. Acta mathematica hungarica, 10(3- 4):441–451, 1959.

[219] Irina Rish, Mark Brodie, Sheng Ma, Natalia Odintsova, Alina Beygelzimer, Genady Grabarnik, and Karina Hernandez. Adaptive diagnosis in distributed systems. IEEE Transactions on neural networks, 16(5):1088–1109, 2005.

[220] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.

[221] Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory R. Ganger. Diagnosing performance changes by comparing request flows. In Pro- ceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, NSDI’11, pages 43–56, Berkeley, CA, USA, 2011. USENIX Association.

[222] Nasim Samei. Local search approximation algorithms for clustering problems. 2019.

[223] F. Schuiki, M. Schaffner, and L. Benini. Ntx: An energy-efficient streaming accelerator for floating-point generalized reduction workloads in 22 nm fd-soi. In 2019 Design, Automation Test in Europe Conference Exhibition (DATE), pages 662–667, March 2019.

[224] Robert Sedgewick and Kevin Wayne. Algorithms. Addison-Wesley Professional, 4th edition, 2011.

[225] C. Shannon and W. Weaver. The mathematical theory of communication. Uni- versity of Illinois Press, 1949.

[226] Yogeshwer Sharma, Philippe Ajoux, Petchean Ang, David Callies, Abhishek Choudhary, Laurent Demailly, Thomas Fersch, Liat Atsmon Guz, Andrzej Kotulski, Sachin Kulkarni, et al. Wormhole: Reliable pub-sub to support

164 geo-replicated internet services. In 12th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 15), pages 351–366, 2015.

[227] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. Departmental Papers (CIS), page 107, 2000.

[228] Benjamin H Sigelman, Luiz Andr´eBarroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. Dapper, a large-scale distributed systems tracing infrastructure. 2010.

[229] Benjamin H Sigelman, Luiz Andre Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. Dapper, a large-scale distributed systems tracing infrastructure. 2010.

[230] Stefan Sprenger, Steffen Zeuch, and Ulf Leser. Cache-sensitive skip list: Effi- cient range queries on modern cpus. In Data Management on New Hardware, pages 1–17. Springer, 2016.

[231] Johannes Starlinger, Sarah Cohen-Boulakia, Sanjeev Khanna, Susan B David- son, and Ulf Leser. Effective and efficient similarity search in scientific workflow repositories. Future Generation Computer Systems, 56:584–594, 2016.

[232] Harald Steck. Learning the bayesian network structure: Dirichlet prior versus data. In UAI 2008, 2008.

[233] Samuel Steffl and Sherief Reda. Lacore: A supercomputing-like linear algebra accelerator for soc-based designs. In 2017 IEEE International Conference on Computer Design (ICCD), pages 137–144. IEEE, 2017.

[234] Jiang Su and Harry Zhang. A fast decision tree learning algorithm. In UAI, pages 500–505. AAAI Press, 2006.

[235] Kozo Sugiyama, Shojiro Tagawa, and Mitsuhiko Toda. Methods for visual un- derstanding of hierarchical system structures. IEEE Transactions on Systems, Man, and Cybernetics, 11(2):109–125, 1981.

[236] Hari Tadepalli. Intel® quickassist technology with intel® key protection tech- nology in intel server platforms based on intel® xeon® processor scalable family, 2017.

[237] Amy Tai, Andrew Kryczka, Shobhit Kanaujia, Chris Petersen, Mikhail Antonov, Muhammad Waliji, Kyle Jamieson, Michael J Freedman, and Asaf

165 Cidon. Live recovery of bit corruptions in datacenter storage systems. arXiv preprint arXiv:1805.02790, 2018.

[238] Prateek Tandon, Faissal M Sleiman, Michael J Cafarella, and Thomas F Wenisch. Hawk: Hardware support for unstructured log processing. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pages 469– 480. IEEE, 2016.

[239] Chunqiang Tang, Thawan Kooburat, Pradeep Venkatachalam, Akshay Chan- der, Zhe Wen, Aravind Narayanan, Patrick Dowell, and Robert Karl. Holistic configuration management at facebook. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP ’15, pages 328–343, New York, NY, USA, 2015. ACM.

[240] Mukarram Tariq, Amgad Zeitoun, Vytautas Valancius, Nick Feamster, and Mostafa Ammar. Answering what-if deployment and configuration questions with wise. In ACM SIGCOMM Computer Communication Review, volume 38, pages 99–110. ACM, 2008.

[241] Eno Thereska, Bjoern Doebel, Alice X Zheng, and Peter Nobel. Practical performance models for complex, popular applications. In ACM SIGMETRICS Performance Evaluation Review, volume 38, pages 1–12. ACM, 2010.

[242] Eno Thereska and Gregory R Ganger. Ironmodel: Robust performance models in the wild. ACM SIGMETRICS Performance Evaluation Review, 36(1):253– 264, 2008.

[243] K. Thomas, C. Grier, J. Ma, V. Paxson, and D. Song. Design and evaluation of a real-time url spam filtering service. In 2011 IEEE Symposium on Security and Privacy, pages 447–462, May 2011.

[244] Ashish Thusoo, Zheng Shao, Suresh Anthony, Dhruba Borthakur, Namit Jain, Joydeep Sen Sarma, Raghotham Murthy, and Hao Liu. Data warehousing and analytics infrastructure at facebook. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 1013–1020. ACM, 2010.

[245] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society., pages 267–288, 1996.

[246] Martin Traverso. Presto: Interacting with petabytes of data at facebook, 2013.

166 [247] Jan Van Lunteren, Ton Engbersen, Joe Bostian, Bill Carey, and Chris Lars- son. Xml accelerator engine. In The First International Workshop on High Performance XML Processing, 2004.

[248] Stamatis Vassiliadis, Filipa Duarte, and Stephan Wong. A load/store unit for a memcpy hardware accelerator. In 2007 International Conference on Field Programmable Logic and Applications, pages 537–541. IEEE, 2007.

[249] Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Mar- gulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and Yee Jiun Song. Kraken: leveraging live traffic tests to identify and resolve resource utilization bottlenecks in large scale web services. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pages 635–651, 2016.

[250] Swagath Venkataramani, Ashish Ranjan, Subarno Banerjee, Dipankar Das, Sasikanth Avancha, Ashok Jagannathan, Ajaya Durg, Dheemanth Nagaraj, Bharat Kaul, Pradeep Dubey, et al. Scaledeep: A scalable compute architec- ture for learning and evaluating deep networks. ACM SIGARCH Computer Architecture News, 45(2):13–26, 2017.

[251] Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. Large-scale cluster management at google with borg. In Proceedings of the Tenth European Conference on Computer Systems, page 18. ACM, 2015.

[252] Tobias Vin¸con,Sergej Hardock, Christian Riegger, Julian Oppermann, An- dreas Koch, and Ilia Petrov. Noftl-kv: Tacklingwrite-amplification on kv-stores with native storage management. In Proceedings of the 21st International Con- ference on Extending Database Technology (EDBT), 2018.

[253] Jelte Peter Vink and Gerard de Haan. Comparison of machine learning tech- niques for target detection. Artificial Intelligence Review, 43(1):125–139, 2015.

[254] Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. Intel math kernel library. In High-Performance Com- puting on the Intel® Xeon Phi, pages 167–188. Springer, 2014.

[255] Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, et al. Bigdatabench: A big data benchmark suite from internet services. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 488–499. IEEE, 2014.

167 [256] Dominic JA Welsh and Martin B Powell. An upper bound for the chromatic number of a graph and its application to timetabling problems. The Computer Journal, 10(1):85–86, 1967.

[257] Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineer- ing, pages 87–98. ACM, 2016.

[258] Frank Wilcoxon. Individual comparisons by ranking methods. Biometrics bulletin, 1(6):80–83, 1945.

[259] D. H. Wolpert and W. G. Macready. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1):67–82, Apr 1997.

[260] Stephan Wong, Filipa Duarte, and Stamatis Vassiliadis. A hardware cache memcpy accelerator. In 2006 IEEE International Conference on Field Pro- grammable Technology, pages 141–148. IEEE, 2006.

[261] Lisa Wu, Andrea Lottarini, Timothy K. Paine, Martha A. Kim, and Ken- neth A. Ross. Q100: The architecture and design of a database processing unit. SIGARCH Comput. Archit. News, 42(1):255–268, February 2014.

[262] Luna Xu, Seung-Hwan Lim, Min Li, Ali R Butt, and Ramakrishnan Kannan. Scaling up data-parallel analytics platforms: Linear algebraic operation cases. In 2017 IEEE International Conference on Big Data (Big Data), pages 273– 282. IEEE, 2017.

[263] Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael Jordan. Detecting large-scale system problems by mining console logs. In Proceedings of the 22nd Symposium on Operating Systems Principles (SOSP’09), pages 117–132. ACM, 2009.

[264] Neeraja J. Yadwadkar, Ganesh Ananthanarayanan, and Randy Katz. Wran- gler: Predictable and faster jobs using fewer resources. In SOCC ’14, pages 26:1–26:14, 2014.

[265] Neeraja J. Yadwadkar, Bharath Hariharan, Joseph E. Gonzalez, and Randy Katz. Multi-task learning for straggler avoiding predictive job scheduling. Jour- nal of Machine Learning Research, 17(106):1–37, 2016.

[266] Takeshi Yoshimura, Tatsuhiro Chiba, and Hiroshi Horii. Evfs: User-level, event-driven file system for non-volatile memory. In 11th USENIX Workshop

168 on Hot Topics in Storage and File Systems (HotStorage 19), Renton, WA, July 2019. USENIX Association.

[267] Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. Improving mapreduce performance in heterogeneous environments. In 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’08), pages 29–42, 2008.

[268] Jingtian Zhang, Sai Wu, Zeyuan Tan, Gang Chen, Zhushi Cheng, Wei Cao, Yusong Gao, and Xiaojie Feng. S3: a scalable in-memory skip-list index for key-value store. Proceedings of the VLDB Endowment, 12(12):2183–2194, 2019.

[269] Sizhuo Zhang, Hari Angepat, and Derek Chiou. Hgum: Messaging frame- work for hardware accelerators (abstact only). In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’16, pages 283–283, New York, NY, USA, 2016. ACM.

[270] Steve Zhang, Ira Cohen, Julie Symons, and Armando Fox. Ensembles of models for automated diagnosis of system performance problems. In Proceedings of the 2005 International Conference on Dependable Systems and Networks, DSN ’05, pages 644–653, Washington, DC, USA, 2005. IEEE Computer Society.

[271] Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, and John Wilkes. CPI2: Cpu performance isolation for shared compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems, pages 379–391. ACM, 2013.

[272] Yunqi Zhang, David Meisner, Jason Mars, and Lingjia Tang. Treadmill: At- tributing the source of tail latency through precise load testing and statistical inference. In ISCA ’16, 2016.

[273] Xu Zhao, Yongle Zhang, David Lion, Muhammad Faizan Ullah, Yu Luo, Ding Yuan, and Michael Stumm. lprof: A non-intrusive request flow profiler for distributed systems. In 11th USENIX Symposium on Operating Systems Design and Implementation, pages 629–644, 2014.

[274] P. Zheng, Y. Qi, Y. Zhou, P. Chen, J. Zhan, and M. R. Lyu. An automatic framework for detecting and characterizing performance degradation of soft- ware systems. IEEE Transactions on Reliability, 63(4):927–943, Dec 2014.

[275] Pengfei Zheng and Benjamin C Lee. Hound: Causal learning for datacenter- scale straggler diagnosis. Proceedings of the ACM on Measurement and Anal- ysis of Computing Systems, 2(1):17, 2018.

169 [276] Pengfei Zheng and Benjamin C. Lee. Hound: Causal learning for datacenter- scale straggler diagnosis. volume 2, New York, NY, USA, 2018. Association for Computing Machinery.

[277] Pengfei Zheng, Xiaodong Wang, Kim Hazelwood, David Brooks, and Ben- jamin C. Lee. Limelight+: Graph theory and statistical semantic learning for understanding workload at datacenter scale. 2020.

[278] Pengfei Zheng, Qingguo Xu, and Yong Qi. An advanced methodology for measuring and characterizing software aging. In Proceedings of the 2012 IEEE 23rd International Symposium on Software Reliability Engineering Workshops, pages 253–258, USA, 2012. IEEE Computer Society.

[279] Yufei Zhu. Serving facebook multifeed: Efficiency, performance gains through redesign, 2015.

[280] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Transactions on information theory, 23(3):337–343, 1977.

[281] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, 67(2):301–320, 2005.

170 Biography

Pengfei Zheng received his Bachelor of Engineering in Software Engineering from China University of Petroleum (East China) in 2009. During undergraduate study, he was awarded National Scholarship (the highest award given by Chinese government to college students) in 2008, and honored Graduate With Distinction in 2009. In 2014, he received his Master of Science in Computer Science from Xi’an Jiaotong University, China. He began doctoral study under Dr. Benjamin C. Lee’s supervision at Duke Uni- versity in September 2014. His research focuses on performance analysis and opti- mization of large, complex systems with quantitative methods inspired by machine learning, graph theory and algorithmic decision making, etc. He has interned with Facebooks Artificial Intelligence Infrastructure team, Facebooks Capacity Engineer- ing & Analysis team, and Lenovos Artificial Intelligence Operations team. His has published at SIGMETRICS [276], INFOCOM [76], IEEE Transactions [274], SOSE [77], ISSRE Workshop [278] and his most recent work is under submission [277].

171