Pi Uccs 0892D 10612.Pdf (2.173Mb)

Home , Shuffling

PROFILING AND IMPROVING PERFORMANCE OF DATA-INTENSIVE

APPLICATION IN CLOUD SYSTEMS

Aidi Pi

B.S., Tongji University, China, 2015

A dissertation submitted to the Graduate Faculty of

University of Colorado Colorado Springs

in partial fulﬁllment of the

requirements for the degree of

Doctor of Philosophy

Department of Computer Science

2021 This dissertation for the Doctor of Philosophy degree by

Aidi Pi

has been approved for the

Department of Computer Science

Xiaobo Zhou, Chair

Edward Chow

Sang-Yoon Chang

Philip Brown

Mike Ji

April 21, 2021

Date

ii Pi, Aidi (Ph.D., Engineering – Computer Science)

Proﬁling and Improving Performance of Data-Intensive Application in Cloud Systems

Dissertation directed by Professor Xiaobo Zhou

Abstract

Data-intensive applications are deployed on cloud systems. In order to ensure high performance of data-intensive applications, profiling and troubleshooting for such applications and underlying systems is critical. By leveraging the right profiled metrics, we can further tune the systems for better performance. However, troubleshooting for data-intensive applications itself is challenging due to two reasons. Firstly, it is difficult for manual analysis on a distributed system consisting of tens and hundreds of nodes. Moreover, anomalies in distributed systems are caused by various sources. On the other hand, in order to improve application performance, there are two more challenges. First, analyzing the systems and deciding on the right metrics that can accurately reflect performance degradation is non-trivial. Second, the resource consumption during data-intensive application runtime is highly dynamic, which requires dynamic approaches that can adaptively adjust resource allocation for the applications.

From the aspect of troubleshooting, logs and performance metrics are two main sources reflecting the status of applications and underlying systems. Firstly, log is non-intrusive and usually records critical events of workflows in a system. Furthermore, log analysis does not require the source code or byte code of targeted systems. Despite the benefits, there are challenges to achieve efficient log analysis. Firstly, The quantity of logs is extremely large, which makes manual log analysis impossible. Another challenge is to automatically provide users with semantic meaning extracted from logs.

One artifact of proﬁling is to identify metrics so that we can form a method in detecting interference and scheduling resources eﬃciently for high performance and high utilization in multi-tenant systems. In such an environment, latency-critical services must respond user requests with low la-

iii tency to keep their Service Level Objectives (SLOs) while batch jobs try to use most transiently available resources to achieve high resource utilization. By leveraging the performance metrics generated during application executions and dynamically adjusting resource allocation based on the proﬁled metrics, we target on satisfying requirements for both latency-critical services and batch jobs.

In this thesis, we focus on efficient log analysis for easy troubleshooting and leveraging performance metrics to efficiently co-locate different kinds of workloads. The thesis is two folds. First, we focus on profiling and troubleshooting distributed data analytic systems. Specifically, we designed a tool collecting both logs and resource metrics of data-intensive applications such that it facilitates troubleshooting processes. We further designed a log analysis tool that is able to extract semantic meaning from logs and automatically report potential anomalies by leveraging natural language processing approaches. Second, we target on efficient job co-location by identifying metrics via profiling and leveraging the metrics to design efficient approaches on interference mitigation and performance improvement in multi-tenant systems. Specifically, we design a middleware approach that leverages hardware event counters to identify Hyper-threading interference and dynamically adjust CPU allocation for co-located latency-critical services and batch jobs. We further design a reservation-based fast memory allocation approach which releases file pages based on system memory usage.

We either developed the proposed approaches from scratch or modiﬁed existing open-source frameworks, and conducted the evaluation in physical servers with various data-intensive and cloud service benchmarks. These approaches facilitate the process of troubleshooting for distributed systems, and reduces the latency of latency-critical services while improving the resource utilization in a job co-location environment.

iv Dedication

To Scarlett O’Hara and in memory of Margaret Mitchell.

“After all ... tomorrow is another day.”

v Acknowledgments

This thesis would have been impossible without the support and mentoring of my advisor, Dr. Xiaobo

Zhou. I would like to thank him for being a continuous source of encouragement, inspiration, and help for me to achieve my academic and career goals.

I would like to thank my graduate committee members, Dr. Edward Chow, Dr. Sang-Yoon

Chang, Dr. Phillip Brown and Dr. Mike Ji, for their help and encouragement throughout my Ph.D. study at UCCS. I appreciate the camaraderie and help of my fellow DISCO lab members, Dr. Wei

Chen, Dr. Shaoqi Wang, Dr. Oluwatobi Akanbi and Junxian Zhao.

I am grateful to my family for their encouragement and support in pursuing my dreams.

The research and dissertation were supported in part by the US National Science Foundation research grants CNS-1422119 and CCF-1816850. I thank the College of Engineering and Applied

Science of UCCS for providing the clusters for conducting experiments.

vi Contents

1 Introduction 1

1.1 Background of Distributed Data-intensive Applications ...... 1

1.1.1 Data-intensive Applications ...... 1

1.1.2 Proﬁling and Troubleshooting for Distributed Systems ...... 6

1.1.3 Performance of Data-intensive Applications ...... 8

1.2 Motivations and Research Focus ...... 9

1.2.1 Aggregating Massive Amount of Logs with Performance Metrics ...... 9

1.2.2 Automatic Semantic Extraction from Logs ...... 12

1.2.3 SMT Interference Detection and Mitigation for Job Co-location ...... 15

1.2.4 Semantic Gap of Memory Allocation for Job Co-location ...... 18

1.3 Challenges...... 22

1.3.1 Challenges in analyzing Massive Amount of Logs ...... 22

1.3.2 Applying Natural Language Processing on Distributed System Logs . . . . . 24

1.3.3 Quantifying SMT Interference and Adjusting Cores ...... 25

1.3.4 Requirements for Fast Memory Allocation ...... 28

2 Related Work 30

2.1 TroubleshootingforDistributedSystems ...... 30

2.1.1 Log analysis ...... 30

2.1.2 Intrusivetroubleshooting ...... 32 2.1.3 Core Dump and Record/Replay Approaches ...... 33

2.2 Performance Improvement for Distributed Systems ...... 34

2.2.1 Job Co-loation and Resource Sharing ...... 34

2.2.2 ClusterScheduling ...... 35

2.2.3 Optimization for Latency-critical services ...... 37

3 Proﬁling Distributed Systems in Lightweight Virtualized Environments with Logs

and Resource Metrics 39

3.1 KeyedMessage ...... 39

3.1.1 Log Transformation ...... 40

3.1.2 Resource Metrics Storage ...... 42

3.2 LRTraceDesign...... 42

3.2.1 Yarn...... 43

3.2.2 Tracing Worker ...... 44

3.2.3 Tracing Master ...... 45

3.3 Evaluation...... 48

3.3.1 TestbedSetup ...... 48

3.3.2 Workﬂow Reconstruction and Analysis ...... 48

3.3.3 Bug Diagnosis ...... 51

3.3.4 InterferenceDetection ...... 57

3.3.5 Feedback Control ...... 58

3.3.6 Performance Overhead ...... 60

3.3.7 Discussion...... 60

3.3.8 Summary ...... 62

4 Semantic-aware Workﬂow Construction and Analysis for Distributed Data Ana-

lytics Systems 63

4.1 Overview ...... 63

viii 4.2 Information Extraction ...... 64

4.2.1 POS Tagging Based Extraction ...... 65

4.2.2 Structure Parsing Based Extraction ...... 67

4.2.3 A Summary Example ...... 67

4.3 Modeling and Anomaly Detection ...... 68

4.3.1 Workﬂow Reconstruction ...... 69

4.3.2 Anomaly Detection ...... 73

4.4 Implementation...... 74

4.5 Evaluation...... 75

4.5.1 ExperimentSetup ...... 75

4.5.2 Information Extraction ...... 76

4.5.3 HW-graph and Workﬂows ...... 77

4.5.4 Anomaly Detection ...... 80

4.6 Discussions ...... 84

4.7 Summary ...... 85

5 SMT Interference Diagnosis and CPU Scheduling for Job Co-location 86

5.1 Hyperthreading Interference Diagnosis ...... 86

5.1.1 Findingthemetric ...... 86

5.1.2 EﬀectivenessoftheMetric...... 88

5.2 Holmes Design ...... 90

5.2.1 Overview ...... 90

5.2.2 Metric Monitor ...... 90

5.2.3 CPUScheduler...... 91

5.3 Implementation...... 94

5.4 Evaluation...... 95

5.4.1 Evaluation Setup ...... 95

5.4.2 Query Latency Reduction ...... 98

ix 5.4.3 ServerThroughputImprovement ...... 100

5.4.4 ParameterSensitivity ...... 102

5.4.5 Overhead ...... 103

5.5 Summary ...... 104

6 Hermes: Fast Memory Allocation for Latency-critical Services 105

6.1 HermesDesign ...... 105

6.1.1 Overview ...... 105

6.1.2 Memory Management Thread ...... 105

6.1.3 Memory Monitor Daemon ...... 111

6.2 Implementation...... 111

6.3 Evaluation...... 114

6.3.1 Evaluation Setup ...... 114

6.3.2 Micro Benchmark ...... 115

6.3.3 Two Real-world Latency-critical Services ...... 116

6.3.4 ParameterSensitivity ...... 121

6.3.5 TheMicroBenchmarkwithSSD ...... 123

6.3.6 Hermes Overhead ...... 124

6.4 Discussions ...... 125

6.5 Summary ...... 127

7 Conclusions and Future Work 128

7.1 Conclusions...... 128

7.2 FutureWork ...... 129

7.3 Publications...... 131

Bibliography 133

x List of Figures

1.1 Thedata-intensiveecosystem ...... 2

1.2 ThearchitectureofYarn...... 3

1.3 Architecture of Two Data-Parallel Applications ...... 4

1.4 Number of tasks and memory usage of a HiBench KMeans application...... 11

1.5 A real-world log snippet of MapReduce...... 13

1.6 Multi-threading and simultaneous multi-threading...... 15

1.7 Memory access latency from diﬀerent sources...... 16

1.8 Latency of Redis under diﬀerent settings...... 17

1.9 Process address space in Linux. Shaded areas represent allocated virtual memory

whose physical pages do not reside in RAM. Red fonts represent variables on Glibc. 19

1.10 CDF of a break-down query latency in Rocksdb...... 20

1.11 CDF of the memory allocation latency...... 20

3.1 A snippet of simpliﬁed log output by a Spark application. Extracted ﬁelds are shown

indiﬀerentcolors...... 40

3.2 LRTracearchitecture...... 44

3.3 Illustration of how a short object can be missed...... 45

3.4 State machines of application attempt and two representative containers. The names

of the states are written in capital letters. Each of short states is labeled with an

arrowpointingtoit...... 47 3.5 Resource metrics and events in three representative containers. The dashed vertical

line in (b) represents a spilling event in container 03. The green and the dashed

vertical blue lines in (c) represent shuﬄing operations (period events) in container 03

and container 06 respectively. (c) and (d) show cumulative I/O usage instead of

instantI/Orate...... 49

3.6 Workﬂows of a map task and a reduce task. The values below spill represent how

many MB of keys and values are processed by the spill event, e.g. 10.44/6.25 stands

for 10.44MB of keys and 6.25MB of values...... 52

3.7 Related performance metrics when debugging SPARK-19371 and YARN-6976. . . . 53

3.8 A container spends a long time in KILLING state. The vertical red line at the 98th

second indicates that the application is ﬁnished. The vertical blue line at the 100th

second is the boundary between RUNNING state and KILLING state of the container. 56

3.9 Related metrics when debugging an anomaly caused by interference in the cloud. . . 57

3.10 Evaluation of the queue rearrangement plug-in...... 59

3.11 Performance overhead of LRTrace...... 61

4.1 The Overview of IntelLog...... 64

4.2 POS tagging on a log key...... 65

4.3 Process of transforming a log key to an Intel Key...... 68

4.4 Illustration of the UpdateSubroutine function...... 71

4.5 Three relations between entity groups...... 72

4.6 The building procedure of the HW-graph based on entity relations...... 73

4.7 The HW-graph for Spark containing the semantic knowledge of the workﬂow. We

omit the children of the entity groups in gray rectangles since they have the same

hierarchy as the children of the ‘memory’ group...... 77

4.8 The S 3 graph of Spark built by Stitch...... 77

xii 5.1 Normalized memory access latency and value of HPEs under diﬀerent thread conﬁg-

urations...... 86

5.2 The relationship between VPI and service performance...... 89

5.3 OverviewofHolmes...... 90

5.4 The CDF of query latency of Redis service under three settings...... 96

5.5 The CDF of query latency of RocksDB service under three settings...... 96

5.6 The CDF of query latency of WiredTiger service under three settings...... 96

5.7 The CDF of query latency of Memcached service under three settings...... 96

5.8 SLO violation of four latency-critical services under three settings...... 97

5.9 CPU utilization with four latency-critical services under three settings (Alone, Holmes,

PerfIso) ...... 99

5.10 CPU usage under two settings for Redis...... 101

5.11 Normalized latency of four latency-critical services with diﬀerent CPU suppression

ratios S in Holmes...... 102

6.1 TheworkﬂowofthemodiﬁedGlibcroutines...... 106

6.2 Illustration of gradual reservation...... 107

6.3 Segregated free list for mmapped memory chunks...... 109

6.4 The memory allocation latency for small (1KB-size) memory requests on the HDD. . 113

6.5 The memory allocation latency for large (256KB-size) memory requests on the HDD. 113

6.6 The 90th percentilequerylatencyofRedis...... 117

6.7 The 90th percentilequerylatencyofRocksdb...... 117

6.8 Latency of Redis under 100% memor pressure...... 118

6.9 Latency of Rocksdb under 100% memory pressure...... 118

6.10 The SLO violation ratio of Redis requests...... 119

6.11 The SLO violation ratio of Rocksdb requests...... 119

6.12 Latency reduction for small requests...... 121

6.13 Latency reduction for large requests...... 121

xiii 6.14 The memory allocation latency for small (1KB-size) memory requests on the SSD. . 122

6.15 The memory allocation latency for large (256KB-size) memory requests on the SSD. 124

xiv List of Tables

1.1 Categories of troubleshooting for distributed systems ...... 6

1.2 Lines and percentages of natural language logs...... 14

3.1 Description of ﬁelds in a keyed message...... 40

3.2 Keyed messages corresponding to logs in Figure 3.1...... 41

3.3 Summary of rules for extracting the workﬂow of a Spark application...... 50

3.4 Summary of memory behavior. GC start is the moment when GC starts. GC delay

is the time between a spilling event and a later full GC...... 51

3.5 Summary scenarios about container termination...... 57

4.1 Patterns to match phrases and the corresponding Penn Treebank presentation. Due

to the limited space, ‘NN’ includes four noun tags: ‘NN’, ‘NNS’, ‘NNP’ and ‘NNPS’. 66

4.2 UD relations for operation extration...... 67

4.3 Accuracy of information extraction by IntelLog in the three systems...... 75

4.4 Log and HW-graph statistics for the three systems...... 78

4.5 The accuracy of anomaly detection by IntelLog...... 81

4.6 Jobdescriptionsinthethreecasestudies...... 82

4.7 Comparison of the anomaly detection accuracy among IntelLog, DeepLog and Log-

Cluster...... 83

5.1 Candidate HPEs with correlation value...... 88 5.2 Terminologies in Holmes...... 92

5.3 Throughput comparison of completed batch jobs...... 101

6.1 Throughput of batch jobs...... 120

6.2 Number of two kernel functions called at each latency range...... 126

xvi CHAPTER 1

INTRODUCTION

1.1 Background of Distributed Data-intensive Applications

1.1.1 Data-intensive Applications

In the recent years, a increasing number of data parallel applications are deployed in a distributed environment. The applications include databases, distributed ﬁle systems, machine learning jobs and web services. A lot of studies focus on proﬁling and improving performance for distributed systems. Among diversity of distributed systems, our focus is the data-intensive ecosystem as shown in Figure 1.1. The software stack of data-intensive ecosystem can basically be divided into three layers.

Storage layer. At the very bottom, it is distributed ﬁle systems (DFS) serving as the storage layer. Popular DFSes include GFS [94] and HDFS [40]. In this research, we choose the open-source

HDFS as the storage layer. HDFS consists of a NameNode and several DataNode. NameNode resides on the master node of a cluster. It is responsible for storing metadata and coordinating

DataNodes. DataNodes run each worker node, one per worker node. They are responsible for saving input, intermediate and output data of upper level applications.

Scheduling layer. In the middle comes the scheduling layer which allocates computational resources for upper level applications. The eﬃcacy of this layer is a key factor that determines the performance of a data-intensive cluster (see Section 1.1.3). Popular cluster scheduling frameworks 2

Figure 1.1: The data-intensive ecosystem include Yarn [171], Mesos [112] and Sparrow [154]. In this research, we choose the open-source framework Yarn as the scheduling layer. Yarn is a centralized cluster scheduling framework. Fig- ure 1.2 shows the architecture of Yarn. Yarn consists of a Resource Manager (RM) and several Node

Managers (NM). NMs, one per worker node, continuously report resource status of the node to the

RM. When RM receives a job submission request from a client, it ﬁrst launches an App Master

(AM) representing the job on a worker node. The AM requests resources from the RM based on the job request. Then, the RM return the AM with the resource information including container sizes and resource location. Based on the returned information the AM communicates with NMs.

Finally, the NMs launch containers which are the execution units of the job. Conceptually, a Yarn container represents a bunch of computational resources. For example, a container can have {2 cores, 4GB RAM} to serve tasks running within it. Essentially, a Yarn container is a JVM process.

The speciﬁed resources are interpreted into the heap size and the number of child thread of a JVM.

Data processing layer. On the top of the stack is the application layer which contains variety of applications serving for diﬀerent purposes. Some applications can use others as their processing engine (e.g. Hive in Figure 1.1). In this research, we mainly focus on MapReduce [82] and Spark [192] since they are two of the most commonly used data processing engine which directly communicates with the storage layer and scheduling layer.

MapReduce has been keeping attracting attention from both industrial and academia since it emerged. Its easy of use, scalability, and fault tolerance render MapReduce as a popular data- parallel programming paradigm for a decade. Figure 1.3a shows the architecture of MapReduce. It 3

Figure 1.2: The architecture of Yarn has two major phases: map and reduce. The map phase reads input data and produces intermediate data represented as a list of key/value pairs. The intermediate data are then written to the DFS.

Before the reduce phase, intermediate data are grouped by the keys. Values with the same key are shuﬄed to one reduce task. Then the reduce tasks further process the data and output the ﬁnal results to the DFS. The results are returned to clients or serve as input data for further processing.

The number of map tasks and reduce tasks are controlled by clients. Each task is hosted by a Yarn container. However, MapReduce has two own fundamental drawbacks. First, Increasing data size requires more containers, incurring serious start-up latency. Since a Yarn container is a JVM process, the JVM needs to go through a self start-up procedure before running a task. The increasing number of containers leads to increasing start-up time, causing delay to an entire job. Another problem is that MapReduce incurs serious I/O overhead, since a MapReduce job writes data to the DFS twice during execution (one for intermediate data, the other for ﬁnal results). Furthermore, a complex data analytics application requires multiple rounds of MapReduce jobs, which incur even more I/O overhead.

Due to the drawbacks of MapReduce, there is a matter of urgency for a data-parallel processing framework with low overhead, which motivates Spark. Spark is a in-memory multi-threading data- parallel processing framework. In order to resolve the start-up latency issue, Spark uses executor to hold multiple tasks, which signiﬁcant reduces the start-up latency. In order to resolve the I/O 4

(a) Architecture of MapReduce

(b) Architecture of Spark

Figure 1.3: Architecture of Two Data-Parallel Applications overhead issue, Spark tries to store intermediate data in memory. It only spills the data to disk when there is not suﬃcient memory. When a task needs data generated by previous tasks, it can directly read the data from memory, which reduces I/O overhead. A study [192] shows that

Spark runs 10x 100x times as fast as MapReduce. Thus, a lot of applications support Spark as their execution engine. For example, the data warehouse Hive [169] transforms SQL queries into

Spark jobs. MLlib [26] is a machine learning library built on Spark. Other applications include

GraphX [25], Spark Streaming [27], etc.

Key-value store. Besides the three stacks, there are several cross-stack latency-critical applications. In this thesis, we focus on key-value stores which belong to both the data processing layer and the storage layer. Compared to traditional relational database, key-value stores provide a easy programming paradigm. A piece of data in key-value store only consists of a key-value pair. Due to the simple data format, they can provide fast data access speed, which is a perfect ﬁt for use cases where data need to be frequently and fast accessed. Redis and Memcached are two representative 5 in-memory key-value stores, while RocksDB and MongoDB are two representative disk key-value stores. In-memory key-value store provide even faster data access speed than disk ones, while its capacity is limited by the RAM capacity and more vulnerable to failure. On the other hand, disk key-value store can adopt in-memory cache to improve its access speed.

In industry, latency-critical services usually have very diﬀerent peak and average resource usage.

Based on a paper from SnowFlake, the average memory and CPU usage in a cluster is only 51% and

19%. This feature leads to a solution that latency-critical services are usually co-located with best- eﬀort batch jobs such as Spark and MapReduce workloads. When latency-critical services are not using many resources, the transiently idle resources can be allocated to best-eﬀort jobs, increasing cluster resource utilization.

Runtime stack. On the right side of Figure 1.1, we present the runtime stack in which data- intensive applications are running. On the top, it is the language runtime, such as Java and C++.

Applications are developed in diﬀerent language to meet their use cases. Then, it is the OS container layer such as Docker. Applications are running in containers in order to achieve resource isolation.

Good resource isolation is especially important in a multi-tenant environment to guarantee that applications do not interfere with each other. Below that, it is the library layer and OS layer. This layer provides the basic runtime environment for the applications. At the bottom, it is the hardware layer including memory, CPU, disk and networks.

Given the complexity of the data-intensive applications ecosystem, it is challenging to proﬁling and improving performance for such systems. First, in order to get enough information for troubleshooting, proﬁling needs to be done in multiple layers, including both software and runtime stack.

Furthermore, it is important to build the relationship between the proﬁled multi-layer information.

Second, due to the large number of logs generated by the system, it is difficult, if not impossible, for manual analysis. The challenges is to automatically extract semantic meaning from logs and automatically detect possible anomalies. Third, finding appropriate metrics that can reflect interference in a co-location environment is challenging, since there are so many software and hardware metrics to use. For example, an Intel’s processor has hundreds of hardware performance counter. Choos- 6 Table 1.1: Categories of troubleshooting for distributed systems

online oﬄine pros: quick reaction, detailed informa- pros: testing phase Intrusive tion cons: access to source/byte code, do- cons: slow reaction main knowledge requirement pros: low overhead, ease of use pros: ease of understanding Non-intrusive cons: less information cons: slow reaction ing appropriate ones to use is non-trivial. Last but not the least, there is semantic gap between exiting runtime stack and co-location environment. While existing GNU library and OS equally treats running processes and tries to accommodate all of them, co-location applications usually have prioritization.

In this thesis, we focus on troubleshooting and improving performance for the scheduling layer and the application layer of distributed data-intensive systems. Our targeted frameworks include but not limited to Yarn, Hadoop MapReduce, Spark, Redis, and Rocksdb. Specifically, the contributions of this thesis include 1) A log performance analysis framework for data-intensive framework in lightweight virtualized environments. 2) A log processing framework that extract semantic meaning from logs via natural language processing approaches. 3) A SMT interference aware CPU scheduling framework for efficient job co-location for latency-critical services and best-effort jobs. 4) A fast memory allocation for latency-critical services in a co-location environment.

1.1.2 Proﬁling and Troubleshooting for Distributed Systems

With regard to the evolving scale of distributed systems, proﬁling and troubleshooting is a critical task in order to maintain a robust and high-performance system. The troubleshooting techniques can generally be divided into two categories: intrusive and non-intrusive. At the meaning while, the reaction to anomalies happens at two moment: online and oﬄine. Therefore, there are basically four kinds approaches adopted to troubleshoot for distributed systems, as shown in Table 1.1. All of them attract much focus from both industrial and academia.

Online - Intrusive This is a popular method for domain-specific experts in production environments. Intrusive tools usually dynamically insert code snippets into source code in order to fetch 7 detailed information requested by users. Despite the flexibility and fine-grained information provided by this method, a crucial prerequisite is that users must have a comprehensive understanding of targeted systems. For example, aspect oriented programming is a technique used to dynamically insert modules to source code at runtime. To do so, users need to know the signature of the targeted function (e.g. function name, return type and arguments) for insertion. Therefore, domain-knowledge is required for such method. Since reading and understanding the source code is time consuming, the method is mainly adopted for production systems that are usually in service for years.

Online - Non-intrusive This is a common method that can be applied for many system. As for non-intrusive method, the most common approach is log analysis since most systems use logs to record events during execution. Other approaches include resource consumption monitoring, execution time monitoring, etc. Online log analysis adopts machine learning to build models for log sequences [87, 182, 183, 139]. Anomalies are captured by detecting logs deviating from the models.

The model building and anomaly detection processes are automated in this approach, taking oﬀ burden from users. However, users still need to spend some eﬀort on manually inspecting logs when

ﬁnding root causes of anomalies. Compared to intrusive methods, non-intrusive methods require less domain knowledge and provide more ease of use for users.

Oﬄine - Intrusive This is a method for debugging in testing phases before software release. One of a representative tool is GDB [6]. These tools show detailed information about targeted running programs includes register values, thread status, memory map, etc. However, this method is not suitable in production environments due to its high overhead since additional debugging information is added besides normal execution.

Offline - non-intrusive This is a method used mostly for understanding applications and tracing recurring workloads. It has the ability to build workflows according to the the events recorded in logs. The reconstructed workflows help users to have a intuitive view of how applications work.

For example, Stitch [196] constructs identifies from different components into a hierarchical graph representing workflows of targeted system. The graph shows the lifespans of objects during execution.

As for troubleshooting, this method is especially helpful to detect latency-related problems. However, 8 it heavily relies on manual analysis of the reconstructed results. When the scale of systems grow beyond tens of nodes, it is time consuming to compare metrics collected from every node. Another problem is that current oﬄine and non-intrusive approaches require either source code analysis or a restricted logging pattern, which limits its applicability.

Two of our proposed contributions lie in the area of non-intrusive troubleshooting. Since online and offline approaches targets on different use cases, our contributions will enable users to more efficiently understand and further troubleshoot the targeted systems. These two contributions focus on different aspects. The first one, called LRTrace, collects both logs and performance metrics during job executions. Combing both kinds of information, users can have a deep understanding of job resource consumption status, job sizes, job execution time, etc. The second one, called IntelLog extracts semantic information from logs via natural language processing approaches. By using this tool, users can more easily obtain key information from logs instead of reading raw log messages.

1.1.3 Performance of Data-intensive Applications

One artifact of proﬁling is to identify metrics so that we can form a method in detecting interference and scheduling resources eﬃciently for high performance and high utilization in multi-tenant systems.

Massive existing eﬀorts have been spent on improving performance of data-intensive applications.

Some studies focus on eﬃcient cluster scheduling [171, 154, 83, 49] while others focus on task executions on individual nodes [58, 150, 116]. The major objectives of performance improvement are decreasing the execution time of tasks and increasing the throughputs of a cluster.

From the aspect of cluster scheduling, the major goals are to quickly allocate enough computational resources to incoming requests while not to overwhelm the cluster with too many tasks.

Some studies focus on designs of cluster schedulers. For example, Yarn and Mesos are two popular centralized cluster schedulers. Sparrow is a distributed cluster scheduler. Hawk [83] is a hybrid cluster scheduler that has centralized and distributed scheduling policy for diﬀerent kinds of jobs.

These schedulers are all actively used in industrial and academia for diﬀerent environments.

From the aspect of improving execution on individual nodes, the major goal is to efficiently 9 host tasks on one node. Specifically, researches focus on efficient resource sharing (CPU, Memory, and I/O) across tasks on a node. The resource consumption of tasks significantly diverse during execution due to bursty workloads. Traditionally, users tends to request from the cluster scheduler a large size of resources which meets the peak resource consumption of tasks. However, task resource consumption only reach the peak value for a short period of time during execution, leading to resource wastage most of the execution time. Researches in this area tries to share unused resource with other tasks on the same node in order to improve the overall execution time.

Two of our contributions lies in the latter aspect. We found that in a shared cluster, CPU and memory are both resources with a tight budget. From the perspective of CPU resource, when latency- critical services and best-eﬀort batch jobs are scheduled on sibling processors of a physical core,

Simultaneous Multi-Threading (SMT) results in signiﬁcant latency degradation on latency-critical services. One of our contributions, called Holmes, takes the interference due to SMT into account and eﬃciently adjusts processor allocation between latency-critical services and batch jobs. From the perspective of memory resource, job co-location usually consumes a large amount of memory on a server. When memory is running low, memory of processes is swapped onto disk, causing additional I/O activities. For latency-critical jobs, the I/O activities causes performance degradation which is not acceptable. Our proposed work, called Hermes, designs and implements a reservation- based memory allocation mechanism that can achieve fast memory allocation for latency-critical services even under memory pressure. These two contributions both target guaranteeing Service

Level Objectives (SLOs) for latency-critical services while improving server resource utilization.

1.2 Motivations and Research Focus

1.2.1 Aggregating Massive Amount of Logs with Performance Metrics

We use a motivating example to demonstrate what information LRTrace can provide to users and how the information is presented. We run a Spark KMeans job with the large data set generated by

HiBench [114] on a 9-node cluster that is managed by Yarn [171] and use Docker as the lightweight 10 virtual container. We demonstrate how a user monitors this application using a traditional method compared to using LRTrace.

Traditional approach. We refer the traditional approach to debugging with tools only provided by the application. Most applications provide two kinds of tools: 1) the raw logs, and 2) the web server provided by the App Master. Logs record various objects and events in a chronological manner and are quickly generated. Thus, log messages regarding to an object may be scattered across the whole log file. Another problem is that logs in a distributed system are generated in different files. A user may spend a long time in finding a specific object in logs even after the application finishes.

A study [196] has shown that using raw logs for anomaly analysis is a common but an ineﬃcient approach.

The web server provides an easy access to the workﬂows of the application. However, the information recorded in the web server is determined by the framework developers. Some applications record less information in the web server than in the logs. For example, the web server of Hadoop

MapReduce only records general information about tasks such as the location, progress and state.

Detailed information such as shuﬄe or spill events cannot be obtained from the web server. Some other applications such as Spark record diﬀerent information in the web server from logs. In general, the web server can be used as a complementary tool for debugging. However, both approaches lack

ﬂexibility of applying operations on the objects.

In this example, suppose a user wants to get the information about tasks in each Spark executor.

The user might not read the raw log ﬁles since this approach is too time consuming. Instead, the

Spark web server provides information about each task such as its location, its start/end time and its input size, which only presents the information of individual tasks but is insuﬃcient for an overview on all tasks.

LRTrace. By leveraging the keyed message, LRTrace provides concise information that describes the workﬂow at runtime. We now show an intuitive view of how keyed messages work, and leave the details of its design to Section 3. Figure 1.4 shows a small fraction of information that can be provided by LRTrace. Since the application is administrated by Yarn and uses Docker containers, 11 8 container_02 container_04 container_05 6 first task in container_04 4 2 # of tasks 0 0 10 20 30 40 50 60 Time (s) (a) number of tasks in each container.

2048 container_02 1536 container_04 1024 container_05 512 0 0 10 20 30 40 50 60 Memory usage (MB) Time (s)

(b) memory usage of each container.

Figure 1.4: Number of tasks and memory usage of a HiBench KMeans application. we can access both logs and ﬁne-grained resource metrics of each container. A Spark application is divided into multiple stages. In each stage, each container executes several tasks. For the conciseness of demonstration, we only present information about three containers of the application. Figure 1.4a describes the number of tasks concurrently running in each container, which is obtained by the following request: key: task aggregator: count groupBy: container, stage

This request first sets the key as task. The count aggregator counts the number of active tasks during each profiling interval. The results are grouped by the container IDs and stage IDs that are identifiers attached to keyed messages. In Figure 1.4, we highlight tasks in stage 0 with thick lines to distinguish them from those in other stages. Between 22s and 30s, container 02 is processing tasks in stage 0 while container 05 (and several other containers which are not shown in the figure) has

ﬁnished the tasks in the current stage and becomes idle. At this point, a user can clearly identify container 02 as a straggler in stage 0. It is also obvious that container 05 gets assigned fewer tasks than container 02 does throughout the entire lifespan of the application. 12 LRTrace also provides the resource consumption of a container. Figure 1.4b shows the memory usages of the three containers, obtained by the following request: key: memory groupBy: container

From Figure 1.4b a user can ﬁnd that three containers start at around the same moment. How- ever, container 04 gets assigned its ﬁrst task right before the end of the whole application. The idle container occupies more than 200 MB of memory resource for a long time from its start. Both cases of task imbalance are caused by a bug in the Spark scheduler. We describe how LRTrace locates the bug in Section 5.

LRTrace provides ﬂexibility to reconstruct the workﬂow from logs. Reconsider the request that generated results in Figure 1.4a. The request returns the number of tasks per container and stage.

If a user wants to inspect the total number of running tasks in the whole cluster, the user only needs to remove “container” from the aggregator ﬁeld. We also use requests of the same format to obtain resources metrics, shown in Figure 1.4b.

Summary Compared to traditional approaches, LRTrace provides a concise view of the events recorded in the logs. By applying operators, in this case sum on tasks of each container, users can reconstruct the workﬂow from a statistical aspect. Also, correlating resource metrics gives users an insight on how an event aﬀects the resource consumption.

1.2.2 Automatic Semantic Extraction from Logs

A log message has a constant text field and variable fields that record information such as identifiers, amount of processed data, time spent, localities. Correspondingly, the text field and the variable

ﬁelds are outputs by the constant string and variables in the printing statement in the source code.

A log printing statement can be abstracted as a log key, in which the constant ﬁeld is unchanged while the variable ﬁelds are represented by asterisk (*). Log keys is widely used by previous studies for the purpose of log analysis [189, 196, 87]. We use Spell [86] to extract log keys from log messages.

It consumes raw logs generated from the systems and adopts a longest string matching algorithm to 13 1 fetcher # 1 about to shuffle 1 fetcher # * about to shuffle output of map attempt_01 output of map * 2 [fetcher # 1] read 2264 bytes 2 [fetcher # *] read * bytes from map-output for attempt_01 from map-output for * 3 host1:13562 freed by 3 * freed by fetcher # 1 in 4ms fetcher # * in * red: entity blue: identifier green: value purple: locality

Figure 1.5: A real-world log snippet of MapReduce. identify the log keys.

However, a log key is a coarse grained abstraction of log messages since it does not distinguish the variable fields. IntelLog parses the log keys and analyzes the meanings of the variable fields by NLP-assisted approaches ( 4.2). We use a simplified real-world log snippet from MapReduce, shown in Figure 1.5, to introduce the terminologies that will be used in this thesis. The log snippet describes the subroutine instance of a fetcher that gets data from a remote node. The instance is a log sequence consisting of three log messages on the left side. Their corresponding log keys are on the right side. For each log key, IntelLog distinguishes four categories of information from the variables fields including entities, identifiers, values and localities marked in colors in Figure 1.5.

Then, the log keys are transformed to Intel Keys with key-value pairs representing each category of information. The sequence of the Intel Keys are called a subroutine in the HW-graph.

We observe three log features that oﬀer opportunities to take advantage of NLP-assisted approaches for workﬂow reconstruction and anomaly detection via log analysis.

First, we find that using natural languages to record system events is common in distributed systems. In this thesis, we define that a log message is written in a natural language if it contains at least one clause. We analyze over 300MB logs generated from three popular data analytics systems (MapReduce, Spark and Tez), a cluster resource management framework (Apache YARN) and a component of a cloud infrastructure (nova-compute in OpenStack [19]). Table 1.2 shows the number of lines and percentages of natural language logs in the five systems, which indicate that most of the logs are written in natural languages. Other logs usually record status of the systems such as access control and resource usages. The log feature inspires us that NLP-assisted 14 Table 1.2: Lines and percentages of natural language logs.

System NL logs total logs % of NL logs Spark 1,286,159 1,286,159 100% MapReduce 599,103 549,922 91.8% Tez 112,478 103,609 92.2% Yarn 156,196 159,995 97.6% nova-compute1 73,318 73,318 100% approaches can be applied to analyze log messages generated by systems. However, log messages have more identifiers and domain-specific words compared to free-form text written by human. This fundamental difference makes it challenging to apply NLP analysis on logs.

Second, although logs contain domain-specific words, we are able to find the relationship between correlated entities by analyzing the nomenclature. To be specific, correlated entities usually share common sub-phrase in their names. Thus, users know those entities are closely related when reading the logs. For example, Spark uses block to store data and uses BlockManager to do block management. Finding such entities gives users a clear view of the correlated components in the system.

Nevertheless, it is time consuming for users to manually ﬁnd such entities by key word searching due to the large number of logs. Furthermore, it is possible that entity names with a common sub-phrase are not correlated. Therefore, a more subtle algorithm should be used in order to distinguish such entities as well as group correlated entities.

Third, since logs are output by execution of the structured source code, the log sequences may follow speciﬁc orders [87]. However, the order is not strict in distributed systems due to concurrency.

Currently, existing log analysis tools [87, 189, 135] focus on the order between individual log messages of infrastructure-level distributed systems such as OpenStack [19] and HDFS [40]. However, they do not exploit the semantic knowledge in logs. The log sequences generated by one request in those systems are usually of a short and fixed length since the number of operations in the execution is deterministic. For example, OpenStack has eight most frequently used requests, each generating a fixed-length log sequence with an average length of nine [189]. This feature facilitates workflow reconstruction for infrastructure-level distributed systems.

1Nova-compute constantly outputs log messages with a ﬁxed format that reports the current node resource usage regardless whether there are VM requests. We eliminate such messages and only calculate log messages that are related to VM requests. 15 executiong unit thread 1 thread 2 Multi-threading SMT Time

Figure 1.6: Multi-threading and simultaneous multi-threading.

On the contrary, logs generated by prevalent distributed data analytics systems (e.g., MapReduce,

Spark and Tez) are usually with various lengths and interchangeable orders. Different data sizes and configurations cause various log sequence lengths while parallel executions cause interchangeable orders. Our experiments show that the log sequence lengths of these systems range from tens to thousands ( 4.5.4), which makes existing log analysis approaches ineffective in distributed data analytics systems. By leveraging the nomenclature of entity names, IntelLog is able to find the ordering and hierarchical relationships between groups of entities. As a result, users can have both an overview and a detailed look of workflows in the systems.

1.2.3 SMT Interference Detection and Mitigation for Job Co-location

We show the execution of multi-threading and Simultaneous Multi-Threading (SMT) in Figure 1.6.

In a multi-threading processor, only one thread can run on execution units at a time. When the thread encounter a long waiting event or uses up its quota, the processor switch to another thread.

In contrast, SMT allows multiple hardware threads run on the execution units of the same physical core at the same cycle. This thesis targets on Intel’s implementation of SMT which is known as

Hyper-Threading technology. Hyper-Threading implements a 2-way SMT which makes a physical core appears as two processors to an operating system. Two thread contexts can be simultaneously kept on a Hyper-Threading core while sharing the same set of execution units. Hyper-Threading expects that combining and executing instructions from two threads increases the utilization of a physical core.

Previous studies [165, 162, 164] show that memory controller and bandwidth congestion are the 16 1.0

0.8

0.6 1 thread:1 core

CDF 0.4 2 threads:2 cores 2 threads:1 core 0.2 16 threads:16 cores 32 threads:16 cores 0.0 800 1600 2400 3200 4000 Time (us)

Figure 1.7: Memory access latency from diﬀerent sources. main bottleneck for memory access latency. We ﬁnd that these bottleneck are well addressed on a modern CPU. However, interference from Hyper-Threading degrades memory access latency.

Micro benchmark. We use a case study with a micro benchmark to illustrate the sources of memory access latency. We use the same server conﬁguration as that in Section 5.4. The benchmark has a conﬁgurable number of threads pinned to individual logical processors. Each thread continuously sends memory requests to access a random 1MB block out of 600MB data. To identify the sources of memory access latency, we evaluate the latency in the following 5 cases: 1) 1 thread, 2)

2 threads on 2 cores, 3) 2 threads on 2 logical processors of the same core, 3), 4) 16 threads on 16 cores, and 5) 32 threads on 32 logical processors of 16 cores. 1) is used as the baseline. 2) and 3) are used to inspect the impact of Hyper-Threading. 4) and 5) are used to inspect the impact of whether memory controller or bandwidth is the bottleneck.

Figure 1.7 shows the CDF of the memory access latency in the ﬁve situations. Cases where memory is accessed from individual physical cores renders an average latency about 1400 µ s regardless of the number of accessing threads. In these cases, memory controller congestion or memory bandwidth congestion has little impact on memory access latency since case 1), 2) and 4) have almost the same performance. Futhermore, case 3) and case 5) both use Hyper-Threading while case 5) has much more requests sending from 32 threads. If case 5) is bottlenecked by memory bandwidth, it should renders a even higher latency than that of case 3), which is not the case. These two cases scheduled on siblings using Hyper-Threading renders a higher average latency about 2300 µ s compared to 17 1.0 Alone 0.8 Co-separate Co-hyper 0.6

CDF 0.4

0.2

0.0 0 50 100 150 200 250 300 Time (us)

Figure 1.8: Latency of Redis under diﬀerent settings. those without Hyper-Threading.

CPU scheduling gets even more complicated when latency-critical services are co-located with batch jobs. Since latency-critical services have bursts of incoming queries [115], statically allocating

ﬁxed amount of resource usually results in either sub-optimal performance or resource wastage.

Job co-location with a real-world service. We use Redis, a real-world latency-critical service, to illustrate the signiﬁcant impact of Hyper-Threading on query latency. Redis is running under three settings. 1) Redis runs alone. 2) Redis runs with batch jobs and they use separate physical cores. 3) Redis runs with batch jobs and batch jobs are allowed to use siblings of Redis cores. We use the Spark Kmeans workload from HiBench [114] as the batch job. In the two co-location settings, threads of Redis and threads of the batch job are pinned on diﬀerent logical processors. We use workload-a from YCSB [77] to generate queries to Redis.

Figure 1.8 shows the CDF of the latency of Redis queries under diﬀerent settings. Settings Alone,

Co-separate render similar latency for Redis queries. However, the latency is signiﬁcantly prolonged with Co-hyper setting where queries are aﬀected by Hyper-Threading. In the case study, the average

(99th percentile) query latency of Redis with Hyper-Threading interference is 2.0× (1.3×) as high as Co-separate. 18 1.2.4 Semantic Gap of Memory Allocation for Job Co-location

The famous malloc function call in Glibc is a unified interface for programs to allocate memory from Linux OS. A process conveniently obtains the address of the memory space without knowing the underlying mechanism by calling malloc. The function call uses two Linux system calls sbrk and mmap to serve memory requests of different sizes. Figure 1.9(a) shows the simplified address space of a process that includes memory chunks allocated by both system calls. We focus on the mechanisms in Glibc that manipulate the main heap space and mmapped memory chunks. Both kinds of memory are dynamically allocated at runtime.

System call sbrk. Each process has exactly one main heap that is a continuous virtual address space. Glibc divides the main heap into two areas: the allocated area and the top chunk. Glibc keeps track of the used and free space in the allocated area. Note the allocated area and the top chunk in Glibc are transparent to Linux OS. Following the allocated area lies the top chunk that is a continuous free address space. The end address of the top chunk is the program break set by the sbrk system call. Upon a request for a small size of memory (< 128 KB by default), Glibc ﬁrst tries to ﬁnd a free space in the allocated area. If it cannot satisfy the request, space is taken from the beginning of the top chunk and added to the allocated area. Once the top chunk is used up, Glibc expands the main heap by calling sbrk with the exact requested size. If the top chunk is greater than a certain threshold, Glibc shrinks the main heap by passing a negative number to sbrk.

System call mmap. Besides the main heap, a process can have multiple disjoint memory chunks allocated by mmap. This system call can either map a ﬁle to process address space or allocate anonymous pages. Glibc leverages the anonymous page usage to handle large memory requests (≥

128 KB by default). Upon success, it returns the starting address of the newly allocated mmapped memory chunk. Glibc gives the memory chunk to the process after a bookkeeping operation. When a process frees a memory space allocated by mmap, Glibc releases it directly back to Linux OS.

Upon return of both system calls, a process gets a virtual memory space while the corresponding physical memory does not necessarily reside in RAM at the moment. Linux OS constructs the virtual-physical address mapping only when the process writes to the allocated memory for the ﬁrst 19 %&$%( ,-'!& ,-'!& &""-#..

mmapped%('#')+/% ,*'!#%2

mmapped%('#')+/% mmapped%('#')+/% ,*'!#%1 ,*'!#%1 ,-+$-&)( -#&'( ,-+$-&)( -#&'( 1&-/0&((&""-#.. 4 3(. -'5 /+,(!%0*' 4 3(. -'5 /+,(!%0*' ''%(%$#'* /+,( ''%(%$#'* /+,( &((+!(&-#& &""-#.. &((+!(&-#& &""-#..

(+2( 6-#.-0%"'-'0% ,,7( 6-#.-0%"'-'0% ,,7( &""-#.. .#$)#*/ .#$)#*/ (a) stage 1 (b) stage 2

Figure 1.9: Process address space in Linux. Shaded areas represent allocated virtual memory whose physical pages do not reside in RAM. Red fonts represent variables on Glibc. time. For example, in Figure 1.9(a), the process has a main heap and a mmapped memory space

1. In Figure 1.9(b), the process allocates a new mmapped memory space and writes data in the main heap. The newly mmapped memory space does not have corresponding physical pages yet, and the virtual-physical mapped space in the main heap expands. Two beneﬁts come with the on-demand mapping construction. For Linux OS, physical memory pages are loaded for the actually used memory since physical memory is a scarce resource. For the process, it accelerates the memory allocation routine. The reason is that the mapping construction for all the virtual addresses requires loading all the physical pages at once, which takes a longer time than only returning the virtual address.

While usually fast, the on-demand virtual-physical mapping construction can be signiﬁcantly delayed when there is insuﬃcient physical memory in the node, which is common in a multi-tenant system. At this point, Linux OS starts to reclaim physical pages by either directly freeing them or swapping them onto disks.

In real-world latency sensitive services, latency spent in memory allocation during data insertion takes a large portion of latency of a whole workload. We take Rocksdb as a case study to illustrate that memory allocation latency is much higher compared to data read latency using both small (1KB) and large (200KB) requests. We use Glibc as the memory allocator and execute data insertion and read requests without any memory pressure. Experimental results are shown in Figure 1.10. For 20

0.99 0.99

0.95 0.95 CDF CDF

insert insert read read 0.90 0.90 0 5 10 15 20 25 30 0 4000 8000 12000 16000 20000 Time (µs) Time (µs) (a) Latency of small (1KB) requests (b) Latency of large (200KB) requests

Figure 1.10: CDF of a break-down query latency in Rocksdb.

1.0

0.8

0.6

CDF 0.4 idle system file cache pressure 0.2 anonymous page pressure 0.0 0 4000 8000 12000 16000 Time (ns)

Figure 1.11: CDF of the memory allocation latency. small requests, the memory allocation latency is 74.7% (54.5%) of the average (99th percentile) overall query latency. For large request, the ratio is 93.5% (97.5%) of the average (99th percentile) overall query latency. The impact of memory allocation is signiﬁcant, and even more in large requests. As for data update requests, it renders similar results compared with read quests since they do not incur memory allocation.

We use another case study to demonstrate the memory allocation latency degradation under anonymous page pressure and file cache pressure. We use a micro benchmark that continuously sends 1KB-size memory requests until a total amount of 1 GB, using the default Glibc in a node with 128 GB RAM. We repeat this experiment under a dedicated system with sufficient memory, under anonymous page pressure, and under file cache pressure, respectively. The details of the micro benchmark and the node are described in Section 6.3. Figure 1.11 shows the CDF of the memory allocation latency under the dedicated system and two kinds of memory pressure.

Anonymous page pressure. To generate anonymous page pressure, we run a program that 21 continuously sends memory allocation requests until the available memory in the node becomes about

300 MB. Note that, the available memory could not further drop below 300 MB due to the indirect and direct reclaim mechanisms of Linux OS. At this point, new memory allocation requests from the micro benchmark trigger the memory reclaim routine and cause swapping. Figure 1.11 shows that the memory allocation latency signiﬁcantly increases under anonymous page pressure. The average allocation latency and the 99th percentile allocation latency under anonymous page pressure are prolonged by 35.6% and 46.6% compared to those without memory pressure, respectively.

File cache pressure. We generate ﬁle cache pressure by loading 10 GB ﬁles and sending memory allocation requests to occupy the rest of the system memory until free memory drops to about

300 MB. In this case, memory reclaim routine starts but not necessarily trigger swapping since the

file cache can be directly released without accessing the disk. Figure 1.11 shows that the memory allocation latency under file cache pressure is lower than that under anonymous page pressure, but it is still higher than that under a dedicated system. The average allocation latency and the 99th percentile allocation latency under file cache pressure are prolonged by 10.8% and 7.6% compared to those without memory pressure, respectively.

Memory pressure signiﬁcantly prolongs memory allocation latency, which has non-trivial impact on SLO violation. We target on both kinds of memory pressure and aim to reduce the memory allocation latency of latency-critical services in a co-located system as well as in a dedicated system.

Linux OS emulates an LRU-like (Least Recent Used) algorithm for physical memory page reclaim by keeping four lists: active anon and inactive anon for anonymous pages, and active file and inactive file for ﬁle cache pages. The two active lists contain recently used pages while the two inactive lists contain pages that are not recently used. Under memory pressure, Linux OS scans through these four lists, updates page usage status, moves pages between lists, and selects pages to reclaim. Speciﬁcally, Linux OS keeps three memory wartermarks (i.e. high, low and minimum) to instruct memory reclaim routine. When available memory drops below the low watermark, a page reclaim thread is started until available memory is larger than the high watermark. When available memory further drops below the minimum watermark, each memory request goes through 22 a synchronous direct memory reclaim routine before the physical memory is allocated.

However, the page reclaim algorithm in Linux OS is ineﬃcient for latency-critical services in a multi-tenant system. The watermarks are conservatively set at around 1‰ of a zone memory.

For example, the total capacity of a memory zone in one of our physical nodes is 60 GB. The low and high watermarks are 53 MB and 64 MB, respectively. Since both latency-critical services and batch jobs tend to consume hundreds of megabytes or gigabytes of memory, the watermarks are too small to timely trigger the indirect memory reclaim thread. The direct memory reclaim routine even causes more delays on memory requests. After a process ﬁnishes, all of its anonymous pages are reclaimed immediately. However, the ﬁle cache pages loaded by the process are not reclaimed by

Linux OS and remain in memory. They are only reclaimed upon memory pressure by the reclaim routine, which prolongs new memory requests. The memory pressure cannot be relived even if we increase the watermarks. Although, Linux OS triggers memory reclaim routine when there is still much free memory with higher watermarks, it does not distinguish latency-critical services and batch jobs. Memory from both kinds of workloads can be reclaimed. In this case, the performance of latency-critical services are still degraded.

1.3 Challenges

1.3.1 Challenges in analyzing Massive Amount of Logs

Extensive studies focus on anomaly and bug detection. One major method is plain log reconstruction and analysis [196, 74]. It needs to consider the tradeoff between the effectiveness and overhead. In other words, extracting more logs improves the effectiveness but also introduces more overhead.

Another popular method is intrusive tracing [46, 141]. This approach ensures the accuracy since the information is extracted from inside the software. The essential problems, however, are that 1) users need to have a deep understanding of the targeted systems, and 2) modiﬁcations to one version of the source code will be invalid after the targeted system is updated.

Besides the methods above, few studies utilize resource metrics to detect anomalies and bugs [80]. 23 One reason is that resource metrics are usually collected at the granularity of per machine. As multiple processes usually run on one machine simultaneously, the resource metrics are inaccurate for troubleshooting a speciﬁc process. For each process, it is straightforward to access the CPU and memory usages. But I/O metrics of a process are hard to obtain.

Lightweight virtualization techniques, such as Docker [5] and LXC [15], have been gaining increasing interests in both academia and industry due to their higher performance, lower overhead as well as competitive isolation compared to traditional virtualization such as KVM [123] and VMWare.

An additional opportunity provided by lightweight virtualization is that it allows the access to more

ﬁne-grained resource metrics. Furthermore, each lightweight virtual container has a unique identiﬁer.

By leveraging the identiﬁers, we can distinguish the resource usages of diﬀerent applications.

In this thesis, we propose and develop LRTrace, a distributed troubleshooting and feedback control tool. LRTrace correlates both log and fine-grained resource metrics that are made possible by lightweight virtualization. LRTrace focuses on profiling data analytics frameworks such as Spark [192] and MapReduce [82]. It helps users to reconstruct the workflows, understand the frameworks, find the anomalies and locate their root causes. Furthermore, LRTrace has a pluggable user-defined feedback control component. Based on the runtime information of applications that is collected by LRTrace, users can write plug-ins to control the behavior of applications thus manage the cluster in a semi-automatic way.

Extracting and reconstructing logs of distributed systems are, however, non-trivial due to the following reasons:

• Proﬁling within one log message: a log message usually contains multiple types of information,

such as timestamps, object identiﬁers and the amount of processed data. Also, the format of

log messages varies. To parse one log message, the tracing tool needs to identify diﬀerent ﬁelds

and the corresponding meanings within one log message.

• Proﬁling across log messages: the same object appears in diﬀerent log messages indicating

diﬀerent events. For example, an object may be recorded in three log messages. The ﬁrst one

suggests the start of the object, the second one records the amount of data processed by the 24 object, and the third one is the end mark of the object. To reconstruct the workﬂow, we need

to identify the same object that appears in diﬀerent log messages.

1.3.2 Applying Natural Language Processing on Distributed System Logs

Log analysis tools can generally be classified into two categories: information inference and automatic anomaly detection. Information inference tools [196, 163, 161, 197] help users to understand the targeted system more efficiently but leave the burden of anomaly analysis to users. They require users to have domain knowledge of the systems. To achieve automatic anomaly detection, there are tools that leverage machine learning and automaton methods [87, 148, 189]. Those tools can report potential problems and reduce user efforts in log analysis. However, they focus on infrastructure-level distributed systems (e.g., OpenStack [19] and HDFS [40]) which have log sessions with relatively

ﬁxed lengths. More importantly, the existing tools of the both categories do not exploit the semantic knowledge of logs, which however is key to workﬂow construction and analysis for distributed data analytics systems.

Our work is inspired by the original goal of system logs: logs are for users to read. In other words, logs are usually written in natural languages by system designers. Leveraging semantic information in logs for workflow construction and troubleshooting not only provides users a clear view of the workflows of systems, but also reduces the user efforts for manual analysis. In order to extract semantic knowledge from log messages, our intuition is that natural language processing (NLP) should be used to process logs that are written in natural languages.

However, there are technical challenges to develop NLP-assisted workflow reconstruction and anomaly detection. First of all, it is difficult for existing NLP tools to extract useful information from log messages including events, identifiers, numeric values and locality information. Spell [86] uses log messages to build log keys that generally represents the types of events in the systems.

However, it is not semantic-aware so that it does not distinguish identifiers, numeric values and locality information but treats them all as variables, which is coarse-grained. In order to extract such information in a fine-grained manner, we need to exploit the semantic knowledge in log messages. 25 The second challenge lies in finding relationships between objects and events in the targeted systems. Stitch [196] uses an identifier based approach to build relationships between objects, which can only be applied to logs conforming to a certain principle. The results only contain identifiers that require domain knowledge to understand. Moreover, correlated objects recorded in logs do not always have identifiers. Ignoring such log messages can lead to incompletion of the reconstructed workflow. NetSieve [163] extracts key information from networking maintenance tickets. However, it focuses only on networking area that has no workflow to construct. As a result, it is not appliable to workflow construction and troubleshooting in distributed data analytics systems.

The third challenge is that one request in dirstributed data analytics systems can generate logs with various lengths and orders. Currently, tools that perform automatic detection target on infrastructure-level distributed systems [189, 87] such as cloud infrastructures (e.g., OpenStack) or distributed file systems (e.g., HDFS). A request in such systems usually outputs a log message sequence with a short length and a relatively fixed order, which makes possible prediction for the next message in the sequence. For example, an infrastructure-level distributed system generates the log sequence ‘A,B,C’ every time when processing the same request. In contrast, in a data analytics system, the same request may generate log sequences such as ‘A,[B,C]*,(DE)*’ (we use regular expressions to represent the log sequence). The log sequences are usually long and have various lengths due to different factors such as data sizes and the quantity of resources. Thus, the prevalent tools are not effective in workflow construction for distributed data analytics systems.

1.3.3 Quantifying SMT Interference and Adjusting Cores

In industry, multi-tenant services usually enable Simultaneous Multi-Threading (SMT) technology to improve server throughput [1, 12, 8]. In academia, there are also active studies on CPU scheduling for SMT servers [188, 45, 89, 167]. By enabling SMT, instructions from different threads are able to utilize execution units of one physical core in one cycle to improve CPU utilization. However, when instructions from different threads compete for the same shared execution units, the interference on memory access incurs performance degradation on each thread running on a physical core. 26 Latency-critical services commonly distribute requests across many servers, thus the end-to- end response time is determined by the slowest individual latency [81, 199, 36, 92]. We find that

SMT interference incurs non-negligible memory access delay, which can signiﬁcantly increase query latency, particularly the tail latency, of co-located latency-critical services and jeopardize their SLOs.

Our experiments with co-located real-world latency-critical services and batch jobs show that SMT interference on memory access can incur up to 2× of the average and 2.5× of the 99th percentile query latency compared to when latency-critical services are running alone in a server. Thus, efficient job co-location asks for an adaptive resource scheduler that takes into account interference of memory access introduced by SMT. Specifically, the scheduler should satisfy the following requirements for efficient job co-location.

• Latency-critical services have the high priority under job co-location. Their query latency

should be close to that when the services are running alone in a server. This is the principle

of job co-location.

• When the principle is satisﬁed, the scheduler should seek to improve the server utilization and

batch job throughput while incurring low overhead.

• The scheduler should be transparent and generally applicable to all applications. To satisfy

this, the source code of applications should not be modiﬁed.

Challenge I: SMT interference measurement. There is no existing quantitative metric to measure SMT interference on memory access. At a ﬁrst glance, CPU usage might be an indicator to measure the SMT interference since memory access must increase CPU usage. However, a high

CPU usage does not necessarily incur a large number of memory accesses and SMT interference since workloads can be computation-intensive. Another naive measurement approach is using a probing process that periodically accesses memory from cores and records the latency. Although the obtained latency could be a quantitative indicator of SMT interference, this approach presents a trade-oﬀ between measurement accuracy and overhead. Higher probing frequency leads to higher accuracy but also severe competition for shared execution units. Furthermore, the process needs to 27 occupy a chunk of physical memory. It interferes with latency-critical services.

We have shown that Hyper-Threading interference causes signiﬁcantly prolonged memory access latency, however there is no existing quantitative metric to directly indicate the interference. We identify two approaches which can be used to indicate the interference but both have their own defects.

Approach 1: exporting service latency. A straightforward approach exports query latency from latency-critical services and uses it as an indicator for modeling Hyper-Threading interference. This approach is easy to implement but has its fundamental issues. First, it is intrusive to services or applications built upon the services. Their source code needs to be modified to export the query latency. Second, the service-level query latency is not accurate for detecting Hyper-Threading interference since the latency is affected by multiple factors such as inner procedures of services, configuration changes, and size of queries. A model should be built for each individual service, which is not generally applicable.

Approach 2: sending probe requests. Hyper-Threading interference on memory access can be obtained by periodically sending a probing request. By comparing the latency of the request to that under an idle core, we are able to determine if there is Hyper-Threading interference. However, the probing process requires signiﬁcant CPU quota on each core and extra physical memory, which interferes with both latency-services and batch jobs.

Challenge II: Adaptive core allocation. Co-located batch jobs can use transiently idle resources, but they should not impact performance of latency-critical services. One challenge is deciding the amount of workloads that can be scheduled on a sibling of a processor that serves latency-critical services. Speciﬁcally, both latency-critical services and batch jobs have three phases during their lifetime: launching, running and exiting. The scheduling algorithm should dynamically adjust the amount of workloads of processors based on process phases and proﬁled memory access latency on each processor. Meanwhile, the algorithm needs to achieves the goal that it guarantees the performance of latency-critical services while keep the sibling core busy.

Currently, many studies on job co-location do not consider the impact of SMT [58, 115, 198, 28 118, 186]. For example, PerfIso [115] is a representative approach that leverages multicore servers to eﬃciently share CPU resource between latency-critical services and batch jobs. It enables SMT but does not take its impact into account and signiﬁcantly degrades performance of latency-critical services. There are also research on improving application performance in servers with SMT enabled [188, 45, 89, 167]. These studies mainly target computation-intensive workloads. A recent study vSMT-IO[119] is a hypervisor-level approach that optimizes I/O-intensive workloads on SMT servers. Nevertheless, job co-location for workloads frequently accessing memory is not studied in depth.

1.3.4 Requirements for Fast Memory Allocation

There are mainly two categories of research on improving performance for latency-critical services.

Studies [36, 134, 199, 43] improve performance for latency-critical services by leveraging their runtime characteristics. For example, ROLP [43] is a runtime object lifetime profiler for efficient memory allocation and garbage collection for latency-critical services. However, these studies do not take job co-location into consideration. Studies of the other category [115, 198, 118, 186] target co-location of latency-critical services with other jobs. For example, PerfIso [115] and Dirigent [198] are two representative approaches that leverage multicore systems to efficiently share CPU resource between processes. Our work falls into the second category.

While existing efforts try to push the resource utilization to the limit, memory management for latency-critical services still faces significant challenges. First, the runtime behavior of a job is difficult to predict. In particular, it is difficult to obtain the amount of memory that will be requested by a job in the future. Second, it is expensive to reclaim physical memory that is occupied by a process. If a process requests more memory when the node memory is almost used up, swapping will be triggered to make space for the requested memory. However, swapping is an expensive operation that takes a long period of time (tens of milliseconds to seconds) or even leads to thrashing. In such cases, the performance of latency-critical services are significantly degraded.

Since the original purpose of a dedicated system is for sole use by latency-critical services, ide- 29 ally their performance should not be affected by batch jobs. In a shared environment, memory is frequently allocated and reclaimed due to provisioning of various workloads. However, the memory reclaim mechanism in Linux OS significantly degrades the performance of latency-critical services under memory pressure, which makes co-location inefficient or even ineffective. In light of the challenges, we tackle the problem from a new principle: resource slacks should be reserved for latency- critical services in case of a burst of resource requests. PerfIso [115] is a preemptive approach that adopts this principle to achieve CPU sharing between latency-critical services and batch jobs. How- ever, data in memory can only be preempted by swapping them onto disks, which is a very expensive operation. We aim to materialize the principle to achieve fast memory allocation for latency-critical services in a multi-tenant system. Our experiments find that memory allocation latency takes up to 97.5% of a whole query latency. Thus, we focus on reducing the memory allocation latency for latency-critical services. The design should meet the following requirements:

• R1 Latency-critical services have the highest priority. This is the primary principle. Best-eﬀort

batch jobs can share idle resources only if they do not aﬀect the performance of latency-critical

services.

• R2 Memory should be allocated in a fast manner. This is the key to achieving low latency

for latency-critical services when they request memory. Fast memory allocation should be

achieved even in the absence of memory pressure.

• R3 The design should be generally applicable to all applications written in a popular language

such as C / C++. That is, the source code of applications should not be modiﬁed.

• R4 The overhead should be low. In other words, it should consume little resource of a node. CHAPTER 2

RELATED WORK

2.1 Troubleshooting for Distributed Systems

2.1.1 Log analysis

Logs should be collected before log analysis for distributed systems. There are a lot of tools that collect logs from diﬀerent components and diﬀerent nodes of distributed systems [2, 161, 14, 74, 53].

Flume [2] is popular tool that collect, aggregate and move large amount of log data based on streaming data ﬂows. Our work [161] collects both logs and resource metrics for distributed systems in lightweight virtualized environments.

There are extensive studies on log analysis. Applying learning methods on logs generated by the targeted system is a hot topic [80, 87, 122, 135, 148, 153, 183, 189, 182]. The goal of learning approaches is automatically reporting potential anomalies to users, which reduces eﬀorts required for users to do manual analysis. One approach [183] extracts ﬁelds in log messages by inferring the source code, and applies Principle Component Analysis to detect the anomalies. Draco [122] uses a “top-down” approach and compares failed runs with successful runs to detect anomalies.

PerfScope [80] categorizes system calls into execution units, and uses a light-weight learning method to detect anomalies. A recent eﬀort DeepLog [87] adopts a Long Short-Term Memory model trained by 1% of the existing logs, and has the ability to identify anomalies in the remaining 99% of logs.

For machine learning approaches, the training accuracy relies on 1) the quality of the training data, 31 and 2) the parameters in the machine learning models. Also, it only reports the potential errors and leaves root cause analysis to users. CloudSeer [189] uses an automaton based approach to reconstruct the workflow of a cloud infrastructure. Since the length and order of the log sequences generated by the cloud infrastructure are relatively fixed, it can accurately capture the workflow and detect anomalies. However, it cannot be applied to distributed data analytics systems since the lengths and orders of logs in such systems can vary a lot.

Studies [39, 197, 196, 38, 93, 138, 190] focus on using logs to reconstruct the workflow. Our work also falls into this category. By leveraging the reconstructed workflow, users can have a clear view of the system components that serve user requests. CSight [38] builds a communicating finite state machine model of the targeted system by leveraging the partially ordered events in logs.

Iprof [197] is a representative. Static analysis provides more accuracy as well as less overhead when reconstructing the workﬂow since it is aware of the corresponding locations of logging statements in the source code but non-intrusive. However, It needs to perform static analysis every time the targeted systems are upgraded, which is similar to intrusive approaches. It relies on the bytecode, a feature adopted by languages using JVM such as Java and Scala. Systems written by

C/C++ needs further processing to support static analysis.

Stitch [196] uses an identifier-based approach to construct the workflows of systems, leaving the rest of the information unused. This work requires logs to follow a certain principle which do not always exist in logs generated by every systems. Furthermore, identifiers are usually abbreviations which introduces difficulty for users to understand the system. A fundamental drawback is that it requires manual analysis to detect the anomalies, which is only practical with small-sized jobs (e.g., jobs with fewer than 50 sessions).

NetSieve [163] uses an NLP based approach to extract semantic information from network tickets.

A fundamental diﬀerence between NetSieve and our proposed work is that NetSieve focuses on logs hand written by network administrators. Such logs only record maintenance information but does not have workﬂow information for reconstruction.

Different from those tools, our work extracts the semantic knowledge in logs and leverages it 32 to reconstruct the workflows of the targeted data analytics systems. The semantic knowledge is essential for users to understand the systems. When an anomaly occurs, our work not only infers the problematic requests but also the potential erroneous components, which significantly helps users to locate the root causes.

2.1.2 Intrusive troubleshooting

Quite a few tracing tools [46, 35, 141, 146, 33, 48, 52, 75, 79, 91, 109] adopt this approach to obtain workflow related information from inside the targeted systems. The key advantage of intrusive approaches is that extracted information can be unambiguously identified, which increases the accuracy of workflow reconstruction. Since a user request usually invokes services from different layers, tools like Magpie [35] further apply this approach to cross-layer systems. This provides an overview of the user request, which facilitates the debugging process. Whodunit [48] uses a novel algorithm to profile transactions that flow through shared memory across different layers in a web server. However, some intrusive tools bring considerable overhead since they always collect all the defined information as long as the targeted systems are running.

To solve this problem, DTrace [46] firstly proposed dynamic instrumentation. Dynamic instrumentation leverages techniques like Aspect Oriented Programming to dynamically extract the intended information. A representative tool in this category is Pivot Tracing [141]. It requires minor modification to the source code of targeted systems and dynamically installs trace points to collect only the information intended by users. It brings almost no overhead when turned off. However, in order to specify the indented information, users has to have a comprehensive understanding of the source code of targeted systems. For example, if a user wants to obtain the value of a certain variable, the user must know the name and the code section of the intended variable. Therefore, this approaches is only suitable for software developer and experts of targeted systems.

A recent study AUDIT [140] adopts both log analysis and the intrusive approach. It sets triggers in the targeted systems and inserts more logging statements in the targeted systems when triggers are ﬁred by anomalies. In this way, the additional logging statements only introduces overhead when 33 the systems need maintenance because of anomalies.

Fundamentally diﬀerent to intrusive approaches, our work is non-intrusive. It does not require modiﬁcation or access to the source code of the targeted systems. A non-intrusive approach is more ubiquitous since most systems output logs while fewer systems make their source code available.

2.1.3 Core Dump and Record/Replay Approaches

There are a large number of studies focus on debugging and replaying failures [78, 181, 193, 155,

72, 47, 145, 120, 113, 194]. REPT [78] enables reserve debugging of software failures in deployed systems. It can recover data from control flow trace and a memory dump by performing forward and backward execution iteratively with error correction. POMP [181] is an automatic root cause analysis tool based on a control flow trace and a memory dump. It handles missing memory writes by running hypothesis tests recursively. ProRace [193] attempts to recover data values based on the control flow logged by Intel PT and register values logged by Intel Processor Event Based Sampling.

PRES [155] and Holmes [72] record execution information to help debug failures. PRES performs state space exploration using the recorded information to recorded information to reproduce bugs.

HOLMES performs bug diagnosis purely based on control ﬂow traces. Castro et al. [47] proposed a system that performs symbolic execution on a full execution trace to generate a new input that can lead to the same failure.

As for failure record/replay techniques, Castor is a recent record/replay system that relies on commodity hardware support as well as instrumentation to enable low-overhead recording. It works efficiently for programs without data races. Ochiai [145] and Tarantula [120] record failing and successful executions and replay them to isolate root causes. H3 [113] uses a control flow trace to reduce the constraint complexity for finding a schedule of shared data accesses that can reproduce a failure. Pensieve [194] is a tool for automatically reproducing failures in complex distributed systems. Its event chaining analysis is based on the Partial Trace Observation, which gives rise to a debugging strategy that aggressively skips code paths to produce a trace containing only relevant prior causes. 34 2.2 Performance Improvement for Distributed Systems

2.2.1 Job Co-loation and Resource Sharing

In production environments, a cluster are usually shared among multiple frameworks and applications. Therefore, eﬃcient resource allocation and sharing becomes a problems that attract much attention. The goals are to host as many applications as possible while guarantee each application to get enough resource to execute.

Studies uses a preemption-based technique to make applications to pause for a while and resume execution [34, 73, 58]. These tools use memory dump to store application onto disks. The resources from paused applications can be assigned to other applications with higher priority. After the other applications ﬁnish, the tools resume the paused application from the point where it is left oﬀ. Instead of killing applications, this technique preserves the execution context of the application. However, this technique incur additional I/O overhead since a paused application needs to be stored onto disks.

Studies propose a consolidation-based technique for a node to host more applications [144, 187,

137, 195, 103, 108, 101, 105, 57, 56, 55, 54, 67, 71, 184, 106, 126]. These studies use profiling- based or historical execution-based approaches to assign applications with free resources such that the QoS can be satisfied. These techniques work well when the cluster host repeating applications that have similar execution time and resource consumption. However, job submission and resource consumption is difficult to predict. These approaches may not accurately allocate application with enough resource without interference.

Addressing memory pressure is a hot top in resource management. A number of efforts focus on garbage collection of managed language such as Java and Scala. FACADE [151] bounds the number of objects created in big data applications so that it reduces the memory management overhead of a JVM. Yak [150] separate JVM heap into control space and data space. Since modern big data applications use a large amount of memory to store intermediate data. Yak allow a JVM to uses a specific area to store such data. Thus, garbage collection (GC) threads only scan the control 35 space, which reduces GC overhead. Broom [97] uses a region-based GC algorithm. Since data of big data applications have clear life circles, they are put into different regions based on their lifespan and facilitate GC. Other studies targets to dynamically adjust the memory size of applications.

ElasticMem [172] includes a technique to dynamically change JVM memory limits, an approach to model memory usage and garbage collection cost during query execution, and a memory manager that performs actions on JVMs to reduce total failures and run times. Study [116] shows that a moderate data spill actually helps job to ﬁnish faster since less memory requirement reduces queuing delay of applications.

Study [149] designs a buffer pool for relational databases in a multi-tenant environment. DB2 [168] automatically tunes the memory between different memory consumers, providing stability under system noise. They do not tackle key-value store. There are studies on efficient memory management for applications running in JVMs by leveraging the runtime characteristics of applications [43, 151, 150, 97]. Broom [97] proposes to use region based memory management for data analytics applications. Facade [151] statically bounds the number of objects allocated to threads for efficient memory management in JVMs. Yak [150] creates a new region in the JVM heap to store long-lived data. ROLP [43] is an object lifetime profiler for efficient garbage collection. our work focuses on latency-critical services that use C libraries.

GNU C Library provides ptmalloc [7] as the memory allocator for C/C++ programs. There are other memory allocators [88, 37, 28] that focus on different design objectives. Jemalloc [88] empha- sizes fragmentation avoidance. It is the default memory allocator for FreeBSD [29]. Hoard [37] is a scalable memory allocator that largely avoids false sharing and is memory efficient. TCMalloc [28] supports efficient memory allocation for multi-thread processes. Although our work is implemented in Glibc, its principle can be integrated to those memory allocators.

2.2.2 Cluster Scheduling

There are a large number of studies on cluster scheduling for data-intensive applications [41, 84,

90, 95, 98, 99, 117, 154, 191, 171, 112, 136, 76, 61, 62, 66, 63, 65, 102, 107, 132, 131, 104, 64, 69, 36 68, 70, 60, 100, 158, 175, 174, 173, 166, 180, 176, 178, 177, 179]. Yarn [171] and Mesos [112] are two commonly used schedulers allocating resources for diﬀerent frameworks in a cluster. Yarn uses a two-layer scheduling mechanism. Applications request resources from Yarn. After Yarn return available resources to applications, applications use the resources to start their execution units and schedule tasks in these units. Mesos uses an other resource allocation strategy call resource oﬀer.

In this case, application can negotiate resource demands with Mesos, and decide whether to accept or reject resource oﬀers.

Besides, there are cluster scheduler for special purposes. Quasar [84] targets on high cluster utilization. Instead of users requesting resources, Quasar analyses an allocation impact on the cluster. Users specify the QoS of applications, Quasar decide the quantity of resources allocated to the applications. Jockey [90] estimates the remaining time of applications and dynamically adjusts resources to applications in order to meet latency SLO.

As for scheduling fairness, “delay scheduling” tackles the problem between fairness and data locality. When the job that should be scheduled next according to fairness cannot launch a local task, it waits for a small amount of time. It ensures most data locality with a little loss of fairness, and increases job execution time. Carbyne [98] allows applications to yield fractions of their allocated resources without impacting their own completion times. The collected leftover resources are used to meet secondary goals. DRF [95] is a fair sharing model that generalizes max-min fairness to multiple resource types. DRF allows cluster schedulers to take into account the heterogeneous demands of datacenter applications, leading to both fairer allocation of resources and higher utilization than existing solutions that allocate identical resource slices (slots) to all tasks.

Due to the increasing number of nodes in a cluster, scheduling delay becomes an overhead for centralized schedulers due to limited computational capacity of the master nodes. Since centralized schedulers records the status of every node in the cluster, scheduling decisions can be slow if they computes the optimal nodes for allocation every time a request arrives. To overcome the drawbacks, there are studies on distributed schedulers. Sparrow [154] is a representative distributed scheduler that is able to schedule millions of tasks per second on appropriate machines while of- 37 fering millisecond-level latency and high availability. Sparrow has multiple independent scheduler instances in a cluster. When an application submits a request to a scheduler instance, the scheduler randomly probe a bunch of nodes, and return the nodes with most available resources to the request.

Apollo [41] takes into account job historical data and minimize task completion time. Apollo enables each scheduler to reason about future resource availability and implement a deferred correction mechanism to eﬀectively adjust suboptimal decisions dynamically.

2.2.3 Optimization for Latency-critical services

Extensive eﬀorts focus on reducing query latency for latency-critical services [186, 111, 110, 147, 115,

36, 134, 199, 124, 96, 59, 51, 128, 129, 157, 156, 133, 130, 127]. For example, Bubble-flux [186, 125] is a profiling based framework for co-location of latency-critical services and batch jobs based on job runtime information. Tail-control [134] develops a work-stealing scheduler for optimizing the number of requests that meet a target latency. MittOS [111] tackles the tail latency for distributed file systems where the bottleneck is disk I/O. It predicts whether the IO SLOs can be met. If not, it rejects the IO request such that the request can be redirected to another less busy server. EvenDB [96] is a recent LSM Tree KV store that is optimized for data with spatial locality. FlatStore [59] is a recent

KV store that uses log in persistent memory for eﬃcient data requests.

There are also efforts on resource sharing and workload prioritization for latency-critical applications. FastTrack [110] targets on mobile devices and improves the response time for foreground applications. PerfIso [115] is an approach that uses native CPU isolation to achieve CPU sharing between latency-critical services and batch jobs. It reserves a set of CPUs as a slack for latency- critical services. When the services have burst of traffic, the reserved CPUs can be immediately allocated to the services to guarantee SLOs. RobinHood [36] dynamically reallocates the cache between cache-rich and cache-poor applications. The idea is that some workloads are not sensitive to cache size or does not use much cache. This cache space can be reallocated to workloads that have much requirement for cache. CurtailHDFS [147] manages the tail latency in datacenter-scale distributed file systems. It propose two techniques called “fail fast” and “smart hedging”. “Fail 38 fast” dynamically replaces the slowest sever in a data write pipeline and use existing fault-tolerance mechanism to handle the failed request. “Smart hedging” sends multiple read requests and accepts the fastest returned request.

There are many Feﬀorts on optimizing application performance running on SMT servers [188,

45, 89, 167, 143, 85, 119, 92]. For example, Elfen [188] introduces principled borrowing to efficiently co-locate latency-critical services and batch jobs on SMT processors. However, it is an intrusive approach and targets on computation-intensive workloads. Stretch [143] is a hardware-level approach that partitions reorder buffer to improve performance of batch jobs. Caladan [92] dynamically pauses/resumes threads of batch jobs running on siblings of a SMT core based on timeout from latency-critical services. Our work differs from previous work mainly in two aspects. First, it quantifies the SMT interference on memory access. It is a user-space approach that does not require modification to either applications or Linux OS. CHAPTER 3

PROFILING DISTRIBUTED SYSTEMS IN LIGHTWEIGHT

VIRTUALIZED ENVIRONMENTS WITH LOGS AND RESOURCE

METRICS

3.1 Keyed Message

Logs generated by distributed systems record rich information including system initialization, lifespans of objects, current states of the system, error messages, etc. However, even experienced pro- grammers have to spend a long time in locating the root cause of anomalies when using raw logs [196].

Our observation indicates that using raw logs is not an eﬃcient method for manual analysis:

• Information in log messages is rich but also miscellaneous. Since logs record all kinds of

information, a log message of interest is usually surrounded by other irrelevant ones. Manually

locating such a log messages is time-consuming and ineﬃcient.

• Workﬂows recorded in logs are not structured. Popular logging tools, e.g. log4j [4] and slf4j [23],

output log messages in a time-ordered manner. Messages recording events of the same object

are usually separated by other messages. For example, Yarn ResourceManager outputs log

messages when the application state transitions from one to another. If multiple applications

run simultaneously, the state transition messages of diﬀerent applications may be intertwined.

To address the problems, we propose keyed message, which can be transformed from a raw log message by applying simple rules. Keyed messages provide the ease to aggregate messages with the 40 Field Description key the key assigned to a message identiﬁers to identify the object in the message value a numeric variable to store the value in the message type type of the message, instant or period is-ﬁnish whether the message indicates object lifespan end timestamp the time when the message is written

Table 3.1: Description of ﬁelds in a keyed message.

1 Got assigned task 39 2 Running task 0.0 in stage 3.0 (TID 39) 3 Got assigned task 41 4 Running task 1.0 in stage 3.0 (TID 41) 5 Task 39 force spilling in-memory map to disk and it will release 159.6 MB memory 6 Task 41 force spilling in-memory map to disk and it will release 180.0 MB memory 7 Finished task 0.0 in stage 3.0 (TID 39) 8 Finished task 1.0 in stage 3.0 (TID 41)

Figure 3.1: A snippet of simplified log output by a Spark application. Extracted fields are shown in different colors. same key, as well as the flexibility for user to define their own keys.

3.1.1 Log Transformation

We use regular expressions to extract intended log messages. This approach seems to be ad hoc, but we use it for two reasons: 1) Though ﬁeld identiﬁcation is a solved problem using tools such as

Logstash [14], additional information cannot be attached to a log message. In our case, we attach a key and identifiers to a log message. 2) Log messages concerning the workflow repeatedly appear with only few changing fields, despite the fact that the total number of log messages is large. For example, in Figure 3.1 messages at line 1 and line 3 only differ in the task ID. In evaluation, we show that a limited number of regular expressions are enough to capture the whole workflow of an application. To be specific, we use 12 rules, 4 rules and 5 rules to extract the workflow of Spark,

MapReduce and Yarn, respectively. 41 Line Key Id Value Type is-ﬁnish 1 task task 39 - period F 2 task task 39 - period F 3 task task 41 - period F 4 task task 41 - period F spill task 39 159.6 MB instant - 5 task task 39 - period - spill task 41 180.0 MB instant - 6 task task 41 - period - 7 task task 39 - period T 8 task task 41 - period T

Table 3.2: Keyed messages corresponding to logs in Figure 3.1.

A keyed message consists of five fields in order to describe a raw log message. Table 3.1 lists the fields of a keyed message. Key represents the high-level objects or events in a log message.

Identifiers are used to uniquely identify the object or event in a message. Value stores a numeric value recorded in a log message if applicable. Type shows that the object in a log message is either an instant event or a period object. Field is-finish is a flag indicating whether the message is the end mark of a period object and is only applicable to messages of period type.

We illustrate how to transform raw log messages to keyed messages by using a snippet of log messages output by a Spark application in Figure 3.1. Messages 1 and 3 indicate that the Spark executor gets the tasks (task 39 and 41). During the processing of tasks, each task spills a certain amount of data to disk, which is recorded in messages 5 and 6. Messages 7 and 8 suggest the tasks are ﬁnished. Note that log messages 5 and 6 are both transformed to two keyed messages, respectively. One indicates the execution of the task, and the other indicates the spill event. We simplify the logs by omitting timestamps and classes that generate the messages. In practice, we indeed extract timestamps recorded in log messages.

Users can use a configuration file in *.xml or *.json format to define the rules extracting log messages. Then, the Tracing Workers parse the configuration file and transformed the specified log messages to keyed messages. In our implementation, we use *.xml as the format of the configuration

ﬁle. The following snippet of rules illustrates how we extract log message 1 in ﬁgure 3.1.

Got assigned (?< taskId > task [0-9]+)<\regex> 42

task<\key>

taskId<\identifier>

true<\is-period>

false<\is-finish>

<\key-group>

<\rule>

To summarize, we transform the eight messages in Figure 3.1 to keyed messages and show them in

Table 3.2. Taking advantage of the design of keyed messages, operations such as Groupby, Count and

Sum can be easily applied for data analytical jobs. Furthermore, users can perform feedback control at runtime based on the keyed messages. For the ease of use, we provide users with conﬁguration

ﬁles for Spark and MapReduce applications. Besides, users can alter the existing rules or deﬁne new rules to extract intended log messages for other systems.

3.1.2 Resource Metrics Storage

LRTrace [161] uses keyed messages to store resource metrics. The advantage of using keyed messages is that users can request all information in a uniform format.

A keyed message stores resource consumption information. In LRTrace, a profile of resource metrics contains four fields: the metric name, the value of usage, its container ID and the profiling time, which are the counterparts of key, value, identifier and timestamp, respectively, in a corresponding keyed message with type of period. Field is-finish is set to be false except for the last metric of a container. Thus, users can regard a resource metric as a special case of a period event whose lifespan equals to the corresponding container.

3.2 LRTrace Design

We choose Yarn as the cluster resource management framework since it is widely used in both academia and industry. The basic idea can be extended to other cluster resource managers such as 43 Mesos [112]. In this section, we ﬁrst review the architecture of Yarn. We then describe the design of LRTrace.

3.2.1 Yarn

Apache Hadoop Yarn [171] is a popular resource management framework that can serve for applications like MapReduce and Spark. Resources are packed in containers. For example, a Yarn container can have {2 cores, 4 GB RAM} as the resources. An application requests containers from

Yarn and launches jobs in them. Traditionally, containers run on top of bare operating systems.

Resource consumption of a container is constrained in a logical manner. An alternative choice is to launch a container in a lightweight container-based virtualized environment, which provides more

ﬂexibility for resource allocation and accuracy for resource monitoring. To avoid ambiguity, we refer

‘container’ to the container in Yarn, and ‘LWV container’ to the lightweight virtualized container unless otherwise speciﬁed in the rest of the paper.

In a cluster, containers are assigned with unique identiﬁers. In the extreme case, two containers may share the same identiﬁer when the date on the machine is changed, which rarely happens in a real production cluster. Our work does not consider this extreme situation. Leveraging the uniqueness of container IDs, we correlate log messages and resource metrics belonging to the same container.

LRTrace is a distributed framework designed to collect and correlate both log messages and resource metrics at runtime. We show the high-level architecture of LRTrace in Figure 3.2. The information collection component and database are considered external to LRTrace. The Tracing

Worker runs on each node in the cluster. It reads information from LWV container API ﬁles and the log ﬁles. After being pre-processed by the Tracing Worker, the information is sent to the information collection component and pulled by the Tracing Master. The Tracing Master is responsible for transforming raw log messages to keyed messages, correlating keyed messages with resource metrics, and sending all the information to a database for further analyses. 44 Slave Node Slave Node Tracing Worker Tracing Worker Log File Log File LogLog File File Log Extractor LogLog File File Log Extractor

LWVLWVLWV Container LWVLWVLWV Container ContainerContainerContainer Monitor ContainerContainerContainer Monitor

Information Collection Master Node Tracing Master Keyed Message Information Transformer Combiner Database Feedback Control

Figure 3.2: LRTrace architecture.

3.2.2 Tracing Worker

Log collection. There are two kinds of logs: Yarn logs generated by ResourceManager or NodeM- anager, and application logs generated by applications running in Yarn containers. We assume that all the intended log messages follow the format below which is supported by most of the popular logging tools: timestamp: log contents

The Tracing Worker collects both kinds of raw logs and sends them to the information collection component. In order to provide container IDs and application IDs for the Tracing Master to construct a keyed message, the Tracing Worker also attaches these two identiﬁers to a raw log message. For

Yarn logs, these identiﬁers can be extracted from the log messages. However, application logs are written by application developers and only contain objects that are internal to the application.

Since applications adopt Yarn as the resource manager, the directory path of an application log ﬁle contains the information about the application ID and the container ID. For example, the path of a

Yarn log file may look like $HADOOP HOME/application 01/container 01 01. The Tracing Worker extracts these identifiers by parsing the absolute paths of application log files. 45 Write Write Results Object Start Object Finish Results

t0 ts tf t0 + ∆t

Figure 3.3: Illustration of how a short object can be missed.

Resource metrics. Metrics are collected via APIs provided by LWV containers. The Tracing

Worker monitors four major resources: CPU, memory, disk I/O and network I/O. In order to distinguish resource metrics of one LWV container from those of another, the Tracing Worker associates resource metrics with the corresponding Yarn container ID and sends them to the information collection component.

The sampling frequency of monitoring is a trade-oﬀ between overhead and accuracy. Higher frequency leads to more overhead of CPU, memory and storage. For long jobs, a low frequency is enough to capture the behavior of resource consumption. However, short jobs lasting only tens of seconds require a higher frequency to guarantee accuracy. In experiments, we set the frequency to

1Hz for long jobs and 5Hz for short jobs.

3.2.3 Tracing Master

The Tracing Master ﬁrst pulls information (raw logs and resource metrics) from the information collection component. Then, it transforms raw log messages to keyed messages. It periodically writes the processed information to users or the database. Each wave of outputs includes the resource metrics, the living period events and the newly received instant events during the last monitoring interval. The Tracing Master is also responsible for executing user-deﬁned feedback control plug-ins.

Keyed message construction. For instant events, the transformation is straightforward. The

Tracing Master simply extracts all the required fields and puts them in the output buffer. For period objects, the Tracing Master maintains a living object set. Objects in the set are distinguished by their keys and identifiers. The Tracing Master adds a period object to the set upon first receiving it, and keeps the object in the set until receiving a corresponding message that has a true is-finish

ﬂag. 46 Since transforming messages and writing results are asynchronous, chances are that two messages containing the same period object are received within one writing interval. This is common for an object with a short lifespan. Figure 3.3 illustrates the scenario. There are two consecutive writing operations at t0 and t0 +∆t, respectively. In between them, an object starts at ts and ﬁnishes at tf . In this case, the object is removed before the second writing operation, which causes data lost.

To avoid this situation, the Tracing Master keeps a finished object buffer. When a period message with a true is-finish flag arrives, the Tracing Master adds the removed object to the finished object buffer. Each writing includes objects in both the living object set and the finished object buffer.

After each writing, the Tracing Master empties the ﬁnished object buﬀer.

Log and resource metrics matching. As discussed in Section 2, every keyed message is attached with an application ID and a container ID, so are the resource metrics. A matching is done by associating keyed messages and resource metrics that share the same identiﬁer. Since the timestamp granularities of keyed messages and resource metrics may vary, we do not use timestamps when matching the two kinds of information. Instead, the Tracing Master presents the information on two timelines respectively in a chronological manner, one for events in logs and the other for resource metrics.

Feedback control. The information collected by Tracing Worker reﬂects the current status of the cluster, which includes the state and resource consumption of each application. This information can be used to manage the cluster immediately, such as adjustment of the resources consumed by applications and trivial error handling.

To assist semi-automatic cluster management, LRTrace provides a programming interface for users to deﬁne their own feedback control plug-ins. For the ease of use, LRTrace does not provide users with the raw data through the interface. Instead, Tracing Master arranges the data collected by Tracing Worker into time-sliding windows. The window size can be speciﬁed by the users. In each window, the information is presented in the format of the keyed message, and is grouped by the application ID and container ID. The interface contains an action(data window) function that will be called by Tracing Master periodically and shall be implemented by users. After obtaining 47

NEW App_attempt SUBMITTED SCHEDULED Container_03 ALLOCATED_SAVING Container_06 ALLOCATED LAUNCHED FINISHING FINISHED RUNNING

LOCALIZED initialization execution LOCALIZING EXIT_WITH_SUCCESS

NEW RUNNING DONE

LOCALIZED initialization execution LOCALIZING EXIT_WITH_SUCCESS

NEW RUNNING DONE

10 20 30 40 50 60 70 80 90 100 time(s)

Figure 3.4: State machines of application attempt and two representative containers. The names of the states are written in capital letters. Each of short states is labeled with an arrow pointing to it. the data, users can design their feedback logics to manage the cluster automatically. Usually, users follow three steps when implementing the action(data window) function. Firstly, users get cluster status data extracted from keyed messages via the interface. Secondly, users update their own local variables based on the cluster data. These variables, for example, can be the usage of resources during the last few windows or counters indicating timeout. Finally, users execute cluster management tasks if their local variables meet certain conditions.

Data Query. We design the keyed message to be compatible with the query language used by the backend database, OpenTSDB in our implementation. The query language supports various operations on the keyed message, to name a few, Count, Groupby, Average, downsampling, changing rate calculation, etc. These operations facilitate analysis on the traced data. For example, changing rate calculation can be used to calculate the network/disk usage rate according to the cumulative usage. The Count operation gets the number of concurrently running objects at runtime. 48 3.3 Evaluation

In this section, we evaluate LRTrace from the following three aspects. 1) We use an example to illustrate how LRTrace reconstructs the workﬂow ( 3.3.2). We also show some interesting ﬁndings in this case. 2) We use several case studies to show how LRTrace can help users to troubleshoot the systems ( 3.3.3 & 3.3.4) and manage the applications ( 3.3.5). 3) We further show the performance overhead introduced by LRTrace ( 3.3.6).

3.3.1 Testbed Setup

We conduct the experiments on a 9-node cluster (1 master and 8 slaves). Each machine has an

Intel(R) Core(TM) i7-2600 CPU, 8GB of memory and a 512GB of HDD with 7200 rpm. The cluster is connected by 1Gbps Ethernet. Experiments run on the Ubuntu 16.04 LTS 64-bit operating system. We use kafka-0.10.2.1 as the information collection component and OpenTSDB-2.3.0 as the time series database. We use the GUI web server provided by OpenTSDB for data visualization and analysis. For Spark applications, we run the original Spark-2.1.0 on Hadoop-2.7.3. We use the sequenceiq/hadoop-docker-2.4.0 as the LWV container. Experiment data are generated either from

HiBench-6.0 [114] benchmark or TPC-H [30] benchmark. Detailed information about the data sizes is given in each experiment, respectively.

3.3.2 Workﬂow Reconstruction and Analysis

Spark. In this experiment, we run a Spark Pagerank workload to illustrate how LRTrace reconstructs the Spark workflow. The Spark Pagerank workload is configured to perform three iterations on 500MB input data. For log extraction, we define only 12 rules, which is enough to capture the whole workflow. Table 3.3 summarizes these rules. This application is configured with eight execu- tors. For conciseness, we only demonstrate the behaviors of three representative containers but omit the rest of them since they have similar behaviors to one of the three.

We show the states of the application attempt and two containers in Figure 3.4. In this ﬁgure, users can see the states of each component at a given time and the length of each state. The 49

12.00 3000 container_03 container_03 10.00 container_04 2500 container_04 container_06 container_06 8.00 2000

6.00 1500 CPU (%) 4.00 1000 spilling 602.9MB Memory usage (MB) 2.00 500

0.00 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Time (s) Time (s) (a) cpu usage. (b) memory usage and related events.

300 100 container_03 240 container_04 80 container_06

180 60

120 40 Total disk (MB) Total

Total Network (MB) Total container_03 60 20 container_04 container_06 0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Time (s) Time (s) (c) network usage and related events. (d) disk usage.

Figure 3.5: Resource metrics and events in three representative containers. The dashed vertical line in (b) represents a spilling event in container 03. The green and the dashed vertical blue lines in (c) represent shuﬄing operations (period events) in container 03 and container 06 respectively. (c) and (d) show cumulative I/O usage instead of instant I/O rate.

RUNNING state of a container is further divided into two sub-states: initialization and execution.

Once a container enters the RUNNING state, it starts the internal initialization before it actually receives tasks. The internal state is transparent to Yarn and recorded in the application logs.

LRTrace captures the internal states by assigning the same key to related log messages from either the application or Yarn.

Figure 3.5 shows resource metrics and the related log events. Figure 3.5a shows the CPU usages of two containers. Container 03 and container 06 ﬁnished the inner initialization at 10s and 12s respectively. Then, the tasks pre-process the data until 74s. From 74s to 93s, there are three CPU usage peaks indicating the three iterations. Finally, from 93s to 96s, the result is written to the disk.

Analysis on memory behavior Findings get more interesting when we inspect the memory usage and its related events. Figure 3.5b shows that the memory usages of container 03 and container 04 drop immediately at certain moments (container 03 at 49s and container 04 at 58s and 69s). Only container 03 has a spilling operation indicated by the dashed vertical green line. Also, the decrease in 50 Object / # of Description Event rules task 3 All three rules extract the corresponding stage ID, one for the start of a task, two for the end of a task. spill 2 Both extract the processed data, one for task spilling events, the other for force spilling events. shuﬄe 2 One for the start of a shuﬄe event, the other for the end. container state 2 One for the start of a container, the other for the rest of state transitions. application state 2 One for the start of an application, the other for the rest of state transitions.

Table 3.3: Summary of rules for extracting the workﬂow of a Spark application. memory of container 03 happens a few seconds later than the spilling event. We infer that swapping and JVM garbage collection (GC) are two possible reasons for a decrease in memory usage. We check both swap memory usage of the LWV containers and the GC log of the JVM. The swap memory usage stays under 30MB during the entire execution, which apparently does not cause the drop in memory. On the other hand, the GC log indicates that a full garbage collection occurs every time when the memory drops (a full garbage collection does not always cause a drop in memory usage). This explains the delay between the spilling event and the memory drop in container 03. A spilling only copies data to the disk. Later, a full garbage collection releases the memory. Table 3.4 summarizes the memory behaviors of container 03 and container 04. The amount of decreased memory is always less than the memory released by GC since tasks keep generating data during the execution.

Figure 3.5c shows the cumulative network usage of containers and shuffle events. A shuffle operation incurs a network I/O operation thus increases cumulative network usage. The key finding here is that different containers in the Spark application always start shuffling at the same time, in this case at 56s, 69s, 80s, 87s and 94s, which are also the boundaries between stages. This indicates that the Spark application uses a synchronizing mechanism between stages so that it starts a new stage only after all the tasks in the previous stage are finished.

MapReduce. LRTrace is also capable to reconstruct the workﬂow of MapReduce applications.

Slightly diﬀerent from Spark, a task in MapReduce monopolizes one container. In this experiment, we run a Hadoop MapReduce Wordcount application on 3GB input data. We demonstrate the 51 Container GC GC Decreased GC start delay memory memory container 03 49th s 10s 658.7 MB 1083.9MB container 04 58th s - 506.5 MB 1077.3 MB container 04 69th s - 236.1 MB 1043.6 MB

Table 3.4: Summary of memory behavior. GC start is the moment when GC starts. GC delay is the time between a spilling event and a later full GC. workﬂows by showing one map task and one reduce task, respectively.

The map task mainly outputs two kinds of events concerning the workflow in its log: spill and merge. Figure 3.6a shows the workflow of the map task. After initialization, the map task quickly starts five consecutive spill operations. Note that compared to the first three spill, the fourth spill processes a similar amount of data but spends a shorter time. Then, the map task finishes 12 consecutive merge operation in a short time. Each merge processes about 6KB data.

The reduce task also outputs two kinds of events: fetcher and merge, shown in Figure 3.6b. The reduce task ﬁrst launches three fetchers to fetch the intermediate results generated by map tasks.

Note that fetcher#2 starts later than the other two do. After the fetchers are ﬁnished, the reduce task starts to process the data, which is not recorded in the logs. Finally, the reduce task executes two merge operations, each on 30KB data.

We also find an abnormal map task. It starts much later than all the other map tasks and keeps alive for 27s even after the application is finished. In our further experiments, we find that a bug causes this anomaly. We explain how we find the bug in details next.

3.3.3 Bug Diagnosis

We describe how we ﬁnd two bugs in the Spark-on-Yarn stack step by step. The ﬁrst bug exists in the Spark scheduler (SPARK-19371 [24]) recently reported by other users. The other bug exists in

Yarn ResourceManager (YARN-6976 [32]) which is reported by us.

Bug1: Uneven task assignment. The Spark-on-Yarn stack uses a two-level scheduler model.

Spark ApplicationMaster ﬁrst requests containers from the Yarn scheduler (level-1). After receiving containers, the Spark scheduler assigns tasks to them (level-2). The bug identiﬁed causes an uneven 52 merge x 12

spill spill spill spill spill Initialization 10.44/ 1.37/ 10.44/6.25 10.44/6.25 10.44/6.25 6.25 6.25

0 5 10 15 20 25 30 Time (s) (a) The workﬂow of a map task.

merge x 2

fetcher#1 Initialization fetcher#2 fetcher#5

0 5 10 15 Time (s) (b) The workﬂow of a reduce task.

Figure 3.6: Workﬂows of a map task and a reduce task. The values below spill represent how many MB of keys and values are processed by the spill event, e.g. 10.44/6.25 stands for 10.44MB of keys and 6.25MB of values. distributed number of tasks in diﬀerent containers. To start with, we run a Spark TPC-H job

(Query 08) with a MapReduce randomwriter as the interference. The Spark TPC-H job queries on a 30GB date set stored on HDFS. The MapReduce randomwriter writes 10GB data on each node of the cluster. We begin the debugging since we notice that some containers have considerably higher memory consumption than the others do, shown in Figure 3.7a. We do not show container 01 since it runs ApplicationMaster that has stable memory consumption. Container 02, 04, 07 and 09 consume more than 1.4GB memory while container 03, 05, 06 and 08 only consume around 500MB memory. An uneven memory consumption indicates an uneven amount of data generated by tasks.

This leads us to inspect the number of tasks assigned to each container. The following request gets the total number of running tasks during each 5-second downsampled interval: key: task groupBy: container downsampler: {

interval: 5s

aggregator: count } 53

2000

1500

1000

500

Memory usage (MB) 0 02 03 04 05 06 07 08 09 Container ID (a) peak memory usage.

3.0 w/o interference 2.5 w/ interference 2.0 1.5 1.0 0.5 Memory Usage (GB) 0.0 Wordcount TPC-H Q8 TPC-H Q12 Terasort KMeans(pt 1) KMeans(pt 2)

(b) Memory unbalance of diﬀerent workloads. Note that KMeans job is divided into part 1 (before iterations) and part 2 (iterations).

execution container_02 RUNNING container_03 container_04 container_05 container_06 container_07 container_08 container_09

0 5 10 15 20 25 30 35 40 45 50 Time delayed from AM starts (s)

container 02

container 03 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

container 04 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

container 05 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

container 06 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

container 07 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

container 08 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

container 09 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Time interval (5s)

(d) number of running tasks in each downsampling interval.

Figure 3.7: Related performance metrics when debugging SPARK-19371 and YARN-6976. 54 Figure 3.7d shows that containers which have higher memory consumption also receive more tasks than the other containers do. From the 2nd to the 10th interval, container 04 consecutively runs more than 10 tasks in each interval while container 06 does not receive its first task until the 9th interval. This suggests that the Spark scheduler prefers to assign tasks to a container that previously receives tasks, which is caused by data locality across Spark stages. In other words, tasks are likely to be assigned to nodes that store the required data generated in the previous stage. However, data locality does not explain where in the first stage the scheduler assigns tasks. We conduct the experiments several times. The results suggest that the overloaded nodes in the first stage are random. We suspect that the scheduler prefers to assign tasks to containers which finish the initialization process early. Thus, we inspect the delay that a container enters the RUNNING state and the internal execution state, respectively, using the following request: key: state groupBy: container

The results in Figure 3.7c confirm our presumption. The scheduler assigns tasks to the four containers that finish initialization early. Note that although container 08 enters the RUNNING state early, it spends a long time on initialization, missing the chance to receive tasks. It is enough for a common user to stop here and claim that the Spark scheduler unevenly assigns tasks to different containers. But we further find that the unbalance is also prevalent in Spark applications even without interference.

We conduct experiments using other Spark workloads from HiBench and TPC-H benchmarks with and without interference to check whether an unbalanced task assignment happens. These workloads include 30GB Wordcount, 30GB TPC-H Query 08, 30GB TPC-H Query12 and 10GB

KMeans. We run each workload three times and use the average values as the results, shown in

Figure 3.7b. The length of the bar shows the diﬀerence between the maximum and minimum memory usage among containers. Note that we divide the KMeans job into two parts – part 1 includes the tasks before iterations and part 2 includes the tasks during iterations. The results show that an unbalanced memory distribution also exists in applications (or phases) Wordcount, TPC-H, and

KMeans part 1 even without interference. These applications share a common feature that most of 55 the tasks ﬁnish within one second.

This bug lies in the Spark scheduler, which causes unbalanced container memory consumption and introduces memory usage ineﬃciency. The root cause is that the Spark scheduler cannot make appropriate assignment decisions for sub-second tasks. Interference aggravates the unbalance since it causes late start of containers. Any container, even an idle one, still consumes a ﬁxed amount of overhead memory. This memory is indispensable to launch the JVM. For the containers running

Spark applications, the overhead memory is around 250MB. Once a container receives tasks, its memory usage increases due to the data generated by the tasks. We refer the increased memory usage on top of the overhead to effective memory. In another word, the actual memory usage of a container is the sum of the overhead memory and effective memory. Therefore, a higher memory consumption of a container means a larger amount of effective memory, which indicates the container is more effectively used. However, the effective memory of a container may also become a wastage of resources later. Consider the situation that a container first receives tasks then becomes idle.

This container causes more wastage since the early tasks usually generate data, which increases the memory consumption of the container if it later becomes idle. An ideal scheduler should keep every container busy during the execution of an application so as to fully utilize the resources and also reduce the makespan.

Bug2: Zombie containers (new bug). Yarn ResourceManager is not aware of zombie containers. These containers stay alive for a long time after the application is finished, which wastes cluster resources. Running a Spark TPC-H Query 08 with a MapReduce randomwriter reveals the unexpected behavior. LRTrace monitors the resource metrics of a LVW container long after the application is finished. We use the memory consumption to demonstrate this situation. Figure 3.8 shows the related metrics and events during the lifespan of the container. The red vertical line indicates the time when the application transitions to the FINISHED state 1. In this case, container 03 keeps alive for another 14s seconds after the application is finished. Then, we check the states of container 03. The container enters the KILLING state 2s after the application is finished and stays

1To avoid confusion, ‘FINISHED’ is the state of the application instead of the state of the container. 56

500 container_03 400

300

200

100

Memory usage (MB) 0 0 10 20 30 40 50 60 70 80 90 100 110 Time (s)

Figure 3.8: A container spends a long time in KILLING state. The vertical red line at the 98th second indicates that the application is ﬁnished. The vertical blue line at the 100th second is the boundary between RUNNING state and KILLING state of the container. in the state for 12s, shown by the dashed vertical blue line. Since the container is alive, it takes up about 450MB memory in the cluster. We also observe the similar behavior when running other workloads. In the worst case of a Spark Wordcount application, two containers spend more than

40 seconds in the KILLING state and occupy more than 500MB memory (250MB each), which we call zombie containers that can be only identiﬁed by LRTrace by correlating both logs and resource metrics.

The bug is caused by the fact that Yarn ResourceManager considers that the container is ﬁnished upon receiving a heartbeat from NodeManager notifying the container is in the KILLING state.

This holds true in most cases since such containers are usually terminated fast enough before the application is ﬁnished. Using the current notiﬁcation mechanism, another scenario is that the container is terminated before the corresponding delayed heartbeat notifying ResourceManager.

The heartbeat is passively delayed due to high network contention. Although, ResourceManager receives the heartbeat late, resources are actually released by the container, which is also acceptable.

The bug scenario occurs when the container is finished from ResourceManager perspective but is still stuck in the KILLING state. Chances are that ResourceManager allocates new containers to the same node that runs the stuck one, leading to resource contention. A possible solution is changing the notification mechanism. The NodeManager should actively send a notification to the ResourceManager after the container actually terminates. Table 3.5 summarizes scenarios of container termination. 57 Slow Late Influence termination heartbeat No No This is the normal termination. No Yes (passive) Delay the scheduling for other applications. Re- sources are released. Yes No ResourceManager is unaware of the long termination, causing resource wastage and contention. Yes Yes (active) A possible solution to the bug, a heartbeat should report the container state after its actual termination.

Table 3.5: Summary scenarios about container termination.

execution 5 container_02 container_02 container_05 container_08 RUNNING container_03 4 container_03 container_06 container_09 container_04 container_07 container_04

3 container_05 container_06 2 container_07 container_08 1

Number of tasks container_09 0 0 10 20 30 40 50 0 5 10 15 20 25 30 35 40 Time (s) Time delayed from AM starts (s) (a) number of running tasks during execution. (b) Delays that a container enters the RUNNING state and the internal execution state.

200 2000 container_02 container_04 container_06 container_08 container_02 container_05 container_08 160 container_03 container_05 container_07 container_09 1500 container_03 container_06 container_09 120 container_04 container_07 1000 80

Disk I/O (MB) 40 500 I/O Wait time (ms) I/O Wait

0 0 0 10 20 30 40 50 0 10 20 30 40 50 Time (s) Time (s) (c) The cumulative disk I/O usage. (d) The cumulative time spent on waiting disk service. Figure 3.9: Related metrics when debugging an anomaly caused by interference in the cloud.

3.3.4 Interference Detection

Interference in cloud environments is common. It can signiﬁcantly degrade the performance of applications [44]. In this experiment, we show how we ﬁnd a performance anomaly by using LRTrace.

We run a Spark Wordcount application on a 300MB input data set. The task assignment unbalance scenario also occurs in the application – container 09 receives no task during the ﬁrst half of its execution. We also ﬁnd that container 09 enters the internal execution state much later than the other containers. Figures 3.9a and 3.9b show the two scenarios, respectively. At this point, a user may consider that the unbalance is caused by the bug in the Spark scheduler (described in Section

5.2). However, after checking all the resource metrics of this application, we ﬁnd the root cause is disk

I/O contention on the node that runs container 09. Figures 3.9c and 3.9d show that container 09 58 has a longer disk I/O wait time but much lower disk I/O usage, which indicates that other processes on the same node compete for the disk throughout its execution. The contention gets more severe during 46s to 59s since the disk wait time keeps increasing drastically while there is only little disk

I/O usage. The root cause of the task unbalance is that the disk I/O interference delays the start of container 09. By Figures 3.9a and 3.9b, we can clearly ﬁnd that container 09 receives tasks as soon as it is fully initialized.

Anomalies due to diﬀerent reasons may have the same behavior. Using only logs can mislead a user to inaccurate root causes. Thus, a user may consider the root cause as a bug instead of interference if only using information from logs. LRTrace is able to identify the interference because of using both logs and resource metrics.

Summary on diagnosis The examples above show that events from logs and changes in resource consumption are closely related so that any mismatching, such as a decrease in memory without spilling, deserves further analysis. Since popular data analytics frameworks adopt a distributed parallelism model, comparing the information from diﬀerent containers usually reveals the anomaly.

3.3.5 Feedback Control

We design and implement two plug-ins to illustrate the flexibility and effectiveness the feedback control component of LRTrace. By using Java ClassLoader class, a user-defined plug-in package can be loaded at runtime. Generally, a plug-in has about 300 lines of code but can vary due to the implementation logic.

Queue rearrangement. This plug-in is designed to rearrange the scheduling queue in the Re- source Manager. The plug-in moves an application to other scheduling queue if the application meets either of the following conditions: 1) the application is pending; 2) the application is running slowly.

To ﬁnd a pending application is straightforward, we simply check whether its state is ACCEPTED.

We deﬁne an application as a slow one if it has both of the following symptoms: 1) its memory usage is under the memory limit and it does not increase for a time threshold, and 2) it does not output log messages for a time threshold. After the plug-in ﬁnds an application that needs to be moved, it 59

120 200 w/ QR plug-in 100 w/ QR plug-in w/o QR plug-in 150 w/o QR plug-in 80

60 100

40 # of Apps 50 20 Execution Time (s) Time Execution 0 0 WC-Spark KMeans-Spark WC-MR Total WC-Spark KMeans-Spark WC-MR

(a) Number of executed applications. (b) Execution time of applications.

Figure 3.10: Evaluation of the queue rearrangement plug-in. moves the application to another scheduling queue with the most available resources.

To evaluate the plug-in, we conﬁgure the scheduler with two queues: default queue and alpha queue. Each queue has half of the cluster resource. We submit three applications, Spark Wordcount,

Spark KMeans and MapReduce Wordcount, to default queue, and keep one instance of each application at a time. The experiment lasts for one hour, with and without the plug-in, respectively.

Figure 3.10 shows the number of executed applications and the execution time of applications. The plug-in improves the overall cluster throughput by 22.0% and reduce the average execution time by

18.8%.

Application restart. This plug-in kills and restarts a stuck or failed application if the application meets user-deﬁned conditions. During our experiments, we observe that a few applications fail at the

first submission but successfully finish at the second submission even with the same configuration.

We ﬁnd two reasons: 1) the resource usage ﬂuctuation in the system causes the failure, and 2) the failed application is running with underlying maintenance jobs, such as HDFS loader balancer, simultaneously.

We implement the plug-in to kill the stuck or failed applications and restart it later. The plug-in maintains a log timeout threshold for each application. If not receiving any log message of an application beyond the timeout threshold, the plug-in finds the launch command code of the application, kills the application and restarts it later. To avoid infinite killing and restarting, the plug-in maintains a maximum number of restart times for each application. If an application still cannot successfully finish after several restarting, it requires manual inspection by the users or 60 administrators.

3.3.6 Performance Overhead

Before we proceed to discuss the limitations of LRTrace, we evaluate its performance by answering two questions: 1) How long does it take for a log message from its generation to being stored in the database? 2) How much performance overhead does LRTrace introduce to the targeted system?

Log arrival latency. Real-time processing is important for an online tracing tool since it should make the information accessible as soon as possible. We measure the latency between the time when a new message is written to the log ﬁle and the time when it is stored in the database. In this experiment, we assume that all nodes in the cluster have synchronized clocks. We write a program to generate simulated log messages that include a timestamp when the message is written (ltime).

The database outputs a timestamp when storing a log message (dtime). We calculate the log arrival latency by subtracting dtime by ltime. Figure 3.11a shows that the latency generally follows a uniform distribution between 5 ms and 210 ms, which is reasonable for real-time processing.

Degradation. A good tracing tool introduces moderate performance overhead to the targeted system. We use the slowdown of an application as the measure. It is calculated by the ratio of the execution time when the applications are running with LRTrace to that without LRTrace. We run each application with and without LRTrace multiple times, and use the average execution time to calculate the slowdown. For the Spark TPC-H application, one round of execution includes running every query once. Figure 3.11b shows that the slowdown on diﬀerent applications varies, with a maximum factor of 7.7%. The average slowdown is only 3.8%.

3.3.7 Discussion

Limitations. LRTrace has demonstrated its eﬃciency in identifying new bugs and performance anomalies. It still requires users to deﬁne rules to transform raw log messages to keyed messages.

Our experiments show that we can reconstruct the whole workﬂow of a Spark application by deﬁning fewer than 15 rules. For a user who is not familiar with the log messages generated by the framework, 61 1.0 latency 1.10 0.8 slowdown

0.6 1.05 CDF 0.4 Slowdown 0.2

1.00 0.0 0 50 100 150 200 250 TPC-H Wordcount KMeans Pagerank Terasort Avg. Delay (ms) Workloads (a) CDF of arrival latency. (b) Slow down.

Figure 3.11: Performance overhead of LRTrace. it may take some time to find out log messages related to workflows and define the rules to transform logs. On the other hand, it provides flexibility for a user to add or delete rules.

LRTrace automatically extracts logs and resource metrics, leaving analysis to users. In experiments, we demonstrate that correlating log messages and resource metrics is powerful in both understanding the workﬂow and debugging distributed systems.

LRTrace focuses on applications that are compatible with Linux cgroup-based virtualization techniques, such as LXC and Docker. Such lightweight virtualization provides more ﬁne-grained resource metrics compared to complex virtualization techniques such as VMs. We implement LRTrace mainly for Docker containers since it is a popular virtualization project. However, LRTrace can be extended to other container-based virtualization techniques as long as they provide APIs for monitoring.

Practical experience. We find that the top-down analysis on the data extracted by LRTrace is an effective way to identify the root cause of anomalies. When using LRTrace, users may first check the high-level events of an application and its containers, such as state transition events. Dynamic plug-in loading is an efficient way to perform feedback control at application runtime. In a cluster consisting of more than tens and hundreds of nodes manual management is, if not impossible, hard to be practical. By using the developed feedback control component, users can be relieved from the burden of conducting manual management tasks. 62 3.3.8 Summary

In this paper, we present LRTrace, a non-intrusive tracing tool that extracts information about both logs and resource metrics at runtime. LRTrace transfers a raw log message to a uniform data structure called keyed message. By using keyed messages, users can easily reconstruct the workflow including the events and the amount of processed data if recorded. Lightweight virtualization techniques provide the opportunity to access fine-grained per container resource metrics. By correlating these two kinds of information, a user can efficiently find the anomaly and locate its root cause.

Experiments show that LRTrace helps users in understanding workﬂows, ﬁnding anomalies caused by either bugs or interference while incurring low overhead.

In the future,we plan to use machine learning methods or rule-based methods to automatically build the relationship between logs and resource metrics, which further takes the burdens oﬀ users. CHAPTER 4

SEMANTIC-AWARE WORKFLOW CONSTRUCTION AND

ANALYSIS FOR DISTRIBUTED DATA ANALYTICS SYSTEMS

4.1 Overview

Figure 4.1 shows the overview of IntelLog [160, 159]. Its ﬁrst stage, similar to other tools, extracts log keys from log ﬁles generated by the targeted systems. The second and third stages build Intel

Keys and construct workﬂows of the systems represented by HW-graphs. The fourth stage consumes newly incoming logs and automatically report anomalies to the users.

Entity extraction. The second stage extracts and distinguishes ﬁelds in log keys using two

NLP approaches: POS analysis and sentence structure analysis. The extraction result of this stage is execution information of the systems that includes entities, identifiers, values, localities and operations. The extracted fields from a log key is stored in an Intel Key that consists of a set of key-value pairs so that users can use queries to request data. An Intel Key is an enhanced representation of a log key. The extracted entities are also used in the next stage to reconstruct workflows ( 4.2).

HW-graph modeling. Before building the HW-graph, we group correlated entities based on their names using nomenclature convention. Besides finding the ordering relationships between individual Intel Keys, we construct hierarchical relationships between entity groups. This step is done by finding the lifespan of each entity group. In an entity group, there are multiple subroutines that represent the ordering relationships between Intel Keys. A HW-graph represents the workflow of a targeted system, which assists users to understand the system. ( 4.3.1) 64

training workﬂow anomaly log ﬁle inference report

log keys Intel Keys HW-graph HW-graph instance

key1 entity group 1 group key2 [entities] ins 1 subroutine 1 [ids] entity incoming key subroutine 2 ins 1 3 [values] group 2 log ﬁle … ins 2 … [localities] group subroutine n … [operations] entity ins 2 keyn group 3 ins n

log key information entity grouping and checking incoming log against extraction extraction HW-graph construction HW-graph instance

Figure 4.1: The Overview of IntelLog.

Anomaly detection. IntelLog instantiates a HW-graph instance when a system starts a new session. A session is generally an execution path invoked by a user request. In our case, a session is the execution within one Yarn container ( 4.4). While consuming incoming logs, IntelLog aims to build the graph instance to meet the structure of the corresponding HW-graph. Per its design,

IntelLog reports two categories of anomalies to users: 1) unexpected log messages, and 2) erroneous

HW-graph instances. Similar to previous studies [189, 87], this step does not directly find root causes of anomalies. It reports the affected components and entity group, which significantly reduces the user effort for root cause analysis.

4.2 Information Extraction

After collecting system logs, we use Spell [86] to obtain the log keys of the logs. This section describes how we extract information from the log keys and transform them to Intel Keys. The extracted information includes entities, identiﬁers, values, localities and operations recorded in the log keys.

The former four categories of information is extracted by a POS tagging based approach while the last one is extracted by a sentence structure parsing based approach. The information extraction phase transforms each log key to an Intel Key. The log message matching the corresponding Intel

Key can be easily stored in databases for queries. 65 log key: * MapTask metrics system log msg: Starting MapTask metrics system POS tagging on msg tagged log msg: Starting_VBG MapTask_NNP metrics_NNS system_NN

Assigning tags on key tagged log key: *_VBG MapTask_NNP metrics_NNS system_NN

Figure 4.2: POS tagging on a log key.

The ﬁrst step in the information extraction phase is POS analysis. Existing POS tagging tools are widely used in academia and industry, such as OpenNLP [18], Stanford POS tagger [170] and

NLTK [42]. However, POS analysis for log keys is slightly diﬀerent from that for free-form text.

Since a log key is an abstraction of log messages, it contains variable ﬁelds that is represented by asterisk (*). If we directly feed log keys to a POS tagger, the results will not be accurate. Thus, for each log key, we take a corresponding sample log message and feed it as the input of the POS taggers. The output POS tags are used for the log key. Figure 4.2 shows an example of this process.

The log key ‘* MapTask metrics system’ is an abstraction of the log message ‘Starting MapTask metrics system’. Apparently, it is inappropriate to feed the log key as the input to the POS tagger.

In this case, we use the sample log message as the input. The result of POS tagging is a log message in which each word is tagged with its part-of-speech. Finally, we assign the corresponding words in the log keys with the POS tags in the sample log message. We use the Penn Treebank tag set [142] as our POS marks.

4.2.1 POS Tagging Based Extraction

Entity extraction. The step is based on a theory in a previous study [121], which postulates that over 97% of terminological entities only consist of nouns and adjectives and thus can be matched by a list of seven POS patterns. Furthermore, in system logs, entities can also be a single-word noun.

We use Penn Treebank to present these patterns as shown in Table 4.1. Note that patterns ‘JJ JJ

NN’ and ‘JJ NN NN’ have no example since we do not observe such patterns in our implementation. 66 Table 4.1: Patterns to match phrases and the corresponding Penn Treebank presentation. Due to the limited space, ‘NN’ includes four noun tags: ‘NN’, ‘NNS’, ‘NNP’ and ‘NNPS’.

Patterns Penn Treebank Pst. Examples noun NN task adjective noun JJ NN remote process noun noun NN NN event fetcher adjective adjective noun JJ JJ NN - adjective noun noun JJ NN NN - noun adjective noun NN JJ NN cleanup temporary folders noun noun noun NN NN NN map completion events noun preposition noun NN IN NN output of map

We keep these two patterns to make IntelLog extensible to other systems.

Beside the POS pattern matching, we also use a camel-case word filter to find entity names. The intuition is that some entities in logs are also classes defined in the source code whose names follow the camel-case naming format convention. Such entities are usually correlated to other entities such as ‘MapTask’ and ‘map output’. This filter separates a camel-case word into a phrase. In this example, ‘MapTask’ is transformed to ‘map task’. After we extract the entity phrases, we lemmatize them to their singular forms. Note that users can define their own filters for other naming format conventions.

Locality extraction. We define a set of patterns to capture commonly used locality information in distributed systems. These patterns include: 1) host names, 2) IP addresses and ports, 3) local directory paths, and 4) distributed file system paths. Besides, users can define new patterns when applying IntelLog on their own targeted systems.

Identifier and Value extraction. Both identifiers and values are represented by variable fields in a log key. Previous studies either manually define identifier patterns [189] or do not distinguish these two kinds of variables [87]. We apply four heuristics one after another on a variable field, which accurately recognize identifiers and values. First, we filter out variable fields that have verb

POS tags and locality information recognized in the previous steps. Second, we categorize a field as a value if it is followed by a unit, such as ‘12 MB’ and ‘5 ms’. Third, we categorize a field as an identifier if the field is mixed with letters and numbers, such as ‘attempt 01’. Finally, for the fields 67 Table 4.2: UD relations for operation extration.

Element Relations Descriptions ROOT a relation indicating the root of the sentence predicate xcomp an open clausal complement of a verb or an ajective nsubj a nominal subject of a clause subj-entity nsubjpass a passive nominal subject dobj a direct object of a verb obj-entity iobj an indirect object of a verb nmod a nominal modifier of a clausal predicates consist of only numbers, we check the POS tag of the word prior to the field. We categorize the field as an identifier if the POS tag represents a noun. Otherwise, the field is a value.

4.2.2 Structure Parsing Based Extraction

Operation. The idea is that an operation is usually indicated by a predicate in a log key, and the entities related to the predicate are either the source or the target of the operation. We use universal dependency (UD) relations [152] to denote the structure of a log key. For simplicity, we deﬁne an operation as a 3-tuple: {subj-entity, predicate, obj-entity}. Since most entities are nouns, the sub-entity and obj-entity can be extracted by checking whether a word has a speciﬁc relation with the predicate. We carefully choose 7 relations out of a total of 40 UD relations. Table 4.2 describes these relations. These relations indicate a predicate and the entity related to the predicate.

4.2.3 A Summary Example

We take a log key in the critical path of Spark jobs as an example to demonstrate the process of the information extraction approach as shown in Figure 4.3. This log key records the task finish event and the amount of data sent to the driver. After the POS tagging phase, each word in the log key is tagged with a POS mark. IntelLog extracts five entities (but omit ‘bytes’ since it is a unit), three identifiers and one value. The structure parsing technique allows us to obtain the relations between words. Two operations are extracted according to the relations in Table 4.2.

The resulted Intel Key is shown on the right side of Figure 4.3. When a newly incoming log message matches the Intel Key, the variable ﬁelds represented by asterisks are replaced by the 68 Log key: finished task * in stage * (TID *) Intel key * bytes result sent to driver Entities: POS tagging: [task, stage, TID, result finished_VBN task_NN *_CD in_IN , driver] stage_NN *_CD (TID_NN *_CD) Identifiers: *_CD bytes_NNS result_NN sent_VBN [(task) *, (stage) * to_TO driver_NN , (TID) *] Values: * (bytes) Structure parsing: Locality: NULL dobj finished task * in stage * (TID *) Operations: [-, finished, task] nsubj nmod [result, sent to, driver] * bytes result sent to driver

Figure 4.3: Process of transforming a log key to an Intel Key. actual values in the log message. As a result, the log message is transformed to an Intel Message.

Compared to a log message, an Intel Message is a more structured representation of logs since it structurizes both the text formats and the contents of logs. Further, an Intel Message can be considered as a collection of key-value pairs. It naturally ﬁts in the storage structure of time series databases [20, 9] and can be utilized by other proﬁling tools [161].

4.3 Modeling and Anomaly Detection

IntelLog builds the HW-graph to present the workﬂow of a targeted distributed system. First, entities are grouped by their names. We identify the hierarchical relationships between groups by checking their lifespans. In each group, we leverage the log orders and identiﬁers to build subroutines.

In the anomaly detection phase, IntelLog tries to reconstruct a HW-graph instance from the incoming logs and check it against the HW-graph built for the system. IntelLog reports two kinds of potential anomalies to users: 1) unexpected log messages, and 2) incomplete subroutines. 69 4.3.1 Workﬂow Reconstruction

Entity grouping. In a targeted system, entities are usually correlated with each other. However, such entities are not always associated with the same identifiers. Thus, Stitch [196], which constructs workflows solely based on identifiers, only presents workflows in a course-grained manner. Moreover, the identifier names only contain limited semantic knowledge, leaving a large amount of information unexploited.

Algorithm 1 Grouping entities. 1: Input: a list of all extracted entities in ascending order by the # of words: E; 2: a dictionary D < group, Eg > that maps a common phrase group to a set of entities Eg ; 3: a dictionary Dr < entity, G > that is a reverse key index of D, mapping an entity to a set of groups G; 4: D ←{}; 5: for all e in E do 6: grouped ← false; 7: for all (group, Eg ) in D do 8: com phrasel ← LongestCommonPhrase(group, e); 9: if com phrasel 6= NULL then 10: Eg ←Eg ∪{e}; 11: group ← com phrasel ; 12: if ! grouped then 13: grouped ← true; 14: end if 15: end if 16: end for 17: if grouped = ﬂase then 18: D←D∪{e, {e}}; 19: end if 20: end for 21: Dr ← ReverseIndex(D) 22: 23: function LongestCommonPhrase(G, E) 24: if G has one word or E has one word then 25: return the longest common string of G and E; 26: else if G and E have common last few words then 27: return empty string; 28: end if 29: return the longest common string of G and E; 30: end function

In IntelLog, we leverage the nomenclature of entities to group together log keys that contain correlated entities. The intuition is that correlated entities usually share the common sub-phrase in their names. For example, block, block manager and block manager endpoint share a common sub-phrase ‘block’. However, some entities are not correlated even they share a common sub-phrase.

We ﬁnd that such phrases usually share last few words that have a general meaning. For example, phrases ‘block manager’ and ‘security manager’ in Spark share the common sub-phrase ‘manager’ 70 but they are not tightly correlated.

Algorithm 1 describes the algorithm for grouping entities. It takes a list of extracted entities as an input and maintains a dictionary that maps group names to a set of entities. It compares each entity e in the list to each group in the dictionary and returns the longest common phrase. If either e or group only contains one word, the function returns the longest common phrase of these two words. In this case, it is either a one-word phrase or an empty string. Since the one-word phrase is part of the other multi-word phrase, their meanings are correlated. For two multi-word phrases, the function also checks whether the two phrases only have the last few words in common. If so, the function ignores the common phrase and returns an empty string. Otherwise, the function returns the longest common phrase that could contain one or multiple words (Line 23 ∼ 30). The idea behind of this step is that the last few words of a system entity usually have general meanings such as ‘manager’, ‘ﬁle’ or ‘output’. The algorithm does not group together entities sharing the last few words since their meanings are not tightly correlated. In theory, it is possible that such entities are correlated. Thus, the algorithm categorizes these entities into diﬀerent groups, missing the correlation. However, IntelLog can still capture the lifespans of the entity groups during HW-graph construction phase. In practice, we do not encounter such entity groups.

Algorithm 2 Assigning Intel Keys into subroutines.

1: Input: a set of log message sequence in the entity group collected from different sessions: S < Seqlog >; 2: a dictionary mapping sets of identifier values to log message sequences: Dvl < Sv , Seqlog >; 3: a dictionary mapping sets of identifier types to Intel Key sequences: Dti < St , Seqkey >; 4: for all Seqlog in S do 5: Dvl ← {{NONE}, []}; 6: for all log in Seqlog do 7: if log.Sv = NULL then 8: append log to the Seq with the NONE key; 9: else if ∃(Sv , Seqlog ) in Dvl 10: s.t. log.Sv ⊆Sv or Sv ⊆ log.Sv then 11: Sv ←Sv ∪ log.Sv ; 12: Seqlog .append(log); 13: else 14: Dvl ←Dvl ∪ (log.Sv , [log]); 15: end if 16: end for 17: if in training process then 18: Dti ← UpdateSubroutine(Dti , Dvl ); 19: end if 20: end for 71 a : A signature {ID_1, ID_2} i log message : Intel Key session ID set log sequence subroutine !" ! a , b , c , d A BC D !v_1 1 1 1 1 1 Session 1 A BC D !v_2 !" 2 ! a2 , b2 , c2 , d1 B ! a , c , b , d A D v_3 !" 3 ! 3 3 3 3 C

training log Session 2 B

consuming order a , b , c A D !v_4 !" 4 ! 4 4 4 C

Figure 4.4: Illustration of the UpdateSubroutine function.

Subroutine. The Intel Keys in an entity group are built as subroutines that are sequences of Intel

Keys. Some subroutines can concurrently run multiple instances during execution distinguished by identifiers. Algorithm 2 builds Intel Keys in one entity group into subroutines. Before applying the algorithm, we find all log sequences belonging to the entity group generated by different executions.

The set of log sequences is represented as S < Seqlog >. The algorithm takes S < Seqlog > as the input to build subroutines. It maintains a dictionary Dvl that maps identiﬁer types to Intel

Key sequences. The dictionary Dvl is a temporary storage for each session that maps the actual identifiers to log message sequences. Before consuming a session, it clears Dvl and assigns an empty sequence with the key NONE (Line 5). The empty sequence is used to store log messages that has no identifiers. In one session, it iterates each log message in the sequences (Line 6). It first updates the log sequence with the NONE key if the log message does not have an identifier (Line 7 ∼ 8). If the identifier values a log message that has already been in this session, it updates the log sequence with corresponding identifiers (Line 9 ∼ 12). Otherwise, it creates a new key-value pair in Dvl and adds the log message to the new sequence (Line 14). After each session, it updates the subroutines based on the log sequence in this session.

The UpdateSubroutine function maintains an order relation set of Intel Keys. With a focus on building the order relations, we assign each identiﬁer with a corresponding identiﬁer type represented by a capitalized word. For example, ‘container 01’ and ‘container 02’ have a type of ‘CONTAINER’.

For each session, it ﬁrst groups the key-value pairs in Dvl according to the corresponding identiﬁer 72 group a group a group a group b group b group b PARENT PARALLEL BEFORE group a group b group a group b group a group b

Figure 4.5: Three relations between entity groups.

types of the identiﬁers in Sv . The set of identiﬁer types can be considered as a signature. For a signature, it iterates through the (Sv , Seqlog ) pairs and checks the order of the corresponding Intel

Keys of the log messages in Seqlog . If Key1 always appears before Key2 in every Seqlog , it assigns a

BEFORE relation Key1 → Key2 to the two Intel Keys. Otherwise, these two Intel Keys can appear in parallel. The function also keeps track of the Intel Keys that always appear in a subroutine and marks them as critical Intel Keys. IntelLog uses critical Intel Keys for anomaly detection. Figure 4.4 illustrates the subroutine construction process for the signature {ID 1, ID 2}. It ﬁrst consumes two log sequences in session 1 that have the same order of Intel Key sequences. IntelLog takes this key sequence as the subroutine. Every Intel Key in this sequence is marked as critical (bold font). Once

IntelLog consumes Seq3 in session 2, it ﬁnds that the log messages of B and C are interchangeable.

As a result, it breaks the BEFORE relation between B and C and assigns them as parallel. Finally, in Seq4 there is no log message matching Intel Key D. IntelLog marks Intel Key D as a not critical one (normal font).

HW-graph. The lifespan of an entity group in a session is deﬁned by the duration between the

ﬁrst and the last log message that belong to the group. We construct the hierarchical relationships by checking the lifespan of each entity group. The idea is that if the lifespan of an group LSa is always within the lifespan of another group LSb in every session, group a is dependent on group b and should be a child of group b. In addition, two groups can either execute sequentially or in parallel. In order to capture the relationships, we deﬁne three relations as described in Figure 4.5.

Two entity groups are assigned with PARENT or BEFORE only if they satisfy such relations in every log session. Otherwise, they execute in parallel and are assigned with PARALLEL.

IntelLog constructs the HW-graph of the targeted system based on the relations between each pair of entity group. We use an example to illustrate the construction procedure, as shown in 73 step 1 step 2 step 3 step 4

a b c d a b c d group a group a a - PTPL PT a - PTPL PT group b b CH - BF PL b CH - BF PL group b group c group d c PL AF - PL c PL AF - PL group d d CH PL PL - group c d CH PL PL -

PT: PARENT CH: CHILD BF: BEFORE AF: AFTER PL: PARALLEL

Figure 4.6: The building procedure of the HW-graph based on entity relations.

Figure 4.6. For the simplicity of explanation, we add two auxiliary group relations CHILD and AFTER that are the opposites of PARENT and BEFORE respectively. First, IntelLog identiﬁes the groups that only have PARALLEL, PARENT and BEFORE relations, say group a (step 1). Then, IntelLog constructs other groups based on their relations with group a. In this case, group b and d are children of group a, and group c executes with a in parallel (step 2). Once group a is built, IntelLog crosses out all the relations that are associated with group a (step 3). IntelLog repeats this procedure until all groups are crossed out. At this point, IntelLog constructs a complete HW-graph of a targeted system (step

4).

4.3.2 Anomaly Detection

When consuming incoming logs, IntelLog instantiates a HW-graph instance for each session of the targeted system. A HW-graph instance has the same entity group hierarchy as the corresponding

HW-graph. In each entity group, however, it has multiple subroutine instances. For example, an entity group G in a HW-graph has a subroutine represented by a sequence of Intel Keys [A, B].

In the HW-graph instance, the entity group G may have two subroutine instances [a1, b1] and [a2, b2], where ai and bi are log messages. For each incoming log message, IntelLog uses Algorithm 2 to determine its subroutine instance. For an instance, if the corresponding Intel Key of an incoming log message is a critical one, IntelLog marks the critical Intel Key as used in the subroutine sequence.

IntelLog reports two kinds of anomalies: 1) unexpected log messages, and 2) erroneous HW-graph instances. These anomalies are common in a multi-tenant data cluster since tasks can be aﬀected 74 by administrators or other user processes.

Unexpected log message. IntelLog reports log messages that are not matched with any

Intel Key. Furthermore, IntelLog tries to extract information of the ﬁve ﬁelds from the unexpected messages using the approaches of POS tagging and structure parsing. Evaluation shows that such information helps users diagnose the erroneous components and the root causes.

Erroneous HW-graph instance. When a whole session is consumed, IntelLog reports the erroneous HW-graph instances. Such instances either have an erroneous hierarchy of entity groups, abnormal subroutine instances or missed critical Intel Keys. IntelLog reports the problematic entity groups or subroutine instances.

4.4 Implementation

We implement IntelLog in about 6,700 lines of Java code and 200 lines of Python code. We have its package open-source at Github, https://github.com/EddiePi/IntelLog/. The code package also includes about a 400-line implementation of Spell. Spell deﬁnes a threshold t that helps matching log messages to log keys. We empirically set t to 1.7 via our experiments. We use OpenNLP [18] as the POS tagging tool and use Stanford Parser [50] to analyze the structures of log keys. Both

HW-graphs and its instances are output as JSON ﬁles which can be queried by JSON query tools such as JSONQuery [13].

We deploy three popular distributed data analytics frameworks, Hadoop MapReduce, Spark and

Tez as our targeted systems. All three systems are managed by Yarn [171], a cluster resource management framework. Execution in Yarn is encapsulated inside containers. Thus, we consider the log messages generated by one container as a log session. Since the formats of logs generated by diﬀerent systems vary, we implement two log formatters for the targeted systems in about a total of

200 lines of Java code. The formatters simply recognize the log formats such as timestamps, output classes and log contents by pattern matching. Note that for new systems, users need to customize and implement their own formatters.

We omit log messages that only consist of a set of key-value pairs since they are not written in 75 Table 4.3: Accuracy of information extraction by IntelLog in the three systems.

Frame- Consumed # of Entities Identiﬁers Values Locations Operations works log msg. Intel (Total / (Total / (Total / (Total / (Total / Keys FP / FN) FP / FN) FP / FN) FP / FN) Missed) Spark 1,361,008 60 63/3/0 19/1/1 13/1/0 9/0/1 63 / 5 MapReduce 5,254,050 44 43/9/2 11/1/1 41/1/1 1/0/0 45 / 5 Tez 1,127,549 95 101/2/3 13/0/3 43/3/0 3/0/0 97 / 7 Total 7,742,607 219 207 / 14 / 43/2/5 97/5/1 13/0/1 205 / 17 5 natural language. They can be captured by pattern matching In the training phase. We also use

Spell to discover the log keys of these omitted log messages. IntelLog maintains a list of these log keys. Once a log message matches a log keys in the list, IntelLog ignores them instead of triggering the unexpected message errors.

4.5 Evaluation

We evaluate IntelLog from three aspects. 1) We evaluate the accuracy of information extraction for Intel Keys. (2) We use a case study to illustrate the HW-graph that represents the workﬂow of a targeted system. We evaluate the eﬀectiveness of HW-graph and compare it with the S 3 graph in Stitch [196]. 3) We evaluate the problem detection accuracy of IntelLog and compare it with related work DeepLog [87] and LogCluster [135]. We also demonstrate how HW-graphs can help users diagnose the root causes of anomalies.

4.5.1 Experiment Setup

We conduct the experiments on a 27-node physical cluster (1 master node and 26 worker nodes) managed by Yarn [171]. The cluster is connected with 10-Gbps Ethernet. Each node has 128 GB memory, 4 Intel Xeon E5-2640 CPUs (8 cores per CPU, 32 cores in total) and is installed with

Ubuntu 16.04 with Linux kernel 4.12.8. The versions of the targeted data analytics systems are

Spark-2.1.0, Hadoop MapReduce-2.9.1 and Tez-0.8.4, respectively.

We implement a workload generator that submits jobs for each targeted system. For Spark and

MapReduce, the generator randomly chooses jobs from HiBench[114] to generate the workloads.

HiBench includes a wide range of jobs including text processing, machine learning and graph pro- 76 cessing. For Tez, we use Hive-1.2.2 [169] as the query interface. The generator randomly chooses queries from TPC-H [30] as the workloads. TPC-H is a suite of database queries that have broad industry-wide relevance. In the model training phase, resource conﬁgurations are carefully tuned in the generator in order to guarantee successful and normal execution of every job. In 4.5.2 and

4.5.3, we use the generator to randomly submit 100 jobs to each system. The logs are collected for evaluation.

4.5.2 Information Extraction

In IntelLog, accuracy checking is done by comparing Intel Keys with log messages. However, such an approach has a possibility to incorrectly categorize an identiﬁer as an value or vice versa. The reason is that such ﬁelds may only contain numeric values that can appear to be ambiguous even with the context. Since the source code of the targeted systems is available, we check the accuracy of information extraction by manually comparing Intel Keys with the corresponding logging statements in the source code.

Table 4.3 shows the accuracy of information extraction for each ﬁeld. ‘Total’ denotes the manual checking results. ‘FP’ and ‘FN’ denote false positive and false negative, respectively. We present the total number and the missed number of operations, since there is no false positive operations (Other

fields cannot be categorized as operations). For the entity field, the major reason of false positives is that IntelLog categorizes abbreviations as entities, such as ‘ref’ for ‘reference’ and ‘tid’ for ‘task id’. We categorize such words as false positives since they are meaningless without context. False negatives are usually caused by entity phrases with four or more words. Note that IntelLog has relatively high accuracy for Tez. The reason is that most Tez logs are well formatted as a sentence followed by a set of key-value pairs. Thus, the entities can be correctly recognized. The false negatives of identifiers are also false positives of values. Such fields only contain numeric variables, which makes it difficult even for manual categorization. The only false negative of location is information about a service port categorized as a value by IntelLog.

For operation extraction, IntelLog performs well when analyzing sentences with correct grammat- 77 memory directory memory BEFORE s1: [MemoryStore started, MemoryStore cleared] [created local directory, s1: FileOutputCommitter skip PARENT task s2: [block stored in memory] cleanup folder]

TID, stage task fetch s1: [got task, running task, task required output shutdown memory, task release memory, ﬁnished task] s1: [doing fetch, started fetch]

fetch s2: [save output of task] acl TID, stage output broadcast variable s1: [running task, ﬁnished task] s1: [fetching output, don’t have map output]

directory s2: [save output of task] driver [connecting to driver, registered with TID, stage [got location, ﬁle output committer s1: s3: algorithm, using class] driver, driver commanded shutdown]

driver broadcast variable block [started reading broadcast variable, [Registering BlockManager, Registered s1: s1: task reading broadcast variable] BlockManager, Initialized BlockManager]

[getting local block, getting remote block, block shutdown s2: block stored, updated info, put block] [driver command shutdown, s1: task shutdown hook called] s3: [getting block, BlockManager stopped]

(a) Hierarchical relations between entity groups (b) Subroutines in each entity group

Figure 4.7: The HW-graph for Spark containing the semantic knowledge of the workﬂow. We omit the children of the entity groups in gray rectangles since they have the same hierarchy as the children of the ‘memory’ group.

{ID1 / ID2}: 1:1{ID1} {ID2}: 1:n {ID1 , ID2}: m:n

{HOST / IP ADDR} {EXECUTOR / CONTAINER} {STAGE, TASK} {TID} {BROADCAST}

Figure 4.8: The S 3 graph of Spark built by Stitch. ical structures. However, some log messages do not strictly follow grammatical rules. For instance,

MapReduce outputs a log ‘Down to the last merge-pass...’ in its critical path. This log message does not have a predicate so that IntelLog cannot recognize the operation. We observe two Tez log keys that record vague information even though they are grammatically correct. The sample log messages generated by the two keys are ‘6 Close done’ and ‘4 finished. Closing’. These two log messages are actually related to query operators after we check the source code. Information extraction for such log messages is beyond the scope of this project.

4.5.3 HW-graph and Workﬂows

A HW-graph helps users to understand the workﬂows from two aspects. First, it categorizes work-

flows into entity groups in a hierarchical manner. This provides an overview of the system without inspecting the detail events in it. Second, it uses subroutines as a fine-grained view of workflows in 78 Table 4.4: Log and HW-graph statistics for the three systems.

Frame- length of # of groups length of subroutines works sessions all / crit max / avg. all / avg. crit Spark 347 45 / 10 10 / 1.2 / 2.3 MapReduce 137 35 / 13 19 / 1.7 / 2.8 Tez 304 59 / 27 14 / 2.7 / 4.6 each entity group, which only focuses on specific entities and filters out irrelevant ones. Subroutines are helpful when users try to understand parts of the workflow.

Since a HW-graph of a distributed data analytic system is large and contains rich information, we divide entity groups into two categories: critical groups and secondary groups. We categorize a critical group if it meets either of the following two criteria: 1) it contains multiple Intel Keys or,

2) it contains a Intel Key that has multiple corresponding log messages in a single session. The ﬁrst criterion captures diﬀerent entities that correlate with each other. The second criterion captures the entities that execute repeatedly or in parallel, which usually indicate common tasks. In practice, users can also choose to obtain a comprehensive HW-graph of a system that contains all the entity groups.

In order to evaluate how a hierarchical HW-graph can reduce the user efforts spent on understanding a workflow, we show the statistics of log sessions and the corresponding HW-graphs. Specifically, we measure the following metrics: 1) average number of log messages in a session; 2) number of all entity groups of a system; 3) number of critical entity groups of a system; 4) average length of a subroutine in all entity groups; and 5) average length of a subroutine in critical groups. Table 4.4 shows the results.

The number of entity groups (critical groups) are 5∼10 (10∼50) times fewer compared to the length of a session. Furthermore, the entity groups are organized in a hierarchical manner, which provides a clear view of workﬂows comparing to the original interleaved log messages. The length of a subroutine instance in entity groups are also shorter than that of a session. The longest instance has about 20 log messages, which is practical for manual analysis.

Spark HW-graph. Due to the limit of space, we solely illustrate the most complex HW-graph of Spark containing all the critical groups. Figure 4.7 depicts the HW-graph that is built from over 79 1.2 million Spark log messages.

Figure 4.7(a) clearly illustrates the hierarchical relations between entity groups, which is a high- level view of Spark workﬂow: 1) Spark ﬁrst checks the ‘acl’; 2) Then, it has four major entities throughout execution, which are ‘memory’, ‘directory’, ‘driver’ and ‘block’; 3) Under these four entities, there are child entities such as ‘task’ and ‘fetch’; 4) Group ‘shutdown’ is after ‘task’ and

‘directory’. The graph shows that it can execute with operations in groups ‘memory’, ‘driver’ and

‘block’ . The length of a rectangle indicates the lifespan of an entity group. We omit the children of the entity groups in grey rectangles since they are the same as those under group ‘memory’.

Figure 4.7(b) shows the subroutines in each entity group and the Intel Keys in subroutines. For simplicity, Intel Keys are represented by the extracted operations. One advantage of using the entity group is that it categorizes correlated entities and the event of entities. We use group ‘block’ as an example to illustrate how subroutines in an entity group describe workflows. Group ‘block’ has three subroutines: 1) s1: a subroutine with identifiers of BlockManager; 2) s2: a subroutine with identifiers of block; and 3) s3: a subroutine with no identifier. Subroutine 1 records the workflow of a BlockManager that has three operations: ‘registering’, ‘registered’ and ‘initialized’. Sub- routine 2 records the blocks stored whether in memory or on disk during the execution. Subroutine

3 records the ‘getting block’ events with the number of blocks got. Note that the ‘stopped’ operation of a BlockManager is categorized into s3 since it has no identiﬁer. The subroutines provide users with a clear view of the workﬂows that are related to the ‘block’ component.

Workﬂows of MapReduce and Tez can also be represented by HW-graphs. Similarly, entity groups capture related entities in the two systems. For example, group ‘map’ in MapReduce captures ‘map metrics system’ and ’map output’. Group ‘task’ in Tez captures ‘task’ and ‘TaskAttempt’.

IntelLog vs. Stitch. Stitch [196] is a closely related tool that reconstructs the workflow of a targeted system as an S 3 graph based on identifiers. An S 3 graph defines four relationships between the identifier pairs in logs, i.e., empty, 1:1, 1:n, and m:n. The 1:1 relationship indicates that the two identifiers are interchangeable. The 1:n relationship indicates a hierarchical relationship between the two identifiers. The m:n relationship indicates that an object can only be unambiguously identified by 80 the combination of the identifier pair. To compare the S 3 graph with the HW-graph, we reconstruct the S 3 graph of the Spark system.

Figure 4.8 shows that the S 3 graph reveals the hierarchical relationship between identiﬁers in

Spark. For example, STAGE and TID have a 1:n relationship that indicates one stage runs multiple

TIDs. However, a major limitation of S 3 is that it does not has semantic information recorded in logs. Users can only infer workflows from the names of identifiers, leaving a large amount of information unexploited. In practice, the reconstructed workflows by Stitch contains the lifespans of objects and their hierarchical relationships. On the other hand, IntelLog contains not only such kinds of information but also the operations and events related to objects. To this end, IntelLog is more fine-grained in terms of providing workflows to users because of its semantic awareness.

4.5.4 Anomaly Detection

In order to evaluate the capability of anomaly detection by IntelLog, we develop a problem injection tool that emulates three real-world scenarios that were also studied in tools DeepLog [87] and

LogCluster [135]. The problems include: 1) execution abortion of a session, 2) network failure on a node, and 3) node failure. The ﬁrst problem can be caused by administrators or other user processes.

We simulate it by sending a SIGKILL signal that gives the process no grace period to do cleanup.

The second and third problems are common due to multiple reasons such as mis-configurations or hardware errors. We simulate network failures by disabling the network interface and simulate node failures by shutting down the machine. To generate logs with various lengths, we submit jobs to the cluster with five sets of configurations that have different input data sizes and resource allocations for each system. For each configuration setting, we submit three jobs injected with the three problems respectively, and submit three jobs that are not affected by the problems. Note that we tune the configurations such that these jobs are guaranteed to run successfully as when no problem is injected.

The injection tool triggers the problem at a random point during the job execution. To summarize, we submit a total of 30 jobs for each system, 15 of which run with the problems.

Table 4.5 summarizes the anomaly detection results by IntelLog. For each system, we present the 81 Table 4.5: The accuracy of anomaly detection by IntelLog.

Framework number of sessions length of a session D / FP / FN / (P/B) Spark 4∼26 20∼1812 13 / 2 / 2 / (2) MapReduce 16∼257 67∼2147 15 / 1 / 0 (0) Tez 2∼36 107∼486 13 / 3 / 2 (3) number of sessions and the length of one session. ‘D’ denotes the number of injected problems that are detected by IntelLog. ‘FP’ denotes the number of false positives and ‘FN’ denotes the number of false negatives. In experiments, we also ﬁnd that IntelLog can capture performance problems and anomalies caused by bugs. The number of bugs is denoted as ‘(P/B)’. IntelLog detects 41 out of 45 injected problems. In addition, IntelLog captures 5 unexpected problems that are caused by inappropriate conﬁgurations or a internal bug. In general, IntelLog has a 87.23% precision and a

91.11% recall.

We manually check the system logs and the problem injection traces in order to analyze the causes of inaccurate anomaly detection. A typical reason of false positives is incomplete HW-graph due to insufficient training logs. We use Spark as an example to illustrate this scenario. During the shutdown phase, the Spark driver sends every Spark worker a ‘shutdown’ command. Then, the driver itself enters the shutdown phase. The workers may still receive a heartbeat telling the disconnection of the driver and record it in the log file. Since the configurations are finely tuned in the training environments, workers finish the self cleanup procedure so quickly that they terminate before receiving the last heartbeat from the driver. In this case, the disconnected event does not show up in the logs. On the other hand, we change the resource configurations during the anomaly detection phase. The workers may experience a slower shutdown and output this log message. Since

IntelLog does not capture this log message in the training phase, it categorizes this message as an unexpected one and alarms users with a potential anomaly concerning shutdown.

Next, we use three case studies to demonstrate how IntelLog reports anomaly execution and helps users to diagnose the root causes. The workload and anomaly information is shown in Table 4.6. In column ‘sessions D / T’, ‘D’ denotes the number of problematic sessions reported by IntelLog while

‘T’ denotes the total number of sessions of the job. 82 Table 4.6: Job descriptions in the three case studies.

Case system/ input/ Session D / T anomaly summary No. job name resources 1 MapReduce/ 30GB/ 4 / 259 network problem on a host WordCount 8-core, 4GB 2.1 Spark / 30GB 1 / 8 a performance issue KMeans 8-core, 2GB 2.2 Tez / 5GB 24 / 25 a performance issue Query 8 1-core, 1GB 3 Spark / 30GB 4 /8 an internal bug of Spark WordCount 8-core, 16GB

Injected problem. In the ﬁrst case, we take advantage of IntelLog and detect that a MapReduce

WordCount job experiences a network problem on a host. We show the step-by-step procedure that

IntelLog leads us to the root cause. After consuming the logs output by this job, IntelLog reports four problematic sessions containing unexpected log messages out of a total of 259 sessions. At this point, IntelLog already signiﬁcantly reduces the log range for analysis. Then, IntelLog applies the information extraction procedure on the unexpected log messages to transform them to Intel

Messages. The result indicates that all of the unexpected messages belong to the ‘fetcher’ entity group. We apply a GroupBy operator on the Intel Messages based on their identiﬁers. The result shows that 11 groups have log messages indicating host connection failures. Finally, we apply another

GroupBy operator based on the location information. The result only has one group with a group

ID ‘host A’. In other words, the diagnosis procedure shows that 11 fetchers have a problem when connecting to ‘host A’ during execution. After checking the trace of the injection tool, we ﬁnd that a network failure is injected during the job execution.

Performance issue. The second case illustrates that IntelLog has the capability to report potential performance issues even if jobs ﬁnish successfully. When IntelLog consumes logs from a

Spark KMeans job and three Tez jobs (Query 8) without injected problems, it reports unexpected log messages that are caused by a performance problem after further inspection. IntelLog detects that one Spark session and 71 Tez sessions are problematic. IntelLog extracts a new entity ‘spill’ from the unexpected log messages. Also, the unexpected logs from Tez record a ﬁle path of a disk location. In this case, the ‘spill’ event stores the intermediate data to the disk when the memory 83 Table 4.7: Comparison of the anomaly detection accuracy among IntelLog, DeepLog and LogCluster.

tools precision recall F-measure IntelLog 87.23% 91.11% 89.13% DeepLog 8.81% 100.00% 16.19% LogCluster 73.08% N/A N/A usage reaches the conﬁgured limit, which incurs additional I/O overhead. Then, we re-run the two jobs with all the same conﬁgurations but a larger memory limit to verify whether the memory limit causes the anomaly. The resulted logs are consumed by IntelLog without triggering a problem.

An unexpected bug. In the last case, IntelLog automatically reports an unexpected anomaly of a Spark WordCount job which ﬁnally turns out to be a bug in Spark (Spark-19731 [24]). In this case, the Spark job successfully ﬁnishes and it does not generate any unexpected log messages.

However, IntelLog reports that 4 out of 8 Spark sessions do not contain any log message of the

‘task’ entity group nor of its child groups, which are shown in Figure 4.7. We further inspect the

HW-graph instances built from the other 4 sessions. Each session has at most 8 subroutine instances in the task entity group, which indicates that a Spark container only gets assigned at most 8 tasks.

Per a previous study LRTrace [161], each Spark container without a task occupies at least 250 MB memory resource but it does not contribute to the job progress. In this case, at least 1 GB memory is wasted in the cluster during the job execution. A possible solution is that Spark should take the input data size into consideration when it launches containers. Note that a concrete solution is beyond the scope of this project.

Comparison Table 4.7 shows the anomaly detection accuracy of IntelLog, DeepLog [87] and

LogCluster [135]. IntelLog achieves the best overall performance. Its precision and F-measure are significantly highly than those by the other two tools. LogCluster aims to reduce human efforts in manually examining logs. Though most of its reported logs are related to anomalies (high precision of 73.08%), it may still miss some logs caused by the problems (recall N/A). DeepLog has a high accuracy rate when it is applied to HDFS and OpenStack systems. However, its performance degrades when it targets distributed data analytics systems (low precision of 8.81%). The reason is that data analytics jobs have much higher parallelism than infrastructure-level distributed systems. 84 For example, the length of a log sequence generated by a VM operation request in OpenStack is relatively small and fixed. In such a case, DeepLog can accurately predict the next log in a sequence by the logs seen before. But in data analytics systems, parallelism exists even in one session of a single job. For example, multiple tasks run in one Spark executor, and multiple fetchers run in one MapReduce container. Therefore, the log orders are less predictable. Another fundamental difference between these two kinds of systems is that the data size affects the length of log sessions of data analytics systems. The log length of new job sessions is difficult to be predicted from the job history.

Summary. The experiments show that IntelLog is able to detect anomalies of the systems, including simulated real-world problems, performance issues and internal system bugs. For the reported anomalies including the three cases above, IntelLog also pinpoints the problematic entity groups or subroutines. By using such information, users can easily locate the root causes. In addition, Intel Messages represented by key-value pairs can be stored as JSON ﬁles or in database, which facilitates log analysis since users can use queries to obtain related messages of problems.

4.6 Discussions

N-gram extraction [185] is a conventional approach for pattern extraction. However, the approach is not eﬃcient for the information extraction process in IntelLog for two reasons. It is time consuming to perform the N-gram approach since it searches all possible sequences that has a length of N. The other reason is that it may capture word sequences that are not entities such as ‘for attempt’, which results in a larger result set with many useless phrases.

IntelLog uses three relations to describe entities groups: parent, parallel and before. The

“happen-before” relation is usually used to describe possible causally related event for distributed system. That is if event a happens before event b, event a might be the cause of event b. In addition to that, we added two other relations. In a distributed data-analytics system, one entities can ini- tialize multiple other entities and wait for their completion. Parent relation describes this situation.

In such a system, multiple entities can also execute in parallel, without causal relations. We use the 85 parallel relation to describe this scenario.

The accuracy of information extraction depends on the quality of logs. IntelLog achieves a relatively higher accuracy in Tez compared to in MapReduce and Spark because logs of Tez are usually short and well formatted. Factors that aﬀect the accuracy include using multiple phrases to describe the same entity and using the same phrase to indicate multiple entities.

The structure parsing based extraction approach of IntelLog is eﬀective because logs are usually written in simple sentences that have only one clause. For complex sentences, the approach may miss the operations in the dependent clause but it can still extract the operations in the independent clauses.

One limitation of IntelLog is that it only focuses on logs that are written in natural languages.

Otherwise, the HW-graphs may not be constructed due to the lack of correlated entity names.

Fortunately, popular distributed systems use natural languages when generating log message.

4.7 Summary

This project presents IntelLog, a workﬂow reconstruction and anomaly detection tool for distributed data analytics systems. It reconstructs the workﬂows of the systems by using NLP based approaches in a non-intrusive manner. It uses POS analysis and sentence structure parsing to extract information from log keys. The results are Intel Keys, an enhanced representation of the original log keys.

IntelLog leverages the nomenclature of entity names to group together related entities. Its semantic awareness by HW-graphs allows users to easily understand the targeted systems. IntelLog also uses the HW-graphs for anomaly detection. Experimental results show that IntelLog outperforms existing tools in automatically detecting anomalies caused by real-world problems, performance issues and internal system bugs and pinpoints the erroneous components. Logs in natural languages help users not only analyze anomalies but also understand the underlying mechanism of the systems. CHAPTER 5

SMT INTERFERENCE DIAGNOSIS AND CPU SCHEDULING FOR

JOB CO-LOCATION

5.1 Hyperthreading Interference Diagnosis

5.1.1 Finding the metric

We identify a set of hardware performance events (HPE) provided by Intel processors and leverage it to accurately diagnose Hyper-Threading interference on memory access. We use the selected HPEs to instruct CPU scheduling for co-located latency-critical services and batch jobs.

An Intel processor has hundreds of HPEs [11] that record runtime metrics of cores such as number of retired instructions, length of diﬀerent queues, and number of CPU cycles and stalls. We choose the four candidate HPEs that are related to LLC miss and the memory subsystem. They are listed in Table 5.1.

In order to evaluate the correlation between the counter value of each candidate HPE and the

1.0 1.0 1.0 latency 0x10A3 0.8 0x02A3 0x14A3 0.8 0.8 0.6 0x06A3 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2 Normalized values Normalized values Normalized values Normalized 0.0 0.0 0.0 0 20000 40000 60000 45000 50000 55000 60000 65000 70000 0 10000 20000 30000 40000 RPS RPS RPS (a) One thread (b) Two threads-max RPS (c) Two thread-various RPS

Figure 5.1: Normalized memory access latency and value of HPEs under different thread configurations. 87 memory access latency, we deploy a measurement program that runs on either one or both threads to continuously send fixed-sized (e.g. 1MB) memory requests from the same cores to DRAM for a unit time (e.g. one second). When one thread is running , we change the request sending rate (i.e. requests per second, RPS) ranging from 5,000 RPS to its maximum rate around 74,000 RPS with

5,000 as the step. When two threads are running, we pin the two threads on the siblings of one physical core. We ﬁx one thread to its maximum possible RPS and change the RPS of the other thread from 5,000 to its maximum rate around 45,000 RPS with 5,000 as the step. We make sure that the requested data do not reside in any layer of CPU caches.

Counter value per second. At the same time, we record the counter value of each candidate

HPE from the logical processor that host the program. Initially, we intend to use per-second counter values as the metric to reﬂect Hyper-Threading interference on DRAM access. However, this method is not eﬀective if a logical processor is not fully loaded. For example, when a thread on a processor sends requests with 5,000 RPS and its sibling processor is fully loaded, the thread experiences a long memory access latency. However, the recorded per-second counter value is relatively low since only

5,000 requests are sent. Such a low counter value does not reﬂect the high memory access latency.

This situation can still happen even with a more ﬁne-grained time unit.

Counter value per instruction. In order to accurately model DRAM access latency, the actual load of DRAM access on a processor needs to be obtained. We achieve this by recording the total number of instructions LOAD and STORE and divide the counter values by it, as shown in formula 5.1. In this way, we calculate the average counter values per DRAM access instruction

VPIevent . We normalize the average memory access latency and VPIevent of each HPE to their own maximum values.

CounterValueevent VPIevent = (5.1) NumLOAD + NumSTORE

Figure 5.1(a) shows the the one-thread conﬁguration. It clearly shows that the memory access latency remains almost unchanged with one thread regardless of RPS. Figure 5.1(b) and (c) show the two-thread conﬁguration. Figure 5.1(b) shows the thread always using maximum RPS and 88 Table 5.1: Candidate HPEs with correlation value.

Name Description Event # Corr CYCLES L3 MISS Cycles while L3 cache miss demand 0x02A3 -0.1748 load is outstanding. STALLS L3 MISS Execution stalls while L3 cache miss de- 0x06A3 0.9992 mand load is outstanding. CYCLES MEM ANY Cycles while memory subsystem has an 0x10A3 0.9997 outstanding load. STALLS MEM ANY Execution stalls while memory subsys- 0x14A3 0.9999 tem has an outstanding load.

Figure 5.1(c) shows the thread using various RPS. For the thread with various RPS, its memory request latency remain unchanged regardless of its RPS. For the thread with maximum RPS, its maximum RPS decreases from ∼ 70000 to ∼ 45000 (from right to left in Figure 5.1(b)) with the increasing RPS on its sibling thread. Its memory access latency also increases. Among the four

HPEs, the VPI of CYCLES MEM ANY (0x10A3) and STALLS MEM ANY (0x14A3) presents almost the same trend as the memory access latency does.

We use Pearson’s correlation coeﬃcient to quantify the linear correlation between the memory access latency and the value of each HPE, as shown in Corr column of Table 5.1. A correlation score that is close to 1 or -1 means a strong correlation between the two metrics. In Table 5.1 event

STALLS MEM ANY (0x14A3) has the highest correlation score (0.9999), suggesting a strong positive correlation with the memory access latency. HPE CYCLE L3 MISS (0x02A3) has relatively low correlation score. We notice that HPE STALLS L3 MISS (0x06A3) and CYCLES MEM ANY

(0x10A3) also present high positive correlation scores. However, the scores are slightly lower than that of HPE STALLS MEM ANY (0x14A3). As a result, we use the counter value of HPE

STALLS L3 MISS (0x14A3) in formula 5.1. The result value VPI0x14A3 is the appropriate indicator for Hyper-Threading interference on between siblings of a physical cores.

5.1.2 Eﬀectiveness of the Metric

We show that VPI0x14A3 is an eﬀective metric reﬂecting Hyper-Threading interference for real-world services. We test four services Redis [21], RocksDB [22], WiredTiger [31] and Memcached [16]. We use workload-a from YCSB to generate requests to the services. We wrote a program that can access 89

0.4 0.6 Avg Avg 0.5 0.3 99th 99th VPI 0.4 VPI 0.2 0.3 0.2 0.1 Relative values Relative Relative values Relative 0.1 0.0 0.0 Low Medium High Low Medium High (a) Redis (b) RockDB

0.4 0.4 Avg Avg 0.3 99th 0.3 99th VPI VPI 0.2 0.2

0.1 0.1 Relative values Relative Relative values Relative

0.0 0.0 Low Medium High Low Medium High (c) WiredTiger (d) Memcached

Figure 5.2: The relationship between VPI and service performance. memory with conﬁgurable request rate (i.e. request per second, RPS). For each service, we pin their threads on four logical processors (i.e., service processors) and pin the threads of the memory access program on the siblings of the four processors (i.e., sibling processors). In the experiment, we use three RPS settings 20000 (Low), 40000 (Medium) and 60000 (High). For each setting, we start the memory access program and YCSB concurrently. We also run YCSB workload alone without the memory access program as the baseline for comparison (Alone). We sum VPI0x14A3 on the services processors during the execution.

− We normalize both VPI and latency of the services to those in the Alone setting by using V VAlone VAlone

(V stands for either VPI or latency of the services.). For example, an avg bar with value 0.3 indicates that the average latency of the service is 30% higher than that under the Alone setting. Figure 5.2

th shows the normalized average latency, 99 percentile latency and VPI0x14A3 for the four services under each setting. We find that VPI0x14A3 is an effective metric that can quantitatively reflect the latency of the services. 90 Interference detection

VPI calculation

CPU usage calculation Metric monitor core CPU usage process CPU usage CPU scheduling HPE values process system status status

Adjustment for workloads

Figure 5.3: Overview of Holmes.

5.2 Holmes Design

5.2.1 Overview

Figure 5.3 presents the closed-loop design of Holmes. The metric monitor keeps track of the status of both latency-critical services and bath jobs, and resource usage of the server. Holmes diagnoses

Hyper-Threading interference on memory access based on the selected HPEs and the quantiﬁcation approach. It then adaptively adjusts CPU core allocation for latency-critical services and batch jobs in a shared server based on process and system status. The CPU scheduler communicates with Linux kernel and adjusts core allocation at runtime using a interference-aware scheduling algorithm which aims to achieve low latency for latency-critical services and improve server resource utilization. In return, the adjusted allocation aﬀects the performance metrics in the system.

5.2.2 Metric Monitor

The metric monitor thread is periodically invoked to collect the information of CPU usages and

CPU/thread status of running workloads.

CPU resource usages. The monitor thread collects both CPU usage and VPI0x14A3 for each core. It stores the metrics in an array for all cores in a server. Holmes maintains logical-processor- to-core mappings. The collected processor metrics are aggregated per core by accumulating both 91 processor metrics on that core. As a result, Holmes presents the overall CPU usage and the overall

VPI0x14A3 of each core in the server.

Process status. For processes of latency-critical services and those of batch jobs, the monitor thread collects their CPU/ thread usages and thread-to-processor mappings. Since latency-critical services usually are long running, their PIDs are provided to Holmes by the system administrator upon service initialization. In contrast, batch jobs are usually launched by a resource scheduler, such as Yarn [171], and finish at arbitrary time. In order to detect batch jobs when they start, we configure Yarn to launch batch job processes in Linux containers. Each Linux container uses its own files in the cgroup file system to achieve resource allocation and isolation. Holmes monitors directories in the cgroup file system to detect batch jobs.

Holmes determines whether a latency-critical service is serving traﬃc by monitoring its CPU usage. Most queries of a latency-critical service are related to memory access. Some services also have background management threads that are responsible for data merge or compaction operations which are also memory intensive. When these operations are on the ﬂy, they incur CPU usage.

5.2.3 CPU Scheduler

The CPU scheduler periodically reads the information collected by the metric monitor and adaptively adjusts core allocations using a interference-aware scheduling algorithm for co-located latency-critical services and batch jobs. The algorithm it prioritizes latency-critical services by reserving a small number of logical processors for bursts of traﬃc. When latency-critical services do not have traﬃc, it allows batch jobs to utilize the other non-reserved logical CPUs. Though this may incur Hyper-

Threading interference for batch jobs, it is not considered as an issue since batch jobs do not have a hard deadline for completion. Furthermore, by leveraging Hyper-Threading, batch jobs can achieve higher throughput. When latency-critical services are serving traﬃc, the scheduler adjusts core allocations for both types of workloads. As the lifetime of a process can be divided into launching, running and exiting, we describe the CPU scheduling algorithm corresponding to these three phases.

Table 5.2 describes the terminologies used in the Algorithm. 92 Table 5.2: Terminologies in Holmes.

Terminologies Description Reserved CPU A logical CPU reserved for latency-critical services. Batch jobs are not allowed to run on it. Non-reserved CPU A CPUs other than reserved CPUs. All processes can run on them. Lcs CPU A CPU that is currently hosting latency-critical services. Lcs-sibling CPU The sibling CPU of a lcs CPU. Non-sibling CPU The CPUs whose sibling is not hosting latency-critical services.

Algorithm 3 Process launching.

1: pid setlss : A set of process IDs of latency sensitive services; 2: pid setbatch : A set of process IDs of batch jobs; 3: pid = launched process id; 4: if pid setlss contains pid then 5: allocate(rsv CPUs, pid); 6: else 7: allocate(non rsv CPUs, pid); 8: end if

Process launching. The procedure during process launching is shown in Algorithm 3. Upon the launch of a latency-critical service, Holmes simply allocates the reserved CPUs to the service.

Upon the launch of containers of a batch job, Holmes allocates non-reserved CPUs to them. Among these non-reserved CPUs, Holmes ﬁrst choose the ones from non-sibling CPUs. The number of

CPUs is specified by the configuration file of the batch job. When non-sibling CPUs are busy,

Holmes chooses CPUs from all the non-reserved CPUs for the containers as long as VPI 0x14A3 of their sibling CPUs is less than a threshold E. By this design, performance of latency-critical services is not aﬀected by batch jobs at launching time. Meanwhile, lcs-sibling CPUs could be utilized by batch jobs when necessary.

Algorithm 4 Process running. 1: for each lcs CPU do 2: for pid in pid setbatch do 3: while VPI0x14A3(lcs CPU ) ≥ E do 4: if sibling CPU of lcs CPU has pid then 5: deallocate(sibling CPU , pid); 6: if non sibling CPUs.available then 7: allocate(non sibling CPUs, pid); 8: end if 9: end if 10: end while 11: end for 12: end for 13: while reserved CPUs.usage > T% do 14: new CPU ← get or deprive(all CPUs); 15: reserved CPUs.add(new CPU ); 16: end while 93 Process running. The procedure during process running is shown in Algorithm 4. Holmes checks whether a latency-critical service is serving traﬃc by monitoring its CPU usage. When it is serving traﬃc, the reserved CPUs’ usage and VPI0x14A3 on the CPUs hosting the service must increase. If there are co-located batch jobs on siblings of these CPUs, Holmes adjusts core allocation based on these two metrics accordingly.

When VPI0x14A3 is greater than or equal to E, Hyper-Threading interference is detected. Holmes deallocates the sibling CPUs from the containers of batch jobs until VPI0x14A3 of the lcs CPU drops below E. Meanwhile, Holmes tries to allocate non-sibling CPUs to containers of batch jobs. This could happen when containers of batch jobs on a non-sibling CPU ﬁnish. By deallocating the CPUs for containers of batch jobs, VPI0x14A3 is reduced accordingly. Although processing of batch jobs is slowed down, their execution progresses are preserved.

When the average usage of reserved CPUs reaches T (0¡T ¡1) of their capacity but VPI0x14A3 is less than E, the capacity of the reserved CPUs is not enough to serve the latency-critical service.

For example, we initially assign four reserved CPUs to the latency-critical service and set T to 80%.

When the CPU usage of the four cores increases beyond 320% (4∗80%), Holmes starts an expansion procedure. It could happen when the latency-critical service creates more active threads than the number of initial reserved CPUs. Holmes adds one CPU at a time until the capacity is enough to serve the latency-critical service. The chosen CPU is not sibling of current lcs CPUs. At the same time, if any batch job is running on the sibling of the chosen CPU, Holmes deallocates it from batch jobs to ensure the performance of latency-critical services.

Algorithm 5 Process exiting. 1: if lss exits then 2: for pid in pid setbatch do 3: allocate(sibling CPUs, pid); 4: end for 5: end if 6: if batch on non sibling CPUs exits then 7: for pid in pid setbatch do 8: reallocate(non sibling CPUs, pid); 9: end for 10: end if

Process exiting. The procedure during process exiting is shown in Algorithm 5. There are two situations of process exiting that needs to be managed by Holmes: 1) exiting of containers of 94 batch jobs on non-sibling CPUs, and 2) traffic of a latency-critical service is over. Upon exiting of containers of a batch job on non-sibling CPUs, Holmes examines whether any sibling CPU is hosting containers of batch jobs. If so, Holmes migrates some of them onto non-sibling CPUs. When the latency-critical service finishes traffic serving, Holmes allocates sibling CPUs to containers of batch jobs whose CPUs are previously deprived. By this design, Holmes improves the CPU resource utilization of a server while guaranteeing resources for latency-critical services.

5.3 Implementation

We implement Holmes in ∼ 3000 lines of C++ code, which will be open source. We empirically set the invocation interval at 50 µ s for both the metric monitor thread and CPU scheduling thread.

The interval is set based on the fact that latency-critical services usually have a per-query respond time at several hundreds of microseconds. The 50 µ s invocation interval is short enough for fast

Hyper-Threading interference detection and CPU scheduling while maintaining low overhead. We set the number of reserved cores to four in our 32-core server. We set the deallocation threshold E for batch jobs at 20. This is a rather strict value since we expect that the processors of batch jobs can be deallocated at an early stage, resolving Hyper-Threading interference. We set the CPU usage threshold T of latency-critical services for core expansion at 80%. In this case, it allows 20% CPU quota for a burst of queries before Holmes allocates more cores to latency-critical services. When a latency-critical service is launched, the system administrator needs to specify its PID to Holmes.

Linux APIs. Holmes communicates with Linux OS for two tasks: 1) collecting HPE counter values, and 2) allocating cores to processes. Holmes uses system call perf event open to collect

HPE values provided by Intel processors at runtime. Note that AMD processors provide similar technology called Instruction-Based Sampling [10]. Holmes allocates cores for threads by invoking system call sched setaﬃnity.

Batch jobs management. We use Yarn [171] to manage batch jobs. A batch job can be divided and launched in multiple Linux containers [15]. We modify source code of Yarn NodeManager to launch batch jobs with a set of speciﬁed cores, which makes sure that batch jobs are not allocated 95 with the reserved cores for latency-critical services. The modiﬁcation only takes less than 10 lines of code in Yarn. We create a parent cgroup directory to manage all Linux containers for batch jobs.

Each batch job container is associated with an individual directory. With this design, Holmes keeps track of the liveliness and resource consumption for batch jobs by scanning all the cgroup directories.

5.4 Evaluation

5.4.1 Evaluation Setup

We use two servers in the experimental evaluation. Each server has two Intel Xeon Gold 6143

CPUs (32 cores per CPU), and 256 GB DRAM. Each server has a 512 GB SSD. We use four real- world latency-critical services to evaluate the performance of Holmes: Redis-5.0.5 [21], Memcached-

1.5.22 [16], RocksDB-6.0.0 [22], and WiredTiger-3.2.1 [31]. Redis and Memcached are two in-memory key-value (KV) stores. RocksDB is a state-of-the-art disk-based persistent KV store based on Long-

Structured Merge Tree (LSM Tree). WiredTiger is the latest disk-based persistent KV storage engine for MongoDB. Both disk-based KV stores keep an in-memory cache. When they access data stored in ﬁles, Linux OS also keeps page cache for the recently accessed ﬁles. Therefore, Holmes is applicable to disk-based KV stores.

We use Yarn in Hadoop-2.9.2 [3] as the job scheduler and Spark-2.4.5 [192] as the data analytics engine for batch jobs. One server hosts co-located latency-critical services and batch jobs. The other server serves as the client of latency-critical services and the master node of Yarn.

We evaluate the performance of Holmes in four metrics: 1) query latency and SLO violation of latency-critical services, 2) server throughput in terms of CPU utilization and the number of completed batch jobs, 3) parameter sensitivity of Holmes, and 4) overhead of Holmes.

We use the following conﬁgurations to generate job co-location on the server, and use it for latency

( 5.4.2) and throughput ( 5.4.3) evaluation. We submit workloads in YCSB [77] to generate bursty traffic for latency-critical services. Each bundle of bursty traffic lasts for 60s∼90s with an interval ranging from 5s∼10s. Both traffic time periods and interval periods agree to Poisson distribution. 96

1.0 1.0 1.0 Alone Alone Alone 0.8 Holmes 0.8 Holmes 0.8 Holmes 0.6 PerfIso 0.6 PerfIso 0.6 PerfIso

CDF 0.4 CDF 0.4 CDF 0.4

0.2 0.2 0.2

0.0 0.0 0.0 0 50 100 150 200 250 0 50 100 150 200 250 300 0 5000 10000 15000 20000 25000 Time (us) Time (us) Time (us) (a) Workload a (b) Workload b (c) Workload e

Figure 5.4: The CDF of query latency of Redis service under three settings.

1.0 1.0 1.0

0.8 0.8 0.8

0.6 0.6 0.6

CDF 0.4 Alone CDF 0.4 Alone CDF 0.4 Alone Holmes Holmes Holmes 0.2 0.2 0.2 PerfIso PerfIso PerfIso 0.0 0.0 0.0 0 1 2 3 4 5 6 0 2 4 6 8 10 0 5 10 15 20 Time (ms) Time (ms) Time (ms) (a) Workload a (b) Workload b (c) Workload e

Figure 5.5: The CDF of query latency of RocksDB service under three settings.

1.0 1.0 1.0

0.8 0.8 0.8

0.6 0.6 0.6

CDF 0.4 Alone CDF 0.4 Alone CDF 0.4 Alone Holmes Holmes Holmes 0.2 0.2 0.2 PerfIso PerfIso PerfIso 0.0 0.0 0.0 0 100 200 300 400 500 0 100 200 300 400 0 2000 4000 6000 8000 10000 Time (us) Time (us) Time (us) (a) Workload a (b) Workload b (c) Workload e

Figure 5.6: The CDF of query latency of WiredTiger service under three settings.

1.0 1.0

0.8 0.8

0.6 0.6

CDF 0.4 CDF 0.4 Alone Alone 0.2 Holmes 0.2 Holmes PerfIso PerfIso 0.0 0.0 0 50 100 150 200 0 50 100 150 200 Time (us) Time (us) (a) Workload a (b) Workload b

Figure 5.7: The CDF of query latency of Memcached service under three settings. 97

100 30 Alone Alone 80 Holmes Holmes 20 60 PerfIso PerfIso

40 10 20 SLO violation (%) violation SLO (%) violation SLO 0 0 WL a (136us) WL b (149us)WL e (12468us) WL a (1.0ms) WL b (2.5ms) WL e (2.5ms) (a) Redis SLO violation (b) RockDB SLO violation

50 50 Alone Alone 40 40 Holmes Holmes 30 PerfIso 30 PerfIso

20 20

10 10 SLO violation (%) violation SLO SLO violation (%) violation SLO 0 0 WL a (288us) WL b (168us) WL e (5293us) WL a (103us) WL b (108us) (c) WiredTiger SLO violation (d) Memcached SLO violation

Figure 5.8: SLO violation of four latency-critical services under three settings.

We continuously submit multiple concurrent workloads in HiBench-6.0 [114] as batch jobs. Each batch job lasts for around three minutes. After all batch jobs are submitted, all processors on the server are allocated to batch jobs except for the reserved processors for latency-critical services.

We make sure that there is no memory pressure on the server by constraining the memory limit of containers of batch jobs.

We conduct the experiments in three settings.

• Alone Latency-critical services run in a dedicated server without job co-location. This is an

ideal situation for the services, but without batch job throughput.

• PerfIso Latency-critical services are co-located with batch jobs. Naive CPU isolation by

PerfIso [115] is enabled to dynamically adjust processor allocation.

• Holmes Latency-critical services are co-located with batch jobs. Holmes is enabled to dynam-

ically adjust core allocation to avoid CPU interference and the impact of SMT interference on

latency-critical services. 98 5.4.2 Query Latency Reduction

We examine query latency when latency-critical services serve three representative workloads in

YCSB under the settings of Alone, Holmes, and PerfIso. Workload-a is a write-heavy workload consisting of 50% read and 50% update queries. Workload-b is a read-heavy workload consisting of 95% read and 5% update queries. Workload-e is a scan-heavy workload consisting of 95% scan and 5% insert queries. Note that there is no workload-e for Memcached service since it does not support scan operations. We show the CDF of query latency of the three workloads for the four latency-critical services running under the three settings in Figures 5.4 to 5.7.

Redis and Memcached. They are two popular in-memory KV stores. As shown in Figure 5.4 and Figure 5.7, naive isolation due to PerfIso prolongs the query latency for all three workloads at each percentile. For Redis service, Holmes reduces the average (99th percentile) query latency by

49.0% (35.2%), 40.7% (11.7%), and 28.1% (49.3%) for the three workloads, respectively. We observe a very long tail (99th percentile) of query latency in workload-b of Redis due to PerfIso. It suggests that Hyper-Threading interference has signiﬁcant impact on the read-heavy workload for Redis. For

Memcached service, compared to PerfIso, Holmes reduces the average (99th percentile) latency by

16.9% (52.3%) and 9.5% (39.2%) for the two workloads, respectively. It achieves almost identical query latency as Alone setting does for both workload-a and workload-b.

RocksDB and WiredTiger. They are two disk-based persistent KV stores. As shown in

Figure 5.5 and Figure 5.6, Holmes achieves almost identical query latency as Alone setting does for both services. For RocksDB service, queries have a long tail of latency even if the service is running under Alone setting. The long tail further deteriorates when Hyper-Threading interference presents.

Compared to PerfIso, Holmes reduces the average (99th percentile) query latency by 44.2% (28.9%),

18.9% (28.7%), and 25.0% (18.8%) for the three workloads, respectively. For WiredTiger service, compared to PerfIso, Holmes reduces the average (99th percentile) query latency by 21.6% (30.1%) and 19.4% (21.7%) for workload-a and workload-b, respectively. As for workload-e, the query latency under the three settings is almost identical, which implies that the workload is insensitive to Hyper-

Threading interference. 99

100 100 Alone Holmes PerfIso Alone Holmes PerfIso 80 80

60 60

40 40

20 20 CPU utilization (%) CPUutilization (%) CPUutilization 0 0 WL a WL b WL e WL a WL b WL e (a) Redis CPU utilization (b) RockDB CPU utilization

100 100 Alone Holmes PerfIso Alone Holmes PerfIso 80 80

60 60

40 40

20 20 CPU utilization (%) CPUutilization CPU utilization (%) CPUutilization 0 0 WL a WL b WL e WL a WL b (c) WiredTiger CPU utilization (d) Memcached CPU utilization

Figure 5.9: CPU utilization with four latency-critical services under three settings (Alone, Holmes, PerfIso) .

We notice that for both disk-based KV stores, the CDF curve is a stair-like shape. For example, about 50% of the queries in workload-a of RocksDB have latency around 200 µ s while the other 50% of the queries have latency raging from 500 µ s to several milliseconds. There are two reasons that cause the stair-like CDF curve. 1) For workload-a, half of the queries are of update and the other half are of read. Since both services use an asynchronous technique for update queries, one half of the queries (update) return quickly. For the other half of the queries (read), the services need to access memory or disks before returning data. Thus, the read queries have much slower response than the update queries. 2) Workload-b and workload-e are both read intensive. The services have low latency when the queries hit in-memory cache for the queried data, and have high latency when the queries need to access disks. For disk-based KV stores, the results show that all three settings cause similar query latency in low percentiles. Hyper-Threading interference has more signiﬁcant impact on query latency in high percentiles.

SLO violation. Figure 5.8 shows the ratios of SLO violation of the four latency-critical services under settings Alone, Holmes and PerfIso, respectively. Since there is not a magic value that deﬁnes the SLO of each service, we adopt the 90th percentile latency under Alone setting as the SLO. For 100 example, in Redis, the SLOs are 136 µ s, 149 µ s and 12, 468 µ s for workload-a, -b and -e, respectively.

These are rather strict values as only 10% SLO violations are allowed under Alone setting. Compared to Alone setting, Holmes achieves a similar SLO violation ratio in most cases, especially for disk- based services (i.e., RocksDB and WiredTiger). One exception is workload-b with Redis. Holmes incurs a violation ratio of 50.8% because workload-b is very sensitive to Hyper-Threading interference and parameter setting. PerfIso causes signiﬁcantly worse SLO violation ratios in all cases. Its SLO violation ratios are usually above 25%, and around 90% in the worst case.

[Summary] Experiments show that Holmes reduces the average (99th percentile) latency by up to

49.3% (52.3%) for in-memory KV stores. Since disk-based KV stores have memory cache, Holmes can reduce their average (99th percentile) latency by up to 44.2% (30.1%). In most cases, Holmes achieves very similar latency and SLO violation ratio for the latency-critical services as Alone does.

We notice that the latency of Redis due to Holmes still has some degradation compared to the Alone case. The possible reason is that Redis uses a single thread to serve all user requests. When requests are delayed on the thread, there is no other thread to dispatch the requests, resulting in longer latency.

5.4.3 Server Throughput Improvement

We examine the server throughput when running four latency-critical services with three workloads in one hour, under Alone, Holmes, and PerfIso. The metrics are 1) CPU utilization, 2) the number of completed batch jobs.

CPU utilization. Figure 5.9 shows the average CPU utilization with the four latency-critical services under the three settings. There is no significant difference in the utilization due to different workloads. Overall, Holmes achieves the average CPU utilization 72.4% ∼ 85.8% while PerfIso achieves the average CPU utilization 83.4% ∼ 88.5%, both under job co-location. Compared to

Alone where latency-critical services are running alone, both Holmes and PerfIso signiﬁcantly improve the CPU utilization of the server. Although PerfIso slightly outperforms Holmes in terms of CPU utilization, it signiﬁcantly violates the SLO of latency-critical services, the principle of job 101 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4

0.2 Overall CPU usage 0.2 Overall CPU usage

CPU usage (%) 0.0 CPU usage (%) 0.0 0 600 1200 1800 2400 3000 3600 0 600 1200 1800 2400 3000 3600 Time (s) Time (s) (a) Co-location with PerfIso (b) Co-location with Holmes

Figure 5.10: CPU usage under two settings for Redis.

Table 5.3: Throughput comparison of completed batch jobs.

settings avg. CPU usage # ﬁnished batch jobs Co-location with PerfIso 84.6% 78 Co-location with Holmes 75.0% 73 Alone 1.1% 0 co-location.

We give a microscopic view of CPU utilization at runtime when Redis is serving workload-a.

Other latency-critical services and workloads have similar results. Figure 5.10 shows the overall

CPU utilization when Redis is running under PerfIso and Holmes. PerfIso, shown in Figure 5.10(a), achieves more stable and higher utilization. Its average overall utilization reaches 84.6% of the server capacity. Holmes, shown in Figure 5.10(b), leads to lower overall utilization with an average of 75.0% of the server capacity. Overall, CPU utilization due to Holmes fluctuates more than that due to PerfIso. The reason is that Holmes deallocates CPUs of batch jobs when Redis is serving traffic, and restores CPUs when the traffic is over. Thus, the CPU utilization fluctuates due to the dynamic traffic pattern of Redis. Note that it is the interference-aware CPU scheduling of Holmes that provides query latency of latency-critical services under job co-location close to that when the services are running alone in a server.

Memory utilization. The server memory utilization under the three settings does not change much. For Alone, the memory utilization stabilizes around 2GB for Redis and Memcached, and around 1GB for RocksDB and WiredTiger. For PerfIso and Holmes, the memory utilization stabilizes around 144 GB for all four latency-critical services co-located with batch jobs. There are two reasons for the stable memory utilization. 1) Latency-critical services use memory as data storage or data 102

600 300 20 30 40 50 60 20 40 60 500 250 30 50

400 200

300 150

200 100

100 50 Normalized latency (%) Normalized Normalized latency (%) Normalized 0 0 avg. p75 p90 p95 p99 avg. p75 p90 p95 p99 (a) Redis (b) RockDB

700 150 20 30 40 50 60 20 30 40 50 60 600

500 100 400

300

200 50

100 Normalized latency (%) Normalized Normalized latency (%) Normalized 0 0 avg. p75 p90 p95 p99 avg. p75 p90 p95 p99 (c) WiredTiger (d) Memcached

Figure 5.11: Normalized latency of four latency-critical services with different CPU suppression ratios S in Holmes. cache. Their memory consumptions do not change much unless there are more data inserted into the services. 2) Each container of a batch job is configured with a fixed size of memory. Its memory consumption does not change unless the memory size of the containers is changed.

Number of completed jobs. Table 5.3 gives the number of completed batch jobs in one hour when Redis serves workload-a in the three settings. With PerfIso, 78 batch jobs are completed.

With Holmes, 73 batch jobs are completed. Note that Holmes adaptively adjust lcs-sibling CPU allocation for batch jobs which slows down batch job processing.

[Summary] Holmes signiﬁcantly improves CPU utilization and system throughput in a shared server, compared to a dedicated system where latency-critical services are running alone. While

PerfIso can achieve even higher utilization and throughput, it violates the principle of job co-location.

That is, while co-located batch jobs can use transiently available resources, they should not impact performance of latency-critical services.

5.4.4 Parameter Sensitivity

We examine the impact of threshold E of VPI0x14A3 on query latency of latency-critical services. In

Holmes scheduler (Section 5.2.3), E determines when sibling cores of containers of each batch job 103 is disabled. It is a trade-oﬀ between CPU utilization and meeting SLO of a latency-critical service.

A lower value yields more disabled processors and thus lower query latency for meeting the SLOs, but also fewer processors for batch jobs that reduces the job throughput and CPU utilization of the server. For eﬃcient job co-location, the primary goal of parameter tuning is to achieve low latency for latency-critical services while the secondary goal is to improve server utilization.

We use workload-a in YCSB and change E from 20 to 60 with step size 10. We normalize the query latency due to Holmes to that due to Alone for the four real-workload services. Figure 5.11 shows the normalized latency on average and at four speciﬁc percentiles. It shows that parameter

E with value of 20 renders almost identical results as Alone does in most cases. Speciﬁcally, this value yields the latency similar to those in Alone for Redis, WiredTiger and Memcached at each percentile. For RocksDB, this value yields slightly worse results than those Alone does. In this case, users can set a lower value for E (e.g. 10) to deallocate cores of batch jobs more promptly.

[Parameter Tuning] In the experiments, Holmes with E of 20 renders latency close to that under

Alone setting in most cases. When users tune parameter E, there are multiple factors to take into account, such as the type of batch jobs, SLO of a latency-critical service, hardware conﬁguration, etc. It may result in a higher value of E if the service has a loose SLO and server utilization is more important, or a lower value if SLO of the service is not allowed to be compromised. In the future, we plan to design a automated parameter tuning plugin for Holmes. The plugin should take user-speciﬁed SLO and tune the parameters in Holmes to meet the SLO.

5.4.5 Overhead

We analyze the overhead of Holmes. Holmes introduces about 1.3% ∼ 3% CPU usage depending on whether the scheduling threads are active in management operations. It occupies about 2MB memory at runtime, which is negligible compared to the memory capacity of a DRAM node. We suggest launching Holmes on a separate core to minimize its interference with latency-critical services.

The overhead of Holmes also depends on the proﬁling frequency of the monitor component.

A higher proﬁling frequency can more accurately detect the SMT interference while incur larger 104 overhead, and vice versa. One solution is that we could run Holmes on a dedicated core with higher proﬁling frequency. In such case, Holmes does not interference other processes running in the system.

5.5 Summary

This project presents Holmes, a user-space CPU scheduler for eﬃcient job co-location in a SMT environment. Holmes tackles two major challenges, 1) accurately diagnosing SMT interference on memory access by identifying hardware performance events and developing a method for interference measurement, and 2) adaptive CPU scheduling via interference-aware core allocation and migration. Experiments with four real-world key-value stores show that Holmes achieves query latency of the latency-critical services close to that when the services are running alone in a server, while signiﬁcantly improving server utilization and throughput of co-located batch jobs. Compared to PerfIso [115], Holmes reduces the average (99th percentile) query latency by up to 49.0% (52.3%) for the services. CHAPTER 6

HERMES: FAST MEMORY ALLOCATION FOR

LATENCY-CRITICAL SERVICES

6.1 Hermes Design

6.1.1 Overview

In this project, we propose and develop Hermes, a library-level mechanism to memory management that addresses the identiﬁed problems in the GNU/Linux system stack and reduces memory allocation latency of latency-critical services in a multi-tenant system. Hermes is transparent to applications and it does not make modiﬁcation to Linux OS. As shown in Figure 5.3, Hermes consists of two major components: a memory management thread woken per f milliseconds in Glibc and a memory monitor daemon independently running on the same physical node. A system administrator sends the process IDs of batch jobs and latency-critical services to the memory monitor daemon.

Upon memory pressure, the ﬁle cache adviser advises Linux OS to free the ﬁle cache owned by batch jobs. In Glibc, if a process is a latency-critical service, the memory management thread is started for memory reservation and virtual-physical address mapping construction.

6.1.2 Memory Management Thread

The goal of the memory management thread is to reserve memory and construct its virtual-physical address mapping in advance for latency-critical services. Figure 6.1 outlines the workﬂow of the 106 return allocated memory internal workﬂow thread synchronization process thread management thread malloc

yes no heap_enabled large size no yes enough mem yes in heap heap management yes enough mem no routine in pool management yes wait on no running routine no mmap_enabled no yes default mmap default heap mmap management alloc routine alloc routine routine

Figure 6.1: The workflow of the modified Glibc routines. management thread and the modified Glibc. The management thread periodically checks the current amount of reserved memory and decides whether to reserve more memory or release reserved memory back to Linux OS. When a process thread calls malloc, Hermes first tries to return the reserved memory to the process. If the reserved memory is insufficient, it uses the default routine to serve the request. Though sharing the same principle, the management thread uses different approaches to manage the main heap memory and mmapped memory chunks since they are allocated by two different system calls.

Small-sized memory requests are allocated from the main heap, as shown in the no branch of the large size statement in Figure 6.1. If there is sufficient memory in the main heap, Hermes immediately allocates it to the requests. Otherwise, if the management thread is running, the requests wait on it. If memory in the main heap is insufficient, the requests are allocated by the default allocation routine in Glibc. We show the heap management routine in Algorithm 6. In every round of the execution, the routine first updates the memory allocation metrics including the total size of all small memory requests (i.e. requests <128 KB) and the number of requests in the last interval. It then updates all the thresholds based on the collected memory allocation metrics

(function UpdateThreshold). For example, the target amount of reserving memory is the total amount of memory requests in the last interval multiplying a reservation factor RSV FACTOR. If the top chunk is smaller than the reservation threshold RSV THR, it expands the current program break and immediately constructs the virtual-physical mapping for the newly allocated memory. 107 bytes bytes 22 ret : 4 18 3 req : 4 18 1 rsv: 4 req1: 4 ret : 4 req2: 4 3 req : 4 10 rsv: 4 10 2 req : 4 req : 4 6 3 rsv: 4 6 3 rsv: 4 rsv: 20 rsv: 4

space in top chunk top in space 2

space in top chunk top in space 2 t t t t t t t t t t1 t2 t3 t4 t5 time 1 2 3 4 5 6 7 8 9 time (a) Reserving a large chunk of memory at once. (b) Reserving small chunks of memory for multiple times.

Figure 6.2: Illustration of gradual reservation.

Otherwise, if the free space in the top chunk exceeds the trim threshold TRIM THR, it shrinks the top chunk by setting the program break to a lower memory address.

A naive approach. The challenge of expanding the main heap lies in how to determine the amount of memory to be reserved. Intuitively, simply reserving a large amount of memory at once would boost process performance since the memory is immediately available for processes. However, our experiments ﬁnd that this approach even degrades the performance of latency-critical services in terms of tail latency. The latency of the default on-demand virtual-physical mapping construction is near proportional to the size of the constructed memory. Since there is only one program break for each process, the manipulation on the program break must be synchronized.

Algorithm 6 Heap management routine. 1: RSV THR: a threshold below which more memory should be reserved; 2: TGT MEM : the target free size in the top chunk at which the memory reservation stops; 3: TRIM THR: a threshold above which memory is released; 4: MEM CHUNK : memory reserved on each sbrk() call; 5: top free: current free memory in the top chunk; 6: UpdateThreshold( ); 7: if top free < RSV THR then 8: mem to reserve ← (TGT MEM − top free); 9: reserved ← 0; 10: while reserved < mem to reserve do 11: Lock(heap); 12: address ← sbrk(MEM CHUNK ); 13: ConstructMapping(address); 14: reserved ← (reserved + MEM CHUNK ); 15: Unlock(heap); 16: end while 17: else if top free > TRIM THR then 18: extra ← (top free − TRIM THR); 19: Lock(heap); 20: sbrk(−extra); 21: Unlock(heap); 22: end if 108 A burst of memory requests in the process thread may be blocked for a long time due to the mapping construction for a large chunk of memory in the management thread. Figure 6.2 (a) illustrates this scenario. There are initially 10 bytes in the top chunk. At t1 and t2, the user process sends two memory requests req1 and req2 of 4 bytes, respectively. The requests return immediately.

Then, there are only 2 bytes left in the top chunk. The management thread is now invoked to expand the top chunk by 20 bytes and construct the virtual-physical mapping. At t3, there is another memory request req3 of 4 bytes from the user process. Since the running management thread locks the program break, req3 is blocked. It can only be served at t5 after the top chunk is expanded at t4, which incurs signiﬁcant delay on the request. Although a large number of memory requests do not compete with the main heap expansion, the competing ones lead to prolonged tail latency.

Gradual reservation. We propose gradual reservation that expands the program break by a small size at a time for multiple times (lines 10 ∼ 16 in Algorithm 6). For example, instead of expanding the program break for 20 bytes at once, gradual reservation expands the program break for 5 times, each time for 4 bytes, as shown in Figure 6.2(b). Before req3 arrives, a reservation of a small memory chunk has already been sent to Linux OS at t2 by the management thread. After the reservation returns, req3 can be immediately served. Finally, the management thread sends four more small reservation operations until the reserved memory reaches 18 bytes. Based on our observation and

MittOS [111], continuous memory requests from latency-critical services are usually of a similar or constant size. Hermes uses the average memory request size during the previous interval as the size of each memory chunk in gradual reservation. Compared with the default on-demand virtual-physical mapping construction, Hermes serves memory requests faster even if the program break is locked by the management thread, since the virtual-physical mapping construction already starts in advance and returns shortly.

Large-sized memory requests are allocated from mmapped memory chunks, as shown in the yes branch of the large size statement in Figure 6.1. Management for mmapped memory is asynchronous since a process can have multiple chunks of mmapped memory space. In other words, 109 bucket id: 0 1 2 3 4 5 6 7 chunk size (KB): 128~255 256~511 512~639 640~767 768~895 896~1023 1024~1151 1152~ 264 KB 524 KB memory chunks: 308 KB

Figure 6.3: Segregated free list for mmapped memory chunks. the process thread and the management thread can simultaneously allocate two diﬀerent chunks of mmapped memory space. Thus, incoming requests do not wait on the management thread but uses the default memory allocation routine when the reserved memory is insuﬃcient. Algorithm 7 shows the management routine for mmapped memory. Since the addresses of mmapped memory space are not necessarily adjacent, each chunk of space needs to be managed separately. We use a segregated free list as the memory pool to keep track of the addresses of mmapped memory space (line 14), as shown in Figure 6.3. The function calculates the target bucket based on the size of a mmapped memory chunk using formula 6.1.

Parameter min mmap size is the minimum memory request size that can use mmap system call, which is 128 KB by default in Glibc. The parameter table size is the maximum number of buckets in the segregated free list. In our implementation, we empirically set table size to 8 (1 MB / 128

KB) since the size of a single memory request is usually less than 1 MB.

chunk size bucket(chunk size)= MIN (⌊ ⌋, table size) (6.1) min mmap size

Upon a request for a large chunk of memory (i.e., requests ≥ 128 KB) from the process, the modified allocation routine first tries to find the best-fit bucket in the list by calculating the bucket based on the requested size. The hash code of the best-fit bucket is calculated by equa- tion MIN (bucket(request size) + 1, table size). If there is no such a chunk, the allocation routine uses the largest chunk in the memory pool and expands the chunk to the requested size. If this step still fails due to an empty memory pool, it falls back to the default allocation using mmap system call. For example, in Figure 6.3, there are three memory chunks. Two are in bucket 1 and the third is in bucket 2. Consider that the application process sends a 278 KB-size memory request. The hash code of the best-fit bucket is 2. Hermes takes the first memory chunk (the 524 KB-size chunk) in 110 the corresponding bucket and returns it to the application process. It is worth noting that, Hermes does not first choose bucket 1 as the target bucket since it may contain chunks that are smaller than the memory request. Otherwise, Hermes needs to scan through the memory chunks in the buckets in order to find a memory chunk that is larger than the requested size, which introduces more overhead. As a result, the allocated chunks are usually not exactly the same size as the request size. After returned to the user process, they are put into alloc set. On the next round execution of the management thread, they are shrunk to the size of the requests (line 7). By this design, the user process gets requested memory immediately as long as they are available while asynchronous shrinking avoids memory wastage. If memory requests are served by expanding an existing small chunk, the delay is still shorter than that of the default allocation routine. The reason is that small chunks already have their virtual-physical mapping constructed. Additional mapping constructions only need to be done for the space that exceeds the size of the original memory chunks.

Algorithm 7 Mmap management routine. 1: RSV THR: a threshold below which more memory is reserved; 2: TGT MEM the target free size of mmapped space at which reservation stops; 3: TRIM THR: a threshold above which memory is released; 4: MEM CHUNK : memory reserved on each mmap() call; 5: memory pool a segregated free list that keeps track of the allocated mmapped space; 6: alloc set: a set of allocated mmapped chunks by the process thread; 7: DelayRelease(alloc set); 8: UpdateThreshold( ); 9: if memory pool.total size < RSV THR then 10: reserved ← 0; 11: while reserved < TGT MEM do 12: address ← mmap(MEM CHUNK ); 13: ConstructMapping(address); 14: memory pool.add(address); 15: reserved ← (reserved + MEM CHUNK ); 16: end while 17: end if 18: while memory pool.total size > TRIM THR do 19: to release ← memory pool.smallest space; 20: munmap(to release); 21: end while 111 6.1.3 Memory Monitor Daemon

The memory monitor daemon is running on a physical node that adopts job co-location and memory sharing. A system administrator speciﬁes the process IDs of latency-critical services and batch jobs.

The daemon keeps the process IDs of latency-critical services in shared memory. With Hermes, when a process detects that its process ID is in the shared memory, it initializes the memory management thread. Otherwise, the process behaves as it uses the default Glibc.

Proactive paging. The memory monitor daemon is responsible for proactively advising Linux OS to release file cache pages upon memory pressure. The monitor daemon keeps track of all batch jobs and their loaded data files. When the system memory usage exceeds threshold adv thr, the monitor daemon advises Linux OS to release file cache pages in a largest-file-first order until the percentage of file cache drops below the threshold or no file cache is from the specified batch jobs. The largest-

ﬁle-ﬁrst paging order makes a large chunk of memory available at once for latency-critical services.

It also reduces the number of calls to the advising routine.

Proactive paging is an effective approach to accelerating memory allocation. Even though Hermes reserves physical memory in advance, the reservation can still be delayed if it triggers the direct reclaim routine due to insufficient system memory. Proactive paging reduces the chance by which the direct reclaim routine is triggered. Note that solely relying on proactive paging is insufficient since it only tries to make free space for new memory requests but it does not contribute to virtual- physical mapping construction.

6.2 Implementation

We implement Hermes in Glibc-2.23 with about 1,200 lines of C code. We empirically set the invocation interval (f ) of the memory management thread to 2. Recall that we use a reservation factor RSV FACTOR to determine the amount of memory to be reserved. A larger value results in more reserved memory and faster memory allocation. However, the reserved memory is wasted if it is never used by latency-critical services. In the rest of the project, we set this value to 2 if 112 not otherwise speciﬁed, which balances between memory allocation speed and memory wastage. We also set the minimum amount of memory min rsv that should be reserved after each execution of the management thread even if there is no newly incoming memory request. It allows that a burst of memory requests after an idle period can be quickly served. The value depends on the characteristics of latency-critical services. Empirically, we set this value to 5 MB. We use mlock system call to delegate virtual-physical mapping construction to kernel space.

There are two choices to implement the virtual-physical mapping construction function, 1) it- erating through the allocated virtual memory addresses and ﬁlling them with ‘0’, and 2) using the mlock system call to delegate the construction to the kernel space. We choose the second one for two reasons. First, our experiments ﬁnd that using mlock system call is at least 40% faster than the iteration approach for both heap memory and mmapped memory. Second, the mlock system call guarantees newly reserved physical memory not to be swapped into disks, which further accelerates memory allocation. After a chunk of reserved memory is allocated to a process, the munlock system call will be called on that address space to allow swapping on the chunk.

The memory monitor daemon takes about 500 lines of C code. It is responsible for bookkeeping latency-critical services and batch jobs, and advising Linux OS to release file cache pages. It communicates with the modified Glibc with a shared memory area. Specifically, it uses the shared memory to store all the process IDs of latency-critical services specified by a system administrator.

With the modiﬁed Glibc, a process examines whether its process ID is in the shared memory. If so, the modiﬁed Glibc initializes the memory management thread. When a process is no longer a latency-critical service, the administrator can simply remove its process ID. Hermes then adopts the default memory management in Glibc for this process. 113

1.0 1.0

0.8 0.8

0.6 0.6 CDF CDF 0.4 0.4 Hermes Hermes+anon 0.2 Glibc 0.2 Glibc+anon jemalloc jemalloc+anon 0.0 0.0 2000 4000 6000 8000 10000 12000 14000 2000 4000 6000 8000 10000 12000 14000 Time (ns) Time (ns) (a) Dedicated system. (b) Anonymous pages pressure.

1.0 60 dedicated anon file 0.8 50

0.6 40 Hermes+file

CDF 0.4 30 Hermes w/o paging+file 0.2 Glibc+file 20 jemalloc+file Reduction (%) Reduction 10 0.0 2000 4000 6000 8000 10000 12000 14000 0 Time (ns) avg. p75 p90 p95 p99 (c) File cache pressure. (d) Latency reduction by Hermes.

Figure 6.4: The memory allocation latency for small (1KB-size) memory requests on the HDD.

1.0 1.0

0.8 0.8

0.6 0.6 CDF 0.4 CDF 0.4 Hermes Hermes+anon 0.2 Glibc 0.2 Glibc+anon jemalloc jemalloc+anon 0.0 0.0 800 1200 1600 2000 2400 2800 800 1200 1600 2000 2400 2800 Time (µs) Time (µs) (a) Dedicated system. (b) Anonymous pages pressure.

1.0 80 70 dedicated anon file 0.8 60 0.6 50 Hermes+file CDF 0.4 40 Hermes w/o paging+file 30 0.2 Glibc+file 20 jemalloc+file (%) Reduction 0.0 10 800 1200 1600 2000 2400 2800 0 Time (µs) avg. p75 p90 p95 p99 (c) File cache pressure. (d) Latency reduction by Hermes.

Figure 6.5: The memory allocation latency for large (256KB-size) memory requests on the HDD. 114 6.3 Evaluation

6.3.1 Evaluation Setup

We use both a micro benchmark and real-world latency-critical services to evaluate the performance of Hermes, and compare it to Glibc and jemalloc [88]. Glibc is the most popular memory allocator in C/C++. Jemalloc is the default memory allocator for Redis [21].

Micro benchmark. We implement a micro benchmark in C, which continuously calls malloc function to request memory until the total amount of requested memory reaches a speciﬁed threshold.

For the micro benchmark, we run the experiments in two settings referred as dedicated system and memory pressure. For the dedicated system setting, we run the micro benchmark alone on the nodes with suﬃcient memory. For the memory pressure setting, we generate the memory pressure for the micro benchmark by loading the node with either anonymous pages or ﬁle cache pages. We measure the memory allocation latency due to the three approaches. The experiment is done on a node with a HDD (Section 6.3.2) and a node with a SSD (Section 6.3.5). The HDD node has four 2.4 GHz

8-core Intel Xeon E5-2630 CPUs with 128 GB DRAM and 2 TB 7200 rpm HDD disk. It is installed with Ubuntu 16.04 with Linux kernel-4.4.0. The SSD node has four 2.8 GHz 16-core Intel Xeon Gold

6143 CPUs with 256 GB DRAM and 480 GB SSD. It is installed with Ubuntu 18.04 with Linux kernel-4.15.0.

Real-world services. We evaluate Redis [21] and Rocksdb [22] as real-world latency-critical services under different memory pressure levels in Section 6.3.3. We measure three metrics in the experiments: 1) query latency of latency-critical services, 2) SLO violation of latency-critical services, and 3) throughput of batch job. To generate different levels of memory pressure, we configure the maximum logically available memory of batch jobs to 50%, 75%, 100%, 125% and 150% of the memory capacity of the node. For example, on a node with 128 GB DRAM, 150% memory pressure level suggests that batch jobs can oversubscribe 192 GB (128 GB × 1.5) of DRAM. The experiments are conducted on the node with HDD.

Parameter sensitivity and overhead. We conduct experiments to evaluate parameter sensitivity 115 in Section 6.3.4. Speciﬁcally, we run the micro benchmark and evaluate its latency under diﬀerent values of reservation factor RSV FACTOR. We evaluate the overhead of Hermes in Section 6.3.6.

These two experiments are conducted on the physical node with a HDD disk.

For all experiments, we pin latency-critical services and background processes onto diﬀerent cores to avoid CPU interference.

6.3.2 Micro Benchmark

We evaluate the performance of Hermes under three scenarios: a dedicated system with sufficient memory, anonymous page pressure, and file cache pressure. Under file cache pressure, we also show the performance of Hermes when it is disabled with proactive paging, denoted as “Hermes w/o paging”, to demonstrate the performance gain due to proactive paging. The anonymous page pressure is made by a process that keeps allocating memory until the system available memory drops below 300 MB. The file cache pressure is made by a process that repeatedly reads 10 GB files and occupies the rest of the system memory with anonymous pages. We develop the micro benchmark by continuously sending fix-sized memory requests until the total requested memory reaches 1 GB. We use 1KB-size and 256KB-size memory requests to evaluate the allocation latency of heap memory and mmapped memory.

Figure 6.4(a)-(c) and Figure 6.5(a)-(c) show the CDFs of memory allocation latency of 1KB-size and 256KB-size requests under a dedicated system, anonymous page pressure (“+anon” suﬃx), and

file cache pressure (“+file” suffix), respectively. For small memory requests, Hermes achieves the lowest latency at every percentile in all three cases. As for large memory requests, jemalloc presents longer but more stable latency under a dedicated system. However, Hermes outperforms both Glibc and jemalloc when the system is under memory pressure. Specifically, we show the latency reduction of Hermes at each percentile compared to Glibc in Figures 6.4d and 6.5d since Glibc outperforms jemalloc in most cases. For 1KB-size requests, Hermes reduces the average latency by 16.0%, 29.3%,

9.4%, and the 99th percentile latency by 15.0%, 38.8%, 17.2% in the three scenarios, respectively.

For 256KB-size requests, Hermes reduces the average latency by 12.1%, 54.4%, 21.7%, and the 99th 116 percentile latency by 5.2%, 62.4%, 11.4%, respectively. Hermes outperforms the default Glibc at each percentile in all scenarios. The allocation latency is as low as 4 µ s for small requests and 1ms for large requests.

By comparing the “dedicated” and “file” bars in Figure 6.4d to those in Figure 6.5d, the performance gain by Hermes under a dedicated system and under file cache pressure for large requests is more significant than that for small requests. The reason is that large requests take a long time to be allocated in the default Glibc. Hermes allocates the requests and constructs the virtual-physical mapping in advance. Thus, memory is immediately available for incoming requests. By comparing the “anon” bar to the “dedicated” and “file” bars in Figure 6.4d or Figure 6.5d, we observe that

Hermes generally achieves more performance improvement under anonymous pressure for both small and large requests compared to those under file cache pressure. The reason is that it is faster to reclaim file cache pages in the default Linux kernel since unmodified file cache pages are directly released without I/O operations. For anonymous pages, however, each of them must be swapped into disks before it is released, causing much longer delay due to I/O operations. Hermes triggers these long I/O operations in advance of new memory requests and constructs the virtual-physical mapping. Thus, it achieves more improvement under anonymous page pressure.

Proactive paging. Figures 6.4c and 6.5c show that “Hermes w/o paging” achieves similar memory allocation latency at low percentiles compared with the default Glibc, but it signiﬁcantly reduces the latency at high percentiles. Full Hermes further improves the average latency over “Hermes w/o paging”.

6.3.3 Two Real-world Latency-critical Services

We evaluate the query latency reduction on real-world latency-critical services by Hermes compared to Glibc and jemalloc under diﬀerent memory pressure. We use Redis-5.0.5 [21] and Rocksdb-

6.4.0 [22] as two representative real-world services. Redis is an in-memory key-value store for fast data access. Rocksdb is a disk-based persistent key-value store for fast storage environments. These services are usually used for intermediate or temporary data storage. Thus, they frequently allocate 117

500 7000 Hermes Hermes s) s) Glibc Glibc µ µ 6000 400 jemalloc jemalloc

5000 SLO=330µs 300 4000 SLO=4326µs p90 latency ( p90 latency (

200 3000 0% 50% 75% 100% 125% 150% 0% 50% 75% 100% 125% 150% Memory pressure level Memory pressure level (a) Small requests (b) Large requests

Figure 6.6: The 90th percentile query latency of Redis.

40 60 Hermes Hermes

s) Glibc 50 Glibc µ 30 jemalloc 40 jemalloc

20 30 SLO=17.6µs 20 10

p90 latency ( 10 p90 latency (ms) SLO=573µs 0 0 0% 50% 75% 100% 125% 150% 0 50% 75% 100% 125% 150% Memory pressure level Memory pressure level (a) Small requests (b) Large requests

Figure 6.7: The 90th percentile query latency of Rocksdb. and release memory. For both Redis and Rocksdb, we implement a program to continuously generate requests. One request consists of one insertion operation followed by one read operation. We use

1KB-size and 200KB-size data records referenced as small and large memory requests, respectively.

For each data insertion execution, we insert the data until it reaches 2 GB. To inject memory pressure, we run Spark Kmeans and Spark PageRank as batch jobs on the host node. The jobs are from HiBench-6.0 [114] using its default huge data size. We run Spark-2.3.0 on Hadoop-2.7.3 [171].

Since there is not a magic value to deﬁne the SLO of each service, we adopt the 90th percentile latency by the default Glibc under a dedicated system (w/o memory pressure) as the SLO, which is a rather strict value. The rational is that latency-critical services like web search commonly distribute requests across many servers. The end-to-end response time is determined by the slowest individual latency [81, 199, 36, 92]. Thus, the 90th percentile latency is a critical metric in measuring the SLO of latency-critical services.

Latency reduction. Figures 6.6 and 6.7 show the 90th percentile query latency under diﬀerent 118

0.99 Hermes 0.99 Hermes Glibc Glibc jemalloc jemalloc

0.95 0.95 CDF CDF

0.90 0.90 0 200 400 600 800 0 2000 4000 6000 8000 10000 Time (µs) Time (µs) (a) Small requests w/ batch jobs (b) Large requests w/ batch jobs

Figure 6.8: Latency of Redis under 100% memor pressure.

0.99 0.99

0.95 0.95 CDF CDF

Hermes Hermes Glibc Glibc jemalloc jemalloc 0.90 0.90 0 10 20 30 40 50 0 20 40 60 80 100 Time (µs) Time (ms) (a) Small requests w/ batch jobs (b) Large requests w/ batch jobs

Figure 6.9: Latency of Rocksdb under 100% memory pressure. memory pressure levels for Redis and Rocksdb, respectively. The horizontal dash line represents the target SLO in each situation. In Redis, the SLOs are 330 µ s and 4, 326 µ s for small and large requests, respectively. In Rocksdb, the SLOs are 17 µ s and 573 µ s for small and large requests, respectively.

The results show that Hermes outperforms Glibc and jemalloc in reducing the 90th percentile query latency in all scenarios for both Redis and Rocksdb. Speciﬁcally, with a dedicated system

(0% memory pressure) or a low memory pressure level (50% and 75%), Hermes achieves similar or slightly lower 90th percentile latency compared to Glibc and jemalloc. Hermers, Glibc and jemalloc all meet the SLO targets. With a moderate memory pressure level (100% and 125%), Hermes can still meet the SLO targets except for large requests while Glibc and jemalloc incur signiﬁcant SLO violation. With a severe memory pressure level (> 125%), all three approaches incur non-trivial SLO violation but Hermes signiﬁcantly outperforms Glibc and jemalloc. We observe that large requests in

Rocksdb under high memory pressure experience tens of milliseconds of latency. Note that Rocksdb 119

100 100 Glibc Glibc 80 80 jemalloc jemalloc 60 Hermes 60 Hermes

40 40

20 20 SLO violation (%) SLO SLO violation (%) SLO 0 0 50% 75% 100% 125% 150% 50% 75% 100% 125% 150% Memory pressure level Memory pressure level (a) Small requests (b) Large requests

Figure 6.10: The SLO violation ratio of Redis requests.

60 100 Glibc Glibc 50 80 jemalloc jemalloc 40 Hermes 60 Hermes 30 40 20 20

SLO violation (%) SLO 10 SLO violation (%) SLO 0 0 50% 75% 100% 125% 150% 50% 75% 100% 125% 150% Memory pressure level Memory pressure level (a) Small requests (b) Large requests

Figure 6.11: The SLO violation ratio of Rocksdb requests. is a disk-based KV store with memory cache. Under severe memory pressure, data are written into disks more frequently, causing high latency.

Under job co-location, severe memory pressure is usually addressed by a system administrator while memory pressure around the 100% level is more likely to happen due to the dynamic memory consumption of latency-critical services and batch jobs. Thus, we plot the CDF of the query latency under such a scenario for Redis and Rocksdb in Figures 6.8 and 6.9, respectively. Hermes achieves the lowest latency for both services. Compared to Glibc, it reduces the average (99th percentile) latency by up to 17.0% (40.6%) for Redis and 20.6% (63.4%) for Rocksdb.

SLO violation. Figure 6.10 and Figure 6.11 show the ratios of SLO violation with Hermes, Glibc and jemalloc under diﬀerent memory pressure levels for Redis and Rocksdb, respectively. For Redis,

Hermes achieves the SLO violation ratio lower than 10% under a low memory pressure level (i.e.,

50% and 75%). The results for Rocksdb are similar. The reason is that Hermes builds the virtual- 120 Table 6.1: Throughput of batch jobs.

Default Hermes Killing Dedicated Redis 212 194 123 0 Rocksdb 380 364 267 0 physical mapping in advance such that incoming memory requests can be immediately served. The most signiﬁcant results are those under 100% or higher memory pressure levels which usually happen in a multi-tenant system. Under such a memory pressure level, compared to the default Glibc and jemalloc, Hermes reduces the SLO violation of Redis by up to 83.6%, and reduces the SLO violation of Rocksdb by up to 84.3%.

We examine the throughput of batch jobs co-located with latency-critical services. We submit

Spark Kmeans jobs and keep three concurrent job instances in the node. Each Kmeans job runs in eight Yarn containers and requests around 40GB memory. This generates the 100% memory pressure level. We send data insertion, read, and deletion requests to the latency-critical services such that stored data size varies from 20GB to 40GB. The co-location experiment runs for 24 hours in each of the three scenarios: Default, Hermes, and Killing.

• Default. We co-locate batch jobs and latency-critical services with the default GNU/Linux

stack.

• Hermes. We co-locate batch jobs and latency-critical services with Hermes.

• Killing. Upon Default, we kill the latest launched container of a batch job when node memory

is insuﬃcient, which frees up memory. We expect that killing the container results in the least

progress loss of the batch job.

Table 6.1 gives the number of the ﬁnished batch jobs in the three co-location scenarios as well as in a dedicated system where is no throughput of batch jobs. Both Default and Hermes achieve much higher throughput than that of Killing. Hermes achieves slightly lower throughput to that of

Default. In return, it signiﬁcantly reduces the query latency and SLO violation of latency-critical services, the principle requirement of job co-location. We notice that the throughput of co-location with Rocksdb is higher than that of Redis. The reason is that Redis is a memory-based KV store 121

60 100 0.5x 1.5x 2.5x 0.5x 1.5x 2.5x 40 1.0x 2.0x 3.0x 80 1.0x 2.0x 3.0x

20 60

0 40 Reduction (%) Reduction Reduction (%) Reduction -20 20

-40 0 avg. p75 p90 p95 p99 avg. p75 p90 p95 p99 (a) Dedicated system (b) Anonymous pressure

Figure 6.12: Latency reduction for small requests.

40 80 0.5x 1.5x 2.5x 0.5x 1.5x 2.5x 30 1.0x 2.0x 3.0x 1.0x 2.0x 3.0x 60

20 40 10

Reduction (%) Reduction (%) Reduction 20 0

-10 0 avg. p75 p90 p95 p99 avg. p75 p90 p95 p99 (a) Dedicated system (b) Anonymous pressure

Figure 6.13: Latency reduction for large requests. that keeps all data in DRAM. Rocksdb is a disk-based KV store that has much lower memory consumption than Redis. Thus, more memory can be allocated to batch jobs. Experimental results

ﬁnd that job co-location due to Hermes obtains about 98.5% average node memory utilization.

Under a Dedicated System. We also evaluate the performance of real-world services Redis and Rocksdb by Hermes, Glibc and jemalloc under a dedicated system. Compared to Glibc and jemalloc, Hermes renders similar or slightly better average, 90th , and 99th percentile query latency for Redis and Rocksdb.

6.3.4 Parameter Sensitivity

We evaluate the impact of parameter sensitivity. Speciﬁcally, we change the value of reservation factor RSV FACTOR raging from 0.5 to 3, and evaluate the memory allocation latency under each value for both small and large memory requests using the micro benchmark. We run the micro benchmark under a dedicated system and under anonymous page pressure, respectively. We use the 122

1.0 1.0

0.8 0.8

0.6 0.6 CDF 0.4 CDF 0.4

0.2 Hermes 0.2 Hermes+anon Glibc Glibc+anon 0.0 0.0 2000 4000 6000 8000 10000 2000 4000 6000 8000 10000 Time (ns) Time (ns) (a) Dedicated system. (b) Anonymous pages pressure.

1.0 60 dedicated anon file 0.8 50

0.6 40

CDF 0.4 30

0.2 Hermes+file 20 Glibc+file Reduction (%) Reduction 10 0.0 2000 4000 6000 8000 10000 0 Time (ns) avg. p75 p90 p95 p99 (c) File cache pressure. (d) Latency reduction by Hermes.

Figure 6.14: The memory allocation latency for small (1KB-size) memory requests on the SSD. same settings as those in Section 6.3.2 to generate the memory pressure. Figures 6.12 and 6.13 show the percentage of latency reduction at speciﬁc percentiles for small and large requests, respectively.

Under a dedicated system, a small value of RSV FACTOR signiﬁcantly increases the 99th percentile tail latency for small requests, as shown in Figure 6.12(a). The reason is that the reserved memory under such a RSV FACTOR value is too small. When a burst of memory requests are sent by the processes, the reserved memory quickly runs out. In this case, the incoming memory requests are blocked by the memory reservation routine in Hermes since they need to modify the main heap. As the value of RSV FACTOR is increased, the 99th percentile tail latency becomes better than that by the default Glibc. For large memory requests, on the other hand, the incoming memory requests are not blocked but served by the default allocation routine in Glibc since there can be multiple mmapped memory chunks for a process. Thus, Hermes achieves more allocation latency reduction for large requests under a dedicated system as shown in Figure 6.13(a).

Under anonymous page pressure, Hermes achieves much more signiﬁcant latency reduction compared with that under a dedicated system. Speciﬁcally, it reduces the average and the 99th percentile 123 latency by up to 69.1% and 41.6% for small requests, respectively. It reduces the average and the

99th percentile latency by up to 63.8% and 64.2% for large requests, respectively. Overall, setting

RSV FACTOR to a value larger than 2 does not achieve more performance gain. The reserved memory exceeds the total amount of memory requests and causes more memory wastage. Thus, we empirically set the value of RSV FACTOR to 2 since it achieves good reduction in the memory allocation latency while resulting in the least memory wastage.

6.3.5 The Micro Benchmark with SSD

We evaluate the memory allocation latency reduction by Hermes on a node with SSD. We use the same micro benchmark and background processes but adjust the amount of allocated memory. To generate anonymous page pressure, we use a program to allocate memory until the available memory drops below 900 MB. For ﬁle cache pressure, we use a program to repeatedly access 20 GB ﬁles and occupy the rest of the memory with anonymous pages. The micro benchmark continuously sends

1KB-size or 256KB-size memory requests until the total amount of the allocated memory reaches

2 GB. The results of 1KB-size and 256KB-size requests are shown in Figure 6.14 and Figure 6.15, respectively.

Figure 6.14d shows that for 1KB-size requests, Hermes reduces the the average latency by 17.8%,

21.4%, 21.0%, and the 99th percentile latency by 11.3%, 43.9%, 51.6% under a dedicated system, under anonymous page pressure, and under ﬁle cache page pressure, respectively. Figure 6.15d shows that for 256KB-size requests, Hermes reduces the average latency by 20.0%, 20.6%, 30.1%, and the

99th percentile latency by 11.3%, 15.2%, 28.4% in the three scenarios, respectively. Overall, the latency reduction with a SSD is less signiﬁcant than that with a HDD. Speciﬁcally, the latency at low percentiles (1st percentile to 70th percentile) by Hermes is close to that by the default Glibc in most scenarios. For 1KB-size requests, Hermes reduces the latency at low percentiles by about

14.3%. For 256KB-size requests under memory pressure shown in Figure 6.15b and Figure 6.15c, the allocation latency below the 70th percentile by Hermes is almost identical to that by the default

Glibc. However, Hermes achieves obvious reduction for latency at high percentiles (from the 70th 124

1.0 1.0

0.8 0.8

0.6 0.6 CDF CDF 0.4 0.4

0.2 Hermes 0.2 Hermes+anon Glibc Glibc+anon 0.0 0.0 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 Time (µs) Time (µs) (a) Dedicated system. (b) Anonymous pages pressure.

1.0 80 70 dedicated anon file 0.8 60 0.6 50

CDF 0.4 40 30 Hermes+file 0.2 20 Glibc+file 0.0 (%) Reduction 10 0 500 1000 1500 2000 2500 3000 0 Time (µs) avg. p75 p90 p95 p99 (c) File cache pressure. (d) Latency reduction by Hermes.

Figure 6.15: The memory allocation latency for large (256KB-size) memory requests on the SSD. percentile to the 99th percentile). Since a SSD has a fast speed of swapping operations, most of the requests are quickly served. Hermes is more eﬀective for latency at high percentiles.

6.3.6 Hermes Overhead

We evaluate the overhead of Hermes. Hermes introduces about 0.4% CPU usage overhead due to the management thread in the modiﬁed Glibc. We proﬁle the memory that is reserved but not actually used by the micro benchmark for both small (1KB-size) and large (256KB-size) memory requests.

The reserved memory at runtime is about 6 MB∼6 .4 MB, which is negligible compared to the memory capacity of a physical node. In addition, the memory monitor daemon requires about 2 MB memory including the memory occupied by the daemon process and the shared memory space. It introduces about 2.4% CPU usage since it keeps monitoring the latency-critical services and available memory in Linux OS. 125 6.4 Discussions

Reservation factor. Users need to set an appropriate value for the reservation factor RSV FACTOR in Hermes. We ﬁnd that a value of 2 achieves good performance gain for both the micro benchmark and real-world services while introducing the least memory wastage. However, the value setting depends on various factors such as the characteristics of latency-sensitive services and the shared environment. If a latency-sensitive service does not require much memory at runtime, RSV FACTOR can be set to a small value. Otherwise, it should be set to a relatively large value.

Reservation heuristic. Hermes relies on two simple but eﬀective heuristics, heap management routine and mmap management routine, to reserve memory for latency-sensitive services. Since

Hermes is developed at the library level, the reservation heuristics are simple and incur low overhead.

Users can also implement their own heuristics based on the runtime characteristics of latency- sensitive services. However, complex modeling for predicting future memory usage at the library level is not recommended because it introduces high overhead, which contradicts the design goal of fast memory allocation for latency-sensitive services.

Query latency. Hermes aims for fast memory allocation. Once the reserved pages are obtained by a process, Hermes calls munlock system call on the pages. The pages can be swapped into disks when the available system memory is low. Queries to the latency-sensitive services will be delayed if the physical pages reside in the swap area. A simple solution is to return the pages to a process without calling munlock. In this case, the pages occupied by latency-sensitive services are never swapped, resulting in low latency for queries. The simple solution meets the design goal that batch jobs should not aﬀect the performance of latency-sensitive services. However, it may incur out-of- memory errors if the locked memory is not well managed under extreme memory pressure. Thus, no page is eligible for reclaim but killing processes is the only choice.

Fragmentation. The current Glibc does not round up the size of heap memory chunks to power of two. Thus, freed memory chunks of any size can be coalesced to neighboring chunks, which does not incur high memory waste through fragmentation. Hermes inherits the heap management algorithm from Glibc for small memory requests allocated from heap. Thus, the impact of fragmentation on 126 Table 6.2: Number of two kernel functions called at each latency range.

Latency idle system under pressure (ns) alloc merge alloc merge 256∼511 1 0 391 1437 512∼1023 11 0 72 1979 1024∼2047 20 0 16 666 2048∼4095 4 448 12 10 4096∼8091 4 3622 1 1 8192∼ 4 26 1 3 heap memory is the same as that in Glibc. Hermes uses its own hash table to manage large memory chunks allocated by mmap system call. Since most memory requests from latency-sensitive services are of the same size, freed large memory chunks may exactly ﬁt incoming requests, incurring no fragmentation. In the worst case where signiﬁcant memory waste through fragmentation occurs, memory compaction can be done through mremap system call. This is a rare case since modern

CPUs support hundreds of gigabytes of memory address space.

Low latency under memory pressure. In the micro benchmark experiments on a HDD (Sec- tion 6.3.2) and on a SSD (Section 6.3.5), we observe a counter-intuitive scenario in both Hermes and the default Glibc. That is, the low percentile memory allocation latency of large memory requests under memory pressure is even lower than that under an idle system.

As large memory requests are served by mmapped memory chunks, in the kernel space, each mmapped memory chunk needs to be associated with a virtual memory area (VMA) data structure.

Upon the allocation of a mmapped memory chunk, it is either merged to an existing VMA by vma merge function if their addresses are adjacent, or otherwise associated with a newly allocated

VMA by kmem cache alloc function.

We trace the two functions when the micro benchmark sends 4,096 256KB-size large memory requests both under an idle system and under memory pressure. Table 6.2 shows the number of two kernel functions called at each latency range. Under memory pressure, the merge function returns faster since it cannot merge mmapped memory chunks to existing VMAs, and thus a lot of new VMAs are allocated. The total latency of merge return and new VMA allocations is only

1 µ s ∼ 2 µ s. Under an idle system, however, most mmapped memory chunks are merged to the 127 existing VMAs while fewer are associated to newly allocated VMAs. A successful merge usually has latency of 4 µ s ∼ 8 µ s. New VMAs are allocated through the Linux slab allocator. The slab allocator has more reserved space under memory pressure due to memory allocation requests of other processes. Thus, it accelerates the VMA allocation. In summary, the low latency at low percentiles is caused by 1) fast merge return, and 2) fast new VMA allocation. On the other hand, a system under memory pressure triggers page swapping, which prolongs the latency at high percentiles. Note that the allocated addresses of mmapped memory chunks are architecture-related, which is beyond the scope of this project. As for small memory requests, they are served by the main heap. Linux

OS associates the main heap to a single VMA. Changing the main heap does not involve either VMA merge or VMA allocation. It only changes the bookkeeping value of the end address in the VMA.

Thus, small memory requests do not necessarily render low latency under memory pressure.

Applicability. Currently, Hermes supports C/C++ programs only. Many popular key-value stores [21, 22, 16, 17] are implemented in C/C++. The principle of Hermes can be applied to other language runtimes. For example, for programs running on Java Virtual Machines (JVMs), the

JVMs could reserve a chunk of memory and construct the virtual-physical mapping in advance for fast memory allocation.

6.5 Summary

This project presents Hermes, a user-space approach that enables fast memory allocation for latency- sensitive services in a shared environment. We analyze the root causes of the inefficient allocation under memory pressure in the current GNU/Linux system stack. Hermes constructs the virtual- physical address mapping in advance and quickly serves incoming memory requests from latency- sensitive services. It proactively advises Linux OS to release file cache occupied by batch jobs so as to make available memory without going through the slow memory reclaim routine. Hermes is implemented in GNU C Library. Using the micro benchmark and real-world services, experimental results show that Hermes significantly reduces the average and the tail latency of memory allocation for latency-sensitive services, both under an idle system and under memory pressure. CHAPTER 7

CONCLUSIONS AND FUTURE WORK

7.1 Conclusions

Data-intensive applications become increasingly popular with the support of cloud computing.

Meanwhile, it also brings challenges to maintain a stable and high-performance system. First, pro-

ﬁling and troubleshooting such a large, multi-layer distributed system is diﬃcult. It is unreasonable, if not impossible, for engineers to manually analyze anomalies in a cluster with hundreds of servers.

Second, when multiple kinds of workloads run in the same cluster, eﬃcient resource allocation and sharing is important to achieve high performance and high resource utilization. With the dynamical resource usage of diﬀerent workloads during runtime, scheduling algorithm should also dynamically adjust resource between workloads based on real-time system status.

From the perspective of proﬁling and troubleshooting. We design and implement two frameworks,

LRTrace and IntelLog. LRTrace is a non-intrusive tracing tool that extracts information about both logs and resource metrics at runtime. By using keyed messages, users can easily reconstruct the workflow including the events and the amount of processed data if recorded. Lightweight virtualization techniques provide the opportunity to access fine-grained per container resource metrics. By correlating these two kinds of information, a user can efficiently find the anomaly and locate its root cause. IntelLog is a workflow reconstruction and anomaly detection tool for distributed data analytics systems. It reconstructs the workflows of the systems by using NLP based approaches in a non-intrusive manner. Its semantic awareness by HW-graphs allows users to easily understand 129 the targeted systems. IntelLog also uses the HW-graphs for anomaly detection. Logs in natural languages help users not only analyze anomalies but also understand the underlying mechanism of the systems.

From the perspective of efficient resource scheduling and job co-location, we design and implement two frameworks, Holmes and Hermes. Holmes is a user-space CPU scheduler for efficient job co- location in a SMT environment. Holmes tackles two major challenges, 1) accurately diagnosing SMT interference on memory access by identifying hardware performance events and developing a method for interference measurement, and 2) adaptive CPU scheduling via interference-aware core allocation and migration. Experiments show that Holmes achieves query latency of the latency-critical services close to that when the services are running alone in a server, while significantly improving server utilization and throughput of co-located batch jobs. Hermes is a user-space approach that enables fast memory allocation for latency-sensitive services in a shared environment. Hermes constructs the virtual-physical address mapping in advance and quickly serves incoming memory requests from latency-sensitive services. It proactively advises Linux OS to release file cache occupied by batch jobs so as to make available memory. Experimental results show that Hermes significantly reduces the average and the tail latency of memory allocation for latency-sensitive services, both under an idle system and under memory pressure.

7.2 Future Work

In my future research, I will continue to dedicate my eﬀorts on proﬁling and improving the performance of data-intensive applications. I am particularly interested in understanding the performance issues of serverless applications and optimizing its performance. I outline my vision for future research and establish myself as an independent researcher as follow.

Proﬁling and understanding serverless applications.

Serverless applications have become a paradigm in Function as a Service (FaaS) programming.

The pay-as-you-go model makes it more lightweight and cost-effective than traditional VM based cloud service. However, it is its lightweight feature makes the function initialization overhead more 130 than its execution time. Furthermore, since the number of concurrently running functions is very large, interference among functions becomes an important factor that affect performance. A user request can also invoke a chain of functions. Understanding the skewed function is important. We plan to design a profiling tool that can collect the information above and help users to understand and troubleshoot serverless applications.

Supports in system level for serverless applications.

Serverless applications and FaaS programming heavily rely lightweight virtualization such as

Linux containers and lightweight VMs. However, existing techniques have semantic gaps to serverless applications and results in either sub-optimal isolation or performance. We plan to add support is system level to bridge the gap between existing lightweight virtualization and serverless applications.

This will modify the existing Linux OS and make it customized for serverless applications and FaaS programmming models. 7.3 Publications

Journal Papers

1. Shaoqi Wang, Aidi Pi, Xiaobo Zhou, Jun Wang, Cheng-Zhong Xu

Overlapping Communication with Computation in Parameter Server for Scalable DL

Training.

In IEEE Transactions on Parallel and Distributed Systems (TPDS) 2021 Early Access Article.

Conference Papers

1. Aidi Pi, Wei Chen, Shaoqi Wang, Xiaobo Zhou and and Mike Ji

Proﬁling Distrbuted Systems in Light-weight Virtualized Environments with Logs and

Resource Metrics.

In Proceedings of the 28th ACM International Symposium on High-Performance Parallel and Dis- tributed Computing (HPDC), Tempe, AZ, USA, 2018.

Open source at https://github.com/EddiePi/LRTrace/

2. Aidi Pi, Wei Chen, Will Zeller, and Xiaobo Zhou It Can Understand the Logs, Literally.

In Proceedings of the 5th IEEE International Workshop on High-Performance Big Data, Deep Learn- ing, and Cloud Computing (HPBDC), Rio de Janeiro, Brazil, 2019.

3. Aidi Pi, Wei Chen, Shaoqi Wang and Xiaobo Zhou

Semantic-aware Workﬂow Construction and Analysis for Distributed Data Analytics

Systems.

In Proceedings the 29th ACM International Symposium on High-Performance Parallel and Dis- tributed Computing (HPDC), Phoenix, AZ, USA, 2019.

Open source at https://github.com/EddiePi/IntelLog/

4. Aidi Pi and Xiaobo Zhou.

Memory at Your Service: Fast Memory Allocation for Latency-critical Services

Under peer review.

5. Aidi Pi and Xiaobo Zhou.

Holmes: SMT Interference Diagnosis and CPU Scheduling for Job Co-location

Under peer review.

6. Wei Chen, Aidi Pi, Shaoqi Wang, Xiaobo Zhou. 132 Characterizing Scheduling Delay for Low-latency Data Analytics Workloads.

In Proceedings of the 32th IEEE International Parallel and Distributed Processing Symposim (IPDPS),

Vancouver, Canada, 2018.

7. Shaoqi Wang, Wei Chen, Aidi Pi, Xiaobo Zhou

Aggressive Synchronization with Partial Processing for Iterative ML Jobs on Clusters.

In Proceedings of the 19th ACM/IFIP International Middleware Conference (Middleware), Rennes,

France, 2018.

8. Shaoqi Wang, Aidi Pi, Xiaobo Zhou

Scalable Distributed DL Training: Batching Communication and Computation.

In Proceedings of the 33rd AAAI Conference on Artiﬁcial Intelligence (AAAI), Honolulu, HI, USA,

2019.

9. Wei Chen, Aidi Pi, Shaoqi Wang, Xiaobo Zhou

Puﬀerﬁsh: Container-driven Elastic Memory Management for Data-intensive Applica- tions.

In Proceedings of ACM Symposium on Cloud Computing 2019 (SoCC), Santa Cruz, CA, USA, 2019.

10. Wei Chen, Aidi Pi, Shaoqi Wang, Xiaobo Zhou

OS-Augmented Oversubscription of Opportunistic Memory with a User-Assisted OOM

Killer.

In Proceedings of the 20th ACM/IFIP International Middleware Conference (Middleware), UC

Davis, CA, USA, 2019.

Posters

1. Wei Chen, Aidi Pi, Xiaobo Zhou and Jia Rao.

MBalloon: A Distributed Memory Manager for Big Data.

In Proceedings of ACM Symposium on Cloud Computing (SoCC), Poster Session, Santa Clara, CA,

USA, 2017. Bibliography

[1] Amazon EC2 Instance Types. https://aws.amazon.com/ec2/instance-types/#instance-details/.

[2] Apache Flume. https://flume.apache.org/.

[3] Apache Hadoop. https://hadoop.apache.org/.

[4] Apache Log4j 2. https://logging.apache.org/log4j/2.x/.

[5] Docker. https://www.docker.com/.

[6] GDB: The GNU Project Debugger. https://www.gnu.org/software/gdb/.

[7] GNU C Library. https://www.gnu.org/software/libc/.

[8] Google Cloud Virtual Machine Types. https://cloud.google.com/compute/docs/machine-types/.

[9] Graphite. https://graphite.readthedocs.io/.

[10] Instruction-Based Sampling: A New Performance Analysis Technique for AMD Family 10h Processors.

http://developer.amd.com/wordpress/media/2012/10/AMD_IBS_paper_EN.pdf/.

[11] Intel 64 and IA-32 Architectures Software Developer’s Man-

ual. https://software.intel.com/content/www/us/en/develop/download/

intel-64-and-ia-32-architectures-sdm-combined-volumes-3a-3b-3c-and-3d-system-programming-guide.

html/.

[12] Introducing Hyperthreading into azure VMs. https://azure.microsoft.com/en-us/blog/

introducing-the-new-dv3-and-ev3-vm-sizes/.

[13] JSONQuery. https://github.com/burt202/jsonquery/.

[14] Logstash. https://www.elastic.co/products/logstash/.

[15] LXC. https://linuxcontainers.org/lxc/. 134 [16] Memcached. https://memcached.org/.

[17] MongoDB. https://www.mongodb.com/.

[18] OpenNLP. https://opennlp.apache.org/.

[19] OpenStack. https://www.openstack.org/.

[20] OpenTSDB. http://opentsdb.net//.

[21] Redis. https://redis.io/.

[22] Rocksdb. https://rocksdb.org/.

[23] SLF4J. https://www.slf4j.org/.

[24] Spark-19371. https://issues.apache.org/jira/browse/SPARK-19371/.

[25] Spark GraphX. https://spark.apache.org/graphx/.

[26] Spark MLlib. https://spark.apache.org/mllib/.

[27] Spark Streaming. https://spark.apache.org/streaming/.

[28] TCMalloc: Thread-Caching Malloc. https://gperftools.github.io/gperftools/tcmalloc.html.

[29] The FreeBSD Project. https://www.freebsd.org/.

[30] TPC-H. http://www.tpc.org/tpch/.

[31] WiredTiger Storage Engine. https://docs.mongodb.com/manual/core/wiredtiger/.

[32] YARN-6976. https://issues.apache.org/jira/browse/YARN-6976/.

[33] M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debug-

ging for distributed systems of black boxes. In Proc. of ACM SOSP, 2003.

[34] G. Ananthanarayanan, C. Douglas, R. Ramakrishnan, S. Rao, and I. Stoica. True elasticity in multi-

tenant data-intensive compute clusters. In Proc. of ACM SoCC, 2012.

[35] P. Barham, A. Donnelly, R. Issacs, and R. Mortier. Using magpie for request extraction and workload

modelling. In Proc. of USENIX OSDI, 2004.

[36] D. S. Berger, B. Berg, T. Zhu, M. Harchol-balter, and S. Sen. Robinhood: Tail latency-aware caching

– dynamically reallocating from cache-rich to cache-poor. In Proc. of USENIX OSDI, 2018.

[37] E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. Wilson. Hoard: A scalable memory allocatorfor

multithreaded applications. In Proc. of ACM ASPLOS, 2000. 135 [38] I. Beschastnikh, Y. Brun, M. D. Ernst, and A. Krishnamurthy. Inferring models of concurrent systems

from logs of their behavior with csight. In Proc. of the ICSE, 2014.

[39] I. Beschastnikh, Y. Brun, S. Schneider, M. Sloan, and M. D. Ernst. Leveraging existing instrumentation

to automatically infer invariant-constrained models. In Proc. of ACM SIGSOFT ESEC/FSE, 2011.

[40] D. Borthakur. Hdfs architecture guide. hadoop apache project, 2008.

[41] E. Boutin, J. Ekanayake, W. Lin, B. Shi, J. Zhou, Z. Qian, M. Wu, and L. Zhou. Apollo: scalable and

coordinated scheduling for cloud-scale computing. In Proc. of USENIX OSDI, 2014.

[42] Brid, Steven, E. Loper, and E. Klein. Natural Language Processing with Python. O’Reilly Media Inc.,

2009.

[43] R. Bruno, D. Patricio, J. Simao, L. Veiga, and P. Ferreira. Runtime object lifetime proﬁler for latency

sensitive big data applications. In Proc. of ACM EuroSys, 2019.

[44] X. Bu, J. Rao, and C.-z. Xu. Interference and locality-aware task scheduling for MapReduce applica-

tions in virtual clusters. In Proc. of ACM HPDC, 2013.

[45] J. R. Bulpin and I. A. Pratt. Hyper-threading aware process scheduling heuristics. In Proc. of USENIX

ATC, 2005.

[46] B. M. Cantrill, M. W. Shapiro, and A. H. Leventhal. Dynamic instrumentation of production systems.

In Proc. of USENIX ATC, 2004.

[47] M. Castro, M. Costa, and J.-P. Martin. Better bug reporting with better privacy. In Proc. of ACM

ASPLOS, 2008.

[48] A. Chanda, A. L. Cox, and W. Zwaenepoel. Whodunit: Transactional proﬁling for multi-tier applica-

tions. In Proc. of ACM Eurosys, 2007.

[49] C. Chen, W. Wang, and L. Bo. Performance-aware fair scheduling: Exploiting demand dlasticity of

data analytics jobs. In Proc. of IEEE InfoCom, 2018.

[50] D. Chen and C. D. Manning. A fast and accurate dependency parser using neural networks. In Proc.

of ACL EMNLP, 2014.

[51] J. Chen, L. Chen, S. Wang, G. Zhu, Y. Sun, H. Liu, and F. Li. Hotring: A hotspot-aware in-memory

key-value store. In Proc. of USENIX FAST, 2020. 136 [52] M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint: Problem determination in

large, dynamic internet services. In Proc. of IEEE DSN, 2002.

[53] W. Chen, A. Pi, S. Wang, and X. Zhou. Characterizing scheduling delay for low-latency data analytic

workloads. In Proc. of IEEE IPDPS, 2018.

[54] W. Chen, A. Pi, S. Wang, and X. Zhou. Os-augmented oversubscription of opportunistic memory with

a user-assisted oom killer. In Proc. of ACM/IFIP Middleware, 2019.

[55] W. Chen, A. Pi, S. Wang, and X. Zhou. Puﬀerﬁsh: Container-driven elastic memory management for

data-intensive applications. In Proc. of ACM SoCC, 2019.

[56] W. Chen, J. Rao, and X. Zhou. Addressing memory pressure in data-intensive parallel programs via

container based virtualization. In Proc. of 2017 IEEE ICAC, pages 197–202, 2017.

[57] W. Chen, J. Rao, and X. Zhou. Addressing performance heterogeneity in mapreduce clusters with

elastic tasks. In Proc. of IEEE IPDPS, pages 1078–1087, 2017.

[58] W. Chen, J. Rao, and X. Zhou. Preemptive, low latency datacenter scheduling via lightweight virtu-

alization. In Proc. of USENIX ATC, 2017.

[59] Y. Chen, L. Youyou, F. Yang, Q. Wang, Y. Wang, and J. Shu. Flatstore: An eﬃcient log-structured

key-value storage engine for persistent memory. In Proc. of ACM ASPLOS, 2020.

[60] D. Cheng, Y. Chen, X. Zhou, D. Gmach, and D. Milojicic. Adaptive scheduling of parallel jobs in

spark streaming. In Proc. of IEEE INFOCOM, pages 1–9, 2017.

[61] D. Cheng, Y. Guo, and X. Zhou. Self-tuning batching with dvfs for improving performance and energy

eﬃciency in servers. In Proc. of IEEE MASCOTS, pages 40–49, 2013.

[62] D. Cheng, C. Jiang, and X. Zhou. Heterogeneity-aware workload placement and migration in dis-

tributed sustainable datacenters. In Proc. of IEEE IPDPS, pages 307–316, 2014.

[63] D. Cheng, P. Lama, C. Jiang, and X. Zhou. Towards energy eﬃciency in heterogeneous hadoop clusters

by adaptive task assignment. In Proc. of IEEE ICDCS, pages 359–368, 2015.

[64] D. Cheng, J. Rao, Y. Guo, C. Jiang, and X. Zhou. Improving performance of heterogeneous mapreduce

clusters with adaptive task tuning. IEEE Transactions on Parallel and Distributed Systems, 28(3):774–

786, 2017. 137 [65] D. Cheng, J. Rao, Y. Guo, and X. Zhou. Improving mapreduce performance in heterogeneous envi-

ronments with adaptive task tuning. In Proceedings of the 15th International Middleware Conference,

page 97–108, 2014.

[66] D. Cheng, J. Rao, C. Jiang, and X. Zhou. Resource and deadline-aware job scheduling in dynamic

hadoop clusters. In Proc. of IEEE IPDPS, pages 956–965, 2015.

[67] D. Cheng, J. Rao, C. Jiang, and X. Zhou. Elastic power-aware resource provisioning of heterogeneous

workloads in self-sustainable datacenters. IEEE Transactions on Computers, 65(02):508–521, feb 2016.

[68] D. Cheng, X. Zhou, P. Lama, M. Ji, and C. Jiang. Energy eﬃciency aware task assignment with dvfs in

heterogeneous hadoop clusters. IEEE Transactions on Parallel and Distributed Systems, 29(01):70–82,

jan 2018.

[69] D. Cheng, X. Zhou, P. Lama, J. Wu, and C. Jiang. Cross-platform resource scheduling for spark and

mapreduce on yarn. IEEE Transactions on Computers, 66(8):1341–1353, 2017.

[70] D. Cheng, X. Zhou, Y. Wang, and C. Jiang. Adaptive scheduling parallel jobs with dynamic batching

in spark streaming. IEEE Transactions on Parallel and Distributed Systems, 29(12):2672–2685, 2018.

[71] D. Cheng, X. Zhou, Y. Xu, L. Liu, and C. Jiang. Deadline-aware mapreduce job scheduling with

dynamic resource availability. IEEE Transactions on Parallel and Distributed Systems, 30(4):814–826,

2019.

[72] T. M. Chilimbi, B. Liblit, K. Mehra, A. V. Nori, and K. Vaswani. Holmes: Eﬀective statistical

debugging via eﬃcient path proﬁling. In Proc. of IEEE ICSE, 2009.

[73] B. Cho, M. Rahman, T. Chajed, I. Gupta, C. Abad, N. Roberts, and P. Lin. Natjam: Design and

evaluation of eviction policies for supporting priorities and deadlines in mapreduce clusters. In Proc.

of ACM SoCC, 2013.

[74] Z. Chothia, J. Liagouris, D. Dimitrova, and T. Roscoe. Online reconstruction of structural information

from datacenter logs. In Proc. of ACM Eurosys, 2017.

[75] M. Chow, D. Meisner, J. Flinn, D. Peek, and T. F. Wenisch. The mystery machine: End-to-end

performance analysis of large-scale internet services. In Porc. of USENIX OSDI, 2014.

[76] A. Chung, J. W. Park, and G. R. Ganger. Stratus: cost-aware container scheduling in the public

cloud. In Proc. of ACM SoCC, 2018. 138 [77] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving

systems with ycsb. In Proc. of ACM SoCC, 2010.

[78] W. Cui, X. Ge, B. Kasikci, B. Niu, U. Sharma, R. Wang, and I. Yun. Rept: Reverse debugging of

failures in deployed software. In Proc. of USENIX OSDI, 2018.

[79] C. Curtsinger and E. D. Berger. Coz: Finding code that counts with causal proﬁling. In Porc. of

ACM SOSP, 2015.

[80] D. J. Dean, H. Nguyen, X. Gu, H. Zhang, J. Rhee, Nipun, Arora, and G. Jiang. Perfscope: Practical

online server performance bug inference in production cloud computing infrastructures. In Proc. of

ACM SoCC, 2014.

[81] J. Dean and L. A. Barroso. The tail at scale. Communications of ACM, 56(2):74–80, Feb. 2013.

[82] J. Dean and S. Ghemawat. MapReduce: simpliﬁed data processing on large clusters. In Proc. of ACM

Communications, 2008.

[83] P. Delgado, F. Dinu, A.-M. Kermarrec, and W. Zwaenepoel. Hawk: Hybrid datacenter scheduling. In

Proc. of USENIX ATC, 2015.

[84] C. Delimitrou and C. Kozyrakis. Quasar: Resource-eﬃcient and qos-aware cluster management. In

Proc. of ACM ASPLOS, 2014.

[85] K. Deng, K. Ren, and J. Song. Symbiotic scheduling for virtual machines on smt processors. In Proc.

of IEEE CGC, 2012.

[86] M. Du and F. Li. Spell: Streaming parsing of system event logs. In Proc. of IEEE ICDM, 2017.

[87] M. Du, F. Li, G. Zheng, and V. Srikumar. Deeplog: Anomaly detection and diagnosis from system

logs through deep learning. In Proc. of ACM CCS, 2017.

[88] J. Evans. A scalable concurrent mallloc(3) implementation for freebsd. In Proc. of BSDCan, 2006.

[89] J. Feliu, J. Sahuquillo, S. Petit, and J. Duato. L1-bandwidth aware thread allocation in multicore smt

processors. In Proc. of IEEE PACT, 2013.

[90] A. D. Ferguson, P. Bodik, S. Kandula, E. Boutin, and R. Fonseca. Jockey: guaranteed job latency in

data parallel clusters. In Proc. of ACM Eurosys, 2012.

[91] R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica. X-trace: A pervasive network tracing

framework. In Proc. of USENIX NSDI, 2007. 139 [92] J. Fried, Z. Ruan, A. Ousterhout, and A. Belay. Caladan: Mitigating interference at microsecond

timescales. In Proc. of USENIX OSDI, 2020.

[93] Q. Fu, J.-G. Lou, Y. Wang, and J. Li. Execution anomaly detection in distributed systems through

unstructured log analysis. In Proc. of IEEE ICDM’09, 2009.

[94] S. Ghemawat, H. Gobioﬀ, and S.-T. Leung. The google ﬁle system. In Proc. of ACM SOSP, 2003.

[95] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica. Dominant resource

fairness: Fair allocation of multiple resource types. In Proc. of USENIX NSDI, 2011.

[96] E. Gilad, E. Bortnikov, A. Braginsky, Y. Gottesman, E. Hillel, I. Keidar, N. Moscovici, and R. Shahout.

Evendb: optimizing key-value storage for spatial locality. In Proc. of ACM Eurosys, 2020.

[97] I. Gog, J. Giceva, M. Schwarzkopf, K. Vaswani, D. Vytiniotis, G. Ramalingam, M. Costa, D. G.

Murray, S. Hand, and M. Isard. Broom: Sweeping out garbage collection from big data systems. In

Proc. of USENIX HotOS, 2015.

[98] R. Grandl, M. Chowdhury, A. Akella, and G. Ananthanarayanan. Altruistic scheduling in multi-

resource clusters. In Proc. of USENIX OSDI, 2016.

[99] R. Grandl, S. Kandula, S. Rao, A. Akella, and J. Kulkarni. Graphene: Packing and dependency-aware

scheduling for data-parallel clusters. In Proc. of the USENIX OSDI, 2016.

[100] Y. Guo, W. Bland, P. Balaji, and X. Zhou. Fault tolerant mapreduce-mpi for hpc clusters. In Proc.

of SC, pages 1–12, 2015.

[101] Y. Guo, P. Lama, J. Rao, and X. Zhou. V-cache: Towards ﬂexible resource provisioning for multi-tier

applications in iaas clouds. In Proc. of IEEE IPDPS, pages 88–99, 2013.

[102] Y. Guo, P. Lama, and X. Zhou. Automated and agile server parameter tuning with learning and

control. In Proc. of IEEE IPDPS, pages 656–667, 2012.

[103] Y. Guo, J. Rao, D. Cheng, C. Jiang, C.-Z. Xu, and X. Zhou. Storeapp: A shared storage appliance for

eﬃcient and scalable virtualized hadoop clusters. In Proc. of IEEE INFOCOM, pages 594–602, 2015.

[104] Y. Guo, J. Rao, D. Cheng, and X. Zhou. ishuﬄe: Improving hadoop performance with shuﬄe-on-write.

IEEE Transactions on Parallel and Distributed Systems, 28(6):1649–1662, 2017. 140 [105] Y. Guo, J. Rao, C. Jiang, and X. Zhou. Flexslot: Moving hadoop into the cloud with ﬂexible slot man-

agement. In SC ’14: Proceedings of the International Conference for High Performance Computing,

Networking, Storage and Analysis, pages 959–969, 2014.

[106] Y. Guo, J. Rao, C. Jiang, and X. Zhou. Moving hadoop into the cloud with ﬂexible slot management

and speculative execution. IEEE Transactions on Parallel and Distributed Systems, 28(3):798–812,

2017.

[107] Y. Guo, J. Rao, and X. Zhou. ishuﬄe: Improving hadoop performance with shuﬄe-on-write. In Proc.

of USENIX ICAC, pages 107–117, 2013.

[108] Y. Guo and X. Zhou. Coordinated vm resizing and server tuning: Throughput, power eﬃciency and

scalability. In Proc. of IEEE MASCOTS, pages 289–297, 2012.

[109] Z. Guo, D. Zhou, H. Lin, and M. Yang. G2: A graph processing system for diagnosing distributed

systems. In Proc. of USENIX ATC, 2011.

[110] S. S. Hahn, S. Lee, I. Yee, D. Ryu, and J. King. Fasttrack: Foreground app-aware i/o management

for improving user experience of android smartphones. In Proc. of USENIX ATC, 2018.

[111] M. Hao, H. Li, M. H. Tong, C. Pakha, and R. O. Suminto. Mittos: Supporting millisecond tail

tolereance with fast rejecting slo-aware os interface. In Proc. of ACM SOSP, 2017.

[112] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H. Katz, S. Shenker, and I. Stoica.

Mesos: A platform for ﬁne-grained resource sharing in the data center. In Proc. of USENIX NSDI,

2011.

[113] S. Huang, B. Cai, and J. Huang. Toward productions-run heisenbugs reproduction on commercial

hardware. In Proc. of USENIX ATC, 2017.

[114] S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang. The HiBench benchmark suite: Characterization of

the mapreduce-based data analysis. In Proc. of IEEE Data Engineering Workshops (ICDEW), 2010.

[115] C. Iorgulescu, R. Azimi, Y. Kwon, S. Elnikety, M. Syamala, V. Narasayya, H. Herodotou, P. Tomita,

A. Chen, J. Zhang, and J. Wang. Perﬁso: Performance isolation for commercial latency-sensitive

services. In Proc. of USENIX ATC, 2018.

[116] C. Iorgulescu, F. Dinu, A. Raza, W. U. Hassan, and W. Zwaenepoel. Don’t cry over spilled records:

Memory elasticity of data-parallel applications and its application to cluster scheduling. In Proc. of

USENIX ATC, 2017. 141 [117] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: fair scheduling

for distributed computing clusters. In Proc. of ACM SOSP, 2009.

[118] S. A. Javadi and A. Gandhi. Dial: Reducing tail latencies for cloud applications via dynamic

interference-aware load balancing. In Proc. of IEEE ICAC, 2017.

[119] W. Jia, J. Shan, X. Shang, H. Cui, and X. Ding. vSMT-io: Improving i/o performance and eﬃciency

on smt processors in virtualized clouds. In Proc. of USENIX ATC, 2020.

[120] J. A. Jones and M. J. Harrold. Empirical evaluation of the tarantula automatic fault-localization

technique. In Proc. of IEEE/ACM ASE, 2005.

[121] J. S. Justeson and S. M. Katz. Technical terminology: some linguistic properties and an algorithm for

identiﬁcation in text. Natural Language Engineering, 1995.

[122] S. P. Kavulya, D. Scott, K. Joshi, M. Hiltunen, R. Gandhi, and Narasimhan. Draco: Statistical

diagnosis of chronic problems in large distributed systems. In Porc. of IEEE/IFIP DSN, 2012.

[123] A. Kivity, Y. Kamay, D. Laor, U. Lublin, and A. Liguori. kvm: the linux virtual machine monitor. In

Proc. of Linux symposium, 2007.

[124] M. Kogias and E. Bugnion. Hovercraft: achieving scalability and fault-tolerance for microsecond-scale

datacenter services. In Proc. of ACM Eurosys, 2020.

[125] P. Lama, Y. Guo, and X. Zhou. Autonomic performance and power control for co-located web appli-

cations on virtualized servers. In Proc. of IEEE IWQoS, pages 1–10, 2013.

[126] P. Lama, S. Wang, X. Zhou, and D. Cheng. Performance isolation of data-intensive scale-out applica-

tions in a multi-tenant cloud. In Proc. of IEEE IPDPS, pages 85–94, 2018.

[127] P. Lama and X. Zhou. Eﬃcient server provisioning with end-to-end delay guarantee on multi-tier

clusters. In Proc. of IEEE IWQoS, pages 1–9, 2009.

[128] P. Lama and X. Zhou. Autonomic provisioning with self-adaptive neural fuzzy control for end-to-end

delay guarantee. In Proc. of IEEE MASCOTS, pages 151–160, 2010.

[129] P. Lama and X. Zhou. amoss: Automated multi-objective server provisioning with stress-strain curv-

ing. In Proc. of IEEE ICPP, pages 345–354, 2011.

[130] P. Lama and X. Zhou. Perfume: Power and performance guarantee with fuzzy mimo control in

virtualized servers. In Proc. of IEEE IWQoS, pages 1–9, 2011. 142 [131] P. Lama and X. Zhou. Aroma: Automated resource allocation and conﬁguration of mapreduce envi-

ronment in the cloud. In Proc. of ICAC, page 63–72. Association for Computing Machinery, 2012.

[132] P. Lama and X. Zhou. Ninepin: Non-invasive and energy eﬃcient performance isolation in virtualized

servers. In Proc. of IEEE/IFIP DSN, pages 1–12, 2012.

[133] P. Lama and X. Zhou. Autonomic provisioning with self-adaptive neural fuzzy control for percentile-

based delay guarantee. ACM Trans. Auton. Adapt. Syst., 8(2), July 2013.

[134] J. Li, K. Agrawal, S. Elnikety, Y. He, I.-T. A. Lee, C. Lu, and K. S. McKinley. Work stealing for

interactive services to meet target latency. In Proc. of ACM PPoPP, 2016.

[135] Q. Lin, H. Zhang, J.-G. Lou, Y. Zhang, and X. Chen. Log clustering based problem identiﬁcation for

online service systems. In Proc. of IEEE/ACM ICSE, 2016.

[136] L. Liu and H. Xu. Elasecutor: Elastic executor scheduling in data analytics systems. In Proc. of ACM

SoCC, 2018.

[137] D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and C. Kozyrakis. Heracles: improving resource

eﬃciency at scale. In Proc. of ACM ISCA, 2015.

[138] J.-G. Lou, Q. Fu, S. Yang, J. Li, and B. Wu. Mining program workﬂow from interleaved traces. In

Proc. of ACM SIGKDD, 2010.

[139] J.-G. Lou, Q. Fu, S. Yang, Y. Xu, and J. Li. Mining invariants from console logs for system problem

detection. In Proc. of USENIX ATC, 2010.

[140] L. Luo, S. Nath, L. R. Sivalingam, M. Musuvathi, and L. Ceze. Troubleshooting, transiently-recurring

problems in production systems with blame-proportional logging. In Proc. of USENIX ATC, 2018.

[141] J. Mace, R. Roelke, and R. Fonseca. Pivot tracing: Dynamic causal monitoring for distributed systems.

In Proc. of ACM SOSP, 2015.

[142] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english:

The penn treebank. Computational Linguistics, 19(2):313–330, June 1993.

[143] A. Margaritov, S. Gupta, R. Gonz´alez-Alberquilla, and B. Grot. Stretch: Balancing qos and through-

put for colocated server workloads on smt cores. In Proc. of IEEE HPCA, 2019.

[144] J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soﬀa. Bubble-up: Increasing utilization in modern

warehouse scale computers via sensible co-locations. In Proc. of IEEE/ACM MICRO, 2011. 143 [145] A. J. Mashtizadeh, T. Garﬁnkel, D. Terei, D. Mazieres, and M. Rosenblum. Towards practical default-

on multi-core record/replay. In Proc. of ACM ASPLOS, 2017.

[146] M. Mejbah ul Alam, T. Liu, G. Zeng, and A. Muzahid. Syncperf: Categorizing, detecting, and

diagnosing synchronization performance bugs. In Proc. of ACM Eurosys, 2017.

[147] P. A. Misra, M. F. Borge, I. Goiri, A. R. Lebeck, W. Zwaenepoel, and R. Bianchini. Managing tail

latency in datacenter-scale ﬁle systems under production constraints. In Proc. of ACM EuroSys, 2019.

[148] K. Nagaraj, C. Killian, and J. Neville. Structured comparative analysis of systems logs to diagnose

performance problems. In Proc. of USENIX NSDI, 2012.

[149] V. Narasayya, I. Menache, M. Singh, F. Li, M. Syamala, and S. Chudhuri. Sharing buﬀer pool memory

in multi-tenant relational databases-as-a-service. In Proc. of VLDB, 2015.

[150] K. Nguyen, L. Fang, G. Xu, B. Demsky, S. Lu, S. Alamian, and O. Mutlu. Yak: A high-performance

big-data-friendly garbage collector. In Proc. of USENIX OSDI, 2016.

[151] K. Nguyen, K. Wang, Y. Bu, L. Fang, J. Hu, and G. Xu. Facade: A compiler and runtime for (almost)

object-bounded big data applications. In Proc. of ACM SOSP, 2015.

[152] J. Nivre, M.-C. Marneﬀe, F. Ginter, Y. Goldberg, J. Hajic, C. D. Manning, R. McDonald, S. Petrov,

S. Pyysalo, N. Silveira, R. Tsarfaty, and D. Zeman. Universal dependencies v1: A multilingual treebank

collection. In Proc. of LREC, 2016.

[153] A. J. Oliner, A. V. Kulkarni, and A. Aiken. Using correlated surprise to infer shared inﬂuence. In

Porc. of IEEE/IFIP DSN, 2010.

[154] K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica. Sparrow: distributed, low latency scheduling.

In Proc. of ACM SOSP, 2013.

[155] S. Park, Y. Zhou, W. Xiaong, Z. Yin, R. Kaushik, K. H. Lee, and S. Lu. Pres: Probabilistic replay

with execution sketching on multiprocessors. In Proc. of ACM SOSP, 2009.

[156] T. B. Perez, W. Chen, R. Ji, L. Liu, and X. Zhou. Pets: Bottleneck-aware spark tuning with param-

eter ensembles. In 2018 27th International Conference on Computer Communication and Networks

(ICCCN), pages 1–9, 2018.

[157] T. B. G. Perez, X. Zhou, and D. Cheng. Reference-distance eviction and prefetching for cache man-

agement in spark. In Proc. of ACM ICPP, 2018. 144 [158] E. Pettijohn, Y. Guo, P. Lama, and X. Zhou. User-centric heterogeneity-aware mapreduce job provi-

sioning in the public cloud. In Proc. of USENIX ICAC, pages 137–143, 2014.

[159] A. Pi, W. Chen, S. Wang, and X. Zhou. Semantic-aware workﬂow construction and analysis for

distributed data analytics systems. In Proc. of ACM HPDC, page 255–266, 2019.

[160] A. Pi, W. Chen, W. Zeller, and X. Zhou. It can understand the logs, literally. In Proc. of IEEE

IPDPSW, pages 446–451, 2019.

[161] A. Pi, W. Chen, X. Zhou, and M. Ji. Proﬁling distributed systems in lightweight virtualized environ-

ments with logs and resource metrics. In Proc. of ACM HPDC, 2018.

[162] M. Popov, A. Jimborean, and D. Black-Schaﬀer. Eﬃcient thread/page/parallelism autoruning for

NUMA systems. In Proc. of ACM ICS, 2019.

[163] R. Potharaju, N. Jain, and C. Nita-Rotaru. Juggling the jigsaw: Towards automated problem inference

from network trouble tickets. In Proc. of USENIX NSDI, 2013.

[164] I. Psaroudakis, T. Scheuer, N. May, A. Sellami, and A. Ailamaki. Adaptive NUMA-aware data

placement and task scheduling for analytical workloads in main-memory column-stores. In Proc. of

ACM VLDB Endowment, 2016.

[165] J. Rao, K. Wang, X. Zhou, and C.-Z. Xu. Optimizing virtual machine scheduling in numa multicore

systems. In Proc. of IEEE HPCA, 2013.

[166] J. Rao and X. Zhou. Towards fair and eﬃcient smp virtual machine scheduling. ACM SIGPLAN

Notices, 49:273–286, 02 2014.

[167] S. Siddha, V. Pallipadi, and A. Mallick. Chip multi processing aware linux kernel scheduler. In Proc.

of Linux Symposium, 2005.

[168] A. J. Storm, C. Garcia-Arellano, S. S. Lightstone, Y. Diao, and M. Surendra. Adaptive self-tuning

memory in db2. In Proc. of VLDB, 2006.

[169] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoﬀ, and R. Murthy.

Hive: a warehousing solution over a map-reduce framework. Proc. of VLDB Endowment, 2009.

[170] K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. Feature-rich part-of-speech tagging with a

cyclic dependency network. In Proc. of HLT-NAACL, 2003. 145 [171] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe,

H. Shah, S. Seth, et al. Apache Hadoop YARN: Yet another resource negotiator. In Proc. of ACM

SoCC, 2013.

[172] J. Wang and M. Balazinska. Elastic memory management for cloud data analytics. In Proc. of USENIX

ATC, 2017.

[173] S. Wang, W. Chen, A. Pi, and X. Zhou. Aggressive synchronization with partial processing for iterative

ml jobs on clusters. In Proc. of ACM Middleware, page 253–265, 2018.

[174] S. Wang, W. Chen, X. Zhou, S.-Y. Chang, and M. Ji. Addressing skewness in iterative ml jobs with

parameter partition. In Proc. of IEEE INFOCOM, pages 1261–1269, 2019.

[175] S. Wang, W. Chen, X. Zhou, L. Zhang, and Y. Wang. Dependency-aware network adaptive scheduling

of data-intensive parallel jobs. IEEE Transactions on Parallel and Distributed Systems, 30(3):515–529,

2019.

[176] S. Wang, W. Chen, X. Zhou, L. Zhang, and Y. Wang. Dependency-aware network adaptive scheduling

of data-intensive parallel jobs. IEEE Transactions on Parallel and Distributed Systems, 30(3):515–529,

2019.

[177] S. Wang, O. J. Gonzalez, X. Zhou, T. Williams, B. D. Friedman, M. Havemann, and T. Woo. An

eﬃcient and non-intrusive gpu scheduling framework for deep learning training systems. In Proc. of

SC, pages 1–13, 2020.

[178] S. Wang, A. Pi, and X. Zhou. Scalable distributed dl training: Batching communication and compu-

tation. In Proc. of AAAI, volume 33, pages 5289–5296, Jul. 2019.

[179] S. Wang, A. Pi, X. Zhou, J. Wang, and C.-Z. Xu. Overlapping communication with computation in

parameter server for scalable dl training. IEEE Transactions on Parallel and Distributed Systems,

32(9):2144–2159, 2021.

[180] S. Wang, X. Zhou, L. Zhang, and C. Jiang. Network-adaptive scheduling of data-intensive parallel

jobs with dependencies in clusters. In 2017 IEEE International Conference on Autonomic Computing

(ICAC), pages 155–160, 2017.

[181] J. Xu, D. Mu, X. Xing, P. Liu, P. Chen, and B. Mao. Pomp: Postmortem program analysis with

hardware-enhanced post-crash artifacts. In Proc. of USENIX Security, 2017. 146 [182] W. Xu, L. Huang, A. Fox, D. Patterson, and M. Jordan. Online system problem detection by mining

patterns of console logs. In Proc. of IEEE ICDM, 2009.

[183] W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan. Detecting large-scale system problems by

mining console logs. In Proc. of ACM SOSP, 2009.

[184] Y. Xu, W. Chen, S. Wang, X. Zhou, and C. Jiang. Improving utilization and parallelism of hadoop

cluster by elastic containers. In IEEE INFOCOM 2018 - IEEE Conference on Computer Communi-

cations, pages 180–188, 2018.

[185] M. Yamamoto and K. W. Church. Using suﬃx arrays to compute term frequency and document

frequency for all substrings in a corpus. Computational Linguistics, 27(1):1–30, Mar. 2001.

[186] H. Yang, A. Breslow, J. Mars, and T. Lingjia. Bubble-ﬂux: Precise online qos management for

increased utilization in warehouse scale computers. In Proc. of ACM ISCA, 2013.

[187] H. Yang, A. Breslow, J. Mars, and L. Tang. Bubble-ﬂux: Precise online qos management for increased

utilization in warehouse scale computers. In Proc. of ACM ISCA, 2013.

[188] X. Yang and S. M. Blackburn. Elfen scheduling: Fine-grain principled borrowing from latency-critical

workloads using simultaneous multithreading. In Proc. of USENIX ATC, 2016.

[189] X. Yu, P. Joshi, J. Xu, and G. Jin. CloudSeer: Workﬂow monitoring of cloud infrastructures via

interleaved logs. In Proc. of ACM ASPLOS, 2016.

[190] D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy. Sherlog: error diagnosis by connecting

clues from run-time logs. In Proc. of ACM ASPLOS, 2010.

[191] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay scheduling:

a simple technique for achieving locality and fairness in cluster scheduling. In Proc. of ACM Eurosys,

2010.

[192] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with

working sets. In Proc. of USENIX HOTCLOUD, 2010.

[193] T. Zhang, C. Jung, and D. Lee. Prorace: Practical data race detection for production use. In Proc. of

ACM ASPLOS, 2017.

[194] Y. Zhang, S. Makarov, X. Ren, D. Lion, and D. Yuan. Pensieve: Non-intrusive failure reproduction

for distributed systems using the event chaining approahc. In Proc. of ACM SOSP, 2017. 147 [195] Y. Zhang, G. Prekas, G. M. Fumarola, M. Fontoura, ´I. Goiri, and R. Bianchini. History-based har-

vesting of spare cycles and storage in large-scale datacenters. In Proc. of USENIX OSDI, 2016.

[196] X. Zhao, K. Rodrigues, Y. Luo, D. Yuan, and M. Stumm. Non-intrusive performance proﬁling for

entire software stacks based on the ﬂow reconstruction principle. In Proc. of USENIX OSDI, 2016.

[197] X. Zhao, Y. Zhang, D. Lion, M. FaizanUllah, Y. Luo, D. Yuan, and M. Stumm. Iprof: A non-intrusive

request ﬂow proﬁler for distributed systems. In Proc. of USENIX OSDI, 2014.

[198] H. Zhu and M. Erez. Dirigent: Enforcing qos for latency-criticaltasks on shared multicore systems. In

Proc. of ACM ASPLOS, 2016.

[199] T. Zhu, M. A. Kozuch, and M. Harchol-Balter. Workloadcompactor: Reducing datacenter cost while

providingtail latency slo guarantees. In Proc. of ACM SoCC, 2017.