A Dissertation for the Degree of Ph.D. in Engineering Increasing Developer Productivity by Improving Build Performance and Automating Logging Code Injection

February 2021

Graduate School of Science and Technology Keio University

Takafumi Kubota Acknowledgment

I would like to thank my advisor, Prof. Kenji Kono. His constant guidance helped me in all the time of research. I would like to express my sincere gratitude to Ph.D. Takeshi Yoshimura, Ph.D. Yusuke Suzuki, and Naohiro Aota for their in- valuable insight. This dissertation would not have been possible without their advice and encouragement. I am grateful to the members of my thesis committee as well: Prof. Shingo Takada, Prof. Motomichi Toyama, and Prof. Hideya Iwasaki. This dissertation was greatly improved by their valuable feedback. I am also thankful to my colleagues in the sslab. Their surprising enthusiasm and skills have always inspired me. I appreciate the nancial supports from the Core Research for Evolutional Science and Technology of Japan Science and Technology Agency and scholar- ships of our university. Finally, I thank my family, my parents, and sister for their support all these years. Without their support and encouragement, many accomplishments in my life including this dissertation would not have been possible.

2 Abstract

As software is growing in size and complexity, it is critical to develop software eciently and reliably. For example, Continuous Integration (CI) has become a de-facto practice of the daily development in large software projects, in which builds and tests are automated, resulting in numerous developers’ modications are eciently integrated in a mainline. However, as an obstacle for improving development eciency, software en- gineers spend lots of time outside of the actual development of the software. This dissertation addresses two problems. 1) Build time: builds occur frequently dur- ing . As a result, the times spent on builds is a noticeable overhead. 2) Logging code insertion: the quality of log messages is critical to the eciency in failure diagnosis. However, appropriately inserting the logging code is time-consuming because it depends on developers’ expertise and engineering eort. To deal with these problems, I introduce two eective tools. For build times, I present a new build system, called Cauldron, which aims to improve the build performance for large ++ projects. Cauldron supports sophisticated unity builds and adaptive build behavior based on the number of les to be compiled. My experiments show that Cauldron outperforms existing approaches; for ex- ample, it reduces build times of WebKit by 23% in continuous builds. For logging code insertion, I introduce a new logging tool, called K9, which automatically inserts the logging code to trace inter-thread data dependencies caused by shared data among threads. In multi-threaded systems, the traceability of inter-thread data dependencies is essential in failure diagnosis because the thread actually causing the failure may be dierent from the thread executing the buggy code. In my experiments, I show that the log of K9 provides useful clues for debugging four bugs in the Linux kernel, including one unknown bug.

3 The contribution of this dissertation is summarized as follows. As software development consists of multiple tasks, it is important to consider the various processes in the development cycle. This dissertation proposes two tools to im- prove the eciency of two specic parts of software development: build time and logging code insertion. I describe in detail the design, implementation, and evaluations of the two tools.

4 Contents

1 Introduction 1 1.1 Motivation ...... 2 1.2 Dissertation Contributions ...... 3 1.2.1 Build System for Sophisticated Unity Builds ...... 3 1.2.2 Logging Automation Tool for Logging Inter-thread De- pendencies ...... 6 1.3 Organization ...... 9

2 Related Work 10 2.1 Improving C++ Build Performance ...... 10 2.1.1 Compile Caching Tools ...... 10 2.1.2 Compiler Approach ...... 11 2.1.3 Language Approach ...... 12 2.2 Failure Diagnosis ...... 13 2.2.1 Diagnosis without Reproducing the Failure ...... 13 2.2.2 Diagnosis with Failure Reproduction ...... 16 2.3 Other Related Work ...... 16 2.3.1 Test Case Generation & Selection ...... 16 2.3.2 Static Analysis for Bug Detection ...... 17 2.4 Summary ...... 17

3 Build System for Unity Builds with Sophisticated Bundle Strate- gies 18 3.1 Background ...... 18 3.1.1 Build systems ...... 18 3.1.2 Incremental builds ...... 20

i 3.1.3 Long build times of large C++ projects ...... 21 3.1.4 Unity builds ...... 22 3.2 Problems in Unity Builds ...... 24 3.3 A Case Study on Unity Builds in WebKit ...... 25 3.3.1 Research Questions ...... 25 3.3.2 Metrics ...... 27 3.3.3 Experimental Results ...... 28 3.4 Design and Implementation of Cauldron ...... 36 3.4.1 Design Choice: Meta-Build System vs. Native Build System 36 3.4.2 Bundle strategies in Cauldron ...... 39 3.4.3 Overview ...... 41 3.4.4 Dependency graph analysis ...... 42 3.4.5 Build behavior decision ...... 42 3.4.6 Bundling source les ...... 43 3.4.7 Bundle Conguration Renement ...... 47 3.5 Experiments ...... 48 3.5.1 Build Performance in Continuous Builds ...... 49 3.5.2 Incremental-build performance ...... 51 3.5.3 Full-build performance ...... 54 3.6 Summary ...... 55

4 Logging Automation for Inter-thread Data Dependencies 56 4.1 Motivation ...... 57 4.1.1 Inter-Thread Data Dependency ...... 57 4.1.2 Bug Examples in Linux ...... 59 4.2 Design Goals and Overview of K9 ...... 61 4.3 Inter-thread Data Dependency Model ...... 63 4.3.1 Collections and Items ...... 63 4.3.2 Dependencies between Collections and Items ...... 66 4.3.3 Log Points for Collections and Items ...... 66 4.4 Design and Implementation of K9 ...... 67 4.4.1 Collection Support Library ...... 67 4.4.2 Data-ow Graph of K9 ...... 68 4.4.3 Direct Dependency Analysis ...... 70 4.4.4 Indirect Dependency Analysis ...... 72 4.5 Experiments ...... 73 4.5.1 Scalability ...... 74 4.5.2 Precision of Log Points ...... 75 4.5.3 Diagnosing failures ...... 79 4.5.4 Performance Overheads ...... 84 4.6 Summary ...... 86

5 Conclusion 87 5.1 Contribution Summary ...... 87 5.2 Future Directions ...... 88

Bibliography 90 List of Figures

1.1 Typical development cycle ...... 2 1.2 Example of unity les in WebKit...... 5 1.3 Traditional fault-error-failure model...... 6 1.4 Logging code examples...... 7

2.1 An overview of previous studies ...... 11

3.1 Example of the dependency graph in build systems...... 20 3.2 How much time does the compiler spend parsing? ...... 23 3.3 Benets of unity builds ...... 24 3.4 A patch to avoid bundling source les in WebKit...... 26 3.5 The impact of the header similarity on the unity build...... 29 3.6 The impact of the front-end ratio on the unity build...... 31 3.7 Heat map of header similarity among subdirectories...... 34 3.8 Example of the header les dynamically generated by llvm- tblgen...... 38 3.9 Dependency graph including the dynamically generated header le...... 38 3.10 Compile time estimation ...... 40 3.11 Overview of the work-ow ...... 41 3.12 Example of bundling source les in Cauldron...... 46 3.13 Compilation of B.cc will be nished before A.cc and B.cc are bundled ...... 47 3.14 Build performance during 101 builds of real git commits: LLVM 48 3.15 Build performance during 101 builds of real git commits: WebKit 50 3.16 CDF of incremental-build overheads caused by unity builds. . . 52

iv 4.1 Two types of inter-thread data dependencies...... 57 4.2 Inter-thread data dependencies in write system call in Linux. 58 4.3 A bug in Btrfs in Linux kernel v3.17-rc5. An error propagates from kworker to sync through shared extent_buffer . 60 4.4 CFQ priority violation. I/O throughput is not proportional to pri- ority. A thread with priority 4 submits all I/O requests...... 61 4.5 Simplied log of CFQ priority violations...... 61 4.6 The work-ow of K9 ...... 63 4.7 Example of an array collection and item in Linux...... 64 4.8 Typical structure of the graph collection and item...... 64 4.9 Examples of graph collection, head, and item in Linux...... 65 4.10 Simplied example of queuing a socket buer into a socket. . . 70 4.11 Example of data-ow graph construction of Figure 4.10...... 71 4.12 Type-ow graphs showing the results of indirect dependency analysis...... 77 4.13 Kernel workqueue bug: data race on cwq->nr_activate.. 81 4.14 Failure logs for Figure 4.13 ...... 82 4.15 An unknown bug in Btrfs: remaining writeback bit...... 82 4.16 Failure logs for Figure 4.15...... 83 List of Tables

3.1 The compile times of unity builds (UB) with changing bundle sizes. 35 3.2 Notable results on incremental build overheads (seconds). The results of existing unity builds are shown in parentheses. . . . . 52 3.3 Full-build performance. UB: unity build...... 54 3.4 The number of unity les and bundled source les in Cauldron. The results of existing unity builds are shown in parentheses. . 55

4.1 Experimental Environment ...... 74 4.2 Analysis results (LP: “log point”, IR: “intermediate representa- tion”)...... 75 4.3 Characterization of log points identied in the direct dependency analysis of the le system case. G: denotes a graph collection and A: represents an array collection...... 76 4.4 Diagnosed failures that are caused by three known bugs and one unkown bug...... 79 4.5 Performance overheads in macro benchmarks...... 84

vi Chapter 1

Introduction

Software has evolved rapidly [95, 114, 151]. Numerous contributors have collaborated on the same project via version control repository hosting ser- vices [64, 65, 15]. For example, GitHub reported 10 million new developers joined its community in 2019, resulting in over 44 million developers in total [63]. Fo- cusing on developers in industries, 4,550 Microsoft employees contributed during 2018, which is the largest number compared to 2,267 of Google and 2,027 of Red Hat [28, 76]. As a result, the code size of the software has dramatically increased. For instance, from the last decade, the Linux kernel has been enlarged by 2.2x from 13.4 MLOC to 29.4 MLOC. WebKit [12], which is the web browser engine used by Safari, has grown 5.4 times from 7.3 MLOC to 39.7 MLOC. To support the rapid development by many contributors, Continuous Inte- gration (CI) [45] has become a de-facto practice of the daily development, in which all developers’ working copies are merged to a shared mainline several times a day [70, 95, 114, 151]. Industries [87, 109, 126, 100] and open-source projects [16, 150, 147, 143] have widely adopted this practice to facilitate econom- ical and reliable software development by using specied CI systems [5, 3, 1, 36]. In such a practice, there are four major steps. First, developers design and write the code that will be committed to the mainline (1st step). Then, they commit the changes via a source control system. A CI system automatically detects the update and triggers a build for checking that the update does not cause any build error (2nd step). Then, all predened tests are automatically conducted as well (3rd step). If a test fails, a failure will be reported. Developers have to diagnose

1 1.1. MOTIVATION CHAPTER 1. INTRODUCTION

report failures test binaries 3: Test

4: Debugging 2: Build

trigger builds commits changes 1: Design & Coding Source control system

Figure 1.1. Typical development cycle such a failure and debug it, which often requires additional code modications (4th step). Developers continuously go through these four practices and develop their software.

1.1 Motivation

Despite the many development tools that facilitate rapid software development, there are two problems in software development for large systems. First, since the code size of the software has dramatically increased, the build times have become longer. Long build times are one of the well-recognized costs in CI [62, 75]. As reported by a recent study on open-source projects, over 40% of the builds have a time duration of over 30 minutes [62], which far exceeds the acceptable build time of 10 minutes [55, 74]. A survey of 443 software engineers in 2013 says that they spend 3.52 hours waiting for builds to complete per week on average, compared to the 41.5 hours worked in total [91]. Such long build times greatly increase the cost in software development and directly decrease the developer productivity. During the build, developers just wait for the nish and cannot make other tasks that cause the code modication. Next, the cost of debugging is also a non-trivial problem in software develop- ment. Real computer systems inevitably experience failures. While considerable eort is spent trying to eliminate such problems before deployment via testing, the size and complexity of modern systems have increased so that it is dicult

2 1.2. DISSERTATION CONTRIBUTIONS CHAPTER 1. INTRODUCTION to deliver “bullet-proof” or “bug-free” software. However, diagnosing failures is notoriously dicult. In part, it is due to the fundamental diculty in trouble- shooting any complex software system, but it is further exacerbated by the in- suciency of available information (especially in the production setting). For example, for reasons of both overhead and privacy, product support engineers may not be given access to the failed system or its input data in the production environment. Therefore, it is common that only the run-time log generated by a system (e.g., syslog) can be shared with the vendors. Debugging with limited information requires deep knowledge about the system and a huge amount of time to identify the root cause and x it. Indeed, a 2013 survey has shown that half of software development time is spent on debugging [27].

1.2 Dissertation Contributions

This dissertation makes two contributions towards increasing developer produc- tivity: a new build system for improving build performance on large C++ projects and a new logging automation tool for tracing inter-thread data dependencies. These two tools aim to reduce the notable overheads in software development, as motivated in the previous section.

1.2.1 Build System for Sophisticated Unity Builds

Background on Build System

Build systems [54, 48, 89, 11, 49, 148, 110, 47, 52, 117] are considered essential to developing large software projects using compiled languages like C and C++. They orchestrate thousands of order-dependent commands including compiling source les, linking object les, and testing output binaries, to ensure that target programs are built correctly. Since developers execute a build system whenever they modify the source code, the performance of build systems directly aects the eciency of developing cycles. Build systems have traditionally focused on providing fast incremental builds [49] for high eciency when rebuilding software projects with small up- dates. Incremental builds mean small partial rebuilds, where few source les need

3 1.2. DISSERTATION CONTRIBUTIONS CHAPTER 1. INTRODUCTION to be recompiled. For the eciency of incremental builds, the build systems cal- culate the minimum amount of commands necessary to generate a target binary that reects the changes made to the source code. For example, when one source le is to be updated, the build system recompiles only that source le; it does not recompile any other source les.

Long Build Times in Large C++ Software

However, since the code size of software projects has increased, the full build times and even incremental build times have become longer and problematic in large C++ projects [33, 136, 112]. Full builds mean building the whole of target projects, in which many source les are compiled. For example, WebKit takes 56 minutes to full-build with 18 threads on our Dell Power Edge R430 server. Chromium [144], an open-source browser project that forms the basis for the Chrome web browser, takes two and a half hours to build. Furthermore, incre- mental build times have become longer as well because of the many recompi- lations needed for updating the source and header les. According to the logs of the build bot of WebKit [150] collected from March 17 and August 3 in 2018, build tasks taking more than 15 minutes were submitted 164 times on 61 out of 75 days.

Unity Builds

To accelerate builds, large C++ projects have started to apply a new build tech- nique, called unity builds [115, 41, 90, 58, 13, 121]. In unity builds, multiple source les are bundled into one unity le shown in Figure 1.2. Then, the unity le is passed to the compiler instead of the bundled source les. As a result, the com- piler can eliminate the redundant processing of the header les commonly in- cluded in bundled source les. In so doing, the build times of large C++ projects can be dramatically reduced; for example, full build times have become twice as fast in WebKit [33]. Despite the advantage, unity builds have an adverse eect on incremental builds. If a single le unied in the same unity le is updated, all the other les in that unity le have to be recompiled. This increases the rebuild time and dimin- ishes the advantages of incremental builds. Our investigation of WebKit shows

4 1.2. DISSERTATION CONTRIBUTIONS CHAPTER 1. INTRODUCTION

#include "API/JSBase.cpp" #include "API/JSCTestRunnerUtils.cpp" #include "API/JSCallbackConstructor.cpp" #include "API/JSCallbackFunction.cpp" #include "API/JSCallbackObject.cpp" #include "API/JSClassRef.cpp" #include "API/JSContextRef.cpp" #include "API/JSHeapFinalizerPrivate.cpp"

Figure 1.2. Example of unity les in WebKit. that incremental builds may slow down by as much as 479% (19s → 110s). The WebKit’s developer mailing list [59] is actively discussing the overhead caused by unity builds.

Integrating Sophisticated Unity Builds into Build System

Driven by these motivations, this dissertation presents a new build system, called Cauldron. Cauldron uses sophisticated bundle strategies in its unity builds to achieve better unity-build performance. In current unity builds, a xed number of source les are bundled together in alphabetical order. However, this simple strategy ignores the characteristics of the source les to be bundled (e.g., which header les are #included in the source les) so that undesired source les are likely to be bundled together. In this dissertation, I rst conduct a performance analysis on unity builds of WebKit to show there is room to reduce both the full build time and incremental build time in existing unity builds. Next, I construct sophisticated bundle strate- gies based on the analysis results and carefully design Cauldron that supports unity builds with the bundle strategies in which the build system determines which source les are bundled together depending on their characteristics. Caul- dron is a drop-in replacement for the Ninja build system [110], which is a widely used modern build system. So, Cauldron provides the benet of unity builds for many projects using the Ninja. As a result, the proposed build system success- fully reduces the build times of the LLVM project and WebKit.

5 1.2. DISSERTATION CONTRIBUTIONS CHAPTER 1. INTRODUCTION

Fault Error Failure

affect results, root cause start to misbehave visible to users (s/w bug, h/w fault, (data corruption, (crash, hung, mis-configuration) memory leak) out of memory)

Figure 1.3. Traditional fault-error-failure model.

1.2.2 Logging Automation Tool for Logging Inter-thread Dependencies

Background on Failure Diagnosis

Before discussing our contribution, it is helpful to rst understand how a failure happens. J.C. Laprie described the structural elements of system failures—fault, error, and failure—into a model that is widely used today [97]. As shown in Fig- ure 1.3, a fault is a root cause, which can be a software bug, a hardware malfunc- tion, or an operator error like mis-conguration. A fault can produce anomalous system behaviors referred to as errors. However, some of these errors will have no user-perceivable side-eects or may be transparently handled by the system; for example, the allocated objects that an ended application forget to free is au- tomatically dealt with by an operating system. It is only the subset of remaining errors that further propagate and become visible to users that are referred to as failures, including crash, hang, incorrect result, incomplete functionality, unnec- essary result, etc. With this failure model, the failure diagnosis process is to reconstruct the fault-propagation chain, typically in a backward manner starting with the ob- served failure symptom, to pinpoint the fault.

The Importance of Logging and its Challenge

When a failure is reproducible, while in principle, deterministic replay is ef- fective and is widely explored by the research community [154, 69, 88, 123, 8, 43, 99, 61, 44, 155, 125, 39, 111]. This approach allows a precise postmortem reproduction of the failure execution. However, in practice, it faces a range of

6 1.2. DISSERTATION CONTRIBUTIONS CHAPTER 1. INTRODUCTION printk(level "format string"); /* Linux kernel */ elog(level, "format string"); /* PostgreSQL */ ap_log_error(level, "format string"); /* Apache */ logit(level, "format string"); /* OpenSSH */

Figure 1.4. Logging code examples from open-source software. Each log printing statement typically has a verbosity level and format string including static text and variable values. deployment challenges including privacy concerns, integration complexity, and high overhead. In particular, by denition, the replay trace must contain all input including any sensitive data. It is required to recreate the same execution envi- ronment encountering failures, including third-party software and hardware. In the recording phase, most non-deterministic events must be logged. As a result, it is frequently the case that production failures cannot be reproduced by the vendors for diagnosis. Consequently, software engineers continue to rely on traditional log mes- sages as a principal tool for troubleshooting failures. Figure 1.4 shows real-world examples of log printing statements. Logging is a conventional programming practice in software development and is applicable even in the production en- vironment. It is an industry-standard practice to request logs when a customer reports a failure and, since log data typically focuses narrowly on issues, they are generally considered far less sensitive than other data sources [66]. More- over, since system logs are typically human-readable, they can be inspected by a customer to establish their acceptability [77]. Indeed, large-scale system ven- dors report that such logs are available from the majority of their customers and many even allow logs to be transmitted automatically and without review [79]. Given the key role that logging plays in diagnosing production failures, the eectiveness of such diagnosis fundamentally depends on its quality (where to log and what to log). For example, if key events are not logged at all, the log is

7 1.2. DISSERTATION CONTRIBUTIONS CHAPTER 1. INTRODUCTION useless and it is hard to understand what happened during a failure. However, excessive logging for trivial events can produce numerous useless logs that even- tually mask the truly important information, thus making it dicult to locate the real issue [17]. Logging too much will cause unacceptable run-time overheads as well. The quality of logging critically aects the eciency of the failure diagno- sis. A study reported that good log messages accelerate debugging by 2.2X [168] As a result, developers need to make carefully informed logging decisions.

Logging inter-thread data dependencies

While the importance of logging can be identied in failure diagnosis, logging has become a tough decision for developers. They still need to make their own decisions on where to log and what to log, which in most cases depend on their own domain knowledge. As a result, deciding appropriate log points is dicult and time-consuming. So, some papers focus on logging automation to systemat- ically help developers from such a burden task [170, 57, 168]. In this dissertation, I address an automation of logging inter-thread data de- pendency, which tangles up an error propagation and hinders the diagnosis. An inter-thread data dependency is incurred when one thread updates a shared data structure and another thread reads it later. Since an error propagates from a buggy thread to a failing thread, investigating the code executed by the failing thread rarely reveals the root cause; software engineers would understand the failing thread reads a corrupt data but would not be able to gure out which thread corrupted the data. Logging the inter-thread data dependency is critically important to diagnosing failures in multi-threaded systems. This dissertation presents the design and implementation of K9, a logging tool that automatically inserts logging code to trace inter-thread data dependencies. K9 identies code locations where the inter-thread dependency can occur, and inserts logging code to record which thread reads from or writes on which data structure. To be a practical tool, K9 is designed to 1) scale to large multi-threaded systems and 2) cause negligible runtime overheads. To scale to large systems, K9 avoids complicated static analyses. Instead, K9 takes a best-eort approach. It identies ‘typical’ data structures that are often shared among threads and de- tects ‘most’ of the accesses to them. To determine typical structures and accesses

8 1.3. ORGANIZATION CHAPTER 1. INTRODUCTION to them, K9 leverages coding conventions ubiquitous in real-world software sys- tems. To conrm the scalability of K9, we have applied it to the Linux kernel. K9 inserts logging code to the Linux within a reasonable amount of time. The logs generated by K9 provides useful information for diagnosis. We reproduced three failures by injecting known bugs in Linux. All of these failures involve inter- thread data dependencies. K9 generates logs that told us which thread accessed which shared data structure, and allowed us to trace back from the failure sites to the root causes. In addition, I have used K9 to diagnose a failure caused by a pre- viously unknown bug. The bug-x has been submitted to the Linux community and accepted [93].

1.3 Organization

This dissertation is organized as follows. Chapter 2 introduces previous works for each contribution of this dissertation. Chapter 3 describes Cauldron. The chapter presents the analysis, design, implementation, and evaluation of my ap- proach to show the tool achieves better build performance in large real-world projects. Chapter 4 introduces K9. In this chapter, I describe the dependency model that K9 focuses on, design, implementation, and the evaluations. Chap- ter 5 concludes this dissertation and discusses the future directions.

9 Chapter 2

Related Work

This Chapter describes the previous works that are related to this dissertation. Figure 2.1 describes an overview of previous approaches. Section 2.1 introduces existing techniques to improve build performance in C++ projects. Section 2.2 discusses existing failure diagnosis techniques. Section 2.3 further discusses other related studies.

2.1 Improving C++ Build Performance

2.1.1 Compile Caching Tools

Compile caching tools [131, 42] aim at improving the performance of incremental builds. The ccache [131] speeds up recompilation by caching previous compila- tions with their textual hashes and detecting when the same compilation is being done again. The cHash [42] enhances the ccache approach by using AST hashing which enables to detect unnecessary recompilations more accurately. These tools are designed to eliminate unnecessary recompilations but not to reduce redundant internal processing in required compilations. Unity builds used in Cauldron try to remove such redundant processing by bundling multiple source les into one translation unit.

10 2.1. IMPROVING C++ BUILD PERFORMANCE CHAPTER 2. RELATED WORK

Compile Caching Tools

Compiler Approach Improving Build Performance for C++ Projects Language Approach

Build System Approach

Logging Log Analysis Automation

w/o Failure Reproduction Trace Analysis Failure Diagnosis w/ Failure Core Dump Reproduction Analysis

Figure 2.1. An overview of previous studies

2.1.2 Compiler Approach

There are several approaches to modify the compiler internal for improv- ing the build/compilation performance in C++. First, pre-compiled headers (PCHs) [146, 145, 129] are an approach to reduce unnecessary recompilations for the unchanged code especially in header les. A pre-compiled header is a header le that is compiled into an intermediate form that is faster to process for the compiler. When the compiler recompiles a translation unit in which the code of the header le is unchanged and the pre-compiled header is ready, the compiler reuses the PCH to avoid unnecessary recompilations. Second, designing a new compiler is another way to improve the compiler performance, resulting in reducing build times. Zapcc [86] is a caching C++ com- piler based on LLVM/Clang, designed to perform faster compilations. Zapcc has an in-memory compilation cache in the client-server architecture for remember- ing all of the compilation information (e.g., ASTs, IRs, and so on) between runs. As a result, Zapcc can reduce redundant header processing, as unity builds do. Cauldron that uses unity builds with GCC outperforms Ninja with Zapcc in building LLVM. The full build time when using Zapcc is 6 minutes 49 seconds; this compares with 5 minutes 49 seconds for our build system. Zapcc is slower

11 2.1. IMPROVING C++ BUILD PERFORMANCE CHAPTER 2. RELATED WORK because its clients randomly send compile requests to free servers. As a result, the compiling servers often fail to reuse compile caches, so that they clear the compile caches and restart the compiling tasks. For example, the full build of LLVM entails clearing caches 52 times. This result indicates that considering the characteristic of each compilation task is critical to improving the build performance. Even if we redesign the com- piler, ignoring the characteristic easily incurs ineciency in compilation.

2.1.3 Language Approach

A module system is coming to C++20 [134, 135]. Major open-source compil- ers including GCC and Clang have started to implement it [132, 38]. This mod- ule system has been expected to improve the problem because it avoids the re- parsing of the same header les. In C++ modules, module les are used instead of header les. Compilers generate pre-compiled results from compiling module les. A pre-compiled result contains a logical description of the abstract syn- tax tree (AST). Compilers reuse the result for the translation units importing the module and eliminate the redundant parsing for the code written in the module le. C++ modules and unity builds share certain motivations and goals. Both ap- proaches try to reduce redundant compilation tasks caused by the header code. However, unity builds could be still eective even after C++ modules become standard. This is because the redundant processing that unity builds can elimi- nate still exists even when using the module system. C++ modules accelerate the front-end processing for module les (e.g., le load, lexing, parsing, and so on) in compilation by reusing pre-compiled results among dierent translation units. However, the back-end processing derived from the module le occurs indepen- dently in dierent translation units due to the design of the pre-compiled results. Such redundant back-end processing can be reduced by using unity builds. I leave exploring a new design space to combine unity builds with C++ modules as a future work.

12 2.2. FAILURE DIAGNOSIS CHAPTER 2. RELATED WORK

2.2 Failure Diagnosis

2.2.1 Diagnosis without Reproducing the Failure

Here, I discuss failure diagnosis techniques that do not require reproducing the failures.

Log Analysis & Logging Automation

Many studies propose techniques to analyze the log messages to infer run-time information in failure execution. Majority of them focus on using statistical anal- ysis to detect anomalous log messages, or detect recurring failures that match known issues [159, 7, 20, 50, 107, 120, 139, 140, 106, 138, 162, 35, 161]. For exam- ple, Xu et al. [159] use machine learning techniques to learn common patterns from console logs and detect abnormal log patterns that violate the common pat- terns. However, their error detection and diagnosis are based on patterns solely from log messages, and thus cannot provide the capability of generating control- ow and data-ow information. DISTALYZER [120] uses machine learning tech- niques to compare system behaviors extracted from the logs in failed executions and correct executions to help debug performance problems in distributed sys- tems. Mariani et al. [106] present a technique to automatically identify depen- dencies between events and values in logs. It generates models of legal behaviors, and compares log les collected during failing executions to detect anomalous event sequences. Tan et al. [138, 139, 140] use heuristics to parse the logs from Hadoop systems to derive control-ow and data-ow information, and further provide visualized representation of the diagnostic information. Log analysis is also used in performance analysis and workload modeling in distributed systems by tracking end-to-end requests [35, 161, 169]. The research proposed in this dissertation is dierent and complementary to these previous works. The improved quality of log messages can benet these previous log analysis techniques with more informative log. For example, logs of the inter-thread data dependency are useful to reconstruct control-ow and data-ow information in failure execution. While the importance of logs is widely identied, deciding on appropriate log points and log messages is time-consuming and requires expertise [159, 165, 57,

13 2.2. FAILURE DIAGNOSIS CHAPTER 2. RELATED WORK

164, 73, 102]. To help developers in logging practice, some studies present log- ging automation tools based on existing logging code [163, 170, 101]. Errlog [163] extracts error conditions from the source code that developers should insert log points and automatically adds the logging code. LogAdvisor [170] also automat- ically extracts common logging practices from existing logging code by machine learning and suggests log points to developers. Li et al. [101] enhance LogAdvi- sor to suggest the modication of existing logging code when it is changed, by using machine learning. Unlike K9, these studies are based on existing logging code in the source code and do not focus on inter-thread data dependency. Log20 [168] is a fully automated logging tool and does not rely on existing logging code. Guided by information theory, it computes how eective each log printing statement is in disambiguating code paths and automatically determine the placement of logging statements that maximize the calculated eectiveness under a specied overhead threshold. K9 is also an automatic logging tool that does not need existing logging state- ments. K9 focuses on tracing inter-thread data dependencies, while Log20 aims to disambiguate executed code paths. Log20 automatically decides log points to disambiguate the executed code paths. Tracking inter-thread data dependencies is another clue with which to clarify the error status and determine root causes.

Trace Analysis

Many diagnosis tools analyze performance counters [21, 37, 24, 83] or execu- tion traces [23, 19, 133, 7, 51, 103, 34, 18, 119, 167, 40, 60] to diagnose failures. Chopstix [21] collects proles of low-level OS events (e.g., scheduling, L2 cache misses, CPU utilization, I/O operations, page allocation, locking). It then re- constructs these events o ine for analysis to diagnosis some elusive problems. Cohen et al. [37] use a statistical approach to extract signatures from raw values of the low-level performance counters to diagnosis recurring problems. Because of the semantic gap between low-level events to program’s logic, errors caused by semantic bugs cannot be diagnosed by these approaches since no anomalies can be observed in low-level events. Some studies use a machine learning technique to identify the performance metrics relevant to a particular crisis. Bodik et al. [24] propose to use machine

14 2.2. FAILURE DIAGNOSIS CHAPTER 2. RELATED WORK learning to learn ngerprints from a large number of performance metrics, there- fore they can automatically classify performance anomalies into some known bugs. Magpie [19] monitor the system to record ne-grained events and physi- cal resource consumptions. Once the traces are grouped in per-request basis, it computes the resource consumptions for each request, then learn the common patterns for each request, and identies anomalous requests. Log message is a dierent and complementary source of run-time informa- tion compared with trace. Since developers typically injects the logging code, log message provides richer semantic information than trace. The diagnostic in- formation provided by logs is also complementary to the debugging information provided by trace analysis techniques.

Debugging with Core Dumps

Some systems collect memory images (i.e., core dumps) when a system crashes. Windows Error Reporting [66] monitors the system for detecting crashes or hangs, and records a “mini-dump”. CrashReporter [10], NetApp savecore [122], Google Breakpad [142] also collect compressed memory dumps. PSE [104] can perform o-line diagnosis of the NULL pointer dereference from core dump. Weeratunge, et al. [158] diagnose Heisen bugs by di-ing the core dumps from failing runs and passing runs. Dimmunix [81] is used to di- agnose deadlock by analyzing the back-trace from core dumps. REPT [40] and Kernel REPT [60] use the memory dump for re-constructing data values in failure execution. They use a hardware tracing functionality to record the control-ow information. By combining the control-ow information and the recovered data, they provide useful clues to debug some kinds of the failures. As discussed early in Chapter 1, this dissertation’s work on automating log- ging is complementary to the methods using core dumps. Core dumps are not available in all types of failures and do not contain the history of inter-thread data dependency. Log can collect historic, intermediate information prior to failures and also provide diagnostic information when no core dump is available. It also reduces overhead and data size by recording only causally-related information.

15 2.3. OTHER RELATED WORK CHAPTER 2. RELATED WORK

2.2.2 Diagnosis with Failure Reproduction

Many systems [154, 69, 88, 123, 8, 43, 99, 61, 44, 155, 125, 39, 111] target to de- terministically replay failed execution, which generally requires high run-time logging overhead especially for multiprocessor system. For example, ReVirt [43] records all the system level events at the Virtual Machine Monitor level, and then replay the execution of guest OS instruction-by-instruction. To reduce the over- head, recently DoublePlay [154] uses a uniparallelism strategy for achieving the benets of recording on a uniprocessor, while still allowing application perfor- mance to scale with increasing processors. Triage [152] performs diagnosis at the user’s site at the moment of the failure. Since Triage operates at the user’s failure site, it could replay the failure multiple times by reloading from the recent checkpoints to infer various diagnostic information. To support check-pointing, Triage requires OS kernel modication and support. Unfortunately, as discussed in Chapter 1, failures are often very hard to be replayed. Especially in the production environment, developers at the vendors’ site are hard to reproduce the customers’ environment due to privacy concerns, unavailability of execution environments, etc. Consequently these eective diag- nosis techniques often cannot be used on production failures. Therefore the con- tributions of this dissertation is complementary to these techniques and mainly targets to the cases when failure reproduction is dicult.

2.3 Other Related Work

2.3.1 Test Case Generation & Selection

Testing also takes an important role in software development to check whether the software meets the requirements that guided its design and implementa- tion. Many studies have tackled improving the performance of testing. For instance, automated test case generation [68, 113, 26, 56, 130, 124, 31] and se- lection [116, 32, 71, 80, 29, 108, 46] is an eective technique to accelerate the testing process. These studies are a complement to this dissertation to increase developer productivity in software development.

16 2.4. SUMMARY CHAPTER 2. RELATED WORK

2.3.2 Static Analysis for Bug Detection

Some studies utilize compiler techniques to address some other software relia- bility problems, such as bug detections [30, 82, 157]. For example, STACK [157] detects the unstable code that leads to the undened behavior by the static anal- ysis based on a proposed model. KLEE [30] uses a full symbolic execution engine to expose bugs in testing. Carburizer [82] uses data-ow analysis to locate depen- dencies on data read from hardware. In this dissertation, K9 also uses the static analysis approach to detect the specic code locations that cause inter-thread data dependencies.

2.4 Summary

In this chapter, I cover many studies related to software development. Section 2.1 shows various approaches to improve the build performance in C++ projects. This dissertation provides a fast performance analysis on unity builds and ex- plores a new build system design to integrate unity builds with sophisticated bundle strategies. Section 2.2 introduces many diagnosis techniques with and without failure reproductions. Logging takes an important role in diagnosis the failures that are hard to be reproduced. In this dissertation, I cover a new tar- get for logging automation (i.e., inter-thread data dependencies), which is not covered by other studies. Section 2.3 shows other related studies.

17 Chapter 3

Build System for Unity Builds with Sophisticated Bundle Strategies

The objective of this chapter is to describe Cauldron, our native build system for unity builds with sophisticated bundle strategies. First, I introduce the back- ground information about build systems, long build times of C++ projects, and unity builds (Section 3.1). Then, I clarify the problems of existing unity builds in Section 3.2. To address these problems, I conduct a case study of unity builds in WebKit to understand unclear heuristics to improve the unity-build performance (Section 3.3). Based on the ndings of the case study, I present the design and implementation of Cauldron (Section 3.4). Cauldron is a new build system that supports unity builds with better bundle strategies than existing unity builds. Last, the experiments show that Cauldron outperforms current unity builds in continuous builds, full builds, and incremental builds (Section 3.5).

3.1 Background

3.1.1 Build systems

Build systems [54, 67, 48, 11, 49, 148, 110, 52, 117] automate the execution of order-dependent tasks to generate target programs. A build system takes source

18 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.1. BACKGROUND BUNDLE STRATEGIES

les and build rules as inputs, constructs a dependency graph from the rules, and appropriately executes build commands. A dependency graph consists of nodes and edges. Each node represents les and internal states of build systems. Edges show input and output dependencies among nodes. Build commands are set to edges. They include not only compiling and linking but also other commands such as changing directories and archiving les. When a build system tries to process a build command for an edge, it rst conrms all input nodes are ready, then executes the command. Figure 3.1 shows an example of the dependency graph for generating an ex- ecutable le of main.exe. Here, normal rectangles show nodes of the les (i.e., header, source, object, and executable les). Double-lined rectangles indi- cate nodes of the internal states in the build system for issuing build commands. Then, arrows are the edges to specify dependencies and build commands. To understand how the build system deals with this dependency graph, I dis- cuss a case, when a user runs the build system specifying main.exe as the desired output without any previous compile result. The build system rst nds the linking edge from LINK_main as the required task. However, it gures out the two dependent object les (main.o and lib.o) are not ready for this task. So, it iterates back in the dependency graph to nd the edges for creating these object les. Then, it eventually nds the two compiling edges from COM- PILE_main and COMPILE_lib. It conrms that all dependent input les exist and starts to issue the build commands at rst. As a result, it will rst build lib.o and main.o, in any order since these tasks are independent, and then build main.exe. Build systems are essential tools for developing large software systems. Whenever developers modify the source code, they have to re-execute a build system to synchronize their modication to the target programs. This build method means the performance of build systems directly aects the productiv- ity of typical developing cycles. As a result, the build systems must have high performance and eciency [118].

19 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.1. BACKGROUND BUNDLE STRATEGIES

main.exe cc –o main.exe main.o lib.o

LINK_main

lib.o main.o cc –c lib.cc cc –c main.cc

COMPILE_lib COMPILE_main

lib.cc lib.h main.cc

Figure 3.1. Example of the dependency graph in build systems.

3.1.2 Incremental builds

Traditionally, build systems have been designed for high eciency when re- building target programs with small changes. Incremental builds [49] are a build technique to avoid unnecessary build tasks when rebuilding programs. In incre- mental builds, build systems track dirty source les and calculate the minimum amount of build tasks from the dependency graph. This build method reduces the build time when a developer updates a small number of source les, because most of the build tasks can be skipped in incremental builds. Many of the existing tools and studies aim at improving the performance of incremental builds. For example, Ninja [110] is a state-of-the-art build system that focuses on very fast incremental builds by forbidding users to express com- plex build rules. Compile caching tools such as CCache [131] and cHash [42] enhance incremental build performance by avoiding redundant compilations via textual hashing or AST hashing. Precompiled headers [129, 146, 145] are useful

20 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.1. BACKGROUND BUNDLE STRATEGIES for saving unnecessary header recompilations.

3.1.3 Long build times of large C++ projects

However, as I mentioned in Chapter 1, the build times for large C++ projects have become unacceptably long [33, 136, 112, 94]. System software has continu- ously and rapidly evolved, and code sizes have reached thousands of source les and millions of lines of code (LOC). This evolution has increased the build time for full and incremental builds in C ++ projects. For instance, a full build of We- bKit [12] takes 56 minutes with 18 threads on a DELL Power Edge R430 server. Chromium [144] takes two and a half hours to build on the same server. Such long build times are problematic for individual developers in open source projects who compile and build their projects on a standard laptop [98]. The build times of incremental builds are also problematic because source-code modications are no longer small in large open-source projects. For example, a recent study [94] has pointed out that build tasks taking more than 15 minutes were submitted to build bots of WebKit almost every day from March 17 and August 3 in 2018. In such build tasks, 1,030 source les were recompiled on average. To conrm that the problem is not limited to WebKit, I analyzed the logs of build bots of LLVM [147], which is an open-source compiler framework written in C++, from March 9 to August 1 in 2018. The number of build tasks submitted during the period was 1,146. Build tasks consuming more than 15 minutes hap- pened 227 times on 74 out of 81 days when the build bot was active. Thus, long build times for large C++ projects are commonplace. Note that the build bots of both projects mentioned above use CCache. CCache is a compile caching tool, which saves and reuses previous compiled results indexed by the hash of the preprocessed code. However, the benet of the compile caching tool is limited because many developers change dierent source and header les at a time in large open-source projects. Such updates easily change the preprocessed code so that CCache cannot reuse previous com- piled results. As a result, many recompilations happen. Precompiled headers do not work well because header les are often updated in such long builds. For example, according to the build bot logs of both projects, the average number of modied header les was 11 in LLVM and seven in WebKit. Updating even

21 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.1. BACKGROUND BUNDLE STRATEGIES one header le makes the precompiled headers obsolete so that they have to be recompiled.

3.1.4 Unity builds

The problem of long build times highlights a new requirement for build systems that need to speed up sequential compilation of many source les. One of the main reasons why build times are so long in large C++ projects is redundant processing [33, 136, 112, 94]. In large C++ projects, much of the compilation time is spent the front-end processing. The front-end is a compiler phase to build an internal representation (IR) of the program; this phase con- sists of preprocessing, parsing, and so on. On the other hand, the phase after IR generation is called back-end, which includes code generation and optimization. Especially, in C++, parsing is well knows as time-consuming task in the front- end. For example, Fig. 3.2a and Fig. 3.2b show the distribution of times in which the compiler performs parsing. According to these gures, parsing occupies a majority of the compilation time (60%>) in many source les (56% in LLVM and 80% in WebKit). This high ratio stems from shared header les that are included in multiple source les. The compiler repeatedly parses the shared headers and instantiates the same template bodies across dierent source les. To eliminate redundant header processing in multiple source les, developers have started to use a new build technique, called unity builds; there are several project-specic names for this technique such as unied builds in WebKit [115]. Unity builds are designed for achieving high build performance in large C++ projects. In a unity build, a bunch of source les are #include’d in a single unity le (as shown in Fig. 1.2) which is then compiled. In current unity builds, the bundle strategy, which is how to select source les to be bundled together, is simply bundling a xed number of source les in alphabetical order. The limit of bundle size, which is the number of #include’d source les, is project-specic value. However, it is often set to eight based on developers’ experience [115, 33, 6]. In unity builds, bundle congurations, which are rules of source les bundled together, are decided by the bundle strategy and the maximum bundle size. Unity builds accelerate both compiling and linking. Compilers parse and compile commonly included headers among the bundled source les only once

22 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.1. BACKGROUND BUNDLE STRATEGIES

100 100

80 80

60 60

60% of compile times 60% of compile times are 40 are consumed by parsing 40 consumed by parsing Parse ratio (%) Parse ratio (%) in 56% of source files. in 80% of source files. 20 20 56% 80% 0 0 0 200 400 600 800 1000 1200 1400 0 1000 2000 3000 4000 5000 6000 7000 Source file index Source file index (a) LLVM (b) WebKit

Figure 3.2. How much time does the compiler spend parsing? Here, the x-axis shows source-le indexes for each project. The y-axis shows parse ratios for each source le. The parse ratio indicates the occupancy of parsing time during the compile time. The times are measured by using the timer report functionality of GCC (-ftime-report). The x-axis is sorted in ascending order of the parse ratio. and make fewer instantiations of the same templates among the source les. Fig- ure 3.3 describes the benets of unity builds that reduce redundant processing in a compilation. In this code, two source les (A.cpp and B.cpp) are in- cluded into one unity le. The compiler rst processes the content of A.cpp normally. However, it can reuse many previous processed results for compiling the code of B.cpp. For instance, the compiler does not perform preprocessing for the vector header le twice, since header les are usually include-guarded. Furthermore, the template instantiation of vector happens only once because the same template instantiated body is reused in the same translation unit. Linkers also receive the benets of unity builds by reducing redundant pro- cessing in linkers. For example, linkers do not have to remove N −1 copies of the same weak symbol dened in a shared header included in N source les (N is the bundle size). As a result, unity builds have dramatically reduced build times of large C++ projects. For instance, they have been shown to reduce the full build times in WebKit by 53% (56 minutes → 28 minutes) and in Chromium by 66% (2 hours 34 minutes → 52 minutes).

23 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.2. PROBLEMS IN UNITY BUILDS BUNDLE STRATEGIES

/* unity.cpp */ // #include "A.cpp" #include void foo() { std::vector v1; ... }

// #include "B.cpp" #include // reuse the previous result. void bar() { std::vector v2; // reuse the result of v1. ... }

Figure 3.3. Benets of unity builds

However, unity builds have the disadvantage of increasing incremental build times. Since unity builds enlarge each compiler task by bundling multiple source les, an incremental build of one updated source le involves N − 1 non-updated source les bundled into the same unity le. Developers have recognized this downside, but they underestimate its negative eect. The WebKit project reports that the worst slowdown is 20% in a typical scenario where one or two les are touched [59]. However, I exposed a slowdown in incremental builds of one source le of 479% (19 seconds → 110 seconds) in the worst case [94].

3.2 Problems in Unity Builds

While unity builds are considered as useful for reducing the build times of C++ projects and are being used in real projects [115] and build systems [41, 90, 58, 13, 121], all of their approaches are inadequate. The main problems are as follows. Naive bundle strategy: The current bundle strategy simply bundles a xed number of source les in alphabetical order, ignoring the characteristics of the bundled source les. This means that bundle congurations that may achieve

24 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.3. A CASE STUDY ON UNITY BUILDS IN WEBKIT BUNDLE STRATEGIES higher build throughput and cause fewer incremental build slowdowns are likely to be overlooked. To deal with this problem, I perform a case study of unity builds in WebKit towards better understanding of how to improve the unity- build performance. Static build behavior: The current approaches implement unity builds as static build behaviors. As a result, build systems always pass unity les to com- pilers, even in the case of incremental builds with small updates. This static conguration causes massive and permanent slowdowns in incremental builds. Developers have dealt with this problem by adopting ad-hoc solutions: for ex- ample, the developers of WebKit manually exclude source les that cause unac- ceptable slowdowns in incremental builds from the candidate source les to be bundled after they encounter a slowdown, as shown in Fig. 3.4. However, I ar- gue that build systems can better deal with this problem by adaptively switching between unity builds and non-unity builds.

3.3 A Case Study on Unity Builds in WebKit

Here, to deal with the rst problem described in the previous section, I investigate the characteristics of unity builds to nd a sweet spot, which generates ecient compiler tasks that reduce the full build time without increasing the incremental build time. To accomplish this, I perform an in-depth case study of WebKit and construct sophisticated bundle strategies.

3.3.1 Research Questions

Here, I will explore four research questions to reveal the unclear heuristics in making the decision to bundle or not to bundle source les in unity builds:

• RQ1: How does the similarity of the headers included among the source les aect the unity-build performance? Unity builds reduce redundant work of the shared headers included in multiple source les. My hypothesis is that unity builds work well when the included headers overlap among the source les. I will show how the similarity of the included headers aects unity builds. These ndings help us to understand the potential impact of the header similarity on unity builds.

25 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.3. A CASE STUDY ON UNITY BUILDS IN WEBKIT BUNDLE STRATEGIES

--- a/Source/JavaScriptCore/Sources.txt +++ b/Source/JavaScriptCore/Sources.txt @@ -392,9 +392,10 @@

↪ dfg/DFGSSAConversionPhase.cpp dfg/DFGSSALoweringPhase.cpp dfg/DFGSafepoint.cpp dfg/DFGSnippetParams.cpp -dfg/DFGSpeculativeJIT.cpp -dfg/DFGSpeculativeJIT32_64.cpp -dfg/DFGSpeculativeJIT64.cpp +// These files take a long time to compile ... +dfg/DFGSpeculativeJIT.cpp @no-unify +dfg/DFGSpeculativeJIT32_64.cpp @no-unify +dfg/DFGSpeculativeJIT64.cpp @no-unify dfg/DFGStackLayoutPhase.cpp dfg/DFGStaticExecutionCountEstimationPhase.cpp dfg/DFGStoreBarrierClusteringPhase.cpp

Figure 3.4. Patch in order not to bundle source les taking a long compilation time in WebKit. Here, the @no-unify attribute excludes a source le from unied builds.

• RQ2: What combinations of source les are not suitable for unity builds? I nd cases where unity builds perform poorly and identify two kinds of source le combination that have negative eects on unity builds. These ndings help us to avoid such source le combinations when building bundling strategies and improve the eciency of unity builds.

• RQ3: Where are the source les that can be bundled? Currently, WebKit only bundles source les in the same directory. However, in practice, there are many cases in which source les in dierent directories would be suitable for unity builds. These ndings increase the opportunities to bundle such source les.

26 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.3. A CASE STUDY ON UNITY BUILDS IN WEBKIT BUNDLE STRATEGIES

• RQ4: How many source les should be bundled into a unity le? The bundle size is critical to compiling speedups and incremental build slowdowns. Bundling too many source les causes unacceptable slowdowns in incremental builds. On the other hand, bundling only a few source les does not lead to compile-time speedups. To better understand this trade-o, I analyze the com- pile time when using unity builds for three dierent kinds of source les while changing the bundle size. These ndings help us to identify the best bundle size depending on the characteristics of each source le.

3.3.2 Metrics

The following metrics are used in the analysis.

• Header Similarity. The similarity between two source les is quantied by the extent that the same headers are included. I use the Jaccard index [141] as the measure of similarity, which is dened as the size of the intersection divided by the size of the union of two sets. Thus, the similarity of two source les is determined as follows: |Headers(s1) ∩ Headers(s2)| Similarity(s1, s2) = |Headers(s1) ∪ Headers(s2)|

s1 and s2 represent source les. Headers indicates the set of headers that are included in each source le. A similarity of 1.0 indicates all of the included headers in s1 and s2 are the same.

• Front-end Ratio. To identify whether a source le is front-end-intensive,1 or not, I calculate the front-end ratios of the compile time of the source le by using the timer report feature in the compiler (-ftime-report). If the frontend ratio is 1.0 (100%), all compilation tasks of the source le are con- sumed by the frontend.

• Compile-Time Speedup. Another important metric is by how much unity builds reduce the compile time compared with compiling each source le in- dividually. I calculate the speedup by dividing the sum of each compile time

1Front-end intensive means that the compile time of the source le is dominated by the front- end processing.

27 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.3. A CASE STUDY ON UNITY BUILDS IN WEBKIT BUNDLE STRATEGIES

(CT (s1) and CT (s2)) by the compile time of using unity builds (CT (unity)): CT (s1) + CT (s2) Speedup(s1, s2) = CT (unity)

• Expected Penalty for Incremental Builds. Another performance metric of unity builds is the expected penalty on incremental builds caused by unity builds. This penalty is dened as the expected value of the overhead of the incremental build assuming that one of the merged source les is updated. For instance, the overhead of the incremental build of s1 in unity builds is CT (unity) calculated as CT (s1) . Then, when two source les are bundled into a unity le, the penalty is computed as CT (unity) CT (unity) Penalty = × 0.5 + × 0.5 CT (s1) CT (s2)

A penalty of 1.0 indicates that unity builds cause no overhead on incremental builds. Here, all source les are updated with the same probability.

3.3.3 Experimental Results

The experiments were conducted on WebKit (git-svn-id:230391). Here, I show the results for JavaScriptCore for the JavaScript engine in WebKit to which unity builds is applied. I used an Intel(R) Xeon(R) 8-core CPU E5-2620 v4 @ 2.10GHz, 128GB of DRAM, and 1TB of HDD. I specied 16 threads as job threads of the build system. I used LLVM/Clang (git-svn-id:329912) as the compiler.

RQ1: How does the similarity of the headers included among the source les aect the unity-build performance?

To understand the potential impact of header similarity, I randomly chose a source le, performed unity builds for all other source les, and measured compile-time speedups and expected penalties for incremental builds. I selected API/JSBase.cpp, which is the rst source le in alphabetical order. There are 959 other source les that can be bundled with it. I tested all combinations of unity builds with API/JSBase.cpp.

28 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.3. A CASE STUDY ON UNITY BUILDS IN WEBKIT BUNDLE STRATEGIES

2.0

1.8

1.6

Speedup 1.4

1.2

0.0 0.2 0.4 0.6 0.8 1.0 Similarity (a) Compile-Time Speedup (higher is better).

9

8

7

6

5

Penalty 4

3

2

1 0.0 0.2 0.4 0.6 0.8 1.0 Similarity (b) Expected Incremental Build Penalty (lower is better).

Figure 3.5. The gures show the impact of the similarity of the included headers on the unity build. The impact on the unity build of API/JSBase.cpp in combination with each of 959 dierent source les in JavaScriptCore of WebKit is shown. Each dot indicates a pair of the bundled source les.

29 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.3. A CASE STUDY ON UNITY BUILDS IN WEBKIT BUNDLE STRATEGIES

Figure 3.5 shows the compile-time speedups and expected incremental build penalties. The header similarity between API/JSBase.cpp and the paired source le is indicated on the x-axes. The y-axes show the impact of unity builds, i.e., the compile-time speedup in Figure 3.5a and expected penalty on incremental builds in Figure 3.5b. According to Figure 3.5a, the higher the similarity of the included head- ers is, the greater the speedup that unity builds achieve. If the similarity ex- ceeds 0.90, unity builds achieve a compile-time speedup of 1.85 on average. In the best case where the paired source le is jit/JITStubRoutine.cpp, the compile-time speedup reaches 1.97. This is good news for incremental builds. Figure 3.5b shows the relationship between the header similarities and expected penalties of incremental builds. This gure as well demonstrates that unity builds incur negligible slowdowns in incremental builds when the header similarities are high. In the case of unifying API/JSBase.cpp and jit/JITStubRoutine.cpp, the expected penalty is only 1.02. These nd- ings tell us that unity builds can achieve high compilation speedups that incur ignorable slowdowns in incremental builds by bundling appropriate source les. However, I also nd cases where the compile-time speedups are not as high as expected even when the header similarities are high. I will discuss these cases in the next question.

Finding 1: Higher header similarity has positive eects on both unity builds and incremental builds. However, there are exceptions that should be avoided.

RQ2: What combinations of source les are not suitable for unity builds?

To better understand the situations where unity builds do not work well, I investigate the cases in Figure 3.5a where the compile-time speedups are not high even though the header similarities are high. For instance, when I bundle API/JSBase.cpp and parser/Parser.cpp together, the compile-time speedup is only 1.15, despite the header similarity between the two source les is 0.94. The root cause of this bottleneck is the imbalanced compile times of the bundled source les. The compiler consumes 4.29s for API/JSBase.cpp, but 28.81s for parser/Parser.cpp. Accordingly, the unity build of the two source les takes 28.88s. This indicates that the compile time of unity builds

30 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.3. A CASE STUDY ON UNITY BUILDS IN WEBKIT BUNDLE STRATEGIES

2.2

2.0

1.8

1.6 Speedup 1.4

1.2

1.0 0.0 0.2 0.4 0.6 0.8 1.0 Frontend ratio (a) Compile-Time Speedup (higher is better).

2.0

1.8

1.6

1.4 Penalty

1.2

1.0

0.0 0.2 0.4 0.6 0.8 1.0 Frontend ratio (b) Expected Incremental Build Penalty (lower is better).

Figure 3.6. These gures show the impact of the front-end ratio of the compile time of source les on the unity builds. The gures indicate that unity builds achieve high speedups and low penalties when the merged source les are highly front-end-intensive

31 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.3. A CASE STUDY ON UNITY BUILDS IN WEBKIT BUNDLE STRATEGIES peaks at the compile time of parser/Parser.cpp. It turns out that unity builds between source les with signicantly dierent compile times are not de- sirable. I also found another reason to hinder the performance of unity builds even if the header similarity of bundled source les is high. In short, the source les that front-end ratio is very low are not appropriate for unity builds. This is be- cause the low front-end ratio indicates the header processing including header inclusions and template instantiations is not a bottleneck of the compilation. So, applying unity builds to such source les is of limited benet. To clarify this, I analyze how the front-end ratio aects on the performance of unity builds. The experiment conducted in this case is similar to the one in Figure 3.5a. I performed the experiment on not only API/JSBase.cpp but also the 959 source les. However, to exclude other factors that slow down unity build performance, I did not experiment on some source le combinations. First, I only applied unity builds for source les that have a high header similarity (more than 90%). I did not perform the source les that have a 50% dierence or more in compile time to remove the imbalance in compile time. Some unity builds failed due to semantic compilation errors. As a result, the analysis entails testing 249,813 bundle combinations in total. Figure 3.6 shows the results. The x-axis indicates the front-end ratio relative to the entire compile time of each source le. The y-axis in Figure 3.6a is the average compile-time speedup when applying unity builds to each source le, while the y-axis in Figure 3.6b is the average expected penalty on incremental builds. According to these gures, the front-end ratio is related to the average compile-time speedup and expected penalty on unity builds. As the front-end ratio increases to 1.0, the average compile-time speedup reaches 2.0 almost lin- early. Figure 3.6b also indicates that the average expected penalty for incremen- tal builds becomes smaller when applying unity builds to front-end-intensive source les. However, when the front-end ratio is low, the average compile-time speedup and expected penalty increase, resulting in unacceptable slowdowns in incremental builds.

32 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.3. A CASE STUDY ON UNITY BUILDS IN WEBKIT BUNDLE STRATEGIES

Finding 2.1: We should avoid unity builds for source les that have signi- cantly dierent compile times. Finding 2.2: If the source les are front-end-intensive, they are appropriate for unity builds. However, if not, they are not suitable for unity builds, resulting in noticeable incremental build slowdowns.

RQ3: Where are the source les that can be bundled?

Currently, WebKit only bundles the source les in the same directory. To better understand the distribution of source les that are suitable for unity builds, I explore the header similarities across directories. Figure 3.7 shows a heat map of expected header similarities of JavaScriptCore. Each label in the gure represents a directory name, and each annotation in a box indicates an expected similarity. I dene the expected similarity between two directories as the average header similarity when I perform unity builds on a source le of the directory indicated on the x-axis and another source le of the directory indicated on the y-axis. According to the gure, there are many directory combinations where the expected header similarities are high; for example, the expected similarity of the combination of API and bindings is 0.90. This value is as high as when bundling source les only within API. This result indicates that unity builds can be expected to be eective for source les in dierent directories.

Finding 3: We can bundle source les not only in the same directory but also in dierent directories.

RQ4: How many source les should be merged into a unity le?

Here, I discuss the number of source les that are bundled into one unity le. As I described before, unity builds achieve high compile-time speedups and do not cause incremental build slowdowns by bundling appropriate source les; for instance, when applying unity builds to API/JSBase.cpp and jit/JITStubRoutine.cpp, the expected penalty for incremental builds is only 1.02. To see if unity builds scale to the bundle size, I analyze the compile time of unity builds while varying the bundle size from 8 to 32. Since the front- end ratio of the compile time is related to compile-time speedups and expected

33 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.3. A CASE STUDY ON UNITY BUILDS IN WEBKIT BUNDLE STRATEGIES

API 0.9 assembler 0.4 0.6 b3 0.7 0.4 0.7 bindings 0.9 0.4 0.8 1.0 builtins 0.7 0.5 0.6 0.7 0.7 bytecode 0.8 0.4 0.7 0.8 0.7 0.7 bytecompiler 0.9 0.4 0.7 0.9 0.7 0.8 0.9 debugger 0.9 0.4 0.7 0.9 0.7 0.8 0.9 0.9 dfg 0.8 0.4 0.7 0.9 0.7 0.7 0.8 0.8 0.9 disassembler 0.3 0.6 0.4 0.3 0.4 0.3 0.3 0.3 0.3 0.7 domjit 0.3 0.6 0.4 0.3 0.4 0.4 0.2 0.3 0.3 0.7 0.8 ftl 0.7 0.4 0.7 0.7 0.6 0.7 0.7 0.7 0.7 0.3 0.3 0.7 heap 0.7 0.5 0.6 0.7 0.6 0.7 0.7 0.7 0.6 0.4 0.4 0.6 0.6 inspector 0.7 0.5 0.6 0.7 0.6 0.7 0.7 0.7 0.6 0.4 0.4 0.6 0.6 0.6 interpreter 0.8 0.4 0.7 0.8 0.7 0.7 0.8 0.8 0.8 0.3 0.3 0.7 0.6 0.6 0.7 jit 0.8 0.5 0.7 0.8 0.7 0.7 0.8 0.8 0.7 0.4 0.4 0.7 0.6 0.6 0.7 0.7 llint 0.8 0.4 0.7 0.8 0.7 0.7 0.8 0.8 0.7 0.4 0.4 0.6 0.6 0.6 0.7 0.7 0.7 parser 0.9 0.4 0.7 0.9 0.8 0.8 0.9 0.9 0.8 0.3 0.3 0.7 0.7 0.7 0.8 0.8 0.8 0.9 profiler 0.8 0.4 0.7 0.9 0.7 0.8 0.8 0.8 0.8 0.3 0.3 0.7 0.7 0.7 0.8 0.7 0.7 0.8 0.8 runtime 0.8 0.4 0.7 0.9 0.7 0.8 0.8 0.8 0.8 0.3 0.3 0.7 0.7 0.7 0.8 0.7 0.7 0.8 0.8 0.8 tools 0.8 0.5 0.7 0.8 0.7 0.7 0.7 0.8 0.7 0.4 0.4 0.6 0.6 0.6 0.7 0.7 0.7 0.8 0.7 0.7 0.7 wasm 0.8 0.4 0.7 0.8 0.7 0.7 0.8 0.8 0.7 0.3 0.3 0.7 0.6 0.6 0.7 0.7 0.7 0.8 0.8 0.8 0.7 0.8 yarr 0.4 0.6 0.4 0.4 0.5 0.4 0.4 0.4 0.4 0.6 0.6 0.4 0.5 0.5 0.4 0.4 0.4 0.4 0.4 0.4 0.5 0.4 0.6 jit ftl b3 dfg yarr API llint heap tools wasm parser domjit profiler builtins runtime bindings bytecode inspector debugger assembler interpreter disassembler bytecompiler

Figure 3.7. Heat map of header similarity among subdirectories in JavaScript- Core. Each annotation shows the expected similarity when I combine a source le in a subdirectory on the x-axis with another in a subdirectory on the y-axis. This map indicates that many source les have high similarity to other source les in dierent directories. penalties of incremental builds, I investigate three source les with dierent front-end ratios: API/JSBase.cpp, runtime/LiteralParser.cpp, and parser/Parser.cpp with front-end ratios of 0.90, 0.52, and 0.18, re- spectively. I manually select the merged source les to maximize compile-time speedups and minimize the expected penalties on incremental builds by referring to the analysis of Figure 3.6. Table 3.1 shows the results. The compile time of unity builds (UB) when

34 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.3. A CASE STUDY ON UNITY BUILDS IN WEBKIT BUNDLE STRATEGIES

Table 3.1. The table shows the compile times of unity builds (UB) of three source les for bundle sizes from 8 to 32. The UB (WebKit) column shows the result of WebKit where the bundle size is 8, and the UB (Manual) column indicates the result of applying my bundle rule instead. According to the table, the unity build of JSBase.cpp does not incur large compilation overheads even if the size is larger than that of the current WebKit.

UB (WebKit) UB (Manual) Source le Frontend Ratio Compile time 8 8 16 32 JSBase.cpp 0.90 4.26 4.62 4.46 4.61 4.97 LiteralParser.cpp 0.52 7.49 10.59 11.08 14.90 20.80 Parser.cpp 0.18 28.62 42.13 124.82 183.88 307.50 using the unify rule of WebKit is shown in UB (WebKit). UB (Manual) indicates the results of manually bundling source les. The compile time of each source le and its front-end ratio are also shown. According to the table, in the case of JSBase.cpp which is highly front- end-intensive, unity builds scale well with negligible slowdown in incremental builds. For example, when the bundle size is 32, the slowdown is only 0.71s rela- tive to the compile time of JSBase.cpp itself. This result suggests that more than eight source les can be bundled with JSBase.cpp; eight is the default bundle size of WebKit. However, if the front-end ratio of a source le is low, the slowdown becomes noticeable. In the case of LiteralParser.cpp, the slowdown incurred by unity builds is 13.31s when the size is 32. Further- more, in the case of Parser.cpp, the slowdown reaches 278.88s and becomes unacceptable. To avoid such unacceptable overheads, source les that are not front-end-intensive should not be bundled in unity builds. The reason why the compile time of UB (WebKit) is shorter than that of UB (Manual) with the same bundle size is that the source les with much short com- pile times are bundled together. In the unity builds of WebKit, eight source les are automatically merged into one unity source le in alphabetical order. As a result, source les that have a signicant dierence in compile time become bun- dled into a unity source le. For example, in the case of Parser.cpp, Node- sAnalyzeModule.cpp, which compile time is 3.38s, is merged in the same unity source le in UB (WebKit). This saves the compile times of UB (WebKit)

35 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.4. DESIGN AND IMPLEMENTATION OF CAULDRON BUNDLE STRATEGIES even if the front-end ratio of Parser.cpp is low. However, this WebKit strategy completely overlooks the overheads in recom- pilation of the source les with shorter compile times. In the case of NodesAn- alyzeModule.cpp, the recompile overhead of the le reaches 1146% in UB (WebKit). This result highlights Finding 2.1 again: it is not desirable to bundle source les that have dierent compiles times in unity builds.

Finding 4.1: Unity builds scale as long as the bundled source les are front- end-intensive. Finding 4.2: Source les that are not front-end-intensive should not be bundled for incremental builds.

3.4 Design and Implementation of Cauldron

In this section, I describe the design and implementation of Cauldron, which is the build system using sophisticated bundle strategies to achieve better unity- build performance. It fullls the following design goals.

• Sophisticated bundle strategies: Cauldron is designed to improve the build performance of unity builds by using better bundle strategies based on the ndings in the previous section. Here, Cauldron gets hints from the prepro- cessed code of each source le to look for a better bundle conguration than that of just bundling source les in alphabetical order. Cauldron automatically collects these hints during the build.

• Adaptive build behavior: To address the static build behavior in unity builds, Cauldron should adaptively switch its build behavior between unity builds and non-unity builds depending on the number of les to be compiled. In non-unity builds, it deactivates unity les and compiles source les in parallel to reduce the negative eect on incremental builds.

3.4.1 Design Choice: Meta-Build System vs. Native Build System

When I embarked on the design of my build system, I made an essential design choice, as follows.

36 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.4. DESIGN AND IMPLEMENTATION OF CAULDRON BUNDLE STRATEGIES

Current build systems can be classied into two types. Native build systems (e.g., Make [49], Ninja [110], llbuild [11], and so on) automatically issue build commands to generate target programs. On the other hand, meta build systems (e.g., CMake [89], [148], Waf [121], and so on) do not build directly, but instead, generate build rules to be used by a native build system. Unity builds are usually implemented as a module of a meta build system or a conguration of an integrated development environment (IDE) [90, 58, 41, 13, 121, 115]. Despite this, I decided to take the native build system approach, for the following reason. To apply sophisticated bundle strategies, my build system has to extract hints for bundling from the preprocessed code of each source le. However, it is dicult to complete the preprocessing of all source les when meta build systems are run. This is because not all of the header les are prepared before the build. In large projects, developers sometimes use simple descriptions for dening similar data structures. The actual header les are generated from the descrip- tions during the build by using specic tools (e.g., llvm-tblgen in LLVM) or script languages (e.g., ruby in WebKit), as shown in Figure 3.8. Thus, during the execution of the meta build system, preprocessing fails for some source les because of the missing header les. To deal with this problem, Cauldron needs a scheduling feature to run the preprocessor at the appropriate scheduling point during the build. To clarify this requirement, I describe the build behavior using the dependency graph depicted in Figure 3.9. Here, Cauldron tries to bundle Attributes.cpp that include Attributes.inc. Attributes.inc is a header le dynamically gener- ated from Attributes.td by llvm-tblgen. In short, my build system runs the preprocessor for Attributes.cpp just before it processes the com- piling edge for Attributes.o (the highlighted arrow in the gure). This is because, when it starts to deal with the edge, it has already conrmed that all inputs (i.e., Attributes.cpp and Attributes.inc) are ready. Thus, the preprocessing does not fail due to the lack of the dependent header le. Note that my build system does not run the preprocessor just after the cre- ation of Attributes.inc, because the number of dynamically generated header les is not always one. Waiting until the processing of the compilation edge begins is a promising way to conrm all required input nodes are ready.

37 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.4. DESIGN AND IMPLEMENTATION OF CAULDRON BUNDLE STRATEGIES

Generated by llvm-tblgen include/llvm/IR/Attributes.td include/llvm/IR/Attributes.inc def Alignment : EnumAttr<"align">; // EnumAttr classes struct AlignmentAttr : EnumAttr { def AllocSize : EnumAttr<"allocsize">; static enum Attribute::AttrKind getKind() { return llvm::Attribute::Alignment; ... } }; struct AllocSizeAttr : EnumAttr { static enum Attribute::AttrKind getKind() { return llvm::Attribute::AllocSize; } }; ...

Figure 3.8. Example of the header les dynamically generated by llvm- tblgen.

Attributes.o

cc –c Attributes.cc

COMPILE_Attributes

Attributes.cpp Attributes.inc llvm-tblgen Attributes.td GEN_Attributes

Attributes.td

Figure 3.9. Dependency graph including the dynamically generated header le.

38 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.4. DESIGN AND IMPLEMENTATION OF CAULDRON BUNDLE STRATEGIES

3.4.2 Bundle strategies in Cauldron

Before I discuss the Cauldron design, I describe two bundle strategies used in Cauldron. These strategies are not independent. Instead, a bundle algorithm used in Cauldron is based on both strategies. I will discuss the detail of the algorithm in Section 3.4.6. Strategy: bundling source les with high header-le similarity. Since unity builds cut o the redundant processing of shared header les, they work well when most of the included headers overlap among the bundled source les, as I described as Finding 1 in the previous section. To quantify the overlap of the included header les, the Jaccard index is used in Cauldron as I used it in the case study. Unlike existing unity builds, I bundle source les regardless of the directories they are in (based on Finding 3 of the case study). WebKit bundles source les only in the same directory. The reasons why WebKit developers made this de- cision are not explained, but it is speculated as follows. Developers often place related source les in the same directory whose name reects the usage. As a result, source les often include common header les where basic classes and function prototypes are dened. So, the header-le similarity seems to be high among source les in the same directory. However, I found many source les in dierent directories have high header-le similarities, as shown in Figure 3.7. Based on the result, Cauldron tries to bundle source les in dierent directories if the header-le similarity is high. Strategy: bundling source les with similar compile time. According to Finding 2.1 in the case study, the compile time of each bundled source le is sensitive to the unity-build performance. The imbalance in the compile times of bundled source les causes signicant performance degradation in incremental builds. However, this strategy is hard to be applied straightforwardly, because ac- curate compile times are not available before the build. I initially tried to use a machine learning technique to estimate compile times from the preprocessed code. However, I eventually gave up this approach because most of the tech- niques use information obtained by parsing source les [14]. Parsing the source les takes up a signicant portion of the compilation time and adds overhead,

39 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.4. DESIGN AND IMPLEMENTATION OF CAULDRON BUNDLE STRATEGIES

Project 200 LLVM WebKit 150 Polyfit

100

50 Compile times (sec)

0 0 25000 50000 75000 100000 125000 150000 Num of semiconlon Figure 3.10. Compile time estimation as described in Section 3.1. Thus, I decided to use the word count of ‘;’ to esti- mate compile times. Fig. 3.10 shows the relationship between the number of ‘;’s and the compile times in LLVM and WebKit. I performed polynomial regression on these data and calculated an approximate expression that our build system uses to estimate the compile times. Note that although this estimation is sim- ple and not accurate completely, it is still eective in increasing the unity-build performance as described in Section 3.5. I have rejected bundle strategies that require the times spent in the front- end and the back-end of the compilation (Finding 2.2 and Finding 4.2). This is because it is hard to estimate the times without compiling the code. The times spent in the front-end and the back-end have little correlation with the length of preprocessed code. Front-end ratios, which mean the occupancy of the front-end related processing during the compilation, varies widely even in source les that have a similar length in the preprocessed code. For example, the coecient of correlation between the front-end ratios and the count of semicolons in the pre- processed code is only 0.09, contrary to 8.6 which is the coecient of correlation between the total compile times and the semicolon counts. Since the information about the front-end ratio is unavailable in Cauldron, the maximum bundle size is set as eight in this dissertation, which is the default value of the unity builds of WebKit and CMake-unity builds.

40 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.4. DESIGN AND IMPLEMENTATION OF CAULDRON BUNDLE STRATEGIES

Source files Parse the manifest Target Build task scheduling Bundle rule refinement Save bundle rules & & Cauldron work-flow programs/libraries Build manifest Generate a dependency graph Non-unitybuilds Build task scheduling Bundle strategies Dependency graph analysis Build behavior decision • bundling sources with high header • Identify sets of source files • Count # of files to be compiled Ahead-of-time bundling Just-in-time bundling similarity that can be bundled together • (De)Activate unity files Unity • bundling sources with similar compile time • Track file organization updates. builds

Figure 3.11. Overview of the work-ow

3.4.3 Overview

An overview of Cauldron is shown in Fig. 3.11. The design and implementation are based on a state-of-the-art native build system called Ninja [110]. To clarify my contributions, they are highlighted in the gure with double-lined boxes. First, Cauldron takes source les and build manifests as its inputs and gener- ates the dependency graph, just as Ninja does. Next, it analyzes the dependency graph for tracking updates to the le organization. It tracks additions and re- movals of source les by comparing the analysis result with the previous one. When it detects any change that aects the bundle congurations, it invalidates the aected congurations and recongures them in subsequent bundling steps. After the dependency graph analysis, Cauldron decides its build behavior depending on the number of les to be compiled. When a small number of les are to be compiled, such builds are recognized as non-unity builds. In non-unity builds, Cauldron compiles the source les instead of unity les. When a large number of les are to be compiled, Cauldron performs unity builds. The bundling steps consist of two parts: ahead-of-time (AOT) and just- in-time (JIT). In AOT bundling, it bundles source les whose dependent header les exist before the build. It runs the preprocessor for the source les and gets hints for bundling from the preprocessed code. After bundling the source les, it modies the compile commands for compiling the unity les. Then, it starts to schedule the build tasks. When Cauldron starts the compilation of the source le whose dependent header les are not ready before the build, it interrupts the processing and per- forms JIT bundling. At this moment, it is conrmed that all header les for this compilation are ready so that it can execute the preprocessor and bundle the source le. After it bundles the source les and modies the compile commands, it resumes the build schedule. Once it nishes a unity build, it saves the bundle

41 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.4. DESIGN AND IMPLEMENTATION OF CAULDRON BUNDLE STRATEGIES congurations for reuse in future builds.

3.4.4 Dependency graph analysis

During software development, source les are added and removed. To reect such updates to the le organization, Cauldron keeps track of the updates and modies the previously created bundle congurations. It accomplishes these tasks by tracking changes in the dependency graph. Cauldron analyzes the compiling edges and linking edges in the dependency graph and determines sets of source les that can be bundled together. In this dissertation, I dene such source-le sets as bundle-sets. For example, source les that are compiled with dierent ags or that are linked to dierent target programs cannot be bundled together. When a developer adds a new source le, it analyzes the corresponding edges for the new source le and registers the source le to an appropriate bundle-set. As a result, it can bundle the source le in subsequent unity builds. It does not bundle a new source le immediately because changing the bundle congurations will cause additional recompilations of other source les or unity les. It only recongures the bundle congurations when all of the source les in the bundle-set are to be compiled. When a source le is removed, Cauldron excludes the source le from the bundle-set where the source le is. It also checks whether an existing unity le includes the removed source le. If it does, the source le is removed from the unity le and from the bundle conguration. It also makes sure the unity le is recompiled to avoid inconsistencies in the output programs.

3.4.5 Build behavior decision

Cauldron adaptively changes its build behavior by referring to the number of source les and unity les to be compiled. Here, I describe how to use this infor- mation to decide the build behavior and what Cauldron does when it switches its build behavior. When the sum of the number of source les to be compiled Ns and that of unity les to be compiled Nu is less than the parallelism P specied by a job option (i.e., Ns + Nu < P), the build behavior is decided as a non-unity build. This decision is based on the insight that a small number of source les are recompiled

42 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.4. DESIGN AND IMPLEMENTATION OF CAULDRON BUNDLE STRATEGIES in incremental builds. As a result, a build system can compile more source les by using free CPUs. To reduce the overhead of additional compilations caused by unity builds, I utilize the benets of parallel compilation. For example, if Cauldron tries to recompile a unity le which includes N source les with M threads (here, N ≤ M), it deactivates the unity le and compiles the N source les in parallel. Cauldron tries to hide the overhead of N −1 additional recompilations by utilizing the available CPU power. Thus, when the build behavior is decided as an incremental build, Cauldron deactivates the unity les and compiles the source les as long as the total number of les to be compiled does not exceed the parallelism. On the other hand, Cauldron performs unity builds when the number of les to be compiled is more than the parallelism (i.e., Ns + Nu ≥ P). In unity builds, Cauldron initializes the bundle congurations or re-activates unity les that were deactivated in non-unity builds. To keep the overhead of the incremental builds low, Cauldron compiles the source les rather than the deactivated unity les until it switches to unity builds. Therefore, if there are deactivated unity les in the unity builds, Cauldron re-activates the unity les whose source les are to be compiled. To achieve a high level of parallelism in compilation, Cauldron activates the unity les until the number of les to be compiled exceeds the par- allelism.

3.4.6 Bundling source les

Now, I describe how Cauldron bundles source les. As mentioned above, it per- forms two types of bundling: ahead-of-time (AOT) and just-in-time (JIT). After the dependency graph analysis, it gets information on the source le set that can be bundled together (i.e., bundle-set). When all of the source les in a bundle-set are ready for preprocessing before the build, it bundles the source les by using AOT bundling. On the other hand, if the set contains source les that include dynamically generated header les, it performs JIT bundling on the set. In the JIT bundling, Cauldron waits until all source les in the set are ready for pre- processing; then it bundles the source les. In particular, Cauldron waits for all of the inputs for compiling edges in the dependency graph are ready. Although AOT and JIT bundling have dierent timings for bundling the source les, their

43 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.4. DESIGN AND IMPLEMENTATION OF CAULDRON BUNDLE STRATEGIES bundlings consist of three common steps. First, Cauldron runs the preprocessor for the source les in order to extract two pieces of information. To compute the header le similarity, it collects the dependency of header les by using a general compiler feature (e.g., -M). To es- timate the compile time for each source le, it counts the number of ’;’s and calculates the compile times by using the approximate expression. I show the overhead of this additional preprocessing in Section 3.5.1. This overhead typ- ically happens only at the rst unity builds because Cauldron saves generated bundle congurations and reuses them in subsequent builds. Bundle-hints are managed by timestamps of their creation and will be updated only when recon- guring bundle congurations. Once Cauldron nishes gathering bundle-hints of all the source les that can be bundled together, it starts to bundle the source les. Note that bundling source les into unity les with the highest header-le similarity and with the highest compile-time similarity is a knapsack problem which is NP-hard. So I decided to use a simplied algorithm to avoid large overhead in computing which source les are bundled together. Algorithm 1 gives an overview of how source les are bundled together. First, to bundle the source les with high header-le similarity, it calculates the sim- ilarities among the source les by analyzing the le dependency and saves the result as a sorted-list (line 1-2). Then, it bundles each two source les in order of those with the highest header-le similarity (line 3-15). To keep the header-le similarity in unity les high, it stops bundling source les if the similarity falls below 90% (line 4). To bundle source les with similar compile times, it does not bundle source les whose compile times have a 50% dierence, even if the header-le similarity is high (line 5). These thresholds are set according to the results of the case study of WebKit. There are three ways to bundle the two source les. First, if both of these source les have not been bundled yet, Cauldron bundles them into a new unity le (line 6-9). Second, if one of the source les has not been bundled and the other one has already been bundled in a unity le, Cauldron checks whether the unbundled source le can be added to the unity le. In particular, it checks that the bundle size does not exceed the maximum bundle size and whether the compile times of all bundled source les are similar. If both checks pass, it adds

44 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.4. DESIGN AND IMPLEMENTATION OF CAULDRON BUNDLE STRATEGIES

Algorithm 1: Bundling source les Data: S: source le set, H: list of header-le similarity data

(srci, srcl,similarity) Result: U : Unity les 1 H ← calculate header-le similarities among S; 2 Sort H by the similarity; 3 foreach Entry (srci, srcl,similarity) ∈ H do 4 if similarity < 0.9 then break; 5 if Compile times have a huge dierence then continue; 6 if Both of the source les are not bundled then 7 Allocate a new unity le u; 8 Bundle srci and srcl to u; 9 Add u → U ; 10 else if Only one of the source les is bundled then 11 Bundle the unbundled source le, if can; 12 else if srci and srcl are bundled in dierent les then 13 ui, ul ← get unity les of srci and srcl; 14 Merge the two unity les, if can;

15 end the source le to the unity le (line 10-11). Here, the similarity of header les is not checked. This is possibly because, even if the header-le similarity is unchecked, the header-le similarity of unity les is kept high. Cauldron n- ishes the bundling when the similarity falls below 90% (line 4). As a result, even if unchecked source-le pairs are in unity les, the header-le similarity is kept high; the average similarity of header les for each unity le is never below 87% in our evaluation. Third, when two source les are bundled in dierent unity les, Cauldron merges these two unity les after checking the maximum bundle size and the dierences in compile times (line 12-14). Figure 3.12 shows an example of how Cauldron bundles source les based on the explained algorithm. First, Cauldron has the information about the header similarity and the compile-time dierence of source pairs, sorted by the header similarity. Then, Cauldron iterates over the information and bundles source les.

45 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.4. DESIGN AND IMPLEMENTATION OF CAULDRON BUNDLE STRATEGIES

Source file pair Header Compile time Cauldron behavior similarity difference <0.cpp, 1.cpp> 98% 8% Create a new unity file (u1.cpp) bundling 0.cpp and 1.cpp.

<1.cpp, 2.cpp> 97% 12% Add 2.cpp to u1.cpp.

<2.cpp, 3.cpp> 96% 120% Skip this source pair due to the large compile time difference.

<2.cpp, 4.cpp> Append 4.cpp to u1.cpp.

Iterate 95% 11%

<5.cpp, 6.cpp> 92% 13% Generate a new unity file (u2.cpp) bundling 5.cpp and 6.cpp.

<1.cpp, 6.cpp> 91% 9% Merge two unity files (u1.cpp and u2.cpp) into one.

<0.cpp, 7.cpp> 87% 3% Finish bundling source files due to the low header similarity.

As a result, Cauldron achieves one unity file bundling six source files. • unity.cpp: 0.cpp, 1.cpp, 2.cpp, 4.cpp, 5.cpp, 6.cpp.

Figure 3.12. Example of bundling source les in Cauldron.

In this example, it rst creates a new unity le (u1.cpp) to bundle 0.cpp and 1.cpp because their header similarity is high and the compile-time dierence is under the threshold. Then, it adds 2.cpp and 4.cpp to u1.cpp in the second and the fourth iteration respectively. However, in the third iteration, Cauldron skips the source pair of 2.cpp and 3.cpp because their compile- time dierence is larger than the threshold. Cauldron generates another unity le u2.cpp in the fth iteration for bundling 5.cpp and 6.cpp and merges it and u1.cpp into one in the sixth iteration. Last, Cauldron nishes bundling in the seventh iteration because the header similarity becomes under 90%. As a result, Cauldron achieves one unity le bundling six source les. After bundling the source les, Cauldron modies compile commands to compile unity les instead of bundled source les. In JIT bundling, it sometimes nishes compiling source les before bundling them because it waits for depen- dent header les to be generated for other source les. For example, Figure 3.13 shows two order-independent compilations of A.cc and B.cc. Based on the dependency graph analysis result, Cauldron tries to bundle the two source les. Since A.cc includes header.h that is a dynamically generated header le, Cauldron has to wait for bundling until the header le is created. However, the compilation of B.cc does not depend on the header creation. As a result, the compilation of B.cc may be done before A.cc and B.cc are bundled. In such cases, Cauldron just creates a bundle conguration and does

46 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.4. DESIGN AND IMPLEMENTATION OF CAULDRON BUNDLE STRATEGIES

lib.a

LINK_lib

A.o B.o

COMPILE_A COMPILE_B

A.cc header.h B.cc

GEN_header

header.td

Figure 3.13. Compilation of B.cc will be nished before A.cc and B.cc are bundled not modify the compile commands to avoid double compilations, which cause symbol redenition errors.

3.4.7 Bundle Conguration Renement

Since the compile time estimation is not perfect, Cauldron mistakenly bundles source les that have a huge dierence in compile times. To cover this drawback, Cauldron measures compile times of source les in non-unity builds and renes bundle congurations. When it deactivates unity les in non-unity builds, it rechecks the dierence in compile times of the bundled source les at the end of the build. If it nds the source les that take much longer to compile than the other bundled source les, it excludes them from the bundle congurations. Cauldron saves the measured compile times for future unity builds.

47 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.5. EXPERIMENTS BUNDLE STRATEGIES

4000 label cmake-unity 3500 our build system

3000

2500

2000

Time (seconds) 1500

1000

500

0 20 40 60 80 100 Build index (a) Changes in the cumulative build time

40

20

0

20 − Performance Gain/Loss (seconds)

40 − 0 10 20 30 40 50 60 70 80 90 100 Build index (b) Performance gain/loss for each build

Figure 3.14. Build performance during 101 builds of real git commits: LLVM

3.5 Experiments

I evaluate Cauldron with existing unity builds of CMake, Meson, and WebKit, focusing on the build performance in continuous builds, the incremental-build performance, and the full-build performance. This evaluation shows that:

• Unity builds of Cauldron have better build performance in continuous builds (Section 3.5.1).

• Cauldron decreases the overhead in incremental builds (Section 3.5.2).

• Cauldron achieves competitive full-build performance (Section 3.5.3).

I perform the evaluation on the DELL Power Edge R430 server that has a Fe- dora 29 server installed. The server consists of 8-core 2.1 GHz Xeon E5-2620V4 processor, 128 GB of RAM, and a 1TB SATA HDD disk. All source les are resi- dent in the disk. I choose three real C/C++ projects for evaluating Cauldron, including LLVM (git-commit-id:ca1e713fdd4), WebKit (git-commit-id:ec4eb02a9e2), and Mesa 3D library (v18.3.6) [127]. In the evaluations, I use GCC v8.3.1 and

48 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.5. EXPERIMENTS BUNDLE STRATEGIES

LLVM/Clang v9.0.0 as compilers but only show the results of the GCC, because the results of the Clang are similar to those of the GCC. The linker is GNU ld v2.31.1. I evaluate the unity builds of WebKit [115], CMake-unity builds [6], and Meson-unity builds [41] for comparison. The maximum bundle size of each unity le is set to eight, which is the default value of the unity builds of WebKit and CMake-unity builds. Note that, there is no limit on the bundle size in Meson- unity builds. However, this unlimited bundle size causes a terrible overhead in incremental builds (Section 3.5.2).

3.5.1 Build Performance in Continuous Builds

To evaluate how well unity builds of Cauldron perform in continuous builds, I analyze the build times of sequential git commits in LLVM and WebKit. The build performance in continuous builds is important in using continuous integration and automated test tools, such as build-bot [150, 147]. In this evaluation, I use CCache because it is enabled in the build bots of the two projects. First, I discuss the results of LLVM. Here, I use the git commit ca1e713fdd4 (Feb 6 2019) as the rst full build and use the continuous 100 commits from e9f465a6a80 to 2e390e6fde4 (Feb 6-8 2019) for this eval- uation. 27,552 LOC in 306 les were updated during the commits. Figure 3.14a shows changes in the total build time during the 101 builds. The blue line shows the result of CMake’s unity builds and the orange line is the result of Cauldron. According to the gure, Cauldron gradually saves the cumulative build time compared with CMake’s unity builds. In total, Cauldron reduces the total build time by 20% and saves 13 minutes in total. Figure 3.14b depicts the performance gain/loss of Cauldron for each build, compared with CMake’s unity builds. The blue bar shows the performance gain (i.e., the times Cauldron can saved). The rad bar depicts the performance loss (i.e., the overheads caused by Cauldron). According to the gures, Cauldron has better build performance in many builds. For example, Cauldron achieves the highest performance gain in the 92nd build. This is because Cauldron reduces unneces- sary recompilations in unity builds. In unity builds, the number of source les to be recompiled is increased because of bundling multiple source les into one

49 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.5. EXPERIMENTS BUNDLE STRATEGIES

label 20000 webkit-unity our build system

15000

10000 Time (seconds)

5000

0 20 40 60 80 100 Build index (a) Changes in the cumulative build time

1250

1000

750

500

250

0 Performance Gain/Loss (seconds)

250 − 0 10 20 30 40 50 60 70 80 90 100 Build index (b) Performance gain/loss for each build

Figure 3.15. Build performance during 101 builds of real git commits: WebKit unity le. Cauldron avoids bundling unrelated source les together by checking the similarities in included header les and compile times. This enables Cauldron to eciently recompile the smaller number of source les in partial rebuilds. As a result, Cauldron excludes 176 unnecessary recompilations for unrelated source les (404 les → 228 les). The reason why Cauldron takes a longer time in the rst full-build than the CMake-unity build is the overhead in bundling source les. Since Cauldron re- quires the additional preprocessing to extract hints for bundling source les, the rst full-build time includes the overhead. The overhead of the additional pre- processing is 10% (33s) in LLVM. However, this overhead is only included in the rst full-build because Cauldron reuses the bundle conguration for subsequent builds. Figure 3.15 shows the results of WebKit. Here, the git commit ec4eb02a9e2 (Feb 8 2019) is used for the rst full-build and the continuous 100 commits 85f3eaeb98f to f7a11be17a9 (Feb 6-12 2019) are used for this evaluation. 15,515 LOC in 624 les were changed during the git commits. The gure shows similar results to LLVM. Cauldron achieves large perfor- mance gains in many builds and improves the total build time by 23% and saves

50 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.5. EXPERIMENTS BUNDLE STRATEGIES one hour and 24 minutes in total. The overhead of the additional preprocessing in the rst full-build is 33% (6m 29s). In particular, Cauldron achieves the highest performance gain in the 12th build. This is because the git-commit (404c727821a) corresponding to this build includes a le addition and an update on a conguration le of WebKit- unity builds for the added les. As a result, many unity les are regenerated and recompiled. However, Cauldron avoids such bursts of recompilation by lazily updating the bundle conguration when source les are added. It only updates the conguration only when all of the source les in a bundle-set are to be com- piled, as described in Section 3.4.4. Thus, Cauldron avoids the update on bundle conguration due to the le addition and reduces unnecessary recompilations. In summary, unity builds with the sophisticated bundle strategies and Caul- dron’s approach succeed in reducing the total build time of the continuous builds, which include both large partial rebuilds (like full builds) and small partial re- builds (like incremental builds). Especially, according to Figure 3.14b and Fig- ure 3.15b, the performance gain of Cauldron is much larger than the perfor- mance loss. This indicates Cauldron successfully reduces the incremental-build overheads of unity builds, keeping the benet of unity builds. The unity-build performance in continuous builds is important when unity builds are employed in continuous integration, automated testing, and daily development.

3.5.2 Incremental-build performance

Now I discuss the incremental-build performance of unity builds in Cauldron. To evaluate the overhead of incremental builds caused by unity builds, I measured the slowdowns of rebuilding the projects when one source le was updated. The reason why I used this methodology is that it emphasizes the disadvantage of unity builds. The slowdowns are evaluated by the absolute dierence of the build times rather than the ratio. This is because the ratio does not represent the de- veloper’s impression of the build times. For example, doubling a slowdown of 0.5 second to 1 second is more acceptable than a 50% slowdown of 10 seconds to 15 seconds. Here, I did not use CCache in this experiment. This is to facilitate the recompilation of each source le. In other words, I only update source-le timestamps to trigger recompilations.

51 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.5. EXPERIMENTS BUNDLE STRATEGIES

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4 Our build system Our build system 0.2 0.2 CMake-unity build WebKit-unity build 0.0 0.0 0 20 40 60 0 10 20 30 40 Overheads (seconds) Overheads (seconds) (a) LLVM (b) WebKit

1.0

0.8

0.6

0.4 Our build system 0.2 Meson-unity build 0.0 0 10 20 30 40 Overheads (seconds) (c) Mesa

Figure 3.16. CDF of incremental-build overheads (seconds) caused by unity builds. The x-axis shows the overheads in seconds for all gures. The y-axis shows the cumulative probability.

Table 3.2. Notable results on incremental build overheads (seconds). The results of existing unity builds are shown in parentheses.

mean median 90tℎ percentile the worst case WebKit 1.91 (3.11) 1.09 (1.97) 4.98 (9.17) 38.0 (29.64) LLVM 6.97 (9.62) 4.83 (8.49) 15.77 (18.60) 44.25 (57.14) Mesa 1.69 (6.96) 1.56 (1.40) 3.93 (19.85) 8.80 (35.71)

52 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.5. EXPERIMENTS BUNDLE STRATEGIES

Figure 3.16 shows the overall results: CDFs of the incremental-build over- heads. Table 3.2 shows notable numbers in the overall results. In summary, Caul- dron has better incremental-build performance in terms of the average and the 90tℎ percentile slowdowns for these projects. This is because Cauldron utilizes the similarity of compile times in its bundle strategy and mitigates the negative eect on incremental builds in unity builds. For example, the average slowdown is reduced from 3.11 seconds to 1.91 sec- onds (by 39%) in WebKit. The 90tℎ percentile slowdown is also improved by 46% (from 9.17 seconds to 4.98 seconds). In the case of the worst overhead, Cauldron improves the overheads in the LLVM and the Mesa, except for the WebKit. The reason why existing unity builds of WebKit achieves the smaller overhead is that the developer of WebKit excludes the source les that cause unacceptable over- head in incremental builds, by applying a patch as shown in Figure 3.4. Here, the @no-unify is an attribute for CMake to exclude a source le from unity builds. According to Figure 3.16c, the Meson-unity build signicantly worsens the overhead of incremental builds for many source les. This is because there is no limit in the maximum bundle size in the Meson-unity build. For example, 108 source les are bundled together in the Meson-unity build. As a result, the unlimited bundle size causes a terrible overhead in incremental builds so that setting the maximum bundle size is necessary for reducing the incremental-build overhead in unity builds. Note that some of the incremental-build slowdowns are below zero. This means that for some source les, unity builds outperforms non-unity builds even in incremental builds. This is because unity builds can accelerate link process- ing because the input size of link operations is reduced. As a result, unity builds constantly improves the I/O performance of the link processing even in incre- mental builds. For example, when the static library libWebCoreGTK.a is created in WebKit, 3,667 object les are linked in non-unity builds. By comparison, only 580 object les are processed in unity builds of Cauldron.

53 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.5. EXPERIMENTS BUNDLE STRATEGIES

Table 3.3. Full-build performance. UB: unity build.

Build times w/ Build times w/ Existing UB Cauldron existing UBs methods WebKit 19m 53s 21m 51s WebKit LLVM 5m 16s 5m 9s CMake Mesa 2m 9s 2m 11s Meson

3.5.3 Full-build performance

Here, I shall focus on the full-build performance to evaluate the impact of the bundle strategies in Cauldron. In this evaluation, I exclude the overhead of bundling source les (such as computing header similarities, estimating com- pile times, choosing source les that should be bundled together, and creating unity les). I showed the pure impact of using sophisticated bundle strategies in terms of source les bundled together. Table 3.3 describes the overall results. In short, the unity build of Cauldron achieves competitive build performance, com- pared with existing unity builds. For example, the full build time is decreased by 9% (from 21m 51s to 19m 53s), compared with the unity builds of WebKit. In LLVM and Mesa 3D library, the build times are similar to existing unity builds (i.e., CMake’s unity builds and Meson’s unity builds respectively). The reason why only the full-build performance for WebKit using Cauldron is much better than using the existing one is that bundle congurations become much dierent. Since WebKit contains more source les than the other two projects, the cong- uration of which source les are bundled together changes signicantly when changing bundle strategies. As a result, the benet of using sophisticated bundle strategies becomes larger. To explore the impacts of using the bundle strategies of Cauldron, I examine the number of unity les and the number of bundled source les as shown in Table 3.4. According to the table, unity builds of Cauldron do not increase the number of bundled source les, compared with all of the existing unity builds. However, the full-build performance of Cauldron is better or similar to the ex- isting unity builds as shown in Table 3.3. In general, if the number of bundled source les is decreased, the full-build performance is degraded because fewer source les benet from unity builds. However, Cauldron provides the compet-

54 CHAPTER 3. BUILD SYSTEM FOR UNITY BUILDS WITH SOPHISTICATED 3.6. SUMMARY BUNDLE STRATEGIES

Table 3.4. The number of unity les and bundled source les in Cauldron. The results of existing unity builds are shown in parentheses.

# of unity les # of bundled source Avg. bundle size les WebKit 748 (660) 4,959 (5,027) 6.6 (7.6) LLVM 234 (349) 1,284 (1,735) 5.5 (5.0) Mesa 142 (81) 818 (905) 5.8 (11.0) itive full-build performance thanks to the sophisticated bundle strategies, even though fewer source les are bundled than existing unity builds.

3.6 Summary

This chapter presented Cauldron, a native build system for unity builds with sophisticated bundling strategies. Unity builds have been identied as a useful technique to improve build times in large C++ projects. However, the downsides of unity builds have been underestimated and overlooked. This dissertation has presented Cauldron, which is a novel build system focusing on the trade-o be- tween incremental builds and unity builds. I evaluated Cauldron on three large projects: LLVM, WebKit, and the Mesa 3D library. I showed that it achieves better build performance in continuous builds. It also achieves the competitive full-build performance in unity builds and mitigates the slowdown caused by unity builds in incremental builds.

55 Chapter 4

Logging Automation for Inter-thread Data Dependencies

The objective of this chapter is to cover the basic knowledge about inter-thread data dependencies and to describe the design and implementation of K9, a log- ging tool that automatically inserts the log printing code to trace inter-thread data dependencies. First, I explain what is an inter-thread data dependency and introduce a motivation for tracing inter-thread data dependencies for failure di- agnosis. Then, I describe the design and implementation of K9. Overall, I design K9 to be “practical”; it scales to one million lines of code in C, causes negligible runtime overheads, and provides useful clues to tracking inter-thread depen- dencies in real-world bugs. To scale to such a large codebase, K9 does not use rigorous static analysis of pointers to detect code locations where inter-thread data dependency can occur. Instead, K9 takes the best-eort approach and nds out “most” of those code locations by making use of coding conventions. This dissertation demonstrates that K9 is applicable to Linux and captures enough relevant code locations to provide useful clues to root causes in real-world bugs. With K9, I was able to debug a previously unknown bug in Linux. I also show that K9 runtime overhead is acceptable.

56 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.1. MOTIVATION DEPENDENCIES

shared data C

thread A thread B

1. add or update item 2. refer to item

(a) Direct Dependency

shared data M shared data N thread Y thread Z thread X 2. refer to P 3. generate item Q from P 5. refer to item Q 1. add item P P 4. add item Q Q

(b) Indirect Dependency

Figure 4.1. Two types of inter-thread data dependencies.

4.1 Motivation

Here, I describe the background information about inter-thread data dependen- cies and introduce motivating examples for logging inter-thread data dependen- cies for failure diagnosis. Section 4.1.1 introduces two types of inter-thread data dependency, and Section4.1.2 shows Linux bug examples.

4.1.1 Inter-Thread Data Dependency

An inter-thread data dependency is a situation in which a thread’s state depends on a shared data structure that has been initialized or modied by other threads. For example, two threads have an inter-thread data dependency if one updates a shared data structure and the other reads it afterwards. The updated shared data structure may aect the reading thread’s behavior. The inter-thread data dependency makes error propagations and failure diag- nosis complicated. Note that, failures including inter-thread data dependencies are not always related to concurrency bugs such as data races, atomicity viola- tions, deadlocks, and so on. I will introduce real-world examples that are not such concurrency bugs but contain inter-thread data dependencies in the next section.

57 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.1. MOTIVATION DEPENDENCIES

direct direct indirect dependency t0 page cache tree t4 t3 I/O requestqueue t1 t4 generates page page t3 dispatches I/O requests ... I/O requests from page cache t7 to a disk t0 - t7 update pages in the tree

Figure 4.2. Inter-thread data dependencies in write system call in Linux. Direct dependencies arise between t0 ⋯ t7 and t4 through the page cache and between t4 and t3 through the I/O request. An indirect dependency arises between t0 ⋯ t7 and t3 through t4 that correlates the page caches and I/O requests.

In this dissertation, I focus on two types of inter-thread data dependencies: 1) direct and 2) indirect. A direct dependency occurs when two threads directly share a data structure; one updates the shared data structure, and the other reads it later. Figure 4.1 (a) illustrates a direct dependency. Thread A updates the shared data structure and then thread B reads it. An indirect dependency is caused by an intermediate thread that correlates distinct shared data structures. Figure 4.1 (b) illustrates an indirect dependency. In the gure, thread Y correlates one shared structure M with another shared structure N . Thread Y reads shared structure M and updates another structure N with some value calculated from M. Threads X and Z do not have a direct dependency, but they have an indirect dependency through thread Y ; the update by thread X may aect the value read by thread Z. An indirect dependency typically appears when a thread generates a new data structure from an existing one. Inter-thread data dependencies often occur in multi-threaded system like the Linux kernel. Figure 4.2 illustrates inter-thread data dependencies in the Linux block I/O subsystem. In the gure, write system calls are issued concurrently, and eight threads (from t0 to t7) are serving the calls. Each thread updates the page cache managed by a radix tree. Then, some thread (t4 in Fig. 4.2) gets point-

58 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.1. MOTIVATION DEPENDENCIES ers to dirty pages in the tree. Here, direct dependencies occur between thread t4 and all other threads (t0 to t7). Next, thread t4 generates I/O requests from the dirty pages in the tree and inserts them in the I/O request queue. Then, the thread scheduled by chance (t3 in Fig. 4.2) dispatches the I/O requests to the disk driver. Here, an indirect dependency arises between t3 and t0–t7 because thread t4 correlates the tree and the I/O request queue.

4.1.2 Bug Examples in Linux

Now, I introduce two real bugs in the Linux kernel to illustrate the need for tracing inter-thread dependencies. The bug reported at the commit 8146502 is related to Btrfs subsystem [22]. Figure 4.3 shows an excerpt of the code, in which there is a direct dependency between kworker and sync through shared structures extent_buffer and page. Kworker ushes dirty pages and sync releases the pages stored in extent_buffer. If an error is encountered when a page is being ushed, Btrfs stops ushing the dirty pages and clears the dirty ags of all the remaining pages, so that they can be freed later by sync, which assumes the page to be freed is not dirty. In this bug kworker forgets to clear the dirty ags after an error is encountered at line 17. Thus, sync panics at line 29. Another bug exemplies an indirect dependency. It is the priority violation in Linux CFQ I/O scheduler, which aims to allocate disk time in proportion to priority. As pointed out in a study [160], CFQ ignores priorities if many threads with dierent priorities perform asynchronous writes. When eight threads with dierent priorities from 0 (highest) to 7 (lowest) are launched, it is expected that a thread’s write throughput is proportional to its priority. Unfortunately, CFQ ignores the priorities and the write throughput is not proportional as shown in Figure 4.4 (a). The CFQ priority violation is caused by two reasons. First, the priority of the thread submitting I/O requests is dierent from that of the thread that updates the page cache. Second, the priority dierence is not tracked in the Linux kernel. For example, Fig. 4.2 illustrates the situation. Eight threads (t0 ∼ t7) write on page caches asynchronously, where each thread index corresponds to its priority (ti’s priority is i). Later, a thread (t4 in this example) gets dirty pages from the cache and submits I/O requests with priority 4 to the queue. Thus, all the I/O requests

59 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.1. MOTIVATION DEPENDENCIES

1 // File: fs/btrfs/extent-io.c

2 // Shared Data: struct extent_buffer *eb;

3 // An extent_buffer holds an array of page caches.

4

5 // Thread: kworker

6 int write_one_eb(...) {

7 ...

8 for (i=0; i< num_pages; i++){

9 struct page*p= eb->pages[i];

10 clear_page_dirty_for_io(p);

11 if (error) { ... break;}

12 }

13 ...

14 if (unlikely(error)) {

15 for (; i< num_pages; i++){

16 struct page*p= eb->pages[i];

17 # BUG: FORGOT TO CLEAR DIRTY FLAG

18 ...

19 }

20 }

21 }

22 // Thread: sync

23 void btrfs_release_extent_buffer_page(...) {

24 ...

25 struct page*page= eb->pages[index];

26 if (is_dirty(page)) {

27 # FAILURE: KERNEL PANIC

28 panic();

29 }

30 ...

31 }

Figure 4.3. A bug in Btrfs in Linux kernel v3.17-rc5. An error propagates from kworker to sync through shared extent_buffer

60 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.2. DESIGN GOALS AND OVERVIEW OF K9 DEPENDENCIES

60 -- Ideal 100% 40 75% 50% MB/s 20 25% 0 0% 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 (high) Priority (low) (high) Priority (low) (a) Throughput (b) I/O submitter

Figure 4.4. CFQ priority violation. I/O throughput is not proportional to priority. A thread with priority 4 submits all I/O requests.

id : dst_type:dst_addr <- src_type:src_addr L1 t6 : page:0x...aac0 <- page_tree:0x...c088 //t6 updates a page L2 t4 : page:0x...aac0 <- page_tree:0x...c088 //t4 gets the same page L3 t4 : bio:0x...f080 <- page:0x...aac0 //t4 generates L4 t4 : request:0x...9988 <- bio:0x...f080 //I/O req from the page L5 t4 : cfq_queue:0x...cd98 <- request:0x...9988 //t4 enqueue it L6 t3 : request:0x...9988 <- cfq_queue:0x...cd98 //t3 dispatches it

Figure 4.5. Simplied log of CFQ priority violations. are treated with priority 4 as shown in Fig. 4.4 (b). In this example, an indirect dependency arises through thread t4 between the dirty page cache and the I/O request because thread t4 creates a new data structure (I/O request) from an ex- isting data structure (dirty page cache). As can be seen from this example, the inter-thread data dependency is not always associated with concurrency bugs. This example involves the dependency but is not related with concurrency bugs.

4.2 Design Goals and Overview of K9

Now, I describe two main design goals of K9. Then, I depict an overview of K9’s work-ow and explain its overall structures. Log inter-thread data dependencies: K9 automatically inserts logging code so that the information of the inter-thread data dependencies can be recorded. K9 identies data structures likely to be shared among threads. For direct dependencies, K9 recognizes code locations where they are accessed and inserts logging code there that records the data type and address of the accessed data structure. Fig. 4.5 shows a simplied log for diagnosing the CFQ priority

61 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.2. DESIGN GOALS AND OVERVIEW OF K9 DEPENDENCIES

violation. Lines L1 and L2 show a direct dependency between threads t6 and t4 through page at address 0x...aac0. Lines L5 and L6 show another depen- dency between threads t4 and t3 through request at address 0x...9988. Note that I only show lower 16 bits for readability. In actual logs, all addresses are shown by 64 bits. For indirect dependencies, K9 recognizes data-ows from one shared data type to another. Suppose K9 identies two data types, 1 and 2, as shared and detects a data-ow from 1 to 2. K9 considers it an indirect dependency since the value of type 2 can be aected by that of type 1. Lines L3 and L4 show an indirect dependency from page to bio and to request through thread t4. Scale to large software systems: In designing K9, scalability is important because the Linux kernel is undoubtedly one of the largest, the most complex software systems in the world. It has 20.0 MLOC in 29,208 C source les. Guided by the motivation for scalability, K9 is designed with a best eort ap- proach. Inter-procedural analysis of pointers becomes inevitable when all the code locations causing inter-thread dependency are to be identied. Unfortu- nately, pointer analysis is notoriously hard. It has been proven that a com- plete analysis is computationally undecidable [96, 128]. Furthermore, propagat- ing pointer information among interleaving threads is a challenge, because the shared data can be accessed non-deterministically [137]. Giving up on reaching a perfect solution, K9 rather takes a best-eort ap- proach to avoid complicated and time-consuming static analysis. It identies “typical” data structures that are often shared among threads, and it detects “most” of the accesses to them. In other words, K9 does not try to detect ev- ery code location where an inter-thread data dependency can occur. Instead, it focuses on key data structures that cause dependencies. Now, I describe the overview of K9 work-ow as shown in Fig. 4.6. K9 takes source les as inputs and outputs a patch le that inserts logging statements for tracing direct and indirect data dependencies. The analysis of K9 consists of three stages. First, K9 identies data types that can be shared among threads. Second, it detects the data-ows of the identied shared data that cause direct data de- pendencies. Last, it analyzes the data-ows between direct data dependencies to discover indirect data dependencies.

62 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.3. INTER-THREAD DATA DEPENDENCY MODEL DEPENDENCIES

Take source files as inputs and start to analyze the code. Source Discover data types code shared among threads $ k9-analyzer

Analyze data-flow Analyze data-flow Output a patch that inserts logging code of the shared data between detected direct deps. for tracing direct and indirect deps.

Detect direct Analyze Indirect Patch depenencies dependencies file

Figure 4.6. The work-ow of K9

4.3 Inter-thread Data Dependency Model

To identify data structures shared among threads, K9 leverages coding conven- tions. Typically, a data structure shared among threads is allocated in a heap, referenced by pointers, and often managed in data collections. To discover such data structures, K9 analyzes the type information and identies the collection and item types from the source code. Then, it checks if the identied data type can be allocated in a heap by examining the existence of pointers for the data type. If K9 discovers such a data structure, it regards the structure as shared among threads. Since not all shared data satisfy this assumption, K9 misses some of the shared structures. Despite this limitation, it identies the primary data structures, and it has provided helpful clues to diagnose an unknown bug in Linux. If we could assume all the critical sections are locked correctly, K9 would be able to focus on critical sections to extract shared data structures and the access to them. K9 did not take this approach since it is a debugging tool.

4.3.1 Collections and Items

K9 denes collections and items to select candidate structures that can be shared among threads. A collection is a data structure that groups a xed or variable number of data items. Important examples of collections are queues and trees. Each element in a queue or each node in a tree is called an item in K9, and queues and trees are considered to be groups of those items. K9 assumes that collec-

63 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.3. INTER-THREAD DATA DEPENDENCY MODEL DEPENDENCIES

// struct extent_buffer is an array collection. struct extent_buffer { // struct page is an array item. struct page*pages[...]; ... };

Figure 4.7. Example of an array collection and item in Linux.

Graph Collection

Graph Head Graph Item Graph Item

Figure 4.8. Typical structure of the graph collection and item. tions are C structs and classies them into a) array collections or b) graph col- lections. K9 uses the following heuristics to identify candidate structures shared by threads. C structs, which do not conform to array or graph collection, are not considered to be shared structures in K9. Array Collection: An array collection is a C struct that contains an array that stores pointers to C structs. Figure 4.7 shows an example of array collections from the Linux kernel. The extent_buffer struct is an array collection, because it contains an array whose elements are pointers to the page struct. In K9, the extent_buffer struct is called an array collection type, and the page struct is called an array item type. Graph Collection: A graph collection consists of C structures linked to each other with pointers. As shown in Figure 4.8, a graph collection contains a graph head, which is a head of the collection (e.g., a head of a linked list). It usually contains metadata for controlling the collection (e.g., the length of a list and the lock for the list) in addition to pointers to data items. Graph items represent the items stored in the collection. If a C struct contains self-pointers, K9 regards it as a graph item type. Figure 4.9 shows an example of graph collections from the Linux kernel.

64 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.3. INTER-THREAD DATA DEPENDENCY MODEL DEPENDENCIES

// A graph collection: // struct sock: network layer representation of

↪ sockets. struct sock { // contains two graph heads struct sk_buff_head sk_write_queue; struct sk_buff_head sk_receive_queue; };

// A graph head // struct sk_buff_head: a head of socket buffer struct sk_buff_head { // pointers to graph items struct sk_buff*next,*prev;

// metadata: list length & lock __u32 qlen; spinlock_t lock; };

// A graph item // struct sk_buff: socket buffer struct sk_buff { // self-pointers struct sk_buff*next,*prev;

// body unsigned char *data; ... };

Figure 4.9. Examples of graph collection, head, and item in Linux.

65 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.3. INTER-THREAD DATA DEPENDENCY MODEL DEPENDENCIES

4.3.2 Dependencies between Collections and Items

K9 records accesses to array or graph items in collections in order to trace data dependencies across threads. Direct dependency: K9 identies 1) adding/removing an item to/from a collection, and 2) reading/updating an item in a collection. Since the adding/removing operations update pointers to item types in collection types, K9 logs the assignments to an array of pointers to the item types in the array collection, and the assignments to self-pointers in the item type for graph col- lections. The operation to read or update an item is called the referencing oper- ation. To reference an item, a pointer to the item is dereferenced, and thus K9 logs dereferences of pointers to item types. Indirect dependency: Logging indirect dependencies are more compli- cated. An indirect dependency occurs when an item of one item type is updated with a value calculated from an item of another item type. K9 keeps track of data ows from one item type to another, and inserts logging code to record the update by a value originating from another item type.

4.3.3 Log Points for Collections and Items

Here, I dene log points which are code locations of logging statements [168, 162, 163, 165]. In particular, K9 identies the following code locations as log points.

• updating and reading operations for array collections: e.g., i=col[idx]; and col[idx]=i;.

• updating and reading operations for pointers in graph collections: e.g., col->next=i; and i=col->next;.

• code locations where data-ows from a direct dependency to another direct dependency occur.

In this dissertation, I assume that items to be removed from collections are re- ferred from the collections before removing them. Thus, logging reading opera- tions for collections is enough to record such removing events. Comparing the source code and log messages, developers understand that the referred item will be removed after the log printing code.

66 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.4. DESIGN AND IMPLEMENTATION OF K9 DEPENDENCIES

4.4 Design and Implementation of K9

K9 is implemented with the LLVM framework (v.3.6.0). K9 works on the LLVM intermediate representation (IR) as in the previous studies [111, 157, 85, 137, 72, 166, 105, 156, 84, 78], although I will use the source code in the following expla- nation. To generate LLVM IRs, K9 hooks and modies compile commands during the build of the Linux kernel. K9 generates a patch that injects logging code into the original source code. It generates a patch from the IR in four stages; 1) collec- tion/item type identication, 2) data-ow graph generation, 3) direct dependency analysis, and 4) indirect dependency analysis.

4.4.1 Collection Support Library

K9 is simplied by making use of the coding conventions. Unfortunately, collec- tion support library, often used in large software projects, complicates the anal- ysis of direct/indirect dependencies. A collection support library provides com- mon implementations for graph collections such as lists and trees. For example, Linux provides list_head for double-linked lists. By embedding it in other structs, you can turn the embedded struct into a double-linked list. All the oper- ations are done in polymorphic functions taking list_head as arguments. A collection support library is a headache for K9 because it ‘conceals’ type names of graph items and collections. As shown in Fig. 4.5, K9 records the names of item and collection types so as to indicate which item/collection is accessed. If the I/O request queue in Fig. 4.2 is implemented with list_head, a naive im- plementation records that list_head is accessed since all the operations are done through list_head. To deal with this problem, K9 keeps track of inter-procedural data-ows be- tween functions working on item/collection types and those working on support library types. This inter-procedural analysis allows it to generate the expected log that records the item/collection types into which a support library type is embedded.

67 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.4. DESIGN AND IMPLEMENTATION OF K9 DEPENDENCIES

4.4.2 Data-ow Graph of K9

K9 employs an inter-procedural and path-insensitive data-ow graph that is used for the direct and the indirect dependency analysis. An inter-procedural data- ow graph is required because data-ows from collection types to item/library types span beyond procedure boundaries. The graph is path-insensitive to avoid costly path-sensitive analyses [25, 9]. Lack of path-condition information may result in redundant logs, but does not miss log points; K9 prefers having redun- dant logs to missing logs. A data-ow graph in K9 consists of two types of node and two types of edge. A function node represents each function in the source code. A variable node represents each variable within a function. A variable node contains the name and type information of the represented variable and information on whether its type belongs to a collection/item/library type. A call edge keeps the inter- procedural relationship between the caller and callee functions. It also maps variable nodes of the caller arguments to those of the callee arguments. A data- ow edge represents a data ow between variable nodes and contains the source code location where the data ow occurs. Figure 4.10 and Figure 4.11 describe an example of a data-ow graph. In Fig. 4.11, there is a data-ow edge from item to head due to the assign- ment at line 14 in Fig. 4.10. The assignment at line 15 causes a data ow from head to item. The call edge from sock_add_skb to skb_queue is created due to the function call at line 8. This call edge keeps the argument mapping between sock_add_skb.item and skb_queue.item, and be- tween sock_add_skb.collection and skb_queue.head. A data-ow edge maintains an op_kind attribute. This attribute is used to indicate whether the corresponding data ow is an adding/referencing opera- tion or not. If “adding” is set as the attribute, the data ow implies an adding operation which indicates that an item is added to a collection. If the attribute is “referencing”, the data ow is a referencing operation on a collection which shows that an item is referred from a collection. Otherwise, this edge represents other operations, and the attribute becomes “other”. K9 distinguishes logs about adding operations of items from logs about refer- encing operations of items. In the case of array collections, the adding/removing

68 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.4. DESIGN AND IMPLEMENTATION OF K9 DEPENDENCIES operation means updating array elements; a reference to an array element like ”c->items[i]” appears on left side of assignments. If a reference appears on the right side, it is the referencing operation. Otherwise, the attribute becomes “other”. For the graph collection, there is an update on the self-pointers of a graph item type in adding/removing operations. If a reference to a self-pointer like “item->next” appears on the left side of an assignment, K9 sets “adding” as the attribute. If a reference to a self-pointer appears on the right side of an assignment, K9 sets “referencing” as the attribute. Otherwise, K9 sets “other” as the attribute. For graph head types, if a pointer to an item type in the graph head type appears on the left side of an assignment, K9 assigns “adding” as the attribute. If it appears on the right side, “referencing” is set as the attribute. In the case of the graph head type, there is one exception. If a graph head type is assigned to a graph item type, it is not considered an “adding” operation even though a graph item type appears on the left side of an assignment. Since there is already a pointer from the head to the item, the operation on the reverse pointer does not have to be logged. In Fig. 4.11, K9 sets “adding” as the attribute of the data-ow edge that cor- responds to the assignment at line 14, because an item type (sk_buff) of a graph head (sk_buff_head) appears on the left side of the assignment. K9 sets “other” as the attribute for the assignment at line 15, because a graph head type is assigned to a graph item type here. This assignment does not have to be logged because the assignment at line 14 already indicates that the graph item has been added to a graph head. There is a subtle problem with call edges. If a function call occurs through a function pointer, K9 has to decide which function is called through the func- tion pointer. To get such information, it analyzes the assignments to the func- tion pointers and saves to a database the relation between the function pointers and the assigned functions. When K9 reaches a function call through a function pointer, it queries which functions can be called through the function pointer and then creates call edges to all the callable functions.

69 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.4. DESIGN AND IMPLEMENTATION OF K9 DEPENDENCIES

1 // Queue a buffer at the tail of the buffer

2 // Graph Collection: struct sock

3 // Graph Head: struct sk_buff_head

4 // Graph Item: struct sk_buff

5

6 void sock_add_skb(struct sock*collection,

↪ struct sk_buff*item) {

7 ...

8 skb_queue(&collection->sk_write_queue, item);

9 LOG(8, "struct sock", collection, "struct

↪ sk_buff", item);

10 }

11

12 void skq_queue(struct sk_buff_head*head, struct

↪ sk_buff*item) {

13 ...

14 head->prev= item;

15 item->next=( struct sk_buff*) head;

16 }

Figure 4.10. Simplied example of queuing a socket buer into a socket.

4.4.3 Direct Dependency Analysis

Analyzing the data-ow graph, K9 identes data-ows that might cause direct inter-thread data dependencies. To achieve this, K9 rst searches data-ows be- tween collection and item types. Then, to record such direct dependencies, K9 injects a logging statement that outputs “ @ <- @ ”. Array Collection: K9 searches for data-ow edges from an array collection type to an array item type. This edge implies an array item is referenced from, added to, or removed from an array collection. Since a data-ow edge records the code location, K9 inserts the logging code there. Graph Collection: I will start with the case in which collection support

70 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.4. DESIGN AND IMPLEMENTATION OF K9 DEPENDENCIES

: function node : variable node : dataflow edge : call edge

name: sock_add_skb

name: item name: collection type: sk_buff* type: sock*

ce1 @ line 8 sock_add_skb:item -> skb_queue:item sock_add_skb:collection -> skb_queue:head

name: skb_queue

df1 @ line 14, OpKind: Adding

name: item name: head type: sk_buff* type: sk_buff_head*

df2 @ line 15, OpKind: Other

Figure 4.11. Example of data-ow graph construction of Figure 4.10. libraries are not used. Since a graph item can be added to/referenced from an- other graph item or a graph head, K9 rst searches for the data-ow edges from a graph item/head type to graph item/head type. Even if such a data-ow edge is found, the collection type of an item type cannot be identied directly. In a graph collection, a graph item is accessed through a graph head. To identify the graph collection type, K9 identies data-ows from the graph collection type to the graph item type through the graph head type. It backtracks data-ows from a node of an item/head type until a collection type is discovered. When a collec- tion support library is used, all the operations are done on the support library type. K9 backtracks data-ows from a support library type to a collection/item type.

71 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.4. DESIGN AND IMPLEMENTATION OF K9 DEPENDENCIES

4.4.4 Indirect Dependency Analysis

An indirect dependency occurs when a thread gets an item from a collection, generates a new item from it, and adds the new item to another collection. To detect indirect dependencies, K9 keeps track of data-ows from the referenced item to the added item. K9 identies three types of code location: 1) where a pointer to an item is obtained from a collection, 2) where another item is gen- erated from the obtained item, and 3) where the generated item is added to an- other collection. K9 identies the code locations of the rst and third as direct dependencies. To discover the code locations of the second type, K9 performs the indirect dependency analysis. To understand which item is created from which item, K9 uses the data-ow graph. When an item is referenced from a collection, there is an edge from a node of a collection type to a node of an item type. I call an item node that has an incoming edge from a collection node an item-source node. If an item is added to a collection, there is a data-ow edge from a node of an item type to a node of a collection type. I name an item node that has an outgoing edge to a collection node as an item-destination node. If K9 nds a data-ow path from an item-source node to an item-destination node, it understands that the added item (item-destination) is generated from the referenced item (item-source). For example, K9 detects an indirect dependency in the Linux block I/O sub- system. When the kernel ushes dirty page caches, it rst retrieves such pages from the page cache tree. Then, the kernel generates I/O requests from the pages and submit them to I/O request queues. In this case, K9 identies the nodes of page caches as item-source nodes and those of I/O requests as item-destination nodes. K9 analyzes the data-ows from the page caches to the I/O requests and logs them. K9 performs this analysis for every combination of item-source and item- destination node. However, this naive reachability analysis does not scale be- cause the data-ow graph becomes large and the cost of tracking data-ow paths is very high. For example, when I apply K9 to the Linux kernel, the data-ow graph contains 54,913 nodes and 51,983 edges. To avoid the scalability problem, I introduce two heuristics. First, K9 sim- plies a data-ow graph before it conducts the indirect dependency analysis. In

72 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.5. EXPERIMENTS DEPENDENCIES particular, it combines dierent nodes of the same type into one node, called the type node. In so doing, multiple data-ow edges from the same type node to the same type node are unied into a single type-ow edge. This simplication re- duces the graph size and makes the indirect dependency analysis scalable; the number of type nodes is reduced to 18,753, while the number of type-ow edges is reduced to 2,440 in the Linux kernel. Even with this simplication, K9 does not overlook indirect dependencies because all the data-ow edges before the simplication exist in the type-ow graph. Note that this simplication increases false positives because logging code is inserted in data-ow edges which do not appear in a data-ow path from an item-source to an item-destination but do appear in the type-ow path. For the second heuristic, I introduce the chain length, which species the maximum number typeow edges along a typeow path. It corresponds to the maximum number of type conversions from the source to the destination. By default, I set the chain length to four. From my experience in Linux, the number of type conversions is often less than three, and thus the chain length does not miss indirect dependencies in practice.

4.5 Experiments

In this section, I present the results of an evaluation of K9 from the following viewpoints: Scalability: I applied K9 to the Linux kernel. It successfully analyzed the kernel in a reasonable time. Accuracy: K9 takes a best-eort approach to identify code locations that can cause inter-thread data dependencies. I manually estimated the precision of the log points (Section 4.5.2). I also demonstrate K9 provides useful clues for diagnosis. I reproduced two failures and one performance problem from existing bug reports and successfully diagnosed them. In addition, I used K9 to diagnose an unknown bug in the Btrfs (Section 4.5.3). Performance overheads: The performance of the K9-enabled Linux was measured on the lebench varmail benchmark, MySQL, MongoDB, Apache web server, and Squid proxy server. The performance overheads were found to be less than 1.25% in terms of throughput, and the CPU usage increased by 0.18%

73 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.5. EXPERIMENTS DEPENDENCIES

Table 4.1. Experimental Environment

CPU Intel Xeon E3-1220 v2 3.10GHz 1P/4C Memory size 16GB Disk HP 500GB SATA HDD Linux kernel version v4.1.39 on average (Section 4.5.4).

4.5.1 Scalability

I applied K9 to the Linux kernel (v4.1.39) in three use cases. In the rst case, K9 analyzed the Linux kernel with the default conguration, and the range of the analysis was not limited. The second case supposed that the user wanted to diagnose failures related to le systems. Logging code was inserted at code locations reachable from system calls related to le systems. Since K9 builds a data-ow graph starting from them, the logging code was injected in the kernel core and the memory management as well as the Ext4 le system, the block I/O layer, and device drivers (SATA, SCSI, and so on). The third case supposed the user wanted to diagnose failures related to networking and the logging code is limited to networking. Table 4.1 describes the experimental environments. Table 4.2 shows the result of the analyses. K9 analyzed 1,385,778 LOC com- prised of 60,513 functions in 2,051 les in the default conguration. The direct and indirect dependency analyses took 44 min 54 s and 321 min, respectively. The indirect dependency analysis took much longer than the direct one because it required a reachability analysis. The patch was generated in 10 s. K9 identied 419 collection types and 386 item types. K9 detected 3,131 log points for direct dependencies and 1,286 log points for indirect dependencies. When the analysis was limited to the le system or networking, the code was about one-seventh the size of the default conguration and the analysis took only one-sixtieth the time.

74 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.5. EXPERIMENTS DEPENDENCIES

Table 4.2. Analysis results (LP: “log point”, IR: “intermediate representation”).

Analysis results Default conf. File system Network # of les 2,051 466 503 # of functions 60,513 8,628 8,428 LOC 1,385,778 186,558 175,687 # of collection types 419 93 86 # of item types 386 73 77 # of LPs (direct) 3,131 392 365 # of LPs (indirect) 1,286 69 27 # of total LPs 4,417 461 392 LLVM IR generation 8 min 32 s <- <- Direct dep. analysis 44 min 54 s 2min 58s 3 min 24 s Indirect dep. analysis 321 min 3min 24s 2 min 24 s Total time 374 min 36 s 14min 56s 14 min 23 s

4.5.2 Precision of Log Points

I evaluated the precision of the log points identied by K9. In particular, I man- ually investigated the log points identied by K9 and classied them into correct (true positives) and incorrect (false positives). To reduce the burden of the man- ual investigation, I focused on the le-system conguration in Table 4.2. For each log point, I carefully examined the source code and determined whether it was correct or not, by using my expertise on the Linux kernel. The le-system conguration was chosen because some log points were veried by a maintainer of the Btrfs le system who has intimate knowledge of the Linux le system. In spite of my expertise, the correctness of some log points could not be determined with condence. I regarded all of them as false positives. My evaluation did not treat true/false negatives. Counting true negatives would be nonsense because it means counting the code locations where logging code should not be injected and not by K9. For false negatives, I would need an oracle that tells us all the code locations in the linux kernel that can cause inter- thread data dependencies. Getting such an oracle is almost impossible, and thus, I did not address false negatives.

75 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.5. EXPERIMENTS DEPENDENCIES

Table 4.3. Characterization of log points identied in the direct dependency anal- ysis of the le system case. G: denotes a graph collection and A: represents an array collection.

Item Collection # log points Type Type (TP / FP) Description mount mount 16 / 0 G: mount namespace in le system. page pagevec 16 / 0 A: page buer and page cache. audit_names audit_context 0 / 15 G: audit context and its name. request request_queue 15 / 0 G: I/O request queue and I/O request. blk_mq_hw_ctx request_queue 0 /13 G: software queue and hardware context in multi-queue block I/O. perf_event perf_event 12 / 0 G: perf events. page address_space 11 / 0 A: page cache tree and page cache. cfq_queue request 11 / 0 A: I/O request and associated CFQ queue. sidtab_node sidtab 7 / 2 A: SELINUX security identier table and its node . perf_event perf_event_context 9 / 0 G: perf context and managed events. md_rdev mddev 2 / 6 G: Linux multi device and its extended device (e.g., RAID). scsi_device Scsi_Host 7 / 0 G: SCSI object and its device. page kmem_cache_node 7 / 0 G: slab allocator and its slab. scsi_cmnd Scsi_Host 7 / 0 G: SCSI object and its command. page page 6 / 0 G: LRU page list. request deadline_data 6 / 0 Graph/A: Deadline I/O scheduler and I/O request. le_lock le_lock 5 / 1 G: POSIX le lock. inode backing_dev_info 6 / 0 G: low-level device information that contains the dirty inode list. page lruvec 5 / 0 G: LRU page list. perf_event_context task_struct 5 / 0 A: process scheduler attaches or detaches perf events to task structs. page per_cpu_pages 5 / 0 G: per-cpu page list. worker worker_pool 5 / 0 G: worker pool in the Linux kernel workqueue. perf_event cpu_hw_events 5 / 0 A: perf events on hardware context. task_struct lb_env 5 / 0 G: load balancer and its task in the completely fair scheduler. le_lock inode 5 / 0 G: POSIX le lock. page zone 5 / 0 G: free page list. hashtab_node hashtab 0 / 5 G: SELinux hash table impletation. sched_rt_entity rt_prio_array 0 / 5 G: request queue and its request in the real-time scheduler. task_struct rq 0 / 4 G: runqueue and task structure. request blk_mq_ctx 4 / 0 G: I/O request queue and I/O request in multi-queue block I/O...... Total 302 / 90

76 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.5. EXPERIMENTS DEPENDENCIES

inode

1/0

page file

5/0 1/0 4/0 1/0

buffer_head 5/0 md_rdev address_space block_device

1/0 1/0 1/0 2/0

bio 1/0

9/1

request

(a) Item-destination type: struct request.

inode

1/0

blk_mq_ctx sg_fd scsi_cmnd page file

1/0 1/0 3/0 4/0 1/0 4/0 1/0

request_queue buffer_head 5/0 md_rdev address_space block_device

1/0 1/0 1/0 1/0 1/0 2/0

2/0 cfq_data bio 1/0

2/0 2/0

cfq_queue

(b) Item-destination type: struct cfq_queue.

Figure 4.12. Type-ow graphs showing the results of indirect dependency anal- ysis. An edge is annotated with “true positives/false positive” that indicates the number of log points.

77 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.5. EXPERIMENTS DEPENDENCIES

For the le-system conguration, as shown in Table 4.2, K9 identied 73 item types and 93 collection types. For the direct dependencies, 392 log points were identied, and Table 4.3 shows the true/false positives for the direct dependen- cies. For each category, Table 4.3 shows the numbers of the correct and incorrect log points and gives a brief description of the type pair. The top 30 results for direct dependencies are shown in terms of the number of correct log points. K9 identied 302 correct and 90 incorrect log points out of 461; 77.0% log points were correct, and 23.0% were incorrect. For the indirect dependencies, K9 analyzed 19 item-source and item- destination pairs and detected 43 (62.3%) correct and 26 (37.7%) incorrect log points out of 69. Figure 4.12 shows two type-ow graphs for the two item- destination types (request and cfq_queue). Each node represents a type. In particular, nodes (ellipse) in bold line denote the item-source types. An edge shows type conversion from a source node and a destination node, and its anno- tation indicates the number of correct/incorrect log points. In the motivating example shown in Fig. 4.2, there is an indirect dependency between the page cache tree and the I/O request queue. An item of page is ob- tained from the page cache tree, from which an item of the request is initial- ized and inserted in the I/O request queue. This indirect dependency is captured by K9. Fig. 4.12 (a) shows the type-ow paths from the page item type to re- quest item type. K9 inserted 21 log points; 5 from page to buffer_head, 5 from page to bio, 1 from buffer_head to bio, 1 from bio to bio, and 9 from bio to request. Now, I discuss false positives in K9. In the direct dependency analysis, K9 identied 90 incorrect log points (23.0%). There are two reasons for K9 inserting incorrect log points. First, K9 mistakenly identied shared data (16). K9 regarded the data allocated in the heap as shared data, but some were per-process data despite that they were allocated in heaps. For example, the audit_context and audit_names were detected as shared data, but both are per-process data. Second, K9 regarded false dependencies as inter-thread data dependencies (34). Suppose one thread updates one eld of a C struct and another thread reads another eld of the same C struct. In this case, there is no dependency between the two threads. Since K9 keeps track of data ows in the granularity of C structs, not elds, it considers that there is an inter-thread dependency between those

78 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.5. EXPERIMENTS DEPENDENCIES

Table 4.4. Diagnosed failures that are caused by three known bugs and one un- kown bug.

Uncleared dirty bit: Btrfs forgets to clear the page’s dirty ag during error handling, which causes a kernel panic when the page is released later. This is a known bug (8146502 [22]). CFQ priority violation: CFQ ignores priorities in the asynchronous write- back because priority is not propagated among the threads. This performance bug has been reported in previous research [160] Data race in kernel workqueue: A race condition occurs when a thread tries to destroy a workqueue, which leads to a kernel panic. This bug is reported at fa2563e [153]. Uncleared writeback bit: Btrfs forgets to clear the writeback bit of a page when it fails to ush the page, which hangs up the kernel. This is a heretofore unknown bug. threads. The remaining 40 log points were hard to classify even with my exper- tise and thus were classied as false positives. In the indirect dependency analysis, K9 identied 26 incorrect log points. All of the false positives were caused by the simplication described in Section 4.4.4. K9 combined dierent variable nodes of the same type into one node. Even if there are no data dependencies between two variables nodes, K9 considers there is if the nodes combined with them have data dependencies.

4.5.3 Diagnosing failures

To demonstrate that K9 provides useful clues for failure diagnosis, this section reports four diagnosis cases. Despite that K9 gave false negatives, it provided useful information for debugging in failure diagnosis. Table 4.4 summarizes all four failures. The report includes diagnosis of an unknown bug in Btrfs. This bug was reported to the Linux kernel community, and my patch fe01aa6 [92] has been accepted.

79 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.5. EXPERIMENTS DEPENDENCIES

Diagnosing known bugs

Uncleared dirty bit: This bug is described in Section 4.1.2. Btrfs forgets to clear the page’s dirty ag during error handling. The bug causes a kernel panic when the page is released later. In Fig. 4.3, K9 identies that the extent_buffer is an array collection and that the page is its array item. K9 detects three code locations (lines 9, 16, and 26) as the referencing operation to the array collection and inserts logging code (lines 10, 17, and 27). K9 log allows us to trace an error propagation from kworker to sync. CFQ priority violation: As mentioned in Section 4.1.2, the CFQ I/O sched- uler has a priority violation problem. This problem involves indirect data depen- dencies. K9 identied the address_space and cfq_queue as collection types and the page and request as item types. It detected and recorded an indirect dependency from the page to the request through the bio. The re- sulting logs shown in Fig. 4.5 output would enable this problem to be diagnosed. Data race in kernel workqueue: This is a reported bug fa2563e [153]. K9 missed some log points in this case because the inter-thread dependency was incurred through data types other than collection/item types. Figure 4.13 shows an excerpt of the code related to the kernel workqueue. In rmmod, destroy_workqueue tries to free a kernel workqueue. It ushes all elements in the workqueue until it becomes empty (line 7∼12). After conrm- ing it is empty and not used (line 13∼18), the workqueue is released (line 18). Since the workqueue is not locked in rmmod, the kworker thread can manipu- late it simultaneously with rmmod; kworker can requeue a new element, while the workqueue is being ushed. If this happens, cwq->nr_active is set to non-zero and the kernel panics at line 17. Here, K9 detected the global_cwq as a graph collection and the work_struct as its item. K9 inserted logging code at lines 23 and 29. Un- fortunately, it did not insert any logging code at destroy_workqueue, be- cause there is no collection access. Despite the lack of log information, the K9 log, shown in Fig. 4.13, is still helpful for diagnosing this failure. The log at line L3 informs of the rmmod crashes at line 17 in destroy_workqueue. Since there is no log in destroy_workqueue, the workqueue causing the failure cannot be identied. Looking at lines L1 and L2, we can see that the crash occurs

80 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.5. EXPERIMENTS DEPENDENCIES

1 // File: kernel/workqueue.c

2 // Shared data: struct global_cwq and struct work_struct

3 // Process: rmmod

4 void destroy_workqueue(struct workqueue_struct*wq) { ...

5 // drain tasks before proceeding with destruction

6 reflush:

7 for_each_cqw_cpu(cpu, wq) {

8 struct cpu_workqueue_struct*cwq= get_cwq(cpu, wq);

9 # BUG: cwq is required to be locked.

10 if (!cwq->nr_active()) continue;

11 goto reflush;

12 }

13 for_each_cwq_cpu(cpu, wq) {

14 struct cpu_workqueue_struct*cwq= get_cwq(cpu, wq);

15 # FAILURE: a data race is caused by self-requeueing

16 if (non_zero(cwq->nr_active()))

17 panic();

18 } kfree(wq); ...}

19

20 // Process: kworker

21 void __queue_work(..., struct work_struct*work) { ...

22 struct global_cwq*gcwq= get_cwq(cpu, wq);

23 insert_work(gcwq, work, worklist, work_flags);

24 + LOG(23, "struct global_cwq", gcwq, "struct ↪ wrok_struct", work); ...

25 }

26 int worker_thread(struct worker* worker) { ...

27 struct global_cwq*gcwq= worker->gcwq;

28 do {

29 struct work_struct*work= ↪ list_first_entry(&gcwq->worklist, ...);

30 + LOG(29, "struct wrok_struct", work, "struct ↪ global_cwq", gcwq);

31 ...

32 } while(...); ...

33 }

Figure 4.13. Kernel workqueue bug: data race on cwq->nr_activate.

81 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.5. EXPERIMENTS DEPENDENCIES

Failure Log process-pid: fille_path:line , dst_type:dst_addr <- src_type:src_addr L1 kworker/-10 : workqueue.c:29 , work_struct:0x...a000 <- global_cwq:0x...c9c0 L2 kworker/-10 : workqueue.c:23 , global_cwq:0x...c9c0 <- work_struct:0x...a000 L3 kernel BUG at kernel/workqueue.c:17, Pid: 2492, comm: rmmod, RIP: destroy_workqueue+0x1e5/0x270

Figure 4.14. Failure logs for Figure 4.13

1 // File: fs/btrfs/extent-io.c

2 // Shared data: struct page

3 // Process: a.out

4 static int extent_write_cache_pages(...) { ...

5 for (i=0; i< nr_pages; i++){

6 struct page*page= pvec.pages[i];

7 + LOG(6, "struct page", page, "struct pagevec", ↪ pvec);

8 ...

9 ret= submit_extent_page(..., page);

10 ...

11 if (ret) { // Error handling

12 SetPageError(page);

13 # BUG: Miss clearing the writeback bit of the ↪ page.

14 }}}

15

16 // File: mm/filemap.c

17 // Process: sync

18 static int __filempa_fdatawait_range() {

19 ...

20 for (i=0; i< nr_pages; i++){

21 // a.out and sync get pointers to the same page.

22 struct page*page= pvec.pages[i];

23 + LOG(22, "struct page", page, "struct pagevec", ↪ pvec);

24 wait_on_page_writeback(page);}}

Figure 4.15. An unknown bug in Btrfs: remaining writeback bit.

after an element (work_struct at 0x...a000) is ushed from and re-queued to the same workqueue (global_cwq at 0x..c9c0). From this log, we can guess that a new item is added to a workqueue while it is being destroyed.

82 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.5. EXPERIMENTS DEPENDENCIES

Failure Log process-pid: fille_path:line, dst_type:dst_addr <- src_type:src_addr L1 a.out-2764 : extent_io.c:6 , page:0x...7180 <- pagevec:0x...fc78 L2 a.out-2764 : extent_io.c:6 , page:0x...13c0 <- pagevec:0x...fc78 L3 a.out-2764 : extent_io.c:6 , page:0x...4bc0 <- pagevec”0x...fc78 L4 ... L5 sync-2765 : filemap.c:22 , page:0x...7180 <- pagevec:0x...3da8 L6 // sync-2765 hangs and never returns

Figure 4.16. Failure logs for Figure 4.15.

Diagnosing an unknown bug

I used K9 to diagnose a failure caused by an unknown bug. A community devel- oper of Btrfs encountered a situation in which sync command hangs. If there is a page cache whose writeback bit is activated, sync waits until the page cache is ushed and its bit is cleared. However, because of a bug in the code, the write- back bits are not cleared forever. To diagnose this hang-up, I investigated the log generated by K9. Figure 4.15 depicts the failure logs and the related code. By investigating the log for the sync command, I found that it does not output any logs after L5. Log line L5 is printed at line 22 in lemap.c to record the access to the page struct (0x...7180) in the pagevec struct (0x...3da8). The command looks like it hangs after the log message. Since wait_on_page_writeback is called at line 24 with page (0x...7180) as an argument, I guessed that sync waits forever for the page to be written back. To get the code location where the page is requested to be writ- ten back, I greped the log with 0x...7180 and discovered L1, which records the access to the same page at line 7 in extent_io.c by a.out. By checking the source code, I found that extent_write_cache_pages writes back each page in pvec.pages and clears the writeback bit if the writeback succeeds. But if the writeback fails, the error handler forgets to clear the writeback bit. As a result, if the writeback fails in submit_extent_page, the bit remains 1 and sync waits forever until the bit is cleared.

83 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.5. EXPERIMENTS DEPENDENCIES

Table 4.5. Performance overheads in macro benchmarks.

Throughput Slowdown Base Logging lebench-varmail (ops/sec) 441 431 (-2.26%) tpcc-mysql (ops/sec) 4,315 4,290 (-0.58%) ycsb-mongodb (ops/sec) 1,013 1,005 (-0.83%) ab-apache (req/sec) 4,150 4,056 (-2.25%) ab-squid (req/sec) 316 302 (-4.78%) Harmonic average overhead -1.25%

CPU usage increase Base Logging lebench-varmail 1.29% 1.68% (+0.39%) tpcc-mysql 12.63% 13.04% (+0.41%) ycsb-mongodb 14.96% 15.02% (+0.06%) ab-apache 9.04% 9.08% (+0.04%) ab-squid 2.15% 2.32% (+0.17%) Average overhead +0.18% 4.5.4 Performance Overheads

I measured the overheads of K9 on ve macro benchmarks. In Table 4.5, Filebench-varmail is the mail server benchmark in lebench [2]. Tpcc-mysql runs the TPCC [4] workload on MySQL. Ycsb-mongo is a NoSQL benchmark using MongoDB with the YCSB workload. For network applications, the performances of the Apache web server and squid proxy server [149] were measured using apache benchmark [53]. The execution environment was the same as in Ta- ble 4.1. K9 was applied to system calls related to le systems in the rst three benchmarks and to those related to networking in the remaining two bench- marks. Linux kernel v4.1.39 was used with the ext4 le system. Before I focus on the detailed results, I briey describe the summary of the re- sults. In macro benchmarks, K9 incurs low logging overhead; e.g., 1.25% through- put degradation and 0.18% CPU usage increase on average. This is because disk I/Os and network I/Os become the bottleneck of the performance. However, in micro benchmarks which do not include I/Os, the logging overhead increases to 35.67%. The cause of the overhead increase is the lock contention in the LRU free

84 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.5. EXPERIMENTS DEPENDENCIES page list and the process scheduling. Table 4.5 shows the experimental results. The throughput slowdown and the CPU usage increase were measured for each benchmark. The throughput drop was from 0.53% in tpcc-mysql to 4.79% in ab-squid. The harmonic average slowdown was 1.25%. The CPU usage increase was from 0.04% in ab-apache to 0.41% in tpcc-mysql. The average increase was 0.18%. To analyze the overhead, I measured the overhead of a microbenchmark, lebench-seqwrite, that generates dirty pages without writing them to disk. This benchmark highlights the overheads because it does not involve disk I/Os. The throughput deteriorates by 35.67% in this benchmark. Although one might ex- pect that the CPU usage increases because of the logging overheads, it actually decreased by 0.34%. This is due to lock contention. If many logs are generated in a critical section, the execution time of the critical section increases because of the logging overhead. If the lock is held longer, other threads have to wait longer to acquire the lock, which results in a decrease in CPU usage. To conrm that lock contention is one of the primary overheads, I counted the number of lock contentions and found that two locks (the one for LRU free page list and the other for process scheduling) are dominant in lock contentions. If I remove two log points in the critical section for the LRU free page list, the throughput degradation decreases from 35.67% to 26.08%. If I also remove one log point for process scheduling, the degradation decreases to 24.50%. K9 users can reduce the logging overheads by specifying the code range to be logged; if they are not interested in process scheduling, the scheduling code can be excluded from the logging. To analyze how much the overhead of K9 is sensitive to faster disks, I have evaluated disk I/O related benchmarks on a RAM disk whose performance is, in general, faster than other forms of storage media. As a result, the throughput is decreased by 3.77% in ycsb-mongo, by 3.07% in tpcc-mysql, and by 34.28% in lebench-varmail. The reason why the throughput degradation is smaller in the ycsb-mongo and tpcc-mysql is that the proportion of the execution of the user code is larger than lebench-varmail. Thus, the logging overhead of K9 becomes more acceptable in large applications.

85 CHAPTER 4. LOGGING AUTOMATION FOR INTER-THREAD DATA 4.6. SUMMARY DEPENDENCIES

4.6 Summary

This chapter presented K9, a logging tool that automatically inserts the logging code to trace inter-thread data dependencies. In failure diagnosis with inter- thread data dependencies, a failing thread is dierent from a buggy thread due to an error propagation via shared data. So, investigating the code executed by the failing thread never reveals the root cause; software engineers would understand the failing thread reads a corrupt data but would not be able to gure out which thread corrupted the data. As a result, logging the inter-thread data dependency is critically important to diagnosing failures in multi-threaded systems. K9 identies the code locations where the inter-thread dependency can oc- cur, and inserts the logging code to record which thread reads from or writes on which data structure. To be a practical tool, K9 is designed to 1) scale to large multi-threaded systems and 2) cause negligible runtime overheads. To scale to large systems, K9 avoids complicated static analyses. Instead, K9 takes a best- eort approach. It identies collections and items which are typical data struc- tures causing inter-thread data dependencies. To determine such structures and accesses to them, K9 leverages coding conventions ubiquitous in real-world soft- ware systems. The current design of K9 has some limitations. First, K9 does not capture all the code locations that can cause the inter-thread data dependency, since it re- lies on the coding conventions; K9 cannot capture the inter-thread dependency caused by the coding style deviating from the conventions. Second, K9 is de- signed for software systems written in C. In spite of these limitations, my ex- perimental results suggest K9 captures code locations enough to diagnose bugs, involving inter-thread dependency, in the real world. In addition, there are still many software systems written in C.

86 Chapter 5

Conclusion

5.1 Contribution Summary

Motivated by the importance of improving software development productivity in large projects, this dissertation makes two main contributions including 1) a new build system for C++ and 2) a new automation tool for inserting the logging code. For build system, I rst discuss the problem of long build times in software development. Long build times has been recognized as a non-trivial problem be- cause the number of builds has increased. I focus on unity builds that are a new build method for large C++ projects, in which multiple source les are bundled into one unity le to reduce the redundant compilation processing. To under- stand unclear characteristics of unity builds, this dissertation provides a case study on WebKit’s unity builds. Based on the results, I nd the bundling strategy that can determine which source les should be bundled into a unity le. Then, I describe the design and implementation of Cauldron, which is a build system for unity builds with sophisticated bundle strategies. Cauldron outperforms exist- ing build systems in continuous builds, full builds, and incremental builds. These results serve as a basis of discussion on future build systems, compilers, and lan- guage design that cooperatively generate ecient compiler tasks to reduce build times. For logging code insertion, this dissertation rst introduces the cost of de- bugging in software development. In particular, while the importance of logs

87 5.2. FUTURE DIRECTIONS CHAPTER 5. CONCLUSION is widely identied, it is hard and time-consuming to make decisions on where to log and what to log in the logging practice during software development. In this dissertation, I focus on inter-thread data dependencies, which complicates an error propagation and makes the diagnosis dicult. Logging them is critical in failure diagnosis of the multi-threaded systems. However, there is no study about automation for logging the inter-thread data dependencies. This disserta- tion demonstrates that K9, a best-eort approach, automatically inserts logging code to trace inter-thread data dependencies by leveraging coding conventions ubiquitous in real-world software systems. K9 reduces developers’ burden in the logging practice in large software systems.

5.2 Future Directions

Cauldron has uncovered the impact of unity builds on build times of the C++ projects in detail. Unity builds reduce the redundant processing in compilation by bundling multiple source les into one compiler task. Although this approach has succeeded in reducing the redundancy, it reduces the parallelism in builds. Thus, one of the future work is to examine the trade-o between the redundancy and the parallelism in unity builds. The build environment (e.g., machine spec, compilers, linkers, and etc.) varies from developer to developer. Identifying a proper bundle size depending on the individual build environment is a future work in unity builds. C++ modules are coming to C++20. The module system for C++ is designed to deal with the serious degradation of compile-time scalability incurred by header le inclusion. In C++ modules, a binary representation of the corresponding header is imported instead of copying and parsing the header code. Since the module is only compiled once, importing modules reduce redundant compila- tion tasks caused by the header code. C++ modules and unity builds share cer- tain motivations and goals. Both approaches try to reduce redundant compilation tasks caused by the header code. However, unity builds could be still eective even after C++ modules become standard. For example, when multiple source les import the same module, there are redundant module reads and deserializa- tions. By bundling such source les into one le as unity builds, the compiling throughput is improved for the same reason that unity builds reduces the redun-

88 5.2. FUTURE DIRECTIONS CHAPTER 5. CONCLUSION dant header inclusion and parsing. Thus, exploring a design space to use unity builds with C++ modules is a future work to improve the build performance. While this dissertation focused on logging inter-thread dependencies on typ- ical data structures in K9, there are dierent types of inter-thread data depen- dencies that dier from dependencies caused by collections and items. The K9 approach in which coding conventions are utilized to extract inter-thread data dependencies envisions future debuggers or prolers considering inter-thread dependencies. Designing hybrid approach in which logging are combined with the hard- ware tracing is one of other future work in automating logging practice. The hardware tracing is used for capturing a program’s control-ow with low over- head. However, tracing records all program behavior and does not save the data- ow information. Instead, logs are used for recording specic program behavior (e.g., error executions) and for capturing the data-ow information with com- prehensive messages to reasoning the program behavior. Eciently using both log messages and trace records helps future failure diagnosis in reasonable over- head.

89 Bibliography

[1] Buildbot. http://buildbot.net/.

[2] lebench(1) - linux man page. https://linux.die.net/man/ 1/filebench.

[3] Jenkins. https://www.jenkins.io/.

[4] tpcc-mysql. https://github.com/Percona-Lab/tpcc- mysql.

[5] Travis ci. https://travis-ci.org/.

[6] C. Adam. Unity (Jumbo) build (!3611) · Merge Requests · CMake / CMake. https://gitlab.kitware.com/cmake/cmake/ merge_requests/3611, July 2019.

[7] M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthi- tacharoen. Performance debugging for distributed systems of black boxes. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP ’03, pages 74–89, New York, NY, USA, 2003. Association for Computing Machinery.

[8] G. Altekar and I. Stoica. Odr: Output-deterministic replay for multicore debugging. In Proceedings of the ACM SIGOPS 22nd Symposium on Operat- ing Systems Principles, SOSP ’09, pages 193–206, New York, NY, USA, 2009. Association for Computing Machinery.

[9] G. Ammons and J. R. Larus. Improving data-ow analysis with path pro- les. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming

90 BIBLIOGRAPHY BIBLIOGRAPHY

Language Design and Implementation, PLDI ’98, pages 72–84, New York, NY, USA, 1998. Association for Computing Machinery.

[10] Apple Inc. Technical note tn2123 crashreporter, February 2006.

[11] Apple Inc. llbuild. https://github.com/apple/swift- llbuild, 2020.

[12] Apple Inc. Webkit. https://webkit.org/, 2020.

[13] O. Arkhipova. Support for unity (jumbo) les in visual studio 2017 15.8 (experimental). https://devblogs.microsoft. com/cppblog/support-for-unity-jumbo-files-in- visual-studio-2017-15-8-experimental/, July 2018.

[14] A. H. Ashouri, W. Killian, J. Cavazos, G. Palermo, and C. Silvano. A survey on compiler autotuning using machine learning. ACM Comput. Surv., 51(5), Sept. 2018.

[15] Atlassian. Bitbucket. https://bitbucket.org/product/, 2020.

[16] C. AtLee, L. Blakk, J. O’Duinn, and A. Z. Gasparian. Firefox Release Engi- neering. The Architecture of Open Source Applications. Lulu.com, 2012.

[17] J. Atwood. The problem with logging. https://blog. codinghorror.com/the-problem-with-logging/, December 2008.

[18] A. Ayers, R. Schooler, C. Metcalf, A. Agarwal, J. Rhee, and E. Witchel. TraceBack. In Proceedings of the 2005 ACM SIGPLAN conference on Pro- gramming language design and implementation - PLDI '05. ACM Press, 2005.

[19] P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using magpie for request extraction and workload modelling. In Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation - Volume 6, OSDI’04, page 18, USA, 2004. USENIX Association.

91 BIBLIOGRAPHY BIBLIOGRAPHY

[20] I. Beschastnikh, Y. Brun, S. Schneider, M. Sloan, and M. D. Ernst. Leverag- ing existing instrumentation to automatically infer invariant-constrained models. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, ESEC/FSE ’11, pages 267–277, New York, NY, USA, 2011. Association for Comput- ing Machinery.

[21] S. Bhatia, A. Kumar, M. E. Fiuczynski, and L. Peterson. Lightweight, high- resolution monitoring for troubleshooting production systems. In Proceed- ings of the 8th USENIX Conference on Operating Systems Design and Imple- mentation, OSDI’08, pages 103–116, USA, 2008. USENIX Association.

[22] L. Bo. Btrfs: x crash of btrfs_release_extent_buer_page, 9 2014. https://github.com/torvalds/linux/commit/ 814650.

[23] E. Bodden, P. Lam, and L. Hendren. Finding programming errors earlier by evaluating runtime monitors ahead-of-time. In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engi- neering - SIGSOFT '08/FSE-16. ACM Press, 2008.

[24] P. Bodik, M. Goldszmidt, A. Fox, D. B. Woodard, and H. Andersen. Finger- printing the datacenter. In Proceedings of the 5th European conference on Computer systems - EuroSys '10. ACM Press, 2010.

[25] R. Bodík and S. Anik. Path-sensitive value-ow analysis. In Proceedings of the 25th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’98, pages 237–251, New York, NY, USA, 1998. Associa- tion for Computing Machinery.

[26] P. Braione, G. Denaro, A. Mattavelli, and M. Pezzè. Sushi: A test generator for programs with complex structured inputs. In Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings, ICSE ’18, pages 21–24, New York, NY, USA, 2018. Association for Comput- ing Machinery.

[27] T. Britton, L. Jeng, G. Carver, and P. Chea. Reversible debugging soft- ware “quantify the time and cost saved using reversible debuggers".

92 BIBLIOGRAPHY BIBLIOGRAPHY

http://citeseerx.ist.psu.edu/viewdoc/download? doi=10.1.1.370.9611&rep=rep1&type=pdf, 2013.

[28] Z. Brodsky. Git much? the top 10 companies contributing to open source. https://resources.whitesourcesoftware. com/blog-whitesource/git-much-the-top-10- companies-contributing-to-open-source, February 2018.

[29] B. Busjaeger and T. Xie. Learning for test prioritization: An industrial case study. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, pages 975– 980, New York, NY, USA, 2016. Association for Computing Machinery.

[30] C. Cadar, D. Dunbar, and D. Engler. Klee: Unassisted and automatic gener- ation of high-coverage tests for complex systems programs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implemen- tation, OSDI’08, pages 209–224, USA, 2008. USENIX Association.

[31] J. Campos, Y. Ge, N. Albunian, G. Fraser, M. Eler, and A. Arcuri. An empir- ical evaluation of evolutionary algorithms for unit test suite generation. Information and Software Technology, 104:207 – 235, 2018.

[32] C. Catal and D. Mishra. Test case prioritization: a systematic mapping study. Software Quality Journal, 21(3):445–478, jul 2012.

[33] M. Catanzaro. On compiling webkit (now twice as fast!). https://blogs.gnome.org/mcatanzaro/2018/02/17/ on-compiling-webkit-now-twice-as-fast/, February 2018.

[34] T. M. Chilimbi, B. Liblit, K. Mehra, A. V. Nori, and K. Vaswani. Holmes: Eective statistical debugging via ecient path proling. In 2009 IEEE 31st International Conference on Software Engineering, pages 34–44, May 2009.

[35] M. Chow, D. Meisner, J. Flinn, D. Peek, and T. F. Wenisch. The mystery ma- chine: End-to-end performance analysis of large-scale internet services. In

93 BIBLIOGRAPHY BIBLIOGRAPHY

Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, pages 217–231, USA, 2014. USENIX Association.

[36] I. Circle Internet Services. Continuous integration and delivery - circleci. https://circleci.com/, 2020.

[37] I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox. Cap- turing, indexing, clustering, and retrieving system history. In Proceedings of the Twentieth ACM Symposium on Operating Systems Principles, SOSP ’05, pages 105–118, New York, NY, USA, 2005. Association for Computing Machinery.

[38] L. Community. Clang - c++20, c++17, c++14, c++11 and c++98 status. https://clang.llvm.org/cxx_status.html, Augest 2020.

[39] O. Crameri, R. Bianchini, and W. Zwaenepoel. Striking a new balance between program instrumentation and debugging time. In Proceedings of the sixth conference on Computer systems - EuroSys '11. ACM Press, 2011.

[40] W. Cui, X. Ge, B. Kasikci, B. Niu, U. Sharma, R. Wang, and I. Yun. Rept: Reverse debugging of failures in deployed software. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, OSDI’18, pages 17–32, USA, 2018. USENIX Association.

[41] M. development team. Unity builds in meson. http://mesonbuild. com/Unity-builds.html, 2020.

[42] C. Dietrich, V. Rothberg, L. Füracker, A. Ziegler, and D. Lohmann. chash: Detection of redundant compilations via AST hashing. In 2017 USENIX Annual Technical Conference (USENIX ATC 17), pages 527–538, Santa Clara, CA, July 2017. USENIX Association.

[43] G. W. Dunlap, S. T. King, S. Cinar, M. A. Basrai, and P. M. Chen. Revirt: Enabling intrusion analysis through virtual-machine logging and replay. In Proceedings of the 5th Symposium on Operating Systems Design and Im- plementation (Copyright Restrictions Prevent ACM from Being Able to Make the PDFs for This Conference Available for Downloading), OSDI ’02, pages 211–224, USA, 2002. USENIX Association.

94 BIBLIOGRAPHY BIBLIOGRAPHY

[44] G. W. Dunlap, D. G. Lucchetti, M. A. Fetterman, and P. M. Chen. Execu- tion replay of multiprocessor virtual machines. In Proceedings of the Fourth ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Envi- ronments, VEE ’08, pages 121–130, New York, NY, USA, 2008. Association for Computing Machinery.

[45] A. Duvall, P. Matyas, and S. Glover. Continuous Integration: Improving Software Quality and Reducing Risk. Addison-Wesley Professional, 2007.

[46] S. Elbaum, G. Rothermel, and J. Penix. Techniques for improving regres- sion testing in continuous integration development environments. In Pro- ceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, pages 235–245, New York, NY, USA, 2014. Association for Computing Machinery.

[47] S. Erdweg, M. Lichter, and M. Weiel. A sound and optimal incremental build system with dynamic dependencies. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Sys- tems, Languages, and Applications, OOPSLA 2015, pages 89–106, New York, NY, USA, 2015. Association for Computing Machinery.

[48] Facebook. Buck: A fast build tool. https://buck.build/, 2020.

[49] S. I. Feldman. Make — a program for maintaining computer programs. Software: Practice and Experience, 9(4):255–265, April 1979.

[50] K. Fisher, D. Walker, K. Q. Zhu, and P. White. From dirt to shovels: Fully automatic tool generation from ad hoc data. In Proceedings of the 35th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’08, pages 421–434, New York, NY, USA, 2008. Associa- tion for Computing Machinery.

[51] R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica. X-trace: A perva- sive network tracing framework. In Proceedings of the 4th USENIX Confer- ence on Networked Systems Design and Implementation, NSDI’07, page 20, USA, 2007. USENIX Association.

95 BIBLIOGRAPHY BIBLIOGRAPHY

[52] S. Foundation. Scons: A software construction tool. https://scons. org/, 2020.

[53] T. A. S. Foundation. ab - apache http server benchmarking tool. http:// httpd.apache.org/docs/2.4/ja/programs/ab.html, 2020.

[54] T. A. S. Foundation. Apache ant - welcome. https://ant.apache. org/, 2020.

[55] M. Fowler. Continuous integration (original ver- sion). https://martinfowler.com/articles/ originalContinuousIntegration.html, September 2000.

[56] G. Fraser and A. Arcuri. Whole test suite generation. IEEE Transactions on Software Engineering, 39(2):276–291, 2013.

[57] Q. Fu, J. Zhu, W. Hu, J.-G. Lou, R. Ding, Q. Lin, D. Zhang, and T. Xie. Where do developers log? an empirical study on logging practices in industry. In Companion Proceedings of the 36th International Conference on Software En- gineering, ICSE Companion 2014, pages 24–33, New York, NY, USA, 2014. Association for Computing Machinery.

[58] F. Fulin. Fastbuild - function reference - unity. https://www. fastbuild.org/docs/functions/unity.html, 2020.

[59] G. Garen. [webkit-dev] growing tired of long build times? check out this awesome new way to speed up your build... soon (hint: It’s not buying a new computer). https://lists.webkit.org/pipermail/ webkit-dev/2017-August/029508.html, Augest 2017.

[60] X. Ge, B. Niu, and W. Cui. Reverse debugging of kernel failures in deployed systems. In Proceedings of the 2020 USENIX Annual Technical Conference (ATC), July 2020.

[61] D. Geels, G. Altekar, P. Maniatis, T. Roscoe, and I. Stoica. Friday: Global comprehension for distributed replay. In Proceedings of the 4th USENIX Conference on Networked Systems Design and Implementation, NSDI’07, page 21, USA, 2007. USENIX Association.

96 BIBLIOGRAPHY BIBLIOGRAPHY

[62] T. A. Ghaleb, D. A. da Costa, and Y. Zou. An empirical study of the long duration of continuous integration builds. Empir Software Eng, 24:2102– 2139, 2019.

[63] GitHub Inc. The state of the octoverse. https://octoverse. github.com/, 2019.

[64] GitHub Inc. Github. https://github.com/, 2020.

[65] GitLab Inc. Gitlab. https://about.gitlab.com/, 2020.

[66] K. Glerum, K. Kinshumann, S. Greenberg, G. Aul, V. Orgovan, G. Nichols, D. Grant, G. Loihle, and G. Hunt. Debugging in the (very) large: Ten years of implementation and experience. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, SOSP ’09, pages 103–116, New York, NY, USA, 2009. Association for Computing Machinery.

[67] Google. Bazel - a fast, scalable, multi-language and extensible build sys- tem". https://www.bazel.build/, Augest 2020.

[68] G. Grano, C. Laaber, A. Panichella, and S. Panichella. Testing with fewer resources: An adaptive approach to performance-aware test case genera- tion. IEEE Transactions on Software Engineering, pages 1–1, 2019.

[69] Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, and Z. Zhang. R2: An application-level kernel for record and replay. In Pro- ceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI’08, pages 193–208, USA, 2008. USENIX Association.

[70] P. Gupta, M. Ivey, and J. Penix. Testing at the speed and scale of google. http://google-engtools.blogspot.com/2011/06/ testing-at-speed-and-scale-of-google.html, June 2011.

[71] D. Hao, L. Zhang, and H. Mei. Test-case prioritization: achievements and challenges. Frontiers of Computer Science, 10(5):769–777, jun 2016.

97 BIBLIOGRAPHY BIBLIOGRAPHY

[72] B. Hardekopf and C. Lin. Semi-sparse ow-sensitive pointer analysis. In Proceedings of the 36th Annual ACM SIGPLAN-SIGACT Symposium on Prin- ciples of Programming Languages, POPL ’09, pages 226–238, New York, NY, USA, 2009. Association for Computing Machinery.

[73] P. He, Z. Chen, S. He, and M. R. Lyu. Characterizing the natural lan- guage descriptions in software logging statements. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineer- ing, ASE 2018, pages 178–189, New York, NY, USA, 2018. Association for Computing Machinery.

[74] M. Hilton, N. Nelson, T. Tunnell, D. Marinov, and D. Dig. Trade-os in continuous integration: Assurance, security, and exibility. In Proceed- ings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2017, pages 197–207, New York, NY, USA, 2017. Association for Computing Machinery.

[75] M. Hilton, T. Tunnell, K. Huang, D. Marinov, and D. Dig. Usage, costs, and benets of continuous integration in open-source projects. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engi- neering, ASE 2016, pages 426–437, New York, NY, USA, 2016. Association for Computing Machinery.

[76] F. Hoa. Who contributed the most to open source in 2017 and 2018? let’s analyze github’s data and nd out. https://www.freecodecamp. org/news/the-top-contributors-to-github-2017- be98ab854e87/, October 2017.

[77] S. Hore, L. Lewis, P. Palukuri, and A. Dindzans. Ontap autosupport and au- tosupport on demand. https://www.netapp.com/us/media/ tr-4444.pdf, October 2017.

[78] J. Huang, M. Allen-Bond, and X. Zhang. Pallas: Semantic-aware checking for nding deep bugs in fast path. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Lan- guages and Operating Systems, ASPLOS ’17, pages 709–722, New York, NY, USA, 2017. Association for Computing Machinery.

98 BIBLIOGRAPHY BIBLIOGRAPHY

[79] S. Jayatheertha, L. S. Balu, and A. C. H. Streamlined trou- bleshooting with the dell system e-support tool. https: //www.dell.com/downloads/global/power/ps4q08- 20080257-Jayatheertha.pdf, November 2008.

[80] B. Jiang, Z. Zhang, W. K. Chan, and T. H. Tse. Adaptive random test case prioritization. In 2009 IEEE/ACM International Conference on Automated Software Engineering, pages 233–244, 2009.

[81] H. Jula, D. Tralamazza, C. Zamr, and G. Candea. Deadlock immu- nity: Enabling systems to defend against deadlocks. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI’08, pages 295–308, USA, 2008. USENIX Association.

[82] A. Kadav, M. J. Renzelmann, and M. M. Swift. Tolerating hardware device failures in software. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, SOSP ’09, pages 59–72, New York, NY, USA, 2009. Association for Computing Machinery.

[83] M. P. Kasick, J. Tan, R. Gandhi, and P. Narasimhan. Black-box problem diagnosis in parallel le systems. In Proceedings of the 8th USENIX Confer- ence on File and Storage Technologies, FAST’10, page 4, USA, 2010. USENIX Association.

[84] B. Kasikci, B. Schubert, C. Pereira, G. Pokam, and G. Candea. Fail- ure sketching: A technique for automated root cause diagnosis of in- production failures. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP ’15, pages 344–360, New York, NY, USA, 2015. Association for Computing Machinery.

[85] B. Kasikci, C. Zamr, and G. Candea. Racemob: Crowdsourced data race detection. In Proceedings of the Twenty-Fourth ACM Symposium on Operat- ing Systems Principles, SOSP ’13, pages 406–422, New York, NY, USA, 2013. Association for Computing Machinery.

[86] Y. Keren. Zapcc. https://github.com/yrnkrn/zapcc, 2018.

99 BIBLIOGRAPHY BIBLIOGRAPHY

[87] C. H. P. Kim, D. Marinov, S. Khurshid, D. Batory, S. Souto, P. Barros, and M. D’Amorim. Splat: Lightweight dynamic analysis for reducing combina- torics in testing congurable systems. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2013, pages 257–267, New York, NY, USA, 2013. Association for Computing Machin- ery.

[88] S. T. King, G. W. Dunlap, and P. M. Chen. Debugging operating systems with time-traveling virtual machines. In Proceedings of the Annual Con- ference on USENIX Annual Technical Conference, ATEC ’05, page 1, USA, 2005. USENIX Association.

[89] Kitware Inc. Cmake. https://cmake.org/, 2020.

[90] S. Kratky. sakra/cotire: Cmake module to speed up builds. https:// github.com/sakra/cotire, 2019.

[91] P. Krill. Software engineers spend lots of time not building soft- ware. https://www.infoworld.com/article/2613762/ software-engineers-spend-lots-of-time-not- building-software.html, April 2015.

[92] T. Kubota. Btrfs: add another missing end_page_writeback on sub- mit_extent_page failure. https://github.com/torvalds/ linux/commit/fe01aa65, February 2017.

[93] T. Kubota. [v2] btrfs: add another missing end_page_writeback on sub- mit_extent_page failure. https://patchwork.kernel.org/ patch/9564177/, February 2017.

[94] T. Kubota, Y. Suzuki, and K. Kono. To unify or not to unify: A case study on unied builds (in webkit). In Proceedings of the 28th International Con- ference on Compiler Construction, CC 2019, pages 42–52, New York, NY, USA, 2019. Association for Computing Machinery.

[95] A. Kumar. Development at the speed and scale of google. http://goo. gl/hCPQxZ, December 2010.

100 BIBLIOGRAPHY BIBLIOGRAPHY

[96] W. Landi. Undecidability of static analysis. ACM Letters on Programming Languages and Systems, 1(4):323–337, dec 1992.

[97] J.-C. Laprie. Dependable computing: Concepts, limits, challenges. In Proceedings of the Twenty-Fifth International Conference on Fault-Tolerant Computing, FTCS’95, pages 42–54, USA, 1995. IEEE Computer Society.

[98] J. Lawall and G. Muller. Coccinelle: 10 years of automated evolution in the linux kernel. In Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC ’18, pages 601–613, USA, 2018. USENIX Association.

[99] D. Lee, B. Wester, K. Veeraraghavan, S. Narayanasamy, P. M. Chen, and J. Flinn. Respec: Ecient online multiprocessor replayvia speculation and external determinism. In Proceedings of the Fifteenth International Confer- ence on Architectural Support for Programming Languages and Operating Systems, ASPLOS XV, pages 77–90, New York, NY, USA, 2010. Association for Computing Machinery.

[100] C. Legnitto. The facebook mobile release process. https:// www.infoq.com/presentations/Facebook-Release- Process/, February 2014.

[101] H. Li, W. Shang, Y. Zou, and A. E. Hassan. Towards just-in-time sugges- tions for log changes. Empirical Software Engineering, 22(4):1831–1865, oct 2016.

[102] S. Li, X. Niu, Z. Jia, J. Wang, H. He, and T. Wang. Logtracker: Learn- ing log revision behaviors proactively from software evolution history. In 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC), pages 178–17810, 2018.

[103] B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bug isolation via remote program sampling. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation, PLDI ’03, pages 141– 154, New York, NY, USA, 2003. Association for Computing Machinery.

101 BIBLIOGRAPHY BIBLIOGRAPHY

[104] R. Manevich, M. Sridharan, S. Adams, M. Das, and Z. Yang. Pse: Explaining program failures via postmortem static analysis. In Proceedings of the 12th ACM SIGSOFT Twelfth International Symposium on Foundations of Software Engineering, SIGSOFT ’04/FSE-12, pages 63–72, New York, NY, USA, 2004. Association for Computing Machinery.

[105] J. Mao, Y. Chen, Q. Xiao, and Y. Shi. Rid: Finding reference count bugs with inconsistent path pair checking. In Proceedings of the Twenty-First Interna- tional Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’16, pages 531–544, New York, NY, USA, 2016. Association for Computing Machinery.

[106] L. Mariani and F. Pastore. Automated identication of failure causes in system logs. In 2008 19th International Symposium on Software Reliability Engineering (ISSRE), pages 117–126, 2008.

[107] L. Mariani, F. Pastore, and M. Pezze. A toolset for automated failure anal- ysis. In Proceedings of the 31st International Conference on Software Engi- neering, ICSE ’09, pages 563–566, USA, 2009. IEEE Computer Society.

[108] D. Marijan, A. Gotlieb, and S. Sen. Test case prioritization for continuous regression testing: An industrial case study. In Proceedings of the 2013 IEEE International Conference on Software Maintenance, ICSM ’13, pages 540–543, USA, 2013. IEEE Computer Society.

[109] D. Marsh. From code to monkeys: Continuous delivery at netix. https://www.infoq.com/presentations/netflix- continuous-delivery/, November 2013.

[110] E. Martin. Ninja, a small build system with a focus on speed. https: //ninja-build.org/, 2012.

[111] A. J. Mashtizadeh, T. Garnkel, D. Terei, D. Mazieres, and M. Rosenblum. Towards practical default-on multi-core record/replay. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’17, pages 693– 708, New York, NY, USA, 2017. Association for Computing Machinery.

102 BIBLIOGRAPHY BIBLIOGRAPHY

[112] S. McIntosh, B. Adams, M. Nagappan, and A. E. Hassan. Identifying and understanding header le hotspots in C/C++ build processes. Automated Software Engineering, 2005.

[113] P. McMinn. Search-based software test data generation: a survey: Re- search articles. Softw. Test., Verif. Reliab., 14:105–156, 06 2004.

[114] J. Micco. Tools for continuous integration at google scale. https:// www.youtube.com/watch?v=KH2_sB1A6lA, June 2012.

[115] K. Miller. [webkit-dev] unied source builds: A new rule for static vari- ables. https://lists.webkit.org/pipermail/webkit- dev/2017-August/029465.html, Augest 2017.

[116] B. Miranda, E. Cruciani, R. Verdecchia, and A. Bertolino. FAST approaches to scalable similarity-based test case prioritization. In Proceedings of the 40th International Conference on Software Engineering. ACM, may 2018.

[117] N. Mitchell. Shake build system. https://shakebuild.com/, 2020.

[118] A. Mokhov, N. Mitchell, and S. Peyton Jones. Build systems à la carte. Proc. ACM Program. Lang., 2(ICFP), July 2018.

[119] K.-K. Muniswamy-Reddy, P. Macko, and M. Seltzer. Provenance for the cloud. In Proceedings of the 8th USENIX Conference on File and Storage Technologies, FAST’10, pages 15–14, USA, 2010. USENIX Association.

[120] K. Nagaraj, C. Killian, and J. Neville. Structured comparative analysis of systems logs to diagnose performance problems. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI’12, page 26, USA, 2012. USENIX Association.

[121] T. Nagy. Waf: the meta build system. https://waf.io/, 2019.

[122] NetApp Inc. Savecore. ONTAP 7.3 Manual Page Reference, Volume 1, Pages 471-472, June 2014.

103 BIBLIOGRAPHY BIBLIOGRAPHY

[123] M. Olszewski, J. Ansel, and S. Amarasinghe. Kendo: Ecient determin- istic multithreading in software. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Op- erating Systems, ASPLOS XIV, pages 97–108, New York, NY, USA, 2009. Association for Computing Machinery.

[124] A. Panichella, F. M. Kifetew, and P. Tonella. Automated test case gener- ation as a many-objective optimisation problem with dynamic selection of the targets. IEEE Transactions on Software Engineering, 44(2):122–158, 2018.

[125] S. Park, Y. Zhou, W. Xiong, Z. Yin, R. Kaushik, K. H. Lee, and S. Lu. PRES. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles - SOSP '09. ACM Press, 2009.

[126] C. Prasad and W. Schulte. Taking control of your engineering tools. Com- puter, 46(11):63–66, 2013.

[127] T. M. D. Project. The mesa 3d graphics library. https://www. mesa3d.org/, 2020.

[128] G. Ramalingam. The undecidability of aliasing. ACM Transactions on Pro- gramming Languages and Systems, 16(5):1467–1471, sep 1994.

[129] C. Robertson, T. Lyngmo, and N. Schonning. Precompiled header les. https://docs.microsoft.com/en-US/cpp/build/ creating-precompiled-header-files?view=vs-2019, October 2019.

[130] J. M. Rojas, M. Vivanti, A. Arcuri, and G. Fraser. A detailed investigation of the eectiveness of whole test suite generation. Empirical Softw. Engg., 22(2):852–893, Apr. 2017.

[131] J. Rosdahl. ccache — compiler cache. https://ccache.dev/, 2020.

[132] N. Sidwell. cxx-modules - gcc wiki. https://gcc.gnu.org/ wiki/cxx-modules, July 2020.

104 BIBLIOGRAPHY BIBLIOGRAPHY

[133] B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag. Dapper, a large-scale distributed systems tracing infrastructure. Technical report, Google Inc., 2010.

[134] R. Smith. Merging modules. http://www.open-std.org/ jtc1/sc22/wg21/docs/papers/2019/p1103r3.pdf, February 2019.

[135] R. Smith. Working draft, standard for programming language c++. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/n4849.pdf, January 2020.

[136] N. Stew. Automatically nding slow headers in c++ projects. https://nibblestew.blogspot.com/2018/02/ automatically-finding-slow-headers-in-c.html, February 2018.

[137] Y. Sui, P. Di, and J. Xue. Sparse ow-sensitive pointer analysis for multi- threaded programs. In Proceedings of the 2016 International Symposium on Code Generation and Optimization - CGO 2016. ACM Press, 2016.

[138] J. Tan, S. Kavulya, R. Gandhi, and P. Narasimhan. Visual, log-based causal tracing for performance debugging of mapreduce systems. In 2010 IEEE 30th International Conference on Distributed Computing Systems, pages 795–806, 2010.

[139] J. Tan, X. Pan, S. Kavulya, R. Gandhi, and P. Narasimhan. Salsa: Analyzing logs as state machines. In Proceedings of the First USENIX Conference on Analysis of System Logs, WASL’08, page 6, USA, 2008. USENIX Association.

[140] J. Tan, X. Pan, S. Kavulya, R. Gandhi, and P. Narasimhan. Mochi: Visual log-analysis based tools for debugging hadoop. In Proceedings of the 2009 Conference on Hot Topics in Cloud Computing, HotCloud’09, USA, 2009. USENIX Association.

[141] P.-N. Tan, M. Steinbach, A. Karpatne, and V. Kumar. Introduction to Data Mining. Pearson, 2nd edition, 2019.

105 BIBLIOGRAPHY BIBLIOGRAPHY

[142] The Chromium Team. Breakpad. https://chromium. googlesource.com/breakpad/breakpad.

[143] The Chromium Team. Chrome infrastructure - welcome to luci! https://chromium.googlesource.com/infra/infra/ +/master/doc/users/services/about_luci.md, 2020.

[144] The Chromium Team. Chromium. https://www.chromium.org/ Home, 2020.

[145] The Clang Team. Precompiled header and modules internals. https: //clang.llvm.org/docs/PCHInternals.html, 2020.

[146] The GCC Team. Precompiled headers (using the gnu compiler col- lection (gcc)). https://gcc.gnu.org/onlinedocs/gcc/ Precompiled-Headers.html, 2020.

[147] The LLVM Team. Buildbot for the llvm project. http://lab.llvm. org:8011/, September 2020.

[148] The Meson Team. The meson build system. https://mesonbuild. com/, 2020.

[149] The Squid Team. Squid web proxy cache. http://www.squid- cache.org/.

[150] The WebKit Team. Buildbot: Webkit. https://build.webkit. org, September 2020.

[151] J. Thomas and A. Kumar. Google engineering tools. http: //google-engtools.blogspot.com/2011/05/ welcome-to-google-engineering-tools.html, May 2011.

[152] J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou. Triage: Diagnosing production run failures at the user’s site. In Proceedings of Twenty-First ACM SIGOPS Symposium on Operating Systems Principles, SOSP ’07, pages 131–144, New York, NY, USA, 2007. Association for Computing Machinery.

106 BIBLIOGRAPHY BIBLIOGRAPHY

[153] T. Tuttle. workqueue: lock cwq access in drain_workqueue. https:// github.com/torvalds/linux/commit/fa2563e, Septem- ber 2011.

[154] K. Veeraraghavan, D. Lee, B. Wester, J. Ouyang, P. M. Chen, J. Flinn, and S. Narayanasamy. Doubleplay: Parallelizing sequential logging and re- play. In Proceedings of the Sixteenth International Conference on Architec- tural Support for Programming Languages and Operating Systems, ASPLOS XVI, pages 15–26, New York, NY, USA, 2011. Association for Computing Machinery.

[155] E. Vlachos, M. L. Goodstein, M. A. Kozuch, S. Chen, B. Falsa, P. B. Gib- bons, and T. C. Mowry. Paralog: Enabling and accelerating online par- allel monitoring of multithreaded applications. In Proceedings of the Fif- teenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XV, pages 271–284, New York, NY, USA, 2010. Association for Computing Machinery.

[156] X. Wang, H. Chen, Z. Jia, N. Zeldovich, and M. F. Kaashoek. Improving in- teger security for systems with kint. In Proceedings of the 10th USENIX Con- ference on Operating Systems Design and Implementation, OSDI’12, pages 163–177, USA, 2012. USENIX Association.

[157] X. Wang, N. Zeldovich, M. F. Kaashoek, and A. Solar-Lezama. Towards optimization-safe systems: Analyzing the impact of undened behavior. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP ’13, pages 260–275, New York, NY, USA, 2013. Association for Computing Machinery.

[158] D. Weeratunge, X. Zhang, and S. Jagannathan. Analyzing multicore dumps to facilitate concurrency bug reproduction. ACM SIGARCH Computer Ar- chitecture News, 38(1):155–166, mar 2010.

[159] W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan. Detecting large- scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, SOSP ’09, pages 117–132, New York, NY, USA, 2009. Association for Computing Machinery.

107 BIBLIOGRAPHY BIBLIOGRAPHY

[160] S. Yang, T. Harter, N. Agrawal, S. S. Kowsalya, A. Krishnamurthy, S. Al- Kiswany, R. T. Kaushik, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Split-level i/o scheduling. In Proceedings of the 25th Symposium on Operat- ing Systems Principles, SOSP ’15, pages 474–489, New York, NY, USA, 2015. Association for Computing Machinery.

[161] X. Yu, P. Joshi, J. Xu, G. Jin, H. Zhang, and G. Jiang. Cloudseer: Workow monitoring of cloud infrastructures via interleaved logs. In Proceedings of the Twenty-First International Conference on Architectural Support for Pro- gramming Languages and Operating Systems, ASPLOS ’16, pages 489–502, New York, NY, USA, 2016. Association for Computing Machinery.

[162] D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy. Sherlog: Error diagnosis by connecting clues from run-time logs. In Proceedings of the Fifteenth International Conference on Architectural Support for Program- ming Languages and Operating Systems, ASPLOS XV, pages 143–154, New York, NY, USA, 2010. Association for Computing Machinery.

[163] D. Yuan, S. Park, P. Huang, Y. Liu, M. M. Lee, X. Tang, Y. Zhou, and S. Sav- age. Be conservative: Enhancing failure diagnosis with proactive logging. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI’12, pages 293–306, USA, 2012. USENIX Associ- ation.

[164] D. Yuan, S. Park, and Y. Zhou. Characterizing logging practices in open- source software. In Proceedings of the 34th International Conference on Soft- ware Engineering, ICSE ’12, pages 102–112. IEEE Press, 2012.

[165] D. Yuan, J. Zheng, S. Park, Y. Zhou, and S. Savage. Improving software diagnosability via log enhancement. ACM Trans. Comput. Syst., 30(1), 2012.

[166] T. Zhang, D. Lee, and C. Jung. Txrace: Ecient data race detection us- ing commodity hardware transactional memory. In Proceedings of the Twenty-First International Conference on Architectural Support for Program- ming Languages and Operating Systems, ASPLOS ’16, pages 159–173, New York, NY, USA, 2016. Association for Computing Machinery.

108 BIBLIOGRAPHY BIBLIOGRAPHY

[167] Q. Zhao, R. Rabbah, S. Amarasinghe, L. Rudolph, and W.-F. Wong. How to do a million watchpoints: Ecient debugging using dynamic instru- mentation. In Proceedings of the Joint European Conferences on Theory and Practice of Software 17th International Conference on Compiler Construc- tion, CC’08/ETAPS’08, pages 147–162, Berlin, Heidelberg, 2008. Springer- Verlag.

[168] X. Zhao, K. Rodrigues, Y. Luo, M. Stumm, D. Yuan, and Y. Zhou. Log20: Fully automated optimal placement of log printing statements under spec- ied overhead threshold. In Proceedings of the 26th Symposium on Operat- ing Systems Principles, SOSP ’17, pages 565–581, New York, NY, USA, 2017. Association for Computing Machinery.

[169] X. Zhao, Y. Zhang, D. Lion, M. F. Ullah, Y. Luo, D. Yuan, and M. Stumm. Lprof: A non-intrusive request ow proler for distributed systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, pages 629–644, USA, 2014. USENIX Association.

[170] J. Zhu, P. He, Q. Fu, H. Zhang, M. R. Lyu, and D. Zhang. Learning to log: Helping developers make informed logging decisions. In Proceedings of the 37th International Conference on Software Engineering - Volume 1, ICSE ’15, pages 415–425. IEEE Press, 2015.

109