Download PDF 'Deliberate Microbenchmarking of Software

Zurich Open Repository and Archive University of Zurich Main Library Strickhofstrasse 39 CH-8057 Zurich www.zora.uzh.ch

Year: 2021

Deliberate microbenchmarking of software systems

Laaber, Christoph

Abstract: Software performance faults have severe consequences for users, developers, and companies. One way to unveil performance faults before they manifest in production is performance testing, which ought to be done on every new version of the software, ideally on every commit. However, performance testing faces multiple challenges that inhibit it from being applied early in the development process, on every new commit, and in an automated fashion. In this dissertation, we investigate three challenges of software microbenchmarks, a performance testing technique on unit granularity which is predominantly used for libraries and frameworks. The studied challenges affect the quality aspects (1) runtime, (2) result variability, and (3) performance change detection of microbenchmark executions. The objective is to understand the extent of these challenges in real-world software and to find solutions to address these. To investigate the challenges’ extent, we perform a series of experiments and analyses. We execute benchmarks in bare-metal as well as multiple cloud environments and conduct a large-scale mining study on benchmark configurations. The results show that all three challenges are common: (1) benchmark suite runtimes are often longer than 3 hours; (2) result variability can be extensive, in some cases up to 100%; and (3) benchmarks often only reliably detect large performance changes of 60% or more. To address the challenges, we devise targeted solutions as well as adapt well-known techniques from other domains for software microbenchmarks: (1) a solution that dynamically stops benchmark executions based on statistics to reduce runtime while maintaining low result variability; (2) a solution to identify unstable benchmarks that does not require execution, based on statically- computable source code features and machine learning algorithms; (3) traditional test case prioritization (TCP) techniques to execute benchmarks earlier that detect larger performance changes; and (4) specific execution strategies to detect small performance changes reliably even when executed in unreliable cloud environments. We experimentally evaluate the solutions and techniques on real-world benchmarks and find that they effectively deal with the three challenges. (1) Dynamic reconfiguration enables todrastically reduce runtime by between 48.4% and 86.0% without changing the results of 78.8% to 87.6% of the benchmarks, depending on the project and statistic used. (2) The instability prediction model allows to effectively identify unstable benchmarks when relying on random forest classifiers, having a prediction performance between 0.79 and 0.90 area under the receiver operating characteristic curve (AUC). (3) TCP applied to benchmarks is effective and eﬀicient, with APFD-P values for the best technique ranging from 0.54 to 0.71 and a computational overhead of 11%. (4) Batch testing, i.e., executing the benchmarks of two versions on the same instances interleaved and repeated as well as repeated across instances, enables to reliably detect performance changes of 10% or less, even when using unreliable cloud infrastructure as execution environment. Overall, this dissertation shows that real-world software microbenchmarks are considerably affected by all three challenges (1) runtime, (2) result variability, and (3) performance change detection; however, deliberate planning and execution strategies effectively reduce their impact.

Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-203225 Dissertation Published Version

Originally published at: Laaber, Christoph. Deliberate microbenchmarking of software systems. 2021, University of Zurich, Faculty of Economics.

Department of Informatics

Deliberate Microbenchmarking of Software Systems

Dissertation submitted to the Faculty of Business, Economics and Informatics of the University of Zurich to obtain the degree of Doktor der Wissenschaften, Dr. sc. (corresponds to Doctor of Science, PhD)

presented by Christoph Laaber from Austria

approved in April 2021

at the request of Prof. Dr. Harald C. Gall Dr. Philipp Leitner Prof. Dr. Benoit Baudry

The Faculty of Business, Economics and Informatics of the University of Zurich hereby authorizes the printing of this dissertation, without indicating an opinion of the views expressed in the work.

Zurich, April 7, 2021

The Chairman of the Doctoral Board: Prof. Dr. Thomas Fritz

Acknowledgments

Successfully ﬁnishing a PhD not only demands staying motivated and putting in the hard work, it equally requires a great support system that carries you through. I am extremely lucky to have great people around me who supported me tremendously. First and foremost, I am eternally grateful to my advisors Philipp Leitner and Harald C. Gall. Philipp, you showed me the ins and outs of research and academia. I would not have been able to achieve this without your continued great advice. But not only did you support me on a professional level, also the friendly relationship we developed helped enormously to get through to the end. Harald, you provided a great environment at SEAL and allowed me an independence to grow into the researcher I am today. I will always keep in mind the numerous retreats and dinners as well as the great laughs we shared. I highly appreciate the valuable feedback Alberto Bacchelli gave me throughout my PhD. Although I was not one of your PhD students, you always took time for discussions. I want to express my gratitude to Benoit Baudry for examining my dissertation, providing insightful comments, and being part of my committee. I especially thank my colleagues in Zurich and Gothenburg. Without you, this would not have been as fun as it was; I will never forget our time together. In particular, the bond with Giovanni Grano and Carmine Vassallo, who started at the same time as me, and the Austrian connection with Jürgen Cito and Gerald Schermann are special to me. Thank you to the SEALs Carol Alexandru, Adelina Ciurumelea, Thomas Fritz, Jian Gu, Anna Jancso, Katja ii

Kevic, Alexander Lill, André Meyer, Sebastian Müller, Sebastiano Panichella, Sebastian Proksch, Pasquale Salza, and Manuela Züger; to the ZESTies David Ackermann, Larissa Braz, Gül Çalikli, Linda Di Geronimo, Enrico Fregnan, Fabio Palomba, Fernando Petrulio, and Pavlína Wurzel Gonçalves; and to the ICET-members Hamdy Michael Ayas, Linda Erlenhov, Hazem (Peter) Samoaa, and Joel Scheuner. I would also like to thank my co-authors and supervised students, who played an integral part in my PhD. Additionally, I am thankful for my great friends outside of academia. You helped me take my mind off work. I want to explicitly thank Joni, who helped me settle in Zurich and designed my dissertation cover; and Simon, with whom, as a fellow PhD student, I could share my experiences and challenges. A big thank-you goes to my family, who always believed in me and my success. I am especially thankful for the support I have received throughout the years from my parents, Karin and Manfred. I would not be here without you. I also want to thank Christa, Daniel, Gerli, Hildegard, Otto, Sabrina, and Wolfgang. Patricia, I feel fortunate to have met you and am grateful for your uncondi- tional support and encouragement. You have made this whole PhD journey a lot easier for me. Shout-out to the Bärlis and Chewie. Finally, sustaining the ups and downs of working towards a PhD also requires a good sense of humor, which is why I want to end with a quote from a contemporary philosopher. . .

I wanna thank me; I wanna thank me for believing in me; I wanna thank “ me for doing all this hard work; I wanna thank me for having no days off; I wanna thank me for never quitting; I wanna thank me for always being a giver and trying to give more than I receive; I wanna thank me for trying to do more right than wrong; I wanna thank me for just being me at all times. — Snoop Dogg [2018] ”

You will always be in my heart Mama, Oma, Opa, and Daniel. Abstract

Software performance faults have severe consequences for users, developers, and companies. One way to unveil performance faults before they manifest in production is performance testing, which ought to be done on every new version of the software, ideally on every commit. However, performance testing faces multiple challenges that inhibit it from being applied early in the development process, on every new commit, and in an automated fashion. In this dissertation, we investigate three challenges of software microbenchmarks, a performance testing technique on unit granularity which is predominantly used for libraries and frameworks. The studied challenges affect the quality aspects (1) runtime, (2) result variability, and (3) performance change detection of microbenchmark executions. The objective is to understand the extent of these challenges in real-world software and to find solutions to address these. To investigate the challenges’ extent, we perform a series of experiments and analyses. We execute benchmarks in bare-metal as well as multiple cloud environments and conduct a large-scale mining study on benchmark configurations. The results show that all three challenges are common: (1) benchmark suite runtimes are often longer than 3 hours; (2) result variability can be extensive, in some cases up to 100%; and (3) benchmarks often only reliably detect large performance changes of 60% or more. To address the challenges, we devise targeted solutions as well as adapt well-known techniques from other domains for software microbenchmarks: (1) a solution that dynamically stops benchmark executions based on statistics iv to reduce runtime while maintaining low result variability; (2) a solution to identify unstable benchmarks that does not require execution, based on statically-computable source code features and machine learning algorithms; (3) traditional test case prioritization (TCP) techniques to execute benchmarks earlier that detect larger performance changes; and (4) specific execution strategies to detect small performance changes reliably even when executed in unreliable cloud environments. We experimentally evaluate the solutions and techniques on real-world benchmarks and find that they effectively deal with the three challenges. (1) Dynamic reconfiguration enables to drastically reduce runtime by between 48.4% and 86.0% without changing the results of 78.8% to 87.6% of the benchmarks, depending on the project and statistic used. (2) The instability prediction model allows to effectively identify unstable benchmarks when relying on random forest classifiers, having a prediction performance between 0.79 and 0.90 area under the receiver operating characteristic curve (AUC). (3) TCP applied to benchmarks is effective and efficient, with APFD-P values for the best technique ranging from 0.54 to 0.71 and a computational overhead of 11%. (4) Batch testing, i.e., executing the benchmarks of two versions on the same instances interleaved and repeated as well as repeated across instances, enables to reliably detect performance changes of 10% or less, even when using unreliable cloud infrastructure as execution environment. Overall, this dissertation shows that real-world software microbenchmarks are considerably affected by all three challenges (1) runtime, (2) result variability, and (3) performance change detection; however, deliberate planning and execution strategies effectively reduce their impact. Contents

1 Synopsis 1 1.1 Software Microbenchmarks ...... 3 1.2 Challenges ...... 3 1.3 Thesis Statement and Research Questions ...... 6 1.4 Research Approach and Main Results ...... 10 1.4.1 RQ 1: Extent of Challenges in the Real World ...... 10 1.4.2 RQ 2: Reducing Runtime ...... 15 1.4.3 RQ 3: Handling Result Variability ...... 17 1.4.4 RQ 4: Improving Change Detection ...... 20 1.5 Scope, Potential, and Limitations ...... 24 1.5.1 Scope of Dissertation ...... 24 1.5.2 Potential for Industrial Adoption ...... 25 1.5.3 Limitations ...... 27 1.6 Scientiﬁc Implications ...... 31 1.7 Opportunities for Future Research ...... 34 1.8 Related Work ...... 37 1.9 Summary and Contributions ...... 42 1.10 Thesis Roadmap ...... 45

2 An Evaluation of Open-Source Software Microbenchmark Suites for Continuous Performance Assessment 49 2.1 Introduction ...... 50 2.2 Background ...... 52 vi Contents

2.3 Study Subjects ...... 55 2.4 RQ 1: Extent of Benchmark Suites ...... 57 2.4.1 Context ...... 57 2.4.2 Approach ...... 57 2.4.3 Results and Analysis ...... 58 2.5 RQ 2: Stability of Benchmarks ...... 60 2.5.1 Context ...... 60 2.5.2 Approach ...... 61 2.5.3 Results and Analysis ...... 62 2.6 RQ 3: Detectability of Slowdowns ...... 66 2.6.1 Context ...... 66 2.6.2 Approach ...... 66 2.6.3 Results and Analysis ...... 71 2.7 Discussion ...... 74 2.7.1 Usage Scenarios ...... 75 2.7.2 Threats to Validity ...... 76 2.8 Related Work ...... 78 2.9 Conclusions ...... 80

3 Software Microbenchmarking in the Cloud. How Bad is it Really? 83 3.1 Introduction ...... 84 3.2 Background ...... 88 3.2.1 Software Microbenchmarking ...... 88 3.2.2 Infrastructure-as-a-Service Clouds ...... 90 3.3 Approach ...... 91 3.3.1 Project and Benchmark Selection ...... 92 3.3.2 Cloud Provider Selection ...... 93 3.3.3 Execution ...... 95 3.4 Benchmark Variability in the Cloud ...... 99 3.4.1 Differences between Benchmarks and Instance Types . . 101 3.4.2 Sources of Variability ...... 104 3.5 Reliably Detecting Slowdowns ...... 107 Contents vii

3.5.1 Statistical Tests ...... 108 3.5.2 Sampling Strategies ...... 110 3.5.3 A/A Testing ...... 113 3.5.4 Minimal Detectable Slowdown Sizes ...... 121 3.6 Discussion ...... 130 3.6.1 Implications and Main Lessons Learned ...... 130 3.6.2 Threats to Validity ...... 134 3.6.3 Future Directions ...... 136 3.7 Related Work ...... 137 3.7.1 Comparison to Our Previous Work ...... 138 3.8 Conclusions ...... 140

4 Dynamically Reconfiguring Software Microbenchmarks: Reducing Execution Time without Sacrificing Result Quality 143 4.1 Introduction ...... 144 4.2 Java Microbenchmark Harness (JMH) ...... 146 4.3 Pre-Study ...... 148 4.3.1 Data Collection ...... 148 4.3.2 Summary of Pre-Study Subjects ...... 149 4.3.3 Results ...... 150 4.4 Dynamic Reconfiguration ...... 151 4.4.1 Approach ...... 151 4.4.2 Stoppage Criteria ...... 153 4.4.3 Modified Java Microbenchmarking Harness (JMH) Im- plementation ...... 157 4.5 Empirical Evaluation ...... 157 4.5.1 Research Questions ...... 157 4.5.2 Study Subjects ...... 158 4.5.3 Study Setup ...... 158 4.5.4 Results and Analysis ...... 162 4.6 Discussion and Recommendations ...... 169 4.7 Threats to Validity ...... 172 viii Contents

4.8 Related Work ...... 175 4.9 Conclusions ...... 177

5 Predicting Unstable Software Benchmarks Using Static Source Code Features 179 5.1 Introduction ...... 180 5.2 Software Benchmarks in Go ...... 184 5.3 Approach ...... 186 5.3.1 Source Code Features ...... 188 5.3.2 Benchmark Stability ...... 193 5.3.3 Model Creation ...... 194 5.3.4 Model Usage ...... 195 5.4 Study Design ...... 195 5.4.1 Research Questions ...... 196 5.4.2 Study Objects ...... 197 5.4.3 Study Setup ...... 197 5.4.4 Execution Setup ...... 198 5.4.5 Approach Parameterization ...... 198 5.5 RQ 1: Associating Source Code with Benchmark Stability . . . . 199 5.5.1 Methodology ...... 199 5.5.2 Results ...... 200 5.6 RQ 2: Classifying Benchmarks as Unstable ...... 203 5.6.1 Methodology ...... 203 5.6.2 Results ...... 210 5.7 Discussion and Future Research ...... 218 5.7.1 Application Scenarios ...... 218 5.7.2 Tradeoff between Precision and Recall ...... 220 5.7.3 Classiﬁcation vs. Prediction ...... 221 5.7.4 Features ...... 222 5.8 Threats to Validity ...... 223 5.9 Related Work ...... 227 5.9.1 Performance Variability ...... 227 Contents ix

5.9.2 Performance Bugs ...... 228 5.9.3 Performance Testing ...... 229 5.9.4 Performance Impact Prediction ...... 230 5.9.5 Machine Learning for Software Engineering ...... 231 5.10 Conclusions ...... 233

6 Applying Test Case Prioritization to Software Microbenchmarks 235 6.1 Introduction ...... 236 6.2 Software Microbenchmarking with JMH ...... 240 6.3 Empirical Study ...... 242 6.3.1 Experiment Process ...... 242 6.3.2 Study Objects ...... 244 6.3.3 Independent Variables ...... 246 6.3.4 Dependent Variables ...... 250 6.3.5 Execution Setup ...... 254 6.3.6 Tooling, Analysis Scripts, and Data ...... 256 6.3.7 Threats to Validity and Limitations ...... 256 6.4 Results and Analyses ...... 259 6.4.1 RQ 1: Effectiveness ...... 261 6.4.2 RQ 2: Robustness ...... 268 6.4.3 RQ 3: Efficiency ...... 273 6.5 Discussion and Recommendations ...... 278 6.5.1 Choosing a TCP Technique ...... 278 6.5.2 Efficiency of Static vs. Dynamic Coverage Techniques . . 279 6.5.3 Impact of Coverage-Type-Specific Parameters ...... 280 6.5.4 Choice of Threshold ...... 281 6.5.5 Comparison to TCP for Unit Tests ...... 282 6.6 Related Work ...... 284 6.6.1 Test Case Prioritization of Functional Tests ...... 284 6.6.2 Performance Testing ...... 285 6.6.3 Performance Measurements ...... 287 6.7 Conclusions ...... 288 x Contents

6.7.1 Future Research ...... 289

List of Figures

1.1 Software microbenchmarking components ...... 4 1.2 Overview of the dissertation ...... 7 1.3 Chapters and papers of the dissertation ...... 46

2.1 Distributions of benchmarks’ maxSpread ...... 64 2.2 ABS calculation approach ...... 67 2.3 Redundancy of Benchmarks ...... 73

3.1 Experiment overview ...... 96 3.2 Drilldown into the variability of etcd-5 ...... 102 3.3 Drilldown into the variability of log4j2-3 ...... 103 3.4 Drilldown into the variability of bleve-5 ...... 104 3.5 Drilldown on the source of variability for four example conﬁgu- rations ...... 106 3.6 Sampling strategies ...... 111 3.7 False positive (FP) differences between environments and statistical tests ...... 117 3.8 Minimally required samples ...... 120 3.9 Minimal detectable slowdowns (MDSs) with instance-based sampling ...... 125 3.10 MDSs with trial-based sampling ...... 128

4.1 JMH Execution ...... 148 4.2 Impact of custom configurations on the execution times of (a) benchmark suites and (b) benchmarks ...... 150 4.3 Static configuration vs. dynamic reconfiguration ...... 153 4.4 Mean change rate per study subject and stoppage criteria . . . . 165

5.1 Benchmarking workﬂow in Go ...... 185 5.2 Example benchmarks with stable and unstable results ...... 187 Contents xi

5.3 Training phase of the approach ...... 188 5.4 Usage phase of the approach ...... 195 5.5 Spearman’s rank correlations between source code features and benchmark variability ...... 201 5.6 Distributions of the data after binarization ...... 204 5.7 Comparison of prediction performance metrics ...... 211

6.1 JMH execution ...... 242 6.2 Experiment process overview ...... 243 6.3 APFD-P ranks across all study objects and for threshold t = 3% 264 6.4 Top-3 ranks across all study objects and for threshold t = 3% . . 267 6.5 APFD-P ranks across all study objects and for all thresholds t . . 269 6.6 APFD-P robustness for each study object across all versions and TCP techniques ...... 272 6.7 APFD-P robustness for each TCP technique across all study objects and versions ...... 272 6.8 Efﬁciency of the techniques ...... 274

List of Tables

1.1 Datasets of the dissertation ...... 44 1.2 Contributions of the author ...... 47

2.1 Study subjects ...... 56 2.2 Extent of benchmark suites ...... 59 2.3 Maximum spread maxSpread in various percentiles ...... 63 2.4 Number of project-usage traces ...... 68 2.5 Application programming interface (API) benchmarking scores (ABS) of all subjects ...... 71

3.1 Study projects ...... 93 3.2 Selected benchmarks ...... 94 3.3 Cloud instance types ...... 95 xii List of Acronyms

3.4 Collected data points per cloud instance type ...... 98 3.5 Result variability for every combination of benchmark and instance type ...... 100 3.6 A/A testing FP-rates (example) ...... 115 3.7 MDSs with instance-based sampling (example) ...... 124 3.8 MDSs with trial-based sampling (example) ...... 127

4.1 Study subjects ...... 159 4.2 Result quality differences between static conﬁguration approach and dynamic reconﬁguration approaches ...... 164 4.3 Time saving per project and stoppage criteria ...... 168

5.1 Source code features ...... 189 5.2 Standard library call features ...... 192

6.1 Study objects ...... 245 6.2 Independent variables and their values ...... 247 6.3 Coverage statistics of the studied coverage parameters . . . . . 260 6.4 APFD-P of the 54 TCP techniques compared to random per study object ...... 262 6.5 Top-3 of the 54 TCP techniques compared to random per study object ...... 265

List of Acronyms

APFD average percentage of fault-detection

API application programming interface

AST abstract syntax tree

AUC area under the receiver operating characteristic curve

AWS Amazon Web Services List of Acronyms xiii

CD continuous delivery

CG call graph

CI continuous integration

CLI command line interface

CPU central processing unit

CSV comma-separated values

CV coefﬁcient of variation

DB database

ESD effect size difference

FN false negative

FP false positive

GC garbage collector

GCE Google Compute Engine

HTTP Hypertext Transfer Protocol

I/O input/output

IaaS Infrastructure as a Service

IDE integrated development environment

JAR Java Archive

JDK Java Development Kit

JMH Java Microbenchmarking Harness xiv List of Acronyms

JSON JavaScript Object Notation

JVM Java Virtual Machine

KLD Kullback-Leibler divergence

LOC lines of code

MCC Matthews correlation coefﬁcient

MDS minimal detectable slowdown

MIME Multipurpose Internet Mail Extensions

ORM object-relational mapping

OS operating system

OSS open-source software

PaaS Platform as a Service

RAM random-access memory

RCIW relative conﬁdence interval width

ROC receiver operating characteristic

RPC remote procedure call

RTS regression test selection

SaaS Software as a Service

SMTP Simple Mail Transfer Protocol

SPE software performance engineering

SQL Structured Query Language List of Acronyms xv

TCP test case prioritization

TP true positive

UDP User Datagram Protocol

UI user interface

URL Uniform Resource Locator

VCS version control system

VM virtual machine

XML Extensible Markup Language

Synopsis

Software is ubiquitous. It powers everything from personal computers and mobile phones to cars and home appliances. Due to this pervasiveness, software affects every part of our lives and, hence, it is paramount to ensure that software adheres to its specification. Already at the first software engineering conference in 1968, software failures caused by functional as well as non-functional issues were pointed out to have severe consequences [Naur and Randell, 1969]. They identified software testing as an important technique to reduce such failures. Software performance is one important non-functional property that can lead to failures, such as unresponsive, slow software or crashes induced by overloaded systems. Research and industry agree that timely performance evaluation is key to reduce these failures. Performance failures pose an enor- mous challenge for developers. Compared to functional failures, they take considerably longer to be discovered [Jin et al., 2012], are harder to reproduce and debug [Zaman et al., 2012; Nistor et al., 2013a], take longer to fix [Zaman et al., 2011; Nistor et al., 2013a; Liu et al., 2014; Mazuera-Rozo et al., 2020], and require more as well as more experienced developers to do so [Zaman et al., 2011; Nistor et al., 2013a; Chen et al., 2019]. Users are similarly affected, e.g., they are more likely to leave if an application if it takes longer to load [Akamai Technologies Inc., 2017; Artz, 2009; An, 2018]. Revenue and profit may plummet as a consequence, as confirmed by web companies such as Amazon [Linden, 2006], Google [Meyer, 2009; Schurman and Brutlad, 2009], 2 Chapter 1. Synopsis

Microsoft [Schurman and Brutlad, 2009], and Shopzilla [Dixon, 2009] as well as the stock trading industry [Reporter, 2008; Duhigg, 2009]. Software performance testing, a measurement-based performance evaluation technique, is a way to reduce the number of performance failures and, ultimately, their consequences on users, developers, and companies. Ideally, performance should be effectively and efﬁciently evaluated early in the development process, i.e., on every commit and as part of continuous integration (CI) and DevOps [Woodside et al., 2007; Bezemer et al., 2019]. Traditional techniques to do so are load tests and benchmarks, which are executed against a system under test. They enable to observe performance behavior under realistic conditions over time and to detect performance changes between versions or variants [Jiang and Hassan, 2015]. Both techniques require the software system to be either deployable, i.e., it can be installed on a machine and communicated with through an application programming interface (API), or have a standardized API, e.g., web browsers or programming language runtimes. However, there are certain types of software that neither come as a deployable unit nor have a standardized API for which benchmarks exist; consequently, it is impossible to performance test these with load tests or (traditional) benchmarks. Software libraries implementing data structures, algorithms, or utility functionality such as logging, as well as frameworks for, e.g., web programming, database (DB) abstraction through object-relational mapping (ORM), or graphical user interfaces (UIs) fall into this category. They can have a considerable performance impact on their clients. For example, if Log4j1, a popular Java logging framework, introduces a performance failure, a myriad of its clients might suffer. Therefore, it is important for Log4j to detect performance failures before they release a new version. Using load tests or traditional benchmarks, however, is unfortunately not an option, because Log4j can not be deployed nor do standardized benchmarks exist for it. Devel- opers of Log4j have to write custom performance tests to ensure performance

1https://logging.apache.org/log4j 1.1 Software Microbenchmarks 3 does not degrade. For Log4j and software with similar characteristics, the performance testing solution are software microbenchmarks.

1.1 Software Microbenchmarks

The performance evaluation technique central to this dissertation is software microbenchmarking, sometimes also referred to as performance unit testing [Stefan et al., 2017]. Software microbenchmarks, thereafter simply called microbenchmarks or benchmarks, are the performance testing equivalent of what unit tests are in functional testing. They assess the performance of code fragments on function, method, or statement level. Typical performance metrics reported by benchmarks are response time, throughput, or memory utilization. Figure 1.1 depicts the components involved in executing benchmarks and retrieving their results. A benchmark is defined, similarly to a unit test, in source code by implementing a function 1 that executes the code segment desired to be benchmarked. A benchmark harness executes the benchmark repeatedly 2 to reduce measurement uncertainty and increase result reliability. The benchmark configuration 3 instructs the harness how often and for how long a benchmark should be executed. In each repetition, often called a measurement iteration, the harness invokes 4 the benchmark as often as possible for a defined time interval. Finally, the result 5 of a benchmark is the distribution of performance metric values, e.g., response times, recorded for each iteration.

1.2 Challenges

Performance evaluation is a varied and challenging discipline [Gregg, 2013]; by some even considered an art [Jain, 1991]. This suggests that performance evaluation is non-trivial and much can be done wrong. 4 Chapter 1. Synopsis

Deﬁnition and Conﬁguration Execution 2 @Fork(...) i i i i @Warmup(...) 1 2 3 … n @Measurement(...) 3 @BenchmarkMode(...) @OutputTimeUnit(...) public class ComputationSchedulerPerf { @Benchmark iteration public void observeOn(Input input) { LatchedObserver o = 4 input.newLatchedObserver(); invoke 20ms input.observable .observeOn(Schedulers.computation()) invoke 18ms average runtime … .subscribe(o); 18.7ms o.latch.await(); duration } 1 invoke 19ms } }

Results 5 average iteration runtime

i1 18.4ms density

i2 18.7ms

i3 19.0ms … …

in 18.8ms iteration runtime

Figure 1.1: Software microbenchmarking components

While load testing and traditional benchmarking have received extensive interest from research and industry [Jiang and Hassan, 2015], microbenchmarking is still a niche technique but has recently gained more attention [Leitner and Bezemer, 2017; Stefan et al., 2017]. As microbenchmarking is a technique evaluating the performance of software systems as well, it is likely to be subject to the same challenges as other forms of performance evaluation are. In particular, this dissertation focuses on three major challenges related to aspects of software microbenchmark executions: (1) runtime, (2) result variability, and (3) performance change detection.

Runtime The time it takes to execute performance tests —the runtime— is known to be extensive [Weyuker and Vokolos, 2000]. Often performance test 1.2 Challenges 5 suites run for many hours, sometimes even multiple days [Huang et al., 2014; AlGhamdi et al., 2016]. The reason for this is often-repeated executions, which are required to achieve reliable results [Georges et al., 2007]. Long runtimes can especially be troublesome in fast-paced environments, such as CI and continuous delivery (CD), where commits arrive multiple times per day and software is released weekly or more often [Hilton et al., 2016; Zhao et al., 2017; Aijaz, 2019; DevOps Research and Assessment (DORA), 2019]. Indeed, there is hardly any sign of performance testing being part of CI [Beller et al., 2017; Vassallo et al., 2017] although there would be a need for it [Bezemer et al., 2019].

Result Variability Measuring performance of software systems is inherently uncertain [Mytkowicz et al., 2009; Maricq et al., 2018]. This is due to the abundance of factors that introduce non-determinism, such as concurrent processes, memory layout, and dynamic compilers [de Oliveira et al., 2013b; Curtsinger and Berger, 2013]. Even physical properties, such as the temperature of the central processing unit (CPU), can have an impact on how “fast” a certain computation executes. Consequently, two measurements of the same performance test are unlikely to return the exact same result. In order to still have reliable test outcomes, one has to repeatedly execute the performance tests to increase conﬁdence in the results [Georges et al., 2007], which also affects runtime. The number of required repetitions depend on the software that is tested as well as the environment the tests are executed in [Kalibera and Jones, 2013]. Performance experiments should ideally be executed in controlled execution environments, such as bare-metal servers, to reduce result variability [Bakshy and Frachtenberg, 2015; Selakovic and Pradel, 2016; Stefan et al., 2017; Maricq et al., 2018]. This is at odds with CI and DevOps practices, where unreliable cloud instances [Iosup et al., 2011a; Leitner and Cito, 2016] are usually utilized as testing environments. If result variability is high, i.e., there is a high dispersion among measurements of individual repetitions, one can not draw reliable conclusions from the test result nor accurately detect performance changes. 6 Chapter 1. Synopsis

Performance Change Detection To reduce the number of performance failures, it is paramount to detect performance changes quickly and reliably. The quickness of detecting changes is impaired by the long performance test runtimes, and the reliability hinges on the result variability induced by measurement uncertainty. Therefore, it is crucial to identify the performance changes through the performance tests that are most important [Gregg, 2013]; in particular, because both performance improvements and performance degradations are common within the same commit [Chen and Shang, 2017]. Similar to functional failures, rapid feedback to developers about performance failures, e.g., as part of CI, might decrease the time for root cause analysis, debugging, and ﬁxing [Hilton et al., 2016], without wasting developers’ time hunting for falsely–reported failures, i.e., false alarms or false positives (FPs) [Herzig and Nagappan, 2015].

1.3 Thesis Statement and Research Questions

The three challenges outlined before are closely interconnected; tackling one might have detrimental or even beneficial effects on another, which requires investigating them together. This thesis explores the three challenges, assesses their extent in the real world, and applies known techniques as well as devises new solutions to deal with these, all in the context of software microbenchmarks. The outcome of this thesis should enable software developers as well as performance testers to take informed decisions and apply effective and efficient techniques to better manage the execution of software microbenchmarks. This is also reflected in our thesis statement:

Conscious planning and execution strategies are required to ensure software microbenchmarks that have short runtimes, yield results with low variability, and detect performance changes reliably and early.

We derive four research questions to investigate our thesis. Figure 1.2 depicts an overview of the dissertation, how the challenges relate to the research questions, and a mapping of chapters to research questions. As the three 1.3 Thesis Statement and Research Questions 7

Existing RQ 1 Extent of Challenges Microbenchmarks Execute real-world suites 2

Bench Bench Compare execution environments 3 Bench Parse real-world conﬁgurations 4

Chapters (Papers)

Challenges Chapter 2 2 (MSR ’18)

Microbenchmark Suite Change Chapter 3 (EMSE ’19) 3 Bench Bench … Bench 1 2 n Variability Detection ? Chapter 4 4 (FSE ’20) Runtime Chapter 5 5 (major revision)

Chapter 6 6 (major revision)

RQ 2 Reducing Runtime RQ 3 Handling Result Variability RQ 4 Improving Change Detection

Stop when results are stable 4 Dynamically check for stability 4 Execute and repeat correctly 3

Predict unstable microbenchmarks 5 Prioritize microbenchmarks 6

Figure 1.2: Overview of the dissertation challenges are interconnected, a chapter often addresses multiple challenges together and, therefore, provides a partial answer to our research questions. The following paragraphs introduce the research questions and brieﬂy describe how we aim to answer these.

As a ﬁrst step, we need to understand to which degree the three challenges affect software microbenchmarks. For this, we formulate the ﬁrst research question:

RQ 1 What is the extent of the three challenges (1) runtime, (2) result variability, and (3) change detection in real-world software microbenchmark suites?

Previous research primarily focussed on performance evaluation in general or other performance testing types such as load tests and traditional benchmarks (e.g., Weyuker and Vokolos [2000], Georges et al. [2007], or Jiang and Hassan [2015]). We conduct a set of studies on real-world microbenchmark suites for each challenge and to answer RQ 1. 8 Chapter 1. Synopsis

For runtime, we execute and measure as well as mine conﬁgurations to estimate the runtime of benchmark suites. For result variability, we use the measurements from the execution results of the runtime study and assess their variability. We further take a subset of these benchmarks and re-execute them in bare-metal and cloud environments to investigate how the variability of the same benchmark changes in different environments. For change detection, we determine the competence of existing microbenchmark suites in ﬁnding actual performance changes that are injected into the source code of the software under test.

The remaining three research questions RQ 2, RQ 3, and RQ 4 investigate existing techniques as well as novel solutions addressing the three challenges, one per research question.

The second research question addresses the challenge of long microbenchmark runtimes. It is formulated as follows:

RQ 2 How can we reduce the runtime of microbenchmark suites?

Previous research on reducing the runtime of performance tests either estimates the number of repetitions required to achieve a certain maximum result variability [Kalibera and Jones, 2013; Maricq et al., 2018], decides when the measurement is stable during the execution [Georges et al., 2007; He et al., 2019], or stops performance tests once they become repetitive [AlGhamdi et al., 2016, 2020]. So far, it is unclear whether these techniques are also applicable in the context of software microbenchmarking, how much runtime savings they enable, or whether they need adaptation. We propose a solution that builds on ideas from Georges et al. [2007] and He et al. [2019] to dynamically reconﬁgure individual microbenchmarks. The aim is to maintain result quality while reducing overall runtime. The approach is strongly linked to result variability investigated in RQ 3. 1.3 Thesis Statement and Research Questions 9

The third research question addresses the challenge of microbenchmark result variability. It is formulated as follows:

RQ 3 How can we handle microbenchmark result variability?

Previous research on tackling performance variability is similar to the one for runtime in RQ 2: more repetitions require more runtime but also lead to lower result variability. Existing techniques either estimate before the execution how many repetitions are required to achieve low result variability [Kalibera and Jones, 2013; Maricq et al., 2018], or they measure the result variability during the execution and stop the performance test once the desired quality is reached [Georges et al., 2007; He et al., 2019]. We propose two solutions, one addresses performance variability before and the other one during the execution. The ﬁrst solution aims at identifying unstable benchmarks, i.e., the ones with high variability, without the need to execute these. The second solution decides during the execution when the result variability is stable, and the execution can be stopped, which is the same approach as the one from RQ 2.

The forth and ﬁnal research question addresses the challenge of reliable and early detection of performance changes with microbenchmarks. It is formulated as follows:

RQ 4 How can we improve performance change detection with microbenchmarks?

Previous research took inspiration from functional regression testing [Yoo and Harman, 2012], to either prioritize benchmarks and execute the ones earlier that are likely to expose larger performance changes [Mostafa et al., 2017] or select benchmarks that are expected to reveal performance changes and drop the rest [de Oliveira et al., 2017; Alshoaibi et al., 2019; Chen et al., 2020]. The approaches are speciﬁc to certain types of software (e.g., collection- intensive), use functional unit tests as performance tests, or do not consider result variability (e.g., caused by different execution environments) as an important factor. 10 Chapter 1. Synopsis

We investigate two sets of techniques to optimize microbenchmark executions through speciﬁc execution strategies to uncover performance changes reliably and early. The ﬁrst set of techniques are strategies to utilize unreliable execution environments (e.g., cloud instances) and still being able to detect small performance changes. The second set of techniques reorders benchmarks by applying traditional test case prioritization (TCP) techniques from unit testing research.

1.4 Research Approach and Main Results

This section describes the approaches, methodologies, evaluations, and results that lead to answering the research questions introduced in the previous section. Study types follow the nomenclature of Stol and Fitzgerald [2018] and are highlighted.

1.4.1 RQ 1: Extent of Challenges in the Real World

To assess the three challenges’ extent in software microbenchmarks, we conduct a set of empirical studies on real-world open-source software (OSS) microbenchmark suites.

Runtime

To understand how long benchmark suites typically take to execute, we perform two studies: (1) a laboratory experiment that executes 10 OSS benchmark suites to measure their runtime, and (2) a sample study on 573 projects to parse the benchmarks’ conﬁgurations and estimate their runtime.

Methodology For the first study, we consider a sample of 5 Java projects with Java Microbenchmarking Harness (JMH) benchmark suites, i.e., the de facto standard framework for microbenchmarks written in Java, and 5 Go projects. The experiment executes the full benchmark suites of the 10 projects in both a controlled, bare-metal environment and on a cloud instance hosted 1.4 Research Approach and Main Results 11 on Google Compute Engine (GCE). The execution uses a unified configuration of 10 warmup iterations for Java projects and 20 measurement iterations for both Java and Go projects per benchmark. We run the full benchmark suites for 5 trials and measure the overall runtime as well as retrieve the performance measurements for the result variability analyses.

While the first study provides exact runtime measurements from two environments and across 5 trials, it suffers from two limitations: (1) it uses a custom configuration that might not align with how developers execute the benchmarks, which reduces internal validity; and (2) it only considers 10 projects, which reduces external validity. The second study addresses these limitations by extracting the configurations set by developers through annotations of 573 projects on GitHub. The projects consist of a total of 10,816 unique benchmarks. We extract their JMH annotations which provide the information about a benchmark’s execution configuration. With knowing the number of repetitions as well as the duration of each repetition, we can estimate a benchmark’s runtime by taking the sum of the number of repetitions multiplied by their duration.

Results The ﬁrst study unveils that runtimes have a wide range depending on the number of benchmarks in the suite. The projects have between 16 and 983 individual benchmarks taking between 11 minutes and 8.75 hours to execute the full suite once. The runtime remains the same when executed in different environments and shows hardly any variation among different trials, for the majority of the suites. The second study elaborates on these ﬁndings: the runtime across the 573 projects ranges from 143 milliseconds to a staggering 7.4 years, with a median of 26.7 minutes. 15% of the projects have benchmark suites with runtimes above 3 hours, and still 3% exceed 12 hours. These results indicate that some benchmark suites are feasible to run in temporally-constrained settings, such as CI; however, a considerable number would require a reduced runtime to enable this. 12 Chapter 1. Synopsis

Result Variability

To determine the extent of microbenchmark result variability in different execution environments, we conduct two studies: (1) a laboratory experiment, reusing the performance measurements from the ﬁrst runtime study, studying result variability of all benchmarks in 10 suites in a bare-metal as well as a cloud environment, and (2) a ﬁeld experiment executing a sample of these benchmarks to investigate the impact of different cloud providers and instance types.

Methodology The first study reuses the performance measurements of the 5 Java and 5 Go projects from the runtime study, all executed in a bare-metal and on an instance in GCE, to assess the variability of the benchmarks. We introduce maxSpread, a variability measure that emphasizes by how much the mean performance metric, i.e., execution time or throughput of the benchmark, differs between identical sets of repetitions (i.e., trials). Based on maxSpread, we investigate variability differences within benchmark suites of the same project, between Java and Go projects, as well as comparing bare-metal and cloud environments. Inspired by the first study, the second study explores variability of benchmarks executed in unstable cloud environments more thoroughly. We select 4 projects, 2 from each programming language, and pick 5 benchmarks each with varying variabilities according to the data of the first study. We then execute the 20 benchmarks in 10 different environments. The environments are 9 different cloud instances and 1 bare-metal instance. The cloud instances are hosted by 3 providers, i.e., Amazon’s Amazon Web Services (AWS) EC2, Google’s GCE, and Microsoft’s Azure, and consist of 3 instance types per provider, i.e., general purpose, compute-optimized, memory-optimized. The bare-metal instance is hosted in IBM’s Bluemix. We repeatedly execute the benchmarks for 50 measurement iterations, for 10 trials within the same instance, and on 50 instances of the same type. This leads to the, to the best of our knowledge, largest microbenchmark result dataset executed on cloud instances with approximately 4.5 million unique data points. We use the coef- 1.4 Research Approach and Main Results 13

ﬁcient of variation (CV) to assess the benchmarks’ variabilities across cloud providers, instance types, and programming languages.

Results The ﬁrst study shows that benchmarks in the same project exhibit different result variabilities, which depends on the code that is benchmarked. Moreover, Go benchmarks are considerably more stable than Java benchmarks. This is due to Go directly compiling to machine code, whereas Java compiles to byte code that is interpreted and dynamically compiled by the Java Virtual Machine (JVM). Expectedly, executing benchmarks in the bare-metal environment leads to lower result variability than in the cloud. Go benchmarks have relatively low maxSpread, mostly below 5% in bare-metal and below 25% in the cloud. On the other hand, the majority of the Java benchmarks are below 25% in bare-metal and below 50% in the cloud. The second study unveils a more detailed picture: some benchmarks show low variability, even if they are executed on cloud instances. CV values of below 10% are possible for 16 benchmarks when extensively repeating executions within and across instances. Most surprisingly, AWS instances perform almost as well as the bare-metal offering from IBM, drastically outperforming GCE and Azure instances. Moreover, 3 benchmarks are inherently unstable, irrespective of where they are executed, even in bare-metal. Lastly, we do not observe the same differences between Java and Go benchmarks in the second study as we do in the ﬁrst. High result variability of microbenchmarks is common, especially when executed in unstable environments. However, it appears to be manageable if enough repetitions are executed, even in cloud environments.

Change Detection

To determine whether existing microbenchmark suites are competent in ﬁnd- ing important performance changes, we conduct a laboratory experiment on the 10 benchmark suites written in Java and Go from the previous studies. An important performance change is one that occurs in a relevant part of the software and one that is of large magnitude. Consequently, an impor- 14 Chapter 1. Synopsis tant benchmark and a competent benchmark suite is one that can detect an important change reliably.

Methodology The experiment (1) identiﬁes “important” methods of the benchmark projects, (2) injects slowdowns into these methods, and (3) assesses whether the benchmark suites are able to detect these slowdowns. We identify the 20 most important methods of a project through dynamically tracing the unit test executions of 5 clients of the benchmark projects. In this study, we consider a method that gets invoked more often as more important. Based on the variabilities from the ﬁrst variability study, we determine the slowdown size to be twice the variability of the 95% percentile most unstable benchmark in a project’s benchmark suite. This increases the chances of even highly variable benchmarks capturing it. We inject the slowdown in each of the most important methods one-by-one and compare two full benchmark suite runs, one with and one without the injected slowdown. A slowdown is detected if a Wilcoxon rank-sum test reports a difference and the detected mean performance change size is at least half of the injected slowdown size, in two distinct trials.

Results The experiment reveals that the 10 benchmark suites are able to detect the injected slowdowns. Depending on the project, between in the worst case 10% and in the best case 100% of the slowdowns are reliably detected, with the majority of the benchmark suites detecting around 60% or more. In line with the variability results, Go benchmark suites are able to detect smaller slowdowns compared to Java benchmark suites. The slowdown sizes range from 10% to 30% and 60% to 260%, respectively. Often multiple benchmarks detect the same slowdowns, showing redundancy within the benchmark suites. These results show that existing benchmark suites are able to detect actual slowdowns; however, they often have to be of large magnitudes. 1.4 Research Approach and Main Results 15

Answer to RQ 1: Real-world software microbenchmark suites are considerably affected by all three challenges, which calls for solutions to tackle these. They often consist of hundreds of benchmarks with runtimes exceeding multiple hours. Their performance results can be highly variable, especially if they are executed in unreliable cloud environments. Although they are usually competent in unveiling important performance changes, the change sizes are typically large. All this makes microbenchmarking expensive, reduces the reliability of their results, and renders these insensitive to small performance changes.

1.4.2 RQ 2: Reducing Runtime

The results of RQ 1 show that microbenchmark suites often take multiple hours to finish. This inhibits early slowdown detection and execution on every code change, such as part of CI and DevOps. The runtime is influenced by the repetitions of a microbenchmark, which are defined in its configuration. We hypothesize that benchmarks are often “wrongly” configured, either yielding unreliable results in short time or wasting runtime although results are already stable. We devise a solution “dynamic reconfiguration” that dynamically decides when to stop microbenchmark executions based on their measurements and evaluate its effectiveness in reducing runtime with a laboratory experiment on 10 real-world Java projects with JMH benchmark suites.

Solution Idea JMH benchmarks have two configurable levels of repetition: (1) iterations, which are split into warmup iterations followed by measurement iterations (see Section 1.1); and (2) forks, which are a set of iterations executed in fresh JVMs. In standard JMH, configurations are set before the execution of a benchmark, i.e., statically, and the number of warmup and measurement iterations are the same for all forks. The idea of dynamic reconfiguration is to dynamically decide when sufficient warmup iterations have been executed per fork, to bring the fork into a steady state, and more forks would not change the 16 Chapter 1. Synopsis stability of the overall measurement. We call these stoppage points at which our solution relies on a statistical stoppage criterion.

Evaluation Methodology We sample 10 Java OSS projects with JMH benchmark suites from the runtime sample study of RQ 1, which have a total of 3,969 benchmarks with suite runtimes between 4 and 191 hours. We execute all benchmarks with JMH’s default configuration and apply dynamic reconfiguration on the resulting performance measurements. This gives us the stoppage points for each benchmark where our solution decides to stop. The experiment compares three stoppage criteria: (1) CV inspired by Georges et al. [2007], (2) relative confidence interval width (RCIW) estimated with bootstrap [Davi- son and Hinkley, 1997; Kalibera and Jones, 2012], and (3) Kullback-Leibler divergence (KLD) [Kullback and Leibler, 1951] inspired by He et al. [2019]. We further measure the runtime overheads of the three stoppage criteria. Based on the total suite runtimes, the stoppage points per benchmark, and the runtime overheads of each stoppage criteria, we can estimate the runtime savings for each project and stoppage criteria.

Results The evaluation shows that CV has the lowest overhead at 0.88%, RCIW has the highest overhead at 10.92%, and KLD lies in the middle at 4.32%. Even with these overheads, dynamic reconﬁguration saves between 66.2% and 82.0% of runtime across all studied projects. RCIW is the most conservative stoppage criterion and has the highest overhead; however, it saves 48.4% for the worst project and 83.1% for the best. CV saves at least 72.8% and at best 86.0%, whereas KLD saves at least 77.8% and at best 83.3%. The three criteria also show different stoppage behavior, i.e., they stop at different numbers of warmup iterations and forks. CV requires more warmup iterations than KLD (18.5 versus 14.1) but fewer forks (3.1 versus 4.1). RCIW demands the most warmup iterations (34.6) and has a similar number of forks as CV (3.3). These results show that although dynamic reconﬁguration can have relatively high overheads, it is still highly effective in reducing runtime. However, this 1.4 Research Approach and Main Results 17 reduction could come at the expanse of result variability, which must also be evaluated and is part of RQ 3.

Answer to RQ 2: Dynamic reconfiguration allows to drastically reduce the runtime of microbenchmark suites. It relieves developers from manually configuring benchmarks and can individually adapt the (previously static) configurations for each benchmark.

1.4.3 RQ 3: Handling Result Variability

Microbenchmark results are often highly variable, as shown in RQ 1. The benchmarked source code, the number of repetitions, as well as the execution environment play a critical role in how variable a benchmark is. We devise two solutions to handle microbenchmark result variability. The first solution identifies unstable microbenchmarks, i.e., the ones with high result variability, without executing these. It utilizes statically-computed source code features and machine learning classifiers. We evaluate the first solution with a laboratory experiment on 4,461 benchmarks of real-world Go projects. The second solution dynamically ensures that benchmarks execute enough repetitions to achieve sufficient result variability. This solution is the same as the one from RQ 2. We evaluate the second solution with a laboratory experiment on 10 real-world Java projects with JMH benchmark suites.

Solution Ideas An abundance of factors influence a benchmark’s result variability, such as the source code, the operating system, or the execution environment. The first solution’s idea is that many of these factors are, to some extent, represented in the source code, e.g., through an API call, and can therefore be approximated by source code features. The solution statically extracts meta information (e.g., lines of codes (LOCs)), programming language features (e.g., loops or variable declarations), and standard library API calls that potentially impact result variability (e.g., input/output (I/O) or concurrency). In total, it relies on 58 features, their occurrences on function level, 18 Chapter 1. Synopsis and accumulates their counts per benchmark with static call graph (CG) information, i.e., all functions reachable from a benchmark. The solution employs machine learning classifiers trained on these features. As dependent variable, we use benchmark stability, which is either stable or unstable depending on a result variability threshold. The second solution “dynamic reconfiguration” executes microbenchmarks in a way that they yield stable results, i.e., with low variability. This solution has two simultaneous goals: stable results while reducing runtime. It assesses the stability of a benchmark (1) after warmup iterations, to decide whether a fork is in a steady state, and (2) after forks, to decide whether the overall measurement is stable. For these decisions, it relies on statistical techniques to approximate how stable a measurement is during runtime.

Evaluation Methodologies To evaluate the first solution, we sample a set of 230 Go OSS projects with a total of 4,461 benchmarks. We then extract the features and repeatedly execute all benchmarks for at most 2 hours per project. Thereafter, the experiment samples benchmark results for a different number of iterations, i.e., 5, 10, and 20, and computes the variability of each benchmark as RCIW with bootstrap [Davison and Hinkley, 1997; Kalibera and Jones, 2012]. To transform the continuous RCIW values into binary classes, the experiment uses thresholds inspired by previous research, i.e., 1%, 3%, 5%, and 10% [Georges et al., 2007; Curtsinger and Berger, 2013]. We then train 11 binary machine learning classifiers (e.g., Random Forest) for each iteration and threshold pair and evaluate their prediction performance with 30 times repeated 10-fold cross validation. The evaluation mainly uses the prediction performance metrics area under the receiver operating characteristic curve (AUC) [Hanley and McNeil, 1982] and Matthews correlation coefficient (MCC) [Matthews, 1975]. To evaluate the second solution, we reuse the measurements from RQ 2 of 3,969 JMH benchmarks from 10 Java OSS projects, as well as the information when the three stoppage criteria, i.e., CV, RCIW, and KLD, decide to stop the benchmark execution. The goal of the evaluation is to assess whether 1.4 Research Approach and Main Results 19 and to which degree “dynamic reconfiguration” changes the result variability compared to a standard JMH execution, i.e., “static configuration”. For this, we compare the results of a benchmark with static configuration to the results of the same benchmark with dynamic reconfiguration, once for each stoppage criteria. The first comparison is a statistical A/A test using bootstrap confidence intervals of the ratio of means for each benchmark [Kalibera and Jones, 2012]. The second comparison examines the mean difference in measured performance for each benchmark.

Results The evaluation of the first solution shows that individual source code features are at best moderately associated with benchmark stability. This is in line with the findings from Siegmund et al. [2015b]. Only the combination of features and powerful machine learning algorithms enable predicting benchmark instability. In our experiment, random forest comes out on top of the 11 classifiers for most iterations-threshold pairs, with prediction performance ranging from 0.79 to 0.90 and 0.43 and 0.68 for AUC and MCC, respectively. Our model performs better at predicting unstable benchmark for larger numbers of iterations (i.e., 20) and higher thresholds (i.e., 10%). These results demonstrate that predicting whether a benchmark will be stable, without executing it, is possible using only static source code features. The evaluation of the second solution “dynamic reconfiguration” shows that for the majority of the studied benchmarks the results do not statistically change compared to a static configuration. The number of benchmarks that do not exhibit a result difference in the A/A test depends on the concrete stoppage criteria used: 78.8% with CV, 87.6% with RCIW, and 79.6% with KLD. The results of the mean change rate are in line with the A/A test results. RCIW achieves the smallest relative change between the results from dynamic reconfiguration and static configuration at 1.4%. KLD and CV have mean change rates of 2.4% and 3.1%, respectively. Even the worst change rate of 3.1% is reasonable small, considering that performance experiments often have a measurement error within 3% [Georges et al., 2007] or sometimes even as large as 10% [Curtsinger and Berger, 2013]. These results in combination with 20 Chapter 1. Synopsis the results from RQ 2 demonstrate that dynamic reconfiguration is effective in reducing benchmarking runtime without sacrificing result quality.

Answer to RQ 3: We can handle benchmark result variability either before the execution or during the execution. Machine learning models on source code features are able to identifying unstable benchmarks before the execution, and dynamic reconﬁguration can determine when the benchmark execution has been repeated often enough to yield stable results. Developers can use the solutions to identify benchmarks with variable results early in the development process, assess whether results will be stable in different environments, and ensure the execution yields stable results.

1.4.4 RQ 4: Improving Change Detection

Real-world microbenchmark suites are capable of detecting important performance changes, as RQ 1 shows. However, RQ 1 also reveals that benchmark suites take long to execute and that benchmarks often only detect extremely large changes reliably. In RQ 4, we investigate two sets of techniques that enable reliable and early detection of performance changes. The ﬁrst set of techniques are execution strategies for benchmarks that enable to detect performance changes reliably, even if they are of small magnitude and execution is carried out in unreliable environments. We evaluate the strategies with an experimental simulation on real-world JMH and Go benchmarks which are executed on public cloud infrastructure. The second set of techniques reorders benchmarks in a suite, such that performance changes that are larger are detected sooner than ones that are smaller. We adapt well-known TCP techniques from functional unit testing research for benchmarks and evaluate these with a laboratory experiment on real-world Java projects across multiple versions.

Techniques The ﬁrst set of techniques builds on that benchmark results become more stable if the executions are repeated more often, at different points in time, and on different machines. The aim is to use cloud instances 1.4 Research Approach and Main Results 21 as execution environment while reducing the impact of confounding factors on the overall result, consequently enabling to detect smaller performance changes reliably. We outline two strategies that use repetition within and across multiple cloud instances: Version Testing: Every new version is benchmarked on new cloud instances and compared to previous execution results from other cloud instance. This mimics the scenario of traditional CI where software is tested when new versions/commits are pushed to the build server. In Chapter 3 we call this instance-based sampling. Batch Testing: Upon a new version, both the new version and the old version of the benchmarks are executed interleaved on the same instances. In Chapter 3 we call this trial-based sampling. The second set of techniques applies total and additional TCP strategies from unit testing research to benchmarks [Rothermel et al., 1999]. Both are greedy strategies relying on code coverage information, which is either retrieved dynamically with JaCoCo2 or statically with WALA3. total ranks benchmarks by the absolute number of covered methods, while additional ranks benchmarks by the number of covered methods that have not been covered by other benchmarks. The prioritization granularity further changes the exact benchmark ranking. TCP can either have method or parameter granularity. The former ranks all benchmark parameterizations as if they were a single benchmark and executes these sequentially, while the latter considers every parameterization as distinct benchmark.

Evaluation Methodologies To evaluate the ﬁrst set of techniques, we reuse the performance data from the cloud variability ﬁeld study in RQ 1. Recall, we repeatedly executed 19 benchmarks written in Java and Go on 10 cloud instance types for 50 measurement iterations, 10 trials (within the same instance), and across 50 instances of the same type. From this data set, we randomly sample test and control groups, resembling an old and a new software ver-

2https://www.jacoco.org 3https://github.com/wala/WALA 22 Chapter 1. Synopsis sion, 100 times for both version and batch testing strategies. The sampling uses different numbers of trials and instances to investigate the impact on the reliability of the change detection. First, we perform A/A tests comparing the test and control groups with two statistical tests. The A/A tests demonstrate whether a benchmark executed with one of the strategies with a number of trials and instances falsely detects a performance change, i.e., a FP. Second, we simulate a performance change in the test group by transforming its data points in a range between 0.1% and 1,000%. If for at least 95 of the 100 samples the A/A tests do not record a change (i.e., no FP) and a statistical test detects a difference for the simulated performance change (i.e., a true positive (TP)), we consider the benchmark with this strategy for the number of trials and instances competent to reliably capture a performance change of the simulated size. In Chapter 3, we call this a minimal detectable slowdown (MDS). To evaluate the second set of techniques, we sample 10 Java OSS projects with a total of 161 versions having 1,829 distinct benchmarks and 6,460 distinct parameterizations. We execute all benchmarks in all versions to identify the performance changes they unveil. Based on the performance change sizes, we can create the ideal ranking how the benchmarks in a suite should be executed to capture larger performance changes sooner. We then create and compare the rankings of the different TCP techniques. The evaluation instantiates a total of 54 TCP techniques with the following technique parameters: (1) total or additional strategy, (2) method or parameter granularity, (3) static or dynamic coverage, and (4) coverage-type-specific parameters such as the CG algorithm. We evaluate the techniques in terms of (1) their effectiveness, i.e., how close they are to the ideal ranking, using standard APFD-P and Top-3 metrics; (2) whether the effectiveness as well as the ranking among techniques changes if different performance change sizes are considered significant; and (3) their efficiency which is defined as the analysis overhead.

Results The evaluation of the ﬁrst set of techniques shows that unreliable cloud instances can be used to reliably detect performance changes. How- ever, the choice of testing strategy, number of repetitions, and cloud provider 1.4 Research Approach and Main Results 23 is crucial. In terms of A/A tests, using version testing often results in FPs, even with 10 repetitions on different cloud instances. Batch testing drastically improves on that. The majority of benchmarks do not falsely detect a performance change when executed in any of the environments, even in the unstable ones (see RQ 1), when using 5 instances and 5 trials. We see a similar trend in terms of MDS: batch testing outperforms version testing. Version testing with 20 repeated instances is only able to reliably detect a performance change of any size in 78% of the benchmark-environment combinations and a change smaller than 10% in just 64%. Batch testing improves on these numbers with 83% of the benchmark-environment combinations are able to reliably detect a performance change of 10% or less, when employing 20 instances and 5 trials. The results here are for the Wilcoxon rank-sum test used for change detection; Chapter 3 also reports results for a conﬁdence interval test. Our experiments also show a trend that AWS and the IBM bare-metal instance perform similarly well (see also RQ 1), and both perform better than GCE and Azure. These results demonstrate that reliable benchmarking is feasible in cloud environments if one employs batch testing in combination with repeated execution within and across instances.

The evaluation of the second set of techniques shows that TCP on benchmarks can effectively and efficiently detect larger performance changes sooner. In terms of effectiveness, the best techniques achieve APFD-P values between 0.54 and 0.71 and require executing between 29% and 66% to capture the 3 largest changes, depending on the project. Although more effective than a random ranking, these values indicate that TCP on benchmarks does not perform as well as on functional unit tests. Also, different form unit testing research, total considerably outperforms additional. Dynamic coverage unsurprisingly leads to better performance than static coverage, and coverage-type-specific parameters, especially the static CG algorithm, can have a drastic impact on TCP effectiveness. We further find that the performance change size that is considered significant has a large impact on the effectiveness, changing the APFD-P value between 0.11 and 0.28. However, it does not have an impact on the comparison among techniques, i.e., which one is better and which 24 Chapter 1. Synopsis one is worse. Finally, the analysis overhead ranges from below 1% to 59%. The most effective technique, which is based on dynamic coverage, has an overhead of 11%. The majority of the static coverage techniques has overheads below 4%. Coverage-type-specific parameters, e.g., reflection options of the static CG construction, can have a detrimental impact on efficiency, leading to overheads beyond 55%. For the techniques in our study, the time to extract coverage information dominates the overall analysis time, while running the prioritization algorithm has a negligible impact in most cases. These results show that applying TCP to benchmarks is effective and reasonable efficient, if the technique is correctly parametrized.

Answer to RQ 4: We can improve performance change detection by applying two optimization strategies for benchmark execution. Batch testing in combination with repetition within and across instances enables detecting small changes reliably, even when using unreliable cloud infrastructure as execution environments. TCP techniques allow reordering benchmarks within a suite to detect large changes sooner, with reasonable analysis overheads. Developers can beneﬁt from both strategies as they minimize false alarms (i.e., FPs) that are likely to waste their time and reduce the time until a large performance change is detected.

1.5 Scope, Potential, and Limitations

This section outlines the scope in which the presented results are applicable, discusses the strategies’ and techniques’ potential for industrial adoption, as well as reviews the main limitations.

1.5.1 Scope of Dissertation

The dissertation’s scope is on a particular performance evaluation technique — software microbenchmarking— which falls into the categories of measurement- based performance evaluation and performance testing. The investigated chal- 1.5 Scope, Potential, and Limitations 25 lenges are derived from general research on performance evaluation and studied in the context of microbenchmarks. The proposed solutions and studied techniques are designed for, adapted for, and evaluated on microbenchmarks, but they might also be applicable in other performance evaluation or testing contexts. For example, the execution strategies for unstable environments are likely transferable to other types of performance evaluations; the dynamic reconfiguration is based on approaches for traditional benchmarks [Georges et al., 2007; He et al., 2019]; the impact of coverage-type-specific parameters on TCP effectiveness and efficiency is likely to also affect TCP on functional unit tests; and the prediction of unstable benchmarks might also be applicable for benchmarks targeting full systems or applications. The projects and benchmark suites used for studying the extent of the challenges as well as for evaluating the effectiveness of the solutions and techniques are all software libraries or frameworks. This is due to microbenchmarking being the only performance evaluation technique applicable to these types of software. As already argued at the beginning of this chapter, load testing and (traditional) benchmarking can not be done on libraries and frameworks. Finally, the dissertation focuses on evaluating the runtime performance of code segments. The investigated runtime performance metrics are response time and throughput, which are the standard metrics for microbenchmarking. Other performance metrics such as memory allocations, CPU or memory utilization, and I/O operations are not studied here. The results and conclusions do not translate to these other metrics.

1.5.2 Potential for Industrial Adoption

We see three opportunities where the dissertations’ solutions and strategies could successfully be adopted, which we discuss in the following.

CI and Build Tools Inclusion of performance testing in CI and DevOps is one of the most important aspects of tackling performance problems [Bezemer 26 Chapter 1. Synopsis et al., 2019]. However, this is not standard practice in OSS yet, as many researchers have reported [Beller et al., 2017; Leitner and Bezemer, 2017; Vassallo et al., 2017]. The lack of performance testing in CI is likely caused, at least partially, by the three challenges investigated in the dissertation. Consequently, we are positive that our solutions can contribute to this goal. CI tools such as Jenkins4, TravisCI5, CircleCI6, and GitLab7 or build tools such as Maven8, gradle9, the Go tools10, Google’s Bazel11, and Facebook’s Buck12 would benefit from implementing the techniques used for planing the execution of benchmarks as part of the testing phase. Companies that use sophisticated build and (performance) testing infrastructure, such as CloudBuild at Microsoft [Esfahani et al., 2016], MongoDB [Ingo and Daly, 2020], or Talos by Mozilla [Mozilla], could equally benefit from the dissertations’ execution strategies. Moreover, our solution for predicting unstable benchmarks would fit nicely into CI systems. A CI vendor could train the prediction model on the benchmarks that have already been executed on their infrastructure and alert developers of new projects or for new benchmarks whether they will be stable in the vendor’s environment.

Benchmarking Frameworks and Harnesses Dynamic reconﬁguration would probably be best included in benchmarking frameworks such as JMH for Java13, Go’s testing package14, Google’s benchmark tool for C++15, Bench- markDotNet for C#16, Criterion.rs for Rust17, or criterion for Haskell18. In- 4https://www.jenkins.io 5https://travis-ci.org 6https://circleci.com 7https://about.gitlab.com 8https://maven.apache.org 9https://gradle.org 10https://github.com/golang/tools 11https://bazel.build 12https://buck.build 13https://github.com/openjdk/jmh 14https://golang.org/pkg/testing 15https://github.com/google/benchmark 16https://benchmarkdotnet.org 17https://github.com/bheisler/criterion.rs 18https://hackage.haskell.org/package/criterion 1.5 Scope, Potential, and Limitations 27 deed, we implemented a prototype in JMH, which is available online [Laaber et al., 2020d,e]. We can also envision harnesses of traditional application benchmarks, such as SPECjvm [Standard Performance Evaluation Corpora- tion (SPEC), 2008], SPECjbb [Standard Performance Evaluation Corporation (SPEC), 2015], DaCapo [Blackburn et al., 2006], Da Capo con Scala [Sewe et al., 2011], Renaissance [Prokopec et al., 2019], and DeathStarBench [Gan et al., 2019], to implement dynamic reconﬁguration.

Static Analysis Tools Our solution that predicts unstable benchmarks relies only on statically-computable source code features, once a trained model is available. This could enable static analysis tools, such as PMD19 or SpotBugs20, to warn developers about potentially unstable benchmarks. Such a static analysis tool can then easily be integrated in CI, as mentioned above, version control systems (VCSs) as post commit hooks, e.g., on GitHub21 or GitLab, or the developer’s integrated development environment (IDE). Damasceno Costa et al. [2019] developed an approach to detect benchmark anti-patterns, which they integrated into SpotBugs. Our instability prediction could augment their approach. Such static analysis tools could help developers to write benchmarks of high quality, e.g., when integrated in their IDE, and prevent “bad” benchmarks from being committed, built, and executed.

1.5.3 Limitations

This section reﬂects on the limitations of the dissertation as well as the threats to validity of the results that are applicable across multiple studies. More detailed and targeted threats to validity and mitigation strategies can be found in each chapter.

Project and Benchmark Selection The generalizability of the dissertations’ results hinges on the studied projects. While our studies all rely on real-world

19https://pmd.github.io 20https://spotbugs.github.io 21https://github.com 28 Chapter 1. Synopsis projects and benchmark suites, there are a number of limitations that need to be considered. First, we rely solely on OSS projects. The majority are arguable of “industry- strength”, either developed by large software companies or widely used by other OSS software. However, we can not claim that the results also translate to closed-source or industrial microbenchmark suites. Second, the majority of the studies and evaluations relies on a restricted set of projects and benchmarks. The studies assessing the challenges in RQ 1 that rely on executions and the evaluations of the TCP techniques and dynamic reconﬁguration depend on 10 projects. The simulation study in RQ 4 relies on only 19 benchmarks from 4 projects, but expands on the number of execution environments instead. Due to the extensive benchmark suite runtimes, studying more projects and benchmarks would render these studies infeasible. However, we aim at a diverse set of projects from different domains to mitigate this limitation. Only the sample study in RQ 1 and the benchmark instability prediction evaluation include hundreds of different projects each. Third, each evaluation usually depends on projects from a single programming language. The solutions and techniques are in principle applicable to benchmarks written in any language. However, supporting multiple languages would require implementing these for each language largely from scratch, which is considered infeasible in the context of the dissertation. Never- theless, in RQ 1, we assess the challenges on projects from two languages, i.e., Java and Go, which have different properties. Java is dynamically compiled while Go statically translates into machine code, which can have an impact on the magnitude of the challenges and, therefore, our results.

Execution Environment and Setup The execution environment impacts the benchmark’s performance measurements and, consequently, the results of the dissertation. The evaluations for all research questions rely on benchmark executions. As execution data is not publicly available, we have to execute the benchmarks ourselves. For the majority of the studies, we use controlled, bare-metal environments to reduce confounding factors. Using bare-metal 1.5 Scope, Potential, and Limitations 29 environments increases internal validity but decreases external validity [Sieg- mund et al., 2015a], because the execution environment might be artificial and might not reflect an environment that developers would use. In addition, we conduct a field experiment on different cloud environments from three public cloud providers, i.e., Amazon AWS, Google GCE, and Microsoft Azure, for RQ 1 and RQ 4. Their results show that unreliable (cloud) environments usually perform worse than bare-metal, which indicates that the results from our bare-metal evaluations are likely to be different in cloud environments.

The majority of the evaluations use custom benchmark configurations (i.e., number of repetitions) in order to make the total experiment durations feasible. For example, the TCP study executes a total of 59,164 benchmarks from 10 projects in 161 versions which takes about 90 days to execute. Using the often extensive developer-defined configurations would be infeasible in the dissertation’s context. On the other hand, the sample study that parses configurations unveils that often developers under-configure their benchmarks which could lead to unreliable results. Hence, we opt for custom configurations that balance runtime and result reliability. Two evaluations, the previously mentioned sample study as well as the evaluation of dynamic reconfiguration, use developer-defined or framework-default configurations.

Effectiveness Metrics Using the right metrics to assess the effectiveness of the solutions and techniques is important for high construct validity. On one hand, our evaluations rely on standardized metrics, i.e., AUC and MCC for prediction performance (RQ 3) or standardized statistical tests (RQ 1, RQ 3, and RQ 4). On the other hand, they increase construct validity by including multiple metrics, i.e., APFD-P and Top-3 for TCP effectiveness (RQ 4), confidence intervals and hypothesis tests for change detection (RQ 4), or confidence interval widths and mean change rate of the runtime for benchmark stability (RQ 3). Nonetheless, it is possible that the chosen metrics do not accurately reflect the desired evaluation property. 30 Chapter 1. Synopsis

Parameter Assumptions The dissertation’s solutions and techniques often require certain parameter values to be set. Dynamic reconfiguration uses a sliding window of a defined size, statistical procedures with different parameters (i.e., confidence level, p-value, strips), and stoppage criteria thresholds. We base our choices on previous research [Georges et al., 2007; He et al., 2019], statistical best practice [Hesterberg, 2015], or pre-study experiments. Instabil- ity prediction uses a threshold to transform result variability into the binary classes stable and unstable. The evaluation considers different thresholds inspired by previous research [Georges et al., 2007; Curtsinger and Berger, 2013]. It further conducts a sensitivity analysis on the number of benchmark repetitions and their impact on prediction performance. Finally, TCP effectiveness is affected by the performance change size that is considered significant. We conduct a sensitivity analysis on this threshold as well. Despite the efforts to deal with different parameter values, the results might change for values not used in the dissertation.

Empirical Study Types The final limitations relate to the empirical study types employed to investigate the thesis and answer the research questions. We use the nomenclature of Stol and Fitzgerald [2018]. The types used to answer RQ 1 are quite diverse: multiple laboratory experiments, a field experiment, and a sample study. However, the types in the remaining studies is less diverse. We evaluated the majority of the solutions and techniques with laboratory experiments. While they ensure a high precision of measurement, as the setting is contrived, the realism of the study context suffers. Only the study on cloud environments uses data from a field experiment, but then assesses the detectable performance changes with an experimental simulation injecting artificial slowdowns. Moreover, the dissertation evaluates the extent of the challenges as well as the solutions to address these solely with quantitative data. A judgment study with benchmarking experts or a case study within a company would increase the reliability of our findings. Qualitative methods such as interviews would further help to assess the challenges and could validate the usefulness of the investigated solutions and techniques. 1.6 Scientific Implications 31

1.6 Scientiﬁc Implications

This section discusses the dissertation’s scientiﬁc implications for researchers and developers.

A Lot of Microbenchmarking Time can be Saved Real-world benchmark suites often have extensive runtimes, which developers seem to address by under-configuring the benchmarks. This ultimately sacrifices result variability for runtime. Using a data-driven solution, such as dynamic reconfiguration, that considers result variability as the important objective to decide when to stop a benchmark can solve this conundrum. Benchmarks are individually stopped based on their exact results and the environment they are executed in. While we show that our solution can drastically save runtime, the overall savings will reach a point where no more time can be saved. If at this point the suite runtime still requires multiple hours, running the benchmarks might not be possible in temporally-constricted environments such as CI or on every new software version, unless other time reduction techniques are applied, e.g., regression testing approaches.

Performance Regression Testing: The Key to Continuous Performance Assessment? Although dynamic reconﬁguration enables reducing the runtime drastically, for projects with extensive suites the time savings are not enough to assess the performance on every commit. Employing regression testing approaches to focus on the most important benchmarks could be the solution. Applying TCP is a promising way to execute more important benchmarks sooner. However, traditional TCP techniques are not as effective on benchmarks as they are on functional unit tests. This suggests that techniques targeted for benchmarks are required. Moreover, the results from the TCP study as well as the study on whether benchmarks detect important changes (RQ 1) indicate that there is high redundancy within suites, i.e., many benchmarks cover the same code and detect the same performance changes. Based on these observations, lowering the number of benchmarks in a suite with 32 Chapter 1. Synopsis test suite reduction could be worthwhile doing. Finally, using regression test selection (RTS) approaches to select the benchmarks that are most important to run on every commit would further reduce the overall runtime. An optimal solution would probably combine multiple regression testing techniques with the execution strategies and solutions of this dissertation.

Result Variability must Always be Considered One reoccurring topic in this dissertation is the effects of result variability on the studied solutions and techniques. Apart from the solutions directly tackling variability, in RQ 3, all other solutions are affected by it as well. Dynamic reconfiguration would not work without considering variability proxies (i.e., stoppage criteria). The TCP effectiveness metrics rely on rigorously executed benchmarks and must account for variability when considering what is a significant performance change. Employing highly unstable execution environments (i.e., cloud infrastructure) is only possible if the inherent variability is ironed out by consciously repeating within and across instances as well as executing test and control group on the same instances (i.e., batch testing). When configuring benchmarks, or any other performance measurement, setting the right repetition values is paramount to achieving reliable results. For researchers, conducting performance evaluations without considering result variability in their evaluations can have detrimental effects on internal validity. Moreover, novel techniques for solving benchmarking problems must always consider the impact on result variability. For developers, unreliable results invalidate the conclusion whether a benchmark unveils a performance change. Resulting FPs or false negatives (FNs) either waste developers’ time hunting for a change that does not exist or let potentially important changes slip through without noticing.

Unreliable Environments can Supply Surprisingly Reliable Results With the emergence of hosted CI solutions and the availability of cheap computing resources, executing benchmarks on cloud infrastructure becomes appealing if not necessary. Contrary to common believe and to what best practice suggests, 1.6 Scientiﬁc Implications 33 unreliable cloud infrastructure can be effectively utilized to ﬁnd reasonably small performance changes (i.e., 10%) reliably. The key to this is executing test and control groups, i.e., old and new version or alternative implementations of some functionality, on the same instances and repeating the execution within and across instances. Executing benchmarks on cloud instances could also solve the challenge of long runtimes, as benchmarks can be executed in parallel on different instances. The only limiting factor is then the cost of such an experiment.

What are Important Microbenchmarks and Performance Changes Re- ally? Different from functional tests which either reveal a fault or not (disregarding flaky tests), benchmarks unveil many slowdowns and improvements in the same commit with different sizes [Chen and Shang, 2017]. Measurement uncertainty impacts the change size and might even lead to FPs. Moreover, the simultaneously occurring slowdowns and improvements require to identify the changes that are important and, consequently, the benchmarks that uncover important changes. A natural definition of importance is the size of a change, i.e., larger changes are more important than smaller ones, and, consequently, benchmarks uncovering larger changes are more important. This is also the definition mostly used to assess the solutions and techniques in this dissertation. However, it is questionable whether the size is indeed the most important proxy for importance. For example, a large slowdown in a code segment that is hardly used might not be as important to address as a small slowdown in a core, heavily-used part of the software. Following this argumentation, it is unclear whether the effectiveness of a technique can realistically be evaluated with metrics relying only on change size. This would require additionally evaluating techniques with respect to what is “really” important. Ways to address this might be the approach from RQ 1 that defines importance as the usage count of clients or based on the domain knowledge of developers. 34 Chapter 1. Synopsis

Machine Learning: The Ultimate Hammer? Targeted approaches for improving benchmarking traditionally involved building complex performance inference models (e.g., Jin et al. [2012]; Huang et al. [2014]; Mostafa et al. [2017]). The TCP study shows that simple solutions are often not as effective in the context of performance evaluations. On the other hand, our machine learning model for predicting unstable benchmarks is a great example for combining simple features with a learning algorithm to solve hard challenges that would otherwise require sophisticated, complex models. Others have recently investigated machine learning approaches to predict performance- related properties to great success as well (e.g., Siegmund et al. [2015b]; Foo et al. [2015]; Gao and Jiang [2017]; Luo et al. [2017]; Chen et al. [2020]; Ding et al. [2020]). Hence, general solutions that address the three challenges can potentially greatly beneﬁt from learning algorithms.

1.7 Opportunities for Future Research

Software microbenchmarking is a relatively new subject of research and, consequently, much can be done. This section describes potential research opportunities.

Regression Testing with Microbenchmarks The previous section touched on that a combination of different regression testing approaches are necessary to include microbenchmarks in CI and DevOps. We see three directions to improve regression testing of microbenchmarks: (1) Our study on TCP shows that basic techniques from unit testing do not necessarily perform equally well for microbenchmarks. There is a myriad of other TCP research on unit testing (e.g., Yoo and Harman [2012]; Hao et al. [2014]; Gligoric et al. [2015]; Zhang [2018]) which would require to be similarly revalidated on microbenchmarks. (2) Microbenchmarks appear to behave differently than unit tests. Conse- quently, there is a need for tailored regression testing solutions that work well and are generally applicable. In particular, novel solution should acknowl- 1.7 Opportunities for Future Research 35 edge result variability, should deal with redundancy among benchmarks, and handle heavily parameterized benchmarks. (3) Research often evaluates novel solutions in isolation, i.e., the effectiveness of one technique is shown over a baseline. However, due to the extremely long runtimes of microbenchmark suites, a combination of multiple regression testing approaches with holistic evaluations is required. For example, combining selection of commits to test (e.g., Huang et al. [2014]; Sandoval Alcocer et al. [2016, 2020]), selection of benchmarks (e.g., de Oliveira et al. [2017]; Alshoaibi et al. [2019]; Chen et al. [2020]), and prioritization of benchmarks (e.g., Mostafa et al. [2017] or this dissertation’s TCP study).

Microbenchmark Generation Microbenchmarking is still a niche technique compared to functional unit testing [Stefan et al., 2017]. One potential reason for this is that writing good benchmarks requires knowledge about programming language internals such as compiler optimizations, rigorous execution methodologies to retrieve reliable results, as well as statistical techniques to assess the results. A way to relief developers from the burden of writing benchmarks is automatically generating these. Bulej et al. [2012, 2017a] generate, run, and evaluate benchmarks using developer-speciﬁed performance annotations and input generators. Rodriguez-Cancio et al. [2016] generate benchmarks from developer-selected code snippets with inputs based on unit test executions. A potential avenue for future research is completely automatic benchmark generation. We see two concrete opportunities: (1) Using search-based techniques to generate benchmarks and their inputs, e.g., with an adapted version of Evosuite [Fraser and Arcuri, 2011]. Such an approach would need to investigate which objectives the search algorithm should optimize for. Our results from the TCP study indicate that coverage is likely not an ideal objective. (2) Transforming unit tests into proper microbenchmarks. A big challenge here is to identify unit tests that are worth transforming. The approach by Ding et al. [2020] or our approach to identify important methods (from RQ 1 and Chapter 2) could be used for that. 36 Chapter 1. Synopsis

Both opportunities could leverage our instability prediction to only generate benchmarks that will have stable results. An evaluation of these could compare the generated tests to the existing benchmarks of a project in terms of result quality and change detection capability.

Developer Support for Writing Microbenchmarks The previous paragraph outlined potential reasons that make writing microbenchmarks difﬁcult. Apart from automatically generating these, we also envision better support for developers when they write benchmarks. Static analysis tools, such as our instability prediction and Damasceno Costa et al. [2019]’s anti-pattern detection, can be a ﬁrst step towards that goal. Other support could be in the form of IDE plugins that “guide” developers in the creation process. An extension of our instability prediction could utilize dynamically-extracted features to build the model and investigate the feature importance. Although the feature importance would only show correlation to benchmark stability, an in-depth study investigating the causation of source code features that affect stability would provide actionable insights that could be used in the writing guidance.

Similarities, Differences, and Synergies of Performance Test Types There are different types of performance tests such as load tests, application/system benchmarks, or microbenchmarks [Jiang and Hassan, 2015]. They have different goals and are applied in different contexts; however, it may be possible to employ all performance test types on the same software. In such a scenario, it is unclear whether different types can ﬁnd the same performance problems and which types of faults are detected by only one test type. For example, if a set of microbenchmarks is able to detect a performance degradation which is also caught by a load test but at a lower cost, running the load test could be postponed to later stages, or it might even be completely skipped. On the other hand, if a load test detects a performance problem, microbenchmarks could potentially be used for root cause analysis. Hence, future research should investigate the similarities and differences between as well as exploit the synergies among different types to improve performance testing overall. 1.8 Related Work 37

Developers’ Needs when Evaluating Performance Performance engineering research, this dissertation included, is usually a quantitative discipline, i.e., studies execute benchmarks, compare results, and draw conclusions based on this data. However, there is a lack of empirical studies that investigate the human side of dealing with performance problems. There is a need for studying which information developers need, which techniques help, and which data is important to solve performance problems. While this dissertation uses client usage and change size as a proxy for benchmark importance, developers could provide a different perspective. Such qualitative, human-oriented studies could positively inﬂuence performance engineering research to create (even more) impactful solutions.

CI Inclusion Often, this dissertation included, research suggests that their effective solutions will pave the way to including performance testing in CI. This arguable would bring the benefits of CI such as detecting (performance) faults sooner [Hilton et al., 2016]. To the best of our knowledge, there is no scientific study that actually included performance testing in CI, apart from industry reports [Ingo and Daly, 2020; Mozilla]. We believe that this is of the highest importance to empirically study cost, benefits, and tradeoffs of including performance testing in CI. For this, research collaboration with industrial partners can provide the required setting.

1.8 Related Work

The research related to this dissertation can be categorized into (1) performance measurements, (2) performance results from the cloud, (3) performance bugs, (4) performance testing, and (5) performance regression testing. In the following sections, we outline the most important related work on a high level. More detailed descriptions and comparisons are part of Chapters 2 to 6.

Performance Measurements At the core of every measurement-based performance evaluation technique is the art of measuring performance. Many 38 Chapter 1. Synopsis factors influence and can even invalidate the measurement. Georges et al. [2007] outline in their landmark work techniques to rigorously measure and analyze performance of Java software. They discuss different experimental design decisions, such as multiple iterations and virtual machine (VM) invocations. This dissertation follows these guidelines and takes multiple different design decisions, e.g., consistently employing multiple iterations and VM invocations or interleaved measurements (see also Abedi and Brecht [2017]). Georges et al. [2007] suggest using the CV and He et al. [2019] propose a technique based on KLD to assess result stability, which we also rely on for dynamic reconfiguration. Kalibera and Jones [2012, 2013] and Maricq et al. [2018] estimate the required repetitions based on mathematical models and confidence interval widths. Our work does not estimate repetitions before the execution, but it uses confidence interval widths as measure for variability.

There are a number of factors that can inﬂuence the validity of the measurements. Mytkowicz et al. [2009] show that the UNIX environment size as well as link order have a considerable impact. DataMill [de Oliveira et al., 2013b] and STABILIZER [Curtsinger and Berger, 2013] reduce such measurement bias through factor randomization to produce reliable results. Our study on reliably detectable changes in unreliable environments aims at neutralizing inﬂuencing factors through extensive repetition within and across cloud instances.

In terms of results, variability is a common issue [Georges et al., 2007; Maricq et al., 2018]. Moreover, benchmarks often have a start-up phase, where results are not yet stable, and multiple different steady states [Georges et al., 2007; Gil et al., 2011]. To address these, enough measurements in combination with adequate statistical tests are required. Our dynamic reconﬁguration approach can help to execute enough measurements for each benchmark individually. Throughout this dissertation, we also employ rigorous statistical techniques to deal with the peculiarities of performance data, mostly based on bootstrap conﬁdence intervals [Davison and Hinkley, 1997; Ren et al., 2010; Hesterberg, 2015], similar to other researchers [Kalibera and Jones, 2012; Bakshy and Frachtenberg, 2015; Wang et al., 2018; He et al., 2019]. 1.8 Related Work 39

Performance Results from the Cloud Rigorous performance evaluations are best executed in controlled, bare-metal environments. Cloud infrastructure is a compelling alternative due to its availability and lack of required maintenance for users. However, cloud instances are also known to have unreliable performance [Iosup et al., 2011b]. This unreliability often stems from virtualization, co-located neighbors, unstable I/O performance, or hardware heterogeneity [Leitner and Cito, 2016]. Papadopoulos et al. [2019] outline principles to be followed when executing on cloud infrastructure and still retrieving reproducible performance results. Arif et al. [2018] show that there is a signiﬁcant discrepancy between load test results from physical and virtual environments. This dissertation augments these studies and shows that microbenchmark results are similarly inﬂuenced by different cloud environments.

Wang et al. [2018] utilize resource proﬁles to estimate the impact on an application’s performance when deployed in the cloud. Similarly, Scheuner and Leitner [2018] predict an application’s performance from system microbenchmark results. Our benchmark instability prediction takes a different approach: on one side, it predicts whether a benchmark (not an application) will be stable or not; on the other side, it solely relies on source code features that approximate performance behavior rather than utilizing system benchmark results. However, such additional data could potentially enhance our instability prediction.

He et al. [2019] devise a dynamic solution to retrieve reliable performance test results when executed in cloud environments. Dynamic reconﬁguration is similar but adopted for microbenchmarks. Finally, Bulej et al. [2020] introduce duet benchmarking that synchronizes two benchmarks and executes these in parallel. By this, cloud environment inﬂuences affect both benchmarks equally and change detection is improved. This approach is closely related to our simulation on the detectable change sizes in cloud environments. Our study, however, executes benchmarks across instance boundaries and investigates back-to-back and interleaved executions. 40 Chapter 1. Synopsis

Performance Bugs Performance bugs or faults have been extensively studied by research. Compared to functional bugs, they are in many aspects trickier to deal with: they go longer undiscovered [Jin et al., 2012], are harder to reproduce [Zaman et al., 2012; Nistor et al., 2013a], take longer to resolve [Zaman et al., 2011; Nistor et al., 2013a; Liu et al., 2014; Mazuera-Rozo et al., 2020], and require more as well as more experienced developers to do so [Zaman et al., 2011; Nistor et al., 2013a; Chen et al., 2019]. Zaman et al. [2012] found that many performance bugs are regression bugs, release blocking, and often explicitly not fixed. Moreover, less than 40% of the bug reports contain measurements. Performance bugs are mostly introduced when fixing functional bugs and hardly with new features [Chen and Shang, 2017]. About every third version introduces a performance bug [Sandoval Al- cocer and Bergel, 2015]. This shows that continuous performance assessment is necessary. Our work supplies techniques to making this feasible. Root causes of performance bugs are varied. Inefficient loops and unnecessary or expensive function calls are contributing factors [Jin et al., 2012; Sandoval Alcocer and Bergel, 2015; Selakovic and Pradel, 2016; Chen and Shang, 2017; Zhao et al., 2020]. Repeated, unnecessary computations are another [Selakovic and Pradel, 2016]. Often developers choose suboptimal or wrong APIs or data structures that lead to performance bugs [Jin et al., 2012; Sandoval Alcocer and Bergel, 2015; Selakovic and Pradel, 2016; Chen and Shang, 2017]. Concurrency and synchronization is another root cause for performance faults [Jin et al., 2012; Alam et al., 2017; Chen and Shang, 2017; Zhao et al., 2020]. For mobile apps, the UI and resource leaks are the major causes for performance bugs [Liu et al., 2014; Mazuera-Rozo et al., 2020]. Our work on predicting benchmark instability is influenced by all these works and defines its source code features accordingly.

Performance Testing To uncover performance bugs, performance testing is one essential technique. Load testing has traditionally received research’s focus [Weyuker and Vokolos, 2000; Menascé, 2002; Jiang and Hassan, 2015] and has successfully been applied in industry [Nguyen et al., 2014; Foo et al., 1.8 Related Work 41

2015]. Lately, load testing research dedicated their efforts to reducing the runtime by detecting repetitive system states [AlGhamdi et al., 2016, 2020] and using statistics on the already gathered results [He et al., 2019]. Our dynamic reconfiguration approach has the same goal, i.e., reducing runtime, but aims at microbenchmarks instead. Microbenchmarking received increased attention from research recently. Leitner and Bezemer [2017] and Stefan et al. [2017] investigate the current state in OSS and find microbenchmarking to still be a niche technique. Horký et al. [2015] utilize them for raising performance awareness through documentation, and Bulej et al. [2012, 2017a] enable hypothesis-based performance assessment through annotations. Chen and Shang [2017] find that microbenchmarks detect many performance improvements and regressions in the same version. Our TCP study therefore studies the impact of different change sizes being significant. Damasceno Costa et al. [2019] detect microbenchmark anti-patterns and show that they have a considerable impact on their results. Our instability prediction is similar, it statically assesses whether a benchmark will produce reliable results. Ding et al. [2020] investigate whether functional unit tests can be used to unveil performance bugs and describe a machine learning approach to identify these tests. Our instability prediction model uses similar features, which are also inspired by performance bug research.

Performance Regression Testing Research on regression testing mostly focuses on selecting the relevant tests or prioritizing the tests with the highest fault-exposing potential [Yoo and Harman, 2012]. Huang et al. [2014] devise a static cost model to check whether a commit should be performance tested. Sandoval Alcocer et al. [2016, 2020] employ a hybrid cost model that adds dynamic information from the version history. While these approaches select commits to be (fully) performance tested, others select individual benchmarks on every commit, i.e., RTS for benchmarks. de Oliveira et al. [2017] utilize dynamic and static source code indicators to select benchmarks if their indicator values are above a certain threshold. Alshoaibi et al. [2019] extend it by for- mulating the selection as a search problem. The TCP techniques in our study 42 Chapter 1. Synopsis nicely complement these selection approaches. In addition, our instability prediction approach could also be used for selecting only stable benchmarks. However, such an approach should also consider the code change in the prediction model. Chen et al. [2020] perform such a selection for performance tests utilizing source code features and machine learning algorithms. They, however, perform the selection on functional unit tests (used for performance), and it’s unclear how their approach works for benchmarks. In terms of TCP, Mostafa et al. [2017] are, to the best of our knowledge, the only ones to perform TCP on performance tests. They employ a complex performance impact model that works for collection-intensive software. Our TCP study investigates traditional strategies from unit testing research on a broad spectrum. Chen et al. [2020] also perform prioritization using their classification model and normalize tests by their cost (i.e., duration). They use functional unit tests as performance tests and therefore such a cost-aware prioritization is reasonable. However, all benchmarks that are configured equally have approximately the same runtime cost, due to their execution iterations having fixed durations.

1.9 Summary and Contributions

This dissertation’s goal was to investigate software microbenchmark challenges related to aspects of their execution. In particular, the focus was on (1) runtime, (2) result variability, and (3) change detection. As a first step, we assessed the extent of the three challenges in real-world projects. We found that microbenchmarks are considerably affected by all three challenges, which led to exploring novel solutions as well as existing techniques to address these. We evaluated the various solutions and techniques with multiple experiments and showed that they can (1) considerably save runtime without increasing result variability, (2) identify unstable benchmarks without running them, (3) prioritize benchmarks effectively and efficiently to detect larger changes sooner, and (4) utilize specific strategies to execute benchmarks in unreliable cloud environments to reliably detect small changes. These results support 1.9 Summary and Contributions 43 our thesis that software microbenchmarking requires conscious planning and execution strategies to tackle the three challenges. To summarize, this dissertation makes the following scientific contributions:

(1) We present empirical ﬁndings from a series of studies showing the extent of the three challenges in OSS projects written in Java and Go, when benchmarks are executed in bare-metal as well as cloud environments.

(2) We devise an approach —dynamic reconﬁguration— which stops benchmark executions when the measurement is stable.

(3) We devise an approach to predict whether a benchmark will be unstable, based on machine learning classiﬁers and statically-computed source code features.

(4) We present empirical ﬁndings showing the effectiveness and efﬁciency of traditional TCP techniques applied to microbenchmarks.

(5) We present an experimental simulation showing high numbers of falsely detected performance changes (i.e., FPs) and two execution strategies to circumvent these FPs to reliably detect small changes when using cloud instances as execution environments.

(6) We provide 6 microbenchmark datasets, depicted in Table 1.1. 44 Chapter 1. Synopsis orecd data code source and Meta-data xcto 088194– 1,984 858 10 Execution xcto 0214399–Jv BM Java – 3,969 2,164 10 Execution xcto 9–– – 19 4 Execution xcto 3 ,6 oB – – BM Go – – 4,461 230 Execution xcto 01896,460 1,829 10 Execution Type al .:Dtst rae o hsdsetto.Evrnet“M orsod obare-metal to corresponds “BM” Environment dissertation. this for created Datasets 1.1: Table 5 33748,107 13,387 753 Projects

Benchmarks

Parameters ✓ ✓ Multi-Version aa––––––[abre l,2020e] al., et [Laaber – – – – – – Java Go Java Go Java aaBM Java Languages lei (BM) Bluemix 3 3 3 GCE BM × × × Environments Azure GCE AWS ✓ ✓ ✓ ✓ ✓ ✓ – –

Invocations Repetitions ✓ ✓ ✓ ✓ ✓ Iterations ✓ Trials 5 Lae ta. 2020a] al., et [Laaber K 550 – 0 Lae n ete,2018b] Leitner, and [Laaber K 400 – 1B[abre l,2020e] al., et [Laaber B 41 – 7B[abre l,2020b] al., et [Laaber B 47 – Machines . Lae ta. 2019a] al., et [Laaber M 4.5

Data Points Reference 1.10 Thesis Roadmap 45

1.10 Thesis Roadmap

The remainder of the dissertation consists of 5 chapters, each corresponding to a paper published or submitted to an international, peer-reviewed journal or conference. Figure 1.3 depicts the mapping from chapters to papers and from the chapters’ research questions (gray circles on the left) to the dissertation’s research questions (orange circles on the right). Table 1.2 shows the contributions of the dissertation’s author to the chapters.

Chapter 2 assesses the extent of the three challenges by executing real-world microbenchmark suites.

Chapter 3 studies benchmark result variability when executed in multiple, different bare-metal and cloud environments. It, further, investigates execution strategies to reduce FPs and to minimize the change sizes that can be reliably detected.

Chapter 4 introduces dynamic reconﬁguration, an approach to reduce benchmarking runtime without increasing result variability. It, further, conducts a large-scale pre-study on benchmark conﬁgurations.

Chapter 5 introduces the approach to predict unstable benchmarks using machine learning classiﬁers and statically-computed source code features.

Chapter 6 investigates traditional TCP techniques applied to benchmarks. 46 Chapter 1. Synopsis

Chapter 2

An Evaluation of Open-Source Software Microbenchmark Suites for Continuous Performance Assessment RQ Christoph Laaber and Philipp Leitner 1 Published in Proceedings of the 15th IEEE/ACM International Conference on Mining Software Repositories (MSR ’18)

Chapter 3 RQ RQ Software Microbenchmarking in the Cloud. How Bad is it Really? 1 1 Christoph Laaber, Joel Scheuner, and Philipp Leitner RQ RQ Published in Empirical Software Engineering 24 (2019) 2 4

Chapter 4 RQ Dynamically Reconﬁguring Software Microbenchmarks: Reducing Pre 1 Execution Time without Sacriﬁcing Result Quality Christoph Laaber, Stefan Würsten, Harald C. Gall, and Philipp Leitner RQ RQ 2 2 Published in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations RQ RQ of Software Engineering (ESEC/FSE ’20) 1 3

Chapter 5

Predicting Unstable Software Benchmarks Using Static Source Code Features RQ Christoph Laaber, Michael Basmaci, and Pasquale Salza 3 In submission under major revision at Empirical Software Engineering

Chapter 6

Applying Test Case Prioritization to Software Microbenchmarks RQ Christoph Laaber, Harald C. Gall, and Philipp Leitner 4 In submission under major revision at Empirical Software Engineering

Figure 1.3: Chapters and papers of the dissertation 1.10 Thesis Roadmap 47

Table 1.2: Contributions of the dissertation’s author to the chapters according to the Contributor Roles Taxonomy (CRediT)1 Investigation Conceptualization Data Curation Formal Analysis Funding Acquisition Methodology Project Administration Resources Software Supervision Validation Visualization Writing – Original Draft Writing – Review & Editing

Chapter 2 ✓✓✓ ✓✓ ✓✓ ✓✓✓✓ Chapter 3 ✓✓✓ ✓✓ ✓✓ ✓✓✓✓ Chapter 4 ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Chapter 5 ✓✓✓ ✓✓ ✓✓✓✓✓✓✓ Chapter 6 ✓✓✓ ✓✓ ✓✓ ✓✓✓✓ 1 http://credit.niso.org

An Evaluation of Open-Source Software Microbenchmark Suites for Continuous Performance Assessment

Christoph Laaber and Philipp Leitner Published in Proceedings of the 15th IEEE/ACM International Conference on Mining Software Repositories (MSR ’18)

Abstract

Continuous integration (CI) emphasizes quick feedback to developers. This is at odds with current practice of performance testing, which predominantly focuses on long-running tests against entire systems in production-like environments. Alternatively, software microbenchmarking attempts to establish a performance baseline for small code fragments in short time. This paper investigates the quality of microbenchmark suites with a focus on suitability to deliver quick performance feedback and CI integration. We study ten open-source libraries written in Java and Go with benchmark suite sizes ranging from 16 to 983 tests, and runtimes between 11 minutes and 8.75 hours. We show that our study subjects include benchmarks with result variability of 50% or higher, indicating that not all benchmarks are useful for reliable discovery of slowdowns. We further artificially inject actual slowdowns into Chapter 2. An Evaluation of Open-Source Software Microbenchmark Suites for 50 Continuous Performance Assessment public application programming interface (API) methods of the study subjects and test whether test suites are able to discover them. We introduce a performance-test quality metric called the API benchmarking score (ABS). ABS represents a benchmark suite’s ability to find slowdowns among a set of defined core API methods. Resulting benchmarking scores (i.e., fraction of discovered slowdowns) vary between 10% and 100% for the study subjects. This paper’s methodology and results can be used to (1) assess the quality of existing microbenchmark suites, (2) select a set of tests to be run as part of API, and (3) suggest or generate benchmarks for currently untested parts of an API.

2.1 Introduction

CI [Ståhl and Bosch, 2014] and continuous delivery (CD) [Chen, 2015] have become standard development practices in both, open-source software (OSS) and commercial software projects. A core tenet of CI and CD is a focus on short cycle times, so as to innovate and deliver value to customers as quickly as possible. One challenge of this “move fast” mentality is ensuring software quality [Rubin and Rinard, 2016]. Regarding functional quality, unit and regression test suites can easily be automatically executed as part of the CI build. Unfortunately, state-of-the-art performance testing practices are harder to align with CI. They usually encompass executing long-running test workloads against entire systems in production-like environments [Jiang and Hassan, 2015], which is both, hard to fully automate and too time-consuming to run for every build [Fagerström et al., 2016]. However, if performance tests are only selectively ran outside the CD pipeline, there is a danger of deploying performance regressions to production, and it becomes harder to identify which code change has actually introduced a problem. For some projects, one way to approach this issue may be software microbenchmarking, i.e., utilizing small code-level benchmarks which can, at least in theory, be run on every build [Bulej et al., 2017a]. 2.1 Introduction 51

Previous research on software microbenchmarks studied root causes [Chen and Shang, 2017], raised performance awareness of developers through documentation [Horký et al., 2015], and studied quantitative and qualitative aspects of microbenchmarking in Java [Stefan et al., 2017; Leitner and Bezemer, 2017]. In this paper, we study the quality of existing software microbenchmark suites of ten OSS projects, five written in Java and five written in Go. The context of our study is to evaluate how useful these benchmarks are for continuous performance assessment, but our results also serve as a general survey of the quality of microbenchmarking in OSS. This work expands on the current state by (1) exploring benchmark-result variability in different environments, i.e., bare-metal and cloud; (2) highlighting similarities and differences of a dynamically compiled/VM-based language (i.e., Java) and a statically compiled language (i.e., Go); and (3) proposing an approach to assess the ability of a benchmark suite to find slowdowns within a defined subset of a project’s API, identifying untested parts of it, and providing empirical data from OSS projects. We thoroughly investigate the following research questions:

RQ 1 How extensive are the microbenchmark suites in the study subject projects?

We investigate how large the microbenchmark suites are, and how long they take to execute. We ﬁnd that the projects have suites sizing between 16 and 983 individual benchmarks, and take between 11 minutes and 8.75 hours to complete a single complete run in a standard conﬁguration.

RQ 2 How stable are the results of microbenchmarks between identical executions?

We repeatedly execute each suite ﬁve times and record the difference in results between runs. As the stability of results will depend on the execution environment, we run all suites in both, an instance rented from Google Compute Engine (GCE) as an example public cloud provider, and on a self-managed bare-metal server. We ﬁnd that many projects include benchmarks with a maximum result variability of 50% or higher even on bare-metal, indicating that not all benchmarks are useful for the reliable slowdown discovery. Chapter 2. An Evaluation of Open-Source Software Microbenchmark Suites for 52 Continuous Performance Assessment

RQ 3 Are the microbenchmark suites able to identify actual slowdowns?

To evaluate how suitable microbenchmark suites are in finding actual slowdowns, we adopt a methodology inspired by mutation testing [Jia and Harman, 2011]. For each project, we identify 20 often-used public API methods by tracing their usage in the unit tests of other OSS projects on GitHub, and artificially introduce slowdowns into those methods. We study which of these artificial slowdowns can be found in our test setup and using the existing microbenchmarks. Using extensive experiments, we show that microbenchmark suites of our study subjects find slowdowns in between 10% and 100% of the selected 20 often-used methods. We further observe that many benchmarks appear redundant, in the sense that they seem to cover regressions in similar core methods. Our results serve as a baseline for how extensive and variable existing OSS microbenchmark suites are in different environments. The proposed methodology is useful for developers to assess the quality of a software’s microbenchmark suite, select a set of benchmarks to be run in fast-paced CD pipelines, and retrieving suggestions or even generating stubs of benchmarks for currently untested parts of a project’s core API.

2.2 Background

Performance testing is a widely used umbrella term for many different approaches. Most importantly, we distinguish between small-scale and highly granular performance tests on code level (software microbenchmarks or performance unit tests [Horký et al., 2015; Stefan et al., 2017]), and performance tests that target entire components or systems, typically under high load (load or stress tests). Microbenchmarks are executed in a unit-test-like fashion, whereas load tests bring up an entire production-like environment and run a deﬁned workload against this system. In the present work, we focus on performance testing via microbenchmarks. Typical performance counters used in such tests include the average 2.2 Background 53

1 @Fork(1) 2 @Warmup(iterations = 5) 3 @Measurement(iterations = 10) 4 @BenchmarkMode(Mode.AverageTime) 5 @OutputTimeUnit(TimeUnit.NANOSECONDS) 6 public class RuntimeSchemaBenchmark { 7 ... 8 @Benchmark 9 public Int1 runtime_deserialize_1_int_field() throws Exception { 10 Int1 int1 = new Int1(); 11 ProtobufIOUtil.mergeFrom(data_1_int, int1, int1RuntimeSchema ); 12 return int1; 13 } 14 ... 15 }

Listing 2.1: JMH example from the protostuff project. execution time of a method, throughput, heap utilization, lock contention, and input/output (I/O) operations. This section gives a brief overview of microbenchmarking in Java and Go.

Java Microbenchmarking Harness Java OSS use a wide variety of microbenchmarking approaches, with Java Microbenchmarking Harness (JMH)1 being the closest to a standard at the time of writing. JMH is part of OpenJDK since version 1.7. Other microbenchmarking tools, as reported by Leitner and Bezemer [2017] and Stefan et al. [2017], such as Caliper, Japex, or JUnitPerf hardly receive any OSS attention, are discontinued, or are not executable in an automated way. Hence, we only consider JMH in this study, which allows users to specify benchmarks in a similar notation to JUnit tests through an annotation mechanism. Every public method annotated with @Benchmark is executed as part of the performance test suite. Listing 2.1 shows an example benchmark from one of the study subjects (protostuff ). There are further conﬁguration parameters that are either deﬁned through annotations on test-class or test-method level, or through command-line pa-

1https://github.com/openjdk/jmh Chapter 2. An Evaluation of Open-Source Software Microbenchmark Suites for 54 Continuous Performance Assessment rameters. These parameters include the number of warmup iterations, measurement iterations, forks (separate benchmark repetitions in fresh Java Virtual Machines (JVMs)), and performance counters to measure. A single iteration of a benchmark takes 1s by default, and runs as many executions of the benchmark as it can fit into that time period. After each iteration the resulting performance counters are evaluated, aggregated, and reported to the user. Be- fore the actual iterations, a defined number of warmup iterations are executed. Moreover, JMH has the notion of parameterized benchmarks, where the same benchmark implementation is run for each combination of input parameters. A partial sample output for the benchmark defined in Listing 2.1 is shown in Listing 2.2.

1 # Run progress: 0.00% complete, ETA 00:00:15 2 # Fork: 1 of 1 3 # Warmup Iteration 1: 11.417 ns/op 4 ... 5 # Warmup Iteration 5: 8.955 ns/op 6 Iteration 1: 8.841 ns/op 7 Iteration 2: 8.819 ns/op 8 ... 9 Iteration 10: 8.963 ns/op 10 11 Result: 8.884 +-(99.9%) 0.169 ns/op [Average] 12 Statistics: (min, avg, max) = (8.751, 8.884, 9.089), stdev = 0.112 13 Confidence interval (99.9%): [8.715, 9.053]

Listing 2.2: JMH console output produced by the benchmark deﬁned in Listing 2.1.

Benchmarking in Go The Go programming language has a benchmarking framework included in its standard library2, where benchmarks are defined as functions that have a name starting with Benchmark and a single parameter of type *testing.B in files with suffix _test.go. By default, benchmarks are executed once for at least 1s and average execution times are reported. In comparison to JMH, Go does not provide mechanisms for warmup iterations and parameterized tests. Nevertheless, through the command-line flag -count, a single benchmark can be repeated multiple times, similar to JMH’s iterations. 2https://golang.org/pkg/testing 2.3 Study Subjects 55

Moreover, forks are not supported, presumably because Go is not executed in a virtual-machine environment, arguably removing the need for warmup iterations and investigation of inter-virtual-machine-variance.

2.3 Study Subjects

We selected ten study-subject projects, five written in the Java programming language and five written in Go. We chose Java because it is a well established programming language, and consistently ranks highly in “most popular programming languages” rankings for various domains (e.g., Tiobe3). Go, on the other hand, is a fairly new programming language at the time of writing, which was introduced in 2009 and comes with a benchmarking library as part of its standard test framework. Microbenchmarking is predominantly used in middleware-style library projects (that is, projects such as Log4j2, RxJava, or etcd, as opposed to end-user-facing applications such as Firefox), hence we focus on such projects. To identify concrete projects, we used the public Google BigQuery dataset fh-bigquery:github_extracts4 to retrieve the projects for both Java (Big- Query table java_contents_2015) and Go (BigQuery table contents_go). These tables contain all Java files hosted on GitHub that have been commited in the year 2015, and all Go files. We queried for projects that use JMH by filtering for the import statement import org.openjdk.jmh, and for projects that use Go’s benchmarking library via a query for *testing.B. We mapped the identified files back to containing GitHub projects, and ordered the projects by stars and watchers. We then identified suitable study-subject projects for both languages by manually checking out the projects, verifying whether they compile, contain a non-trivial amount of benchmarks (> 10) that largely execute successfully, and whether we could find non-trivial usage of these projects in other OSS (again using Google BigQuery). Table 2.1 shows summary statistics

3https://www.tiobe.com/tiobe-index 4https://bigquery.cloud.google.com/dataset/fh-bigquery:github_ extracts Chapter 2. An Evaluation of Open-Source Software Microbenchmark Suites for 56 Continuous Performance Assessment Go Description Java Project Language etcd JCTools bleve Caffeine fasthttp Log4j2 protostuff gin otto RxJava al .:Oeve fGtu n ecmr eaaao td-ujc projects study-subject of metadata benchmark and GitHub of Overview 2.1: Table elkonlgigfaeokgithub.com/apache/logging-log4j2 framework logging Well-known aacitpre n nepee github.com/robertkrimen/otto interpreter and parser JavaScript irr o etidxn github.com/blevesearch/bleve indexing text for Library github.com/ben-manes/caffeine library caching in-memory performance High itiue ossetkyvlesoegithub.com/coreos/etcd store key/value consistent Distributed github.com/JCTools/JCTools queues message on focus with tools Concurrency ihpromneHT irr github.com/valyala/fasthttp library HTTP performance High eilzto irr htspot ceaeouingithub.com/protostuff/protostuff evolution schema supports that library Serialization TPwbframework web HTTP sequences through observable programs asynchronous composing for Library github.com/gin-gonic/gin R omtSASasWatchers Stars SHA Commit URL github.com/ReactiveX/RxJava 0b1034d 64096c0 e7e7451 4a8775b fc109d6 8a10178 6dfd8fe 2162d6d e2212d4 1861f24 325818 13,215 3701,770 23,770 ,0 191 3,204 182 2,461 ,6 136 1,062 ,1 218 4,315 ,3 493 9,739 ,8 138 2,880 5 63 551 4255 44 2.4 RQ 1: Extent of Benchmark Suites 57 for the projects we have ultimately chosen. The column “Commit SHA” in the table refers to the project’s Git commit we have used for all our experiments. For all projects except RxJava, this SHA represents the current version of the master branch at the time of experimentation. For RxJava, we used the 1.x branch because version 2 was only released on October 29, 2016, hence other projects on GitHub still used RxJava version 1 at the time of writing. Note that the low star and watcher count for Log4j2 is due to the GitHub repository being merely a mirror of an Apache repository for this project.

2.4 RQ 1: Extent of Benchmark Suites

2.4.1 Context

Improving the build time, i.e., the time it takes for a CI server to check out, compile, and test the build, is a constant concern for CI and CD projects, and the foundation of continuous performance assessment. This is primarily for two reasons: (1) Quick builds allow for fast feedback to the developer, reducing the time that is “wasted” waiting for builds that ultimately fail. (2) State of practice CI platforms restrict builds to a certain maximum runtime. A prominent example is TravisCI, which at the time of writing employs a build job timeout of maximum 120 minutes5. Hence, we ﬁrst study how large (in number of benchmarks) the microbenchmark suites of our study subjects are, and how long they take to execute.

2.4.2 Approach

To establish a baseline of how much time is necessary to execute the microbenchmarking suites of our study subjects, we execute each suite five times in two different environments. Firstly, similar to other works dealing with performance testing [Selakovic and Pradel, 2016; Bakshy and Frachtenberg, 2015], we executed all tests on 5https://docs.travis-ci.com/user/customizing-the-build#Build- Timeouts Chapter 2. An Evaluation of Open-Source Software Microbenchmark Suites for 58 Continuous Performance Assessment a dedicated, non-virtualized (“bare-metal”) server that we reserved exclusively for our performance testing. This server has a 12-core Intel Xeon X5670 @ 2.93GHz central processing unit (CPU) with 70 GiB memory, and runs ArchLinux with a Linux kernel version 4.10.4-1-ARCH. It uses an Hitachi hard-disk (HUA722020ALA330) with 7200rpm and a 32MB cache. We did not execute any other user-space applications except ssh during test runs, but did not explicitly disable hardware optimizations. Note that our benchmark experiments do not fully utilize this high-performance test machine. With our ultimate goal in mind — integrating performance testing with standard CI tooling — we also executed the test suites in virtual machines hosted in GCE. We used GCE’s Infrastructure as a Service (IaaS) instances of machine type n1-standard-26 running Ubuntu 16.10 a GNU/Linux 4.8.0- 46-generic x86_64 kernel. This machine type comes with two virtual CPUs and 7.5 GB of memory. We chose GCE because of its stability compared to other IaaS providers [Leitner and Cito, 2016]. Experiments were conducted between March and May 2017, and in the us-central-1a region of GCE. One challenge with this approach is that JMH allows overriding benchmarking configurations (e.g., number of iterations, number of forks) via com- mandline parameters. Unfortunately, these customizations are not visible in the software repositories that we have access to. Hence, we use a uniform default JMH configuration for all Java projects (10 warmup iterations, 20 measurement iterations, 1 fork). For Go, we also repeat each benchmark 20 times. For all runs we record both, the duration required for running the entire test suite once with the above configuration, and the results of all benchmarks (this data will be used in Section 2.5).

2.4.3 Results and Analysis

Table 2.2 reports the size of the benchmark suites in number of benchmarks, the mean duration of a single execution of the entire suite, and the standard deviation of the duration between the different runs in hours. For JMH projects,

6https://cloud.google.com/compute/docs/machine-types 2.4 RQ 1: Extent of Benchmark Suites 59

Table 2.2: Size of microbenchmark suites and durations of a single complete run on a bare-metal machine and in a public cloud

Test Suite Exec. Time Exec. Time Language Project Size (#) Mean (Hours) Std.Dev (Hours) Bare-Metal Cloud Bare-Metal Cloud Java Caffeine 89 0.79 1.81 0.00 0.08 JCTools 169 2.28 2.30 0.00 0.00 Log4j2 437 4.76 4.61 0.00 0.01 protostuff 31 0.25 0.26 0.00 0.00 RxJava 983 8.75 8.72 0.02 0.01 Go bleve 70 1.16 0.92 0.10 0.10 etcd 41 0.63 0.72 0.17 0.11 fasthttp 99 1.30 1.20 0.10 0.06 gin 16 0.19 0.18 0.00 0.00 otto 49 0.55 0.50 0.00 0.01

we count every test method annotated with the @Benchmark annotation. For parameterized benchmarks, we count every combination of parameters as a different benchmark, as they appear to the developer as individual benchmarks during JMH execution. The Go benchmarking library does not have a comparable parameterization concept, therefore every benchmark is counted exactly once. We observe that the size of microbenchmark suites among our study subjects varies widely. While Go projects tend to have smaller suites than the Java projects, this is not a strict rule. The two projects with the smallest microbenchmark suites are gin (a Go project with only 16 benchmarks) and protostuff (a Java project with 31 benchmarks). On the other hand, the two projects with the largest suites are both Java projects (Log4j2 with 437, and RxJava with 983 benchmarks). Consequently, the total duration of a single test run also varies substantially between projects. RxJava takes close to 9 hours to ﬁnish a single run even with the relatively short default JMH conﬁguration that we used. All other Java projects except protostuff require (often substantially) longer than a full hour for a single test run, making them infeasible to be executed as part of CI runs. Chapter 2. An Evaluation of Open-Source Software Microbenchmark Suites for 60 Continuous Performance Assessment

Due to their smaller test suites, the Go projects also require less time for testing. fasthttp is the only Go project in our study to take over a full hour in both environments. Moreover, it is evident that the total duration of a run generally correlates strongly with the size of the microbenchmark suite. This is due to how both, JMH and the Go benchmarking library, work (see also Section 2.2): they attempt to repeatedly execute a given small benchmark code for a defined duration (e.g., 1 second), and then report summary statistics such as how often the snippet could be executed in this defined duration. Consequently, every benchmark requires close to the same time for execution. This also explains the generally low standard deviation of durations between our five experiment repetitions, and the often negligible difference between the duration in the bare-metal and public cloud environments (despite the bare-metal machine being much more powerful than the cloud instance that we used). The only exception here is Caffeine, which takes about 2.3 times longer in the cloud than in the bare-metal environment. This is due to the project employing time-consuming setups prior to each benchmark, which takes significantly longer on the smaller cloud instance.

RQ1 Summary. We ﬁnd that the projects have microbenchmarking suites sizing between 16 and 983 tests, and take between 11 minutes and 8.75 hours for a single run. We conclude that many projects have microbenchmark test suites that per se take too much time to be used for continuous performance assessment.

2.5 RQ 2: Stability of Benchmarks

2.5.1 Context

A fundamental challenge of performance testing is that results for most performance counters are nondeterministic. A certain baseline level of variability even between identical runs of the same benchmarks is hence to be expected. 2.5 RQ 2: Stability of Benchmarks 61

How large this variability is depends on multiple factors, including the used programming language, the nature of the benchmark, and the stability of the environment [Laaber and Leitner, 2018a]. However, establishing this baseline variability gives us a good basis to evaluate the usefulness of the benchmarks in the project’s microbenchmark suites in different environments. The variability of a benchmark affects the stability and reliability of its results. A lower result variability implies a detectability of smaller performance changes by a benchmark without reporting false positives (FPs) due to random ﬂuctuations. Multiple factors, such as concurrent processes, virtualization, or I/O latencies can be the cause of these variations. Therefore, it is inevitable to measure a project’s benchmark-suite stability to assess how large a performance change needs to be before it can be detected reliably.

2.5.2 Approach

As a metric for stability of microbenchmarks, we propose the maximum spread among the mean performance counter outcomes of n repeated runs. This metric is calculated for benchmarks b ∈ B. In our experiments, we use n = 5, and formally deﬁne the repeated benchmark executions as the set R = {1, 2, 3, 4, 5}. We chose n = 5 to have a reasonable high number of complete suite executions to catch inter-run result variance. Any concrete run r ∈ R b produces a series of measurements for each b ∈ B, which we refer to as Mr , b R+ with ∀b ∈ B, ∀r ∈ R, ∀m ∈ Mr : m ∈ . Further, we denote the arithmetic b mean of the benchmark measurements of benchmark b in run r as Mr , and the arithmetic mean over all runs as M b. The maximum spread is then a function maxSpread : B → R+, deﬁned as Equation 2.1.

b b Mr1 − Mr2 maxSpread(b)= max r1,r2 ∈ R (2.1) b M

Intuitively, maxSpread represents the largest distance in means (from the slowest to the fastest average benchmark execution) between any two runs in our experiment in percent from the overall mean. It gives us a measure about Chapter 2. An Evaluation of Open-Source Software Microbenchmark Suites for 62 Continuous Performance Assessment the worst variability between repeated runs (as opposed to a measure for variability within a run, such as the standard deviation of a series of measurements b Mr would be). maxSpread therefore provides a lower bound for performance changes that are detectable by a benchmark in a particular environment. A maxSpread of 0 denotes that there is no variance in the results between runs, whereas, for instance, a value of 1 expresses that in the worst repetition the mean performance counter was twice as high as in the best (e.g., twice the mean execution time).

2.5.3 Results and Analysis

Figure 2.1 shows the distribution of the maximum spread of each benchmark for the study subjects in violin plot notation. To ease visual comprehen- sion, we have limited the y-axis to 1.00 for Java and 0.5 for Go, even though some projects have a small number of outlier benchmarks with even higher maxSpread. All data and analysis scripts are provided in a replication package [Laaber and Leitner, 2018b]. Further, Table 2.3 presents the distribution of the maxSpread of each benchmark for the study subjects. The table lists how many benchmarks (in absolute numbers and percent of the suite) fall into one of ﬁve buckets. The column “Benchs” lists for how many benchmarks we were able to generate results for all 5 runs, and only these are considered for maxSpread calculation. This number can be lower than the size of the entire suite as reported in Table 2.1, as we have experienced transiently failing benchmarks for a subset of projects. Even though there are obvious differences between the projects in terms of benchmark stability, we are able to identify some common trends in the cloud environment. Firstly, even though Go projects have smaller test suites, their benchmarks tend to be substantially more stable than the Java ones. This is largely due to not being executed in a virtual machine. However, even in Go, two of ﬁve projects have some outlier benchmarks that are very unreliable. (maxSpread > 0.25). For two Go projects (bleve and fasthttp), more than 70% of benchmarks have a maxSpread < 0.05. Only one Go project (etcd) has 2.5 RQ 2: Stability of Benchmarks 63 (4) (1) (7) (0) (1) (2) (0) (0) (10) (19) 2% 4% 3% 1% 0% 2% 2% 0% 0% 12% (6) (0) (4) (1) (0) (0) (18) (46) (30) (108) 7% 0% 1% 0% 0% 21% 27% 19% 11% 10% (6) (3) (7) (8) (39) (28) (10) (27) (216) (543) Cloud 9% 3% 45% 17% 49% 32% 55% 66% 44% 16% (7) (6) (10) (26) (12) (14) (21) (26) (142) (213) 12% 15% 33% 39% 22% 20% 17% 21% 37% 53% (9) (2) (2) (3) (65) (30) (50) (73) (15) (106) 7% 7% 5% 10% 39% 11% 71% 73% 19% 31% (6) (8) (0) (0) (0) (1) (0) (0) (12) (15) in various percentiles 7% 7% 2% 0% 2% 0% 0% 1% 0% 0% (1) (5) (1) (7) (0) (0) (10) (23) (23) (57) 5% 3% 6% 6% 2% 7% 0% 0% 11% 14% maxSpread (3) (7) (1) (3) (0) (0) (19) (45) (98) (195) 9% 2% 3% 0% 0% 21% 27% 22% 10% 20% (4) (4) (0) (2) (0) (26) (29) (15) (104) (195) 5% 0% 0% 29% 17% 24% 13% 20% 15% 13% (28) (59) (23) (64) (38) (71) (14) (49) (204) (500) 32% 35% 47% 74% 52% 80% 93% 72% 88% 100% Table 2.3: Maximum spread 89 86 80 70 31 31 409916 41 100 49 16 49 168437 169 437 962 977 BM Cloud 0 – 0.05 0.05 – 0.1 0.1 – 0.25 0.25 – 0.5 >0.5 0 – 0.05 0.05 – 0.1 0.1 – 0.25 0.25 – 0.5 >0.5 Caffeine bleve JCTools Log4j2 protostuff RxJava etcd fasthttp gin otto Language Project BenchsJava Go Bare-Metal Chapter 2. An Evaluation of Open-Source Software Microbenchmark Suites for 64 Continuous Performance Assessment

Java Bare−Metal Cloud 1.00 0.75 0.50 0.25 0.00 Maximum Spread Maximum Caffeine JCTools Log4J2 Protostuff RxJava Caffeine JCTools Log4J2 Protostuff RxJava

Go Bare−Metal Cloud 0.5 0.4 0.3 0.2 0.1 0.0 Maximum Spread Maximum bleve etcd fasthttp gin otto bleve etcd fasthttp gin otto

Figure 2.1: Distribution of each benchmarks’ maxSpread for all study subject projects in bare-metal and cloud environments in violin plot notation

relatively unstable benchmark results, with about two thirds of benchmarks having a maxSpread between 0.1 and 0.25.

All JMH projects have a substantial number of benchmarks with a maxSpread > 0.10. JCTools is interesting, as benchmarks for this project are largely either very stable (maxSpread < 0.05) or very unreliable (maxSpread > 0.25). We were unable to study in detail how the benchmarks that fall into one of these buckets differ for the present paper. We observe that the Java language runtime appears to introduce a baseline variability of benchmarking results in the order of 10% on GCE, as relatively few JMH benchmarks are more stable than that. This may be due to the well-documented challenges of benchmarking languages that use a virtual machine [Georges et al., 2007]. Hence we conclude that JMH is not a reliable tool for discovering small performance regressions in our test setup. A potential way to improve the microbenchmarking quality for these projects without improving the benchmarks themselves would be to increase the number of iterations and forks. While increasing these parameters 2.5 RQ 2: Stability of Benchmarks 65 potentially decreases maxSpread, it further increases the already long test durations (up to multiple days per run in the case of RxJava).

Another important, albeit unsurprising, observation is that both Go and Java projects produce more consistent benchmarking results on bare-metal than in the cloud. This is due to the performance instability of public cloud instances themselves, as widely reported in literature [Leitner and Cito, 2016; Gillam et al., 2013; Iosup et al., 2011a]. For the Go projects except etcd, this effect is noticeable but not dramatic, as the majority of benchmarks of each project still have maxSpread < 0.10, which we consider acceptable for many use cases. For etcd, and to a lesser extent gin, we observe an interesting effect where benchmarks which exhibit close to no variability in bare-metal become rather unreliable in the cloud (maxSpread between 0.05 and 0.3). The generally more unreliable Java benchmarks consistently become even less stable in the cloud, leading to benchmark suites where most benchmarks have a maxSpread of 0.1 or higher. Similar to etcd, some Java projects with comparatively stable benchmarks in bare-metal become overproportional more unreliable in the cloud (protostuff, RxJava, as well as JCTools for a signiﬁcant subset of its benchmarks).

We speculate that certain types of microbenchmarks (e.g., ones that are particularly I/O-intensive) are hit harder by the performance instability of public clouds than others. However, a detailed analysis of this is out of scope in the present study. We conclude that microbenchmarking using a public cloud, as commonly used by public CI services, is possible. However, users need to, in some cases drastically, lower their expectations with regard to how ﬁne-grained slowdowns can realistically be discovered this way. Based on our data, only slowdowns of more than 50% in most Java projects, and of more than 25% in Go projects, are detectable with reasonable reliability on the used cloud instance. Chapter 2. An Evaluation of Open-Source Software Microbenchmark Suites for 66 Continuous Performance Assessment

RQ2 Summary. Study subjects implemented in Go largely have very reliable benchmarks, with a maxSpread below 0.05 in bare-metal. Conversely, the benchmark stability in most Java projects is more varied, with at least a quarter of benchmarks having a maxSpread > 0.10. In the cloud, benchmarks of all projects become substantially less stable, often leading to a maxSpread greater than 0.25 (Java) and greater than 0.1 (Go).

2.6 RQ 3: Detectability of Slowdowns

2.6.1 Context

Benchmark variability, as studied in Section 2.5, is only half the story when evaluating the quality of a microbenchmark suite, as even a microbenchmark suite with very stable benchmarks may still suffer from not-benchmarking a sufﬁciently large part of the project. With the approach proposed in this section, we can analyze the ability of existing microbenchmark suites in ﬁnding regressions, identify parts of the API that are not covered by microbenchmarks, and discovery methods which are tested for performance by multiple benchmarks.

2.6.2 Approach

Test coverage is a common quality metric for functional test suites [Zhu et al., 1997]. To the best of our knowledge, there is currently no similar metric for microbenchmarks. We propose an API benchmarking score ABS(K′,B,v) that represents what fraction of (a subset of) the public API of a project is “covered” by the test suite B, i.e., in how many public methods K′ ⊆ K an actual slowdown of severity v can be found. Note that this metric is geared towards library or middleware-type projects, such as Log4j2 or RxJava. We evaluate ABS for all study subject projects using the methodology outlined in Figure 2.2. This methodology is inspired by mutation testing [Jia and Harman, 2011], and relies on systematically introducing artiﬁcial slowdowns to observe whether 2.6 RQ 3: Detectability of Slowdowns 67 the benchmark suite is able to catch these artiﬁcial regressions. Unfortunately, this methodology is very time-consuming. For two repeated runs, it requires 2 · (|K′| + 1) executions of the entire microbenchmark suite, so we decided to use a small subset of the public API of only 20 often-used methods for our experiments (i.e., |K′| = 20). Even this fairly low number of considered methods our experiment still requires between ~7.5 hours (gin) and ~15.3 days (RxJava) to execute (cp. the execution time durations for a single run as reported in Table 2.1).

v (relative Benchmark Regression to Suite B Methods K’ introduce)

ABS Execution

for all runs r from R

Execute Baseline of B

for all methods k from K’

Introduce Execute B Slowdown v in k

Combine Measurements

Compute Results

Covering/Non- Covered/Non- covering Benchmarks covered Methods

Figure 2.2: ABS calculation approach Chapter 2. An Evaluation of Open-Source Software Microbenchmark Suites for 68 Continuous Performance Assessment

Table 2.4: Number of project-usage traces

Java Project Caffeine JCTools Log4j2 protostuff RxJava # Traces 1,638 × 106 100 × 106 4.5 × 106 126 × 106 1.2 × 106 Go Project bleve etcd fasthttp gin otto # Traces 972 × 103 789 25 × 103 21 × 103 6.4 × 106

Finding Often-Used API Methods This paper’s approach is concerned with testing software performance of a performance-relevant part of a project’s API. We denote this subset as K′ ⊆ K, where K is the entire public API of a project. Generally, we expect project developers to be able to identify this subset using their domain knowledge. However, as we are basing our experiments on OSS projects where this kind of information is not publicly available, we chose to trace invocations of the study subjects’ public API from unit tests of other OSS projects on GitHub. Recent work by Sawant and Bacchelli [2017] has proposed a sophisticated approach to extract API usage data for OSS projects. However, this work was not yet available when we conducted our research, so we elected to use a simpler approach to finding often-used methods. Concretely, we identified additional GitHub projects per subject by running yet another BigQuery query searching for projects with import statements that indicated usage of our subjects. We again ordered the resulting projects by stars and watchers, manually iterated over them, executing the unit test suites of the found projects and tracing the number of invocations per public method of our subjects. For Java projects, we used AspectJ to weave a custom-written simple tracer at load-time, while for Go projects we transformed the abstract syntax trees (ASTs) of the case study projects before executing the unit test suites. The tool chain for tracing API usage is available on GitHub for both, Java [Laaber and Leitner, 2018e] and Go [Laaber and Leitner, 2018c]. Table 2.4 shows the total number of traces for each project that we generated using this tooling. Trace information is again collected in the replication package [Laaber and Leitner, 2018b]. Instead of tracing a fixed number of projects per study subject, we have chosen a flexible number of projects (between four and six) that led to 2.6 RQ 3: Detectability of Slowdowns 69 a sufficient number of traces for each subject. For all Go projects except etcd, we have in this way collected more than 103 unique API execution traces, and more than 106 for all Java projects. Only for etcd, we were unable to identify more than 768 traces using our methodology. We speculate that the reason for this is that etcd is not a traditional library project, but data store where we expect access is through network protocols rather than API invocations.

Determining Slowdowns (v) per Subject Based on the assessment described in Section 2.5, we determine the slowdown that is potentially detectable by the 95th percentile of most instable benchmark of the subject’s performance test suite. We chose the 95th percentile in order to remove the worst 5% benchmark results in terms of variability. We deﬁne the slowdown to introduce v for a study subject project as v =2 · max{MSi | 0 < i ≤⌊0.95·|MS|⌋}, where MS is the ordered set (ascending) containing the maximum spreads of B such that MS = {maxSpread(b) | b ∈ B}. That is, we introduce a slowdown that is large enough that 95% of a project’s benchmarks still have a maxSpread that is half as large as the slowdown, or lower. To introduce slowdowns into the projects, we transform the ASTs (using javaparser7 for Java and Go’s standard ast package8) of the projects to measure the execution time t(k) of the body of each method k ∈ K′ and append a relative slowdown t′(k)= t(k) · v at the end of the body such that the overall new execution time is t(k)+ t′(k). In Java, we used Thread.sleep and in Go time.Sleep to implement the actual slowdown. Note that our tooling is unable to implement very small slowdowns (in the nanosecond range), as the overhead of measuring the execution time and invoking the programming language’s sleep method already delays by more than the intended slowdown. Hence, for very short-running methods, v should be understood as a lower bound, with actual slowdowns potentially being higher.

7https://github.com/javaparser/javaparser 8https://golang.org/pkg/go/ast Chapter 2. An Evaluation of Open-Source Software Microbenchmark Suites for 70 Continuous Performance Assessment

We have developed tool chains for both, JMH [Laaber and Leitner, 2018d] and Go [Laaber and Leitner, 2018c] projects, using the Groovy and Go programming languages respectively. These tools implement the approach outlined in Figure 2.2, and can be used by practitioners to evaluate ABS for their projects and based on the core API methods that they deem essential using their domain knowledge. The tools expect a conﬁguration in JavaScript Object Notation (JSON) notation, which designates the project to evaluate, a list of methods forming K′, and a slowdown to introduce v.

Calculating ABS For every k ∈ K′ and every run r ∈ {1, 2}, our experi- bk ments produce a set of metrics Mr for each benchmark b ∈ B. Based on this b and the results from the baseline execution Mr , where no slowdown was introduced, we then define a function reg : M × M → {true, false} which captures whether a difference between the sets of measurements can be determined with sufficient confidence (i.e., the benchmark detects a difference). We choose the following implementation for reg. We conduct a Wilcoxon bk b rank-sum test between Mr and Mr . H0 is that the sets of measurements do not differ, which we conservatively reject iff p< 0.01. However, preliminary cloud-reliability experiments and A/A testing have shown that relying only on statistical testing leads to many FPs, as changes in cloud performance often lead to statistically different measurements even if no slowdown was introduced [Laaber and Leitner, 2018a]. Hence, we add another necessary condition, which is that the relative difference of the means needs to be at least half as large as the relative slowdown (Equation 2.2). reg = true iff both the statistical test reports a significant difference and Equation 2.2 holds for both runs.

bk b Mr − Mr v ≥ (2.2) b 2 Mr We consider a method k covered iff at least one benchmark registers a differ- bk b ence, i.e., covered(k,B,v)=1 ⇐⇒ ∃b ∈ B : reg(Mr ,Mr ) = true. The ABS of a project is then the percentage of k ∈ K′ that are covered (Equation 2.3). 2.6 RQ 3: Detectability of Slowdowns 71

Table 2.5: API benchmarking scores (ABS) of all subjects

ABS Excluded Language Project v Total Rel. Abs. Rel. Abs. Java Caffeine 2.6 65% 60% 60% 4 13 JCTools 1.0 55% 55% 55% 8 4 Log4J2 1.0 80% 40% 40% 22 19 protostuff 1.0 40% 35% 35% 2 1 RxJava 0.6 95% 95% 95% 50 22 Go bleve 0.1 100% 100% 100% 4 0 etcd 0.3 10% 10% 10% 2 1 fasthttp 0.2 95% 95% 95% 5 2 gin 0.2 65% 65% 65% 1 0 otto 0.2 60% 60% 60% 2 0

′ covered(k,B,v) ABS(K′,B,v) = 100 · k∈K (2.3) P |K′|

2.6.3 Results and Analysis

API Benchmarking Score We present all resulting scores in Table 2.5. Fur- ther, all data is again available online [Laaber and Leitner, 2018b]. To identify how sensitive scores are to individual unreliable benchmarks, we present three variants: “total” (includes all benchmarks), “relative”, and “absolute”. “Relative” and “absolute” scores describe two strategies where most unreliable benchmarks, as per the results discussed in Section 2.5, are excluded. We explore how scores change if (1) “relative” score, the worst 5% of benchmarks with respect to result stability (as per maxSpread), or (2) “absolute”, benchmarks with a maxSpread > 0.5, are excluded. Intuitively these values should not be as prone to reported FPs as the full set of benchmarks (“total” score) is. Column “Excluded” lists how many benchmarks have been skipped in that way. Column “v” lists how large the introduced relative slowdown was per project. Our results show that the Go projects, despite testing for substantially smaller slowdowns, have higher scores than the Java projects. Two of five Go Chapter 2. An Evaluation of Open-Source Software Microbenchmark Suites for 72 Continuous Performance Assessment projects have scores of or close to 100% (bleve and fasthttp). Gin and otto have comparable scores to Caffeine and JCTools. Interestingly, etcd has a very low score at 10%. This could be due to etcd being a database project, and the public API may not reflect the most performance critical part of the system. Given that many benchmarks in the Java projects have a high maxSpread and are hence prone to lead to FPs, excluding the most unstable benchmarks often leads to significantly lower scores. This is most evident for Log4j2, where all tests together lead to 80% ABS. However, once we exclude unstable tests, the score drops to 40% (relative and absolute). We conclude that in Log4j2, many FPs during our test runs give a too optimistic view of the test suite’s ability to find regressions. RxJava is the standout Java project of our case studies on multiple levels. As highlighted in Table 2.1, RxJava’s benchmark suite is the most extensive in both runtime and size. Further, its benchmarking score is the best among Java projects with 95% which is similar to the best Go projects (bleve and fasthttp). Generally, the fact that benchmarks are more unreliable in the Java than in the Go projects also manifests in the difference of the three ABS results. All Java projects except JCTools and RxJava show a decreased score when filtering the most unreliable benchmarks. Go projects on the other hand have stable scores across all three types. Experiments with different combinations of ABS calculation showed, that the combination of difference in maxSpread and a hypothesis test produced the best results among the different benchmarking-score types.

Redundancy of Benchmarks Another question is whether there typically is a one-to-one mapping between benchmarks and covered methods, or whether different benchmarks tend to redundantly find similar slowdowns. Figure 2.3 shows for each study subject how many methods from K′ are covered by what number of benchmarks as a density plot. The higher the curve is in the right part of a figure, the more benchmarks find slowdowns in the same API methods. A common pattern across multiple projects (JCTools, Log4j2, protostuff, etcd, gin, and otto) is that most methods are either not covered at all, or only covered 2.6 RQ 3: Detectability of Slowdowns 73

Caffeine bleve 0.15 0.04

0.03 0.10 0.02 0.05 Methods 0.01

0.00 0.00 0.0 2.5 5.0 7.5 10.0 0 10 20 30

JCTools etcd 0.03 0.6

0.02 0.4

0.01 0.2 Methods

0.00 0.0 0 20 40 60 80 0 1 2 3 4

Log4j2 fasthttp 0.08 0.03 0.06 0.02 0.04

Methods 0.02 0.01

0.00 0.00 0 25 50 75 0 10 20 30 40

protostuff gin

0.20 0.10 0.15

0.10 0.05 Methods 0.05 0.00 0.00 0 2 4 6 0 5 10 15

RxJava otto 0.008 0.20 0.006 0.15

0.004 0.10

Methods 0.002 0.05

0.000 0.00 0 40 80 120 160 0 10 20 30 40 # of Benchmarks # of Benchmarks

Figure 2.3: Mapping of benchmarks to covered API methods as density plot. Plots that lean more to the right indicate that a larger number of benchmarks ﬁnd slowdowns in the same API methods, indicating redundancy Chapter 2. An Evaluation of Open-Source Software Microbenchmark Suites for 74 Continuous Performance Assessment by a small number of benchmarks, but a small subset of methods are covered by many benchmarks. The remaining projects Caffeine, RxJava, bleve, and fasthttp have more redundant benchmarks in comparison, indicating that benchmarks in these projects tend to be more coarse-grained. Caffeine has an evenly distributed set of benchmarks in terms of redundancy which shows that the number of benchmarks covering many methods steadily decreases. RxJava shows a high degree of redundancy with most methods being covered by around 35 benchmarks with an substantial amount covered by up to 160 benchmarks. Contrarily, bleve and fasthttp show a similar redundancy trend, where a relatively low number of benchmarks cover a few methods and the majority are benchmarks detecting around a third of the slowdowns in the selected API methods.

RQ3 Summary. The API benchmarking score ABS is a quality metric to assess how well-tested a subset of a software’s API is. ABS is determined similarly to mutation testing, where artiﬁcial regressions of project-dependent sizes are introduced into the API’s subset, and the detection-scores of these is recorded by the microbenchmark suite. Resulting scores of the study subjects vary between 10% and 100%., and benchmarks often detect regressions in the same core method.

2.7 Discussion

We presented an approach and metric to study the microbenchmark suites of ten OSS projects written in Java and Go. We showed that, depending on the environment, benchmark suites have different variability, which needs to be taken into consideration when evaluating performance. Through the proposed approach, we can assess the ability of a set of performance tests to detect actual performance changes in ﬁne-granular software components. We now propose three concrete usage scenarios for our approach and results, and discuss threats to the validity. 2.7 Discussion 75

2.7.1 Usage Scenarios

Quality Assessment of Microbenchmark Suites Software developers, both of OSS and commercial projects, can use our approach and the ABS to assess the quality of the benchmark suite of their own projects. Firstly, developers can find the best-fitting environment executing benchmarks by applying the methodology from Section 2.5. We showed that benchmark-result variability depends on the type and properties of the software. Hence, finding an environment that produces the most reliable results is key for continuous performance assessment. Secondly, by using the approach from Section 2.6, developers can determine “how good” their software’s benchmark suite is in actually finding slowdowns. They can either use a similar process to ours for establishing K′ (see Section 2.6.2), or use their domain knowledge. In this way, developers can increase confidence in their performance testing. Moreover, the results that we have presented for ten OSS projects can serve as a point of reference for other OSS developers and their projects. Using our tool chain [Laaber and Leitner, 2018d,e,c,b], setting up the environment, and executing the necessary steps to gather and analyze the data is straight-forward. As with any statistical test, especially in the context of performance evaluation and uncertain cloud environments, there will be FPs. The approach and results presented in Section 2.6 mitigates false-positives through multiple runs r where only a slowdown is reported if in all runs slowdowns are discovered. If users of our approach still experience too many FPs, an increase in runs should lead to more stable results.

Performance Test Case Selection We argue that the approach taken in Section 2.6 can be used to steer the selection of a subset of benchmarks for CI integration. In future work, we will explore this angle and propose approaches to select an optimal subset of benchmarks for continuous performance assessment. These approaches build on the results of ABS in combination with an assessment of the available time budget or other constraints. We foresee that search-based methods [Harman and Jones, 2001] could be used to optimize benchmark selection. However, our approach is highly dependent Chapter 2. An Evaluation of Open-Source Software Microbenchmark Suites for 76 Continuous Performance Assessment on the identified performance-relevant set K′. Based on the selection of this subset, identified benchmarks with ABS potentially do not cover all possible performance problems detectable with a full microbenchmark suite or systematic, large-scale load testing. This subset is intended to act as a smoke test that should be seen as a sanity check rather than a full-fledged performance evaluation.

Generating Stubs for Additional Benchmarks Using our methodology, developers can identify important parts of the API which are not covered by the existing microbenchmark suite. Consequently, they can use this information directly to improve their performance tests. Additionally, we envision future tooling that will be able to generate new benchmark stubs for such uncovered core methods. In future research, we plan to explore these possibil- ities in more detail. For instance, we envision future developer tooling that analyzes all existing benchmarks, and then uses code generation to propose new benchmarks that follow the same style as the existing tests, but which are able to identify slowdowns in the so-far uncovered parts of the API.

2.7.2 Threats to Validity

There are certain threats and limitations to our approach and the paper’s empirical results, that a reader should take into account. We discuss the most important ones in the following. In terms of external validity, readers should consider the characteristics of our study subjects, and not generalize our results to other programming languages (with potentially different performance characteristics), application models, or industrial software. However, we have chosen a common programming language running on top of a virtual machine (Java), as well as a statically compiled language (Go) to improve representativeness. Future work should address this by extending our ABS results to other programming languages and paradigms (e.g., interpreted or functional). Given the time-consuming nature of the proposed approach in Section 2.6, i.e., between 2.7 Discussion 77

~7.5 hours and ~15.3 days for a single project, scaling up to more than ten projects was considered infeasible. Results regarding performance testing in public clouds may vary if a different cloud provider (e.g., Microsoft Azure or Amazon Web Services (AWS)) is used instead of GCE. We have speciﬁcally investigated microbenchmarking suites, which evaluate software performance on ﬁne-granular level. To detect full-system-performance degradations (e.g., network latencies, database access, load balancing), one might have to consider load tests or performance monitoring with release-engineering/live-testing strategies. Further, even though slowdowns are only one part of software performance, other performance counters (e.g., memory consumption, I/O operations, lock contention) were considered out of scope. We argue that the overall approach would not change drastically when considering these. Nonetheless, future research needs to investigate performance along these other dimensions.

In terms of construct validity, the most relevant threat is the selection of methods to test for in Section 2.6.2. We trace API usage in the unit tests of other open-source projects to get a simple, data-driven selection of often-used methods. However, there are two limitations to this method. Firstly, usage in unit tests does not necessarily correspond to usage in production. Secondly, absolute invocation counts are not necessarily a good proxy for importance. However, as the main contribution of the present study is the ABS methodology, hence a more comprehensive and time-consuming methodology to generate a more robust list of important methods, such as Sawant and Bac- chelli [2017], was considered out of scope. Further, 20 methods may be too little to comprehensively evaluate a large project. Projects attempting to apply our methodology in practice may wish to calculate the ABS for a larger part of the public API, but need to keep in mind that the time necessary for running the experiments increases linearly with the number of considered methods. Moreover, maxSpread bases its variability on average statistics, whereas performance distributions are often long-tailed [de Oliveira et al., 2013a]. In future research, we want to investigate the effect performance measurements have on maxSpread and compare it to other measures such Chapter 2. An Evaluation of Open-Source Software Microbenchmark Suites for 78 Continuous Performance Assessment as averages with confidence-interval widths [Stefan et al., 2017] or medians. The overridden JMH execution settings are another threat to the benchmark results. We decided to use our defaults as they reflect performance engineering best practices [Georges et al., 2007], and limit the execution time to a lower bound. This is especially necessary due to the expensive nature of our experiments (see Section 2.6). Finally, we were unable to determine how sensitive the calculated scores are to the extent of the introduced slowdown. A follow- up sensibility study is required to investigate whether the benchmark scores change significantly when decreasing or increasing the slowdown.

2.8 Related Work

Historically, research on software performance was often conducted in the context of system-scale load and stress testing [Menascé, 2002; Jiang and Hassan, 2015; Weyuker and Vokolos, 2000; Barna et al., 2011]. More recent research focuses on industrial applicability [Nguyen et al., 2014; Foo et al., 2015], or how to reduce the necessary time for load testing [Grechanik et al., 2012]. Studies of microbenchmarking are scarcer, but have recently gotten more attention from research. Stefan et al. [2017] and Leitner and Bezemer [2017] study quantitatively and qualitatively microbenchmarking practices in Java OSS. Stefan et al. [2017] indicate that microbenchmarks hardly find useful regressions and are often used for design decisions, which might not be the case for larger, industry-driven projects like RxJava. In their subject-selection approach projects with long-running benchmark suites were not considered. Compared to this work, we extend their findings by studying result variability on cloud infrastructure, which is arguable the way to go for integrating performance testing in CI. Further, we examine a different type of programming language that translates directly to machine code (i.e., Go). Studies from Horký et al. [2015] and Chen and Shang [2017] are auxiliary to ours, focusing on raising performance-awareness of developers through generated API documentation and studying root causes of performance degradations reported by microbenchmarks respectively. None of these microbenchmarking works 2.8 Related Work 79 conducted a detailed analysis of the performance test suites’ ability to find actual slowdowns, as is the focus of Section 2.6. Studies on the nature of performance problems have found that these bugs tend to be particularly problematic. They take more time to be fixed than functional bugs [Heger et al., 2013], and require more experienced developers to do so [Zaman et al., 2012]. Performance bugs are often dormant, and only become visible after a time when the system’s usage changes [Jin et al., 2012]. Further studies have investigated performance regressions in specific domains, for instance in browser-based JavaScript [Selakovic and Pradel, 2016] or Android mobile applications [Linares-Vasquez et al., 2015]. Another recent line of research related to the study of software performance is mining regressions from version histories [Luo et al., 2016b; Sandoval Alcocer and Bergel, 2015]. There has also been work on the automated fixing of (special types of) performance problems [Barna et al., 2011; Nistor et al., 2013b, 2015; Heger et al., 2013]. Performance test regression selection research so far explored testing only performance-critical commits [Huang et al., 2014; Sandoval Alcocer and Bergel, 2015], or focused on particular types such as collection-intensive software [Mostafa et al., 2017] and concurrent classes [Pradel et al., 2014]. de Oliveira et al. [2017] propose selection of individual benchmarks based on static and dynamic data that assess whether a code change affects the performance of each benchmark. Bulej et al. [2017a] tackle performance-regression testing through stochastic performance logic (SPL) which lets developers describe performance assertions in hypothesis-test-style logical equations. In the present work, we do not propose methods to actually fix issues, but suggest that our ABS methodology can be used to reduce existing test suites to a minimal required set, and propose stubs for valuable new benchmarks that should be added to a project’s benchmark suite. Further, there has recently been an increase of interest in the study of CI and CD builds, as well as in the types of build faults and failures that may occur. For instance, Rausch et al. [2017] categorized build failures on TravisCI and found no evidence that performance testing is currently a relevant part of CI builds. Beller et al. [2017] specifically study testing on TravisCI, but Chapter 2. An Evaluation of Open-Source Software Microbenchmark Suites for 80 Continuous Performance Assessment neither mention performance testing. Vassallo et al. [2017] study the CI builds of OSS and industrial applications, but again find little evidence of structured performance testing in the pipeline. Integrating performance testing into CI is only starting to gain momentum, with an initial publication by Brunnert and Krcmar [2017] on CI load testing. As an alternative to invoke a full load test run with every build (which may be infeasible for many systems), existing techniques to judge the performance sensitivity of a commit could be utilized [Huang et al., 2014]. In industrial practice, and especially in the context of Software as a Service (SaaS) applications, application performance management [Ahmed et al., 2016; Kim et al., 2016] and partial rollouts (or canary releases) [Veeraraghavan et al., 2016; Tang et al., 2015; Schermann et al., 2016] are common approaches to unify CI with performance management. Fundamentally, these approaches eschew performance testing entirely, and instead rely on post-hoc identification of performance issues after deployment. However, these approaches are fundamentally targeted at SaaS rather than library and middleware projects, such as the ones studied in the present paper.

2.9 Conclusions

In this paper, we studied the quality of software microbenchmark suites of ten open source projects written in Java and Go, and evaluated the suitability of these for continuously assessing software performance in CI environments. Firstly, we showed that benchmark suite sizes vary from 16 benchmarks to 983 with a mean execution time for a single run between 11 minutes and 8.75 hours. Secondly, we studied the result variability of the study subject’s benchmark suites over identical repetitions, which we defined as maxSpread. Most Go benchmarks have a maxSpread below 5% in bare-metal and below 25% in cloud, whereas most Java benchmarks have one below 25% in bare-metal and below 50% in cloud. Thirdly, we evaluated how well the benchmark suites find actual slowdowns. We introduced a performance-test quality metric ABS, and showed that the study subject’s score varies between 10% and 100%. Moreover, we found significant redundancy in the microbenchmark suites, indicating 2.9 Conclusions 81 that it may be possible to execute only a subset of the suite as a smoke test within CI. The methodology and results of our study can be used to assess the microbenchmarking quality of other projects, identify missing tests that would improve the quality of existing benchmark suites, and select benchmarks to be run as part of CI.

Software Microbenchmarking in the Cloud. How Bad is it Really?

Christoph Laaber, Joel Scheuner, and Philipp Leitner Published in Empirical Software Engineering 24 (2019)

Abstract

Rigorous performance engineering traditionally assumes measuring on bare- metal environments to control for as many confounding factors as possible. Unfortunately, some researchers and practitioners might not have access, knowledge, or funds to operate dedicated performance testing hardware, making public clouds an attractive alternative. However, shared public cloud environments are inherently unpredictable in terms of the system performance they provide. In this study, we explore the effects of cloud environments on the variability of performance test results and to what extent slowdowns can still be reliably detected even in a public cloud. We focus on software microbenchmarks as an example of performance tests and execute extensive experiments on three different well-known public cloud services (Amazon Web Services (AWS), Google Compute Engine (GCE), and Azure) using three different cloud instance types per service. We also compare the results to a hosted bare-metal offering from IBM Bluemix. In total, we gathered more than 84 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really?

4.5 million unique microbenchmarking data points from benchmarks written in Java and Go. We find that the variability of results differs substantially between benchmarks and instance types (by a coefficient of variation from 0.03% to > 100%). However, executing test and control experiments on the same instances (in randomized order) allows us to detect slowdowns of 10% or less with high confidence, using state-of-the-art statistical tests (i.e., Wilcoxon rank-sum and overlapping bootstrapped confidence intervals). Finally, our results indicate that Wilcoxon rank-sum manages to detect smaller slowdowns in cloud environments.

3.1 Introduction

In many domains, renting computing resources from public clouds has largely replaced privately owning computational resources, such as server racks. This is due to economic factors, but also due to the convenience of outsourcing te- dious data center or server management tasks [Cito et al., 2015]. However, one often-cited disadvantage of public clouds is that the inherent loss of control can lead to highly variable and unpredictable performance, for example due to co-located noisy neighbors [Leitner and Cito, 2016; Farley et al., 2012; Iosup et al., 2011b]. Therefore, using cloud resources, such as virtual machines (VMs), in performance testing environments is a challenging proposition because predictability and low-level control over the hardware and software is key in traditional performance engineering. Nevertheless, there are many good reasons why researchers and practitioners might be interested in adopting public clouds as execution environments for their performance experiments. The experimenters might have insufficient access to dedicated hardware resources for conducting performance testing at a designated scale. They may wish to evaluate the performance of applications under “realistic conditions”, which nowadays often refers to cloud environments. They may wish to leverage industrial-strength infrastructure automation tools (e.g., AWS CloudForma- 3.1 Introduction 85 tion1) to easily provision resources on demand, which allows to massively parallelize the execution of large benchmarking suites. In this paper, we ask whether using a standard public cloud as an execution environment for software performance experiments is always a bad idea. We focus on the cloud service model Infrastructure as a Service (IaaS) and on the performance testing type of software microbenchmarking in the programming languages Java and Go. IaaS clouds provide relatively low-level access and configurability while allowing for high scalability of experiments; and software microbenchmarks emerge as the performance evaluation strategy of choice for libraries and frameworks. Software microbenchmarks can be seen as the unit test equivalent for performance and are sometimes even referred to as performance unit tests [Stefan et al., 2017]. They are relatively short-running (e.g., <1ms) performance tests against small software units (e.g., methods), which are typically tested in isolation without a fully-deployed system (as used for load testing). A microbenchmark is repeatedly executed (called invocations) for a defined time period (e.g., 1s) and reports the mean execution time over all invocations (called iteration). The result of a microbenchmark is then the distribution of multiple iterations (e.g., 20). In our previous work, we already studied performance stability of IaaS clouds using benchmarks for general system performance [Leitner and Cito, 2016]. We focused on running low-level, domain-independent system tests with different characteristics (e.g., input/output (I/O) or central processing unit (CPU) benchmarks). Further, we have also previously executed software microbenchmarks in a bare-metal and one cloud environment to study the quality of open source microbenchmark suites [Laaber and Leitner, 2018a]. Our results in these papers motivated a more extensive study, dedicated to the reliability of software microbenchmarking in public clouds, which is the core contribution of the present study. In particular, we quantify to what extent slowdowns can still be reliably detected in public clouds. Concretely, we address the following research questions:

1https://aws.amazon.com/cloudformation 86 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really?

RQ 1 How variable are microbenchmark results in different cloud environments?

RQ 2 Which slowdowns in microbenchmark results can we detect 95% of the time with at most 5% false positives?

We base our research on 19 real microbenchmarks sampled from four open-source software (OSS) projects written in Java or Go. We aimed for 20 benchmarks, five for each project (see Section 3.3.1), however, due to a configuration execution error we lack results for one benchmark. We study cloud instances (i.e., VMs in cloud environments) in three of the most prominent public IaaS providers, namely Google Compute Engine (GCE), Amazon Elastic Compute Cloud (EC2), and Microsoft Azure, and we contrast these results against a dedicated bare-metal machine deployed using IBM Bluemix. We also evaluate and compare the impact of common deployment strategies for performance tests in the cloud, such as running experiments on different cloud instances of the same type or repeating experiments on the same instance. Hereby, we use randomized multiple interleaved trials as recently proposed as best practice [Abedi and Brecht, 2017]. We find that result variability ranges from a coefficient of variation of 0.03% to more than 100% between repeated executions of the same experiments. This variability depends on the particular benchmark and the environment it is executed in. Some benchmarks show high variability across all studied instance types, whereas others are stable in only a subset of the environments. We conclude that instability originates from different sources including variability inherent to the benchmark, variability between trials (i.e., executions within an instance), and variability between instances. We further find that two state-of-the-art statistical tests for performance evaluation [Bulej et al., 2017b], i.e., (1) Wilcoxon rank-sum with effect size medium or larger and (2) overlapping confidence intervals of the mean computed with statistical simulation (bootstrapping) both falsely report high numbers of performance changes (i.e., false positives (FPs)) when in fact neither the benchmark nor production code has changed and the sample size (e.g., a 3.1 Introduction 87 single instance) is low. To mitigate this, benchmarks have to be repeatedly executed on multiple instances and multiple times within an instance to lower the numbers of FPs below an acceptable threshold (≤ 5% of 100 simulations), hence making it feasible to use cloud instances as performance-test execution environment. Finally, we find that only 78%–83% of the benchmark-environment combinations are able to reliably detect slowdowns at all, when test and control experiments are not run on the same instances and 20 instances each are used. Employing a strategy where test and control experiments are executed on the same instances, all benchmark-environment combinations find slowdowns with high confidence when utilizing ≥ 10 instances. In 77% – 83% of the cases, a slowdown below 10% is reliably detectable when using trial-based sampling and 20 instances. With respect to the difference between Wilcoxon rank-sum and overlapping confidence intervals, the Wilcoxon test is superior in two regards: (1) it is able to reliably detect smaller slowdowns, and (2) it is not as computational-intensive and therefore takes less time. Following these findings, we conclude that executing software microbenchmarking experiments is possible on cloud instances, albeit with some caveats. Not all cloud providers and instance types are equally suited for performance testing, and not all microbenchmarks lend themselves to reliably detect slowdowns in cloud environments. In most settings, a substantial number of trials or instances is required to achieve robust results. However, running test and control groups on the same instances, optimally in random order, reduces the number of required repetitions (i.e., number of trials or instances). Prac- titioners can use our study as a blueprint to evaluate the stability of their own performance microbenchmarks within their custom experimental environment. The remainder of this paper is structured as follows. Section 3.2 introduces relevant background information for this study, such as microbenchmarking, the Java Microbenchmarking Harness (JMH) framework, and IaaS cloud services. Section 3.3 outlines our research approach, describes the microbenchmark and cloud-provider selection, and details the execution methodology. 88 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really?

Section 3.4 and Section 3.5 discuss the study results for both research questions, while Section 3.6 discusses main lessons learned and threats to the validity of the study. Related research is discussed in Section 3.7, and ﬁnally the paper is concluded in Section 3.8.

3.2 Background

This section summarizes software microbenchmarking and IaaS clouds as important concepts we use in our study.

3.2.1 Software Microbenchmarking

Performance testing is a common term used for a wide variety of different approaches. In this paper, we focus on one specific technique, namely software microbenchmarking, sometimes also referred to as performance unit tests [Horký et al., 2015]. Microbenchmarks are short-running (e.g., < 1ms) unit-test-like performance tests that aim to measure fine-grained performance metrics, such as method-level execution times, throughput, or heap utilization. Typically, frameworks repeatedly execute microbenchmarks for a certain time duration (e.g., 1s) and report their mean execution time. The nature of these performance tests is different from traditional load tests where full applications are deployed and long-running load scenarios simulate the load of multiple users. JMH is part of the OpenJDK implementation of Java and allows users to specify benchmarks through Java annotations. Every public method annotated with @Benchmark is executed as part of the performance test suite. Listing 3.1 shows an example benchmark from the RxJava project where the execution time and throughput of a latched observer are measured. Other examples measure logging (e.g., logger.info) in Log4j2 or filter by search terms in the bleve text indexing library. 3.2 Background 89

1 @State(Scope.Thread) 2 public class ComputationSchedulerPerf { 3 4 @State(Scope.Thread) 5 public static class Input extends InputWithIncrementingInteger { 6 @Param({ "100" }) 7 public int size; 8 } 9 10 @Benchmark 11 public void observeOn(Input input) { 12 LatchedObserver o = 13 input.newLatchedObserver(); 14 input.observable.observeOn( 15 Schedulers.computation() 16 ).subscribe(o); 17 o.latch.await(); 18 } 19 }

Listing 3.1: JMH example (rxjava-5) from the RxJava project.

The Go programming language includes a benchmarking framework directly in their standard library2. This framework primarily follows the convention- over-configuration paradigm. For instance, benchmarks are defined in files ending with _test.go as functions that have a name starting with Benchmark (see List- ing 3.2).

1 func BenchmarkTermFrequencyRowEncode(b *testing.B) { 2 row := NewTermFrequencyRowWithTermVectors(...) 3 b.ResetTimer() 4 for i := 0; i < b.N; i++ { 5 row.Key() 6 row.Value() 7 } 8 }

Listing 3.2: Go benchmarking example (bleve-3) from the bleve project.

2https://golang.org/pkg/testing 90 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really?

In our study, we use both JMH and Go microbenchmarks as test cases to study the suitability of IaaS clouds for performance evaluation.

3.2.2 Infrastructure-as-a-Service Clouds

The academic and practitioner communities have nowadays widely agreed on a uniform high-level understanding of cloud services following NIST [Mell and Grance, 2011]. This definition distinguishes three service models: Infras- tructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). These levels differ mostly in which parts of the cloud stack are managed by the cloud provider and what is self-managed by the customer. In IaaS, computational resources are acquired and released in the form of VMs or containers. Tenants are relieved from the burden of operating physical servers but are still required to administer their virtual servers. We argue that for the scope of our research, IaaS is the most suitable model at the time of writing as this model still allows for comparatively low-level access to the underlying infrastructure. Further, setting up performance experiments in IaaS is substantially simpler than doing the same in a typical PaaS system, where applications need to be adapted to provider-specific interfaces. Hence, we focus on IaaS in this paper. In IaaS, a common abstraction is the notion of an instance: an instance bundles resources (e.g., CPUs, storage, networking capabilities, etc.) defined through an instance type and an image. The instance type governs how powerful the instance is supposed to be (e.g., what hardware it receives), while the image defines the software initially installed. More powerful instance types are typically more expensive, even though there is often significant variation even between individual instances of the same type [Ou et al., 2012; Farley et al., 2012]. Instance types are commonly grouped into families, each representing a different usage class (e.g., general purpose, compute-optimized, or memory-optimized). 3.3 Approach 91

3.3 Approach

Traditionally, performance measurements are conducted in dedicated environments with the goal to reduce the non-deterministic factors inherent in all performance tests to a minimum [Mytkowicz et al., 2009]. Specifically, hardware and software optimizations are disabled on test machines, no background services are running, and each machine has a single tenant. These dedicated environments require high effort to maintain and have considerable acquisition costs. Conversely, cloud providers offer different types of hardware for on-demand rental that have no maintenance costs and low prices. However, the lack of control over optimizations, virtualization, and multi-tenancy negatively affects performance measurements [Leitner and Cito, 2016]. To study the extent of these effects, we take the following approach. We sample a subset of benchmarks from four OSS projects written in two programming languages. These benchmarks are executed repeatedly on the same cloud instance as well as on different cloud-instance types from multiple cloud providers. The results are then compared in terms of variability and detectability of slowdowns. Note that we are not benchmarking cloud infrastructure itself but rather software executed on it, which mostly falls into the category of software libraries and frameworks, such as Log4j2. In this study, we are also not concerned with testing for performance of deployed and running applications (e.g., Agilefant3, Dell DVD Store4, or JPetStore5) in the fashion of a load test but evaluating performance of fine-grained software units (e.g., methods) using software microbenchmarks. We do not claim that software microbenchmarks are a replacement for load tests. Neither is this paper concerned with finding appropriate benchmarks for slowdown detection in cloud environments. Re- search into these other areas is required as part of potential future work. In the scope of this paper, we study the result variability and slowdown detectability of existing benchmark suites.

3https://github.com/Agilefant/agilefant 4https://github.com/dvdstore/ds3 5https://github.com/mybatis/jpetstore-6 92 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really?

3.3.1 Project and Benchmark Selection

The study is based on 20 microbenchmarks selected from four OSS projects, two of which are written in Java and two in Go. Unfortunately, due to a configuration error, we are lacking all results for one benchmark (bleve-1) and consequently omit this benchmark from all remaining discussions. We decided to choose Java as it has been ranked highly in programming-language rankings (e.g., Tiobe6), is executed in a VM (i.e., the Java Virtual Machine (JVM)) with dynamic compiler optimizations, and has a microbenchmarking framework available that is used by real OSS projects [Stefan et al., 2017]. The Go language complements our study selection as a new programming language being introduced in 2009. It is backed by Google, has gained significant traction, compiles directly to machine-executable code, and comes with a benchmarking framework7 as part of its standard library. We chose these languages due to their different characteristics, which improves generalizability of our results. In an earlier study [Laaber and Leitner, 2018a], we investigated OSS projects in these languages that make extensive use of microbenchmarking. We chose real-world projects that are non-trivial and have existing microbenchmark suites. Table 3.1 shows detailed information about these projects, such as the GitHub Uniform Resource Locator (URL), the commit snapshot used for all experiments, and the total number of benchmarks in the project at the time of study. We also report popularity metrics, such as stars, watchers, and forks on GitHub. Note that log4j2’s star count is relatively low. This is due to the GitHub repository being a mirror of the main repository hosted directly by the Apache Software Foundation. The four selected projects represent good candidates of study subjects, as they are among the largest ones in their languages with respect to popularity (indicated by GitHub stars, watchers, and forks) and the size of their benchmark suites. For benchmark selection, we executed the entire benchmark suites of all study subjects five times on an in-house bare-metal server at the first author’s university, requiring between 37.8 minutes (etcd) and 8.75 hours (RxJava) of

6https://www.tiobe.com/tiobe-index 7https://golang.org/pkg/testing 3.3 Approach 93

Table 3.1: Overview of study-subject projects

Github URL Benchs Project Commit Stars Watchers Forks https://github.com/ (#) Log4j2 apache/logging-log4j2 8a10178 369 58 256 437 RxJava ReactiveX/RxJava 2162d6d 27,742 1,951 4,882 977 bleve blevesearch/bleve 0b1034d 3,523 209 305 70 etcd coreos/etcd e7e7451 15,084 920 2,934 41

execution time per trial [Laaber and Leitner, 2018a]. For each project, we ranked all benchmarks in the order of result variability between these five trials and selected the ones that are: the most stable, the most unstable, the median, the 25th percentile, and the 75th percentile. Our intuition is to pick five benchmarks from each project that range from stable to unstable results to explore the effect of result variability on the ability to detect slowdowns. The selected benchmarks are summarized in Table 3.2, where the first benchmark of each project (e.g., log4j2-1) is the most stable and the last (log4j2-5) the most unstable according to our previous study.

3.3.2 Cloud Provider Selection

Within three of the most-prominent cloud providers, we choose three different families of instance types. The selected providers are Amazon with AWS EC2, Microsoft with Azure, and Google with GCE. For each provider, we choose instance types in the families of entry-level general purpose (GP), compute- optimized (CPU), and memory-optimized (Mem). We expect instance types with better speciﬁcations to outperform the entry-level ones, and therefore this study establishes a baseline of what is possible with the cheapest available cloud-resource options. Table 3.3 lists the selected instance types including information about the data-center region, processor and memory speciﬁcation, and hourly prices at experiment time (summer 2017). All cloud instances run Ubuntu 17.04 64-bit. 94 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really? al .:Oeve fslce ecmrs o M ecmrswt utpeprmtr,w lols the list also we parameters, multiple with benchmarks JMH For benchmarks. selected of Overview 3.2: Table oceeprmtrzto eue.TeG irbnhakfaeokde o s h oino parameters of notion the use not does framework microbenchmark Go The used. we parameterization concrete agaeProject Language Java Go xaarxjava-1 RxJava o42log4j2-1 Log4j2 lv bleve-1 bleve etcd rxjava-5 rxjava-4 rxjava-3 o425ogaah.ogn.o4.efjhSortedArrayVsHashMapBenchmark.getValueHashContextData org.apache.logging.log4j.perf.jmh log4j2-5 log4j2-4 log4j2-3 log4j2-2 rxjava-2 etcd-5 etcd-4 etcd-3 etcd-2 etcd-1 Name bleve-5 bleve-4 bleve-3 bleve-2 Short rx.schedulers rx.operators rx.operators r.pcelgiglgjpr.m ThreadContextBenchmark.legacyInjectWithoutProperties PatternLayoutBenchmark.serializableMCNoSpace org.apache.logging.log4j.perf.jmh org.apache.logging.log4j.perf.jmh r.pcelgiglgjpr.m ThreadContextBenchmark.putAndRemove org.apache.logging.log4j.perf.jmh rx.operators mc/vtr_ec_etg BenchmarkStorePut BenchmarkMediumResponseUnmarshal BenchmarkMutex4Waiters BenchmarkManySmallResponseUnmarshal BenchmarkBackendPut /mvcc/backend/backend_bench_test.go /mvcc/kvstore_bench_test.go /client/keys_bench_test.go /integration/v3_lock_test.go /client/keys_bench_test.go SortedArrayVsHashMapBenchmark.getValueHashContextData org.apache.logging.log4j.perf.jmh sac/olco/onts.oBenchmarkTop1000of100000Scores BenchmarkTermFrequencyRowDecode BenchmarkTop100of50Scores BenchmarkGoLevelDBIndexing2Workers10Batch /index/upsidedown/benchmark_goleveldb_test.go BenchmarkNullIndexing1Workers10Batch /search/collector/topn_test.go /index/upsidedown/row_test.go /index/upsidedown/benchmark_null_test.go /search/collector/topn_test.go Package rx.operators sn as;bthrqec ;cidon ;sz 1 = size 0; = childCount 8; = batchFrequency false; = async sn as;bthrqec ;cidon ;sz 1000000 = size 5; = childCount 4; = batchFrequency false; = async ie=100 = size on 00 ak=3 = mask 1000; = count on ;traCnetaAis=NoGcOpenHash = threadContextMapAlias 5; = count on 0;lnt 20 = length 500; = count on 0 hedotxMpla NoGcSortedArray = threadContextMapAlias 50; = count on ;lnt 20 = length 5; = count ie=1000 = size OperatorPublishPerf.benchmark OperatorPublishPerf.benchmark FlatMapAsFilterPerf.rangeEmptyConcatMap ComputationSchedulerPerf.observeOn OperatorSerializePerf.serializedTwoStreamsSlightlyContended ecmr aeadParameters and Name Benchmark 3.3 Approach 95

Table 3.3: Overview of used cloud-instance types

Memory Cost Provider Data Center Family Instance Type vCPU [GB] [USD/h] AWS us-east-1 GP m4.large 2 8.00 0.1000 AWS us-east-1 CPU c4.large 2 3.75 0.1000 AWS us-east-1 Mem r4.large 2 15.25 0.1330 Azure East US GP D2s v2 2 8.00 0.1000 Azure East US CPU F2s 2 4.00 0.1000 Azure East US Mem E2s v3 2 16.00 0.1330 GCE us-east1-b GP n1-standard-2 2 7.50 0.0950 GCE us-east1-b CPU n1-highcpu-2 2 1.80 0.0709 GCE us-east1-b Mem n1-highmem-2 2 13.00 0.1184

Additionally, we selected a bare-metal machine available for rent from IBM in its Bluemix cloud. A bare-metal instance represents the closest to a controlled performance testing environment that one can get from a public cloud provider. We used an entry-level bare-metal server equipped with a 2.1GHz Intel Xeon IvyBridge (E5-2620-V2-HexCore) processor and 16GB of memory, running Ubuntu 16.04 64-bit version, hosted in IBM’s data center in Amsterdam, NL. We speciﬁcally deactivated hyperthreading and Intel’s TurboBoost. Moreover, we attempted to disable frequency scaling, but manual checks revealed that this setting is ineffective and presumably overridden by IBM.

3.3.3 Execution

We use the following methodology to execute benchmarks on cloud instances and collect the resulting performance data. For each cloud instance type, as listed in Table 3.3, we create 50 different instances. On each instance, we sched- ule 10 consecutive experiment trials of each benchmark and randomize the order within each trial, following the method proposed by Abedi and Brecht [2017]. Within each trial, every benchmark (e.g., etcd-1) consists of 50 iterations (e.g., using the -i50 parameter of JMH) and every iteration produces a single data point, which reports the execution time in ns. For JMH benchmarks, we 96 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really? also run and discard 10 warmup iterations prior to the measurement iterations to reach steady-state performance [Georges et al., 2007; Kalibera and Jones, 2013]. Note that 10 warmup iterations of 1 second might not be sufﬁcient to bring the JVM into a steady state, depending on the microbenchmark under consideration. Ideally the warmup time and iterations would be dynamically determined (e.g., following an approach as outlined by Georges et al. [2007]), which JMH does not support yet. For practical situations and the context of our study, 10 warmup iterations are considered sufﬁcient. We use the same terminology of instances, trials, iterations, and data points in the remainder of the paper. These concepts are also summarized in Figure 3.1.

…

AWS CPUi-1 AWS GPi-1 dp1 dp1 dp1 dp2 dp2 dp2 ...... dp1 dp1 dp1 Key: i-1 ...... t-1 dp2 dp2 t-2 dp2 ...... dp1 t-1 dp1 dp1 ...... dp2 dp2 dp2 ...... IaaS Instance Type ...... i-2 IaaS Instance dp1 dp1 i-2 dp1 dp2 dp2 dp2 ...... dp1 dp1 dp1 i-2 ...... Benchmark Trial dp2 t-1 dp2 t-2 dp2 ...... t-1 dp1 dp1 dp1 (in Random Order), ...... dp2 dp2 dp2 ...... Shape determines ...... distinct benchmark ...... Individual Result dp1, dp2 i-n Data Points ...... i-n ...... i-n ......

Figure 3.1: Schematic view of instance types (e.g., AWS GP), instances (e.g., i-1), and randomized-order trials (e.g., t-1) on each instance

Formally, we run performance measurements for all benchmarks b ∈ B (cf. Table 3.2) on all environments e ∈ E (e.g., AWS GP). We refer to the combination of a benchmark b run on a particular environment e as configuration c ∈ C where C = B × E. All configurations are executed on 50 instances instances = 50, with 10 trials per instance trials = 10, and 50 iterations iters = 50. The configuration results (i.e., data points reporting execution 3.3 Approach 97

c times in ns, see Section 3.2.1) of a single trial are deﬁned as the set Mi,t in Equa- tion 3.1 (cf. Benchmark Trial in Figure 3.1, different shapes refer to distinct benchmarks), where the subscript i determines the instance, the subscript t represents the trial, and the superscript c describes the conﬁguration where the respective results are from.

c R+ Mi,t = {dpj ∈ | 0

Consequently, the conﬁguration-execution order within an instance of a e particular environment e is speciﬁed as the ordered set Mi in Equation 3.2, ′ where Bt is the benchmark set B in potentially different randomized order of each trial t.

e c ′ Mi = {Mi,t | 0

The set of measurements for all benchmarks b, instances i, and trials t of a particular environment e is then deﬁned as M e in Equation 3.3.

e e M = Mi (3.3) 0

8https://github.com/sealuzh/cloud-workbench 9https://www.vagrantup.com 98 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really?

Table 3.4: Overview of the number of collected data points per cloud instance type, and how many data points remain after data cleaning

Short Name Instance Type Total Post-Cleaning

AWS GP m4.large 474,952 474,656 AWS CPU c4.large 475,000 474,999 AWS MEM r4.large 474,953 474,951 Azure GP D2s v2 472,636 472,122 Azure CPU F2s 473,132 473,041 Azure MEM E2s v3 470,570 470,400 GCE GP n1-standard-2 474,491 474,490 GCE CPU n1-highcpu-2 474,725 474,546 GCE MEM n1-highmem-2 474,716 474,337 Bluemix (BM) - 474’436 474’185

nitude higher than the median. These outliers are due to the non-deterministic nature of cloud resources such as multi-tenancy, decreased network band- width, or instance volatility. None of the cloud providers offer information nor did we explicitly record cloud failure events that would explain the reasons for these outliers. Due to the long-tailed nature of performance-measurement data, we did not remove outliers more aggressively (e.g., 3 standard deviations from the mean) because they might be correct extreme values belonging to the long-tail of the result distribution. Table 3.4 lists for each instance type across all instances, trials, and iterations how many data points we have collected and how many data points remain after outlier removal.

In particular, Azure instances were most affected by outlier removal, where up to 8 of the studied benchmarks had at least one data point cleaned. The two benchmarks that have the most outliers removed are log4j2-5 and etcd-2, which comes as no surprise as these two have also the highest variability (see Ta- ble 3.5). About 1% of the two benchmarks’ data points per instance type are cleaned. 3.4 Benchmark Variability in the Cloud 99

3.4 Benchmark Variability in the Cloud

To answer RQ 1, we study the benchmarks of all projects in terms of their result variability in all chosen environments. For all configurations (i.e., all benchmarks on all instance types), we report the variability of each benchmark across all 50 instances, 10 trials on each instance, and 50 iterations per trial. We use the coefficient of variation (CV), also referred to as relative standard deviation, of a set of benchmark measurements M in percent as the measure R+ σ(M) of variability cv : M → . cv is defined as cv(M) = µ(M) , with σ(M) representing the standard deviation and µ(M) denoting the mean value of all c data points of an set of measurements, where Mi,t ∈ M. CV is a statistical measure for dispersion among a population of values, in this case performance variability of microbenchmark results as obtained from executions on cloud instances. CVs have been used previously in similar studies to ours, such as Leitner and Cito [2016]. The set of measurements of a specific configuration c across all trials and instances is defined as M c in Equation 3.4.

c c′ e ′ M = Mi,t ∈ M | c = c (3.4) Further, the variability of a configuration (i.e., benchmark-environment combination) is defined as V total,c in Equation 3.5. V total,c represents the variability of a single configuration (i.e., the CV) of the union of all collected measurements of a benchmark b executed in an environment e across all instances i and trials t.

total,c c V = cv Mi,t (3.5) c c Mi,t[∈M Table 3.5 reports these CV (in percent, that is 100 · V total,c) variabilities for all studied conﬁgurations (e.g., log4j2-1, AWS GP) as numerical values and further provides an indication of the 95% relative conﬁdence interval width (RCIW) of these values. The RCIW describes the estimated spread of the population’s CV, as computed by statistical simulation (bootstrap with 1,000 100 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really?

Table 3.5: Result variability in CV [%] for every combination of benchmark and instance type in the study. The circles indicate the relative 95% RCIW of the CV, computed with statistical simulation (i.e., bootstrap with 1,000 simulations). # indicates a RCIW below 30%, G# between 30% and 60%, and greater than 60%

Benchs AWS GCE Azure BM GP CPU Mem GP CPU Mem GP CPU Mem

log4j2-1 45.41 G# 42.17 # 48.53 # 41.40 # 43.47 # 44.38 # 46.19 # 40.79 G# 51.79 # 41.95 # log4j2-2 7.90 # 4.89 # 3.92 G# 10.75 G# 9.71 G# 11.29 G# 6.18 G# 6.06 # 11.01 G# 3.83 # log4j2-3 4.86 3.76 G# 2.53 G# 10.12 # 9.18 G# 10.15 G# 13.89 7.55 15.46 3.02 log4j2-4 3.67 # 3.17 G# 4.60 # 10.69 G# 9.47 G# 10.52 # 17.00 7.79 G# 19.32 G# 6.66 # log4j2-5 76.75 # 86.02 # 88.20 # 83.42 # 82.44 # 80.75 # 82.62 # 86.93 # 82.07 # 77.82 # rxjava-1 0.04 # 0.04 # 0.05 # 0.04 G# 0.04 # 0.04 # 0.05 # 0.05 # 0.27 0.03 # rxjava-2 0.70 # 0.61 G# 1.68 5.73 4.90 G# 6.12 G# 9.42 6.92 G# 13.38 G# 0.49 G# rxjava-3 2.51 G# 3.72 G# 1.91 # 8.16 G# 8.28 G# 9.63 G# 6.10 G# 5.81 # 10.32 G# 4.14 # rxjava-4 4.55 G# 4.18 G# 7.08 G# 8.07 10.46 8.82 17.06 10.22 21.09 1.42 # rxjava-5 5.63 G# 2.81 # 4.04 # 14.33 11.39 G# 13.11 61.98 # 64.24 # 21.69 G# 1.76 G#

bleve-2 1.57 G# 1.32 G# 4.79 5.56 G# 6.09 G# 5.78 G# 5.97 G# 5.48 G# 13.29 G# 0.27 # bleve-3 1.13 G# 7.53 7.77 10.08 G# 10.74 G# 14.42 G# 7.62 6.12 G# 14.41 G# 0.18 # bleve-4 4.95 4.38 5.17 G# 11.24 G# 12.00 G# 14.52 G# 8.18 7.11 G# 15.24 G# 0.62 # bleve-5 10.23 # 9.84 # 8.18 # 57.60 # 58.42 # 59.32 # 52.29 G# 46.40 52.74 # 10.16 # etcd-1 1.03 G# 3.17 1.56 G# 6.45 G# 5.21 G# 7.62 6.36 4.89 G# 11.46 G# 0.15 G# etcd-2 4.06 # 4.45 G# 6.28 66.79 # 69.07 # 69.18 # 100.68 G# 94.73 90.19 # 29.46 # etcd-3 1.25 0.69 1.24 G# 7.15 G# 6.57 G# 9.26 4.95 G# 4.31 G# 9.89 G# 0.14 G# etcd-4 6.80 # 6.00 # 7.34 # 34.53 # 34.34 # 34.37 # 12.28 G# 12.39 G# 22.92 G# 8.09 # etcd-5 43.59 # 22.46 # 43.44 # 27.21 # 27.86 # 27.17 # 30.54 # 31.40 # 24.98 # 23.73 #

iterations). It provides an indication how variable the values in Table 3.5 are. A high variability of the benchmark in an environment (e.g., log4j2-5 in all environments) does not necessarily mean that the CV value is highly variable itself (as indicated by the circles). # represents a RCIW below 30%, G# a RCIW between 30% and 60%, and a RCIW larger than 60%. Note that the dispersion observed in this table originates from three different sources of variability: (1) the difference in performance between different cloud instances of the same type, (2) the variability between different trials of an instance, and (3) the “inherent” variability of a benchmark, i.e., how variable the performance results are, even in the best case. Consequently, a large CV in Table 3.5 can have different sources such as an unpredictable instance type or an unstable benchmark. We elaborate on the different sources of variability later in this section. 3.4 Benchmark Variability in the Cloud 101

3.4.1 Differences between Benchmarks and Instance Types

It is evident that the investigated benchmarks have a wide spectrum of result variability, ranging from 0.03% for rxjava-1 on Bluemix, to 100.68% for etcd-2 on Azure GP. Consequently, the potential slowdown to be detected by the benchmarks will also vary drastically depending on the benchmark and instance type it is executed on. We observe three groups of benchmarks: (1) some have a relatively small variability across all providers and instance types (e.g., rxjava-1); (2) some show a high variability in any case (e.g., log4j2-5); and (3) some are stable on some instance types but unstable on others (e.g., bleve-5). The first group’s result indicate that variability is low, as desired for performance testing. However, the latter two groups are particularly interesting for further analysis to identify reasons for their instability. The second group consists of benchmarks with high variability across all studied instance types. We observe three such benchmarks: log4j2-1, log4j2-5, and to a lesser extent etcd-5. There are two factors that lead to high variability, either the execution time of the benchmark is very short, or the benchmark itself produces unstable results. log4j2-1 and log4j2-5 are examples for the first case, with low execution times in the orders of only tens of nanoseconds. For these benchmarks, measurement inaccuracy becomes an important factor for variability. log4j2-1 is also interesting because this benchmark has been identified as very stable in our preliminary studies. We speculate that for such extremely short-running benchmarks (i.e., 4.7ns on average), small variations in the environment (i.e., our pre-study was on a controlled host in the first author’s university) can have a substantial impact on the observed measurements and their stability. This makes such benchmarks of questionable use for performance testing. In contrast, etcd-5 has an execution time around 250,000ns on GCE Mem with a CV of 27.17%. Figure 3.2 depicts the results for this benchmark on all 50 instances in beanplot notation. The variability of this benchmark is comparable in all instance types, with CVs ranging from 22.46% to 43.59%. Even the bare-metal machine from Bluemix has high variability of 23.73% CV. This indicates that the benchmark itself is rather low-quality and produces unstable measurement results, independently of where it is executed. 102 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really?

Instance Variability − etcd−5 on GCE Mem

5e+05

4e+05

3e+05

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2e+05 ● ● ● ● ● ● ● Avg. Exec. Time [ns] Exec. Avg.

1e+05 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Instances

Figure 3.2: Drilldown into the variability of etcd-5, an example of a benchmark with high result variability across all instance types

The third group of benchmarks exhibits high variability on some, but not all, instance types. This group contains two sub-groups: (1) benchmarks that have high standard deviations but similar medians; and (2) benchmarks that have overall varying results, including substantially differing medians on different instances. An example for the ﬁrst sub-group is log4j2-3 on GCE Mem — and similarly on the other GCE and Azure instances — where the variability of the benchmark differs among the instances of the same instance types (see Figure 3.3). We observe that this benchmark on this instance type has a “long tail” distribution, which is common for performance data. How- ever, the length of this long tail differs from instance to instance. A possible explanation for this phenomenon is the behavior of other tenants on the same physical machine as the instance. Other tenants may compete for resources needed by a benchmark causing longer tails in the data. We have observed this problem particularly in the case of the log4j2 benchmarks where manual analysis of these benchmarks reveals that they tend to be I/O-intensive (e.g., writing to log ﬁles). Previous work has shown that I/O-bound operations suffer particularly from noisy neighbors in a cloud [Leitner and Cito, 2016]. 3.4 Benchmark Variability in the Cloud 103

Instance Variability − log4j2−3 on GCE Mem

175

150

125

100 ● ●

Avg. Exec. Time [ns] Exec. Avg. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 75 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Instances

Figure 3.3: Drilldown into the variability of log4j2-3, an example of a benchmark with differing variability between instances

More severe variabilities can be observed with the second sub-group, where even medians are shifted substantially between instances. This is illustrated in Figure 3.4 for bleve-5 on Azure. A potential cause for this phenomenon is hardware heterogeneity [Ou et al., 2012; Farley et al., 2012] where different hardware conﬁgurations (e.g., different processor generations) are used for different instances of the same instance type. Given that the medians fall into a small number of different groups (only 2 in the case of Figure 3.4), we conclude that hardware heterogeneity rather than multi-tenancy is the culprit for the variability observed in these cases. In this bleve-5 example on Azure, the hardware metadata supports our conclusion by revealing that two different versions of the same CPU model with distinct CPU clock speeds were provisioned. Moreover, an interesting ﬁnding is that different instance type families (e.g., general-purpose versus compute-optimized) of the same cloud provider mostly do not differ drastically from each other. The only cloud provider that consistently has different variabilities between its instance types is Azure, where the memory-optimized type does not perform as well as the general- 104 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really?

Instance Variability − bleve−5 on Azure Mem

7.5e+09

5.0e+09

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Avg. Exec. Time [ns] Exec. Avg. 2.5e+09

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Instances

Figure 3.4: Drilldown into the variability of bleve-5 on Azure, an example for a benchmark with high benchmark result variability due to differences in hardware purpose type and the compute-optimized type. A reason for the similarity of different instance types of the same provider may be that the different types are backed by the same hardware, just with different CPU and random-access memory (RAM) conﬁguration. We assume that the benchmarks under study do not fully utilize the provided hardware and thus show little difference.

3.4.2 Sources of Variability

We now discuss three different sources of variability in more detail. These sources are (1) variability inherent to a benchmark (“Benchmark Variability”), (2) variability between trials on the same instance (“Variability between Tri- als”), and (3) variability between different instances of the same type (“Total Variability”). This gives insight into whether variability originates from the benchmark itself (e.g., non-deterministic behavior of the code written or the runtime environment), from fluctuating performance within a single instance (e.g., due to noisy neighbors), or from different performance behavior of instances of the same type (e.g., hardware heterogeneity). Expectedly, the relative 3.4 Benchmark Variability in the Cloud 105 impact of these sources of variability differs for different configurations. Some examples are depicted in Figure 3.5. Each subfigure contrasts three different CV values: the mean CV per trial of a benchmark (Benchmark Variability), the mean CV per instance (Variability between Trials), and the total CV of a configuration as also given in Table 3.5 (Total Variability). The red error bars signify the standard deviation. Notice that this is not meaningful for the Total Variability that consists of a single CV value per configuration. Recall the definition of the measurements for a configuration M c (see Equa- tion 3.4) and the variability across all measurements V total,c (see Equation 3.5). In a similar vein, we now define the other two variabilities discussed in this section, i.e., variability inherent to a benchmark V trials,c and variability between trials of an instance V instances,c. Equation 3.6 specifies the variability inherent to a benchmark V trials,c. It is the set of all CVs per trial of a benchmark b executed in an environment e.

trials,c c V = cv(Mi,t) (3.6) c c Mi,t[∈M

V instances,c (see Equation 3.7) describes the variability within instances i for all trials t (i.e., trials = 50) on this particular instance. It is the set of CVs for c each instance i (i.e., instances = 50), calculated from all measurements Mi,t (trials × instances = 500 ) taken on this instance.

instances,c c V = cv Mi,t (3.7) ′ c c ′ i ∈[1,instances[ ] Mi,t∈[M ∧i=i

Figure 3.5 depicts the variabilities V trials,c (top bar), V instances,c (middle bar), and V total,c (bottom bar), for four configuration examples. The top-left and top-right subfigures provide a drilldown on etcd-2 in different clouds, and explore in more detail why this curious benchmark is remarkably stable on AWS and unstable in GCE and Azure (see also Table 3.5). The top-left subfigure shows the benchmark on AWS GP. There is little difference in the CVs between trials, instances, and total, indicating that the small variability of this benchmark largely originates from the benchmark itself. This 106 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really?

AWS GP / etcd−2 Azure GP / etcd−2

0 1 2 3 4 0 25 50 75 100 CV CV Benchmark Variability Variability between Trials AWS CPU / log4j2−5 GCE Mem / etcd−4 Total Variability

0 25 50 75 0 10 20 30 CV CV

Figure 3.5: Drilldown on the source of variability for four example configurations is different for the same benchmark executed on Azure (top-right subfigure). While the per-trial and per-instance CV is also substantially larger than on AWS, it is particularly the total CV that is now very high. This indicates, that etcd-2 is fairly stable within a trial and among multiple trials within an instance but becomes comparatively much more unstable when considering executions of the benchmark across multiple instances of the same type. Further, this indicates that the reason for the difference in stability between cloud providers is caused by varying instance performance. A different example is provided by the bottom-left figure, which shows log4j2-5 on AWS CPU. Here, the inherent benchmark variability is minuscule but there are substantial differences between different trials, largely independently of whether these trials happen on the same instance or not. This indicates that for this benchmark a large source of variability are trial-level effects, such as different types of resource contention (e.g., CPU, memory, I/O) over the duration of its execution. Finally, the example on the bottom-right shows etcd-4 on GCE Mem. This benchmark has a high total variability, which is composed of a combination 3.5 Reliably Detecting Slowdowns 107 of large inherent benchmark variability and substantial performance variability between different instances. This is the case, where all three variability types have a similarly big fraction of the overall instability of the benchmark- execution’s result.

Benchmark Variability: In RQ 1, we studied the variability of microbenchmark results when executed on cloud instances. The CVs for the studied benchmark- environment conﬁgurations vary between 0.03% to 100.68%. This variability originates from three sources (variability inherent to a benchmark, between trials on the same instance, and between different instances) and different benchmark-environment conﬁgurations suffer to very different degrees from any of these sources. The bare-metal instance expectedly produces very stable results, however AWS is typically not substantially less stable. Based on these results, we conclude that GCE and Azure seem to lend themselves much less to performance testing than AWS at the time of study.

3.5 Reliably Detecting Slowdowns

To answer RQ 1, we compare two standard statistical tests for evaluating whether the performance of a software has changed. First we outline two approaches how to run performance experiments and sample data from existing measurements, i.e., instance-based and trial-based sampling strategies. Then we investigate how many FPs the two tests report when executing A/A tests in the studied environments. Intuitively this gives us an indication which benchmarks in which configurations should not be used for performance testing at all, as they frequently indicate slowdowns even if comparing identical software versions. Lastly, we explore the minimal detectable slowdowns (MDSs) of all benchmarks in all configurations with sufficiently low number of FPs during A/A testing. 108 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really?

3.5.1 Statistical Tests

For both the A/A tests and the MDS tests (also referred to as A/B tests), we use two standard statistical tests for evaluating whether a software’s performance has changed. Literature on performance evaluation suggests usage of two test types: (1) hypothesis testing with Wilcoxon rank-sum (sometimes referred to as Mann-Whitney U test) combined with effect sizes [Bulej et al., 2017b], and (2) change detection through testing for overlapping conﬁdence intervals of the mean [Bulej et al., 2017b; Georges et al., 2007; Kalibera and Jones, 2012, 2013]. Note that Wilcoxon rank-sum is a test for difference of medians while the conﬁdence interval test employed in this paper is for means.

Wilcoxon Rank-Sum

Firstly, we study the applicability of Wilcoxon rank-sum for performance change detection. Due to the non-normality (i.e., long-tailed or, in some cases, multi-modal) of performance data, we choose the Wilcoxon rank-sum test because it is applicable to non-normal data. We formulate the null hypothesis

H0 as the two performance result populations (i.e., test and control group) having the same performance (mean execution time), when both groups contain performance results from the same benchmark in the same environment.

Consequently, the alternative hypothesis H1 states that the two compared performance populations do not have the same performance and hence we detect a performance change. All experiments use a 95% confidence level. That is, we report a statistically significant performance change iff the p-value of the Wilcoxon rank-sum is smaller or equal to 0.05 (i.e., p ≤ 0.05). Further, due to the large number of data points, hypothesis tests might report a statistically significant difference with low effect size. Therefore, we combine the Wilcoxon test with a test for a minimum effect size as measured by Cliff’s Delta [Cliff, 1996]. We test for an effect size of “medium” (0.33) or larger, as defined in Romano et al. [2006]. Similar to the usage of Wilcoxon rank-sum, we utilize Cliff’s Delta as a measure of effect size due to its applicability to non-normal data. We have also conducted experiments testing for a an effect size of small 3.5 Reliably Detecting Slowdowns 109 or larger (|d| >=0.147), but determined that this led to excessive FPs in most tested configurations. Hence, we do not report on these results here. However, the corresponding tables are part of our online appendix [Laaber et al., 2019a].

Conﬁdence Intervals

Classic textbooks on performance analysis [Jain, 1991; John and Eeckhout, 2005] suggest that confidence intervals should be preferred over hypothesis testing. Therefore, we additionally study how well slowdowns can be detected by testing for overlapping 95% confidence intervals of the mean. We report a difference between test and control group iff the 95% confidence intervals of the mean of the two populations do not overlap. Due to the non-normality of performance data, a simple computation of the confidence interval of the mean would be invalid. To address this, we apply statistical simulation, i.e., bootstrapping [Davison and Hinkley, 1997], with hierarchical random re-sampling [Ren et al., 2010] and replacement. In detail, the approach employed is tailored for evaluating performance data and has been introduced by Kalibera and Jones [2012, 2013]. Re-sampling happens on three levels: (1) instance-level, (2) trial-level, and (3) iteration-level. We run 100 bootstrap iterations for each confidence interval computation. Although Kalibera and Jones [2013] suggest running 1,000 bootstrap iterations, the additional layer of repeated simulations (100 times re-sampling of test and control group for the upcoming A/A tests and minimal detectable slowdowns) adds drastically to the overall runtime, i.e., about 130 hours for A/A tests and 500 hours for detection tests. A manual analysis of the computed confidence intervals (for the benchmark variabilities in Section 3.4) between 100 and 1,000 bootstrap iterations suggested that the gain of using 1,000 bootstrap iterations is relatively small. Hesterberg [2015] even suggests adopting 10,000 bootstrap simulations as smaller numbers might suffer from Monte Carlo variability. Admitting that individual results might change due to the randomized nature of bootstrapping, the overall results presented in the paper are expected to remain stable. Especially because our experiments (A/A and detection tests) sample 100 different test and control groups (see Section 3.3) and compute bootstrapped 110 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really? confidence intervals for each of these 100 pairs with 100 bootstrap iterations, which in total reaches the 10,000 samples suggested by Hesterberg [2015]. We refer to Kalibera and Jones [2012, p.24ff] and our online appendix [Laaber et al., 2019a] for details on how the confidence interval is computed.

3.5.2 Sampling Strategies

We now outline two sampling strategies along the dimensions of the study (i.e., trials and instances), which define how the test and control group of a performance experiment can be executed. Recall from Section 3.3.3 and Section 3.4 the formal definitions of the spe- c cific measurements Mi,t of a configuration c for trial t and instance i (see Equa- tion 3.1) and the definition for all measurements of a defined configuration M c (see Equation 3.4). The selection of the test and control group is then defined as select in Equation 3.8.

select : M c,selinstances,seltrials 7→ M ′c,M ′c (3.8)

select takes as input the measurements of a configuration M c, the desired number of instances selinstances to sample, and the desired number of trials seltrials to sample. The function returns a tuple of sampled configuration results M ′c ⊂ M c where the first element corresponds to the test group and the second element to the control group. Concrete implementations of select are the sampling strategies instance-based sampling (ibs) and trial-based sampling (tbs), which have the same signature and are described in the following.

Instance-Based Sampling

Instance-based sampling implements the idea of running test and control groups on different instances. In practical terms, this emulates the situation when a performance engineer wants to compare a new version of a system against older performance data, which has been measured at a previous time, for instance when the previous version was released. We assume that, between releases, cloud instances are terminated to save costs. 3.5 Reliably Detecting Slowdowns 111

Control Group Test Group i-1 i-n i-1 i-n

t-1 dp1 t-m dp1 t-1 dp1 t-m dp1 t-1 dp1 t-m dp1 t-1 dp1 t-m dp1 … … … … … … … … … … … … … … dp50 dp50 dp50 dp50 dp50 dp50 dp50 dp50

(a) Instance-based

Control Group Test Group i-1 i-n i-1 i-n

t-1 dp1 t-m dp1 t-1 dp1 t-m dp1 t-1 dp1 t-m dp1 t-1 dp1 t-m dp1 … … … … … … … … … … … … … … dp50 dp50 dp50 dp50 dp50 dp50 dp50 dp50

(b) Trial-based

Figure 3.6: Sampling strategies. (a) Instance-based: test and control group consist of disjoint instances (blue/green, b/w: dark/medium gray) with m trials each (light gray). (b) Trial-based: test and control group consist of the same instances (light gray), with disjoint m trials each taken from the same instances (blue/green, b/w: dark/medium grey)

Figure 3.6a visualizes the instance-based sampling strategy, and Equa- tion 3.9 formally deﬁnes the function ibs that performs this selection. It randomly selects (potentially) multiple different instances for each test and control group (in blue and green; selinstances) and uses random trials from each instance (in light grey; seltrials).

ibs(M c,selinstances,seltrials)= c c TG # trials Mi,t ∈ M | i ∈ I ∧ t ∈ rand(T ,sel ) , c c CG # trials Mi,t ∈ M | i ∈ I ∧ t ∈ rand(T ,sel ) (3.9) For Equation 3.9 the conditions in Equation 3.10 hold. The first clause defines the test group’s instances ITG of size selinstances, which are randomly selected from I# = [1, instances] with instances being the total number of instances (i.e., instances = 50 as defined in Section 3.3.3). The second clause, similar to the first one, describes the control group’s instances ICG of size 112 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really? seltrials, which have the same selection criterion as the test group’s instances. Finally, the third clause ensures that the sets of instances for test and control group are disjoint (i.e., that we are not using the same instances for test and control group). Supporting the above and following equations, the function rand takes as input an arbitrary set and a set-member counter and returns the counter’s value of randomly selected members of the set.

ITG = rand(I#,selinstances)∧ ICG = rand(I#,selinstances)∧ ITG ∩ ICG = ∅ (3.10)

Trial-Based Sampling

Trial-based sampling implements the idea of running test and control groups on the same instances, as opposed to instance-based sampling where they are executed on different instances. This emulates the case where a performance engineer starts multiple instances and then runs both, the test and control group, potentially multiple times on the same instance in randomized order. This minimizes the impact of the speciﬁc instance’s performance, which we have established to be an important factor contributing to variability in many cases. This approach can also be seen as a paired statistical test. Hence, we expect that this approach should generally lead to fewer FPs and smaller MDSs. Figure 3.6b illustrates and Equation 3.11 formalizes trial-based sampling, which randomly selects (potentially) multiple instances (selinstances) per benchmark and environment. However, different from the instance-based strategy, the control and test group now consist of the same instances (in light grey; selinstances) with multiple different randomly selected trials each (in blue and green; seltrials). 3.5 Reliably Detecting Slowdowns 113

tbs(M c,selinstances,seltrials)= c c ′ TG Mi,t ∈ M | i ∈ I ∧ t ∈ Ti , c c ′ CG Mi,t ∈ M | i ∈ I ∧ t ∈ Ti (3.11) Equation 3.12 describes the conditions that hold for 3.11. The first clause defines the test and control group’s instances I′, which are randomly selected from I# = [1, instances] with instances being the total number of instances (i.e., instances = 50 as defined in Section 3.3.3). The second clause describes TG the trials selected from each instance for the test group Ti , which have a size of seltrials (e.g., 5), and are randomly selected for each instance i from T # = [1, trials] with trials being the total number of trials as defined in our methodology as trials = 10 (cf. Section 3.3.3). The third clause, similar to the CG second one, describes the control group’s trials Ti , which have the same selection criterion as the test group’s trials. The forth clause ensures that the sets of trials for test and control group are disjoint.

∀i : I′ = rand(T #,selinstances)∧ TG # trials Ti = rand(T ,sel )∧ CG # trials Ti = rand(T ,sel )∧ TG CG ∅ Ti ∩ Ti = (3.12)

3.5.3 A/A Testing

In this section, we perform A/A tests of all selected benchmarks with different sample sizes and the two sampling strategies (select). The goal of A/A testing is to compare samples that, by construction, do not stem from a differently performing application (i.e., running the same benchmark in the same environment). Thus, the goal of these tests is to validate that the experiment environment does not report a slowdown if, by construction, a slowdown is not possible. Following common statistical practice, we define the upper FP 114 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really? threshold to be 5% and consider a specific benchmark in a specific environment to be too unreliable if it exceeds the 5% FPs threshold. That is, from 100 random tests between identical versions, we hope for 5 or less FPs. Recall the approach from Section 3.3 depicted in Figure 3.1: we executed every benchmark on each individual instance 10 times (trials = 10), and repeated these trials 50 times on different instances of the same instance type (instances = 50). We now randomly select 1, 2, 3, 5, 10, 15, and 20 instances (selinstances), and then randomly select 1, 2, 3, and 5 trials (seltrials) for test and control group. As selection strategy (select) we use both instance- based (ibs) and trial-based (tbs) sampling. For each benchmark-environment configuration (e.g., log4j2-1 in AWS GP), we repeat this experiment 100 times with different randomly-sampled instances and trials for test and control groups. To account for the increased likelihood of rejecting H0 when testing 100 times a distinct benchmark-environment configuration with Wilcoxon rank-sum, we apply a Bonferoni correction to the resulting p-values of the 100 test-control-group samples.

Example

Table 3.6 shows an example of the FP-rates for bleve-3. The ﬁrst four rows use ibs whereas the last four rows use tbs. Intuitively, the table reports in percent how often a conﬁdence interval test falsely reported a performance change when neither benchmark nor code had in fact changed. The smallest sample sizes (i.e., one instance and one trial) for both sampling strategies show a high number of FPs across all environments, with the bare- metal one from Bluemix (column “BM”) being the “most reliable” environment with approximately 40% FPs. Increasing the number of trials and/or instances yields fewer FPs for both strategies. ibs does not achieve much better results when only considering more trials but still only one instance for both test and control group (row 2). However, increasing the number of instances to 10 (row 3) results in less than 10% FPs for this particular benchmark. Increasing the sample size even further to 20 instances and 5 trials per instance, the instance- based strategy has at most 5% FPs across all studied environments, which we 3.5 Reliably Detecting Slowdowns 115

Table 3.6: FP-rates in percent testing for overlapping conﬁdence intervals for bleve-3 for instance-based (ibs, top) and trial-based (tbs, bottom) sampling, across all studied environments, with example conﬁgurations

Sampling AWS GCE Azure BM # Ins. # Trials GP CPU Mem GP CPU Mem GP CPU Mem ibs 1 1 81 81 90 82 88 92 86 87 71 43 1 5 69 35 85 71 73 69 81 72 65 13 10 1 6 7 4 6 5 6 6 7 1 6 20 5 3 2 0 4 4 2 5 4 1 0 tbs 1 1 49 67 64 67 65 65 27 41 26 38 1 5 11 13 5 6 8 10 3 6 8 2 3 1 3 2 0 4 4 5 0 2 3 8 3 5 0 0 0 1 0 0 0 0 0 5 consider acceptable. In comparison, when running test and control group on the same instances (tbs), already an increase to five trials on a single instance produces much fewer FPs (row 6). A small sample size such as three instances with five trials already reduces to FP-rate of all but one cloud environments to 0. Interestingly, we still observe 5% FPs in the Bluemix environment, despite a very low CV. In this particular case (i.e., for benchmark bleve-3), the confidence intervals are extremely narrow and therefore already minor shifts of the mean in the control group are statistically significant changes. We omit detailed results for other benchmarks and the Wilcoxon rank-sum testing because they are generally in line with the example provided above. Full details for every benchmark, both sampling strategies, and both statistical tests can be found in the online appendix [Laaber et al., 2019a].

Impact of Sampling Strategy

We now provide a more detailed look into the impact of the different sampling strategies and chosen number of samples. Figure 3.7 shows the density plots of FPs rates for all studied environments. The left subfigures (Figure 3.7a) show the Wilcoxon rank-sum test and the right subfigures (3.7b) show the 116 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really? results for the confidence interval test. The first two subplots are instance- based sampling results with one instance and one trial and ten instances and one trial respectively. The lower three subplots depict trial-based sampling results with two instances and five trials, five instances and two trials, and five instances and five trials respectively. The blue horizontal line indicates the 5%-FP threshold, which we consider as acceptable upper bound across the 100 simulations. These figures make obvious that for instance-based sampling, even with 10 instances, testing using Wilcoxon leads to a large number of FPs for most benchmark-environment configurations (intuitively, most of the area under the curve is on the right-hand side of the 5% border). The smallest sample size (one instance with one trial using ibs; top row) produces FP-rates > 5% for almost all configurations and only a few are below the 5% threshold: 18/190 for Wilcoxon and 8/190 for confidence intervals. This improves when considering 10 instances with one trial and ibs (second row). Across all configurations only 44 in 190 have 5% FPs or less, 48 have between 10% and 20%, and still 54 have more than 20% FPs when using Wilcoxon. If using only a single trial from each instance for test and control group (e.g., if a performance engineer were be to run the benchmarks only once for each release) the rate of reported FPs sometimes even exceeds 50% for 4 configurations. Using overlapping confidence intervals and 10 instances, the situation improves drastically (110 of 190 configurations, or close to 60%, have ≤ 5% FPs), although still 80 configurations exhibit too many FPs. 11 configurations have a FP rate of 10% or more, but none has 20% or more. Changing to trial-based sampling improves the overall FP rate across all environments. Both statistical tests show below 5% FPs for almost all configurations (185 of 190, or 97% with Wilcoxon, and 189 of 190 with confidence intervals) when increasing the sample size to five instances with five trials. Generally, confidence intervals produce slightly fewer total FPs than Wilcoxon rank-sum. From a total of 19,000 simulated tests (19 benchmarks * 10 environments * 100 simulations), only 75 (0.4%) and 106 (0.6%) FPs where reported when testing with confidence intervals and Wilcoxon respectively. 3.5 Reliably Detecting Slowdowns 117

1 Instance with 1 Trial (instance−based) 1 Instance with 1 Trial (instance−based) 4 Cloud Cloud 3 AWS 4 AWS 2 Azure Azure Density 1 Bluemix 2 Density Bluemix GCE 0 GCE 0.00 0.25 0.50 0.75 1.00 0 0.00 0.25 0.50 0.75 1.00

10 Instances with 1 Trial (instance−based) 10 Instances with 1 Trial (instance−based) 5 4 15 3 10 2 Density

1 Density 5 0 0.0 0.2 0.4 0.6 0 0.00 0.05 0.10 0.15

2 Instances with 5 Trials (trial−based) 2 Instances with 5 Trials (trial−based) 100 20 75

10 50 Density

Density 25 0 0.0 0.1 0.2 0.3 0 0.000 0.025 0.050 0.075

5 Instances with 2 Trials (trial−based) 25 5 Instances with 2 Trials (trial−based)

20 40 15 30 10

Density 20

5 Density 10 0 0.0 0.1 0.2 0.3 0 0.00 0.05 0.10

5 Instances with 5 Trials (trial−based) 100 5 Instances with 5 Trials (trial−based) 250 75 200 50 150

Density 25 100 Density 50 0 0.00 0.02 0.04 0.06 0.08 0 False Positives 0.00 0.02 0.04 0.06 False Positives (a) Wilcoxon rank-sum test (b) Conﬁdence interval test

Figure 3.7: False-positive differences between environments for Wilcoxon rank-sum and conﬁdence interval tests 118 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really?

Comparing the different cloud providers shows interesting results. Study- ing the worst case (one instance with one trial using ibs, top row) and the best case (five instance with with trials using tbs, bottom row), we can hardly notice any differences among the cloud providers with respect to their FP rates. For the worst case, only 2 (Azure, Bluemix) to 3 (AWS, GCE) benchmarks have < 5% FPs in all environments. In the best case, 0 (AWS, GCE, Azure) or 1 (Bluemix) benchmarks are not below this threshold. Counter-intuitively, we observe that the more stable cloud environments (AWS and Bluemix, based on results from Section 3.4) lead to more FPs in the other three examples (rows two, three, and four). This is because environments with high dispersion between individual measurements often lead to the Wilcoxon test not being able to reject H0 as well as wide confidence intervals, which incidentally helps to reduce the number of FPs. However, we expect that the number of accurately identified actual slowdowns (see Section 3.5.4) will be low as well in the more unstable environments. That is, these results should not be taken to mean that an unstable environment is more useful for performance testing than a stable one, just that a specific type of problem (namely a falsely identified slowdown) is less likely to happen. Another interesting observation is that the IaaS environments (AWS, GCE, and Azure), and in particular the more unreliable environments (GCE and Azure), benefit more from employing tbs rather than ibs. This indicates that the methodology RMIT introduced by Abedi and Brecht [2017] can indeed be considered a best-practice.

We observe the same trends for the different projects under study. Trial- based sampling outperforms instance-based sampling and conﬁdence intervals produce slightly more reliable results than Wilcoxon rank-sum tests. An investigation of the FP rate differences between instance families (i.e., GP, CPU, Mem) was inconclusive. We did not see substantial differences between instance families of the same cloud provider in the density plots, nor did an examination of the detailed per-benchmark results reveal major differences. 3.5 Reliably Detecting Slowdowns 119

Minimal Number of Required Samples

In the examples above, we see that an increase of samples (instances and/or trials) as well as the sampling strategy has a direct impact on the reliability of the obtained results, i.e., the number of wrongly reported performance changes (FPs). Figure 3.8 shows the minimal number of required samples (number of instances and trials) for the statistical tests to report below 5% FPs. For every configuration (benchmark-environment combination), we report a single dot that indicates the lowest sampling which has below 5% FPs. Intuitively, the further to the bottom-left corner of each plot, the fewer samples are required to not suffer from excessively many FPs and therefore having a sufficiently stable environment for slowdown detection. The lowest sample number is defined by the product of number of instances and number of trials. As this can result in the same number of samples (e.g., one instance and five trials vs. five instances and one trial), we first check more trials of the same instances, because an increase of trials compared to an increase of instances tends to have a smaller impact on the FP rate. For both tests (Wilcoxon test in the first row and confidence interval test in the second row) and sampling strategies (instance-based left in the left column and trial-based in the right) one instance and one trial is sufficient only for a small number of benchmarks. These configurations are exclusively from the RxJava and etcd projects. For instance-based (left column), just an increase in number of trials does not produce better results. The majority of the results analyzed with Wilcoxon require repeating the measurements on 10 instances or more to reach <5% FPs. A few configurations only drop below that threshold with two or more trials for each instance. In comparison, confidence intervals (bottom-left plot) require between five and 15 instances when using one trial. Interestingly, especially Log4j2 configurations benefit more from an increased number of trials than the other projects when using confidence intervals. Considering all studied benchmarks, an upper bound of minimally required samples for ibs is 5,000 (20 instances * 5 trials * 50 iterations) when testing with Wilcoxon and 1,000 (20 instances * 1 trial * 50 iterations) when employing the confidence interval test. 120 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really?

Wilcoxon test with ibs Wilcoxon test with tbs

● ● ● ● ● ●● ● ● ● ● ● 5 ● 5 ● ● ●● ● ● ● ● ● ● ● ●

● ● ● ● ● 3 ● 3 ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● 2 ● 2 ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● 1 ● ● 1 ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● Project 1 2 3 5 10 15 20 1 2 3 5 10 15 20 ● Log4j2 RxJava ibs tbs Trials Confidence−interval test with Confidence−interval test with bleve

● etcd ● ● 5 ● 5 ● ● ●

● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● 3 ● ●● 3 ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●●●● ● ● ● ● ● ●● ● ● ● ● 2 ● 2 ●● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● 1 ● 1 ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●

1 2 3 5 10 15 20 1 2 3 5 10 15 20

Instances

Figure 3.8: Minimally required samples to reach < 5% false-positives. The top row shows the Wilcoxon rank-sum test, and the bottom row depicts conﬁdence intervals. The left column shows instance-based and the right column shows trial-based sampling. Note that this is a jitter plot for discrete values, therefore all data points within one box correspond to the same value

For trial-based sampling (right column) we see a similar trend, where confidence intervals require fewer instances/trials than testing with Wilcoxon. Even more evident, Log4j2 turns out to be the project with the highest FP rate. As expected by the design of trial-based sampling (running test and control group on the same instances), an increase in number of trials often reduces the FP rate. However, a surprising result is that many benchmark- environment combinations already yield low FP rates for a single trial on two to five instances when testing with confidence intervals. Over all benchmarks, at most 1,500 samples (10 instances * 3 trials * 50 iterations) for Wilcoxon and 3.5 Reliably Detecting Slowdowns 121

750 samples (5 instances * 3 trials * 50 iterations) for the confidence interval test are required for a maximum of 5% FPs when employing tbs. Our A/A testing results show that for the microbenchmarks under study the overlapping confidence interval test (bottom row) indeed yields better results than the Wilcoxon rank-sum test (top row). That is, less samples are required to have similarly stable results. This result is in line with previous scientific arguments [Jain, 1991; Kalibera and Jones, 2013].

A/A Testing: Identical measurements (e.g., same benchmark executed in the same environment without any code changes) suffer from falsely-reported performance changes when they are taken from a small number of instances and trials. Nonetheless, an increase in samples yields a low number of acceptable FPs (i.e., < 5%) for all studied benchmarks and environments, both sampling-strategies, and both statistical tests. Hence, testing for performance with software microbenchmarks in (unreliable) cloud environments is possible. However, a substantial number of experiment repetitions are required, optimally executed using a trial-based strategy. We have conﬁrmed that testing using overlapping conﬁdence intervals yields substantially less false positives with the same setup than using a Wilcoxon test with medium Cliff’s delta effect size.

3.5.4 Minimal Detectable Slowdown Sizes

The previous section showed that depending on the sample size (number of instances and trials), we are able to utilize cloud instances for microbenchmark execution with low FP-rates when performing A/A tests. Especially when employing trial-based sampling (running test and control group on the same instances), already using five instances with five trials produces results that have below 5% FPs for most studied benchmarks and environments. Taking a step further, we now investigate the sizes of slowdowns that are detectable with both sampling strategies (ibs and tbs) using both studied statistical tests (Wilcoxon rank-sum with Cliff’s Delta effect-size measure of medium or larger 122 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really? and overlapping 95% confidence intervals for the mean computed using bootstrapping).

Approach

To find the MDS, i.e., the slowdown that can be identified in 95% of cases, with at most 5% of FPs, by a benchmark in a particular environment, we performed experiments based on the following procedure. For each benchmark and each instance type, we investigate a fixed list of 11 simulated slowdowns ranging from a tiny 0.1% slowdown to a massive one of 1,000% (i.e., the respective functionality got 10 times slower). Concretely, we experiment with the following slowdowns: 0.1%, 0.2%, 0.5%, 1%, 1.5%, 2%, 5%, 10%, 50%, 100%, 1,000%. For each simulated slowdown, we use both sampling strategies (ibs and tbs) to sample a test and control group from the data set. We simulate a slowdown in the test group by increasing the execution time of each data point by x% where x is the simulated slowdown size (e.g., for a slowdown of 1%, we add 1% to each data point) and compare test and control group with both statistical tests, i.e., Wilcoxon and confidence interval tests. We believe this is the most-straightforward way of injecting slowdowns that comes with the fewest parameters to be applied correctly. Note that increasing each data point of the test group by a fixed slowdown percentage represents an “idealized” slowdown. As our experiments share the same data, the comparison between the studied approaches and statistical tests remains valid, although it might overestimate the number of detected slowdowns. Similar to the A/A tests, for each slowdown we repeat this process 100 times by re-sampling different control and test groups and count how often we find a statistically significant change — a true positive (TP) — either by rejecting H0 with the Wilcoxon test and an effect size of medium or larger, or by non-overlapping 95% confidence intervals. Again, for the Wilcoxon rank-sum test and every benchmark environment combination, we applied a Bonferoni correction to the resulting p-values. If the rate of TPs is at least 95% (i.e., a minimum of 95 in the 100 repetitions) and the rate of FPs (as identified by the A/A tests from Section 3.5.3) is not higher than 5% (i.e., a maximum of 5 in 3.5 Reliably Detecting Slowdowns 123 the 100 repetitions), we conclude that a slowdown of this size can be found reliably using one of the two studied statistical tests. MDS is then the smallest slowdown from the list that can be found with this approach. In the following, we discuss the MDSs using the instance-based (ibs) and trial-based (tbs) strategies. Due to the runtime of our simulations, we were not able to run the detection experiments for the same number of selected samples as we did for the A/A tests. The runtime is especially a limiting factor for the confidence interval simulations. We randomly sample 100 test and control groups and for each of these groups we run a bootstrap technique with another 100 iterations to compute the confidence intervals of the mean. In our implementation, this simulation takes around 500 hours for all benchmarks and environments, when executed on 8 core machines. We used a cluster of 10 instances in a private cloud in the first author’s university to execute the experiments.

Instance-Based Sampling

We evaluate the instance-based sampling strategy for different sample sizes. Concrete parameters for the sampling function ibs are selinstances ∈ {1, 2, 3, 5, 10, 15, 20} and seltrials =1. Table 3.7 shows exemplary results for three benchmarks (i.e., log4j2-3, bleve- 3, and etcd-2) and a subset of the tested sample sizes (i.e., 10, 20 instances). We omit smaller sample sizes from the example as they hardly detect slowdowns reliably using ibs. We provide the full results as part of the online appendix [Laaber et al., 2019a]. Cell values represent the MDS in percent that could be detected in this setup. Cells with the value “∞” (colored in red), indicate that for the given configuration no slowdown could be detected using our approach. The results depict that, using ibs, a high number of samples is required to reliably find slowdowns for many configurations (benchmark-environment combinations). A clear trend is that benchmarks run on AWS are able to detect smaller slowdowns as compared to GCE, Azure, and even Bluemix (BM). This trend also applies to the other benchmarks not displayed in the example. 124 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really?

Table 3.7: MDSs for log4j2-3, bleve-3, and etcd-2 using instance-based sampling (ibs) with both statistical tests (“Test”), Wilcoxon test (“WRS”) and conﬁdence interval test (“CI”)

Test Bench AWS GCE Azure BM # Ins. # Trials GP CPU Mem GP CPU Mem GP CPU Mem

WRS log4j2-3 10 1 ∞ ∞ 1.5 ∞ ∞ ∞ 10 ∞ ∞ ∞ log4j2-3 20 1 ∞ 2 1.5 10 ∞ 10 5 5 5 ∞ bleve-3 10 1 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ 0.2 bleve-3 20 1 0.5 1 5 ∞ 5 10 ∞ ∞ ∞ 0.2 etcd-2 10 1 1 2 5 ∞ ∞ ∞ ∞ ∞ ∞ 50 etcd-2 20 1 1.5 2 5 50 50 100 ∞ ∞ ∞ 50

CI log4j2-3 10 1 5 ∞ 5 50 50 50 50 10 50 10 log4j2-3 20 1 5 ∞ 5 10 10 10 10 ∞ 10 5 bleve-3 10 1 ∞ ∞ 10 ∞ 50 ∞ ∞ ∞ 50 ∞ bleve-3 20 1 2 2 5 50 50 50 10 10 10 0.2 etcd-2 10 1 5 2 5 ∞ ∞ 1,000 1,000 1,000 1,000 50 etcd-2 20 1 2 2 2 100 1,000 1,000 1,000 1,000 1,000 50

Although the example might suggest that some instance families (i.e., AWS Mem) yield smaller MDS than the other instance families of the same provider, an investigation of the full data set did not reveal substantial differences. In the case of AWS and tested with conﬁdence intervals, all three instance types had roughly the same number of undetectable slowdowns over all samples (total of 7 sample sizes * 19 benchmarks), i.e., 85 (64%; AWS GP), 89 (67%; AWS CPU), 80 (60%; AWS Mem). In terms of MDS, the same instance types ﬁnd slowdowns of ≤ 10% in 35 (26%; AWS Std), 36 (27%; AWS CPU), and 41 (31%; AWS Mem) of the cases. Generally speaking, the MDS differences between the instance types of the same provider are negligible for the tested benchmarks and environments.

Figure 3.9 shows an overview histogram over all configurations using ibs for testing with Wilcoxon (left) and confidence intervals (right). The bars indicate how many configurations (y-axis) had a MDS of size x on the x-axis. We can see that the majority of configurations do not find any slowdown reliably (“Inf”). This is the case for benchmarks with A/A FPs of more than 3.5 Reliably Detecting Slowdowns 125

5%, or for benchmarks where we could not ﬁnd the introduced slowdown in 95% or more of the simulations for any slowdown size.

Wilcoxon test with ibs Confidence−interval test with ibs

100 100

75 75

Instances

1 50 50 5 10 20 # Benchmarks 25 25

0 0

1 1.5 2 5 10 50 100 1000 Inf 1 1.5 2 5 10 50 100 1000 Inf

Minimal−detectable Slowdown [%]

Figure 3.9: MDSs found with instance-based sampling (ibs) by applying the Wilcoxon test (left) and conﬁdence interval test (right). The x-axis shows the slowdowns we tested for, and the y-axis shows for how many benchmark- environment combinations the corresponding slowdown is the smallest that is detectable. MDSs ≤ 5% are aggregated at 1. If no slowdown is detectable, Inf is depicted

From the 190 configurations (19 benchmarks * 10 environments), no slowdown could be reliably detected for 172 (91%) benchmark-environment combinations with one instance, for 162 (85%) combinations with five instances, for 146 (77%) combinations with ten instances, and 42 (22%) combinations with 20 instances when testing with Wilcoxon. The confidence interval test performs slightly better. 154 (81%) combinations with one instance do not find a slowdown of any size, 141 (74%) combinations with five instances, 71 (37%) with ten instances, and 24 (13%) with 20 instances. When increasing the sample size to five instances, Wilcoxon finds slowdowns in 27 (14%) configurations, out of which 17 (9%) are below 10% slowdown size. In comparison, the confidence interval test finds slowdowns in 40 126 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really?

(21%) of the studied configurations, where 18 (9%) are below 10% slowdown size. Doubling the sample size to ten instances has a small effect when testing with Wilcoxon, i.e., 44 (23%) configurations find a slowdown reliably out of which 34 (18%) detect slowdowns ≤ 10%. Going to ten instances has a substantially larger impact on which slowdowns are reliably detectable when testing with confidence intervals. 110 (58%) configurations find a slowdown and 50 (26%) are even able to detect slowdowns ≤ 10%. With 20 instances Wilcoxon is able to detect slowdowns in 148 (78%) configurations, and confidence intervals expose slowdowns in 157 (83%) configurations; while 121 (64%) and 95 (50%) reliably report slowdowns below 10% respectively. These results show that instance-based sampling, i.e., the sampling strategy where test and control group are executed on different instances, is only able to reliably detect slowdowns (of any size) in 78% of the tested configurations when using Wilcoxon and 83% when testing with confidence intervals, even when utilizing a fairly large number of 20 instances.

Trial-Based Sampling

Following the results from the instance-based MDSs, we now investigate whether trial-based sampling (tbs) is able to detect slowdowns for smaller sample sizes. In Section 3.5.3, we observed that tbs, i.e., running test and control group on the same instances in interleaved order, results in fewer FPs with smaller sample sizes than ibs. Based on this, we expect that tbs yields more reliable and more stable detection results and therefore smaller MDSs. We evaluate the trial-based sampling strategy for different sample sizes analogous to the instance-based sampling. The parameters for the sampling function tbs are selinstances ∈ {1, 2, 3, 5, 10, 15, 20} and seltrials =5. Table 3.8 shows the same examples as above, but now for tbs instead of ibs. Again, all other detailed results are provided as part of the online appendix [Laaber et al., 2019a]. The three benchmark examples provided already give an indication that tbs indeed leads to more reliable results in terms of slowdown detectability. Although running test and control group on a single instance, slowdowns are reliably detectable for benchmarks bleve-3 and etcd-2 3.5 Reliably Detecting Slowdowns 127 in some environments with both statistical tests (e.g., bleve-3 on Azure with the Wilcoxon test). Similar to ibs, we see a trend that in AWS and Bluemix, smaller slowdowns can be detected than in GCE and Azure. We again observe that AWS works even better than Bluemix for at least some benchmarks (notably etcd-2, but this is not true universally. Regarding the differences between instance types of the same cloud provider we again observe no notable difference. This is in line with the variability results from Section 3.4, which indicate that these providers appear to be particularly useful for performance microbenchmarking experiments.

Table 3.8: MDSs for log4j2-3, bleve-3, and etcd-2 using trial-based sampling (tbs) with both statistical tests (“Test”), Wilcoxon test (“WRS”) and conﬁdence interval test (“CI”)

Test Bench AWS GCE Azure BM # Ins. # Trials GP CPU Mem GP CPU Mem GP CPU Mem

WRS log4j2-3 1 5 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ log4j2-3 5 5 5 ∞ 1.5 10 5 10 5 5 10 1.5 log4j2-3 10 5 2 5 1.5 10 5 5 5 5 10 1.5 log4j2-3 20 5 2 5 1.5 5 5 5 5 5 10 1.5 bleve-3 1 5 ∞ ∞ ∞ ∞ ∞ ∞ 5 5 5 ∞ bleve-3 5 5 1 0.5 2 5 10 10 5 5 10 0.2 bleve-3 10 5 1 0.5 2 5 5 5 5 5 10 0.2 bleve-3 20 5 0.5 0.5 2 5 5 5 5 5 10 0.2 etcd-2 1 5 5 1.5 5 ∞ ∞ ∞ 50 50 50 50 etcd-2 5 5 2 1.5 2 100 50 100 50 50 50 50 etcd-2 10 5 2 1.5 2 50 50 50 50 50 50 50 etcd-2 20 5 1.5 1 2 50 50 50 50 50 50 50

CI log4j2-3 1 5 ∞ ∞ ∞ ∞ ∞ 50 50 ∞ ∞ ∞ log4j2-3 5 5 5 10 5 50 50 50 50 10 50 5 log4j2-3 10 5 10 5 5 50 10 50 50 10 10 5 log4j2-3 20 5 5 5 2 10 10 10 10 5 10 5 bleve-3 1 5 ∞ ∞ 5 ∞ ∞ ∞ 10 ∞ ∞ 1 bleve-3 5 5 5 5 10 50 50 50 50 10 50 0.2 bleve-3 10 5 2 5 50 50 50 50 10 10 10 0.2 bleve-3 20 5 2 5 10 10 10 10 10 5 10 0.2 etcd-2 1 5 5 2 5 ∞ 1,000 ∞ 100 ∞ 1,000 ∞ etcd-2 5 5 2 5 5 1,000 1,000 1,000 1,000 1,000 1,000 50 etcd-2 10 5 2 2 2 100 100 100 1,000 1,000 1,000 50 etcd-2 20 5 2 2 2 100 100 100 1,000 1,000 1,000 10 128 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really?

In line with the presentation of the instance-based sampling results, we now depict an overview of all configurations in Figure 3.10 as histogram. The results confirm our intuition from the examples in Table 3.8. The majority of configurations reliably find at least some slowdown when considering sample sizes bigger than one instance. Nevertheless, from 190 configurations (19 benchmarks * 10 environments), still 142 (75%) with the Wilcoxon test and 117 (62%) with the confidence interval test are not able to find any of the slowdowns we tested for with a single instance. Increasing the sample size to five instances (and beyond), changes the MDSs drastically for the better. Five instances already allow finding slowdowns in 185 (97%) and 189 (99%) configurations when tested with the Wilcoxon test and confidence interval test respectively.

Wilcoxon test with tbs Confidence−interval test with tbs

100 100

75 75

Instances

1 50 50 5 10 20 # Benchmarks 25 25

0 0

1 1.5 2 5 10 50 100 1000 Inf 1 1.5 2 5 10 50 100 1000 Inf

Minimal−detectable Slowdown [%]

Figure 3.10: MDSs found with trial-based sampling by applying the Wilcoxon test (left) and conﬁdence interval test (right). The x-axis shows the slowdowns we tested for, and the y-axis shows for how many combinations the corresponding slowdown is the smallest that is detectable. MDSs ≤ 5% are aggregated at 1. If no slowdown is detectable, Inf is depicted

Interestingly and contrary to the A/A-testing results, the Wilcoxon test supports finding smaller MDSs. With five instances already 150 (79%) config- 3.5 Reliably Detecting Slowdowns 129 urations find slowdowns of 10% or smaller when tested with the Wilcoxon test whereas only 74 (39%) when the confidence interval test is applied. Sim- ilar trends are observable for 10 instances, where 156 (82%; Wilcoxon) and 113 (60%; confidence interval), and 20 instances, where 157 (83%; Wilcoxon) and 147 (77%; confidence interval), find a slowdown of 10% or less. 34 (17%; Wilcoxon) and 43 (22%; confidence interval) configurations reliably detect a slowdown of 10% or less when only using a single instance. These configurations are spread across all studied environments but only include 12 unique benchmarks from RxJava, bleve, and etcd. We observe that the Wilcoxon test is more aggressive in rejecting H0. The leads to more FPs in the A/A tests, but ultimately means that smaller slowdowns are correctly observed in the present experiments. The tbs results confirm the previous result by Abedi and Brecht [2017]: executing performance experiments on the same instances in randomized order can indeed be considered a best practice. However, it is a surprising outcome how small slowdowns (< 10%) can often be found with high confidence even in a comparatively unstable environment.

Minimal-Detectable Slowdowns: Our experiments re-confirm that testing in a trial-based fashion leads to quite substantially better results than executing on different instances. If using instance-based sampling, a large number of instances (the maximum of 20 in our experiments) is required, otherwise our test benchmarks are not able to discover any slowdown for most benchmark- environment combinations. However, even with 20 instances, slowdowns need to be in the range of 5% to 10% to be reliably detectable in most cases. With trial- based sampling, already 5 instances are sufficient to find similar slowdowns, at least when using Wilcoxon, which has shown to be more aggressive in

rejecting H0 than our alternative conﬁdence interval based approach. A small number of benchmarks in our study is able to reliably identify small changes in performance (less than 5%), but for most combinations such small slowdowns cannot reliably be detected in the cloud, at least not with the maximum number of instances we tested, i.e., 20. 130 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really?

3.6 Discussion

In this section, we discuss the implications of this study’s results for researchers and practitioners. First and foremost, we want to address whether cloud resources can be employed to assess software performance with microbenchmarks. This has multiple aspects to it, (1) which cloud provider and instance type is used, (2) how many measurements are taken, (3) which measurements are considered for slowdown analysis, (4) which slowdown sizes are desired to be detected, (5) and which statistical method to use. Following, we explicitly address the threats to validity and limitations of our study. Lastly, we brieﬂy mention potential future directions of performance measurements/testing in unstable and uncontrollable environments such as public clouds.

3.6.1 Implications and Main Lessons Learned

In this section we discuss implications and main lessons learned from executing microbenchmarks in cloud environments. In detail, we comment on results for different cloud providers and instance types, discuss the impact of sample size and different sampling strategies on result variability, and elaborate on MDSs of this study’s subjects when following the adopted approaches.

Cloud Provider and Instance Type

The reliable detection of slowdowns directly depends on the choice of cloud provider. We have observed relevant differences between providers, and more stable providers eventually will lead to benchmark results with lower result variability. Based on the variability (see Section 3.4) and the results of our A/A and A/B testing (see Section 3.5), we see an indication across all experiments performed between July and October 2017 that benchmarks executed on AWS produce more stable results compared to GCE and Azure. Even better results are obtained when utilizing a bare-metal machine rented from IBM Bluemix. Surprisingly, variability and MDS are not far apart in AWS and Bluemix. An interesting conclusion drawn from the presented data is that there is no 3.6 Discussion 131 big difference between instance types of the same provider. When using microbenchmarks, there is no indication in our data that a particular instance- type family should be chosen over another. Nevertheless, a reader should carefully evaluate this for his/her application and microbenchmark types as this might not hold true for other applications and benchmarks. Especially if other performance metrics are tested for, e.g., I/O throughput or memory consumption, a difference between specialized instance types might manifest. AWS recently introduced bare-metal instances10 as well, but this new instance type has only become available after concluding data gathering for the current study. We expect performance to be comparable to IBM’s bare-metal offering. However, a detailed comparison is subject to future research.

Measurement Strategy

Both instance-based and trial-based sampling come with their own advantages and disadvantages. From a purely statistical point of view, the trial-based strategy leads to substantially better slowdown detection results. However, we argue that there is still room for the instance-based strategy, as it inherently lends itself better to typical continuous software development scenarios. The instance-based strategy can be implemented easily, but requires substantially higher sample sizes. One advantage which, to some extent, alleviates the problem of large sample size is that this strategy supports parallelization of test executions nicely. That is, in a public cloud, the only real factor preventing an experimenter from launching an experiment on hundreds of instances in parallel are costs. A disadvantage of instance-based sampling is that results need to be persisted such that a comparison is possible as soon as the new version’s performance-test results are available. A more pressing issue is that you can not be sure that the cloud instances have not been upgraded by the provider between test executions of two versions. Therefore, a comparison between two versions run at different points in time might not be fair and a potentially detected slowdown could be caused by a change to the underlying

10https://aws.amazon.com/de/blogs/aws/new-amazon-ec2-bare-metal- instances-with-direct-access-to-hardware/ 132 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really? system. These problems are alleviated by implementing trial-based sampling. However, if the time budget is constrained (e.g., 2 hours per build job on TravisCI), one can only run half of the performance tests compared to instance- based sampling. Nonetheless, if detecting the smallest-possible slowdowns is the primary goal, employing trial-based sampling — running test and control group on the same instances, ideally in randomized order — is preferable.

Required Number of Measurements

The result variability as well as the MDS is directly affected by the number of repeated benchmark executions. A naive suggestion would be to run as many as possible, which is at odds with temporal and monetary constraints. With the “limitless” availability of cloud instances, performance tests of a new software version can be executed in parallel. Even long-running benchmark suites could be split into subsets and run in parallel in order to reduce overall execution time. The only sequential operation are the number of trials on the same instance. Unfortunately, there is no generalizable rule for how many measurements are required to reliably detect slowdowns, as it depends on project, benchmark, and cloud configuration. Nevertheless, our results show that for finding slowdowns of 10% or less, the number of repeated measurements has to include 20 instances with five trials each to be sure to catch slowdowns of this size even for less stable benchmarks. Our results also show that a few benchmarks (e.g., rxjava-1) are extremely stable even if only one instance per test and control group is used. A tool chain utilizing cloud instances for performance testing, e.g., as part of a continuous integration (CI) build, should track benchmark-result variabilities and if needed to reconfigure the repeated executions required to find small slowdowns for each benchmark individually.

Minimal Detectable Slowdown Size

The minimal slowdown size we target to reliably detect directly influences the performance-testing-environment configuration. If it is sufficient to have a sanity check whether the performance of a software has not plummeted, an 3.6 Discussion 133 instance-based strategy with 20 instances and a single trial might be sufficient for most benchmarks. With this configuration some benchmarks will still be able to find relatively small slowdowns, but others would only detect changes in the order of 100% or more. However, if slowdowns below 10% are desired to be detected — which is often the case for software-performance changes [Mytkowicz et al., 2009] — a trial-based strategy with at least 20 instances and 5 trials is required for most of the studied benchmarks. Even with extensive testing on multiple instances, it is not guaranteed that all benchmarks of a project will reliably detect slowdowns of a certain size. We have observed multiple benchmarks, most evidently log4j2-5, that are inherently not able to detect realistic slowdowns in our setting. We argue that there is a need for developer tooling which supports developers in writing good benchmarks that are able to detect small slowdowns. It is a prerequisite to start from a stable benchmark in an ideal environment which can then be executed on cloud environments. If the benchmark is already inherently unstable, moving to cloud-based execution will only make matters worse. For I/O- intensive benchmarks, like most of the ones from log4j2, the transition to cloud infrastructure is troublesome as I/O is particularly unreliable. In any case, researchers as well as practitioners are encouraged to follow a similar approach to ours when executing performance experiments in cloud environments. That is, always perform A/A testing to validate whether the detected changes between different versions are due to software-performance changes or unreliable environments.

Testing Using Wilcoxon vs. Overlapping Conﬁdence Intervals

Although both statistical tests are state-of-the-art in performance evaluation [Bulej et al., 2017b], traditional text books [Jain, 1991] and research [Georges et al., 2007; Kalibera and Jones, 2012, 2013] tend to prefer confidence intervals of the mean over hypothesis testing (in form of Wilcoxon rank-sum). Our results show that for both testing approaches (A/A in Section 3.5.3 and A/B tests in Section 3.5.4), Wilcoxon rank-sum reports more changes compared to overlapping confidence intervals. This implies that Wilcoxon is more sensitive 134 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really? towards changes and computed confidence intervals are more conservative. One advantage of Wilcoxon is that this test is computationally cheaper to run, as the bootstrapping required for long-tailed performance data is fairly expensive. Which test is more recommendable for practitioners depends largely on whether the expensive bootstrapping is problematic, and whether false positives (i.e., false warnings about non-existent slowdowns) are worse than false negatives (i.e., missed slowdowns). However, it should be noted that both statistical tests can also be customized to be more or less conservative. For instance, one can easily imagine testing for a lower p-value or larger effect sizes in Wilcoxon.

3.6.2 Threats to Validity

As with any empirical study, there are experiment design trade-offs, threats, and limitations to the validity of our results to consider.

Threats to Internal and Construct Validity

Experiments in a public cloud always need to consider that the cloud provider is, for all practical purposes, a black box that we cannot control. Although reasonable model assumptions can be made (e.g., based on common sense, previous literature, and information published by the providers), we can fundamentally only speculate about the reasons for any variability we observe. Another concern that is sometimes raised for cloud experimentation is that the cloud provider may in theory actively impact the scientiﬁc experiment, for instance by providing more stable instances to benchmarking initiatives than they would do in production. However, in practice, such concerns are generally unfounded. Major cloud providers operate large data centers on which our small-scale experiments are expected to have a neglectable impact and historically, providers have not shown interest to directly interfere with scientiﬁc experiments. For the present study, we investigated entry-level instance types only. A follow-up study is required to investigate whether variability and detectability results improve for superior cloud hardware. 3.6 Discussion 135

Note that the compute-optimized instance of GCE has considerably lower memory (1.8GB vs. 3.6GB vs. 4GB) compared to the compute-optimized instance types of AWS and Azure. Hence, a direct comparison between these instance types might not be valid. Another threat to the internal validity of our study is that we have chosen to run all experiments in a relatively short time frame. This was due to avoid bias from changes in the performance of a cloud provider (e.g., through hardware updates). This decision means that our study only reports on a speciﬁc snapshot and not on longitudinal data as would be observed by a company using the cloud for performance testing over a period of years. Another threat is concerned with the simulated slowdowns where each data point of the test group is increased by a ﬁxed slowdown percentage, which represents an “idealized” situation. Nevertheless, our comparison of results between approaches and statistical tests remain valid as they share the same data.

Threats to External Validity

We have only investigated microbenchmarking in Java and Go for a selected sample of benchmarks in two OSS projects. Other programming language paradigms (e.g., functional) and purely interpreted languages were considered out of scope. Nevertheless, we believe that with Java (dynamically compiled) and Go (statically compiled) we cover two of the most-used language- compilation/execution types. In addition, the availability of projects with microbenchmark suites limits the language options to study. In terms of performance testing paradigms, we do not claim that software microbenchmarking is a replacement for traditional load testing. A comparison between these two is not the goal of the underlying study but should be investigated in future research. Further, we have focused on three, albeit well-known, public cloud providers and a single bare-metal hosting provider. A reader should carefully evaluate whether our results can be generalized to other languages, projects, and cloud providers. Even more so, our results should not be generalized to performance testing in a private cloud, as many of the phenomena that underlie our results (e.g., noisy neighbors, hardware heterogeneity) cannot 136 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really? necessarily, or not to the same extent, be observed in a private cloud. Similarly, we are not able to make claims regarding the generalizability of our results to other types of performance experiments, such as stress or load tests. In this work, we focused on mean (by using bootstrapped overlapping conﬁdence intervals) and median (by applying Wilcoxon rank-sum) execution-time performance. Future work should investigate other performance metrics (e.g., memory consumption, I/O operations, or lock contention), as well as explore best/worst-case performance characteristics of microbenchmark suites. Fi- nally, readers need to keep in mind that any performance benchmarking study run on cloud infrastructure is fundamentally aiming at a moving target. As long as virtualization, multi-tenancy, or control over hardware optimizations is managed by providers, we expect the fundamental results and implications of our work to be stable. Nevertheless, detailed concrete results (e.g., detectable slowdown sizes on particular provider/instance types) may become outdated as providers update their hardware or introduce new offerings.

3.6.3 Future Directions

So far, we studied how bad performance testing is in cloud environments and whether we are able to reliably detect slowdowns. This is a first step towards utilizing cloud infrastructure for performance measurements/testing, but naturally it is not the end of the story. Future research in this area should particularly address supporting developers in creating and executing performance tests on cloud infrastructure. Two concrete topics we envision are (1) studying the properties of benchmarks that contribute to better or worse stability and therefore slowdown detectability in uncontrolled environments and (2) supporting developers in writing appropriate benchmarks for execution in these. The properties to study are root causes that contribute to higher variability in cloud environments (e.g., writing to files, network access, or execution configuration of benchmarks). We then foresee great potential for better tooling that (a) supports developers through IDE extensions to write adequate benchmarks (e.g., compiler optimizations that invalidate benchmark results) 3.7 Related Work 137 and hint which kind of properties might lead to higher result variability and (b) suggests/adapts execution configuration (i.e., repetitions along trials and instances) for a given cloud-resource type through continuous monitoring of benchmark result variability of prior executions.

3.7 Related Work

Software performance is a cross-cutting concern affected by many parts of a system and therefore hard to understand and study. Two general approaches to software performance engineering (SPE) are prevalent: measurement-based SPE, which executes performance experiments and monitors and evaluates their results, and model-based SPE, which predicts performance characteristics based on the created models [Woodside et al., 2007]. In this paper, we focus on measurement-based SPE. It has been extensively studied that measuring correctly and applying the right statistical analyses is hard and much can be done wrong. Mytkowicz et al. [2009] pinpoint that many systems researchers have drawn wrong conclusions through measurement bias. Others report on wrongly quantified experimental evaluations by ignoring uncertainty of measurements through non- deterministic behavior of software systems, such as memory placement or dynamic compilation [Kalibera and Jones, 2012]. Dealing with non-deterministic behavior of dynamically optimized programming languages, Georges et al. [2007] summarize methodologies to measure languages such as Java, which dynamically compile and run on VMs. Moreover, they explain which statistical methods lend themselves for performance evaluation of these languages. All of these studies expect an as-stable-as-possible environment to run performance experiments on. More recently, Arif et al. [2018] study the effect of virtual environments on load tests. They find that there is a discrepancy between physical and virtual environments, which are most strongly affected by unpredictability of I/O performance. Our paper augments this study, which looks at result unreliability of load tests, whereas we investigate software microbenchmarks. Additionally, our study differs by conducting measure- 138 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really? ments in cloud environments rather than virtual environments on controlled hardware. Traditionally, performance testing research was conducted in the context of system-scale load and stress testing [Menascé, 2002; Jiang and Hassan, 2015; Weyuker and Vokolos, 2000; Barna et al., 2011]. By now, such performance tests are academically well-understood, and recent research focuses on industrial applicability [Nguyen et al., 2014; Foo et al., 2015] or on how to reduce the time necessary for load testing [Grechanik et al., 2012]. Studies of software microbenchmarking have not received main stream attention previously, but academics have recently started investigating it [Stefan et al., 2017; Horký et al., 2015; Chen and Shang, 2017]. Similarly, Leitner and Bezemer [2017] recently investigated different practices of microbenchmarking of OSS written in Java. However, none of these studies report on the reliability of detecting slowdowns. A substantial body of research has investigated the performance and stability of performance of cloud providers independently of software-performance- engineering experiments. Iosup et al. [2011a] evaluate the usability of IaaS clouds for scientific computing. Gillam et al. [2013] focus on a fair comparison of providers in their work. Ou et al. [2012] and Farley et al. [2012] specifically focus on hardware heterogeneity, and how it can be exploited to improve a tenant’s cloud experience. Our study sets a different focus on software performance tests and goes a step further to investigate which slowdowns can be detected.

3.7.1 Comparison to Our Previous Work

In this section, we compare the results of the work presented here to our two previous papers that investigated performance of cloud instances. In Leitner and Cito [2016], we studied the performance characteristics of cloud environments across multiple providers, regions, and instance types. We assessed the stability of four public cloud providers (the same as in the current work) using low-level, domain-independent CPU and I/O system benchmarks 3.7 Related Work 139

(e.g., calculating a series of prime numbers, or writing and reading big ﬁles to/from hard disk), using a methodology not unlike previous benchmarking works [Iosup et al., 2011b; Gillam et al., 2013]. This paper laid the groundwork for the presently presented research, but ﬁndings from these isolated, domain- independent tests are not easy to transfer to any concrete usage domain, such as performance benchmarking.

In Laaber and Leitner [2018a], we studied the quality of open-source performance test suites. Similarly to the present work, we measured the variability of performance tests in different environments, but the ultimate goal was to study benchmark test coverage. Hence, we were unable to go deeper in this topic. This has brought us to the idea to conduct a more targeted study, which ultimately led to the present work, which can be seen as applying a rigorous cloud-benchmarking methodology, similar to Leitner and Cito [2016], to the study of microbenchmarking-based performance tests.

It should be noted that the detailed results between these two previous works and the present one differ in some aspects. Most importantly, AWS has been benchmarked to be fairly unreliable in Leitner and Cito [2016], while the results of the present study show surprisingly stable results for the same provider. While such results may seem inconsistent, they are unavoidable in practice. Cloud providers routinely buy new hardware, roll out different server management policies, and offer entirely new and different services. Hence, one should not read the outcomes of a cloud benchmarking study as a set-in-stone truth. Rather, the goal is to identify common themes, trends, and benchmarking methodologies, which remain valid signiﬁcantly longer than the detailed data. Similarly, some microbenchmarks that have been identiﬁed in Laaber and Leitner [2018a] as particularly stable have proven much less so in the present study (e.g., log4j-1, which we explicitly selected as a stable benchmark, but which has proven to be highly unstable in the present study). This primarily underlines how important a larger study of the subject, as presented here, is. 140 Chapter 3. Software Microbenchmarking in the Cloud. How Bad is it Really?

3.8 Conclusions

This paper empirically studied “how bad” performance testing with software microbenchmarks in cloud environments actually is. By executing microbenchmark suites of two Java projects (Log4j2 and RxJava) and two Go projects (bleve and etcd) in bare-metal and cloud environments, we studied result variability, investigated falsely-detected performance changes (A/A tests), and identified minimal detectable slowdown sizes. The results of the A/A tests and minimal detectable slowdown sizes were retrieved by applying two state-of-the art statistical tests, namely Wilcoxon rank-sum and overlapping bootstrapped confidence intervals of the mean. We found result variabilities of the studied benchmark-environment combination ranging between 0.03% to 100.68%, with the bare-metal and environment and AWS delivering the best results. In terms of falsely detect performance changes, the A/A test results show experiments with small sample sizes (e.g., one instance and one trial) suffer drastically from high false-positive rates, irrespective of which statistical test, sampling strategy, and execution environment is used. With increased sample sizes (e.g., five instance and five trials) though, most benchmark-environment combination show acceptable numbers of false positives (≤ 5%) when repeatedly executed, hence making it feasible to use cloud instances as performance test execution environment. With regard to minimal detectable slowdowns (MDSs), executing test and control group on the same instances (trial-based sampling) enables finding slowdowns with high confidence in all benchmark-environment combination when utilizing ≥ 10 instances. In 77%–83% of the time, a slowdown below 10% is reliably detectable when using trial-based sampling and 20 instances. We further found that Wilcoxon rank-sum is superior to overlapping confidence intervals in two regards: (1) it detects smaller slowdowns reliably and (2) it is not as computational-intensive and therefore takes less time. Following these findings, we conclude that executing software microbenchmarking experiments is, to some degree, possible on cloud instances. Not all cloud providers and instance types perform equally well in terms of detectable 3.8 Conclusions 141 slowdowns. However, in most settings, a substantial number of trials or instances and the co-location of test and control group on the same instances is required to achieve robust results with small detectable slowdowns. Prac- titioners can use our study as a blueprint to evaluate the stability of their own performance microbenchmarks within their custom experimental environment.

Dynamically Reconﬁguring Software Microbenchmarks: Reducing Execution Time without Sacriﬁcing Result Quality

Christoph Laaber, Stefan Würsten, Harald C. Gall, and Philipp Leitner Published in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’20)

Abstract

Executing software microbenchmarks, a form of small-scale performance tests predominantly used for libraries and frameworks, is a costly endeavor. Full benchmark suites take up to multiple hours or days to execute, rendering frequent checks, e.g., as part of continuous integration (CI), infeasible. How- ever, altering benchmark configurations to reduce execution time without considering the impact on result quality can lead to benchmark results that are not representative of the software’s true performance. We propose the first technique to dynamically stop software microbenchmark executions when their results are sufficiently stable. Our approach Chapter 4. Dynamically Reconfiguring Software Microbenchmarks: Reducing 144 Execution Time without Sacrificing Result Quality implements three statistical stoppage criteria and is capable of reducing Java Microbenchmarking Harness (JMH) suite execution times by 48.4% to 86.0%. At the same time it retains the same result quality for 78.8% to 87.6% of the benchmarks, compared to executing the suite for the default duration. The proposed approach does not require developers to manually craft custom benchmark configurations; instead, it provides automated mechanisms for dynamic reconfiguration. Hence, making dynamic reconfiguration highly effective and efficient, potentially paving the way to inclusion of JMH microbenchmarks in CI.

4.1 Introduction

Performance testing enables automated assessment of software performance in the hope of catching degradations, such as slowdowns, in a timely manner. A variety of techniques exist, spanning from system-scale (e.g., load testing) to method or statement level, such as software microbenchmarking. For functional testing, CI has been a revelation, where (unit) tests are regularly executed to detect functional regressions as early as possible [Hilton et al., 2016]. However, performance testing is not yet standard CI practice, although there would be a need for it [Leitner and Bezemer, 2017; Bezemer et al., 2019]. A major reason for not running performance tests on every commit is their long runtimes, often consuming multiple hours to days [Huang et al., 2014; Jiang and Hassan, 2015; Laaber and Leitner, 2018a]. To lower the time spent in performance testing activities, previous research applied techniques to select which commits to test [Huang et al., 2014; Sandoval Alcocer et al., 2016] or which tests to run [de Oliveira et al., 2017; Alshoaibi et al., 2019], to prioritize tests that are more likely to expose slowdowns [Mostafa et al., 2017], and to stop load tests once they become repetitive [AlGhamdi et al., 2016, 2020] or do not improve result accuracy [He et al., 2019]. However, none of these approaches are tailored to and consider characteristics of software microbenchmarks and enable running full benchmark 4.1 Introduction 145 suites, reduce the overall runtime, while still maintaining the same result quality. In this paper, we present the first approach to dynamically, i.e., during execution, decide when to stop the execution of software microbenchmarks. Our approach —dynamic reconfiguration— determines at different checkpoints whether a benchmark execution is stable and if more executions are unlikely to improve the result accuracy. It builds on the concepts introduced by He et al. [2019], applies them to software microbenchmarks, and generalizes the approach for any kind of stoppage criteria. To evaluate whether dynamic reconfiguration enables reducing execution time without sacrificing quality, we perform an experimental evaluation on ten Java open-source software (OSS) projects with benchmark suite sizes between 16 and 995 individual benchmarks, ranging from 4.31 to 191.81 hours. Our empirical evaluation consists of three different stoppage criteria, including the one from He et al. [2019]. It assesses whether benchmarks executed with dynamic reconfiguration in controlled, bare-metal environments (1) maintain their result quality and (2) have shorter execution times, compared to being executed with the default JMH configuration. We find that for the majority of studied benchmarks the result quality remains the same after applying our approach. Depending on the stoppage criteria, between 78.8% and 87.6% of the benchmarks do not produce different results, with an average performance change rate between 1.4% and 3.1%. Even though computation of the stoppage criteria introduces an overhead between <1% and ~11%, dynamic reconfiguration still enables saving a total of 66.2% to 82% of the execution time across all projects. For individual projects, benchmark suites take 48.4% to 86.0% less time to execute. Our empirical results support that dynamic reconfiguration of software microbenchmarks is highly effective and efficient in reducing execution time without sacrificing result quality. Chapter 4. Dynamically Reconfiguring Software Microbenchmarks: Reducing 146 Execution Time without Sacrificing Result Quality

Contributions The main contributions of our study are:

• We present the ﬁrst approach to dynamically stop the execution of software microbenchmark using three different stoppage criteria.

• We provide empirical evidence that demonstrates the effectiveness and efﬁciency of dynamic reconﬁguration for ten OSS applications.

• We provide a fork of JMH that implements our approach on Github [Laaber et al., 2020d] and as part of our replication package [Laaber et al., 2020e].

• To investigate whether real-world benchmark suites could beneﬁt from our approach to save time, we collect the largest data set of JMH OSS projects (753 projects with 13,387 benchmarks) including extracted source code properties such as benchmark conﬁgurations and parameters.

4.2 Java Microbenchmark Harness (JMH)

JMH1 is the de facto standard framework for writing and executing software microbenchmarks (in the following simply called benchmarks) for Java. Benchmarks operate on the same level of granularity as unit tests, i.e., statement/method level, and are similarly defined in code and configured through annotations. Different from unit tests where the outcome is binary, i.e., a test passes or fails (disregarding flakiness), benchmarks produce outputs for a certain performance metric, such as execution time or throughput. As these performance metrics are easily affected by confounding factors, such as the computer’s hardware and software, background process, or even temperature, one must execute benchmarks repeatedly to obtain rigorous results that are representative of the software’s true performance [Georges et al., 2007]. Figure 4.1 depicts a standard execution of a JMH benchmark suite B, where benchmarks b are sequentially scheduled. Every benchmark execution starts with a number of warmup forks wf, to bring the system into a steady state,

1https://github.com/openjdk/jmh (incl. source code examples) 4.2 Java Microbenchmark Harness (JMH) 147 whose results are discarded. A fork is JMH parlance for running a set of measurements in a fresh Java Virtual Machine (JVM). The warmup forks are followed by a number of measurement forks f (often simply called forks). Due to dynamic compilation, every fork is brought into steady state by running a series of warmup iterations wi, after which a series of measurement iterations mi are executed. An iteration has a specific duration –wt or mt for warmup time and measurement time, respectively– for which the benchmark is executed as often as possible, and the performance metrics for a sample of the invocations is reported. Performance metrics from warmup iterations are discarded, and the union of the measurement iterations across all forks form the benchmark’s result. All these values can be configured by the developer through JMH annotations or the command line interface (CLI), otherwise default values are used. JMH supports benchmark fixtures, i.e., setup and teardown methods, as well as parameterization of benchmarks. A parameterized benchmark has a number of parameters with (potentially) multiple values; JMH then runs the benchmark once for every parameter combination, which are formed as the Cartesian product of the individual parameters. JMH uses different sets of default configuration values, depending on the version: ≤ 1.20 and ≥ 1.21. Versions until 1.20 use 10 forks (f) running 40 iterations (20 wi and mi each) with an iteration time (wt and mt) of 1s; starting from 1.21, defaults are 5 forks (f), 5 iterations (both wi and mi), and 10s iteration time (both wt and mt) [Shipilev, 2018a,b]. JMH does not use warmup forks (wf) by default. Consequently, and as Figure 4.1 depicts, we can define the overall warmup b time as tw = wf ∗(wi∗wt+mi∗mt)+f ∗wi∗wt, the overall measurement time as b b b b b tm = f ∗mi∗mt, and the benchmark execution time as t = tw +tm +tfix, where b tfix is the time spent in benchmark fixtures. Finally, the full microbenchmark suite execution time T is the sum of all benchmark parameter combinations,

bp b defined as T = b∈B p∈P b t , where P the set of parameter combinations for a benchmarkPb. TheseP definitions will be used in the remainder of the paper. Chapter 4. Dynamically Reconfiguring Software Microbenchmarks: Reducing 148 Execution Time without Sacrificing Result Quality

microbenchmark suite

b1 bn

f1 … wf1 … wfn warmup measurement … fn

wi1 wi2 … win mi1 mi2 … min

wt mt invocation samples

b benchmark wf warmup fork wi warmup iteration wt warmup time f fork occurrences mi measurement iteration mt measurement time performance metric

Figure 4.1: JMH Execution

4.3 Pre-Study

To motivate our work, we conduct a pre-study investigating whether benchmark execution times are in fact a problem in real-world OSS projects using JMH.

4.3.1 Data Collection

We create, to the best of our knowledge, the most extensive OSS JMH data set to date from GitHub, by querying and combining three sources: (1) Google BigQuery’s most recent Github snapshot2, queried for org.openjdk.jmh import statements [Laaber and Leitner, 2018a; Damasceno Costa et al., 2019]; (2) Github’s search application programming interface (API) with an approach as outlined by Stefan et al. [2017]; and (3) Maven Central searching for projects with JMH as dependency. Our ﬁnal dataset consists of 753 pre-study sub-

2https://console.cloud.google.com/bigquery?project=fh-bigquery& page=dataset&d=github_extracts&p=fh-bigquery 4.3 Pre-Study 149 jects after removing duplicate entries, repositories that do not exist anymore, projects without benchmarks in the most recent commit, and forked projects. For each project, we apply the tool bencher [Laaber, 2020a] to construct abstract syntax trees (ASTs) for Java source-code ﬁles and extract information related to (1) execution conﬁguration (@Fork, @Warmup, @Measurement, and @BenchmarkMode) and (2) benchmark parameterization (@Param). In addition, (3) we extract the JMH version from the build script (Maven and gradle).

4.3.2 Summary of Pre-Study Subjects

The 753 projects have in total 13,387 benchmark methods and 48,107 parameter combinations. 400 (53.1%) projects feature fewer than 10 benchmarks, and 52 (6.9%) projects contain 50 benchmarks or more. On average, a project has 19.7±44.9 benchmarks, with a median of 7. The project with the largest benchmark suite is eclipse/eclipse-collections with 515 benchmarks. Benchmark parameterization is quite common with projects having, on average, 70.6±303.3 parameterized benchmarks, with a median of 9. 76.9% of the benchmarks have 10 parameter combinations or fewer. We ﬁnd the largest number of parameter combinations in the project msteindorfer/criterion, with 4,132 combinations; and the project with the most parameter combinations for a single benchmark is apache/hive, which contains an individual benchmark3 with an astounding 2,304 different combinations. However, the majority of the benchmarks are not parameterized, i.e., 10,394 (77.6%). Extracting the JMH version is crucial for our analysis, as the default values of the execution conﬁgurations have been changed with version 1.21 (see also Section 4.2). However, automatically extracting the JMH version is not possible for each project. We are able to successfully extract the JMH version from build scripts for 573 (76%) of our pre-study subjects, containing 10,816 (80.8%) benchmarks. About 20% of the projects (containing 4,115 (38.0%) benchmarks) already use the most recent JMH version.

3VectorGroupByOperatorBench.testAggCount Chapter 4. Dynamically Reconﬁguring Software Microbenchmarks: Reducing 150 Execution Time without Sacriﬁcing Result Quality

600 4500 34% 76% 70% 4000 30% 500 3500 60% 25% 400 3000 50% 20% 2500 300 40% 2000 15% 30% 1500 200 10% 20% 1000 # Projects (cum) Projects # # Benchmarks (cum) Benchmarks # 5% 100 500 10% 0 0% 0 0% 1 2 3 5 7 10 15 25 50 100 0 1 2 3 4 5 6 7 8 9 10 11 12 Decrease Factor (log10) Execution Time [h] (b) Decreased tb compared to JMH de- (a) Benchmark suite execution times T faults

Figure 4.2: Impact of custom conﬁgurations on the execution times of (a) benchmark suites and (b) benchmarks

4.3.3 Results

We use this data to analyze how much time benchmark suites in the wild take to execute. Figure 4.2a depicts a summary of benchmark suite execution times T for the 573 studied projects where JMH version extraction was successful. The runtimes vary greatly, ranging from 143 milliseconds for protobufel/protobuf-el to no less than 7.4 years for kiegroup/kie-benchmarks (clearly, this project does not intend to execute all benchmarks at once), with a median of 26.7 minutes. 364 (49%) benchmark suites run for an hour or less, which is probably acceptable, even in CI environments. However, 110 (15%) suites take longer than 3 hours, with 22 projects (3%) exceeding 12 hours runtime. For example, the popular collections library eclipse/eclipse-collections has a total benchmark suite runtime of over 16 days, executing 515 benchmarks with 2,575 parameter combinations. We conclude that at least 15% of the pre-study subjects would greatly benefit from an approach to reduce benchmark execution times given their current configuration. The benchmark suite execution time is based on the extracted JMH configurations from the projects. We speculate that developers specifically apply custom configurations to reduce the default settings of JMH. Indeed, 4,836 4.4 Dynamic Reconfiguration 151

(36%) benchmarks have a conﬁguration change that affects its runtime, of which 4,576 (34%) have a decreased benchmark time tb with respect to JMH defaults (see Figure 4.2b). We observe that for the majority of the benchmarks the execution time is in fact drastically reduced: for 3,735 (28%) and 2,379 (18%) by a factor ≥ 5 and ≥ 10, respectively. Still 374 (3%) benchmarks are reduced by a factor ≥ 50. While only a minority of 250 (2%) of the benchmarks belonging to just 17 (3.0%) of projects are conﬁgured to increase execution time compared to the defaults.

Pre-Study Summary. OSS developers extensively customize benchmark con- ﬁgurations, often setting their values considerably lower than the JMH default. Despite these changes, 15% of the projects still have a benchmark suite execution time of over 3 hours. These ﬁndings indicate that developers of many projects could be supported by a data-driven way to reduce the execution time of JMH benchmarks.

4.4 Dynamic Reconﬁguration

In Section 4.3, we found that real-world OSS benchmark suites often are configured to considerably reduce runtime, with respect to JMH’s defaults; still many run for multiple hours, making it effectively impossible to assess performance on every software change. We hypothesize that this time reduction is an effort by developers to keep benchmark suite runtimes reasonable without confirming that benchmark results remain stable (accurate). This section introduces an approach to dynamically stop benchmarks when their result is stable, with the goal of saving execution time without sacrificing quality.

4.4.1 Approach

JMH allows developers to define benchmark configurations before execution, either through annotations or CLI parameters, and then executes all Chapter 4. Dynamically Reconfiguring Software Microbenchmarks: Reducing 152 Execution Time without Sacrificing Result Quality benchmarks according to this configuration (see Section 4.2). We call this the “static configuration” of a benchmark execution. Figure 4.3a shows the static configuration where every row indicates a JMH fork (f1–f5) and every column/rectangle an iteration (i1–i20) of the corresponding fork. Yellow rectangles (i1–i10) indicate warmup iterations, and blue rectangles (i11–i20) indicate measurement iterations. This static configuration bears the problem that all forks are executed with the same configuration, irrespective of the accuracy of the results, potentially wasting precious runtime.

In order to stop benchmark executions when their result is accurate enough, we propose dynamic benchmark reconﬁguration, i.e., an approach that dynamically decides, at certain checkpoints, when the benchmark results are unlikely to change with more executions. This happens at two points: (1) within a fork, when the execution reaches a steady state, i.e., the warmup phase was executed long enough, and (2) after a fork’s execution, when it is unlikely that more forks will lead to different results. Figure 4.3b illustrates dynamic reconﬁguration. Vertical bars indicate checkpoints after iterations (line 7), horizontal bars indicate checkpoints after forks (line 10), and white, dashed boxes represent iterations that are skipped.

Algorithm 1 depicts the pseudocode for our dynamic reconfiguration approach. The algorithm takes the benchmark b to execute, its extended JMH execution configuration Cb, a stability function stable that is executed at every checkpoint, and a threshold t for deciding what is considered stable. Cb is a tuple of configuration values defined as Cb = hwimin, wimax, mi, f min,f max,wf, wt,mti (see also Section 4.2). Note that checkpoints only happen after i5 and min min f2 in the example, defined as wi and f . If a benchmark is not stable at a checkpoint, the bar is dashed (solid otherwise) and the warmup phase continues or another fork is spawned.

To circumvent the situation where a benchmark’s warmup phase never reaches a steady state or the overall measurements are never accurate enough, our approach takes a maximum number of warmup iterations (wimax) and max forks (f ), e.g., f3 has a dashed bar after the last warmup iteration. This 4.4 Dynamic Reconﬁguration 153

i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 i18 i19 i20

f5 (a) Static conﬁguration

i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 i18 i19 i20

f5 (b) Dynamic reconﬁguration

Figure 4.3: Standard JMH execution with static configuration vs. dynamic reconfiguration approach. A yellow box is a warmup iteration, a blue box is a measurement iteration, and a dashed box is a skipped iteration. A solid line indicates that the stoppage criterion is met, and a dashed line indicates the opposite. guarantees that a single benchmark execution never exceeds a configurable time budget, which defaults to JMH’s warmup iterations (wi) and forks (f). Benchmarks often exhibit multiple steady states resulting in multi-modal distributions, and outliers, due to non-deterministic behavior, might still occur even after stable considered a fork to be in a steady state [Gil et al., 2011]. Therefore, our approach uses a fixed number of multiple measurement iterations mi (lines 8–9), as a single measurement iteration would not accurately represent a fork’s performance.

4.4.2 Stoppage Criteria

To decide whether a fork reached a steady state (line 7) or the gathered measurements are stable (line 10), our approach needs to decide whether more measurements provide significantly more accurate results. For this, we rely on statistical procedures on the performance measurement distributions. That Chapter 4. Dynamically Reconfiguring Software Microbenchmarks: Reducing 154 Execution Time without Sacrificing Result Quality

Algorithm 1: Dynamic reconfiguration algorithm Input: b ∈ B: the benchmark to execute Cb = hwimin,wimax, mi, f min,f max, ···i: execution configuration for b stable : M × T 7→ {true,false}: stability function at threshold t ∈ T for a set of measurements M ′ ∈ M t: stability threshold (specific for stable) Data: execute : B 7→ M: executes a benchmark iteration Result: Measurements M b of the benchmark b 1 begin b 2 M ← ∅ max 3 for f ← 1 to f do 4 Mw ← ∅ // dynamic warmup phase max 5 for wi ← 1 to wi do b 6 Mw ← Mw ∪ execute(b, C ) // warmup stoppage min 7 if wi ≥ wi ∧ stable(Mw, t) then break // measurement phase 8 for 1 to mi do b b b 9 M ← M ∪ execute(b, C ) // fork stoppage min b 10 if f ≥ f ∧ stable(M , t) then break b 11 return M

is, if more measurements (i.e., data points) are unlikely to change the result distribution, we consider the measurement stable. There are three key aspects to consider: (1) a stability criteria sc : M 7→ R+ that assigns a stability value s ∈ R to a set of measurements M ′ ∈ M; (2) a threshold t ∈ T that indicates whether a stability value s is considered stable; and (3) a stability function stable : M ×T 7→ {true, false} that, based on a set of stability values (extracted from a set of measurements M ′ ∈ M) and a threshold t ∈ T , decides whether a set of performance measurements is stable or not.

Stopping Warmup Phase

The first stoppage point (line 7) decides whether a fork is in a steady state, which indicates the end of the warmup phase and the start of the measure- 4.4 Dynamic Reconfiguration 155 ment phase. For this, the dynamic reconfiguration approach uses a sliding- window technique where the measurement distributions at the last iterations are compared along a stability criterion. Let us consider the set of warmup measurements Mw (across multiple warmup iterations) such that mi ∈ Mw is the measurement at the ith iteration. We then define the sliding-window ′′ W warmup vector Wi′′ after a current iteration i , a sliding-window size s , and the resulting start iteration of the window i′ = i′′ − sW : i′ ≥ 1 in Equation 4.1.

x ′ ′′ Wi′′ = sc mi i ≤ x ≤ i (4.1) i[=i′

Stopping Forks

The second stoppage point (line 10) decides whether the benchmark measurement results M b are sufﬁciently stable and no additional fork needs to be spawned, therefore stopping the execution of benchmark b. Let us consider b b b the set of measurements M (across multiple forks) such that Mf ⊆ M is the subset of measurements at fork number f. We then deﬁne the fork vector Ff ′′ after a current fork f ′′ in Equation 4.2.

x ′′ ′′ b Ff = sc Mf 1 ≤ x ≤ f (4.2) f[=1

Stability Criteria and Function

The dynamic reconﬁguration approach allows for different stability criteria (sc) and functions (stable), and we identiﬁed and evaluated three:

Coefficient of variation (CV): coefficient of variation (CV) is a measure of variability under the assumption that the distribution is normal. How- ever, performance distributions are usually non-normal, e.g., multi- modal or long-tailed [Curtsinger and Berger, 2013; Maricq et al., 2018]. As a stability criteria sc, CV might still be a “good enough” proxy to estimate a benchmark’s stability, especially due to its low computational Chapter 4. Dynamically Reconfiguring Software Microbenchmarks: Reducing 156 Execution Time without Sacrificing Result Quality

overhead. Depending on the benchmark, the stability values in the vec-

tor v ∈ {Wi′′ ,Ff ′′ } converge towards different values, making a global threshold t for all benchmarks unrealistic. Instead, we compare all stability values from v such that the delta between the largest and the smallest is at most t. Formally, stablevar(M ′,t)= true ⇐⇒ max(v) − min(v) ≤ t.

Relative conﬁdence interval width (RCIW): The second stability criteria sc —relative conﬁdence interval width (RCIW)— is similar to CV, as it estimates a benchmark’s variability, hence stablevar also applies here. Different from CV, we employ a technique based on Monte Carlo simulation called bootstrap [Davison and Hinkley, 1997; Hesterberg, 2015] to estimate the RCIW for the mean. For this, we utilize the tool pa [Laaber, 2020b] that implements a technique by Kalibera and Jones [2012]. It uses hierarchical random resampling [Ren et al., 2010] with replacement, which is tailored to performance evaluation. The hierarchical levels are (1) invocations, (2) iterations, and (3) forks (we refer to pa [Laaber, 2020b] and Kalibera and Jones [2012] for details).

Kullback-Leibler divergence (KLD): The third stability criteria sc uses a technique outlined by He et al. [2019] that constructs a probability that two

distributions d1 and d2 are similar based on the Kullback-Leibler divergence (KLD) [Kullback and Leibler, 1951]. sc computes this prob-

ability (for every element of the vector v) where d1 is the measurement distribution excluding the last measurement (warmup iteration

i or fork f) and d2 is the measurement distribution including the last measurement. Consequently and different from a variability-based stability criteria, the vector v consists of probabilities rather than variabilities. The stability function stable checks whether the mean probability of the stability values from v are above the threshold t. Formally, stableprob(M ′,t)= true ⇐⇒ mean(v) >t. 4.5 Empirical Evaluation 157

4.4.3 Modiﬁed JMH Implementation

We implemented the dynamic reconfiguration approach with the three stoppage criteria for JMH version 1.21, by adding a reconfiguration benchmark mode with stoppage criteria (sc and stable) and threshold (t) properties, annotation properties for wimin and f min, and corresponding CLI flags. Additionally, we adapted JMH’s console and JavaScript Object Notation (JSON) result file output to include the new configuration options and added a warning if the stability criterion has not been met for a benchmark. The modified fork of JMH is available on GitHub [Laaber et al., 2020d] and part of our replication package [Laaber et al., 2020e].

4.5 Empirical Evaluation

To assess whether dynamic reconfiguration is effective and efficient, we conduct an experimental evaluation on a subset of the Java OSS projects identified in our pre-study (see Section 4.2). Our evaluation compares three dynamic reconfiguration approaches (one for every stoppage criterion). As a baseline for comparison, we use standard JMH with static configuration and the default values. To support open science, we provide all evaluation data and scripts in a replication package [Laaber et al., 2020e].

4.5.1 Research Questions

First, we want to ensure that dynamic reconfiguration does not change the results compared to static configuration. If the results of the same benchmark executed with static configuration and with dynamic reconfiguration are equal, we conclude that dynamic reconfiguration is effective in preserving result quality. For this, we formulate RQ 1:

RQ 1 How does dynamic reconfiguration of software microbenchmarks affect their execution result? Chapter 4. Dynamically Reconfiguring Software Microbenchmarks: Reducing 158 Execution Time without Sacrificing Result Quality

Second, we want to evaluate if dynamic reconﬁguration improves the overall runtime of a benchmark suite, compared to static conﬁguration, including the overhead imposed by the stoppage criteria computation. For this, we formulate RQ 2:

RQ 2 How much time can be saved by dynamically reconﬁguring software microbenchmarks?

As a benchmark’s result quality (accuracy) and runtime are competing objectives, the combination of the results from RQ 1 and RQ 2 validates whether dynamic reconﬁguration enables “reducing execution time without sacriﬁcing result quality”.

4.5.2 Study Subjects

Evaluating the dynamic reconﬁguration approach on all 753 pre-study subjects (see Section 4.3) is infeasible as executing benchmark suites potentially takes a long time. Hence, we perform purposive sampling [Baltes and Ralph, 2020] to select a subset of ten, non-trivial projects from a wide variety of domains with small (16) to large (994) benchmark suites. Our evaluation executes all 3,969 benchmark parameter combinations of the ten study subjects, which are 8.2% of the 48,107 parameter combinations from the pre-study. Table 4.1 lists the study subjects with their number of benchmarks (“# Benchs.”) and benchmark parameter combinations (“# Param. Benchs.”), git version used for the evaluation (“Version”), and execution time when using JMH default values (“Exec. Time”).

4.5.3 Study Setup

We execute all benchmarks, retrieve the benchmark results, and afterwards apply dynamic reconfiguration and the stoppage criteria to the obtained data set. This allows us to experiment with thresholds and parameters without having to rerun the full benchmark suites with our modified JMH implementation (with dynamic reconfiguration). 4.5 Empirical Evaluation 159 Domain Exec. Time 1 # Param. Benchs. 1 39 39 5.42h Bytecode manipulation 60 148 20.56h Concurrent data structures 40 40 5.56h Genetic algorithms 16 31 4.31h Serialization 61 61 8.47h Distributed tracing 994 1,381 191.81h Benchmarks of the JDK 110 110 15.28h Benchmarks of JMH 358 510 70.83h Logging 217 1,282 178.06h Asynchronous programming 269 367 50.97h Visualization d0fab23 c24319a 19cbaae 002f969 a07e914 ac121e2 17a8eef 2865bb4 055f041 43f633d 2 https://hg.openjdk.java.net/code-tools/jmh 3 http://hg.openjdk.java.net/code-tools/jmh-jdk-microbenchmarks Module directory in repository: Repository: The numbers correspond to succeeding benchmarks and excludes 38 failing parameterizations. See our replication package for a list [Laaber et al., 2020e] Table 4.1: Selected study subjects. All projects are hosted on Github except the ones indicated 3 2 1 Namebyte-buddy Project raphw/byte-buddy Version # Benchs. JCTools JCTools/JCTools jenetics jenetics/jenetics jdk jmh-jdk-microbenchmarks jmh-core jmh-core-benchmarks log4j2 apache/logging-log4j2 RxJava ReactiveX/RxJava protostuff protostuff/protostuff SquidLib SquidPony/SquidLib zipkin openzipkin/zipkin Chapter 4. Dynamically Reconfiguring Software Microbenchmarks: Reducing 160 Execution Time without Sacrificing Result Quality

Execution and Data Gathering

As performance measurements are prone to confounding factors [Georges et al., 2007; Mytkowicz et al., 2009; Curtsinger and Berger, 2013; de Oliveira et al., 2013b; Maricq et al., 2018], we apply the subsequent steps to follow a rigorous methodology in order to increase result reliability.

(1) All benchmark suites are patched with JMH 1.21.

(2) We compile and execute all benchmarks with AdoptOpenJDK and Java HotSpot virtual machine (VM) version 1.8.0_222-b10, except log4j2 which requires a Java Development Kit (JDK) version ≥ 9, hence we employ version 13+33.

(3) We run the benchmarks on a bare-metal machine [Bakshy and Frachten- berg, 2015; Selakovic and Pradel, 2016] with a 12-core Intel Xeon X5670 @2.93GHz CPU, 70 GiB memory, and a Samsung SSD 860 PRO SATA III disk, running ArchLinux with a kernel version 5.2.9-1-1-ARCH.

(4) All non-mandatory background processes except ssh are disabled, without explicitly disabling software/hardware optimizations.

(5) Regarding benchmark suite execution, we configure and execute all benchmarks with five forks f, 100 measurement iterations mi, 1s measurement time mt, and JMH’s sample mode, set through JMH’s CLI. This configuration corresponds to the JMH 1.21 defaults, only mt changes from 10s to 1s but, at the same time, mi increases by a factor of 10, which grants our approach more checkpoints. Note that warmup iterations wi are set to zero but mi is doubled (from 50 to 100), which is required to obtain results for every iteration to dynamically decide when to stop the warmup phase. The resulting execution configuration is then Cb = h0, 0, 100, 5, 5, 0, 0s, 1si.

(6) We remove outliers that are a magnitude larger than the median. 4.5 Empirical Evaluation 161

Approach

With the obtained performance results from the suite executions, we evaluate dynamic reconfiguration with the following configuration parameters. Recall the configuration definition Cb = hwimin, wimax, mi, f min,f max,wf,wt,mti (see Section 4.4.1).

Static Configuration (Baseline) The baseline, i.e., JMH with static configuration, uses the JMH 1.21 default configuration for all benchmarks. For this, we remove from the gathered data the first 50 iterations (corresponding to wi) from each fork and use the 50 remaining iterations as mi. Hence, the baseline has the following configuration: Cb = h50, 50, 50, 5, 5, 0, 1s, 1si. We consciously decided for the JMH default configuration as baseline and against the developers’ custom benchmark configurations for the following reasons: (1) 36% of the pre-study benchmarks change the benchmark execution time through custom configurations, hence, 64% of the benchmarks still use the JMH default configuration; (2) the majority of these benchmarks (28% of all pre-study benchmarks) only use a single fork f, which is considered bad practice as inter-JVM-variability is common [Laaber and Leitner, 2018a], basically invalidating developers’ custom configurations for rigorous benchmarking; and (3) a unified benchmark configuration as the baseline enables comparability across our study subjects.

Dynamic Reconfiguration For the dynamic reconfiguration approaches, we employ the configuration Cb = h5, 50, 10, 2, 5, 0, 1s, 1si for all benchmarks, which changes the minimum warmup iterations (wimin = 5) and minimum forks (f min = 2) compared to the baseline. Note that we also reduce mi to 10 instead of 50, which the baseline uses. Initial experiments showed that an increase in measurement iterations, after a steady state is reached, has only a minor effect on result accuracy but with considerably longer runtimes. We use the following parameters for the three dynamic reconfiguration approaches (one per stoppage criterion). Chapter 4. Dynamically Reconfiguring Software Microbenchmarks: Reducing 162 Execution Time without Sacrificing Result Quality

(1) We draw a weighted sample of 1,000 invocations per iteration to reduce computational overhead at checkpoints.

(2) The sliding-window size is set to sW =5.

(3) CV uses a threshold t =0.01, which corresponds to a maximum variability difference in the sliding window of 1%.

(4) RCIW uses a 99% conﬁdence level, 1,000 bootstrap iterations (which is a good tradeoff between runtime overhead and estimation accuracy), and a threshold t =0.03 following best practice [Georges et al., 2007].

(5) KLD partitions the distributions d1 and d2 into 1,000 strips for the KLD calculation [He et al., 2019]; removes outliers that are more than 1.5×IQR away from the median; and uses a threshold t =0.99, which corresponds to a mean probability within the sliding window of 99% or larger. More strips would result in longer calculation times for the kernel density estimation and, consequently, in a higher runtime overhead. Without the outlier removal, KLD would not converge above the probability threshold t, and, hence, our approach would not stop the benchmark execution. Note that the outlier removal is only performed as part of the stoppage criteria calculation of our approach; for the evaluation, we consider all measurements and do not remove any outliers (see Section 4.5.4).

4.5.4 Results and Analysis

We now present the results of our empirical evaluation by comparing the benchmark results of the static conﬁguration to the ones of our dynamic reconﬁguration approaches with the three stoppage criteria.

RQ 1: Result Quality

To assess whether applying dynamic reconfiguration changes benchmark results and to answer RQ 1, we perform two analyses between the execution results coming from the baseline with static configuration and each of the three 4.5 Empirical Evaluation 163 dynamic reconfiguration approaches: (1) statistical A/A tests and (2) mean performance change rate.

A/A Tests An A/A test checks whether results from two distributions are not significantly different, where no difference is expected. In our context, this means if an A/A test between static configuration and dynamic reconfiguration (for each stoppage criterion) does not report a difference, we conclude that dynamic reconfiguration does not change the benchmark result. Follow- ing performance engineering best practice [Kalibera and Jones, 2012; Bulej et al., 2017a; Laaber et al., 2019b; Bulej et al., 2019], we estimate the confidence interval for the ratio of means with bootstrap [Davison and Hinkley, 1997], using 10,000 iterations [Hesterberg, 2015], and employing hierarchical random resampling with replacement on (1) invocation, (2) iteration, and (3) fork level [Kalibera and Jones, 2012] (again relying on pa [Laaber, 2020b]). If the confidence interval (of the ratio) straddles 1, there is no statistically significant difference. Note that this procedure is different from the stoppage criteria RCIW (see Section 4.4); here we compare the results (all measurement iterations mi from all forks f) of two techniques, whereas RCIW uses confidence interval widths as a variability measure of a single technique. The first row of Table 4.2 shows the A/A results. For a majority of the 3,969 benchmark parameter combinations, applying dynamic reconfiguration does not result in significantly different distributions. About 80% or more of the benchmarks have similar result distributions compared to the static configuration. RCIW achieves the best result with 87.6%, while CV and KLD perform similarly well with 78.8% and 79.6%, respectively. Note that the static approach uses 50 measurement iterations (mi) while the dynamic approach “only” runs 10, indicating that if a steady state is reached (which is one goal of dynamic reconfiguration) more measurement iterations have a negligible impact on the overall result.

Change Rate In addition to A/A tests, we assess the performance change rate between the static configuration approach and each of the dynamic re- Chapter 4. Dynamically Reconfiguring Software Microbenchmarks: Reducing 164 Execution Time without Sacrificing Result Quality

Table 4.2: Result quality differences between static conﬁguration approach and dynamic reconﬁguration approaches

CV RCIW KLD A/A tests not different 78.8% 87.6% 79.6% Mean change rate 3.1% ± 8.1% 1.4% ± 3.8% 2.4% ± 7.4% # benchs < 1% 57.4% 73.2% 62.3% # benchs < 2% 72.4% 87.0% 78.2% # benchs < 3% 79.6% 91.9% 84.6%

configuration approaches, i.e., by how much the means of the performance result distributions differ. The change rate augments the A/A tests’ binary decision, by showing how different the benchmark results become when applying dynamic reconfiguration. The second row of Table 4.2 shows the mean change rate across all benchmarks in percent and its standard deviation. The mean change rate between the three stoppage criteria and the static approach is ~3% or lower for all three. Note that, following a rigorous measurement methodology, ~3% could still be caused by JVM instabilities unrelated to our approach [Gil et al., 2011]. Again, RCIW is the best criterion with 1.4%±3.8%. Finally, the last three rows show how many benchmarks have a change rate below 1%, 2%, and 3% for all stoppage criteria. We observe that RCIW outperforms the other two significantly, followed by KLD. ~73% of the benchmarks have a change rate below 1%, ~87% below 2%, and ~92% below 3%. This suggests that RCIW is a highly effective technique for stopping benchmark executions. Figure 4.4 depicts the change rate distributions per project and stoppage criterion, where every data point corresponds to a benchmark’s mean performance change. Considering the median change rate of a project’s benchmarks, RCIW performs best for all projects except jenetics, jmh-core, and SquidLib where KLD is slightly superior. CV consistently has the largest change rates of the three stoppage criteria; nonetheless, it performs only slightly worse in most cases. Considering the mean change rate, RCIW is the most accurate stoppage criteria for 9/10 projects, with only jmh-core being more stable when KLD is 4.5 Empirical Evaluation 165

Stoppage Criteria CV KLD RCIW 12%

10%

5% Change Rate Change 2%

0% jdk jenetics log4j2 RxJava zipkin byte-buddyJCToools jmh-core protostuff SquidLib Study Subject

Figure 4.4: Mean change rate per study subject and stoppage criteria. The bar indicates the median, the diamond the mean, the box the IQR, and the whiskers [Q1|Q3]+1.5 ∗ IQR employed. Note that for the projects where RCIW is not the best stoppage criterion, both mean and median change rates are below 1%. The projects with the most diverging benchmarks between static conﬁguration and dynamic reconﬁguration execution are byte-buddy, JCTools, log4j2, and SquidLib. The benchmarks of these projects are less stable compared to the other projects, likely due to executing non-deterministic behavior such as concurrency and input/output (I/O). Results from benchmarks that are less stable will potentially have statistically different distributions and, therefore, not maintain the same result quality.

Unreachble Stability Criteria If the stability function stable never evaluates the measurements after a warmup iteration or a fork as stable, the maximum number of warmup iterations (wimax) or forks (f max) are executed. This corresponds to the static configuration of JMH. We analyzed how often stability is not achieved according to the three stoppage criteria across all study subjects. CV is the most lenient criterion with only 1.0% of the benchmarks’ forks not considered stable after 50 warmup iterations and 12% of the benchmarks Chapter 4. Dynamically Reconfiguring Software Microbenchmarks: Reducing 166 Execution Time without Sacrificing Result Quality insufficiently accurate after five forks. KLD achieves similar numbers (0.8%) for warmup iterations, however 46.4% of the benchmarks were not considered stable after five forks. RCIW is even more restrictive where 46.7% and 37.9% of the benchmarks do not reach the stability criteria after wimax and f max, respectively. This restrictiveness impacts the A/A test and mean change rate results, leading to benchmark results with higher quality. Not reaching the stability criteria can either happen if the threshold t is too restrictive or the benchmark is inherently variable, which is a common phenomenon [Laaber and Leitner, 2018a; Laaber et al., 2019b].

RQ 1 Summary. Applying dynamic reconﬁguration does not change the result quality of the majority of the benchmarks, when compared to the static con- ﬁguration. The RCIW stoppage criteria outperforms KLD and CV, with 87.6% of the benchmarks maintaining their result quality and a mean performance change rate of 1.4%.

RQ 2: Time Saving

The main goal of dynamic reconﬁguration is to save time executing benchmark suites. For this, and to answer RQ 2, we (1) measure the runtime overhead of the three stoppage criteria, (2) estimate the time saving for all projects compared to the static conﬁguration, and (3) show at which checkpoint (warmup or fork) more time can be saved.

Runtime Overhead To measure the runtime overhead of the three stoppage criteria, we execute the benchmark suite of log4j2 once with standard JMH 1.21 (i.e., static configuration) and once for each stoppage criteria with our JMH fork implementing dynamic reconfiguration. To ensure a valid comparison between the four measurements (static configuration + dynamic reconfiguration of three stoppage criteria), we use the same configuration for the static and the dynamic approaches of Cb = h5, 90, 10, 2, 5, 0, 1s, 1si, but do not stop at the stoppage checkpoints. We measure the end-to-end execution time tb′ of every benchmark b when executed through JMH’s CLI. This time 4.5 Empirical Evaluation 167 includes JVM startup, benchmark fixtures, benchmark execution, and stoppage criteria computation, which is negligible compared to the duration of the measurement. Note that the number of data points used for the stoppage criteria calculation is independent of the study subject by construction of JMH and our approach; therefore it is sufficient to measure the overhead based on one project (see a discussion on this in Section 4.7). The overheads o ∈ O of all benchmarks for a stoppage criterion is O = b′ b′ b′ b∈B tdyn/tsta − 1, where tdyn is the execution time of the dynamic reconfigura- b′ tionS with a specific stoppage criteria, and tsta is the execution time of the static configuration. The overheads o are independent of the number of iterations and forks executed, because they are factors of the runtime difference between dynamic reconfiguration with one stoppage criterion and the static configuration (i.e., standard JMH), and all our overhead measurements use the same configuration Cb. The overheads we measure are oCV =0.88% ± 0.34% for CV, oRCIW = 10.92% ± 0.63% for RCIW, oKLD = 4.32% ± 0.65% for KLD. Note that changing the iteration time of 1s and executing benchmarks on different hardware might affect the overhead. The considerable difference in overhead is explained by the complexity of the stoppage criteria calculations. Whereas CV is computationally cheap (it only needs to compute standard deviation, mean, and their difference), RCIW is computationally intensive due to the simulations required for bootstrap. Because there is hardly any overhead variability (< 1%) among all benchmarks, we consider the overhead constant and use the mean value for the remainder of the experiments.

Time Saving Estimation To estimate the overall time that can be saved with dynamic reconfiguration, we adapt the execution time equation tb (see Sec- tion 4.2) to incorporate the stoppage criteria. The dynamic reconfiguration b benchmark execution time is then tdyn = f∈forks[(1 + o) ∗ wif ∗ wt + mi ∗ mt]. forks corresponds to the number of executedP forks f of a benchmark according to the stoppage criterion, wif to the number of warmup iterations in this fork f, and the rest according to Cb from Section 4.4.1. For simplicity and Chapter 4. Dynamically Reconfiguring Software Microbenchmarks: Reducing 168 Execution Time without Sacrificing Result Quality

Table 4.3: Time saving per project and stoppage criteria

Project Time Saving CV RCIW KLD byte-buddy 4.42h (81.7%) 2.62h (48.4%) 4.22h (77.8%) JCTools 17.42h (84.8%) 11.45h (55.7%) 17.13h (83.3%) jdk 157.32h (82.0%) 135.57h (70.7%) 154.41h (80.5%) jenetics 4.78h (86.0%) 3.37h (60.7%) 4.52h (81.4%) jmh 12.76h (83.5%) 12.69h (83.1%) 12.42h (81.3%) log4j2 54.56h (77.0%) 39.12h (55.2%) 55.96h (79.0%) protostuff 3.43h (79.6%) 2.91h (67.7%) 3.44h (79.8%) RxJava 147.91h (83.1%) 121.55h (68.3%) 138.68h (77.9%) SquidLib 43.07h (84.5%) 30.70h (60.2%) 41.11h (80.7%) zipkin 6.17h (72.8%) 4.93h (58.2%) 6.59h (77.8%) Total 451.84h (82.0%) 364.92h (66.2%) 438.48h (79.5%) because of the low variability between benchmark overheads, we disregard benchmark fixture times. The total benchmark suite execution time when b ′ using dynamic reconfiguration is then Tdyn = b∈B′ tdyn, where B is the set of benchmark parameter combinations. P Table 4.3 shows the time saving per project and stoppage criteria in absolute numbers (hours) and relative to the static configuration. We observe that dynamic reconfiguration with all three stoppage criteria enables drastic time reductions compared to static configuration. In total, CV and KLD save ~80% and RCIW ~66% of the benchmark suite execution times of all projects combined. For individual projects, the time saving ranges between 72.8% and 86.0% for CV, 48.4% and 83.1% for RCIW, and 77.8% and 83.3% for KLD. Even with the computationally most expensive technique, i.e., RCIW, we can save at least 48.4% of time. In total numbers, the savings are between 3.43h and 157.32h for CV, 2.62h and 135.57h for RCIW, and 3.44h and 154.41h for KLD.

Stoppage Criteria Checkpoints Dynamic reconfiguration defines two points during benchmark execution when to stop: (1) after the warmup phase if measurements are stable within a fork and (2) after a fork if measurements across forks are stable. In our analysis, the range of warmup iterations is from five 4.6 Discussion and Recommendations 169

(wimin) to 50 (wimax), and forks are between two (f min) and five (f max) (see Cb in Section 4.4.1). Although CV and KLD save a similar amount of time, they have different stoppage behavior. Where CV requires more warmup iterations (18.5±9.4) than KLD (14.1±6.9), the opposite is the case for forks with 3.1±1.2 vs. 4.1±1.2, respectively. RCIW, which saves considerably less time, demands more warmup iterations (34.6±16.6) to consider a fork stable but lies between CV and KLD in terms of forks (3.3±1.4). The reported numbers are arithmetic means (of warmup iterations and forks) with standard deviations across all benchmarks of all study subjects. Generally, warmup iterations are more reduced than forks in our setup, indicating that fork-to-fork variability is more present than within-fork variance, that is variability across multiple JVMs compared to within a JVM, respectively. Dynamic reconfiguration enables finding the sweet spot between shortening warmup iterations and forks in combination with a certain stoppage criteria.

RQ 2 Summary. With runtime overheads between <1% and ~11%, dynamic reconﬁguration enables reducing benchmark suite runtimes by 48.4% to 86.0% compared to JMH’s default runtime.

4.6 Discussion and Recommendations

Our pre-study (see Section 4.3) shows that developers often drastically reduce benchmark execution times. We see two potential reasons for this: (1) the benchmark suite runtimes are too long, and, consequently, developers trade shorter runtimes for inaccurate results; or (2) JMH defaults are overly conservative, and benchmarks with shorter runtimes often still produce results that are considered sufficiently accurate. We hypothesize that the former is more likely, but leave the developer perspective for configuration choices for future work. In any case, the proposed dynamic reconfiguration approach enables reducing time while maintaining similar benchmark results, as our empirical evaluation shows. Chapter 4. Dynamically Reconfiguring Software Microbenchmarks: Reducing 170 Execution Time without Sacrificing Result Quality

Recommendations for Developers Developers are advised to either assess their benchmark accuracies when executed in their environment and adjust configurations accordingly, or employ dynamic reconfiguration which is able to adjust to different execution environments. The choice of stoppage criteria depends on the required result quality and, therefore, the performance change sizes desired to be detected. For slightly less accurate results but more time reduction, we recommend using KLD, otherwise RCIW is preferred. The exact threshold t depends on the stability of the execution environments the benchmarks that are run in it. If a controlled, bare-metal environment is available, we suggest the thresholds of our study. In a virtualized or cloud environment, the thresholds need to be adjusted (see also He et al. [2019]). The effectiveness of our technique in non-bare-metal environments, such as in the cloud, is subject to future research. Moreover, whether a combination of different stoppage criteria, e.g., stopping when both KLD and RCIW deem a benchmark run to be stable, improves result accuracy also requires further research. Such a combination would, however, negatively affect the runtime overhead of dynamic reconfiguration.

Microbenchmarks in CI The long benchmark execution times (see Sec- tion 4.3 and [Huang et al., 2014; Sandoval Alcocer et al., 2016; Laaber and Leit- ner, 2018a]) are a major obstacle for including microbenchmarks in CI [Beze- mer et al., 2019]. To overcome this hurdle, a combination of our technique with benchmark selection [de Oliveira et al., 2017], benchmark prioritization [Mostafa et al., 2017], and risk analysis on commits [Huang et al., 2014] would reduce the required time for microbenchmarking and potentially enable CI integration. Continuously assessing software performance would increase conﬁdence that a change does not degrade performance and likely be beneﬁcial for performance bug root cause analysis.

Choosing JMH Configuration Parameters Choosing JMH configuration parameters that keep execution time low and result accuracy high is non- trivial, and developers decrease configurations drastically. Our results show 4.6 Discussion and Recommendations 171 the importance of setting the warmup phase correctly and utilizing multiple forks for benchmark accuracy. With a large number of benchmarks, expecting developers to pick the “right” values becomes unrealistic. Our dynamic reconfiguration approach helps in this regard by deciding based on data and per benchmark when the results are accurate enough.

Iteration Time and Forks The warmup and measurement times affect benchmark result accuracy and control the frequency with which stability checkpoints occur. JMH 1.21 changed the iteration time from 1s to 10s, and reduced the number of forks from ten to five [Shipilev, 2018a,b]. The OpenJDK team argued that 1s is too short for large workloads [Shipilev, 2018a]. We performed an additional analysis whether result accuracy changes when switching from 10s to 1s but did not observe differences in most cases. Hence, we decided for 1s iterations to give the dynamic reconfiguration approach more checkpoints to assess a benchmark’s stability. Whereas 10s is a safe choice for static configurations, we believe that 1s provides more flexibility and works better with dynamic reconfiguration. Our results support reducing to five forks, which indicates that most fork-to-fork variability is captured.

Choosing Stability Criteria Parameters Choosing optimal meta-parameters for the stability criteria can affect the effectiveness and efficiency of the overall approach. Dynamic reconfiguration supports the sliding-window size sW , the threshold t when a stability criterion value (CV, RCIW, or KLD) is considered stable, and stability-criterion-dependent parameters (see Sec- tion 4.5.3). We base our parameters on common statistical practice and previous research [Georges et al., 2007; He et al., 2019] (see Section 4.5.3). Only the sliding window size sW is manually set by us. Our empirical evaluation shows that the employed parameters work well across all study subjects. However, future research should explore the meta-parameter space through experimentation. It is important to emphasize that choosing these meta-parameters is an offline activity, which is done once and before executing the benchmarks; Chapter 4. Dynamically Reconfiguring Software Microbenchmarks: Reducing 172 Execution Time without Sacrificing Result Quality hence, the cost for choosing these parameters is not part of the overhead estimations in RQ 2.

Unreachable Stability Criteria Although the stability criteria is frequently not met for warmup iterations or forks of individual benchmarks, at least when using KLD and RCIW, the overall runtime of the full benchmark suites is considerably reduced (see Section 4.5.4). Dynamic reconﬁguration uses upper bounds for warmup iterations (wimax) and forks (f max); therefore, it does not exceed the runtime of standard JMH with static conﬁguration. In case of an unreachable stability criteria, our JMH implementation warns the developer, who can then adjust this benchmark’s upper bounds to obtain better results. Our approach could also automatically lift the caps if the desired result quality is not reached, which should be explored by future research.

4.7 Threats to Validity

Construct Validity Our pre-study (see Section 4.3) relies on information extracted from source code, i.e., configurations based on JMH annotations. We do not consider overwritten configurations through CLI arguments, which might be present in build scripts or documentation in the repositories. Reported runtimes do not consider fixture (setup and teardown) times, JVM startup, and time spent in the benchmark harness of JMH; and they assume iteration times are as configured, while in reality they are minimum times. Therefore, reported times might slightly underestimate the real execution times. The results and implications from RQ 1 are based on the notion of benchmark result similarity. We assess this through statistical A/A tests (based on bootstrap confidence intervals for the ratio of means) and mean performance change rate, similar to previous work [Laaber et al., 2019b; Bulej et al., 2019]. Other tests for the similarity of benchmark results, such as non-parametric hypothesis tests and effect sizes [Damasceno Costa et al., 2019; Laaber et al., 2019b], might lead to different outcomes. 4.7 Threats to Validity 173

We base the time savings from RQ 2 on overhead calculations from a single project and assume this overhead is constant for all stoppage points and benchmarks. There is hardly any reason to believe that overheads change between study subjects, benchmarks, and stoppage points, because the number of data points used for stoppage criteria computation are similar. This is due to how JMH and our approach work (see Sections 4.2 and 4.4), and how our experiment is designed (see Section 4.5): (1) the measurement time mt is fixed, irrespective of the benchmark workload; (2) the number of iterations mi and forks f is fixed; (3) benchmark fixtures, i.e., setup and teardown, are constant and of negligible duration compared to the measurement duration; and (4) the stoppage criteria calculation uses a sliding-window approach (sW ) and, therefore, the number of iterations used for the calculation is constant. Further, we perform post-hoc analysis on a single benchmark execution data set for all stoppage criteria. That is, we execute the benchmark suites with five forks and 100 measurement iterations à 1s and then compute the stoppage points. Computing the stoppage points while executing test suites might lead to slightly different results. Finally, we use a sliding-window approach for determining the end of the warmup phase with a window size (sW ) of five. Different window sizes might impose a larger runtime overhead and change the stoppage point outcomes.

Internal Validity Internal validity is mostly concerned with our performance measurement methodology and the employed thresholds. We follow measurement best practice [Georges et al., 2007] and run experiments on a bare-metal machine [Stefan et al., 2017] to reduce measurement bias [Mytkowicz et al., 2009; Curtsinger and Berger, 2013; de Oliveira et al., 2013b; Maricq et al., 2018]. We did not explicitly turn off software and hardware optimizations, which might affect benchmark variability and, therefore, our results. Regarding the thresholds, we started from previous works [Georges et al., 2007; He et al., 2019] and adapted them to fit the context of microbenchmarks. As we used the same thresholds for all benchmarks and projects, we are confi- Chapter 4. Dynamically Reconfiguring Software Microbenchmarks: Reducing 174 Execution Time without Sacrificing Result Quality dent that they are generally applicable for Java microbenchmarks executed on a similar machine to ours. Further, the times reported in Section 4.3 rely on the JMH version of a benchmark; we applied simple heuristics to extract the version, which might not be fully accurate in case of, for instance, multi-module projects or dynamic JMH version declarations.

External Validity Generalizability might be affected with respect to the studied projects. We only focus on OSS projects from GitHub, and it is unclear whether our findings are equally valid in the context of industrial software or projects hosted on other platforms. Especially, the ten selected projects for our empirical evaluation (see Section 4.5) might not be a representative sample for all JMH projects. Due to the long benchmark suite execution times, more projects would not have been feasible to study. We aimed for a diverse set of projects, spanning multiple domains (see Table 4.1), covering ~8% of the benchmarks from the pre-study (see Section 4.3). The effectiveness and efficiency results of dynamic reconfiguration depends on the environment used for executing the benchmarks. Our experimental evaluation favors internal validity over external validity by using a controlled, bare-metal environment. Therefore, we can be more confident that the reported comparisons between study subjects and stoppage criteria are indeed correct and not due to uncontrollable factors present in virtualized and cloud environments. Executing benchmarks with dynamic reconfiguration in such other environments might lead to different results. Moreover, our focus has been on Java projects that use JMH as their benchmarking framework. Although the concepts from Section 4.4 also translate to other frameworks and languages, the exact results might be different. We opted for Java/JMH because (1) it is a dynamically compiled language where warmup phases and multiple forks are essential, (2) JMH benchmark suites are long-running [Laaber and Leitner, 2018a] and can benefit greatly from dynamic reconfiguration, and (3) JMH is a mature framework with many features offering great opportunities for our approach. 4.8 Related Work 175

Finally, switching to different Java virtual machines, such as Eclipse OpenJ9 or Graal, might change the results due to different performance characteristics.

4.8 Related Work

Performance testing is a form of measurement-based performance engineering [Woodside et al., 2007], which comes in two main flavors: system-level tests and method/statement-level tests. Historically, research focussed on system-level tests [Weyuker and Vokolos, 2000; Menascé, 2002; Jiang and Has- san, 2015], such as load and stress testing, with more recent advances targeting industrial applicability and practice [Nguyen et al., 2014; Foo et al., 2015]. The other flavor, i.e., software microbenchmarks and performance unit tests, has only recently gained popularity in research. Studies on OSS projects [Stefan et al., 2017; Leitner and Bezemer, 2017] found that adoption lags behind their functional counter-parts, i.e., unit tests. One problem is that handling performance tests is complex and requires in-depth knowledge from developers. To reduce this friction, Ding et al. [2020] studied utilizing unit tests for assessing performance properties. Bulej et al. [2017a] proposed a framework that lets developers specify performance assertions and handles rigorous statistical evaluation. Horký et al. [2015] compose performance unit test outcomes into code documentation to raise performance awareness, and Damasceno Costa et al. [2019] uncover bad practices in microbenchmark code through static analyses. Generating tests removes the need to write tests by hand: AutoJMH helps to avoid pitfalls rooted in compiler optimization [Rodriguez-Cancio et al., 2016], Pradel et al. [2014] generate performance regression tests for concurrent classes, and PerfSyn synthesizes inputs through mutation that expose worst-case performance behavior [Toffola et al., 2018]. Our work is orthogonal to the aforementioned works: it dynamically adapts software microbenchmark configurations to stop their execution once their result is stable. Long execution times [Huang et al., 2014; Foo et al., 2015; Laaber and Leit- ner, 2018a] and uncertain results [Maricq et al., 2018; Laaber et al., 2019b] are well-known to complicate the usage of performance tests in general, includ- Chapter 4. Dynamically Reconfiguring Software Microbenchmarks: Reducing 176 Execution Time without Sacrificing Result Quality ing software microbenchmarks. There are a few approaches that reduce the time spent in performance testing activities without considering result quality: (1) predicting commits that are likely to impact performance [Huang et al., 2014; Sandoval Alcocer et al., 2016], (2) prioritizing [Mostafa et al., 2017] and (3) selecting [de Oliveira et al., 2017; Alshoaibi et al., 2019] the tests in a suite that are more likely to expose performance changes. Our approach pursues the same goal of reducing benchmarking time, but with a focus on running all benchmarks (similar to prioritization) as long as necessary while maintaining the same result quality. Result quality is impaired by not running enough measurements as well as measurement bias, which requires careful experiment planning and execution [Georges et al., 2007; Mytkowicz et al., 2009; Gil et al., 2011; de Oliveira et al., 2013b; Curtsinger and Berger, 2013; Kalibera and Jones, 2013; Blackburn et al., 2016]. To mitigate measurement bias, Georges et al. [2007] outlined a rigorous methodology how to assess performance of Java programs, which we base our measurement technique on. Using the correct statistical techniques to assess performance is paramount, with estimated confidence intervals using bootstrap being the state-of-the-art [Kalibera and Jones, 2012; Bulej et al., 2017a; Laaber et al., 2019b; Bulej et al., 2020]. One of our stopping criteria is based on and our result quality evaluation uses confidence intervals with bootstrap. To decide how many measurements are enough, approaches using statistical techniques have been proposed, employing CV [Georges et al., 2007; Maricq et al., 2018], confidence intervals [Jain, 1991; Maricq et al., 2018], and the Kullback-Leibler divergence (KLD) [He et al., 2019]. With these, performance experiments such as benchmark executions run until their results are accurate/stable enough and then abort execution, ideally reducing execution time. Our stoppage criteria use these three techniques and apply them in the context of software microbenchmarks after the warmup phase and after every fork. Closest to our approach are the ones by Maricq et al. [2018] and He et al. [2019]. Maricq et al. [2018] estimate the number of trials and iterations using a bootstrap technique. While they perform this estimation before executing 4.9 Conclusions 177 benchmarks, we evaluate result quality during execution. He et al. [2019] stop system-level performance tests executed in cloud environments, once they reach a certain stability criteria. Different from the benchmarks used in their study, microbenchmarks are much shorter, with runtimes in the order of seconds instead of multiple hours. Our work builds on top of their statistics-based approach using KLD for system benchmarks, adapts it for microbenchmarks and extends it to other stoppage criteria.

4.9 Conclusions

This paper introduced a dynamic reconfiguration approach for software microbenchmarks, which reduces benchmark execution time and maintains the same result quality. In a pre-study based on real-world configurations of 13,387 microbenchmarks coming from 753 projects, we find that developers make extensive use of custom configurations to considerably reduce runtimes for 34% of the benchmarks. Still, about 15% of the projects have benchmark suite runtimes of more than 3 hours. Our dynamic reconfiguration approach implements data-driven decisions to stop microbenchmark executions, assisting developers with the intricate task of correctly configuring microbenchmarks. With overheads between 1% and 11%, it achieves a time reduction of 48.4% to 86.0%, with between 78.8% and 87.6% of the microbenchmarks preserving their result quality. These results show that dynamic reconfiguration is highly effective and efficient, and we envision it to enable regular performance microbenchmarking activities, such as part of CI.

Predicting Unstable Software Benchmarks Using Static Source Code Features

Christoph Laaber, Mikael Basmaci, and Pasquale Salza In submission under major revision at Empirical Software Engineering

Abstract

Software benchmarks are only as good as the performance measurements they yield. Unstable benchmarks show high variability among repeated measurements, which causes uncertainty about the actual performance and complicates reliable change assessment. However, if a benchmark is stable or unstable only becomes evident after it has been executed and its results are available. In this paper, we introduce a machine-learning-based approach to predict a benchmark’s stability, without having to execute it. Our approach relies on 58 statically-computed source code features related to (1) meta information, e.g., lines of code (LOC), (2) programming language elements, e.g., conditionals or loops, and (3) potentially performance-impacting standard library calls, e.g., file and network input/output (I/O). To assess our approach’s effectiveness, we perform a large-scale experiment on 4,461 Go benchmarks coming from 230 open-source software (OSS) projects. First, we investigate the association between the features and the Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 180 Features benchmark’s stability. The features most associated with unstable benchmarks are concurrency constructs, as well as library calls to pseudo-random number generators, network functionality, and synchronization primitives. Second, we assess the prediction performance of our machine learning model using 11 binary classification algorithms. We find that Random Forest performs best with good prediction performance from 0.79 to 0.90, and 0.43 to 0.68, in terms of AUC and MCC, respectively. Our results show that although benchmark stability is affected by more than just its source code, we can effectively utilize machine learning models to predict whether a benchmark will be stable or not ahead of execution. This enables spending precious testing time on reliable benchmarks, supporting developers to identify unstable benchmarks during development, or allowing unstable benchmarks to be repeated more often.

5.1 Introduction

Software benchmarks are a performance testing technique on the same granularity as unit tests, i.e., they test functions, methods, or statements. Different from unit tests, they measure performance, most often execution time, by executing the benchmark repeatedly to retrieve reliable results. Depending on a myriad of factors, such as the quality of the benchmark, the stability of the execution environment, the programming language, and the source code the benchmark invokes, these measurements are close to each other, i.e., the benchmark has low result variability and is “stable”, or further apart, i.e., the benchmark has high result variability and is “unstable”. Results from unstable benchmarks do not accurately reflect the “true” performance of the software under test and hinder rigorous and reliable performance change assessment. Unfortunately, this only becomes evident once the benchmark has been executed and its results are available. Previous research in software performance focussed on performance impact prediction of new code changes on the execution time of software, i.e., whether a code change slows down (or speeds up) the program. They often 5.1 Introduction 181 leverage statically or dynamically determined source code features, such as added loops or method calls, the code change diff, and sometimes in combination with profiling data, to predict whether a benchmark or a version is likely to experience a performance change [Jin et al., 2012; Huang et al., 2014; Sandoval Alcocer et al., 2016; de Oliveira et al., 2017; Mostafa et al., 2017; Al- shoaibi et al., 2019; Sandoval Alcocer et al., 2020]. This information is then used for selecting which versions to test for performance [Jin et al., 2012; Huang et al., 2014; Sandoval Alcocer et al., 2016, 2020], selecting the benchmarks to execute after a code change [de Oliveira et al., 2017; Alshoaibi et al., 2019], or prioritizing the benchmarks with larger predicted performance changes for execution [Mostafa et al., 2017]. All of these focus on slowdown/speedup size as the performance property to predict, and none considers measurement variability or benchmark stability. Moreover, they all employ traditional inference techniques such as rule-based detection [Jin et al., 2012], cost models [Huang et al., 2014; Sandoval Alcocer et al., 2016; Mostafa et al., 2017; Sandoval Alcocer et al., 2020], heuristics [de Oliveira et al., 2017], or genetic algorithms [Alshoaibi et al., 2019], and none utilizes machine learning models. In this paper, we propose an approach that leverages static source code features to predict whether a benchmark will be unstable, before executing it. The approach employs 58 source code features extracted with abstract syntax tree (AST) and static call graph (CG) information, which consist of (1) meta information, e.g., LOC, (2) programming language elements, e.g., conditionals or loops, and (3) potentially performance-impacting standard library calls, e.g., file and network I/O. To assess our approach, we empirically study the relation of source code to benchmark stability and perform a large-scale experiment on 4,461 benchmarks coming from 230 OSS projects written in Go. The association of source code features to benchmark stability provides an indication whether they might be used as predictors for our approach’s overall goal. We investigate the correlation in our first research question:

RQ 1 Are source code features associated with benchmark stability? Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 182 Features

We ﬁnd that individual features are at best moderately associated, where the majority of features show weak or no association to benchmark stability. The majority of meta information and programming language features show a similar correlation, indicating that benchmarks that invoke more code are stronger associated with instability. Moreover, standard library calls to packages encapsulating I/O, concurrency primitives, and pseudo-random number generators show higher correlations to benchmark instability than other library calls. With our second research question, we investigate whether a combination of all features and the use of machine learning algorithms enable predicting benchmark instability, although RQ 1 showed that the majority of individual features are only weakly correlated:

RQ 2 Can we predict benchmark stability with statically-computed source code features?

We build a static model based on the source code features and assess the prediction performance among 11 classification algorithms. For this, we transform the benchmarks’ variability into two classes, i.e., stable and unstable, relying on thresholds inspired by previous work [Georges et al., 2007; Curtsinger and Berger, 2013]. We compared the different machine learning algorithms to find the best performance on the binary classification task. Our results show that machine learning models can effectively predict whether a benchmark will be unstable. Random Forest performs best with prediction performance ranging from 0.79 to 0.90 AUC and from 0.43 to 0.68 MCC. Our classification model considers benchmarks to be stable or unstable if they fall beneath or above a certain threshold. We particularly study how the threshold value impacts prediction performance with our first sub research question:

RQ 2.1 How does the deﬁnition of stability affect the prediction performance?

Inspired by previous work [Georges et al., 2007; Curtsinger and Berger, 2013], we investigate four stability thresholds t ∈ {1%, 3%, 5%, 10%} which 5.1 Introduction 183 correspond to t% benchmark variability. We find that the threshold t impacts prediction performance, with our largest threshold t = 10% delivering the best prediction performance for the majority of algorithms studied. This shows that our model is better at classifying benchmarks to be unstable if they have higher result variability, i.e., they are “more unstable”. The number of measurement iterations directly influences the benchmark stability, i.e., more iterations lead to narrower confidence intervals and, consequently, to more benchmark stability. Our second sub research question investigates the impact of more measurement repetitions (iterations) on the prediction performance of our model:

RQ 2.2 How does the number of benchmark iterations affect the prediction performance?

We study three number of repeated iterations i = {5, 10, 20} and ﬁnd that the number of iterations drastically impacts the prediction performance of our model. Measurements from more iterations lead to better prediction performance for the majority of algorithms under study. This shows that our model is better at predicting benchmark instability if a benchmark remains unstable with an increased number of iterations.

Although benchmark stability is affected by more than just source code, our results show that source code features can be effectively used to predict whether a benchmark will be stable or unstable. We envision that this stability prediction can be used in regression benchmarking for selecting only stable benchmarks to be executed, supporting developers to identify unstable benchmarks during development, or conﬁguring unstable benchmarks to run more iterations.

Contributions The main contributions of this paper can be summarized as follows:

• An approach to extract statically-computable features from benchmark executions and predict unstable benchmark based on machine learning algorithms. Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 184 Features

• A ﬁrst study on the association of source code features with the stability of Go benchmarks.

• A study comparing the predictive performance of 11 machine learning algorithms, investigating the effects of different stability deﬁnitions and benchmark iterations.

• A large data set of 4,461 Go benchmark executions from 230 OSS projects.

We provide all data and scripts to reuse our approach and replicate our study online [Laaber et al., 2020a].1

Paper Organization In Section 5.2, we provide an introduction to benchmarking in Go and benchmark result stability. Section 5.3 introduces our approach, how it extracts the features, how we deﬁne benchmark stability, what is required for the model creation, and how to use the model. Section 5.4 describes the study used for assessing the effectiveness of our approach, and Sections 5.5 and 5.6 report the results for RQ 1 and RQ 2. In Section 5.7, we discuss application scenarios and other aspects of our approach, as well as directions for future research. Section 5.8 discusses threats to validity, and Section 5.9 compares to related work. Finally, we conclude the paper in Sec- tion 5.10.

5.2 Software Benchmarks in Go

Software benchmarking—also referred to as microbenchmarking [Laaber and Leitner, 2018a; Laaber et al., 2019b] or performance unit testing [Horký et al., 2015; Stefan et al., 2017; Bulej et al., 2017a]—is a form of measurement-based software performance engineering (SPE) [Woodside et al., 2007] to evaluate the performance, usually execution time, of ﬁne-granular software components such as functions, methods, or statements. They can be considered as the equivalent of unit tests for performance.

1We will move the replication package to a permanent hosting service, e.g., Zenodo or ﬁgshare, upon acceptance. 5.2 Software Benchmarks in Go 185

Execution

go test \ Deﬁnition -run ^$ \ … -bench Benchmark1 \ i1 i2 i3 in func Benchmark1(b *testing.B) { -count n \ vm := New() -benchtime 1s vm.Set("x", func(a1 string, a2 int) {}) s, _ := vm.Compile("test.js", `x("zzz", 1)`) for i := 0; i < b.N; i++ { vm.Run(s) iteration } } invoke 20ms invoke 18ms average runtime

benchmarks_test.go … 18.7ms duration invoke 19ms }

Results

average iteration runtime

i1 18.4ms density variability i2 18.7ms

i3 19.0ms … …

in 18.8ms iteration runtime

Figure 5.1: Benchmarking workﬂow in Go. Benchmark1 corresponds to the benchmark call_test.go/BenchmarkNativeCallWithString from the project robertkrimen/otto

The Go programming language2 comes with a benchmarking framework included in their standard library, as part of the testing framework3. Fig- ure 5.1 depicts a schematic view of Go benchmarks, how they are deﬁned and executed, and what their results look like.

Definition In Go, benchmarks are defined as top-level functions in source code, similar to unit tests, if (1) they are placed in a file ending in _test.go, (2) their name starts with Benchmark, and (3) their only function parameter is of type *testing.B. A benchmark’s body also contains a for loop that repeatedly invokes the component that should be benchmarked.

2https://golang.org 3https://golang.org/pkg/testing Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 186 Features

Execution A single measurement is not sufficient to accurately depict a program’s performance, because performance measurements are influenced by a myriad of factors, such as the machine they are executed on, the software (versions) installed, and the programming language characteristics. Consequently, performance measurements (and benchmarks) are susceptible to measurement bias [Mytkowicz et al., 2009; Curtsinger and Berger, 2013; de Oliveira et al., 2013b], which is addressed with repeated measurements. To execute benchmarks, such as the one previously defined, one uses the go command line interface (CLI), specifying which benchmark(s) to execute (through the use of the -bench option), and for how many iterations

(-count). An iteration (in) is the repeated invocation of the benchmark function, Benchmark1 in our example, for a deﬁned duration (-benchtime). The benchmarking framework measures the runtime (with nanosecond precision) of every benchmark invocation and yields the average runtime among all invocations as the iteration’s result.

Results The ﬁnal result of a benchmark is the distribution of all iteration measurements, which typically does not follow a Gaussian distribution but is often long-tailed or multi-modal [Curtsinger and Berger, 2013; Maricq et al., 2018]. Depending on the variability (or spread) among the individual iteration results, a benchmark is considered more or less stable when it has low or high measurement variability, respectively. Running more iterations leads to narrower conﬁdence intervals of the results and, hence, to more stable benchmark results (see Section 5.3.2). Figure 5.2 shows an example of real- world benchmark results with low (“stable”) and high (“unstable”) variability after 20 iterations.

5.3 Approach

To predict whether a benchmark will be unstable without executing it, we introduce an approach based on machine-learning. This approach employs only statically-computed source code features to build a binary classiﬁcation 5.3 Approach 187

unstable Benchmark stable

800 1200 1600 2000 Execution Time [ns]

Figure 5.2: Benchmarks with stable and unstable results after 20 iterations (indicated by the dots). The bar indicates the median, the diamond the mean, the box the IQR, and the whiskers [Q1|Q3]+1.5 ∗ IQR. The “stable” benchmark is call_test.go/BenchmarkNativeCallWithString from robertkrimen/otto. The “unstable” benchmark is frame_pool_b_test.go/BenchmarkFramePoolChannel1000 from uber/tchannel-go

model used for prediction. Although our approach’s general idea is applicable for different programming languages, in particular the types of source code features employed, this section describes the approach in the context of Go. Our approach is based on two main phases: (1) training and (2) usage. During the training phase, depicted in Figure 5.3, our aim is to build a model based on actual benchmark executions. Starting from a possibly large sample of benchmarks, we first extract the statically-computed source code features for each of them. Then, we run the benchmarks for multiple independent executions (iterations) to obtain a measure for their stability, i.e., relative confidence interval width (RCIW). Once we collected enough executions, we train a model by using a black box machine learning algorithm. To identify whether a benchmark will be “unstable”, regardless of the concrete RCIW values, we simplify the machine learning part from a regression to a binary classification problem, i.e., “binarization”. Section 5.7 goes in to length discussing binary classification in the context of benchmark stability prediction. We investigate the best algorithms for this prediction task in Section 5.6. The produced model Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 188 Features

Source Code Features

for each Function Meta Information

Abstract Language Parse Feature Syntax Tree Elements Counts Library Calls

for each Benchmark Model Creation

Create Cumulative Pre-processing Reachable Static Call Sum Feature Pre- Model Functions Graph Counts processing Training Model Benchmark Suite Benchmark Stability Binarization for each Benchmark ML Algorithm

Execute Average Bootstrap RCIW Threshold Runtimes

# Samples # Iterations Conﬁdence Level

Figure 5.3: Training phase of the approach can then be “used” to predict the expected stability of a new benchmark before executing it.

5.3.1 Source Code Features

To build our approach’s prediction model, we statically determine its features by (1) extracting them on a per-function basis through AST parsing and (2) combining them per benchmark by accumulating features from reachable functions with static CG information. Table 5.1 provides an overview of the 58 features.

Feature Extraction

Features are elements of the source code that can be extracted from a project’s source code (files). Our approach considers three kinds of source code elements: (1) meta information, e.g., number of LOCs or files; (2) language elements, e.g., loops, conditionals, or variables; and (3) calls to standard library application programming interfaces (APIs) which “might affect” benchmark stability. We encode a feature as the number of occurrences of these source 5.3 Approach 189 Y: ; } Y: ; } v.f() case ; val != nil { } case or chan<- T {} , or X: ; X: ; pkg.f() var v pkg.T; {}} recover() c := val case case range c or {} {} <-chan T for f() , v.f() (c) {} v := or (){} f() - switch { for { if val := select { if switch for f() return val defer panic( ) var x&val T []T map[T]U go chan T c <-<-c val close for select method calls: function calls: Go Code indicates the source code elements blue mechanism) statement) catch throw cases in all switch statements loop statements nested loop statements (with arbitraryfunction depth) and method calls return statements defer statements calls to panic (Go’s calls to recover (Go’s usage of slice types (e.g.,usage for of creation) map types (e.g., for creation) channel sends channel receives channel closes cases in all select statements looping over channels LOC of a functionfunction name length - - variable declarations creations of new goroutines LOCs of a file if conditionals (potentially with precedingswitch else) statements Cyclomatic complexity by McCabe [1976]pointers - usage of channel types (e.g., for creation) calls to functions and methodsstandard of library 31 packages selected files in a Go package - select statements sync/atomic loc namelen vars gos fileloc ifs switches switchcases loops nestedloops funccalls rets defers panics recovers cc ptrs slices maps channels chsends chrecs chcloses selectcases e.g., chranges pkgfiles selects function data concurrency TypeMeta information file CategoryLanguage feature Name control flow Description Standard library call - pkg(/subpkg)* considered for the corresponding feature Table 5.1: Source code features used by our approach. The Go code in Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 190 Features code elements in a method, e.g., the number of if branches or calls to a random number generator.

Meta Information Meta information features are language-agnostic and do not (necessarily) require complicated AST parsing but are potentially useful approximations for performance variability. Our approach considers meta information features on (1) file and (2) function granularity level. File meta information features assume that functions that are part of larger packages (pkgfiles) or contained in longer files (fileloc) invoke more and more diverse functionality. Similarly, longer functions (loc) and longer function names (namelen) might be indicative of more complex functionality. The suspected consequence is that benchmarks calling functions which follow these behaviours potentially have higher result variability and, therefore, are less stable.

Language Elements Similar to other works that build a (non-machine- learning) performance impact analysis model based on source code [Huang et al., 2014; Mostafa et al., 2017; de Oliveira et al., 2017; Alshoaibi et al., 2019], we extract source code elements that require inter-procedural AST parsing. Our approach considers language element features falling into three main categories: (1) control ﬂow elements, (2) variables and data types, and (3) concurrency elements. Control ﬂow elements consist of programming-language-agnostic features such as conditionals (ifs, switches, and switchcases), loops (loops and nestedloops), function lifecycle (funccalls, rets, and defers), exception handling (panics and recovers), and cyclomatic complexity (cc). More of these increase the complexity of a function which in turn might have a negative impact on benchmark variability. Data features could have an impact on a function’s stack size, because Go stacks are initially 2 kB and can dynamically grow [Go Authors, 2020a]. Bench- marks might encounter more results variability due to dynamically growing stacks. The other aspect of the potential impact of data features on benchmark stability is frequent garbage collector (GC) activity. More variables (vars), 5.3 Approach 191 pointers (ptrs), and built-in dynamic data structures (slices and maps) could increase pressure on the GC, caused by more allocated (and then reclaimed) heap memory objects [Hudson, 2018]. Finally, Go has built-in support for concurrency in form of lightweight user- level threads (goroutines), channels as a communication primitive among goroutines, and channel communication through sending and receiving messages as well as non-deterministic selection among multiple channel operations [Go Authors, 2020b]. These concurrency features are often used in real-life Go programs [Dilley and Lange, 2019] and a potential cause for benchmark instability due to the inherent non-determinism of thread schedulers.

Library Calls Standard library packages encapsulate behaviour that are essential for all programs. They enable ﬁle and network I/O, communication with the underlying operating system (OS), text or string processing, and concurrency primitives. As most of these functionalities rely on or are backed by non-deterministic behaviour; e.g., waiting for locks to become available, blocking on I/O, sending and receiving data over the network; calls to standard library APIs might affect performance variability more than “regular” function calls. We assume that these calls can serve as a proxy for benchmark stability to statically identify unstable benchmarks, without executing them. Hence, we encode standard library calls as individual source code features aggregated on package-level, i.e., all calls to a particular package contribute to the same package-level feature. Table 5.2 depicts the Go standard library features used by our approach, and the bottom row in Table 5.1 shows their source code representation.

Feature Combination

Source-code-induced benchmark variability is not only determined by the benchmark’s body but also the source code it invokes. Consequently, we need to combine the source code features described previously for all functions a benchmark calls. To go from intra-procedural source code features to inter-procedural ones, we employ static CGs rooted at every benchmark Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 192 Features

Table 5.2: Standard library call features

Package Description

bufio buffered I/O bytes manipulating byte slices crypto cryptographic types and constants database/sql generic interfaces for Structured Query Language (SQL) databases encoding interfaces for byte to text conversions encoding/binary binary encodings encoding/csv comma-separated values (CSV) encoding and decoding (RFC 4180) encoding/json JavaScript Object Notation (JSON) encoding and decoding (RFC 7159) encoding/xml Extensible Markup Language (XML) 1.0 parser io I/O interfaces and primitives io/ioutil I/O utility functions math basic mathematical functions math/rand pseudo-random number generators mime (partial) Multipurpose Internet Mail Extensions (MIME) implementation net network I/O including Transmission Control Protocol (TCP), User Datagram Protocol (UDP), domain name resolution, and Unix sockets net/http Hypertext Transfer Protocol (HTTP) client and server implementation net/http/httptest HTTP testing utility functions net/http/httptrace1 tracing of HTTP requests net/http/httputil HTTP utility functions net/rpc remote procedure call (RPC) client and server implementation net/rpc/jsonrpc JSON codec for net/rpc (JSON-RPC 1.0) net/smtp Simple Mail Transfer Protocol (SMTP) implementation (RFC 5321) net/textproto text-based request/response protocol os platform-independent interface to OS functionality os/exec running external commands/processes os/signal signals from the OS (e.g., SIGKILL) sort sorting of slices and user-defined collections strconv string conversions to/from primitives sync synchronization primitives (e.g., mutexes) sync/atomic low-level atomic memory primitives syscall interface to low-level OS primtives 1 net/http/httptrace is not present in the evaluation of our approach, because no study object uses the package 5.3 Approach 193 and aggregate the source code features of all reachable functions by summing up their values. The “for each Benchmark” block in Figure 5.3 visualizes the feature combination process. We rely on the callgraph4 tool which is part of the official extended Go distribution. It constructs sound static CGs through inclusion-based points- to analysis using Andersen’s algorithm [Andersen, 1994], which is the most precise algorithm available.

5.3.2 Benchmark Stability

Our approach uses the performance result variability of a benchmark across multiple independent executions (iterations) as its measure for stability. For this, we use the RCIW of the mean across a benchmark’s iterations. Due to performance data usually not being normally distributed [Curtsinger and Berger, 2013], we estimate the benchmark’s population confidence interval from an iteration sample with bootstrap [Davison and Hinkley, 1997; Kalibera and Jones, 2012], a Monte Carlo simulation technique drawing a fixed number of random samples with replacement. b b b Formally, let M be the set of iterations of a benchmark b. Then Mi ⊆ M is b b b b the subset containing the first i iterations so that Mi = {mit|mit ∈ M ∧ 1 ≤ it ≤ i}. The bootstrap sample set Bs is then defined as Bs = s sample(M b), where s is the number of bootstrap samples (usually s is 1,000S or 10,000), sample is the function drawing the random sample with replacement from the measurement iterations M b, and overscores indicate the arithmetic mean. Finally, the RCIW is then defined as the function rciw in Equation 5.1.

1− s 1− s quantilecl+ cl (B ) − quantile cl (B ) rciw(M b,s,cl)= 2 2 (5.1) M b The number of iterations is a parameter of our approach and an object of investigation in Section 5.6.

4https://pkg.go.dev/golang.org/x/tools/cmd Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 194 Features

5.3.3 Model Creation

The final part of the training phase of our approach consists of using the collected information to create a machine learning model. We first preprocess the data to ease the prediction task. Precisely, we apply standardization and feature selection based on variance. These operations are solely based on the independent variables, i.e., the source code features. Note that these operations are based on the data used for training: the variance is computed during the training of the model and needs to be stored together with the model in order to preprocess the data during the usage. We then apply binarization, i.e., we transform the regression problem into a classification problem. Any binary classification algorithm can then be used to build the final model.

Our approach defines the (in)stability of a benchmark based on its result variability RCIW. Hence, our approach’s target dependent variable is the RCIW, which is a continuous value that ranges from 0 to +∞. Similar to the research on (functional) defect prediction [Turhan et al., 2009; Zimmermann et al., 2009], we transform the benchmark variability problem into a binary classification problem, where a benchmark is either stable or unstable. For example, if we consider performance test case (benchmark) selection of only stable benchmarks, the exact variability value is secondary as a binary answer, i.e., whether to select or not, is sufficient.

We consider a benchmark to be stable if the relative conﬁdence interval width (RCIW) of its results is below a certain threshold t and unstable otherwise. A threshold t is provided in percentages, indicating that benchmarks with a RCIW below t% are stable. Our approach applies binarization to the RCIW value of a benchmark’s results, converting the values to either 0 or 1 for stable or unstable, respectively. For example, if the desired stability threshold t is 3%, binarization assign the class 0, i.e., stable, if the RCIW value is in [0, 0.03) and 1, i.e., unstable, otherwise.

The threshold value is a parameter for our approach and an object of investigation in Section 5.6. 5.4 Study Design 195

Pre-processing Model

Stable

Extract Feature Pre- Benchmark Model Features processing

Unstable

Figure 5.4: Usage phase of the approach

5.3.4 Model Usage

Once the model is created, it can be used to predict a benchmark’s (in)stability. Speciﬁcally, the process requires the following steps, also depicted in Figure 5.4:

(1) the new benchmarks are processed by using the same source code feature extraction in the training phase;

(2) we apply the pre-processing operations, i.e., standardization and feature selection, by using the variance value found and stored during the training phase; and

(3) we use the pre-processed data as input to the binary classiﬁcation model to predict the benchmark stability.

As explained above, the model will give an answer for the deﬁnition of stability with which it has been trained, therefore depending on the used threshold value.

5.4 Study Design

We perform a laboratory experiment [Stol and Fitzgerald, 2018] to study the relation between our approach’s source code features and benchmark stability and assess the prediction performance of the approach’s machine learning model. The experiment utilize 230 OSS projects written in Go having 4,461 Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 196 Features unique benchmarks. A replication package containing all data and scripts is available online [Laaber et al., 2020a].

5.4.1 Research Questions

First, we want to investigate whether the 58 source code features of our approach are individually associated with benchmark variability RCIW. This analysis provides initial insight into whether a machine learning model utilizing the static features could be feasible. For this, we formulate RQ 1:

RQ 1 Are source code features associated with benchmark stability?

Second, we want to assess the prediction performance of a binary classiﬁ- cation model to predict whether a benchmark will be stable or unstable. This model is built using the static source code features outlined in Section 5.3.1. For this, we formulate RQ 2:

RQ 2 Can we predict benchmark stability with statically-computed source code features?

The deﬁnition of benchmark stability relies on the threshold t, which divides the continuous value of the benchmark variability RCIW into two binary classes, i.e., “stable” and “unstable”. We want to study the sensitivity of t with respect to the prediction performance of our approach’s model. For this, we formulate RQ 2.1:

RQ 2.1 How does the deﬁnition of stability affect the prediction performance?

The benchmark variability RCIW is directly affected by the number of repeated measurements, i.e., iterations i, a benchmark is executed for; more iterations result in narrower conﬁdence intervals (see Section 5.2). We want to investigate the impact the number of iterations i has on the prediction performance of our approach’s model. For this, we formulate RQ 2.2:

RQ 2.2 How does the number of benchmark iterations affect the prediction performance? 5.4 Study Design 197

5.4.2 Study Objects

We focus our study on benchmarks written in Go. Go statically compiles to machine code, which helps to increase benchmark stability compared to dynamically compiled languages such as Java [Laaber and Leitner, 2018a]. By studying a statically compiled language, our experiment design removes the non-deterministic factor dynamic compilation and, therefore, increases reliability in the benchmark result variabilities, which increases internal validity. We follow an approach outlined by Stefan et al. [2017] to mine GitHub repositories through GitHub’s search API. The search considers all projects that use “Go” as one of their languages and contain at least one benchmark (see Sec- tion 5.2). This results in 10,707 projects. As executing performance tests is a costly endeavor [Huang et al., 2014], it would be infeasible to execute all projects. Our sampling strategy’s goal is to reduce the overall experiment execution time and to ﬁlter out “toy” projects. Therefore, we apply the following inclusion criteria to each project: (1) > 50 commits, (2) > 1 authors, (3) not a fork of another repository, (4) > 1 stars, (5) > 1 watchers, (6) > 5 benchmarks, (7) > 1,000 LOCs, and (8) a benchmark suite execution time of at most 2 hours. From the 10,707 projects, 483 adhere to our selection criteria. We (try to) execute all of them for 2 hours but due to compilation or runtime errors many projects fail, so that we end up with a ﬁnal data set containing 4,461 individual benchmarks belonging to 230 projects. To the best of our knowledge, this is the most extensive data set of Go OSS projects with benchmarks and their execution results. We provide the full list of projects, including their commit hashes, as part of our replication package [Laaber et al., 2020a].

5.4.3 Study Setup

In this section, we describe the execution setup of the benchmark suites to retrieve their results for the stability computations and deﬁne the parameterization of the approach. The statistical tests used for answering the study’s re- Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 198 Features search questions are described in the corresponding sections, i.e., Section 5.5.1 for RQ 1 and Section 5.6.1 for RQ 2.

5.4.4 Execution Setup

In order to reduce confounding factors inﬂuencing the performance measurements [Mytkowicz et al., 2009; Curtsinger and Berger, 2013; de Oliveira et al., 2013b] and consequently our approach’s deﬁnition of benchmark stability, we execute all benchmarks in a controlled bare-metal environment. The setup consists of four machines with an Intel(R) Xeon(R) central processing unit (CPU) E5-2620 v4 @ 2.10 GHz with 8 cores, a 20 MB cache, and 64 GB random- access memory (RAM). We disabled hyper-threading, frequency scaling, and Intel’s TurboBoost. As OS the machines use Fedora Linux 24, and measurements were taken in the period between November and December 2017. The benchmarks are compiled and run with Go 1.9.2. Every benchmark is executed for a 1-second duration as often as possible and the average runtime is reported, which corresponds to one iteration. The benchmark suite of each project is repeatedly executed for 2 hours, resulting in multiple iterations i per benchmark.

5.4.5 Approach Parameterization

Our approach provides a few knobs for parameterization, which can have an impact on our study and, consequently, on the conclusions drawn from it. To compute the benchmark variability RCIW, we take a fixed number of iterations from the executions of a single benchmark and apply a bootstrap technique to estimate each benchmark’s confidence interval of the mean (see Section 5.3.2). In our study, we use a confidence level of 99% (cl = 0.99) and draw 10,000 bootstrap samples (s = 10,000) with replacement [Hesterberg, 2015]. To study the sensitivity of the number of iterations i on the association in RQ 1 and the prediction performance in RQ 2, we perform all analyses with multiple numbers of iterations. Previous research has shown that Go benchmarks are experiencing much less result variability than Java bench- 5.5 RQ 1: Associating Source Code with Benchmark Stability 199 marks [Laaber and Leitner, 2018a]. This is likely due to Go being a statically compiled and linked language as opposed to dynamically compiled like Java. By default, Go benchmarks are only executed with a single iteration, but in order to follow performance engineering best practice [Georges et al., 2007] and to be able to compute RCIW values, we choose i ∈ {5, 10, 20}. To answer RQ 2, the binary classification requires transforming the benchmark variability RCIW into two classes, i.e., stable and unstable. We do so by defining the threshold t that divides the RCIW values into these classes. We choose t ∈ {1%, 3%, 5%, 10%}. These thresholds are informed by previous research: Georges et al. [2007] report that performance measurement variability is often around 3%, while Mytkowicz et al. [2009] mention that measurement bias can obfuscate a performance change as large as 10%. Our study employs 11 binary classification algorithms and compares them across 5 prediction performance metrics, which we describe in detail in Sec- tion 5.6.1.

5.5 RQ 1: Associating Source Code with Bench- mark Stability

In RQ 1, we study whether the source code features in our study are associated with the result variability of benchmarks. This analysis provides an insight into whether certain source code constructs might lend themselves to our study’s ultimate goal of predicting benchmark (in)stability.

5.5.1 Methodology

To answer RQ 1, we perform a correlation analysis between all source code features and the benchmark’s variability. We apply the non-parametric Spear- man rank correlation [Hauke and Kossowski, 2011; Corder and Foreman, 2009] for each source code feature across all benchmarks in relation to the RCIW of benchmarks executed for 5, 10, and 20 measurement iterations (i). Recall from Section 5.2 that more iterations lead to smaller confidence interval Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 200 Features widths, lower benchmark result variability, and, consequently, more stable benchmarks. Spearman rank yields a correlation coefficient ρ and a p-value. The coefficient ρ can be in the range −1 to 1, where negative values indicate a negative correlation, 0 no correlation, and positive values a positive correlation. ρ values are divided into nominal categories [Akoglu, 2018]: |ρ| < 0.3 corresponds to a weak, 0.3 ≤|ρ| < 0.6 to a moderate, and |ρ|≥ 0.6 to a strong correlation. The p-value indicates whether there is a statistically significant correlation at a given significance level; we use a significance level of 0.01.

5.5.2 Results

Figure 5.5 depicts the Spearman’s rank correlation coefﬁcient ρ between individual source code features (in rows) and benchmark variabilities RCIW. An asterisk after the ρ value indicates whether the null hypothesis can be rejected at signiﬁcance level 0.01, i.e., there is no correlation. The left table shows meta information and language features, whereas the right table shows standard library features. Every column corresponds to the correlation of the source code features to the benchmark variability with a different number of iterations (i), i.e., 5, 10, and 20 iterations. Overall, we observe that no source code feature is strongly associated with benchmark variability, irrespective of the number of iterations executed. The majority of the features have a small to moderate positive correlation, except crypto’s correlation is only weakly negative at −0.07 and only for 20 iterations. Further, we notice that with the increasing number of iterations, the correlation becomes weaker. This comes at not surprise as benchmarks become less variable the longer they are executed. All but two meta information and programming language features show some sort of association, i.e., their p-value is below 0.01. Though, namelen and chranges are hardly correlated with 0.07 and 0.05. namelen, switches, switchcases, panics, recovers, and chranges show the lowest associations of 0.15 or lower. The remaining features have correlations between 0.2 and 0.3, which is still considered weak, but might help in our study’s goal of predicting benchmark 5.5 RQ 1: Associating Source Code with Benchmark Stability 201

-1.00 -0.50 0.00 0.50 1.00

fileloc 0.22* 0.22* 0.17* bufio 0.14* 0.14* 0.07*

pkgfiles 0.23* 0.24* 0.18* bytes 0.18* 0.18* 0.12*

crypto -0.01 -0.01 -0.07* loc 0.23* 0.24* 0.18* database/sql 0.00 -0.00 -0.00 namelen 0.07* 0.07* 0.03 encoding 0.11* 0.11* 0.07* ifs 0.24* 0.24* 0.19* encoding/binary 0.06* 0.05* 0.02 switches 0.08* 0.09* 0.05* encoding/csv 0.03 0.03 0.03 switchcases 0.08* 0.08* 0.04* encoding/json 0.10* 0.09* 0.06*

loops 0.22* 0.23* 0.18* encoding/xml 0.05* 0.05* 0.07*

nestedloops 0.18* 0.19* 0.15* io 0.20* 0.20* 0.13*

funccalls 0.24* 0.25* 0.18* io/ioutil 0.17* 0.17* 0.10*

rets 0.26* 0.26* 0.20* math 0.11* 0.12* 0.18*

math/rand 0.25* 0.27* 0.27* defers 0.30* 0.30* 0.25* mime 0.13* 0.14* 0.05* panics 0.15* 0.16* 0.11* net 0.22* 0.24* 0.18* recovers 0.10* 0.11* 0.12* net/http 0.15* 0.16* 0.11* cc 0.24* 0.25* 0.19* net/http/httptest 0.06* 0.06* 0.07* vars 0.24* 0.25* 0.18* net/http/httputil 0.04 0.04 0.04

ptrs 0.24* 0.25* 0.18* net/rpc 0.01 0.02 0.04

slices 0.18* 0.18* 0.10* net/rpc/jsonrpc 0.01 0.02 0.02

maps 0.20* 0.20* 0.17* net/smtp 0.00 -0.00 -0.01

gos 0.28* 0.29* 0.22* net/textproto 0.01 0.01 0.02

os 0.20* 0.20* 0.16* channels 0.26* 0.27* 0.19* os/exec 0.10* 0.09* 0.09* chsends 0.22* 0.22* 0.13* os/signal 0.01 0.01 0.02 chrecs 0.23* 0.23* 0.17* sort 0.13* 0.12* 0.09* chcloses 0.25* 0.26* 0.18* strconv 0.12* 0.11* 0.08* chranges 0.05* 0.05* -0.00 sync 0.32* 0.33* 0.28*

selects 0.22* 0.22* 0.15* sync/atomic 0.20* 0.20* 0.13*

selectcases 0.22* 0.22* 0.15* syscall 0.06* 0.07* 0.05*

5 10 20 5 10 20 Iterations

Figure 5.5: Spearman’s rank correlations between source code features and benchmark variability RCIW. Asterisks (*) indicate a p-value < 0.01 Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 202 Features stability in RQ 2. These results support the conclusion that more source code elements related to meta information, i.e., number of ﬁles and LOC, and general programming language features have at least a weak correlation with benchmark variability. Especially encouraging is that Go’s concurrency language features, i.e., gos and channels, are among the highest correlated features, indicating that “more concurrency” might lead to more variability.

Regarding standard library call features, our results paint a more diverse picture. Far more features (9/30) compared to meta information and programming language features show no signiﬁcant correlation, i.e., the p-value is below 0.01. These features are hardly used by our study objects and, hence, no signiﬁcant association can be shown in our study. Particularly, sub-packages of encoding and net (except net/http), are hardly used and have, consequently, little connection with performance variability.

Moreover, the correlation coefficients ρ are more dispersed among the standard library packages. Features that deal with string conversion (encoding or strconv) and sorting (sort) are weaker correlated to benchmark variability. Similarly, calling into the OS (syscall) and running external commands (os/exec) are less associated. However, features related to I/O are stronger correlated to benchmark instability: (1) general I/O functionality (bufio, io, and io/iotuil), (2) network I/O (net and net/http), and (3) file system I/O (os). Most associated with benchmark instability are concurrency primitives, such as locks, with the largest ρ of 0.33, 0.33, and 0.28 for 5, 10, and 20 iterations, respectively; and to a lesser extent atomic data types (sync/atomic). Finally, benchmarks that measure software components relying on randomized algorithms (math/rand) are also more correlated with being unstable. The correlation results for standard library call features support our assumption that function calls to (some) packages, might affect performance variability more than others and, therefore, could lend themselves to being a reasonable proxy for benchmark stability prediction. 5.6 RQ 2: Classifying Benchmarks as Unstable 203

RQ 1 Summary: Individual source code features are at best moderately (ρ ≥ 0.3) associated with benchmark result variability. The majority of our meta information and programming language features show similar correlations, between 0.2 and 0.3, indicating that more code correlates higher with benchmark instability. Benchmarks that (indirectly) rely on I/O, concurrency primitives, and pseudo-random numbers are more associated with yielding variable performance measurements.

5.6 RQ 2: Classifying Benchmarks as Unstable

By answering the previous research question on the correlation in Section 5.5, we found that the used features are at best moderately associated with benchmark instability. As a next step and in order to answer RQ 2, we investigate whether the same, static features can be used in combination to build effective predictors.

5.6.1 Methodology

First, we transform the RCIW values of all 4,461 benchmarks into the binary classes stable (0) and unstable (1) (see Section 5.3.3), where the threshold values are t ∈ {1%, 3%, 5%, 10%}. Binarization is applied for all benchmarks after 5, 10, and 20 iterations (i) to study the impact of the number of iterations on the instability prediction performance. Recall from Section 5.2 that more iterations result in lower benchmark result variability and, hence, in more stable benchmarks. Figure 5.6 shows the bar plots of the resulting distributions after the binary transformation. The ﬁgure shows that the data is imbalanced for iteration-threshold pairs. This phenomenon can also be observed in the defect prediction literature [Tantithamthavorn et al., 2020], where there are much fewer defected methods, classes, or modules than ones without defects. Note that the total number of benchmarks varies depending on the number of iterations considered: 4,205 with 5, 4,122 with 10, and 3,620 with 20. This Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 204 Features

Stable Unstable

i = 5 4061 4000 3769 3921

3000 3014

2000 1191 1000 436 284 144 0 i = 10

4002 4000 3735 3877

3082 3000

2000

1040 1000 387 245 120 0 i = 20

4000 3528 3345 3449 3000 2912

2000

1000 708 275 171 92 0 1% 3% 5% 10% Threshold

Figure 5.6: Distributions of the data used to train the classifiers, after the binarization 5.6 RQ 2: Classifying Benchmarks as Unstable 205 is due to our study execution design (see Section 5.4.4) which restricts a full benchmark suite execution to a maximum of 2 hours; hence, our data set does not contain the same number of benchmark iterations, depending on the project (or suite) it belongs to. Due to our data set being relatively small (≥ 3,620 instances) for machine learning purposes and highly imbalanced, we do not use a holdout method for the evaluation of the classification algorithms, i.e., splitting the data into training and test sets. Instead, we employ a k-fold cross validation approach. The data is randomly split into k folds, the model is trained on k − 1 folds and validated on the remaining kth fold. This process is repeated k times collecting a total of k evaluations. In addition, we use the repeated version of k-folds cross validation: we first shuffle the whole dataset before applying the cross validation for a total of m times. We set k = 10 and m = 30, collecting a total of 10 × 30 = 300 evaluations for each of the combinations of model, iterations, and threshold values. This way, we can take advantage of a high number of evaluations and then apply statistical tests to investigate the significance and effect size of our results. We also apply two pre-processing operations before each training phase: (1) standardization of the features by removing the mean and scaling to unit variance, and (2) feature selection that removes all low-variance features. We use the same variance value found during the training to pre-process the test data in the same way. In the following, we list the classification algorithms, evaluation metrics, and statistical tests used to answer the research questions.

Classiﬁcation Algorithms

We compare the prediction performance of 11 classiﬁcation algorithms from the PYTHON library SCIKIT-LEARN [Pedregosa et al., 2011], brieﬂy described in the following:

Naive Bayes (NB) is an algorithm based on the Bayes’ theorem of “naive” assumption of conditional independence between every pair of fea- Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 206 Features

tures [John and Langley, 1995]. We select the implementation of “Gaus- sian Naive Bayes” from SCIKIT-LEARN, where the likelihood of the features is assumed to be Gaussian. k-Nearest Neighbors (KNN) is an instance-based learning algorithm, where the classiﬁcation is computed with the majority vote method [Goldberger et al., 2004]. The parameter k indicates the number of neighbors.

Logistic Regression (LR) is a linear model used for classiﬁcation, where the probabilities that describe the possible outcomes are modeled using a logistic function [Hosmer Jr. et al., 2013].

Neural Networks (NN) are a powerful method for supervised learning, especially effective in estimating functions that can depend on a large number of inputs [Ruck et al., 1990]. For our experiments, we select the “Multi- layer Perceptron” implementation from SCIKIT-LEARN, a fully connected neural network with a linear activation function in all neurons.

Decision Tree (DT) is a non-parametric classiﬁcation algorithm where the goal is to predict by learning simple decision rules inferred from the input data [Quinlan, 1986].

Linear Discriminant Analysis (LDA) is a classiﬁcation algorithm based on the supervised dimensionality reduction that projects the input data to a linear subspace [Friedman, 1991].

Support Vector Machines (SVMs) are discriminative classification algorithms, formally defined by a separating hyperplane [Cortes and Vapnik, 1995]. Given a labeled training data, the algorithm outputs an optimal hyperplane categorizing new examples. Different “kernel” functions can be specified for the decision functions. We report the performance results of two different kernels for SVMs: the linear kernel (LSVM) and the radial kernel (RSVM). 5.6 RQ 2: Classifying Benchmarks as Unstable 207

Random Forest (RF) is a bagging/ensemble method, whose ﬁnal results depend on the decisions of multiple classiﬁers, i.e., multiple decision trees in the case of Random Forest [Breiman, 2001].

Boosting is another ensemble method, where multiple iterations of the same algorithm, e.g., decision tree, are performed. At every new step, it trains the model with a modiﬁed version of the input data. We use two boosting algorithms, both based on decision trees, namely Adaptive Boosting (AB) [Freund and Schapire, 1997], and Gradient Boosting (GB) [Fried- man, 2001]. Adaptive Boosting gives weights to the votes of every trained models, adapting them at every step so that a weaker model will have a lower impact on the ﬁnal decision rather than the strongest ones. In- stead, Gradient Boosting trains the models in a gradual, additive, and sequential manner.

We do not apply any tuning to optimize the hyper-parameters of the algorithms but use the standard conﬁguration provided by SCIKIT-LEARN version 0.23.0. A study on the effects of tuning on benchmark instability prediction is an important subject for future work. From now on, we will use the names of the classiﬁers also to indicate the type of models we trained.

Evaluation Metrics

We evaluate the different models along the following prediction performance metrics:

Precision describes the ability of a classiﬁer not to label a sample that is negative as positive [Buckland and Gey, 1994]. It is deﬁned as the ratio TP TP+FP , where TP is the number of true positives and FP the number of false positives. Its values lie between 0.0 and 1.0 ranging from worst to best, respectively.

Recall represents the ability of the classifier to find all positive samples [Buck- TP land and Gey, 1994]. It is defined as TP+FN , where FN is the number of Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 208 Features

false negatives. Its values range from 0.0, i.e., the worst value, to 1.0, i.e., the best one.

F-measure , also known as “F1 score” or “F-score”, is a weighted harmonic 2∗precision∗recall average of precision and recall, deﬁned as precision+recall [Chinchor, 1992]. Its values range from 0.0, i.e., the worst, to 1.0, the best. Intuitively, it is a metric that provides a better intuition about how well the model performs overall, as it is based on both precision and recall.

Area Under the Curve (AUC) , precisely area under the receiver operating characteristic curve (AUROC), represents the measure of the area of the ROC curve [Hanley and McNeil, 1982]. The ROC is a probability curve, plotting the true positives rate vs. false positives rate, showing the performance of a classifier at all classification thresholds. The AUC quantifies, between 0.0, i.e., the worst, and 1.0, i.e., the best, the area that the ROC draws on a plot. This metric represents the probability the model will score a randomly chosen positive class rather than a randomly chosen negative class. It is a form of accuracy measure, but instead of classic accuracy, it is more meaningful in the case of imbalanced data between positive and negative classes, as the distributions of our dataset are (see Figure 5.6).

Matthews Correlation Coefﬁcient (MCC) is a measure of classiﬁcation quality, which takes into consideration true and false positives and negatives [Matthews, 1975]. MCC is regarded as a meaningful measure even if the classes are imbalanced. It values lie between −1.0, i.e., a completely wrong prediction, and +1.0, i.e., a perfect one. 0.0 represents an average random prediction.

In the remainder of the paper, we predominantly discuss the models’ prediction performance along the AUC and MCC metrics. AUC [Bradley, 1997] and MCC [Chicco and Jurman, 2020] have been shown to be reliable metrics, especially in the case of binary classiﬁcation. We provide insights about the other metrics only in a few cases where more details are of interest. All the 5.6 RQ 2: Classifying Benchmarks as Unstable 209 details about the above-mentioned metrics are available for reference in our replication package [Laaber et al., 2020a].

Comparing Multiple Observations

To ensure that our reported results are valid from a statistical perspective, we employ statistical hypothesis testing and effect size measures. All reported statistical test results are signiﬁcant at a level (α) of 0.01. We use the term “group of observations” to indicate the prediction performance scores for all folds of each of the models under study. To identify the type of statistical tests that are suitable for the distributions of our data, we perform a normality test. For all collected groups of observations, we apply the “D’Agostino’s K2” test [D'Agostino et al., 1990], whose null hypothesis states that a group of observations are normally distributed. For the majority of the observations, we can reject the null hypothesis (p-value < 0.01), thus not allowing to use parametric tests for further investigations. For this reason, we only use non-parametric tests to uniform our discussion. We report the median values of the metrics under analysis, since the non-parametric tests generally refer to the median and to ease understandability of the results. Therefore, when we compare combinations of algorithm, iterations, and threshold, we use the median value as an indicator of the performance over multiple observations, i.e., results from repeated k-fold cross validation. The term “range” thus refers to the minimum and maximum values between median values. We refer the interested reader to our external appendix, which contains the raw dataset and enables straight-forward computation of other statistics [Laaber et al., 2020a]. To compare the observations, we apply the “Kruskal-Wallis H” test [Kruskal and Wallis, 1952], i.e., the non-parametric version of the ANOVA test. The null hypothesis states that the observations’ median of all tested groups are equal. The test is applied to multiple groups simultaneously but can not identify exactly where and how much the groups are statistically different. When able to reject the null hypothesis, i.e., the median among all the groups is statistically signiﬁcantly different, we apply a post-hoc pairwise test Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 210 Features to identify the pairs of groups of observations that are different. For this, we use the “Dunn’s” test [Dunn, 1964] where the null hypothesis states that there is no difference. To measure how much two groups are different from each other, we compute the “Vargha-Delaney Aˆ12” test [Vargha and Delaney, 2000] for the effect size to characterize the magnitude of such a difference. Aˆ12 = 0.5 if two groups

(observations) are statistically indistinguishable. Aˆ12 > 0.5 means that, on average over all observations, the ﬁrst group obtains larger values than the one it is compared to, Aˆ12 < 0.5 otherwise. The magnitude values can be summarized into 4 nominal categories: “negligible” (|Aˆ12| < 0.147), “small”

(0.147 ≤|Aˆ12| < 0.33), “medium” (0.33 ≤|Aˆ12| < 0.474), and “large” (|Aˆ12|≥ 0.474).

5.6.2 Results

We analyze the results along three dimensions: (1) the 11 classiﬁcation models; (2) the threshold t used for binarization, i.e., which benchmark variability (RCIW) values are considered stable or unstable; and (3) the number of iterations i used for calculating the benchmark result variability. RQ 2 aims at assessing the effectiveness of machine learning for predicting, i.e., classifying, unstable benchmarks. It gives an overview of the best performing models. RQ 2.1 and RQ 2.2 further investigates the impact different thresholds t and number of iterations i have on prediction performance.

Overall Comparison

Figure 5.7 shows the line plots of the median prediction performance (on the y-axis), over 300 observations, for all evaluation metrics under study. The row facets show the classiﬁcation models, whereas the column facets represent the number of iterations i, i.e., 5, 10, and 20. The x-axis depicts the threshold values t, i.e., 1%, 3%, 5%, 10%. 5.6 RQ 2: Classifying Benchmarks as Unstable 211

Precision Recall F-measure AUC MCC

i = 5 i = 10 i = 20 1.00 0.75 NB 0.50 0.25 0.00 1.00 0.75 KNN 0.50 0.25 0.00 1.00 0.75 LR 0.50 0.25 0.00 1.00 0.75 NN 0.50 0.25 0.00 1.00 0.75 LDA 0.50 0.25 0.00 1.00 0.75 DT 0.50 0.25 0.00 1.00

0.75 LSVM 0.50 0.25 0.00 1.00

0.75 RSVM 0.50 0.25 0.00 1.00 0.75 RF 0.50 0.25 0.00 1.00 0.75 AB 0.50 0.25 0.00 1.00 0.75 GB 0.50 0.25 0.00 1% 3% 5% 10% 1% 3% 5% 10% 1% 3% 5% 10% Threshold

Figure 5.7: Metrics comparison over the thresholds t, iterations i, and algorithms Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 212 Features

Metrics Interpretation We provide an interpretation of the metrics in the context of benchmark instability prediction. Recall that our dataset is highly imbalanced (see Figure 5.6). We set “unstable” as the positive value (1), i.e., the one we are mostly interested in identifying, and “stable” as the negative value (0). If precision is high, the classifier is able to recognize unstable benchmarks correctly. A simple classifier that maximizes precision would always answer with a negative value, i.e., “stable”. Instead, the recall describes the capability of recognizing all benchmarks that are unstable. To maximize recall, a simple classifier would always answer with “unstable” for all benchmarks. In case of high precision and low recall, we can be confident that the classifier (mostly) classifies unstable benchmarks as such, i.e., only few benchmarks classified as unstable are indeed stable (low false positive (FP)-rate). However, the classifier might have missed many unstable benchmarks because of the low recall, i.e., many benchmarks that are indeed unstable are classified as stable (high false negative (FN)-rate). Conversely, a classifier with low precision and high recall is able to identify most benchmarks that are unstable as such (low FN-rate), but it also classifies many benchmarks as “unstable” that are indeed “stable” (high FP-rate). Depending on the application scenario of an unstable benchmark classifier, recall, precision, or both should be maximized for. We present more discussion on this trade-off and application scenarios in Section 5.7. AUC [Bradley, 1997] and MCC [Chicco and Jurman, 2020] consider both precision and recall, being more indicative of the general performance of the classifiers.

Generally High-Performance Algorithms Referring to MCC as an indicator of the overall effectiveness of the classiﬁers, Random Forest outperforms the other models for all combinations of iterations and thresholds. Across all combinations, Random Forest’s MCC median values range from 0.43 to 0.68. The same applies for AUC, where Random Forest’s prediction performance ranges from 0.79 to 0.90. This is in-line with other models, whose MCC median values are over 0.60, all with 20 iterations and 10% as threshold: (1) Gradient 5.6 RQ 2: Classifying Benchmarks as Unstable 213

Boosting (max 0.66), (2) Adaptive Boosting (max 0.64), (3) Decision Tree (max 0.60), and (4) Neural Networks (max 0.60).

High Precision Algorithms The algorithms with the highest precision are: (1) Logistic Regression (from 0.58 to 1.0), (2) Linear Support Vector Machines (from 0.56 to 1.0), and (3) Radial Support Vector Machines (from 0.64 to 1.0). It is likely that these algorithms tend to train the classifier to answer mostly with “stable”, i.e., the negative value for the metric. However, all of these algorithms suffer from low recall, specifically (1) Logistic Regression (from 0.05 to 0.21), (2) Linear Support Vector Machines (from 0.02 to 0.21), and (3) Radial Support Vector Machines (from 0.06 to 0.22), arguably rendering them inferior to algorithms with balanced precision and recall. Further, we observe a growing trend in precision with increasing threshold and iterations (see Figure 5.7). We discuss the impact of these values in more detail with RQ 2.1 and RQ 2.2 in Section 5.6.2, respectively. This might be a phenomenon related to the reduced number of unstable benchmarks with an increasing number of iterations and threshold values (see Figure 5.6). All the above-mentioned algorithms are based on subsequent iterations for training, which we set to 100,000 in our experiments. It is likely that these algorithms do not have enough “time” to learn how to recognize such a low number of unstable benchmarks. Instead, the Neural Networks algorithm, which is similarly based on the same concept of iterations, does not fail in the same way. Neural Networks also obtain higher MCC scores. It is likely that their complex structure of layers is more effective in capturing the features that characterize unstable benchmarks, within the given number of iterations. Thus, hyper-parameter tuning is expected to improve the general performance of the mentioned classifiers.

High Recall Algorithm The model with, by far, the highest recall is Naive Bayes. The precision is low (≈0.0), therefore Naive Bayes is not able to correctly identify “unstable” benchmarks. Having high recall (≈1.0), it is likely that the algorithm trained the model to answer “unstable” in the majority of the cases. Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 214 Features

Indeed, precision and recall are not balanced, which is also reﬂected in a low MCC value (≈0.0).

Comparing to Baselines We also trained four basic classification algorithms as baselines, based on different strategies to generate predictions: (1) “stratified”, assigning predictions based on the distribution of the data, (2) “most frequent”, always predicting the most frequent label in the training set, (3) “uniform”, generating predictions uniformly at random, (4) “prior”, always predicting the class that maximizes the class prior. We verified that for all cases the AUC and MCC values for the baselines are close to ≈0.5 and ≈0.0, respectively, clearly outperformed by all the other classifiers. Further details are available in our replication package [Laaber et al., 2020a].

RQ 2 Summary: Machine learning models can be used to predict benchmark instability based on statically-computed source code features. In particular, Random Forest trained with standard parameters have an AUC value ranging from 0.79 to 0.90 and MCC from 0.43 to 0.68.

RQ 2.1: Comparing Thresholds

We now investigate the impact of the threshold value t on the prediction performance, for all combinations of iteration values i and classification algorithms. More specifically, we fix two of the dimensions and vary the other one, i.e., we compare the evaluation scores of the models with the same values for a specific classification algorithm and iteration value. In this way, we can analyze the impact of a single dimension on the evaluation metrics. We observe from Figure 5.7 that both AUC and MCC values follow a growing trend when the threshold value increases, for the majority of algorithms. Graphically, it is possible to identify this behavior by following the lines of AUC and MCC for each distinct facet box. To validate this, we use the Kruskal-Wallis H test and verify that there is a statistically significant difference between observations whose median value is represented in Figure 5.7. After applying the Dunn’s post-hoc pairwise test, we can also verify whether there 5.6 RQ 2: Classifying Benchmarks as Unstable 215 is a significant difference between consecutive threshold values. Moreover, we can measure the magnitude of such a difference with the Vargha-Delaney

Aˆ12 test. The Aˆ12 value also indicates whether the first group of observations has values that are greater than the second group, on average. Thus, showing whether there is a statistically growing trend. We find that there is always a statistically significative difference between threshold values 1% and 10%, regardless of the model and number of iterations. Indeed, the Dunn’s test rejects the null hypothesis (p-value < 0.01) and the magnitude computed by the Aˆ12 test ranges from “small” to “large”. We observe an upwards trend in prediction performance with an increasing threshold value for all algorithms and iterations, except Naive Bayes and Lin- ear Support Vector Machines. Naive Bayes always shows decreasing trends, i.e., both AUC and MCC have decreasing trends for all combinations of models and iteration values. Linear Support Vector Machines have decreasing AUC and MCC only with 5 iterations. The increase in prediction performance with higher thresholds is likely an effect of measurement bias (see Section 5.3.2), as reported by Georges et al. [2007] and Mytkowicz et al. [2009]. Stable benchmarks (with low result variability) probably have different characteristics, in terms of the source code features they call, compared to unstable benchmarks. If the threshold value is small(er) and measurement bias pushes the benchmark’s result variability beyond the threshold, a benchmark is more likely to fall into the “wrong” class, i.e., stable vs. unstable. Consequently, its source code feature characteristics and stability class are misaligned and prediction performance suffers. Higher thresholds tend to increasingly assign benchmarks to their “right” stability class which positively impacts prediction performance. Regarding Random Forest, i.e., the best overall model (see Section 5.6.2), we notice a clear growing trend for all the iteration values. In particular: (1) for 5 iterations, the AUC median values start from 0.79 (threshold at 1%) to reach 0.85 (threshold at 10%), whereas MCC from 0.43 to 0.54; (2) for 10 iterations, AUC ranges from 0.81 to 0.89, and MCC from 0.44 to 0.63; and (3) for 20 iterations, AUC ranges from 0.83 to 0.90, and MCC from 0.48 to 0.68. Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 216 Features

Therefore, for Random Forest the best prediction performance is always with a threshold t of 10%, regardless of the number of iterations i.

RQ 2.1 Summary: The choice of threshold, which causes the deﬁnition of stability, does impact the prediction performance. In the vast majority of the cases, the models with the highest threshold value t, i.e., 10%, reached the highest values for AUC and MCC.

RQ 2.2: Comparing Iterations

Now, we are interested whether the number of iterations i has an impact on the prediction performance. As before for RQ 2.1, we validate if the observations follow a growing trend for AUC and MCC when increasing the number of benchmark iterations. For this, we fix both the models and threshold values and analyze the values of the observations for AUC and MCC, for all iteration values. It is possible to visually observe this in Figure 5.7 by looking at the points of a specific metric, e.g., AUC and MCC, and follow the horizontal grid lines of one of the thresholds, e.g., 10%, from left to right. We can validate the following observations by running the Kruskal-Wallis H test. The test shows no statistical significative difference (p-value >=0.01) of AUC and MCC for (1) Naive Bayes when the threshold is 5%, (2) Linear Discriminant Analysis for the thresholds 3% and 5%, and (3) Linear Support Vector Machines when the threshold is 3%. Similar to RQ 2.1 where precision is high, Linear Support Vector Machines might not be able to converge within the maximum default number of iterations, i.e., 100,000. Hyper-parameter optimization potentially improves the prediction performance of Linear Support Vector Machines. The remaining algorithm-threshold combinations follow a statistically significant growing trend of AUC and MCC. We, again, apply the Dunn’s test to pinpoint to the different observations, followed by the Vargha-Delaney’s Aˆ12 test for the effect size. We find that the following combinations do not improve AUC and MCC values when the number of iterations increases: (1) Naive Bayes for all thresholds as well as (2) Logistic Regression, (3) Neural Network, 5.6 RQ 2: Classifying Benchmarks as Unstable 217

(4) Linear Discriminant, (5) Linear Support Vector Machine, and (6) Radial Support Vector Machines for a threshold of 1%. However, we observe that 20 iterations always result in the best prediction performance, considering MCC as the main indicator.

Similar to the prediction performance sensitivity caused by the threshold value (see Section 5.6.2), prediction of benchmark instability is affected by the number of iterations considered for calculating the benchmark result variability RCIW. Recall from Section 5.2 and Figure 5.6, with an increase in number of iterations, a benchmark’s variability decreases, i.e., the conﬁdence interval narrows, and the benchmark becomes more stable. This has the effect that characterizing features of unstable benchmarks seem to be better associated with benchmark instability and, hence, prediction performance improves.

As for the previous research question, we analyze the trends of Random Forest, i.e., the best overall model. Also in this case, we notice a growing trend for all the threshold values. In detail: (1) for a threshold of 1%, the AUC median values start from 0.79 (5 iterations) to reach 0.83 (20 iterations), whereas MCC from 0.43 to 0.48; (2) for a threshold of 3%, AUC ranges from 0.83 to 0.86, and MCC from 0.47 to 0.55; (3) for a threshold of 5%, AUC ranges from 0.84 to 0.88, and MCC from 0.49 to 0.60; and (4) for a threshold of 10%, AUC ranges from 0.85 to 0.90, and MCC from 0.54 to 0.68; Therefore, the best prediction performance with Random Forest is always with 20 iterations, regardless of the threshold value.

RQ 2.2 Summary: The number of benchmark iterations i does considerably impact the prediction performance. In the vast majority of the cases, the models with the largest number of benchmark iterations, i.e., 20, score highest in AUC and MCC. Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 218 Features

5.7 Discussion and Future Research

In this section, we provide insights for researchers to build on and practitioners wishing to apply instability prediction as well as discuss potential directions for future research.

5.7.1 Application Scenarios

We see three potential application scenarios where benchmark instability prediction could be of use for practitioners: (1) regression benchmarking by selection and reduction, (2) support for developers writing benchmarks, and (3) execution conﬁguration of benchmarks.

Regression Benchmarking Test suite reduction (also minimization) and regression test selection (RTS) are common techniques in unit testing to reduce the testing effort [Yoo and Harman, 2012]. RTS selects a subset of tests that should be executed upon a new version. Our approach can identify benchmarks that yield unstable results and, hence, are potentially less useful for slowdown detection or do not accurately reﬂect the program’s performance. Considering the immense execution times of benchmark suites [Huang et al., 2014; Chen and Shang, 2017; Laaber and Leitner, 2018a; Laaber et al., 2020c], selecting only benchmarks whose results are of high quality is desirable to reduce overall testing time. Our approach can assist in that with only relatively lightweight, statically-computed metrics. As RTS techniques often (only) consider the changes made between two versions, our approach would probably need to incorporate change related features, e.g., number of changes to programming language features or API calls. Note that our approach would not be a safe benchmark selection technique, as some (unstable) benchmarks might expose slowdowns and our prediction would remove them. Such a study is out of scope of this paper and subject to future research. We can imagine such a study comparing to state-of-the-art functional RTS research [Gligoric et al., 2015; Zhang, 2018; Machalica et al., 2019] and performance test selection [de Oliveira et al., 2017; Alshoaibi et al., 2019]. 5.7 Discussion and Future Research 219

Test suite reduction removes redundant tests, usually based on single version information. Our approach would be similar to reduction as it would remove unstable, potentially less meaningful benchmarks. However, the redundancy aspect of traditional reduction is not modeled in our approach. In this regard, we might want to relax the notion of reduction and not remove these benchmarks indefinitely. We argue that unstable benchmarks are similar to flaky tests as their results can not be trusted [Luo et al., 2014], and unstable benchmarks could be quarantined instead of permanently removed for developers to (if possible) fix their source if instability.

Developer Support We envision our benchmark instability prediction to fit well into developer support tooling. For example, as part of the integrated development environment (IDE) or as a standalone linter, our approach could raise awareness about the result accuracy of benchmarks. This might be especially useful for developers during the development phase of the benchmarks. Based on this early feedback, developers could rethink which parts of their software are benchmarked and which parts are not. Questions like “Do I really need this benchmark?” could arise and optional benchmarks that are also unstable might not be written in the first place. Moreover, developers could mock variability-inducing programming constructs, such as file or network I/O, to improve benchmark stability. This, however, should be done with care as the mocking possibly defeats the purpose of benchmarking and renders the benchmark results unrealistic and useless.

Configuring Benchmarks The number of repeated iterations a benchmark is executed for has a direct impact on its result variability and consequently on its stability. In Section 5.2, we described this repetition and its impact. How- ever, finding the right configuration is non-trivial and developers often get it “wrong”, as shown by Laaber et al. [2020c]. Their solution is to dynamically— during runtime of a benchmark—decide when to stop a benchmark; still, this dynamic reconfiguration requires manually setting upper bounds on the iteration parameter. Usually, developers would rely on default parameters imposed Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 220 Features by the benchmarking framework or follow best practices from research, e.g., by Georges et al. [2007]. Our approach would provide a first step towards solving this problem, by making developers aware of certain benchmarks potentially being unstable, before executing them. This would enable developers to take extra care of these benchmarks and setting the iteration configuration to higher values than set by default or proposed by research. If dynamic reconfiguration by Laaber et al. [2020c] is used, the upper bounds for the number of iterations can be pessimistically set to high values for unstable benchmarks.

5.7.2 Tradeoff between Precision and Recall

As with any binary classification, there is an inherent tradeoff between precision and recall, i.e., whether false positives (FPs) or false negatives (FNs) are more detrimental to what the approach is trying to ultimately achieve. In our context it depends on the application scenario, as described in Section 5.7.1, and the positive value for classification. We consider the positive value to be “unstable”; consequently, we can question if it is more important to only select unstable benchmarks (low FP-rate) or not to miss any unstable benchmarks (low FN-rate). Generally, both are desired and our results show that Random Forest performs best among the classification algorithms under study. In regression benchmarking where we want to exclude the benchmarks that are unstable from execution, we might prefer precision to recall. That is, we only want to exclude the benchmarks that are indeed unstable and execute the ones that are stable. In this setting, lower precision means that more stable benchmarks are excluded which might give reliable results. A low recall means that many unstable benchmarks are not excluded and, consequently, less time is saved. A similar argumentation is valid for benchmark suite reduction, i.e., it is more favorable to execute unstable benchmarks than remove stable benchmarks. Similarly, precision is probably more important for tooling support, to not overwhelm developers with false alarms to check a benchmark for potential sources of instability if there are none. This would result in developers spend- 5.7 Discussion and Future Research 221 ing a lot of unnecessary time debugging a problem where there is none. In this example, running an unstable benchmark and figuring out that it has high result variability after its execution is arguably the lesser evil, i.e., wasting computing resources over developer time. Especially, with the gained insight, a subsequent execution could exclude the unstable benchmarks or give the benchmark more iterations to become stable. In terms of configuring benchmarks, we would argue that recall is more important than precision. In an ideal scenario, all benchmarks are configured to be stable, that is they are executed long enough to lead to stable results. With high recall, all unstable benchmarks are selected, and a developer would allow these to run for more repetitions. With the potential downside, in case of low precision, that also stable benchmarks are over-configured and are executed more often than necessary, which boils down to wasting computing resources. If one uses this approach in combination with dynamic reconfiguration [Laaber et al., 2020c], the problem of setting an overly conservative configuration is removed as dynamic reconfiguration stops a benchmark execution when the results are stable. Our prediction results show that the majority of classifiers perform better in terms of precision than recall (see Section 5.6.2), supporting our application scenarios regression benchmarking and developer support nicely. A defi- nite answer whether these application scenarios are indeed well-supported, requires follow-up studies employing our approach.

5.7.3 Classiﬁcation vs. Prediction

In Section 5.6.1, we transform benchmark variability into a binary classification problem, i.e., a benchmark being stable or unstable. We use four thresholds t inspired by literature to make this distinction between the two classes [Georges et al., 2007; Mytkowicz et al., 2009], and we perform a sensitivity analysis on the threshold value in Section 5.6.2. Ideally, we would predict a benchmark’s variability with, e.g., a (linear) regression model; initial experiment were un- successful, where the error was exceedingly high. One potential reason for Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 222 Features this failure is that the statically-computed features are not precise enough or source code features in general do not offer enough explanatory power for regression-based prediction. Future research should explore other or more precise features to predict the exact performance result variability of a benchmark. Nonetheless, for (at least) two of our outlined application scenarios (see Section 5.7.1), i.e., regression benchmarking and developer support, we argue that the simplification to a binary classification problem is sufficient. The configuration of a benchmark would greatly benefit from a precise variability prediction, as this would enable suggesting the “right” iteration value to achieve a desired benchmark variability, e.g., smaller than 3% RCIW.

5.7.4 Features

We now discuss the tradeoffs concerning our choice of features and two potential improvements.

Static vs. Dynamic Features Our model is based only on statically-computable source code features, by parsing ASTs (intra-procedural) and combining the features with static CG analyses (inter-procedural). The feature extraction trades faster performance for the precision of the analyses. However, our prediction results (see Section 5.6.2) show that our static features perform well; especially Random Forest, our best performing predictor, with a median performance ranging from 0.43 to 0.68 and from 0.79 to 0.90 for MCC and AUC, respectively. To improve feature precision, we identify a switch from static features to dynamic features. In particular, relying on dynamic control ﬂow and CGs to accurately identify which programming languages features are covered and APIs are called by a benchmark. We suppose that more precise analyses lead to better prediction performance; this, however, is subject to future research. Apart from the features we use, dynamic features based on performance proﬁlers, e.g., CPU, memory, locks, or race detector information, might help to further improve prediction performance. An approach employing dynamic features would run a single invocation of a benchmark with the 5.8 Threats to Validity 223 feature extractor injected, gather all necessary information, and feed these features into a predictor to make a decision whether a rigorous benchmark execution with multiple iterations is required.

Execution Environment Proxies Source code is only one of the many factors influencing performance variability; others are, to name a few, dynamic compiler optimizations, memory layout, environment variables, virtualization, and OS-dependant factors [Georges et al., 2007; Mytkowicz et al., 2009; Curtsinger and Berger, 2013; de Oliveira et al., 2013b; Arif et al., 2018; Maricq et al., 2018; Laaber et al., 2019b]. A potential reason for the low association in Section 5.5 and the room for improvement in prediction performance in Section 5.6 is that these other factors contribute more to benchmark result variability and stability than source code. We, therefore, argue that an improved prediction model should consider additional features that approximate the other factors for benchmark variability. In particular, we envision to include performance profiles of the execution environment, e.g., based on standardized (system) microbenchmarks. A similar approach to Wang et al. [2018] for predicting the performance of cloud applications based on resource profiles, Scheuner and Leitner [2018] for estimating the application performance from system microbenchmarks, or Jimenez et al. [2018] for inferring the resource profile of a benchmark could offer the desired proxy features for the execution environment.

5.8 Threats to Validity

Construct Validity The central aspects of our paper are associating source code features with and predicting benchmark stability. The definition of benchmark stability is consequently subject to validity concerns. We use the performance result variability of a benchmark, as defined by the relative confidence interval width (RCIW) of the mean computed with bootstrap [Davison and Hinkley, 1997] (see Section 5.3.2), as the measure for how stable a benchmark is. This is in line with previous work in performance engineering [Bakshy and Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 224 Features

Frachtenberg, 2015; Stefan et al., 2017; Bulej et al., 2017a; He et al., 2019; Bulej et al., 2020; Laaber et al., 2020c]. To reduce Monte-Carlo noise from the results of the bootstrap technique, we follow best practice and draw 10,000 random samples with replacement [Hesterberg, 2015]. The coefficient of variation (CV)—another measure of variability—is less applicable in our context, as performance data usually does not follow a Gaussian distribution. Nevertheless, a reader should acknowledge that different measures for benchmark stability could change the results of our paper. Moreover, the transformation of result variability to a binary classification problem, i.e., whether a benchmark is stable or unstable, depends on the used threshold t (see Section 5.6.1). We use t ∈ {1%, 3%, 5%, 10%} informed by previous work [Georges et al., 2007; Mytkowicz et al., 2009] and perform a sensibility analysis on the threshold value in Section 5.6.2. Other thresholds and, consequently, different assign- ments of benchmarks to the classification classes “stable” and “unstable” might result in different outcomes.

In RQ 1, we study the association of source code features to benchmark variability, which might be impacted by the way we extract the features and the correlation algorithm in use. Our approach relies on AST parsing and static CG information and not on precise, dynamic coverage information. This imprecision is likely to assign more features in higher numbers (counts) to certain benchmarks, which are actually not executing these features. Hence, the correlation results in Section 5.5.2 are likely impacted by it. This was a conscious design decision as we wanted a purely static model to improve feature extraction performance. Similarly, the execution environment is likely to impact the performance variability and, hence, the results from RQ 1 (also compare to Section 5.7.4).

In RQ 2, we assess the prediction performance of the 11 classiﬁers, relying on the 5 evaluation metrics in Section 5.6.1. We aimed for a wide variety of threshold-dependent and threshold-independent metrics to improve construct validity. In particular, we refer to AUC and MCC as main indicators of the prediction performance of the classiﬁers. The literature shows that AUC [Bradley, 5.8 Threats to Validity 225

1997] and MCC [Chicco and Jurman, 2020] are reliable metrics, especially in the case of binary classiﬁcation.

Internal Validity Any performance measurement experiment, which includes benchmark executions, is subject to measurement bias [Mytkowicz et al., 2009; Curtsinger and Berger, 2013; de Oliveira et al., 2013b; Maricq et al., 2018], which could alter our benchmark result variability and, consequently, the benchmark stability. To reduce measurement bias, we follow a rigorous performance engineering methodology [Georges et al., 2007] and use non-virtualized machines with hyper-threading, frequency scaling, and Intel’s TurboBoost turned off [Stefan et al., 2017]. Different execution environments are likely changing the stability of individual benchmarks, potentially affect- ing the correlation and prediction results. We opted for a tightly controlled environment to control for as many confounding factors as possible. The number of iterations i (see Section 5.3.2) affects the internal validity of our experiment, as the iterations have a direct impact on the benchmark result variability. To measure this effect, we perform a sensitivity analysis in Section 5.6.2. The exact results, however, might be different for other iteration counts. Note that we do not claim that certain source code features are the cause for benchmark instability in Section 5.5. RQ 1 is about finding associations and not proving causality. Such a causal relationship would require a different experiment, which is out of scope of this paper. As for RQ 2, we do not specifically address the problem of data imbalance. Our main goal is to investigate machine learning prediction models to establish a benchmark’s stability, before executing it. Since our dataset is relatively small, i.e., ≥ 3,620 instances, we employ a repeated k-fold cross validation approach, for a total of 300 prediction observations for every combination of model, iterations i, and thresholds t. With such a number of observations, we can apply statistical tests to mitigate the risk of spurious differences. We also apply pre-processing operations to our data, namely standardization and feature selection based on variance. Pre-processing techniques Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 226 Features might have an undisputed benefit for the prediction performance. However, a sensitivity analysis of such a high number of possible methods seems to be infeasible, considering the large number of classifiers used in our study. As in the case of hyper-parameters optimization, which we did not apply in this study, our view is that simple classifiers can already be competitive for this task. However, a sensitivity study on the effects of pre-processing and tuning on benchmark instability prediction is an avenue for future work.

External Validity Generalizability is mostly concerned with regard to the selected study objects. Our study only considered benchmarks written in Go; consequently, our results do not necessarily translate to benchmarks written in other programming languages. The results potentially also do not extend to other Go projects, which are not part of our study. With 230 projects having 4,461 benchmarks, we have an extensive set of benchmarks and projects to draw conclusions from. We study benchmarks on function/statement granularity, often also called microbenchmarks or performance unit tests [Stefan et al., 2017; Laaber and Leitner, 2018a]. The results presented here do not generalize to other forms of performance test, such as load tests or system/application-level benchmarks. The benchmark result variability in our study considers execution time as its performance metric. Execution time is the standard performance metric for benchmarks on this granularity, whereas load tests, system-benchmarks, or profilers often also consider memory performance, I/O, lock contention, or other performance metrics. A careful reader should not assume that our results transfer to these other performance metrics. Finally, other machine learning algorithms, which are not part of our 11 algorithms under study, might perform differently in terms of prediction as well as sensitivity to the number of iterations i or the threshold value t. We aimed at a diverse set of algorithms to increase generalizability among binary classifiers. Multiclass classification, clustering, and regression algorithms are likely to show different results to binary classifiers and are out of the scope of this work. 5.9 Related Work 227

5.9 Related Work

Particularly related to our study are works dealing with (1) performance variability, (2) performance bugs, (3) performance testing, (4) performance impact prediction, and (5) machine learning for software engineering. We will discuss these ﬁve aspects in the following.

5.9.1 Performance Variability

The performance variability of experiment results, such as benchmarks, is a well-known challenge in performance engineering. Georges et al. [2007] outlined a rigorous methodology to measure performance of dynamically compiled languages like Java. They report on effects of dynamic compiler optimizations influencing performance measurements and show that measurement variability is often around 3%. Even if a rigorous methodology is followed, measurement bias is common [Mytkowicz et al., 2009; Curtsinger and Berger, 2013; de Oliveira et al., 2013b]. Mytkowicz et al. [2009] report that different environment variable sizes, such as simply having a longer username, impacts performance measurements. Curtsinger and Berger [2013] identify that layout of code and memory, i.e., stack frames and heap objects, impacts the results of performance experiments by as much as 10%. To reduce measurement bias and benchmark variability, it is best practice to repeat measurements on the different levels that introduce measurement uncertainty [Kalibera and Jones, 2012, 2013] and randomize factors influencing the measurement [Curtsinger and Berger, 2013; de Oliveira et al., 2013b]. Moreover, randomly interleaving benchmarks across multiple trials [Abedi and Brecht, 2017; Laaber et al., 2019b] or executing benchmarks, that need to be compared with each other, in parallel on different CPUs of the same machine [Bulej et al., 2020] are novel techniques to handle environment-induced variability. The variability can be due to co-located tenants, hardware, OS specifics, or source code [Maricq et al., 2018; Laaber et al., 2019b]. In particular, virtualized [Arif et al., 2018] and cloud [Iosup et al., 2011b; Gillam et al., Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 228 Features

2013; Leitner and Cito, 2016] environments suffer from performance variability when used as performance execution environment. Our work draws inspiration from these works and utilizes source code features, one cause of performance variability, to predict whether a benchmark will be unstable, to ultimately decide whether more repetitions (e.g., iterations) are required. Although our approach’s features are solely extracted from source code, the idea is that they act as proxies for performance variability rooted in non-source-code factors, such as I/O, memory access through variables, or non-deterministic pseudo-random generators and concurrency constructs. Both our sensitivity analyses from RQ 2.1 and RQ 2.2 acknowledge the fact that performance measurements are imperfect, and they report on the impact on prediction performance. The sensitivity analysis on the threshold t of what is considered a stable or unstable benchmark bases its values to investigate on the experience reported in the papers mentioned (see RQ 2.1). Recently, He et al. [2019] and Laaber et al. [2020c] introduced techniques for system-benchmarks and microbenchmarks to stop benchmarks once the result variability is unlikely to change with more repetitions. Our approach augments these techniques by identifying the benchmarks that are unstable, before execution.

5.9.2 Performance Bugs

There is an abundance of research on characteristics of performance bugs and how to automatically detect them. In this context, a performance bug can be a slowdown, an increase in memory consumption, reduced throughput, or excessive I/O operations. Jin et al. [2012] find that performance bugs are often related to function calls, synchronization of concurrent computation, data structures, and API misuses. Selakovic and Pradel [2016] study performance bugs in JavaScript and find that they are often caused by inefficient APIs and loops, as well as unnecessarily repeated executions. Nistor et al. [2013b, 2015] identify similar memory access patterns and wasted loops as root causes. Sandoval Alcocer and Bergel [2015] 5.9 Related Work 229 and Sandoval Alcocer et al. [2016, 2020] discuss performance problems related to programming language features, such as function calls, conditionals, and (heavy) object creation. Often synchronization and concurrency are root causes for performance problems [Alam et al., 2017; Yu and Pradel, 2017]. Other studies conclude with similar observations of where performance bugs stem from [Zhao et al., 2020; Mazuera-Rozo et al., 2020]. Our approach incorporates the root causes identified in these studies as features in our prediction model. The simple counting of feature occurrences is in line with the finding that unnecessary repetition of source code constructs are causes for performance bugs. Similar to Liu et al. [2014], our approach also checks for potentially performance-variability-inducing API calls in the call graph of a benchmark. However, all these studies characterize and (sometimes) identify performance bugs, whereas our approach and study centers around benchmark (in)stability prediction in terms of its result variability.

5.9.3 Performance Testing

Performance testing is part of measurement-based performance engineering [Woodside et al., 2007], which aims to ensure catching performance degradations of software systems. Literature usually focusses on system-scale or method/statement-scale performance tests, often called load testing (or application/system benchmarking) and microbenchmarking (or performance unit testing), respectively. Traditionally, research on performance testing focussed mostly on load testing, such as identifying problems and reporting on case studies [Weyuker and Vokolos, 2000; Menascé, 2002; Jiang and Hassan, 2015]. More recent work focussed on industrial applicability [Nguyen et al., 2014; Foo et al., 2015; Chen et al., 2017] and reducing the time spent in load testing activities [AlGhamdi et al., 2016, 2020; He et al., 2019]. Microbenchmarking, which is the focus of our study, has significantly gained traction in recent years. Leitner and Bezemer [2017] and Stefan et al. [2017] empirically study the state in OSS projects and identified gaps that Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 230 Features require addressing from research. In particular, the complexity of performance testing activities and lack of tooling seem to be a hurdle to overcome [Bezemer et al., 2019]. Horký et al. [2015] utilize benchmarks to increase performance- awareness through documentation, and Bulej et al. [2012, 2017a] introduce a declarative form of specifying performance assumptions without the need for manually writing benchmarks. Sandoval Alcocer and Bergel [2015] and Chen and Shang [2017] explore performance changes in evolving software and find that code changes often introduce performance variation. Laaber and Leitner [2018a] and Laaber et al. [2019b] assess benchmarks for their applicability in continuous integration (CI) and study their results when executed on cloud infrastructure. Damasceno Costa et al. [2019] study bad practices in microbenchmark implementations and show that they significantly impact their results. Ding et al. [2020] explore the applicability of functional unit tests as performance tests and build a machine-learning model to predict so. Our approach is orthogonal to the research outlined above. It contributes to one of the major challenges of performance testing, i.e., the lack of tooling for performance testing.

5.9.4 Performance Impact Prediction

Previous research investigated the impact of “new situations” on software performance. These can be in regression testing upon a new commit or about the effects of running a performance experiment under different conditions. Two techniques of regression testing that are extensively studied for unit testing [Yoo and Harman, 2012], i.e., RTS and test case prioritization (TCP), have recently become subject of investigation for performance testing. In terms of selection (RTS), they either predict whether a commit potentially introduces a performance regression [Jin et al., 2012; Huang et al., 2014; San- doval Alcocer et al., 2016, 2020] or whether a benchmark is affected by a code change [de Oliveira et al., 2017; Alshoaibi et al., 2019]. Jin et al. [2012] build a rule-based technique to detect these commits, whereas Huang et al. [2014] and Sandoval Alcocer et al. [2016, 2020] employ a performance cost model. Regard- 5.9 Related Work 231 ing benchmark selection, de Oliveira et al. [2017] use lightweight static and dynamic source code indicators which are combined with logical operators. Alshoaibi et al. [2019] reuse their indicators, define the selection as an optimization problem, and employ genetic algorithms to predict whether a benchmark will be affected. In terms of prioritization (TCP), Mostafa et al. [2017] rank the execution order of benchmarks according to their predicted performance change size, inferred from a performance cost model. As already mentioned in Section 5.7.1, regression testing is one context where we foresee our approach to be applied. Different from the works above, our approach (1) targets benchmark stability rather than the performance impact and (2) uses a machine learning model rather than more traditional cost and inference models. The second area of performance impact prediction focusses on how software (or a benchmark) behaves when executed in different environments. Gao and Jiang [2017] build ensemble models to predict the performance variation of load tests in different environments. Wang et al. [2018] utilize resource profiles of performance tests and cloud performance distributions to estimate how an application will behave when deployed in cloud environments. Scheuner and Leitner [2018] employ a linear regression model to predict the response time of an application deployed in the cloud based on system microbenchmarks run on the respective cloud instance. All three of these predict the performance based on resource profiles from where the software (or benchmark) is executed. Whereas our approach is fully static and leverages the idea that certain source code features contribute more to performance variability than others, i.e., concurrency, expensive API calls, or randomized algorithms. As discussed in Section 5.7, dynamic information, such as resource profiles, could enhance our approach and potentially improve prediction performance.

5.9.5 Machine Learning for Software Engineering

Machine learning has been extensively used in several fields of software engineering research, e.g., defect prediction [Malhotra, 2015] or effort estimation [Wen et al., 2012]. We refer to defect prediction as an example of primary Chapter 5. Predicting Unstable Software Benchmarks Using Static Source Code 232 Features area of interest to show the trend of the research in this regard. Turhan et al. [2009] and Zimmermann et al. [2009] first demonstrated that it is possible to train models from labeled historical defects to determine if a software is defective, based on static and process features. We can identify some important differences with our approach and the problem we tackle. (1) Our approach replaces the computational effort of running a benchmark several times before detecting unstable results. It does not replace a human classification activity (not as its primary goal), but it saves computational time. For instance, in defect prediction the training data has to be manually labeled. (2) As a consequence, we can theoretically scale up the data set by running a higher number of benchmarks.

Then, several approaches have been developed to deal with the problem of defect prediction [Malhotra, 2015], investigating different machine learning aspects that might have a significative impact on the prediction performance of the produced models: (1) using noisy data for training, due to mislabelling of defects [Tantithamthavorn et al., 2015]; (2) the operation of class rebalancing [Tantithamthavorn et al., 2020]; (3) the use of feature selection [Jiarpakdee et al., 2020]; (4) the optimization of hyper-parameters of the machine learning algorithms [Tantithamthavorn et al., 2016, 2019; Agrawal et al., 2019]; (5) the importance of correlated metrics [Jiarpakdee et al., 2019]; and (6) the techniques to validate the models [Tantithamthavorn et al., 2017]. In this paper, we conduct a first investigation on the use of machine learning in the field of performance variability, i.e., benchmark stability. Predicting benchmark stability is somewhat similar to defect prediction, as it classifies “defective” benchmarks without executing them. Considering the promising results, we aim at further investigating the use of machine learning by following a similar path to what has been done in the field of defect prediction or similar areas in software engineering. 5.10 Conclusions 233

5.10 Conclusions

In this paper, we introduced a static approach to predict whether a software benchmark will have stable or unstable results, without having to execute it. It uses 58 statically-computable source code features which are extracted with AST parsing and static call graphs. The features are related to (1) meta information, e.g., LOC, (2) programming language elements, e.g., conditionals or loops, and (3) potentially performance-impacting standard library calls, e.g., file and network I/O. We assessed the effectiveness of our approach with an empirical experiment on 230 open-source software Go projects that contain a total of 4,461 benchmarks. Although individual features are at best moderately associated with benchmark stability, we find that their combination and the usage of machine learning classifiers can be used for effective prediction. We built and compared 11 different binary classification models, of which Random Forest performs best, with a median prediction performance of AUC ranging from 0.79 to 0.90 and MCC ranging from 0.43 to 0.68, depending on the concrete approach parameterization. These results show that predicting benchmark instability with only static features is effective. We envision our approach to enable selecting reliable benchmarks in regression testing scenarios, help developers to spot potentially low-quality benchmarks, and improve unstable results of benchmarks by increasing their number of repetitions before execution.

Applying Test Case Prioritization to Software Microbenchmarks

Christoph Laaber, Harald C. Gall, and Philipp Leitner In submission under major revision at Empirical Software Engineering

Abstract

Software microbenchmark suites take considerably longer to execute than unit test suites. Therefore, applying test case prioritization (TCP) techniques may help capturing large performance regressions sooner. As software microbenchmarks differ from unit tests in many aspects, it is unclear whether traditional unit testing TCP techniques work equally well for software microbenchmarks. In this paper, we empirically study total and additional TCP techniques for software microbenchmarks along multiple parameterization dimensions, leading to 54 unique technique instantiations. We find that TCP techniques have a mean APFD-P effectiveness between 0.54 and 0.71 and are able to capture the three largest performance changes after executing 29% to 66% of the whole microbenchmark suite. Our efficiency analysis reveals that the runtime overhead of TCP varies considerably depending on the exact parameterization. The most efficient technique has an overhead of 11% of the total microbenchmark suite execution time, making TCP a 236 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks viable option for performance regression testing. The results demonstrate that the total strategy is superior over additional. Finally, dynamic techniques should be favored over static techniques due to their acceptable analysis overhead; however, in settings where time for prioritization is limited, static techniques provide an attractive alternative.

6.1 Introduction

Regression testing approaches assist developers to uncover faults on new software versions. One such approach is TCP: it re-orders tests to execute the most important ones firsts, to find faults sooner on average. TCP has been extensively studied in unit testing research [Rothermel et al., 1999, 2001; Elbaum et al., 2001, 2002; Tonella et al., 2006; Zhang et al., 2009a; Mei et al., 2012; Yoo and Harman, 2012; Zhang et al., 2013; Hao et al., 2014; Henard et al., 2016; Luo et al., 2016a, 2018, 2019]. The unit-testing-equivalent technique for testing performance is software microbenchmarking. However, software microbenchmarks take substantially longer to execute, often taking multiple hours or even days [Huang et al., 2014; Stefan et al., 2017; Laaber and Leitner, 2018a], which is a compelling reason to apply TCP. Unfortunately, compared to functional regression testing, performance regression testing is not as inten- sively studied. So far, the focus has been on predicting the performance impact of code changes on commits to decide whether performance tests should be run at all [Huang et al., 2014; Sandoval Alcocer et al., 2016], on prioritizing microbenchmarks according to the expected performance change size [Mostafa et al., 2017], or on selecting microbenchmarks that are most likely to detect a performance regression [de Oliveira et al., 2017; Alshoaibi et al., 2019; Chen et al., 2020]. Applying traditional TCP techniques to software microbenchmarks could work well due to their similarities to unit tests, i.e., a suite contains many microbenchmarks, they are defined in code, they are self-contained and therefore re-arrangeable, as well as the granularity of the system under test (methods and statements). In addition, existing research builds on the assumption that 6.1 Introduction 237 traditional TCP techniques can be used as baselines for TCP on microbenchmarks [Mostafa et al., 2017]. However, traditional TCP techniques might also behave differently when used to prioritize microbenchmarks, for the following reasons: (1) They rank their tests based on coverage information, under the assumption that a test covering more statements, branches, or functions is more likely to find defects. However, performance changes might not be associated with the number of covered elements, but with the performance impact of each of these elements (e.g., a change to a loop variable potentially has a bigger impact than one to multiple conditional statements [Jin et al., 2012]). (2) Where unit tests have a clearly defined binary outcome (pass or fail), software microbenchmarks result in distributions of performance counters indicating probabilistic results. (3) The reliability of software microbenchmark results and, consequently, of the performance changes is dependent on how rigorous one conducts the measurement. Hence, the effectiveness of TCP techniques could be compromised by performance measurement inaccuracies.

To investigate whether these underlying differences of unit tests and software microbenchmarks lead to measurable differences in the usefulness of existing TCP techniques, we empirically study total and additional coverage- based prioritization techniques along multiple dimensions, resulting in 54 unique techniques: (1) benchmark granularity on method and parameter level, (2) coverage granularity on method level employing (3) dynamic and static coverage analyses with (4) different coverage-type-speciﬁc parameterizations. Research has shown that these aspects affect TCP effectiveness and coverage precision [Rothermel et al., 2001; Elbaum et al., 2002; Hao et al., 2014; Henard et al., 2016; Luo et al., 2016a, 2019; Reif et al., 2016, 2019].

As study objects, we select 10 Java open-source software (OSS) projects with comprehensive Java Microbenchmarking Harness (JMH) suites, having 1,829 unique microbenchmarks with 6,460 unique parameterizations across 161 versions, to which we apply all prioritization approaches. This is, to the best of our knowledge, by far the largest JMH microbenchmarking data set 238 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks with approximately 47 billion1 performance data points from a total of 17,464 microbenchmarks with 59,164 parameterization. As part of our study, we formulate and answer the three subsequent research questions: An effective TCP technique should be able to rearrange the execution order of microbenchmarks to detect larger performance changes sooner. We investigate whether this is the case with our ﬁrst research question:

RQ 1 How effective are TCP techniques in ranking software microbenchmarks by the size of the performance changes they identify?

The best techniques achieve mean APFD-P values between 0.54 and 0.71 and mean Top-3 values between 29% and 66%, depending on the study object. Techniques using the strategy total outperform ones with additional, and dynamic-coverage enables more effective techniques compared to static-coverage. We ﬁnd that all TCP techniques perform better than random. However, “wrong” parameterization can have detrimental effects on the effectiveness, even rendering some techniques inferior to random for some study objects. Hence, choosing good parameter values is paramount for effectiveness. With the second research question, we investigate the robustness of the effectiveness measures from RQ 1 when considering different magnitudes of performance changes (i.e., the difference in execution time between two versions):

RQ 2 How robust are the TCP techniques’ effectiveness with respect to performance change sizes?

We ﬁnd that the size at which a performance change is considered sig- niﬁcantly impacts the effectiveness of TCP techniques. Depending on the technique and the study object, our results show that APFD-P values differ between a median of 0.11 to 0.28, with a maximum of up to 0.62. However,

1This number corresponds to 46,978,627,870 JMH microbenchmark invocation samples from 3,549,840 iterations. 6.1 Introduction 239 the ranking of techniques, i.e., which techniques perform better or worse, is hardly impacted. When considering the practical usefulness of TCP techniques, it is crucial to not only consider their effectiveness, but also how much overhead they add to the microbenchmark execution. We deﬁne this as the efﬁciency of a technique and investigate this in our third research question:

RQ 3 How efﬁcient are the TCP techniques?

We ﬁnd that the runtime overhead of the studied techniques ranges between <1% and 59% of the total microbenchmark suite execution duration. Techniques with dynamic-coverage add between 10% and 17%, and techniques with static-coverage often add less than 4% overhead. However, similar to our effectiveness results, choosing the “wrong” prioritization parameters for static-coverage techniques can result in excessive overheads even beyond 55%. This indicates that if “good” parameters are chosen, applying TCP can be highly worthwhile.

Recommendations In typical TCP scenarios, where the entire microbenchmark suite is executed, we suggest employing dynamic TCP techniques due to the low overhead of 11%. However, if TCP is applied in settings with strict time limits, e.g., as part of a continuous integration (CI) pipeline, the analysis overhead introduced by TCP might still exceed the available time budget. In these cases, static TCP techniques can be a viable alternative if the “right” parameters are selected. Finally, according to our results, the total strategy is superior to additional, which may be surprising to readers accustomed to similar research on unit testing [Luo et al., 2019].

Contributions The main contributions of our study are:

• A ﬁrst large-scale empirical comparison of TCP techniques applied to software microbenchmarks, which can serve as a reference point for future research to decide which techniques and parameters to choose as baselines. 240 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks

• Empirical evidence about the impact of performance change sizes and coverage-type-speciﬁc parameters on TCP effectiveness and efﬁciency.

• A method to conduct studies about TCP for software microbenchmarks (and, potentially, other types of performance tests).

• An extensive JMH microbenchmark result data set, executed in a controlled, bare-metal environment, for 10 Java OSS projects having 1,829 distinct microbenchmarks with 6,460 distinct parameterizations across 161 versions. The data set comprises of 46,978,627,870 microbenchmark invocation measurements. The data set is available as part of our replication package [Laaber et al., 2020b].2

6.2 Software Microbenchmarking with JMH

Software microbenchmarking is a performance testing technique that measures certain performance metrics, such as execution time, throughput, or memory utilization, of small code units. These small code units are usually individual methods or statements, which makes software microbenchmarking comparable to unit tests in functional testing. In the remainder of the paper, we use both benchmark and microbenchmark to refer to software microbenchmarks. In the Java world, JMH is the de-facto standard framework for defining and executing software benchmarks. Similarly to JUnit, a benchmark is defined as Java source code with annotations. Listing 6.1 shows an example from RxJava. A benchmark is a public method annotated with @Benchmark, here measuring the performance of a latched observer (lines 8–16). JMH supports parameterization of benchmarks, i.e., executing the same benchmark with multiple different inputs. Parameters for benchmarks are instance variables annotated with @Param (lines 20–21), defined in a state object (@State). This state object can either be the benchmark class itself or, as in this case, a different class which is passed to the benchmark method as a parameter. In this example,

2For now, the replication package is hosted on servers of the first author’s university, but we will move it to figshare once the paper has been accepted 6.2 Software Microbenchmarking with JMH 241 the values of parameter size are 1 and 1000, resulting in the benchmark to be executed twice, once for each value. If multiple parameters are defined, the number of executions is the cross-product of their number of values. For example, a benchmark with parameters a = 20, b ∈ {”a”, ”b”}, and c ∈ {10, 100, 1000} is executed 6 times.

1 @Fork(3) 2 @Warmup(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS) 3 @Measurement(iterations = 20, time = 1, timeUnit = TimeUnit.SECONDS) 4 @BenchmarkMode(Mode.SampleTime) 5 @OutputTimeUnit(TimeUnit.NANOSECONDS) 6 public class ComputationSchedulerPerf { 7 8 @Benchmark 9 public void observeOn(Input input) { 10 LatchedObserver o = input.newLatchedObserver(); 11 input.observable 12 .observeOn(Schedulers.computation()) 13 .subscribe(o); 14 o.latch.await(); 15 } 16 17 @State(Scope.Thread) 18 public static class Input extends InputWithIncrementingInteger { 19 @Param({ "1", "1000" }) 20 public int size; 21 } 22 }

Listing 6.1: Modiﬁed JMH example from RxJava

As performance is affected by multiple factors, such as the execution environment (e.g., bare-metal server, cloud, developer laptop) or the programming language (e.g., compiler optimizations, caches), one has to execute benchmarks multiple times to get reliable results. JMH lets developers configure the execution repetitions (lines 1–3) as well as the measured performance metric (lines 4–5). Figure 6.1 visualizes how JMH executes benchmarks (we refer to elements of the figure in “quotes”). A (parameterized) benchmark is repeatedly invoked for a defined time period (e.g., 1s), called an iteration, and the performance 242 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks

invocation samples benchmark

fork1

warmup measurement fork2 … forkn occurrences wi1 wi2 … win i1 i2 … in

performance metric

Figure 6.1: JMH execution metric is reported. This performance metric can be the average execution time (AverageTime), the throughput (Throughput) across all invocations, or a sample distribution of the invocation values (SampleTime). JMH runs multiple iterations (line 2 and “warmup”) to bring the system into a steady-state, which is required for reliable measurements, followed by multiple measurement iterations (line 3 and “measurement”). To deal with non-determinism of the Java Virtual Machine (JVM) (e.g., dynamic compilation), JMH supports forks (line 1 and “fork”) that execute the same benchmarks in fresh JVMs. The result of a benchmark is then the distribution of results from all measurement iterations (“i”) of all forks (“fork”).

6.3 Empirical Study

To investigate whether TCP techniques originating from unit testing research are applicable to software microbenchmarks, we conduct a laboratory experiment [Stol and Fitzgerald, 2018] on open-source Java projects with JMH software microbenchmark suites. The study compares the effectiveness and efﬁciency (i.e., dependent variables) of different TCP techniques, exploring a wide variety of parameter combinations (i.e., independent variables).

6.3.1 Experiment Process

We use the research design depicted in Figure 6.2. First, we select suitable study objects and multiple versions of these (see Section 6.3.2). Second, and for all 6.3 Empirical Study 243

TCP Techniques I Type I Benchmark Parameters Strategy Granularity

Coverage Benchmark Research Questions Coverage Prioritization Information Ranking Study Objects 1 Effectiveness

2 Robustness Projects 1 2 Eﬀectiveness 3 Efﬁciency APFD-P

Performance Execute Eﬀectiveness D Study Variables Versions Benchmarks Changes Calculation Top-N I Independent

D Dependent 3 Eﬃciency Coverage Time Execute Cloud Technique D Priorization Instances Runtimes Techniques Prioritization Time

Figure 6.2: Experiment process overview

selected versions of all selected objects, we apply the prioritization techniques under study by retrieving coverage information of all benchmarks that serve as input to the actual prioritization. The parameter space, i.e., independent variables of our study I (see Section 6.3.3), consists of the prioritization strategy, the benchmark granularity, the coverage type, and the coverage-type-specific parameters. The result of the prioritization strategies is then an ordered list of benchmark with ranks, in descending order. Third, we compare these rankings by their effectiveness 1 2 and efficiency 3 , which are defined by the dependent variables of our study D (see Section 6.3.4). For effectiveness, we execute all benchmarks of all projects in all versions (see Section 6.3.5), compute the performance changes between adjacent versions, and calculate their effectiveness measures with the benchmark ranking and the performance changes (see Section 6.3.4). Regarding efficiency, we execute all studied prioritization techniques and all study objects on cloud instances to assess their runtime (see Section 6.3.5), which consists of the time required for retrieving coverage information and prioritizing the benchmarks (see Section 6.3.4). 244 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks

6.3.2 Study Objects

To study TCP for software microbenchmarks, we select 10 OSS Java libraries. Because of the time-intensive nature of rigorously executing benchmarks, it is infeasible to conduct a study as ours on, for example, all projects that have JMH suites. Therefore, we perform purposive sampling [Baltes and Ralph, 2020] of GitHub projects based on a list of 1,545 projects with JMH suites from Laaber et al. [2020c], with the aim to select a diverse set of projects from different domains, with varying benchmark suite sizes, and a multitude of versions to apply TCP on. First, we apply the following inclusion criteria to each project: (1) it is the main project and not a fork, (2) the repository is available on GitHub, and (3) it has 30 benchmarks or more in the newest version. After applying the inclusion criteria, the list contains 111 projects which we arrange in descending order by their number of GitHub stars, forks, and watchers, as well as their benchmark suite size. The scripts to retrieve this list are part of our replication package [Laaber et al., 2020b]. We then manually iterate through the project list from top to bottom, giving preference to “more popular” projects with many benchmarks, and apply the following exclusion criteria until we reach 10 projects: (1) neither Maven nor gradle is used for building, (2) fewer than 10 versions are available as git tags, and (3) fewer than 10 versions can be compiled. Depending on the number of available, compilable versions per project and the runtime of the benchmark suites, we choose at least 10 versions covering a wide variety from multiple major, minor, and bugﬁx versions. Table 6.1 depicts the ﬁnal set of projects used as study objects. Our data set consists of 161 versions (“Versions”) across the 10 study objects, as well as 1,829 distinct benchmarks (“Bench. Methods”) with 6,460 distinct benchmark parameterizations (“Bench. Params.”) across all study objects and versions. The difference between “Bench. Methods” and “Bench. Params.” is that the former indicates the count of unique methods annotated with @Benchmark, and the latter counts every benchmark-parameter-instance as a single benchmark (see Section 6.2). As the same benchmarks are executed for 6.3 Empirical Study 245 0.069 0.124 0.481 0.053 1.398 4.625 0.170 2.113 0.104 0.101 ± ± ± ± ± ± ± ± ± ± 711.28 1,829 6,460 17,464 59,164 Methods Params. mean stdev 161 Table 6.1: Study objects raphw/byte-buddy Bytecode manipulation 31 41 41 0.26 eclipse/eclipse-collections Collections library 10 513 2,413 38.45 JCTools/JCTools Concurrent data structures 11 67 227 1.15 jenetics/jenetics Genetic algorithms 21 143 143 0.42 apache/logging-log4j2 Logging utilities 15 399 558 2.71 netty/netty Asynchronous network communication 10 201 1,616 6.56 square/okio Data type access, storage, and processing 11 33 208 1.56 ReactiveX/RxJava Asynchronous programming 19 198 1,020 7.81 JetBrains/xodus Embedded, schema-less database (DB) 11 76 76 1.33 openzipkin/zipkin Distributed tracing system 22 158 158 0.48 This number corresponds to athree single trials, execution hence, of the the total benchmark runtime suites as across part all of versions. our We experiments executed is the 2,133.81h benchmark or suites 89 for days. 1 ProjectByte Buddy Github DomainDistinct Versions Bench. Bench. Runtime [h] Eclipse Collections Total JCTools Jenetics Log4j 2 Netty Okio RxJava Xodus Zipkin 246 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks multiple versions in our experiment, the total number of benchmark-version and benchmark-parameterization-version pairs is 17,464 and 59,164, respectively. The mean runtime (“Runtime mean”) across the versions of our study objects and the execution trials (see Section 6.3.5) varies from approximately 16 minutes (0.26h) for Byte Buddy to 38.45 hours for Eclipse Collections for a single, full benchmark suite execution. A larger standard deviation of the runtime (“Runtime stdev”) is due to earlier versions of the respective project containing fewer benchmarks, with more benchmarks being added over time. To the best of our knowledge, this is by far the largest data set of software microbenchmark executions across multiple versions to date. Details, including the exact versions and commit hashes used, can be found in our replication package [Laaber et al., 2020b].

6.3.3 Independent Variables

Our empirical study investigates four independent variables (see Table 6.2): (1) the prioritization strategy, (2) the benchmark granularity, (3) the coverage type, and (4) coverage-type-speciﬁc parameters. In total, our study involves 54 unique combinations of independent variable values, thereafter simply called TCP techniques.

Prioritization Strategy

We study and evaluate the two most common and basic strategies from unit testing research, i.e., the total and additional strategies [Rothermel et al., 1999]. The total strategy orders benchmarks based on the number of code elements covered by the individual benchmarks, while additional ranks the benchmarks based on the number of code elements that have not been covered by other benchmarks. In addition, we compare the two strategies to a baseline with random benchmark order, which corresponds to the dependent variable’s mean across 100 random orderings. 6.3 Empirical Study 247

Table 6.2: Independent variables and their values. Variable values are listed top to bottom in increasing precision. Code names will be used as abbreviations in ﬁgures

Short Code Name Value Name Name Prioriziation Strategy strategy total t additional a random r Benchmark Granularity bench benchmark-method m benchmark-parameter p Coverage Type cov-type dynamic-coverage d static-coverage s dynamic-coverage: Benchmark Granularity dc-bench method m parameter p static-coverage: Algorithm sc-algo RTA R 0CFA 0 01CFA 01 Reflection Option sc-ro NONE N OFTCAGM1 O FULL F MAX2 M Entry Points sc-ep single s multiple m 1 OFTCAGM corresponds to WALA’s reflection option ONE_FLOW_- TO_CASTS_APPLICATION_GET_METHOD 2 MAX represents the “best” reflection option for a particular project where the execution was successful, i.e., OFTCAGM or FULL 248 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks

Benchmark Granularity

Unit testing research often considers the test case granularity as an independent variable [Hao et al., 2014], which is either on test class or test method level. Since JUnit 53 (released Sunday 10th September, 2017) developers can specify parameterized test cases4, which arguably would be a third test granularity to consider. However, at the time of writing we are not aware of any studies that investigate TCP with test parameter granularity. JMH supports parameterized benchmarks since version 0.45 (released Wednesday 19th February, 2014), and many projects make extensive use of parameterization [Laaber et al., 2020c]. Therefore, our study investigates granularity on two levels: benchmark-method and benchmark-parameter. benchmark-parameter treats every parameterization of a benchmark method as a unique benchmark, whereas benchmark-method treats a benchmark method with all its parameter combinations as a unique benchmark. TCP techniques with benchmark-method rank all parameters of a benchmark in sequence without interleaving other benchmarks.

Coverage Type

The TCP strategies studied, i.e., total and additional, rank the benchmarks based on structural coverage information. This structural coverage can be obtained in two ways: dynamic-coverage and static-coverage. We investigate both coverage types in our study. Apart from the type of coverage, its granularity inﬂuences the effectiveness and efﬁciency of TCP. Different coverage granularities have been studied for unit tests such as statement-level and method-level [Elbaum et al., 2002]. In our study, we investigate method-level coverage granularity for two reasons: (1) method-level is available for both dynamic and static types; and (2) dynamic-coverage on statement-level is known to have high runtime overhead, which may render these techniques too expensive in high code velocity environments or as part of CI [Elbaum et al., 2014; Liang et al., 2018].

3https://junit.org/junit5 4https://junit.org/junit5/docs/current/user-guide/#writing-tests- parameterized-tests 5https://hg.openjdk.java.net/code-tools/jmh/rev/b45d214529fc 6.3 Empirical Study 249 static-coverage is retrieved by static call graph (CG) analyses constructed by WALA, and dynamic-coverage is retrieved by executing a single benchmark invocation using the JaCoCo agent.

Coverage-Type-Speciﬁc Parameters

Previous research on TCP for unit tests investigated different prioritization strategies, coverage types and granularity, and test case granularity [Elbaum et al., 2002; Zhang et al., 2009a; Mei et al., 2012; Yoo and Harman, 2012; Luo et al., 2016a, 2019], but to the best of our knowledge, no study exists that shows the impact of different coverage-type-specific parameters on TCP effectiveness and efficiency. Coverage-type-specific parameters guide how coverage information is retrieved by their algorithms. Depending on whether dynamic- coverage or static-coverage is used, different parameters are available. We consider the benchmark granularity (similar to benchmark granularity of the prioritization strategy) of the coverage type, i.e., of the underlying CG type, gathering the coverage information. For dynamic-coverage we study method (dc-benchm) and parameter (dc-benchp) granularity, whereas for static- coverage we only study method granularity, as parameter granularity requires executing the benchmark or utilizing symbolic execution techniques. dc-benchm retrieves coverage information for a benchmark method with a single parameterization, and dc-benchp extracts coverage information for each benchmark parameterization. The coverage type parameters for static-coverage are related to how WALA, a state-of-the-art static analysis library, builds the static CGs: (1) the CG algorithm (sc-algo), (2) the CG algorithm’s reflection option (sc-ro), and (3) the set of CG entry points for each benchmark (sc-ep). We investigate three of the four pre-defined CG algorithms in WALA, the context-sensitive algorithms RTA [Bacon and Sweeney, 1996] and 0CFA [Shiv- ers, 1988] as well as the context-insensitive algorithm 01CFA [Grove and Chambers, 2001]. We refrain from using 1CFA (as an instance of nCFA), as it was shown to be inferior to both 0CFA and 01CFA [Reif et al., 2019]. We 250 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks further excluded 01CFAContainer due to its long execution times and heavy resource usage, which led to timeouts and failures during our experiments.

Regarding the CG algorithms’ reflection options (sc-ro), we study all CG algorithms with no special handling of reflection (sc-roNONE) and the highest reflection option per study object that did not lead to timeouts or failures (sc-roFULL or sc-roOFTCAGM). We were able to execute Byte Buddy, Eclipse Col- lections, JCTools, Jenetics, Netty, Okio, and RxJava with the highest available reflection option FULL; and Log4j 2, Xodus, and Zipkin with the second highest reflection option OFTCAGM. Table 6.2 also lists MAX which corresponds to either FULL or OFTCAGM, depending on the project as described above. We use MAX throughout the paper whenever we discuss a TCP technique using the highest reflection option across all study objects.

Finally, CG algorithms rely on a deﬁned set of entry points that inform the algorithm which paths of a program are executed, which classes are in- stantiated, and hence which subtypes are considered by points-to analyses. Employing different entry point sets results in different CGs and, consequently, in potentially different prioritizations [Reif et al., 2016]. We construct entry point sets assuming closed-package usage, i.e., only methods that are called by the benchmark itself (@Benchmark) and setup methods (@Setup) are contained. Our study investigates two types of entry point sets: single (sc-eps) and multiple (sc-epm). sc-eps constructs a single entry point set for all benchmarks in a suite and, hence, builds a single CG for all benchmarks. sc-epm constructs one entry point set per benchmark, consisting only of the benchmark itself and its setup method(s).

6.3.4 Dependent Variables

Our study investigates three types of dependent variables, two measuring TCP effectiveness and one assessing efﬁciency. 6.3 Empirical Study 251

Effectiveness

For RQ 1 and RQ 2, we study two dependent variables, similar to the work by Mostafa et al. [2017]: (1) average percentage of fault-detection on performance (APFD-P) and (2) Top-N percentile. These two metrics assess how effective the studied TCP techniques are in ranking benchmarks. A more effective TCP technique ranks benchmarks that uncover bigger performance changes higher than benchmarks that ﬁnd smaller or no performance changes. We do not employ nDCG as an effectiveness measure, as Mostafa et al. [2017] did, because APFD-P and nDCG metrics are correlated in our study.

Performance Changes The performance changes of benchmarks between two adjacent versions are integral to the calculation of effectiveness measures. Performance data is known to often be non-normally distributed [Curtsinger and Berger, 2013] (e.g., long-tailed or multi-modal), and, consequently, best practice suggests using bootstrap confidence intervals [Kalibera and Jones, 2012; Stefan et al., 2017; Wang et al., 2018; He et al., 2019; Laaber et al., 2019b, 2020c]. To compute the confidence interval for ratios of means, we employ a Monte-Carlo technique described by Kalibera and Jones [2012] that relies on statistical simulation, i.e., bootstrap [Davison and Hinkley, 1997], with hierarchical random resampling [Ren et al., 2010] with replacement, 10,000 bootstrap iterations [Hesterberg, 2015], and a confidence level of 99%. We use three hierarchical levels: (1) trials, (2) iterations, and (3) invocations. For details on how to compute the bootstrap confidence intervals, we refer to Kalibera and Jones [2012, p.27ff], in particular the algorithm in Fig. 2 of their paper. We define a performance change to be the lower bound of the bootstrap confidence interval for the ratio of means of two versions, which we formally denote as the function change(i) in Equation 6.1.

bi mean(Si) change(i)= quantile 1−cl (6.1) 2 base [ mean(Si ) th 1−cl th i is the i benchmark in a TCP ranking. quantile returns the ( 2 ) quantile (with cl =0.99 for a 99% conﬁdence level) of bi = 10,000 bootstrap iterations 252 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks

of the performance change ratios, and mean is the arithmetic average. Si is the th base bootstrap sample for the i benchmark, and Si the sample for the previous version of the same benchmark, according to hierarchical random resampling with replacement. Chen and Shang [2017] showed that benchmarks commonly indicate many small performance changes between version pairs, which might partially be caused by measurement inaccuracy or bias [Mytkowicz et al., 2009; Curtsinger and Berger, 2013; de Oliveira et al., 2013b]. Many of these changes are likely to be unimportant, hence distorting effectiveness measures. In RQ 1, we only consider performance changes of 3% or larger as signiﬁcant, similar to Georges et al. [2007]. All changes below that threshold are discarded. In RQ 2, we explicitly investigate the effectiveness robustness, i.e., the impact the performance change size has on the studied effectiveness measures, by performing a sensitivity analysis on this threshold value. Our study investigates thresholds t from 0% (i.e., all changes) to 100%, where t ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 50, 100}.

APFD-P The effectiveness measure APFD-P is adapted from the standard TCP measure average percentage of fault-detection (APFD), which was ﬁrst introduced by Rothermel et al. [1999] and has since been widely used in unit testing research [Rothermel et al., 2001; Elbaum et al., 2002; Mei et al., 2012; Zhang et al., 2009b; Hao et al., 2014; Luo et al., 2016a, 2018]. As unit tests have a binary outcome (i.e., they either pass or fail), and benchmarks have a continuous result (e.g., 10% or 50% slowdown), Mostafa et al. [2017] adapted APFD for performance tests and benchmarks as deﬁned in Equation 6.2.

N detected(x) T APFD-P = xP=1 (6.2) N N is the benchmark suite size, T is the total sum of all performance changes, and detected(x) returns the performance change sum of the ﬁrst x benchmarks (see Equation 6.3). 6.3 Empirical Study 253

x detected(x)= change(i) (6.3) Xi=1 Were change(i) is the performance change of the ith benchmark in a TCP ranking. Note that we adapt the APFD-P deﬁnition by Mostafa et al. [2017] to use the lower bound of the bootstrap conﬁdence interval as the performance change size (see Equation 6.1) as opposed to the change in mean performance.

Top-N This dependent variable provides a better intuition regarding the advantages developers have from prioritizing their benchmarks. It represents the number of benchmarks that have to be executed, based on a TCP technique’s ranking, to capture the N largest performance changes. Following Mostafa et al. [2017], we choose N =3 in our study. This captures how large a fraction of the benchmark suite must be executed to detect the 3 largest performance changes.

Efﬁciency

Effectiveness of TCP techniques captures only one part of their quality and applicability. Techniques that rely on precise analyses might produce effective results, but may be unrealistic to perform in practice due to their long runtimes. Hence, our efficiency analysis (for RQ 3) complements the effectiveness analysis by studying the runtimes of the different TCP techniques. The efficiency dependent variable can be split into two distinct classes: (1) coverage time and (2) prioritization time, which together form the analysis time of a TCP technique. Depending on the TCP technique’s independent variable values, these two times are expected to contribute in different proportions to the analysis time. The prioritization time’s computational complexity is O(mn) for total and O(m2n) for additional, where m is the number of benchmark methods and n the number of called production methods [Rothermel et al., 2001]. In our efficiency analysis, we are interested in the actual overhead for the objects under study, with respect to the duration of executing the entire benchmark suite. 254 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks

6.3.5 Execution Setup

As our empirical study relies on measuring performance, i.e., (1) the performance changes of the benchmark suites for each version required for the effectiveness metrics (RQ 1 and RQ 2) and (2) the efﬁciency analysis of RQ 3, a rigorous methodology is required to reduce validity concerns and enable replicability.

Performance Changes

Measuring benchmark performance counters requires careful experiment planning because of widely reported performance measurement errors [Georges et al., 2007; Mytkowicz et al., 2009; Curtsinger and Berger, 2013; de Oliveira et al., 2013b]. A sloppy measurement methodology results in unreliable benchmark results, which in turn might distort the results of our experiment. To mitigate these sources of error, we follow a three-part methodology:

(1) We manually patch the build scripts of all projects and versions with the same JMH version (i.e., 1.21), compile the JMH fat Java Archives (JARs) and execute the benchmarks with Java Development Kit (JDK) version 1.8.0_181-b13 employing Java HotSpot 64-Bit Server VM (build 25.181-b13). This way we ensure that a benchmark performance change does not stem from a JDK-related or JMH-related improvement or regression.

(2) In terms of execution setup, we configure the benchmark suites with 10 warmup iterations and 20 measurement iterations of 1 second each. We export the invocation samples of each iteration (using JMH’s benchmark mode SampleTime) and take a weighted sub-sample of 1,000 from that. In addition, we execute the full benchmark suites of each version for 3 trials at different points in time. We decide against using the original configurations (as set by the libraries) due to their extensive execution times. Already in our configuration set up, running the benchmark suites of our study objects required approximately 89 days for three 6.3 Empirical Study 255

trials (see Table 6.1). In contrast, the conﬁguration set by developers of the most recent version (1.3.8) of RxJava takes about 124.5 hours when running a single trial, which would render our study infeasible.

(3) As performance engineering best practice suggests utilizing controlled environments, we used a non-virtualized (“bare-metal”) server hosted at the ﬁrst author’s university. This server has a 12-core Intel Xeon [email protected] central processing unit (CPU) with 70GiB memory, runs ArchLinux with a kernel version 5.2.9-arch1-1-ARCH, and uses a Samsung SSD 860 PRO SATA III disk.

Efﬁciency Analysis

For the efficiency analysis, the environment used to measure the performance of the coverage extraction and prioritization phase is hosted in the private cloud of the first author’s institution. We select private cloud instances and refrain from using a bare-metal machine for two reasons: (1) the runtime of the individually measured times, i.e., coverage and prioritization, is much longer than the individual benchmarks’ runtimes. Instead of nanoseconds to milliseconds, they are in the order of seconds to minutes. Therefore, small measurement errors are not expected to have an impact on the overall result of RQ 3; and (2) the time to run the efficiency analysis would take about another 77 days for a single measurement, for all TCP techniques (i.e., independent variable combinations) and study objects. Considering performance engineering best practice and running the measurements repeatedly (e.g., 10 trials), the total duration would exceed a sensible time frame. Hence, we measure coverage and prioritization times once on multiple private cloud instances with the same configuration to make use of experiment parallelization. All instances have the same specification:

(1) The cloud instance types have 16 virtual CPUs and 62GB memory. The CPUs’ model is Intel Xeon E3-12xx v2 (Ivy Bridge, IBRS) with 2.5GHz and a 4MB cache. 256 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks

(2) The instances are provisioned with Ubuntu 18.04 LTS and run a Linux kernel version 4.15.0-23-generic.

(3) Identical to the performance change execution setup, we execute the measurements with JDK version 1.8.0_181-b13 employing Java HotSpot 64-Bit Server VM (build 25.181-b13).

6.3.6 Tooling, Analysis Scripts, and Data

The tools, scripts, and data required to run (and replicate) our study consists of three parts: (1) the benchmark analysis and prioritization tool bencher [Laaber, 2020a], (2) the performance change analysis tool pa [Laaber, 2020b], and (3) an openly-available replication package [Laaber et al., 2020b]. bencher is written in Kotlin 1.3.72. It parses the byte code of JMH projects for their benchmarks with ASM6 7.2, retrieves static coverage information with WALA7 1.5.0 and dynamic coverage information with JaCoCo8 0.8.5, and applies the TCP techniques. pa is written in Go and implements efficient, multi-threaded performance change analysis of benchmark results, as required for Section 6.3.4 and introduced by Kalibera and Jones [2012]. It computes bootstrap confidence intervals and confidence interval ratios of a specified statistic (e.g., the arithmetic mean), with hierarchical random resampling with replacement, user-defined bootstrap iterations and confidence levels, and sampling of invocations. The replication package contains all scripts that perform data preparation and cleaning, invocation of the aforementioned tools, data analyses, and data representations, as well as all input, intermediate, and output data.

6.3.7 Threats to Validity and Limitations

Construct Validity We rely on APFD-P and Top-3 as measures for TCP effectiveness [Mostafa et al., 2017]. APFD-P is adapted from APFD, which,

6https://asm.ow2.io 7https://github.com/wala/WALA 8https://www.jacoco.org 6.3 Empirical Study 257 although widely used, has been discussed to have limitations [Rothermel et al., 1999]. We address this threat by also investigating Top-3. Choosing N =3, as opposed to 1 or 5, is based on previous research [Mostafa et al., 2017], and we manually confirmed that a larger N would always result in effectiveness values of close to 100% (that is, the whole benchmark suite has to be executed for capturing the top N performance changes). We further adapt APFD-P to be more robust against performance measurement variabilities and study the metric’s robustness in RQ 2. Finally, we combine the effectiveness findings (RQ 1, RQ 2) with an efficiency analysis (RQ 3) in our discussion.

Internal Validity Valid performance changes are paramount to the study’s internal validity. Measurement bias is common [Mytkowicz et al., 2009; Curtsinger and Berger, 2013; de Oliveira et al., 2013b] and could threaten our effectiveness (i.e., validity of APFD-P) and efficiency results. For the performance measurements of all versions, we follow a rigorous methodology based on state-of-the-art best practice [Georges et al., 2007] utilizing a bare-metal environment. However, measurement uncertainty can never be excluded entirely. We rely on statistical simulation, i.e., bootstrap confidence interval of the ratios of the mean [Kalibera and Jones, 2012], to decide whether a benchmark’s result has changed between two adjacent versions. Bootstrap is a randomized algorithm to approximate a benchmark’s result population from a measured sample. Consequently, the detected performance change size might suffer from Monte-Carlo noise. We mitigate this by following statistical best practice and using 10,000 bootstrap iterations [Hesterberg, 2015]. The efficiency measurements are executed in cloud environments which might interfere with measurement. However, because the times we measure, i.e., coverage and prioritization, are in the order of minutes (or even longer), and we compare them to the total runtime of the suites, which are between 16 minutes and 38 hours, small measurement inaccuracies are not expected to change our overall conclusions for RQ 3. Further threats to internal validity concern potential functional (RQ 1, RQ 2) and performance (RQ 3) bugs of our tool chain. We dedicated extensive effort 258 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks in unit testing our tool chain and performance benchmarking core functionality. To address validity threats regarding WALA usage and configuration, we rely on results and best practice of current research in static analysis [Reif et al., 2016, 2019].

External Validity Generalizability of our study is mostly concerned with the choice of our study objects and versions. We selected 10 Java OSS projects in 161 versions and with 6,460 distinct JMH benchmark parameterizations. Although we can not generalize our findings to all Java/JMH projects, the data set created for this study is, to the best of our knowledge, by far the most extensive microbenchmarking data set to date. More projects would have rendered our study infeasible because of the time-intensive nature of running rigorous performance experiments. We picked Java because benchmark suites written in it are long-running [Laaber and Leitner, 2018a; Laaber et al., 2020c] and, hence, would benefit from TCP. Regarding the benchmark framework, JMH is the de facto standard for Java at the time of study [Stefan et al., 2017; Leitner and Bezemer, 2017]. We selected study objects that are large, well- known, popular projects from different domains to investigate high-quality software projects. However, the results might not generalize to closed-source or industrial software, other programming languages, or even other software written in Java. We studied a specific type of performance test, i.e., software microbenchmarks. They typically measure execution runtime of small software components, such as methods or statements. Therefore, our results may not generalize to regression testing for other performance test types, e.g., load tests or system benchmarks, or performance metrics, e.g., memory, input/output (I/O). Finally, depending on which static CG library is employed for coverage extraction, effectiveness and efficiency results are likely to change. We chose WALA because it works well for software libraries such as our study objects, performs reasonably well in the presence of reflection, and has been used in previous testing research [Reif et al., 2019; Luo et al., 2016a]. 6.4 Results and Analyses 259

Limitations We limited the implementation of the static CG (“S”) and dynamic coverage (“D”) extractors, which occasionally causes empty coverage sets for affected benchmarks. (1) “S” and “D”: We only consider calls to study- object-internal methods as relevant for the coverage, because we are primarily interested in ranking benchmarks higher that find performance changes in the production code of the study objects. Some benchmarks test JDK collections, JDK concurrency features, or atomic data types, which serve as baselines for the benchmarks of custom functionality. We consider such benchmarks not interesting for regression testing. (2) “S”: If a benchmark implementation (annotated with @Benchmark) is located in a super-class and its parameterization (@Param) is defined in the sub-class, the static CG coverage detector is not able to capture this benchmark. (3) “D”: If a JMH parameter value contains a comma “,”, our tooling is not able to execute this benchmark through the JMH command line interface, because JMH 1.21 exits with a parsing error. In our study five benchmarks of RxJava9 and one benchmark of Netty10 are affected by this limitation.

6.4 Results and Analyses

This section presents our results and analyses along the three research questions. We elaborate on the impact of different independent variable value combinations on the dependent variables, i.e., effectiveness (APFD-P and Top-3) and efﬁciency. The result analyses and interpretations in this section are supported by Table 6.3, which provides statistics about the extracted static and dynamic coverage information. Every row corresponds to a unique combination of the coverage independent variable values, i.e., coverage type (“Coverage Type”) and coverage-type-speciﬁc parameters (“Coverage Parameters”). These results support and explain phenomena observed throughout this section. Column

9all deﬁned in rx.operators.RedoPerf 10io.netty.handler.codec.DateFormatter2Benchmark. parseHttpHeaderDateFormatterNew 260 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks al .:Cvrg ttsiso h tde oeaeprmtr.The parameters. coverage studied the of statistics Coverage 6.3: Table dc-bench static-coverage oeaeTp oeaePrmtr oee ehd oeaeOelpEpyCvrg Set Coverage Empty Overlap Coverage Methods Covered dynamic-coverage Parameters Coverage Type Coverage The . static-coverage CAMAX 0CFA 1F NONE 01CFA CAMAX 0CFA 1F NONE 01CFA 1F MAX 01CFA T NONE RTA 1F MAX 01CFA CANONE 0CFA T MAX RTA ehd211.85 method T MAX RTA T NONE RTA CANONE 0CFA aaee 130.52 parameter oeaeprmtr are parameters coverage ige3,005.32 single ige664.69 single ige445.24 single utpe174.94 multiple utpe105.84 multiple ige452.95 single utpe7,497.67 multiple utpe106.29 multiple ige14,855.13 single utpe1,248.64 multiple ige563.93 single utpe165.77 multiple ensdvma te l 1 All stdev mean stdev mean sc-algo ± ± ± ± ± ± ± 6525 37% 26,532.51 3382 37% 33,358.23 ± ± ± ± ± ± ± ,2.929% 9,828.09 ,7.526% 1,378.55 ,0.930% 1,005.79 ,8.929% 8,587.39 ,3.725% 1,134.97 8.226% 787.32 0.730% 603.47 7.731% 979.17 0.631% 605.36 5.531% 352.35 1.725% 719.77 2.937% 223.09 , sc-ro and , dynamic-coverage sc-ep i hsorder) this (in ± ± ± ± ± ± ± ± ± ± ± ± ± ± 1 8 2 9 18% 19% 22% 18% 41% 1 9 6 3 24% 23% 26% 29% 41% 5 8 1 8 27% 28% 31% 38% 45% 5 3 1 9 27% 29% 31% 33% 45% 5 4 1 0 27% 30% 31% 34% 45% 5 8 2 8 26% 28% 32% 38% 45% 1 1 3 0 19% 20% 23% 21% 41% 5 1 1 %2% 1% <1% <1% 35% 1 8 5 5 24% 25% 25% 28% 41% 1 9 6 7 25% 27% 26% 29% 41% 1 1 1 1 <1% <1% <1% <1% 31% 5 7 0 7 18% 17% 20% 17% 45% 5 7 0 7 18% 17% 20% 17% 45% 1 0 8 6 26% 26% 28% 30% 41% oeaeprmtris parameter coverage st 2 nd 3 rd 6.4 Results and Analyses 261

“Covered Methods” depicts the number of called methods from each of the 59,164 benchmark parameterizations across all versions. Column “Coverage Overlap” shows the overlap of covered methods with another benchmark parameterization (of the same study object). For example, the benchmarks of our study subjects call on average 130.52±223.09 methods (directly or indirectly) of which 37%±31% are also covered by another benchmarks, if we retrieve dynamic-coverage with the parameter dc-benchp. The other columns “Empty Coverage Set” show the percentage of benchmarks for which no coverage information can be extracted. Column “all” depicts the percentage of all benchmarks, whereas “1st”, “2nd”, and “3rd” shows it for the top 3 benchmarks. The interested reader can ﬁnd more detailed results, ﬁgures, and tables for each study object in our replication package [Laaber et al., 2020b].

6.4.1 RQ 1: Effectiveness

This section presents and discusses the effectiveness measures, i.e., APFD-P and Top-3, for each study object and across all study objects. For this, we follow a rigorous, three-step approach for the statistical analyses, as described below:

(1) We calculate the effectiveness values as described in Section 6.3.4 for every combination of the study objects, their versions, and the 54 TCP techniques (i.e., unique combination of the independent variable values of our study). This results in a single effectiveness value, i.e., either APFD-P or Top-3, for each technique. Recall that we use the performance change size threshold t =3.

(2) We then apply the Scott-Knott effect size difference (ESD) test [Tan- tithamthavorn et al., 2019] for every study object, which clusters the TCP techniques into statistically different groups iff the Cohen’s d [Cohen, 1992] effect size estimate is non-negligible, i.e., d > 0.2 at signiﬁcance level α = 0.05. Techniques in the same cluster only have a negligible effect size difference and hence perform statistically comparably. Collo- quially, better techniques receive lower ranks than worse techniques, e.g., techniques with rank 1 are better than techniques with rank 2, and so on. 262 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks

Table 6.4: APFD-P of the 54 TCP techniques compared to random per study object

Project Mean Conf. Int. vs. random Max. Min. Upper Lower + = − Byte Buddy 0.64 0.43 0.69 0.37 40 8 6 Eclipse Collections 0.64 0.60 0.71 0.53 54 0 0 JCTools 0.60 0.45 0.68 0.38 4 0 50 Jenetics 0.62 0.49 0.70 0.40 3 27 24 Log4j 2 0.64 0.43 0.68 0.38 22 0 32 Netty 0.65 0.43 0.76 0.34 31 8 15 Okio 0.70 0.42 0.76 0.36 33 6 15 RxJava 0.59 0.48 0.64 0.43 24 10 20 Xodus 0.71 0.51 0.74 0.43 46 4 0 Zipkin 0.54 0.48 0.61 0.42 27 11 16

(3) Finally, we apply the Scott-Knott test again —the double Scott-Knott test [Tantithamthavorn et al., 2019]— this time on the ranks from the previous step across all study objects. By that, we can draw general conclusions on the effectiveness of the 54 TCP techniques across all 10 study objects.

APFD-P

Per Study Object Table 6.4 shows per study object the mean APFD-P values across all versions and all 54 TCP techniques (“Mean”), where “Max.” and “Min.” corresponds to the mean APFD-P value (across all versions) of the best and worst technique, respectively. For each TCP technique, we compute a 95% confidence interval of the mean across all versions with bootstrap. Column “Conf. Int.” depicts the minimal lower bound (“Lower”) and the maximal upper bound (“Upper”) of all confidence intervals, one for each TCP technique. These confidence interval bounds supply a range of APFD-P values per study object. Finally, column “vs. random” shows the number of TCP techniques that perform statistically better (“+”), equal (“=”), or worse (“−”) than a random benchmark ordering (random), as assessed by the ranks of the first 6.4 Results and Analyses 263 application of the Scott-Knott test (analysis step 2). Note that random achieves a mean APFD-P value (across 100 random orderings) of approximately 0.5. We observe that the mean APFD-P values range from 0.42 for Okio to 0.71 for Xodus, with confidence interval bounds between 0.34 for Netty and 0.76 for Netty and Okio. The best techniques for each study object range between 0.54 (Zipkin) and 0.71 (Xodus). Compared to random, it depends on the study object and TCP technique whether TCP on benchmarks is more effective and, therefore, provides a benefit. We see three kinds of study objects:

• the ones where the majority of TCP techniques perform better than random, i.e., Byte Buddy, Eclipse Collections, Netty, Okio, and Xodus;

• the ones where there are a similar number of techniques that are better and worse (or equal) to random, i.e., Log4j 2, RxJava, and Zipkin; and

• the ones where the majority of techniques are inferior to random, i.e., JCTools and Jenetics.

This shows that for most study objects, a wrong TCP technique or the wrong parameterization can have detrimental effects on its effectiveness, rendering the technique inferior even to random. Nonetheless, for every study object there exists at least a few techniques that outperform random substantially.

Overall To assess how effective TCP on benchmarks is across all study objects, we depict the results of the double Scott-Knott test (analysis step 3) in Figure 6.3. The y-axis shows the APFD-P ranks from the first Scott-Knott test (analysis step 2), the shape represents the mean rank across the 54 study objects, and the whiskers represent the 95% confidence interval of the mean computed with bootstrap. The x-axis shows the 54 TCP techniques. The facets show the ranks of the second Scott-Knott test. Colloquially, the higher on the y-axis and the more to the left on the x-axis, the better a particular TCP technique performs. The first and most interesting observation is that total outperforms additional, with the first three ranks only containing total techniques. This is 264 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks

Coverage Type dynamic static Prioritization Strategy total additional random

1 2 3 4 5 6 7

2 P

− 4 Ranks APFD

cov-type d d d s s s s s s d d s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s d s s s s s s s s s s s − strategy t t t t t t t t t a a a t a a t t a a t a a a a a t t t a a a t t t t t a a a a t t a a a a t t t t a a a a r bench p p m p p p m p m p p p p p p m m m p p m m p p m m p p m m p m p p p p p m m p m m m m m p m m m m p p m m p dc-bench m p m − − − − − − m p − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − m − − − − − − − − − − − − sc-algo − − − R R R R R R − − R 01 R R R R R R 01 R R 0 0 0101010101 R 0 01 0 0 0 0 010101 0 0101 − 0 0 01 0 0 0 0 0101 0 0 − sc-ro − − − M M N M N N − − M M M N N M M N N M N M N M M M N N N M N M M N N M M N N M N − M N N M M N N M N M N − sc-ep − − − s m m m s m − − m s s m s s m s s s m m m s s m m s s s s m s m s s m m s m m − m m s m s m s m m s s − Independent Variables

Figure 6.3: APFD-P ranks across all study objects and for threshold t = 3%. For the independent variable descriptions, see Table 6.2

different from unit testing research and what Mostafa et al. [2017] assume. We see two reasons for this: (1) performance changes are captured by benchmarks that cover similar parts of the software under test; and (2) a large subset of the benchmarks in a suite have overlapping coverage with other benchmarks, where the average overlap is between 25% and 37% (see Table 6.3). The best performing techniques (with rank 1) are the ones using dynamic- coverage and granularity benchmark-parameter in combination with total. The first additional techniques using dynamic-coverage achieve rank 4, again with benchmark granularity benchmark-parameter. For both dynamic-coverage strategies, the granularity benchmark-method performs worse compared to benchmark- parameter, indicating that they should prioritize benchmarks on parameter level to be more effective. In terms of techniques using static-coverage, the best techniques achieve a higher rank than the first additional technique with dynamic-coverage, i.e., rank 2 and 3. Interestingly, all of these use the most imprecise static CG algorithm, i.e., sc-algoRTA. A reason could be that the more precise sc-algo0CFA and sc-algo01CFA show a higher number of benchmarks without coverage information, i.e., 28% or higher as compared to 21% or lower for sc-algoRTA (see Table 6.3). Reif et al. [2019] demonstrate that reflection is a common cause for unsoundness of 6.4 Results and Analyses 265

Table 6.5: Top-3 of the 54 TCP techniques compared to random per study object

Project Mean Conf. Int. vs. random Min. Max. Lower Upper + = − Byte Buddy 0.57 0.73 0.48 0.81 40 7 7 Eclipse Collections 0.29 0.62 0.16 0.83 54 0 0 JCTools 0.44 0.73 0.27 0.85 53 0 1 Jenetics 0.53 0.67 0.43 0.78 46 8 0 Log4j 2 0.48 0.75 0.33 0.82 22 16 16 Netty 0.39 0.68 0.20 0.86 38 16 0 Okio 0.45 0.71 0.26 0.86 46 0 8 RxJava 0.51 0.68 0.37 0.79 48 6 0 Xodus 0.37 0.68 0.23 0.85 42 8 0 Zipkin 0.66 0.70 0.54 0.79 0 54 0 static CGs, which potentially affects the higher percentage of empty coverage sets and, consequently, the lower effectiveness. Similar to dynamic-coverage techniques, static-coverage techniques with granularity benchmark-parameter tend to outperform techniques with benchmark-method. Regarding the reflection option (sc-ro) or entrypoint set (sc-ep), we do not observe particular differences in APFD-P. For sc-algo0CFA and sc-algo01CFA, Ta- ble 6.3 shows that there is hardly any coverage difference between the reflection options sc-roNONE and sc-roMAX (when all other coverage-specific parameters are fixed). Although this is not the case for sc-algoRTA, sc-ro does not seem to have a big impact on APFD-P. The entry point set (sc-ep) for all CG algorithms (sc-algo) has an impact on the number of covered methods, sc-eps results in larger CGs per benchmark than sc-epm, but no impact on the overlap, and only a minor impact on the empty coverage sets. Nonetheless, we do not see their impact on the coverage information reflected in the APFD-P effectiveness. Finally, only 2 of 54 techniques do not statistically outperform random across our study objects.

Top-3

Per Study Object Table 6.5 shows per study object the Top-3 effectiveness across all 54 TCP techniques, similarly to Table 6.4. Different from APFD-P, a 266 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks lower Top-3 value is better, i.e., fewer benchmarks are required to be executed to find the three largest performance changes. This is reflected in Table 6.5, where columns “Mean Min.” and “Conf. Int. Lower” are further left as “Mean Max.” and “Conf. Int. Upper”, respectively. We observe that the range of mean Top-3 values is wide, where depending on the study object and technique between 16%(0.16) and 86%(0.86) of the full benchmark suite must be executed to capture the three largest performance changes. This shows that TCP can be effective regarding Top-3 in the best cases, but it can also have almost no benefit over executing the full suites if the worst technique is utilized. Depending on the study object, the best technique requires executing between 29% (Eclipse Collections) and 66% (Zipkin). It is more often the case than for APFD-P that any TCP technique provides a benefit in terms of Top-3 over random. For eight study objects, i.e., Byte Buddy, Eclipse Collections, JCTools, Jenetics, Netty, Okio, RxJava, and Xodus, the majority of techniques are superior to random. For Log4j 2, more techniques are inferior or equal and, therefore, not effective compared to random. However, for this project there exist 22 techniques that are superior. An interesting study object is Zipkin, for which all techniques perform equal to random. Zipkin is also the project that shows the lowest APFD-P values among all study objects (see Ta- ble 6.4). Nonetheless, these results show that most TCP techniques enable capturing larger performance changes sooner.

Overall Similar to APFD-P and Figure 6.3, Figure 6.4 shows the results of the double Scott-Knott test (analysis step 3) across all study objects. In line with the APFD-P results, the sole best TCP technique (rank 1) employs dynamic-coverage in combination with the total strategy, prioritizes benchmark parameters (benchmark-parameter), and retrieves coverage information only once per benchmark method (dc-benchm). additional is generally more effective than for APFD-P, with the ﬁrst technique already ranked in cluster 2. Almost all additional techniques are ranked one cluster lower than the total technique with the same parameters. 6.4 Results and Analyses 267

Coverage Type dynamic static Prioritization Strategy total additional random

1 2 3 4 5 6 7 8 1

3 Ranks 3 Percentile 3 Percentile − 4 Top

5 cov-type d s d d d s s d s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s d s s s s s s s s s s s s − strategy t t a t t a t a a t t a a t t a a t a t a a t t t t a a a t t a a a t t a a a a a a t a a t t t t t t a t a r bench p p p p m p p p p m m p p m p p p p p m p p m m p p m m p m m m m m p p m m m p p m p m m m m m p p p m m m p dc-bench m − m p m − − p − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − m − − − − − − − − − − − − − sc-algo − R − − − R R − R R R R R 0101 0 0 R 0 R 01 0 01 R 01 R 01 R 01010101 R R 01010101 R 0101 − 0 0 0 0 0 0 0 0 0 0 0 0 − sc-ro − M − − − M M − N M M M N M M M N N M N M N N N N N M M N M N N M N M N M N N M N − M M N M M N M N N M N N − sc-ep − s − − − m m − m m s s s s s m m s s s s s s m s m s m s m m s s m m m m m s m m − s m m m s m m m s s s s − Independent Variables

Figure 6.4: Top-3 ranks across all study objects and for threshold t = 3%. For the independent variable descriptions, see Table 6.2

Regarding techniques relying on static-coverage, the best performing technique is the same as for APFD-P, i.e., total with granularity benchmark-parameter constructing CGs with sc-algoRTA, sc-roMAX, and sc-eps. Techniques employing sc-algoRTA again perform better than techniques with more precise CG analyses. The first technique using a different CG algorithm has rank 4. Techniques relying on sc-algoRTA likely perform better due to significantly fewer top 3 benchmarks with empty coverage sets (see Table 6.3). Different from APFD-P, sc-algo0CFA performs the worst, with all techniques but the ones using additional and benchmark-parameter having rank 7, only one above random. The best static-coverage techniques (rank 2 and 3) almost exclusively rely on the highest reflection option parameter available, i.e., sc-roMAX. Nonetheless, sc-ro does not have a considerable impact on lower ranked static-coverage techniques. Finally, we do not observe a Top-3 difference when using CGs with distinct entry points per benchmark (sc-epm) or unified entry points across all benchmarks (sc-eps).

For both static-coverage and dynamic-coverage techniques, benchmark granularity benchmark-parameter performs better (or equal) than benchmark-method. This is in line with the ﬁndings from APFD-P. Finally, none of the 54 techniques perform worse than random overall. 268 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks

RQ 1 Summary: TCP techniques on benchmarks are effective in terms of both APFD-P and Top-3. The best techniques achieve an APFD-P value between 0.54 and 0.71, depending on the study object. Across all techniques, total outperforms additional and dynamic-coverage achieves better results than static-coverage. Regarding Top-3, the best techniques are able to capture the three largest performance changes after executing between 29% and 66% of the benchmarks contained in the entire suite, depending on the study object. Similar to APFD-P, the best technique follows the total strategy with dynamic-coverage; however, additional techniques are performing almost equally well. Prioritizing benchmarks at the right granularity is important for both effectiveness measures with benchmark-parameter outperforming benchmark-method. Finally, coverage-type-speciﬁc parameters can decide whether a technique is more or less effective. In particular, the static CG algorithm has a considerable impact on both effectiveness measures with sc-algoRTA performing best.

6.4.2 RQ 2: Robustness

This section presents how robust the TCP techniques’ effectiveness is with respect to what magnitude of performance change is considered signiﬁcant (see Sec- tion 6.3.4). To this end, we perform the following two analyses using a diverse set of thresholds t ∈ T , where T = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 50, 100}:

(1) We investigate the impact of the different thresholds t on the overall ranking of TCP techniques.

(2) We study the effectiveness difference of the techniques when using different thresholds t.

Note, we only investigate APFD-P robustness and refrain from analyzing Top-3 robustness, as by construction Top-3 considers the benchmarks exhibit- ing the 3 largest performance changes, which do not change for different performance change size thresholds t. 6.4 Results and Analyses 269

Coverage Type dynamic static Prioritization Strategy total additional random

1 2 3 4 5 6 7

3 P Ranks − (all thresholds) APFD 4

cov-type d d d s s s d s d s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s d s s s s s s s s s − strategy t t t t t t a t a a t a t t t a a t t t a a a t t t a a a a t t a a a a a a t a t t a a a t t a a t t a t a r bench p p m p p m p p p p p p p m m m p m p m m p m m p p p p m m m m p m m m p p p p p p m m m m p p p m m m m m p dc-bench m p m − − − p − m − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − m − − − − − − − − − − sc-algo − − − R R R − R − R 01 R R R 0101 R R 01 R R R R 010101 0 0 01 R 0101 0 R 010101 0 0 01 0 0 0 0 − 0 0 0101 0 0 0 0 0 − sc-ro − − − M M M − N − M M N N N M M M M N N M N M N M N M N N N M N M N M N M N M N N N M N − N M M N M M M N N − sc-ep − − − s m m − m − m s m s m s s s s s s m s s s m m m m s m m m s s m m s s s s m s m m − m m m m s m s s s − Independent Variables

Figure 6.5: APFD-P ranks across all study objects and for all thresholds t, where t ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 50, 100}. For the independent variable descriptions, see Table 6.2

Technique Ranking across all Thresholds t

To investigate whether the overall effectiveness and the ranking among the 54 techniques change if we consider the APFD-P values of all studied thresholds t, we perform the analysis steps of RQ 1 (see Section 6.4.1) with the following minor modiﬁcations: (1) We calculate the APFD-P values for every threshold t individually. This results in a single APFD-P value for each TCP technique applied to a study object, a version, and a threshold t.

(2) We apply the Scott-Knott ESD test for every study object, where a TCP technique is represented by the APFD-P values it achieves in every version for every threshold t as opposed to every version with a single threshold t (i.e., t =3 for RQ 1). This provides us with a single rank per technique and study object considering all thresholds t.

(3) We apply the Scott-Knott ESD test again on the ranks of the previous analysis step. This step remains unchanged compared to RQ 1. Figure 6.5 shows the APFD-P ranks for each TCP technique across all study objects, versions, and thresholds. Similar to Figure 6.3, techniques that have 270 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks no statistical signiﬁcant difference among each other are ranked in the same cluster (facet), and techniques that are further to the left perform better across all study objects. The results are largely similar to the APFD-P results from RQ 1 in Figure 6.3. total techniques still perform the best, with the top 5 techniques (ranked 1 and 2) exclusively being total techniques. Similarly, techniques with dynamic- coverage outperform the ones with static-coverage. Nevertheless, we notice three differences where the threshold has a non-negligible impact on the ranking:

• Techniques with additional strategies “catch up” to total, with the ﬁrst one already having rank 3 as opposed to rank 4.

• All techniques now perform better than random, whereas two techniques (additional with static-coverage) performed equally to random when considering the speciﬁc threshold t =3.

• The conﬁdence intervals are considerably narrower, indicating that the techniques’ APFD-P ranks are more stable, which gives us high conﬁ- dence in the robustness of the ranking, even if a different threshold t is chosen.

Effectiveness Variation across different Thresholds t

The previous section showed that different thresholds t, have a minor impact on the APFD-P rank of the TCP techniques. We now investigate whether the APFD-P value of a particular technique changes with different thresholds t. For this, we apply the following analysis steps:

(1) Similar to analysis step (1) of RQ 1 (see Section 6.4.1), we ﬁrst calculate APFD-P (see Section 6.3.4) for every combination of the study objects; their versions; the 54 TCP techniques; and, different from RQ 1, the different performance change size thresholds t. This results in a single APFD-P value for each combination.

(2) We then calculate the APFD-P robustness for each TCP technique per project and version, as deﬁned as the difference between the maximum 6.4 Results and Analyses 271

and minimum APFD-P value. Intuitively, the robustness describes by how much the APFD-P values change when using different thresholds t. Let’s deﬁne the APFD-P value of a particular TCP technique TCP , TCP,p,v for a project p, in a version v, and for a threshold t as et . The TCP,p,v TCP,p,v set of all effectiveness values is then E = t∈T et . Finally, the robustness rTCP,p,v is then deﬁned as rTCP,p,vS= max(ETCP,p,v) − min(ETCP,p,v), with min and max being the minimum and maximum APFD-P value e, respectively. A robustness value r of 0.0 means that a TCP technique is robust and does not change with different thresholds t, whereas a robustness value of 1.0 indicates a completely unstable technique with large effectiveness differences for different thresholds t.

(3) Finally, we apply the Scott-Knott ESD test for the TCP techniques, combining the robustness values of all study objects and versions, at signiﬁ- cance level α =0.05. This results in a single cluster rank per technique across all projects.

Per Study Object Figure 6.6 shows the APFD-P robustness (y-axis) for each study object (x-axis), where each data point of a project (p) is a robustness value rTCP,p,v, for all TCP techniques (TCP ) in all versions (v). We observe that the threshold t has a considerable impact on a technique’s APFD-P value. Depending on the project, technique, and version, the APFD-P values vary between 0 and 0.62. Netty is the least impacted with a median robustness of 0.11, whereas Xodus is the most impacted project with 0.28. This shows that the decision of what is a signiﬁcant performance change has a drastic impact on the evaluation of TCP techniques.

Overall Figure 6.7 depicts the APFD-P robustness (y-axis) for each TCP technique, across all study objects and versions. Techniques to the left are more robust than ones to the right, also indicated by the Scott-Knott rank (facets) reported by analysis step (3). Note that whiskers represent minimum and maximum robustness values and not conﬁdence interval bounds as in previous ﬁgures. 272 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks

1.00

0.75

0.50

0.25 APFD−P Robustness

0.00

Okio JCTools Jenetics Log4j 2 Netty RxJava Xodus Zipkin Byte Buddy Eclipse Collections Study Object

Figure 6.6: APFD-P robustness for each study object across all versions and TCP techniques, when considering the thresholds t ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 50, 100}. The bar indicates the median, the diamond the mean, the box the IQR, and the whiskers [Q1|Q3]+1.5 ∗ IQR

Coverage Type dynamic static Prioritization Strategy total additional

1 2 3 1.00

0.75

0.50 P Robustness − 0.25 APFD 0.00 cov-type d d d s d s s s s s s s s s s s s d s s s s s s s s s s s d s s s s s s s s s s s s s s s s s s s s s s s s strategy t a t t t t t t t a t a t t t t t a a t a t a t a a a t t a t t t t t t a a a t a t a a a a a a a a a a a a bench m m p m p m m m m m m m p m m m m p m p m p m p m m m m p p p p p p p m m m m p m p p p p p p p p p p p p p dc-bench m m p − m − − − − − − − − − − − − p − − − − − − − − − − − m − − − − − − − − − − − − − − − − − − − − − − − − sc-algo − − − 0 − 0 0 R 01 0 R 0 0 010101 0 − R 0 R 0 0 01 R R 0 R R − R 0101 0 01 R 01 0101 R 01 R R R 0 0 R R 0 0 01010101 sc-ro − − − M − M N N M M M N M N M N N − N N M M M M M N N M N − M N N N M N M N N M M N N M M N M N M N N M N M sc-ep − − − m − s m m s s m s m m m s s − s m s s m s m m m s m − m s m s m s s s m s m s s s s s m m m m m s s m Independent Variables

Figure 6.7: APFD-P robustness for each TCP technique across all study objects and versions, when considering the thresholds t ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 50, 100}. The shapes indicate the mean, and the whiskers show the minimum and maximum. For the independent variable descriptions, see Table 6.2 6.4 Results and Analyses 273

We observe that although there are statistically signiﬁcant differences between three clusters, the mean robustness does not change much among the techniques. total techniques tend to be more robust than additional, as the majority of techniques with rank 1 are total, and the majority of techniques with rank 3 are additional. 4 of 6 techniques with dynamic-coverage are ranked 1, whereas the other two are ranked 2. In terms of benchmark granularity, techniques with benchmark-method tend to be more robust than ones with benchmark-parameter. Finally, we do not observe robustness differences betweem techniques with different coverage-type-speciﬁc parameters, i.e., neither for techniques with dynamic-coverage (dc-bench) nor for techniques with static-coverage (sc-algo, sc-ro, and sc-ep).

RQ 2 Summary: The choice of threshold t, at which a performance change is considered signiﬁcant, has a minor impact on the overall ranking of techniques, compared to RQ 1. Techniques relying on total and dynamic-coverage are still the best. However, the APFD-P value of a particular technique varies considerably with different thresholds t. Our results show an APFD-P difference between a median of 0.11 and 0.28, with a maximum up to 0.62, across study objects and techniques.

6.4.3 RQ 3: Efﬁciency

This section presents the efficiency of the studied TCP techniques, as defined by their runtime overhead with respect to a full benchmark suite execution. Figure 6.8 presents the three efficiencies across all study objects: (1) analysis time, which is the sum of the following two times, in Figure 6.8a; (2) coverage time, i.e., the time it takes to extract coverage information, in Figure 6.8b; and (3) prioritization time, i.e., the time it takes to prioritize all benchmarks of a suite based on the coverage information, in Figure 6.8c. The techniques are ranked from the lowest overhead on the left to the highest overhead on the right and are, again, clustered into ranks with the Scott-Knott ESD test (similar to RQ 1 and RQ 2). The y-axis depicts the mean runtime overhead for each technique across all study objects and versions. Individual technique 274 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks

1 2 3 4 5 6 7 8

60%

40%

20% Analysis Time Overhead

0% cov-type s s s s s s s s s s s s s s s s s s s s s s s s s s s d d d s d s s s s d d s s s s s s s s s s s s s s s s strategy t t a t t a t t a a a a t t a t t a a t t a t t a a a t t a a a t t a a t a t t a a t t a a t t a a t t a a bench m p m m p m m p m p p p m p m m p m p m p m m p m p p m p m p p m p m p p p m p m p m p m p m p m p m p m p dc-bench − − − − − − − − − − − − − − − − − − − − − − − − − − − m m m − m − − − − p p − − − − − − − − − − − − − − − − sc-algo 0 0 0 0101 01 R R R 01 0 R R R R 0 0 0 0 010101 R R R 01 R − − − R − 01010101 − − 0 0 0 0 0 0 0 0 01010101 R R R R sc-ro N N N N N N N N N N N N M M M N N N N N N N N N N N N − − − M − M M M M − − M M M M M M M M M M M M M M M M sc-ep s s s s s s s s s s s s s s s m m m m m m m m m m m m − − − s − m m m m − − s s s s m m m m s s s s m m m m Independent Variables (a) Analysis time

1 2 3 4

60%

40%

20% Coverage Time Overhead Coverage

0% cov-type s s s s s s s d s d s s s s strategy − − − − − − − − − − − − − − bench − − − − − − − − − − − − − − dc-bench − − − − − − − m − p − − − − sc-algo 0 01 R R 0 01 R − 01 − 0 0 01 R sc-ro N N N M N N N − M − M M M M sc-ep s s s s m m m − m − s m s m Independent Variables (b) Coverage time

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

12%

4% Prioritization Time Overhead 0% cov-type s s s s s s s s s d s d s s s s s s d s s s s s s s s s s s s d s s s s s s s s s s d s s d s s s s s s s s strategy t t t t t t t t t t t t t t t t t t t t t t t t t t a a a a a a t a a a a a a a a a a a a a a a a a a a a a bench m m m m p p m m m m m p m p p m p p p m p m p p p p m m m m m m p m m m m m m m p p p p p p p p p p p p p p dc-bench − − − − − − − − − m − p − − − − − − m − − − − − − − − − − − − m − − − − − − − − − − p − − m − − − − − − − − sc-algo 0101 0 0 01010101 0 − R − 0 0 0 R 0101 − R R R 0 0 R R 0101 0 0 01 − R 01 R 0 0 R R R 0101 − 0 0 − 0101 R 0 0 R R R sc-ro N M N M M N M N N − N − M M N N N M − M N M N M N M N M N M N − M M N N M N M M M N − N M − N M N N M N M M sc-ep m m m m m m s s s − m − s m m s s s − m m s s s s m m m m m s − s s m s s s m s m m − m m − s s m s s s m s Independent Variables (c) Prioritization time

Figure 6.8: Efficiency of the techniques in terms of analysis time (i.e., coverage and prioritization time) in runtime overhead in percent of the full benchmark suite execution. Error bars represent the 95% confidence interval. For the independent variable descriptions, see Table 6.2 6.4 Results and Analyses 275 runtimes, i.e., of a particular study object in a specified version, are normalized by the execution time of the full benchmark suite of the particular project and version. Whiskers represent the 95% bootstrap confidence interval of the mean overhead, using 10,000 bootstrap iterations and drawing samples with replacement.

Analysis Time From Figure 6.8a, we observe that the 27 techniques ranked in clusters 1 to 6 have a range of mean overheads between <1% and 3.7%, with confidence interval bounds from <1% to 4.5%. These techniques exclusively use static-coverage, with the majority relying on sc-algo0CFA and sc-algo01CFA. All techniques use the lowest reflection option (sc-roNONE) and/or use a single CG entry point set (sc-eps). The 23 techniques ranked in cluster 7 include all dynamic-coverage techniques and static-coverage techniques relying on the algorithms sc-algo0CFA and sc-algo01CFA in combination with multiple entry point sets (one per benchmark; sc-epm). This cluster contains the techniques with the widest variety of mean overheads, ranging from 10% to 23%, with confidence interval bounds between 6% and 42%. The dynamic-coverage techniques show a low overhead variability among the different study objects and versions, with mean overheads between 10% and 17%. Although all dynamic-coverage techniques are ranked in cluster 7, there is a significant difference between techniques with dc-benchm (10.7%) and dc-benchp(16.8%). The techniques with the highest runtime overhead rely on static-coverage in combination with sc-algoRTA, sc-roMAX, and sc-epm. Their mean overhead ranges from 55% to 59%, with confidence interval bounds between 43% and 72%. Finally, we neither observe differences in analysis time overhead between total and additional strategies nor benchmark-method and benchmark-parameter granularities. This indicates that prioritization time contributes less to the analysis time than coverage time. 276 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks

Coverage Time Figure 6.8b shows the coverage time overhead per combination of coverage-type-specific parameters. This figure is different than all others because (1) coverage extraction is the first step of the analysis performed by TCP techniques and, hence, it can not be affected by previous steps; and (2) different prioritization strategies and benchmark granularity rely on the same coverage information.

We observe that the coverage time has similar overhead numbers as the analysis time, and they are ranked into four clusters. This is a ﬁrst indication that coverage time is indeed the deciding factor for a TCP technique’s analysis time.

Coverage extractors with rank 1 are using the lowest reﬂection option (sc-roNONE), a single CG entry point set (sc-eps), and only differ by their CG algorithm (sc-algoRTA, sc-algo0CFA, and sc-algo01CFA). All these extractors have an overhead below 1%.

The second rank (2), again, contains only static-coverage extractors. Three of these are the same extractors as in cluster 1 but with entry point sets per benchmark (sc-epm). The remaining extractor with rank 2 employs sc-algoRTA with the maximum reﬂection option (sc-roMAX) and a single entry point set (sc-eps). Their mean overhead ranges from 2.7% to 3.2%.

Cluster 3 contains all dynamic-coverage extractors and all remaining but one static-coverage extractor (which we discuss below). Their mean overheads are between 10.6% and 22.1%, with conﬁdence interval bounds ranging from 6% to 41%. Interestingly, these extractors make up all but one of the TCP techniques in analysis time cluster 7 (see Figure 6.8a). The overheads of the dynamic-coverage extractors are responsible for the majority of the analysis time of the dependent TCP techniques.

Finally, the coverage extractor with the highest overhead (ranked 4) retrieves static-coverage using sc-algoRTA in combination with sc-roMAX and sc-epm. This also explains the worst TCP techniques (in analysis time cluster 8) that all rely on this extractor. 6.4 Results and Analyses 277

The overheads from the extractors in cluster 3 and 4 are almost equal to the TCP techniques’ analysis times in clusters 7 and 8. This shows that coverage is the major deciding factor of long analysis times of TCP techniques.

Prioritization Time Figure 6.8c shows the prioritization time overhead per TCP technique across all study objects and versions. Note that here we are again interested in all 54 TCP techniques. We first observe that the majority of the techniques, i.e., 51 of 54, have a mean prioritization time overhead below 1%. This confirms the suggested finding that coverage time and not prioritization time is the main contributor to TCP efficiency, for most of the studied techniques. Nevertheless, three techniques show overheads worth mentioning (ranked 16 and 17); all three rely on static-coverage, apply the additional strategy with granularity benchmark- parameter, and use sc-algoRTA as CG algorithm. The technique in rank 16 uses the lowest reflection option (sc-roNONE) in combination with a single CG entry point set (sc-eps), resulting in a mean overhead of 2%, with confidence interval bounds between 1.2% and 2.8%. The reason why this technique is only one cluster away from the worst techniques is because of the high number of covered methods per benchmark, i.e., on average 3,005.32 as depicted in Table 6.3. However, this technique has a relatively low analysis time overhead of 2.3%, which is largely caused by the prioritization overhead. Finally, the two techniques with the highest overhead (with rank 17) use the maximum reflection option (sc-roMAX). Their mean overhead is 3.8% and 8% for sc-epm and sc-eps, respectively. The technique with sc-epm also has the highest mean analysis time overhead at 59%; and the technique with sc-eps is the only one in analysis time cluster 7, which is due to the high prioritization overhead. Both techniques owe their high prioritization overheads to the number of covered methods per benchmark, i.e., on average 7,497.67 (sc-epm) and 14,855.13 (sc-eps). 278 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks

RQ 3 Summary: The efﬁciency of TCP techniques applied to benchmarks ranges from <1% to 59% overhead of the duration of a full benchmark suite execution. Techniques relying on static-coverage can have very low (often below 4%) or very high overheads (sometimes exceeding 55%), whereas the overhead of dynamic-coverage techniques is less varied (between 10% and 17%). In most cases the time to extract coverage information is the major contributor to the overall analysis time. However, additional techniques that use coverage information with a high number of called methods per benchmark spend a considerable amount of time “just” applying the prioritization strategy.

6.5 Discussion and Recommendations

In this section, we discuss the trade-off between TCP effectiveness and efficiency, provide recommendations for researchers and practitioners, and compare our findings to the findings of TCP for unit tests.

6.5.1 Choosing a TCP Technique

Our results, especially RQ 1 and RQ 3, show that the studied TCP techniques are to varying degrees effective and efficient. However, whether a specific technique is also useful in practice depends on both effectiveness and efficiency. We have found that the most effective techniques in terms of APFD-P and Top-3 use dynamic-coverage. The best dynamic-coverage technique uses total, benchmark-parameter, and dc-benchm and has an analysis time overhead of approximately 11%. In case of very long-running benchmark suites (e.g., 2.71h for Log4j 2 or 38.45h for Eclipse Collections), even a 11% overhead may still be worthwhile if it leads to performance changes being identified sooner. However, if an 11% overhead is too expensive, a technique relying on static- coverage might be an attractive alternative. The most effective static-coverage technique, for both APFD-P and Top-3, in our study uses total, benchmark- 6.5 Discussion and Recommendations 279 parameter, sc-algoRTA, sc-roMAX, and sc-eps. This technique is also efficient with a mean analysis overhead of below 3%. It is important to keep in mind that TCP can be less effective than a random ordering, depending on the project and the parametrization of the technique (see Tables 6.4 and 6.5). However, on average across all studied projects all techniques are superior to random (see Figures 6.3 to 6.5). Practitioners who are keen on applying TCP for their microbenchmark suites should carefully evaluate whether they would benefit from it, by answering the following questions:

(1) Is the suite runtime too long to wait for its completion, and can we, therefore, beneﬁt from prioritization?

(2) Which analysis overhead is acceptable (in relation to the suite runtime)?

(3) Which technique is effective and efﬁcient for our project?

6.5.2 Efﬁciency of Static vs. Dynamic Coverage Techniques

Although a few static-coverage techniques are almost as effective as dynamic- coverage techniques, and the majority are as efficient or considerably more efficient than dynamic-coverage techniques. However, static-coverage is always faster than dynamic-coverage can not be generally assumed, i.e., extensive overheads are possible with the “wrong” technique parameterization. Our results show that in most cases the coverage time is the deciding factor whether a technique is fast or slow. For example, highly effective static-coverage techniques, such as the ones ranked in cluster 3 for APFD-P (see Figure 6.3) as well as Top-3 (see Figure 6.4), have the highest runtime overhead at above 50%. Often a change of one coverage-type-specific parameter can already improve the efficiency drastically without sacrificing effectiveness. The technique relying on total with static-coverage, benchmark-parameter, sc-algoRTA, and sc-epm is equally effective for either sc-ro value. However, sc-roMAX has a mean overhead of 55%, whereas for sc-roNONE the overhead is below 3%. This shows 280 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks that sophisticated reflection handling mechanisms of static CG libraries can have a detrimental impact on the effectiveness of static-coverage techniques.

6.5.3 Impact of Coverage-Type-Speciﬁc Parameters

Our study is, to the best of our knowledge, the first to define coverage-type- specific parameters of TCP techniques and to assess their impact on effectiveness and efficiency. Previous studies either fixed these parameters, e.g., the static CG algorithm, or do not explicitly mention them [Zhang et al., 2009a; Luo et al., 2016a, 2019; Mostafa et al., 2017]. Our results show that these parameters can have an impact on both effectiveness and efficiency. Hence, they cannot be neglected in rigorous experimental evaluations. We hypothesize that there is a similar impact of coverage-type-specific parameters on TCP effectiveness in functional testing research. Future studies should validate this hypothesis. For dynamic-coverage techniques, choosing between dc-benchm and dc-benchp can affect both effectiveness and efficiency: (1) favoring dc-benchm over dc-benchp, i.e., retrieving dynamic-coverage per benchmark method rather than per benchmark parameterization, reduces the overhead from 17% to 11%; (2) while being more effective regarding Top-3 effectiveness; and (3) remaining equally effective in terms of APFD-P. For techniques with static-coverage, both effectiveness and efficiency is drastically impacted by coverage-type-specific parameters. Effectiveness changes mostly with different CG algorithms. Surprisingly, the least precise algorithm among the studied ones, i.e., sc-algoRTA, enables the most effective techniques. This is likely due to sc-algo0CFA and sc-algo01CFA being not able to extract coverage information for many benchmarks (see Table 6.3). As already discussed before, changes to coverage-type-specific parameters can also lead to efficiency drops. Especially, more sophisticated reflection options, i.e., sc-roMAX instead of sc-roNONE, and constructing CGs per benchmark with smaller, more specific entry point sets (sc-epm), is often much less efficient at a similar effectiveness. 6.5 Discussion and Recommendations 281

6.5.4 Choice of Threshold

Our study’s robustness result (RQ 2 in Section 6.4.2) show that depending on which performance change size is considered to be signiﬁcant, i.e., as deﬁned by the threshold t, the concrete APFD-P values have a median difference between 0.11 and 0.28, depending on the project. The robustness differences are only small across techniques, demonstrating that all techniques are similarly affected by different thresholds t. One could argue that the threshold should always be set to t = 0, which would consider all performance change of any size for APFD-P calculation. However, this is can be problematic for two reasons:

(1) Performance experiments are prone to measurement bias, where the measurement result does not accurately reﬂect the (software) systems’ true performance [Georges et al., 2007; Mytkowicz et al., 2009; Curtsinger and Berger, 2013]. Non-obvious execution environment peculiarities can affect the measurement, such as environment variables, stack sizes, background processes, or frequency scaling. Consequently, a measured performance change might in fact be due to a confounding factor and not due to a change to the software. Even if one follows a rigorous measurement methodology, the absence of measurement bias can not be guaranteed. Therefore, false-positives in the detection of performance changes impact the effectiveness evaluation of TCP techniques. To manage measurement bias, a threshold t can ﬁlter out these changes.

(2) Multiple co-occurring performance changes between two software versions are common [Chen and Shang, 2017], but often they are of small size. Consequently, developers might only be interested in changes of a certain size, e.g., everything below a 10% change is not worth investigating. The exact threshold t that is relevant depends on the project, developer, and application scenario. Our results show that deﬁning this threshold can considerably change the evaluated effectiveness of a technique. 282 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks

We suggest that researchers conducting all kinds of performance experiments to consider different thresholds t when evaluating the effectiveness of their approaches. In particular, research on TCP for benchmarks must consider the sensitivity of the evaluation metric APFD-P. Practitioners eager to apply TCP on their benchmark suites should decide which performance change sizes they are interested in capturing early, as it can change which technique is optimal for their usage scenario.

6.5.5 Comparison to TCP for Unit Tests

To assess how TCP for benchmarks compares to TCP for unit tests, we compare our results to the ones of Luo et al. [2019]. Their study is the most recent, large- scale study of static and dynamic TCP techniques for unit tests. Different from their study, ours investigates benchmark granularity on method (benchmark-method) and parameter (benchmark-parameter), whereas theirs looks at class and method level. The implementation of our static- coverage techniques (total and additional) resembles theirs, both are based on Zhang et al. [2009a]. Our dynamic-coverage techniques rely on coverage information on method-level, whereas theirs rely on statement-level coverage. There is a conceptual difference between dynamic TCP for unit tests and dynamic TCP for benchmarks: coverage information (for unit tests) is usually acquired during the test executions of the previous version. Luo et al. [2019] refrain from studying the efﬁciency of dynamic techniques, because “the temporal overhead is quite high and well-studied”. As benchmarks are executed many times to retrieve reliable results (see Section 6.2), TCP for benchmarks can utilize a single benchmark execution of a new version to instrument the code and retrieve dynamic coverage information. Luo et al. [2019] report the following APFD values for TCP with method granularity (which is more effective than class granularity): techniques with static-coverage achieve on average 0.764 and 0.818 across their study objects, whereas the ones with dynamic-coverage reach 0.809 and 0.898, respectively for total and additional. 6.5 Discussion and Recommendations 283

Our results highlight four major observations compared to unit tests:

(1) TCP is considerably less effective for benchmarks than for unit tests, if we assume that values for APFD and APFD-P are comparable. This is likely due to performance changes being less correlated with the number of covered methods (or statements) than functional faults are. To circumvent this situation, TCP for benchmark requires better approximations for performance changes than “just” the sum of all covered items. To this end, Mostafa et al. [2017] build a performance change impact model for collection-intensive software, and Chen et al. [2020] build a runtime-cost-aware machine learning model. However, both studies (partially) evaluated their techniques with unit test suites which are executed in a benchmark-like fashion. In this study, we explored the state of traditional TCP applied to benchmarks among a large set of parameter combinations. It is our hope that future research can use this foundation to develop techniques that are more effective for prioritising benchmarks.

(2) The strategy total is more effective than additional for benchmarks when relying on either static-coverage or dynamic-coverage; whereas the opposite is true for TCP for unit tests. A potential reason for this relates to the deﬁnition of APFD-P by Mostafa et al. [2017]: it does not distinguish between multiple (performance) faults detected by the same benchmark as APFD does, but it considers benchmark results as a single fault with different severities, i.e., the performance change size. Future research should aim at devising a new effectiveness metric for TCP for benchmarks. This would require building a data set that distinguishes root causes of distinct performance changes (with their severities/change sizes) per benchmark. However, it is unclear whether this is feasible, as performance is non-linear and not directly additive [Siegmund et al., 2015b].

(3) The efﬁciency of TCP for benchmarks is less of a concern compared to TCP for unit tests, at least for the majority of the studied techniques. 284 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks

About half of the static-coverage techniques have an overhead below 4%, whereas the dynamic-coverage techniques have an overhead between 10% and 17%. This reasonable overhead potentially makes TCP for benchmarks applicable in practice.

(4) The performance change of a benchmark executed on two adjacent versions is a continuous, open-ended value, whereas the outcome of a unit test is binary, i.e., either it exposes a fault or not (disregarding ﬂaky tests for simplicity). This leads to the challenge that measurement uncertainty and bias impacts the robustness of the technique evaluation, as studied for RQ 2 in Section 6.4.2.

6.6 Related Work

Our study is related to three main areas of research: (1) TCP of functional tests, (2) performance testing, and (3) performance measurements.

6.6.1 Test Case Prioritization of Functional Tests

Regression testing for functional/unit tests has been extensively explored [Yoo and Harman, 2012], with the three main techniques being test suite minimization, regression test selection (RTS), and test case prioritization (TCP). Our study takes the traditional TCP techniques on unit tests [Rothermel et al., 1999], i.e., total and additional and studies them in the context of software benchmarks. TCP’s main idea is to re-order test cases with the goal of maximizing fault-exposure rate, i.e., finding more faults sooner. Rothermel et al. [1999, 2001] coined the term TCP and introduced the main techniques total and additional. Both are greedy, white-box prioritization techniques relying on coverage information, such as statement, branch, or method coverage. Where total assigns weights once to all tests and ranks them accordingly, additional re-assigns weights to prioritize tests that execute more, yet uncovered regions of the production code. Elbaum et al. [2002] extended the study to a total of 18 different techniques by rankings based on fault exposure and fault 6.6 Related Work 285 existence probabilities. Elbaum et al. [2001] extended the TCP effectiveness to incorporate cost of tests and faults. More recent trends in greedy TCP techniques combine total and additional [Zhang et al., 2013; Hao et al., 2014] or utilize less-expensive static coverage information [Zhang et al., 2009a; Mei et al., 2012]. Other, non-greedy techniques have been proposed to utilize search-based algorithms [Walcott et al., 2006; Li et al., 2007], ant-colony optimization [Singh et al., 2010], knapsack solvers [Alspaugh et al., 2007], and integer linear programming [Zhang et al., 2009b]. Time-aware techniques [Walcott et al., 2006; Alspaugh et al., 2007; Zhang et al., 2009b; Do et al., 2010; You et al., 2011] study the impact of time on TCP effectiveness. With the emergence of CI and new code versions arriving at high velocity, efficient black-box techniques are on the rise [Elbaum et al., 2014; Liang et al., 2018; Haghighatkhah et al., 2018]. Henard et al. [2016] investigate the differences between white-box and black-box techniques, and Luo et al. [2016a, 2019] compare static and dynamic techniques. Finally, recent efforts assess TCP techniques in real world contexts [Lu et al., 2016], contrast real faults to faults based on mutation [Luo et al., 2018], and incorporate developer knowledge into the ranking [Tonella et al., 2006]. Our study draws inspiration from many of the aforementioned papers. It studies traditional techniques, i.e., total and additional, on method-level granularity, investigates the impact of varying prioritization parameters, focuses on efficiency, and applies all of it to software microbenchmarks.

6.6.2 Performance Testing

Software performance engineering (SPE) can be conducted in two general ways: measurement-based and model-based [Woodside et al., 2007]. Our work focuses on a speciﬁc technique of measurement-based SPE, i.e., performance testing with software benchmarks. Traditional performance testing research dedicated their effort on system- level load testing, and the related stress, soak, and spike testing [Weyuker and 286 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks

Vokolos, 2000; Menascé, 2002; Jiang and Hassan, 2015]. More recent works in load testing focuses on industrial contexts [Nguyen et al., 2014; Foo et al., 2015] and time reduction techniques [AlGhamdi et al., 2016, 2020; He et al., 2019].

The other form of performance testing, i.e., software benchmarking, has only received more attention from research in recent years. Software benchmarking is to load testing what unit testing is to functional system/integration testing. General studies investigate empirically the current state of software benchmarking [Leitner and Bezemer, 2017; Stefan et al., 2017]. Targeted research on their usage for raising the performance awareness of developers [Horký et al., 2015], the changes that they detect [Chen and Shang, 2017], their applicability in CI [Laaber and Leitner, 2018a] shows the potential, but also the challenges, of using software benchmarks.

Challenges include the complexity of writing good benchmarks, executing them in a rigorous fashion, and assessing their results with statistical techniques. Damasceno Costa et al. [2019] devise a technique to statically detect bad practices, Laaber et al. [2019b] study their behavior when executed on cloud infrastructure, and Bulej et al. [2012, 2017a] introduce a declarative method for comparing different benchmark results using rigorous statistical testing. Ding et al. [2020] study whether unit tests can be effectively used for detecting performance changes. Laaber et al. [2020c] devise an approach to stop microbenchmarks once their results are sufﬁciently stable.

In the context of regression testing, only a handful of studies has been conducted so far. Huang et al. [2014] predict the performance impact of a new software version to decide whether this new versions should be tested for performance. Pradel et al. [2014] and Yu and Pradel [2017] address performance regression testing for concurrent classes. Three regression test selection (RTS) techniques employ performance-impact prediction [de Oliveira et al., 2017], genetic algorithms [Alshoaibi et al., 2019], and machine learning classi- ﬁers [Chen et al., 2020] to select important benchmarks, i.e., the ones that are likely to expose performance changes, for every software version. 6.6 Related Work 287

Closest to our work are the ones by Mostafa et al. [2017] and Chen et al. [2020], which are, to the best of our knowledge, the only other works on TCP for performance tests. Mostafa et al. [2017] focus on collection-intensive software and decide, based on code changes and a performance-impact model, which performance tests to prioritize. This paper utilizes as baselines the “best techniques” based on unit testing research. We, however, outline that the assumption that TCP techniques from unit testing research behave identical for performance tests does not hold. The primary goal of Chen et al. [2020] is to predict whether tests are performance-affected, e.g., for RTS, but they also prioritize tests based on whether they are affected normalized by their runtime cost. Both works, however, (partially) use unit tests executed in a benchmark-like fashion as performance tests. It is unclear whether they are even comparable to dedicated performance tests, i.e., benchmarks, which are the objects in our study. We further show how the uncertainty of performance measurements and the choice of prioritization parameters impacts TCP effectiveness and efﬁciency.

6.6.3 Performance Measurements

The results of any software benchmarking study are affected by the validity of the underlying performance measurements. A lot can go wrong, and many mistakes can be made. Consequently, measurement bias has in the past lead researchers to draw wrong conclusions [Mytkowicz et al., 2009]. Effects due to memory layout [Curtsinger and Berger, 2013] and dynamic compilation [Kalib- era and Jones, 2012, 2013] require careful experiment design and statistical evaluation. Georges et al. [2007] provide a guide for performance evaluations in Java. To retrieve reliable results from unreliable environments (such as clouds), Papadopoulos et al. [2019] outline a comprehensive methodology. We follow the methodologies from [Georges et al., 2007] for the performance changes used in the effectiveness measure calculation, apply cloud performance measurement methodologies [Papadopoulos et al., 2019; Laaber et al., 2019b] for 288 Chapter 6. Applying Test Case Prioritization to Software Microbenchmarks the efﬁciency results, and employ rigorous statistical techniques [Kalibera and Jones, 2012].

6.7 Conclusions

This paper presents the first investigation on whether standard TCP techniques from unit testing research are applicable in the context of software microbenchmarks. We empirically studied the effectiveness, robustness, and efficiency of these techniques and investigated the impact of four independent variables, i.e., total and additional strategies, benchmark granularity on method and parameter level, dynamic and static coverage types, and four coverage-type-specific parameters. The unique combinations of these independent variables results in 54 different TCP techniques, which we evaluated on the most extensive JMH data set consisting of 10 Java OSS projects, across 161 versions, having 1,829 distinct microbenchmarks with 6,460 distinct parameterizations. We found that techniques with the total strategy outperform additional techniques. The mean effectiveness ranges between 0.54 and 0.71 APFD-P, and it requires executing between 29% and 66%(Top-3) of the total benchmark suite. The performance change size, which is considered to be significant, impacts the effectiveness and can change the APFD-P values considerably, i.e., by a median difference of between 0.11 and 0.28. However, the ranking among different techniques is hardly affected by it. In terms of efficiency, we showed that the best technique has an overhead of 11% and uses dynamic-coverage, making TCP for benchmarks feasible. Techniques with static-coverage often reduce the overhead even further, often below 4%, while still being competitive in terms of effectiveness. Our efficiency analysis also revealed that the assumption that static-coverage is always cheaper than dynamic-coverage does not hold; “wrong” parameterization can drastically decrease efficiency, sometimes exceeding 55% overhead. The choice of independent variable values has a considerable effect on effectiveness and efficiency, sometimes even rendering the TCP technique inferior to a random ordering and imposing a large analysis overhead. 6.7 Conclusions 289

Our results are of high importance to future research that considers standard TCP techniques as baselines for novel techniques, and they raise awareness of how impactful prioritization parameters are on TCP effectiveness and efﬁciency.

6.7.1 Future Research

It is our hope that this paper is only the beginning of performance test prioritization and selection. We envision ﬁve directions that seem worthwhile investigating.

• Current TCP techniques for performance tests are either tailored to spe- ciﬁc types of software [Pradel et al., 2014; Mostafa et al., 2017] or evaluated on unit tests which are used as performance tests [Chen et al., 2020]. Future research should devise generally applicable TCP techniques for benchmarks and evaluate these and existing techniques on benchmarks.

• This study investigated white-box TCP techniques, whereas black-box techniques have not been explored. These could be based on benchmark quality attributes or test similarity.

• It is unclear which information developers need to decide which benchmarks to execute on new commits and in which order. Empirical studies involving humans could inﬂuence design decisions for better benchmark TCP and RTS.

• Choosing TCP hyper-parameters (i.e., independent variable values) to maximize TCP effectiveness and efﬁciency is non-trivial, which would require better support for developers.

• Combining TCP and RTS techniques might provide optimal results in temporally-constrained settings, such as CI.

Bibliography

Ali Abedi and Tim Brecht. Conducting repeatable experiments in highly variable cloud computing environments. In Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering, ICPE 2017, pages 287–292, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-4404-3. doi: 10.1145/3030207.3030229. URL http://doi.acm.org/10.1145/ 3030207.3030229.

Amritanshu Agrawal, Wei Fu, Di Chen, Xipeng Shen, and Tim Menzies. How to “DODGE” complex software analytics. IEEE Transactions on Software Engineering, pages 1–1, 2019. ISSN 2326-3881. doi: 10.1109/tse.2019.2945020. URL http://dx.doi.org/10.1109/tse.2019.2945020.

Tarek M. Ahmed, Cor-Paul Bezemer, Tse-Hsun Chen, Ahmed E. Hassan, and Weiyi Shang. Studying the effectiveness of application performance management (APM) tools for detecting performance regressions for web applications: An experience report. In Proceedings of the 13th International Conference on Mining Software Repositories, MSR 2016, pages 1–12, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4186-8. doi: 10.1145/2901739.2901774. URL http://doi.acm.org/10.1145/2901739.2901774.

Adil Aijaz. Release frequency: A need for speed, February 2019. URL https://dzone.com/articles/release-frequency-a-need- for-speed. Accessed: 7.12.2020. 292 BIBLIOGRAPHY

Akamai Technologies Inc. Find out how you stack up to new industry benchmarks for mobile page speed, April 2017. URL https://www.akamai.com/uk/en/about/news/press/2017- press/akamai-releases-spring-2017-state-of-online- retail-performance-report.jsp. Accessed: 20.11.2020.

Haldun Akoglu. User's guide to correlation coefﬁcients. Turkish Journal of Emergency Medicine, 18(3):91–93, September 2018. doi: 10.1016/j.tjem.2018. 08.001. URL https://doi.org/10.1016/j.tjem.2018.08.001.

Mohammad Mejbah ul Alam, Tongping Liu, Guangming Zeng, and Abdul- lah Muzahid. SyncPerf: Categorizing, detecting, and diagnosing synchronization performance bugs. In Proceedings of the 12th European Conference on Computer Systems, EuroSys 2017, pages 298–313, New York, NY, USA, 2017. ACM. ISBN 9781450349383. doi: 10.1145/3064176.3064186. URL https://doi.org/10.1145/3064176.3064186.

Hammam M. AlGhamdi, Mark D. Syer, Weiyi Shang, and Ahmed E. Hassan. An automated approach for recommending when to stop performance tests. In Proceedings of the 32nd IEEE International Conference on Software Maintenance and Evolution, ICSME 2016, pages 279–289, October 2016. doi: 10.1109/ ICSME.2016.46.

Hammam M. AlGhamdi, Cor-Paul Bezemer, Weiyi Shang, Ahmed E. Hassan, and Parminder Flora. Towards reducing the time needed for load testing. Journal of Software: Evolution and Process, July 2020. doi: 10.1002/smr.2276. URL https://doi.org/10.1002/smr.2276.

Deema Alshoaibi, Kevin Hannigan, Hiten Gupta, and Mohamed Wiem Mkaouer. PRICE: Detection of performance regression introducing code changes using static and dynamic metrics. In Proceedings of the 11th Inter- national Symposium on Search Based Software Engineering, SSBSE 2019, pages 75–88. Springer Nature, 2019. doi: 10.1007/978-3-030-27455-9_6. URL https://doi.org/10.1007/978-3-030-27455-9_6. BIBLIOGRAPHY 293

Sara Alspaugh, Kristen R. Walcott, Michael Belanich, Gregory M. Kapfhammer, and Mary Lou Soffa. Efﬁcient time-aware prioritization with knapsack solvers. In Proceedings of the 1st ACM International Workshop on Empirical Assessment of Software Engineering Languages and Technologies, WEASELTech 2007, pages 13–18, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-880-0. doi: 10.1145/1353673.1353676. URL http://doi.acm.org/10.1145/ 1353673.1353676.

Daniel An. Akamai online retail performance report: Milliseconds are critical, February 2018. URL https://www.thinkwithgoogle.com/ marketing-strategies/app-and-mobile/mobile-page-speed- new-industry-benchmarks/. Accessed: 20.11.2020.

Lars Ole Andersen. Program Analysis and Specialization for the C Programming Language. PhD thesis, University of Copenhagen, Universitetsparken 1, DK-2100 Copenhagen, Denmark, May 1994.

Muhammad Moiz Arif, Weiyi Shang, and Emad Shihab. Empirical study on the discrepancy between performance testing results from virtual and physical environments. Empirical Software Engineering, 23(3):1490–1518, June 2018. ISSN 1573-7616. doi: 10.1007/s10664-017-9553-x. URL ttps://doi. org/10.1007/s10664-017-9553-x.

Dave Artz. The secret weapons of the AOL optimization team, 2009. URL http://assets.en.oreilly.com/1/event/29/The%20Secret% 20Weapons%20of%20the%20AOL%20Optimization%20Team% 20Presentation.pdf. Accessed: 19.11.2020.

David F. Bacon and Peter F. Sweeney. Fast static analysis of C++ virtual function calls. In Proceedings of the 11th ACM SIGPLAN Conference on Object- oriented Programming, Systems, Languages, and Applications, OOPSLA 1996, pages 324–341, New York, NY, USA, 1996. ACM. ISBN 0-89791-788-X. doi: 10.1145/236337.236371. URL http://doi.acm.org/10.1145/236337. 236371. 294 BIBLIOGRAPHY

Eytan Bakshy and Eitan Frachtenberg. Design and analysis of benchmarking experiments for distributed internet services. In Proceedings of the 24th International Conference on World Wide Web, WWW 2015, pages 108–118, Republic and Canton of Geneva, Switzerland, 2015. International World Wide Web Conferences Steering Committee. ISBN 978-1-4503-3469-3. doi: 10.1145/2736277.2741082. URL https://doi.org/10.1145/2736277. 2741082.

Sebastian Baltes and Paul Ralph. Sampling in software engineering research: A critical review and guidelines. CoRR, abs/2002.07764, 2020. URL https: //arxiv.org/abs/2002.07764.

Cornel Barna, Marin Litoiu, and Hamoun Ghanbari. Autonomic load-testing framework. In Proceedings of the 8th ACM International Conference on Auto- nomic Computing, ICAC 2011, pages 91–100, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0607-2. doi: 10.1145/1998582.1998598. URL http://doi.acm.org/10.1145/1998582.1998598.

Moritz Beller, Georgios Gousios, and Andy Zaidman. Oops, my tests broke the build: An explorative analysis of Travis CI with GitHub. In Proceedings of the 14th International Conference on Mining Software Repositories, MSR 2017, pages 356–367, Piscataway, NJ, USA, 2017. IEEE Press. ISBN 978-1-5386- 1544-7. doi: 10.1109/MSR.2017.62. URL https://doi.org/10.1109/ MSR.2017.62.

Cor-Paul Bezemer, Simon Eismann, Vincenzo Ferme, Johannes Grohmann, Robert Heinrich, Pooyan Jamshidi, Weiyi Shang, André van Hoorn, Monica Villavicencio, Jürgen Walter, and Felix Willnecker. How is performance addressed in DevOps? In Proceedings of the 10th ACM/SPEC International Conference on Performance Engineering, ICPE 2019, pages 45–50, New York, NY, USA, 2019. ACM. ISBN 978-1-4503-6239-9. doi: 10.1145/3297663.3309672. URL http://doi.acm.org/10.1145/3297663.3309672.

Stephen M. Blackburn, Robin Garner, Chris Hoffmann, Asjad M. Khang, Kathryn S. McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel BIBLIOGRAPHY 295

Frampton, Samuel Z. Guyer, Martin Hirzel, Antony Hosking, Maria Jump, Han Lee, J Eliot B Moss, Aashish Phansalkar, Darko Stefanovi´c,Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann. The DaCapo benchmarks: Java benchmarking development and analysis. In Proceedings of the 21st Annual ACM SIGPLAN Conference on Object-Oriented Program- ming Systems, Languages, and Applications, OOPSLA 2006, pages 169–190, New York, NY, USA, 2006. Association for Computing Machinery. ISBN 1595933484. doi: 10.1145/1167473.1167488. URL https://doi.org/10. 1145/1167473.1167488.

Stephen M. Blackburn, Amer Diwan, Matthias Hauswirth, Peter F. Sweeney, José Nelson Amaral, Tim Brecht, Lubomír Bulej, Cliff Click, Lieven Eeckhout, Sebastian Fischmeister, and et al. The truth, the whole truth, and nothing but the truth: A pragmatic guide to assessing empirical evaluations. ACM Transactions on Programming Languages and Systems, 38(4), October 2016. ISSN 0164-0925. doi: 10.1145/2983574. URL https://doi.org/10.1145/ 2983574.

Andrew P. Bradley. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7):1145–1159, July 1997. doi: 10.1016/s0031-3203(96)00142-2. URL https://doi.org/10. 1016/s0031-3203(96)00142-2.

Leo Breiman. Random forests. Machine Learning, 45(1):5–32, October 2001. doi: 10.1023/a:1010933404324. URL https://doi.org/10.1023/a: 1010933404324.

Andreas Brunnert and Helmut Krcmar. Continuous performance evaluation and capacity planning using resource proﬁles for enterprise applications. Journal of Systems and Software, 123:239 – 262, 2017. ISSN 0164-1212. doi: 10. 1016/j.jss.2015.08.030. URL https://doi.org/10.1016/j.jss.2015. 08.030.

Michael Buckland and Fredric Gey. The relationship between Recall and Preci- sion. Journal of the American Society for Information Science, 45(1):12–19, Jan- 296 BIBLIOGRAPHY

uary 1994. doi: 10.1002/(sici)1097-4571(199401)45:1<12::aid-asi2>3.0.co;2- l. URL https://doi.org/10.1002/(sici)1097-4571(199401)45: 1<12::aid-asi2>3.0.co;2-l.

Lubomír Bulej, Tomáš Bureš, Jaroslav Keznikl, Alena Koubková, Andrej Podz- imek, and Petr T ˚uma.Capturing performance assumptions using Stochastic Performance Logic. In Proceedings of the 3rd ACM/SPEC International Confer- ence on Performance Engineering, ICPE 2012, pages 311–322, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1202-8. doi: 10.1145/2188286.2188345. URL http://doi.acm.org/10.1145/2188286.2188345.

Lubomír Bulej, Tomáš Bureš, Vojtech˘ Horký, Jaroslav Kotrˇc,Lukáš Marek, Tomáš Trojánek, and Petr T ˚uma.Unit testing performance with Stochastic Performance Logic. Automated Software Engineering, 24(1):139–187, March 2017a. ISSN 1573-7535. doi: 10.1007/s10515-015-0188-0. URL https: //doi.org/10.1007/s10515-015-0188-0.

Lubomír Bulej, Vojtech˘ Horký, and Petr T ˚uma.Do we teach useful statistics for performance evaluation? In Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering Companion, ICPE 2017 Companion, pages 185–189, New York, NY, USA, 2017b. ACM. ISBN 978-1-4503-4899-7. doi: 10.1145/3053600.3053638. URL http://doi.acm.org/10.1145/ 3053600.3053638.

Lubomír Bulej, Vojtech˘ Horký, and Petr T ˚uma.Initial experiments with duet benchmarking: Performance testing interference in the cloud. In Proceedings of the 27th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS 2019, pages 249–255, October 2019. doi: 10.1109/MASCOTS.2019.00035.

Lubomír Bulej, Vojtech˘ Horký, Petr T ˚uma,François Farquet, and Aleksan- dar Prokopec. Duet benchmarking: Improving measurement accuracy in the cloud. In Proceedings of the 11th ACM/SPEC International Conference on Performance Engineering, ICPE 2020, New York, NY, USA, April 2020. BIBLIOGRAPHY 297

ACM. doi: 10.1145/3358960.3379132. URL https://doi.org/10.1145% 2F3358960.3379132.

Jinfu Chen and Weiyi Shang. An exploratory study of performance regression introducing code changes. In Proceedings of the 33rd IEEE Interna- tional Conference on Software Maintenance and Evolution, ISCME 2017, New York, NY, USA, September 2017. IEEE. doi: 10.1109/icsme.2017.13. URL https://doi.org/10.1109/icsme.2017.13.

Jinfu Chen, Weiyi Shang, and Emad Shihab. PerfJIT: Test-level just-in-time prediction for performance regression introducing commits. IEEE Transactions on Software Engineering, pages 1–1, 2020. doi: 10.1109/tse.2020.3023955. URL https://doi.org/10.1109%2Ftse.2020.3023955.

L. Chen. Continuous delivery: Huge beneﬁts, but challenges too. IEEE Software, 32(2):50–54, March 2015. ISSN 0740-7459. doi: 10.1109/MS.2015.27.

Tse-Hsun Chen, Mark D. Syer, Weiyi Shang, Zhen Ming Jiang, Ahmed E. Hassan, Mohamed Nasser, and Parminder Flora. Analytics-driven load testing: An industrial experience report on load testing of large-scale systems. In Proceedings of the 39th IEEE/ACM International Conference on Soft- ware Engineering: Software Engineering in Practice, ICSE-SEIP 2019. IEEE, May 2017. ISBN 9781538627174. doi: 10.1109/icse-seip.2017.26. URL http://dx.doi.org/10.1109/icse-seip.2017.26.

Yiqun Chen, Stefan Winter, and Neeraj Suri. Inferring performance bug patterns from developer commits. In Proceedings of the 39th IEEE International Symposium on Software Reliability Engineering, ISSRE 2019. IEEE, October 2019. doi: 10.1109/issre.2019.00017. URL https://doi.org/10.1109% 2Fissre.2019.00017.

Davide Chicco and Giuseppe Jurman. The advantages of the Matthews correlation coefﬁcient (MCC) over F1 score and Accuracy in binary classiﬁcation evaluation. BMC Genomics, 21(1), January 2020. doi: 10.1186/s12864-019- 6413-7. URL https://doi.org/10.1186/s12864-019-6413-7. 298 BIBLIOGRAPHY

Nancy Chinchor. MUC-4 evaluation metrics. In Proceedings of the 4th Conference on Message Understanding, MUC4 1992. Association for Com- putational Linguistics, June 1992. doi: 10.3115/1072064.1072067. URL https://doi.org/10.3115/1072064.1072067.

Jürgen Cito, Philipp Leitner, Thomas Fritz, and Harald C. Gall. The making of cloud applications: An empirical study on software development for the cloud. In Proceedings of the 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, pages 393–403, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3675-8. doi: 10.1145/2786805.2786826. URL http: //doi.acm.org/10.1145/2786805.2786826.

Norman Cliff. Ordinal Methods for Behavioral Data Analysis. Psychology Press, 1 edition, 1996. ISBN 978-0805813333.

Jonathan Cohen. A power primer. Psychological Bulletin, 112:155–159, 1992.

Gregory W. Corder and Dale I. Foreman. Nonparametric Statistics for Non- Statisticians. John Wiley & Sons, Inc., May 2009. doi: 10.1002/9781118165881. URL https://doi.org/10.1002/9781118165881.

Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, September 1995. doi: 10.1007/bf00994018. URL https://doi.org/10.1007/bf00994018.

Charlie Curtsinger and Emery D. Berger. STABILIZER: Statistically sound performance evaluation. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2013, pages 219–228, New York, NY, USA, 2013. ACM. ISBN 978-1- 4503-1870-9. doi: 10.1145/2451116.2451141. URL http://doi.acm.org/ 10.1145/2451116.2451141.

Ralph B. D'Agostino, Albert Belanger, and Ralph B. D'Agostino Jr. A suggestion for using powerful and informative tests of normality. The Amer- ican Statistician, 44(4):316, November 1990. doi: 10.2307/2684359. URL https://doi.org/10.2307/2684359. BIBLIOGRAPHY 299

Diego Elias Damasceno Costa, Cor-Paul Bezemer, Philipp Leitner, and Artur Andrzejak. What’s wrong with my benchmark results? Studying bad practices in JMH benchmarks. IEEE Transactions on Software Engineering, pages 1–1, 2019. doi: 10.1109/TSE.2019.2925345.

Anthony C. Davison and D Hinkley. Bootstrap methods and their application. Journal of the American Statistical Association, 94, January 1997.

Augusto Born de Oliveira, Sebastian Fischmeister, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney. Why you should care about quantile regression. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2013, pages 207–218, New York, NY, USA, 2013a. ACM. ISBN 978-1-4503-1870-9. doi: 10.1145/2451116.2451140. URL http://dl.acm.org/citation. cfm?doid=2451116.2451140.

Augusto Born de Oliveira, Jean-Christophe Petkovich, Thomas Reidemeister, and Sebastian Fischmeister. DataMill: Rigorous performance evaluation made easy. In Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering, ICPE 2013, pages 137–148, New York, NY, USA, 2013b. ACM. ISBN 978-1-4503-1636-1. doi: 10.1145/2479871.2479892. URL http://doi.acm.org/10.1145/2479871.2479892.

Augusto Born de Oliveira, Sebastian Fischmeister, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney. Perphecy: Performance regression test selection made simple but effective. In Proceedings of the 10th IEEE International Conference on Software Testing, Veriﬁcation and Validation, ICST 2017, pages 103–113, March 2017. doi: 10.1109/ICST.2017.17.

DevOps Research and Assessment (DORA). Accelerate state of DevOps 2019, 2019. URL https://services.google.com/fh/files/misc/ state-of-devops-2019.pdf. Accessed: 7.12.2020.

Nicolas Dilley and Julien Lange. An empirical study of messaging passing concurrency in Go projects. In Proceedings of the 26th IEEE International 300 BIBLIOGRAPHY

Conference on Software Analysis, Evolution and Reengineering, SANER 2019. IEEE, February 2019. ISBN 9781728105918. doi: 10.1109/saner.2019.8668036. URL http://dx.doi.org/10.1109/saner.2019.8668036.

Zishuo Ding, Jinfu Chen, and Weiyi Shang. Towards the use of the readily available tests from the release pipeline as performance tests. Are we there yet? In Proceedings of the 42nd IEEE/ACM International Conference on Soft- ware Engineering, ICSE 2020, New York, NY, USA, June 2020. ACM. doi: 10.1145/3377811.3380351. URL https://doi.org/10.1145/3377811. 3380351.

Phil Dixon. Shopzilla site redesign – we get what we measure, 2009. URL http://assets.en.oreilly.com/1/event/29/Shopzilla%27s% 20Site%20Redo%20-%20You%20Get%20What%20You%20Measure% 20Presentation.ppt. Accessed: 19.11.2020.

Hyunsook Do, Siavash Mirarab, Ladan Tahvildari, and Gregg Rothermel. The effects of time constraints on test case prioritization: A series of controlled experiments. IEEE Transactions on Software Engineering, 36(5):593–617, September 2010. ISSN 0098-5589. doi: 10.1109/TSE.2010.58.

Charles Duhigg. Stock traders ﬁnd speed pays, in milliseconds. The New York Times, Online, July 2009. URL https://www.nytimes.com/2009/07/ 24/business/24trading.html?_r=1&hp. Accessed: 19.11.2020.

Olive Jean Dunn. Multiple comparisons using rank sums. Technometrics, 6(3):241–252, August 1964. doi: 10.1080/00401706.1964.10490181. URL https://doi.org/10.1080/00401706.1964.10490181.

Sebastian Elbaum, Alexey Malishevsky, and Gregg Rothermel. Incorporating varying test costs and fault severities into test case prioritization. In Proceed- ings of the 23rd International Conference on Software Engineering, ICSE 2001, pages 329–338, Washington, DC, USA, 2001. IEEE. doi: 10.1109/icse.2001. 919106. URL https://doi.org/10.1109/icse.2001.919106. BIBLIOGRAPHY 301

Sebastian Elbaum, Alexey G. Malishevsky, and Gregg Rothermel. Test case prioritization: A family of empirical studies. IEEE Transactions on Software Engineering, 28(2):159–182, February 2002. ISSN 0098-5589. doi: 10.1109/32. 988497.

Sebastian Elbaum, Gregg Rothermel, and John Penix. Techniques for improving regression testing in continuous integration development environments. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Founda- tions of Software Engineering, FSE 2014, pages 235–245, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-3056-5. doi: 10.1145/2635868.2635910. URL http://doi.acm.org/10.1145/2635868.2635910.

Hamed Esfahani, Jonas Fietz, Qi Ke, Alexei Kolomiets, Erica Lan, Erik Mavrinac, Wolfram Schulte, Newton Sanches, and Srikanth Kandula. Cloud- Build: Microsoft’s distributed and caching build service. In Proceedings of the 38th IEEE/ACM International Conference on Software Engineering: Software En- gineering in Practice, ICSE-SEIP 2016. ACM, May 2016. doi: 10.1145/2889160. 2889222. URL https://doi.org/10.1145/2889160.2889222.

Mikael Fagerström, Emre Emir Ismail, Grischa Liebel, Rohit Guliani, Fredrik Larsson, Karin Nordling, Eric Knauss, and Patrizio Pelliccione. Verdict machinery: On the need to automatically make sense of test results. In Proceedings of the 25th International Symposium on Software Testing and Analysis, ISSTA 2016, pages 225–234, July 2016. doi: 10.1145/2931037.2931064. URL http://doi.acm.org/10.1145/2931037.2931064.

Benjamin Farley, Ari Juels, Venkatanathan Varadarajan, Thomas Ristenpart, Kevin D. Bowers, and Michael M. Swift. More for your money: Exploiting performance heterogeneity in public clouds. In Proceedings of the 3rd ACM Symposium on Cloud Computing, SoCC 2012, page 14, New York, NY, USA, 2012. ACM. ISBN 9781450317610. doi: 10.1145/2391229.2391249. URL https://doi.org/10.1145/2391229.2391249.

King Chun Foo, Zhen Ming (Jack) Jiang, Bram Adams, Ahmed E. Has- san, Ying Zou, and Parminder Flora. An industrial case study on the 302 BIBLIOGRAPHY

automated detection of performance regressions in heterogeneous environments. In Proceedings of the 37th IEEE/ACM International Conference on Software Engineering, volume 2 of ICSE 2015, pages 159–168, Piscat- away, NJ, USA, 2015. IEEE Press. doi: 10.1109/icse.2015.144. URL https://doi.org/10.1109/icse.2015.144.

Gordon Fraser and Andrea Arcuri. EvoSuite: Automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engi- neering, ESEC/FSE 2011. ACM, 2011. doi: 10.1145/2025113.2025179. URL https://doi.org/10.1145/2025113.2025179.

Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, August 1997. doi: 10.1006/jcss.1997.1504. URL https://doi.org/10.1006/jcss.1997.1504.

Jerome H. Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19(1):1–67, March 1991. doi: 10.1214/aos/1176347963. URL https://doi.org/10.1214/aos/1176347963.

Jerome H. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189–1232, October 2001. doi: 10.1214/ aos/1013203451. URL https://doi.org/10.1214/aos/1013203451.

Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, Kelvin Hu, Meghna Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang Wen, Catherine Leung, Siyuan Wang, Leon Zaruvinsky, Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Padilla, and Christina Delimitrou. An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In Proceedings of the 24th International Con- ference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2019. ACM, April 2019. doi: 10.1145/3297858.3304013. URL https://doi.org/10.1145/3297858.3304013. BIBLIOGRAPHY 303

Ruoyu Gao and Zhen Ming (Jack) Jiang. An exploratory study on assessing the impact of environment variations on the results of load tests. In Proceedings of the 14th IEEE/ACM International Conference on Mining Software Repositories, MSR 2017. IEEE, May 2017. ISBN 9781538615447. doi: 10.1109/msr.2017.22. URL http://dx.doi.org/10.1109/msr.2017.22.

Andy Georges, Dries Buytaert, and Lieven Eeckhout. Statistically rigorous java performance evaluation. In Proceedings of the 22nd Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems and Applications, OOPSLA 2007, pages 57–76, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-786-5. doi: 10.1145/1297027.1297033. URL http://doi.acm.org/10.1145/ 1297027.1297033.

Joseph Yossi Gil, Keren Lenz, and Yuval Shimron. A microbenchmark case study and lessons learned. In Proceedings of the Compilation of the Co-Located Workshops on DSM’11, TMC’11, AGERE! 2011, AOOPES’11, NEAT’11, & VMIL’11, SPLASH 2011 Workshops, pages 297–308, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-1183-0. doi: 10.1145/2095050.2095100. URL http://doi.acm.org/10.1145/2095050.2095100.

Lee Gillam, Bin Li, John O’Loughlin, and Anuz Pratap Singh Tomar. Fair benchmarking for cloud computing systems. Journal of Cloud Computing: Advances, Systems and Applications, 2(1):6, 2013. ISSN 2192-113X. doi: 10.1186/ 2192-113X-2-6. URL http://dx.doi.org/10.1186/2192-113X-2-6.

Milos Gligoric, Lamyaa Eloussi, and Darko Marinov. Practical regression test selection with dynamic ﬁle dependencies. In Proceedings of the 2015 International Symposium on Software Testing and Analysis, ISSTA 2015, pages 211–222, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3620-8. doi: 10.1145/2771783.2771784. URL http://doi.acm.org/10.1145/ 2771783.2771784.

Go Authors. Go – frequently asked questions (FAQ). https://golang.org/doc/faq, July 2020a. 304 BIBLIOGRAPHY

Go Authors. The Go programming language speciﬁcation. https://golang.org/ref/spec, January 2020b.

Jacob Goldberger, Sam Roweis, Geoffrey E. Hinton, and Ruslan R. Salakhutdi- nov. Neighbourhood components analysis. In Advances in Neural Information Processing Systems 17, volume 17 of NIPS 2004, pages 513–520. MIT Press, 2004. URL https://proceedings.neurips.cc/paper/2004/file/ 42fe880812925e520249e808937738d2-Paper.pdf.

Mark Grechanik, Chen Fu, and Qing Xie. Automatically ﬁnding performance problems with feedback-directed learning software testing. In Proceedings of the 34th International Conference on Software Engineering, ICSE 2012, pages 156–166, Piscataway, NJ, USA, 2012. IEEE. doi: 10.1109/icse.2012.6227197. URL https://doi.org/10.1109/icse.2012.6227197.

Brendan Gregg. Systems Performance. Prentice Hall, 1 edition, 2013. ISBN 0-13-339009-8.

David Grove and Craig Chambers. A framework for call graph construction algorithms. ACM Transactions on Programming Languages and Systems, 23(6): 685–746, November 2001. ISSN 0164-0925. doi: 10.1145/506315.506316. URL http://doi.acm.org/10.1145/506315.506316.

Alireza Haghighatkhah, Mika Mäntylä, Markku Oivo, and Pasi Kuvaja. Test prioritization in continuous integration environments. The Journal of Systems and Software, 146:80–98, 2018. doi: 10.1016/j.jss.2018.08.061. URL https: //doi.org/10.1016/j.jss.2018.08.061.

James A. Hanley and Barbara J. McNeil. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1): 29–36, April 1982. doi: 10.1148/radiology.143.1.7063747. URL https: //doi.org/10.1148/radiology.143.1.7063747.

Dan Hao, Lingming Zhang, Lu Zhang, Gregg Rothermel, and Hong Mei. A uniﬁed test case prioritization approach. ACM Transactions on Software Engi- BIBLIOGRAPHY 305

neering and Methodology, 24(2):10:1–10:31, December 2014. ISSN 1049-331X. doi: 10.1145/2685614. URL http://doi.acm.org/10.1145/2685614.

Mark Harman and Bryan F Jones. Search-based software engineering. Information and Software Technology, 43(14):833 – 839, 2001. ISSN 0950-5849. doi: http://dx.doi.org/10.1016/S0950-5849(01)00189- 6. URL http://www.sciencedirect.com/science/article/pii/ S0950584901001896.

Jan Hauke and Tomasz Kossowski. Comparison of values of pearson's and spearman's correlation coefﬁcients on the same sets of data. Quaestiones Geographicae, 30(2):87–93, June 2011. doi: 10.2478/v10117-011-0021-1. URL https://doi.org/10.2478/v10117-011-0021-1.

Sen He, Glenna Manns, John Saunders, Wei Wang, Lori Pollock, and Mary Lou Soffa. A statistics-based performance testing methodology for cloud applications. In Proceedings of the 27th ACM Joint Meeting on European Soft- ware Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2019, pages 188–199, New York, NY, USA, 2019. ACM. ISBN 978-1-4503-5572-8. doi: 10.1145/3338906.3338912. URL http://doi.acm.org/10.1145/3338906.3338912.

Christoph Heger, Jens Happe, and Roozbeh Farahbod. Automated root cause isolation of performance regressions during software development. In Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering, ICPE 2013, pages 27–38, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-1636-1. doi: 10.1145/2479871.2479879. URL http: //doi.acm.org/10.1145/2479871.2479879.

Christopher Henard, Mike Papadakis, Mark Harman, Yue Jia, and Yves Le Traon. Comparing white-box and black-box test prioritization. In Pro- ceedings of the 38th IEEE/ACM International Conference on Software Engineering, ICSE 2016, pages 523–534, New York, NY, USA, 2016. ACM. ISBN 978-1- 4503-3900-1. doi: 10.1145/2884781.2884791. URL http://doi.acm.org/ 10.1145/2884781.2884791. 306 BIBLIOGRAPHY

Kim Herzig and Nachiappan Nagappan. Empirically detecting false test alarms using association rules. In Proceedings of the 37th IEEE/ACM Inter- national Conference on Software Engineering: Software Engineering in Practice, ICSE-SEIP 2015, pages 39–48. IEEE, May 2015. doi: 10.1109/icse.2015.133. URL https://doi.org/10.1109/icse.2015.133.

Tim C. Hesterberg. What teachers should know about the bootstrap: Re- sampling in the undergraduate statistics curriculum. The American Statis- tician, 69(4):371–386, 2015. doi: 10.1080/00031305.2015.1089789. URL https://doi.org/10.1080/00031305.2015.1089789.

Michael Hilton, Timothy Tunnell, Kai Huang, Darko Marinov, and Danny Dig. Usage, costs, and beneﬁts of continuous integration in open-source projects. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ASE 2016, pages 426–437, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-3845-5. doi: 10.1145/2970276.2970358. URL http://doi.acm.org/10.1145/2970276.2970358.

Vojtech˘ Horký, Peter Libiˇc,Lukáš Marek, Antonín Steinhauser, and Petr T ˚uma. Utilizing performance unit tests to increase performance awareness. In Proceedings of the 6th ACM/SPEC International Conference on Per- formance Engineering, ICPE 2015, pages 289–300, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3248-4. doi: 10.1145/2668930.2688051. URL http://doi.acm.org/10.1145/2668930.2688051.

David W. Hosmer Jr., Stanley Lemeshow, and Rodney X. Sturdivant. Applied Logistic Regression. John Wiley & Sons, Inc., 3 edition, April 2013. ISBN 978-0-470-58247-3.

Peng Huang, Xiao Ma, Dongcai Shen, and Yuanyuan Zhou. Performance regression testing target prioritization via performance risk analysis. In Proceedings of the 36th IEEE/ACM International Conference on Software Engi- neering, ICSE 2014, pages 60–71, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2756-5. doi: 10.1145/2568225.2568232. URL http://doi.acm. org/10.1145/2568225.2568232. BIBLIOGRAPHY 307

Rick Hudson. Getting to Go: The journey of Go’s garbage collector. https://blog.golang.org/ismmkeynote, July 2018.

Henrik Ingo and David Daly. Automated system performance testing at MongoDB. In Proceedings of the Workshop on Testing Database Systems, DBTest 2020. ACM, June 2020. doi: 10.1145/3395032.3395323. URL https://doi. org/10.1145/3395032.3395323.

Alexandru Iosup, Simon Ostermann, Nezih Yigitbasi, Radu Prodan, Thomas Fahringer, and Dick Epema. Performance analysis of cloud computing services for many-tasks scientiﬁc computing. IEEE Transactions on Parallel and Distributed Systems, 22(6):931–945, June 2011a. ISSN 1045-9219. doi: 10.1109/ TPDS.2011.66. URL http://dx.doi.org/10.1109/TPDS.2011.66.

Alexandru Iosup, Nezih Yigitbasi, and Dick Epema. On the performance variability of production cloud services. In Proceedings of the 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2011, pages 104–113, Washington, DC, USA, 2011b. IEEE Computer Society. ISBN 978-0-7695-4395-6. doi: 10.1109/CCGrid.2011.22. URL http://dx.doi. org/10.1109/CCGrid.2011.22.

Raj Jain. The Art of Computer Systems Performance Analysis. Wiley, 1991. ISBN 0-471-50336-3.

Yue Jia and Mark Harman. An analysis and survey of the development of mutation testing. IEEE Transactions on Software Engineering, 37(5):649– 678, September 2011. ISSN 0098-5589. doi: 10.1109/TSE.2010.62. URL http://dx.doi.org/10.1109/TSE.2010.62.

Zhen Ming Jiang and Ahmed E. Hassan. A survey on load testing of large- scale software systems. IEEE Transactions on Software Engineering, 41(11): 1091–1118, November 2015. ISSN 2326-3881. doi: 10.1109/TSE.2015.2445340. URL https://doi.org/10.1109/tse.2015.2445340. 308 BIBLIOGRAPHY

Jirayus Jiarpakdee, Chakkrit Tantithamthavorn, and Ahmed E. Hassan. The impact of correlated metrics on the interpretation of defect models. IEEE Transactions on Software Engineering, 2019.

Jirayus Jiarpakdee, Chakkrit Tantithamthavorn, and Christoph Treude. The impact of automated feature selection techniques on the interpretation of defect models. Empirical Software Engineering, 25(5):3590–3638, August 2020. doi: 10.1007/s10664-020-09848-1. URL https://doi.org/10.1007/ s10664-020-09848-1.

Ivo Jimenez, Noah Watkins, Michael Sevilla, Jay Lofstead, and Carlos Maltzahn. quiho: Automated performance regression testing using inferred resource utilization proﬁles. In Proceedings of the 9th ACM/SPEC International Confer- ence on Performance Engineering, ICPE 2018, pages 273–284, New York, NY, USA, 2018. ACM. ISBN 9781450350952. doi: 10.1145/3184407.3184422. URL https://doi.org/10.1145/3184407.3184422.

Guoliang Jin, Linhai Song, Xiaoming Shi, Joel Scherpelz, and Shan Lu. Un- derstanding and detecting real-world performance bugs. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2012, pages 77–88, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1205-9. doi: 10.1145/2254064.2254075. URL http://doi.acm.org/10.1145/2254064.2254075.

George H. John and Pat Langley. Estimating continuous distributions in bayesian classiﬁers. In Proceedings of the 11th Conference on Uncertainty in Artiﬁcial Intelligence, UAI 1995, pages 338–345, San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc. ISBN 1558603859. URL https: //arxiv.org/abs/1302.4964.

Lizy Kurian John and Lieven Eeckhout. Performance Evaluation and Benchmark- ing. CRC Press, 1 edition, 2005. ISBN 9780849336225. BIBLIOGRAPHY 309

Tomas Kalibera and Richard Jones. Quantifying performance changes with effect size conﬁdence intervals. Technical Report 4–12, University of Kent, June 2012. URL http://www.cs.kent.ac.uk/pubs/2012/3233.

Tomas Kalibera and Richard Jones. Rigorous benchmarking in reasonable time. In Proceedings of the 2013 ACM SIGPLAN International Symposium on Memory Management, ISMM 2013, pages 63–74, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2100-6. doi: 10.1145/2464157.2464160. URL http://doi.acm.org/10.1145/2464157.2464160.

Chung Hwan Kim, Junghwan Rhee, Kyu Hyung Lee, Xiangyu Zhang, and Dongyan Xu. PerfGuard: Binary-centric application performance monitoring in production environments. In Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, pages 595–606, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4218-6. doi: 10.1145/2950290.2950347. URL http://doi.acm.org/10.1145/ 2950290.2950347.

William H. Kruskal and W. Allen Wallis. Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260): 583–621, December 1952. doi: 10.1080/01621459.1952.10483441. URL https://doi.org/10.1080/01621459.1952.10483441.

Solomon Kullback and Richard A. Leibler. On information and sufﬁciency. Annals of Mathematical Statistics, 22(1):79–86, March 1951. doi: 10.1214/aoms/ 1177729694. URL https://doi.org/10.1214/aoms/1177729694.

Christoph Laaber. bencher - JMH benchmark analysis and prioritization, 2020a. URL https://github.com/chrstphlbr/bencher.

Christoph Laaber. pa - performance (change) analysis using bootstrap, 2020b. URL https://github.com/chrstphlbr/pa.

Christoph Laaber and Philipp Leitner. An evaluation of open-source software microbenchmark suites for continuous performance assessment. In Proceed- ings of the 15th International Conference on Mining Software Repositories, MSR 310 BIBLIOGRAPHY

2018, pages 119–130, New York, NY, USA, 2018a. ACM. ISBN 978-1-4503- 5716-6. doi: 10.1145/3196398.3196407. URL http://doi.acm.org/10. 1145/3196398.3196407.

Christoph Laaber and Philipp Leitner. Dataset and scripts "An evaluation of open-source software microbenchmark suites for continuous performance assessment", 2018b. URL https://doi.org/10.6084/m9.figshare. 5982253.

Christoph Laaber and Philipp Leitner. GoABS. https://github.com/ sealuzh/GoABS/releases/tag/msr18, 2018c.

Christoph Laaber and Philipp Leitner. JavaABS. https://github.com/ sealuzh/JavaABS/releases/tag/msr18, 2018d.

Christoph Laaber and Philipp Leitner. JavaAPIUsageTracer. https: //github.com/sealuzh/JavaAPIUsageTracer/releases/tag/ msr18, 2018e.

Christoph Laaber, Joel Scheuner, and Philipp Leitner. Dataset, scripts, and online appendix "Software microbenchmarking in the cloud. How bad is it really?", 2019a. URL https://doi.org/10.6084/m9.figshare. 7546703.

Christoph Laaber, Joel Scheuner, and Philipp Leitner. Software microbenchmarking in the cloud. How bad is it really? Empirical Software Engineer- ing, April 2019b. ISSN 1573-7616. doi: 10.1007/s10664-019-09681-1. URL https://doi.org/10.1007/s10664-019-09681-1.

Christoph Laaber, Mikael Basmaci, and Pasquale Salza. Replication package "Predicting unstable software benchmarks using static source code features", 2020a. URL https://seafile.ifi.uzh.ch/d/ 25d4298f8c8245ecb3b1/.

Christoph Laaber, Harald C. Gall, and Philipp Leitner. Replication package "Applying test case prioritization to software microbenchmarks", 2020b. URL https://seafile.ifi.uzh.ch/d/e2db9a8932aa4e13a625/. BIBLIOGRAPHY 311

Christoph Laaber, Stefan Würsten, Harald C. Gall, and Philipp Leitner. Dynam- ically reconﬁguring software microbenchmarks: Reducing execution time without sacriﬁcing result quality. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Founda- tions of Software Engineering, ESEC/FSE 2020. ACM, November 2020c. doi: 10.1145/3368089.3409683. URL https://doi.org/10.1145/3368089. 3409683.

Christoph Laaber, Stefan Würsten, Harald C. Gall, and Philipp Leitner. JMH with dynamic reconﬁguration, 2020d. URL https://github.com/ sealuzh/jmh.

Christoph Laaber, Stefan Würsten, Harald C. Gall, and Philipp Leitner. Repli- cation package "Dynamically reconﬁguring software microbenchmarks: Reducing execution time without sacriﬁcing result quality", 2020e. URL https://doi.org/10.6084/m9.figshare.11944875.

Philipp Leitner and Cor-Paul Bezemer. An exploratory study of the state of practice of performance testing in java-based open source projects. In Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering, ICPE 2017, pages 373–384, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-4404-3. doi: 10.1145/3030207.3030213. URL http://doi. acm.org/10.1145/3030207.3030213.

Philipp Leitner and Jürgen Cito. Patterns in the chaos – A study of performance variation and predictability in public IaaS clouds. ACM Transactions on Internet Technology, 16(3):15:1–15:23, April 2016. ISSN 1533-5399. doi: 10. 1145/2885497. URL http://doi.acm.org/10.1145/2885497.

Zheng Li, Mark Harman, and Robert M. Hierons. Search algorithms for regression test case prioritization. IEEE Transactions on Software Engineering, 33(4):225–237, April 2007. ISSN 0098-5589. doi: 10.1109/TSE.2007.38. URL https://doi.org/10.1109/TSE.2007.38. 312 BIBLIOGRAPHY

Jingjing Liang, Sebastian Elbaum, and Gregg Rothermel. Redeﬁning prioritization: Continuous prioritization for continuous integration. In Proceedings of the 40th IEEE/ACM International Conference on Software Engineering, ICSE 2018, pages 688–698, New York, NY, USA, 2018. ACM. ISBN 978-1-4503- 5638-1. doi: 10.1145/3180155.3180213. URL http://doi.acm.org/10. 1145/3180155.3180213.

Mario Linares-Vasquez, Christopher Vendome, Qi Luo, and Denys Poshy- vanyk. How developers detect and ﬁx performance bottlenecks in Android apps. In Proceedings of the 31st IEEE International Conference on Software Main- tenance and Evolution, ICSME 2015, pages 352–361, Los Alamitos, CA, USA, 2015. IEEE Computer Society. doi: doi.ieeecomputersociety.org/10.1109/ ICSM.2015.7332486.

Greg Linden. Make data useful, December 2006. URL http: //glinden.blogspot.com/2006/12/slides-from-my-talk- at-stanford.html. Accessed: 20.11.2020.

Yepang Liu, Chang Xu, and Shing-Chi Cheung. Characterizing and detecting performance bugs for smartphone applications. In Proceedings of the 36th IEEE/ACM International Conference on Software Engineering, ICSE 2014, pages 1013–1024, New York, NY, USA, 2014. ACM. ISBN 9781450327565. doi: 10.1145/2568225.2568229. URL https://doi.org/10.1145/2568225. 2568229.

Yafeng Lu, Yiling Lou, Shiyang Cheng, Lingming Zhang, Dan Hao, Yangfan Zhou, and Lu Zhang. How does regression test prioritization perform in real- world software evolution? In Proceedings of the 38th IEEE/ACM International Conference on Software Engineering, ICSE 2016, pages 535–546, May 2016. doi: 10.1145/2884781.2884874. URL https://doi.org/10.1145/2884781. 2884874.

Qi Luo, Kevin Moran, and Denys Poshyvanyk. A large-scale empirical comparison of static and dynamic test case prioritization techniques. In Pro- ceedings of the 24th ACM SIGSOFT International Symposium on Foundations BIBLIOGRAPHY 313

of Software Engineering, FSE 2016, pages 559–570, New York, NY, USA, 2016a. ACM. ISBN 978-1-4503-4218-6. doi: 10.1145/2950290.2950344. URL http://doi.acm.org/10.1145/2950290.2950344.

Qi Luo, Denys Poshyvanyk, and Mark Grechanik. Mining performance regression inducing code changes in evolving software. In Proceedings of the 13th International Conference on Mining Software Repositories, MSR 2016, pages 25–36, New York, NY, USA, 2016b. ACM. ISBN 978-1-4503-4186-8. doi: 10.1145/2901739.2901765. URL http://doi.acm.org/10.1145/ 2901739.2901765.

Qi Luo, Aswathy Nair, Mark Grechanik, and Denys Poshyvanyk. FORE- POST: Finding performance problems automatically with feedback-directed learning software testing. Empirical Software Engineering, 22(1):6–56, Febru- ary 2017. ISSN 1573-7616. doi: 10.1007/s10664-015-9413-5. URL https: //doi.org/10.1007/s10664-015-9413-5.

Qi Luo, Kevin Moran, Denys Poshyvanyk, and Massimiliano Di Penta. As- sessing test case prioritization on real faults and mutants. In Proceedings of the 34th IEEE International Conference on Software Maintenance and Evolution, ICSME 2018. IEEE, September 2018. doi: 10.1109/icsme.2018.00033. URL https://doi.org/10.1109/icsme.2018.00033.

Qi Luo, Kevin Moran, Lingming Zhang, and Denys Poshyvanyk. How do static and dynamic test case prioritization techniques perform on modern software systems? An extensive study on GitHub projects. IEEE Transactions on Software Engineering, 45(11):1054–1080, November 2019. doi: 10.1109/tse. 2018.2822270. URL https://doi.org/10.1109/tse.2018.2822270.

Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. An empirical analysis of ﬂaky tests. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014. ACM Press, 2014. ISBN 9781450330565. doi: 10.1145/2635868.2635920. URL http://dx.doi.org/10.1145/2635868.2635920. 314 BIBLIOGRAPHY

Mateusz Machalica, Alex Samylkin, Meredith Porth, and Satish Chandra. Pre- dictive test selection. In Proceedings of the 41st IEEE/ACM International Confer- ence on Software Engineering: Software Engineering in Practice, ICSE-SEIP 2019. IEEE, May 2019. ISBN 9781728117607. doi: 10.1109/icse-seip.2019.00018. URL http://dx.doi.org/10.1109/icse-seip.2019.00018.

Ruchika Malhotra. A systematic review of machine learning techniques for software fault prediction. Applied Soft Computing, 27:504–518, February 2015. doi: 10.1016/j.asoc.2014.11.023. URL https://doi.org/10.1016/ j.asoc.2014.11.023.

Aleksander Maricq, Dmitry Duplyakin, Ivo Jimenez, Carlos Maltzahn, Ryan Stutsman, and Robert Ricci. Taming performance variability. In Proceed- ings of the 13th USENIX Conference on Operating Systems Design and Imple- mentation, OSDI 2018, pages 409–425, USA, October 2018. USENIX As- sociation. ISBN 978-1-939133-08-3. URL https://www.usenix.org/ conference/osdi18/presentation/maricq.

Brian W. Matthews. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure, 405(2):442–451, October 1975. doi: 10.1016/0005-2795(75)90109-9. URL https://doi.org/10.1016/0005-2795(75)90109-9.

Alejandro Mazuera-Rozo, Catia Trubiani, Mario Linares-Vásquez, and Gabriele Bavota. Investigating types and survivability of performance bugs in mobile apps. Empirical Software Engineering, 25(3):1644–1686, March 2020. ISSN 1573-7616. doi: 10.1007/s10664-019-09795-6. URL http://dx.doi.org/10.1007/s10664-019-09795-6.

Thomas J. McCabe. A complexity measure. IEEE Transactions on Software Engineering, SE-2(4):308–320, December 1976. ISSN 0098-5589. doi: 10. 1109/tse.1976.233837. URL http://dx.doi.org/10.1109/tse.1976. 233837. BIBLIOGRAPHY 315

Hong Mei, Dan Hao, Lingming Zhang, Lu Zhang, Ji Zhou, and Gregg Rother- mel. A static approach to prioritizing junit test cases. IEEE Transactions on Software Engineering, 38(6):1258–1275, November 2012. ISSN 0098-5589. doi: 10.1109/TSE.2011.106. URL http://dx.doi.org/10.1109/TSE.2011. 106.

Peter Mell and Timothy Grance. The NIST deﬁnition of cloud computing. Technical Report 800-145, National Institute of Standards and Technology (NIST), Gaithersburg, MD, September 2011. URL https://doi.org/10. 6028/nist.sp.800-145.

Daniel A. Menascé. Load testing of web sites. IEEE Internet Computing, 6(4): 70–74, July 2002. ISSN 1941-0131. doi: 10.1109/MIC.2002.1020328. URL https://doi.org/10.1109/mic.2002.1020328.

Marissa Meyer. In search of... a better, faster, stronger web, 2009. URL http://assets.en.oreilly.com/1/event/29/Keynote% 20Presentation%202.pdf. Accessed: 19.11.2020.

Shaikh Mostafa, Xiaoyin Wang, and Tao Xie. PerfRanker: Prioritization of performance regression tests for collection-intensive software. In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2017, pages 23–34, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-5076-1. doi: 10.1145/3092703.3092725. URL http://doi.acm. org/10.1145/3092703.3092725.

Mozilla. Talos. URL https://wiki.mozilla.org/TestEngineering/ Performance/Talos. Accessed: 25.11.2020.

Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney. Producing wrong data without doing anything obviously wrong! In Proceed- ings of the 14th International Conference on Architectural Support for Program- ming Languages and Operating Systems, ASPLOS 2009, pages 265–276, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-406-5. doi: 10.1145/1508244. 1508275. URL http://doi.acm.org/10.1145/1508244.1508275. 316 BIBLIOGRAPHY

Peter Naur and Brian Randell. Software engineering: Report of a conference sponsered by the NATO science committee. In NATO Conference on Software Engineering, Brussels 39, Belgium, January 1969. NATO, Scientiﬁc Affairs Division.

Thanh H. D. Nguyen, Meiyappan Nagappan, Ahmed E. Hassan, Mohamed Nasser, and Parminder Flora. An industrial case study of automatically identifying performance regression-causes. In Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pages 232–241, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2863-0. doi: 10.1145/2597073. 2597092. URL http://doi.acm.org/10.1145/2597073.2597092.

Adrian Nistor, Tian Jiang, and Lin Tan. Discovering, reporting, and ﬁxing performance bugs. In Proceedings of the 10th Working Conference on Mining Software Repositories, MSR 2013. IEEE, May 2013a. doi: 10.1109/msr.2013. 6624035. URL https://doi.org/10.1109/msr.2013.6624035.

Adrian Nistor, Linhai Song, Darko Marinov, and Shan Lu. Toddler: Detecting performance problems via similar memory-access patterns. In Proceedings of the 35th IEEE/ACM International Conference on Software Engineering, ICSE 2013, pages 562–571, Piscataway, NJ, USA, 2013b. IEEE Press. ISBN 978-1- 4673-3076-3. doi: 10.1109/ICSE.2013.6606602. URL https://doi.org/ 10.1109/ICSE.2013.6606602.

Adrian Nistor, Po-Chun Chang, Cosmin Radoi, and Shan Lu. Caramel: De- tecting and ﬁxing performance problems that have non-intrusive ﬁxes. In Proceedings of the 37th IEEE/ACM International Conference on Software Engi- neering, volume 1 of ICSE 2015, pages 902–912, Piscataway, NJ, USA, 2015. IEEE Press. ISBN 978-1-4799-1934-5. doi: 10.1109/ICSE.2015.100. URL https://doi.org/10.1109/ICSE.2015.100.

Zhonghong Ou, Hao Zhuang, Jukka K. Nurminen, Antti Ylä-Jääski, and Pan Hui. Exploiting hardware heterogeneity within the same instance type of Amazon EC2. In Proceedings of the 4th USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2012, pages 4–4, Berkeley, CA, USA, 2012. BIBLIOGRAPHY 317

USENIX Association. URL https://www.usenix.org/conference/ hotcloud12/workshop-program/presentation/ou.

Alessandro Vittorio Papadopoulos, Laurens Versluis, André Bauer, Nikolas Herbst, Jóakim Von Kistowski, Ahmed Ali-eldin, Cristina Abad, José Nelson Amaral, Petr T ˚uma,and Alexandru Iosup. Methodological principles for reproducible performance evaluation in cloud computing. IEEE Transactions on Software Engineering, pages 1–1, 2019. doi: 10.1109/TSE.2019.2927908. URL https://doi.org/10.1109/tse.2019.2927908.

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duches- nay. Scikit-learn: Machine learning in python. Journal of Machine Learn- ing Research, 12(85):2825–2830, November 2011. ISSN 1532-4435. URL http://jmlr.org/papers/v12/pedregosa11a.html.

Michael Pradel, Markus Huggler, and Thomas R. Gross. Performance regression testing of concurrent classes. In Proceedings of the 2014 International Symposium on Software Testing and Analysis, ISSTA 2014, pages 13–25, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2645-2. doi: 10.1145/2610384. 2610393. URL http://doi.acm.org/10.1145/2610384.2610393.

Aleksandar Prokopec, Andrea Rosà, David Leopoldseder, Gilles Duboscq, Petr T ˚uma,Martin Studener, Lubomír Bulej, Yudi Zheng, Alex Villazón, Doug Simon, Thomas Würthinger, and Walter Binder. Renaissance: Benchmarking suite for parallel applications on the JVM. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019, pages 31–47, New York, NY, USA, 2019. Association for Comput- ing Machinery. ISBN 9781450367127. doi: 10.1145/3314221.3314637. URL https://doi.org/10.1145/3314221.3314637. 318 BIBLIOGRAPHY

J. Ross Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, March 1986. doi: 10.1007/bf00116251. URL https://doi.org/10.1007/ bf00116251.

Thomas Rausch, Waldemar Hummer, Philipp Leitner, and Stefan Schulte. An empirical analysis of build failures in the continuous integration workﬂows of Java-based open-source software. In Proceedings of the 14th International Conference on Mining Software Repositories, MSR 2017, pages 345–355, Piscat- away, NJ, USA, 2017. IEEE Press. ISBN 978-1-5386-1544-7. doi: 10.1109/ MSR.2017.54. URL https://doi.org/10.1109/MSR.2017.54.

Michael Reif, Michael Eichberg, Ben Hermann, Johannes Lerch, and Mira Mezini. Call graph construction for Java libraries. In Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, pages 474–486, New York, NY, USA, 2016. ACM. ISBN 978-1-4503- 4218-6. doi: 10.1145/2950290.2950312. URL http://doi.acm.org/10. 1145/2950290.2950312.

Michael Reif, Florian Kübler, Michael Eichberg, Dominik Helm, and Mira Mezini. Judge: Identifying, understanding, and evaluating sources of unsoundness in call graphs. In Proceedings of the 28th ACM SIGSOFT In- ternational Symposium on Software Testing and Analysis, ISSTA 2019, pages 251–261, New York, NY, USA, 2019. ACM. ISBN 978-1-4503-6224-5. doi: 10.1145/3293882.3330555. URL http://doi.acm.org/10.1145/ 3293882.3330555.

Shiquan Ren, Hong Lai, Wenjing Tong, Mostafa Aminzadeh, Xuezhang Hou, and Shenghan Lai. Nonparametric bootstrapping for hierarchical data. Journal of Applied Statistics, 37(9):1487–1498, 2010. doi: 10.1080/02664760903046102. URL https://doi.org/10.1080/ 02664760903046102.

Viraf (Willy) Reporter. The value of a millisecond: Finding the optimal speed of a trading infrastructure, April 2008. URL https://research. tabbgroup.com/report/v06-007-value-millisecond- BIBLIOGRAPHY 319

finding-optimal-speed-trading-infrastructure. Accessed: 19.11.2020.

Marcelino Rodriguez-Cancio, Benoit Combemale, and Benoit Baudry. Auto- matic microbenchmark generation to prevent dead code elimination and constant folding. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ASE 2016, pages 132–143, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450338455. doi: 10.1145/2970276.2970346. URL https://doi.org/10.1145/2970276. 2970346.

J. Romano, J.D. Kromrey, J. Coraggio, and J. Skowronek. Appropriate statistics for ordinal level data: Should we really be using t-test and Cohen’s d for evaluating group differences on the NSSE and other surveys? In Annual Meeting of the Florida Association of Institutional Research, pages 1–3, 2006.

Gregg Rothermel, Roland H. Untch, Chengyun Chu, and Mary Jean Harrold. Test case prioritization: An empirical study. In Proceedings of the IEEE International Conference on Software Maintenance, ICSM 1999, pages 179–, Washington, DC, USA, 1999. IEEE Computer Society. ISBN 0-7695-0016- 1. doi: 10.1109/icsm.1999.792604. URL https://doi.org/10.1109/ icsm.1999.792604.

Gregg Rothermel, Roland J. Untch, and Chengyun Chu. Prioritizing test cases for regression testing. IEEE Transactions on Software Engineering, 27 (10):929–948, October 2001. ISSN 0098-5589. doi: 10.1109/32.962562. URL https://doi.org/10.1109/32.962562.

Julia Rubin and Martin Rinard. The challenges of staying together while moving fast: An exploratory study. In Proceedings of the 38th IEEE/ACM Inter- national Conference on Software Engineering, ICSE 2016, pages 982–993, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-3900-1. doi: 10.1145/2884781. 2884871. URL http://doi.acm.org/10.1145/2884781.2884871. 320 BIBLIOGRAPHY

Dennis W. Ruck, Steven K. Rogers, Matthew Kabrisky, Mark E. Oxley, and Bruce W. Suter. The multilayer perceptron as an approximation to a bayes optimal discriminant function. IEEE Transactions on Neural Networks, 1(4): 296–298, December 1990. doi: 10.1109/72.80266. URL https://doi.org/ 10.1109/72.80266.

Juan Pablo Sandoval Alcocer and Alexandre Bergel. Tracking down performance variation against source code evolution. In Proceedings of the 11th Symposium on Dynamic Languages, DLS 2015, pages 129–139, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3690-1. doi: 10.1145/2816707.2816718. URL http://doi.acm.org/10.1145/2816707.2816718.

Juan Pablo Sandoval Alcocer, Alexandre Bergel, and Marco Tulio Valente. Learning from source code history to identify performance failures. In Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering, ICPE 2016, pages 37–48, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4080-9. doi: 10.1145/2851553.2851571. URL http://doi.acm. org/10.1145/2851553.2851571.

Juan Pablo Sandoval Alcocer, Alexandre Bergel, and Marco Tulio Valente. Prioritizing versions for performance regression testing: The Pharo case. Science of Computer Programming, 191:102415, June 2020. ISSN 0167-6423. doi: 10.1016/j.scico.2020.102415. URL http://dx.doi.org/10.1016/ j.scico.2020.102415.

Anand Ashok Sawant and Alberto Bacchelli. ﬁne-GRAPE: Fine-grained APi usage extractor – An approach and dataset to investigate API usage. Empir- ical Software Engineering, 22(3):1348–1371, June 2017. ISSN 1573-7616. doi: 10.1007/s10664-016-9444-6. URL https://doi.org/10.1007/s10664- 016-9444-6.

Gerald Schermann, Dominik Schöni, Philipp Leitner, and Harald C. Gall. Bifrost: Supporting continuous deployment with automated enactment of multi-phase live testing strategies. In Proceedings of the 17th International Middleware Conference, Middleware 2016, pages 12:1–12:14, New York, NY, BIBLIOGRAPHY 321

USA, 2016. ACM. ISBN 978-1-4503-4300-8. doi: 10.1145/2988336.2988348. URL http://doi.acm.org/10.1145/2988336.2988348.

Joel Scheuner and Philipp Leitner. Estimating cloud application performance based on micro-benchmark proﬁling. In Proceedings of the 11th IEEE Inter- national Conference on Cloud Computing, CLOUD 2014. IEEE, July 2018. doi: 10.1109/cloud.2018.00019. URL https://doi.org/10.1109/cloud. 2018.00019.

Joel Scheuner, Philipp Leitner, Jürgen Cito, and Harald Gall. Cloud Work Bench – Infrastructure-as-Code based cloud benchmarking. In Proceedings of the 6th IEEE International Conference on Cloud Computing Technology and Science, CLOUDCOM 2014, pages 246–253, Washington, DC, USA, 2014. IEEE Computer Society. ISBN 978-1-4799-4093-6. doi: 10.1109/CloudCom. 2014.98. URL http://dx.doi.org/10.1109/CloudCom.2014.98.

Eric Schurman and Jake Brutlad. Performance related changes and their user impact, 2009. URL http://assets.en.oreilly.com/ 1/event/29/The%20User%20and%20Business%20Impact%20of% 20Server%20Delays,%20Additional%20Bytes,%20and%20HTTP% 20Chunking%20in%20Web%20Search%20Presentation.pptx. Ac- cessed: 19.11.2020.

Marija Selakovic and Michael Pradel. Performance issues and optimizations in JavaScript: An empirical study. In Proceedings of the 38th IEEE/ACM International Conference on Software Engineering, ICSE 2016, pages 61–72, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-3900-1. doi: 10.1145/2884781. 2884829. URL http://doi.acm.org/10.1145/2884781.2884829.

Andreas Sewe, Mira Mezini, Aibek Sarimbekov, and Walter Binder. Da Capo con Scala: Design and analysis of a Scala benchmark suite for the Java virtual machine. In Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA 2011, pages 657–676, New York, NY, USA, 2011. Association for Computing 322 BIBLIOGRAPHY

Machinery. ISBN 9781450309400. doi: 10.1145/2048066.2048118. URL https://doi.org/10.1145/2048066.2048118.

Aleksey Shipilev. Reconsider defaults for warmup and measurement iteration counts, durations, May 2018a. URL https://bugs.openjdk.java. net/browse/CODETOOLS-7902165.

Aleksey Shipilev. Reconsider defaults for fork count, May 2018b. URL https: //bugs.openjdk.java.net/browse/CODETOOLS-7902170.

O. Shivers. Control ﬂow analysis in scheme. In Proceedings of the 1988 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 1988, pages 164–174, New York, NY, USA, June 1988. ACM. doi: 10.1145/960116.54007. URL http://doi.acm.org/10.1145/960116. 54007.

Janet Siegmund, Norbert Siegmund, and Sven Apel. Views on internal and external validity in empirical software engineering. In Proceedings of the 37th IEEE/ACM International Conference on Software Engineering, ICSE 2015. IEEE, May 2015a. doi: 10.1109/icse.2015.24. URL https://doi.org/10. 1109/icse.2015.24.

Norbert Siegmund, Alexander Grebhahn, Sven Apel, and Christian Käst- ner. Performance-inﬂuence models for highly conﬁgurable systems. In Proceedings of the 10th Joint Meeting on Foundations of Software Engineer- ing, ESEC/FSE 2015. ACM, 2015b. doi: 10.1145/2786805.2786845. URL https://doi.org/10.1145%2F2786805.2786845.

Yogesh Singh, Arvinder Kaur, and Bharti Suri. Test case prioritization using ant colony optimization. SIGSOFT Software Engineering Notes, 35(4):1–7, July 2010. ISSN 0163-5948. doi: 10.1145/1811226.1811238. URL http: //doi.acm.org/10.1145/1811226.1811238.

Snoop Dogg. Hollywood Walk of Fame ceremony speech, November 2018. URL https://walkoffame.com/snoop-dogg/. Accessed: 29.4.2021. BIBLIOGRAPHY 323

Daniel Ståhl and Jan Bosch. Modeling continuous integration practice differences in industry software development. Journal of Systems and Software, 87: 48–59, January 2014. ISSN 0164-1212. doi: 10.1016/j.jss.2013.08.032. URL http://dx.doi.org/10.1016/j.jss.2013.08.032.

Standard Performance Evaluation Corporation (SPEC). SPECjvm2008, 2008. URL https://spec.org/jvm2008/.

Standard Performance Evaluation Corporation (SPEC). SPECjbb2015, 2015. URL https://spec.org/jbb2015/.

Petr Stefan, Vojtech˘ Horký, Lubomír Bulej, and Petr T ˚uma. Unit testing performance in Java projects: Are we there yet? In Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering, ICPE 2017, pages 401–412, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-4404-3. doi: 10.1145/3030207.3030226. URL http://doi.acm.org/10.1145/ 3030207.3030226.

Klaas-Jan Stol and Brian Fitzgerald. The ABC of software engineering research. ACM Transactions on Software Engineering and Methodology, 27(3):1–51, Oc- tober 2018. doi: 10.1145/3241743. URL https://doi.org/10.1145/ 3241743.

Chunqiang Tang, Thawan Kooburat, Pradeep Venkatachalam, Akshay Chan- der, Zhe Wen, Aravind Narayanan, Patrick Dowell, and Robert Karl. Holistic conﬁguration management at Facebook. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP 2015, pages 328–343, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3834-9. doi: 10.1145/2815400.2815401. URL http://doi.acm.org/10.1145/2815400.2815401.

Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E. Hassan, Akinori Ihara, and Kenichi Matsumoto. The impact of mislabelling on the performance and interpretation of defect prediction models. In Proceedings of the 37th IEEE/ACM International Conference on Software Engineering, ICSE 2015. IEEE, 324 BIBLIOGRAPHY

May 2015. doi: 10.1109/icse.2015.93. URL https://doi.org/10.1109/ icse.2015.93.

Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E. Hassan, and Kenichi Matsumoto. Automated parameter optimization of classiﬁcation techniques for defect prediction models. In Proceedings of the 38th IEEE/ACM Interna- tional Conference on Software Engineering, ICSE 2016. ACM Press, 2016. doi: 10.1145/2884781.2884857. URL https://doi.org/10.1145/2884781. 2884857.

Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E. Hassan, and Kenichi Matsumoto. An empirical comparison of model validation techniques for defect prediction models. IEEE Transactions on Software Engineering, 43(1): 1–18, January 2017. doi: 10.1109/tse.2016.2584050. URL https://doi. org/10.1109/tse.2016.2584050.

Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E. Hassan, and Kenichi Matsumoto. The impact of automated parameter optimization on defect prediction models. IEEE Transactions on Software Engineering, 45(7):683–711, July 2019. doi: 10.1109/TSE.2018.2794977.

Chakkrit Tantithamthavorn, Ahmed E. Hassan, and Kenichi Matsumoto. The impact of class rebalancing techniques on the performance and interpretation of defect prediction models. IEEE Transactions on Software Engineering, 46(11):1200–1219, November 2020. doi: 10.1109/tse.2018.2876537. URL https://doi.org/10.1109/tse.2018.2876537.

Luca Della Toffola, Michael Pradel, and Thomas R. Gross. Synthesizing programs that expose performance bottlenecks. In Proceedings of the 2018 International Symposium on Code Generation and Optimization, CGO 2018, pages 314–326, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450356176. doi: 10.1145/3168830. URL https: //doi.org/10.1145/3168830. BIBLIOGRAPHY 325

Paolo Tonella, Paolo Avesani, and Angelo Susi. Using the case-based ranking methodology for test case prioritization. In Proceedings of the 22nd IEEE International Conference on Software Maintenance, ICSM 2006, pages 123–133, Washington, DC, USA, 2006. IEEE Computer Society. ISBN 0-7695-2354-4. doi: 10.1109/ICSM.2006.74. URL http://dx.doi.org/10.1109/ICSM. 2006.74.

Burak Turhan, Tim Menzies, Ay¸seB. Bener, and Justin Di Stefano. On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering, 14(5):540–578, January 2009. doi: 10.1007/s10664-008-9103-7. URL https://doi.org/10.1007/s10664- 008-9103-7.

András Vargha and Harold D. Delaney. A critique and improvement of the "CL" common language effect size statistics of McGraw and Wong. Journal of Educational and Behavioral Statistics, 25(2):101–132, 2000. doi: 10.2307/ 1165329. URL https://doi.org/10.2307/1165329.

Carmine Vassallo, Gerald Schermann, Fiorella Zampetti, Daniele Romano, Philipp Leitner, Andy Zaidman, Massimiliano Di Penta, and Sebastiano Panichella. A tale of CI build failures: An open source and a ﬁnan- cial organization perspective. In Proceedings of the 33rd IEEE International Conference on Software Maintenance and Evolution, ICSME 2017, pages 183– 193. IEEE, September 2017. doi: 10.1109/ICSME.2017.67. URL https: //doi.org/10.1109/icsme.2017.67.

Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and Yee Jiun Song. Kraken: Leveraging live trafﬁc tests to identify and resolve resource utilization bottlenecks in large scale web services. In Proceedings of the 12th USENIX Symposium on Operat- ing Systems Design and Implementation, OSDI 16, pages 635–651, Savan- nah, GA, November 2016. USENIX Association. ISBN 978-1-931971-33-1. 326 BIBLIOGRAPHY

URL https://www.usenix.org/conference/osdi16/technical- sessions/presentation/veeraraghavan.

Kristen R. Walcott, Mary Lou Soffa, Gregory M. Kapfhammer, and Robert S. Roos. Time-aware test suite prioritization. In Proceedings of the 2006 ACM/SIGSOFT International Symposium on Software Testing and Analysis, IS- STA 2006, pages 1–12, New York, NY, USA, 2006. ACM. ISBN 1-59593-263-1. doi: 10.1145/1146238.1146240. URL http://doi.acm.org/10.1145/ 1146238.1146240.

Wei Wang, Ningjing Tian, Sunzhou Huang, Sen He, Abhijeet Srivastava, Mary Lou Soffa, and Lori Pollock. Testing cloud applications under cloud- uncertainty performance effects. In Proceedings of the 11th IEEE Interna- tional Conference on Software Testing, Veriﬁcation and Validation, ICST 2018, pages 81–92, April 2018. doi: 10.1109/ICST.2018.00018. URL https: //doi.org/10.1109/icst.2018.00018.

Jianfeng Wen, Shixian Li, Zhiyong Lin, Yong Hu, and Changqin Huang. Systematic literature review of machine learning based software development effort estimation models. Information and Software Technology, 54 (1):41–59, January 2012. doi: 10.1016/j.infsof.2011.09.002. URL https: //doi.org/10.1016/j.infsof.2011.09.002.

Elaine J. Weyuker and Filippos I. Vokolos. Experience with performance testing of software systems: Issues, an approach, and case study. IEEE Transactions on Software Engineering, 26(12):1147–1156, December 2000. ISSN 0098-5589. doi: 10.1109/32.888628. URL http://dx.doi.org/10.1109/ 32.888628.

Murray Woodside, Greg Franks, and Dorina C. Petriu. The future of software performance engineering. In Future of Software Engineering, FOSE 2007. IEEE, May 2007. doi: 10.1109/fose.2007.32. URL https://doi.org/10.1109/ fose.2007.32. BIBLIOGRAPHY 327

S. Yoo and M. Harman. Regression testing minimization, selection and prioritization: A survey. Software: Testing, Veriﬁcation and Reliability, 22 (2):67–120, March 2012. ISSN 0960-0833. doi: 10.1002/stv.430. URL http://dx.doi.org/10.1002/stv.430.

Dongjiang You, Zhenyu Chen, Baowen Xu, Bin Luo, and Chen Zhang. An empirical study on the effectiveness of time-aware test case prioritization techniques. In Proceedings of the 2011 ACM Symposium on Applied Computing, SAC 2011, pages 1451–1456, New York, NY, USA, 2011. ACM. ISBN 978-1- 4503-0113-8. doi: 10.1145/1982185.1982497. URL http://doi.acm.org/ 10.1145/1982185.1982497.

Tingting Yu and Michael Pradel. Pinpointing and repairing performance bottlenecks in concurrent programs. Empirical Software Engineering, 23(5): 3034–3071, November 2017. doi: 10.1007/s10664-017-9578-1. URL https: //doi.org/10.1007/s10664-017-9578-1.

Shahed Zaman, Bram Adams, and Ahmed E. Hassan. Security versus performance bugs: A case study on Firefox. In Proceedings of the 8th Working Conference on Mining Software Repositories, MSR 2011. ACM, May 2011. doi: 10.1145/1985441.1985457. URL https://doi.org/10.1145/1985441. 1985457.

Shahed Zaman, Bram Adams, and Ahmed E. Hassan. A qualitative study on performance bugs. In Proceedings of the 9th IEEE Working Conference on Mining Software Repositories, MSR 2012, pages 199–208. IEEE, 2012. ISBN 978-1-4673-1761-0. doi: 10.1109/msr.2012.6224281. URL https://doi. org/10.1109/msr.2012.6224281.

Lingming Zhang. Hybrid regression test selection. In Proceedings of the 40th IEEE/ACM International Conference on Software Engineering, ICSE 2018, pages 199–209, New York, NY, USA, 2018. ACM. ISBN 978-1-4503-5638-1. doi: 10.1145/3180155.3180198. URL http://doi.acm.org/10.1145/ 3180155.3180198. 328 BIBLIOGRAPHY

Lingming Zhang, Ji Zhou, Dan Hao, Lu Zhang, and Hong Mei. Prioritizing JUnit test cases in absence of coverage information. In Proceedings of the 25th IEEE International Conference on Software Maintenance, ICSM 2009, pages 19–28, September 2009a. doi: 10.1109/ICSM.2009.5306350. URL https: //doi.org/10.1109/icsm.2009.5306350.

Lingming Zhang, Dan Hao, Lu Zhang, Gregg Rothermel, and Hong Mei. Bridg- ing the gap between the total and additional test-case prioritization strategies. In Proceedings of the 35th IEEE/ACM International Conference on Software Engineering, ICSE 2013, pages 192–201, May 2013. doi: 10.1109/ICSE.2013. 6606565. URL https://doi.org/10.1109/icse.2013.6606565.

Lu Zhang, Shan-Shan Hou, Chao Guo, Tao Xie, and Hong Mei. Time-aware test-case prioritization using integer linear programming. In Proceedings of the 18th International Symposium on Software Testing and Analysis, ISSTA 2009, pages 213–224, New York, NY, USA, 2009b. ACM. ISBN 978-1-60558-338-9. doi: 10.1145/1572272.1572297. URL http://doi.acm.org/10.1145/ 1572272.1572297.

Yangyang Zhao, Alexander Serebrenik, Yuming Zhou, Vladimir Filkov, and Bogdan Vasilescu. The impact of continuous integration on other software development practices: A large-scale empirical study. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, ASE 2017, pages 60–71. IEEE, October 2017. doi: 10.1109/ase.2017.8115619. URL https://doi.org/10.1109/ase.2017.8115619.

Yutong Zhao, Lu Xiao, Xiao Wang, Lei Sun, Bihuan Chen, Yang Liu, and Andre B. Bondi. How are performance issues caused and resolved?–An empirical study from a design perspective. In Proceedings of the 11th ACM/SPEC International Conference on Performance Engineering, ICPE 2020. ACM, April 2020. doi: 10.1145/3358960.3379130. URL https://doi.org/10.1145/ 3358960.3379130.

Hong Zhu, Patrick A. V. Hall, and John H. R. May. Software unit test coverage and adequacy. ACM Computing Surveys, 29(4):366–427, December 1997. ISSN BIBLIOGRAPHY 329

0360-0300. doi: 10.1145/267580.267590. URL http://doi.acm.org/10. 1145/267580.267590.

Thomas Zimmermann, Nachiappan Nagappan, Harald Gall, Emanuel Giger, and Brendan Murphy. Cross-project defect prediction. In Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering, ESEC/FSE 2009. ACM Press, 2009. doi: 10.1145/1595696.1595713. URL https://doi. org/10.1145/1595696.1595713.

Curriculum Vitae

Personal Information

Name Christoph Laaber Nationality Austria Date of Birth 12 June 1988

Education

September 2016 – Doctoral program at the April 2021 University of Zurich, Department of Informatics, Chair of Software Engineering September 2011 – Master of Science from the June 2016 TU Wien in Software Engineering & Internet Computing September 2008 – Bachelor of Science in Engineering from the June 2011 St. Pölten University of Applied Sciences in Media Technology