Performance Testing in the Cloud. How Bad Is It Really?

Performance Testing in the Cloud. How Bad is it Really? Christoph Laaber Joel Scheuner Philipp Leitner Department of Informatics Software Engineering Division Software Engineering Division University of Zurich Chalmers | University of Gothenburg Chalmers | University of Gothenburg Zurich, Switzerland Gothenburg, Sweden Gothenburg, Sweden [email protected] [email protected] [email protected] Abstract to evaluate the performance of applications under “realistic condi- Rigorous performance engineering traditionally assumes measur- tions”, which nowadays often means running it in the cloud. Finally, ing on bare-metal environments to control for as many confounding they may wish to make use of the myriad of industrial-strength 1 factors as possible. Unfortunately, some researchers and practition- infrastructure automation tools, such as Chef or AWS CloudFor- 2 ers might not have access, knowledge, or funds to operate dedicated mation , which ease the setup and identical repetition of complex performance testing hardware, making public clouds an attractive performance experiments. alternative. However, cloud environments are inherently unpre- In this paper, we ask the question whether using a standard dictable and variable with respect to their performance. In this study, public cloud for software performance experiments is always a we explore the effects of cloud environments on the variability of bad idea. To manage the scope of the study, we focus on a specific performance testing outcomes, and to what extent regressions can class of cloud service, namely Infrastructure as a Service (IaaS), still be reliably detected. We focus on software microbenchmarks and on a specific type of performance experiment (evaluating the as an example of performance tests, and execute extensive experi- performance of open source software products in Java or Go using 3 ments on three different cloud services (AWS, GCE, and Azure) and microbenchmarking frameworks, such as JMH ). In this context, for different types of instances. We also compare the results toa we address the following research questions: hosted bare-metal offering from IBM Bluemix. In total, we gathered more than 5 million unique microbenchmarking data points from RQ 1. How variable are benchmark results in different cloud envi- benchmarks written in Java and Go. We find that the variability ronments? of results differs substantially between benchmarks and instance RQ 2. How large does a regression need to be to be detectable in types (from 0.03% to > 100% relative standard deviation). We also a given cloud environment? What kind of statistical methods lend observe that testing using Wilcoxon rank-sum generally leads to themselves to confidently detect regressions? unsatisfying results for detecting regressions due to a very high number of false positives in all tested configurations. However, simply testing for a difference in medians can be employed with good We base our research on 20 real microbenchmarks sampled from success to detect even small differences. In some cases, a difference four open source projects written in Java or Go. Further, we study in- as low as a 1% shift in median execution time can be found with a stances hosted in three of the most prominent public IaaS providers low false positive rate given a large sample size of 20 instances. (Google Compute Engine, AWS EC2, and Microsoft Azure) and con- trast the results against a dedicated bare-metal machine deployed Keywords using IBM Bluemix (formerly Softlayer). We also evaluate and compare the impact of common deployment strategies for performance performance testing, microbenchmarking, cloud tests in the cloud, namely running test and control experiments on different cloud instances, on the same instances, and randomized 1 Introduction multiple interleaved trials as recently proposed as best practice [1]. In many domains, renting computing resources from public clouds We find that result variability ranges from 0.03% relative stan- has largely replaced privately owning resources. This is due to dard deviation to > 100%. This variability depends on the particular economic factors (public clouds can leverage economies of scale), benchmark and the environment it is executed in. Some benchmarks but also due to the convenience of outsourcing tedious data center show high variability across all studied instance types, whereas or server management tasks [6]. However, one often-cited disad- others are stable in only a subset of the studied environments. We vantage of public clouds is that the inherent loss of control can conclude that there are different types of instability (variability in- lead to highly variable and unpredictable performance (e.g., due to herent to the benchmark, variability between trials, and variability co-located noisy neighbors) [7, 13, 17]. This makes adopting cloud between instances), which needs to be handled differently. computing for performance testing, where predictability and low- Further, we find that standard hypothesis testing (Wilcoxon rank- level control over the used hard- and software is key, a challenging sum in our case) is an unsuitable tool for performance-regression proposition. detection in cloud environments due to false positive rates of 26% There are nonetheless many good reasons why researchers and to 81%. However, despite the highly variable measurement results, practitioners might be interested in adopting public clouds for their medians tend to be stable. Hence, we show that simply testing for experiments. They may not have access to dedicated hardware, 1https://www.chef.io or the hardware that they have access to may not scale to the ex- 2https://aws.amazon.com/cloudformation periment size that the experimenter has in mind. They may wish 3http://openjdk.java.net/projects/code-tools/jmh PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3507v1 | CC BY 4.0 Open Access | rec: 4 Jan 2018, publ: 4 Jan 2018 difference in medians can be used to surprisingly good results, as @State(Scope.Thread) long as a sufficient number of samples can be collected. Forthe public class ComputationSchedulerPerf { baseline case (1 trial at 1 instance), we are mostly only able to detect @State(Scope.Thread) regression sizes >20%, if regressions are detectable at all. However, public static class Input with increased sample sizes (e.g., 20 instances, or 5 trials), even a extends InputWithIncrementingInteger { median shift of 1% is detectable for some benchmarks. @Param({ "100" }) Following the findings, we can conclude that executing software public int size; } microbenchmarks is possible on cloud instances, albeit with some caveats. Not all cloud providers and instance types have shown to @Benchmark be equally useful for performance testing, and not all microbench- public void observeOn(Input input) { marks lend themselves to reliably finding regressions. In most set- LatchedObserver<Integer> o = tings, a substantial number of trials or instances are required to input.newLatchedObserver(); input.observable.observeOn( achieve good results. However, running test and control groups on Schedulers.computation() the same instance, optimally in random order, reduces the num- ).subscribe(o); ber of required repetitions. Practitioners can use our study as a o.latch.await(); blueprint to evaluate the stability of their own performance mi- } } crobenchmarks, and in the environments and experimental settings that are available to them. Listing 1: JMH example from the RxJava project. models: IaaS, Platform as a Service (PaaS), and Software as a Ser- 2 Background vice (SaaS). These levels differ mostly in which parts of the cloud We now briefly summarize important background concepts that we stack is managed by the cloud provider and what is left for the use in our study. customer. Most importantly, in IaaS, computational resources are acquired and released in the form of virtual machines or contain- 2.1 Software Microbenchmarking ers. Tenants do not need to operate physical servers, but are still Performance testing is a common term used for a wide varity of required to administer their virtual servers. We argue that for the different approaches. In this paper, we focus on one very specific scope of our research, IaaS is the most suitable model at the time of technique, namely software microbenchmarking (sometimes also writing, as this model still allows for comparatively low-level access referred to as performance unit tests [12]). Microbenchmarks are to the underyling infrastructure. Further, setting up performance short-running (e.g., < 1ms), unit-test-like performance tests, which experiments in IaaS is significanly simpler than doing the same in attempt to measure fine-grained performance counters, such as a typical PaaS system. method-level execution times, throughput, or heap utilization. Typ- In IaaS, a common abstraction is the notion of an instance: an ically, frameworks repeatedly execute microbenchmarks for a cer- instance is a bundle of resources (e.g., CPUs, storage, networking tain time duration (e.g., 1s) and report their average execution time. capabilites, etc.) defined through an instance type and an imagine. In the Java world, the Java Microbenchmarking Harness (JMH) The instance type governs how powerful the instance is supposed to has established itself as the de facto standard to implement software be (e.g., what hardware it receives), while the image defines the soft- microbenchmarking [24]. JMH is part of the OpenJDK implemen- ware initially installed.

Performance Testing in the Cloud. How Bad Is It Really?

Microprocessor

Load Testing, Benchmarking, and Application Performance Management for the Web

Overview of the SPEC Benchmarks

Opportunities and Open Problems for Static and Dynamic Program Analysis Mark Harman∗, Peter O’Hearn∗ ∗Facebook London and University College London, UK

Evaluation of AMD EPYC

Software Testing: Essential Phase of SDLC and a Comparative Study Of

Stress Testing Embedded Software Applications

Testing Techniques Selection Based on SWOT Analysis

Putting the Right Stress Into WMS Volume Testing Eliminate Unwelcome Surprises

An Effective Dynamic Analysis for Detecting Generalized Deadlocks

Scalability Performance of Software Testing by Using Review Technique Mr

Performance Testing, Load Testing & Stress Testing Explained