Masaryk University Faculty of Informatics

Performance Testing Automation of Messaging Libraries

Bachelor’s Thesis

Jiří Daněk

Brno, Fall 2020

Masaryk University Faculty of Informatics

Performance Testing Automation of Apache Qpid Messaging Libraries

Bachelor’s Thesis

Jiří Daněk

Brno, Fall 2020

This is where a copy of the official signed thesis assignment and a copy ofthe Statement of an Author is located in the printed version of the document.

Declaration

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Jiří Daněk

Advisor: Mgr. Martin Večeřa

i

Acknowledgements

Software performance testing has proved to be a deep and intrigu- ing area in the field of software quality engineering to me. I would therefore like to thank my fellow quality engineers on the Messaging QE team at Czech, s. r. o. for providing a friendly work en- vironment in which I could pursue this topic. Special thanks belong to my thesis consultant, Ing. Zdeněk Kraus, who works as the team’s manager. I am fortunate that there has been considerable prior work in de- velopment of performance measurement tooling applicable to the software I am focusing on. When making the final step of integrating the tools into a performance regression test suite, I was able to refer to tools built by Justin R. Ross and Otavio R. Piske, both fellow Red Hat employees. Some of the bibliographical references I have used were recommended by yet another Red Hat employee, Francesco Nigro, in e-mail discussions. Finally, I must not forget to thank my parents for their support, and for helping to keep me motivated during my university studies.

iii Abstract

Event-driven architecture has proven to be an effective way of design- ing software systems composed of loosely coupled software compo- nents. These designs are often implemented using message-oriented middleware, which puts messaging into a fundamental role and places great demands on reliability and performance of the messaging solu- tion being used. This thesis focuses on messaging libraries developed under the Apache Qpid Proton project, and proposes a method for an automated measurement of their performance in peer-to-peer mode using an open-source tool called Quiver. The proposed performance measurement method has been imple- mented in the form of a Jenkins pipeline and is suitable for inclusion in a continuous delivery pipeline of a corporate software development process, serving as a performance regression test.

iv Keywords software testing, performance testing, continuous integration, network technologies, Qpid Proton, AMQP 1.0

v

Contents

Introduction 1 0.1 Note on Terminology ...... 3

1 Software Testing 5 1.1 Software engineering ...... 5 1.2 Software Quality Engineering ...... 5 1.3 Testing process ...... 7 1.3.1 Testing terminology ...... 8 1.3.2 Types of tests ...... 9 1.3.3 Purposes of testing ...... 11 1.4 When to test ...... 14 1.4.1 Continuous integration ...... 15 1.4.2 Testing is Alerting ...... 16 1.5 When to stop testing ...... 17

2 Performance testing 19 2.1 Importance of performance and performance testing . . 19 2.2 Key Performance Indicators (KPIs) ...... 21 2.3 When to test performance ...... 22 2.4 Automation in performance testing ...... 23 2.5 Types of performance tests ...... 23 2.6 Performance testing approaches ...... 25 2.6.1 Microbenchmarks ...... 25 2.6.2 System-level performance testing ...... 27 2.6.3 Performance monitoring in production . . . . . 30 2.7 Industrial benchmarking ...... 30

3 Maestro, Quiver, and Additional Tooling 33 3.1 Software Under Test: Apache Qpid ...... 33 3.2 Performance measurement frameworks ...... 34 3.2.1 Maestro ...... 34 3.2.2 Quiver ...... 37 3.3 Additional tooling ...... 37 3.3.1 Ansible ...... 38 3.3.2 Jenkins ...... 38 3.3.3 Google Benchmark ...... 39

vii 3.3.4 Docker ...... 40

4 Automation Design and Implementation 43 4.1 Microbenchmarking ...... 43 4.1.1 Implementation ...... 43 4.2 Quiver automation ...... 44 4.2.1 Design ...... 44 4.2.2 Implementation ...... 45 4.3 Result reporting ...... 46 4.3.1 Statistical analysis ...... 47

5 Performance regression testing evaluation 49 5.1 Test results ...... 49 5.2 Result analysis ...... 50 5.2.1 Qpid Proton C ...... 51 5.2.2 Qpid Proton C++ ...... 52 5.2.3 Qpid Proton Python ...... 53 5.2.4 Qpid JMS ...... 54 5.3 Future work ...... 54

Bibliography 57

viii List of Figures

3.1 Maestro HTML report 41 3.2 Performance Co-Pilot disk throughput chart 42 5.1 Qpid Proton C P2P Throughput per Release Version 51 5.2 Qpid Proton C++ P2P Throughput per Release Version 52 5.3 Qpid Proton Python P2P Throughput per Release Version 53 5.4 Qpid JMS P2P Throughput per Release Version 54

ix

Introduction

Messaging libraries are part of middleware, a software layer located between application code and operating system, which provides the application with additional services beyond what the operating sys- tem offers, by extending on that. Messaging belongs to the areaof Interprocess Communication and Enterprise Integration techniques. It allows creating distributed software with decentralized, decoupled flow of information, based on the notion of a logical address. Main applications of messaging include the Internet of Things (IoT), Event- Driven Architectures, and application integration using the Enterprise Service Bus (ESB) pattern. Messaging libraries developed in the Apache Qpid project use a standardized protocol called AMQP 1.0 (Advanced - ing Protocol) to exchange messages. Moreover, the Apache Qpid JMS library also implements a standardized Application Programming Interface (API) called JMS 2.0 (Java Message Service). As a result of this, the external interfaces of Apache Qpid libraries are largely fixed, and the focus of the subprojects is to implement them in an efficient and performant manner for their respective supported programming languages. Performant messaging is necessary to enable the creation of demanding applications, unlocking the world’s potential. Continuous Integration and Continuous Delivery (CI/CD) are two techniques of modern software development. Changes to the software are automatically integrated into the master repository, unit tests and integration tests are performed for each change. Continuous Delivery then automates the process of creating deliverable artifacts at release time. Continuous Deployment extends this to the automatic deployment of the artifacts to the production environment, making the software development project more agile and efficient. The CI/CD pipeline becomes a point of communication and collaboration for all the diverse roles on the project development team working on the release. Building an automated Continuous Delivery pipeline that includes nonfunctional requirement checks, such as performance, is highly desirable for project health and velocity. Timely information gained from performance testing can prevent performance regressions, that is, degradation in application perfor-

1 mance from one version of the application to the next; aid in capacity planning, allowing to estimate the amount of hardware needed to sup- port a particular application deployment; inform future optimization efforts; and last but not least, provide backing for marketing claims used to promote the product. In my thesis, I have focused on examining two preexisting per- formance testing solutions for AMQP client libraries and decided on using one of those, called Quiver, initially developed by Justin R. Ross. I have then designed and built a job in the Jenkins continuous integra- tion system, which automatically sets up an environment and runs a Quiver performance test against an Apache Qpid client library the user specifies. The stability and reliability of performance results from this job stems from the use of a dedicated physical machine to perform the test, and from running the Quiver test multiple times, and calculating a mean throughput and the associated confidence interval from the data. The Jenkins job has been implemented in the form of a Jenkins declarative pipeline and is suitable for inclusion in a continuous de- livery pipeline of a corporate software development process, serving as a performance regression test for peer-to-peer message exchange using Apache Qpid libraries. My work presents an improvement over the previous state of things, which required each developer or quality engineer to pro- cure their own hardware, and set up the test in an ad-hoc fashion. Previously shared performance measurement data were never accom- panied by confidence intervals, or other similar means allowing to judge their reliability and significance of changes in the values over time. Using the job developed in this thesis is not only more straight- forward, it also provides more meaningful and repeatable results, and improves utilization of dedicated hardware running the benchmark, which is now managed as a Jenkins node. This thesis is structured as follows. Chapter 1 introduces the dis- cipline of software testing. Chapter 2 discusses the specifics of per- formance testing. Chapter 3 describes available performance mea- surement tooling with a view towards comparing their capabilities. Chapter 4 presents a design for a performance test and discusses its concrete implementation in Jenkins CI. Chapter 5 concludes the thesis with an evaluation of the implemented solution and proposals for fu-

2 ture expansions of the performance test to cover additional messaging patterns and messaging middleware.

Note on Terminology

This thesis distinguishes between machines (physical or virtualized computers) and servers (processes providing network services, run- ning on machines) [1]. Another useful distinction, although less rele- vant for the present discussion, is that between a client (process con- necting to a network server) and a customer (a physical or corporate entity which is using our software).

3

1 Software Testing

Beware of bugs in the above code; I have only proved it correct, not tried it.

— Donald Knuth, 1977

1.1 Software engineering

Software engineering can be defined as "the application of a systematic, disciplined, quantifiable approach to the development, operation, and maintenance of software" [2, p. 67]. This definition suggests that while programming stands at the core of software engineering, it is being shaped through rigorous engineering practices. Russ Cox offers the following—less encyclopedic sounding—definition

Software engineering is what happens to programming when you add time and other programmers [3].

Having established that software engineering is based in program- ming, computer science, and engineering, we must now acknowledge that its character is inherently related to its practitioners, that is, human beings, often employed in corporate structures. Software engineering is a social process [4, p. 130], and tools of social and cognitive science should be used in its study [4, p. 129].

1.2 Software Quality Engineering

Software quality engineering is one of the areas of software engineer- ing. Its focus can be described as the use of an engineering approach in assuring software quality, similarly as software reliability engineering can be described as "what happens when you ask a software engineer to design an operations team." [1]. The Oxford English Dictionary defines quality as "the standard or nature of something as measured against other things of a similar kind;

5 1. Software Testing the degree of excellence possessed by a thing." [5]. This definition presents quality as a relative attribute since it is "measured against other things of a similar kind". Every participant in the software de- velopment process, as well as the customers, uses different measures for quality because each uses a different set of "other things" as their yardstick, has a different view of what constitutes "things of the same kind", and so judges "excellence" differently. The role of a software quality engineers on a project is to find information [6, p. 22] and to support programmers [6, p. 25] as well as other members of the project team. When quality engineers do their job well, the programmers will clearly know when the features they are working on are done to customer satisfaction, and project management will have a good idea of the project’s progress and its release readiness. Quality assurance tools and techniques that software quality engi- neers employ are centered around testing and metrics. Metrics provide introspection into the software development pro- cess, often even into the quality engineers’ own activities. One example of a popular metric is the defect arrival rate, that is, the number of is- sues reported against a product in a time period. Monitoring this and other metrics can enable the project management team to make data-driven decisions, such as to pronounce the software as ready for release only when previously found defects are fixed and it appears that the probability of finding further defects through the continued testing is acceptably low [7]. When using metrics, it is necessary to keep in mind Godhart’s law, "When a measure becomes a target, it ceases to be a good measure." In case of the defect arrival rate, if the testers and developers start to tailor their efforts based on the metric, the metric ceases to be a good predictor of progress [8, p. 38]. The remainder of this chapter provides an overview of my un- derstanding of software testing, based on the cited literature I have studied and on my personal work experience as a software quality engineer in Red Hat Czech, s. r. o.

6 1. Software Testing 1.3 Testing process

"Testing is the process of executing a program with the intent of finding errors." [9, p. 19]

Besides the definition of testing just presented, Myers [9, p.19] gives three other alternative definitions of testing and dismisses each as an inadequate attitudinal basis for a tester’s work. These alternative definitions state the following.

1. Testing is the process of demonstrating that errors are not present.

2. The purpose of testing is to show that a program performs its intended function correctly.

3. Testing is the process of establishing confidence that a program does what it is supposed to do.

The first alternative definition is obviously incorrect. As Dijkstra has famously noted, "Testing shows the presence, not the absence of bugs." [nato1969]. This is not to say that testing cannot facilitate increased confidence in the program’s correctness. However, such confidence is best established by a thorough search for errors [9, p.22], not by demonstrating instances of correct functionality. Martin suggests that instead of trying to prove software correct, as in mathematics, by creating proofs, we should turn to empiric science. That is, we formulate a falsifiable theory of the operation of the software, and we then do our best to disprove it through testing. Having completed that testing, we conclude that our software is good enough to work under the conditions it was tested for [10, p. 30]. That is not to say that our software is bug-free, similarly as the Newton’s theory of gravity is not bug-free when relativistic effects come into play. "We cannot talk about a theory being absolutely true. We can only talk about a theory being true in a given context of application." [11, p. 8]. The second and third alternative definitions focus the tester to- wards exploring only the intended functionality, and ignore the pos- sibility the program may perform incorrectly outside of its primary

7 1. Software Testing intended function such as when given invalid inputs, or when han- dling errors [9, p. 21]. There is still value in tests that demonstrate correct functionality of the program. Such tests constitute an executable specification of the desired functionality of the software and may even be written by business analysts [12, p. 105].

1.3.1 Testing terminology

Quality engineering has developed its own specific vocabulary. The International Software Testing Qualifications Board (ISTQB) is a pro- fessional software testing organization that maintains a curriculum of software testing topics, a terminology dictionary, and administers cer- tification examinations for testers. ISTQB is a major force in unifying and standardizing the terminology used in quality engineering [13, p. 17]. The software or part of software being tested is called Software Under Test (SUT). The delineation of what is and what is not part of SUT is a crucial prerequisite of effective testing. Depending on the purpose of each test, the scope of SUT can change. For example, when a payment processing gateway of a web commerce system is replaced by a fake implementation during testing, the payment processing gateway is considered not to be a part of the SUT for this particular test. When a test is written down, either in structured prose (to be performed manually) or in code (in case of an automated test), the resulting artifact is called a test case. Each test case consists of a de- scription of a specific program input, usually provided as a sequence of steps to be performed, and an expected output to be compared with the actual output of the SUT. It is a common practice to separate setup and teardown steps needed to put the SUT into a state when the test can be performed, and clean up and bring the SUT back to some default state, respectively. Multiple test cases gathered together, usually because they share a common characteristic, and are to be performed in the same testing session, form a test suite.

8 1. Software Testing 1.3.2 Types of tests

There is no single classification of tests that is suitable for all purposes. When discussing a test strategy, the correct classification and the use of particular visual models to illustrate it depends on the circumstances and the audience [14].

ISTQB test types ISTQB recognizes the following four test types.

• Functional testing

• Nonfunctional testing

• Structural testing

• Change-related testing

The first two test types aim to validate functional requirements and nonfunctional requirements, respectively. Nonfunctional require- ments tend to be qualitatively similar across software systems, they may be requirements on reliability, usability, maintainability, or inter- nationalization testing, as well as performance-related requirements discussed in the next chapter. Structural testing is a technology-focused test that exercises the quality of the implementation and is often informed by statement coverage data. Change-related testing focuses on verifying that a change, be it bug fix or enhancement, has not compromised the quality of the soft- ware. [15]

ISTQB test levels When further classifying the tests of each type, ISTQB offers classifica- tion into test levels based on the size of the SUT. They are

• unit

• component

9 1. Software Testing

• integration • system • acceptance On one end of this scale, unit tests are focused on an individual im- plementation unit of the software, such as a class in an object-oriented language. On the other end, acceptance tests exercise the deployment of the entire software system. [16] It is interesting to note that Martin defines acceptance tests differ- ently, not based on their scope, but on their purpose. According to Martin, specific acceptance tests may be run against select components of the system, and the use of fake implementations for parts of the system is permitted and even encouraged [12, p. 102].

Other schemes for test classification Besides the ISTQB classification, many other schemes are in use. The test classification used by the software company Google cat- egorizes automated tests by their scope and size. Test scopes closely correspond to the ISTQB test levels and will not be discussed further. The test sizes are: small, medium, large, and enormous. Each size is characterized primarily by the constraints put on the environment where the tests are executing. For example, the small tests must run on a single machine, are not allowed to perform file and network oper- ations, and their running time is limited to 60 seconds by default [17]. Medium tests are allowed to perform file operations and network com- munication, but only between processes on a single machine, and their default time limit is longer [17]. Smaller tests can be executed faster and more reliably, so they are to be preferred over larger tests.[18]. Yet another scheme, called agile testing quadrants, is presented by Lisa Crispin and Janet Gregory, who give credits for the original idea to Brian Marick. According to agile testing quadrants, tests are classified along two orthogonal axes as either technology-facing or business- facing and also as either primarily supporting the development team or critiquing the product1. For example, unit tests are tests that are

1. In their later writing, the phrase "critique the product" was complemented with "evaluate the product", to limit misunderstandings.

10 1. Software Testing

technology-facing and serve to support the development team. Func- tional tests are business-facing tests that support the development team. Regarding the tests that critique the product, an example of a technology-facing one would be performance tests, and an example of a business-facing one is usability testing [19, ch. 6].

1.3.3 Purposes of testing

Quality engineers are not the only department on a software project that uses testing in their work. Business analysts may write accep- tance tests to provide executable examples of the desired functionality, software developers can use testing to manage the internal quality of the codebase, and even customers often perform their own testing to ensure the software they were delivered meets their expectations. While everybody has different expectations of quality, testing can help with ensuring many of those. Smoke testing is used to discover broken software artifacts produced by a build process and to quickly reject them before serious testing commences. It usually entails running a small suite of test cases to verify the most important functionality of the software. There are two origin stories for the name of smoke testing [20]. The first is in electrical engineering, stemming from the practice of plugging few samples of newly manufactured circuit board to a power source and waiting a few minutes if smoke starts coming out (in which case, the shipment that failed smoke testing would be sent back). An alternate origin story comes from the use of fumigating gases to test the quality of seals in plumbing. Regression tests serve to check that certain properties of the SUT, which were satisfied in the past, still hold. For example, soft- ware engineers or quality engineers should write regression tests to ensure that previously fixed bugs are not present again in the current version of the software. The Bazel Test Encyclope- dia2 focuses exclusively on this type of testing and even defines

2. Despite the title, Bazel Test Encyclopedia is just a single medium-long web page, a reference guide for a build system.

11 1. Software Testing

the purpose of Bazel tests as a developer tool to "confirm some property of the source files checked into the repository", which is expected to continue to be maintained in the future [17]. Robust regression test coverage and the ability to execute the regression tests in a timely manner are essential for maintain- ing developer’s ability to make changes to the codebase with confidence, and therefore for the project’s agility and maintain- ability.

Fuzz testing tries to automatically identify inputs on which the pro- gram behaves unsafely. It is considered a best-practice for any program which must parse arbitrary unsafe inputs, such as image decoders or client-server applications of any kind. A fuzz test consists of an input generating procedure, a fuzzing harness that feeds the inputs into the tested program, and in- strumentation to verify the program’s safe operation. The input generating procedure is usually based either on a formal de- scription of the input in the form of a language grammar or on a genetic algorithm that attempts to find inputs with max- imal statement coverage starting from a preseeded corpus of sample inputs, in case of coverage-driven fuzzing. The instru- mentation to verify a program is behaving safely is usually the -fsanitize= feature implemented in GCC, Clang, and recent MSVC compilers.

Property-based testing is an automated test aimed at checking that the operation of a program or a procedure in a program con- forms to a specified property. This is accomplished by auto- matically generating multiple candidate inputs, running the SUT with each, and checking the property is satisfied. The au- thor of a property-based test is only responsible for writing a function which checks whether the property is specified for a given input and output. In a special case, the checking function may invoke an alternative implementation of the SUT (perhaps slower, but more clearly written) and assert that both give the same result. The testing framework’s responsibility is to gen- erate the candidate inputs, which can be inferred from type signatures in a statically typed language or specified in the form

12 1. Software Testing

of a grammar (or other such generating procedure from which the framework can sample) by the test author. Property-based testing is a generalized variant of fuzz-testing.

Test-driven development (TDD) is a discipline that uses unit tests to express short term development goals as the programming proceeds. The three rules of TDD are: 1) Write production code only to pass a failing unit test. 2) Write no more of a unit test than sufficient to fail (compilation failures are failures). 3)Write no more production code than necessary to pass the one failing unit test.[21] Tests created post hoc struggle to achieve good code coverage and often resort to testing private methods [22]. Test-driven development does help in low-level design by forc- ing clean interfaces, interface segregation, and testability from the beginning. TDD does not help much to design high-level algorithms or approaches, as illustrated by a series of blogs doc- umenting a failed attempt to "test-drive" a sudoku solver [23]. It deserves a mention that the author allowed the blogs to remain public after all the negative attention. TDD, in combination with acceptance testing, allows the pro- grammer to confidently say "I am done, the code works".

User acceptance testing focuses on external quality, as the test sce- narios are designed according to user expectations and their view of the feature, and less according to the internal struc- ture and design of the application. There is tooling available (such as FitNesse, Cucumber, or Gauge) to write test steps in (structured) natural language and still be able to automatically execute test steps based on this description.

Exploratory tests Even if most testing is automated, it is always a good practice to have an actual human being interact with the application before it is shipped to customers. When the majority of testing is already done by other means, the human tester is free to actively look for opportunities for mistakes and gather hands-on experience with the application. The outcomes of exploratory testings are improved knowledge in tester’s mind, ideas for test cases, and possibly even issue reports.

13 1. Software Testing

Dogfooding is the practice of deploying an upcoming version of the software for use by its programmers and other members of the development team. This should make the developers more sensitive and responsive to the product’s issues and ensure that enough diverse sets of eyes see the product version before it is released. Dogfooding is sometimes called an internal alfa/beta or similarly; the more fanciful name of this practice comes from the phrase "eating your own dog food".

1.4 When to test

According to Boehm’s law [13, p. 12], "The cost of finding and fixing a defect grows exponentially with time". The sooner in the develop- ment process is the defect identified and fixed, the cheaper will the whole episode be. To minimize costs, we should test at the earliest opportunity as soon as there is something to test. Testing, according to Myers, is a "destructive, even sadistic, pro- cess" [9, p. 20]. This supposedly makes it incompatible with the cre- ative mode of thinking required of software developers. Myers there- fore argues that developers should never test their own code. There are two notable methodologies that modify or disagree with the just expressed view. They are the test-first approach of eXtreme programming, and the practice of Test-driven development. The test-first approach asks the developer to write the tests as the first thing of the development process before the tested codeis written. Test first approach is very suitable for acceptance testing, that is, tests in which business analysts and quality engineers captured the requirements placed onto the system. Test-driven development is more radical still, it asks the developer to constantly switch between writing tests and production code, and to only write production code if it is necessary to pass a test. Regardless which developer-driven testing methodology is adapted, the responsi- bility of testing then becomes shared between the programmers and quality engineers.

14 1. Software Testing 1.4.1 Continuous integration

In the 2000’s, there has been a dramatic shift in the software indus- try’s approach to testing, which started to center on the practice of developer-driven, automated testing.[18, p. 240] On the organizational level, this change coincided with the increased adoption of so-called agile approaches, such as eXtreme Programming and Scrum, and later by the DevOps movement and Site Reliability Engineering (SRE) methodology. The technology side of this transformation led to the establishment of continuous integration (CI) and continuous delivery (CD) infrastructure, that work in tandem to turn the build and test process from a manual hand-off between the software engineering team, release engineering team, and quality engineering team into an automated, version controlled, executable procedure implemented in a programming language "as code". Testing is no longer a sepa- rate, two-week-long activity, which starts when developers hand-off the build to quality engineers, and the output of which is a slew of entries in the bug tracking system. Instead, quality engineers are en- couraged to contribute automated acceptance tests into the continuous integration system, avoiding the need for frequent hand-offs. Giving software engineers a say in writing the test cases and in building the necessary continuous integration infrastructure to execute the tests is an imminently sensible thing to do, for at least the following reasons. First, a robust test suite gives the developers confidence to move faster and make necessary changes to existing code, without fear the changes may introduce unnoticed issues. [24] Second, tests are the first users of the production code being written. The act of writing tests forces the developer to evaluate the low-level code design from the point of view of a client of the interface being implemented, leading to a better design. [18, p. 240] Third, writing and maintaining a robust test suite can be a signif- icant engineering effort. Therefore, it is sensible to involve the best software engineers, who may not necessarily be on the quality engi- neering team. Penultimately, shifting the low-level testing effort towards develop- ers, in the shift-left paradigm of DevOps, leaves the quality engineers able to focus on sophisticated nonfunctional requirements, such as

15 1. Software Testing usability. It is much more efficient to check input validation logic with a unit test, than to require a tester to type invalid values into all fields in a web form, either by hand or using a browser automation tool [19, p. 120]. Finally, as software engineers directly see the benefits they are gaining from a good test suite, they are the best motivated people to maintain it and keep it in sync with the production code.

1.4.2 Testing is Alerting

When a software is delivered in a manner that allows frequent releases and seamless upgrades, and such is the case for web applications nowadays, there is an opportunity for a radically different approach towards quality engineering and testing. Continuous integration shares similarities with production moni- toring and alerting. Winters notes that the purpose of both is to notify the developers about problems as soon as possible. Continuous inte- gration system executes automated test cases against new software builds before they are deployed into production and provides a signal about the number of failing cases. Alerting system achieves the same end by means of ingesting metrics from production deployment of the software and watching for metrics that exceed predetermined thresh- olds [18, p. 547]. Examples of such metrics include the 5xx response error rate in a web application, presence of an uncaught exception, excessive tail latency, or similar such performance metrics. Issue detection based on alerting is astonishingly similar to running property-based tests in continuous integration. In a property-based test, the test author specifies a property that must hold between ev- ery input and output, plus a procedure to automatically generate test inputs. The similarity to alerting is evident if we replace the input gen- erating procedure with an actual user, and our fail criterion becomes the alerting criterion. In contrast to a synthetic test in a continuous integration system, alerting reports on issues resulting from operation by actual users on actual data. It means that the alert is usually highlighting an issue that is impacting actual users trying to accomplish something important to them, while synthetic tests cover scenarios that the development team thinks are important to the users.

16 1. Software Testing

Testing in production cannot become the sole means of testing, but it can complement other methods. When a release candidate is ready to be deployed, the strategy of canary deployments allows redirecting a fraction of production traffic towards machines running the new build while leaving the previous deployment to continue serving the majority of users. This minimizes the number of affected users in case something goes wrong and simplifies release roll-backs. Canary deployments are not trivial to achieve, however, as building a service so it can withstand multiple versions of servers operating at the same time requires significant engineering effort.[25]

1.5 When to stop testing

Herout presents this as a question to be decided by the project man- agement, not by the quality engineers themselves. Testing should stop, he states, when the costs of finding and repairing the remaining is- sues are equal or larger to the costs resulting from their continuing presence [13, p. 26]. This advice is sensible but not easily actionable since it asks us to estimate costs resulting from an unknown number of yet unidentified issues. Risk-based testing is a strategy striving to find the costliest issues first, by identifying and prioritizing tests that have the greatest chance to uncover such issues.[6] When considered in the modern setting of continuous integration, testing is no longer a one-off process performed on a release candidate build of the software by a dedicated team. Instead, testing, test case implementation, test suite maintainable, and manual activities such as exploratory testing are a continuous process that never really stops as long as the software project is in operation. The question then is not as much as "When to stop testing?" as "How much resources to spend on testing?" and "Can we release now?". Again, these are questions for the project management, who should, of course, solicit quality engineering input, but quality engineers should never allow themselves to be placed into the role of the sole gate-keeper of releases.[6]

17

2 Performance testing

[B]uying [mutual] funds based purely on their past performance is one of the stupidest things an investor can do.

— Jason Zweig, 2003

ISO/IEC 25010 product quality model describes "Performance ef- ficiency characteristic" as composed out of the following three sub- characteristics [26]:

• Time behaviour, latency and throughput requirements

• Resource utilization, resource usage while performing its func- tion

• Capacity limits of the product, if they meet requirements

ISTQB (International Software Testing Qualifications Board) de- fines performance testing as "Testing to determine the performance efficiency of a component or system." [27] In the Foundation Level Specialist Syllabus Performance Testing, ISTQB endorse [28] as a good reference on the subject. In the rest of this chapter, this book is used as my primary reference. Here is one of the things that Molyneaux has to say about the desirable results of performance testing: "effective performance testing identifies performance bottlenecks quickly and early so they can be rapidly triaged, allowing you to deploy with confidence" [28, p. xii].

2.1 Importance of performance and performance testing

There has been much research conducted about the impact of low per- forming applications on the end-user. For interactive applications, the latency thresholds are usually given as 0.1, 1, and 10 seconds, where a

19 2. Performance testing human perceives 0.1 seconds as an instantaneous reaction, 1 second is the threshold where the user is still engaged and their thoughts are not wandering off, and 10-second latency is where the user begins to multitask, instead of patiently waiting. Test-driven Development can be successful because the developers’ thought processes are not significantly interrupted as long as any incremental recompile and running of the relevant tests take only a few seconds. Application performance can be usefully compared to the concept of an error budget from Site Reliability Engineering (SRE). The error budget is computed as the number of minutes a service is allowed to be nonfunctional before the Service Level Objective on availability is breached. Whenever an outage happens, which is often caused by misconfiguration or coding error during deployment of a new version, some error budget is spent. Depletion of the error budget means that no new deployments can be risked until the budget builds back up dur- ing a period of outage-free operation [1]. Similarly, good performance provides the freedom to add features that will eventually slow the ap- plication down, as new more fully-featured versions are deployed. In this way, both error budgets and performance constraints put pressure on application developers to maintain a balance between shipping features and maintaining adequate nonfunctional characteristics, be it reliability or performance. Performance nowadays has become even more critical due to the significant move towards cloud applications, where the application developer is also maintaining the centralized deployment, and users access the application through more or less thin clients such as web browsers and mobile applications. Even a small improvement to per- formance of cloud-deployed applications, which would be unnotice- able if the applications were running on users’ hardware, can have a significant impact in aggregate. For example, Facebook has invested in improving its C++ std::string class, obtaining approximately 1% performance improvement across the entire site in 2012 [29, 17:00]. Additional consideration stemming from the cloud distribution model is that the maximum conceivable number of concurrent users is "the entire Internet". In practice, these spikes manifest as "being slash- dotted", that is, sudden peaks in visitor count caused by deliberate marketing or organic popularity and news-worthiness. Websites that cannot handle such peak load will then become inaccessible exactly

20 2. Performance testing when the interest in them is the greatest. On the other hand, devel- oping with such considerations in mind increases the difficulty and therefore cost of development, with no real guarantees that when it is build, the hoards of people will actually come. As an example, Poke- mon Go has underestimated the player traffic by approximately 50x in their initial capacity planning when the online game has been first released. While the game has been already architected to be horizon- tally scalable, the underlying platform (Google Kubernetes Engine) imposed scalability limits that had to be overcome first [30]. One more example. The HealthCare.gov website has been archi- tected and built without scalability in mind. Its performance issues were resolved first by moving static web pages out of the web appli- cation into a static site, which could then benefit from the use ofa Content Delivery Network, and also by implementing a waiting room feature, which only allowed a manageable number of users to log in, while offering to take the e-mail address of those rejected, andlater sent spaced-out reminders to try logging-in again. The healthcare website was able to only handle approximately 50 to 80 thousand con- current users with 1-second latency. This stop-gap measure allowed it to handle peak demand in a graceful fashion [31, 1:05:10].

2.2 Key Performance Indicators (KPIs)

In order to formulate specific and measurable performance require- ments, Key Performance Indicators (KPIs) need to be constructed and procedures for their measurement need to be agreed upon. The KPIs are essentially the same concept as Service Level Indicators (SLIs), which are used in the Site Reliability Engineering literature [1]. The two main KPIs of application performance are the throughput and response time, (also known as latency) of a particular operation. Throughput is given as the amount of data or requests processed over a time interval, for example, messages sent and received per second. Latency is the duration of each operation, and can be reported either in the form of a full distribution, summarized using a histogram, or even more condensed using percentiles to say that, for example, the 99.99 percentile is 14ms, meaning that 99.99% of requests were pro- cessed in 14ms or less. Traditionally, a batch-processing task requires

21 2. Performance testing good throughput and does not care about latency, while an interactive application that interfaces with an user or say with a high-speed stock exchange trading system is highly latency-sensitive. Hudson has jok- ingly proposed to trigger garbage collector every time the user blinks, as detected by a camera; this illustrates that latency only matters when someone is waiting for the results [32, 4:00]. Besides throughput and latency, there are two crucial KPIs that need to be discussed: the availability and the utilization. Availability measures the amount of time the application is available to its clients. This is tightly related to performance since an overloaded application unable to serve further requests stops being available. Utilization is the amount of resources, or the percentage of available resources used by the application under a particular load. For example, we may ask about the best available throughput and latency, provided that CPU utilization does not exceed 80%.

2.3 When to test performance

Performance testing is commonly performed at the end of a develop- ment cycle [28, p. 7]. In these situations, programmers are not getting any feedback on application performance until the development is actually done. Nygard’s thesis in his book is to accept this as a fact of life, and he advises developers not to "design for QE" but to "de- sign for production" from the start [25, p. 93]. This is to be achieved by employing architectural patterns that have been proven in previ- ously deployed production systems, and that can be relied upon when production issue reports start arriving. System-level performance testing, the way Molineaux describes it, is a significant undertaking, lasting two weeks at a minimum. For the duration of the testing, Molineaux advises to demand a code freeze; this is to stop the developers from changing the software in the meantime, rendering the obtained performance testing results no longer relevant. Furthermore, developers may end up changing the system in a way which would invalidate the test scripts used to do load generation, thereby adding extra maintenance burden on the testers.

22 2. Performance testing

Performance testing needs to be introduced into the daily devel- opment cycle to provide relevant, timely feedback early on, and de- velopers need to participate in test maintenance to avoid handover delays.

2.4 Automation in performance testing

Performance testing has to always include some automation, even the so-called manual approaches; human testers cannot click and scroll through cat pictures quickly enough to put a computer application under a significant load. Nevertheless, the fortnight-long performance testing process described by Molyneaux can be much improved if it is automated and integrated into a Continous Integration/Continuous Delivery pipeline. While the first run of the performance tests may take two weeks or even more, including the necessary preparation and automation work, running the test afterward should be trivial and should not require human supervision. Having arrived to such state, it is possible to make the performance test run overnight when the developers do not work, avoiding the need to impose a code freeze on them. If the test breaks, it should not be difficult to figure out what caused the breakage. If the culprit is a software change introduced the previous day, the developer responsible may be politely asked to update the performance tests. Lack of emphasis on automation is one of the common critiques raised against computing industry standards, such as those published by IEEE (Institute of Electrical and Electronics Engineers). As Huggins puts it, somebody should go to them and "get the word automation in there somewhere" [31, 1:27:20]. Other criticisms point out their inflexibility, costliness, and emphasis on form over substance [6, lesson 146].

2.5 Types of performance tests

Performance testing can be used to demonstrate meeting required per- formance criteria. It can compare two systems to find which performs better, or it can be used to identify and isolate bottlenecks that cause the system to perform poorly.

23 2. Performance testing

The specific performance-oriented types of tests belong under the group of nonfunctional testing, and they can be categorized based on the characteristics of the applied load follows [33], [28, p. 50]:

Pipe-clean test is a preparatory testing activity to validate that each performance test script is working in the performance test en- vironment. Failures in the test script itself need to be quickly identified and fixed before the actual performance testing is initiated. This is achieved by running the actual performance tests for a limited period of time or a small number of iterations. Load test , sometimes confusingly called volume test, is conducted to understand the behavior of the application under a specific load. It helps us to understand expected performance characteristics and resource utilization at this particular load. This way we may learn how much CPU and memory the application consumes while serving a certain number of requests per minute and the latency at this load. Volume test refers to a test, which exercises a software application as it is operating on a certain significant amount of data. One example from the messaging world would be a performance test of a that was previously set up to hold a large number of previously received messages in its persistent message journal. Efficiency test establishes the amount of resources required by a pro- gram to perform a particular function. Endurance test sometimes called a soak or stability test, consists of applying a load to a system over an extended period of time. When the system is running for longer periods, previously hidden problems such as resource leaks may cause the system to fail. To analyze the results afterward, a monitoring system needs to be in place, and the monitoring data must be correlated with logs and performance test data. Scalability test measures the application’s ability to scale up and support increased load as the available resources on a single machine are increased (i.e., vertical scaling) or the number

24 2. Performance testing

of replicas of the application’s subsystems is increased (i.e., horizontal scaling).

Stress test involves testing beyond normal operational capacity, often to a breaking point, in order to determine where the breaking points lay and to observe and document the modes of failure. It is a form of testing that is used to determine the stability of a given system and evaluate its robustness, availability, and error handling under a heavy load.

2.6 Performance testing approaches

There are three approaches to performance testing to be introduced. They are microbenchmarking, system-level performance testing, and mon- itoring in production. Out of those three, this thesis will describe my practical automation effort around the first two in a later chapter.

2.6.1 Microbenchmarks

Microbenchmarking is analogous to unit testing. It is developer-driven testing focused on a narrow part of the software system, only instead of testing for functional correctness as in a unit-test, the microbenchmark measures (and optionally asserts on) performance. Similarly to writing unit-tests, the programmer isolates a part of the system, often a single function or class, and writes timed tests that exercise it. Due to the narrow scope and short duration of the test, this type of testing is sensitive to measurement errors, requiring certain expertise to perform it correctly and obtain meaningful results. Microbenchmarks can be used for low-level optimization (e.g., which way of concatenating strings is the fastest in my version of the compiler and my version of the language library?), for isolating and fixing bottlenecks discovered in a larger scale performance testing, since a microbenchmark provides immediate feedback. In contrast, a nightly performance test requires waiting for the results overnight, or ensuring that the critical path through the code remains performant by providing a test for this invariant that can be automatically checked relatively quickly, maybe on every commit to the codebase.

25 2. Performance testing

Let us now turn to some of the causes of measurement errors in microbenchmarking:

Programming language runtimes introduce overhead, which may unpredictably wax and wane over the duration of the test. Tak- ing the programming language Java as an example, the JVM (Java Virtual Machine) is performing bytecode compilation and garbage collection while a program is running. To minimize JVM effects, it is desirable to let the bytecode compilation finish, allow the benchmarked function to reach a steady state over several warm-up runs and only then perform the measurements. Regarding garbage collection, it is desirable to exclude its effects when calculating microbenchmark results, either by disabling it or by automatically redoing the test if garbage collection was triggered.

The processor and memory in contemporary machines are fairly so- phisticated, employing strategies such as pipelining, branch prediction [34], or multilevel caching. This means that even if a programming language does not use a sophisticated run- time, a warm-up period is still necessary to reach steady per- formance [35]. Dynamic frequency setting of CPU cores needs to be disabled to obtain more predictable results.

Compiler optimizations may end up optimizing the code under test to a greater degree than when it is called from the larger soft- ware system. This may happen when a computation is being timed, and a compiler was able to establish that the result of the computation remains unused. In that case, the compiler will eliminate the entire computation during compilation (that includes bytecode compilation in Java) [36].

Operating system processes and system daemons which run on a machine together with the benchmark may also add to the noise. The benchmarking tooling itself and any extra monitor- ing which is deployed is also taking up computing resource. In this case, mitigation consists of ensuring there is enough CPU cores and memory so that processes are not competing for

26 2. Performance testing

resources. Furthermore, it may be desirable to pin the bench- marked program to specific cores, to reduce effects from context switching and thread migration between cores.

Many of the aforementioned issues are resolved by using a bench- marking harness, which implements appropriate mitigations for us. When measuring system level performance, as discussed in the next section, the sources of noise just mentioned either become negligi- ble or can be treated as simply a part of the system characteristics. The main and only exception to this is the need to commence measurement after the application performance has reached a steady state, which is accomplished by ensuring a sufficiently long warm-up period.

2.6.2 System-level performance testing

System-level performance tests are sophisticated end-to-end tests that require a significant amount of preparation and careful planning be- fore they can be successfully executed. We need to build a performance testing environment, agree on performance requirements, design test cases, set up the test, perform the testing, and analyze the results.

Requirements specification Soliciting performance requirements from stakeholders is notoriously difficult, since they tend to be unwilling to commit to anything specific. However, leaving performance requirements fluid opens up doors to post hoc negotiations where, e.g., a developer is then able to claim that any performance degradation is an acceptable tradeoff for enhanced functionality of the new version, rendering the results of performance testing meaningless in driving the quality of the product. Always decide on hard perf requirements. Otherwise, developers will talk you into accepting whatever the actual test result is at the moment. Until the performance requirements are made specific, we cannot test for them. It has been recognized that latency, quality, cost, throughput are conflicting and must be balanced. Establishing the balance isa question for the product manager and needs to be done early and deliberately [18].

27 2. Performance testing

Building a performance lab In an ideal case, the deployment for performance testing would exactly match the production deployment of the software in all respects. The deployment model most conducive to this is the blue-green strategy of maintaining two production environments, allowing to perform deploys by installing a new version of the software on the inactive one and then redirecting user traffic to it in a load-balancer. Whichever environment is currently not serving as production can be used for system-level testing, including performance testing. Usually, an exact replica of the production environment is not available, for reasons of cost, which necessitates building a performance lab which is able to at least approximate the production setup. When designing a scaled- down version of production deployment for performance testing, it is desirable to at least match specifications of the machines used in production and distribute software components across machines to maintain as many aspects of the deployment topology used in pro- duction as possible. Most companies plan their purchases long in advance. Procure- ment of new machines specifically for performance testing may bea lengthy and involved process that needs to be started sufficiently in advance. Workable alternatives to building a separate performance testing deployment ih-house include virtualization, possibly in a cloud com- puting environment. Cloud computing offerings are notable by the ease and speed of provisioning new virtual machines and by the flex- ible pricing model, where one pays only for the resources one has provisioned. Assuming that the pricing model is sufficiently under- stood, to avoid unexpected charges, the use of cloud computing can completely bypass a need to procure physical machines.

Construction of test scenarios Test scenarios have to represent realistic load, which will be put onto a system in production deployment. If the system under test is already deployed, then its usage patterns can be observed, and test scenar- ios can be modeled based on actual load. When this is not the case, usage patterns and the fraction of simulated users performing each

28 2. Performance testing

of the patterns must be estimated based on experience with other similar systems. Risk-based testing is a good approach to identify the most business-critical scenarios to test. In the case of a web commerce application, this would consist of simulating activity of customers browsing the catalog, customers adding items into the shopping cart, and customers performing a checkout. The ratio between these groups is significant because it is the checkout operation that tends tobe the most resource intensive, while on the other hand the number of transactions in this group tends to be the lowest.

Load injectors (load generators) Load injector is the component of a system-level performance test that is responsible for submitting requests to the system under test and usually also for receiving responses and performing their basic validation. During its operation, the load injector logs the throughput it was able to achieve, latencies between corresponding requests and responses, and whether the responses were successful, or if the system started responding with errors. The load injector’s performance is critical, as it needs to be performant enough to generate sufficient load. Thankfully, it is often possible to use multiple machines running load generators simultaneously to increase the load as needed. An additional consideration is external monitoring, since it is necessary to ensure that the machine running the load generator operates properly for the whole duration of the test. The load generator’s specific load profile and the actions it takes are determined by the type of performance test and the test script. The typical load profiles are: big-bang, which applies the target load from the start, ramp-up which gradually increases the load, until a target load is reached, ramp-up with steps which increases the load in several steps, typically used for load testing, and delayed start used when there is multiple load injectors, each with a different delay.

Result analysis A performance test results can be used to judge the fulfillment of non- functional performance requirements and for capacity planning. If the performance lab hardware does not closely resemble the production

29 2. Performance testing deployment, we are forced to extrapolate the measured performance. Such activity is inherently risky, as opposed to interpolation, which is relatively safe to do.

2.6.3 Performance monitoring in production

As previously stated in Chapter I, monitoring can be viewed as a vari- ation on testing, where the real world is providing the test inputs. While this might sound irresponsible in case of functional testing ("let’s just ship it into Apple AppStore and watch for eventual ex- ceptions reported in analytics"), there is value to this approach in performance testing of incrementally evolving applications. The tastes of users and the usage patterns do evolve in concert with the applica- tion, which makes prediction and scenario construction difficult [37]. For example, a website designer may move a button from the left side of the site to the center, leading to a dramatic change in how users then engage with the website. Combined with phased rollout of fea- tures and graceful degradation, this may be a viable strategy for a service-oriented application. Facebook is one example of a company that bases its performance testing strategy on microbenchmarking and production monitoring. Building an application so that it can be deployed and tested in this way is a large undertaking, which must be conducted deliberately [25]. Notable disadvantages of performance testing through monitoring in production is the lack of repeatability, inability to experiment with hypothetical usage patterns, and lack of developer convenience unless the scenarios discovered by monitoring is turned into a benchmark first. Monitoring does not necessarily consist of passive observation only. The so-called active monitoring involves generating traffic of its own and measuring the system’s response to it. When active monitoring is in use, the lines between monitoring and testing in production start to become very blurry.

2.7 Industrial benchmarking

Standardized performance tests are called industrial benchmarks and they can be used to compare competing products, gauge technological

30 2. Performance testing progress in a field, and focus interest towards unresolved problems. Benchmarks also play a significant role in research. We can name the ImageNet Large Scale Visual Recognition Challenge as one of the premier competitive benchmarks comparing machine learning approaches for image recognition in the past years. Two notable industry benchmark organizations are the TPC (Trans- action Processing Performance Council), which focuses mainly on benchmarking data storage and retrieval systems, and SPEC (Stan- dard Performance Evaluation Corporation). One example of a SPEC benchmark relevant to messaging is SPECjms2007, which is intended to benchmark messaging middleware that implements the JMS API. Benchmarks should ideally be relevant, repeatable, fair, verifiable, and economical. That is, they should exercise the code path that are relevant for production use of the software, they can be made to re- peatedly produce the same result, all relevant software can participate in the benchmark, the benchmark setup and execution is verified for correctness of operation, and running the benchmark is not pro- hibitively expensive [38]. Agreeing on a standardized benchmark can be a highly politicized process since each participant is motivated to skew the benchmark towards scenarios that their application handles best. Benchmarks need to be periodically revisited and updated or even deprecated. Otherwise, they risk becoming out-of-date with cur- rent developments, and implementers may end up over-optimizing for a particular benchmark while ignoring relevant current real-world scenarios [39].

31

3 Maestro, Quiver, and Additional Tooling

And you must have nails in your boots and a tuck or two in your petticoat, and a pair of blue spectacles, and an opera-glass slung across your shoulder like Diana’s quiver. Man needs but little here below in the way of dress, and woman too, if she only knew it.

— A. Strahan and Company, 1883

My performance testing automation work relies on a significant number of already existing software. This chapter aims to introduce Apache Qpid libraries (the Software Under Test), performance mea- surement tools Maestro and Quiver I considered for my automated pipeline, a microbenchmarking library called Google Benchmark, and a continuous integration system called Jenkins.

3.1 Software Under Test: Apache Qpid

Qpid is an open-source project organized under the auspices of the Apache Software Foundation. It is focused on building AMQP (Ad- vanced Messaging and Queuing Protocol) servers and client libraries in various programming languages. At this date, Qpid maintains two AMQP brokers, one written in Java and another in C++, as well as AMQP 1.0 router written in C and Python. Client libraries developed under a subproject called Qpid Proton are available for C, C++, Java (JMS 2.0), Python, Ruby, and Go programming languages. Two distinct code-bases are supporting the AMQP 1.0 client devel- opment. The Qpid Proton C is a core library that supports bindings written in C++, Python, Ruby, and Go, all of which are maintained within the Qpid Proton subproject. On the Java side, the core library is

33 3. Maestro, Quiver, and Additional Tooling

Qpid Proton J, which then supports the JMS 2.0 high-level API library called Qpid JMS. I had first contributed to Qpid Proton in 2016. In 2019, I have joined the Apache Software Foundation and became a Qpid committer. In 2020, I was voted onto the Project Management Committee (PMC) of Apache Qpid1.

3.2 Performance measurement frameworks

Both Maestro and Quiver, the frameworks I will examine in this section, have been previously used [40, 41] in performance tests involving the Qpid Proton AMQP clients. After weighing the considerations raised in this section, I have finally decided to automate Qpid Proton and Qpid JMS performance testing using the Quiver tool.

3.2.1 Maestro

Maestro is an extensible distributed performance testing framework. It was originally focused on testing the performance of a messaging deployment build around the Apache ActiveMQ Artemis broker, and then it was subsequently improved to allow testing the Apache Qpid Dispatch router [40] and allow importing results from the Quiver tool through a Maestro component called the QuiverAgent [41]. Maestro is architected as a set of standalone services organized around the Maestro Broker, which may be any MQTT-capable messag- ing broker. The Maestro services are the following: Maestro broker any appropriately configured standard MQTT bro- ker, although Mosquito is commonly used. This message broker is part of the test infrastructure; it is a different broker from the broker under test. Maestro agents assigned as either message senders or receivers by the test script, and perform the role of a load injector in the test. Maestro inspector connects to the messaging broker under test using the JMX protocol and records various statistics about it.

1. https://projects.apache.org/committee.html?qpid

34 3. Maestro, Quiver, and Additional Tooling

Maestro result database stores and visualizes previously measured results.

All Maestro components (except for the Maestro broker itself) need to be given the hostname of the Maestro broker at launch and they use message passing through the broker for all their communication with each other. Each service is usually deployed on a separate machine, or as a pod in the Kubernetes pod orchestration system. Note that the distributed nature of the performance measurement system compli- cates measuring the duration of a message round-trip through the messaging system, from a Maestro sending agent to a receiver agent, which may be located on different machines, and therefore have to rely on synchronized hardware clock in order to calculate the durations.

Maestro usage To gain practical experience with the Maestro performance testing framework, I have deployed it inside Kubernetes and used it to execute a performance test using the Apache Qpid JMS messaging library and Apache ActiveMQ Artemis messaging broker. Probably the easiest way to deploy Maestro is to use the Docker Compose2 configuration file3, together with prebuilt Maestro images from dockerhub.com. This way, one does not have to build and deploy each service in the Maestro framework individually. Since I was using a Red Hat Enterprise Linux 8 machine to run Maestro, which does not come with Docker by default, I instead chose to deploy Maestro into a single-node K3s4 instance. I used Kompose5 to automatically convert the Docker Compose configuration into a Kubernetes deployment resource. To start a test, I have installed and started the Apache ActiveMQ Artemis broker on the same machine running the K3s, then I used kubectl on my laptop to expose the Maestro broker, and finally, I could run the following commands to configure and start the test.

2. https://docs.docker.com/compose/ 3. https://github.com/maestro-performance/integration-tests/blob/ master/docker-compose.yml 4. https://k3s.io/ 5. https://kompose.io/

35 3. Maestro, Quiver, and Additional Tooling export SEND_RECEIVE_URL=amqp://msg-qe-07:5672/... export MAESTRO_BROKER=mqtt://msg-qe-07:30311 export MANAGEMENT_INTERFACE=http://admin:admin@... export MESSAGE_SIZE=~200 export PARALLEL_COUNT=5 export RATE=40000 export TEST_DURATION=900s java -cp maestro-cli \ org.maestro.cli.main.MaestroCliMain exec \ -s FixedRateTest.groovy

Maestro results database Maestro results database presents test results in the form of HTML reports. An example of a report for the test I described executing can be seen in Figure 3.1. When I investigated why the throughput faltered at about 18:40, I was able to investigate using broker logs and the Performance Co-Pilot monitoring chart shown in Figure 3.2. The broker ran out of memory space allocated for the messaging address and started paging the message journal to the hard disk.

Maestro capabilities Maestro can be used to perform soak testing, load testing, as well as stress testing of the broker or router middleware. This is thanks to the robust configurability and scriptability of the Maestro Agents, which permit starting an arbitrary number of senders and receivers, and also adjusting the load profile, for example, by setting the message send rate. Maestro test scripts are written in the Groovy language, a fully-featured programming language similar to Java. On the flip side, deploying Maestro is a significant effort. When used to test the performance of peer-to-peer message exchange be- tween Qpid Proton clients, which is the topic of this thesis, Maestro delegates all the work to the QuiverAgent, which is just a wrapper that runs the Quiver tool, discussed in the next section, and reports the results produced by this tool [41]. This is why I have decided to forgo Maestro in the end and use Quiver directly.

36 3. Maestro, Quiver, and Additional Tooling

3.2.2 Quiver

The Quiver benchmarking tool [42] consists of a test driver written in Python and test harnesses called arrows, written in the respective languages of the messaging client library under test. The test driver is written in Python, and its purpose is to process command line parameters that specify the test, execute the test by running the arrow programs, and to produce a test report at the end. Both the receiving and sending arrow as well as the test driver have to run on the same machine, although it is possible to send to a different URL endpoint than the messages are received from. This design of Quiver stems from the multiplatform and polyglot nature of the messaging libraries (the software under test)and is mirrored in an interoperability test suite QIT6 (its equivalents of arrows are called shims) as well as dTests, an internal test suite developed in Red Hat Messaging QE (which calls its arrows by the overloaded name of clients) [43, p. 75]. In contrast with the Maestro tool, with Quiver, it is not possible to use more than one sending or receiving arrow in a test. It is also not possible to configure the sending or the receiving message rate ofa Quiver arrow. These limitations would be problematic if we wanted to use Quiver to test a messaging server, but they are not a significant obstacle for stress-testing the client libraries, measuring the maximum possible throughput. I have decided to use Quiver for my performance automation work. The main reason for this decision was for me the significantly simpler and relatively straightforward setup with direct use of Quiver requires.

3.3 Additional tooling

Quiver does not need an extensive setup before a test can be executed, but it is not a completely straightforward process either. The following are the additional technologies I have utilized to support automated test execution with the Quiver framework.

6. https://qpid.apache.org/components/interop-test

37 3. Maestro, Quiver, and Additional Tooling

3.3.1 Ansible

Ansible is an agentless automation system for application deployment and remote machine management. Typical Ansible user writes declar- ative scripts called playbooks, which sequentially invoke one or more modules, usually through SSH (Secure SHell) connection. Modules are written in the Python programming language and they are re- sponsible for enacting changes on the target machines. The most basic Ansible module is called shell and it merely passes its arguments to the shell environment for execution. More sophisticated modules exist, such as the dnf module, which wraps interactions with the Dandified YUM package manager. At its best, Ansible allows for writing declarative playbooks that describe the target state. When invoked, Ansible modules compare the actual state with the declared one and issue commands to reconcile the two if needed. At its worst, an Ansible playbook may become a glorified shell script written in a verbose syntax.

3.3.2 Jenkins

Jenkins may be briefly described as a Continuous Integration (CI) system; yet such abbreviated description does not do justice to its extensive third-party plugin library, which can be used to turn Jenkins into virtually anything. Therefore, a more useful view is to focus on capabilities that the base Jenkins offers. These capabilities then provide direction for plugins to extend further.

Task scheduling — either time-based, like a cron job, or triggered by an event provided by a plugin. This allows Jenkins to run jobs in response to a push to a source code repository or according to a time schedule.

Resource management — Jenkins manages a pool of Jenkins agents. An agent is a relatively small Java binary, which may run on a machine (physical or virtual). Each agent offers one or more executors, which may be assigned Jenkins jobs to run. In per- formance testing, we usually require exclusive access to our machines to avoid interference from other jobs, and so it makes sense to set the number of executors on each agent to one. There

38 3. Maestro, Quiver, and Additional Tooling

are many plugins that extend Jenkins to spawn new virtual ma- chines and provision new agents on them in various environ- ments when there is not already a sufficient executor capacity to run all queued jobs. Task execution — Jenkins manages a queue of jobs that are ready to run and matches them with available executors, where they can be scheduled. The most basic way of matching jobs to ex- ecutors is through labels. A job is configured with a set of labels it requires, while agents are configured with a set of labels they provide. The labels usually encode operating system running on the agent, or other capabilities, such as the presence of pe- ripheral devices. Plugins may define new job steps, which may then be executed from Jenkins jobs. Artifact storage — jobs may produce tasks and upload them to Jenk- ins for archival. This allows chaining jobs so that one job runs first and publishes artifacts that are then used by a subsequent job as its inputs. Results history — jobs may either pass or fail, which Jenkins keeps a history of. Progress monitoring — web interface provided by Jenkins gives an overview of job status. Secrets management — jobs may access credentials stored in Jenkins, such as passwords to external systems. Jenkins 2.0 has brought improvements to the way Jenkins jobs are defined. We may now use a special Job DSL (Domain Specific Language) to write our jobs in code and store them in a version control system. Jobs whose steps are defined using the DSL are called Pipeline jobs.

3.3.3 Google Benchmark

Google Benchmark7 is a microbenchmarking library supporting C++03 and C++11 and greater.

7. https://github.com/google/benchmark

39 3. Maestro, Quiver, and Additional Tooling

Using a support library for microbenchmarking is always advis- able because the library implements best practices about measuring execution time, which it would be difficult to rediscover on one’s own.

3.3.4 Docker

Docker Engine8 is a OS-level virtualization software—a so-called con- tainer runtime. Container functionality in the Linux operating system predates Docker, but Docker made containers easy to use and practical. Processes running inside a Docker container have their own view of the filesystem instantiated from a Docker image, as well as their own process namespace and usually the network namespace. This way, Docker manages deployments and provides process isolation. I am using Docker to test Qpid Proton clients installed from a RPM (Red Hat Package Manager) package. When installing inside Docker, the RPM package installer does not modify global state of the entire machine, only that inside the disposable container.

8. https://www.docker.com/products/container-runtime

40 3. Maestro, Quiver, and Additional Tooling

Figure 3.1: Maestro HTML report

41 3. Maestro, Quiver, and Additional Tooling

Figure 3.2: Performance Co-Pilot disk throughput chart

42 4 Automation Design and Implementation

’schedule()’ is the scheduler function. This is GOOD CODE! There probably won’t be any reason to change this, as it should work well in all circumstances (ie gives IO-bound processes good response etc)...

— Linus Torvalds, 1991

I have decided on a two-pronged approach. First, I wanted to develop a microbenchmark, which would be part of Qpid Proton code- base and run in the project’s CI system. Second, I needed to develop automation to run a Quiver job as that is to be the main focus of this thesis. This way, if there are any performance regressions found using Quiver, there will be a microbenchmarking infrastructure in place to write a test to isolate the issue.

4.1 Microbenchmarking

While researching this topic, I found a relevant Proton issue PROTON- 2201. My implementations of the microbenchmarks are based on the ideas mentioned in the summary and in the comments of that JIRA ticket. For example, the BM_SendReceiveMessages test is mov- ing empty messages from a message sender to a message receiver in-memory, without the use of a network socket.

4.1.1 Implementation

The core of Qpid Proton is written in the C programming language. I chose Google Benchmark as the microbenchmarking framework because it is compatible with C code and because I have heard of it previously. The main difficulties with writing the benchmarks were

1. https://issues.apache.org/jira/browse/PROTON-220

43 4. Automation Design and Implementation related to finding my way around a mature and unfamiliar code-base; I did not have any significant problems with the framework itself that the documentation would not resolve. Running the benchmark happens in the Travis CI service automat- ically on every commit, thanks to the configuration in .travis.yml which I wrote. The only unexpected aspect of it was that I had to use Ubuntu Focal to run the test, as the other Ubuntu versions Travis offers do not have a recent version of Google Benchmark packaged.

4.2 Quiver automation

The automated performance job needs to obtain Qpid Proton (and/or Qpid JMS), build Quiver, run a test, and report and summarize the results. There are two ways of obtaining Qpid Proton; either compile it as part of the job or install a prebuilt RPM (Red Hat Package Manager) package. The performance job should be capable of both.

4.2.1 Design

My first consideration was to avoid having to obtain and maintain any dedicated machines for the test. This seemed a practical goal at first because the Jenkins instance I worked with in Red Hat is equipped with a plugin that can spawn virtual machines in an associated Open- Stack internal cloud and run jobs on them. Having implemented the job this way, I have learned that using OpenStack virtual machines for this particular performance test is a bad idea. The results I was able to obtain were very different from one run to the next. I speculate this was caused by dynamic frequency scaling on the host machines run- ning the VMs, by shifting load patterns on those machines during the day, or by a combination of those and other reasons. The OpenStack deployment is not under my control. Therefore there is not much to do, except to look elsewhere. For my next attempt, I procured access to a dedicated machine, which is a Dell PowerEdge R540 with 16 CPU cores (32 hyper-threaded) and 45GB RAM. The machine is assigned the hostname rhm-per540-02

44 4. Automation Design and Implementation

and is available as a Jenkins node. Using a bare-metal Jenkins node turned out to work well. I based my final design around it.

4.2.2 Implementation

The Quiver performance job is written as a Pipeline job using the Job DSL for Jenkins. The implementation is split into two files. The entry point is QuiverPerfJobs.groovy, which defines job parameters and loads the Pipeline definition from QuiverPerfPipeline.groovy. The pipeline consists of four stages: to clean the job workspace, to build or install Qpid Proton and build Quiver, to run Quiver, and finally to analyze the results.

Quiver compilation Quiver is distributed in the form of source code. The project is equipped with a Makefile script that builds both Quiver itself, as well asthe arrows. Qpid Proton libraries need to be installed on the machine in a location where the Makefile can locate them when building Quiver arrows. This is not a problem when the libraries are installed from RPMs because RPM packages place them into the standard system location. Building Quiver with a locally compiled version of Proton requires additional configuration. The Qpid JMS arrow expects to download the Qpid JMS library through Maven. The version of the library to use is hardcoded in the Quiver arrow sources. Again, there is a need to override the default with a custom version.

Performance of the performance job One important consideration when working with Jenkins jobs is the testing of the job itself. When the job can be run fast from start to finish, it becomes significantly easier to iteratively develop it and to test it. To speed the job, I implemented the ccache compilation cache and I added a Gradle task to deploy the job from the command line. The ccache compiler wrapper caches results of previous compila- tions, speeding them up. It can be installed from the EPEL CentOS

45 4. Automation Design and Implementation repository. Enabling ccache in a CMake project (such as Qpid Proton) is fairly simple. The only difficulty I encountered was that support for ccache only comes with cmake3 and CentOS7 only ships with cmake2. I had to use the older method of creating a symlink to the ccache binary and passing it off to cmake as the compiler binary. Job DSL jobs in Jenkins are usually deployed by means of a seed job. This is a job that runs upon committing to a job repository and updates the job definitions in Jenkins according to the new definitions inthe repository. For job authors, this means that every job update requires a repository commit and push. When we instead use the Gradle task, we are able to push the new job definitions from the commandline directly, making job development significantly faster.

Native and Docker versions

In order to be able to test the software as it is distributed—in the form of prebuilt RPM packages—it is necessary to modify the global state of the test machine by installing the RPM packages system-wide. Even though it is possible to install RPM packages to a custom directory prefix, that requires support from the package, and Qpid packages for RHEL do not provide this support. Wanting to gain test isolation and to avoid global state modification, I decided to create a second variant of the job to run the Quiver tests in Docker. Docker uses the Linux kernel namespaces functionality (cgroups) to create environments for groups of processes, which may include filesystem, or network configurations. Even though Docker is not meant as a security sandbox, it does provide sufficient process isolation for running nonmalicious tests.

4.3 Result reporting

At the end of its run, quiver produces the following files: receiver- snapshots.csv,receiver-summary.json, receiver-transfers.csv.xz, sender- snapshots.csv, sender-summary.json, and sender-transfers.csv.xz. The most interesting of those are the JSON files containing the test results in a machine-readable form.

46 4. Automation Design and Implementation

To work with json files on the command line, the jq utility command can be used. It is available from the EPEL repository. $ jq ’.results.message_rate’ {1..5}/receiver-summary.json 120769 122468 122075 121616 120849

4.3.1 Statistical analysis

Reporting message rate as a single number is insufficient. We also need to ask what is the variation between test runs. Is there actually any significant change happening in the test results we are getting, or could that be explained by variation between runs? There are multiple ways of answering the question. If we run the performance test frequently, such as per commit, there will be contiguous stretches of data points where the true performance has not changed, such as when the commits were typo fixes in the project documentation. Looking at such a chart, the viewer gains an intuitive feeling for what fluctuations are measurement artifacts, and what was most likely caused by changes to the true performance of the software. Second, we may perform multiple measurements for each data point and visualize their distribution directly, using a histogram drawn around each mean point, using a kernel density estimate, or by using a box plot. Third, we may repeat each performance measure multiple times and provide interval estimates, such as confidence intervals or credible intervals. In a chart, we would then draw error bars around each data point. Confidence intervals is the solution used in this thesis.

Condidence intervals Confidence interval (under a given confidence level) suggests a plausi- ble range for a statistic, such as the mean value. The confidence interval is a useful statistic because it aggregates the number of observations and their variance. As the number of observations gets larger or their variance decreases, the confidence interval shrinks.

47 4. Automation Design and Implementation

Formula to calculate 95% confidence interval for a sample mean in a normally distributed sample is as follows [44]  σ σ  x¯ − z∗ √ , x¯ + z∗ √ n n In the formula above, x¯ is sample mean, σ is standard deviation of ∗ the sample, n is sample count, and z is 1.96. Confidence intervals have a notoriously unintuitive interpretation. A confidence level of say 95% means that if the experiment wasre- peated many times, the true value would be inside its own (each time different) confidence interval in 95% of the repeats. That is, there is 95% chance that our Jenkins job experiment was one of the lucky ones, which managed to bound the true value with its confidence interval. Notably, it does not mean that the true value must lay inside the confidence interval, nor that it lays there with the probability of 95% [45].

48 5 Performance regression testing evaluation

"We’ve observed your performance and you’re just not what we’re looking for. [...] Furthermore, we recommend that you not return for future assessment. [...] That’s all I have." The colonel then formally dismisses the candidate.

— Dick Couch, 2007

In this thesis, I have implemented a Jenkins job that automates running a Quiver peer-to-peer test with Apache Qpid clients. Further- more, I have implemented several microbenchmarks for Qpid Proton C client and created a Travis CI job and a Jenkins job to run those. To test the Quiver job, I have executed it with all Qpid Proton releases starting from 0.18.0 and ending at 0.33.0, and all Qpid JMS re- leases starting from 0.45.0 and ending at 0.54.0. I have also executed the Docker variant of the job with relevant versions of the client libraries within these ranges.

5.1 Test results

Quiver jobs are parameterized. When the job is started, the user can choose which clients should be tested, their versions, the number of iterations of the test, and the number of messages sent in each iteration. Every Quiver job outputs the Quiver result files as Jenkins Artifacts. For a quick inspection of the results, the job log also contains all Quiver console output. At the end of the job log, a final summary of throughput with a confidence interval is printed. Here is an example of the job log output: CONFIGURATION

Sender ...... qpid-proton-python2

49 5. Performance regression testing evaluation

Receiver ...... qpid-proton-python2 Address URL ...... q0 Output files ...... results/15 Count ...... 300,000 messages Body size ...... 100 bytes Credit window ...... 1,000 messages Flags ...... peer-to-peer

RESULTS

Count ...... 300,000 messages Duration ...... 79.5 seconds Sender rate ...... 3,784 messages/s Receiver rate ...... 3,776 messages/s End-to-end rate ...... 3,772 messages/s

Latencies by percentile:

0% ...... 94 ms 90.00% ...... 260 ms 25% ...... 255 ms 99.00% ...... 260 ms 50% ...... 257 ms 99.90% ...... 261 ms 100% ...... 269 ms 99.99% ...... 263 ms

Estimated message rate is 3755.5 +- 9 msgs/second at 95% confidence

5.2 Result analysis

I have executed the Quiver job for four Qpid clients and I tested multi- ple versions of each client. For each version where an RPM (or prebuilt Java archive) is available, I executed the test also in Docker. The dat- apoints obtained from Docker runs are colored red in the following figures.

50 5. Performance regression testing evaluation

5.2.1 Qpid Proton C

190000

180000

170000 Throughput (msgs/s)

160000

0.19.0 0.20.0 0.21.0 0.22.0 0.23.0 0.24.0 0.25.0 0.26.0 0.27.0 0.27.1 0.28.0 0.29.0 0.30.0 0.31.0 0.32.0 0.33.0 Qpid Proton release version

Figure 5.1: Qpid Proton C P2P Throughput per Release Version

51 5. Performance regression testing evaluation

5.2.2 Qpid Proton C++

110000

100000

90000 Throughput (msgs/s)

80000

70000

0.19.0 0.20.0 0.21.0 0.22.0 0.23.0 0.24.0 0.25.0 0.26.0 0.27.0 0.27.1 0.28.0 0.29.0 0.30.0 0.31.0 0.32.0 0.33.0 Qpid Proton release version

Figure 5.2: Qpid Proton C++ P2P Throughput per Release Version

52 5. Performance regression testing evaluation

5.2.3 Qpid Proton Python

5000

4500 Throughput (msgs/s)

4000

0.19.0 0.20.0 0.21.0 0.22.0 0.23.0 0.24.0 0.25.0 0.26.0 0.27.0 0.27.1 0.28.0 0.29.0 0.30.0 0.31.0 0.32.0 0.33.0 Qpid Proton release version

Figure 5.3: Qpid Proton Python P2P Throughput per Release Version

53 5. Performance regression testing evaluation

5.2.4 Qpid JMS

90000

87500

85000 Throughput (msgs/s)

82500

0.45.0 0.46.0 0.47.0 0.48.0 0.49.0 0.50.0 0.51.0 0.52.0 0.53.0 0.54.0 0.55.0 0.56.0 Qpid JMS release version

Figure 5.4: Qpid JMS P2P Throughput per Release Version

5.3 Future work

This thesis demonstrates that running Quiver in a Jenkins Pipeline job is a practical proposition. The use of confidence intervals in results presentation provides useful guidance about the significance of the measurements. Finally, the introduction of microbenchmarks into Apache Qpid Proton has been moderately successful.

54 5. Performance regression testing evaluation

Nevertheless, there is still a significant amount of low-handing fruit when it comes to Apache Qpid performance testing. Here, I have collected a few suggestions.

More microbenchmarks should be written, to improve coverage of Qpid Proton C.

Alerting for microbenchmarks needs to be set up, so that regres- sions in their performance cannot be ignored by project de- velopers.

Quiver can be extended to support randomized message sizes or mix of preconfigured message sizes in a single test.

Quiver can be further extended to allow multiple sending and re- ceiving arrows in a single test.

Previous measurements should be made accessible in the form of historical charts that need to be frequently reviewed by devel- opers.

55

Bibliography

1. BEYER, B.; JONES, C.; PETOFF, J.; MURPHY, N.R. Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media, 2016. ISBN 9781491951187. Available also from: https://books. google.cz/books?id=%5C_4rPCwAAQBAJ. 2. COMMITTEE, IEEE Computer Society. Standards Coordinating; ELECTRICAL, Institute of; ENGINEERS, Electronics; BOARD, IEEE Standards. IEEE Standard Glossary of Software Engineering Terminology. IEEE, 1990. IEEE Std. ISBN 9781559370677. Available also from: https://ieeexplore.ieee.org/stamp/stamp.jsp? tp=&arnumber=159342. 3. COX, Russ. What is Software Engineering? [online]. 2018 [visited on 2020-07-18]. Available from: https://research.swtch.com/ vgo-eng. 4. BOSSAVIT, L. The Leprechauns of Software Engineering. Laurent Bossavit, 2015. ISBN 9782954745503. Available also from: https: //leanpub.com/leprechauns. 5. ONLINE, OED. quality, n. and adj. [online]. Oxford University Press, 2020 [visited on 2020-07-19]. Available from: www.oed. com/view/Entry/155878. 6. KANER, C.; BACH, J.; PETTICHORD, B. Lessons Learned in Software Testing: A Context-Driven Approach. Wiley, 2011. ISBN 9781118080559. 7. DOBROVSKÝ, Pavel. Martin Klíma: Od Dračího doupěte k Warhorse. Tiscali Media, a.s., 2013. Available also from: https://games. tiscali.cz/rozhovor/martin-klima-od-draciho-doupete-k- warhorse-225575. 8. IV113 Validace a verifikace [online]. 2016 [visited on 2020-07-19]. Available from: https://www.fi.muni.cz/~xbarnat/IV113/ all_IV113_2016.pdf. 9. MYERS, G.J.; SANDLER, C.; BADGETT, T. The Art of Software Test- ing. Wiley, 2011. ITPro collection. ISBN 9781118133156. Available also from: https://books.google.cz/books?id=GjyEFPkMCwcC.

57 BIBLIOGRAPHY

10. MARTIN, R.C. Clean Architecture: A Craftsman’s Guide to Software Structure and Design. Prentice Hall, 2018. Martin, Robert C. ISBN 9780134494166. Available also from: https://books.google.cz/ books?id=8ngAkAEACAAJ. 11. HINKELMANN, K.; KEMPTHORNE, O. Design and Analysis of Experiments, Volume 1: Introduction to Experimental Design. Wiley, 2007. Wiley Series in Probability and Statistics. ISBN 9780470191743. Available also from: https://books.google. cz/books?id=T3wWj2kVYZgC. 12. MARTIN, R.C. The Clean Coder: A Code of Conduct for Profes- sional Programmers. Pearson Education, 2011. Robert C. Martin Series. ISBN 9780132542883. Available also from: https://books. google.cz/books?id=ik0qCTVzl44C. 13. HEROUT, Pavel. Testování pro programátory. 1. vyd. České Budě- jovice: Koop nakladatelství, 2016. ISBN 978-80-7232-481-1. 14. CRISPIN, Lisa. [WEBINAR] Modeling Your Test Automation Strategy Part 2 [online]. 2019 [visited on 2020-07-19]. Available from: https://www.mabl.com/blog/modeling- your- test- automation-strategy-webinar-2. 15. KUMAR, S. What are Software Test Types? [online] [visited on 2020-07-19]. Available from: http : / / tryqa . com / what - are - software-test-types/. 16. KUMAR, S. What are Software Testing Levels? [online] [visited on 2020-07-19]. Available from: http://tryqa.com/what-are- software-testing-levels/. 17. THE BAZEL AUTHORS. Test encyclopedia - Bazel [online]. 2020 [visited on 2020-07-17]. Available from: https://docs.bazel. build/versions/3.4.0/test-encyclopedia.html. 18. WINTERS, T.; MANSHRECK, T.; WRIGHT, H. Software Engi- neering at Google: Lessons Learned from Programming Over Time. O’Reilly Media, 2020. ISBN 9781492082743. Available also from: https://books.google.cz/books?id=WXTTDwAAQBAJ.

58 BIBLIOGRAPHY

19. CRISPIN, L.; GREGORY, J. Agile Testing: A Practical Guide for Testers and Agile Teams. Pearson Education, 2008. Addison-Wesley Signature Series (Cohn). ISBN 9780321616937. Available also from: https://books.google.cz/books?id=68%5C_lhPvoKS8C. 20. KLOSTERMANN, Aiko. About the origin of Smoke Testing and the confusion it comes with [online]. 2019 [visited on 2020-11-17]. Available from: https://medium.com/@AikoPath/about-the- origin- of- smoke- testing- and- the- confusion- it- comes- with-dfa18eb8ce0. 21. LANGR, Jeff. The Three Rules of TDD [online]. 2013 [visited on 2020-11-17]. Available from: https://www.oreilly.com/ library / view / modern - c - programming / 9781941222423 / f _ 0055.html. 22. TRENK, Andrew. Testing on the Toilet: Prefer Testing Public APIs Over Implementation-Detail Classes [online]. 2015 [visited on 2020- 11-17]. Available from: https : / / testing . googleblog . com / 2015/01/testing-on-toilet-prefer-testing-public.html. 23. MOHAN, Ravi. Learning From Sudoku Solvers [online]. 2007 [visited on 2020-07-19]. Available from: http : / / ravimohan . blogspot . com / 2007 / 04 / learning - from - sudoku - solvers . html. 24. BLAND, Mike. Goto Fail, Heartbleed, and Unit Testing Culture [on- line]. 2014 [visited on 2020-12-03]. Available from: https:// martinfowler.com/articles/testing-culture.html. 25. NYGARD, M.T. Release It!: Design and Deploy Production-Ready Software. 2nd ed. Pragmatic Bookshelf, 2018. ISBN 9781680504521. Available also from: https://books.google.cz/books?id= Ug9QDwAAQBAJ. 26. ISO 25000 PORTAL. ISO/IEC 25010 [online]. 2011 [visited on 2020-11-17]. Available from: https: / /iso25000 .com /index . php/en/iso-25000-standards/iso-25010. 27. ISTQB. ISTQB Glossary: performance testing [online] [visited on 2020-11-17]. Available from: https://glossary.istqb.org/en/ term/performance-testing-2.

59 BIBLIOGRAPHY

28. MOLYNEAUX, I. The Art of Application Performance Testing: From Strategy to Tools. O’Reilly Media, 2014. Theory in practice. ISBN 9781491900505. Available also from: https://books.google.cz/ books?id=jc7UBQAAQBAJ. 29. ORMROD, Nicholas. The strange details of std::string at Facebook [online]. 2020 [visited on 2020-07-17]. Available from: https: //docs.bazel.build/versions/3.4.0/test-encyclopedia. html. 30. STONE, Luke. Bringing Pokémon GO to life on Google Cloud [online]. 2016 [visited on 2020-07-17]. Available from: https://cloud. google.com/blog/products/gcp/bringing-pokemon-go-to- life-on-google-cloud. 31. HUGGINS, Jason. Fixing HealthCare.gov, One Test at a Time [MEETUP] [online]. 2015 [visited on 2020-07-17]. Available from: https://www.youtube.com/watch?v=wRTDo6uWBss. 32. HUDSON, Rick. Go GC: Solving the Latency Problem [online]. 2015 [visited on 2021-01-07]. Available from: https://www.youtube. com/watch?v=aiv1JOfMjm0. 33. KUMAR, S. What is Non-functional testing (Testing of software prod- uct characteristics)? [online] [visited on 2020-12-19]. Available from: http://tryqa.com/what-is-non-functional-testing- testing-of-software-product-characteristics/. 34. THE STACK OVERFLOW COMMUNITY. Why is processing a sorted array faster than processing an unsorted array? Stack Exchange Inc, 2012. Available also from: https://stackoverflow.com/ questions/11227809/why-is-processing-a-sorted-array- faster-than-processing-an-unsorted-array. 35. MYTKOWICZ, Todd; DIWAN, Amer; HAUSWIRTH, Matthias; SWEENEY, Peter F. Producing wrong data without do- ing anything obviously wrong! ACM Sigplan Notices. 2009, vol. 44, no. 3, pp. 265–276. Available also from: https : //users.cs.northwestern.edu/~robby/courses/322-2013- spring/mytkowicz-wrong-data.pdf.

60 BIBLIOGRAPHY

36. CARRUTH, Chandler. CppCon 2015: Chandler Carruth "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My! [online]. 2015 [visited on 2020-07-19]. Available from: https://www.youtube. com/watch?v=nXaxk27zwlk. 37. THE THE V8 TEAM. How V8 measures real-world performance [online]. 2016 [visited on 2020-07-19]. Available from: https: //v8.dev/blog/real-world-performance. 38. HUPPLER, Karl. The art of building a good benchmark. In: Tech- nology Conference on Performance Evaluation and Benchmarking. 2009, pp. 18–30. 39. THE THE V8 TEAM. Retiring Octane [online]. 2017 [visited on 2020-07-19]. Available from: https://v8.dev/blog/retiring- octane. 40. STEJSKAL, Jakub. Performance Testing and Analysis of Qpid Dis- patch Router. Brno, CZ, 2018. Available also from: https://www. fit.vut.cz/study/thesis/21191/. Master’s thesis. Brno Uni- versity of Technology, Faculty of Information Technology. 41. STUCHLÍK, Dominik. Performance testing of Messaging Protocols [online]. 2020 [cit. 2021-01-09]. Available also from: Dostupn%C3% A9%20z%20WWW%20%3Chttps://is.muni.cz/th/cmpll/%3E. 42. ROSS, Justin Richard. Quiver [https:/ /github.com/ssorj / quiver]. GitHub, Inc., 2020. 43. LEŠKO, Matej. Enhanced remote execution layer for Deployment test- ing framework [online]. 2016 [cit. 2021-01-09]. Available also from: Dostupn%C3%A9%20z%20WWW%20%3Chttps://is.muni.cz/th/ ot3q2/%3E. Diplomová práce. Masarykova univerzita, Fakulta informatiky, Brno. Supervised by Adam RAMBOUSEK. 44. ADELSTEIN-LELBACH, Bryce. Benchmarking C++ Code. 2015. Available also from: https : / / www . youtube . com / watch ? v = zWxSZcpeS8Q. CppCon. 45. GREENLAND, Sander; SENN, Stephen J; ROTHMAN, Kenneth J; CARLIN, John B; POOLE, Charles; GOODMAN, Steven N; ALT- MAN, Douglas G. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. European journal of epidemiology. 2016, vol. 31, no. 4, pp. 337–350.

61