Benchmarking Testing Strategies with Tools from Mutation Analysis

Benchmarking Testing Strategies with Tools from Mutation Analysis Ralph Guderlei∗ Rene´ Just Christoph Schneckenburger Franz Schweiggert Institute of Applied Information Processing Ulm University, D-89069 Ulm, Germany {ralph.guderlei, rene.just, christoph.schneckenburger, franz.schweiggert}@uni-ulm.de Abstract faults undetected requires a strict assessment of the proposed oracles. The assessment of a testing strategy and the comparison The test input has also an impact on the ability of the of different testing strategies is a crucial part in current re- testing strategy to detect faults in the IUT. Several methods search on software testing. Often, manual error seeding is have been proposed for the assessment of test inputs. In the used to generate faulty programs. As a consequence, the re- present paper, methods from mutation analysis will be used sults obtained from the examination of these programs are to measure the adequacy of the test input. often not reproducible and likely to be biased. Often, selected example programs are used for the as- In this paper, a flexible approach to the benchmarking of sessment of testing methods. The assessment procedure testing strategies is presented. The approach utilizes well- usually starts with the manual insertion of errors to the pro- known results from mutation analysis to construct an ob- grams. Then, the testing method is measured by its ability jective effectiveness measure for test oracles. This measure to detect the inserted faults. The results of this kind of as- allows to draw conclusions not only on the effectiveness of sessment are hard to reproduce and can hardly be used for a single testing strategy but also to compare different testing the comparison to other testing methods. strategies by their effectiveness measure. Therefore, a benchmarking method based on known results from mutation analysis is presented in this paper. After a short introduction to mutation analysis in Section 2, the 1 Introduction proposed approach is depicted and an example is given in Section 3. Thereafter, the presented approach is discussed in Section 4. Finally, a conclusion is given in Section 5. Manual extensive testing is often not feasible. Therefore, the automation of the software testing process is necessary 2 Preliminaries to reduce the testing effort. The automated testing requires solutions for two problems: the automatic generation of test Mutation analysis is a fault-based method to examine the inputs and the automatic validation of the output of the im- effectiveness of a test set. It has been explored since 1977 plementation under test (IUT). The latter problem is often and was originally introduced in [4, 6]. In order to assess referred to as the oracle problem (c.f. [18]). The combina- the test case adequacy, faults are seeded systematically into tion of a method to generate the test input and an oracle will the implementation which is to be investigated. Thus, faulty be referred to as a testing strategy. versions of the original implementation, so-called mutants, A class of oracles, the so-called partial oracles (c.f. [2]), are generated. In contrary to the classical error seeding seem to be a promising class of techniques to automate the (c.f. [13]), where the seeding is led by the intuition of ex- verification of the output of the IUT. Partial oracles are char- perienced engineers, the way of seeding faults in mutation acterized by the property, that if an IUT is judged to be analysis is formally described by so-called mutation opera- faulty by the oracle, then the IUT contains a fault. That tors. Examples for mutation operators are the replacement means that the oracle produces no false-positive results. On of variables by constants or swapping the comparison oper- the other hand, if an IUT is judged to be correct, it may con- ators (i.e. a<b will be replaced by e.g. a>b or a==b) (see tain a fault. The fact that partial oracles may leave some e.g. [9, 10] for definitions of mutation operators). Due to ∗work supported by the DFG research training group 1100 ”Model- the methodical seeding of faults the way of obtaining the lierung, Analyse und Simulation in der Wirtschaftsmathematik” mutants is reproduceable and the applicability is indepen- dent of the semantics of the concerning implementation, i.e. actual output of the IUT (an oracle). Each part of the testing the implementation can be chosen arbitrarily and the corre- strategy will be considered in a separate quality measure. sponding semantics is exclusively covered by the test cases. The benchmark of the testing strategy T is then defined as This fact can be regarded as fundamental in order to yield the product reliable and comparable benchmarking results of those test E T E T E T cases. ( ) = in( ) · out( ) Mutation analysis is based on two hypotheses, namely the so called competent programmer hypothesis and the of the quality measures for the generated test input and the E E coupling effect. The latter assumes that complex faults can oracle, in and out, respectively. These quality measures be regarded as an accumulation of small mutations. Due to are based on results from mutation analysis. The measure E this premise it can be said that every test case which detects in is equal to the mutation score as presented in Section 2. T a simple mutation will reveal a complex fault. On the other In order to benchmark a testing strategy , the following hand the competent programmer hypothesis implies that an steps have to be proceeded: application written by an experienced programmer contains 1. Choose a benchmark implementation: just small faults, i.e. these faults are similar to simple muta- The domain and the structure of the implementation tions. Therefore, every mutant contains just one mutation to should match the testing strategy to be assessed and satisfy the indicated hypotheses and to avoid an exploding the implementation itself should be representative for number of mutants as a result of combining multiple muta- the testing problem to be solved by the testing method. tion operators. The implementation should be complex enough so that Before the mutants are generated all test cases are ap- a large number of mutants can be generated. In the plied to the original implementation in order to ensure that following, the chosen implementation will be referred the implementation as well as the test cases are correct (cor- to as original implementation. rectness means for every input value and test case respectively the output of the implementation meets the expec- 2. Apply T to the benchmark implementation: tation). Thereafter, mutants are generated by applying all The testing strategy should not detect a bug in the orig- mutant operators to the original implementation. inal implementation. Otherwise, either the original im- Every test case is then applied to the original implemen- plementation or the testing strategy has to be corrected. tation as well as every mutant and the outputs are compared. If the outputs differ, the mutant is said to be killed. Owing 3. Generate mutants: to Rice’s theorem [17] which states that equivalence of non- This step can be automated using various tools, e.g. trivial programs is not decidable in general, it can be neces- for programs written in Java, Fortran, C#, or PHP. One sary to inspect the not-killed mutants for equivalence man- possibility is to consider programs written in Java and ually (A mutant can produce the same output as the original to use the tool MuJava (c.f. [11]) to create the mutants. implementation either if the test cases don’t cover the mu- 4. Determine the number of non-equivalent mutants tation or if the mutation has no effect, e.g. a post-increment N: Usually, this step has to be proceeded manually, of a variable at the end of its life cycle). as there is no tool for that. But an algorithm for the As a result of disassociating the equivalent mutants from automatic detection of equivalent mutants is presented the whole bundle the number of killed mutants can be re- in [16]. lated to the number of mutants not being equivalent to the original implementation. The subsequent quotient Ms(T )is 5. Examination of T : referred to as mutation score. For each mutant, do: #{killed mutants} Ms(T ) = (a) Generate test input according to the input gener- #{not-equivalent mutants} ation method taken from T Thus, the mutation score is suitable for measuring the (b) Execute the mutant and compare the output of the adequacy of the test input. Further extensive research re- mutant to the output of the original implementa- sults and the current state of affairs in the field of mutation tion. If the outputs differ increase m by one. analysis can be found in [15]. (c) Apply the oracle of T to the mutant. If the oracle judges the mutant to be faulty, increase n by one. 3 A Benchmark for Testing Strategies 6. Compute E(T ): Recall that a testing strategy consists of two parts, a m n n E T method to select test input data and a method to verify the ( ) = N · m = N The comparison of the mutants outputs to the outputs Effectiveness of metamorphic relations 1 R1 of the original implementation can be interpreted as R2 R3 R4 a perfect oracle (see e.g. [3]) as the original imple- 0.9 R5 R6 R7 mentation is a defect free version of the mutant. The R8 0.8 R9 R10 perfect oracle is the best available oracle. Therefore, R11 R12 if the same inputs are used for the perfect oracle and 0.7 the examined testing strategy T , the number of mu- 0.6 tants killed by T cannot exceed the number of mutants effectiveness killed by the comparison with the original implemen- 0.5 tation.

Benchmarking Testing Strategies with Tools from Mutation Analysis

Microprocessor

Load Testing, Benchmarking, and Application Performance Management for the Web

Overview of the SPEC Benchmarks

Opportunities and Open Problems for Static and Dynamic Program Analysis Mark Harman∗, Peter O’Hearn∗ ∗Facebook London and University College London, UK

Evaluation of AMD EPYC

An Effective Dynamic Analysis for Detecting Generalized Deadlocks

BOOM): an Industry- Competitive, Synthesizable, Parameterized RISC-V Processor

A Randomized Dynamic Program Analysis Technique for Detecting Real Deadlocks

Specfp Benchmark Disclosure

A Real-World Benchmark Model for Testing Concurrent Real-Time Systems in the Automotive Domain

Optimistic Hybrid Analysis: Accelerating Dynamic Analysis Through Predicated Static Analysis

Decoupling Dynamic Program Analysis from Execution in Virtual Environments