Using Mutation Testing to Measure Behavioural Test Diversity
Total Page:16
File Type:pdf, Size:1020Kb
Using mutation testing to measure behavioural test diversity Francisco Gomes de Oliveira Neto, Felix Dobslaw, Robert Feldt Chalmers and the University of Gothenburg Dept. of Computer Science and Engineering Gothenburg, Sweden [email protected], fdobslaw,[email protected] Abstract—Diversity has been proposed as a key criterion [7], or even over the entire set of tests altogether [8]. In to improve testing effectiveness and efficiency. It can be used other words, the goal is then, to obtain a test suite able to to optimise large test repositories but also to visualise test exercise distinct parts of the SUT to increase the fault detection maintenance issues and raise practitioners’ awareness about waste in test artefacts and processes. Even though these diversity- rate [5], [8]. However, the main focus in diversity-based testing based testing techniques aim to exercise diverse behavior in the has been on calculating diversity on, typically static, artefacts system under test (SUT), the diversity has mainly been measured while there has been little focus on considering the outcomes on and between artefacts (e.g., inputs, outputs or test scripts). of existing test cases, i.e., whether similar tests fail together Here, we introduce a family of measures to capture behavioural when executed against the same SUT. Many studies report that diversity (b-div) of test cases by comparing their executions and failure outcomes. Using failure information to capture the using the outcomes of test cases has proven to be an effective SUT behaviour has been shown to improve effectiveness of criteria to prioritise tests, but there are many challenges with history-based test prioritisation approaches. However, history- such approaches [9], [10]. Particularly, gathering historical based techniques require reliable test execution logs which are data to evaluate test failure rates has its own trade-off. On often not available or can be difficult to obtain due to flaky tests, one hand, historical data can be readily mined or provided scarcity of test executions, etc. To be generally applicable we instead propose to use mutation testing to measure behavioral by many build and continuous integration environments. On diversity by running the set of test cases on various mutated the other hand, historical data can be unreliable due to, e.g., versions of the SUT. Concretely, we propose two specific b- many flaky test executions, or insufficient/limited in case the div measures (based on accuracy and Matthew’s correlation project is not mature enough or many test cases are being coefficient, respectively) and compare them with artefact-based added or removed from test repositories [10], [11]. Moreover, diversity (a-div) for prioritising the test suites of 6 different open-source projects. Our results show that our b-div measures historical data is inherently connected to the quality of the outperform a-div and random selection in all of the studied test suite, such that test cases that rarely fail can hide many projects. The improvement is substantial with an average increase false negatives (i.e., tests that pass but should fail) [1], [10]. in average percentage of faults detected (APFD) of between 19% In short, calculating diversity based on historic test outcomes to 31% depending on the size of the subset of prioritised tests. has shown a lot of promise but is limited to situations when Index Terms—diversity-based testing, test prioritisation, test selection, empirical study reliable and extensive test logs are available. As an alternative, we here propose to use mutation testing I. INTRODUCTION to overcome these challenges and enable more general use of diversity approaches for test optimisation. An added benefit Testing is an essential activity that drives and ensures of our approach is that it allows controlled experiments on the quality of software-intensive system. However, the increasing value of test diversity itself. By introducing many mutants in growth and complexity of software systems in combination arXiv:2010.09144v1 [cs.SE] 19 Oct 2020 the code and running the entire test suite for each mutant, we with limited resources creates prohibitive testing scenarios [1], identify patterns in the test outcomes (passes and failures) of [2]. Test optimisation approaches, such as test prioritisation or pairs of test cases. We call this behavior-based diversity (b- selection, cover a variety of strategies and technique that aim div) since it measures diversity in the behavior of the test for cost-effective testing by automatically or manually support- cases when exercising different faults in the SUT. Unlike ing testers in deciding what and how much to test [3]. Par- artefact-based diversity (a-div) measures, our approach con- ticularly, diversity-based techniques perform among the best siders test outcomes independently from the test specification when comparing a large set of test prioritization techniques [4], (e.g., requirements, test input, or system output data). In other hence being a prominent candidate for cost-effective testing in words, our distance measures conveys whether tests have been prohibitive situations. passing/failing together when executed on the same instance Test diversity can measure how different tests are from each of the SUT, i.e., when it was exposed to a certain (set of) other, based on distance functions for each pair of tests [5]– mutant(s). Note that other techniques use co-failure, to obtain We acknowledge the support of Vetenskapsradet˚ (Swedish Science Council) the probability distribution of an arbitrary test B failing given in the project BaseIT (2015-04913). that test A failed [10], whereas our approach measures the 1 distance value between tests A and B based on whether they • Analysis of the differences between artefact-based and have the same test outcome. behaviour-based diversity on fault detection. Our goal in this paper is to describe and then investigate Next, we discuss the background and related work. Sec- the benefits and drawbacks of using behaviour-based diversity tion III details our approach, along with the proposal of two based on mutation testing. Our empirical study applies the distance measures to operate on test execution data through a approach for test prioritisation which is one of the more direct Test Outcome Matrix (TOM). Our empirical study is described use cases for test optimisation [3]. We also contrast different in Section IV, whereas the results and discussion are presented, diversity-based techniques in terms of their fault detection respectively in Sections V and VI. Lastly, Section VII contains capabilities. The research questions we study are: concluding remarks and future work. RQ1: How can we represent and capture behavioural II. BACKGROUND AND RELATED WORK diversity? Generally, there are two main steps in techniques exploiting RQ2: How can we use mutation testing to calculate test diversity. First, we determine which information will behavioural diversity? be used to calculate diversity. Many different information RQ3: How do artefact and behaviour-based diversity sources can be used as outlined by e.g. the variability of tests differ in terms of effectiveness? model of [5]: test setup, inputs, method calls, execution traces, RQ4: What are the trade-offs in applying behavioural outputs, etc. The second step is to use a distance function d to diversity in test prioritisation? measure the diversity of the information of each (or a subset of) test case(s). As an example, consider a test suite T where ta; tb 2 T are two test cases. Here, we focus on analysing open source projects and use When the diversity measure is normalized, its distance the PIT1 mutation testing tool to compare fault detection function d(ta; tb) returns a value that represents a continuum rates between behaviour-based and traditional artefact-based between zero and one, where: d(ta; tb) = 0 indicates that diversity measures. Our results show that behaviour-diversity two tests are identical, while d(ta; tb) = 1 indicate they are is significantly better at killing mutants regardless of the completely different. Numerous d have been used for diversity- prioritisation budget, i.e., the size of the selected test set. In based testing, each supporting different types of data. For particular, larger projects (containing more test cases and test instance, the Hamming distance was applied on the traces of execution data), yield the largest improvements. For instance, historical failure data based on sequences of method calls from in our two larger, investigated projects, when given a budget execution traces [12], whereas Jaccard and Levenshtein are of 30% tests selected, the difference in the average percentage applicable when textual information from test specifications of mutants killed between the best b-div measure (MCC) and is available [13]–[15]. Moreover, Normalized Compressed a-div measure is, respectively, 31% and 19%. But for tighter Distance (NCD) has been used as a generic distance function budgets the difference can be even larger, e.g., in one project, applicable for most types of information (e.g., full execution behavioral diversity kills 45% more mutants when selecting traces including test input and output) [5], [8]. only 10% of test cases. Empirical studies have shown promising results when di- Here, mutants allow us to run the test against realistic faults versity is used to increase fault detection [4], [8], [11], test introduced into the code, instead of measuring behavioural generation [16] or visualization of test redundancy [17]. On diversity in connection to modifications in the tests or in the one hand, the approaches using diversity tend to be quite SUT. Therefore, the results indicate that the use of mutation general, since distance measures, such as NCD, can be used testing and behavioural diversity is a suitable candidate for for any type of data (e.g., code, requirements, test scripts, exe- prioritising tests where historical information or modifications cution traces).