Extreme Mutation Testing in Practice: an Industrial Case Study

arXiv:2103.08480v1 [cs.SE] 15 Mar 2021 n airt opeeddet t ihrasrcinlevel abstraction higher its to by due expensi comprehend introduced less to computationally was easier is and variatio and it This because [3]. testing promising 2016 not seems mutation in Wagner still and of testi Juergens, is mutation Niedermayr, variation Extreme it a [2]. research, [1], is industry in sever in for studied adopted around widely thoroughly being Despite and is. suite decades test a effective how ItlietMtosfrTs n eiblt”(SIT)a (GS-IMTR) Reliability” and Stuttgart. Test of for Methods “Intelligent r,wa hyma npatc,adwihfcoscurrently factors which testin and practice, mutation in traditional mean they over what testing are, mutation extreme htterslsaeaalbefs nuhbtsilmeaning still but s enough fast tests available unit a are how writing in results investigate while testing the than testing that should rather mutation mutation research tests extreme extreme unit Future perform that writing session. when and decoupled performed reports highlig be be coverage should should code code testi cov pseudo-tested in code mutation that with propose extreme conjunction We w in that tools. are tests show unit times writing and analysis supplements practice and We execution in summary. reas shorter noticeable interview findings the mutators, an that and and execut the conclude pseudo-tested, tests include useful are executed results methods how of Our why numbers on are. writ scores, testing opinions they times, mutation an gathered how extreme methods see and of pseudo-tested to t tests 25 to developers addition of unit experienced In analysis tests. two large qualitative unit interviewed 11,000 a a than did more in of o co we suite is testing consisting test that a industry study mutation by semiconductor the case extreme from industrial project and software an testing adop traditional conducted industry mutation foster running we to extreme addressed that, be how For to need at discus that and look challenges practices t development We fo software what it, current times discuss practice. impacts analysis notice and in and testing not execution mutation mean the would extreme compare and suite we traditional pseudo-t paper, test called this are the calle methods In functionality These and variation coverage. their having removed, l despite where promising computationally entirely methods a is be identifies years, that It recent emerged expensive. testing In mutation suites. extreme test of hsrsac a upre yAvneta ato h Gradu the of part as Advantest by supported was research This uaintsigi otaetsigtcnqet measure to technique testing software a is testing Mutation ewn ofidothwbgtetm mrvmnsof improvements time the big how out find to want We Abstract Mtto etn sue oeaut h effectiveness the evaluate to used is testing —Mutation xrm uaintsigi practice: in testing mutation Extreme nttt fSfwr Engineering Software of Institute [email protected] .I I. nvriyo Stuttgart of University NTRODUCTION ttgr,Germany Stuttgart, akBetka Maik nidsra aestudy case industrial An h University the t t School ate open s ested. vered erage ful. tion. hted hat, uch hey can ons ion ess ng ng ve ell to al . d g d n e r f .Eteemtto testing mutation Extreme [2]. B. optimizi [1], by one towards expressed score typically mutation is the testing mutation of goal .Mtto testing Mutation A. extreme of findings the particular. in and about testing practices developers mutation testing experienced software testing two semiconduct their mutation interviewed the from both and project performed software industry large we a in where techniques indu study an case conducted we trial Therefore, adoption. industry prevent uat aet uvv.Freape o rmtv data primitive a for example, For survive. multi to often have values, categorized return is mutants with method methods the (void- method. For survives, whole pseudo-tested. value mutant the resulting return of body the no the If empties with mutator methods the methods), For [4]. pseudo-tested. 10.1% was as methods mutated of number method total pseudo-tested the of proportion to been median surprisin have the projects still, are and open-source methods analyzed and 19 study, Pseudo-tested removed, another [3]. In be fail. common. can will level. test functionality no method whole a their on mu- find where to mutations extreme is level, these goal The instruction performs an testing on tation performed usually are yehnigo rtn e et okl h uvvn non- surviving the gener called all the to is kill mutants mutants killed non-equivalent to of tests ratio called The new mutants. are writing equivalent mutants or enhancing These by behavior. t different mutants mutants a equivalent have i.e., not mutants, do equivalent is syntactically mutant produce the that said oeo h otaeudrts r nrdcdt change called to are called introduced versions are are software changes of altered test types respective under The software behavior. its the of code en httets ut sntal odtc h mutant. the detect mutant to the th able that fails, not test said is no is suite If it test re-run. Thus, is the suite that test means given the terwards, ncnrs otaiinlmtto etn,weemutatio where testing, mutation traditional to contrast In o uaintsig ml otoldcagsi h sourc the in changes controlled small testing, mutation For atclrmtnshv osriet aeoieamethod a categorize to survive to have mutants Particular [email protected] nttt fSfwr Engineering Software of Institute I F II. nvriyo Stuttgart of University ttgr,Germany Stuttgart, UDTOSADRLTDWORK RELATED AND OUNDATIONS tfnWagner Stefan h oli osrnte h etsuite test the strengthen to is goal The . suotse methods pseudo-tested killed oesfwr hne may changes software Some . survived uainscore mutation hc r methods are which , tews,i is it Otherwise, . mutators mutants hs the Thus, . The . Af- . ated gly hat ple ng ns as or s- is e s type like boolean, two mutants have to survive. One mutator different properties. Afterwards, we killed the extreme mutants replaces the method-body with a single return statement that that caused the methods to be pseudo-tested, where possible, returns true and one that only returns false. If both by enhancing or writing new tests and noted the reason why resulting mutants survive, the method is categorized as pseudo- the method was pseudo-tested. tested. If only one survives, the method is categorized as C. Interview with developers partially-tested. Analogously, other mutators that replace the whole method-body with default return values for other data The second part of this case study was a semi-structured types like integers (e.g. 0 and 1) or strings (e.g. "" and "A") interview. We interviewed two developers with more than are chosen such that, if the mutants survive, the method can seven years of development experience who are familiar with be categorized as pseudo-tested. Complex data types derived the software under test. None of them used mutation testing from classes, like objects, can be simply set to null. The beforehand and only one developer heard of it at university. choice which default values are used depends on the mutation We structured the interview into four phases. In the first phase, testing tool used [3], [4]. we asked several questions about the developers’ roles in Compared to traditional mutation testing, extreme mutation the company and their experiences with unit testing, code testing has the advantage of generating far fewer mutants as coverage, and mutation testing. Afterwards, we explained the number of methods is significantly lower than the number traditional and extreme mutation testing and showed the output of instructions that can be mutated. This reduces both: runtime of the PIT tool. In the third phase, we opened the software and time to analyze the mutants. On the other hand, extreme project in an integrated development environment (IDE) and mutation testing is not as fine-grained as traditional mutation presented pseudo-tested methods to depict the process of how testing to find weaknesses in the test suite. to analyze and strengthen the test suite. For this session, four classes with eight pseudo-tested methods to discuss were III. METHOD pre-selected. Five methods had 100% line coverage and the A. Research questions remaining three had more or equal to 75% line coverage. During the enhancement of the test suite, the developers were RQ1: Are the improved execution and analysis times of asked several questions regarding the relevance of the findings extreme mutation testing relevant in practice? and at which stage in development the findings would be most RQ2: How do extreme mutation testing results impact the likely fixed. We also asked whether the developers would have established practice of writing unit tests with code coverage? written more or better tests if pseudo-tested lines would be RQ3: Which factors prevent extreme mutation testing from shown as “uncovered” in code coverage reports. The last phase industry adoption? was an open discussion.

B. Mutation testing IV. RESULTS In the first part of this case study, we have run traditional A. Mutation testing results and extreme mutation testing for a software project that is Table I shows the results of mutation testing with PIT used in the semiconductor testing industry. The software is for 11,424 unit tests that normally finish in 12 seconds. For tested by more than 11,000 unit tests that call about 2,000 traditional mutation testing, we used two sets of mutators that methods (about 12,500 lines of code). The software project PIT provides. The “default” set consists of seven mutators, is written in Java and consists of multiple smaller projects whereas the “all” set consists of 66 mutators. All executions of different age, authors, and coding styles. For traditional were performed on the same machine. 1 mutation testing, we used PIT [5] with its default mutation Table II shows the total number of methods and the number engine Gregor as a mutation testing tool (version 1.4.10). For of pseudo-tested methods as well as their proportion. It also extreme mutation testing, we used the same tool but with the shows the lines that are covered, i.e., executed, by the test 2 Descartes engine [6] (version 1.2.6). In addition to that, we suite, and the total line counts. In the analyzed software also measured code coverage with the Java code coverage project, 291 (14%) of all 2,041 methods are pseudo-tested. 3 library JaCoCo (version 0.8.6) and counted the total lines The 291 pseudo-tested methods consist of 1,129 lines of code of code for all methods called by the test suite. where 835 lines have coverage. After running mutation tests with different mutators and We analyzed 25 methods in more depth and found the measuring the code coverage, we additionally labeled the following three reasons why methods are pseudo-tested: methods by their access modifiers (public, private, etc.) by • Weak tests with no assertions (8) writing a Python script. We then manually selected 25 pseudo- • Incomplete tests (3) tested methods according to the collected data, namely: cov- • Side-effect methods (14) erage, number of lines, access modifier, package-, class-, We found eight tests that had no assertions, which caused them and method-name to have a high diversity of methods with to be pseudo-tested. For example, a class that handled network 1http://pitest.org/ connections was tested by opening and closing connections, 2https://github.com/STAMP-project/pitest-descartes but the connections were never verified to work by, e.g., count- 3https://www.jacoco.org/jacoco/ ing the number of open sessions before and after closing. TABLE I MUTATION TESTING RESULTS

Type Score Killed / Total Survived Mutators Executed Tests Time Extreme 85% 2,297 / 2,706 409 19 597,877 13 min Traditional (default) 73% 5,813 / 7,989 2,176 7 4,228,169 37 min Traditional (all) 70% 55,515 / 79,350 23,835 66 34,391,374 4 h 1 min

Instead, these tests were designed in a way that they pass if no depict that level of verification. Code coverage is not seen as failure occurs, which means that no functionality is checked an appropriate measure to judge how bug-free the code is. by verifying any output. Some of these tests could be fixed 3) Relevance of pseudo-tested methods: The developers by simply capturing the output and adding assert statements. see remediating pseudo-tested methods as relevant when they Other tests, like the example that concerns the connections, want to improve regression testing and to verify that the code have required to add new methods to the class under test to works as expected. They do not see remediating pseudo-tested get the information like the session count in addition to adding methods as relevant to write clean code. assertions to the test. 4) Pseudo-tested methods in the development process: We also found three incomplete tests. These are tests that When asking the developers at which stage in development are well designed with strong assertions but missed to check the knowledge about pseudo-tested methods would most likely certain properties of a class. For example, a class that creates result in improving the test suite, both developers answered at a table was tested whether it writes the correct data rows, but the time when writing unit tests. The best point in time would the table header was not tested, despite being called. This was be while working with an IDE or when invoking the build on easily fixed by adding another assertion for the table header. the command line because the developers who write the unit The remaining 14 methods are methods that are called tests have the expert knowledge and could quickly add new during testing but are not subject to be tested and do not tests. The developers stated that the information would still have an impact on the test result. Because of that, we named be useful when a continuous integration pipeline produces a the last category side-effects. Typical side-effect methods are, report, but it would strongly depend on the available time and e.g., custom methods that are responsible for logging or priorities whether they would fix the pseudo-tested method. methods that store meta-data that is never used or verified They would probably not remediate pseudo-tested methods by the test suite elsewhere. when spotted during a code review when merging code changes or when receiving the results from a separate quality B. Interview results assurance team because they usually have other priorities at these stages. 1) Main reasons to write unit tests: In general, the develop- 5) Acting upon pseudo-tested methods: The developers ers write unit tests to verify that code works as expected, future would have written more or better tests if pseudo-tested code changes do not break anything, and to write clean code methods were highlighted in code coverage reports. However, by using, for example, the test-driven software development the developers stated one exception where they probably would approach. None of the developers write unit tests because of not have enhanced the test suite. This was the case for a an existing 80% line-coverage policy in the company. subset of unit tests that tested classes of a particular package 2) Confidence in code coverage: The developers mainly use but found pseudo-tested methods of classes of a completely code coverage to spot regions in the code that have not been different package that was not the target of the unit tests. tested so far. They have some confidence that code coverage is V. EVALUATION an accurate measure to judge the quality of the test suite when using it for regression testing but are aware of the caveats. A. Answer to RQ1 – Relevance of execution and analysis time Although one of the main reasons to write unit tests is to Traditional mutation testing with seven default mutators verify that the code works as expected, the developers only takes almost three times as long (37 min.) as extreme mutation have very little confidence that code coverage can accurately testing (13 min.), despite having only about a third of the number of mutators (see table I). In practice, the execution time requirements strongly depend on in which phase of the TABLE II development process mutation testing is used. When analyzing METHODS, LINES, AND COVERAGE mutation testing results in a separate session, none of the Measure Pseudo-tested Total Proportion execution times would actually be troublesome because muta- Methods 291 2,041 14% tion testing would be performed in the background and later Lines (covered) 835 11,189 7% analyzed. However, when trying to perform mutation testing Lines (total) 1,129 12,572 9% during the development of unit tests, as the interview results suggest (see section IV-B4), all of the execution times are too long and thus impractical. This may be changed when not (see section IV-A). For pseudo-tested side-effect methods, performing mutation testing for the whole test suite. For that, reflecting them in the coverage reports would be sufficient. extreme mutation testing is more promising due to its lower VI. THREATS TO VALIDITY time requirements. The presented numbers rely on the correctness of the Table I shows that the coarse-grained approach of extreme mutation testing tool PIT. Some tests of the software project mutation testing results in having only a fifth of surviving are designed to test parallel execution features. Other tests rely mutants (409) compared to the traditional approach (2,176). on static classes and their fields and depend on each other, This number gets further lowered to 291 when working on a which influences the test result when executed in parallel or method-level to check which methods are pseudo-tested (see split into independent chunks, which is what PIT does. Thus, table II). These numbers are more relevant in practice. If we for PIT, we set the number of execution threads to one and hypothetically assume that a pseudo-tested method can be tried to remove parallel tests where possible. Nonetheless, we remediated in only five minutes, then even remediating 291 still found nine methods that belong to a class that included pseudo-tested methods would take about 24 hours. Applying “parallel” as part of its name. However, we ran the extreme the same calculation to all 2,176 mutants of the traditional and traditional mutation tests with default mutators both three approach would result in about seven days of analysis. Thus, times and always got the same results. we conclude that the improved execution and analysis times Further, we only interviewed two experienced developers are relevant and noticeable in practice in terms of execution who mostly agreed in their opinions. This does not necessarily time and workload for the developers. mean that the results are well generalizable and that other B. Answer to RQ2 – Impact on established testing practices developers would rate the relevance of pseudo-tested methods or how to use extreme mutation testing the same way they did. Extreme mutation testing supplements writing unit tests in conjunction with code coverage. Unit tests are mainly written VII. CONCLUSIONS to verify that the code works as expected and serve as regres- We have shown that extreme mutation testing is supplemen- sion tests for future code changes (see sections IV-B1, IV-B2). tal to current software testing practices. It can help to write Pseudo-tested methods are seen as relevant for improving both better unit tests and addresses some shortcomings that code (see section IV-B3). Line coverage alone does not accurately coverage has. Compared to the traditional approach, the faster depict how effective a test suite is. We found 1,129 lines execution time is well noticeable in practice and provides a of code (see table II) where we could easily introduce new good starting point for further research. For result presentation, bugs that would remain unnoticed by the given test suite. In we consider it to be best integrated into existing coverage fact, table II shows that when counting pseudo-tested lines as reports because it is easy to implement and it is a format “uncovered”, the total line coverage would have to be adjusted that is already known by the developers. Future research from 89% to 82%. should investigate how to apply extreme mutation testing while writing unit tests such that the results are available fast enough C. Answer to RQ3 – Industry adoption but still meaningful. Lastly, future research should also analyze Despite the mentioned advantages in terms of execution the severity and number of issues that remain undiscovered by and analysis time (see section V-A) and its positive impact the extreme approach, which trades accuracy for speed gains. on established software testing practices (see section V-B), REFERENCES extreme mutation testing is comparatively young with only a few available implementations [4] and thus, is not widely [1] Y. Jia and M. Harman, “An analysis and survey of the development of mutation testing,” IEEE transactions on software engineering, vol. 37, known yet. We noticed that there are two areas to improve to no. 5, pp. 649–678, 2010. foster industry adoption: tooling and usage. [2] M. Papadakis, M. Kintis, J. Zhang, Y. Jia, Y. Le Traon, and M. Harman, For tooling, we found that the results of extreme mutation “Mutation testing advances: an analysis and survey,” in Advances in Computers. Elsevier, 2019, vol. 112, pp. 275–378. testing would be best represented in code coverage reports [3] R. Niedermayr, E. Juergens, and S. Wagner, “Will my tests tell me by highlighting the pseudo-tested code (see section IV-B5). if i break this code?” in 2016 IEEE/ACM International Workshop on This is currently not supported by the Descartes engine. The Continuous Software Evolution and Delivery (CSED). IEEE, 2016, pp. 23–29. advantage of that approach is that developers are already [4] R. Niedermayr, “Evaluation and improvement of automated software familiar with coverage reports, and it is easy to implement. test suites,” Ph.D. dissertation, University of Stuttgart, 2019. [Online]. For usage, further research should investigate how extreme Available: http://dx.doi.org/10.18419/opus-10640 [5] H. Coles, T. Laurent, C. Henard, M. Papadakis, and A. Ventresque, “Pit: mutation testing has to be improved such that it can be run a practical mutation testing tool for java,” in Proceedings of the 25th while writing unit tests. This is the point in time where International Symposium on Software Testing and Analysis, 2016, pp. issues are most likely fixed (see section IV-B4). Also, we 449–452. [6] O. L. Vera-Pérez, M. Monperrus, and B. Baudry, “Descartes: a pitest observed that the mutation score is irrelevant when using engine to detect pseudo-tested methods: tool demonstration,” in 2018 33rd extreme mutation testing in practice because some methods IEEE/ACM International Conference on Automated Software Engineering are not worth fixing (see section IV-B5). This was also found (ASE). IEEE, 2018, pp. 908–911. [7] O. L. Vera-Pérez, B. Danglot, M. Monperrus, and B. Baudry, “A in another study [7]. It would probably be enough to only comprehensive study of pseudo-tested methods,” Empirical Software enhance weak tests with no assertions and incomplete tests Engineering, vol. 24, no. 3, pp. 1195–1225, 2019.