Freeing Testers from Polluting Test Objectives
Total Page:16
File Type:pdf, Size:1020Kb
Freeing Testers from Polluting Test Objectives Michaël Marcozzi, Sébastien Bardin, Mike Papadakis Nikolai Kosmatov, Virgile Prevosto, Interdisciplinary Centre for Loïc Correnson Security, Reliability and Trust CEA, LIST, Software Reliability Laboratory University of Luxembourg Gif-sur-Yvette, France Luxembourg, Luxembourg fi[email protected] [email protected] ABSTRACT Problem. White-box testing criteria are purely syntactic and thus Testing is the primary approach for detecting software defects. A totally blind to the semantic of the program under analysis. As a major challenge faced by testers lies in crafting efficient test suites, consequence, many of the test objectives that they define may turn able to detect a maximum number of bugs with manageable ef- out to be in practice either fort. To do so, they rely on coverage criteria, which define some (a) infeasible: no input can satisfy them, such as dead code or precise test objectives to be covered. However, many common cri- equivalent mutants [2], or teria specify a significant number of objectives that occur to be (b) duplicate versions of other objectives: satisfied by exactly the infeasible or redundant in practice, like covering dead code or se- same inputs, such as semantically equal mutants [53], or mantically equal mutants. Such objectives are well-known to be (c) subsumed by another objective: satisfied by every input cover- harmful to the design of test suites, impacting both the efficiency ing the other objective [1, 37, 52], such as validity of a condi- and precision of testers’ effort. This work introduces a sound and tion logically implied by another one in condition coverage. scalable formal technique able to prune out a significant part of the infeasible and redundant objectives produced by a large panel of We refer to these three situations as polluting test objectives, which white-box criteria. In a nutshell, we reduce this challenging prob- are well-known to be harmful to the testing task [52, 53, 62, 63, 65] lem to proving the validity of logical assertions in the code un- for two main reasons: der test. This technique is implemented in a tool that relies on • While (early) software testing theory [66] requires all the crite- weakest-precondition calculus and SMT solving for proving the rion objectives to be covered, this seldom reflects the actual prac- assertions. The tool is built on top of the Frama-C verification plat- tice, which usually relies on test suites covering only a part of form, which we carefully tune for our specific scalability needs. them [23]. This is due to the difficulty of generating the appro- The experiments reveal that the tool can prune out up to 27% of priate test inputs, but also to infeasible test objectives. Indeed, test objectives in a program and scale to applications of 200K lines testers often cannot know whether they fail to cover them be- of code. cause their test suites are weak or because they are infeasible, possibly wasting a significant amount of their test budget try- KEYWORDS ing to satisfy them. Software Testing, White-Box Testing, Coverage Criteria, Polluting • As full objective coverage is rarely reached in practice, testers Test Objectives, Formal Methods rely on the ratio of covered objectives to measure the strength of their test suites. However, the working assumption of this prac- 1 INTRODUCTION tice is that all objectives are of equal value. Early testing research Context. Heretofore, software testing is the primary method for demonstrated that this is not true [1, 13, 17, 52], as duplication detecting software defects [2, 42, 44, 66]. It is performed by exe- and subsumption can make a large number of feasible test objec- arXiv:1708.08765v1 [cs.SE] 29 Aug 2017 cuting the programs under analysis with some inputs, in the aim tives redundant. Such coverable redundant objectives may arti- of finding some unintended (defective) behavior. As the number ficially deflate or inflate the coverage ratio. This skews the mea- of possible test inputs is typically enormous, testers do limit their surement, which may misestimate test thoroughness and fail to tests in practice to a manageable but carefully crafted set of in- evaluate correctly the remaining cost to full coverage. puts, called a test suite. To build such suites, they rely on so-called coverage criteria, also known as adequacy or test criteria, which Goal and Challenges. While detecting all polluting test objec- define the objectives of testing [2, 66]. In particular, many white- tives is undecidable [1, 52], our goal is to provide a technique capa- box criteria have been proposed so far, where the test objectives ble to identify a significant part of them. This is a challenging task are syntactic elements of the code that should be covered by run- as it requires to perform complex program analyses over large sets ning the test suite. For example, the condition coverage criterion of objectives produced by various criteria. Moreover, duplication imposes to cover all possible outcomes of the boolean conditions and subsumption should be checked for each pair of objectives, a appearing in program decisions, while the mutant coverage crite- priori putting a quadratic penalty over the necessary analyses. rion requires to differentiate the program from a set of its syntactic Although many studies have demonstrated the harmful effects variants. Testers need then to design their suite of inputs to cover of polluting objectives, to date there is no scalable technique to the corresponding test objectives, such as — for the two aforemen- discard them. Most related research works (see Tables 1, 2 and Sec- tioned cases — condition outcomes or mutants to kill. tion 8) focus on the equivalent mutant problem, i.e. the particular arXiv.org, August, 2017 Marcozzi et al. instance of infeasible test objectives for the mutant coverage cri- 1 // given three sides x,y,z of a valid triangle, computes terion. These operate either in dynamic mode, i.e. mutant classifi- 2 // its type as: 0 scalene, 1 isosceles, 2 equilateral 3 int type = 0; cation [57, 58], or in static mode, i.e. Trivial Compiler Equivalence 4 // l1: x == y && y == z; (DC) l2: x != y || y != z; (DC) (TCE) [53]. Unfortunately, the dynamic methods are unsound and 5 //l3: x==y;(CC) l4: x!=y;(CC) 6 if( x == y && y == z ){ produce many false positives [51, 58], while the static one deals 7 type = type + 1; only with strong mutation and cannot detect subsumed mutants 8 } (whereas it handles duplicates in addition to infeasible ones). The 9 // l5: x==y || y==z || x==z; (DC) l6: x!=y && y!=z && x!=z; (DC) 10 //l7:x==y;(CC) l8:x!=y;(CC) LUncov technique [8] combines two static analyses to prune out 11 // l9: x!=y && y==z && x==z; (MCC) l10: x==y && y!=z && x==z;(MCC) infeasible objectives from a panel of white-box criteria in a generic 12 if( x == y || y == z || x == z ){ way, but faces scalability issues. 13 // l11: type + 1 != type + 2; (WM) l12: type + 1 != type; (WM) 14 // l13: type + 1 != -type + 1; (WM) l14: type + 1 != 1; (WM) 15 type = type + 1; } KindofPollution Criterion 16 Sound Scale Inf. Dupl. Subs. Genericity Figure 1: Toy example of a C program with test objectives Mutant class. [58] × X X × × × TCE [53] X X X X × × • An open-source prototype tool LClean (Section 5) enacting the pro- LUncov [8] X × X × × X posed approach. It relies on an industrial-proof formal verifica- X XXXX X LClean (this work) tion platform, which we tune for the specific scalability needs of Table 1: Comparison with closest research techniques LClean, yielding a robust multi-core assertion-proving kernel. • A thorough evaluation (Sections 6 and 7) assessing Analyses Scope Acuteness (a) the scalability and detection power of LClean for three types built-in compiler TCE [53] interprocedural + of polluting objectives and four test criteria – pruning out up optimizations to 27% of the objectives in C files up to 200K lines, value analysis and LUncov [8] interprocedural ++ weakest-precondition (b) the impact of using a multi-core kernel and tailored verifica- LClean (this work) weakest-precondition local function + tion libraries on the required computation time (yielding a Table 2: Static analyses available in closest techniques speedup of approximately 45×), and (c) that, compared to the existing methods, LClean prunes out Proposal. Our intent is to provide a unified, sound and scalable so- four times more objectives than LUncov [8] in about half as lution to prune out a significant part of polluting objectives, includ- much time, it can be one order of magnitude faster than (un- ing infeasible but also duplicate and subsumed ones, while handling sound) dynamic identification of (likely) polluting objectives, a large panel of white-box criteria in a generic manner. To achieve and it detects half more duplicate objectives than TCE, while this, we propose reducing the problem of finding polluting objec- being complementary to it. tives for a wide range of criteria to the problem of proving the Potential Impact. Infeasible test objectives have been recognized validity of logical assertions inside the code under test. These as- as a main cost factor of the testing process [62, 63, 65]. By pruning sertions can then be verified using known verification techniques. out a significant number of them with LClean, testers could rein- Our approach, called LClean, is the first one that scales to pro- vest the gained cost in targeting full coverage of the remaining ob- grams composed of 200K lines of C code, while handling all types jectives. This would make testing more efficient, as most faults are of polluting test requirements. It is also generic, in the sense that found within high levels of coverage [22].