Building Very Small Test Suites (Using SNAP)

1 Building Very Small Test Suites (using SNAP) Jianfeng Chen, Xipeng Shen, Senior Member, IEEE, and Tim Menzies, Fellow, IEEE Abstract—Software is now so large and complex that additional architecture is needed to guide theorem provers as they try to generate test suites. For example, the SNAP test suite generator (introduced in this paper) combines the Z3 theorem prover with the following tactic: sample around the average values seen in a few randomly selected valid tests. This tactic is remarkably effective. For 27 real-world programs with up to half a million variables, SNAP found test suites which were 10 to 750 smaller times than those found by the prior state-of-the-art. Also, SNAP ran orders of magnitude faster and (unlike prior work) generated 100% valid tests. Index Terms—SAT solvers, test suite generation, mutation F 1 INTRODUCTION When changing software, it is useful to test if the new indication that there is much repeated structure in the formula work damages old functionality. For this reason, testing and (as well as in the code). The starting point for this paper was the re-testing code is widely applied in both open-source projects conjecture that, when repeated structures exist, the common and closed-source projects [1], [2], [3]. But generating test suites settings seen in a small M sample may also be the common can be difficult since good test suite generators must struggle settings in a large N M sample. If so, the space of tests can À to achieve five goals: be explored using the following “SNAP tactic”: 1) Terminate quickly; Sample around the average values seen in a few 2) Scale to large programs; randomly selected valid tests. 3) Return small test suites that contain valid tests; To evaluate this tactic, this paper combines Eq. 1 with the 4) Cover most program branches; Z3 theorem prover with the SNAP tactic. The results of that 5) Avoid excessive redundancy in the tests. combined tool were then compared to the more extensive For example, the QuickSampler system [4] presented at ICSE’18 search offered by QuickSampler. In that comparison, we asked: runs faster than prior work [5], [6] for larger programs and RQ1: How much faster is the SNAP tactic? finds test suites with more valid tests [4]. QuickSampler uses the heuristic that “valid tests can be built by combining other Answer #1: SNAP terminated 10 to 3000 times faster than valid tests’; e.g. a new test can be built from valid tests a,b,c QuickSampler (median to max). using (“exclusive or”): © d c (a b) (1) Æ © © RQ2: Does the SNAP tactic find fewer test cases? This heuristic is useful since “exclusive or” is faster than, say, running a theorem prover. But heuristics are, by definition, just Answer #2: Test cases from SNAP were 10 to 750 times guesses (albeit, good ones). Therefore it is hardly surprising that smaller than those from QuickSampler (median to max). heuristics like Eq. 1 can introduce new problems. For example: Redundancy: In one sample of 10 million tests generated ² RQ3: How “good” are the tests found via the SNAP tactic? from the blasted_case47, a benchmark in this paper. Quick- For “credibility” defined as the percent of valid tests; and Sampler only found 26,000 unique valid solutions. That is, arXiv:1905.05358v3 [cs.SE] 15 Jul 2020 “diversity” defined as percent code branches covered: 99% of the tests were repeating other tests. Credibility: A test suite becomes more credible as the ² number of valid tests increases. As shown below, 30% (on Answer #3: Usually, SNAP’s tests are much more credible average) of tests found by QuickSampler are not credible. and very nearly as diverse as those from QuickSampler. Minimality: Our SNAP tool finds test which are 10 to 750 ² smaller than those found by QuickSampler. The rest of this paper is structured as follows. The next This paper introduces SNAP. We show that SNAP runs orders section discusses why we want to shrink test suites and how of magnitude faster than QuickSampler and its tests are more to do that using SAT solvers. Then, after describing SNAP, we credible (since 100% of its tests are valid). run experiments on the same case studies used in the ICSE’18 SNAP was designed as follows. QuickSampler generates tests QuickSampler paper. Finally, we discuss threats to validity. by converting code into a logical formula (discussed later in Note that an open-source version of SNAP (and all SE this paper). The presence of many repeated tests is hence an models used in this paper) is freely available on-line1. E-mail: [email protected], [email protected], [email protected] ² J. Chen is at Facebook. X. Shen and T. Menzies are at NC State University. ² Manuscript received March 23, 2020; revised XXX. 1. http://github.com/ai-se/SatSpaceExpo 2 2 BACKGROUND Many software problems can be transformed into “SAT”; i.e. a propositional satisfiability problem. For example, given 2.1 Why Reduce Test Suite Size? a script of C programming, one can translate it into CNF This paper evaluates the SNAP test suite generator using the formulas, as done in Fig. 1. Symbolic/dynamic execution tech- five goals described above: i.e. runtime, scalability, redundancy, niques [19], [20] extract the possible execution branches of a credibility and minimality. But before that, we offer general procedural program. Each branch is a conjunction of condi- notes on the important of test suite minimization. tions B C C ... so the whole program can be summarized i Æ x ^ y ^ When developers extend a code base, test suites let them as the disjunction B B .... Using deMorgan’s rules these i _ j _ check that their new work does not harm old functionality. clauses can be converted to conjunctive normal form (CNF) Such tests mean that developers can find and fix more faults, where the inputs to the program are the variables in the CNF: sooner. Hence, better tests enable faster code modification [1], Disjunctions to conjunctions: P Q ( P Q) ² _ ´ : ^ : [2], [15]. Conjunctions to disjunctions: (P Q) P Q. ² : ^ ´ : _: By minimizing the number of tests executed, we also min- SAT has applications in many areas, including test case gener- imize the developer effort required to specify the expected ation. Modern constraint solvers (i.e. SAT-solvers) are based on behavior associated with each test execution [16]. If testing some variant of the Davis-Putnam-Logemann-Loveland (DPLL) for (e.g.) core dumps, then specifying off-nominal behavior procedure [21]. DPLL searches systematically for a satisfying is trivial (just look for a core dump file). But in many other assignment, applying first unit propagation and pure literal cases, specifying what should (and should not) be seen when elimination as often as possible. Then, DPLL branches on the a test executes is a time-consuming task requiring a deep truth value of a variable, and recurses. understanding of the purpose and context of the software. Many methods have been explored to make these tools Smaller tests suites are also cheaper to run. The industrial practical for large problems. For example Arito et al. pro- experience is that excessive testing can be expensive and time posed a framework to transform the test suite minimization consuming, especially if run after each modification to soft- problem (TSMP) in regression testing into a constrained SAT ware [16]. Such high-frequency testing can consumes as much problem [22]. This transformation is done by modeling TSMP as 80 percent of the software maintenance effort [16], [17]. instances as a set of Pseudo-Boolean constraints that are Many current organizations spend tens of millions of dollars later translated to SAT instances. TSMP has two objectives: 1) each year (or more) on cloud-based facilities to run large tests minimizing the testing cost and 2) maximizing the program suites [16]. The fewer the tests those organizations have to run, coverage. To start with, a set of test cases T {t ,t ,t ,...} as Æ 1 2 3 the cheaper their testing. well as their running time cost {c1,c2,...} is defined. Here, ti is Smaller test suites are also faster to execute. If minimal a binary signal indicating if the test case i should be tested and and effective test suites can be generated then, within a fixed the information about whether test case ti covers some element time limit, more faults can be found and fixed [16]. Faster test in the program e is stored as the binary matrix M [m ]. j Æ i j execution means that software teams can certify a new release, To translate the TSMP into constrained problems, we use quicker. This is particularly important for organizations using pseudo-boolean constraints: Pn c t B and Pm e P i 1 i i · j 1 j ¸ continuous integration since faster test suites mean they can where B Z is the maximum allowedÆ cost and P {1,2,...,Æ m} is 2 2 make more releases each day – which means that clients can the minimum coverage level. Having the pseudo-boolean con- sooner receive new (or fixed) features [18]. straints, Een et al. [23] provides three techniques to translate pseudo-boolean constraints (linear constraints over boolean variables) into clauses that can be handled by a SAT-solver. 2.2 Theorem Proving in Software Engineering Combinatorial testing covers interactions of parameters in As argued in this section, the test suite generation problem is the system under test. A well-chosen sampling mechanism can so complex that extra machinery is needed to guide theorem reduce the cost of software and system testing by reducing the provers.

Building Very Small Test Suites (Using SNAP)

Autogr: Automated Geo-Replication with Fast System Performance and Preserved Application Semantics

Functional SMT Solving: a New Interface for Programmers

Approximations and Abstractions for Reasoning About Machine Arithmetic

Automated Theorem Proving

Learning-Aided Program Synthesis and Verification

Faster and Better Simple Temporal Problems

(12) United States Patent (10) Patent No.: US 8,341,602 B2 Hawblitzel Et Al

Principles of Security and Trust

LNCS 8141 Molecular Programming 19Th International Conference, DNA 19 Tempe, AZ, USA, September 2013 Proceedings

Applied Type System: an Approach to Practical Programming with Theorem-Proving

Homework: Verification

COIN Attacks: on Insecurity of Enclave Untrusted Interfaces In