1

Building Very Small Test Suites (using SNAP)

Jianfeng Chen, Xipeng Shen, Senior Member, IEEE, and Tim Menzies, Fellow, IEEE

Abstract—Software is now so large and complex that additional architecture is needed to guide theorem provers as they try to generate test suites. For example, the SNAP test suite generator (introduced in this paper) combines the Z3 theorem prover with the following tactic: sample around the average values seen in a few randomly selected valid tests. This tactic is remarkably effective. For 27 real-world programs with up to half a million variables, SNAP found test suites which were 10 to 750 smaller times than those found by the prior state-of-the-art. Also, SNAP ran orders of magnitude faster and (unlike prior work) generated 100% valid tests.

Index Terms—SAT solvers, test suite generation, mutation !

1 INTRODUCTION

When changing software, it is useful to test if the new indication that there is much repeated structure in the formula work damages old functionality. For this reason, testing and (as well as in the code). The starting point for this paper was the re-testing code is widely applied in both open-source projects conjecture that, when repeated structures exist, the common and closed-source projects [1], [2], [3]. But generating test suites settings seen in a small M sample may also be the common can be difficult since good test suite generators must struggle settings in a large N M sample. If so, the space of tests can À to achieve five goals: be explored using the following “SNAP tactic”: 1) Terminate quickly; Sample around the average values seen in a few 2) Scale to large programs; randomly selected valid tests. 3) Return small test suites that contain valid tests; To evaluate this tactic, this paper combines Eq. 1 with the 4) Cover most program branches; Z3 theorem prover with the SNAP tactic. The results of that 5) Avoid excessive redundancy in the tests. combined tool were then compared to the more extensive For example, the QuickSampler system [4] presented at ICSE’18 search offered by QuickSampler. In that comparison, we asked: runs faster than prior work [5], [6] for larger programs and RQ1: How much faster is the SNAP tactic? finds test suites with more valid tests [4]. QuickSampler uses the heuristic that “valid tests can be built by combining other Answer #1: SNAP terminated 10 to 3000 times faster than valid tests’; e.g. a new test can be built from valid tests a,b, QuickSampler (median to max). using (“exclusive or”): ⊕ d c (a b) (1) = ⊕ ⊕ RQ2: Does the SNAP tactic find fewer test cases? This heuristic is useful since “exclusive or” is faster than, say, running a theorem prover. But heuristics are, by definition, just Answer #2: Test cases from SNAP were 10 to 750 times guesses (albeit, good ones). Therefore it is hardly surprising that smaller than those from QuickSampler (median to max). heuristics like Eq. 1 can introduce new problems. For example: Redundancy: In one sample of 10 million tests generated • RQ3: How “good” are the tests found via the SNAP tactic? from the blasted_case47, a benchmark in this paper. Quick- For “credibility” defined as the percent of valid tests; and Sampler only found 26,000 unique valid solutions. That is, arXiv:1905.05358v3 [cs.SE] 15 Jul 2020 “diversity” defined as percent code branches covered: 99% of the tests were repeating other tests. Credibility: A test suite becomes more credible as the • number of valid tests increases. As shown below, 30% (on Answer #3: Usually, SNAP’s tests are much more credible average) of tests found by QuickSampler are not credible. and very nearly as diverse as those from QuickSampler. Minimality: Our SNAP tool finds test which are 10 to 750 • smaller than those found by QuickSampler. The rest of this paper is structured as follows. The next This paper introduces SNAP. We show that SNAP runs orders section discusses why we want to shrink test suites and how of magnitude faster than QuickSampler and its tests are more to do that using SAT solvers. Then, after describing SNAP, we credible (since 100% of its tests are valid). run experiments on the same case studies used in the ICSE’18 SNAP was designed as follows. QuickSampler generates tests QuickSampler paper. Finally, we discuss threats to validity. by converting code into a logical formula (discussed later in Note that an open-source version of SNAP (and all SE this paper). The presence of many repeated tests is hence an models used in this paper) is freely available on-line1.

E-mail: [email protected], [email protected], [email protected] • J. Chen is at Facebook. X. Shen and T. Menzies are at NC State University. • Manuscript received March 23, 2020; revised XXX. 1. http://github.com/ai-se/SatSpaceExpo 2

2 BACKGROUND Many software problems can be transformed into “SAT”; i.e. a propositional satisfiability problem. For example, given 2.1 Why Reduce Test Suite Size? a script of C programming, one can translate it into CNF This paper evaluates the SNAP test suite generator using the formulas, as done in Fig. 1. Symbolic/dynamic execution tech- five goals described above: i.e. runtime, scalability, redundancy, niques [19], [20] extract the possible execution branches of a credibility and minimality. But before that, we offer general procedural program. Each branch is a conjunction of condi- notes on the important of test suite minimization. tions B C C ... so the whole program can be summarized i = x ∧ y ∧ When developers extend a code base, test suites let them as the disjunction B B .... Using deMorgan’s rules these i ∨ j ∨ check that their new work does not harm old functionality. clauses can be converted to conjunctive normal form (CNF) Such tests mean that developers can find and fix more faults, where the inputs to the program are the variables in the CNF: sooner. Hence, better tests enable faster code modification [1], Disjunctions to conjunctions: P Q ( P Q) • ∨ ≡ ¬ ∧ ¬ [2], [15]. Conjunctions to disjunctions: (P Q) P Q. • ¬ ∧ ≡ ¬ ∨ ¬ By minimizing the number of tests executed, we also min- SAT has applications in many areas, including test case gener- imize the developer effort required to specify the expected ation. Modern constraint solvers (i.e. SAT-solvers) are based on behavior associated with each test execution [16]. If testing some variant of the Davis-Putnam-Logemann-Loveland (DPLL) for (e.g.) core dumps, then specifying off-nominal behavior procedure [21]. DPLL searches systematically for a satisfying is trivial (just look for a core dump file). But in many other assignment, applying first unit propagation and pure literal cases, specifying what should (and should not) be seen when elimination as often as possible. Then, DPLL branches on the a test executes is a time-consuming task requiring a deep truth value of a variable, and recurses. understanding of the purpose and context of the software. Many methods have been explored to these tools Smaller tests suites are also cheaper to run. The industrial practical for large problems. For example Arito et al. pro- experience is that excessive testing can be expensive and time posed a framework to transform the test suite minimization consuming, especially if run after each modification to soft- problem (TSMP) in regression testing into a constrained SAT ware [16]. Such high-frequency testing can consumes as much problem [22]. This transformation is done by modeling TSMP as 80 percent of the software maintenance effort [16], [17]. instances as a set of Pseudo-Boolean constraints that are Many current organizations spend tens of millions of dollars later translated to SAT instances. TSMP has two objectives: 1) each year (or more) on cloud-based facilities to run large tests minimizing the testing cost and 2) maximizing the program suites [16]. The fewer the tests those organizations have to run, coverage. To start with, a set of test cases T {t ,t ,t ,...} as = 1 2 3 the cheaper their testing. well as their running time cost {c1,c2,...} is defined. Here, ti is Smaller test suites are also faster to execute. If minimal a binary signal indicating if the test case i should be tested and and effective test suites can be generated then, within a fixed the information about whether test case ti covers some element time limit, more faults can be found and fixed [16]. Faster test in the program e is stored as the binary matrix M [m ]. j = i j execution means that software teams can certify a new release, To translate the TSMP into constrained problems, we use quicker. This is particularly important for organizations using pseudo-boolean constraints: Pn c t B and Pm e P i 1 i i ≤ j 1 j ≥ continuous integration since faster test suites mean they can where B Z is the maximum allowed= cost and P {1,2,...,= m} is ∈ ∈ make more releases each day – which means that clients can the minimum coverage level. Having the pseudo-boolean con- sooner receive new (or fixed) features [18]. straints, Een et al. [23] provides three techniques to translate pseudo-boolean constraints (linear constraints over boolean variables) into clauses that can be handled by a SAT-solver. 2.2 Theorem Proving in Software Engineering Combinatorial testing covers interactions of parameters in As argued in this section, the test suite generation problem is the system under test. A well-chosen sampling mechanism can so complex that extra machinery is needed to guide theorem reduce the cost of software and system testing by reducing the provers. The rest of this paper discusses that extra machinery. number of test cases to be executed [24]. Note that not all

y, z such that it satisfies these constraints. 1 int mid(int x, int y, int z) { 2 if (x < y) { path 1: [C1: x < y < z] L2->L3 3 if (y < z) return y; path 2: [C2: x < z < y] L2->L3->L4 4 else if (x < z) return z; path 3: [C3: z < x < y] L2->L3->L4->L5 5 else return x; path 4: [C4: y < x < z] L2->L6 6 } else if (x < z) return x; path 5: [C5: y < z < x] L2->L6->L7 7 else if (y < z) return z; path 6: [C6: z < y < x] L2->L6->L7->L8 8 else return y; } via SMT conversion tools [7]. By convention, the disjunction The code above has the six branches shown Ci is transformed into the conjunction normal form (CNF) ∨ below. Each branch is a logical constraint C 0 C 0 .... A valid assignment to the CNF, i.e. the assignment 1 ∧ 2 C1 C2 C3 ... C6. A valid test selects x, that fulfills all clauses, is corresponding to a test case, covering ∨ ∨ ∨ some branch of code.

Fig. 1. A script of C programming can be translated into CNF (conjunctive normal form). 3 combinations are valid. For example, MacOS does not support prover. Theorem proving might then be applied to some (small) AMD processor while IE does not support MacOS, etc. All of subset of the newly generated tests, just to assess how well the such constraints can be expressed as the feature model [25] or heuristics are working. as product lines. Further, such a feature model can be trans- The earliest sampling tools were based on binary decision formed into the CNF formulas [26], at which point, SAT solvers diagrams (BDDs) [34]. Yuan et al. [8], [10] build a BDD from the can compute out the valid testing environment combination. input constraint model and then weighted the branches of the For other applications in this area, see [27]. vertices in the tree such that a stochastic walk from root to the In summary, in theory, it is can be useful to reformulate SE leaf was able to generate samples with the desired distribution. tasks as a SAT task. As Micheal Lowry said at a panel at ASE’15: In other work, Iyer proposed a technique named RACE which “It used to be that reduction to SAT proved a prob- has been applied in multiple industrial solutions [9]. RACE lem’s intractability. But with the new SAT solvers, that (a) builds a high-level model to represent the constraints; then reduction now demonstrates practicality.” (b) implements a branch-and-bound algorithm for sampling However, in practice, general SAT solvers, such as the Z3 [28], diverse solutions. The advantage of RACE is its implemen- MathSAT [29], vZ [30] et al., are challenged by the complex- tation simplicity. However, RACE, as well as the BDD-based ity of real-world software models. For example, the largest approached introduced above, return highly biased samples, benchmark for SAT Competition 2017 [31] had 58,000 variables– that is, highly non-uniform samples. For testing, this is not which is far smaller than (e.g.) the 300,000 variable problems recommended since it means small parts of the code get seen in the recent SE testing literature [4]. Accordingly, the explored at a much higher frequency than others. rest of this paper discusses ways to better customize theorem Using a SAT solver WalkSat [35], Wei et al. [11] proposed provers in order to handle very large test case generation SampleSAT. SampleSAT combines random walk steps with problems. greedy steps from WalkSat– a method that works well for small models. However, due to the greedy nature of WalkSat, the 2.3 Theorem Prover For Large Problems performance of SampleSAT is highly skewed as the size of the constraint model increases. As shown in Table 1, much prior research has explored scaling theorem proving for software engineering. One way to tame For seeking diverse samples, some use universal hash- the theorem proving problem is to simplify or decompose the ing [36] which offers strong guarantees of uniformity. Meel et CNF formulas. A recent example in this arena was GreenTire, al. [14] list key ingredients of integration of universal hash- proposed by Jia et al. [32]. GreenTire supports constraint reuse ing and SAT solvers; e.g. guarantee uniform solutions to a based on the logical implication relation among constraints. constraint model. These hashing algorithms can be applied One advantage of this approach is its efficiency guarantees. to the extreme large models (with near 0.5M variables). More Similar to the analytical methods in linear programming, they recently, several improved hashing-based techniques have been are always applied to a specific class of problem. However, purposed to balance the scalability of the algorithm as well as even with the improved theorem prover, such methods may diversity (i.e. uniform distribution) requirements. For example, be difficult to be adopted in large models. GreenTire was tested Chakraborty et al. proposed an algorithm named UniGen [13], in 7 case studies. Each case study was corresponding to a small following by the Unigen2 [6]. UniGen provides strong theoreti- code script with ten lines of code, e.g. the BinTree in [33]. For cal guarantees on the uniformity of generated solutions and has the larger models, such as those explored in this paper, the applied to constraint models with hundreds of thousands of following methods might do better. variables. However, UniGen suffered from a large computation Another approach, which we will call sampling, is to com- resource requirement. Later work explored a parallel version of bine theorem provers Z3 with stochastic sampling heuristics. this approach. Unigen2 achieved near linear speedup on the For example, given random selections for b,c, Eq. 1 might be number of CPU cores. used to generate a new test suite, without calling a theorem To the best of our knowledge, the state-of-the-art technique

TABLE 1 SNAP and its related work for solving theorem proving constraints via sampling.

Case study size Verifying Distribution/ Reference Year Citation Sampling methodology (max variables ) samples diversity reported | | [8] 1999 105 Binary Decision Diagram 1.3K ≈ [9] 2003 50 Interval-propagation-based 200 [10] 2004 54 Binary Decision Diagram 1K < [11] 2004 141 Random Walk + WALKSAT No experiment conducted [12] 2011 88 Sampling via determinism 6k [5] 2012 25 MAXSAT + Search Tree Experiment details not reported [13] 2014 29 Hashing based 400K [6] 2015 28 Hashing based (paralleling) 400K [14] 2016 29 Universal hashing 400K [4] 2018 5 Z3 + Eq. 1 flipping 400K SNAP 2020 this paper Z3 + Eq. 1 + local sampling 400K / : the absence / presence of corresponding item : only partial case studies (the small case studies) were reported 4 for generating test cases using theorem provers is QuickSam- 0) Set up pler [4]. QuickSampler was evaluated on large real-world case a) Let N 100; i.e. initial sample size; = studies, some of which have more than 400K variables. At b) Let k 5; i.e. number of clusters; = ICSE’18, it was shown that QuickSampler outperforms afore- c) Let suite ; i.e. the output test suite; = ; mentioned Unigen2 as well as another similar technique named d) Let samples ; i.e. a temporary work space. SearchTreeSampler [5]. QuickSampler starts from a set of valid = ; 1) Initial samples generation: solutions generated by Z3. Next, it computes the differences between the solutions using Eq. 1. New test cases generated a) Add N solutions (from Z3) to samples in this manner are not guaranteed to be valid. QuickSampler b) Put all samples into suite (since they are valid) defines three terms, we use later in this paper: 2) Delta preparation: A test suite is a set of valid tests. a) Find delta δ (a b) for all a,b samples. • = ⊕ ∈ A test is valid if it uses input settings that satisfy the CNF. b) Weight each delta by how often it repeats • One test suite is more diverse than another if it uses more 3) Sampling • variable within the CNF disjunctions. Diverse test suites a) Find k centroids in samples using k-means ; are preferred since they cover more parts of the code. b) For each centroid c, repeat N times:

According to Dutra et al.’s experiments, the percent of valid i) stochastically pick deltas δi , δj at prob. equal to tests found by QuickSampler can be higher than 70%. The their weight. percent of valid tests found by SNAP, on the other hand, is ii) compute a new candidate using c (δi δj ) ⊕ ∨ 100%. Further, as shown below, SNAP builds those tests with iii) verify new candidate using Z3; enough diversity much faster than QuickSampler. iv) if invalid, repair using Z3 (see §3.1). Add to sample; v) add the repaired candidate to suite; 4) Loop or terminate: 3 IMPLEMENTINGTHE SNAP TACTIC a) If improving (see §3.2), go to step 2. Else return suite.

In the SNAP algorithm of Fig. 2, each test is a set of zeros or Fig. 2. SNAP ones (false, true) assigned to all the variables in a CNF formula. As shown in initial samples (steps 1a,1b), instead of computing some deltas between many tests, SNAP restrains The call to Z3 in step 1a is a generating call. This can be • mutation to the deltas between a few valid tests (generated slowest since this must navigate all the constraints of our from Z3). SNAP builds a pool of 10,000 deltas from N 100 CNF. Therefore, we only do this N 100 times. = = valid tests (which mean calling a theorem prover only N 100 The call to Z3 in step 3b.iii is a verification call and is much = • times). SNAP uses this pool as a set of candidate “mutators” faster since all the variables are set. for existing tests (and by “mutator”, we mean an operation that The call to Z3 in step 3b.iv repair call, is slower than step • converts an existing test into a new one). 3b.iii since our repair method adds open choices to a test. After that, in delta preparation (steps 2a,2b), SNAP applies Note that we only need to repair the small minority of new tests Eq. 1. Note that the more often a setting repeats, the more that fail verification. Later in this paper, we can use Fig. 6 to likely it is a backdoor variable. Hence, step 2b sorts the deltas show that repairs are only needed on 30% (median) of all tests. on occurrence frequency. This sort is used in step 3b. In sample (steps 3a,3b), SNAPS samples around the average 3.1 Implementing “Repair” values seen in a few randomly selected valid tests. Here, SNAP’s repair function deletes “dubious” parts of a test case, "averaging" is inferred by using the median values seen in k then uses Z3 to fill in the the gaps. In this way, when we repair clusters. Note that, in step 3b, we use deltas that are more likely a test, most bits are set and Z3 only has to search a small space. to be valid (i.e. we use the deltas that occur more frequently). To find the “dubious” section, we reflect on how step 3b.ii Step 3b.iii is where we verify the new candidate using Z3. operates. Recall that the new test uses δ a b and a,b are SNAP explores far fewer candidates than QuickSampler (10 to = ⊕ valid tests taken from samples. Since a,b were valid, then the 750 times less, see §5.2). Since we are exploring less, we can “dubious” parts of the test is anything that was not seen in both take the time to verify them all. Hence, 100% of SNAP’s tests are a and b. Hence, we preserve the bits in c δ bits (where the valid (and the same is not true for QuickSampler– see Fig. 6). ⊕ corresponding δ bit was 1), while removing all other bits (where Note that in 3b.iv, we only add our new tests to the clusters δ bit was 0). For example: if it fails verification (taking care to first repair it). We do To mutating c (1,0,0,1,1,0,0,0) use δ (1,0,1,0, 1,0,1,0). this since test cases that pass verification do not add new • = = If c δ (0,0,1,1,0,0,1,0) is invalid, then SNAP deletes the information. But when an instance fails verification and is • ⊕ = “dubious” sections as follows. repaired, that offers new settings. SNAP preserves any “1” bits that were seen in δ. Note also that SNAP takes great care in how it calls a the- • SNAP deletes the others; e.g. bits 2, 4, 6, 8 (0,¡A0,1,¡A1,0,¡A0, 1,¡A0). orem prover. Theorem provers are much slower for generating • Z3 is then called to figure out the missing bits of (0?1?0?1?). new tests than repairing invalid tests than for verifying that a • test is valid (since there are more options for generation than for repairing than for verification). Hence, SNAP needs to verify 3.2 Implementing “Termination” more than it repairs (and also do repairs more than generating To implement SNAP’s termination criteria (step 4a), we need a new tests). More specifically: working measure of diversity. Recall from the introduction that 5 one test suite is more diverse than another if it uses more of the 3.3 Engineering Choices variable settings with disjunctions inside the CNF. Diverse test SNAP uses theses control parameters: suites are better since they cover more parts of the code. X 5%; • = To measure diversity, we used Feldt et al. [37]’s normalized T 10 minutes; • = compression distance (NCD). A test suite with high NCD im- N 100 samples; 2 gzip • = plies higher code coverage during the testing . NCD uses k 5 clusters. to the estimate Kolmogorov complexity [38] of the tests. If C(x) • = In future work, it could be insightful to vary these values. An- is the length of compression of x and C(X ) is the compression other area that might bear further investigation is the clustering length of binary string set X ’s concatenation, then: method used in step 3a. For this paper, we tried different C(X ) minx X {C(x)} clustering methods. Clustering ran so fast that we were not NCD(X ) − ∈ (2) = maxx X {C(X \{x})} motivated to explore alternate algorithms. Also, we found that ∈ To understand how the NCD is revealing the diversity of a the details of the clustering were less important than pruning test suite, consider the following test suite where each row in away most of the items within each cluster (so that we only the matrix represents one test case: mutate the centroid).

0 0 0 0 0  4 EXPERIMENTAL SET-UP T1 0 0 0 1 1  = 0 0 0 1 0 4.1 Code To explore the research questions shown in the introduction, Here, NCD1 0.142. Now, assuming that we have obtained the = the SNAP system shown in Fig. 2 was implemented in C++ following test suite after several iterations using Z3 v4.8.4 (the latest release when the experiment was   0 0 0 0 0 conducted). A k-means cluster was added using the free edition 0 0 0 1 1  of ALGLIB [39], a numerical analysis and data processing library   T2 0 0 0 1 0  delivered for free under GPL or Personal/Academic license. =   1 0 0 0 0  QuickSampler does not integrate the samples verification into  +  0 1 0 1 1 the workflow. Hence, in the experiment, we adjusted the work- + flow of QuickSampler so that all samples are verified before then NCD is now 0.272. Note that (a) marks the new test 2 + termination, which is the same as SNAP as in §3.2. Also, the cases obtained since T1 ; and (b) NCD2 is larger since the new outputs of QuickSampler were the assignments of independent cases cover various options in first two bits. support. The independent support is a subset of variables which On the other hand, if we further consider the following test completely determines all the assignments to a formula [4]. In suite: practice, engineers need the complete test case input; conse- quently, for valid samples, we extended the QuickSampler to 0 0 0 0 0  get full assignments of all variables from independent support’s 0 0 0 1 1    assignment via propagation. 0 0 0 1 0      T3 0 0 0 0 1  =  +  4.2 Case Studies 1 0 0 0 0    0 1 0 1 1  Table 2 lists the case studies used in this work. We can see that 0 1 0 1 0 the number of variables ranges from hundreds to more than + 486K. The large examples have more than 50K clauses, which is then NCD3 = 0.305. Here, due to two new test cases (marked very huge. For exposition purposes, we divided the case studies as ), we do see NCD improvements in T3 , as compared to into three groups: the small case studies with vars 6K ; the + < T2 . Such scale of improvements, however, is not significant: medium case studies with 6K vars 12K and the large case 0.272 0.142 < < from T1 to T2 , we got 0.142− 91% NCD improvements, studies with vars 12K . 0.305 0.272= > while in the T3 , we got − 12.1% increases. This is For the following reasons, our case studies are the same as 0.272 = because the new cases in T3 does not explore the diversity of those used in the QuickSampler paper: 3 new bits, such as the third bit. We wanted to compare our method to QuickSampler; • Another point to note is that NCD is presenting the diversity Their case studies were online available; • of a string-block (i.e. every substring matters). One substring x Their studies are used in many papers [4], [6], [13], [14]. • where C(x) 0 does not imply NCD 1. This is because the x → → These case studies are representative of scenarios engineers itself can attribute a lot to NCD of the whole string-block. Take met in software testing or circuit testing in embedded sys- the aforementioned T1 as an example: among all three cases tem design. They include bit-blasted versions of SMTLib case x , x and x , C(x ) 1, NCD(x x x ) 1. 1 2 3 1 → 1 ∪ 2 ∪ 3 ¿ studies, ISCAS89 circuits augmented with parity conditions on randomly chosen subsets of outputs and next-state vari- 2. Aside: we note that we did not adopt the diversity metric (distribution of samples displayed as a histogram) from [4], [6] since computing that ables, problems arising from automated program synthesis metric is very time-consuming. For the case studies of this paper, that and constraints arising in bounded theorem proving. For more calculation required days of CPU. Later in this paper, we show that our use introduction of the case studies, please see [4], [6]. of this diversity measure is not a threat to validity for this study. For pragmatic reasons, certain case studies were omitted 3. In this example, we examine the diversity via single bit. However, the NCD also examines the bits-group, such as the combinations of bit pair from our study. For example, we do not report on diagSten- (x, y), or bit tuple (x, y,z) etc. cilClean.sk_41_36 in the experiment since the purpose of this 6

paper is to sample a set of valid solutions to meet the diversity requirement; while there are only 13 valid solutions from this model. The QuickSampler spent 20 minutes (on average) to TABLE 2 search for one solution. Case studies used in this paper. Sorted by number of variables. Also, we do report on the case studies marked with a Medium sized-problems are highlighted with blue rows while the large star(*) in Table 2. Based on the experiment, we found that ones are in orange rows. Three items (marked with *) are not included in some further reports (see text). See §4.2 for details. even though the QuickSampler generates tens of millions of samples for these examples, all samples were the assignment to the independent support (defined in §4.1). The omission of Size Case studies Vars Clauses these case studies is not a critical issue. Solving or sampling blasted_case47 118 328 these examples is not difficult; since they are all very small, as blasted_case110 287 1263 compared to other larger case studies. s820a_7_4 616 1703 s820a_15_7 685 1987 s1238a_3_2 685 1850 4.3 Experimental Rig Small s1196a_3_2 689 1805 s832a_15_7 693 2017 We compared SNAP to the state-of-the-art QuickSampler, tech- blasted_case_1_b12_2* 827 2725 nique. To ensure a repeatability, we update the Z3 solver in blasted_squaring16* 1627 5835 QuickSampler to the latest version. blasted_squaring7* 1628 5837 To reduce the observation error and test the performance 70.sk_3_40 4669 15864 robustness, we repeated all experiment 30 times with 30 dif- ProcessBean.sk_8_64 4767 14458 ferent random seeds. To simulate real practice, such random 56.sk_6_38 4836 17828 seeds were used in Z3 solver (for initial solution generation), 35.sk_3_52 4894 10547 80.sk_2_48 4963 17060 ALGLIB (for the k-means) and other components. Due to space 7.sk_4_50 6674 24816 limitation, we cannot report results for all 30 repeats. Instead, doublyLinkedList.sk_8_37 6889 26918 we report the medium or the IQR (75-25th variations) results. 19.sk_3_48 6984 23867 All experiments were conducted on Xeon-E5@2GHz ma- 29.sk_3_45 8857 31557 chines with 4GB memory, running CentOS. We only used one Medium isolateRightmost.sk_7_481 10024 35275 core per machine. 17.sk_3_45 10081 27056 81.sk_5_51 10764 38006 LoginService2.sk_23_36 11510 41411 5 RESULTS sort.sk_8_52 12124 49611 parity.sk_11_11 13115 47506 The rest of this paper use the machinery defined above to 77.sk_3_44 14524 27573 answer the four research questions posed in the introduction. Large 20.sk_1_51 15465 60994 enqueueSeqSK.sk_10_42 16465 58515 karatsuba.sk_7_41 19593 82417 5.1 RQ1: How Much Faster is the SNAP Tactic? tutorial3.sk_4_31 486193 2598178 Fig. 3 shows the execution time required for SNAP and Quick- Sampler. The y-axis of this plot is a log-scale and shows time in seconds. These results are shown in the same order as Table 2. That is, from left to right, these case studies grow from around 300 to around 3,000,000 clauses.

SNAP QuickSampler 104

103

102

101 Sampling time (s) 100 0 7.sk_4_50 s820a_7_4 70.sk_3_40 56.sk_6_38 35.sk_3_52 80.sk_2_48 19.sk_3_48 29.sk_3_45 17.sk_3_45 81.sk_5_51 77.sk_3_44 20.sk_1_51 s820a_15_7 s1238a_3_2 s1196a_3_2 s832a_15_7 sort.sk_8_52 blasted_case47 parity.sk_11_11 ProcessBean.XX blasted_case110 tutorial3.sk_4_31 LoginService2.XX enqueueSeqSK.XX karatsuba.sk_7_41 doublyLinkedList.XX isolateRightmost.XX

Fig. 3. RQ1 results: Time to terminated (seconds), The y-axis is in log scale. The SNAP sampling time for s1238_a_3_2 and parity.sk_11_11 is not reported since their achieved NCD were much worse than QuickSampler’s (see Fig. 7). Fig. 4 illustrates the corresponding speedups. 7

104 TABLE 3 103 RQ2 results: number of unique valid cases in test suite. Sorted by last column. Same color scheme as Table 2. 102

SS SQ SQ/ 1 10 Case studies SNAP QuickSampler SS blasted_case47 2899 71 0.02 Time speedups 100 isolateRightmost 15480 7510 0.49 LoginService2 404 210 0.52 0 Benchmarks sorted by speedup 19.sk_3_48 204 200 0.98 70.sk_3_40 3050 4270 1.40 s820a_15_7 29065 70099 2.41 29.sk_3_45 225 660 2.93 Fig. 4. RQ1 results: sorted speedup (time(QuickSampler) / time(SNAP)). s820a_7_4 37463 124457 3.32 If over 100, then SNAP terminates earlier. s832a_15_7 27540 96764 3.51 s1196a_3_2 225 1890 8.40 enqueueSeqSK 338 2495 7.38 blasted_case110 274 2386 8.71 For the smaller case studies, shown on the left, SNAP is tutorial3.sk_4_31 336 2953 8.79 sometimes slower than QuickSampler. Moving left to right, 81.sk_5_51 227 2814 12.40 from smaller to larger case studies, it can be seen that SNAP sort.sk_8_52 812 10184 12.54 often terminates much faster than QuickSampler. On the very karatsuba.sk_7_41 139 4210 30.29 20.sk_1_51 239 10039 42.00 right-hand side of Fig. 3, there are some results where it seems doublyLinkedList 278 12042 43.32 SNAP is not particularly fastest. This is due to the log-scale 17.sk_3_45 228 12780 56.05 applied to the y-axis. Even in these cases, SNAP is terminating ProcessBean 1193 75392 63.20 7.sk_4_50 258 18090 70.12 in less than an hour while other approaches need more than 56.sk_6_38 1827 149031 81.57 two hours. 80.sk_2_48 653 54440 83.37 Fig. 4 is a summary of Fig. 3 that divides the execution time 77.sk_3_44 245 33858 138.20 for both systems. From this figure it can be seen: 35.sk_3_52 258 193920 751.63

Conclusion #1: SNAP terminated 10 to 3000 times faster than QuickSampler (median to max).

Number of identical deltas 2 There are some exceptions to this conclusion, where Quick- among 100 samples Sampler was faster than SNAP (see the right-hand-side of s1238a_3_2 Fig. 4). Those cases are usually for small models (17,000 clauses s1196a_3_2 or less). For medium to larger models, with 20,000 to 2.5 million 29.sk_3_45 81.sk_5_51 clauses, SNAP is often orders of magnitude faster. 77.sk_3_44 20.sk_1_51 s820a_7_4 70.sk_3_40 5.2 RQ2: Does the SNAP tactic find fewer test cases? 56.sk_6_38 Table 3 compares the number of tests from QuickSampler and 35.sk_3_52 7.sk_4_50 SNAP. As shown by the last column in that table: s820a_15_7 80.sk_2_48 17.sk_3_45 Conclusion #2: Test cases from SNAP were 10 to 750 times 19.sk_3_48 blasted_case47 smaller than from QuickSampler (median to max). blasted_squaring7 blasted_squaring16 s832a_15_7 doublyLinkedList.sk_8_37 NAP Hence we say that using S is easier than other methods, LoginService2.sk_23_36 where “easier” is defined as per our Introduction. That is, when blasted_case_1_b12_2 test suites are 10 to 750 times smaller, then they are faster blasted_case110 tutorial3.sk_4_31 to run, consumes less cloud-compute resources, and means sort.sk_8_52 developers have to spend less time processing failed tests. parity.sk_11_11 ProcessBean.sk_8_64 karatsuba.sk_7_41 enqueueSeqSK.sk_10_42 5.3 RQ3: How “good” are the tests found via the SNAP isolateRightmost.sk_7_481 tactic? 0 3000 6000 9000 Generating small tests sets, and doing so very quickly, is not interesting unless those test suites are also “good”. This section applies two definitions of “good” to the SNAP output: Fig. 5. Identical deltas seen in 100*100 pair of valid solution deltas for Credibility: Recalling Fig. 1, test suites need to satisfy the all case studies. Same color scheme as Table 2. • CNF clauses generated from source code. As defined in the 8

NCD(SNAP) 100% medium(QuickSampler)

80% parity.sk_11_11 s1238a_3_2 60% blasted_case47 40% 35.sk_3_52 s1196a_3_2 20% tutorial3.sk_4_31 karatsuba.sk_7_41

Effective mutation rate 0% Benchmarks sorted by credibility rate blasted_case110 s820a_7_4 s832a_15_7 80.sk_2_48 Fig. 6. RQ3 results for “credibility”: percentage of valid mutations found s820a_15_7 it step3b.iii (computed separately for each case study). 70.sk_3_40 doublyLinkedList.XX 56.sk_6_38 introduction, we say that the “more credible” a test suite, 19.sk_3_48 the larger the percentage of valid tests. ProcessBean.XX 7.sk_4_50 Diversity: A CNF clause is conjunction of disjunctions. • 81.sk_5_51 Diversity measures how many of disjunctions are explored 29.sk_3_45 by the tests. This is important since a high diversity means 77.sk_3_44 that most code branches are covered. 17.sk_3_45 enqueueSeqSK.XX isolateRightmost.XX 5.3.1 Credibility 20.sk_1_51 Regarding credibility, we note that SNAP only prints valid tests. LoginService2.XX sort.sk_8_52 That is, 100% of SNAP’s tests are valid. The same cannot be said for QuickSampler. That algorithm 0.0 0.2 0.4 0.6 0.8 1.0 1.2 ran so quickly since it assumed that tests generated using Eq. 1 did not need verification. To check that assumption, for each case study, we randomly generated 100 valid solutions, Fig. 7. RQ3 results for “diversity”: Normalized compression distance S {s ,s ,...s } using Z3. Next, we selected three {a,b,c} Ss (NCD) when QuickSampler and SNAP terminated on the same case = 1 2 100 ∈ and built a new test case using Eq. 1; i.e. new c (a b). studies. Median results over 30 runs (and small black lines show the = ⊕ ⊕ 75th-25th variations). Same color scheme as Table 2. Fig. 5 lists the number of identical deltas seen in 1002 of those deltas. We rarely found large sets of unique deltas; i.e. among the 100 valid solutions given by Z3, many δs were shared We have analyzed the Fig. 7 results with a bootstrap test at within pairwise solutions. This is important since if otherwise, 95% confidence (to test for statistically significant results), and the Eq. 1 heuristic would be dubious. a Cohen’s effect size test (to rule out trivially small differences). The percentage of these deltas that proved to be valid in Based on those tests, we say that in 25 93% of these results, 27 = step3b.iii of Algorithm 1 are shown in Fig. 6. Dutra et al.’s there is no significant difference (of non-trivial size) between estimate was that the percentage of valid tests generated by the two algorithms. Eq. 1 was usually 70% or more. As shown by the median values That said, in there two cases with a statistical significant of Fig. 6, this was indeed the case. However, we also see that difference that are markedly less than SNAP (see s1238a_3_2 in the lower third of those results, the percent of valid tests and parity.sk_11_11) (Fig. 7). In terms of scoring different generated by Eq. 1 is very low: 25% to 50% (median to max). algorithms, it could be argued that these examples might mean This result make us cautious about using QuickSampler that QuickSampler is the preferred algorithm but only (a) if since, when the Eq. 1 heuristics fails, it seems to be inefficient. numerous invalid tests are not an issue; (b) if testing resources By way of comparisons, it is relevant to remark here that are fast and cheap (so saving time and money on cloud- SNAP verifies every test case it generates. This is practical compute test facilities is not worthwhile); and (c) if developer for SNAP, but impractical for QuickSampler since these two time is cheap (so the time required to specify expected test 2 8 systems typically process 10 to 10 test cases, respectively. In output, or processing large numbers of failed tests, is not an any case, another reason to recommend SNAP is that this tool issue). delivers tests suites where 100% of all tests are valid. Hence we recommend SNAP since, In summary, measured in terms of credibility: SNAP’s tests are 100% “good” • Conclusion #3: Usually, SNAP’s tests are far more credible While other methods may find fewer “good” tests. • and very nearly as diverse as those from QuickSampler. 5.3.2 Diversity Regarding diversity, Fig. 7 compares the diversity of the test suites generated by our two systems (xpressed as ratios of 6 THREATS TO VALIDITY the observed NCD values). Results less than one indicate that SNAP’s test suites are less diverse than QuickSampler. In the 6.1 Baseline Bias median case, the ratio is one; i.e. in terms of central tendency, One threat to the validity of this work is the baseline bias. there is no difference between the two algorithms. Indeed, there are many other sampling techniques, or solvers, 9 that SNAP might be compared to. However, our goal here was to 6.5 External Validity compare SNAP to a recent state-of-the-art result from ICSE’18. Apart for issues of nested type constraints, this section lists In further work, we will compare SNAP to other methods. two other areas that would require an extension to the current SNAP framework. Specifically, SNAP is not designed for testing non-deterministic or nonfunctional requirements. 6.2 Internal Bias Functional requirements define systems functions; e.g. "up- A second threat to validity is internal bias that raises from the date credit card record". On the other hand, a non-functional requirements specify how the system should do it. For example, stochastic nature of sampling techniques. SNAP requires many random operations. To mitigate the threats, we repeated the nonfunctional requirements related to software “ilities” such as experiments for 30 times. usability, maintainability, scalability, etc. When designing tests for nonfunctional requirements, it may be required to access variables that are not defined in the conditionals that define 6.3 Hyperparameter Bias program branches; (e.g. is the user happy with the interaction?). SNAP does not do that since it draws its tests only from the Another threat is hyperparameter bias. The hyperparameter is variables in the branch tests. the set of configurations for the algorithm. The hyperparameter As to testing non-deterministic systems, a deterministic used in these experiments were shown in §3.3. Learning how to function is one where the output is fully determined by their automatically adjust these settings would be a useful direction inputs; i.e. if the function is called N times with the same inputs for future work. then in a deterministic environment, we would expect the How long would it take to learn better parameters? As same output. On the other hand, a non-deterministic function shown in Fig. 4, it can take 105 (approx) seconds to complete is one where identical inputs can lead to different outputs. one run of our test generation systems. Standard advice for When designing tests for non-deterministic systems, it would hyperparameter optimization with (say) a genetic algorithm be useful to make multiple tests fall down each program is to mutate a population of 100 candidates over 100 gener- branch since that better samples the space of possible non- ations [40]. Allowing for 20 repeats (for statistical validity), then deterministic behaviours within that branch. SNAP may not be the runtimes for hyperparameter optimization experiments the best tool for non-deterministic systems since, often, it only could require: produces one test for each of the branches it visits. In future work, it would be insightful to consider how SNAP 105 100 100 20/3600/168/52 650 years might be extended to handle testing for non-functional and/or ∗ ∗ ∗ ≈ non-deterministic systems. This is clearly an upper bound. If we applied experimental hy- perparameter optimizers that tried less than 50 configurations 6.6 Algorithm Bias (selected via Bayesian parameter optimization [41], [42]), then One threat to the validity of the above results is that we used that runtimes could be three years of CPU: the termination criteria based on NCD, which is different from some prior work. So is NCD a fair diversity comparison metric?. 105 50 20/3600/168/52 3 years ∗ ∗ ≈ To explore this, we need some way to compare the results of text case generation algorithms, given the same CPU allocation, Yet another method, that might be more promising, is in- i.e. getting rid of NCD. Fig. 8 shows one way to make that cremental transfer learning where optimizers transfer lessons comparison ( this figure comes from the QuickSampler paper): learned between hyperparameter optimizations running in Three different test case generation algorithms [4], [5], [6] parallel [43]. In this approach, we might not need to wait 105 • (and a uniform random generator) are executed. seconds before we can find better parameters. Each algorithm was given 10 hours of CPU. • In summary, it would be an exciting and challenging task to Fig. 8 counts the number of repeated solutions within • perform hyperparameter optimization in this domain. each run. That figure shows that (e.g.) 25,000 solutions are found 15 times (approximately) within all four methods. The authors of the QuickSampler paper used Fig. 8 to argue 6.4 Construct Validity that, assuming a large CPU allocation, then at the end of There are cases where the above test scheme would be incom- the run, all these algorithms achieve similar solution diversity. plete. All the above assumes that the the constraints of the Aside: just to defend QuickSampler here– merely because the program can be expressed in terms of the literals seen within same solutions are found in Fig. 8 by different methods does the conditionals that define each branch of a program. This not mean that there is no benefit to QuickSampler. As discussed may not always be true. For example, consider constraints in [4], QuickSampler wins over the other algorithms of Fig. 8 between fields of buried deep within a nested data structure since (a) it scales to larger problems and (b) it produces test being passed around the program. To address constraints of suites with more valid tests, faster than previous methods. To that type, we would need access to (e.g.) invariants that many illustrate SNAP’s diversity in a similar manner to Fig. 8, we have be defined within those structs, but which are invisible to the the following observations: tests in the path conditionals. Strongly typed languages like Recalling the Table 3 results, SNAP produces far fewer test • Haskell or OCaml which can reason about nested types might cases that these other algorithms. Hence, we cannot use be of some assistance here. This would be a promising area for the y-axis of Fig. 8 to compare our method to previous future work. methods. 10

blasted_case47 14.0% QuickSampler SNAP 12.0%

10.0%

8.0%

6.0% Occurance(%) 4.0%

2.0%

0.0%

0.950 0.955 0.960 0.965 0.970 0.975 0.980 Solution(entropy) Fig. 8. The authors of the QuickSampler paper [4], say this fig- blasted_case110 ure shows that different test generation algorithms QuickSampler [4], 25.0% QuickSampler SearchTreeSampler [5], UniGen2 [6] and one uniformed random gener- SNAP ator achieve similar solution diversity (assuming unlimited CPU). 20.0%

15.0% We need another non-NCD measure of diversity that does • 10.0% not favor either QuickSampler or SNAP. For that purpose, Occurance(%) we used Shannon Entropy [44], i.e. 5.0%

p (1 p) 0.0% H(p) p log2 (1 p)log2 − 0.40 0.45 0.50 0.55 0.60 0.65 = − − − Solution(entropy)

where p is the probability of one(1)s in the solution. s820a_7_4

QuickSampler In combination, our comparison proceeds as follows: 17.5% SNAP QuickSampler and SNAP were run on each case study, 15.0% • terminating after the same number of minutes. 12.5% 10.0% Since our goal was to see “what is lost by SNAP”, we • 7.5%

terminated in times similar to the termination times seen Occurance(%) 5.0% in the RQ1 study. Specifically, those termination times were 2.5% assigned to the case studies, basing on their number of 0.0%

clauses, from the set {1,5,10} minutes. 0.9960 0.9965 0.9970 0.9975 0.9980 0.9985 0.9990 As shown above in the RQ2 study, the number of solutions Solution(entropy) • generated by SNAP and QuickSampler are not in the same s820a_15_7 scale. Accordingly, we only recorded the unique solutions 30.0% QuickSampler SNAP found in this study. 25.0% At termination, we collected all unique valid test cases • 20.0% when the execution terminated at the given time and 15.0% compared the diversities among them.

Occurance(%) 10.0% As seen in Fig. 9, we output the distribution of entropies, 5.0% expressed as the percentage of tests that have that entropy (and not as Fig. 8’s absolute number of test with that entropy). 0.0% 0.978 0.980 0.982 0.984 0.986 0.988 0.990 0.992 0.994 Fig. 9 shows the distributions of the diversity of SNAP and Solution(entropy)

QuickSampler results seen in blasted_case47, blasted_case10, LoginService2.sk_23_36 s820a_7_4, s820a_15_7 and LoginService2.sk_23_36. These data 40.0% QuickSampler SNAP sets were selected for presentation here since, in the QuickSam- 35.0% pler paper, they were singled out for special analysis (according 30.0% to that paper, these algorithm yield a large and countable range 25.0% 20.0% of diverse results). In that figure, we see that 15.0% Occurance(%) Among all these test cases, SNAP and QuickSampler 10.0% • yielded solutions within same entropy range. 5.0% In fact, usually we see a very narrow range of entropy on 0.0% • 0.826 0.828 0.830 0.832 0.834 0.836 0.838 0.840 the x-axis: In 4/5 cases, the range was less than 3%. This Solution(entropy) means that these case studies yield solutions with similar entropy. Fig. 9. Distributions of the diversity (assessed using Shannon entropy). In summary, from Fig. 9, we say that if solutions were generated Both QuickSampler and SNAP were terminated at the same time – by QuickSampler with a particular entropy, then it is likely that Blasted cases=1min, s820a cases = 5mins, LoginService2=10mins. Y -axis shows valid solutions (those with same Shannon Entropy were NAP the S was generating that kind of solutions as well. Hence, clustered together) Y -axis shows occurrences among the result set. we do not see a threat to validity introduced by how SNAP selects its termination criteria. 11

6.7 Evaluation Bias scripts that processed up to 1010 rows of data. In theory, the test This paper has evaluated the SNAP test case generator using the suite required here is very large indeed (the cross-produce of 10 five goals described in the introduction: i.e. runtime, scalability, all the possible values in 10 rows). However, a static analysis redundancy, credibility and minimality. But as the following showed that those scripts could be approximated by less than examples show, these are not the only criteria for assessing test 3 dozens pathways. Hence in that application, less than three suites. For future work it could be useful and insightful to apply dozen tests were enough to test those scripts. Note the parallels other evaluation criteria. of the Kim et al. results to the SNAP work and the data mining Firstly, Yu et al. [16] discuss the information needs for test example offered above: case prioritization. They argue that in modern complex cloud- Kim et al. did not cover all possible data combinations. • based test environment, it can be advantageous not to run Rather, they constrained their tests to just cover the stan- • all tests all the time. Rather, there are engineering benefits to dard “shape” of the code they were testing. first running the tests that are most likely to fail. His results Accordingly, when seeking the smallest number of tests that show that different kinds of systems need different kinds of cover the branches, it may be a secondary concern whether prioritization schemes, but not all projects collect the kinds of or not those test cases “bunch up” and do not cover the cross data needed for different prioritization schemes. Hence it is an product of all possible solutions. open issue if tools like SNAP and QuickSampler can contribute to test case prioritization. 7 RELATED WORK Secondly, once tests are run, then faults have to be localized In essence, the algorithms of this paper are samplers that and fixed. Spectrum-based Reasoning (SR) is a research hot- explore some subset of seemingly large and complex problems. spot on this. Given a system of M components, a test suite Sampling is not only useful for finding test suites in theorem T as well as the obtained errors after executing T on the proving. It also has applications for other SE problems such system, SR approaches utilize similarity-like coefficient to find as requirement engineering, resource planning optimization, a correlation between component and the errors location. Perez etc [50], [51], [52], [53]. A repeated problem with all these et al. [45] warn that though high-coverage test suites can detect applications was the time required to initialize the reasoning. errors in the system, it is not guaranteed that inspecting tests In that initialization step, some large number of samples had to will yield a straightforward explanation, i.e. root cause, for the be collected. In practice, that step took a significant percentage error. It will be of insightful to test how effective are QuickSam- of the total runtime of those systems. We conjecture that SNAP pler or SNAP in localizing faults in real-world executions. Thirdly, Ostrand et al. [46] argues that the value of quality can solve that initialization problem. Using the techniques of assurance methods is that they focus the analysis on what parts this paper, it might be time now to repeat all the above work. of the code base deserve most attention. By this criteria, we This time, however, instead of wasting much time on a tedious should assess test suites by how well they find the most bugs in generation process, we could use something like SNAP to quick the fewest lines of code. start the reasoning. Fourthly, a common way to assess test suite generators is As to other related work, like SNAP, the DODGE system of via the uniformity of the generated tests [47]. Theorem provers Agrawal et al. [54] made an assumption that given a set of report their solutions in some implementation-specific order. solutions to some SE problem, there is much redundancy and Hence, it is possible that after running a theorem prover for repetition within those different solutions. A tool for software some finite time, then the solutions found in that time may analytics, DODGE needed just a few dozen evaluations to ex- only come from a small “corner” of the space of possible plore billions of configuration options for (a) choice of learner, solution [13]. When test for uniformity for a theorem prover for (b) choice of pre-processor, and for (c) control parameters sampling the space of N possible tests, then the frequency of for the learner and pre-processor. DODGE executed by: 1) Assign random weights to configuration options. occurrence of some test Ti should be approximately 1/N. We argue that issues of uniformity are less important than 2) Randomly pick options, favoring those with most weight; branch coverage (which is measured above as diversity, see 3) Configuring and executing data pre-processors and learn- §6.6). To make that argument, we draw a parallel from the field ers using those options; 4) Dividing output scores into regions of size ² 0.2; of data mining. Consider a rule learner that is building a rule = from the set of all possible literals in a data sets. In theory, 5) When some new configuration has scores with ² of prior this space of literals is very large (all attributes combined with configurations then... all their ranges combined any any number of logical operators 6) ...reduce the weight of those configuration options; and combined to any length of rule). Nevertheless, a repeated 7) Go Step2 result is that such learners can terminate very quickly [48] Note that after Step5, then the choices made in subsequent since, rather that searching all literals, these learners need only Step1s will avoid options that arrive within ² of other observed explore the small set of literals commonly seen in the data. scores. Experiments with DODGE found that best learner per- We draw this parallel since the success of SNAP is consistent formance plateau after just a few dozen repeats of Steps12345. with the conjecture that the programs we explore are using just To explain this result, Argrawal et al. [54] note that for a range a small subset of the space of all settings. In that situation, of software analytics tasks, the outputs of a learner divide into uniformity is less of an issue than diversity since the latter only a handful of equivalent regions. For example, when an reports how well the tests match the “shape” of the data. software analytics task is repeated 10 times, each time with We note that other researchers endorse our position here 90% of the data, then the observed performance scores (e.g. that effective testing need only explore a small portion of the recall, false alarm) can vary by 5 percent, or more. Assuming total state space. Miryung Kim and colleagues [49] were testing normality, then scores less than ² 1.96 2 0.05 0.196 are = ∗ ∗ = 12 statistically indistinguishable. Hence, for learners evaluated on Given the presence of such repeated strucutures, the SNAP • (say) N 2 scores, those scores effectively divide into just tactic might be useful for many other SE tasks. ¡ 1= ¢N 2 C = 26 different regions. Hence, it is hardly sur- = ² 0.196 = prising= that a few dozen repeats of Step1,2,3,4,5 were enough ACKNOWLEDGEMENTS to explore a seemingly very large space of options. This work was partially funded by an NSF award #1703487. It has not escaped our notice that some analogy of the DODGE result could explain the curious success of the Quick- REFERENCES Sampler heuristic. Consider: one way to summarize Eq. 1. is that the space around existing valid test cases contains many [1] Y. Fazlalizadeh, A. Khalilian, M. A. Azgomi, and S. Parsa, “Prioritizing test cases for resource constraint environments using historical test other valid test cases– which is an analogous idea to Argrawal’s case performance data,” in 2009 2nd IEEE International Conference ² regions. That said, we would be hard pressed to defend that on Computer Science and Information Technology. IEEE, 2009, pp. analogy. Argrawal’s ² regions are a statistical concept based on 190–195. continuous variables while Eq. 1 is defined over discrete values. [2] Y. Lu, Y. Lou, S. Cheng, L. Zhang, D. Hao, Y. Zhou, and L. Zhang, “How does regression test prioritization perform in real-world software evo- Also, there are many other ways in which DODGE is lution?” in 2016 IEEE/ACM 38th International Conference on Software fundamentally different to SNAP. DODGE was a support tool Engineering (ICSE), 2016, pp. 535–546. for inductive data mining applications while SNAP is most [3] S. Mahajan, S. D. Joshi, and V. Khanaa, “Component-based soft- ware system test case prioritization with genetic algorithm decoding accurately described as a support tool for a deductive system technique using java platform,” in 2015 International Conference on (Z3). Further, DODGE assumes very little structure in its inputs Computing Communication Control and Automation, 2015, pp. 847– (just tables of data with no more than a few dozen attributes) 851. [4] . Dutra, K. Laeufer, J. Bachrach, and K. Sen, “Efficient sampling of SAT while SNAP’s inputs are far larger and far more structured solutions for testing,” in 2018 IEEE/ACM 40th International Conference (recall from Table 2 that SNAP processes CNF formula with up on Software Engineering (ICSE). IEEE, 2018, pp. 549–559. to hundreds of thousands of variables). Lastly, recalling Step6 [5] S. Ermon, C. P. Gomes, and B. Selman, “Uniform solution sampling (listed above), DODGE incrementally re-weights the space from using a constraint solver as an oracle,” arXiv preprint arXiv:1210.4861, 2012. which new options are generated. SNAP, on the other hand, [6] S. Chakraborty, D. J. Fremont, K. S. Meel, S. A. Seshia, and M. Y. treats the option generator as a black box algorithm since it Vardi, “On parallel scalable uniform SAT witness generation,” in does not reach inside Z3 to alter the order in which it generates International Conference on Tools and Algorithms for the Construction solutions. and Analysis of Systems. Springer, 2015, pp. 304–319. [7] M. Finke, “Equisatisfiable SAT encodings of arith- metical operations,” Online] http://www. martin-finke. 8 CONCLUSION de/documents/Masterarbeit_bitblast_ Finke. pdf, 2015. [8] J. Yuan, K. Shultz, C. Pixley, H. Miller, and A. Aziz, “Modeling design Exploring propositional formula is a core computational pro- constraints and biasing in simulation using bdds,” in Proceedings cess with many areas of application. Here, we explore the of the 1999 IEEE/ACM international conference on Computer-aided design. IEEE Press, 1999, pp. 584–590. use of such formula for test suite generation. SAT solvers [9] M. A. Iyer, “RAVE: a word-level atpg-based constraints solver system are a promising technology for finding settings that satisfy for smart random simulation,” in International Test Conference, 2003. propositional formula. The current generation of SAT solvers Proceedings. ITC 2003., vol. 1, 2003, pp. 299–308. is challenged by the size of the formula seen in the recent SE [10] J. Yuan, A. Aziz, C. Pixley, and K. Albin, “Simplifying boolean constraint solving for random simulation-vector generation,” IEEE Transactions testing literature. on Computer-Aided Design of Integrated Circuits and Systems, vol. 23, Using the criteria listed in the introduction (runtime, scala- no. 3, pp. 412–420, 2004. bility, redundancy, credibility and minimality), we recommend [11] W. Wei, J. Erenrich, and B. Selman, “Towards efficient sampling: Exploiting random walk strategies,” in AAAI, vol. 4, 2004, pp. 670–676. the following “SNAP tactic” to tame the computational com- [12] V. Gogate and R. Dechter, “Samplesearch: Importance sampling in plexity of SAT solving for test suite generation: presence of determinism,” Artificial Intelligence, vol. 175, no. 2, pp. Sample around the average values seen in a few 694–729, 2011. [13] S. Chakraborty, K. S. Meel, and M. Y. Vardi, “Balancing scalability and randomly selected valid tests. uniformity in SAT witness generator,” in Proceedings of the 51st Annual When this tactic was applied to 27 real-world test case studies, Design Automation Conference. ACM, 2014, pp. 1–6. test suite generation can ran 10 to 3000 times faster (median to [14] K. S. Meel, M. Y. Vardi, S. Chakraborty, D. J. Fremont, S. A. Seshia, D. Fried, A. Ivrii, and S. Malik, “Constrained sampling and counting: max) than a prior report. While that prior work found tests that Universal hashing meets SAT solving,” in Workshops at the thirtieth were 70% valid, our SNAP tool generated 100% valid tests. AAAI conference on artificial intelligence, 2016. Another important result was the size of the test set gener- [15] S. Haidry and T. Miller, “Using dependency structures for prior- ated in this manner. There is an economic imperative to run itization of functional test suites,” IEEE Transactions on Software Engineering, vol. 39, no. 2, pp. 258–275, 2013. fewer tests when companies have to pay money to run each [16] Z. Yu, F. Fahid, T. Menzies, G. Rothermel, K. Patrick, and S. Cherian, test, and when developers have to spend time studying the “Terminator: better automated UI test case prioritization,” in Proceed- failed test. In that context, it is interesting to note that Snap’s ings of the 2019 27th ACM Joint Meeting on European Software En- gineering Conference and Symposium on the Foundations of Software tests are 10 to 750 times smaller (median to max) than those Engineering, 2019, pp. 883–894. from prior work. [17] P. K. Chittimalli and M. J. Harrold, “Re-computing coverage informa- We conjecture that: tion to assist regression testing,” in 2007 IEEE International Conference on Software Maintenance, 2007, pp. 164–173. SNAP’s success is due to widespread repeated structures • [18] C. Parnin, E. Helms, C. Atlee, H. Boughton, M. Ghattas, A. Glover, in software. Without such repeated structures, we are at J. Holman, J. Micco, B. Murphy, T. Savor et al., “The top 10 adages a loss to explain our results. Algorithms that exploit such in continuous deployment,” IEEE Software, vol. 34, no. 3, pp. 86–95, repeated structures, there are many kinds of SE analysis 2017. [19] R. Baldoni, E. Coppa, D. C. D’elia, C. Demetrescu, and I. Finocchi, “A that (potentially) could be clarified and simplified using survey of symbolic execution techniques,” ACM Computing Surveys SNAP. (CSUR), vol. 51, no. 3, p. 50, 2018. 13

[20] M. Christakis, P. Müller, and V. Wüstholz, “Guiding dynamic symbolic IEEE/ACM 39th International Conference on Software Engineering execution toward unverified program executions,” in Proceedings of (ICSE). Los Alamitos, CA, USA: IEEE Computer Society, may 2017, the 38th International Conference on Software Engineering. ACM, pp. 654–664. 2016, pp. 144–155. [46] T. J. Ostrand, E. J. Weyuker, and R. M. Bell, “Where the bugs are,” in [21] M. Davis and H. Putnam, “A computing procedure for quantification Proceedings of the 2004 ACM SIGSOFT International Symposium on theory,” Journal of the ACM (JACM), vol. 7, no. 3, pp. 201–215, 1960. Software Testing and Analysis, ser. ISSTA’04, 2004, pp. 86–96. [22] F. Arito, F. Chicano, and E. Alba, “On the application of SAT solvers to [47] S. Deng, Z. Kong, J. Bian, and Y. Zhao, “Self-adjusting constrained the test suite minimization problem,” in SSBSE. Springer, 2012, pp. random stimulus generation using splitting evenness evaluation and 45–59. xor constraints,” in 2009 Asia and South Pacific Design Automation [23] N. Eén and N. Sorensson, “Translating pseudo-boolean constraints Conference. IEEE, 2009, pp. 769–774. into sat,” Journal on Satisfiability, Boolean Modeling and Computa- [48] D. J. Hand and N. M. Adams, “Data mining,” Wiley StatsRef: Statistics tion, vol. 2, pp. 1–26, 2006. Reference Online, pp. 1–7, 2014. [24] A. Yamada, T. Kitamura, C. Artho, E.-H. Choi, Y. Oiwa, and A. Biere, [49] M. A. Gulzar, S. Mardani, M. Musuvathi, and M. Kim, “White-box “Optimization of combinatorial testing by incremental SAT solving,” testing of big data analytics with complex user-defined functions,” in in 2015 IEEE 8th International Conference on Software Testing, Verifi- FSE’19, 2019, pp. 290–301. cation and Validation (ICST). IEEE, 2015, pp. 1–10. [50] J. Chen, V. Nair, R. Krishna, and T. Menzies, “"Sampling" as a baseline [25] M. Janota, V. Kuzina, and A. Wasowski, “Model construction with optimizer for search-based software engineering,” IEEE Transactions external constraints: An interactive journey from semantics to syntax,” on Software Engineering, 2018. in MoDELS. Springer, 2008, pp. 431–445. [51] J. Chen, V. Nair, and T. Menzies, “Beyond evolutionary algorithms [26] M. Mendonca, A. Wasowski, and K. Czarnecki, “Sat-based analysis for search-based software engineering,” Information and Software of feature models is easy,” in Proceedings of the 13th International Technology, vol. 95, pp. 281–294, 2018. Software Product Line Conference. Carnegie Mellon University, 2009, [52] T. Menzies and J. Richardson, “XOMO: Understanding development pp. 231–240. options for autonomy,” in COCOMO forum, vol. 2005, 2005. [27] C. Nie and H. Leung, “A survey of combinatorial testing,” ACM [53] J. Chen and T. Menzies, “Riot: A stochastic-based method for workflow Computing Surveys (CSUR), vol. 43, no. 2, p. 11, 2011. scheduling in the cloud,” in 2018 IEEE 11th International Conference [28] L. De Moura and N. Bjørner, “Z3: An efficient smt solver,” in Inter- on Cloud Computing (CLOUD). IEEE, 2018, pp. 318–325. national conference on Tools and Algorithms for the Construction and [54] A. Agrawal, W. Fu, D. Chen, X. Shen, and T. Menzies, “How to Analysis of Systems. Springer, 2008, pp. 337–340. "DODGE" complex software analytics,” IEEE Transactions on Software [29] R. Bruttomesso, A. Cimatti, A. Franzén, A. Griggio, and R. Sebastiani, Engineering, pp. 1–1, 2019. “The mathsat 4 smt solver,” in International Conference on Computer Aided Verification. Springer, 2008, pp. 299–303. [30] N. Bjørner, A.-D. Phan, and L. Fleckenstein, “νZ-an optimizing smt solver,” in International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 2015, pp. 194–199. Jianfeng Chen holds a Computer Science [31] M. Heule, M. Järvisalo, and T. Balyo, “Sat competition,” SAT, 2017. Ph.D. from North Carolina State University. His [32] X. Jia, C. Ghezzi, and S. Ying, “Enhancing reuse of constraint solutions research mainly focuses on utilizing machine to improve symbolic execution,” in Proceedings of the 2015 Interna- learning techniques to optimize software engi- tional Symposium on Software Testing and Analysis. ACM, 2015, pp. neering/system infrastructure performance. He 177–187. works now as a research scientist at Facebook, [33] W. Visser, C. S. Pasareanu, and R. Pelánek, “Test input generation New York. For more information, please visit for java containers using state matching,” in Proceedings of the 2006 https://jianfeng.us/. international symposium on Software testing and analysis. ACM, 2006, pp. 37–48. [34] S. B. Akers, “Binary decision diagrams,” IEEE Trans. Computers, no. 6, pp. 509–516, 1978. [35] B. Selman, H. A. Kautz, B. Cohen et al., “Local search strategies for satisfiability testing.” Cliques, coloring, and satisfiability, vol. 26, pp. 521–532, 1993. [36] Y. Mansour, N. Nisan, and P. Tiwari, “The computational complexity Xipeng Shen is a Professor in the Computer of universal hashing,” Theoretical Computer Science, vol. 107, no. 1, Science Department at North Carolina State pp. 121–133, 1993. University in USA. He is a receipt of the DOE [37] R. Feldt, S. Poulding, D. Clark, and S. Yoo, “Test set diameter: Quanti- Early Career Award, NSF CAREER Award, fying the diversity of sets of test cases,” in 2016 IEEE International Google Faculty Research Award, and IBM CAS Conference on Software Testing, Verification and Validation (ICST). Faculty Fellow Award. He is an ACM Distin- IEEE, 2016, pp. 223–233. guished Member, ACM Distinguished Speaker, [38] M. Li and P. Vitányi, An introduction to Kolmogorov complexity and its and a senior member of IEEE. He was hon- applications. Springer Science & Business Media, 2013. ored with University Faculty Scholars Award. His [39] S. Bochkanov and V. Bystritsky, “Alglib,” Available from: www. alglib. primary research interest lies in the fields of net, vol. 59, 2013. Programming Systems and Machine Learning, [40] D. E. Goldberg and J. H. Holland, “Genetic algorithms and machine emphasizing inter-disciplinary problems and approaches. learning,” 1988. [41] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley, “Google vizier: A service for black-box optimization,” in Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, 2017, pp. 1487–1495. Tim Menzies (IEEE Fellow, Ph.D. UNSW, 1995) [42] V. Nair, Z. Yu, T. Menzies, N. Siegmund, and S. Apel, “Finding faster is a Professor in computer science at NC State configurations using flash,” IEEE TSE, 2018. University, USA, where he teaches software [43] R. Krishna, V. Nair, P. Jamshidi, and T. Menzies, “Whence to learn? engineering, automated software engineering, Transferring knowledge in configurable systems using BEETLE,” IEEE and programming languages. His research in- Transactions on Software Engineering, pp. 1–1, 2020. terests include software engineering (SE), data [44] C. E. Shannon, “A mathematical theory of communication,” Bell mining, artificial intelligence, and search-based system technical journal, vol. 27, no. 3, pp. 379–423, 1948. SE, open access science. For more information, [45] A. Perez, R. Abreu, and A. van Deursen, “A test-suite diagnosability please visit http://menzies.us. metric for spectrum-based fault localization approaches,” in 2017