Efficient Dependency Detection for Safe Test Acceleration

Jonathan Bell, Gail Kaiser Eric Melski, Mohan Dattatreya Columbia University Electric Cloud, Inc 500 West 120th St 35 S Market Street New York, NY USA San Jose, CA USA {jbell,kaiser}@cs.columbia.edu {ericm,mohan}@electric-cloud.com

ABSTRACT Our study of popular open source Java programs echoes Slow builds remain a plague for software developers. The fre- these results, finding projects that take hours to build, with quency with which code can be built (compiled, tested and most of that time spent testing. Even in cases of projects packaged) directly impacts the productivity of developers: that build in a more manageable amount of time — for ex- longer build times mean a longer wait before determining if ample, five to ten minutes — faster builds can result in a a change to the application being built was successful. We significant increase in productivity due to less lag-time for have discovered that in the case of some languages, such as test results. Java, the majority of build time is spent running tests, where To make testing faster, developers may turn to techniques dependencies between individual tests are complicated to such as Test Suite Minimization (which reduce the size of discover, making many existing test acceleration techniques a test suite, for instance by removing tests that duplicate unsound to deploy in practice. Without knowledge of which others) [11, 12,23,24,27,28,37,39], Test Suite Prioritization tests are dependent on others, we cannot safely parallelize (which reorders tests to run those most relevant to recent the execution of the tests, nor can we perform incremen- changes first) [14, 15, 35, 36, 38], or Test Selection [17, 25, 32] tal testing (i.e., execute only a subset of an application’s (which selects tests to execute that are impacted by recent tests for each build). The previous techniques for detecting changes). Alternatively, given a sufficient quantity of cheap these dependencies did not scale to large test suites: given a computational resources (e.g. Amazon’s EC2), we might test suite that normally ran in two hours, the best-case run- hope that we could reduce the amount of wall time needed ning scenario for the previous tool would have taken over to run a given test suite even further by parallelizing it. 422 CPU days to find dependencies between all test meth- All of these techniques involve executing tests out of or- ods (and would not soundly find all dependencies) — on the der (compared to their typical execution — which may be same project the exhaustive technique (to find all depen- random but is almost always alphabetically), making the dencies) would have taken over 10300 years. We present a assumption that individual test cases are independent. If novel approach to detecting all dependencies between test some test case t1 writes to some persistent state, and t2 de- cases in large projects that can enable safe exploitation of pends on that state to execute properly, we would be unable parallelism and test selection with a modest analysis cost. to safely apply previous work in test parallelization, selec- tion, minimization, or prioritization without knowledge of Categories and Subject Descriptors this dependency. Previous work by Zhang et al. has found that these dependencies often come as a surprise and can D.2.5 [Software Engineering]: Testing and Debugging cause unpredictable results when using common test priori- Keywords tization algorithms [40]. Test dependence, detection algorithms, empirical studies This assumption is part of the controlled regression testing assumption: given a program P and new version P 0, when 1 Introduction P 0 is tested with test case t, all factors that may influence 0 Slow builds remain a hinderance to continuous integration the outcome of this test (except for the modified code in P ) and deployment processes, with the majority of building remain constant [34]. This assumption is key to maintaining time often spent in the testing phase. Our industry part- the soundness of techniques that reorder or remove tests ners confirm previous results reported in literature [35]: test from a suite. In the case of test dependence, we specifically suites frequently take several hours to run — often over a day assume that by executing only some tests, or executing them — making it hard to run them with the desired frequency. in a different order, we are not effecting their outcome (i.e., that they are independent). One simple approach to accelerating these test suites is to ignore these dependencies, or hope that developers specify Permission to make digital or hard copies of part or all of this work for them manually. However, previous work has shown that personal or classroom use is granted without fee provided that copies are inadvertently dependent tests exist in real projects, can take not made or distributed for profit or commercial advantage and that copies significant time to identify, and pose a threat to test suite bear this notice and the full citation on the first page. Copyrights for third- correctness when applying test acceleration techniques [29, party components of this work must be honored. For all other uses, contact 40]. Zhang et al. show that dependent tests are a serious the Owner/Author. Copyright is held by the owner/author(s). ESEC/FSE ’15, August 31 - September 04, 2015, Bergamo, Italy problem, finding in a study of five open source applications ACM 978-1-4503-3675-8/15/08. 96 tests that depend on other tests [40]. In our own study we nique can allow for sound test acceleration. While we im- found many large test suites in popular open source software plement ElectricTest in Java, we believe that we present a do not isolate their tests, and hence, may potentially have sufficiently general approach that should be applicable to dependencies. other memory-managed languages. Our new approach and tool, ElectricTest, detects depen- dencies between test cases in both small and large, real- 2 Motivation world test suites.ElectricTest monitors test execution, de- To motivate our work, we set out to answer three motivating tecting dependencies between tests, adding on average a 20x questions to ground our approach: slowdown to test execution when soundly detecting depen- MQ1: For those projects that take a long time to build, dencies. In comparison, we found that the previous state of what component of the build dominates that time? the art approach applied to these same projects showed an MQ2: Are existing test acceleration approaches safe to average slowdown of 2,276x (using an unsound heuristic not apply to real world, long running test suites? guaranteed to find all dependencies), often requiring more MQ3: Can the state of the art in test dependency detec- than 10308 times the amount of time needed to run the test tion be practically used to safely apply test accelera- suite normally in order to exhaustively find all dependencies. tion to these long running test suites? Moreover, the existing technique does not point developers 2.1 A Study of Java Build Times to the specific code causing dependencies, making inspection In our previous work [8], we studied 20 open source Java and analysis of these dependencies costly. applications to determine the relative amount of build time With ElectricTest, it becomes feasible to soundly perform spent testing, finding testing to consume on average 78% of test parallelization and selection on large test suites. Rather build time. The longest of these projects took approximately than detect manifest dependencies (i.e., a dependency that 40 minutes to build, while the shortest completed in under changes the outcome of a test case, the definition in previous one minute. Given a desire to target projects with very long work by Zhang et al, DTDetector [40]), ElectricTest detects build times, we wanted to make sure that those very long simple data dependencies and anti-dependencies (i.e., read- running builds were also spending most of their time in tests. over-write and write-over-read). Since not all data depen- If we are sure that most of the time spent building these dencies will result in manifest dependencies, our approach is projects is in the testing phase, then we can be confident inherently less precise than DTDetector at reporting “true” that a reduction in testing time will have a strong impact in dependencies between tests, though it will never miss a de- reducing overall build time. pendency that DTDetector would have detected. However, For this study, we downloaded the 1,966 largest and most in the case of long running test suites (e.g. over one hour), popular Java projects from the open source repository site, the DTDetector approach is not feasible. On popular open GitHub (those 1,000 with the most forks and stars overall, source software, we found that the number and type of de- and those 1,000 with the most forks over 300 MB, as of pendencies reported by ElectricTest allow for up to 16X December 23rd, 2014). From these projects, we searched for speedups in test parallelization. only those with tests (i.e., had files that had the word “test” Our key insight is that, for memory-managed languages, in their name), bringing our list to 921 projects. we can efficiently detect data dependencies between tests by Next, we looked at the different build management sys- leveraging existing efficient heap traversal mechanisms like tems used by each project: there are several popular build those used by garbage collectors, combined with filesystem systems for Java, such as ant, maven, and gradle. To mea- and network monitoring. For ElectricTest, test T depends 2 sure the per-step timing of building each of these projects, on test T if T reads some data that was last written by T . 1 2 1 we had to instrument the build system, and hence, we se- A system that logs all data dependencies will always report lected the most commonly used system in this dataset. We at least as many dependencies as a system that searches for looked for build files for five build systems: ant, maven, manifest dependencies. Our approach also provides addi- gradle, set, and regular Makefiles. Of these 921 projects, tional benefits to developers: it can report the exact line of the majority (599) used maven, and hence, we focused our code (with stack trace) that causes a dependency between study on only those projects using maven due to resource tests, greatly simplifying test debugging and analysis. limitations creating and running experiments. ElectricTest instruments all classes in the system under We utilized Amazon’s EC2 “m3.medium” instances, each test (including those provided in the core Java system li- running Ubuntu 14.04.1 and Maven 3.2.5 with 3.75GB of brary) to support a heap-walking analysis. During test exe- RAM, 14 GB of SSD disk space, and a one-core 2.5Ghz cution, ElectricTest observes the heap values, files, and net- Xeon processor. We tried to build each project first with work resources that are read and written, collecting these Java 1.8.0 40, and then fell back to Java 1.7.0 60 if the results and analyzing them to determine if a dependency newer version did not work (some projects required the lat- occurred. After this learning phase, we can safely perform est version while others didn’t support it). For each project, test parallelization or selection using the resulting depen- we first built it in its entirety without any instrumentation, dency chains produced by ElectricTest. and then we built it again from a clean checkout with our We applied ElectricTest to the test suites of 10 popu- instrumented version of Maven in “offline” mode (with exter- lar free open source applications, studying the dependen- nal dependencies already downloaded and cached locally). cies detected and the runtime overhead imposed by the de- If a project contained multiple maven build files, we exe- tection process. We found that ElectricTest finds at least cuted maven on the build file nearest the root of the reposi- as many dependencies as the state-of-the-art tool in many tory, and we did not perform any per-project configuration. orders of magnitude less time. We have studied the im- Of these 599 projects, we could successfully build 351. pact of ElectricTest on test parallelization and test selection Table 1 shows the three longest build phases, first for all techniques, finding that its test dependence analysis tech- of these projects, and then filtering to only those projects Phase All Only projects building in: currently depend on each other, then tests may present false Projects >10 min >1 hour negatives or false positives (albeit deterministically between executions) when isolated. Test 41.22% 59.64% 90.04% We examined the 351 Java projects that we built, finding Compile 38.33% 26.25% 8.46% that 18 (or 5%) isolated all of their test classes, and 41 (or Package 15.49% 1.05% 12%) isolated at least some of their test classes (i.e., some Pre-Test 13.51% classes were isolated and others were grouped together and Table 1: Top three phases in Java builds. executed without isolation). The majority of projects did that took more than 10 minutes to build (69 projects), and not isolate their tests at all, and therefore are prone to test those that took more than one hour to build (8 projects). dependencies occurring, posing a risk to test acceleration. When looking across all projects, 41% of the build time (per This result differs from our 2013 study, which showed 41% project) was spent testing, and testing was the single most of 591 Java projects isolated their tests [7]. This study ex- time consuming build step. When eliminating the cases of amined only projects that built with maven, while our pre- projects with particularly short build times (those taking vious study (which was performed through a static analy- less than 10 minutes to execute all phases of the build), the sis of build scripts) examined both maven and ant-building average testing time increased significantly to nearly 60%. projects. In our previous study, we found that of our 591 In the eight cases of projects that took more than an hour to projects, only approximately 10% of those that used maven build, nearly all time (90%) is spent testing. Therefore, to to build and run their tests isolated some or all of their tests, answer MQ1, we find that testing dominates build times, es- a number much more similar to what we found here. pecially in long running builds. This conclusion underscores Due to the risks that they impose and ability to occur the importance of accelerating testing. (when tests aren’t isolated), our goal is to detect dependen- 2.2 Danger of Dependent Tests cies between test classes so that we can (1) inform existing test acceleration techniques of the dependencies to ensure Any test acceleration technique that executes only a sub- sound acceleration, and (2) provide feedback to developers set of tests, or executes them out of order (e.g., test paral- so that they are aware of dependencies that exist. lelization or test selection) is unsound in the presence of test dependencies. If the result of one test depends on the execu- 2.3 Feasibility of Existing Approaches tion of a previous test, then these techniques may cause false positives (tests that should fail but pass) or false negatives Finally, we study the existing state-of-the-art approach for (tests that should pass but fail). detecting dependencies between test cases to determine if it Zhang et al. studied the issue trackers of five popular open is feasible to apply to long-running test suites. source applications to determine if dependent tests truly ex- If we define a test dependence as the case where execut- ist and cause problems for developers [40]. They found a ing some set of tests T in a different order changes the re- total of 96 dependent tests, 95 of which would result in a sult of the test(s), then identifying test dependencies is NP- false negative when executed out of order (causing a test to Complete [40]. This definition for dependence (henceforth fail although it should pass), and one which produced a false referred to as a manifest dependence) is more narrow than positive when executed out of order (causing a test to pass ours (a distinction described later in §3), but is the definition when it should fail). Given that test dependencies exist and used in the state-of-the-art work by Zhang et al. [40]. can cause tests to behave incorrectly when executed out of To identify all manifest test dependencies in a suite of n order, we conclude that yes: dependent tests pose a risk to tests we would have to execute every permutation of those n existing test acceleration techniques. tests, requiring O(n!) test executions, clearly infeasible for If we isolate the execution of each of our test cases, then any reasonably large test suite. Moreover, such a technique dependencies would not be possible. In practice, tests are would only identify that tests are dependent, and not the typically written as single test methods, which are grouped specific resource or lines of code causing the dependence, into test classes, which are batched together into modules. making it difficult for developers who wish to examine or Typically each test method represents an atomic test, while remove the dependency. In our study that follows, we esti- test classes represent groups of tests that test the same com- mated that this exhaustive process often would take more ponent. The module separation occurs when a project is than 1 × 10308 times longer than running the test suite nor- split into modules, with a test suite for each module. mally. Zhang et al. propose two techniques to reduce the Since they are typically testing the same component, indi- number of test executions needed to detect manifest depen- vidual test methods are never isolated, although sometimes dent tests, both of which they acknowledge may not scale to test classes are isolated. Since they represent different mod- large test suites [40]. ules of code (that must compile separately), test modules In one approach, they reduce the search space to O(n2) by are always isolated in our experience. We are interested in suggesting that most dependencies manifest between only detecting dependencies both at the level of individual test two tests, with no need to consider every possible n size methods, and also test classes, which also are the same gran- permutation. However, this is incomplete: there may be ularity used by test selection and parallelization techniques. dependencies that only manifest when more than two tests For the remainder of this paper, when we refer to individual interact. They further reduce the search space by a constant tests, we will refer to test classes and test modules. factor (it is still an O(n2) algorithm) by only checking test One approach to solving the dependent test problem is combinations that share common resources (defined to be to simply isolate each test to ensure that no dependencies static fields and files). If two tests access (read or write) could occur (e.g., by executing each test in its own process, the same file or static field, then they are marked as sharing or by using our efficient isolation system VmVm [7]). How- a common resource, regardless of whether a true data de- ever, if the application does not isolate its tests, and tests pendency exists or not. Since this very coarse dependency Testing Pairwise Test Exhaustive Project Test Classes Test Methods Time Slowdown Test Slowdown (mins) Class Method Class Method camel 5,919 13,562 109.70 1,865X 8,045X *1E+308X *1E+308X crunch 62 243 17.58 65X 298X 54E+82X 54E+82X 2.3 § hazelcast 297 2,623 47.37 147X 2,536X *1E+308X *1E+308X jetty.project 554 5,603 20.08 35X 2,555X 2E+60X 2E+60X mongo-java-driver 58 576 74.25 58X 649X 4E+76X 4E+76X mule 2,047 10,476 117.45 250X 3,438X *1E+308X *1E+308X netty 289 4,601 62.95 11X 2,725X 62E+82X 62E+82X spring-data-mongodb 141 1,453 121.38 136X 1,715X 3E+230X 3E+230X tachyon 53 362 34.47 47X 397X 56E+56X 56E+56X titan 177 1,191 81.82 181X 398X *1E+308X *1E+308X Projects selected in Average 960 4,069 68.71 279X 2,276X *1E+308X *1E+308X joda-time 122 3,875 0.27 627X 418,016X 3E+204X *1E+308X xml security 19 108 0.37 59X 1,316X 47E+16X 15E+172X crystal 11 75 0.07 37X 763X 3E+8X 3E+108X synoptic 27 118 0.03 183X 3,497X 5E+28X 1E+194X

Zhang [40] Average 45 1,044 0.18 226X 105,898X 70E+202X *1E+308X Table 2: Testing time and statistics for the 10 longest-running test suites studied with unisolated tests, plus the 4 projects studied in previous work by Zhang et al. [40]. In addition to the normal testing time, we estimate the time that needed to run all pairwise combinations of tests, and the time needed to exhaustively run all combinations.* indicates a slowdown greater than 1 × 10308. detection will likely result in many false positives, Zhang A large slowdown appears in both long-building and fast et al. manually inspect each resource to determine if it is building projects. likely to cause a dependence, and if not, ignore it in this For the four projects previously studied by Zhang et al., process. This heuristic can limit the search space but it still the dependence-aware approach showed approximately one can remain large, and requires manual effort to rule out some order of magnitude less overhead. However, we were un- resource accesses that will not cause manifest dependencies. able to evaluate the dependence-aware technique on our ten Table 2 shows the estimated CPU time needed to de- projects due to technical limitations of the DTDetector im- tect the dependent tests in each of the ten longest building plementation: running it requires manual enumeration and projects from our dataset from §2.1 with unisolated tests, configuration of each test to run in the DTDetector test run- along with the four projects studied by Zhang et al. pre- ner. Given the manual effort required and that this heuristic viously [40]. This experiment was performed on Amazon is unsound, we chose not to implement it for our ten projects. EC2 “r3.xlarge” instances, each running Ubuntu 14.04.1 and As expected, there is no situation in the projects that Maven 3.2.5 with 4 virtualized Intel Xeon X5-2670 v2 2.5Ghz we studied where the fully exhaustive method (testing all CPUs, 30.5 GB of RAM and 80 GB of SSD storage. Subjects possible permutations) is feasible. Even in the cases of the ‘jetty’, ‘titan’ and ‘crunch’ were evaluated on OpenJDK Java more modest length test suites, the overhead of DTDetector 1.7.0 60 (the most recent version supported by the projects) is very high. We answer MQ3 and conclude that the existing, while the others were evaluated on OpenJDK Java 1.8.0 40. state of the art approach for detecting dependencies between In the case of the four small projects previously studied tests can not scale to detect dependencies in the wild, except by Zhang et al, we used their publicly available tool to calcu- when using unsound heuristics on the very smallest of test late the pairwise testing time, and estimated the exhaustive suites that took less than a minute to execute normally. testing time. In the case of our ten projects, we estimate all times, due to scaling limitations of the DTDetector tool. 3 Detecting Test Dependencies We estimated all times using the following approach: first we While previous work in the area has focused on detecting measured the time to run each test normally and then we cal- manifest dependencies between tests [40], we focus instead culated the permutations of tests to run for each module of on a more general definition of dependence. For our pur- each project (most of these projects had many modules with poses, if T2 reads some value that was last written by T1, tests, and since tests from different modules were isolated, then we say that T2 depends on T1 (i.e., there is a data de- there was no need to include permutations cross-module). pendence). If some later test, T3 writes over that same data, We added a constant time of 1 second to each combination then we say that there is an anti-dependence between tests of tests executed to account for the time needed to start T2 and T3: T3 must never run between T1 and T2. Note and stop the JVM and system under test (a conservative that any two tests that are manifest dependent will also be estimate based on our prior results [7]). dependent by our definition, but two tests that have a data We compare this projected time to the actual time needed dependence may not have a manifest dependence. to run the test suite in its normal configuration, presenting Consider the case of a simple utility function that caches the slowdown as TDT Detector/Tnormal. Even the pairwise the current formatted timestamp at the resolution of sec- heuristic (examining only every 2-pair of tests, rather than onds so that multiple invocations of the method in the same all possible permutations) can be cost prohibitive: adding an second returns the same formatted string. If the date for- overhead of up to 418,016X (minimum 298X for test meth- matter has no side-effects, then we can surmise that if multi- ods), even though there is no guarantee of its correctness. ple tests call this method, while there is a data dependency between them (since the cache is reused), this dependence ject which has other instance fields. If we simply recorded won’t in and of itself influence the outcome of any tests. only accesses to static fields (as our previous work, VmVm Hence, there will be no manifest dependence between these did [7]), we wouldn’t be able to detect all data dependen- tests even though there is a data dependence. cies, since we wouldn’t be able to follow all pointers. With While detecting manifest dependencies between tests may VmVm, we were forced to treat all static field reads as writes, require executing every possible permutation of all tests, de- since a test might read a static field to get a pointer to some tecting data dependencies (that may or may not result in other part of the heap, then write that other part (indirectly manifest dependencies) requires that each test is executed writing the area referenced by the static field). only once. ElectricTest detects dependencies by observing Our key insight is that we can efficiently detect these global resources read and written by each test, and reports dependencies by leveraging several powerful features that any test Tj that reads a value last written by test Ti as de- already exist in the JVM: garbage collection and profil- pendent. ElectricTest also reports anti-dependencies, that ing. The high level approach that ElectricTest uses to ef- is, other tests Tk that write that same data after Tj , to en- ficiently detect in-memory dependencies between test cases sure that Tk is not executed between Ti and Tj . is twofold. At the end of each test execution, we force a ElectricTest consists of a static analyzer/instrumenter and garbage collection and mark any reachable objects (that a runtime library. Before tests are run with ElectricTest, weren’t marked as written yet) as written in this test case. all classes in the system under test (including its libraries) During the following test executions, we monitor all accesses are instrumented with heap tracking code (at the bytecode to the marked objects, and if a test reads an object that was level — no access to source code is required). In principle, written during a previous test, we tag it as having a depen- this instrumentation could occur on-the-fly during testing dence on the last test that wrote that object. This method is as classes are loaded into JVM, however, we perform the similarly used to detect anti-dependencies (write after read). instrumentation offline for increased performance, as many ElectricTest heavily leverages the JVM Tooling Interface external library classes may remain constant between dif- (JVMTI), which provides support for internal monitoring ferent versions of the same project. This process is fairly and is used for implementing profilers and debuggers that fast though: analyzing and instrumenting the 67,893 classes interact with the JVM [31]. While it would be possible (and in the Java 1.8 JDK took approximately 4 minutes on our likely more performant) to implement ElectricTest by modi- commodity server. ElectricTest detects dynamically gener- fying certain aspects of the JVM directly (e.g. to piggy-back ated classes that are loaded when testing (which were not generation counters already used for garbage collection to statically instrumented) and instruments them on the fly. track which test wrote an object), we chose instead to use During test execution, the ElectricTest runtime monitors the standard JVMTI interface so that ElectricTest will not heap accesses to detect dependencies between tests. require a specialized JVM: it functions on commodity JVMs Dependencies between tests can arise due to shared mem- such as Oracle’s HotSpot or OpenJDK’s IcedTea. ory, shared files on a filesystem, or shared external resources Aside from bytecode instrumentation, ElectricTest uti- (e.g. on a network). ElectricTest’s approach for file and net- lizes three key functions of the JVMTI API: heap walking, work dependency detection is simple: it maintains a list of heap tagging, and heap access notifications. The heap walk- files and network socket addresses that are read and written ing mechanism provides a fairly efficient means whereby we during each test. ElectricTest leverages Java’s built in IO- can visit every object on the heap, descending from root Trace support to track file and network access. Efficiently nodes down to leaves. Heap tagging allows us to associate detecting in-memory dependencies is much more complex, objects with arbitrary 64-bit tags, useful for storing infor- and we focus our discussion to this technique next. mation about the status of each object (i.e., which test last read and wrote it) and each static field. Finally, heap access 3.1 Detecting In-Memory Dependencies notifications allows us to register callbacks for the JVM to notify ElectricTest when specific objects or their primitive To detect dependencies in memory between test cases, Elec- fields are written or read (when instance fields are accessed). tricTest carefully examines reads and writes to heap mem- Observing New Heap Writes. For each test execution, ory. Recall that Java is a memory managed language, where ElectricTest needs to be able to efficiently determine what it is impossible to directly address memory. Simply put, the part of the heap was written by that test. We have optimized heap can be accessed through pointers to it that already ex- this process for cases where the majority of data created on ist on the stack, or via static fields (which reside in the the heap is not shared between tests (which we have found to heap and can be directly referenced). be a common case). As objects are created on the heap dur- At the start of each test, we’ll assume that the test runner ing test execution, ElectricTest does nothing. At the end of (which is creating these tests) does not pass (on the stack) each test, after performing a garbage collection, ElectricTest references to heap objects, or at least not the same reference uses JVMTI to scan for all objects that have no tag associ- to multiple tests. We easily verified this safe assumption, ated with them (i.e., those not yet tagged by ElectricTest). as there is typically only a single test runner that’s shared Each untagged object is tagged with a counter indicating between all projects using that framework (e.g. JUnit, which that it was created in the current test case. Objects are also creates a new instance of each test class for each test). tagged with the list of static fields from which they are ac- Therefore, our possible leakage points for dependencies cessible. Since the only objects that still exist after the test between tests will arise through static fields. static fields completes are those that can be shared between tests, this are heap roots: they are directly accessed, and therefore the method avoids unnecessarily tagging and tracking objects level of granularity at which we detect dependencies. that can’t be part of dependencies. Unfortunately, to soundly detect all possible dependen- Observing Heap Reads and Writes of Old Data. cies, it is insufficient to simply record accesses to static Aside from references on the stack, data on the JVM’s heap fields, since each static field may in turn point to some ob- is accessed through fields of objects, static fields of classes, section, the coarse grained approach is sufficient to allow for or array elements. The easiest type of heap access to ob- reasonable test suite acceleration in the projects we studied. serve is to the fields of objects, which ElectricTest accom- 3.3 Reporting and Debugging Dependencies plishes through JVMTI’s field tracking system. For each class of object created in a previous test but still reachable Once all dependencies have been detected, ElectricTest can in the current test, ElectricTest registers a callback through be used to help developers analyze and inspect them. While JVMTI to be notified whenever any fields of those objects manual inspection is not required for sound test acceleration are read or written. (the following section will describe how ElectricTest does When an object is read or written, ElectricTest checks its this automatically), we imagine that in some cases develop- tag to see if it was last written or read in a previous test: ers will want to understand the dependencies between tests if so, then it is marked as causing a dependency on the last in their projects. For instance, perhaps some dependencies test that wrote that object, and we note this dependence to may be indicative of incorrectly written tests. We expect report at the end of the test. This technique will detect both that in some cases developers may want to investigate de- data dependencies (read after write) and anti-dependencies pendencies to make sure that they are intentional. Alterna- (write after read), reporting them independently. tively, developers may want to mark some dependencies as Detecting reads and writes of static fields and array ele- benign: perhaps multiple tests intentionally share resources, ments is more complicated, as there is no similar callback but do so in a way that doesn’t create a functional depen- to use. Instead, ElectricTest relies on bytecode instrumen- dence. For instance, we have seen many test suites that tation, modifying the bytecode of every class that executes intentionally share state between tests to reduce setup time: to directly notify ElectricTest of reads and writes. In its each test checks if the shared state is established and if not, instrumentation phase, ElectricTest employs an intraproce- initializes it, and resets it to this initial state when done. dural data flow analysis to reduce the number of redundant ElectricTest supports developers analyzing dependencies calls that it makes to record reads and writes on the same by providing a complete stack-trace showing how a depen- value by inferring which arrays and fields have already been dency occurred. Stack traces are trivially collected when a read or written before each instruction. ElectricTest also dy- test reads data previously written since ElectricTest is de- namically detects reads and writes through Java’s reflection tecting data dependencies in real-time, within the executing interface by intercepting all calls to the reflection API and program and JVM. In this way, ElectricTest provides signif- adding a call to the ElectricTest runtime library to record icantly more information than previous work in test depen- the access. dency detection [40], which could only report that two tests Detecting Dependencies at Static Fields. At the end were dependent. of each test, we perform a heap walk, rooted at every static 3.4 Sound Test Acceleration field, visiting all objects that are reachable from each static Given the list of dependencies between tests, we can soundly field. We maintain a simple stop-list of common static fields apply existing test acceleration techniques such as paral- within the Java API that are manually-verified as determin- lelization and prioritization. Naive approaches to both are istically written and hence may be ignored in this process. straightforward, but may be sub-optimal. For instance, a These fields include fields such as System.out, which is the naive approach for running a given test is to ensure that all stream handler for standard output. While it is possible to tests that must run before it have just run (in order). modify these fields to point to a different object, their de- Haidry and Miller proposed several techniques for effi- fault value is always deterministically created. Therefore, a ciently and effectively prioritizing test suites in light of de- dependence on the default value of one of these fields can pendencies [22]. Rather than consider prioritization met- safely be ignored (a dependence on the non-default value is rics (e.g. line coverage) for a single test, entire dependency not ignored), since we can assume that this field would have graphs are examined at once. Their techniques are agnos- the same value independent of which test first accessed it. tic to the dependency detection method (relying on manual This mechanism also allows developers to easily filter specific specification in their work), and would be easily adapted to fields that are known to be data-dependent between execu- consume the dependencies output by ElectricTest. tions, but benign (e.g. internals of logging mechanisms). A We propose a simple technique to improve parallelization simple configuration file maintains a stop-list of fields for of dependent tests based on historical test timing informa- ElectricTest to ignore. ElectricTest marks all objects at the tion. Our optimistic greedy round-robin scheduler observes end of each test with the list of static fields that point to it. how long each test takes to execute and combines this data with the dependency tree to opportunistically achieve par- 3.2 Detecting External Dependencies allelism. Consider the simple case of ten tests, each of which take 10 minutes to run, all dependent on a single other test ElectricTest leverages the JVM’s built-in IOTrace features that takes only 30 seconds to run (but not dependent on to track access to external resources. When code attempts to each other). If we have 10 CPUs to utilize, we can safely access a file or socket, ElectricTest gets a callback identifying utilize all resources by first running the single test that the which file or network address is being accessed. All tests that others are dependent on on each CPU (causing it to be ex- access the same file or socket are marked as dependent. This ecuted 10 times total), and then run one of the remaining relatively coarse approach is based on our observation that 10 tests on each of the 10 CPUs. The testing infrastructure tests infrequently share access to the same file or network can then filter the unnecessary executions from reports. resource — or that if they do, they are dependent. While ElectricTest generates schedules for parallel execution of it would be possible to have a finer grained approach to tests using a greedy version of this algorithm, re-executing detecting these dependencies (e.g. by tracing the exact data a single test multiple times on multiple CPUs when doing read and written), as our evaluation shows in the following so would decrease wall time for execution. ElectricTest also Testing Time (Seconds) ElectricTest # of Tests DTDetector Speedup vs Project Classes Methods Baseline All 2-pair Dep-Aware Pairs Exhaustive ElectricTest Dep-Aware Joda 122 3875 16 *6,688,250 *657,144 *1E+308 2122 310X XMLSecurity 19 108 22 28,958 5,500 *3E+174 57 96X Crystal 11 75 4 3,050 874 *14E+108 22 40X Synoptic 27 118 2 6,993 2,070 *2E+194 34 61X Table 3: Dependency detection times for DTDetector and ElectricTest using the same subjects evaluated in [40]. We show the baseline runtime of the test suite as well as the running time for three configurations of DTDetector: the 2-pair algorithm, the dependence-aware 2-pair algorithm, and the exhaustive algorithm. Execution times for DTDetector on Joda and with the Exhaustive algorithm (marked with *) are estimations based on the same methodology used by the authors of DTDetector [40]. Dependencies ET Shared 4.1 Accuracy ET Resource Locations We evaluated the accuracy of ElectricTest by comparing the Project DTD W R App Code JRE Code dependencies detected between test methods with those de- Joda 2 15 121 39 12 tected by Zhang et al.’s tool, DTDetector [40]. Table 4 shows XMLSecurity 4 3 103 3 15 the dependencies detected by each tool. ElectricTest de- Crystal 18 15 39 4 19 tected all of the same dependencies identified by DTDetec- Syntopic 1 10 117 3 14 tor, plus some additional dependencies. We therefore con- clude that ElectricTest’s recall is at least as good as the Table 4: Dependencies detected by DTDetector existing tool, DTDetector. (DTD) and ElectricTest (ET). For ElectricTest, we We can directly attribute the additional dependencies to group dependencies, into tests that write a value the different definition of dependencies employed by the two which others read (W) and tests that read a value systems: ElectricTest detects all data dependencies, whereas written by a previous test (R). DTDetector detects tests that have different outcomes when executed in a different order (manifest dependencies). Since not all data dependencies will result in manifest dependen- can speculatively parallelize test methods by breaking sim- cies, we expect that ElectricTest reports more dependencies ple dependencies. When one test depends on a simple (i.e., than DTDetector. primitive) value from another test, ElectricTest will allow For the purposes of test acceleration, given the computa- the dependent test to run separately from the test it de- tional ability to execute DTDetector on the test suite under pends on, simulating the dependent value. If ElectricTest scrutiny, it may still be preferable to use it over ElectricTest. runs the test writing that value and finds that it writes a However, as discussed in §2.3, it is often infeasible to use DT- different value than was replayed, the pair of tests are re- Detector on projects of reasonable size. Moreover, in cases executed serially. In our evaluation that follows, we show where developers want to debug a dependency, ElectricTest that most dependencies are on a small number of tests, al- would still be preferable over DTDetector (which would only lowing this simple algorithm to greatly reduce the longest tell the developer that two tests had a dependency). serial chain of tests to execute in parallel. Interestingly, all dependencies detected by ElectricTest were caused by shared accesses to a very small number of 4 Evaluation resources (static fields in this case). Most of the data de- We evaluated ElectricTest across three dimensions: accu- pendences were caused by references to static fields within racy, runtime performance, and impact on test acceleration the JRE made by JRE code (not by application code), and techniques. For accuracy, we compare the dependencies de- all such references were to fields of primitive types, allowing tected by ElectricTest to the state-of-the-art tool, DTDe- for the opportunistic parallelism described in §3.4. This is tector [40]. In terms of performance, we measured the over- also interesting in that it may make manual investigation of head of running ElectricTest on Java test suites compared dependencies by developers easier: even if many tests are to the normal running time of the test suite. Given that dependent, the number of actual resources shared is small. ElectricTest may report non-manifest dependencies (that is, 4.2 Overhead those that need not be respected in order to maintain the integrity of the test suite), we are particularly interested in We evaluated the overhead of ElectricTest on the same four the impact of ElectricTest’s detected dependencies on test subjects studied by Zhang et al. [40], in addition to the ten acceleration. To determine the impact of ElectricTest on large Java projects described earlier in §2.3. test acceleration, we measured the longest chain of data de- We reproduced Zhang et al.’s experiments [40] in our en- pendencies and number of anti-dependencies in each of these vironment to provide a direct performance comparison be- large test suites to identify how effective test parallelization tween the two tools. Table 3 shows the runtime of the and selection could be when respecting the dependencies au- same four test suites evaluated by Zhang et al., present- tomatically detected by ElectricTest. ing the baseline test execution time, the DTDetector exe- All of these experiments were performed in the same envi- cution time and the ElectricTest execution time. None of ronment as our previous experiment in §2.3: Amazon EC2 the DTDetector algorithms we studied provided reasonable r3.xlarge instances with 4 2.5Ghz CPUs and 30.5 GB of performance on the Joda test suite, and the exhaustive tech- RAM (more details on this environment are in §2.3). nique was infeasible in all cases. Even in the case of the # Resources in- Analysis Analysis Test Dependencies volved at level: Number of Tests Time Relative Classes Methods Project Classes Methods (Min) Slowdown WRCWRC Classes Methods camel 5,919 13,562 2,449 22.3X 1,977 3,465 1,356 4,790 8,399 1,695 4,944 5,490 crunch 62 243 165 9.4X 9 20 6 18 43 18 190 207 hazelcast 297 2,623 1,780 37.6X 174 200 186 1,163 1,261 1,482 941 1,020 jetty.project 554 5,603 184 9.2X 223 261 54 4,016 4,079 424 713 828 mongo-java-driver 58 576 103 1.4X 36 36 34 342 362 357 32 33 mule 2,047 10,476 9,698 82.6X 185 859 119 2,049 6,279 1,400 11,844 12,387 netty 289 4,601 338 5.4X 128 120 63 2,928 3,297 2,926 640 1,104 spring-data-mongodb 141 1,453 364 3.0X 114 130 110 1,407 1,404 1,401 1,469 1,489 tachyon 53 362 89 2.6X 9 13 9 55 93 13 125 157 titan 177 1,191 2,262 27.7X 118 126 46 429 877 40 1,433 1,562 Average 960 4,069 1,743 20.0X 297 523 198 1,720 2,609 976 2,233 2,428 Table 5: Dependencies found by ElectricTest on 10 large test suites. We show the number of tests in each suite, the time necessary to run the suite in dependency detection mode, the relative overhead of running the dependency detector (compared to running the test suite normally), the number of dependent tests, and the number of resources involved in dependencies. For dependencies, we report the number of tests that write a resource that is later read (W), the number of tests that read a previously written resource (W), and the longest serial chain of dependencies (C). For dependencies and resources in dependencies, we report our findings at the granularity of test classes and test methods. dependence-aware optimized heuristics (which is not guar- a complete garbage collection and heap walk had to occur anteed to detect all dependencies), ElectricTest still ran sig- after each test finishes, test suites consisting of a lot of very nificantly faster than DTDetector. fast-executing tests (like in ‘mule’) had a greater slowdown. However, these test suites were all very small, with the We believe that ElectricTest’s overhead makes it feasible longest taking only 22 seconds to run. We applied Elec- to use in practice, and note that it is still much less than tricTest to the ten large open source projects with unisolated DTDetector’s, the previous system for detecting test depen- tests previously discussed in §2.3, recording the number of dencies [40]. dependencies detected and the time needed to run the tool. Table 5 shows the results of this study, showing the num- 4.3 Impact on Acceleration ber of test classes and test methods in each project, along Our approach may detect dependencies between tests that with the time needed to detect dependencies, the relative do not effect the outcome of tests. That is, two tests may slowdown of dependency detection compared to normal test have a data dependency, but this dependency may be com- execution, and the number of dependent tests detected. For pletely benign to the control flow of the test. dependencies, we report dependence at both the level of Therefore, we take special care to evaluate the impact of test classes and test methods (in the case of test classes, dependencies detected by ElectricTest on test acceleration we report the dependencies between entire test classes, and techniques, notably, test parallelization. In the extreme, if not the dependencies between the methods in the same test ElectricTest found data dependencies between every single class). We report the number of tests writing values (W) test, techniques like test parallelization or test selection yield that are later read by dependent tests (R), as well as the no benefits, since it would be impossible to change the order size of the longest serial chain (C). Finally, we report the that tests ran in while preserving the dependencies. distinct number of resources involved in dependencies be- We first evaluate the impact of ElectricTest on test ac- tween tests, both at the test class and test method level. celeration techniques by examining the longest dependency In general, far more tests caused dependencies (i.e., wrote chain detected in each project, shown under the heading a shared value) than were dependent (i.e., read a shared ‘C’ in Table 5. In almost all projects, even if there were value). The longest critical path was fairly short (relative to many dependent tests, the longest critical path was very the total number of tests) in almost all cases, indicating that short compared to the total number of tests. For example, test parallelization or selection may remain fairly effective. while 4,079 of the 5,603 test methods in the jetty test suite Given infinite CPUs to parallelize test execution across, the depended on some value from a previous test, the longest maximum speedup possible is restricted by this measure. dependency chain was 424 methods long. Across all of the ElectricTest imposed on average a 20X slowdown com- projects, the average maximum dependency chain between pared to running the test suite normally to detect all de- test methods was 976 of an average 4,069 test methods and pendencies between test methods or classes. In comparison, 198 between an average of 960 test classes. We find this we calculated that on the same projects, DTDetector (us- result encouraging, as it indicates that test selection tech- ing the pairwise testing heuristic) would impose on average niques can still operate with some sensitivity while preserv- a 2,276X slowdown when considering dependencies between ing detected dependencies. test methods, or 279X between entire test classes (Table 3). To quantify the impact of ElectricTest’s automatically de- ElectricTest’s overhead fluctuates with heap access patterns tected dependencies on test parallelization we simulated the — in test suites that share large amounts of heap data be- execution of each test suite running in parallel on a 32-core tween tests, ElectricTest is slower. The overhead also fluctu- machine, distributing tests in a round-robin fashion in the ated somewhat with the average test method duration: since same order they would typically run in. In this environment, Naive ET Greedy ET Unsound testing time. Thanks to its integration with the JVM, Elec- Project C M C M C M tricTest can easily be configured by developers to ignore par- camel 4.6 6.5 6.9 9.8 16.1 18.0 ticular dependencies at the level of static or instance fields. crunch 4.8 3.0 7.8 3.1 12.4 15.5 hazelcast 1.2 2.1 1.2 2.6 12.9 20.0 4.4 Discussion and Limitations jetty.project 6.1 6.9 6.1 6.9 9.5 17.0 There are several limitations to our approach and implemen- mongo-java-driver 1.8 1.6 1.8 1.6 8.2 27.8 tation. Because we detect dependencies of code running in mule 9.6 7.9 9.6 13.8 17.0 18.0 the JVM, we may miss some dependencies that occur due to netty 2.8 6.7 2.8 6.7 2.9 7.0 native code that is invoked by the test. While ElectricTest spring-data-mongodb 1.2 0.8 1.2 0.8 3.5 26.5 can detect Java field accesses from native code (through the tachyon 3.9 4.3 3.9 15.5 5.3 26.9 use of field watches), it can not detect file accesses or array titan 3.8 6.3 3.8 6.3 8.0 18.2 accesses from native code. However, none of the applica- Average 4.0 5.0 5.0 7.0 10.0 19.0 tions that we studied contained native libraries. It would be Table 6: Relative speedups from parallelizing each possible to expand our implementation to detect and record app’s tests. Shown at the test class (C) and test these accesses by performing load-time patching for calls to method (M) while respecting ElectricTest-reported JNI functions for array accesses and system calls for file ac- dependencies (with the naive scheduler and the cess to record the event. greedy scheduler) in comparison to unsound paral- When we detect external dependencies (i.e., files or net- lelization without respecting dependencies. work hosts), we assume that there is no collusion between externalities. For example, we assume that if one test com- there are 32 processes each running tests on the same ma- municates with network host A, and another test communi- chine, with each process persisting for the entire execution. cates with network host B, hosts A and B have no backchan- We simulated the parallelization of each test suite follow- nel that may cause a dependency between the two tests. We ing three different configurations: without respecting depen- have not built our tool to handle specialized hardware de- dencies (“unsound”), with a naive dependency-aware sched- vices (other than those accessed via files or network sock- uler (“naive”), and the optimistic greedy scheduler described ets) that may be involved in dependencies. However, Elec- in §3.4 (“greedy”). The naive scheduler groups tests into tricTest could easily be extended to handle such devices in chains to represent dependencies, such that each test is in the same manner as files and network hosts. Since Elec- exactly one group, and each group contains all dependen- tricTest is a dynamic tool, it will only detect dependencies cies for each test — this approach soundly respects depen- that occur during the specific execution that it is used for: dencies but may not be optimal in execution time. Table if tests exhibit different dependencies due to nondetermin- 6 shows the results of this simulation, parallelizing at the ism in the test, the dependency may not be detected. Elec- granularities of test classes and test methods. We show tricTest could be expanded to include a deterministic replay the theoretical speedups for each schedule provided rela- tool such as [9, 26] to ensure that dependencies don’t vary. tive to the serial execution, where speedup is calculated as There are also several threats to the validity of our ex- Tserial/Tparallel. Overall, the greedy optimistic scheduler periments. We studied the ten longest building open source outperformed the naive scheduler in some cases, and at times projects that we could find, in conjunction with four rel- provided a speedup close to that of the unsound paralleliza- atively short-building projects used by other researchers. tion. In some cases, the dependency-preserving paralleliza- These projects may not necessarily have the same charac- tion was faster when parallelizing at the coarser level of test teristics as those projects found in industry. However, we classes. In these cases, there were so many dependencies at believe that they are sufficiently diverse to show a cross- the test method level that the schedulers were generating in- section of software, and show that ElectricTest works well credibly inefficient schedules, requiring that some tests were with both long and short building software. re-executed many times. Also, this may have occurred in We simulated the speedups afforded by various parallel some cases because we assumed that the shortest amount of schedules of test suites. Due to resource limitations, we time that a single test could take was one millisecond: in the did not actually run the test suites in parallel. We assume case of a single test class that took one millisecond that had that the running time of a test is constant, regardless of several test methods, if we parallelized the test methods, we the order in which it is executed. Therefore we may expect may assume a total time to execute longer than just running that the speedup of the unsound parallelization is an over- a test class at once. estimate: if multiple tests share the same state to save on We investigated the cases where ElectricTest didn’t do time running setup code, then it may actually take longer as well, ‘spring-data-mongodb’ and ‘mongo-java-driver’ — to run the tests in parallel since the setup must run multiple both projects had very long dependency chains. Upon in- times. However, we are confident that the various speedups spection, we found that most tests in each project purposely predicted for dependency-preserving schedules are sound, as shared state between test cases for performance reasons. For we do not believe that other external factors are likely to instance, the mongo driver created a single connection to a impact the running time of each test. database and reused that connection between tests to save Similarly, we did not directly study the impact of the de- on setup. The spring based project had a similar pattern. pendencies we detected on test selection or prioritization These cases bring up an interesting point: sometimes tests techniques, instead using the maximum dependency size as may be intentionally data-dependent on each other. Espe- a proxy for selectivity. A more thorough study may have cially in the case of short unit tests all testing the same large instead downloaded many versions of each program and per- functional component, it is reasonable to expect that devel- formed test selection or prioritization on each version (based opers would intentionally re-use state to reduce the overall on results from the previous version) and then measured the impact of detected dependencies on these tools. Such used to identify which tests are currently dependent (and a study also would show the practicality of caching Elec- how), allowing a programmer to manually fix the tests so tricTest results throughout the development cycle, so it need that they can run in isolation. not be executed for each build. This caching might signif- Other tools support test execution in the presence of test icantly reduce the performance burden of checking for test dependencies. However, all of these tools require develop- dependencies with each successive change to the program. ers to manually specify dependencies, a tedious and difficult Again, we were limited in resources to perform such a study, process which is automated by ElectricTest. For instance, and believe that our use of maximum dependency size as an both the depunit [1] and TestNG [3] framework allow devel- indicator for selectivity is sufficient. opers to specify dependencies between tests, while JUnit [2] allows developers to specify the order to run tests. 5 Related Work Test Suite Minimization identifies tests cases that may be redundant in terms of coverage metrics and removes them Test dependencies are one root cause of the general prob- from the suite. Many heuristics and coverage metrics have lem of flaky tests, a term used to refer to tests whose out- been proposed to minimize test suites, although most ap- come is non-deterministic with regards to the software under proaches are limited by the strength of the coverage criteria test [16, 29, 30]. Luo, et al. analyzed bug reports and fixes used [11, 12, 23, 24, 27, 28, 37, 39]. Test selection approaches in 51 projects, studying the causes and fixes of 161 flaky the same problem of having too many tests to run from a dif- tests, categorizing 19 (12%) of these to be caused by test ferent angle by instead selecting only tests to run that have dependencies [29]. ElectricTest could be used to automat- been impacted by changes in the application code since the ically detect and avoid these dependency problems before last time that tests were executed [6, 10, 18, 20, 25]. Since they result in flaky tests. test selection can be dangerous if additional tests are im- ElectricTest is most similar to Zhang et al.’s DTDetector pacted by changes but not selected (i.e. due to imprecision system, which detected manifest dependencies between tests in the coverage metrics used to determine impact), some by running the tests in various orderings [40]. A manifest may turn to test prioritization, where entire test suites are dependency is indicated by a test having a different outcome still executed, but tests most likely to be impacted by re- when it is executed in a different order relative to the entire cent changes are executed first [14, 15, 35, 36, 38]. Haidry test suite. This approach required O(n!) test executions for and Miller propose several test prioritization techniques that n tests, with best-case approximation scenarios at O(n2). consider dependencies between tests when performing the ElectricTest instead detects data dependencies, where one minimization, but require developers to manually specify de- test reads data that was last written by a previous test, and pendencies [22]. ElectricTest could be combined with each does not require running each test more than once, in a much of these techniques to efficiently and automatically detect more scalable approach. dependencies between tests, then safely accelerate them us- Unlike ElectricTest, which observes and reports actual ing test selection or prioritization. data dependencies between tests, Gyori et al.’s PolDet tool detects potential data sharing between tests by searching for data“pollution”— data left behind by a test that a later test 6 Conclusions may read (which may or may not ever occur) [21]. PolDet While testing dominates long build times, accelerating test- captures the JVM heap to an XML file using Java reflection ing is tricky, since test dependencies pose a threat to test and compares these XML files offline, while ElectricTest per- acceleration tools. Test dependencies can be difficult to de- forms all analysis on live heaps, greatly simplifying detection tect by hand, and prior to ElectricTest, there was no tool of leaked data. to practically detect them in all but the very smallest test ElectricTest’s technique for dependency detection is more suites (those which took less than several minutes to run nor- related to work in Makefile (build) parallelization, such as mally). We have presented ElectricTest, a tool for detecting EMake [33] or Metamorphisis [19]. These systems observe data dependencies between Java tests with an average slow- filesystem reads and writes for each step of the build pro- down of only 20X, where previous approaches would have cess to detect dependencies between steps and infer which been completely infeasible taking up to 10308 times longer steps can be paralleled. In addition to filesystem accesses, to find all dependencies. We evaluated the accuracy of Elec- ElectricTest monitors memory and network accesses. tricTest, finding it to have perfect recall compared to the While ElectricTest detects hidden dependencies between previous approach in our study. Because not all data depen- tests, there has also been work to efficiently isolate tests to dencies will influence the control flow of the data-dependent ensure that dependencies do not occur. Popular Java test- tests, we evaluated the impact of ElectricTest on test paral- ing platforms (e.g., JUnit [2] or TestNG [3] running with lelization and selection, finding its dependency chains small Ant [4] or Maven [5]) support optional test isolation by exe- enough to still allow for acceleration. The dependencies de- cuting each test class in its own process, resulting in isolation tected by ElectricTest can further be used by developers to at the expense of a high runtime overhead. Our previous gain insight into how their tests interact and fix uninten- work, VMVM, eliminates in-memory dependencies between tional dependencies. tests without requiring running each test in its own pro- cess, greatly reducing the overhead for isolation [7]. While 7 Acknowledgements VMVM preserves the exact same semantics for isolation and initialization that would come by executing each test in its Bell and Kaiser are members of The Programming Systems own process, other systems such as JCrasher [13] also isolate Laboratory, which is funded in part by NSF CCF-1302269, tests efficiently, although without reproducing the same ex- CCF-1161079, and NIH U54 CA121852. Bell was partially act semantics. If tests are already dependent on each other, supported by Electric Cloud while completing this work. but the goal is to isolate them, then ElectricTest could be 8 References Notes in Computer Science, pages 293–309. Springer International Publishing, 2014. [18] M. Gligoric, S. Negara, O. Legunsen, and D. Marinov. [1] Dependency and data driven unit testing framework An empirical evaluation and comparison of manual for java. https://code.google.com/p/depunit/. and automated test selection. In Proceedings of the [2] Junit: A programmer-oriented testing framework for 29th ACM/IEEE International Conference on java. http://junit.org/. Automated Software Engineering, ASE ’14, pages [3] Next generation java testing. 361–372, New York, NY, USA, 2014. ACM. http://testng.org/doc/index.html. [19] M. Gligoric, W. Schulte, C. Prasad, D. van Velzen, [4] Apache Software Foundation. The apache ant project. I. Narasamdya, and B. Livshits. Automated migration http://ant.apache.org/. of build scripts using dynamic analysis and [5] Apache Software Foundation. The apache maven search-based refactoring. In Proceedings of the 2014 project. http://maven.apache.org/. ACM International Conference on Object Oriented [6] T. Ball. On the limit of control flow analysis for Programming Systems Languages &; Applications, regression test selection. In Proceedings of the 1998 OOPSLA ’14, pages 599–616, New York, NY, USA, ACM SIGSOFT International Symposium on Software 2014. ACM. Testing and Analysis, ISSTA ’98, pages 134–142, New [20] T. L. Graves, M. J. Harrold, J.-M. Kim, A. Porter, York, NY, USA, 1998. ACM. and G. Rothermel. An empirical study of regression [7] J. Bell and G. Kaiser. Unit Test Virtualization with test selection techniques. ACM Trans. Softw. Eng. VMVM. In Proceedings of the 36th International Methodol., 10(2):184–208, Apr. 2001. Conference on Software Engineering, ICSE ’14, pages [21] A. Gyori, A. Shi, F. Hairi, and D. Marinov. Reliable 550–561, New York, NY, USA, 2014. ACM. testing: Detecting state-polluting tests to prevent test [8] J. Bell, E. Melski, M. Dattatreya, and G. Kaiser. dependency. In Proceedings of the 2015 ACM Vroom: Faster Build Processes for Java. In IEEE International Symposium on Software Testing and Software Special Issue: Release Engineering. IEEE Analysis, 2015. Computer Society, March/April 2015. To Appear. [22] S. Haidry and T. Miller. Using dependency structures Preprint: http://jonbell.net/s2bel.pdf. for prioritization of functional test suites. Software [9] J. Bell, N. Sarda, and G. Kaiser. Chronicler: Engineering, IEEE Transactions on, 39(2):258–275, Lightweight recording to reproduce field failures. In Feb 2013. Proceedings of the 2013 International Conference on [23] D. Hao, L. Zhang, X. Wu, H. Mei, and G. Rothermel. Software Engineering, ICSE ’13, pages 362–371, On-demand test suite reduction. In Proceedings of the Piscataway, NJ, USA, 2013. IEEE Press. 2012 International Conference on Software [10] L. C. Briand, Y. Labiche, and S. He. Automating Engineering, ICSE 2012, pages 738–748, Piscataway, regression test selection based on uml designs. Inf. NJ, USA, 2012. IEEE Press. Softw. Technol., 51(1):16–30, Jan. 2009. [24] M. J. Harrold, R. Gupta, and M. L. Soffa. A [11] T. Chen and M. Lau. A new heuristic for test suite methodology for controlling the size of a test suite. reduction. Information and Software Technology, ACM Trans. Softw. Eng. Methodol., 2(3):270–285, 40(5–6):347 – 354, 1998. July 1993. [12] T. Chen and M. Lau. A simulation study on some [25] M. J. Harrold, J. A. Jones, T. Li, D. Liang, A. Orso, heuristics for test suite reduction. Information and M. Pennings, S. Sinha, S. A. Spoon, and A. Gujarathi. Software Technology, 40(13):777 – 787, 1998. Regression test selection for java software. In [13] C. Csallner and Y. Smaragdakis. Jcrasher: an Proceedings of the 16th ACM SIGPLAN Conference automatic robustness tester for java. Software: on Object-oriented Programming, Systems, Languages, Practice and Experience, 34(11):1025–1050, 2004. and Applications, OOPSLA ’01, pages 312–326, New [14] H. Do, G. Rothermel, and A. Kinneer. Empirical York, NY, USA, 2001. ACM. studies of test case prioritization in a testing [26] J. Huang, P. Liu, and C. Zhang. Leap: Lightweight environment. In Software Reliability Engineering, deterministic multi-processor replay of concurrent java 2004. ISSRE 2004. 15th International Symposium on, programs. In Proceedings of the Eighteenth ACM pages 113–124, 2004. SIGSOFT International Symposium on Foundations [15] S. Elbaum, A. Malishevsky, and G. Rothermel. of Software Engineering, FSE ’10, pages 385–386, New Incorporating varying test costs and fault severities York, NY, USA, 2010. ACM. into test case prioritization. In Proceedings of the 23rd [27] D. Jeffrey and N. Gupta. Improving fault detection International Conference on Software Engineering, capability by selectively retaining test cases during ICSE ’01, pages 329–338, Washington, DC, USA, test suite reduction. IEEE Trans. Softw. Eng., 2001. IEEE Computer Society. 33(2):108–123, Feb. 2007. [16] M. Fowler. Eradicating non-determinism in tests. [28] J. A. Jones and M. J. Harrold. Test-suite reduction http://martinfowler.com/articles/ and prioritization for modified condition/decision nonDeterminism.html, 2011. coverage. IEEE Trans. Softw. Eng., 29(3):195–209, [17] M. Gligoric, R. Majumdar, R. Sharma, L. Eloussi, and Mar. 2003. D. Marinov. Regression test selection for distributed [29] Q. Luo, F. Hariri, L. Eloussi, and D. Marinov. An software histories. In A. Biere and R. Bloem, editors, empirical analysis of flaky tests. In Proceedings of the Computer Aided Verification, volume 8559 of Lecture 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, pages symposium on Software testing and analysis, ISSTA 643–653, New York, NY, USA, 2014. ACM. ’02, pages 97–106, New York, NY, USA, 2002. ACM. [30] A. M. Memon and M. B. Cohen. Automated testing of [37] S. Tallam and N. Gupta. A concept analysis inspired gui applications: Models, tools, and controlling greedy algorithm for test suite minimization. In flakiness. In Proceedings of the 2013 International Proceedings of the 6th ACM SIGPLAN-SIGSOFT Conference on Software Engineering, ICSE ’13, pages workshop on Program analysis for software tools and 1479–1480, Piscataway, NJ, USA, 2013. IEEE Press. engineering, PASTE ’05, pages 35–42, New York, NY, [31] Oracle. Jvm tool interface. http://docs.oracle.com/ USA, 2005. ACM. javase/7/docs/platform/jvmti/jvmti.html. [38] W. E. Wong, J. R. Horgan, S. London, and H. A. [32] A. Orso, N. Shi, and M. J. Harrold. Scaling regression Bellcore. A study of effective regression testing in testing to large software systems. In Proceedings of the practice. In Proceedings of the Eighth International 12th ACM SIGSOFT Twelfth International Symposium on Software Reliability Engineering, Symposium on Foundations of Software Engineering, ISSRE ’97, Washington, DC, USA, 1997. IEEE SIGSOFT ’04/FSE-12, pages 241–251, New York, NY, Computer Society. USA, 2004. ACM. [39] W. E. Wong, J. R. Horgan, S. London, and A. P. [33] J. Ousterhout. 10–20x faster software builds. USENIX Mathur. Effect of test set minimization on fault ATC, 2005. detection effectiveness. In Proceedings of the 17th [34] G. Rothermel and M. J. Harrold. Analyzing regression international conference on Software engineering, test selection techniques. IEEE Transactions on ICSE ’95, pages 41–50, New York, NY, USA, 1995. Software Engineering, 22(8):529–441, August 1996. ACM. [35] G. Rothermel, R. Untch, C. Chu, and M. Harrold. Test [40] S. Zhang, D. Jalali, J. Wuttke, K. Muslu, M. Ernst, case prioritization: an empirical study. In Proceedings and D. Notkin. Empirically revisiting the test of the IEEE International Conference on Software independence assumption. In Proceedings of the 2014 Maintenance (ICSM ’99), pages 179–188, 1999. International Symposium on Software Testing and [36] A. Srivastava and J. Thiagarajan. Effectively Analysis, ISSTA ’14, pages 384–396, New York, NY, prioritizing tests in development environment. In USA, 2014. ACM. Proceedings of the 2002 ACM SIGSOFT international