2020 IEEE International Conference on Software Maintenance and Evolution (ICSME)

Shake It! Detecting Flaky Tests Caused by Concurrency with Shaker

Denini Silva, Leopoldo Teixeira, Marcelo d’Amorim Federal University of Pernambuco, Brazil {dgs,lmt,damorim}@cin.ufpe.br

Abstract—A test is said to be flaky when it non- 10 reasons, for slowing down software deployments. Finally, deterministically passes or fails. Test flakiness negatively affects Facebook considers flakiness detection a priority [6]. the effectiveness of regression testing and, consequently, impacts ReRun is the most popular approach in industry to detect software evolution. Detecting test flakiness is an important and challenging problem. ReRun is the most popular approach in test flakiness [1], [7]. It reruns tests multiple times. A test industry to detect test flakiness. It re-executes a test suite on a that failed and then passed is considered flaky, and vice-versa. fixed code version multiple times, looking for inconsistent outputs The status of a test that persistently failed is unknown, but across executions. Unfortunately, ReRun is costly and unreliable. developers typically treat this scenario as a problem in appli- This paper proposes SHAKER, a lightweight technique to improve cation code as opposed to a bug in test code. Unfortunately, the ability of ReRun to detect flaky tests. SHAKER adds noise in the execution environment (e.g., it adds stressor tasks to compete ReRun is unreliable and expensive. It is unreliable because it for the CPU or memory). It builds on the observations that is hard to determine the number of reruns to find discrepancy concurrency is an important source of flakiness and that adding in outputs. Is is expensive because rerunning tests consumes noise in the environment can interfere in the ordering of events a lot of computing power. Google, for example, uses 2-16% and, consequently, influence on the test outputs. We conducted of its testing budget just to rerun flaky test [1]. Researchers experiments on a data set with 11 Android apps. Results are very encouraging. SHAKER discovered many more flaky tests have proposed various techniques to identify flaky tests. Bell et than ReRun (95% and 37.5% of the total, respectively) and al. [8] proposed DeFlaker, a technique that determines that a discovered these flaky tests much faster. In addition, SHAKER test is flaky if the output of a test has changed even though was able to reveal 61 new flaky tests that went undetected in 50 there was no change in the code reachable from the execution re-executions with ReRun. trace. Note that DeFlaker cannot capture flakiness in test cases Keywords-test flakiness, regression testing, noise not impacted by changes and that change-impacted flaky tests do not reveal flakiness necessarily. Pinto et al. [9] proposed the I. INTRODUCTION use of text processing and classification to statically identify A test is said to be flaky when it non-deterministically likely flaky test cases from the keywords they use. passes or fails depending on the running environment [1]. For This paper proposes SHAKER, a lightweight technique to example, a test may fail or pass depending on the availability improve the ability of ReRun to detect flaky tests. SHAKER of a server that the test itself has spawned—the test passes adds noise in the execution environment with the goal of pro- if the server is up-and-running at the point the test makes a voking failures in time-constrained tests. For example, it adds request to the server and it fails, otherwise. stressor tasks to compete with the test execution for the CPU or Test flakiness hurts software testing practice in multiple memory. SHAKER builds on the observations that concurrency ways. For example, Thorve et al. [2] found that, because of is an important source of flakiness [10], [11] and that adding test flakiness, developers lost trust in test results, choosing noise in the environment can interfere in the ordering of events to ignore test failures altogether sometimes. Even ignoring related to test execution and, consequently, influence on the failures of known flaky tests or ignoring previously classified test outputs. The process of detecting flaky tests consists of flaky tests (i.e., not executing those tests) can be danger- two steps. In the first offline step, SHAKER uses a sample of ous as those tests could reveal bugs in code. Furthermore, tests known to be flaky to search for configurations of a noise ignoring flakiness can produce the effect of observing even generator to be used for revealing flakiness in new tests. To more failures to be analyzed during software evolution. For that end, SHAKER first builds a probability matrix encoding example, Rahman and Rigby [3] found that when developers the relation between flaky tests and noise configurations. The ignored flaky test failures during a build, the deployed build matrix shows the probability of a test failing when executed experienced more crashes than builds that did not contain with a given noise configuration. Then, SHAKER uses that any flaky test failures. Test flakiness is a huge problem in matrix to search for sets of configurations that reveals the industry. Most test failures at Google are due to flaky tests [1], highest number of flaky tests. In the second online step, [4]. At Microsoft, the presence of flaky tests also imposes SHAKER uses those configurations to find time-constrained a significant burden on developers. Wing et al. [5] reported flaky tests in the project of interest. that 58 Microsoft developers involved in a survey considered We conducted experiments on a data set with 75 flaky tests flaky tests to be the second most important reason, out of of 11 open source Android apps. Preliminary results provide

2576-3148/20/$31.00 ©2020 IEEE 301 DOI 10.1109/ICSME46990.2020.00037

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on January 10,2021 at 19:37:08 UTC from IEEE Xplore. Restrictions apply. Listing 1: AntennaPod Test

1 @Test 2 public void testReplayEpisodeContinuousPlaybackOff() throws Exception { 3 setContinuousPlaybackPreference(false); 4 uiTestUtils.addLocalFeedData(true); 5 activityTestRule.launchActivity(new Intent()); 6 //Navigate to the first episode in the list of episodes and click 7 openNavDrawer(); 8 onDrawerItem(withText(R.string.episodes_label)).perform(click()); 9 onView(isRoot()).perform(waitForView(withText(R.string.all_episodes_short_label), 1000)); 10 onView(withText(R.string.all_episodes_short_label)).perform(click()); 11 final List episodes = DBReader.getRecentlyPublishedEpisodes(0, 10); 12 Matcher allEpisodesMatcher = allOf(withId(android.R.id.list), isDisplayed(), hasMinimumChildCount(2)); 13 onView(isRoot()).perform(waitForView(allEpisodesMatcher, 1000)); 14 onView(allEpisodesMatcher).perform(actionOnItemAtPosition(0, clickChildViewWithId(R.id.secondaryActionButton))); 15 FeedMedia media = episodes.get(0).getMedia(); 16 Awaitility.await().atMost(1, TimeUnit.SECONDS).until( () -> media.getId() == PlaybackPreferences. getCurrentlyPlayingFeedMediaId()); ... }

early evidence that SHAKER is promising. SHAKER discovered provide means to handle asynchronous operations. However, many more flaky tests than ReRun (95% and 37.5% of the total developers may fail to properly coordinate tasks from tests. from our data set, respectively) and discovered these flaky A. AntennaPod tests much faster than ReRun. SHAKER discovered 85% of the total number of possible flakies in 10% of the average AntennaPod [13] is an open source manager for time ReRun took to find its maximum number of flakies. In Android supporting episode download and streaming. The ∼ addition, SHAKER was able to reveal 61 new flaky tests that AntennaPod app is implemented in Java in 50KLOC. As 1 went undetected in 50 re-executions with ReRun. This paper of the date of submission, the app had 250 GUI tests written makes the following contributions: in Espresso [12] and UIAutomator [14]. Listing 1 shows a simplified version of a test that checks Approach. We proposed a simple lightweight approach to whether a podcast episode can be played twice. It uses the find time-constrained flaky tests by introducing noise in the Awaitility library [15] to handle asynchronous events, such test environment where tests will be executed. as I/O events related to notifications of media playback. Implementation. We developed a tool implementing Executing the statement at line 3 turns off the continuous SHAKER (available per request due to double blindness). playback option. This stops the app from automatically playing Evaluation. We evaluated SHAKER on 11 Android apps with the next episode in queue after it finishes the current one. The encouraging results. statement at line 4 then adds local data (e.g., podcast feeds, Artifacts. Our scripts, data sets, and list of issues submitted images, and episodes) to the app whereas the execution of the to GitHub projects, are publicly available at the following statements at lines 7–14 navigate through the GUI objects with link https://github.com/shaker-project/shaker. the effect of playing the first episode in the queue. Line 16 shows an assertion based on the Awaitibility library. The II. EXAMPLES assertion checks if the episode is being played and indicates that test execution should wait for at most one second to verify This section presents two examples of flaky tests in Android this (line 16). When the play button is pressed (line 14), the applications (apps) to motivate and illustrate SHAKER. Before app runs a custom service in the background2—to load the introducing the examples, it is important to briefly explain media file from the file system and, subsequently, play the concepts related to the Android OS. The execution of an media to the user. These are typically expensive I/O operations. Android app creates a separate process including an execution If the machine running the test is heavy-loaded, the 1 second thread, typically called the main thread, or the UI thread. This limit may not be reached. Consequently, the execution of the thread is responsible for handling events such as callbacks test will fail with a ConditionTimeoutException exception from UI interactions, callbacks associated with the lifecycle raised by the Awaitibility library. of components, etc. Any costly operation, such as network We ran this test case 50 times and found failures in 11 operations, should be offloaded to separate threads to avoid runs, i.e., 22% of the cases. Considering this example, one blocking the main thread, and consequently freezing the UI. would need to execute the test for more than four times to Blocking the main thread for more than 5 seconds results in an detect the failure with high probability. When configured to Application Not Responding (ANR) error, which crashes the execute the test suite using only the most effective noisy app. Consequently, asynchronous operations are common in SHAKER configuration, we were able to detect failure in the Android apps. Separate worker threads, responsible for these first execution of the test suite. Furthermore, we re-executed operations, are not allowed to manipulate the UI directly, the test case for 50 times with this configuration and were able which is responsibility of the main (UI) thread. Consequently, there must be coordination among worker threads and the 1Revision SHA dd5234c as per the time of submission. main thread. Testing frameworks, such as Espresso [12], 2Look for “Thread” in the PlaybackService.java file [16].

302

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on January 10,2021 at 19:37:08 UTC from IEEE Xplore. Restrictions apply. Listing 2: Paintroid Test

1 @Test 2 public void testFullscreenPortraitOrientationChangeWithShape() { 3 onToolBarView().performSelectTool(ToolType.SHAPE); 4 setOrientation(SCREEN_ORIENTATION_PORTRAIT); 5 onTopBarView().performOpenMoreOptions(); 6 onView(withText(R.string.menu_hide_menu)).perform(click()); 7 setOrientation(SCREEN_ORIENTATION_LANDSCAPE); pressBack(); 8 onToolBarView().performOpenToolOptionsView().performCloseToolOptionsView(); } to detect flakiness in all of them. Although each individual is 69.25s (=12.5*5.54). We also ran the test case for 50 times with SHAKER and found failures in 18 runs (26% of the execution of the test suite is more expensive with SHAKER,it requires fewer re-executions of the test suite to detect flakiness. cases). As the test fails in every 2.78 (=50/18) executions For comparison, one non-noisy execution of this test case and each execution with SHAKER takes 7.08s, the aggregate takes 11.08s whereas one execution of the test case under cost of SHAKER to find this flaky test is 19.68s (=2.78x7.08). Overall, SHAKER reveals the flaky test 3.52x faster than SHAKER’s noisy environment takes 20.2s, i.e., the execution without noise is ∼2x faster in this particular case. However, ReRun (=69.25/19.68) despite having the execution of the test case itself 1.28x (=7.08/5.54) slower. comparing the total execution time, we observed that SHAKER enabled the discovery of flakiness 2.49x (=50.3/20.2) faster III. SHAKER and, more importantly, it required a single execution for that. The goal of SHAKER is to detect flaky tests. The observation B. Paintroid that motivates SHAKER is that many tests are flaky because Paintdroid is a graphical paint editor application for An- of timing constraints in test executions [2], [10], [11], such as ∼ droid implemented in Java in 25KLOC. As of its latest those from the examples. Our hypotheses are that 1) such tests 3 version, it had 250 Espresso tests. One of such tests checks can be detected by adding noise in the environment where test whether some buttons can be clicked after changing the cases will run and 2) rerunning tests in a noisy environment screen orientation. Listing 2 shows the test. It first selects will more promptly reveal flaky tests compared with rerunning the Shape drawing option (line 3), and then sets the screen tests without noise. orientation to portrait (line 4). Then, it opens a menu that shows a list of options (line 5) (e.g., option to save an image, option to export image to a file, etc). The test clicks on the full screen option (line 6), then it changes orientation to landscape (line 7), exits full screen mode (line 7), and clicks on the tool options to again open a menu, and then close it (line 8). Note that there are no assertions in the test. The Fig. 1: SHAKER’s workflow. intention is to validate that the options remain clickable as the orientation changes from portrait to landscape. Like the previous example, this test can produce different Figure 1 shows SHAKER’s workflow. SHAKER has an offline results depending on the efficiency of the machine. More step and an online step. In the first offline step, SHAKER uses a precisely, the click on the menu item (line 6) can be performed sample of tests known to be flaky to search for configurations before or after the menu is rendered on the screen (line 5). of a noise generator. In the second online step, SHAKER uses As expected, the test fails, throwing the PerformException, those configurations to find flaky tests in the test suite of a if the click is performed before the menu is shown. Changing project provided as input. screen orientation corresponds to a configuration change in The following sections detail SHAKER. Section III-A de- Android.4 When a configuration change happens, Android scribes how an off-the-shelf tool can be used to generate destroys and recreates the current screen (represented by an noise in the execution environment. Section III-B describes Activity object). This happens because changing orientation the offline step of generating configurations for a noise gener- might result in a different screen layout. ator. Finally, Section III-C describes how SHAKER optimizes We ran this test case for 50 times with ReRun and found ReRun to find flaky tests efficiently. failures in 4 runs (8% of the cases). This example shows that, albeit practical and widely adopted in industry [1], [6], A. Noise Generation [10], ReRun can be (1) ineffective or (2) costly to proactively A noise generator is a tool to create load. For example, detect flaky tests. Given that it takes 12.5 (=50/4) executions a noise generator can spawn “stressor” tasks that can influ- on average to find flakiness and each regular execution takes ence the execution environment where tests will be executed. 5.54s, the aggregate cost to detect this flaky test with ReRun Existing tools provide different choices of target for noise

3 generation. We focused on CPU and memory options as we Revision SHA 1f302a2 as per the time of submission empirically found that they influence detection of test flakiness 4https://developer.android.com/guide/topics/resources/runtime-changes

303

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on January 10,2021 at 19:37:08 UTC from IEEE Xplore. Restrictions apply. (see Section V-A). We used the stress-ng [17] tool to create 1) Generation of probability matrix: This step takes as CPU and memory load. We used the following options: input a set of flaky tests TR and reports on output a probability • –cpu n. Starts n stressors to exercise the CPU by work- matrix M, relating the tests in TR and randomly-sampled ing sequentially through different CPU stress methods like configurations in K by their corresponding probabilities. The Ackermann function or Fibonacci sequence. symbol M[t][c] denotes the probability of configuration c ∈ K • –cpu-load p. Sets the load percentage p for the –cpu detecting flakiness in t ∈ TR. To obtain approximate prob- command. ability measurements, SHAKER runs each test several times • –vm n. Starts n stressors to allocate and deallocate contin- on each sampled configuration. The probability measurement M[t, c] t uously in memory. is obtained by dividing the number of failures (of on c t c • –vm-bytes p. Sets the percentage p of the total memory ) found by the total number of reruns (of on ). available to use by the tasks created with option –vm. 2) Search for sets of configurations: SHAKER offers two For example, the command stress-ng --cpu 2 strategies to search for configurations. --cpu-load 50% --vm 1 --vm-bytes 30% configures stress- The Greedy strategy sorts configurations according to their ng to run two CPU stressors with 50% load each and one fitness value and reports the top-n configurations in the sorted virtual memory stressor using 30% of the available memory. list. The value n is provided by the user. Note that this strategy The documentation of stress-ng can be found elsewhere.5 does not offer guarantees that every flaky test will be detected. In addition to the options listed above, SHAKER uses an option that we found important for finding flaky Android The Minimum Hitting Set (MHS) strategy takes that prob- tests—the number of cores available for use by an Android lem into account. MHS [18] is a well-known intractable emulator. This option can be used to restrict an Android problem with efficient polynomial-time approximations [19]. emulator to run on a specified number of cores. To sum up, it enables SHAKER to obtain minimum sets of configurations (columns of the matrix) that detect the maxi- B. Step 1: Discovering Configurations mum number of flaky tests (rows in the matrix). Variations of the MHS problem exist considering weights and returning The goal of this step is to identify configurations of the complete or partial (sub-optimal) solutions [20]. SHAKER noise generator that are more likely to reveal flakiness in a builds on the unweighted and complete MHS version, which test suite. We search for configurations as a heavy-load could takes a boolean matrix as an input and produces a minimum crash the emulator. It takes as input a set of tests TR, known hitting set encoding configurations as an output. We abstracted to be flaky, and reports on output a list of configurations the probability matrix M to only encode low or high likelihood ck =[v1, ..., vn]. A noise generator is configured from a of configurations detecting flaky tests. Intuitively, we are only list of options [o1, ..., on], with each option oi ranging over interested in a configuration c to detect flakiness of a certain the interval loi-hii. Section III-A describes which options are test t if the observed probability M[t][c] is above a certain currently used. The flaky tests in TR can be obtained with threshold. More precisely, SHAKER computes an abstract ma- ReRun, i.e., by rerunning test sets of several applications for trix A defined as A[t][c]=1 if M[t][c] >= threshold, otherwise multiple times. Section IV explains how we obtained the train- 0.SHAKER runs MHS on the boolean matrix A. The goal is to ing set to evaluate SHAKER. To identify “good” configurations, find a set of configurations (columns in the matrix) that detects it is necessary to define a metric for configuration quality. flakiness in tests (rows in the matrix). Note that, although MHS We use the symbol fit(ck , TR) to denote the fitness value of assures that all flaky tests are covered (i.e., a test would be configuration ck for the test set TR. Fitness value is obtained detected with some configuration in the MHS), there is no by computing the average probability of detecting flakiness on guarantee that these tests are uniformly covered. TR when the configuration ck is used for noise generation. Example. Consider that the set TR contains three tests. Example. Figure 2 shows an illustrative example of the MHS A given configuration c that detects flakiness in TR with procedure to discover configurations for detecting flakiness. probabilities {0, 2, 0.5, 0.0} has fit(c, TR)=(0.2+0.5)/3=0.23. The left-hand side of the figure shows the probability matrix These probabilities are obtained by running the test suite M. The abstract matrix A appears on the right-hand side. multiple times and computing the average number of failures. For space, we used a 3x4 matrix, i.e., the test suite contains In this case, one test failed in 20% of the reruns, another test three test cases and SHAKER sampled four configurations. In failed in 50% of the reruns, and another did not fail at all. practice, these matrices are much bigger. The matrix on the The search for noise configuration is realized in two left shows the probabilities of each configuration detecting stages. First, SHAKER samples configurations and generates flakiness on TR. The matrix on the right-hand side is obtained a probability matrix characterizing the probability of each using the abstraction function described above with a threshold configuration to find flakiness in TR. Second, SHAKER uses value of 0.5. There are five hitting sets associated with the that matrix to search for sets of configurations. The following abstract matrix, namely {c1,c2,c4}, {c1,c3,c4}, {c2,c3,c4}, sections elaborate each of these steps in detail. {c2,c4}, {c1,c2,c3,c4}. The MHS algorithm is able to iden- tify that the set {c2,c4} is a minimal set that hits (i.e., covers) 5https://manpages.ubuntu.com/manpages/artful/man1/stress-ng.1.html the tests in TR. In this case, it is also the minimum.

304

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on January 10,2021 at 19:37:08 UTC from IEEE Xplore. Restrictions apply. c1 c2 c3 c4 c1 c2 c3 c4 t1 0.1 0.6 0.5 0.2 t1 0110 t2 0.6 0.6 0.1 0.2 t2 1100 t3 0.1 0.1 0.1 0.5 t3 0001

Fig. 2: Original probability matrix (M) and its abstracted version (A) using a threshold of 0.5. MHS(A)={c2, c4}.

C. Step 2: Discovering Flakies Fig. 3: Histogram of test failures. Finally, SHAKER uses the configurations obtained on Step 1 to determine which of the tests the user provides are flaky. we found that do not contain the @FlakyTest annotation,6 Figure 1 illustrates the inputs and output of this step in the whereas y indicates the number of tests from the test suite box named “Flakies Discovery”. SHAKER reruns the test suite containing the annotation @FlakyTest that we missed. Column on each configuration for a specified number of times and ”#Stars” shows the number of stars that the project received reports divergences on the test outputs. on GitHub. Finally, columns “GitHub...” and “SHA” show the Note the tension between cost and effectiveness of SHAKER. GitHub address and corresponding SHA prefix of the revision. The execution of a test suite under a loaded environment We found flakiness in 75 of the 1,298 tests we analyzed should be slower compared to a regular (noiseless) execution (=5.78% of the total). We found flaky tests in 7 of the 11 apps. as different tasks are competing for the machine resources. In two of these apps, namely Espresso and Flexbox-layout, Compared to ReRun, however, SHAKER may require less we found only 1 flaky test. We highlighted the apps with- executions to detect flakiness. As result, the aggregate cost out flaky tests in gray color. Results also show that the of detecting flakiness would be lower. AntennaPod, FirefoxLite, and Orgzly were the apps with the highest number of flaky tests, with 12, 15, and 38 flaky tests, IV. OBJECTS OF ANALYSES AND SETUP respectively. These projects are among those with the highest This section describes the methodology we used to build a number of test cases too, with 250, 70, and 266 tests, respec- data set of flaky tests (Section IV-A) and the setup of SHAKER tively. Curiously, we found that Flexbox-layout is one of the that we used to run the experiments (Section IV-B). apps with the highest number of tests (232) and lowest number of flakies detected (1). Despite this evidence of determinism on A. Objects the test suite, developers of this app chose to use the annotation We mined flaky tests from various GitHub projects to build @FlakyTest in all tests. We inspected the code and it appears that developers used the annotation only for documentation— a dataset to evaluate SHAKER. We used the following search criteria to select projects: all tests in that app have that annotation and filtering test cases in that state would result in no test executed. Flexbox-layout 1) the project must be written in Java or Kotlin; consists in a library for implementing widgets that adhere to 2) the project must have at least 100 stars; the CSS Flexible Box Layout module using the RecyclerView 3) the project must include tests in Espresso or UIAutomator; widget, which allows displaying lists and grids in Android 4) the project needs to be built without errors. applications. Tests rely heavily on checking list/grid elements, We sampled a total of 11 projects that satisfied this criteria which are typically computed in worker threads, that require and used ReRun to find test flakiness. More precisely, we synchronization to be posted to the Main Thread. executed the test suite of these projects for 50 times using Figure 3 shows a histogram of number of failures for the a generic Android Emulator (AVD) with Android API version 75 flaky tests we discovered using ReRun. Nearly 52.1% of 28. As usual, we consider a test to be flaky if there was a the flaky tests revealed failure in 10 or less executions. This disagreement in the outcomes (i.e., pass, fail, or error) across number is reflected in the leftmost bar from the histogram. those runs. For example, we considered as flaky a test that Intuitively, to capture those cases, one would need to rerun passes in all but one run. To confirm that we could reproduce the test suite for five times given that the failure probability flakiness, we repeated the execution of each flaky test for for that group of tests is 20% (=10/50). 100 times. Section V-E shows results of running SHAKER to find other flaky tests on this same set of projects, i.e., flakies B. Setup that went undetected with the aforementioned procedure and To evaluate SHAKER, we need to define (1) the training set only SHAKER could detect. Table I shows the 11 projects we of test cases TR that will be used to discover configurations used. The column “App” shows the name of the Android app, (step 1) and (2) the testing set of test cases TS that will be the column “#Tests” shows the number of Espresso or UIAu- used to discover the flaky tests (step 2). For that, we divided tomator tests in that application, the column “#Flakies” shows the number of flaky tests detected, column “@FlakyTest(+/-) 6The @FlakyTest annotation is a JUnit test filter. It is used on test declarations to indicate JUnit to exclude those tests from execution (if a shows numbers x/y with x indicating the number of flaky tests corresponding command is provided on input).

305

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on January 10,2021 at 19:37:08 UTC from IEEE Xplore. Restrictions apply. TABLE I: Apps and tests.

# App #Tests #Flakies @FlakyTest(+/-) #Stars GitHub URL (https://github.com/URL) SHA 1 AntennaPod 250 12 12/0 2.8k /AntennaPod/AntennaPod dd5234c 2 AnyMemo 150 0 0/0 117 /helloworld1/AnyMemo 7e674fb 3 Espresso 14 1 1/0 1.1k /TonnyL/Espresso 043d028 4 FirefoxLite 70 15 15/3 220 /mozilla-tw/FirefoxLite 048d605 5 Flexbox-layout 232 1 0/231 15.5k /google/flexbox-layout 611c755 6 Kiss 16 3 3/0 1.5k /Neamar/KISS 00011ce 7 Omni-Notes 10 0 0/0 2.1k /federicoiosue/Omni-Notes b7f9396 8 Orgzly 266 38 38/0 1.5k /orgzly/orgzly-android d74235e 9 Paintroid 270 5 5/0 101 /Catrobat/Paintroid 1f302a2 10 Susi 17 0 0/0 2k /fossasia/susi android 17a7031 11 WiFiAnalyzer 3 0 0/0 1k /VREMSoftwareDevelopment/WiFiAnalyzer 80e0b5d Total - 1,298 75 (5.78%) 74/234 - - -

our dataset containing 75 flaky tests as follows. We randomly RQ4. How effective is SHAKER to find flaky tests? sampled 35 tests and added them to the training set. The This question evaluates how SHAKER compares with ReRun remaining 40 tests were added to the testing set. to find the flaky tests in our data set. To answer this question, The configuration discovery step of SHAKER also requires we measured how long SHAKER and ReRun took to discover a sample of random configurations that will constitute the all flaky tests and how quickly each technique finds most tests. columns of the concrete and abstract matrices. For that, we RQ5. How effective is SHAKER to find new flaky tests? sampled 50 configurations uniformly distributed across the do- mains of the five parameters we analyzed (see Section III-A): For fairness with ReRun, the previous question restricted the four parameters from stress-ng and one parameter from the evaluation of SHAKER to the set of flaky tests that ReRun itself AVD. To obtain probabilities for the concrete matrix M,we was able to discover within 50 re-executions (see Section IV). ran each test on each of these configurations for 3 times. The This question evaluates the ability of SHAKER to discover result of this execution is a probability matrix with failure flaky tests not detected by ReRun in the same set of projects. probability values 0, 0.33, 0.66, 1.0. To construct the abstract The following sections elaborate the answers to each of matrix A, we used a probability threshold of 0.66, i.e., values these questions. equal or above that level are set to 1 (true) and values below A. Answering RQ1: Do tests fail more often in noisy environ- that level are set to 0 (false). ments than in regular (non noisy) environments? V. E VALUATION The purpose of this question is to evaluate the effect of The goal of this section is to evaluate the effectiveness of introducing noise in the environment where tests are executed. SHAKER. We pose the following research questions. This question is important as SHAKER assumes there is such RQ1. Do tests fail more often in noisy environments than in effect. To answer the question, we ran statistical tests to regular (non noisy) environments? evaluate if there are differences in the rate of failures observed The purpose of this question is to evaluate if executing in noisy executions versus standard executions. tests in noisy environments has the effect of making tests fail As we evaluated the effect of noise on the same data set, more often. This is an important question because SHAKER we used a statistical test that takes two paired distributions builds on that assumption to detect flaky tests, so if results are on input–one distribution associated with regular executions positive, than SHAKER is likely to be useful. and one distribution associated with noisy executions. Each number in the distribution corresponds to the rate of failures RQ2. How repeatable is the discovery of flaky tests with a in one execution of the test suite, i.e., the fraction of the 75 given noise configuration? tests from our data set that fails. We ran each test suite on This question analyzes the variance of results obtained each treatment for 30 times. Consequently, each distribution with a given configuration of the noise generator. If results contains 30 samples. First, we ran a Shapiro-Wilk test to check are very non-deterministic then selecting configurations, as if the data is normally distributed. The p-values of this test described on Section III-B2, is helpless as results would be are higher than the traditional threshold of α =0.05.As unpredictable. such, we cannot reject the null hypothesis that the data is RQ3. How effective is the search for configurations of the normally distributed with 95% confidence. Given that both noise generator (e.g., Greedy and MHS) that SHAKER uses? data sets are normally distributed, we used the paired t- This question evaluates the effect of the (configuration) test parametric statistical hypothesis test to check if there is search strategies proposed by SHAKER in their ability to detect difference in the measurements. The null hypothesis (H0)is flaky tests. More precisely, this question compares Greedy that the measurements are the same, that is, introducing noise MHS, and Random search in their effect to detect flakiness. does not impact rate of failure. The test indicated that we could

306

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on January 10,2021 at 19:37:08 UTC from IEEE Xplore. Restrictions apply. Fig. 5: Standard deviations of distributions of 10 reruns on each of the 15 randomly-selected configurations.

average and median standard deviation is ∼.04, indicating that the average difference in measurements is not superior to 8% (avg±σ for a 95% confidence interval) of the total number of Fig. 4: Histograms of ratio of failures for execution with and flaky tests found. We conclude that: without noise. Vertical lines show average value. Summary: Despite the non-deterministic nature of the noise generation process, we observed relatively reject H0 with 99% confidence as the p-value is less than similar rates of failures across multiple executions 0.01. Since the distributions were different, we proceeded to of the same noise configuration. evaluate the effect size. For that, we used Cohen’s d to measure the difference in group means in terms of standard deviation units. This is given by d =(μ1-μ2)/σ, where μ1-μ2 is the difference over the two means from the sample data and σ is C. Answering RQ3: How effective is the search for configu- the standard deviation of the data. The magnitude of the effect rations of the noise generator (e.g., Greedy and MHS) that varies with the value of d, ranging from very small (<0.01) SHAKER uses? to huge (>2.0) [21]. We obtained a value of d =2.52, which Recall that SHAKER selects configurations in two steps (see indicates that the effect of introducing noise was huge. Section III-B). First, it generates a probability matrix and then Figure 4 shows histograms for the distributions of values it selects configurations from that matrix. This research ques- associated with noisy and noiseless runs. Note that, although tion evaluates the effectiveness of the configuration selection some configurations manifest a relative small number of strategies SHAKER uses. More precisely, it evaluates the ability failures (0.15-0.2 range), most executions with noisy configu- of different techniques to select configurations. rations raise more failures in tests. The average rate of failures We evaluated three strategies for selecting configurations, in a noisy execution is 0.37. In contrast, regular executions namely, (i) MHS, (ii) Greedy, and (iii) Random. Recall that have average failure rate of 0.09, which is substantially smaller MHS is the technique that finds the smallest set of configura- compared with that of noisy runs. To sum up: tions whose execution of the test suite is capable of detecting all flaky tests from the training set (as per their associated Summary: Results indicate that introducing noise in probabilities in the abstract matrix). Greedy is the technique the environment increases the ratio of failures in that selects configurations with maximum individual fitness test suites with time-constrained test cases. scores (see Section III-B). Random serves as our control in this experiment. It is the technique that randomly selects configurations regardless of their scores. Both Greedy and B. Answering RQ2: How repeatable is the discovery of flaky Random select the same number of configurations as MHS. tests with a given noise configuration? The metric we used to compare techniques is the ratio of The goal of this question is to evaluate how repeatable flaky tests detected when re-running the test suite against all are the results obtained with a given configuration. If results configurations in the set associated with a technique. For ex- obtained with two runs of the test suite with the same ample, let us consider that MHS produced four configurations. configuration are very different, then choosing configurations To obtain the score for that set, we execute the test suite randomly would be no worse than systematically searching for four times, once for each configuration. Let us assume, we configurations as described on Section III-B. observe discrepancy in the outputs of 30 of the 40 flaky tests To answer this question, we randomly selected 15 noise from the testing set. In that case, the score associated with the configurations and run the test suite on each one of them for configuration set will be 0.75 (=30/40). 10 times, measuring the percentage of failures detected on each We ran a statistical test to evaluate if there are differences execution. For each one of the 15 configurations, we generated in the measurements obtained by MHS, Greedy, and Random. a distribution with 10 samples, where each sample indicates The metric used was the fitness score, as described above, i.e., the percentages of failures detected on a given configuration. we measured the ratio of flaky tests detected when using each Figure 5 shows a boxplot associated with the standard selection strategy. We ran each technique for 10 times, so each deviations of these 15 distributions. Results indicate that the distribution of measurements contain 10 samples.

307

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on January 10,2021 at 19:37:08 UTC from IEEE Xplore. Restrictions apply. As for RQ1, we ran a Shapiro-Wilk test to check if the data is normally distributed. We found that the p-values are above α =0.05, therefore we concluded that the data is normally distributed. We then used Bartlett’s test to check the homogeneity of variances, which reveals that the samples come from populations with the same variance. As such, we chose to use the one-way ANOVA (ANalysis Of VAriance) parametric test to check statistical significance of the sample means by examining the variance of the samples. The null hypothesis (H0) is that there is no variation in the means of sample measurements, which would indicate that there is no Fig. 6: Progress over time of SHAKER and ReRun in detecting impact on changing selection strategies. The test reveals that flaky tests. there are statistically significant differences among treatments at the p<0.01 level (ANOVA F(2,27)=66.83, p<0.01). projects. ReRun detected flakiness at a much slower pace, TABLE II: Post-hoc analysis for RQ3 — Multiple Comparison as reflected by the growth of its plot. ReRun needed 43 re- of Means using Tukey HSD executions of the test suite and 316m (=5h16m) to reach saturation with 15 flaky tests detected (i.e., 37.5% of the Group 1 Group 2 Mean Diff. padj. Lower Upper Reject total) whereas SHAKER needed 14 re-executions and 106m Greedy MHS 0.008 0.8352 -0.0279 0.0439 False Greedy Random -0.141 0.001 -0.1769 -0.1051 True (=1h46m) to reach saturation with 38 flaky tests detected (i.e., MHS Random -0.149 0.001 -0.1849 -0.1131 True 95% of the total). The vertical dotted line on Figure 6 marks the 32m point in time corresponding to 10% of the time We performed a post-hoc paired comparison to evaluate required by ReRun to saturate, which is the limit of the x- which of the techniques differ. For that, we used the Tukey axis. In contrast, at that point, SHAKER had already found 34 HSD test to execute multiple pairwise comparisons. Table II of the flaky tests (i.e., 85% of the total). shows the difference in means, the adjusted p-values, and We used the Area Under the Curve (AUC) as a proxy confidence levels for all possible pairs. Columns p-values of effectiveness. Rothermel et al. [22] pioneered the use of and confidence levels show that between-group differences this metric to assess performance of test case prioritization are significant only when comparing Random to MHS and techniques. The larger the area under the curve the better. For Greedy. However, statistically, results do reject the hypothesis these progress plots, a higher area indicates higher ability to that Greedy and MHS are significantly different. detect flaky tests and to detect them quickly. Intuitively, an To sum up: optimal technique would detect all flaky tests in one execution and would have the AUC of a big trapezoid. Considering the Summary: Results indicate that there is advantage plot from Figure 6, the ReRun curve has an AUC of 3,491 in selecting noise configurations based on their where as the SHAKER curve has an AUC of 11,628, i.e., the fitness scores as opposed to randomly picking them. area of SHAKER is 3.33x (=11,628/3,491) higher than that of However, there is no statistical support to claim ReRun. We used the auc function of the MESS library in R to significant differences between MHS and Greedy. compute the AUCs of these plots. That implementation uses the traditional trapezoidal integration method7 to obtain AUCs. It is worth noting that, because of the increased load in the D. Answering RQ4: How effective is SHAKER to find flaky environment, one execution of the test suite with SHAKER is tests? slower than one execution of the test suite with ReRun. The The goal of this research question is to evaluate SHAKER’s average cost of one execution of the test suite with SHAKER performance. To that end, we analyzed two dimensions: (i) ef- is 8m55s whereas the average cost of one execution of the test ficiency, i.e., how fast it finds flaky tests, and (ii) completeness, suite with ReRun is 7m38s. Also note that SHAKER has the i.e., what is the fraction of the set of known flaky tests additional cost of finding configurations. However, this phase the technique detects. The set of flaky tests to be detected does not need to be executed as software evolves. Although corresponds to the test set, as defined on Section IV-B. the configurations that SHAKER finds can be calibrated with Figure 6 shows the progress of SHAKER and ReRun in the inclusion of new projects and test suites for training, the detecting flaky tests over time. The x-axis denotes time in proposal is to use the set of configurations found in test runs minutes whereas the y-axis denotes the number of flaky tests of any Android test suite. To sum up: detected. The steep increase in the number of flaky tests detected by SHAKER indicates that it quickly discovers many flaky tests. For example, 26 of the 40 flaky tests failed (i.e., 7https://www.lexjansen.com/nesug/nesug02/ps/ps017.pdf 65% of the total) in the first execution of the test suite with SHAKER. Recall that the test set involves test cases of multiple

308

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on January 10,2021 at 19:37:08 UTC from IEEE Xplore. Restrictions apply. Omni-Notes developers also modified their test suite to include Summary: Results show that (1) SHAKER the @FlakyTest annotation in the tests we reported as flaky.10 discovered many more flaky tests than ReRun (95% versus 37.5%) and (2) it discovered these flaky tests F. Threats to validity much faster. SHAKER discovered 85% of the total Threats to the construct validity are related to the appropri- number of possible flakies in 10% of the time ateness of the evaluation metrics we used. We used popular ReRun took to find its maximum number of flakies. metrics previously used. For example, we used the ratio of detected flakiness, the number of flaky tests detected, and the Area Under the Curve (AUC) to evaluate the techniques. E. Answering RQ5: How effective is SHAKER to find new Threats to the internal validity compromise our confidence flaky tests? in establishing a relationship between the independent and We ran S HAKER on TABLE III: Number of new flaky tests found with SHAKER. dependent variables. To mitigate this threat, we carefully each project for 5h with inspected the results of our evaluation. In addition, we ran the goal of finding ad- #flakytests our experiments in different machines to confirm the impact App ditional flaky tests. Ta- All First 3 of noise in detecting flakiness. ble III shows the new AntennaPod 27 20 Threats to the external validity relate to the ability to gener- flaky tests we found. We FirefoxLite 5 3 alize our results. We cannot claim generalization of our results discarded tests with the Omni-Notes 4 3 beyond the particular set of projects studied. In particular, our @FlakyTest annotation Orgzly 14 10 findings are intrinsically limited by projects studied, as well from our search to avoid Paintroid 10 8 WiFiAnalyzer 1 1 as their domains. The problems we found are related to task the risk of inflating our coordination. Nevertheless, future work will have to investigate results given that devel- Total: 61 45 to what extent our findings generalize to software written opers already signaled a in other programming languages and frameworks (beyond potential issue on those tests. As such, we needed to discard Android and UI tests). the project Flexbox-layout as all tests from that project contain that annotation. VI. RELATED WORK Overall, we found new flaky tests in 6 of the 11 projects (10 We describe in the following recently related papers to ours. if we exclude Flexbox-layout) and a total of 61 new flaky tests across these projects. SHAKER detected a total of 45 of these A. Empirical studies about bugs in test code flaky tests in the first three runs of the test suite, confirming Different empirical studies [10], [23]–[25] have attempted the behavior observed in the previous experiment when we to characterize the causes and symptoms of buggy tests, i.e., showed that SHAKER could reveal most flaky tests very early problematic tests that can fail raising a false alarm when there during the search. We did not find any flaky tests on projects is no indication of a bug in the application code. This paper AnyMemo, Espresso, Kiss, and Susi. As per Table I, note that focuses on test flakiness, which is one of several possible types the previous experiment we ran could not reveal any flaky tests of test code issues. For example, Vahabzadeh et al. [23] mined on AnyMemo and Susi either. In contrast, we found one flaky the JIRA bug repository and the version control systems of test in Omni-Notes and WifiAnalyzer, projects that previously Apache Software Foundation and found that 5,556 unique bug did not manifest test flakiness. It is important to recall that fixes exclusively affected test code. They manually examined 95% of the 75 flaky tests found by ReRun are also found by a sample of 499 buggy tests and found that 21% of these false SHAKER (see Section V-D). To sum up: alarms were related to flaky tests, which they further classified into Asynchronous Wait, Race Condition, and Concurrency Summary: SHAKER revealed 61 new flaky tests that Bugs. In principle all such problems can result in timing went undetected in 50 re-executions with ReRun. constraints that SHAKER could capture. Note, however, that we focused on Android, the diagnosis of defect is out of scope, and that SHAKER is a technique to find these issues whereas For each project where we found flaky tests, we opened the aforementioned works manually analyze test artifacts. issues on Github informing the detected flaky tests, explaining Luo et al. [10] analyzed the commit history of the Apache how we detected flakiness, and pointing to a shell script with Software Foundation central repository looking specifically 8 instructions for reproducing our steps. AntennaPod develop- for flakiness. They analyzed 1,129 commits including the ers confirmed they are aware of flakiness for some of their keyword “flak” or “intermit”, and then manually inspected tests, due to emulator performance when testing the Playback each commit. They proposed 10 categories of flakiness root functionality. This is consistent with our discussion in Sec- causes and summarized the most common strategies to repair tion II. Some of those tests are ignored on their Continuous them. Many of the problems reported are related to timing Integration systems. Nevertheless, we still detected flakiness constraints that could, in principle, be captured by SHAKER. 9 in other tests, that developers considered as important to fix. We remain to investigate how SHAKER performs for software 8See https://bit.ly/3dbqCZe for a script for AntennaPod 10https://github.com/federicoiosue/Omni-Notes/issues/761 9https://github.com/AntennaPod/AntennaPod/issues/4194

309

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on January 10,2021 at 19:37:08 UTC from IEEE Xplore. Restrictions apply. of different domains. Thorve et al. [2] conducted a study in previous execution and did not execute changed code should Android apps and observed that the causes of test flakiness in still pass. When that does not happen, DeFlaker assumes that Android apps are similar to those identified by Luo et al. [10]. the changes in the coverage profile must have been caused by They also found two new causes as Program Logic and UI. non-determinism. Note that DeFlaker is unable to determine Altogether, these studies show that test flakiness is prevalent flakiness if the coverage profile was impacted by change and and a potential deterrent to software productivity. the ability of DeFlaker to detect flakiness is bound by the ability of ReRun itself. Shi et al. [33] proposed iFixFlakies to B. Detection of test smells find and fix flaky tests caused by broken test dependencies. Code smells are syntactical symptoms of poor design that SHAKER focuses on a different source of flakiness, which could result in a variety of problems. Test smells manifest often relates to time-constraints, such as those brought by in test code as opposed to application code. Van Deursen et concurrency. Recently, Dong et al. [11] proposed FlakeShovel, al. [26] described 11 sources of test smells and suggested a tool to detect flakiness in Android apps by monitoring and corresponding refactorings to circumvent them. More recent manipulating thread executions to change event orderings. studies have been conducted on the same topic. Bavota et It directly interacts with the Android runtime, instead of al. [27] and Tufano et al. [28] separately studied the sources generating stress loads, as we did. We remain to evaluate how of test smells as defined by Van Deursen et al. [26]. They SHAKER compares with FlakeShovel. used simple syntactical patterns to detect these smells in code Purely static approaches have also been proposed to iden- and then manually inspected them for validation. Bavota et al. tify flaky tests [9], [34]–[36]. An important benefit of these found that up to 82% of the 637 test classes they analyzed approaches is scalability. For example, it is possible to build contains at least one test smell. In related work, Tufano et services to proactively search for suspicious tests in open al. studied the life cycle of test smells and concluded that source repositories. On the downside, they only offer estimates they are introduced since test creation—instead of during of flakiness; re-execution is still necessary to confirm the evolution—and they survive through thousands of commits. issue. Herzig and Nagappan [34] developed a machine learning Waterloo et al. [24] developed a set of (anti-)patterns to approach that mines association rules among individual test pinpoint problematic test code. They performed a study using steps in tens of millions of false test alarms. Lam et al. [35] 12 open source projects to assess the validity of those patterns. used Bayesian networks for flakiness classification. Pinto et Garousi et al. [29] prepared a comprehensive catalogue of test al. [9] used binary text classification (e.g., Random Forests) smells and a summary of guidelines and tools to deal with to predict test flakiness. They used typical NLP techniques to them. Test flakiness may relate to test smells. For example, classify flaky test cases—they tokenized the body of test cases, the use of sleeps are good predictors of flakiness [9], [30]; discarded stop words, put words in camel case, and built lan- they induce time constraints that could be violated. We remain guage models from the words associated with flaky and non- to investigate whether the extent to which static methods flaky tests. SHAKER is complementary to static techniques. We of flakiness prediction can improve the detection ability of remain to evaluate how static classification techniques could SHAKER. For example, in principle, it is possible to instrument be used to selectively run tests in a noisy environment. particular tests to initiate and terminate noise generation. VII. CONCLUSIONS C. Detection of flaky tests In principle, a test case should produce the same results Flaky tests are a huge problem in industry. Their presence regardless of the order it is executed in a test suite [31]. makes it difficult for developers to unambiguously interpret Unfortunately, this is not always the case as the application the results of a regression testing cycle. This paper proposes code that is reached by the test cases can inadvertently modify SHAKER, a lightweight approach to detect flakiness in time- static area and resetting the static area after the execution of constrained tests by adding noise in the execution environment. a given test is impractical. Test dependency is one particular For example, SHAKER adds stressor tasks to create load in source of flakiness [10]. Gambi et al. [32] proposed a practical the CPU and memory. We evaluated SHAKER on a sample approach, based on flow analysis and iterative testing, to of 11 Android apps. Results are very encouraging. SHAKER detect flakiness due to broken test dependencies. SHAKER discovered many more flaky tests than ReRun (95% and is complementary to techniques for capturing broken test 37.5% of the total, respectively) and discovered these flaky dependencies. It remains to investigate how a technique that tests much faster. In addition, SHAKER was able to reveal forcefully modifies the test orderings (e.g., discarding tests 61 new flaky tests that went undetected in 50 re-executions from test runs and modifying orderings of test execution) with ReRun. Our data sets and results are publicly available compares with the approach proposed by Gambi et al.. at https://github.com/shaker-project/shaker. Bell et al. proposed DeFlaker [8], a dynamic technique Acknowledgements. This research was partially funded by INES that uses test coverage to detect flakiness during software 2.0, FACEPE grants PRONEX APQ 0388-1.03/14 and APQ-0399- evolution. DeFlaker observes the latest code changes and 1.03/17, CAPES grant 88887.136410/2017-00, FACEPE grant APQ- marks any new failing test that did not execute changed code 0570-1.03/14 and CNPq (grants 465614/2014-0, 309032/2019-9, as flaky tests. The expectation is that a test that passed in the 406308/2016-0, 409335/2016-9).

310

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on January 10,2021 at 19:37:08 UTC from IEEE Xplore. Restrictions apply. REFERENCES [21] S. Sawilowsky, “New effect size rules of thumb,” Journal of Modern Applied Statistical Methods, vol. 8(2), pp. 597–599, 2009. [1] J. Micco, “Flaky tests at google and how we mitigate them,” [22] G. Rothermel, R. H. Untch, C. Chu, and M. J. Harrold, “Test case 2016, https://testing.googleblog.com/2017/04/where-do-our-flaky-tests- prioritization: An empirical study,” in ICSM, 1999. come-from.html. [23] A. Vahabzadeh, A. M. Fard, and A. Mesbah, “An empirical study of [2] S. Thorve, C. Sreshtha, and N. Meng, “An empirical study of flaky tests bugs in test code,” in Proceedings of the 2015 IEEE International in android apps,” in 2018 IEEE International Conference on Software Conference on Software Maintenance and Evolution (ICSME), ser. Maintenance and Evolution (ICSME). IEEE, 2018, pp. 534–538. ICSME ’15. USA: IEEE Computer Society, 2015, p. 101–110. [3] M. T. Rahman and P. C. Rigby, “The impact of failing, flaky, and [Online]. Available: https://doi.org/10.1109/ICSM.2015.7332456 high failure tests on the number of crash reports associated with [24] M. Waterloo, S. Person, and S. Elbaum, “Test analysis: Searching for firefox builds,” in Proceedings of the 2018 26th ACM Joint Meeting faults in tests,” in Proceedings of the 30th IEEE/ACM International on European Software Engineering Conference and Symposium on Conference on Automated Software Engineering, ser. ASE ’15. the Foundations of Software Engineering, ser. ESEC/FSE 2018. New Piscataway, NJ, USA: IEEE Press, 2015, pp. 149–154. [Online]. York, NY, USA: Association for Computing Machinery, 2018, p. Available: https://doi.org/10.1109/ASE.2015.37 857–862. [Online]. Available: https://doi.org/10.1145/3236024.3275529 [25] H. K. V. Tran, N. B. Ali, J. Borstler¨ , and M. Unterkalmsteiner, “Test-case [4] J. Listfield, “Where do our flaky tests come from?” 2017, quality–understanding practitioners’ perspectives,” in International Con- https://testing.googleblog.com/2016/05/flaky-tests-at-google-and- ference on Product-Focused Software Process Improvement. Springer, how-we.html. 2019, pp. 37–52. [5] W. Lam, P. Godefroid, S. Nath, A. Santhiar, and S. Thummalapenta, [26] A. Van Deursen, L. Moonen, A. Van Den Bergh, and G. Kok, “Refac- “Root causing flaky tests in a large-scale industrial setting,” in toring test code,” in Proceedings of the 2nd international conference Proceedings of the 28th ACM SIGSOFT International Symposium on on extreme programming and flexible processes in software engineering Software Testing and Analysis, ser. ISSTA 2019. New York, NY, USA: (XP2001), 2001, pp. 92–95. Association for Computing Machinery, 2019, p. 101–111. [Online]. [27] G. Bavota, A. Qusef, R. Oliveto, A. De Lucia, and D. Binkley, “An Available: https://doi.org/10.1145/3293882.3330570 empirical analysis of the distribution of unit test smells and their impact [6] M. Harman and P. W. O’Hearn, “From start-ups to scale-ups: Opportu- on software maintenance,” in 2012 28th IEEE International Conference nities and open problems for static and dynamic program analysis,” in on Software Maintenance (ICSM). IEEE, 2012, pp. 56–65. Proc. SCAM’18, 2018. [28] M. Tufano, F. Palomba, G. Bavota, M. Di Penta, R. Oliveto, A. De Lucia, [7] J. Palmer, “Test flakiness – methods for identifying and deal- and D. Poshyvanyk, “An empirical investigation into the nature of test ing with flaky tests,” 2019, https://labs.spotify.com/2019/11/18/test- smells,” in 2016 31st IEEE/ACM International Conference on Automated flakiness-methods-for-identifying-and-dealing-with-flaky-tests/. Software Engineering (ASE). IEEE, 2016, pp. 4–15. [8] J. Bell, O. Legunsen, M. Hilton, L. Eloussi, T. Yung, and D. Marinov, [29] V. Garousi, B. Kucuk, and M. Felderer, “What we know about smells “Deflaker: automatically detecting flaky tests,” in Proceedings of the in software test code,” IEEE Software, vol. 36, no. 3, pp. 61–73, 2018. 40th International Conference on Software Engineering. ACM, 2018, [30] F. Palomba, A. Zaidman, and A. De Lucia, “Automatic test smell detec- pp. 433–444. tion using information retrieval techniques,” in 2018 IEEE International [9] G. Pinto, B. Miranda, S. Dissanayake, M. d’Amorim, C. Treude, and Conference on Software Maintenance and Evolution (ICSME). IEEE, A. Bertolino, “What is the vocabulary of flaky tests?” in International 2018, pp. 311–322. Conference on Mining Software Repositories (MSR), 2020, to Appear. [31] S. Zhang, D. Jalali, J. Wuttke, K. Muslu, W. Lam, M. D. Ernst, and [10] Q. Luo, F. Hariri, L. Eloussi, and D. Marinov, “An empirical analysis D. Notkin, “Empirically revisiting the test independence assumption,” of flaky tests,” in Proc. FSE’14, 2014. in International Symposium on Software Testing and Analysis, ISSTA [11] Z. Dong, A. Tiwari, X. L. Yu, and A. Roychoudhury, “Concurrency- ’14, San Jose, CA, USA - July 21 - 26, 2014, 2014, pp. 385–396. related flaky test detection in android apps,” ArXiv, vol. abs/2005.10762, [32] A. Gambi, J. Bell, and A. Zeller, “Practical test dependency detection,” 2020. in 2018 IEEE 11th International Conference on Software Testing, [12] Google, “Espresso api webpage,” 2020, Verification and Validation (ICST). IEEE, 2018, pp. 1–11. https://developer.android.com/training/testing/espresso. [33] A. Shi, W. Lam, R. Oei, T. Xie, and D. Marinov, “ifixflakies: A [13] “Antennapod app website,” 2020, https://antennapod.org. framework for automatically fixing order-dependent flaky tests,” in [14] Google, “Uiautomator api webpage,” 2020, Proceedings of the 2019 27th ACM Joint Meeting on European Software https://developer.android.com/training/testing/ui-automator. Engineering Conference and Symposium on the Foundations of Software [15] “Awaitibility library,” 2020, http://www.awaitility.org. Engineering, 2019, pp. 545–555. [16] “Antennapod playbackservice.java,” 2020, https://tinyurl.com/y9vltj5e. [34] K. Herzig and N. Nagappan, “Empirically detecting false test alarms [17] C. King and A. Waterland, “stress-ng,” using association rules,” in Proceedings of the 37th International https://manpages.ubuntu.com/manpages/artful/man1/stress- Conference on Software Engineering - Volume 2, ser. ICSE ’15. ng.1.html#description, 2020, [Online; accessed April-2020]. Piscataway, NJ, USA: IEEE Press, 2015, pp. 39–48. [Online]. [18] N. Alon, D. Moshkovitz, and S. Safra, “Algorithmic Available: http://dl.acm.org/citation.cfm?id=2819009.2819018 construction of sets for k-restrictions,” ACM Trans. Algorithms, [35] T. M. King, D. Santiago, J. Phillips, and P. J. Clarke, “Towards a vol. 2, no. 2, p. 153–177, Apr. 2006. [Online]. Available: bayesian network model for predicting flaky automated tests,” in 2018 https://doi.org/10.1145/1150334.1150336 IEEE International Conference on Software Quality, Reliability and [19] A. Gainer-Dewar and P. Vera-Licona, “The minimal hitting set genera- Security Companion (QRS-C). IEEE, 2018, pp. 100–107. tion problem: algorithms and computation,” CoRR, vol. abs/1601.02939, [36] A. Bertolino, E. Cruciani, B. Miranda, and R. Verdecchia, “Know Your 2016. [Online]. Available: http://arxiv.org/abs/1601.02939 Neighbor: Fast Static Prediction of Test Flakiness,” Jan. 2020. [Online]. [20] A. Rebert, S. K. Cha, T. Avgerinos, J. Foote, D. Warren, G. Grieco, and Available: https://doi.org/10.32079/ISTI-TR-2020/001 D. Brumley, “Optimizing seed selection for fuzzing,” in USENIX, 2014, pp. 861–875.

311

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on January 10,2021 at 19:37:08 UTC from IEEE Xplore. Restrictions apply.