Shake It! Detecting Flaky Tests Caused by Concurrency with Shaker
Total Page:16
File Type:pdf, Size:1020Kb
2020 IEEE International Conference on Software Maintenance and Evolution (ICSME) Shake It! Detecting Flaky Tests Caused by Concurrency with Shaker Denini Silva, Leopoldo Teixeira, Marcelo d’Amorim Federal University of Pernambuco, Brazil {dgs,lmt,damorim}@cin.ufpe.br Abstract—A test is said to be flaky when it non- 10 reasons, for slowing down software deployments. Finally, deterministically passes or fails. Test flakiness negatively affects Facebook considers flakiness detection a priority [6]. the effectiveness of regression testing and, consequently, impacts ReRun is the most popular approach in industry to detect software evolution. Detecting test flakiness is an important and challenging problem. ReRun is the most popular approach in test flakiness [1], [7]. It reruns tests multiple times. A test industry to detect test flakiness. It re-executes a test suite on a that failed and then passed is considered flaky, and vice-versa. fixed code version multiple times, looking for inconsistent outputs The status of a test that persistently failed is unknown, but across executions. Unfortunately, ReRun is costly and unreliable. developers typically treat this scenario as a problem in appli- This paper proposes SHAKER, a lightweight technique to improve cation code as opposed to a bug in test code. Unfortunately, the ability of ReRun to detect flaky tests. SHAKER adds noise in the execution environment (e.g., it adds stressor tasks to compete ReRun is unreliable and expensive. It is unreliable because it for the CPU or memory). It builds on the observations that is hard to determine the number of reruns to find discrepancy concurrency is an important source of flakiness and that adding in outputs. Is is expensive because rerunning tests consumes noise in the environment can interfere in the ordering of events a lot of computing power. Google, for example, uses 2-16% and, consequently, influence on the test outputs. We conducted of its testing budget just to rerun flaky test [1]. Researchers experiments on a data set with 11 Android apps. Results are very encouraging. SHAKER discovered many more flaky tests have proposed various techniques to identify flaky tests. Bell et than ReRun (95% and 37.5% of the total, respectively) and al. [8] proposed DeFlaker, a technique that determines that a discovered these flaky tests much faster. In addition, SHAKER test is flaky if the output of a test has changed even though was able to reveal 61 new flaky tests that went undetected in 50 there was no change in the code reachable from the execution re-executions with ReRun. trace. Note that DeFlaker cannot capture flakiness in test cases Keywords-test flakiness, regression testing, noise not impacted by changes and that change-impacted flaky tests do not reveal flakiness necessarily. Pinto et al. [9] proposed the I. INTRODUCTION use of text processing and classification to statically identify A test is said to be flaky when it non-deterministically likely flaky test cases from the keywords they use. passes or fails depending on the running environment [1]. For This paper proposes SHAKER, a lightweight technique to example, a test may fail or pass depending on the availability improve the ability of ReRun to detect flaky tests. SHAKER of a server that the test itself has spawned—the test passes adds noise in the execution environment with the goal of pro- if the server is up-and-running at the point the test makes a voking failures in time-constrained tests. For example, it adds request to the server and it fails, otherwise. stressor tasks to compete with the test execution for the CPU or Test flakiness hurts software testing practice in multiple memory. SHAKER builds on the observations that concurrency ways. For example, Thorve et al. [2] found that, because of is an important source of flakiness [10], [11] and that adding test flakiness, developers lost trust in test results, choosing noise in the environment can interfere in the ordering of events to ignore test failures altogether sometimes. Even ignoring related to test execution and, consequently, influence on the failures of known flaky tests or ignoring previously classified test outputs. The process of detecting flaky tests consists of flaky tests (i.e., not executing those tests) can be danger- two steps. In the first offline step, SHAKER uses a sample of ous as those tests could reveal bugs in code. Furthermore, tests known to be flaky to search for configurations of a noise ignoring flakiness can produce the effect of observing even generator to be used for revealing flakiness in new tests. To more failures to be analyzed during software evolution. For that end, SHAKER first builds a probability matrix encoding example, Rahman and Rigby [3] found that when developers the relation between flaky tests and noise configurations. The ignored flaky test failures during a build, the deployed build matrix shows the probability of a test failing when executed experienced more crashes than builds that did not contain with a given noise configuration. Then, SHAKER uses that any flaky test failures. Test flakiness is a huge problem in matrix to search for sets of configurations that reveals the industry. Most test failures at Google are due to flaky tests [1], highest number of flaky tests. In the second online step, [4]. At Microsoft, the presence of flaky tests also imposes SHAKER uses those configurations to find time-constrained a significant burden on developers. Wing et al. [5] reported flaky tests in the project of interest. that 58 Microsoft developers involved in a survey considered We conducted experiments on a data set with 75 flaky tests flaky tests to be the second most important reason, out of of 11 open source Android apps. Preliminary results provide 2576-3148/20/$31.00 ©2020 IEEE 301 DOI 10.1109/ICSME46990.2020.00037 Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on January 10,2021 at 19:37:08 UTC from IEEE Xplore. Restrictions apply. Listing 1: AntennaPod Test 1 @Test 2 public void testReplayEpisodeContinuousPlaybackOff() throws Exception { 3 setContinuousPlaybackPreference(false); 4 uiTestUtils.addLocalFeedData(true); 5 activityTestRule.launchActivity(new Intent()); 6 //Navigate to the first episode in the list of episodes and click 7 openNavDrawer(); 8 onDrawerItem(withText(R.string.episodes_label)).perform(click()); 9 onView(isRoot()).perform(waitForView(withText(R.string.all_episodes_short_label), 1000)); 10 onView(withText(R.string.all_episodes_short_label)).perform(click()); 11 final List<FeedItem> episodes = DBReader.getRecentlyPublishedEpisodes(0, 10); 12 Matcher<View> allEpisodesMatcher = allOf(withId(android.R.id.list), isDisplayed(), hasMinimumChildCount(2)); 13 onView(isRoot()).perform(waitForView(allEpisodesMatcher, 1000)); 14 onView(allEpisodesMatcher).perform(actionOnItemAtPosition(0, clickChildViewWithId(R.id.secondaryActionButton))); 15 FeedMedia media = episodes.get(0).getMedia(); 16 Awaitility.await().atMost(1, TimeUnit.SECONDS).until( () -> media.getId() == PlaybackPreferences. getCurrentlyPlayingFeedMediaId()); ... } early evidence that SHAKER is promising. SHAKER discovered provide means to handle asynchronous operations. However, many more flaky tests than ReRun (95% and 37.5% of the total developers may fail to properly coordinate tasks from tests. from our data set, respectively) and discovered these flaky A. AntennaPod tests much faster than ReRun. SHAKER discovered 85% of the total number of possible flakies in 10% of the average AntennaPod [13] is an open source podcast manager for time ReRun took to find its maximum number of flakies. In Android supporting episode download and streaming. The ∼ addition, SHAKER was able to reveal 61 new flaky tests that AntennaPod app is implemented in Java in 50KLOC. As 1 went undetected in 50 re-executions with ReRun. This paper of the date of submission, the app had 250 GUI tests written makes the following contributions: in Espresso [12] and UIAutomator [14]. Listing 1 shows a simplified version of a test that checks Approach. We proposed a simple lightweight approach to whether a podcast episode can be played twice. It uses the find time-constrained flaky tests by introducing noise in the Awaitility library [15] to handle asynchronous events, such test environment where tests will be executed. as I/O events related to notifications of media playback. Implementation. We developed a tool implementing Executing the statement at line 3 turns off the continuous SHAKER (available per request due to double blindness). playback option. This stops the app from automatically playing Evaluation. We evaluated SHAKER on 11 Android apps with the next episode in queue after it finishes the current one. The encouraging results. statement at line 4 then adds local data (e.g., podcast feeds, Artifacts. Our scripts, data sets, and list of issues submitted images, and episodes) to the app whereas the execution of the to GitHub projects, are publicly available at the following statements at lines 7–14 navigate through the GUI objects with link https://github.com/shaker-project/shaker. the effect of playing the first episode in the queue. Line 16 shows an assertion based on the Awaitibility library. The II. EXAMPLES assertion checks if the episode is being played and indicates that test execution should wait for at most one second to verify This section presents two examples of flaky tests in Android this (line 16). When the play button is pressed (line 14), the applications (apps) to motivate and illustrate SHAKER. Before app runs a custom service in the