A Study of Static Bug Detectors
Total Page:16
File Type:pdf, Size:1020Kb
How Many of All Bugs Do We Find? A Study of Static Bug Detectors Andrew Habib Michael Pradel [email protected] [email protected] Department of Computer Science Department of Computer Science TU Darmstadt TU Darmstadt Germany Germany ABSTRACT International Conference on Automated Software Engineering (ASE ’18), Sep- Static bug detectors are becoming increasingly popular and are tember 3–7, 2018, Montpellier, France. ACM, New York, NY, USA, 12 pages. widely used by professional software developers. While most work https://doi.org/10.1145/3238147.3238213 on bug detectors focuses on whether they find bugs at all, and on how many false positives they report in addition to legitimate 1 INTRODUCTION warnings, the inverse question is often neglected: How many of all Finding software bugs is an important but difficult task. For average real-world bugs do static bug detectors find? This paper addresses industry code, the number of bugs per 1,000 lines of code has been this question by studying the results of applying three widely used estimated to range between 0.5 and 25 [21]. Even after years of static bug detectors to an extended version of the Defects4J dataset deployment, software still contains unnoticed bugs. For example, that consists of 15 Java projects with 594 known bugs. To decide studies of the Linux kernel show that the average bug remains in which of these bugs the tools detect, we use a novel methodology the kernel for a surprisingly long period of 1.5 to 1.8 years [8, 24]. that combines an automatic analysis of warnings and bugs with a Unfortunately, a single bug can cause serious harm, even if it has manual validation of each candidate of a detected bug. The results been subsisting for a long time without doing so, as evidenced by of the study show that: (i) static bug detectors find a non-negligible examples of software bugs that have caused huge economic loses amount of all bugs, (ii) different tools are mostly complementary to and even killed people [17, 28, 46]. each other, and (iii) current bug detectors miss the large majority Given the importance of finding software bugs, developers rely of the studied bugs. A detailed analysis of bugs missed by the static on several approaches to reveal programming mistakes. One ap- detectors shows that some bugs could have been found by variants proach is to identify bugs during the development process, e.g., of the existing detectors, while others are domain-specific problems through pair programming or code review. Another direction is that do not match any existing bug pattern. These findings help testing, ranging from purely manual testing over semi-automated potential users of such tools to assess their utility, motivate and out- testing, e.g., via manually written but automatically executed unit line directions for future work on static bug detection, and provide tests, to fully automated testing, e.g., with UI-level testing tools. a basis for future comparisons of static bug detection with other Once the software is deployed, runtime monitoring can reveal so bug finding techniques, such as manual and automated testing. far missed bugs, e.g., collect information about abnormal runtime behavior, crashes, and violations of safety properties, e.g., expressed CCS CONCEPTS through assertions. Finally, developers use static bug detection tools, • Software and its engineering → Automated static analysis; which check the source code or parts of it for potential bugs. Software testing and debugging; • General and reference → In this paper, we focus on static bug detectors because they have Empirical studies; become increasingly popular in recent years and are now widely used by major software companies. Popular tools include Google’s KEYWORDS Error Prone [1], Facebook’s Infer [7], or SpotBugs, the successor to the widely used FindBugs tool [10]. These tools are typically static bug checkers, bug finding, static analysis, Defects4J designed as an analysis framework based on some form of static ACM Reference Format: analysis that scales to complex programs, e.g., AST-based pattern Andrew Habib and Michael Pradel. 2018. How Many of All Bugs Do We Find? matching or data-flow analysis. Based on the framework, the tools A Study of Static Bug Detectors. In Proceedings of the 2018 33rd ACM/IEEE contain an extensible set of checkers that each addresses a specific bug pattern, i.e., a class of bugs that occurs across different code bases. Typically, a bug detector ships with dozens or even hundreds Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed of patterns. The main benefit of static bug detectors compared to for profit or commercial advantage and that copies bear this notice and the full citation other bug finding approaches is that they find bugs early inthe on the first page. Copyrights for components of this work owned by others than the development process, possibly right after a developer introduces a author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission bug. Furthermore, applying static bug detectors does not impose and/or a fee. Request permissions from [email protected]. any special requirements, such as the availability of tests, and can ASE ’18, September 3–7, 2018, Montpellier, France be fully automated. © 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-5937-5/18/09...$15.00 The popularity of static bug detectors and the growing set of https://doi.org/10.1145/3238147.3238213 bug patterns covered by them raise a question: How many of all 316 ASE ’18, September 3–7, 2018, Montpellier, France Andrew Habib and Michael Pradel real-world bugs do these bug detectors find? Or in other words, what numbers of false positives. Huge advances in static bug detection is the recall of static bug detectors? Answering this question is have been made since then. Our study focuses on a novel generation important for several reasons. First, it is an important part of as- of static bug detectors, including tools that have been adopted by sessing the current state-of-the-art in automatic bug finding. Most major industry players and that are in wide use. reported evaluations of bug finding techniques focus on showing The main findings of our study are the following: that a technique detects bugs and how precise it is, i.e., how many • The three bug detectors together reveal 27 of the 594 studied of all reported warnings correspond to actual bugs rather than false bugs (4.5%). This non-negligible number is encouraging and positives. We do not consider these questions here. In contrast, shows that static bug detectors can be beneficial. practically no evaluation considers the above recall question. The • The percentage of detected among all bugs ranges between reason for this omission is that the set of “all bugs” is unknown (oth- 0.84% and 3%, depending on the bug detector. This result erwise, the bug detection problem would have been solved), making points out a significant potential for improvement, e.g., by it practically impossible to completely answer the question. Second, considering additional bug patterns. It also shows that check- understanding the strengths and weaknesses of existing static bug ers are mostly complementary to each other. detectors will guide future work toward relevant challenges. For • The majority of missed bugs are domain-specific problems example, better understanding of which bugs are currently missed not covered by any existing bug pattern. At the same time, may enable future techniques to cover previously ignored classes several bugs could have been found by minor variants of the of bugs. Third, studying the above question for multiple bug detec- existing bug detectors. tors allows us to compare the effectiveness of existing tools with each other: Are existing tools complementary to each other or does 2 METHODOLOGY one tool subsume another one? Fourth and finally, studying the This section presents our methodology for studying which bugs above question will provide an estimate of how close the current are detected by static bug detectors. At first, we describe the bugs state-of-the-art is to the ultimate, but admittedly unrealistic, goal (§ 2.1) and bug detection tools (§ 2.2) that we study. Then, we of finding all bugs. present our experimental procedure for identifying and validating To address the question of how many of all bugs do static bug matches between the warnings reported by the bug detectors and detectors find, we perform an empirical study with 594 real-world the real-world bugs (§ 2.3). Finally, we discuss threats to validity bugs from 15 software projects, which we analyze with three widely in § 2.4. used static bug detectors. The basic idea is to run each bug detector on a version of a program that contains a specific bug, and to check 2.1 Real-World Bugs whether the bug detector finds this bug. While being conceptually Our study builds on an extended version of the Defects4J data set, simple, realizing this idea is non-trivial for real-world bugs and bug a collection of bugs from popular Java projects. In total, the data detectors. The main challenge is to decide whether the set of warn- set consists of 597 bugs that are gathered from different versions ings reported by a bug detector captures the bug in question. To of 15 projects. We use Defects4J for this study for three reasons.