Continuously Reasoning About Programs Using Differential Bayesian Inference
Total Page:16
File Type:pdf, Size:1020Kb
Continuously Reasoning about Programs using Differential Bayesian Inference Kihong Heo∗ Mukund Raghothaman∗ University of Pennsylvania, USA University of Pennsylvania, USA [email protected] [email protected] Xujie Si Mayur Naik University of Pennsylvania, USA University of Pennsylvania, USA [email protected] [email protected] Abstract Keywords Static analysis, software evolution, continuous Programs often evolve by continuously integrating changes integration, alarm relevance, alarm prioritization from multiple programmers. The effective adoption of pro- gram analysis tools in this continuous integration setting ACM Reference Format: is hindered by the need to only report alarms relevant to a Kihong Heo, Mukund Raghothaman, Xujie Si, and Mayur Naik. particular program change. We present a probabilistic frame- 2019. Continuously Reasoning about Programs using Differential work, Drake, to apply program analyses to continuously Bayesian Inference. In Proceedings of the 40th ACM SIGPLAN Con- evolving programs. Drake is applicable to a broad range of ference on Programming Language Design and Implementation (PLDI ’19), June 22ś26, 2019, Phoenix, AZ, USA. ACM, New York, NY, USA, analyses that are based on deductive reasoning. The key in- 15 pages. https://doi.org/10.1145/3314221.3314616 sight underlying Drake is to compute a graph that concisely and precisely captures differences between the derivations of alarms produced by the given analysis on the program before and after the change. Performing Bayesian inference on the 1 Introduction graph thereby enables to rank alarms by likelihood of rele- The application of program analysis tools such as Astrée [5], Drake Sparrow vance to the change. We evaluate using Ða SLAM [2], Coverity [4], FindBugs [22], and Infer [7] to large static analyzer that targets buffer-overrun, format-string, software projects has highlighted research challenges at the and integer-overflow errorsÐon a suite of ten widely-used intersection of program reasoning theory and software engi- Drake C programs each comprising 13kś112k lines of code. neering practice. An important aspect of long-lived, multi- enables to discover all true bugs by inspecting only 30 alarms developer projects is the practice of continuous integration, × per benchmark on average, compared to 85 (3 more) alarms where the codebase evolves through multiple versions which × by the same ranking approach in batch mode, and 118 (4 are separated by incremental changes. In this context, pro- more) alarms by a differential approach based on syntactic grammers are typically less worried about the possibility masking of alarms which also misses 4 of the 26 bugs overall. of bugs in existing codeÐwhich has been in active use in CCS Concepts · Software and its engineering → Auto- the fieldÐand in parts of the project which are unrelated mated static analysis; Software evolution; · Mathemat- to their immediate modifications. They specifically want to ics of computing → Bayesian networks. know whether the present commit introduces new bugs, re- gressions, or breaks assumptions made by the rest of the codebase [4, 50, 57]. How do we determine whether a static ∗The first two authors contributed equally to this work. analysis alarm is relevant for inspection given a small change to a large program? A common approach is to suppress alarms that have al- Permission to make digital or hard copies of all or part of this work for ready been reported on previous versions of the program [4, personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear 16, 19]. Unfortunately, such syntactic masking of alarms has this notice and the full citation on the first page. Copyrights for components a great risk of missing bugs, especially when the commit of this work owned by others than ACM must be honored. Abstracting with modifies code in library routines or in commonly used helper credit is permitted. To copy otherwise, or republish, to post on servers or to methods, since the new code may make assumptions that are redistribute to lists, requires prior specific permission and/or a fee. Request not satisfied by the rest of the program44 [ ]. Therefore, even permissions from [email protected]. alarms previously reported and marked as false positives PLDI ’19, June 22ś26, 2019, Phoenix, AZ, USA © 2019 Association for Computing Machinery. may potentially need to be inspected again. ACM ISBN 978-1-4503-6712-7/19/06...$15.00 In this paper, we present a probabilistic framework to ap- https://doi.org/10.1145/3314221.3314616 ply program analyses to continuously evolving programs. 561 PLDI ’19, June 22–26, 2019, Phoenix, AZ, USA Kihong Heo, Mukund Raghothaman, Xujie Si, and Mayur Naik The framework, called Drake, must address four key chal- Contributions. In summary, we make the following contri- lenges to be effective. First, it must overcome the limita- butions in this paper: tion of syntactic masking by reasoning about how semantic 1. We propose a new probabilistic framework, Drake, to changes impact alarms. For this purpose, it employs deriva- apply static analyses to continuously evolving programs. tions of alarmsÐlogical chains of cause-and-effectÐproduced Drake is applicable to a broad range of analyses that are by the given analysis on the program before and after the based on deductive reasoning. change. Such derivations are naturally obtained from analy- 2. We present a new technique to relate static analysis alarms ses whose reasoning can be expressed or instrumented via between the old and new versions of a program. It ranks deductive rules. As such, Drake is applicable to a broad the alarms based on likelihood of relevance to the differ- range of analyses, including those commonly specified in ence between the two versions. the logic programming language Datalog [6, 46, 59]. 3. We evaluate Drake using different static analyses on Second, Drake must relate abstract states of the two pro- widely-used C programs and demonstrate significant im- gram versions which do not share a common vocabulary. We provements in false positive rates and missed bugs. build upon previous syntactic program differencing work by setting up a matching function which maps source loca- tions, variable names, and other syntactic entities of the old 2 Motivating Example version of the program to the corresponding entities of the We explain our approach using the C program shown in new version. The matching function allows to not only relate Figure 1. It is an excerpt from the audio file processing utility alarms but also the derivations that produce them. shntool, and highlights changes made to the code between Third, Drake must efficiently and precisely compute the versions 3.0.4 and 3.0.5, which we will call Pold and Pnew relevance of each alarm to the program change. For this respectively. Lines preceded by a ł+ž indicate code which purpose, it constructs a differential derivation graph that cap- has been added, and lines preceded by a ł-ž indicate code tures differences between the derivations of alarms produced which has been removed from the new version. The integer by the given analysis on the program before and after the overflow analysis in Sparrow reports two alarms in each change. For a fixed analysis, this graph construction takes version of this code snippet, which we describe next. effectively linear time, and it captures all derivations of each The first alarm, reported at line 30, concerns the command alarm in the old and new program versions. line option łtž. This program feature trims periods of silence Finally, Drake must be able to rank the alarms based from the ends of an audio file. The program reads unsani- on likelihood of relevance to the program change. For this tized data into the field info->header_size at line 25, and purpose, we leverage recent work on probabilistic alarm allocates a buffer of proportional size at line 30. Sparrow ob- ranking [53] by performing Bayesian inference on the graph. serves this data flow, concludes that the multiplication could This approach also enables to further improve the ranking overflow, and subsequently raises an alarm at the allocation by taking advantage of any alarm labels provided by the site. However, this data has been sanitized at line 29, so that programmer offline in the old version and online in the new the expression header_size * sizeof(char) cannot over- version of the program. flow. This is therefore a false alarm in both Pold and Pnew. We We have implemented Drake and demonstrate how to will refer to this alarm as Alarm(30). apply it to two analyses in Sparrow [49], a sophisticated The second alarm is reported at line 45, and is triggered by static analyzer for C programs: an interval analysis for buffer- the command line option łcž. This program feature compares overrun errors, and a taint analysis for format-string and the contents of two audio files. The first version has source- integer-overflow errors. We evaluate the resulting analyses sink flows from the untrusted fields info1->data_size and on a suite of ten widely-used C programs each comprising info2->data_size, but this is a false alarm since the value 13kś112k lines of code, using recent versions of these pro- of bytes cannot be larger than CMP_SIZE. On the other hand, grams involving fixes of bugs found by these analyses. We the new version of the program includes an option to offset compare Drake’s performance to two state-of-the-art base- the contents of one file by shift_secs seconds. This value is line approaches: probabilistic batch-mode alarm ranking [53] used without sanitization to compute cmp_size, leading to a and syntactic alarm masking [50]. To discover all the true possible integer overflow at line 42, which would then result bugs, the Drake user has to inspect only 30 alarms on aver- in a buffer of unexpected size being allocated at line 45.