Finding Real Bugs in Big Programs with Incorrectness Logic 2 This Is a Draft Paper of August 2021 Describing Work That Has Not Yet Been Published
Total Page:16
File Type:pdf, Size:1020Kb
1 Finding Real Bugs in Big Programs with Incorrectness Logic 2 This is a draft paper of August 2021 describing work that has not yet been published. 3 4 QUANG LOC LE, University College London and Facebook, UK 5 AZALEA RAAD, Imperial College London and Facebook, UK 6 JULES VILLARD, Facebook, UK 7 Facebook, UK 8 JOSH BERDINE, 9 DEREK DREYER, MPI-SWS, Germany 10 PETER W. O’HEARN, Facebook and University College London, UK 11 Recently, Incorrectness Logic (IL) has been advanced as a logical theory for proving the presence of bugs. Its 12 proof theory supports a compositional approach to reasoning which mirrors the compositionality of Hoare 13 logic, where knowledge that a bug exists can be inferred from local properties established independently for 14 program parts. This promise of the compositionality of IL for bug catching is, however, so far unrealized: 15 reasoning about bugs was done by hand, in a merely suggestive way, in the early work. 16 In this paper, we take a step towards realizing the promise of the compositional reasoning principles of IL 17 as a foundation for practical program analysis. Specifically, we develop Pulse-X, a new, automatic program 18 analysis for memory bugs based on ISL, a recent synthesis of IL and separation logic. To investigate the effectiveness of Pulse-X, we compare it to Infer, a related, widely used analyzer which has proven useful in 19 industrial engineering practice, demonstrating how Pulse-X is competitive for speed and, when zooming in on 20 a particular large program (OpenSSL), finds more true bugs with fewer false positives. Using Pulse-X, we have 21 also found 15 new real bugs in OpenSSL, which we have reported to OpenSSL maintainers and have since 22 been fixed. 23 24 25 1 INTRODUCTION 26 Incorrectness Logic (IL)[O’Hearn 2019] describes principles for reasoning about the presence of 27 bugs as a dual to those of Hoare’s logic (HL) for reasoning about their absence. As well as being able 28 to describe existing bug catching approaches, including traditional testing and symbolic execution, 29 the proof theory of IL mirrors the compositionality of HL, where proofs of compound programs are 30 built from (proven) specifications of its parts. In a retrospective on 50 years of Hoare logic, Apt and 31 Olderog[2019] identify this compositionality as the principal reason for HL’s remarkable influence. 32 In this paper we take steps towards realising the potential of the IL compositionality, by devel- 33 oping an automatic program analyser, Pulse-X, which we apply to real programs. Our analyser 34 is closely inspired by the Infer tool which is in deployment at Facebook, Amazon, Microsoft and 35 other companies. In particular, Pulse-X adopts Infer’s approach to compositionality based on the 36 biabductive proof principle [Calcagno et al. 2011]. However, rather than building on classic sepa- 37 ration logic [O’Hearn et al. 2001] as Infer did, Pulse-X builds instead on the recent Incorrectness 38 Separation Logic (ISL,[Raad et al. 2020]); and rather than reporting bugs based on failing to prove 39 their absence, Pulse-X instead reports bugs based on positive proofs of their presence. 40 41 Program Analysis Context. Before describing our results, we recall some of the relevant pro- 42 gram analysis context. The theory of static program analysis has traditionally most often adopted a 43 global program perspective: analyses assume they are given the whole program to be analysed, 44 compute an approximation (usually an over-approximation) of the program runs, and then use this 45 information to issue bug reports [Cousot and Cousot 1977; Rival and Yi 2020]. Although there are 46 notable exceptions (e.g., [Cousot 2001]), much less attention has been paid to the foundations of 47 compositional static analysis. A compositional analysis is one in which each part of the program is 48 analysed locally—i.e., independently of the global program context in which it is used. The analysis 49 , Vol. 1, No. 1, Article . Publication date: August 2021. 2 Quang Loc Le, Azalea Raad, Jules Villard, Josh Berdine, Derek Dreyer, and Peter W.O’Hearn 50 result of a composite program is then computed directly from the analysis results of its constituent 51 parts [Calcagno et al. 2011]. 52 In recent CACM articles, industrial developers of static analyses at Facebook [Distefano et al. 53 2019] and Google [Sadowski et al. 2018] described several concrete advantages of compositional 54 static analyses in practical deployments, the overarching one being that compositionality enables 55 analyses to be usefully integrated into the code review process. According to the Facebook article, 56 industrial codebases evolve at such a high velocity that, for a bug-finding analysis to be deployable 57 as part of a code review process, it must execute in minutes and not hours; thus, any analysis 58 that requires traversal of an entire large program would likely be too slow. By contrast, since 59 compositional analyses work locally on code snippets rather than globally on whole programs, they 60 can be run quickly (in minutes) on diffs (snippets of code that change as part of a pull request). This 61 agility makes it possible for compositional analyses to be run automatically as part of continuous 62 integration (CI) systems and provide timely information to developers performing code review. 63 The Problem of Compositional Bug Reporting. One of the central problems in developing 64 static analyses—particularly compositional analyses that are deployed automatically as part of 65 CI—is figuring out when to report bugs to developers. This is difficult for several reasons. 66 First, the fast, compositional static analyses deployed in CI usually do not enjoy a “soundness for 67 bug-catching” theorem. If they are sound for correctness (proving the absence of bugs) then any 68 alarm they report will be the result of a failure to prove the program correct, and not of having 69 found a bug with certainty. On the other hand, static analysis tools that are sound for incorrectness 70 (proving the presence of bugs) have been developed, e.g., symbolic execution, but they usually 71 work in a whole-program fashion and (thus far) have been too slow for CI on large codebases. 72 Second, there is the even more fundamental question of what constitutes a “bug”. For global 73 analyses, we can appeal to the classic distinction between true positives and false positives: a 74 bug report is deemed a true positive if the bug can be exercised in some concrete execution of 75 the program starting from a main() function or other specified entry points, and is otherwise 76 deemed a false positive. On the other hand, compositional analyses only examine code snippets, 77 not whole programs, so they must report bugs based only on local information about a snippet, 78 without knowing the global program context in which it is used. In some cases such as libraries, an 79 enveloping “whole program” simply does not exist. In the absence of such contextual information, 80 however, it is not clear how to even define whether a bug reported by a compositional analysis isa 81 true positive or not. 82 For example, a compositional analysis might deduce that a function foo is safe to execute 83 when passed a nonzero argument, but that it will incur a memory safety error (e.g., null-pointer- 84 dereference) when passed zero (0). Without knowing the calling contexts of foo, does this “bug” 85 count as a true positive or not, and should it be reported? 86 In industrially deployed compositional analysis tools (such as those presented in the aforemen- 87 tioned CACM articles), the choice of when to report bugs is thus typically implemented using 88 heuristics, which are refined over time based on empirical evaluation of developer feedback onthe 89 bug reports produced by the tool. Moreover, the success of these tools is measured not in “true 90 positive rate” or “false positive rate” but rather in fix rate, namely the proportion of reported bugs 91 fixed by developers.1 Fix rate is a useful empirical metric of the analysis effectiveness, but it leaves 92 open a natural research question: 93 94 Can we develop useful compositional bug-finding tools, and characterise their effectiveness, 95 in a more formally rigorous way? 96 1The Google article [Sadowski et al. 2018] uses the term “effective false positive” for a report a developer does not fix,but 97 this should not be confused with the established objective concept of false positive as a spurious bug claim. 98 , Vol. 1, No. 1, Article . Publication date: August 2021. Finding Real Bugs in Big Programs with Incorrectness Logic 3 99 Contributions. In this paper, we make a significant step forward in answering the above ques- 100 tion. Specifically, we develop a compositional program analysis that is similar in spirit and deploya- 101 bility to existing, empirically effective analyses, but that (unlike those analyses) is also backed by 102 rigorous foundations concerning the abstractions it computes and the bugs it reports. 103 We take as our starting point the original foundation of Infer [Calcagno et al. 2011], one of 104 the analyses described in the Facebook paper. Infer’s use of biabduction provides a powerful tech- 105 nique for inferring compositional specifications for memory-manipulating procedures. However, 106 as O’Hearn[2019] has noted, there is a fundamental mismatch between the logical foundation 107 of Infer and its practical deployment. On the one hand, the theory of Infer is based on separation 108 logic [O’Hearn et al.