1 Finding Real Bugs in Big Programs with Incorrectness Logic 2 This is a draft paper of August 2021 describing work that has not yet been published. 3 4 QUANG LOC LE, University College London and , UK 5 AZALEA RAAD, Imperial College London and Facebook, UK 6 JULES VILLARD, Facebook, UK 7 Facebook, UK 8 JOSH BERDINE, 9 DEREK DREYER, MPI-SWS, Germany 10 PETER W. O’HEARN, Facebook and University College London, UK 11 Recently, Incorrectness Logic (IL) has been advanced as a logical theory for proving the presence of bugs. Its 12 proof theory supports a compositional approach to reasoning which mirrors the compositionality of Hoare 13 logic, where knowledge that a bug exists can be inferred from local properties established independently for 14 program parts. This promise of the compositionality of IL for bug catching is, however, so far unrealized: 15 reasoning about bugs was done by hand, in a merely suggestive way, in the early work. 16 In this paper, we take a step towards realizing the promise of the compositional reasoning principles of IL 17 as a foundation for practical program analysis. Specifically, we develop Pulse-X, a new, automatic program 18 analysis for memory bugs based on ISL, a recent synthesis of IL and . To investigate the effectiveness of Pulse-X, we compare it to Infer, a related, widely used analyzer which has proven useful in 19 industrial engineering practice, demonstrating how Pulse-X is competitive for speed and, when zooming in on 20 a particular large program (OpenSSL), finds more true bugs with fewer false positives. Using Pulse-X, we have 21 also found 15 new real bugs in OpenSSL, which we have reported to OpenSSL maintainers and have since 22 been fixed. 23 24 25 1 INTRODUCTION 26 Incorrectness Logic (IL)[O’Hearn 2019] describes principles for reasoning about the presence of 27 bugs as a dual to those of Hoare’s logic (HL) for reasoning about their absence. As well as being able 28 to describe existing bug catching approaches, including traditional testing and symbolic execution, 29 the proof theory of IL mirrors the compositionality of HL, where proofs of compound programs are 30 built from (proven) specifications of its parts. In a retrospective on 50 years of Hoare logic, Apt and 31 Olderog[2019] identify this compositionality as the principal reason for HL’s remarkable influence. 32 In this paper we take steps towards realising the potential of the IL compositionality, by devel- 33 oping an automatic program analyser, Pulse-X, which we apply to real programs. Our analyser 34 is closely inspired by the Infer tool which is in deployment at Facebook, Amazon, Microsoft and 35 other companies. In particular, Pulse-X adopts Infer’s approach to compositionality based on the 36 biabductive proof principle [Calcagno et al. 2011]. However, rather than building on classic sepa- 37 ration logic [O’Hearn et al. 2001] as Infer did, Pulse-X builds instead on the recent Incorrectness 38 Separation Logic (ISL,[Raad et al. 2020]); and rather than reporting bugs based on failing to prove 39 their absence, Pulse-X instead reports bugs based on positive proofs of their presence. 40 41 Program Analysis Context. Before describing our results, we recall some of the relevant pro- 42 gram analysis context. The theory of static program analysis has traditionally most often adopted a 43 global program perspective: analyses assume they are given the whole program to be analysed, 44 compute an approximation (usually an over-approximation) of the program runs, and then use this 45 information to issue bug reports [Cousot and Cousot 1977; Rival and Yi 2020]. Although there are 46 notable exceptions (e.g., [Cousot 2001]), much less attention has been paid to the foundations of 47 compositional static analysis. A compositional analysis is one in which each part of the program is 48 analysed locally—i.e., independently of the global program context in which it is used. The analysis 49 , Vol. 1, No. 1, Article . Publication date: August 2021. 2 Quang Loc Le, Azalea Raad, Jules Villard, Josh Berdine, Derek Dreyer, and Peter W.O’Hearn

50 result of a composite program is then computed directly from the analysis results of its constituent 51 parts [Calcagno et al. 2011]. 52 In recent CACM articles, industrial developers of static analyses at Facebook [Distefano et al. 53 2019] and Google [Sadowski et al. 2018] described several concrete advantages of compositional 54 static analyses in practical deployments, the overarching one being that compositionality enables 55 analyses to be usefully integrated into the code review process. According to the Facebook article, 56 industrial codebases evolve at such a high velocity that, for a bug-finding analysis to be deployable 57 as part of a code review process, it must execute in minutes and not hours; thus, any analysis 58 that requires traversal of an entire large program would likely be too slow. By contrast, since 59 compositional analyses work locally on code snippets rather than globally on whole programs, they 60 can be run quickly (in minutes) on diffs (snippets of code that change as part of a pull request). This 61 agility makes it possible for compositional analyses to be run automatically as part of continuous 62 integration (CI) systems and provide timely information to developers performing code review. 63 The Problem of Compositional Bug Reporting. One of the central problems in developing 64 static analyses—particularly compositional analyses that are deployed automatically as part of 65 CI—is figuring out when to report bugs to developers. This is difficult for several reasons. 66 First, the fast, compositional static analyses deployed in CI usually do not enjoy a “soundness for 67 bug-catching” theorem. If they are sound for correctness (proving the absence of bugs) then any 68 alarm they report will be the result of a failure to prove the program correct, and not of having 69 found a bug with certainty. On the other hand, static analysis tools that are sound for incorrectness 70 (proving the presence of bugs) have been developed, e.g., symbolic execution, but they usually 71 work in a whole-program fashion and (thus far) have been too slow for CI on large codebases. 72 Second, there is the even more fundamental question of what constitutes a “bug”. For global 73 analyses, we can appeal to the classic distinction between true positives and false positives: a 74 bug report is deemed a true positive if the bug can be exercised in some concrete execution of 75 the program starting from a main() function or other specified entry points, and is otherwise 76 deemed a false positive. On the other hand, compositional analyses only examine code snippets, 77 not whole programs, so they must report bugs based only on local information about a snippet, 78 without knowing the global program context in which it is used. In some cases such as libraries, an 79 enveloping “whole program” simply does not exist. In the absence of such contextual information, 80 however, it is not clear how to even define whether a bug reported by a compositional analysis isa 81 true positive or not. 82 For example, a compositional analysis might deduce that a function foo is safe to execute 83 when passed a nonzero argument, but that it will incur a memory safety error (e.g., null-pointer- 84 dereference) when passed zero (0). Without knowing the calling contexts of foo, does this “bug” 85 count as a true positive or not, and should it be reported? 86 In industrially deployed compositional analysis tools (such as those presented in the aforemen- 87 tioned CACM articles), the choice of when to report bugs is thus typically implemented using 88 heuristics, which are refined over time based on empirical evaluation of developer feedback onthe 89 bug reports produced by the tool. Moreover, the success of these tools is measured not in “true 90 positive rate” or “false positive rate” but rather in fix rate, namely the proportion of reported bugs 91 fixed by developers.1 Fix rate is a useful empirical metric of the analysis effectiveness, but it leaves 92 open a natural research question: 93 94 Can we develop useful compositional bug-finding tools, and characterise their effectiveness, 95 in a more formally rigorous way? 96 1The Google article [Sadowski et al. 2018] uses the term “effective false positive” for a report a developer does not fix,but 97 this should not be confused with the established objective concept of false positive as a spurious bug claim. 98 , Vol. 1, No. 1, Article . Publication date: August 2021. Finding Real Bugs in Big Programs with Incorrectness Logic 3

99 Contributions. In this paper, we make a significant step forward in answering the above ques- 100 tion. Specifically, we develop a compositional program analysis that is similar in spirit and deploya- 101 bility to existing, empirically effective analyses, but that (unlike those analyses) is also backed by 102 rigorous foundations concerning the abstractions it computes and the bugs it reports. 103 We take as our starting point the original foundation of Infer [Calcagno et al. 2011], one of 104 the analyses described in the . Infer’s use of biabduction provides a powerful tech- 105 nique for inferring compositional specifications for memory-manipulating procedures. However, 106 as O’Hearn[2019] has noted, there is a fundamental mismatch between the logical foundation 107 of Infer and its practical deployment. On the one hand, the theory of Infer is based on separation 108 logic [O’Hearn et al. 2001], a logic of over-approximation—meaning that separation logic speci- 109 fications prove the absence of bugs. On the other hand, Infer was deployed in practice as a bug 110 catcher, for which we really want a logic of under-approximation—meaning that we want to prove 111 the presence of bugs. 112 Our first step is thus to replace over-approximation by under-approximation: we obtaina 113 compositional program analysis, called Pulse-X, which infers proofs much like the original Infer 114 does, but they are proofs in Incorrectness Separation Logic (ISL)[Raad et al. 2020] rather than classic 115 (over-approximate) separation logic. Pulse-X calculates a collection of under-approximate ISL triples 116 of the form [presumption] 푓 [휖 : result], which are interpreted as the dual of Hoare triples: they 117 assert that any final state satisfying the result condition result is reachable (with exit condition 118 휖) by executing 푓 starting in some initial state satisfying presumption. Here, 휖 can be either ok 119 (indicating normal termination) or er (indicating erroneous termination, i.e., a program error). 120 By trading over- for under-approximation in the computed abstractions, we know that Pulse-X 121 guarantees the existence of bugs under some conditions—our analysis infers error specifications of 122 the form [presumption] 푓 [er : result] for each function—but this leaves us with the key question of 123 when to report them to developers. To this end, we describe conditions on presumption and result 124 under which the error er is guaranteed to be context-independent, meaning intuitively that it does 125 not depend on particular preconditions which may or may not hold in the calling context of 푓 . We 126 show how to check these context-independence conditions algorithmically as part of our analysis, 127 and thereby only report error specifications that satisfy them; we refer to such context-independent 128 bugs as manifest bugs. On the other hand, we refer to non-manifest bugs as latent bugs. Even when 129 we do not report them directly to developers, latent bugs nevertheless play an important role in 130 bug detection and reporting: a latent bug of 푓 may be used compositionally to compute manifest 131 bug specifications for code that calls 푓 . 132 For example, for function foo mentioned earlier, Pulse-X would compute an error specification 133 under the presumption that its argument is 0; however, this error specification is not context- 134 independent, and thus it would be classified as a latent bug. If, however, another function bar were 135 to (without context-dependent conditions of its own) call foo with argument 0, we would then 136 report a manifest bug for bar. 137 To formalise the claim that manifest bugs are indeed context-independent, we establish that 138 Pulse-X satisfies a True Positives Property: 139 140 If a subprogram of a complete program has a manifest error, then either the subprogram 141 is dead code (not reachable from main()) or there is a trace in the concrete semantics 142 from main() to the error. 143 144 This provides a very strong justification for reporting an error to a programmer, even whenthe 145 global program context is not known. Interestingly, we also found that it makes sense to report 146 some, though not all, latent bugs, in particular those for memory leaks; see §2.3 for discussion. 147 , Vol. 1, No. 1, Article . Publication date: August 2021. 4 Quang Loc Le, Azalea Raad, Jules Villard, Josh Berdine, Derek Dreyer, and Peter W.O’Hearn

148 Standard Caveat. Our soundness theorem says that, under assumptions encapsulated in a formal 149 model, the computed abstractions will be under-approximate and there will be no false positives. 150 However, in a real implementation there can still be false positives, in cases that lie outside the 151 theoretical model assumptions. We did encounter false positives in practice, all due to outside-the- 152 theory issues with unknown code, but this was the minority of findings (see §4.2 and §5). End of 153 Standard Caveat. 154 Another key contribution we make is an experimental evaluation of Pulse-X, which indicates 155 that our rigorous bug-reporting criterion is competitive with the heuristic approach of Infer. We 156 ran both Infer and Pulse-X on OpenSSL versions from 2015 and 2021. We chose OpenSSL for our 157 evaluation because Infer had reported several bugs on OpenSSL back in 2015, which were fixed by 158 OpenSSL developers, and thus it was natural to consider how Pulse-X would fare. We found that 159 Pulse-X found more true bugs and fewer false ones than Infer. We also compared the performance 160 of Pulse-X against that of Infer on a number of large programs; our comparison suggests that 161 although Pulse-X is currently a scientific experimental tool not deployed in an industrial CI system, 162 its performance characteristics are close enough to Infer to suggest that it could indeed be deployed 163 in industrial CI systems in the future. 164 165 Outline. The remainder of this article is organised as follows. In §2 we present an overview of 166 our analysis and several representative bugs found by Pulse-X. In §3 we discuss the formal model 167 underpinning Pulse-X. In §4 we present the Pulse-X analysis algorithm. In §5 we evaluate the 168 performance of Pulse-X in terms of its scalability, bug report accuracy and number of bugs found. 169 We discuss related work in §6 and conclude in §7. 170 Pulse-X is an eXperimental version of the Pulse program analyzer being developed at Facebook, 171 and was developed as a fork of Pulse. We’ll describe the relationship between Pulse-X and Pulse in 172 more detail as part of the related work section, §6. 173 We plan to make Pulse-X together with our experiments publicly available at a later date, and before 174 this draft paper is published. For the time being, certain supplemental material may be accessed at 175 http://www0.cs.ucl.ac.uk/staff/p.ohearn/RealBigDraftSupplemental.txt; this includes our bug reports 176 and links to pull requests for fixes against recent OpenSSL, as well as the Infer bugs discussed inthe 177 experimental section of the paper. Additionally, a technical appendix containing proofs ommitted from 178 the main body of the paper is at http://www0.cs.ucl.ac.uk/staff/p.ohearn/RealBigDraftAppendix.pdf. 179 180 2 PULSE-X OVERVIEW AND REPRESENTATIVE BUG EXAMPLES 181 Rather than start with theory we begin with bugs, four of them drawn from running Pulse-X on 182 OpenSSL. The first three are real bugs illustrating different challenges for compositional analysis 183 and reporting, while the fourth is a false positive which is excluded by our reporting criterion. To 184 prepare the reader for the upcoming technical development, we make several remarks along the 185 way regarding the Pulse-X theory and tool architecture. 186 187 2.1 Bugs 1 and 2: Null Dereference Bugs 188 189 Listing1 shows a null-pointer-dereference (denial-of-service) vulnerability we discovered in 190 OpenSSL’s ssl_excert_prepend function. The lines starting with + denote our proposed patch. 191 The code begins by calling app_malloc, which in turn calls CRYPTO_malloc in its body. The 192 CRYPTO_malloc function is a malloc wrapper, i.e., a wrapper for a C-standard malloc, used through- 193 out OpenSSL instead of a standard malloc. The returned pointer is set to 0 on line 6. Pulse-X found 194 that the malloc wrapper could return NULL and thus a null-pointer-dereference may occur on line 195 6. We reported this bug with our proposed patch to OpenSSL. 196 , Vol. 1, No. 1, Article . Publication date: August 2021. Finding Real Bugs in Big Programs with Incorrectness Logic 5

197 This bug may seem straightforward, and one might wonder whether a simplistic intra-procedural 198 analysis could find it. The problem is that the analysis must understand that app_malloc is a 199 NULL malloc wrapper which can return . 1 int ssl_excert_prepend(SSL_EXCERT**pexc){ 200 Note that it is not scalable to ask the hu- 2 SSL_EXCERT*exc= app_malloc( sizeof(*exc), 201 man (developer) to specify what the null- 3 "prepend cert"); 202 returning malloc wrappers are, as there 4 + if (exc == NULL) 203 might be many malloc wrappers in a given 5 + return 0; 204 codebase, and new ones may be added 6 memset(exc, 0, sizeof(*exc)); 205 with new pull requests.2 A further inter- 7 ... 206 esting wrinkle is that not all malloc wrap- 8 } 207 pers are the same: some wrappers never Listing 1. A null pointer dereference error in OpenSSL. 208 return NULL, and thus reporting bugs in 209 these cases would lead to false positives, a point that was brought home to us vividly in an interac- 210 tion with an OpenSSL maintainer which we now recall. 211 1 apps/lib/s_cb.c:959: error: Nullptr Dereference 212 2 213 3 apps/lib/s_cb.c:957:23: in call to `app_malloc` 214 4 955. static int ssl_excert_prepend(SSL_EXCERT**pexc) 215 5 956. { 216 6 957. SSL_EXCERT*exc= app_malloc( sizeof(*exc),"prepend cert"); 217 7 ^ 218 8 958. 219 9 959. memset(exc, 0, sizeof(*exc)); 220 10 test/testutil/apps_mem.c:16:16: in call to CRYPTO_malloc (modelled) 221 11 ` ` 12 14. void *app_malloc(size_t sz, const char *what) 222 13 15. { 223 14 16. void *vp= OPENSSL_malloc(sz); 224 15 ^ 225 16 17. 226 17 18. return vp; 227 18 228 19 test/testutil/apps_mem.c:16:16: is the null pointer 229 20 14. void *app_malloc(size_t sz, const char *what) 230 21 15. { 231 22 16. void *vp= OPENSSL_malloc(sz); ^ 232 23 24 17. 233 25 18. return vp; 234 26 ...... 235 236 Listing 2. Part of error trace of the bug in Listing 1. 237 Our analysis finds this bug by computing a procedure summary for app_malloc() which includes 238 the two under-approximate triples below: 239 240 [emp ∧ true] app_malloc(sz, what) [ok : ret↦→nil ∧ true] 241 [emp ∧ true] app_malloc(sz, what) [ok : ∃푋 .ret↦→푋 ∗ 푋↦→− ∧ true] 242 243 2Note that we inform our Pulse-X analyser (using a flag) that CRYPTO_malloc is a malloc for OpenSSL, but then Pulse-X 244 automatically discovers a procedure summary for app_malloc and many other CRYPTO_malloc wrappers. 245 , Vol. 1, No. 1, Article . Publication date: August 2021. 6 Quang Loc Le, Azalea Raad, Jules Villard, Josh Berdine, Derek Dreyer, and Peter W.O’Hearn

246 Using the first specification for app_malloc(sz, what) on line 2, when calling memset (on line 6) 247 the analyser uses a built-in summary for memset which includes the following error specification: 248 [푎↦→nil ∧ true] memset(a,b,c) [er : 푎↦→nil ∧ true] 249 250 Putting the two together, we obtain the following error specification for the entire procedure: 251 [emp ∧ true] ssl_excert_prepend(pexc) [er : emp ∧ true] 252 253 As we discuss in §3, this specification indicates that an error can happen under very general 254 circumstances, no matter how ssl_excert_prepend() is called. This kind of general specification 255 of a bug is what we call a manifest error, and we report such manifest errors to developers even 256 when we do not have the calling context or an enveloping program containing a main() function. 257 When we first reported this bug to OpenSSL, an OpenSSL maintainer replied that it was a 258 false positive as app_malloc aborts when the result is NULL. However, after inspecting the gen- 259 erated error trace of the bug shown in Listing 2, we were led to a definition of app_malloc in 260 test/testutil/apps_mem.c which does not contain abort.3 It turns out that another wrapper 261 with the same name app_malloc, located in file apps/lib/apps.c, does abort in the case of NULL 262 result, which was the one the maintainer had in mind, but it was not the version called in Listing 1, 263 leading to this bug. We returned to the maintainer with the trace information, and they agreed 264 with us and acknowledged our bug report as genuine. However, although they agreed with the 265 bug report, they did not accept our proposed fix and they preferred a different one. In a separate 266 pull request https://github.com/openssl/openssl/pull/15836 the developer changed the wrapper in 267 test/testutil/apps_mem.c to be one that never returns NULL, as it executes abort() when the 268 embedded call to OPENSSL_malloc() inside the wrapper returns NULL (see Listing 3). Many of the 269 malloc wrappers we have encountered in OpenSSL do have the ability to return NULL, but others, 270 such as this one, do not. After applying the developer’s preferred fix, for the NULL case our analysis 271 computes an abort triple for app_malloc() and for ssl_excert_prepend(). Since abort triples 272 are a special form of ok triples representing a program exit, Pulse-X does not produce the error in 273 ssl_excert_prepend(). 274 1 void *app_malloc(size_t sz, const char *what){ 275 2 void *vp; 276 3 if (!TEST_ptr(vp= OPENSSL_malloc(sz))) { 277 4 TEST_info("Could not allocate%zu bytes for%s\n", sz, what); 278 5 abort(); 279 6 } 280 7 return vp; 281 8 } 282 Listing 3. Null-avoiding malloc() wrapper. 283 284 This story highlights the value of inter-procedural reasoning in explaining and fixing bugs that 285 might deceptively seem simple. 286 We found malloc wrappers to be implicated in a number of bugs in OpenSSL, but the reasoning 287 needed after NULL is assigned is not always so straightforward as in the case above. To see this, 288 consider the DH_check_pub_key function in Listing 4, where the code allocates new memory and 289 assigns it to pointer q by calling the BN_new function (another malloc wrapper). After allocating 290 memory for ret successfully, this function also assigns ret->d to NULL (on line 18). On line 6, the 291 code sets 1 to the buffer pointed to by q by calling BN_set_word. On line 25, ret->d is dereferenced, 292 293 3The full trace is available in the supplementary material. 294 , Vol. 1, No. 1, Article . Publication date: August 2021. Finding Real Bugs in Big Programs with Incorrectness Logic 7

295 triggering a null-pointer-deference on line 6. To find this bug, non-trivial heap information is 296 needed; in our analyser this information is provided by ISL formulae representing abstract heaps. 297 298 1 int DH_check_pub_key(const DH*dh, const BIGNUM*pub_key, int *ret){ 299 2 BIGNUM*q=NULL; 300 3 ... 301 4 q=BN_new(); 5 if (q == NULL) goto err; 302 6 BN_set_word(q,1); 303 7 ... 304 8 } 305 9 306 10 BIGNUM*BN_new( void){ 307 11 BIGNUM*ret; 308 12 309 13 if((ret=(BIGNUM*)OPENSSL_malloc( sizeof(BIGNUM))) == NULL){ 310 14 BNerr(...); 311 15 return(NULL); 312 16 } 17 ... 313 18 ret->d=NULL; 314 19 ... 315 20 return(ret); 316 21 } 317 22 318 23 int BN_set_word(BIGNUM*a, BN_ULONGw){ 319 24 ... 320 25 a->d[0] =w; 321 26 ... 322 27 } 323 Listing 4. OpenSSL null pointer bug in DH_check_pub_key. 324 325 326 2.2 Bug 3: A Memory Leak 327 One particularly interesting bug involved the 1 static int www_body(int s, int stype, 328 s_server application, implementing a generic int prot, unsigned char *context){ 329 SSL/TLS server which listens for connections 2 ... 330 on a given port using SSL/TLS. Pulse-X discov- 3 io= BIO_new(BIO_f_buffer()); 331 ered a memory leak in function www_body (List- 4 ssl_bio= BIO_new(BIO_f_ssl()); 332 ing 5). Once allocated, the SSL socket ssl_bio 5 ... 333 is pushed at the end of the linked list io (on 6 BIO_push(io, ssl_bio); 334 line 6), and all sockets and buffers linked by the 7 ... BIO_free_all(io); 335 list io are freed afterwards. Pulse-X found that, 8 9 + BIO_free(ssl_bio); 336 the memory allocated at line 4 would be leaked. 10 return ret; 337 When we submitted the patch, an OpenSSL 11 } 338 maintainer warned us that our fix might cause a 339 double-free-error and suggested that assigning Listing 5. OpenSSL memory leak bug in www_body. 340 ssl_bio to NULL after the push would be the 341 correct fix. However, rerunning Pulse-X on the patch proposed by the maintainer revealed that if 342 the push fails (i.e., ssl_bio points to an allocated heap and it has not been pushed at the end of the 343 , Vol. 1, No. 1, Article . Publication date: August 2021. 8 Quang Loc Le, Azalea Raad, Jules Villard, Josh Berdine, Derek Dreyer, and Peter W.O’Hearn

344 list io), then the leak occurs after the added assignment. This leak report was indeed valid because 345 the memory pointed to by ssl_bio was not accessible from any pointer after the added assignment, 346 and consequently the memory would not be freed afterwards. Furthermore, our proposed fix does 347 not in fact cause a double-free-error because the BIO library uses a reference count mechanism to 348 prevent such an error. Indeed, the maintainer’s fix would have been correct if the push had been 349 always successful. After reporting this new observation to the maintainer, they agreed with our 350 proposed fix. 351 This bug was a challenging one to find. The www_body contains 426 lines of code, and is fairly 352 complicated: the list io is manipulated by a chain of function calls and multiple loops. In contrast to 353 over-approximate techniques (such as that of Infer’s “biabduction” analysis) which cannot reason 354 precisely about the presence of bugs for looping programs, Pulse-X performs under-approximation 355 and can reason precisely about the bugs within a bounded number of program paths (loop un- 356 foldings). This example also illustrates a challenge in software testing. While a developer may 357 write tests to try to make sure this function works correctly, it is perhaps not immediate that a 358 test would exercise the condition that triggers the failure of BIO_push(io, ssl_bio). As a direct 359 consequence, this bug was nearly three years old (it also affects the stable release OpenSSL-1.1.1). 360 361 2.3 Latent and Manifest Bugs 362 A function has a latent error if it contains a bug that occurs only when its inputs satisfy certain 363 conditions, i.e., the bug does not occur in all calling contexts. Otherwise the bug is manifest. 364 For an example of a latent bug, the potential null dereference in function file_ctrl (Listing 6) is 365 classified as latent in OpenSSL-1.0.1h and has not been fixed in the current revision OpenSSL-3.0.04. 366 180 static long file_ctrl(BIO*b, int cmd, long num, void *ptr){ 367 181 long ret = 1; 368 182 FILE*fp=(FILE*)b->ptr; 369 183 .... 370 184 } 371 crypto/bio/bss_file.c 372 Listing 6. Latent issue in . 373 Our analysis discovered a null-pointer-dereference on line 182, but did not report it to the user. This 374 bug occurs only when the input b is NULL. Pulse-X would report this bug at a call site where pointer 375 b is NULL. Indeed, this issue seems not a real bug and has not been fixed; the code of function 376 file_ctrl is the same in the current revision of OpenSSL. 377 As this example illustrates, it seems undesirable to report latent null-pointer-dereferences to a 378 programmer. Often, there are implicit assumptions on whether a pointer is allocated, which are not 379 checked locally, and such assumptions can be inferred and pushed back to callers. Interestingly, 380 however, the case of leaks is different. If a function contains a local memory leak, very rarely 381 would this be the fault of a caller. A caller might happen to avoid a path with a leak, but the leak 382 is a bug just the same. Furthermore, leaks can be difficult to observe (especially for an end user), 383 but easy to fix. For these reasons, we have adopted a strategy of reporting latent leaks, butnot 384 null-pointer-dereferences. Our experience with this strategy has been good: OpenSSL developers 385 reacted positively to latent leak reports, and legacy latent leaks had often been fixed; from our 386 1.0.1h (from 2015) experiments, there were 42 latent leaks that were fixed later and 5 latent leaks 387 that were not. 388 In more detail, our strategy is to report: 389 • all manifest null-pointer-dereferences, no matter where in the program; 390 391 4Commit 147ed5f9def86840c9f6ba512e63a890d58ac1d6. 392 , Vol. 1, No. 1, Article . Publication date: August 2021. Finding Real Bugs in Big Programs with Incorrectness Logic 9

393 Pulse-X Assertions Pulse-X Assertion Semantics 394 휅 ::= emp | 푥↦→푋 | 푋↦→푉 | 푋̸↦→ | 휅 ∗ 휅 spatial (휂, 휎) |= emp iff dom (휎) =∅ 395 휋 ::= true | 퐵 | 휋 ∧ 휋 | 휋 ∨ 휋 | ¬휋 pure (휂, 휎) |= 푥↦→푋 iff 휎={휂(푥) ↦→ 휂(푋)} 396 Δ ::= 휅 ∧ 휋 quantifier-free (휂, 휎) |= 푋↦→푉 iff 휎={휂(푋) ↦→ 휂(푉 )} 397 푝, 푞, 푟 ::= Δ | ∃푋 . Δ top-level (휂, 휎) |= 푋̸↦→ iff 휎={휂(푋) ↦→ ⊥} 398 Pulse-X Model Domain (휂, 휎) |= 휅1 ∗휅2 iff ∃휎1, 휎2. 휎=휎1 ⊎ 휎2 ∧ 399 휂 ∈ Env ≜ (Var ∪ LVar) → Val (휂, 휎1) |= 휅1 ∧(휂, 휎2) |= 휅2 (휂, 휎) |= 휅 ∧ 휋 iff (휂, 휎) |= 휅 and 휂 |= 휋 400 휎 ∈ State ≜ Loc ⇀fin Val with Loc ⊎ {nil} ⊆ Val 401 푠 ∈ World ≜ Env × State (휂, 휎) |= ∃푋 . Δ iff ∃푣. (휂[푋 ↦→ 푣], 휎) |= Δ 402 403 Fig. 1. The Pulse-X model domain, assertions and their semantics. 404 405 406 407 • all latent null-pointer-dereferences in main(); and 408 • all latent leaks, no matter where in the program. 409 410 As we demonstrate in more detail in §5.2, our experimental evidence validates this strategy: on 411 OpenSSL 1.0.1h, Pulse-X found 306 issues in all (latent and manifest), it reported 63 of the 306 412 issues (those satisfying the three properties above), 53 of the 63 have been subsequently fixed with 413 5 not fixed and the remaining ones unknown (the procedures were removed totally). Ondirect 414 inspection, we found that a high number of the remaining (latent) issues had not been fixed, and 415 there was no reason to fix them (a human might reasonably label them false positives, hadthey 416 been reported). 417 418 3 THE PULSE-X FORMAL MODEL 419 420 We describe the abstract states of our program analyser Pulse-X, certain separation logic formulae, 421 their semantics in a concrete state model, and the semantics of under-approximate triples. We give 422 a formal definition of manifest bugs in terms of the assertions and triples, and we establish atrue 423 positives result which underpins our bug reporting in Pulse-X. The result is a property of the triples 424 and assertions themselves, and is independent of the way that the triples are computed. The next 425 section describes an analysis algorithm for obtaining such triples. 426 427 The Pulse-X State Model. We assume two (countably infinite) sets of program variables, Var, 428 and logical variables, LVar, such that Var ∩ LVar = ∅, a set of heap (memory) locations, Loc, and a 429 set of values, Val, such that Loc ⊎ {nil} ⊆ Val. We further assume a standard interpreted language 430 of (program) expressions, Exp, containing at least variables and values, and a standard interpreted 431 language for Boolean expressions, BExp. 휂, 휎 휂 432 As shown in Fig. 1, we model a Pulse-X world as a pair ( ), comprising an environment 휎 433 and a state . Intuitively, the environment tracks the values associated with program and logical 434 variables, while the state models the stack and the heap. 푥,푦, . . . 푋 푌, . . . 푣 435 We use as metavariables for program variables; , for logical variables; for values; 푒 퐵 436 for expressions; for Boolean expressions; and ret for a designated program variable recording the −→푥 −→푒 437 return value at procedure exits. We further write to denote a tuple of variables; and to denote a . 438 tuple of expressions. Lastly, we assume an expression interpretation function, : Exp → Env → Val, . J K 439 and a Boolean interpretation function, : BExp → Env → Val, respectively evaluating the values J K 440 of expressions and Boolean expressions against a given environment. 441 , Vol. 1, No. 1, Article . Publication date: August 2021. 10 Quang Loc Le, Azalea Raad, Jules Villard, Josh Berdine, Derek Dreyer, and Peter W.O’Hearn

442 PDef ∋ pdef ::= skip | error() | assume(휋) | 푥 := 푒 | 푥 := [푦] | [푥] :=푦 Predefined instructions 443 | 푥 := malloc() | free(푥) | return(푥) 444 Inst ∋ inst ::= pdef | f() Instructions 445 Comm ∋ C ::= inst | local 푥.C | C; C | C + C | C★ Commands 446 FDec ∋ d ::= f() = C Function declarations 447 Prog ∋ P ::= d, ··· , d Programs 448 449 Fig. 2. The Pulse-X programming language. 450 451 The Pulse-X Assertions and their Semantics. As shown in Fig. 1, the Pulse-X assertions (ranged 452 over by 푝, 푞, 푟) are those of [Calcagno et al. 2011], without inductive predicates, and extended with 453 the invalidated assertion (푋̸↦→) of ISL.5 454 Pulse-X assertions describe sets of worlds, and their semantics is given by a satisfiability relation, 455 |=, as shown in Fig. 1. Pure assertions constrain the underlying environment (휂); their semantics 456 is standard and omitted here. Analogously, spatial assertions constrain the shape of the state (휎). 457 Specifically, emp describes worlds in which the state is empty; 푥↦→푋 describes worlds in which the 458 state comprises a single location denoted by (the interpretation of) 푥 containing the value denoted 459 by 푋; analogously for 푋↦→푉 . Similarly, 푋̸↦→ describes worlds in which the state comprises a single 460 invalidated location at 푋 (i.e., 푋 contains the designated value ⊥), and 휅1 ∗휅2 describes worlds in 461 which the state can be split into two disjoint sub-states, one satisfying 휅1 and the other 휅2. The 462 semantics of the remaining assertions is standard. 463 def 464 Notation. We define 푝 ≜ {푠 | 푠 |= 푝} and write 푝 ⊢ 푞 ⇔ 푝 ⊆ 푞 . We write 푝 ≡ 푞 for 465 푝 푞 푞 푝 L M푝 푝 푝 def 푝 L L M 푝 푝 466 ⊢ ∧ ⊢ ; and write sat( ) when is satisfiable: sat( ) ⇔ ≠∅. We write fv( ) (resp. flv( )) 푝L M 467 for the set of free variables (resp. free logical variables) in . We use the standard substitution 푝 푣 푋 푝 푣 푋 468 notation and write [ / ] for the assertion obtained from by substituting for . 469 3.1 Programs and Summaries 470 Pulse-X 471 We present the Pulse-X programming language in Fig. 2, which is similar to the intermediate 472 language of Infer [Calcagno et al. 2011]. A program, P, comprises a sequence of procedure (function) 473 declarations. A procedure declaration d is of the form f()=C, where C is a command describing 474 the function body. Commands include instructions (inst), local variable declaration (local 푥.C), 475 sequential composition (C; C), non-deterministic choice (C + C) and Kleene (non-deterministic) ★ 476 iteration (C ). An instruction is either a procedure call or a predefined instruction. The predefined 477 instructions include skip, assume statements, assignment (푥 := 푒), heap lookup (푥 := [푦]), heap 478 mutation ([푥] :=푦), memory allocation, memory disposal and return statements. Standard procedure 479 calls in C – where a call allocates formals, assigns actuals and deallocates formals on procedure 480 exit – can be encoded in terms of predefined instructions. 481 As is standard, assume statements can be used to encode if and while statements. Specifically, 482 if 휋 then C1 else C2 is encoded as (assume(휋); C1) + (assume(¬휋); C2); and while(휋) do C ★ 483 is encoded as (assume(휋); C) ; assume(¬휋). Similarly, we can define assert(.) statements via 484 assume(.), error() and non-deterministic choice as: 485 assert(휋) ≜ (assume(¬휋); error()) + (assume(휋); skip) 486

487 5Inductive predicates were used by Calcagno et al.[2011] to establish over-approximate loop invariants, which we do not 488 need in an under-approximate analysis; see [O’Hearn 2019] for a discussion. However, using inductive predicates to leap 489 loops via backward variant rules in incorrectness logic [O’Hearn 2019] might be studied in separate work. 490 , Vol. 1, No. 1, Article . Publication date: August 2021. Finding Real Bugs in Big Programs with Incorrectness Logic 11

491 [emp] skip [ok : emp] 492 [emp] error() [er : emp] 493 h i h i 494 ( ↦→ ) ∧ [ / ] assume( ) ( ↦→ ) ∧ [ / ] ∗푥푖 ∈pvars(휋) 푥푖 푋푖 휋 푋푖 푥푖 휋 ok : ∗푥푖 ∈pvars(휋) 푥푖 푋푖 휋 푋푖 푥푖 495     ( ∈ ( )\{ } 푦푖 ↦→푌푖 ∗ 푥↦→푋) ( ∈ ( )\{ } 푦푖 ↦→푌푖 ∗ 푥↦→푉 ) 496 ∗푦푖 pvars 푒 푥 푥 := 푒 ok : ∗푦푖 pvars 푒 푥 497 ∧ 푉 =푒 [푌푖 /푦푖 ][푋/푥] ∧ 푉 =푒 [푌푖 /푦푖 ][푋/푥] 498 [푥↦→푋 ∗ 푦↦→푌 ∗ 푌 ↦→푉 ] 푥 := [푦] [ok : 푥↦→푉 ∗ 푦↦→푌 ∗ 푌↦→푉 ] 499 [푦↦→푌 ∗ 푌̸↦→] 푥 := [푦] [er : 푦↦→푌 ∗ 푌 ̸↦→] 500 501 [푦↦→푌 ∧ 푌 = nil] 푥 := [푦] [er : 푦↦→푌 ∧ 푌 = nil] 502 [푥↦→푋 ∗ 푋↦→푉 ∗ 푦↦→푌] [푥] :=푦 [ok : 푥↦→푋 ∗ 푋↦→푌 ∗ 푦↦→푌] 503 [푥↦→푋 ∗ 푋̸↦→] [푥] :=푦 [er : 푥↦→푋 ∗ 푋̸↦→] 504 505 [푥↦→푋 ∧ 푋 = nil] [푥] :=푦 [er : 푥↦→푋 ∧ 푋 = nil] 506 507 [푥↦→푋] 푥 := malloc() [ok : ∃퐿. 푥↦→퐿 ∗ 퐿↦→푉 ∧ true] 508 509 [푥↦→푋] 푥 := malloc() [ok : ∃퐿. 푥↦→퐿 ∧ 퐿 = nil] 510 [푥↦→푋 ∗ 푋↦→푉 ] free(푥) [ok : 푥↦→푋 ∗ 푋̸↦→] 511 [푥↦→푋 ∗ 푋̸↦→] free(푥) [er : 푥↦→푋 ∗ 푋̸↦→] 512 513 [푥↦→푋 ∧ 푋 = nil] free(푥) [er : 푥↦→푋 ∧ 푋 = nil] 514 [푥↦→푋 ∗ ret↦→푉 ] return(푥) [ok : 푥↦→푋 ∗ ret↦→푋] 515 516 Fig. 3. Predefined Pulse-X summaries as ISL triples, where pvars(.) returns the program variables of an 517 expression or a statement; for brevity, we omit the pure assertion true and write 푝 in lieu of 푝 ∧ true. 518 519 520 Our approach to formalism here is minimalistic rather than maximalistic: its aim is to include 521 as little as possible in the language while still exposing key issues to explain/probe technically, 522 rather than including as much as possible so as to cover more of an implemented analyser. In 523 order to focus on the essential issues regarding the analysis algorithm and how it infers states, for 524 brevity we focus on parameterless procedures. It would be straightforward to follow the treatment 525 of parameters by Calcagno et al. [2011], and doing so would not provide any novel insights. We 526 also assume that procedures are non-recursive. In practice they can also be subject to bounded 527 unrolling, and that is what our implementation (implicitly) does. 528 Predefined Pulse-X Summaries as ISL Triples. As we describe in §4, the Pulse-X algorithm 529 uses the summaries (specifications) of predefined instructions to infer the specification of agiven 530 piece of code. The summaries of predefined instructions are given in Fig. 3 as ISL triples, and 531 are adapted from those of Raad et al. [2020] with only minor changes to fit our formalism with 532 environment 휂. In particular, these specifications are modified with 푥↦→푋 to track the values 푋 of 533 program variables 푥. For example, the spec for assignment statements 푥 := 푒 replaces variable 푥 by 534 the value 푋 and other variables 푦푖 by the values 푌푖 before actually applying the assignment. 535 536 Pulse-X Summaries as Valid ISL Triples. As discussed in §2, Pulse-X computes procedure 537 summaries as a set of valid ISL (under-approximate) triples of the form [푝] C [휖 :푞]. As described 538 in [Raad et al. 2020], a triple [푝] C [휖 :푞] is valid, written |= [푝] C [휖 :푞], iff every state in 푞 is 539 , Vol. 1, No. 1, Article . Publication date: August 2021. 12 Quang Loc Le, Azalea Raad, Jules Villard, Josh Berdine, Derek Dreyer, and Peter W.O’Hearn

540 reachable under the exit condition 휖 (which may be ok or er) by executing C on some state in 푝. 541 Note that [푝] C [휖 :false] is vacuously valid as false denotes an empty state set. 542 Definition 3.1 (ISL triples). An (under-approximate) ISL triple [푝] C [휖 :푞] is valid, written |= [푝] 543 C [휖 :푞], iff: 544 def 545 |= [푝] C [휖 :푞] ⇔ 푞 ⊆ C 휖 ( 푝 ) 546 L M J K L M where . 휖 : Comm → P(World × World) denotes the command semantics under 휖, defined (in 547 the technicalJ K appendix, http://www0.cs.ucl.ac.uk/staff/p.ohearn/RealBigDraftAppendix.pdf) as a 548 state transition system analogously to that in [Raad et al. 2020]. 549 550 3.2 Manifest Errors 551 We begin by recalling the ISL Cons and Frame rules (below), as the placements of entailment and 552 ∗푓 assertions in the upcoming discussion implicitly appeals to these rules. Note that the entailments 553 in the premise of Cons are in the opposite direction compared to those of their counterpart in 554 Hoare logic; this is due to the under-approximate nature of ISL and our Pulse-X analysis. 555 556 Frame Cons [푝] [휖 푞] 푝 ′ ⊢ 푝 [푝 ′] [휖 푞′] 푞 ⊢ 푞′ 557 C : C : 558 [푝 ∗ 푟] C [휖 :푞 ∗ 푟] [푝] C [휖 :푞] 559 Recall from §2 that in order to minimise noise when reporting bugs and eliminate false positives, 560 we introduce the notion of manifest errors. Intuitively, a manifest error denotes a valid procedure 561 summary [푝] C [er : 푞] that (1) can be applied within any calling context (i.e., regardless of the state 562 at call site); and (2) when applied, it always yields an erroneous execution terminating in a state 563 satisfying an extension of 푞. Put formally, given any calling context 푟, if sat(푟) holds6, then (1) there 564 exists 푓 such that 푝 ∗ 푓 ⊢ 푟, i.e., the summary precondition 푝 can be extended with a frame 푓 to 565 match the context 푟; and (2) sat(푞 ∗ 푓 ) also holds: the summary postcondition 푞 can be analogously 566 extended with frame 푓 . (Note that the entailment instance ⊢ here is in the opposite direction to 567 what one would use in over-approximate analysis.) Condition (2) ensures that [푝] C [er : 푞] is not 568 vacuously valid as otherwise sat(푞 ∗ 푓 ) would not hold. 569 570 Definition 3.2 (Manifest errors). A valid error triple |= [푝] C [er : 푞] denotes a manifest error iff 571 for all 푟, if sat(푟) holds, then there exists 푓 such that 푝 ∗ 푓 ⊢ 푟 and sat(푞 ∗ 푓 ) also holds. 572 The following theorem shows that this notion of manifest errors ensures that executing C on 573 any calling context state 푠 can terminate erroneously in some state 푠 ′ satisfying an extension of 푞 574 (i.e., satisfying 푞 ∗ true). In other words, the manifest error is reachable from any input state. This 575 is formulated, for simplicity, without taking procedure parameters into account, but in any case 576 forms the semantic basis of our approach to compositional reporting. The full proof of Theorem 3.3 577 is given in the supplementary material. 578 579 Theorem 3.3. For all manifest errors |= [푝] C [er : 푞]: 580 ′ ′ ′ ∀푠. ∃푠 . (푠, 푠 ) ∈ C er ∧ 푠 ∈ 푞 ∗ true 581 J K L M 582 In the following by a “complete program” we mean one with main() in which there are no 583 undefined, free function names. 584 Corollary 3.4 (True Positives Property). If f()is a procedure in a complete program with a 585 manifest error, then either f()is dead code (not reachable from main) or main er ≠ ∅. 586 J K 587 6Note that given any context state 푠, there exists an assertion 푟 that 푠 satisfies, namely 푟 ≜ {푠 }. 588 , Vol. 1, No. 1, Article . Publication date: August 2021. Finding Real Bugs in Big Programs with Incorrectness Logic 13

589 Manifest Errors and Backwards Under-Approximate Triples. Note that Theorem 3.3 states 590 that every (and not just some) state in the precondition can reach a state in the postcondition. As 591 such, manifest errors curiously satisfy another interpretation of triples, namely that of backwards, 592 under-approximate triples denoted as {|푝|} C {|휖 :푞|}, stating that 푝 under-approximates the states 593 that can be reached from 푞 when executing C backwards (see [Möller et al. 2021], Section 5). We 594 formalise and prove this relation to backwards, under-approximate triples in the supplementary + 595 material. A related concept is studied in [Ball et al. 2005] (and elsewhere) under the name must 596 transitions, in the reachability analysis of [Asadi et al. 2021], and was referred to as “total Hoare 597 triples” by de Vries and Koutavas[2011]. 598 599 3.3 Algorithmically Identifying Manifest Errors 600 Note that directly checking the conditions in Def. 3.2 is practically infeasible as it involves finding 601 a suitable frame 푓 for each arbitrary context 푟, while also ensuring sat(푞 ∗ 푓 ). To remedy this, we 602 identify four conditions that are simpler to check, do not quantify over all contexts, and if satisfied 603 ensure that a given triple |= [푝] C [er : 푞] denotes a manifest error. 604 First, to ensure that the triple can be applied in any context, we stipulate that the precondition 605 impose no spatial or pure constraints on the underlying state. That is, we require (1) 푝 ≡ emp∧true. 606 This way, for any given context 푟 we have 푝 ∗ 푟 ≡ 푟 (and thus 푝 ∗ 푟 ⊢ 푟), yielding the frame 푟. 607 Next, to ensure that sat(푞∗푟) holds for an arbitrary 푟, we require that (2) sat(푞) hold, as otherwise 608 푞 ∗ 푟 would be rendered unsatisfiable. Note that condition (2) is not sufficient toensure sat(푞 ∗ 푟) 609 holds for an arbitrary 푟. To see this, consider f() and g() below and their valid error summaries: 610 f() ≜ 푧 := malloc(); g() ≜ if (푥 = 7) then 611 if (푧 = 푥) then error() error() 612 푆1: |= [푥↦→푋 ∗ 푧↦→푍] f() [er : 푥↦→푋 ∗ 푧↦→푋 ∗ 푋↦→푉 ] 613 푆2: |= [푥↦→푋 ∗ 푧↦→푍] f() [er : ∃푌 . 푥↦→푋 ∗ 푧↦→푌 ∗ 푌↦→푉 ∧ 푌=푋] 푆3: |= [푥↦→푋] g() [er : 푥↦→푋 ∧ 푋=7] 614 615 Note that all summaries are valid and can be derived using the ISL proof system [Raad et al. 2020]. 616 Specifically, ISL also includes the following axiom for malloc(), allowing the allocation to pick a 푋 617 specific location, namely : 618 [푧↦→푍] 푧 := malloc() [ok : 푧↦→푋 ∗ 푋↦→푉 ] (Malloc-ISL) 619 620 This summary is a combination of the predefined summary of malloc() in Fig. 3 and an application 푆 621 of Cons to strengthen the post. As such, 1 can be derived using the ISL rule for malloc() above, 푆 622 while 2 can be derived using the predefined summary of malloc() in Fig. 3, followed by an 푌 . 푥 푋 푧 푌 푌 푉 푌 푋 푌 . 푥 푋 푧 푌 푌 푉 푆 623 application of Cons (as ∃ ↦→ ∗ ↦→ ∗ ↦→ ∧ = ⊢ ∃ ↦→ ∗ ↦→ ∗ ↦→ ). Similarly, 3 . 624 can be derived using the encoding of if and the predefined summaries of assume( ) and error(). 푞 푞 푞 푆 푆 푆 625 Let 1, 2 and 3 respectively denote the postconditions of 1, 2 and 3. As mentioned above, 푆 푋 626 in the case of 1, malloc() allocates a specific location, namely that denoted by , rendering 푞 푟 푟 푟 푋 푉 푞 푟 627 1 ∗ unsatisfiable for an arbitrary : e.g., if ≜ ↦→ , then 1 ∗ ≡ false. In other words, for 628 an error triple to be manifest, it must not require that malloc() allocate a specific location, i.e., 푆 629 the postcondition may not constrain the locations allocated via malloc(). Intuitively, 1 does 630 not denote a manifest error as the specified error only occurs when malloc() allocates a specific 푋 푋 631 location (i.e., the location denoted by is not allocated on the heap beforehand), and thus this error does not arise in contexts in which 푋 is already allocated. To remedy this, when the 632 −→ 633 summary postcondition is given by ∃푋푞 . 휅푞 ∧ 휋푞, we require that the heap locations in 휅푞 (i.e., 634 the logical variables on the left-hand side of ↦→ and ̸↦→ assertions in 휅푞) be existentially quantified 635 and thus not be the same as heap locations in the context. To this end, we define an auxiliary 636 function, locs(.), that computes the heap locations in a given spatial assertion 휅푞; we then require 637 , Vol. 1, No. 1, Article . Publication date: August 2021. 14 Quang Loc Le, Azalea Raad, Jules Villard, Josh Berdine, Derek Dreyer, and Peter W.O’Hearn

−→ 638 that (3) locs(휅푞) ⊆ 푋푞. Indeed, as we demonstrated above in Fig. 3, in Pulse-X we do not include 639 the original ISL summary for malloc() in Malloc-ISL; instead we make a minor alteration by 640 existentially quantifying the location allocated, thus ensuring that this condition holds for all 641 summaries computed by Pulse-X. 642 Note that conditions (1–3) are still not sufficient to ensure sat(푞 ∗ 푟) as the pure assertions in 643 the postcondition may impede the satisfiability of 푞 ∗ 푟. Specifically, in the case of 푆2, although the 644 location allocated via malloc() is existentially quantified as 푌 in 푞2, the pure part of 푞2 requires 645 푌=푋, once again constraining the location allocated via malloc(), rendering 푞2 ∗ 푟 unsatisfiable 646 when 푟 ≜ 푋↦→푉 . Note that 푞2 ≡ 푞1 and thus the error specified by 푆2 is likewise not manifest. In 647 the case of 푆3, the postcondition 푞3 constrains the value of 푥 (tied to logical variable 푋) by requiring 648 푋=7, rendering 푞3 ∗ 푟 unsatisfiable when e.g., 푟 ≜ 푋=2. Intuitively, the error identified by 푆3 does 649 not denote a manifest error as it only arises in contexts in which 푋=7. To rule out summaries such 650 −→ as 푆2 and 푆3 as manifest errors, when the summary postcondition is given by 푞 ≜ ∃푋푞 . 휅푞 ∧ 휋푞 and 651 −→ flv(푞) = 푌 , we lastly require that the pure assertion 휋 be satisfiable for any choice of allocated 652 푞 locations (ruling out summaries such as 푆 ) and any choice of free logical variables (ruling out 653 2 −→ 푆 휋 −→푣 푌 휅 −→푣 654 summaries such as 3). Put formally, we require that (4) sat( 푞 [ / ∪ locs( 푞)]) hold for all . 푆 푆 푆 푌 푋 푣 푋, 푣 푌 655 Note that neither 2 nor 3 satisfies condition 4: in the case of 2, sat( = [ 1/ 2/ ]) does not 푣 푣 푆 푋 푣 푋 푣 656 hold when 1≠ 2; in the case of 3, sat( =7[ / ]) does not holds when ≠7. 657 We formalise the four conditions described above in Theorem 3.5, proving that they are sufficient 658 for identifying manifest errors. The full proof of this theorem is given in the supplementary material. 659 −→ Theorem 3.5 (Manifest errors). An error triple |= [푝] C [er : 푞] with 푞 ≜ ∃푋푞 . 휅푞 ∧ 휋푞 denotes 660 a manifest error if: 661 (1) 푝 ≡ emp ∧ true; 662 (2) sat(푞) holds; 663 −→ 휅 푋 . 664 (3) locs( 푞) ⊆ 푞, where locs( ) is as defined below; and −→ −→ −→ −→ 665 (4) for all 푣 , sat(휋푞 [ 푣 /푌 ∪ locs(휅푞)]) holds, where 푌 = flv(푞). 666 locs(emp)≜ ∅ locs(푥↦→푋)≜ {푥} locs(푋↦→푉 )=locs(푋̸↦→)≜ {푋 } locs(휅1∗휅2)≜ locs(휅1)∪locs(휅2) 667 Note that conditions (1) and (3) can be checked in polynomial time, and the complexity of checking 668 conditions (2) and (4) in our assertion language corresponds to the complexity of the satisfiability 669 problem of equality logic, which is NP-complete. Nevertheless, checking these conditions is efficient 670 in practice. 671 672 3.4 Extending ISL to Support Memory Leaks 673 We next describe how we extend the ISL theory in [Raad et al. 2020] to support memory leak 674 detection. To this end, we assume a designated location, a ∈ Loc, that tracks all memory locations 675 allocated thus far, as shown in Fig. 4. That is, we define states as State ≜ (Loc ⇀ Val) ∪ ({a} → 676 fin P(Val)). Analogously, we extend the assertion language with the a↦→A and leaks(S, 퐿), where we 677 use A and S as meta-variables for sets of logical variables. Specifically, the a↦→A denotes that the 678 set of locations allocated thus far (at a) is given by A. We describe leaks(S, 퐿) shortly below. 679 Accordingly, we adapt the predefined summaries for memory allocation and disposal to track 680 allocated locations, as shown at the bottom of Fig. 4. Specifically, the erroneous specifications (under 681 er) for malloc() and free(.) remain unchanged (as in Fig. 3), while their normal specifications 682 (under ok) are adapted as shown in Fig. 4 to additionally account for the set of allocated locations. 683 684 Detecting Memory Leaks. The leaks(S, 퐿) assertion states that the location denoted by 퐿 is 685 leaked in that it is not reachable from the starting points given by S. Intuitively, as we demonstrate 686 , Vol. 1, No. 1, Article . Publication date: August 2021. Finding Real Bugs in Big Programs with Incorrectness Logic 15

687 Extended Pulse-X Model Domain 688 휂 ∈ Env ≜ (Var → Val) ∪ (LVar → Val ∪ P(Val)) 689 휎 ∈ State ≜ (Loc ⇀fin Val) ∪ ({a} → P(Val)) where a ∈ Loc 690 691 Extended Pulse-X Assertions 휅 692 ::= · · · | a↦→A 휋 , 퐿 693 ::= · · · | leaks(S ) 694 Extended Pulse-X Assertion Semantics 휂, 휎 휎 휂 695 ( ) |= a↦→A iff = {a ↦→ (A)} 휂, 휎 , 퐿 휂 퐿 휂 , 휎 696 ( ) |= leaks(S ) iff ( ) ∉ reach( (S) ) 푆, 휎 Ð 푆 푆 푙 푘 푆 . 휎 푘 푙 푆 푆 697 where reach( ) ≜ 푖 푛 ≜ ∈ Loc ∃ ∈ 푛−1 ( ) = 0 ≜ 푖 ∈N+ 698 699 [ a↦→A ∗ 푥↦→푋] 푥 := malloc() ok : ∃퐿. a↦→A ⊎ {퐿}∗ 푥↦→퐿 700 [ ↦→ ∗ 푥↦→푋 ∗ 푋↦→푉 ] free(푥)  ↦→ \{푋 }∗ 푥↦→푋 ∗ 푋̸↦→ 701 a A ok : a A 702 Fig. 4. Updated ISL model and assertions for memory leak detection (above), where A, S denote meta-variables 703 for sets of logical variables and the extensions from Fig. 1 are highlighted; updated predefined summaries for 704 memory leak detection (below), where the extensions from Fig. 3 are highlighted. 705 706 707 shortly, we check for memory leaks at the end of each procedure call; as such, the starting points 708 correspond to the global program variables, denoted by G. That is, a procedure leaks a location 퐿 if 퐿 709 has been allocated (i.e., is tracked under a) and is not reachable from G (i.e., leaks(G, 퐿) holds). We 710 then define the noLeaks assertion, denoting that no allocated location is being leaked, as follows: 711 def Û ⇔ ∃ . ↦→ ∧ ¬ (G, 퐿) 712 noLeaks A a A leaks 퐿∈A 713 7 714 To detect memory leaks, we insert assert(noLeaks) at the end of each procedure. As such, if 715 procedure f() leaks a location, this assertion fails, leading Pulse-X to report a memory leak for f(). 716 717 4 THE PULSE-X ANALYSIS ALGORITHM 718 Our Pulse-X program analysis finds a collection of ISL triples for a given procedure in a program, 719 forming its summary. These summaries are then used to report errors, and to obtain (inductively) 720 summaries for other procedures. We describe our Pulse-X analysis algorithm in terms of a proof 721 search algorithm in ISL. We present it as a variation on predicate transformer semantics, one 722 that generates pre-assertions (presumptions) on the way to producing a post-assertion (result). 723 Following our presentation of the Pulse-X algorithm, we then describe our actual implementation 724 of Pulse-X, how it differs from the idealised algorithm, and sources of false positives which arise 725 from going outside the presumptions of the soundness theorem. 726 727 4.1 The Pulse-X Analysis Algorithm as Proof Search in ISL 728 Specification Tables. Our proof search algorithm carries around a specification table, 푇 , that 729 associates each instruction (i.e., a predefined instruction or a procedure call) inst with a set of ISL 730 specifications. At the beginning of the algorithm, the specification table is only populated withthe 731 7Note that assert(noLeaks) does not fit the official syntax of the encoding of assert statements from earlier, because 732 noLeaks is not a pure (heap independent) boolean. Instead of extending the syntax of booleans we can more simply regard 733 assert(noLeaks) as a special command apart from the given instructions. Its semantics is that its ok relation sends a state 734 to the same state if there are no leaks, and its er relation sends a state to the same state if there are leaks. 735 , Vol. 1, No. 1, Article . Publication date: August 2021. 16 Quang Loc Le, Azalea Raad, Jules Villard, Josh Berdine, Derek Dreyer, and Peter W.O’Hearn

736 Eval(푝, skip,푇 ) ≜ {(ok, emp, 푝)} 737   738 ′ (휖,푚, 푞) ∈ Eval(푦↦→푌 ∗ 푝 ∧ 푌 =nil, C[푦/푥],푇 ) Eval(푝, local x.C,푇 ) ≜ (휖, ∃푌 .푚, ∃푌 .푞 ) ′ 739 and 푞 = prune(푦, 푌, 푞) 740 where 푦, 푌 fresh in 푝 and 푦 fresh in C 741  ( ,푚 , 푞 ) ∈ (푝, ,푇 )  742 푝, ,푇 휖,푚 푚 , 푞 ok 1 1 Eval C1 Eval( C1; C2 ) ≜ ( 1 ∗ 2 ) 휖,푚 , 푞 푞 , ,푇 743 ∧( 2 ) ∈ Eval( 1 C2 )  ,푚 , 푞 ,푚 , 푞 푝, ,푇 744 ∪ (er 1 1) (er 1 1) ∈ Eval( C1 ) 745 Eval(푝, C1 + C2,푇 ) ≜ Eval(푝, C1,푇 ) ∪ Eval(푝, C2,푇 ) 746 747 Eval(푝, C★,푇 ) ≜ Eval(푝, Cunrollings,푇 ) where C푖+1 ≜ (skip + (C; C푖 )) 748  749 Eval(푝, inst,푇 ) ≜ (휖,푚,푏 ∗ 푓 ) [푎] inst [휖 :푏] ∈ 푇 ∧ (푚, 푓 ) ∈ BiabDiscover(푝, 푎) 750 푝, ,푇 , , 푝 푝 , , 푝 푝 751 Eval( assert(noLeaks) ) ≜ {(ok emp ) | |= noLeaks} ∪ {(er emp ) | ̸|= noLeaks} 752 753 Fig. 5. The Eval function for pre/post discovery. 754 755 summaries of predefined instructions, and is extended incrementally with the procedure summaries 756 along the way. As such, for each predefined instruction inst, the 푇 (inst) is as given in Fig. 3; and 757 for each function call f(), the 푇 (f()) is of the form [푝1] f() [휖1 :푞1], ··· , [푝푛] f() [휖푛 :푞푛]. 758 At the core of Pulse-X is the evaluation function Eval(푝, C,푇 ). Given a presumption 푝 and a 759 specification table 푇 , the Eval(푝, C,푇 ) computes a set of tuples of the form (휖,푚, 푞), such that: 760 푝 푚 휖 푞 푝 푚 761 If we extend with the ‘missing’ resource , then : is a valid result of executing C on ∗ . 762 This intuition is captured in the theorem below, where we write 푇 |= [푝 ∗ 푚] C [휖 : 푞] to denote 763 that if the triples in 푇 are valid ISL triples (see Def. 3.1), then the triple [푝 ∗ 푚] 푐 [휖 : 푞] is also valid. 764 We formalise the notion of Eval(푝, C,푇 ) shortly below. 765 Theorem 4.1 (Under-approximation Soundness). For all 푝, C,푇, 휖,푚, 푞: 766 767 (휖,푚, 푞) ∈ Eval(푝, C,푇 ) implies 푇 |= [푝 ∗ 푚] C [휖 : 푞] 768 769 The Eval Function. We define the Eval function using the inductive rules shown in Fig. 5. These 770 rules are designed in such a way that they maintain the result in Theorem 4.1, thereby ensuring 771 the soundness of the Pulse-X analysis by construction. Note that these rules are an adaptation 772 of ISL proof rules, oriented to a forwards-running symbolic execution. The Eval definition in the 773 case of sequential composition (C1; C2) incorporates many of the key properties of the analysis. 774 Specifically, we first execute the first command C1 and generate its missing resource 푚1. In the 775 case where executing C1 results in an error, the execution is halted (short-circuit semantics) and 776 푚1 is returned as the missing resource. Otherwise, we continue with executing C2 and generate 777 its missing resource 푚2, which is then combined with the missing resource of C1 and returned as 778 푚1 ∗ 푚2. To execute a path symbolically we use the sequential composition rule repeatedly, until 779 the execution terminates normally, or we encounter an error (short-circuiting). 780 Note that the Eval definition of procedure calls (the inst case at the bottom of Fig. 5) uses the 781 BiabDiscover(푝, 푞) function. This function is the biabduction notion from [Calcagno et al. 2011]. 782 BiabDiscover(푝, 푞) returns a set of pairs of the form (푚, 푓 ) such that 푝 ∗ 푚 = | 푞 ∗ 푓 . That is, 783 it abduces a frame 푓 and anti-abduces a missing resource 푚. In the Eval rule for procedure calls 784 , Vol. 1, No. 1, Article . Publication date: August 2021. Finding Real Bugs in Big Programs with Incorrectness Logic 17

785 at the bottom of Fig. 5, after abducing the frame 푓 and anti-abducing the missing resource 푚 via 786 BiabDiscover(푝, 푞), the 푚 is fed back as the missing resource (to be added to the presumption), 787 while 푓 is carried forward. The soundness of the procedure call rule follows from the Frame and 788 Cons proof rules of ISL. Note that our treatment of procedure calls is as in [Calcagno et al. 2011], 789 except that (1) we anti-abduce 푚 while they abduce 푚; and (2) we abduce 푓 while they anti-abduce 790 푓 . This difference is due to the different direction of entailments in the premise oftherule 791 consequence (Frame) in under-approximate reasoning of incorrectness logic (IL) and ISL, compared 792 to the over-approximate reasoning of Hoare logic and separation logic (SL). 793 The rule for local variables utilizes a function prune(푦, 푌, 푞) which takes program and logical 794 variables and a symbolic heap as arguments, and returns a symbolic heap. Intuitively, it deallocates ′ 795 a cell denoted by 푦 from 푞. The correctness property for prune is that for some 푌 , the entailment ′ 796 푦↦→푌 ∗prune(푦, 푌, 푞) ∧푌=nil |= 푞 holds. prune can be implemented by simply stripping 푦↦→− facts 797 from a symbolic heap, and performing additional equivalence-preserving boolean simplifications if 798 desired. The direction of the entailment in the correctness condition is a result of the reversed rule 799 of consequence in IL, where shrinking the post is a sound operation. 800 As mentioned earlier, assert(noLeaks) is treated as a special instruction. It can be implemented 801 by checking whether all locations in the symbolic heap are reachable from globals, where symbolic 802 reachability takes equalities and other pure facts into account. This has been a standard operation 803 on symbolic heaps in separation logic program analyses from the beginning [Distefano et al. 2006]. 804 To be sound for bug catching, this calculation should be sound for concluding that there are leaks 805 (푝 ̸|= noLeaks) – if the prover thinks there’s an unreachable element in a symbolic heap, then there 806 is in at least one concrete heap that satisfies it – while the prover can be complete but unsound for 807 concluding that there are no leaks (푝 |= noLeaks). 808 Lastly, the Eval rule for loops assumes a parameter, unrollings, for bounded loop unrolling, leading 809 to a form of bounded model checking. 810 In order to convert the rules in Fig. 5 to ones that also bound paths as well as loop depths, we 811 assume we are given another parameter, width, and accordingly adapt Eval so that it computes a 812 sequence of tuples (휖,푚, 푞) rather than a set. We then interpret the set comprehension and ∪ in the 813 case of sequential composition so that the final sequence follows a “lexicographic” order of first 814 selecting disjuncts resulting from 퐶1 in order, and for each of them collecting those resulting from 815 퐶2 second. Similarly, we interpret ∪ in the cases of non-deterministic choice so that it selects first 816 from the disjuncts resulting from 퐶1 (thus favouring lower iteration counts when evaluating loops), 817 then those from 퐶2, capping the length of the resulting sequence at a maximum of width. 818 Analogously, we interpret the set comprehension in the case of instructions (inst) to truncate after 819 width, and assume the specification table and BiabDiscover(푝, 푞) give back sequences instead 820 of sets. For brevity, we omit a more formal definition of width bounding and of the ordering 821 implemented in the tool. Note that the order described briefly in the previous paragraph is the one 822 implemented in Pulse-X and is but one of several possible; a more comprehensive exploration of 823 the benefits of various orderings is left to future work. 824 825 Remark 1. Our analysis algorithm is strikingly simple compared to the original biabductive 826 analysis from [Calcagno et al. 2011]. In particular, a key source of simplification relates to the 827 handling of loops: to over-approximate loops, the original biabductive analysis had to seek fixed- 828 points (i.e., to guess loop invariants). To describe loop invariants for linked structures, inductive 829 definitions of predicates in one form or another are generally needed, and the treatment ofinductive 830 predicates is a challenging issue in over-approximate analysis for both (abductive) theorem proving 831 and abstract semantics of instructions. In contrast, for our under-approximate analysis, we can 832 avoid the issue entirely: our abstract domain does not include inductive predicates (apart from 833 , Vol. 1, No. 1, Article . Publication date: August 2021. 18 Quang Loc Le, Azalea Raad, Jules Villard, Josh Berdine, Derek Dreyer, and Peter W.O’Hearn

834 a special one for checking leaks), and we need not seek loop invariants because we are merely 835 seeking to prove the existence of (finite) buggy executions. Instead, we use a simple fragment ofthe 836 symbolic heaps used in separation logic analyzers which includes ↦→ and ̸↦→ assertions and pure 837 boolean conditions, but not e.g., predicates to describe unbounded lists or trees. There are potential 838 advantages to including inductive definitions to describe (backwards) loop variants, and that isa 839 direction for research in the future, but we can use the simple method of loop unrolling and get 840 useful results without needing to invent sophisticated abstract domains. So, an under-approximate 841 biabductive analysis has a lower technical startup cost than an over-approximate one, and yet we 842 will see in the next section that this low-startup-cost analysis nonetheless delivers comparable and 843 often better practical results than analyses with higher startup costs. 844 A second source of simplification is the description of the analysis viathe Eval function. The 845 idea that hypotheses discovered during execution are sent back to the precondition is crystallized 846 in the sequencing rule, which is nothing but a derived inference rule. So soundness of the analysis 847 is obvious: the analysis is simply doing proof search using straightforwardly derivable inference 848 rules. In contrast, the original biabductive analysis was presented in two forms – a denotational 849 style from POPL’09 and a worklist presentation from JACM’11 – neither of which had as simple 850 a presentation or as direct a connection to proof theory. Our approach using Eval is related to 851 the analysis description in [Raad et al. 2020], but makes one further simplification in avoiding 852 having a presumption/result pair describing the computation “up to now” as a parameter, letting 853 the sequencing rule do the whole job. 854 These comparisons to the original carry over as well to other biabductive analyses that have 855 appeared in the intervening years, including [Le et al. 2014] and [Fragoso Santos et al. 2020]. 856 Remark 2. Our approach of using width and unrollings parameters for under-approximate bound- 857 ing may be regarded as simple-minded in the extreme: we simply cut after a certain number of loop 858 iterations or paths are explored. This under-approximation is sound in Incorrectness Logic, but 859 different choices could be made too. For instance, dynamic testing tools, especially fuzzers, often 860 employ more sophisticated exploration strategies, based on the observation that testing is a search 861 problem [Harman et al. 2015]. Such strategies can also be used in under-approximate static tools 862 such as . However, simple-minded can be very useful (especially in the early studies for an 863 Pulse-X approach), as it provides baselines for future work to build upon, as well as a better understanding 864 of the potential benefits a new approach has that are not due to more sophisticated strategies. Our 865 goal therefore is to keep our approach simple in design and implementation, and to resist the urge 866 to bring in more sophisticated search strategies, in order to gain insights and understanding. We 867 were pleasantly surprised, in fact, that this simple bounding furnished as good experimental results 868 as it did – see §5 – and expect that there is potential for going quite a bit further. 869 870 Generating Procedure Summaries. As mentioned above, our analysis algorithm begins with 871 a specification table 푇 that is initially only populated with the predefined summaries in Fig. 3, and 872 extends 푇 along the way with procedure summaries. To do this, given a procedure f() = C, we 873 extend 푇 by assigning the specifications in Eval(emp, C,푇 ) to 푇 (f()). That is, we start with the 874 empty state emp, thus assuming nothing about the initial state. Although our formalism focuses on 875 parameterless procedures, it is straightforward to adapt it to account for parameters: rather than 876 −−−−→ −→ −→ emp, we would then begin with 푦↦→푌 , where 푦 denote the procedure parameters and 푌 denote the 877 logical variables recording their initial values. We illustrate the analysis via the following example. 878 879 Example 4.2. Consider the foo() summary in Fig. 6, a simplified version of the code in Listing4. 880 In this example, we show how our analysis finds a null-pointer-dereference (NPE) error. Specifically, 881 set(y,v) dereferences the heap cell whose address is the content of the pointer y and updates its 882 , Vol. 1, No. 1, Article . Publication date: August 2021. Finding Real Bugs in Big Programs with Incorrectness Logic 19

883 1 void foo(){ 884 [emp ∧ true] 885 2 local 푥; 886 [푥↦→푋 ∧ 푋 = nil] 887 3 푥 := malloc(); 888 휎3 ≡[ ok: ∃퐿. 푥↦→퐿 ∗ 퐿↦→푉 ∧ 푋 = nil ∧ 퐿 ≠ nil] 889 4 assume(푥 != NULL); 890 휎3 ≡[ ok: ∃퐿. 푥↦→퐿 ∗ 퐿↦→푉 ∧ 푋 = nil ∧ 퐿 ≠ nil] 891 5 [푥] := NULL; 892 휎4 ≡[ ok: ∃퐿. 푥↦→퐿 ∗ 퐿↦→nil ∧ 푋 = nil ∧ 퐿 ≠ nil] 893 6 set(푥, 1); 894 휎5 ≡[ er: ∃퐿. 푥↦→퐿 ∗ 퐿↦→nil ∧ 푋 = nil ∧ 퐿 ≠ nil] 895 7 } 896 897 8 void set(푦,푣){ 898 [푦↦→푌 ∗ 푣↦→푉 ∗ 푌 ↦→푊 ∧ 푊 = nil ] 899 9 local 푧; 900 휎 ≡[ 푦↦→푌 ∗ 푣↦→푉 ∗ 푧↦→푍 ∗ 푌 ↦→푊 ∧ 푍 = nil ∧ 푊 = nil ] 901 0 10 푧 := [푦]; 902 휎 ≡[ 푦↦→푌 ∗ 푣↦→푉 ∗ 푧↦→푊 ∗ 푌 ↦→푊 ∧ 푍 ∧ 푊 ] 903 1 ok: = nil = nil [푧] 푣 904 11 := ; 휎 ≡[ 푦↦→푌 ∗ 푣↦→푉 ∗ 푧↦→푊 ∗ 푌 ↦→푊 ∧ 푍 ∧ 푊 ] 905 2 er: = nil = nil 906 12 } 907 908 Fig. 6. An example illustrating how Pulse-X generates procedure summaries. 909 910 911 content to v. The NPE arises as follows: (1) on line 3, malloc() is called and 푥 is assigned to an 912 allocated heap cell; (2) 푥 is dereferenced on line 5, assigning nil to its content; and (3) set(푥,1) is 913 called; as the content of 푥 is NULL, this leads to a NPE. 914 Pulse-X algorithm infers summaries for set(푦,푣) and then foo(). Procedure set(푦,푣) uses 915 two parameters, which are excluded from our formalism, but will help illustrate the workings of 916 the algorithm. As discussed above, we simply record the initial value of 푦 and 푣 via 푦↦→푌 and 푣↦→푉 , 917 respectively. Initially, the Pulse-X algorithm starts with the pre-state 푦↦→푌 ∗ 푣↦→푉 . As 푧 is a local 918 variable, it applies the local inference rule via Eval (Fig. 5) to obtain 휎0. Note: (1) for now, ignore the 919 highlighted assertions in 휎0: they represent the anti-abduced assertion 푚, which will be computed 920 by subsequent analysis steps; (2) to simplify the presentation, and as 푧 and 푍 are distinct from other 921 variables, we reused 푧 and 푍 and did not replace 푧 with a fresh variable. Recall that on encountering 922 the memory read on line 10, Pulse-X uses its predefined summary in Fig. 3 via Eval (Fig. 5). This 923 in turn calls the biabductive procedure, BiabDiscover(., .), to infer the frame 푓 and the missing 924 resource 푚. More concretely, Pulse-X uses the following ok summary of memory lookup (repeated 925 from Fig. 3): 926 [푧↦→푍 ∗ 푦↦→푌 ∗ 푌 ↦→푊 ] 푧 := [푦] [ok : 푧↦→푇 ∗ 푦↦→푌 ∗ 푌↦→푊 ] 927 푦 푌 푣 푉 푧 푍 푍 , 푧 푍 푦 푌 푌 푊 928 and poses the biabductive query BiabDiscover( ↦→ ∗ ↦→ ∗ ↦→ ∧ = nil ↦→ ∗ ↦→ ∗ ↦→ ), 푚, 푓 푚 푌 푊 푓 푣 푉 푍 푚 929 which yields {( )} with ≡ ↦→ and ≡ ↦→ ∧ = nil. Note that is used to compute the 930 pre-assertion and it is sent back to the beginning of the procedure, as denoted by the highlighted 931 , Vol. 1, No. 1, Article . Publication date: August 2021. 20 Quang Loc Le, Azalea Raad, Jules Villard, Josh Berdine, Derek Dreyer, and Peter W.O’Hearn

932 assertion 푌↦→푊 in Fig. 6, while 푓 is combined with the post-condition of 푧 := [푦] to obtain 휎1 933 (shown in Fig. 6). The other er specifications for 푧 := [푦] are applied similarly and result in two 934 additional (er) disjuncts being analysed. 935 Similarly, on encountering the memory read [푧] := 푣 on line 11, Pulse-X uses the following er 936 summary of memory store in Fig. 3: 937 [푧↦→푊 ∧ 푊 = ] [푧] = 푣 [er 푧↦→푊 ∧ 푊 = ] 938 nil : : nil (1a) 939 and poses the query BiabDiscover(푦↦→푌 ∗ 푣↦→푉 ∗ 푧↦→푊 ∗ 푌↦→푊 ∧ 푍 = nil, 푧↦→푊 ∧ 푊 = nil), 940 which yields {(푚, 푓 )} with 푚 ≡ emp ∧푊 = nil and 푓 ≡ 푦↦→푌 ∗ 푣↦→푉 ∗ 푌 ↦→푊 ∧ 푍 = nil. 푚 is used 941 to compute the pre-condition while 푓 is combined with the post-condition of [푧] := 푣 to obtain 942 휎2 (shown in Fig. 6). The other specifications for [푧] := 푣 are applied similarly and result in two 943 additional disjuncts being analysed. 944 Finally, Pulse-X infers the following summary based on the precondition and 휎2. 945 푦 푌 푣 푉 푌 푊 푊 푦 푣 푦 푌 푣 푉 푌 푊 푊 946 [ ↦→ ∗ ↦→ ∗ ↦→ ∧ = nil] set( , ) [er : ↦→ ∗ ↦→ ∗ ↦→ ∧ = nil] (1) 947 Similarly, it generates four other summaries for set(y,v) as follows. 948 949 [푦↦→푌 ∗ 푣↦→푉 ∧ 푌 = nil] set(푦,푣) [er : 푦↦→푌 ∗ 푣↦→푉 ∧ 푌 = nil] (2) 950 [푦↦→푌 ∗ 푣↦→푉 ∗ 푌 ̸↦→] set(푦,푣) [er : 푦↦→푌 ∗ 푣↦→푉 ∗ 푌 ̸↦→] (3) 951 [푦↦→푌 ∗ 푣↦→푉 ∗ 푌↦→푊 ∗ 푊 ↦→푊 ′] set(푦,푣) [ok : 푦↦→푌 ∗ 푣↦→푉 ∗ 푌 ↦→푊 ∗ 푊 ↦→푉 ] (4) 952 953 [푦↦→푌 ∗ 푣↦→푉 ∗ 푌↦→푊 ∗ 푊 ̸↦→] set(푦,푣) [er : 푦↦→푌 ∗ 푣↦→푉 ∗ 푌 ↦→푊 ∗ 푊 ̸↦→] (5) 954 (2) and (3) correspond to the two disjuncts generated after line 10 while (4) and (5) correspond to 955 those inferred after line 11. 956 Pulse-X processes the statements of foo() similarly, where it uses the following summary of 957 x := malloc() in Fig. 6. 958 959 [푥↦→푋] 푥 := malloc() [ok : ∃퐿. 푥↦→퐿 ∗ 퐿↦→푉 ∧ 퐿 ≠ nil] 960 Note that, upon calling set(x,1) on line 6 from state 휎 , of the above five summaries, only (1) is 961 4 applicable, for which the biabductive procedure returns the output 휎 , as shown in Fig. 6. As such, 962 5 휎 is returned as the error post-condition, in accordance with the post-state of (1). In particular, 963 5 infers the following er specification: 964 Pulse-X 965 [emp] foo() [er : ∃퐿. 퐿↦→nil ∧ 퐿 ≠ nil] 966 967 Note that the above triple denotes a manifest error as it satisfies the four conditions of Theorem 3.5: 퐿. 퐿 퐿 968 (1) its precondition is emp; (2) its postcondition is satisfiable i.e., sat(∃ ↦→nil ∧ ≠ nil) holds; 퐿 퐿 퐿 969 (3) locs( ↦→nil) = {L} and is existentially quantified; and (4) sat( ≠ nil) holds. In the imple- 970 mentation, Pulse-X also generates an error trace that is based on the chain of function calls (e.g., 971 foo();set(x,1)) and their corresponding triples. If the trace, like the one in this example, includes 972 a null-dereference er triple e.g., (1a), the error is an NPE. 973 In the cases of (2), (3), (4) and (5), the biabductive procedure returns an empty set, as their 휎 974 preconditions are not compatible with 4. For instance, the biabductive query generated from the 휎 975 precondition of (2) (after renaming variables) and 4 is as follows: 976 BiabDiscover(∃퐿. 푥↦→퐿 ∗ 푧↦→푍 ∗ 퐿↦→nil ∗ 푋̸↦→ ∧ 퐿 ≠ nil, 푥↦→푋 ∗ 푣↦→1 ∗ 푧↦→푍 ∧ 푋 = nil) 977 978 As the underlined sections show that the 푥↦→푋 ∧ 푋 = nil in the precondition of (2) is incompatible 979 with 푥↦→퐿 ∗ 퐿↦→nil ∧ 퐿 ≠ nil in 휎2 (they cannot be reconciled), this query returns the empty set. 980 , Vol. 1, No. 1, Article . Publication date: August 2021. Finding Real Bugs in Big Programs with Incorrectness Logic 21

981 4.2 Implementation Notes 982 Pulse-X is a fork of Pulse, which is itself a program analyzer built on top of the Infer platform 983 (www.fbinfer.com). Note that there is a distinction between Infer-the-platform (which includes sev- 984 eral static analysis tools including RacerD and Pulse) and Infer-the-original which is the biabduction 985 engine of the first version of Infer. The platform shares a frontend and some code with the original 986 that doesn’t involve biabduction or separation logic, but is otherwise separate. Pulse-X does not use 987 any of the original Infer biabduction code: we use a general compositonal abstract interpretation 988 framework implemented inside Infer that is also used for RacerD (but not biabduction). Pulse-X is 989 not yet in the public Infer github repo, which is why we do not share it yet. But we intend to share 990 it in due course. 991 Our implementation of Eval does not use proof search per se. Rather, it uses a worklist algo- 992 rithm which maintains a set of (휖,푚, 푞) triples at each program point [Jhala and Majumdar 2009]. 993 Intuitively, 푚 denotes the ‘missing’ resource that is to be sent back to the start of the procedure, 994 and 푞 denotes states that can be reached from the combination of 푚 and the assumed function 995 precondition. The relationship between the predicate-transformer presentations (e.g., Eval in Fig. 5) 996 and the worklist descriptions is the same as in standard forwards program analysis, except for the 997 presence of the additional component 푚. 998 Given an implementation of Eval, we take several further steps to obtain Pulse-X: 999 • We check the validity of predicate at the end of each procedure, to detect memory 1000 noLeaks leaks (see §3). This gives the effect of inserting assert( ) at the end.8 1001 noLeaks • We apply to all procedures in a codebase, such that is applied to callees before their 1002 Eval Eval callers. For brevity, we do not formalise this ordering as it is standard; however, we note that 1003 the analysis of callees before callers can be done lazily (“on-demand”) as follows. When the 1004 summary for a callee procedure f() is needed for the analysis of the current procedure g(), 1005 if the summary of f() is already available in the specification table 푇 , (i.e., f() has already 1006 been analysed) then we simply fetch it; otherwise, we pause the analysis of g(), generate the 1007 summary of f() and store it in 푇 , and subsequently resume the analysis of g(). 1008 • Once all procedure summaries have been generated and stored in the specification table, we 1009 examine the error specifications to determine which errors to report. This is a two-stage 1010 process. In the first stage, we filter out those null-pointer errors (NPEs) that are not manifest 1011 or not associated with main(), as well as those errors that have already been reported (our 1012 summaries are associated with metadata indicating their “reported” status). In the second 1013 stage, we report the remaining errors and update their metadata to indicate that they have 1014 been reported, so as to report errors as soon as we have ascertained they are manifest and to 1015 avoid reporting them again at procedure call sites. 1016 • All this work is done at the level of an intermediate language with a front-end that maps C/C++ 1017 to the intermediate language. As mentioned above, this front-end is provided independently 1018 by Infer-the-platform. 1019 1020 Finally, we note that our Pulse-X implementation departs in several ways from its formalism and 1021 from C/C++ semantics. In particular, our implementation handles calls to unknown procedures 1022 (e.g., library functions for which the code is not available, or calls to unresolved function pointers) 1023 by treating them as if they had no effect on the state (i.e., as skip), except that the return value and 1024 pointer arguments are assigned arbitrary values. This is not accounted for in our formalism. We also 1025 interpret arithmetic formulae using a theorem prover for rationals, rather than the fixed-precision 1026 1027 8We distinguish between a version of assert() known internally to the analysis and that of C. Programmers sometimes use 1028 assert() or abort() to indicate when they do not want to be warned, and we treat abort this way for OpenSSL, as in §2.1. 1029 , Vol. 1, No. 1, Article . Publication date: August 2021. 22 Quang Loc Le, Azalea Raad, Jules Villard, Josh Berdine, Derek Dreyer, and Peter W.O’Hearn

1030 Table 1. Experimental results on compositional bug reporting. 1031 1032 Project #files LoC(k) #procs BoC(m) Infer Pulse-X-10 Pulse-X-100 1033 OpenSSL-1.0.1h 1536 444 8324 3 80 26 63 1034 OpenSSL-3.0.3 2452 754 21639 8.5 116 30 38 1035 wdt 194 25.4 6679 8.5 1 0 0 1036 bistro 424 37.6 7290 9.7 1 4 5 1037 SQuangLe 36 8.3 12938 17.9 4 4 5 1038 RocksDB 1291 411.7 14669 18 32 8 35 1039 FbThrift 5639 937.7 21753 29 8 21 21 1040 OpenR 341 78.3 124461 195.7 60 50 137 1041 Treadmill 409 25.3 236676 393.7 60 48 132 1042 Watchman 557 63.2 245661 407.3 70 48 131 1043 1044 1045 integers (and floats) of the machine. Both of these departures open up the possibility of real-world 1046 false positives, an issue we revisit in the next section. 1047 1048 5 EVALUATION 1049 The goals of our experiments are 1) to demonstrate that our approach scales to large codebases, and 1050 2) to study the effectiveness of our bug reporting criterion. For the first experiment weran Pulse-X 1051 on a variety of publicly-available open-source C and C++ projects, and compared the runtimes 1052 to those of Infer. This produces many more bug reports than is feasible to triage manually, so for 1053 the second experiment we decided to zoom in on OpenSSL. Infer had previously reported bugs 1054 on OpenSSL back in 2015, so this provided an additional point of comparison, making OpenSSL a 1055 natural choice for this part. We conducted all our experiments on a Linux machine with 24 cores. 1056 Of course, the experimental evidence in this section should be interpreted in the appropriately 1057 cautious way: we do not claim that these results will carry over to arbitrary other codebases (as 1058 would be impossible to conclude from any finite experiments). 1059 1060 5.1 Scale 1061 The purpose of the work in this section is to test the hypothesis 1062 1063 Hypothesis H1. Pulse-X is broadly comparable with Infer in terms of performance, 1064 while reporting a comparable number of bugs. 1065 Note that this hypothesis says nothing about the quality of the bugs reported. 1066 We studied the scalability of Pulse-X and Infer on the projects in Table 1, counting the number of 1067 issues reported by Infer and Pulse-X. We evaluated Pulse-X in two configurations concerning the 1068 maximum number of disjuncts involved: 10 and 100. As Pulse-X is based on under-approximation, 1069 setting a limit on the maximum number of disjuncts in abstract states is sound for proving the 1070 presence of bugs. We say “Infer” to refer to the “biabduction” analysis of Infer. As Infer was aimed 1071 originally at over-approximation, it does not include a disjunct limit. Nonetheless, by default Infer 1072 sets a limit on the wall-clock time and the number of symbolic execution steps that are allowed for 1073 the analysis of each procedure, which we have left at their defaults of 1100 steps and 10 seconds 1074 per procedure. 1075 The first column in Table 1 shows the project names. In the next two columns, we present the 1076 number of files and the number of source lines of code excluding blank lines and commentsfor 1077 each project (LoC, and k for thousands); these two numbers are computed by using cloc, and they 1078 , Vol. 1, No. 1, Article . Publication date: August 2021. Finding Real Bugs in Big Programs with Incorrectness Logic 23

1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 Fig. 7. Speed measurements, in minutes of (user) CPU time per run. 1100 1101 1102 1103 1104 do not include the code in libraries used by the projects. The next two columns show the number 1105 of procedures (#procs) and number of bytes of IR code (BoC, and m for millions) actually analysed 1106 by Pulse-X, including those in libraries used by each project. Note that the size of IR code analysed 1107 can, because of libraries, be considerably larger for a project with a smaller number of source lines 1108 excluding libraries. The translation of C code into IR was done starting from Make build and the 1109 translation of C++ code from build. Pulse-X-10 and Pulse-X-100 refer to Pulse-X with the 1110 indicated disjunct limit. The columns for Pulse-X and Infer show the number of null pointer and 1111 memory leak issues found by each. 1112 We plot the running times of the tools in Fig. 7. All told, we find that Pulse-X-100 and Infer’s 1113 performance are comparable, as are the number of issues both tools report. This finding says 1114 nothing about the quality of the issues found; to study this, we focus on OpenSSL. 1115 1116 1117 Note on Incremental Performance. The compositional nature of the Pulse-X analysis enables 1118 us not only to perform wholesale analyses of large projects efficiently, but also to get fast results on 1119 code changes. For example, fixing the bug we described in Listing 1 and then rerunning Pulse-X in 1120 incremental mode will re-analyse only the changed file to validate the fix. This takes less than ten 1121 seconds, compared to the tens of minutes needed to analyse the whole of OpenSSL. This is not a 1122 new feature of the analysis compared to other compositional techniques such as Infer, but it bears 1123 repeating here as our compositional criterion for bug catching gives us the same benefits as before 1124 in this respect. Even though Pulse-X is a scientific tool that is not deployed in production within an 1125 industrial workflow, this incrementality, together with the scaling results above, suggest thatithas 1126 performance characteristics compatible with being run in CI at code review time. 1127 , Vol. 1, No. 1, Article . Publication date: August 2021. 24 Quang Loc Le, Azalea Raad, Jules Villard, Josh Berdine, Derek Dreyer, and Peter W.O’Hearn

1128 5.2 Accuracy 1129 We evaluate Infer and Pulse-X on OpenSSL, a widely used implementation of the SSL/TSL protocols 1130 as well as a general-purpose cryptography library. 1131 1132 Old Bugs. Often, when judging the accuracy of an analysis, researchers use the concepts of true 1133 and false positive rates. While meaningful in theory, judging whether a bug is a true or false positive 1134 can be error-prone and time-intensive. Below we speak of “thought-true” and “thought-false” bugs, 1135 to emphasize the (fallable) human judgement involved. 1136 To study accuracy we follow [Distefano et al. 2019] in looking at fix rate – the proportion of 1137 reports that have been fixed – in our evaluation. The fix rate concept is often used tojudgethe 1138 effectiveness of a deployment of an analysis in CI but, while Pulse-X is not deployed in this way, 1139 we can use the fix rate concept by looking at a legacy rather than new version of a codebase. Inthis 1140 section we consider OpenSSL-1.0.1h, a legacy version of OpenSSL from 2015 which had previously 9 1141 been analysed by Infer . The evaluation in this section is testing the following hypothesis. 1142 Hypothesis H2: On legacy OpenSSL (from 2015) Pulse-X has a superior fix rate to the 1143 present-day Infer. 1144 OpenSSL uses procedures CRYPTO_malloc, CRYPTO_free, and CRYPTO_realloc in place of the 1145 usual malloc, free, and realloc from the C standard library. We model these CRYPTO wrapper 1146 functions by providing their triples to Pulse-X, as well as Infer. 1147 The results of these experiments are as follows: 1148 • Pulse-X-10 found 26 bugs, 19 of which had been fixed, including 12 null pointer derefer- 1149 ences and 7 memory leaks, fix rate is 73%(= 19/26). Moreover, the unfixed issues involve 1150 either calls to unknown functions for which the code is unavailable or buggy procedures 1151 that no longer exist in OpenSSL-3.0.0 (https://github.com/openssl/openssl, commit hash 1152 147ed5f9def86840c9f6ba512e63a890d58ac1d6). 1153 • Pulse-X-100 found 63 bugs, 53 of which had been fixed, including 11 null pointer dereferences 1154 and 42 memory leaks10, fix rate is 84%(= 53/63). Similar to the case of Pulse-X-10, the unfixed 1155 issues concern either calls to unknown functions or buggy procedures that no longer exist in 1156 the current revision. 1157 • The present-day Infer discovered 39 fixed bugs and 41 unfixed issues, Infer’s is 49%(= 39/80). 1158 8 of the 39 fixed bugs overlap with those found by Pulse-X. The other 41 issues flagged by 1159 Infer were classified as latent errors by Pulse-X and as such were not reported by Pulse-X. 1160 1161 These results confirm hypothesis H2. 1162 Our reporting criterion filters the error specifications, reporting only some of them; we comment 1163 on the effect of this filtering, in light of this experiment. As wesawin §2.3, reporting all error 1164 specifications as alarms is obviously a bad idea. It would make Pulse-X report any time an address 1165 that was not allocated locally is accessed for the first time, due to the error specifications forload 1166 and store instructions that have invalidated assertions in their preconditions. To avoid obvious false 1167 positives, one might then think of reporting alarms whenever 1) there is no invalidated assertion in 1168 the precondition, and 2) if the error is due to a null pointer X, then X=0 is not part of the precondition 1169 either. The latter is to say: we have not assumed a pointer was null, only to then complain about it 1170 being null. This is a subset of the total number of error specifications, but still much greater than 1171 the manifest specifications. Pulse-X found over 300 of these latent null dereferences, more than 1172 9https://mailing.openssl.dev.narkive.com/2DbkkYzD/openssl-org-3403-null-dereference-and-memory-leak-reports-for- 1173 openssl-1-0-1h-from-facebook-s-infer 1174 10See the supplementary material, available at https://www.dropbox.com/s/lltpwcxmtc6aibp/list-bugs-openssl-1.0.1h.txt? 1175 dl=0, for the description of these 53 fixed bugs. 1176 , Vol. 1, No. 1, Article . Publication date: August 2021. Finding Real Bugs in Big Programs with Incorrectness Logic 25

1177 ten times the number of manifest issues reported. Upon inspection, most of these seem to be false 1178 positives, in that they are true specifications but might be considered as issues that are unlikely to 1179 be triggered in a global program. Indeed a high number of the 300+ have not been fixed (like the 1180 one shown in §2.3). 1181 So, the reporting criterion suppresses reports of issues which have not been fixed, and this 1182 is a positive indication. But, the criterion is purposely extremely conservative, reporting null 1183 dereferences only when they can occur in all contexts. It might be that some of the suppressed 1184 latent issues can be worth fixing, and we do not claim our reporting criterion to be the ultimate one 1185 (perhaps there is no ultimate). Indeed, Listing7 shows a bug in function chopup_args that had been 1186 fixed by OpenSSL developers and is classified as latent by Pulse-X. On line 6, the function allocates 1187 new memory (by calling OPENSSL_malloc) and assigns the resulting address to pointer arg->data. 1188 As the allocation might fail, arg->data could point to NULL. Subsequently, the dereference on line 1189 9 might cause a null-pointer-error. Pulse-X classified this issue as latent because this error occurs 1190 only when the condition on line 4 (arg->count == 0) holds. Interestingly, that is the case in a call 1191 site to this function in main, as shown in Listing 8, which may be why, when Infer reported this 1192 bug to the developer in 2015, it consequently got fixed. In theory, Pulse-X could report this bug at 1193 the call site of chopup_args in main as a manifest issue. However, this call is only reachable after 1194 20 conditional statements and 2 loops, and thus the disjunct limit was hit before that bug could be 1195 found. 1196 1 int chopup_args(ARGS*arg, char *buf, int *argc, char **argv[]) { 1197 2 int num,i; 1198 3 ... 1199 4 if (arg->count == 0) { 1200 5 arg->count=20; 1201 6 arg->data= (char **)OPENSSL_malloc(sizeof(char *)*arg->count); 1202 7 } 1203 8 for (i=0;icount;i++) 1204 9 arg->data[i]=NULL; 1205 10 .... } 1206 11 1207 Listing 7. Latent error in chopup_args. 1208 1209 1 int main(int Argc, char *ARGV[]){ 1210 2 ARGS arg; 1211 3 ... 1212 4 arg.count=0; 1213 5 ... 1214 6 if (!chopup_args(&arg,buf,&argc,&argv)) break; 1215 7 ... 1216 8 } 1217 Listing 8. Manifest error in main of openssl.c. 1218 1219 Comparing with the 15 bugs found by Infer and fixed in 2015, Pulse-X-10 reported 4 overlapping 1220 fixed bugs, Pulse-X-100 reported 5, and present Infer discovered 4. Of the 10 fixed bugs not reported 1221 by Pulse-X-100, 5 were null dereferences found by Pulse-X-100 but that were classified as latent. 1222 We are unsure why the remaining fixed leaks were not discovered, but it might have to do withthe 1223 order of paths being processed. We are also unsure why the present-day Infer does not report 11 of 1224 the fixed bugs. Note that the 15 fixed from 2015 were reported by the last author during atraining 1225 , Vol. 1, No. 1, Article . Publication date: August 2021. 26 Quang Loc Le, Azalea Raad, Jules Villard, Josh Berdine, Derek Dreyer, and Peter W.O’Hearn

1226 phase of Infer, when there were many more false positives (the exact number of false positives 1227 from then is not known). The main lesson we can take from these numbers is that there might be 1228 room to loosen our reporting criterion, though this would have to be done in a way that avoids 1229 introducing bugs deemed not-worth-fixing. 1230 New Bugs. Given the positive results for Pulse-X on legacy OpenSSL, we decided to run it on 1231 the current beta (OpenSSL-3.0.0 commit hash 147ed5f9def86840c9f6ba512e63a890d58ac1d6, available 1232 at https://github.com/openssl/openssl), in case it found new bugs that would be worth fixing. 1233 1234 Hypothesis H3. Pulse-X finds new bugs worth fixing on latest OpenSSL. 1235 Running with a 10-disjunct limit, Pulse-X discovered 30 issues on 3.0.0. Manual inspection by 1236 us revealed what we thought were 15 real bugs (8 null-pointer-dereferences and 7 memory leaks), 1237 as well as 5 maybe and 10 likely false positives. We then submitted the 15 thought-true bugs 1238 to the OpenSSL maintainers in two pull requests: https://github.com/openssl/openssl/pull/15834 1239 and https://github.com/openssl/openssl/pull/15910. The maintainers agreed that all 15 bugs were 1240 legitimate, and our patches are now in the OpenSSL codebase (see the supplementary materials 1241 mentioned in the introduction). We take the information that the OpenSSL maintainers judged 1242 these issues worth fixing as confirmation of hypothesis H3. 1243 When we ran Pulse-X with 100 rather than 10 disjuncts, it found a greater number of issues, 38, 1244 including a non-zero overlap with fixed bugs from the 10-disjunct case. We did not classify allthose 1245 38 because doing so is time-consuming, and the non-zero overlap means that hypothesis H3 is also 1246 confirmed for the 100-disjunct case: to classify those issues would furnish no new informationas 1247 to our hypothesis. Curiously, though, the 100-disjunct case did not find a superset of the bugs of 푡ℎ 푡ℎ 푡ℎ 푡ℎ 1248 the 10 case: several real bugs (4 , 6 , 13 and 14 ) were not found in the 100-disjunct case. The 1249 reason is that when the disjunct limit increases, the number of specs inferred for each function 1250 might increase; consequently, if a function contains a buggy statement after several function calls, 1251 the limit might be reached before the analysis reaches the buggy statement. 1252 We discuss one false positive reported by Pulse-X in the function ossl_do_ex_data_init (List- 1253 ing 9). This function assigns the pointer address of a field of ctx to variable global via the 1254 function call on line 2. The call on line 2 invokes three other functions, pthread_getspecific, 1255 pthread_once, and pthread_rwlock_init whose code is unavailable, and so are unknown to 1256 Pulse-X. Our analysis replaces the result of an unknown function by a fresh logical variable, and 1257 as a consequence the analysis assumes that global points to a local heap cell instead of a global 1258 pointer. This leads Pulse-X to report a memory leak as a consequence of the allocation on line 5. 1259 1 int ossl_do_ex_data_init(OSSL_LIB_CTX*ctx){ 1260 2 OSSL_EX_DATA_GLOBAL*global= ossl_lib_ctx_get_ex_data_global(ctx); 1261 3 if (global == NULL) 1262 4 return 0; 1263 5 global->ex_data_lock= CRYPTO_THREAD_lock_new(); 1264 6 return global->ex_data_lock != NULL; 1265 7 } 1266 Listing 9. Pulse-X false positive. 1267 1268 If we were to model pthread_getspecific, pthread_once, and pthread_rwlock_init then 1269 we could avoid this specific issue. But there are very many libraries, some not yet created, which 1270 would have to be modelled. Another way to address this issue would be to suppress any report that 11 1271 involves unknown code . This would stamp out all the false positives observed with OpenSSL, and 1272 11Interesting subtler approaches to unknown code have been investigated, a good example being [Das et al. 2015]; but, to 1273 our knowledge, no definitive solution has emerged and further research is warranted. 1274 , Vol. 1, No. 1, Article . Publication date: August 2021. Finding Real Bugs in Big Programs with Incorrectness Logic 27

1275 we could trumpet that we had obtained no false positives on OpenSSL, but it would also remove true 1276 positives: 5 of the 15 true bugs fixed in 3.0.0 involved unknown functions. Rather than artificially 1277 move to a position where we can blow that trumpet, we prefer to present the realistic situation 1278 (which arises from examples going beyond assumptions of theory) and not make a value judgement 1279 on whether it would be better to stamp out these false positives at the expense of false negatives; 1280 this is, in the end, an engineering rather than a scientific judgement, and we would prefer tobe 1281 able to provide this data to engineers to help inform their judgement. 1282 A run of Infer found 116 issues on 3.0.0, 7 of which overlap with those fixed bugs found by 1283 Pulse-X. We do not know exactly how many more of the Infer issues would be considered worth 1284 fixing by OpenSSL maintainers, but we noticed over 40 likely false positives and did not pursuethe 1285 matter further. We emphasize that we are not making any comparative claim w.r.t. Infer here – our 1286 hypothesis H3 only mentions Pulse-X– and these numbers are given just for information. 1287 We also remark that Coverity scan runs regularly on OpenSSL and does result in bug fixes12; as 1288 Coverity is proprietary, we do not know whether it finds the issues that Pulse-X does, but the 15 1289 reported Pulse-X-issues had not been previously fixed at the time of our tool run. 1290 1291 6 RELATED WORK 1292 Our work builds directly on the biabductive compositional analysis of Infer [Calcagno et al. 2011], 1293 on Incorrectness Logic [O’Hearn 2019], and on Incorrectness Separation Logic (ISL)[Raad et al. 1294 2020]. In their work introducing ISL, Raad et al. [2020] also presented an intra-procedural anal- 1295 ysis. Our work goes beyond [Raad et al. 2020] with inter-procedural reasoning, using Eval (our 1296 predicate transformer-style semantics of Pulse-X) for summary inference, with an implemented 1297 inter-procedural analysis algorithm, and an experimental evaluation. Most significantly, we pro- 1298 posed a rigorous approach to the problem of compositional bug reporting, without a global program 1299 context. 1300 Pulse is an analyzer for memory issues being developed at Facebook, which has evolved alongside 1301 the incorrectness logic theory. It implemented ideas about under-approximate compositional 1302 analysis before the formal foundation was in place, and this then influenced IL and ISL. As we 1303 mentioned in the introduction, Pulse-X is itself an experimental fork of Pulse. We made this fork to 1304 distinguish goals in tool engineering. The goal of Pulse-X is scientific, to probe the effect of trading 1305 Infer’s over-approximate separation logic for the under-approximate ISL. In doing so, our aim was to 1306 minimize the conceptual distance from Infer’s original foundation. In particular, our summaries look 1307 much like Infer’s, except that they are based on under-approximate rather than over-approximate 1308 triples. On the other hand, the goal of Pulse is to help engineers, and minimizing this conceptual 1309 difference is not as important for that purpose. For example, Pulse has a more compact notion of 1310 summary which is less obviously similar to that of Infer, and there is no requirement for Pulse to 1311 evolve in a way that eases scientific comparison. 1312 Still, the distance between Pulse-X and Pulse has been narrowing rather than growing, and the 1313 Pulse-X work has even repaid some of IL’s debt to Pulse: the reporting criterion based on manifest 1314 bugs has been implemented in Pulse as well as Pulse-X, it is switched on in production, and this 1315 has resulted in a significant reduction in observed false positives. We expect that a fuller account 1316 of the Pulse story will be forthcoming after there has been extended industrial experience with it. 1317 Long before the work on IL and Pulse began, summary-based under-approximate analysis was 1318 studied in the context of symbolic execution by Godefroid and others [Godefroid 2007; Godefroid 1319 et al. 2010]. Godefroid et al. [2010] use an analogue of the under-approximate triple referred to 1320 as a must− transition [Ball et al. 2005]. Summaries were used in these works in an effort to speed 1321 1322 12https://scan.coverity.com/projects/tpm2-openssl 1323 , Vol. 1, No. 1, Article . Publication date: August 2021. 28 Quang Loc Le, Azalea Raad, Jules Villard, Josh Berdine, Derek Dreyer, and Peter W.O’Hearn

1324 up a global program analysis, and ultimately one falls back on dynamic analysis for soundness. 1325 These works present experimental results for device drivers up to just over 30k LOC, with running 1326 times in the hundreds of minutes, whereas we tackle programs in the hundreds of thousands 1327 of LOC. Summary-based analyses were subsequently implemented in the SAGE and PEX tools 1328 used in production at Microsoft, but are not used by default widely in their deployments, because 1329 other techniques were found which were better for fighting path explosion (Godefroid, personal 1330 communication). Note that we do not use summaries principally for efficiency (though they help), 1331 but rather as part of a begin-anywhere analysis which can be applied to partial programs (and thus 1332 fits well with a CI deployment for analysing pull requests). Although we believe the authorsof 1333 these works could have, technically, run the symbolic part of their analysis on partial programs, 1334 they did not develop an approach for compositional reporting as we do, and this would have been 1335 essential for automated deployment without the human intervention to filter reports. 1336 Another direction in symbolic execution is “under-constrained” analysis, which uses symbolic 1337 execution to analyse individual functions without going back to main(). Representative tools are 1338 JPF [Khurshid et al. 2003], UC-KLEE [Ramos and Engler 2015] and JPF-Star [Pham et al. 2019b]. 1339 Instead of abduction, these works use lazy initialisation [Khurshid et al. 2003] to infer procedure 1340 preconditions and use the preconditions for test case generation. These works are not summary- 1341 based, but do have the characteristic of Infer where execution can start from anywhere in a program. 1342 The authors of UC-KLEE emphasise that the ability to run on individual procedures, without 1343 the complete program, opens up new possibilities for applying symbolic execution to real code, 1344 since execution on a global program may get stuck and not reach many of the functions. As an 1345 example of (rough) comparison with this work, when run on an OpenSSL target [Ramos and Engler 1346 2015], UC-KLEE found five true memory leak bugs with 267 false positives (excluding 269 false 1347 positives in ASN.1 sub-library). By contrast, when run on OpenSSL (circa 2015), Pulse-X found 1348 40 leak bugs which were reported and fixed (so we consider them true positives) with 12 false 1349 positives. Although the two tools were not run on exactly the same OpenSSL version (and thus 1350 this is not a systematic study), the difference is striking. There are numerous technical differences 1351 between Pulse-X and UC-KLEE related to path explosion and areas of unsoundness, but the greatest 1352 difference concerns what we call compositional reporting. Generated preconditions can lead tobugs 1353 but that would be deemed infeasible or not useful by programmers. This is managed in UC-KLEE 1354 using heuristics and programmer annotations. Their best-performing heuristic led to a claimed 1355 true-positive ratio of 20%, considerably lower than what we have observed in our experiments 1356 (which is 73% for Pulse-X-10 and 84% for Pulse-X-100). 1357 JPF-Star performs under-constrained symbolic execution by using separation logic to pre- 1358 constrain the test cases generated. For high path coverage, it uses inductive definitions in precon- 1359 ditions to represent an expressive fragment of heap inputs and utilises a separation logic solver 1360 to generate the test inputs that drive the symbolic execution engine to different paths of a given 1361 program. It was also combined with random heap-based input generation to support unknown 1362 functions e.g., Java native methods [Pham et al. 2019a]. However, as with JPF, it does not use 1363 summaries. Additionally, its symbolic states might over-approximate, and it does not report bugs 1364 compositionally. 1365 Gillian [Fragoso Santos et al. 2020] is a separation logic reasoning platform that combines 1366 technologies for symbolic execution-based testing, verification, and biabductive compositional 1367 analysis. The compositional testing phase of Gillian currently works by using over-approximate 1368 rather than under-approximate summaries, but we expect it could be altered to follow our under- 1369 approximate approach. If so, given that Gillian is multi-language and parametric on the memory 1370 model of the target language, an expressive under-approximate reasoning platform might be 1371 obtained for C, JavaScript, and many other languages. 1372 , Vol. 1, No. 1, Article . Publication date: August 2021. Finding Real Bugs in Big Programs with Incorrectness Logic 29

1373 A distantly related line of work is harness generation for fuzzers (e.g., [Ispoglou et al. 2020]). 1374 Fuzzing a complete program poses a similar problem to symbolic execution: when starting from 1375 main(), many parts of the code can be difficult to reach. Harnesses are “fake main programs” which 1376 set up the environment and then call a collection of functions repeatedly. Constructing them can 1377 be time-consuming and error-prone, hence the desire for automatic generation. Automatic harness 1378 generation is similar in spirit to the precondition generation of Infer or UC-KLEE, but the harnesses 1379 (so far) can be more complex than preconditions. Harness generation can also suffer from generating 1380 spurious or infeasible fake main programs, which could uncover “bugs” that programmers regard as 1381 false positives. We are unsure whether harness generation and symbolic execution or compositional 1382 static analysis can affect one another by direct transfer of already-existing ideas; however, given 1383 their somewhat similar motivations, we expect some convergence in the future. 1384 1385 7 CONCLUSION 1386 This paper has made steps towards realising the promise of Incorrectness Logic for defining new 1387 automated bug catching techniques, going beyond its ability to describe existing approaches. Our 1388 focus has been in a specific direction, compositionality, where program logics such as Hoare Logic 1389 and Separation Logic have had the most influence for correctness reasoning. This is not the only 1390 possible direction. New under-approximate abstractions and methods for synthesising (backwards) 1391 loop variants might also be approached, but those would be topics for other papers. 1392 We defined a new compositional, under-approximate analysis, by merging ideas from automatic 1393 compositional analysis based on Separation Logic [Calcagno et al. 2011] and Incorrectness SL 1394 [Raad et al. 2020]. This merging gives us sufficient information in the abstract semantics to reason 1395 soundly about the presence of bugs, and one of our main contributions is a rigorous reporting 1396 criterion which accesses this information, leading to a “true positives property” of compositional 1397 bug reporting on potentially incomplete program fragments. Analysis of incomplete program 1398 fragments (without a global program context) is relevant to the diff-time analysis reporting recently 1399 emphasised by Google and Facebook, but rests on theoretical definitions rather than heuristics 1400 alone. 1401 A second main contribution is an experimental evaluation of the effectiveness of our method 1402 based on an implementation in a tool, Pulse-X. We compared to Infer, a state-of-the-art industrial 1403 program analysis. We found that Pulse-X is competitive on performance and, zooming in on 1404 OpenSSL, superior on accuracy (true/false bugs). We remark that Pulse-X is a scientific prototype 1405 that does not have the demonstrated industrial impact of Infer. Furthermore, there is no guarantee 1406 that our results on accuracy will carry over to arbitrary other codebases.13 But, with these caveats, 1407 our results do provide some corroborative evidence of potential effectiveness. 1408 Although we did not do a detailed comparison, we also noted in the previous section that 1409 our experimental results compare favourably with UC-KLEE [Ramos and Engler 2015] on some 1410 dimensions. UC-KLEE is a leading14 symbolic execution tool that targets running on individual 1411 functions instead of complete programs. 1412 These remarks on Infer and UC-KLEE are not meant as criticisms of them. We stress that we 1413 build on the insights and reported experience of Infer, and we acknowledge that UC-KLEE has 1414 convincingly made the case that symbolic execution can more easily be applied to real code, finding 1415 real bugs, if it is applied on a per-function basis. (This experience with UC-KLEE, incidentally, 1416 agrees with the perspective of the Google and Facebook CACM papers [Sadowski et al. 2018; 1417 Distefano et al. 2019], and with that of the present paper.) Stepping back from comparisons of 1418 1419 13Of course this is almost always the case with static analyses, which are tackling undecidable problems. 1420 14It received the best paper award at the 2014 USENIX Security Symposium. 1421 , Vol. 1, No. 1, Article . Publication date: August 2021. 30 Quang Loc Le, Azalea Raad, Jules Villard, Josh Berdine, Derek Dreyer, and Peter W.O’Hearn

1422 specific tools showing that one has certain apparent advantages over another, the moregeneral 1423 point we wish to make is the positive potential of using sound under-approximate abstractions 1424 when reasoning compositionally. We believe the results of this paper underline this point, but 1425 add that the general point does not depend crucially on our specific abstractions. For example, it 1426 might be possible to define a rigorous compositional reporting criterion for a tool like UC-KLEE, 1427 maybe similarly to how we have done so here; if successful, then symbolic execution tools might 1428 be deployed automatically in CI to identify regressions on diffs, without the human in the loop, 1429 which would likely considerably boost their impact. We certainly don’t claim to have discovered 1430 the ultimate compositional reporting criterion in this paper but rather an example with associated 1431 experimental backing, an example criterion which we hope others will be motivated to improve, 1432 extend, or even replace. 1433 1434 REFERENCES 1435 Krzysztof R. Apt and Ernst-Rüdiger Olderog. 2019. Fifty years of Hoare’s logic. Formal Aspects Comput. 31, 6 (2019), 751–807. 1436 Ali Asadi, Krishnendu Chatterjee, Hongfei Fu, Amir Kafshdar Goharshady, and Mohammad Mahdavi. 2021. Polynomial 1437 reachability witnesses via Stellensätze. In PLDI ’21: 42nd ACM SIGPLAN International Conference on Programming 1438 Language Design and Implementation, Virtual Event, Canada, June 20-25, 20211, Stephen N. Freund and Eran Yahav (Eds.). 1439 ACM, 772–787. https://doi.org/10.1145/3453483.3454076 Thomas Ball, Orna Kupferman, and Greta Yorsh. 2005. Abstraction for Falsification. In Computer Aided Verification, 17th 1440 International Conference, CAV 2005, Edinburgh, Scotland, UK, July 6-10, 2005, Proceedings (Lecture Notes in Computer 1441 Science, Vol. 3576), Kousha Etessami and Sriram K. Rajamani (Eds.). Springer, 67–81. https://doi.org/10.1007/11513988_8 1442 Cristiano Calcagno, Dino Distefano, Peter W. O’Hearn, and Hongseok Yang. 2011. Compositional Shape Analysis by Means 1443 of Bi-Abduction. J. ACM 58, 6, Article 26, 66 pages. https://doi.org/10.1145/2049697.2049700 1444 Patrick Cousot. 2001. Compositional separate modular static analysis of programs by abstract interpretation. In Proc. SSGRR 2001 – Advances in Infrastructure for Electronic Business, Science, and Education on the Internet, 6 – 10. 1445 P. Cousot and R. Cousot. 1977. Abstract Interpretation: A Unified Lattice Model for Static Analysis of Programs by 1446 Construction or Approximation of Fixpoints. In POPL. 238–252. 1447 Ankush Das, Shuvendu K. Lahiri, Akash Lal, and Yi Li. 2015. Angelic Verification: Precise Verification Modulo Unknowns. In 1448 Computer Aided Verification - 27th International Conference, CAV 2015, San Francisco, CA, USA, July 18-24, 2015, Proceedings, 1449 Part I (Lecture Notes in Computer Science, Vol. 9206), Daniel Kroening and Corina S. Pasareanu (Eds.). Springer, 324–342. https://doi.org/10.1007/978-3-319-21690-4_19 1450 Edsko de Vries and Vasileios Koutavas. 2011. Reverse Hoare Logic. In Software Engineering and Formal Methods, Gilles 1451 Barthe, Alberto Pardo, and Gerardo Schneider (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 155–171. 1452 Dino Distefano, Manuel Fähndrich, Francesco Logozzo, and Peter W. O’Hearn. 2019. Scaling Static Analyses at Facebook. 1453 Commun. ACM 62, 8 (July 2019), 62–70. https://doi.org/10.1145/3338112 1454 Dino Distefano, Peter W. O’Hearn, and Hongseok Yang. 2006. A Local Shape Analysis Based on Separation Logic. In Tools and Algorithms for the Construction and Analysis of Systems, 12th International Conference, TACAS 2006 Held as Part of 1455 the Joint European Conferences on Theory and Practice of Software, ETAPS 2006, Vienna, Austria, March 25 - April 2, 2006, 1456 Proceedings (Lecture Notes in Computer Science, Vol. 3920), Holger Hermanns and Jens Palsberg (Eds.). Springer, 287–302. 1457 https://doi.org/10.1007/11691372_19 1458 José Fragoso Santos, Petar Maksimović, Sacha-Élie Ayoun, and Philippa Gardner. 2020. Gillian, Part i: A Multi-Language 1459 Platform for Symbolic Execution. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (London, UK) (PLDI 2020). Association for Computing Machinery, New York, NY, USA, 927–942. 1460 https://doi.org/10.1145/3385412.3386014 1461 Patrice Godefroid. 2007. Compositional Dynamic Test Generation. In Proceedings of the 34th Annual ACM SIGPLAN-SIGACT 1462 Symposium on Principles of Programming Languages (Nice, France) (POPL ’07). Association for Computing Machinery, 1463 New York, NY, USA, 47–54. https://doi.org/10.1145/1190216.1190226 1464 Patrice Godefroid, Aditya V. Nori, Sriram K. Rajamani, and SaiDeep Tetali. 2010. Compositional may-must program analysis: unleashing the power of alternation. In Proceedings of the 37th ACM SIGPLAN-SIGACT Symposium on Principles of 1465 Programming Languages, POPL 2010, Madrid, Spain, January 17-23, 2010, Manuel V. Hermenegildo and Jens Palsberg 1466 (Eds.). ACM, 43–56. 1467 Mark Harman, Yue Jia, and Yuanyuan Zhang. 2015. Achievements, Open Problems and Challenges for Search Based Software 1468 Testing. In 8th IEEE International Conference on Software Testing, Verification and Validation, ICST 2015, Graz, Austria, 1469 April 13-17, 2015. IEEE Computer Society, 1–12. https://doi.org/10.1109/ICST.2015.7102580 1470 , Vol. 1, No. 1, Article . Publication date: August 2021. Finding Real Bugs in Big Programs with Incorrectness Logic 31

1471 Kyriakos K. Ispoglou, Daniel Austin, Vishwath Mohan, and Mathias Payer. 2020. FuzzGen: Automatic Fuzzer Generation. In 1472 29th USENIX Security Symposium, USENIX Security 2020, August 12-14, 2020, Srdjan Capkun and Franziska Roesner (Eds.). 1473 USENIX Association, 2271–2287. https://www.usenix.org/conference/usenixsecurity20/presentation/ispoglou Ranjit Jhala and Rupak Majumdar. 2009. Software Model Checking. ACM Comput. Surv. 41, 4, Article 21 (Oct. 2009), 54 pages. 1474 https://doi.org/10.1145/1592434.1592438 1475 Sarfraz Khurshid, Corina S. Păsăreanu, and Willem Visser. 2003. Generalized Symbolic Execution for Model Checking and 1476 Testing. In Proceedings of the 9th International Conference on Tools and Algorithms for the Construction and Analysis of 1477 Systems (Warsaw, Poland) (TACAS’03). Springer-Verlag, Berlin, Heidelberg, 553–568. 1478 Quang Loc Le, Cristian Gherghina, Shengchao Qin, and Wei-Ngan Chin. 2014. Shape Analysis via Second-Order Bi- Abduction. In Computer Aided Verification, Armin Biere and Roderick Bloem (Eds.). Springer International Publishing, 1479 Cham, 52–68. 1480 Bernhard Möller, Peter O’Hearn, and Tony Hoare. 2021. On Algebra of Program Correctness and Incorrectness. In Relational 1481 and Algebraic Methods in Computer Science - 19th International Conference, RAMiCS 2021. 1482 Peter W. O’Hearn. 2019. Incorrectness Logic. Proc. ACM Program. Lang. 4, POPL, Article 10 (Dec. 2019), 32 pages. 1483 https://doi.org/10.1145/3371078 Peter W. O’Hearn, John C. Reynolds, and Hongseok Yang. 2001. Local Reasoning about Programs that Alter Data Structures. 1484 In Computer Science Logic, 15th International Workshop, CSL 2001. 10th Annual Conference of the EACSL, Paris, France, 1485 September 10-13, 2001, Proceedings. 1–19. https://doi.org/10.1007/3-540-44802-0_1 1486 Long H. Pham, Quang Loc Le, Quoc-Sang Phan, and Jun Sun. 2019a. Concolic Testing Heap-Manipulating Programs. 1487 In Formal Methods – The Next 30 Years, Maurice H. ter Beek, Annabelle McIver, and José N. Oliveira (Eds.). Springer 1488 International Publishing, Cham, 442–461. Long H. Pham, Quang Loc Le, Quoc-Sang Phan, Jun Sun, and Shengchao Qin. 2019b. Enhancing Symbolic Execution of 1489 Heap-Based Programs with Separation Logic for Test Input Generation. In Automated Technology for Verification and 1490 Analysis, Yu-Fang Chen, Chih-Hong Cheng, and Javier Esparza (Eds.). Springer International Publishing, Cham, 209–227. 1491 Azalea Raad, Josh Berdine, Hoang-Hai Dang, Derek Dreyer, Peter O’Hearn, and Jules Villard. 2020. Local Reasoning About 1492 the Presence of Bugs: Incorrectness Separation Logic (CAV). 1493 David A. Ramos and Dawson Engler. 2015. Under-Constrained Symbolic Execution: Correctness Checking for Real Code. In 24th USENIX Security Symposium (USENIX Security 15). USENIX Association, Washington, D.C., 49–64. https: 1494 //www.usenix.org/conference/usenixsecurity15/technical-sessions/presentation/ramos 1495 X. Rival and K. Yi. 2020. Introduction to Static Analysis. MIT Press. 1496 Caitlin Sadowski, Edward Aftandilian, Alex Eagle, Liam Miller-Cushon, and Ciera Jaspan. 2018. Lessons from Building 1497 Static Analysis Tools at Google. Commun. ACM 61, 4 (March 2018), 58–66. https://doi.org/10.1145/3188720 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 , Vol. 1, No. 1, Article . Publication date: August 2021.