Online seminar series on Verification beyond 2020
Is Static Analysis Successful?
Patrick Cousot
IMDEA Software, Madrid & NYU, New York [email protected] cs.nyu.edu/~pcousot
Tuesday, July 7th, 2020
“Is Static Analysis Successful?” – 1/37 – © P. Cousot, IMDEA Software, Madrid & NYU, New York, Tuesday, July 7th, 2020
...... Abstract When I was invited for the online seminar, I first thought of a technical subject (more precisely, symbolic terms, substitutions, and systems of equations which abstract interpretation clarifies by giving them various ground semantics).
γ Assume ⟨C, ⊑⟩ −−−→−→←−−−− ⟨A, ≼⟩ is a Galois connection between posets and α is α surjective. If ⟨C, ⊑⟩ is a complete lattice then so is ⟨A, ≼⟩. So symbolic terms, substitutions, and equations are complete lattices ⟨A, ≼⟩ where ⟨C, ⊑⟩ is a powerset with ground terms (already well-known but proved “in abstracto” in the literature). The problem is that substitutions do not have a unique interpretation! This is the origin of great difficulties, misunderstandings, and lot of confusion about substitutions in the literature...... “Is Static Analysis Successful?” – 2/37 – © P. Cousot, IMDEA Software, Madrid & NYU, New York, Tuesday, July 7th, 2020 Abstract But then, the conversation Andreas and I had on the mainstream success of deductive methods while static analysis stays in the shade, led me to another idea: discuss whether static analysis is successful, or not, and what are the conditions to make it mainstream, and, why not, immensely popular?
...... “Is Static Analysis Successful?” – 3/37 – © P. Cousot, IMDEA Software, Madrid & NYU, New York, Tuesday, July 7th, 2020 The successes of deductive methods
...... “Is Static Analysis Successful?” – 4/37 – © P. Cousot, IMDEA Software, Madrid & NYU, New York, Tuesday, July 7th, 2020 Deductive methods
• Theorem provers • Proof assistants • SMT solvers
...... “Is Static Analysis Successful?” – 5/37 – © P. Cousot, IMDEA Software, Madrid & NYU, New York, Tuesday, July 7th, 2020 Theorem provers
...... “Is Static Analysis Successful?” – 6/37 – © P. Cousot, IMDEA Software, Madrid & NYU, New York, Tuesday, July 7th, 2020 Theorem provers
• First-order theorem proving started with Jacques Herbrand (invents unification in 1930), John Robinson (invents resolution in 1965), …; • Great academic tools: ACL2, B prover of Atelier B, iProver, Prover9, PVS, etc; • Industrial successes: e.g. • B method used to prove the security software of the new Paris driverless metro line 14 (not the control/command software); • Verification of all elementary floating-point arithmetics on the AMD Athlonby ACL21; • Not simple, slow, requires specialists, proofs must be changed after each program modification, etc; • Punctual successes not easily replicated (e.g. B not used for renovation of existing line A of Paris RER).
1J. Strother Moore, Marijn J. H. Heule: Industrial Use of ACL2: Applications, Achievements, Challenges, and Directions. ARCADE@CADE 2017:
42-45 ...... “Is Static Analysis Successful?” – 7/37 – © P. Cousot, IMDEA Software, Madrid & NYU, New York, Tuesday, July 7th, 2020 Theorem provers
• Full automation is hopeless2; • Real successes mostly came from reducing the initial ambitions: • Restrict automation: Proof assistants; • Prove less: SMT solvers.
2 Henry Gordon Rice, Classes of Recursively Enumerable Sets and Their Decision Problems,Trans. Amer. Math.. Soc.. ., 74:1,. . . 1953,. . . 358–366...... “Is Static Analysis Successful?” – 8/37 – © P. Cousot, IMDEA Software, Madrid & NYU, New York, Tuesday, July 7th, 2020 Proof assistants
...... “Is Static Analysis Successful?” – 9/37 – © P. Cousot, IMDEA Software, Madrid & NYU, New York, Tuesday, July 7th, 2020 Proof assistants
• Interact or program the proof; • Great academic tools: Isabelle3, Coq4, etc; • Great successes, e.g. for Coq: • Four color theorem, Feit–Thompson theorem by Georges Gonthier, • CompCert by Xavier Leroy; • Industrialization (of software proved by Coq): • CompCert sold by AbsInt (used by Airbus France);
3first-order logic (FOL), higher-order logic (HOL) or Zermelo–Fraenkel set theory (ZFC) 4 Calculus of constructions ...... “Is Static Analysis Successful?” – 10/37 – © P. Cousot, IMDEA Software, Madrid & NYU, New York, Tuesday, July 7th, 2020 Proof assistants • Despite proving already known theorems, proof assistants are not used • by mathematicians (who favor creation over verification), • by production engineers; • Real compilers (LLVM, GCC, etc) are 10 to 50 times larger than CompCert, proofs become inhuman. • The hope is more for small complex algorithms (e.g. EasyCrypt 5 in cryptography).
5Gilles Barthe, François Dupressoir, Benjamin Grégoire, César Kunz, Benedikt Schmidt, Pierre-Yves Strub: EasyCrypt: A Tutorial. FOSAD 2013:
146-166 ...... “Is Static Analysis Successful?” – 11/37 – © P. Cousot, IMDEA Software, Madrid & NYU, New York, Tuesday, July 7th, 2020 SMT solvers
...... “Is Static Analysis Successful?” – 12/37 – © P. Cousot, IMDEA Software, Madrid & NYU, New York, Tuesday, July 7th, 2020 SMT solvers
• Restrict what you can prove; • Fixed reduced product of abstract domains6; • An unexpected formalization by abstract interpretation7; • Great academic tools: CVC3, CVC4, Z3, and many, many, others; • Some industrial successes e.g.: • In R&D on very specific subjects (most often small but complex algorithms) 8, • Used by AdaCore to check SPARK contracts; • Rather unstable over time in competitions; • At the limit one SMT solver per application9.
6Patrick Cousot, Radhia Cousot, Laurent Mauborgne: Theories, solvers and static analysis by abstract interpretation. J. ACM 59(6): 31:1-31:56 (2012) 7Vijay D’Silva, Leopold Haller, Daniel Kroening: Abstract satisfaction. POPL 2014: 139-150. 8e.g. Z3 and SMT in Industrial R&D, Nicolaj Bjørner, 2018 9 could be done by incorporating the adequate abstract domains in the product, presently customization is done. by. hand...... “Is Static Analysis Successful?” – 13/37 – © P. Cousot, IMDEA Software, Madrid & NYU, New York, Tuesday, July 7th, 2020 Is static analysis successful?
...... “Is Static Analysis Successful?” – 14/37 – © P. Cousot, IMDEA Software, Madrid & NYU, New York, Tuesday, July 7th, 2020 Static analysis is hard to understand Contrary to deductive methods (check the work of mathematicians), static analysis (check the work of programmers) is more difficult to perceive • by the general public; • by programmers (who learn to prove theorems at school but not to prove and, even less, to analyze a program with a tool10).
10 Compilers are the only universally used static analyzers...... “Is Static Analysis Successful?” – 15/37 – © P. Cousot, IMDEA Software, Madrid & NYU, New York, Tuesday, July 7th, 2020 Static analysis is harder than verification (by deductive methods) Program Analysis Is Harder Than Verification: A Computability Perspective
Patrick Cousot1 ,RobertoGiacobazzi2,3 ,andFrancescoRanzato4(B)
1 New York University, New York City, USA 2 University of Verona, Verona, Italy 3 IMDEA Software Institute, Madrid, Spain 4 University of Padova, Padova, Italy [email protected]
Abstract. We study from a computability perspective static program analysis, namely detecting sound program assertions, and verification, namely sound checking of program assertions. We first design a general computability model for domains of program assertions and correspond- ing program analysers and verifiers. Next, we formalize and prove an instantiation of Rice’s theorem for static program analysis and verifica- tion. Then, within this general model, we provide and show a precise statement of the popular belief that program analysis is a harder prob- lem than program verification: we prove that for finite domains of pro- gram assertions, program analysis and verification are equivalent prob- lems, while for infinite domains, program analysis is strictly harder than verification.
1Introduction ...... th . . . “Is Static Analysis Successful?” It is common– 16/37 to assume – that program© P. analysis Cousot, isIMDEA harder Software, than program Madrid verifi- & NYU, New York, Tuesday, July 7 , 2020 cation (e.g. [1,17,22]). The intuition is that this happens because in program analysis we need to synthesize a correct program invariant while in program ver- ification we have just to check whether a given program invariant is correct. The distinction between checking a proof and computing a witness for that proof can be traced back to Leibniz [18] in his ars iudicandi and ars inveniendi,respec- tively representing the analytic and synthetic method. In Leibniz’s ars combina- toria,thearsinveniendiisdefinedastheartofdiscovering“correct”questions while ars iudicandi is defined as the art of discovering “correct” answers. These foundational aspects of mathematical reasoning have a peculiar meaning when dealing with questions and answers concerning the behaviour of computer pro- grams as objects of our investigation. Our main goal is to define a general and precise model for reasoning on the computability aspects of the notions of (sound or complete) static analyser and verifier for generic programs (viz. Turing machines). Both static analysers and verifiers assume a given domain A of abstract program assertions, that may range from synctatic program properties (e.g., program sizes or LOCs) to complexity
c The Author(s) 2018 ! H. Chockler and G. Weissenbacher (Eds.): CAV 2018, LNCS 10982, pp. 75–95, 2018. https://doi.org/10.1007/978-3-319-96142-2_8 Two additional difficulties for static analysis
• induction (no inductive invariants are given to static analysers); • no universal interface (to many programming languages, libraries, and systems).
...... “Is Static Analysis Successful?” – 17/37 – © P. Cousot, IMDEA Software, Madrid & NYU, New York, Tuesday, July 7th, 2020 The context of static analysis is quite diverse • Academic and industrial tools; • Academic and industrial users;practice • Explicit versus implicit specifications;
DOI:10.1145/1646353.1646372 • Small models to large programs;A translator framework enables the use of model checking in complex avionics • Sound and unsound tools; systems and other industrial settings. BY STEVEN P. MILLER, MICHAEL W. WHALEN, • Bug finding versus verification; AND DARREN D. COFER • Terminology is confusing (static analysis versus software checking); Software • Many different levels of ambitionsModel (from compilers, linters, unsound commercial tools, semantic-based, sound, and precise tools, to verifiers); Checking → hard for non-specialists to haveTakes a clear understanding Off of the field;
→ a lot of space for unscrupulous charlatans...... ALTHOUGH FORMAL METHODS have been used in the . . .of models. . . they. . provide. . . a . “push-but. . . - . . th . . . “Is Static Analysis Successful?” – 18/37 – development© of P. safety- Cousot, and IMDEA security-critical Software, systems Madrid & NYU,ton” means New of York,determining Tuesday, if a model July 7 , 2020 for years, they have not achieved widespread industrial meets its requirements. Since these tools examine all possible combina- use in software or systems engineering. However, tions of inputs and state, they are two important trends are making the industrial use much more likely to find design errors of formal methods practical. The first is the growing than testing. Here, we describe a translator acceptance of model-based development for the framework developed by Rockwell Col- design of embedded systems. Tools such as MATLAB lins and the University of Minnesota 6 2 that allows us to automatically trans- Simulink and Esterel Technologies SCADE Suite are late from some of the most popular achieving widespread use in the design of avionics and commercial modeling languages to a automotive systems. The graphical models produced variety of model checkers and theorem provers. We describe three case studies by these tools provide a formal, or nearly formal, in which these tools were used on in- specification that is often amenable to formal analysis. dustrial systems that demonstrate that formal verification can be used effec- The second is the growing power of formal verification tively on real systems when properly tools, particularly model checkers. For many classes supported by automated tools.
58 COMMUNICATIONS OF THE ACM | FEBRUARY 2010 | VOL. 53 | NO. 2 Keys to static analysis successes
...... “Is Static Analysis Successful?” – 19/37 – © P. Cousot, IMDEA Software, Madrid & NYU, New York, Tuesday, July 7th, 2020 Specifications are inexistent viewpoints • Formal semantics of languages; • Requirements for programs; • Few exceptions: e.g. MISRA C for automobile, DO-178C for the avionics.
VDOI:10.1145/2736348 Leslie Lamport Viewpoint Who Builds a House without Drawing
Blueprints? 11 Finding a better solution by thinking about the problem and its solution, rather than just thinking about the code.
11Static analysis of blueprints is certainly also useful! BEGAN WRITING programs in ...... “Is Static Analysis Successful?” 1957. For the – past 20/37 four – de- © P. Cousot, IMDEA Software, Madrid & NYU, New York, Tuesday, July 7th, 2020 cades I have been a computer science researcher, doing only a small amount of program- Iming. I am the creator of the TLA+ specification language. What I have to say is based on my experience pro- gramming and helping engineers write specifications. None of it is new; but sensible old ideas need to be repeated or silly new ones will get all the atten- tion. I do not write safety-critical pro- grams, and I expect that those who do will learn little from this. Architects draw detailed plans before a brick is laid or a nail is ham- mered. But few programmers write even a rough sketch of what their pro- grams will do before they start coding. We can learn from architects. A blueprint for a program is called a specification. An architect’s blueprint is a useful metaphor for a software speci- fication. For example, it reveals the fal- lacy in the argument that specifications are useless because you cannot gener- ate code from them. Architects find blueprints to be useful even though buildings cannot be automatically gen- it is a good idea to think about what we something, we can explain it clearly in erated from them. However, metaphors are going to do before doing it, and as writing. If we have not explained it in can be misleading, and I do not claim the cartoonist Guindon wrote: “Writ- writing, then we do not know if we re- that we should write specifications just ing is nature’s way of letting you know ally understand it. because architects draw blueprints. how sloppy your thinking is.” The second observation is that to The need for specifications follows We think in order to understand write a good program, we need to think
from two observations. The first is that what we are doing. If we understand above the code level. Programmers SANSRI SAKONBOON BY IMAGE
38 COMMUNICATIONS OF THE ACM | APRIL 2015 | VOL. 58 | NO. 4 Program certification is not mandatory
• No regulation on software quality; • A facility rather than an obligation; • Obligation of means rather than results; • Software engineering is empiricism but no science;
...... “Is Static Analysis Successful?” – 21/37 – © P. Cousot, IMDEA Software, Madrid & NYU, New York, Tuesday, July 7th, 2020 Benchmarks are biased
• Academic benchmarks do not reflect industrial needs; • Industrial benchmarks are not publicly available.
...... “Is Static Analysis Successful?” – 22/37 – © P. Cousot, IMDEA Software, Madrid & NYU, New York, Tuesday, July 7th, 2020 Independent evaluations are rare
• Few independent comparative evaluations; → choosing the appropriate static analysis tool is very difficult for engineers and managers.
...... “Is Static Analysis Successful?” – 23/37 – © P. Cousot, IMDEA Software, Madrid & NYU, New York, Tuesday, July 7th, 2020 size range is also given in square brackets. Additionally, both case studies contain pointers to constant data (e.g. const void*). With a couple of hundred variables, globals are heavily employed. They are used for exchanging data with other compilation units. Here, the char type is most prevalent, taking up around three quarters of the variable count. float variables make up the remaining quarter. The number of operations in the call graph are around 5, 000 and 10, 000 for DSR and ECC. While linear arithmetic is most prominent, we also observe a message of this paper is to emphasize the need for a synchronization between large amount of multiplication and division operations, possibly on non-constant the industrial and scientific software verification communities. variables. Challenges for software verifiers rise along with the complexity of oper- size range is also given in square brackets. Additionally, both case studies contain atorspointers used. to constant Pointer data and (e.g. floating-pointconst void* arithmetic,). as well as bit-wise operations 2 Preliminaries ExampleimposeWith challenges. a1 couple of of hundredThese benchmarks case variables, studiesglobals employ are heavily a variety employed of bit-wise. They are operations suchused foras >> exchanging, &, and data|, mainly with other on 32-bit compilation variables. units. Such Here, operators the char type can is force the 2.1 The Automotive Benchmarks underlyingmost prevalent, solvers taking to up model around the three variable quarters bit of by the bit. variable A noticeable count. float amount of Benchmark Description. Both case studies involve auto-generatedBenchmarking code of pointervariables Software dereferences make Model up the, remaining namelyCheckers 180 quarter. and 83 occurrences, is present in the programs. two R&D prototype Simulink models from Ford Motor Company: the next-gen The number of operations in the call graph are around 5, 000 and 10, 000 for Driveline State Request (DSR) feature and the next-gen E-Clutch Control (ECC)on AutomotiveDSR and ECC. Code While linear arithmetic is most prominent, we also observe a feature. The DSR and ECC features implement the decision logic for opening Requirementlarge amount of multiplication Characteristics.and divisionTheoperations, requirements possibly originate on non-constant from internal and and closing the driveline and calculating the desired clutch torque and corre- 1 informalvariables. documents Challenges2 for of software the car verifiers2 manufacturer rise along and with have the beencomplexity formalized of oper- by hand. sponding engine control torque of the vehicle, respectively. The caseLukas studies Westhofen are , Philipp Berger , and Joost-Pieter Katoen Asators described used. Pointer in [3], and obtaining floating-point an unambiguous arithmetic, as formal well as bit-wiserequirement operations specification described in detail in [3]. Unfortunately, because of non-disclosure agreements,1 OFFISimpose e.V., Oldenburg, challenges. Germany These case studies employ a variety of bit-wise operations we cannot make the benchmarks publicly available; instead we give a detailed [email protected] be a substantial task. All di↵erences between the formalization in [3] and this characterization of the used code in the following. 2 RWTH Aachenworksuch University, asin>> number, & Aachen,, and of Germany| properties, mainly on stem 32-bit from variables. di↵erent Such splitting operators of can the force properties. the For berger,the katoenunderlying [email protected] case solvers study, to from model 42 the functional variable requirements bit by bit. A wenoticeable extracted amount 105 properties, of { pointer} dereferences, namely 180 and 83 occurrences, is present in the programs. Code Characteristics. From Table 1: Code metrics of the benchmarks. consisting of 103 invariants and two bounded-response properties. For the ECC the Simulink models, gener- case study, from 74 functional requirements we extracted 71 invariants and three ated by a few thousand blocks, Metric DSR ECC Requirement Characteristics. The requirements originate from internal and around 1,400 and 2,500 source Abstract. This paper reportsbounded-response on our experiences withproperties. verifying auto- motive C code by state-of-the-artinformal open documents source software of the model car checkers. manufacturer and have been formalized by hand. lines of C code were extracted Complexity Invariant properties are assertions that are supposed to hold for all reachable The embedded C code is automaticallyAs described generated in [3], fromobtaining Simulink an open- unambiguous formal requirement specification for DSR and ECC. Both code Source lines of codeloop 1,354 controller 2,517 models. Itsstates.can diverse be features a Bounded-response substantial (decision task. logic,All floating-point properties di↵erences betweenrequest the that formalization a certain assertion in [3] and holds this within bases have a cyclomatic com- Cyclomatic complexityand 213 pointer 268arithmetic, ratea given limiters number and state-flow of computational systems) and the steps whenever a given, second assertion holds. plexity of over 200 program work in number of properties stem from di↵erent splitting of the properties. For Global constantsextensive 77 use 274 of floating-point variables make verifying the code highly paths. The cyclomatic com- 3 Comparingchallenging. Our the study reveals Open-SourcetheDSR large discrepancies case study, in from coverage 42 functional —Verifiers which requirements we extracted 105 properties, plexity is a common software char is at12 most only 8 20% of all requirements — and tool strength compared 2.2consisting The of Software 103 invariants Model and two Checkers bounded-response properties. For the ECC metric indicating the number char[] [12,32]to results20 from the maincase annual study, software from verification 74 functional competition. requirements A we extracted 71 invariants and three hand-crafted, simple extension of the verifier CBMC with k-induction of linearly independent paths floatCoverage. Fig.35 2 shows 77 thebounded-response verification properties. results of running the open-source veri- through a program’s code. Ta- float[] [6-12]delivers9 [2-7] results4 on 63% ofIn the order requirements to analyze while the the proprietary performance BTC of open-source verifiers on our specific use Invariant properties are assertions that are supposed to hold for all reachable ble 1 presents the metrics col- float*fiers on the twoEmbeddedValidator11 case studies, coverscase 80% andomitting of embeddedobtains bounded automotive the verification results results C code of from the Simulink witness models, validation. we selected a suit- for most of the remaining requirements.states. Bounded-response properties request that a certain assertion holds within lected on both case studies. void* 18 184 able subset of C verifiers based on the following criteria: a given number of computational steps whenever a given, second assertion holds. Constants are used to ac- Global variables 273 775 count for configurability, i.e. 1001 Introduction 1. Has matured enough to compete in the SV-COMP 2019 [2] in the Reach- they represent parameters of char 199 595 True 2.2 The Software Model Checkers the model that can be changed char[] [16-32] 30 Safety and SoftwareSystems category. False float Software Model46 Checking. 110 Software model checking is an active field of research. for di↵erent types of applica- 80Whereas model checking algorithmsIn2. orderHas initially ato license analyze focused thaton the verifying performance allows models, an academic of vari- open-source evaluation. verifiersTimeout on our specific use tions. The configurable state- float[] [4-10] 25 [2-4] 70 ous dedicated techniques have beencase developed of embedded in the automotive last two decades C code to enable from Simulink models,Out we of memoryselected a suit- space consists of 77 and 274 Operationsmodel checking5232 10096 of program code.Based This onincludes these e.g., criteria, predicate weabstraction, selected ab- the verifiers: 2LS,Verifier CBMC, bug CPAChecker, constants, for DSR and ECC 60 able subset of C verifiers based on the following criteria: Addition/subtractionstract interpretation, 133 346 boundedDepthK, model checking, ESBMC, counterexample-guided PeSCo, SMACK, abstrac- Symbiotic, UltimateAutomizer,Spurious counterexample UltimateKo- respectively. Most of them are Max. depth reached Multiplication/divisiontion refinement 52 (CEGAR) 253 andjak, automata-based1. Has and matured UltimateTaipan. techniques. enough Combined to The compete study with in the was theSV-COMP conducted 2019 in March [2] in 2019.the Reach- We used the of type float, sometimes in a 40enormous advancements of SAT and SMT-techniques [1], nowadays program
Bit-wise operationsPercentage 65 191 fixed-length array, as indicated Safety and SoftwareSystems category. Pointer dereferencescode can be 83 directly 180 verified bylatest powerful stable tools. versions Companies of like each Microsoft, tool to Face- that date. We also included CBMC+k (de- by the square brackets. Their arXiv:2003.11689v1 [cs.LO] 26 Mar 2020 book, Amazon, and ARM check2. softwareHas a on license a daily that basis allows using in-house an academic model evaluation. 20 scribed in Section 2.3), a variant of CBMC that enables k-induction as a proof checkers. The enormous varietygeneration of code verification technique techniques on and top tools of has CBMC. ini- Let us briefly introduce the selected tiated a number of software verificationBased on competitions these criteria, such as we RERS, selected VerifyThis, the verifiers: 2LS, CBMC, CPAChecker, and0 SV-COMP. For software modelopen-sourceDepthK, checking, ESBMC, the verifiers. annual PeSCo, SV-COMP SMACK, competition Symbiotic, UltimateAutomizer, UltimateKo- is most relevant.CBMC+kCBMC LaunchedUltimateTaipan withESBMCjak, 9 participatingUltimateAutomizer and UltimateTaipan.Symbiotic toolsCPAChecker in 2012,PeSCo The itDepthK gained studyUltimateKojak popular- was2LS conductedSMACK in March 2019. We used the ity over the years with more thanlatest 40 competitors stable versions in 2019 of [2]. each It tool runs too↵-line that in date. We also included CBMC+k (de- a controlled manner, and has severalscribed categories. in Section Competitions 2.3), a variant like SV-COMP of CBMC that enables k-induction as a proof have established standards in input and output format, and evaluation criteria. generation technique on top of CBMC. Let us briefly introduce the selected...... Software model checkers are rankedopen-source based on verifiers. the verification results, earning ...... “Is Static Analysis Successful?”Fig. 2: The overall– 24/37 result – distribution for© P. each Cousot, software IMDEA model Software, checker, Madrid in & percent. NYU, New York, Tuesday, July 7th, 2020
CBMC+k is able to verify about 63% of the verification tasks; CBMC and UltimateTaipan cover roughly 20%. ESBMC delivers results on 10% of the re- quirements. The remaining verifiers reach a coverage of at most 5%. The majority of the verifiers is either able to identify counterexamples or produce proofs, but seldom both. 2LS and SMACK cannot return a single definite result. The only successful witness validation was a proof of PeSCo validated by CPAChecker, indicated by True (Correct). CBMC delivered invalid witnesses on all tasks, leading it to fail the witness validation process. Fig. 2 also indicates the reasons for Unknown answers. We observe that time- and memory-outs prevail, but a large number of verifiers exhibit erroneous behavior. A detailed description of the latter issues is given in Section 5. To get insight into which requirements are covered by which software model checker, Fig. 3 depicts two Venn diagrams indicating the subsets of all 179 verifi- cation tasks. Each area represents the set of verification tasks on which a verifier returned a definite result. Those areas are further divided into overlapping sub- areas, where a number indicates the size of this set. For reasons of clarity, we included only the top five verifiers for the respective case study, based on the number of definite answers. For both case studies, there is not one verifier which covers all requirements covered by the other verifiers. For DSR, CBMC+k cov- ers all but one definite results of the remaining verifiers. In this case, CBMC was able to identify a counterexample close to the timeout. CBMC+k exhausts its resources on this requirement as the inductive case occupies a part of the available computation time. For ECC, UltimateTaipan, ESBMC, and CBMC+k together cover the set of all definite results. Note that some verifiers — e.g. Ulti- mateTaipan and ESBMC — perform rather well on one case study, but lose most of their coverage on the other. In most of such cases, this is due to erroneous behavior of the verifier manifesting on just one of the two case studies. We believe that the substantial di↵erence in verifier coverage for the two case studies, as seen in Fig. 3, is the result of structural di↵erences in the benchmark NISTIR 8304
SATE VI Ockham Sound Analysis Criteria
NISTIR 8304 Paul E. Black Kanwardeep Singh Walia ______
We still had significant difficulty assigning tool warnings to these new classes. Part of SATE VI Ockhamthe problem SoundNISTIR was understanding Analysis 8304 exactly what the tool warning covered or what it did not cover. Part of the problemCriteria was that our classes made distinctions that tools did not and vice versa. Our own classes were easier to use than CWEs, but did not come close to being reasonable universal classes. This publication is available free of charge from: https://doi.org/10.6028/NIST.IR.8304 This 2. The Criteria publication SATE VI OckhamThis Sound section Analysis has the details of the Ockham Criteria itself, and includes explanation and NISTIR 8304 discussion. Much of thisPaul section E. Black comes from Sec. 2 of the SATE V Ockham report [9]. ______Criteria The CriteriaKanwardeep is named Singh for WilliamWaliaNISTIR of 8304 Ockham, best known for Occam’s Razor. Since the is detailsAbstract of the Criteria will likely change in the future, the name includes a time reference: available SATE VI Ockham Sound AnalysisExample Criteria. 2 of benchmarks StaticThe analyzers value of examinea sound the analyzer source is or that executable every onecode of of its programs findings to canfind problems.be assumed Many to be NISTIR 8304 correct.static analyzersSATE use heuristics VI Ockham or approximations Sound Analysis to examine programs with millions of lines free ofSATEWe code tried for VI hundredsto Ockham write criteria of classes Sound that of Analysis communicated problems.Criteria The Ockham our intent, Sound ruled Analysis out trivial Criteria satisfaction, recog- andnizes were static understandable. analyzersPaul E. Black that We are planned precise.Criteria to In be brief liberal the in criteria interpreting are (1) the the rules: analyzer’s we anticipated findings of This This publicationareKanwardeep claimed is available to Singh always free Walia of becharge correct, from: (2) it produces findings for most of a program, and (3) charge that tools satisfy the Criteria, so we occasionally assumed proper operation of the tool in https://doi.org/10.6028/NIST.IR.8304 publication caseseven requiring one incorrect human finding judgment. disqualifies an analyzer. This document begins by explaining SATE VI Ockham Soundthe background Analysis and requirements of the Ockham Criteria and how we determine if a tool The criteria were: Paul E. Black from: satisfies1. TheCriteria it. tool is claimed to be sound.Kanwardeep Singh Walia As part of Static Analysis ToolPaul Exposition E. Black (SATE) VI, we examined two tools: Astree´ 2. For at least one weaknessKanwardeep class Singh and Walia one test case the tool produces findings for a is https://doi.org/10.6028/NIST.IR.8304 and Frama-C with the Evolved Value Analysis (Eva) plug-in. Examining the tool outputs minimum of 75 % of appropriate sites. available led us to find several systematic mistakes in the Juliet 1.3 test suite and thousands of mis- This publication is3. available Even free one of incorrect charge from: finding disqualifies a tool for this SATE. https://doi.org/10.6028/NIST.IR.8304takes in its manifest of known errors. An implicit criterion is that the tool is useful, not merely a toy. OurPaul conclusion E. Black is that AstrThis publicationee´ and is available Frama-C free of charge with from: Eva satisfied the SATE VI Ockham KanwardeepA finding Singh Waliais a definitive reporthttps://doi.org/10.6028/NIST.IR.8304 about a site, which is a specific place in code. In other free Sound Analysis Criteria.This publication is available free of charge from: words, the tool reports thathttps://doi.org/10.6028/NIST.IR.8304 the site has a specific weakness (is buggy) or that the site does
of notAstrée have thatproduces weakness. 36316 alarms including all 18954 known bugs in the test suite (plus a number charge KeythatNo wordswhere manual unknown editing but of real). the tool Note: output Astrée was is not allowed. a general No purpose automated static filteringanalyzer, specializedwhich is not reflected in the Juliet 1.3 (https://samate.nist.gov/SRD/testsuite.php) test suite. This publication isto availableOckham a test free of chargecase Criteria; from: or to software SATE VI assurance; was allowed, software either. measurement; The tool’s software settings testing; and sound options anal- may https://doi.org/10.6028/NIST.IR.8304Frama-C with Eva produces 42056 alarms including all 18954 known bugs in the test suite. from: beysis; selected static to analysis. produce the best result, as alluded to in Sec. 1.1. Such setting should be reported.The analyzers that fail the test are not mentioned in the politically-correct report...... https://doi.org/10.6028/NIST.IR.8304 ...... th . . . “Is Static Analysis2.1 Successful?” Criterion 1: “Sound”– 25/37 (and – “Complete”)© P. Analysis Cousot, IMDEA Software, Madrid & NYU, New York, Tuesday, July 7 , 2020 Criterion 1 is “The tool is claimed to be sound.” We use the term sound to mean that every finding is correct. In other words, “Sound analysis means that the [tool] never asserts a property to be true when it is not true.” [16, FM.1.6.2]. The tool need not produce a finding for every site; that is completeness. A tool may have settings that allow unsound analysis. The tool still qualifies if it has clearly sound settings. For example, it is acceptable for the user to be able to select unsound
6
i Putting it all together Picking the right analysis tool is important because it will have a significant effect on development efficiency and the DO-178C certification schedule. In this paper we’ve discussed how using Coverity static analysis during code development and before reverse-engineering certification artifacts from the code has proven to increase productivity while simultaneously reducing budget and schedule risk. But it is typical for procurement policies to require the consideration of multiple suppliers for software tools. Synopsys welcomes competition and, in that spirit, provides the following vendor-neutral guidelines for establishing your selection criteria: Do’s Putting• Install, or have it all the vendortogether install, the candidate tool for a test run in your environment. (Synopsys does this regularly with PickingCoverity, the rightfor free.) analysis tool is important because it will have a significant effect on development efficiency and the DO-178C certification– Verify schedule.that the tool In this works paper in your we’ve development discussed how environment. using Coverity static analysis during code development and before reverse-engineering– Verify that it interfaces certification with artifacts your software from the repository code has andproven defect to increase tracking productivity systems. while simultaneously reducing budget– Verify and schedule that it is risk.compatible But it is withtypical your for software procurement build policiesprocedures to require and other the consideration development oftools, multiple such suppliers as compilers, for software integrated Puttingtools. Synopsysdevelopment it all together welcomes environments competition (IDEs), and, and in sothat on. spirit, provides the following vendor-neutral guidelines for establishing your Pickingselection the right criteria: analysis tool is important because it will have a significant effect on development efficiency and the DO-178C – Verify that managing and updating the tool will not impose an unacceptable workload on IT staff. certification schedule. In this paper we’ve discussed how using Coverity static analysis during code development and before reverse-engineering• Run the tool certification over your artifacts existing from code.the code has proven to increase productivity while simultaneously reducing budgetDo’s and schedule risk. But it is typical for procurement policies to require the consideration of multiple suppliers for software – Determine whether the defects reported are meaningful or insignificant. Allocate some time for your subject matter tools.• Install, Synopsys or welcomes have the competition vendor install,and, in that the spirit, candidate provides the tool following for a testvendor-neutral run in your guidelines environment. for establishing (Synopsys your does this regularly with experts to perform this task, because a proper assessment requires a systemwide perspective. selectionCoverity, criteria: for free.) – Determine whether the tool presents defects in a manner useful to developers. There should be more information than Do’s – Verify that the tool works in your development environment. “Problem type X in line Y of source file Z.” The tool should disclose the reasoning behind each finding, because very often • Install,– or Verify have the that vendor it interfaces install, the candidate with your tool softwarefor a test run repository in your environment. and defect (Synopsys tracking does this systems. regularly with Coverity, thefor free.) fix for a defect found in a line of code is to change lines of code in the control flow preceding that defect. – Verify that it is compatible with your software build procedures and other development tools, such as compilers, integrated •– Verify that that the thetool workstool isin yourpractical. development environment. development environments (IDEs), and so on. – Verify– Verify that it interfaces that it runs with yourfast software enough repository to be invokedand defect in tracking every systems. periodic build. – Verify– Verify that it thatis compatible managing with your and software updating build the procedures tool will and not other impose development an unacceptable tools, such as compilers, workload integrated on IT staff. development– Determine environments whether (IDEs), you and can so runon. it only over modified code relative to the baseline while still retaining context or whether it • Run the tool over your existing code. – Verifyis that fast managing enough and to updating analyze the toolall the will notcode impose all thean unacceptable time. workload on IT staff. – Determine whether the defects reported are meaningful or insignificant. Allocate some time for your subject matter • Run the– toolDetermine over your existing whether code. you can implement and enforce a clean-before-review or clean-before-commit policy. – Determineexperts whether to perform the defects this reported task, are because meaningful a orproper insignificant. assessment Allocate some requires time for a yoursystemwide subject matter perspective. • expertsDetermine to perform the thisfalse-positive task, because a(FP) proper rate. assessment requires a systemwide perspective. – Determine whether the tool presents defects in a manner useful to developers. There should be more information than – Determine– Focus whether on your the tool own presents code. defects Do not in a acceptmanner useful an FP to developers.rate based There on shouldgeneric be more code information or an unsubstantiated than vendor claim. “Problem“Problem type X intype line YX of in source line Yfile of Z.” source The tool fileshould Z.” disclose The tool the shouldreasoning disclose behind each the finding, reasoning because behindvery often each finding, because very often theWHITE– fix theDecide for fix a defect PAPERfor an a foundacceptable defect in a foundline of FP code in rate a is line to for change of your code lines process. is of tocode change inAn the unacceptable control lines flow of code preceding FP in ratethe that controlwastesdefect. flowresources preceding and thaterodes defect. developer • Verify• Verify thatconfidence thethat tool the is practical. tool in theis practical. tool itself. A very significant finding can be obscured by meaningless noise. – VerifyCoverity: that it runs fast enoughRisk to beMitigation invoked in every periodic for build. DO-178C • Investigate– Verify that training, it runs startup, fast enough and supportto be invoked options. in every periodic build. – DetermineGordon whether M. Uchenick, you can run it only Lead over modifiedAerospace/Defense code relative to the baseline Sales while Engineerstill retaining context or whether it is– –fast DetermineInquire enough toabout analyze whether the all vendor’sthe you code can all capability therun time. it only to over provide modified on-site code training relative relevant to the baselineto your SDLC. while still retaining context or whether it Of– courseDetermine– isVerify fast whether not enoughthat everybody theyyou tocan can analyze implement provide agrees all and theservices enforce code with a alltoclean-before-review the helpthe time.significance you get orstarted clean-before-commit with of thethe tool policy. testquickly suite: and smoothly. • Determine the false-positive (FP) rate. –– DetermineVerify that whethertheir support you can hours implement correspond and toenforce your workday. a clean-before-review or clean-before-commit policy. – Focus on your own code. Do not accept an FP rate based on generic code or an unsubstantiated vendor claim. • – DetermineDecide– Verify an acceptable thatthe false-positive they FP ratehave for fieldyour (FP) process. engineering rate. An unacceptable staff if FPon-site rate wastes support resources becomes and erodes necessary. developer confidence– Focus in on the yourtool itself. own A code.very significant Do not finding accept can anbe obscuredFP rate bybased meaningless on generic noise. code or an unsubstantiated vendor claim. • InvestigateDon’ts training, startup, and support options. – Decide an acceptable FP rate for your process. An unacceptable FP rate wastes resources and erodes developer – Inquire about the vendor’s capability to provide on-site training relevant to your SDLC. • Don’tconfidence evaluate toolsin the by tool comparing itself. A very lists significant of vendor finding claims can about be theobscured kinds ofby defects meaningless that their noise. tools can find, and don’t let a – vendorVerifyWHITE that push they PAPER can the provide evaluation services into helpthat you direction. get started Comparing with the tool quickly lists andof claims smoothly. regarding defect types isn’t meaningful and leads to • – InvestigateVerify that their training, support hours startup, correspond and supportto your workday. options. false equivalencies. Capabilities with the same name from different vendors won’t have the same breadth, depth, or accuracy. – VerifyCoverity:– Inquire that they about have field theRisk engineering vendor’s Mitigation staff capability if on-site tosupport provide for becomes on-siteDO-178C necessary. training relevant to your SDLC. • Don’t waste time purposely writing defective code to be used as the evaluation target. Purposely written bad code can contain Don’tsGordon M. Uchenick, Lead Aerospace/Defense Sales Engineer only– Verify the kinds that theyof defects can provide that you services already to knowhelp you of. Theget startedvalue of with a static the tool analysis quickly tool and is smoothly.to find the kinds of defects that you • Don’tdon’t –evaluate Verify already tools that by know their comparing supportof. lists ofhours vendor correspond claims about the to kindsyour of workday. defects that their tools can find, and don’t let a vendor push the evaluation in that direction. Comparing lists of claims regarding defect types isn’t meaningful and leads to false• …Don't –equivalencies. Verify overestimate that Capabilities they have the with limited field the same engineering value name of from standard staff different if on-site vendorstest suites won’tsupport havesuch thebecomes as same Juliet. breadth, necessary.†† These depth, orsuites accuracy. often exercise language features that • Don’tare waste not time appropriate purposely writing for safety-critical defective code to becode. used Historically,as the evaluation the target. overlap Purposely between written badfindings code can of contain different tools that were run over the onlyDon’ts the kinds of defects that you already know of. The value of a static analysis tool is to find the kinds of defects that you don’tsame WHITEalready Juliet know PAPER of. test suite has been surprisingly small. • Don't• Don’t overestimate evaluate the tools limited by value comparing of standard liststest suites of vendor such as claimsJuliet.†† These about suites the often kinds exercise of defects language that features their that tools can find, and don’t let a arevendor notCoverity: appropriate push forthe safety-critical evaluation Risk code.Mitigation in that Historically, direction. the overlap Comparing for between DO-178C findingslists of of claims different regarding tools that were defect run over types the isn’t meaningful and leads to samefalseGordon Juliet equivalencies. test suite M. has Uchenick, been Capabilities surprisingly Lead small. with Aerospace/Defense the same name from different Sales Engineer vendors won’t have the same breadth, depth, or accuracy. • Don’t waste time purposely writing defective code to be used as the evaluation target. Purposely written bad code can contain only the kinds of defects that youhttps://samate.nist.gov/SRD already know of. The value/testsuite.php of a static .analysis tool is to find the kinds of defects that you don’t already know of...... https://samate.nist.gov/SRD/testsuite.php...... “Is Static Analysis Successful?” – 26/37 – © P. Cousot, IMDEA Software, Madrid & NYU, New York, Tuesday, July 7th, 2020 • Don't overestimate the limited value of standard test suites such as Juliet.†† These suites often exercise language features that | synopsys.com | 7 are not appropriate for safety-critical code. Historically, the overlap between findings of different tools that were run over the | synopsys.com | 7 same Juliet test suite has been surprisingly small.