Natural Proofs for Structure, Data, and Separation

XiaokangQiu PranavGarg AndreiStefanescu P.Madhusudan University of Illinois at Urbana-Champaign, USA {qiu2, garg11, stefane1, madhu}@illinois.edu

Abstract fields), data (the integers stored in data-fields of the tree respect the We propose natural proofs for reasoning with programs that ma- binary search tree property, or the data stored in a tree is a max- nipulate data-structures against complex specifications— specifica- heap), and separation (the procedure modifies one list and not the tions that describe the structure of the heap, the data stored within other and leaves the two lists disjoint at exit, etc.). The fact that the it, and separation and framing of sub-structures. Natural proofs are dynamic heap contains an unbounded number of locations means a subclass of proofs that are amenable to completely automated that expressing the above properties requires quantification in some reasoning, that provide sound but incomplete procedures, and that form, which immediately precludes the use of most SMT decidable capture common reasoning tactics in program verification. We de- theories (there are only a few of them known that can handle quan- trand velop a dialect of separation logic over heaps, called Dryad, with tification; e.g., the array property fragment [12] and the S recursive definitions that avoids explicit quantification. We develop logic [24, 25]). Consequently, expressing such properties naturally ways to reason with heaplets using classical logic over the theory and succinctly in a logical formalism has been challenging, and of sets, and develop natural proofs for reasoning using proof tactics reasoning with them automatically even more so. involving disciplined unfoldings and formula abstractions. Natural For instance, in the Boogie line of tools (including VCC) of proofs are encoded into decidable theories of first-order logic so as writing specifications using first-order logic and employing SMT to be discharged using SMT solvers. solvers to validate verification conditions, the specification of in- variants of even simple methods like singly-linked-list insert is te- We also implement the technique and show that a large class of 1 more than 100 correct programs that manipulate data-structures are dious. In such code , second-order properties (reachability, acyclic- amenable to full functional correctness using the proposed natural ity, separation, etc.) are smuggled in using carefully chosen ghost proof method. These programs are drawn from a variety of sources variables; for example, acyclicity of a list is encoded by assign- including standard data-structures, the Schorr-Waite algorithm for ing a ghost number (idx) to each node in the list, with the property garbage collection, a large number of low-level routines from the that the numbers associated with adjacent nodes strictly increase Glib library, the OpenBSD library and the Linux kernel, and rou- going down the list. These ghost variables require careful manip- tines from a secure verified OS-browser project. Our work is the ulation when the structures are updated; for example, inserting a first that we know of that can handle such a wide range of full func- node may require updating the ghost numbers for other nodes in tional verification properties of heaps automatically, given pre/post the list, in order to maintain the acyclicity property. Once such a and loop invariant annotations. We believe that this work paves the ghost-encoding of the specification is formulated, the validation of way for the deductive verification technology to be used by pro- verification conditions, which typically have quantifiers, are dealt grammers who do not (and need not) understand the internals of with using sound heuristics (a wide variety of them including e- the underlying logic solvers, significantly increasing their applica- matching, model-based quantifier instantiation, etc. are available), bility in building reliable systems. but are still often not enough and have to be augmented by instan- tiation triggers from the verification engineer to help the proof go through. 1. Introduction In recent years, separation logic, especially in combination In recent years, the automated deductive verification paradigm for with recursive definitions, has emerged as a much more succinct software verification that combines user written modular contracts and natural logic to express properties about structure and separa- and loop invariants with automated theorem proving of the result- tion [31, 35]. However, the validation of verification conditions re- ing verification conditions has become very powerful. The latter sulting from separation logic invariants are also complex, and has is often executed by automated logical decision procedures eluded automatic reasoning and exploitation of SMT solvers (even supported by SMT solvers, which have emerged as robust and pow- more so than tools such as Boogie that use classical logic). Again, erful engines to automatically find proofs. Several techniques and help from the user in proving the verification conditions are cur- erifast edrock tools have been developed [2, 16, 20] and there have been several rently necessary— the tools V [20] and B [15], for success stories of large software verification projects using this ap- instance, admit separation logic specifications but require the user to write low-level lemmas and proof tactics to guide the verifica- proach (the OS project[41], the hypervisor veri- 2 fication project using VCC [16], and a recent verified-for-security tion. For example, in verifying an in-place reversal of a linked list , edrock OS+browser for mobile applications [28] to name a few). B would require several lemmas and a hint package be sup- Verification conditions do not, however, always fall into decid- plied at the level of the code in order for the proof to go through. able theories. In particular, the verification of properties of the dy- The work in this paper is motivated by the opinion that en- namically modified heap is a big challenge for logical methods. The tirely decidable logics are too restrictive, in general, to support dynamically manipulated heap poses several challenges, as typical correctness properties of heaps require complex combinations of 1 http://vcc.codeplex.com/SourceControl/changeset/view/dcaa4d0ee8c2#vcc structure (e.g., p points to a tree structure, or to a doubly-linked /Docs/Tutorial/c/7.2.list.c list, or to an almost balanced tree, with respect to certain pointer- 2 http://plv.csail.mit.edu/bedrock/Tutorial.html

1 2012/11/14 the verification of complex specifications of functional correctness to express structural restrictions; for example. tree-ness can be ex- for heap manipulating programs, and the other extreme of user- pressed simply as: supplied proof tactics and lemmas is too tedious, requiring of the user too much knowledge of the underlying proof systems/decision tree(x)::(x = nil ∧ emp) ∨ (x 7−→ (l, ) ∗ tree(l) ∗ tree(r)) procedures. Our aim is to build completely automatic, sound, but We first define a new logic, Dryad, that permits no explicit incomplete proof techniques that can solve a large class of proper- quantification, but permits powerful recursive definitions in order ties involving complex data-structures. to define integers, sets/multisets of integers, and sets of locations, The natural proof methodology: using least fixed-points. The logic Dryad furthermore has a heaplet Our proof method, called the natural proof method, is inspired by semantics and allows the spatial conjunction operator ∗. However, work on natural proofs proposed by Madhusudan et al. [26]; natural a key design feature of Dryad is that the heaplet for recursive proofs exploit a fixed set of proof tactics, keeping the expressive- formulas is essentially determined by the syntax as opposed to ness of powerful logics, retaining the automated nature of proving the semantics. In classical separation logic, a formula of the form validity, but giving up on completeness (i.e., gives up decidability, α ∗ β says that the heaplet can be partitioned into any two disjoint retaining soundness). The idea of natural proofs [26] is to identify heaplets, one satisfying α and the other β. In Dryad, the heaplet a subclass of proofs N such that (a) a large class of valid verifi- for a (complex) formula is determined and hence if there is a way cation conditions of real-world programs have a proof in N, and to partition the heaplet, there is precisely one way to do so. We (b) searching for a proof in N is decidable. In fact, we would even have found that most uses of separation logic to express properties like the search for a proof in N to be efficiently decidable, possibly can be written quite succinctly and easily using Dryad (in fact, it utilizing the automatic logic solvers (SMT solvers) that exist today. is easier to write such deterministic-heap specifications). The key Natural proofs are hence a fixed set of proof tactics whose appli- advantage is that this eliminates implicit existential quantification cation is itself expressible in a decidable logic. The natural proofs the separation operator provides. In a verification condition that developed in [26] were too restrictive, however, handling only sin- combines the pre-condition in the negative and the post-condition gle trees, with no scope for handling multiple or more complex in the positive, the classical semantics for ∗ invariably introduces data-structures and their separation (see Related Work for more de- universal quantification in the satisfiability query for the negation of tails). the verification condition, which in turn is extremely hard to handle. The aim of this paper is to provide natural proofs for general In Dryad, the semantics of a recursive definition r(x) (such as properties of structure, data, and separation. Our contributions are: tree above), requires that the heaplet be determined and defined (a) we propose Dryad, a dialect of separation logic for heaps, with as the set of all locations reachable from the node x through a set no explicit (classical) quantification but with recursive definitions, of pointer-fields f1,..., fk without passing through a set of loca- to express second-order properties; (b) show that Dryad is both tions (given by a set of location terms t1,... tn). While our logical powerful in terms of expressiveness, and that the strongest-post of mechanisms can be extended beyond this notion (but in determisitic Dryad specifications with respect to bounded code segments can ways), we have found that this mostly captures common properties be formulated in Dryad, (c) show how Dryad has been designed so required in proving data-structure manipulating programs correct. that it can be systematically converted to classical logic using the ryad theory of sets, allowing us to connect the more natural and succinct Translating D to classical logic with recursion: specifications to more verbose but classical logic, and (d) develop The second key step in our paradigm is a technique to bridge the a natural proof mechanism for classical logics with recursion and gap from separation logic to classical logic in order to utilize ef- sets that implement sound but incomplete reductions to decidable ficient decision procedures supported by SMT solvers. We show theories that can be handled by an SMT solver. that heaplet semantics and separation logic constructs of Dryad can be effectively translated to classical logic where heaplets are Dryad: A separation logic with determined heaplets modeled as sets of locations. We show that Dryad formulas can be The primary design principle behind separation logic is the deci- translated into classical logic with free set variables that capture the sion to express strict specifications— logical formulas must natu- heaplets corresponding to the strict semantics. This translation does rally refer to heaplets (subparts of the heap), and, by default, the not, of course, a decidable theory yet, as recursive definitions smallest heaplets over which the formula needs to refer to. This is are still present (the recursion-free formulas are in a decidable the- in contrast to classical logics (such as FOL) which implicitly re- ory). The carefully designed Dryad logic with determined heaplet fer to the entire heap globally. Strict specifications permit elegant semantics ensures that there is no quantification in the resulting ways to capture how a particular sub-part of the heap changes due formula in classical logic. The heaplets of recursively defined prop- to a program, implicitly leaving the rest of the heap and its prop- erties, which are defined using the set of all reachable nodes, are erties unchanged across a call to a procedure. Separation logic is translated to recursively defined sets of locations. a particular framework for strict specifications, where formulas are implicitly defined on strictly defined heaplets, and where heaplets Natural proofs for Dryad: can be combined using a spatial conjunction operator denoted by We finally develop a natural proof methodology for Dryad by ∗. The frame rule in separation logic captures the main advantage showing a natural proof mechanism for the equivalent formulas of strict specifications: if the Hoare-triple {P}C{Q} holds for some in classical logic. The basic proof tactic that we follow is not just program C, then {P ∗ R}C{Q ∗ R} also holds (with side-conditions dependent on the formula embodying the verification condition, but that the modified variables in C are disjoint from the free variables also on the precise footprint touched by the program segment being in R). verified. We unfold recursive definitions precisely across footprints, Consider, for example, expressing that the location x is the root translating them to the frontier of the footprint, and then use a form of a tree. This is a second-order property and formulations of it of formula abstraction that treats recursive formulas on frontier in classical logic using set or path quantification are quite com- nodes as uninterpreted functions. The resulting formula falls in plex and not easily amenable to automated verification. We prefer a logic over sets and integers, which is then decided using the inductive definitions of structural properties without any explicit theory of uninterpreted functions and arrays using SMT solvers. quantification. The separation logic syntax with recursive defini- The key feature is that heaplets and separation logic constructs, tions and heaplet semantics allows simple quantifier-free formulas which get translated to recursively defined sets of locations, are

2 2012/11/14 unfolded along with other user-defined recursive definition and on lists which has been further extended in [11] to include a re- formula-abstracted using this uniform natural proof strategy, stricted form of arithmetic. Symbolic execution with separation While our proof strategy is roughly as above, there are many logic has been used in [4, 5, 8] to prove structural specifications for more technical details that are complex. For example, the heaplets various list and tree programs. These tools come hardwired with a defined by pre/post conditions intrinsically specify the modified collection of axioms and their symbolic execution engines check locations of the heap, which have to be considered when processing the entailment between two formulas modulo these axioms. Veri- procedure calls in order to ensure which recursively defined metrics fast [20] on the other hand, chooses the flexibility of writing richer on locations continue to hold after a procedure call. The final specifications over complete automation, but requires the user to decidable theories that we compile our conditions down to does provide some inductive lemmas and proof tactics to aid verifica- require a bit of quantification, but it turns out to be in the array tion. Similarly, Bedrock [15] is a Coq library that aims at mostly property fragment which admits automatic decision procedures. automated (but not completely automated) procedures that requires some proof tactics to be given by the user to prove verification Implementation and Evaluation: conditions. The idea of using regions (sets of locations) for de- Our proof mechanisms are essentially a class of decidable proof scribing heaps in our work also extends to describing frames for tactics that result in sound but incomplete validation procedures. In function calls, and the use for the latter is similar to implicit dy- order to show that this class of natural proofs is effective in practice, namic frames [37] in the literature. The crucial difference in our we provide a prototype implementation of our technique, which framework is that the implicit dynamic frames are determined, and handles a low-level programming language with pre-conditions and is amenable to quantifier-free reasoning. A work that comes very post-conditions written in Dryad. We show, using a large class of close to ours is a paper by Chin et al. [14], where the authors allow correct programs manipulating lists, trees, cyclic lists, and dou- user-defined recursive predicates (similar to ours) and build a ter- bly linked lists as well as multiple data-structures of these kinds, minating procedure that reduces the verification condition to stan- that the natural proof mechanism succeeds in proving the verifi- dard logical theories. However, their procedure does not search for cations conditions automatically. These programs are drawn from a proof in a well-defined simple and decidable class, unlike our nat- a range of sources, from textbook data-structure routines (binary ural proof mechanism; in fact, the resulting formulas are quantified search trees, red-black trees, etc.) to routines from Glib low-level and incompatible with decidable logics handled by SMT solvers. C-routines used in GTK+/Gnome and OpenBSD to routines from In all of the above cited work and other manual and semi- Linux memory regions, a routine from the Schorr-Waite garbage automatic approaches to verification of heap-manipulating pro- collection algorithm, to several programs from a recent secure grams like [36], inductive definitions of algebraic data-types is ex- framework developed for mobile applications [28]. Our work is by tremely common for capturing second-order data-structure proper- far the only one that we know of that can handle such a large class ties. Most of these approaches use proof tactics which unroll induc- of programs, completely automatically. Our experience has been tive definitions and do extensive unification to try to match terms that the user-provided contracts and invariants are easily express- to find simple proofs. Our notion of natural proofs is very much ible in Dryad, and the automatic natural proof mechanisms work inspired by the such kinds of semi-automatic and manual heap rea- extremely fast. In fact, contrary to our own expectations, we also soning that we have seen in the literature. found that the tool is useful in debugging: in several cases, when There is also a variety of verification tools based on classi- the annotations supplied were incorrect, the model provided by the cal logics and SMT solvers. [23] and VCC [16] compile SMT solver for the natural proof was useful in detecting errors and to Boogie [2] and generate VCs that are passed to SMT solvers. correcting the invariants/program. This approach requires significant ghost annotations, and annota- tions that explicitly express and manipulate frames. The Jahob sys- 2. Related Work tem [42, 43] is one of the first attempts at full functional verifica- tion of linked data structures, which integrates a variety of theo- The natural proof methodology was introduced in [26] (see also [38]), rem provers, including SMT solvers, and makes the process mostly but was exclusively built for tree data-structures. In particular, this automated. However, complex specifications combining structure, work could only handle recursive programs, i.e., no while-loops, data and separation usually require more complex provers such and even for tree data-structures, imposed a large number of restric- as Mona [21], or even interactive theorem provers such as Is- tions on pre/post conditions for methods— the input to a procedure abelle [30] in the worst case. The success of the proof search also had to be only a single tree, the method can only return a single relies on users’ manual guidance. tree, and even then must havoc the input tree given to it. The lack The idea of unfolding recursive definitions and formula abstrac- of handling of multiple structures means that even simple programs tion also features in the work by Suter et al. [38, 39], where a pro- like mergesort (that merges two lists internally), cannot be handled, cedure for algebraic data-types is presented. However, this work and simple programs that manipulate two lists or two trees cannot focuses on soundness and completeness, and is not terminating for be reasoned with. Also, structures such as doubly-linked lists, trees several complex data structures like red-black trees. Moreover, the with parent pointers, etc. are out of scope of this work. Technically, work limits itself to functional program correctness; in our opinion, in our work, we can handle user-defined structures expressible in functional programs are very similar to algebraic inductive specifi- separation logic, multiple structures and their separation, programs cations, leading to much simpler proof procedures. with while-loops, etc., because of our logical treatment of separa- There is also a rich literature on completely automatic deci- tion logic using classical logic. sion procedures for restricted heap logics, some of which com- There is a rich literature on analysis of heaps in software. bine structure-logic and arbitrary data-logic. These logics are usu- We omit the discussing literature on general interactive theorem ally FOLs with restricted quantifiers, and usually are decided us- sabelle provers (like I [30]) that require considerable manual guid- ing SMT solvers. The logics Lisbq [22] and CSL [9, 10] offer such ance. We also omit a lot of work on analyzing shape properties reasoning with restricted reachability predicates and quantification; of the heap [6, 13, 18, 27, 40], as they do not handle complex see also the logics in [1, 7, 29, 32–34]. Strand is a relatively expres- functional properties. sive logic that can handle some data-structure properties (at least There are several proof systems and assistants for separation BSTs) and admits decidable fragments [24, 25], but is again not logic [31, 35] that incorporate proof heuristics and are incomplete. expressive enough for more complex properties of inductive data- However, [3] gives a small decidable fragment of separation logic

3 2012/11/14 structures. None of these logics can express the variety of VCs for • Subtraction, set-difference, and negation are disallowed; full functional verification explored in this paper. • Every variable in ~s should appear in the right hand side of a points-to relation binding it to x exactly once. 3. TheLogicDryad 3.2 Semantics We present our logic Dryadsep; this redefines the logic Dryad [26] on arbitrary data-structures (not just trees), using heaplet semantics Our logic is interpreted on models that are program states: and separation logic primitives; the logic hence is a quantifier-free Definition 3.1. A program state is a tuple C = (R, s, h) where heaplet logic augmented with recursively defined predicates/func- • tions. However, for brevity, we will refer to the new logic we pro- R ⊆ Loc \{nil} is a finite set of locations; • s : Vars → Int ∪ Loc is a store mapping program variables to pose as Dryad, and refer to the logic in [26] as Dryadtree. locations or integers (of appropriate type); 3.1 Syntax • h : R × (PF ∪ DF) → Int ∪ Loc is a heaplet mapping non- nil locations and each pointer-field/data-field to values of the Let us fix a finite set of pointer-fields PF and a finite set of data- appropriate type.  fields DF. A record consists of a set of pointer-fields from PF and a set of data-fields from DF. Our logic also presumes that locations Note that the set of locations is, in general, larger than the state refer to entire records rather than particular fields, and that address R and hence R defines a subset of heap locations. The store maps arithmetic is precluded. We will use the term locations hence to variables to locations (not necessarily in R), but the heaplet h gives refer to these records. We assume that every field is defined at every interpretations for pointer and data-fields only for elements in R. location, i.e., all memory records have the same layout (to simplify Given a heaplet h, for every pointer field pf, we denote the the presentation; our logic can easily be extended with record types. projection of h on R × (PF\{pf}∪ DF) as h ∤ pf; similarly, for every Let Bool = {true, false} stand for the set of Boolean values, data-field df, we denote the projection of h on R × (PF ∪ DF \{ f }) Int stand for the set of integers and Loc stand for the universe of as h ∤ f . Also, for every subset S ⊆ R, we denote the projection of locations. For any set A, let S(A) denote the powerset of A, and let h on S × (PF ∪ DF) as h | S . MS(A) denote the set of all finite multisets with elements in A. A term/formula with free variables F is interpreted by inter- The Dryad logic allows expressing quantifier-free first-order preting the free variables in F according to an interpretation. The properties over heaps/heaplets augmented with recursively defined semantics of Dryad is similar to that of classical Separation Logic notions for a location to express second-order properties, denoted (SL). In particular, a term/formula without recursive definitions is as a function r : Loc → D. The domain is Loc, and the codomain interpreted exactly in the same way in Dryad and SL. Hence we D can be IntL, S(Loc), S(Int), MS(Int)L or Bool, where IntL and first give the semantics of the non-recursive part, followed by the MS(Int)L extend Int and MS(Int) to lattice domains, respectively, semantics of recursive definitions. in order to give least fixed-point semantics (explained later in this Before defining the semantics of formulas, we define the pure section). Typical examples of these recursive definitions include the property for terms/formulas. Intuitively, a term/formula is pure if it height of a tree or the height of black-nodes in the tree rooted at a is independent of the heap. Syntactically, a term/formula is pure if node (recursively defined integers), the set of nodes reachable from it does not contain emp, 7−→ or any recursive definition. Note that a location following certain pointer fields (recursively defined sets in SL, all terms are pure, but in Dryad, a term can be impure if it of locations), the set/multiset of keys stored at a particular data- contains a recursive function f ∗. field under nodes reachable from a location (recursively defined set/multiset of integers), and the property that the tree rooted at Semantics for terms (non-recursive) a node is a binary search tree or a balanced tree or just a tree Each T -term evaluates to either a normal value of type T , or to (recursively defined predicates). undef, which is only used in interpreting recursive functions (will ADryad formula ϕ is quantifier-free, but parameterized by a set be explained later). As a special value, undef will be propagated of recursive definitions Def. The syntax of Dryad logic is given in throughout the formula: if an atomic formula ϕ contains a sub-term Figure 1, where the syntax of the formulas is followed by the syn- that evaluates to undef, then ϕ will evaluate to false if it appears tax for recursive definitions. Most symbols in Dryad are common positively, and will evaluate to true otherwise. Intuitively, undef and self-explanatory. Note that the inequality (< or ≤) between in- cannot help in making the formula true over a model. teger sets/multisets indicates that any integer in the left-hand side The variables and constants are evaluated as follows: is less-than/not-greater-than any integer in the right-hand side. It is ~xC = s(x) ~ jC = s( j) ~cC = c ~nilC = nil ∗ also noteworthy that the separating conjunction ( ) from separation ′ logic is also allowed, but only if it is not above any negation (¬). We For any binary operator op, t op t is evaluated as follows: / require that every recursive function predicate used in the formula ′ ′ ϕ has a unique definition in Def. Each recursive function is param- ~tC op ~t C if t or t is pure ~t op ~t′ else if there exist R , R such that −→  C|R1 C|R2 1 2 eterized by a set of pointer fields pf and a set of program variables ~t t′ =  R = R ∪ R , ~t , undef op C  1 2 C|R1 ~v, denoted as f ∗ . We usually simply use f ∗ when the subscripts  and ~t′ , undef −→  C|R2 pf ,~v  undef otherwise are not important in the context. Similarly, recursive predicates are  ∗ ∗  denoted as p −→ or simply p . The recursively functions are defined where op is interpreted in the natural way. pf ,~v For singletons, {it} will evaluate to ∅ if it evaluates to −∞ or ∞: using the syntax:

f f f f f undef if ~itC = undef ϕ1 (x,~v, ~s): t1 (x, ~s); ... ; ϕk (x,~v, ~s): tk (x, ~s) ; default : tk+1(x, ~s) ~{it}C =  ∅ if ~itC = −∞ or ∞ f f   {~itC } otherwise where ϕu (x,~v, ~s)/tu (x, ~s) is a formula/term in our logic with ~s im-   plicitly existentially quantified. The recursively defined predicates {it}m will evaluate similarly. are defined using the syntax: ϕp(x,~v, ~s), which is a formula in our logic with ~s implicitly existentially quantified. The restrictions on Semantics for formulas (non-recursive) these recursive definitions are: The semantics of formulas is defined as follows.

4 2012/11/14 ∗ ∗ ∗ ∗ ∗ i : Loc → IntL sl : Loc → S(Loc) si : Loc → S(Int) msi : Loc → MS(Int)L p : Loc → Bool j ∈ IntL Variables L ∈ S(Loc) Variables S ∈ S(Int) Variables MS ∈ MS(Int)L Variables q ∈ Bool Variables x ∈ Loc Variables c : IntL Constant pf ∈ PF df ∈ DF Loc Term: lt ::= x | nil ∗ IntL Term: it ::= c | j | i −→ (lt) | it + it | it − it pf ,~t ∗ S(Loc) Term: slt ::= ∅l | L | {lt} | sl −→ (lt) | slt ∪ slt | slt ∩ slt | slt \ slt pf ,~t ∗ S(Int) Term: sit ::= ∅s | S | {it} | si (lt) | sit ∪ sit | sit ∩ sit | sit \ sit pf~ ,~t ∗ MS(Int)L Term: msit ::= ∅m | MS |{it}m | msi (lt) | msit ∪ msit | msit ∩ msit | msit\msit pf~ ,~t −→ −→ ∗ pf ,df Positive Formula: ϕ ::= true | false | q | p −→ (lt) | emp | lt 7−→ (~lt, it~) | lt = lt | lt , lt | it ≤ it | it < it | sit ≤ sit | sit < sit | msit ≤ msit | msit < msit pf ,~t | slt ⊆ slt | slt * slt | sit ⊆ sit | sit * sit | msit ⊑ msit | msit @ msit | lt ∈ slt | lt < slt | it ∈ sit | it < sit | it ∈ msit | it < msit | ϕ ∧ ϕ | ϕ ∨ ϕ | ϕ ∗ ϕ Formula: ψ ::= ϕ | ψ ∧ ψ | ψ ∨ ψ |¬ψ

∗ def f f f f f Recursive function : f −→ (x) = ϕ (x,~v, ~s) : t (x, ~s) ; ... ; ϕ (x,~v, ~s) : t (x, ~s) ; default : t + (x, ~s) pf ,~v 1 1 k k k 1 ∗ def p  Recursive predicate : p −→ (x) = ϕ (x,~v, ~s) pf ,~v

Figure 1. Syntax of Dryad ∗ The formula true is always interpreted to be true: delineating the heap domain. Intuitively, the heap domain for f −→ (l) pf ,~v (R, s, h) |= true −→ is the set of locations reachable from l using pointer-fields in pf , but The formula emp asserts that the heap is empty: without going through the locations ~v. In other words, we want to (R, s, h) |= emp iff R = ∅ take the set of locations that lie in between l and ~v. Precisely, this −→ −→ set is determined by a location l and a program state (R, s, h). We pf ,df ~ ~ The formula lt 7−→ (lt, it) asserts that the heap contains exactly denote it as reach −→ (l, (R, s, h)). Formally it is the smallest set of −→ −→ pf ,~v one record consisting of fields pf and df , at address lt, with values locations L satisfying the following two conditions: ~lt and ~it, respectively: −→ −→ 1. l is in the set if l is not in ~v and l , nil; pf ,df ~ (R, s, h) |= lt 7−→ (lt, ~it) iff R = {~ltR,s,h } and 2. for each c in L, with c ∈ R, and for each pointer pf, if h(c, pf) is

h(~ltR,s,h , pfi) = ~ltiR,s,h for corresponding pfi and lti, not in ~v and is not nil, then h(c, pf) is also in L. h(~ltR,s,h , df ) = ~itiR,s,h for corresponding df and iti. ∗ i i Remark: For each recursive definition rec −→ , we usually simply pf ,~v Note that, as in separation logic, the above has a strict semantics— rec denote reach −→ as reach , as the subscripts are implicitly known. the heaplet must be a singleton set and cannot be a larger set. pf ,~v ′ Then given a program state C = (R, s, h) and a recursive For binary relations t ∼ t between integers, sets, and multisets, ∗ including the equality, the pure property plays an important role. functions/predicates rec , the semantics on a location l depends Remember that in SL all terms are pure. To be consistent with SL, if on whether the heap domain R is exactly the required reach set both t and t′ are pure, it is interpreted in the normal way. Otherwise, reachrec(l, (R, s, h)). If the heap domain does not match, we simply t ∼ t′ is only defined on the minimum heaplet required by t and t′, interpret it as undef or false. more concretely the union of the heaplet associated with t and t′. If the heap domain matches what is required, the semantics is ′ ′ ′ (R, s, h) |= t ∼ t iff t or t is pure and ~tC ∼ ~t C just defined in the natural way (using least fixed-point). In order ′ to give least fixed-point semantics for recursive definitions in the or t and t are impure and there exist R1, R2 s.t. R = R1 ∪ R2 logic, we extend the primitive data-types to lattice domains. Bool and ~t , undef, ~t′ , undef and ~t ∼ ~t′ C|R1 C|R2 C|R1 C|R2 with the order false ⊑ true forms a complete lattice, and S(Loc) where ∼ is interpreted in the natural way. and S(Int) ordered by inclusion, with join as union and meet as The semantics of the disjoint conjunction operator ∗ is defined intersection, form complete lattices. Integers and multisets are ex- as follows. The formula ϕ0 ∗ ϕ1 asserts that the heap can be split tended to lattices. Let (IntL, ≤) denote the complete lattice, where into two disjoint parts in which ϕ0 and ϕ1 hold respectively: IntL = Int∪{−∞, ∞}, and where the ordering is ≤, join is max, meet (R, s, h) |= ϕ0 ∗ ϕ1 iff there exist R0, R1 s.t. R0 ∩ R1 = ∅ and is min. Also, MS(Int)L, ⊑ denote the complete lattice constructed R0 ∪ R1 = R and (R0, s, h | R0) |= ϕ0 and (R1, s, h | R1) |= ϕ1 from MS(Int), where MS(Int)L = MS(Int) ∪ {⊤}, and ⊑ extends the inclusion relation with S ⊑ ⊤ for any M ∈ MS(Int). It is easy Boolean combinations are defined in the standard way: to see that (IntL, ⊑) and (MS(Int)L, ⊑) are complete lattices. −→∗ −→∗ −→∗ −−→∗ −→∗ (R, s, h) |= ψ0 ∧ ψ1 iff (R, s, h) |= ψ0 and (R, s, h) |= ψ1 Formally, let Def consists of i , sl , si ,msi , p ; each is a vector (R, s, h) |= ψ0 ∨ ψ1 iff (R, s, h) |= ψ0 or (R, s, h) |= ψ1 of recursive definitions of a certain sort. Since these definitions (R, s, h) |= ¬ψ iff (R, s, h) 6|= ψ could be mutually recursive, they together form a function vector −→ −→ −→ −−→ −→ r∗ = (i∗ ,sl∗,si∗,msi∗, p∗ ) Semantics of recursive definitions The main semantical difference between Dryad and SL is on recur- We take the cartesian product lattice of the individual lattices and take the least-fixed point of r∗ to obtain the semantics for sive definitions. We would like to deterministically delineate the ∗ ∗ each definition. Let select f (lfp(r )), for each recursive definition f , heap domain for any recursive definition, so that the minimum heap ∗ ∗ domain required by any Dryad formula can be syntactically deter- denote the selection of the coordinate for f in lfp(r ). ∗ −→ Now we can formally define the semantics of recursive defini- mined. Given a definition f −→ , its subscripts pf and ~v play a role in tions. For any C = (R, s, h), the semantics of a recursive function pf ,~v

5 2012/11/14 ∗ f is defined as: Construct Domain-Exact Scope ∗ f ∗ select f lfp(r ) (~ltC ) if R = reach (~ltC , C) var/const false ∅ ~ f (lt)R,s,h = ( undef  otherwise {t}/{t}m dom-ext(t) scope(t) t op t′ dom-ext(t) ∨ dom-ext(t′) scope(t) ∪ scope(t′) and the semantics of a recursive predicate p∗ is defined as f ∗(lt) true reach f (lt) ∗ p ∗ select p lfp(r ) (~ltC ) if R = reach (~ltC , C) true/false false ∅ ~p (lt)R,s,h = ( false  otherwise emp true ∅ pf~ ,df~ Remark: Note that we disallow negative operations (subtraction, lt 7−→ (~lt, it~) true {lt} set-difference and negation) in defining recursive definitions. This p∗(lt) true reachp(lt) syntactical restriction guarantees that each iteration of r∗ is - t ∼ t′ dom-ext(t) ∨ dom-ext(t′) scope(t) ∪ scope(t′) tonic. By Knaster-Tarski theorem, r∗ admits a least fixed-point. ϕ ∧ ϕ′ dom-ext(ϕ) ∨ dom-ext(ϕ′) scope(ϕ) ∪ scope(ϕ′) ϕ ∗ ϕ′ dom-ext(ϕ) ∧ dom-ext(ϕ′) scope(ϕ) ∪ scope(ϕ′) Examples To clarify the difference between Dryad and SL, consider now this Figure 2. Domain-Exact property and Scope function. Both are recursive definition on locations of a binary search tree: defined only for terms and formulas without disjunction and nega-

∗ def d,l,r tion. A formula is assumed in its disjunctive normal form. p{l,r}(x) = x = nil ∨ (x 7−→ xd, xl, xr) ∗ h ∗ ∗ ∗ ∧ ∗ ∗ true ∨ ≤ ∧ ∗ ∗ true For example, consider the formula tree (x) tree (y), since the 0 < xd (p (xl) ) xd 0 (p (xr) ) heaplets for tree∗(x) and tree∗(y) are precise, it can get translated  i When x is nil, p∗(x) is true onanyheaplet in SL; when x is not to an equivalent formula with a free set variable G that denotes the nil, the predicate is recursively defined on its left child xl or right global heap over which the formula is evaluated: child xr, depending on whether the data field xd is positive or not. tree(x) ∧ tree(y) ∧ (reach{l,r}(x) ∩ reach{l,r}(y) = ∅) Since the base case is not deterministic, the heap domain for the ∧ (reach{l,r}(x) ∪ reach{l,r}(y) = G) recursive case is also not exactly defined. The heap domain could be arbitrarily large, as long as it contains the set of nodes constitut- where tree and reach{l,r} are corresponding recursive definitions in classical logic, which will be defined later in this section. Note that ing the path from root to leaf, determined by the data xd stored in ∗ the current node (move left if positive, move right otherwise). Such we use italics and remove the superscript to show the difference non-determined heaplets are inherent in SL (even if the base case is from their counterpart in Dryad. determined, a disjunction in the recursive definition can cause un- We assume the Dryad formula to be translated in the disjunctive determined heaplets). However, in Dryad, the heap domain for p∗ normal form, i.e., ∨ operators should be above all ∗ and ∧ operators. is precisely determined; in the above example, it is precisely the en- This is not a real restriction as one can always push the disjunction tire tree starting from the node x. Consequently, in our semantics, out. This normal form ensures that for all occurrences of the separa- the heap domain for p∗ is syntactically determined to be a fixed tion operator in a formula, there exists a unique way of splitting the set, based only on its signature, making Dryad amenable to being heap so as to satisfy the ∗ separated sub-formulas. Also, it ensures translated to quantifier-free classical logic. that this unique heap-split can be determined syntactically from the Dryad can express structures beyond trees. The main restriction structure of those sub-formulas. we do impose is that we allow only unary recursive definitions, as In our translation, we model the heaplets associated with a for- this allows us to find simpler natural proofs since there is only one mula or a term as a set of locations and all operations on these way to unfold the definition across a footprint. However, Dryad can heaplets are modeled as set operations like set union, set intersec- express structures like cyclic lists and doubly-linked lists. tions, etc. over set-of-location variables. For example the separating ∗ conjunction P∗Q is translated to the following set constraint: the in- A cyclic-list is captured as (h 7→ y) ∗ lsegh(y). Here, h is a program variable which denotes the head of the cyclic-list and tersection of the sets associated with the heaplets in the formulas P ∗ lsegh(y) captures the list segment from y back to the head h. and Q is empty. Given a formula ψ in Dryad and its associated heap def domain modeled by a set variable G, we define an inductive trans- ∗ = = ∧ emp ∨ 7→ ∗ ∗ lsegh(y) (y h ) (y z) lsegh(z) lation T into a classical logic formula T(ψ, G) in the quantifier-  free theory of finite sets, integers and uninterpreted functions. The Another interesting example is a doubly-linked list. We define a translated formula is not interpreted on a heaplet, but interpreted on doubly-linked list as the following unary predicate: a global heap (i.e., with the heap domain Loc). The translation uses next dll∗(x) = (x = nil ∧ emp) ∨ (x 7→ nil) ∨ an auxiliary domain-exact property and an auxiliary scope function. next prev next ∗ The domain-exact property indicates whether a term or a positive (x 7→ y ∗ y 7→ x ∗ true) ∧ (x 7→ y ∗ dll (y)) formula is evaluated on a fixed heap domain or not. This is different  The first two disjuncts in the definition cover the base case when from the property pure; a pure formula or term is not domain-exact x is nil or the location next to x is nil; otherwise, let y be the location but the reverse implication is not true, in general. For example, the next to x, then the prev pointer at y points to x and location y is formula (lt 7−→ it) ∗ true is not domain-exact but is also not pure. recursively defined as a doubly-linked list. The scope function maps a term to the minimum heap domain re- quired to evaluate it to a normal value, and maps a positive formula to the minimum heap domain required to evaluate it to true. The 4. Translation to a logic over the global heap domain-exact property and the scope function are defined induc- We now show one of the main contributions of this paper— a trans- tively in Figure 2. lation from Dryad logic to classical logic with recursive predicates We describe the logic translation in detail in Figure 3. The ITE and functions, but over the global heap. The formulation of separa- expression used in the translation is short for ”if-then-else”. It is just tion logic primitives in the global heap allows us to express com- a conditional expression defined as follows: ITE(φ, t1, t2) evaluates plex structural properties, like disjointness of heaplets and tree- to t1 if φ is true, otherwise evaluates to t2. ness, using recursive definitions over sets of locations, which are In general, our translation restricts an impure term/formula to defined locally, and are amenable to unfolding across the footprint be evaluated only on the syntactically determined heap domain and hence amenable to natural proofs. according to the semantics of Dryad. In particular, a recursive

6 2012/11/14 T(var / const, G) ≡ var / const P :− P ; P | stmt − = | = nil | = | = T({t} / {t}m, G) ≡ {t} / {t}m stmt : u : v u : u : v.pf u.pf : v T(t op t′, G) ≡ T(t, G) op T(t′, G) | j := u.df | u.df := j | j := aexpr T( f ∗(lt), G) ≡ ITE reach f (lt) = G, f (lt), undef | u := new | free u | assume bexpr | u := f (~v,~z) | j := g(~v,~z) T(true / false, G) ≡ true / false  T(emp, G) ≡ G = ∅ aexpr :− int | j | aexpr + aexpr | aexpr − aexpr − = | = nil | ≤ pf~ ,df~ bexpr : u v u aexpr aexpr T lt 7−→ ~lt, it~ , G ≡ G = {lt} ∧ pf T lt, G = T lt , G | ¬bexpr | bexpr ∨ bexpr ( ( ) ) pfi i ( ) ( i ) V∧ df dfi T(lt,G) = T(iti,G) ∗ i p T(p (lt), G) ≡ p(lt) ∧ G = reachV (lt)  Figure 4. Syntax of program t ∼ t′ if t ∼ t′ is not domain-exact T(t ∼ t′, G) ≡ short for  ′ ′  t ∼ t ∧ G = scope(t ∼ t ) otherwise f f f f ′  ′ ITE scope(t (x, ~s)) ⊆ reach (x), T t (x, ~s), scope(t (x, ~s)) , undef T(ψ ∧ ψ , G) ≡ T(ψ, G) ∧ T(ψ , G) i i i ′  ′   T(ψ ∨ ψ , G) ≡ T(ψ, G) ∨ T(ψ , G) We denote the righthand side as def f (x,~t,~v). Now for each set of T(¬ψ, G) ≡ ¬T(ψ,G) recursive definitions Def in Dryad, we can translate it to a set of −∗ T ψ, scope(ψ) ∧ T ψ′, scope(ψ) recursive definitions Def in classical logic. ∧ scope(ψ) ∪ scope(ψ′) = G    Theorem 4.1. Let ψ be a Dryad formula w.r.t. a set of recursive ∧ scope(ψ) ∩ scope(ψ′) = ∅  definitions Def. For every program state C with heap domain Loc,  if both ψ and ψ′ are domain-exact  and for every interpretation of variables I including a set-variable   T ψ, scope(ψ) ∧ T ψ′, G \ scope(ψ) −∗  G, (C, I) |= T(ψ, G) w.r.t. Def if and only if (C |G, I\{G}) |= ψ  ∧ scope(ψ) ⊆ G if only ψ is domain-exact  ′    w.r.t. Def. T(ψ ∗ ψ , G) ≡   T ψ′, ψ′ ∧ T ψ, G \ ψ′  scope( ) scope( )  ∧ scope(ψ′) ⊆ G if only ψ′ is domain-exact    5. Natural Proofs for Dryad   T ψ, scope(ψ) ∧ T ψ′, scope(ψ) ryad  In this section we show how D can be used in reasoning about  ∧ scope(ψ) ∪ scope(ψ′) ⊆ G the correctness of imperative heap-manipulating programs, in terms   ′   ∧ scope(ψ) ∩ scope(ψ ) = ∅  ′ of verifying Hoare-triples where the pre- and post-conditions are  if neither ψ nor ψ is domain-exact expressed in Dryad. We first introduce a simple programming lan-   guage and the corresponding Hoare-triples, generate a classical Figure 3. Translating Dryad terms/formulas to classical logic logic formula as the verification condition, utilizing in part the

∗ translation defined in Section 4. Then we present the natural proof definition rec (lt) is evaluated only on the heaplet consisting of the framework which consists of two steps. In the first step, we utilize reach set reachrec(lt). For a formula ψ ∗ ψ′, translation to classical ′ the idea of unfolding across the footprint to strengthen the verifica- logic depends on whether the sub-formulas ψ and ψ are domain- tion condition. In the second step, we prove the validity of the VC exact or not. If a sub-formula is domain-exact then it is evaluated soundly using the technique of formula abstraction. on its scope. If it is not domain-exact, then it is evaluated on the rest of the heaplet. 5.1 Programs and Hoare-triples Translating a recursive definition rec∗ uses the corresponding definitions rec and reachrec, both are defined recursively in classical We consider straight-line program segments that do destructive logic. The set reachrec represents the domain of the required heaplet pointer-updates, data-updates and procedure calls. Parameterized for evaluating rec∗, and the ∗-eliminated definition rec captures the by a set of pointer fields PF and a set of data-fields DF, the syntax value of rec∗ when the heaplet is restricted to reachrec. Formally, of the programs is defined in Figure 4, where pf ∈ PF, df ∈ DF, u suppose rec∗ is a recursive definition w.r.t. pointer fields pf~ and and v are program variables of type location, j and z are program stopping locations ~v, then reachrec is recursively defined as variables of type integer, int is an integer constant. To simplify the presentation, we assume all program variables are local and are def reachrec(x) = ITE x = nil ∨ x ∈ ~v, ∅, either pre-assigned or assigned once in the program.  We allow two kinds of recursive procedures, one returning a lo- ITE x ∈ R, {x} ∪ reachrec(pf(x)) , {x} cation f (~v,~z) and one returning an integer g(~v,~z). Each program   pf[∈pf~   is annotated with its pre- and post-conditions in Dryad. The pre- We denote the righthand side as reachdefrec(x). For each recursive condition is denoted as a formula ψpre(~v,~z,~c), where ~v and ~z are variables as the formal parameters/program variables, ~c is a set of ∗ ∗ def p predicate p defined as p (x) = ϕ (x,~v, ~s), we define complimentary variables, which are implicitly existentially quan- def tified. The post-condition is denoted as a formula ψ (ret,~v,~z,~c), p(x) = T ϕp(x, ~v, ~s), reachp(x) post where ret is the variable representing the returned value, of corre- We denote the righthand side as defp(x,~v, ~s). Similarly, for each sponding type, ~v and ~z are program variables, ~c is a set of compli- ∗ recursive function f defined as mentary variables that have appeared in the pre-condition ψpre. def Given a straight-line program with its pre- and post-conditions f ∗(x) = ϕ f (x,~v, ~s) : t f (x, ~s) ; ... ϕ f (x,~v, ~s) : t f (x, ~s) ; default : t f (x, ~s) 1 1 k k k+1 { ψpre } P { ψpost }, we define its partial correctness without Then we define  considering memory errors3: P is partially correct iff for every def f f f −∗ normal execution (memory-error free) of P, which transits state C f (x) = ITE T ϕ (x,~v, ~s), reach (x) , t (x, ~s) ′ ′ 1 1 to state C , if C |= ψpre, then C |= ψpost.  f  f −∗ f Given a Hoare-triple {ψpre} P {ψpost} as defined above, in the ITE T ϕ2 (x,~v, ~s), reach (x) , t2 (x, ~s)  f −∗ presence of a set of recursive definitions and a set of annotated ..., tk+1 (x, ~s) ...  3 f −∗ f We exclude memory errors in order to simplify the presentation. Memory where ti (x, ~s) is just the classical logic counterpart of ti (x, ~s), errors can be handled using a similar VC generation for assertions that when interpreted in a heap domain within reach f (x). Formally it is negate the conditions for memory errors to occur.

7 2012/11/14 Time (s) procedure declarations, we algorithmically derive the verification Data-structure Routine / Routine condition ψVC corresponding to it (the algorithm is quite involved, find rec, insert front, and is presented as Appendix A). Singly- insert back rec, delete all rec, < 1s Linked List Theorem 5.1. Given a Hoare-triple {ψpre} P {ψpost}, assume that copy rec, append rec, reverse iter each procedure call in P satisfies its associated pre- and post- find rec, insert rec, merge rec, conditions. Then the triple is valid if the formula ψ derived above delete all rec, insert sort rec, < 1s VC Sorted List is valid. Moreover, when P contains no procedure calls, the triple reverse iter, find last iter insert iter is valid iff ψVC is valid. 1.4 quick sort iter 64.8 insert front, insert back rec, Proof. Presented as Appendix B.  Doubly- delete all rec, append rec, < 1s Linked List mid insert, mid delete, meld 5.2 Unfolding Across the Footprint insert front, insert back rec, Cyclic List < 1s The verification condition obtained above is a quantifier-free for- delete front, delete back rec mula involving recursive definitions and the reachable sets of the Max-Heap heapify rec 8.8 p form reach (x), which are also defined recursively. Intuitively, the find rec, find iter, insert rec, < 1s footprint is the set of locations explored by the program explicitly delete rec, remove root rec (not including procedure calls). More precisely, a location is in the insert iter 72.4 BST footprint if it is dereferenced explicitly in the program. While these find leftmost iter 4.7 recursive definitions can be unfolded ad infinitum, the idea of un- remove root iter 65.6 folding across the footprint is to unfold the recursive definitions delete iter 225.2 precisely over the footprint of the program, so that recursive defi- find rec, delete rec < 1s nitions on the footprint nodes are related, as tightly as possible, to Treap insert rec 12.7 the recursive definitions on frontier nodes. This will enable effec- remove root rec 9.5 tive use of the formula abstraction mechanism, as when recursive balance, leftmost rec < 1s insert rec 4.1 definitions on frontier nodes are made uninterpreted, the formula AVL-Tree ensures tight conditions that the frontier nodes have to satisfy. We delete rec 13.9 state the unfolding formally using a formula Unfold (formulated in insert rec 73.9 Appendix C). insert left fix rec 8.1 We use a predicate fp to indicate the footprint of the program. insert right fix rec 5.1 RB-Tree delete rec 12.1 The fact that the footprint is exactly the set of locations pointed by delete left fix rec 7.6 the dereferenced variables can be stated using a formula Footprint. delete right fix rec 5.5 We may also know that the recursive definitions or fields on a leftmost rec < 1s location is unchanged during a segment or procedure call of the Binomial find min rec 1.1 program, if the location is not affected by the segment or procedure Heap merge rec 152.7 eg nchanged call. These facts can be captured by formulas S U and Schorr-Waite marking iter < 1s CallUnchanged, respectively. We may also incorporate the fact that inorder tree to list rec 2.4 the reach set on every non-footprint location u contains u itself Tree inorder tree to list iter 42.7 elf each Traversals using the formula S R . All the formulas mentioned above preorder rec, postorder rec < 1s are formally defined in Appendix C. inorder rec 3.76 Now we can strengthen the verification condition by incorporat- ing all the derived formulas above: Figure 5. Results of verifying data-structure algorithms. (see de- ′ tails at http://web.engr.illinois.edu/˜qiu2/dryad) ψVC ≡ ψVC ∧ Unfold ∧ Footprint ∧ SegUnchanged ∧ CallUnchanged ∧ SelfReach abs Since the incorporated formulas are all satisfied when inter- gist of our natural proof scheme. When a proof for ψVC is found, ′ preted on a normal execution E, we can reduce the validity of ψVC we call it a natural proof for ψVC (and also for ψVC). ′ to the validity of ψVC.

Theorem 5.2. Given a Hoare-triple {ψpre} P {ψpost}, its verification ′  condition ψVC is valid if and only if ψVC is valid. The formula abstraction step is the only step that introduces in- completeness in our framework, but helps us transform the verifi- 5.3 Formula Abstraction cation condition to a decidable theory. Formula abstraction (com- ′ While the strengthened verification condition ψVC is still undecid- bined with unfolding recursive definitions across the footprint) dis- able, as we argued before, it is often sufficient to prove it by as- cover recursive proofs where the recursion is structural recursion on suming that the recursive definitions are arbitrary, or uninterpreted. the definitions of the data-structures. The use of these tactics comes Moreover, the uninterpreted formula falls in the array property frag- from the observation that such programs often have such recursive ment [12], whose satisfiability is decidable and is supported by proofs (see [38] also for use of formula abstractions). abs modern SMT solvers such as Z3 [17]. This tactic roughly corre- Our goal now is to check the satisfiability of ¬ψVC in a decidable sponds to applying unification in proof systems. theory. We do that by transforming it to a formula ψAPF in the ′ To prove ψVC, we first replace each recursive predicate recd with array-property fragment, which is decidable [12] (details are in an uninterpreted predicaterec ˆ d, and replacing the corresponding Appendix D). rec rec reach set reachd with an uninterpreted Hd . Let the result formula abs Theorem 5.3. Given a Hoare-triple {ψ } P {ψ }, if the derived be ψVC. This conversion is called formula abstraction, which is pre post abs ′ array formula ψAPF is satisfiable, then the Hoare-triple is valid.  sound: if ψVC is valid, so is ψVC. The formula abstraction is the

8 2012/11/14 Time (s) 6. Experimental Evaluation Example Routine / Routine We have implemented a prototype of the natural proof methodol- free, prepend, concat, ogy for Dryad presented in this paper. The prototype verifier takes insert before, remove all, as input a set of user-defined recursive definitions, a set of pro- remove link, delete link, < 1s cedure declarations with contracts, and a set of straight-line pro- copy, reverse, nth, nth data, find, position, grams (or basic blocks) annotated with a pre-condition and a post- glib/gslist.c index, last, length condition specifying a set of partial correctness properties includ- Singly append 4.9 ing structural, data and separation requirements. Both the contracts Linked-List insert at pos 11.4 and pre-/post-conditions are written in Dryad. For each basic block, LOC: 1.1K remove 3.1 the verifier automatically generates the abstracted formula ψAPF as APF insert sorted list 16.6 described in Section 5, and passes ψ to Z3 [17], a state-of-the- merge sorted lists 6.1 art SMT solver, to check the satisfiability in the decidable theory merge sort 3.0 of array property fragment. The front-end of our verifier is based glib/glist.c free, prepend, reverse, on ANTLR and our tool is around 4000 lins of C# code. Using the Doubly nth, nth data, position, < 1s verifier, we successfully proved the partial correctness of 59 rou- Linked-List find, index, last, tines over a large class of programs involving heap data structures LOC: 0.3K length like sorted lists, doubly-linked lists, cyclic lists and trees. Addi- simpleq init, < 1s tionally, we pit our natural proofs methodogy against real-world simpleq remove after openbsd/queue.h programs and successfully verified, in total, 47 routines from dif- simpleq insert head 1.6 Queue simpleq insert tail 3.6 ferent projects including the list and queue implementations in the LOC: 0.1K Glib open source library, the OpenBSD library, the Linux kernel simpleq insert after 18.3 and the memory regions and the page cache implementations from simpleq remove head 2.1 two different operating systems. Experimental details are available secureOS/cachePage.c lookup prev 2.4 at http://web.engr.illinois.edu/˜qiu2/dryad. LOC: 0.1K add cachepage 6.4 User-provided Axioms: We allow the user to provide some inter- secureOS/ memory region init < 1s pretations to the recursive definitions with the help of axioms. As memoryRegion.c create user space region 3.6 a simple example, the axiom tree∗(x) ⇒ height∗(x) ≥ 0 states that LOC: 0.1K split memory region 5.8 find vma, remove vma, the height of a tree is always non-negative. During the VC genera- linux/mmap.c < 1s remove vma list tion, these axioms are automatically instantiated for every node in LOC: 0.1K the footprint of the heap. The user can also use the separating impli- insert vm struct 11.6 cation, or −∗, from separation logic while specifying these axioms, Figure 6. Results of verifying open-source libraries. (see details expressing some of which can become very diffcult without it. For at http://web.engr.illinois.edu/˜qiu2/dryad) eg. the axiom lseg∗(h, t) ⇒ list∗(t) − ∗ list∗(h) states that a list seg- ment from head h to tail t when appended to a disjoint list from t is an implementation of Schorr-Waite for trees and we check the gives a list from h. property that the resulting output tree is well-marked. We conducted the experiments on a machine with a dual-core, The routines inorder tree to list construct a list consisting 2.4GHz CPU and 6GB RAM. The first part of our experimental of the keys of the input tree, which is traversed inorder. The itera- results is tabulated in Figure 5. In general, for every routine, we tive version of this algorithm achieves this by maintaining a work- checked the properties formalizing the complete verification of the list/stack of sub-trees which remain to be processed at any given routines— capturing the precise structure of the resulting heap- time. The routines inorder, preorder and postorder number structure, the precise change to the data stored in the nodes and the nodes of an input tree according to the inorder, preorder and the precise heaplet modified by the respective routines. postorder traversal algorithm, respectively. For every routine, the suffix rec or iter indicates if the routine Figure 6 shows the results of applying natural proofs to the was implemented recursively or iteratively using while loops. The verification of various other real world programs and libraries. names for most of the routines are self-descriptive. Routines like Glib is the low-level C library that forms the basis of the GTK+ find, insert, delete, append, etc. are the natural implementa- toolkit and the GNOME desktop environment, apart from other tions of the corresponding operations. The routine open source projects. Using our protoype verifier, we efficiently delete all for singly-linked lists, sorted lists and doubly-linked verified Glib implementation of various routines for manipulating lists recursively deletes all occurrences of a particular key in the singly-linked and doubly-linked lists. We also verified the queue input list. The max-heap routine heapify accepts an almost max- library which forms part of the OpenBSD . heap in which the heap property is violated only at the root, both SecureOS is an operating-system/browser implementation which of whose children are max-heaps, and recursively descends the tree provides security guarantees to the user via formal verification [28]. to restore the max-heap property. The routine remove root for bi- The module cachePage maintains a cache of the recently used nary search trees and treaps is an auxiliary routine which is called disc pages. The cache is implemented as a priority queue based in delete. Similarly, the routines leftmost for AVL-trees and on a sorted list. We prove that the methods add cachepage and RB-trees and delete fix and insert fix for RB-trees are also lookup prev (both called whenever a disc page is accessed) main- auxiliary routines. tain the sortedness property of the cache page. Schorr-Waite is a well-known graph marking algorithm which In an OS kernel, a process address space consists of a set of in- marks all the reachable nodes of the graph using very little addi- tervals of linear addresses represented as a memory region. In the tional space. The algorithm achieves this by manipulating the point- SecureOS implementation, a memory region is implemented as a ers in the graph such that the stack of nodes along the path from sorted doubly-linked list where each node of the list with a start the root is encoded in the graph itself. The Schorr-Waite algorithm and an end address represents an interval included in the address is used in garbage collectors and it is traditionally considered as space. We also verified some key components of the Linux im- a challenging problem for verification [19]. The routine marking plementation of a memory region, present in the file mmap.c. In Linux, a memory region is represented as a red-black tree where

9 2012/11/14 each node, again, represents an address interval. We proved meth- [16] E. Cohen, M. Dahlweid, M. A. Hillebrand, D. Leinenbach, M. Moskal, ods which find, remove and insert a vma struct (vma short for a T. Santen, W. Schulte, and S. Tobies. VCC: A practical system for virtual memory address) into a memory region. verifying concurrent C. In TPHOLs’09, volume 5674 of LNCS, pages It also worth mentioning that in the process of experiments, 23–42. Springer, 2009. we did make some unintentional mistakes, in writing both the ba- [17] L. M. de Moura and N. Bjørner. Z3: An efficient SMT solver. In sic blocks and the annotations. For example, forgetting to free the TACAS’08, volume 4963 of LNCS, pages 337–340. Springer, 2008. deleted node, or using ∧ instead of ∗ in the specification between [18] D. Distefano, P. W. O’Hearn, and H. Yang. A local shape analysis two disjoint heaplets, were common mistakes. In these cases, Z3 based on separation logic. In TACAS’06, volume 3920 of LNCS, pages provided counter-examples to the verification condition that cap- 287–302. Springer, 2006. tured the essence of the bugs, and turned out to be very helpful for [19] T. Hubert and C. March´e. A case study of C source code verification: us to debug the specification. These debugging hints are usually not the Schorr-Waite algorithm. In SEFM’05, pages 190–199. IEEE, 2005. available in other incomplete proof systems. [20] B. Jacobs, J. Smans, P. Philippaerts, F. Vogels, W. Penninckx, and Our experiments show that the natural proof methodology set F. Piessens. Verifast: A powerful, sound, predictable, fast verifier for C forth in this paper is successful in efficiently proving full-functional and Java. In NFM’11, volume 6617 of LNCS, pages 41–55. Springer, correctness of a large variety of algorithms. Most of the VCs gen- 2011. erated for the above examples were dishcarged by Z3 in a few [21] N. Klarlund and A. Møller. MONA. BRICS, Department of Com- seconds. To the best of our knowledge, this is the first automatic puter Science, Aarhus University, January 2001. Available from mechanism that can prove such a wide variety of algorithms full- http://www.brics.dk/mona/. functionally correct, involving such complex properties of struc- [22] S. Lahiri and S. Qadeer. Back to the future: revisiting precise program ture, data and separation. verification using SMT solvers. In POPL’08, pages 171–182. ACM, 2008. References [23] K. R. M. Leino. Dafny: An automatic program verifier for functional [1] I. Balaban, A. Pnueli, and L. D. Zuck. Shape analysis of single-parent correctness. In LPAR-16, volume 6355 of LNCS, pages 348–370. heaps. In VMCAI’07, volume 4349 of LNCS, pages 91–105. Springer, Springer, 2010. 2007. [24] P. Madhusudan and X. Qiu. Efficient decision procedures for heaps [2] M. Barnett, B.-Y. E. Chang, R. DeLine, B. Jacobs, and K. R. M. Leino. using STRAND. In SAS’11, volume 6887 of LNCS, pages 43–59. Boogie: A modular reusable verifier for object-oriented programs. In Springer, 2011. FMCO’05, volume 4111 of LNCS, pages 364–387. Springer, 2005. [25] P. Madhusudan, G. Parlato, and X. Qiu. Decidable logics combining [3] J. Berdine, C. Calcagno, and P. W. O’Hearn. A decidable fragment heap structures and data. In POPL’11, pages 611–622. ACM, 2011. of separation logic. In FSTTCS’04, volume 3328 of LNCS, pages 97– [26] P. Madhusudan, X. Qiu, and A. Stefanescu. Recursive proofs for 109. Springer, 2004. inductive tree data-structures. In POPL’12, pages 123–136. ACM, [4] J. Berdine, C. Calcagno, and P. W. O’Hearn. Symbolic execution with 2012. separation logic. In APLAS’05, volume 3780 of LNCS, pages 52–68. [27] S. Magill, M.-H. Tsai, P. Lee, and Y.-K. Tsay. THOR: A tool for Springer, 2005. reasoning about shape and arithmetic. In CAV’08, volume 5123 of [5] J. Berdine, C. Calcagno, and P. W. O’Hearn. Smallfoot: Modular LNCS, pages 428–432. Springer, 2008. automatic assertion checking with separation logic. In FMCO’05, [28] H. Mai, E. Pek, H. Xue, P. Madhusudan, and S. King. Building a volume 4111 of LNCS, pages 115–137. Springer, 2005. secure foundation for mobile apps. In ASPLOS’13. to appear. [6] J. Berdine, C. Calcagno, B. Cook, D. Distefano, P. W. O’Hearn, [29] G. Nelson. Verifying reachability invariants of linked structures. In T. Wies, and H. Yang. Shape analysis for composite data structures. POPL’83, pages 38–47. ACM, 1983. In CAV’07, volume 4590 of LNCS, pages 178–192. Springer, 2007. [30] T. Nipkow, L. C. Paulson, and M. Wenzel. Isabelle/HOL - A Proof [7] N. Bjørner and J. Hendrix. Linear functional fixed-points. In CAV’09, Assistant for Higher-Order Logic, volume 2283 of Lecture Notes in volume 5643 of LNCS, pages 124–139. Springer, 2009. Computer Science. Springer, 2002. [8] M. Botinˇcan, M. Parkinson, and W. Schulte. Separation logic verifica- [31] P. W. O’Hearn, J. C. Reynolds, and H. Yang. Local reasoning about tion of c programs with an smt solver. ENTCS, 254:5 – 23, 2009. programs that alter data structures. In CSL’01, volume 2142 of LNCS, [9] A. Bouajjani, C. Dr˘agoi, C. Enea, and M. Sighireanu. A logic- pages 1–19. Springer, 2001. based framework for reasoning about composite data structures. In CONCUR’09, volume 5710 of LNCS, pages 178–195. Springer, 2009. [32] Z. Rakamari´c, J. D. Bingham, and A. J. Hu. An inference-rule-based decision procedure for verification of heap-manipulating programs [10] A. Bouajjani, C. Dr˘agoi, C. Enea, and M. Sighireanu. On inter- with mutable data and cyclic data structures. In VMCAI’07, volume procedural analysis of programs with lists and data. In PLDI’11, pages 4349 of LNCS, pages 106–121. Springer, 2007. 578–589. ACM, 2011. [33] Z. Rakamari´c, R. Bruttomesso, A. J. Hu, and A. Cimatti. Verifying [11] M. Bozga, R. Iosif, and S. Perarnau. Quantitative separation logic heap-manipulating programs in an SMT framework. In ATVA’07, and programs with lists. In IJCAR’08, volume 5195 of LNCS, pages volume 4762 of LNCS, pages 237–252. Springer, 2007. 34–49. Springer, 2008. [34] S. Ranise and C. Zarba. A theory of singly-linked lists and its exten- [12] A. R. Bradley, Z. Manna, and H. B. Sipma. What’s decidable about ar- sible decision procedure. In SEFM’06, pages 206–215. IEEE, 2006. rays? In VMCAI’06, volume 3855 of LNCS, pages 427–442. Springer, 2006. [35] J. Reynolds. Separation logic: a logic for shared mutable data struc- tures. In LICS’02, pages 55–74. IEEE, 2002. [13] B.-Y. E. Chang and X. Rival. Relational inductive shape analysis. In POPL’08, pages 247–260. ACM, 2008. [36] G. Rosu, C. Ellison, and W. Schulte. Matching logic: An alternative to Hoare/Floyd logic. In AMAST’10, volume 6486 of LNCS, pages [14] W.-N. Chin, C. David, H. H. Nguyen, and S. Qin. Automated veri- 142–162. Springer, 2010. fication of shape, size and bag properties via user-defined predicates in separation logic. Science of , 77(9):1006 – [37] J. Smans, B. Jacobs, and F. Piessens. Implicit dynamic frames. ACM 1036, 2012. Trans. Program. Lang. Syst., 34(1):2:1–2:58, 2012. [15] A. Chlipala. Mostly-automated verification of low-level programs in [38] P. Suter, M. Dotta, and V. Kuncak. Decision procedures for algebraic computational separation logic. In PLDI’11, pages 234–245. ACM, data types with abstractions. In POPL’10, pages 199–210. ACM, 2011. 2010.

10 2012/11/14 [39] P. Suter, A. S. K¨oksal, and V. Kuncak. Satisfiability modulo recur- [ u := v ] sive programs. In SAS’11, volume 6887 of LNCS, pages 298–315. ϕi ≡ u = v ∧ Ri = Ri−1 ∧ FieldsUnmod(PF ∪ DF, i, i − 1) Springer, 2011. [ u := nil ] [40] H. Yang, O. Lee, J. Berdine, C. Calcagno, B. Cook, D. Distefano, and P. W. O’Hearn. Scalable shape analysis for systems code. In CAV’08, ϕi ≡ u = nil ∧ Ri = Ri−1 ∧ FieldsUnmod(PF ∪ DF, i, i − 1) volume 5123 of LNCS, pages 385–398. Springer, 2008. [ u := v.p f ] [41] J. Yang and C. Hawblitzel. Safe to the last instruction: automated verification of a type-safe operating system. In PLDI’10, pages 99– ϕi ≡ v ∈ Ri−1 ∧ u = pfi−1(v) ∧ Ri = Ri−1 110. ACM, 2010. ∧ FieldsUnmod(PF ∪ DF, i, i − 1) [42] K. Zee, V. Kuncak, and M. C. Rinard. Full functional verification of [ u.p f := v ] linked data structures. In PLDI’08, pages 349–361. ACM, 2008. ϕ ≡ u ∈ R ∧ pf = pf {v ← u} ∧ R = R [43] K. Zee, V. Kuncak, and M. C. Rinard. An integrated proof language i i−1 i i−1 i i−1 for imperative programs. In PLDI’09, pages 338–351. ACM, 2009. ∧ FieldsUnmod(PF ∪ (DF \{pf}), i, i − 1) [ j := u.d f ]

ϕi ≡ u ∈ Ri−1 ∧ j = dfi−1(u) ∧ Ri = Ri−1 A. Generating the Verification Condition ∧ FieldsUnmod(PF ∪ DF, i, i − 1) Assume that P consists of n statements, then consider a normal exe- [ u.d f := j ] cution E, which can be represented as a sequence of program states ϕ ≡ u ∈ R ∧ df = df { j ← u} ∧ R = R (C0,..., Cn), where each Ci = (Ri, si, hi) represents the program i i−1 i i−1 i i−1 state after executing the first i statements. The verification condi- ∧ FieldsUnmod((PF \{df}) ∪ DF, i, i − 1) tion is just a formula interpreted on a state sequence (C ,..., C ). 0 n [ j := aexpr ] Let pfi : Loc → Loc be the function mapping every location l to ≡ = ∧ = ∧ ∪ − its pf pointer, i.e., pfi(l) = hi(l, pf) for every location l. Similarly, ϕi j aexpr Ri Ri−1 FieldsUnmod(PF DF, i, i 1) df : Loc → Int is defined such that df (l) = h (l, df) for every i i i [ u := new ] l. Recall that every program variable is either pre-assigned or as- ϕ ≡ new , nil ∧ u = new ∧ new < R ∧ R = R ∪{new } signed once in the program, each si is an expanding of the previous i i i i i−1 i i−1 i one, and sn is the store with all program variables defined. Hence ∧ pfi = pfi−1{nil ← newi} ∧ dfi = dfi−1{0 ← newi} we simply use v to denote sn(v). Moreover, every recursive predi- ^pf  ^df  cate/function is also indexed by i. For example, pi is the recursive ∗ p [ free u ] predicate such that pi(l)istrue iff Ci |= T(p (l), reach (l)). Now for every formula ϕ and every index i, we can give the index i to all the ϕi ≡ u ∈ Ri−1 ∧ Ri = Ri−1 \{u} ∧ FieldsUnmod(PF ∪ DF, i, i − 1) pointer fields, data fields and recursive definitions. We denote the indexed formula as ϕ[i]. [ assume bexpr ] Assume there are m procedure calls in P, then P can be divided ϕi ≡ bexpr ∧ Ri = Ri−1 ∧ FieldsUnmod(PF ∪ DF, i, i − 1) into m + 1 basic segments (subprograms without procedure calls): [ u := f (~v,~z) ] S ; g ; S ; ... ; g ; S d d 0 1 1 m m ϕi ≡ T ψpre(~v,~z,c ~d), Calld [i − 1] ∧ T ψpost(u,~v,~z,c ~d), Returnd [i]   where S d is the d + 1-th basic segment and gd is the d-th procedure ∧ (Ri−1 \ Calld) ∩ Returnd = ∅ ∧ Ri = (Ri−1 \ Calld) ∪ Returnd call. where d is the index such that td = i For each d ∈ [m], let the d-th procedure call in P be the td-th statement (we also extend the index d to 0 and m + 1 such that [ j := g(~v,~z) ] t0 = 0 and tm+1 = n + 1). Note that E requires that a portion of the ϕi is defined in the same way as the above case, state Ctd−1 satisfies the precondition of the call, and a portion of the except replacing u with j. state Ctd satisfies the postcondition of the call. We denote the two required portions Ctd−1 |Calld and Ctd |Returnd, respectively, where Calld ⊆ Rtd−1 and Returnd ⊆ Rtd are two sets of records. where FieldsUnmod(F, i, j) is short for field∈F (fieldi = field j). Let all the location variables appearing in P be LVars. We call a V location variable v dereferenced if v appears on the left-hand side Figure 7. Formulas capture modification by statements of a dereferencing operator “.” in P. We call a location variable v modified if v appears in a statement of the form v.pf := u or v.df := j not involve null pointer dereference: in P. Then we can extract the set of dereferenced variables Deref NoNullDereference ≡ v , nil and the set of modified variables Mod. Note that a modified variable v∈^Deref is always dereferenced, i.e., Mod ⊆ Deref. For each basic segment For each i ∈ [n], Figure 7 shows the effect of each statement S d, let the dereferenced and modified variables within the segment be Deref and Mod , respectively. on the verification condition generated. Each statement’s strongest td td post condition is captured in the logic, and for procedure calls, For the d-th procedure call, let the pre- and post-condition d d the heaplet manipulated by the procedure is carefully taken into associated with the procedure be ψpre(~v,~z,~c) and ψpost(ret,~v,~z,~c), account to update the heap at the caller. The conjunction of these respectively. Since E is a normal execution, we have Ctd−1 |= formulas captures the modification made in E: d d T(ψpre(v~d,z ~d,c ~d), Calld) and Ctd |= T(ψpost(u,v ~d,z ~d,c ~d), Returnd) Modification ≡ ϕi (assume the procedure call returns a location to u), where v~d and i^∈[n] z~d are the actual parameters of the procedure call, c~d are the com- plimentary variables with fresh names. Finally, we can define two formulas to capture the pre- and post- Now we are ready to define the verification condition corre- conditions: sponding to P. We first derive a formula expressing that E does Pre ≡ T(ψpre, R0)[0]

11 2012/11/14 Post ≡ T(ψpost, Rn)[n] [ j := aexpr ] Now the validity of {ψ } P {ψ } can be captured by the following pre post ϕi ≡ j = aexpr ∧ Ri = Ri−1 ∧ FieldsUnmod(PF ∪ DF, i, i − 1) formula: The statement assigns the value of aexpr, which is expressible ψVC ≡ Pre ∧ NoNullDereference ∧ Modification → Post in our logic, to j. Hence j = aexpr. The rest is similar to other  variable assignment cases. B. Proof of Theorem5.1 [ u := new ]

Proof. We prove the soundness by contradiction. Assume the ϕi ≡ newi , nil ∧ u = newi ∧ newi < Ri−1 ∧ Ri = Ri−1 ∪{newi} Hoare-triple {ψ } P {ψ } is not valid. Assume P consists of pre post ∧ pf = pf {nil ← new } ∧ df = df {0 ← new } n statements, then there is an execution E, which can be repre- i i−1 i i i−1 i ^pf ^df sented as a state sequence (C0,..., Cn) where each Ci = (Ri, si, hi),   such that (C0, R0) satisfies ψpre[0], (Cn, Rn) satisfies ψpost[n], and This statement makes u points to a freshly allocated location, the whole execution is memory error free. Then by the defini- namely newi in E. So it is clear that newi , nil ∧ u = newi. tions of Pre,Post and NoNullDereference, and Theorem 5.1, Since the heap domain at timestamp i is an extension of that E |= Pre ∧ NoNullDereference ∧ Post. Now it suffices to show at timestamp i − 1 by adding newi, we know that newi < that E |= Modification, in which case E dissatisfies ψVC. The con- Ri−1 ∧ Ri = Ri−1 ∪{newi}. By default, for newi, each pointer tradiction will conclude the proof. field initially points to nil, each data field initially stores 0. The Since Modification ≡ i∈[n] ϕi, we just need to prove E |= ϕi remaining portion of the heap is exactly the same as Ci−1. Hence = { ← } ∧ = { ← } for each i ∈ [n], by case analysisV on the type of the i-statement in pf pfi pfi−1 nil newi df dfi dfi−1 0 newi . P. [ freeV u ]  V  ϕ ≡ u ∈ R ∧ R = R \{u}∧ FieldsUnmod(PF ∪ DF, i, i − 1) [ u := v ] i i−1 i i−1 This statement removes the location pointed by u from the heap. ϕ ≡ u = v ∧ R = R ∧ FieldsUnmod(PF ∪ DF, i, i − 1) i i i−1 So the old heap contains this location, and the new heap can be The variable assignment makes u points to where v points to. obtained by subtracting it from the old heap: u ∈ Ri−1 ∧ Ri = Hence u = v. Since the heap is unmodified from Ci−1 to Ci, Ri−1 \{u}. Since the domain is shrinked, the field function can the heap domain remains the same (Ri = Ri−1), and all the field be simply unchanged. functions remain the same (FieldsUnmod(PF ∪ DF, i, i − 1)). [ assume bexpr ] [ u := nil ] ϕi ≡ bexpr ∧ Ri = Ri−1 ∧ FieldsUnmod(PF ∪ DF, i, i − 1) ϕ ≡ u = nil ∧ R = R ∧ FieldsUnmod(PF ∪ DF, i, i − 1) i i i−1 The assumed condition bexpr, which can be expressed in our The variable assignment makes u points to nil, so u = nil. logic, must be true. The heap is simply unmodified. Similar to the above case, the heap is also unmodified from Ci−1 [ u := f (~v,~z) ] to C . i ϕ ≡ T ψd (~v,~z,c ~ ), Call [i − 1] ∧ T ψd (u,~v,~z,c ~ ), Return [i] [ u := v.pf ] i pre d d post d d ∧ (Ri−1 \ Calld) ∩ Return d = ∅∧ Ri = (Ri−1 \ Calld) ∪ Return d ϕi ≡ v ∈ Ri−1 ∧ u = pfi−1(v) ∧ Ri = Ri−1 where d is the index such that td = i ∧ FieldsUnmod(PF ∪ DF, i, i − 1) As assumed, this call is the d-th procedure call in P, and The dereferencing on v implies that v points to a valid location Ci−1 satisfies the associated precondition by the heaplet de- − ∈ at timestamp i 1, i.e., v Ri−1. Moreover, the assignment fined by Calld, and Ci satisfies the associated postcondition − = makes u points to the pf field of v at timestamp i 1, formally u by the heaplet defined by Returnd. Then formally we have pf (v). Similar to the above case, the heap is also unmodified d d i−1 T ψpre(~v,~z,c ~d), Calld [i − 1] ∧ T ψpost(u,~v,~z,c ~d), Returnd [i] be- from C to C . i−1 i ing satisfied by E.   [ u.pf := v ] Due to the framing property of the separation semantics, the portion of Ci−1 that is not required by ψpre remains unchanged, ϕi ≡ u ∈ Ri−1 ∧ pfi = pfi−1{v ← u}∧ Ri = Ri−1 and is disjoint from Returnd (since the returned location is ∧ FieldsUnmod(PF ∪ (DF \{pf}), i, i − 1) assigned to u, the variable ret can be replaced with u). This Similar to the above case, u points to a valid location at times- property can be expressed as (Ri−1 \ Calld) ∩ Returnd = ∅∧ Ri = R \ Call ∪ Return tamp i − 1(u ∈ Ri−1). the mutation makes the pf field at times- ( i−1 d) d. = tamp i updated from that at timestamp i − 1: pfi = pfi−1{v ← u}. [ j : g(~v,~z) ] Moreover, the heap domain is unmodified, so Ri = Ri−1. The ϕ is defined in the same way as the above case, other field functions also remain the same, which is captured i by FieldsUnmod(PF ∪ (DF \{pf}), i, i − 1). except replacing u with j. [ j := u.df ] The proof is also similar to the above case.

ϕi ≡ u ∈ Ri−1 ∧ j = dfi−1(u) ∧ Ri = Ri−1  ∧ FieldsUnmod(PF ∪ DF, i, i − 1) Similar to the u := v.pf case. C. Strengthening the Verification Condition [ u.df := j ] Let u be a location variable in LVars and let i be an timestamp such that 1 ≤ i ≤ n. For each recursive definition rec∗ whose ∗- ϕ ≡ u ∈ R ∧ df = df { j ← u}∧ R = R def i i−1 i i−1 i i−1 eliminated version defined as p(x) = defrec(x,~t,~v) and whose reach ∧ FieldsUnmod((PF \{df}) ∪ DF, i, i − 1) def set defined as reachrec(x) = reachdef rec(x), we can derive a formula Similar to the u.pf := v case. Unfoldrec(i, u) for unfolding both rec∗ and its corresponding reach

12 2012/11/14 set on u at timestamp i, provided that u is allocated at the current The formula SelfReach can be defined as: timestamp (u ∈ R ). Note that in def rec(x,~t,~v), x will be renamed i SelfReach ≡ u , nil → as u, and ~t will not be renamed as they are program variables, 0≤^d≤m u∈^LVars  but ~v are existentially quantified and should be replaced with fresh variable names. Due to the restrictions on the recursive definitions, u ∈ reachrec(u) ∩ reachrec (u) td td+1−1 every v is unique and can be determined by dereferencing u on  \rec    the corresponding pointer fields, say pf rec,v. Hence we can replace each v in ~v distinctly as u rec v i. Let the renamed formula be rec ~ abs APF def (u, t,~vfresh), then we can derive D. Translating ¬ψVC to ψ abs rec rec rec Note that ¬ψVC is mostly expressible in the quantifier-free theory Unfold (i, u) ≡ reachi (u) = reachdefi (u) ∧ u ∈ Ri →   of arrays, maps, uninterpreted functions, and integers: Loc can be viewed as an uninterpreted sort; each pointer field pf can be rec rec,v reci(u) ↔ def (u,~t,~vfresh) ∧ pf (u) = u rec v i viewed as an array with both indices and elements of sort Loc; each i i !    ^v∈~v    data field df can be viewed as an array with indices of sort Loc and elements of sort Int; each integer set (or multiset) variable S Now the footprint unfolding is just unfolding every derefer- enced variable for every timestamp at the beginning and the end and can be viewed as an array with indices of sort Int and elements the program, as well as before/after each procedure call, as long as of sort Bool (or Int). Moreover, each array update operation of the variable is allocated at the current timestamp. As a special lo- the form array{elem ← key} can be viewed as a read-over-write cation, nil is also unfolded with respect to each recursive definition operation in the array property fragment, and each set-operation and each timestamp. We can simply make the conjunction of all (union, intersection, etc.) can be viewed as a mapping function these unfoldings: applying a Boolean operation (∧, ∨, etc.) to the range of arrays. The only construct in ¬ψabs that escapes the quantifier-free Unfold ≡ VC formulation is the ≤ relation between integer sets/multisets; but 0≤^d≤m u∈Deref^∪{nil} ^rec this can be captured using the array-property fragment, which is rec rec Unfold (td, u) ∧ Unfold (td+1 − 1, u) decidable [12]. For each atomic formula of the form S 1 ≤ S 2, if S 1   and S 2 are sets of integers, we can be replace the formula with a universally quantified formula as follows: The formula Footprint is defined as follows: ∀i1, i2. i1 < i2 → (¬S 2[i1] ∨¬S 1[i2]) Footprint ≡ (u = v) ↔ fp(u) Similarly, if S 1 and S 2 are integer multisets, we can replace the u∈^LVars  v∈_Deref    formula with To define SegUnchanged and CallUnchanged, we first define a formula expressing that a recursive definition and its corresponding ∀i1, i2. i1 < i2 → (S 2[i1] = 0 ∨ S 1[i2] = 0) reach set on a location is unchanged between two timestamps: We thus obtain a formula ψAPF whose satisfiability is decidable. rec ′ rec rec Unchanged (u, i, i ) ≡ reci(u) = reci′ (u) ∧ reachi (u) = reachi′ (u) In the d-th segment of the program, for each non-footprint location variable u and recursive predicate p∗ (or similarly for each recursive ∗ p function f ), ptd (u) and reachtd (u) are unchanged, if the reach set p u reachtd ( ) is not modified during this segment. This fact can be captured as follows:

SegUnchanged ≡ ¬fp(u) →

0≤^d≤m u∈^LVars  rec rec (reach (u) ∩ Modt = ∅) → Unchanged (u, td, td+1 − 1) td d !  ^rec   In the d-th procedure call of the program, for each variable u, if u is not in the footprint, then for each recursive predicate rec∗, rec u rec u rec u td ( ) and reachtd ( ) are unchanged, if the reach set reachtd ( ) is not affected during this call. Moreover, if u is not nil, then for ff each field pf (or df), pftd−1(u) is unchanged if u itself is not a ected during the call. This fact can be captured as follows: CallUnchanged ≡ d^∈[m] u∈^LVars

(¬fp(u)) → 

rec rec (reach (u) ∩ Calld = ∅) → Unchanged (u, td, td+1 − 1) td !  ^rec  

∧ u , nil ∧ u < Calld →   pf u = pf u ∧ df u = df u td−1( ) td ( ) td−1( ) td ( )  ^pf  ^df   

13 2012/11/14