First International Workshop on Proof Exchange for Theorem Proving August 1, 2011 Affiliated with CADE 2011, 31 July

PxTP 2011: First International Workshop on Proof eXchange for Theorem Proving

August 1, 2011

Afﬁliated with CADE 2011, 31 July - 5 August 2011 Wrocław, Poland

http://pxtp2011.loria.fr/ Preface

This volume contains the papers presented at PxTP 2011, the First International Workshop on Proof eXchange for Theorem Proving held on August 1st, 2011 in Wrocław. The main objective of this workshop is to stimulate the research on proof production and exchange in the field of computer aided deduction. Machine-checkable proofs have been proposed for applications like proof-carrying code and certified compilation, as well as for exchanging knowledge between different automated reasoning systems. For example, interactive theorem provers can import results from otherwise untrusted high-performance solvers, by means of proofs the solvers produce. In such situations, one automated reasoning tool can make use of the results of another, without having to trust that the second tool is sound. It is only necessary to be able to reconstruct a proof that the first tool will accept, in order to import the result without increasing the size of the trusted computing base. This simple idea of proof exchange for theorem proving becomes quite complicated under the real- world constraints of highly complex and heterogeneous proof producers and proof consumers. For example, even the issue of a standard proof format for a single class of solvers, like SMT solvers, is quite difficult to address, as different solvers use different inference systems. It may be quite challenging, from an engineering and possibly also theoretical point of view, to fit these into a single standard format. Emerging work from several groups proposes standard meta-languages or parametrized formats to achieve flexibility while retaining a universal proof language. Papers were solicited on topics that include (but are not limited to) all aspects of proof exchange among automated reasoning tools. More specifically, some suggested topics were:

• proposed proof formats for different classes of logic solvers (SAT, SMT, QBF, First-Order ATP, Higher-Order ATP, Rewriting, etc.).

• meta-languages and logical frameworks for proofs, particularly proof systems designed for solvers.

• proof checking tools and algorithms.

• proof translation and methods for importing proofs, including proof replaying or reconstruction.

• tools and case studies related to analyzing proofs produced by solvers, and proof metrics.

• applications relying on importing proofs from automated theorem provers, such as certiﬁed static analysis, proof-carrying code, or certiﬁed compilation.

• data structures and algorithms for improved proof production in solvers (for example, more time- or memory-efﬁcient ways of representing proofs).

We received seven papers. Each submission was reviewed by three program committee members. Due to the quality of the submissions, we were able to accept all contributions for presentation at this ﬁrst edition of the PxTP workshop. The workshop used the EasyChair Conference System; we would like to thank the EasyChair team for making this great tool available. The workshop organizers would also like to thank the program committee for their work and the organizers of the PSATTT workshop for a very fruitful partnership. We are very grateful to the CADE organizers for their support and for hosting the workshop.

July 2011 Pascal Fontaine and Aaron Stump

II Program Committee

• Clark Barrett (New York University)

• Christoph Benzmuller¨ (Articulate Software)

• Sacha Bohme¨ (Technische Universitat¨ Munchen)¨

• Amy Felty (University of Ottawa)

• Pascal Fontaine, co-chair (INRIA, University of Nancy)

• Leonardo de Moura (Microsoft research)

• Hans de Nivelle (University of Wroclaw)

• David Pichardie (INRIA Rennes)

• Stephan Schulz (Technische Universitat¨ Munchen)¨

• Aaron Stump, co-chair (The University of Iowa)

• Geoff Sutcliffe (University of Miami)

• Laurent Thery´ (INRIA)

• Tjark Weber (University of Cambridge)

• Bruno Woltzenlogel Paleo (Technische Universitat¨ Wien)

III Table of Contents

A Nelson-Oppen based Proof System using Theory Specific Proof Systems ...... 1 Fred´ eric´ Besson, Pierre-Emmanuel Cornilleau and David Pichardie A Flexible Proof Format for SMT: a Proposal ...... 15 Fred´ eric´ Besson, Pascal Fontaine, and Laurent Thery´ Designing Proof Formats: A User’s Perspective ...... 27 Sascha Bohme¨ and Tjark Weber Quantifier Inference Rules for SMT proofs ...... 33 David Deharbe, Pascal Fontaine and Bruno Woltzenlogel Paleo Towards certification of TLA+ proof obligations with SMT solvers ...... 40 Stephan Merz and Hernan´ Vanzetto Escape to ATP for Mizar ...... 46 Piotr Rudnicki and Josef Urban Combining Proofs to form Different Proofs ...... 60 Geoff Sutcliffe, Cynthia Chang, Deborah McGuinness, Tim Lebo, Li Ding, and Paulo Pinheiro da Silva

IV A Nelson-Oppen based Proof System using Theory Speciﬁc Proof Systems∗ Fred´ eric´ Besson, Pierre-Emmanuel Cornilleau, David Pichardie INRIA Rennes – Bretagne Atlantique, France

Abstract SMT solvers are nowadays pervasive in verification tools. When the verification is about a critical system, the result of the SMT solver is also critical and cannot be trusted. The SMT-LIB 2.0 is a standard interface for SMT solvers but does not specify the output of the get-proof command. We present a proof system that is geared towards SMT solvers and follows their conceptually modular architecture. Our proof system makes a clear distinction between propositional and theory reasoning. Moreover, individual theories provide specific proof systems that are combined using the Nelson-Oppen proof scheme. We propose specific proof systems for linear real arithmetic (LRA) and uninterpreted functions (EUF) and discuss proof generation and proof checking. We have evaluated the cost of generating proofs in our proof system. Our experiments on benchmarks taken from the SMT-LIB library show that the simple mechanisms used in our approach suffice for a large majority of the selected benchmarks.

1 Introduction

Modern Satisfiability Modulo Theory (SMT) solvers (e.g., CVC3 [2], VeriT [5], Yices [11] or Z3 [7]) are able to automatically discharge formula of industrial size combining various logic fragments such as linear (real or integer) arithmetic, the theory of uninterpreted function symbols or the theory of arrays. The SMT-LIB 2.0 format [1] is a standard interface for SMT solvers. It provides a unified syntax for SMT problems and a rich interface for interacting with SMT solvers. The command check-sat tests the satisfiability of the problem and is the minimal information that is expected from a SMT solver. More advanced features are unsat cores (get-unsat-core) or models (get-model). In case the problem is unsat, the command get-proof outputs a proof of this fact. The answer to the get-proof command is unspecified and is therefore prover-specific. Actually, the SMT solvers CVC3, veriT and Z3 all use a different syntax and semantics for their proofs. Moreover, the granularity of the proofs greatly differ. This hinders proof exchanges and significantly complicates proof checking by third-party entities. Several works show that checking proofs generated by SMT provers in skeptical proof-assistants (see e.g., [12, 13, 4]) requires substantial (retro-)engineering. In this paper, we advocate for a very structured proof system that mimics the (conceptual) modular architecture of SMT solvers. We provide: A new methodology to obtain unsatisfiability proofs from an untrusted, non proof-producing, SMT • solver. Our proof format is modular: it separates boolean reasoning from theory reasoning. Each multi-theory proof is itself decomposed (using the Nelson-Oppen proof scheme) into mono-theory proofs.

A prototype prover that generate proofs. The prover only requires a SMT solver that extracts • unsat cores and boolean models, as expected by the SMT-LIB 2 format. A SMT solver is used to obtain unsat multi-theory cores and any proof-generating multi-theory prover can be used to obtain certiﬁcates for theory speciﬁc lemmas.

Pascal Fontaine, Aaron Stump (eds.); PxTP 2011, pp. 1-14 ∗This work was funded by the ANR Decert projet

1 A Nelson-Oppen based Proof System. . . F. Besson, P.-E. Cornilleau, and D. Pichardie

For uninterpreted functions (EUF) and linear real arithmetic (LRA) we propose specific proof systems and discuss how to generate proofs using state-of-the-art decision procedures. We have done preliminary experiments to assess the viability of our proof generation. Using SMT- LIB 2.0 scripts, we have implemented a lazy SMT loop [9]: a first SMT solver acting as a SAT solver; the second SMT solver acting as a Theory-reasoner. Such a set-up amounts to disabling many optimisations and forbidding, for instance, any global pre-processing or theory-propagation. Nonetheless, the results are rather encouraging as we are able to generate for most of the benchmarks a proof with an acceptable overhead. The remainder of this paper is organised as follows. Section 2 covers the needed SMT solving background and describe a simple SMT proof search. Section 3 defines our proof systems and describe their interactions. Section 4 presents some experimental evaluation results. We discuss related work in Section 5 and conclude in Section 6 with a discussion on further work.

2 Background

In this section, we give an overview of some concepts useful to describe the interactions between the Boolean and the theory part of a SMT proof search.

2.1 Separating Boolean and Theory reasoning We consider multi-theory unquantiﬁed ﬁrst-order formulas, with terms belonging to combinations of theories. Such a formula will be called T-formula. The following formula is an example of T-formula combining uninterpreted functions and arithmetic:

f ( f (x) f (y)) = f (z) x y ((y + z x z 0) (y z x z < 0)) (1) − 6 ∧ ≤ ∧ ≤ ∧ ≥ ∨ − ≤ ∧

Boolean Abstraction. A simple approach to solve a T-formula is to consider its Boolean abstraction and search for propositional models, eliminating along the search any model leading to a contradiction at the theory level. To obtain the Boolean abstraction, the T-formula terms from the underlying theories are substituted for propositional variables. We will refer to the resulting propositional formula as the propositional abstraction of the initial T-formula. Each variable corresponds to a theory literal. For example, the propositional abstraction of the T-formula (1) is A B ((C D) (E D)), with the following T-mapping: ∧ ∧ ∧ ∨ ∧ ¬

A ; f ( f (x) f (y)) = f (z) B ;x y C ;y + z x − 6 ≤ ≤ D ;z 0 E ;y z x ≥ − ≤ If the abstracted formula does not have a model, i.e., the propositional abstraction is unsatisﬁable, then the T-formula is unsatisﬁable at the Boolean level. But if the abstraction has a model, this model needs to be validated at the theory level. To do that, we transform this model in a conjunction of theory atoms, called T-conjunction, according to the T-mapping between propositional variables and corresponding atoms. Consider the following propositional model of the T-formula (1):

A ;True B ;True C ;True D ;True E ;False (2)

The corresponding T-conjunction is

f ( f (x) f (y)) = f (z) x y y + z x z 0 (y z x) (3) − 6 ∧ ≤ ∧ ≤ ∧ ≥ ∧ ¬ − ≤ 2 A Nelson-Oppen based Proof System. . . F. Besson, P.-E. Cornilleau, and D. Pichardie

This formula is unsatisfiable (see Section 2.2 for involved theory reasoning), hence model (2) leads to a contradiction at the theory level, and has to be removed from the search. We eliminate model (2) from the propositional SAT search by adding to the propositional abstraction, as a new clause, called a conflict clause: the negation of the abstraction of the T-conjunction (3), i.e., A B C D E = False. We refer to the conjunction of the propositional abstraction and the discovered∧ ∧ ∧ conflict∧ ¬ clauses⇒ as the propositional abstraction set. At the beginning of the search, this set only contains the propositional abstraction of the T-formula. We can now continue the search by looking for another model of the propositional abstraction set, until either the set is unsatisfiable, or a model of the initial T-formula is found.

Shorter Conﬂict Clauses. Notice that in our example the atom (y z x) is not necessary to prove T-conjunction (3) unsatisﬁable. The T-conjunction ¬ − ≤

f ( f (x) f (y)) = f (z) x y y + z x z 0 (4) − 6 ∧ ≤ ∧ ≤ ∧ ≥ is already unsatisfiable, it is in fact an unsatisfiable core. The T-conjunction (3) being redundant it leads to a weak conflict clause that does not eliminate the following model:

A ;True B ;True C ;True D ;True E ;True (5)

By building the conflict clauses from unsatisfiability cores (unsat-cores) instead of whole T-conjunctions, we eliminate more models, and accelerate the search. If we use unsat-cores in our example, the conflict clause to add, in order to eliminate model (2), is A B C D = False, and it also eliminate model (5). The propositional abstraction set is then ∧ ∧ ∧ ⇒

A B ((C D) (E D)) ∧ ∧ ∧ ∨ ∧ ¬ A B C D = False ∧ ∧ ∧ ⇒ A model of this propositional formula is

A ;True B ;True C ;True D ;False E ;True and the corresponding T-conjunction is f ( f (x) f (y)) = f (z) x y y+z x z < 0 y z x. This is an unsatisﬁable formula, and its unsat-core is− 6 ∧ ≤ ∧ ≤ ∧ ∧ − ≤

x y z < 0 y z x (6) ≤ ∧ ∧ − ≤ This unsat-core leads to the conflict clause B D E = False. Once we have added this conflict clause to the propositional abstraction set, the∧ set ¬ becomes∧ ⇒ unsatisfiable, and the model search ends.

Concluding the Search. Any model of the propositional abstraction set is a model of the propositional abstraction, because the conflict clauses we add to the set only eliminate models. Conversely, any model of the propositional abstraction which is not a model of the propositional abstraction set corresponds to an unsatisfiable T-conjunction. As a result, if the T-conjunction corresponding to a propositional model is satisfiable, we can obtain a model of the initial T-formula, i.e., a proof of satisfiability. On the contrary, if all propositional models translate into unsatisfiable T-conjunctions, the initial T-formula is unsatisfiable. In such case, when the search ends the propositional abstraction set is an unsatisfiable propositional formula. It is composed of:

the propositional abstraction; in our example A B ((C D) (E D)) • ∧ ∧ ∧ ∨ ∧ ¬ 3 A Nelson-Oppen based Proof System. . . F. Besson, P.-E. Cornilleau, and D. Pichardie

all the conflict clauses; in our example we found two of them: • A B C D = False ∧ ∧ ∧ ⇒ B D E = False ∧ ¬ ∧ ⇒ Each conflict clause corresponds to an unsatisfiable T-conjunction. In our example, the two conflict clauses come from the T-conjunctions unsat-cores (4) and (6). A conflict clause is the abstraction of a tautology, i.e., the negation of an unsatisfiable T-conjunction. In fact, we could add to the propositional abstraction set the abstraction of any tautology, conflict clause or not, without endangering the soundness of our proof search. Adding more clauses to the propositional abstraction would eliminate more models from the search and accelerate the procedure. Conflict clauses can be more generally seen as abstraction of theory lemma, i.e., valid formulas whose abstractions are necessary to prove the unsatisfiability of the T-formula. To optimise the search, other kinds of theory lemmas could be useful, and modern SMT solvers do use more theory reasoning than mere conflict clauses. Some SMT solvers check partial models incrementally against the theory in order to build similar subsets. In this example, it is useless to assign a boolean value to E to obtain a theory conflict. Second, the multi-theory solver may be able to discover propagation lemmas, i.e theory literals that are consequence of partial models. In a boolean form, such lemmas allow the SAT solver to perform efficient unit propagation and reduce its research tree.

2.2 Multi-Theory Conjunction Proofs We now give an overview of the Nelson-Oppen equality exchange, used to prove unsatisﬁability of T- conjunctions. We illustrate the proof search on the T-conjunction (4) from the previous example1.

f(f(x) f(y)) = f(z) x y y + z x z 0 − ∧ ≤ ∧ ≤ ∧ ≥ puriﬁcation EUF LRA

(1) f(y)=t3 (0) t0 =0 (2) f(x)=t (3) t t + t =0 5 3 − 5 6 (4) f(t6)=t8 (7) y x 0 (5) f(z)=t (8) y + x − z ≥ 0 9 − − ≥ (6) t8 = t9 x = y (9) z 0 LRA proves z ≥ (11) x = y t0 = LRA proves (12) t0 = z EUF proves t3 = t5 (14) t t =0 z 3 − 5 t6 = LRA proves (18) t6 = z EUF proves UNSAT !

Figure 1: Example of Nelson-Oppen equality exchange

In this example, we combine the theories of Equality and Uninterpreted Function (EUF) and Linear Real Arithmetic (LRA). For EUF, a literal is an equality between multi-sorted ground terms and a formula is a conjunction of positive and negative literals. The axioms of this theory are reflexivity, symmetry and transitivity, and the congruence axiom a b,a = b f (a) = f (b) for functions. Such a theory is infinitely stable and decidable using an efficient∀ ∀ extension⇒ of the union-find algorithm to compute

1The formula is taken from [14].

4 A Nelson-Oppen based Proof System. . . F. Besson, P.-E. Cornilleau, and D. Pichardie congruence closures [10]. The only way for a set of literals to be unsatisfiable is to deduce from positive literals an equality trivially negated by one of the negative literals. For LRA, a literal is a linear constraint c0 + c1 x1 + + cn xn 1 0 where (ci)i=0..n Q is a sequence of rational coefficients, (xi)i=1..n is a · ··· · ∈ sequence of real unknowns and 1 =,>, 2. Here, a formula is a conjunction of positive literals. Such a theory is also infinitely stable∈ {and decidable≥} using the Simplex procedure [10]. The Nelson-Oppen algorithm is a sound and complete decision procedure for combining infinitely stable theories with disjoint signatures. Figure 1 presents the deduction steps of this procedure on an example. We start from the formula at the top of Figure 1 and first apply a purification step that introduces sufficiently many intermediate variables to flatten each term and dispatch pure formulas to each theory. Then, each theory exchanges new equalities with the others, until a contradiction is found.

3 Proof Systems

In this section we discuss the proof system for multi-theory formulas. We begin with a general discussion on proof searches for whole formulas, then detail what is intended by Nelson-Oppen proofs. We follow with instances of uninterpreted functions (EUF) and linear real arithmetic (LRA) proofs.

3.1 Proof Scheme Preprocessing. The first step of SMT solving is to handle Boolean abstraction and purification. De- pending on the SAT proof system we use, we also need to put the propositional formulas in Conjunctive Normal Form (CNF). We can either give a proof for all these preprocessings, or make sure the checker will be able to find the normal forms itself, by using the same algorithms in the proof-producing prover and in the checker.

SMT Proofs. Once we are sure that the proof-producing prover and the proof checker agree on the preprocessing of the formula, the proof of unsatisﬁability is composed of two parts:

a proof of unsatisfiability of the propositional abstraction set, including all conflict clauses; • the set of unsatisfiable T-conjunctions with their proofs. • With the theory proofs we can check the validity of the theory lemmas, and with the propositional proof we can check the unsatisfiability of the formula at the Boolean level.

Proof Generation. The proof generation would be facilitated if state-of-the-art SMT solvers would give direct access to the conﬂict clauses discovered during a search, or to any kind of theory reasoning for that matter. Still, we would have to link these discovered formulas to the initial problem, which would require to take into account any preprocessing done by the solver. Anyway, using the SMT-LIB 2.0 standard we can access models discovered by a SAT solver and unsatisﬁability cores using a SMT solver. Then we can use off-the-shelf solvers to generate proofs, if non optimal ones, and try to evaluate our scheme. See Section 4 for experimental results.

3.2 Propositional SAT Proof System One part of a SMT proof is a proof of unsatisﬁability of the propositional abstraction set. Unsatisﬁability proofs of propositional formulas have already been discussed in the literature. Several proof systems [19]

2Following the Simplify [10] approach, disequality is managed on the EUF side.

5 A Nelson-Oppen based Proof System. . . F. Besson, P.-E. Cornilleau, and D. Pichardie and checking procedures [22] exist. State-of-the-art solvers like zChaff [18] or PicoSAT [3] can output checkable proofs. Formats may vary and we will not go into details, but all proof systems are based on the resolution rule: x C x C ¬ ∨ ∨ 0 C C ∨ 0 The variable x is called the resolution variable and C and C0 are clauses. Using resolution chains, new clauses are deduced. Once the empty clause has been deduced, the initial set of clauses has been proved unsatisfiable; hence a proof is a list of resolution chains, and the checker uses them to produce new clauses until it reaches the empty clause. Using optimised algorithms, resolution proofs can be checked efficiently [20]. Other proof systems exist e.g., Reverse Unit Propagation proofs [21], for checking propositional unsatisfiability.

3.3 Nelson-Oppen Proofs The second part of a SMT proof is a set of T-conjunctions and their proofs of unsatisﬁability. We have seen on an example in Section 2.2 how to solve such conjunctions and we will now introduce Nelson- Oppen based proofs using the same example.

Step 1 (11) x = y because (LRA) (7) gives y x 0 and (8) + (9) gives x y 0 − ≥ − ≥ Step 2 (14) t3 = t5 because of the following rewriting steps trans. with (1) congr. with (11) trans. with (2) (EUF) t3 f (y) f (x) t5 −−−−−−−→ −−−−−−−−−→ −−−−−−−→ (18) t = z because Step 3 6 (3) + (7) + (8) (14) gives t6 z 0 and (LRA) − − ≥ (7) + (8) + 2 (9) + (14) (3) gives z t6 0 · − − ≥ Step 4 False by contradiction of (6) with the following rewriting steps trans. with (4) congr. with (18) trans. with (5) (EUF) t8 f (t6) f (z) t9 −−−−−−−→ −−−−−−−−−→ −−−−−−−→

Figure 2: Example of Nelson-Oppen proof

Proof Generation. Figure 2 presents the proofs we consider. The proof generation only has to consider useful exchanges, based on the whole history of exchanges. In this example, t0 = z is not required in the final proof. A LRA proof of a = b is made of two Farkas proofs [17] of b a 0 and a b 0. Each − ≥ − ≥ inequality is obtained by a linear combination of hypotheses that preserves signs. A EUF proof of a = b is made of a sequence of rewriting steps that allows to reach b from a. Each proof is expressed in a theory-specific proof format that is complete w.r.t. to the theory, i.e., if a formula is unsatisfiable, there exists a proof of it. For EUF+LRA, unsatisfiability can always be proved without resorting to case-splits. EUF and LRA are said to be convex theories. In the general case of non-convex theories (such as linear integer arithmetic or theories of arrays), disjunctions of equalities may be generated and case splits are necessary.

The Nelson-Oppen Proof System. The proof system we propose for a combination of n theories T1,. . . , Tn is given below.

Γi Ti prf i : (Γi0,eqs) V ` x =y eqs (Γ1[ j xk = yk],...,Γi0,...,Γn[ j xk = yk] NO sons[k] : False) k k∈ 7→ 7→ ` Γ1,...,Γn NO (prf ,sons) : False ` i 6 A Nelson-Oppen based Proof System. . . F. Besson, P.-E. Cornilleau, and D. Pichardie

In this judgement Γi represents an environment of pure literals of theory Ti. Each theory is equipped with is own deduction judgement Γi T prf : (Γ ,eqs) where Γi and Γ are environments of theory Ti, ` i i i0 i0 prf i is a proof speciﬁc to theory Ti and eqs is a list of equalities between variables. Such a judgement reads as follows: assuming that all the literals in Γi hold, we can prove that all the literals in Γi0 hold and the disjunction equalities in eqs can be proved from Γi. The judgement Γ1,...,Γn NO (prf ,sons) : ` i False holds if given an environment Γ1,...,Γn of the joint theory T1 + ... + Tn, the proof (prf i,sons) allows to exhibit a contradiction, i.e., False. Suppose that proof prf i establishes a judgement of the form Γi T prf : (Γ ,eqs). If the list eqs is empty, we have a proof that Γi is contradictory and therefore the ` i i i0 joint environment Γ1,...,Γn is contradictory and the judgement holds. An important situation is when the list is always a singleton. This corresponds to the case of convex theories for which the Nelson- Oppen algorithm never perform case-splits. In the general case, we recursively exhibit a contradiction for each equality (xk = yk) using the kth proof of sons, i.e., sons[k] for a joint environment (Γ1[ j xk = 7→ yk],...,Γ ,Γn[ j xk = yk]) enriched with the equality (xk = yk). For completeness, the index j used to i0 7→ store the equality (xk = yk) should be fresh. The judgement holds if all the branches of the case-split over the equalities in eqs reach a contradiction.

3.4 Proof Checking and Generation for EUF In this section we introduce a proof system and checker for EUF and present an overview of the proof- producing procedure. We then propose an overview of an alternative EUF proof system. After preprocessing and purification, EUF formulas can be encoded with the following types: type var = int type term = Var of var | Apply of var * var list type formula = Eq of term * term | Neq of term * term The fact that terms are purified and flat is an invariant maintained by the proof-producing procedure.

Proof System. A proof is a list of commands executed in sequence. Each command operates on the state of the checker, which is a pair (Γ,eq). The assumption set Γ is a mapping from indices to assumptions, written Γ(i) a = b, and eq is the current equality, i.e., the last one we proved. Each command corresponds to an axiom7→ or a combination of axioms of the EUF theory. The syntax of the commands is the following: type command = | Refl of term | Trans of index * bool | Congr of index * position * bool | Push index cmd The semantics is given by rules of the form (Γ,eq) (Γ0 ,eq0 ) where (Γ ,eq ) is the state obtained −−→ 0 0 after executing the command cmd from the state (Γ,eq). The Boolean s in Trans and Congr commands make explicit symmetry: if Γ(i) t = t then we have Γ(i)true t = t and Γ(i) f alse t = t . 7→ 0 7→ 0 7→ 0 Γ(i)s t = t0 Γ = Γ[i x = t] 7→ 0 7→ Refl(y) Trans(i,s) Push(i) Γ,. = . Γ,y = y Γ,x = t Γ,x = t0 Γ,x = t Γ0 ,x = t −−−−→ −−−−−→s −−−−→ Γ(i) ap = a 7→ 0p Congr(i,p,s) Γ,x = f (a0..ap..an) Γ,x = f (a0..a0 ..an) −−−−−−−→ p The command Refl(y) corresponds to the reﬂexivity axiom and initialises the current equality with the tautology y = y, whatever the previous equality. Subsequent commands will then rewrite the right hand side of this equality. The command Trans(i,s) updates the right hand side of the current equality. If we

7 A Nelson-Oppen based Proof System. . . F. Besson, P.-E. Cornilleau, and D. Pichardie

can prove that x = t (current equality) and we know that t = t0 (equality indexed by i) then we can deduce x = t0. The command Congr(i, p,s) rewrites a sub-term of the right hand side. In any given context if we can prove x = f (y) (current equality) and we know that y = z (equality indexed by i) then we can deduce x = f (z) and make it the new current equality. The parameter p is used to determine where to rewrite. The command Push(i) is used to update the assumption set Γ with the current equality x = t, creating a new context Γ0 = Γ[i x = t] to be used to evaluate the next commands. It allows us some factorisation of sub-proofs and is mandatory7→ to keep the terms ﬂat. The rules below detail the transitive closure of the previous relation, explaining how to evaluate a list of commands prf .

cmd prf Γ,eq Γ0,eq0 Γ0,eq0 Γ00,eq00 −−→ −→∗ nil cmd::prf Γ0,eq0 Γ0,eq0 Γ,eq Γ00 ,eq00 −→∗ −−−−→∗ The relation Γ EUF prf : (Γ0 ,eqs) implements the theory specific judgement seen in Section 3.3. ` EUF prf prf Γ,z = z Γ0,x = y Γ,z = z Γ0,x = y Γ(i) x = y −→∗ −→∗ 7→ 6 Γ EUF EUF Eq(prf ) : (Γ0 ,[x = y]) Γ EUF EUF False(i,prf ) : (Γ0 ,nil) ` ` Suppose that we obtain a state (Γ,x = y) after processing a list pr f of commands. The proof EUF False(i, pr f ) deduces a contradiction if Γ(i) x = y and the proof EUF Eq(pr f ) deduces the equality x = y. 7→ 6 Proof Generation. Proof generation follows closely [16] where the proof-producing prover maintains a proof forest that keeps track of the reasons why two nodes are merged. Besides the usual merge and find operations, the data structure has a new operator explain(a,b,forest) which outputs a proof that a = b based on forest. In our case, proofs are lists of commands, while in the original approach they were unsatisfiable unordered sets of assumptions. We show below the proof forest corresponding to the example of Section 2.2. Trees represent equivalence classes and each edges is labelled by assumptions. The prover updates the forest with each merge. Two distinct classes can be merged for two reasons: an equality between variables is added or two terms are equal by congruence.

t0 x t5 t9

(2) f(x)=t5 (5) f(z)=t9 (12) t0 = z (11) x = y (1) f(y)=t3 (4) f(t6)=t8

z y t3 t8

(18) z = t6

Suppose for example that the problem contains (2) f (x) = t5 and (1) f (y) = t3 and we add the equality (11) x = y. First, we have to add an edge between x and y, labelled by the reason of this merge, i.e., assumption (11). Then, we have to add an edge between t3 and t5, and label it with the two assumptions that triggered that merge by congruence, i.e., (1) and (2). To output a proof that two variables are equal, we travel the path between the two corresponding nodes, and each edge yields a list of commands. An edge labelled by an equality corresponds to a simple (18) transitivity: t6 z yields −−→ [Trans(18,true)]

8 A Nelson-Oppen based Proof System. . . F. Besson, P.-E. Cornilleau, and D. Pichardie

(1)(2) An edge labelled by two equalities makes use of the congruence: t3 t5 yields −−−→ [Trans(1, f alse);Congr(11,1,true);Trans(2,true)]

If the equality that triggered the congruence was discovered by EUF and not an assumption, we have to explain it, and then update the environment accordingly, using the Push command. This could lead to factorisation issues. We can ensure that any intermediate result is checked only once during proof- producing, but this may not be enough. We may want to ensure that any connection between variables, reﬂected by an edge in the proof forest, is only checked once, but this is trickier.

Alternative EUF Checker. We now briefly expose a second EUF proof verifier, which aims at maxi- mum factorisation of subproof. The proof forest maintained by our proof-producing prover is a compact array-based structure, on which it is very easy and efficient to check equalities of variables while sharing subproofs. Arrays may be a sensitive data structure depending on the proof verification context. In Coq for example, only functional style arrays are provided, and they may not behave like traditional arrays. But if our checker is able to efficiently manipulate arrays, the proof forest itself is a fine proof. To check an equality a = b, the checker only has to travel between the trees to ensure that the nodes corresponding to the variables a and b are in the same equivalence class, i.e., have the same root. During this computation of the root of a node, any node on the path can store that information. Once the checker is aware of the root of a node, it doesn’t have to compute it again, hence a high rate of subproof sharing if we can ensure that any edge in the forest is only crossed once. The forest being linear in the number of assumptions, we have achieved linear complexity in checking. The checker algorithm mimics the initial congruence closure algorithm, without any decision making or reordering of the forest. We take the forest for granted and fail as soon as it does not reflect any needed equality. In particular the choice of the roots is made by the prover, and the checker relies on it. With this simplification comes the reduction of algorithmic complexity. A Nelson-Oppen compatible EUF checker needs to be incremental. We need to check equalities between variables, then to assert equalities discovered by other theories, and then to check more equalities. Fortunately, the proof forest obtained at the end of a Nelson-Oppen cycle reflects its history, i.e., a path between two variables only uses equalities asserted or discovered earlier. We can compute the temporary root of a node, instead of its real root, by stopping as soon as an edge in the forest is not labelled by an available assumption. We can then check early equalities without breaking any temporal constraint, and unroot nodes as soon as a new assumption is available. This second proof system is checked using different data structures, namely arrays. Depending on the tools available, one could choose either a very efficient checker, or a checker that does not rely on arrays. The switch between checkers is easy as long as both implement the primitives needed by the Nelson-Oppen checker.

3.5 Proof Checking and Generation for LRA In this section we introduce the proof system for LRA and describe a proof-producing procedure. Literals are of the form e 1 0 with e a linear expression manipulated in (Horner) normal form and 1 ,>,= . ∈ {≥ }

Proof System. For linear real arithmetic, Farkas’ lemma provides a sound and complete notion of proof that a conjunction of linear constraints is unsatisﬁable [17, Corollary 7.1e]. The following proof system allows to prove an inequality with a list of commands (a Farkas proof). Each command is a pair Mul(c,i) with c a coefﬁcient (in type Z) and i the index of an assumption in the current assumption set. Such a

9 A Nelson-Oppen based Proof System. . . F. Besson, P.-E. Cornilleau, and D. Pichardie

Mul(c,i) command is used below in a judgement Γ e 1 0 e 1 0 with 1 and 1 in ,> . Γ e 1 0 −−−−→ 0 0 0 {≥ } ∪ { } is the current set of assumptions and e0 10 0 is the new inequality that is deduced.

c > 0 Γ(i) e 0 Γ(i) e = 0 7→ 0 ≥ 7→ 0 Mul(c,i) Mul(c,i) Γ e 1 0 (c[ ]e [+]e) 1 0 Γ e 1 0 (c[ ]e [+]e) 1 0 −−−−→ ∗ 0 −−−−→ ∗ 0 c > 0 Γ(i) e > 0 7→ 0 Mul(c,i) Γ e 1 0 (c[ ]e [+]e) > 0 −−−−→ ∗ 0 The operators [ ],[+],[ ] model the standard arithmetic operations but maintain the normalised form of ∗ − the LRA expressions. The previous rules follow the standard sign rules in arithmetic: for example, if e0 is non-negative we can add it c times to the right part of the inequality e 1 0, assuming c is strictly positive. Contrarily to the EUF checker of Section 3.4, the LRA checker does not change the assumption set Γ; this difference motivates the use of a different type of judgement. It is completely transparent to the Nelson-Oppen checker as long as the judgement Γi T prf : (Γ0 ,eqs) is implemented. ` i i i The transitive closure of the previous relations allows to prove an inequality with a list of command. It is formalised with the following rules.

cn Γ (c1 :: :: cn 1) : e 1 0 Γ e 1 0 e0 10 0 ··· − −→ Γ nil : 0 0 Γ (c1 :: :: cn 1 :: cn) : e0 10 0 ≥ ··· − A LRA proof is then either a proof of 0 > 0 given by a list of commands or a proof of x = y given by two lists of commands (one for x y 0 and one other for y x 0). − ≥ − ≥ type LRA_proof = |LRA False of command list |LRA Eq of command list * command list

Γ l : 0 > 0 Γ l1 : e 0 e = x[ ]y Γ l2 : [ ]e 0 ` ` ≥ − ` − ≥ Γ LRA (LRA False(l)) : (Γ,nil) Γ LRA (LRA Eq(l1,l2)) : (Γ,[x = y]) ` ` Proof Generation. In order to produce Farkas proofs efﬁciently, we can use the Simplex algorithm used in Simplify [10]. This variant of the standard linear programming algorithm does not require all the variables to be non-negative, and directly handles inequalities (strict or not) and equalities. Each time a contradiction is found, one line of the Simplex tableau gives us the expected Farkas coefﬁcients. The algorithm is also able to discover new equalities between variables. In this case again, the two expected Farkas proofs are read from the current tableau, up to trivial manipulations.

4 Experiments

For our approach to be viable we ﬁrst need to make sure that proof generation is feasible. For the moment, our goal is not to evaluate the proof veriﬁer; hence, to get an idea of what we can expect at best we used a high-performance solver instead of a solver complying to the proof systems prensented in Sections 3.4 and 3.5. For this reason we were able to test proof generation for linear integer arithmetic, whose proof system is left as further work.

Prototype. The SMT-LIB 2.0 standard defines scripts to be run by solvers. First, one declares the logic used, the types of the terms, then asserts formulas and checks for satisfiability with a check-sat command. The standard also defines utility commands to obtain more than a verdict from the solver. A solver can implement a get-model command, which output a valuation of the variables validating

10 A Nelson-Oppen based Proof System. . . F. Besson, P.-E. Cornilleau, and D. Pichardie a satisfiable formula, and a get-unsat-core command, which output an unsatisfiable subformula. Our scheme would benefit from a get-conflict-clauses command, to obtain the conflict clauses discovered during the search, but we can already use get-model and get-unsat-core to emulate the simple search described in Section 2.1, with a SAT solver to discover models of the propositional abstraction and a SMT solver to obtain the unsatisfiability cores of formulas corresponding to models. Once the conflict clauses have been discovered, we can build their proofs using the proof-producing prover of our choice. We have implemented our proof scheme in OCaml, the OCaml programme being in charge of the abstraction and the communication with the SMT-LIB 2.0 compatible, off-the-shelf SAT and SMT solvers (we chose Z3 for both). We have isolated several parts of this lazy SMT loop and distinguished accordingly four times of importance: the time spent solving the propositional abstraction and the conflict clauses, and obtaining the • propositional models; the time spent obtaining the unsatisfiability cores from the models; • the time spent obtaining the propositional proof of unsatisfiability; • the time spent obtaining the proofs of the conflict clauses. • The sum of these four times is the proof generation time. We estimated these times by re-launching Z3 on the scripts generated by our OCaml programme. We do not take into account the running time of our OCaml programme, whose only part was to make the SAT solver and the SMT solver communicate. We have launched this proof-producing prover on SMT-LIB benchmarks to measure the times described earlier and the number of conflict clauses we discovered for each benchmark. We compared these measures with the time of a direct run of Z3 on the same benchmark, referred to as direct solve time, to understand the overhead induced by our scheme. We also counted the number of atoms of each conflict clause to evaluate the stress put on the multi-theory conjunctions solver. We call overhead factor the number obtained through the following division: generation time direct solve time

Results. We used 574 unsatisfiable unquantified formulas from the SMT-LIB benchmarks, combining uninterpreted functions and linear real arithmetic (QF UFLRA) or linear integer arithmetic (QF UFLIA). Eight of the benchmarks hit timeout at 1000 seconds. They belong to the same category (QF UFLIA/wisas). The only 3 benchmarks with more than 2000 conflict clauses, and which took the longest time to prove, belong to that category too. We believe that theory-propagation is needed to solve them efficiently. As soon as theory-propagation can be encoded by conflict clauses it is expressible in our proof system but would require a tighter integration with a SMT solver. To evaluate the overhead factor of our approach we sort the benchmarks by overhead factor and draw in the right-hand side graphic of Figure 3 a point by benchmarks, with on Y the overhead factor and on X the benchmark index in the list of benchmarks. On the left-hand side graphic we do the same with the proof generation time. For 2/3 of the benchmarks the overhead of the generation time w.r.t the solving time is less 10. For only 3%, the overhead climbs up to more than 100. For certain applications such as interactive theorem proving, wall clock is the critical factor not the overhead. If we only consider benchmarks that take more than a tenth of second to be solved, 4% have a overhead factor greater than 100. These cases represent 1.5% of the whole dataset. Looking only at the generation time, 91% of the proofs are generated in less then 3 seconds, 96% in less then 30 seconds. Maybe surprisingly, for some benchmarks the generation time is inferior to

11 A Nelson-Oppen based Proof System. . . F. Besson, P.-E. Cornilleau, and D. Pichardie

1000 10000

100 1000

10 100

1 10 overhead factor generation time (s)

0.1 1

0.01 0.1 0 100 200 300 400 500 600 0 100 200 300 400 500 600 formula index formula index Figure 3: proof generation time and overhead factor the direct solve time, resulting in an overhead factor inferior to 1. These are benchmarks solved by our prototype without any conflict clause, the abstraction being faster to solve and prove than the initial formula. Overall, proof generation went quite well, considering how naive our prototype is. We can expect the overhead factor to vary less with each theory reasoning we take into account; but with only conflict clauses and no preprocessing, a lot of formulas can be certified in a reasonable amount of time. For each benchmark, the number of conflict clauses vary between 0 (for 326 benchmarks of the QF UFLRA category) and 29873 (only 3 benchmarks have more than 2000 clauses), the mean being 318.5 conflicts by benchmark and 86% of the benchmarks raising less than 100 conflicts. The mean size of the conflicts is 5.6 atoms by conjunction; therefore, we expect the proof generation of the conflict clause to amount for a little part of the whole generation time. In Figure 4 we consider the percentage of the generation time spent proving the conflict clauses. In 84% of the benchmarks the proof generation

0.1

0.01

0.001

0.0001 weight of conflict clauses proving

1e-05 0 100 200 300 400 500 600 formula index Figure 4: weight of the conflict clauses proof generation of the conflict clauses amounts for less than 10% of the generation time, and at most it amounts for less than 20% of the generation time. For this reason it seems that the proving multi-theory prover is not the bottleneck of our process, and we can focus on the quality of the proofs rather than the efficiency of the

12 A Nelson-Oppen based Proof System. . . F. Besson, P.-E. Cornilleau, and D. Pichardie prover. Overall, once we have reduced a T-conjunction to its unsat-core, the remaining formula is very short and easy to prove.

5 Related Work

For his Proof Carrying Code framework, Necula has pioneered the area of proof-generating decision procedures [14]. In his Touchstone theorem prover [15], Necula needed to derive complete proof terms in a unified language. In our approach, each decision procedure comes with its own proof language thus allowing to choose the level of details to be put in the proofs. Several authors have examined EUF proofs [8, 16]. They extend a pre-existing decision procedure with proof-producing mechanisms without degrading its complexity and achieving a certain level of irredundancy. However, their notion of proof is reduced to unsatisfiable cores of literals rather than proof trees. Our proof generation builds on such works to produce detailed explanations. Like several modern SMT solvers (CVC3, VeriT), the solver Z3 has its own proof language [6]. It contains a lot of rules reflecting its internal reasoning with different levels of precision, some rules detailing each computation step, some others accounting for complex reasoning with no further details. Our approach advocates a strict discipline in the way the proof is conducted but simplifies its proof-checking. Moreover, we believe that SMT solvers could generate proofs in our proof system without too much hassle when certain optimisations are disabled. Previous work has been devoted to reconstruct SMT solvers proofs in proof assistants. McLaughlin et al. [13] have combined CVC Lite and HOL light for quantifier-free first-order logic with equality, arrays and linear real arithmetic. Ge and Barrett have continued that work with CVC3 and have extended it to quantified formulas and linear integer arithmetic. This approach highlighted the difficulty for proof reconstruction to compare to straightforward implementation of decision procedures in HOL. In- dependently Fontaine et al. [12] have combined haRVey with Isabelle/HOL for quantifier free first-order formulas with equality and uninterpreted functions. Their scheme includes Isabelle solving of EUF subproof with hints provided by haRVey. Our EUF proof system is more detailed and does not require any decision on the checker side. Bohme¨ and Weber [4] have built a proof reconstruction of Z3 proof in the theorem provers Isabelle/HOL and HOL4. Their implementation is particularly efficient but their fine profiling shows that a lot of time is spend re-proving sub-goals for which the Z3 proof does not give sufficient details.

6 Conclusion and Perspectives

We have presented a proof system for multi-theory unquantified first-order formulas that relies on theory- specific proofs. We have developed uninterpreted functions and linear real arithmetic checkers, and combined them using a Nelson-Oppen checker. The proof format of any theory can be changed as long as a checker is provided, with no modification of the combination scheme. We have examined feasibility of proof generation based on state-of-the-art SMT solvers, and implemented simple proof-producing provers to test proof generation for our EUF and LRA proof systems and combinations of them. Our prover use an extended Union-Find algorithm [16] for EUF and a Simplex algorithm [10] for LRA. The checkers for EUF, LRA and the generic Nelson-Oppen combination have been developed and proved in Coq to provide a new reflexive decision procedure. As further work we intend to instantiate further the framework and examine checkers and proof systems for non-convex theories such as the theory of linear integer arithmetic and the theory of arrays. The Nelson-Oppen verifier is generic enough to handle such theories but we still need to design specialised checkers and examine proof generation. The experiments have shown that handling conflict clauses is not always enough to solve formulas in a reasonable time with a reasonable amount of resources, and

13 A Nelson-Oppen based Proof System. . . F. Besson, P.-E. Cornilleau, and D. Pichardie we need to explore other kinds of theory reasoning to shorten the proof search. Closer interaction with SMT solvers and access to theory propagation decisions would be very beneﬁcial for our proofs because theory propagation can readily be encoded in our proof system.

References

[1] C. Barret, A. Stump, and C. Tinelli. The SMT-LIB standard: Version 2.0, 2010. [2] C. Barrett and C. Tinelli. CVC3. In Proc. of CAV 2007, volume 4590 of LNCS, pages 298–302. Springer, 2007. [3] Armin Biere. PicoSAT essentials. Journal on Satisfiability, Boolean Modeling and Computation (JSAT), 4(2-4):75–97, 2008. [4]S.B ohme¨ and T. Weber. Fast LCF-style proof reconstruction for Z3. In Proc. of ITP 2010, volume 6172 of LNCS, pages 179–194. Springer, 2010. [5] T. Bouton, D. C. B. de Oliveira, D. Deharbe,´ and P. Fontaine. veriT: an open, trustable and efficient SMT- solver. In Proc. of CADE 2009, LNCS. Springer, 2009. [6] L. M. de Moura and N. Bjørner. Proofs and Refutations, and Z3. In Proc. of the LPAR 2008 Workshops, Knowledge Exchange: Automated Provers and Proof Assistants, volume 418. CEUR-WS.org, 2008. [7] L. M. de Moura and N. Bjørner. Z3: An efficient SMT solver. In Proc. of TACAS 2008, volume 4963 of LNCS, pages 337–340. Springer, 2008. [8] L. M. de Moura, H. Rueß, and N. Shankar. Justifying equality. ENTCS, 125(3):69–85, 2005. [9] L. M. de Moura, H. Rueß, and M. Sorea. Lazy theorem proving for bounded model checking over infinite domains. In Proc. of CADE’02, volume 2392 of LNCS, pages 438–455. Springer, 2002. [10] D. Detlefs, G. Nelson, and J. B. Saxe. Simplify: a theorem prover for program checking. J. ACM, 52(3):365– 473, 2005. [11] B. Dutertre and L. de Moura. The Yices SMT solver. Tool paper at http://yices.csl.sri.com/tool-paper.pdf, 2006. [12] P. Fontaine, J-Y. Marion, S. Merz, L. P. Nieto, and A. F. Tiu. Expressiveness + automation + soundness: Towards combining SMT solvers and interactive proof assistants. In Proc. of TACAS 2006, volume 3920 of LNCS, pages 167–181. Springer, 2006. [13] S. McLaughlin, C. Barrett, and Y. Ge. Cooperating theorem provers: A case study combining HOL-Light and CVC Lite. ENTCS, 144(2):43–51, 2006. [14] G. C. Necula. Compiling with Proofs. PhD thesis, Carnegie Mellon University, 1998. [15] G. C. Necula and P. Lee. Proof generation in the Touchstone theorem prover. In Proc. of CADE 2000, volume 1831 of LNCS, pages 25–44. Springer, 2000. [16] R. Nieuwenhuis and A. Oliveras. Proof-producing congruence closure. In Proc. of RTA 2005, volume 3467 of LNCS, pages 453–468. Springer, 2005. [17] A. Schrijver. Theory of Linear and Integer Programming. Wiley, 1998. [18] Princeton University. http://www.princeton.edu/ chaff/zchaff.html. [19] Allen Van Gelder. Verifying RUP proofs of propositional unsatisfiability. In Elec. Proc. of ISAIM 2008. [20] Allen Van Gelder. Verifying propositional unsatisfiability: Pitfalls to avoid. In Proc of SAT’07, Lisboa, Portugal, 2007. [21] Allen Van Gelder. Verifying RUP proofs of propositional unsatisfiability. In Proc of ISAIM’08, Fort Laud- erdale, 2008. http://isaim2008.unl.edu/index.php?page=proceedings. [22] Lintao Zhang. Validating sat solvers using an independent resolution-based checker: Practical implementa- tions and other applications. In Proc. of DATE 2003, pages 10880–10885, 2003.

14 A Flexible Proof Format for SMT: a Proposal∗ Fred´ eric´ Besson Pascal Fontaine INRIA Rennes – Bretagne Atlantique, France University of Nancy and INRIA, Nancy, France [email protected] [email protected] Laurent Thery´ INRIA Sophia-Antipolis – Mediterran´ ee,´ France [email protected]

Abstract The standard input format for Satisfiability Modulo Theories (SMT) solvers has now reached its second version and integrates many of the features useful for users to interact with their favourite SMT solver. However, although many SMT solvers do output proofs, no standardised proof format exists. We, here, propose for discussion at the PxTP Workshop a generic proof format in the SMT-LIB philosophy that is flexible enough to be easily recast for any SMT solver. The format is configurable so that the proof can be provided by the solver at the desired level of detail.

1 Introduction

Satisfiability Modulo Theory (SMT) consists in deciding the satisfiability of formulae belonging to a combination of theories. Over the past few years, the quality of SMT-provers has greatly improved. This is evaluated at the SMT-COMP, the annual competition for SMT. Current SMT-provers are highly optimised and engineered tools that are capable of deciding formulae of industrial size. For the moment, the output of most SMT-provers is just the simple answer: sat or unsat. This information is enough to evaluate their speed and relative soundness – especially when the status of formulae is known beforehand. But this may not be sufficient, and particularly when trusting the SMT solver is not an option, like in a skeptical cooperation of solvers. The purpose of the current proposal is to tackle this problem and propose a generic proof format for SMT-provers. Compared to existing approaches our objective is to aim at a format that is sufficiently generic so that: - any SMT-solver could generate a proof without too much effort; - the proof could be checked by a trustworthy external verifier. Our assessment is that previous attempts have produced formats that are either easy to generate but hard to check or hard to generate and easy to check. For instance, the SMT-solver clsat generates proofs that are already genuine proofs objects of the logical framework LF [9]. This approach is very challenging and intellectually attractive. Yet efficient generation and checking of LF proofs is still an open research area. Unlike [9], we advocate for a clear distinction between the proof generated by the SMT-solver and the proof-object that would be built by the checker. As a pay-off our proof format should be easier to produce and not require a substantial re-engineering of the SMT-solver. There are other proof-generating SMT- provers such as CVC3 [2], veriT [4] and Z3 [6, 5]. For those, the proof format is not totally formalised and proof reconstruction in a skeptical proof assistant is not a trivial task [8, 3]. One difficulty lies in the fact that certain proof steps of the SMT-provers are kept implicit and do not appear in the proof trace. We aim at providing a proof format that is proof-assistant friendly.

Pascal Fontaine, Aaron Stump (eds.); PxTP 2011, pp. 15-26 ∗This work was funded by the ANR Decert project.

15 A ﬂexible proof format for SMT: a proposal F. Besson, P. Fontaine and L. Thery´

1.1 Proof of satisfiability/unsatisfiability When a formula is satisfiable, a checkable account of the satisfiability is a detailed model of the formula, such that every term, atom, literal, and sub-formula has a precise value. Giving this detailed model may be problematic when handling formulae with quantifiers. When the formula is unsatisfiable, SMT- solvers derive an inconsistency from the original formula. A proof of unsatisfiability is thus a checkable derivation from the original formula to an object which is trivially inconsistent. The context of such a proof is the set of all deduced facts. Initially, the context contains the original formula. Logical rules are applied to derive new formulae from formulae in the context. These new formulae are then added to the context. At the end of the proof, the empty clause (noted ()) should belong to the final context. Our proof format aims at providing a clear interface between SMT-solvers that generate proofs and checkers that verify the correctness of the generated proofs. We voluntarily restrict ourselves to a specific and limited fragment of the logics current SMT-solvers can deal with. This will let us experiment rapidly and get feedback on how the format should evolve and be improved. As a result, we are considering quantifier-free formulae with uninterpreted functions, equality and linear arithmetic only. In particular, we do not take quantifiers into account. This means that tasks such as skolemization and instantiation are not covered yet. The work of the solver is usually composed of several phases that the proof format has to address:

- First, the formula may be rewritten: the formula is transformed to a somewhat simpler and equivalent one. This is done by identifying sub-formulae and sub-terms that clearly can be rewritten to simpler equivalent forms. Provers implement this by some rewriting rules, and those rewriting rules should have their derivation counterparts.

- SMT-solvers, being based on the SAT-solvers technology, also require conjunctive normal forms (CNF). The transformation phase, that translates the original formula to a set of clauses also requires a proof. Each clause of the CNF becomes a new fact added to the context. From those clauses, new clauses can be derived using resolution. Resolution is a complete method for the satisﬁability of propositional formulae.

- Reasoning about theories can be understood as adding conﬂict clauses to the SAT-solver. These conﬂict clauses are tautologies according to the theories in action. To each theory will correspond a set of rules that will add clauses in the context. Equality exchange between different parts of the reasoner can simply be seen as resolution.

1.2 Rationale for the proof format The SMT-LIB 2.0 format [1] is the de facto input standard of the SMT community. It provides a set of commands for interacting with solvers. While designing our proof format, we have been careful at being fully compatible with the SMT-LIB format. Other existing formats like TSTP [10] could have been valid candidates to represent SMT proofs but we wanted to have a format whose syntax was a direct extension of SMT-LIB syntax. For this reason, we inherit the syntax of terms and formulae but also certain well- formedness conditions from the SMT-LIB. This does not prevent the possibility of developing a translator to other formats like TSTP in the future. The core of the proof format specifies the syntax and semantics for the existing SMT-LIB command get-proof – currently left unspecified. To our opinion, a fixed proof format is not viable because of the variety of existing SMT-provers. Our proposal is a proof format that is built upon a generic kernel enriched by prover-specific proof rules that can be obtained by a dedicated get-proof-header command (see Section 2.2).

16 A ﬂexible proof format for SMT: a proposal F. Besson, P. Fontaine and L. Thery´

Overall, the proof format is a trade-off between the requirements from proof consumers, who want a non-ambiguous syntax and semantics, and developers of SMT-solvers, who favour freedom in outputting proofs in a structure that their solvers can generate without sacrificing efficiency. In Section 2.5, we list a set of recommendations for getting a maximal benefit from the proof format. The rest of the paper is organised as follows. Section 2 explains the syntax of the proof format. The format is illustrated by an example discussed in details in Section 3. Section 4 provides an operational semantics of the format by providing a reference implementation for a proof-checker.

2 Syntax

2.1 Data structures This document inherits the syntax for terms (and thus formulae) from the SMT-LIB 2.0 standard [1]. Furthermore, clauses are used in several places. A clause is a list of formulae. For instance: ((= x y) (= (f x) (f y)))) represents a clause which is a disjunction of the formula x = y and the formula f (x) = f (y). The empty clause is the empty list (). For convenience, one can use the trivially valid clause (true).

2.2 Proof header The format is generic and therefore has to be instantiated with solver-speciﬁc proof rules. To ease proof-reconstruction by third-party tools, proof rules are declared belonging to a certain SMT-LIB logic, with an attached informal description. The header may be output on request by the SMT solver with a supplementary command get-proof-header, an addition to SMT-LIB 2.0. An example of such a header is given in Figure 2. Here is its syntax:

hrule tag defi ::= (define-rule-kw hkeywordi hstringi) hrule defi ::= (hrule idi hlogici hattributei∗) hheaderi ::= hrule tag defi∗ hrule defi∗ where hattributei is defined as in SMT-LIB 2.0. Note that attribute values can be hsexpri so they could possibly contain code (for instance for proof checkers). All keywords used as attributes in hrule defi should be defined; there is currently no predefined keyword. Keywords are used to identify collection of rules. For example, they could be used to qualify rules that are handling quantifiers, skolemizations, or specific operations on connectors (e.g., conjunctions). A rule using hlogici should be such that SMT solvers implementing only the SMT-LIB 2.0 logic must be able to verify every clause deduced by this rule.

2.3 Proof script An example proof script is presented in Figure 3. A proof script is a pair made of a proof context and a sequence of proof steps

hproofi ::= hcontexti hproof stepi∗ where the context is an SMT-LIB 2.0 script using only commands set-logic, declare-sort, define-sort, declare-fun, define-fun, assert. The context is generally just a subset of the SMT-LIB 2.0 input

17 A ﬂexible proof format for SMT: a proposal F. Besson, P. Fontaine and L. Thery´ script. Furthermore, in asserts, in order to identify formulae, we require them to be explicitly named using the following SMT-LIB 2.0 notation:

(assert (! htermi :named hclause idi)) where htermi is the formula asserted, written in the SMT-LIB 2.0 syntax, and hclause idi is an SMT-LIB 2.0 hsymboli. A proof step is deﬁned by the following rule:

hproof stepi ::= (define hterm idi htermi) | (set hclause idi hgen clausei) | (seth hclause idi hclausei) where

- (define hterm idi htermi) declares hterm idi as a short-name for the term htermi.

- (set hclause idi hgen clausei) constructs a new clause named hclause idi using the clause generation rule hgen clausei (see below). The identiﬁer hclause idi is an SMT-LIB 2.0 hsymboli and is used to refer to the clause in the current environment

- (seth hclause idi hclausei) is used to assert the hypotheses of a sub-proof. Again, hclause idi is an SMT-LIB 2.0 hsymboli.

Note that this version of the proof format does not cover quantiﬁers yet. So htermi is just a ground term.

2.4 Derived clauses

A clause can either be explicit or derived. Here we describe how it is derived:

hgen clausei ::= hclause idi | (hrule idi (:clauses (hgen clausei∗) | :all-clauses)? (:terms (htermi∗))? hattributei∗ (:conclusion hclausei)?) | (subproof hproofstepi∗ (:conclusion hclausei)? )

There exist three ways to derive a clause:

- hclause idi is used to retrieve a clause formerly assigned a hclause idi in the current environment (where hclause idi is an SMT-LIB 2.0 hsymboli).

- A clause can be derived using a (named) rule: as arguments an optional list of terms, and an optional result can be provided. When :clauses is used, the conclusion is a logical consequence of the given set of clauses. When :all-clauses is used, the conclusion is a logical consequence of an unspeciﬁed set of local hypotheses. When neither the tag :clauses nor :all-clauses is used, the rule introduces a tautology.

18 A ﬂexible proof format for SMT: a proposal F. Besson, P. Fontaine and L. Thery´

- A local proof can be declared. For simplicity, only local proofs from an empty context are supported; references to the current context in a local proof is not allowed. The clause attached to a local proof is composed of all the literals in all clauses introduced by the seth in the proof steps (these literals are negated) and the literals in the clause given after the :conclusion attribute. When this attribute is omitted, it is the clause attached to the last step that is considered as conclusion.

2.5 Recommendations Figure 1 presents the whole grammar of the format. The format is very ﬂexible and tries to avoid unnec- essary constraints. Nevertheless, some obvious recommendations can already be made in order to allow effective validation by third-party tools:

- rules should be carefully described in the header;

- term sharing should be maximised in order to improve memory management;

- all the hypotheses in local proofs should be speciﬁed ﬁrst.

These are recommendations and not requirements since external tools could always refactor proofs in order to meet these recommendations.

hrule tag defi ::= (define-rule-kw hkeywordi hstringi) hrule defi ::= (hrule idi hlogici hattributei∗) hheaderi ::= hrule tag defi∗ hrule defi∗

hproofi ::= hcontexti hproof stepi∗

hproof stepi ::= (define hterm idi htermi) | (set hclause idi hgen clausei) | (seth hclause idi hclausei)

Figure 1: The complete proof format

3 An example

In Figure 3, we present a proof in our format of the following formula

¬((a = c) ∧ (b = c) ∧ (( f (a) 6= f (b)) ∨ (p(a) ∧ ¬(p(b)))))

19 A ﬂexible proof format for SMT: a proposal F. Besson, P. Fontaine and L. Thery´

For readability, the proof does not use the facility of the format for sharing terms. It starts with the context which is just a subset of the SMT-LIB 2.0 input script. Every following command introduces a new clause (with the set command) until the final clause, the empty clause (). The proof relies on elementary rules. The description of these rules can be obtained using the command (get-proof-header) which could return (among other rule definitions) the definitions in Figure 2. In those definitions, the :clauses, :terms and :conclusion attributes respectively give the expected number of clauses (−1 if arbitrary), terms, and conclusion (0 if no conclusion is provided, that is, if the checker is expected to recompute the conclusion). The and rule is used to deduce new clauses from conjunctive unit clauses: every conjunct of a conjunctive unit clause can indeed be itself introduced as a unit clause, as are clauses c2, c3 and c4 in the example in Figure 3. The and pos rule generates a tautology of the form ¬ ∧i ai ∨ a j for some j (see clauses c5 and c6). The or rule introduces a clause from a disjunctive unit clause, by simply building the clause of the disjuncts (e.g. clause c7). Clauses c2 to c7 constitute the CNF of the input formula. Notice that the CNF transformation is not a naive transformation, but it does not explicitly introduce new (Tseitin) Boolean variables. Indeed, as a side effect of the fact that clauses are sets of formulae (and not just literals), it is easy to introduce definitional CNF transformations without introducing new Boolean variables: the formulae themselves stand for those Boolean variables. Using sharing, the CNF would be linear with respect to the size of the initial formula. Rules eq_congruent, eq_congruent_pred, and eq_transitive introduce equational tautologies (for instance c8, c9, c15 and c16). The resolution rule implements chain resolution. Note that this rule serves several purposes here. It is used for instance for resolution executed within the SAT-solver (e.g. to deduce c18), and it is used to build conflict clauses from generic tautologies (e.g. to build c10). The subproof construct (see Figure 1) can always be inlined – maybe at the cost of adding more proof-rules. It does not change the expressive power of the proof-format but allows for more structured proofs. The subproof construct can also be used to minimise the number of proof rules. For instance, in the proof-script of Figure 3, the clause c5 can alternatively be obtained by the following sub-proof.

(set c5 (subproof (seth h (and (p a) (not (p b))) ) (set res (and :clauses (h) :conclusion (p a)))))

Note that the conclusion of a sub-proof (if omitted) can exactly be reconstructed as soon as the proof rule for the last derived clause in the sub-proof has an explicit conclusion. Here, the obtained clause is therefore ((not (and (p a) (not (p b)))) (p a)).

4 Proof checking

In this section, we provide a reference implementation for a proof verifier for the generic proof format. The proof verifier is parametrised by prover-specific proof rules declared in the proof header. As a result, the validity of the proof verifier relies on the fact that proof rules are logically sound i.e., only derive clauses that are logic consequences of the current known clauses. The proof verifier takes as input a proof π ∈ hcontexti × hproof stepi∗ (see Section 2.3). and verifies that the proof steps indeed derive the empty clause i.e., the conjunctions of the formulae asserted by the hcontexti part of the proof is unsatisfiable. The state of the verifier is made of a quadruple (Sig,N,S,cl) where

- Sig is a SMT-LIB signature built from the commands declare-sort and declare-fun of the proof context. Its purpose is to ensure that terms are well-sorted.

20 A ﬂexible proof format for SMT: a proposal F. Besson, P. Fontaine and L. Thery´

; get-proof-header returns

(and_pos QF_UF :comment "valid clause ((not (and a_1 ... a_n)) a_i)" :clauses 0 :terms 0 :conclusion 1)

(and QF_UF :comment "(and :clauses (c) :conclusion (a_i)) where c = ((and a_1 ... a_n))" :clauses 1 :terms 0 :conclusion 1)

(or QF_UF :comment "(or :clauses (c) :conclusion (a_1 ... a_n) ) where c = ((or a_1 ... a_n))" :clauses 1 :terms 0 :conclusion 1)

(eq_transitive QF_UF :comment "valid clause ((not (= x_1 x_2)) ... (not (= x_{n-1} x_n)) (= x_1 x_n))" :clauses 0 :terms 0 :conclusion 1)

(eq_congruent QF_UF :comment "valid clause ((not (= x_1 y_1)) ... (not (= x_n y_n)) (= (f x_1 ... x_n) (f y_1 ... y_n)))" :clauses 0 :terms 0 :conclusion 1)

(eq_congruent_pred QF_UF :comment "valid clause ((not (= x_1 y_1)) ... (not (= x_n y_n)) (not (p x_1 ... x_n)) (p y_1 ... y_n))" :clauses 0 :terms 0 :conclusion 1)

(resolution QF_UF :comment "Chain resolution of any number of clauses" :clauses -1 :terms 0 :conclusion 1)

Figure 2: The proof header

- N is a mapping from symbols to constants and function declarations. It is initialised by the define-fun context commands. Later on, for sharing purpose, it is updated by the proof command define. Note that unlike define-fun, the define command can only deﬁne constants1.

- S is a stack of named assertions tagged with a boolean ﬂag. The assert commands of the context construct an initial singleton stack where all the asserted formulae are tagged true. During the proof, derived clauses introduced by the set command are also tagged by true. The tag false is reserved to clauses introduced by the seth command. Those clauses are local hypotheses to be discharged by sub-proofs.

1For more ﬂexibility, this restriction might be lifted in the future.

21 A ﬂexible proof format for SMT: a proposal F. Besson, P. Fontaine and L. Thery´

; get-proof returns

; Context

(set-logic QF_UF) (declare-sort U 0) (declare-fun p (Bool) U) (declare-fun f (U) U) (declare-fun a () U) (declare-fun b () U) (declare-fun c () U) (assert (! (and (= a c) (= b c) (or (not (= (f a) (f b))) (and (p a) (not (p b))))) :named c1))

; Proof

(set c2 (and :clauses (c1) :conclusion ((= a c)))) (set c3 (and :clauses (c1) :conclusion ((= b c)))) (set c4 (and :clauses (c1) :conclusion ((or (not (= (f a) (f b))) (and (p a) (not (p b))))))) (set c5 (and_pos :conclusion ((not (and (p a) (not (p b)))) (p a)))) (set c6 (and_pos :conclusion ((not (and (p a) (not (p b)))) (not (p b))))) (set c7 (or :clauses (c4) :conclusion ((not (= (f a) (f b))) (and (p a) (not (p b))))) (set c8 (eq_congruent :conclusion ((not (= b a)) (= (f a) (f b)))) (set c9 (eq_transitive :conclusion ((not (= b c)) (not (= a c)) (= b a))) (set c10 (resolution :clauses (c8 c9) :conclusion ((= (f a) (f b)) (not (= b c)) (not (= a c))))) (set c11 (resolution :clauses (c10 c2 c3) :conclusion ((= (f a) (f b))))) (set c12 (resolution :clauses (c7 c11) :conclusion ((and (p a) (not (p b)))))) (set c13 (resolution :clauses (c5 c12) :conclusion ((p a)))) (set c14 (resolution :clauses (c6 c12) :conclusion ((not (p b))))) (set c15 (eq_congruent_pred :conclusion ((not (= b a)) (p b) (not (p a))))) (set c16 (eq_transitive :conclusion ((not (= b c)) (not (= a c)) (= b a)))) (set c17 (resolution :clauses (c15 c16) :conclusion ((p b) (not (p a)) (not (= b c)) (not (= a c))))) (set c18 (resolution :clauses (c17 c2 c3 c13 c14) :conclusion ())

Figure 3: A simple example with its proof.

- cl is the last derived clause. It is initialised to true ( true ) and for a valid proof is eventually set to the empty clause ().

The proof checking succeeds if the empty clause is eventually generated in a singleton stack for which all the formulae are tagged true. The proof-veriﬁer is executed in a state constructed by running the context commands. After running the context commands the proof state is such that:

- The signature Sig contains sort and function declarations (Sig is thereafter immutable)

- N contains constant and function declarations (during the proof checking, only new constants can

22 A ﬂexible proof format for SMT: a proposal F. Besson, P. Fontaine and L. Thery´

be deﬁned)

tt tt - S is a singleton stack of the form {n1 7→ cl1 ,...,ni 7→ cli } where a binding (ni,cli) is the result of the assert command

(assert (! fi :named ni)).

cli being then the clause containing the sole formula fi.

4.1 Conventions and auxiliary functions The veriﬁer makes use of auxiliary functions and predicates. Several of them are already part of the SMT-LIB 2.0 where they are used to give a semantics to SMT-LIB scripts. The pseudo-code does not introduce an explicit abstract syntax for the proof constructs. Instead we use directly the concrete syntax with the conventions that optional attributes are given a default value. More precisely, missing attributes of type list (for instance :clauses, :terms, :attributes) are given as default value the empty list and a missing attribute of type clause (for instance :conclusion) is given as default value the symbol null.

4.1.1 Well-formedness As in the SMT-LIB we only consider well-formed terms.

Deﬁnition 1 (Well-formed terms). A term t is well-formed with respect to a signature Sig and an environment of names N (isWellFormed(Sig,N,t)) if

- all the symbols in the term are bound in N;

- the term is well-sorted according to the signature Sig.

To ensure that a well-formed term will never be invalidated, the signature Sig and the environment N can only be augmented by additional declarations. It is therefore forbidden to overwrite a declaration. In case of violation, the veriﬁcation is aborted.

4.1.2 Assertion stack Proof scripts feature a notion of subproof to describe scoped proofs. As already stated, this construct does not increase the expressive power of the format. Its purpose is to structure proofs and thus facilitate their checking. For instance, each theory reasoner of the SMT solver can generate a sub-proof for each theory lemma or conflict clause. The advantage of sub-proof is that such theory lemma can be checked in isolation. To implement scoped proofs, the assertion stack is updated by push and pop operations2. Given a tagged clause clb, clause(clb) returns the clause cl and tag(clb) returns the boolean b. The verifier always accesses and updates the topmost assertion set. We write top[s] for the clause bound to the symbol s in the topmost assertion set. Accessing a non-existent symbol aborts the verification. Overwriting a clause tagged with ff also aborts the verification i.e., an assignment of the form top[s] := clb where tag(top[s]) = ff aborts the verification.

2They correspond to the functions push 1 and pop 1 of the SMT-LIB

23 A ﬂexible proof format for SMT: a proposal F. Besson, P. Fontaine and L. Thery´

4.1.3 Prover-specific proof rules The verifier is parametrised by prover-specific proof rules that are trusted and could therefore be responsible for an invalid proof. Each proof rule is implemented by a function named according to the hrule idi taking as arguments an optional list of clauses, terms and attributes and either returning a clause or re- porting an error. An error immediately aborts the verification of the proof. The soundness requirement is that the generated clause must be a logic consequence of the clauses passed as arguments. A proof rule also takes as arguments the signature Sig and the environment of bindings N. hrule idi : Sig × N × hclausei∗ × htermi∗ × hattributesi∗ → hclausei | herrori

4.2 Pseudo-code of the verifier The verifier is written in an imperative style and therefore updates in-place the proof state (Sig,N,S,cl). The verifier is presented top-down and consists of 4 functions: - checker is the top-level function and evaluates a whole proof. Upon success, it concludes that the input problem is not satisfiable. - eval proof evaluates all the proof steps in turn and updates the proof state accordingly; - eval proof step performs a case analysis over the proof command; - eval gen clause is responsible for deriving from the proof state new logic consequences by combining prover-specific proof rules. A proof is valid if after executing the proof steps the clause cl is the empty clause () and there are no introduced local hypotheses (all clauses are tagged tt).

bool checker( Π ) { e v a l p r o o f ( Π ); return (cl = ()) & ( ∀ i, tag(top[i]) = tt ); }

The function eval proof executes in turn the proof steps.

void e v a l p r o o f ( Π ) { for −each ( π ∈ Π ) e v a l p r o o f s t e p ( π ); }

The function eval proof step performs a case analysis and updates the last derived clause.

void e v a l p r o o f s t e p ( π ){ s w i t c h ( π ) { c a s e ( define tid trm) : requires(isWellFormed(trm)); N[tid] := trm

c a s e ( set c i d g e n c l ) : c l := e v a l g e n clause(gencl); top[cid] := cl tt ;

c a s e ( seth c i d hyp ) : requires(isWellFormed(hyp) & hasSortBool(hyp)); top[cid] := hypff ; }

24 A ﬂexible proof format for SMT: a proposal F. Besson, P. Fontaine and L. Thery´

The core of the veriﬁer is the function eval gen clause which generates a novel clause logic consequence of the current proof state.

c l a u s e e v a l g e n clause(gencl){ switch(clgen){ case clid : return clause(top[clid])

c a s e ( r i d :clauses (gc1, ..., gci) :terms (t1,..., tk) :attributes (a1, ..., an) :conclusion c o n c l ) : / ∗ Recursively generate clauses ∗ / gcs := ( e v a l g e n clause(gc1), . . . , e v a l g e n clause(gci)); / ∗ Call the prover specific proof rule ∗ / cl := rid(Sig,N,gcs,(t1 ,... ,tk),(a1,... ,an),concl); / ∗ Check the conclusion if any ∗ / if concl != null & concl != cl then abort r e t u r n c l ;

c a s e ( subproof p r f :conclusion c o n c l ) push ( ) ; / ∗ Local assertion set ∗ / e v a l proof(prf); if concl != null & cl != concl then abort / ∗ Generation of the conflict −c l a u s e ∗ / conflict := (not(h1) ,.... ,not(hn), cl); where {h1 , . . . , hn} = { clause(top[i]) | tag(top[i]) = ff } pop ( ) ; return conflict; }

5 Acknowledgements

The work presented here has been funded by the French ANR Decert initiative. This format is the result of many stimulating conversations with its members. We want to thank Aaron Stump for his remarks on a previous version of this document, and for numerous informal discussions. We also thank the anonymous reviewers for their insightful comments.

6 Conclusion and further work

The proof format presented in this paper is an extension of the SMT-LIB 2.0 format. We have tried as much as possible to find a good compromise between what can be output by an SMT-solver and what is needed in order for a simple checker to be able to verify a proof. We are aware that this format is not perfect and some subtle issues still need to be further investigated and discussed. Nevertheless we hope that the format can serve as a common basis that can be used by the SMT community. The format is currently being implemented in the veriT solver distributed as open-source using the BSD licence. We are also currently investigating proofs for quantifier reasoning [7]; Skolemization, as a satisfiability preserving transformation which does not preserve logical equivalence, may raise difficult

25 A ﬂexible proof format for SMT: a proposal F. Besson, P. Fontaine and L. Thery´ issues. The CNF transformation presented here does not require Tseitin variables; if such variables are required by other solvers, then similar issues may also appear for CNF transformation. In Section 4, we presented the code of a proof checker. A standalone checker for the format is one of our long term goals. However, in addition to the infrastructure presented in Section 4, this requires to implement the term data-structure, and the operators to manipulate terms. This comes to reimplementing (parts of) a kernel like those found in mainstream proof assistants. We are currently implementing a module to replay proofs in the present format within Coq, and we plan to study replaying this proof format within LFSC.

References

[1] C. Barret, A. Stump, and C. Tinelli. The SMT-LIB standard: Version 2.0, 2010. Latest official release of Version 2.0 of the SMT-LIB standard. [2] C. Barrett and C. Tinelli. CVC3. In CAV ’07, volume 4590 of LNCS, pages 298–302. Springer, 2007. [3] S. Bohme¨ and T. Weber. Fast LCF-style proof reconstruction for Z3. In ITP’2010, volume 6172 of LNCS, pages 179–194. Springer, 2010. [4] T. Bouton, D. C. B. de Oliveira, D. Deharbe,´ and P. Fontaine. veriT: an open, trustable and efficient SMT- solver. In CADE’09, volume 5663 of LNCS, pages 151–156. Springer, 2009. [5] L. M. de Moura and N. Bjørner. Proofs and Refutations, and Z3. In Proc. of the LPAR 2008 Workshops, volume 418 of CEUR Workshop Proceedings, 2008. [6] L. M. de Moura and N. Bjørner. Z3: An Efficient SMT Solver. In TACAS’08, volume 4963 of LNCS, pages 337–340. Springer, 2008. [7] David Deharbe,´ Pascal Fontaine, and Bruno. Quantifier inference rules for SMT proofs, 2011. Workshop on Proof eXchange for Theorem Proving (PxTP). [8] S. McLaughlin, C. Barrett, and Y. Ge. Cooperating theorem provers: A case study combining HOL-Light and CVC Lite. In PDPAR ’05, volume 144(2) of ETCS, pages 43–51. Elsevier, 2006. [9] A. Stump and D. L. Dill. Faster Proof Checking in the Edinburgh Logical Framework. In CADE-18, volume 2392 of LNCS, pages 392–407. Springer, 2002. [10] G. Sutcliffe, J. Zimmer, and S. Schulz. TSTP Data-Exchange Formats for Automated Theorem Proving Tools. In Distributed Constraint Problem Solving and Reasoning in Multi-Agent Systems, pages 201–215, 2004.

26 Designing Proof Formats: A User’s Perspective — Experience Report — Sascha Bohme¨ Tjark Weber Technische Universitat¨ Munchen¨ University of Cambridge

Abstract Automatic provers that can produce proof certificates do not need to be trusted. The certificate can be checked by an independent tool, for example an LCF-style proof assistant such as Isabelle/ HOL or HOL4. Currently, the design of proof formats is mostly dictated by internal constraints of automatic provers and less guided by applications such as checking of certificates. In the worst case, checking can be as involved as the actual proof search simply because important information is missing in the proof certificate. To address this and other issues, we describe design choices for proof formats that we consider both feasible for implementors of automatic provers as well as effective to simplify checking of certificates.

1 Introduction

The development of automatic theorem provers has seen much improvement in recent years, yet there is reason to doubt their soundness [6, 7]—the ultimate criterion for such tools. Solvers for propositional satisfiability (SAT), for quantified Boolean formulae (QBF), for satisfiability modulo theories (SMT), and automatic provers for first-order logic implement complicated algorithms. Hence, it is not surprising that bugs happen. As C. A. R. Hoare famously put it in his 1980 Turing Award Lecture [15]: “There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.” State-of-the-art provers rely on sophisticated heuristics and ingenious optimizations to achieve striking performance. There is little hope of making them obviously correct. A solution to this issue lies in requesting proof certificates from automatic provers that can be checked by an independent tool (Section 2). This approach pushes the soundness issue to the checker, which may be simple enough to be “obviously correct” or trustworthy for other reasons, for instance because it has been verified. In previous work we have used LCF-style proof assistants (Section 3), which are based on a small trusted kernel, to implement efficient checkers for SAT [25], QBF [17, 24] and SMT solvers [4], and have been working with proofs from first-order provers [3]. Our integrations achieved two goals. First, they provide high correctness assurances for these automatic provers. Second, they increase the degree of proof automation available in the LCF-style proof assistants Isabelle/HOL [20] and HOL4 [14]. The certificate-based approach is theoretically appealing, but our experience shows that there are practical pitfalls. Currently, the design of proof formats is mostly dictated by internal constraints of automatic provers. Checking should have lower complexity than proof search, yet proof certificates sometimes lack essential information, so that checking requires costly search already performed by the prover. Checking should be simple, yet proofs employ powerful inference steps with unclear semantics. Checking the results of a second prover should require little extra effort, yet similar provers use very different proof formats. We report on our findings to alleviate the process of checking proof certificates, both technical as well as conceptual (Section 4), and abstract from specific problems of particular provers.

Pascal Fontaine, Aaron Stump (eds.); PxTP 2011, pp. 27-32

27 Designing Proof Formats: A User’s Perspective S. Bohme¨ and T. Weber

2 Automatic Provers and Their Proof Formats

The demand for proofs has been perceived many times by the automated reasoning community. Most automatic provers can now generate proof certificates. It is impossible to give an exhaustive list of systems, but we briefly review the state of proof formats for important classes of provers. State-of-the-art SAT solvers expect their input to be in conjunctive normal form. Proofs of unsatisfiability can, therefore, be given as a sequence of resolution steps. Several SAT solvers, e.g., MiniSat and zChaff [25], define proof formats based on this observation. Despite the conceptual simplicity of resolution proofs, however, no two solvers seem to use the exact same proof format. Fostering the development of proof-producing SAT solvers, SAT competitions since 2005 have included a special Verified UNSAT track, where several different resolution-based proof formats are accepted. Among them, the RUP format [11] is relatively concise, while other (more verbose) formats can be checked more easily. Noting that simplicity of the checker is essential to obtain trustworthy results, Allen Van Gelder proposed that the language of certificates should be recognizable in deterministic log space [10]. Proofs of invalidity for quantified Boolean formulae are typically based on Q-resolution [8], an extension of propositional resolution. Proof formats for satisfiability show more diversity, with techniques including model generation and Skolemization. A recent overview can be found in [19]. Jussila et al. suggested a unified QBF proof format [16]. However, their calculus may not be rich enough to capture important algorithmic features found in certain QBF solvers. The SAT 2011 Competition will use another Q-resolution-based format, QIR [12], for proofs of invalidity. The QIR format can be checked in deterministic log space. Whether solver developers will adopt this new format remains to be seen. Several SMT solvers produce proofs, e.g., clsat [21], CVC3 [2], Fx7 [18], veriT [5] and Z3 [9]. Each of them has a very distinct proof format. Only clsat and Fx7 specifically target efficient formal proof checking. The proof calculus underlying the CVC3, veriT and Z3 proof formats is natural deduction; CVC3 uses hundreds of inference rules, whereas less than 40 rules—some of which compress several reasoning steps—suffice for Z3. Fx7 uses natural deduction with a rewriting-based flavor, and clsat is based on the Edinburgh Logical Framework (LF). LF has been suggested as the basis for a standardized SMT proof format, as it promises efficient proof checking and generality, but its adoption is pending. The first-order prover community has devised the TSTP proof format [23] as a standard for their proofs, although several first-order provers continue to provide certificates in their own formats. TSTP prescribes the syntax of proof certificates, and it is general enough to potentially cover certificates for other classes of provers. It does not, however, define or restrict the set of allowed inferences.

3 LCF-Style Proof Assistants

The term LCF-style [13] describes proof assistants based on a small inference kernel. Theorems are implemented as an abstract data type, and the only way to construct new theorems is through a fixed set of functions (corresponding to the underlying logic’s axiom schemata and inference rules) provided by this data type, thus ensuring a small trusted code base. Proof procedures based on an LCF-style kernel cannot produce unsound theorems, as long as the implementation of the theorem data type is correct. Using LCF-style proof assistants like Isabelle/HOL [20] and HOL4 [14] as proof checkers is instructive for three reasons. First, these systems implement rich logics, e.g., simply-typed higher-order logic with schematic polymorphism. Their language contains the subsets and extensions of first-order logic implemented by major automatic provers. We do not explicitly consider proof formats for higher-order logic in this paper, although much of our advice in Section 4 likely is transferable. Since proof assistants have not been optimized for a specific subset or background theory, they provide a generic testbed for the implementation of checkers.

28 Designing Proof Formats: A User’s Perspective S. Bohme¨ and T. Weber

Second, LCF-style proof assistants require that all inferences are validated by their kernel. Isabelle can even construct persistent proof terms during validation, but this feature is optional and not investigated here. Soundness errors in the proof checker, therefore, are extremely unlikely. No handwaving is possible: The checker must know all data required to make the necessary calls to kernel functions. Proof formats that lack essential information will be found out. Third, various automatic proof procedures are available in Isabelle/HOL and other LCF-style proof assistants, notably a simpliﬁer, which performs term rewriting, a decision procedure for propositional logic, tableau- and resolution-based ﬁrst-order provers, and decision procedures for Presburger arithmetic and real algebra. Despite their simple, trustworthy inference kernel, LCF-style systems can, therefore, serve as checkers for arbitrarily complex inference rules and proof formats.

4 Guidelines for Proof Formats

The primary goals in designing a proof format are that generating, storing and checking certificates should be easy and fast. Unfortunately these goals are conflicting: Small certificates are easier to generate and store, but may not contain enough information to be checked efficiently. Based on our experience in using LCF-style proof assistants as checkers for various proof formats, we suggest more detailed guidelines for the design of new proof formats.

Use an existing format. There is general agreement among developers of automatic provers about standardized input formats. For instance, there is the DIMACS format for SAT solvers, its QDIMACS extension for QBF solvers, the SMT-LIB language for SMT solvers, and the TPTP language for first- order provers. These formats are widely supported by automatic provers. It is hence easy to run different provers on the same problem, and to exchange inferior provers for better ones. When it comes to proof formats, the situation is very different. Few standards have emerged (see Section 2), and most automatic provers currently use their own format. This is unfortunate: Implement- ing an independent checker for any one format can require a considerable amount of time and effort. In the absence of standardized formats, users of proof certificates become locked in to specific provers. To avoid this, we need standardized proof formats in much the same fashion as we have standardized input formats. Of course, a balance must be struck between standardization and progress. Designing proof formats for some classes of provers, e.g., for SMT solvers, is very much an active research area. We do not mean to discourage genuine advances in proof formats, but we want to point out the detrimental effect that the Not Invented Here syndrome has when prover developers needlessly reinvent concrete syntax. Sharing proof formats across provers will enable more re-use of checkers and other proof-based tools, and a better comparison of different automatic provers.

Provide a human-readable (yet easy to parse) representation. There are various ways for an automatic prover to make its proofs available to other tools, e.g., via a binary API, or via exporting proofs to binary or text files. An in-memory API may provide rich functionality, combined with short access times, and binary files may provide a storage-efficient representation. In our experience, the perceived benefits of in-memory APIs and binary formats for proof exchange are small. Access times are usually insignificant when compared to runtimes for proof generation and proof checking. Even the size of proofs is rarely much of an issue, given that modern hard disk drives can routinely store terabytes of data. If necessary, generic or domain-specific data compression techniques [1] can be employed.

29 Designing Proof Formats: A User’s Perspective S. Bohme¨ and T. Weber

Moreover, binary formats are surprisingly brittle: Solvers and their interfaces keep changing. Docu- mentation in practice is often outdated and incomplete. Bugs happen. This can make a significant amount of reverse engineering necessary to understand any binary proof format. In-memory APIs also require bridging different programming languages (between the prover and the checker), which can introduce additional complications. Thus, in our experience, human-readable text files are clearly easiest to use. They facilitate proof debugging and provide a clear separation of tools. For instance, they make it easy to set up a remote prover that sends its certificates to a checker via HTTP [3, 22]. To simplify parsing of proofs, a standardized data format language, e.g., JSON, YAML or S-expressions, can be used.

Take theoretical considerations into account. Some proof formats are little more than quick hacks: Prover developers tend to include information in certificates only if it can readily be extracted from the prover’s internal data structures. This sometimes leads to proof formats that contain redundant information, which could easily be reconstructed in the checker. More often, however, and much more gravely, it leads to proofs that contain an insufficient amount of detail. For instance, some proof-producing SMT solvers provide no justification for theorems derived by their theory-specific decision procedures. Proof checking should not require significant amounts of search. While automatic provers may use sophisticated decision procedures and heuristics to find proofs, checkers should not have to re-implement those heuristics. Instead, proof checking should be reasonably easy—perhaps in deterministic log space, but certainly at most polynomial in the size of the proof. For a good proof format, considerations about complexity should take precedence over ease of prover implementation. The complexity-theoretic perspective also provides a good guideline against including too much detail in proofs. It is not necessary to include every decision taken in the prover, if the result of that computation can instead be validated by a succinct (but easily checkable) certificate.

Use simple, canonical semantics. Proofs should not contain surprises. The number of different inference rules in a proof format will likely depend on the automatic prover: SAT solvers currently use just one rule (propositional resolution), while SMT solvers employ dozens or even hundreds of rules. Clearly, this can be attributed to the richer input language supported by the latter. But no matter what the number of rules is, each one should have simple and straightforward semantics. There is an analogy between inference rules in proofs and functions in programming—in fact, a checker will probably use the latter to implement the former. Small and focused inference rules are to be preferred, and complex rules should be broken into several smaller ones. Special cases and small, local optimizations are best avoided. These may be justiﬁed in the implementation of a prover, but they should not complicate the prover’s interface—i.e., the proof format.

Use declarative proofs. Proofs apply inference rules to derive formulas. If each rule has clear, deterministic semantics (cf. the previous paragraph), the derived formulas could be left implicit. In our experience, however, the semantics of inference rules is often poorly deﬁned and ambiguous. Are formulas treated modulo “obvious” equivalences, such as idempotence of logical connectives or α-equivalent re- naming of bound variables? Are formulas implicitly transformed into a normal form of some kind? Are there corner cases where rules have unexpected semantics? The answers are prover-dependent and rarely obvious. Therefore, it is tremendously helpful when proofs contain not only inference rules, but also explicitly state the derived formulas. Declarative proofs facilitate proof debugging and enable checking of intermediate results. They allow to pinpoint errors in the proof (or in the checker) and to detect them early.

30 Designing Proof Formats: A User’s Perspective S. Bohme¨ and T. Weber

Provide exhaustive documentation. Independent of the proof format chosen, ample documentation should be provided. We have seen well-designed proof formats that were hard to check simply because the semantics of individual features was unclear, and no documentation gave any clue. Both the concrete and abstract syntax of the proof format, as well as its semantics, should be described in full detail. As an additional benefit, doing this thoroughly will likely help to identify proof features and inference rules that are overly complex. Many automatic provers include a pre-processing phase to simplify the input and transform it into some canonical form. Pre-processing steps are rarely included in proof certificates: They typically do not involve search, so a checker can mimic them without additional information. Precisely for this reason, however, a prover’s pre-processing phase needs to be documented as part of its proof format description. An open-source reference implementation of a checker for the proof format is desirable, but it goes without saying that the source code of the checker (or of the automatic prover, if available) should not be considered sufficient documentation of the implemented proof format.

5 Conclusion

Automatic provers are complex software tools whose correctness is difficult to verify. Proof certificates that can be checked independently promise a solution to this issue. Certificates have several applications, from deciding the status of individual benchmark problems with great confidence to integrating automatic provers into LCF-style proof assistants without enlarging their trusted code base. Such integrations have been found to be fruitful for both worlds: Proof assistants benefit from increased automation, and automatic provers reach new domains and receive feedback as well as bug reports. Currently, proof formats are usually designed by prover developers. Few independent proof checkers have been implemented, and the user community for proof certificates appears to be small. Consequently, the design of proof formats is mostly dictated by internal constraints of automatic provers and less guided by applications. We hope that by providing a user’s perspective, this experience report can make a contribution to an informed debate over future proof formats. A major nuisance for implementing checkers is the prevalent lack of standardized proof formats. There are only a handful of substantially different proof calculi in practical use, but hardly any two provers implement the same proof format. This situation is unfortunate as implementing independent checkers, especially efficient ones, tends to be involved. To overcome the lack of standardization is probably more of a social than a technical challenge. We believe that standard proof formats could advance the use of certificates as much as standard input formats have advanced the use of automatic provers in general. Research into proof formats is continuing, and we look forward to better formats being proposed. We also expect to see new applications of proof certificates emerge, for instance in proof mining or machine learning. Our guidelines were designed to simplify proof checking; new applications will undoubtedly lead to additional requirements.

Acknowledgments. We acknowledge funding from EPSRC grant EP/F067909/1.

References

[1] H. Amjad. Data compression for proof replay. J. Autom. Reasoning, 41(3-4):193–218, 2008. [2] C. Barrett and C. Tinelli. CVC3. In W. Damm and H. Hermanns, editors, Computer Aided Veriﬁcation, volume 4590 of LNCS, pages 298–302. Springer, 2007. [3] J. C. Blanchette, S. Bohme,¨ and L. C. Paulson. Extending Sledgehammer with SMT solvers. In Automated Deduction, 2011. To appear.

31 Designing Proof Formats: A User’s Perspective S. Bohme¨ and T. Weber

[4] S. Bohme¨ and T. Weber. Fast LCF-style proof reconstruction for Z3. In M. Kaufmann and L. Paulson, editors, Interactive Theorem Proving, volume 6172 of LNCS, pages 179–194. Springer, 2010. [5] T. Bouton, D. C. B. de Oliveira, D. Deharbe,´ and P. Fontaine. veriT: an open, trustable and efficient SMT- solver. In R. A. Schmidt, editor, Automated Deduction, volume 5663 of LNCS, pages 151–156. Springer, 2009. [6] R. Brummayer and A. Biere. Fuzzing and delta-debugging SMT solvers. In Satisfiability Modulo Theories, 2009. [7] R. Brummayer, F. Lonsing, and A. Biere. Automated testing and debugging of SAT and QBF solvers. In O. Strichman and S. Szeider, editors, Theory and Applications of Satisfiability Testing, volume 6175 of LNCS, pages 44–57. Springer, 2010. [8] H. K. Buning,¨ M. Karpinski, and A. Flogel.¨ Resolution for quantified Boolean formulas. Information and Computation, 117(1):12–18, 1995. [9] L. M. de Moura and N. Bjørner. Proofs and refutations, and Z3. In P. Rudnicki, G. Sutcliffe, B. Konev, R. A. Schmidt, and S. Schulz, editors, International Workshop on the Implementation of Logics, volume 418 of CEUR Workshop Proceedings. CEUR-WS.org, 2008. [10] A. V. Gelder. Verifying propositional unsatisfiability: Pitfalls to avoid. In J. Marques-Silva and K. A. Sakallah, editors, Theory and Applications of Satisfiability Testing, volume 4501 of LNCS, pages 328–333. Springer, 2007. [11] A. V. Gelder. Specification for reverse unit-propagation proof files, version 1.3, 2009. Retrieved May 5, 2011 from http://users.soe.ucsc.edu/~avg/ProofChecker/fileformat_rup.txt. [12] A. V. Gelder. The QBF proof format: QIR, version 1.0, 2010. Retrieved May 5, 2011 from http://users. soe.ucsc.edu/~avg/ProofChecker/fileformat_qir.txt. [13] M. Gordon, R. Milner, and C. P. Wadsworth. Edinburgh LCF: A Mechanised Logic of Computation, volume 78 of LNCS. Springer, 1979. [14] M. J. C. Gordon and A. M. Pitts. The HOL logic and system. In Towards Verified Systems, volume 2 of Real-Time Safety Critical Systems Series, chapter 3, pages 49–70. Elsevier, 1994. [15] C. A. R. Hoare. The emperor’s old clothes. Communications of the ACM, 24(2):75–83, 1981. [16] T. Jussila, A. Biere, C. Sinz, D. Kroning,¨ and C. M. Wintersteiger. A first step towards a unified proof checker for QBF. In J. Marques-Silva and K. A. Sakallah, editors, Theory and Applications of Satisfiability Testing, volume 4501 of LNCS, pages 201–214. Springer, 2007. [17] R. Kumar and T. Weber. Validating QBF validity in HOL4. In Interactive Theorem Proving, 2011. To appear. [18] M. Moskal. Rocket-fast proof checking for SMT solvers. In C. R. Ramakrishnan and J. Rehof, editors, Tools and Algorithms for the Construction and Analysis of Systems, volume 4963 of LNCS, pages 486–500. Springer, 2008. [19] M. Narizzano, C. Peschiera, L. Pulina, and A. Tacchella. Evaluating and certifying QBFs: A comparison of state-of-the-art tools. AI Communications, 22(4):191–210, 2009. [20] T. Nipkow, L. C. Paulson, and M. Wenzel. Isabelle/HOL — A Proof Assistant for Higher-Order Logic, volume 2283 of LNCS. Springer, 2002. [21] D. Oe, A. Reynolds, and A. Stump. Fast and flexible proof checking for SMT. In Satisfiability Modulo Theories, pages 6–13. ACM, 2009. [22] G. Sutcliffe. System description: SystemOnTPTP. In D. McAllester, editor, Automated Deduction, volume 1831 of LNAI, pages 406–410. Springer, 2000. [23] G. Sutcliffe, J. Zimmer, and S. Schulz. TSTP data-exchange formats for automated theorem proving tools. In W. Zhang and V. Sorge, editors, Distributed Constraint Problem Solving and Reasoning in Multi-Agent Systems, volume 112 of Frontiers in Artificial Intelligence and Applications, pages 201–215. IOS Press, 2004. [24] T. Weber. Validating QBF invalidity in HOL4. In M. Kaufmann and L. C. Paulson, editors, Interactive Theorem Proving, volume 6172 of LNCS, pages 466–480. Springer, 2010. [25] T. Weber and H. Amjad. Efficiently checking propositional refutations in HOL theorem provers. Journal of Applied Logic, 7(1):26–40, 2009.

32 Quantiﬁer Inference Rules for SMT proofs∗ David Deharbe Universidade Federal do Rio Grande do Norte, Natal, Brazil [email protected] Pascal Fontaine Bruno Woltzenlogel Paleo University of Nancy and INRIA, Nancy, France Technische Universität Wien, Vienna, Austria [email protected] [email protected]

Abstract This paper discusses advantages and disadvantages of some possible alternatives for inference rules that handle quantifiers in the proof format of the SMT-solver veriT. The quantifier-handling modules in veriT being fairly standard, we hope this will motivate the discussion among the PxTP audience around proof production for quantifier handling. This could generate ideas to help us improve our proof production module, and also benefit the SMT community.

1 Introduction

In the typical architecture of an SMT-solver, the core automated reasoner is a propositional SAT-solver, and quantifiers are handled by independent modules [8]. In veriT [4], essentially universal quantifiers are handled by an instantiation module, which heuristically chooses terms to instantiate such quantified variables. The instantiation module is called on-demand as rarely as possible (to reduce the number of generated instances) and only on essentially universally quantified subformulas. Essentially existential quantifiers, on the other hand, are handled by a skolemization module that is called only in a pre-processing phase and replaces all the essentially existentially quantified variables by skolem terms. Currently, these modules are not proof-producing: if the input problem contains quantifiers that require skolemization, the proof produced by veriT will take as starting point the skolemized formula. If the instantiation module is called, generated instances will be used to deduce unsatisfiability, and the proof produced by veriT will contain holes. This paper discusses advantages and disadvantages of possible inference rules to handle quantifiers in the proof format of the SMT-solver veriT. We believe veriT’s instantiation module behaves mostly like those in other solvers that handle quantifiers, e.g. CVC3 [2] or Z3 [5]. We thus believe that the following discussion is relevant in the larger context of SMT solving for quantified formulas. We aim at developing inference rules for skolemization and instantiation that take into account the following requirements:

• Proof size: the proofs produced by the skolemization and by the instantiation modules should be as short as possible, relative to the size of the formula that needs to be skolemized or instantiated.

• Faithfulness to the inner workings of the quantification modules: the proposed inference rules should reflect what actually happens inside the solver, so that they can also be used for precisely tracing executions; from a tool engineering perspective, this is important for debugging, profiling and maintainability.

• Ease of programming: our solver has a generic framework for proof production; it is desirable that the new inference rules comply with this framework.

Pascal Fontaine, Aaron Stump (eds.); PxTP 2011, pp. 33-39 ∗This work was supported by the ANR DeCert and the SMT-SAVeS Projects

33 Quantiﬁer Inference Rules for SMT proofs D. Deharbe, P. Fontaine and B. Woltzenlogel Paleo

• Compatibility with the proof format: veriT’s input format follows the SMT-LIB standard; its output format also obeys the proposed proof format [3]. Hence the inference rules should be expressible in this format. • Fine-grainedness and simplicity: the inference rules should describe instantiation and skolemization in steps that are as small and simple as possible. • Generality: inference rules having a broader range of applicability should be preferred to over- specialized inference rules that can be used only for specific purposes. • Elegance: the proposed inference rules should fit into the style of rules already existing in the solver. • Complexity of proof-checking: it should be possible for an external proof checker to efficiently check instances of the proposed inference rules. • User-friendliness: the proposed inference rules should be suitable for the users of the solver and their applications. In this paper we discuss some alternative inference rules, focusing on the more objective and more easily measurable criteria mentioned above (e.g. proof size, fine-grainedness and complexity of proof- checking). And we leave the more subjective criteria for future work.

2 The Proof Format veriT’s proof format follows a proposed format [3] in the philosophy of the SMT-LIB standard [1]. Its grammar is partially shown below. Clauses are sets of arbitrary formulas (not only literals), and inference rules have an arbitrary number of clauses as premises and a single clause as conclusion. Optionally, an inference rule may also take terms and attributes as arguments.

hgen_clausei ::= hclause_idi | (hrule_idi (:clauses (hgen_clausei∗) | :all-clauses)? (:terms (htermi∗))? hattributei∗ (:conclusion hclausei)?) | (subproof hproofstepi∗ (:conclusion hclausei)? )

This document describes inference rules abstractly, using a proof-theoretical notation that is independent from any concrete proof format. The translation from this notation to the proof format is easy. An inference with the form Γ ... Γ 1 n rule_id(term∗;attribute∗) Γ becomes

(rule_id (:clauses (η(Γ1)...η(Γn)))? (:terms (htermi∗))? hattributei∗ (:conclusion Γ)?)

34 Quantiﬁer Inference Rules for SMT proofs D. Deharbe, P. Fontaine and B. Woltzenlogel Paleo

where η(Γi) is either the clause id of Γi or an inference rule instance that derives Γi or a subproof that derives Γi.

3 Rules for Instantiation of Essentially Universal Quantiﬁers

Essentially universal quantifiers are universal quantifiers that occur with positive polarity or existential quantifiers that occur with negative polarity. Essentially universally quantified variables can be instantiated by any term of the suitable sort. When a satisfying assignment that does not generate any theory conflict contains an essentially universally quantified formula, the instantiation module generates and returns singleton clauses whose only formulas are instances of an instantiation axiom schema. The instantiation terms are usually chosen by a heuristic based on E-matching [6, 9]. This heuristic selects ground terms that appear in the literals composing the satisfying assignment, sometimes based on annotations called triggers (i.e. sets of term patterns). If a quantified formula is of the form ∀x1.....∀xn.A[x1,...,xn] or ¬∃x1.....∃xn.A[x1,...,xn] — with A[_] being a formula not starting with an essentially universal quantifier — the instantiation module instantiates not only the first universally quantified variable x1, but all the variables xi at once. This instantiation heuristic (based on E-matching) is incomplete, and a simple clause set that is unsatisfiable but irrefutable due to this incompleteness is simple: {∀x.P(x) ; ∀y.¬P(y)}. To find a refutation, the solver shall instantiate x and y to the same arbitrary term t and resolve the two unit clauses with each other. Taking these remarks into account, the most straightforward inference rules for the instantiation module would be: forall_inst_axiom ∀~x.F(~x) → F(~a) exists_inst_axiom F(~a) → ∃~x.F(~x) where~x denotes a sequence of variables x1,...,xn and ~a a sequence of terms a1,...,an of suitable sort. An obvious and easily implementable idea to improve these rules is to combine them with the clause form transformation rule for implication, as shown below. This reduces the size of proofs, since it eliminates the need to always apply the implication rules after the instantiation axioms. forall_inst_cnf_axiom ¬∀~x.F(~x),F(~a) exists_inst_cnf_axiom ¬F(~a),∃~x.F(~x) In first-order resolution proofs, it is usual to follow a convention that considers variables to be implicitly universally quantified. Universal quantifiers then simply do not (need to) appear in the proof. One might wonder if it would be desirable to adopt a similar convention in the presented proof format and rules. The answer is negative: because the SMT-LIB standard does not enforce any naming convention to distinguish identifiers for constants and for free variables, a proof checker would not be able to (easily) tell whether a given identifier (e.g. x) is a constant or a variable. With explicit rules for omiting the quantifiers, a more sophisticated proof-checker could keep track of which identifiers are variables. However, this would imply an undesirable loss of simplicity of the proof format and of the proof checker.

4 Rules for Skolemization of Essentially Existential Quantiﬁers

Essentially existential quantifiers are existential quantifiers that occur with positive polarity or universal quantifiers that occur with negative polarity. veriT eliminates them by skolemization during a pre- processing phase. The simplest solution would be to disregard this kind of pre-processing in the proof

35 Quantiﬁer Inference Rules for SMT proofs D. Deharbe, P. Fontaine and B. Woltzenlogel Paleo production. However, this would go against the style of veriT’s proof format, since veriT does produce proofs for other pre-processing tasks such as clause form transformation. Another simple solution consists of having a single macro inference rule that skolemizes all essentially existential quantiﬁers:

F skolemize_all sk(F) where sk(F) is any skolemization of F. The rule skolemize_all is simple to implement in veriT and simple to check by an independent proof checker. The proof checker just needs to traverse F and sk(F) once, checking that each essentially existential quantifier is eliminated and the quantified variables it binds are replaced by skolem terms headed by skolem symbols that do not occur anywhere else in the proof and whose arguments are (a subset, depending on the skolemization algorithm used, of) variables bound by essentially universal quantifiers having scope over the eliminated quantifier. This rule is also convenient from the point of view of size, since it is clearly linear in the size |sk(F)| of sk(F). |sk(F)|, however, is in the worst case Θ(|F|2), if F is a tree-formula. To see that |sk(F)| is O(|F|2), just note that the number of essentially existential quantifiers in |F| is O(|F|) and each of these quantifiers is replaced by a skolem-term of size O(|F|). To see that in the worst case |sk(F)| is Ω(|F|2), just consider the following example sequence:

Fn = ∀x1 ...∀xn∃y1 ...∃yn.P(x1,...,xn,y1,...,yn)

If F is a dag-formula, skolemization may require it to be transformed to an exponentially bigger tree- formula first. In this case, the worst-case size of |sk(F)| is Θ(2|F|)). An example where this happens is available in [7]. However, the rule skolemize_all has the disadvantage of being very coarse-grained, since it skolemizes the whole formula at once. In trying to develop more fine-grained inference rules, it would be desirable to have something analogous to the rules forall_inst_axiom and exists_inst_axiom. This could be attempted with rules such as the following: exists_skolem_axiom ∃x.F(x) → F( fnew(x1,...,xn)) forall_skolem_axiom F( fnew(x1,...,xn)) → ∀x.F(x) exists_skolem_cnf_axiom ¬∃x.F(x),F( fnew(x1,...,xn)) forall_skolem_cnf_axiom ¬F( fnew(x1,...,xn)),∀x.F(x) where x1,...,xn are the free variables occurring in F(x), fnew is a fresh new skolem symbol, not occurring anywhere else in the proof. The rules above are validity-preserving only if we just consider models in which fnew(x1,...,xn) has a fixed interpretation as the witness of the essentially existentially quantified variable it replaces (if such a witness exists). Otherwise, these rules are merely satisfiability-preserving. In the case of instantiation of essentially universal quantifiers, when a quantified formula g (e.g. ∀x.F(x)) needs to be instantiated, g is one of the formulas in a clause c (e.g. Γ,∀x.F(x)). So, after stating an instantiation axiom clause c0 (e.g. ¬∀x.F(x),F(a)), we can do the actual instantiation of g in c simply by resolving c with c0. This replaces g by its instance in c. However, in the case of skolemization, a quantified formula g (e.g. ∃x.F(x)) may often occur not shallowly as a direct formula of a clause c but more deeply as a subformula of a formula in c (e.g. Γ,∀y.∃x.F(x)). Therefore, replacing g by its instance in c cannot be done simply by resolution with a skolemization axiom. To overcome this problem, a deep version of resolution is proposed, so that one of the resolved formulas can occur arbitrarily deep inside

36 Quantiﬁer Inference Rules for SMT proofs D. Deharbe, P. Fontaine and B. Woltzenlogel Paleo another formula. This rule may be used in the more general case of the replacement of a deep occurrence of a subformula by another:

+ Γ,¬F1,F2 ∆,F (F1) + deep_resolution+ Γ,∆,F (F2)

− Γ,¬F1,F2 ∆,F (F2) − deep_resolution− Γ,∆,F (F1) where the signs + and − indicate the polarity of the annotated subformula. Note that deep_resolution+ and deep_resolution− are analogous to deep applications of modus ponens and modus tollens, but ﬁt better in the style of veriT’s proof format, which is based on resolution.

+ Γ,F1 → F2 ∆,F [F1] + deep_modus_ponens Γ,∆,F [F2]

− Γ,F1 → F2 ∆,F [F2] − deep_modus_tollens Γ,∆,F [F1] Γ,F ↔ F ∆,F[F ] 1 2 1 deep_replacement Γ,∆,F[F2] This approach with skolemization axioms and deep resolution has many problems, though. Firstly, there is a significant increase in the size of proofs: if m quantifiers need to be skolemized and for the sake of fine-grainedness a deep replacement is performed separately for each of the quantifiers, then there will be Θ(m) inferences, whose conclusions are of size O(|sk(F)|). Consequently, there is also a significant increase in the proof-checking time. Anti-prenexing the quantifiers as much as possible could reduce this problem in the average case. Secondly and perhaps more seriously, proof-checking the skolemization axiom rules and the deep resolution rules depends on being capable of distinguishing identifiers of free variables and constants, and the SMT-LIB standard does not enforce any distinction. To understand this issue, consider the following formulas: F1 := ∀x.∃y.P(x,c,y) F2 := ∀c.∀x.∃y.P(x,c,y) and consider the following exists_skolem_axiom:

∃yP(x,c,y) → P(x,c, fnew(x))

Proof-checking this axiom depends on being able to tell whether c is a variable or a constant, for if it were a variable, then c should also have been listed as an argument of fnew. Moreover, both F1 and F2 might occur in a proof, and then c occurs both as a constant identifier and as a variable identifier. In such cases, the proof-checker should be able to accept this skolem axiom, it should be able to accept a deep resolution (or modus ponens) with F1 (concluding ∀x.P(x,c, fnew(x))), but it should also be able to reject an incorrect deep resolution (or modus ponens) with F2 (unsoundly concluding ∀x.P(x,c, fnew(x)) instead of ∀x.∀c.P(x,c, fnew(x,c))). This simple example shows that the combination of skolem axioms, deep replacement rules and no distinction between identifiers for constants and for variables leads to unsoundness. This could be fixed by requiring non-local side conditions in the deep replacement rules, so that a deep replacement is only allowed if any skolem function symbol occuring in the replacing formula has as arguments all the identifiers that occur free in the replacing formula and that become bound after the replacement. Although this is technically feasible, it is questionable whether the increase

37 Quantifier Inference Rules for SMT proofs D. Deharbe, P. Fontaine and B. Woltzenlogel Paleo in the complexity of proof-checking is a reasonable price to pay for the fine-grainedness and elegance provided by the deep replacement rules and the skolem axioms. This means that the proof checker would not be able the check the correctness of a skolemization axiom inference locally; it would not be able to verify whether the list of arguments of the skolem function, which ought to contain all the free variables (but not the constants) of F(x), is correct. The proof checker would only be able to tell whether the arguments of the skolem function are variables or constants when the deep resolutions are performed. Proof-checking would not be just local anymore, since the proof checker would need to keep track of some global correctness conditions. One way to keep proof-checking local, while still keeping fine-grainedness would be to combine the skolemization axiom rule and the deep resolution rule into a single unary inference rule, as follows:

Γ,G+[∃x.F(x)] + exists_skolem Γ,G [F( fnew(x1,...,xn))]

Γ,G−[∀x.F(x)] − forall_skolem Γ,G [F( fnew(x1,...,xn))] where x1,...,xn are the free variables of F(x) that are bound in G(Qx.F(x)). Another approach would be to give up using skolem terms and use Hilbert’s epsilon terms instead. A problem with this approach is that the size of the transformed formula epsilon(F) for a ﬁrst-order formula F is in the worst-case Ω(2|F|) for tree-like F. This lower bound can be easily proved by considering the sequence of linearly growing formulas Fn := ∃x1 ...∃xn.P(x1,...,xn) and checking that the sequence n epsilon(Fn) grows indeed in Ω(2 ). For dag-like F, another exponential blow up is possible, since the |F| formula may need to be transformed to a tree, and hence the size is Ω(22 ).

epsilon_axiom_1 ∃~x.F(~x) → F(ε~x.F(~x))

epsilon_axiom_2 F(ε~x.¬F(~x)) → ∀~x.F(~x) Yet another alternative worth considering would be to stop doing skolemization as a pre-processing step altogether, and do it only on demand, when an essentially existentially quantified formula occurs shallowly as a direct formula of a clause. In this case, no deep replacement would be necessary, and skolem terms would be always just skolem constants. Equivalently, strong quantifier rules or axioms that instantiate the essentially existentially quantified variables by eigen-variables could be used.

5 Conclusions

In this paper we have presented a few alternative inference rules for handling quantifiers in the proof format of veriT. We showed that each alternative has advantages but also disadvantages with respect to the requirements mentioned in the introduction. Therefore, we found none of the alternatives completely satisfactory. Since it seems to be difficult to find the right balance to satisfy most if not all the requirements simul- taneously, we have implemented some of these alternative rules in veriT selecting those that seemed to fit better within the existing proof style of that tool. The quantifier instantiation module produces instances of the rules forall_inst_axiom and exists_inst_axiom to justify lemmas that are added to the Boolean satisfiability solver. The clauses they introduce are then combined with existing rules for CNF transformation and resolution. Skolemization is applied to the input formula only on essentially existentially quantified

38 Quantiﬁer Inference Rules for SMT proofs D. Deharbe, P. Fontaine and B. Woltzenlogel Paleo

variables occuring at the outermost level, and thus only produce skolem constants. Also, since the formulas generated by the quantiﬁer instantiation module might reveal essentially existential quantiﬁers at the outermost level, skolemization is also applied to instances. In both cases, exists_skolem_cnf_axiom and forall_skolem_cnf_axiom are used in the proof, and the resulting clauses are further combined using deep_resolution rules.

References

[1] Clark Barrett, Aaron Stump, and Cesare Tinelli. The SMT-LIB standard : Version 2.0, March 2010. First official release of Version 2.0 of the SMT-LIB standard. [2] Clark Barrett and Cesare Tinelli. CVC3. In Werner Damm and Holger Hermanns, editors, Computer Aided Verification, volume 4590 of Lecture Notes in Computer Science, pages 298–302. Springer Berlin / Heidelberg, 2007. [3] Frédéric Besson, Pascal Fontaine, and Laurent Théry. A flexible proof format for SMT: a proposal, 2011. Workshop on Proof eXchange for Theorem Proving (PxTP). [4] Thomas Bouton, Diego Caminha B. de Oliveira, David Déharbe, and Pascal Fontaine. veriT: an open, trustable and efficient SMT-solver. In Renate Schmidt, editor, Proc. Conference on Automated Deduction (CADE), volume 5663 of Lecture Notes in Computer Science, pages 151–156, Montreal, Canada, 2009. Springer. [5] Leonardo Mendonça de Moura and Nikolaj Bjørner. Z3: An efficient smt solver. In C. Ramakrishnan and Jakob Rehof, editors, Tools and Algorithms for the Construction and Analysis of Systems, volume 4963 of Lecture Notes in Computer Science, pages 337–340. Springer Berlin / Heidelberg, 2008. [6] David Detlefs, Greg Nelson, and James B. Saxe. Simplify: a theorem prover for program checking. J. ACM, 52(3):365–473, 2005. [7] Pascal Fontaine. Techniques for Verification of Concurrent Systems with Invariants. PhD thesis, Université de Liège, Belgium, September 2004. [8] Yeting Ge, Clark Barrett, and Cesare Tinelli. Solving quantified verification conditions using satisfiability modulo theories. In Frank Pfenning, editor, Automated Deduction — CADE-21, volume 4603 of Lecture Notes in Computer Science, pages 167–182. Springer Berlin / Heidelberg, 2007. [9] M. Moskal, J. Lopuszanski, and J.R. Kiniry. E-matching for fun and profit. Electronic Notes in Theoretical Computer Science, 198(2):19–35, 2008. Proceedings of the 5th Int’l Workshop on Satisfiability Modulo Theories (SMT 2007).

39 Towards certiﬁcation of TLA+ proof obligations with SMT solvers Stephan Merz and Hernan´ Vanzetto∗ INRIA Nancy Grand-Est & LORIA Nancy, France

Abstract TLA+ is a formal specification language that is based on Zermelo-Frankel¨ set theory and the Tem- poral Logic of Actions TLA. The TLA+ proof system TLAPS assists users in deductively verifying safety properties of TLA+ specifications. TLAPS is built around a proof manager, which interprets the TLA+ proof language, generates corresponding proof obligations, and passes them to backend verifiers. In this paper we present a new backend for use with SMT solvers that supports elementary set theory, functions, arithmetic, tuples, and records. We introduce a typing discipline for TLA+ proof obligations, which helps us to disambiguate the translation of expressions of (untyped) TLA+, while ensuring its soundness. Our work is a first step towards the certification of proofs generated by proof-producing SMT solvers in Isabelle/TLA+, which is intended to be the only trusted component of TLAPS.

1 Introduction

TLA+ [8] is a language for specifying and verifying concurrent and distributed algorithms and systems. It is based on a variant of Zermelo-Frankel¨ set theory for specifying the data structures, and on the Temporal Logic of Actions TLA for describing the dynamic system behavior. Recently, a first version of the TLA+ proof system TLAPS [4] has been developed, in which users can deductively verify safety properties of TLA+ specifications. TLA+ contains a declarative language for writing hierarchical proofs, and TLAPS is built around a proof manager, which interprets this proof language, expands the necessary module and operator definitions, generates corresponding proof obligations (POs), and passes them to backend verifiers, as illustrated in Figure 1. While TLAPS is an interactive proof environment that relies on users guiding the proof effort, it integrates automatic backends to discharge proof obligations that users consider trivial. The two main backends of the current version of TLAPS are Zenon [3], a tableau prover for first-order logic and set theory, and Isabelle/TLA+, a faithful encoding of TLA+ in the Isabelle [11] proof assistant, which provides automated proof methods based on first-order reasoning and rewriting. Zenon is not part of the trusted code base of TLAPS, but outputs proof scripts in Isar syntax for the theorems that it proves. These proofs are passed to Isabelle for verification. In this way, Isabelle/TLA+ is used both as a standalone backend prover and as the certification engine of proof scripts produced by other backends. The currently available backends also include a generic translation to the input language of SMT solvers, focusing on quantifier-free formulas of linear arithmetic (not shown in Fig. 1). This SMT backend has occasionally been useful because the other verifiers perform quite poorly on obligations involving arithmetic reasoning. However, it covers a rather limited fragment of TLA+, which heavily relies on modeling data using sets and functions. Assertions mixing arithmetic, sets and functions arise frequently in TLA+ proofs. In the work reported here we present a new SMT-based backend for (non-temporal) TLA+ formulas that encompasses set-theoretic expressions, functions, arithmetic, records, and tuples. By evaluating the performance of the backend over several existing TLA+ proofs we show that it achieves good coverage for “trivial” proof obligations.

Pascal Fontaine, Aaron Stump (eds.); PxTP 2011, pp. 40-45 ∗Supported by the Microsoft Research-INRIA Joint Centre, Saclay, France.

40 Towards certiﬁcation of TLA+ proof obligations with SMT solvers S. Merz and H. Vanzetto

TLA Proof System Proof Manager TLA+ interpret module, generate specification expand definitions proof obligations and proofs

translate & verify results, certify proof type inference error (when possible) proof obligations messages

Isabelle/ SMT-LIB Zenon Yices TLA+ solvers

Figure 1: General architecture of TLAPS.

The new modules comprising our backend appear shaded in Figure 1. We consider two target languages for our translation: SMT-LIB [2], the de facto standard input format for SMT solvers, and the native input language of the SMT solver Yices [6]. Using SMT-LIB as the target of our translation, TLAPS can be independent of any particular solver. On the other hand, the Yices language provides useful concepts such as sub-typing or a direct representation of tuples and records. The considered TLA+ formulas are translated to quantified first-order formulas over the theory of linear integer and real arithmetic, extended with free sort and function symbols. In particular, we make heavy use of uninterpreted functions, and we do not restrict ourselves to quantifier-free formulas. All SMT solvers mentioned in this paper produce proofs, although each one has its specific output format. TLA+ is an untyped language, which makes it very expressive and flexible, but also makes automated reasoning quite challenging [9]. Since TLA+ variables can assume any value, it is customary (and recom- mended) to start any verification project by proving a so-called type invariant that constrains the values that variables of the specification may assume. Most higher-level correctness proofs rely on the type invariant. It should be noted that TLA+ type invariants frequently express more sophisticated properties than what could be ensured by a decidable type system. In contrast, our target languages are multi-sorted first-order languages, which are supported by dedicated decision procedures in SMT solvers. The first challenge is therefore to assign an SMT sort to each expression that appears in the proof obligation. We make use of this type assignment during the translation of expressions, which may depend on the types of the subexpressions involved. For example, equality between integer expressions will be handled differently from equality between sets or functions. During a first phase, our translation attempts to infer types for all subexpressions of a proof obligation. This phase may fail because not every set-theoretic expression is typable according to our typing discipline, and in this case the backend aborts. Otherwise, the proof obligation is translated to an SMT formula. Observe that type inference is relevant for the soundness of the SMT backend: a proof obligation that is unprovable according to the semantics of untyped TLA+ must not become provable due to incorrect type annotation. As a trivial example, consider the formula x+0 = x, which should be provable only if x is known to be in an arithmetic domain. Type inference essentially relies on assumptions that are present in the proof obligation and that constrain the values of symbols (variables or operators). A type system for TLA+ together with a concise description of the type inference algorithm is pre-

41 Towards certiﬁcation of TLA+ proof obligations with SMT solvers S. Merz and H. Vanzetto sented in the next section. Section 3 describes the translation. Finally, results for some case studies and conclusions will be given in Sections 4 and 5.

2 Type inference for TLA+

We deﬁne a type system for TLA+ expressions that underlies our SMT translation. We consider types τ according to the following grammar.

τ ::= i | o | Str | Nat | Int | Real | P τ | τ → τ | Rec {ﬁeldi 7→ τi} | Tup [τi]

The atomic types are i (terms of unspecified type), o (propositions), strings, and natural, integer and real numbers. Complex types are sets (of base type τ), functions, records (defined by a mapping from field names to types) and tuples (as a list of types). For this type system, we define an inference algorithm that is based on a recursive, bottom-up oper- + ator [[e,ε]]I whose arguments are a TLA expression e and an expected (minimum) type ε that e should at least have, according to a predefined partial order relation on types that includes relations such as Nat < Int or i < τ, for any type τ 6= i. The computation either returns the inferred type or aborts. The operator recurses over the structure of TLA+ expressions, gathering information in a typing environment type : Id 7→ τ, that maps each TLA+ symbol to its type. The operator [[·]]I is applied iteratively on the list of hypotheses H1,...,Hn, until a fixpoint is reached: note that later hypotheses may provide additional information for symbols that occur in earlier ones. Initially, we consider that every symbol has the unspecified type i, which can be thought of as a bottom type. Recursive calls to the operator [[·]]I may update the type of symbols as recorded in type by new types that are greater than the previous ones. A type assignment is definitive only when types for all expressions in the proof obligation have been successfully inferred. For example, consider the hypotheses S = {} and S ⊆ Int. After evaluating the first one, S will have type Pi, but it will be updated to PInt when the second hypothesis is processed. When type inference succeeds, the environment variable type will contain the resulting final type assignments. As we discuss below, there are two cases where the inference algorithm can fail: (1) when a symbol an expression depends on does not have an assigned type, and (2) when a constraint stating that two or more expressions need to be of the same type cannot be solved. The operator [[·]]I assigns types to complex expressions based on the types of their constituents. Although expressions such as {a} ∪ 0 or 3 + TRUE appear silly, they are allowed in TLA+, yet their meaning is unknown. Certain TLA+ operators are defined in such a way that the result type is fixed, aiding type inference. For example, logical operators always return Boolean values, whatever are the types of their operands. Similarly, an expression S ∪ T is of type Pi; if S and T are known to be sets of elements of the same type τ, then we obtain the more precise type Pτ. Arithmetic operators guarantee the result type only if their arguments are in the arithmetic domain. + TLA expressions such as e1 = e2, e1 ⊆ e2, IF p THEN e1 ELSE e2 are critical in the sense that their constituents e1 and e2 must have the same type in order to express them in the sorted format of the SMT solvers. Similarly, the expression e1 ∈ e2 requires that e2 be of type Pτ, and e1 of type τ, for some τ. For similar reasons, we do not allow functions (or operators) to return values of different types. Type inference for this kind of expressions makes use of the function S ([e1,...,en],ε), that given a list of expressions e1,...,en and an expected type ε, returns their unique type and fails when some pair [[e1,ε]]I,...,[[en,ε]]I cannot be equated. As a concrete example, consider the proof obligation (¬¬P) = P, where P is a defined operator whose definition is hidden (i.e., its definition cannot be used in the proof obligation). We know that ¬¬P is Boolean, and the typing rule for equality requires that the expressions on both sides must be of equal

42 Towards certiﬁcation of TLA+ proof obligations with SMT solvers S. Merz and H. Vanzetto type. Without any information on the type of P, the obligation should be rejected by type checking. In particular, we are not allowed to infer that P is of type Boolean. Indeed, P might be deﬁned as the integer 42, and we do not know if (¬¬42) = 42 is true in TLA+. We therefore derive the type assignment only from available facts, i.e., hypotheses of the proof obligation, of the forms id ⊗e and ∀x ∈ S : id(x)⊗e, for ⊗ ∈ {=,∈,⊆}. In these expressions, id is a constant, variable or operator, and e is an expression whose type can already be inferred1.

3 From TLA+ to SMT

Once we have determined a type assignment for a TLA+ proof obligation, it can be translated to the input languages of SMT solvers. Our target languages (SMT-LIB and Yices) are similar, and the translation + proceeds along the same lines. We define the operator [[·]]T that translates a TLA expression, using the type information gathered previously, to its corresponding SMT formula extended with λ-terms. The typing discipline ensures that all resulting λ-abstractions can be β-reduced before obtaining the definitive translation of the original expression. A symbol’s translation depends on its inferred type. It is translated to an SMT variable or function name, which has to be declared in the output produced by the translation, together with its type. Purely arithmetic and first-order expressions are translated to the corresponding built-in operators of the target languages, except for quantified expressions, where quantified variables are introduced temporarily in the context with their types inferred accordingly. In this way, for the expression ∀x : e, the value [[e,o]]I is evaluated to obtain the type assignment for the variable x, so it can be properly declared in the translation of the quantified expression. Sets and functions are translated to uninterpreted functions. The encoding of a set S represents its characteristic predicate (“is a member of set S”), allowing for the direct translation of the set membership relation. In this way, [[S]]T is a λ-abstraction which will be applied to a value to detect if it is an element of the set or not, consequently [[e ∈ S]]T ≡ ([[S]]T [[e]]T ). Similarly, function application reduces to [[ f [e]]]T ≡ ([[ f ]]T [[e]]T ). A function [x ∈ S 7→ exp], whose domain is S, is translated to λy. [[exp(x/y)]]T , where x is replaced by y in the expression exp (the domain S is represented separately, as we will see below). For example, a function f that returns a set of type Pτ it will be translated as a λ-abstraction where the first parameter is the function’s argument and the second is a value of type τ. Then the translation is defined as [[ f ]]T ≡ λx y.(f x y), where f denotes the symbol’s name in the SMT output. The translation of equality depends on the types of the two subexpressions, which must have the same type by type- inference. For example:

[[e1 = e2]]T ≡ [[∀x : x ∈ DOMAIN e1 ⇔ x ∈ DOMAIN e2 when e1 is of type →

∧ (x ∈ DOMAIN e1 ⇒ e1[x] = e2[x])]]T

[[e1 = e2]]T ≡ [[∀ f ,g : ( f ∈ e1 ∧ g ∈ e2) ⇒ f = g]]T when e1 is of type P( → )

More generally, n-ary functions or operators returning a (simple) set are represented as predicates of arity n + 1 with their last argument being the set members. We only consider simple sets, i.e. sets of individuals, in order to remain in the realm of ﬁrst-order logic. Any hypotheses of a proof obligation that fall outside this class are discarded. For example, hypotheses of the form x ∈ S where S is of type PPτ are useful during type inference in order to determine the type of x but are then dropped during the SMT translation.

1For variables, these kind of expressions are usually given by the type invariant. Our backend requires similar type-correctness lemmas for hidden operators.

43 Towards certiﬁcation of TLA+ proof obligations with SMT solvers S. Merz and H. Vanzetto

Because SMT-LIB has no notion of function domain, we associate with each function f a set DOMAIN f (as a characteristic predicate). For every function application that occurs in the proof obligation, we check that the argument values are in the domain: otherwise the value of the function application would be unspecified. To this end, we define an auxiliary operator [[·]]F that computes corresponding proof obligations. In particular, for function applications we let [[ f [e]]]F ≡ [[ f ]]F ∧ [[e]]F ∧ e ∈ DOMAIN f and [[Qx ∈ S : e]]F ≡ ∀x ∈ [[S]]T : [[e]]F for Q ∈ {∀,∃}. The remaining clauses in the definition of [[·]]F collect all function applications that occur in an expression. In particular, [[x]]F ≡ TRUE for an atomic expression x. An expression e depending on subexpressions e1,..., en, is evaluated as [[e(e1,...,en)]]F ≡ [[e1]]F ∧ ... ∧ [[en]]F . The main difference between the SMT-LIB and Yices translations is the encoding of tuples and records. In Yices, they are natively supported, whereas SMT-LIB currently does not have a pre-defined theory. To encode (fixed-size) tuples and records in the first-order logic of SMT-LIB, we treat each of their components separately. A symbol t of type Tup[τi] is translated as λi. t i, introducing the new symbols t i with types τi to the context, corresponding to the i-th component of the tuple t. For example, [[t = he1,e2i]]T ≡ [[t 1 = e1 ∧t 2 = e2]]T . The translation of records is analogous, with field names taking the place of tuple indexes. Currently, all constituents of tuples and records must be of basic types.

4 Experimental results

We have used our new backend with good success on several examples taken from the TLAPS distribution. For example, the non-temporal part of the invariant proof for the well-known N-process Bakery algorithm [7], which mainly uses set theory, functions and arithmetic over the natural numbers, could be reduced from around 320 lines of interactive proof to a completely automatic proof. The resulting obligation generates an SMT-LIB file containing 105 quantifiers (many of them nested), which has been proved by the CVC3 [1] SMT solver in around 10 seconds and by Z3 [5] in less than a second on a standard laptop (whereas the original, interactive proof takes around 24 seconds to process). On the other hand, Yices cannot handle the entire proof obligation at once, and it was necessary to split the theorem into separate cases per subaction; it then takes about 8 seconds to prove the resulting obligations. More interestingly, the Yices backend (with better support for records) could handle significant parts of the type and safety invariant proofs of the Memoir system [10], a generic framework for executing modules of code in a protected environment. The proofs were almost fully automated, except for three sub-proof that required manual Skolemization of second-order quantifiers. In terms of lines of proof, they were reduced to around 10% of the original size. In particular, the original 2400 lines of proof for the complete type invariant theorems were reduced to 208 lines.

5 Conclusions

We defined a translation of certain TLA+ proof obligations into SMT-LIB 2 and into the native language of Yices. The translation relies on a typing discipline for TLA+, which is untyped, and a corresponding type inference algorithm. This discipline restricts the class of TLA+ expressions that can be translated. Nevertheless, a significant fragment of the source language can be handled. In particular, we support first-order logic, elementary set theory, functions, integer and real arithmetic, tuples and records. Sets and functions are represented as lambda-abstractions, which works quite efficiently but excludes handling second-order expressions involving sets of sets or quantification over complex types. Universal set quantifiers that occur at the outermost level can easily be removed by the user of TLAPS, by introducing Skolem constants. An automatic pre-processing of such terms would further improve the backend. The current SMT-LIB backend provides only limited support for tuples and records.

44 Towards certiﬁcation of TLA+ proof obligations with SMT solvers S. Merz and H. Vanzetto

In future work, we intend to study the question of interpreting proofs provided by SMT solvers for reconstructing them (as well as the type assignment) in the trusted object logic of Isabelle/TLA+. We also envisage extending our translation to the native input languages of other SMT solvers such as Z3 [5]. These are similar to the ones that we considered here, and therefore their translation will be straightforward once the types are assigned.

References

[1] C. Barrett and C. Tinelli. CVC3. In W. Damm and H. Hermanns, editors, 19th International Conference on Computer Aided Verification (CAV ’07), volume 4590 of Lecture Notes in Computer Science, pages 298-302. Springer-Verlag, July 2007. Berlin, Germany. [2] C. Barrett, A. Stump, and C. Tinelli. The SMT-LIB Standard: Version 2.0. In A. Gupta and D. Kroening, eds.: Satisfiability Modulo Theories, 2010. [3] R. Bonichon, D. Delahaye, and D. Doligez. Zenon : An Extensible Automated Theorem Prover Producing Checkable Proofs. In N. Dershowitz, A. Voronkov, eds.: 14th Intl. Conf. Logic for Programming, Artificial Intelligence, and Reasoning (LPAR 2007), vol. 4790 of Lecture Notes in Computer Science, pages 151-165, Yerevan, Armenia, 2007. Springer. [4] K. Chaudhuri, D. Doligez, L. Lamport, and S. Merz. Verifying Safety Properties with the TLA+ Proof System. In J. Giesl and R. Hahnle¨ (eds.): 5th Intl. Joint Conf. Automated Reasoning (IJCAR 2010), volume 6173 of Lecture Notes in Computer Science, pp. 142-148. Edinburgh, UK, 2010. [5] L. de Moura and N. Bjørner. Z3: An efficient SMT solver. In C. Ramakrishnan and J. Rehof, editors, Tools and Algorithms for the Construction and Analysis of Systems, volume 4963 of Lecture Notes in Computer Science, pages 337–340. Springer Berlin / Heidelberg, 2008. [6] B. Dutertre and L. de Moura. The Yices SMT solver. Tool paper at http://yices.csl.sri.com/ tool-paper.pdf, 2006 [7] L. Lamport. A new solution of Dijkstra’s concurrent programming problem. Commun. ACM, 17:453-455, August 1974. [8] L. Lamport. Specifying Systems, The TLA+ Language and Tools for Hardware and Software Engineers. Addison-Wesley, 2002. [9] L. Lamport and L. C. Paulson. Should your specification language be typed? ACM Trans. Program. Lang. Syst., 21:502-526, May 1999. [10] B. Parno, J. R. Lorch, J. R. Douceur, J. Mickens, and J. M. McCune. Memoir: Practical State Continuity for Protected Modules. IEEE Symp. Security and Privacy, IEEE, May 2011. Formal Specifications and Correctness Proofs: Tech. Report, Microsoft Research, Feb. 2011. [11] M. Wenzel, L. C. Paulson, and T. Nipkow. The Isabelle Framework. In 21st Intl. Conf. Theorem Proving in Higher Order Logics, TPHOLs ’08, pages 33-38, Berlin, Heidelberg, 2008. Springer-Verlag.

45 Escape to ATP for Mizar Piotr Rudnicki∗ Josef Urban† University of Alberta Radboud University Edmonton, Alberta, Canada Nijmegen, The Netherlands [email protected] [email protected]

Abstract An interactive ATP service is a new feature in the Mizar proof assistant. The functionality of the service is in many respects analogous to the Sledgehammer subsystem of Isabelle/HOL. The ATP service requires minimal user configuration and is accessible via a few keystrokes from within Mizar mode in Emacs. In return, for a given goal formula, the ATP service, when it succeeds, finds premises sufficient to prove the goal. The “escape” to ATP uses a sound translation from Mizar’s language to that of first-order provers, the same translation that has been used in the more batch oriented Automated Reasoning for Mizar (MizAR) web services presented in [16]. We briefly present the interactive ATP service followed by an account of initial experiments with the tool. We claim with some confidence that the tool will substantially ease the process of preparing new Mizar articles.

1 Introduction

We start by noting a substantial difference between the context of discovering a proof, when any help from ATPs is welcome, and the context of justifying a proof, when the proof is finished and we only want to check its correctness. The experience of maintaining the Mizar library1 for more than 20 years, with its continual improvements, refactorings and reorganizations, indicates that in the context of justification we need a checker that runs very quickly. Many experiments in which the strength of the Mizar verifier was increased had to be abandoned because they tended to substantially increase the time of verifying the entire library. Such changes happen, for a variety of reasons, almost daily. On the other hand, when we are constructing a proof for a formula, we are ready to use any assistance, even if the available automated helpers run for quite some time. Thus we need two, if not more, quite different proof checkers/verifiers/advisors. To ensure fast rechecking, the proofs stored in the large repository of formal mathematics need to be checkable almost instantly, and thus tend to be quite detailed. Such detailed proofs are not suitable for direct human consumption, but one can conceive of presentation tools that can browse through the proofs at various levels of abstraction. The question remains: how to construct the detailed proofs? For years, the common practice among Mizar authors has been to write such proofs by hand. Until several years ago, the contents of the Mizar Mathematical Library (MML) was processable only by the ‘secretive’ Mizar processor, whose source code is still available only to the members of the Association of Mizar Users. Despite this difficulty, the contents of MML has been exported to other systems, independent of the Mizar processor [13]. For historical accuracy we cannot forget to mention the efforts of Ingo Dahn in trying to export the contents of MML to ATPs and other presentations [2]. Dahn’s effort was inspired by the (then influential) The QED Manifesto2 anonymously published in [1].

Pascal Fontaine, Aaron Stump (eds.); PxTP 2011, pp. 46-59 ∗Partially supported by NSERC. †Partially supported by the NWO project “MathWiki: a Web-based Collaborative Authoring Environment for Formal Proofs” 1http://mizar.org 2http://en.wikipedia.org/wiki/QED_manifesto

46 Escape to ATP for Mizar P. Rudnicki and J. Urban

Although the QED initiative did not result in much more than few written texts3, the spirit of the project seems to be quite alive as this particular workshop demonstrates. Over the last decade, the second author has developed software tools advising Mizar authors. The advice consists of suggesting premises relevant for proving a new conjecture in an article. The following tools are external to the Mizar processor and they can be launched from within the Mizar mode for Emacs.

• M(ost—uch) of M(izar—ath) Matches (MoMM [10]), released in 2002, is a subsumption tool based on Schulz’s E Equational Theorem Prover4. Given a goal to prove, the tool quickly ﬁnds similar previous proof situations in the large Mizar library. Thanks to the sheer size of the library, subsumption, incorporating additional Mizar-like type mechanisms, turns out to provide useful (exact) advice in more than 50% of new proof situations [10], which is remarkably good compared with full ATP techniques.

• The Mizar Proof Advisor [9, p. 332] was released in 2003. The tool ﬁnds previous theorems and deﬁnitions from the library as hints (relevant premises) for proving a new conjecture. This advice is inexact—a proof is not searched for. The advisor is trained (in the sense of machine learning) on previous proofs and thus brings into the exact world of automated reasoning some heuristic methods similar to keyword-based web search.

Despite numerous regular presentations and other similar attempts by the second author, these tools have never become regularly used in the Mizar community. The following factors could have played a role:

• The Emacs mode for Mizar took some time to be adopted by a majority of Mizar users. Some core Mizar users still use editors other than Emacs.

• The proof advisor’s precision depends on the domain in which it is applied, and the “Google”-like answer it provides may be quite difficult to post-process into a human-understandable complete proof, particularly when suggested connections and proof-ideas are non-obvious. Thus, using ATP to find a proof makes a big difference for users. The premises used in the final ATP proof are guaranteed to be sufficient, and because the user knows that the ATP is right, he might put extra effort into understanding the suggested ATP proof. In contrast, in the case of an imprecise AI advisor users might just resign even when the proof is easy to reconstruct from the “Google hits”. This seems to be an interesting difference between the perceived utility of Googling for finding relevant high-level mathematical facts, and “Googling” in formal libraries with the purpose of finishing sufficiently simple low-level reasoning steps.

• While MoMM’s overall performance on the MML was unusually good at the time of its invention in 2002, when large-theory ATP ﬁeld was just being born, its performance depends on the domain in which it is applied, because it strongly depends on previously formalized knowledge. This might have discouraged the then small number of potential Mizar users ready to use Emacs and experiment with the then untested large-theory ATP techniques in 2002.

The last version of MoMM was released in 2006. Later, the second author’s interest shifted to the more general task of employing AI techniques and ATPs in large formal knowledge bases. In 2005 large parts of the Mizar software were reimplemented and the internal representation of Mizar articles became based on XML. The entire MML in this form is freely distributed and allows for processing of its contents by generic XML tools. This semantically complete XML representation of

3http://mizar.uwb.edu.pl/qed/ 4http://www4.informatik.tu-muenchen.de/~schulz/WORK/eprover.html

47 Escape to ATP for Mizar P. Rudnicki and J. Urban

MML simplified the process of translating the library contents to the TPTP format and facilitated the creation of various utilities based on HTML representations; see the next section for a summary and consult [13, 16] for details. An early milestone of these translations is the MPTP system [11], which allows analysis of various mathematically oriented ATP benchmarks as described in [14]. Later, the XMLized MML opened the way for creating metasystems supporting reasoning in large theories [12, 17]. The XMLized version of MML prompted a crucial application of ATPs: the cross-verification of the Mizar library. The Mizar proof checker is a complicated piece of software whose sources are publicly unavailable. Its wide, successful usage was the only external reason for trusting Mizar-checked proofs. It was natural to attempt reproving all of the theorems in MML. As reported in [15], the initial experiments were encouraging: combinations of ATPs are able to reprove between 40% and 60% of MML theorems, depending on the area of mathematics covered. ATPs have been much more successful at reproving Mizar atomic inference steps where the success rate approaches 100%. It was only natural to ask whether the ATPs could prove—and not just reprove—the goals obtained by Mizar atomic inferences without using the guidance of which premises were used by human authors. This presents a challenge of how to select the potentially useful premises out of about 100,000 to choose from; this brings us closer to the topic of this paper. The tools described in [16] offer a combination of several automated reasoning and proof presentation techniques for the Mizar system. One of the functions, which has been developed since 2008, offers a possibility to ask an ATP system for help in proving a conjecture in the context of a Mizar article. However, in order to call for the ATP assistance, one has to switch into an HTML presentation of the article. Recently, functionality has been added to the Mizar mode in Emacs that allows the user to invoke ATP help, in several simple ways, from within an Emacs session. This functionality owes much to the recent dissemination and development phase of new ATP techniques for large theories such as SInE [4] and MaLARea [12, 17]. The techniques are now mature enough to handle with reasonable efficiency— at least comparable to specialized ATP techniques like MoMM—simple mathematical problems in the environment of something like 100,000 available facts (axioms). We would like to stress that the Japanese Mizar community expressed deep interest in ATP techniques assisting in writing Mizar proofs. The growing popularity of the related Sledgehammer ATP subsystem [6] for Isabelle/HOL finally lead the first author to investigate the usability of the MizAR service (see Section 2) and thus provided the first valuable feedback to the second author. The first demonstration5 of the service reported here was done at the the Joint Mathematical Meeting in New Or- leans, January 2011. We report in this paper our initial experience with using the interactive ATP service which we have collected while developing a new Mizar article. In Section 2 we summarize the automated reasoning online service called MizAR described in detail in [16]. Section 3 illustrates how the ATP service is invoked from inside Emacs and how the ATP reports back. Since this service for Mizar is quite new we have collected only some initial, but encouraging, experience which is presented in Section 4. The last section contains some conclusions but consists mainly of future plans.

2 Essence of the service

A host of several automated reasoning and proof presentation tools for the Mizar system are described by Urban and Sutcliffe in [16]. The collection of tools provides an evolving support for Mizar authors and potential users of the relatively rich Mizar library. The name chosen for this collection is Automated

5http://mws.cs.ru.nl/~urban/ams11/out4.ogv

48 Escape to ATP for Mizar P. Rudnicki and J. Urban

Reasoning for Mizar with the suggested acronym MizAR and the main motivation for its development was Sutcliffe’s SystemOn TPTP ATP service [8]. MML is distributed as a collection of source Mizar articles plus an internal representation of the library for use by the Mizar veriﬁer. The standard Mizar distribution offers a stand alone installation of the Mizar veriﬁer but no tools for browsing or searching the library besides common utilities like grep. In contrast, MizAR offers, among others, the following services: • Web access to the whole cross-linked HTMLized presentation of MML, which allows for browsing the entire library just by following HTML links. The HTMLized version is presented in a form close to the original articles as stored in the library. This feature is especially important for new authors.

• Server-based verification of a Mizar article which frees the author from installing the system on the local machine. The increasing speed of the Internet, parallelization (currently in development) of the verifier, and access to multi-core servers, all make the remote verification an attractive alternative to installing Mizar locally.

• HTMLization of the article on which the author is currently working. Such assistance is crucial to a Mizar author: the author can inspect his text disambiguated with respect to the omnipresent overloading of Mizar notations, synonyms and antonyms. It is a key advantage over just reading a text version of an article. The HTMLized presentation is also furnished with links permitting an inspection of all the items that are tacitly taken into account by the Mizar processor during semantic analysis and inference checking.

• Translation of the article to MPTP (Mizar Problems for Theorem Provers) format.

• Generation of ATP problems in TPTP format for theorems and inferences in the article with the option of invoking some default ATP systems for solving the problems. This is the point that we are focusing on in this paper as we describe how this service is made available directly from inside Emacs.

• Generation of suggestions about potentially useful premises for proving a formula in the article being developed.

The ATP service described in this paper relies on MizAR being able to tackle an inference step and on success returning all the details about the sufﬁcient premises. The current implementation, used in the experiments reported below, uses Vampire6 as ATP and uses SInE7[4] as the premises pre-selection heuristic. Three different premise pre-selections strategies are used, run in parallel. One of the versions uses the entire 100,000 MML items as axioms. For instance, the HTMLized presentation of the Mizar text snippet

6http://www.vprover.org/ 7http://www.cs.man.ac.uk/~hoderk/.

49 Escape to ATP for Mizar P. Rudnicki and J. Urban

F1: card C >= cnG by Lchro; G1: card C <= cnG by D1, E1, MYCIELSK:8; H1: card C = cnG by F1, G1, XXREAL_0:1; looks like

where behind almost every token there is a link to its denotation. Clicking on the last by token produces

Again, behind each displayed token of text, there is a link to its denotation. For example, clicking at dt k5 card 1 takes us to

which is a redefinition of the functor card for finite sets returning a natural number (Element of omega). The list of Mizar items underneath the heading ATP Proof References above includes all the items that the ATP needed in order to prove the goal—including the items which the Mizar processor handles tacitly with the explicit references F1, G1, XXREAL 0:1. The items used tacitly in an inference step by the Mizar verifier help those authors who are puzzled when an inference step is accepted and it is not immediately clear why. At the moment, the ATP service uses some other naming convention for Mizar items than the purely textual Mizar sources. The problem stems from an asymmetry: some items in Mizar sources cannot be explicitly referenced in the textual form as they do not have Mizar names. These items (clusters) are tacitly processed by the Mizar verifier. However, all information-bearing Mizar items exported to Mizar-‘unaware’ ATPs have to be referable by name. The work on standardization of names for Mizar items is being continued. Another, and from our viewpoint more substantial, form of assistance that the ATP service can be asked for is to prove an inference step that the Mizar verifier does not accept, or, which is a frequently the case, the author does not immediately know how to justify. The current MizAR service provides the assistance but the returned output uses the HTMLized presentation of the Mizar text and the naming of items convenient for ATPs rather than for a human reader. In the next Section, we present a solution to this problem whereby the ATP help is interactively returned in the Emacs session and uses the plain text version of Mizar notation.

50 Escape to ATP for Mizar P. Rudnicki and J. Urban

3 Access from Emacs

The interactive ATP service is accessible from the Mizar-mode in Emacs and all the ﬁndings are reported within the same Emacs session. (We apologize to non-Emacs speakers for using technical jargon.) In the Mizar menu under Remote solving we set the remote server, several are available in Białystok, Nijmegen and Edmonton. Then, we need to switch on by; triggers ATP completion . Mizar inference steps are presented to the veriﬁer by stating the goal followed by the keyword by and then a list of labels of premises. For example,

A: x in L ...... D: {x} c= L by A, ZFMISC_1:37; where the label ZFMISC 1:37 refers to a fact imported from MML. The acceptance of an inference is decided by the rather complex and still evolving Mizar checker8 [7]. Finding the necessary references requires detailed knowledge of the library and even then, in more complicated cases it can be a time-consuming process. We have chosen a particularly simple example to illustrate the steps performed when ﬁnding sufﬁcient premises with the assistance of the ATP. When the ATP service is switched on, we invoke the service with typing by; after the goal we want to be assisted with.

A: x in L ...... D: {x} c= L by;

The server reacts immediately and displays

A: x in L ...... D: {x} c= L ; :: ATP asked ... and the ATP answer is provided quickly, generally less than a minute if not several seconds, depending on preset time limits and how busy the server is. The reported solution replaces the original by; with the found premises (or report of a failure). In our simple case we get

A: x in L ...... D: {x} c= L by A,ENUMSET1:69,ZFMISC_1:37;

Within the same Emacs session, we invoke the standard Mizar utility relprem that detects premises relevant for the Mizar veriﬁer. We learn that referring to ENUMSET1:69 is not necessary and after removing it, we obtain the result with which we started our illustration.

8http://www.cs.ru.nl/~freek/mizar/by.ps.gz

51 Escape to ATP for Mizar P. Rudnicki and J. Urban

The above example is one step taken from a proof of a very simple theorem

theorem Sub3: for G being SimpleGraph, L being set, x being set st x in L & x in Vertices G holds x in Vertices (G SubgraphInducedBy L) which turns out to be too much for the ATP service to handle at the moment and calling by; returns Sub3: ... by Unsolved. This ‘theorem’ comes from an attempt to represent simple graphs as at-most-1-dimensional simplicial complexes presented in Section 4. Mizar proofs are typically written with a mix of top-down and bottom-up strategy, depending on the author, as the Mizar system allows constructing the proofs both ways. In this small example, we tried to write the formulae of the intermediate proof steps, that we found immediately obvious to ourselves.

proof let G be SimpleGraph, L be set, x be set such that A: x in L and B: x in Vertices G; C: {x} in G ; D: {x} c= L ; E: {x} in (G SubgraphInducedBy L) ; thus x in Vertices (G SubgraphInducedBy L) ; end;

None of the sentences labeled C, D, E or the proof conclusion is obvious to the Mizar veriﬁer. We could have searched for the necessary premises by hand but instead we employed the ATP escape by replacing the semicolon with by;. The replies from the ATP server were coming while we were still typing, with the ﬁnal result as follows:

C: {x} in G by B,SIMCOLOR:5; D: {x} c= L by A,ENUMSET1:69,ZFMISC_1:37; E: {x} in (G SubgraphInducedBy L) by C,D,BOOLE:7,SIMCOLOR:14; thus x in Vertices (G SubgraphInducedBy L) by SIMCOLOR:func 5,E,BOOLE:7,SIMCOLOR:5;

At this initial stage of the project, the replies from the ATP server need some postediting in order to satisfy Mizar veriﬁer.

• The article in which we work is named SIMCOLOR and references to lemmas from this article use the name. These references have to be renamed to the corresponding local names.

• Some references returned by the ATP service mention Mizar items, like SIMCOLOR:func 5, which are implicitly processed by Mizar and cannot be explicitly referred to.

• Some references, like BOOLE:7, are processed implicitly by the Mizar veriﬁer and do not have to be referenced.

• Finally, some references are spurious for the Mizar veriﬁer and they can be removed, e.g. ENUMSET1:69 above. Why these references are returned by the ATP is sometimes puzzling.

All this postediting can (and will) be automated. Note that studying the references found by the ATP is instructive as the automated service sometimes ﬁnds solutions quite different from what the author had in mind.

52 Escape to ATP for Mizar P. Rudnicki and J. Urban

With the postediting by hand the ﬁnal result, accepted by the Mizar veriﬁer, is:

C: {x} in G by B,Vertices0; D: {x} c= L by A,ZFMISC_1:37; E: {x} in (G SubgraphInducedBy L) by C,D; thus x in Vertices (G SubgraphInducedBy L) by E,Vertices0;

4 Initial experience

Automated assistance for Mizar authors has been developed by the second author for quite some time as mentioned in Section 2. Yet it was only in January 2011 that it was noticed that such assistance can be easily incorporated into the Mizar mode in Emacs, offering on the-fly advice for justifying proof steps. At the time the first author was developing a new formalization of simple graphs, (the sixth in the Mizar library!) as 1-dimensional simplicial complexes9. In this formalization, we start with a set of vertex labels with vertices being singletons (0-dimensional simplexes) and edges being unordered pairs (1-dimensional simplexes). We started using the ATP service on a Mizar article which was being created and even the formalization of basic definitions was still not ironed out. The goal of the article was to develop enough theory to prove the construction of the Mycielskian of a graph. This construction was used by Mycielski [5] to prove existence of triangle-free graphs with arbitrarily large chromatic number. The formalization uses only very basic notions of set theory: empty set, singleton, ordered and unordered pairs, union of a family of sets, subset, finite set, cardinality of a finite set, partition of a set, Cartesian product, basic properties of relations and functions, and not much more. We were interested in how the ATP service performs under such conditions since the Mizar library contains thousands of facts using only these basic notions.

4.1 The first little surprise The Mizar checker provides a proof placeholder similar to Sorry in Isabelle and to Admitted in Coq, which makes the Mizar verifier temporarily treat a stated fact as proven. In this case, one writes @proof instead of proof. However, this practice soon becomes tedious; more importantly, it cannot be used in contexts where a proof is syntactically illegal and a simple justification with by must be used. One helpful trick (thanks to Freek Wiedijk) is to state at the beginning of the article

FAKE: contradiction @proof end; and then use by FAKE to justify any goal as for the Mizar veriﬁer anything follows from a contradiction. Mizar articles destined for the library must not contain @proof, of course. We have employed the FAKE trick, but it has spoiled the ATP server: with contradiction available as an axiom, the ATP was able to prove everything. We could have resorted to using @proof, but we believe that the Mizar authoring environment needs to provide a built in tool for temporarily preventing the veriﬁer from checking an inference step. This can be implemented with @by as an analog of @proof or (less conveniently) using some pragma comments.

9 Using simplicial complexes was ﬁrst suggested by Veblen [18, p. 2] according to Harary [3, p. 7] who was wondering why the approach had been popular only within pure mathematics. From our short experience, the explanation could be that for a pen-and-pencil graph theorist the advantage of using this simple framework is not clear.

53 Escape to ATP for Mizar P. Rudnicki and J. Urban

4.2 Empty graphs With the FAKE statement removed, the ATP was still able to prove everything. This time, the following statement theorem SG1: for G being SimpleGraph holds {} in G @proof end; was to blame. While developing a new formalization we frequently state similar, simple facts and leave them unproven while focusing on more interesting pieces. We had carelessly stated that {} is in every simple graph, even in an empty graph. We tried the following fix: theorem SG1: for G being non empty SimpleGraph holds {} in G @proof end; which only removed a symptom. A similar unproven statement theorem SG0: for G being SimpleGraph holds G = { {} } \/ Vertices G \/ Edges G @proof end; is also unprovable, as it fails when G is empty. However, the presence of this unproven fact allowed the ATP to prove many other facts in a rather unexpected way. We again tried a quick fix by requiring G in SG0 to be nonempty. However, we ran out of quick fixes when the ATP managed to prove a lot based on a contradiction stemming from: theorem Vertices2: for G being SimpleGraph holds G = {} iff Vertices G = {} @proof end; As with our then definition of SimpleGraph (which we do not show here) the ‘only if’ of the above formula leads to a contradiction but in discovering this fact the ATP used a dozen other facts from MML. This time, the quick and dirty fix from above would not work, as the theorem theorem Vertices2: for G being non empty SimpleGraph holds G = {} iff Vertices G = {} @proof end; looks rather suspicious even to a naked eye. This was the moment when we changed the definition of simple graph to the following: definition mode SimpleGraph is 1-at_most_dimensional subset-closed (finite-membered non empty set); end; With this definition, the only simple graph with no vertices is {{}}, i.e. the singleton of an empty set. We call such a graph void and a graph with no edges edgeless. The ATP helped us to straighten the basic definition before we did more proofs which we deemed interesting but which would have been based on unproven, contradictory lemmas about empty graphs.

4.3 Empty sets After rectifying the issue of contradictions stemming from empty simple graphs, the ATP was still able to prove surprising results by ﬁnding a contradiction elsewhere. Consider: theorem M0e2a: for G being SimpleGraph, u being Element of Vertices G holds {[u, union G], union G} in Mycielskian G @proof end;

54 Escape to ATP for Mizar P. Rudnicki and J. Urban

The fault is again in misusing empty sets. With the definition of SimpleGraph stated at the end of Sec- tion 4.2, a simple graph can be non empty and still have no vertices10. The construction of a Mycielskian of G adds vertices to the graph which are labeled, in our approach, either by union G or an ordered pair with the first element being a vertex of G. In Mizar, typing u being Element of Vertices G means that u in Vertices G but only when Vertices G is non empty. The contradiction stemmed from M0e2a combined with another unproven (but valid) fact that theorem M0e2: for G being SimpleGraph, u being set st { union G, u } in Mycielskian G holds ex x being set st x in Vertices G & u = [x, union G] @proof end; In the case of a void graph, we still can introduce objects that can be typed Element of Vertices G although Vertices G is empty. Then M0e2a gives us an edge in the Mycielskian of G and M0e2 results in an x in Vertices G although there are none. To fix the problem we changed M0e2a to theorem M0e2a: for G being SimpleGraph, u being set st u in Vertices G holds { [u, union G], union G } in Mycielskian G @proof end; With the ATP service, we managed to discover inconsistencies of our formalization quite early on rather than much later and without investing time to prove more interesting facts based on shaky assumptions. The contradiction-finding capability of ATP systems in large theories is a fresh development which is desirable in many contexts. The large-theory component that we use—SInE [4]—won the SUMO reasoning prize at CASC11 in 2008 by finding a contradiction in an early first-order presentation of the large SUMO library.

4.4 Deciphering the ATP advice There are times when the ATP manages to ﬁnd a proof for a fact that we deem worthwhile to be included as an exportable item, these are marked theorem in Mizar. Here is an example. theorem Aux1a: for x, X being set holds not [x,X] in X for which the ATP returns the following list of premises by ENUMSET1:69,ZFMISC_1:12,ZFMISC_1:38,TARSKI:def 5,ORDINAL1:3; and at the moment the author is left with the task of converting the list into a sequence of smaller inference steps acceptable by the Mizar veriﬁer. Sometimes it is quite a puzzling task but with the trustworthy verdict from the ATP we know we can do it. It may help to consult the feedback obtained from MizAR in the form of the ATP Proof References (see Section 2) which in this case displays ATP Proof References reflexivity_r1_tarski, t38_zfmisc_1, t69_enumset1, d5_tarski, l1_simcolor, commutativity_k2_tarski, t3_ordinal1,

10It is common practice among graph theorists to require the set of vertices to be non empty but this seems to be a matter of tradition when graph is considered as a structure with a non empty carrier. 11http://www.ontologyportal.org/reasoning.html

55 Escape to ATP for Mizar P. Rudnicki and J. Urban

We see that the ATP used the fact that the subset relation is reflexive. (c= is the first relational symbol defined in TARSKI.)

definition let X, Y be set ; pred X c= Y means :: TARSKI:def 3 for x being set st x in X holds x in Y; reflexivity ; end;

With this information we quickly see how to construct a more detailed justiﬁcation by hand using the proof construct.12

theorem Aux1a: for x, X being set holds not [x,X] in X proof let x, X be set such that A: [x,X] in X; B: [x,X] = { {x,X}, {x} } by TARSKI:def 5; C: {x,X} in { {x,X}, {x} } by ZFMISC_1:38; D: X in {x,X} by ZFMISC_1:38; thus contradiction by A, B, C, D, ORDINAL1:3; end;

It turns out that the current ATP service reports using some premises that are not necessary for the Mizar veriﬁer, like ENUMSET1:69,ZFMISC 1:12. Minimizing the set of premises returned by the ATP is a topic of ongoing work. As mentioned earlier we strongly favor refactoring into structured, detailed, and human-obvious proofs. We foresee heavy use of standard Mizar tools like the detectors of irrelevant premises and spurious inference steps, both of which preserve human-obviousness. We ended up using the following facts from MML as premises:

definition let x,y; func [x,y] equals :: TARSKI:def 5 { { x,y }, { x } }; end;

theorem :: ZFMISC_1:38 {x1,x2} c= Z iff x1 in Z & x2 in Z;

theorem :: ORDINAL1:3 not ( X in Y & Y in Z & Z in X);

Later we realized that a more natural proof of this little fact can use the deﬁnition of the unordered pair,

definition

12We consider it a good feature that Mizar does not allow complicated, fragile, and slow proof ﬁnding procedures as a part of the core proof checking. The twenty years of experience with daily large-scale theory refactoring of MML has taught the Mizar community that such fragility and slowness should be avoided. We strongly believe that the way from automatically found proofs to proofs in the MML leads through suitable semi-automated refactoring into structural proofs which are perceived as obvious by humans. The work in this direction has been started by translating detailed (untyped) Otter proofs directly into Mizar format, and compressing them using the Mizar proof-compression utilities in a ﬁx-point loop, https://github.com/ JUrban/ott2miz.

56 Escape to ATP for Mizar P. Rudnicki and J. Urban

let y; let z; func { y, z } -> set means :: TARSKI:def 2 x in it iff x = y or x = z; commutativity; end; instead of ZFMISC_1:38 and the reflexivity of c=, to state what belongs to an unordered pair. It is an interesting problem for future research of how to direct the ATP into using premises of our preference rather than what the ATP finds first.

4.5 ATP service effectiveness Our work on the SIMCOLOR article is being continued. Of the few hundred nontrivial inferences that we tried, the ATP managed to solve around 40% which, surprisingly, is close to the success rate of Sledgehammer on nontrivial goals reported by Paulson and Blanchette [6]. On the other hand, the ATP which we used can reprove 86% of the inferences if it is told which premises where used by humans. This means that more precise narrowing of potential premises is a vital issue for the ATP service. From other experiments using several ATPs in parallel and longer time limits one can reprove more than 99% of atomic inferences in the Mizar library [15]. The interactive ATP service has helped us in several ways:

• The ATP managed to prove some lemmas using the simple by justification that require a structured proof for the Mizar verifier. This is not a big surprise, as the Mizar verifier uses only pattern matching and very limited resolution. The feedback from the ATP was quite helpful, as it is much easier to write a detailed proof when one knows the facts that suffice for the proof.

• The ATP turned out to be a search tool in a rather unexpected way. More than once, the ATP indicated that we had already formed a local lemma from which a given formula follows in one step while we were about to write several inference steps.

• The ATP found proof justiﬁcations quite different from what we had in mind. Sometimes, it found large sets of premises when some other premise sufﬁced. The inverse also happened as stated in the previous item.

• When the ATP finds a justification it returns all the Mizar items that are deemed necessary for the automated prover. The feedback also includes those Mizar items which are tacitly processed by the Mizar verifier and they are not and now cannot be referenced in a Mizar text. This feedback information led us to a better understanding of the task at hand.

5 Conclusion and future plans

Our initial experience with the interactive ATP service for Mizar authors is encouraging. We get both help in justifying single inference steps as well as in ‘debugging’ the conceptual framework of a new formalization, as pointed out in Section 4. We ﬁnd the help of the ATP service very valuable in the context of discovering a proof. The plans for the future development of the ATP service for Mizar include:

• Employing several ATP systems in parallel. E prover and SPASS have been part of the MizAR service since its beginning. Recent exhaustive experiments [14] over MML show that the internal

57 Escape to ATP for Mizar P. Rudnicki and J. Urban

strategy scheduling of Vampire is adequate and a more important source of gains is parallelization of different axiom selection scenarios.

• Improving the techniques for narrowing the selection of potentially useful premises such that the ATP systems can become more efﬁcient.

• Creating strong knowledge-based ATP meta-systems able to suggest intermediate lemmas, suitable concepts, and reliable proof structures through learning from the large proof corpus in a similar way as learning of ATP premise selection. That ATPs with suitable heuristic premise selection methods can signiﬁcantly help in large theories like MML has been known since 2003 [9]. We perceive a continuation of this effort as a chance for large-theory formal mathematics to boost development of strong semantic AI systems that are hard to envisage in less formal domains.

• Automatically translating the findings of the ATP service to a sequence of basic inference steps acceptable by the Mizar verifier. This would substantially contribute to creating proofs at the level of detail suitable in the context of justification, necessary for smooth maintenance of the growing library.

• Designing syntax such that the Mizar author can restrict the parts of MML which are searched for relevant premises. At the moment the entire MML of some 100,000 items is searched. Such a feature would facilitate automating searches for alternative proofs.

We foresee future work on this project as a contribution to web-based collaborative platforms for development of mathematics, currently within the MathWiki13 initiative.

Acknowledgments: We would like to thank one of the reviewers for many helpful comments, suggestions and corrections. Also, we would like to thank Lorna Stewart and Jesse Alama for their help in improving English of the paper.

References

[1] The QED manifesto. In Alan Bundy, editor, CADE, volume 814 of Lecture Notes in Computer Science, pages 238–251. Springer, 1994. [2] Ingo Dahn and Christph Wernhard. First order proof problems extracted from an article in the MIZAR mathematical library. In M.P. Bonacina and U. Furbach, editors, Int. Workshop on First-Order Theorem Proving (FTP’97), pages 58–62, 1997. [3] Frank Harary. Graph theory. Addison-Wesley, 1969. [4] Krystofˇ Hoder and Andrej Voronkov. Sine qua non for large theory reasoning. In Proceedings of CADE-23, 2011. Accepted. [5] Jan Mycielski. Sur le coloriage des graphes. Colloquium Mathematicum, 3:161–162, 1955. [6] Lawrence C. Paulson and Jasmin C. Blanchette. Three years of experience with Sledgehammer, a practical link between automated and interactive theorem provers. In 8th IWIL, 2010. Invited talk. [7] Piotr Rudnicki. Obvious inferences. Journal of Automated Reasoning, 3:383–393, 1987. [8] Geoff Sutcliffe. System description: SystemOnTPTP. In David A. McAllester, editor, CADE, volume 1831 of Lecture Notes in Computer Science, pages 406–410. Springer, 2000. [9] Josef Urban. MPTP — motivation, implementation, ﬁrst experiments. J. Autom. Reasoning, 33(3-4):319– 339, 2004. [10] Josef Urban. MoMM — fast interreduction and retrieval in large libraries of formalized mathematics. Inter- national Journal on Artiﬁcial Intelligence Tools, 15(1):109–130, 2006.

13http://www.fnds.cs.ru.nl/fndswiki/Research/MathWiki

58 Escape to ATP for Mizar P. Rudnicki and J. Urban

[11] Josef Urban. MPTP 0.2: Design, implementation, and initial experiments. Journal of Automated Reasoning, 37(1-2):21–43, 2006. [12] Josef Urban. MaLARea: a metasystem for automated reasoning in large theories. In Geoff Sutcliffe, Josef Urban, and Stephan Schulz, editors, ESARLT, volume 257 of CEUR Workshop Proceedings. CEUR-WS.org, 2007. [13] Josef Urban. Automated reasoning for Mizar: Artificial intelligence through knowledge exchange. In Piotr Rudnicki, Geoff Sutcliffe, Boris Konev, Renate A. Schmidt, and Stephan Schulz, editors, LPAR Workshops, volume 418 of CEUR Workshop Proceedings. CEUR-WS.org, 2008. [14] Josef Urban, Krystofˇ Hoder, and Andrei Voronkov. Evaluation of automated theorem proving on the Mizar mathematical library. In Komei Fukuda, Joris van der Hoeven, Michael Joswig, and Nobuki Takayama, editors, ICMS, volume 6327 of Lecture Notes in Computer Science, pages 155–166. Springer, 2010. [15] Josef Urban and Geoff Sutcliffe. ATP-based cross-verification of Mizar proofs: Method, systems, and first experiments. Mathematics in Computer Science, 2(2):231–251, 2008. [16] Josef Urban and Geoff Sutcliffe. Automated reasoning and presentation support for formalizing mathematics in Mizar. In Serge Autexier, Jacques Calmet, David Delahaye, Patrick D. F. Ion, Laurence Rideau, Renaud Rioboo, and Alan P. Sexton, editors, AISC/MKM/Calculemus, volume 6167 of Lecture Notes in Computer Science, pages 132–146. Springer, 2010. [17] Josef Urban, Geoff Sutcliffe, Petr Pudlak,´ and Jir´ı Vyskocil. MaLARea SG1- machine learner for automated reasoning with semantic guidance. In Alessandro Armando, Peter Baumgartner, and Gilles Dowek, editors, IJCAR, volume 5195 of Lecture Notes in Computer Science, pages 441–456. Springer, 2008. [18] Oswald Veblen. Analysis Situs, volume V. AMS Colloquium Publications, 193.

59 Combining Proofs to form Different Proofs Geoff Sutcliffe Cynthia Chang University of Miami, USA Rensselaer Polytechnic Institute, USA Deborah McGuinness, Tim Lebo, Li Ding Paulo Pinheiro da Silva Rensselaer Polytechnic Institute, USA University of Texas at El Paso, USA

Abstract Different Automated Theorem Proving (ATP) systems solve different parts of different problems in different ways. Given a set of proofs produced by ATP systems based on ade- quately common principles, it is possible to create new proofs by combining proof components extracted from the proofs in the set. It is not generally easy to say that one of the original or new proofs is better or worse than another, but ways to show that two proofs are different are available. This paper describes a process of proof combination to form new proofs that are different from the original set of proofs.

1 Introduction

Proofs form an essential part of mathematics and modern sciences [11]. Proofs allow human users and machines to understand [4, 25] and verify [12, 21] the processes that yield scientific discoveries, logical conclusions, and other forms of results. This work deals specifically with formal proofs generated by Automated Theorem Proving (ATP) systems for first-order logic, but the framework and techniques are applicable to a broad range of proof-like structures, e.g., informal arguments, dataflows, scientific data manipulation, etc. Some examples of such use cases are described in Section 6. The proofs used in this work come from problems expressed as a set of axioms and a conjecture to be proved, such as those in the Thousands of Problems for Theorem Provers (TPTP)1 problem library [22]. The proofs are directed acyclic graphs in which the leaf nodes come from the problem, other nodes are inferred from parent nodes, and the root nodes provide an assur- ance that the conjecture is a theorem of the leaf axioms. Specifically, the ATP systems that have been used produce proofs by contradiction, in which an initial inference step negates the leaf conjecture, all other inferences are at least satisfiability preserving, and the root nodes are false. These are typical of proofs found in the Thousands of Solutions from Theorem Provers (TSTP)2 solution library, which contains solutions to TPTP problems. All the problems and proofs are written in the TPTP language [24]. Given a set of proofs and an equivalence relation between nodes of the proofs, a new proof can be formed by selecting a replaced node in a target proof, finding an equivalent replacing node in a contributing proof, and replacing the sub-DAG rooted at the replaced node by the sub-DAG rooted at the replacing node. The resultant combined proof is different from the target and contributing proofs. Combining can be iterated in a controlled fashion to generate a series of new combined proofs based on the set of original proofs. The overall process might require that the set of original proofs be proofs for the same problem, or be based on a common background theory – this depends on the application context. There has been previous work related to this, in the direction of compressing propositional proofs, e.g., [2, 18, 5]. A key difference between that work and this is that here the goal is to

Pascal Fontaine, Aaron Stump (eds.); PxTP 2011, pp. 60-73 1http://www.tptp.org 2http://www.tptp.org/TSTP/

60 Combining Proofs to form Different Proofs G. Sutcliffe, et al. produce proofs that are different from the original proofs, rather than proofs that are shorter (see Section 3). Further, this work is in the context of first-order logic, which adds extra demands on the process. At a higher level there has been work studying techniques for presenting alternative proofs and proofs at different levels of granularity [3, 1]. The data structure presented in [1] provides a similar functionality to that of the PML data structure described in Section 2. That research was aimed to support proof development in an interactive mathematical proof assistant, and was related to an application in proof explanation and proof adaptation for tutoring systems [19]. Again, the goal here of producing different proofs, based on proofs produced by ATP systems, makes this work somewhat distinct. The rest of this paper is structured as follows: Section 2 describes the PML language, and the translation of TPTP format proofs into PML for the proof combining process. Section 3 describes quantitative measures of proofs, which are used to guide the proof combination process. Section 4 describes the proof combination process, and Section 5 presents an example application of the process. Section 6 concludes.

2 Proofs in PML

The Proof Markup Language (PML) [15] is a semantic web based representation for exchanging explanations, including provenance information that records the sources of knowledge, justification information that describes the steps for deriving the conclusions or executing workflows, and trust information that captures trustworthiness assertions about knowledge and sources. In contrast to the TPTP language, there is less focus on the logical data and the fine-grained reasoning processes - PML supports arbitrary logical data and inference steps including, e.g., extraction of data from non-logical sources, conversion to logical forms, clausification and first- order inferences, etc. The proof combining described in this work is done using PML because of its ability to explicitly represent alternative justifications for a conclusion. The XML-based format also allows for easy parsing and processing. PML classes are OWL [13] classes, and PML data is therefore expressible in the RDF/XML syntax. PML is used to build OWL documents representing both proofs and proof provenance information. For this work, the representation of proofs is of primary interest. The two main constructs of proofs in PML are NodeSets and InferenceSteps. A NodeSet is used to host a set of alternative justifications for one conclusion. A NodeSet contains: • A URI that is its unique identifier. • The conclusion of the proof step. • The expression language in which the conclusion is written. • Any number of InferenceSteps, each of which represents an application of an inference rule that justifies the conclusion.

An InferenceStep contains: • The inference rule that was applied to produce the conclusion. • The antecedent NodeSets of the inference step. • Bindings for variables in the inference. • Any number of discharged assumptions. • The original sources upon which the conclusion depends. • The inference engine that performed the inference step. • A time stamp recording when the inference step was performed.

61 Combining Proofs to form Diﬀerent Proofs G. Sutcliﬀe, et al.

A proof consists of a collection of NodeSets, with a root NodeSet as the final goal, linked recursively to its antecedent NodeSets. The translation of a TSTP proof into PML is done by parsing the TSTP file and extracting the necessary information into PML object instances [16]. A proof is translated into a PML NodeSet collection, with each formula in the solution being translated as a singleton member of the collection. Additionally, the conjecture of the corresponding TPTP problem is translated into a PML Query, and the English header field of the problem into a PML Question. The Query contains a pointer to the Question and to all NodeSet collections (from different ATP systems) that provide a solution. The Query thus provides a starting point for accessing all the proofs for that problem. It is important for this work that the PML representation can store multiple justifications (InferenceSteps) for a formula in a proof (a NodeSet). In ATP terms, PML can associate multiple inference steps with a formula, in the sense that the formula is the result of each of the inference steps. This allows a single PML structure to capture multiple proofs, and a single proof can be extracted by choosing one inference step leading to each formula. A limited version of this is used in the proof combining process, described in Section 4. The PML representation is generally convenient for representating alternative derivations, as it is independent of the underlying reasoning process. This independence will be leveraged in future work representing scientific provenance information - see Section 6. The way that PML captures multiple proofs in one structure is akin to using a reasoning calculus with either a weakening or a juxtaposition rule [7]. In such calculi the “alternative” sub-DAGs leading to a node are combined by the application of one of these rules. In the same way that alternative proofs can be extracted from the PML representation, alternative proofs can be extracted from these richer calculi’s proofs, e.g., by using cut elimination on a proof that uses weakening [6].

3 Diﬀerent Proofs (are Good Proofs)

Section 4 describes how proof combining steps are wrapped in a greedy hill-climbing algorithm. Hill-climbing uses a heuristic function that guides the search to an (locally) optimal solution. For this work the heuristic function evaluates and ranks proofs based on three artifacts: • the leaf formulae (axioms and conjecture) • the inferred formulae • the inference steps (where an inference step is identiﬁed with its parent and inferred formulae).

However, as is explained in [23], it turns out to be difficult, if not unreasonable, to directly rank proofs according to such artifacts. Instead, it is possible to say only that proofs are different from each other. Different proofs may be preferable, depending on the user’s point of view. For examples: in mathematics, different proofs can demonstrate different levels of mathematical “elegance”, or show that a theorem can be proved from different axiomatic bases; in formal methods, different proofs can be easier or harder to verify, or provide different insights into the structure or process being analyzed; in planning, different proofs can correspond to different sequences of actions that lead to the same goal state; in knowledge based reasoning, different proofs can be built from different knowledge bases, allowing independent agents with different knowledge to reach a common conclusion; in security analysis, different proofs can rely on different ground observations, thus increasing trust in the intelligence information that is

62 Combining Proofs to form Diﬀerent Proofs G. Sutcliﬀe, et al.

a & b a a d d e e b a & b a e b b

12 345 12510 11

¬a | ¬b a ¬a | d ¬d | e ¬e | b ¬a | ¬b a ¬e | b ¬b | e b

6 12

d e

7 8

e b

9 9

false false

Figure 1: Jaccard Distances generated. Thus the heuristic function used in this work prefers new combined proofs that are maximally different from other proofs. Evaluating proofs by counting formulae and inference steps is not necessarily meaningful. Comparing the numbers of axioms used in proofs can be meaningless because different axioms contain different amounts of information - using many axioms that each contain a small amount of information is incomparable with using fewer axioms that each contain a large amount of information. Comparing the numbers of inferred formulae or the numbers of inference steps in proofs can be meaningless because of the in-principle differences between inference rules - a proof that uses inference rules that take “large” steps, e.g., hyperresolution, is likely to have less steps and less inferred formulae than one that uses rules that take “smaller” steps, e.g., binary resolution. It is also easy to create proofs with less or more inferences, by combining inferences into larger steps, or splitting inferences into smaller steps. Thus, rather than counting the proof artifacts, it is better to compare sets of the artifacts. For this work the Jaccard similarities [10] between sets of proof artifacts are measured. The following propositional problem illustrates this (this example is only for illustration - the problem is trivial): Axioms : {a, b, a ⇔ d, d ⇔ e, e ⇔ b} Conjecture : a & b Two different proofs by contradiction are shown in Figure 1. The inferences are numbered for identification. The sets of artifacts – leaf formulae, inferred formulae, and inferences – for the left-hand proof are: Leaves : {a & b, a, a ⇔ d, d ⇔ e, e ⇔ b} Inferred : {¬a|¬b, a, ¬a|d, ¬d|e, ¬e|b, d, e, b, false} Inferences : {1, 2, 3, 4, 5, 6, 7, 8, 9} and for the right-hand proof: Leaves : {a & b, a, e ⇔ b, b} Inferred : {¬a|¬b, a, ¬e|b, ¬b|e, b, e, false} Inferences : {1, 2, 5, 8, 9, 10, 11, 12} The Jaccard similarity between the sets of leaf formulae is 0.50, between the sets of inferred formulae it’s 0.60, and between the sets of inferences it’s 0.42. Given Jaccard similarities between pairs of proofs, a set of proofs is evaluated in terms of the differences between the proofs. The measurement is done using a modified version [23] of the generalized clustering coefficient [14]. This measure views the proofs as vertices of a complete weighted graph, with the Jaccard similarities for a chosen proof artifact as the edge weights. For each pair of edges coincident on a vertex, the pair defines a triplet (of vertices) closed by the edge between the vertices at the other ends of the edges. The weight of a triplet is the average

63 Combining Proofs to form Diﬀerent Proofs G. Sutcliﬀe, et al.

3 of the pair of edges’ weights. The clustering coeﬃcient Csw is: P all triplets triplet weight × closing edge weight Csw = P all triplets triplet weight

For a set of proofs that are all the same, so that all edge weights are 1.0, Csw = 1.0. For a set of proofs that are all completely different, so that all edge weights are 0.0, Csw = 0.0 (noting that it is necessary to not do the math, to avoid a division by zero). For two proofs, Csw is defined to be the Jaccard similarity between the two sets of chosen artifacts. Csw is undefined for single proofs. A chosen proof artifact is recorded as a superscript on Csw: Lf for leaf formulae, If for Lf inferred formulae, and Is for inference steps, e.g., Csw . For example, for three (somewhat similar) proofs A, B, and C, with leaf set Jaccard similarities A-B = 0.66, B-C = 0.50, and A-C = 0.50, the triplet formed from the edges A-B and A-C has weight 0.58. Then 0.29 + 0.29 + 0.33 CLf = = 0.5482 sw 0.58 + 0.58 + 0.50 Replacing C with a (quite different) proof D, so that the leaf set Jaccard similarities are A-B = 0.66, B-D = 0.34, and A-D = 0.34, 0.17 + 0.17 + 0.22 CLf = = 0.4212 sw 0.50 + 0.50 + 0.34 α For a chosen proof artifact α, a high Csw indicates that the proofs in the set are highly clustered α with respect to α. Conversely a low Csw indicates that the proofs are rather different from each other with respect to α. Thus for this work D would be preferred over C with respect to leaf sets.

4 Combining Proofs

Reiterating from Section 1, given a set of proofs and an equivalence relation between nodes of the proofs, a new proof can be formed by selecting a replaced node in a target proof, ﬁnding an equivalent replacing node in a contributing proof (which can be the target proof itself or a diﬀerent proof), and replacing the sub-DAG rooted at the replaced node in the target proof by the sub-DAG rooted at the replacing node of the contributing proof. Any nodes in the replaced sub-DAG that are used in another part of the target proof are retained, and any duplicated axioms from the problem (leaf formulae of the proof) are merged. Figure 2 illustrates a combining step based on two original proofs for the simple propositional example introduced in Section 3. The left hand upper proof is the target proof, with the node containing the formula e as the replaced node, so that the outlined sub-DAG is replaced. The right hand upper proof is the contributing proof, with the node containing the formula e as the replacing node (based on syntactic equivalence), so that the outlined sub-DAG replaces the sub-DAG in the target proof. The lower proof is the combined proof. Note that the nodes containing a in the target proof are retained in the combined proof because they are also used in another part of the target proof. Note also that the node containing e ⇔ b is merged as the parent of the node containing ¬e | b from the target proof, and as the parent of the node containing ¬b | e in the replacing sub-DAG.

3‘C’ - Clustering coeﬃcient, ‘sw’ - scaled weight.

64 Combining Proofs to form Diﬀerent Proofs G. Sutcliﬀe, et al.

Target Proof Contributing Proof

a & b a a ⇔ d d ⇔ e e ⇔ b a & b a ⇔ d d ⇔ e e ⇔ b b

1 2 3 4 5 1 3 13 10 11

¬a | ¬b a ¬a | d ¬d | e ¬e | b ¬a | ¬b ¬d | a ¬e | d ¬b | e b

6 12

d e

14 7 replacing! e d node

15 replaced! 8 node b a

9 9

false false Combined Proof

a & b a e ⇔ b b

1 2 5 10 11

¬a | ¬b a ¬e | b ¬b | e b

false

Figure 2: One combining step

The Csw value of Section 3 is used to compare original and combined proofs. Given a set α α of original proofs, a proof artifact α, and a proof P to be evaluated, Csw(P ) is the Csw value for the set consisting of the original proofs plus (a distinct copy of) P . Following the example above, using the leaf formulae as the proof artifact of interest, the Jaccard similarity between the target and contributing proofs is 0.66, and between each of the two original proofs and the Lf Lf combined proof it’s 0.50. Csw (T arget) is 0.7576, and Csw (Combined) is 0.4241. The combined proof is thus (as might be expected) more different than the target proof, from the set of original proofs with respect to leaf sets. The heuristic of “maximal difference from the original proofs” is the basis for the greedy hill-climbing algorithm that is used to generate a series of new combined proofs based on a set of original proofs. The algorithm is shown in Figure 3. The algorithm gives preference to self-combining steps, in which the target proof is also used as the contributing proof, over non- self-combining steps that graft in sub-DAGs from an original proof. Self-combining steps serve to “optimize” the target proof, e.g., collapsing sequences of equivalent nodes. The heuristic is used to rank alternative combined proofs, and also to determine if the best combined proof is better than the target proof (in which case the (local) maxima has not yet been reached). The heuristic aims to find non-self-combined proofs that are different (i.e., with smaller Csw values) from the original proofs, first with respect to their leaf sets, then with respect to their inferred formula sets, and lastly with respect to their inferences. When selecting the initial target proof from the If set of original proofs, a proof with the largest Csw is chosen. This preference selects an original proof that has the most formulae that can be used as replaced formulae, hence maximizing the number of possible combined proofs that can be formed. The initial selection also relies on a preference ordering on the ATP systems, which has to be provided by the user. Figure 4 shows a sequence of combinations as generated by the hill-climbing algorithm, with the two upper proofs of Figure 2 as the original proofs. For this simple illustrative example

65 Combining Proofs to form Diﬀerent Proofs G. Sutcliﬀe, et al.

/* Choose the first target from the original proofs */ If 1 BestProofs = the subset of OriginalProofs with maximal Csw; 2 Target = the proof in BestProofs from the most preferred ATP system; /* Start hill-climbing */ 3 repeat /* Try combine with itself as the contributing proof */ 4 CombinedProofs = Combine(Target,{Target}); 5 if CombinedProofs 6= ∅ then 6 BestProofs = BestByArtifacts(CombinedProofs); 7 else /* Try combine with original proofs as the contributing proof */ 8 CombinedProofs = Combine(Target,OriginalProofs); 9 BestProofs = BestByArtifacts(CombinedProofs ∪ {Target}); 10 end /* Select the next target proof */ 11 if Target ∈ BestProofs then 12 Exit with Target as ﬁnal proof; 13 else 14 Target = any proof in BestProofs; 15 end 16 until Target == null ;

/* Function to select the best target proof based on artifacts */ 17 Function BestByArtifacts(Proofs) begin /* Prefer proofs with different axioms from the original proofs */ Lf 18 BestProofs = the subset of Proofs with minimal Csw ; /* Next prefer proofs with the same/different inferred formulae as the original proofs (see explanation in text) */ If 19 BestProofs = the subset of BestProofs with minimal Csw; /* Finally prefer proofs with different inference steps */ Is 20 BestProofs = the subset of BestProofs with minimal Csw; 21 return BestProofs; 22 end

Figure 3: Hill-climbing combining proofs syntactic equivalence is adequate for identifying replaced and replacing nodes (see Section 5 for stronger notions of equivalence). The original proof A is selected as the initial target. The first iteration combines Original Proof A with itself as the contributing proof, collapsing the sequence of two nodes containing the formula a, to produce Combined Proof 1. The second iteration continues with Combined Proof 1 as the target proof and Original proof B as the contributing proof. The sub-DAG root at the node containing b is replaced, to produce Combined Proof 2. Another iteration does not produce any more different proofs, and the hill-climbing stops. Table 1 shows the Csw values for the four proofs. The implementation of the combining process takes advantage of PML’s ability to store multiple justifications (PML InferenceSteps) for a formula in a proof (a PML NodeSet), as explained in Section 2. For each node in the target proof, all equivalent ancestor nodes in the target proof, and all equivalent nodes in the original proofs, are identified. The inferences

66 Combining Proofs to form Diﬀerent Proofs G. Sutcliﬀe, et al.

replacing! replacing! node 1 Original Proof A Original Proof B node 2

a & b a a ⇔ d d ⇔ e e ⇔ b a & b a ⇔ d d ⇔ e e ⇔ b b

1 2 3 4 5 1 3 13 10 11

¬a | ¬b a ¬a | d ¬d | e ¬e | b ¬a | ¬b ¬d | a ¬e | d ¬b | e b

6 12 replaced! node 1 d e 7 14

e d

8 15

b a

9 9

false false

Combined Proof 1 Combined Proof 2

a & b a a ⇔ d d ⇔ e e ⇔ b a & b a b

1 3 4 5 1

¬a | ¬b ¬a | d ¬d | e ¬e | b ¬a | ¬b

6 9

d false

9 replaced! false node 2

Figure 4: A sequence of combining steps

Lf If Is Proof Csw Csw Csw Original A 1.0000 0.8483 0.5616 Original B 1.0000 0.8483 0.5616 Combined 1 1.0000 0.8736 0.5616 Combined 2 0.6250 0.6405 0.4235

Table 1: Csw values for sequence of combining steps

that produce these equivalent nodes are added as alternative justifications for the node in the target proof. In this way all the possible justifications for all the nodes in the target proof are captured in one PML structure. Each combined proof is then extracted by selecting a different inference for a node (the replaced node) in the target proof. Note that if the different inference came from the target proof itself, then a self-combining step has been implemented, otherwise a non-self-combining step has been implemented.

The greedy nature of the hill-climbing means that the process does not necessarily find the most different possible combined proof, i.e., the process is not “complete”. A broader search, e.g., a beam search or best-first search, might produce more different proofs. A full search of the space of combining possibilities might be necessary to find the most different proof. This is a topic for further investigation.

67 Combining Proofs to form Diﬀerent Proofs G. Sutcliﬀe, et al.

5 Example Use

The proof combining process has been tested on a range of problems and their proofs. The problems are all in full first-order logic, and the proofs were produced by the ATP systems EP 1.2 [20], Metis 2.3 [9], SInE 0.4 [8], and Vampire 0.6 [17]. In each case the proofs from the same problem were combined, i.e., all the proofs had axioms from the problem’s axiom set. In the cases where different problems use the same axiom set it is possible to combine proofs for different problems, but this has been left for future work. As the problems and solutions are in full first- order logic, a simple syntactic notion of equivalence is inadequate. For example, under syntactic equivalence the commutativity, associativity, and reversibility of logical operators would result in “obviously equivalent” formulae not being recognized as such, e.g., ∀X(q(X) ⇐ p(X)) would not be recognized as equivalent to ∀X(p(X) ⇒ q(X)). More subtlely, as the ATP systems all convert the problems to clause normal form, Skolem function symbols are typically introduced, and different ATP systems use different naming conventions for these symbols. Therefore a more general notion of formula equivalence was implemented, taking into account the properties of logical operators, possible reordering of quantifications, and naming of Skolem symbols. As an exemplar use of proof combining, a simplification of the general education rules for obtaining a Bachelor of Science degree from the College of Arts and Sciences at the University of Miami4 was encoded. The rules require that students complete requirements in English composition, Humanities, Natural science, Mathematics, Social science, a Second language, and Writing. In most of these areas there are alternative ways of completing the requirements. The encoding is provided in Appendix A. A proof of the conjecture that “it is possible to complete the general education requirements” finds one possible combination of the alternatives. By combining a few different proofs, new combinations of the alternatives are found. The combining process used proofs by EP and Metis. As noted in Section 4, the algorithm relies on a preference ordering on the ATP systems’ – for these experiments the order chosen was EP then Metis.5 Three original proofs, two by EP and one by Metis, were used. The combining process chose one of the EP proofs as the initial target. The proof DAG is shown in the left hand side of Figure 5, as rendered by IDV [25]. An examination of the axioms used shows that EP chose Art (Humanities), Biology (Natural science), Computer Science (Mathematics), Anthropology (Social science), Arabic (Second language), and Philosophy (Writing). Philosophy could have been used to also satisfy the Humanities requirement, but EP chose to take the extra Art course. The combining process first does 13 self-combining steps, which collapse sequences of equivalent formulae in EP’s proof. Following that there are five non-self-combining steps interleaved with 28 self-combining steps. The five non-self-combining steps replace sub-DAGs representing alternative ways of satisfying the various general education requirements. The 28 self-combining steps collapse sequences of equivalent formulae in the replacing sub-DAGs from the non-self-combining steps. The right hand side of Figure 5 shows the final combined proof. An examination of the axioms used shows that the combined proof chose Art History (Humanities, from Metis’ proof), Biology (Natural science, from the initial EP proof), Statistics (Mathematics, from the other EP proof), Psychology (Social science, from the other EP proof), Japanese (Second language, from the other EP proof), and a specialized writing course (Writing, from the other EP proof). The first of the five non-self-combining steps makes a comprehensive change to the EP proof, replacing the sub-DAG rooted at the second-to-deepest node in the target proof by the corresponding sub-DAG from the other EP proof - this step effectively replaced the original EP proof by the

4http://www.miami.edu/bulletins 5This ordering was only weakly motivated, based on experience with the systems.

68 Combining Proofs to form Different Proofs G. Sutcliffe, et al. other EP proof. The second non-self-combining step replaces the sub-DAG for the Humanities with that from Metis’ proof. The third non-self-combining step replaces the sub-DAG for the Lf Science with that from the original EP proof. All those three steps reduce the Csw value. The last two non-self-combining steps replace parts of the proof in which there is no choice about the education requirements, but using different sequences of inferences that produce different If Csw values. Table 2 shows the Csw values for the three original proofs, the target proof after the 13 self- combining steps, and the final combined proof. All the heuristic values are smaller in the final combined proof. The greedy nature of the hill-climbing algorithm leads to a proof (and hence the way of completing the general education requirements) that is different from the original three, according to the measures described in Section 3.

Figure 5: The initial and ﬁnal proofs

6 Conclusion

This paper has described a process for combining proofs to produce new different proofs. The process uses measures of proof difference based on the Jaccard similarities between sets of proof artifacts (leaf formulae, inferred formulae, and inference steps), and implements a hill-climbing approach that generates proofs that are successively more different from the original proofs. An example has shown how this process can be applied in a practical application.

69 Combining Proofs to form Diﬀerent Proofs G. Sutcliﬀe, et al.

Lf If Is Proof Csw Csw Csw EP chosen 0.6466 0.7105 0.6155 EP other 0.6274 0.7045 0.6475 Metis 0.6466 0.6475 0.5166 After 13 0.6466 0.7063 0.5791 Combined 0.6270 0.6815 0.5428

Table 2: Csw values for the original and combined proofs

There are four items of work planned for the immediate future. The first is to do more extensive testing, to confirm that the process is useful over a broad range of problem domains, and exploit the full potential of proof combining. This testing will use problems from the TPTP and proofs from the TSTP. The second is to combine proofs that come from different problems that use the same axiom set, e.g., a suite of problems that are based on a common axiomatization of set theory. The third is to try broader search algorithms, to produce proofs that are more different from the original proofs, and hence regain some of the completeness lost by the greedy hill-climbing. The fourth is to further generalize the notion of formula equivalence, e.g., using ATP to prove that two formulae are equivalent (modulo different Skolem symbol names), or to prove that the formula in the replacing node implies the formula in the replaced node. In the longer term the proof combining process will be applied to a broad range of proof- like structures, with a range of motivations for producing combined proofs. One motivation comes from settings where there are multiple sources of information, and some sources may be less certain than others. When proofs can be combined and the same conclusions reached from alternative sources, trust may be increased. Identifying different derivations for the same conclusion may be of value when considering explanation strategies. Sometimes one derivation to a conclusion may be more useful for computation and another derivation may be more useful for presenting explanations. For example, a refutation style approach may be useful for computation while a more direct forward chaining style approach may be more appropriate for an explanation. Some derivations may be best for some users while others may be best for different users. For example, some users may prefer derivations that depend more heavily on certain reasoning tools. A particular type of “proof-like structure” to which the proof combining process can be applied is scientific provenance traces. It is important to note two aspects of scientific activities to understand the connection between these distinct areas of endeavor: • It is typical for scientists to document the provenance of their scientific products, i.e., the way the scientists collect and process data to derive their scientific products, so that they can claim the reproducibility of their scientific activities. • The provenance of scientific products and underlying activities is often a description of derivation traces that are simply less formal than proofs in logic. Proofs are a special case of provenance information, where the conclusion of each step of a derivation is a formal sentence (as opposed to a generic piece of information), and inference rules are often defined as patterns over these sentences.

With this understanding, the proof combining process should be transferable to scientiﬁc provenance. For example, it is common for geoscientists to combine multiple types of evidence to get an idea of the subterranean features that exist within some geographical region. For gravity data about the earth, geoscientists are often concerned only with anomalies in the data, which often indicate the presence of a water table or oil reserve. In a contour map, such anomalies are illustrated as a set of contour lines with very close proximity, indicating a drastic change

70 Combining Proofs to form Different Proofs G. Sutcliffe, et al. in gravity (or whatever data is being mapped). However, these anomalies have the potential to be artificial – simply imperfections introduced during the map generation process. The use of seismograms can produce a higher resolution three-dimensional tomography of the same subterranean features identified by gravity contour maps. If the contour map and a 3D tomography of a given region are compared, it should be possible to determine if the anomalies are artificial imperfections or features to be further investigated. If the provenance of these scientific products are further compared, it may even be possible to identify the cause of the imperfection. How- ever, without the use of automated “proof” combining techniques, these comparisons have to be manually identified and enabled by geoscientists who understand the interdependencies between data. Common aspects of provenance traces such as “subterranean features” and “regions of interest” may allow the proof combining process to be applied.

References

[1] S. Autexier, C. Benzmüller,D. Dietrich, A. Meier, and C-P. Wirth. A Generic Modular Data Struc- ture for Proof Attempts Alternating on Ideas and Granularity. In M. Kohlhase, editor, Proceedings of the 4th International Conference on Mathematical Knowledge Management, number 3863 in Lecture Notes in Artificial Intelligence, pages 126–142. Springer-Verlag, 2006. [2] O. Bar-Ilan, O. Fuhrmann, S. Hoory, O. Shacham, and O. Strichman. Linear-Time Reductions of Resolution Proofs. In H. Chockler and A. Hu, editors, Proceedings of the 4th International Haifa Verification Conference on Hardware and Software: Verification and Testing, number 5394 in Lecture Notes in Computer Science, pages 114–128. Springer-Verlag, 2009. [3] L. Cheikhrouhou and V. Sorge. PDS - A Three-Dimensional Data Structure for Proof Plans. In R. Braham and M. Mohammadian, editors, Proceedings of the International Conference on Artificial and Computational Intelligence, pages 144–149, 2000. [4] A. Fiedler. P.rex: An Interactive Proof Explainer. In R. Gore, A. Leitsch, and T. Nipkow, editors, Proceedings of the International Joint Conference on Automated Reasoning, number 2083 in Lecture Notes in Artificial Intelligence, pages 416–420. Springer-Verlag, 2001. [5] P. Fontaine, S. Merz, and B. Woltzenlogel Paleo. Compression of Propositional Resolution Proofs via Partial Regularization. In N. Bjorner and V. Sofronie-Stokkermans, editors, Proceedings of the 23rd International Conference on Automated Deduction, Lecture Notes in Artificial Intelligence, page To appear. Springer-Verlag, 2011. [6] J-Y. Girard, P. Taylor, and Y. Lafont. Proofs and Types. Cambridge University Press, 1989. [7] S. Hetzl. Proof Profiles. Characteristic Clause Sets and Proof Transformations. VDM, 2008. [8] K. Hoder and A. Voronkov. Sine Qua Non for Large Theory Reasoning. In V. Sofronie-Stokkermans and N. Bjœrner, editors, Proceedings of the 23rd International Conference on Automated Deduction, Lecture Notes in Artificial Intelligence, page To appear. Springer-Verlag, 2011. [9] J. Hurd. First-Order Proof Tactics in Higher-Order Logic Theorem Provers. In M. Archer, B. Di Vito, and C. Munoz, editors, Proceedings of the 1st International Workshop on Design and Application of Strategies/Tactics in Higher Order Logics, number NASA/CP-2003-212448 in NASA Technical Reports, pages 56–68, 2003. [10] P. Jaccard. Etude´ Comparative de la Distribution Florale Dans une Portion des Alpes et du Jura. Bulletin del la SociétéVaudoise des Sciences Naturelles, 37:547–579, 1901. [11] M. Kohlhase. OMDoc - An Open Markup Format for Mathematical Documents [version 1.2]. Number 4180 in Lecture Notes in Artificial Intelligence. Springer-Verlag, 2006. [12] W. McCune and O. Shumsky-Matlin. Ivy: A Preprocessor and Proof Checker for First-Order Logic. In M. Kaufmann, P. Manolios, and J. Strother Moore, editors, Computer-Aided Reasoning: ACL2 Case Studies, number 4 in Advances in Formal Methods, pages 265–282. Kluwer Academic Publishers, 2000.

71 Combining Proofs to form Diﬀerent Proofs G. Sutcliﬀe, et al.

[13] D. McGuinness and F. van Harmelen. OWL Web Ontology Language Overview. Technical report, World Wide Web Consortium, 2004. World Wide Web Consortium (W3C) Recommendation. [14] T. Opsahl and P. Panzarasa. Clustering in Weighted Networks. Social Networks, 31(2):155–163, 2009. [15] P. Pinheiro da Silva, D.L. McGuinness, and R. Fikes. A Proof Markup Language for Semantic Web Services. Information Systems, 31(4-5):381–395, 2006. [16] P. Pinheiro da Silva, G. Sutcliffe, C. Chang, L. Ding, N. del Rio, and D. McGuinness. Presenting TSTP Proofs with Inference Web Tools. In R. Schmidt, B. Konev, and S. Schulz, editors, Proceedings of the Workshop on Practical Aspects of Automated Reasoning, 4th International Joint Conference on Automated Reasoning, number 373 in CEUR Workshop Proceedings, pages 81–93, 2008. [17] A. Riazanov and A. Voronkov. The Design and Implementation of Vampire. AI Communications, 15(2-3):91–110, 2002. [18] S. Rollini, R. Bruttomesso, and N’ Sharygina. An Efficient and Flexible Approach to Resolution Proof Reduction. In S. Barner, I. Harris, D. Kroening, and O. Raz, editors, Proceedings of the 6th International Conference on Hardware and Software: Verification and Testing, number 6504 in Lecture Notes in Computer Science, pages 182–196. Springer-Verlag, 2011. [19] M. Schiller, D. Dietrich, and C. Benzmüller.Proof Step Analysis for Proof Tutoring – a Learning Approach to Granularity. Teaching Mathematics and Computer Science, 6(2):325–343, 2008. [20] S. Schulz. E: A Brainiac Theorem Prover. AI Communications, 15(2-3):111–126, 2002. [21] G. Sutcliffe. Semantic Derivation Verification. International Journal on Artificial Intelligence Tools, 15(6):1053–1070, 2006. [22] G. Sutcliffe. The TPTP Problem Library and Associated Infrastructure. The FOF and CNF Parts, v3.5.0. Journal of Automated Reasoning, 43(4):337–362, 2009. [23] G. Sutcliffe, C. Chang, L. Ding, D. McGuinness, and P. Pinheiro da Silva. Different Proofs are Good Proofs. In D. McGuinness, A. Stump, G. Sutcliffe, and C. Tinelli, editors, Proceedings of the IJCAR 2010 Workshop on Evaluation Methods for Solvers and Quality Metrics for Solutions, 2010. [24] G. Sutcliffe, S. Schulz, K. Claessen, and A. Van Gelder. Using the TPTP Language for Writing Derivations and Finite Interpretations. In U. Furbach and N. Shankar, editors, Proceedings of the 3rd International Joint Conference on Automated Reasoning, number 4130 in Lecture Notes in Artificial Intelligence, pages 67–81, 2006. [25] S. Trac, Y. Puzis, and G. Sutcliffe. An Interactive Derivation Viewer. In S. Autexier and C. Benzmüller,editors, Proceedings of the 7th Workshop on User Interfaces for Theorem Provers, 3rd International Joint Conference on Automated Reasoning, volume 174 of Electronic Notes in Theoretical Computer Science, pages 109–123, 2006.

A Appendix fof(get_degree,conjecture, degree ). fof(degree,axiom, ( ( composition & humanities & science & math & social_science & language & writing ) => degree ) ). fof(composition,axiom, ( ( eng105 & eng106 ) => composition ) ). fof(composition_courses,axiom, ( eng105 & eng106 ) ). fof(humanities,axiom, ( ( art & literature & religion & phi115 ) => humanities ) ).

72 Combining Proofs to form Diﬀerent Proofs G. Sutcliﬀe, et al.

fof(art,axiom, ( ( artXXX | arhXXX | danXXX | mcyXXX | thaXXX ) => art ) ).

fof(artXXX,axiom,artXXX ). fof(arhXXX,axiom,arhXXX ). fof(danXXX,axiom,danXXX ). fof(mcyXXX,axiom,mcyXXX ). fof(thaXXX,axiom,thaXXX ).

fof(literature,axiom, ( eng2XX => literature ) ).

fof(literature_courses,axiom, ( eng2XX ) ).

fof(religion,axiom, ( relXXX => religion ) ).

fof(religion_courses,axiom, ( relXXX ) ).

fof(phi115,axiom, ( phi115 ) ).

fof(bilXXX,axiom,bilXXX ). fof(chmXXX,axiom,chmXXX ). fof(ecsXXX,axiom,ecsXXX ). fof(geoXXX,axiom,geoXXX ). fof(mscXXX,axiom,mscXXX ). fof(phyXXX,axiom,phyXXX ).

fof(math,axiom, ( ( mth162 & ( cscXXX | staXXX) ) => math ) ).

fof(mth162,axiom,mth162 ). fof(cscXXX,axiom,cscXXX ). fof(staXXX,axiom,staXXX ).

fof(apyXXX,axiom,apyXXX ). fof(ecoXXX,axiom,ecoXXX ). fof(gegXXX,axiom,gegXXX ). fof(hisXXX,axiom,hisXXX ). fof(intXXX,axiom,intXXX ). fof(polXXX,axiom,polXXX ). fof(psyXXX,axiom,psyXXX ). fof(socXXX,axiom,socXXX ).

fof(language,axiom, ( ( arb2XX | chi2XX | fre2XX | ger2XX | gre2XX | heb2XX | ita2XX | jap2XX | lat2XX | por2XX | spa2XX ) => language ) ). fof(arbXXX,axiom,arb2XX ). fof(chiXXX,axiom,chi2XX ). fof(freXXX,axiom,fre2XX ). fof(gerXXX,axiom,ger2XX ). fof(greXXX,axiom,gre2XX ). fof(hebXXX,axiom,heb2XX ). fof(itaXXX,axiom,ita2XX ). fof(japXXX,axiom,jap2XX ). fof(latXXX,axiom,lat2XX ). fof(porXXX,axiom,por2XX ). fof(spzXXX,axiom,spa2XX ). fof(wwwXXX_writing,axiom, wwwXXX => writing ). fof(wwwXXX,axiom,wwwXXX). fof(hisXXX_writing,axiom, hisXXX => writing ). fof(eng2XX_writing,axiom, eng2XX => writing ). fof(phi115_writing,axiom, phi115 => writing ).