<<

Machine Learning for Automated Reasoning

Proefschrift

ter verkrijging van de graad van doctor aan de Radboud Universiteit Nijmegen, op gezag van de rector magnificus prof. mr. S.C.J.J. Kortmann, volgens besluit van het college van decanen in het openbaar te verdedigen op maandag 14 april 2014 om 10:30 uur precies

door

Daniel A. Kühlwein

geboren op 7 november 1982 te Balingen, Duitsland Promotoren: Prof. dr. Tom Heskes

Prof. dr. Herman Geuvers

Copromotor: Dr. Josef Urban

Manuscriptcommissie: Prof. dr. M.C.J.D. van Eekelen (Open University, the Netherlands) Prof. dr. L.C. Paulson (, UK) Dr. S. Schulz (TU Munich, Germany)

This research was supported by the NWO project Learning2Reason (612.001.010).

Copyright © 2013 Daniel Kühlwein

ISBN 978-94-6259-132-5 Gedrukt door Ipskamp Drukkers, Nijmegen Contents

Contents i

1 Introduction 1 1.1 Formal Mathematics ...... 1 1.1.1 Interactive Theorem Proving ...... 1 1.1.2 Automated Theorem Proving ...... 2 1.1.3 Industrial Applications ...... 3 1.1.4 Learning to ...... 4 1.2 Machine Learning in a Nutshell ...... 5 1.3 Outline of this ...... 6

2 Premise Selection in ITPs as a Machine Learning Problem 9 2.1 Premise Selection as a Machine-Learning Problem ...... 9 2.1.1 The Training Data ...... 10 2.1.2 What to Learn ...... 10 2.1.3 Features ...... 12 2.2 Naive Bayes and Kernel-Based Learning ...... 14 2.2.1 Formal Setting ...... 14 2.2.2 A Naive Bayes Classifier ...... 15 2.2.3 Kernel-based Learning ...... 15 2.2.4 Multi-Output Ranking ...... 18 2.3 Challenges ...... 19 2.3.1 Features ...... 20 2.3.2 Dependencies ...... 20 2.3.3 Online Learning and Speed ...... 21

3 Overview of Premise Selection Techniques 23 3.1 Premise Selection ...... 23 3.1.1 Premise Selection Setting ...... 23 3.1.2 Learning-based Ranking Algorithms ...... 24 3.1.3 Other Algorithms Used in the Evaluation ...... 25

i CONTENTS

3.1.4 Techniques Not Included in the Evaluation ...... 25 3.2 Machine Learning Evaluation Metrics ...... 26 3.3 Evaluation ...... 27 3.3.1 Evaluation Data ...... 27 3.3.2 Machine Learning Evaluation ...... 28 3.3.3 ATP Evaluation ...... 30 3.4 Combining Premise Rankers ...... 33 3.5 Conclusion ...... 34

4 Learning from Multiple Proofs 37 4.1 Learning from Different Proofs ...... 37 4.2 The Machine Learning Framework and the Data ...... 38 4.3 Using Multiple Proofs ...... 39 4.3.1 Substitutions and Unions ...... 40 4.3.2 Premise Averaging ...... 40 4.3.3 Premise Expansion ...... 41 4.4 Results ...... 42 4.4.1 Experimental Setup ...... 42 4.4.2 Substitutions and Unions ...... 42 4.4.3 Premise Averaging ...... 42 4.4.4 Premise Expansions ...... 44 4.4.5 Other ATPs ...... 44 4.4.6 Comparison With the Best Results Obtained so far ...... 46 4.4.7 Machine Learning Evaluation ...... 46 4.5 Conclusion ...... 48

5 Automated and Human Proofs in General Mathematics 49 5.1 Introduction: Automated Theorem Proving in Mathematics ...... 49 5.2 Finding proofs in the MML with AI/ATP support ...... 50 5.2.1 Mining the dependencies from all MML proofs ...... 50 5.2.2 Learning Premise Selection from Proof Dependencies ...... 51 5.2.3 Using ATPs to Prove the Conjectures from the Selected Premises 52 5.3 Proof Metrics ...... 53 5.4 Evaluation ...... 54 5.4.1 Comparing weights ...... 56 5.5 Conclusion ...... 56

6 MaSh - Machine Learning for Sledgehammer 59 6.1 Introduction ...... 59 6.2 Sledgehammer and MePo ...... 61 6.3 The Machine Learning Engine ...... 62 6.3.1 Basic Concepts ...... 63 6.3.2 Input and Output ...... 63 6.3.3 The Learning ...... 63

ii CONTENTS

6.4 Integration in Sledgehammer ...... 64 6.4.1 The Low-Level Learner Interface ...... 64 6.4.2 Learning from and for Isabelle ...... 65 6.4.3 Relevance Filters: MaSh and MeSh ...... 67 6.4.4 Automatic and Manual Control ...... 68 6.4.5 Nonmonotonic Theory Changes ...... 68 6.5 Evaluations ...... 69 6.5.1 Evaluation on Large Formalizations ...... 69 6.5.2 Judgment Day ...... 72 6.6 Related Work and Contributions ...... 73 6.7 Conclusion ...... 73

7 MaLeS - Machine Learning of Strategies 75 7.1 Introduction: ATP Strategies ...... 75 7.1.1 The Strategy Selection Problem ...... 76 7.1.2 Overview ...... 77 7.2 Finding Good Search Strategies with MaLeS ...... 77 7.3 Strategy Scheduling with MaLeS ...... 79 7.3.1 Notation ...... 80 7.3.2 Features ...... 80 7.3.3 Runtime Prediction Functions ...... 82 7.3.4 Crossvalidation ...... 85 7.3.5 Creating Schedules from Prediction Functions ...... 85 7.4 Evaluation ...... 86 7.4.1 E-MaLeS ...... 87 7.4.2 Satallax-MaLeS ...... 88 7.4.3 LEO-MaLeS ...... 91 7.4.4 Further Remarks ...... 94 7.4.5 CASC ...... 95 7.5 Using MaLeS ...... 97 7.5.1 E-MaLeS, LEO-MaLeS and Satallax-MaLeS ...... 97 7.5.2 Tuning E, LEO-II or Satallax for a New Set of Problems . . . . . 98 7.5.3 Using a New Prover ...... 101 7.6 Future Work ...... 102 7.7 Conclusion ...... 102

Contributions 105

Bibliography 107

Scientific Curriculum Vitae 121

Summary 125

iii CONTENTS

Samenvatting 127

Acknowledgments 129

iv Chapter 1

Introduction

Heuristically, a proof is a rhetorical device for convincing someone else that a mathematical statement is true or valid. — Steven G. Krantz, [52] I am entirely convinced that formal verification of mathematics will eventu- ally become commonplace. — Jeremy Avigad, [6]

1.1 Formal Mathematics

The foundations of modern mathematics were laid at the end of the 19th century and the beginning of the 20th century. Seminal works such as Frege’s Begriffsschrift [30] estab- lished the notion of mathematical proofs as formal derivations in a logical calculus. In [118], Whitehead and Russell set out to show by example that all of mathematics can be derived from a small set of using an appropriate log- ical calculus. Even though Gödel later showed that no effectively generated consistent system can capture all mathematical truth [32], Principia Mathematica showed that most of normal mathematics can indeed be catered for by a . Proofs could now be rigidly defined, and verifying the validity of a proof was a simple matter of checking whether the rules of the calculus were correctly applied. But formal proofs were extremely tedious to write (and read), and so they found no audience among practicing mathematicians.

1.1.1 Interactive Theorem Proving With the advent of computers, formal mathematics became a more realistic proposal. Interactive theorem provers (ITP), or proof assistants, are computer programs that support

This chapter is based on: “A Survey of Axiom Selection as a Machine Learning Problem”, submitted to “Infinity, , and metamathematics. Festschrift celebrating the 60th birthdays of Peter Koepke and Philip Welch”.

1 CHAPTER 1. INTRODUCTION

Theorem There are infinitely many primes: for every number n there exists a prime p > n. Proof [after Euclid] Given n. Consider k = n! + 1, where n! = 1 · 2 · 3 · ... · n. Let p be a prime that divides k. For this number p we have p > n: otherwise p ≤ n; but then p divides n!, so p cannot divide k = n! + 1, contradicting the choice of p. QED

Figure 1.1: An informal proof that there are infinitely many prime numbers [117] the creation of formal proofs. Proofs are written in the input language of the ITP, which can be thought of as being at the intersection between a , a logic, and a mathematical typesetting system. In an ITP proof, each statement the user makes gives rise to a proof obligation. The ITP ensures that every proof obligation is met with a correct proof. ACL2 [47], Coq [11], HOL4 [90], HOL Light [39], Isabelle [68], Mizar [35], and PVS [71] are perhaps the most widely used ITPs. Figures 1.1 and 1.2 show a simple informal proof and the corresponding Isabelle proof. ITPs typically provide built-in and programmable automation procedures for performing reasoning that are called tactics. In Figure 1.2, the by command specifies which tactic should be applied to discharge the current proof obligation. Developing proofs in ITPs usually requires a lot more work than sketching a proof with pen and paper. Nevertheless, the benefit of gaining quasi-certainty about the correct- ness of the proof led a number of mathematicians to adopt these systems. One of the largest mechanization projects is probably the ongoing formalization of the proof of Kepler’s conjecture by Thomas Hales and his colleagues in HOL Light [37]. Other major undertakings are the formal proofs of the Four-Color Theorem [33] and of the Odd-Order Theorem [34] in Coq, both developed under Georges Gonthier’s leadership. In terms of mathematical breadth, the Mizar Mathematical Library [61] is perhaps the main achievement of the ITP community so far: With nearly 52000 theorems, it covers a large portion of the mathematics taught at the undergraduate level.

1.1.2 Automated Theorem Proving In contrast to interactive theorem provers, automated theorem provers (ATPs) work with- out human interaction. They take a problem as input, consisting of a set of axioms and a conjecture, and attempt to deduce the conjecture from the axioms. The TPTP (Thou- sands of Problems for Theorem Provers) library [91] has established itself as a central infrastructure for exchanging ATP problems. Its main developer also organizes an annual competition, the CADE ATP Systems Competition (CASC) [95], that measures progress in this field. E [84], SPASS [114], Vampire [77], and Z3 [66] are well-known ATPs for classical first-order logic.

2 1.1. FORMAL MATHEMATICS

theorem Euclid: ∃p ∈ prime. n < p proof let ?k = n! + 1 obtain p where prime: p ∈ prime and dvd: p dvd ?k using prime-factor-exists by auto have n < p proof have ¬ p ≤ n proof assume p ≤ n with prime-g-zero have p dvd n! by (rule dvd-factorial) with dvd have p dvd ?k − n! by (rule dvd-diff) then have p dvd 1 by simp with prime show False using prime-nd-one by auto qed then show ?thesis by simp qed from this and prime show ?thesis .. qed

corollary ¬ finite prime using Euclid by (fastsimp dest!: finite-nat-set-is-bounded simp: le-def)

Figure 1.2: An Isabelle proof corresponding to the informal proof of Figure 1.1 [117]

Some researchers use ATPs to try to solve open mathematical problems. William McCune’s proof of the Robbins conjecture using a custom ATP is the main success story on this front [62]. More recently, ATPs have also been integrated into ITPs [16, 109, 46], where they help increase the productivity by reducing the number of manual interactions needed to carry out a proof. Instead of using a built-in tactic, the ITP translates the current proof obligation (e.g., the lemma that the user has just stated but not proved yet) into an ATP problem. If the ATP can solve it, the proof is translated to the logic of the ITP and the user can proceed. In Isabelle, the component that integrates ATPs is called Sledgehammer [16]. The process is illustrated in Figure 1.3 and a detailed description can be found in Section 6.2. In Chapter 6, we show that almost 70% of the proof obligations arising in a representative Isabelle corpus can be solved by ATPs.

1.1.3 Industrial Applications Apart from mathematics, formal proofs are also used in industry. With the ever increasing complexity of software and hardware systems, quality assurance is a large part of the time and money budget of projects. Formal mathematics can be used to prove that an implementation meets a specification. Although some tests might still be mandated by certification authorities, formal proofs can both drastically reduce the testing burden and

3 CHAPTER 1. INTRODUCTION

Proof obligation First-order problem

Isabelle proof ATP proof Isabelle Sledgehammer ATP

Figure 1.3: Sledgehammer integrates ATPs (here E) into Isabelle increase confidence that the systems are bug-free. AMD and Intel have been verifying floating-point procedures since the late 1990s [65, 40], as a consequence of the Pentium bug. Microsoft has had success applying for- mal verification methods to Windows device drivers [7]. One of the largest software verification projects so far is seL4, a formally verified operating system kernel [48].

1.1.4 Learning to Reason

One of the main why formal mathematics and its related technologies have not become mainstream yet is that developing ITP proofs is tedious. The reasoning capabili- ties of ATPs and ITP tactics are in many respects far behind what is considered standard for a human mathematician. Developing an interactive proof requires not only knowledge of the subject of the proof, but also of the ITP and its libraries. One way to make users of ITPs more productive is to improve the success rate of ATPs. ATPs struggle with problems that have too many unnecessary axioms since they increase the search space. This is especially an issue when using ATPs from an ITP, where users have access to thousands of premises (axioms, definitions, lemmas, theorems, and corollaries) in the background libraries. Each premise is a potential axiom for an ATP. Premise selection algorithms heuristically select premises that are likely to be useful for inclusion as axioms in the problem given to the ATP. A terminological note is in order. ITP axioms are fundamental assumption in the com- mon mathematical sense (e.g., the axiom of choice). In contrast, ATP axioms are arbitrary formulas that can be used to establish the conjecture. In an ITP, we call statements that can be used for proving a new statement premises. Alternative names are facts (mainly in the Isabelle community), items, or just lemmas. After a new statement has been proven, it becomes a premise for the all following statements. Learning mathematics involves studying proofs to develop a mathematical intuition. Experienced mathematicians often know how to approach a new problem by simply look- ing at its statement. Assume that p is a prime number and a,b ∈ N − {0}. Consider the following statement: If p | ab, then p | a or p | b.

4 1.2. MACHINE LEARNING IN A NUTSHELL

Even though mathematicians usually know about many different areas (e.g., linear alge- bra, probability theory, numerics, analysis), when trying to prove the above statement they would ignore those areas and rely on their knowledge about number theory. At an abstract level, they perform premise selection to reduce their search space. Most common premise selection algorithms rely on (recursively) comparing the sym- bols and terms of the conjecture and axioms [41, 64]. For example, if the conjecture involves π and sin, they will prefer axioms that also talk about either of these two sym- bols, ideally both. The main drawback of such approaches is that they focus exclusively on formulas, ignoring the rich information contained in proofs. In particular, they do not learn from previous proofs.

1.2 Machine Learning in a Nutshell

This section aims to provide a high-level introduction to machine learning; for a more thorough discussion, we refer to standard textbooks [13, 60, 67]. Machine learning con- cerns itself with extracting information from data. Some typical examples of machine learning problems are listed below.

Spam classification: Predict if a new email is spam.

Face detection: Find human faces in a picture.

Web search: Predict the websites that contain the information the user is looking for.

The results of a learning algorithm is a prediction function that takes a new datapoint (email, picture, search query) and returns a target value (spam / not spam, location of faces, relevant websites). The learning is done by optimizing a score function over a training dataset. Typical score functions are accuracy (how many emails were correctly labeled?) and the root mean square error (the Euclidean distance between the predicted values and the real values). Elements of the training datasets are datapoints together with their intended value. For example:

Spam classification: A set of emails together with their classification.

Face detection: A set of pictures where all faces are marked.

Web search: A set of query-relevant websites tuples.

The performance of the learned function heavily depends on the quality of the training data, as expressed by the aphorism “Garbage in, garbage out.” If the training data is not representative for the problem, the prediction function will likely not generalize to new data. In addition to the training data, problem features are also essential. Features are the input of the prediction function and should describe the relevant attributes of the data- point. A datapoint can have several possible feature representations. Feature engineering concerns itself with identifying relevant features [59]. To simplify computations, most

5 CHAPTER 1. INTRODUCTION machine learning algorithms require that the features are a (sparse) real-valued vector. Potential features are listed below.

Spam classification: A list of all the words occurring in the email.

Face detection: The matrix containing the color values of the pixels.

Web search: The n-grams of the query.

From a mathematical point of view, most machine learning problems can be reduced to an optimization problem. Let D ⊆ X × T be a training dataset consisting of datapoints and their corresponding target value. Let ϕ : X → F be a feature function that maps a datapoint to its feature representation in the feature space F (usually a subset of Rn for some n ∈ N). Furthermore, let F ⊆ (F → T) be a set of functions that map features to the target space and s a (convex) score function s : D × F → R. One possible goal is to find the function f ∈ F that maximizes the average score over the training set D. The main differences between various learning algorithms are the function space F and the score function s they use. If the function space is too expressive, overfitting may occur: The learned function f ∈ F might perform well on the training data D, but poorly on unseen data. A simple example is trying to fit a polynomial of degree n − 1 through n training datapoints; this will give perfect scores on the training data but is likely to yield a curve that behaves so wildly as to be useless to make predictions. Regularization is used to balance function complexity with the result of the score function. To estimate how well a learning algorithm generalizes or to tune metaparame- ters (e.g., which prior to use in a Bayesian model ), cross-validation partitions the training data in two sets: one set used for training, the other for the evaluation. Section 2.2.4 gives an example of metaparameter tuning with cross-validation.

1.3 Outline of this Thesis

This work develops machine learning methods that can be used to improve both interac- tive and automated theorem proving. The first part of the thesis focuses on how learning from previous proofs can help to improve premise selection algorithms. In a way, we are trying to teach the computer mathematical intuition. The second part concerns itself with the orthogonal problem of strategy selection for ATPs. My detailed contributions to the thesis chapters are listed in the Contributions section 7.7.

Chapter 2 presents premise selection as a machine learning problem, an idea originally introduced in [101]. First, the problem setup and the properties of the training data are generally defined. The naive Bayesian approach of SNoW [21] is discussed and a new kernel-based Multi-Output Ranking (MOR) algorithm is introduced. The chapter ends with a discussion of the typical properties of the training datasets and the challenges they present to machine learning algorithms.

6 1.3. OUTLINE OF THIS THESIS

Chapter 3 compares the learning-based premise selection algorithms of SNoW and a faster variant of MOR, MOR-CG, with several other state-of-the-art techniques on the MPTP2078 benchmark dataset [2]. We find a discrepancy between the results of the typ- ical machine learning evaluations and the ATP evaluations. Due to incomplete training data, i.e. alternative proofs, a low score in AUC and/or Recall does necessarily imply a low number of solved problems by the ATP. With 726 problems, MOR-CG solves 11.3% more problems than the second best method, SInE [41].1 An ensemble combination of learning (MOR-CG) with non-learning (SInE) algorithms leads to 797 solved problems, an increase of almost 10% compared to MOR-CG.

Chapter 4 explores how knowledge of different proofs can be exploited to improve the premise predictions. The proofs found from the ATP experiments of the previous chapter are used as additional training data for the MPTP2078 dataset. Several different proof combinations are defined and tested. We find that learning from ATP proofs instead of ITP proofs gives the best results. The ensemble of ATP-learned MOR-CG with SInE solved 3.3% more problems than the former maximum.

Chapter 5 takes a closer look at the differences between ITP and ATP proofs on the whole Mizar Mathematical Library. We compare the average number of dependencies of ITP and ATP proofs and try to measure the proof complexity. We find that ATPs tend to use alternative proofs employing more advanced lemmas whereas humans often rely on the basic definitions for their proofs.

Chapter 6 brings learning-based premise selection to Isabelle. MaSh is a modified ver- sion of the sparse naive Bayes algorithm that was build to deal with the challenges of premise selection. Unlike MOR and MOR-CG, it is fast enough to be used during every- day proof development and has become part of the default Isabelle installation. MeSh, a combination of MaSh and the old relevance filter MePo increases the number of solved problems in the Judgement Day benchmark by 4.2%.

Chapter 7 presents MaLeS, a general learning-based tuning framework for ATPs. ATP systems tuned with MaLeS successfully competed in the last three CASCs. MaLeS com- bines strategy finding with automated strategy scheduling using a combination of random search and kernel-based machine learning. In the evaluation, we use MaLeS to tune three different ATPs, E, LEO-II [9] and Satallax [19], and evaluate the MaLeS version against the default setting. The results show that using MaLeS can significantly improve the ATP performance.

1With the ATP Vampire 0.6, 70 premises and a 5 second time limit. Section 3.3 contains additional infor- mation.

7

Chapter 2

Premise Selection in Interactive Theorem Proving as a Machine Learning Problem

Without premise selection, automated theorem provers struggle to discharge proof obliga- tions of interactive theorem provers. This is partly due to the large number of background premises which are passed to the automated provers as axioms. Premise selection algo- rithms predict the relevance of premises, thereby helping to reduce the search space of automated provers. This chapter presents premise selection as a machine learning prob- lem and describes the challenges that distinguish this problem from other applications of machine learning.

2.1 Premise Selection as a Machine-Learning Problem

Using an ATP within an ITP requires a method to filter out irrelevant premises. Since most ITP libraries contain several thousands of theorems, simply translating every library statement into an ATP axiom overwhelms the ATP due to the exploding search space.1 To use machine learning to create such a relevance filter, we must first answer three ques- tions:

1. What is the training data?

2. What is the goal of the learning?

3. What are the features?

This chapter is based on: “A Survey of Axiom Selection as a Machine Learning Problem”, submitted to “Infinity, computability, and metamathematics. Festschrift celebrating the 60th birthdays of Peter Koepke and Philip Welch” and my part of [2] “Premise Selection for Mathematics by Corpus Analysis and Kernel Methods”, published in the Journal of Automated Reasoning. 1Initially, even parsing huge problem files has been an issue with some ATPs.

9 CHAPTER 2. PREMISE SELECTION IN ITPS AS A MACHINE LEARNING PROBLEM

Axiom 1. A Axiom 2. B Definition 1. C iff A Definition 2. D iff C Theorem 1. C Proof. By Axiom 1 and Definition 1. Corollary 1. D Proof. By Theorem 1 and Definition 2.

Figure 2.1: A simple library

2.1.1 The Training Data ITP proof libraries consist of axioms, definitions and previously proved formulas together with their proofs. We use these proofs as training data for the learning algorithms. For example, for Isabelle we can use the libraries included with the prover or the Archive of Formal Proofs [50]; for Mizar, the Mizar Mathematical Library [61]. The data could also include custom libraries defined by the user or third parties. Abstracting from its source, we assume that the training data consists of a set of formulas (axioms, definitions, lemmas, theorems, corollaries) equipped with

1.a visibility relation that for each formula states which other formulas appear before it

2.a dependency graph that for each formula shows which formulas were used in its proof (for lemmas, theorems, and corollaries)

3.a formula tree representation of each formula

For the remainder of the thesis we simply use theorem to denote lemmas, theorems and corollaries.

Example. Figure 2.1 introduces a simple, constructed library. For each formula, every formula that occurs above it is visible. Axioms 1 and 2 and Definitions 1 and 2 are visible from Theorem 1, whereas Corollary 1 is not visible. Figure 2.2 presents the corresponding dependency graph. Finally, Figure 2.3 shows the formula tree of ∀x x + 1 > x.

2.1.2 What to Learn When using an ATP as proof tactic of an ITP, the conjecture of the ATP problem is the current proof obligation the ITP user wants to discharge and the axioms are the visible premises. Recall that machine learning tries to optimize a score function over the training

10 2.1. PREMISE SELECTION AS A MACHINE-LEARNING PROBLEM

Ax. 2 Ax. 1 Def. 1

Def. 2 Thm. 1

Cor. 1

Figure 2.2: The dependency graph of the library of Figure 2.1, where edges denote de- pendency between formulas.

x >

+ x

x 1

Figure 2.3: The formula tree for ∀x x + 1 > x dataset. If we ignore alternative proofs and assume that the dependencies extracted from the ITP are the dependencies that an ATP would use, then an ambitious, but unrealistic, learning goal would be to try to predict the parents of conjecture in the dependency graph. Treating premise selection as a ranking rather than a subset selection problem allows more room for error and simplifies the problem. Hence we state our learning goal as:

Given a training dataset (Section 2.1.1) and the formula tree of a conjecture, rank the visible premises according to their predicted usefulness based on previous proofs.

In the training phase, the learning algorithm is allowed to learn from the proofs of all previously proved theorems. For all theorems in the training set, their corresponding

11 CHAPTER 2. PREMISE SELECTION IN ITPS AS A MACHINE LEARNING PROBLEM

ATP problem 1

highest ranked premises n1

ni highest ranked premises Premise ranking ATP problem i n m highest ranked premises

Sledgehammer

ATP problem m

Figure 2.4: Sledgehammer generates several ATP problems from a single ranking. For simplicity, other possible slicing options are not shown.

dependencies should be ranked as high as possible. I.e., the score function should opti- mize the ranks of the premises that were used in the proof. Alternative proofs and their effect on premise selection are addressed in Chapter 4, and Chapter 5 takes a look at the difference between ITP and ATP dependencies. When trying to prove the conjecture, the predicted ranking is used to create several different ATP problems. It has often been observed that it is better to invoke an ATP repeatedly with different options (e.g. numbers of axioms, type encodings, ATP parame- ters) for a short period of time (e.g., 5 seconds) than to let it run undisturbed until the user stops it. This optimization is called time slicing [99]. Figure 2.4 illustrates the process using Sledgehammer as an example. Slices with few axioms are more likely to find deep proofs involving a few obvious axioms, whereas those with lots of axioms might find straightforward proofs involving more obscure axioms.

2.1.3 Features

Almost all learning algorithms require the features of the input data to be a real vector. Therefore a method is needed to translate formula trees into real vectors that tries to characterize the formula.

12 2.1. PREMISE SELECTION AS A MACHINE-LEARNING PROBLEM

Symbols.

The symbols that appear in a formula can be seen as its basic characterization and hence a simple approach is to take the set of symbols of a formula as its feature set. The symbols correspond to the node labels in the formula tree. Let n ∈ N denote the vector size, which should be at least as large as the total number of symbols in the library. Let i be an injective index function that maps each symbol s to a positive number i(s) ≤ n. The feature representation of a formula tree t is the binary vector ϕ(t) such that ϕ(t)( j) = 1 iff the symbol with index j appears in t. The example formula tree in Figure 2.3 contains the symbols ∀, >, +, x, and 1. Given n = 10, i(∀) = 1, i(>) = 4, i(+) = 6, i(x) = 7, and i(1) = 8, the corresponding feature vector is (1,0,0,1,0,1,1,1,0,0).

Subterms and subformulas.

In addition to the symbols, one can also include as features the subterms and subformulas of the formula to prove—i.e., the subtrees of the formula tree [110]. For example, the for- mula tree in Figure 2.3 has subtrees associated with x, 1, x+1, x > x+1, and ∀x x+1 > x. Adding all subtrees significantly increases the size of the feature vector. Many subterms and subformulas appear only once in the library and are hence useless for making predic- tions. An approach to curtail this explosion is to consider only small subtrees (e.g., those with a height of at most 2 or 3).

Types.

The formalisms supported by the vast majority of ITP systems are typed (or sorted), meaning that each term can be given a type that describes the values that can be taken by the term. Examples of types are int, real, real × real, and real → real. Adding the types that appear in the formula tree as additional features is reasonable [56, 45]. Like terms, types can be represented as trees, and we may choose between encoding only basic types or also some or all complex subtypes.

Context.

Due to the way humans develop complex proofs, the last few formulas that were proved are likely to be useful in a proof of the current goal [24]. However, the machine learning algorithm might rank them poorly because they are new and hence little used, if at all. Adding the feature vectors of some of the last previously proved theorems to the feature vector of the conjecture, in a weighted fashion, is a way to add information about the context in which the conjecture occurs to the feature vector. This method is particularly useful when a formula has very few or very general features but occurs in a wider context.

13 CHAPTER 2. PREMISE SELECTION IN ITPS AS A MACHINE LEARNING PROBLEM

2.2 Naive Bayes and Kernel-Based Learning

We give a detailed example of an actual learning setup using a standard naive Bayes and the kernel-based Multi-Output Ranking (MOR) algorithm. The mathematics underlying both algorithms are introduced and the benefits of kernels explained. Naive Bayes has already been used in previous work on premise selection [110] whereas the MOR algo- rithm is newly introduced in this thesis. The next chapter contains an evaluation of these two (among other) algorithms.

2.2.1 Formal Setting Let Γ be the set of formulas that appear in the training dataset.

Definition 1 (Proof matrix). For two formulas c, p ∈ Γ we define the proof matrix µ : Γ × Γ → {0,1} by  1 if p is used to prove c, µ(c, p) B  0 otherwise. In other words, µ is the adjacency matrix of the dependency graph.

The used premises of a formulas c are the direct parents of c in the dependency graph.

usedPremises(c) B {p | µ(c, p) = 1}

Definition 2 (Feature matrix). Let T B {t1,...,tm} be a fixed enumeration of the set of all symbols and (sub)terms that appear in all formulas from Γ.2 We define Φ : Γ×{1,...,m} → {0,1} by  1 if ti appears in c, Φ(c,i) B  0 otherwise. This matrix gives rise to the feature function ϕ : Γ → {0,1}m which for c ∈ Γ is the vector ϕc with entries in {0,1} satisfying

c ϕi = 1 ⇐⇒ Φ(c,i) = 1.

The expressed features of a formula are denoted by the value of the function e : Γ → P(T) that maps c to {ti | Φ(c,i) = 1}.

For each premise p ∈ Γ we learn a real-valued classifier function Cp(·): Γ → R which, given a conjecture c, estimates how useful p is for proving c. The premises for a con- jecture c ∈ Γ are ranked by the values of Cp(c). The main difference between learning algorithms is the function space in which they search for the classifiers and the measure they use to evaluate how good a classifier is.

2If the set of features is not constant they are enumerated in order of appearance.

14 2.2. NAIVE BAYES AND KERNEL-BASED LEARNING

2.2.2 A Naive Bayes Classifier Naive Bayes is a statistical learning method based on Bayes’ theorem about conditional probabilities3 with a strong (read: naive) independence assumptions. In the naive Bayes setting, the value Cp(c) of the classifier function of a premise p at a conjecture c is the probability that µ(c, p) = 1 given the expressed features e(c). To understand the difference between the naive Bayes and the kernel-based learning algorithm we need to take a closer look at the naive Bayes classifier. Let θ denote the statement that µ(c, p) = 1 and for each feature ti ∈ T let t¯i denote that Φ(c,i) = 1. Fur- thermore, let e(c) = {s1,..., sl} ⊆ T be the expressed features of c (with corresponding s¯1,..., s¯l). Then (by Bayes’ theorem) we have

P(θ | s¯1,..., s¯l) ∝ P(¯s1,..., s¯l | θ)P(θ) (2.1) where the logarithm of the right-hand side can be computed as

ln P(¯s1,..., s¯l | θ)P(θ) = ln P(¯s1,..., s¯l | θ) + ln P(θ) (2.2) Yl = ln P(¯si | θ) + ln P(θ) by independence (2.3) i=1 m X c = ϕi ln P(t¯i | θ) + ln P(θ) (2.4) i=1 = wT ϕc + ln P(θ) (2.5) where wi B ln P(t¯i | θ) (2.6)

There are two things worth noting here. First, P(t¯i | θ) and P(θ) might be 0. In that case, taking the natural logarithm would not be defined. In practice, if P(t¯i | θ) or P(θ) are 0 the algorithm replaces the 0 with a predefined very small ε > 0. Second, line (5) shows that the naive-Bayes classifier is “essentially” (after the monotonic transformation) a linear function of the features of the conjecture. The feature weights w are computed using formula (2.6).

2.2.3 Kernel-based Learning We saw that the naive Bayes algorithm gives rise to a linear classifier. This leads to several questions: ‘Are there better weights?’ and ‘Can one get better performance with non-linear functions?’. Kernel-based learning provides a framework for investigating such questions. In this subsection we give a simplified, brief description of kernel-based learning that is tailored to our present problem; further information can be found in [5, 82, 88]. 3In its simplest form, Bayes’ theorem asserts for a probability function P and random variables X and Y that P(Y|X)P(X) P(X|Y) = , P(Y) where P(X|Y) is understood as the conditional probability of X given Y.

15 CHAPTER 2. PREMISE SELECTION IN ITPS AS A MACHINE LEARNING PROBLEM

Are there better weights? To answer this question we must first define what ‘better’ means. Using the number of problems solved as measure is not feasible because we cannot practically run an ATP for every possible weight combination. Instead, we measure how good a classifier approxi- mates our training data. We would like to have that

∀x ∈ Γ : Cp(x) = µ(x, p). However, this will almost never be the case. To compare how well a classifier approxi- mates the data, we use loss functions and the notion of expected loss that they provide, which we now define. Definition 3 (Loss function and Expected Loss). A loss function is any function l : R × R → R+. Given a loss function l we can then define the expected loss E(·) of a classifier C as p X E(Cp) = l(Cp(x),µ(x, p)) x∈Γ One might add additional properties such as l(x, x) = 0, but this is not necessary. Typ- ical examples of a loss function l(x,y) are the square loss (y − x)2 or the 0-1 loss defined by I(x = y).4 We can compare two different classifiers via their expected loss. If the expected loss of 0 classifier Cp is less than the expected loss of a classifier Cp then Cp is the better classifier.

Nonlinear Classifiers It seems straightforward that more complex functions would lead to a lower expected loss and are hence desirable. However, weight optimization becomes tedious once we leave the linear case. Kernels provide a way to use the machinery of linear optimization on non-linear functions. Definition 4 (Kernel). A kernel is is a function k : Γ × Γ → R satisfying k(x,y) = hφ(x),φ(y)i where φ : Γ → F is a mapping from Γ to an inner product space F with inner product h·,·i. A kernel can be understood as a similarity measure between two entities. Example 1. A standard example is the linear kernel: x y klin(x,y) B hϕ ,ϕ i with h·,·i being the normal dot product in Rm. Here, ϕ f denotes the features of a formula f , and the inner product space F is Rm. A nontrivial example is the Gaussian kernel with parameter σ [13]: ! hϕx,ϕxi − 2hϕx,ϕyi + hϕy,ϕyi kgauss(x,y) B exp − σ2

4I is defined as follows: I(x = y) = 0 if x = y, and I(x = y) = 1 otherwise.

16 2.2. NAIVE BAYES AND KERNEL-BASED LEARNING

We can now define our kernel function space in which we will search for classification functions. Definition 5 (Kernel Function Space). Given a kernel k, we define    X  F  f ∈ RΓ | f (x) = α k(x,v),α ∈ R,k f k < ∞. k B  v v   v∈Γ  P as our kernel function space, where for f (x) = v∈Γ αvk(x,v) X k f k = αuαvk(u,v) u,v∈Γ

Essentially, every function in Fk compares the input x with formulas in Γ using the kernel, and the weights α determine how important each comparison is.5

The kernel function space Fk naturally depends on the kernel k. It can be shown that when we use klin, Fklin consists of linear functions of the features T. In contrast, the Gaussian kernel kgauss gives rise to a nonlinear (in the features) function space.

Putting it all together Having defined loss functions, kernels and kernel function spaces we can now define how kernel-based learning algorithms learn classifier functions. Given a kernel k and a loss function l, recall that we measure how good a classifier Cp is with the expected loss E(Cp). With all our definitions it seems reasonable to define Cp as

Cp B argmin E( f ) (2.7) f ∈Fk However, this is not what a kernel based learning algorithm does. There are two reasons for this. First, the minimum might not exist. Second, in particular when using complex kernel functions, such an approach might lead to overfitting: Cp might perform very well on our training data, but badly on data that was not seen before. To handle both problems, a regularization parameter λ > 0 is introduced to penalize complex functions. This regu- larization parameter allows us to place a bound on possible solution which together with the fact that Fk is a Hilbert space ensures the existence of Cp. Hence we define 2 Cp = argmin E( f ) + λk f k (2.8) f ∈Fk

Recall from the definition of Fk that Cp has the form X Cp(x) = αvk(x,v), (2.9) v∈Γ with αv ∈ R. Hence, for any fixed λ, we only need to compute the weights αv for all v ∈ Γ in order to define Cp. In Section 2.2.4 we show how to solve this optimization problem in our setting. 5Schölkopf gives a more general approach to kernel spaces [81].

17 CHAPTER 2. PREMISE SELECTION IN ITPS AS A MACHINE LEARNING PROBLEM

Naive Bayes vs Kernel-based Learning Kernel-based methods typically outperform the naive Bayes algorithm. There are several reasons for this. Firstly and most importantly, while naive Bayes is essentially a linear classifier, kernel based methods can learn non-linear dependencies when an appropriate non-linear (e.g. Gaussian) kernel function is used. This advantage in expressiveness usually leads to significantly better generalization6 performance of the algorithm given properly estimated hyperparameters (e.g., the kernel width σ for Gaussian functions). Secondly, kernel-based methods are formulated within the regularization framework that provides mechanism to control the errors on the training set and the complexity ("expres- siveness") of the prediction function. Such setting prevents overfitting of the algorithm and leads to notably better results compared to unregularized methods. Thirdly, some of the kernel-based methods (depending on the loss function) can use very efficient proce- dures for hyperparameter estimation (e.g. fast leave-one-out cross-validation [78]) and therefore result in a close to optimal model for the classification/regression task. For such reasons kernel-based methods are among the most successful algorithms applied to var- ious problems from bioinformatics to information retrieval to computer vision [88]. A general advantage of naive Bayes over kernel-based algorithms is the computational effi- ciency, particularly when taking into account the fact that computing the kernel matrix is generally quadratic in the number of training data points.

2.2.4 Multi-Output Ranking

We define the kernel-based multi-output ranking (MOR) algorithm. It extends previously defined preference learning algorithms by Tsivtsivadze and Rifkin [100, 78]. Let Γ = {x1,..., xn}. Then formula (2.9) becomes

Xn Cp(x) = αik(x, xi) i=1

Using this and the square-loss l(x,y) = (x − y)2 function, solving equation (2.8) is equiva- lent to finding weights αi that minimize

  2  Xn Xn  Xn      min   α jk(xi, x j) − µ(xi, p) + λ αiα jk(xi, x j) (2.10) α1,...,αn      i=1 j=1 i, j=1 

Recall that Cp is the classifier for a single premise. Since we eventually want to rank all premises, we need to train a classifier for each premise. So we need to find weights αi,p for each premise p. We can use the fact that for each premise p, Cp depends on the values of k(xi, x j), where 1 ≤ i, j ≤ n, to speed up the computation. Instead of learning the classifiers Cp for each premise separately, we learn all the weights αp,i simultaneously.

6Generalization is the ability of a machine learning algorithm to perform accurately on new, unseen exam- ples after training on a finite data set.

18 2.3. CHALLENGES

To do this, we first need some definitions. Let

A = (αi,p)i,p (1 ≤ i ≤ n, p ∈ Γ). A is the matrix where each column contains the parameters of one premise classifier. Define the kernel matrix K and the label matrix Y as

K B (k(xi, x j))i, j (1 ≤ i, j ≤ n) Y B (µ(xi, p))i,p (1 ≤ i ≤ n, p ∈ Γ). We can now rewrite (2.10) in matrix notation to state the problem for all premises:   argmintr (Y − KA)T(Y − KA) + λATKA (2.11) A where tr(A) denotes the trace of the matrix A. Taking the derivative with respect to A leads to: ∂  T T  ∂A tr (Y − KA) (Y − KA) + λA KA = tr(−2K(Y − KA) + 2λKA) = tr(−2KY + (2KK + 2λK)A) To find the minimum, we set the derivative to zero and solve with respect to A. This leads to: A = (K + λI)−1Y (2.12) If the regularization parameter λ and the (potential) kernel parameter σ are fixed, we can find the optimal weights through simple matrix computations. Thus, to fully deter- mine the classifiers, it remains to find good values for the parameters λ and σ. This is done, as is common with such parameter optimization for kernel methods, by simple (log- arithmically scaled) grid search and cross-validation on the training data using a 70/30 split. For this, we first define a logarithmically scaled set of potential parameters. The training set in then randomly split in two parts cvtrain and cvtest with cvtrain containing 70% of the training data and cvtest containing the remaining 30%. For each set of param- eters, the algorithm is trained on cvtrain and evaluated on cvtest. The process is repeated 10 times. The set of parameters with the best average performance is then picked for the real evaluation.

2.3 Challenges

Premise selection has several peculiarities that restrict which machine learning algorithms can be effectively used. In this section, we illustrate these challenges on a large fragment of Isabelle’s Archive of Formal Proofs (AFP). The AFP benchmarks contain 165964 for- mulas distributed over 116 entries contributed by dozens of Isabelle users.7 Most entries are related to (e.g., data structures, algorithms, programming languages, and process algebras). The dataset was generated using Sledgehammer [56] and is avail- able publicly at http://www.cs.ru.nl/~kuehlwein/downloads/afp.tar.gz. 7A number of AFP entries were omitted because of technical difficulties.

19 CHAPTER 2. PREMISE SELECTION IN ITPS AS A MACHINE LEARNING PROBLEM

2.3.1 Features The features introduced in Section 2.1.3 are very sparse. For example, the AFP contains 20461 symbols. Adding small subterms and subformulas as well as basic types raises the total number of features to 328361. Rare features can be very useful, because if two formulas share a very rare feature, the likelihood that one depends on the other is very high. However, they also lead to much larger and sparser feature vectors. Figure 2.5 shows the percentage of features that appear in at least x formulas in the AFP, for various values of x. If we consider all features, then only 3.37% of the features appear in more than 50 formulas. Taking only the symbols into account gives somewhat less sparsity, with 2.65% of the symbols appearing in more than 500 formulas. Since there are 165964 formulas in total, this means that 97.35% of all symbols appear in less than 0.3% of the training data.

100

80

60 Symbols only 40 All features Features (%)

20

0 1 3 10 30 100 300 1000 Number of Formulas

Figure 2.5: Distribution of the feature appearances in the Archive of Formal Proofs

Another peculiarity of the premise selection problem is that the number of features is not a priori fixed. Defining new names for new concepts is standard mathematical practice. Hence, the learning algorithm must be able to cope with an unbounded, ever increasing feature set.

2.3.2 Dependencies Like the features, the dependencies are also sparse. On average, an AFP formula depends on 5.5 other formulas— 19.4% of the formulas have no dependencies at all, and 10.7% have at least 20 dependencies. Figure 2.6 shows the percentage of formulas that are de- pendencies of at least x formulas in the AFP, for various values of x. Less than half of the formulas (43.0%) are a dependency in at least one other formula and 94593 formulas are never used as dependencies. This includes 32259 definitions as well as 17045 formulas

20 2.3. CHALLENGES where the dependencies could not be extracted and were hence left empty. Only 0.08% of the formulas are being used as dependencies more than 500 times. The main issue is that the dependencies in the training data might be incomplete or otherwise misleading. The dependencies extracted from the ITP are not necessarily the same as an ATP would use [3]. For example, Isabelle users can use induction in an inter- active proof, and this would be reflected in the dependencies—the induction principle is itself a (higher-order) premise. Most ATPs are limited to first-order logic without induc- tion. If an alternative proof is possible without induction, this is the one that should be learned. Experiments with combinations of ATP and ITP proofs indicate that ITP depen- dencies are a reasonable guess, but learning from ATP dependencies yields better results (Chapter 4, [55, 110]). More generally, the training data extracted from an ITP library lacks information about alternative proofs. In practice, this means that any evaluation method that relies only on the ITP proofs cannot reliably evaluate whether an premise selection algorithm produces good predictions. There is no choice but to actually run ATPs—and even then the hardware, time limit, and version of the ATP can heavily influence the results.

2.3.3 Online Learning and Speed Any algorithm for premise selection must update its predictions model and create predic- tions fast. The typical use case is that of an ITP user who develops a theory formula by formula, proving each along the way. Usually these formulas depend on one another, of- ten in the familiar sequence definition–lemma–theorem–corollary. After each user input, the prediction model might need to be updated. In addition, it is not uncommon for users to alter existing definitions or lemmas, which should trigger some relearning. Speed is essential for a premise selection algorithm since the automated proof finding

40

20 Dependencies (%)

0 1 3 10 30 100 300 Number of Formulas

Figure 2.6: Distribution of the dependency appearances in the Archive of Formal Proofs

21 CHAPTER 2. PREMISE SELECTION IN ITPS AS A MACHINE LEARNING PROBLEM process needs to be faster than manual proof creation. The less time is spent on updating the learning model and predicting the premise ranking, the more time can be used by ATPs. Users of ITPs tend to be impatient: If the automated provers do not respond within half a minute or so, they usually prefer to carry out the proof themselves.

22 Chapter 3

Overview and Evaluation of Premise Selection Techniques

In this chapter, an overview of state-of-the-art techniques for premise selection in large theory mathematics is presented, and new premise selection techniques are introduced. Several evaluation metrics are defined and their appropriateness is discussed in the con- text of automated reasoning in large theory mathematics. The methods are evaluated on the MPTP2078 benchmark, a subset of the Mizar library, and a 10% improvement is obtained over the best method so far.

3.1 Premise Selection Algorithms

3.1.1 Premise Selection Setting

The typical setting for the task of premise selection is a large developed library of for- mally encoded mathematical knowledge, over which mathematicians attempt to prove new lemmas and theorems[102, 15, 109]. The actual mathematical corpora suitable for ATP techniques are only a fraction of all mathematics (e.g. about 52000 lemmas and theorems in the Mizar library) and started to appear only recently, but they already pro- vide a corpus on which different methods can be defined, trained, and evaluated. Premise selection can be useful as a standalone service for the formalizers (suggesting relevant lemmas), or in conjunction with ATP methods that can attempt to find a proof from the relevant premises.

This chapter is based on: [57] “Overview and Evaluation of Premise Selection Techniques for Large Theory Mathematics”, published in the Proceedings of the 6th International Joint Conference on Automated Reasoning.

23 CHAPTER 3. OVERVIEW OF PREMISE SELECTION TECHNIQUES

3.1.2 Learning-based Ranking Algorithms Learning-based ranking algorithms have a training and a testing phase and typically rep- resent the data as points in pre-selected feature spaces. In the training phase the algorithm tries to fit one (or several) prediction functions to the data it is given. The result of the training is the best fitting prediction function which can then be used in the testing phase for evaluations. In the typical setting presented above, the algorithms would train on all existing proofs in the library and be tested on the new theorem the mathematician wants to prove. We compare three different algorithms.

SNoW: SNoW (Sparse Network of Winnows)[21] is an implementation of (among oth- ers) the naive Bayes algorithm that has already been successfully used for premise selec- tion [102, 105, 2]. Naive Bayes is a statistical learning method based on Bayes‘ theorem with a strong (or naive) independence assumption. Given a new conjecture c and a premise p, SNoW computes the probability of p being needed to prove c, based on the previous use of p in proving conjectures that are similar to c. The similarity is in our case typically expressed using symbols and terms of the formulas. The independence assumption says that the (non-)occurrence of a symbol/term is not related to the (non-)occurrence of every other symbol/term. A detailed description can be found in Section 2.2.4.

MOR-CG: MOR-CG (Multi-Output Ranking with Conjugate Gradient) is a kernel-based learning algorithm [88] that is a faster version of the MOR algorithm described the pre- vious Chapter. Instead of doing an exact computation of the weights as presented in Section 2.2.4, MOR-CG uses conjugate-gradient descent [89] which speeds up the time needed for training. Since preliminary tests gave the best results for a linear kernel, the following experiments are based on a linear kernel. Kernel-based algorithms do not aim to model probabilities, but instead try to minimize the expected loss of the prediction functions on the training data. For each premise p MOR-CG tries to find a function Cp such that for each conjecture c, Cp(c) = 1 iff p was used in the proof of c. Given a new conjecture c, we can evaluate the learned prediction functions Cp on c. The higher the value Cp(c) the more relevant p is to prove c.

BiLi: BiLi (Bi-Linear) is a new algorithm by Twan van Laarhoven that is based on a bilinear model of premise selection, similar to the work of Chu and Park [23]. Like MOR- CG, BiLi aims to minimize the expected loss. The difference lies in the kind of prediction functions they produce. In MOR-CG the prediction functions only take the features1 of the conjecture into account. In BiLi, the prediction functions use the features of both the conjectures and the premises. This makes BiLi a similar to methods like SInE that symbolically compare conjectures with premises. The bilinear model learns a weight for

1In our experiments each feature indicates the presence or absence of a certain symbol or term in a formula.

24 3.1. PREMISE SELECTION ALGORITHMS each combination of a conjecture feature together with a premise feature. Together, this weighted combination determines whether or not a premise is relevant to the conjecture. When the number of features becomes large, fitting a bilinear model becomes com- putationally more challenging. Therefore, in BiLi the number of features is first reduced to 100, using random projections [12]. To combat the noise introduced by these random projections, this procedure is repeated 20 times, and the averaged predictions are used for ranking the premises.

3.1.3 Other Algorithms Used in the Evaluation

SInE: SInE, the SUMO , is a heuristic state-of-the-art premise selection algorithm by Kryštof Hoder [41]. The basic idea is to use global frequencies of symbols in a problem to define their generality, and build a relation linking each symbol S with all formulas F in which S is has the lowest global generality among the symbols of F. In common-sense ontologies, such formulas typically define the symbols linked to them, which is the reason for calling this relation a D-relation. Premise selection for a conjec- ture is then done by recursively following the D-relation, starting with the conjecture’s symbols. For the experiments described here the E implementation2 of SInE has been used, be- cause it can be instructed to select exactly the N most relevant premises. This is compat- ible with the way other premise rankers are used in this chapter, and it allows to compare the premise rankings produced by different algorithms for increasing values of N.3

Aprils: APRILS [79], the Automated Prophesier of Relevance Incorporating Latent Se- mantics, is a signature-based premise selection method that employs Latent Semantic Analysis (LSA) [26] to define symbol and premise similarity. Latent semantics is a ma- chine learning method that has been successfully used for example in the Netflix Prize,4 and in web search. Its principle is to automatically derive “semantic” equivalence classes of words (like car, vehicle, automobile) from their co-occurrences in documents, and to work with such equivalence classes instead of the original words. In APRILS, formulas define the symbol co-occurrence, each formula is characterized as a vector over the sym- bols’ equivalence classes, and the premise relevance is its dot product with the conjecture.

3.1.4 Techniques Not Included in the Evaluation As a part of the overview, we also list important or interesting algorithms used for ATP knowledge selection that for various reasons do not fit this the evaluation. We refer readers to [106] for their discussion.

2http://www.mpi-inf.mpg.de/departments/rg1/conferences/deduction10/slides/stephan- schulz.pdf 3The exact parameters used for producing the E-SInE rankings are at https://raw.github.com/JUrban/MPTP2/master/MaLARea/script/filter1. 4http://www.netflixprize.com

25 CHAPTER 3. OVERVIEW OF PREMISE SELECTION TECHNIQUES

• The default premise selection heuristic used by the Isabelle/Sledgehammer ex- port [64]. This is an Isabelle-specific symbol-based technique similar to SInE that would need to be evaluated on Isabelle data. • Goal directed ATP calculi including the Conjecture Symbol Weight clause selection heuristics in E prover [84] giving lower weights to symbols contained in the con- jecture, the Set of Support (SoS) strategy in resolution/superposition provers, and tableau calculi like leanCoP [70] that are in practice goal-oriented. • Model-based premise selection, as done by Pudlák’s semantic axiom selection sys- tem for large theories [76], by the SRASS metasystem [97], and in a different set- ting by the MaLARea [110] metasystem. • MaLARea [110] is a large-theory metasystem that loops between deductive proof and model finding (using ATPs and finite model finders), and learning premise- selection (currently using SNoW or MOR-CG) from the proofs and models to at- tack the conjectures that still remain to be proved. • Abstract proof trace guidance implemented in the E prover by Stephan Schulz for his PhD [83]. Proofs are abstracted into clause patterns collected into a common , which is loaded when a new problem is solved, and used for guid- ing clause selection. This is also similar to the hints technique in Prover9 [63]. • The MaLeCoP system [112] where the clause relevance is learned from all closed tableau branches, and the tableau extension steps are guided by a trained machine learner that takes as input features a suitable encoding of the literals on the current tableau branch.

3.2 Machine Learning Evaluation Metrics

Given a database of proofs, there are several possible ways to evaluate how good a premise selection algorithm is without running an ATP. Such evaluation metrics are used to esti- mate the best parameters (e.g. regularization, tolerance, step size) of an algorithm. The input for each metric is a ranking of the premises for a conjecture together with the in- formation which premises where used to prove the conjecture (according to the training data).

Recall Recall@n is a value between 0 and 1 and denotes the fraction of used premises that are among the top n highest ranked premises.

{used premises} ∩ {n highest ranked premises} Recall@n = {used premises}

Recall@n is always less than Recall@(n + 1). As n increases, Recall@n will eventu- ally converge to 1. Our intuition is that the better the algorithm, the faster its Recall@n converges to 1.

26 3.3. EVALUATION

AUC The AUC (Area under the ROC Curve) is the probability that, given a randomly drawn used premise and a randomly drawn unused premise, the used premise is ranked higher than the unused premise. Values closer to 1 show better performance. Let x1,.., xn be the ranks of the used premises and y1,..,ym be the ranks of the unused premises. Then, the AUC is defined as

Pn Pm i j 1xi>y j AUC = mn where 1xi>y j = 1 iff xi > y j and zero otherwise.

100%Recall 100%Recall denotes the minimum n such that Recall@n = 1.

100%Recall = min{n | Recall@n = 1}

In other words 100%Recall tells us how many premises (starting from the highest ranked one) we need to give to the ATP to ensure that all necessary premises are included.

3.3 Evaluation

3.3.1 Evaluation Data The premise selection methods are evaluated on the large (chainy) problems from the MPTP2078 benchmark5.[2] These are 2078 related large-theory problems (conjectures) and 4494 formulas (conjectures and premises) in total, extracted from the Mizar Math- ematical Library (MML). The MPTP2078 benchmark was developed to supersede the older and smaller MPTP Challenge benchmark (developed in 2006), while keeping the number of problems manageable for experimenting. Larger evaluations are possible,6 but not convenient when testing a large number of systems with many different settings. MPTP2078 seems sufficiently large to test various hypotheses and find significant differ- ences. MPTP2078 also contains (in the smaller, bushy problems) for each conjecture the information about the premises used in the MML proof. This can be used to train and evaluate machine learning algorithms using a chronological order emulating the growth of MML. For each conjecture, the algorithms are allowed to train on all MML proofs that were done up to (not including) the current conjecture. For each of the 2078 problems, the algorithms predict a ranking of the premises.

5Available at http://wiki.mizar.org/twiki/bin/view/Mizar/MpTP2078. 6See [108, 3] for recent evaluations spanning the whole MML.

27 CHAPTER 3. OVERVIEW OF PREMISE SELECTION TECHNIQUES

3.3.2 Machine Learning Evaluation: Comparison of Predictions with Known Proofs We first compare the algorithms introduced in section 3.1 using the machine learning evaluation metrics introduced in section 3.2. All evaluations are based on the training data, the human-written formal proofs from the MML. They do not take alternative proofs into account.

Recall Figure 3.1 compares the average Recall@n of MOR-CG, BiLi, SNoW, SInE and Aprils for the top 200 premises over all 2078 problems. Higher values denote better perfor- mance. The graph shows that MOR-CG performs best, and Aprils worst.

1.0

0.8

0.6

0.4 Average Recall@n SNoW MOR-CG 0.2 BiLi SInE Aprils 0.0 50 100 150 200 n

Figure 3.1: Recall comparison of the premise selection algorithms

Note that here is a sharp distinction between the learning algorithms, which use the MML proofs and eventually reach a very similar recall, and the heuristic-based algorithms Aprils and SInE.

AUC The average AUC of the premise selection algorithms is reported in table 3.1. Higher values mean better performance, i.e. a higher chance that a used premise is higher ranked than a unused premise. SNoW (97%) and BiLi (96%) have the best average AUC scores

28 3.3. EVALUATION with MOR-CG taking the third spot with an average AUC of 88%. Aprils and SInE are considerably worse with 64% and 42% respectively. The standard deviation is very low with around 2% for all algorithms.

Table 3.1: AUC comparison of the premise selection algorithms

Algorithm Avg. AUC Std. SNoW 0.9713 0.0216 BiLi 0.9615 0.0215 MOR-CG 0.8806 0.0206 Aprils 0.6443 0.0176 SInE 0.4212 0.0142

100%Recall The comparison of the 100%Recall measure values can be seen in figure 3.2. For the first 115 premises, MOR-CG is the best algorithm. From then on, MOR-CG hardly increases and SNoW takes the lead. Eventually, BiLi almost catches up with MOR-CG. Again we can see a big gap between the performance of the learning and the heuristic algorithms with SInE and Aprils not even reaching 400 problems with 100%Recall.

1400 SNoW MOR-CG BiLi 1200 SInE Aprils 1000

800

600 100% Recall@n

400

200

0 50 100 150 200 n

Figure 3.2: 100%Recall comparison of the premise selection algorithms

29 CHAPTER 3. OVERVIEW OF PREMISE SELECTION TECHNIQUES

Discussion In all three evaluation metrics there is a clear difference between the performance of the learning-based algorithms SNoW, MOR-CG and BiLi and the heuristic-based algorithms SInE and Aprils. If the machine-learning metrics on the MML proofs are a good indicator for the ATP performance then there should be a corresponding performance difference in the number of problems solved. We investigate this in the following section.

3.3.3 ATP Evaluation

Vampire In the first experiment we combined the rankings obtained from the algorithms introduced in section 3.1 with version 0.6 of the ATP Vampire [77]. All ATPs are run with 5s time limit on an Intel Xeon E5520 2.27GHz server with 24GB RAM and 8MB CPU cache. Each problem is always assigned one CPU. We use Vampire because of its good perfor- mance in the CASC competitions as well as earlier experiments with the MML [108]. For each MPTP2078 problem (containing on average 1976.5 premises), we created 20 new ATP problems, containing the 10,20,...,200 highest ranked premises. The results can be seen in figure 3.3.

800

700

600

500

400

Problems Solved 300

Vampire - MOR-CG 200 Vampire - SNoW Vampire - SInE 100 Vampire - BiLi Vampire - Aprils 0 0 50 100 150 200 Used Premises

Figure 3.3: Problems solved – Vampire

Apart from the first 10-premise batch and the three last batches, MOR-CG always solves the highest number of problems with a maximum of 726 problems with the top 70

30 3.3. EVALUATION premises. SNoW solves less problems in the beginning, but catches up in the end. BiLi solves very few problems in the beginning, but gets better as more premises are given and eventually is as good as SNoW and MOR-CG. The surprising fact (given the machine learning performance) is that SInE performs very well, on par with SNoW in the range of 60-100 premises. This indicates that SInE finds proofs that are very different from the human proofs. Furthermore, it is worth noting that most algorithms have their peak at around 70-80 premises. It seems that after that, the effect of increased premise recall is beaten by the effect of the growing ATP search space.

800

700

600

500

400

Problems Solved 300

200

E - MOR-CG 100 E - SNoW E - SInE 0 0 50 100 150 200 Used Premises

Figure 3.4: Problems solved – E

E, SPASS and Z3

We also compared the top three algorithms, MOR-CG, SNoW and SInE, with three other ATPs: E 1.4 [84], SPASS 3.7 [114] and Z3 3.2 [66]. The results can be seen in figure 3.4, 3.5, 3.6 respectively. In all three experiments, MOR-CG gives the best results. Looking at the number of problems solved by E we see that SNoW and SInE solve about the same number of problems when more than 50 premises are given. In the SPASS evaluation, SInE performs better than SNoW after the initial 60 premises. The results for Z3 are clearer, with (apart from the first run with the top 10 premises) MOR-CG always solving more problems than SNoW, and SNoW solving more problems than SInE. It is worth noting that independent of the learning algorithm, SPASS solves the fewest problems and Z3 the most, and that (at least up to the limit of 200 premises used) Z3 is hardly affected by having too many premises in the problems.

31 CHAPTER 3. OVERVIEW OF PREMISE SELECTION TECHNIQUES

800

700

600

500

400

Problems Solved 300

200

SPASS - MOR-CG 100 SPASS - SNoW SPASS - SInE 0 0 50 100 150 200 Used Premises

Figure 3.5: Problems solved – SPASS

800

700

600

500

400

Problems Solved 300

200

Z3 - MOR-CG 100 Z3 - SNoW Z3 - SInE 0 0 50 100 150 200 Used Premises

Figure 3.6: Problems solved – Z3

Discussion The ATP evaluation shows that a good ML evaluation performance does not necessarily imply a good ATP performance and vice versa. E.g. SInE performs better than expected, and BiLi worse. A plausible explanation for this is that the human-written proofs that are

32 3.4. COMBINING PREMISE RANKERS the basis of the learning algorithms are not the best possible guidelines for ATP proofs, because there are a number of good alternative proofs: the total number of problems proved with Vampire by the union of all prediction methods is 1197, which is more (in 5s) than the 1105 problems that Vampire can prove in 10s when using only the premises used exactly in the human-written proofs. One possible way how to test this hypothesis (to a certain extent at least) would be to train the learning algorithms on all the ATP proofs that are found, and test whether the ML evaluation performance closer correlates with the ATP evaluation performance. The most successful 10s combination, solving 939 problems, is to run Z3 with the 130 best premises selected by MOR-CG, together with Vampire using the 70 best premises se- lected by SInE. It is also worth noting that when we consider all provers and all methods, 1415 problems can be solved. It seems the heuristic and the learning based premise selection methods give rise to different proofs. In the next section, we try to exploit this by considering combinations of ranking algorithms.

3.4 Combining Premise Rankers

There is clear evidence about alternative proofs being feasible from alternative predic- tions. This should not be too surprising, because the premises are organized into a large derivation graph, and there are many explicit (and also quite likely many yet- undiscovered) semantic dependencies among them. The evaluated premise selection algorithms are based on different ideas of similarity, relevance, and functional approximation spaces and norms in them. This also means that they can be better or worse in capturing different aspects of the premise selection problem (whose optimal solution is obviously undecidable in general, and intractable even if we impose some finiteness limits). An interesting machine learning technique to try in this setting is the combination of different predictors. There has been a large amount of machine learning research in this area, done under different names. Ensembles is one of the most frequent. A recent overview of ensemble based systems is given in [75], while for example [87] deals with the specific task of aggregating rankers. As a final experiment that opens the premise selection field to the application of ad- vanced ranking-aggregation methods, we have performed an initial simple evaluation of combining two very different premise ranking methods: MOR-CG and SInE. The aggre- gation is done by simple weighted linear combination, i.e., the final ranking is obtained via weighted linear combination of the predicted individual rankings. We test a limited grid of weights, in the interval of [0,1] with a step value of 0.25, i.e., apart from the original MOR-CG and SInE rankings we get three more weighted aggregate rankings as follows: 0.25 ∗ CG + 0.75 ∗ SInE, 0.5 ∗ CG + 0.5 ∗ SInE, and 0.75 ∗ CG + 0.25 ∗ SInE. The following Figure 3.7 shows their ATP evaluation. The machine learning evaluation (done as before against the data extracted from the human proofs) is not surprising, and the omitted graphs look like linear combinations of

33 CHAPTER 3. OVERVIEW OF PREMISE SELECTION TECHNIQUES

900

800

700

600

500

400 Problems Solved 300 Vampire - SInE

200 Vampire - 0.25 MOR-CG + 0.75 SInE Vampire - 0.5 MOR-CG + 0.5 SInE 100 Vampire - 0.75 MOR-CG + 0.25 SInE Vampire - MOR-CG 0 0 50 100 150 200 Used Premises

Figure 3.7: Combining CG and SInE: Problems solved the corresponding figures for MOR-CG and SInE. The ATP evaluation (only Vampire was used) is a very different case. For example the equally weighted combination of MOR- CG and SInE solves over 604 problems when using only the top 20 ranked premises. The corresponding values for standalone MOR-CG resp. SInE are 476, resp. 341, i.e., they are improved by 27%, resp. 77%. The equally weighted combination solves 797 when using the top 70 premises, which is a 10% improvement over the best result of all methods (726 problems solved by MOR-CG when using the top 70 premises). Note that unlike the external combination mentioned above, this is done only in 5 seconds, with only one ATP, one premise selector, and one threshold.

3.5 Conclusion

Heuristic and inductive methods seem indispensable for strong automated reasoning in large formal mathematics, and significant improvements can be achieved by their proper design, use and combination with precise deductive methods. Knowing previous proofs and learning from them turns out to be important not just to mathematicians, but also for automated reasoning in large theories. We have evaluated practically all reasonably fast state-of-the-art premise selection techniques and tried some new ones. The results show that learning-based algorithms can perform better than heuristics. Relying solely on ML evaluations is not advisable since in particular heuristic premise selection algorithms often find different proofs. A

34 3.5. CONCLUSION combination of heuristic and learning-based predictions gives the best results.

35

Chapter 4

Learning from Multiple Proofs

Mathematical textbooks typically present only one proof for most of the theorems. How- ever, there are infinitely many proofs for each theorem in first-order logic, and mathe- maticians are often aware of (and even invent new) important alternative proofs and use such knowledge for (lateral) thinking about new problems. In this chapter we explore how the explicit knowledge of multiple (human and ATP) proofs of the same theorem can be used in learning-based premise selection algorithms in large- theory mathematics. Several methods and their combinations are defined, and their effect on the ATP performance is evaluated on the MPTP2078 benchmark. The experiments show that the proofs used for learning significantly influence the number of problems solved, and that the quality of the proofs is more important than the quantity.

4.1 Learning from Different Proofs

In the previous chapter we tested and evaluated several premise selection algorithms on a subset of the Mizar Mathematical Library (MML), the MPTP2078 large-theory bench- mark,1 using the (human) proofs from the MML as training data for the learning algo- rithms. We found that learning from such human proofs helps a lot, but alternative proofs can quite often be successfully constructed by ATPs, making heuristic methods like SInE surprisingly strong and orthogonal to learning methods. Thanks to these experiments we now also have (possibly several) ATP proofs for most of the problems. In this chapter, we investigate how the knowledge of different proofs can be integrated in the machine learning algorithms for premise selection, and how it influences the perfor- mance of the ATPs. Section 4.2 introduces the necessary machine learning terminology and explains how different proofs can be used in the algorithms. In Section 4.3, we define

This chapter is based on: [55] “Learning from Multiple Proofs: First Experiments”, published in the Proceedings of the 3rd Workshop on Practical Aspects of Automated Reasoning. 1Available at http://wiki.mizar.org/twiki/bin/view/Mizar/MpTP2078.

37 CHAPTER 4. LEARNING FROM MULTIPLE PROOFS several possible ways to use the additional knowledge given by the different proofs. The different proof combinations are evaluated and discussed in Section 4.4, and Section 4.5 concludes.

4.2 The Machine Learning Framework and the Data

We start with the setting introduced in the previous chapter. Γ denotes the set of all facts that appear in a given (fixed) large mathematical corpus (MPTP2078 in this chapter). The corpus is assumed to use notation (symbols) and formula names consistently, since they are used to define the features and labels for the machine learning algorithms as defined in Chapter 2. The visibility relation over Γ is defined by the chronological growth of ITP library. We say that a proof P is a proof over Γ if the conjecture and all premises used in P are elements of Γ. Given a set of proofs ∆ over Γ in which every fact has at most one proof, the (∆-based) proof matrix µ∆ : Γ × Γ → {0,1} is defined as  1 if p is used to prove c in ∆, µ∆(c, p) B 0 otherwise.

In other words, µ∆ is the adjacency matrix of the graph of the direct proof dependencies from ∆. The proof matrix derived from the MML proofs, together with the formula features are used as training data. In the previous chapter, we compared several different premise selection algorithms on the MPTP2078 dataset. Thanks to this comparison we have ATP proofs for 1328 of the 2078 problems, found by Vampire 0.6 [77]. For some problems we found several different proofs, meaning that the sets of premises used in the proofs differ. Figure 4.1 shows the number of different ATP proofs we have for each problem. The maximum number of different proofs is 49. On average, we found 6.71 proofs per solvable problem. This database of proofs allows us to start considering multiple proofs for a c ∈ Γ. For each conjecture c, let Θc be the set of all ATP proofs of c in our dataset, and let nc denote the cardinality of Θc. We use a generalized proof matrix to represented multiple proofs of c. The general interpretation of µX(c, p) is the relevance (weight) of a premise p for a proof of c determined by X, where X can either be a set of proofs as above, or a particular algorithm (typically in conjunction with the data to which it is applied). For a single proof σ, let µσ B µ{σ}, i.e.,  1 if σ ∈ Θc and p is used to prove c in σ, µσ(c, p) B 0 otherwise.

We end this section by introducing the concept of redundancy, which seems to be at the heart of the problem that we are exploring. Let c be a conjecture and σ1,σ2 be proofs for c (σ1,σ2 ∈ Θc) with used premises {p1, p2} and {p1, p2, p3} respectively. In this case, premise p3 can be called redundant since we know a proof of c that does not

38 4.3. USING MULTIPLE PROOFS

2 use p3. Redundant premises appear quite frequently in ATP proofs, for example, due to exhaustive equational normalization that can turn out to be unnecessary for the proof. Now imagine we have a third proof of c, σ3 with used premises {p1, p3}. With this knowledge, p2 could also be called redundant (or at least unnecessary). But one could also argue that at least one of p2 and p3 is not redundant. In such cases, it is not clear what a meaningful definition of redundancy should be. We will use the term redundancy for premises that might not be necessary for a proof.

50

40

30

Proofs Found 20

10

0 0 500 1000 1500 2000 Problem

Figure 4.1: Number of different ATP proofs for each of the 2078 problems. The problems are ordered by their appearance in the MML.

4.3 Using Multiple Proofs

We define several different combinations of MML and ATP proofs and their respective proof matrices. Recall that there are many problems for which we do not have any ATP proofs. For those problems, we will always just use the MML proof. I.e., for all proof matrices µX defined below, if there is no ATP proof of a conjecture c, then µX(c, p) = µMML(c, p).

2 For this we assume some similarity between the efficiency of the proofs in Θc, which is the case for our experiments based on the 5-second time limit.

39 CHAPTER 4. LEARNING FROM MULTIPLE PROOFS

4.3.1 Substitutions and Unions The simplest way to combine different proofs is to either only consider the used premises of one proof, or take the union of all used premises. We consider five different combina- tions. Definition 6 (MML Proofs).  1 if p is used to prove c in the MML proof, µMML(c, p) B 0 otherwise.

This dataset will be used as baseline throughout all experiments. It uses the human proofs from the Mizar library. Definition 7 (Random ATP Proof). For each conjecture c for which we have ATP proofs, pick a (pseudo)random ATP proof σc ∈ Θc.  1 if p is a used premise in σc, µRandom(c, p) B 0 otherwise.

Definition 8 (Best ATP Proof). For each conjecture c for which we have ATP proofs, min pick an(y) ATP proof with the least number of used premises σc ∈ Θc.   min 1 if p is a used premise in σc , µBest(c, p) B 0 otherwise.

Definition 9 (Random Union). For each conjecture c for which we have ATP proofs, pick a random ATP proof σc ∈ Θc.  1 if p is a premise used in σc or in the MML proof of c, µRandomUnion(c, p) B 0 otherwise.

Definition 10 (Union). For each conjecture c for which we have ATP proofs, we define  1 if p is a premise used in any ATP or MML proof of c, µUnion(c, p) B 0 otherwise.

4.3.2 Premise Averaging Proofs can also be combined by learning from the average used premises. We consider three options, the standard average, a biased average and a scaled average. Definition 11 (Average). The average gives equal weight to each proof. 1 X µAverage(c, p) = µσ(c, p) + µMML(c, p) nc + 1 σ∈Θc

40 4.3. USING MULTIPLE PROOFS

The intuition is that the average gives a better idea of how necessary a premise really is. When there are very different proofs, the average will give a very low weight to every premise. That is why we also tried scaling as follows: Definition 12 (Scaled Average). The scaled average ensures that there is at least one premise with weight 1. P σ∈Θc µσ(c, p) + µMML(c, p) µScaledAverage(c, p) = P maxq∈Γ σ∈Θc µσ(c,q) + µMML(c,q) Another experiment is to make the weight of all the ATP proofs equal to the weight of the MML proof: Definition 13 (Biased Average). P ! 1 σ∈Θc µσ(c, p) µBiasedAverage(c, p) = + µMML(c, p) 2 nc

4.3.3 Premise Expansion Consider a situation where a ` b and b ` c. Obviously, not only b, but also a proves c. When we consider the used premises in a proof, we only use the information about the direct premises (b in the example), but nothing about the indirect premises (a in the example), the premises of the direct premises. Using this additional information might help the learning algorithms. We call this premise expansion and define three different weight functions that try to capture this indirect information. All three penalize the weight 1 of the indirect premises with a factor of 2 . Definition 14 (MML Expansion). For the MML expansion, we only consider the MML proofs and their one-step expansions: P q∈Γ µMML(c,q)µMML(q, p) µ (c, p) = µ (c, p) + MMLExp MML 2 P Note that since µMML(c, p) is either 0 or 1, the sum q∈Γ µMML(c,q)µMML(q, p) just counts how often p is a grandparent premise of c.

Definition 15 (Average Expansion). The average expansion takes µAverage instead of µ : MML P q∈Γ µAverage(c,q)µAverage(q, p) µ (c, p) = µ (c, p) + AverageExp Average 2 Definition 16 (Scaled Expansion). And finally, we consider an expansion of the scaled average. P q∈Γ µScaledAverage(c,q)µScaledAverage(q, p) µ (c, p) = µ (c, p) + ScaledAverageExp ScaledAverage 2 Deeper expansions and different penalization factors are possible, but given the per- formance of these initial tests shown in the next section we decided to not investigate further.

41 CHAPTER 4. LEARNING FROM MULTIPLE PROOFS

4.4 Results

4.4.1 Experimental Setup All experiments were done on the MPTP2078 dataset. Because of its good performance in earlier evaluations, we used the Multi-Output-Ranking (MOR) learning algorithm for the experiments. For each conjecture, MOR is allowed to train on all proofs that were (in the chronological order of MML) done up to that conjecture. In particular, this means that the algorithms do not train on the data they were asked to predict. Three-fold cross validation on the training data was used to find the optimal parameters. For the combinations in 4.3.1, the AUC measure was used to estimate the performance. The other combinations used the square-loss error. For each of the 2078 problems, MOR predicts a ranking of the premises. We again use Vampire 0.6 for evaluating the predictions. Version 0.6 was chosen to make the experiments comparable with the earlier results. Vampire is run with 5s time limit on an Intel Xeon E5520 2.27GHz server with 24GB RAM and 8MB CPU cache. Each problem is always assigned one CPU. For each MPTP2078 problem, we created 20 new problems, containing the 10,20,...,200 highest ranked premises and ran Vampire on each of them. The graphs show how many problems were solved using the 10,20,...,200 highest ranked premises. As a performance baseline, Vampire 0.6 in CASC mode (that means also using SInE with different parameters on large problems) can solve 548 problems in 10 seconds [2].

4.4.2 Substitutions and Unions Figure 4.2 shows the performance of the simple proof combinations introduced in 4.3.1. It can be seen that using ATP instead of MML proofs can improve the performance consid- erably, in particular when only few premises are provided. One can also see the difference that the quality of the proof makes. The best ATP proof predictions always solved more problems than the random ATP proof predictions. Taking the union of two or more proofs decreases the performance. This can be due to the redundancy introduced by considering many different premises and suggests that the ATP search profits most from a simple and clear (one-directional) advice, rather than from a combination of ideas.

4.4.3 Premise Averaging Taking the average of the used premises could be a good way to combat the redundant premises. The idea is that premises that are actually important should appear in almost every proof, whereas premises that are redundant should only be present in a few proofs. Hereby, important premises should get a high weight and unimportant premises a low weight. The results of the averaging combinations can be seen in Figure 4.3.2. Apart from the scaled average, it seems that taking the average does perform better than taking the union. However, the baseline of only the MML premises is better or almost as good as the average predictions.

42 4.4. RESULTS

900

800

700

600

500

400 Problems Solved 300 MML Proofs 200 Random ATP Proof Best ATP Proof 100 Random Union Total Union 0 0 50 100 150 200 Used Premises

Figure 4.2: Comparison of the combinations presented in 4.3.1.

900

800

700

600

500

400 Problems Solved 300

200 MML Proofs Biased Average 100 Average Scaled Average 0 0 50 100 150 200 Used Premises

Figure 4.3: Comparison of the combinations presented in 4.3.2.

43 CHAPTER 4. LEARNING FROM MULTIPLE PROOFS

4.4.4 Premise Expansions

Finally, we compare how expanding the premises effects the ATP performance in Figure 4.3.3. While expanding the premises does add additional redundancy, it also adds further potentially useful information.

900

800

700

600

500

400 Problems Solved 300

200 MML Proofs MML Expansion 100 Average Expansion Scaled Expansion 0 0 50 100 150 200 Used Premises

Figure 4.4: Comparison of the combinations presented in 4.3.3.

However, all expansions perform considerably worse than the MML proof baseline. It seems that the additional redundancy outweighs the usefulness.

4.4.5 Other ATPs

We also investigated how learning from Vampire proofs affects other provers, by running E 1.4 [84] and Z3 3.2 [66] on some of the learned predictions. Figure 4.5 shows the results. The predictions learned from the MML premises serve as a baseline. E using the predictions based on the best Vampire proofs is not so much improved over the MML-based predictions as Vampire is. This would suggest that the ATPs really profit most from “their own” best proofs. However for Z3 the situation is opposite: the improvement by learning from the best Vampire proofs is at some points even slightly better than for Vampire itself, and this helps Z3 to reach the maximum performance earlier than before. Also, learning from the averaged proofs behaves differently for the ATPs. For E, the MML and the averaged proofs give practically the same performance, for Vampire the MML proofs are better, but for Z3 the averaged proofs are quite visibly better.

44 4.4. RESULTS

900

800

700

600

500

400 Problems Solved 300

200 E - MML Proofs E - Best ATP Proof 100 E - Average E - Average Expansion 0 0 50 100 150 200 Used Premises

(a) E

900

800

700

600

500

400 Problems Solved 300

200 Z3 - MML Proofs Z3 - Best ATP Proof 100 Z3 - Average Z3 - Average Expansion 0 0 50 100 150 200 Used Premises

(b) Z3

Figure 4.5: Performance of other ATPs when learning from Vampire proofs.

45 CHAPTER 4. LEARNING FROM MULTIPLE PROOFS

4.4.6 Comparison With the Best Results Obtained so far In the previous chapter, we found that a combination of SInE [41] and the MOR algorithm (trained on the MML proofs) has so far the best performance on the MPTP2078 dataset. Figure 4.6 compares the new results with this combination. Furthermore we also try combining SInE with MOR trained on ATP proofs. For comparison we also include our baseline, the MML Proof predictions, and the results obtained from the SInE predictions.

900

800

700

600

500

400 Problems Solved 300 MML Proofs 200 Best ATP Proof 0.5 Best ATP Proof + 0.5 SInE 100 0.5 MML Proofs + 0.5 SInE SInE 0 0 50 100 150 200 Used Premises

Figure 4.6: Comparison of the best performing algorithms.

While learning from the best ATP proofs leads to more problems solved than learning from the MML proofs, the combination of SInE and learning from MML proofs still beats both. However, combining the SInE predictions with the best ATP proof predictions gives even better results with a maximum of 823 problem solved (a 3.3% increase over the former maximum) when given the top 70 premises.

4.4.7 Machine Learning Evaluation Machine learning has several methods to measure how good a learned classifier is with- out having to run an ATP. In the earlier experiments the results of the machine learning evaluation did not correspond to the results of the ATP evaluation. For example, SInE performed worse than BiLi on the machine learning evaluation but better than BiLi on the ATP evaluation. Our explanation was that we are training from (and therefore measuring) the wrong data. With SInE the ATP found proofs that were very different from the MML proofs.

46 4.4. RESULTS

1400 best ATP Proofs MML Proofs 1200 SInE

1000

800

600 100% Recall@n

400

200

0 50 100 150 200 n

(a) 100%Recall on the MML proofs.

1400 best ATP Proofs MML Proofs 1200 SInE

1000

800

600 100% Recall@n

400

200

0 50 100 150 200 n

(b) 100%Recall on the best ATP proofs.

Figure 4.7: 100%Recall comparison between evaluating on the MML and the best ATP proofs. The graphs show how many problems have all necessary premises (according to the training data) within the n highest ranked premises.

47 CHAPTER 4. LEARNING FROM MULTIPLE PROOFS

In Figure 4.7 we present a comparison between a machine learning evaluation (the 100%Recall measure) depending on whether we evaluate on the MML proofs or on the best ATP proofs. Ideally we would like to have that the machine learning performance of the algorithms corresponds to the ATP performance (see Figure 4.6). This is clearly not the case for the 100%Recall on the MML proofs graph. The best ATP predictions are better than the MML proof predictions, and SInE solves more than 200 problems. With the new evaluation, the 100%Recall on the best ATP proofs graph, the performance is more similar to the actual ATP performance but there is still room for improvement.

4.5 Conclusion

The fact that there is never only one proof makes premise selection an interesting machine learning problem. Since it is in general undecidable to know the “best prediction”, the domain has a randomness aspect that is quite unusual (Chaitin-like [22]) in AI. In this chapter we experimented with different proof combinations to obtain better information for high-level proof guidance by premise selection. We found that it is easy to introduce so much redundancy that the predictions created by the learning algorithms are not good for existing ATPs. On the other hand we saw that learning from proofs with few premises (and hence probably less redundancy) increases the ATP performance. It seems that we should look for a measure of how ‘good’ or ‘simple’ a proof is, and only learn from the best proofs. Such measures could be for example the number of inference steps done by the ATP during the proof search, or the total CPU time needed to find the proof. Another question that was (at least initially) answered in this chapter is to what extent learning from human proofs can help an ATP, in comparison to learning from ATP proofs. We saw that while not optimal, learning from human proofs seems to be approximately equivalent to learning from suboptimal (for example random, or averaged) ATP proofs. Learning from the best ATP proof is about as good as combining SInE with learning from the MML proofs. Combining SInE with learning from the best ATP proof still outperforms all methods.

48 Chapter 5

Automated and Human Proofs in General Mathematics

First-order translations of large mathematical repositories allow discovery of new proofs by automated reasoning systems. Large amounts of available mathematical knowledge can be re-used by combined AI/ATP systems, possibly in unexpected ways. But automated systems can be also more easily misled by irrelevant knowledge in this setting, and finding deeper proofs is typically more difficult. Both large-theory AI/ATP methods, and trans- lation and data-mining techniques of large formal corpora, have significantly developed recently, providing enough data for an initial comparison of the proofs written by mathe- maticians and the proofs found automatically. This chapter describes such a comparison conducted over the 52000 mathematical theorems from the Mizar Mathematical Library.

5.1 Introduction: Automated Theorem Proving in Mathematics

Computers are becoming an indispensable part of many areas of mathematics [38]. As their capabilities develop, human mathematicians are faced with the task of steering, com- prehending, and evaluating the ideas produced by computers, similar to the players of chess in recent decades. A notable milestone is the automatically found proof of the Robbins conjecture by EQP [62] and its postprocessing into a human-comprehensible proof by ILF [25] and Mathematica [29]. Especially in small equational algebraic theories (e.g., quasigroup theory), a number of nontrivial proofs have been already found auto- matically [74], and their evaluation, understanding, and automated post-processing is an open problem [113].

This chapter is based on: [3] “Automated and Human Proofs in General Mathematics: An Initial Compar- ison”, published in the Proceedings of the 18th International Conference on Logic for Programming, Artificial Intelligence, and Reasoning. All three authors contributed equally to the paper. Part of Section 5.2.1 is taken from [2] “Premise Selection for Mathematics by Corpus Analysis and Kernel Methods”, published in the Journal of Automated Reasoning.

49 CHAPTER 5. AUTOMATED AND HUMAN PROOFS IN GENERAL MATHEMATICS

In the recent years, large general mathematical corpora like the Mizar Mathemati- cal Library (MML) and the Isabelle/HOL library are being made available to automated reasoning and AI methods [102, 73], leading to the development of automated reasoning techniques working in large theories with many previous theorems, definitions, and proofs that can be re-used [110, 41, 64, 112]. A recent evaluation (and tuning) of ATP systems on the MML [108] has shown that the Vampire/SInE [77] system can already re-prove 39% of the MML’s 52000 theorems when the necessary premises are precisely selected from the human1 proofs, and about 14% of the theorems when the ATP is allowed to use the whole available library, leading on average to 40000 premises in such ATP problems. In the previous chapters we showed that re-using (generalizing and learning) the knowl- edge accumulated in previous proofs can further significantly improve the performance of combined AI/ATP systems in large-theory mathematics. This performance, and the recently developed proof analysis for the MML [4], al- lowed an experiment with finding automatically all proofs in the MML by a combination of learning and ATP methods. This is described in Section 5.2. The 9141 ATP proofs found automatically were then compared using several metrics to the human proofs in Section 5.3 and Section 5.4.

5.2 Finding proofs in the MML with AI/ATP support

To create a sufficient body of ATP proofs from the MML, we have conducted a large AI/ATP experiment that makes use of several recently developed techniques and signifi- cant computational resources. The basic idea of the experiment is to lift the setting used in [2] for large-theory automated proving of the MPTP2078 benchmark to the whole MML (approximately 52000 theorems and more than 100000 premises). The setting consists of the following three consecutive steps:

• mining proof dependencies from all MML proofs;

• learning premise selection from the mined proof dependencies;

• using an ATP to prove new conjectures from the best selected premises.

5.2.1 Mining the dependencies from all MML proofs For the experiments below, we used Alama et al.’s method for computing fine-grained de- pendencies [4]. The first step in the computation is to break up each article in the MML into a sequence of Mizar texts, each consisting of a single statement (e.g., a theorem, def- inition, unexported lemma). Each of these texts can—with suitable preprocessing—be regarded as a complete, valid Mizar article in its own right. The decomposition of a whole

1Mizar proofs are initially human-written, but they are formal and machine-understandable. That allows their automated machine processing and refactoring, which can make them “less human”. Yet, we believe that their classification as “human” is appropriate, and that MML/MPTP is probably the most suitable resource today for attempting this initial comparison of ATP and human proofs.

50 5.2. FINDING PROOFS IN THE MML WITH AI/ATP SUPPORT

MML article into such smaller articles typically requires a number of nontrivial refac- toring steps, comparable, e.g., to automated splitting and re-factoring of large programs written in programming languages with complicated syntactic mechanisms. In Mizar, every article has a so-called environment: a list ENV0 = [statement j : 1 ≤ j ≤ length(ENV0)] of statements statement j specifying the background knowledge (theorems, notations, etc.) that is used to verify the article. The actual Mizar content contained in an article’s environment, is, in general, a rather conservative overestimate of the statements that the article actually needs. The algorithm first defines the current environment as ENV0. It then considers each statement in ENV0 and tries to verify the article using the current environment without the considered statement. If the verification succeeds, the considered statement is deleted from the current environment. To be more precises, starting with the original environment ENV0 (in which the article verification succeeds), the algorithm works by constructing a sequence of finer environments {ENVi : 1 ≤ i ≤ length(ENV0)} such that  ENVi−1 if the verification fails in ENVi−1 − {statementi} ENVi B  ENVi−1 − {statementi} otherwise .

The article verification thus still succeeds in the final ENVlength(ENV0) environment, and this environment consist of all the statement of ENV0 whose removal caused the article verification to fail during this construction.2 The dependencies of the original statement, which formed the basis of the article, are then defined as the elements of ENVlength. This process is in detail described in [4], where it is conducted for the 100 initial articles from the MML. The computation takes several days for all of MML, however the information thus obtained gives rise to an unparalleled corpus of data about human- written proofs in the largest available formal body of mathematics. In the final account, the days of computation pay off, by providing more precise advice for proving new con- jectures over the whole MML. An approximate estimate of the computational resources taken by this job is about ten days of full (parallel) CPU load (12 hyperthreading Xeon 2.67 GHz cores, 24 GB RAM) of the Mizar server at the University of Alberta. The resulting dependencies for all MML items can be viewed online.3

5.2.2 Learning Premise Selection from Proof Dependencies To learn premise selection from proof dependencies, one characterizes all MML formulas by suitable features, and feeds them (together with the detailed proof information) to a machine learning system that is trained to advise premises for later conjectures. Formula symbols have been used previously for this task in [102]. Thanks to sufficient hardware being available, we have for the first time included also term features generated by the MaLARea framework, added to it in 2008 [110] for experiments with smaller subsets of

2Note that this final environment could in general still be made smaller (after the removal of a certain statement, another statement might become unnecessary), and its construction depends on the (chosen and fixed for all experiments) initial ordering of statements in the environment. 3http://mizar.cs.ualberta.ca/mizar-items

51 CHAPTER 5. AUTOMATED AND HUMAN PROOFS IN GENERAL MATHEMATICS

MML. Thus, for each MML formula we include as its characterization also all the sub- terms and subformulas contained in the formula, which makes the learning and prediction more precise. To our surprise, the EPROVER-based [84] utility that consistently numbers all shared terms in all formulas, written for this purpose in 2008 by Josef Urban, scaled without problems to the whole MML. This feature-generation phase took only minutes and created over one million features. We have also briefly explored using validity in finite models (introduced in MaLARea in 2008, building on Pudlák’s previous work [76]) as a more semantic way of characterizing formulas. However, this has turned out to be very time-consuming, most likely caused by the LADR-based clausefilter utility struggling with evaluating in the models some complicated (containing many quantifiers) mathematical formulas. Clearly, further optimizations are needed for extracting such se- mantic characterizations for all of MML. Even without such features, the machine learn- ing was already pushed to the limit. The kernel-based multi-output ranker presented in chapter 2.2.4 turned out to be too slow and memory-exhaustive to handle over one million features and over hundred thousand training examples. The SNoW system used in naive Bayes mode took several gigabytes of RAM to train on the data, and on average about a second (ca. a day of computation for all of MML) to produce a premise prediction for each MML problem (based always on incremental training4 on all previous MML proofs). The results of this run are SNoW premise predictions for all of MML, available online5 as the raw SNoW output, and also postprocessed into corresponding ATP problems (see below).

5.2.3 Using ATPs to Prove the Conjectures from the Selected Premises As the MML grows from axioms of set theory to advanced mathematics, it gives rise to a chronological ordering of its theorems. When a new theorem C is conjectured, all the previous theorems and definitions are available as premises, and all the previous proofs are used to learn which of these premises are relevant for C. The SNoW system provides a ranking of all premises, and the best premises are given to an ATP which attempts a proof of C. There are many ways how to organize several ATP systems to try to prove C, with different numbers of the highest ranked premises and with different time limits. For our experiments, we have fixed the ATP system to be Vampire (version 1.8) [77], and we have always used the 200 highest ranked premises and a time limit of 20 seconds. A 12-core 2.67 GHz Xeon server at University of Alberta was used for (parallelized) proving, which took about a day in real time. This has produced 9141 automatically found proofs that we further analyze. The overall success rate is over 18% of theorems proved which is so far the best result on the whole MML, but we have not really focused yet on getting this as high as possible. For example, running Vampire in parallel with both 40 and 200 best recommended premises has been shown to significantly improve the success rate, and a preliminary experiment with the Z3 solver has provided another two thousand proofs

4In the incremental learning mode, the evaluation and training are done at the same time for each example, hence there was no extra time taken by training. 5http://mizar.cs.ualberta.ca/~mptp/proofcomp/snow_predictions.tar.gz

52 5.3. PROOF METRICS from the problems with 200 best premises. Unfortunately, Z3 does not (yet) print the names of premises used in the proofs, so its proofs would not be directly usable for the analysis that is conducted here. When using a large number of premises, an ATP proof can often contain unnecessary premises. To weed out those unnecessary premises, we always re-run the ATP with only the premises that were used in the first run. The ATP problems are also available online6 for further experiments, as well as all the proofs found.7

5.3 Proof Metrics

We thus have, for 9141 Mizar theorems φ, the set of premises that were used in the (min- imized) ATP proof of φ. Each ATP proof was found completely independently of its Mizar proof, i.e., no information (e.g., about the premises used) from the Mizar proof was transferred to the ATP proof.8 This gives us a notion of dependency for Mizar theorems, derived from an ATP. From the Mizar proof dependency analysis we also know precisely what Mizar items are needed for a given Mizar (human-written) proof to be successful.

Definition 17. For a Mizar theorem φ, let PMML(φ) be the minimal set of premises needed for the success of the (human) MML proof of φ. Let PATP(φ) be the set of premises used by an ATP to prove φ. This gives rise to the notions of “immediate dependence” and “indirect dependence” of one Mizar item a upon another Mizar item b:

Definition 18. For Mizar items a and b, a <1 b means that a immediately depends on b (b ∈ PMML(a)). Let < be the transitive closure of <1, ≤ its reflexive version, and let ∗ ∗ PMML(a) B {b: b < a}. For a set S of items, let PMML(S ) B {b: ∃a ∈ S : b ≤ a} .

While theoretically, there are multiple versions of <1 and < induced by different (ATP, Mizar) proofs, unless we explicitly state otherwise, these relations will always refer to the dependencies derived from the Mizar proofs. The pragmatic reason is that we do not have an ATP proof for all Mizar items,9 and hence we do not have the full dependency graph induced by ATP proofs. Also, the way ATP proofs were produced was by always relying on the previous Mizar theorems and dependency data, therefore it makes sense to also use the Mizar data for the transitive closure. We define two comparison metrics. D (Dependencies) counts the number of premises used in a proof.

Definition 19. For each Mizar item a, we define its Mizar dependencies as DMML(a) B |PMML(a)| and its ATP dependencies via DATP(a) B |PATP(a)|.

6http://mizar.cs.ualberta.ca/~mptp/proofcomp/advised200f1.tar.gz 7http://mizar.cs.ualberta.ca/~mptp/proofcomp/proved200f1min.tar.gz 8The ATP proofs are however always based on the same state of previous theory and proof knowledge. This could be further relaxed in future experiments. 9We have limited the ATP experiment to Mizar theorems, so even with perfect ATP success rate we would still miss for example all ATP dependencies of Mizar definitions, that often require proofs of existence, unique- ness, etc.

53 CHAPTER 5. AUTOMATED AND HUMAN PROOFS IN GENERAL MATHEMATICS

The second metric L (Length) adds weighting by (recursive) proof complexity. For the Mizar proofs, L is computed using the assumption that the Mizar weak refutational checker enforces a relatively uniform degree of derivational complexity on all Mizar proof steps, which roughly correspond to proof lines in Mizar formalizations. For the ATP version, we make a similar assumption that the complexity of ATP proof steps is roughly uniform.10 For the comparison with human proofs, we define a conversion ratio c between the number of ATP inference lines and the corresponding number of Mizar proof lines. This is pragmatically estimated as the average of such ratios for all the proofs where the ATP used the same premises as the Mizar proof. The actual value computed (based on 1223 proofs where PATP(a) = PMML(a)) is c = 81.99 . Formally:

Definition 20. For a Mizar-proved item a, let LMML(a) be the number of Mizar lines of 0 code used to prove a (direct Mizar proof length). For each ATP-proved item a, let LATP(a) be the number of steps in the ATP proof. Let EMML=AT P B {a: PATP(a) = PMML(a)} (items whose ATP and Mizar proofs use the same premises). The length conversion ratio c is defined as 1 X L0 (a) c B ATP |EMML=AT P| LMML(a) a∈EMML=AT P 0 Finally, we define the normalized ATP proof length as LATP(a) B L (a)/c . For a set of P ATP items S , let again LMML(S ) B a∈S LMML(a).

∗ ∗ Definition 21. For a Mizar theorem a we set LMML(a) B LMML(a) + LMML(PMML(a)). If ∗ ∗ we have an ATP proof, we define LATP(a) B LATP(a) + LMML(PMML(PATP(a))). ∗ ∗ The reason for using LMML and PMML in the recursive part of LATP is again the fact that we only have the complete line count information for the Mizar proofs. Note that in L∗ we always count any lemma on the transitive proof path exactly once. We believe that this approach captures the mathematician’s intuition of proof complexity as the set of “the proofs that need to be understood” rather than as their multiset.

5.4 Evaluation

The metrics developed above were used to compare the Mizar and ATP proofs. The de- tailed evaluation data corresponding to this section are available online.11 First we ana- lyze the data based on the relation between PMML and PATP. For each Mizar theorem φ that can be proved by an ATP, we have either PMML(φ) = PATP(φ), PMML(φ) ⊂ PATP(φ), PATP(φ) ⊂ PMML(φ), or neither set is included in the other. Let us say that two sets A and B are orthogonal if neither A ⊆ B, nor B ⊆ A. The statistics is given in Table 5.1. More than 10% (1223) of the proofs have the same dependencies. For 386 proofs, the MML proof depends on fewer premises than the ATP proof. While the orthogonal

10The precision of such metrics could be further improved, for example by expanding Vampire proofs into the (really uniform) detailed proof objects developed for Otter/Prover9/IVY [63]. 11http://mizar.cs.ualberta.ca/~mptp/proofcomp/metrics_evaluation.xls

54 5.4. EVALUATION

Table 5.1: Dependency statistics

PATP = PMML PATP ⊂ PMML PMML ⊂ PATP Orthogonal Cases 1223 1980 386 5552

Min DATP 0 0 1 1 Min DMML 0 1 1 1

Max DATP 7 12 89 63 Max DMML 7 59 10 58 Max |DMML − DATP| 0 58 83 60

Avg DATP 2.18 2.20 6.24 5.22 Avg DMML 2.18 5.58 2.41 6.33 Avg |DMML − DATP| 0 3.40 3.88 3.86 category is largest with 5552 proofs as was expected, it is surprising to see that 1980 ATP proofs (21.66%) depend on fewer premises. We found several possible explanations: • The ATP is naturally oriented towards finding short proofs. Getting involved proofs with many premises is hard, and it may well be the main reason of ATP failure outside the 9141 proved theorems. • In many cases, a human formalizer can overlook the fact that the same or very sim- ilar theorem is already in the library.12 An example is the theorem LOPBAN_3:2413 which required a 20-line proof in Mizar, but the ATP found an earlier more general theorem BHSP_4:3 that (using additional typing information) provides an almost immediate proof. • ATPs work in untyped first-order logic, and they are not constrained by the Mizar’s (and other ITPs’) requirement that all types should be inhabited. For example, Mizar proof checking of GOEDELCP:114 fails if two type non-emptiness declarations are removed, because the formula is no longer well-typed. The ATP proof however does not need any of them. An interesting case is when the ATP finds a way to re-use previous lemmas. Sometimes enough knowledge about advanced concepts is already developed that can be used for their quite simple (“algebraic”) manipulation, abstracting from their definitions. An ex- ample is COMSEQ_3:4015, proving the relation between the limit of a complex sequence and its real and imaginary parts. The human proof expands the definitions (finding a suit- able n for a given ). The ATP just notices that this kind of groundwork was already done in a “similar” case COMSEQ_3:3916, and notices the “similarity” (algebraic simplification)

12From this point of view, this analysis is conducted at the right time, because the ATP service is starting to be used by authors, and such simple repetitions will be prevented by it. 13http://mizar.cs.ualberta.ca/~mptp/cgi-bin/browserefs.cgi?refs=t24_lopban_3, The theorem says that the partial-sums operator on normed space sequences commutes with multiplication by a scalar. 14http://mizar.cs.ualberta.ca/~mptp/cgi-bin/browserefs.cgi?refs=t1_goedelcp 15http://mizar.cs.ualberta.ca/~mptp/cgi-bin/browserefs.cgi?refs=t40_comseq_3 16http://mizar.cs.ualberta.ca/~mptp/cgi-bin/browserefs.cgi?refs=t39_comseq_3

55 CHAPTER 5. AUTOMATED AND HUMAN PROOFS IN GENERAL MATHEMATICS

provided by COMPLEX1:2817. Such manipulations can be used (if noticed!) to avoid the “hard thinking” about the epsilons in the definitions.

5.4.1 Comparing weights ∗ ∗ For a Mizar theorem φ, a large difference between LMML(φ) and LATP(φ) is an indicator that the ATP proof of φ is different from the human Mizar proof. The Table 5.2 shows that with exception of the PMML = PATP case, which we used to define c, the ATP proofs have on average higher recursive complexity L∗ than the corresponding human proofs. Again, we have found several explanations:

Table 5.2: Recursive line count/proof step statistics

PMML = PATP PATP ⊂ PMML PMML ⊂ PATP Orthogonal ∗ Max LATP 140176 40653 162308 139935 ∗ Max LMML 140438 32652 162532 140172 ∗ ∗ Max |LMML − LATP| 6210 40626 35536 75114 ∗ Min LATP 7 26 9 13 ∗ Min LMML 1 1 3 3 ∗ Avg LATP 7390.77 7373.31 14155.3 9893.04 ∗ Avg LMML 7385.06 6167.73 14768.3 9828.81 ∗ ∗ Avg |LMML − LATP| 0 1220.52 632.329 910.744

• Some cases are due to the failure in minimization of the ATP proofs. For exam- ple, the ATP proof of FUNCT_7:2018 reports 40 premises and 178715 ATP (non- normalized) proof steps, largely coming from recent addition of BDDs to Vampire. • Most of the cases again seem to be due to the ATPs tendency to get a short proof by advanced lemmas, rather than getting into longer proofs by expanding the defini- tions. The lemmas typically recursively use the basic definitions anyway, and their line complexity is then a net contribution to the ATP proof’s recursive complexity.

5.5 Conclusion

While ATPs in general large-theory formal mathematics are becoming clearly useful, our proof analysis has not found any highly surprising ATP proofs. Clearly, the general large-theory mathematical setting is still quite far from producing automated proofs of the order of complexity that some specialized algebraic theories enjoy. On the other hand, the ATPs have found a surprising number of proofs that are shorter than the mathematicians’ version. Unlike humans, the combined AI/ATP stack learns new lemmas and new proofs immediately, and this results in their more extensive use and higher value of L∗. An

17http://mizar.cs.ualberta.ca/~mptp/cgi-bin/browserefs.cgi?refs=t28_complex1 18http://mizar.cs.ualberta.ca/~mptp/cgi-bin/browserefs.cgi?refs=t20_funct_7

56 5.5. CONCLUSION

ATP working in unsorted FOL can sometimes find proofs that, in some sense, get to the “mathematical heart” of a theorem without first going through the syntactic hoops of ensuring that terms have suitable sorts. The tools produced for our experiments can produce information that is useful for maintainers of large formal libraries. We found cases where an ATP was able to find a significantly shorter proof—sometimes employing only one premise—compared to a human proof. At times, such highly efficient ATP proofs were due to duplication in the library or failure to use a generalization to prove a special case. Finally, this work could provide a practical “test bed” for theoretical criteria of proof identity [27].

57

Chapter 6

MaSh - Machine Learning for Sledgehammer

Sledgehammer integrates automated theorem provers into the Isabelle. A key component, the relevance filter, heuristically ranks the thousands of facts available and selects a subset, based on syntactic similarity to the current goal. We introduce MaSh, an alternative that learns from successful proofs. New challenges arose from our “zero- click” vision: MaSh should integrate seamlessly with the users’ workflow, so that they benefit from machine learning without having to install software, set up servers, or guide the learning. The underlying machinery draws on recent research in the context of Mizar and HOL Light, with a number of enhancements. MaSh outperforms the old relevance filter on large formalizations, and a particularly strong filter is obtained by combining the two filters.

6.1 Introduction

Sledgehammer [73] is a subsystem of the proof assistant Isabelle/HOL [68] that dis- charges interactive goals by harnessing external automated theorem provers (ATPs). It heuristically selects a number of relevant facts1 (axioms, definitions, or lemmas) from the thousands available in background libraries and the user’s formalization, translates the problem to the external provers’ logics, and reconstructs any machine-found proof in Isabelle (Section 6.2). The tool is popular with both novices and experts. Various aspects of Sledgehammer have been improved since its introduction, notably the addition of SMT solvers [16], the use of sound translation schemes [14], close coop-

This chapter is based on: [56] “MaSh: Machine Learning for Sledgehammer”, published in the Proceed- ings of the 4th International Conference on Interactive Theorem Proving. 1To keep with the standard Isabelle terminology the notation differs from the previous chapters. We use lemma instead of statement, fact instead of premise, and goal instead of conjecture.

59 CHAPTER 6. MASH - MACHINE LEARNING FOR SLEDGEHAMMER eration with the first-order superposition prover SPASS [17], and of course advances in the underlying provers themselves. Together, these enhancements increased the success rate from 48% to 64% on the representative “Judgment Day” benchmark suite [20, 17]. One key component that has received little attention is the relevance filter. Meng and Paulson [64] designed a filter, MePo, that iteratively ranks and selects facts similar to the current goal, based on the symbols they contain. Despite its simplicity, and despite advances in prover technology [41, 17, 86], this filter greatly increases the success rate: Most provers cannot cope with tens of thousands of formulas, and translating so many formulas would also put a heavy burden on Sledgehammer. Moreover, the translation of Isabelle’s higher-order constructs and types is optimized globally for a problem—smaller problems make more optimizations possible, which helps the automated provers. Coinciding with the development of Sledgehammer and MePo, a line of research has focused on applying machine learning to large-theory reasoning. Much of this work has been done on the vast Mizar Mathematical Library (MML) [1], either in its original Mizar [61] formulation or in first-order form as the Mizar Problems for Theorem Prov- ing (MPTP) [104]. The MaLARea system [105, 110] and the competitions CASC LTB and Mizar@Turing [95] have been important milestones. Recently, comparative studies involving MPTP [57, 2] and the Flyspeck project in HOL Light [45] have found that fact selectors based on machine learning outperform purely symbol-based approaches. Several learning-based advisors have been implemented and have made an impact on the automated reasoning community. In this chapter, we describe a tool that aims to bring the fruits of this research to the Isabelle community. This tool, MaSh, offers an alternative to MePo by learning from successful proofs, whether human-written or machine-generated. Sledgehammer is a one-click technology—fact selection, translation, and reconstruc- tion are fully automatic. For MaSh, we had four main design goals:

• Zero-configuration: The tool should require no installation or configuration steps, even for use with unofficial repository versions of Isabelle.

• Zero-click: Existing users of Sledgehammer should benefit from machine learning, both for standard theories and for their custom developments, without having to change their workflow.

• Zero-maintenance: The tool should not add to the maintenance burden of Isabelle. In particular, it should not require maintaining a server or a database.

• Zero-overhead: Machine learning should incur no overhead to those Isabelle users who do not employ Sledgehammer.

By pursuing these “four zeros,” we hope to reach as many users as possible and keep them for as long as possible. These goals have produced many new challenges. MaSh’s heart is a Python program that implements a custom version of a weighted sparse naive Bayes algorithm that is faster than the naive Bayes algorithm implemented in the SNoW [21] system used in previous studies (Section 6.3). The program maintains

60 6.2. SLEDGEHAMMER AND MEPO a persistent state and supports incremental, nonmonotonic updates. Although distributed with Isabelle, it is fully independent and could be used by other proof assistants, auto- mated theorem provers, or applications with similar requirements. This Python program is used within a Standard ML module that integrates machine learning with Isabelle (Section 6.4). When Sledgehammer is invoked, it exports new facts and their proofs to the machine learner and queries it to obtain relevant facts. The main technical difficulty is to perform the learning in a fast and robust way without interfering with other activities of the proof assistant. Power users can enhance the learning by letting external provers run for hours on libraries, searching for simpler proofs. A particularly strong filter, MeSh, is obtained by combining MePo and MaSh. The three filters are compared on large formalizations covering the traditional application ar- eas of Isabelle: , programming languages, and mathematics (Section 6.5). These empirical results are complemented by Judgment Day, a benchmark suite that has tracked Sledgehammer’s development since 2010. Performance varies greatly depending on the application area and on how much has been learned, but even with little learning MeSh emerges as a strong leader.

6.2 Sledgehammer and MePo

Whenever Sledgehammer is invoked on a goal, the MePo (Meng–Paulson) filter selects n facts φ1,...,φn from the thousands available, ordering them by decreasing estimated relevance. The filter keeps track of a set of relevant symbols—i.e., (higher-order) con- stants and fixed variables—initially consisting of all the goal’s symbols. It performs the following steps iteratively, until n facts have been selected:

1. Compute each fact’s score, as roughly given by r/(r + i), where r is the number of relevant symbols and i the number of irrelevant symbols occurring in the fact.

2. Select all facts with perfect scores as well as some of the remaining top-scoring facts, and add all their symbols to the set of relevant symbols.

The implementation refines this approach in several ways. Chained facts (inserted into the goal by means of the keywords using, from, then, hence, and thus) take absolute priority; local facts are preferred to global ones; first-order facts are preferred to higher- order ones; rare symbols are weighted more heavily than common ones; and so on. MePo tends to perform best on goals that contain some rare symbols; if all the symbols are common, it discriminates poorly among the hundreds of facts that could be relevant. There is also the issue of starvation: The filter, with its iterative expansion of the set of relevant symbols, effectively performs a best-first search in a tree and may therefore ignore some relevant facts close to the tree’s root. The automated provers are given prefixes φ1,...,φm of the selected n facts. The order of the facts—the estimated relevance—is exploited by some provers to guide the search. Although Sledgehammer’s default time limit is 30 s, the automated provers are invoked repeatedly for shorter time periods, with different options and different number of facts

61 CHAPTER 6. MASH - MACHINE LEARNING FOR SLEDGEHAMMER m ≤ n; for example, SPASS is given as few as 50 facts in some slices and as many as 1000 in others. Excluding some facts restricts the search space, helping the prover find deeper proofs within the allotted time, but it also makes fewer proofs possible. The supported ATP systems include the first-order provers E [85], SPASS [17], and Vampire [77]; the SMT solvers CVC3 [8], Yices [28], and Z3 [66]; and the higher-order provers LEO-II [9] and Satallax [19]. Once a proof is found, Sledgehammer minimizes it by invoking the prover repeatedly with subsets of the facts it refers to. The proof is then reconstructed in Isabelle by a suitable proof text, typically a call to the built-in resolution prover Metis [43].

Example 2. Given the goal

map f xs = ys =⇒ zip (rev xs)(rev ys) = rev (zip xs ys)

MePo selects 1000 facts: rev_map, rev_rev_ident,..., add_numeral_special(3). The prover E, among others, quickly finds a minimal proof involving the 4th and 16th facts:

zip_rev: length xs = length ys =⇒ zip (rev xs)(rev ys) = rev (zip xs ys) length_map: length (map f xs) = length xs

Example 3. MePo’s tendency to starve out useful facts is illustrated by the following goal, taken from Paulson’s verification of cryptographic protocols [72]:

used [] ⊆ used evs

A straightforward proof relies on these four lemmas: S used_Nil: used [] = B parts (initState B) initState_into_used: X ∈ parts (initState B) =⇒ X ∈ used evs subsetI:(V x. x ∈ A =⇒ x ∈ B) =⇒ A ⊆ B S UN_iff : b ∈ x∈A B x ←→ ∃x ∈ A. b ∈ B x The first two lemmas are ranked 6807th and 6808th, due to the many initially irrelevant S constants ( , parts, initState, and ∈). In contrast, all four lemmas appear among MaSh’s first 45 facts and MeSh’s first 77 facts.

6.3 The Machine Learning Engine

MaSh (Machine Learning for Sledgehammer) is a Python program for fact selection with machine learning.2 Its default learning algorithm is an approximation of naive Bayes adapted to fact selection. MaSh can perform fast model updates, overwrite data points, and predict the relevance of each fact. The program can also use the slower naive Bayes algorithm implemented by SNoW [21].

2The source code is distributed with Isabelle2013 in the directory src/HOL/Tools/Sledgehammer/MaSh/ src.

62 6.3. THE MACHINE LEARNING ENGINE

6.3.1 Basic Concepts MaSh manipulates theorem proving concepts such as facts and proofs in an agnostic way, as “abstract nonsense”:

• A fact φ is a string.

• A feature f is also a string. A positive weight w is attached to each feature.

• Visibility is a partial order ≺ on facts. A fact φ is visible from a fact φ0 if φ ≺ φ0, and visible through the set of facts Φ if there exists a fact φ0 ∈ Φ such that φ  φ0.

• The parents of a fact are its (immediate) predecessors with respect to ≺.

• A proof Π for φ is a set of facts visible from φ.

Facts are described abstractly by their feature sets. The features may for example be the symbols occurring in a fact’s statement. Machine learning proceeds from the hypoth- esis that facts with similar features are likely to have similar proofs.

6.3.2 Input and Output MaSh starts by loading the persistent model (if any), executes a list of commands, and saves the resulting model on disk. The commands and their arguments are

learn fact parents features proof relearn fact proof query parents features hints

The learn command teaches MaSh a new fact φ and its proof Π. The parents specify how to extend the visibility relation for φ, and the features describe φ. In addition to the supplied proof Π `φ, MaSh learns the trivial proof φ `φ; hence something is learned even if Π = ∅ (which can indicate that no suitable proof is available). The relearn command forgets a fact’s proof and learns a new one. The query command ranks all facts visible through the given parents by their predicted relevance with respect to the specified features. The optional hints are facts that guide the search. MaSh temporarily updates the model with the hints as a proof for the current goal before executing the query. The commands have various preconditions. For example, for learn, φ must be fresh, the parents must exist, and all facts in Π must be visible through the parents.

6.3.3 The Learning Algorithm MaSh’s default machine learning algorithm is a weighted version of sparse naive Bayes. It ranks each visible fact φ as follows. Consider a query command with the features f1,..., fn weighted w1,...,wn, respectively. Let P denote the number of proofs in which φ occurs, and pj ≤ P the number of such proofs associated with facts described by fj

63 CHAPTER 6. MASH - MACHINE LEARNING FOR SLEDGEHAMMER

(among other features). Let π and σ be predefined weights for known and unknown features, respectively. The estimated relevance is given by X  X r(φ, f1,..., fn) = ln P + wj ln(πpj) − ln P + wj σ j: pj,0 j: pj=0

When a fact is learned, the values for P and pj are initialized to a predefined weight τ. The models depend only on the values of P, pj, π, σ, and τ, which are stored in dictionaries for fast access. Computing the relevance is faster than with standard naive Bayes because only the features that describe the current goal need to be considered, as opposed to all features (of which there may be tens of thousands). Experiments have found the values π = 10, σ = −15, and τ = 20 suitable. A crucial technical issue is to represent the visibility relation efficiently as part of the persistent state. Storing all the ancestors for each fact results in huge files that must be loaded and saved, and storing only the parents results in repeated traversals of long parentage chains to obtain all visible facts. MaSh solves this dilemma by complementing parentage with a cache that stores the ancestry of up to 100 recently looked-up facts. The cache not only speeds up the lookup for the cached facts but also helps shortcut the parentage chain traversal for their descendants.

6.4 Integration in Sledgehammer

Sledgehammer’s MaSh-based relevance filter is implemented in Standard ML, like most of Isabelle.3 It relies on MaSh to provide suggestions for relevant facts whenever the user invokes Sledgehammer on an interactive goal.

6.4.1 The Low-Level Learner Interface Communication with MaSh is encapsulated by four ML functions. The first function resets the persistent state; the last three invoke MaSh with a list of commands:

MaSh.unlearn () MaSh.learn [(fact1, parents1, features1, proof1),..., (factn, parentsn, featuresn, proofn)] MaSh.relearn [(fact1, proof1),..., (factn, proofn)] suggestions = MaSh.query parents features hints

To track what has been learned and avoid violating MaSh’s preconditions, Sledge- hammer maintains its own persistent state, mirrored in memory. This mainly consists of the visibility graph, a directed acyclic graph whose vertices are the facts known to MaSh and whose edges connect the facts to their parents. (MaSh itself maintains a visibility graph based on learn commands.) The state is accessed via three ML functions that use a

3The code is located in Isabelle2013’s files src/HOL/Tools/Sledgehammer/sledgehammer_mash.ML, src/HOL/TPTP/mash_export.ML, and src/HOL/TPTP/mash_eval.ML.

64 6.4. INTEGRATION IN SLEDGEHAMMER lock to guard against race conditions in a multithreaded environment [116] and keep the transient and persistent states synchronized.

6.4.2 Learning from and for Isabelle Facts, features, proofs, and visibility were introduced in Section 6.3.1 as empty shells. The integration with Isabelle fills these concepts with content.

Facts. Communication with MaSh requires a string representation of Isabelle facts. Each theorem in Isabelle carries a stable “name hint” that is identical or very similar to its fully qualified user-visible name (e.g., List.map.simps_2 vs. List.map.simps(2)). Top- level lemmas have unambiguous names. Local facts in a structured Isar proof [115] are disambiguated by appending the fact’s statement to its name.

Features. Machine learning operates not on the formulas directly but on sets of features. The simplest scheme is to encode each symbol occurring in a formula as its own feature. The experience with MePo is that other factors help—for example, the formula’s types and type classes or the theory it belongs to. The MML and Flyspeck evaluations revealed that it is also helpful to preserve parts of the formula’s structure, such as subterms [3, 45]. Inspired by these precursors, we devised the following scheme. For each term in the formula, excluding the outer quantifiers, connectives, and equality, the features are derived from the nontrivial first-order patterns up to a given depth. Variables are replaced by the wildcard _ (underscore). Given a maximum depth of 2, the term g (h x a), where constants g, h, a originate from theories T, U, V, yields the patterns

T.g(_) T.g(U.h(_,_)) U.h(_,_) U.h(_,V.a) V.a which are simplified and encoded into the features

T.g T.g(U.h) U.h U.h(V.a) V.a

Types, excluding those of propositions, Booleans, and functions, are encoded using an analogous scheme. Type variables constrained by type classes give rise to features cor- responding to the specified type classes and their superclasses. Finally, various pieces of metainformation are encoded as features: the theory to which the fact belongs; the kind of rule (e.g., introduction, simplification); whether the fact is local; whether the formula contains any existential quantifiers or λ-abstractions. Guided by experiments similar to those of Section 6.5, we attributed the following weights to the feature classes: Fixed variable 20 Type 2 Presence of ∃ 2 Constant 16 Theory 2 Presence of λ 2 Localness 8 Kind of rule 2 Type class 1

Example 4. The lemma transpose (map (map f ) xs) = map (map f )(transpose xs) from the List theory has the following features and weights (indicated by subscripts):

65 CHAPTER 6. MASH - MACHINE LEARNING FOR SLEDGEHAMMER

List2 List.transpose16 List.list2 List.transpose(List.map)16 List.map16 List.map(List.transpose)16 List.map(List.map)16 List.map(List.map, List.transpose)16

Proofs. MaSh predicts which facts are useful for proving the goal at hand by studying successful proofs. There is an obvious source of successful proofs: All the facts in the loaded theories are accompanied by proof terms that store the dependencies [10]. How- ever, not all available facts are equally suitable for learning. Many of them are derived automatically by definitional packages (e.g., for inductive predicates, datatypes, recursive functions) and proved using custom tactics, and there is not much to learn from those highly technical lemmas. The most interesting lemmas are those stated and proved by humans. Slightly abusing terminology, we call these “Isar proofs.” Even for user lemmas, the proof terms are overwhelmed by basic facts about the logic, which are tautologies in their translated form. Fortunately, these tautologies are easy to detect, since they contain only logical symbols (equality, connectives, and quantifiers). The proofs are also polluted by decision procedures; an extreme example is the procedure, which routinely pulls in over 200 dependencies. Proofs involving over 20 facts are considered unsuitable and simply ignored. Human-written Isar proofs are abundant, but they are not necessarily the best raw material to learn from. They tend to involve more, different facts than Sledgehammer proofs. Sometimes they rely on induction, which is difficult for automated provers; but even excluding induction, there is evidence that the provers work better if the learned proofs were produced by similar provers [57, 55]. A special mode of Sledgehammer runs an automated prover on all available facts to learn from ATP-generated proofs. Users can let it run for hours at at time on their favorite theories. The Isar proof facts are passed to the provers together with a few dozens of MePo-selected facts. Whenever a prover succeeds, MaSh discards the Isar proof and learns the new minimized proof (using MaSh.relearn). Facts with large Isar proofs are processed first since they stand to gain the most from shorter proofs.

Visibility. The loaded background theories and the user’s formalization, including local lemmas, appear to Sledgehammer as a vast collection of facts. Each fact is tagged with its own abstract theory value, of type theory in ML, that captures the state of affairs when it was introduced. Sledgehammer constructs the visibility graph by using the (very fast) subsumption order E on theory. A complication arises because E lifted to facts is a preorder, whereas the graph must encode a partial order . Antisymmetry is violated when facts are registered together. Despite the simultaneity, one fact’s proof may depend on another’s; for example, an in- ductive predicate’s definition p_def is used to derive introduction and elimination rules pI and pE, and yet they may share the same theory. Hence, some care is needed when constructing  from E to ensure that p_def  pI and p_def  pE.

66 6.4. INTEGRATION IN SLEDGEHAMMER

When performing a query, Sledgehammer needs to compute the current goal’s parents. This involves finding the maximal vertices of the visibility graph restricted to the facts available in the current Isabelle proof context. The computation is efficient for graphs with a quasi-linear structure, such as those that arise from Isabelle theories: Typically, only the first fact of a theory will have more than one parent. A similar computation is necessary when teaching MaSh new facts.

6.4.3 Relevance Filters: MaSh and MeSh

Sledgehammer’s MaSh-based relevance filter computes the current goal’s parents and features; then it queries the learner program (using MaSh.query), passing the chained facts as hints. This process usually takes about one second on modern hardware, which is reasonable for a tool that may run for half a minute. The result is a list with as many suggestions as desired, ordered by decreasing estimated relevance. Relying purely on MaSh for relevance filtering raises an issue: MaSh may not have learned all the available facts. In particular, it will be oblivious to the very latest facts, introduced after Sledgehammer was invoked for the last time, and these are likely to be crucial for the proof. The solution is to enrich the raw MaSh data with a proximity filter, which sorts the available facts by decreasing proximity in the proof text. Instead of a plain linear combination of ranks, the enriched MaSh filter transforms ranks into probabilities and takes their weighted average, with weight 0.8 for MaSh and 0.2 for proximity. The probabilities are rough approximations based on experiments. Fig. 6.1 shows the mathematical curves; for example, the first suggestion given by MaSh is considered about 15 times more likely to appear in a successful proof than the 50th. Probability Probability 0 0 1 34 67 100 1 34 67 100 (a) MaSh (b) Proximity

Figure 6.1: Estimated probability of the jth fact’s appearance in a proof

This notion of combining filters to define new filters is taken one step further by MeSh, a combination of MePo and MaSh. Both filters are weighted 0.5, and both use the probability curve of Fig. 6.1(a). Ideally, the curves and parameters that control the combination of filters would be learned mechanically rather than hard-coded. However, this would complicate and possi- bly slow down the infrastructure.

67 CHAPTER 6. MASH - MACHINE LEARNING FOR SLEDGEHAMMER

6.4.4 Automatic and Manual Control All MaSh-related activities take place as a result of a Sledgehammer invocation. When Sledgehammer is launched, it checks whether any new facts, unknown to the visibility graph, are available. If so, it launches a thread to learn from their Isar proofs and update the graph. The first time, it may take about 10 s to learn all the facts in the background theories (assuming about 10 000 facts). Subsequent invocations are much faster. If an automated prover succeeds, the proof is immediately taught to MaSh (using MaSh.learn). The discharged (sub)goal may have been only one step in an unstructured proof, in which case it has no name. Sledgehammer invents a fresh, invisible name for it. Although this anonymous goal cannot be used to prove other goals, MaSh benefits from learning the connection between the formula’s features and its proof. For users who feel the need for more control, there is an unlearn command that resets MaSh’s persistent state (using MaSh.unlearn); a learn_isar command that learns from the Isar proofs of all available facts; and a learn_prover command that invokes an automated prover on all available facts, replacing the Isar proofs with successful ATP-generated proofs whenever possible.

6.4.5 Nonmonotonic Theory Changes MaSh’s model assumes that the set of facts and the visibility graph grow monotonically. One concern that arises when deploying machine learning—as opposed to evaluating its performance on static benchmarks—is that theories evolve nonmonotonically over time. It is left to the architecture around MaSh to recover from such changes. The following scenarios were considered:

• A fact is deleted. The fact is kept in MaSh’s visibility graph but is silently ignored by Sledgehammer whenever it is suggested by MaSh.

• A fact is renamed. Sledgehammer perceives this as the deletion of a fact and the addition of another (identical) fact.

• A theory is renamed. Since theory names are encoded in fact names, renaming a theory amounts to renaming all its facts.

• Two facts are reordered. The visibility graph loses synchronization with reality. Sledgehammer may need to ignore a suggestion because it appears to be visible according to the graph.

• A fact is introduced between two facts φ and φ0. MaSh offers no facility to change the parent of φ0, but this is not needed. By making the new fact a child of φ, it is considered during the computation of maximal vertices and hence visible.

• The fact’s formula is modified. This occurs when users change the statement of a lemma, but also when they rename or relocate a symbol. MaSh is not informed of such changes and may lose some of its predictive power.

68 6.5. EVALUATIONS

More elaborate schemes for tracking dependencies are possible. However, the ben- efits are unclear: Presumably, the learning performed on older theories is valuable and should be preserved, despite its inconsistencies. This is analogous to teams of humans developing a large formalization: Teammates should not forget everything they know each time a colleague changes the capitalization of some basic theory name. And should users notice a performance degradation after a major refactoring, they can always invoke unlearn to restart from scratch.

6.5 Evaluations

This section attempts to answer the main questions that existing Sledgehammer users are likely to have: How do MaSh and MeSh compare with MePo? Is machine learning really helping? The answer takes the form of two separate evaluations.4

6.5.1 Evaluation on Large Formalizations The first evaluation measures the filters’ ability to re-prove the lemmas from three for- malizations included in the Isabelle distribution and the Archive of Formal Proofs [50]:

Auth Cryptographic protocols [72] 743 lemmas Jinja Java-like language [49] 733 lemmas Probability Measure and probability theory [42] 1311 lemmas

These formalizations are large enough to exercise learning and provide meaningful num- bers, while not being so massive as to make experiments impractical. They are also representative of large classes of mathematical and computer science applications. The evaluation is twofold. The first part computes how accurately the filters can predict the known Isar or ATP proofs on which MaSh’s learning is based. The second part connects the filters to automated provers and measures actual success rates. The first part may seem artificial: After all, real users are interested in any proof that discharges the goal at hand, not a specific known proof. The predictive approach’s great- est virtue is that it does not require invoking external provers; evaluating the impact of parameters is a matter of seconds instead of hours. MePo itself has been fine-tuned using similar techniques. For MaSh, the approach also helps ascertain whether it is learning the learning materials well, without noise from the provers. Two (slightly generalized) standard metrics, 100%Recall and AUC, are useful in this context. For a given goal, a fact filter (MePo, MaSh, or MeSh) ranks the available facts and selects the n best ranked facts Φ = {φ1,...,φn}, with rank(φj) = j and rank(φ) = n + 1 for φ < Φ. The parameter n is fixed at 1024 in the experiments below. The known proof Π serves as a reference point against which the selected facts and their ranks are judged. Ideally, the selected facts should include as many facts from the proof as possible, with as low ranks as possible.

4Our empirical data are available at http://www21.in.tum.de/~blanchet/mash_data.tgz.

69 CHAPTER 6. MASH - MACHINE LEARNING FOR SLEDGEHAMMER

MePo MaSh MeSh 100%Rec. AUC 100%Rec. AUC 100%Rec. AUC Isar proofs Auth 430 79.2 190 93.1 142 94.9 Jinja 472 73.1 307 90.3 250 92.2 Probability 742 57.7 384 88.0 336 89.2 ATP proofs Auth 119 93.5 198 92.0 68 97.0 Jinja 163 90.4 241 90.6 84 96.8 Probability 428 74.4 368 85.2 221 91.6

Figure 6.2: Average 100%Recall and AUC (%) with Isar and ATP proofs

Definition 22 (100%Recall). 100%Recall denotes the minimum number m ∈ {0,...,n} such that {φ1,...,φm} ⊇ Π, or n + 1 if no such number exists.

Definition 23 (AUC). The area under the receiver operating characteristic curve (AUC) is given by

{(φ,φ0) ∈ Π × (Φ − Π) | rank(φ) < rank(φ0)} |Π| · |Φ − Π|

100%Recall tells how many facts must be selected to ensure that all necessary facts are included—ideally as few as possible. The AUC focuses on the ranks: It gives the probability that, given a randomly drawn “good” fact (a fact from the proof) and a ran- domly drawn “bad” fact (a selected fact that does not appear in the proof), the good fact is ranked before the bad fact. AUC values closer to 1 (100%) are preferable. For each of the three formalizations (Auth, Jinja, and Probability), the evaluation harness processes the lemmas according to a linearization (topological sorting) of the partial order induced by the theory graph and their location in the theory texts. Each lemma is seen as a goal for which facts must be selected. Previously proved lemmas, and the learning performed on their proofs, may be exploited—this includes lemmas from imported background theories. This setup is similar to the one used by Kaliszyk and Urban [45] for evaluating their Sledgehammer-like tool for HOL Light. It simulates a user who systematically develops a formalization from beginning to end, trying out Sledgehammer on each lemma before engaging in a manual proof.5 Fig. 6.2 shows the average 100%Recall and AUC over all lemmas from the three formalizations. For each formalization, the statistics are available for both Isar and ATP proofs. In the latter case, Vampire was used as the ATP, and goals for which it failed to find a proof are simply ignored. Learning from ATP proofs improves the machine learning metrics, partly because they usually refer to fewer facts than Isar proofs.

5Earlier evaluations of Sledgehammer always operated on individual (sub)goals, guided by the notion that lemmas can be too difficult to be proved outright by automated provers. However, lemmas appear to provide the right level of challenge for modern automation, and they tend to exhibit less redundancy than a sequence of similar subgoals.

70 6.5. EVALUATIONS

40 MeSh/Isar MeSh/ATP MaSh/Isar 30 MaSh/ATP

Success rate (%) MePo

20 16 32 64 128 256 512 1024 Number of facts Figure 6.3: Success rates for a combination of provers on Auth + Jinja + Probability

There is a reversal of fortune between Isar and ATP proofs: MaSh dominates MePo for the former but performs slightly worse than MePo for the latter on two of the formaliza- tions. The explanation is that the ATP proofs were found with MePo’s help. Nonetheless, the combination filter MeSh scores better than MePo on all the benchmarks. Next comes the “in vivo” part of the evaluation, with actual provers replacing ma- chine learning metrics. For each goal from the formalizations, 13 problems were gener- ated, with 16, 23 (≈ 24.5), 32, . . . , 724 (≈ 29.5), and 1024 facts. Sledgehammer’s trans- lation is parameterized by many options, whose defaults vary from prover to prover and, because of time slicing, even from one prover invocation to another. As a reasonable uniform configuration for the experiments, types are encoded via the so-called polymor- phic “featherweight” guard-based encoding (the most efficient complete scheme [14]), and λ-abstractions via λ-lifting (as opposed to the more explosive SK combinators). Fig. 6.3 gives the success rates of a combination of three state-of-the-art automated provers (Epar6, Vampire 2.6, and Z3 3.2) on these problems. Two versions of MaSh and MeSh are compared, with learning on Isar and ATP proofs. A problem is considered solved if it is solved within 10 s by any of them, using only one thread. The experiments were conducted on a 64-bit Linux server equipped with 12-core AMD Opteron 6174 processors running at 2.2 GHz. We observe the following: • MaSh clearly outperforms MePo, especially in the range from 32 to 256 facts. For 91-fact problems, the gap between MaSh/Isar and MePo is 10 percentage points. (The curves have somewhat different shapes for the individual formalizations, but the general picture is the same.) • MaSh’s peak is both higher than MePo’s (44.8% vs. 38.2%) and occurs for smaller problems (128 vs. 256 facts), reflecting the intuition that selecting fewer facts more carefully should increase the success rate.

6A modification of E 1.6 described in [107].

71 CHAPTER 6. MASH - MACHINE LEARNING FOR SLEDGEHAMMER

MePo MaSh MeSh E 55.0 49.8 56.6 SPASS 57.2 49.1 57.7 Vampire 55.3 49.7 56.0 Z3 53.0 51.8 60.8 Together 65.6 63.0 69.8

Figure 6.4: Success rates (%) on Judgment Day goals

• MeSh adds a few percentage points to MaSh. The effect is especially marked for the problems with fewer facts.

• Against expectations, learning from ATP proofs has a negative impact. A closer inspection of the raw data revealed that Vampire performs better with ATP (i.e., Vampire) proofs, whereas the other two provers prefer Isar proofs.

Another measure of MaSh and MeSh’s power is the total number of goals solved for any number of facts. With MePo alone, 46.3% of the goals are solved; adding MaSh and MeSh increases this figure to 62.7%. Remarkably, for Probability—the most difficult formalization by any standard—the corresponding figures are 27.1% vs. 47.2%.

6.5.2 Judgment Day The Judgment Day benchmarks [20] consist of 1268 interactive goals arising in seven Isabelle theories, covering among them areas as diverse as the fundamental theorem of algebra, the completeness of a Hoare logic, and Jinja’s type soundness. The evaluation harness invokes Sledgehammer on each goal. The same hardware is used as in the original Judgment Day study [20]: 32-bit Linux servers with Intel Xeon processors running at 3.06 GHz. The time limit is 60 s for proof search, potentially followed by minimization and reconstruction in Isabelle. MaSh is trained on 9852 Isar proofs from the background libraries imported by the seven theories under evaluation. The comparison comprises E 1.6, SPASS 3.8ds, Vampire 2.6, and Z3 3.2, which Sledgehammer employs by default. Each prover is invoked with its own options and problems, including prover-specific features (e.g., arithmetic for Z3; sorts for SPASS, Vampire, and Z3). Time slicing is enabled. For MeSh, some of the slices use MePo or MaSh directly to promote complementarity. The results are summarized in Fig. 6.4. Again, MeSh performs very well: The overall 4.2 percentage point gain, from 65.6% to 69.8%, is highly significant. As noted in a similar study, “When analyzing enhancements to automated provers, it is important to remember what difference a modest-looking gain of a few percentage points can make to users” [17, §7]. Incidentally, the 65.6% score for MePo reveals progress in the underlying provers compared with the 63.6% figure from one year ago.

72 6.6. RELATED WORK AND CONTRIBUTIONS

The other main observation is that MaSh underperforms, especially in the light of the evaluation of Section 6.5.1. There are many plausible explanations. First, Judgment Day consists of smaller theories relying on basic background theories, giving few op- portunities for learning. Consider the theory NS_Shared (Needham–Shroeder shared-key protocol), which is part of both evaluations. In the first evaluation, the linear progress through all Auth theories means that the learning performed on other, independent pro- tocols (certified email, four versions of Kerberos, and Needham–Shroeder public key) can be exploited. Second, the Sledgehammer setup has been tuned for Judgment Day and MePo over the years (in the hope that improvements on this representative bench- mark suite would translate in improvements on users’ theories), and conversely MePo’s parameters are tuned for Judgment Day. In future work, we want to investigate MaSh’s mediocre performance on these bench- marks (and MeSh’s remarkable results given the circumstances). The evaluation of Sec- tion 6.5.1 suggests that there are more percentage points to be gained.

6.6 Related Work and Contributions

The main related work is already mentioned in the introduction. Bridges such as Sledge- hammer for Isabelle/HOL, MizAR [109] for Mizar, and HOL(y)Hammer [45] for HOL Light are opening large formal theories to methods that combine ATPs and artificial intel- ligence (AI) [106, 57] to help automate interactive proofs. Today such large theories are the main resource for combining semantic and statistical AI methods [111].7 The main contribution of this work has been to take the emerging machine learning methods for fact selection and make them incremental, fast, and robust enough so that they run unnoticed on a single-user machine and respond well to common user-interaction scenarios. The advising services for Mizar and HOL Light [104, 103, 45, 109] (with the partial exception of MoMM [103]) run only as remote servers trained on the main central library, and their solution to changes in the library is to ignore them or relearn everything from scratch. Other novelties of this work include the use of more proof-related features in the learning (inspired by MePo), experiments combining MePo and MaSh, and the related learning of various parameters of the systems involved.

6.7 Conclusion

Relevance filtering is an important practical problem that arises with large-theory rea- soning. Sledgehammer’s MaSh filter brings the benefits of machine learning methods to Isabelle users: By decreasing the quantity and increasing the quality of facts passed to the automated provers, it helps them find more, deeper proofs within the allotted time. The core machine learning functionality is implemented in a separate Python program that can be reused by other proof assistants.

7It is hard to envisage all possible combinations, but with the recent progress in natural language processing, suitable ATP/AI methods could soon be applied to another major aspect of formalization: the translation from informal prose to formal specification.

73 CHAPTER 6. MASH - MACHINE LEARNING FOR SLEDGEHAMMER

Many areas are calling for more engineering and research; we mentioned a few al- ready. Learning data could be shared on a server or supplied with the proof assistant. More advanced algorithms appear too slow for interactive use, but they could be opti- mized. Learning could be applied to control more aspects of Sledgehammer, such as the prover options or even MePo’s parameters. Evaluations over the entire Archive of Formal Proofs might shed more light on MaSh’s and MePo’s strengths and weaknesses. Machine learning being a gift that keeps on giving, it would be fascinating to instru- ment a user’s installation to monitor performance over several months.

74 Chapter 7

MaLeS - Machine Learning of Strategies

MaLeS is a framework that develops strategies for automated theorem provers (ATPs) and creates suitable schedules of strategies for individual problems. The framework can be used in a push-button way to develop such strategies and schedules for an arbitrary ATP. This chapter describes the tool and the methods used in it, and evaluates its performance for three automated theorem provers: E, LEO-II and Satallax. An evaluation on a subset of the TPTP library problems shows that, on average, a MaLeS-tuned prover solves 8.67% more problems than the prover with its default settings.

7.1 Introduction: ATP Strategies

Automated theorem proving is a search problem. Many different approaches exist, and most of them have parameters that can be tuned. Examples of such parameterizations are clause weighting and selection schemes, term orderings, and sets of inference and reduction rules used. For a given ATP A, its implemented parameters form A’s parameter space. A specific choice of parameters defines a search strategy.1 The choice of a strategy can often make the difference between finding a proof in a few milliseconds or not at all (within a reasonable time limit). This naturally leads to the question: Given a new problem, which search strategy should be used? Considerable attention has already been paid to this problem. Gandalf [99] pioneered strategy scheduling: Run several search strategies sequentially with shorter time limits instead of a single strategy for the whole time limit. This method is used in most current ATPs, most prominently Vampire [77]. In the SETHEO project [119], a local search al-

This chapter is based on: [53] “MaLeS: A Framework for Automatic Tuning of Automated Theorem Provers” and an extension of [58] “E-MaLeS 1.1”, published in the Proceedings of the 24th Conference on Automated Deduction. 1 Many different names exist for these concepts. In Satallax [19] parameters are called flags, and a strategy is called a mode. Option is often used as a synonym for parameter. Configurations and configuration space are other alternative names.

75 CHAPTER 7. MALES - MACHINE LEARNING OF STRATEGIES gorithm in the space of all strategy schedules was used to find better strategy schedules. Fuchs [31] used a nearest neighbour algorithm to determine which strategy/strategies to run. Bridge’s [18] thesis is about machine learning for search heuristic selection in ATPs with a particular focus on problem features and feature selection. In the SAT commu- nity, Satzilla [120] very successfully used machine learning to decide when to run which SAT solver. ParamILS [44] is a general tuning framework that searches for good param- eter settings with a randomized hill climbing algorithm. BliStr [107] uses ParamILS to develop strategies for E [84] on a large set of interrelated problems. Despite all this work, most ATPs do not harness the methods available. Search strate- gies are often manually defined by the developer of the ATP and strategy schedules are created by a greedy algorithm or very simple clustering. This chapter introduces MaLeS (Machine Learning (of) Strategies), an easy-to-use learning-based framework for auto- matic tuning and configuration of ATPs. It is based on and supersedes E-MaLeS 1.0 [54] and E-MaLeS 1.1 [58]. The goal of MaLeS is to help ATP users fine-tune an ATP to their problems and provide developers with a push-button method for finding good search strategies and creating strategy schedules. MaLeS is implemented in Python and has been tested with the ATPs E, LEO-II [9] and Satallax [19]. The source code is freely available at https://code.google.com/p/males/.

7.1.1 The Strategy Selection Problem

Figure 7.1 gives an overview of the strategy selection problem. Each point in the pa- rameter space corresponds to a search strategy. Parameter spaces can be very big. For example, the ATP E supports over 1017 different search strategies. To simplify the strat- egy selection problem, strategy selection algorithms usually only consider a small number of preselected strategies. There are different criteria to determine which strategies should be selected. The most common ones are to pick strategies that solve a lot of problems, or are very good for a particular kind of problem. In order to determine which strategy to use for a problem, one needs to be able to characterize different problem classes. This is usually done by defining a set of problem features. Features must be fast to compute, but also expressive enough so that the ATP behaves similarly on problems with similar features. The features are used to determine which strategy is run. Hence, the strategy selection problem consists of three subprob- lems:

1. Finding a good set of preselected strategies S.

2. Defining features F which are easy to compute (via a feature function ϕ), but also expressive enough to distinguish different types of problems.

3. Determining a method which given the features of a problem creates a strategy schedule.

76 7.2. FINDING GOOD SEARCH STRATEGIES WITH MALES

Problems Parameter Space P Space S

ϕ Preselected Strategy Space S

Feature Space F

Figure 7.1: Overview of the strategy selection problem for ATPs.

7.1.2 Overview The rest of the chapter is organized as follows: Section 7.2 explains how MaLeS defines the preselected strategy space S. The features and the algorithm that creates the strategy schedule are presented in Section 7.3. MaLeS is evaluated against the default installa- tions of E 1.7, LEO-II 1.6.0 and Satallax 2.7 in Section 7.4. The experiments compare the performance of running an ATP in default mode versus running the ATP with strategy scheduling provided by MaLeS. Section 7.5 shows how to install the MaLeS-tuned ver- sions of the ATPs mentioned above, E-MaLeS, LEO-MaLeS and Satallax-MaLeS, how to tune any of those systems for new problems, and how to use MaLeS with different ATPs. Future work is considered in Section 7.6, and the chapter concludes with Section 7.7.

7.2 Finding Good Search Strategies with MaLeS

Choosing a good strategy for a problem requires prior information on how the differ- ent strategies behave on different kinds of problems. Getting this information for all strategies is often infeasible due to constraints on CPU power available and the number possible strategies. Hence, one has to decide which strategies one wishes to evaluate. ATP developers often manually define such a set of strategies based on their intuition and experience. This option is, however, not available when one lacks in-depth knowledge of the internal workings of the ATP. A local search algorithm can help in these cases, and can even be combined with the manual approach by taking the predefined strategies as starting points of the search. We present a basic stochastic local search algorithm labeled find_strategies (Algo- rithm 1) for ATPs. The strategies returned by find_strategies define the preselected strat- egy space S. The difference to existing parameter selection frameworks like ParamILS

77 CHAPTER 7. MALES - MACHINE LEARNING OF STRATEGIES

Algorithm 1 find_strategies: For each problem search for an optimal strategy. 1: procedure find_strategies(Problems,tol,t_max,nS,nC) 2: initialize Queue Q 3: initialize dictionary bestTimes with t_max for all problems 4: while Q not empty do 5: s ← pop(Q) 6: for p ∈ Problems do 7: oldBestTime ← bestTime[p] 8: proofFound,timeNeeded ← run_strategy(s, p,t_max) 9: if proofFound and timeNeeded < bestTime[p] then 10: bestTime[p] ← timeNeeded 11: bestStrategies[p] ← s 12: end if 13: if proofFound and timeNeeded < bestTime[p]+tol then 14: randomStrategies ← create_random_strategies(s,nS,nC) 15: for r in randomStrategies do 16: proofFoundR,timeNeededR ← run_strategy(r, p, timeNeeded) 17: if proofFoundR and timeNeededR

2find_strategies is essentially equivalent to running ParamILS on every single problem.

78 7.3. STRATEGY SCHEDULING WITH MALES on all problems. If the strategy solves a problem faster than any of the tried strategies (within some tolerance, see Line 13), a local search is performed. If the search yields faster strategies, the fastest newly found search strategy is appended to the queue. In the end, find_strategies returns the strategies that were the fastest strategy on at least one problem.

Algorithm 2 create_random_strategies: Returns slight variations of the input strategy. 1: procedure create_random_strategies(Strategy,nS,nC) 2: newStrategies is an empty list 3: for i in nS do 4: newStrategy is a copy of Strategy 5: for j in nC do 6: newStrategy = change_random_parameter(newStrategy) 7: end for 8: newStrategies.append(newStrategy) 9: end for 10: return newStrategies 11: end procedure nS determines the number of new strategies, nC is the upper limit for the number of changed parameters.

The local search part is defined in Algorithm 2 (create_random_strategies). It returns a predefined number of strategies similar to the input strategy. The new strategies are created by randomly changing the parameters of the input strategy. How many parameters are changed is determined in MaLeS’ configuration file.3

7.3 Strategy Scheduling with MaLeS

Most automated theorem provers, independent of the parameters used, solve problems either very fast, or not at all (within a reasonable time limit). Instead of trying only a single strategy for a long time, it is often beneficial to run several search strategies for a shorter time. This approach is called strategy scheduling. Many current ATPs use strategy scheduling to define their default configuration. Some use a single schedule for every problem (e.g. Satallax 2.7). Others define classes of simi- lar problems and use different schedules for different classes (e.g. LEO-II 1.6.0). MaLeS creates an individual strategy schedule for each problem, depending on the problem’s features. For each strategy s in the preselected strategies S, MaLeS defines a runtime prediction function ρs : P → R. The prediction function ρs uses the features of a problem to predict the time the ATP running strategy s needs to solve the problem. The strategy schedule for the problem is created from these predictions.

3Parameter WalkLength in Table 7.7

79 CHAPTER 7. MALES - MACHINE LEARNING OF STRATEGIES

7.3.1 Notation For the remainder of the chapter, we shall use the following notation:

• p is an ATP problem. P denotes a set of problems.

• Ptrain ⊆ P is a set of training problems that is used to tune the learning algorithm.

• F is the feature space. We assume that F is a subset of Rn for some n ∈ N.

• ϕ : P → F is the feature function. ϕ(p) is the feature vector of a problem.

• S is the parameter space, S is the set of preselected strategies.

• The time the ATP running strategy s needs to solve a problem p is denoted by τ(p, s). If s is obvious from the context or irrelevant, we also use τ(p).

• For a strategy s, ρs : P → R is the runtime prediction function.

7.3.2 Features Features give an abstract description of a problem. Optimally, the features should be de- signed in such a way that the ATP behaves similar on problems with similar features, i.e. if two problem p,q have similar features ϕ(p) ∼ ϕ(q), then for each strategy s the runtimes should be similar τ(p, s) ∼ τ(q, s). The similarity function (e.g. cosine distance between the feature vectors) and set of features heavily influence the quality of the prediction functions. Indeed, feature selection is an entire subfield of machine learning [59, 36]. Currently, MaLeS supports two different feature spaces: Schulz’s E features are used for first order (FOF) problems. The TPTP features designed by Sutcliffe are used for higher order (THF) problems [96].

The E Features Schulz designed a set of features for clause-normal-form and first order problems. They are used in the strategy selection process in his theorem prover E [58]. Table 7.1 shows the features together with a short description.4 MaLeS uses the same features for first-order problems. A clause is called negative if it only has negative literals. It is called positive if it only has positive literals. A ground clause is a clause that contains no variables. In this setting, we refer to all negative clauses as “goals”, and to all other clauses as “axioms". Clauses can be unit (having only a single literal), Horn (having at most one positive literal), or general (no constraints on the form). All unit clauses are Horn, and all Horn clauses are general. The features are computed by running Schulz’s classify_problem program which is distributed with MaLeS.

4The author would like to thank Stephan Schulz for the design of the features, the program that extracts them and their precise description in this subsection.

80 7.3. STRATEGY SCHEDULING WITH MALES

Table 7.1: Problem features used for strategy selection in E and in first-order MaLeS.

Feature Description axioms Most specific class (unit, Horn, general) describing all ax- ioms goals Most specific class (unit, Horn) describing all goals equality Problem has no equational literals, some equational liter- als, or only equational literals non-ground units Number (or fraction) of unit axioms that are not ground ground-goals Are all goals ground? clauses Number of clauses literals Number of literals term_cells Number of all (sub)terms unitgoals Number of unit goals (negative clauses) unitaxioms Number of positive unit clauses horngoals Number of Horn goals (non-unit) hornaxioms Number of Horn axioms (non-unit) eq_clauses Number of unit equations groundunitaxioms Number of ground unit axioms groundgoals Number of ground goals groundpositiveaxioms Number (or fraction) of positive axioms that are ground positiveaxioms Number of all positive axioms ng_unit_axioms_part Number of non-ground unit axioms max_fun_arity Maximal arity of a function or predicate symbol avg_fun_arity Average arity of symbols in the problem sum_fun_arity Sum of arities of symbols in the problem clause_max_depth Maximal clause depth clause_avg_depth Average clause depth

The TPTP Features

The TPTP problem library [91] provides a syntactical description of every problem which can be used as problem features. Figure 7.2 shows an example. Before normalization, the feature vector corresponding to the example is

[145,5,47,31,1106,...,147,0,0,0,0]

Sutcliffe’s MakeListStats computes these features and is publicly available as part of the TPTP infrastructure. A modified version which outputs only the numbers without any text is also distributed with MaLeS.

81 CHAPTER 7. MALES - MACHINE LEARNING OF STRATEGIES

% Syntax : Number of formulae : 145 ( 5 unit; 47 type; 31 defn) % Number of atoms : 1106 ( 36 equality; 255 variable) % Maximal formula depth : 11 ( 7 average) % Number of connectives : 760 ( 4 ~; 4 |; 8 &; 736 @) % ( 0 <=>; 8 =>; 0 <=; 0 <~>) % ( 0 ~|; 0 ~&; 0 !!; 0 ??) % Number of type conns : 235 ( 235 >; 0 *; 0 +; 0 <<) % Number of symbols : 52 ( 47 :) % Number of variables : 147 ( 3 sgn; 29 !; 6 ?; 112 ^) % ( 147 :; 0 !>; 0 ?*) % ( 0 @-; 0 @+)

Figure 7.2: The TPTP features of the THF problem AGT029^1.p in TPTP-v5.4.0.

Normalization In the initial form, there can be great differences between the values of different features. In the THF example (Figure 7.2), the number of atoms (1106) is of a different order of magnitude than e.g. the maximal formula depth (7). Since our machine learning method (like many other) computes the euclidean distance between data points, these differences can render smaller valued features irrelevant. Hence, normalization is used to scale all features to have values between 0 and 1. First we compute the features for each p ∈ Ptrain. Then the maximal and minimal value of each feature f is determined. These values are then used to rescale the feature vectors for each problem p via

ϕ(p) f − min f ϕ(p) f := max f − min f where ϕ(p) f is the value of feature f for problem p, min f is the minimal and max f is the maximal value for f among the problems in Ptrain.

7.3.3 Runtime Prediction Functions Predicting the runtime of an ATP is a classic regression problem [13]. For each strategy s in the preselected strategies S, we are searching for a function ρs : P → R such that for all problems p ∈ P the predicted values are close to the actual runtimes: ρs(p) ∼ τ(p, s). This section explains the learning method employed by MaLeS as well as the data preparation techniques used.

Timeouts The prediction functions are learned from the behaviour of the preselected strategies on the training problems Ptrain. Each preselected strategy is run on all training problems with a timeout t. Often, strategies will not solve all problems within the timeout. This leads to the question how one should treat unsolved problems. Setting the time value of an unsolved problem-strategy pair (p, s) to the timeout, τ(p, s) = t is one possible solution.

82 7.3. STRATEGY SCHEDULING WITH MALES

Another possibility, which is used in MaLeS, is to only learn on problems that can be solved. While ignoring unsolved problems introduces a bias towards shorter runtimes, it also simplifies the computation of the prediction functions and allows us to update the prediction functions at runtime (Section 7.3.5). If MaLeS runs the ATP with strategy s for a time limit t on a problem p and the ATP does not find a solution, then MaLeS uses this information to update the prediction functions and adapt the strategy schedule for p at runtime.

Kernel Methods MaLeS uses kernels to learn the runtime prediction function. Kernels are a very popular machine learning method that has successfully been applied in many domains [88]. A kernel can be seen as a similarity function between feature vectors. Kernels allow the usage of nonlinear features while keeping the learning problem itself linear. The basic principles will be covered on the next pages. More information about kernel-based ma- chine learning can be found in [88]. Definition 24 (Gaussian Kernel). The Gaussian kernel k with parameter σ of two prob- lems p,q ∈ P with feature vectors ϕ(p),ϕ(q) ∈ F ⊆ Rn for some n ∈ N is defined as ! ϕ(p)T ϕ(p) − 2ϕ(p)T ϕ(q) + ϕ(q)T ϕ(q) k(p,q):= exp − σ2

ϕ(p)T is the transposed vector, and hence ϕ(p)T ϕ(q) is the dot product between ϕ(p) and ϕ(q) in Rn. In order to apply machine learning, we first need some data to learn from. Let t ∈ R be a time limit. For each preselected strategy s ∈ S, the ATP is run with strategy s and time s ⊆ limit t on each problem in Ptrain. For each strategy s, Ptrain Ptrain is the set of problems that the ATP can solve within the time limit t with strategy s. In kernel based machine learning, the prediction function ρs has the form

X s ρs(p) = αqk(p,q) ∈ s q Ptrain s s for some αq ∈ R. The αq are called weights and are the result of the learning. To define how exactly this is done, some more notation is needed. Definition 25 (Kernel Matrix, Times Matrix and Weights Matrix). For every strategy ∈ S s s , let m be the number of problems in Ptrain and (pi)i∈m be an enumeration of the s s ∈ m×m problems in Ptrain. The kernel matrix K R is defined as s Ki, j := k(pi, p j)

We define the time matrix Y s ∈ R1×m via

s Yi := τ(pi, s)

83 CHAPTER 7. MALES - MACHINE LEARNING OF STRATEGIES

Finally, we set the weight matrix As ∈ Rm×1 as

s s Ai := αpi If is it obvious which strategy is meant, or the statement is independent of the strategy, we omit the s in Ks,Y s and As.

A simple way to define A would be to solve KA = Y. Such a solution (if it exists) would likely perform very well on known data but poorly on new data, a behaviour called overfitting. A regularization parameter λ ∈ R is added as a penalty for complex predic- tion functions. Least square regression is used to minimize the difference between the predicted times and the actual times [78]. That means we want   A = argmin (Y − KA)T (Y − KA) + λAT KA A∈Rm×1

The first part of the equation (Y − KA)T (Y − KA) is the square loss between the predicted values and the actual time needed. λAT KA is the regularization term. The bigger λ, the more complex functions are penalized [78]. For very high values of λ, we force A to be almost equal to the 0 matrix. This approach can be seen as a kind of Occam’s razor for prediction functions. A is the matrix that best fits the training data while staying as simple as possible.

Theorem 1 (Weight Matrix for a Strategy). For λ > 0, the optimal weights for a strategy s are given by A = (K + λI)−1Y with I being the identity matrix in Rm×m.

Proof.

∂  T T  ∂A (Y − KA) (Y − KA) + λA KA = −2K(Y − KA) + 2λKA = −2KY + (2KK + 2λK)A

It can be shown that K is a positive-semi definite symmetric matrix and therefore (K +λI) is invertible for λ > 0. To find a minimum, we set the derivative to zero and solve with respect to A.

K(K + λI)A = KY and hence

A = (K + λI)−1Y is a solution. 

84 7.3. STRATEGY SCHEDULING WITH MALES

7.3.4 Crossvalidation Finally, the values for the regularization constant λ and the kernel width σ need to be determined. This is done via 10-fold cross-validation on the training problems, a standard machine learning method for such tasks [51]. Cross-validation simulates the effect of not knowing the data and picks the values that perform, in general, best on unknown problems. First a finite number of possible values for λ and σ is defined. Then, the training s ≤ ≤ set Ptrain is split in 10 disjoint, equally sized subsets P1,... P10. For all 1 i 10, each s − possible combination of values for λ and σ is trained on Ptrain Pi and evaluated on Pi. The evaluation is done by computing the square-loss between the predicted runtimes and the actual runtimes. The combination with the least average square loss is used.

7.3.5 Creating Schedules from Prediction Functions Having defined the prediction functions, we can now introduce the scheduling algorithm that is used when trying to solve new problem. For each new problem, MaLeS uses the prediction functions to select the strategy and runtime that is most likely (according to our model) to solve the problem. If the predicted strategy does not solve the problem, MaLeS updates all prediction functions with this new information. Algorithm 3 shows the details. In line 2 the algorithm starts by running some predefined start strategies. The goal of running these start strategies first is to filter out simple problems which allows the learning algorithm to focus on the harder problems. The start strategies are picked greedily. First the strategy that solves most problems (within some time limit) is chosen. Then the strategy that solves most of the problems that were not solved by the first picked strategy (within some time limit) is picked, etc. The number of start strategies and their runtime are determined via their respective parameters in the setup.ini file (Table 7.8). Training problems that are solved by the start strategies are deleted from the training set. For example, let s1,..., sn be the starting strategies, all with a runtime of 1 second. Then for all s ∈ S 0 we can set

s s Ptrain := {p ∈ Ptrain | ∀ 1 ≤ i ≤ n τ(p, si) > 1} s and train ρs on the updated Ptrain. The subprocedure choose_best_strategy in line 12 picks the strategy with the mini- mum predicted runtime among those that have not been run with a bigger or equal runtime 5 0 before. run_strategy runs the ATP with strategy s and time limit ts0 on the problem. If the ATP cannot solve the problem within the time limit, this information is used to improve the prediction functions in update_prediction_function (Line 19). For this, all the training problems that are solved by the picked strategy s0 within the predicted 0 runtime ts0 are deleted from the training set Ptrain, i.e. for all s ∈ S 0 s s 0 Ptrain := {p ∈ Ptrain | τ(p, s ) > ts } 5If there are several strategies with the same minimal predicted runtime a random one is chosen.

85 CHAPTER 7. MALES - MACHINE LEARNING OF STRATEGIES

Algorithm 3 males: Tries to solve the input problem within the time limit. Creates and runs a strategy schedule for the problem. 1: procedure males(problem,time) 2: proofFound,timeUsed <–run_start_strategies(problem,time) 3: if proofFound then 4: return timeUsed 5: end if 6: while timeUsed < time do 7: times is an empty list 8: for s ∈ S do 9: ts ← ρs(problem) 10: times.append([ts, s]) 11: end for 0 12: ([ts0 , s ]) ← choose_best_strategy(times) 0 13: proofFound,timeNeeded ← run_strategy(s ,problem,ts0 ) 14: timeUsed + = timeNeeded 15: if proofFound then 16: return timeUsed 17: end if 18: for s ∈ S do 0 19: timeUsed + = update_prediction_function(ρs,s ,ts0 ) 20: end for 21: end while 22: return timeUsed 23: end procedure

Afterwards, new prediction functions are learned on the reduced training set. This is done s by first creating a new kernel and time matrix for the new Ptrain and then computing new weights as shown in Theorem 1. Due to the small size of the training dataset, this can be done in real time during a proof. Note that these updates are local, i.e. do not have any effect on future calls to males. If males finds a proof, the total time needed is returned to the user.

7.4 Evaluation

MaLeS is evaluated with three different ATPs: E 1.7, LEO-II 1.6 and Satallax 2.7. For every prover, a set of training and testing problems is defined. MaLeS first searches for good strategies on the training problems using Algorithm 1 with a 10 second time limit, i.e. tmax = 10. Promising strategies are then run for 300 seconds on all training problems. The resulting data is used to learn runtime prediction functions and strategy schedules as explained in the previous section. After the learning, MaLeS uses Algorithm 3 when trying to solve a new problem. The difference between the different MaLeS versions

86 7.4. EVALUATION

(i.e. E-MaLeS, Satallax-MaLeS and Leo-MaLeS) is the training data used to create the prediction functions and start strategies, and the ATP that is run in the run_strategy part of Algorithm 3. The MaLeS version of the ATP is compared with the default mode on both the test and the training problems. The section ends with an overview of previous versions of MaLeS and their CASC performance.

7.4.1 E-MaLeS E is a popular ATP for first order logic. It is open source, easily available and consistently performs very well at the CASC competitions. Additionally, E is easily tunable with a big parameter space6 which suggested that parameter tuning could lead to significant improvements. All computations were done on a 64 core AMD Opteron Processor 6276 with 1.4GHz per CPU and 256 GB of RAM

E’s Automatic Mode E’s automatic mode is developed by Stephan Schulz and based on a static partitioning of the set of all problems into disjoint classes. It is generated in two steps. First, the set of all training examples (typically the set of all current TPTP problems) is classified into disjoint classes using some of the features listed in Table 7.1. For the numeric features, threshold values have originally been selected to split the TPTP into 3 or 4 approximately equal subsets on each feature. Over time, these have been manually adapted using trial and error. Once the classification is fixed, a Python program reads the different classes and as- signs to each class one of the strategies that solves the most examples in this class. For large classes (arbitrarily defined as having more than 200 problems), it picks the strat- egy that also is fastest on that class. For small classes, it picks the globally best strategy among those that solve the maximum number of problems. A class with zero solutions by all strategies is assigned the overall best strategy.

The Training Data The problems from the FOF divisions of CASC-22 [92], CASC-J5 [93], CASC-23 [94] and CASC-J6 and CASC@Turing [95] were used as training problems. Several problems appeared in more than one CASC. There are also a few problems from earlier CASCs that are not part of the TPTP version used in the experiments, TPTP-v5.4.0. Deleting dupli- cates and missing problems leaves 1112 problems that were used to train E-MaLeS. The strategy search for the set of preselected strategies took three weeks on a 64 core server. The majority of the time was spent running promising strategies with a 300 seconds time limit. Over 2 million strategies were considered. Of those, 109 were selected to be used in E-MaLeS. E-MaLeS runs 10 start strategies, each with a 1 second time limit. E 1.7 (running the automatic mode) and E-MaLeS were evaluated on all training problems with a 300 second time limit. The results can be seen in Figure 7.3.

6The parameter space considered in the experiments contains more than 1017 different strategies.

87 CHAPTER 7. MALES - MACHINE LEARNING OF STRATEGIES

Figure 7.3: Performance graph for E-MaLeS 1.2 on the training problems.

Altogether, 1055, or 94.9%, of the problems can be solved by E 1.7 with the consid- ered strategies. E 1.7’s automatic mode solves 856 of the problems (77.0%), E-MaLeS solves 10.0% more problems: 942 (84.7%). Best Strategy shows the best possible result, i.e. the number of problems solved if for each problem the strategy that solves it in the least amount of time was picked.

The Test Data Similar to the way the problems for CASC are chosen, 1000 random FOF problems of TPTP-v5.4.0 with a difficulty rating [98] between 0.2 and (including) 1.0 were chosen for the test dataset. 165 of the test problems are also part of the training dataset. The results are similar to the results on the training problems and can be seen in Figure 7.4. In the first three seconds, E solves more problems than E-MaLeS. Afterwards, E-MaLeS overtakes E. After 300 seconds, E-MaLeS solves 573 of the problems (57.3%) and E 1.7 511 (51.1%), an increase of 12.4%. Figure 7.5 shows the results for only the 835 problems that are not part of the training problems.

7.4.2 Satallax-MaLeS In order to show that MaLeS works for arbitrary ATPs, we picked a very different ATP for the next experiment: Satallax. Satallax is a higher order theorem prover that has a reputation of being highly tuned. The built-in strategy schedule of Satallax solves 95.3% of all solvable problems in the training dataset and, with the right parameters, 91.3% (525) of the training problems can be solved in less than 1 second. The strategy search for the

88 7.4. EVALUATION

Figure 7.4: Performance graph for E-MaLeS 1.2 on the test problems.

Figure 7.5: Performance graph for E-MaLeS 1.2 on the unseen test problems.

set of preselected strategies was done on a 32 core Intel Xeon with 2.6GHz per CPU and 256 GB of RAM. The evaluations were done on a 64 core AMD Opteron Processor 6276 with 1.4GHz per CPU and 256 GB of RAM.

89 CHAPTER 7. MALES - MACHINE LEARNING OF STRATEGIES

Satallax’s Automatic Mode Satallax employs a hard-coded strategy schedule that defines a sequence of strategies together with their runtimes. The same schedule is used for all problems. It is defined in the file satallaxmain.ml in the src directory of the Satallax installation. Many modes are only run for a very short time (0.2 seconds). This can cause problems if Satallax is run on CPUs that are slower than the one(s) used to create this schedule.

The Training Data The problems from the THF divisions of CASC-J5 [93], CASC-23 [94] and CASC- J6 [95] were used as training problems. The THF division of CASC-J5 contained 200 problems, of CASC-23 300 problem, and of CASC-J6 also 200 problems. After deleting duplicates and problems that are not available in TPTP-v5.4.0, 573 problems remain. The strategy search took approximately 3 weeks. In the end, 111 strategies were selected to be used in Satallax-MaLeS. Satallax-MaLeS runs 20 start strategies, each with a 0.5 second time limit. 533 of the 573 problems are solvable with the appropriate strategy. Satallax and Satallax-MaLeS were evaluated on all training problems with a 300 second time limit. Satallax solves 508 of the problems (88.7%). Satallax-MaLeS solves 1.6% more prob- lems for a total of 516 solved problems (90.1%).

Figure 7.6: Performance graph for Satallax-MaLeS 1.2 on the training problems.

Figure 7.6 shows a log-scaled time plot of the results. For low time limits, Satallax- MaLeS solves significantly more problems than Satallax. This is probably due to the fact that Satallax uses the same strategy schedule for every problem, whereas Satallax-

90 7.4. EVALUATION

MaLeS adapts its schedule. Best Strategy shows the best possible result, i.e. the number of problems solved if for each problem the strategy that solves it in the least amount of time was picked.

The Test Data

Similar to the E-MaLeS evaluation, the test dataset consists of 1000 randomly selected THF problems of TPTP-v5.4.0 with a difficulty rating between 0.2 and (including) 1.0. 301 of the test problems are also part of the training dataset. The results are similar to the results on the training problems and can be seen in Figure 7.7. While the end results are almost the same with Satallax-MaLeS solving 590 (59.0% ) and Satallax solving 587 (58.7%) of the problems, Satallax-MaLeS significantly outperforms Satallax for lower time limits. Figure 7.8 shows the results for only the 699 problems that are not part of the training problems. Here, Satallax-MaLeS solves more problems than Satallax in the beginning, but fewer for longer time limits. After 300 seconds, Satallax solves 344 and Satallax- MaLeS 336 problems.

Figure 7.7: Performance graph for Satallax-MaLeS 1.2 on the test problems.

7.4.3 LEO-MaLeS LEO-MaLeS is the latest addition to the MaLeS family. LEO-II is a resolution-based higher-order theorem prover designed for fruitful cooperation with specialist provers for

91 CHAPTER 7. MALES - MACHINE LEARNING OF STRATEGIES

Figure 7.8: Performance graph for Satallax-MaLeS 1.2 on the unseen test problems. natural fragments of higher-order logic.7 The strategy search for the set of preselected strategies, and all evaluations were done on a 32 core Intel Xeon with 2.6GHz per CPU and 256 GB of RAM.

LEO-II’s Automatic Mode LEO-II’s automatic mode is a mixture of E’s and Satallax’s automatic modes. The prob- lem space is split into disjoint subspaces and a different strategy schedule is used for each subspace. The automatic mode is defined in the file strategy_scheduling.ml in the src/interfaces directory of the LEO-II installation.

The Training and Test Datasets The same training and test problems as for the Satallax evaluation were used. The strategy search took 2 weeks. 89 strategies were selected. LEO-II and LEO-MaLeS were run with a 300 second time limit per problem. Of the 573 training problems 472 can be solved by LEO-II if the correct strategy is picked. LEO-MaLeS runs 5 start strategies, each with a 1 second time limit. Using more start strategies only marginally increases the number of solved problems by the start strategies. LEO-II’s default mode solves 415 of the training problems (72.4%), and 367 of the test problems (36.7%). LEO-MaLeS improves this to 441 (77.0%) and 417 (41.7%) solved problems respectively. Figure 7.9 and Figure 7.10 show the graphs. Figure 7.11 shows the results for only the 699 problems that are not part of the training problems.

7Description from the LEO-II website www.leoprover.org.

92 7.4. EVALUATION

Figure 7.9: Performance graph for LEO-MaLeS 1.2 on the training problems.

Figure 7.10: Performance graph for LEO-MaLeS 1.2 on the test problems.

Between 7 and 20 seconds, both provers solve approximately the same number of problems. For all other time limits, LEO-MaLeS solves more. On the test problems, a similar time frame is problematic for LEO-MaLeS. LEO-II solves more problems than LEO-MaLeS between 5 and 30 seconds. For other time limits, LEO-MaLeS solves more

93 CHAPTER 7. MALES - MACHINE LEARNING OF STRATEGIES

Figure 7.11: Performance graph for Leo-MaLeS 1.2 on the unseen test problems. problems than LEO-II . This behaviour indicates that the initial predictions of LEO- MaLeS are wrong. Better features could help remedy this problem. The sudden jump in the number of solved problems at around 30 seconds on the test dataset seems pecu- liar. Upon inspection, we found that 42 out of 43 problems solved in the 30 − 35 seconds timeframe are from the SEU (Set Theory) problem domain. These problems have very similar features and hence similar strategy schedules. 34 of the 43 problems were solved by the same strategy.

7.4.4 Further Remarks There are a few things to note that are independent of the underlying prover.

Multicore Evaluations: All the evaluations were done on multicore machines, a 64 core AMD Opteron Processor 6276 with 1.4GHz per CPU and 256 GB of RAM and a 32 core Intel Xeon with 2.6GHz per CPU and 256 GB of RAM. All runtimes were measured in wall-clock time. During the evaluation we noticed irregularities in the runtime of the ATPs. When running a single instance of an ATP, the time needed to solve a problem often differed from the result we got when running several instances in parallel, even when using less than the maximum number of cores. It turns out that the number of cores used during the evaluation heavily influences the performance. The more cores, the worse the ATPs performed. We were not able to completely determine the cause, but the speed of the hard disk drive, shared cache and process swapping are all possible explanations. Reducing the hard disk drive load by changing the behaviour of MaLeS from loading all

94 7.4. EVALUATION models at the very beginning to only when they are needed did lead to more (and faster) solved problems. Eventually, all evaluation experiments (apart from the strategy searches for the sets of preselected strategies) were redone using only 20 out of 64 / 14 out of 32 cores and the results reported here are based on those runs.

How Good are the Predictions? Apart from the total number of solved problems, the qual- ity of the predictions is also of interest. In short, they are not very good. The predictions of MaLeS are already heavily biased because the unsolveable problems are ignored (Sec- tion 7.3.3). Reducing the number of training problems during the update phase makes the predictions even less reliable. For some strategies, the average difference between the actual and predicted runtimes exceeds 40 seconds. Two heuristics were added to help MaLeS to deal with this uncertainty. First, the predicted runtime must always exceed the minimal runtime of the training data. This prevents unreasonably low (in particular neg- ative) predictions. Second, if the number of training problems is less than a predefined minimum (set to 5) then the predicted runtime is the maximum runtime of the training data. That MaLeS nevertheless gives good results is likely due to the fact that the tested ATPs all utilize either no or very basic strategy scheduling.

The Impact of the Learning Parameters: Table 7.7 and 7.8 shows the learning parameters of MaLeS. Tolerance, StartStrategies and StartStrategiesTime had the greatest impact in our experiments. Tolerance influences the number of strategies used in MaLeS. A low value means more strategies, a high value less. For E and LEO, higher values (1.0 − 15.0 seconds) gave better results since fewer irrelevant strategies were run. Satallax performed slightly better with a low tolerance which is probably due to the fact that it can solve almost every problem in less than a second. The values for StartStrategies and Start- StrategiesTime determine how many problems are left for learning. 10 StartStrategies with a 1 second S tartS trategiesTime are good default values for the provers tested. For LEO-II we found that the number of solved problems barely increased after 5 seconds, and hence changed to number of StartStrategies to 5.

7.4.5 CASC MaLeS 1.2 is the third iteration of the MaLeS framework. E-MaLeS 1.0 competed at CASC-23, E-MaLeS 1.1 at CASC@Turing and CASC-J6, and E-MaLeS 1.2 at CASC- 24. Satallax-MaLeS competed for the first time at CASC-24. We give an overview of the older versions, the CASC performance and the changes over the years.

CASC-23 E-MaLeS 1.0 [54] was the first MaLeS version to compete at CASC. Stephan Schulz provided us with a set of strategies and information about their performance on all TPTP problems. This data was used to train a kernel-based classification model for each strat- egy. Given the features of a problem p, the classification models predict whether or not a strategy can solve p. Altogether, three strategies were run. First E’s auto mode for 60

95 CHAPTER 7. MALES - MACHINE LEARNING OF STRATEGIES

Table 7.2: Results of the FOF division of CASC 23

ATP Vampire 0.6 Vampire 1.8 E-MaLeS 1.0 EP 1.4 pre Solved 269/300 263/300 233/300 232/300 Average CPU Time 12.95 13.62 18.85 22.55

Table 7.3: Results of the FOF division of CASC-J6

ATP Vampire 2.6 E-MaLeS 1.1 EP 1.6 pre Vampire 0.6 Solved 429/450 377/450 359/450 355/450 Average CPU Time 13.17 17.85 13.46 11.81

Table 7.4: Results of the FOF division of CASC@Turing

ATP Vampire 2.6 E-MaLeS 1.1 EP 1.6 pre Vampire 0.6 Solved 469/500 401/500 378/500 368/500 Average CPU Time 20.26 20.81 14.49 16.40 seconds, then the strategy with the highest probability of solving the problem as predicted by a Gaussian kernel classifier for 120 seconds. Finally the strategy with the highest prob- ability of solving the problem as predicted by a linear (dot-product) kernel classifier was run for the remainder of the available time. E-MaLeS 1.0 won third place in the FOF division. Table 7.2 shows the results.

CASC@Turing and CASC-J6 E-MaLeS 1.1 [58] changed the learning from classification to regression. Like E-MaLeS 1.0, E-MaLeS 1.1 learned from (an updated version of) Schulz’s data. Instead of pre- dicting which strategy to run, E-MaLeS 1.1 learned runtime prediction functions. The learning method is the same as the one presented in this chapter, without the updating of the prediction functions. E-MaLeS 1.1 first ran E’s auto mode for 60 seconds. Af- terwards, each strategy was run for its predicted runtime, starting with the strategy with the lowest predicted runtime. E-MaLeS 1.1 won second place in the FOF divisions of both CASC@Turing (Table 7.3) and CASC-J6 (Table 7.4). It also came fourth in the LTB division of CASC-J6.

CASC-24 E-MaLeS 1.2 and Satallax-MaLeS 1.2 competed at CASC 24, both based on the algo- rithms presented in this chapter. E-MaLeS 1.2 used Schulz’s strategies as start strategies for find_strategies. It is the first E-MaLeS that was not based on the CASC version of E (E 1.7 in E-MaLeS 1.2 vs E 1.8). E-MaLeS 1.2 got fourth place in the FOF division,

96 7.5. USING MALES

Table 7.5: Results of the FOF division of CASC 24

ATP Vampire 2.6 Vampire 3.0 EP 1.8 E-MaLeS 1.2 Solved 281/300 274/300 249/300 237/300 Average CPU Time 12.24 10.91 29.02 14.52

Table 7.6: Results of the THF division of CASC 24

ATP Satallax-MaLeS 1.2 Satallax Isabelle 2013 Solved 119/150 116/150 108/150 Average CPU Time 10.42 11.39 54.65 losing to two versions of Vampire, and E 1.8. Several significant changes were introduced in E 1.8, in particular new strategies and E’s own strategy scheduling. Satallax-MaLeS won first place in the THF division before Satallax. The results can be seen in Tables 7.5 and 7.6.

7.5 Using MaLeS

MaLeS aims to be a general ATP tuning framework. In this section, we show how to setup E-MaLeS, LEO-MaLeS and Satallax-MaLeS, tuning any of those provers on new problems, and how to use MaLeS with a completely new prover. The first step is to download the MaLeS git repository via

git clone https://code.google.com/p/males/

MaLeS requires Python 2.7, Numpy 1.6 or later, and Scipy 0.10 or later [69]. Installation instructions for Numpy and Scipy can be found at http://www.scipy.org/install. html.

7.5.1 E-MaLeS, LEO-MaLeS and Satallax-MaLeS Setting up any of the presented systems can be done in three steps. 1. Install the ATP (E, LEO-II or Satallax) 2. Run the configuration script with the location of the prover as argument. For exam- ple

EConfig.py --location=../E/PROVER

for E-MaLeS. 3. Learn the prediction function via

97 CHAPTER 7. MALES - MACHINE LEARNING OF STRATEGIES

MaLeS/learn.py

After the installation, MaLeS can be used by running

MaLeS/males.py -t 30 -p test/PUZ001+1.p where −t denotes the time limit and −p the problem to be solved.

7.5.2 Tuning E, LEO-II or Satallax for a New Set of Problems Tuning an ATP for a particular dataset involves finding good search strategies and learning prediction models. The search behaviour is defined in the the file setup.ini in the main directory. Using the default search behaviour, E, LEO-II and Satallax can be tuned for new data as follows:

1. Install the ATP (E, LEO-II or Satallax) 2. Run the configuration script with the location of the prover as argument. For exam- ple

EConfig.py --location=../E/PROVER

for E-MaLeS. 3. Store the absolute pathnames of the problems in a new file with one problem per line and change the PROBLEM parameter in setup.ini to the file containing the problem paths. 4. Find promising strategies by searching with a short time limit (which is the default setup)

MaLeS/findStrategies.py

5. Run all promising strategies for a longer time. For this several parameters need to be changed. a) Copy the value of ResultsDir to TmpResultsDir. b) Copy the value of ResultsPickle to TmpResultsPickle. c) Change the value of ResultsDir to a new directory. d) Change the value of ResultsPickle to a new file. e) Change Time in search to the maximal runtime (in seconds), e.g. 300. f) Set FullTime to True. g) Set TryWithNewDefaultTime to True. 6. Run findStrategies again.

98 7.5. USING MALES

MaLeS/findStrategies.py

7. The newly found strategies are stored in ResultsDir. MaLeS can now learn from these strategies via

MaLeS/learn.py

For completeness, Table 7.7 and 7.8 contains a list of all parameters in setup.ini with their descriptions.

Table 7.7: Parameters of MaLeS

Settings Parameter Description TPTP The TPTP directory. Not required. TmpDir Directory for temporary files. Cores How many cores to use. ResultsDir Directory where the results of the findStrategies are stored. ResultsPickle Directory where the models are stored. TmpResultsDir Like ResultsDir, but only used if TryWithNewDefaultTime is True. TmpResultsPickle Like ResultsPickle, but only used if TryWithNewDefault- Time is True. Clear If True, all existing results are ignored and MaLeS starts from scratch. LogToFile If True, a log file is created. LogFile Name of the log file.

Search Parameter Description Time Maximal runtime during search. Problems File with the absolute pathnames of the problems. FullTime If True, the ATP is run for the value of Time. If False, it is run for the rounded minimal time required to solve the problem. TryWithNewDefaultTime If True, findStrategies uses the best strategies from Tm- pResultsDir and TmpResultsPickle as a start strategies for a new search. Walks How many different strategies are tried in the local search step. WalkLength Up to this many parameters are changed for each strategy in the local search step.

99 CHAPTER 7. MALES - MACHINE LEARNING OF STRATEGIES

Table 7.8: Parameters of MaLeS (cont.)

Learn Parameter Description Features Which features to use. Possible values are E for the E features and TPTP for the TPTP features. FeaturesFile Location of the feature file. StrategiesFile Location of the strategies file. KernelFile Location of the file containing the kernel matrices. RegularizationGrid Possible values for λ. KernelGrid Possible values for σ. CrossValidate If False, no crossvalidation is done during learning. In- stead the first values in RegularizationGrid and Kernel- Grid are used. CrossValidationFolds How many folds to use during crossvalidation. StartStrategies Number of start strategies. StartStrategiesTime Runtime of each start strategy. CPU Bias This value is added to each runtime before learning. Serves as a buffer against runtime irregularities. Tolerance For a strategy s to be considered as a good strategy, there must be at least one problem where the difference of the best runtime of any strategy and the runtime of s is at most this value.

Run Parameter Description CPUSpeedRatio Predicted runtimes are multiplied with this value. Useful if the training was done on a different machine. MinRunTime Minimal time a strategy is run. Features Either TPTP for higher order features or E for first order features. StrategiesFile Location of the strategies file. FeaturesFile Location of the feature file. OutputFile If not None, the output of MaLeS is stored in this file.

100 7.5. USING MALES

7.5.3 Using a New Prover The behaviour of MaLeS is defined in three configuration files: ATP.ini defines the ATP and its parameters, setup.ini configures the searching and learning of MaLeS and strate- gies.ini contains the default strategies of the ATP that form the starting point of the strat- egy search for the set of preselected strategies. To use a new prover, ATP.ini and strate- gies.ini need to be adapted. Table 7.9 describes the parameters in ATP.ini.

Table 7.9: Parameters in ATP.ini

ATP Settings Parameter Description binary Path to the ATP binary. time Argument used to denote the time limit. problem Argument used to denote the problem. strategy Defines how parameters are given to the ATP. Three styles are supported: E, LEO and Satallax. default Any default parameters that should always be used.

The section Boolean Parameters contains all flags that are given without a value. List Parameters contains flags which require a value and their possible values. MaLeS searches strategies in the parameter space defined by Boolean Parameters and List Pa- rameters. Running EConfig.py creates the configuration file for E which can serve an example. Different ATPs have (unfortunately) different input formats for search parameters. MaLeS currently supports three formats: E, LEO or Satallax. Each format corresponds to the format of the respective ATP. Table 7.10 lists the differences. New formats need to be hardcoded in the file Strategy.py.

Table 7.10: ATP Formats

Format Description E Parameters and their values are joined by = if the parameter starts with --. Else the parameter is directly joined with its value. For example --ordering=3 -sine13. LEO Parameters and their values are joined by a space. For example --ordering 3. Satallax The parameters are written in a new mode file M. The ATP is then called with ATP -m M.

Strategies defined in strategies.ini are used to initialize the strategy queue during the strategy searching for the set of preselected strategies. The default ini format is used. Each strategy is its own section with each parameter on a separate line. For example

101 CHAPTER 7. MALES - MACHINE LEARNING OF STRATEGIES

[NewStrategy12884] FILTER_START = 0 ENUM_IMP = 100 INITIAL_SUBTERMS_AS_INSTANTIATIONS = true E_TIMEOUT = 1 POST_CONFRONT3_DELAY = 1000 FORALL_DELAY = 0 LEIBEQ_TO_PRIMEQ = true

At least one strategy must be defined. After the ini files are adapted, the new ATP can be tuned and run using the procedure defined in the last two sections.

7.6 Future Work

Apart from simplifying the installation and set up, there are several other ways to improve MaLeS. We present the three most promising ones.

Features: The quality of the runtime prediction function is limited by the quality of the features. Adding new features and using feature selection methods could increase the prediction capabilities of MaLeS.

Strategy Finding: As an alternative to randomized hill climbing, different search algo- rithms should be supported. In particular simulated annealing and genetic algorithms seem promising. The biggest problem of the current implementation, the time it needs to find good strategies, could be improved by using a clusterized local search principle similar to the one employed in BliStr [107].

Strategy Prediction: The runtime prediction function are the heart of MaLeS. Machine learning offers dozens of different regression methods which could be used instead of the kernel methods of MaLeS. A big drawback of the current method is that it scales badly due to the need to invert a new matrix after every tried strategy. A nearest neighbour approach would eliminate the need for matrix computations and also the dependency on Numpy and Scipy.

7.7 Conclusion

Finding the best parameter settings and strategy schedules for an ATP is a time consum- ing task that often requires in-depth knowledge of how the ATP works. MaLeS is an automatic tuning framework for ATPs. Given the possible parameter settings of an ATP and a set of problems, MaLeS finds good search strategies and creates individual strategy schedules. MaLeS currently supports E, LEO-II and Satallax and can easily be extended to work with other provers. Experiments with the ATPs E, LEO-II and Satallax showed that the MaLeS version performs at least comparable to the respective default strategy selection algorithm. In

102 7.7. CONCLUSION some cases, the MaLeS optimized version solves considerably more problems than the untuned ATP. MaLeS simplifies the workflow for both ATP users and developers. It allows ATP users to fine-tune ATPs to their specific problems and helps ATP developers to focus on actual improvements instead of time-consuming parameter tuning.

103

Contributions

The detailed contributions to each chapter are listed here. Josef Urban proof read the complete thesis and suggested many improvements.

Chapter 1 and Chapter 2 are, apart from Section 1.3 and 2.2, based on joint work with Jasmin Blanchette. The paper is titled “A Survey of Axiom Selection as a Machine Learn- ing Problem” and submitted to “Infinity, computability, and metamathematics. Festschrift celebrating the 60th birthdays of Peter Koepke and Philip Welch”. I did most of the writ- ing and all evaluations, Jasmin provided the raw Isabelle data upon which the figures and tables are based. Section 2.2 is based on [2] “Premise Selection for Mathematics by Corpus Analysis and Kernel Methods”, published in the Journal of Automated Reasoning. Section 2.2.1 is mostly written by Josef Urban. Section 2.2.2 was written by me and proof read by Tom Heskes. Section 2.2.3 was originally written by Evgeni Tsivtsivadze, the current version was mainly written by myself with input from all other co-authors. Evgeni had the idea for Section 2.2.4. The implementation was done by myself (based on earlier code by Ev- geni). The text of the section was written by myself, Evgeni, and Tom.

Chapter 3 is based on [57] “Overview and Evaluation of Premise Selection Techniques for Large Theory Mathematics”, published in the Proceedings of the 6th International Joint Conference on Automated Reasoning. The introduction is based on an earlier work- shop paper by Josef Urban [106]. SNoW had already been used in earlier work by Josef [102, 105]. MOR-CG was developed by me with help from Evgeni Tsivtsivadze. Twan van Laarhoven created BiLi. Section 3.2 is my work. The data for Section 3.3 was created by me and Josef. I wrote the text, Josef proof read. Section 3.4 was done by Josef. Tom Heskes, Evgeni and Twan helped with polishing the paper.

Chapter 4 is based on: [55] “Learning from Multiple Proofs: First Experiments”, pub- lished in the Proceedings of the 3rd Workshop on Practical Aspects of Automated Rea- soning. I did the writing and the machine learning experiments. Josef Urban did the ATP evaluations and proof read the paper.

105 CONTRIBUTIONS

Chapter 5 is based on: [3] “Automated and Human Proofs in General Mathematics: An Initial Comparison”, published in the Proceedings of the 18th International Conference on Logic for Programming, Artificial Intelligence, and Reasoning. All three authors con- tributed equally to the paper. The proof dependencies in Section 5.2.1 were created by Jesse Alama and Josef Urban. The final version of the text of 5.2.1 was written by Josef as a part of our joint work on paper [2] “Premise Selection for Mathematics by Corpus Analysis and Kernel Methods”, published in the Journal of Automated Reasoning.

Chapter 6 is based on: [56] “MaSh: Machine Learning for Sledgehammer”, published in the Proceedings of the 4th International Conference on Interactive Theorem Proving. Jasmin Blanchette developed the Isabelle side of MaSh. I programmed the Python part of MaSh and was responsible for the machine learning evaluation. Cezary Kaliszyk and Jasmin did the ATP evaluation. Josef Urban did some proof reading and acted as general advisor.

Chapter 7 is based on: [53] “MaLeS: A Framework for Automatic Tuning of Automated Theorem Provers” , currently under review at the Journal for Automated Reasoning; and an extension of joint work with Stephan Schulz and Josef Urban [58] “E-MaLeS 1.1”, published in the Proceedings of the 24th Conference on Automated Deduction. I wrote this paper, implemented the MaLeS system, and did all the experiments. Josef Urban advised me. Both Stephan Schulz and Josef helped with proof reading the paper. The writing of the earlier E-MaLeS 1.1 paper was done equally by me and Josef, Stephan contributed some E-related parts and suggested improvements.

106 Bibliography

[1] The Mizar Mathematical Library. http://mizar.org/.

[2] Jesse Alama, Tom Heskes, Daniel Kühlwein, Evgeni Tsivtsivadze, and Josef Ur- ban. Premise Selection for Mathematics by Corpus Analysis and Kernel Methods. Journal of Automated Reasoning, pages 1–23, 2013. doi:10.1007/s10817-013- 9286-5.

[3] Jesse Alama, Daniel Kühlwein, and Josef Urban. Automated and Human Proofs in General Mathematics: An Initial Comparison. In Nikolaj Bjørner and Andrei Voronkov, editors, Logic for Programming, Artificial Intelligence, and Reasoning, volume 7180 of Lecture Notes in Computer Science, pages 37–45. Springer, 2012. doi:10.1007/978-3-642-28717-6_6.

[4] Jesse Alama, Lionel Mamane, and Josef Urban. Dependencies in Formal Mathe- matics: Applications and Extraction for Coq and Mizar. In Johan Jeuring, JohnA. Campbell, Jacques Carette, Gabriel Reis, Petr Sojka, Makarius Wenzel, and Volker Sorge, editors, Intelligent Computer Mathematics, volume 7362 of Lecture Notes in Computer Science, pages 1–16. Springer, 2012. doi:10.1007/978-3-642- 31374-5_1.

[5] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Math- ematical Society, 68, 1950. doi:10.1090/S0002-9947-1950-0051437-7.

[6] Jeremy Avigad, Kevin Donnelly, David Gray, and Paul Raff. A formally verified proof of the prime number theorem. ACM Transactions on Computational Logic (TOCL), 9(1):2, 2007. doi:10.1145/1297658.1297660.

107 BIBLIOGRAPHY

[7] Thomas Ball, Ella Bounimova, Vladimir Levin, Rahul Kumar, and Jakob Lichten- berg. The Static Driver Verifier Research Platform. In Tayssir Touili, Byron Cook, and Paul Jackson, editors, Computer Aided Verification, volume 6174 of Lecture Notes in Computer Science, pages 119–122. Springer, 2010. doi:10.1007/978- 3-642-14295-6_11.

[8] Clark Barrett and Cesare Tinelli. CVC3. In Werner Damm and Holger Hermanns, editors, Computer Aided Verification, volume 4590 of Lecture Notes in Computer Science, pages 298–302. Springer, 2007. doi:10.1007/978-3-540-73368-3_34.

[9] Christoph Benzmüller, Lawrence C. Paulson, Frank Theiss, and Arnaud Fietzke. LEO-II - A Cooperative Automatic Theorem Prover for Classical Higher-Order Logic (System Description). In Alessandro Armando, Peter Baumgartner, and Gilles Dowek, editors, Automated Reasoning, volume 5195 of Lecture Notes in Computer Science, pages 162–170. Springer, 2008. doi:10.1007/978-3-540- 71070-7_14.

[10] Stefan Berghofer and Tobias Nipkow. Proof Terms for Simply Typed Higher Order Logic. In Mark Aagaard and John Harrison, editors, Theorem Proving in Higher Order Logics, volume 1869 of Lecture Notes in Computer Science, pages 38–52. Springer, 2000. doi:10.1007/3-540-44659-1_3.

[11] Yves Bertot and Pierre Castéran. Interactive Theorem Proving and Program Development—Coq’Art: The Calculus of Inductive Constructions. Texts in Theo- retical Computer Science. Springer, 2004.

[12] Ella Bingham and Heikki Mannila. Random Projection in Dimensionality Reduc- tion: Applications to Image and Text Data. In Proceedings of the Seventh Inter- national Conference on Knowledge Discovery and Data Mining, pages 245–250. ACM Press, 2001. doi:10.1145/502512.502546.

[13] Christopher M. Bishop. Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, 2006.

[14] Jasmin Christian Blanchette, Sascha Böhme, Andrei Popescu, and Nicholas Small- bone. Encoding Monomorphic and Polymorphic Types. In Nir Piterman and Scott Smolka, editors, Proceedings of the 19th international conference on Tools and

108 BIBLIOGRAPHY

Algorithms for the Construction and Analysis of Systems, volume 7795 of Lecture Notes in Computer Science. Springer, 2013. doi:10.1007/978-3-642-36742- 7_34.

[15] Jasmin Christian Blanchette, Lukas Bulwahn, and Tobias Nipkow. Automatic Proof and Disproof in Isabelle/HOL. In Cesare Tinelli and Viorica Sofronie- Stokkermans, editors, Proceedings of the 8th international conference on Frontiers of combining systems, volume 6989 of Lecture Notes in Computer Science, pages 12–27. Springer, 2011. doi:10.1007/978-3-642-24364-6_2.

[16] Jasmin Christian Blanchette, Sascha Böhme, and Lawrence C. Paulson. Extending Sledgehammer with SMT solvers. Journal of Automated Reasoning, 51(1):109– 128, 2013. doi:10.1007/s10817-013-9278-5.

[17] Jasmin Christian Blanchette, Andrei Popescu, Daniel Wand, and Christoph Wei- denbach. More SPASS with Isabelle. In Lennart Beringer and Amy Felty, editors, Interactive Theorem Proving, volume 7406 of Lecture Notes in Computer Science, pages 345–360. Springer, 2012. doi:10.1007/978-3-642-32347-8_24.

[18] James P. Bridge. Machine learning and automated theorem proving. University of Cambridge, Computer Laboratory, Technical Report, (792), 2010.

[19] Chad E. Brown. Satallax: An Automatic Higher-Order Prover. In Bernhard Gramlich, Dale Miller, and Uli Sattler, editors, Automated Reasoning, volume 7364 of Lecture Notes in Computer Science, pages 111–117. Springer, 2012. doi:10.1007/978-3-642-31365-3_11.

[20] Sascha Böhme and Tobias Nipkow. Sledgehammer: Judgement Day. In Jürgen Giesl and Reiner Hähnle, editors, Automated Reasoning, volume 6173 of Lecture Notes in Computer Science, pages 107–121. Springer, 2010. doi:10.1007/978- 3-642-14203-1_9.

[21] Andy Carlson, Chad Cumby, Jeff Rosen, and Dan Roth. The SNoW Learning Architecture. Technical Report UIUCDCS-R-99-2101, UIUC Computer Science Department, May 1999.

[22] Gregory J. Chaitin. The Omega Number: Irreducible Complexity in Pure Math. In Jonathan M. Borwein and William M. Farmer, editors, Proceedings of the 5th

109 BIBLIOGRAPHY

International Conference on Mathematical Knowledge Management, volume 4108 of Lecture Notes in Computer Science, page 1. Springer, 2006. doi:10.1007/ 11812289_1.

[23] Wei Chu and Seung-Taek Park. Personalized Recommendation on Dynamic Content Using Predictive Bilinear Models. In Proceedings of the 18th Inter- national Conference on World Wide Web, pages 691–700. ACM Press, 2009. doi:10.1145/1526709.1526802.

[24] Marcos Cramer, Peter Koepke, Daniel Kühlwein, and Bernhard Schröder. Premise Selection in the Naproche System. In Jürgen Giesl and Reiner Hähnle, editors, Automated Reasoning, volume 6173 of Lecture Notes in Computer Science, pages 434–440. Springer, 2010. doi:10.1007/978-3-642-14203-1_37.

[25] Ingo Dahn. Robbins Algebras Are Boolean: A Revision of McCune’s Computer- Generated Solution of Robbins Problem. Journal of Algebra, 208:526–532, 1998. doi:10.1006/jabr.1998.7467.

[26] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the Ameri- can Society for Information Science, 41(6):391–407, 1990. doi:10.1002/(SICI) 1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.

[27] Kosta Dosen. Identity of proofs based on normalization and generality. Bulletin of Symbolic Logic, 9:477–503, 2003.

[28] Bruno Dutertre and Leonardo de Moura. The Yices SMT solver. http://yices. csl.sri.com/tool-paper.pdf, 2006.

[29] Branden Fitelson. Using Mathematica to understand the computer proof of the Robbins Conjecture. Mathematica In Education and Research, 7(1), 1998.

[30] Gottlob Frege. Begriffsschrift, eine der arithmetischen nachgebildete Formel- sprache des reinen Denkens. Verlag von Louis Nebert, Halle, 1879.

[31] Matthias Fuchs. Automatic Selection Of Search-Guiding Heuristics For Theorem Proving. In Proceedings of the 10th Florida AI Research Society Conference, pages 1–5. Florida AI Research Society, 1998.

110 BIBLIOGRAPHY

[32] Kurt Gödel. Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I. Monatshefte für Mathematik und Physik, 38(1):173–198, 1931.

[33] Georges Gonthier. Formal Proof—The Four-Color Theorem. Notices of the AMS, 55(11):1382–1393, 2008.

[34] Georges Gonthier, Andrea Asperti, Jeremy Avigad, Yves Bertot, Cyril Co- hen, François Garillot, Stéphane Le Roux, Assia Mahboubi, Russell O’Connor, Sidi Ould Biha, Ioana Pasca, Laurence Rideau, Alexey Solovyev, Enrico Tassi, and Laurent Théry. A Machine-Checked Proof of the Odd Order Theorem. In Sandrine Blazy, Christine Paulin-Mohring, and David Pichardie, editors, Interac- tive Theorem Proving, volume 7998 of Lecture Notes in Computer Science, pages 163–179. Springer, 2013. doi:10.1007/978-3-642-39634-2_14.

[35] Adam Grabowski, Artur Korniłowicz, and Adam Naumowicz. Mizar in a Nut- shell. Journal of Formalized Reasoning, 3(2):153–245, 2010. doi:10.6092/issn. 1972-5787/1980.

[36] Isabelle Guyon and André Elisseeff. An introduction to variable and feature selec- tion. Journal of Machine Learning Research, 3:1157–1182, March 2003.

[37] Thomas C. Hales. Introduction to the Flyspeck project. In Thierry Coquand, Henri Lombardi, and Marie-Françoise Roy, editors, Mathematics, Algorithms, Proofs, volume 05021 of Dagstuhl Seminar Proceedings. Internationales Begegnungs- und Forschungszentrum für Informatik (IBFI), Schloss Dagstuhl, Germany, 2005.

[38] Thomas C. Hales. Mathematics in the age of the Turing machine. Lecture Notes in Logic, 2012. to appear; http://www.math.pitt.edu/~thales/papers/turing. pdf.

[39] John Harrison. HOL Light: A tutorial introduction. In Mandayam Srivas and Albert Camilleri, editors, in Computer-Aided Design, volume 1166 of Lecture Notes in Computer Science, pages 265–269. Springer, 1996. doi:10.1007/BFb0031814.

[40] John Harrison. Formal verification of IA-64 division algorithms. In Mark Aagaard and John Harrison, editors, Theorem Proving in Higher Order Logics, volume 1869

111 BIBLIOGRAPHY

of Lecture Notes in Computer Science, pages 233–251. Springer, 2000. doi:10. 1007/3-540-44659-1_15.

[41] Kryštof Hoder and Andrei Voronkov. Sine Qua Non for Large Theory Reason- ing. In Nikolaj Bjørner and Viorica Sofronie-Stokkermans, editors, Automated Deduction, volume 6803 of Lecture Notes in Computer Science, pages 299–314. Springer, 2011. doi:10.1007/978-3-642-22438-6_23.

[42] Johannes Hölzl and Armin Heller. Three chapters of measure theory in Is- abelle/HOL. In Marko C. J. D. van Eekelen, Herman Geuvers, Julien Schmaltz, and Freek Wiedijk, editors, Proceedings of the 2nd Conference on Interactive The- orem Proving, volume 6898 of Lecture Notes in Computer Science, pages 135–151. Springer, 2011. doi:10.1007/978-3-642-22863-6_12.

[43] Joe Hurd. First-order proof tactics in higher-order logic theorem provers. In Myla Archer, Ben Di Vito, and César Muñoz, editors, Design and Application of Strate- gies/Tactics in Higher Order Logics, number CP-2003-212448 in NASA Tech. Re- ports, pages 56–68, 2003.

[44] Frank Hutter, Holger H. Hoos, Kevin Leyton-Brown, and Thomas Stützle. ParamILS: An Automatic Algorithm Configuration Framework. Journal of Artifi- cial Intelligence Research, 36:267–306, October 2009. doi:10.1613/jair.2861.

[45] Cezary Kaliszyk and Josef Urban. Learning-assisted Automated Reasoning with Flyspeck. CoRR, abs/1211.7012, 2012. http://arxiv.org/abs/1211.7012.

[46] Cezary Kaliszyk and Josef Urban. Automated Reasoning Service for HOL Light. In Jacques Carette, David Aspinall, Christoph Lange, Petr Sojka, and Wolfgang Windsteiger, editors, Intelligent Computer Mathematics, volume 7961 of Lecture Notes in Computer Science, pages 120–135. Springer, 2013. doi:10.1007/978- 3-642-39320-4_8.

[47] Matt Kaufmann, Panagiotis Manolios, and J Strother Moore. Computer-Aided Reasoning: An Approach. Kluwer Academic Publishers, 2000.

[48] Gerwin Klein, June Andronick, Kevin Elphinstone, Gernot Heiser, David Cock, Philip Derrin, Dhammika Elkaduwe, Kai Engelhardt, Rafal Kolanski, Michael

112 BIBLIOGRAPHY

Norrish, Thomas Sewell, Harvey Tuch, and Simon Winwood. seL4: formal veri- fication of an operating-system kernel. Communications of the ACM, 53(6):107– 115, 2010. doi:10.1145/1743546.1743574.

[49] Gerwin Klein and Tobias Nipkow. Jinja is not Java. In Gerwin Klein, Tobias Nipkow, and Lawrence Paulson, editors, Archive of Formal Proofs. http://afp. sf.net/entries/Jinja.shtml, 2005.

[50] Gerwin Klein, Tobias Nipkow, and Lawrence Paulson, editors. Archive of Formal Proofs. http://afp.sf.net/.

[51] Ron Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2, pages 1137–1143. Morgan Kaufmann Publishers Inc., 1995.

[52] Steven G Krantz. The history and concept of mathematical proof. 2007.

[53] D. Kühlwein and J. Urban. MaLeS: A Framework for Automatic Tuning of Auto- mated Theorem Provers. ArXiv e-prints, August 2013. arXiv:1308.2116.

[54] Daniel Kühlwein, Stephan Schulz, and Josef Urban. Experiments with Strategy Learning for E Prover. In 2nd Joint International Workshop on Strategies in Rewriting, Proving and Programming. 2012.

[55] Daniel Kühlwein and Josef Urban. Learning from multiple proofs: First experi- ments. In Pascal Fontaine, Renate A. Schmidt, and Stephan Schulz, editors, Prac- tical Aspects of Automated Reasoning, volume 21 of EPiC Series, pages 82–94. EasyChair, 2013.

[56] Daniel Kühlwein, Jasmin Christian Blanchette, Cezary Kaliszyk, and Josef Urban. MaSh: Machine Learning for Sledgehammer. In Sandrine Blazy, Christine Paulin- Mohring, and David Pichardie, editors, Interactive Theorem Proving, volume 7998 of Lecture Notes in Computer Science, pages 35–50. Springer, 2013. doi:10. 1007/978-3-642-39634-2_6.

[57] Daniel Kühlwein, Twan Laarhoven, Evgeni Tsivtsivadze, Josef Urban, and Tom Heskes. Overview and Evaluation of Premise Selection Techniques for Large The- ory Mathematics. In Bernhard Gramlich, Dale Miller, and Uli Sattler, editors,

113 BIBLIOGRAPHY

Automated Reasoning, volume 7364 of Lecture Notes in Computer Science, pages 378–392. Springer, 2012. doi:10.1007/978-3-642-31365-3_30.

[58] Daniel Kühlwein, Stephan Schulz, and Josef Urban. E-MaLeS 1.1. In Maria Paola Bonacina, editor, Automated Deduction – CADE-24, volume 7898 of Lecture Notes in Computer Science, pages 407–413. Springer, 2013. doi:10.1007/978-3-642- 38574-2_28.

[59] Huan Liu and Hiroshi Motoda. Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers, 1998. doi:10.1007/978-1-4615- 5689-3.

[60] David J.C. MacKay. Information Theory, Inference and Learning Algorithms. Cambridge University Press, 2003. doi:10.2277/0521642981.

[61] Roman Matuszewski and Piotr Rudnicki. Mizar: The first 30 years. Mechanized Mathematics and Its Applications, 4:3–24, 2005.

[62] William Mccune. Solution of the Robbins problem. Journal of Automated Rea- soning, 19(3):263–276, 1997. doi:10.1023/A:1005843212881.

[63] William McCune. Prover9 and Mace4. http://www.cs.unm.edu/~mccune/ prover9/, 2005–2010.

[64] Jia Meng and Lawrence C. Paulson. Lightweight relevance filtering for machine- generated resolution problems. Journal of Applied Logic, 7(1):41–57, 2009. doi: 10.1016/j.jal.2007.07.004.

[65] J. Strother Moore, Thomas W. Lynch, and Matt Kaufmann. A mechanically TM checked proof of the AMD5K86 floating point division program. IEEE Transac- tions on Computers, 47(9):913–926, 1998. doi:10.1109/12.713311.

[66] Leonardo Moura and Nikolaj Bjørner. Z3: An Efficient SMT Solver. In C.R. Ramakrishnan and Jakob Rehof, editors, Tools and Algorithms for the Construc- tion and Analysis of Systems, volume 4963 of Lecture Notes in Computer Science, pages 337–340. Springer, 2008. doi:10.1007/978-3-540-78800-3_24.

[67] Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, 2012.

114 BIBLIOGRAPHY

[68] Tobias Nipkow, Lawrence C. Paulson, and Markus Wenzel. Isabelle/HOL: A Proof Assistant for Higher-Order Logic, volume 2283 of Lecture Notes in Computer Sci- ence. Springer, 2002.

[69] Travis E. Oliphant. Python for scientific computing. Computing in Science & Engineering, 9(3):10–20, 2007. doi:10.1109/MCSE.2007.58.

[70] Jens Otten and Wolfgang Bibel. leanCoP: lean connection-based theorem proving. Journal of Symbolic Computation, 36(1–2):139 – 161, 2003. doi:http://dx. doi.org/10.1016/S0747-7171(03)00037-3.

[71] Sam Owre and Natarajan Shankar. A brief overview of PVS. In Theorem Proving in Higher Order Logics, volume 5170 of Lecture Notes in Computer Science, pages 22–27. Springer, 2008. doi:10.1007/978-3-540-71067-7_5.

[72] Lawrence C. Paulson. The inductive approach to verifying cryptographic proto- cols. Journal of , 6(1-2):85–128, 1998.

[73] Lawrence C. Paulson and Jasmin Christian Blanchette. Three years of experience with Sledgehammer, a Practical Link Between Automatic and Interactive Theorem Provers. In Geoff Sutcliffe, Stephan Schulz, and Eugenia Ternovska, editors, In- ternational Workshop on the Implementation of Logics 2010, volume 2 of EPiC Series, pages 1–11. EasyChair, 2012.

[74] J.D. Phillips and D. Stanovský. Automated Theorem Proving in Loop Theory. In G. Sutcliffe, S. Colton, and S. Schulz, editors, Proceedings of the Workshop on Empirically Successful Automated Reasoning for Mathematics, number 378 in CEUR Workshop Proceedings, pages 42–53, 2008.

[75] Robi Polikar. Ensemble based systems in decision making. Circuits and Systems Magazine, IEEE, 6(3):21–45, 2006. doi:10.1109/MCAS.2006.1688199.

[76] Petr Pudlak. Semantic Selection of Premisses for Automated Theorem Proving. In Geoff Sutcliffe, Josef Urban, and Stephan Schulz, editors, Proceedings of the CADE-21 Workshop on Empirically Successful Automated Reasoning in Large Theories, volume 257 of CEUR Workshop Proceedings, 2007.

[77] Alexandre Riazanov and Andrei Voronkov. The design and implementation of VAMPIRE. AI Communications, 15(2-3):91–110, August 2002.

115 BIBLIOGRAPHY

[78] Ryan Rifkin, Gene Yeo, and Tomaso Poggio. Regularized least-squares classifi- cation. In J.A.K. Suykens, G. Horvath, S. Basu, C. Micchelli, and J. Vandewalle, editors, Advances in Learning Theory: Methods, Model and Applications, pages 131–154, Amsterdam, 2003. IOS Press.

[79] Alex Roederer, Yury Puzis, and Geoff Sutcliffe. Divvy: An ATP Meta-system Based on Axiom Relevance Ordering. In Schmidt [80], pages 157–162. doi: 10.1007/978-3-642-02959-2.

[80] Renate A. Schmidt, editor. Automated Deduction, volume 5663 of Lecture Notes in Computer Science. Springer, 2009. doi:10.1007/978-3-642-02959-2.

[81] Bernhard Schoelkopf, Ralf Herbrich, Robert Williamson, and Alex J Smola. A Generalized Representer Theorem. In D. Helmbold and R. Williamson, editors, Proceedings of the 14th Annual Conference on Computational Learning Theory, pages 416–426, 2001. doi:10.1007/3-540-44581-1_27.

[82] Bernhard Scholkopf and Alexander J. Smola. Learning with Kernels: Support Vec- tor Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001.

[83] Stephan Schulz. Learning search control knowledge for equational deduction, volume 230 of Dissertations in Artificial Intelligence. Infix Akademische Verlags- gesellschaft, 2000.

[84] Stephan Schulz. E—A Brainiac Theorem Prover. Journal of AI Communications, 15(2-3):111–126, 2002.

[85] Stephan Schulz. System description: E 0.81. In David Basin and Michaël Rusi- nowitch, editors, Automated Reasoning, volume 3097 of Lecture Notes in Com- puter Science, pages 223–228. Springer, 2004. doi:10.1007/978-3-540-25984- 8_15.

[86] Stephan Schulz. First-order deduction for large knowledge bases. Presentation at Deduction at Scale Seminar, 2011.

[87] D. Sculley. Rank aggregation for similar items. In Proceedings of 2007 SIAM In- ternational Conference on Data Mining. Society for Industrial and Applied Math- ematics, 2007.

116 BIBLIOGRAPHY

[88] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.

[89] Jonathan R Shewchuk. An Introduction to the Conjugate Gradient Method Without the Agonizing Pain. Technical report, 1994.

[90] Konrad Slind and Michael Norrish. A Brief Overview of HOL4. In Otmane Ait Mohamed, César Muñoz, and Sofiène Tahar, editors, Theorem Proving in Higher Order Logics, volume 5170 of Lecture Notes in Computer Science, pages 28–32. Springer, 2008. doi:10.1007/978-3-540-71067-7_6.

[91] Geoff Sutcliffe. The TPTP problem library and associated infrastructure. Journal of Automated Reasoning, 43(4):337–362, 2009. doi:10.1007/s10817-009-9143-8.

[92] Geoff Sutcliffe. The CADE-22 Automated Theorem Proving System Competition - CASC-22. AI Communications, 23(1):47–60, 2010.

[93] Geoff Sutcliffe. The 5th IJCAR Automated Theorem Proving System Competition - CASC-J5. AI Communications, 24(1):75–89, 2011.

[94] Geoff Sutcliffe. The CADE-23 Automated Theorem Proving System Competition - CASC-23. AI Communications, 25(1):49–63, 2012.

[95] Geoff Sutcliffe. The 6th IJCAR automated theorem proving system competition— CASC-J6. AI Communications, 26(2):211–223, 2013.

[96] Geoff Sutcliffe and Christoph Benzmüller. Automated reasoning in higher-order logic using the TPTP THF infrastructure. Journal of Formalized Reasoning, 3(1):1–27, 2010.

[97] Geoff Sutcliffe and Yury Puzis. SRASS - A Semantic Relevance Axiom Selec- tion System. In Frank Pfenning, editor, Proceedings of the 21st Conference on Automated Deduction, volume 4603 of Lecture Notes in Computer Science, pages 295–310. Springer, 2007. doi:10.1007/978-3-540-73595-3_20.

[98] Geoff Sutcliffe and Christian Suttner. Evaluating General Purpose Automated Theorem Proving Systems. Artificial Intelligence, 131(1-2):39–54, 2001. doi: 10.1016/S0004-3702(01)00113-8.

117 BIBLIOGRAPHY

[99] Tanel Tammet. Gandalf. Journal of Automated Reasoning, 18:199–204, 1997. doi:10.1023/A:1005887414560.

[100] Evgeni Tsivtsivadze, Tapio Pahikkala, Jorma Boberg, Tapio Salakoski, and Tom Heskes. Co-Regularized Least-Squares for Label Ranking. In Johannes Fürnkranz and Eyke Hüllermeier, editors, Preference Learning, pages 107–123. Springer, 2011. doi:10.1007/978-3-642-14125-6_6.

[101] Josef Urban. MPTP – Motivation, Implementation, First Experiments. Journal of Automated Reasoning, 33(3-4):319–339, 2004. doi:10.1007/s10817-004- 6245-1.

[102] Josef Urban. MizarMode—an integrated proof assistance tool for the Mizar way of formalizing mathematics. Journal of Applied Logic, 4(4):414–427, 2006. doi: 10.1016/j.jal.2005.10.004.

[103] Josef Urban. MoMM - fast interreduction and retrieval in large libraries of formalized mathematics. International Journal on Artificial Intelligence Tools, 15(1):109–130, 2006. doi:10.1142/S0218213006002588.

[104] Josef Urban. MPTP 0.2: Design, Implementation, and Initial Experiments. Journal of Automated Reasoning, 37(1-2):21–43, 2006. doi:10.1007/s10817-006-9032- 3.

[105] Josef Urban. MaLARea: A metasystem for automated reasoning in large theo- ries. In Geoff Sutcliffe, Josef Urban, and Stephan Schulz, editors, Proceedings of the CADE-21 Workshop on Empirically Successful Automated Reasoning in Large Theories, volume 257 of CEUR Workshop Proceedings, 2007.

[106] Josef Urban. An Overview of Methods for Large-Theory Automated Theorem Proving. In Peter Höfner, Annabelle McIver, and Georg Struth, editors, Pro- ceedings of the First Workshop on Automated Theory Engineering, volume 760 of CEUR Workshop Proceedings, pages 3–8, 2011.

[107] Josef Urban. BliStr: The Blind Strategymaker. CoRR, abs/1301.2683, 2013.

[108] Josef Urban, Kryštof Hoder, and Andrei Voronkov. Evaluation of Automated Theo- rem Proving on the Mizar Mathematical Library. In Proceedings of the Third Inter- national Congress Conference on Mathematical Software, volume 6327 of Lecture

118 BIBLIOGRAPHY

Notes in Computer Science, pages 155–166. Springer, 2010. doi:10.1007/978- 3-642-15582-6_30.

[109] Josef Urban, Piotr Rudnicki, and Geoff Sutcliffe. ATP and Presentation Service for Mizar Formalizations. Journal of Automated Reasoning, 50(2):229–241, 2013. doi:10.1007/s10817-012-9269-y.

[110] Josef Urban, Geoff Sutcliffe, Petr Pudlák, and Jiríˇ Vyskocil.ˇ MaLARea SG1 - Ma- chine Learner for Automated Reasoning with Semantic Guidance. In Alessandro Armando, Peter Baumgartner, and Gilles Dowek, editors, Automated Reasoning, volume 5195 of Lecture Notes in Computer Science, pages 441–456. Springer, 2008. doi:10.1007/978-3-540-71070-7_37.

[111] Josef Urban and Jiríˇ Vyskocil.ˇ Theorem Proving in Large Formal Mathematics as an Emerging AI Field. In Maria Paola Bonacina and Mark E. Stickel, editors, Au- tomated Reasoning and Mathematics, volume 7788 of Lecture Notes in Computer Science, pages 240–257. Springer, 2013. doi:10.1007/978-3-642-36675-8_13.

[112] Josef Urban, Jiríˇ Vyskocil,ˇ and Petr Štepánek.ˇ MaLeCoP Machine Learning Con- nection Prover. In Kai Brünnler and George Metcalfe, editors, Automated Reason- ing with Analytic Tableaux and Related Methods, volume 6793 of Lecture Notes in Computer Science, pages 263–277. Springer, 2011. doi:10.1007/978-3-642- 22119-4_21.

[113] Jiríˇ Vyskocil,ˇ David Stanovský, and Josef Urban. Automated proof shortening by invention of new definitions. In Proceedings of the 16th International Conference on Logic for Programming Artificial Intelligence and Reasoning, volume 6355 of Lecture Notes in Computer Science, pages 447–462. Springer, 2010.

[114] Christoph Weidenbach, Dilyana Dimova, Arnaud Fietzke, Rohit Kumar, Martin Suda, and Patrick Wischnewski. SPASS version 3.5. In Schmidt [80], pages 140– 145. doi:10.1007/978-3-642-02959-2_10.

[115] Makarius Wenzel. Isabelle/Isar—A generic framework for human-readable proof documents. In Roman Matuszewski and Anna Zalewska, editors, From Insight to Proof—Festschrift in Honour of Andrzej Trybulec, volume 10(23) of Studies in Logic, Grammar, and Rhetoric. University of Białystok, 2007.

119 BIBLIOGRAPHY

[116] Makarius Wenzel. Parallel Proof Checking in Isabelle/Isar. In Gabriel Dos Reis and Laurent Théry, editors, Proceedings of the 2009 International Workshop on Pro- gramming Languages for Mechanized Mathematics Systems, pages 13–29. ACM Digital Library, 2009.

[117] Markus Wenzel and Freek Wiedijk. A Comparison of Mizar and Isar. Journal of Automated Reasoning, 29(3-4):389–411, 2002. doi:10.1023/A:1021935419355.

[118] and . Principia Mathematica. Cambridge University Press, 1925–1927.

[119] Andreas Wolf. Strategy selection for automated theorem proving. In Fausto Giunchiglia, editor, Artificial Intelligence: Methodology, Systems, and Appli- cations, volume 1480 of Lecture Notes in Computer Science, pages 452–465. Springer, 1998. doi:10.1007/BFb0057466.

[120] Lin Xu, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. SATzilla: Portfolio-based Algorithm Selection for SAT. Journal of Artificial Intelligence Research, 32:565–606, 2008. doi:10.1613/jair.2490.

120 Scientific Curriculum Vitae

Education

2011 – 2014 PhD Computer Science Radboud University Nijmegen, Nijmegen, The Netherlands

2013 Internship at Microsoft Research Mountain View, CA, USA

2009 – 2010 PhD Mathematics (unfinished) Universität Bonn, Bonn, Germany

2005 – 2008 Diploma in Mathematics (Diplom Mathematik) Universität Bonn, Bonn, Germany

2004 – 2005 Erasmus Exchange University of Birmingham, Birmingham, UK

2002 – 2004 Intermediate Exam in Mathematics (Vordiplom Mathematik) Universität Tübingen, Tübingen, Germany

Awards

2013 The CADE ATP System Competition at CADE 24 1st place in the THF division with Satallax-MaLeS 1.2 4th place in the FOF division with E-MaLeS 1.2

2012 The CADE ATP System Competition at IJCAR 6 2nd place in the FOF division with E-MaLeS 1.1

2012 The CADE ATP System Competition at the Alan Turing Centenary Confer- ence 2nd place in the FOF division with E-MaLeS 1.1 3rd place in the MZR@Turing division with PS-E

121 SCIENTIFIC CURRICULUM VITAE

2011 The CADE ATP System Competition at CADE 23 3rd place in the FOF division with E-MaLeS 1.0

Publications

• D. Kühlwein and J. Blanchette, A Survey of Axiom Selection as a Machine Learning Problem, submitted to “Infinity, computability, and metamathematics. Festschrift celebrating the 60th birthdays of Peter Koepke and Philip Welch”, 2014

• D. Kühlwein and J. Urban, MaLeS: A Framework for Automatic Tuning of Auto- mated Theorem Provers, CoRR, arXiv:1308.2116, 2013

• D. Kühlwein, S. Schulz, and J. Urban, E-MaLeS 1.1, LNCS 7898: Automated Deduction – CADE-24, 2013

• D. Kühlwein, J. Blanchette, C. Kaliszyk, and J. Urban, MaSh: Machine Learning for Sledgehammer, LNCS 7998: Interactive Theorem Proving, 2013

• D. Kühlwein and J. Urban, Learning from Multiple Proofs: First Experiments, EPiC 21: Practical Aspects of Automated Reasoning, 2013

• J. Alama, T. Heskes, D. Kühlwein, E. Tsivtsivadze, and J. Urban, Premise Selection for Mathematics by Corpus Analysis and Kernel Methods, Journal of Automated Reasoning, 2013

• D. Kühlwein, T. Laarhoven, E. Tsivtsivadze, J. Urban, and T. Heskes, Overview and Evaluation of Premise Selection Techniques for Large Theory Mathematics, LNCS 7364: Automated Reasoning, 2012

• J. Alama, D. Kühlwein, and J. Urban, Automated and Human Proofs in General Mathematics: An Initial Comparison, LNCS 7180: Logic for Programming, Arti- ficial Intelligence, and Reasoning, 2012

• D. Kühlwein, J. Urban, E. Tsivtsivadze, H. Geuvers, and T. Heskes, Multi-output Ranking for Automated Reasoning, KDIR, 2011

• D. Kühlwein, J. Urban, E. Tsivtsivadze, H. Geuvers, and T. Heskes, Learning2Reason, Intelligent Computer Mathematics, 2011

• M. Cramer, D. Kühlwein, and B. Schröder, Presupposition Projection and Accom- modation in Mathematical Texts, Proceedings of the Conference on Natural Lan- guage Processing, 2010

• M. Cramer, P. Koepke, D. Kühlwein, and B. Schröder, Premise Selection in the Naproche System, LNCS 6173: Automated Reasoning, 2010

122 • M. Cramer, B. Fisseni, P. Koepke, D. Kühlwein, B. Schröder, and J. Veldman, The Naproche Project Controlled Natural Language Proof Checking of Mathematical Texts, LNCS 5972: Controlled Natural Language, 2010

• D. Kühlwein, M. Cramer, P. Koepke, and B. Schröder, The Naproche System, Cal- culemus 2009

123

Summary

This thesis develops machine learning methods that improve interactive and automated theorem provers with a strong focus on building systems that are actually helpful for de- velopers and users. The various experiments show that learning can not only significantly improve the success rates of theorem provers, but also simplify the tuning process of automated theorem provers.

The first part of this thesis focuses on the premise selection problem. Automated the- orem provers struggle to solve problems when too much information (i.e. premises) are available due to the explosion of the search space. Premise selection techniques try to pre- dict relevant premises. The approach we take is to try and learn premise relevance by con- sidering previous proofs. As for any machine learning problem, a thorough understanding of the training data is necessary, and the first few chapters provide it. We introduce several new algorithms and show that they outperform both state-of-the-art non-learning-based premise selection methods as well as previously tried learning-based approaches. Chap- ter 6 presents a new system, MaSh, that brings learned-based premise selection to the interactive theorem prover Isabelle. MaSh is build based on the insight we gained from the experiments in the first chapters, while also taking into account the requirements of real users. MaSh has become part of the default installation of Isabelle.

The second part of the thesis considers the related problem of automated theorem prover tuning. Automated theorem provers often have several (possibly infinitely many) search strategies. These search strategies define how the prover tries to solve a problem, i.e. find a proof. Finding good search strategies and knowing when to use which strat- egy is becoming an increasingly important part of automated theorem proving. Chapter 7 presents MaLeS, a general learning-based tuning framework for ATPs. ATP systems tuned with MaLeS successfully competed in the last three world championships for auto- mated theorem provers, the CADE ATP System Competition. Notable achievements are a 6% improvement over the standard version of E prover in the 2012 CASC@Turing100 competition (2nd place for E-MaLeS), and a 2.5% improvement over the standard version of Satallax in CASC 2013 (1st place for Satallax-MaLeS).

125 SUMMARY

What’s next All the combinations of machine learning programs presented in this thesis provide an add-on or extension to existing systems. A deeper integration between machine learning and automated reasoning seems like a promising research direction. There is no reason why one should have to learn that a variable of type integer can also be seen as being a general group element, if it is already easily deducible by the calculus. ATPs could use learning as integral part of their calculus, e.g. to predict which unification to attempt next or whether or not it should try an alternative search strategy. Automated reasoning systems have a lot to offer to society, for example software and hardware development, mathematics, or even philosophy. But in order to reach main- stream, we must improve both the capabilities and usability of our tools. I sincerely hope that this thesis is a step towards this overall goal.

126 Samenvatting

Dit proefschrift ontwikkelt methoden voor machinaal leren die interactieve en automatis- che bewijzers verbeteren, met een sterke nadruk op het bouwen van systemen die daadw- erkelijk ontwikkelaars en gebruikers ondersteunen. De verscheidene experimenten laten zien dat machinaal leren niet alleen het slagingspercentage van automatische bewijzers significant verbetert, maar ook het afstellen van automatische bewijzers vereenvoudigt. Het eerste deel van dit proefschrift richt zich op het probleem van het selecteren van de juiste premissen. Automatische bewijzers hebben er moeite mee stellingen te bewi- jzen wanneer ze over teveel informatie (d.w.z. teveel premissen) beschikken, omdat het aantal mogelijke selecties van premissen dan exponentieel toeneemt. Premisseselecti- etechnieken proberen te voorspellen wat de relevante premissen gaan zijn. De aanpak die hier is gehanteerd is te pogen deze relevantie te leren op basis van voorgaande bewi- jzen. Zoals bij ieder probleem dat met machinaal leren wordt aangepakt, is het hiervoor nodig de trainingsdata goed te begrijpen. De eerste hoofdstukken zijn hieraan gewijd. We introduceren een aantal nieuwe algoritmen en laten zien dat zij zowel state-of-the-art pre- misseselectietechnieken mét als zonder machinaal leren achter zich laten. Hoofdstuk 6 beschrijft een nieuw systeem, MaSh, dat automatische premisseselectie toevoegt aan de interactieve bewijzer Isabelle. MaSh is gebaseerd op de inzichten die zijn verkregen uit de experimenten in de eerste hoofdstukken, en neemt tegelijkertijd de wensen van daad- werkelijke gebruikers in acht. MaSh is inmiddels een standaardonderdeel van Isabelle geworden. Het tweede deel van dit proefschrift richt zich op een gerelateerd probleem, namelijk op het afstellen van automatische bewijzers. Automatische bewijzers hebben doorgaans verschillende (mogelijk zelfs oneindig veel) mogelijk zoekstrategieën. Deze zoekstrategieën bepalen hoe de bewijzer een probleem oplost (d.w.z. hoe het een bewijs vindt). Goede zoekstrategieën vinden en weten wanneer welke strategie te gebruiken, is een steeds belangrijker deel van automatisch bewijzen aan het worden. Hoofdstuk 7 beschrijft MaLeS, een generiek raamwerk voor het afstemmen van automatische bewijz- ers, gebaseerd op machinaal leren. Automatische bewijzers die met behulp van MaLeS zijn afgesteld, hebben succesvol deelgenomen aan drie wereldkampioenschappen voor automatische bewijzers (de zogenaamde CADE ATP System Competition, CASC). Noe- menswaardige prestaties zijn een tweede plaats in de CASC@Turing100 in 2012 voor de bewijzer E-MaLeS (met een verbetering van 6% ten opzichte van de standaardversie van

127 SAMENVATTING bewijzer E) en een eerste plaats in de CASC 2013 voor de bewijzer Satallax-MaLeS (met een verbetering van 2,5% ten opzichte van de standaardversie van Satallax).

Voor de toekomst Alle combinaties van machinaal lerende programma’s die in dit proefschrift zijn be- schreven, zijn toevoegingen voor bestaande systemen. Een diepere integratie tussen machinaal leren en automatisch bewijzen lijkt een veelbelovende onderzoeksrichting. Zo is het niet nodig te leren dat een variabele van het type integer gezien kan worden als een generiek groepselement, als dat al af te leiden is uit de onderliggende calculus. Au- tomatische bewijzers kunnen machinaal leren als integraal onderdeel van hun calculus gebruiken, om bijvoorbeeld te voorspellen welke unificatie kan worden geprobeerd, of een andere zoekstrategie kan worden gebruikt. Automatische redeneersystemen hebben de maatschappij veel te bieden, op vlakken variërend van ontwikkeling van software en hardware, maar ook in de wiskunde of zelfs de filosofie. Maar om gemeengoed te worden, moeten de mogelijkheden en gebruiksvriendelijkheid van deze toepassingen verbeteren. Ik hoop oprecht dat dit proefschrift een stap in de richting van dit overkoepelende doel zet.

128 Acknowledgments

Throughout my PhD, I had the great pleasure of working with a bunch of fascinating and very clever people. I would like to start by thanking the two people who probably had the biggest impact on this thesis: my daily supervisors, Josef Urban and Evgeni Tsivtsivadze. Both were always motivated, full of ideas and open to my suggestions which made for a perfect work environment. Tom Heskes and Herman Geuvers successfully managed to walk the fine line between too close and too loose that every promotor has to find. I’m grateful to my Co-Authors. To Jesse Alama for his help during the early-day Mizar experiments, and in particular for connecting me with Susanne and Ed when I went to the US. To Stephan Schulz for providing me with the data that provided the basis of MaLeS and his support with the great piece of software that is E. To Jasmin Blanchette and Tobias Nipkow for the opportunity to visit the Isabelle group (twice!). Jasmin’s help during the development of MaSh has been invaluable, and he even taught me a thing or two about how to create pretty papers. To Cezary Kaliszyk who, even though he just had a small child, ran experiment after experiment after experiment, and was always available for discussions. MaLeS couldn’t exist without Christoph Benzmüller, Chad Brown, and Geoff Sut- cliffe who publicly released their programs and provided support whenever problems oc- curred. Every thesis is also a product of its environment in which it was written and hence I’d like to thank my colleagues: Alexandra, Ali, Bas, Carst, Elena M., Elena S., Freek, Helle, Janos, Jelle, Jonce, Joris, Kasper, Maya, Max, Michael, Mohsen, Nicole, Robbert, Simone, Suzan, Thijs, Tjeerd, Tom, Twan and Wout. Life does not only consist of work, and the numerous adventures with my climbing buddies ensured that I remembered that. Thanks to Alex, Dieke, Gitta, Johannes, Jonas, Nadja, Niko, Marcos, Marek, Sebastian, Silke and Pawel. In particular a big thank you to Janina who brightens my every day. Last but not least, I want to express my gratitude to my family for their support, help, and advice throughout my studies.

129