Learning Logic Programs by Explaining Failures

Rolf Morel , Andrew Cropper {rolf.morel,andrew.cropper}@cs.ox.ac.uk

Abstract is to machine learn computer programs from data [Shapiro, 1983]. We build on the inductive logic programming (ILP) Scientists form hypotheses and experimentally test approach learning from failures and its implementation called them. If a hypothesis fails (is refuted), scientists POPPER [Cropper and Morel, 2021]. POPPER learns logic pro- try to explain the failure to eliminate other hypothe- grams by iteratively generating and testing hypotheses. When ses. We introduce similar explanation techniques a hypothesis fails on training examples, POPPER examines for inductive logic programming (ILP). We build the failure to learn constraints that eliminate hypotheses that on the ILP approach learning from failures. Given will provably fail as well. A limitation of POPPER is that it a hypothesis represented as a logic program, we only derives constraints based on entire hypotheses (as Alice test it on examples. If a hypothesis fails, we iden- does for C1) and cannot explain why a hypothesis fails (cannot tify clauses and literals responsible for the failure. reason as Alice does for C2). By explaining failures, we can eliminate other hy- We address this limitation by explaining failures. The idea potheses that will provably fail. We introduce a is to analyse a failed hypothesis to identify sub-programs that technique for failure explanation based on analysing also fail. We show that, by identifying failing sub-programs SLD-trees. We experimentally evaluate failure ex- and generating constraints from them, we can eliminate more planation in the POPPER ILP system. Our results hypotheses, which can in turn improve learning performance. show that explaining failures can drastically reduce By the Blumer bound [1987], searching a smaller hypothesis learning times. space should result in fewer errors compared to a larger space, assuming a solution is in both spaces. 1 Introduction Our approach builds on algorithmic debugging [Caballero et al., 2017]. We identify sub-programs of hypotheses by The process of forming hypotheses, testing them on data, analysing paths in SLD-trees. In similar work [Shapiro, 1983; analysing the results, and forming new hypotheses is the foun- Law, 2018], only entire clauses can make up these sub- [ ] dation of the scientific method Popper, 2002 . For instance, programs. By contrast, we can identify literals responsible imagine that Alice is a chemist trying to synthesise a vial of for a failure within a clause. We extend POPPER with failure the compound octiron from substances thaum and slood. To explanation and experimentally show that failure explanation do so, Alice can perform actions, such as fill a vial with a can significantly improve learning performance. fill(Vial,Sub) mix(V1,V2,V3) substance ( ) or mix two vials ( ). Our contributions are: One such hypothesis is: synth(A,B,C) ← fill(V1,A), fill(V1,B), mix(V1,V1,C) • We relate logic programs that fail on examples to their failing sub-programs. For wrong answers we identify arXiv:2102.12551v1 [cs.AI] 18 Feb 2021 This hypothesis says to synthesise compound C, fill vial V1 clauses. For missing answers we additionally identify with substance A, fill vial V1 with substance B, and then mix literals within clauses. vial V1 with itself to form C. When Alice experimentally tests this hypothesis she finds • We show that hypotheses that are specialisations and gen- that it fails. From this failure, Alice deduces that hypotheses eralisations of failing sub-programs can be eliminated. that add more actions (i.e. literals) will also fail (C1). Alice • We prove that hypothesis space pruning based on sub- can, however, go further and explain the failure as “vial V1 programs is more effective than pruning without them. cannot be filled a second time”, which allows her to deduce • We introduce an SLD-tree based technique for failure that any hypothesis that includes fill(V1,A) and fill(V1,B) will explanation. We introduce POPPER , which adds the fail (C2). Clearly, conclusion C2 allows Alice to eliminate X ability to explain failures to the POPPER ILP system. more hypotheses than C1, that is, by explaining failures Alice can better form new hypotheses. • We experimentally show that failure explanation can dras- Our main contribution is to introduce similar explanation tically reduce (i) hypothesis space exploration and (ii) techniques for inductive program synthesis, where the goal learning times. 2 Related Work 3.1 Learning From Failures predicate declara- Inductive program synthesis systems automatically generate To define the LFF problem, we first define tions hypothesis constraints computer programs from specifications, typically input/output and . LFF uses predicate decla- examples [Shapiro, 1983]. This topic interests researchers rations as a form of language bias, defining which predicate from many areas of , including Bayesian symbols may appear in a hypothesis. A predicate declaration is head pred(p, a) body pred(p, a) inference [Silver et al., 2020] and neural networks [Ellis et a ground atom of the form or p a al., 2018]. We focus on ILP techniques, which induce logic where is a predicate symbol of arity . Given a set of D C declaration programs [Muggleton, 1991]. In contrast to neural approaches, predicate declarations , a definite clause is consistent p/m ILP techniques can generalise from few examples [Cropper when two conditions hold (i) if is the predicate C head pred(p, m) D et al., 2020]. Moreover, because ILP uses logic program- in the head of , then is in , and (ii) for all q/n C body pred(q, n) ming as a uniform representation for background knowledge predicate symbols in the body of , is D (BK), examples, and hypotheses, it can be applied to arbitrary in . domains without the need for hand-crafted, domain-specific To restrict the hypothesis space, LFF uses hypothesis con- L neural architectures. Finally, due to logic’s similarity to natural straints. Let be a language that defines hypotheses, i.e. a language, ILP learns comprehensible hypotheses. meta-language. Then a hypothesis constraint is a constraint ex- pressed in L. Let C be a set of hypothesis constraints written Many ILP systems [Muggleton, 1995; Blockeel and Raedt, in a language L. A set of definite clauses H is consistent with 1998; Srinivasan, 2001; Ahlgren and Yuen, 2013; Inoue et C if, when written in L, H does not violate any constraint in al., 2014; Schuller¨ and Benz, 2018; Law et al., 2020] ei- C. ther cannot or struggle to learn recursive programs. By We now define the LFF problem, which is based on the ILP contrast, POPPER can learn recursive programs and thus X learning from entailment setting [Raedt, 2008]: programs that generalise to input sizes it was not trained [ Definition 3.1 (LFF input). A LFF input is a tuple on. Compared to many modern ILP systems Law, 2018; + − + − Evans and Grefenstette, 2018; Kaminski et al., 2019; Evans (E ,E ,B,D,C) where E and E are sets of ground atoms denoting positive and negative examples respectively; et al., 2021], POPPERX supports large and infinite domains, which is important when reasoning about complex data struc- B is a Horn program denoting background knowledge; D is tures, such as lists. Compared to many state-of-the-art systems a set of predicate declarations; and C is a set of hypothesis [Cropper and Muggleton, 2016; Evans and Grefenstette, 2018; constraints. Kaminski et al., 2019; Hocquette and Muggleton, 2020; A definite program is a hypothesis when it is consistent with Patsantzis and Muggleton, 2021] POPPERX does not need both D and C. We denote the set of such hypotheses as HD,C . metarules (program templates) to restrict the hypothesis space. We define a LFF solution: Algorithmic debugging [Caballero et al., 2017] explains Definition 3.2 (LFF solution). Given an input tuple + − failures in terms of sub-programs. Similarly, in databases (E ,E ,B,D,C), a hypothesis H ∈ HD,C is a solution provenance is used to explain query results [Cheney et al., when H is complete (∀e ∈ E+,B ∪ H |= e) and consistent 2009]. In seminal work on logic program synthesis, Shapiro (∀e ∈ E−,B ∪ H 6|= e). [ ] 1983 analysed debugging trees to identify failing clauses. If a hypothesis is not a solution then it is a failure and a By contrast, our failure analysis reasons about concrete SLD- failed hypothesis. A hypothesis H is incomplete when ∃e+ ∈ [ ] trees. Both ILASP3 Law, 2018 and the remarkably similar E+,H ∪ B 6|= e+. A hypothesis H is inconsistent when [ ] ProSynth Raghothaman et al., 2020 induce logic programs ∃e− ∈ E−,H ∪ B |= e−. A worked example of LFF is by precomputing every possible clause and then using a select- included in Appendix A. test-and-constrain loop. This precompute step is infeasible for clauses with many literals and restricts their failure explana- 3.2 Specialisation and Generalisation Constraints tion to clauses. By contrast, POPPERX does not precompute The key idea of LFF is to learn constraints from failed hypothe- clauses and can identify clauses and literals within clauses ses. Cropper and Morel [2021] introduce constraints based on responsible for failure. subsumption [Plotkin, 1971] and theory subsumption [Midel- OPPER [ ] P Cropper and Morel, 2021 learns first-order con- fart, 1999]. A clause C1 subsumes a clause C2 if and only if straints, which can be likened to conflict-driven clause learning there exists a substitution θ such that C1θ ⊆ C2. A clausal [ et al. ] OPPER Silva , 2009 . Failure explanation in P X can there- theory T1 subsumes a clausal theory T2, denoted T1  T2, OPPER fore be viewed as enabling P to detect smaller conflicts, if and only if ∀C2 ∈ T2, ∃C1 ∈ T1 such that C1 subsumes yielding smaller yet more general constraints that prune more C2. Subsumption implies entailment, i.e. if T1  T2 then effectively. T1 |= T2. A clausal theory T1 is a specialisation of a clausal theory T2 if and only if T2  T1. A clausal theory T1 is a 3 Problem Setting generalisation of a clausal theory T2 if and only if T1  T2. Hypothesis constraints prune the hypothesis space. Gener- We now reiterate the LFF problem [Cropper and Morel, 2021] alisation constraints only prune generalisations of inconsistent as well as the relation between constraints and failed hypothe- hypotheses. Specialisation constraints only prune specialisa- ses. We then introduce failure explanation in terms of sub- tions of incomplete hypotheses. Generalisation and specialisa- programs. We assume standard logic programming definitions tion constraints are sound in that they do not prune solutions [Lloyd, 2012]. [Cropper and Morel, 2021]. 3.3 Missing and Incorrect Answers Definition 3.4 (Failing sub-programs problem). Given the + − We follow Shapiro [1983] in identifying examples as responsi- definite program P and sets of examples E and E , the ble for the failure of a hypothesis H given background knowl- failing sub-programs problem is to find all sub-programs of P + edge B. A positive example e+ is a missing answer when that do not entail an example of E or entail an example of − B ∪H 6|= e+. Similarly, a negative example e− is an incorrect E . answer when B ∪ H |= e−. We relate missing and incorrect By definition, a failing sub-program is incomplete and/or in- answers to specialisations and generalisations. If H has a consistent, so, by Section 3.2, we can always prune specialisa- missing answer e+, then each specialisation of H has e+ as tions and/or generalisations of a failing sub-program. a missing answer, so the specialisations of H are incomplete and can be eliminated. If H has an incorrect answer e−, then Remark 1 (Undecidability). The failing sub-programs prob- each generalisation of H has e− as an incorrect answer, so the lem is undecidable in general as deciding entailment can be generalisations of H are inconsistent and can be eliminated. reduced to it. Example 1 (Missing answers and specialisations). Con- We show that sub-programs are effective at pruning: sider the following droplast hypothesis: Theorem 1 (Better pruning). Let H be a definite program that fails and P (6= H) be a sub-program of H that fails. H1 = { droplast(A,B) ← empty(A),tail(A,B) } Specialisation and generalisation constraints for P can always Both droplast([1, 2, 3], [1, 2]) and droplast([1, 2], [1]) are achieve additional pruning versus those only for H. missing answers of H1, so H1 is incomplete and we can prune its specialisations, e.g. programs that add literals to the clause. Proof. Suppose H is a specialisation of P . If P is incomplete, Example 2 (Incorrect answers and generalisations). Con- then among the specialisations of P , which are all prunable, is H and its specialisations. If P is inconsistent, P ’s general- sider the hypothesis H2:   isations do not completely overlap with H’s generalisations droplast(A,B) ← tail(A,C),tail(C,B) and specialisations (using that P 6= H). Hence, pruning P ’s H2 = droplast(A,B) ← tail(A,B) generalisations prunes programs not pruned by H. The case where H is a generalisation of P is analogous. In the remain- In addition to being incomplete, H2 is inconsistent because of the incorrect answer droplast([1, 2], []), so we can prune the ing case, where H and P are not related by subsumption, it is immediate that the constraints derived for P prune a distinct generalisations of H2, e.g. programs with additional clauses. part of the hypothesis space. 3.4 Failing Sub-programs We now extend LFF by explaining failures in terms of failing 4 Implementing Failure Explanation sub-programs. The idea is to identify sub-programs that cause We now describe our failure explanation technique, which the failure. Consider the following two examples: identifies sub-programs by identifying both clauses and liter- Example 3 (Explain incompleteness). Consider the positive als within clauses responsible for failure. Subsequently we + example e = droplast([1, 2], [1]) and the previously defined summarise the POPPER ILP system before introducing our + hypothesis H1. An explanation for why H1 does not entail e extension of it: POPPERX . 0 is that empty([1,2]) fails. It follows that the program H1 = { droplast(A,B) ← empty(A) } has e+ as a missing answer 4.1 SLD-trees and Sub-programs and is incomplete, so we can prune all specialisations of it. In algorithmic debugging, missing and incorrect answers help Example 4 (Explain inconsistency). Consider the negative characterise which parts of a debugging tree are wrong [Ca- example e− = droplast([1, 2], []) and the previously defined ballero et al., 2017]. Debugging trees can be seen as gener- − hypothesis H2. The first clause of H2 always entails e re- alising SLD-trees, with the latter representing the search for gardless of other clauses in the hypothesis. It follows that the a refutation [Nienhuys-Cheng and Wolf, 1997]. Exploiting 0 program H2 = { droplast(A,B) ← tail(A,C),tail(C,B) } has their granularity, we analyse SLD-trees to address the failing e− as an incorrect answer and is inconsistent, so we can prune sub-programs problem, only identifying a subset of them. all generalisations of it. A branch in a SLD-tree is a path from the root goal to a leaf. We now define a sub-program: Each goal on a branch has a selected atom, on which resolution is performed to derive child goals. A branch that ends in an Definition 3.3 (Sub-program). The definite program P is a empty leaf is called successful, as such a path represents a sub-program of the definite program Q if and only if either: refutation. Otherwise a branch is failing. Note that selected • P is the empty set atoms on a branch identify a subset of the literals of a program. • there exists Cp ∈ P and Cq ∈ Q such that Cp ⊆ Cq and Let B be a Horn program, H be a hypothesis, and e be an P \{Cp} is a sub-program of Q \{Cq} atom. The SLD-tree T for B ∪ H ∪ {¬e}, with ¬e as the root, In functional program synthesis, sub-programs (sometimes proves B ∪ H |= e iff T contains a successful branch. Given called partial programs) are typically defined by leaving out a branch λ of T , we define the λ-sub-program of H. A literal nodes in the parse tree of the original program [Feng et al., L of H occurs in λ-sub-program H0 if and only if L occurs 2018]. Our definition generalises this idea by allowing for as a selected atom in λ or L was used to produce a resolvent arbitrary ordering of clauses and literals. that occurs in λ. The former case is for literals in the body We now define the failing sub-programs problem: of clauses and the latter for head literals. Now consider the SLD-tree T 0 for B ∪ H0 ∪ {¬e} with ¬e as root. As all literals Pruning for sub-programs is in addition to the pruning that the necessary for λ occur in B ∪ H0, the branch λ must occur in constrain stage already does for H. This is important as H’s T 0 as well. failing sub-programs need not be specialisations/generalisa- Suppose e− is an incorrect answer for hypothesis H. Then tions of H. the SLD-tree for B ∪ H ∪ {¬e−}, with ¬e− as root, has a successful branch λ. The literals of H necessary for this 5 Experiments branch are also present in λ-sub-program H0, hence e− is also an incorrect answer of H0. Now suppose e+ is a missing We claim that failure explanation can improve learning per- answer of H. Let T be the SLD-tree for B ∪ H ∪ {¬e+}, with formance. Our experiments therefore aim to answer the ques- ¬e+ as root, and λ0 be any failing branch of T . The literals tions: of H in λ0 are also present in λ0-sub-program H00. This is Q1 Can failure explanation prune more programs? however insufficient for concluding that the SLD-tree for H00 has no successful branch. Hence it is not immediate that e+ is Q2 Can failure explanation reduce learning times? a missing answer for H00. In case that H00 is a specialisation A positive answer to Q1 does not imply a positive answer of H we can conclude that e+ is a missing answer. for Q2 because of the potential overhead of failure explana- tion. Identifying sub-programs requires computational effort 4.2 POPPER and the additional constraints could potentially overwhelm POPPER tackles the LFF problem (Definition 3.1) using a a learner. For example, as well as identifying sub-programs, generate, test, and constrain loop. A logical formula is con- POPPERX needs to derive more constraints, ground them, and structed and maintained whose models correspond to Prolog have a solver reason over them. These operations are all costly. programs. The first stage is to generate a model and con- To answer Q1 and Q2, we compare POPPERX against POP- vert it to a program. The program is tested on all positive PER. The addition of failure explanation is the only difference and negative examples. The number of missing and incorrect between the systems and in all the experiments the settings answers determine whether specialisations1 and/or generalisa- for POPPERX and POPPER are identical. We do not compare tions can be pruned. When a hypothesis fails, new hypothesis against other state-of-the-art ILP systems, such as Metagol constraints (Section 3.2) are added to the formula, which elim- [Cropper and Muggleton, 2016] and ILASP3 [Law, 2018] be- inates models and thus prunes the hypothesis space. POPPER cause such a comparison cannot help us answer Q1 and Q2. then loops back to the generate stage. Moreover, POPPER has been shown to outperform these two Smaller programs prune more effectively, which is partly systems on problems similar to the ones we consider [Cropper why POPPER searches for hypotheses by their size (num- and Morel, 2021]. ber of literals)2. Yet there are many small programs that We run the experiments on a 10-core server (at 2.2GHz) POPPER does not consider well-formed that achieve sig- with 30 gigabytes of memory (note that POPPER and POP- 0 PER only run on a single CPU). When testing individual nificant, sound pruning. Consider the sub-program H1 = X { droplast(A,B) ← empty(A) } from Example 3. POPPER does examples, we use an evaluation timeout of 33 milliseconds. 0 not generate H1 as it does not consider it a well-formed hy- pothesis (as the head variable B does not occur in the body). 5.1 Experiment 1: Robot Planning Yet precisely because this sub-program has so few body literals The goal of this experiment is to evaluate whether failure is why it is so effective at pruning specialisations. explanation can improve performance when progressively in- creasing the size of the target program. We therefore need a 4.3 POPPERX problem where we can vary the program size. We consider a robot strategy learning problem. There is a robot that can We now introduce POPPERX , which extends POPPER with SLD-based failure explanation. Like POPPER, any generated move in four directions in a grid world, which we restrict to hypothesis H is tested on the examples. However, addition- being a corridor (dimensions 1 × 10). The robot starts in the ally, for each tested example we obtain the selected atoms on lower left corner and needs to move to a position to its right. each branch of the example’s SLD-tree, which correspond to In this experiment, failure explanation should determine that sub-programs of H. As shown, sub-programs derived from any strategy that moves up, down, or left can never succeed incorrect answers have the same incorrect answers. For each and thus can never appear in a solution. 0 such identified inconsistent sub-program H of H we tell the Settings. An example is an atom f(s1, s2), with start (s1) 0 constrain stage to prune generalisations of H . Sub-programs and end (s2) states. A state is a pair of discrete coordinates derived from missing answers are retested, now without obtain- (x, y). We provide four dyadic relations as BK: move right, ing their SLD-trees. If a sub-program H00 of H is incomplete move left, move up, and move down, which change the state, we inform the constrain stage to prune specialisations3 of H00. e.g. move right((2,2),(3,2)). We allow one clause with up to 10 body literals and 11 variables. We use hypothesis constraints to 1 POPPER generates elimination constraints when a hypothesis ensure this clause is forward-chained [Kaminski et al., 2019], entails none of the positive examples [Cropper and Morel, 2021]. which means body literals modify the state one after another. 2The other reason is to find optimal solutions, i.e. those with the minimal number of literals. Method. The start state is (0, 0) and the end state is (n, 0), 3As in POPPER, we prune by elimination constraints if no positive for n in 1, 2, 3,..., 10. Each trial has only one (positive) examples are entailed. example: f((0, 0), (n, 0)). We measure learning times and the number of programs generated. We enforce a timeout of 1.2

10 minutes per task. We repeat each experiment 10 times and 1 plot the mean and (negligible) standard error. 0.8 Results. Figure 1a shows that POPPERX substantially out- performs POPPER in terms of learning time. Whereas POP- 0.6 PERX needs around 80 seconds to find a 10 move solution, 0.4 POPPER exceeds the 10 minute timeout when looking for a Ratio of learning time 0.2 six move solution. The reason for the improved performance 0 is that POPPERX generates far fewer programs, as failure ex- 0 0.2 0.4 0.6 0.8 1 1.2 planation will, for example, prune all programs whose first Ratio of generated programs move is to the left. For instance, to find a five literal solution, Figure 2: String transformation results. The ratio of number of POPPER generates 1300 programs, whereas POPPERX only programs that POPPER needs versus POPPER is plotted against the generates 62. When looking for a 10 move solution, POPPER X X ratio of learning time needed on that problem. only generates 1404 programs in a hypothesis space of 1.4 million programs. These results show that, compared to POP- PER, POPPERX generates substantially fewer programs and Results. For 52 problems both POPPER and POPPERX find requires less learning time. The results from this experiment solutions4. On 11 tasks POPPER timeouts, and on 7 of these strongly suggest that the answer to questions Q1 and Q2 is in all trials. POPPERX finds solutions on these same 11 tasks, yes. with timeouts in some trials on only 6 tasks. As relational so- lutions are allowed, many solutions are not ideal, e.g. allowing 600 6,000 POPPERX POPPERX for optionally copying over a character. baseline baseline Figure 2 plots ratios of generated programs and learning 400 4,000 times. Each point represents a single problem. The x-axis is the ratio of programs that POPPERX generates versus the 200 2,000 number of programs that POPPER generates. The y-value is the ratio of learning time of POPPERX versus POPPER. These Generated programs

Learning time (seconds) ratios are acquired by dividing means, the mean of POPPERX 0 0 2 4 6 8 10 2 4 6 8 10 over that of POPPER. Program size Program size Looking at x-axis values, of the 52 problems plotted 50 (a) Learning time. (b) Number of programs. require fewer programs when run with POPPERX . Looking at the y-axis, the learning times of 51 problems are faster Figure 1: Results of robot planning experiment. The x-axes denote on POPPERX . Note that either failure explanation is very the number of body literals in the solution, i.e. the number of moves effective or its influence is rather limited, which we explore required. more in the next experiment. Overall, these results show that, compared to POPPER, POP- 5.2 Experiment 2: String Transformations PERX almost always needs fewer programs and less time to learn programs. This suggests that the answer to questions Q1 We now explore whether failure explanation can improve learn- and Q2 is yes. ing performance on real-world string transformation tasks. We use a standard dataset [Lin et al., 2014; Cropper, 2019] formed 5.3 Experiment 3: Programming Puzzles of 312 tasks, each with 10 input-output pair examples. For instance, task 81 has the following two input-output pairs: This experiment evaluates whether failure explanation can improve performance when learning programs for recursive Input Output list problems, which are notoriously difficult for ILP sys- “Alex”,“M”,41,74,170 M tems. Indeed, other state-of-the-art ILP system [Law, 2018; “Carly”,“F”,32,70,155 F Evans and Grefenstette, 2018; Kaminski et al., 2019] struggle Settings. As BK, we give each system the monadic predi- to solve these problems. We use the same 10 problems used cates is uppercase, is empty, is space, is letter, is number and by [Cropper and Morel, 2021] to show that POPPER drasti- dyadic predicates mk uppercase, mk lowercase, skip1, copy- cally outperforms METAGOL [Cropper and Muggleton, 2016] skip1, copy1. For each monadic predicate we also provide a and ALEPH [Srinivasan, 2001]. The 10 tasks include a mix predicate that is its negation. We allow up to 3 clauses with 4 of monadic (e.g. evens and sorted), dyadic (e.g. droplast and body literals and up to 5 variables per clause. finddup), and triadic (dropk) target predicates. Some prob- last len Method. The dataset has 10 positive examples for each prob- lems are functional (e.g. and ) and some are relational finddup member lem. We perform cross validation by selecting 10 distinct sub- (e.g. and ). sets of 5 examples for each problem, using the other 5 to test. 4Note that these problems are very difficult with many of them not We measure learning times and number of programs generated. having solutions given only our primitive BK and with the learned We enforce a timeout of 120 seconds per task. We repeat each program restricted to defining a single predicate. Therefore, absolute experiment 10 times, once for each distinct subset, and record performance should be ignored. The important result is the relative means and standard errors. performance of the two systems. Settings. We provide as BK the monadic relations empty, zero, one, even, odd, the dyadic relations element, head, tail, increment, decrement, geq, and the triadic relation cons. We provide simple types and mark the arguments of predicates as either input or output. We allow up to two clauses with five body literals and up to five variables per clause. Method. We generate 10 positive and 10 negative examples per problem. Each example is randomly generated from lists Figure 3: Relative time spent in three stages of POPPERX and POP- up to length 50, whose integer elements are sampled from PER. From bottom to top: testing (in red), generating hypotheses (in blue), and imposing constraints (in orange). Times are scaled by the 1 to 100. We test on a 1000 positive and a 1000 negative total learning time of POPPER, with POPPER’s average time(s) on the randomly sampled examples. We measure overall learning left and POPPERX ’s on the right. Bars are standard error. time, number of programs generated, and predictive accuracy. We also measure the time spent in the three distinct stages of POPPER and POPPERX . We repeat each experiment 25 times Figure 3 shows the relative time spent in each stage of POP- and record the mean and standard error. PERX and POPPER and that any of the stages can dominate the runtime. For addhead, it is hypothesis generation (pre- Number of programs Learning time (seconds) dominantly spent searching for a model). For finddup, it is Name POPPER POPPERX ratio POPPER POPPERX ratio constraining (mostly spent grounding constraints). More per- len 590 ± 4 60 ± 5 0.10 16 ± 0.2 1 ± 0.1 0.09 tinently, droplast, the only dyadic problem whose output is a dropk 114 ± 0.7 13 ± 1 0.12 1 ± 0.01 0.3 ± 0.02 0.23 list, is dominated by testing. finddup 1223 ± 22 644 ± 14 0.53 53 ± 2 20 ± 0.7 0.38 member 57 ± 2 14 ± 0.7 0.24 0.6 ± 0.03 0.2 ± 0.01 0.41 We can also infer the overhead of failure explanation by last 232 ± 6 64 ± 6 0.28 3 ± 0.1 2 ± 0.1 0.48 analysing SLD-trees from Figure 3. All problems from last evens 306 ± 2 278 ± 2 0.91 7 ± 0.06 7 ± 0.09 1.00 to sorted have POPPERX spend more time on testing than threesame 18 ± 4 15 ± 3 0.81 0.2 ± 0.04 0.2 ± 0.04 1.00 droplast 161 ± 9 148 ± 10 0.92 11 ± 0.9 11 ± 1 1.02 POPPER. On both last and sorted, POPPERX incurs consid- addhead 32 ± 3 31 ± 2 0.96 0.6 ± 0.04 0.6 ± 0.04 1.05 erable testing overhead. Whilst for last this effort translates sorted 708 ± 40 599 ± 26 0.85 29 ± 3 31 ± 2 1.08 into more effective pruning constraints, for sorted this is not the case. Abstracting away from the implementation of failure Table 1: Results for programming puzzles. Left, the average number explanation, we see that POPPER outfitted with zero-overhead of programs generated by each system. Right, the corresponding sub-program identification would have been strictly faster. average time to find a solution. We round values over one to the nearest integer. The error is standard error. Overall, these result strongly suggest that the answer to questions Q1 and Q2 is yes.

Results. Both systems are equally accurate, except on sorted 6 Conclusions and Limitations where POPPER scores 98% and POPPERX 99%. Accuracy is 98% on dropk and 99% on both finddup and threesame. All To improve the efficiency of ILP, we have introduced an ap- other problems have 100% accuracy. proach for failure explanation. Our approach, based on SLD- Table 1 shows the learning times in relation to the number of trees, identifies failing sub-programs, including failing literals programs generated. Crucially, it includes the ratio of the mean in clauses. We implemented our idea in POPPERX . Our ex- of POPPERX over the mean of POPPER. On these 10 problems, periments show that failure explanation can drastically reduce POPPERX always considers fewer hypotheses than POPPER. learning times. Only on three problems is over 90% of the original number Limitations. We have shown that identifying failing sub- of programs considered. On the len problem, POPPERX only programs will lead to more constraints and thus more prun- needs to consider 10% of the number of hypotheses. ing of the hypothesis space (Theorem 1), which our exper- As seen from the ratio columns, the number of generated iments empirically confirm. We have not, however, quanti- programs correlates strongly with the learning time (0.96 cor- fied the theoretical effectiveness of pruning by sub-programs, relation coefficient). Only on three problems is POPPERX nor have we evaluated improvements in predictive accuracy, slightly slower than POPPER. Hence POPPER can be nega- which are implied by the Blumer bound [Blumer et al., 1987]. tively impacted by failure explanation, however, when POP- Future work should address both of these limitations. Al- PERX is faster, the speed-up can be considerable. though we have shown that failure explanation can drasti- To illustrate how failure explanation can drastically improve cally reduce learning times, we can still significantly im- pruning, consider the following hypothesis that POPPERX prove our approach. For instance, reconsider the failing considers in the len problem: sub-program f(A,B):- element(A,D),odd(D),even(D) from Sec- f(A,B):- element(A,D),odd(D),even(D),tail(A,C),element(C,B). tion 5.3. We should be able to identify that the two literals odd(D) even(D) Failure explanation identifies the failing sub-program: and can never both hold in the body of a clause, which would allow us to prune more programs. Fi- f(A,B):- element(A,D),odd(D),even(D). nally, in future work, we want to explore whether our in- As should be hopefully clear, generating constraints from this herently interpretable failure explanations can aid explain- smaller failing program, which is not a POPPER hypothesis, able AI and ultra-strong machine learning [Michie, 1988; leads to far more effective pruning. Muggleton et al., 2018]. References [Law, 2018] Mark Law. Inductive learning of answer set [Ahlgren and Yuen, 2013] John Ahlgren and Shiu Yin Yuen. programs. PhD , , UK, 2018. Efficient program synthesis using constraint satisfaction in [Lin et al., 2014] Dianhuan Lin, Eyal Dechter, Kevin Ellis, inductive logic programming. JMLR, 2013. Joshua B. Tenenbaum, and Stephen Muggleton. Bias refor- [Blockeel and Raedt, 1998] Hendrik Blockeel and Luc De mulation for one-shot function induction. In ECAI, 2014. Raedt. Top-down induction of first-order logical decision [Lloyd, 2012] John W Lloyd. Foundations of logic program- trees. AIJ, 1998. ming. Springer Science & Business Media, 2012. [Blumer et al., 1987] Anselm Blumer, Andrzej Ehrenfeucht, [ ] David Haussler, and Manfred K. Warmuth. Occam’s razor. Michie, 1988 . Machine learning in the next Inf. Process. Lett., 1987. five years. In EWSL, 1988. [Caballero et al., 2017] Rafael Caballero, Adrian´ Riesco, and [Midelfart, 1999] Herman Midelfart. A bounded search space Josep Silva. A survey of algorithmic debugging. ACM of clausal theories. In ILP, 1999. Comput. Surv., 2017. [Muggleton et al., 2018] S.H. Muggleton, U. Schmid, [Cheney et al., 2009] James Cheney, Laura Chiticariu, and C. Zeller, A. Tamaddoni-Nezhad, and T. Besold. Ultra- Wang Chiew Tan. Provenance in databases: Why, how, and strong machine learning - comprehensibility of programs where. Found. Trends Databases, 2009. learned with ILP. Machine Learning, 2018. [Cropper and Morel, 2021] Andrew Cropper and Rolf Morel. [Muggleton, 1991] Stephen Muggleton. Inductive logic pro- Learning programs by learning from failures. Machine gramming. New Generation Comput., 1991. Learning, 2021. To appear. [Muggleton, 1995] Stephen Muggleton. Inverse entailment [ ] Cropper and Muggleton, 2016 Andrew Cropper and and progol. New Generation Comput., 1995. Stephen H. Muggleton. Learning higher-order logic programs through abstraction and invention. In IJCAI, [Nienhuys-Cheng and Wolf, 1997] Shan-Hwei Nienhuys- 2016. Cheng and Ronald de Wolf. Foundations of Inductive [Cropper et al., 2020] Andrew Cropper, Sebastijan Duman- Logic Programming. Springer-Verlag New York, Inc., cic, and Stephen H. Muggleton. Turning 30: New ideas in Secaucus, NJ, USA, 1997. inductive logic programming. In IJCAI, 2020. [Patsantzis and Muggleton, 2021] S. Patsantzis and [Cropper, 2019] Andrew Cropper. Playgol: Learning pro- Stephen H. Muggleton. Top program construction grams through play. IJCAI, 2019. and reduction for polynomial time meta-interpretive learning. Machine Learning, 2021. [Ellis et al., 2018] Kevin Ellis, Lucas Morales, Mathias Sable-Meyer,´ Armando Solar-Lezama, and Josh Tenen- [Plotkin, 1971] G.D. Plotkin. Automatic Methods of Inductive baum. Learning libraries of subroutines for neurally-guided Inference. PhD thesis, Edinburgh University, August 1971. bayesian program induction. In NeurIPS, 2018. [Popper, 2002] K.R. Popper. Conjectures and Refutations: [Evans and Grefenstette, 2018] Richard Evans and Edward The Growth of Scientific Knowledge. Routledge, 2002. Grefenstette. Learning explanatory rules from noisy data. JAIR, 2018. [Raedt, 2008] Luc De Raedt. Logical and relational learning. Cognitive Technologies. Springer, 2008. [Evans et al., 2021] Richard Evans, Jose´ Hernandez-Orallo,´ Johannes Welbl, Pushmeet Kohli, and Marek Sergot. Mak- [Raghothaman et al., 2020] Mukund Raghothaman, Jonathan ing sense of sensory input. Artificial Intelligence, 2021. Mendelson, David Zhao, Mayur Naik, and Bernhard [Feng et al., 2018] Yu Feng, Ruben Martins, Osbert Bastani, Scholz. Provenance-guided synthesis of datalog programs. and Isil Dillig. Program synthesis using conflict-driven PACMPL, 2020. learning. In PLDI, 2018. [Schuller¨ and Benz, 2018] Peter Schuller¨ and Mishal Benz. [Hocquette and Muggleton, 2020] Celine´ Hocquette and Best-effort inductive logic programming via fine-grained Stephen H. Muggleton. Complete bottom-up predicate cost-based hypothesis generation. Machine Learning, 2018. invention in meta-interpretive learning. In IJCAI, 2020. [Shapiro, 1983] Ehud Y. Shapiro. Algorithmic Program De- [Inoue et al., 2014] Katsumi Inoue, Tony Ribeiro, and Chiaki Bugging. MIT Press, Cambridge, MA, USA, 1983. Sakama. Learning from interpretation transition. Machine [ ] Learning, 2014. Silva et al., 2009 Joao˜ P. Marques Silva, Inesˆ Lynce, and Sharad Malik. Conflict-driven clause learning SAT solvers. [Kaminski et al., 2019] Tobias Kaminski, Thomas Eiter, and In Handbook of Satisfiability. 2009. Katsumi Inoue. Meta-interpretive learning using hex- programs. In IJCAI, 2019. [Silver et al., 2020] Tom Silver, Kelsey R. Allen, Alex K. Lew, Leslie Pack Kaelbling, and Josh Tenenbaum. Few- [Law et al., 2020] Mark Law, Alessandra Russo, Elisa shot bayesian imitation learning with logical program poli- Bertino, Krysia Broda, and Jorge Lobo. Fastlas: scalable in- cies. In AAAI, 2020. ductive logic programming incorporating domain-specific optimisation criteria. In AAAI, 2020. [Srinivasan, 2001] A. Srinivasan. The ALEPH manual. 2001.   h1 = { droplast(A,B):- empty(A),tail(A,B). }    h2 = { droplast(A,B):- empty(A),cons(C,D,A),tail(D,B). }       droplast(A,B):- tail(A,C),tail(C,B).   h3 =   droplast(A,B):- tail(A,B).     h4 = { droplast(A,B):- empty(A),tail(A,B),head(A,C),head(B,C). }    droplast(A,B):- tail(A,C),tail(C,B).   H = h 1 5 = droplast(A,B):- tail(A,B),tail(B,A).      droplast(A,B):- tail(A,B),empty(B).   h6 =   droplast(A,B):- cons(C,D,A),droplast(D,E),cons(C,E,B).     ( droplast(A,B):- tail(A,C),tail(C,B). )     h7 = droplast(A,B):- tail(A,B).   droplast(A,B):- tail(A,C),droplast(C,B). 

Figure 4: LFF hypothesis space considered in Example 5.

A Appendix: LFF Example 5. POPPER subsequently generates h6. B ∪ h6 is correct on all the examples and hence is returned. Example 5. To illustrate LFF, consider learning a droplast/2 program. Suppose our predicate declara- Now consider learning by a generate-test-and-constrain loop tions D are head pred(droplast,2), denoting that we want with failure explanation. The following execution sequence is to learn a droplast/2 relation, and body pred(empty,1), representative of POPPERX : body pred(head,2), body pred(tail,2), and body pred(cons,3). 1. POPPERX starts by generating h1. B ∪ h1 fails Suitable definitions for the provided body predicate dec- + + to entail e1 and e2 and correctly does not entail larations constitute our background knowledge B. To al- − 0 e1 . Failure explanation identifies sub-program h1 = low for learning a recursive program, we also supply the 0 + {droplast(A,B):- empty(A).}. h1 fails in the same predicate declaration body pred(droplast,2). Let e = 0 1 way as h1. Hence specialisations of both h1 and h get + − 1 droplast([1, 2, 3], [1, 2]), e2 = droplast([1, 2], [1]) and e1 = pruned, namely h2 and h4. droplast([1, 2], []). Then E+ = {e+, e+} and E− = {e−} 1 2 1 2. POPPERX subsequently generates h3. B ∪h3 does not en- are the positive and negative examples, respectively. Our ini- tail the positive examples, but does entail negative exam- tial set of hypothesis constraints C only ensure that hypotheses − 0 ple e1 . Failure explanation identifies sub-program h3 = are well-formed, e.g. that each variable that occurs in the head {droplast(A,B):- tail(A,C),tail(C,B).}. B ∪ h0 of a rule occurs in the rule’s body. 3 fails in the same way as h3. Hence specialisations and We now consider learning a solution for LFF input 0 generalisations of h3 and h get pruned, meaning h5 and (E+,E−,B,D,C), where, for demonstration purposes, we 3 h7. use the simplified hypothesis space H1 ⊆ HD,C of figure 4. The order the hypotheses are considered in is by their 3. POPPERX subsequently generates h6. B ∪ h6 is correct number of literals. Pruning is achieved by adding additional on all the examples and hence is returned. hypothesis constraints. First we learn by a generate-test-and- The difference in these two execution sequences is illustra- constrain loop without failure explanation. This first sequence tive of how failure explanation can help prune away significant is representative of POPPER’s execution: parts of the hypothesis space.

1. POPPER starts by generating h1. B ∪ h1 fails to entail + + − e1 and e2 and correctly does not entail e1 . Hence only specialisations of h1 get pruned, namely h4.

2. POPPER subsequently generates h2. B ∪ h2 fails to entail + + − e1 and e2 and is correct on e1 . Hence specialisations of h2 get pruned, of which there are none in H1.

3. POPPER subsequently generates h3. B∪h3 does not entail the positive examples, but does entail negative example − e1 . Hence specialisations and generalisations of h3 get pruned, meaning only generalisation h7.

4. POPPER subsequently generates h5. B ∪ h5 is correct on none of the examples. Hence specialisations and gener- alisations of h5 get pruned, of which there are none in H1.