Bayesian Inductive Logic Programming

Stephen Muggleton Oxford University Computing Laboratory Wolfson Building, Parks Road Oxford, OX1 3QD.

Abstract 1 INTRODUCTION

Inductive Logic Progrdng (ILP) involves the This paper is written for two separate audiences: 1) a Ma- construction of first-order definite clause theories chine Learning audience, interested in implementations and from examples and background knowledge, Un- applications and 2) a Computational Learning Theory audl- like both traditional Machine Learning and Com- ence interested in models and results in learnability. The putational Learning Theory, ILP is based on lock- paper should thus be taken as two loosely connected papers step development of Theory, Implementations and with a common theme involving the use of distributional Applications. ILP systems have successful appli- information in Inductive Logic Programming (ILP). cations in the learning of structure-activity rules In the last few years ILP [23, 24,20, 31] has developed from for , semantic grammars rules, finite a theoretical backwater to a rapidly growing area in Machine element mesh design rules and rules for prediction Learning. This is in part due to the fact that pure logic of and mutagenic molecules. The programs provide Machine Learning with a representation strong applications in ILP can be contrasted with relatively weak PAC-learning results (even highly- which is general purpose (Turing-computable), and has a restricted forms of logic programs are known to simple, and rigorously defined, semantics. In addition, logic programs tend to be easy to comprehend as hypotheses in be prediction-hard). It has been recently argued scientific domains. that the mismatch is due to distributional assump- tions made in application domains. These assump- ILP is based on interaction between Theory, Implementations tions can be modelled as a Bayesiau prior prob- and Applications. The author has argued [25] that the devel- ability representing subjective degrees of belief. opment of a formal semantics of ILP (see Section 2) allows Other authors have argued for the use of Bayesian the direct derivation of ILP algorithms. Such derivational prior distributions for reasons different to those techniques have been at the centre of specific-to-general here, though this has not lead to a new model ILP techniques such as Plotkin’s least general generalisation of polynomial-time learnability. Incorporation of [37, 36], Muggleton and Buntine’s inverse resolution [27, 23] Bayesian prior distributions over time-bounded hy- and Idestam-Almquist’s inverse implication [ 15]. TWO con- potheses in PAC leads to a new model called U- tending semantics of ILP have been developed [31]. These Learnability. It is argued that U-learnability is are the so-called ‘open-world’ semantics [31 ] and ‘closed- more appropriate than PAC for Universal (Tur- world’ semantics [13]. ing computable) languages. Time-bounded logic programs have been shown to be polynomially A number of efficient ILP systems have been developed, in- U-learnable under certain distributions. The use cluding FOIL [38], Golem [28], LINUS [21], CLAUDIEN of time-bounded hypotheses enforces decidability [39] and Progol [26]. The use of a relational logic formalism and allows a unified characterisation of speed-up has allowed successful application of ILP systems in a num- learning and inductive learning. U-learnability has ber of domains in which the concepts to be learned cannot as special cases PAC and Natarajan’s model of easily be described in an attribute-value, propositional-level, speed-up learning. language. These applications (see Section 3) include struc- ture activity prediction for drug design [19, 44, 43], protein Permission to copy without fee all or part of this material is secondary-structure prediction [29], finite-element mesh de- granted provided that the copies are not made or distributed for sign [9] and learning semantic grammars [45]. direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given A variety of positive and negative PAC-learnability results ex- that copying is by permission of the Association of Computing Machinery. To copy otherwise, or to republish, requires a fee ist for subsets of definite clause logic [11, 34, 10, 18, 7, 40]. and/or specific permission. However, in contrast to experimentally demonstrated abili- COLT 94- 7/94 New Brunswick, N..J. USA ties of ILP systems in applications, the positive PAC results @ 1994 ACM 0-89791-655-7/94/0007..$3.50 are rather weak, and even highly restricted forms of logic

3 programs have been shown to be prediction hard [7]. Like where H Q cd, H 1= B and n is a a natural number. Hn many other results in PAC-learning, positive results are only is said to hold for examples E, or holds(Hn, E), when both achieved for ILP by setting language parameters to constant the following are truel. values (eg. k-clauses and l-literals). This provides ‘hard’ boundaries to the hypothesis space, which often reduces Sufficiency: H h. E+ the representation to a propositional level. An alternative Satisfiability: H A E- tjn ❑ model, U-learnability [30], is better suited to Universal (Tur- ing computable) representations, such as logic programs. U- The prior probability, p(Hn), of an hypothesis is defined by learnability provides a ‘soft’ boundary to hypothesis spaces in a given distribution D. According to Bayes’ theorem, the the form of a prior probability distribution over the complete posterior probability is representation (eg. time-bounded logic programs). Other authors have argued for the use of Bayesian prior distribu- p(EIHn).p(Hn) tions for reasons different to those here, though this has not P(~nl~)= p(E) lead to a new model of polynomial-time learnability. U- where p(E[Hn) is 1 when holds(Hn, E) and O otherwise Iearnability allows standard distributional assumptions, such and as Occam’s razor, to be used in a natural manner, without p(E) = ~ p(H:) parameter settings. Although Occam algorithms have been studied in PAC-learning [3], this has not led to algorithms holds (H& ,E) capable of learning in universal representations, as it has in Well known strategies for making use of distributional infor- Inductive Logic Programming. In the early 1980’s a distribu- mation include a) maximisation of posterior probability b) tion assumption very different to Occam’s razor was implic- class prediction based on posterior weighted sum of predic- itly implemented by Arbab, Michie [1] and Bratko [4] in an tions of all hypotheses (Bayes optimal strategy) c) randomly algorithm for constructing decision trees. The algorithm con- sampling an hypothesis from the posterior probability (Gibbs structs a decision list (linear decision tree) where one exists algorithm). The sample complexities of strategies b) and c) and otherwise returns the most linear decision tree which can are analysed in [12]. be constructed from the data. This distributional assumption Although no existing ILP system takes an hypothesis distri- is based on the fact that decision lists are generally easier to bution D as an input parameter, such distributions are implicit comprehend than arbitrary decision trees. Other interesting in FOIL’s [38] information gain measure and Golem’s [32] kinds of distributions along these lines can be imagined; as- compression measure. Additionally, search strategies such signing higher prior probabilities to grammars that are regular as general-to-specific or specific-to-general, use an implicit or almost so, logic programs that are deterministic or almost prior distribution which favours more general hypotheses or, so and logic programs that run in linear time or almost so. conversely, more specific ones. One can imagine very different kinds of prior distributions on hypotheses. For example, when learning concepts which Bayesian approaches to Machine Learning have been dis- mimic human performance in skill tasks [41] predictive ac- cussed previously in the literature. For instance, Buntine curacy is dependent on hypotheses being evaluable in time [6] develops a Bayesian framework for learning, which he similar to that of human reaction, and so such hypotheses applies to the problem of learning class probability trees. should be preferred a priori. Also Haussler et al. [12] analyse the sample complexity of two Bayesian algorithms. This analysis focuses on average This paper is organised as follows. Section 2 gives a formal case, rather than worst case, accuracy. On the other hand definition of ILP in terms of Bayesian inference. Section 3 U-learnability uses average case time complexity and worst provides an overview of ILP applications in molecular bi- case accuracy. Also unlike U-learnability (Section 4) neither ology and discusses some of the distributional assumptions used in these applications. Section 4 defines U-learnability Buntine nor Haussler et al, develop a new learning model which allows significantly larger polynomially-learnable rep- and describes some general results on U-learnable distribu- resentations, such as logic programs. The applications in the tions. following section show that ILP systems are effective at learn- ing logic programs in significant real-world domains. This 2 BAYES’ ILP DEFINITIONS is consistent with the fact that time-bounded logic programs are U-learnable under certain distributions (see Section 4).

Familiarity with standard definitions from Logic Program- ming [22, 14] is assumed in the following. A Bayesian ver- 3 APPLICATIONS IN MOLECULAR sion of the usual (open world semantics) setting for ILP is as BIOLOGY follows. Suppose T, 7 and V are sets of predicate symbols, function symbols and variables and cd and Ch are the classes ILP applications in Molecular Biology extend a long tradi- of definite and Horn clauses constructed from T, 7 and V. tion in approaches to scientific discovery that was initiated by The symbol En denotes SLDNF derivation bounded by n res- Meta-Dendral’s [5] application in mass spectrometry. Molec- olution steps. An ILP learner is provided with background ular biology domains are particularly appropriate for ILP due knowledge B z Cd and examples E = (E+, E- ) in which E+ C cd are positive examples and E- C (Ch – cd) are 1Though most ILp systems can deal with noise, this is ignored negat~ve examples. Each hypothesis is a p~r H. = (H, n) in the above definition for the sakeof simplicity.

4 to the rich relational structure of the data. Such relationships A Hz make it hard to code feature-based representations for these T domains. ILP’s applications to protein secondary structure prediction [29], structure-activity prediction of drugs [19] and mutagenicity [42] of molecules have produced new knowl- edge, published in respected scientific journals in their area. NH

3.1 PROTEIN SECONDARY STRUCTURE k, PREDICTION

When a protein (amino acid sequence) shears from the RNA that codes it, the molecule convolves in a complex though deterministic manner. However, it is largely the final shape, known as secondary and tertiary structure, which determines \’ the protein’s function. Determination of the shape is a long \ and Iabour intensive procedure involving X-ray crystallogra- phy. On the other hand the complete amino acid sequence is N/ / NH relatively easy to determine and is known for a large number A- Hz of naturally occurring proteins. It is therefore of great in- terest to be able to predict a protein’s secondary and tertiary structure directly from its amino acid sequence. Figure 1: The family of analogues in the first ILP study. A) Many attempts at secondmy structure prediction have been Template of 2,4-diamino-5(substituted-benzyl)pyrimidines made using hand-coded rules, statistical and pattern matching R3, R4, and R5 are the three possible substitution positions. algorithms and neural networks. However, the best results for B) Example compound: 3 – Cl, 4 – iViZZ, 5 – C’HJ the overall prediction rate are still disappointing (65-70%). In [29] the ILP system Golem was applied to learning sec- ondary structure prediction rules for the sub-domain of a/a drug development projects. However, the in vitro activity of proteins. The input to the program consisted of 12 non- members of a family of drugs can be determined experimen- homologous proteins (1612 amino acid residues) of known tally in the laboratory. This data is usually analysed by drug secondary structure, together with background knowledge designers using statistical techniques, such as regression, to describing the chemical and physical properties of the nat- predict the activity of untested members of the family of urally occurring amino acids. Golem learned a set of 21 drugs. This in turn leads to directed testing of the drugs rules that predict which amino acids are part of the a-helix predicted to have high activity. based on the positional relationships and chemical and phys- ical properties. The rules were tested on four independent An application of ILP to structure-activity prediction was re- non-homologous proteins (416 amino acid residues) giving ported in [19]. The ILP program Golem [28] was applied an accuracy of 8190 (with 2% expected error). The best pre- to the problem of modelling the structure-activity relation- viously reported result in the literature was 76%, achieved ships of trimethoprim analogues binding to dihydrofolate using a neural net approach. Although higher accuracy is reductase. The training data consisted of 44 trimethoprim possible on sub-domains such as a/a than is possible in the analogues and their observed inhibition of Escherichia coli general domain, ILP has a clear advantage in terms of com- dihydrofolate reductase. A further 11 compounds were used prehensibility over neural network and statistical techniques. as unseen test data. Golem obtained rules that were statisti- cally more accurate on the training data and also better on the However, according to the domain expert, Mike Sternberg, test data than a previously published linear regression model. Golem’s secondary-structure prediction rules were hard to Figure 1 illustrates the family of analogues used in the study. comprehend since they tended to be overly specific and highly complex. In many scientific domains, comprehensibility of Owing to the lack of variation in the molecules, a highly new knowledge has higher priority than accuracy. Such a deterministic subset of the class of logic programs was suffi- priori preferences can be modelled by a prior distribution on cient to learn in this domain. Unlike the protein domain, the hypotheses. domain expert found the rules to be highly compact and thus easy to comprehend.

3.2 DRUG STRUCTURE-ACTIVITY PREDICTION 3.3 MUTAGENESIS PREDICTION Proteins are large macro-molecules involving thousands of atoms. Drugs tend to be small, rigid molecules, involving In a third molecular biology study ILP has been applied to tens of atoms. Drugs work by stimulating or inhibiting the predicting mutagenicity of molecules [42]. Mutagenicity is activity of proteins. They do so by binding to specific sites highly correlated to carcinogenicity. Unlike activity of a on the protein, often in competition with natural regulatory drug, mutagenicity is a property which pharmaceutical com- molecules, such as hormones. The 3-dimensional structure panies would like to avoid. Also, unlike drug development of the protein binding site is not known in over 90% of all projects, the data on mutagenicity forms a highly heteroge-

5 :.B , cl x NY + Y—...... z

3 2 \ NO,— u x N02— / v—Y...... N02— Q i“)\

Figure 2: Examples of the diverse set of aromatic and Figure 3: Structural properties discovered by Progol heteroaromatic nitro compounds used in the mutagene- sis study. A) 3,4,4’-trinitrobiphenyl B) 2-nitro-l,3,7,8- tetrachlorodibenzo-1 ,4-dioxin C) 1,6,-dinitro-9,10, 11,12- pressing precisely this fact, was discovered by Progol without tetrahydrobenzo [e]pyrene D) nitrofurantoin access to any specialist chemical knowledge. Further, there- gression analysis suggested that electron-attracting elements conjugated with nitro groups enhance mutagenicity. A par- neous set with large variations in molecular form, shape and ticular form of this pattern was also discovered by Progol chame distribution. The data comes from a wide number of (pattern 3 in Figure 3). sources, and the molecules involved are almost arbitrary in Previous published results for the same data [8] used linear shape and bond connectivity (see Figure 2). discriminant techniques and hand-crafted features. The ILP The 2D bond-and-atom molecular descriptions of 229 aro- technique allowed prediction of 49 cases which were statis- matic and heteroaromatic nitro compounds taken from [8] tical outliers not predictable by the linear discriminant algo- were given to the ILP system Progol [26]. The study was rithm. In addition, the ILP rules are small and simple enough confined to the problem of obtaining structural descriptions to provide insight into the molecular basis of mutagenicity. that discriminate molecules with positive mutagenicity from Such a low-level representation opens up a range of arbitrary those which have zero or negative mutagenicity. chemical structures for analysis. This is of particular interest for drug discovery, since all compounds within a database of A set of 8 compact rules were discovered by Progol. These compounds have known atom and bond descriptions, while rules suggested 3 previously unknown features leading to very few have computed features available. mutagenicity. Figure 3 shows the structural properties de- scribed by the rules in the Progol theory. The descriptions Owing to the large range of molecules the rules were highly of these patterns are as follows. 1) Almost half of all muta- non-deterministic (of the form “there exists somewhere in genic molecules contain 3 fused benzyl rings. Mutagenicity molecule M a structure S with property P“). Owing to sim- is enhanced if such a molecule also contains a pair of atoms plicity, the mutagenicity rules were the most highly favoured connected by a single bond where one of the atoms is con- by the expert among the Molecular Biology domains studied. nected to a third atom by an aromatic bond. 2) Another strong For all the domains in Molecular Biology comprehensibility, mutagenic indicator is 2 pairs of atoms connected by a single rather than accuracy, provides the strongest incentive for an bond, where the 2 pairs are connected by an aromatic bond as Occam distribution (ie. shorter theories preferred a priori). shown. 3) The third mutagenic indicator discovered by Pro- gol was an aromatic carbon with a partial charge of +0. 191 in 4 U-LEARNABILITY which the molecule has three N02 groups substituted onto a biphenyl template (two benzine rings connected by a single The learnability model defined in this section allows for the bond). incorporation of a priori distributions over hypotheses. The An interesting feature is that in the original regression anal- definition of this model, U-learnability, is motivated by the ysis of the molecules by [8], a special ‘indicator variable’ general problems associated with learning Universal (Turing was provided to flag the existence of three or more fused computable) representations, such as logic programs. Never- rings (this variable was then seen to play a vital role in the theless U-learnability can be easily applied to other represen- regression equation). The first rule (pattern 1 in Figure 3), ex- tations, such as decision trees. U-learnability is defined using

6 the traditional notation from computational learning theory q(lr[, P( Iwl)) over the examples xi seen m that point,2 rather than that typically used in ILP (see Section 2). (Recall that (r, p) is the target concept chosen randomly according to D1, and that q describes the time compkx- ity of the evaluation algorithm A.) 4.1 BACKGROUND FOR U-LEARNING

● For all m > p2(n), there aist values E and d, O < Let (ZE, X, Zc, R, c) be the representation of a learning c, 8< 1, such that: p3 ( ~, ~) > m, and with probability problem (as in PAC-learning), where X is the domain of at least 1 – J the hypothesis Hm = (rm, pm) is (1 – examples (finite strings over the alphabet ZE), R is the class C)-accurate. By (1 - t)-accurate, we mean that the of concept representations (finite strings over the alphabet probability (according to D2) that A(rm, pm, Xm+l ) # Xc), and c : R + 2X maps concept representations to con- lm+l is less than e. cepts (subsets of X). Hence the conce t class being learned is the actual range of c (a subset of 2$ ), Let P be a subset As with PAC-learning, we can also consider a prediction of the class of polynomial functions in one variable, From variant, in which the hypotheses llm need not be ‘built from this point on, a concept representation will be a pair (r, p), R (in addition to P), but can come from a different class R’ forr ERandp EP. and can have a different associated evaluation algorithm A’. Let A(r, p, z) be an algorithm that maps any tuple (r, p, z), forrg R,p CP, andz EX, toO orl. Let Arun 4.4 INITIAL RESULTS AND PROPOSED in time bounded by q( lrl, P(IZ [)), where q is a polynomial DIRECTIONS in two variables. Furthermore, let A have the following properties. First, for all r 6 R and z c X, if z @ c(r) This section presents initial results about polynomial U- then for every p E P: A(r, p, z) = O. Second, for all Iearnability and suggests directions for further work. To r E R and z E X, if z E c(r) then there exists p E P such conserve space the proofs are omitted, but they appear in a that for every p’ E P with p’(lzl) z p(lxl): A(r, p’, ~) = full paper on U-learnability [30]. In each of these results, let 1. Intuitively, p specifies some form of time bound that (X~, X, XC, R, c) be the representation of any given learning is polynomially related to the size of the example. It might problem. The following theorem shows that PAC-learnability specify the maximum derivation length (number of resolution is a special case of U-learnability, steps) for logic programs or the maximum number of steps in a Turing Machine computation. Greater derivation lengths Theorem 2 U-learnability generalises PAC. Letp be the or more computation steps are allowed for larger inputs. polynomialjimction p(x) = x. Let F be afamily of distribu- tions that contains one distribution Di for each ri c R, such Let F be any family of probability distributions over R x P, that Di assigns probability 1 to (ri, p). Let n = O be the pa- where the distributions in F have an associated parameter rameterfor each member of F. Let G be thefamily of all dis- n >0. Let G be any family of probability distributions over x. tributionsover X. Then the representation (ZE, X, xc, R, c) is PAC-leamable (PAC-predictable) if and only if (F, G] is polynomially U-learnable (U-predictable). 4.2 PROTOCOL FOR U-LEARNING This observation raises the question of how, exactly, polyno- In U-learning, a Teacher randomly chooses a pair (r, p), for mial U-learnability differs from PAC-learnability. There is a r E R and p c P, according to some distribution D1 in cosmetic difference that polynomial U-learnability has been F. The Teacher presents to a learning algorithm L an infi- defined to look more similar to identi$cation in the limit nite stream of labeled examples {(z1, 11), (Z2, 12), ...) where: than does the definition of PAC-learnability. Ignoring this each example xi is drawn randomly, independently of the leaves the following distinguishing features of polynomial preceding examples, from X according to some fixed, un- U-learnability. known distribution D2 in G and labeled by li = A(r, p, ~i). After each example Zj in the stream of examples, L outputs @~ be rn~re precise, let m be any positive integer. bt a hypothesis Hi = (ri, pi), where ri E R and pi c P. PrD, (r, p) be the probability assigned to (r, p) by DI, and (with a stight abuse of notation) let PrD1,Dz (r, P, (~1, .... z~)) be the product of PrDl (r, p) and the probability of obtaining the sequence 4.3 DEFINITION OF U-LEARNABILITY (x,, .... z~) when drawing m examples randomly and indepen- dently according to Dz. LetlTME~(r, p, (X I,.... z~)) be the time Definition 1 Polynomial U-learnability. Let F be a family spent by L until output of its hypothesis I-f~, when the target is of distributions (with associatedparameters) over R x P, and (r, p) and the initial sequenceof m examples is (XI, .... z~). Then let G be a family of distributions over X. The pair (F,G) we say that the average-casetime complexity of L to output Ifm is is polynomially U-learnable just if there exist polynomial bounded by pI (y) = y= just if the sum (or integral), over all tuples functions pI (y) = y’, for some constant c, P2(V), p3(yI, y2), (r,p, (m, ....zm)). of and a learning algorithm L, such that for every distribution D1 (with parameter n) in F, and every distribution D2 in G, ~hefollowing hold. is less than infinity, where M is the sum of g(lrl, p(lz, l)) over all ● The average-case time complexi~ of L at any point in %, in a 1,.... x~. See [2] for the motivation of this definition of a run is bounded by p] (M), where M is the sum of average-casecomplexity.

7 o Assumption of a prior distribution on target concept builtfmm a given alphabet of predicate symbols P, jimction representations. symbols F, and variables V. Notice that R is countable.

● Sample size is related to e, d, and (by means of the pa- Let P be the set of all bounds p(x) on derivation length of rameter n) the reciprocal of the probability of the target the form p(z) = CX, where c is a positive integer and x is representation rather than the target representation size. the size of (number of terms in) the goal. Let the domain X of examples be the ground atomic subset of cd, which is ● Use of average case time complexity, relative to the the set of dejinite clauses that can be built from the given prior distribution on hypotheses and the distribution on alphabet. Notice also that a concept (r, p) classifies as pos- examples, in place of worst case time complexity. itive just the dejinite clauses d E X such that r I-P(IdlJ d, ● Concept representations need not bepolynomially evalu- where Idl is the number of terms in d. Let F be the family of able in the traditional sense, but only in the weaker sense distributions DP built using some particular enumeration of that some polynomial p must exist for each concept rep- polynomially-growing subsets of R, and let G be thefamily of resentation such that the classification of an example z all distributions over examples. From Theorem 4, it follows can be computed in time P( lx 1). This distinction and the that (F, G) is polynomially U-learnable. next are irrelevant for concept representations that are already polynomially evaluable and hence do not need Definition 6 The distribution family DE.. Let R be count- time bounds. able, let k be a positive intege~ and let Ee = (El, E2, ...) be ● The learning algorithm runs in (expected) time polyno- an enumeration of subsets of R such that: (1) for all i ~ 1, mial in the ‘(w&st-case) time r~uired to compute the the cardinality of Ei is at most ki, (2) every (r, p), for r c R classification, by the target, of the examples seen thus and p G P, is in some Ei, and (3) the members of Ei, for far, rather than in the sum of the sizes of the examples each i, can be determined efficiently (by an algorhhm that seen thus far. runs in timepolynomial in the sum of the sizes of the members of Ei). Let D~ be the discrete exponential distribution with Definition 3 The distribution family Dp. Let R be count- probability densi~finction able, let p’(y) be a polynomial function, and let EPI = Pr(x) = (k – l)k-x for all integers x ~ 1 (EI, E2, ...) bean enumeration of subsets of R such that: (1) for all i ~ 1, the caniinality of E~ is at most p’(i), (2) and let the parameter n’ of D~ be its mean {~). From D;, every r E R ls in some Ei, and (3) the members of E~, for de$ne a corresponding distribution Dk over R asfollows: for each i, can be determined eflciently (by an algorithm that each i ~ 1, let the probabilities of the r c Ei be uniformly runs in time polynomial in the sum of the sizes of the members distributed with sum Prl); (i). Let the parameter associated of E~J Let F’ be the family of all distributions D), with finite with Dk also be n’. Let P be the set of all linear functions variance, over the positive integers, and let the parameter n) of the form p(x) = CX, where c is a positive integer Let the of D’ be the maximum of the mean and standard deviation distributions DL over P be as defined earlier (see definition of D’. For each such distribution D’ in F’, define a corre- of Dp }. Finally, for the distribution Dk (with parameter sponding distrt”bution D{{ over R as follows: for each i ~ 1, nl) and each distribution DL, dejine a distn”bution D over let the probabilities of the r G Ei be unijomtly distributed R x Pas follows: for each pair (r, p), for r E R andp G P, with sum PrDl (i). Let the parameter associated with D“ prD (r, P) = [prD, (r)] [prDL (P)]. Let the parameter n of D also be n’. Let P be the set of all linear functions of the be the parameter n’ of Dk. DE, is the family of distributions form p(x) = CX, where c is a positive integer Let D~ be consisting of each such distribution D. any probability distribution over the positive integers that has apmbabilitydensi~ finction (p.d.$) prD; (x) = $, for Theorem 7 Polynomial U-1earnability under DE.. Let some k >3 (appropriately normalised so that it sums to 1). F be the distribution family DE, dejined by a particular For each distribution D~, dejine a distribution DL over P enumeration of subsets of R and a particular choice of k, to assign to the function p(z) = cx the probability assigned Let G be any family of distributions over X, The pair of to c by D~. Finally, for each pair of distributions D“ and distributions (F, G) is polynomially U-learnable. DL as de$ned above, define a distribution D over R x P as follows: for each pair (r, p), where r E R and p ~ P, Example 8 Time-bounded logic programs are U-learnable under DEh. Again let R be the set of all logic programs that prD (r, p) = [PrD1/ (r)] [PrDL (p)]. ~t the parameter n of D can be builtfrom a given$nite alphabet ofpredicate symbols be n’. DP is the family of all such distributions D, each with associated parameter n. P, function symbols F, and variables V. And again let P be the set of all bounds p(z) on den’vation length of the form Theorem 4 Polynomial U-learnability under Dp. Let F p(x) = CX, where c is a positive integer and x is the size of be the distribution family DP defined by a particular enu- (number of terms in) the goal. Let the domain X of examples meration of subsets of R. Let G be any family of distributions be the ground atomic subset of cd. Let (S1, S2, ...) be an over .X. The pair (F, G) is polynomially U-learnable. enumeration of subsets of R such that Si contains all logic programs of length i, for all i ~ 1. Let F be the distribu- The following example relates U-learnability to the ILP def- tion family DE, built fmm this enumeration and fmm the initions given in Seetion 2. positive integer k chosen to be the sum of the sizes of P, F, and V. Let G be the family of all distn”butions over exam- Example 5 Time-bounded logic programs are U-learnable ples. From Theorem 7, it follows that (F, G) is polynomially under DP. Let R be the set of all logic programs that can be U-learnable.

8 In many cases it may be incorrect to assume that the proba- these demands and lead to additional positive results. bilities of target representations vanish as quickly as they do in the distributions considered in the preceding theorems. In particular, a more appropriate distribution on target represen- tations may be defined by combining an enumeration Ee, asin 5 DISCUSSION Definition 6, with some distributionin which the probabilities do not decrease exponentially, as in Definition 3. Polynomial U-learnability should be considered with such distributions, As the ILP applications in Section 3 show, not only is Horn perhaps combined with restricted families of distributions on clause logic a powerful representation language for scien- examples. It can be expected that obtaining positive results tific concepts, but also one which is tractable in real-world for “natural” such distributions will be theoretically challeng- domains. This apparently flies in the face of negative PAC- ing, and that such results will have a major impact on practical learnability results for logic programs. This may be explained applications of machine learning. It is also expected that it by the fact that despite logic programs being used as the rep- will be challenging and useful to obtain negative results for resentation language throughout the applications in Section some such distributions. Whereas negative results for PAC- 3, very different distributional assumptions were necessary learnability can be obtained by showing that the consistency in each application. problem for a given concept representation is NP-hard (as- suming RP # NP) [35], negative results for polynomial ILP has a clear model-theoretic semantics which is inherited U-learnability in some cases can be obtained by showing that from Logic programming. However, without consideration the consistency problem is NP-hard-on-average (or DistNP of the probabilistic nature of inductive inference, a purely hard) [2, 16] relative to particular distributions on hypothe- logic-based definition of ILP is always incomplete. For this ses and examples. In addition, just as negative results for reason the Bayes’ oriented definitions of ILP found in Section PAC-predictabiIity can be obtained (based on certain assump- 2 provide a more complete semantics for I~P than has pre- tions) using hard problems from cryptography [17], negative viously been suggested. This introduces prior distributional results for polynomial U-predictability might possibly be ob- information into the definition of ILP. The introduction of tained in this same way, but would require additional effort. such information is also motivated both by the requirements Specifically, obtaining negative results for polynomial U- of practical experience and the ability to define a model of Unpredictability will require specifying a particular distribution learnability which is more appropriate for learning logic pro- (or family of distributions) to be used with a hard problem grams. from cryptography, such as inverting RSA, and it must be All ILP algorithms make implicit use of distributional in- verified, in some way, that the problem does not become sub- formation. One promising line of research would involve stantially easier under this distribution (or every distribution explicit provision of distributions to general-purpose ILP sys- in this family). For example, inverting RSA is assumed hard tems. if encryptionldecryption is based on a choice of two large random prime numbers according to a uniform distribution The choice of representation is at the centre of focus in PAC- over primes of some large size, but it becomes easy if the learning. By contrast, the choice of distribution is the central distribution over primes of a given size assigns probability theme in U-learnability. The reason for the refocusing of 1 to one particular prime number; what about for distribu- attention is that once one is committed to a Universal repre- tions “in between”, where some primes of a given size are sentation, as in the case of ILP, differences in distributional more likely than others? The distributions used in polyno- assumptions become paramount. mial U-predictability would correspond to such distributions Some general initial results of U-learnability are stated in for RSA. this paper. Much more effort is required to clearly define the boundaries of what is and is not U-learnable. It is believed 4.5 INCORPORATING BACKGROUND that U-learnability will not only be of interest to researchers in KNOWLEDGE ILP but also to the wider community within Machine Learn- ing. Recall that background knowledge is often used in ILP. This paper has not provided for the use of background knowl- edge in polynomial U-learnability. The full paper on U- Acknowledgements learnability includes a definition of polynomial U-learnability that allows the teacher to provide the learner with a back- ground theory. A surprising aspect of this definition is that The author would like to thank David Page and Ashwin it not only captures the notion of inductive learning relative Srinivasan of Oxford University Computing Laboratory and to background knowledge, as desired, but it also provides Michael Sternberg and Ross King of Imperial Cancer Re- a generalisation of Natarajan’s PAC-style formalisation of search Fund for discussions and input to this paper. This speed-up learning [33]. Although Natarajan’s formalisation work was supported partly by the Esprit Basic Research Ac- is quite appealing, few positive have been proven within it be- tion ILP (project 6020), the SERC project GR/J46623 and cause it is also very demanding. The use of a family of prior an SERC Advanced Research Fellowship held by the author. distributions on hypotheses, as specified in the definition of The author is also supported by a Research Fellowship at polynomial U-learnability, can in some cases effectively ease Wolfson College, Oxford.

9 References [14] C.J. Hogger. Essentials of logic programming. , Oxford, 1990. [1] B. Arbab and D. Michie. Generating rules from ex- [15] P. Idestam-Almquist. Generalisation of Clauses. PhD amples. In ZJCAZ-85, pages 63 1–633, Los Altos, CA, , Stockholm University, 1993. 1985. Kaufmann. [16] D. Johnson. The NP-completeness column-an ongo- [2] S. Ben-David, B. Chor, and O. Goldreich. On the theory ing guide. Journal of Algorithms, 4:284-299, 1984. of average case complexity. Journal of Information and System Sciences, 44: 193–219, 1992. [17] M, Kearns and L. Valiant. Cryptographic limitations [3] R. Board and L. Pitt. On the necessity of Occam algo- on lemming boolean formulae and finite automata. In Proceedings of the 21st ACM Symposium on Theory of rithms. UIUCDCS-R-89- 1544, University of Illinois at Computing, pages 433444. ACM, 1989. Urbana-Champaign, 1989. [4] I. Bratko. Generating human-understandable decision [18] J.U Kietz. Some lower bounds on the computa- rules, Working paper, E. Kardelj University Ljubljana, tional complexity of inductive logic programming. In Ljubljana, Yugoslavia, 1983. P. Brazdil, editor, Proceedings of the 6th European Conference on Machine Learning, volume 667 of Lec- [5] B. Buchanan, E. Feigenbaum, and N. Sridharan. Heuris- ture Notes in Artificial Intelligence, pages 115–123. tic theory formation: data interpretation and rule for- Springer-Verlag, 1993. mation. In B. Meltzer and D. Michie, editors, Machine intelligence 7, pages 267–290. Edinburgh University [19] R. King, S. Muggleton R. Lewis, and M. Steinberg. Press, 1972. Drug design by machine learning: The use of inductive logic programming to model the structure-activity rela- [6] W. Buntine. A Theory of Learning Classi$cation Rules. tionships of trimethoprim analogues binding to dihydro- PhD thesis, School of Computing Science, University folate reductase. Proceedings of the National Academy of Technology, Sydney, 1990. of Sciences, 89(23), 1992. [7] W. Cohen. Learnability of restricted logic programs. [20] N. LavraE and S. DZeroski. Inductive Logic Program- In S. Muggleton, editor, Proceedings of the 3rd Inter- ming : Techniques and Applications. Ellis Horwood, national Workshop on Inductive Logic Programming 1993. (Technical report IJS-DP-6707 of the Josef Stefan In- stitute, Ljubljana, Slovenia), pages 41-72, 1993. [21] N. Lavra5, S. D2eroski, and M. Grobelnik. Learning non-recursive definitions of relations with LINUS. In [8] A.K. Debnath, R.L Lopez de Compadre, G. Debnath, Yves Kodratoff, editor, Proceedings of the 5th Eur- A.J. Schusterman, and C. Hansch. Structure-activity opean Working Session on Learning, volume 482 of Lec- relationship of mutagenic aromatic and heteroaromatic ture Notes in Arti@cial Intelligence. Springer-Verlag, nitro compounds. correlation with molecular orbital en- 1991. ergies and hydrophobicity. Journal of Medicinal Chem- istry, 34(2):786 – 797, 1991. [22] J.W. Lloyd. Foundations of Logic Programming. Springer-VerIag, Berlin, 1984. [9] B. Dolsak and S. Muggleton. The application of Induc- tive Logic Programming to finite element mesh design. [23] S. Muggleton. Inductive logic programming. New Gen- In S. Muggleton, editor, Inductive Logic Programming. eration Computing, 8(4):295–3 18, 1991. Academic Press, London, 1992. [24] S. Muggleton, editor. Inductive Logic Programming. [10] S, Dzeroski, S. Muggleton, and S. Russell. PAC- Academic Press, 1992. learnability of determinate logic programs. In Pro- [25] S. Muggleton. Inductive logic programming: deriva- ceedings of the 5th ACM Workshop on Computational tions, successes and shortcomings. SIGART Bulletin, Learning Theory, Pittsburg, PA, 1992, 5(1):5-1 1, 1994. [11] D, Haussler. Applying Valiant’s learning framework [26] S. Muggleton. Mode-directed inverse resolution. In to AI concept-learning problems. In Y. Kodratoff and K. Furukawa, D. Michie, and S. Muggleton, editors, R. Michalski, editors, Machine learning: an artificial Machine Intelligence 24. Oxford University Press, (to intelligence approach, volume 3, pages 641-669, Mor- appear). gan Kaufman, San Mateo, CA, 1990. [27] S. Muggleton and W. Buntine. Machine invention of [12] D. Haussler, M Keams, and R. Shapire. Bounds on first-order predicates by inverting resolution. In Pr- the sample complexity of bayesian learning using in- oceedings of the Fijlh International Conference on Ma- formation theory and the vc dimension. In COLT-91: chine Learning, pages 339–352. Kaufmann, 1988. Proceedings of the 4th Annual Workshop on Computa- tional Learning Theory, pages 61–74, San Mateo, CA, [28] S. Muggleton and C. Feng. Efficient induction of logic 1991. Morgan Kauffmann. programs. In Proceedings of the First Conference on Algorithmic Learning Theory, Tokyo, 1990. Ohmsha. [13] N. Helft. Induction as nonmonotonic inference. In Proceedings of the Ist International Conference on [29] S. Muggleton, R. King, and M. Steinberg. Protein sec- Principles of Knowledge Representation and Reason- ondary structure pred~ction using logic-&sed machine ing, pages 149–156. Kaufmann, 1989. learning. Protein Engineering, 5(7):647-657, 1992.

10 [30] S. Muggleton and D. Page. A learnability model for [45] J. Zelle and R. Mooney. Learning semantic grammars universal representations. Technical Report PRG-TR-3- with constructive inductive logic programming. In Pro- 94, Oxford University Computing Laboratory, Oxford, ceedings of the Eleventh National Conference on Ar- 1994. ti~cial Intelligence, pages 8 17–822, San Mateo, CA, 1993, Morgan Kaufmann. [31] S. Muggleton and L. De Raedt, Inductive logic pro- gramming: Theory and methods. Journal of Logic Programming, 12, 1994. (to appear). [32] S. Muggleton, A. Srinivasan,and M, Bain. Compres- sion, significanceand accuracy.In Proceedings of the Ninth InternationalMachine Learning Conference, San Mateo, CA, 1992. Morgan-Kaufmann.

[33] B. K. Natarajrm, Learning from exercises. In Proceed- ings of the 1989 Workshop on Computational Learning Theory, pages 72–86, San Mateo, CA, 1989. Morgan Kaufmann.

[34] D. Page and A. Frisch. Generalization and learnabil- ity: A study of constrained atoms. In S. Muggleton, editor, Inductive Logic Programming. Academic Press, London, 1992. [35] L. Pitt and L, G. Valiant. Computational limitations on learning from examples. Journal of the ACM, 35(4):965-984, 1988. [36] G. Plotkin. A further note on inductive generalization. In Machine Intelligence, volume 6. Edinburgh Univer- sity Press, 1971. [37] G.D. Plotkin. A note on inductive generalisation. In B. Meltzer and D. Michie, editors, Machine Intelligence 5, pages 153–163. Elsevier North Holland, New York, 1970, [38] R, Quinlan. Learning logical definitions from relations. Machine Learning, 5:239-266, 1990. [39] L. De Raedt and M. Bruynooghe, A theory of clausal discovery. In Proceedings of the 13th International Joint Conference on Artificial Intelligence, Morgan Kaufmann, 1993. [40] L. De Raedt and S, Dzeroski. First-order jk-clause the- ories are pat-learnable. Technical report, Department of Computer Science, Katholieke Universieit Leuven, Heverlee, Belgium, 1994. [41] C, Sammut, S. Hurst, D. Kedzier, and D. Michie. Learn- ing to fly. In D. Sleeman and P. Edwards, editors, Pro- ceedings of the Ninth International Workshop on Ma- chine Learning, pages 385–393, San Mateo, CA, 1992. Morgan Kaufmann. [42] A. Srinivasan and S. Muggleton. Discovering rules for mutagenesis. Technical Report PRG-TR-4-94, Oxford University Computing Laboratory, Oxford, 1994. [43] M, Sternberg, R, King, R. Lewis, and S. Muggleton, Application of machine learning to structural molecu- lar biology. Philosophical Transactions of the Royal Society B, 1994 (to appear). [44] M. Steinberg, R. Lewis, R. King, and S, Muggleton, Modelling the structure and function of enzymes by machine learning, Proceedings of the Royal Society of Chemistry: Faraday Discussions, 93:269-280,1992.

11