Bayesian Inductive Logic Programming
Total Page:16
File Type:pdf, Size:1020Kb
Bayesian Inductive Logic Programming Stephen Muggleton Oxford University Computing Laboratory Wolfson Building, Parks Road Oxford, OX1 3QD. Abstract 1 INTRODUCTION Inductive Logic Progrdng (ILP) involves the This paper is written for two separate audiences: 1) a Ma- construction of first-order definite clause theories chine Learning audience, interested in implementations and from examples and background knowledge, Un- applications and 2) a Computational Learning Theory audl- like both traditional Machine Learning and Com- ence interested in models and results in learnability. The putational Learning Theory, ILP is based on lock- paper should thus be taken as two loosely connected papers step development of Theory, Implementations and with a common theme involving the use of distributional Applications. ILP systems have successful appli- information in Inductive Logic Programming (ILP). cations in the learning of structure-activity rules In the last few years ILP [23, 24,20, 31] has developed from for drug design, semantic grammars rules, finite a theoretical backwater to a rapidly growing area in Machine element mesh design rules and rules for prediction Learning. This is in part due to the fact that pure logic of protein structure and mutagenic molecules. The programs provide Machine Learning with a representation strong applications in ILP can be contrasted with relatively weak PAC-learning results (even highly- which is general purpose (Turing-computable), and has a restricted forms of logic programs are known to simple, and rigorously defined, semantics. In addition, logic programs tend to be easy to comprehend as hypotheses in be prediction-hard). It has been recently argued scientific domains. that the mismatch is due to distributional assump- tions made in application domains. These assump- ILP is based on interaction between Theory, Implementations tions can be modelled as a Bayesiau prior prob- and Applications. The author has argued [25] that the devel- ability representing subjective degrees of belief. opment of a formal semantics of ILP (see Section 2) allows Other authors have argued for the use of Bayesian the direct derivation of ILP algorithms. Such derivational prior distributions for reasons different to those techniques have been at the centre of specific-to-general here, though this has not lead to a new model ILP techniques such as Plotkin’s least general generalisation of polynomial-time learnability. Incorporation of [37, 36], Muggleton and Buntine’s inverse resolution [27, 23] Bayesian prior distributions over time-bounded hy- and Idestam-Almquist’s inverse implication [ 15]. TWO con- potheses in PAC leads to a new model called U- tending semantics of ILP have been developed [31]. These Learnability. It is argued that U-learnability is are the so-called ‘open-world’ semantics [31 ] and ‘closed- more appropriate than PAC for Universal (Tur- world’ semantics [13]. ing computable) languages. Time-bounded logic programs have been shown to be polynomially A number of efficient ILP systems have been developed, in- U-learnable under certain distributions. The use cluding FOIL [38], Golem [28], LINUS [21], CLAUDIEN of time-bounded hypotheses enforces decidability [39] and Progol [26]. The use of a relational logic formalism and allows a unified characterisation of speed-up has allowed successful application of ILP systems in a num- learning and inductive learning. U-learnability has ber of domains in which the concepts to be learned cannot as special cases PAC and Natarajan’s model of easily be described in an attribute-value, propositional-level, speed-up learning. language. These applications (see Section 3) include struc- ture activity prediction for drug design [19, 44, 43], protein Permission to copy without fee all or part of this material is secondary-structure prediction [29], finite-element mesh de- granted provided that the copies are not made or distributed for sign [9] and learning semantic grammars [45]. direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given A variety of positive and negative PAC-learnability results ex- that copying is by permission of the Association of Computing Machinery. To copy otherwise, or to republish, requires a fee ist for subsets of definite clause logic [11, 34, 10, 18, 7, 40]. and/or specific permission. However, in contrast to experimentally demonstrated abili- COLT 94- 7/94 New Brunswick, N..J. USA ties of ILP systems in applications, the positive PAC results @ 1994 ACM 0-89791-655-7/94/0007..$3.50 are rather weak, and even highly restricted forms of logic 3 programs have been shown to be prediction hard [7]. Like where H Q cd, H 1= B and n is a a natural number. Hn many other results in PAC-learning, positive results are only is said to hold for examples E, or holds(Hn, E), when both achieved for ILP by setting language parameters to constant the following are truel. values (eg. k-clauses and l-literals). This provides ‘hard’ boundaries to the hypothesis space, which often reduces Sufficiency: H h. E+ the representation to a propositional level. An alternative Satisfiability: H A E- tjn ❑ model, U-learnability [30], is better suited to Universal (Tur- ing computable) representations, such as logic programs. U- The prior probability, p(Hn), of an hypothesis is defined by learnability provides a ‘soft’ boundary to hypothesis spaces in a given distribution D. According to Bayes’ theorem, the the form of a prior probability distribution over the complete posterior probability is representation (eg. time-bounded logic programs). Other authors have argued for the use of Bayesian prior distribu- p(EIHn).p(Hn) tions for reasons different to those here, though this has not P(~nl~)= p(E) lead to a new model of polynomial-time learnability. U- where p(E[Hn) is 1 when holds(Hn, E) and O otherwise Iearnability allows standard distributional assumptions, such and as Occam’s razor, to be used in a natural manner, without p(E) = ~ p(H:) parameter settings. Although Occam algorithms have been studied in PAC-learning [3], this has not led to algorithms holds (H& ,E) capable of learning in universal representations, as it has in Well known strategies for making use of distributional infor- Inductive Logic Programming. In the early 1980’s a distribu- mation include a) maximisation of posterior probability b) tion assumption very different to Occam’s razor was implic- class prediction based on posterior weighted sum of predic- itly implemented by Arbab, Michie [1] and Bratko [4] in an tions of all hypotheses (Bayes optimal strategy) c) randomly algorithm for constructing decision trees. The algorithm con- sampling an hypothesis from the posterior probability (Gibbs structs a decision list (linear decision tree) where one exists algorithm). The sample complexities of strategies b) and c) and otherwise returns the most linear decision tree which can are analysed in [12]. be constructed from the data. This distributional assumption Although no existing ILP system takes an hypothesis distri- is based on the fact that decision lists are generally easier to bution D as an input parameter, such distributions are implicit comprehend than arbitrary decision trees. Other interesting in FOIL’s [38] information gain measure and Golem’s [32] kinds of distributions along these lines can be imagined; as- compression measure. Additionally, search strategies such signing higher prior probabilities to grammars that are regular as general-to-specific or specific-to-general, use an implicit or almost so, logic programs that are deterministic or almost prior distribution which favours more general hypotheses or, so and logic programs that run in linear time or almost so. conversely, more specific ones. One can imagine very different kinds of prior distributions on hypotheses. For example, when learning concepts which Bayesian approaches to Machine Learning have been dis- mimic human performance in skill tasks [41] predictive ac- cussed previously in the literature. For instance, Buntine curacy is dependent on hypotheses being evaluable in time [6] develops a Bayesian framework for learning, which he similar to that of human reaction, and so such hypotheses applies to the problem of learning class probability trees. should be preferred a priori. Also Haussler et al. [12] analyse the sample complexity of two Bayesian algorithms. This analysis focuses on average This paper is organised as follows. Section 2 gives a formal case, rather than worst case, accuracy. On the other hand definition of ILP in terms of Bayesian inference. Section 3 U-learnability uses average case time complexity and worst provides an overview of ILP applications in molecular bi- case accuracy. Also unlike U-learnability (Section 4) neither ology and discusses some of the distributional assumptions used in these applications. Section 4 defines U-learnability Buntine nor Haussler et al, develop a new learning model which allows significantly larger polynomially-learnable rep- and describes some general results on U-learnable distribu- resentations, such as logic programs. The applications in the tions. following section show that ILP systems are effective at learn- ing logic programs in significant real-world domains. This 2 BAYES’ ILP DEFINITIONS is consistent with the fact that time-bounded logic programs are U-learnable under certain distributions (see Section 4). Familiarity with standard definitions from Logic Program- ming [22, 14] is assumed in the following. A Bayesian ver- 3 APPLICATIONS IN MOLECULAR sion of the usual (open world semantics) setting for ILP is as BIOLOGY follows. Suppose T, 7 and V are sets of predicate symbols, function symbols and variables and cd and Ch are the classes ILP applications in Molecular Biology extend a long tradi- of definite and Horn clauses constructed from T, 7 and V. tion in approaches to scientific discovery that was initiated by The symbol En denotes SLDNF derivation bounded by n res- Meta-Dendral’s [5] application in mass spectrometry.