<<

NASA-CR-19Z236 /.._ - _3-c/t_ /3o 5/5-- p.f Research Institute for Advanced Computer Science NASA Ames Research Center

Myths and Legends in Learning Classification Rules

WRAY BUNTINE

(NASA-CR-191236) MYTHS AND LEGENDS N93-13366 IN LEARNING CLASSIFICATION RULES (Research Inst. for Advanced Computer Science) 10 p Unc1_s

G3/_3 0130515

RIACS Technical Report 90.23 May 1990

This paper appears in The Proceedings of AAAI-90, published by Morgan Kaufmann (Boston, MA; July-August 1990) ir Myths and Legends in Learning Classification Rules

WRAY BUNTINE

The Research Institute of Advanced Computer Science is operated by Universities Space Research Association, The American City Building, Suite 212, Columbia, MD 21044, (301)730-2656

Work reported herein was supported in part by Coopertive Agreement NCC2-387 between the National Aeronautics and Space Administration (NASA) and The Universities Space Research Association (USRA) Work was performed at the Research Institute for Advanced Computer Science (RIACS), NASA Ames Research Center, Moffett Field, California 94035.1000

Myths and Legends in Learning Classification Rules

Wray Buntine RIACS, NASA Ames Research Center Mail Stop: 244-17, Moffett Field, CA 94035 U.S.A.

May 1990

Abstract This paper is a discussion of machine learning theory on empirically learning classification rules. The paper proposes six myths in the machine lea.n_ng community that address issues of bias, learn- ing as search, computational learning theory, Occam's razor, _universal" learning algorithms, and interactive |ea._ng. Some of the problems raised are also addressed from a Bayesian perspective. The paper concludes by suggesting questions that machine learning researchers should be addressing both theoretically and experimentally.

This paper appears in The Proceedings of AAAI-gO, published by Morgan Kau_Cmann. (Boston, MA; July-August, 1990)

Myths and Legends in Learning Classification Rules

Wray Buntine*

Turing Institute George House, 36 Nth. Hanover St. , G1 2AD, UK

Abstract frameworks that have made significant contributions This paper is a discussion of machine learning theory to our research, so the emphasis of the discussion is on empirically learning classification rules. The pa- on qualifying the problems and suggesting solutions. The current flavour of machine learning research is per proposes six myths in the machine learning com- first briefly reviewed before the so-called myths are munity that address issues of bias, learning as search, introduced. The myths address the framework of computational learning theory, Occam's razor, "uni- bias, learning as search, computational learning the- versal _ learning algorithms, and interactive learning. Some of the problems raised are also addressed from a ory, Occam's razor, the continuing quest for "univer- Bayesian perspective. The paper concludes by suggest- sal _ learning algorithms, and the notion of automatic ing questions that machine learning researchers should non-interactive learning. While discussing these, a be addressing both theoretically and experimentally. Bayesian perspective is also presented that addresses some of the issues raised. However, the arguments in- Introduction troducing the myths are intended to be independent of this Bayesian perspective. The conclusion raises some Machine learning addresses the computational problem general questions for a theory of empirical learning. of learning, whether it be for insight into the corre- sponding psychological process or for prospective com- Machine learning research mercial gain from knowledge learned. Empirical learn- ing is sometimes intended to replace the manual elicita- The development of learning systems in the machine tion of classification rules from a domain expert (Quin- learning community has been largely empirical and Inn e_ al. 1987), as a knowledge acquisition sub-task for ideas-driven in nature, rather than theoretically moti- building classification systems. A classification rule is vated. That is, learning methods are developed based used to predict the class of a new example, where the around good ideas, some with strong psychological sup- class is some discrete variable of practical importance. port, and the methods are of course honed through For instance, an example might correspond to a patient experimental evaluation. described by attributes such as age, sex, and various Such development is arguably the right approach in a measurements taken from a blood sample, and we want relatively young area. It allows basic issues and prob- to predict a binary-valued class of whether the patient lems to come to the fore, and basic techniques and has an overactive thyroid gland. Empirical learning methodologies to be developed. Although it should here would be the learning of the classification rule be more and more augmented with theory as the area from a set of patient records. progresses, especially where suitable theory is available from other sciences. The knowledge acquisition environment provides specific goals for and constraints on an empirical learn- Early comments by Minsky and Papert (Minsky & Papert 1972) throw some light onto this kind of de- ing system: the system should fit neatly into some broader knowledge acquisition strategy, the system velopment approach. They were discussing history of should be able to take advantage of any additional research in the "perceptron _ which is a simple linear information over and above the examples, for in- thresholding unit that was a subject of intense study stance, acquired interactively from an expert, the sys- in earlymach]ne learning and pattern recognition. tem should only require the use of readily available They first comment on the attraction of the percep- information, and of course the system should learn ef- tron paradigm (Minsky & Papert 1972, page 18). ficiently and as well as possible. Part of the attraction of the perceptron lies This paper proposes and discusses some myths in the in the possibility of using very simple physical machine learning community. All of these are twists on devices---%nalogue computers'--to evaluate the linear threshold functions. "Current address: RIACS, NASA Ames Res., MS 244- 17, Moiler Field, CA 94035, (wrayOptolemy.arc.nua.gov). Thepopularityof theperceptronasa modelfor than definite restrictions to the search space. Several an intelligent, general purpose learning machine researchers have since extended this theory by consid- has roots, we think, in an image of the brain itself ering the strength of bias (Hanssler 1988), declarative bias (Russell & Grosof 1987), the appropriateness of Good ideas and apparent plausibility were clearly the bias, and the learning of bias (Tcheng e_ al. 1989; Utgoff 1986). Much of this research h_ concentrated initial motivating force. on domains without noise or uncertainty. In noisy do- While perceptrons usually worked quite well on sim- mains, some researchers have considered the "bias to- ple problems, their performance deteriorated rapidly wards simplicity =, "overfitting _, or the "accuracy vs. on the more ambitious problems. Minsky and Papert sum up much of the research as follows (Minsky & Pa- complexity tradeofF _ (Fisher & Schlimmer 1988) first pert 1972, page 19): noticed in AI with decision tree learning algorithms (Cestnik etal. 1987). The results of these hundreds of projects and ex- There are, however, remaining open issues on this periments were generally disappointing, and the line of research. First, where does the original bias explanations inconclusive. come from for a particular application? Are there It was only after this apparent lack of success that domain independent biases that all learning systems Minsky and Papert set about developing a more com- should use? Researchers have managed to uncover prehensive theory of perceptrons and their capabilities. through experimentation rough descriptions of useful Their theory led to a more mature understanding of biases, but a generative theory of bias has really only these systems from which improved approaches might been presented for the case of a logical declarative bias. have been developed. The lack of success, however, Second, investigation of bias in noisy or uncertain do- had already dampened research so the good ideas un- mains (where perfect classification is not possible) is derlying the approach were virtually forgotten for an- fairly sparse, and the rehtion between the machine other decade. Fortunately, a second wave of promising learning notion of bias and the decades of literature in "neural net" research is now underway. statistics needs more attention. Third, there seems to The major claim of this paper is that research on be no separation between that component of bias ex- empirical learning within the machine learning commu- isting due to computational limitations on the learner, nity is at a similar juncture. Several promising frame- bias input as knowledge to the original system, for in- works have been developed for learning such as the stance defining a search goal, and therefore unaffected bias framework (Mitchell 1980), the notion of learn- by computational limitations, and the interaction be- ing as search (Simon & Lea 1974) and computational tween bias and the sample itself. learning theory (Valiant 1985). It is argued in this pa- What is required is a more precise definition of bias, per that to progress we still need more directed theory its various functional components and how they can to give us insight about designing learning algorithms. be pieced together so that we can generate or at least reason about a good bias for a particular application A catalogue of myths and legends without having to resort to experimentation. So while important early research identified the ma- The theoretical basis for bias research jor problem in learning as bias, and mapped out some If a learning system is to come up with any hypotheses broad issues particularly concerning declarative bias, at all, it will need to somehow make a choice based on we have not since refined this to where we can tea. information that is not logically present in the learning son about what makes a bias good. The first myth is sample. The first clear enunciation of this problem in that ghere is c_r-eenfl_/ a au._icient _heoretical baJia /or the machine learning community was by Mitchell, who tmdo'standing the problem of bias. referred to it as the problem of bias (Mitchell 1980). Utgof's definition of bias could, for instance, be de- Utgoif developed this idea further, saying (Utgoff 1986, composed into separate functional components. page 5) Hypothesis space bias: This component of bias Given a set of training instances, bias is the set defines the space of hypotheses that are being of all factors that collectively influence hypothe- searched, but not the manner of search. sis selection. These factors include the definition Ideal search bias: This component of bias defines of the space of hypotheses and definition of the what a learning system should be searching for given algorithm that searches the space of concept de- the sample, that is, assuming infinite computing re- scriptions. sources are available for search. As a contrast, one could consider algorithm bias as defining how the Utgoff also introduced a number of terms: good bias algorithm differs from the ideal. is appropriate to learn the actual concept, strong bias restricts the search space considerably but indepen- Application-specific bias: This is the Component dent of appropriateness, declarative bias is defined of the bias that is determined by the application. declaratively as opposed to procedurally, and prefer- This decomposition is suggested by Bayesian tech- ence bias is implemented as soft preferences rather niques which give prescriptions for dealing with each of thesecomponents(Buntine 1990). Bayesian tech- used in computational learning theory to measure con- niques deal with belief in hypotheses, where belief is fidence in learning error (Valiant 1985). Under the a form of preference bias. This gives, for instance, usual definition of PACness, if confidence is high that prescriptions for learning from positive-only examples, error is low then the same will hold for a Bayesian and for noisy or uncertain examples of different kinds. method no matter what prior was used (this is a di- For many learning systems described in the litera- rect corollary of (Buntine 1990, Lemma 4.2.1)). In ture, ideal search bias is never actually specified, al- Bayesian statistics, "large enough" data means there though notable exceptions exist (Quinlan & Rivest is so much data that the result of a Bayesian method 1989; Muggleton 1987). Many publications describe is virtually the same no matter what prior was used an algorithm and results of the algorithm but never for the analysis; this situation is referred to as Jtable describe in general goal-oriented terms what the algo- estimation (Berger 1985). rithm should be searching for, although they give the We say a sample is su_cien_ when the condition of local heuristics used by the algorithm. Hence it is often "large enough" more or less holds for the sample. (Be- difficult to determine the limitations of or assumptions cause the condition is about asymptotic convergence, underlying the algorithm. In areas such as construc- it will only ever hold approximately.) This definition tive induction, where one is trying to construct new is about as precise as is needed for this paper. This terms from existing terms used to describe examples, means, for instance, that a sufficient sample is a rea- this becomes critical because the careful evaluation of sonably complete specification of the semantics of the potential new terms is needed for success. A survey of "true" concept but not its representational form. (As- constructive induction methods (Matheus & Kendell sume the classification being learned is time indepen- 1989) reveals that few use a coherent evaluation strat- dent so the existence of a true concept is a reasonable egy for new terms. This issue of search is the subject assumption.) of the next myth. A sufficient sample makes the search well-specified. For instance, if we are seeking an accurate classifier, Learning is a well-specified search problem then with a sufficient sample we can determine the A seminal paper by Simon and Lea argued that learn- "true" accuracy of all hypotheses in the search space ing should be cast as a search problem (Simon & Lea reasonably well just by checking each one against the 1974). While few would disagree with this general idea, sample. That is, we just apply the principle of em- there appears to be no broad agreement in the machine pirical risk minimisation (Vapnik 1989). With an in- learning community as to what precise goals a learn- sufficient sample, we can often check the accuracy of ing algorithm should be searching for. For a particular some hypotheses the same way--this is often the pur- application, can we obtain a precise notion of the goal pose of an independent test set--but the catch is we of search at all? Of course, the problems of bias and cannot check the accuracy of all hypotheses because overfitting are just different perspectives of this same inaccurate hypotheses can easily have high accuracy problem. The second myth, related to the first, is that on the sample by chance. For instance, this happens under the curren_ view of learning as search, _he goal of with decision trees; it is well known they often have to search is well-specified. If it was generally known how be pruned because a fully grown tree has been "fitted to design a good bias then the search problem could to the noise" in the data (Cestnik e_ al. 1987). be made well-specified. A variation on this theme has been made by Weiss Recurrent problems in machine learning such as and colleagues who report a competitive method splitting and pruning rules for decision trees (Mingers for learning probabUistic conjunctive rules (Weiss & 1989) and the evaluation of new predicates for con- Kapouleas 1989). They make the hypothesis space structive induction (Mathens & Itendell 1989) are just small (conjuncts of size three or less), so the sample some symptoms of this broad search problem. size is nearly "sufficient". They then do a near ex- There is one context, however, where learning as haustive search of the space of hypotheses--#omething search is well-speci_edaccording to most current learn- not often done in machine learning--to uncover the ing theories. Any reasonable model of learning or rule minimising empirical risk. statistics has asymptotic properties that guarantee the It is unfortunate, though, that a sufficient sample model will converge on an optimal hypothesis. When a is not always available. We may only have a limited large enough quantity of data is available, it is easy to supply of data, as is often the case in medical or bank- show the various statistical approaches become almost ing domains. In this case we would hope our learning indistinguishable in result: maximum likelihood meth- algorithm will make the best use of the limited supply ods from classical statistics, uniform convergence and of data available. How can this be done? empirical risk minimisation techniques (Vapnik 1989) adopted by the computational learning community for Computational learning theory gives a handling logical, noisy and uncertain data, minimum basis for learning algorithms encoding approaches (Wallace & Freeman 1987) and Bayesian methods (Buntine 1990). For instance, prob- Valiant's "theory of the learnable" was concerned with ably approximate correctness (PACness) is a notion whether a machine can learn a particular class of con- cept_in feasiblecomputation(Valiant1985). This Occam's razor has a simple explanation theorycenteredaroundthreenotions:uni/ormcon- A standard explanation of Occam's razor (Blamer et eer_._e_ or dlstribution-free learning, that ]earning al. 1987) can be summarised as follows (Dietterich should converge regardless of the underlying distri- 1989): butions, probable approz_mate coeeectne88 (PACness), that the best a learner can do is probably be close to The famous bias of Occam's Razor (prefer the sim- correct, and the need for _raetlble aigorflhm8 for learn- plest hypothesis consistent with the data) can thus be seen to have a mathematical basis. If we choose ing. These three notions and variations have since been our simplicity ordering before examining the data, vigorously azlopted for a wider range of ]earning prob- ferns by the theoretical community to create an area then a simple hypothesis that is consistent with called computational learning theory. the data is provably likely to be approximately The theory has been concentrated on the strength of correct. This is true regardless of the nature of bias and resultant worst-caze complexity results about the simplicity ordering, because no matter what learning logical concepts. Evidence exists, however, the ordering, there are relatively few simple hy- that these results can be improved so better principles potheses. exist for the algorithm designer. An algorithm that looks for a simpler hypothesis un- Simulations reported in (Buntine 1990) indicate der certain computational limitations has been called a large gap exists between current worst-case error an Occam algorithm (Blumer et aL 1987). While the bounds obtained using uniform convergence and the mathematical basis of Occam algorithms is solid, their kinds of errors that might occur in practice. The same useful application can be elusive. A poorly chosen Oc- gap did not occur when using a Bayesian method to cam algorithm is rather like the drunk who, having lost estimate error, although the question of priors clouds his keys further down the road, only searches for them these results. In addition, the current bounds only around the lamp post because that is the place where consider the size of the sample and not the contents of the light is strongest. Search should be directed more the sample. In the simulations, for instance, the space carefully. of consistent hypotheses sometimes reduced to one hy- Clearly, an Occarn algorithm can only be said to pro- pothesis quite quickly, indicating the "true" concept vide _eful support for Oceam's razor if it will at least had been identified; yet bounds based on just the size sometimes (not infrequently) find simpler hypotheses of the sample cannot recognise this. This identifica- that are good approximations. The catch with Occarn tion can occur approximately in that an algorithm may algorithms is that there is no guarantee they will do so. recognise parts of the concept have been identified. Suppose all reasonable approximations to the "true" This behaviour has been captured using the "reliably concept are complex. Notice almost all hypotheses in probably almost always usefully _ learning framework a large space can be considered complex (since size is (IDvest & Sloan 1988), and s technique that generalises usually determined from the length of a non-redundant this same behaviour is implicit in a Bayesian method code). Then the framework of constructing a simplic- (Buntine 1990). ity ordering and searching for simpler hypotheses has It is argued in (Buntine 1990) that these issues arise been pointless. If this almost always turns out to be because of the worst-case properties of the standard the case, then the Occam algorithm approach will al- PACness notion which is based on uniform conver- most always not be useful. Certainly no proof has been gence. While uniform convergence has desirable prop- presented that a guarantee of useful application exists; erties, it cannot be achieved when there is less data. the main theorem in (Blumer et eL 1987) ignores this In this context, less stringent principles give stronger problem by _suming the "true" function is of size %t guides. This is especially relevant in the context of most n _. learning noisy or uncertain concepts where a variety of In other words, we need at least a weak guarantee other statistical principles could be used instead. that during learning, simpler hypotheses will some- So the third myth is that computational learning _he- times be found that are good approximations. There o--7, in its current form, provides a Jolid basis on t_hich are two potential arguments for this. _he algorithm designer can perform his duties. There The first potential argument is that since we only re- are two important qualifications where this is not a quire a good approximation, the hypothesis space can myth. First, computations] learning theory currently be reduced in size to one that is sufficient for finding provides the algorithm designer with a skeletal theory a good approximation. A thorough treatment of this of learning that gives a rough guide as to what to do appears in (Amsterdam 1988). For the purposes of and where further effort should be invested. Second, discussion, assume there is a uniform distribution on there are some areas where computational learning the- the examples and we are considering the space of all ory has provided significant support to the algorithm possible hypotheses defined over E kinds of examples. designer. For instance, in the previous section it was Such a space has size 2£ because each kind of example argued that with the notion of probably approximate is either in the concept or not. Notice that for any correctness, computational learning theory provides a one hypothesis, there are 2tE hypotheses within error basis for learning in the context of a sufficient sample. e of it. So the smallest space of hypotheses guaranteed to contain a hypothesis within • of the "true _ concept Some successful machine learning methodologies do must be at least of size 2 (I-')E. So this first argument incorporate weaker application-specific knowledge by provides some support, but the reduction factor of just some means. Two approaches are the careful selection (I - e) shows the argument is clearly not sufficient to of attributes in which examples are described (Quin- provide a guarantee on its own. lan etal. 1987; Michie 1986) and the use of various Second,.if we believe the simplicity ordering implicit forms of interaction with an expert (Buntine & Stir- in the Occam algorithm is somehow appropriate, that ling 1990). In addition, Bayesian statistics, with its is, simpler hypotheses should be expected to be good notion of subjective knowledge or prior belief, could approximations, then the Occam algorithm should be provide a means by which application-specific knowl- useful. Here, the power of the Occam algorithm comes edge can be cautiously incorporated into the learning from choosing a simplicity ordering appropriate for the process. problem in the first place. Bayesian methods sup- Yet many comparative studies from the machine port this principle because the simplicity ordering cor- learning community, for instance, fail to consider even responds to prior belief. In minimum encoding ap- weak kinds of knowledge about an application that proaches (Wallace & Freeman 1987) the principle is would help decide whether an algorithm is appropriate achieved by choosing an appropriate representation in for an application, and hence whether the comparison which the "true" concept should be simple. of algorithms on the application is a fair one. Algo- So the complexity argument above needs to be qual- rithms are applied universally to all problems without ified: it is not useful in learning "regardless of the na- consideration of their applicability. A similar issue has ture of the simplicity ordering _. Either way, we need to been noted by Fisher and Schlirnmer (Fisher & Schlim- think carefully about an appropriate simplicity order- mer 1988, page 27). ing if we are to usefully employ the Occam algorithm. Using a statistical measure to characterize predic- The explanation for Occam's razor quoted above tion tasks instantiates a methodology forwarded provides a mathematical basis for Occarn's razor. How- by Simon (1969) - domains must be characterized ever, the fourth myth is that th/8 mathematical 5as_ before an hi system's effectiveness can be prop- prorides a full and useful ezplanation of Occam's ra- erly evaluated. There is little benefit in stating zor. Two other supporting explanations have been pre- that a system performs in a certain manner un- sented that seem necessary to engage this mathemat- less performance is tied to domain properties that ical basis. And we have not yet considered the ease predict system performance in other domains. Ad- where concepts have noise or uncertainty. In this con- herence to this strategy is relatively novel in ma- text different complementary arguments for Occam's chine learning. razor become apparent (Wallace & Freeman 1987). Perhaps the key point is that Occam's razor finds prac- The fifth myth is that there ezist universal learning tical application because of people's inherent ability to algorithms that perform _nell on anl/ application re#ard- select key attributes and appropriate representations less. Rather, there exist universal learning algorithms for expressing a learning problem in the first place (and each of us provides living proof), but these can (Michie 1986). There may also be other more subtle always be outperformed by a second class of algorithms psychological explanations for its use. better selected and modified for the particular appli- cation. A way to develop this second class of non-universal "Universal" learning methods are best learning algorithms is to develop "targeted" learning Empirical learning seems to ignore one of the key methods that, first, are suitable for specific kinds of lessons for hi in the 1970s called the strong knowledge classification tasks or specific inference models, and, principle (Waterman 1986, page 4): second are able to be fine tuned or primed for the application at hand. The choices necessary in using ... to make a program intelligent, provide it with these algorithms could be made in the light of sub- lots of high quality specific knowledge about some jective knowledge available, for instance, elicited in an problem area. interview with an expert. The correct choice of model Whereas techniques such as explanation-based learn- is known to have considerable bearing on statistical ing, analogical learning, and knowledge integration problems like learning (Berger 1985, page 110). and refinement certainly embrace the strong knowledge At the broadest level we could choose to model the principle, in empirical learning, as it is often described classification process with probabilistic decision trees in the literature, one simply picks a universal learning or DNF rules, Bayesian or belief nets, linear classifiers, method, inputs the data and then receive as output or perhaps even rules in a pseudo lst-order logic such a classification rule. While early logical induction sys- as D*TALOG. And for each of these models there are tems like Shapiro's/¢IIS (Shapiro 1983)and subsequent many additional constraints or preferences that could similar systems do appear to incorporate background be imposed on the representation to give the learning knowledge, they usually do so to extend the search algorithm some of the flavour of the strong knowledge space rather than to guide the search (Buntine 1988). learning methods. One could choose to favour some attributesoverothersin a treealgorithmor primea • For a particularinduction protocol and inference beliefnetalgorithmwith potentialcausesandknown model, how should the system perform induction independencies. given itscomputational resources? What is being searched for,and how should this search be per- Learning should be non-interactlve formed? One aspect of ]earning for knowledge acquisition, not • What are the average and worst-case computa- sufficientlywell highlighted in earlierstatisticalap- tional and data requirements of the system for a proaches, is the capabilityof promoting interaction given problem? Furthermore, what problems can with the expert to assistlearning. This is done to be solved with reasonable computational resources, obtain additionalknowledge from the expert that may and what amounts of data should be needed? wellbe equivalentin valueto a possiblyexpensivesam- • How does a theory of uncertaintyrelateto the prob- ple.The importance of interactivelearningwas recog- lem of learning classificationrules? How can this nised as early as 1947 by (Turing 1986, then throw lighton the previous search problem? page 124),who said: • When and how should subjectiveknowledge, weak No man adds very much to the body of knowledge, [sic]why should we expect more of a machine? or strong domain knowledge, or other information extraneous to the sample be incorporated into the Putting the same point differently,the machine must be allowed to have contact with human be- learningprocess? Ifthisisdone but poorly,how can the system subsequently detect from evidence in the ingsin order that itmay adapt itselfto theirstan- dards. sample that the incorporated subjectiveknowledge is actuallyinappropriateto the application? Interactivelearning should be used with caution, • How reliableare the classificationruleslearned by however, because experts are often unreliablesources of knowledge. In the context of uncertainty,people the system from availabledata? How much more have limitationswith reasoning and in articulating data is required,and of which type? Can the sys- their reasoning,so knowledge elicitedmust be inter- tem ask a few pertinent questions or design a few preted with caution (Cleaves 1988). Also, we would key experiments to improve subsequent results?Are hope that learningcould stillbe achieved without in- the resultssensitiveto assumptions implicitin the teraction,perhaps at the expense of needing larger system? (InBayesian methods, thisincludespriors.) samples. It has been argued at various places throughout However, interaction is not always possible. Some thispaper that Bayesian theory can address at least applications require a so-called aautonomous agent _, some of these questions. It is certainlya (seventh) With these, not only should learning be entirely au- myth, however, that Bayesian methods provide a com- tomatic, there may also be no-one present to help se- plete theory fordesigning learning algorithms. There lect an appropriate targetedlearningalgorithm as sug- are many complementary statisticalperspectivesand gested in the previous section. many complementary theoreticaltoolssuch as optirni- While most researchersnow believe that learning sation and search, decisiontheory, resource-bounded can profitwith carefulexpert interactionwhere pos- reasoning,computational complexity, models of man- sible,and research in this area exists (Angluin 1988; machine interaction,and the psychology of learning, Buntine & Stirling1990), the sixthmyth, that leaea- etc. In addition,some of the above questions require ing shonld be automatic and non-_temeti_e, lives on in a pragmatic and experimental perspective,particular many experimental studies reported and many of the those concerned with the human interfaceand learning algorithmsunder development. methodology.

Requirements for a theory of empirical Acknowledgements learning Brian Ripley, and RIACS gave sup- This sectionsuggestsquestionsthat a theory forlearn- port and Peter Cheesernan, Pete Clark, David Haus- ing of classificationrulesshould be addressing. sler,PhilLaird, SteveMuggleton, Tim Niblett,and the AAAI reviewersprovided constructivefeedback. Any • How does the use ofa learningsystem fitina broader remaining errorsare my own. knowledge-acquisitionstrategy? • According to what inference mode/should dassifica. References tionproceed, or in other words, what form of classi- ficationrulesshould be learned? Amsterdam, J. 1988. Extending the Valiant learn- ing model. In Fifth International Con[erenee on Ma- • What is the induction protocolor typicalcourse of chine Learning, 381-394, Ann Arbor, Michigan. Mor- an induction session? What sortsof questionscan gem Kaufmann. the trainerreasonably answer, and with what sort of knowledge can the trainerprime the learningsys- Angluin, D. 1988. Queries and concept learning. Ma- tem? chine Learning, 2(4):319-343. Berger,J. O. 1985. Statisiical Decision Theory and Muggleton, S.H. 1987. Duce, an oracle based approach Bayesian Analysis. Springer-Verlag, New York. to constructive induction. In International Joint Con- ference on , 287-292, Milan. Blumer, A.; Ehrenfeucht, A.; Haussler, D.; and War- muth, M.K. 1987. Occam's razor. Information Pro- Quinlan, J.tL and Rivest, R.L. 1989. Inferring decision cessing Letters, 24:377-380. trees using the minimum description length principle. Information and Computation, 80:227-248. Buntine, W.L. and Stirring, D.A. 1990. Interactive induction. In Hayes, J.; Michie, D.; and Tyugu, E., ed- Quinlan, J.tL; Compton, P.J.; Horn, K.A.; and itors, MI-lP.: Machine Intelligence 1_, Machine Anal- Lazarus, L. 1987. Inductive knowledge acquisition: l/sis and Synthesis of Knowledge. Oxford University A ease study. In Quinlan, J.tL, editor, Applications of Press, Oxford. Ezpert Systems. Addison Wesley, London. Buntine, W.L. 1988. Generalised subsumption and its Rivest, R.L. and Sloan, It. 1988. Learning complicated applications to induction and redundancy. Artificial concepts reliably and usefully (extended abstract). In Intelligence, 36(2):149-176. Seventh National Conference on Artificial Intelligence, Buntine, W.L. 1990. A Theory of learning Classi- 635-640, Saint Paul, Minnesota. fication Rules. PhD thesis, University of Technology, Russell, S.J. and Grosof, B.N. 1987. A declarative Sydney. Forthcoming. approach to bias in concept learning. In Sizth National Cestnik, B.; Kononenko, I.; and Bratko, I. 1987. Conference on Artificial Intelligence, 505-510, Seattle. Assistmxt 86: A knowledge-elicitation tool for sophis- Shapiro, E.Y. 1983. Algorithmic Program Debugging. ticated users. In Bratko, I. and Lavra_, N., editors, MIT Press. Progress in Machine Learning: Proceedings of EWSL- Simon, H.A. and Lea, G. 1974. Knowledge and cogni- 87, 31-45, Bled, Yugoslavia. Sigma Press. tion. In Gregg, L.W., editor, Knowledge and Cognition, Cleaves, D.A. 1988. Cognitive biases and corrective 105-127. Lawrence Erlbaum Associates. techniques: proposals for improving elicitation pro- Tcheng, D.; Lambert, B.; Lu, S. C-Y.; and Rendell, L. cedures for knowledge-based systems. In Gaines, B. 1989. Building robust learning systems by combin- and Boose, J., editors, Knowledge Acquisflion for ing induction and optimization. In International Joint Knowledge-Based Systems, 23-34. Academic Press, Conference on Artificial Intelligence, 806-812, Detroit. London. Morgan Kanfmann. Dietterich, T.G. 1989. Machine learning, Tech. Report Turing, A.M. 1986. Lecture to the London Mathemat- 89-30-6, Dept. Comp. Sc., Oregon State University. To ical Society on 2 February 1947. In Carpenter, B.E. appear, Annual Review of Computer Science, Vol. 4, and Doran, R.W., editors, AM Taring's Ace Report of Spring 1990. 1945 and other papers, 106-124. MIT Press. Fisher, D.H. and Schlimmer, J.C. 1988. Concept sim- Utgoff, P.E. 1986. Machine Learning o/Inductive Bias. plification and prediction accuracy. In Fifth Interna- Kluwer Academic Publishers. tional Conference on Machine Learning, 22-28, Ann Arbor, Michigan. Morgan Kaufmann. Valiant, L.G. 1985. A theory of the learnable. CA CM, 27(11):1134-1142. Hanssler, D. 1988. Quantifying inductive bias: /LI learning algorithms and Valiant's learning framework. Vapnik, V.N. 1989. Inductive principles of the search Artificial Intelligence, 36(2):177-222. for empirical dependencies. In PAvest, R..; Haussler, D.; and Warmuth, M.K., editors, COLT'89: Second Work. Matheus, C.J. and Rendell, L.A. 1989. Construc- shop on Computational Learning Theory, 3-21, Univer- tive induction on decision trees. In International Joint sity of California, Santa Cruz. Morgan Kaufmann. Conference on Artificial Intelligence, 645-650, Detroit. Morgan Kaufmama. Wallace, C.S. and Freeman, P.IL 1987. Estimation and inference by compact encoding. J. Roy. Statist. Soc. Michie, D. 1986. The superarticulacy phenomenon in B, 49(3):240-265. the context of software manufacture. Proc. Roy. Soc. (.4), 405:185-212. Waterman, D.A. 1986. A Guide to Ezpert Systems. Addison Wesley. Mingers, J. 1989. An empirical comparison of selection measures for decision-tree induction. Machine Learn- Weiss, S.M. and Kapouleas, I. 1989. An empirical ing, 3(4):319-342. comparison of pattern recognition, neural nets, and machine learning classification methods. In Interna- Minsky, M. and Papert, S. 1972. Perceptrons: An tional Joint Conference on Artificial Intelligence, 781- Introduction 4o Computational Geometry. MIT Press, 787, Detroit. Morgan Kaufmann. second edition. Mitchell, T.M. 1980. The need for biases in learn- ing generalisations, CBM-TR 5-110, Rutgers Univer- sity, New Brunswick, NJ.