3O 5/5-- Myths and Legends in Learning Classification Rules
Total Page:16
File Type:pdf, Size:1020Kb
NASA-CR-19Z236 /.._ - _3-c/t_ /3o 5/5-- p.f Research Institute for Advanced Computer Science NASA Ames Research Center Myths and Legends in Learning Classification Rules WRAY BUNTINE (NASA-CR-191236) MYTHS AND LEGENDS N93-13366 IN LEARNING CLASSIFICATION RULES (Research Inst. for Advanced Computer Science) 10 p Unc1_s G3/_3 0130515 RIACS Technical Report 90.23 May 1990 This paper appears in The Proceedings of AAAI-90, published by Morgan Kaufmann (Boston, MA; July-August 1990) ir Myths and Legends in Learning Classification Rules WRAY BUNTINE The Research Institute of Advanced Computer Science is operated by Universities Space Research Association, The American City Building, Suite 212, Columbia, MD 21044, (301)730-2656 Work reported herein was supported in part by Coopertive Agreement NCC2-387 between the National Aeronautics and Space Administration (NASA) and The Universities Space Research Association (USRA) Work was performed at the Research Institute for Advanced Computer Science (RIACS), NASA Ames Research Center, Moffett Field, California 94035.1000 Myths and Legends in Learning Classification Rules Wray Buntine RIACS, NASA Ames Research Center Mail Stop: 244-17, Moffett Field, CA 94035 U.S.A. May 1990 Abstract This paper is a discussion of machine learning theory on empirically learning classification rules. The paper proposes six myths in the machine lea.n_ng community that address issues of bias, learn- ing as search, computational learning theory, Occam's razor, _universal" learning algorithms, and interactive |ea._ng. Some of the problems raised are also addressed from a Bayesian perspective. The paper concludes by suggesting questions that machine learning researchers should be addressing both theoretically and experimentally. This paper appears in The Proceedings of AAAI-gO, published by Morgan Kau_Cmann. (Boston, MA; July-August, 1990) Myths and Legends in Learning Classification Rules Wray Buntine* Turing Institute George House, 36 Nth. Hanover St. Glasgow, G1 2AD, UK Abstract frameworks that have made significant contributions This paper is a discussion of machine learning theory to our research, so the emphasis of the discussion is on empirically learning classification rules. The pa- on qualifying the problems and suggesting solutions. The current flavour of machine learning research is per proposes six myths in the machine learning com- first briefly reviewed before the so-called myths are munity that address issues of bias, learning as search, introduced. The myths address the framework of computational learning theory, Occam's razor, "uni- bias, learning as search, computational learning the- versal _ learning algorithms, and interactive learning. Some of the problems raised are also addressed from a ory, Occam's razor, the continuing quest for "univer- Bayesian perspective. The paper concludes by suggest- sal _ learning algorithms, and the notion of automatic ing questions that machine learning researchers should non-interactive learning. While discussing these, a be addressing both theoretically and experimentally. Bayesian perspective is also presented that addresses some of the issues raised. However, the arguments in- Introduction troducing the myths are intended to be independent of this Bayesian perspective. The conclusion raises some Machine learning addresses the computational problem general questions for a theory of empirical learning. of learning, whether it be for insight into the corre- sponding psychological process or for prospective com- Machine learning research mercial gain from knowledge learned. Empirical learn- ing is sometimes intended to replace the manual elicita- The development of learning systems in the machine tion of classification rules from a domain expert (Quin- learning community has been largely empirical and Inn e_ al. 1987), as a knowledge acquisition sub-task for ideas-driven in nature, rather than theoretically moti- building classification systems. A classification rule is vated. That is, learning methods are developed based used to predict the class of a new example, where the around good ideas, some with strong psychological sup- class is some discrete variable of practical importance. port, and the methods are of course honed through For instance, an example might correspond to a patient experimental evaluation. described by attributes such as age, sex, and various Such development is arguably the right approach in a measurements taken from a blood sample, and we want relatively young area. It allows basic issues and prob- to predict a binary-valued class of whether the patient lems to come to the fore, and basic techniques and has an overactive thyroid gland. Empirical learning methodologies to be developed. Although it should here would be the learning of the classification rule be more and more augmented with theory as the area from a set of patient records. progresses, especially where suitable theory is available from other sciences. The knowledge acquisition environment provides specific goals for and constraints on an empirical learn- Early comments by Minsky and Papert (Minsky & Papert 1972) throw some light onto this kind of de- ing system: the system should fit neatly into some broader knowledge acquisition strategy, the system velopment approach. They were discussing history of should be able to take advantage of any additional research in the "perceptron _ which is a simple linear information over and above the examples, for in- thresholding unit that was a subject of intense study stance, acquired interactively from an expert, the sys- in earlymach]ne learning and pattern recognition. tem should only require the use of readily available They first comment on the attraction of the percep- information, and of course the system should learn ef- tron paradigm (Minsky & Papert 1972, page 18). ficiently and as well as possible. Part of the attraction of the perceptron lies This paper proposes and discusses some myths in the in the possibility of using very simple physical machine learning community. All of these are twists on devices---%nalogue computers'--to evaluate the linear threshold functions. "Current address: RIACS, NASA Ames Res., MS 244- 17, Moiler Field, CA 94035, (wrayOptolemy.arc.nua.gov). Thepopularityof theperceptronasa modelfor than definite restrictions to the search space. Several an intelligent, general purpose learning machine researchers have since extended this theory by consid- has roots, we think, in an image of the brain itself ering the strength of bias (Hanssler 1988), declarative bias (Russell & Grosof 1987), the appropriateness of Good ideas and apparent plausibility were clearly the bias, and the learning of bias (Tcheng e_ al. 1989; Utgoff 1986). Much of this research h_ concentrated initial motivating force. on domains without noise or uncertainty. In noisy do- While perceptrons usually worked quite well on sim- mains, some researchers have considered the "bias to- ple problems, their performance deteriorated rapidly wards simplicity =, "overfitting _, or the "accuracy vs. on the more ambitious problems. Minsky and Papert sum up much of the research as follows (Minsky & Pa- complexity tradeofF _ (Fisher & Schlimmer 1988) first pert 1972, page 19): noticed in AI with decision tree learning algorithms (Cestnik etal. 1987). The results of these hundreds of projects and ex- There are, however, remaining open issues on this periments were generally disappointing, and the line of research. First, where does the original bias explanations inconclusive. come from for a particular application? Are there It was only after this apparent lack of success that domain independent biases that all learning systems Minsky and Papert set about developing a more com- should use? Researchers have managed to uncover prehensive theory of perceptrons and their capabilities. through experimentation rough descriptions of useful Their theory led to a more mature understanding of biases, but a generative theory of bias has really only these systems from which improved approaches might been presented for the case of a logical declarative bias. have been developed. The lack of success, however, Second, investigation of bias in noisy or uncertain do- had already dampened research so the good ideas un- mains (where perfect classification is not possible) is derlying the approach were virtually forgotten for an- fairly sparse, and the rehtion between the machine other decade. Fortunately, a second wave of promising learning notion of bias and the decades of literature in "neural net" research is now underway. statistics needs more attention. Third, there seems to The major claim of this paper is that research on be no separation between that component of bias ex- empirical learning within the machine learning commu- isting due to computational limitations on the learner, nity is at a similar juncture. Several promising frame- bias input as knowledge to the original system, for in- works have been developed for learning such as the stance defining a search goal, and therefore unaffected bias framework (Mitchell 1980), the notion of learn- by computational limitations, and the interaction be- ing as search (Simon & Lea 1974) and computational tween bias and the sample itself. learning theory (Valiant 1985). It is argued in this pa- What is required is a more precise definition of bias, per that to progress we still need more directed theory its various functional components and how they can to give us insight about designing learning algorithms. be