From: FLAIRS-01 Proceedings. Copyright © 2001, AAAI (www.aaai.org). All rights reserved.

Decision Tree Rule Reduction Using Linear Classifiers in Multilayer

DaeEun Kim Sea Woo Kim Division of Informatics Dept. of Information and University of Edinburgh Communication Engineering, KAIST 5 Forrest Hill Cheongryang l-dong Edinburgh, EHI 2QL, United Kingdom Seoul, 130-011, Korea [email protected] [email protected]

Abstract relationship betweendata points, using a statistical tech- nique. It generates manydata points on the response sur- It has beenshown that a neuralnetwork is better thana direct applicationof inductiontrees in modelingcom- face of the fitted curve, and then induces rules with a plex relations of input attributes in sampledata. We decision tree. This method was introduced as an alter- proposethat conciserules be extractedto supportdata native measure regarding the problem of direct applica- with input variablerelations over continuous-valuedat- tion of the induction tree to raw data (Irani & Qian 1990; tributes. Thoserelations as a set of linear classifiers Kim1991). However,it still has the problem of requiring can be obtained from neural networkmodeling based manyinduction rules to reflect the response surface. on back-propagation.Alinear classifier is derivedfrom In this paper we use a hybrid technique to combineneu- a linear combinationof input attributes and neuron ral networksand decision trees for data classification (Kim weightsin the first hiddenlayer of neural networks.It is shownin this paperthat whenwe use a decisiontree & Lee 2000). It has been shownthat neural networks are over linear classifiers extractedfrom a multilayerper- better than direct application of induction trees in mod- ceptron, the numberof rules can be reduced.We have eling nonlinear characteristics of sample data (Dietterich, tested this methodover several data sets to compareit Hild, & Bakiri 1990; Quinlan 1994; Setiono & Lie 1996; withdecision tree results. Fisher & McKusick 1989; Shavlik, Mooney, & Towell 1991). Neural networks have the advantage of being able to deal with noisy, inconsistent and incompletedata. A method Introduction to extract symbolicrules from neural networkshas been pro- The discovery of decision rules and recognition of patterns posed to increase the performanceof the decision process from data examples is one of the most challenging prob- (Andrews, Diederich, & Tickle 1996; Taha & Ghosh1999; lems in . If data points contain numeri- Fu 1994; Towcll & Shavlik Oct 1993; Setiono & Lie 1996). cal attributes,induction tree methods need the continuous- The KTalgorithm developed by Fu (Fu 1991) extracts rules valuedattributes to be madediscrete with threshold values. from subsets of connected weights with high activation in a Inductiontree algorithms such as C4.5build decision trees trained network. The Mof N algorithm clusters weights of by recursivelypartitioning theinput attribute space (Quin- the trained networkand removesinsignificant clusters with lan1996). The tree traversal from the root node to eachleaf low active weights. Then the rules are extracted from the leadsto oneconjunctive rule. Each internal node in thedeci- weights (Towell & Shavlik Oct 1993). siontree has a splittingcriterion orthreshold forcontinuous- A simple rule extraction algorithm that uses discrete acti- valuedattributes to partition some part of theinput space, vations over continuous hidden units is presented in by Se- andeach leaf represents a class related to theconditions of tiono and Taha (Setiono & Lie 1996; Taha & Ghosh1999). eachinternal node. They used in sequence a weight-decay back-propagation Approachesbased on decisiontrees involve making the over a three-layer feed-forward network, a pruning process continuous-valuedattributes discrete in inputspace, creat- to removeirrelevant connectionweights, a clustering of hid- ingmany rectangular divisions. As a result,they may have den unit activations, and extraction of rules from discrete theinability to detect data trends or desirableclassifica- unit activations. They derived symbolic rules from neural tionsurfaces. Even in thecase of multivariatemethods of networks that include oblique decision hyperplanes instead discretionwhich search in parallelfor threshold values for of general input attribute relations (Setiono & Liu 1997). morethan one continuous attribute (Fayyad & Irani1993; Also the direct conversion from neural networks to rules Kweldo& Kretowski1999), the decision rules may not re- has an exponential complexity whenusing search-based al- flectdata trends or thedecision tree may build many rules gorithm over incoming weights for each unit (Fu 1994; withthe support of a smallnumber of examplesor ignore Towell & Shavlik Oct 1993). Most of the rule extraction somedata points by dismissingthem as noisy. algorithms are used to derive rules from neuron weights A possible process is suggested to grasp the trend of the and neuronactivations in the hidden layer as a search-based data. It first tries to fit it with a given data set for the method. An instance-based rule extraction method is sug- Copyright©2001, AAAI, All rightsreserved. 48O FLAIRS-2001 gested to reduce computation time by escaping search-based Therules do not catch data clustering completelyin this ex- methods(Kim & Lee 2000). After training two hidden layer ample. Figure l(b)-(c) showsneural networkfitting with neural networks, the first hidden layer weight parametersare back-propagation method¯In Figure l(b)-(c) neural network treated as linear classifiers. Theselinear differentiated fimc- nodes have slopes alpha = 1.5, 4.0 for sigmoids, respec- tions are chosen by decision tree methodsto determine deci- tively. After curve fitting, 900 points were generated uni- sion boundariesafter re-organizing the training set in terms formly on the response surface for the mappingfrom input of the newlinear classifier attributes¯ space to class, and the response values of the neural net- Our approach is to train a neural network with sigmoid workwere calculated as shownin Figure l(d). The result functions and to use decision classifiers based on weight C4.5 to those 900 points followed the classification curves, parameters of neural networks. Then an induction tree se- but produced55 rules¯ The production of manyrules results lects the desirable input variable relations for data classifi- from the fact that decision tree makespiecewise rectangu- cation. Decision tree applications have the ability to deter- lar divisions for each rule. This happensin spite of the fact mine proper subintervals over continuousattributes by a dis- that the response surface for data clustering has a correlation cretion process. This discretion process will cover oblique betweenthe input variables. hyperplanes mentionedin Setiono’s papers. In this paper, As shownabove, the decision tree has a problemof over- we have tested linear classifiers with variable thresholds and generalization for a small number of data and an over- fixed thresholds. The methodsare tested on various types of specialization problem for a large numberof data. A pos- data and compared with the method based on the decision sible suggestion is to consider or derive relations between tree alone. input variablt s as another attribute for rule extraction¯ How- ever, it is difficult to find input variablerelations for classi- Problem Statement fication directly in , while unsupervised Induction trees are useful for a large numberof examples, methodscan use statistical methodssuch as principal com- and they enable us to obtain proper rules from examples ponent analysis (Haykin 1999). rapidly (Quinlan 1996). However,they have the difficulty in inferring relations betweendata points and cannot handle Method noisy data. The goal for our approachis to generate rules following the shape and characteristics of response surfaces. Usually in- duction trees cannot trace the trend of data, and they deter- mine data clustering only in terms of input variables, unless we apply other relation factors or attributes. In order to im- .... f ...... "._~.. prove classification rules from a large training data set, we allow input variable relations for multi-attributes in a set of

1 ° i rules¯ NeuralNetwork and Linear Classifiers (a) (b) We use a two-phase method for rule extraction over continuous-valuedattributes. Given a large training set of data points, the first phase, as a feature extraction phase, is to train feed-forward neural networks with back-propagation and collect the weight set over input variables in the first hidden layer. A feature useful in inferring multi-attribute relations of data is foundin the first hiddenlayer of neural networks. The extracted rules involving networkweight val- (c) ues will reflect features of data examplesand provide good classification boundaries. Also they maybe more compact Figure 1: Example(a) data set and decision boundary (O and comprehensible,compared to induction tree rules. class 1, X : class 0) (b)-(c) neural networkfitting (d) data In the second phase, as a feature combinationphase, each with 900 points (Kim & Lee 2000) extracted feature for a linear classification boundaryis com- bined together using Booleanlogic gates. In this paper, we Wecan see a simple exampleof undesirable rule extrac- use an induction tree to combineeach linear classifier. tion discoveredin the induction tree application¯ Figure1 (a) The highly nonlinear property of neural networks makes displays a set of 29 original sample data with two classes¯ it difficult to describe howthey reach predictions. Although It appears that the set has four sections that have the bound- their predictive accuracy is satisfactory for manyapplica- aries of direction from upper-left to lower-right. A set of tions, they have long been considered as a complexmodel in the dotted boundarylines is the result of multivariate clas- terms of analysis. By using expert rules derived from neu- sification by the induction tree. It has six rules to classify ral networks, the neural networkrepresentation can be more data points. Even in C4.5 (Quinlan 1996) run, it has four understandable¯ rules with 6.9 % error, makingdivisions with attribute y. It has been shownthat a particular set of functions can

NEURALNETWORK / FUZZY 481 be obtained with arbitrary accuracy by at most two hid- fication in terms of entropy minimizationthan a set of origi- den layers given enough nodes per layer (Cybenko 1988). nal input attributes itself. Evenif weinclude input attributes, Also one hidden layer is sufficient to represent any Boolean the entropy measurementleads to a rule set with boundary function (Hertz, Palmer, & Krogh 1991). Our neural net- equations. These rules are moremeaningful than those of di- work structure has two hidden layers, where the first hid- rect C4.5 application to raw data since their divisions show den layer makesa local feature selection with linear clas- the trend of data clustering and howeach attribute is corre- sifiers and the second layer receives Boolean logic val- lated. ues from the first layer and maps any Boolean func- tion. The second hidden layer and output layer can be thought of as a sum of the product of Boolean logic Linear Classifiers for Decision Trees gates. The n-th output of neural networks for a set of Induction trees can split any continuous value by selecting data is Fn= f(EN2w ,d()Sj2 ---N1 wjkf(1 --No w°ad)) thresholds for given attributes, while it cannot derive re- After training data patterns with a neural network by back- lations of input attributes directly. Thus, before induction propagation,we can have linear classifiers in the first hidden trees are applied to a given training set, weput newrelation layer. attributes consisting of linear classifiers in the training set, For a nodein the first hiddenlayer, the activation is de- whichare generated fromthe weight set in a neural network. fined as Hj = f(Y~i N° aiWij) for the j-th node where Wecan represent training data as a set of attribute column No is the numberof input attributes, ai is an input, and vectors. Whenwe have linear classifiers extracted from a f(x) = 1.0/(1.0 -’ ~z) is a s igmoid function. When we neural network,each linear classifier can be a columnvector train neural networks with the back-propagation method, a, in a training set, wherethe vector size is equal to the number the slope of the sigmoid function is increased as iteration of the original training data. Eachlinear classifier becomesa continues. If we have a high value of a, the activation of newattribute in the training set. If we represent original in- each neuron is close to the property of digital logic gates, put attribute vectors and neural networklinear classifiers as whichhas a binary value of 0 or 1. U-vectorsand L-vectors, respectively, C-vectors, L-vectors, Exceptfor the first hiddenlayer, we can replace each neu- and {U + L}-vectors will form a different set of training ron by logic gates if we assume we have a high slope for data; each set of vector is transformedfrom the same data. the sigmoidfunction. Input to each neuron in the first hid- Thosethree vector sets were tested with several data sets in den layer is represented as a linear combinationof input at- the UCI depository (Blake, Keogh, & Merz 1998) to com- tributes and weights, )--~N aiWij. This forms linear classi- pare the performance (Kim & Lee 2000). fiers for data classification as a feature extraction over data It is believed that a compactset of attributes to represent distribution. the data set shows a better performance. Addingoriginal WhenFigure l(a) data is trained, we can introduce new input attributes does not improvethe result, but it makesits attributes aX + bY + e, where a, b, c are constants. We performanceworse in most cases. C4.5 has a difficulty in use two hidden layers with 4 nodes and 3 nodes, respec- selecting properly the mostsignificant attributes for a given tively, where every neuron node has a high sigmoid slope set of data, because it chooses attributes with local entropy to guarantee desirable linear classifiers as shownin Fig- measurementand the methodis not a global optimization of ure l(c). Wetransformed 900 data points in Figure l(d) entropy. Also, especially whenonly linear classifiers from into four linear classifier data points, and then we added neural network,L-vectors, are used, it is quite effective in the classifier attributes {L1, L2, L3, L4}to the original at- reducing the numberof rules (Kim& Lee 2000). tributes x, y. Induction tree algorithm used those six at- Generally manydecision tree algorithms have muchdif- tributes {x, y, LI, L2, Ls, L4} for its input attributes. ficulty in feature extraction. Whenwe add manyunrelated Then we could obtain only four rules with C4.5, while a features(attributes)to a training set for decisiontrees, it has simple application of C4.5 for those data genetated 55 rules. tendency to worsen performance.This is because the induc- The rules are given as follows: tion tree is basedon a locally optimalentropy search. In this rule 1 : if (1.44x + 1.73y <= 5.98), thenclass paper, a compactL-linear classifier methodwas tested. We used L-linear classifiers with fixed thresholds and variable rule2 : if(1.44x+ 1.73y > 5.98) thresholds. and (1.18x + 2.81y <= 12.37) then class In the linear classifier methodwith fixed thresholds, all rule 3 : if(1.44x + 1.73y > 5.98) instances in the training data are transformed into Boolean and (1.18x + 2.81y > 12.37) logic values through dichotomyof node activations in the and (0.53x + 2.94y < 14.11), then class first hidden layer; then Booleandata are applied to the in- duction algorithmC4.5. In the linear classifiers with variable rule4 : if(1.44x + 1.73y > 5.98) thresholds, a set of linear classifiers are taken as continuous- and (1.18x + 2.81y > 12.37) valued attributes. TheC4.5 application over instances of lin- and (0.53x + 2.94y > 14.11), then class ear classifiers will try to find the best splitting thresholdsfor discretion over each linear classifier attribute. In this case, These linear classifiers exactly match with the boundaries each linear classifier attribute mayhave multivariate thresh- shownin Figure 1 (c), and they are moredominant for classi- olds to handle marginalboundaries of linear classifiers.

482 FLAIRS-2001 C4.5 LinearClassifier [ neuralnetwork veriableT’[ L.I’ fixedT’[.L} data orain (%) [ test (%) train (%) [ test (%) nodes wain(%) test (%) # rules # rules wine 1.2 4- 0.1 7.9 4- 1.3 0.0 4- 0.0 3.6 4- 0.9 4-3-3 O.84- 0.1 4.3 4- 1.5 3.8-1-0.3 3.44- 0.2 iris 1.9 4- 0.1 5.4"4-0.7 0.7 4- 0.2 4.7 4- 1.5 4-5-3 0.5 4- 0.1 4.74- I.I 3.9 4- 0.4 4.0 4- 0.5 blagast-w I.I 4- 0.1 4.74- 0.5 0.9 4- 0.2 4.4 4- 0.3 4-7-3 0.5 4- 0.2 5.2 4- 1.1 3.94-0.2 4.1.4- 0.4 ios~sphere 1.6 4- 0.2 10.44- I.I 1.2 4- 0.2 8.8 4- 1.5 4-10-3 0.4-I- 0.2 5.1 4- 1.2 3.94-0.1 4.4 4- 0.6 pinla 15.14- 0.8 26.4 4- 0.9 15.84- 0.4 27.04- 0.9 4-3-3-3 0.7-4- 0.1 4.5-I-I.I 3.7 q- 0.2 3.8 4- 0.2 glass 6.7 4- 0.4 32.0 4- 1.5 6.6 4- 0.8 33.64- 3. I 4-5-4-3 0.64.0.1 4.1-1.-1.3 4.04* 0.4 4.1 4-0.2 bupa 12.94- 1.5 34.54- 2.0 10.24- 1.1 33.74- 2.1 4-7-4-3 0.54-0.1 4.74- 0.9 3.9 4- 0.2 4.0 4- 0.4 Ca) Table1: Dataclassification errors in C4.5 andlinear clas- variablethresholds t z, J. fixedthresholdst t, } sifier methodwith variable thresholds(error rates in linear nodes train (%) test (%) train (%) (%) 4-3-3 0.74.0.1 5.14. 1.3 1.7 4-0.3 4.84- 1.4 classifier methodshow the best result amongseveral neural 4-5-3 0.7 4* 0.2 6.O4- 1.7 1.8 4-0.3 5.1 4- 1.2 networkexperiments) 4-7-3 0.7 4- 0.2 4.64*1.1 1.5 4- 0.3 5.3 4- 1.1 4-10-3 0.8-4-0.1 5.2 4- 1.6 1.4 4- 0.2 5.2 4- 1.4 4-3-3-3 0.9 "4"0.1 5.9 4- 1.4 2.6 4- 0.7 6.14- 1.7 4-5-4-30.7 4- 0.2 4.7 4- 1.5 2.0 4- 0.4 5.3 4- 1.2 4-7-4-3 0.8 4- 0.2 4.7 4- 1.0 2.2 4- 0.4 5.14- 1.0

Table2: iris data classification result (a) neuralnetwork er- ror rate and the numberof rules with the linear classifier method(b) error rate in linear classifier methodwith vari- 0 1 able thresholdsand fixed thresholds -- g .m . . ¯ -- neural network variableT~ L } fixedT~- L J. (a) (b) n~s ~n <%) I testc%) #~ #r~ es 6-5-2 17.94- 1.1 31.94- 1.6 10.04- 1.3 7.24- 0.7 6-10-2 10.24.1.1 32.84- 2.1 15.34* 1.7 16.34- 1.7 Figure 2: Comparisonbetween C4.5 and linear classifier 6-5-5-2 15.24- 0.9 32.84- 1.9 10.6.4-1.3 9.4 4- 0.9 method(a) averageerror rate in test data with C4.5, neural 6-8-6-2 9.3 4- 0.7 32.14- 1.9 14.34* 2.1 19.54- 1.3 network,and linear classifier (b) the numberof rules with 6-10-7-2 7.34- 1.4 32.74- 2.2 16.24- 1.7 24.44* 2.3 C4.5 andlinear classifier method (a) variable thresholds"[L} fixed thresholdsfL} [1 nodes train (%) test (%) train (%) (%) Experiments 17.84- 1.5 32.84- 1.9 25.24- 1.9 34.14- 1.3 14.84- 1.4 33.74* 2.1 19.94* 1.5 36.24- 3.1 Ourmethod has been tested on several sets of data in the 6-5-5-2 17.54- 1.0 33.64* 2.3 22.24- 1.1 34.34- 2.3 UCI depository (Blake, Keogh, & Merz 1998). Figure 6-8-6-2 13.64* 1.5 32.54* 1.7 15.14- 0.6 34.04- 2.4 6-10-7-2 15.24- 1.3 32.74- 2.9 17.34- 0.7 32.74- 1.6 showsaverage classification error rates for C4.5, neuralnet- (b) worksand the linear classifier method.Table 1 showserror rates to comparethe pure C4.5 methodand our linear clas- Table 3: bupadata classification result (a) neural network sifier method.The error rates wereestimated by runningthe error rate andthe numberof rules with the linear classifier complete10-fold cross-validation ten times, andthe average method(b) error rate in linear classifier methodwith variable andthe standarddeviation for ten runs weregiven in the ta- thresholds andfixed thresholds ble. Several neural networkswere tested for each data set. Table 2-3 shows examplesof different neural networksand their linear classifier results. vation is near 0.5, the weightedsum of activations maylead Ourmethods using linear classifiers are better than C4.5 to different output classes. If the numberof nodes in the in somesets and worsein other data sets such as glass and first hiddenlayer is increased, this marginaleffect becomes pima whichare hard to predict even in neural network.The larger as observedin Table2 - see fixed thresholds. result supportsthe fact that the methodsgreatly dependon Figure 2(b) shows that the numberof rules using our neural networktraining. If neuralnetwork fitting is not cor- methodis significantly smaller than that using conventional rect, then the fitting errors maymislead the result of linear C4.5 in all the data sets. To reducethe numberof rules, lin- classifier methods.Normally, the C4.5 application showsthe ear classifiers with the Booleancircuit modelgreatly depend error rate is very high for trainingdata in Table1. Theneu- on the numberof nodesin the first hiddenlayer. It decreases ral networkcan improvetraining performanceby increasing the numberof rules when the numberof nodes decreases the numberof nodes in the hiddenlayers as shownin Table in the first hiddenlayer, while the error rate performanceis 2-3. However,it does not guaranteeto improvetest set per- similar withinsome limit, regardlessof the numberof nodes. formance.In manycases, reducingerrors in a training set Thelinear classifier methodwith variable thresholdsalso de- tendsto increasethe errorrate in a test set by overfitting. pends on the numberof nodes. The reason whythe number The error rate difference betweena neural networkand of rules is proportionalto the numberof nodesis related to the linear classifier methodexplains that somedata points the search space of Booleanlogic circuits. Thelinear clas- are located on marginalboundaries of classifiers. It is due sifier methodwith the Booleancircuit modeloften tends to to the fact that our neural networkmodel uses sigmoidfunc- generate rules that have a small numberof support exam- tions with high slopes instead of step functions. Whenacti- ples, while variable threshold modelprunes those rules by

NEURALNETWORK / FUZZY 483 adjusting splitting thresholds in the decision tree. Twohid- Fayyad, U., and Irani, K. 1993. Multi-interval discretiza- den layer neural networks are not significantly more effec- tion of continuous-valuedattributes for classification learn- tive in terms of error rates and the numberof rules than one ing. In Proceedings of IJCAI’93, 1022-1027. Morgan hidden layer as shownin Table 2-3. Thus, neural networks Kaufmann. with one hidden layer maybe enoughfor UCIdata set. Fisher, D., and McKusick,K. 1989. An empirical compar- Mostof the data sets in the UCIdepository have a small ison of ID3 and . In Proceedings of 11th numberof data examplesrelative to the numberof attributes. International Joint Conferenceon AI, 788-793. The significant difference between a simple C4.5 applica- Fu, L. 1991. Rule learning by searching on adaptive nets. tion and a combinationof C4.5 application and a neural net- In Preceedingsof the 9th NationalConference on Artificial workis not seen distinctively in UCIdata in terms of error rate, unlike the synthetic data in Figure 1. Information of Intelligence, 590-595. data trend or input relations can be moredefinitely described Fu, L. 1994. Neural Networks in ComputerIntelligence. whengiven manydata examplesrelative to the numberof at- New York: McGraw-Hill. tributes. Haykin, S. 1999. Neural networks : a comprehensivefoun- Table 2-3 showsthat neural networkclassification is bet- dation. UpperSaddle River, N.J.: Prentice Hall, 2nd edi- ter than linear classifier applications. Even though linear tion. classifier methods are good approximations to nonlinear Hertz, J.; Palmer, R.; and Krogh,A. 1991. Introduction to neural network modeling in the experiments, we still need the Theory of Neural Computation. RedwoodCity, Calif.: to reduce the gap betweenneural networktraining and linear Addision Wesley. classifier models. There is a trade-off betweenthe number Irani, K., and Qian, Z. 1990. Karsm: A combinedresponse of rules and error rate performance.We need to explain what surface / knowledgeacquisition approachfor deriving rules is the optimal numberof rules for a given data set for future for expert system. In TECHCON’90Conference, 209-212. study. Kim,D., and Lee, J. 2000. Handlingcontinuous-valued at- tributes in decision tree using neural networkmodeling. In Conclusions EuropeanConference on MachineLearning, Lecture Notes This paper presents a hybrid methodfor constructing a de- in Artificial Intelligence 1810, 211-219. Springer Verlag. cision tree from neural networks. Our method uses neural Kim, D. 1991. Knowledgeacquisition based on neural network modeling to find unseen data points and then an network modeling. Technical report, Directed Study, The induction tree is applied to data points for symbolicrules, University of Michigan, AnnArbor. using features from the neural network. The combination of neural networks and induction trees will compensatefor Kweldo,W., and Kretowski, M. 1999. An evolutionary al- the disadvantages of one approach alone. This method has gorithmusing multivariate discretization for decision rule advantages over a simple decision tree method. First, we induction. In Proceedings of the Third EuropeanConfer- ence on Principles of Data Mining and KnowledgeDiscov- can obtain good features for a classification boundaryfrom ery, 392-397. Springer. neural networks by training input patterns. Second, because of feature extractions about input variable relations, we can Quinlan, J. 1994. Comparingconnectionist and symbolic obtain a compactset of rules to reflect input patterns. learning methods. In Computational Learning Theory and Westill have muchwork ahead, such as reducing the Natural Learning Systems, 445-456. MITPress. numberof rules and error rate together, and finding the Quinlan, J. 1996. Improveduse of continuous attributes in optimal numberof linear classifiers. C4.5. Journal of Artificial Intelligence Approach(4):77- 90. Setiono, R., and Lie, H. 1996. Symbolicrepresentation of References neural networks. Computer29(3):71-77. Andrews,R.; Diederich, J.; and Tickle, A. 1996. A survey Setiono, R., and Liu, H. 1997. Neurolinear: A system and critique of techniques for extracting rules from trained for extracting oblique decision rules fromneural networks. artificial neural networks. Knowledge-BasedSystems 8(6). In European Conference on Machine Learning, 221-233. Blake, C.; Keogh, E., and Merz, C. 1998. UCIrepository Springer Verlag. of machinelearning databases. In Preceedingsof the Fifth Shavlik, J.; Mooney,R.; and Towell, G. 1991. Symbolic International Conference on MachineLearning. and neural learning algorithms: Anexperimental compari- son. MachineLearning 6(2): 111-143. Cybenko, G. 1988. Continuous valued neural networks with two hidden layers are sufficient. Technical report, Taha, I. A., and Ghosh, J. 1999. Symbolicinterpretation Technical Report, Departmentof ComputerScience, Tufts of artificial neural networks. IEEETransactions on Knowl- University, Medford, MA. edge and Data Engineering 11(3):448-463. Dietterich, T.; Hild, H.; and Bakiri, G. 1990. A compar- Towell, G., and Shavlik, J. Oct. 1993. Extracting refined ative study of ID3 and backpropagation for english text- rules from knowledge-based neural networks. Machine to-speech mapping. In Proceedings of the 1990 Machine Learning 13(1):71-101. Learning Conference, 24-31. Austin, TX.

484 FLAIRS-2001