Decision Tree Rule Reduction Using Linear Classifiers in Multilayer Perceptron

From: FLAIRS-01 Proceedings. Copyright © 2001, AAAI (www.aaai.org). All rights reserved. Decision Tree Rule Reduction Using Linear Classifiers in Multilayer Perceptron DaeEun Kim Sea Woo Kim Division of Informatics Dept. of Information and University of Edinburgh Communication Engineering, KAIST 5 Forrest Hill Cheongryang l-dong Edinburgh, EHI 2QL, United Kingdom Seoul, 130-011, Korea [email protected] [email protected] Abstract relationship betweendata points, using a statistical technique. It generates manydata points on the response sur- It has beenshown that a neuralnetwork is better thana direct applicationof inductiontrees in modelingcom- face of the fitted curve, and then induces rules with a plex relations of input attributes in sampledata. We decision tree. This method was introduced as an alter- proposethat conciserules be extractedto supportdata native measure regarding the problem of direct applica- with input variablerelations over continuous-valuedat- tion of the induction tree to raw data (Irani & Qian 1990; tributes. Thoserelations as a set of linear classifiers Kim1991). However,it still has the problem of requiring can be obtained from neural networkmodeling based manyinduction rules to reflect the response surface. on back-propagation.Alinear classifier is derivedfrom In this paper we use a hybrid technique to combineneu- a linear combinationof input attributes and neuron ral networksand decision trees for data classification (Kim weightsin the first hiddenlayer of neural networks.It is shownin this paperthat whenwe use a decisiontree & Lee 2000). It has been shownthat neural networks are over linear classifiers extractedfrom a multilayerper- better than direct application of induction trees in mod- ceptron, the numberof rules can be reduced.We have eling nonlinear characteristics of sample data (Dietterich, tested this methodover several data sets to compareit Hild, & Bakiri 1990; Quinlan 1994; Setiono & Lie 1996; withdecision tree results. Fisher & McKusick 1989; Shavlik, Mooney, & Towell 1991). Neural networks have the advantage of being able to deal with noisy, inconsistent and incompletedata. A method Introduction to extract symbolicrules from neural networkshas been pro- The discovery of decision rules and recognition of patterns posed to increase the performanceof the decision process from data examples is one of the most challenging prob- (Andrews, Diederich, & Tickle 1996; Taha & Ghosh1999; lems in machine learning. If data points contain numeri- Fu 1994; Towcll & Shavlik Oct 1993; Setiono & Lie 1996). cal attributes,induction tree methods need the continuous- The KTalgorithm developed by Fu (Fu 1991) extracts rules valuedattributes to be madediscrete with threshold values. from subsets of connected weights with high activation in a Inductiontree algorithms such as C4.5build decision trees trained network. The Mof N algorithm clusters weights of by recursivelypartitioning theinput attribute space (Quin- the trained networkand removesinsignificant clusters with lan1996). The tree traversal from the root node to eachleaf low active weights. Then the rules are extracted from the leadsto oneconjunctive rule. Each internal node in thedeci- weights (Towell & Shavlik Oct 1993). siontree has a splittingcriterion orthreshold forcontinuous- A simple rule extraction algorithm that uses discrete acti- valuedattributes to partition some part of theinput space, vations over continuous hidden units is presented in by Se- andeach leaf represents a class related to theconditions of tiono and Taha (Setiono & Lie 1996; Taha & Ghosh1999). eachinternal node. They used in sequence a weight-decay back-propagation Approachesbased on decisiontrees involve making the over a three-layer feed-forward network, a pruning process continuous-valuedattributes discrete in inputspace, creat- to removeirrelevant connectionweights, a clustering of hid- ingmany rectangular divisions. As a result,they may have den unit activations, and extraction of rules from discrete theinability to detect data trends or desirableclassifica- unit activations. They derived symbolic rules from neural tionsurfaces. Even in thecase of multivariatemethods of networks that include oblique decision hyperplanes instead discretionwhich search in parallelfor threshold values for of general input attribute relations (Setiono & Liu 1997). morethan one continuous attribute (Fayyad & Irani1993; Also the direct conversion from neural networks to rules Kweldo& Kretowski1999), the decision rules may not re- has an exponential complexity whenusing search-based al- flectdata trends or thedecision tree may build many rules gorithm over incoming weights for each unit (Fu 1994; withthe support of a smallnumber of examplesor ignore Towell & Shavlik Oct 1993). Most of the rule extraction somedata points by dismissingthem as noisy. algorithms are used to derive rules from neuron weights A possible process is suggested to grasp the trend of the and neuronactivations in the hidden layer as a search-based data. It first tries to fit it with a given data set for the method. An instance-based rule extraction method is sug- Copyright©2001, AAAI, All rightsreserved. 48O FLAIRS-2001 gested to reduce computation time by escaping search-based Therules do not catch data clustering completelyin this ex- methods(Kim & Lee 2000). After training two hidden layer ample. Figure l(b)-(c) showsneural networkfitting with neural networks, the first hidden layer weight parametersare back-propagation method¯In Figure l(b)-(c) neural network treated as linear classifiers. Theselinear differentiated fimc- nodes have slopes alpha = 1.5, 4.0 for sigmoids, respec- tions are chosen by decision tree methodsto determine deci- tively. After curve fitting, 900 points were generated uni- sion boundariesafter re-organizing the training set in terms formly on the response surface for the mappingfrom input of the newlinear classifier attributes¯ space to class, and the response values of the neural net- Our approach is to train a neural network with sigmoid workwere calculated as shownin Figure l(d). The result functions and to use decision classifiers based on weight C4.5 to those 900 points followed the classification curves, parameters of neural networks. Then an induction tree se- but produced55 rules¯ The production of manyrules results lects the desirable input variable relations for data classifi- from the fact that decision tree makespiecewise rectangu- cation. Decision tree applications have the ability to deter- lar divisions for each rule. This happensin spite of the fact mine proper subintervals over continuousattributes by a dis- that the response surface for data clustering has a correlation cretion process. This discretion process will cover oblique betweenthe input variables. hyperplanes mentionedin Setiono’s papers. In this paper, As shownabove, the decision tree has a problemof over- we have tested linear classifiers with variable thresholds and generalization for a small number of data and an over- fixed thresholds. The methodsare tested on various types of specialization problem for a large numberof data. A pos- data and compared with the method based on the decision sible suggestion is to consider or derive relations between tree alone. input variablt s as another attribute for rule extraction¯ How- ever, it is difficult to find input variablerelations for classi- Problem Statement fication directly in supervised learning, while unsupervised Induction trees are useful for a large numberof examples, methodscan use statistical methodssuch as principal com- and they enable us to obtain proper rules from examples ponent analysis (Haykin 1999). rapidly (Quinlan 1996). However,they have the difficulty in inferring relations betweendata points and cannot handle Method noisy data. The goal for our approachis to generate rules following the shape and characteristics of response surfaces. Usually induction trees cannot trace the trend of data, and they determine data clustering only in terms of input variables, unless we apply other relation factors or attributes. In order to im- .... f .............................. "._~.. prove classification rules from a large training data set, we allow input variable relations for multi-attributes in a set of 1 ° i rules¯ NeuralNetwork and Linear Classifiers (a) (b) We use a two-phase method for rule extraction over continuous-valuedattributes. Given a large training set of data points, the first phase, as a feature extraction phase, is to train feed-forward neural networks with back-propagation and collect the weight set over input variables in the first hidden layer. A feature useful in inferring multi-attribute relations of data is foundin the first hiddenlayer of neural networks. The extracted rules involving networkweight val- (c) ues will reflect features of data examplesand provide good classification boundaries. Also they maybe more compact Figure 1: Example(a) data set and decision boundary (O and comprehensible,compared to induction tree rules. class 1, X : class 0) (b)-(c) neural networkfitting (d) data In the second phase, as a feature combinationphase, each with 900 points (Kim & Lee 2000) extracted feature for a linear classification boundaryis com- bined together using Booleanlogic gates. In this paper, we Wecan see a simple exampleof undesirable rule extrac- use an induction tree to combineeach linear classifier. tion discoveredin the induction tree application¯ Figure1 (a) The highly nonlinear property of neural networks makes

Load more