Letters 24 (2003) 2195–2207 www.elsevier.com/locate/patrec

Supervised fuzzy clustering for the identification of fuzzy classifiers

Janos Abonyi *, Ferenc Szeifert

Department of Process Engineering, University of Veszprem, P.O. Box 158, H-8201 Veszprem, Hungary

Received 9 July 2001; received in revised form 26 August 2002

Abstract

The classical fuzzy classifier consists of rules each one describing one of the classes. In this paper a new fuzzy model structure is proposed where each rule can represent more than one classes with different probabilities. The obtained classifier can be considered as an extension of the quadratic Bayes classifier that utilizes mixture of models for esti- mating the class conditional densities. A supervised clustering algorithm has been worked out for the identification of this fuzzy model. The relevant input variables of the fuzzy classifier have been selected based on the analysis of the clusters by FisherÕs interclass separability criteria. This new approach is applied to the well-known wine and Wisconsin breast cancer classification problems. 2003 Elsevier B.V. All rights reserved.

Keywords: Fuzzy clustering; Bayes classifier; Rule-reduction; Transparency and interpretability of fuzzy classifiers

1. Introduction overlapping class definitions and improves the interpretability of the results by providing more Typical fuzzy classifiers consist of interpretable insight into the decision making process. Fuzzy if–then rules with fuzzy antecedents and class logic, however, is not a guarantee for interpret- labels in the consequent part. The antecedents ability, as was also recognized in (Valente de (if-parts) of the rules partition the input space into Oliveira, 1999; Setnes et al., 1998). Hence, real a number of fuzzy regions by fuzzy sets, while effort must be made to keep the resulting rule-base the consequents (then-parts) describe the output transparent. of the classifier in these regions. im- The automatic determination of compact fuzzy proves rule-based classifiers by allowing the use of classifiers rules from data has been approached by several different techniques: neuro-fuzzy methods (Nauck and Kruse, 1999), genetic-algorithm (GA)-

* based rule selection (Ishibuchi et al., 1999), and Corresponding author. Tel.: +36-88-422-0224290; fax: +36- fuzzy clustering in combination with GA-optimi- 88-422-0224171. E-mail address: [email protected] (J. Abonyi). zation (Roubos and Setnes, 2000). Generally, the URL: http://www.fmt.vein.hu/softcomp. bottleneck of the data-driven identification of

0167-8655/03/$ - see front matter 2003 Elsevier B.V. All rights reserved. doi:10.1016/S0167-8655(03)00047-3 2196 J. Abonyi, F. Szeifert / Pattern Recognition Letters 24 (2003) 2195–2207 fuzzy systems is the structure identification that describe the data. The obtained clusters are then requires non-linear optimization. Thus for high- used to build a fuzzy classifier. dimensional problems, the initialization the fuzzy The contribution of this approach is twofold. model becomes very significant. Common initial- izations methods such as grid-type partitioning • The classical fuzzy classifier consists of rules (Ishibuchi et al., 1999) and rule generation on each one describing one of the C classes. In this extrema initialization, result in complex and non- paper a new fuzzy model structure is proposed interpretable initial models and the rule-base where the consequent part is defined as the simplification and reduction steps become com- probabilities that a given rule represents the putationally demanding. To avoid these problems, c1; ...; cC classes. The novelty of this new model fuzzy clustering algorithms (Setnes and Babusska, is that one rule can represent more than one 1999) were put forward. However, the obtained classes with different probabilities. membership values have to be projected onto the • Classical fuzzy clustering algorithms are used to input variables and approximated by parameter- estimate the distribution of the data. Hence, ized membership functions that deteriorates the they do not utilize the class label of each data performance of the classifier. This decomposition point available for the identification. Further- error can be reduced by using eigenvector projec- more, the obtained clusters cannot be directly tion (Kim et al., 1998), but the obtained linearly used to build the classifier. In this paper a new transformed input variables do not allow the in- cluster prototype and the related clustering algo- terpretation of the model. To avoid the projection rithm have been introduced that allows the di- error and maintain the interpretability of the rect supervised identification of fuzzy classifiers. model, the proposed approach is based on the Gath–Geva (GG) clustering algorithm (Gath and The proposed algorithm is similar to the multi- Geva, 1989) instead of the widely used Gustafson– prototype classifier technique (Biem et al., 2001; Kessel (GK) algorithm (Gustafson and Kessel, Rahman and Fairhurst, 1997). In this approach, 1979), because the simplified version of GG clus- each class is clustered independently from the tering allows the direct identification of fuzzy other classes, and is modeled by few components models with exponential membership functions (Gaussian in general). The main difference of this (Hoppner et al., 1999). approach is that each cluster represents different Neither GG nor GK algorithm does not utilize classes, and the number of clusters used to ap- the class labels. Hence, they give suboptimal result proximate a given class have to be determined if the obtained clusters are directly used to for- manually, while the proposed approach does not mulate a classical fuzzy classifier. Hence, there is suffer from these problems. a need for fine-tuning of the model. This GA Using too many input variables may result in or gradient-based fine-tuning, however, can result difficulties in the prediction and interpretability in overfitting and thus poor generalization of the capabilities of the classifier. Hence, the selection of identified model. Unfortunately, the severe com- the relevant features is usually necessary. Gener- putational requirements of these approaches limit ally, there is a very large set of possible features to their applicability as a rapid model-development compose feature vectors of classifiers. As ideally tool. the training set size should increase exponentially This paper focuses on the design of interpret- with the feature vector size, it is desired to choose a able fuzzy rule-based classifiers from data with minimal subset among it. Some generic tips to low-human intervention and low-computational choose a good feature set include the facts that complexity. Hence, a new modeling scheme is in- they should discriminate as much as possible the troduced based only on fuzzy clustering. The pattern classes and they should not be correlated/ proposed algorithm uses the class label of each redundant. There are two basic feature-selection point to identify the optimal set of clusters that approaches: The closed-loop algorithms are based J. Abonyi, F. Szeifert / Pattern Recognition Letters 24 (2003) 2195–2207 2197 on the classification results, while the open-loop 2. Structure of the fuzzy rule-based classifier algorithms are based on a distance between clus- ters. In the former, each possible feature subset is 2.1. Classical Bayes classifier used to train and to test a classifier, and the rec- ognition rates are used as a decision criterion: the The identification of a classifier system means higher the recognition rate, the better is the feature the construction of a model that predicts the class subset. The main disadvantage of this approach is yk ¼fc1; ...; cCg to which pattern xk ¼½x1;k; ...; that choosing a classifier is a critical problem on its xn;kŠ should be assigned. The classic approach for own, and that the final selected subset clearly de- this problem with C classes is based on BayesÕ rule. pends on the classifier. On the other hand, the The probability of making an error when classi- latter depends on defining a distance between the fying an example x is minimized by BayesÕ decision clusters, and some possibilities are Mahalanobis, rule of assigning it to the class with the largest a Bhattacharyya and the class separation distance posteriori probability: (Campos and Bloch, 2001). x is assigned to ci () pðcijxÞ P pðcjjxÞ8j 6¼ i In this paper the Fisher-interclass separability method is utilized, which is an open-loop feature ð1Þ selection approach (Cios et al., 1998). Other pa- The a posteriori probability of each class given pers focused on feature selection based on simi- a pattern x can be calculated based on the pðxjciÞ larity analysis of the fuzzy sets (Campos and class conditional distribution, which models the Bloch, 2001; Roubos and Setnes, 2000). Differ- density of the data belonging to the class ci, and ences in these reduction methods are: (i) Feature the PðciÞ class prior, which represents the prob- reduction based on the similarity analysis of fuzzy ability that an arbitrary example out of data be- sets results in a closed-loop feature selection be- longs to class ci cause it depends on the actual model while the pðxjciÞPðciÞ pðxjciÞPðciÞ applied open-loop feature selection can be used pðcijxÞ¼ ¼ P ð2Þ pðxÞ C pðxjc ÞPðc Þ beforehand as it is independent from the model. j¼1 j j (ii) In similarity analysis, a feature can be removed As (1) can be rewritten using the numerator of from individual rules. In the interclass separability (2) method the feature is omitted in all the rules x is assigned to ci () pðxjciÞPðciÞ P pðxjcjÞPðcjÞ (Roubos et al., 2001). In this paper the simple j i 3 Fisher interclass separability method have been 8 6¼ ð Þ modified, but in the future advanced multiclass we would have an optimal classifier if we would data reduction algorithms like weighted pairwise perfectly estimate the class priors and the class Fisher criteria (Loog et al., 2001) could be also conditional densities. used. In practice one needs to find approximate esti- The paper is organized as follows. In Section 2, mates of these quantities on a finite set of training the structure of the new fuzzy classifier is pre- data fxk; ykg, k ¼ 1; ...; N. Priors PðciÞ are often sented. Section 3 describes the developed clus- estimated on the basis of the training set as the tering algorithm that allows for the direct proportion of samples of class ci or using prior identification of fuzzy classifiers. For the selection knowledge. The pðcijxÞ class conditional densities of the important features of the fuzzy system can be modeled with non-parametric methods like a Fisher interclass separability criteria based histograms, nearest-neighbors or parametric meth- method will be presented in Section 4. The pro- ods such as mixture models. posed approach is studied for the Wisconsin A special case of Bayes classifiers is the quad- breast cancer and the wine classification examples ratic classifier, where the pðxjciÞ distribution gen- in Section 5. Finally, the conclusions are given in erated by the class ci is represented by a Gaussian Section 6. function 2198 J. Abonyi, F. Szeifert / Pattern Recognition Letters 24 (2003) 2195–2207   output is the class related to the consequent of the 1 1 T À1 pðxjciÞ¼ n=2 exp À ðx À viÞ ðFiÞ ðx À viÞ rule that gets the highest degree of activation: j2pFij 2 à y^k ¼ cià ; i ¼ arg maxbiðxkÞð8Þ ð4Þ 1 6 i 6 C T where vi ¼½v1;i; ...; vn;iŠ denotes the center of the To represent the Ai;jðxj;kÞ fuzzy set, we use ith multivariate Gaussian and F stands for a co- Gaussian membership functions i ! variance matrix of the data of the class c . In this i 1 ðx À v Þ2 case, the (3) classification rule can be reformulated A ðx Þ¼exp À j;k j;i ð9Þ i;j j;k 2 r2 based on a distance measure. The sample xk is j;i classified to the class that minimizes the D2 ðx Þ i;k k where v represents the center and r2 stands for distance, where the distance measure is inversely i;j j;i the variance of the Gaussian function. The use of proportional to the probability of the data: Gaussian membership function allows for the 2 compact formulation of (7): D ðxkÞ i;k !   À1 biðxkÞ¼wiAiðxkÞ PðciÞ 1 T À1   ¼ exp À ðxÀviÞ ðFiÞ ðxÀviÞ n=2 1 T À1 j2pFij 2 ¼ w exp À ðx À v Þ ðF Þ ðx À v Þ i 2 k i i k i ð5Þ ð10Þ T 2.2. Classical fuzzy classifier where vi ¼½v1;i; ...; vn;iŠ denotes the center of the ith multivariate Gaussian and Fi stands for a dia- gonal matrix that contains the r2 variances. The classical fuzzy rule-based classifier consists i;j of fuzzy rules each one describing one of the C The fuzzy classifier defined by the previous classes. The rule antecedent defines the operating equations is in fact a quadratic Bayes classifier region of the rule in the n-dimensional feature when Fi in (4) contains only diagonal elements space and the rule consequent is a crisp (non- (variances). (For more details, refer to the paper of Baraldi and Blonda (1999), which overviews this fuzzy) class label from the fc ; ...; c g label set: 1 C issue.) ri : If x1 is Ai;1ðx1;kÞ and ...xn is Ai;nðxn;kÞ In this case, the AiðxÞ membership functions and the w certainty factors can be calculated from then y^ ¼ ci; ½wiŠð6Þ i the parameters of the Bayes classifier following where Ai;1; ...; Ai;n are the antecedent fuzzy sets Eqs. (4) and (10) as and wi is a certainty factor that represents the n=2 PðciÞ desired impact of the rule. The value of wi is AiðxÞ¼pðxjciÞj2pFij ; wi ¼ n=2 ð11Þ usually chosen by the designer of the fuzzy system j2pFij according to his or her belief in the accuracy of the rule. When such knowledge is not available, 2.3. Bayes classifier based on mixture of density wi is fixed to value 1for any i. models The and connective is modeled by the product operator allowing for interaction between the One of the possible extensions of the classical propositions in the antecedent. Hence, the degree quadratic Bayes classifier is to use mixture of of activation of the ith rule is calculated as: models for estimating the class-conditional densi- Yn ties. The usage of mixture models in Bayes classi- biðxkÞ¼wi Ai;jðxj;kÞð7Þ fiers is not so widespread (Kambhatala, 1996). In j¼1 these solutions each conditional density is modeled The output of the classical fuzzy classifier is by a separate mixture of models. A possible criti- determined by the winner takes all strategy, i.e. the cism of such Bayes classifiers is that in a sense they J. Abonyi, F. Szeifert / Pattern Recognition Letters 24 (2003) 2195–2207 2199

are modeling too much: for each class many aspects ri : If x1 is Ai;1ðx1;kÞ and ...xn is Ai;nðxn;kÞ of the data are modeled which may or may not play then y^ ¼ c1 with Pðc1jriÞ; ...; y^ ¼ cC a role in discriminating between the classes. k k In this paper a new approach is presented. The with PðcCjriÞ½wiŠð16Þ pðcijxÞ posteriori densities are modeled by R > C Similarly to Takagi–Sugeno fuzzy models mixture of models (clusters) (Takagi and Sugeno, 1985), the rules of the fuzzy XR model are aggregated using the normalized fuzzy pðcijxÞ¼ pðrljxÞPðcijrlÞð12Þ mean formula and the output of the classifier is l¼1 determined by the label of the class that has the where pðrljxÞ represents the a posteriori probabil- highest activation: ity of x has been generated by the rlth local model P R b ðx ÞPðc jr Þ and PðcijrlÞ denotes the prior probability of this à l¼P1 l k i l y^k ¼ cià ; i ¼ arg max R ð17Þ 6 6 model represents the class ci. 1 i C i¼1 blðxkÞ Similarly to (2) pðrljxÞ can be written as where b ðxkÞ has the meaning expressed by (7). pðxjr ÞPðr Þ pðxjr ÞPðr Þ l pðr jxÞ¼ i i ¼ P i i ð13Þ As the previous equation can be rewritten using i pðxÞ R j¼1 pðxjrjÞPðrjÞ only its numerator, the obtained expression is identical to the Gaussian mixtures of Bayes clas- By using this mixture of density models the sifiers (15) when similarly to (11) the parameters of posteriori class probability can be expressed fol- the fuzzy model are calculated as lowing Eqs. (2), (12) and (13) as

pðxjciÞPðciÞ n=2 PðriÞ pðcijxÞ¼ AiðxÞ¼pðxjriÞj2pFij ; wi ¼ n=2 ð18Þ pðxÞ j2pFij XR p x r P r P ð j iÞ ð iÞ The main advantage of the previously presented ¼ R PðcijrlÞ l¼1 j¼1 pðxjrjÞPðrjÞ classifier is that the fuzzy model can consist of P more rules than classes and every rule can describe R pðxjr ÞPðr ÞPðc jr Þ ¼ l¼1 i i i l ð14Þ more than one class. Hence, as a given class will be pðxÞ described by a set of rules, it should not be a The Bayes decision rule can be thus formulated compact geometrical object (hyper-ellipsoid). similarly to (3) as The aim of the remaining part of the paper is to propose a new clustering-based technique for x is assigned to ci the identification of the fuzzy classifier presented XR above. In addition, a new method for the selection () pðxjrlÞPðrlÞPðcijrlÞ l¼1 of the antecedent variables (features) of the model XR will be described. P pðxjrlÞPðrlÞPðcjjrlÞ8j 6¼ i ð15Þ l¼1 where the pðxjrlÞ distribution is represented by 3. Supervised fuzzy clustering Gaussians similarly to (4). The objective of clustering is to partition the 2.4. Extended fuzzy classifier identification data Z into R clusters. This means, each observation consists of input and output T A new fuzzy model that is able to represent variables, grouped into a row vector zk ¼½xk ; ykŠ, Bayes classifier defined by (15) can be obtained. where the k subscript denotes the k ¼ 1; ...; Nth The idea is to define the consequent of the fuzzy row of the Z pattern matrix. The fuzzy partition is rule as the probabilities of the given rule represents represented by the U ¼½li;kŠRÂN matrix, where the the c1; ...; cC classes: li;k element of the matrix represents the degree of 2200 J. Abonyi, F. Szeifert / Pattern Recognition Letters 24 (2003) 2195–2207 membership, how the zk observation is in the To get a fuzzy partitioning space, the member- cluster i ¼ 1; ...; R. ship values have to satisfy the following condi- The clustering is based on the minimization of tions: the sum of weighted D2 squared distances between i;k U 2 RcÂN jl 2½0; 1Š8i; k; the data points and the g cluster prototypes that i;k i XR XN contains the parameters of the clusters. li;k ¼ 1 8k; 0 < li;k < N 8i ð21Þ XR XN i¼1 k¼1 m 2 JðZ; U; gÞ¼ ðli;kÞ Di;kðzk; riÞð19Þ The minimization of the (22) functional repre- i¼1 k¼1 sents a non-linear optimization problem that is where m is a fuzzy weighting exponent that de- subject to constraints defined by (21) and can be termines the fuzziness of the resulting clusters. As solved by using a variety of available methods. The m approaches one from above, the partition be- most popular method, is the alternating optimi- zation (AO), which consists of the application of comes hard (li;k 2f0; 1g), and vi are the ordinary means of the clusters. As m !1, the partition Picard iteration through the first-order conditions for the stationary points of (22), which can be found becomes fuzzy (li;k ¼ 1=R) and the cluster means are all equal to the grand mean of Z. Usually, m by adjoining the constraints (21) to J by means is often chosen as m ¼ 2. of LaGrange multipliers (Hoppner et al., 1999), Classical fuzzy clustering algorithms are used to XR XN m 2 estimate the distribution of the data. Hence, they JðZ; U; g; kÞ¼ ðli;kÞ Di;kðzk; riÞ do not utilize the class label of each data point i¼1 k¼1 ! available for the identification. Furthermore, the XN XR obtained clusters cannot be directly used to build þ kk li;k À 1 ð22Þ the classifier. In the following a new cluster pro- k¼1 i¼1 totype and the related distance measure will be and by setting the gradients of J with respect to Z, introduced that allows the direct supervised iden- U, g and k to zero. tification of fuzzy classifiers. As the clusters are Hence, similarly to the update equations of GG used to obtain the parameters of the fuzzy classi- clustering algorithm, the following equations will fier, the distance measure is defined similarly to the result in a solution that satisfies the (22) con- distance measure of the Bayes classifier (5): straints. ! Yn 2 1 1 ðxj;k À vi;jÞ Initialization Given a set of data Z specify R, 2 ¼ PðriÞ exp À 2 Di;kðzk; riÞ 2 ri;j choose a termination tolerance >0. Initialize |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}j¼1 the U ¼½l Š partition matrix randomly, Gath–Geva clustering i;k RÂN where li;k denotes the membership that the zk  Pðcj ¼ ykjriÞð20Þ data is generated by the ith cluster. This distance measure consists of two terms. The Repeat for l ¼ 1; 2; ... first term is based on the geometrical distance be- Step 1 Calculate the parameters of the clusters tween the vi cluster centers and the xk observation • Calculate the centers and standard de- vector, while the second is based on the probability viation of the Gaussian membership that the rith cluster describes the density of the class functions (the diagonal elements of of the kth data, Pðcj ¼ ykjriÞ. It is interesting to note the Fi covariance matrices): P that this distance measure only slightly differs from N ðlÀ1Þ m ðl Þ xk the unsupervised GG clustering algorithm which vðlÞ Pk¼1 i;k ; i ¼ N ðlÀ1Þ m can also be interpreted in a probabilistic frame- ðli;k Þ Pk¼1 ð23Þ work (Gath and Geva, 1989). However, the novelty N ðlÀ1Þ m 2 ðl Þ ðxj;k À vj;kÞ r2;ðlÞ k¼1 Pi;k of the proposed approach is the second term, which i;j ¼ N ðlÀ1Þ m allows the use of class labels. k¼1 ðli;k Þ J. Abonyi, F. Szeifert / Pattern Recognition Letters 24 (2003) 2195–2207 2201

• Estimate the consequent probability convergence of the sequence will be c-linear. This parameters, means that there is a norm kÃk and constants P ðlÀ1Þ m 0 < c < 1and l0 > 0, such that for all l P l0, the ðl Þ l l à kjyk ¼ci j;k sequence of errors e U U satis- pðcijrjÞ¼ P ; f g ¼ fkð ÞÀð Þkg N ðlÀ1Þ m lþ1 l k¼1 ðlj;k Þ fies the inequality e < ce . 1 6 i 6 C; 1 6 j 6 R ð24Þ • A priori probability of the cluster and 4. Feature selection based on interclass separability the weight (impact) of the rules: Using too many input variables may result in 1 XN Pðr Þ¼ ðlðlÀ1ÞÞm; difficulties in the interpretability capabilities of the i N i;k k¼1 obtained classifier. Hence, selection of the relevant Yn 1 features is usually necessary. Others have focused wi ¼ PðriÞ qffiffiffiffiffiffiffiffiffiffiffi ð25Þ 2 on reducing the antecedent variables by similarity j¼1 2pr i;j analysis of the fuzzy sets (Roubos and Setnes, 2 2000), however this method is not very suitable for Step 2 Compute the distance measure Di;kðzk; riÞ by (20). feature selection. In this section Fischer interclass Step 3 Update the partition matrix separability method (Cios et al., 1998) is modified which is based on statistical properties of the data. 1 lðlÞ P ; The interclass separability criterion is based on the i;k ¼ R 2=ðmÀ1Þ j¼1 ðDi;kðzk; riÞ=Dj;kðzk; rjÞÞ FB between-class and the FW within-class covari- 1 6 i 6 R; 1 6 k 6 N ð26Þ ance matrices that sum up to the total covariance of the data FT ¼ FW þ FB, where until kUðlÞ À UðlÀ1Þk <. XR FW ¼ PðrlÞFl; The remainder of this section is concerned with l¼1 the theoretical convergence properties of the pro- XR T posed algorithm. Since, this algorithm is the FB ¼ PðrlÞðvl À v0Þ ðvl À v0Þ; ð27Þ member of the family of algorithms discussed in l¼1 (Hathaway and Bezdek, 1993), the following dis- XR cussion is based on the results of Hathaway and v0 ¼ PðrlÞvl Bezdek (1993). Using LaGrange multiplier theory, l¼1 2 it is easily shown that for Di;kðzk; riÞ P 0, (26) de- The feature interclass seperability selection cri- ðlþ1Þ fines U to be a global minimizer of the re- terion is a trade-off between FW and FB: stricted cost function (22). From this it follows det F that the proposed iterative algorithm is a special J ¼ B ð28Þ case of grouped coordinate minimization, and the det FW general convergence theory from (Bezdek et al., The importance of a feature is measured by 1987) can be applied for reasonable choices of leaving out the interested feature and calculating J 2 Di;kðzk; riÞ to shown that any limit point of for the reduced covariance matrices. The feature an iteration sequence will be a minimizer, or at selection is a step-wise procedure, when in every worst a saddle point of the cost function J. The step the least needed feature is deleted from the local convergence result in (Bezdek et al., 1987) model. 2 states that if the distance measures Di;kðzk; riÞ are In the current implementation of the algorithm sufficiently smooth and a standard convexity holds after fuzzy clustering and initial model formulation at a minimizer of J, then any iteration sequence a given number of features are selected by contin- started with Uð0Þ sufficiently close to Uà will con- uously checking the decrease of the perfor- verge to the minima. Furthermore, the rate of mance of the classifier. To increase the classification 2202 J. Abonyi, F. Szeifert / Pattern Recognition Letters 24 (2003) 2195–2207 performance, the final classifier is identified based chromatin, x8 normal nuclei, and x9 mitosis (data on the re-clustering of reduced data which have shown in Fig. 1). The attributes have integer value smaller dimensionality because of the neglected in the range (Baraldi and Blonda, 1999; Hoppner input variables. et al., 1999). The original database contains 699 instances however 16 of these are omitted because these are incomplete, which is common with other 5. Performance evaluation studies. The class distribution is 65.5% benign and 34.5% malignant, respectively. In order to examine the performance of the The advanced version of C4.5 gives misclas- proposed identification method two well-known sification of 5.26% on 10-fold cross validation multidimensional classification benchmark prob- (94.74% correct classification) with tree size lems are presented in this section. The studied 25 0.5 (Quinlan, 1996). Nauck and Kruse (1999) Wisconsin breast cancer and Wine data come from combined neuro-fuzzy techniques with interactive the UCI Repository of Data- strategies for rule pruning to obtain a fuzzy classi- bases (http://www.ics.uci.edu). fier. An initial rule-base was made by applying two The performance of the obtained classifiers was sets for each input, resulting in 29 ¼ 512 rules measured by 10-fold cross validation. The data which was reduced to 135 by deleting the non- divided into ten sub-sets of cases that have similar firing rules. A heuristic data-driven learning method size and class distributions. Each sub-set is left out was applied instead of gradient descent learning, once, while the other nine are applied for the which is not applicable for triangular membership construction of the classifier which is subsequently functions. Semantic properties were taken into ac- validated for the unseen cases in the left-out sub- count by constraining the search space. They final set. fuzzy classifier could be reduced to two rules with 5–6 features only, with a misclassification of 4.94% 5.1. Example 1: the Wisconsin breast cancer clas- on 10-fold validation (95.06% classification accu- sification problem racy). Rule-generating methods that combine GA and fuzzy logic were also applied to this problem The Wisconsin breast cancer data is widely used (Penna-Reyes~ and Sipper, 2000). In this method the to test the effectiveness of classification and rule number of rules to be generated needs to be deter- extraction algorithms. The aim of the classification mined a priori. This method constructs a fuzzy is to distinguish between benign and malignant model that has four membership functions and one cancers based on the available nine measurements: rule with an additional else part. Setiono (2000) has x1 clump thickness, x2 uniformity of cell size, x3 generated similar compact classifier by a two-step uniformity of cell shape, x4 marginal adhesion, x5 rule extraction from a feedforward neural network single epithelial cell size, x6 bare nuclei, x7 bland trained on preprocessed data.

2 10 10 10 10

1.5 5 5 5 5 Class Unif. Cell Size

1 0 0 Unif. Cell Shape 0 0 Clump Thickness

0 500 0 500 0 500 0 500 Marginal Adhesion 0 500 10 10 10 10 10

5 5 5 5 5 Mitoses Bare Nuclei

Epith. Cell Size 0 0 0 0 0 Normal Nucleoli 0 500 0 500 Bland Chromatin 0 500 0 500 0 500

Fig. 1. Wisconsin breast cancer data: two classes and nine attributes (class 1: 1–445, class 2: 446–683). J. Abonyi, F. Szeifert / Pattern Recognition Letters 24 (2003) 2195–2207 2203

As Table 1shows, our fuzzy rule-based classifier sults given in Table 1are only roughly compar- is one of the most compact models in the literature able. with such high accuracy. In the current implementation of the algorithm 5.2. Example 2: the wine classification problem after fuzzy clustering an initial fuzzy model is generated that utilizes all the nine information The wine data contains the chemical analysis of profile data about the patient. A step-wise feature 178 wines grown in the same region in Italy but reduction algorithm has been used where in every derived from three different cultivars. The problem step one feature has been removed continuously is to distinguish the three different types based on checking the decrease of the performance of the 13 continuous attributes derived from chemical classifier on the training data. To increase the analysis (Fig. 2). Corcoran and Sen (1994) applied classification performance, the classifier is re- all the 178 samples for learning 60 non-fuzzy if– identified in every step by re-clustering of reduced then rules in a real-coded genetic-based-machine data which have smaller dimensionality because of learning approach. They used a population of 1500 the neglected input variables. As Table 2 shows, individuals and applied 300 generations, with full our supervised clustering approach gives better replacement, to come up with the following result results than utilizing the GG clustering algorithm for 10 independent trials: best classification rate in the same identification scheme. 100%, average classification rate 99.5% and worst The 10-fold validation experiment with the classification rate 98.3% which is three misclassifi- proposed approach showed 95.57% average clas- cations. Ishibuchi et al. (1999) applied all the 178 sification accuracy, with 90.00% as the worst and samples designing a fuzzy classifier with 60 fuzzy 95.57% as the best performance. This is really rules by means of an integer-coded genetic algo- good for such a small classifier as compared with rithm and grid partitioning. Their population previously reported results (Table 1). As the error contained 100 individuals and they applied 1000 estimates are either obtained from 10-fold cross generations, with full replacement, to come up with validation or from testing the solution once by the following result for 10 independent trials: best using the 50% of the data as training set, the re- classification rate 99.4% (one misclassification),

Table 1 Classification rates and model complexity for classifiers constructed for the Wisconsin breast cancer problem Author Method ] Rules ] Conditions Accuracy (%) Setiono (2000) NeuroRule 1f 4 4 97.36 Setiono (2000) NeuroRule 2a 3 11 98.1 Penna-Reyes~ and Sipper (2000) Fuzzy-GA11 4 97.07 Penna-Reyes~ and Sipper (2000) Fuzzy-GA2 3 16 97.36 Nauck and Kruse (1999) NEFCLASS 2 10–12 95.06\ \ Denotes results from averaging a 10-fold validation.

Table 2 Classification rates and model complexity for classifiers constructed for the Wisconsin breast cancer problem Method Min Acc. Mean Acc. Max Acc. Min ] Feat. Mean ] Feat. Max ] Feat. GG: R ¼ 2 84.28 90.99 95.718 8.9 9 Sup: R ¼ 2 84.28 92.56 98.57 7 8 9 GG: R ¼ 4 88.57 95.14 98.57 9 9 9 Sup: R ¼ 4 90.00 95.57 98.57 8 8.7 9 Results from a 10-fold validation. GG: Gath–Geva clustering based classifier, Sup: proposed method. 2204 J. Abonyi, F. Szeifert / Pattern Recognition Letters 24 (2003) 2195–2207

Class Alcohol Malic acid Ash Alcalinity ash 3 16 6 4 30

14 4 3 2 20 12 2 2

1 10 0 1 10 0 178 0 178 0 178 0 178 0 178 Magnesium Tot. Phenols Flavonoids Nonflav.Phen. Proanthoc. 200 4 6 1 4

150 4 2 0.5 2 100 2

50 0 0 0 0 0 178 0 178 0 178 0 178 0 178 Color intensity Hue OD280/OD315 Proline 15 2 4 2000

10 3 1 1000 5 2

0 0 1 0 0 178 0 178 0 178 0 178

Fig. 2. Wine data: three classes and 13 attributes. average classification rate 98.5% and worst classi- mance of this model, the obtained classifier gave fication rate 97.8% (four misclassifications). In only one misclassification (99.4%). The classifica- both approaches the final rule base contains 60 tion power of the identified models is then com- rules. The main difference is the number of model pared with fuzzy models with the same number of evaluations that was necessary to come to the final rules obtained by GG clustering, as GG clustering result. can be considered the unsupervised version of the Firstly, for comparison purposes, a fuzzy clas- proposed clustering algorithm. The GG identified sifier, that utilizes all the 13 information profile fuzzy model achieves eight misclassifications cor- data about the wine has been identified by the responding to a correct percentage of 95.5%, when proposed clustering algorithm based on all the 178 three rules are used in the fuzzy model, while six samples. Fuzzy models with three and six rules misclassifications (correct percentage 96.6%) in the were identified. The three rule-model gave only case of four rules. The results are summarized in two misclassification (correct percentage 98.9%). Table 3. As it is shown, the performance of the When a cluster was added to improve the perfor- obtained classifiers are comparable to those in

Table 3 Classification rates on the wine data for 10 independent runs Method Best result (%) Average result (%) Worst result (%) Rules Model evaluations Corcoran and Sen (1994) 100 99.5 98.3 60 150 000 Ishibuchi et al. (1999) 99.4 98.5 97.8 60 6000 GG clustering 95.5 95.5 95.5 3 1

Sup (13 features) 98.9 98.9 98.9 3 1 Sup (13 features) 99.4 99.4 99.4 4 1 J. Abonyi, F. Szeifert / Pattern Recognition Letters 24 (2003) 2195–2207 2205

Table 4 Classification rates and model complexity for classifiers constructed for the Wine classification problem Method Min Acc. Mean Acc. Max Acc. Min ] Feat. Mean ] Feat. Max ] Feat. GG: R ¼ 3 83.33 94.38 100 10 12.4 13 Sup: R ¼ 3 88.88 97.77 100 12 12.6 13 GG: R ¼ 3 88.23 95.49 100 4 4.8 5 Sup: R ¼ 3 76.47 94.87 100 4 4.8 5 GG: R ¼ 6 82.35 94.34 100 4 4.9 5 Sup: R ¼ 6 88.23 97.15 100 4 4.8 5 Results from averaging a 10-fold validation.

(Corcoran and Sen, 1994; Ishibuchi et al., 1999), racy, with 88.88% as the worst and 100% as the but use far less rules (3–5 compared to 60) and less best performance (Table 4). The above presented features. automatic model reduction technique removed These results indicate that the proposed clus- only one feature without the decrease of the clas- tering method effectively utilizes the class labels. sification performance on the training data. Hence, As can be seen from Table 3, because of the sim- to avoid possible local minima, the feature selec- plicity of the proposed clustering algorithm, the tion algorithm is used to select only five features, proposed approach is attractive in comparison and the proposed scheme has been applied again with other iterative and optimization schemes that to identify a model based on the selected five at- involves extensive intermediate optimization to tributes. This compact model with average 4.8 generate fuzzy classifiers. rules showed 97.15% average classification accu- The 10-fold validation is a rigorous test of the racy, with 88.23% as the worst and 100% as the classifier identification algorithms. These experi- best performance. The resulted membership func- ments showed 97.77% average classification accu- tions and the selected features are shown in Fig. 3.

1 1 1 1 1

0.9 0.9 0.9 0.9 0.9

0.8 0.8 0.8 0.8 0.8

0.7 0.7 0.7 0.7 0.7

0.6 0.6 0.6 0.6 0.6

0.5 0.5 0.5 0.5 0.5

0.4 0.4 0.4 0.4 0.4

0.3 0.3 0.3 0.3 0.3

0.2 0.2 0.2 0.2 0.2

0.1 0.1 0.1 0.1 0.1

0 0 0 0 0 12 13 14 100 150 2 4 0.5 11.5 500 1000 1500 Alcohol Magnesium Flavonoids Hue Proline

Fig. 3. Membership functions obtained by fuzzy clustering. 2206 J. Abonyi, F. Szeifert / Pattern Recognition Letters 24 (2003) 2195–2207

Comparing the fuzzy sets in Fig. 3 with the data in grouped variable version of coordinate descent. J. Optimi- Fig. 2 shows that the obtained rules are highly zation Theory Appl. 71, 471–477. Biem, A., Katagiri, S., McDermott, E., Juang, B.H., 2001. An interpretable. For example, the Flavonoids are application of discriminative feature extraction lo filter- divided in low, medium and high, which is clearly bank-based speech recognition. IEEE Trans. Speech Audio visible in the data. Process. 9 (2), 96–110. Campos, T.E., Bloch, I., Cesar Jr., R.M., 2001. Feature selection based on fuzzy distances between clusters: First results on simulated data. In: ICAPRÕ2001––International 6. Conclusions Conference on Advances in Pattern Recognition, Rio de Janeiro, Brazil, May. In: Lecture Notes in Computer In this paper a new fuzzy classifier has been Science. Springer-Verlag, Berlin. presented to represent Bayes classifiers defined by Cios, K.J., Pedrycz, W., Swiniarski, R.W., 1998. mixture of Gaussians density model. The novelty Methods for Knowledge Discovery. Kluwer Academic Press, Boston. of this new model is that each rule can represent Corcoran, A.L., Sen, S., 1994. Using real-valued genetic more than one classes with different probabilities. algorithms to evolve rule sets for classification. In: IEEE- For the identification of the fuzzy classifier a su- CEC, June 27–29, Orlando, USA. pp. 120–124. pervised clustering method has been worked out Gath, I., Geva, A.B., 1989. Unsupervised optimal fuzzy that is the modification of the unsupervised GG clustering. IEEE Trans. Pattern Anal. Machine Intell. 7, 773–781. clustering algorithm. In addition, a method for the Gustafson, D.E., Kessel, W.C., 1979. Fuzzy clustering with a selection of the relevant input variables has been fuzzy covariance matrix. In: Proceedings of IEEE CDC, San presented. The proposed identification approach is Diego, USA. demonstrated by the Wisconsin breast cancer and Hathaway, R.J., Bezdek, J.C., 1993. Switching regression the wine benchmark classification problems. The models and fuzzy clustering. IEEE Trans. Fuzzy Systems 1, 195–204. comparison to GG clustering and GA-tuned fuzzy Hoppner, F., Klawonn, F., Kruse, R., Runkler, T., 1999. Fuzzy classifiers indicates that the proposed supervised ––Methods for Classification, Data Analy- clustering method effectively utilizes the class sis and Image Recognition. John Wiley and Sons, New labels and able to identify compact and accurate York. fuzzy systems. Ishibuchi, H., Nakashima, T., Murata, T., 1999. Performance evaluation of fuzzy classifier systems for multidimensional pattern classification problems. IEEE Trans. SMC B 29, 601–618. Acknowledgements Kambhatala, N., 1996. Local models and Gaussian mixture models for statistical data processing. Ph.D. Thesis, Oregon Gradual Institute of Science and Technology. This work was supported by the Hungarian Kim, E., Park, M., Kim, S., Park, M., 1998. A transformed Ministry of Education (FKFP-0073/2001) and the input–domain approach to fuzzy modeling. IEEE Trans. Hungarian Science Foundation (OTKA TO37600). Fuzzy Systems 6, 596–604. Part of the work has been elaborated when J. Ab- Loog, L.C.M., Duin, R.P.W., Haeb-Umbach, R., 2001. onyi was at the Control Laboratory of Delft Uni- Multiclass linear dimension reduction by weighted pairwise Fisher criteria. IEEE Trans. PAMI 23 (7), versity of Technology. J. Abonyi is grateful for the 762–766. Janos Bolyai Fellowship of the Hungarian Acad- Nauck, D., Kruse, R., 1999. Obtaining interpretable fuzzy emy of Sciences. classification rules from medical data. Artificial Intell. Med. 16, 149–169. Penna-Reyes,~ C.A., Sipper, M., 2000. A fuzzy genetic approach to breast cancer diagnosis. Artificial Intell. Med. 17, 131– References 155. Quinlan, J.R., 1996. Improved use of continuous attributes in Baraldi, A., Blonda, P., 1999. A survey of fuzzy clustering C4.5. J. Artificial Intell. Res. 4, 77–90. algorithms for pattern recognition––Part I. IEEE Trans. Rahman, A.F.R., Fairhurst, M.C., 1997. Multi-prototype Systems Man Cybernet. Part B 29 (6), 778–785. classification: improved modelling of the variability of Bezdek, J.C., Hathaway, R.J., Howard, R.E., Wilson, C.A., handwritten data using statistical clustering algorithms. Windham, M.P., 1987. Local convergence analysis of a Electron. Lett. 33 (14), 1208–1209. J. Abonyi, F. Szeifert / Pattern Recognition Letters 24 (2003) 2195–2207 2207

Roubos, J.A., Setnes, M., 2000. Compact fuzzy models through Setnes, M., Babusska, R., 1999. Fuzzy relational classifier complexity reduction and evolutionary optimization. In: trained by fuzzy clustering. IEEE Trans. SMC B 29, 619– FUZZ-IEEE, May 7–10, San Antonio, USA. pp. 762– 625. 767. Setnes, M., Babusska, R., Kaymak, U., van Nauta Lemke, Roubos, J.A., Setnes, M., Abonyi, J., 2001. Learning fuzzy H.R., 1998. Similarity measures in fuzzy rule base simpli- classification rules from data. In: John, R., Birkenhead, R. fication. IEEE Trans. SMC B 28, 376–386. (Eds.), Developments in Soft Computing. Springer-Verlag, Takagi, T., Sugeno, M., 1985. Fuzzy identification of systems Berlin/Heidelberg, pp. 108–115. and its application to modeling and control. IEEE Trans. Setiono, R., 2000. Generating concise and accurate classifica- SMC 15, 116–132. tion rules for breast cancer diagnosis. Artificial Intell. Med. Valente de Oliveira, J., 1999. Semantic constraints for mem- 18, 205–219. bership function optimization. IEEE Trans. FS 19, 128–138.