Supervised Fuzzy Clustering for the Identification of Fuzzy Classifiers

Pattern Recognition Letters 24 (2003) 2195–2207 www.elsevier.com/locate/patrec Supervised fuzzy clustering for the identification of fuzzy classifiers Janos Abonyi *, Ferenc Szeifert Department of Process Engineering, University of Veszprem, P.O. Box 158, H-8201 Veszprem, Hungary Received 9 July 2001; received in revised form 26 August 2002 Abstract The classical fuzzy classifier consists of rules each one describing one of the classes. In this paper a new fuzzy model structure is proposed where each rule can represent more than one classes with different probabilities. The obtained classifier can be considered as an extension of the quadratic Bayes classifier that utilizes mixture of models for esti- mating the class conditional densities. A supervised clustering algorithm has been worked out for the identification of this fuzzy model. The relevant input variables of the fuzzy classifier have been selected based on the analysis of the clusters by FisherÕs interclass separability criteria. This new approach is applied to the well-known wine and Wisconsin breast cancer classification problems. Ó 2003 Elsevier B.V. All rights reserved. Keywords: Fuzzy clustering; Bayes classifier; Rule-reduction; Transparency and interpretability of fuzzy classifiers 1. Introduction overlapping class definitions and improves the interpretability of the results by providing more Typical fuzzy classifiers consist of interpretable insight into the decision making process. Fuzzy if–then rules with fuzzy antecedents and class logic, however, is not a guarantee for interpret- labels in the consequent part. The antecedents ability, as was also recognized in (Valente de (if-parts) of the rules partition the input space into Oliveira, 1999; Setnes et al., 1998). Hence, real a number of fuzzy regions by fuzzy sets, while effort must be made to keep the resulting rule-base the consequents (then-parts) describe the output transparent. of the classifier in these regions. Fuzzy logic im- The automatic determination of compact fuzzy proves rule-based classifiers by allowing the use of classifiers rules from data has been approached by several different techniques: neuro-fuzzy methods (Nauck and Kruse, 1999), genetic-algorithm (GA)- * based rule selection (Ishibuchi et al., 1999), and Corresponding author. Tel.: +36-88-422-0224290; fax: +36- fuzzy clustering in combination with GA-optimi- 88-422-0224171. E-mail address: [email protected] (J. Abonyi). zation (Roubos and Setnes, 2000). Generally, the URL: http://www.fmt.vein.hu/softcomp. bottleneck of the data-driven identification of 0167-8655/03/$ - see front matter Ó 2003 Elsevier B.V. All rights reserved. doi:10.1016/S0167-8655(03)00047-3 2196 J. Abonyi, F. Szeifert / Pattern Recognition Letters 24 (2003) 2195–2207 fuzzy systems is the structure identification that describe the data. The obtained clusters are then requires non-linear optimization. Thus for high- used to build a fuzzy classifier. dimensional problems, the initialization the fuzzy The contribution of this approach is twofold. model becomes very significant. Common initial- izations methods such as grid-type partitioning • The classical fuzzy classifier consists of rules (Ishibuchi et al., 1999) and rule generation on each one describing one of the C classes. In this extrema initialization, result in complex and non- paper a new fuzzy model structure is proposed interpretable initial models and the rule-base where the consequent part is defined as the simplification and reduction steps become com- probabilities that a given rule represents the putationally demanding. To avoid these problems, c1; ...; cC classes. The novelty of this new model fuzzy clustering algorithms (Setnes and Babusska, is that one rule can represent more than one 1999) were put forward. However, the obtained classes with different probabilities. membership values have to be projected onto the • Classical fuzzy clustering algorithms are used to input variables and approximated by parameter- estimate the distribution of the data. Hence, ized membership functions that deteriorates the they do not utilize the class label of each data performance of the classifier. This decomposition point available for the identification. Further- error can be reduced by using eigenvector projec- more, the obtained clusters cannot be directly tion (Kim et al., 1998), but the obtained linearly used to build the classifier. In this paper a new transformed input variables do not allow the in- cluster prototype and the related clustering algo- terpretation of the model. To avoid the projection rithm have been introduced that allows the di- error and maintain the interpretability of the rect supervised identification of fuzzy classifiers. model, the proposed approach is based on the Gath–Geva (GG) clustering algorithm (Gath and The proposed algorithm is similar to the multi- Geva, 1989) instead of the widely used Gustafson– prototype classifier technique (Biem et al., 2001; Kessel (GK) algorithm (Gustafson and Kessel, Rahman and Fairhurst, 1997). In this approach, 1979), because the simplified version of GG clus- each class is clustered independently from the tering allows the direct identification of fuzzy other classes, and is modeled by few components models with exponential membership functions (Gaussian in general). The main difference of this (Hoppner et al., 1999). approach is that each cluster represents different Neither GG nor GK algorithm does not utilize classes, and the number of clusters used to ap- the class labels. Hence, they give suboptimal result proximate a given class have to be determined if the obtained clusters are directly used to for- manually, while the proposed approach does not mulate a classical fuzzy classifier. Hence, there is suffer from these problems. a need for fine-tuning of the model. This GA Using too many input variables may result in or gradient-based fine-tuning, however, can result difficulties in the prediction and interpretability in overfitting and thus poor generalization of the capabilities of the classifier. Hence, the selection of identified model. Unfortunately, the severe com- the relevant features is usually necessary. Gener- putational requirements of these approaches limit ally, there is a very large set of possible features to their applicability as a rapid model-development compose feature vectors of classifiers. As ideally tool. the training set size should increase exponentially This paper focuses on the design of interpret- with the feature vector size, it is desired to choose a able fuzzy rule-based classifiers from data with minimal subset among it. Some generic tips to low-human intervention and low-computational choose a good feature set include the facts that complexity. Hence, a new modeling scheme is in- they should discriminate as much as possible the troduced based only on fuzzy clustering. The pattern classes and they should not be correlated/ proposed algorithm uses the class label of each redundant. There are two basic feature-selection point to identify the optimal set of clusters that approaches: The closed-loop algorithms are based J. Abonyi, F. Szeifert / Pattern Recognition Letters 24 (2003) 2195–2207 2197 on the classification results, while the open-loop 2. Structure of the fuzzy rule-based classifier algorithms are based on a distance between clusters. In the former, each possible feature subset is 2.1. Classical Bayes classifier used to train and to test a classifier, and the recognition rates are used as a decision criterion: the The identification of a classifier system means higher the recognition rate, the better is the feature the construction of a model that predicts the class subset. The main disadvantage of this approach is yk ¼fc1; ...; cCg to which pattern xk ¼½x1;k; ...; that choosing a classifier is a critical problem on its xn;k should be assigned. The classic approach for own, and that the final selected subset clearly de- this problem with C classes is based on BayesÕ rule. pends on the classifier. On the other hand, the The probability of making an error when classi- latter depends on defining a distance between the fying an example x is minimized by BayesÕ decision clusters, and some possibilities are Mahalanobis, rule of assigning it to the class with the largest a Bhattacharyya and the class separation distance posteriori probability: (Campos and Bloch, 2001). x is assigned to ci () pðcijxÞ P pðcjjxÞ8j 6¼ i In this paper the Fisher-interclass separability method is utilized, which is an open-loop feature ð1Þ selection approach (Cios et al., 1998). Other pa- The a posteriori probability of each class given pers focused on feature selection based on simi- a pattern x can be calculated based on the pðxjciÞ larity analysis of the fuzzy sets (Campos and class conditional distribution, which models the Bloch, 2001; Roubos and Setnes, 2000). Differ- density of the data belonging to the class ci, and ences in these reduction methods are: (i) Feature the PðciÞ class prior, which represents the prob- reduction based on the similarity analysis of fuzzy ability that an arbitrary example out of data be- sets results in a closed-loop feature selection be- longs to class ci cause it depends on the actual model while the pðxjciÞPðciÞ pðxjciÞPðciÞ applied open-loop feature selection can be used pðcijxÞ¼ ¼ P ð2Þ pðxÞ C pðxjc ÞPðc Þ beforehand as it is independent from the model. j¼1 j j (ii) In similarity analysis, a feature can be removed As (1) can be rewritten using the numerator of from individual rules. In the interclass separability (2) method the feature is omitted in all the rules x is assigned to ci () pðxjciÞPðciÞ P pðxjcjÞPðcjÞ (Roubos et al., 2001). In this paper the simple j i 3 Fisher interclass separability method have been 8 6¼ ð Þ modified, but in the future advanced multiclass we would have an optimal classifier if we would data reduction algorithms like weighted pairwise perfectly estimate the class priors and the class Fisher criteria (Loog et al., 2001) could be also conditional densities.

Load more