Learning of Decision Fusion Mappings for Pattern Recognition

Friedhelm Schwenker, Christian R. Dietrich, Christian Thiel, Gunther¨ Palm Department of Neural Information Processing Universitat¨ Ulm, 89069 Ulm, Germany fi[email protected]

Abstract 2. Training of the Fusion Layer performing a mapping of the classifier outputs (soft or crisp decisions) into the Different learning algorithms for the decision fusion set of desired class labels. mapping of a multiple classifier system are compared in this This two-phase MCS training is very similar to RBF learn- paper. It is very well known that the confusion matrices ing procedures, where the network is learned over two or of the individual classifiers are utilised in the naive Bayes even three learning phases: combination of classifier outputs. After a brief review of the decision templates, the linear associative memory and the 1. Training of the RBF kernel parameters (centres and pseudoinverse approaches it is demonstrated that all widths) of the first layer through clustering or vector four adaptive decision fusion mappings share the confusion quantization. matrices as the essential ingredient. 2. Supervised learning of the output layer (second layer) of the network. 1 Introduction 3. Training both layers of the RBF network through a backpropagation-like optimisation procedure. During the last years, a community has emerged that is Typically, for these three-phase learning scheme a labeled interested in the fusion of classifiers in the context of so- training set is used to calculate the radial basis function called multiple classifier systems (MCS). One goal is to centres, the radial basis function widths and the weights of design classifier systems with a high classification accu- the output layer, but it should be noticed that other learning racy ([7]...[13],[11]).The reason behind this new approach schemes may be applied. For instance, if a large set of unla- is that in traditional pattern recognition, the best individ- belled data points is available this data can be used to train ual classifier has to be found, which is difficult to identity. the parameters of the first RBF layer (centres and widths) Also, a single classifier can not make use of information through unsupervised clustering [17]. from other discrimination functions. For the training of the two layer fusion architecture two Typical fusion strategies for the combination of classi- labeled data sets may be used. It is assumed that I classifier fiers in MCS are based on fixed fusion mappings, for in- decisions, here calculated on I different features, have to be stance averaging or multiplying the classifier outputs [1, 6]. combined. First, the classifiers Ci, i = 1, ..., I, are trained, In this paper we consider supervised learning methods to by utilising a training set, and then another labeled training train the fusion mapping of the MCS. Whereas the MCS ar- set R, which is not necessarily completely disjunct to the chitecture (see Figure 1) is very similar to layered artificial first, is sent to the previously trained first level classifiers to neural networks, like radial basis function networks (RBFs) calculate the individual classifier outputs. These classifier or multilayer perceptrons (MLPs), the training is rather dif- i outputs C (xi), i = 1, ..., I together with the desired class ferent. For most artificial neural network architectures, e.g. labels are then used to train the decision fusion mapping F. MLPs, the parameters are typically trained simultaneously To train this fusion mapping different algorithms have been through a backpropagation-like training algorithm by a one- proposed. phase learning procedure. On the other hand a MCS is typ- In this paper decision templates [9, 10], naive Bayes ically trained by two completely different training phases: decision fusion [19], linear matrix memory [4], and Pseu- 1. Building the Classifier Layer consisting of a set of clas- doinverse matrix [4] methods are studied and links between sifiers where each classifier is trained separately, e.g. these fusion mappings are shown. The mappings are intro- each classifier is trained on a specific feature subset. duced in Section 2, while Section 3 presents experimental results on their performance. The closing Section explores stored classifier outputs Ci are adapted through a Hebbian the close links between the fusion mappings. learning rule [15] and V i is given as the product of the clas- sifier outputs Ci and the desired classifier outputs Y :

2 Adaptive fusion mappings i T V := YCi . (4) | {z } i In this Section we briefly describe methods to train adap- W tive fusion mappings. It is assumed that a set of first level In the case of crisp classifiers the matrix V i is equal to the classifiers C1, ..., CI has been trained through a supervised i i confusion matrix of classifier C , where Vω,ω∗ is equal to learning procedure using a labeled training set. This corre- the number of samples of class ω in the training set which sponds to phase 1 of the two-phase MCS learning scheme. were assigned by Ci to class ω∗ [19]. For soft classifiers the To fix the notation, let Ω = {1, ..., L} be a set of class ω-th row of V i contains the accummulated soft classifier labels and Ci(xµ) ∈ ∆ be the probabilistic classifier output i µ ω 1 i decisions of C for the feature vectors xi ∈ R . i µ of the i-th classifier C given the input feature vector xi ∈ In the classification phase these matrices are then used to R, with ∆ defined as combine the individual classifier decisions to calculate the overall classifier decision (see Eq 3). For a feature vector L X X = (x , ..., x ) Ci(x ) ∆ := {(y , ..., y ) ∈ [0, 1]L| y = 1}. (1) 1 I , the classifier outputs i are applied 1 L l to the matrices V i, i = 1, ..., I and the outputs zi ∈ IRL are l=1 given by i i i T Then for each classifier the outputs for the training set R z := V (C (xi)) . (5) are given by a (L × M)-matrix C , i = 1, ..., I where M = i The combined class membership estimate based on I fea- |R| and the µ-th column of Ci contains the classifier output i µ T ture vectors is then calculated as the average of the individ- C (xi ) . Here the superscript T denotes the transposition. µ µ ual outputs of the second level classifiers The desired classifier outputs ω ∈ Ω for inputs xi ∈ R are given by the (L × M)-matrix Y defined by the 1 of L I I X i X i i T encoding scheme for class labels z := z = V C (xi) (6) i=1 i=1 ( 1, l = ωµ Yl,µ = . (2) and the final class label based on the combined class mem- 0, otherwise bership estimate z is then determined by the maximum membership rule That is, corresponding to Ci the µ-th column Y·,µ ∈ ∆ contains the binary coded target output of feature vector ω := argmax(zl). (7) µ l∈Ω xi ∈ R. In the following we discuss different approaches to com- i 2.2 Decision Templates bine the classifier outputs C (xi), i = 1, ..., I into an overall classifier decision The concept of decision templates is a simple, intuitive, 1 I z := F(C (x1), ..., C (xI )). (3) and robust aggregation idea that evolved from the fuzzy tem- plate which was introduced by KUNCHEVA, see [10, 11]. The four different learning schemes, namely linear associa- Decision templates are calculated as the mean of the classi- µ tive memory, decision template, pseudoinverse matrix and fier outputs for inputs xi of class ω ∈ Ω naive Bayes are introduced to calculate the decision fusion 1 X mapping F : ∆I → ∆ (see Eq. 3). In all these methods T ω := C(xµ). (8) i |Rω| i the fusion mapping F is realised through (L × L)-matrices µ ω xi ∈R V 1, ..., V I , calculated through a certain training algorithm. The overall classifier system is depicted in Figure 1. In the case of I input features for each class the decision template T ω is given by the (I × L)-matrix

2.1 Linear Associative Memory  ω T1 T ω :=  .  ∈ ∆I . (9) A linear decision fusion mapping F may be re-  .  ω alised through an associative matrix memory whose error- TI correcting properties have been shown in several numerical 1Independent from the classifier type (soft or crisp) we consider W i a experiments and theoretical investigations [8]. In order to confusion matrix. It is defined to compare the individual fusion schemes calculate the memory matrix V i for each classifier Ci the (see Eq. 4,11,12 and 14). ClassifierLayer FusionLayer

1 C 1 (x1) 1 Feature1 V z : : : i : Decision C (x ) : i Fusion i i z Feature i V : I : Final : : : I : C (xI) z Classification Feature I V I z Classification Transformation

Figure 1. Two layer MCS architecture consisting of a classifier layer and an additional fusion layer. In i general the combination of the classifier outputs C (xi), i = 1, ..., I is accomplished through a fusion 1 I mapping F(C (x1), ..., C (xI )). In this paper we restrict ourselves to separable linear fusion mappings, i i where the classifier outputs C (xi) are multiplied with matrices V , i = 1, ..., I. The resulting decisions z1, ..., zI are combined with decision fusion. To produce independent answers, the classifiers in this example work on different features.

In order to align the decision template combining scheme a set of input vectors X = (x1, ..., xI ) and a set of classifier i in such a way that the combination is done as proposed in outputs C (xi), i = 1, ..., I the combined class membership Figure 1, for each feature space a linear matrix operator V i estimate is given by Eq. 6. ω is calculated. Let Ti ∈ ∆ be the decision template of the i-th feature space and target class ω as defined in Eq. 8. 2.3 Pseudoinverse Solution Then for each classifier i = 1, ..., I, a (L × L)-decision template V i is given by the decision templates of the i-th Another linear decision fusion mapping F can be calcu- feature space  1  lated as an optimal least squares solution between the out- Ti i puts of the second layer V Ci and the desired classifier out- i  .  L V :=  .  ∈ ∆ . (10) puts Y . Such a mapping is given by the pseudoinverse solu- L Ti tion which defines another type of linear associative mem- i It can be computed similar to the associative memory matrix ory [4]. The linear matrix operator V based on the pseu- (see Eq. 4) by using the stored classifier outputs and the doinverse solution (also called generalised inverse matrix) desired classifier outputs is given by i T −1 T V := (YY ) (YCi ) . (11) i T T −1 + V := Y lim Ci (CiCi + αI) = YCi (12) | {z } α→0+ W i i T −1 + Multiplying the confusion matrix W with (YY ) from with Ci being the pseudoinverse of Ci and I for once being i T the left is equivalent to the row wise normalisation of W the . Provided the inverse matrix of CiCi with the number of training patterns per class. exists, which is always the case for full rank matrices Ci, Assuming the normalised correlation as similarity mea- the pseudoinverse solution is given through sure [2, 9, 10] the combination of the classifier outputs with i T T −1 T T −1 decision templates is equivalent to the combination of clas- V = YCi (CiCi ) = (YCi )(CiCi ) . (13) | {z } sifier outputs with the linear associative memory. Thus, for W i As with the linear associative memory and the decision tem- values from the HSV representation) and then classified plate, the matrix operators V 1, ..., V I based on the pseudo- separately with a Gaussian RBF network which had been inverse solution are used for a linear combination of the initialised using the K-Means algorithm (for details see classifier outputs. Therefore, the combined class member- [18]). Coil20 is the well known data set from the Columbia ship is given by Eq. 6 as well. Object Image Library [16], consisting of 1440 greyscale pictures from 20 different classes, from which 4 features 2.4 Naive Bayes Decision Fusion were extracted (concatenation of 7 invariant hu moments [12], orientation histograms utilising sobel or canny edge In the naive Bayes fusion scheme the classifiers detectors respectively on the grey scale image as well as on C1, ..., CI are assumed to be crisp and mutually indepen- the black/white channel of the opponent colour system) and dent [9], therefore in [19] it is called Bayes combination. In then classified separately with a k-nearest-neighbour clas- the case of I feature spaces for each feature space a (L×L)- sifier. The accuracy after combining the answers from the matrix operator is calculated based on the stored classifier different classifiers using one of the four fusion schemes was then measured using 5x5 times cross validation with outputs Ci and the desired classifier outputs Y : the fruits and 5x6 times with the Coil20 data set. Please i T T −1 V := (YCi )(CiCi ) . (14) note that no effort has been made to achieve high accuracy | {z } or select good features, the emphasis was on comparing the W i fusion architectures. In contrast to decision templates (see Eq. 11) the confusion Fruits Coil20 matrices W i, i = 1, ..., I are normalised column wise and V i is called label matrix. Due to the fact that a classifier Assoc. linear memory 94.93 98.87 i Decision templates 95.71 99.06 C (xi) is typically error-bearing the individual confusions i Pseudoinverse 95.81 99.28 describe the uncertainty of C to classify xi. Therefore, the (m, l)-th entry of V i is an estimate of the conditional prob- Naive Bayes 89.17 78.14 ability ˆ i P (ω = m|C (xi) = l), (15) Table 1. Accuracy of the four fusion schemes that is the probability that the true class label is m given that on two different data sets in %. i C (xi) assigns the crisp class label l ∈ Ω. After learning, the label matrices V 1, ..., V I are used As can be seen in Table 1, the naive Bayes scheme falls to calculate the final decision of the classifier ensem- well behind for the two data sets tested. This could not be ble. The classification works as follows: Let C1(x ) = 1 attributed to it working only on crisp answers, as the accu- ω , ..., CI (x ) = ω be be the class decisions of the individ- 1 I I racy of the other fusion schemes did not change much in ex- ual classifiers for a new sample. Then by the independence periments when also fed only with crisp classifier outputs. assumption [5], the estimate z of the probability that the m The associative linear memory and decision templates fu- true class label is m, is calculated as [19] sion schemes do yield different results as we did not use the I normalised correlation as for the later (as Y i zm := α Pˆ(ω = m|C (xi) = ωi), m = 1, ..., L shown in Table 2) but the S1 measure introduced in [11]. i=1 It is striking that three of the schemes achieve very simi- (16) lar results, demonstrating that despite coming from different PL where α is a normalising constant that ensures m=1 zm = research directions they are quite alike, in fact all using the 1. This type of Bayesian combination and various meth- confusion matrix W i. ods for uncertainty reasoning have been studied extensively in [14]. 4 Discussion

3 Experiments Finally we want to explore the close ties between the four combining schemes. All are realised using (L×L)-matrices All four schemes have been tested using two bench- V 1, ..., V I , obtained using a certain training algorithm (see mark sets, namely our photos of fruits and Coil20. The Table 2 for an overview). These matrices are based on the fruits data set comes from a robotic environment, consisting classifier outputs C1, ..., CI of the corresponding first level of 840 pictures from 7 different classes (apples, oranges, classifiers C1, ..., CI and the desired output Y . i T plums... [3]). Four features were extracted (3 histograms The confusion matrix W = YCi (see Eq. 4, 11, 12 left/middle/right using Sobel edge detector, mean colour and 14) is the main ingredient in all combination schemes, Matrix V i = Usage T PI i i T Assoc. linear memory (YCi ) z := i=1 V C (xi) T −1 T Decision templates (YY ) (YCi ) ” T T −1 Pseudoinverse (YCi )(CiCi ) ” T T −1 QI i i T Naive Bayes (YCi )(CiCi ) z := i=1 V C (xi) Q = element-wise product

Table 2. Overview of the central matrix V i = in the different approaches, and what to do in each case to get the classification result z, which is an estimate for the class membership. The final decision is hence obviously for the class with the highest probability (see Eq. 7).

with W i describing the uncertainty of the classifier which is But then the combination schemes for the naive Bayes com- taken into account for the training of the fusion layer. From bination (see Eq. 16) and the pseudoinverse solution (see another point of view, the confusion matrix W i could be Eq. 12) are different, leading to different decisions. regarded as prior knowledge about the classifier Ci [19]. In three of the four combining schemes the combina- 5 Acknowledgements tion of the classifier decisions is a matrix-vector product (see Eq. 6). The only exception is the naive Bayes fusion This work has been partially supported by the scheme. DFG (Deutsche Forschungsgemeinschaft) under contract By assuming the normalised correlation as similarity SCHW 623/3-2. measure, classifier fusion via decision templates is very similar to the combination with linear associative memo- ries. Provided, that the number of feature vectors for each References class in R is equal to κ ∈ IN, the normalisation term of the decision template (YY T)−1 (see Eq. 11) can be written as [1] C. Dietrich, F. Schwenker, and G. Palm. Classification of time series utilizing temporal and decision fusion. In J. Kit- 1 1 tler and F. Roli, editors, Multiple Classifier Systems, pages (YY T)−1 = diag( , ..., ). (17) κ κ 378–387. Springer, 2001. [2] C. Dietrich, F. Schwenker, and G. Palm. Decision templates In comparison to the linear associative memory matrix the for the classification of bioacoustic time series. In H. B. et. 1 al., editor, Proceedings of IEEE Workshop on Neural Net- decision vector z of the decision template is multiplied by κ and therefore the class decisions of both fusion schemes are works and Signal Processing, pages 159–168, 2002. [3] R. Fay, U. Kaufmann, F. Schwenker, and G. Palm. Learning equivalent because both are based on the maximum mem- Object Recognition in a NeuroBotic System. In H.-M. Groß, bership rule (see Eq. 7). K. Debes, and H.-J. Bohme,¨ editors, 3rd Workshop on Self- Regarding the coefficients of the matrix operators Organization of AdaptiVE Behavior SOAVE 2004, number V1, ..., VI , fusion with the naive Bayes combination is very 743 in Fortschritt-Berichte VDI, Reihe 10, pages 198–209. similar to the pseudoinverse matrix solution. If crisp classi- VDI, 2004. fiers are employed to classify the feature vectors in R, the [4] S. Haykin. Neural Networks: A Comprehensive Foundation. T Macmillan College Publishing Company, 1994. term CiCi (see Eq. 12 and Eq. 14) is a containing the number of feature vectors of the individual [5] E. T. Jaynes. Probability Theory: The Logic of Science. classes. Let κω be the number of feature vectors classified Washington University, St. Louis, MO 63130, U.S.A., 1996. [6] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas. On combin- to class ω ∈ Ω in R. Then the normalisation term is given ing classifiers. IEEE Transactions on Pattern Analysis and by Machine Intelligence, 20(3):226–239, 1998. T 1 L CiCi = diag(κ , ..., κ ). (18) [7] J. Kittler and F. Roli. Multiple Classifier Systems. Springer, Berlin, 2000. ω T If κ 6= 0 for each ω ∈ Ω the inverse of CiCi is given [8] T. Kohonen. Self-Organizing Maps. Springer, Berlin, 1995. through [9] L. I. Kuncheva. Fuzzy Classifier Design. Physica Verlag, Heidelberg New York, 2000. 1 1 [10] L. I. Kuncheva. Using measures of similarity and inclusion (C CT)−1 = diag(κ1, ..., κL)−1 = diag( , ..., ). i i κ1 κL for multiple classifier fusion by decision templates. Fuzzy (19) Sets and Systems, 122(3):401–407, 2001. [11] L. I. Kuncheva, J. C. Bezdek, and R. P. W. Duin. Decision templates for multiple classifier fusion. Pattern Recognition, 34(2):299–314, 2001. [12] M. K. Hu. Visual Pattern Recognition by Moment Invari- ants. IRE Trans. Info. Theory, IT-8:179–187, 1962. [13] N. C. Oza, R. Polikar, J. Kittler, and F. Roli. Multiple Clas- sifier Systems. Springer, Berlin, 2005. [14] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann Publishers, San Francisco, 1988. [15] H. Ritter, T. Martinetz, and K. Schulten. Neuronale Netze. Addison-Wesley, 2nd edition, 1991. [16] S. Nene, S. Nayar, and H. Murase. Columbia Object Im- age Library: COIL. Technical report CUCS-006-96, COIL 20, Department of Computer Science, Columbia University, Feb. 1996. [17] F. Schwenker, H. A. Kestler, and G. Palm. Three phase learning for radial basis function networks. Neural Net- works, 14:439–458, 2001. [18] C. Thiel. Multiple Classifier Fusion Incorporating Certainty Factors. Master’s thesis, University of Ulm, Germany, 2004. [19] L. Xu, A. Krzyzak, and C. Y. Suen. Methods of combin- ing multiple classifiers and their applications in handwritten character recognition. IEEE Transactions on Systems, Man and Cybernetics, 22(3):418–435, 1992.