Learning of Decision Fusion Mappings for Pattern Recognition Friedhelm Schwenker, Christian R. Dietrich, Christian Thiel, Günther Palm Department of Neural Information Processing, University of Ulm 89069 Ulm, Germany [email protected], http://www.informatik.uni-ulm.de/neuro/

Abstract classiers where each classier is trained sepa- rately, e.g. each classier is trained on a specic Dierent learning algorithms for the decision fusion feature subset. mapping of a multiple classier system are compared in this paper. It is very well known that the confusion 2. Training of the Fusion Layer performing a map- matrices of the individual classiers are utilised in the ping of the classier outputs (soft or crisp deci- naive Bayes combination of classier outputs. After sions) into the set of desired class labels. a brief review of the decision templates, the linear This two-phase MCS training is very similar to RBF associative memory and the pseudoinverse learning procedures, where the network is learned over approaches it is demonstrated that all four adaptive two or even three learning phases: decision fusion mappings share the confusion matrices as the essential ingredient. 1. Training of the RBF kernel parameters (centres and widths) of the rst layer through clustering Keywords: Decision fusion mapping, Linear as- or vector quantisation. sociative memory, Decision templates, Pseudoinverse 2. Supervised learning of the output layer (second matrix, Naive Bayes, Confusion matrix. layer) of the network. 3. Training both layers of the RBF network through 1 Introduction a backpropagation-like optimisation procedure.

The reason behind this new approach is that in tra- Typically, for these three-phase learning scheme a la- ditional pattern recognition, the best individual clas- beled training set is used to calculate the radial ba- sier has to be found, which is dicult to identify. sis function centres, the radial basis function widths Also, a single classier can not make use of informa- and the weights of the output layer, but it should be tion from other discrimination functions. noticed that other learning schemes may be applied. Typical fusion strategies for the combination of For instance, if a large set of unlabeled data points is classiers in MCS are based on xed fusion map- available this data can be used to train the parameters pings, for instance averaging or multiplying the clas- of the rst RBF layer (centres and widths) through sier outputs [1, 5]. In this paper we consider super- unsupervised clustering [13]. vised learning methods to train the fusion mapping of For the training of the two layer fusion architec- the MCS. Whereas the MCS architecture (see Fig- ture two labeled data sets may be used. It is assumed ure 1) is very similar to layered articial neural net- that I classier decisions, here calculated on I dier- works, like radial basis function networks (RBFs) or ent features, have to be combined. First, the classiers multilayer perceptrons (MLPs), the training is rather Ci, i = 1, ..., I, are trained, by utilising a training set, dierent. For most articial neural network architec- and then another labeled training set R, which is not tures, e.g. MLPs, the parameters are typically trained necessarily completely disjunct to the rst, is sent to simultaneously through a backpropagation-like train- the previously trained rst level classiers to calcu- ing algorithm by a one-phase learning procedure. On late the individual classier outputs. These classier i the other hand a MCS is typically trained by two outputs C (xi), i = 1, ..., I together with the desired completely dierent training phases: class labels are then used to train the decision fusion mapping F. To train this fusion mapping dierent 1. Building the Classier Layer consisting of a set of algorithms have been proposed. In this paper decision templates [7, 8], naive Bayes correcting properties have been shown in several nu- decision fusion [15], linear matrix memory [3], and merical experiments and theoretical investigations [6]. Pseudoinverse matrix [3] methods are studied and In order to calculate the memory matrix V i for i links between these fusion mappings are shown. The each classier C the stored classier outputs Ci are mappings are introduced in Section 2, while Section adapted through a Hebbian learning rule [6] and V i is 3 presents experimental results on their performance. given as the product of the classier outputs Ci and The closing Section explores the close links between the desired classier outputs Y : the fusion mappings. i T (4) V := YCi . | {z } 2 Adaptive fusion mappings W i In the case of crisp classiers the matrix V i is equal to the confusion matrix of classier i, where i is In this Section we briey describe methods to train C Vω,ω∗ adaptive fusion mappings. It is assumed that a set of equal to the number of samples of class ω in the train- ing set which were assigned by i to class ∗ [15]. For rst level classiers C1, ..., CI has been trained through C ω a supervised learning procedure using a labeled train- soft classiers the ω-th row of V i contains the accum- ing set. This corresponds to phase 1 of the two-phase mulated soft classier decisions of Ci for the feature vectors µ ω.1 MCS learning scheme. xi ∈ R In the classication phase these matrices are then To x the notation, let Ω = {1, ..., L} be a set µ used to combine the individual classier decisions to of class labels and Ci(x ) ∈ ∆ be the probabilistic i calculate the overall classier decision (see Eq 3). For classier output of the i-th classier Ci given the input µ a feature vector , the classier outputs feature vector x ∈ R, with ∆ dened as X = (x1, ..., xI ) i i i C (xi) are applied to the matrices V , i = 1, ..., I and L the outputs i L are given by X z ∈ IR ∆ := {(y , ..., y ) ∈ [0, 1]L| y = 1}. (1) 1 L l i i i T (5) l=1 z := V (C (xi)) . Then for each classier the outputs for the training The combined class membership estimate based on I set R are given by a (L × M)-matrix C , i = 1, ..., I feature vectors is then calculated as the average of the i individual outputs of the second level classiers where M = |R| and the µ-th column of Ci contains the classier output i µ T. Here the superscript C (xi ) I I denotes the transposition. The desired classier X i X i i T T z := z = V C (xi) (6) outputs µ for inputs µ are given by the ω ∈ Ω xi ∈ R i=1 i=1 (L × M)-matrix Y dened by the 1 of L encoding and the nal class label based on the combined class scheme for class labels membership estimate z is then determined by the ( maximum membership rule 1, l = ωµ Yl,µ = . (2) 0, otherwise ω := argmax(zl). (7) l∈Ω

That is, corresponding to Ci the µ-th column Y·,µ ∈ ∆ contains the binary coded target output of feature 2.2 Decision Templates vector µ . xi ∈ R The concept of decision templates is a simple, in- In the following we discuss dierent approaches to tuitive, and robust aggregation idea that evolved combine the classier outputs i , into C (xi) i = 1, ..., I from the fuzzy template which was introduced by an overall classier decision Kuncheva, see [8, 9]. Decision templates are calcu- lated as the mean of the classier outputs for inputs z := F(C1(x ), ..., CI (x )). (3) 1 I µ of class xi ω ∈ Ω The four dierent learning schemes, namely linear 1 X associative memory, decision template, pseudoinverse T ω := C(xµ). (8) i |Rω| i matrix and naive Bayes are introduced to calculate µ ω xi ∈R the decision fusion mapping F : ∆I → ∆ (see Eq. 3). In the case of I input features for each class the deci- In all these methods the fusion mapping F is re- sion template T ω is given by the (I × L)-matrix alised through (L × L)-matrices V 1, ..., V I , calculated  ω through a certain training algorithm. The overall clas- T1 . sier system is depicted in Figure 1. ω  .  I (9) T :=  .  ∈ ∆ . T ω 2.1 Linear Associative Memory I 1Independent from the classier type (soft or crisp) we con- A linear decision fusion mapping may be realised F sider W i a confusion matrix. It is dened to compare the indi- through an associative matrix memory whose error- vidual fusion schemes (see Eq. 4,11,12 and 14). ClassifierLayer FusionLayer

1 C 1 (x1) 1 Feature1 V z : : : i : Decision C (x ) : i Fusion i i z Feature i V : I : Final : : : I : C (xI) z Classification Feature I V I z Classification Transformation

Figure 1: Two layer MCS architecture consisting of a classier layer and an additional fusion layer. In gen- i eral the combination of the classier outputs C (xi), i = 1, ..., I is accomplished through a fusion mapping 1 I F(C (x1), ..., C (xI )). In this paper we restrict ourselves to separable linear fusion mappings, where the classi- i i 1 I er outputs C (xi) are multiplied with matrices V , i = 1, ..., I. The resulting decisions z , ..., z are combined with decision fusion. To produce independent answers, the classiers in this example work on dierent features.

In order to align the decision template combining 2.3 Pseudoinverse Solution scheme in such a way that the combination is done as Another linear decision fusion mapping can be cal- proposed in Figure 1, for each feature space a linear F culated as an optimal least squares solution between matrix operator i is calculated. Let ω be the V Ti ∈ ∆ the outputs of the second layer i and the desired decision template of the -th feature space and target V Ci i classier outputs . Such a mapping is given by the class as dened in Eq. 8. Then for each classier Y ω pseudoinverse solution which denes another type of , a -decision template i is given by i = 1, ..., I (L × L) V linear associative memory [3]. The linear matrix op- the decision templates of the i-th feature space erator V i based on the pseudoinverse solution (also  1  called generalised inverse matrix) is given by Ti i . L (10) i T T −1 + (12) V :=  .  ∈ ∆ . V := Y lim Ci (CiCi + αI) = YCi   α→0+ T L i with + being the pseudoinverse of and for once Ci Ci I It can be computed similar to the associative mem- being the . Provided the inverse ma- ory matrix (see Eq. 4) by using the stored classier trix of T exists, which is always the case for full CiCi outputs and the desired classier outputs rank matrices Ci, the pseudoinverse solution is given through V i := (YY T)−1 (YCT) . (11) i i T T −1 T T −1 (13) | {z } V = YCi (CiCi ) = (YCi )(CiCi ) . W i | {z } W i Multiplying the confusion matrix W i with (YY T)−1 As with the linear associative memory and the deci- from the left is equivalent to the row wise normalisa- sion template, the matrix operators V 1, ..., V I based tion of W i with the number of training patterns per on the pseudo-inverse solution are used for a linear class. combination of the classier outputs. Therefore, the Assuming the normalised correlation as similar- combined class membership is given by Eq. 6 as well. ity measure [7, 8] the combination of the classier outputs with decision templates is equivalent to the 2.4 Naive Bayes Decision Fusion combination of classier outputs with the linear as- sociative memory. Thus, for a set of input vectors In the naive Bayes fusion scheme the classiers i 1 I X = (x1, ..., xI ) and a set of classier outputs C (xi), C , ..., C are assumed to be crisp and mutually inde- i = 1, ..., I the combined class membership estimate is pendent [7], therefore in [15] it is called Bayes combi- given by Eq. 6. nation. In the case of I feature spaces for each feature space a (L×L)-matrix operator is calculated based on measured using 5x5 times cross validation with the the stored classier outputs Ci and the desired classi- fruits and 5x6 times with the Coil20 data set. Please er outputs Y : note that no eort has been made to achieve high ac- curacy or select good features, the emphasis was on i T T −1 (14) V := (YCi )(CiCi ) . comparing the fusion architectures. | {z } W i In contrast to decision templates (see Eq. 11) the con- Fruits Coil20 Linear assoc. memory 94.93 98.87 fusion matrices W i, i = 1, ..., I are normalised column wise and V i is called label matrix. Due to the fact Decision templates 95.71 99.06 i Pseudoinverse 95.81 99.28 that a classier C (xi) is typically error-bearing, the individual confusions describe the uncertainty of Ci to Naive Bayes 89.17 78.14 i classify xi. Therefore, the (m, l)-th entry of V is an estimate of the conditional probability Table 1: Accuracy of the four fusion schemes on two dierent data sets in %. i Pˆ(ω = m|C (xi) = l), (15) As can be seen in Table 1, the naive Bayes scheme that is the probability that the true class label is m falls well behind for the two data sets tested. This given that i assigns the crisp class label . C (xi) l ∈ Ω could not be attributed to it working only on crisp After learning, the label matrices 1 I are V , ..., V answers, as the accuracy of the other fusion schemes used to calculate the nal decision of the classier did not change much in experiments when also fed ensemble. The classication works as follows: Let only with crisp classier outputs. The linear associa- 1 I be be the class decisions C (x1) = ω1, ..., C (xI ) = ωI tive memory and decision templates fusion schemes of the individual classiers for a new sample. Then do yield dierent results as we did not use the nor- by the independence assumption [4], the estimate zm malised correlation as for the later of the probability that the true class label is , is m (as shown in Table 2) but the measure introduced calculated as [15] S1 in [9]. I It is striking that three of the schemes achieve Y i zm := α Pˆ(ω = m|C (xi) = ωi), m = 1, ..., L very similar results, demonstrating that despite com- i=1 ing from dierent research directions they are quite (16) alike, in fact all using the confusion matrix W i. where α is a normalising constant that ensures PL . This type of Bayesian combination m=1 zm = 1 and various methods for uncertainty reasoning have been studied extensively in [11]. 4 Discussion Finally we want to explore the close ties between the 3 Experiments four combining schemes. All are realised using (L×L)- matrices V 1, ..., V I , obtained using a certain training All four schemes have been tested using two bench- algorithm (see Table 2 for an overview). These ma- mark sets, namely our photos of fruits and Coil20. trices are based on the classier outputs C1, ..., CI of The fruits data set comes from a robotic environment, the corresponding rst level classiers C1, ..., CI and consisting of 840 pictures from 7 dierent classes (ap- the desired output Y . ples, oranges, plums... [2]). Four features were ex- The confusion matrix i T (see Eq. 4, 11, W = YCi tracted (3 histograms left/middle/right using Sobel 12 and 14) is the main ingredient in all combination edge detector, mean colour values from the HSV repre- schemes, with W i describing the uncertainty of the sentation) and then classied separately with a Gaus- classier which is taken into account for the train- sian RBF network which had been initialised using ing of the fusion layer. From another point of view, the K-Means algorithm (for details see [14]). Coil20 the confusion matrix W i could be regarded as prior is the well known data set from the Columbia Ob- knowledge about the classier Ci [15]. ject Image Library [12], consisting of 1440 greyscale In three of the four combining schemes the com- pictures from 20 dierent classes, from which 4 fea- bination of the classier decisions is a matrix-vector tures were extracted (concatenation of 7 invariant hu product (see Eq. 6). The only exception is the naive moments [10], orientation histograms utilising sobel Bayes fusion scheme. or canny edge detectors respectively on the grey scale By assuming the normalised correlation as simi- image as well as on the black/white channel of the op- larity measure, classier fusion via decision templates ponent colour system) and then classied separately is very similar to the combination with linear associa- with a k-nearest-neighbour classier. The accuracy tive memories. Provided, that the number of feature after combining the answers from the dierent clas- vectors for each class in R is equal to κ ∈ IN, the siers using one of the four fusion schemes was then normalisation term of the decision template (YY T)−1 Matrix V i = Usage Linear assoc. memory T PI i i T (YCi ) z := i=1 V C (xi) Decision templates T −1 T " (YY ) (YCi ) Pseudoinverse T T −1 " (YCi )(CiCi ) Naive Bayes T T −1 QI i i T (YCi )(CiCi ) z := i=1 V C (xi) Q = element-wise product

Table 2: Overview of the central matrix V i = in the dierent approaches, and what to do in each case to get the classication result z, which is an estimate for the class membership. The nal decision is hence obviously for the class with the highest probability (see Eq. 7).

(see Eq. 11) can be written as Behavior SOAVE 2004, number 743 in Fortschritt- Berichte VDI, Reihe 10, pages 198209. VDI, 2004. 1 1 (YY T)−1 = diag( , ..., ). (17) [3] S. Haykin. Neural Networks: A Comprehensive Foun- κ κ dation. Macmillan College Publishing Company, In comparison to the linear associative memory ma- 1994. [4] E. T. Jaynes. Probability Theory: The Logic of Sci- trix the decision vector of the decision template is z ence. Washington University, St. Louis, MO 63130, multiplied by 1 and therefore the class decisions of κ U.S.A., 1996. both fusion schemes are equivalent because both are [5] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas. based on the maximum membership rule (see Eq. 7). On combining classiers. IEEE Transactions on Pat- Regarding the coecients of the matrix operators tern Analysis and Machine Intelligence, 20(3):226 V1, ..., VI , fusion with the naive Bayes combination is 239, 1998. very similar to the pseudoinverse matrix solution. If [6] T. Kohonen. Self-Organizing Maps. Springer, Berlin, crisp classiers are employed to classify the feature 1995. vectors in , the term T (see Eq. 12 and Eq. 14) [7] L. I. Kuncheva. Fuzzy Classier Design. Physica R CiCi is a containing the number of feature Verlag, Heidelberg New York, 2000. [8] L. I. Kuncheva. Using measures of similarity and vectors of the individual classes. Let ω be the num- κ inclusion for multiple classier fusion by decision ber of feature vectors classied to class ω ∈ Ω in R. templates. Fuzzy Sets and Systems, 122(3):401407, Then the normalisation term is given by 2001. [9] L. I. Kuncheva, J. C. Bezdek, and R. P. W. Duin. De- T 1 L (18) CiCi = diag(κ , ..., κ ). cision templates for multiple classier fusion. Pattern If ω for each the inverse of T is given Recognition, 34(2):299314, 2001. κ 6= 0 ω ∈ Ω CiCi [10] M. K. Hu. Visual Pattern Recognition by Moment through Invariants. IRE Trans. Info. Theory, IT-8:179187, 1962. T 1 1 (C C )−1 = diag(κ1, ..., κL)−1 = diag( , ..., ). [11] J. Pearl. Probabilistic Reasoning in Intelligent Sys- i i κ1 κL (19) tems. Morgan Kaufmann Publishers, San Francisco, But then the combination schemes for the naive Bayes 1988. combination (see Eq. 16) and the pseudoinverse so- [12] S. Nene, S. Nayar, and H. Murase. Columbia Ob- lution (see Eq. 12) are dierent, leading to dierent ject Image Library: COIL. Technical report CUCS- 006-96, COIL 20, Department of Computer Science, decisions. Columbia University, Feb. 1996. [13] F. Schwenker, H. A. Kestler, and G. Palm. Three 5 Acknowledgements phase learning for radial basis function networks. Neural Networks, 14:439458, 2001. [14] C. Thiel. Multiple Classier Fusion Incorporating This work was partially supported by the DFG (Ger- Certainty Factors. Master's thesis, University of Ulm, man Research Society) contract SCHW 623/3-2. Germany, 2004. [15] L. Xu, A. Krzyzak, and C. Y. Suen. Methods of combining multiple classiers and their applications References in handwritten character recognition. IEEE Transac- tions on Systems, Man and Cybernetics, 22(3):418 [1] C. Dietrich, F. Schwenker, and G. Palm. Classi- 435, 1992. cation of time series utilizing temporal and decision fusion. In J. Kittler and F. Roli, editors, Multiple Classier Systems, pages 378387. Springer, 2001. [2] R. Fay, U. Kaufmann, F. Schwenker, and G. Palm. Learning Object Recognition in a NeuroBotic Sys- tem. In H.-M. Groÿ, K. Debes, and H.-J. Böhme, edi- tors, 3rd Workshop on SelfOrganization of AdaptiVE