826 International Journal of Fuzzy Systems, Vol. 4, No. 3, September 2002

Fuzzy Support Vector Machines for and

Han-Pang Huang and Yi-Hung Liu

Abstract In a SVM, the original input space is mapped into a high dimensional dot product space via a kernel. The A support vector machine (SVM) is a new, new space is called feature space, in which an optimal powerful classification machine and has been applied hyperplane is determined to maximize the generalization to many application fields, such as pattern ability. The optimal hyperplane can be determined by re cognition and data mining, in the past few years. only few data points that are called support vectors (SVs). Accordingly, a SVM can provide a good However, there still remain some problems to be generalization performance for classification problems solved. One of them is that SVMs are very sensitive despite that it does not incorporate problem-domain to outliers or noises because of overfitting problem. knowledge. The attribute is unique to SVMs [6]. In this paper, a fuzzy support vector machines However, there are two important issues for (FSVMs) is proposed to deal with the problem. A developing SVMs. One is how to extend two-class proper membership model is also proposed to fuzzify problems to multi-class problems efficiently. Several all the training data of positive/negative class. The methods have been proposed, such as one-against-one, outliers are detected by the proposed outlier one-against-all, and directed acyclic graph SVM detection method (ODM). The ODM is a hybrid (DAGSVM) [8,15]. These methods are based on solving method based on the fuzzy c-means (FCM) algorithm several binary classifications. In other words, SVMs are originally designed for binary classification [4]. Hence, cascaded with an unsupervised neural network, before enhancing the performance of SVMs for called self-organizing map (SOM). Experimental multi-class classification, the problem of SVMs in results indicate that the proposed FSVMs actually binary classification should be solved first. The other reduce the effect of outliers and yield higher issue of SVMs is to overcome the problem of overfitting classification rate than SVMs do. in two-class classification [2]. As those remarks in [17] and [5], the SVMs are very sensitive to outliers and Keywords: Support Vector Machines (SVMs), Fuzzy noises. This paper deals with the problem. c-means (FCM), Self-organizing Map (SOM), Outlier Different from SVMs, the proposed FSVMs treat the Detection. training data points with different importance in the training process. Namely, FSVMs fuzzify the penalty 1. Introduction term of the cost function to be minimized, reformulate the constrained optimization problem, and then construct Support vector machines (SVMs) are based on the the Lagrangian so that the solutions for the optimal theoretical learning theory developed by Vapnik hyperplane in the primal form can be found in the dual [2,4,6,16]. The formulation of SVMs embody the form. The other goal of this paper is to provide a structural risk minimization (SRM) principle. SVMs membership model with the function of outlier detection. have been gained wide acceptance due to their high Based on this membership model, all training data points generalization ability for a wide range of applications are properly assigned different degrees of importance to and better performance than other traditional learning their own classes so that FSVMs can avoid the machines [2]. In addition, SVMs have been applied to phenomenon of overfitting due to outliers and noises. many classification or recognition fields, such as isolated This paper is organized as follows. Section 2 reviews handwritten digit recognition, object recognition, speech the basic theory of SVMs briefly. Section 3 first states recognition [2], and spatial data analysis [11]. the problem of SVMs, then the proposed FSVM is formulated. In Section 4, a membership model for the FSVM is given, including the proposed outlier detection Corresponding Author: Han-Pang Huang is with Robotics method (ODM). Section 5 conducts several experiments Laboratory, Department of Mechanical Engineering, National Taiwan University, Taipei, 10660, Taiwan, TEL/FAX: (886) to indicate the merit of the FSVM. Finally, we have 2-23633875 some conclusions in Section 6. E-mail: [email protected]

© 2002 CFSAT H. P. Huang et al.: Fuzzy Support Vector Machines for Pattern Recognition and Data Mining 827

2. Basic Theory of SVMs and the number of nonseparable points. In particular, it is the only free parameter in SVMs. This section briefly introduces the theory of SVMs, For the primal problem (Eqs. (6-8)), it is difficult to including linearly separable case, linearly nonseparable find the solution by QP when it becomes a large-scale case, and nonlinear case through a two-class problem. Hence, by introducing a set of Lagrange classification problem [2,4,6,16]. Assume that a training multipliers ai and bi for constraints (7) and (8), the set S is given as primal problem becomes the task of finding the saddle n point of the Lagrangian. Thus, the dual problem S = {xi, yi}i=1 (1) becomes where x Î R N , and y Î{-1,+1} . The goal of SVMs i i n n n 1 T is to find an optimal hyperplane such that Maximize Q(a ) = åai - ååaia j yi y jxi x j (9) T i=1 2 i=1 j=1 w xi + b ³ 1 for yi = +1 (2) n T w xi + b £ -1 for yi = -1 Subject to åai yi = 0 (10) where the weight vector wÎ RN , and the bias b is a i=1 0 £ a £ C , i = 1,2,..., n scalar. If the inequalities in Eq. (2) hold for all training i (11) data, it is said to be a linearly separable case. In the In the dual problem, the cost function Q(a) to be learning of the optimal hyperplane, SVMs maximize the maximized depends only on the training data in the form margin of separation r between classes, where T n of a set of dot product, {xi x j}(i, j)=1 . Furthermore, these r = 2 w . Therefore, for the linearly separable case, to Lagrange multipliers can be obtained by using the find the optimal hyperplane is to solve the following constrained nonlinear programming with the equality constrained optimization problem constraint in Eq. (10) and those inequality constraints in 1 Eq.(11). The Kuhn-Tucker (KT) condition plays a Minimize F(w) = wT w (3) 2 central role in the optimization theory and is defined by T T ai[yi(w xi + b) - 1+ xi] = 0 , i = 1,2,...,n (12) Subjectto yi(w xi + b) ³ 1, i = 1,2,...,n (4) The above constrained optimization problem can be bixi = 0 , i = 1,2,..., n (13) solved by quadratic programming (QP). However, if the where bi = C - ai . There are two types of ai . If inequalities in Eq. (2) do not hold for some data points 0 < ai £ C , the corresponding data points are called in S , the SVMs become linearly nonseparable. In such a support vectors (SVs). The optimal solution for the case, the margin of separation between classes is said to weight vector is given by be soft since some data points violate the separation N s conditions in Eq. (2). To set the stage for a formal wo = åai yixi (14) treatment of nonseparable data points, the SVMs i=1 n where N is the number of SVs. Moreover, in the case introduce a set of nonnegative scalar variables, {xi}i=1 , s into the decision surface; i.e., of 0 < ai < C , we have xi = 0 according to the KT T condition in Eq. (13). Hence, one may determine the yi(w xi + b) ³ 1 - xi , i = 1,2,...,n (5) optimal bias bo by taking any data point in the set S , xi are called slack variables. For 0 £ xi < 1 , the data points fall inside the region of separation but on the right for which we have 0 < ai < C and therefore xi = 0 , and using data point in the KT condition in Eq. (12). side of the decision surface. For xi > 1 , they fall on the wrong side of the decision surface. Now the goal of the However, from the numerical perspective it is better to SVM is to find a separation hyperplane for which the take the mean value of bo resulting from such data misclassification error can be minimized while points in the set S . Once the optimal pair (wo, bo) is maximizing the margin of separation. To find an optimal determined, the decision function is obtained as hyperplane for a linearly nonseparable case is to solve N æ s ö the following constrained optimization problem g(x) = signç a y xT x + b ÷ (15) çå i i i o ÷ n è i=1 ø 1 T Minimize F(w,x ) = w w + Cåxi (6) Also, the optimal hyperplane is g(x) = 0 . Another case 2 i=1 is ai = C . Then, xi > 0 and can be computed from Eq. Subjectto y (wT x + b) ³ 1- x , i = 1,2,...,n (7) i i i (12). If a = 0 , the corresponding data points are x ³ 0 , i = 1,2,..., n (8) i i classified correctly. where C is a user-defined positive parameter. It Another merit of SVMs is to map the input vector into controls the tradeoff between complexity of the machine 828 International Journal of Fuzzy Systems, Vol. 4, No. 3, September 2002

a higher dimensional feature and thus can solve the term. A larger C means to assign a higher penalty to nonlinear case. By choosing a nonlinear mapping errors and thus reduces the number of misclassification function j (x)Î RM , where M > N , the SVM can data points. On the contrary, a smaller C is to ignore construct an optimal hyperplane in this new feature more plausible misclassification data points and thus get wider margins. No matter the value of C is large or space. K( x, xi) is the inner-product kernel performing the nonlinear mapping into feature space small, this parameter is fixed during the training process of SVMs. Namely, all training data points are equally K(x, x ) = K(x , x) = j(x)T j(x ) (16) treated during the training of SVMs. This leads to a high i i i sensit ivity to some special cases, such as outliers and Hence, the dual optimization problem becomes noises. n n n 1 Maximize Q(a ) = åai - ååaia j yiy jK (xi, x j) (17) i=1 2 i=1 j=1 subject to the same constraints as Eqs. (10) and (11). The only requirement on the kernel K( x, xi) is to satisfy the Mercer’s theorem [6]. Using Kernel functions, without treating the high dimensional data explicitly, unseen data are classified as follows ì positive class , if g(x) > 0 x Î í (18) înegative class , if g( x) < 0 Figure 1. Equally treating outliers may cause the overfitting in where the decision function is the training of SVMs æ ö g(x) = signç a y K( x, x ) ÷ (19) Figure 1 shows a comparatively, linearly nonseparable çå i i i ÷ è SVs ø example in a 2-D space to indicate two different cases. In Table 1, we summarize the inner-product kernels for The “less important region” means that the training data three common types of SVMs. The power p and width points in this region have no contributions to determine s 2 are user-specified. For a two-layer perception, the support vectors for which the equality terms in Eq. (2) Mercer’s theorem is satisfied only for some values of s are satisfied. In other words, support vectors that may 0 meet the equality conditions fall into the “more and s . 1 important region.” Having the same value of C , if there exist outliers in class 1, the optimal hyperplane H Table 1. Three common types of kernels used in SVMs moves toward the side of class 2 with a degree of bias Kernel’s types K( x, xi) due to the overfitting. In addition, the slope of the Polynomial learning T p (1+ x xi) hyperplane changes because the weight vector w is machine determined by all of the support vectors according to Eq. Radial-basis function æ 1 2ö (14). Moreover, the classification error increases for exp - x - x network ç 2 i ÷ è 2s ø class 2 because of the unsuitable overfitting arising from outliers. Two-layer perception T tanh(s0x xi + s1) B. Fuzzify the Training Set 3. FSVMs As mentioned above, equally treating every data points may cause unsuitable overfitting in SVMs. Hence, A. Problems Occurred in SVMs the central concept of the proposed FSVM is to assign In SVMs, the training process is very sensitive to each data point a membership value accordin g to its those training data points which are far away from their relative importance in the class. Since each data point own classes. If there exist two linearly separable classes, xi has an assigned membership value mi , the training SVMs can find an optimal hyperplane such that no data set becomes a fuzzy training set S and is given by points would fall into the margin of separation no matter f n how small the compactness of the two classes is. S f = {xi, yi , mi}i=1 (20)

However, for two linearly nonseparable classes, how to For positive class (yi = +1 ), the set of membership + - set the free parameter C to a proper value is very values are denoted as mi ; and are denoted as mi for significant. Recalling the cost function in Eq.(6), the negative class (yi = -1 ). They are assigned second term on the right hand side is actually a penalty independently.

H. P. Huang et al.: Fuzzy Support Vector Machines for Pattern Recognition and Data Mining 829

C. Formulate FSVMs the narrower of the feasible region along the ai axis is. Suppose that a fuzzy training set is obtained, the next Finally, the KT conditions of FSVMs are step is to formulate the FSVM. Starting with the a [y (wT x + b) - 1+ x ] = 0 , i = 1,2,...,n (31) construction of a cost function, the FSVM also wants to i i i i m maximize the margin of separation and minimize the (Cmi - ai )×xi = 0 , i = 1,2,...,n (32) classification error such that a good generalization ability can be achieved. Unlike the penalty term in 4. Membership Model for FSVMs SVMs, FSVMs fuzzify the penalty term in order to reduce the effect of less important data points. The The membership model for the FSVMs can be divided penalty term will be a function of membership values, into two steps and shown in Figure 2. First of all, the and hence the SVM is named FSVM. outlier set is detected from the training set, and the rest is The constrained optimization problem of the FSVM is called main body set of the training set. Then, the two formulated as follows sets are fuzzified by some membership functions. It is 1 n noted that the membership model is performed on Minimize F( w,x, m) = wT w + C m mx (21) positive and negative classes independently. Finally, 2 å i i i=1 + - after combining the two fuzzy sets S f and S f , the Subjectto y (wT x + b) ³ 1- x , i = 1,2,...,n (22) i i i fuzzy training set S is obtained. In the following, all x ³ 0 , i = 1,2,..., n (23) f i symbols can stand for either positive class or negative where m influences the fuzziness of the fuzzified class. penalty term in the cost function. Now let the Lagrangian function be 1 n Q(w,b,x ,a, b, m) = wT w + C m mx - 2 å i i i=1 (24) n n T åai [yi (w xi + b) -1 +xi ]- å bixi i=1 i=1 where ai and bi are nonnegative Lagrange Figure 2. Membership model for the FSVMs multipliers. Differentiating Q with respect to w , b , and xi , and setting the results equal to zero, we have the A. Outlier Detection following three conditions of optimality Before detecting outlier from a class set, some ¶Q(w, b,x ,a, b , m) n characteristics about outliers should be given first. There = w - a y x = 0 (25) ¶w å i i i are two important aspects. First, the outliers should be i=1 largely separated from the main body. Second, the ¶Q(w, b,x ,a, b , m) n number of outliers should be much fewer than the = -åai yi = 0 (26) ¶b i=1 number of elements in the main body [3,12]. According ¶Q(w, b,x ,a, b , m) to these two aspects, an intelligent outlier detection = Cmm - a - b = 0 (27) ¶x i i i method (ODM), based on several techniques including i Kohonen’s self-organizing map (SOM), index of Substituting Eqs. (25-27) into Eq. (24), the Lagrangian is fuzziness, and fuzzy c-means (FCM) algorithm, is a function of a only. The dual problem becomes proposed. The merit of using the ODM is that it can n n n 1 T perform the task of outlier detection in 1-D space such Maximize Q(a ) = åai - ååaia j yi y jxi x j (28) i=1 2 i=1 j=1 that one can observe the distribution of training data n points along an axis. In addition, the outlier partition and Subject to åai yi = 0 (29) the main body partition can be determined by the ODM i=1 automatically. Thus, outliers in the training set can be 0 £ a £ Cm m , i = 1,2,...,n (30) p i i detected. Let X c = {xi}i=1 be the positive class set or It is clear that the only difference between SVMs and the N negative cla ss set in S , and x ÎR , the procedures of proposed FSVMs is the upper bounds of Lagrange i the ODM are given below. multipliers a in the dual problem. In SVMs, the upper i bounds of ai are bounded by a constant C while they Step 1: Unsupervised feature map. Let the SOM be a are bounded by dynamical boundaries that are function 1-D SOM, i.e., the output nodes are arranged in a 1-D of membership values in FSVMs. The lower the array. After using the learning rule of Kohonen’s feature membership value of a data point xi to its own class is, map [13] to update the weight vector between input and 830 International Journal of Fuzzy Systems, Vol. 4, No. 3, September 2002

output layers (fully connected), each training data point Hence, as searching the virtual set, the two indices SI N and CI can be used as a criteria for which the virtual set xi ÎR is mapped into a scalar called winner. The winner can be determined by the similarity matching can satisfy the above two conditions. Finally, le t based on the minimum-distance Euclidean criterion Y = {Yv ,Yc} = { y1,..., y j,...., y2p} be a candidate set for next step. yi = arg min xi - Wj , j = 1,2,..., l (33) j Step 3: Perform fuzzy clustering to partition the set The updating rule for the weight vector at time t is Y into fuzzy partitions (clusters). The fuzzy c-mean W j(t + 1) = W j(t) +h(t)h(t)[ x(t) -W j(t)] (34) algorithm (FCM) [1] is used here to find the cluster where the learning rate h and neighborhood function centers vi and the membership value uij Î {0,1} , which

h are changed dynamically during learning. l denotes expresses the degree to which the element y j belongs the number of output nodes. Through the unsupervised to the ith cluster feature map, the class set X can be represented by the c 1 2 p e vi = (uij) y j , i = 1,2,..., c. (38) p 2p e å j=1 winner set Yc = {yi}i=1 , and yi Î R . (u ) å j=1 ij Step 2: Incorporate a virtual set into the class set. Let 2 1 (1 y - v ) (e-1) Yv be a virtual set, Yv ÎR , Yv = Yc = p ,and j i uij = , i = 1,..., c, j = 1,...,2 p 2 1 Yv Ç Yc = f , i.e., they are completely disjoint. Randomly c (e-1) å (1 y j - vk ) search a virtual set such that Y and Y satisfy the k=1 c v (39) following two conditions: where e is called the exponential weight which

influences the degree of fuzziness of the partition matrix 1. Separation condition: the separation between class [uij] . There are two constraints on the value of uij set Yc and virtual set Yv is sufficiently large. 2. Compactness condition: the compactness of the c å uij = 1, for all j = 1,2,....,2p (40) virtual set Yv is sufficiently large. i=1 2p For the purpose of quantifying the measure of separation 0 < åuij < 2p for all i = 1,2,..., c (41) between sets and compactness of a virtual set, a j=1 separation index (SI) and a compactness index (CI) are c is the number of clusters. For finding the most defined. They are based on the measure of index of plausible number of clusters in the set, the partitioning fuzziness [10] to measure the ambiguities of intersets entropy (PE) H(U, c) gives a global validity measure and intrasets as [7,13]. Hence, the best partition number (BPN), cbest , 2 p' ìSI , if y Î{Y ,Y }, and p' = 2p can be determined by minimizing the partitioning m (d ( y , y)) = i c v å AÇ A i í entropy p' i=1 îCI, elseif yi Î{Yv }, and p' = p 2p-1 (35) cbest = arg min { min [H(U,c)]} (42) c=2 UÎW where y denotes the center, and A is the complement c and the partitioning entropy is defined as set of A . The distance d is defined as the normalized 2 p c distance within [0, 0.5] 1 H(U, c) = åå uij ln uij (43) 2 p j=1 i=1 yi - y d ( y , y) = (36) i where Wc is the set of all optimal solutions for a given 2 × max y j - y j c. The membership function of A is defined as the Step 4: Defuzzify the fuzzy partitions to crisp partitions. semi - p membership function This can be done by the following defuzzification rule c ì1- 2d 2, if 0 £ d £ 1/ 2 best m = (37) P · = arg max uij (44) A í c i=1 î0 , otherwise Regardless the virtual partition P that contains the Clearly, the higher (lower) of the value of the index SI v (CI) is, the larger of separation between sets whole virtual set Yv only, the class set Yc forms a (compactness of virtual set). On the contrary, the lower partition set

(higher) of the value of the index SI (CI), the lower the P = {Pi |i = 1,2,..., cbest -1} (45) separation between sets (compactness of the virtual set) Step 5: Determination of the largest main body is.

H. P. Huang et al.: Fuzzy Support Vector Machines for Pattern Recognition and Data Mining 831

cbest -1 between clusters, and the defuzzification rule defuzzify 1 Let Pm = arg max Pi (46) i=1 the fuzzy boundaries to crisp boundaries. As to the (The largest main body has the most training points). virtual set, it can be viewed as an auxiliary tool in the ODM. Since the number of clusters must be larger or IF cbest = 2 1 equal to two, we have to integrate a virtual set into a new THEN P = Pm set such that the original class set would not be END ODM ¬ ELSEIF cbest > 2 THEN partitioned into two or more partitions if there are no Discard the largest main partition P1 temporarily. outliers in the class set and if the class set has only one m main body partition. In other words, if we perform the Partition set becomes P = {P |i = 1,2,..., c - 2} i best fuzzy clustering method to partition a class set that has a * Initialization:c = cbest - 2 , q = 1, r = 0 . Go to step 6. high compactness, the BPN should be larger than one. In END fact, “a high compactness” means that the class set c* should form only one cluster in the feature space. For P = arg max P P Step 6: Let T i ( T is a temporary avoiding the unsuitable partitioning case, it is necessary i i=1 partition). Find the main body partitions and outlier to add a virtual set to the class set, and therefore we have partitions as step 2. Moreover, in order to avoid the partition set in Eq. (45) containing any elements that belong to the virtual P £ 1 g × Pa , a = 1,..., q IF T m , THEN set, the virtual set and class set should satisfy the r = r +1; separation condition in step 2. Furthermore, in order to r b 0 make the virtual partition P contain all the elements in Po = PT ; Po = {Po |b = 0,....,r} ; where ( Po = 0 ) v Y , the compactness condition should be satisfied, too. ELSE v q = q + 1; The goal of step 5 is to find the largest main body partition which contains the most elements (see Eq. (46)). q a Pm = PT ; Pm = {Pm | a = 1,...,q} Certainly, the ODM will stop in this step if the BPN END equals two; otherwise, it will go to step 6 to find another

Discard the temporary partition PT . main body partitions or outlier partitions. In step 6, the * * * positive real number g is an inter-cluster weight factor Reset: c = c - 1, P = {Pi |i = 1,...,c } which gives a threshold such that the temporary partition Repeat step 6 until c* = 0 . Then stop the ODM P can be assigned to either a main body partition or an T So far, the main body partitions and outlier partitions outlier partition according to the weight ratio 1 g in a class set Yc are found. Accordingly, those training between partitions. The IF-THEN rule in step 6 is made data points in the outlier partitions belong to outliers. based on the assumption that the number of outliers Some explanations about the ODM are described below. should be much smaller than the number of elements in In step 1, we use an unsupervised neural network, the the main body as we have mentioned in the beginning of Kohonen’s 1-D SOM, to reduce the dimension of this section. Finally, those training data points in the training data points because SOM can convert patterns outlier partition Po form an outlier set O ; otherwise, of arbitrary dimensionality (original feature space) into they belong to the main body set M . the one- or two-dimensional arrays of neurons (transformed feature space). One of the major properties B. Membership Functions of the SOM is its topology-preserving map that Basically, the rule to assign membership values preserves neighborhood relations of the input pattern depends on relative importance of training data points in

[6,9,13]. Namely, the SOM is capable of mapping their classes. In order to obtain the fuzzy training set S f , similar input patterns onto geometrically close output we define two membership functions to fuzzify the main nodes [12]. Another property of the SOM is the body set and outliers in a class (posit ive or negative [14], and therefore we have class), respectively. If x ÎM , the membership step 1. In step 3, we use FCM to partition the set Y i functions are defined according to their distances to the instead of crisp clustering method such as K-means center of the main body as algorithm because it is difficult to define outliers with crisp boundary. Hence, having the best partition number, x x FCM assigns each element different membership values i - mi = 1- + e, xj Î M (47) to every clusters such that an element can be properly max x j - x assigned to a partition according to its maximum j belongingness. Namely, FCM forms fuzzy boundaries where × denotes the Euclidean distance, e is a 832 International Journal of Fuzzy Systems, Vol. 4, No. 3, September 2002

sufficiently small positive number, and x is the center nodes is 2. of all data points in M . Also, the membership function 2) Neighborhood size: initial value is 100 and is shrunk by mi for the main body set is bounded in [e , 1]. For 2 outliers, the membership function is d j, y h (t) = exp(- c ) j, yc 2 mi = s , "xi ÎO (48) 2s (t) where s , 0 < s < e , is a sufficiently small positive d where j, y c is the lateral distance between the value such that the effect of outliers can be reduced winner yc and the excited neuron j in the 1-D substantially in the training process of FSVMs. output space, and s (t) is defined as [6] t s (t) = 100 exp(- ) 5. Experimental Results 200 log100 Therefore, as time t increases, the width s (t) In experiments, the difference between the SVM and decreases exponentially, and the topological the proposed FSVM will be discussed. The proposed neighborhood shrinks in a corresponding manner. membership model with the outlier detection method 3) Learning rate: initial value is 0.9 and is updated by (ODM) is used to fuzzify the training set such that the t FSVM can show good ability against outliers. h(t) = 0.9exp(- ) A result of SVMs for two-class classification is shown 200 Hereafter, it will decrease gradually, but we set its in Figure 3. The value of free parameter C is 5. In this minimum value as 0.01 figure, the symbols “+”, and “x” denote the positive All the 20 posit ive training data points are fed into the class and negative class, respectively. The data points 1-D SOM to learn weight vectors. The learning is with square symbols surrounding them are the so-called stopped when the mean square error (MSE) almost keeps support vectors. The distance between the two dotted at a small constant. Figure 4 shows the MSE of the 1-D lines is the margin of separation. The line between the SOM for the learning of positive class. two dotted lines is the optimal hyperplane i.e., the After learning, all winners along the 1-D output space decision surface. Obviously, the data points no. 5 and no. are determined by the similarity matching, and the result 11 can be regarded as outliers of the positive class since is shown in Figure 5. Clearly, the data points no.5 and no. they are very far away from the rest and they are a small 11 are clustered together. Note that these two data points number of group. In the following, we will perform the are still far away from the rest in which data points form ODM to detect outliers from these two class sets. We another cluster. As a result, the 1-D SOM shows the assume that there are no outliers in negative class. Hence, property of preserving map. Since the dimension of we simply detect outliers from positive class by the input space has been transformed from 2 into 1 by the ODM. unsupervised feature map, the next step is to find a

virtual set (step 2).

Figure 4. MSE of 1-D SOM for the learning of positive class

Figure 3. A result of SVM for a two-class classification

Figure 5. Winner distribution along the 1-D output space In our experiments, some parameters in 1-D Kohonen’s SOM are set as follows For searching a virtual set to satisfy the separation 1) The number of output nodes is 100. Number of input condition in step 2 of ODM, let all elements in the

H. P. Huang et al.: Fuzzy Support Vector Machines for Pattern Recognition and Data Mining 833

virtual set have a minimum distance between them and respectively. One can see that all elements in the virtual the class set. The distance is actually a threshold and is set have high membership values to partition 1 and have set as 20 such that the virtual set and the class set have a low membership values to the other two. The data points large separation. On the other hand, for obtaining a high no. 5 and no. 11 have high membership values to compactness, let all the elements in the virtual set have partition 3, and the rest belong to partition 2. After the same location. After searching, the locations of the defuzzifying with the defuzzification rule in Eq. (44), virtual elements are at 96 on the output layer of the 1-D there are three crisp partitions. The first is the virtual

SOM (see Figure 6). In this case, the value of SI is partition Pv , the second is the main body of the positive 0.7237, and the value of CI is 0.0 (maximum class, and the third is an outlier partition. The above compactness). partition results can be obtained from step 5 and step 6 of ODM. In this experiment, the inter-cluster weight factor g is set as 10. Since the partition 3 is an outlier partition, these two training data points, data no. 5 and no.6, are outliers, and the rest 18 training data belong to the main body set.

Table 2. PEs in different numbers of clusters. Figure 6. Distributions of positive class set and virtual set on c 2 3 4 5 6 7 8 9 10 the output layer of 1-D SOM. PE 0.0085 0.0068 0.0423 0.0426 0.0550 0.0471 0.0472 0.0536 0.0483

After detecting outliers with the ODM, all training data are assigned membership values with the membership functions in Eqns. (47) and (48). For main body set, the lower bound e of the membership function is set as 0.1. For outliers, the s is set as 0.01 here. Having the same value of C in SVM and setting m equal to 1, the result of FSVM is shown in Fig. 8. Clearly, the overfitting of the optimal hyperplane resulted from the FSVM is reduced compared with the optimal hyperplane resulted from the SVM. In addition, the classification error of negative class is zero in FSVM but is two in SVM. Besides, the number of training data points in the margin of separation is reduced (zero) in FSVM in contrast with SVM.

Figure 7. Clustering results of candidate set with e = 2

Since the virtual set, Yv = {96}, Yv = p = 20 , has been found, the candidate set Y = {Yc,Yv} is formed, where

Yc is the corresponding winner set of positive training data points. In step 3, we perform FCM algorithm to partition the candidate set. Before partitioning, the PE is used as the validity indicator to find the BPN. The searching range of numbers of clusters is from 2 to 10 and the results are listed in Table 2. As a result, the BPN

cbest = 3 because the value of PE is minimum when c = 3 . Hence, FCM algorithm is performed to partition the set Y with the best partition number. The exponential weight e of FCM is set as 2. Finally, the Figure 8. The result of FSVM with the same training data in clustering result is obtained and drawn in Figure7. The the experiment of SVM. partition centers are v1 = 95 .95, v2 = 12 .59 , and v3 = 68 .71 , 834 International Journal of Fuzzy Systems, Vol. 4, No. 3, September 2002

6. Conclusions [11] M. Kanevski, A. Pozdnukhov, S. Canu, and M. Maignan, “Advanced Spatial Data Analysis and In this paper, a fuzzy support vector machine (FSVM) Modeling with Support Vector Machines,” is proposed to deal with the problem of overfitting in International Journal of Fuzzy Systems, Vol. 4, traditional support vector machines (SVMs) for No. 1, pp. 606-615, 2002. two-class classification. In addition, a membership [12] X. Liu, G. Cheng, and John X. Wu, “Analyzing model is proposed to assign membership values to main Outliers Cautiously,” IEEE Transactions on body training data points and outliers according to their Knowledge and Data Engineering, Vol. 14, No. relative importance in the training set. In the 2, pp. 432-437, 2002. membership model, a proposed outlier detection method [13] C. T. Lin, and C. S. Goerge Lee, Neural Fuzzy (ODM) can detect outliers automatically from the Systems: A Neuro-Fuzzy Synergism to Intelligent training set. Experimental results show that the optimal System, Englewood Cliffs, NJ: Prentice-Hall, hyperplane of SVM is indeed influenced by outliers, and 1996. thus has the phenomenon of overfitting. However, the [14] C. T. Lin, Y. C. Lee, and H. C. Pu, “Satellite proposed FSVM reduces the overfitting with the Sensor Image Classification using Cascaded membership model and the ODM to obtain the fuzzy Architecture of Neural Fuzzy Network,” IEEE training set. Accordingly, the number of classification Transactions on Geoscience and Remote errors is reduced. Sensing, Vol. 38, No. 2, pp. 1033-1043, 2000. [15] J. C. Platt, N. Cristianini, and J. Shawe-Taylor, References “Large Margin DAG’s for Multiclass Classification,” Advances in Neural Information [1] J. C. Bezdek, Pattern Recognition with Fuzzy Processing Systems. Cambridge, MA: MIT Objective Function Algorithms. NewYork: Press, Vol. 12, pp. 547-553, 2000. Plenum Press, 1981. [16] V. N. Vapnik, Statistical Learning Theory, New [2] J. C. Burges, “A Tutorial on Support Vector York: John Wiley & Sons, 1998. Machines for Pattern Recognition,” Data Mining [17] X. Zhang, “Using Class-Center Vectors to Build and Knowledge Discovery, Vol. 2, pp. 121-167, Support Vector Machines,” Proc. IEEE 1998. NNSP’99, pp. 3-11, 1999. [3] V. Barnet and T. Lewis, Outliers in Statistical Data, NY: Wiley, 1994. [4] C. Corts and V. N. Vapnik, “Support Vector Dr. Huang graduated from National Networks,” , Vol. 20, pp. Taipei Institute of Technology in 1977, 273-297, 1995. and received the M.S. and Ph.D. degrees in electrical engineering from the [5] I. Guyon, N. Matic, and V. N. Vapnik, University of Michigan, Ann Arbor, in Discovering Information Patterns and Data 1982 and 1986, respectively. Cleaning, Cambridge, MA: MIT Press, pp. Since 1986 he has been with the 181-203, 1996. National Taiwan University, where he is [6] Simon Haykin, Neural Networks: A currently a Professor in the Department of Comprehensive Foundation, Second Edition, Mechanical Engineering and Graduate Institute of Industrial New Jersey: Prentice-Hall, 1999. Engineering. He was the Vice Chairperson of the Mechanical [7] Frank Höppner, Frank Klawonn, Rudolf Kruse, Engineering Department from August 1992 to July 1993, and Thomas Runkler, Fuzzy : Director of Semiconductor Industry Teaching Resource Center Methods for Classification, Data Analysis and from January 2001 to December 2001, Director of CIM Education Center, Taiwan IBM and Tjing Ling Industrial Image Recognition, New York: John Wiley & Research Institute from August 1989 to July 1996, the Sons Ltd, 1999. Director of Manufacturing Automation Research Technology [8] C. W. Hsu and C. J. Lin, “A Comparison of Center from August 1996 to July 1999. He has served as the Methods for Multiclass Support Vector Associate Dean of College of Engineering, National Taiwan Machines,” IEEE Transactions on Neural University since August 2000. His research interests include Networks. Vol. 13, No. 2, pp. 415-425, 2002. machine intelligence, network-based manufacturing systems , [9] T. Kohonen, “The Self-organizing Map,” Proc. intelligent robotic systems , prosthetic hands, nano IEEE, Vol. 78, No. 9, pp. 1464-1480, 1990. manipulation and nonlinear systems. Dr. Huang holds several [10] A. Kaufmann, and M. Gupta, Introduction to patents on dexterous hands, real-time communication control Fuzzy Arithmetic: Theory and Applications, and semiconductor manufacturing. New York: Van Nostrand Reinhold Co., 1985.

H. P. Huang et al.: Fuzzy Support Vector Machines for Pattern Recognition and Data Mining 835

Dr. Huang is a member of Tau Beta Pi, IEEE, SME, CFSA, CIAE. He was the Chapter Chair of IEEE Robotics and Automation Society Taipei Chapter (1996-1998), the Editor-in -Chief of Journal of Chinese Fuzzy System Association, the Program Chair of the 1998 International Conference on Mechatronic Technology (ICMT’98), and the General Vice Chair of the Eighth International Fuzzy Systems Association World Congress (IFSA’99), 1999. He is also the Publication Chair of 2003 IEEE Intl. Conf. on Robotics and Automation, Program Chair of 2003 Intl. Conf. on Automation Technology, and the Organizing Committee Chair of the 2002 Asia-Pacific Conference on Industrial Engineering and Management System. He is the co-author of the book “Fuzzy Neural Intelligent Systems: Mathematical Foundation and the Applications in Engineering,” published by CRC Press in January, 2001. He was the Guest Editor of IEEE/ASME Trans. on Mechatronics in 2001. Currently, he is the Editor-in-Chief of the International Journal of Fuzzy Systems, FIRA Associate Editor. He has received three-time Distinguished Research Awards from National Science Council, Taiwan R.O.C. He is named in Who’s Who in the World 2001, 2002, and Who’s Who in the R.O.C. 2002.

Yi Hung Liu is currently a Ph.D. student in the Department of Mechanical Engineering at National Taiwan University, Taiwan, ROC. He received the M.S. degree from the Department of Engineering Science and Ocean Engineering, National Taiwan University in 1996. He had been a teaching assistant there from August 1996 to July 1997. His research topics are in the areas of applications of neural networks and fuzzy set theory, automatic control, pattern recognition, data mining, computer vision, and machine learning.