Fuzzy Support Vector Machines for Pattern Recognition and Data Mining

826 International Journal of Fuzzy Systems, Vol. 4, No. 3, September 2002 Fuzzy Support Vector Machines for Pattern Recognition and Data Mining Han-Pang Huang and Yi-Hung Liu Abstract In a SVM, the original input space is mapped into a high dimensional dot product space via a kernel. The A support vector machine (SVM) is a new, new space is called feature space, in which an optimal powerful classification machine and has been applied hyperplane is determined to maximize the generalization to many application fields, such as pattern ability. The optimal hyperplane can be determined by re cognition and data mining, in the past few years. only few data points that are called support vectors (SVs). Accordingly, a SVM can provide a good However, there still remain some problems to be generalization performance for classification problems solved. One of them is that SVMs are very sensitive despite that it does not incorporate problem-domain to outliers or noises because of overfitting problem. knowledge. The attribute is unique to SVMs [6]. In this paper, a fuzzy support vector machines However, there are two important issues for (FSVMs) is proposed to deal with the problem. A developing SVMs. One is how to extend two-class proper membership model is also proposed to fuzzify problems to multi-class problems efficiently. Several all the training data of positive/negative class. The methods have been proposed, such as one-against-one, outliers are detected by the proposed outlier one-against-all, and directed acyclic graph SVM detection method (ODM). The ODM is a hybrid (DAGSVM) [8,15]. These methods are based on solving method based on the fuzzy c-means (FCM) algorithm several binary classifications. In other words, SVMs are originally designed for binary classification [4]. Hence, cascaded with an unsupervised neural network, before enhancing the performance of SVMs for called self-organizing map (SOM). Experimental multi-class classification, the problem of SVMs in results indicate that the proposed FSVMs actually binary classification should be solved first. The other reduce the effect of outliers and yield higher issue of SVMs is to overcome the problem of overfitting classification rate than SVMs do. in two-class classification [2]. As those remarks in [17] and [5], the SVMs are very sensitive to outliers and Keywords: Support Vector Machines (SVMs), Fuzzy noises. This paper deals with the problem. c-means (FCM), Self-organizing Map (SOM), Outlier Different from SVMs, the proposed FSVMs treat the Detection. training data points with different importance in the training process. Namely, FSVMs fuzzify the penalty 1. Introduction term of the cost function to be minimized, reformulate the constrained optimization problem, and then construct Support vector machines (SVMs) are based on the the Lagrangian so that the solutions for the optimal theoretical learning theory developed by Vapnik hyperplane in the primal form can be found in the dual [2,4,6,16]. The formulation of SVMs embody the form. The other goal of this paper is to provide a structural risk minimization (SRM) principle. SVMs membership model with the function of outlier detection. have been gained wide acceptance due to their high Based on this membership model, all training data points generalization ability for a wide range of applications are properly assigned different degrees of importance to and better performance than other traditional learning their own classes so that FSVMs can avoid the machines [2]. In addition, SVMs have been applied to phenomenon of overfitting due to outliers and noises. many classification or recognition fields, such as isolated This paper is organized as follows. Section 2 reviews handwritten digit recognition, object recognition, speech the basic theory of SVMs briefly. Section 3 first states recognition [2], and spatial data analysis [11]. the problem of SVMs, then the proposed FSVM is formulated. In Section 4, a membership model for the FSVM is given, including the proposed outlier detection Corresponding Author: Han-Pang Huang is with Robotics method (ODM). Section 5 conducts several experiments Laboratory, Department of Mechanical Engineering, National Taiwan University, Taipei, 10660, Taiwan, TEL/FAX: (886) to indicate the merit of the FSVM. Finally, we have 2-23633875 some conclusions in Section 6. E-mail: [email protected] © 2002 CFSAT H. P. Huang et al.: Fuzzy Support Vector Machines for Pattern Recognition and Data Mining 827 2. Basic Theory of SVMs and the number of nonseparable points. In particular, it is the only free parameter in SVMs. This section briefly introduces the theory of SVMs, For the primal problem (Eqs. (6-8)), it is difficult to including linearly separable case, linearly nonseparable find the solution by QP when it becomes a large-scale case, and nonlinear case through a two-class problem. Hence, by introducing a set of Lagrange classification problem [2,4,6,16]. Assume that a training multipliers ai and bi for constraints (7) and (8), the set S is given as primal problem becomes the task of finding the saddle n point of the Lagrangian. Thus, the dual problem S = {xi, yi}i=1 (1) becomes where x Î R N , and y Î{-1,+1} . The goal of SVMs i i n n n 1 T is to find an optimal hyperplane such that Maximize Q(a ) = åai - ååaia j yi y jxi x j (9) T i=1 2 i=1 j=1 w xi + b ³ 1 for yi = +1 (2) n T w xi + b £ -1 for yi = -1 Subject to åai yi = 0 (10) where the weight vector wÎ RN , and the bias b is a i=1 0 £ a £ C , i = 1,2,..., n scalar. If the inequalities in Eq. (2) hold for all training i (11) data, it is said to be a linearly separable case. In the In the dual problem, the cost function Q(a) to be learning of the optimal hyperplane, SVMs maximize the maximized depends only on the training data in the form margin of separation r between classes, where T n of a set of dot product, {xi x j}(i, j)=1 . Furthermore, these r = 2 w . Therefore, for the linearly separable case, to Lagrange multipliers can be obtained by using the find the optimal hyperplane is to solve the following constrained nonlinear programming with the equality constrained optimization problem constraint in Eq. (10) and those inequality constraints in 1 Eq.(11). The Kuhn-Tucker (KT) condition plays a Minimize F(w) = wT w (3) 2 central role in the optimization theory and is defined by T T ai[yi(w xi + b) - 1+ xi] = 0 , i = 1,2,...,n (12) Subjectto yi(w xi + b) ³ 1, i = 1,2,...,n (4) The above constrained optimization problem can be bixi = 0 , i = 1,2,..., n (13) solved by quadratic programming (QP). However, if the where bi = C - ai . There are two types of ai . If inequalities in Eq. (2) do not hold for some data points 0 < ai £ C , the corresponding data points are called in S , the SVMs become linearly nonseparable. In such a support vectors (SVs). The optimal solution for the case, the margin of separation between classes is said to weight vector is given by be soft since some data points violate the separation N s conditions in Eq. (2). To set the stage for a formal wo = åai yixi (14) treatment of nonseparable data points, the SVMs i=1 n where N is the number of SVs. Moreover, in the case introduce a set of nonnegative scalar variables, {xi}i=1 , s into the decision surface; i.e., of 0 < ai < C , we have xi = 0 according to the KT T condition in Eq. (13). Hence, one may determine the yi(w xi + b) ³ 1 - xi , i = 1,2,...,n (5) optimal bias bo by taking any data point in the set S , xi are called slack variables. For 0 £ xi < 1 , the data points fall inside the region of separation but on the right for which we have 0 < ai < C and therefore xi = 0 , and using data point in the KT condition in Eq. (12). side of the decision surface. For xi > 1 , they fall on the wrong side of the decision surface. Now the goal of the However, from the numerical perspective it is better to SVM is to find a separation hyperplane for which the take the mean value of bo resulting from such data misclassification error can be minimized while points in the set S . Once the optimal pair (wo, bo) is maximizing the margin of separation. To find an optimal determined, the decision function is obtained as hyperplane for a linearly nonseparable case is to solve N æ s ö the following constrained optimization problem g(x) = signç a y xT x + b ÷ (15) çå i i i o ÷ n è i=1 ø 1 T Minimize F(w,x ) = w w + Cåxi (6) Also, the optimal hyperplane is g(x) = 0 . Another case 2 i=1 is ai = C . Then, xi > 0 and can be computed from Eq. Subjectto y (wT x + b) ³ 1- x , i = 1,2,...,n (7) i i i (12). If a = 0 , the corresponding data points are x ³ 0 , i = 1,2,..., n (8) i i classified correctly. where C is a user-defined positive parameter. It Another merit of SVMs is to map the input vector into controls the tradeoff between complexity of the machine 828 International Journal of Fuzzy Systems, Vol. 4, No. 3, September 2002 a higher dimensional feature and thus can solve the term. A larger C means to assign a higher penalty to nonlinear case.

Fuzzy Support Vector Machines for Pattern Recognition and Data Mining

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support