Memetic Algorithm Based Support Vector Machine Classification
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Innovative Management Information & Production ISME International ⓒ 2014 ISSN 2185-5455 Volume 5, Number 1, March 2014 PP. 99-117 MEMETIC ALGORITHM BASED SUPPORT VECTOR MACHINE CLASSIFICATION 1 2 3 MINGNAN WU , ZHENYUAN XU , JUNZO WATADA The Graduate School of Information, Production and Systems Waseda University Kitakyushu, 808-0135, Japan [email protected] 2 [email protected] [email protected] ABSTRACT. A database has plenty of hidden knowledge, which can be used in decision making to support commerce, research and other activities. Classification analysis is one of the core researches in pattern recognition field. According to the distribution of samples, algorithms like artificial network (ANN) and support vector machine (SVM) have been proposed to perform binary classification. Although knowledge discovery and data mining techniques have successfully resolved a lot of real-world applications, classifying an imbalanced data remains still full of challenge. These traditional classification algorithms mentioned above hardly work well for imbalanced dataset. In this dissertation, a novel model on the basis of memetic algorithm (MA) and support vector machine (SVM) is proposed to perform the classification for large imbalanced dataset. It is named MSVC (memetic support vector classification) model. Memetic Algorithm is recently proposed and it is used as a heuristic framework for the large imbalanced classification here. Because of the high performance of SVM in balanced binary classification, support vector classification (SVC) is combined with MA to improve SVM classification accuracy. G-mean is used to check the final result, and data employed here is some data about semiconductor manufacturing from Intel Corp. Compared with some other exited models, the results showed that this MSVC model is a proper alternative for imbalanced dataset classification, and it expends the applications of memetic algorithm. Keywords: Memetic Algorithm (MA); Support Vector Machine (SVM); Classification on Imbalanced Dataset; Memetic Support Vector Classification (MSVC) 1. Introduction. In this dissertation, a newly proposed MSVC model was employed to deal with classifications on imbalanced dataset and showed its priority to other existed models. In this section, research topics, research motivation and this dissertation’s organization was discussed. 1.1. Research Topics. Classification plays a pivotal role in data mining and knowledge discovery. A lot of algorithms and models have been developed to deal with binary classification problem, such as decision tree (Yuan and Shaw, 1995; Yang, 2006), neural network (Siegelmann and Sontag, 1991), support vector machines (Boser et al., 1992), 99 100 MINGNAN WU, ZHENYUAN XU AND JUNZO WATADA nearest neighbor (Bremner et al., 2005). They have been well developed and applied in various domains. All the above algorithms assume that the classifiers operate on data drawn from the same distribution as the training data, and the final goal is to maximize the accuracy. While in the real world, it is not always the case. In the real world, the dataset usually contains more instances from one class than another one, which are named after imbalanced dataset. In recent years, the imbalanced learning problem has attracted a lot of interest from academia, industry and government funding agencies. The class with more instances is referred as majority class, and the other one minority class. Learning from large and imbalance classes is a challenging problem. Applications like health control, vision recognition and credit card fraud detection focus their attention on the minority class. Most conventional algorithms try to minimize the error rate, ignoring the differences between majority class and minority class. So they cannot perform well on imbalanced dataset classification. Using conventional algorithms is challenging and difficult to deal with the imbalanced dataset classification. The higher proportion the majority class counts, the worse the conventional algorithms perform. Some modifications are needed in the existed methods to deal with the imbalanced dataset classification. Usually there are two basic thinks in imbalanced classification, which act as data level approach and algorithm level approach. At data level, re-balancing the data distribution is a common objective through re-sampling the original training dataset. Tomek links (Ivan Tomek, 1976), One-side selection (OSS) (Miroslav Kubat and Matwin Stan, 1997), and Neighbor- hood clean rule (NCL) (Jorma Laurikkala, 2001) were proposed to remove majority class examples, while Synthetic Minority Over-sampling Technique (SMOTE) (Bianca Zadrozny and Elkan Charles, 2001) was proposed to form new minority class examples by interpolating between several minority class examples that lie together. At algorithm level, choosing inductive bias is a common strategy to deal with imbalanced problem. Researchers try to adapt conventional algorithms to strengthen learning process regarding the minority class. Improved decision tree (Bianca Zadrozny and Elkan Charles, 2001) and support vector machine (Yi Lin, 2002) have been proposed to achieve better classification. 1.2. Motivation. In the manufacturing industry, the volume of production is a key index which measures processing capacity, health and quality performance. It is quite obvious in the semiconductor industry with mass production that the production growth means a substantial cost reduction and revenue increase. Being capable of predicting the ultimate volume of production with upstream process data allows us to proactively address impending yield or quality concerns on the final product. This helps to make intervention on problematic upstream process variations and excursions to mitigate risks on product quality and supply line management. In this dissertation, a novel algorithm was proposed for large imbalanced dataset classification based on memetic algorithm (MA) and support vector machine (SVM) and it was tested using some semiconductor data obtained from Intel Corp. After pretreatment to MEMETIC ALGORITHM BASED SUPPORT VECTOR MACHINE CLASSIFICATION 101 the training data, memetic algorithm helped us get an optimized partition for training distribution space. As a heuristic framework, memetic algorithm also runs for selecting parameters of SVM classification. Being different from genetic algorithm (GA) (Goldberg and David, 1989), memetic algorithm provides a proper alternative as an evolutionary algorithm (EA) (Eiben and Smith, 2003). This dissertation extended the application of memetic algorithm, and the result showed its feasibility. 1.3. Organization of This Dissertation. This dissertation was organized as follows. In Section I we introduced the research topics in this area briefly. In Section II, we introduced some background knowledge about imbalanced dataset, memetic algorithm (MA) and support vector machine (SVM). After the introduction to basic knowledge, we described the model proposed to deal with this classification in Section III. This model was named memetic support vector classification (MSVC) model. In Section IV, we classified the process to run this experiment under our proposal step by step. In Section V, results were pasted and we make some comparisons with some other existing proposals. It came to the conclusion in Section VI. 2. Background Knowledge. In this section, some necessary background knowledge was discussed. Firstly, we introduced the dataset employed in this dissertation, making guidance on imbalanced dataset. Since classifications on imbalanced dataset are a little different from traditional classification problems, a new evaluation criterion was imported in this dissertation. In the following part, we introduced some basic knowledge about MA (memetic algorithm). We introduced both the procedure and its individual learning. We compared MA with GA (genetic algorithm), and found that actually, in some certain situations, MAs are prior to GAs. Thirdly, we introduced some basic knowledge on SVM (support vector machine). 2.1. Imbalanced Dataset. 2.1.1. Classification on imbalanced dataset. Binary classification is the task of classifying the members of a given set of objects into two groups on the basis of whether they have some property or not. Some typical binary classification tasks are (1) Medical testing to determine if a patient has certain disease or not (the classification property is the disease); (2) Quality control in factories; i.e. deciding if a new product is good enough to be sold, or if it should be discarded (the classification property is being good enough); (3) Deciding whether a page or an article should be in the result set of a search or not (the classification property is the relevance of the article - typically the presence of a certain word in it). 102 MINGNAN WU, ZHENYUAN XU AND JUNZO WATADA Statistical classification in general is one of the problems studied in computer science, in order to automatically learn classification systems; some methods suitable for learning binary classifiers include the decision trees, Bayesian networks, support vector machines, and neural networks. There are two kinds of imbalances used to be found in a binary classified dataset. One is named as between-class imbalance; another is named as within-class imbalance. In the case of between-class imbalance, one class has much more examples than the other class; in the case of within-class imbalance, some subsets of one class have much fewer examples