International Journal of Innovative Management Information & Production ISME International ⓒ 2014 ISSN 2185-5455 Volume 5, Number 1, March 2014 PP. 99-117

MEMETIC BASED SUPPORT VECTOR MACHINE CLASSIFICATION

1 2 3 MINGNAN WU , ZHENYUAN XU , JUNZO WATADA The Graduate School of Information, Production and Systems Waseda University Kitakyushu, 808-0135, Japan [email protected] [email protected]

[email protected]

ABSTRACT. A database has plenty of hidden knowledge, which can be used in decision making to support commerce, research and other activities. Classification analysis is one of the core researches in field. According to the distribution of samples, like artificial network (ANN) and support vector machine (SVM) have been proposed to perform binary classification. Although knowledge discovery and data mining techniques have successfully resolved a lot of real-world applications, classifying an imbalanced data remains still full of challenge. These traditional classification algorithms mentioned above hardly work well for imbalanced dataset. In this dissertation, a novel model on the basis of memetic algorithm (MA) and support vector machine (SVM) is proposed to perform the classification for large imbalanced dataset. It is named MSVC (memetic support vector classification) model. Memetic Algorithm is recently proposed and it is used as a heuristic framework for the large imbalanced classification here. Because of the high performance of SVM in balanced binary classification, support vector classification (SVC) is combined with MA to improve SVM classification accuracy. G-mean is used to check the final result, and data employed here is some data about semiconductor manufacturing from Intel Corp. Compared with some other exited models, the results showed that this MSVC model is a proper alternative for imbalanced dataset classification, and it expends the applications of memetic algorithm.

Keywords: Memetic Algorithm (MA); Support Vector Machine (SVM); Classification on Imbalanced Dataset; Memetic Support Vector Classification (MSVC)

1. Introduction. In this dissertation, a newly proposed MSVC model was employed to deal with classifications on imbalanced dataset and showed its priority to other existed models. In this section, research topics, research motivation and this dissertation’s organization was discussed.

1.1. Research Topics. Classification plays a pivotal role in data mining and knowledge discovery. A lot of algorithms and models have been developed to deal with binary classification problem, such as decision tree (Yuan and Shaw, 1995; Yang, 2006), neural network (Siegelmann and Sontag, 1991), support vector machines (Boser et al., 1992),

99

100 MINGNAN WU, ZHENYUAN XU AND JUNZO WATADA nearest neighbor (Bremner et al., 2005). They have been well developed and applied in various domains. All the above algorithms assume that the classifiers operate on data drawn from the same distribution as the training data, and the final goal is to maximize the accuracy. While in the real world, it is not always the case. In the real world, the dataset usually contains more instances from one class than another one, which are named after imbalanced dataset. In recent years, the imbalanced learning problem has attracted a lot of interest from academia, industry and government funding agencies. The class with more instances is referred as majority class, and the other one minority class. Learning from large and imbalance classes is a challenging problem. Applications like health control, vision recognition and credit card fraud detection focus their attention on the minority class. Most conventional algorithms try to minimize the error rate, ignoring the differences between majority class and minority class. So they cannot perform well on imbalanced dataset classification. Using conventional algorithms is challenging and difficult to deal with the imbalanced dataset classification. The higher proportion the majority class counts, the worse the conventional algorithms perform. Some modifications are needed in the existed methods to deal with the imbalanced dataset classification. Usually there are two basic thinks in imbalanced classification, which act as data level approach and algorithm level approach. At data level, re-balancing the data distribution is a common objective through re-sampling the original training dataset. Tomek links (Ivan Tomek, 1976), One-side selection (OSS) (Miroslav Kubat and Matwin Stan, 1997), and Neighbor- hood clean rule (NCL) (Jorma Laurikkala, 2001) were proposed to remove majority class examples, while Synthetic Minority Over-sampling Technique (SMOTE) (Bianca Zadrozny and Elkan Charles, 2001) was proposed to form new minority class examples by interpolating between several minority class examples that lie together. At algorithm level, choosing inductive bias is a common strategy to deal with imbalanced problem. Researchers try to adapt conventional algorithms to strengthen learning process regarding the minority class. Improved decision tree (Bianca Zadrozny and Elkan Charles, 2001) and support vector machine (Yi Lin, 2002) have been proposed to achieve better classification.

1.2. Motivation. In the manufacturing industry, the volume of production is a key index which measures processing capacity, health and quality performance. It is quite obvious in the semiconductor industry with mass production that the production growth means a substantial cost reduction and revenue increase. Being capable of predicting the ultimate volume of production with upstream process data allows us to proactively address impending yield or quality concerns on the final product. This helps to make intervention on problematic upstream process variations and excursions to mitigate risks on product quality and supply line management. In this dissertation, a novel algorithm was proposed for large imbalanced dataset classification based on memetic algorithm (MA) and support vector machine (SVM) and it was tested using some semiconductor data obtained from Intel Corp. After pretreatment to MEMETIC ALGORITHM BASED SUPPORT VECTOR MACHINE CLASSIFICATION 101 the training data, memetic algorithm helped us get an optimized partition for training distribution space. As a heuristic framework, memetic algorithm also runs for selecting parameters of SVM classification. Being different from (GA) (Goldberg and David, 1989), memetic algorithm provides a proper alternative as an (EA) (Eiben and Smith, 2003). This dissertation extended the application of memetic algorithm, and the result showed its feasibility.

1.3. Organization of This Dissertation. This dissertation was organized as follows. In Section I we introduced the research topics in this area briefly. In Section II, we introduced some background knowledge about imbalanced dataset, memetic algorithm (MA) and support vector machine (SVM). After the introduction to basic knowledge, we described the model proposed to deal with this classification in Section III. This model was named memetic support vector classification (MSVC) model. In Section IV, we classified the process to run this experiment under our proposal step by step. In Section V, results were pasted and we make some comparisons with some other existing proposals. It came to the conclusion in Section VI.

2. Background Knowledge. In this section, some necessary background knowledge was discussed. Firstly, we introduced the dataset employed in this dissertation, making guidance on imbalanced dataset. Since classifications on imbalanced dataset are a little different from traditional classification problems, a new evaluation criterion was imported in this dissertation. In the following part, we introduced some basic knowledge about MA (memetic algorithm). We introduced both the procedure and its individual learning. We compared MA with GA (genetic algorithm), and found that actually, in some certain situations, MAs are prior to GAs. Thirdly, we introduced some basic knowledge on SVM (support vector machine).

2.1. Imbalanced Dataset.

2.1.1. Classification on imbalanced dataset. Binary classification is the task of classifying the members of a given set of objects into two groups on the basis of whether they have some property or not. Some typical binary classification tasks are (1) Medical testing to determine if a patient has certain disease or not (the classification property is the disease); (2) Quality control in factories; i.e. deciding if a new product is good enough to be sold, or if it should be discarded (the classification property is being good enough); (3) Deciding whether a page or an article should be in the result set of a search or not (the classification property is the relevance of the article - typically the presence of a certain word in it). 102 MINGNAN WU, ZHENYUAN XU AND JUNZO WATADA

Statistical classification in general is one of the problems studied in , in order to automatically learn classification systems; some methods suitable for learning binary classifiers include the decision trees, Bayesian networks, support vector machines, and neural networks. There are two kinds of imbalances used to be found in a binary classified dataset. One is named as between-class imbalance; another is named as within-class imbalance. In the case of between-class imbalance, one class has much more examples than the other class; in the case of within-class imbalance, some subsets of one class have much fewer examples than other subsets of the same class (Qiong Gu et al., 2008). In imbalanced data sets, classes having more examples are called the majority classes and the ones having fewer examples are called the minority classes. In many real world domains imbalanced datasets do exist. For examples, spotting unreliable telecommunication customers, detection of oil spills in satellite radar images and detection of fraudulent telephone calls and so on. In these real applications, the ratio of the minority class to the majority class can be drastic such as 1 to 1000, or even more. This factor is named as Imbalance Ratio (IR) while another factor is named as Lack Information (LI), which means that instances in the minority class are quite rare. Take the following two datasets as an example; one has 100:10000 instances, while another one has 10:1000 instances. Although the IRs of both the dataset are the same, the LIs are different. Considering the LI factor, the previous dataset is easier to be accurately classified.

2.1.2. Evaluation of classification on imbalanced dataset. Let us classify the imbalanced dataset as class positive (P) and negative (N). Thus we can obtain Table 2.1,

TABLE 2.1. Confusion matrix PPos PNeg APos TPos FPos ANeg FNeg TNeg where PPos stands for Predicted Positive (PPos), PNeg stands for Predicted Negative (PNeg), APos stands for Actual Positive (APos), TPos stands for True Positive (TPos), FPos stands for False Positive (FPos), ANeg stands for Actual Negative (ANeg), FNeg stands for False Negative (FNeg), TNeg stands for True Negative (TNeg). Majority Class is usually notated as Positive, and Minority Class is usually notated as Negative. Thus we can easily calculate the true positive rate (TPR) and true negative rate (TNR) by the following equations respectively: TPos TPR = (1) TPos+ FNeg TNeg TNR = (2) TNeg+ FPos MEMETIC ALGORITHM BASED SUPPORT VECTOR MACHINE CLASSIFICATION 103

And another term, correctly classified rate (CCR) can be obtained by the following equation: TPos+ TNeg CCR = (3) APos+ ANeg CCR can assess the classification result well if APos is almost the same as ANeg. But as in imbalanced dataset, even if a high CCR is achieved, it might be contributed a lot by the majority class, with a high value of TPR, and low value in TNR. CCR doesn’t meet the need for evaluating the classification on imbalanced dataset. Another way is to use geometric mean (G-mean), which is defined as the following equation: G−= mean TPR × TNR (4) G-mean enables us to evaluate the classifier’s performance fairly for an imbalanced dataset; at least it is much better than CCR.

2.2. Memetic Algorithm.

2.2.1. Brief of memetic algorithm. Memetic Algorithm (MA) (Moscato, 1989) was firstly viewed as being close to a form of population-cased hybrid genetic algorithm coupled with an individual learning procedure capable of performing local refinements. It was not very far after the theory of Universal Darwinism (Richard Dawkins, 1983), which suggests that evolution is not limited to biological system, not confined to genes, but also applicable to any complex system that exhibits the principles of inheritance, variation and selection (Wikipedia). The term “” (Richard Dawkins, 1976) is defined as “the basic unit of cultural transmission, or imitation”, and also as “an element of culture that may be considered to be passed on by non-genetic means”. In other words, a meme can be considered as any unit of information, observable in the environment. They are similar to genes in that they are self replicating, but differ from genes in that they are transmitted through imitation rather than being inherited. Furthermore, memes replicate in a Lamarckian manner or in a Baldwinian manner in that changes to the meme during its lifetime are passed on. Examples of memes are catch phrases, stories, fashion, technology and chain letters. Thus, we are able to determine how far the concept of meme is being used, or in fact if it is being used at all. Modifications to the algorithm to further mimic the concept of meme are then suggested. Pseudo code of MA. Process of memetic algorithm: Define: stopping condition s, number of individuals n; Initialize: Generate an initial population P; While (s is false) do { Evolve new individuals by stochastic offset; Evaluate every individual i in the total population; Select the best n individuals W; 104 MINGNAN WU, ZHENYUAN XU AND JUNZO WATADA for w in W { Perform individual learning using meme; Proceed with individual learning; }end for; }end while;

2.2.2. Individual learning of memetic algorithm. In the section of individual learning, meme is defined according to the problem domain. Meme can be defined as a real value, a binary value, a vector and so on. Later in the MSVC model we will explain the meme in this imbalanced dataset classification. In the section of “proceed with individual learning”, usually Lamarckian (Lamarck, 1794-1796) learning or Baldwinian (Baldwin, 1896) learning is used. A Lamarckian example is that “Giraffes stretching their necks to reach leaves high in trees, strengthen and gradually lengthen their necks. These giraffes have offspring with slightly longer necks”. While a Baldwinian example is that “Individuals who learn the behavior to escape from predator more quickly will obviously be at an advantage. As time goes on, the ability to learn the behavior will improve (by genetic selection), and at some point it will seem to be an instinct”. Individual learning is performed in between each generation, in addition to the techniques used by GA (Genetic Algorithm) to explore the search space, namely recombination/crossover and mutation. For this reason, Memetic Algorithm is also known as Hybrid-GA (Moscato 2002). It is performed to improve the fitness of the population (in a localized region of the solution space/ under a certain rule) so that the next generation has “better” memes from its parents, hence the claim that Memetic Algorithms can reduce convergence time. Memetic Algorithms incorporate the concept of memes by allowing individuals to change before the next population is produced or under some certain rules. Individuals may copy parts of memes from other individuals to improve their own fitness. The individual learning algorithm adopted in a Memetic Algorithm is somewhat dependent on the problem being solved; however the common trait with any individual learning is that parameters in the algorithm cannot be changed. This does not follow with the definition of a meme, in that it can be changed because the adopted meme is the individual’s own interpretation of it. Furthermore, when memes are transmitted, changes to them are also passed on. Memes affect the behavior of an individual, and do not modify the memes themselves.

2.2.3. GA vs MA. A population-based search algorithm called Genetic Algorithm (GA) is commonly used to solve combinatorial optimization problems where the goal is to find the best solution in a (possibly unknown) solution space. It (GA) uses the principle of biological evolution to generate successively better solutions from previous generations of solutions. Memetic Algorithm (MA) is an extension of GA which incorporates a local-search / learning algorithm for each solution in between generations. MA is different from genetic algorithm (GA) (Goldberg, 1994). MEMETIC ALGORITHM BASED SUPPORT VECTOR MACHINE CLASSIFICATION 105

The biggest difference lies in that MA uses meme as the heuristic unit, while GA uses gene as the heuristic unit. In computation, genes are usually denoted by binary value or real value. Memes can be expressed by an object. MA can perform well no matter what value are assigned to memes. Another difference is that MA and GA use different ways to evolve new individuals. GA uses crossover and mutation to generate new individuals, and this process is like a stochastic process. MA generates new individuals after learning, and it can be said this process is under some rules. Theoretically, MA can optimize more efficiently than GA, which means in less generations, generating better results. Also, according to Pastorino (2004), MA is able to improve convergence time, hence making it more favorable over GA.

2.2.4. Applications of MA. To further incorporate the concept of memes into MAs, individual learning could involve a history dependent component. This history component can be referred to as a rule base, whereby it could dictate or influence the outcome of the individual learning. A rule base could be defined as a particular search direction or a particular meme to imitate. Like current local searches, rule base would be dependent on the problem being solved. Each individual in the population has a probability of adopting this rule base, or a typical local search algorithm. If the rule base is not adopted and the individual finds a better fitness on its own, it may change the rule base. This rule base could be reinforced as more individuals adopt it, but reduces in popularity if following this rule results in a lowered fitness. Compared to before, this rule base is not inherited through recombination, but exits throughout each generation to the end. This rule base could be updated several times during iterations, and affect individuals in different ways. This is very similar to trends or fashions in human society, where they appear and disappear or change over a period of time. More recent applications include (but are not limited to): training of artificial neural networks (Ichimura and Kuriyama, 1998), pattern recognition (Aguilar and Colmenares, 1998), robotic motion planning (Ridao et al., 1998), beam orientation (Haas et al., 1998), circuit design (Harris and Ifeachor, 1998), electric service restoration (Augugliaro et al., 1998), medical expert systems (Wehrens et al., 1993), single machine scheduling (França et al., 1999), automatic timetabling (notably, the timetable for the NHL (Costa, 1995)), manpower scheduling (Aickelin, 1998), nurse rostering and function optimization, processor allocation (Ozcan and Onbasioglu, 2006), maintenance scheduling (for example, of an electric distribution network (Burke and Smith, 1999)), VLSI design (Areibi and Yang, 2004), clustering of gene expression profiles (Merz and Zell, 2002), feature/gene selection (Zexuan Zhu et al., 2007; Zexuan Zhu et al., 2007) and multi-class, multi-objective (Zexuan Zhu et al., 2008). We will explain the learning process later in the MSVC model according to this certain application.

2.3. SVM.

2.3.1. Brief of SVM. Support vector machines (SVMs) (Vapnik, 1995) are a set of related supervised learning methods for classification or regression analysis. The goal of SVM 106 MINGNAN WU, ZHENYUAN XU AND JUNZO WATADA modeling is to find the optimal hyper-plane that separates samples with one category on one side of the plane and samples with the other category on the other side of the plane. The vectors near the hyper-plane are called the support vectors. An SVM finds the hyper-plane that is oriented so that the margin between the support vectors is maximized.

FIGURE 2.1. SVM algorithm

As shown in Figure 2.1, for a binary classification, conventional classifiers would just identify the decision boundary w between the two classes. While SVMs identify support vectors (SVs) H1 and H2 that create a margin between two classes, ensuring the data is more separable than in the case of the conventional classifiers. d Given a training set of labeled pairs (Xi,Yi), i=1,…,l where Xi ∈R and Yi ∈ {-1,1} indicating the category information. Then any point lying on the hyper-plane separating the b classes satisfies wx•+=b 0 , with w being the normal to the hyper-plane and being w the perpendicular distance from the hyper-plane to the origin. This problem can be transformed by the following quadratic programming optimization problem: 1 2 min w wb, 2 (5) T st.. yii(wx•+ b) ≥1 In particular, the samples of my dataset are mislabeled, not linearly separable, a so called

Slack Variable, ξi is employed to accommodate the non-linearly separable outliers. Thus the equation (5) can be rewritten as follows: n 1 T min ww+ C ξi wb,,ξ ∑ 2 i=1 (6)  T ybii(wx•φξ( ) +) ≥−1 i st..  ξi ≥=0,in 1,..., Its dual problem is then MEMETIC ALGORITHM BASED SUPPORT VECTOR MACHINE CLASSIFICATION 107

1 nn min aaiji yy j K( x i, x j) − a i a ∑∑ 2 ii=11=  0≤≤a Ci , = 1,..., n (7)  i st.. ∑∑aaii=0; = 0;  =+=− iy:1iiiy :1 Here C is the upper bound on the error, and function K(xi,xj) is the kernel function that describes the behavior of the support vectors. Since radial basis function kernel is used here, T 2 Kxx( ij, ) =φφ( x i) ( x j) =exp −− γ xi x j (8) ( ) Here γ>0, and C,γ are calculated out to make the SVM highly sensitive to training data and finally they contribute to reduce the error rate of this classification.

2.3.2. Applications of SVM. The support vector machine (SVM) algorithm (Boser et al., 1992; Vapnik, 1998) is a classification algorithm that provides state-of-the-art performance in a wide variety of application domains, including handwriting recognition, object recognition, speaker identification, face detection and text categorization (Cristianini and Shawe-Taylor, 2000). During the past years, SVMs have been applied very broadly within the field of computational biology, to pattern recognition problems including protein remote homology detection, microarray gene expression analysis, recognition of translation start sites, functional classification of promoter regions, prediction of protein-protein interactions, and peptide identification from mass spectrometry data. Two main motivations suggest the use of SVMs in computational biology. First, many biological problems involve high-dimensional, noisy data, for which SVMs are known to behave well compared to other statistical or methods. Second, in contrast to most machine learning methods, kernel methods like the SVM can easily handle non-vector inputs, such as variable length sequences or graphs. These types of data are common in biology applications, and often require the engineering of knowledge-based kernel functions.

3. Problem Domains. In this section we introduced the imbalanced dataset that employed in this dissertation. It was some data on semiconductor manufacturing from Intel Corp.

3.1. Semiconductor Manufacturing Dataset. Semiconductor manufacturing, particularly in manufacturing test process, involves assembly of good and bad performance products. Even though bad products is minority and as their amount is significantly less than good products, a classifier is required to group distinguish products in term of their performance since a bad performance product should be prohibited from being processed as early as possible. Hence, a good classifier can provide manufacturing with better performance in terms of reducing cost, time, and machine utilization. We got some encrypted semiconductor dataset from Intel, and we perform out experiments based on this dataset. According to the introduction from Intel, the minority implies the defective products and the majority implies the qualified products. If we can 108 MINGNAN WU, ZHENYUAN XU AND JUNZO WATADA find out these defective products, we can guarantee the rate of qualified products. We should never classify the minority instance as majority instance. An instance of this dataset consists of four attributes, with two numerical input parameters, one categorical parameter and one output. Two typical instances are shown in Table 3.1.

TABLE 3.1. Examples of real instances output categ num1 num2 0 103A 236780 236400 1 134A 262540 260630

Feature num1 ranges from 174610 to 299720, And feature num2 ranges from 174650 to 301800. The categorical parameter categ consists of 210 different types ranging from 1A to 210A. The output is the classification information, notated as 1 and 0. Please check it in Table 3.2.

TABLE 3.2. Semiconductor data detail attribute num1 num2 categ output Training data 174610- 174650- 1A – 0,1 261970 278140 210A Testing data 180800- 176670– 1A – 0,1 299370 301770 210A

The data in training dataset and testing dataset are imbalanced. The amount of samples with label 0 is much greater than samples with label 1 in each dataset. The data with label 0 are the majority class, and the data with label 1 are the minority class. Tables 3.3 and Table 3.4 show some features of the dataset. The number is the account of instances in each class.

TABLE 3.3. Training data status Majority Minority Total Minority/Total Class Class 107424 8890 116314 0.076

TABLE 3.4 Testing data status Majority Minority Total Minority/Total Class Class 107529 8821 116350 0.076

MEMETIC ALGORITHM BASED SUPPORT VECTOR MACHINE CLASSIFICATION 109

4. MSVC Model. In this part the proposed MSVC model was introduced first time. We combined both MA and SVC, where MA runs as the heuristic framework, and SVC runs as the classifier. Processes of Memetic Algorithm and Support Vector Classification were introduced separately, and again integrated to perform final classification.

4.1. Brief of MSVC Model.

4.1.1. Summary of MSVC. In this proposed MSVC model, MA provides a heuristic framework to find an optimal partition of the original training space and SVM is used to perform classification in a certain partition. In the process of SVM, another MA optimization is performed to find the optimal parameters C and γ for SVM. Before MSVC model is employed to data that is going to be classified, it is necessary to pre treat these data to make them suitable for classification calculation. Specially, for imbalanced dataset here, a certain re-sampling strategy is selected to adapt the problems caused by format and distribution differences.

4.1.2. Re-sampling. Re-sampling methods are commonly used for dealing with class-imbalanced problems since it straight forwardly balances dataset at the stage of pre processing. Typically, there are three methods of data re-sampling that can help to solve the imbalanced data analyses problem, Down Sampling, Over Sampling and Advanced Sampling[35][36][37]. Down Sampling reduces the sample size of the majority class by randomly selecting them. Over Sampling is used to randomly and repeatedly select the sample to increase the size of minority class. Another sampling technique that combines Down Sampling and Over Sampling is called Advanced Sampling, which applies priors on sample selections for both majority and minority populations to achieve an equal sample size. Down-sampling dataset will cause information loss, while Over-sampling will bring noises. Therefore, the key of re-sampling problem is that whether simply varying the sleekness of the data distribution can improve predictive performance systematically. Imbalanced datasets have two inner factors, namely, imbalance ratio (IR) and lack information (LI) as mentioned before. IR is the ratio of majority sample quantity to minority sample quantity; LI is the lack of information for the minority class. In other words, IR is the relative imbalance; the quantity of instances in minority class is not certainly few, LI is absolute less, that means the quantity of instance in Minority Class is quite rare. Both the above inherent factors are present in every imbalance dataset learning problem, in combination with other external factors, such as overlap, complexity, size of the data and high dimension etc.

4.1.3. Memetic process. In the memetic process of MSVC model, its objective is to partition the distribution in a proper way for the sake of SVC process. 110 MINGNAN WU, ZHENYUAN XU AND JUNZO WATADA

There is a dilemma that if we partition it into too small pieces, it will lead to a lot of calculation time, while if we partition it into big pieces or not small enough, the superiority of MSVC will not appear. It is sensible if we do more than one experiment and compare all results, to find a solution near to optimization. The following is the pseudo code for memetic algorithm in the heuristic framework. Pseudo Code of MSVC. Step 1: scale the data in training dataset num1*=num1/max (numerical 1) num2*=num2/max (numerical 2) Step2: code the training space into distribution blocks Step3: define the stopping condition s, number of partitions p and number of memes n. Step4: generate n initial p partition memes randomly, and train them using SVM Step5: evolve n new memes based on n previous memes by stochastic/ruled offset, and number every one, training them using SVM; Step6: test these memes using testing dataset; calculate the fitness of all these 2n memes after SVM’s classification Step7: select n best memes W Step8: for every w in W, classify the differences between every pair of previous meme and generated meme, which is a learning process. Step9: update the rule base according to the learning result. Step10: if the s is satisfied, terminate the computation; if not, go back to step 4. The above procedure is the memetic process of MSVC. After generations of imitations, or under a certain computing threshold, it comes to an end. Scaling is to make the data suitable for SVM classification. Here meme consists of memeID and partition information. Before we perform the classification, we do some pretreatment on the dataset. We cut the training space into different pieces. Some blocks have no distribution, and they don’t count. If they are with of distributions, we number them for the sake of partition. About the detailed partition information, we will explain clearly in the experiment section. In this total procedure, the parameter stopping condition s, numbers of partitions p, number of memes n are crucial to control the whole computation. These three parameters determine the final computation time and classification accuracy. In step5, in the first circle, new memes are evolved by a randomly way. After the first circle, since we can obtain some rules in step8 and step9, we will evolve new memes under these rules. In this step, offset is something like changing the combinations of every partition. Offset is just deleting one or more blocks from one meme and put it/them into another meme, and in every partition, they should not have redundant block. In the experiment section, we will explain it more clearly in this certain situation. For different situations, offsets can be totally different. For example, in the situation of finding an optimal result for a numerical computation, offset can be simply a random real value between 0 and 1. While in the situation of a travelling salesman problem (TSP) [16], offset can be a binary value. Application domains verify the offsets. MEMETIC ALGORITHM BASED SUPPORT VECTOR MACHINE CLASSIFICATION 111

Each pair of memes has a parent meme and a child meme. Comparisons in step8 are performed in every pair to confirm what makes higher fitness. These experiences form some rules, and these rules are stored in rule base, in the form of list. Since some memes have already gotten their fitness in the previous computation circle, these memes will be marked. Thus in step6, we can reduce the computation load in this way.

4.1.4. SVC process. In the SVM classification section, an MA optimization is performed to find the optimal parameters C and γ for SVM. The following is the procedure of this MA optimization. Pseudo Code of SVC. Step1: define the stopping condition s, number of meme n. Step2: initialize n (C, γ) memes randomly Step3: evolve n new memes based on n previous memes by stochastic/ruled offset, and number every one; Step4: calculate the fitness of all these 2n memes Step5: select n best individuals as memes W Step6: for every w in W, classify the differences between every pair of previous meme and generated meme, which is a learning process. Step7: update the rule base according to the learning result. Step8: if the s is satisfied, terminate the computation; if not, go back to step 3. The above procedure is the MA process of SVM. After generations of imitations, or under a certain computing threshold, it comes to an end. Here meme consists of memeID and two real value C and γ. Offset is just some random binary values; these random values change the C and γ after a treatment. The standard genetic algorithm codes the genes as binary strings similar to how DNA codes information in living organisms. This is fine if the parameters really are binary. If the parameters have been transformed to binary, then some pairs of bits should usually inherit from the same parent -- but the algorithm will not know which ones those are. If no transformation has been made, then each location should inherit independently of all others. While in this memetic algorithm process, the learning process helps to find which bits should be inherited frequently. In summary, this memetic process searches the parameter space more efficiently than traditional standard genetic algorithm.

5. Experiments and Results. In this section, some experiments were employed and we got some results. We applied MSVC in this certain classification problem, and the result proved the priority of MSVC. After some discussion on experiment results, we found the more resource it consumed, the better result it got.

112 MINGNAN WU, ZHENYUAN XU AND JUNZO WATADA

5.1. Application of MSVC.

5.1.1. Memetic process of MSVC. Before we perform the classification, we do some pretreatment on the dataset. As shown in the Figure 5.1, we grid the training space into different units. This demo grids the original space into a 7 times 6 matrix.

FIGURE 5.1. Distribution blocks

Some blocks have no distribution, so they don’t count. In these 42 blocks, there are totally 15 blocks that are with of distributions, and we number them for the sake of partition. Specially, in our experiment, we can number these blocks as [1][0], [0][1], [1][1], [2][1], [1][3], [2][3], [3][3], [2][4], [3][4], [4][4], [3][5], [4][5], [5][5], [4][6], [5][6] according to their positions in the grid space. These blocks are arranged in different combinations to form a partition, and for a certain partition, SVM is performed to train these data for classification. The accuracy of SVM classification will be tracked for each partition in one single meme, if this SVM classification is proved to perform well after comparison, it will form a rule on proper partitions and this rule will be updated into MSVC’s rule base. In one single partition, these blocks don’t have to be connected with each other. For example, under the blocks in Figure 5.1, if we initialize the training space as 4 partitions with 4 memes in a MSVC process, then one of these 4 memes might be: partition1, [1][0], [0][1], [1][1] (blocks ID are stored in a list); partition2, [2][1], [1][3]; partition3, [2][4], [3][4], [4][4], [3][5], [4][5], [5][5], [4][6]; partition4, [3][3],[5][6]. Offset is just trying to delete one or more blocks from one meme and, put one or some of them into another meme. And in every partition, they should not have redundant block. MEMETIC ALGORITHM BASED SUPPORT VECTOR MACHINE CLASSIFICATION 113

For instance, one meme at the time T(n) is as the above meme, one meme formed following this meme in T(n+1) could be: partition1, [1][0], [0][1], [1][1] (blocks ID are stored in a list); partition2, [2][1], [1][3] , [4][6]; partition3, [2][4], [3][4], [4][4], [3][5], [4][5]; partition4, [3][3], [5][5], [5][6]. SVM classifications are performed in each partition in both memes, with their SVM classification accuracies recorded and compared. After a period of learning, it can be found that when some certain blocks appear in one partition the classification performs better. At last these regular patterns form a rule base. For instance, let the classification accuracy of partition2 with [2][1], [1][3] , [4][6] have a good result, then the information of [2][1], [1][3] , [4][6] should be parted together is updated into rule base, and it is named a pattern. The more patterns found, the better the partition performs, the better this MSVC performs.

5.1.2. SVC process of MSVC. The MA process in SVM is very simple since it just concerns some numerical computation. The stopping condition is different with the condition in MSVC, where here it is set that when one computation takes 8 seconds, it comes to an end. Offset here is simply one binary value which is between -1 and 1, notated as r. New parameter is evolved as C*(1+r) and it is the same as γ.

5.1.3. Integration of MSVC. In Table 5.1, we listed out some common notations used in experiment.

TABLE 5.1. Notations for computation Attribute Meaning num1* num1/max (numerical 1) num2* num2/max (numerical 2) S Stopping conditions (50 circles computation)/ 8 seconds computation P Number of partition N Number of memes C Parameter C for SVM Γ Parameter γ for SVM offset Generated randomly /under a certain rule W Candidate memes

We totally performed five experiments on this dataset as shown in Table 5.2. In the first experiment, the training space was grid into 32 by 32 matrix, with 8 partitions, 8 memes. In the second experiment, the training space was grid into 32 by 32 matrix, with 16 partitions, 8 memes. 114 MINGNAN WU, ZHENYUAN XU AND JUNZO WATADA

In the third experiment, the training space was grid into 64 by 64 matrix, with 8 partitions, 8 memes. In the fourth experiment, the training space was grid into 32 by 32 matrix, with 8 partitions, 16 memes. In the fifth experiment, the training space was grid into 256 by 256 matrix, with 32 partitions and 32 memes. Stopping condition is that they should finish in 50 circles calculation. The MA optimization for every SVM’s parameters was uniformly set as 8 seconds computation.

TABLE 5.2. Results of MSVC Experiment TPR TNR G-mean First 81.3% 84.7% 83.0% Second 87.4% 85.1% 86.2% Third 88.3% 86.2% 87.2% Forth 81.9% 85.1% 83.4% Fifth 89.2% 88.9% 89.0%

5.1.4. Conclusions of MSVC. Compare the first experiment with the second one, as the count of partitions increases, accuracy is improved. Compare the first experiment with the third one, as the count of grids increases, accuracy is improved. Compare the first experiment with the fourth one, as the count of memes increases, accuracy is improved. But with the counts of partitions, grids and memes increase, it become more and more resource consuming, both in time and space. It could be implied that if we consume more and more resources, the classification would be more and more accurate. Through these experiments, we find that the MSVC model is able to deal with the imbalanced classification problem. All these TPR and TNR are higher than 0.8, neither of TPR nor TNR is neglected.

5.2. Compared with Other Algorithms. To evaluate the performance of MSVC, we compare the best experiment result with other model in [14] and some other researches by Watada, et al as shown in Table 8. These models are SVM-RBF (Support Vector Machine with Radial Basis Function as kernel), BN (Bayesian Network classification), RSA-SVM (Rough Set Algorithm based Support Vector Machine), GBT (Gradient Boosted Trees classification), MIPS (Multi Input Probabilistic System), TSLM-ANN, MSVC (Memetic Support Vector Classification) and ANN (Artificial Neural Network).

MEMETIC ALGORITHM BASED SUPPORT VECTOR MACHINE CLASSIFICATION 115

TABLE 5.3. Comparisons with Other Models Model G-Mean Model G-mean SVM-RBF 79.0% BN 82.2% RSA-SVM 90.1% GBT 88.9% MIPS 87.8% TSLM-ANN 87.9% MSVC 89.0% ANN 63.4%

Although the MSVC didn’t provide the best result in the listed models, the comparisons showed that it is a proper alternative for classification since the G-mean accuracy is high and MSVC deals with both majority and minority fairly. MSVC can execute the computation procedure fewer steps than RSA-SVM. Besides, this implementation extends the application of memetic algorithm (MA). It is meaningful. Still, Memetic Algorithm should be developed in other fields.

6. Conclusions. Data mining on Imbalanced dataset is important, including the classification problem. In this paper, we reviewed the imbalanced dataset classification, memetic algorithm and support vector machine, and then proposed a novel model named memetic algorithm based support vector machine for classification (MSVC model). G-mean was used to test the applicability of proposed MSVC model. The data used here is semiconductor data from Intel. It was used for both training and testing. After five experiments, it was proved that this MSVC model provides a proper alternative for imbalanced classification problem. MSVC model can classify majority and minority fairly. And also, this study extended the applications of memetic algorithm, which is most meaningful.

REFERENCE

[1] A. Augugliaro, L. Dusonchet, E. Riva-Sanseverino (1998), Service restoration in compensated distribution networks using a hybrid genetic algorithm, Electric Power Systems Research, vol.46, no.1, pp.59–66. [2] A. E. Eiben and J. E. Smith (2003), Introduction to evolutionary computing, Springer. [3] B. E. Boser, I. M. Guyon and V. N. Vapnik (1992), A training algorithm for optimal margin classifiers, In D. Haussler, editor, 5th Annual ACM Workshop on COLT, Pittsburgh, PA, ACM Press, pp.144–152. [4] Bianca Zadrozny, Elkan Charles (2001), Learning and making decisions when costs and probabilities are both unknown, The Seventh International Conference on Knowledge Discovery and Data Mining, pp. 204-213. [5] D. Bremner, E. Demaine, J. Erickson, J. Iacono, S. Langerman, P. Morin, G. Toussaint (2005), Output-sensitive algorithms for computing nearest-neighbor decision boundaries, Discrete and Computational Geometry, vol.33, no.4, pp.593–604. [6] D. Costa (1995), An evolutionary tabu search algorithm and the NHL scheduling problem, Infor, vol.33, pp. 161–178. [7] D. L. Applegate, R. M. Bixby, V. Chvátal, W. J. Cook (2006), The traveling salesman problem, ISBN 0691129932. 116 MINGNAN WU, ZHENYUAN XU AND JUNZO WATADA

[8] E. Burke and A. Smith (1999), A memetic algorithm to schedule planned maintenance for the national grid, Journal of Experimental Algorithmics, vol.4, no.4, pp.1–13. [9] E. Ozcan (2007), Memes, Self-generation and Nurse Rostering, Lecture Notes in Computer Science (Springer-Verlag), vol.3867, pp.85–104. [10] E. Ozcan and E. Onbasioglu (2006), Memetic algorithms for parallel code optimization, International Journal of Parallel Programming, vol.35, no.1, pp. 33–61. [11] Goldberg and E. David (1989), Genetic algorithms in search optimization and machine learning, Addison Wesley, pp.41. [12] H. T. Siegelmann and E. D. Sontag (1991), Turing computability with neural nets, Appl. Math. Lett, vol.4, no.6, pp.77–80. [13] Ivan Tomek (1976), Two Modifications of Cnn, IEEE Transactions on Systems, Man, and Cybernetics, vol. 6, no.11, pp.769-772. [14] J. Aguilar and A. Colmenares (1998), Resolution of pattern recognition problems using a hybrid genetic/random neural network learning algorithm, Pattern Analysis and Applications, vol.1, no.1, pp.52–61. [15] Jorma Laurikkala (2001), Improving identification of difficult small classes by balancing class distribution, the 8th Conference on AI in Medicine in Europe: Medicine Lecture Notes In Computer Science, pp. 63-66. [16] Junzo Watada, Lee-Chuan Lin, Lei Ding, Mohd Ibrahim Shapiai, Lim Chun Chew, Zuwairie Ibrahim Lee Wen Jau and Marzuki Khalid (2010), A rough-set-based two-class classifier for large imbalanced dataset, Smart Innovation, Systems and Technologies, vol.4, pp. 641-651. [17] Lei Ding, Junzo Watada, Lim Chun Chew, Zuwairie Ibrahim, Lee Wen Jau and Marzuki Khalid (2010), A SVM-RBF method for solving imbalanced data problem, ICIC EL, vol.4, pp. 2419-2424. [18] Lim Chun Chewa, Zuwairie Ibrahima, Lee Wen Jaua and Marzuki Khalida (2009), A probabilistic classifier for large and imbalanced data, Universiti Teknologi Malaysia, ATTD-a APAC pathfinding Intel Malysis. [19] Miroslav Kubat and Matwin Stan (1997), Addressing the curse of imbalanced training sets: one-sided selection, the 14th International Conference on Machine Learning, pp. 179-186. [20] M. Ridao, J. Riquelme, E. Camacho, M. Toro (1998), An evolutionary and local search algorithm for planning two manipulators motion, Lecture Notes in Computer Science (Springer-Verlag), pp.105–114. [21] O. Haas, K. Burnham, J. Mills (1998), Optimization of beam orientation in radiotherapy using planar geometry, Physics in Medicine and Biology, vol.43, no.8, pp.2179–2193. [22] P. França, A. Mendes, P. Moscato (1999), Memetic algorithms to minimize tardiness on a single machine with sequence-dependent setup times, Proceedings of the 5th International Conference of the Decision Sciences Institute, Athens, Greece, pp. 1708–1710. [23] P. Merz and A. Zell (2002), Clustering gene expression profiles with memetic algorithms, Parallel Problem Solving from Nature—PPSN VII. Springer, pp. 811–820. [24] Qiong Gu, Zhihua Cai, Li Zhu, Bo Huang (2008), Data mining on imbalanced data sets, International Conference on Advanced Computer Theory and Engineering. [25] R. Wehrens, C. Lucasius, L. Buydens, G. Kateman (1993), HIPS, A hybrid self-adapting expert system for nuclear magnetic resonance spectrum interpretation using genetic algorithms, Analytica Chimica ACTA, vol.277, no.2, pp.313–324. MEMETIC ALGORITHM BASED SUPPORT VECTOR MACHINE CLASSIFICATION 117

[26] S. Areibi and Z. Yang (2004), Effective memetic algorithms for VLSI design automation = genetic algorithms + local search + multi-level clustering, (MIT Press), vol.12, no.3, pp.327–353. [27] S. Harris and E. Ifeachor (1998), Automatic design of frequency sampling filters by hybrid genetic algorithm techniques, IEEE Transactions on Signal Processing, vol.46, no.12, pp. 3304–3314. [28] T. Yang (2006), Computational verb decision trees, International Journal of Computational Cognition (Yang's Scientific Press), vol.4, no.4, pp.34–46. [29] T. Ichimura and Y. Kuriyama (1998), Learning of neural networks with parallel hybrid GA using a royal road function, IEEE International Joint Conference on Neural Networks, New York, pp. 1131–1136. [30] U. Aickelin (1998), Nurse rostering with genetic algorithms, Proceedings of young operational research conference, Guildford, UK. [31] V. Vapnik, S. Golowich and A. Smola (1997), Support vector method for function approximation, regression estimation and signal processing, In M. Mozer, M. Jordan and T. Petshe, (eds), Advances in Neural Information Processing Systems 9, Cambridge, MA, MIT Press, pp. 281-287. [32] Yi Lin, Lee Yoonkyung, Wahba Grace (2002), Support vector machines for classification in nonstandard situations, Machine Learning, vol.46, no.1-3, pp.191-202. [33] Y. Yuan and M. J. Shaw (1995), Induction of fuzzy decision trees, Fuzzy Sets and Systems 69, pp. 125–139. [34] Zexuan Zhu, Y. S. Ong and M. Dash (2007), Markov blanket-embedded genetic algorithm for gene selection, Pattern Recognition, vol.49, no.11, pp.3236–3248. [35] Zexuan Zhu, Y. S. Ong and M. Dash (2007), Wrapper-filter feature selection algorithm using a memetic framework, IEEE Transactions on Systems, Man and Cybernetics - Part B, vol.37, no.1, pp.70–76. [36] Zexuan Zhu, Y. S. Ong and M. Zurada (2008), Simultaneous identification of full class relevant and partial class relevant genes, IEEE/ACM Transactions on Computational Biology and Bioinformatics. [37] Zhu Ming (2002), Data mining, University of Science and Technology of China Press.