Data Preprocessing and Classification for Taproot Site Data Sets of PANAX NOTOGINSENG

2nd International Conference on Modelling, Identification and Control (MIC 2015) Data Preprocessing and Classification for Taproot Site Data Sets of PANAX NOTOGINSENG Dao Huang, Jin He* School of Electronics and Information Engineering Yunnan University of Nationalities, Kunming, China e-mail: [email protected], *Corresponding author: e-mail: [email protected] Abstract—The herbs from different producing regions have algorithm [14-15], but most single classification algorithm differences in the active constituents and efficacy. The quality are not suitable to every kinds of data, in order to salve the of the herb from the authentic region is better than other problem, many scholars had turned to study the multi- producing regions. Nowadays, many peddlers substitute non- classifier algorithm, they had proposed a lot of good multi- authentic herbs for authentic-region herbs in order to make classifier algorithms, such as Bagging [16], Boosting [17], more money. So it is important to distinguish herbs between AdaBoost[18], Random-Forest[19] ,etc. These multi- different producing regions. This paper studies the data classifiers algorithm had been used in many fields and preprocessing and classification of taproot site data sets of obtained good effect. Therefore, we can use multi-classifier Panax notoginseng from three different producing regions. algorithms in the field of herbs’ data classification to Compare the effect of data preprocessing includes data improve the classification performance. In order to standardization, instance selection, attribute selection and try to find out the best method and parameter settings for the data distinguish between different kinds of herbs with different sets. Finally, we use different classification algorithms to producing regions, this paper firstly preprocess the data classify the preprocessed data and compare the classification extracted by the taproot site of Panax notoginseng, and then performance to find the optimal classification algorithm for find out the optimal data preprocessing method by the data sets. The classification performance in the experiment comparing several evaluation parameters, and then compare was evaluated by Percent Correct (PC), Mean Squared Error several different classification algorithms, choose the better (MSE), Kappa Statistics (KS), Area Under ROC (AUR), Mean one. This study demonstrates that it is suitable to use Absolute Error (MAE). The results shows that using decimal decimal scaling standardized data conversion and choose the scaling to standardize the data and choose the subset of subset of attribute {1, 2, 4, 6, 7, 8}, the suitable attribute {1,2,4,6,7,8}is suitable for the data and Random classification algorithms are Random-Forest and Forest algorithm and AdaBoost.M1 algorithm are the optimal AdaBoost.M1.It can be helpful in the study of herbs’ data classification algorithm for this data sets which has better classification. classification performance. II. METHOD Keywords-AdaBoost.M1, authentic-region herbs, Random Forest; data preprocessing A. Data standardization There are three common standardized operations: Zero- I. INTRODUCTION mean standardization, standardization of the decimal scaling, There are different identification methods in the field of and the Minimum - maximum standardization. herb identification, such as traditional method which identify Minimum-maximum standardization (Min-Max):a linear the herb through their the appearance [1-2], or identify herbs transformation of the data. Suppose S is an attribute and the through analyzing the ingredients of herbs that was extracted min S is the minimum values of S and the maxS is the by the advanced chemical instruments [3-6]. Mining the data maximum values of S .The Minimum-maximum extracted by the chemical instruments can make the standardization can be defined as: identification of herb be more effective. The data preprocessing is one of the most important part v − min in data mining technology. The current related studies in this w = S )( +− bba (1) area have been mainly focused on two aspects: one is the S − minmax S data cleaning and the other is the data reduction. In terms of data cleaning many researchers had studied the anomaly While it mapped v to w , where w is in the interval ab ],[ , detection [7], the way to clean duplicated records [8-9], and the way to clean the data [10-11]. In terms of data reduction where v is the value of S . they had studied the reduction of the dimension of high- Minimum-maximum standardization will maintain the dimensional data [12], and the discrete technology [13], etc. relationship between the original data values. At present, research related to herbs’ data classification Zero-mean standardization (Z-m): it can be defined as: algorithm had focused on the traditional single classification © 2015. The authors - Published by Atlantis Press 131 v − μS w = (2) TABLE I. COMPARISON OF PC AFTER THREE STANDARDIZATION σ S BY RF BA NNG RN Z-m 85.27 97.27 92.53 93.9 94.23 Where v is the value of S , w is the standardization result, Min- 85.47 97.13 92.57 93.83 94.43 and is the mean of S , is the standard deviation of S . Max μS σ S DS 85.53 97.5 92.57 93.8 94.47 Decimal Scaling standardization (DS): by moving the decimal place of S . Move the decimal point depends on the TABLE II. COMPARISON OF KS AFTER THREE STANDARDIZATION maximum absolute value of S . It can be defined as: BY RF BA NNG RN Z-m 0.78 0.96 0.89 0.91 0.91 0.78 0.96 0.89 0.91 0.92 v Min- w = (3) Max 10 j DS 0.78 0.96 0.89 0.91 0.92 Where j is the smallest integer with the precondition TABLE III. COMPARISON OF AUR AFTER THREE STANDARDIZATION BY RF BA NNG RN of wMax <1)( . Z-m 0.95 1 0.98 0.94 0.99 Min- 0.95 1 0.98 0.94 0.99 III. RESULTS AND DISCUSSION Max DS 0.95 1 0.98 0.94 0.99 A. Data The data in this experiment was obtained by Panax TABLE IV. COMPARISON OF MAE AFTER THREE STANDARDIZATION BY RF BA NNG RN notoginseng Institute with the technology of HPLC (High Performance Liquid Chromatography).After analyzing the Z-m 0.11 0.06 0.09 0.04 0.05 Min- 0.11 0.06 0.09 0.04 0.05 fingerprint of the data. We can obtain 100 samples of each Max producing region, and we can totally obtain 300 samples DS 0.11 0.06 0.09 0.04 0.05 with 3 different regions. One of fingerprints is shown in Fig. 1. mV 检测器 A:203nm C. Comparison of different attribute subsets 400 12.039/3437128 We choose AttributeSelectedClassifier(ASC) to do the 300 19.286/1913630 attribute selection, ASC is one of classifier in WEKA. We 200 select three different evaluations, they are 100 10.767/641658 InfoGainAttributeEval (IGAE), CfsSubsetEval (CSE) and 0 WrapperSubsetEval (WSE). Firstly, we select Random Forest as the based classifier Figure 1. One of fingerprints of taproot site of Panax notoginseng. in ASC, choose IGE as the evaluation, Ranker as the search method. We set numToSlect of Ranker from 7 to 1. Then we B. Comparison of data standardization number the attribute from 1 to 8 .Every number represent an In this study, we do the experiment in WEKA. We firstly attribute, from 1 to 8 ,they are R1, Rg1, Rb1, Amount of convert the raw data into the suitable file format such as sample, Moisture content, R1ct, Rg1ct and Rb1ct.The class CSV format and ARFF format. Then we add a class attribute attribute does not need to be selected .So we can get several to the file and set 1, 2, 3 as three different regions. subsets of attribute in the first step, they are Then we use three kinds of standardization that {7,5,8,6,2,3,1},{7,5,8,6,2,3}, {7,5,8,6,2}, {7,5,8,6}, {7,5,8}, introduced in 1.1 to process the data. Then compare the three {7,5}, {7}. Secondly, we set the CSE as the evaluation and standardization by using several classification algorithms. BestFirst as the search methods. So we can get a subset of They are Naive Bayes (BY), Random Forest (RF), Bagging attributes {3, 4, 5, 6, 7, 8}. Thirdly, we choose WSE as the (BA) , NNge (NNG) and RBFnetwork (RN).We set 10-fold evaluation and BestFirst as the search methods. We can get a cross-validation, the other is the default setting. Then we use subset of attributes {1, 2, 4, 6, 7, 8}. We use RF algorithm to Percent Correct (PC), Kappa Statistics (KS), Area Under classify the different subsets of attribute and then compare ROC (AUR),Mean Absolute Error (MAE) as evaluation their classification performance with four evaluation standard to evaluate the classification performance in order standard, they are PC,KS,MAE, MSE(Mean square to find the better standardization. The results are shown in Error).So we can get the results shown in TABLE 5. TABLE 1, TABLE 2, TABLE 3and TABLE 4. As the results shown in TABLE 5, we can draw the The results in Tables show that the decimal scaling conclusion that the attribute subset {1,2,4,6,7,8} can get standardization has better classification performance in better classification performance. Therefore, we can use the comparing the four evaluation standard. So we can choose it subset when do the attribute selection. for the data standardization. 132 D. Comparison of different classification algorithms In this experiment, we use naive Bayes(BA), J48, Random Forest (RF), AdaBoost.M1 (ADM), Neural TABLE V. COMPARISON OF DIFFERENT SUBSETS OF ATTRIBUTE {7,5,8,6, {7,5,8, {7,5,8,6 {7,5,8,6 {7,5,8} {7，5} {7} {3,4,5,6 {1,2,4, 2,3,1} 6,2,3} ,2} ,} ,7,8} 6,7,8} PC 98.3539 97.942 97.9424 97.5309 97.1193 92.592 67.078 98.3539 99.17 4 6 KS 0.9753 0.9691 0.9691 0.963 0.9568 0.8889 0.5062 0.9753 0.9877 MAE 0.0466 0.046 0.0468 0.0494 0.0599 0.0916 0.2518 0.0535 0.0472 MSE 0.104 0.1126 0.1226 0.1324 0.1537 0.219 0.3901 0.1269 0.0948 Network (BP), the Nearest Neighbor(k-NN), Support We set 10-fold cross-validation, the other is the default Vector Machine (SVM) and the Vote algorithm which setting.

Load more