Dissolved Oxygen Prediction Using Support Vector Machine Sorayya Malek, Mogeeb Mosleh, Sharifah M
Total Page:16
File Type:pdf, Size:1020Kb
World Academy of Science, Engineering and Technology International Journal of Bioengineering and Life Sciences Vol:8, No:1, 2014 Dissolved Oxygen Prediction Using Support Vector Machine Sorayya Malek, Mogeeb Mosleh, Sharifah M. Syed [10]-[12]. ANN is a well-suited method with self-adaptability, Abstract—In this study, Support Vector Machine (SVM) self-organization, and error tolerance for nonlinear simulation. technique was applied to predict the dichotomized value of Dissolved However, this method has limitationsdue to its complex oxygen (DO) from two freshwater lakes namely Chini and Bera Lake structure that requires a great amount of training data, (Malaysia). Data sample contained 11 parameters for water quality difficulty in tuning the structure parameter that is mainly features from year 2005 until 2009. All data parameters were used to predicate the dissolved oxygen concentration which was based on experience, not sensitive to the noise produce and its dichotomized into 3 different levels (High, Medium, and Low). The “black box” nature that makes it difficult to understand and input parameters were ranked, and forward selection method was interpret the data [13], [14]. applied to determine the optimum parameters that yield the lowest Considering the drawbacks of ANN, recently support vector errors, and highest accuracy. Initial results showed that pH, Water machine (SVM) are become widely used for predicting the Temperature, and Conductivity are the most important parameters DO concentration [15], [16]. It is a new machine-learning that significantly affect the predication of DO. Then, SVM model was applied using the Anova kernel with those parameters yielded technology based on statistical theory and derived from 74% accuracy rate. We concluded that using SVM models to instruction risk minimization, which can enhance the predicate the DO is feasible, and using dichotomized value of DO generalization ability and minimize the upper limit of yields higher prediction accuracy than using precise DO value. generalization error. Compared to ANN, SVM has advantages of dealing with complex and highly nonlinear data, only Keywords—Dissolved oxygen, Water quality,predication DO, requiring a small amount of samples, high degree of prediction Support Vector Machine. accuracy, and long prediction period if kernel function was used to solve the nonlinear problems of water quality I. INTRODUCTION modeling [13]. UTROPHICATION is a common global concern in lakes In this study, we attempted to develop an SVM-based Eand reservoir. Current status of eutrophication in predictive model to predict dissolved oxygen concentration Malaysian lakes and reservoirs are indicated to be more than using data from selected freshwater lake in Malaysia. The data 60% [1],[2]. The effects of eutrophication are deterioration of samples for two lakes from 2005 to 2009 were used to train water quality for human consumption, limitation of and test the model. Dissolved oxygen concentrations used are recreational usage and exhaustion of dissolved oxygen (DO) dichotomized because different threshold value of DO below tolerable level which induces reductions in specific fish concentration can represent either acceptable or unacceptable and other animal populations [3]. The concentration of DO is water quality in a lake. This article is organized as follows; important for the healthy functioning of aquatic ecosystems section II discusses the methodology which includes the study and is a significant indicator of the state of aquatic area, data collection, analysis, and SVM model development. ecosystems. DO concentration is a parameter commonly used Section III represents the results followed by the discussion to assess the water quality in different reservoirs and and conclusion. watersheds. DO concentration is strongly influenced by a combination of the physical, chemical, and biological II. METHODOLOGY characteristics of the streams’ oxygen-demanding substances, A. Study Area including algal biomass, dissolved organic matter, ammonia, volatile suspended solids, and sediment oxygen demand [4]- Water quality data were collected from two fresh water [9]. lakes in peninsular Malaysia. They are Bera and Chini Lakes Machine learning methods such as artificial neural networks as illustrated in Figs. 1 and 2. Lake Bera consists of 26,000 International Science Index, Bioengineering and Life Sciences Vol:8, No:1, 2014 waset.org/Publication/9997153 (ANNs) has been successfully implemented for DO estimation hectares of its core zone and 27,500 hectares of the buffer zone all has been preserved as RAMSAR sites and it is coordinate at 3°49′00″N102°25′00″E.Lake Chini consists of Sorayya Malek is with the Institute of Biological Sciences, Faculty of Science University Malaya (phone: +603-79676748; fax: +603-79674208; e- about 5026 hectares and is located in the coordinate mail: [email protected]). 3°26′N102°55′E. Lake Chini and Lake Bera both serves as Mogeeb Mosleh was with Taiz University, Yemen. He is now with the wetlands to prevent the floods and erosion of riverbank and Department of A.I, Faculty of Computer Science & Information Technology, serve as an important water reservoir in Malaysia. However University of Malaya, multivariate statistics, KualaLumpur, Malaysia (e-mail: [email protected]). due recent increase in pollution level the water qualities at Sharifah M. Syed is with Department of Computer & Communication these lakes are deteriorating [17]. System Engineering University Putra Malaysia (e-mail: [email protected]). International Scholarly and Scientific Research & Innovation 8(1) 2014 46 scholar.waset.org/1307-6892/9997153 World Academy of Science, Engineering and Technology International Journal of Bioengineering and Life Sciences Vol:8, No:1, 2014 B. Data Collection and Analysis TABLE I STATISTICS SUMMARY OF DATA PARAMETERS FROM LAKE CHINI AND LAKE Water quality data from six monitoring stations from Bera BERA (2005 – 2009) lakes, and nine monitoring stations from Chini Lake were used No. Limnological Mean Minimum Maximum in this study. Bimonthly water quality data from February Variables 2005 until October 2009 was used. 1. Water temperature oC 30.245 26.12 34.29 2. Turbidity (NTU) 17.554 0.4 282.2 3. pH 6.596 4.97 7.94 4. DOmg/L M L H 5. Conductivity uS/cm 40.17 14 190 6. Salinityppt 0.019 0.01 0.09 7. Nitrate (NO3-N) mg/L 0.232 0.01 2.04 8. Phosphate PO4mg/L 0.063 0 0.36 9. Ammonia(NH3) mg/L 0.085 0.01 0.81 10. E-colimg/L 733.12 0 5400 11. Total Coliformmg/L 8885.7 20 284000 The parameter “DO” was classified into three levels ranges from low to high. This is because water quality can be classified into different classes depending on DO concentration range. The Interim National Water Quality Standard, Malaysia (INWQS) and Department of Environment have classified water quality based on DO concentration. Fig. 1 Lake Bera Map Table II, illustrates the DO concentration and water quality classes.This classification of DO range has been adapted in this study. TABLE II CLASSIFICATION OF WATER QUALITY BASED ON DO CONCENTRATION (INWQS) DO Value INWQS water Classification DO < = 5 Class III – IV Extensive treatment required 5 > DO <7 Class II - Suitable for recreational use and conventional treatment required. Do > = 7 Class I -Conservation of natural environment, Water Supply practically no treatment necessary. The Do used in this study for SVM model development is classified into three categorizations (high, medium, and low) similar with the classification of the Interim National Water Quality Standard, Malaysia (INWQS) and Department of Environment as shown in Table III. In this classification the high class of DO range from value 7 and above, while the Fig. 2 Lake Chini Map medium class ranges between 5 until 7, and lastly for the DO that had value below 5 is in class low. The water quality data include 147 row samples for 11 parameters. Sampling procedures including preservation for TABLE III water quality parameters were carried out in accordance with OUR CLASSIFICATION FOR THE “DO” CONCENTRATIONS WHO[18] and the analytical methods for the measured DO Value DO Categorizations parameters were adopted from manual published APHA [19]. DO < = 5 Low Table I list the summary statistics of the parameters used in 5 > DO <7 Medium International Science Index, Bioengineering and Life Sciences Vol:8, No:1, 2014 waset.org/Publication/9997153 this study. Do > = 7 High C. SVM Model Data Development The SVM was implemented in this study using the R Package software. Kernel-based Machine Learning Lab (Kernlab) package was used to develop our SVM model. It is an extensible package for kernel-based machine learning methods in R. This package contains dot product primitives (kernels), implementations of support vector machines and the relevance vector machine with many other kernel methods and International Scholarly and Scientific Research & Innovation 8(1) 2014 47 scholar.waset.org/1307-6892/9997153 World Academy of Science, Engineering and Technology International Journal of Bioengineering and Life Sciences Vol:8, No:1, 2014 algorithms [20]. True positive (TP) is represented the correct prediction SVM works by predicting the labels of training data state, where false negative (FN) is represented the wrong as,, 1.. with 1, 1 is separable by predication state. True negative (TN) is class which not a hyperplane. Where if the data is positive it belongs to class 1 wrongly predicted as target class and false positive (FP) is and if the data is negative it belongs to class -1. When data is class that is wrongly predict and is predicted on target class. linearly separable and support vector existed it can be derived as: III. RESULTS Forward selection is used to determine optimum number of parameter for DO SVM prediction model. Results of SVM model that yields the highest accuracy rate using various W and b are determined during training process where it kernels with optimum number of parameters is shown in Table locates the support vector.