World Academy of Science, Engineering and Technology International Journal of Bioengineering and Life Sciences Vol:8, No:1, 2014

Dissolved Oxygen Prediction Using Support Vector Machine Sorayya Malek, Mogeeb Mosleh, Sharifah M. Syed

[10]-[12]. ANN is a well-suited method with self-adaptability, Abstract—In this study, Support Vector Machine (SVM) self-organization, and error tolerance for nonlinear simulation. technique was applied to predict the dichotomized value of Dissolved However, this method has limitationsdue to its complex oxygen (DO) from two freshwater lakes namely Chini and Bera Lake structure that requires a great amount of training data, (). Data sample contained 11 parameters for water quality difficulty in tuning the structure parameter that is mainly features from year 2005 until 2009. All data parameters were used to predicate the dissolved oxygen concentration which was based on experience, not sensitive to the noise produce and its dichotomized into 3 different levels (High, Medium, and Low). The “black box” nature that makes it difficult to understand and input parameters were ranked, and forward selection method was interpret the data [13], [14]. applied to determine the optimum parameters that yield the lowest Considering the drawbacks of ANN, recently support vector errors, and highest accuracy. Initial results showed that pH, Water machine (SVM) are become widely used for predicting the Temperature, and Conductivity are the most important parameters DO concentration [15], [16]. It is a new machine-learning that significantly affect the predication of DO. Then, SVM model was applied using the Anova kernel with those parameters yielded technology based on statistical theory and derived from 74% accuracy rate. We concluded that using SVM models to instruction risk minimization, which can enhance the predicate the DO is feasible, and using dichotomized value of DO generalization ability and minimize the upper limit of yields higher prediction accuracy than using precise DO value. generalization error. Compared to ANN, SVM has advantages of dealing with complex and highly nonlinear data, only Keywords—Dissolved oxygen, Water quality,predication DO, requiring a small amount of samples, high degree of prediction Support Vector Machine. accuracy, and long prediction period if kernel function was used to solve the nonlinear problems of water quality I. INTRODUCTION modeling [13]. UTROPHICATION is a common global concern in lakes In this study, we attempted to develop an SVM-based Eand reservoir. Current status of eutrophication in predictive model to predict dissolved oxygen concentration Malaysian lakes and reservoirs are indicated to be more than using data from selected freshwater lake in Malaysia. The data 60% [1],[2]. The effects of eutrophication are deterioration of samples for two lakes from 2005 to 2009 were used to train water quality for human consumption, limitation of and test the model. Dissolved oxygen concentrations used are recreational usage and exhaustion of dissolved oxygen (DO) dichotomized because different threshold value of DO below tolerable level which induces reductions in specific fish concentration can represent either acceptable or unacceptable and other animal populations [3]. The concentration of DO is water quality in a lake. This article is organized as follows; important for the healthy functioning of aquatic ecosystems section II discusses the methodology which includes the study and is a significant indicator of the state of aquatic area, data collection, analysis, and SVM model development. ecosystems. DO concentration is a parameter commonly used Section III represents the results followed by the discussion to assess the water quality in different reservoirs and and conclusion. watersheds. DO concentration is strongly influenced by a combination of the physical, chemical, and biological II. METHODOLOGY characteristics of the streams’ oxygen-demanding substances, A. Study Area including algal biomass, dissolved organic matter, ammonia, volatile suspended solids, and sediment oxygen demand [4]- Water quality data were collected from two fresh water [9]. lakes in . They are Bera and Chini Lakes Machine learning methods such as artificial neural networks as illustrated in Figs. 1 and 2. Lake Bera consists of 26,000 International Science Index, Bioengineering and Life Sciences Vol:8, No:1, 2014 waset.org/Publication/9997153 (ANNs) has been successfully implemented for DO estimation hectares of its core zone and 27,500 hectares of the buffer zone all has been preserved as RAMSAR sites and it is coordinate at 3°49′00″N102°25′00″E.Lake Chini consists of Sorayya Malek is with the Institute of Biological Sciences, Faculty of Science University Malaya (phone: +603-79676748; fax: +603-79674208; e- about 5026 hectares and is located in the coordinate mail: [email protected]). 3°26′N102°55′E. Lake Chini and Lake Bera both serves as Mogeeb Mosleh was with Taiz University, Yemen. He is now with the wetlands to prevent the floods and erosion of riverbank and Department of A.I, Faculty of Computer Science & Information Technology, serve as an important water reservoir in Malaysia. However University of Malaya, multivariate statistics, KualaLumpur, Malaysia (e-mail: [email protected]). due recent increase in pollution level the water qualities at Sharifah M. Syed is with Department of Computer & Communication these lakes are deteriorating [17]. System Engineering University Putra Malaysia (e-mail: [email protected]).

International Scholarly and Scientific Research & Innovation 8(1) 2014 46 scholar.waset.org/1307-6892/9997153 World Academy of Science, Engineering and Technology International Journal of Bioengineering and Life Sciences Vol:8, No:1, 2014

B. Data Collection and Analysis TABLE I STATISTICS SUMMARY OF DATA PARAMETERS FROM LAKE CHINI AND LAKE Water quality data from six monitoring stations from Bera BERA (2005 – 2009) lakes, and nine monitoring stations from were used No. Limnological Mean Minimum Maximum in this study. Bimonthly water quality data from February Variables 2005 until October 2009 was used. 1. Water temperature oC 30.245 26.12 34.29 2. Turbidity (NTU) 17.554 0.4 282.2

3. pH 6.596 4.97 7.94 4. DOmg/L M L H 5. Conductivity uS/cm 40.17 14 190 6. Salinityppt 0.019 0.01 0.09 7. Nitrate (NO3-N) mg/L 0.232 0.01 2.04 8. Phosphate PO4mg/L 0.063 0 0.36 9. Ammonia(NH3) mg/L 0.085 0.01 0.81 10. E-colimg/L 733.12 0 5400 11. Total Coliformmg/L 8885.7 20 284000

The parameter “DO” was classified into three levels ranges from low to high. This is because water quality can be classified into different classes depending on DO concentration range. The Interim National Water Quality Standard, Malaysia (INWQS) and Department of Environment have classified water quality based on DO concentration. Fig. 1 Lake Bera Map Table II, illustrates the DO concentration and water quality

classes.This classification of DO range has been adapted in this study.

TABLE II CLASSIFICATION OF WATER QUALITY BASED ON DO CONCENTRATION (INWQS) DO Value INWQS water Classification DO < = 5 Class III – IV Extensive treatment required 5 > DO <7 Class II - Suitable for recreational use and conventional treatment required. Do > = 7 Class I -Conservation of natural environment, Water Supply practically no treatment necessary.

The Do used in this study for SVM model development is classified into three categorizations (high, medium, and low) similar with the classification of the Interim National Water Quality Standard, Malaysia (INWQS) and Department of Environment as shown in Table III. In this classification the high class of DO range from value 7 and above, while the Fig. 2 Lake Chini Map medium class ranges between 5 until 7, and lastly for the DO

that had value below 5 is in class low. The water quality data include 147 row samples for 11

parameters. Sampling procedures including preservation for TABLE III water quality parameters were carried out in accordance with OUR CLASSIFICATION FOR THE “DO” CONCENTRATIONS WHO[18] and the analytical methods for the measured DO Value DO Categorizations parameters were adopted from manual published APHA [19]. DO < = 5 Low Table I list the summary statistics of the parameters used in 5 > DO <7 Medium International Science Index, Bioengineering and Life Sciences Vol:8, No:1, 2014 waset.org/Publication/9997153 this study. Do > = 7 High C. SVM Model Data Development The SVM was implemented in this study using the R Package software. Kernel-based Machine Learning Lab

(Kernlab) package was used to develop our SVM model. It is an extensible package for kernel-based machine learning methods in R. This package contains dot product primitives (kernels), implementations of support vector machines and the relevance vector machine with many other kernel methods and

International Scholarly and Scientific Research & Innovation 8(1) 2014 47 scholar.waset.org/1307-6892/9997153 World Academy of Science, Engineering and Technology International Journal of Bioengineering and Life Sciences Vol:8, No:1, 2014

algorithms [20]. True positive (TP) is represented the correct prediction SVM works by predicting the labels of training data state, where false negative (FN) is represented the wrong

as,, 1.. with 1, 1 is separable by predication state. True negative (TN) is class which not a hyperplane. Where if the data is positive it belongs to class 1 wrongly predicted as target class and false positive (FP) is and if the data is negative it belongs to class -1. When data is class that is wrongly predict and is predicted on target class. linearly separable and support vector existed it can be derived as: III. RESULTS Forward selection is used to determine optimum number of parameter for DO SVM prediction model. Results of SVM model that yields the highest accuracy rate using various W and b are determined during training process where it kernels with optimum number of parameters is shown in Table locates the support vector. W is defined as weight vector while IV. It illustrates the results of the input parameters for the b is term for bias and is a scalar. T stands for transpose selection process. This table is yield from the process of operator. selection, forwarding, and ranking based on the least errors In SVM kernel function is introduced as it offer a better value obtained. performance when dealing with a nonlinear data. Here data is mapped and derived as below where K is the kernel function: TABLE IV INPUT PARAMETER RANKING , Ranking Parameter Cross validation Error 1 Temperature 0.458333 By introducing the kernel it can be derived as 2 pH 0.516667 3 Conductivity 0.575 4 Total Coliform 0.608333 1 2 5 Turbidity 0.625 6 Ammonia 0.675

7 Salinity 0.691667 subject to: 8 Phosphate 0.7 9 E.coli 0.708333 0, 10 Nitrate 0.8 0 , Selection procedure was repeated, until the SVM model The equation produce is as below with kernel that has the highest accuracy rate was determined automatically using a built in function in R. The results of SVM prediction model with highest accuracy rate and , optimum number of parameters for each kernel is presented in Table V.

The kernel selection depends on the data distribution but TABLE V kernel selection also generally done through trial and error SVM PREDICTION MODEL RESULT FOR “DO” [21], [22]. Input Parameter Kernel Accuracy (%) Prior to model development data was divided into training Water Temperature, pH Radial Basis 44.44 and testing data set; 80% for training data and 20% for testing Water Temperature, pH, Conductivity. Anova 74.07 data. All data was converted into comma delimited (CSV) Water Temperature, pH, Conductivity., Anova 62.96 Coliform format before it can be used to run in the R software. Input Temperature, pH, Conductivity., Anova 70.37 parameters were ranked using linear kernel in ascending order Coliform, Turbidity based on the cross validation errors. This is carried out to Temperature, pH, Conductivity., Laplacian 66.67 Coliform, Turbidity, Ammonia determine effects of each input parameter on the output Temperature, pH, Conductivity., Anova 42 parameter DO. Once the effect of each parameter has been Coliform, Turbidity, Ammonia, Salinity determined, SVM prediction models using RBF, ANOVA and Temperature, pH, Conductivity., Anova 48. 14 International Science Index, Bioengineering and Life Sciences Vol:8, No:1, 2014 waset.org/Publication/9997153 Laplace kernel was developed. The default value was used for Coliform, Turbidity, Ammonia, Salinity, Phosphate each kernel for the SVM prediction model Temperature, pH, Conductivity., Anova 44.44 development.Forward elimination method was used for each Coliform, Turbidity, Ammonia, SVM model to eliminate less significance parameters. Salinity, Phosphate, E.coli Temperature, pH, Conductivity., Radial Basis 51.85 Manual calculations are also carried out to determine the Coliform, Turbidity, Ammonia, accuracy for each class of DO predication using the formula Salinity, Phosphate, E.coli, Nitrate below for training and testing data. After we determined the appropriate parameters that have the most effects on the predication results of DO, we isolated those parameters (Water Temperature, pH, and Conductivity)

International Scholarly and Scientific Research & Innovation 8(1) 2014 48 scholar.waset.org/1307-6892/9997153 World Academy of Science, Engineering and Technology International Journal of Bioengineering and Life Sciences Vol:8, No:1, 2014

in individual comma separated file. Then we used the test data results showed that ANOVA is the most appropriate kernel to as input to the SVM model. 27 test data samples were run to obtain the highest accuracy in predication results because of predicate the DO level using the SVM. Finally, we calculate its ability in reducing data redundancy of nonlinear data. In the accuracy of predication manually as shown in Table VI. this study the prediction of DO level by using the SVM for the Lake Bera and Lake Chini produced accuracy about 74.07%. TABLE VI The acceptable accuracy results showed that predication of the SVM ACCURACY FOR DO PREDICATION RESULTS DO level using dichotomized value isfeasible, and more DO Level Accuracy (%) accurate than predicating exact value of DO. Data Classification Low Medium High Training 78.33 72.5 75.83 Testing 81.48 77.78 88.89 ACKNOWLEDGMENT The authors would like to thank National Hydraulic As shown in Table VI, we calculated the accuracy of Research Institute of Malaysia (Nahrim) for their data and predication for the both data sets, training and testing data. UMRG grant RG241-12AFR. Then we calculated manually the accuracy of each DO classification. REFERENCES [1] Nahrim: A desktop Study on the Status of Lake Eutrophication in IV. DISCUSSION Malaysia, Final Report, Malaysia; 2005. [2] Z. Sharip, and Z. Yusop, "National overview: the status of lakes In the SVM theory, there are different types of kernel eutrophication in Malaysia." pp. 2-3.H. Poor, An Introduction to Signal functions can be used, such as linear, polynomial, RBF, Detection and Estimation. New York: Springer-Verlag, 1985, ch. 4. ANOVA and sigmoid. Different base functions are applicable [3] S.O. Ryding, and W. Rast, “Control of eutrophication of lakes and reservoirs,” Manual the biosphere series, vol. 1, 1989. for dealing with different types of data. ANOVA kernels have [4] M. Spanou, and D. Chen, “An object-oriented tool for the control of been shown to work rather well in multi-dimensional point-source pollution in river systems,” Environmental Modelling& regression problems [21],[23]. The SVM accuracy using the Software, vol. 15, no. 1, pp. 35-54, 2000. [5] M. D. Williams, and M. Oostrom, “Oxygenation of anoxic water in a ANOVA kernel gave more accurate predication results if fluctuating water table system: an experimental and numerical study,” compared with other kernels which used in this study such as Journal of hydrology, vol. 230, no. 1, pp. 70-85, 2000. RBF and laplacian kernels. The highest accuracy obtained [6] J. Kalff, Limnology: inland water ecosystems: Prentice Hall New Jersey, 2002. using ANOVA kernel with the three selected parameters was [7] B. Cox, “A review of currently available in-stream water-quality models 74%.Forward selection method was used to select parameters and their applicability for simulating dissolved oxygen in lowland that yield the optimum results. The parameters were pH, rivers,” Science of the Total Environment, vol. 314, pp. 335-377, 2003. [8] P. J. Mulholland, J. N. Houser, and K. O. Maloney, “Stream diurnal conductivity and water temperature. DO in water is primarily dissolved oxygen profiles as indicators of in-stream metabolism and determined by water temperature [6]. disturbance effects: Fort Benning as a case study,” Ecological Water temperature also has a reverse relationship with DO Indicators, vol. 5, no. 3, pp. 243-252, 2005. [9] N. W. T. Quinn, K. Jacobs, C. W. Chen, and W. T. Stringfellow, in water and proportional relationship with hydrogen ion “Elements of a decision support system for real-time management of concentration (pH). Conductivity meanwhile is related to dissolved oxygen in the San Joaquin River Deep Water Ship Channel,” salinity which affects the DO level in water. Water that has Environmental Modelling& Software, vol. 20, no. 12, pp. 1495-1504, high salinity has high conductivity which decreases DO level 2005. [10] E. Dogan, B. Sengorur, and R. Koklu, “Modeling biological oxygen in water. Salinity also decreases the DO saturation value demand of the Melen River in Turkey using an artificial neural network Salinity enables conductivity affects DO level in water. technique,” Journal of Environmental Management, vol. 90, no. 2, pp. Similar studies on DO prediction using SVM has reported 1229-1235, 2009. [11] A. Akkoyunlu, and M. E. Akiner, “Pollution evaluation in streams using accuracy level of 70% [13],[24]-[27]. The success rate of DO water quality indices: A case study from Turkey's Sapanca Lake Basin,” prediction in this study is reported to be 74% using Ecological Indicators, vol. 18, pp. 501-511, 2012. dichotomous values of DO instead of precise values as used in [12] K. P. Singh, A. Basant, A. Malik, and G. Jain, “Artificial neural network modeling of the river water quality—a case study,” Ecological other studies. Many ecological responses are difficult to Modelling, vol. 220, no. 6, pp. 888-895, 2009. measure accurately and definitely. Furthermore there are many [13] M. Bouamar, and M. Ladjal, "Evaluation of the performances of ANN factors that need to be considered when modeling water and SVM techniques used in water quality classification." pp. 1047- 1050. quality such as climate factors and development activities in [14] X. Yunrong, and J. Liangzhong, "Water quality prediction using LS-

International Science Index, Bioengineering and Life Sciences Vol:8, No:1, 2014 waset.org/Publication/9997153 the surrounding areas. Therefore characterizing responses that SVM and particle swarm optimization." pp. 900-904 are dichotomous such DO for lakeeutrophication level [15] S. Liu, J. Ren, and W. You, “A Study on Purification of the Eutrophic Water Body with Economical Plants Siollessly Cultivated on Artificial increases the accuracy of prediction in this study. Substratum [J],” ActaScicentiarumNaturalumUniversitisPekinesis, vol. 4, 1999. V. CONCLUSION [16] A. Najah, A. El-Shafie, O. Karim, O. Jaafar, and A. H. El-Shafie, “An application of different artificial intelligences techniques for water SVM is considered recently one of the most suitable quality prediction,” Int. J. Phy. Sci, vol. 6, no. 22, pp. 5298-5308, 2011. approaches for predication and forecasting. The preparation [17] M. Shuhaimi-Othman, E. C. Lim, and I. Mushrifah, “Water quality approaches such as selection, forwarding, and ranking that changes in Chini Lake, , West Malaysia,” Environmental Monitoring and Assessment, vol. 131, no. 1-3, pp. 279-292, 2007. applied to determine the significant water quality parameters [18] WHO:UNEP/WHO/UNESCO/WMO Project on Global Environmental can play essential roles in the prediction of DO. In addition, Monitoring.GEM Water Operational Guide 1987.

International Scholarly and Scientific Research & Innovation 8(1) 2014 49 scholar.waset.org/1307-6892/9997153 World Academy of Science, Engineering and Technology International Journal of Bioengineering and Life Sciences Vol:8, No:1, 2014

[19] American Public Health Association (APHA): Standard methods for the examination of water and waste water. 19th edition. American Water Works Association (AWWA) and Water Environment Federation APHA, Washington, DC; 1995. [20] A. Karatzoglou, A. Smola, K. Hornik, and A. Zeileis, “kernlab-an S4 package for kernel methods in R,” 2004. [21] M. Stitson, A. Gammerman, V. Vapnik, V. Vovk, C. Watkins, and J. Weston, “Support vector regression with ANOVA decomposition kernels,” Advances in kernel methods—Support vector learning, pp. 285-292, 1999. [22] A. Zien, G. Rätsch, S. Mika, B. Schölkopf, T. Lengauer, and K.-R. Müller, “Engineering support vector machine kernels that recognize translation initiation sites,” Bioinformatics, vol. 16, no. 9, pp. 799-807, 2000. [23] T. Hofmann, B. Schölkopf, and A. J. Smola, “Kernel methods in machine learning,” The annals of statistics, pp. 1171-1220, 2008. [24] R. V. Thomann, and J. A. Mueller, Principles of surface water quality modeling and control: Harper & Row, Publishers, 1987. [25] A. Najah, A. El-Shafie, O. Karim, and O. Jaafar, “Integrated versus isolated scenario for prediction dissolved oxygen at progression of water quality monitoring stations,” Hydrology and Earth System Sciences Discussions, vol. 8, no. 3, pp. 6069-6112, 2011. [26] X.-S. Qin, G. H. Huang, G.-M. Zeng, A. Chakma, and Y. Huang, “An interval-parameter fuzzy nonlinear optimization model for stream water quality management under uncertainty,” European Journal of Operational Research, vol. 180, no. 3, pp. 1331-1357, 2007. [27] J. Liu, M. Chang, and X. Ma, "Groundwater Quality Assessment Based on Support Vector Machine." pp. 173-178. International Science Index, Bioengineering and Life Sciences Vol:8, No:1, 2014 waset.org/Publication/9997153

International Scholarly and Scientific Research & Innovation 8(1) 2014 50 scholar.waset.org/1307-6892/9997153