International Journal of Basic & Applied Sciences IJBAS-IJENS Vol: 12 No: 02 1

Predictions of Water Level in Dungun River Using Partial Least Squares Regression

Noraini Ibrahim 1, Antoni Wibowo 2 Faculty of Computer Science and Information Systems 81310, UTM Bahru, Johor, e-mail: [email protected] 1, [email protected] 2

Abstract — Floods are common phenomenon in the state are water level, area inundation, peak discharge, volume of of Dungun, specifically in Terengganu-Malaysia. Every year, flow and duration. There are three categories of critical floods affecting biodiversity on this region and also causing level stages of water level which are alert, warning and property loss of this residential area. The residents in Dungun danger and they had been introduced by Department of always suffered from floods since the water overflows to the Irrigation and Drainage Malaysia. Four telemetric stations areas adjoining to the rivers, lakes or dams. The rainfall and in Dungun were set up by this department to observe and evaporation of the area have a large influence on the water level of Dungun River. Therefore, a suitable prediction model monitor the water level and rainfall. is needed to forecast the water level in Dungun River by There are many related works done in this area of adopting the ordinary linear regression (OLR) and partial predicting water level and most of them are using neural least squares regression (PLSR) based on hydrological data. network and other nonlinear methods. The previous However, we need to perform cleansing data of the research shows that predicting water level using artificial hydrological data since the original data contain inconsistent neural network (ANN) combined with partial least squares data. Based on the experiment, it shows that PLSR is more regression achieves better results compared to the results of suitable model rather than OLR and the use of the cleansing PLSR alone according to water level variation and data gives higher accuracy than the original data. prediction of the Pingshan Sinkhole in Guizhou,

Southwestern China [14]. Keywords - Dungun River, evaporation, flooding, linear This study attempts to find an appropriate model regression, partial least squares regression, rainfall, water level for predicting water level in Dungun River. In order to obtain the model, OLR and PLSR are used to predict the water level since the data look like have linear relationship I. INTRODUCTION between predictors and response. Cross validation (CV) is applied to select the appropriate model for predicting water Most countries in Malaysia suffering from floods during level in Dungun River. It is noticed that the normalized monsoon season especially in , , original data of hydrological data contain inconsistent data. Terengganu, and Johor. Terengganu is a state in the Therefore, we need to perform a cleansing data of the east coast of Peninsular Malaysia that has never missed out hydrological data. The following sections present an a flooding event which occurs between October and March approach to the development of the water level, evaporation every year during the North-East monsoon period [4]. The and rainfall models. OLR and PLSR methods are discussed annual rainfall at sub regions of Peninsular Malaysia for a in Section II while the evaluation of the prediction quality is decade (1984-1993) shows that Terengganu has the highest described in Section III. A case study and some amount of annual rainfall [13]. The factors that lead to experimental results are reported in Section IV, V and VI. flood at Dungun district of Terengganu state is due to a Finally, the conclusion is given in Section VII. combination of physical factors such as elevation and its close proximity to the sea apart of heavy rainfall II. THEORIES AND METHODS experienced during monsoon period. The severe floods all over Terengganu are resulted from heavy rainfall during the north-east monsoon season especially in November and The linear regression model is given as follows [16] December. y= 1β + X β + ε (1) Floods that affect Dungun district and other n 0 1 extensive damages to property and road systems along the eastern coast of Terengganu were categorized as a coastal where y is an n×1 vector of observations on the response flooding. Some of flood characteristics had been measured variable, X is an n× p matrix consisting of n observations by Department of Irrigation and Drainage Malaysia which

124702-8383 IJBAS-IJENS © April 2012 IJENS I J E N S International Journal of Basic & Applied Sciences IJBAS-IJENS Vol: 12 No: 02 2

and p predictors, β is an unknown constant, β is an p×1 the original regressors while maintaining most information 0 1 vector of unknown regression coefficient, 1 is an n×1 in the input variables. PLS is useful when the number of n explanatory variables exceeds the number of observation × ones vector and ε is an n 1 vector of errors identically and high level of multicollinearity among those variables is and independently distributed with mean zero and variance assumed. The weights used to determine the linear 2 σ > 0 , respectively. combinations of the original regressors are proportional to the covariance among input and output variables [6]. A. Ordinary Linear Regression

OLR often being used in fitting models to make an C. Partial Least Squares Regression Using SIMPLS observation which is applied by minimizing the sum of the Algorithm squared residuals between the predicted and actual response. When matrix X= [ 1 X ] has full rank of p, the OLR 1 n SIMPLS algorithm was used to compute the regression estimator of β= (β , β T ) T , say 0 1 coefficient in order to find the model for predicting water level in Dungun River of Terengganu. SIMPLS algorithms β$ = [β , β ,, β ] T , is estimated by OLS OLS0 OLS 1 L OLS p work very well, resistant to be more appropriate, fast, easy to implement and simple to tune [1]. In PLSR approach, we β$ = ().XXT−1 Xy T (2) need to obtain the PLSR estimator, say Bˆ OLS 1 1 1 PLSR = ˆ ˆ ˆ T [BPLSR1 , B PLSR 2 ,,L B PLSRp ], and it starts with computing The prediction of y is given by the cross-product of [7]

y$ = X β$ . (3) = T 1 OLS S X y . (5)

The model for OLR can be represented by Then, the computing of the iteration is followed starting from 1 until A latent variables where A is determined in =β + β ++ β advanced and 1≤A ≤ p. The algorithm of SIMPLS is given f()x OLS0 OLS 1 x1 L OLSp x p (4) as follows

=T ∈ p where x [,,,]x1 x 2 L x p  . For a = 1 to A • If a = 1, then do the singular value = B. Partial Least Squares Regression decomposition (svd) of S :[usv , , ] svd ( S ). Otherwise, if a > 1, we compute the svd of: [,,]usv= svd ( SPPP − (T )−1 PS T ). Partial Least Squares (PLS) has proven to be an effective approach to solve the problems in chemometrics such as by • Get weights for r which is the first singular predicting the bioactivity of molecules to facilitate vector: r= u (:,1). discovery of novel pharmaceuticals. The PLS approach was • Compute the scores: t= Xr . originated around 1975 by Herman Wold for modeling the • = T T complicated datasets in terms of matrices blocks which Compute the loadings: p Xt/ ( tt ). called path models [19]. The PLS method has been • The vector r, t and p are stored into R, T, and introduced in the chemical literature as an algorithm and it P respectively. is only recently that its numerical and statistical properties have become more apparent [16]. PLSR is a technique for modeling a linear relationship between a set of output The last step is computing a regression coefficient can be n∈ L shown as variables (response) {yi } i =1 R with L-dimensional responses and a set of input variables (regressors) Bˆ = RTT()T−1 Ty T . (6) n∈ p with p number of variables [12]. The data PLSR {xi } i =1 R matrices X and y in this analysis are assumed to be centered Then, the estimate of PLSR is as a first step to perform PLSR.

In this paper, we only use one dimensional y$ = XB ˆ . (7) response which is L equals to one. PLS is a method for PLSR modeling relations between sets of observed variables by means of latent variables which are linear combinations of The model for PLSR can be represented by

124702-8383 IJBAS-IJENS © April 2012 IJENS I J E N S International Journal of Basic & Applied Sciences IJBAS-IJENS Vol: 12 No: 02 3

months, rainfall and evaporation while the response is water =+ˆ −++ ˆ − levels. Dungun receives its monthly rainfall mainly in g(x ) yBPLSR1 ( xx 1 1 )L B PLSRp ( xx p p ) (8) raining season which starts from October and ends in March. The following tables show the details of predictors where y is the mean of response yi and xp is the mean of and response used for water level prediction. observation data of xp . A. Data

III. EVALUATING THE QUALITY OF THE PREDICTION In this paper, we use the data from Dungun River of Terengganu between 2001 and 2010. We would like to see The quality of the prediction is evaluated using A latent the relationship between rainfall, evaporation and months with the water level of Dungun River. We would find the variables, y$ and y [6]. CV technique is used to estimate i i regression coefficient and its model by using the method of the prediction capacity and the data are separated between OLR and PLSR. The data are separated into two which are the training data set to build the model and testing data set 108 data for training and 12 data for testing. to test the model. The CV is applied in three cases which are in performance estimation, model selection and tuning B. Normalized Original Data learning model parameters. In this paper, CV is used in model selection for predicting water level of Dungun River. The data set is cover from January until December for 10 The CV is a statistical method to evaluate the algorithms by years and yet it has shown a total of 120 data. The predictors dividing the data into two segments which are for training are rainfall, evaporation and months while the response is and validation and the basic form of cross-validation is K- water levels. Figure 1 describes the predictors and response fold CV. The idea for CV was originated in the 1930s [9, used over training period in predicting water level of 11]. In 1970s, CV was employed as means for choosing Dungun River. The first column, second column, third proper model parameters, as opposed to using cross- column and fourth column of this figure are month, rainfall, validation purely for estimating model performance [5, 15]. evaporation and water level data, respectively. This figure Stratified 10-fold CV was recommended as the best shows the normalized original data and 17 th month is in May model selection method since it tends to provide less biased 2002. Figure 1 presents the normalized original data and the estimation of the accuracy compared to regular cross- original data has been normalized according to simple validation, leave-one-out CV and bootstrap methods [4]. For normalization as follows: this analysis, we used 10 -fold CV because it can give accurate performance estimation and it suitable for small z = originali samples of performance estimation. We were using this znormalized i (9) type of CV to choose an appropriate model between zmax normalized original data and cleansing data by comparing the value of mean squared error of cross-validation where (MSECV) based on OLR and PLSR. The data are divided znormalized i is normalized original data, into K segments of roughly equal size and the inner sum of z is original data, MSECV is taken over the observations in the kth segment originali [3, 10]. For each of K experiments, the K-fold CV uses K-1 zmax is the maximum value of the original data. folds for training and the remaining one for testing. There is an advantage of using K-fold CV which is all the examples in the dataset are eventually used for both training and testing. For this type of CV, we used the function in Matlab software called ‘ crossval ’ to obtain the value of MSECV which is a scalar containing a 10-fold cross-validation estimate of mean-squared error. We will select a better model according to lowest value of MSECV and it is a measure of how well the model fits the data.

IV. CASE STUDY

In our case study, the hydrological data are taken from water resources management and hydrology division in . The predictors for these experiments are Fig. 1. The snapshot of normalized original data

124702-8383 IJBAS-IJENS © April 2012 IJENS I J E N S International Journal of Basic & Applied Sciences IJBAS-IJENS Vol: 12 No: 02 4

TABLE I β C. Cleansing Data THE  OLS VECTOR FOR MODEL PREDICTION USING NORMALIZED ORIGINAL DATA Data preprocessing is the process that was performed to the β β β β normalized original data in order to prepare it for next OLS 0 OLS 1 OLS 2 OLS 3 processing procedure. Thus, it will transform the data into 0.64623 0.00004 0.60831 -0.24382 the format that more effective according to our purpose of analysis. Data preprocessing is important since the real world data normally are noisy which are containing errors β β Table I describes the OLR estimator of , say  OLS using and outliers. There are five tasks in performing data normalized original data and the model for water level preprocessing which are data cleaning, data integration, data prediction using OLR is transformation, data reduction and data discretization.

fxxx( , , )= 0.64623 + 0.00004 x + 0.60831 x OLR 1 1 2 3 1 2 − 0.24382.x3 (10)

The model for rainfall prediction using original data is

x=0.07595 + 0.00013 x . (11) 2 1 The model for evaporation prediction using original data is

x=0.70034 + 0.00004 x . (12) 3 1

TABLE II Fig. 2. The snapshot of cleansing data THE β VECTOR FOR MODEL PREDICTION USING NORMALIZED OLS CLEANSING DATA From Figure 1, it was recorded that there is 0 value of β β β β rainfall during December in 2002. It is classified as non-  OLS 0  OLS 1  OLS 2  OLS 3 logical data since the 0 value indicates that there is no rainfall in December whereas there is more rainfall in that 0.62455 -0.00029 1.49075 -0.28687 month. For this analysis, we used data cleaning to correct β β non-logical data. For example, 0 value of rainfall in Table II shows the OLR estimator of , say  OLS using November 2002 and December 2002 are replaced by the cleansing data and the model for water level prediction means of the corresponding months throughout 10 years. using OLR is: Figure 2 shows the snapshot of the cleansing data for Dungun River. = − + fxxxOLR 2( 1 , 2 , 3 ) 0.62455 0.00029 x1 1.49075 x 2 − 0.28687.x3 (13) V. MODEL S DEVELOPMENT The model for rainfall prediction using cleansing data is

The following sections present the results of the x=0.06272 + 0.00030 x . (14) experiments using OLR and PLSR. 2 1

The model for evaporation prediction using cleansing data is A. Ordinary Linear Regression x=0.69381 + 0.00012 x . (15) 3 1 OLR is performed in this experiment to build the model for rainfall, evaporation and water level in Dungun River. This B. Partial Least Squares Regression subsection presents the results of the experiment which are the prediction models for water level, rainfall and evaporation over training period. PLSR is another method that we use in this experiment in order to get the prediction model and the results based on these two methods are being compared between original

124702-8383 IJBAS-IJENS © April 2012 IJENS I J E N S International Journal of Basic & Applied Sciences IJBAS-IJENS Vol: 12 No: 02 5

= + + data and cleansing data. Validation method is used for gxxxPLSR 1( 1 , 2 , 3 ) 0.64623 0.00004 x1 0.60831 x 2 choosing number of components of PLS and the model with − 0.24382.x3 (16) the lowest MSECV is considered to be the optimal one. TABLE VIII TABLE III THE MATRIX OF PREDICTOR LOADINGS USING NORMALIZED ORIGINAL THE BPLS VECTOR FOR PREDICTION MODEL USING CLEANSING DATA DATA ˆ ˆ ˆ ˆ BPLSR 0 BPLSR 1 BPLSR 2 BPLSR 3 Predictors Comp1 Comp2 Comp3 0.62455 -0.00029 1.49075 -0.28687 Month 323.9861 -0.0875 8E-05 Rainfall -0.0415 1.1471 0.2658 Table VIII shows the regression coefficient of PLSR using Evaporation -0.0140 -0.1303 1.0703 cleansing data and the model for water level prediction using PLSR is Table III shows the matrix of predictor loadings and each = − + row of the matrix contains coefficients which is a linear gxxxPLSR 2( 1 , 2 , 3 ) 0.62455 0.00029 x1 1.49075 x 2 combination of PLS components and it approximate the −0.28687.x (17) 3 original predictor variable . The model for evaporation and rainfall cannot be predicted using PLSR since it has only one predictor and it is more significant if there is more number of predictors. In this TABLE IV THE MATRIX OF RESPONSE LOADINGS USING NORMALIZED ORIGINAL DATA case, we use OLR to predict the model for evaporation and rainfall . Response Comp1 Comp2 Comp3 Water level -0.03397 0.72955 -0.09925 VI. MODEL SELECTION

Table IV shows the matrix of response loadings and each In this study, we will restrict ourselves to the common row also contains coefficients which is a linear combination variants of CV called K-fold CV, where the calibration of PLS components and approximate the original response objects are divided in k segments and for this experiment we variable. use k=10 [2, 17]. The selected number of components using k-fold CV correctly find this range, the actual value of the

TABLE V number of components is immaterial as long as the THE MATRIX OF PREDICTOR LOADINGS USING CLEANSING DATA prediction error is close to its minimum [17]. We used 10- fold CV to obtain the appropriate model for predicting water Predictors Comp1 Comp2 Comp3 level at Dungun River of Terengganu using two types of Month -323.9861 -0.0446 -0.0002 data which are normalized original data and cleansing data. The data were analyzed using OLR and PLSR and the Rainfall -0.0968 0.6926 0.2120 results are compared between the normalized original data Evaporation -0.0388 -0.1658 1.0230 and cleansing data to obtain a better model according to lowest value of MSECV.

TABLE VI THE MATRIX OF RESPONSE LOADINGS USING CLEANSING DATA TABLE VIIII. A COMPARISON OF MSECV FOR WATER LEVEL IN DUNGUN RIVER USING 10-FOLD CROSS -VALIDATION Response Comp1 Comp2 Comp3 Water level -0.0382 1.0801 0.0226 Method MSECV OLR 0.0445 Normalized PLSR TABLE VII Original (ncomp=1) 0.0271 THE BPLS VECTOR FOR MODEL PREDICTION USING NORMALIZED ORIGINAL DATA Data (ncomp=2) 0.0348 (ncomp=3) 0.0348 Bˆ Bˆ Bˆ Bˆ PLSR 0 PLSR 1 PLSR 2 PLSR 3 OLR 0.0251 0.64623 0.00004 0.60831 -0.24382 Cleansing PLSR Data (ncomp=1) 0.0257 Table VII describes the regression coefficient of PLSR (ncomp=2) 0.0151 using normalized original data and the model for water level (ncomp=3) 0.015 prediction using PLSR is

124702-8383 IJBAS-IJENS © April 2012 IJENS I J E N S International Journal of Basic & Applied Sciences IJBAS-IJENS Vol: 12 No: 02 6

Table VIIII illustrates the comparison of MSECV Our further research will focus on the use of nonlinear for Water Level in Dungun River using 10-fold CV of OLR methods and compare them to PLSR model. and PLSR. From Table VIIII, PLSR with cleansing data of ncomp equals to 3 has the smallest MSECV. Therefore, this PLSR is considered as the best model. Figure 3 shows the comparison between actual and prediction monthly water level for Dungun River with test data in 2010 using Actual Water Level 0.7 normalized original data and Figure 4 shows the comparison Predicted Water Level (PLSR) between predicted and actual water level in Dungun River with test data using cleansing data. From these graph, it is 0.6 clear that the use of cleansing data achieves closer agreement between actual and predicted water level rather 0.5 than using normalized original data. To obtain the prediction LevelWater of water level in original scale, we modified the equation (9) 0.4 as follows:

0.3 = ⋅ 1 2 3 4 5 6 7 8 9 10 11 12 z1originali z 1 normalizedi z 1max (18) Month (2010)

where: Fig. 4. A comparison between actual and prediction monthly z1originali is the predicted water level in original scale, water level for Dungun River with test data in 2010 using z1normalizedi is the predicted water level in normalized scale, cleansing data. z1max is the maximum value of water level in original scale. ACKNOWLEDGEMENT

This project is funded by the Short Term Research Grant- Foreign Academic Visitor Fund (vot number: 4D051). The 0.75 authors would like to thank the Research Management Actual Water Level 0.70 Predicted Water Level (PLSR) Centre for supporting this research and Drainage and 0.65 Irrigation Department of Malaysia for general assistant. The

0.60 first author would like to thank to Zamalah scholarship for supporting her master by research program. 0.55

0.50 Water LevelWater

0.45 REFERENCES

0.40

0.35 [1] K. P. Bennett and M. J. Embrechts. An optimization 1 2 3 4 5 6 7 8 9 10 11 12 perspective on kernel partial least squares regression. Month (2010) Mathematical Sciences Dept.Decision Science and Engineering Systems Dept.Rensselaer Polytechnic Fig. 3. A comparison between actual and prediction monthly Institute, 227-250,2003. water level for Dungun River with test data in 2010 using [2] L. Breiman, J. H. Friedman, R. A. Olshen and C. normalized original data. Stone. classification and regression trees. Wadsworth: Belmont, CA, 1984 [3] A. C. Davison and D. V. Hinkley. Bootstrap methods V. CONCLUSION and their application. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University In Dungun district, rising water levels of the river become Press, Cambridge, UK, 1997. critical issues since it can induce flood and destroy a lot of [4] M. B. Gasim, J. H. Adam, M. E. H. Toriman, S. A. things. We had compared between two types of data which Rahim and H. H. Juahir. Coastal flood phenomenon in are original data and cleansing data using OLR and PLSR Terengganu, Malaysia: Special references to Dungun. approaches. The experiment had shown that PLSR using Research Journal of Environment Sciences 1 (3): 102- cleansing data was more suitable model compared to the 109,2007. OLR and PLSR without using cleansing data. PLSR using cleansing data gives higher accuracy than the original data.

124702-8383 IJBAS-IJENS © April 2012 IJENS I J E N S International Journal of Basic & Applied Sciences IJBAS-IJENS Vol: 12 No: 02 7

[5] S. Geisser. The predictive sample reuse method with Noraini Ibrahim is currently a Master’s student in the applications. J. Am. Stat. Assoc., 70(350):320- Faculty of Computer Science and Information Systems in 328,1975. UTM Johor Bahru, Malaysia. She received B. Sc in [6] I. S. Helland. On the structure of partial least squares Industrial Mathematics from UTM. Her interests are in the regression. Communications in Statistics Elements of field of computational intelligence and data analysis. Simulation and Computation, 17:581–607, 1988. [7] S. D. Jong. SIMPLS: An alternative approach to Antoni Wibowo is currently working as a senior lecturer in partial least squares regression. Chemometrics and the Faculty of Computer Science and Information Systems Intelligent Laboratory Systems,Elsevier Science in UTM. He received B.Sc in Math Engineering from Publisher B.V., Amsterdam, 251-263,1993. Sebelas Maret University (UNS) Indonesia and M.Sc in [8] R. Kohavi. A study of cross-validation and bootstrap Computer Science from University of Indonesia. Besides for accuracy estimation and model selection. In that, he is also a holder M. Eng and Dr. Eng in System and proceedings of International Joint Conference on Al. Information Engineering from University of Tsukuba Japan. 1995, pp. 1137-1145, URL http://citeseer.ist.psu.edu/ His interests are in the field of computational intelligence, Kohavi95study.html. machine learning, cybernatics operations research and data [9] S. Larson. The shrinkage of the coefficient of multiple analysis. correlation. J. Educat. Phychol., 22:45-55, 1931. [10] B. H. Mevik, H. R. Cederkvist. Mean squared error of prediction (MSEP) estimates for principle component regression (PCR) and partial least squares regression (PLSR). Journal of Chemometrics, 18(9): 422-429, 2004. [11] P. Refaeilzadeh, L. Tang and H. Liu. Cross-validation. Arizona State University, 2008. [12] R. Rosipal and L.J. Trejo, Kernel partial least squares regression in reproducing kernel hilbert space. Journal of Machine Learning Research 2, 97-123,2001. [13] A. J. Shaaban, Y. M. Chan, M. L. Kavvas, Z. Q. Chen, and N. Ohara. Impact of climate change on Peninsular Malaysia Water Resources and Hydrogic Regime, 2010. [14] L. Shu, G. Dong, L. Liu, Y. Tao and M. Wang. Water level variation and prediction of the Pingshan Sinkhole in Guizhou, Southwestern China. Sinkholes and the Engineering and Environmental Impacts of Karst, 2008. [15 ] M. SjGstrGm, S. Wold, W. Lindberg, J.-A. Persson and H. Martens. A multivariate calibration problem in analytical chemistry solved by partial least-squares models in latent variables. Analytica Chimica Acta , 150:61-70,1983. [16] M. Stone. Cross-validatory choice and assessment of statistical predictions. J. Royal Stat. Soc., 36(2):111- 147, 1974. [17] O. Yeniay and A. Goktas. A comparison of partial least squares regression with other prediction methods. Hacettepe Journal of Mathematics and Statistics, 31: 99-111, 2002. [18] S. Wiklund, D. Nilsson, L. Eriksson, M. Sjostrom, S. Wold and K. Faber. A randomization test for PLS component selection. Journal of Chemometrics, 21:427-439, 2007. [19] H. Wold. Soft modeling. The basic design and some extensions, in: K.-G. Jo reskog, H. Wold Eds.., System Under Indirect Observation, vols. I and II, North- Holland, Amsterdam, 1982.

124702-8383 IJBAS-IJENS © April 2012 IJENS I J E N S