Predictions of Water Level in Dungun River Terengganu Using Partial Least Squares Regression
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Basic & Applied Sciences IJBAS-IJENS Vol: 12 No: 02 1 Predictions of Water Level in Dungun River Terengganu Using Partial Least Squares Regression Noraini Ibrahim 1, Antoni Wibowo 2 Faculty of Computer Science and Information Systems 81310, UTM Johor Bahru, Johor, Malaysia e-mail: [email protected] 1, [email protected] 2 Abstract — Floods are common phenomenon in the state are water level, area inundation, peak discharge, volume of of Dungun, specifically in Terengganu-Malaysia. Every year, flow and duration. There are three categories of critical floods affecting biodiversity on this region and also causing level stages of water level which are alert, warning and property loss of this residential area. The residents in Dungun danger and they had been introduced by Department of always suffered from floods since the water overflows to the Irrigation and Drainage Malaysia. Four telemetric stations areas adjoining to the rivers, lakes or dams. The rainfall and in Dungun were set up by this department to observe and evaporation of the area have a large influence on the water level of Dungun River. Therefore, a suitable prediction model monitor the water level and rainfall. is needed to forecast the water level in Dungun River by There are many related works done in this area of adopting the ordinary linear regression (OLR) and partial predicting water level and most of them are using neural least squares regression (PLSR) based on hydrological data. network and other nonlinear methods. The previous However, we need to perform cleansing data of the research shows that predicting water level using artificial hydrological data since the original data contain inconsistent neural network (ANN) combined with partial least squares data. Based on the experiment, it shows that PLSR is more regression achieves better results compared to the results of suitable model rather than OLR and the use of the cleansing PLSR alone according to water level variation and data gives higher accuracy than the original data. prediction of the Pingshan Sinkhole in Guizhou, Southwestern China [14]. Keywords - Dungun River, evaporation, flooding, linear This study attempts to find an appropriate model regression, partial least squares regression, rainfall, water level for predicting water level in Dungun River. In order to obtain the model, OLR and PLSR are used to predict the water level since the data look like have linear relationship I. INTRODUCTION between predictors and response. Cross validation (CV) is applied to select the appropriate model for predicting water Most countries in Malaysia suffering from floods during level in Dungun River. It is noticed that the normalized monsoon season especially in Kedah, Kelantan, original data of hydrological data contain inconsistent data. Terengganu, Pahang and Johor. Terengganu is a state in the Therefore, we need to perform a cleansing data of the east coast of Peninsular Malaysia that has never missed out hydrological data. The following sections present an a flooding event which occurs between October and March approach to the development of the water level, evaporation every year during the North-East monsoon period [4]. The and rainfall models. OLR and PLSR methods are discussed annual rainfall at sub regions of Peninsular Malaysia for a in Section II while the evaluation of the prediction quality is decade (1984-1993) shows that Terengganu has the highest described in Section III. A case study and some amount of annual rainfall [13]. The factors that lead to experimental results are reported in Section IV, V and VI. flood at Dungun district of Terengganu state is due to a Finally, the conclusion is given in Section VII. combination of physical factors such as elevation and its close proximity to the sea apart of heavy rainfall II. THEORIES AND METHODS experienced during monsoon period. The severe floods all over Terengganu are resulted from heavy rainfall during the north-east monsoon season especially in November and The linear regression model is given as follows [16] December. y= 1β + X β + ε (1) Floods that affect Dungun district and other n 0 1 extensive damages to property and road systems along the eastern coast of Terengganu were categorized as a coastal where y is an n×1 vector of observations on the response flooding. Some of flood characteristics had been measured variable, X is an n× p matrix consisting of n observations by Department of Irrigation and Drainage Malaysia which 124702-8383 IJBAS-IJENS © April 2012 IJENS I J E N S International Journal of Basic & Applied Sciences IJBAS-IJENS Vol: 12 No: 02 2 and p predictors, β is an unknown constant, β is an p×1 the original regressors while maintaining most information 0 1 vector of unknown regression coefficient, 1 is an n×1 in the input variables. PLS is useful when the number of n explanatory variables exceeds the number of observation × ones vector and ε is an n 1 vector of errors identically and high level of multicollinearity among those variables is and independently distributed with mean zero and variance assumed. The weights used to determine the linear 2 σ > 0 , respectively. combinations of the original regressors are proportional to the covariance among input and output variables [6]. A. Ordinary Linear Regression OLR often being used in fitting models to make an C. Partial Least Squares Regression Using SIMPLS observation which is applied by minimizing the sum of the Algorithm squared residuals between the predicted and actual response. When matrix X= [ 1 X ] has full rank of p, the OLR 1 n SIMPLS algorithm was used to compute the regression estimator of β= (β , β T) T , say 0 1 coefficient in order to find the model for predicting water level in Dungun River of Terengganu. SIMPLS algorithms β$ = [β , β ,, β ] T , is estimated by OLS OLS0 OLS 1 L OLS p work very well, resistant to be more appropriate, fast, easy to implement and simple to tune [1]. In PLSR approach, we β$ = ().XXT−1 Xy T (2) need to obtain the PLSR estimator, say Bˆ OLS 1 1 1 PLSR = ˆ ˆ ˆ T [BPLSR1 , B PLSR 2 ,,L B PLSRp ], and it starts with computing The prediction of y is given by the cross-product of [7] y$ = X β$ . (3) = T 1 OLS S X y . (5) The model for OLR can be represented by Then, the computing of the iteration is followed starting from 1 until A latent variables where A is determined in =β + β ++ β advanced and 1≤A ≤ p. The algorithm of SIMPLS is given f()x OLS0 OLS 1 x1 L OLSp x p (4) as follows =T ∈ p where x [,,,]x1 x 2 L x p . For a = 1 to A • If a = 1, then do the singular value = B. Partial Least Squares Regression decomposition (svd) of S :[usv , , ] svd ( S ). Otherwise, if a > 1, we compute the svd of: [,,]usv= svd ( SPPP − (T )−1 PS T ). Partial Least Squares (PLS) has proven to be an effective approach to solve the problems in chemometrics such as by • Get weights for r which is the first singular predicting the bioactivity of molecules to facilitate vector: r= u (:,1). discovery of novel pharmaceuticals. The PLS approach was • Compute the scores: t= Xr . originated around 1975 by Herman Wold for modeling the • = T T complicated datasets in terms of matrices blocks which Compute the loadings: p Xt/ ( tt ). called path models [19]. The PLS method has been • The vector r, t and p are stored into R, T, and introduced in the chemical literature as an algorithm and it P respectively. is only recently that its numerical and statistical properties have become more apparent [16]. PLSR is a technique for modeling a linear relationship between a set of output The last step is computing a regression coefficient can be n∈ L shown as variables (response) {yi } i =1 R with L-dimensional responses and a set of input variables (regressors) Bˆ = RTT()T−1 Ty T . (6) n∈ p with p number of variables [12]. The data PLSR {xi } i =1 R matrices X and y in this analysis are assumed to be centered Then, the estimate of PLSR is as a first step to perform PLSR. In this paper, we only use one dimensional y$ = XB ˆ . (7) response which is L equals to one. PLS is a method for PLSR modeling relations between sets of observed variables by means of latent variables which are linear combinations of The model for PLSR can be represented by 124702-8383 IJBAS-IJENS © April 2012 IJENS I J E N S International Journal of Basic & Applied Sciences IJBAS-IJENS Vol: 12 No: 02 3 months, rainfall and evaporation while the response is water =+ˆ −++ ˆ − levels. Dungun receives its monthly rainfall mainly in g(x ) yBPLSR1 ( xx 1 1 )L B PLSRp ( xx p p ) (8) raining season which starts from October and ends in March. The following tables show the details of predictors where y is the mean of response yi and xp is the mean of and response used for water level prediction. observation data of xp . A. Data III. EVALUATING THE QUALITY OF THE PREDICTION In this paper, we use the data from Dungun River of Terengganu between 2001 and 2010. We would like to see The quality of the prediction is evaluated using A latent the relationship between rainfall, evaporation and months with the water level of Dungun River. We would find the variables, y$ and y [6]. CV technique is used to estimate i i regression coefficient and its model by using the method of the prediction capacity and the data are separated between OLR and PLSR.