<<
Home , PH

Article Machine Learning Algorithms for the Forecasting of Wastewater Quality Indicators

Francesco Granata 1,*, Stefano Papirio 2, Giovanni Esposito 1, Rudy Gargano 1 and Giovanni de Marinis 1

1 Department of Civil and Mechanical Engineering, University of Cassino and Southern Lazio, via G. Di Biasio 43, 03043 Cassino, Italy; [email protected] (G.E.); [email protected] (R.G.); [email protected] (G.d.M.) 2 Department of Civil, Architectural and Environmental Engineering, University of Napoli “Federico II”, via Claudio 21, 80125 Napoli, Italy; [email protected] * Correspondence: [email protected]

Academic Editor: Yung-Tse Hung Received: 21 November 2016; Accepted: 6 February 2017; Published: 9 February 2017

Abstract: Stormwater runoff is often contaminated by human activities. Stormwater discharge into water bodies significantly contributes to environmental . The choice of suitable treatment technologies is dependent on the pollutant . Wastewater quality indicators such as biochemical oxygen demand (BOD5), (COD), (TSS), and (TDS) give a measure of the main pollutants. The aim of this study is to provide an indirect methodology for the estimation of the main wastewater quality indicators, based on some characteristics of the drainage basin. The catchment is seen as a black box: the physical processes of accumulation, washing, and transport of pollutants are not mathematically described.Two models deriving from studies on artificial intelligence have been used in this research: Support Vector Regression (SVR) and Regression Trees (RT). Both the models showed robustness, reliability, and high generalization capability. However, with reference to coefficient of determination R2 and root-mean square error, Support Vector Regression showed a better performance than Regression Tree in predicting TSS, TDS, and COD. As regards BOD5, the two models showed a comparable performance. Therefore, the considered machine learning algorithms may be useful for providing an estimation of the values to be considered for the sizing of the treatment units in absence of direct measures.

Keywords: ; machine learning; Support Vector Regression; Regression Tree; wastewater; treatment units

1. Introduction In a rapidly developing world, water is more and more extensively polluted [1]. Urban areas can pollute water in many ways. Modifications of the urban environment lead to changes in the physical and chemical characteristics of urban stormwater [2]. The increase of impervious surfaces affects many components of the hydrologic cycle altering the pervious interface between the atmosphere and the . Runoff is heavily contaminated by pollution arising from industrial activities, civil emissions, and road traffic emissions. Stormwater collected in sewer systems and then discharged into water bodies significantly contributes to environmental pollution. The selection of suitable technologies needs an understanding of the pollutant concentrations. The stormwater quality characterization [3] is a challenging issue that has become of utmost importance with the increasing urban development. The problem complexity is mainly due to the randomness of the phenomena that govern the dynamics of pollutants in the basin

Water 2017, 9, 105; doi:10.3390/w9020105 www.mdpi.com/journal/water Water 2017, 9, 105 2 of 12 and into the drainage network. The complexity is increased by the real difficulty of acquiring sufficient information to describe in detail the evolution of the pollution parameters during runoff. The water quality in the sewer system during events depends on many factors related to the basin, the drainage system, the rainfall and, more generally, the climate. The urban may be considered as consisting of four main processes:

- Accumulation of sediments and pollutants on basin surface (build-up); - Washing of the surfaces operated by rain (wash-off); - Accumulation of sediments in the drainage system; - Transport of pollutants in the drainage system.

These phenomena can be modeled according to different approaches and the quality models may be used for different purposes in the study of . In particular, the quality models are extremely useful for the characterization of effluents, the determination of the pollution load in the receiving watercourses, the evaluation of the dissolved oxygen in the flow [4], and the prediction of the effects of the drainage networks control systems. Wastewater Quality Indicators (WQIs) give a measure of the physical (i.e., solid content, , and electric conductivity), chemical (i.e., BOD5, COD, total , , pH, and dissolved oxygen), and biological (pathogenic , etc.) parameters in the effluents. In order to choose the most appropriate treatment systems, the most used WQIs are biochemical oxygen demand (BOD5), chemical oxygen demand (COD), total suspended solids (TSS), and total dissolved solids (TDS). These indicators are evaluated by means of specific laboratory tests. TDS are measured as the mass of the solid residue remaining in the liquid phase after filtration of a water sample. Conversely, the mass of solids remained on the filter surface is the amount of TSS. COD is used to indirectly measure the amount of reduced organic and inorganic compounds in water. Conversely, BOD5 is the amount of dissolved oxygen needed by aerobic microorganisms to oxidize the biodegradable organic material present in the wastewater at a certain temperature in five days. Wastewater treatment plants should be designed and sized using quality parameters (i.e., BOD5, COD, TSS, and TDS) measured on samples taken from the influent wastewater, ideally in both dry and runoff conditions. However, the direct measurements are often not available as the sampling of the wastewater cannot be performed. Therefore, the quality parameters are typically calculated by standard values of per capita production [5]. For instance, the influent BOD5 is calculated assuming that each person produces 60 g O2 per day. This approach is reliable only when calculating the quality parameters under dry climate conditions. A method to estimate the influent quality parameters during runoff is therefore needed for properly sizing the wastewater treatment units. Hence, the aim of this study is to provide an indirect methodology for the estimation of the main wastewater quality indicators, based on some characteristics of the drainage basin. The catchment is seen as a black box, with the physical processes of accumulation, washing, and transport of pollutants not mathematically described in detail. Recently, models derived from artificial intelligence studies have been increasingly used in hydraulic engineering issues. Many researchers have investigated the feasibility of solving different water engineering problems, using Support Vector Regression (SVR) or Regression Trees (RT). Support Vector Regression is a variant of Support Vector Machine, a machine-learning algorithm based on a two-layer structure. The first layer (Figure1) is a kernel function weighting on the input variable series, while the second is a weighted sum of the kernel outputs. Water 2017, 9, 105 3 of 12 Water 2017, 9, 105 3 of 12 Water 2017, 9, 105 3 of 12

FigureFigure 1. 1.Network Network architecturearchitecture of the SVR algorithms. algorithms. Figure 1. Network architecture of the SVR algorithms. Regression Tree analysis is based on a decision tree that is used as a predictive model (Figure 2). RegressionRegression Tree Tree analysis analysis is is based based on on a a decision decision tree that is is used used as as a apredictive predictive model model (Figure (Figure 2).2 ). The tree is obtained by splitting the source set into subsets on the basis of an attribute value test. The TheThe tree tree is is obtained obtained by by splittingsplitting the the source source set set into into subsets subsets on the on thebasis basis of an of attribute an attribute value test. value The test. process is recursively repeated on each derived subset. It reaches a conclusion when the subset at a Theprocess process is isrecursively recursively repeated repeated on on each each derived derived subset. subset. It reaches It reaches a conclusion a conclusion when when the subset the subset at a at node has all the same values of the target variable, or when splitting no longer improves the results node has all the same values of the target variable, or when splitting no longer improves the results a nodeof the has predictions. all the same values of the target variable, or when splitting no longer improves the results of theof the predictions. predictions.

Figure 2. A typical Regression Tree architecture. FigureFigure 2. 2.A A typical typical Regression Tree Tree architecture. architecture. Dibike et al. [6] used Support Vector Machines in rainfall‐runoff modeling in comparison with Dibike et al. [6] used Support Vector Machines in rainfall‐runoff modeling in comparison with ANNDibike models. et al. Bray [6] and used Han Support [7] emphasized Vector Machinesthe difficulties in rainfall-runoff in identifying a modelingsuitable model in comparison structure ANN models. Bray and Han [7] emphasized the difficulties in identifying a suitable model structure withand ANN the parameters models. Bray for the and rainfall Han‐runoff [7] emphasized modeling. Chen the difficulties et al. [8] applied in identifying SVMs to forecast a suitable the daily model and the parameters for the rainfall‐runoff modeling. Chen et al. [8] applied SVMs to forecast the daily structureprecipitationand the in parameters the Hanjiang for basin. the rainfall-runoff Granata et al. modeling. [9] performed Chen a et comparative al. [8] applied study SVMs of torainfall forecast‐ precipitation in the Hanjiang basin. Granata et al. [9] performed a comparative study of rainfall‐ therunoff daily modeling precipitation between in the a SVR Hanjiang‐based basin. approach Granata and the et EPA’s al. [9] Storm performed Water a Management comparative Model. study of runoff modeling between a SVR‐based approach and the EPA’s Storm Water Management Model. rainfall-runoffRaghavendramodeling and Deka between [10] provided a SVR-based an extensive approach review andof the the SVM EPA’s applications Storm Water in hydrology. Management Raghavendra and Deka [10] provided an extensive review of the SVM applications in hydrology. Model. Raghavendra and Deka [10] provided an extensive review of the SVM applications in hydrology.

Water 2017, 9, 105 4 of 12

The RT method has been employed for sediment transport [11,12], flood prediction [13], mean annual flood forecasting [14], scour depth prediction [15,16], and sediment yield estimation in rivers [17]. A comparative study of the wastewater quality modeling between a Support Vector Regression and Regression Tree-based approaches has been performed in this work. The experimental data was taken from the National Stormwater Quality Database (NSQD), and developed under the support of the U.S. Environmental Protection Agency [18]. The water quality indicators have been correlated to the main characteristics of the basin, such as the drainage area, the percentage of impervious surface, the types of land use, the height of rain, and the .

2. Materials and Methods

2.1. Support Vector Regression Support vector machines (SVMs) are supervised learning models that analyze data for classification or regression analysis (SVR). Given a training data set {(x1, y1), (x2, y2), ... ,(xl, yl)} ⊂ X × R, where X is the space of the input patterns (e.g., X = Rn), in SVR the main aim is to find a function f (x) that is characterized by a maximum ε deviation from the actually obtained target values yi for all the training data. Moreover, f (x) has to be as flat as possible. The errors smaller than ε are tolerated, while the errors greater than ε are generally unacceptable. For this purpose, for linear functions in the form

f (x) = hw, xi + b (1) in which w ∈ X, b ∈ R and h·, ·i is the dot product in X, the Euclidean norm ||w||2 has to be minimized. Accordingly, the problem can be considered as a convex optimization problem, as reported in the following equations:

minimize 1 kwk2 ( 2 y − hw, x i − b ≤ ε (2) subject to i i hw, xii + b − yi ≤ ε

Equation (2) implicitly states that a function f is capable of approximating all the pairs (xi, yi) with ε precision. However, in many cases this is not possible, or a certain error has to be tolerated. ∗ Therefore, slack variables ξ, ξ i have to be introduced (Figure3) in the constraints of the optimization problem, based on the following formulation:

l 1 2 ∗ minimize 2 kwk + C ∑ ξi + ξi i=1 ( (3) yi − hw, xii − b ≤ ε + ξi subject to ∗ hw, xii + b − yi ≤ ε + ξi

Both the flatness of f and the accepted deviations larger than ε depend on the constant C > 0. The optimization problem defined in Equation (3) is commonly solved in its dual formulation, utilizing Lagrange multipliers:

l l l l 1 2 ∗ ∗ ∗  ∗ ∗ L = 2 kwk + C ∑ ξi + ξi − ∑ αi(ε + ξi − yi + hw, xii + b)− ∑ αi ε + ξi + yi − hw, xii − b − ∑ ηiξi + ηi ξi (4) i=1 i=1 i=1 i=1

∗ ∗ ∗ in which αi, αi , ηi, ηi ≥ 0, whereas the partial derivatives of L with respect to the variables w, b, ξi ξi must be equal to zero in the optimal condition. Thus, the dual optimization problem can be formulated as Water 2017, 9, 105 5 of 12

 l    − 1 α − α∗ α − α∗ x , x  2 ∑ i i j j i j maximize i,j=1 l l    ∗ ∗  −ε ∑ αi + αi + ∑ yi αi − αj  i=1 i=1 (5)  l  ∗ ∑ (αi − αi ) = 0 subject to i=1  ∗ αi, αi ∈ [0, C] The calculation of b can be performed on the basis of the Karush-Kuhn-Tucker conditions [19] which impose that the product between constraints and dual variables must vanish at the optimal solution, that is: αi(ε + ξi − yi + hw, xii + b) = 0 ∗ ∗  (6) αi ε + ξi + yi − hw, xii − b = 0 and (C − αi)ξi = 0 ∗ ∗ (7) C − αi ξi = 0

Then, in order to make the Support Vector algorithm nonlinear, the training patterns xi can be preprocessed by a map Φ: X → F, where F is some feature space. The SVR algorithm is only depending on the dot products between the various patterns, making sufficient use of a kernel k(xi, xj) = Φ(xi), Φ(xj) instead of the map Φ(·) distinctly. Hence, the Support Vector algorithm can be rewritten as:

 l    − 1 α − α∗ α − α∗ kx , x   2 ∑ i i j j i j maximize i,j=1 l l    ∗ ∗  −ε ∑ αi + αi + ∑ yi αi − αj  i=1 i=1 (8)  l  ∗ ∑ αi − αi = 0 subject to i=1  ∗ αi, αi ∈ [0, C] The expansion of f can be formulated as follows:

l ∗ f (x) = ∑(αi − αi )k(xi, x) + b (9) i=1 Although uniquely defined, w is no longer explicitly identified, unlike in the linear case. In the nonlinear case, the optimization problem is to find the flattest function in the feature space, not in the input space. The Mercer’s condition has to be satisfied for the functions k(xi, xj), which corresponds to a dot product in some feature space F [20]. In this study, a radial basis function (RBF) has been chosen as kernel. The RBF has the form:

 2 k(xi, xj) = exp −γkxi − xjk , γ > 0 (10)

A Support Vector algorithm was implemented in the MATLAB environment. The search for the optimal structure of the model and its parameters was conducted by means of a trial and error iteration procedure. The deviation between the function found by the SVR and the target value was controlled by the ε parameter. Finally, ε was chosen to be equal to 0.01, while the cost of error C, which affects the smoothness of the function, was assumed equal to 1. Water 2017, 9, 105 6 of 12 controlled by the ε parameter. Finally, ε was chosen to be equal to 0.01, while the cost of error C, Water 2017, 9, 105 6 of 12 which affects the smoothness of the function, was assumed equal to 1.

i

f(x)

* i

Loss

* i

Error

Figure 3. ExampleExample of nonlinear Support Vector Regression.

2.2. Regression Tree A Regression Tree [[21]21] isis aa variantvariant ofof DecisionDecision Trees, developeddeveloped toto approximateapproximate real-valuedreal‐valued functions. Once Once again, again, the the aim aim is is to to develop a a model model that that can can predict predict the the value of of a target variable, given severalseveral inputinput variables. variables. In In the the RT RT model, model, the the whole whole input input domain domain is recursively is recursively partitioned partitioned into sub-domains,into sub‐domains, while while a multivariable a multivariable linear linear regression regression model model is used is used to make to make predictions predictions in eachin each of them.of them. The The tree tree is composedis composed of of a roota root node node (containing (containing all all data), data), a seta set of of internal internal nodes nodes (splits), (splits), and and a seta set of of terminal terminal nodes nodes (leaves). (leaves). The building process of Regression Trees consists of an iterative process that splits the data into partitions oror branches. branches. Initially, Initially, all dataall data in the in training the training set is groupedset is grouped in a single in partition.a single partition. The algorithm The thenalgorithm starts then to allocate starts to the allocate data into the the data first into two the branches, first two branches, using every using possible every binary possible split binary on every split field.on every Subsequently, field. Subsequently, the procedure the procedure continues splittingcontinues each splitting partition each into partition smaller into groups, smaller as the groups, method as movesthe method up each moves branch. up each At eachbranch. stage, At theeach algorithm stage, the selects algorithm the split selects that the minimizes split that theminimizes sum of the squaredsum of the deviations squared fromdeviations the mean from in the the mean two separate in the two partitions. separate partitions. The process is repeated until each node reaches a user-specifieduser‐specified minimum node size, becoming a terminal node. If If the sum of squared deviations from the mean tends to zero in a node, then it is considered a terminal node even if thethe minimum sizesize hashas notnot beenbeen reached.reached. In otherother words,words, for for each each node node the the algorithm algorithm selects selects the partitionthe partition of data of thatdata leads that toleads nodes to ‘purer’nodes than‘purer’ their than parents. their parents. The process The process stops if stops the lowest if the lowest impurity impurity level is level reached is reached or if some or if some stopping stopping rules arerules met. are Thesemet. These stopping stopping rules rules are generally are generally associated associated with thewith maximum the maximum tree depth tree depth (i.e., the(i.e., final the levelfinal level reached reached after successiveafter successive splitting splitting from thefrom root the node), root node), the minimum the minimum number number of units of in units nodes, in ornodes, the threshold or the threshold for the minimumfor the minimum variation variation of the impurity of the impurity provided provided by new by splits. new splits. In the RT algorithm, the impurity of a node is evaluated by the Least Least-Squared‐Squared Deviation (LSD), (LSD), R(t), which is simply thethe within variance forfor thethe nodenode tt::

1 2 Rt()1 ( y y ()) t2 (11) R(t) = (yi im− ym(t)) (11) Nt()( ) ∑it N t i∈t

Water 2017, 9, 105 7 of 12

in which N(t) is the number of sample units in the node t, yi is the value of the response variable for the i-th unit, and ym is the mean of the response variable in the node t. The function of the Least-Squared Deviation criterion for split s at the node t may be defined as follows: φ(s, t) = R(t) − pLR(tL) − pRR(tR) (12) where tL and tR are the left and right nodes generated by the split s, respectively, while pL and pR are the portions of units assigned to the left and right child node. The split s that maximizes the value of φ(s, t) is finally chosen, subsequently providing the highest improvement in terms of tree homogeneity. Since the tree is built from training samples, it might suffer from over-fitting when the full structure is reached. This may deteriorate the accuracy of the tree when applied to unseen data and lead to a lower generalization capability. For this reason, a pruning process is often used in order to obtain an optimal dimension tree. This process is defined as Error-Complexity Pruning: after the tree generation, the process removes the redundant splits of the tree. Concisely, this method is based on a measure of both error and complexity, depending on the number of nodes, for the generated binary tree. An accurate description of the pruning algorithm is reported by Breiman et al. [21].

2.3. The Database The data of the NSQD was collected by the University of Alabama and the Center for Watershed Protection, funded by the U.S. Environmental Protection Agency (EPA). The monitoring period of the used data is from 1992 to 2002. The NSQD contains about 3765 events from 360 sites in 65 communities in the U.S.A. [18]. The drainage area of the considered basins ranges from about 0.16 hectares to more than 1160 hectares. For each site the database also includes additional information, such as the geographical location, the total area, the percentage of impervious cover, the percentage of each land use in the catchment. In addition, the characteristics of each event such as total precipitation, precipitation intensity, total runoff, and antecedent dry period are also included. It is important to highlight that the database only contains information for samples collected at drainage system outfalls, while in-stream samples were not included.

3. Results and Discussion As the database was widely incomplete with regard to many sets of data, it was possible to use only a part of them. The input matrix for both models consisted of a series of vectors containing the following quantities: drainage area, percentage of residential area, percentage of institutional area, percentage of commercial area, percentage of industrial area, percentage of open space area, percentage of freeway, percentage of impervious area, precipitation depth, and runoff. The training and testing dataset size of the two prediction models, for each of the four considered indicators, is shown in Table1. The difference in size is due to the different data availability.

Table 1. Training and testing dataset size.

WQI Training Dataset Size Testing Dataset Size TSS 811 50 TDS 894 44 COD 1108 42 BOD 680 40

The effectiveness of prediction algorithms was evaluated by two criteria: the root-mean square error (RMSE) and the coefficient of determination R2. The latter is a number that indicates the WaterWater2017 2017,,9 9,, 105 105 88 ofof 1212

proportion of the variance in the dependent variable that is predictable from the independent proportion of the variance in the dependent variable that is predictable from the independent variable, variable, giving some information about the goodness of fit of a model. It is defined as giving some information about the goodness of fit of a model. It is defined as m 2 IIpi mi 2 m i12 ! R 1 ∑ =m Ipi − Imi (13) R2 = 1− i 1 2 (13) m IIami 2 ∑i=i11(Ia − Imi)

where m is the total number of measured data, Ipi is the predicted value of the considered water where m is the total number of measured data, Ipi is the predicted value of the considered water quality quality indicator for data point i, Imi is the measured indicator for data point i, Ia is the averaged value indicator for data point i, Imi is the measured indicator for data point i, Ia is the averaged value of theof the measured measured values values of of the the indicator. indicator. The The root-mean root‐mean square square error error represents represents the the sample sample standardstandard deviationdeviation ofof thethe differencesdifferences betweenbetween predicted predicted and and observed observed values. values. It It is is defined defined as: as: s m 2 IIpi mi 2 m i1 RMSE  ∑i=1 Ipi − Imi (14) RMSE = m (14) m ItIt aggregatesaggregates thethe magnitudes of the errors in predictions into a single measure of predictive predictive power. power. TableTable2 2 and and Figure Figure4 4show show that that both both models models are are able able to to provide provide good good predictions predictions of of water water quality quality parameters.parameters. TheThe analysisanalysis ofof RMSERMSE showsshows thatthat thethe SVRSVR modelmodel waswas characterizedcharacterized byby aa betterbetter accuracyaccuracy inin predictionspredictions ofof TSS,TSS, TDS,TDS, andand CODCOD thanthan thethe RTRT modelmodel (Table(Table2 ).2). The The difference difference was was particularly particularly relevantrelevant with regard regard to to TSS TSS and and COD. COD. As As regards regards the theBOD BOD5, instead,5, instead, the two the models two models showed showed a similar a similarperformance. performance.

1000 1000 SVR SVR RT RT

100 [mg/l] 100 [mg/l]

10 10 Predicted values Predicted Predicted values Predicted

1 1 1 10 100 1000 1 10 100 1000 Experimental values [mg/l] Experimental values [mg/l] (a) (b) 1000 1000 SVR SVR RT RT

100 [mg/l] 100 [mg/l]

10 10 Predicted values Predicted Predicted values values Predicted

1 1 1 10 100 1000 1 10 100 1000 Experimental values [mg/l] Experimental values [mg/l] (c) (d)

FigureFigure 4.4. ComparisonComparison between between predictions predictions and and experimental experimental values. values. Logarithmic Logarithmic scale scale for for both both axes axes is is used. (a) TSS; (b) TDS; (c) COD; (d) BOD5. used. (a) TSS; (b) TDS; (c) COD; (d) BOD5.

Water 2017, 9, 105 9 of 12

2 Water 2017, 9, 105 Table 2. Comparison between SVR and RT by means of RMSE and R . 9 of 12

2 Water 2017, 9, 105 Table 2. Comparison betweenRMSE (mg/L) SVR and RT by meansR of RMSE and R2. 9 of 12 WQI SVRRMSE (mg/L) RT SVRR2 RT Table 2. ComparisonTSS between1049 SVR 3486 and RT by 0.97 means of 0.906 RMSE and R2. WQI SVR RT SVR RT TDS 1549 1826 0.851 0.828 TSS 1049 3486 0.97 2 0.906 COD 1172RMSE (mg/L) 2395 0.893R 0.784 BODWQITDS 104SVR1549 1031826RT 0.851SVR 0.87 0.828RT 0.871 CODTSS 10491172 34862395 0.8930.97 0.9060.784 BODTDS 1549104 1826103 0.8510.87 0.8280.871 Figure5 shows the relative errorCOD distribution1172 2395 in the0.893 TSS 0.784 predictions made with both models. The highestFigure relative 5 shows errors the relative were observed BODerror distribution104 in correspondence 103in the TSS0.87 predictions 0.871 of the lowest made with measured both models. values, The while in correspondencehighest relative of errors the highest were observed experimental in correspondence values, the of relative the lowest error measured was less values, than 0.2. while The in use of RegressioncorrespondenceFigure Tree 5 led shows toof generallythe the highestrelative higher experimentalerror distribution errors. values, From in the the Figure TSS relative predictions5b iterror can was made be seenless with than that both 0.2. 50% models. The of use the The of results highestRegression relative Tree errorsled to generallywere observed higher in errors. correspondence From Figure of 5b the it lowestcan be seenmeasured that 50% values, of the while results in provided by SVR were affected by an error less than 25% in absolute value, while only 30% of the correspondenceprovided by SVR of were the highest affected experimental by an error lessvalues, than the 25% relative in absolute error wasvalue, less while than only 0.2. The30% useof the of resultsRegressionresults provided provided Tree by RTled by to wereRT generally were characterized characterized higher errors. by by aaFrom lessless Figurethan than 25% 25% 5b error.it error.can beIn Inaddition,seen addition, that 50%62% 62%of the of resultsresults the results providedprovided by the by two theSVR two models were models affected suffered suffered by froman from error a a negative negativeless than relative relative25% in error.absolute error. Both Both value, SVR SVR andwhile and RT only showed RT 30% showed aof clear the a clear tendencyresultstendency to overestimateprovided to overestimate by RT the were the experimental experimentalcharacterized data. data.by a less than 25% error. In addition, 62% of the results provided by the two models suffered from a negative relative error. Both SVR and RT showed a clear 1.00 0.30 SVR 0.28 tendency to overestimate the experimental data. SVR RT 0.25 0.24 0.24 RT 1.000.50 0.30 SVR 0.28 0.20 SVR RT 0.20 0.24 0.24 RT 0.25 0.16 0.16 0.50 0.15 0.14 0.00 0.20 0 400 800 1200 0.20 0.12 0.12 Frequency 0.10

Relative error 0.10 0.16 0.16 0.08 0.15 0.14 -0.500.00 0.06 0.12 0.12 0 400 800 1200 0.05 Frequency 0.10 0.02 0.02 Relative error 0.10 0.08 -0.50 0.00 0.06 -1.00 [-1, -0.75] [-0.75, -0.5] [-0.5,-0.25] [-0.25, 0] [0, 0.25] [0.25, 0.5] [0.5, 0.75] [0.75, 1] Experimental values [mg/l] 0.05 Relative Error 0.02 0.02 (a) (b) 0.00 -1.00 [-1, -0.75] [-0.75, -0.5] [-0.5,-0.25] [-0.25, 0] [0, 0.25] [0.25, 0.5] [0.5, 0.75] [0.75, 1] Experimental values [mg/l] Relative Error Figure 5. Distribution of errors in the TSS predictions. (a) Relative error versus experimental values; Figure 5. Distribution of errors(a) in the TSS predictions. (a) Relative error(b versus) experimental values; (b) Frequency distribution. (b) Frequency distribution. Figure 5. Distribution of errors in the TSS predictions. (a) Relative error versus experimental values; (Figureb) Frequency 6 shows distribution. the relative error distribution in the TDS tests. The highest relative errors were Figurefound 6for shows the TDS the values relative between error 0 distribution and 200 mg/L. in From the TDS Figure tests. 6b, Theit can highest be seen relative that 50% errors of the were foundresults forFigure the provided TDS 6 shows values by SVRthe between relative were affected error 0 and distribution by 200 a less mg/L. than in the From25% TDS error Figure tests. in absoluteThe6b, highest it can value, be relative seenwhile errors that 52% 50% ofwere the of the resultsfoundresults provided for provided the by TDS SVR by values RT were were between affectedcharacterized 0 and by a200 by less amg/L. less than than From 25% 25% Figure error error. 6b, inHowever, absoluteit can be the seen value, highest that while errors50% of 52%were the of the resultsmade by provided using a Regressionby SVR were Tree. affected 14% of by the a lessresults than provided 25% error by inRT absolute was affected value, by while an error 52% greater of the results provided by RT were characterized by a less than 25% error. However, the highest errors were resultsthan 50%, provided while onlyby RT 2% were of the characterized results provided by a less by thanSVR had25% sucherror. a However,large error. the Moreover, highest errors 71% of were the mademaderesults by using by obtained using a Regression a withRegression RT and Tree. Tree. 63% 14% 14% of those of of the the obtained resultsresults providedwith provided SVR by showed by RT RT was wasa positiveaffected affected byrelative an by error anerror: errorgreater both greater than 50%,thanalgorithms 50%, while while tended only only 2% to 2%underestimate of theof the results results the provided provided experimental by SVR SVR results. had had such such a large a large error. error. Moreover, Moreover, 71% of 71% the of the resultsresults obtained obtained with with RT RT and and 63% 63% of of those those obtained obtained with with SVR SVR showed showed a positive a positive relative relative error: error: both both 1.00 0.50 algorithmsalgorithms tended tended to underestimate to underestimate the the experimental experimentalSVR results. results. SVR 0.43 RT RT 0.40 0.36 1.000.50 0.50 SVR SVR 0.43 RT 0.30 0.27 RT 0.40 0.36 0.500.00 0.23 0.20 0 400 800 1200 0.20 Frequency 0.30 0.27 Relative error Relative 0.14 0.11 0.00 0.23 -0.50 0.10 0.20 0.09 0 400 800 1200 0.20 0.07 Frequency 0.05

Relative error Relative 0.02 0.14 0.11 0.00 0.09 -0.50-1.00 0.10 [-1, -0.75] [-0.75, -0.5] [-0.5,-0.25] [-0.25, 0] [0, 0.25] [0.25, 0.5] [0.5, 0.75] [0.75, 1] Experimental values 0.07 Relative Error 0.05 (a) 0.02 (b) 0.00 -1.00 [-1, -0.75] [-0.75, -0.5] [-0.5,-0.25] [-0.25, 0] [0, 0.25] [0.25, 0.5] [0.5, 0.75] [0.75, 1] Experimental values Relative Error Figure 6. Distribution of errors in the TDS predictions. (a) Relative error versus experimental values; (a) (b) (b) Frequency distribution. Figure 6. Distribution of errors in the TDS predictions. (a) Relative error versus experimental values; Figure 6. Distribution of errors in the TDS predictions. (a) Relative error versus experimental values; (b) Frequency distribution. (b) Frequency distribution.

Water 2017, 9, 105 10 of 12 Water 2017, 9, 105 10 of 12 Water 2017, 9, 105 10 of 12 Relative errorerror distributionsdistributions in in the the COD COD predictions predictions made made with with both both models models are are shown shown in in Figure Figure7. Again,7. Again,Relative the the highest highesterror distributions relative relative errors errors in were the were COD observed observed predictions in in correspondence correspondence made with both of of the themodels lowestlowest are measured shown in values, Figure 7.whilewhile Again, inin correspondencecorrespondencethe highest relative ofof the theerrors highesthighest were experimental experimentalobserved in correspondence values, the relativerelative of the errorerror lowest waswas measured quitequite small:small: values, lessless whilethanthan 10% in correspondence inin thethe casecase ofof SVRSVR of the waswas highest approximatelyapproximately experimental equalequal values, toto 30%30% forthefor RT.RT.relative error was quite small: less than 10% in the case of SVR was approximately equal to 30% for RT. 1.00 0.30 SVR 1.00 0.30 0.26 0.26 SVR SVRRT 0.25 0.26 0.260.24 SVRRT RT 0.25 0.21 0.21 0.24 RT 0.50 0.50 0.20 0.190.21 0.21 0.20 0.17 0.19 0.17

0.15 0.17 0.17 0.00 0 200 400 600 0.15 0.00 Frequency 0 200 400 600 0.10 Relative error Frequency 0.10 0.07 0.07 0.07 Relative error -0.50 0.07 0.07 0.070.05 0.05 -0.50 0.02 0.05 0.05 0.02 0.00 -1.00 [-1, -0.75] [-0.75, -0.5] [-0.5,-0.25] [-0.25, 0] [0, 0.25] [0.25, 0.5] [0.5, 0.75] [0.75, 1] Experimental values 0.00 Relative Error -1.00 [-1, -0.75] [-0.75, -0.5] [-0.5,-0.25] [-0.25, 0] [0, 0.25] [0.25, 0.5] [0.5, 0.75] [0.75, 1] Experimental(a) values Relative(b) Error (a) (b) Figure 7. Distribution of errors in the COD predictions. ( a) Relative error versus experimental values; Figure 7. Distribution of errors in the COD predictions. (a) Relative error versus experimental values; ((b)) FrequencyFrequency distribution.distribution. (b) Frequency distribution. Figure 7b shows that 47% of the results provided by Support Vector Regression were affected Figure7b shows that 47% of the results provided by Support Vector Regression were affected by by a Figureless than 7b 25% shows error that in 47%absolute of the value, results while provided 50% of bythe Supportresults provided Vector Regression by Regression were Tree affected were a less than 25% error in absolute value, while 50% of the results provided by Regression Tree were bycharacterized a less than 25% by a error less thanin absolute 25% error. value, Furthermore, while 50% of relative the results errors provided were equally by Regression distributed Tree around were characterized by a less than 25% error. Furthermore, relative errors were equally distributed around characterizedzero in the case by of a SVR,less than while 25% 64% error. of the Furthermore, results provided relative by RTerrors showed were a equally negative distributed relative error around and zero in the case of SVR, while 64% of the results provided by RT showed a negative relative error and zeroa tendency in the case to overestimate of SVR, while the 64% experimental of the results results. provided by RT showed a negative relative error and aa tendency to overestimate the experimental results. Figure 8 shows the relative error distribution in the BOD5 tests. In this case, the highest relative FigureFigure 88 showsshows thethe relativerelative errorerror distributiondistribution inin thethe BOD BOD5 tests.tests. In In this this case, case, the the highest highest relative relative errors were found for the BOD5 values between 0 and 50 mg/L. In correspondence of the highest errors were found for the BOD values between 0 and 50 mg/L. In correspondence of the highest errorsmeasured were values found the for relative the BOD error5 values was less between than 25% 0 forand both 50 mg/L.models. In From correspondence Figure 8b, it mayof the be highestnoticed measured values the relative error was less than 25% for both models. From Figure8b, it may be measuredthat 61% of values the results the relative provided error by was SVR less was than affected 25% for by both a less models. than 25% From error Figure in absolute 8b, it may value, be noticed while noticed that 61% of the results provided by SVR was affected by a less than 25% error in absolute that46% 61% of the of theRT outcomesresults provided showed by a SVRless thanwas affected25% error. by aIn less addition, than 25% SVR error leads in toabsolute equally value, distributed while value, while 46% of the RT outcomes showed a less than 25% error. In addition, SVR leads to equally 46%errors of aroundthe RT outcomeszero, while showed RT showed a less a than tendency 25% error.to underestimate In addition, theSVR experimental leads to equally results. distributed errorsdistributed around errors zero, around while zero,RT showed while RT a tendency showed a to tendency underestimate to underestimate the experimental the experimental results. results. 1.00 0.35 0.33 SVR SVR 0.35 1.00 0.33 0.30 SVRRT 0.3 SVRRT 0.28 0.30 RT 0.3 RT 0.28 0.50 0.25 0.50 0.25 0.2 0.18 0.2 0.00 0.15 0.18 0 50 100 150 200 0.15 0.00 Frequency 0.15 0.15 0.10 Relative error 0 50 100 150 200 Frequency 0.1 0.08 0.10 Relative error -0.50 0.1 0.05 0.08 0.05 -0.50 0.05 0.03 0.050.03 0.05 0.03 0.05 0.03 0.03 0.03 0 -1.00 [-1, -0.75] [-0.75, -0.5] [-0.5,-0.25] [-0.25, 0] [0, 0.25] [0.25, 0.5] [0.5, 0.75] [0.75, 1] Experimental values 0 Relative Error -1.00 [-1, -0.75] [-0.75, -0.5] [-0.5,-0.25] [-0.25, 0] [0, 0.25] [0.25, 0.5] [0.5, 0.75] [0.75, 1] Experimental(a) values Relative(b )Error (a) (b) Figure 8. Distribution of errors in the BOD5 predictions. (a) Relative error versus experimental values; Figure 8. Distribution of errors in the BOD5 predictions. (a) Relative error versus experimental values; Figure(b) Frequency 8. Distribution distribution. of errors in the BOD5 predictions. (a) Relative error versus experimental values; ((b)) Frequency distribution. The considered machine learning algorithms were characterized by robustness and reliability. The considered machine learning algorithms were characterized by robustness and reliability. BothThe the consideredmodels proved machine to be learningable to provide algorithms good were predictions characterized of the WQI by robustness values in andvery reliability. different Both the models proved to be able to provide good predictions of the WQI values in very different Bothsituations the models as regards proved basin to characteristics be able to provide and rainfall good predictions events. of the WQI values in very different situations as regards basin characteristics and rainfall events. situationsThe obtained as regards results basin characteristicshighlight that andthe rainfallproposed events. modelling approaches are able to produce formulationsThe obtained with aresults high generalizationhighlight that capability.the proposed Therefore, modelling machine approaches learning are based able models to produce are a formulationsvaluable alternative with a high to generalizationthe physically capability.based modelling Therefore, in machine the forecasting learning basedsituations. models This are isa valuableparticularly alternative emphasized to thewhen physically the physical based processes modelling are difficultin the forecastingto identify andsituations. formulate This in isa particularly emphasized when the physical processes are difficult to identify and formulate in a

Water 2017, 9, 105 11 of 12

The obtained results highlight that the proposed modelling approaches are able to produce formulations with a high generalization capability. Therefore, machine learning based models are a valuable alternative to the physically based modelling in the forecasting situations. This is particularly emphasized when the physical processes are difficult to identify and formulate in a suitable mathematical form. Data driven models naturally try to minimize the dependency on knowledge of the real-world processes. However, machine-learning algorithms suffer from some significant limitations. An optimal buildup of the model is often a complex procedure. In addition, a selection of inadequate variables can deteriorate the model performance. Finally, a reliable model training needs a large set of events, including samples of the typical situations that may occur in simulation scenarios. The use of limited samples of training data may result in poor quality of predictions.

4. Conclusions An indirect methodology for the estimation of the main effluent quality parameters was tested in this study. The methodology accounts the following characteristics of the urban catchment: drainage area, percentage of residential area, percentage of institutional area, percentage of commercial area, percentage of industrial area, percentage of open space area, percentage of freeway, percentage of impervious area, precipitation depth, and runoff. Two different machine-learning algorithms were compared: Support Vector Regression and Regression Tree. Both the models showed robustness, reliability, and high generalization capability. However, with reference to root-mean square error and coefficient of determination R2, Support Vector Regression showed a better performance than Regression Tree in forecasting TSS, TDS, and COD. As regards BOD5, the two models showed a comparable performance. Machine learning algorithms may be also useful for providing an estimation of the values to be considered for the sizing of the treatment units, given the characteristics of the basin, when direct measures are not available. The proposed approach could also be effective in the context of the real-time management issues of the wastewater treatment plants, as well as used to address environmental planning issues.

Author Contributions: Francesco Granata conceived this research, developed the models and carried out the model simulations. All the authors analyzed the results, wrote and approved the manuscript. Conflicts of Interest: The authors declare no conflicts of interest.

References

1. Peters, N.E.; Meybeck, M. Water quality degradation effects on freshwater availability: Impacts of human activities. Water Int. 2000, 25, 185–193. [CrossRef] 2. Goonetilleke, A.; Thomas, E.; Ginn, S.; Gilbert, D. Understanding the role of land use in urban stormwater quality management. J. Environ. Manag. 2005, 74, 31–42. [CrossRef][PubMed] 3. Lee, J.H.; Bang, K.W. Characterization of urban stormwater runoff. Water Res. 2000, 34, 1773–1780. [CrossRef] 4. Granata, F.; Gargano, R.; de Marinis, G. Air-water flows in circular drop manholes. Urban Water J. 2015, 12, 477–487. [CrossRef] 5. Tchobanoglous, G.; Burton, F.L.; Stensel, H.D. Wastewater Engineering: Treatment and Reuse, 4th ed.; Metcalf & Eddy Inc.: Boston, MA, USA, 2003. 6. Dibike, Y.B.; Velickov, S.; Solomatine, D.; Abbott, M.B. Model induction with sup-port vector machines: Introduction and application. J. Comput. Civ. Eng. 2001, 15, 208–216. [CrossRef] 7. Bray, M.; Han, D. Identification of support vector machines for runoff modelling. J. Hydroinform. 2004, 6, 265–280. 8. Chen, H.; Guo, J.; Wei, X. Downscaling GCMs using the smooth support vector machine method to predict daily precipitation in the Hanjiang basin. Adv. Atmos. Sci. 2010, 27, 274–284. [CrossRef] 9. Granata, F.; Gargano, R.; de Marinis, G. Support Vector Regression for Rainfall-Runoff Modeling in Urban Drainage: A Comparison with the EPA’s Storm Water Management Model. Water 2016, 8, 69. [CrossRef] Water 2017, 9, 105 12 of 12

10. Raghavendra, S.N.; Deka, P.C. Support vector machine applications in the field of hydrology: A review. Appl. Soft Comput. 2014, 19, 372–386. [CrossRef] 11. Bhattacharya, B.; Solomatine, D.P. Neural networks and M5 model trees in modelling water level-discharge relationship. Neurocomputing 2005, 63, 381–396. [CrossRef] 12. Najafzadeh, M.; Laucelli, D.B.; Zahiri, A. Application of model tree and evolutionary polynomial regression for evaluation of sediment transport in pipes. KSCE J. Civ. Eng. 2016.[CrossRef] 13. Solomatine, D.P.; Xue, Y.P. M5 model trees and neural networks: Application to flood forecasting in the upper reach of the Huai River in China. J. Hydrol. Eng. 2004, 9, 491–501. [CrossRef] 14. Singh, K.K.; Pal, M.; Singh, V.P. Estimation of mean annual flood in Indian catchments using backpropagation neural network and M5 model tree. Water Resour. Manag. 2009, 24, 2007–2019. [CrossRef] 15. Etemad-Shahidi, A.; Ghaemi, N. Model tree approach for prediction of pile groups scour due to waves. Eng. 2011, 38, 1522–1527. [CrossRef] 16. Ghaemi, N.; Etemad-Shahidi, A.; Ataie-Ashtiani, B. Estimation of current-induced pile groups scour using a rule based method. J. Hydroinform. 2013, 15, 516–528. [CrossRef] 17. Goyal, M.K. Modeling of sediment yield prediction using M5 model tree algorithm and wavelet regression. Water Resour. Manag. 2014, 28, 1991–2003. [CrossRef] 18. Pitt, R.; Maestre, A.; Morquecho, R. The national stormwater quality database (NSQD, version 1.1). In Proceedings of the 1st Annual Stormwater Management Research Symposium, Orlando, FL, USA, 12–13 October 2004; pp. 13–51. 19. Kuhn, H.W.; Tucker, A.W. Nonlinear programming. In Proceedings of the 2nd Berkeley Symposium on Mathematical Statistics and Probabilistics, University of California, Oakland, CA, USA, 31 July–12 August 1950; pp. 481–492. 20. Smola, A.J.; Scholkopf, B. A tutorial on support vector regression. Stat. Comput. 2004, 14, 199–222. [CrossRef] 21. Breiman, L.; Friedman, J.; Stone, C.J.; Olshen, R.A. Classification and Regression Trees; CRC Press: Boca Raton, FL, USA, 1984.

© 2017 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).