Journal of Marine Science and Engineering

Article Application of the LightGBM Model to the Prediction of the Water Levels of the Lower Columbia River

Min Gan 1,2,3, Shunqi Pan 3 , Yongping Chen 1,2,*, Chen Cheng 2, Haidong Pan 4,5 and Xian Zhu 1,2

1 State Key Laboratory of Hydrology-Water Resources & Hydraulic Engineering, Nanjing 210098, China; [email protected] (M.G.); [email protected] (X.Z.) 2 College of Harbor, Coastal, and Offshore Engineering, Hohai University, Nanjing 210098, China; [email protected] 3 Hydro-Environmental Research Centre, School of Engineering, Cardiff University, Cardiff CF243AA, UK; [email protected] 4 Key Laboratory of Physical Oceanography, Qingdao Collaborative Innovation Center of Marine Science and Technology, Ocean University of China, Qingdao 266003, China; [email protected] 5 Qingdao National Laboratory for Marine Science and Technology, Qingdao 266100, China * Correspondence: [email protected]

Abstract: Due to the strong nonlinear interaction with river discharge, tides in estuaries are char- acterised as nonstationary and their mechanisms are yet to be fully understood. It remains highly challenging to accurately predict estuarine water levels. Machine learning methods, which offer a unique ability to simulate the unknown relationships between variables, have been increasingly used in a large number of research areas. This study applies the LightGBM model to predicting the water

 levels along the lower reach of the Columbia River. The model inputs consist of the discharges from  two upstream rivers (Columbia and Willamette Rivers) and the tide characteristics, including the

Citation: Gan, M.; Pan, S.; Chen, Y.; tide range at the estuary mouth (Astoria) and tide constituents. The model is optimized with the Cheng, .; Pan, H.; Zhu, X. selected parameters. The results show that the LightGBM model can achieve high prediction accuracy, Application of the Machine Learning with the root-mean-square-error values of water level being reduced to 0.14 m and the correlation LightGBM Model to the Prediction of coefficient and skill score being in the ranges of 0.975–0.987 and 0.941–0.972, respectively, which are the Water Levels of the Lower statistically better than those obtained from physics-based models such as the nonstationary tidal Columbia River. J. Mar. Sci. Eng. 2021, harmonic analysis model (NS_TIDE). The importance of subtide constituents in interacting with the 9, 496. https://doi.org/10.3390/ river discharge in the estuary is clearly revealed from the model results. jmse9050496 Keywords: water level prediction; estuarine tides; lower Columbia River; machine learning; Light- Academic Editor: Georgios Sylaios GBM; NS_TIDE

Received: 22 March 2021 Accepted: 26 April 2021 Published: 3 May 2021 1. Introduction

Publisher’s Note: MDPI stays neutral In recent decades, estuaries have been intensively developed and used for a wide with regard to jurisdictional claims in range of economic activities [1]. In estuaries, water temperature, salinity, turbidity and published maps and institutional affil- sediment transport change daily in response to the tides [2]. Accurately predicting the iations. water levels in estuaries is important, in particular under extreme conditions, to allow sufficient time for the authorities to issue a forewarning of an impending flood and to implement early evacuation measures [3]. Ocean and coastal tides are usually predictable as they have periodic signals. Their tide properties (amplitudes and phase) can be easily obtained from tide levels by using Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. classical harmonic analysis methods such as the T_TIDE model [4], or extracted from This article is an open access article tide prediction models such as the Oregon State University Tidal Inversion Software [5]. distributed under the terms and However, when tides propagate toward the nearshore zone, shallow water effects can conditions of the Creative Commons significantly affect tide properties, as shallow water tide constituents can be generated Attribution (CC BY) license (https:// during the on-shore propagation of ocean tides [6]. Moreover, tide properties can be further creativecommons.org/licenses/by/ altered when tides enter into an estuary due to the effects caused by the river discharge 4.0/). from upstream. In addition, changes of estuarine width can cause energy variation of tides,

J. Mar. Sci. Eng. 2021, 9, 496. https://doi.org/10.3390/jmse9050496 https://www.mdpi.com/journal/jmse J. Mar. Sci. Eng. 2021, 9, 496 2 of 22

while the estuarine water depth can change the propagation speed of tide waves. Due to the complex estuarine environment, estuarine water level variations contain strong nontide components, making the analysis of estuarine tides more challenging [7]. In the past decades, substantial research has been focused on the nonlinear interaction between the river discharge and tides, such as the works of Jay [8] and Godin [9]. Especially in recent years, advanced tools aimed at improving the predictions of estuarine tides have been developed. Matte et al. [7] developed the nonstationary tidal harmonic analysis model (NS_TIDE) to consider the time-dependent tide properties (amplitudes and phases) caused by the nonlinear interaction between tides and river discharge. Similarly, Pan et al. [10] pro- posed the enhanced harmonic analysis model (S_TIDE), which calculates time-dependent tide properties by using the interpolation method without the input of river discharge. These tools achieved a significant higher analytic accuracy than the classical harmonic analysis model. Therefore, they have been widely used in the analysis of estuarine tides, or even modified to make them more suitable for local estuary environments [11–16]. These studies provided a good generalization about the nonstationary nature of estuarine tides from the view of physical processes. At present, mathematical descriptions of the nonlinear interaction between tides and river discharge in estuaries commonly adopt a linear simplification in predicting the tides, or predict them numerically using hydrodynamic models. The linear simplification or numerical solutions, which rely on accurate topography data and boundary conditions, can limit the prediction accuracy. As an alternative, data-driven models provide a good supplement to physical process-based models. Data-driven models require no specific mathematical assumptions about tide levels and are more suitable to simulate the nonlinear or unknown relationships between the inputs and outputs. A number of previous studies have achieved the success of accurately predicting tide levels both in coastal and estuary zones through data-driven methods [17–20]. Some studies also showed that data-driven models, such as the artificial neural network (ANN) method, could achieve better performance than classical harmonic analysis when wa- ter level variation contains nontide components [21–23]. However, to the best of our knowledge, previous application of data-driven models to tide level predictions mainly focused on coastal tides, whilst their applications to estuarine tides were mainly used for short-term prediction. Supharatid [20] constructed a prediction model at the River Chao, Phraya in the Gulf of Thailand by using the ANN method to predict the mean monthly estuarine tide level. Chang and Chen [24] used the radial basis function neural network to predict an-hour- ahead water levels in the Tanshui River during the typhoon periods. Similar works were done by Tsai et al. [25] in predicting the water levels in 1–3 hourly intervals in the Tanshui River. Chen et al. [26] compared the model performance of the ANN model with 2-D and 3-D hydrodynamic models in predicting the water levels of the Danshui River estuary in northern Taiwan. Their results showed that the ANN model achieved similar accuracy as 2-D and 3-D hydrodynamic models, or even better performance during the extremely high flow periods. More recently, Yoo et al. [27] applied the Long Short-Term Memory method to predict the water level of Hangang River, South Korea. Their results showed that the model prediction accuracy becomes unacceptable when the prediction length is longer than 3 h. Chen et al. [28] integrated the NS_TIDE model [7] with an autoregressive model to improve the short-term (within 48 h) prediction of the water levels in the Yangtze estuary. With the growing research interests in artificial intelligence and the concept of big data, more data-driven methods, especially machine learning and deep learning methods, have been developed [29]. These methods have been widely used in a large number of areas such as economics [30], geoscience [23,31,32], and medicine [33]. Riazi [23] also successfully applied the deep learning approach to accurately predict the coastal tide level, which clearly indicated the applicability of machine learning methods in analyzing the estuarine tides. J. Mar. Sci. Eng. 2021, 9, 496 3 of 22

In 2017, ® proposed a fast, distributed, high performance framework: LightGBM [34]. Consequently, the LightGBM framework, which belongs to decision tree algorithms, has been frequently used in solving the problems of classifica- tion [35] and regression [33,36]. Considering the effective performance of the LightGBM framework in regression, this study aimed to develop the LightGBM model for predicting estuarine water levels. Appropriate selection of input parameters and the optimization of the hyperparameters are fully described and carried out. The lower reach of the Columbia River was selected as the research area because the long-term hydrographic field data are publicly available. The results from the LightGBM model were further compared with those from the commonly-used NS_TIDE model [7]. The spatially varied contribution of the parameters in the input layer of the LightGBM model in the lower Columbia River was also investigated in this study. The remaining part of this paper is organized as follows. The algorithm and the modelling process of the LightGBM model are described in Section2, while the information of the research area and the measurements are given in Section3. Model training and testing results are presented in Sections4 and5, respectively. A detailed discussion is given in Section6, while the main conclusions are presented in Section7.

2. Model Description 2.1. The Algorithm of the LightGBM Model The LightGBM model is an open-source framework originally proposed by Microsoft® [34]. It is a decision-tree-based algorithm, which divides the parameters in the input layer into different parts and thereby constructs the mapping relationship between the inputs and outputs. Figure1A shows the main feature of the LightGBM model, which uses leaf-wise- tree growth instead of the widely used level-wise-tree growth to speed up the training. In terms of the level-wise-tree growth strategy, the tree structure grows level by level. As shown in the figure, assuming that the tree depth is D, the number of the nodes at the final level is 2D. Compared with the level-wise-tree growth strategy, the leaf-wise-tree growth strategy grows tree based on the node where the largest error reduction can be obtained. In other words, the leaf-wise-tree growth strategy is greedier than the level-wise-tree growth strategy. The tree nodes of the leaf-wise-tree algorithm are usually less than the tree nodes of the level-wise-tree algorithm with the same tree depth. A typical example is shown in Figure1B, illustrating the leaf-wise-tree growth for breast cancer patients through the input parameters of gender and age groups. In comparison with the level-wise-tree growth algorithm, the leaf-wise tree growth algorithm can considerably reduce the number of tree nodes, so that the training process can be significantly speeded up when the dataset is large. However, when the dataset is small, the leaf-wise tree growth algorithm tends to over-fit because of its greedier algorithm, while the level-wise tree growth algorithm is relatively more stable. More specifically, the LightGBM model uses the Gradient Boosting Decision Tree (GBDT) algorithm to reach a more accurate prediction. As conceptualized in Figure1C, assuming X and Y are input and output (targets) for regression, the target of the 1st tree (Tree 1 in Figure1C) is Y. If the estimation of the 1st tree is T 1, the input and output (targets) of the 2nd tree (Tree 2 in Figure1C) become X and Y − T1. More specifically, the target (output) of each tree comes from the tree model’s errors of the previous tree, while the input is kept the same. The GBDT algorithm generates the final predictions by using the ensemble predictions of a series of trees. With the construction of more and more trees, the ensemble estimation can be gradually close to the output Y. During the construction of each decision tree, the parameters that determine the tree structure and influence the iteration process are determined through cross-validation. J. Mar. Sci. Eng. 2021, 9, x FOR PEER REVIEW 4 of 23 J. Mar. Sci. Eng. 2021, 9, 496 4 of 22

FigureFigure 1.1. SchematicSchematic diagramdiagram ofof thethe LightGBMLightGBM model:model: ((AA)) growthgrowth treetree structures;structures; ((BB)) an an exampleexample of leaf-wise-treeof leaf-wise-tree growth growth conceptual conceptual algorithm algorithm and and (C) ( GradientC) Gradient Boosting Boosting Decision Decision Tree Tree algorithm. algorithm.

AnotherAnother commonly-usedcommonly-used GBDT-based GBDT-based model model is the is XGBoostthe model model [37]. It [ has37]. gainedIt has muchgained popularity much popularity and attention and attention in recent in recent years, years such, as such the as studies the studies of Shi of et Shi al. et [38 al.]and [38] Dongand Dong et al. [et39 al.]. The [39] XGBoost. The XGBoost model usesmodel the uses level-wise-tree the level-wise growth-tree algorithmgrowth algorithm to generate to thegenerate tree nodes. the tree Compared nodes. Compared with other with GBDT-based other GBDT models-based suchmodels as thesuch XGBoost as the XGBoost model, themodel, LightGBM the LightGBM model model also provides also provides another anot threeher three algorithms algorithms to accelerate to accelerate its training,its train- namelying, namely histogram-based, histogram-based, gradient-based gradient-based one-side one sampling-side sampling (GOSS) (GOSS) and exclusive and exclusive feature bundlingfeature bundling (EFB) algorithms. (EFB) algorithms. The histogram-based The histogram algorithms-based [34 algorithms] convert the[34 sorted] convert dataset the ofsorted the parametersdataset of the in parameters the input layer in the into input a histogramlayer into a with histogram a specified with a number specified of num- data intervalsber of data or intervals bins [36] or (Fan bins et [ al.,36] 2019),(Fan et as al., shown 2019), in as Figure shown2 A.in Figure After this 2A. transformation, After this trans- eachformation, data interval each data (bin) interval shares the (bin) same shares index. the This same algorithm index. allows This algorithm the LightGBM allows model the toLightGBM require a model much lower to require memory a much consumption lower memory while considerablyconsumption accelerating while considerably the training ac- speed.celerating However, the training compared speed. with However, the XGBoost compared model, with the the split XGBoost point model, of the dataset the split found point byof the the dataset LightGBM found model by the may LightGBM be less accurate model may than be that less of accurate the XGBoost than that model of the because XGBoost the LightGBMmodel because traverses the theLightGBM data bins traverses (Figure2 A) thewhile data the bins XGBoost (Figure model 2A) while directly the traverses XGBoost themodel dataset. directly As traverses shown in the Figure dataset.2B, theAs shown GOSS algorithmin Figure 2B [ 34, t]he approximates GOSS algorithm the sum[34] ap- of sampleproximates data bythe usingsum theof sample top a% descending-sorteddata by using the datatop a% points descending and part- randomlysorted data selected points data (b%) from the remaining data (100% − a%). The GOSS algorithm reduces the number and part randomly selected data (b%) from the remaining data (100% − a%). The GOSS of data instances, while the accuracy of the learned decision trees can be still kept [34]. The algorithm reduces the number of data instances, while the accuracy of the learned decision EFB algorithm reduces the parameter size by merging some parameters in the input layer, trees can be still kept [34]. The EFB algorithm reduces the parameter size by merging some which also improves the training speed. Taking Figure2C as an example, the repeated parameters in the input layer, which also improves the training speed. Taking Figure 2C zero-value points are in the same bin (Figure2A). However, the location of the nonzero as an example, the repeated zero-value points are in the same bin (Figure 2A). However, values of Parameter 1 and Parameter 2 in Figure2C are mutually exclusive. Merging the location of the nonzero values of Parameter 1 and Parameter 2 in Figure 2C are mutu- them does not change the distribution of the output (target) dataset but can reduce the ally exclusive. Merging them does not change the distribution of the output (target) da- parameters’ size in the input layer. Through these algorithms, the LightGBM model has taset but can reduce the parameters’ size in the input layer. Through these algorithms, the the advantage of a faster training speed and is suitable for handling large-scale datasets. LightGBM model has the advantage of a faster training speed and is suitable for handling large-scale datasets.

J. Mar. Sci. Eng. 2021, 9, x FOR PEER REVIEW 5 of 23 J. Mar. Sci. Eng. 2021, 9, 496 5 of 22

FigureFigure 2. 2. (A(A) )Histogram, Histogram, (B (B) )g gradient-basedradient-based one one-side-side s sampling,ampling, and and (C (C) )e exclusivexclusive feature feature bundling bundling algorithmsalgorithms p providedrovided in in the the LightGBM LightGBM model model to to accelerate accelerate the the training. training.

TakingTaking the the estuarine estuarine water water levels levels as as an an example example in in this this study, study, they they can can be be estimated estimated byby the the LightGBM LightGBM model model as: as: I 퐼 ηI = ∑ Ti(X, θi) (1) 휂퐼 =i=∑1 푇푖(푋, 휃푖) (1) 푖=1 where ηI (unit: m) is the estimated estuarine water level by the LightGBM model; Ti (unit: wherem) is the휂퐼 estimation(unit: m) is of the the waterestimated level estuarine from the iwaterth tree, level in which by theX is LightGBM the input parameter model; 푇푖 th (unit:dataset m) and is theθi isestimation the learned of the parameter water level setof from the theith tree,푖 tree and, inI iswhich the total 푋 is number the input of th parameterdecision trees. dataset and 휃푖 is the learned parameter set of the 푖 tree, and 퐼 is the total numberDuring of decision the learning trees. stage, Ti can be determined by minimizing a loss (error) func- tion,DLuring, against the thelearning observation stage, 푇 data,푖 canη beobs .determined For the 1st by tree, minimiT1 canzing be a estimated loss (error) with func- the tionfollowing, 퐿, against equation: the observation data, 휂𝑜푏푠. For the 1st tree, 푇1 can be estimated with the following equation: T1 = argmin{L(ηobs, η1)} (2) where argmin is the condition of the function achieving the minimum value between the 푇1 = 푎푟𝑔푚푖푛{퐿(휂𝑜푏푠, 휂1)} (2) observed and an initial estimation, η1. For the ith (i > 1) tree, the square error (SE) is where 푎푟𝑔푚푖푛 is the conditionL of( the) = function( − achieving)2 the minimum value between the adopted as the loss function, ηi ηobs ηi ,thto evaluate the model errors, with the observedintroduction and of an a target initial function, estimationObj, 휂,1 as. For shown the below:푖 (푖>1) tree, the square error (SE) is i 2 adopted as the loss function, 퐿(휂푖) = (휂𝑜푏푠 − 휂푖) , to evaluate the model errors, with the introduction of a target function, 푂푏푗Obj푖,i as= shownL(ηi) + belowΩ(Ti): (3)

푂푏푗푖 = 퐿(휂푖) + Ω(2푇푖) (3) where the target function of the ith tree, Obji (unit: m ), contains two parts, one represents th 2 wherethe loss the function target function for evaluating of the 푖 model tree, accuracy, 푂푏푗푖 (unit:L( ηmi))and, contains the other, two partsΩ(Ti,) onerepresents repre- sentmodels the robustness, loss function which for isevaluating a regular functionmodel accuracy expressed, 퐿( as:휂푖) and the other, Ω(푇푖) repre- sents model robustness, which is a regular function expressed as: J J 1 2 Ω(Ti) = α ∑ |ωj| + β ∑ ωj (4) j=1 2 j=1

J. Mar. Sci. Eng. 2021, 9, 496 6 of 22

where α and β are hyperparameters, which are used to control the learning process, to be tuned at the hyperparameters optimization stage, j is the index of the leaf node, while J is the total number of leaf nodes of the ith decision tree and ωj (unit: m) is the model estimation at the jth leaf node. The first and second terms on the right-hand side of Equation (4) are respectively the Lasso Regression (L1 regularization) and Ridge Regression (L2 regularization) [40]. Both are used to improve the model stability (avoid over-fitting) by adding different penalty terms. If α = 0 and β = 0, Equation (3) becomes the ordinary square error function. The ith tree is determined by finding the smallest Obji following the least square error approach. In other words, the determination of the decision tree structure not only considers the model accuracy but also takes into account the model’s stability. The LightGBM model pertains to the concept of GBDT, which uses the second-order Taylor expansion to estimate the target function. More specifically, the ith targe function is estimated by using the learning results from the (i − 1)th process:

1 2 Obj = L(η − ) + gT (X, θ ) + hT (X, θ ) + Ω(T ) (5) i i 1 i i 2 i i i

∂L(η − ) g = i 1 (6) ∂ηi−1 ∂2L(η ) = i−1 h 2 (7) ∂ηi−1

where g and h are the first and second derivations of L(ηi−1) with respect to ηi−1. Since L(ηi−1) is known from the (i − 1)th learning process, the target function can be written as:

J    1  2 Obji = ∑ Gjωj + α ωj + Hj + β ωj (8) j=1 2

N t Gj = ∑ g (t ∈ leaf node j) (9) t=1 N t Hj = ∑ h (t ∈ leaf node j) (10) t=1

where ωj at each leaf node on the ith tree is the unknown variable to be solved, the superscript t is the sample index and N is the total number of sample points. In Equation (8), except for ωj, all other parameters are specified or can be obtained from the (i − 1)th learning process. In order to obtain the smallest Obji, we have:

∂Obj i = 0 (11) ∂ωj

which yields the following:  Gα,j ωj = −sgn Gj (12) Hj + β   Gα,j = max 0, Gj − α (13)

where sgn is the sign function. When ωj in Equation (12) is determined, the objective function in Equation (8) can be re-written as:

J 2 ! 1 Gα,j Obj = − (14) i ∑ + 2 j=1 Hj β J. Mar. Sci. Eng. 2021, 9, 496 7 of 22

To statistically evaluate the tree performance at each leaf node, a score function for the jth leaf node is introduced as: 2 Gα,j Vj = (15) Hj + β It is clear that Equation (15) is a subterm of Equation (14), indicating the model’s error reduction contributed from the jth leaf node. Equation (15) can assist the LightGBM model in selecting the most suitable node to split further. If two new leaf nodes are generated from the jth leaf node, the difference of Vj after the split, Vd, can be calculated:

2 2 ! 2 GL,α,j GR,α,j Gα,j Vd = + − (16) HL,j + β HR,j + β Hj + β

where GL,α,j (HL,j) and GR,α,j (HR,j) are the values of G (H) on the left and right nodes newly-generated from the jth node. Equation (16) statistically compares the model’s error reduction when new leaf nodes are generated from different nodes. After examining all the parameters in the input layer and all the leaf nodes, the LightGBM model chooses the most suitable parameter to split from the leaf node where the largest Vd can be obtained.

2.2. Construction of the LightGBM Model 2.2.1. Input and Output Setting The input dataset of the LightGBM model should contain the factors that influence the variation of estuarine water levels. These input factors should also be predicted when they are used to forecast estuarine water levels. The effective input parameters include the variables related to the tide potential or the time series of tides constituents from ocean tides or coastal tides. However, due to the strong nonlinear interaction between tides and river discharge, these factors may not be enough. As shown in the works of Kukulka and Jay [41,42], the upstream river discharge and tide range near the estuary mouth are important factors in analyzing estuarine tides. Supharatid [20] also included them as variables of the input layer of the ANN model. Therefore, the LightGBM model also includes the river discharges and tide range as the parameters in the input layer. Figure3 shows the schematic map of the input and output layers of the LightGBM model in this study. In order to both consider the riverine force and marine force factors, similar to the works of Jay et al. [43] and Matte et al. [7], upstream river discharge (Q) and tide range () near the estuary mouth were selected in the input layer of the LightGBM model. Moreover, the upstream river discharge was low-passed to smooth the data. In the semi-diurnal tide regime, tide range R is the greater tide range within one lunar day. Referred from the works of Lee [18] and Lee et al. [44], the time series of tide constituents were also included. The tide constituents whose signal-noise-ratio values are larger than 2 in the results of the T_TIDE model [4] were selected as the input of the LightGBM model. Harmonic analysis was used to preliminarily filter the insignificant tide constituents and can reduce the computation time of the LightGBM model. In the LightGBM model, it selects the important parameters from the input layer based on finding the parameter that contributes the largest error reduction (Equation (16)).

2.2.2. Hyperparameters Setting In the LightGBM model, a number of hyperparameters were used, but many of them can be adjusted to improve the model performance for different applications as the hyperpa- rameters may significantly influence the model performance. Therefore, hyperparameters tuning is an essential process in the construction of machine learning models. The hyper- parameters tuned in this study and their related meanings are listed in Table1. Only the hyperparameters listed in Table1 were tuned. The other hyperparameters that can further slightly influence the training process, such as setting the minimal Hj on the leaf node and the minimal number of data inside one bin, were kept as default. Generally, the core J. Mar. Sci. Eng. 2021, 9, 496 8 of 22

J. Mar. Sci. Eng. 2021, 9, x FOR PEER REVIEW 8 of 23

hyperparameters for regression are listed in Table1. More details of the hyperparameters of the LightGBM model can be found in the LightGBM documentation [45].

Q(t)

R (t)

o 1푡, n 1푡 LightGBM  t

….. Estuarine water level

o 푡, n 푡 : the frequency of tide constituents n: number of tide constituents

Figure 3. Schematic map of the input and output layers of the LightGBM model usedused inin thisthis study.study.

Table 1. The main hyperparameters of the LightGBM model tuned in this study. 2.2.2. Hyperparameters Setting HyperparametersIn the Description LightGBM and model, Usage a number of hyperparameters Type were used Range, but many Default of them can be adjustedControl theto shrinkageimprove the rate; model performance for different applications as the hy- learning_rate double >0.0 0.1 smallerperparameters value indicates may asignificantly smaller iteration influence step. the model performance. Therefore, hyperpa- num_iterationsrameters Number tuning of is iterations an essential (trees). process in the construction int of machine≥0 learning 100models. max_depthThe Limithyperparameters the max depth tuned for a tree in model.this study and their related int meanings are case-based listed in Table 1. num_leaves ControlOnly the maximumthe hyperparameters number of leaves listed of ain decision Table 1 tree. were tuned int. The other 1–131,072 hyperparameters 31 that Control the max number of bins (data intervals) when the 퐻 max_bin datasetcan of a further parameter slightly in the inputinfluence layer the is transformed training process to a , suchint as setting 1–255the minimal 푗 255 on the leaf node andhistogram the minimal (Figure 2numberA). of data inside one bin, were kept as default. Generally, min_data_in_leafthe core Minimal hyperparameters number of data for in oneregression leaf. are listed in Table int 1. More details≥0 of the hyperpa- 20 The proportion of the selected parameters to the total feature_fraction rameters of the LightGBM model can be found in thedouble LightGBM documentation 0.0–1.0 [45 1.0]. number of the parameters in the input layer. bagging_fraction The proportion2.2.3. Optimal of the Hyperparameters selected data to the totalSearching data size. double 0.0–1.0 1.0 Frequency of re-sampling the data when bagging_fraction is bagging_freq int ≥0 0 Figure 4 smallershows thana diagram 1.0. of constructing the optimal LightGBM model in this study. lambda_l1The main The process value of wasα in Equationto optimize (4). the training process double of the LightGBM≥0.0 model by 0adjust- lambda_l2ing the The hyperparameter value of β in Equation values (4).as listed in Table 1. doubleThe numerical ≥range0.0 of the values 0 of Indicating the minimal error reduction to conduct the further min_gain_to_split the hyperparameters in Figure 4 is specified accordingdouble to preliminary≥0.0 tests, and they 0 can split; corresponding to the minimal value of Equation (16) be varied when this model is applied to other research areas. The hyperparameters in Table 1 are divided into four steps to be sequentially optimized (Figure 4) when suitable 2.2.3.learning_rate Optimal and Hyperparameters num_iterations Searching are specified. The optimization sequence of these hy- perparameFigureters4 shows was determined a diagram of according constructing to their the mutual optimal relationship LightGBM model and the in importance this study. Theof their main influence process wason the to optimizeLightGBM the model. training process of the LightGBM model by adjusting the hyperparameter values as listed in Table1. The numerical range of the values of the Table 1.hyperparameters The main hyperparameters in Figure 4of is the specified LightGBM according model tuned to preliminary in this study tests,. and they can be var- Hyperparameters ied when this modelDescription is applied and to Usage other research areas. The hyperparametersType Range in TableDefault1 are divided into four steps to be sequentially optimized (Figure4) when suitable learning_rate Control the shrinkage rate; learning_rate and num_iterations are specified. The optimization sequence ofd theseouble hyperparameters >0.0 0.1 was smaller value indicates a smaller iteration step. determined according to their mutual relationship and the importance of their influence on num_iterations the LightGBMN model.umber of iterations (trees). int ≥0 100 max_depth Limit the max depth for a tree model. int case-based num_leaves Control the maximum number of leaves of a decision tree. int 1–131072 31 Control the max number of bins (data intervals) when the dataset max_bin of a parameter in the input layer is transformed to a histogram int 1–255 255 (Figure 2A). min_data_in_leaf Minimal number of data in one leaf. int ≥0 20

J. Mar. Sci. Eng. 2021, 9, x FOR PEER REVIEW 9 of 23

The proportion of the selected parameters to the total number of feature_fraction double 0.0–1.0 1.0 the parameters in the input layer. bagging_fraction The proportion of the selected data to the total data size. double 0.0–1.0 1.0 Frequency of re-sampling the data when bagging_fraction is bagging_freq int ≥0 0 smaller than 1.0. lambda_l1 The value of 훼 in Equation (4). double ≥0.0 0 lambda_l2 The value of 훽 in Equation (4). double ≥0.0 0 Indicating the minimal error reduction to conduct the further min_gain_to_split double ≥0.0 0 split; corresponding to the minimal value of Equation (16) In determining the optimal combination of hyperparameters, the GridSearchCV function in the Scikit-learn Python module [46] was used. The GridSearchCV function uses the concept of the grid search method [47] to determine the optimal hyperparame- ters. On the basis of the specified range of hyperparameters, the GridSearchCV function forms all possible combinations and then performs a cross-validation process (Figure 4) to search the optimal hyperparameter combinations. In the specification of the GridSearchCV function, a negative mean squared error (NMSE) was selected as the scoring criterion function to evaluate the model performance. In addition, five-fold cross-validation was specified to optimize the hyperparameters. More specifically, the initial training data in Figure 4 were further averagely divided to five groups, four of which were sequentially selected as the training data in the hyperpa- rameters optimization period and the remaining group was used to test the model. There- fore, five cross-validation tests were conducted in each combination of hyperparameters. The highest NMSE score in the cross-validation tests represents the smallest model error, indicating the best combination of the hyperparameters in the given range. The LightGBM model was constructed from the opensource framework of Microsoft® J. Mar. Sci. Eng. 2021, 9, 496 [34] and the related codes were run on the Python 3.7 environment. 9 of 22

Figure 4. Flow chart of conducting the optimization of hyperparameters (ranges are indicated in the square brackets) of the LightGBM model.

In determining the optimal combination of hyperparameters, the GridSearchCV func- tion in the Scikit-learn Python module [46] was used. The GridSearchCV function uses the concept of the grid search method [47] to determine the optimal hyperparameters. On the basis of the specified range of hyperparameters, the GridSearchCV function forms all possible combinations and then performs a cross-validation process (Figure4) to search the optimal hyperparameter combinations. In the specification of the GridSearchCV function, a negative mean squared error (NMSE) was selected as the scoring criterion function to evaluate the model performance. In addition, five-fold cross-validation was specified to optimize the hyperparameters. More specifically, the initial training data in Figure4 were further averagely divided to five groups, four of which were sequentially selected as the training data in the hyperparameters optimization period and the remaining group was used to test the model. Therefore, five cross-validation tests were conducted in each combination of hyperparameters. The highest NMSE score in the cross-validation tests represents the smallest model error, indicating the best combination of the hyperparameters in the given range. The LightGBM model was constructed from the opensource framework of Microsoft® [34] and the related codes were run on the Python 3.7 environment.

3. Study Area & Data 3.1. The Lower Columbia River The Columbia River is located in a region of North America (Figure5) and flows into the North Pacific Ocean. The Columbia River is the third largest river in the U.S. state [14,15] and the largest river in the Pacific northwest region of North America. Its annual average flow is about 7500 m3/s [48]. Three variation patterns can be found in the river discharge records of the Columbia River [48]: First, it has a significant seasonal variation pattern; second, some high-flow events can happen in the winter and early spring seasons; third, due to the actions of hydraulic electrogenerating, it has a periodical variation pattern during low-flow periods. The research area of this study was the lower Columbia J. Mar. Sci. Eng. 2021, 9, x FOR PEER REVIEW 10 of 23

Figure 4. Flow chart of conducting the optimization of hyperparameters (ranges are indicated in the square brackets) of the LightGBM model.

3. Study Area & Data 3.1. The Lower Columbia River The Columbia River is located in a region of North America (Figure 5) and flows into the North Pacific Ocean. The Columbia River is the third largest river in the U.S. state [14,15] and the largest river in the Pacific northwest region of North America. Its annual average flow is about 7500 m3/s [48]. Three variation patterns can be found in the river discharge records of the Columbia River [48]: First, it has a significant seasonal variation J. Mar. Sci. Eng. 2021, 9, 496 pattern; second, some high-flow events can happen in the winter and early spring seasons;10 of 22 third, due to the actions of hydraulic electrogenerating, it has a periodical variation pat- tern during low-flow periods. The research area of this study was the lower Columbia RiverRiver whose whose areaarea ranges ranges from from the the estuary estuary mouth mouth to to 235 235 km km landward landwardalong along the the river river coursecourse [7[7].]. ItIt cancan bebe seenseen fromfrom FigureFigure5 5that that there there is is a a tributary tributary river river (Willamette (Willamette River) River) enteringentering into into the the lower lower Columbia Columbia River. River. It It is is the the largest largest tributary tributary into into the the lower lower Columbia Columbia 3 River,River, andand itsits averageaverage flowflow is about about 950 950 m m3/s/s [7] [.7 The]. The tides tides in the in theColumbia Columbia River River are clas- are classifiedsified as mixed as mixed tides. tides. The The amplitude amplitude ratio ratio of semidiurnal of semidiurnal tides tides to diurn to diurnalal tides tides at the at estu- the estuaryary mouth mouth is around is around 1.5 1.5[48] [.48 Both]. Both the theriver river discharge discharge from from the the Columbia Columbia River River and and the theWillamette Willamette River River can can affect affect the the tides tides in inthe the lower lower Columbia Columbia River. River. Under Under the the influence influence of ofriver river discharge, discharge, the the tides tides in in the the lower lower Columbia Columbia River present significantsignificant nonstationarynonstationary characteristics.characteristics. Former Former studies studies [10 [10,14,41,14,41–43–]43 clearly] clearly showed showed that the that lower the lower Columbia Columbia River isRiver a suitable is a suitable natural natural laboratory labor foratory the for analysis the analysis of estuarine of estuarine tides. tides.

FigureFigure 5. 5.Map Map of of the the lower lower Columbia Columbia River River and and the the locations locations of of the the hydrometric hydrometric stations stations (modified (modi- fromfied Googlefrom Google Map byMap M_Map by M_Map software software package package [49]). [49]).

3.2.3.2. DataData TheThe hydrometric hydrometric stations stations used used in in this this study study are are also also shown shown in in Figure Figure5. 5 Hourly. Hourly water water levellevel measurements measurements at at Astoria, Astoria, Wauna, Wauna, Longview, Longview, St.Helens, St.Helens, and Vancouver and Vancouver stations stations were collectedwere collect fromed the from National the National Oceanic Oceanic and Atmospheric and Atmospheric Administration Administration (NOAA) (NOAA) website web- [50]. Thesite duration[50]. The of duration the water of level the water measurements level measurements at all stations at was all stations from 2003 was to 2020,from except2003 to for2020, Wauna except station, for Wauna where station, data between where 2003data andbetween 2006 2003 was unavailable.and 2006 was In unavailable. addition, the In synchronizedaddition, the hourly synchronized river discharge hourly river data of discharge the Columbia data of River the at Columbia the Bonneville River atDam the (denotedBonneville as Dam Qc hereafter) (denoted and as Qc the hereafter) Willamette and River the atWillamette Portland (denotedRiver at Portland as Qw hereafter) (denoted over the same period (2003–2020) were downloaded from the website of the U.S. Geological Survey (USGS) [51]. The date and time of all the measurements were adjusted to the Greenwich Mean Time (GMT), while the mean sea level datum was specified as the uniform datum when downloading the water levels data from the NOAA. The collected measurements of the water levels are shown in Figure6A, while the measured river discharges are shown in Figure6B. For the discharges of the Willamette River at Portland, it can be seen that the river discharge is under the influence of tides from the downstream, as evidenced by the negative values. From the collected data of this study, in about 70% of the whole period, the values of the Qw were no more than 20% of the values of the Qc. However, there were also a few periods that the values of Qw exceeded Qc, as shown in Figure6B. J. Mar. Sci. Eng. 2021, 9, x FOR PEER REVIEW 11 of 23

as Qw hereafter) over the same period (2003-2020) were downloaded from the website of the U.S. Geological Survey (USGS) [51]. The date and time of all the measurements were adjusted to the Greenwich Mean Time (GMT), while the mean sea level datum was specified as the uniform datum when downloading the water levels data from the NOAA. The collected measurements of the water levels are shown in Figure 6A, while the measured river discharges are shown in Figure 6B. For the discharges of the Willamette River at Portland, it can be seen that the river discharge is under the influence of tides from the downstream, as evidenced by the negative values. From the collected data of this study, in about 70% of the whole period, the values of the Qw were no more than 20% of the values of the Qc. However, there were also a few periods that the values of Qw exceeded Qc, as shown in Figure 6B. The Astoria station is the most seaward station in the study site and is assumed under little or no influence from the upstream river discharge. Therefore, it was selected as the reference station to provide the tide range information, while Bonneville Dam and Port- land stations were used as the reference stations to provide the upstream river discharge J. Mar. Sci. Eng. 2021, 9, 496 information. The measured water levels at Wauna, Longview, St.Helens and Vancouver11 of 22 stations were used for training and testing the LightGBM model.

Figure 6. (A) Measured waterwater levelslevels at at five five stations stations along along the the lower lower Columbia Columbia River, River and, and (B) measured(B) meas- uredriver river discharge discharge of the of lowerthe lower Columbia Columbia River River at theat the Bonneville Bonneville Dam Dam and and the the Willamette Willamette River River at atthe the Portland. Portland.

3.3. DataThe Preparation Astoria station is the most seaward station in the study site and is assumed underThe little measured or no influence water levels from at the all upstream stations were river divided discharge. into Therefore, two parts: it 80% was of selected all the dataas the from reference January station 2003 to to June provide 2017, thetogether tide rangewith the information, tide range at while Astoria Bonneville and the Damriver dischargesand Portland at Bonneville stations were Dam used and as Portland the reference in the stations same period, to provide was the used upstream for training, river whilstdischarge the information.remaining data The measured measured during water June levels 2017 at– Wauna,December Longview, 2020 (20% St.Helens of the data) and wasVancouver used for stations model were predicti usedons. for training As shown and in testing Figure the 6A, LightGBM the measured model. water level at Wauna3.3. Data station Preparation for training and testing was slightly less due to the unavailability. Wauna station used the same data division ratio for training (80%) and testing (20%). The short The measured water levels at all stations were divided into two parts: 80% of all the data length of Wauna station may have had only a minor impact on the results because data from January 2003 to June 2017, together with the tide range at Astoria and the river the tides at Wauna station were less affected by upstream river discharge than Vancouver, discharges at Bonneville Dam and Portland in the same period, was used for training, St.Helens, and Longview stations (Figure 6A). The tides at Wauna station mainly present whilst the remaining data measured during June 2017–December 2020 (20% of the data) was used for model predictions. As shown in Figure6A, the measured water level at Wauna station for training and testing was slightly less due to the unavailability. Wauna

station used the same data division ratio for training (80%) and testing (20%). The short data length of Wauna station may have had only a minor impact on the results because the tides at Wauna station were less affected by upstream river discharge than Vancouver, St.Helens, and Longview stations (Figure6A). The tides at Wauna station mainly present the characteristics of astronomic tides, while the river discharge primarily influences the tides in flood season.

4. Model Training The training process of the LightGBM model was conducted in two stages. First, to specify the appropriate values of learning_rate and num_iterations, their influence on the model performance and the computational time of the LightGBM model was tested. With the other hyperparameters being defaulted, the RMSE values of the LightGBM model in the training period with learning_rate varying from its default value (0.1) to 0.05 and 0.01 were tested. The value of num_iterations of the LightGBM model is the iteration value when the model errors stop to decrease with the specified learning_rate. J. Mar. Sci. Eng. 2021, 9, x FOR PEER REVIEW 12 of 23

the characteristics of astronomic tides, while the river discharge primarily influences the tides in flood season.

4. Model Training The training process of the LightGBM model was conducted in two stages. First, to specify the appropriate values of learning_rate and num_iterations, their influence on the model performance and the computational time of the LightGBM model was tested. With the other hyperparameters being defaulted, the RMSE values of the LightGBM model in the training period with learning_rate varying from its default value (0.1) to 0.05 and 0.01 J. Mar. Sci. Eng. 2021, 9, 496 12 of 22 were tested. The value of num_iterations of the LightGBM model is the iteration value when the model errors stop to decrease with the specified learning_rate. TheThe comparison of of the the RMSE RMSE values values between between the the water water levels levels predicted predicted by the by Light- the LightGBMGBM model model and and measurements measurements during during the the training training period period at eachat each station station is shownis shown in inFigure Figure7. 7 It. It can can be be seen seen from from Figure Figure7 that7 that a smallera smaller learning_rate learning_rate led led to to better better model model perfoperformancermance (as (as indicated indicated by by the the smaller smaller RMSE RMSE values) values) at St.Helens, at St.Helens, Longview Longview and Wauna and Waunastations. stations. However, However, at Vancouver at Vancouver station, thestation, RMSE the values RMSE when values learning_rate when learning_rate = 0.05 were = 0.05slightly were smaller slightly than smaller that when than learning_rate=0.01. that when learning_rate Overall,=0.01. the Overall,influence the of learning_rate influence of learningdid not_rate significantly did not influencesignificantly the modelinfluence performance. the model Theperformance. difference betweenThe difference the RMSE be- tweenvalues the of eachRMSE station values with of each different station learning_rate with different was learning_rate within 0.01m. was within 0.01 m.

0.20 learning_rate=0.10 learning_rate=0.05 learning_rate=0.01 0.18

0.16

RMSE (m)RMSE

0.14

0.12 Vancouver St.Helens Longview Wauna

FigureFigure 7. 7. ComparisonComparison of of the the RMSE RMSE values values of of the the predicted predicted water water levels levels from from the the LightGBM LightGBM model model duringduring the the training periodperiod whenwhen hyperparameters hyperparameters were were not not optimized, optimized but, but with with different different learning_rate. learn- ing_rate. Taking the results at Vancouver station as an example, Figure8 plots the variation of the RMSETaking values the results of the at LightGBM Vancouver model station during as an each example, iteration Figure with 8 different plots the learning_rate variation of thevalues. RMSE When values learning_rate of the LightGBM = 0.10, the model LightGBM during model each used iteration 184 iterations with different to reach learn- the ing_rateconvergence values. state When of the learning_rate RMSE values. = When 0.10, learning_ratethe LightGBM was model reduced used to 184 0.05 iterations and 0.01, theto reachiterations, the convergence respectively, state increased of the toRMSE 380 and values. 1829. When The results learning_rate in Figures was7 and reduced8 show to that0.05 anda smaller 0.01, the learning_rate iterations, respectively did not significantly, increase improved to 380 and the model1829. The accuracy results but in Figure considerably 7 and Figureincreased 8 show the model that a iterationssmaller learning_rate (computation did time). not Accordingsignificantly to theimprove preliminary the model test resultsaccu- racyin Figures but considerably7 and8, learning_rate increased atthe each model station iterations was specified (computation as 0.05 time). during According the subsequent to the preliminarytraining and test testing. results Meanwhile, in Figure 7 the and corresponding Figure 8, learning_rate num_iterations at each values station at each was stationspeci- fiedwere, asrespectively, 0.05 during 380the (Vancouver),subsequent training 291 (St.Helens), and testing. 339 (Longview) Meanwhile, and the 374 corresponding (Wauna). The optimization of max_depth and num_leaves at Vancouver station was used as an num_iterations values at each station were, respectively, 380 (Vancouver), 291 (St.Helens), instance to show the optimization process of the remaining hyperparameters. Parameters 339 (Longview) and 374 (Wauna). max_depth and number_leaves controlled the complexity of a decision tree. In the methods using the level-wise-tree growth strategy as shown in Figure1A, such as the XGBoost model [37], num_leaves was 2D (assuming D = max_depth). However, as the LightGBM model uses the leaf-wise-tree growth strategy [34], the depth of the growth tree generated could be deeper than the tree generated by the level-wise-tree growth strategy with the same value of num_leaves. In this study, max_depth was set in a range of from 4 to 10, while num_leaves was specified in a range of from 5 to 40 and also smaller than 2D to avoid over-fitting. All combinations of max_depth and number_leaves are shown as a grid graph in Figure9 with their related NMSE values at the cross-validation period at Vancouver station. Indicated by the size (larger size represents better performance) and the color of the data points, it can be seen that the grid point, where max_depth = 5 and num_leaves = 17, J. Mar. Sci. Eng. 2021, 9, 496 13 of 22 J. Mar. Sci. Eng. 2021, 9, x FOR PEER REVIEW 13 of 23

had the highest score (NMSE), indicating the best model performance during the hyper- parameters optimization period. Compared with the default value of num_leaves (31), The optimization of max_depth and num_leaves at Vancouver station was used as the results shown in Figure9 suggest that a simpler (fewer leaf nodes) tree structure an instance to show the optimization process of the remaining hyperparameters. Param- could achieve better model performance. However, it should be noted that the mutually eters max_depth and number_leaves controlled the complexity of a decision tree. In the restrictive relationship of max_depth or num_leaves prevented the NMSE values from pre- methods using the level-wise-tree growth strategy as shown in Figure 1A, such as the senting apparent monotonous relationships with max_depth or num_leaves. max_depth XGBoost model [37], num_leaves was 2D (assuming D = max_depth). However, as the restricted the up limit number of tree leaf nodes, while num_leaves restricted max_depth LightGBM model uses the leaf-wise-tree growth strategy [34], the depth of the growth tree to no more than num_leaves-1. Figure9 mainly shows the process of searching the opti- generated could be deeper than the tree generated by the level-wise-tree growth strategy mal hyperparameters based on all the hyperparameters’ combinations in their specified with the same value of num_leaves. In this study, max_depth was set in a range of from 4 range. During the subsequent optimization of the other hyperparameters (Step 2–Step 4 in to 10, while num_leaves was specified in a range of from 5 to 40 and also smaller than 2D Figure4) , the values of max_depth and num_leaves at Vancouver station were set as 5 and 17,to avoid respectively. over-fitting.

0.8 learing_rate=0.10 0.7 learing_rate=0.05 learing_rate=0.01 0.6

0.5

0.4

RMSE (m)

0.3

0.2

0.1 184 380 1829

0 500 1000 1500 2000 Iterations

Figure 8.8. The variation of the RMSERMSE valuesvalues ofof thethe predictedpredicted waterwater levelslevels fromfrom thethe LightGBMLightGBM model duringmodel during the training the training period atperiod Vancouver at Vancouver station withstation the with increase the increase of iterations. of iterations.

TheAll combinations remaining hyperparameters of max_depth and were number_leaves essential in improving are shown model as a grid performance graph in andFigure accelerating 9 with their the related model NMSE iterations. values Judged at the cross by the-validation NMSE values, period allat Vancouver the optimized sta- hyperparameterstion. Indicated by arethe size listed (larger in Table size2 .represents With the better hyperparameters performance) in and Table the2, color the final of the LightGBMdata points, model it can couldbe seen be that trained. the grid point, where max_depth = 5 and num_leaves = 17, had the highest score (NMSE), indicating the best model performance during the hyperpa- Table 2. The optimal hyperparameters at each station. rameters optimization period. Compared with the default value of num_leaves (31), the results shown in Figure 9 suggest that a simpler (fewerStations leaf nodes) tree structure could achieveHyperparameters better model performance. However, it should be noted that the mutually restric- Wauna Longview St.Helens Vancouver tive relationship of max_depth or num_leaves prevented the NMSE values from present- ing apparentnum_iterations monotonous relationship 374s with max_depth 339 or num_leaves 291. max_depth 380 re- max_depth 5 5 8 5 stricted the up limit number of tree leaf nodes, while num_leaves restricted max_depth to num_leaves 21 13 18 17 no moremax_bin than num_leaves-1. Figure 165 9 mainly shows 190 the process 195 of searching the 194 optimal hyperparametersmin_data_in_leaf based on all the 17 hyperparameters' 16 combination 23s in their specified 19 range. Duringfeature_fraction the subsequent optimization 0.72 of the other 0.79 hyperparameters 1.00 (Step 2–Step 4 0.97 in Figure 4), thebagging_fraction values of max_depth and 0.70 num_leaves at 0.67 Vancouver station 0.88 were set as 0.72 5 and 17, respectivelybagging_freq. 1 6 8 4 lambda_l1 0.016 0.008 0.000 0.000 lambda_l2 0.060 0.100 0.052 0.138 min_gain_to_split 0.104 0.136 0.068 0.008

J. Mar. Sci. Eng. 2021, 9, x FOR PEER REVIEW 14 of 23 J. Mar. Sci. Eng. 2021, 9, 496 14 of 22

FigureFigure 9. TheThe variation variation of the of theNMSE NMSE values values with different with different combinations combinations of max_depth of max_depth and and num_leaves values at Vancouver station. num_leaves values at Vancouver station.

5. ModelThe Testing remaining hyperparameters were essential in improving model performance and Theaccelerating testing datathe model (20% ofiterations. measurements) Judged by at the each NMSE station valu werees, all used the tooptimized examine hy- the modelperparameters accuracy. are Figure listed 10 in shows Table the 2. With comparison the hyperparameters of the water levels in Table predicted 2, the by final the LightGBMLightGBM modelmodel withcould measurements be trained. at four stations. In general, the model results of the LightGBM model show a reasonable accuracy relative to the measurements. However, it is Table 2. The optimal hyperparameters at each station. clear that the discrepancy seems to become larger when the water levels were high at Van- couver and St.Helens stations. This illustrates that theStation errorss of the LightGBM model were Hyperparameters larger in flood seasons at the upstreamWauna tide reachLongview stations (VancouverSt.Helens and St.Helens).Vancouver How- ever, thenum_iterations differences between the374 results of the LightGBM339 model291 and the measurements380 at Longview and Wauna stations were more randomly (or evenly) distributed. max_depth 5 5 8 5 In the time domain, Figure 11A shows the time series of the predicted water levels num_leaves 21 13 18 17 against the measurements at the Vancouver station during 2019 and 2020, while Figure 11B max_bin 165 190 195 194 plots the model errors and the synchronous river discharge of the Qc (Columbia River). min_data_in_leaf 17 16 23 19 Again, the predicted water level from the LightGBM model generally agrees well with the measurements,feature_fraction as shown in Figure0.72 11A. It is clear0.79 from Figure 1.0011B that the average0.97 error limitsbagging_fraction between the predicted and0.70 measured water0.67 levels were mostly0.88 in the range0.72 of ± 0.5 m. However,bagging_freq larger errors up to approximately1 ±1.06 m appeared when8 river discharge4 was strong inlambda_l1 flood season, such as0.016 the periods between0.008 April to May0.000 of 2019 and May0.000 to June of 2020.lambda_l2 The larger errors of the0.060 LightGBM model0.100 in these two0.052 periods both0.138 appeared duringmin_gain_to_split the rising flood period.0.104 In the rising flood0.136 period of 2019,0.068 the river discharge0.008 met a neap tide (Figure 11B), and the model errors mainly came from the phase difference between5. Model the Testing model results and the measurements. However, in the rising flood period of 2020,The the rivertesting discharge data (20% met of a measurements) spring tide (Figure at each 11B). station The model were errorsused to mainly examine sourced the frommodel the accuracy. over-estimation Figure 10 of modelshows results. the comparison This means of the the water model levels errors predicted of the LightGBM by the inLightGBM the rising model flood with period measurements tended to present at four a stations. different In pattern general, when the model the tidal results condition of the wasLightGBM different. model show a reasonable accuracy relative to the measurements. However, it is clear that the discrepancy seems to become larger when the water levels were high at Vancouver and St.Helens stations. This illustrates that the errors of the LightGBM model

J. Mar. Sci. Eng. 2021, 9, x FOR PEER REVIEW 15 of 23

were larger in flood seasons at the upstream tide reach stations (Vancouver and St.Helens). However, the differences between the results of the LightGBM model and the J. Mar. Sci. Eng. 2021, 9, 496 measurements at Longview and Wauna stations were more randomly (or evenly) distrib-15 of 22 uted.

5 5

4 4

3 3

2 2

1 1

Prediction (m) 0 Prediction (m) 0

-1 -1 (A) (B) -2 -2 -2 -1 0 1 2 3 4 5 -2 -1 0 1 2 3 4 5 Observation (m) Observation (m) 3 3

2 2

1 1

0 0

Prediction (m) Prediction (m) -1 -1

(C) (D) -2 -2 -2 -1 0 1 2 3 -2 -1 0 1 2 3 Observation (m) Observation (m)

J. Mar. Sci. Eng. 2021, 9, x FOR PEER REVIEW 16 of 23 FigureFigure 10.10. Comparisons of the predicted water water levels levels from from the the LightGBM LightGBM model model with with the the meas- measure- mentsurements at ( Aat) ( Vancouver,A) Vancouver, (B) St.Helens,(B) St.Helens, (C) ( LongviewC) Longview and and (D )(D Wauna) Wauna stations. stations.

In5 the time domain, Figure 11A shows the time series of the predicted water levels against the measurements at the Vancouver station during 2019 and Observation 2020, while Figure 11B plots4 the model errors and the synchronous river discharge LightGBM of the Qc (Columbia River). Again, the predicted water level from the LightGBM model generally agrees well with the3 measurements, as shown in Figure 11A. It is clear from Figure 11B that the aver- age error limits between the predicted and measured water levels were mostly in the range 2of ±0.5 m. However, larger errors up to approximately ±1.0 m appeared when river discharge was strong in flood season, such as the periods between April to May of 2019 1 and May to June of 2020. The larger errors of the LightGBM model in these two periods

bothWater levels (m) appear0 ed during the rising flood period. In the rising flood period of 2019, the river discharge met a neap tide (Figure 11B), and the model errors mainly came from the phase difference-1 between the model results and the measurements. However, in the rising flood period of 2020, the river discharge met a spring tide (Figure 11B). The model errors mainly (A) source-2d from the over-estimation of model results. This means the model errors of the LightGBMJanuary 2019 in the risingJune 2019 flood periodNovember tend ed2019 to presentApril 2020 a differentSeptember pattern 2020 when the tidal

/s)

condition1.0 was different. 15 3 neap tide LightGBM spring tide Qc m 0.5 3 10 0.0 5

Errors (m) -0.5 (B) -1.0 0 January 2019 June 2019 November 2019 April 2020 September 2020 River discharge (10

Figure 11. 11. (A(A) )Comparison Comparison of ofthe the LightGBM LightGBM model model results results and andthe measurements the measurements at Vancouver at Vancouver station, and and ( (BB)) the the corresponding corresponding errors errors of of the the LightGBM LightGBM model, model, the the measured measured river river discharge discharge of of thethe Columbia Columbia River. River.

To compare the performance of the LightGBM model at different tide reaches when river discharge was significant in 2019, Figure 12 compares the predicted and measured water levels particularly for the period between 1 April and 22 April in 2019, together with the river discharges from the Columbia River (Qc) at all stations. Generally, the errors mainly arose when the phase of the water levels predicted by the LightGBM model was earlier than the measurements. Meanwhile, the phase difference was more significant at Vancouver and St.Helens stations than at Longview and Wauna stations. In the LightGBM model, the synchronized upstream river discharge and the water levels at each station were used, respectively, as the input and output of the LightGBM model. There was actu- ally a propagation time for the upstream river discharge flowing to each station. In other words, there was a time lag in terms of the effect of the upstream river discharge on tides. Without considering the time lag, the signals of the water levels predicted by the LightGBM models could be earlier than the measurements. The time lag between the model results and the measurements was up to about one day at Vancouver and St.Helens stations and became smaller at Longview station (Figure 12C), but insignificant at Wauna station (Figure 12D) because the influence of river discharge on tides gradually became weaker in the landward direction. For further comparison of the LightGBM model results and the measurements in other periods, Figure 13 compares the monthly maximum water levels from the LightGBM models and the measurements at each station. It is clear from Figure 13 that the error limits of the LightGBM model results relative to the measurements were all in the range of about ±0.5 m. This clearly shows that the LightGBM model can give an accu- rate prediction of the monthly maximum water levels. However, due to the inability to consider the propagation time of the upstream river discharge or even tides, the errors of

J. Mar. Sci. Eng. 2021, 9, 496 16 of 22

To compare the performance of the LightGBM model at different tide reaches when river discharge was significant in 2019, Figure 12 compares the predicted and measured water levels particularly for the period between 1 April and 22 April in 2019, together with the river discharges from the Columbia River (Qc) at all stations. Generally, the errors mainly arose when the phase of the water levels predicted by the LightGBM model was earlier than the measurements. Meanwhile, the phase difference was more significant at Vancouver and St.Helens stations than at Longview and Wauna stations. In the LightGBM model, the synchronized upstream river discharge and the water levels at each station were used, respectively, as the input and output of the LightGBM model. There was actually a propagation time for the upstream river discharge flowing to each station. In other words, there was a time lag in terms of the effect of the upstream river discharge J. Mar. Sci. Eng. 2021, 9, x FOR PEER onREVIEW tides. Without considering the time lag, the signals of the water levels predicted17 of by 23 the LightGBM models could be earlier than the measurements. The time lag between the model results and the measurements was up to about one day at Vancouver and St.Helens stations and became smaller at Longview station (Figure 12C), but insignificant at Wauna the LightGBM can be larger during the periods of flood season, such as the model errors station (Figure 12D) because the influence of river discharge on tides gradually became weakerat Vancouver in the landwardstation (Figure direction. 11B).

4 12 4 12 Observation Qc LightGBM

10 /s)

3 3 3 9

m 8 3 2 2 6 6 1 1 4

Water levels (m) 3 0 0 2

River discharge (10

(A) (B) -1 0 -1 0 1 April 5 April 9 April 13 April 17 April 21 April 1 April 5 April 9 April 13 April 17 April 21 April 3 12 3 12

/s)

2 2 3 9 9

m

3 1 1 6 6 0 0

Water levels (m) 3 3 -1 -1

River discharge (10

(C) (D) -2 0 -2 0 1 April 5 April 9 April 13 April 17 April 21 April 1 April 5 April 9 April 13 April 17 April 21 April

2019 FigureFigure 12.12. Comparison of the LightGBM LightGBM model model results results and and the the measurements measurements in in April April of of 2019 2019 at: at: ((AA)) Vancouver,Vancouver, ( B(B)) St.Helens, St.Helens, ( C(C)) Longview Longview and and ( D(D)) Wauna Wauna stations. stations.

ForAlthough further the comparison LightGBM of model the LightGBM had relatively model larger results errors and in the the measurements flood season, its in otheraccuracy periods, was Figurestill com 13parable compares with the the monthly physics maximum-based models. water For levels statistical from the comparison, LightGBM modelsthe performance and the measurements of the LightGBM at each model station. was Itfurther is clear compared from Figure with 13 the that NS_TIDE the error model limits of[7, the13] LightGBMbecause of modelits typical results application relative toin the measurementsanalysis of estuarine were alltides. in the The range detail ofs aboutof the ±NS_TIDE0.5 m. This can clearly be found shows in Matte that the et LightGBMal. [7]. The model specification can give of an the accurate NS_TIDE prediction model ofat the monthlylower Columbia maximum River water is levels.referred However, to the works dueto of the Mat inabilityte et al. to[7 consider] and Pan the et propagational. [14]. The timesame of data the division upstream as riverthe LightGBM discharge model or even was tides, used the to examine errors of the the model LightGBM performance can be largerof the duringNS_TIDE the model. periods Table of flood 3 compares season, their such models’ as the model performance errors at by Vancouver using the stationperfor- (Figuremance 11indicatorsB). of the maximum absolute error (MAE), RMSE, correlation coefficient (CC), and the nondimensional skill score (SS) based on the method of Murphy [52]. An SS value close to 1 represents a better model performance. It can be seen that the RMSE values of the LightGBM model were all smaller than the NS_TIDE model, which indicates that the LightGBM model statistically performed better than the NS_TIDE model. The MAE and the values of CC and SS of the LightGBM model were, respectively, smaller and larger than the NS_TIDE model. The results in Table 3 illustrate that the LightGBM model could achieve better performance than the NS_TIDE model.

J.J. Mar. Mar. Sci. Sci. Eng. Eng.2021 2021, 9, ,9 496, x FOR PEER REVIEW 1718 of of 22 23

5 5

4 4

3 3

2 2

Prediction (m) Prediction (m) 1 1

(A) (B) 0 0 0 1 2 3 4 5 0 1 2 3 4 5 Observation (m) Observation (m) 5 5

4 4

3 3

2 2

Prediction (m) Prediction (m) 1 1

(C) (D) 0 0 0 1 2 3 4 5 0 1 2 3 4 5 Observation (m) Observation (m) FigureFigure 13. 13.Comparison Comparison of of the the monthly monthly maximummaximum waterwater levelslevels ofof thethe LightGBMLightGBM modelmodel resultsresults andand thethe measurements measurements at:( at:A(A)) Vancouver, Vancouver, (B ()B St.Helens,) St.Helens, (C ()C Longview) Longview and and (D (D) Wauna) Wauna stations. stations.

TableAlthough 3. Comparison the LightGBM of the performance model hadindicators relatively of the larger LightGBM errors and in the the NS_TIDE flood season, models. its accuracy was still comparable with the physics-based models. For statistical compari- LightGBM/NS_TIDE son,Stations the performance of the LightGBM model was further compared with the NS_TIDE model [7,13] becauseMAE of its (m) typical applicationRMSE (m) in the analysisCC of estuarine tides.SS The details of theVancouver NS_TIDE can be0.87/1.09 found in Matte0.14/0.16 et al. [7]. The specification0.987/0.983 of the0.972/0.965 NS_TIDE model at the lowerSt.Helens Columbia River0.82/0.97 is referred to0.14/0.15 the works of Matte0.982/0.978 et al. [7] and0.963/0.956 Pan et al. [14]. The sameLongview data division as0.75/0.90 the LightGBM 0.14/0.15 model was used0.975/0.972 to examine the0.941/0.938 model performance of theWauna NS_TIDE model.0.72/0.82 Table 3 compares0.14/0.16 their models’0.977/0.975 performance by0.955/0.947 using the perfor- mance indicators of the maximum absolute error (MAE), RMSE, correlation coefficient (CC),6. Discussion and the nondimensional skill score (SS) based on the method of Murphy [52]. An SS value6.1. Parameter close to 1 C representsontribution a to better the LightGBM model performance. Model It can be seen that the RMSE values of the LightGBM model were all smaller than the NS_TIDE model, which indicates that During the propagation of tide waves toward the upstream direction in an estuary, the LightGBM model statistically performed better than the NS_TIDE model. The MAE nonlinear interaction between tides and river discharge intensifies, as well as the tide con- and the values of CC and SS of the LightGBM model were, respectively, smaller and larger stituents themselves, so that the overall influence of the river discharge on estuarine tides than the NS_TIDE model. The results in Table3 illustrate that the LightGBM model could can vary spatially. These characteristics make the parameters in the input layer (Figure 3) achieve better performance than the NS_TIDE model. contribute differently in constructing the tree nodes of the LightGBM model at different Tabletide reaches. 3. Comparison Figure of 14 the shows performance the contribution indicators of of the the LightGBM top 15 parameter and the NS_TIDEs in the models.input layer at each station, including two river discharges (Qc and Qw), tide range at Astoria (R) and 12 tide constituents from statistical analysis. LightGBM/NS_TIDEThe parameter contribution represents the Stations ratio of the times of the parameterMAE (m) used in tree RMSE splitting (m) to the total CC times of all theSS param- eters used in tree splitting. A larger value of the parameter contribution means that it has Vancouver 0.87/1.09 0.14/0.16 0.987/0.983 0.972/0.965 been usedSt.Helens more times to 0.82/0.97 split the nodes 0.14/0.15, indicating more 0.982/0.978 significant parameter0.963/0.956 im- portanceLongview. For each tide constituent, 0.75/0.90 the cosine 0.14/0.15 and sine parts 0.975/0.972 are used as the0.941/0.938 input param- eters, butWauna the sum of the parameter 0.72/0.82 contribution 0.14/0.16 of the cosine 0.977/0.975 and sine part0.955/0.947s of a tide con- stituent is shown in Figure 14. It is clear from Figure 14 that the river discharges Qc and Qw consistently rank the top three at each station. The tide range at Astoria (R) ranks between 7th–13th at those stations for its contribution. In respect of tide constituents, their ranks are basically related to their tide amplitudes. For example, the M2 tide constituent,

J. Mar. Sci. Eng. 2021, 9, 496 18 of 22

6. Discussion 6.1. Parameter Contribution to the LightGBM Model During the propagation of tide waves toward the upstream direction in an estuary, nonlinear interaction between tides and river discharge intensifies, as well as the tide constituents themselves, so that the overall influence of the river discharge on estuarine tides can vary spatially. These characteristics make the parameters in the input layer (Figure3) contribute differently in constructing the tree nodes of the LightGBM model at different tide reaches. Figure 14 shows the contribution of the top 15 parameters in the input layer at each station, including two river discharges (Qc and Qw), tide range at Astoria (R) and 12 tide constituents from statistical analysis. The parameter contribution represents the ratio of the times of the parameter used in tree splitting to the total times of all the parameters used in tree splitting. A larger value of the parameter contribution means that it has been used more times to split the nodes, indicating more significant parameter importance. For each tide constituent, the cosine and sine parts are used as the input parameters, but the sum of the parameter contribution of the cosine and sine J. Mar. Sci. Eng. 2021, 9, x FOR PEER REVIEW 19 of 23 parts of a tide constituent is shown in Figure 14. It is clear from Figure 14 that the river discharges Qc and Qw consistently rank the top three at each station. The tide range at Astoria (R) ranks between 7th–13th at those stations for its contribution. In respect of tide constituents,which is the theirmost ranks significant are basically astronomic related tide to theirconstituent, tide amplitudes. ranks 3rd For–5th example,. Although the M2the tideLightGBM constituent, model which plays isthe the role most of a significant black box, astronomic its selections tide of constituent,the parameter rankss to 3rd–5th.the con- Althoughstruction of the the LightGBM decision modeltrees is plays still related the role to of the a black physic box, background its selections of the of the lower parameters Colum- tobia the River construction. of the decision trees is still related to the physic background of the lower Columbia River.

Qc Columbia River 15.1 Qc Columbia River 14.4 Qw Willamette River 14.7 Qw Willamette River 13.8 SA 8.8 SA 8.9 M2 6.4 M2 8.4 SSA 6.1 MM 6.2 MSF 5.7 SSA 5.8 R Astoria 5.4 MSF 5.5 S1 4.4 K1 5.3 K1 3.7 R Astoria 4.3 N2 2.5 S1 4.0 O1 2.4 O1 2.5

Parameter names Parameter MK3 2.2 N2 2.2 S2 2.1 S2 2.1 L2 1.8 MU2 1.7 P1 1.5 (A) MK3 1.7 (B) 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20

Qc Columbia River 12.6 SA 7.9 Qw Willamette River 10.8 Qw Willamette River 7.8 M2 9.8 Qc Columbia River 7.8 SA 7.9 S1 7.2 K1 6.1 M2 6.6 MM 4.8 SSA 4.3 N2 4.6 K1 4.3 SSA 4.2 S2 3.7 MSF 4.0 MM 3.6 S2 3.3 O1 3.5 O1 2.9 N2 3.1

Parameter names Parameter R Astoria 2.7 MSF 2.8 S1 2.4 R Astoria 2.7 P1 2.1 P1 2.3 M4 1.9 (C) MO3 2.2 (D) 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 Parameter contribution (%) Parameter contribution (%)

Figure 14. TheThe top top 15 15 parameter parameterss ofof the the input input layer layer of the of theLightGBM LightGBM model model at: (A at:) Vancouver, (A) Vancouver, (B) (St.Helens,B) St.Helens, (C) (Longview,C) Longview, and and (D) ( DWauna) Wauna stations. stations. 6.2. The Subtide Constituents 6.2. The Subtide Constituents It can be seen from Figure 14 that the SA, SSA, MSF, and MM tide constituents are in It can be seen from Figure 14 that the SA, SSA, MSF, and MM tide constituents are in the top 15 of the most important parameters. To further analyze these subtide constituents, the top 15 of the most important parameters. To further analyze these subtide constitu- their amplitudes obtained from the measurements are plotted in Figure 15 and compared ents, their amplitudes obtained from the measurements are plotted in Figure 15 and com- with the M2 tide constituent. These subtide constituents generally become more significant pared with the M2 tide constituent. These subtide constituents generally become more significant in the landward direction. Near the estuary mouth where the reference station (Astoria) is located, the amplitudes of these subtide constituents are insignificant except the SA tide constituent (0.10 m). The strength of subtides increasing in the landward di- rection was also reported in previous studies [12,28,53,54]. For example, MSF (MM) can become more energetic from the nonlinear interaction of the M2 and S2 (M2 and N2) tide constituents during the landward propagation of tide waves. Moreover, the lower fre- quency of subtides relative to the tides such as diurnal and semidiurnal tides favors their propagation because subtides are affected by smaller friction [8]. The upstream river discharge can also directly affect the water level variation. The upstream river discharge is annually and seasonally varied, making the water level fluc- tuation also annual and seasonal. The stronger effect of upstream river discharge on water levels along the upstream direction can be reflected in the increase of the SA and SSA tide amplitudes as their periods are annual and semiannual, respectively. The amplitudes of the SA and SSA tide constituents increase by about 7 and 13 times from Astoria station to Vancouver station, which is comparable with the variation ratio of the M2 tide constituent. It should be noted that the amplification of the SA and SSA tide constituents mainly comes from the effect of the variation pattern of the influence of river discharge on water levels. This makes their physical meaning different from the SA and SSA tide constituents in

J. Mar. Sci. Eng. 2021, 9, 496 19 of 22

in the landward direction. Near the estuary mouth where the reference station (Astoria) is located, the amplitudes of these subtide constituents are insignificant except the SA tide J. Mar. Sci. Eng. 2021, 9, x FOR PEER REVIEW 20 of 23 constituent (0.10 m). The strength of subtides increasing in the landward direction was also reported in previous studies [12,28,53,54]. For example, MSF (MM) can become more energetic from the nonlinear interaction of the M2 and S2 (M2 and N2) tide constituents duringcoastal theor ocean landward zone. propagation Figure 15 illustrates of tide waves. that Moreover,the strongthe inland lower propagation frequency of of subtides subtide relativewaves is to also the tidesexited such in the as diurnalLower Columbia and semidiurnal River and tides responds favors their to the propagation high ranks because of the subtidessub-tide areconstituents affected byin smallerFigure 14 friction. [8].

Figure 15. Amplitudes of thethe highesthighest importantimportant sub-tidesub-tide constituentsconstituentsand andthe theM2 M2tide tide constituent constituent at eachat each station. station.

7. ConclusionsThe upstream river discharge can also directly affect the water level variation. The upstreamIn this river study, discharge a novel is annually prediction and model seasonally was built varied, based making on the water machine level learning fluctu- ationLightGBM also annual framework and seasonal.to accurately The predict stronger the effect water of levels upstream in a rivertide-affected discharge estuary on water and levelsgain a alongbetter theunderstanding upstream direction of the nonlinear can be reflected interaction in the between increase tides of the and SA river and discharge SSA tide amplitudesin estuarine aswaters. their periodsThe model are annualwas fully and trained semiannual, with the respectively. data obtained The from amplitudes the lower of theColumbia SA and River SSA tideto establish constituents the op increasetimal parameters by about 7required. and 13 times The model from Astoria results stationwere also to Vancouvercompared with station, those which from is the comparable NS_TIDE with model. the variation ratio of the M2 tide constituent. It shouldIn the be input noted layer that of the the amplification LightGBM model, of the SA the and river SSA discharge tide constituents from the mainlylower Colum- comes frombia River the effectat the ofBonneville the variation Dam pattern and the of Willamette the influence River of at river Portland, discharge the t onide water ranges levels. near Thisthe estuary makes mouth their physical (Astoria meaning), and the differenttide constituents from the are SA used. and 80% SSA of tide the constituentsfield data were in coastalused for or training ocean zone. and Figure optimizing 15 illustrates the model, that and the the strong remaining inland propagation 20% of measurements of subtide waveswere used is also forexited testing in the the model Lower performance. Columbia River and responds to the high ranks of the sub-tideThe constituentsaccuracy of the in FigureLightGBM 14. model was statistically evaluated by RMSE, MAE, CC, 7.and Conclusions SS. The RMSE value of the LightGBM model was 0.14 m, which shows higher accu- racy than those from the NS_TIDE model (0.15–0.16 m). The MAE values (0.72–0.87 m), In this study, a novel prediction model was built based on the machine learning the CC values (0.975–0.987), and the SS values (0.941–0.972) of the LightGBM model also LightGBM framework to accurately predict the water levels in a tide-affected estuary and indicated a comparable prediction accuracy relative to the MAE values (0.82–1.09 m), the gain a better understanding of the nonlinear interaction between tides and river discharge CC values (0.972–0.983), and the SS values (0.938–0.965) of the NS_TIDE model. It was in estuarine waters. The model was fully trained with the data obtained from the lower also revealed that the predicted water levels from the LightGBM model in flood season Columbia River to establish the optimal parameters required. The model results were also exhibited a phase lag affected by the upstream river discharge. compared with those from the NS_TIDE model. The results clearly demonstrate that the LightGBM model is a powerful machine In the input layer of the LightGBM model, the river discharge from the lower Columbia learning tool for predicting the water levels in the fluvial flow affected estuaries, such as River at the Bonneville Dam and the Willamette River at Portland, the tide ranges near the Lower Columbia River and maybe any other mega-estuaries worldwide, with satis- factory accuracy. The LightGBM model can reveal a new perspective to the prediction of estuarine water levels.

J. Mar. Sci. Eng. 2021, 9, 496 20 of 22

the estuary mouth (Astoria), and the tide constituents are used. 80% of the field data were used for training and optimizing the model, and the remaining 20% of measurements were used for testing the model performance. The accuracy of the LightGBM model was statistically evaluated by RMSE, MAE, CC, and SS. The RMSE value of the LightGBM model was 0.14 m, which shows higher accuracy than those from the NS_TIDE model (0.15–0.16 m). The MAE values (0.72–0.87 m), the CC values (0.975–0.987), and the SS values (0.941–0.972) of the LightGBM model also indicated a comparable prediction accuracy relative to the MAE values (0.82–1.09 m), the CC values (0.972–0.983), and the SS values (0.938–0.965) of the NS_TIDE model. It was also revealed that the predicted water levels from the LightGBM model in flood season exhibited a phase lag affected by the upstream river discharge. The results clearly demonstrate that the LightGBM model is a powerful machine learning tool for predicting the water levels in the fluvial flow affected estuaries, such as the Lower Columbia River and maybe any other mega-estuaries worldwide, with satisfactory accuracy. The LightGBM model can reveal a new perspective to the prediction of estuarine water levels.

Author Contributions: Conceptualization, M.G., S.P. and Y.C.; Data curation, M.G. and H.P.; Formal analysis, M.G., S.P., Y.C., C.C., H.P. and X.Z.; Funding acquisition, M.G. and Y.C.; Investigation, M.G., S.P., Y.C. and H.P.; Methodology, M.G., S.P., Y.C. and C.C.; Software, M.G., S.P. and C.C.; Supervision, S.P. and Y.C.; Validation, M.G., S.P. and C.C.; Writing—original draft, M.G., S.P., Y.C., C.C., H.P. and X.Z.; Writing—review & editing, M.G., S.P., Y.C., C.C., H.P. and X.Z. All authors have read and agreed to the published version of the manuscript. Funding: This work was partly supported by the National Natural Science Foundation of China [Grant No: 51620105005], the Fundamental Research Funds for the Central Universities of China [Grant No: 2018B635X14, B200204017], and the Postgraduate Research & Practice Innovation Program of Jiangsu Province [Grant No: KYCX18_0602]. Institutional Review Board Statement: Not applicable. Informed Consent Statement: Not applicable. Data Availability Statement: The water level and river discharge data used in this study can be obtained from the National Oceanic and Atmospheric Administration (NOAA) website (https: //www.noaa.gov/, accessed on 27 April 2021) and the U.S. Geological Survey (USGS) website ( https://www.usgs.gov/, accessed on 27 April 2021), respectively. The LightGBM model used in this study is based on the open-source framework of Microsoft® available at https://github.com/ microsoft/LightGBM/tree/master/python-package (accessed on 27 April 2021). Acknowledgments: The first author would like to acknowledge the financial support from the China Scholarship Council (CSC) under PhD exchange program with Cardiff University [201906710022]. The LightGBM model was run on the Supercomputing Wales Hawk clusters, funded by the European Regional Development Fund (ERDF). The authors would also like to thank Pascal Matte for providing codes of the NS_TIDE model. Conflicts of Interest: The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References 1. Savenije, H.H.G. Prediction in ungauged estuaries: An integrated theory. Water Resour. Res. 2015, 51, 2464–2476. [CrossRef] 2. Garvine, R.W. The distribution of salinity and temperature in the connecticut river estuary. J. Geophys. Res. 1975, 80, 1176–1183. [CrossRef] 3. Chau, K.W. A split-step particle swarm optimization algorithm in river stage forecasting. J. Hydrol. 2007, 346, 131–135. [CrossRef] 4. Pawlowicz, R.; Beardsley, B.; Lentz, S. Classical tidal harmonic analysis including error estimates in MATLAB using T_TIDE. Comput. Geosci. 2002, 28, 929–937. [CrossRef] 5. Egbert, G.D.; Erofeeva, S.Y. Efficient inverse modeling of barotropic ocean tides. J. Atmos. Ocean. Technol. 2002, 19, 183–204. [CrossRef] 6. Gallo, M.N.; Vinzon, S.B. Generation of overtides and compound tides in Amazon estuary. Ocean Dyn. 2005, 55, 441–448. [CrossRef] J. Mar. Sci. Eng. 2021, 9, 496 21 of 22

7. Matte, P.; Jay, D.A.; Zaron, E.D. Adaptation of classical tidal harmonic analysis to nonstationary tides, with application to river tides. J. Atmos. Ocean. Technol. 2013, 30, 569–589. [CrossRef] 8. Jay, D.A. Green’s law revisited: Tidal long-wave propagation in channels with strong topography. J. Geophys. Res. 1991, 96, 20585. [CrossRef] 9. Godin, G. The propagation of tides up rivers with special considerations on the upper Saint Lawrence River. Estuar. Coast. Shelf Sci. 1999, 48, 307–324. [CrossRef] 10. Pan, H.; Lv, X.; Wang, Y.; Matte, P.; Chen, H.; Jin, G. Exploration of Tidal-Fluvial interaction in the Columbia River Estuary Using S_TIDE. J. Geophys. Res. Oceans 2018, 123, 6598–6619. [CrossRef] 11. Cai, H.; Yang, Q.; Zhang, Z.; Guo, X.; Liu, F.; Ou, S. Impact of river-tide dynamics on the temporal-spatial distribution of residual water level in the Pearl River channel Networks. Estuaries Coasts 2018, 41, 1885–1903. [CrossRef] 12. Gan, M.; Chen, Y.; Pan, S.; Li, J.; Zhou, Z. A modified nonstationary tidal harmonic analysis model for the Yangtze estuarine tides. J. Atmos. Ocean. Technol. 2019, 36, 513–525. [CrossRef] 13. Matte, P.; Secretan, Y.; Morin, J. Temporal and spatial variability of tidal-fluvial dynamics in the St. Lawrence fluvial estuary: An application of nonstationary tidal harmonic analysis. J. Geophys. Res. Oceans 2014, 119, 5724–5744. [CrossRef] 14. Pan, H.D.; Guo, Z.; Wang, Y.Y.; Lv, X.Q. Application of the EMD method to river tides. J. Atmos. Ocean. Technol. 2018, 35, 809–819. [CrossRef] 15. Pan, H.; Lv, X. Reconstruction of spatially continuous water levels in the Columbia River Estuary: The method of Empirical Orthogonal Function revisited. Estuar. Coast. Shelf Sci. 2019, 222, 81–90. [CrossRef] 16. Zhang, W.; Cao, Y.; Zhu, Y.; Zheng, J.; Ji, X.; Xu, Y.; Wu, Y.; Hoitink, A.J.F. Unravelling the causes of tidal asymmetry in deltas. J. Hydrol. 2018, 564, 588–604. [CrossRef] 17. Chang, H.; Lin, L. Multi-point tidal prediction using artificial neural network with tide-generating forces. Coast. Eng. 2006, 53, 857–864. [CrossRef] 18. Lee, T.L. Back-propagation neural network for long-term tidal predictions. Ocean Eng. 2004, 31, 225–238. [CrossRef] 19. Lee, T.L.; Jeng, D.S. Application of artificial neural networks in tide-forecasting. Ocean Eng. 2002, 29, 1003–1022. [CrossRef] 20. Supharatid, S. Application of a neural network model in establishing a stage–discharge relationship for a tidal river. Hydrol. Process. 2003, 17, 3085–3099. [CrossRef] 21. Cox, D.T.; Tissot, P.; Michaud, P. Water level observations and short-term predictions including meteorological events for entrance of galveston bay, Texas. J. Waterw. Port Coast. Ocean Eng. 2002, 128, 21–29. [CrossRef] 22. Liang, S.X.; Li, M.C.; Sun, Z.C. Prediction models for tidal level including strong meteorologic effects using a neural network. Ocean Eng. 2008, 35, 666–675. [CrossRef] 23. Riazi, A. Accurate tide level estimation: A deep learning approach. Ocean Eng. 2020, 198, 107013. [CrossRef] 24. Chang, F.; Chen, Y. Estuary water-stage forecasting by using radial basis function neural network. J. Hydrol. 2003, 270, 158–166. [CrossRef] 25. Tsai, C.C.; Lu, M.C.; Wei, C.C. Decision tree-based classifier combined with neural-based predictor for water-stage forecasts in a river basin during typhoons: A case study in Taiwan. Environ. Eng. Sci. 2012, 29, 108–116. [CrossRef] 26. Chen, W.B.; Liu, W.C.; Hsu, M.H. Comparison of ANN approach with 2D and 3D hydrodynamic models for simulating estuary water stage. Adv. Eng. Softw. 2012, 45, 69–79. [CrossRef] 27. Yoo, H.J.; Kim, D.H.; Kwon, H.; Lee, S.O. Data driven water surface elevation forecasting model with hybrid activation function—A case study for Hangang River, South Korea. Appl. Sci. 2020, 10, 1424. [CrossRef] 28. Chen, Y.P.; Gan, M.; Pan, S.Q.; Pan, H.D.; Zhu, X.; Tao, Z.J. Application of Auto-Regressive (AR) analysis to improve short-term prediction of water levels in the yangtze estuary. J. Hydrol. 2020, 590, 125386. [CrossRef] 29. Zhang, Q.C.; Yang, L.T.; Chen, Z.K.; Li, P. A survey on deep learning for big data. Inf. Fusion 2018, 42, 146–157. [CrossRef] 30. Sun, X.L.; Liu, M.X.; Sima, Z.Q. A novel cryptocurrency price trend forecasting model based on LightGBM. Financ. Res. Lett. 2020, 32, 101084. [CrossRef] 31. Zhu, S.L.; Hrnjica, B.; Ptak, M.; Choi´nski,A.; Sivakumar, B. Forecasting of water level in multiple temperate lakes using machine learning models. J. Hydrol. 2020, 585, 124819. [CrossRef] 32. Zhu, S.L.; Ptak, M.; Yaseen, Z.M.; Dai, J.Y.; Sivakumar, B. Forecasting surface water temperature in lakes: A comparison of approaches. J. Hydrol. 2020, 585, 124809. [CrossRef] 33. Chen, C.; Zhang, Q.M.; Ma, Q.; Yu, B. LightGBM-PPI: Predicting protein-protein interactions through LightGBM with multi- information fusion. Chemom. Intell. Lab. Syst. 2019, 191, 54–64. [CrossRef] 34. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems; Morgan Kaufmann Publishers: San Mateo, CA, USA, 2017; pp. 3146–3154. 35. Dev, V.A.; Eden, M.R. Formation lithology classification using scalable gradient boosted decision trees. Comput. Chem. Eng. 2019, 128, 392–404. [CrossRef] 36. Fan, J.L.; Ma, X.; Wu, L.F.; Zhang, F.C.; Yu, X.; Zeng, W.Z. Light gradient boosting machine: An efficient soft computing model for estimating daily reference evapotranspiration with local and external meteorological data. Agric. Water Manag. 2019, 225. [CrossRef] 37. Chen, T.Q.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [CrossRef] J. Mar. Sci. Eng. 2021, 9, 496 22 of 22

38. Shi, X.; Wong, Y.D.; Li, M.Z.; Palanisamy, C.; Chai, C. A feature learning approach based On XGBoost for driving assessment and risk prediction. Accid. Anal. Prev. 2019, 129, 170–179. [CrossRef] 39. Dong, W.; Huang, Y.; Lehane, B.; Ma, G. XGBoost algorithm-based prediction of concrete electrical resistivity for structural health monitoring. Autom. Constr. 2020, 114, 103155. [CrossRef] 40. Demir-Kavuk, O.; Kamada, M.; Akutsu, T.; Knapp, E.W. Prediction Using Step-Wise L1, L2 Regularization and Feature Selection for Small Data Sets with Large Number of Features. BMC Bioinform. 2011, 12, 412. Available online: https://bmcbioinformatics. biomedcentral.com/articles/10.1186/1471-2105-12-412 (accessed on 27 April 2021). [CrossRef] 41. Kukulka, T.; Jay, D.A. Impacts of Columbia River discharge on salmonid habitat: 1. A nonstationary fluvial tide model. J. Geophys. Res. Ocean. 2003, 108, 3293. [CrossRef] 42. Kukulka, T.; Jay, D.A. Impacts of Columbia River discharge on salmonid habitat: 2. Changes in shallow-water habitat. J. Geophys. Res. Ocean. 2003, 108, 3294. [CrossRef] 43. Jay, D.A.; Leffler, K.; Degens, S. Long-term evolution of Columbia River tides. J. Waterw. Port Coast. Ocean Eng. 2011, 137, 182–191. [CrossRef] 44. Lee, T.; Makarynskyy, O.; Shao, C. A combined harmonic analysis–artificial neural network methodology for tidal predictions. J. Coast. Res. 2007, 23, 764–770. [CrossRef] 45. LightGBM’s Documentation. Available online: https://lightgbm.readthedocs.io/en/latest/ (accessed on 27 April 2021). 46. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [CrossRef] 47. Bergstra, J.; Bengio, Y. Random Search for Hyper-Parameter Optimization. J. Mach. Learn. Res. 2012, 13, 281–305. Available online: https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a (accessed on 27 April 2021). 48. Jay, D.A.; Flinchem, E.P. Interaction of fluctuating river flow with a barotropic tide: A demonstration of wavelet tidal analysis methods. J. Geophys. Res. 1997, 102, 5705–5720. [CrossRef] 49. Pawlowicz, R. “M_Map: A Mapping Package for MATLAB”, Version 1.4m, [Computer Software]. 2020. Available online: www.eoas.ubc.ca/~{}rich/map.html (accessed on 27 April 2021). 50. National Oceanic and Atmospheric Administration. Available online: https://www.noaa.gov/ (accessed on 27 April 2021). 51. U.S. Geological Survey (USGS). Available online: https://www.usgs.gov/ (accessed on 27 April 2021). 52. Murphy, A.H. Skill scores based on the mean square error and their relationships to the correlation coefficient. Mon. Weather Rev. 1988, 116, 2417–2424. [CrossRef] 53. Guo, L.C.; Wegen, M.V.D.; Jay, D.A.; Matte, P.; Wang, Z.B.; Roelvink, D.; He, Q. River-tide dynamics: Exploration of nonstationary and nonlinear tidal behavior in the Yangtze River estuary. J. Geophys. Res. Oceans 2015, 120, 3499–3521. [CrossRef] 54. Guo, L.; Zhu, C.; Wu, X.; Wan, Y.; Jay, D.A.; Townend, I.; Wang, Z.B.; He, Q. Strong inland propagation of low-frequency long waves in river estuaries. Geophys. Res. Lett. 2020, 47, e2020GL089112. [CrossRef]