A machine learning approach to geomorphometry-extreme links in the Lower Colorado Basin

Lin Ji, Victor R. Baker, Hoshin V. Gupta, P.A. Ty Ferré, and Tao Liu Department of Hydrology and Atmospheric Sciences, the University of Arizona

INTRODUCTION Tab. 1 Formulas and methods in calculation of morphometric parameters 2. Hyperparameter Tuning Tab. 2 The summary of the best results of training and evaluation for Morphometric Formula Units Reference random forest model . Extreme flood hazards are common in the Lower Colorado River characteristics There are two very important hyperparameters need be optimized in 1. Drainage network Basin (LCRB) due to the complex terrain and entrenched river Explained order(u) Hierarchical rank dimensionless Strahler (1964) python script: R2 MAE RMSE std variance channels. Evaluating basin morphometry helps understand the physical Number of stream orders • n_estimators: the number of trees in the forest based on the observations behavior of watersheds with respect to extreme events. (Lu) Nu = N1 + N2 + ∙∙∙ + Nn dimensionless Horton (1945) bootstrapped samples. Training Testing Training Testing Training Testing Training Testing Training Testing Length of stream orders MAP 0.918 0.706 0.022 0.047 0.198 0.172 0.002 0.006 0.885 0.336 However, extracting basin morphometric characteristics is (Nu) Lu = L1 + L2 + ∙∙∙ + Ln km Horton (1945) • max_features: the number of features to consider when looking for the UP 0.902 0.754 0.024 0.037 0.202 0.154 0.002 0.004 0.871 0.512 computationally expensive and time consuming. Conventional Bifurcation Ratio (Rb) Rb=Nu/Nu+1 dimensionless Schumm (1956) best split. There are some options to choose: “auto”, the max_features = Mean Bifurcation ratio + +. . . + dimensionless Strahler (1957) approaches lack effective tools that link morphometric indices to (Rbm) = n_features; “sqrt”, the max_features = sqrt(n_features), “log2”, the 𝑁𝑁푁 𝑁𝑁푁 𝑁𝑁𝑁𝑁 − 1 extreme floods, and this poses a great challenge for extreme flood 𝑅𝑅𝑅𝑅𝑚𝑚 𝑁𝑁푁 𝑁𝑁푁2. Basin Geometry𝑁𝑁𝑁𝑁 max_features = log2 (n_features), where n_features is the number of 𝑛𝑛 − 1 prediction. Total basin area (A) Plan area enclosed by basin boundary km2 features in the data. Total basin surface area Surface area enclosed by basin boundary km2 In this study, we extracted 41 basin morphometric parameters for (As) GridSearchCV and OOB error rate to evaluate the trained model Basin perimeter (P) Length of the drainage basin boundary km Schumm (1956) performance. 372 watersheds in the LCRB from a 10 m DEM using ArcGIS with Basin length (Lb) Distance from outlet to the farthest point on km Schumm (1956) python script. We then employed the Random Forest (RF) regression basin boundary Main channel length (Lc) Length of longest network from outlet to km Ayad (2015) with the GridSerachCV algorithm and Out-of-Bag (OOB) error upstream RESULTS estimation to link these morphometric features to the extreme flood- Fitness ratio (Rf) = / dimensionless Melton (1957) a. b. Form factor (Ff) = / dimensionless Horton (1932) 𝑅𝑅𝑅𝑅 𝐿𝐿𝐿𝐿 𝑃𝑃 records, maximum annual peak discharge (MAP) ande th peak Relative perimeter (Pr) Pr = / 2 dimensionless Schumm (1956) 𝐹𝐹𝐹𝐹 𝐴𝐴 𝐿𝐿𝐿𝐿 Length area relation (Lar) = .4 . Km1.2 Hack (1957) discharge per unit Parea (U ). The model can also be used to understand 𝐴𝐴 𝑃𝑃 Rotundity coefficient (R) = / 0 6 dimensionless 𝐿𝐿𝐿𝐿𝐿𝐿 1 ∗ 𝐴𝐴 Strahler (1964) the relative importance of geomorphometric variables in predicting the Mean basin width (W) =2 / km Horton (1932) 𝑅𝑅 𝐿𝐿𝐿𝐿 ∗ 𝜋𝜋 4𝐴𝐴 Compactness coefficient = 0. / dimensionless Horton (1945) extreme floods. 𝑊𝑊 𝐴𝐴 𝐿𝐿𝐿𝐿 (C) 𝐶𝐶 282 ∗ 𝑃𝑃 𝐴𝐴 DATA Circularity ratio (Rc) = 4 / dimensionless Strahler (1964); Miller 2 (1953) 𝑅𝑅𝑅𝑅 ∗ 𝜋𝜋 ∗𝐴𝐴 𝑃𝑃 There are 695 USGS stream gages located in the LCRB. Excluding Elongation ratio (Re) = . / dimensionless Schumm (1956) 3. Drainage Texture redundant sites that stream gages on the same stream with short 𝑅𝑅𝑅𝑅 1 129 ∗ 𝐴𝐴 𝐿𝐿𝐿𝐿 Drainage texture (Dt) = / No./km Horton (1945) c. d. Fig.4 The feature importance of the input variables in predicting MAP (left) and UP distance, heavily urbanized areas, sites influenced by the upstream Drainage density (Dd) = / km/km2 Horton (1945) 𝐷𝐷𝐷𝐷 𝑁𝑁𝑁𝑁 𝑃𝑃 (right), respectively. Stream frequency (Fs) = / No./km2 Horton (1945) reservoirs or dams, the big error of calculated stream orders, the 372 𝐷𝐷𝐷𝐷 𝐿𝐿𝐿𝐿 𝐴𝐴 Constant of channel = / = / km2/km Strahler (1964); 𝐹𝐹𝐹𝐹 𝑁𝑁𝑁𝑁 𝐴𝐴 gaging stations with at least 10-year- records of annual peak discharge maintenance (Cm) Schumm (1956) 𝐶𝐶𝐶𝐶 1 𝐷𝐷𝐷𝐷 𝐴𝐴 𝐿𝐿𝐿𝐿 in each gage station were collected as the research watersheds. number (In) = No./km3 Faniran (1968); Pareta CONCLUSION and Pareta (2011) 𝐼𝐼𝐼𝐼 𝐹𝐹𝐹𝐹 ∗ 𝐷𝐷𝐷𝐷 The watersheds were delineated by ArcGIS-Hydrology Toolbox. Drainage Intensity (Di) = / No/km Faniran (1968); Pareta and Pareta (2011) Then, the morphometric characteristics of each basin were calculated 𝐷𝐷𝐷𝐷 𝐹𝐹𝐹𝐹 𝐷𝐷𝐷𝐷 Average Length of = 1/(2*Dd) km Horton (1945) The results indicate that the RF model has a better estimation to UP than Overland Flow (Lo) by Morphometric Toolbox with python script according eto th methods 𝐿𝐿𝐿𝐿 4. Basin Relief MAP. The results also suggest that significant improvement in predicting and formulas (Tab. 1) Height of Basin outlet The outlet height from DEM m the MAP is achieved with the relative perimeter, total basin area, Maximum Height of basin The maximum basin height from DEM m Fig.2 The results of GridSearchCV and OOB error rate for validation Basin Relief (R) = h, where H is maximum elevation m Schumm (1956) and length area relation. Similar improvement in predicting UP is and h is minimum elevation of a basin the trained RF model for two response variables, MAP and UP. 2a and 𝑅𝑅 𝐻𝐻 − achieved using the maximum height of basin, total basin relief, and Relief Ratio (Rr) = / dimensionless Schumm (1956) 2b are the average MSE of GridSearchCV with 10-fold change with the Relative Relief Ratio (Rrr) = / dimensionless Melton (1957) 𝑅𝑅𝑅𝑅 𝑅𝑅 𝐿𝐿𝐿𝐿 n_estimators in different max_features for MAP and UP, respectively. 2c relief ratio. This initial using RF shows that data-driven machine Ruggedness Number = dimensionless Strahler (1958) 𝑅𝑅𝑅𝑅𝑅𝑅 𝑅𝑅 ∗ 100 𝑃𝑃 and 2d are the OOB error rate change with the n_estimators in different learning can help link morphometry to measures of extreme (Rn) 𝑅𝑅𝑅𝑅 𝑅𝑅 ∗ 𝐷𝐷𝐷𝐷 Terrain Undulation Index = / dimensionless Ayad (2015) max_features for MAP and UP, respectively. flooding, thereby advancing our understanding of regional large (T) 𝑇𝑇 𝐴𝐴𝐴𝐴 𝐴𝐴 flood behavior and improving flood risk analyses for the Southwestern U.S. MODEL 1. Random Forest algorithm (Breiman L., 2001) REFERENCES

[1]. Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32. [2]. Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning (Vol. 1, No. 10). New York: Springer series in statistics. [3]. Liaw, A., & Wiener, M. (2002). Classification and regression by random Forest. R news, 2(3), 18-22. [4]. Sadler, J. M., Goodall, J. L., Morsy, M. M., & Spencer, K. (2018). Modeling urban coastal flood severity from crowd-sourced flood reports Fig.3 The model results of the observation and estimation for training using Poisson regression and Random Forest. Journal of Hydrology, samples and evaluation samples using Random forest regression. The left is Fig. 1 Locations of 372 USGS stream gages and their contributing 559, 43-55. the prediction of MAP. The right is the prediction of UP. areas delineated from USGS NED on the Lower Colorado River Basin