Article Is Available On- 2016
Total Page:16
File Type:pdf, Size:1020Kb
Geosci. Model Dev., 13, 4253–4270, 2020 https://doi.org/10.5194/gmd-13-4253-2020 © Author(s) 2020. This work is distributed under the Creative Commons Attribution 4.0 License. ML-SWAN-v1: a hybrid machine learning framework for the concentration prediction and discovery of transport pathways of surface water nutrients Benya Wang1,2, Matthew R. Hipsey2,3, and Carolyn Oldham1,2 1Department of Civil, Mining and Environmental Engineering, The University of Western Australia, 35 Stirling Highway, Crawley 6009, Australia 2Co-operative Research Centre for Water Sensitive Cities, Clayton, Australia 3UWA School of Agriculture and Environment, The University of Western Australia, 35 Stirling Highway, Crawley 6009, Australia Correspondence: Carolyn Oldham ([email protected]) Received: 8 January 2020 – Discussion started: 6 April 2020 Revised: 6 July 2020 – Accepted: 23 July 2020 – Published: 15 September 2020 Abstract. Nutrient data from catchments discharging to re- is accurate for the prediction of responses of surface water ceiving waters are monitored for catchment management. nutrient concentrations to hydrologic variability. However, nutrient data are often sparse in time and space and have non-linear responses to environmental factors, mak- ing it difficult to systematically analyse long- and short-term trends and undertake nutrient budgets. To address these chal- 1 Introduction lenges, we developed a hybrid machine learning (ML) frame- work that first separated baseflow and quickflow from total Surface water nutrient concentrations have been significantly flow, generated data for missing nutrient species, and then increased by human activities (Forio et al., 2015) due to ur- utilised the pre-generated nutrient data as additional vari- banisation, waste discharges and agricultural intensification ables in a final simulation of tributary water quality. Hybrid (Liu et al., 2012; Kaiser et al., 2013; Li et al., 2013). In- random forest (RF) and gradient boosting machine (GBM) creased nutrient concentrations and loads in streams alter models were employed and their performance compared with the biogeochemical functioning and biological community a linear model, a multivariate weighted regression model, and structure in receiving estuaries (Jickells et al., 2014; Staehr stand-alone RF and GBM models that did not pre-generate et al., 2017), leading to an increased incidence of harmful al- nutrient data. The six models were used to predict six dif- gal blooms (Domingues et al., 2011), anoxia and hypoxia (Li ferent nutrients discharged from two study sites in Western et al., 2016; Testa et al., 2017) and reduced water availability Australia: Ellen Brook (small and ephemeral) and the Mur- (Heathwaite, 2010). Analysis of tributary water quality data ray River (large and perennial). Our results showed that the over time is therefore essential to compute incoming nutrient hybrid RF and GBM models had significantly higher accu- loads, support policy and plan remediation measures. racy and lower prediction uncertainty for almost all nutri- Water quality data, however, often have constraints that ent species across the two sites. The pre-generated nutri- make it challenging to analyse long- and short-term trends. ent and hydrological data were highlighted as the most im- Firstly, water quality data often have non-linear responses portant components of the hybrid model. The model results to environmental factors and show high-order interaction ef- also indicated different hydrological transport pathways for fects between different environmental variables. Moreover, total nitrogen (TN) export from two tributary catchments. nutrients can derive from different sources (point or non- We demonstrated that the hybrid model provides a flexible point) in the landscape and are transported to receiving method to combine data of varied resolution and quality and waters through different water pathways subject to varied Published by Copernicus Publications on behalf of the European Geosciences Union. 4254 B. Wang et al.: ML-SWAN-v1 catchment hydrological conditions and human intervention function of stream ecology (Clapcott et al., 2012). In con- (Hirsch et al., 2010; Lloyd et al., 2014). Additionally, tribu- trast to process-based conceptual models, ML methods sim- tary nutrient datasets often are sparse in both space and time, ulate relationships purely from the data (Maier et al., 2014) due to the high cost of fieldwork and chemical analysis (Lam- and have the ability to incorporate different types of variables sal et al., 2006; Forio et al., 2015). Historical and current wa- (e.g. numerical or categorised variables); this is particularly ter quality monitoring programmes often use low-frequency suitable for systems with complex variable interactions and sampling regimes on a weekly to monthly basis (Halliday et non-linear response functions (Povak et al., 2014). al., 2012). When monthly averaged concentrations are used, While both process-based and ML models can manage calculated nutrient loads to receiving environments such as non-linear interactions and be used to explore long-term lakes or estuaries may be poorly estimated (Cozzi and Giani, trends, they both have difficulty in fully extracting important 2011), with high variability in the estimated loads (Jordan hydrochemical information embedded in nutrient data. Hy- and Cassidy, 2011). It is also common to have patchy avail- brid methods have been proposed for flow forecasting, to en- ability of nutrient species data across a study area, and com- hance the performance of ML models by first using interme- bining datasets from different projects and analytical labo- diate models to generate additional variables, which are then ratories makes the analysis of long-term trends fraught with used for subsequent modelling. For instance, a neural net- uncertainty. For instance, total nitrogen (TN) and total phos- work model is first applied to reconstruct surface ocean par- phorus (TP) concentrations within catchment outflows may tial pressure of carbon dioxide (pCO2) climatology, which is have been monitored for decades, while dissolved organic ni- used as an input into another neural network to predict pCO2 trogen (DON) and dissolved organic carbon (DOC) concen- anomalies with other features (Denvil-Sommer et al., 2019). trations may have only been monitored recently, with the in- Similarly, Noori and Kalin (2016) used the soil and water as- creasing recognition of their ecological importance (Górniak sessment tool (SWAT) to generate baseflow and stormflow, et al., 2002; Petrone et al., 2009; Erlandsson et al., 2011). which were then used as inputs to an artificial neural net- Given the hydrochemical correlation between different nu- work (ANN) model to improve daily flow prediction. Both trient species and high analytical cost, there are benefits in studies used hybrid models to demonstrate that pre-generated extracting maximum information from all available nutrient variables provided additional information that was crucial to data, particularly relating to changes in water quality over achieving higher prediction accuracy, compared with stand- time (Hirsch et al., 2010). In summary, while high-quality alone ANN models. nutrient data from tributaries are typically required as input Stream flow integrates water from multiple pathways re- to water quality modelling of receiving waters, the reliability sulting in a distribution of residence times. Stream nutri- and accuracy of the trend analysis of tributary data are fre- ents are the product of overlapping historical inputs and re- quently restricted by data non-linearity, limited sample size action rates, which are spatially distributed and temporally and variable nutrient availability. weighted within the catchment (Abbott et al., 2016). There- Various models for constructing tributary water quality fore, it is beneficial to understand nutrient transport pathways data have been developed. For example, linear models (LMs) from the source to receiving waters, to analyse the long- and generalised linear models (GLMs) that use correlations and short-term trends of stream nutrient data; this knowl- between concentration (C) and flow (Q) have long played edge will improve management strategies to reduce nutrient a central role in stream water quality analysis (Cohn et al., transport (Tesoriero et al., 2009; Mellander et al., 2012). In 1989; Chanat et al., 2002). Some multivariate regression the analysis of the streamflow hydrograph, separating base- models have been applied to analyse the long-term trend (Li flow (the long-term delayed flow from storage) and quick- et al., 2007; Tao et al., 2010; Greening et al., 2014) and flow (the short-term response to a rainfall event) from total seasonal patterns (Giblin et al., 2010; Chen et al., 2012) of flow is a well-established strategy to better understand trans- surface water nutrients. For example, a weighted regression port pathways (Tesoriero et al., 2009). To utilise all available on time, discharge and season (WRTDS) was introduced by nutrient data and assess the impact of different transport path- Hirsch et al. (2010) and has been applied to a number of dif- ways on stream nutrient concentrations, we developed a hy- ferent water quality studies (Green et al., 2014; Zhang et al., brid machine learning framework for surface water nutrient 2016a, b, c). concentrations (ML-SWAN) that first separated baseflow and Meanwhile, data-driven machine learning (ML) methods quickflow from total flow and then built intermediate models are