Based Systems for Diagnosing and Predicting River Health
Total Page:16
File Type:pdf, Size:1020Kb
Refinement of artificial intelligence- based systems for diagnosing and predicting river health Report – SC030189 Title here in 8pt Arial (change text colour to black) i The Environment Agency is the leading public body protecting and improving the environment in England and Wales. It’s our job to make sure that air, land and water are looked after by everyone in today’s society, so that tomorrow’s generations inherit a cleaner, healthier world. Our work includes tackling flooding and pollution incidents, reducing industry’s impacts on the environment, cleaning up rivers, coastal waters and contaminated land, and improving wildlife habitats. This report is the result of research commissioned and funded by the Environment Agency. Published by: Author(s): Environment Agency, Horizon House, Deanery Road, Paisley M.F., Trigg, D.J., Martin, R., Walley, W.J., Bristol, BS1 5AH Andriaenssens, V*., Buxton, R., & O’Connor, M. www.environment-agency.gov.uk Dissemination Status: ISBN: 978-1-84911-243-7 Publicly available Released to all regions © Environment Agency – September 2011 Keywords: All rights reserved. This document may be reproduced River invertebrates, biomonitoring, pollution, diagnosis, with prior permission of the Environment Agency. modelling, pressure, pattern recognition, Bayesian network, artificial intelligence The views and statements expressed in this report are those of the author alone. The views or statements Research Contractor: expressed in this publication do not necessarily Centre for Intelligent Environmental Systems, represent the views of the Environment Agency and the Faculty of Computing, Engineering & Technology, Environment Agency cannot accept any responsibility for Staffordshire University, Stafford ST18 0AD such views or statements. * Environment Agency Further copies of this report are available from our Research Collaborator: publications catalogue: http://publications.environment- Scottish Environment Protection Agency agency.gov.uk or our National Customer Contact Centre: T: 08708 506506 Environment Agency’s Project Manager: E: [email protected]. John Murray-Bligh, Operations Directorate Project Number: SC030189 Product Code: SCHO0911BUBL-E-E ii Refinement of AI-based systems for diagnosing and predicting river health Evidence at the Environment Agency Evidence underpins the work of the Environment Agency. It provides an up-to-date understanding of the world about us, helps us to develop tools and techniques to monitor and manage our environment as efficiently and effectively as possible. It also helps us to understand how the environment is changing and to identify what the future pressures may be. The work of the Environment Agency’s Evidence Directorate is a key ingredient in the partnership between research, guidance and operations that enables the Environment Agency to protect and restore our environment. This report was produced by the Research, Monitoring and Innovation team within Evidence. The team focuses on four main areas of activity: • Setting the agenda, by providing the evidence for decisions; • Maintaining scientific credibility, by ensuring that our programmes and projects are fit for purpose and executed according to international standards; • Carrying out research, either by contracting it out to research organisations and consultancies or by doing it ourselves; • Delivering information, advice, tools and techniques, by making appropriate products available. Miranda Kavanagh Director of Evidence Refinement of AI-based systems for diagnosing and predicting river health iii Executive Summary This report presents the results of a project funded by the Environment Agency which builds on the previous creation of two software systems to diagnose and predict river health from biological and environmental data, namely the River Pressure Diagnostic System (RPDS) and the River Pressure Bayesian Belief Network (RPBBN). RPDS is a pattern recognition system to diagnose likely pressures at a river site. RPBBN is a reasoning system that can diagnose chemical concentrations from a biological community, or predict likely changes in a biological community from changes in chemical concentrations. An early aim of our project was to use the RPDS database to define chemical standards needed to protect ecological quality. This was achieved by developing Thresholder, a software application which searches the 1995 river survey database to determine the chemical concentrations needed to support the invertebrate fauna predicted by RIVAPCS (River Invertebrate Prediction and Classification System) at all general quality assessment sites. A second early objective was to use the same database to help determine potential reference sites to act as targets for temporal trajectories in RPDS. The main aims of the project were to enhance the software systems RPDS and RPBBN. The specific objectives were as follows: • Substantially extend the dataset on which the data models are based. • Revise and test the data models on which the systems are based. • Extend the functionality of the two systems and combine into one ‘integrated system’. Extending the dataset took much longer than originally anticipated and affected the progress of the remaining work. In particular, the integrated system could not be developed in the remaining time (the third objective), and the two software systems have been kept separate. The dataset has been substantially extended temporally, geographically and in terms of the variables included. The new dataset covers the ten-year period 1995-2004 instead of the single year 1995. The dataset covers Scotland as well as England and Wales. Variables relating to flow (except for Scotland), geology, land cover and land risk have been added to the chemistry and stress as diagnostic variables. The resulting spring and autumn datasets contain over five times more biological samples than the original systems. To reflect increasing interest in linking changes in the biological community with the physical flow of rivers, two measures of flow were included in the new diagnostic variables. The first was the percentage impact (at 95 per cent exceedence probability) from LowFlows2000. To complement this, a second measure was developed to estimate the flow condition at each site at the time the sample was taken. This was based on interpolation from thirty years of monthly flow records at gauged sites in England and Wales. The ecological significance of this measure was demonstrated by the fact that that those taxa more likely to occur in wet conditions were more sensitive, and those more likely to occur in dry conditions were more tolerant, according to their revised biological monitoring working party (BMWP) scores. Following preliminary tests, new MIR-max models were produced for the full dataset for both spring and autumn. In all cases, the number of bins (clusters of samples with iv Refinement of AI-based systems for diagnosing and predicting river health similar biological composition) was kept the same as in the original models, namely 250. The original spring and autumn models contained more than 6,000 samples each (with an average of 24 samples in each bin), whereas the new spring model contains 32,100 samples and the new autumn model 31,400 (with averages of 128 and 126 samples in each bin respectively). The sample data in the original spring and autumn models was from 6,000 sites in England and Wales, whereas the sample data in the new spring and autumn models cover 9,100 and 8,800 sites respectively in England, Wales and Scotland. The new spring and autumn cluster models were ordered by MIR-max to produce hexagonal output maps with the same side-length (10, 15 and 20 bins) as the original models. Each map was rotated to align it as closely as possible with the originals to ease comparison. After adding the new diagnostic variables to the spring and autumn models, RPDS 2.0 was revised to RPDS 3.0 by streamlining the operations involving database queries and including Scotland on the geographical map panel. The output maps in RPDS 3.0 for particular variables are qualitatively similar to those of the same variables in RPDS 2.0, demonstrating that the clustering and ordering in the new models are similar to those of the original models despite the large increase in data. From this we conclude that the models are sound and represent reality rather than an artefact of the sample data, and that the models contain sufficient data: more data is unlikely to affect the overall models. The geographic locations of the samples in clusters occupying similar positions in the hexagonal output map are also similar in RPDS 2.0 and RPDS 3.0. Preliminary evaluation of the new flow variables shows that the percentage impact at 95-percentile flow (flow exceeded 95 per cent of the time) is negatively correlated with distance from source, as might be expected. Preliminary evaluation of the flow condition variable, on the other hand, shows a relationship with the taxa as the clusters containing samples taken in wetter years tend to be those with higher average score per taxon (ASPT), while those taken in drier years tend to be those with lower ASPT. As with the MIR-max models, the new Bayesian Belief Network (BBN) model was derived from a substantially larger quantity of data. The original BBN model was based on 3,600 spring and autumn matched samples, whereas the new spring and autumn models are based on 16,200 and 15,800 matched samples respectively. Several other changes were made. The structure of the model was modified. The larger dataset meant that chemical statistics could be based on percentile values over three years prior to sampling (the chemical statistics that are used to manage water quality) rather than mean values over the preceding three months, and that five states could be used for the taxonomic variables instead of four. Dependent testing of the network including all of these changes against the original indicated major improvements in the predictions of total ammoniacal nitrogen and dissolved oxygen. The prediction of the flow condition variables was poor, but this was not unexpected given the far fewer connections to the taxonomic variables.