Analysis of Algae-Vulnerable Lakes in Using R Plotting Tools to Visualize Water Quality Data

Sunayna Dasgupta and Aiswarya Rani Pappu Department of Civil and Environmental Engineering University of Utah

Abstract - Algae formation in water body is a direct B. CHL a quantification outcome of eutrophication. Eutrophication adversely Detection and quantification of chlorophyll a (CHL a) impacts the biological, physical, chemical and aesthetic has proven to be an effective way to assess the presence of components of a water body. It usually occurs due to algae in a water body [8]. Since algae have chlorophyll as increased rate of nutrient loading in the form of nitrogen their primary photosynthetic pigment, CHL a quantification and phosphorous. This study presents a comparative will provide useful information for measuring algal analysis of algae vulnerable lakes/waterbodies in Utah population density in a water body. Chlorophyll is the green State and categorize them based on Tropic State Index. pigment, which acts as an essential component to trap Keywords: eutrophication, algae, lakes sunlight and convert it to energy for metabolism.

I. INTRODUCTION C. Algae vulnerable lakes in Utah A. Problem According to a recent report, three of Utah’s largest Algae are primarily aquatic, single or multicellular public drinking water systems, tap reservoirs, and twenty organisms containing chlorophyll. Examples of algae rivers have developed green biota in them. Utah Division of include: diatoms, green and red algae, and primitive Water Quality released a list of algae vulnerable water photosynthetic bacteria such as Cyanobacteria (blue green bodies [3]: algae). Algal biomass acts as one of the primary surface Huntington Creek* water quality criterion. Higher level of algal biomass in a Provo River* water body can be associated with a broad range of changes Utah Lake in the dissolved oxygen concentration. Fluctuations in Gunlock Reservoir dissolved oxygen concentration will effect sensitive aquatic Lost Creek Reservoir biota [6]. The process of algal biomass formation in a water body is commonly known as algal bloom/eutrophication. Pineview Reservoir* This usually occurs due to increase in the nutrient loading East Canyon Reservoir [1] [2] to the water body, light levels, pH and temperature. Steinaker Reservoir* Figure 1. Justifies the ill-effect of algae contamination by Minersville Reservoir depicting a fish kill picture as a result of eutrophication and Yuba Reservoir the formation of green algal biomass in a water body. Red Fleet Reservoir* a b Starvation Reservoir* Wide Hollow Reservoir Flaming Gorge Reservoir* Sand Hollow Reservoir* Figure 1. (a) Fish kill due to algal growth (eutrophication), and (b) Water Millsite Reservoir* Stargrass (Heteranthera dubia) a rooted macrophyte in the Yakima River, Deer Creek Reservoir* Washington. (USGS.) *Denotes source of drinking water There are basically three methods for the estimation of algal biomass: Algal blooms have been found to cause health risks to both human being and animals. Hence regular monitoring of such • Computing the chlorophyll a amount (CHL a) water bodies becomes an important parameter for promoting • Measurement of the carbon biomass as the ash-free healthy environment. dry mass (AFDM)

• Measurement of the particulate organic carbon (POC).

I. OBJECTIVE The present report discusses the data collection, assortment and visualization analysis of the water quality using water quality data collected from various counties of Utah. The collected data was analyzed using tools for R. The data was visually analyzed for Chlorophyll a along a timeline of sampling data. A database consisting of the data from various counties was created and then manipulated for analysis purpose. Figure 2. EPA STORET Web interface II. SOFTWARE DESCRIPTION A. MySQL Under STORET Data Warehouse link click on yellow button MySQL workbench is an undivided visualization tool titled browse or download Modernized STORET data used by architects, developers, and DBAs [9]. It is (Figure 3). responsible for providing a data model, SQL development environment and cumulative administration tools for the server configuration, user administration and backup. It enables a developer to design and visualize model and generate and manage databases.

B. Rstudio Rstudio is a free and open source integrated development environment for R [10]. It provides an environment for data analysis. RMySQL Package was used to create a connection with an Observation data model (ODM) database in MySQL. The intended variables requiring analysis can be Figure 3. Data access tool called and visualized by creating plots using packages like matplotlib. Under STORET/WQX Warehouse Reports‐STORET

Results click the link titled Results Download (Figure 4). III. DATA COLLECTION The data required for the analysis was collected from EPA’s STORET (STOrage and RETrieval). It is a data warehouse repository for sharing water monitoring data which includes biological and physical parameters. This data can be used by state environmental agencies, EPA and other federal agencies, universities, private citizens and others.

A. Data query and download The STORET database can be used for data query and acquire access on data of specific water resource chemical, physical and biological attributes and parameters as well as methods used in evaluation [7]. STORET data can be used to Figure 4. Data selection window query data based on monitoring location information and data collected on that location. Choose the respective state, county, station type, date, The STORET data can be accessed by following the steps activity medium, activity intent and community sampled below: (default) and the characteristics intended. Go to STORET main page http://www.epa.gov/storet/ For downloading the query results, note down the number of Click download data link (Figure 2) records found and narrow down the query if the number of records exceed 3gigabites. Select the report types, click on ‘appropriate user profile’ box, enter email address, prefix the intended name, and select data elements for report. Under batch processing click immediate button and then the data will be sent to the provided email address.

B. Database creation Data downloaded from the STORET site was organized using observations data model (ODM) in MySQL [9]. Chlorophyll a data was available only for years 2006 and Waterbodies with low to midrange TSI values from 40-50 2007. This data was sorted according to the time series and are moderately clear, and have high chances of algal growth stored in a column named data value. Data from 10 different increase. Waterbodies with midrange TSI values ranging in counties were plotted using R, according to the available between 50-70 are generally more turbid, have higher algal time series. Figure 5 shows the data values loaded into the population densities, and also exhibits low DO levels. table created for Beaver County ODM. Waterbodies with high TSI values (70 and above) are observed to have heavy and dense algal blooms with excessive DO problems [4].

VI. RESULTS Figures 6 to 15 show the time series plots for Beaver, Cache, Davis, Duchesne, Morgan, Salt Lake, Summit, Wasatch, Wayne and Weber counties.

Figure 5. Variables table created for Beaver County in MySQL

IV. DATA VISUALISATION USING R The data was visualized using R Studio [10]. The R script was used to connect to the local ODM named Figure 6. Plot showing Chlorophyll a record for Beaver County during chlorophyll a ODM. The code was written to sort and create 2006-2007. time series plots for all counties.

V. TROPIC STATE INDEX TSI (Tropic State Index) is a common way to characterize lake’s trophic status. It indicates the overall health status of a waterbody. The quantities of Chlorophyll a, Phosphorus, nitrogen and other useful nutrients are the primary determinants that independently estimate the algal bloom density in a waterbody at a specific location. It uses algal biomass as the basis for classification of tropic state of waterbody. This index is a dimensionless numeric value which approximately ranges from 0 to 100. The index is simple to calculate, use and understand.

A simplified equation (given below) can be used to calculate Figure 7. Plot showing Chlorophyll a record for Cache County during 2006- 2007. the Tropic State Index for Chlorophyll a.

(Carlson,R.E et al.,1996)

Where, CHL represents the average value of Chlorophyll a in µg/l.

Generally, every TSI value indicates algal population densities and the water system characteristics. A water body with low TSI values ranging from 30-40 are generally transparent, have low algal population densities, and have adequate DO (Dissolved Oxygen) concentration present.

Figure 8. Plot showing Chlorophyll a record for Davis County during 2006- Figure 11. Plot showing Chlorophyll a record for Salt Lake County during 2007. 2006-2007.

Figure 9. Plot showing Chlorophyll a record for Duchesne County during Figure12. Plot showing Chlorophyll a record for Summit County during 2006-2007. 2006-2007.

Figure 10. Plot showing Chlorophyll a record for Morgan County during Figure 13. Plot showing Chlorophyll a record for Wasatch County during 2006-2007. 2006-2007.

VII. CONCLUSIONS

MySQL and Rstudio acted as a useful tool for database creation, storage of Chlorophyll a data over a time series, statistical analysis and visualization of data. Data base creation in MySQL was a bit challenging due to the required formatting issues but later it was solved. Once the database was created and the data was stored, linking it to Rstudio, creation of visualizations was comparatively easy. Due to limited field sampling for Chlorophyll a data for Utah State, limited data points were achieved. The graphs generated using R and the statistical analysis showed the presence of algae vulnerable sites in Utah and the highest presence was indicated in the Salt Lake County. Figure 14. Plot showing Chlorophyll a record for Wayne County during 2006-2007. The ease of usage and the compatibility makes MySQL and R an efficient tool to handle large amount of data and conduct a comparative and statistical analysis.

TABLE I

COMPARATIVE RESULTS OF TSI OF EACH COUNTY

County Name Tropic State Index of each county Beaver County 59.428 Cache County 50.527 Davis County 56.169 Duchesne County 44.348 Morgan County 44.988 Salt Lake County 60.788 Summit County 46.455

Figure 15. Plot showing Chlorophyll a record for Weber County during Wasatch County 48.054 2006-2007. Wayne County 58.283 Weber County 54.344 The plots show that Salt Lake county water bodies exhibited the maximum-recorded Chlorophyll a over other counties with a mean of 28.30µg/l and maximum of 202.80µg/l concentrations. Duchesne County showed a minimum ACKNOWLEDGEMENTS recorded data for Chlorophyll a among all counties. We would like to thank our instructors Dr. Ames, Dr. TSI values were calculated for all the counties. Results Horsburgh and Dr. Burian. We would also like to thank obtained indicate that only Salt Lake, Summit, Wasatch, and Carly and Erfan for helping us with the project and data Morgan counties have comparatively more number of data collection. points with TSI values 60.788, 46.455, 48.054, and 44.988 respectively. This indicates that waterbodies in Summit, REFERENCES Wasatch, and Morgan counties might have moderately clear water with high chances of algal growth increase. TSI value [1] A. Räike, O.P. Pietiläinen, S. Rekolainen, P. Kauppila, of Salt Lake County is in the range of 50-70 which indicates H. Pitkänen, J. Niemi, A. Raateland, and J. Vuorenmaa. that there are possibilities of waterbodies experiencing more "Trends of phosphorus, nitrogen and chlorophyll a turbidity and higher algal population densities. Also, the concentrations in Finnish rivers and lakes in 1975–2000." water may exhibit low DO levels especially in mid to late- Science of the Total Environment 310, no. 1 (2003): 47-59. summer [5]. [2] J.R. Jones, and R.W. Bachmann. "Prediction of phosphorus and chlorophyll levels in lakes." Journal (Water Pollution Control Federation) (1976): 2176-2182. [3] Penrod E. It’s not just Utah Lake: Toxic algae plagues 20 waterways, including drinking water sources (2015). The Salt Lake Tribune.

[4] R.E Carlson, and J. Simpson. (1996). A Coordinator’s guide to volunteer lake monitoring methods. North American Lake Management Society. 96pp. [5] S. A. Pothoven and G. L. Fahnenstiel. "Recent change in summer chlorophyll a dynamics of southeastern Lake Michigan." Journal of Great Lakes Research 39, no. 2 (2013): 287-294. [6] T. Blakey, A. M. Melesse, and C. S. Rousseaux. "Toward connecting subtropical algal blooms to freshwater nutrient sources using a long-term, spatially distributed, in situ chlorophyll-a record." Catena 133 (2015): 119-127. [7] The STORET database from http://www.epa.gov/storet/ [8] USGS field manual on Algal Biomass Indicators. Section 7.4. [9] https://www.mysql.com/ [10] https://www.rstudio.com/

APPENDIX MyDataCache <- read.csv(file="C:/Users/Aiswarya R P/Desktop/hydro/cac_20151130_153620/Data_cac_2015113 Results of Summaries: 0_153620_RegResults_data.csv", header=TRUE, sep=",") > summary(MyDataBeaver$Datavalue) MyDataCache Min. 1st Qu. Median Mean 3rd Qu. Max. 1.700 4.025 5.850 18.890 14.150 119.100 ##DAVIS County Chlorophyll a data > summary(MyDataCache$Datavalue) MyDataDavis <- read.csv(file="C:/Users/Aiswarya R Min. 1st Qu. Median Mean 3rd Qu. Max. P/Desktop/hydro/dav_20151130_154152/Data_dav_201511 0.000 1.300 1.900 7.624 5.200 33.500 30_154152_RegResults_data.csv", header=TRUE, sep=",") > summary(MyDataDavis$Datavalue) MyDataDavis Min. 1st Qu. Median Mean 3rd Qu. Max. 0.70 5.60 9.50 13.55 16.30 41.10 ##DUCHESNE County Chlorophyll a data > summary(MyDataDuchesne$Datavalue) MyDataDuchesne <- read.csv(file="C:/Users/Aiswarya R Min. 1st Qu. Median Mean 3rd Qu. Max. NA's P/Desktop/hydro/duc_20151130_155622/Data_duc_201511 0.000 1.350 1.800 4.061 2.900 42.900 1 30_155622_RegResults_data.csv", header=TRUE, sep=",") > summary(MyDataMorgan$Datavalue) MyDataDuchesne Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 1.475 2.300 4.335 3.875 26.500 ## MORGAN County Chlorophyll a data > summary(MyDataSaltLake$Datavalue) MyDataMorgan <- read.csv(file="C:/Users/Aiswarya R Min. 1st Qu. Median Mean 3rd Qu. Max. NA's P/Desktop/hydro/mor_20151130_164713/Data_mor_201511 0.90 4.35 11.20 21.07 28.30 202.80 13 30_164713_RegResults_data.csv", header=TRUE, sep=",") > summary(MyDataSummit$Datavalue) MyDataMorgan Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 1.800 3.350 5.034 6.925 20.700 ##SALTLAKE County Chlorophyll a data > summary(MyDataWasatch$Datavalue) MyDataSaltLake <- read.csv(file="C:/Users/Aiswarya R Min. 1st Qu. Median Mean 3rd Qu. Max. NA's P/Desktop/hydro/SLC_20151130_154459/Data_SLC_20151 0.000 2.600 4.100 5.925 7.200 34.000 4 130_154459_RegResults_data.csv", header=TRUE, sep=",") > summary(MyDataWayne$Datavalue) MyDataSaltLake Min. 1st Qu. Median Mean 3rd Qu. Max. 0.60 2.60 5.80 16.81 12.65 116.20 ##SUMMIT County Chlorophyll a data > summary(MyDataWeber$Datavalue) MyDataSummit <- read.csv(file="C:/Users/Aiswarya R Min. 1st Qu. Median Mean 3rd Qu. Max. P/Desktop/hydro/sum_20151130_164216/Data_sum_20151 3.10 4.60 5.90 11.25 11.80 32.50 130_164216_RegResults_data.csv", header=TRUE, sep=",") MyDataSummit R Script: library(RMySQL) # will load DBI as well ##WASATCH County Chlorophyll a data MyDataWasatch <- read.csv(file="C:/Users/Aiswarya R d=dbDriver("MySQL") P/Desktop/hydro/was_20151130_154840/Data_was_201511 con=dbConnect(d,user='root',password='Aiswarya43$',host= 30_154840_RegResults_data.csv", header=TRUE, sep=",") 'localhost') MyDataWasatch

##select database ##WAYNE County Chlorophyll a data sqlstmntdb=dbSendQuery(con,"Use chlorophyllaodm") MyDataWayne <- read.csv(file="C:/Users/Aiswarya R P/Desktop/hydro/way_20151130_155146/Data_way_20151 ## list the tables in the database 130_155146_RegResults_data.csv", header=TRUE, sep=",") sqlstmntlist=dbListTables(con) MyDataWayne print(sqlstmntlist) ## WEBER County Chlorophyll a data ## load a data frames into the database MyDataWeber <- read.csv(file="C:/Users/Aiswarya R ##BEAVER County Chlorophyll a data P/Desktop/hydro/web_20151130_155349/Data_web_20151 MyDataBeaver <- read.csv(file="C:/Users/Aiswarya R 130_155349_RegResults_data.csv", header=TRUE, sep=",") P/Desktop/hydro/bea_20151130_153905/Data_bea_201511 MyDataWeber 30_153905_RegResults_data.csv", header=TRUE, sep=",") MyDataBeaver ##Write tables for all Chlorophyll a records dbWriteTable(con,"BeaverCounty",MyDataBeaver) ## CACHE County Chlorophyll a data dbWriteTable(con,"CacheCounty",MyDataCache) dbWriteTable(con,"DavisCounty",MyDataDavis) dbWriteTable(con,"DuchesneCounty",MyDataDuchesne) Dateformat=as.Date(Dates, format = "%m/%d/%Y") dbWriteTable(con,"MorganCounty",MyDataMorgan) plot(Dateformat,MyDataSaltLake$Datavalue,xlab='Time(yrs dbWriteTable(con,"SaltLakeCounty",MyDataSaltLake) )',ylab='ChlorophyllA(ug/l)',type="b",main='Chlorophyll A dbWriteTable(con,"SummitCounty",MyDataSummit) record for SaltLake County') dbWriteTable(con,"WasatchCounty",MyDataWasatch) dbWriteTable(con,"wayneCounty",MyDataWayne) Dates=MyDataSummit$Time dbWriteTable(con,"WeberCounty",MyDataWeber) Dateformat=as.Date(Dates, format = "%m/%d/%Y") plot(Dateformat,MyDataSummit$Datavalue,xlab='Time(yrs) ',ylab='ChlorophyllA(ug/l)',type="b",main='Chlorophyll A ##List all the tables record for Summit County') dbListTables(con) Dates=MyDataWasatch$Time ## get the whole table Dateformat=as.Date(Dates, format = "%m/%d/%Y") dbReadTable(con, "BeaverCounty") plot(Dateformat,MyDataWasatch$Datavalue,xlab='Time(yrs dbReadTable(con, "CacheCounty") )',ylab='ChlorophyllA(ug/l)',type="b",main='Chlorophyll A dbReadTable(con, "DavisCounty") record for Wasatch County') dbReadTable(con, "DuchesneCounty") dbReadTable(con, "MorganCounty") Dates=MyDataWayne$Time dbReadTable(con, "SaltLakeCounty") Dateformat=as.Date(Dates, format = "%m/%d/%Y") dbReadTable(con, "SummitCounty") plot(Dateformat,MyDataWayne$Datavalue,xlab='Time(yrs)' dbReadTable(con, "WasatchCounty") ,ylab='ChlorophyllA(ug/l)',type="b",main='Chlorophyll A dbReadTable(con, "wayneCounty") record for Wayne County') dbReadTable(con, "WeberCounty") Dates=MyDataWeber$Time #Make plots for Chlorophyll A data Dateformat=as.Date(Dates, format = "%m/%d/%Y") Dates=MyDataBeaver$Time plot(Dateformat,MyDataWeber$Datavalue,xlab='Time(yrs)', Dateformat=as.Date(Dates, format = "%m/%d/%Y") ylab='ChlorophyllA(ug/l)',type="b",main='Chlorophyll A plot(Dateformat,MyDataBeaver$Datavalue,xlab='Time(yrs)' record for Weber County') ,ylab='ChlorophyllA(ug/l)',type="b",main='Chlorophyll A record for Beaver County') summary(MyDataBeaver$Datavalue) summary(MyDataCache$Datavalue) Dates=MyDataCache$Time summary(MyDataDavis$Datavalue) Dateformat=as.Date(Dates, format = "%m/%d/%Y") summary(MyDataDuchesne$Datavalue) plot(Dateformat,MyDataCache$Datavalue,xlab='Time(yrs)', summary(MyDataMorgan$Datavalue) ylab='ChlorophyllA(ug/l)',type="b",main='Chlorophyll A summary(MyDataSaltLake$Datavalue) record for Cache County') summary(MyDataSummit$Datavalue) summary(MyDataWasatch$Datavalue) summary(MyDataWayne$Datavalue) Dates=MyDataDavis$Time summary(MyDataWeber$Datavalue) Dateformat=as.Date(Dates, format = "%m/%d/%Y") plot(Dateformat,MyDataDavis$Datavalue,xlab='Time(yrs)', ylab='ChlorophyllA(ug/l)',type="b",main='Chlorophyll A record for Davis County')

Dates=MyDataDuchesne$Time Dateformat=as.Date(Dates, format = "%m/%d/%Y") plot(Dateformat,MyDataDuchesne$Datavalue,xlab='Time(yr s)',ylab='ChlorophyllA(ug/l)',type="b",main='Chlorophyll A record for Duchesne County')

Dates=MyDataMorgan$Time Dateformat=as.Date(Dates, format = "%m/%d/%Y") plot(Dateformat,MyDataMorgan$Datavalue,xlab='Time(yrs) ',ylab='ChlorophyllA(ug/l)',type="b",main='Chlorophyll A record for Morgan County')

Dates=MyDataSaltLake$Time