Machine Learning Classification of Microbial Community Compositions to Predict Anthropogenic Pollutants in the Baltic Sea
Total Page:16
File Type:pdf, Size:1020Kb
Machine learning classification of microbial community compositions to predict anthropogenic pollutants in the Baltic Sea Kumulative Dissertation zur Erlangung des akademischen Grades Doctor rerum naturalium (Dr. rer. nat.) der Mathematisch-Naturwissenschaftlichen Fakultät der Universität Rostock vorgelegt von René Janßen, geb. am 12.9.1986 in Goch aus Rostock Rostock, 18.09.2020 https://doi.org/10.18453/rosdok_id00002897 Dieses Werk ist lizenziert unter einer Creative Commons Namensnennung - Weitergabe unter gleichen Bedingungen 4.0 International Lizenz. Gutachter: Prof. Dr. Matthias Labrenz, Leibniz-Institut für Ostseeforschung Warnemünde, Sektion Biologische Meereskunde Prof. Dr. Rudolf Amann, Bremen, Max-Planck-Institut für Marine Mikrobiologie, Abteilung Molekulare Ökologie Prof. Dr. Alexander Probst, Universität Duisburg-Essen, Fachbereich Umweltmikrobiologie und Biotechnologie Ass.-Prof. Stephen Techtmann, PhD, Houghton, Michigan Technological University, Department of Biological Sciences Jahr der Einreichung: 2020 Jahr der Verteidigung: 2020 Table of contents Summary .............................................................................................................................. 1 Zusammenfassung .............................................................................................................. 3 General introduction ........................................................................................................... 6 Formation of Earth and life ................................................................................................. 6 Bacterial lifestyles and strategies ....................................................................................... 7 Reactions of microbial communities towards changing environments ................................ 7 Accessing microbial communities via next generation sequencing ..................................... 9 Microbiological data sets have specific characteristics ..................................................10 Machine learning to classify microbial community compositions .......................................11 Shallow and deep machine learning algorithms .............................................................13 Variable importance and interpretable models ...............................................................17 Clustering, classification and regression tasks for models .............................................17 Pollution of the Baltic Sea .................................................................................................18 Monitoring the environmental state of the Baltic Sea .........................................................19 Description of research aims ............................................................................................20 Conducted experiments and analyses ..............................................................................22 Summary of published papers ..........................................................................................23 General discussion ............................................................................................................24 Machine learning algorithms utilized non-i.i.d. sequencing data for accurate predictions ..24 Variable importance is a precondition to determine indicative microbial fingerprints ......27 Shotgun sequencing requires careful experimental design to support analyses ............30 Disturbed communities displayed resistance and resilience ..........................................31 Random Forest is preferable to Artificial Neural Networks .............................................32 Phyloseq2ML: an R package facilitates machine learning with microbial communities ..37 Applying sequencing data and machine learning analysis to monitoring ...........................37 Microbial communities may inform additionally to parameter prediction .........................37 Microbial monitoring requires specific collection and storage of samples ......................41 Selecting appropriate algorithms for integration with environmental monitoring .............42 Conclusion and outlook .....................................................................................................43 Chapter I: An artificial neural network and Random Forest identify glyphosate-impacted brackish communities based on 16S rRNA amplicon MiSeq read counts .....................45 Abstract ............................................................................................................................46 1.1 Introduction ............................................................................................................47 1.2 Material and methods .............................................................................................49 1.2.1 Laboratory & sampling ....................................................................................49 1.2.2 Machine learning .............................................................................................51 1.3 Results ...................................................................................................................56 1.3.1 ANN identifies glyphosate-treated microbial communities ...............................56 1.3.2 Identification of clusters present in successful classifications by the ANN .......57 1.3.3 Exploring the limits and assembling a highly indicative selection .....................60 1.3.4 Comparing the use of 16S rRNA gene - vs. 16S rRNA-derived data ...............62 1.4 Discussion ..............................................................................................................62 1.4.1 A statistical approach to identifying decision-important clusters ......................63 1.4.2 More observations should be generated..........................................................65 1.4.3 The outcome of the ANN was confirmed by bioinformatic analysis ..................66 1.4.4 Concluding further steps in the application of ANN with NGS ..........................66 Chapter II: A glyphosate pulse to brackish long-term microcosms has a greater impact on the microbial diversity and abundance of planktonic than of biofilm assemblages 68 Abstract ............................................................................................................................69 2.1 Introduction ............................................................................................................70 2.2 Material and methods .............................................................................................71 2.2.1 Experimental setup .........................................................................................71 2.2.2 Sampling procedure ........................................................................................71 2.2.3 Determination of total cell counts.....................................................................72 2.2.4 Significance testing applied to total cell counts ................................................72 2.2.5 Nutrient analysis ..............................................................................................72 2.2.6 Glyphosate and AMPA analysis ......................................................................72 2.2.7 Nucleic acid extraction and sequencing...........................................................73 2.2.8 Bioinformatic and statistical analysis of the amplicon data ..............................74 2.2.9 Metagenomic analysis .....................................................................................75 2.2.10 Functional tree calculation ...............................................................................75 2.3 Results ...................................................................................................................75 2.3.1 Total cell counts, glyphosate and AMPA concentrations and nutrients ............75 2.3.2 16S rRNA and rRNA gene based community compositions ............................77 2.3.3 NMDS ordination .............................................................................................78 2.3.4 Alpha diversity measures ................................................................................80 2.3.5 OTUs increasing in abundance after glyphosate treatment .............................81 2.3.6 Duration of the detected signals ......................................................................82 2.3.7 Distribution of glyphosate degradation genes in metagenomic samples ..........83 2.4 Discussion ..............................................................................................................87 2.4.1 Potential impacts of glyphosate on a brackish microbial ecosystem ................87 2.4.2 Differences in the responses to glyphosate addition ........................................88 2.4.3 Glyphosate-induced changes in OTU abundance ...........................................89 2.4.4 Probability of glyphosate degradation ..............................................................89 2.5 Data availability statement ......................................................................................91 Chapter III: Machine learning predicts the presence of 2,4,6-trinitrotoluene in sediments of a Baltic Sea munitions dumpsite using microbial community compositions ............92 Abstract ............................................................................................................................94 3.1 Introduction ............................................................................................................95