Advancing Taxonomic Annotation and Profiling of Microbial Communities Using Marker Genes
Total Page:16
File Type:pdf, Size:1020Kb
Research Collection Doctoral Thesis Advancing taxonomic annotation and profiling of microbial communities using marker genes Author(s): Milanese, Alessio Publication Date: 2020 Permanent Link: https://doi.org/10.3929/ethz-b-000479809 Rights / License: In Copyright - Non-Commercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use. ETH Library DISS. ETH NO. 27137 Advancing taxonomic annotation and profiling of microbial communities using marker genes A thesis submitted to attain the degree of DOCTOR OF SCIENCE of ETH ZURICH (Dr. sc. ETH Zurich) Presented by ALESSIO MILANESE Laurea magistrale in Bioinformatica e biotecnologie mediche, Università di Verona, Verona, Italy Born on April 12, 1990 Citizen of Italy Accepted on the recommendation of Prof. Dr. Shinichi Sunagawa, examiner Prof. Dr. Julia Vorholt, co-examiner Dr. Georg Zeller, co-examiner 2021 1 2 Content Content Abbreviations 5 Summary 8 Sommario 11 Chapter 1 11 General introduction 14 Modern analysis of microbial communities 16 Phylogenetic analyses with marker genes 20 Taxonomy analysis of microbial survey data 22 Benchmarking of taxonomic analysis tools 29 Thesis objectives 30 Chapter 2 32 Microbial abundance, activity, and population genomic profiling with mOTUs2 34 Chapter 3 91 Machine-learning based hierarchical taxonomic classification of Prokaryotic genes and genomes using STAG 84 Chapter 4 134 Additional related projects 111 Chapter 5 General conclusions and perspective 120 References 126 Acknowledgments 143 Curriculum vitae 144 3 4 Abbreviations 16S rRNA 16S ribosomal RNA gene ANI Average NucleotiDe IDentity ASVs Amplicon Sequence variants BMI BoDy Mass InDex CAMI Critical Assessment of Metagenome Interpretation CCR Continuous Calorie Restriction CTR Control DDH DNA-DNA hybriDization FDR False Discovery Rate ICR Intermittent Calorie Restriction ITS Internal transcribeD spacer LCA Lowest common ancestor MAG Metagenome-assembleD genomes mOTU marker gene-baseD OTU MG Marker Gene MGC Marker Gene Cluster NGS Next Generation DNA Sequencing NR Non-ReDunDant OUT Operational taxonomic unit SAT Subcutaneous aDipose tissue 5 specI Species iDentification tool SSU Small Subunit of the rRNA VAT Visceral aDipose tissue WMS Whole metagenome sequencing 6 7 Summary Microorganisms are founD everywhere on earth anD play a prominent role in many natural processes such as nitrogen-fixation anD carbon storage for nutrient cycling that are essential for all other living organisms. Their importance is also increasingly being appreciateD in the context of human health, with human-associateD microbes being Differentially abunDant in many diseases. While analyses of microbes in communities were historically restricted to isolated organisms, modern methods allow us to examine an entire community as a whole by sequencing DNA fragments Directly from an environmental sample without a prior culturing step. To obtain a census of the organisms present in a sample, there are currently two well- established techniques:16S ribosomal RNA gene (16S for short) profiling, in which part of this gene is amplifieD anD sequenceD; anD whole-metagenome sequencing (WMS), where all the DNA from the sample is inDiscriminately sequenced. The funDamental bioinformatics task is then to iDentify which taxa are present in the original microbial community, a process called taxonomic profiling. Taxonomic profiling of microbial communities is an inherently Difficult computational problem anD -- although many tools have been Developed in this area -- the results are still a biaseD representation of the actual communities, as many stuDies have shown. Almost all bioinformatics tools for taxonomic profiling of metagenomic samples rely on reference genomes, which are currently unavailable for a substantial fraction of microbial species. As a consequence, the inferreD taxonomic profiles remain incomplete anD relative abunDances biaseD. In Chapter 2, I present mOTUs2, a tool baseD on 10 universal phylogenetic marker genes that allow profiling of both known and unknown species. The resulting taxonomic profiles are more complete anD relative abunDances more accurate compareD to other methoDs, in particular in less-stuDieD environments. We aDDitionally show that marker genes useD by mOTUs2 are essential housekeeping genes, anD as such Demonstrably well-suiteD for quantification of basal transcriptional activity of community members. Furthermore, single nucleotiDe variation profiles estimateD using mOTUs2 reflect those from whole genomes, which opens up the possibility of comparing microbial strain populations in a computationally efficient manner. While mOTUs2 is a tool to quantify microbial species in WMS Data, it Does not accurately place them into existing taxonomies. For species represented by reference genomes, lineage information is obtaineD from NCBI, for species without such a representation, only a ruDimentary proceDure is employeD to approximately infer their lineages. However, accurate 8 solutions to this taxonomic assignment problem are urgently neeDeD to help make sense of the ever-growing volume of (meta-)genomic sequences --in the form of amplicons, genes, or genomes-- many of them from novel organisms. To aDDress this problem, I introDuce STAG (in Chapter 3), a hierarchical taxonomic classifier for marker gene sequences. The central iDea is to formalize the problem of taxonomic annotation as a hierarchical series of binary classifications tasks along the taxonomic tree. The features useD by these classifiers are extracteD from a multiple sequence alignment of gene families with well- resolveD taxonomies. As a result, STAG is able to learn which informative positions accurately discriminate clades. STAG shows high accuracy across taxonomic ranks, outperforming other tools on simulated assignment problems for 16S gene sequences. Additionally, STAG can also be applieD to genome sequences, by leveraging a concatenation of 40 universal phylogenetic marker genes. In these evaluations, STAG also exhibits high precision anD recall on isolate genomes, and facilitates Deeper taxonomic annotation for MAGs from human- anD cow rumen- associateD microbiomes. Owing to its versatility, STAG can be readily traineD anD applieD to the taxonomic annotation of 16S amplicon profiles, which --when annotateD with STAG-- are significantly more similar to corresponDing WMS profiles than those annotateD with other state-of-the-art tools. Taxonomic assignment anD taxonomic profiling are two essential steps in the stuDy of any microbial community. They proviDe the basis for further analyses seeking to iDentify human pathogens, to associate microbes with Disease, anD to unDerstanD the interplay of microbial communities with the host or the biogeochemical environment. In summary, by proviDing novel computational methoDology anD showcasing its potential, my work presenteD in this thesis advanceD the state of the art in the taxonomic analysis of microbial communities. 9 10 Sommario I microrganismi si trovano ovunque sulla terra e svolgono un ruolo fonDamentale in molti processi naturali essenziali per tutti gli altri organismi viventi, come la fissazione Dell'azoto e lo stoccaggio Del carbonio per il ciclo Dei nutrienti. La loro importanza relativamente alla salute umana è aumentata sempre più, soprattutto in consiDerazione Del fatto che la quantità Di microbi nel corpo umano varia nel caso siano presenti patologie. In passato, le analisi Dei microrganismi si limitavano a organismi isolati, oggi invece i metoDi moDerni ci consentono Di esaminare un'intera comunità nel suo insieme tramite il sequenziamento Di frammenti Di DNA Direttamente da un campione ambientale senza una precedente fase di coltura. Per ottenere un censimento degli organismi presenti in un campione, esistono attualmente due tecniche consoliDate: 16S ribosomal RNA gene (16S in breve), in cui una parte Di questo gene viene amplificata e sequenziata; e il sequenziamento Dell'intero metagenoma (WMS, Dall’inglese “Whole Metagenome Sequencing”), Dove tutto il DNA Del campione viene sequenziato inDiscriminatamente. Il compito fonDamentale Della bioinformatica è quinDi Di iDentificare quali taxa sono presenti nella comunità microbica originaria, un processo chiamato “taxonomic profiling”. Il profiling Delle comunità microbiche è un problema computazionale intrinsecamente difficile e, sebbene molti strumenti siano stati sviluppati in quest'area, i risultati sono ancora una rappresentazione parziale Delle comunità attuali, come hanno Dimostrato molti stuDi. Quasi tutti gli strumenti bioinformatici per il “profiling” Di campioni metagenomici si basano su genomi Di riferimento, attualmente non Disponibili per molte specie. Di conseguenza, i profili tassonomici dedotti rimangono incompleti e le abbonDanze relative sono distorte. Nel Capitolo 2, presento mOTUs2, uno strumento basato su 10 geni con proprieta filogenetiche che consentono il profiling Di specie note e sconosciute. I profili tassonomici risultanti sono più completi e le abbonDanze relative più accurate rispetto ad altri metoDi, in particolare in ambienti meno stuDiati. Mostro inoltre che i geni utilizzati Da mOTUs2 sono geni essenziali perché hanno proprietà “housekeeping” e, come tali, sono Dimostrabilmente aDatti per la quantificazione dell'attività trascrizionale basale dei membri della comunità. Inoltre, i profili Di variazione Dei singoli nucleotidi stimati utilizzanDo mOTUs2 riflettono quelli Di interi genomi,