Bioinformatics Approaches for Metagenomics Data

BIOINFORMATICS APPROACHES FOR METAGENOMICS DATA ANALYSIS A D I D O R O N - FAIGENBOIM PLANT SCIENCES, VEGETABLE AND FIELD CROPS ARO, T H E VOLCANI CENTER , I S R A E L RISHON LEZION 7528809 Metagenomics o“Metagenomics is the study of the collective genomes of all microorganisms from an environmental sample” o Community o Environmental o Ecological DNA sequencing & microbial profiling Traditional microbiology relies on isolation and culture of bacteria o Cumbersome and labour intensive process o Fails to account for the diversity of microbial life o Great plate-count anomaly Staley, J. T., and A. Konopka. 1985. Measurements of in situ activities of nonphotosynthetic microorganisms in aquatic and terrestrial habitats. Annu. Rev. Microbiol. 39:321-346 Why environmental sequencing? Estimated 1000 trillion tons of bacterial/archeal life on Earth o Only a small proportion of organisms have been grown in culture o Species do not live in isolation o Clonal cultures fail to represent the natural environment of a given organism o Many proteins and protein functions remain undiscovered Why environmental sequencing? Rhizobiome Pollutant Non-human microbiomes Human microbiome sites The revolution in sequencing technologies High throughput technologies promote the accumulation of enormous volumes of genomic and metagenomics data. HiSeq MiSeq Next-Generation Sequencing: A Review of Technologies and Tools for Wound Microbiome Research Brendan P. Hodkinson and Elizabeth A. Grice*. Adv Wound Care (New Rochelle). 2015 Experimental Approaches Community composition ◦ Microbiome (16S rRNA gene, 18S, ITS, etc.) Community composition and functional potential ◦ Metagenomics Functional genetic response ◦ Metatranscriptomics 16s Vs. Shotgun Metagenomic o16s – targeted sequencing of a single gene ◦ Marker for identification ◦ Well established ◦ Cheap ◦ Amplified what you want oShotgun sequencing – sequence all the DNA ◦ No primer bias ◦ Can identify all microbes ◦ Function information 16S rRNA sequencing • 16S rRNA forms part of bacterial ribosomes. • Contains regions of highly conserved and highly variable sequence. • Variable sequence can be thought of as a molecular “fingerprint” can be used to identify bacterial genera and species. • Large public databases available for comparison.– Ribosomal Database Project (RDP) currently contains >1.5 million rRNA sequences. • Conserved regions can be targeted to amplify broad range of bacteria from environmental samples. • Not quantitative due to copy number variation Erlandsen S L et al. J Histochem Cytochem 2005;53:917-927 16S rRNA gene sequencing o Pros ◦ Well established ◦ Sequencing costs are relatively cheap (~50,000 reads/sample) ◦ Only amplifies what you want (no host contamination) oCons ◦ Primer choice can bias results towards certain organisms ◦ Usually not enough resolution to identify to the strain level ◦ Need different primers usually for archaea & eukaryotes (18S) ◦ Cannot identify viruses ◦ No direct functional profiling Binning sequences to UTS oOperational Taxonomic Unit (OTU) An arbitrary definition of a taxonomic unit based on sequence divergence oComposition-based binning − GC content − Di/Tri/Tetra/... nucleotide composition (kmer-based frequency comparison) − Codon usage statistics oSimilarity-based binning − Direct comparison of OTU sequence to a reference database − Identity cut-off varies depending on resolution required Genus - 90% , Family - 80% , Species - 97% Sample 1 Sample 2 OTU present 50:50 in both samples MEGAN Blast against NCBI database Clustering of OTUs based on sequence similarity Software for binning o Composition-based binning o TETRA - Maximal-Order Markov Model o PhyloPythia – Support Vector o Seeded Growing Self-Organising Maps (S-GSOM) o TETRA + Codon based usage o Similarity-based binning o Requires that most sequences in a sample are present in a primary or secondary reference database o QIIME o MEGAN (comparison against Blast NCBI NR) o Mothur (RDP) o CARMA (comparison against PFAM) o ARB (linked with Silva database) Sequences Databases Measuring diversity of OTUs Two primary measures for sequence based studies: • Alpha diversity −What is there? How much is there? −Diversity within a sample • Beta diversity −How similar are two samples? −Diversity between samples Alpha diversity – human microbiome C Huttenhower et al. Nature 486, 207-214 (2012) doi:10.1038/nature11234 Alpha diversity oSpecies count in the sample o what is a species ? o OUTs o missing level of evolutionary diversity oPhylogenetic diversity (PD) o sum of the branch length covered by a sample o missing the distribution of the species Alpha diversity oSimpson’s diversity index (also Shannon, Chao indexes) o gives less weight to rarest species S is the number of species N is the total number of organisms ni is the number of organisms of species i Whittaker, R.H. (1972). "Evolution and measurement of species diversity". Taxon (International Association for Plant Taxonomy (IAPT)) 21 (2/3): 213–251 Beta diversity – human microbiome C Huttenhower et al. Nature 486, 207-214 (2012) doi:10.1038/nature11234 Beta diversity oDiversity between samples oUnifrac distance oPhytogenic-based beta diversity oPercentage observed branch length unique to either sample Lozupone and Knight, 2005. Unifrac: A new phylogenetic method for comparing microbial communitieis. Appl Environ Microbiol 71:8228 Other useful data representations Simple bar charts - what species are present? Other useful data representations Rarefaction curves - How much of a community have we sampled? Number of OTUsNumber Number of sequences Adapted from Wooley et al. A Primer on Metagenomics, PLoS Computational Biology, Feb 2010, Vol 6(2) Shotgun whole metagenome oUnlike 16S, metagenomic sequencing is no targeted to a specific gene, but does an unbiased sample of the entire genomic DNA. oTypically shorter sequence reads are used to obtain >5Gb of data per sample. oHiSeq or NextSeq platform are typically more cost effective for metagenomic sequencing Shotgun metagenomics Pros ◦ No primer bias ◦ Can identify all microbes (e.g. eukaryotes, viruses) ◦ Direct functional profiling • Cons ◦ More expensive (millions of sequences needed) ◦ Host/site contamination can be significant ◦ May not be able to sequence “rare” microbes ◦ Required computational resources can be restrictive ◦ More complex bioinformatic analyses required ◦ Chimera, unknown function Sequence coverage Complexity Diversity & Coverage Estimating coverage in metagenomic data sets and why it matters. ISME J. 2014 Luis M Rodriguez-R and Konstantinos T Konstantinidis Metagenomics' assembly Metagenomics' assembly Metagenomic Assembly: Overview, Challenges and Applications. Yale J Biol Med. 2016 Sep; 89(3): 353–362 Metagenomics' assembly o Greedy assembler: o reads with maximum overlaps are iteratively merged into contigs o Overlap-Layout-Consensus : o graph is constructed by finding overlaps between all pairs of reads o Bruijn graph: o reads are chopped into short overlapping segments (k-mers) o K-mers are organized in a de Bruijn graph based on their co-occurrence across reads. o The graph is simplified to remove artifacts due to sequencing errors, o branch-less paths are reported as contigs. de Bruijn graph approach o Low abundance genomes may end up fragmented if overall sequencing depth is insufficient to form connections in the graph o Using a short k-mer size oThe assembler must strike a balance between recovering low abundance genomes and obtaining long, accurate contigs for high abundance genomes oComputational time and memory may be insufficient to complete such assemblies. oMultiple k-mer approach oSpread memory load over cluster of computer Metagenome assembly tools Comparing and Evaluating Metagenome Assembly Tools from a Microbiologist’s Perspective - Not Only Size Matters! John Vollmers, Sandra Wiegand, Anne-Kristin Kaster What we do with the assembly oCharacterizing the contigs/scaffolds o Mapping statistics o Compositions (%GC, codon usage) o Annotation - taxonomy & function assignments oBinning oComparative genomics oMetabolic pathways Binning over read mapping oPartition the metagenome to species o Read coverage (multiple samples) o compositions sample sample sample GC% 1 2 3 scaffold1 27 7 60 34 scaffold2 29 6 61 33 scaffold3 5 21 20 51 scaffold4 7 20 22 50 Metagenomic Assembly: Overview, Challenges and Applications. Yale J Biol Med. 2016 Sep; 89(3): 353–362 Binning over read mapping sample sample sample GC% 1 2 3 scaffold1 27 7 60 34 70 scaffold2 29 6 61 33 60 scaffold3 5 21 20 51 50 scaffold1 scaffold4 7 20 22 50 40 scaffold2 scaffold3 30 scaffold4 20 10 0 sample1 sample2 sample3 GC Binning contigs oCompletely automated approach o CONCOCT o GroopM o MetaBAT oCompleteness of metagenome assembled genomes (MAGs) o single-copy core genes (tRNA synthetases , ribosomal proteins) Genes annotations oFinds bacterial genes in the contigs/scaffolds ◦ Prodigal ◦ Prokka oAnnotation of the genes ◦ By homology searches (DIAMOND) ◦ Domains finding o Comparisons ◦ Gene family ◦ Distribution among the samples (CD-HIT) Functional potential - The annotations suggest the functional potential of the community No sure about the biology activity (may not be transcribed an translates) Common functional databases oNCBI oCOG o Well known but original classification (not updated since 2003) o PFAM o Focused more on protein domains based on hidden Markov models oKEGG o Very popular, each entry is well annotated, and often linked into “Modules” or “Pathways” o Full access now requires a license fee

Bioinformatics Approaches for Metagenomics Data

The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet Is Available from the National Academies Press, 500 Fifth Street, NW, Washington, D.C

Metagenomics Approaches for the Detection and Surveillance of Emerging and Recurrent Plant Pathogens

What Is Metagenomics? - an Introduction

For Immediate Release. Oct. 8Th, 2010 Microbesonline.Org Enhanced for Metagenomics and Metabolism

Genome and Pangenome Analysis of Lactobacillus Hilgardii FLUB—A New Strain Isolated from Mead

Microbes and Metagenomics in Human Health an Overview of Recent Publications Featuring Illumina® Technology TABLE of CONTENTS

A Roadmap for Metagenomic Enzyme Discovery

Discovery of New Protein Families and Functions

Prokaryotic Gene Discovery, Metagenomics and Pangenomics

Genome-Reconstruction for Eukaryotes from Complex Natural Microbial Communities

Metagenomics Metatranscriptomics Matthew L

A Fast Machine Learning Workflow for Metagenomic Data for Phenotype