Bioinformatics Approaches for Metagenomics Data
Total Page:16
File Type:pdf, Size:1020Kb
BIOINFORMATICS APPROACHES FOR METAGENOMICS DATA ANALYSIS A D I D O R O N - FAIGENBOIM PLANT SCIENCES, VEGETABLE AND FIELD CROPS ARO, T H E VOLCANI CENTER , I S R A E L RISHON LEZION 7528809 Metagenomics o“Metagenomics is the study of the collective genomes of all microorganisms from an environmental sample” o Community o Environmental o Ecological DNA sequencing & microbial profiling Traditional microbiology relies on isolation and culture of bacteria o Cumbersome and labour intensive process o Fails to account for the diversity of microbial life o Great plate-count anomaly Staley, J. T., and A. Konopka. 1985. Measurements of in situ activities of nonphotosynthetic microorganisms in aquatic and terrestrial habitats. Annu. Rev. Microbiol. 39:321-346 Why environmental sequencing? Estimated 1000 trillion tons of bacterial/archeal life on Earth o Only a small proportion of organisms have been grown in culture o Species do not live in isolation o Clonal cultures fail to represent the natural environment of a given organism o Many proteins and protein functions remain undiscovered Why environmental sequencing? Rhizobiome Pollutant Non-human microbiomes Human microbiome sites The revolution in sequencing technologies High throughput technologies promote the accumulation of enormous volumes of genomic and metagenomics data. HiSeq MiSeq Next-Generation Sequencing: A Review of Technologies and Tools for Wound Microbiome Research Brendan P. Hodkinson and Elizabeth A. Grice*. Adv Wound Care (New Rochelle). 2015 Experimental Approaches Community composition ◦ Microbiome (16S rRNA gene, 18S, ITS, etc.) Community composition and functional potential ◦ Metagenomics Functional genetic response ◦ Metatranscriptomics 16s Vs. Shotgun Metagenomic o16s – targeted sequencing of a single gene ◦ Marker for identification ◦ Well established ◦ Cheap ◦ Amplified what you want oShotgun sequencing – sequence all the DNA ◦ No primer bias ◦ Can identify all microbes ◦ Function information 16S rRNA sequencing • 16S rRNA forms part of bacterial ribosomes. • Contains regions of highly conserved and highly variable sequence. • Variable sequence can be thought of as a molecular “fingerprint” can be used to identify bacterial genera and species. • Large public databases available for comparison.– Ribosomal Database Project (RDP) currently contains >1.5 million rRNA sequences. • Conserved regions can be targeted to amplify broad range of bacteria from environmental samples. • Not quantitative due to copy number variation Erlandsen S L et al. J Histochem Cytochem 2005;53:917-927 16S rRNA gene sequencing o Pros ◦ Well established ◦ Sequencing costs are relatively cheap (~50,000 reads/sample) ◦ Only amplifies what you want (no host contamination) oCons ◦ Primer choice can bias results towards certain organisms ◦ Usually not enough resolution to identify to the strain level ◦ Need different primers usually for archaea & eukaryotes (18S) ◦ Cannot identify viruses ◦ No direct functional profiling Binning sequences to UTS oOperational Taxonomic Unit (OTU) An arbitrary definition of a taxonomic unit based on sequence divergence oComposition-based binning − GC content − Di/Tri/Tetra/... nucleotide composition (kmer-based frequency comparison) − Codon usage statistics oSimilarity-based binning − Direct comparison of OTU sequence to a reference database − Identity cut-off varies depending on resolution required Genus - 90% , Family - 80% , Species - 97% Sample 1 Sample 2 OTU present 50:50 in both samples MEGAN Blast against NCBI database Clustering of OTUs based on sequence similarity Software for binning o Composition-based binning o TETRA - Maximal-Order Markov Model o PhyloPythia – Support Vector o Seeded Growing Self-Organising Maps (S-GSOM) o TETRA + Codon based usage o Similarity-based binning o Requires that most sequences in a sample are present in a primary or secondary reference database o QIIME o MEGAN (comparison against Blast NCBI NR) o Mothur (RDP) o CARMA (comparison against PFAM) o ARB (linked with Silva database) Sequences Databases Measuring diversity of OTUs Two primary measures for sequence based studies: • Alpha diversity −What is there? How much is there? −Diversity within a sample • Beta diversity −How similar are two samples? −Diversity between samples Alpha diversity – human microbiome C Huttenhower et al. Nature 486, 207-214 (2012) doi:10.1038/nature11234 Alpha diversity oSpecies count in the sample o what is a species ? o OUTs o missing level of evolutionary diversity oPhylogenetic diversity (PD) o sum of the branch length covered by a sample o missing the distribution of the species Alpha diversity oSimpson’s diversity index (also Shannon, Chao indexes) o gives less weight to rarest species S is the number of species N is the total number of organisms ni is the number of organisms of species i Whittaker, R.H. (1972). "Evolution and measurement of species diversity". Taxon (International Association for Plant Taxonomy (IAPT)) 21 (2/3): 213–251 Beta diversity – human microbiome C Huttenhower et al. Nature 486, 207-214 (2012) doi:10.1038/nature11234 Beta diversity oDiversity between samples oUnifrac distance oPhytogenic-based beta diversity oPercentage observed branch length unique to either sample Lozupone and Knight, 2005. Unifrac: A new phylogenetic method for comparing microbial communitieis. Appl Environ Microbiol 71:8228 Other useful data representations Simple bar charts - what species are present? Other useful data representations Rarefaction curves - How much of a community have we sampled? Number of OTUsNumber Number of sequences Adapted from Wooley et al. A Primer on Metagenomics, PLoS Computational Biology, Feb 2010, Vol 6(2) Shotgun whole metagenome oUnlike 16S, metagenomic sequencing is no targeted to a specific gene, but does an unbiased sample of the entire genomic DNA. oTypically shorter sequence reads are used to obtain >5Gb of data per sample. oHiSeq or NextSeq platform are typically more cost effective for metagenomic sequencing Shotgun metagenomics Pros ◦ No primer bias ◦ Can identify all microbes (e.g. eukaryotes, viruses) ◦ Direct functional profiling • Cons ◦ More expensive (millions of sequences needed) ◦ Host/site contamination can be significant ◦ May not be able to sequence “rare” microbes ◦ Required computational resources can be restrictive ◦ More complex bioinformatic analyses required ◦ Chimera, unknown function Sequence coverage Complexity Diversity & Coverage Estimating coverage in metagenomic data sets and why it matters. ISME J. 2014 Luis M Rodriguez-R and Konstantinos T Konstantinidis Metagenomics' assembly Metagenomics' assembly Metagenomic Assembly: Overview, Challenges and Applications. Yale J Biol Med. 2016 Sep; 89(3): 353–362 Metagenomics' assembly o Greedy assembler: o reads with maximum overlaps are iteratively merged into contigs o Overlap-Layout-Consensus : o graph is constructed by finding overlaps between all pairs of reads o Bruijn graph: o reads are chopped into short overlapping segments (k-mers) o K-mers are organized in a de Bruijn graph based on their co-occurrence across reads. o The graph is simplified to remove artifacts due to sequencing errors, o branch-less paths are reported as contigs. de Bruijn graph approach o Low abundance genomes may end up fragmented if overall sequencing depth is insufficient to form connections in the graph o Using a short k-mer size oThe assembler must strike a balance between recovering low abundance genomes and obtaining long, accurate contigs for high abundance genomes oComputational time and memory may be insufficient to complete such assemblies. oMultiple k-mer approach oSpread memory load over cluster of computer Metagenome assembly tools Comparing and Evaluating Metagenome Assembly Tools from a Microbiologist’s Perspective - Not Only Size Matters! John Vollmers, Sandra Wiegand, Anne-Kristin Kaster What we do with the assembly oCharacterizing the contigs/scaffolds o Mapping statistics o Compositions (%GC, codon usage) o Annotation - taxonomy & function assignments oBinning oComparative genomics oMetabolic pathways Binning over read mapping oPartition the metagenome to species o Read coverage (multiple samples) o compositions sample sample sample GC% 1 2 3 scaffold1 27 7 60 34 scaffold2 29 6 61 33 scaffold3 5 21 20 51 scaffold4 7 20 22 50 Metagenomic Assembly: Overview, Challenges and Applications. Yale J Biol Med. 2016 Sep; 89(3): 353–362 Binning over read mapping sample sample sample GC% 1 2 3 scaffold1 27 7 60 34 70 scaffold2 29 6 61 33 60 scaffold3 5 21 20 51 50 scaffold1 scaffold4 7 20 22 50 40 scaffold2 scaffold3 30 scaffold4 20 10 0 sample1 sample2 sample3 GC Binning contigs oCompletely automated approach o CONCOCT o GroopM o MetaBAT oCompleteness of metagenome assembled genomes (MAGs) o single-copy core genes (tRNA synthetases , ribosomal proteins) Genes annotations oFinds bacterial genes in the contigs/scaffolds ◦ Prodigal ◦ Prokka oAnnotation of the genes ◦ By homology searches (DIAMOND) ◦ Domains finding o Comparisons ◦ Gene family ◦ Distribution among the samples (CD-HIT) Functional potential - The annotations suggest the functional potential of the community No sure about the biology activity (may not be transcribed an translates) Common functional databases oNCBI oCOG o Well known but original classification (not updated since 2003) o PFAM o Focused more on protein domains based on hidden Markov models oKEGG o Very popular, each entry is well annotated, and often linked into “Modules” or “Pathways” o Full access now requires a license fee