Bioinformatics Approaches for Metagenomics Data
BIOINFORMATICS APPROACHES FOR METAGENOMICS DATA ANALYSIS A D I D O R O N - FAIGENBOIM PLANT SCIENCES, VEGETABLE AND FIELD CROPS ARO, T H E VOLCANI CENTER , I S R A E L RISHON LEZION 7528809 Metagenomics o“Metagenomics is the study of the collective genomes of all microorganisms from an environmental sample” o Community o Environmental o Ecological DNA sequencing & microbial profiling Traditional microbiology relies on isolation and culture of bacteria o Cumbersome and labour intensive process o Fails to account for the diversity of microbial life o Great plate-count anomaly
Staley, J. T., and A. Konopka. 1985. Measurements of in situ activities of nonphotosynthetic microorganisms in aquatic and terrestrial habitats. Annu. Rev. Microbiol. 39:321-346 Why environmental sequencing? Estimated 1000 trillion tons of bacterial/archeal life on Earth o Only a small proportion of organisms have been grown in culture o Species do not live in isolation o Clonal cultures fail to represent the natural environment of a given organism o Many proteins and protein functions remain undiscovered Why environmental sequencing?
Rhizobiome Pollutant Non-human microbiomes Human microbiome sites The revolution in sequencing technologies High throughput technologies promote the accumulation of enormous volumes of genomic and metagenomics data.
HiSeq MiSeq
Next-Generation Sequencing: A Review of Technologies and Tools for Wound Microbiome Research Brendan P. Hodkinson and Elizabeth A. Grice*. Adv Wound Care (New Rochelle). 2015 Experimental Approaches Community composition ◦ Microbiome (16S rRNA gene, 18S, ITS, etc.) Community composition and functional potential ◦ Metagenomics Functional genetic response ◦ Metatranscriptomics 16s Vs. Shotgun Metagenomic o16s – targeted sequencing of a single gene ◦ Marker for identification ◦ Well established ◦ Cheap ◦ Amplified what you want oShotgun sequencing – sequence all the DNA ◦ No primer bias ◦ Can identify all microbes ◦ Function information 16S rRNA sequencing
• 16S rRNA forms part of bacterial ribosomes.
• Contains regions of highly conserved and highly variable sequence.
• Variable sequence can be thought of as a molecular “fingerprint” can be used to identify bacterial genera and species.
• Large public databases available for comparison.– Ribosomal Database Project (RDP) currently contains >1.5 million rRNA sequences.
• Conserved regions can be targeted to amplify broad range of bacteria from environmental samples.
• Not quantitative due to copy number variation
Erlandsen S L et al. J Histochem Cytochem 2005;53:917-927 16S rRNA gene sequencing o Pros ◦ Well established ◦ Sequencing costs are relatively cheap (~50,000 reads/sample) ◦ Only amplifies what you want (no host contamination) oCons ◦ Primer choice can bias results towards certain organisms ◦ Usually not enough resolution to identify to the strain level ◦ Need different primers usually for archaea & eukaryotes (18S) ◦ Cannot identify viruses ◦ No direct functional profiling Binning sequences to UTS oOperational Taxonomic Unit (OTU) An arbitrary definition of a taxonomic unit based on sequence divergence oComposition-based binning − GC content − Di/Tri/Tetra/... nucleotide composition (kmer-based frequency comparison) − Codon usage statistics oSimilarity-based binning − Direct comparison of OTU sequence to a reference database − Identity cut-off varies depending on resolution required Genus - 90% , Family - 80% , Species - 97% Sample 1 Sample 2
OTU present 50:50 in both samples
MEGAN Blast against NCBI database
Clustering of OTUs based on sequence similarity Software for binning o Composition-based binning o TETRA - Maximal-Order Markov Model o PhyloPythia – Support Vector o Seeded Growing Self-Organising Maps (S-GSOM) o TETRA + Codon based usage o Similarity-based binning o Requires that most sequences in a sample are present in a primary or secondary reference database o QIIME o MEGAN (comparison against Blast NCBI NR) o Mothur (RDP) o CARMA (comparison against PFAM) o ARB (linked with Silva database) Sequences Databases Measuring diversity of OTUs Two primary measures for sequence based studies:
• Alpha diversity −What is there? How much is there? −Diversity within a sample
• Beta diversity −How similar are two samples? −Diversity between samples Alpha diversity – human microbiome
C Huttenhower et al. Nature 486, 207-214 (2012) doi:10.1038/nature11234 Alpha diversity oSpecies count in the sample o what is a species ? o OUTs o missing level of evolutionary diversity oPhylogenetic diversity (PD) o sum of the branch length covered by a sample o missing the distribution of the species Alpha diversity oSimpson’s diversity index (also Shannon, Chao indexes) o gives less weight to rarest species
S is the number of species N is the total number of organisms ni is the number of organisms of species i
Whittaker, R.H. (1972). "Evolution and measurement of species diversity". Taxon (International Association for Plant Taxonomy (IAPT)) 21 (2/3): 213–251 Beta diversity – human microbiome
C Huttenhower et al. Nature 486, 207-214 (2012) doi:10.1038/nature11234 Beta diversity oDiversity between samples oUnifrac distance oPhytogenic-based beta diversity oPercentage observed branch length unique to either sample
Lozupone and Knight, 2005. Unifrac: A new phylogenetic method for comparing microbial communitieis. Appl Environ Microbiol 71:8228 Other useful data representations Simple bar charts - what species are present? Other useful data representations
Rarefaction curves - How much of a community have we sampled? Number of OTUs Number
Number of sequences
Adapted from Wooley et al. A Primer on Metagenomics, PLoS Computational Biology, Feb 2010, Vol 6(2) Shotgun whole metagenome oUnlike 16S, metagenomic sequencing is no targeted to a specific gene, but does an unbiased sample of the entire genomic DNA. oTypically shorter sequence reads are used to obtain >5Gb of data per sample. oHiSeq or NextSeq platform are typically more cost effective for metagenomic sequencing Shotgun metagenomics Pros ◦ No primer bias ◦ Can identify all microbes (e.g. eukaryotes, viruses) ◦ Direct functional profiling • Cons ◦ More expensive (millions of sequences needed) ◦ Host/site contamination can be significant ◦ May not be able to sequence “rare” microbes ◦ Required computational resources can be restrictive ◦ More complex bioinformatic analyses required ◦ Chimera, unknown function Sequence coverage Complexity Diversity & Coverage
Estimating coverage in metagenomic data sets and why it matters. ISME J. 2014 Luis M Rodriguez-R and Konstantinos T Konstantinidis Metagenomics' assembly Metagenomics' assembly
Metagenomic Assembly: Overview, Challenges and Applications. Yale J Biol Med. 2016 Sep; 89(3): 353–362 Metagenomics' assembly o Greedy assembler: o reads with maximum overlaps are iteratively merged into contigs o Overlap-Layout-Consensus : o graph is constructed by finding overlaps between all pairs of reads o Bruijn graph: o reads are chopped into short overlapping segments (k-mers) o K-mers are organized in a de Bruijn graph based on their co-occurrence across reads. o The graph is simplified to remove artifacts due to sequencing errors, o branch-less paths are reported as contigs. de Bruijn graph approach o Low abundance genomes may end up fragmented if overall sequencing depth is insufficient to form connections in the graph o Using a short k-mer size oThe assembler must strike a balance between recovering low abundance genomes and obtaining long, accurate contigs for high abundance genomes oComputational time and memory may be insufficient to complete such assemblies. oMultiple k-mer approach oSpread memory load over cluster of computer Metagenome assembly tools
Comparing and Evaluating Metagenome Assembly Tools from a Microbiologist’s Perspective - Not Only Size Matters! John Vollmers, Sandra Wiegand, Anne-Kristin Kaster What we do with the assembly oCharacterizing the contigs/scaffolds o Mapping statistics o Compositions (%GC, codon usage) o Annotation - taxonomy & function assignments oBinning oComparative genomics oMetabolic pathways Binning over read mapping oPartition the metagenome to species o Read coverage (multiple samples) o compositions
sample sample sample GC% 1 2 3 scaffold1 27 7 60 34 scaffold2 29 6 61 33 scaffold3 5 21 20 51 scaffold4 7 20 22 50
Metagenomic Assembly: Overview, Challenges and Applications. Yale J Biol Med. 2016 Sep; 89(3): 353–362 Binning over read mapping
sample sample sample GC% 1 2 3 scaffold1 27 7 60 34 70 scaffold2 29 6 61 33 60 scaffold3 5 21 20 51 50 scaffold1 scaffold4 7 20 22 50 40 scaffold2 scaffold3 30
scaffold4 20
10
0 sample1 sample2 sample3 GC Binning contigs oCompletely automated approach o CONCOCT o GroopM o MetaBAT oCompleteness of metagenome assembled genomes (MAGs) o single-copy core genes (tRNA synthetases , ribosomal proteins) Genes annotations oFinds bacterial genes in the contigs/scaffolds ◦ Prodigal ◦ Prokka oAnnotation of the genes ◦ By homology searches (DIAMOND) ◦ Domains finding o Comparisons ◦ Gene family ◦ Distribution among the samples (CD-HIT)
Functional potential - The annotations suggest the functional potential of the community No sure about the biology activity (may not be transcribed an translates) Common functional databases oNCBI oCOG o Well known but original classification (not updated since 2003) o PFAM o Focused more on protein domains based on hidden Markov models oKEGG o Very popular, each entry is well annotated, and often linked into “Modules” or “Pathways” o Full access now requires a license fee o MetaCyc o Similar to KEGG, but more microbe focused o UniRef o Has clustering at different levels (e.g. UniRef100, UniRef90, UniRef50) o Most comprehensive and is constantly updated o These gene families are typically less functionally informative Metagenomic annotation system Web-based ◦ EBI ◦ MG-RAST GUI-based ◦ MEGAN Local-based ◦ Kraken ◦ MetAMOS Post-processing analysis oData matrices of samples versus microbial features o species o genes o Pathways oUnsupervised methods o Clustering and correlations o PCA oStatistically different between sample types o taxa or functional genes A Review of Bioinformatics Tools for Bio-Prospecting from Metagenomic Sequence Data Front. Genet., 06 March 2017 Case study: the microbiome of fruit peel
Shiri Freilich
Maria Vetcos Edoardo Piombo Shlomit Medina Samir Droby Michael Wisniewski Case study: the microbiome of fruit peel Sequencing output: files in FASTQ format
Read length: 150 Total of 472 million quality reads Assembly: MEGAHIT Format: FASTQ Total of 472 million quality reads Total of 71 Gbp
Format: FASTA Total number of contigs/contigs > 2k: 4,000,000/200,000 Average contig length: 820/4,600 bp N50: 980/5000 bp Total #bp: 3Gbp/1Gbp %mapping vs. Sample #raw reads #clean reads %clean reads #PE Filtered set
A1 26,692,151 22,638,404 84.81296243 45,276,808 75.59 A2 32,550,741 27,819,952 85.46641688 55,639,904 69.84 A3 24,083,541 20,677,583 85.85773579 41,355,166 82.77 C1W 29,722,008 25,416,861 85.51528887 50,833,722 78.32 C2W 24,125,961 20,451,024 84.76770728 40,902,048 76.01 C3W 24,956,733 21,353,952 85.56389172 42,707,904 87.48 M1 26,211,005 21,974,866 83.83831906 43,949,732 66.52 M2 5,640,819 4,765,939 84.49019548 9,531,878 62.97 M3 6,113,051 5,137,683 84.04449758 10,275,366 57.24 O1S 23,760,866 19,848,045 83.53249835 39,696,090 57.85 O2S 28,317,777 23,141,736 81.72158429 46,283,472 57.22 O3S 28,604,975 22,679,029 79.28351275 45,358,058 64.43 Total 280,779,628 235,905,074 84.02 471,810,148
Full contig set Contig > 2K Total number of 3,762,133 206,575 sequences Total number of 3,085,995,440 945,480,334 bps Average sequence 820.27 4,576.93 length N50 979 4,926 Gene calling: Prodigal
Format: FASTA Total number of contigs > 2k pb: 200,000
Format: FASTA Total number of genes: 1,000,000 From sequence to gene: summary
4 treatments X 3 repeats ~200,000 contigs ~1,000,000 Functional = 12 libraries with N50 of ~5000 bp genes and taxonomic ~45 million reads per library With 60% of reads annotations Total of ~472 million quality mapped reads
Raw Genome/gene Genomic assembly Gene calling Annotations Data (pooled data) JGI annotation platform Annotation in MEGAN based DIAMOND similarity search
1,000,000 genes
DIAMOND Similarity search Detection of homologs Condensation into Ncbi NR for 75 % of genes DAA binary format MEGAN annotation platform Taxonomy Output files TaxonPath Taxon ID etc
Input daa file KEGG Output files KEGGPath KEGGName SEED etc
Output files SEEDPath SEEDName etc Taxonomic annotations Krona chart: dynamic representation
Megan file- Taxonomy ID
assigned_Krona_All.html Annotations of most genes on the same contig are consistent
Functional annotations
SEED
KEGG Annotations statistic
% genes Assigned assigned genes assigned genes Taxa 759,353 570,702 0.75 75 Interpro2go 759,353 367,789 0.48 48 Eggnog 759,353 255,892 0.34 34 KEGG* 759,353 187,842 0.25 25
* from seed 2015 mapping file Count data
The count data are presented as a table which reports, for each sample, the number of sequence fragments that have been assigned to each genes. PCA & correlations
US conventional Israel organic
Israel conventional Name compounds_contig_conventionalConventional compunds_contig_organicOrganic compunds_gene_conventional compunds_gene_organic Cutin, suberine and wax biosynthesis 0 5 0 6 Biosynthesis of alkaloids derived from shikimate pathway 0 5 0 4 Drug metabolism - cytochrome P450 0 10 0 9 Glycerophospholipid metabolism 5 0 5 0 Tyrosine metabolism 2 6 2 6 Bisphenol degradation 0 4 0 4 Penicillin and cephalosporin biosynthesis 2 4 2 4 Chlorocyclohexane and chlorobenzene degradation 0 6 0 5 Steroid hormone biosynthesis 10 1 10 1 Inflammatory mediator regulation of TRP channels 3 1 3 0 Isoquinoline alkaloid biosynthesis 0 6 0 6 Arachidonic acid metabolism 17 0 17 0 Aminobenzoate degradation 0 7 0 7 Retinol metabolism 0 6 0 6 Flavonoid biosynthesis 8 0 8 0 Flavone and flavonol biosynthesis 7 1 6 1 Fluorobenzoate degradation 11 0 11 0 Anthocyanin biosynthesis 12 0 12 0 Betalain biosynthesis 8 0 8 0 Steroid biosynthesis 12 0 12 0 Polycyclic aromatic hydrocarbon degradation 0 21 0 21 Porphyrin and chlorophyll metabolism 14 0 14 0 Amino sugar and nucleotide sugar metabolism 0 9 0 9 Biosynthesis of plant secondary metabolites 4 2 4 1 Biosynthesis of type II polyketide products 5 0 5 0 Ubiquinone and other terpenoid-quinone biosynthesis 1 10 1 10 Linoleic acid metabolism 5 0 5 0 Biosynthesis of 12-, 14- and 16-membered macrolides 21 4 21 4 Glycine, serine and threonine metabolism 4 1 4 1
Differential abundance of enzymes in the KEGG metabolic pathway Thank you