Quick viewing(Text Mode)

Bioinformatics Approaches for Metagenomics Data

Bioinformatics Approaches for Metagenomics Data

BIOINFORMATICS APPROACHES FOR METAGENOMICS DATA ANALYSIS A D I D O R O N - FAIGENBOIM PLANT SCIENCES, VEGETABLE AND FIELD CROPS ARO, T H E VOLCANI CENTER , I S R A E L RISHON LEZION 7528809 Metagenomics o“Metagenomics is the study of the collective of all microorganisms from an environmental sample” o Community o Environmental o Ecological DNA & microbial profiling Traditional relies on and culture of o Cumbersome and labour intensive process o Fails to account for the diversity of microbial life o Great plate-count anomaly

Staley, J. T., and A. Konopka. 1985. Measurements of activities of nonphotosynthetic microorganisms in aquatic and terrestrial habitats. Annu. Rev. Microbiol. 39:321-346 Why environmental sequencing? Estimated 1000 trillion tons of bacterial/archeal life on Earth o Only a small proportion of organisms have been grown in culture o do not live in isolation o Clonal cultures fail to represent the of a given organism o Many proteins and protein functions remain undiscovered Why environmental sequencing?

Rhizobiome Pollutant Non-human Human sites The revolution in sequencing technologies High throughput technologies promote the accumulation of enormous volumes of genomic and metagenomics data.

HiSeq MiSeq

Next-Generation Sequencing: A Review of Technologies and Tools for Wound Microbiome Research Brendan P. Hodkinson and Elizabeth A. Grice*. Adv Wound Care (New Rochelle). 2015 Experimental Approaches Community composition ◦ Microbiome (16S rRNA gene, 18S, ITS, etc.) Community composition and functional potential ◦ Metagenomics Functional genetic response ◦ 16s Vs. Shotgun Metagenomic o16s – targeted sequencing of a single gene ◦ Marker for identification ◦ Well established ◦ Cheap ◦ Amplified what you want oShotgun sequencing – sequence all the DNA ◦ No primer bias ◦ Can identify all microbes ◦ Function information 16S rRNA sequencing

• 16S rRNA forms part of bacterial .

• Contains regions of highly conserved and highly variable sequence.

• Variable sequence can be thought of as a molecular “fingerprint” can be used to identify bacterial genera and species.

• Large public databases available for comparison.– Ribosomal Database Project (RDP) currently contains >1.5 million rRNA sequences.

• Conserved regions can be targeted to amplify broad range of bacteria from environmental samples.

• Not quantitative due to copy number variation

Erlandsen S L et al. J Histochem Cytochem 2005;53:917-927 16S rRNA gene sequencing o Pros ◦ Well established ◦ Sequencing costs are relatively cheap (~50,000 reads/sample) ◦ Only amplifies what you want (no host contamination) oCons ◦ Primer choice can bias results towards certain organisms ◦ Usually not enough resolution to identify to the strain level ◦ Need different primers usually for & (18S) ◦ Cannot identify viruses ◦ No direct functional profiling sequences to UTS oOperational Taxonomic Unit (OTU) An arbitrary definition of a taxonomic unit based on sequence divergence oComposition-based binning − GC content − Di/Tri/Tetra/... nucleotide composition (kmer-based frequency comparison) − Codon usage statistics oSimilarity-based binning − Direct comparison of OTU sequence to a reference database − Identity cut-off varies depending on resolution required Genus - 90% , Family - 80% , Species - 97% Sample 1 Sample 2

OTU present 50:50 in both samples

MEGAN Blast against NCBI database

Clustering of OTUs based on sequence similarity Software for binning o Composition-based binning o TETRA - Maximal-Order Markov Model o PhyloPythia – Support Vector o Seeded Growing Self-Organising Maps (S-GSOM) o TETRA + Codon based usage o Similarity-based binning o Requires that most sequences in a sample are present in a primary or secondary reference database o QIIME o MEGAN (comparison against Blast NCBI NR) o Mothur (RDP) o CARMA (comparison against PFAM) o ARB (linked with Silva database) Sequences Databases Measuring diversity of OTUs Two primary measures for sequence based studies:

• Alpha diversity −What is there? How much is there? −Diversity within a sample

• Beta diversity −How similar are two samples? −Diversity between samples Alpha diversity –

C Huttenhower et al. Nature 486, 207-214 (2012) doi:10.1038/nature11234 Alpha diversity oSpecies count in the sample o what is a species ? o OUTs o missing level of evolutionary diversity oPhylogenetic diversity (PD) o sum of the branch length covered by a sample o missing the distribution of the species Alpha diversity oSimpson’s diversity index (also Shannon, Chao indexes) o gives less weight to rarest species

S is the number of species N is the total number of organisms ni is the number of organisms of species i

Whittaker, R.H. (1972). "Evolution and measurement of ". Taxon (International Association for Plant Taxonomy (IAPT)) 21 (2/3): 213–251 Beta diversity – human microbiome

C Huttenhower et al. Nature 486, 207-214 (2012) doi:10.1038/nature11234 Beta diversity oDiversity between samples oUnifrac distance oPhytogenic-based beta diversity oPercentage observed branch length unique to either sample

Lozupone and Knight, 2005. Unifrac: A new phylogenetic method for comparing microbial communitieis. Appl Environ Microbiol 71:8228 Other useful data representations Simple bar charts - what species are present? Other useful data representations

Rarefaction curves - How much of a community have we sampled? Number of OTUs Number

Number of sequences

Adapted from Wooley et al. A Primer on Metagenomics, PLoS Computational Biology, Feb 2010, Vol 6(2) Shotgun whole metagenome oUnlike 16S, metagenomic sequencing is no targeted to a specific gene, but does an unbiased sample of the entire genomic DNA. oTypically shorter sequence reads are used to obtain >5Gb of data per sample. oHiSeq or NextSeq platform are typically more cost effective for metagenomic sequencing Shotgun metagenomics Pros ◦ No primer bias ◦ Can identify all microbes (e.g. eukaryotes, viruses) ◦ Direct functional profiling • Cons ◦ More expensive (millions of sequences needed) ◦ Host/site contamination can be significant ◦ May not be able to sequence “rare” microbes ◦ Required computational resources can be restrictive ◦ More complex bioinformatic analyses required ◦ , unknown function Sequence coverage Complexity Diversity & Coverage

Estimating coverage in metagenomic data sets and why it matters. ISME J. 2014 Luis M Rodriguez-R and Konstantinos T Konstantinidis Metagenomics' assembly Metagenomics' assembly

Metagenomic Assembly: Overview, Challenges and Applications. Yale J Biol Med. 2016 Sep; 89(3): 353–362 Metagenomics' assembly o Greedy assembler: o reads with maximum overlaps are iteratively merged into contigs o Overlap-Layout-Consensus : o graph is constructed by finding overlaps between all pairs of reads o Bruijn graph: o reads are chopped into short overlapping segments (k-mers) o K-mers are organized in a de Bruijn graph based on their co-occurrence across reads. o The graph is simplified to remove artifacts due to sequencing errors, o branch-less paths are reported as contigs. de Bruijn graph approach o Low abundance genomes may end up fragmented if overall sequencing depth is insufficient to form connections in the graph o Using a short k-mer size oThe assembler must strike a balance between recovering low abundance genomes and obtaining long, accurate contigs for high abundance genomes oComputational time and memory may be insufficient to complete such assemblies. oMultiple k-mer approach oSpread memory load over cluster of computer Metagenome assembly tools

Comparing and Evaluating Metagenome Assembly Tools from a Microbiologist’s Perspective - Not Only Size Matters! John Vollmers, Sandra Wiegand, Anne-Kristin Kaster What we do with the assembly oCharacterizing the contigs/scaffolds o Mapping statistics o Compositions (%GC, codon usage) o Annotation - taxonomy & function assignments oBinning oComparative oMetabolic pathways Binning over read mapping oPartition the metagenome to species o Read coverage (multiple samples) o compositions

sample sample sample GC% 1 2 3 scaffold1 27 7 60 34 scaffold2 29 6 61 33 scaffold3 5 21 20 51 scaffold4 7 20 22 50

Metagenomic Assembly: Overview, Challenges and Applications. Yale J Biol Med. 2016 Sep; 89(3): 353–362 Binning over read mapping

sample sample sample GC% 1 2 3 scaffold1 27 7 60 34 70 scaffold2 29 6 61 33 60 scaffold3 5 21 20 51 50 scaffold1 scaffold4 7 20 22 50 40 scaffold2 scaffold3 30

scaffold4 20

10

0 sample1 sample2 sample3 GC Binning contigs oCompletely automated approach o CONCOCT o GroopM o MetaBAT oCompleteness of metagenome assembled genomes (MAGs) o single-copy core genes (tRNA synthetases , ribosomal proteins) Genes annotations oFinds bacterial genes in the contigs/scaffolds ◦ Prodigal ◦ Prokka oAnnotation of the genes ◦ By searches (DIAMOND) ◦ Domains finding o Comparisons ◦ Gene family ◦ Distribution among the samples (CD-HIT)

Functional potential - The annotations suggest the functional potential of the community No sure about the biology activity (may not be transcribed an translates) Common functional databases oNCBI oCOG o Well known but original classification (not updated since 2003) o PFAM o Focused more on protein domains based on hidden Markov models oKEGG o Very popular, each entry is well annotated, and often linked into “Modules” or “Pathways” o Full access now requires a license fee o MetaCyc o Similar to KEGG, but more microbe focused o UniRef o Has clustering at different levels (e.g. UniRef100, UniRef90, UniRef50) o Most comprehensive and is constantly updated o These gene families are typically less functionally informative Metagenomic annotation system Web-based ◦ EBI ◦ MG-RAST GUI-based ◦ MEGAN Local-based ◦ Kraken ◦ MetAMOS Post-processing analysis oData matrices of samples versus microbial features o species o genes o Pathways oUnsupervised methods o Clustering and correlations o PCA oStatistically different between sample types o taxa or functional genes A Review of Tools for Bio-Prospecting from Metagenomic Sequence Data Front. Genet., 06 March 2017 Case study: the microbiome of fruit peel

Shiri Freilich

Maria Vetcos Edoardo Piombo Shlomit Medina Samir Droby Michael Wisniewski Case study: the microbiome of fruit peel Sequencing output: files in FASTQ format

Read length: 150 Total of 472 million quality reads Assembly: MEGAHIT Format: FASTQ Total of 472 million quality reads Total of 71 Gbp

Format: FASTA Total number of contigs/contigs > 2k: 4,000,000/200,000 Average contig length: 820/4,600 bp N50: 980/5000 bp Total #bp: 3Gbp/1Gbp %mapping vs. Sample #raw reads #clean reads %clean reads #PE Filtered set

A1 26,692,151 22,638,404 84.81296243 45,276,808 75.59 A2 32,550,741 27,819,952 85.46641688 55,639,904 69.84 A3 24,083,541 20,677,583 85.85773579 41,355,166 82.77 C1W 29,722,008 25,416,861 85.51528887 50,833,722 78.32 C2W 24,125,961 20,451,024 84.76770728 40,902,048 76.01 C3W 24,956,733 21,353,952 85.56389172 42,707,904 87.48 M1 26,211,005 21,974,866 83.83831906 43,949,732 66.52 M2 5,640,819 4,765,939 84.49019548 9,531,878 62.97 M3 6,113,051 5,137,683 84.04449758 10,275,366 57.24 O1S 23,760,866 19,848,045 83.53249835 39,696,090 57.85 O2S 28,317,777 23,141,736 81.72158429 46,283,472 57.22 O3S 28,604,975 22,679,029 79.28351275 45,358,058 64.43 Total 280,779,628 235,905,074 84.02 471,810,148

Full contig set Contig > 2K Total number of 3,762,133 206,575 sequences Total number of 3,085,995,440 945,480,334 bps Average sequence 820.27 4,576.93 length N50 979 4,926 Gene calling: Prodigal

Format: FASTA Total number of contigs > 2k pb: 200,000

Format: FASTA Total number of genes: 1,000,000 From sequence to gene: summary

4 treatments X 3 repeats ~200,000 contigs ~1,000,000 Functional = 12 libraries with N50 of ~5000 bp genes and taxonomic ~45 million reads per With 60% of reads annotations Total of ~472 million quality mapped reads

Raw /gene Genomic assembly Gene calling Annotations Data (pooled data) JGI annotation platform Annotation in MEGAN based DIAMOND similarity search

1,000,000 genes

DIAMOND Similarity search Detection of homologs Condensation into Ncbi NR for 75 % of genes DAA binary format MEGAN annotation platform Taxonomy Output files TaxonPath Taxon ID etc

Input daa file KEGG Output files KEGGPath KEGGName SEED etc

Output files SEEDPath SEEDName etc Taxonomic annotations Krona chart: dynamic representation

Megan file- Taxonomy ID

assigned_Krona_All.html Annotations of most genes on the same contig are consistent

Functional annotations

SEED

KEGG Annotations statistic

% genes Assigned assigned genes assigned genes Taxa 759,353 570,702 0.75 75 Interpro2go 759,353 367,789 0.48 48 Eggnog 759,353 255,892 0.34 34 KEGG* 759,353 187,842 0.25 25

* from seed 2015 mapping file Count data

The count data are presented as a table which reports, for each sample, the number of sequence fragments that have been assigned to each genes. PCA & correlations

US conventional Israel organic

Israel conventional Name compounds_contig_conventionalConventional compunds_contig_organicOrganic compunds_gene_conventional compunds_gene_organic Cutin, suberine and wax biosynthesis 0 5 0 6 Biosynthesis of alkaloids derived from shikimate pathway 0 5 0 4 Drug metabolism - cytochrome P450 0 10 0 9 Glycerophospholipid metabolism 5 0 5 0 Tyrosine metabolism 2 6 2 6 Bisphenol degradation 0 4 0 4 Penicillin and cephalosporin biosynthesis 2 4 2 4 Chlorocyclohexane and chlorobenzene degradation 0 6 0 5 Steroid hormone biosynthesis 10 1 10 1 Inflammatory mediator regulation of TRP channels 3 1 3 0 Isoquinoline alkaloid biosynthesis 0 6 0 6 Arachidonic acid metabolism 17 0 17 0 Aminobenzoate degradation 0 7 0 7 Retinol metabolism 0 6 0 6 Flavonoid biosynthesis 8 0 8 0 Flavone and flavonol biosynthesis 7 1 6 1 Fluorobenzoate degradation 11 0 11 0 Anthocyanin biosynthesis 12 0 12 0 Betalain biosynthesis 8 0 8 0 Steroid biosynthesis 12 0 12 0 Polycyclic aromatic hydrocarbon degradation 0 21 0 21 Porphyrin and chlorophyll metabolism 14 0 14 0 Amino and nucleotide sugar metabolism 0 9 0 9 Biosynthesis of plant secondary metabolites 4 2 4 1 Biosynthesis of type II polyketide products 5 0 5 0 Ubiquinone and other terpenoid-quinone biosynthesis 1 10 1 10 Linoleic acid metabolism 5 0 5 0 Biosynthesis of 12-, 14- and 16-membered macrolides 21 4 21 4 Glycine, serine and threonine metabolism 4 1 4 1

Differential abundance of in the KEGG metabolic pathway Thank you