Validation of the Asaim Framework and Its Workflows on HMP Mock Community Samples

Validation of the ASaiM framework and its workflows on HMP mock community samples The ASaiM framework and its workflows have been tested and validated on two mock metagenomic data of an artificial community (with 22 known microbial strains). The datasets are available on EBI metagenomics database (project accession number: SRP004311). First we checked that the targeted abundances (based on number of PCR product) from both mock datasets were similar to the effective abundance (by mapping reads on reference genomes). Second, taxonomic and functional results produced by the ASaiM framework have been extensively analyzed and compared to expectations and to results obtained with the EBI metagenomics pipeline (S. Hunter et al. 2014). For these datasets, the ASaiM framework produces accurate and precise taxonomic assignations, different functional results (gene families, pathways, GO slim terms) and results combining taxonomic and functional information. Despite almost 1.4 million of raw metagenomic sequences, these analyses were executed in less than 6h on a commodity computer. Hence, the ASaiM framework and its workflows are proven to be relevant for the analysis of microbiota datasets. 1Data On EBI metagenomics database, two mock community samples for Human Microbiome Project (HMP) are available. Both samples contain a genomic mixture of 22 known microbial strains. Relative abundance of each strain has been targeted using the number of PCR product of their respective 16S sequences (Table 1). In first sample (SRR072232), the targeted 16S copies of the strains vary by up to four orders of magnitude between the strains (Table 1), whereas in second sample (SRR072233) the same 16S copy number is targeted for each strain (Table 1). After pooling, the DNA of the strains of both samples were sequenced using 454 GS FLX Titanium. 1,225,169 and 1,386,198 raw metagenomic sequences are then respectively obtained for the first dataset (SRR072232) and the second dataset (SRR072233). 2 Methods Both datasets have been analyzed using the ASaiM framework. The results are extensively analyzed and compared to expected results from reference genome information and to EBI metagenomics results. Details about these analyses (workflows, scripts) are available on a dedicated GitHub repository (https://github.com/ASaiM/hmp_mock_tests). 1 2.1 Abundance computation using mapping on reference genomes In both datasets, abundance of each strain is targeted based on 16S quantity to build the genomic mixture before sequencing. But these targeted abundances may be not reflect the final abundances (e.g. 16S copy number variation, sequencing bias). We have a known composition of the datasets with an expected abundances before sequencing but no information about the real abundance of the strains after sequencing. Before any analysis of EBI metagenomics and ASaiM results, we need more insights in the real abundance of the strains after sequencing. We mapped raw reads on reference genomes of expected strains using BWA 0.7.12 (Li and Durbin 2009; Li and Durbin 2010) (using default parameters). We then extract the “exact” abundances of expected strains in the metagenomic datasets, after DNA pooling and sequencing (i.e. not based on targeted rRNA operon counts in PCR). 2 Taxonomy Targeted abundances (%) Domain Kingdom Phylum Class Order Family Genus Species Strains SRR072232 SRR072233 Methanobrevibacter Archaea Archaea Euryarchaeota Methanobacteria Methanobacteriales Methanobacteriaceae Methanobrevibacter ATCC 35061 1.797 101 4.545 smithii · Actinomyces 2 Bacteria Bacteria Actinobacteria Actinobacteria Actinomycetales Actinomycetaceae Actinomyces ATCC 17982 1.797 10≠ 4.545 odontolyticus · Propionibacterium 1 Propionibacteriaceae Propionibacterium DSM 16379 1.797 10≠ 4.545 acnes · Bacteroides vul- 2 Bacteroidetes Bacteroidia Bacteroidales Bacteroidaceae Bacteroides ATCC 8482 1.797 10≠ 4.545 gatus · Deinococcus- Deinococcus ra- 2 Deinococci Deinococcales Deinococcaceae Deinococcus DSM 20539 1.797 10≠ 4.545 Thermus diodurans · Bacillus cereus Firmicutes Bacilli Bacillales Bacillaceae Bacillus ATCC 10987 1.797 4.545 thuringiensis Listeria monocy- 1 Listeriaceae Listeria ATCC BAA-679 1.797 10≠ 4.545 togenes · Staphylococcus ATCC BAA- Staphylococcaceae Staphylococcus 1.797 4.545 aureus 1718 Staphylococcus ATCC 12228 1.797 101 4.545 epidermidis · Enterococcus 2 Lactobacillales Enterococcaceae Enterococcus ATCC 47077 1.797 10≠ 4.545 faecalis · Lactobacillus 2 Lactobacillaceae Lactobacillus DSM 20243 1.797 10≠ 4.545 gasseri · Streptococcus Streptococcaceae Streptococcus ATCC BAA-611 1.797 4.545 agalactiae Streptococcus ATCC 700610 1.797 101 4.545 mutans · Streptococcus 2 mitis oralis ATCC BAA-334 1.797 10≠ 4.545 · pneumoniae Clostridium bei- Clostridia Clostridiales Clostridiaceae Clostridium ATCC 51743 1.797 4.545 jerinckii Rhodobacter Proteobacteria Alphaproteobacteria Rhodobacterales Rhodobacteraceae Rhodobacter ATCC 17023 1.797 101 4.545 sphaeroides · Neisseria menin- 1 Betaproteobacteria Neisseriales Neisseriaceae Neisseria ATCC BAA-335 1.797 10≠ 4.545 gitidis · Helicobacter py- 1 Epsilonproteobacteria Campylobacterales Helicobacteraceae Helicobacter ATCC 700392 1.797 10≠ 4.545 lori · Acinetobacter 1 Gammaproteobacteria Pseudomonadales Moraxellaceae Acinetobacter ATCC 17978 1.797 10≠ 4.545 baumannii · Pseudomonas Pseudomonadaceae Pseudomonas ATCC 47085 1.797 4.545 aeruginosa Enterobacteriales Enterobacteriaceae Escherichia Escherichia coli ATCC 70096 1.797 101 4.545 · Candida albi- 2 Eukaryotes Fungi Ascomycota Saccharomycetes Saccharomycetales Debaryomycetaceae Candida SC5314 1.797 10≠ 4.545 cans · Table 1: Expected strains, their taxonomy and their targeted relative abundance (percentage) based on 16S gene copy counts (abundance, from metadata on EBI metagenomics database) on both samples (SRR072232 and SRR072233) Similar community compositions are observed using mapping-based relative abundances of strains or targeted relative abundance (Figure 1): the Bray-Curtis dissimilarity scores are smaller than 0.5 (0.338 for SRR02232 and 0.479 for SRR072233). However, for SRR072233 (Figure 1), identical targeted abundances are expected for all species, but variations are observed for mapping based abundances. The variation of 16S gene copy number between the species can explain the differences between targeted abundances and mapping-based abundances. Indeed, the targeted abundances are based on 16S copy number targeted in PCR to build the DNA pool. But, the number of 16S gene copies is not identical in the strains (from 1 for Candida albicans to 14 for Clostridium beijerinckii). Hence, even with identical targeted abundances (e.g. for SRR072233), we expect that a species with two 16S gene copies in its genome would be found twice less abundant in mapping-based relative abundance results. The 16S gene copy number variation induces then a difference between the relative abundance based on mapping reads on whole genome and the expected relative abundance based on the targeted 16S gene counts. Abundance from mapping Abundance from mapping Streptococcus pneumoniae RNA operon abundance Streptococcus pneumoniae RNA operon abundance Streptococcus mutans Streptococcus mutans Streptococcus agalactiae Streptococcus agalactiae Staphylococcus epidermidis Staphylococcus epidermidis Staphylococcus aureus Staphylococcus aureus Rhodobacter sphaeroides Rhodobacter sphaeroides Pseudomonas aeruginosa Pseudomonas aeruginosa Propionibacterium acnes Propionibacterium acnes Neisseria meningitidis Neisseria meningitidis Methanobrevibacter smithii Methanobrevibacter smithii Listeria monocytogenes Listeria monocytogenes Lactobacillus gasseri Lactobacillus gasseri Helicobacter pylori Helicobacter pylori Escherichia coli Escherichia coli Enterococcus faecalis Enterococcus faecalis Deinococcus radiodurans Deinococcus radiodurans Clostridium beijerinckii Clostridium beijerinckii Candida albicans Candida albicans Bacteroides vulgatus Bacteroides vulgatus Bacillus cereus Bacillus cereus Actinomyces odontolyticus Actinomyces odontolyticus Acinetobacter baumannii Acinetobacter baumannii 0.02 0.05 0.10 0.20 0.50 1.00 2.00 5.00 0.05 0.10 0.20 0.50 1.00 2.00 5.00 10.00 20.00 10.00 20.00 Relative abundance Relative abundance Figure 1: Comparison of relative abundances (percentage, in log scale) between expectation given the ribosomal RNA operon counts (green, Table 1) and mapping against reference genomes for both samples (SRR072232 on left, SRR072233 on right) Taxonomic analyses in EBI metagenomics and ASaiM workflows are executed on metagenomic sequences, i.e. on data after DNA pooling and sequencing. Mapping-based relatives abundances computed on raw metagenomic sequences are then more appropriate expected abundance information than the relative abundances based on 16S counts. We will then use this information in the next sections. 4 2.2 Analyses using EBI Metagenomics In EBI metagenomics database, both datasets have been analysed with EBI metagenomics pipeline (Version 1.0) (Figure 2). Figure 2: EBI metagenomics pipeline (version 1.0). The grey boxes correspond to data, the blue boxes to pretreatment steps, the red boxes to functional analysis steps and the green boxes to taxonomic analysis steps. To ease comparison with ASaiM results, EBI metagenomics pipeline results were downloaded from EBI metagenomics database and formatted. First, to compute relative abundances of each clade at all taxonomic levels, OTUs with taxonomic assignation are extracted and aggregated. Second, EBI metagenomics pipeline generates 3 types of functional

Validation of the Asaim Framework and Its Workflows on HMP Mock Community Samples

Assemblage and Functioning of Bacterial Communities in Soil and Rhizosphere Issue Date: 2016-06-08

Mobile Genetic Elements in Streptococci

Appendix III: OTU's Found to Be Differentially Abundant Between CD and Control Patients Via Metagenomeseq Analysis

Inoculation with Mycorrhizal Fungi and Irrigation Management Shape the Bacterial and Fungal Communities and Networks in Vineyard Soils

Successful Drug Discovery Informed by Actinobacterial Systematics

Streptococcaceae

Development of Bacterial Communities in Biological Soil Crusts Along

Supplemental Material S1.Pdf

Molecular Mechanisms of Inhibition of Streptococcus Species by Phytochemicals

Beta-Haemolytic Streptococci (BHS)

Table S5. the Information of the Bacteria Annotated in the Soil Community at Species Level

Microbial Community Composition During Degradation of Organic Matter