Contact: [email protected]

Comparative evaluation of Lazypipe

Benchmarking with simulated data

To evaluate the quality of Lazypipe predictions we used artificial metagenome dataset from MetaShot evaluation [1]. This set contains the reads produced by simulation of 2x150 Illumina paired end sequencing using ART [2]. We mapped reads using accession numbers in read id-fields to 107 viral and 99 prokaryote NCBI taxonomic ids and the human genome. (94.5% of all reads). Strain taxonomic ids were further mapped using NCBI taxonomy to species, genus, family, order and superkingdom taxonomic ids resulting in 84 species and 46 genera of , and 71 species and 42 genera of bacteria. Based on this mapping we constructed the golden-standard taxonomic profile for the MetaShot benchmark.

We compared the performance of Lazypipe against Kraken2 [3], MetaPhlan2 [4], and Centrifuge [5] software packages. Lazypipe was run with the following two options: querying translated de novo genes against UniProt, labelled Lazypipe, and with direct querying of nucleotide contigs against the NCBI nucleotide non-redundant database, labelled Lazypipe-nt. Kraken2, MetaPhlan2, and Centrifuge were run with default settings. For Centrifuge we used the NCBI nucleotide non-redundant sequences database; alignments with <60 nt match (40% of read length) were removed to improve precision. Classification results were converted to CAMI taxonomic profiles and evaluated against the golden standard using OPAL: CAMI [6] spinoff project implementing CAMI metrics for metagenomic profilers [7]. For the simulated metagenome, we separately evaluated the entire taxonomic profile output by each of the pipelines and sub profiles limited to predicted viral taxa. Results are displayed in Tables 1 and 2.

Both Lazypipe variants showed remarkably high accuracy for viral taxa (see Table 2). Accuracy for all taxa (bacterial and viral) was also good up to the genus level (see Table 1). Contact: [email protected] tool rank TP FP FN Pr Rc F1 score Lazypipe 84 17 5 0.8317 0.944 0.884 MetaPhlan2 71 6 18 0.9221 0.798 0.855 Lazypipe-ntgenus 68 19 21 0.7816 0.764 0.773 Kraken2 50 3 39 0.9434 0.562 0.704 Centrifuge 83 161 6 0.3402 0.933 0.498 MetaPhlan2 105 10 51 0.913 0.673 0.775 Lazypipe 143 111 13 0.563 0.917 0.698 Lazypipe-ntspecies 107 64 49 0.6257 0.686 0.654 Kraken2 53 21 103 0.7162 0.340 0.461 Centrifuge 131 466 25 0.2194 0.840 0.348

Table 1. Benchmarking pipelines on simulated metagenome [1]: evaluating all predicted taxa. Compared tools are ordered by descending F1-score. Abbreviations: TP, true positives, FP, false positives, FN, false negatives, Pr, precision, Rc, recall, F1-score, harmonic mean of precision and recall.

tool rank TP FP FN Pr Rc F1 score Lazypipe-nt 43 1 3 0.977 0.935 0.956 Lazypipe 41 1 5 0.976 0.891 0.932 Centrifugegenus 46 7 0 0.868 1.000 0.929 MetaPhlan2 33 3 13 0.917 0.717 0.805 Kraken2 21 1 25 0.955 0.457 0.618 Lazypipe-nt 73 1 11 0.986 0.869 0.924 Lazypipe 71 8 13 0.899 0.845 0.871 Centrifugespecies 84 43 0 0.661 1.000 0.796 MetaPhlan2 37 8 47 0.822 0.440 0.574 Kraken2 16 1 68 0.941 0.190 0.317

Table 2. Benchmarking pipelines on simulated metagenome [1]: evaluating predicted viral taxa. Compared tools are ordered by descending F1-score. For abbreviations see Table 1.

Benchmarking with real data

To evaluate Lazypipe performance with real data, we ran the Lazypipe analysis with default settings on the mock-community dataset [8]. This dataset was composed from 9 cultures (Porcine 2, Feline panleukopenia virus, BK virus, Pepino Mosaic virus, A, Feline infectious peritonitis virus, Bovine herpesvirus 1, Dickeya solani LIMEstone bacteriophage and Acanthamoeba polyphaga ) and 4 bacterial cultures [8]. Taxonomic profiling of viral taxa by Lazypipe is Contact: [email protected] presented in Table 3. Lazypipe recovered all 7 eukaryotic viruses included in the mock-virome. Moreover, the correct eukaryotic viruses were the only eukaryotic viruses predicted for this data with acceptable confidence scores (scores 1 and 2; score 3 has a high risk of being false positive and was therefore excluded). Thus, we had 100% sensitivity and 100% precision for the Eukaryotic viruses at the species level. Lazypipe also reported the Dickeya LIMEstone virus, but did not report the Acanthamoeba polyphaga mimivirus. readn readn_pccsum csumq contign species species_id genus family 1906884 16.9% 51.7% 1 19 Pepino mosaic virus 112229 1773244 15.7% 67.4% 1 18 Rotavirus A 28875 Rotavirus 570391 5.1% 72.5% 1 7 Carnivore 1 1511906 Protoparvovirus 74102 0.7% 95.8% 2 39 Bovine alphaherpesvirus 1 10320 57966 0.5% 96.9% 2 3 Porcine circovirus 2 85708 Circovirus 57919 0.5% 97.4% 2 5 Alphacoronavirus 1 693997 Alphacoronavirus 31185 0.3% 97.7% 2 1 Human polyomavirus 1 1891762 Betapolyomavirus 3448 0.0% 99.7% 3 3 Porcine type-C oncovirus 369960 Gammaretrovirus Retroviridae 500 0.0% 99.9% 3 3 Bovine alphaherpesvirus 5 35244 Varicellovirus Herpesviridae 248 0.0% 100.0% 3 1 Rotavirus C 36427 Rotavirus Reoviridae readn readn_pccsum csumq contign species species_id genus family 512699 4.5% 77.1% 1 5 Dickeya virus Limestone 1091052 Limestonevirus Ackermannviridae 509088 4.5% 81.6% 1 1 Dickeya phage Coodle 2320188 Limestonevirus Ackermannviridae 509088 4.5% 86.1% 1 1 Dickeya phage Kamild 2320190 Limestonevirus Ackermannviridae 509088 4.5% 90.6% 1 1 Dickeya phage phiDP10.3 1542132 Limestonevirus Ackermannviridae 509088 4.5% 95.1% 2 1 Salmonella virus SKML39 2169690 Agtrevirus Ackermannviridae 1056 0.0% 99.9% 3 1 Dickeya phage phiDP23.1 1542133 Limestonevirus Ackermannviridae 85 0.0% 100.0% 3 2 unidentified phage 38018 none none 23 0.0% 100.0% 3 1 Escherichia virus HK629 2169968 Lambdavirus Table 3. Lazypipe summary1 table for the mock community dataset. The upper part of the table lists predicted abundancies for non-bacteriophage viruses and the lower part predicted abundancies for bacteriophages. All 7 eukaryotic viruses included in the mock dataset were correctly recovered with acceptable confidence scores (i.e. csumq scores above 3). Contact: [email protected]

References

[1] Fosso B, Santamaria M, D’Antonio M, et al. MetaShot: an accurate workflow for taxon classification of host-associated microbiome from shotgun metagenomic data. Bioinformatics. 2017;33:1730–1732.

[2] Huang W, Li L, Myers JR, et al. ART: a next-generation sequencing read simulator. Bioinformatics. 2011;28:593–594.

[3] Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15:R46.

[4] Truong DT, Franzosa EA, Tickle TL, et al. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat. Methods. 2015;12:902.

[5] Kim D, Song L, Breitwieser FP, et al. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. [Internet]. 2016 [cited 2019 Oct 22]; Available from: http://genome.cshlp.org/content/early/2016/11/16/gr.210641.116.

[6] Sczyrba A, Hofmann P, Belmann P, et al. Critical assessment of metagenome interpretation—a benchmark of metagenomics software. Nat. Methods. 2017;14:1063.

[7] Meyer F, Bremges A, Belmann P, et al. Assessing taxonomic metagenome profilers with OPAL. Genome Biol. 2019;20:51.

[8] Conceição-Neto N, Zeller M, Lefrère H, et al. Modular approach to customise sample preparation procedures for viral metagenomics: a reproducible protocol for virome analysis. Sci. Rep. 2015;5:16532.