bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
1 Bactopia: a flexible pipeline for complete
2 analysis of bacterial genomes
3
4
5 Robert A. Petit III1 and Timothy D. Read1
6 1Division of Infectious Diseases, Department of Medicine, Emory University School of
7 Medicine.
8
9 Corresponding Author:
10 Timothy Read1
11
12 Email address: [email protected]
13
14 Abstract
15 Sequencing of bacterial genomes using Illumina technology has become such a standard
16 procedure that often data are generated faster than can be conveniently analyzed. We created
17 a new series of pipelines called Bactopia, built using Nextflow workflow software, to provide
18 efficient comparative genomic analyses for bacterial species or genera. Bactopia consists of a
19 dataset setup step (Bactopia Datasets; BaDs) where a series of customizable datasets are
20 created for the species of interest; the Bactopia Analysis Pipeline (BaAP), which performs
21 quality control, genome assembly and several other functions based on the available datasets bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
22 and outputs the processed data to a structured directory format; and a series of Bactopia Tools
23 (BaTs) that perform specific post-processing on some or all of the processed data. BaTs include
24 pan-genome analysis, computing average nucleotide identity between samples, extracting and
25 profiling the 16S genes and taxonomic classification using highly conserved genes. It is
26 expected that the number of BaTs will increase to fill specific applications in the future. As a
27 demonstration, we performed an analysis of 1,664 public Lactobacillus genomes, focusing on L.
28 crispatus, a species that is a common part of the human vaginal microbiome. Bactopia is an
29 open source system that can scale from projects as small as one bacterial genome to
30 thousands that allows for great flexibility in choosing comparison datasets and options for
31 downstream analysis. Bactopia code can be accessed at
32 https://www.github.com/bactopia/bactopia.
33
34 Introduction
35 Sequencing a bacterial genome, an activity that once required the infrastructure of a dedicated
36 genome center, is now a routine task that even a small laboratory can undertake. A large
37 number of open source software tools have been created to handle various parts of the process
38 of using raw read data for functions such as SNP calling and de novo assembly. As a result of
39 dedicated community efforts, it has recently become much easier to locally install these
40 bioinformatic tools through package managers (Bioconda (1), Brew) or through the use of
41 software containers (Docker, Singularity). Despite these advances, producers of bacterial
42 sequence data face a bewildering array of choices when considering how to perform analysis,
43 particularly when large numbers of genomes are involved and processing efficiency and
44 scalability become major factors.
45 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
46 Efficient bacterial multi-genome analysis has been hampered by three missing functionalities.
47 First, is the need to have “workflows of workflows'' that can integrate analyses and provide a
48 simplified way to start with a collection of raw genome data, remove low quality sequences and
49 perform the basic analytic steps of de novo assembly, mapping to reference sequence and
50 taxonomic assignment. Second, is the desire to incorporate user-specific knowledge of the
51 species into the input of the main genome analysis pipeline. While many microbiologists are not
52 expert bioinformaticians, they are experts in the organisms they study. Third, is the need to
53 create an output format from the main pipeline that could be used for future customized
54 downstream analysis such as pan-genome analysis and basic visualization of phylogenies.
55
56 Here we introduce Bactopia, an integrated suite of workflows for flexible analysis of Illumina
57 genome sequencing projects of bacteria from the same taxon. Bactopia is based on the
58 Nextflow workflow software (2), and is designed to be scalable, allowing projects as small as a
59 single genome to be run on a local desktop, or many thousands of genomes to be run as a
60 batch on a cloud infrastructure. Running multiple tasks on a single platform standardizes the
61 underlying data quality used for gene and variant calling between projects run in different
62 laboratories. This structure also simplifies the user experience. In Bactopia, complex multi-
63 genome analysis can be run in a small number of commands. However, there are myriad
64 options for fine-tuning datasets used for analysis and the functions of the system. The
65 underlying Nextflow structure ensures reproducibility. To illustrate the functionality of the system
66 we performed a Bactopia analysis of 1,664 public genome projects of the Lactobacillus genus,
67 an important component of the microbiome of humans and animals.
68 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
69 Design and implementation
70 Bactopia links together open source bioinformatics software, available from Bioconda (1), using
71 Nextflow (2). Nextflow was chosen for its flexibility: Bactopia can be run locally, on clusters, or
72 on cloud platforms with simple parameter changes. It also manages the parallel execution of
73 tasks and creates checkpoints allowing users to resume jobs. Nextflow automates installation of
74 the component software of the workflow through integration with Bioconda. For ease of
75 deployment, Bactopia can either be installed through Bioconda, a Docker container, or a
76 Singularity container. All the software programs used by Bactopia (version v1.3.0) described in
77 this manuscript are listed in Table 1 with their individual version numbers.
78
79 There are three main components of Bactopia (Figure 1): 1) Bactopia Datasets (BaDs), a
80 framework for formatting organism-specific datasets to be used by the downstream analysis
81 pipeline; 2) Bactopia Analysis Pipeline (BaAP), a customizable workflow for the analysis of
82 individual bacterial genome projects that is an extension and generalization of the previously
83 published Staphylococcus aureus-specific “Staphopia Analysis Pipeline” (StAP) (3). The inputs
84 to BaAP are FASTQ files from bacterial Illumina sequencing projects, either imported from the
85 National Centers for Biotechnology Information (NCBI) Short Read Archive (SRA) database or
86 provided locally, and any reference data in the BaDs; 3) Bactopia Tools (BaTs), a set of
87 workflows that use the output files from a BaAP project to run genomic analysis on multiple
88 genomes. For this project we used BaTs to a) summarize the results of running multiple
89 bacterial genomes through BaAP, b) extract 16S sequences and create a phylogeny, c) assign
90 taxonomic classifications with the Genome Taxonomy Database (4), d) subset Lactobacillus
91 crispatus samples by average nucleotide identity (ANI) with FastANI (5), e) run pan-genome
92 analysis for L. crispatus using Roary (6) and create a core-genome phylogeny. bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
93 Figure 1 - Bactopia overview
94
95
96 Bactopia Datasets
97 The Bactopia pipeline can be run without downloading and formatting Bactopia Datasets
98 (BaDs). However, providing them enriches the downstream analysis. Bactopia can import
99 specific existing public datasets, as well as accessible user-provided datasets in the appropriatete
100 format. A subcommand (bactopia datasets) was created to automate downloading, building, andnd
101 (or) configuring these datasets for Bactopia.
102
103 BaDs can be grouped into those that are general and those that are user-supplied. General
104 datasets include: a Mash (7) sketch of the NCBI RefSeq (8) and PLSDB (9) databases and a
105 Sourmash (10) signature of microbial genomes (includes viral and fungal) from the NCBI
106 GenBank (11) database. Ariba (12), a software for detecting genes in raw read (FASTQ) files,
107 uses a number of default reference databases for virulence and antibiotic resistance. The bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
108 available Ariba datasets include: ARG-ANNOT (13), CARD (14), MEGARes (15), NCBI
109 Reference Gene Catalog (16), plasmidfinder (17), resfinder (18), SRST2 (19), VFDB (20), and
110 VirulenceFinder (21).
111
112 When an organism name is provided, additional datasets are set up. If a MLST schema is
113 available for the species, it is downloaded from PubMLST.org (22) and set up for BLAST+ (23)
114 and Ariba. Each RefSeq completed genome for the species is downloaded using ncbi-genome-
115 download (24). A Mash sketch is created from the set of downloaded completed genomes to be
116 used for automatic reference selection for variant calling. Protein sequences are extracted from
117 each genome with BioPython (25), clustered using CD-HIT (26, 27) and formatted to be used by
118 Prokka (28) for annotation. Users may also provide their own organism-specific reference
119 datasets to be used for BLAST+ alignment, short-read alignment, or variant calling.
120 Bactopia Analysis Pipeline
121 The Bactopia Analysis Pipeline (BaAP) takes input FASTQ files and optional user-specified
122 BaDs and performs a number of workflows that are based on either de novo whole genome
123 assembly, reference mapping or sequence decomposition (ie kmer-based approaches) (Figure
124 1). BaAP has incorporated numerous existing bioinformatic tools (Table 1) into its workflow
125 (Supplementary Figure 1). For each tool, many of the input parameters are exposed to the
126 user, allowing for fine-tuning analysis.
127
128
129 Table 1 - List of bioinformatic tools used by Bactopia
130 A list of bioinformatic tools used throughout version 1.3.0 of the Bactopia Analysis Pipeline and
131 Bactopia Tools.
132 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
Name Version Description Link Ref.
AMRFinder+ 3.6.7 Find acquired antimicrobial resistance https://github.com/ncbi/amr (16)
genes and some point mutations in
protein or assembled nucleotide
sequences
Aragorn 1.2.38 Finds transfer RNA features (tRNA) http://130.235.244.92/ARAGORN/Downloads/ (29)
Ariba 2.14.4 Antimicrobial Resistance Identification https://github.com/sanger-pathogens/ariba (12)
By Assembly assembly-scan 0.3.0 Generate basic stats for an assembly https://github.com/rpetit3/assembly-scan (30)
Barrnap 0.9 Bacterial ribosomal RNA predictor https://github.com/tseemann/barrnap (31)
BBMap 38.76 A suite of fast, multithreaded https://jgi.doe.gov/data-and-tools/bbtools/ (32)
bioinformatics tools designed for
analysis of DNA and RNA sequence
data.
BCFtools 1.9 Utilities for variant calling and https://github.com/samtools/bcftools (33)
manipulating VCFs and BCFs.
Bedtools 2.29.2 A powerful toolset for genome https://github.com/arq5x/bedtools2 (34)
arithmetic.
BioPython 1.76 Tools for biological computation written https://github.com/biopython/biopython (25)
in Python
BLAST+ 2.9.0 Basic Local Alignment Search Tool https://blast.ncbi.nlm.nih.gov/Blast.cgi (23)
BWA 0.7.17 Burrows-Wheeler Aligner for short-read https://github.com/lh3/bwa/ (35)
alignment
CD-HIT 4.8.1 Accelerated for clustering the next- https://github.com/weizhongli/cdhit (26, 27)
generation sequencing data bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
ClonalFrameM 1.12 Efficient Inference of Recombination in https://github.com/xavierdidelot/ClonalFrameML (36)
L Whole Bacterial Genomes
DiagrammeR 1.0.0 Graph and network visualization using https://github.com/rich-iannone/DiagrammeR (37)
tabular data in R
EMIRGE 0.61.1 Reconstructs full length ribosomal genes https://github.com/csmiller/EMIRGE (38)
from short read sequencing data
FastANI 1.3 Fast Whole-Genome Similarity (ANI) https://github.com/ParBLiSS/FastANI (5)
Estimation
FastTree 2 2.1.10 Approximately-maximum-likelihood http://www.microbesonline.org/fasttree (39)
phylogenetic trees from alignments of
nucleotide or protein sequences fastq-dl 1.0.3 Download FASTQ files from SRA or https://github.com/rpetit3/fastq-dl (40)
ENA repositories
FastQC 0.11.9 A quality control analysis tool for high https://github.com/s-andrews/FastQC (41)
throughput sequencing data. fastq-scan 0.4.2 Output FASTQ summary statistics in https://github.com/rpetit3/fastq-scan (42)
JSON format
FLASH 1.2.11 A fast and accurate tool to merge https://ccb.jhu.edu/software/FLASH/ (43)
paired-end reads freebayes 1.3.2 Bayesian haplotype-based genetic https://github.com/ekg/freebayes (44)
polymorphism discovery and genotyping
GNU Parallel 20200122 A shell tool for executing jobs in parallel https://www.gnu.org/software/parallel/ (45)
GTDB-tk 1.0.2 A toolkit for assigning objective https://github.com/Ecogenomics/GTDBTk (46)
taxonomic classifications to bacterial
and archaeal genomes bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
HMMER 3.3 Biosequence analysis using profile http://hmmer.org/ (47–49)
hidden Markov models
Infernal 1.1.2 Searches DNA sequence databases for http://eddylab.org/infernal/ (50)
RNA structure and sequence similarities
IQ-TREE 1.6.12 Efficient phylogenomic software by https://github.com/Cibiv/IQ-TREE (51)
maximum likelihood
ISMapper 2.0 IS mapping software https://github.com/jhawkey/IS_mapper (52)
Lighter 1.1.2 Fast and memory-efficient sequencing https://github.com/mourisl/Lighter (53)
error corrector
MAFFT 7.455 Multiple alignment program for amino https://mafft.cbrc.jp/alignment/software/ (54)
acid or nucleotide sequences
Mash 2.2.2 Fast genome and metagenome distance https://github.com/marbl/Mash (7, 55)
estimation using MinHash maskrc-svg 0.5 Masks recombination as detected by https://github.com/kwongj/maskrc-svg (56)
ClonalFrameML or Gubbins and draws
an SVG
McCortex 1.0 De novo genome assembly and https://github.com/mcveanlab/mccortex (57)
multisample variant calling
MEGAHIT 1.2.9 Ultra-fast and memory-efficient (meta- https://github.com/voutcn/megahit (58)
)genome assembler
MinCED 0.4.2 Mining CRISPRs in Environmental https://github.com/ctSkennerton/minced (59)
Datasets
Minimap2 2.17 A versatile pairwise aligner for genomic https://github.com/lh3/minimap2 (60)
and spliced nucleotide sequences ncbi-genome- 0.2.12 Scripts to download genomes from the https://github.com/kblin/ncbi-genome-download (24) bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
download NCBI FTP servers
Nextflow 19.10.0 A DSL for data-driven computational https://github.com/nextflow-io/nextflow (2)
pipelines phyloFlash 3.3b3 Rapidly reconstruct the SSU rRNAs and https://github.com/HRGV/phyloFlash (61)
explore phylogenetic composition of an
illumina (meta)genomic dataset.
Pigz 2.3.4 A parallel implementation of gzip for https://zlib.net/pigz/ (62)
modern multi-processor, multi-core
machines
Pilon 1.23 An automated genome assembly https://github.com/broadinstitute/pilon/ (63)
improvement and variant detection tool pplacer 1.1.alpha Phylogenetic placement and https://github.com/matsen/pplacer (64)
19 downstream analysis
Prodigal 2.6.3 Fast, reliable protein-coding gene https://github.com/hyattpd/Prodigal (65)
prediction for prokaryotic genomes.
Prokka 1.4.5 Rapid prokaryotic genome annotation https://github.com/tseemann/prokka (28)
Roary 3.13.0 Rapid large-scale prokaryote pan https://github.com/sanger-pathogens/Roary (6)
genome analysis samclip 0.2 Filter SAM file for soft and hard clipped https://github.com/tseemann/samclip (66)
alignments
SAMtools 1.9 Tools for manipulating next-generation https://github.com/samtools/samtools (67)
sequencing data
Seqtk 1.3 A fast and lightweight tool for processing https://github.com/lh3/seqtk (68)
sequences in the FASTA or FASTQ
format bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
Shovill 1.0.9se Faster assembly of Illumina reads https://github.com/tseemann/shovill (69)
SKESA 2.3.0 Strategic Kmer Extension for Scrupulous https://github.com/ncbi/SKESA (70)
Assemblies
Snippy 4.4.5 Rapid haploid variant calling and core https://github.com/tseemann/snippy (71)
genome alignment
SnpEff 4.3.1 Genomic variant annotations and http://snpeff.sourceforge.net/ (72)
functional effect prediction toolbox snp-dists 0.6.3 Pairwise SNP distance matrix from a https://github.com/tseemann/snp-dists (73)
FASTA sequence alignment
SNP-sites 2.5.1 Rapidly extracts SNPs from a multi- https://github.com/sanger-pathogens/snp-sites (74)
FASTA alignment
Sourmash 3.2.0 Compute and compare MinHash https://github.com/dib-lab/sourmash (10)
signatures for DNA data sets
SPAdes 3.13.0 An assembly toolkit containing various https://github.com/ablab/spades (75)
assembly pipelines
Trimmomatic 0.39 A flexible read trimming tool for Illumina http://www.usadellab.org/cms/index.php (76)
NGS data vcf-annotator 0.5 Add biological annotations to variants in https://github.com/rpetit3/vcf-annotator (77)
a VCF file
Vcflib 1.0.0rc3 A simple C++ library for parsing and https://github.com/vcflib/vcflib (78)
manipulating VCF files
Velvet 1.2.10 Short read de novo assembler using de https://github.com/dzerbino/velvet (79)
Bruijn graphs
VSEARCH 2.14.1 Versatile open-source tool for https://github.com/torognes/vsearch (80)
metagenomics bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
vt 2015.11.1 A tool set for short variant discovery in https://github.com/atks/vt (81)
0 genetic sequence data
133
134 BaAP: Acquiring FASTQs
135 Bactopia provides multiple ways for users to provide their FASTQ formatted sequences. Input
136 FASTQs can be local or downloaded from public repositories.
137
138 Local sequences can be processed one at a time or in batches. To process a single sample, the
139 user provides the path to their FASTQ(s) and a sample name. For multiple samples, this
140 method does not make efficient use of Nextflow’s queue system. Alternatively, users can
141 provide a “file of filenames” (FOFN), which is a tab-delimited file with information about samples
142 and paths to the corresponding FASTQ(s). By using the FOFN method, Nextflow queues each
143 sample and makes efficient use of available resources. A subcommand (bactopia prepare) was
144 created to automate the creation of a FOFN.
145
146 Raw sequences available from public repositories (e.g. European Nucleotide Archive (ENA),
147 Sequence Read Archive (SRA), or DNA Data Bank of Japan (DDBJ)) can also be processed by
148 Bactopia. Sequences associated with a provided Experiment accession (e.g. DRX, ERX, SRX)
149 are downloaded and processed exactly as local sequences would be. A subcommand (bactopia
150 search) was created, which allows users to query ENA to create a list of Experiment accessions
151 from the ENA Data Warehouse API (82) associated with a BioProject accession, Taxon ID, or
152 organism name.
153 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
154 BaAP: Validating FASTQs
155 The path for input FASTQ(s) is validated, and if necessary, sequences from public repositories
156 are downloaded using fastq-dl (40). Once validated, the FASTQ input(s) are tested to determine
157 if they meet a minimum threshold for continued processing. All BaAP steps expect to use
158 Illumina sequence data, which is the great majority of genome projects currently generated.
159 FASTQ files that are explicitly marked as non-Illumina or have properties that suggest non-
160 Illumina (e.g read-length, error profile) are excluded. By default, input FASTQs must exceed
161 2,241,820 bases (20x coverage of the smallest bacterial genome, Nasuia deltocephalinicola
162 (83)) and 7,472 reads (minimum required base pairs / 300 bp, the longest available reads from
163 Illumina). If estimated, the genome size must be between 100,000 bp and 18,040,666 bp which
164 is based off the range of known bacterial genome sizes (N. deltocephalinicola,
165 GCF_000442605, 112,091 bp; Minicystis rosea, GCF_001931535, 16,040,666 bp). Failure to
166 pass these requirements excludes the samples from further subsequent analysis. The threshold
167 values can be adjusted by the user at runtime.
168
169 BaAP: FastQ Quality Control and generation of pFASTQ
170 Input FASTQs that pass the validation steps undergo quality control steps to remove poor
171 quality reads. BBDuk, a component of BBTools (32), removes Illumina adapters, phiX
172 contaminants and filters reads based on length and quality. Base calls are corrected using
173 Lighter (53). At this stage, the default procedure is to downsample the FASTQ file to an average
174 100x genome coverage (if over 100x) with Reformat (from BBTools). This step, which was used
175 in StAP (3), significantly saves compute time at little final cost to assembly or SNP calling
176 accuracy. The genome size for coverage calculation is either provided by the user or estimated
177 based on the FASTQ data by Mash (7). The user can provide their own value for downsampling
178 FASTQs or disable it completely. Summary statistics before and after QC are created using bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
179 FastQC (41) and fastq-scan (42). After QC, the original FASTQs are no longer used and only
180 the processed FASTQs (pFASTQ) are used in subsequent analysis.
181
182 BaAP: Assembly, Reference Mapping and Decomposition
183 BaAP uses Shovill (69) to create a draft de novo assembly with MEGAHIT (58), SKESA (70)
184 (default), SPAdes (75) or Velvet (79) and make corrections using Pilon (63) from the pFASTQ.
185 Summary statistics for the draft assembly are created using assembly-scan (30). If the total size
186 of the draft assembly fails to meet a user-specified minimum size, further assembly-based
187 analyses are discontinued. Otherwise, a BLAST+ (23) nucleotide database is created from the
188 contigs. The draft assembly is also annotated using Prokka (28). If available at runtime, Prokka
189 will first annotate with a clustered RefSeq protein set, followed by its default databases. The
190 annotated genes and proteins are then subjected to antimicrobial resistance prediction with
191 AMRFinder+ (16).
192
193 For each pFASTQ, sketches are created using Mash (k=21,31) and Sourmash (10)
194 (k=21,31,51). McCortex (57) is used to count 31-mers in the pFASTQ.
195
196 BaAP: Optional Steps
197 At runtime, Bactopia checks for BaDs specified by the command line (if any) and adjusts the
198 settings of the pipeline accordingly. Examples of processes executed only if a BaDs is specified
199 include Ariba (12) analysis for each available reference dataset; sequence containment
200 screening against RefSeq (8) with mash screen (55) and against GenBank (11) with sourmash
201 lca gather (10), and PLSDB (9), with mash screen and BLAST+. The sequence type (ST) of the
202 sample is determined with BLAST+ and Ariba. The nearest reference RefSeq genome, based
203 on mash (7) distance, is downloaded with ncbi-genome-download (24), and variants are called bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
204 with Snippy (71). Alternatively, one or more reference genomes can be provided by the user.
205 Users can also provide sequences for: sequence alignment and per-base coverage with BWA
206 (35, 84) and Bedtools (34), BLAST+ alignment, or insertion site determination with ISMapper
207 (52).
208
209 Bactopia Tools
210 After BaAP has successfully completed, it will create a directory for each strain with
211 subdirectories for each analysis result. The directory structure is independent of project or
212 options chosen. Bactopia Tools (BaTs) are a set of comparative analysis workflows written
213 using Nextflow that take advantage of the predictable output structure from BaAP. Each BaT is
214 created from the same framework and a subcommand (bactopia tools create) is available to
215 simplify the creation of future BaTs.
216
217 Five BaTs were used for analyses in this article. The summary BaT, outputs a summary report
218 of the set of samples and a list of samples that failed to meet thresholds set by the user. This
219 summary includes basic sequence and assembly stats as well as technical (pass/fail)
220 information. The roary BaT creates a pan-genome of the set of samples with Roary (6), with the
221 option to include RefSeq (8) completed genomes. The fastani BaT determines the pairwise
222 average nucleotide identity (ANI) for each sample with FastANI (5). The phyloflash BaT
223 reconstructs 16S rRNA gene sequences with phyloFlash (61). The gtdb BaT assigns taxonomic
224 classifications from the Genome Taxonomy Database (GTDB) (4) with GTDB-tk (46). Each
225 Bactopia tool has a separate Nextflow workflow with its own conda environment, Docker image
226 and Singularity image.
227 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
228 Comparison to Similar Open Source Software
229 At the time of writing (February 2020), we knew of only three other actively-maintained open
230 source generalist bacterial genomic workflow softwares that encompassed a similar range of
231 functionality to Bactopia: ASA3P (85), TORMES (86) and the currently unpublished Nullarbor
232 (87). The versions of these programs used many of the same component software (e.g. Prokka,
233 SPAdes, BLAST+, Roary) but differed in the philosophies underlying their design (Table 2). This
234 made head-to-head runtime comparisons somewhat meaningless, as each was aimed at
235 different analysis scenarios and produced different output. Bactopia was the most open-ended
236 and flexible, allowing the user to customize input databases and providing a platform for
237 downstream analysis by different BaTs rather than built-in pangenome and phylogeny creation.
238 Bactopia also had some features not implemented in the others, such as SRA/ENA search and
239 download and automated reference genome selection for identifying variants. Both Bactopia
240 and ASA3P are highly scalable, each can be seamlessly executed on local, cluster, and cloud
241 environments with little effort required by the user. ASA3P was the only program to implement
242 long-read data assembly. TORMES was the only program to include a user customizable
243 RMarkdown for report and to have optional analyses specifically for Escherichia and
244 Salmonella. Nullarbor was the only program to implement a pre-screen method for filtering out
245 potential biological outliers prior to full analysis.
246
247 Table 2 - A comparison of bacterial genome analysis workflows.
Feature Bactopia ASA3P Nullarbor TORMES
Version 1.3.0 1.2.2 2.0.20191013 1.0
Release Date February 19th, 2020 November 20th, 2019 October 13th, 2019 February 4th, 2019 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
Latest Commit February 23th, 2020 February 19th, 2020 February 23th, 2020 January 30th, 2020
Sequence Illumina Illumina, Nanopore, Illumina Illumina
Technology PacBio
Single End Reads Yes Yes No No
Workflow Nextflow Groovy Perl + Make Bash
Resume If Stopped Yes No Yes No
Reuse Existing Yes No Yes No
Runs For
Expanded Analysis
Built In HPC Yes Yes No No
Cluster and Cloud
Capability
Individual Program Yes No Yes No
Adjustable
Parameters
Batch processing Yes Yes Yes Yes from Config File
Single Sample Yes No Yes No
Processing from
Command Line bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
Sequence Depth Yes No Yes No
Downsample
Automatic Yes No No No
Reference
Selection For
Variant Detection
Data Download Yes No No No from SRA/ENA
Species k-mers,16S, ANI k-mers, 16S, ANI k-mers k-mers
Identification
Comparative Separate Process Built In Process Built In Process Built In Process
Analysis
Summary Text HTML HTML R Markdown
Package Manager Bioconda - Bioconda & Brew Conda YAML
Container Available Yes Yes Yes No
Documentation Website PDF Manual README README
Github Repository bactopia/bactopia oschwengers/asap tseemann/nullarbor nmquijada/tormes
248
249 Use case: the Lactobacillus genus
250 We performed a Bactopia analysis of publicly-available raw Illumina data labelled as belonging
251 to the Lactobacillus genus. Lactobacillus is an important component of the human microbiome, bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
252 and cultured samples have been sequenced by several research groups over the past few
253 years. L. crispatus and other species are often the majority bacterial genus of the human vagina
254 and are associated with low pH and reduction in pathogen burden (88). Samples of the genus
255 are used in the food industry for fermentation in the production of yoghurt, kimchi, kombucha
256 and other common items. Lactobacillus is a common probiotic, although recent genomic-based
257 transmission studies showed that bloodstream infections can follow after ingestion by
258 immunocompromised patients (89).
259
260 Running BaAP
261 In November 2019, we initiated Bactopia analysis using the following three commands:
262
263 # Build Lactobacillus dataset
264 bactopia datasets ~/bactopia-datasets --species 'Lactobacillus' --
265 include_genus --cpus 10
266
267 # Query ENA for all Lactobacillus (tax id 1578) sequence projects
268 bactopia search 1578 --prefix lactobacillus
269 # this creates a file called ‘lactobacillus-accessions.txt’
270
271 # Process Lactobacillus samples
272 mkdir ~/bactopia
273 cd ~/bactopia
274 bactopia --accessions ~/lactobacillus-accessions.txt --datasets ~/bactopia-
275 datasets --species lactobacillus --coverage 100 --cpus 4 --min_genome_size
276 1000000 --max_genome_size 4200000 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
277
278 The bactopia datasets subcommand automated downloading of BaDs. With these parameters,
279 we downloaded and formatted the following datasets: Ariba (12) reference databases for the
280 Comprehensive Antibiotic Resistance Database (CARD) and the core Virulence Factor
281 Database (VFDB) (14, 20), RefSeq Mash sketch (7, 8), GenBank Sourmash signatures (10, 11),
282 PLSDB blast database and sketch (9), and a clustered protein set and Mash sketch from
283 completed Lactobacillus genomes available from NCBI Assembly (RefSeq). This took 25
284 minutes to complete.
285
286 The bactopia search subcommand produced a list of 2,030 Experiment accessions that had
287 been labelled as “Lactobacillus” (taxonomy ID: 1578) (Supplementary Data 1). After filtering for
288 only Illumina sequencing, 1,664 Experiment accessions remained (Supplementary Data 2).
289
290 The main bactopia command automated BaAP processing of the list of accessions using the
291 downloaded BaTs. Here we chose a standard maximum coverage per genome of 100x, based
292 on the estimated genome size. We used the range of genome sizes (1.2 Mb to 3.7Mb) for the
293 completed Lactobacillus genomes to require that the estimated genome size for each sample be
294 between 1Mbp and 4.2 Mbp.
295
296 Samples were processed on a 96-core SLURM cluster with 512 GB of available RAM. Analysis
297 took approximately 2.5 days to complete, with an estimated runtime of 30 minutes per sample,
298 (determined by adding up the median process runtime, 17 total, in BaAP). No individual process
299 used more than 8GB of memory, with all but 5 using less than 1 GB. Nextflow (2) recorded
300 detailed statistics on resource usage, including CPU, memory, job duration and I/O.
301 (Supplementary Data 3).
302 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
303 Analysis of Lactobacillus genomes using BaTs
304 The BaAP outputted a directory of directories named after the unique Experiment accessions for
305 each sample. Within each sample directory were subdirectories for the output of each analysis
306 run. This data structure was recognized by BaTs for subsequent analysis.
307
308 We used BaT summary to generate a summary report of our analysis. The report included an
309 overview of sequence quality, assembly statistics, and predicted antimicrobial resistances and
310 virulence factors. It also output a list of samples that failed to meet minimum sequencing depth
311 and/or quality thresholds.
312
313 bactopia tools summary --bactopia ~/bactopia --prefix lactobacillus
314
315 BaT summary grouped samples as “Gold”, “Silver”, “Bronze”, “Exclude” or “QC Failure”, based
316 on BaAP completion, minimum sequencing coverage, per-read sequencing mean quality,
317 minimum mean read length, and assembly quality (Table 3; Supplementary Figure 2). We
318 used the default values for these cutoffs to group our samples. Gold samples had greater than
319 100x coverage, per-read mean quality greater than Q30, mean read length greater than 95bp,
320 and an assembly with less than 100 contigs. Silver samples had greater than 50x coverage,
321 per-read mean quality greater than Q20, mean read length greater than 75bp, and an assembly
322 with less than 200 contigs. Bronze samples had greater than 20x coverage, per-read mean
323 quality greater than Q12, mean read length greater than 49bp, and an assembly with less than
324 500 contigs. 106 samples (Exclude and QC Failure) were excluded from further analysis
325 (Supplementary Table 1). 48 samples that failed to meet the minimum thresholds for Bronze
326 quality were grouped as Exclude. 58 samples that were not processed by BaAP due to
327 sequencing related errors or the estimated genome size were grouped as QC Failure. Of these, bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
328 one (Experiment accession SRX4526092) was labeled as paired-end but did not have both sets
329 of reads, one (Experiment accession SRX1490246) was identified to be an assembly converted
330 to FASTQ format, and 14 had insufficient sequencing depth. The remaining 42 samples,
331 unprocessed by BaAP, had an estimated genome size which exceeded 4.2Mbp (set at runtime).
332 We queried these samples against available GenBank and RefSeq sketches using Mash screen
333 and Sourmash lca gather. There were 36 samples that contained evidence for Lactobacillus but
334 also sequences for other bacterial species, phage, viruses and plants genomes. There were 6
335 samples that contained no evidence for Lactobacillus, 4 had matches to multiple bacterial
336 species and 2 to only Saccharomyces cerevisiae.
337
338 There were 1,558 Gold, Silver or Bronze quality samples (Table 3) used for further analysis. For
339 these we found that, on average, the assembled genome size was about 12% smaller than the
340 estimated genome size (Supplementary Figure 3, Table 3). If we assume the assembled
341 genome size is a better indicator of a sample’s genome size, the average coverage before QC
342 increases from 220x to 268x. In this use case, the Lactobacillus genus, it was necessary to
343 estimate genome sizes, but when dealing with samples from a single species it may be better to
344 provide a known genome size.
345
346 Table 3 - Summary of Lactobacillus genome sequencing projects quality and coverage
Quality Count Original Post-Bactopia Per-Read Read Contig Percent of
Rank Coverage Coverage Quality Length Count Assembled Genome
(Median) (Median) Score (Median) (Median) Size compared to
(Median) Estimated Genome
Size (Median) bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
Gold 967 213x 100x Q35 100 52 92%
Silver 386 160x 100x Q35 100 110 93%
Bronze 205 102x 100x Q34 100 90 93%
Exclude 48 26x 22x Q34 100 706 93%
QC 58 ------
Failure
347
348 For visualization of the phylogenetic relationships of the samples, we used the phyloflash and
349 gtdb BaTs.
350
351 bactopia tools phyloflash --phyloflash ~/bactopia-datasets/16s/138 --bactopia
352 ~/bactopia --cpus 16 --exclude ~/bactopia-tool/summary/lactobacillus-
353 exclude.txt
354
355 bactopia tools gtdb --gtdb ~/bactopia-datasets/gtdb/db --bactopia ~/bactopia
356 --cpus 48 --exclude ~/bactopia-tool/summary/lactobacillus-exclude.txt
357
358 The gtdb BaT used GTDB-Tk (46) to assign a taxonomic classification to each sample. GTDB-
359 Tk used the assembly to predict genes with Prodigal (65), identify marker genes (4) for
360 phylogenetic inference with HMMER3 (47), and find the maximum-likelihood placement of each
361 sample on the GTDB-Tk reference tree with pplacer (64). A taxonomic classification was
362 assigned to 1,554 samples, and 4 samples failed classification due to insufficient marker gene
363 coverage or marker genes with multiple hits.
364 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
365 The phyloflash BaT used the phyloFlash tool (61) to reconstruct a 16S rRNA gene from each
366 sample that was used for phylogenetic reconstruction (Figure 2). The 16S rRNA was
367 reconstructed from a SPAdes (75) assembly and annotated against the SILVA (92) rRNA
368 database (v138) database for 1,470 samples. There were 88 samples that were excluded from
369 the phylogeny: 12 samples that did not meet the requirement of a mean read length of 50bp, 17
370 samples in which a 16S could not be reconstructed, 19 samples that had a mismatch in
371 assembly and mapped read taxon designations, and 40 samples that had 16S genes
372 reconstructed for multiple species. A phylogenetic tree was created with IQ-TREE (51, 90, 91)
373 based on a multiple sequence alignment of the reconstructed 16S genes with MAFFT (54).
374 Taxonomic classifications from GTDB-Tk were used to annotate the 16S with iTOL (93).
375
376 Figure 2 - Maximum-likelihood phylogeny from reconstructed 16S rRNA genes
377 A phylogenetic representation 1,470 samples using IQ-Tree (51, 90, 91). A) A tree of the full set of
378 samples. The outer ring represents the genus assigned by GTDB-Tk, with Blue being
379 Lactobacillus and all other assignments Red. B) The same tree as A, but with the non-
380 Lactobacillus clade collapsed. Major groups of Lactobacillus groups (depicted with a letter) and
381 the most sequenced Lactobacillus species have been labeled. The inner ring represents the
382 average nucleotide identity (ANI), determined by FastANI (5) of samples to L. crispatus. The tree
383 was built from a multiple sequence alignment (54) of 16S genes reconstructed by phyloFlash (61)
384 with 1,281 parsimony informative sites. The likelihood score for the consensus tree constructed
385 from 1,000 bootstrap trees was -54,698. Taxonomic classifications were assigned by GTDB-Tk
386 (46).
387 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
388 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
389
390
391
392 A recent analysis of completed genomes in the NCBI found 239 discontinuous de novo
393 Lactobacillus species using a 94% ANI cutoff (94). Based on GTDB taxonomic classification,
394 which applies a 95% ANI cutoff, we identified 161 distinct Lactobacillus species in 1,554
395 samples. The five most sequenced Lactobacillus species, 45% of the total, were L. rhamnosus
396 (n=225), L. paracasei (n=180), L. gasseri (n=132), L. plantarum (n=86) and L. fermentum
397 (n=80). Within these five species the assembled genomes sizes were remarkably consistent
398 (Supplementary Figure 4). There were 58 samples that were not classified as Lactobacillus, of bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
399 which 34 were classified as Streptococcus pneumoniae by both 16S and GTDB
400 (Supplementary Table 2).
401
402 We found that in 505 (~33%) out of 1,554 taxonomically classifications by 16S and GTDB were
403 in conflict with the taxonomy according to the NCBI SRA, illustrating the importance of the
404 unbiased approach to understanding sample context. In samples that had both a 16S and
405 GTDB taxonomic classification, there was disagreement in 154 out of 1,467 samples. Of these,
406 47% were accounted for by the recently described L. paragasseri (95) (n=72). This possibly
407 highlights a lag in the reclassification of assemblies in the NCBI Assembly database.
408
409 Analysis of the pangenome of the entire genus using a tool such as Roary (6), would return only
410 a few core genes, owing to sequence divergence of evolutionary distant species. However,
411 because the roary BaT can be supplied a list of individual samples it is possible to isolate the
412 analysis to the species level. As an example of using BaTs to focus on a particular group within
413 the larger set of results, we chose L. crispatus, a species commonly isolated from the human
414 vagina and also found in the guts/feces of poultry. We used the fastani BaT to estimate the ANI
415 of all samples against a randomly selected L. crispatus completed genome (NCBI Assembly
416 Accession: GCF_003795065) with FastANI (5). A cutoff of greater than 95% ANI was used to
417 categorize a sample as L. crispatus. A pan-genome analysis was conducted on only the
418 samples categorized as L. crispatus using the roary BaT. The roary BaT downloaded all
419 available completed L. crispatus genomes with ncbi-genome-download (24), formatted the
420 completed genomes with Prokka (28), created a pan-genome with Roary (6), identified and
421 masked recombination with ClonalFrameML (36) and maskrc-svg (56), created a phylogenetic
422 tree with IQ-TREE (51, 90, 91) and a pair-wise SNP distance matrix with snp-dists (73).
423 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
424 bactopia tools fastani --bactopia ~/bactopia --exclude ~/bactopia-
425 tool/summary/lactobacillus-exclude.txt --accession GCF_003795065.1 --
426 refseq_only --minFraction 0.0
427
428 # Identify samples with > 95% ANI to L. crispatus
429 awk '{if ($3 > 95){print $0}}' ~/bactopia-tool/fastani/fastani.tsv | grep
430 "RX" > ~/crispatus-include.txt
431
432 bactopia tools roary --bactopia ~/bactopia --cpus 20 --include ~/crispatus-
433 include.txt --species "lactobacillus crispatus" --n
434
435 ANI analysis revealed 38 samples as having >96.1% ANI to L. crispatus, with no other sample
436 greater than 83.1%. Four completed L. crispatus genomes were also included in the analysis
437 (Table 4) for a total of 42 genomes. The pan-genome of L. crispatus was revealed to have
438 7,037 gene families and 972 core genes (Figure 3). Similar to a recent analysis by Pan et al
439 (96), L. crispatus was separated into two main phylogenetic groups, one associated with human
440 vaginal isolates, the other of more mixed provenance, and including chicken, turkey and human
441 gut isolates.
442
443 Table 4 - L crispatus genomes used in pan-genome analysis
444 Lactobacillus crispatus samples (n=42) used in the pan-genome analysis. An accession
445 corresponding to the BioProject, BioSample, and either the NCBI Assembly accession (beginning
446 with GCF) or SRA Experiment accession is given for each sample. The host and source were
447 collected from metadata associated with the BioSample or available publications. In cases when a
448 host and/or source was not explicitly stated, it was inferred from available metadata (denoted by
449 *). bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
450
BioProject BioSample Accession Host Source Reference
PRJEB8104 SAMEA3319334 ERX1126086 Human* urine*
PRJEB8104 SAMEA3319350 ERX1126089 Human* urine*
PRJEB8104 SAMEA3319265 ERX1126106 Human* urine*
PRJEB8104 SAMEA3319366 ERX1126138 Human* urine*
PRJEB8104 SAMEA3319373 ERX1126140 Human* urine*
PRJEB8104 SAMEA3319383 ERX1126143 Human* urine*
PRJEB8104 SAMEA3319392 ERX1126150 Human* urine*
PRJEB22112 SAMEA104208649 ERX2150228 Human* urine*
PRJEB22112 SAMEA104208650 ERX2150229 Human* urine*
PRJEB3060 SAMEA1920319 ERX271950 Human* unknown
PRJEB3060 SAMEA1920326 ERX271958 Human* unknown
PRJEB3060 SAMEA1920319 ERX450852 Human* unknown
PRJEB3060 SAMEA1920326 ERX450860 Human* unknown
PRJNA50051 SAMN00109860 SRX026143 Human* vaginal*
PRJNA272101 SAMN03854351 SRX1090887 Human urine (97)
PRJNA50053 SAMN00829399 SRX130900 Human* vaginal*
PRJNA50057 SAMN00829123 SRX130912 Human* vaginal*
PRJNA50067 SAMN00829125 SRX130914 Human* vaginal*
PRJNA52107 SAMN01057066 SRX155504 Human* vaginal*
PRJNA52105 SAMN01057067 SRX155505 Human* vaginal*
PRJNA52107 SAMN01057066 SRX155863 Human* vaginal* bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
PRJNA52105 SAMN01057067 SRX155875 Human* vaginal*
PRJNA379934 SAMN06624125 SRX2660270 Human eye
PRJNA222257 SAMN02369387 SRX456245 Human eye
PRJNA231221 SAMN11056458 SRX5949263 Human vaginal (98)
PRJNA547620 SAMN11973370 SRX5986001 Human vaginal (99)
PRJNA547620 SAMN11973369 SRX5986002 Human vaginal (99)
PRJNA547620 SAMN11973371 SRX5986003 Human vaginal (99)
PRJNA557339 SAMN12395213 SRX6613945 Human vaginal (100)
PRJNA563077 SAMN12667791 SRX6959881 Human gut (96)
PRJNA563077 SAMN12667801 SRX6959883 Chicken gut (96)
PRJNA563077 SAMN12667803 SRX6959885 Human gut (96)
PRJNA563077 SAMN12667804 SRX6959886 Turkey gut (96)
PRJNA563077 SAMN12667805 SRX6959887 Human eye (96)
PRJNA563077 SAMN12667793 SRX6959888 Chicken gut (96)
PRJNA563077 SAMN12667794 SRX6959889 Chicken gut (96)
PRJNA563077 SAMN12667795 SRX6959890 Chicken gut (96)
PRJNA563077 SAMN12667796 SRX6959891 Chicken gut (96)
PRJNA563077 SAMN12667797 SRX6959892 Chicken gut (96)
PRJNA563077 SAMN12667798 SRX6959893 Chicken gut (96)
PRJNA563077 SAMN12667799 SRX6959894 Chicken gut (96)
PRJNA563077 SAMN12667800 SRX6959895 Chicken gut (96)
PRJNA531669 SAMN11372136 GCF_009769205 Chicken gut (101)
PRJNA231221 SAMN11056458 GCF_009730275 Human vaginal (98) bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
PRJNA431864 SAMN08409124 GCF_003971565 Human vaginal (102)
PRJNA499123 SAMN10343598 GCF_003795065 Human vaginal (103)
451
452 Figure 3 - Core-genome maximum-likelihood phylogeny of Lactobacillus crispatus.
453 A core-genome phylogenetic representation using IQ-Tree (51, 90, 91) of 42 L. crispatus samples .
454 The putatively recombinant positions predicted using ClonalFrameML (36) were removed from thee
455 alignment with maskrc-svg (56). The tree was built from 972 core genes identified by Roary with
456 9,209 parsimony informative sites. The log likelihood score for the consensus tree constructed
457 from 1,000 bootstrap trees was −1,418,106.
458 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
459
460 Lastly, we looked at patterns of antibiotic resistance across the genus using a table, generated
461 by the summary BaT, of resistance genes and loci called by AMRFinder+ (104). Only 79 out of
462 1,496 Lactobacillus samples defined by GTDB-Tk (46) were found to have predicted resistance
463 using AMRFinder+. The most common resistance categories were tetracyclines (67 samples),
464 followed by macrolides, lincosamides and aminoglycosides (16, 15 and 11, respectively).
465 Species with the highest proportion of resistance included L. amylovorus (12/14 tetracycline
466 resistant) and L. crispatus (10/42 tetracycline resistant). Only 3 genomes of L. amylophilus were
467 included in the study but each contained matches to genes for macrolide, lincosamide and
468 tetracycline resistance. The linking thread between these species is they are each commonly
469 isolated from agricultural animals. The high proportion of L. crispatus samples isolated from
470 chickens that were tetracycline resistant has been previously observed (105, 106) (Figure 3).
471
472 A recent analysis of 184 Lactobacillus type strain genomes by Campedelli et al (107) found a
473 higher percentage of type strains with aminoglycoside (20/184), tetracycline (18/184),
474 erythromycin (6/184) and clindamycin (60/184) resistance. Forty-two of the type strains had
475 chloramphenicol resistance genes, where here, the AMRFinder+ returned only 1/1,467. These
476 differences probably reflect a combination of the different sampling bias of the studies, and the
477 Campedelli et al strategy of using a relaxed threshold for hits to maximize sensitivity (blastp
478 matches against the CARD database with acid sequence identity of 30% and query coverage of
479 70% (107)). Resistance is probably undercalled in both methods because of a lack of well-
480 characterized resistance loci from the Lactobacillus genus to use as comparison.
481 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
482 Conclusions
483 Bactopia is a flexible workflow for bacterial genomics. It can be run on a laptop for a single
484 bacterial sample, but critically, the underlying Nextflow framework allows it to make efficient use
485 of large clusters and cloud-compute environments to process the many thousands of genomes
486 that are currently being generated. For users that are not familiar with bacterial genomic tools
487 and/or who require a standardized pipeline, Bactopia is a “one-stop-shop” that can be easily
488 deployed using conda, Docker and Singularity containers. For researchers with particular
489 interest in individual species or genera, BaDs can be highly customized with taxon-specific
490 databases.
491
492 The current version of Bactopia does not support long-read data, but this is planned for future
493 implementation. We also plan to implement more comparative analyses in the form of additional
494 BaTs. With a framework set in place for developing BaTs, it should be possible to make a
495 toolshed of workflows that not only can be used for all bacteria but are also customized for
496 annotating genes and loci specific for particular species.
497
498 Acknowledgements
499 We would like to thank Torsten Seemann, Oliver Schwengers, Narciso Quijada, Michelle Su,
500 Michelle Wright, Matt Plumb, Sean Wang, Ahmed Babiker and Monica Farley for their helpful
501 suggestions and feedback. We would also like to acknowledge our gratitude to the many
502 scientists and their funders who provided genome sequences to the public domain, ENA and
503 SRA for storing and organizing the data, and the authors of the open source software tools and
504 datasets used in this work.
505 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
506 Support for this project came from the award CDC U50CK000485 Emerging Infection Program
507 (EIP) PPHF/ACA: Enhancing Epidemiology and Laboratory Capacity” funding from the Emory
508 Public Health Bioinformatics Fellowship.
509
510 Availability of data
511 Raw Illumina sequences used in this study were acquired from Experiment accessions under
512 the BioProject accessions PRJDB1101, PRJDB1726, PRJDB4156, PRJDB4955, PRJDB5065,
513 PRJDB5206, PRJDB6480, PRJDB6495, PRJEB10572, PRJEB11980, PRJEB14693, PRJEB18589,
514 PRJEB19875, PRJEB21025, PRJEB21680, PRJEB22112, PRJEB22252, PRJEB23845, PRJEB24689,
515 PRJEB24698, PRJEB24699, PRJEB24700, PRJEB24701, PRJEB24713, PRJEB24715, PRJEB25194,
516 PRJEB2631, PRJEB26638, PRJEB2824, PRJEB29398, PRJEB29504, PRJEB2977, PRJEB3012,
517 PRJEB3060, PRJEB31213, PRJEB31289, PRJEB31301, PRJEB31307, PRJEB5094, PRJEB8104,
518 PRJEB8721, PRJEB9718, PRJNA165565, PRJNA176000, PRJNA176001, PRJNA183044,
519 PRJNA184888, PRJNA185359, PRJNA185406, PRJNA185584, PRJNA185632, PRJNA185633,
520 PRJNA188920, PRJNA188921, PRJNA196697, PRJNA212644, PRJNA217366, PRJNA218804,
521 PRJNA219157, PRJNA222257, PRJNA224116, PRJNA227106, PRJNA227335, PRJNA231221,
522 PRJNA234998, PRJNA235015, PRJNA235017, PRJNA247439, PRJNA247440, PRJNA247441,
523 PRJNA247442, PRJNA247443, PRJNA247444, PRJNA247445, PRJNA247446, PRJNA247452,
524 PRJNA254854, PRJNA255080, PRJNA257137, PRJNA257138, PRJNA257139, PRJNA257141,
525 PRJNA257142, PRJNA257182, PRJNA257185, PRJNA257853, PRJNA257876, PRJNA258355,
526 PRJNA258500, PRJNA267549, PRJNA269805, PRJNA269831, PRJNA269832, PRJNA269860,
527 PRJNA269905, PRJNA270961, PRJNA270962, PRJNA270963, PRJNA270964, PRJNA270965,
528 PRJNA270966, PRJNA270967, PRJNA270968, PRJNA270969, PRJNA270970, PRJNA270972,
529 PRJNA270973, PRJNA270974, PRJNA272101, PRJNA272102, PRJNA283920, PRJNA289613,
530 PRJNA29003, PRJNA291681, PRJNA296228, PRJNA296248, PRJNA296274, PRJNA296298,
531 PRJNA296309, PRJNA296751, PRJNA296754, PRJNA298448, PRJNA299992, PRJNA300015,
532 PRJNA300023, PRJNA300088, PRJNA300119, PRJNA300123, PRJNA300179, PRJNA302242, bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
533 PRJNA303235, PRJNA303236, PRJNA305242, PRJNA306257, PRJNA309616, PRJNA312743,
534 PRJNA315676, PRJNA316969, PRJNA322958, PRJNA322959, PRJNA322960, PRJNA322961,
535 PRJNA336518, PRJNA342061, PRJNA342757, PRJNA347617, PRJNA348789, PRJNA376205,
536 PRJNA377666, PRJNA379934, PRJNA381357, PRJNA382771, PRJNA388578, PRJNA392822,
537 PRJNA397632, PRJNA400793, PRJNA434600, PRJNA436228, PRJNA474823, PRJNA474907,
538 PRJNA476494, PRJNA477598, PRJNA481120, PRJNA484967, PRJNA492883, PRJNA493554,
539 PRJNA496358, PRJNA50051, PRJNA50053, PRJNA50055, PRJNA50057, PRJNA50059, PRJNA50061,
540 PRJNA50063, PRJNA50067, PRJNA50115, PRJNA50117, PRJNA50125, PRJNA50133, PRJNA50135,
541 PRJNA50137, PRJNA50139, PRJNA50141, PRJNA50159, PRJNA50161, PRJNA50163, PRJNA50165,
542 PRJNA50167, PRJNA50169, PRJNA50173, PRJNA504605, PRJNA504734, PRJNA505088,
543 PRJNA52105, PRJNA52107, PRJNA52121, PRJNA525939, PRJNA530250, PRJNA533291,
544 PRJNA533837, PRJNA542049, PRJNA542050, PRJNA542054, PRJNA543187, PRJNA544527,
545 PRJNA547620, PRJNA552757, PRJNA554696, PRJNA554698, PRJNA557339, PRJNA562050,
546 PRJNA563077, PRJNA573690, PRJNA577465, PRJNA578299, PRJNA68459, and PRJNA84.
547
548 Links
549 ● Website/Docs - https://bactopia.github.io/
550 ● Github - https://www.github.com/bactopia/bactopia/
551 ● Zenodo Snapshot - https://doi.org/10.5281/zenodo.3691353
552 ● Bioconda - https://bioconda.github.io/recipes/bactopia/README.html
553 ● Containers
554 ○ Docker - https://cloud.docker.com/u/bactopia/
555 ○ Singularity - https://cloud.sylabs.io/library/rpetit3/bactopia
556 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
557 Supplementary Tables
558 Supplementary Table 1 - Lactobacillus excluded from analysis
559
Sample Exclusion Reason
DRX061123 Not processed, reason: Poor estimate of genome size
DRX061147 Not processed, reason: Poor estimate of genome size
DRX143657 Not processed, reason: Poor estimate of genome size
ERX034817 Not processed, reason: Poor estimate of genome size
ERX1046386 Not processed, reason: Low depth of sequencing
ERX1046387 Not processed, reason: Low depth of sequencing
ERX1046388 Not processed, reason: Low depth of sequencing
ERX1046389 Not processed, reason: Low depth of sequencing
Failed to pass minimum cutoffs, reason: Low coverage (14.83x, expect >= 20x);Too
ERX1046390 many contigs (783, expect <= 500)
ERX1046391 Failed to pass minimum cutoffs, reason: Too many contigs (631, expect <= 500)
Failed to pass minimum cutoffs, reason: Low coverage (11.54x, expect >= 20x);Too
ERX1046392 many contigs (755, expect <= 500)
Failed to pass minimum cutoffs, reason: Low coverage (17.59x, expect >= 20x);Too
ERX1046393 many contigs (544, expect <= 500)
ERX1046397 Not processed, reason: Low depth of sequencing
ERX1046398 Not processed, reason: Low depth of sequencing
ERX1046399 Not processed, reason: Low depth of sequencing
ERX1046400 Failed to pass minimum cutoffs, reason: Low coverage (11.97x, expect >= 20x);Too bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
many contigs (810, expect <= 500)
Failed to pass minimum cutoffs, reason: Low coverage (18.17x, expect >= 20x);Too
ERX1046401 many contigs (559, expect <= 500)
Failed to pass minimum cutoffs, reason: Low coverage (13.30x, expect >= 20x);Too
ERX1046402 many contigs (857, expect <= 500)
ERX1046407 Failed to pass minimum cutoffs, reason: Low coverage (18.63x, expect >= 20x)
ERX1046416 Failed to pass minimum cutoffs, reason: Low coverage (19.20x, expect >= 20x)
ERX1275836 Not processed, reason: Poor estimate of genome size
ERX1275870 Not processed, reason: Low number of reads;Low depth of sequencing
ERX1275925 Not processed, reason: Poor estimate of genome size
ERX1275959 Not processed, reason: Low number of reads;Low depth of sequencing
ERX178660 Not processed, reason: Poor estimate of genome size
ERX178661 Failed to pass minimum cutoffs, reason: Too many contigs (1079, expect <= 500)
ERX178663 Not processed, reason: Poor estimate of genome size
ERX2438180 Not processed, reason: Poor estimate of genome size
ERX271964 Failed to pass minimum cutoffs, reason: Too many contigs (900, expect <= 500)
ERX271983 Failed to pass minimum cutoffs, reason: Too many contigs (515, expect <= 500)
ERX2866884 Not processed, reason: Poor estimate of genome size
ERX303991 Not processed, reason: Poor estimate of genome size
ERX359696 Not processed, reason: Poor estimate of genome size
ERX359703 Not processed, reason: Poor estimate of genome size
ERX359718 Not processed, reason: Poor estimate of genome size
ERX359724 Not processed, reason: Low number of reads;Low depth of sequencing bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
ERX359774 Not processed, reason: Poor estimate of genome size
ERX359777 Failed to pass minimum cutoffs, reason: Too many contigs (674, expect <= 500)
ERX399759 Not processed, reason: Poor estimate of genome size
ERX450866 Failed to pass minimum cutoffs, reason: Too many contigs (902, expect <= 500)
ERX450885 Failed to pass minimum cutoffs, reason: Too many contigs (554, expect <= 500)
ERX529055 Not processed, reason: Poor estimate of genome size
ERX529175 Not processed, reason: Poor estimate of genome size
SRX059832 Failed to pass minimum cutoffs, reason: Short read length (35.00bp, expect >= 49 bp)
SRX130889 Not processed, reason: Low depth of sequencing
SRX130890 Not processed, reason: Low depth of sequencing
SRX130892 Failed to pass minimum cutoffs, reason: Too many contigs (618, expect <= 500)
SRX130895 Failed to pass minimum cutoffs, reason: Too many contigs (1050, expect <= 500)
SRX130896 Failed to pass minimum cutoffs, reason: Too many contigs (610, expect <= 500)
Failed to pass minimum cutoffs, reason: Low coverage (18.31x, expect >= 20x);Too
SRX130897 many contigs (940, expect <= 500)
SRX130898 Not processed, reason: Low depth of sequencing
Failed to pass minimum cutoffs, reason: Low coverage (17.02x, expect >= 20x);Too
SRX130903 many contigs (677, expect <= 500)
SRX130904 Failed to pass minimum cutoffs, reason: Too many contigs (719, expect <= 500)
SRX130910 Failed to pass minimum cutoffs, reason: Low coverage (18.77x, expect >= 20x)
SRX130911 Failed to pass minimum cutoffs, reason: Too many contigs (948, expect <= 500)
SRX130916 Failed to pass minimum cutoffs, reason: Low coverage (16.95x, expect >= 20x)
SRX1490246 Not processed, reason: Low number of reads bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
SRX1684187 Not processed, reason: Poor estimate of genome size
SRX1684189 Not processed, reason: Poor estimate of genome size
SRX1687039 Not processed, reason: Poor estimate of genome size
SRX1842889 Failed to pass minimum cutoffs, reason: Too many contigs (1016, expect <= 500)
SRX1949176 Not processed, reason: Poor estimate of genome size
Failed to pass minimum cutoffs, reason: Short read length (35.00bp, expect >= 49
SRX1950180 bp);Too many contigs (962, expect <= 500)
SRX1950181 Failed to pass minimum cutoffs, reason: Short read length (35.00bp, expect >= 49 bp)
SRX1970179 Not processed, reason: Poor estimate of genome size
SRX200229 Not processed, reason: Poor estimate of genome size
SRX200230 Not processed, reason: Poor estimate of genome size
SRX247326 Not processed, reason: Poor estimate of genome size
SRX2582464 Not processed, reason: Poor estimate of genome size
SRX2940498 Not processed, reason: Poor estimate of genome size
SRX2940499 Not processed, reason: Poor estimate of genome size
SRX3075577 Not processed, reason: Poor estimate of genome size
Failed to pass minimum cutoffs, reason: Low coverage (18.90x, expect >= 20x);Too
SRX3146668 many contigs (784, expect <= 500)
SRX3155954 Not processed, reason: Poor estimate of genome size
SRX318181 Failed to pass minimum cutoffs, reason: Too many contigs (692, expect <= 500)
SRX377715 Failed to pass minimum cutoffs, reason: Too many contigs (787, expect <= 500)
SRX422995 Not processed, reason: Low number of reads;Low depth of sequencing
SRX4336892 Not processed, reason: Poor estimate of genome size bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
SRX4526088 Failed to pass minimum cutoffs, reason: Low coverage (16.08x, expect >= 20x)
SRX4526092 Not processed, reason: Paired-end read count mismatch
SRX4997533 Not processed, reason: Poor estimate of genome size
SRX5116733 Not processed, reason: Poor estimate of genome size
SRX5116734 Not processed, reason: Poor estimate of genome size
SRX5116735 Not processed, reason: Poor estimate of genome size
SRX5116736 Not processed, reason: Poor estimate of genome size
SRX5116737 Not processed, reason: Poor estimate of genome size
SRX5395058 Not processed, reason: Poor estimate of genome size
SRX5489807 Not processed, reason: Poor estimate of genome size
SRX5992480 Failed to pass minimum cutoffs, reason: Too many contigs (843, expect <= 500)
SRX5992985 Failed to pass minimum cutoffs, reason: Too many contigs (1220, expect <= 500)
Failed to pass minimum cutoffs, reason: Low coverage (9.19x, expect >= 20x);Too many
SRX5992988 contigs (1440, expect <= 500)
SRX5992989 Not processed, reason: Poor estimate of genome size
SRX6454026 Failed to pass minimum cutoffs, reason: Too many contigs (533, expect <= 500)
SRX6454030 Failed to pass minimum cutoffs, reason: Too many contigs (796, expect <= 500)
SRX6454035 Failed to pass minimum cutoffs, reason: Too many contigs (687, expect <= 500)
SRX6454040 Failed to pass minimum cutoffs, reason: Too many contigs (792, expect <= 500)
SRX6454043 Failed to pass minimum cutoffs, reason: Too many contigs (801, expect <= 500)
Failed to pass minimum cutoffs, reason: Low coverage (16.59x, expect >= 20x);Too
SRX6959882 many contigs (855, expect <= 500)
SRX7004909 Failed to pass minimum cutoffs, reason: Too many contigs (999, expect <= 500) bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
SRX761992 Failed to pass minimum cutoffs, reason: Low coverage (18.30x, expect >= 20x)
Failed to pass minimum cutoffs, reason: Low coverage (15.33x, expect >= 20x);Too
SRX762105 many contigs (560, expect <= 500)
SRX762278 Not processed, reason: Poor estimate of genome size
SRX762279 Failed to pass minimum cutoffs, reason: Low coverage (16.51x, expect >= 20x)
SRX762385 Failed to pass minimum cutoffs, reason: Too many contigs (1132, expect <= 500)
SRX762507 Failed to pass minimum cutoffs, reason: Low coverage (16.75x, expect >= 20x)
SRX762687 Failed to pass minimum cutoffs, reason: Low coverage (19.54x, expect >= 20x)
560
561 Supplementary Table 2 - Samples with a non-Lactobacillus taxonomic classification
562
sample sra gtdb
ERX2000473 Lactobacillus jensenii Actinomyces viscosus
ERX1275832 Lactobacillus rhamnosus Aerococcus urinae
ERX1275921 Lactobacillus rhamnosus Aerococcus urinae
ERX2000484 Lactobacillus iners Bifidobacterium vaginale
ERX1275866 Lactobacillus rhamnosus Bifidobacterium vaginale
ERX1275955 Lactobacillus rhamnosus Bifidobacterium vaginale
ERX1275856 Lactobacillus fermentum Campylobacter ureolyticus
ERX1275945 Lactobacillus fermentum Campylobacter ureolyticus
SRX301579 Lactobacillus catenefornis DSM 20559 Eggerthia catenaformis bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
ERX2000470 Lactobacillus jensenii Facklamia hominis
ERX034808 Lactobacillus sp. KLE1615 sp900066985
SRX2118809 Lactobacillus rogosae Lachnospira rogosae
ERX1275881 Lactobacillus jensenii Microbacterium sp001595495
ERX1275970 Lactobacillus jensenii Microbacterium sp001595495
SRX963052 Lactobacillus sp. HMSC12B03 Neisseria bacilliformis
ERX1275835 Lactobacillus iners Pseudoglutamicibacter cumminsii
ERX1275924 Lactobacillus iners Pseudoglutamicibacter cumminsii
ERX178670 Lactobacillus casei Staphylococcus epidermidis
SRX244574 Lactobacillus gasseri ADL-351 Streptococcus agalactiae
ERX1275883 Lactobacillus sp. Streptococcus agalactiae
ERX1275972 Lactobacillus sp. Streptococcus agalactiae
ERX2000463 Lactobacillus johnsonii Streptococcus parasanguinis
ERX2000481 Lactobacillus crispatus Streptococcus pasteurianus
ERX3310397 Lactobacillus acetotolerans Streptococcus pneumoniae
ERX3310420 Lactobacillus acidophilus Streptococcus pneumoniae
ERX3310369 Lactobacillus agilis Streptococcus pneumoniae
ERX3310458 Lactobacillus alimentarius Streptococcus pneumoniae
ERX3310319 Lactobacillus amylophilus Streptococcus pneumoniae
ERX3310443 Lactobacillus amylovorus Streptococcus pneumoniae
ERX3310479 Lactobacillus animalis Streptococcus pneumoniae bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
ERX3310410 Lactobacillus aviarius Streptococcus pneumoniae
ERX3310408 Lactobacillus bifermentans Streptococcus pneumoniae
ERX3310303 Lactobacillus brevis Streptococcus pneumoniae
ERX3310449 Lactobacillus buchneri Streptococcus pneumoniae
ERX3310425 Lactobacillus casei Streptococcus pneumoniae
ERX3310459 Lactobacillus coryniformis Streptococcus pneumoniae
ERX3310264 Lactobacillus delbrueckii subsp. bulgaricus Streptococcus pneumoniae
ERX3310316 Lactobacillus delbrueckii Streptococcus pneumoniae
ERX3310336 Lactobacillus farciminis Streptococcus pneumoniae
ERX3310280 Lactobacillus fermentum Streptococcus pneumoniae
ERX3310372 Lactobacillus fructivorans Streptococcus pneumoniae
ERX3310460 Lactobacillus helveticus Streptococcus pneumoniae
ERX3310328 Lactobacillus hilgardii Streptococcus pneumoniae
ERX3310291 Lactobacillus mali Streptococcus pneumoniae
ERX3310432 Lactobacillus oris Streptococcus pneumoniae
ERX3310305 Lactobacillus paracasei Streptococcus pneumoniae
ERX3310374 Lactobacillus pentosus Streptococcus pneumoniae
ERX3310445 Lactobacillus plantarum Streptococcus pneumoniae
ERX3310452 Lactobacillus ruminis Streptococcus pneumoniae
ERX3310353 Lactobacillus sakei Streptococcus pneumoniae
ERX3310387 Lactobacillus sakei L45 Streptococcus pneumoniae bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
ERX3310490 Lactobacillus salivarius Streptococcus pneumoniae
ERX3310309 Lactobacillus sanfranciscensis Streptococcus pneumoniae
ERX3310430 Lactobacillus sharpeae Streptococcus pneumoniae
ERX3310388 Lactobacillus sp. 30A Streptococcus pneumoniae
ERX3310483 Lactobacillus vaginalis Streptococcus pneumoniae
ERX3310456 Lactobacillus vermiforme Streptococcus pneumoniae
ERX2000443 Lactobacillus gasseri Winkia sp002849225
563 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
564 Supplementary Figures
565 Supplementary Figure 1 - Bactopia Analysis Pipeline Workflow
566
567 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
568 Supplementary Figure 2 - Sequencing quality ranks per year 2011–2019 of
569 Lactobacillus genome projects.
570 Genome projects were grouped into three increasing quality ranks: Bronze, Silver and Gold. The
571 rank was based on coverage, read length, per-read quality and total assembled contigs. The
572 highest rank, Gold, represented 62% (n=967) of the available Lactobacillus genome projects. The
573 remaining genomes were 25% Silver (n=386) and 13% Bronze (n=205). Between the years of 2011
574 and 2019, Gold rank samples consistently outnumbered Silver and Bronze except for 2011 and
575 2015 . However it is likely the total number of Gold rank samples is underrepresented due
576 coverage reduction being based on the estimated genome size.
577
578
579
580
581 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
582 Supplementary Figure 3 - Comparison of estimated genome size and assembled
583 genome size.
584 The assembled genome size (Y-axis) and estimated genome size (X-axis) was plotted for each
585 sample. The color of the dot is determined by the rank of the sample. The solid black line
586 represents a 1:1 ratio between the assembled genome size and the estimated genome size. The
587 genome size was estimated for each sample by Mash (7) using the raw sequences.
588
589
590
591
592 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
593 Supplementary Figure 4 - Assembled genome sizes are consistent within species.
594
595
596
597 Supplementary Data
598 Supplementary Data 1 - Results returned after querying ENA for Lactobacillus
599
600 Supplementary Data 2 - SRA/ENA Experiment accessions processed by Bactopia
601
602 Supplementary Data 3 - Nexflow runtime report for Lactobacillus genomes processed
603 by Bactopia
604 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
605 References
606 1. Grüning B, Dale R, Sjödin A, Rowe J, Chapman BA, Tomkins-Tinch CH, Valieris R, The
607 Bioconda Team, Köster J. 2017. Bioconda: A sustainable and comprehensive software
608 distribution for the life sciences. bioRxiv.
609 2. Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. 2017.
610 Nextflow enables reproducible computational workflows. Nat Biotechnol 35:316–319.
611 3. Petit RA 3rd, Read TD. 2018. Staphylococcus aureus viewed from the perspective of
612 40,000+ genomes. PeerJ 6:e5261.
613 4. Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A, Chaumeil P-A, Hugenholtz
614 P. 2018. A standardized bacterial taxonomy based on genome phylogeny substantially
615 revises the tree of life. Nat Biotechnol 36:996–1004.
616 5. Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT, Aluru S. 2018. High throughput
617 ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun
618 9:5114.
619 6. Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S, Holden MTG, Fookes M, Falush D,
620 Keane JA, Parkhill J. 2015. Roary: Rapid large-scale prokaryote pan genome analysis.
621 Bioinformatics.
622 7. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM.
623 2016. Mash: fast genome and metagenome distance estimation using MinHash. Genome
624 Biol 17:132.
625 8. O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse
626 B, Smith-White B, Ako-Adjei D, Astashyn A, Badretdin A, Bao Y, Blinkova O, Brover V, bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
627 Chetvernin V, Choi J, Cox E, Ermolaeva O, Farrell CM, Goldfarb T, Gupta T, Haft D,
628 Hatcher E, Hlavina W, Joardar VS, Kodali VK, Li W, Maglott D, Masterson P, McGarvey
629 KM, Murphy MR, O’Neill K, Pujar S, Rangwala SH, Rausch D, Riddick LD, Schoch C,
630 Shkeda A, Storz SS, Sun H, Thibaud-Nissen F, Tolstoy I, Tully RE, Vatsan AR, Wallin C,
631 Webb D, Wu W, Landrum MJ, Kimchi A, Tatusova T, DiCuccio M, Kitts P, Murphy TD,
632 Pruitt KD. 2016. Reference sequence (RefSeq) database at NCBI: current status,
633 taxonomic expansion, and functional annotation. Nucleic Acids Res 44:D733–45.
634 9. Galata V, Fehlmann T, Backes C, Keller A. 2019. PLSDB: a resource of complete bacterial
635 plasmids. Nucleic Acids Res 47:D195–D202.
636 10. Titus Brown C, Irber L. 2016. sourmash: a library for MinHash sketching of DNA. JOSS
637 1:27.
638 11. Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. 2016. GenBank. Nucleic Acids
639 Res 44:D67–72.
640 12. Hunt M, Mather AE, Sánchez-Busó L, Page AJ, Parkhill J, Keane JA, Harris SR. 2017.
641 ARIBA: rapid antimicrobial resistance genotyping directly from sequencing reads. Microb
642 Genom 3:e000131.
643 13. Gupta SK, Padmanabhan BR, Diene SM, Lopez-Rojas R, Kempf M, Landraud L, Rolain J-
644 M. 2014. ARG-ANNOT, a new bioinformatic tool to discover antibiotic resistance genes in
645 bacterial genomes. Antimicrob Agents Chemother 58:212–220.
646 14. McArthur AG, Waglechner N, Nizam F, Yan A, Azad MA, Baylay AJ, Bhullar K, Canova MJ,
647 De Pascale G, Ejim L, Kalan L, King AM, Koteva K, Morar M, Mulvey MR, O’Brien JS,
648 Pawlowski AC, Piddock LJ V, Spanogiannopoulos P, Sutherland AD, Tang I, Taylor PL,
649 Thaker M, Wang W, Yan M, Yu T, Wright GD. 2013. The comprehensive antibiotic bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
650 resistance database. Antimicrob Agents Chemother 57:3348–3357.
651 15. Lakin SM, Dean C, Noyes NR, Dettenwanger A, Ross AS, Doster E, Rovira P, Abdo Z,
652 Jones KL, Ruiz J, Belk KE, Morley PS, Boucher C. 2017. MEGARes: an antimicrobial
653 resistance database for high throughput sequencing. Nucleic Acids Res 45:D574–D580.
654 16. Feldgarden M, Brover V, Haft DH, Prasad AB, Slotta DJ, Tolstoy I, Tyson GH, Zhao S, Hsu
655 C-H, McDermott PF, Tadesse DA, Morales C, Simmons M, Tillman G, Wasilenko J, Folster
656 JP, Klimke W. 2019. Validating the NCBI AMRFinder Tool and Resistance Gene Database
657 Using Antimicrobial Resistance Genotype-Phenotype Correlations in a Collection of
658 NARMS Isolates. Antimicrob Agents Chemother.
659 17. Carattoli A, Zankari E, García-Fernández A, Voldby Larsen M, Lund O, Villa L, Møller
660 Aarestrup F, Hasman H. 2014. In silico detection and typing of plasmids using
661 PlasmidFinder and plasmid multilocus sequence typing. Antimicrob Agents Chemother
662 58:3895–3903.
663 18. Zankari E, Hasman H, Cosentino S, Vestergaard M, Rasmussen S, Lund O, Aarestrup FM,
664 Larsen MV. 2012. Identification of acquired antimicrobial resistance genes. J Antimicrob
665 Chemother 67:2640–2644.
666 19. Inouye M, Dashnow H, Raven L-A, Schultz MB, Pope BJ, Tomita T, Zobel J, Holt KE. 2014.
667 SRST2: Rapid genomic surveillance for public health and hospital microbiology labs.
668 Genome Med 6:90.
669 20. Chen L, Zheng D, Liu B, Yang J, Jin Q. 2016. VFDB 2016: hierarchical and refined dataset
670 for big data analysis--10 years on. Nucleic Acids Res 44:D694–7.
671 21. Joensen KG, Scheutz F, Lund O, Hasman H, Kaas RS, Nielsen EM, Aarestrup FM. 2014.
672 Real-time whole-genome sequencing for routine typing, surveillance, and outbreak bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
673 detection of verotoxigenic Escherichia coli. J Clin Microbiol 52:1501–1510.
674 22. Jolley KA, Bray JE, Maiden MCJ. 2018. Open-access bacterial population genomics:
675 BIGSdb software, the PubMLST.org website and their applications. Wellcome Open Res
676 3:124.
677 23. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. 2009.
678 BLAST+: architecture and applications. BMC Bioinformatics 10:421.
679 24. Blin K. ncbi-genome-download - Scripts to download genomes from the NCBI FTP servers.
680 Github.
681 25. Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T,
682 Kauff F, Wilczynski B, de Hoon MJL. 2009. Biopython: freely available Python tools for
683 computational molecular biology and bioinformatics. Bioinformatics 25:1422–1423.
684 26. Li W, Godzik A. 2006. Cd-hit: a fast program for clustering and comparing large sets of
685 protein or nucleotide sequences. Bioinformatics 22:1658–1659.
686 27. Fu L, Niu B, Zhu Z, Wu S, Li W. 2012. CD-HIT: accelerated for clustering the next-
687 generation sequencing data. Bioinformatics 28:3150–3152.
688 28. Seemann T. 2014. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30:2068–
689 2069.
690 29. Laslett D, Canback B. 2004. ARAGORN, a program to detect tRNA genes and tmRNA
691 genes in nucleotide sequences. Nucleic Acids Res 32:11–16.
692 30. Petit RA III. assembly-scan: generate basic stats for an assembly. Github.
693 31. Seemann T. Barrnap: Bacterial ribosomal RNA predictor. Github. bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
694 32. Bushnell B. BBMap short read aligner, and other bioinformatic tools. SourceForge.
695 33. Danecek P. BCFtools - Utilities for variant calling and manipulating VCFs and BCFs.
696 Github.
697 34. Quinlan AR, Hall IM. 2010. BEDTools: a flexible suite of utilities for comparing genomic
698 features. Bioinformatics 26:841–842.
699 35. Li H, Durbin R. 2010. Fast and accurate long-read alignment with Burrows-Wheeler
700 transform. Bioinformatics 26:589–595.
701 36. Didelot X, Wilson DJ. 2015. ClonalFrameML: efficient inference of recombination in whole
702 bacterial genomes. PLoS Comput Biol 11:e1004041.
703 37. Iannone R. 2018. DiagrammeR: Graph/network visualization. R package 1.
704 38. Miller CS, Baker BJ, Thomas BC, Singer SW, Banfield JF. 2011. EMIRGE: reconstruction
705 of full-length ribosomal genes from microbial community short read sequencing data.
706 Genome Biol 12:R44.
707 39. Price MN, Dehal PS, Arkin AP. 2010. FastTree 2 – Approximately Maximum-Likelihood
708 Trees for Large Alignments. PLoS One 5:e9490.
709 40. Petit RA III. fastq-dl - Download FASTQ files from SRA or ENA repositories. Github.
710 41. Andrews S, Krueger F, Seconds-Pichon A, Biggins F, Wingett S. 2016. FastQC A Quality
711 Control tool for High Throughput Sequence Data. Babraham Bioinformatics. 2012.
712 42. Petit RA III. fastq-scan: generate summary statistics of input FASTQ sequences. Github.
713 43. Magoč T, Salzberg SL. 2011. FLASH: fast length adjustment of short reads to improve
714 genome assemblies. Bioinformatics 27:2957–2963. bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
715 44. Garrison E, Marth G. 2012. Haplotype-based variant detection from short-read sequencing.
716 arXiv [q-bioGN].
717 45. Tange O. 2018. GNU Parallel 2018.
718 46. Chaumeil P-A, Mussig AJ, Hugenholtz P, Parks DH. 2019. GTDB-Tk: a toolkit to classify
719 genomes with the Genome Taxonomy Database. Bioinformatics.
720 47. Eddy SR. 2009. A new generation of homology search tools based on probabilistic
721 inference. Genome Inform 23:205–211.
722 48. Eddy SR. 2011. Accelerated Profile HMM Searches. PLoS Comput Biol 7:e1002195.
723 49. Wheeler TJ, Eddy SR. 2013. nhmmer: DNA homology search with profile HMMs.
724 Bioinformatics 29:2487–2489.
725 50. Nawrocki EP, Eddy SR. 2013. Infernal 1.1: 100-fold faster RNA homology searches.
726 Bioinformatics 29:2933–2935.
727 51. Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ. 2015. IQ-TREE: a fast and effective
728 stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol 32:268–
729 274.
730 52. Hawkey J, Hamidian M, Wick RR, Edwards DJ, Billman-Jacobe H, Hall RM, Holt KE. 2015.
731 ISMapper: identifying transposase insertion sites in bacterial genomes from short read
732 sequence data. BMC Genomics 16:667.
733 53. Song L, Florea L, Langmead B. 2014. Lighter: fast and memory-efficient sequencing error
734 correction without counting. Genome Biol 15:509.
735 54. Katoh K, Standley DM. 2013. MAFFT multiple sequence alignment software version 7: bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
736 improvements in performance and usability. Mol Biol Evol 30:772–780.
737 55. Ondov BD, Starrett GJ, Sappington A, Kostic A, Koren S, Buck CB, Phillippy AM. 2019.
738 Mash Screen: high-throughput sequence containment estimation for genome discovery.
739 Genome Biol 20:232.
740 56. Kwong J. maskrc-svg - Masks recombination as detected by ClonalFrameML or Gubbins
741 and draws an SVG. Github.
742 57. Turner I, Garimella KV, Iqbal Z, McVean G. 2018. Integrating long-range connectivity
743 information into de Bruijn graphs. Bioinformatics 34:2556–2565.
744 58. Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. 2015. MEGAHIT: an ultra-fast single-node
745 solution for large and complex metagenomics assembly via succinct de Bruijn graph.
746 Bioinformatics 31:1674–1676.
747 59. Skennerton CT. MinCED: Mining CRISPRs in Environmental Datasets. Github.
748 60. Li H. 2018. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics
749 34:3094–3100.
750 61. Gruber-Vodicka HR, Seah BKB, Pruesse E. 2019. phyloFlash – Rapid SSU rRNA profiling
751 and targeted assembly from metagenomes. bioRxiv.
752 62. Adler M. 2015. pigz: A parallel implementation of gzip for modern multi-processor, multi-
753 core machines. Jet Propulsion Laboratory.
754 63. Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, Cuomo CA, Zeng Q,
755 Wortman J, Young SK, Earl AM. 2014. Pilon: An Integrated Tool for Comprehensive
756 Microbial Variant Detection and Genome Assembly Improvement. PLoS One 9:e112963. bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
757 64. Matsen FA, Kodner RB, Armbrust EV. 2010. pplacer: linear time maximum-likelihood and
758 Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC
759 Bioinformatics 11:538.
760 65. Hyatt D, Chen G-L, Locascio PF, Land ML, Larimer FW, Hauser LJ. 2010. Prodigal:
761 prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics
762 11:119.
763 66. Seemann T. Samclip: Filter SAM file for soft and hard clipped alignments. Github.
764 67. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin
765 R, 1000 Genome Project Data Processing Subgroup. 2009. The Sequence Alignment/Map
766 format and SAMtools. Bioinformatics 25:2078–2079.
767 68. Li H. 2012. seqtk Toolkit for processing sequences in FASTA/Q formats. GitHub.
768 69. Seemann T. Shovill: De novo assembly pipeline for Illumina paired reads. Github.
769 70. Souvorov A, Agarwala R, Lipman DJ. 2018. SKESA: strategic k-mer extension for
770 scrupulous assemblies. Genome Biol 19:153.
771 71. Seemann T. Snippy: fast bacterial variant calling from NGS reads. Github.
772 72. Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM.
773 2012. A program for annotating and predicting the effects of single nucleotide
774 polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118;
775 iso-2; iso-3. Fly 6:80–92.
776 73. Seemann T. snp-dists - Pairwise SNP distance matrix from a FASTA sequence alignment.
777 Github. bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
778 74. Page AJ, Taylor B, Delaney AJ, Soares J, Seemann T, Keane JA, Harris SR. 2016. SNP-
779 sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microb Genom
780 2:e000056.
781 75. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko
782 SI, Pham S, Prjibelski AD, Pyshkin AV. SPAdes: A New Genome Assembly Algorithm and
783 Its Applications to Single-Cell Sequencing | Journal of Computational Biology. Mary Ann
784 Liebert, Inc, publishers.
785 76. Bolger AM, Lohse M, Usadel B. 2014. Trimmomatic: a flexible trimmer for Illumina
786 sequence data. Bioinformatics 30:2114–2120.
787 77. Petit RA III. VCF-Annotator: Add biological annotations to variants in a VCF file. Github.
788 78. Vcflib: A C++ library for parsing and manipulating VCF files. Github.
789 79. Zerbino DR, Birney E. 2008. Velvet: Algorithms for de novo short read assembly using de
790 Bruijn graphs. Genome Res 18:821–829.
791 80. Rognes T, Flouri T, Nichols B, Quince C, Mahé F. 2016. VSEARCH: a versatile open
792 source tool for metagenomics. PeerJ 4:e2584.
793 81. Tan A, Abecasis GR, Kang HM. 2015. Unified representation of genetic variants.
794 Bioinformatics 31:2202–2204.
795 82. Toribio AL, Alako B, Amid C, Cerdeño-Tarrága A, Clarke L, Cleland I, Fairley S, Gibson R,
796 Goodgame N, Ten Hoopen P, Jayathilaka S, Kay S, Leinonen R, Liu X, Martínez-Villacorta
797 J, Pakseresht N, Rajan J, Reddy K, Rosello M, Silvester N, Smirnov D, Vaughan D, Zalunin
798 V, Cochrane G. 2017. European Nucleotide Archive in 2016. Nucleic Acids Res 45:D32–
799 D36. bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
800 83. Bennett GM, Moran NA. 2013. Small, smaller, smallest: the origins and evolution of ancient
801 dual symbioses in a Phloem-feeding insect. Genome Biol Evol 5:1675–1688.
802 84. Li H. 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA-
803 MEM. arXiv [q-bioGN].
804 85. Schwengers O, Hoek A, Fritzenwanker M, Falgenhauer L, Hain T, Chakraborty T,
805 Goesmann A. 2019. ASA3P: An automatic and scalable pipeline for the assembly,
806 annotation and higher level analysis of closely related bacterial isolates. bioRxiv.
807 86. Quijada NM, Rodríguez-Lázaro D, Eiros JM, Hernández M. 2019. TORMES: an automated
808 pipeline for whole bacterial genome analysis. Bioinformatics 35:4207–4212.
809 87. da Silva A Bulach DM Schultz MB Kwong JC Howden BP. STG. Nullarbor. Github.
810 88. Fettweis JM, Serrano MG, Brooks JP, Edwards DJ, Girerd PH, Parikh HI, Huang B, Arodz
811 TJ, Edupuganti L, Glascock AL, Xu J, Jimenez NR, Vivadelli SC, Fong SS, Sheth NU, Jean
812 S, Lee V, Bokhari YA, Lara AM, Mistry SD, Duckworth RA, Bradley SP, Koparde VN,
813 Orenda XV, Milton SH, Rozycki SK, Matveyev AV, Wright ML, Huzurbazar SV, Jackson
814 EM, Smirnova E, Korlach J, Tsai Y-C, Dickinson MR, Brooks JL, Drake JI, Chaffin DO,
815 Sexton AL, Gravett MG, Rubens CE, Wijesooriya NR, Hendricks-Muñoz KD, Jefferson KK,
816 Strauss JF, Buck GA. 2019. The vaginal microbiome and preterm birth. Nat Med.
817 89. Yelin I, Flett KB, Merakou C, Mehrotra P, Stam J, Snesrud E, Hinkle M, Lesho E, McGann
818 P, McAdam AJ, Sandora TJ, Kishony R, Priebe GP. 2019. Genomic and epidemiological
819 evidence of bacterial transmission from probiotic capsule to blood in ICU patients. Nat Med.
820 90. Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS. 2017. ModelFinder:
821 fast model selection for accurate phylogenetic estimates. Nat Methods 14:587–589. bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
822 91. Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Vinh LS. 2018. UFBoot2: Improving
823 the Ultrafast Bootstrap Approximation. Mol Biol Evol 35:518–522.
824 92. Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glöckner FO.
825 2013. The SILVA ribosomal RNA gene database project: improved data processing and
826 web-based tools. Nucleic Acids Res 41:D590–6.
827 93. Letunic I, Bork P. 2016. Interactive tree of life (iTOL) v3: an online tool for the display and
828 annotation of phylogenetic and other trees. Nucleic Acids Res 44:W242–5.
829 94. Wittouck S, Wuyts S, Meehan CJ, van Noort V, Lebeer S. 2019. A Genome-Based Species
830 Taxonomy of the Lactobacillus Genus Complex. mSystems 4.
831 95. Tanizawa Y, Tada I, Kobayashi H, Endo A, Maeno S, Toyoda A, Arita M, Nakamura Y,
832 Sakamoto M, Ohkuma M, Tohno M. 2018. Lactobacillus paragasseri sp. nov., a sister taxon
833 of Lactobacillus gasseri, based on whole-genome sequence analyses. Int J Syst Evol
834 Microbiol 68:3512–3517.
835 96. Pan M, Hidalgo-Cantabrana C, Barrangou R. 2020. Host and body site-specific adaptation
836 of Lactobacillus crispatus genomes. NAR Genom Bioinform 2.
837 97. Weimer CM, Deitzler GE, Robinson LS, Park S, Hallsworth-Pepin K, Wollam A, Mitreva M,
838 Lewis WG, Lewis AL. 2016. Genome Sequences of 12 Bacterial Isolates Obtained from the
839 Urine of Pregnant Women. Genome Announc 4.
840 98. Sichtig H, Minogue T, Yan Y, Stefan C, Hall A, Tallon L, Sadzewicz L, Nadendla S, Klimke
841 W, Hatcher E, Shumway M, Aldea DL, Allen J, Koehler J, Slezak T, Lovell S, Schoepp R,
842 Scherf U. 2019. FDA-ARGOS is a database with public quality-controlled reference
843 genomes for diagnostic use and regulatory science. Nat Commun 10:3313. bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
844 99. Bassis CM, Bullock KA, Sack DE, Saund K, Pirani A, Snitkin ES, Alaniz VI, Quint EH,
845 Young VB, Bell JD. 2019. Evidence that vertical transmission of the vaginal microbiota can
846 persist into adolescence. bioRxiv.
847 100. Clabaut M, Boukerb AM, Racine P-J, Pichon C, Kremser C, Picot J-P, Karsybayeva M,
848 Redziniak G, Chevalier S, Feuilloley MGJ. 2020. Draft Genome Sequence of Lactobacillus
849 crispatus CIP 104459, Isolated from a Vaginal Swab. Microbiol Resour Announc 9.
850 101. Richards PJ, Flaujac Lafontaine GM, Connerton PL, Liang L, Asiani K, Fish NM,
851 Connerton IF. 2020. Galacto-Oligosaccharides Modulate the Juvenile Gut Microbiome and
852 Innate Immunity To Improve Broiler Chicken Performance. mSystems 5.
853 102. Chang D-H, Rhee M-S, Lee S-K, Chung I-H, Jeong H, Kim B-C. 2019. Complete
854 Genome Sequence of Lactobacillus crispatus AB70, Isolated from a Vaginal Swab from a
855 Healthy Pregnant Korean Woman. Microbiol Resour Announc 8.
856 103. McComb E, Holm J, Ma B, Ravel J. 2019. Complete Genome Sequence of Lactobacillus
857 crispatus CO3MRSI1. Microbiol Resour Announc 8.
858 104. Feldgarden M, Brover V, Haft DH, Prasad AB, Slotta DJ, Tolstoy I, Tyson GH, Zhao S,
859 Hsu C-H, McDermott PF, Tadesse DA, Morales C, Simmons M, Tillman G, Wasilenko J,
860 Folster JP, Klimke W. 2019. Using the NCBI AMRFinder Tool to Determine Antimicrobial
861 Resistance Genotype-Phenotype Correlations Within a Collection of NARMS Isolates.
862 bioRxiv.
863 105. Cauwerts K, Pasmans F, Devriese LA, Haesebrouck F, Decostere A. 2006. Cloacal
864 Lactobacillus isolates from broilers often display resistance toward tetracycline antibiotics.
865 Microb Drug Resist 12:284–288.
866 106. Dec M, Urban-Chmiel R, Stępień-Pyśniak D, Wernicki A. 2017. Assessment of antibiotic bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
867 susceptibility in Lactobacillus isolates from chickens. Gut Pathog 9:54.
868 107. Campedelli I, Mathur H, Salvetti E, Clarke S, Rea MC, Torriani S, Ross RP, Hill C,
869 O’Toole PW. 2018. Genus-wide assessment of antibiotic resistance in Lactobacillus spp.
870 Appl Environ Microbiol.
871