bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

1 Bactopia: a flexible pipeline for complete

2 analysis of bacterial

3

4

5 Robert A. Petit III1 and Timothy D. Read1

6 1Division of Infectious Diseases, Department of Medicine, Emory University School of

7 Medicine.

8

9 Corresponding Author:

10 Timothy Read1

11

12 Email address: [email protected]

13

14 Abstract

15 of bacterial genomes using Illumina technology has become such a standard

16 procedure that often data are generated faster than can be conveniently analyzed. We created

17 a new series of pipelines called Bactopia, built using Nextflow workflow software, to provide

18 efficient comparative genomic analyses for bacterial species or genera. Bactopia consists of a

19 dataset setup step (Bactopia Datasets; BaDs) where a series of customizable datasets are

20 created for the species of interest; the Bactopia Analysis Pipeline (BaAP), which performs

21 quality control, assembly and several other functions based on the available datasets bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

22 and outputs the processed data to a structured directory format; and a series of Bactopia Tools

23 (BaTs) that perform specific post-processing on some or all of the processed data. BaTs include

24 pan-genome analysis, computing average nucleotide identity between samples, extracting and

25 profiling the 16S and taxonomic classification using highly conserved genes. It is

26 expected that the number of BaTs will increase to fill specific applications in the future. As a

27 demonstration, we performed an analysis of 1,664 public Lactobacillus genomes, focusing on L.

28 crispatus, a species that is a common part of the human vaginal . Bactopia is an

29 open source system that can scale from projects as small as one bacterial genome to

30 thousands that allows for great flexibility in choosing comparison datasets and options for

31 downstream analysis. Bactopia code can be accessed at

32 https://www.github.com/bactopia/bactopia.

33

34 Introduction

35 Sequencing a bacterial genome, an activity that once required the infrastructure of a dedicated

36 genome center, is now a routine task that even a small laboratory can undertake. A large

37 number of open source software tools have been created to handle various parts of the process

38 of using raw read data for functions such as SNP calling and de novo assembly. As a result of

39 dedicated community efforts, it has recently become much easier to locally install these

40 bioinformatic tools through package managers (Bioconda (1), Brew) or through the use of

41 software containers (Docker, Singularity). Despite these advances, producers of bacterial

42 sequence data face a bewildering array of choices when considering how to perform analysis,

43 particularly when large numbers of genomes are involved and processing efficiency and

44 scalability become major factors.

45 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

46 Efficient bacterial multi-genome analysis has been hampered by three missing functionalities.

47 First, is the need to have “workflows of workflows'' that can integrate analyses and provide a

48 simplified way to start with a collection of raw genome data, remove low quality sequences and

49 perform the basic analytic steps of de novo assembly, mapping to reference sequence and

50 taxonomic assignment. Second, is the desire to incorporate user-specific knowledge of the

51 species into the input of the main genome analysis pipeline. While many microbiologists are not

52 expert bioinformaticians, they are experts in the organisms they study. Third, is the need to

53 create an output format from the main pipeline that could be used for future customized

54 downstream analysis such as pan-genome analysis and basic visualization of phylogenies.

55

56 Here we introduce Bactopia, an integrated suite of workflows for flexible analysis of Illumina

57 genome sequencing projects of from the same taxon. Bactopia is based on the

58 Nextflow workflow software (2), and is designed to be scalable, allowing projects as small as a

59 single genome to be run on a local desktop, or many thousands of genomes to be run as a

60 batch on a cloud infrastructure. Running multiple tasks on a single platform standardizes the

61 underlying data quality used for and variant calling between projects run in different

62 laboratories. This structure also simplifies the user experience. In Bactopia, complex multi-

63 genome analysis can be run in a small number of commands. However, there are myriad

64 options for fine-tuning datasets used for analysis and the functions of the system. The

65 underlying Nextflow structure ensures reproducibility. To illustrate the functionality of the system

66 we performed a Bactopia analysis of 1,664 public genome projects of the Lactobacillus genus,

67 an important component of the microbiome of humans and animals.

68 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

69 Design and implementation

70 Bactopia links together open source software, available from Bioconda (1), using

71 Nextflow (2). Nextflow was chosen for its flexibility: Bactopia can be run locally, on clusters, or

72 on cloud platforms with simple parameter changes. It also manages the parallel execution of

73 tasks and creates checkpoints allowing users to resume jobs. Nextflow automates installation of

74 the component software of the workflow through integration with Bioconda. For ease of

75 deployment, Bactopia can either be installed through Bioconda, a Docker container, or a

76 Singularity container. All the software programs used by Bactopia (version v1.3.0) described in

77 this manuscript are listed in Table 1 with their individual version numbers.

78

79 There are three main components of Bactopia (Figure 1): 1) Bactopia Datasets (BaDs), a

80 framework for formatting organism-specific datasets to be used by the downstream analysis

81 pipeline; 2) Bactopia Analysis Pipeline (BaAP), a customizable workflow for the analysis of

82 individual bacterial genome projects that is an extension and generalization of the previously

83 published Staphylococcus aureus-specific “Staphopia Analysis Pipeline” (StAP) (3). The inputs

84 to BaAP are FASTQ files from bacterial Illumina sequencing projects, either imported from the

85 National Centers for Biotechnology Information (NCBI) Short Read Archive (SRA) database or

86 provided locally, and any reference data in the BaDs; 3) Bactopia Tools (BaTs), a set of

87 workflows that use the output files from a BaAP project to run genomic analysis on multiple

88 genomes. For this project we used BaTs to a) summarize the results of running multiple

89 bacterial genomes through BaAP, b) extract 16S sequences and create a phylogeny, c) assign

90 taxonomic classifications with the Genome Taxonomy Database (4), d) subset Lactobacillus

91 crispatus samples by average nucleotide identity (ANI) with FastANI (5), e) run pan-genome

92 analysis for L. crispatus using Roary (6) and create a core-genome phylogeny. bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

93 Figure 1 - Bactopia overview

94

95

96 Bactopia Datasets

97 The Bactopia pipeline can be run without downloading and formatting Bactopia Datasets

98 (BaDs). However, providing them enriches the downstream analysis. Bactopia can import

99 specific existing public datasets, as well as accessible user-provided datasets in the appropriatete

100 format. A subcommand (bactopia datasets) was created to automate downloading, building, andnd

101 (or) configuring these datasets for Bactopia.

102

103 BaDs can be grouped into those that are general and those that are user-supplied. General

104 datasets include: a Mash (7) sketch of the NCBI RefSeq (8) and PLSDB (9) databases and a

105 Sourmash (10) signature of microbial genomes (includes viral and fungal) from the NCBI

106 GenBank (11) database. Ariba (12), a software for detecting genes in raw read (FASTQ) files,

107 uses a number of default reference databases for virulence and antibiotic resistance. The bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

108 available Ariba datasets include: ARG-ANNOT (13), CARD (14), MEGARes (15), NCBI

109 Reference Gene Catalog (16), plasmidfinder (17), resfinder (18), SRST2 (19), VFDB (20), and

110 VirulenceFinder (21).

111

112 When an organism name is provided, additional datasets are set up. If a MLST schema is

113 available for the species, it is downloaded from PubMLST.org (22) and set up for BLAST+ (23)

114 and Ariba. Each RefSeq completed genome for the species is downloaded using ncbi-genome-

115 download (24). A Mash sketch is created from the set of downloaded completed genomes to be

116 used for automatic reference selection for variant calling. sequences are extracted from

117 each genome with BioPython (25), clustered using CD-HIT (26, 27) and formatted to be used by

118 Prokka (28) for annotation. Users may also provide their own organism-specific reference

119 datasets to be used for BLAST+ alignment, short-read alignment, or variant calling.

120 Bactopia Analysis Pipeline

121 The Bactopia Analysis Pipeline (BaAP) takes input FASTQ files and optional user-specified

122 BaDs and performs a number of workflows that are based on either de novo whole genome

123 assembly, reference mapping or sequence decomposition (ie kmer-based approaches) (Figure

124 1). BaAP has incorporated numerous existing bioinformatic tools (Table 1) into its workflow

125 (Supplementary Figure 1). For each tool, many of the input parameters are exposed to the

126 user, allowing for fine-tuning analysis.

127

128

129 Table 1 - List of bioinformatic tools used by Bactopia

130 A list of bioinformatic tools used throughout version 1.3.0 of the Bactopia Analysis Pipeline and

131 Bactopia Tools.

132 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Name Version Description Link Ref.

AMRFinder+ 3.6.7 Find acquired antimicrobial resistance https://github.com/ncbi/amr (16)

genes and some point mutations in

protein or assembled nucleotide

sequences

Aragorn 1.2.38 Finds transfer RNA features (tRNA) http://130.235.244.92/ARAGORN/Downloads/ (29)

Ariba 2.14.4 Antimicrobial Resistance Identification https://github.com/sanger-pathogens/ariba (12)

By Assembly assembly-scan 0.3.0 Generate basic stats for an assembly https://github.com/rpetit3/assembly-scan (30)

Barrnap 0.9 Bacterial ribosomal RNA predictor https://github.com/tseemann/barrnap (31)

BBMap 38.76 A suite of fast, multithreaded https://jgi.doe.gov/data-and-tools/bbtools/ (32)

bioinformatics tools designed for

analysis of DNA and RNA sequence

data.

BCFtools 1.9 Utilities for variant calling and https://github.com/samtools/bcftools (33)

manipulating VCFs and BCFs.

Bedtools 2.29.2 A powerful toolset for genome https://github.com/arq5x/bedtools2 (34)

arithmetic.

BioPython 1.76 Tools for biological computation written https://github.com/biopython/biopython (25)

in Python

BLAST+ 2.9.0 Basic Local Alignment Search Tool https://blast.ncbi.nlm.nih.gov/Blast.cgi (23)

BWA 0.7.17 Burrows-Wheeler Aligner for short-read https://github.com/lh3/bwa/ (35)

alignment

CD-HIT 4.8.1 Accelerated for clustering the next- https://github.com/weizhongli/cdhit (26, 27)

generation sequencing data bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

ClonalFrameM 1.12 Efficient Inference of Recombination in https://github.com/xavierdidelot/ClonalFrameML (36)

L Whole Bacterial Genomes

DiagrammeR 1.0.0 Graph and network visualization using https://github.com/rich-iannone/DiagrammeR (37)

tabular data in R

EMIRGE 0.61.1 Reconstructs full length ribosomal genes https://github.com/csmiller/EMIRGE (38)

from short read sequencing data

FastANI 1.3 Fast Whole-Genome Similarity (ANI) https://github.com/ParBLiSS/FastANI (5)

Estimation

FastTree 2 2.1.10 Approximately-maximum-likelihood http://www.microbesonline.org/fasttree (39)

phylogenetic trees from alignments of

nucleotide or protein sequences fastq-dl 1.0.3 Download FASTQ files from SRA or https://github.com/rpetit3/fastq-dl (40)

ENA repositories

FastQC 0.11.9 A quality control analysis tool for high https://github.com/s-andrews/FastQC (41)

throughput sequencing data. fastq-scan 0.4.2 Output FASTQ summary statistics in https://github.com/rpetit3/fastq-scan (42)

JSON format

FLASH 1.2.11 A fast and accurate tool to merge https://ccb.jhu.edu/software/FLASH/ (43)

paired-end reads freebayes 1.3.2 Bayesian haplotype-based genetic https://github.com/ekg/freebayes (44)

polymorphism discovery and genotyping

GNU Parallel 20200122 A shell tool for executing jobs in parallel https://www.gnu.org/software/parallel/ (45)

GTDB-tk 1.0.2 A toolkit for assigning objective https://github.com/Ecogenomics/GTDBTk (46)

taxonomic classifications to bacterial

and archaeal genomes bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

HMMER 3.3 Biosequence analysis using profile http://hmmer.org/ (47–49)

hidden Markov models

Infernal 1.1.2 Searches DNA sequence databases for http://eddylab.org/infernal/ (50)

RNA structure and sequence similarities

IQ-TREE 1.6.12 Efficient phylogenomic software by https://github.com/Cibiv/IQ-TREE (51)

maximum likelihood

ISMapper 2.0 IS mapping software https://github.com/jhawkey/IS_mapper (52)

Lighter 1.1.2 Fast and memory-efficient sequencing https://github.com/mourisl/Lighter (53)

error corrector

MAFFT 7.455 Multiple alignment program for amino https://mafft.cbrc.jp/alignment/software/ (54)

acid or nucleotide sequences

Mash 2.2.2 Fast genome and metagenome distance https://github.com/marbl/Mash (7, 55)

estimation using MinHash maskrc-svg 0.5 Masks recombination as detected by https://github.com/kwongj/maskrc-svg (56)

ClonalFrameML or Gubbins and draws

an SVG

McCortex 1.0 De novo genome assembly and https://github.com/mcveanlab/mccortex (57)

multisample variant calling

MEGAHIT 1.2.9 Ultra-fast and memory-efficient (meta- https://github.com/voutcn/megahit (58)

)genome assembler

MinCED 0.4.2 Mining in Environmental https://github.com/ctSkennerton/minced (59)

Datasets

Minimap2 2.17 A versatile pairwise aligner for genomic https://github.com/lh3/minimap2 (60)

and spliced nucleotide sequences ncbi-genome- 0.2.12 Scripts to download genomes from the https://github.com/kblin/ncbi-genome-download (24) bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

download NCBI FTP servers

Nextflow 19.10.0 A DSL for data-driven computational https://github.com/nextflow-io/nextflow (2)

pipelines phyloFlash 3.3b3 Rapidly reconstruct the SSU rRNAs and https://github.com/HRGV/phyloFlash (61)

explore phylogenetic composition of an

illumina (meta)genomic dataset.

Pigz 2.3.4 A parallel implementation of gzip for https://zlib.net/pigz/ (62)

modern multi-processor, multi-core

machines

Pilon 1.23 An automated genome assembly https://github.com/broadinstitute/pilon/ (63)

improvement and variant detection tool pplacer 1.1.alpha Phylogenetic placement and https://github.com/matsen/pplacer (64)

19 downstream analysis

Prodigal 2.6.3 Fast, reliable protein-coding gene https://github.com/hyattpd/Prodigal (65)

prediction for prokaryotic genomes.

Prokka 1.4.5 Rapid prokaryotic genome annotation https://github.com/tseemann/prokka (28)

Roary 3.13.0 Rapid large-scale pan https://github.com/sanger-pathogens/Roary (6)

genome analysis samclip 0.2 Filter SAM file for soft and hard clipped https://github.com/tseemann/samclip (66)

alignments

SAMtools 1.9 Tools for manipulating next-generation https://github.com/samtools/samtools (67)

sequencing data

Seqtk 1.3 A fast and lightweight tool for processing https://github.com/lh3/seqtk (68)

sequences in the FASTA or FASTQ

format bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Shovill 1.0.9se Faster assembly of Illumina reads https://github.com/tseemann/shovill (69)

SKESA 2.3.0 Strategic Kmer Extension for Scrupulous https://github.com/ncbi/SKESA (70)

Assemblies

Snippy 4.4.5 Rapid haploid variant calling and core https://github.com/tseemann/snippy (71)

genome alignment

SnpEff 4.3.1 Genomic variant annotations and http://snpeff.sourceforge.net/ (72)

functional effect prediction toolbox snp-dists 0.6.3 Pairwise SNP distance matrix from a https://github.com/tseemann/snp-dists (73)

FASTA sequence alignment

SNP-sites 2.5.1 Rapidly extracts SNPs from a multi- https://github.com/sanger-pathogens/snp-sites (74)

FASTA alignment

Sourmash 3.2.0 Compute and compare MinHash https://github.com/dib-lab/sourmash (10)

signatures for DNA data sets

SPAdes 3.13.0 An assembly toolkit containing various https://github.com/ablab/spades (75)

assembly pipelines

Trimmomatic 0.39 A flexible read trimming tool for Illumina http://www.usadellab.org/cms/index.php (76)

NGS data vcf-annotator 0.5 Add biological annotations to variants in https://github.com/rpetit3/vcf-annotator (77)

a VCF file

Vcflib 1.0.0rc3 A simple C++ library for parsing and https://github.com/vcflib/vcflib (78)

manipulating VCF files

Velvet 1.2.10 Short read de novo assembler using de https://github.com/dzerbino/velvet (79)

Bruijn graphs

VSEARCH 2.14.1 Versatile open-source tool for https://github.com/torognes/vsearch (80)

bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

vt 2015.11.1 A tool set for short variant discovery in https://github.com/atks/vt (81)

0 genetic sequence data

133

134 BaAP: Acquiring FASTQs

135 Bactopia provides multiple ways for users to provide their FASTQ formatted sequences. Input

136 FASTQs can be local or downloaded from public repositories.

137

138 Local sequences can be processed one at a time or in batches. To process a single sample, the

139 user provides the path to their FASTQ(s) and a sample name. For multiple samples, this

140 method does not make efficient use of Nextflow’s queue system. Alternatively, users can

141 provide a “file of filenames” (FOFN), which is a tab-delimited file with information about samples

142 and paths to the corresponding FASTQ(s). By using the FOFN method, Nextflow queues each

143 sample and makes efficient use of available resources. A subcommand (bactopia prepare) was

144 created to automate the creation of a FOFN.

145

146 Raw sequences available from public repositories (e.g. European Nucleotide Archive (ENA),

147 Sequence Read Archive (SRA), or DNA Data Bank of Japan (DDBJ)) can also be processed by

148 Bactopia. Sequences associated with a provided Experiment accession (e.g. DRX, ERX, SRX)

149 are downloaded and processed exactly as local sequences would be. A subcommand (bactopia

150 search) was created, which allows users to query ENA to create a list of Experiment accessions

151 from the ENA Data Warehouse API (82) associated with a BioProject accession, Taxon ID, or

152 organism name.

153 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

154 BaAP: Validating FASTQs

155 The path for input FASTQ(s) is validated, and if necessary, sequences from public repositories

156 are downloaded using fastq-dl (40). Once validated, the FASTQ input(s) are tested to determine

157 if they meet a minimum threshold for continued processing. All BaAP steps expect to use

158 Illumina sequence data, which is the great majority of genome projects currently generated.

159 FASTQ files that are explicitly marked as non-Illumina or have properties that suggest non-

160 Illumina (e.g read-length, error profile) are excluded. By default, input FASTQs must exceed

161 2,241,820 bases (20x coverage of the smallest bacterial genome, Nasuia deltocephalinicola

162 (83)) and 7,472 reads (minimum required base pairs / 300 bp, the longest available reads from

163 Illumina). If estimated, the genome size must be between 100,000 bp and 18,040,666 bp which

164 is based off the range of known bacterial genome sizes (N. deltocephalinicola,

165 GCF_000442605, 112,091 bp; Minicystis rosea, GCF_001931535, 16,040,666 bp). Failure to

166 pass these requirements excludes the samples from further subsequent analysis. The threshold

167 values can be adjusted by the user at runtime.

168

169 BaAP: FastQ Quality Control and generation of pFASTQ

170 Input FASTQs that pass the validation steps undergo quality control steps to remove poor

171 quality reads. BBDuk, a component of BBTools (32), removes Illumina adapters, phiX

172 contaminants and filters reads based on length and quality. Base calls are corrected using

173 Lighter (53). At this stage, the default procedure is to downsample the FASTQ file to an average

174 100x genome coverage (if over 100x) with Reformat (from BBTools). This step, which was used

175 in StAP (3), significantly saves compute time at little final cost to assembly or SNP calling

176 accuracy. The genome size for coverage calculation is either provided by the user or estimated

177 based on the FASTQ data by Mash (7). The user can provide their own value for downsampling

178 FASTQs or disable it completely. Summary statistics before and after QC are created using bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

179 FastQC (41) and fastq-scan (42). After QC, the original FASTQs are no longer used and only

180 the processed FASTQs (pFASTQ) are used in subsequent analysis.

181

182 BaAP: Assembly, Reference Mapping and Decomposition

183 BaAP uses Shovill (69) to create a draft de novo assembly with MEGAHIT (58), SKESA (70)

184 (default), SPAdes (75) or Velvet (79) and make corrections using Pilon (63) from the pFASTQ.

185 Summary statistics for the draft assembly are created using assembly-scan (30). If the total size

186 of the draft assembly fails to meet a user-specified minimum size, further assembly-based

187 analyses are discontinued. Otherwise, a BLAST+ (23) nucleotide database is created from the

188 contigs. The draft assembly is also annotated using Prokka (28). If available at runtime, Prokka

189 will first annotate with a clustered RefSeq protein set, followed by its default databases. The

190 annotated genes and are then subjected to antimicrobial resistance prediction with

191 AMRFinder+ (16).

192

193 For each pFASTQ, sketches are created using Mash (k=21,31) and Sourmash (10)

194 (k=21,31,51). McCortex (57) is used to count 31-mers in the pFASTQ.

195

196 BaAP: Optional Steps

197 At runtime, Bactopia checks for BaDs specified by the command line (if any) and adjusts the

198 settings of the pipeline accordingly. Examples of processes executed only if a BaDs is specified

199 include Ariba (12) analysis for each available reference dataset; sequence containment

200 screening against RefSeq (8) with mash screen (55) and against GenBank (11) with sourmash

201 lca gather (10), and PLSDB (9), with mash screen and BLAST+. The sequence type (ST) of the

202 sample is determined with BLAST+ and Ariba. The nearest reference RefSeq genome, based

203 on mash (7) distance, is downloaded with ncbi-genome-download (24), and variants are called bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

204 with Snippy (71). Alternatively, one or more reference genomes can be provided by the user.

205 Users can also provide sequences for: sequence alignment and per-base coverage with BWA

206 (35, 84) and Bedtools (34), BLAST+ alignment, or insertion site determination with ISMapper

207 (52).

208

209 Bactopia Tools

210 After BaAP has successfully completed, it will create a directory for each strain with

211 subdirectories for each analysis result. The directory structure is independent of project or

212 options chosen. Bactopia Tools (BaTs) are a set of comparative analysis workflows written

213 using Nextflow that take advantage of the predictable output structure from BaAP. Each BaT is

214 created from the same framework and a subcommand (bactopia tools create) is available to

215 simplify the creation of future BaTs.

216

217 Five BaTs were used for analyses in this article. The summary BaT, outputs a summary report

218 of the set of samples and a list of samples that failed to meet thresholds set by the user. This

219 summary includes basic sequence and assembly stats as well as technical (pass/fail)

220 information. The roary BaT creates a pan-genome of the set of samples with Roary (6), with the

221 option to include RefSeq (8) completed genomes. The fastani BaT determines the pairwise

222 average nucleotide identity (ANI) for each sample with FastANI (5). The phyloflash BaT

223 reconstructs 16S rRNA gene sequences with phyloFlash (61). The gtdb BaT assigns taxonomic

224 classifications from the Genome Taxonomy Database (GTDB) (4) with GTDB-tk (46). Each

225 Bactopia tool has a separate Nextflow workflow with its own conda environment, Docker image

226 and Singularity image.

227 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

228 Comparison to Similar Open Source Software

229 At the time of writing (February 2020), we knew of only three other actively-maintained open

230 source generalist bacterial genomic workflow softwares that encompassed a similar range of

231 functionality to Bactopia: ASA3P (85), TORMES (86) and the currently unpublished Nullarbor

232 (87). The versions of these programs used many of the same component software (e.g. Prokka,

233 SPAdes, BLAST+, Roary) but differed in the philosophies underlying their design (Table 2). This

234 made head-to-head runtime comparisons somewhat meaningless, as each was aimed at

235 different analysis scenarios and produced different output. Bactopia was the most open-ended

236 and flexible, allowing the user to customize input databases and providing a platform for

237 downstream analysis by different BaTs rather than built-in pangenome and phylogeny creation.

238 Bactopia also had some features not implemented in the others, such as SRA/ENA search and

239 download and automated reference genome selection for identifying variants. Both Bactopia

240 and ASA3P are highly scalable, each can be seamlessly executed on local, cluster, and cloud

241 environments with little effort required by the user. ASA3P was the only program to implement

242 long-read data assembly. TORMES was the only program to include a user customizable

243 RMarkdown for report and to have optional analyses specifically for Escherichia and

244 Salmonella. Nullarbor was the only program to implement a pre-screen method for filtering out

245 potential biological outliers prior to full analysis.

246

247 Table 2 - A comparison of bacterial genome analysis workflows.

Feature Bactopia ASA3P Nullarbor TORMES

Version 1.3.0 1.2.2 2.0.20191013 1.0

Release Date February 19th, 2020 November 20th, 2019 October 13th, 2019 February 4th, 2019 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Latest Commit February 23th, 2020 February 19th, 2020 February 23th, 2020 January 30th, 2020

Sequence Illumina Illumina, Nanopore, Illumina Illumina

Technology PacBio

Single End Reads Yes Yes No No

Workflow Nextflow Groovy Perl + Make Bash

Resume If Stopped Yes No Yes No

Reuse Existing Yes No Yes No

Runs For

Expanded Analysis

Built In HPC Yes Yes No No

Cluster and Cloud

Capability

Individual Program Yes No Yes No

Adjustable

Parameters

Batch processing Yes Yes Yes Yes from Config File

Single Sample Yes No Yes No

Processing from

Command Line bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Sequence Depth Yes No Yes No

Downsample

Automatic Yes No No No

Reference

Selection For

Variant Detection

Data Download Yes No No No from SRA/ENA

Species k-mers,16S, ANI k-mers, 16S, ANI k-mers k-mers

Identification

Comparative Separate Process Built In Process Built In Process Built In Process

Analysis

Summary Text HTML HTML R Markdown

Package Manager Bioconda - Bioconda & Brew Conda YAML

Container Available Yes Yes Yes No

Documentation Website PDF Manual README README

Github Repository bactopia/bactopia oschwengers/asap tseemann/nullarbor nmquijada/tormes

248

249 Use case: the Lactobacillus genus

250 We performed a Bactopia analysis of publicly-available raw Illumina data labelled as belonging

251 to the Lactobacillus genus. Lactobacillus is an important component of the human microbiome, bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

252 and cultured samples have been sequenced by several research groups over the past few

253 years. L. crispatus and other species are often the majority bacterial genus of the human vagina

254 and are associated with low pH and reduction in pathogen burden (88). Samples of the genus

255 are used in the food industry for fermentation in the production of yoghurt, kimchi, kombucha

256 and other common items. Lactobacillus is a common probiotic, although recent genomic-based

257 transmission studies showed that bloodstream infections can follow after ingestion by

258 immunocompromised patients (89).

259

260 Running BaAP

261 In November 2019, we initiated Bactopia analysis using the following three commands:

262

263 # Build Lactobacillus dataset

264 bactopia datasets ~/bactopia-datasets --species 'Lactobacillus' --

265 include_genus --cpus 10

266

267 # Query ENA for all Lactobacillus (tax id 1578) sequence projects

268 bactopia search 1578 --prefix lactobacillus

269 # this creates a file called ‘lactobacillus-accessions.txt’

270

271 # Process Lactobacillus samples

272 mkdir ~/bactopia

273 cd ~/bactopia

274 bactopia --accessions ~/lactobacillus-accessions.txt --datasets ~/bactopia-

275 datasets --species lactobacillus --coverage 100 --cpus 4 --min_genome_size

276 1000000 --max_genome_size 4200000 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

277

278 The bactopia datasets subcommand automated downloading of BaDs. With these parameters,

279 we downloaded and formatted the following datasets: Ariba (12) reference databases for the

280 Comprehensive Antibiotic Resistance Database (CARD) and the core Virulence Factor

281 Database (VFDB) (14, 20), RefSeq Mash sketch (7, 8), GenBank Sourmash signatures (10, 11),

282 PLSDB database and sketch (9), and a clustered protein set and Mash sketch from

283 completed Lactobacillus genomes available from NCBI Assembly (RefSeq). This took 25

284 minutes to complete.

285

286 The bactopia search subcommand produced a list of 2,030 Experiment accessions that had

287 been labelled as “Lactobacillus” (taxonomy ID: 1578) (Supplementary Data 1). After filtering for

288 only Illumina sequencing, 1,664 Experiment accessions remained (Supplementary Data 2).

289

290 The main bactopia command automated BaAP processing of the list of accessions using the

291 downloaded BaTs. Here we chose a standard maximum coverage per genome of 100x, based

292 on the estimated genome size. We used the range of genome sizes (1.2 Mb to 3.7Mb) for the

293 completed Lactobacillus genomes to require that the estimated genome size for each sample be

294 between 1Mbp and 4.2 Mbp.

295

296 Samples were processed on a 96-core SLURM cluster with 512 GB of available RAM. Analysis

297 took approximately 2.5 days to complete, with an estimated runtime of 30 minutes per sample,

298 (determined by adding up the median process runtime, 17 total, in BaAP). No individual process

299 used more than 8GB of memory, with all but 5 using less than 1 GB. Nextflow (2) recorded

300 detailed statistics on resource usage, including CPU, memory, job duration and I/O.

301 (Supplementary Data 3).

302 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

303 Analysis of Lactobacillus genomes using BaTs

304 The BaAP outputted a directory of directories named after the unique Experiment accessions for

305 each sample. Within each sample directory were subdirectories for the output of each analysis

306 run. This data structure was recognized by BaTs for subsequent analysis.

307

308 We used BaT summary to generate a summary report of our analysis. The report included an

309 overview of sequence quality, assembly statistics, and predicted antimicrobial resistances and

310 virulence factors. It also output a list of samples that failed to meet minimum sequencing depth

311 and/or quality thresholds.

312

313 bactopia tools summary --bactopia ~/bactopia --prefix lactobacillus

314

315 BaT summary grouped samples as “Gold”, “Silver”, “Bronze”, “Exclude” or “QC Failure”, based

316 on BaAP completion, minimum sequencing coverage, per-read sequencing mean quality,

317 minimum mean read length, and assembly quality (Table 3; Supplementary Figure 2). We

318 used the default values for these cutoffs to group our samples. Gold samples had greater than

319 100x coverage, per-read mean quality greater than Q30, mean read length greater than 95bp,

320 and an assembly with less than 100 contigs. Silver samples had greater than 50x coverage,

321 per-read mean quality greater than Q20, mean read length greater than 75bp, and an assembly

322 with less than 200 contigs. Bronze samples had greater than 20x coverage, per-read mean

323 quality greater than Q12, mean read length greater than 49bp, and an assembly with less than

324 500 contigs. 106 samples (Exclude and QC Failure) were excluded from further analysis

325 (Supplementary Table 1). 48 samples that failed to meet the minimum thresholds for Bronze

326 quality were grouped as Exclude. 58 samples that were not processed by BaAP due to

327 sequencing related errors or the estimated genome size were grouped as QC Failure. Of these, bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

328 one (Experiment accession SRX4526092) was labeled as paired-end but did not have both sets

329 of reads, one (Experiment accession SRX1490246) was identified to be an assembly converted

330 to FASTQ format, and 14 had insufficient sequencing depth. The remaining 42 samples,

331 unprocessed by BaAP, had an estimated genome size which exceeded 4.2Mbp (set at runtime).

332 We queried these samples against available GenBank and RefSeq sketches using Mash screen

333 and Sourmash lca gather. There were 36 samples that contained evidence for Lactobacillus but

334 also sequences for other bacterial species, phage, viruses and plants genomes. There were 6

335 samples that contained no evidence for Lactobacillus, 4 had matches to multiple bacterial

336 species and 2 to only Saccharomyces cerevisiae.

337

338 There were 1,558 Gold, Silver or Bronze quality samples (Table 3) used for further analysis. For

339 these we found that, on average, the assembled genome size was about 12% smaller than the

340 estimated genome size (Supplementary Figure 3, Table 3). If we assume the assembled

341 genome size is a better indicator of a sample’s genome size, the average coverage before QC

342 increases from 220x to 268x. In this use case, the Lactobacillus genus, it was necessary to

343 estimate genome sizes, but when dealing with samples from a single species it may be better to

344 provide a known genome size.

345

346 Table 3 - Summary of Lactobacillus genome sequencing projects quality and coverage

Quality Count Original Post-Bactopia Per-Read Read Contig Percent of

Rank Coverage Coverage Quality Length Count Assembled Genome

(Median) (Median) Score (Median) (Median) Size compared to

(Median) Estimated Genome

Size (Median) bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Gold 967 213x 100x Q35 100 52 92%

Silver 386 160x 100x Q35 100 110 93%

Bronze 205 102x 100x Q34 100 90 93%

Exclude 48 26x 22x Q34 100 706 93%

QC 58 ------

Failure

347

348 For visualization of the phylogenetic relationships of the samples, we used the phyloflash and

349 gtdb BaTs.

350

351 bactopia tools phyloflash --phyloflash ~/bactopia-datasets/16s/138 --bactopia

352 ~/bactopia --cpus 16 --exclude ~/bactopia-tool/summary/lactobacillus-

353 exclude.txt

354

355 bactopia tools gtdb --gtdb ~/bactopia-datasets/gtdb/db --bactopia ~/bactopia

356 --cpus 48 --exclude ~/bactopia-tool/summary/lactobacillus-exclude.txt

357

358 The gtdb BaT used GTDB-Tk (46) to assign a taxonomic classification to each sample. GTDB-

359 Tk used the assembly to predict genes with Prodigal (65), identify marker genes (4) for

360 phylogenetic inference with HMMER3 (47), and find the maximum-likelihood placement of each

361 sample on the GTDB-Tk reference tree with pplacer (64). A taxonomic classification was

362 assigned to 1,554 samples, and 4 samples failed classification due to insufficient marker gene

363 coverage or marker genes with multiple hits.

364 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

365 The phyloflash BaT used the phyloFlash tool (61) to reconstruct a 16S rRNA gene from each

366 sample that was used for phylogenetic reconstruction (Figure 2). The 16S rRNA was

367 reconstructed from a SPAdes (75) assembly and annotated against the SILVA (92) rRNA

368 database (v138) database for 1,470 samples. There were 88 samples that were excluded from

369 the phylogeny: 12 samples that did not meet the requirement of a mean read length of 50bp, 17

370 samples in which a 16S could not be reconstructed, 19 samples that had a mismatch in

371 assembly and mapped read taxon designations, and 40 samples that had 16S genes

372 reconstructed for multiple species. A was created with IQ-TREE (51, 90, 91)

373 based on a multiple sequence alignment of the reconstructed 16S genes with MAFFT (54).

374 Taxonomic classifications from GTDB-Tk were used to annotate the 16S with iTOL (93).

375

376 Figure 2 - Maximum-likelihood phylogeny from reconstructed 16S rRNA genes

377 A phylogenetic representation 1,470 samples using IQ-Tree (51, 90, 91). A) A tree of the full set of

378 samples. The outer ring represents the genus assigned by GTDB-Tk, with Blue being

379 Lactobacillus and all other assignments Red. B) The same tree as A, but with the non-

380 Lactobacillus clade collapsed. Major groups of Lactobacillus groups (depicted with a letter) and

381 the most sequenced Lactobacillus species have been labeled. The inner ring represents the

382 average nucleotide identity (ANI), determined by FastANI (5) of samples to L. crispatus. The tree

383 was built from a multiple sequence alignment (54) of 16S genes reconstructed by phyloFlash (61)

384 with 1,281 parsimony informative sites. The likelihood score for the consensus tree constructed

385 from 1,000 bootstrap trees was -54,698. Taxonomic classifications were assigned by GTDB-Tk

386 (46).

387 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

388 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

389

390

391

392 A recent analysis of completed genomes in the NCBI found 239 discontinuous de novo

393 Lactobacillus species using a 94% ANI cutoff (94). Based on GTDB taxonomic classification,

394 which applies a 95% ANI cutoff, we identified 161 distinct Lactobacillus species in 1,554

395 samples. The five most sequenced Lactobacillus species, 45% of the total, were L. rhamnosus

396 (n=225), L. paracasei (n=180), L. gasseri (n=132), L. plantarum (n=86) and L. fermentum

397 (n=80). Within these five species the assembled genomes sizes were remarkably consistent

398 (Supplementary Figure 4). There were 58 samples that were not classified as Lactobacillus, of bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

399 which 34 were classified as Streptococcus pneumoniae by both 16S and GTDB

400 (Supplementary Table 2).

401

402 We found that in 505 (~33%) out of 1,554 taxonomically classifications by 16S and GTDB were

403 in conflict with the taxonomy according to the NCBI SRA, illustrating the importance of the

404 unbiased approach to understanding sample context. In samples that had both a 16S and

405 GTDB taxonomic classification, there was disagreement in 154 out of 1,467 samples. Of these,

406 47% were accounted for by the recently described L. paragasseri (95) (n=72). This possibly

407 highlights a lag in the reclassification of assemblies in the NCBI Assembly database.

408

409 Analysis of the pangenome of the entire genus using a tool such as Roary (6), would return only

410 a few core genes, owing to sequence divergence of evolutionary distant species. However,

411 because the roary BaT can be supplied a list of individual samples it is possible to isolate the

412 analysis to the species level. As an example of using BaTs to focus on a particular group within

413 the larger set of results, we chose L. crispatus, a species commonly isolated from the human

414 vagina and also found in the guts/feces of poultry. We used the fastani BaT to estimate the ANI

415 of all samples against a randomly selected L. crispatus completed genome (NCBI Assembly

416 Accession: GCF_003795065) with FastANI (5). A cutoff of greater than 95% ANI was used to

417 categorize a sample as L. crispatus. A pan-genome analysis was conducted on only the

418 samples categorized as L. crispatus using the roary BaT. The roary BaT downloaded all

419 available completed L. crispatus genomes with ncbi-genome-download (24), formatted the

420 completed genomes with Prokka (28), created a pan-genome with Roary (6), identified and

421 masked recombination with ClonalFrameML (36) and maskrc-svg (56), created a phylogenetic

422 tree with IQ-TREE (51, 90, 91) and a pair-wise SNP distance matrix with snp-dists (73).

423 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

424 bactopia tools fastani --bactopia ~/bactopia --exclude ~/bactopia-

425 tool/summary/lactobacillus-exclude.txt --accession GCF_003795065.1 --

426 refseq_only --minFraction 0.0

427

428 # Identify samples with > 95% ANI to L. crispatus

429 awk '{if ($3 > 95){print $0}}' ~/bactopia-tool/fastani/fastani.tsv | grep

430 "RX" > ~/crispatus-include.txt

431

432 bactopia tools roary --bactopia ~/bactopia --cpus 20 --include ~/crispatus-

433 include.txt --species "lactobacillus crispatus" --n

434

435 ANI analysis revealed 38 samples as having >96.1% ANI to L. crispatus, with no other sample

436 greater than 83.1%. Four completed L. crispatus genomes were also included in the analysis

437 (Table 4) for a total of 42 genomes. The pan-genome of L. crispatus was revealed to have

438 7,037 gene families and 972 core genes (Figure 3). Similar to a recent analysis by Pan et al

439 (96), L. crispatus was separated into two main phylogenetic groups, one associated with human

440 vaginal isolates, the other of more mixed provenance, and including chicken, turkey and human

441 gut isolates.

442

443 Table 4 - L crispatus genomes used in pan-genome analysis

444 Lactobacillus crispatus samples (n=42) used in the pan-genome analysis. An accession

445 corresponding to the BioProject, BioSample, and either the NCBI Assembly accession (beginning

446 with GCF) or SRA Experiment accession is given for each sample. The host and source were

447 collected from metadata associated with the BioSample or available publications. In cases when a

448 host and/or source was not explicitly stated, it was inferred from available metadata (denoted by

449 *). bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

450

BioProject BioSample Accession Host Source Reference

PRJEB8104 SAMEA3319334 ERX1126086 Human* urine*

PRJEB8104 SAMEA3319350 ERX1126089 Human* urine*

PRJEB8104 SAMEA3319265 ERX1126106 Human* urine*

PRJEB8104 SAMEA3319366 ERX1126138 Human* urine*

PRJEB8104 SAMEA3319373 ERX1126140 Human* urine*

PRJEB8104 SAMEA3319383 ERX1126143 Human* urine*

PRJEB8104 SAMEA3319392 ERX1126150 Human* urine*

PRJEB22112 SAMEA104208649 ERX2150228 Human* urine*

PRJEB22112 SAMEA104208650 ERX2150229 Human* urine*

PRJEB3060 SAMEA1920319 ERX271950 Human* unknown

PRJEB3060 SAMEA1920326 ERX271958 Human* unknown

PRJEB3060 SAMEA1920319 ERX450852 Human* unknown

PRJEB3060 SAMEA1920326 ERX450860 Human* unknown

PRJNA50051 SAMN00109860 SRX026143 Human* vaginal*

PRJNA272101 SAMN03854351 SRX1090887 Human urine (97)

PRJNA50053 SAMN00829399 SRX130900 Human* vaginal*

PRJNA50057 SAMN00829123 SRX130912 Human* vaginal*

PRJNA50067 SAMN00829125 SRX130914 Human* vaginal*

PRJNA52107 SAMN01057066 SRX155504 Human* vaginal*

PRJNA52105 SAMN01057067 SRX155505 Human* vaginal*

PRJNA52107 SAMN01057066 SRX155863 Human* vaginal* bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

PRJNA52105 SAMN01057067 SRX155875 Human* vaginal*

PRJNA379934 SAMN06624125 SRX2660270 Human eye

PRJNA222257 SAMN02369387 SRX456245 Human eye

PRJNA231221 SAMN11056458 SRX5949263 Human vaginal (98)

PRJNA547620 SAMN11973370 SRX5986001 Human vaginal (99)

PRJNA547620 SAMN11973369 SRX5986002 Human vaginal (99)

PRJNA547620 SAMN11973371 SRX5986003 Human vaginal (99)

PRJNA557339 SAMN12395213 SRX6613945 Human vaginal (100)

PRJNA563077 SAMN12667791 SRX6959881 Human gut (96)

PRJNA563077 SAMN12667801 SRX6959883 Chicken gut (96)

PRJNA563077 SAMN12667803 SRX6959885 Human gut (96)

PRJNA563077 SAMN12667804 SRX6959886 Turkey gut (96)

PRJNA563077 SAMN12667805 SRX6959887 Human eye (96)

PRJNA563077 SAMN12667793 SRX6959888 Chicken gut (96)

PRJNA563077 SAMN12667794 SRX6959889 Chicken gut (96)

PRJNA563077 SAMN12667795 SRX6959890 Chicken gut (96)

PRJNA563077 SAMN12667796 SRX6959891 Chicken gut (96)

PRJNA563077 SAMN12667797 SRX6959892 Chicken gut (96)

PRJNA563077 SAMN12667798 SRX6959893 Chicken gut (96)

PRJNA563077 SAMN12667799 SRX6959894 Chicken gut (96)

PRJNA563077 SAMN12667800 SRX6959895 Chicken gut (96)

PRJNA531669 SAMN11372136 GCF_009769205 Chicken gut (101)

PRJNA231221 SAMN11056458 GCF_009730275 Human vaginal (98) bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

PRJNA431864 SAMN08409124 GCF_003971565 Human vaginal (102)

PRJNA499123 SAMN10343598 GCF_003795065 Human vaginal (103)

451

452 Figure 3 - Core-genome maximum-likelihood phylogeny of Lactobacillus crispatus.

453 A core-genome phylogenetic representation using IQ-Tree (51, 90, 91) of 42 L. crispatus samples .

454 The putatively recombinant positions predicted using ClonalFrameML (36) were removed from thee

455 alignment with maskrc-svg (56). The tree was built from 972 core genes identified by Roary with

456 9,209 parsimony informative sites. The log likelihood score for the consensus tree constructed

457 from 1,000 bootstrap trees was −1,418,106.

458 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

459

460 Lastly, we looked at patterns of antibiotic resistance across the genus using a table, generated

461 by the summary BaT, of resistance genes and loci called by AMRFinder+ (104). Only 79 out of

462 1,496 Lactobacillus samples defined by GTDB-Tk (46) were found to have predicted resistance

463 using AMRFinder+. The most common resistance categories were tetracyclines (67 samples),

464 followed by macrolides, lincosamides and aminoglycosides (16, 15 and 11, respectively).

465 Species with the highest proportion of resistance included L. amylovorus (12/14 tetracycline

466 resistant) and L. crispatus (10/42 tetracycline resistant). Only 3 genomes of L. amylophilus were

467 included in the study but each contained matches to genes for macrolide, lincosamide and

468 tetracycline resistance. The linking thread between these species is they are each commonly

469 isolated from agricultural animals. The high proportion of L. crispatus samples isolated from

470 chickens that were tetracycline resistant has been previously observed (105, 106) (Figure 3).

471

472 A recent analysis of 184 Lactobacillus type strain genomes by Campedelli et al (107) found a

473 higher percentage of type strains with aminoglycoside (20/184), tetracycline (18/184),

474 erythromycin (6/184) and clindamycin (60/184) resistance. Forty-two of the type strains had

475 chloramphenicol resistance genes, where here, the AMRFinder+ returned only 1/1,467. These

476 differences probably reflect a combination of the different sampling bias of the studies, and the

477 Campedelli et al strategy of using a relaxed threshold for hits to maximize sensitivity (blastp

478 matches against the CARD database with acid sequence identity of 30% and query coverage of

479 70% (107)). Resistance is probably undercalled in both methods because of a lack of well-

480 characterized resistance loci from the Lactobacillus genus to use as comparison.

481 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

482 Conclusions

483 Bactopia is a flexible workflow for bacterial . It can be run on a laptop for a single

484 bacterial sample, but critically, the underlying Nextflow framework allows it to make efficient use

485 of large clusters and cloud-compute environments to process the many thousands of genomes

486 that are currently being generated. For users that are not familiar with bacterial genomic tools

487 and/or who require a standardized pipeline, Bactopia is a “one-stop-shop” that can be easily

488 deployed using conda, Docker and Singularity containers. For researchers with particular

489 interest in individual species or genera, BaDs can be highly customized with taxon-specific

490 databases.

491

492 The current version of Bactopia does not support long-read data, but this is planned for future

493 implementation. We also plan to implement more comparative analyses in the form of additional

494 BaTs. With a framework set in place for developing BaTs, it should be possible to make a

495 toolshed of workflows that not only can be used for all bacteria but are also customized for

496 annotating genes and loci specific for particular species.

497

498 Acknowledgements

499 We would like to thank Torsten Seemann, Oliver Schwengers, Narciso Quijada, Michelle Su,

500 Michelle Wright, Matt Plumb, Sean Wang, Ahmed Babiker and Monica Farley for their helpful

501 suggestions and feedback. We would also like to acknowledge our gratitude to the many

502 scientists and their funders who provided genome sequences to the public domain, ENA and

503 SRA for storing and organizing the data, and the authors of the open source software tools and

504 datasets used in this work.

505 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

506 Support for this project came from the award CDC U50CK000485 Emerging Infection Program

507 (EIP) PPHF/ACA: Enhancing Epidemiology and Laboratory Capacity” funding from the Emory

508 Public Health Bioinformatics Fellowship.

509

510 Availability of data

511 Raw Illumina sequences used in this study were acquired from Experiment accessions under

512 the BioProject accessions PRJDB1101, PRJDB1726, PRJDB4156, PRJDB4955, PRJDB5065,

513 PRJDB5206, PRJDB6480, PRJDB6495, PRJEB10572, PRJEB11980, PRJEB14693, PRJEB18589,

514 PRJEB19875, PRJEB21025, PRJEB21680, PRJEB22112, PRJEB22252, PRJEB23845, PRJEB24689,

515 PRJEB24698, PRJEB24699, PRJEB24700, PRJEB24701, PRJEB24713, PRJEB24715, PRJEB25194,

516 PRJEB2631, PRJEB26638, PRJEB2824, PRJEB29398, PRJEB29504, PRJEB2977, PRJEB3012,

517 PRJEB3060, PRJEB31213, PRJEB31289, PRJEB31301, PRJEB31307, PRJEB5094, PRJEB8104,

518 PRJEB8721, PRJEB9718, PRJNA165565, PRJNA176000, PRJNA176001, PRJNA183044,

519 PRJNA184888, PRJNA185359, PRJNA185406, PRJNA185584, PRJNA185632, PRJNA185633,

520 PRJNA188920, PRJNA188921, PRJNA196697, PRJNA212644, PRJNA217366, PRJNA218804,

521 PRJNA219157, PRJNA222257, PRJNA224116, PRJNA227106, PRJNA227335, PRJNA231221,

522 PRJNA234998, PRJNA235015, PRJNA235017, PRJNA247439, PRJNA247440, PRJNA247441,

523 PRJNA247442, PRJNA247443, PRJNA247444, PRJNA247445, PRJNA247446, PRJNA247452,

524 PRJNA254854, PRJNA255080, PRJNA257137, PRJNA257138, PRJNA257139, PRJNA257141,

525 PRJNA257142, PRJNA257182, PRJNA257185, PRJNA257853, PRJNA257876, PRJNA258355,

526 PRJNA258500, PRJNA267549, PRJNA269805, PRJNA269831, PRJNA269832, PRJNA269860,

527 PRJNA269905, PRJNA270961, PRJNA270962, PRJNA270963, PRJNA270964, PRJNA270965,

528 PRJNA270966, PRJNA270967, PRJNA270968, PRJNA270969, PRJNA270970, PRJNA270972,

529 PRJNA270973, PRJNA270974, PRJNA272101, PRJNA272102, PRJNA283920, PRJNA289613,

530 PRJNA29003, PRJNA291681, PRJNA296228, PRJNA296248, PRJNA296274, PRJNA296298,

531 PRJNA296309, PRJNA296751, PRJNA296754, PRJNA298448, PRJNA299992, PRJNA300015,

532 PRJNA300023, PRJNA300088, PRJNA300119, PRJNA300123, PRJNA300179, PRJNA302242, bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

533 PRJNA303235, PRJNA303236, PRJNA305242, PRJNA306257, PRJNA309616, PRJNA312743,

534 PRJNA315676, PRJNA316969, PRJNA322958, PRJNA322959, PRJNA322960, PRJNA322961,

535 PRJNA336518, PRJNA342061, PRJNA342757, PRJNA347617, PRJNA348789, PRJNA376205,

536 PRJNA377666, PRJNA379934, PRJNA381357, PRJNA382771, PRJNA388578, PRJNA392822,

537 PRJNA397632, PRJNA400793, PRJNA434600, PRJNA436228, PRJNA474823, PRJNA474907,

538 PRJNA476494, PRJNA477598, PRJNA481120, PRJNA484967, PRJNA492883, PRJNA493554,

539 PRJNA496358, PRJNA50051, PRJNA50053, PRJNA50055, PRJNA50057, PRJNA50059, PRJNA50061,

540 PRJNA50063, PRJNA50067, PRJNA50115, PRJNA50117, PRJNA50125, PRJNA50133, PRJNA50135,

541 PRJNA50137, PRJNA50139, PRJNA50141, PRJNA50159, PRJNA50161, PRJNA50163, PRJNA50165,

542 PRJNA50167, PRJNA50169, PRJNA50173, PRJNA504605, PRJNA504734, PRJNA505088,

543 PRJNA52105, PRJNA52107, PRJNA52121, PRJNA525939, PRJNA530250, PRJNA533291,

544 PRJNA533837, PRJNA542049, PRJNA542050, PRJNA542054, PRJNA543187, PRJNA544527,

545 PRJNA547620, PRJNA552757, PRJNA554696, PRJNA554698, PRJNA557339, PRJNA562050,

546 PRJNA563077, PRJNA573690, PRJNA577465, PRJNA578299, PRJNA68459, and PRJNA84.

547

548 Links

549 ● Website/Docs - https://bactopia.github.io/

550 ● Github - https://www.github.com/bactopia/bactopia/

551 ● Zenodo Snapshot - https://doi.org/10.5281/zenodo.3691353

552 ● Bioconda - https://bioconda.github.io/recipes/bactopia/README.html

553 ● Containers

554 ○ Docker - https://cloud.docker.com/u/bactopia/

555 ○ Singularity - https://cloud.sylabs.io/library/rpetit3/bactopia

556 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

557 Supplementary Tables

558 Supplementary Table 1 - Lactobacillus excluded from analysis

559

Sample Exclusion Reason

DRX061123 Not processed, reason: Poor estimate of genome size

DRX061147 Not processed, reason: Poor estimate of genome size

DRX143657 Not processed, reason: Poor estimate of genome size

ERX034817 Not processed, reason: Poor estimate of genome size

ERX1046386 Not processed, reason: Low depth of sequencing

ERX1046387 Not processed, reason: Low depth of sequencing

ERX1046388 Not processed, reason: Low depth of sequencing

ERX1046389 Not processed, reason: Low depth of sequencing

Failed to pass minimum cutoffs, reason: Low coverage (14.83x, expect >= 20x);Too

ERX1046390 many contigs (783, expect <= 500)

ERX1046391 Failed to pass minimum cutoffs, reason: Too many contigs (631, expect <= 500)

Failed to pass minimum cutoffs, reason: Low coverage (11.54x, expect >= 20x);Too

ERX1046392 many contigs (755, expect <= 500)

Failed to pass minimum cutoffs, reason: Low coverage (17.59x, expect >= 20x);Too

ERX1046393 many contigs (544, expect <= 500)

ERX1046397 Not processed, reason: Low depth of sequencing

ERX1046398 Not processed, reason: Low depth of sequencing

ERX1046399 Not processed, reason: Low depth of sequencing

ERX1046400 Failed to pass minimum cutoffs, reason: Low coverage (11.97x, expect >= 20x);Too bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

many contigs (810, expect <= 500)

Failed to pass minimum cutoffs, reason: Low coverage (18.17x, expect >= 20x);Too

ERX1046401 many contigs (559, expect <= 500)

Failed to pass minimum cutoffs, reason: Low coverage (13.30x, expect >= 20x);Too

ERX1046402 many contigs (857, expect <= 500)

ERX1046407 Failed to pass minimum cutoffs, reason: Low coverage (18.63x, expect >= 20x)

ERX1046416 Failed to pass minimum cutoffs, reason: Low coverage (19.20x, expect >= 20x)

ERX1275836 Not processed, reason: Poor estimate of genome size

ERX1275870 Not processed, reason: Low number of reads;Low depth of sequencing

ERX1275925 Not processed, reason: Poor estimate of genome size

ERX1275959 Not processed, reason: Low number of reads;Low depth of sequencing

ERX178660 Not processed, reason: Poor estimate of genome size

ERX178661 Failed to pass minimum cutoffs, reason: Too many contigs (1079, expect <= 500)

ERX178663 Not processed, reason: Poor estimate of genome size

ERX2438180 Not processed, reason: Poor estimate of genome size

ERX271964 Failed to pass minimum cutoffs, reason: Too many contigs (900, expect <= 500)

ERX271983 Failed to pass minimum cutoffs, reason: Too many contigs (515, expect <= 500)

ERX2866884 Not processed, reason: Poor estimate of genome size

ERX303991 Not processed, reason: Poor estimate of genome size

ERX359696 Not processed, reason: Poor estimate of genome size

ERX359703 Not processed, reason: Poor estimate of genome size

ERX359718 Not processed, reason: Poor estimate of genome size

ERX359724 Not processed, reason: Low number of reads;Low depth of sequencing bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

ERX359774 Not processed, reason: Poor estimate of genome size

ERX359777 Failed to pass minimum cutoffs, reason: Too many contigs (674, expect <= 500)

ERX399759 Not processed, reason: Poor estimate of genome size

ERX450866 Failed to pass minimum cutoffs, reason: Too many contigs (902, expect <= 500)

ERX450885 Failed to pass minimum cutoffs, reason: Too many contigs (554, expect <= 500)

ERX529055 Not processed, reason: Poor estimate of genome size

ERX529175 Not processed, reason: Poor estimate of genome size

SRX059832 Failed to pass minimum cutoffs, reason: Short read length (35.00bp, expect >= 49 bp)

SRX130889 Not processed, reason: Low depth of sequencing

SRX130890 Not processed, reason: Low depth of sequencing

SRX130892 Failed to pass minimum cutoffs, reason: Too many contigs (618, expect <= 500)

SRX130895 Failed to pass minimum cutoffs, reason: Too many contigs (1050, expect <= 500)

SRX130896 Failed to pass minimum cutoffs, reason: Too many contigs (610, expect <= 500)

Failed to pass minimum cutoffs, reason: Low coverage (18.31x, expect >= 20x);Too

SRX130897 many contigs (940, expect <= 500)

SRX130898 Not processed, reason: Low depth of sequencing

Failed to pass minimum cutoffs, reason: Low coverage (17.02x, expect >= 20x);Too

SRX130903 many contigs (677, expect <= 500)

SRX130904 Failed to pass minimum cutoffs, reason: Too many contigs (719, expect <= 500)

SRX130910 Failed to pass minimum cutoffs, reason: Low coverage (18.77x, expect >= 20x)

SRX130911 Failed to pass minimum cutoffs, reason: Too many contigs (948, expect <= 500)

SRX130916 Failed to pass minimum cutoffs, reason: Low coverage (16.95x, expect >= 20x)

SRX1490246 Not processed, reason: Low number of reads bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

SRX1684187 Not processed, reason: Poor estimate of genome size

SRX1684189 Not processed, reason: Poor estimate of genome size

SRX1687039 Not processed, reason: Poor estimate of genome size

SRX1842889 Failed to pass minimum cutoffs, reason: Too many contigs (1016, expect <= 500)

SRX1949176 Not processed, reason: Poor estimate of genome size

Failed to pass minimum cutoffs, reason: Short read length (35.00bp, expect >= 49

SRX1950180 bp);Too many contigs (962, expect <= 500)

SRX1950181 Failed to pass minimum cutoffs, reason: Short read length (35.00bp, expect >= 49 bp)

SRX1970179 Not processed, reason: Poor estimate of genome size

SRX200229 Not processed, reason: Poor estimate of genome size

SRX200230 Not processed, reason: Poor estimate of genome size

SRX247326 Not processed, reason: Poor estimate of genome size

SRX2582464 Not processed, reason: Poor estimate of genome size

SRX2940498 Not processed, reason: Poor estimate of genome size

SRX2940499 Not processed, reason: Poor estimate of genome size

SRX3075577 Not processed, reason: Poor estimate of genome size

Failed to pass minimum cutoffs, reason: Low coverage (18.90x, expect >= 20x);Too

SRX3146668 many contigs (784, expect <= 500)

SRX3155954 Not processed, reason: Poor estimate of genome size

SRX318181 Failed to pass minimum cutoffs, reason: Too many contigs (692, expect <= 500)

SRX377715 Failed to pass minimum cutoffs, reason: Too many contigs (787, expect <= 500)

SRX422995 Not processed, reason: Low number of reads;Low depth of sequencing

SRX4336892 Not processed, reason: Poor estimate of genome size bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

SRX4526088 Failed to pass minimum cutoffs, reason: Low coverage (16.08x, expect >= 20x)

SRX4526092 Not processed, reason: Paired-end read count mismatch

SRX4997533 Not processed, reason: Poor estimate of genome size

SRX5116733 Not processed, reason: Poor estimate of genome size

SRX5116734 Not processed, reason: Poor estimate of genome size

SRX5116735 Not processed, reason: Poor estimate of genome size

SRX5116736 Not processed, reason: Poor estimate of genome size

SRX5116737 Not processed, reason: Poor estimate of genome size

SRX5395058 Not processed, reason: Poor estimate of genome size

SRX5489807 Not processed, reason: Poor estimate of genome size

SRX5992480 Failed to pass minimum cutoffs, reason: Too many contigs (843, expect <= 500)

SRX5992985 Failed to pass minimum cutoffs, reason: Too many contigs (1220, expect <= 500)

Failed to pass minimum cutoffs, reason: Low coverage (9.19x, expect >= 20x);Too many

SRX5992988 contigs (1440, expect <= 500)

SRX5992989 Not processed, reason: Poor estimate of genome size

SRX6454026 Failed to pass minimum cutoffs, reason: Too many contigs (533, expect <= 500)

SRX6454030 Failed to pass minimum cutoffs, reason: Too many contigs (796, expect <= 500)

SRX6454035 Failed to pass minimum cutoffs, reason: Too many contigs (687, expect <= 500)

SRX6454040 Failed to pass minimum cutoffs, reason: Too many contigs (792, expect <= 500)

SRX6454043 Failed to pass minimum cutoffs, reason: Too many contigs (801, expect <= 500)

Failed to pass minimum cutoffs, reason: Low coverage (16.59x, expect >= 20x);Too

SRX6959882 many contigs (855, expect <= 500)

SRX7004909 Failed to pass minimum cutoffs, reason: Too many contigs (999, expect <= 500) bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

SRX761992 Failed to pass minimum cutoffs, reason: Low coverage (18.30x, expect >= 20x)

Failed to pass minimum cutoffs, reason: Low coverage (15.33x, expect >= 20x);Too

SRX762105 many contigs (560, expect <= 500)

SRX762278 Not processed, reason: Poor estimate of genome size

SRX762279 Failed to pass minimum cutoffs, reason: Low coverage (16.51x, expect >= 20x)

SRX762385 Failed to pass minimum cutoffs, reason: Too many contigs (1132, expect <= 500)

SRX762507 Failed to pass minimum cutoffs, reason: Low coverage (16.75x, expect >= 20x)

SRX762687 Failed to pass minimum cutoffs, reason: Low coverage (19.54x, expect >= 20x)

560

561 Supplementary Table 2 - Samples with a non-Lactobacillus taxonomic classification

562

sample sra gtdb

ERX2000473 Lactobacillus jensenii Actinomyces viscosus

ERX1275832 Lactobacillus rhamnosus Aerococcus urinae

ERX1275921 Lactobacillus rhamnosus Aerococcus urinae

ERX2000484 Lactobacillus iners Bifidobacterium vaginale

ERX1275866 Lactobacillus rhamnosus Bifidobacterium vaginale

ERX1275955 Lactobacillus rhamnosus Bifidobacterium vaginale

ERX1275856 Lactobacillus fermentum Campylobacter ureolyticus

ERX1275945 Lactobacillus fermentum Campylobacter ureolyticus

SRX301579 Lactobacillus catenefornis DSM 20559 Eggerthia catenaformis bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

ERX2000470 Lactobacillus jensenii Facklamia hominis

ERX034808 Lactobacillus sp. KLE1615 sp900066985

SRX2118809 Lactobacillus rogosae Lachnospira rogosae

ERX1275881 Lactobacillus jensenii Microbacterium sp001595495

ERX1275970 Lactobacillus jensenii Microbacterium sp001595495

SRX963052 Lactobacillus sp. HMSC12B03 Neisseria bacilliformis

ERX1275835 Lactobacillus iners Pseudoglutamicibacter cumminsii

ERX1275924 Lactobacillus iners Pseudoglutamicibacter cumminsii

ERX178670 Lactobacillus casei Staphylococcus epidermidis

SRX244574 Lactobacillus gasseri ADL-351 Streptococcus agalactiae

ERX1275883 Lactobacillus sp. Streptococcus agalactiae

ERX1275972 Lactobacillus sp. Streptococcus agalactiae

ERX2000463 Lactobacillus johnsonii Streptococcus parasanguinis

ERX2000481 Lactobacillus crispatus Streptococcus pasteurianus

ERX3310397 Lactobacillus acetotolerans Streptococcus pneumoniae

ERX3310420 Lactobacillus acidophilus Streptococcus pneumoniae

ERX3310369 Lactobacillus agilis Streptococcus pneumoniae

ERX3310458 Lactobacillus alimentarius Streptococcus pneumoniae

ERX3310319 Lactobacillus amylophilus Streptococcus pneumoniae

ERX3310443 Lactobacillus amylovorus Streptococcus pneumoniae

ERX3310479 Lactobacillus animalis Streptococcus pneumoniae bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

ERX3310410 Lactobacillus aviarius Streptococcus pneumoniae

ERX3310408 Lactobacillus bifermentans Streptococcus pneumoniae

ERX3310303 Lactobacillus brevis Streptococcus pneumoniae

ERX3310449 Lactobacillus buchneri Streptococcus pneumoniae

ERX3310425 Lactobacillus casei Streptococcus pneumoniae

ERX3310459 Lactobacillus coryniformis Streptococcus pneumoniae

ERX3310264 Lactobacillus delbrueckii subsp. bulgaricus Streptococcus pneumoniae

ERX3310316 Lactobacillus delbrueckii Streptococcus pneumoniae

ERX3310336 Lactobacillus farciminis Streptococcus pneumoniae

ERX3310280 Lactobacillus fermentum Streptococcus pneumoniae

ERX3310372 Lactobacillus fructivorans Streptococcus pneumoniae

ERX3310460 Lactobacillus helveticus Streptococcus pneumoniae

ERX3310328 Lactobacillus hilgardii Streptococcus pneumoniae

ERX3310291 Lactobacillus mali Streptococcus pneumoniae

ERX3310432 Lactobacillus oris Streptococcus pneumoniae

ERX3310305 Lactobacillus paracasei Streptococcus pneumoniae

ERX3310374 Lactobacillus pentosus Streptococcus pneumoniae

ERX3310445 Lactobacillus plantarum Streptococcus pneumoniae

ERX3310452 Lactobacillus ruminis Streptococcus pneumoniae

ERX3310353 Lactobacillus sakei Streptococcus pneumoniae

ERX3310387 Lactobacillus sakei L45 Streptococcus pneumoniae bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

ERX3310490 Lactobacillus salivarius Streptococcus pneumoniae

ERX3310309 Lactobacillus sanfranciscensis Streptococcus pneumoniae

ERX3310430 Lactobacillus sharpeae Streptococcus pneumoniae

ERX3310388 Lactobacillus sp. 30A Streptococcus pneumoniae

ERX3310483 Lactobacillus vaginalis Streptococcus pneumoniae

ERX3310456 Lactobacillus vermiforme Streptococcus pneumoniae

ERX2000443 Lactobacillus gasseri Winkia sp002849225

563 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

564 Supplementary Figures

565 Supplementary Figure 1 - Bactopia Analysis Pipeline Workflow

566

567 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

568 Supplementary Figure 2 - Sequencing quality ranks per year 2011–2019 of

569 Lactobacillus genome projects.

570 Genome projects were grouped into three increasing quality ranks: Bronze, Silver and Gold. The

571 rank was based on coverage, read length, per-read quality and total assembled contigs. The

572 highest rank, Gold, represented 62% (n=967) of the available Lactobacillus genome projects. The

573 remaining genomes were 25% Silver (n=386) and 13% Bronze (n=205). Between the years of 2011

574 and 2019, Gold rank samples consistently outnumbered Silver and Bronze except for 2011 and

575 2015 . However it is likely the total number of Gold rank samples is underrepresented due

576 coverage reduction being based on the estimated genome size.

577

578

579

580

581 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

582 Supplementary Figure 3 - Comparison of estimated genome size and assembled

583 genome size.

584 The assembled genome size (Y-axis) and estimated genome size (X-axis) was plotted for each

585 sample. The color of the dot is determined by the rank of the sample. The solid black line

586 represents a 1:1 ratio between the assembled genome size and the estimated genome size. The

587 genome size was estimated for each sample by Mash (7) using the raw sequences.

588

589

590

591

592 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

593 Supplementary Figure 4 - Assembled genome sizes are consistent within species.

594

595

596

597 Supplementary Data

598 Supplementary Data 1 - Results returned after querying ENA for Lactobacillus

599

600 Supplementary Data 2 - SRA/ENA Experiment accessions processed by Bactopia

601

602 Supplementary Data 3 - Nexflow runtime report for Lactobacillus genomes processed

603 by Bactopia

604 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

605 References

606 1. Grüning B, Dale R, Sjödin A, Rowe J, Chapman BA, Tomkins-Tinch CH, Valieris R, The

607 Bioconda Team, Köster J. 2017. Bioconda: A sustainable and comprehensive software

608 distribution for the life sciences. bioRxiv.

609 2. Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. 2017.

610 Nextflow enables reproducible computational workflows. Nat Biotechnol 35:316–319.

611 3. Petit RA 3rd, Read TD. 2018. Staphylococcus aureus viewed from the perspective of

612 40,000+ genomes. PeerJ 6:e5261.

613 4. Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A, Chaumeil P-A, Hugenholtz

614 P. 2018. A standardized bacterial taxonomy based on genome phylogeny substantially

615 revises the tree of life. Nat Biotechnol 36:996–1004.

616 5. Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT, Aluru S. 2018. High throughput

617 ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun

618 9:5114.

619 6. Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S, Holden MTG, Fookes M, Falush D,

620 Keane JA, Parkhill J. 2015. Roary: Rapid large-scale prokaryote pan genome analysis.

621 Bioinformatics.

622 7. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM.

623 2016. Mash: fast genome and metagenome distance estimation using MinHash. Genome

624 Biol 17:132.

625 8. O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse

626 B, Smith-White B, Ako-Adjei D, Astashyn A, Badretdin A, Bao Y, Blinkova O, Brover V, bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

627 Chetvernin V, Choi J, Cox E, Ermolaeva O, Farrell CM, Goldfarb T, Gupta T, Haft D,

628 Hatcher E, Hlavina W, Joardar VS, Kodali VK, Li W, Maglott D, Masterson P, McGarvey

629 KM, Murphy MR, O’Neill K, Pujar S, Rangwala SH, Rausch D, Riddick LD, Schoch C,

630 Shkeda A, Storz SS, Sun H, Thibaud-Nissen F, Tolstoy I, Tully RE, Vatsan AR, Wallin C,

631 Webb D, Wu W, Landrum MJ, Kimchi A, Tatusova T, DiCuccio M, Kitts P, Murphy TD,

632 Pruitt KD. 2016. Reference sequence (RefSeq) database at NCBI: current status,

633 taxonomic expansion, and functional annotation. Nucleic Acids Res 44:D733–45.

634 9. Galata V, Fehlmann T, Backes C, Keller A. 2019. PLSDB: a resource of complete bacterial

635 plasmids. Nucleic Acids Res 47:D195–D202.

636 10. Titus Brown C, Irber L. 2016. sourmash: a library for MinHash sketching of DNA. JOSS

637 1:27.

638 11. Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. 2016. GenBank. Nucleic Acids

639 Res 44:D67–72.

640 12. Hunt M, Mather AE, Sánchez-Busó L, Page AJ, Parkhill J, Keane JA, Harris SR. 2017.

641 ARIBA: rapid antimicrobial resistance genotyping directly from sequencing reads. Microb

642 Genom 3:e000131.

643 13. Gupta SK, Padmanabhan BR, Diene SM, Lopez-Rojas R, Kempf M, Landraud L, Rolain J-

644 M. 2014. ARG-ANNOT, a new bioinformatic tool to discover antibiotic resistance genes in

645 bacterial genomes. Antimicrob Agents Chemother 58:212–220.

646 14. McArthur AG, Waglechner N, Nizam F, Yan A, Azad MA, Baylay AJ, Bhullar K, Canova MJ,

647 De Pascale G, Ejim L, Kalan L, King AM, Koteva K, Morar M, Mulvey MR, O’Brien JS,

648 Pawlowski AC, Piddock LJ V, Spanogiannopoulos P, Sutherland AD, Tang I, Taylor PL,

649 Thaker M, Wang W, Yan M, Yu T, Wright GD. 2013. The comprehensive antibiotic bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

650 resistance database. Antimicrob Agents Chemother 57:3348–3357.

651 15. Lakin SM, Dean C, Noyes NR, Dettenwanger A, Ross AS, Doster E, Rovira P, Abdo Z,

652 Jones KL, Ruiz J, Belk KE, Morley PS, Boucher C. 2017. MEGARes: an antimicrobial

653 resistance database for high throughput sequencing. Nucleic Acids Res 45:D574–D580.

654 16. Feldgarden M, Brover V, Haft DH, Prasad AB, Slotta DJ, Tolstoy I, Tyson GH, Zhao S, Hsu

655 C-H, McDermott PF, Tadesse DA, Morales C, Simmons M, Tillman G, Wasilenko J, Folster

656 JP, Klimke W. 2019. Validating the NCBI AMRFinder Tool and Resistance Gene Database

657 Using Antimicrobial Resistance Genotype-Phenotype Correlations in a Collection of

658 NARMS Isolates. Antimicrob Agents Chemother.

659 17. Carattoli A, Zankari E, García-Fernández A, Voldby Larsen M, Lund O, Villa L, Møller

660 Aarestrup F, Hasman H. 2014. In silico detection and typing of plasmids using

661 PlasmidFinder and plasmid multilocus sequence typing. Antimicrob Agents Chemother

662 58:3895–3903.

663 18. Zankari E, Hasman H, Cosentino S, Vestergaard M, Rasmussen S, Lund O, Aarestrup FM,

664 Larsen MV. 2012. Identification of acquired antimicrobial resistance genes. J Antimicrob

665 Chemother 67:2640–2644.

666 19. Inouye M, Dashnow H, Raven L-A, Schultz MB, Pope BJ, Tomita T, Zobel J, Holt KE. 2014.

667 SRST2: Rapid genomic surveillance for public health and hospital microbiology labs.

668 Genome Med 6:90.

669 20. Chen L, Zheng D, Liu B, Yang J, Jin Q. 2016. VFDB 2016: hierarchical and refined dataset

670 for big data analysis--10 years on. Nucleic Acids Res 44:D694–7.

671 21. Joensen KG, Scheutz F, Lund O, Hasman H, Kaas RS, Nielsen EM, Aarestrup FM. 2014.

672 Real-time whole-genome sequencing for routine typing, surveillance, and outbreak bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

673 detection of verotoxigenic Escherichia coli. J Clin Microbiol 52:1501–1510.

674 22. Jolley KA, Bray JE, Maiden MCJ. 2018. Open-access bacterial :

675 BIGSdb software, the PubMLST.org website and their applications. Wellcome Open Res

676 3:124.

677 23. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. 2009.

678 BLAST+: architecture and applications. BMC Bioinformatics 10:421.

679 24. Blin K. ncbi-genome-download - Scripts to download genomes from the NCBI FTP servers.

680 Github.

681 25. Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T,

682 Kauff F, Wilczynski B, de Hoon MJL. 2009. Biopython: freely available Python tools for

683 computational molecular biology and bioinformatics. Bioinformatics 25:1422–1423.

684 26. Li W, Godzik A. 2006. Cd-hit: a fast program for clustering and comparing large sets of

685 protein or nucleotide sequences. Bioinformatics 22:1658–1659.

686 27. Fu L, Niu B, Zhu Z, Wu S, Li W. 2012. CD-HIT: accelerated for clustering the next-

687 generation sequencing data. Bioinformatics 28:3150–3152.

688 28. Seemann T. 2014. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30:2068–

689 2069.

690 29. Laslett D, Canback B. 2004. ARAGORN, a program to detect tRNA genes and tmRNA

691 genes in nucleotide sequences. Nucleic Acids Res 32:11–16.

692 30. Petit RA III. assembly-scan: generate basic stats for an assembly. Github.

693 31. Seemann T. Barrnap: Bacterial ribosomal RNA predictor. Github. bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

694 32. Bushnell B. BBMap short read aligner, and other bioinformatic tools. SourceForge.

695 33. Danecek P. BCFtools - Utilities for variant calling and manipulating VCFs and BCFs.

696 Github.

697 34. Quinlan AR, Hall IM. 2010. BEDTools: a flexible suite of utilities for comparing genomic

698 features. Bioinformatics 26:841–842.

699 35. Li H, Durbin R. 2010. Fast and accurate long-read alignment with Burrows-Wheeler

700 transform. Bioinformatics 26:589–595.

701 36. Didelot X, Wilson DJ. 2015. ClonalFrameML: efficient inference of recombination in whole

702 bacterial genomes. PLoS Comput Biol 11:e1004041.

703 37. Iannone R. 2018. DiagrammeR: Graph/network visualization. R package 1.

704 38. Miller CS, Baker BJ, Thomas BC, Singer SW, Banfield JF. 2011. EMIRGE: reconstruction

705 of full-length ribosomal genes from microbial community short read sequencing data.

706 Genome Biol 12:R44.

707 39. Price MN, Dehal PS, Arkin AP. 2010. FastTree 2 – Approximately Maximum-Likelihood

708 Trees for Large Alignments. PLoS One 5:e9490.

709 40. Petit RA III. fastq-dl - Download FASTQ files from SRA or ENA repositories. Github.

710 41. Andrews S, Krueger F, Seconds-Pichon A, Biggins F, Wingett S. 2016. FastQC A Quality

711 Control tool for High Throughput Sequence Data. Babraham Bioinformatics. 2012.

712 42. Petit RA III. fastq-scan: generate summary statistics of input FASTQ sequences. Github.

713 43. Magoč T, Salzberg SL. 2011. FLASH: fast length adjustment of short reads to improve

714 genome assemblies. Bioinformatics 27:2957–2963. bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

715 44. Garrison E, Marth G. 2012. Haplotype-based variant detection from short-read sequencing.

716 arXiv [q-bioGN].

717 45. Tange O. 2018. GNU Parallel 2018.

718 46. Chaumeil P-A, Mussig AJ, Hugenholtz P, Parks DH. 2019. GTDB-Tk: a toolkit to classify

719 genomes with the Genome Taxonomy Database. Bioinformatics.

720 47. Eddy SR. 2009. A new generation of search tools based on probabilistic

721 inference. Genome Inform 23:205–211.

722 48. Eddy SR. 2011. Accelerated Profile HMM Searches. PLoS Comput Biol 7:e1002195.

723 49. Wheeler TJ, Eddy SR. 2013. nhmmer: DNA homology search with profile HMMs.

724 Bioinformatics 29:2487–2489.

725 50. Nawrocki EP, Eddy SR. 2013. Infernal 1.1: 100-fold faster RNA homology searches.

726 Bioinformatics 29:2933–2935.

727 51. Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ. 2015. IQ-TREE: a fast and effective

728 stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol 32:268–

729 274.

730 52. Hawkey J, Hamidian M, Wick RR, Edwards DJ, Billman-Jacobe H, Hall RM, Holt KE. 2015.

731 ISMapper: identifying transposase insertion sites in bacterial genomes from short read

732 sequence data. BMC Genomics 16:667.

733 53. Song L, Florea L, Langmead B. 2014. Lighter: fast and memory-efficient sequencing error

734 correction without counting. Genome Biol 15:509.

735 54. Katoh K, Standley DM. 2013. MAFFT multiple sequence alignment software version 7: bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

736 improvements in performance and usability. Mol Biol Evol 30:772–780.

737 55. Ondov BD, Starrett GJ, Sappington A, Kostic A, Koren S, Buck CB, Phillippy AM. 2019.

738 Mash Screen: high-throughput sequence containment estimation for genome discovery.

739 Genome Biol 20:232.

740 56. Kwong J. maskrc-svg - Masks recombination as detected by ClonalFrameML or Gubbins

741 and draws an SVG. Github.

742 57. Turner I, Garimella KV, Iqbal Z, McVean G. 2018. Integrating long-range connectivity

743 information into de Bruijn graphs. Bioinformatics 34:2556–2565.

744 58. Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. 2015. MEGAHIT: an ultra-fast single-node

745 solution for large and complex metagenomics assembly via succinct de Bruijn graph.

746 Bioinformatics 31:1674–1676.

747 59. Skennerton CT. MinCED: Mining CRISPRs in Environmental Datasets. Github.

748 60. Li H. 2018. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics

749 34:3094–3100.

750 61. Gruber-Vodicka HR, Seah BKB, Pruesse E. 2019. phyloFlash – Rapid SSU rRNA profiling

751 and targeted assembly from metagenomes. bioRxiv.

752 62. Adler M. 2015. pigz: A parallel implementation of gzip for modern multi-processor, multi-

753 core machines. Jet Propulsion Laboratory.

754 63. Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, Cuomo CA, Zeng Q,

755 Wortman J, Young SK, Earl AM. 2014. Pilon: An Integrated Tool for Comprehensive

756 Microbial Variant Detection and Genome Assembly Improvement. PLoS One 9:e112963. bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

757 64. Matsen FA, Kodner RB, Armbrust EV. 2010. pplacer: linear time maximum-likelihood and

758 Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC

759 Bioinformatics 11:538.

760 65. Hyatt D, Chen G-L, Locascio PF, Land ML, Larimer FW, Hauser LJ. 2010. Prodigal:

761 prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics

762 11:119.

763 66. Seemann T. Samclip: Filter SAM file for soft and hard clipped alignments. Github.

764 67. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin

765 R, 1000 Data Processing Subgroup. 2009. The Sequence Alignment/Map

766 format and SAMtools. Bioinformatics 25:2078–2079.

767 68. Li H. 2012. seqtk Toolkit for processing sequences in FASTA/Q formats. GitHub.

768 69. Seemann T. Shovill: De novo assembly pipeline for Illumina paired reads. Github.

769 70. Souvorov A, Agarwala R, Lipman DJ. 2018. SKESA: strategic k-mer extension for

770 scrupulous assemblies. Genome Biol 19:153.

771 71. Seemann T. Snippy: fast bacterial variant calling from NGS reads. Github.

772 72. Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM.

773 2012. A program for annotating and predicting the effects of single nucleotide

774 polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118;

775 iso-2; iso-3. Fly 6:80–92.

776 73. Seemann T. snp-dists - Pairwise SNP distance matrix from a FASTA sequence alignment.

777 Github. bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

778 74. Page AJ, Taylor B, Delaney AJ, Soares J, Seemann T, Keane JA, Harris SR. 2016. SNP-

779 sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microb Genom

780 2:e000056.

781 75. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko

782 SI, Pham S, Prjibelski AD, Pyshkin AV. SPAdes: A New Genome Assembly Algorithm and

783 Its Applications to Single-Cell Sequencing | Journal of Computational Biology. Mary Ann

784 Liebert, Inc, publishers.

785 76. Bolger AM, Lohse M, Usadel B. 2014. Trimmomatic: a flexible trimmer for Illumina

786 sequence data. Bioinformatics 30:2114–2120.

787 77. Petit RA III. VCF-Annotator: Add biological annotations to variants in a VCF file. Github.

788 78. Vcflib: A C++ library for parsing and manipulating VCF files. Github.

789 79. Zerbino DR, Birney E. 2008. Velvet: Algorithms for de novo short read assembly using de

790 Bruijn graphs. Genome Res 18:821–829.

791 80. Rognes T, Flouri T, Nichols B, Quince C, Mahé F. 2016. VSEARCH: a versatile open

792 source tool for metagenomics. PeerJ 4:e2584.

793 81. Tan A, Abecasis GR, Kang HM. 2015. Unified representation of genetic variants.

794 Bioinformatics 31:2202–2204.

795 82. Toribio AL, Alako B, Amid C, Cerdeño-Tarrága A, Clarke L, Cleland I, Fairley S, Gibson R,

796 Goodgame N, Ten Hoopen P, Jayathilaka S, Kay S, Leinonen R, Liu X, Martínez-Villacorta

797 J, Pakseresht N, Rajan J, Reddy K, Rosello M, Silvester N, Smirnov D, Vaughan D, Zalunin

798 V, Cochrane G. 2017. European Nucleotide Archive in 2016. Nucleic Acids Res 45:D32–

799 D36. bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

800 83. Bennett GM, Moran NA. 2013. Small, smaller, smallest: the origins and evolution of ancient

801 dual symbioses in a Phloem-feeding insect. Genome Biol Evol 5:1675–1688.

802 84. Li H. 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA-

803 MEM. arXiv [q-bioGN].

804 85. Schwengers O, Hoek A, Fritzenwanker M, Falgenhauer L, Hain T, Chakraborty T,

805 Goesmann A. 2019. ASA3P: An automatic and scalable pipeline for the assembly,

806 annotation and higher level analysis of closely related bacterial isolates. bioRxiv.

807 86. Quijada NM, Rodríguez-Lázaro D, Eiros JM, Hernández M. 2019. TORMES: an automated

808 pipeline for whole bacterial genome analysis. Bioinformatics 35:4207–4212.

809 87. da Silva A Bulach DM Schultz MB Kwong JC Howden BP. STG. Nullarbor. Github.

810 88. Fettweis JM, Serrano MG, Brooks JP, Edwards DJ, Girerd PH, Parikh HI, Huang B, Arodz

811 TJ, Edupuganti L, Glascock AL, Xu J, Jimenez NR, Vivadelli SC, Fong SS, Sheth NU, Jean

812 S, Lee V, Bokhari YA, Lara AM, Mistry SD, Duckworth RA, Bradley SP, Koparde VN,

813 Orenda XV, Milton SH, Rozycki SK, Matveyev AV, Wright ML, Huzurbazar SV, Jackson

814 EM, Smirnova E, Korlach J, Tsai Y-C, Dickinson MR, Brooks JL, Drake JI, Chaffin DO,

815 Sexton AL, Gravett MG, Rubens CE, Wijesooriya NR, Hendricks-Muñoz KD, Jefferson KK,

816 Strauss JF, Buck GA. 2019. The vaginal microbiome and preterm birth. Nat Med.

817 89. Yelin I, Flett KB, Merakou C, Mehrotra P, Stam J, Snesrud E, Hinkle M, Lesho E, McGann

818 P, McAdam AJ, Sandora TJ, Kishony R, Priebe GP. 2019. Genomic and epidemiological

819 evidence of bacterial transmission from probiotic capsule to blood in ICU patients. Nat Med.

820 90. Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS. 2017. ModelFinder:

821 fast model selection for accurate phylogenetic estimates. Nat Methods 14:587–589. bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

822 91. Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Vinh LS. 2018. UFBoot2: Improving

823 the Ultrafast Bootstrap Approximation. Mol Biol Evol 35:518–522.

824 92. Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glöckner FO.

825 2013. The SILVA ribosomal RNA gene database project: improved data processing and

826 web-based tools. Nucleic Acids Res 41:D590–6.

827 93. Letunic I, Bork P. 2016. Interactive tree of life (iTOL) v3: an online tool for the display and

828 annotation of phylogenetic and other trees. Nucleic Acids Res 44:W242–5.

829 94. Wittouck S, Wuyts S, Meehan CJ, van Noort V, Lebeer S. 2019. A Genome-Based Species

830 Taxonomy of the Lactobacillus Genus Complex. mSystems 4.

831 95. Tanizawa Y, Tada I, Kobayashi H, Endo A, Maeno S, Toyoda A, Arita M, Nakamura Y,

832 Sakamoto M, Ohkuma M, Tohno M. 2018. Lactobacillus paragasseri sp. nov., a sister taxon

833 of Lactobacillus gasseri, based on whole-genome sequence analyses. Int J Syst Evol

834 Microbiol 68:3512–3517.

835 96. Pan M, Hidalgo-Cantabrana C, Barrangou R. 2020. Host and body site-specific adaptation

836 of Lactobacillus crispatus genomes. NAR Genom Bioinform 2.

837 97. Weimer CM, Deitzler GE, Robinson LS, Park S, Hallsworth-Pepin K, Wollam A, Mitreva M,

838 Lewis WG, Lewis AL. 2016. Genome Sequences of 12 Bacterial Isolates Obtained from the

839 Urine of Pregnant Women. Genome Announc 4.

840 98. Sichtig H, Minogue T, Yan Y, Stefan C, Hall A, Tallon L, Sadzewicz L, Nadendla S, Klimke

841 W, Hatcher E, Shumway M, Aldea DL, Allen J, Koehler J, Slezak T, Lovell S, Schoepp R,

842 Scherf U. 2019. FDA-ARGOS is a database with public quality-controlled reference

843 genomes for diagnostic use and regulatory science. Nat Commun 10:3313. bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

844 99. Bassis CM, Bullock KA, Sack DE, Saund K, Pirani A, Snitkin ES, Alaniz VI, Quint EH,

845 Young VB, Bell JD. 2019. Evidence that vertical transmission of the vaginal microbiota can

846 persist into adolescence. bioRxiv.

847 100. Clabaut M, Boukerb AM, Racine P-J, Pichon C, Kremser C, Picot J-P, Karsybayeva M,

848 Redziniak G, Chevalier S, Feuilloley MGJ. 2020. Draft Genome Sequence of Lactobacillus

849 crispatus CIP 104459, Isolated from a Vaginal Swab. Microbiol Resour Announc 9.

850 101. Richards PJ, Flaujac Lafontaine GM, Connerton PL, Liang L, Asiani K, Fish NM,

851 Connerton IF. 2020. Galacto-Oligosaccharides Modulate the Juvenile Gut Microbiome and

852 Innate Immunity To Improve Broiler Chicken Performance. mSystems 5.

853 102. Chang D-H, Rhee M-S, Lee S-K, Chung I-H, Jeong H, Kim B-C. 2019. Complete

854 Genome Sequence of Lactobacillus crispatus AB70, Isolated from a Vaginal Swab from a

855 Healthy Pregnant Korean Woman. Microbiol Resour Announc 8.

856 103. McComb E, Holm J, Ma B, Ravel J. 2019. Complete Genome Sequence of Lactobacillus

857 crispatus CO3MRSI1. Microbiol Resour Announc 8.

858 104. Feldgarden M, Brover V, Haft DH, Prasad AB, Slotta DJ, Tolstoy I, Tyson GH, Zhao S,

859 Hsu C-H, McDermott PF, Tadesse DA, Morales C, Simmons M, Tillman G, Wasilenko J,

860 Folster JP, Klimke W. 2019. Using the NCBI AMRFinder Tool to Determine Antimicrobial

861 Resistance Genotype-Phenotype Correlations Within a Collection of NARMS Isolates.

862 bioRxiv.

863 105. Cauwerts K, Pasmans F, Devriese LA, Haesebrouck F, Decostere A. 2006. Cloacal

864 Lactobacillus isolates from broilers often display resistance toward tetracycline antibiotics.

865 Microb Drug Resist 12:284–288.

866 106. Dec M, Urban-Chmiel R, Stępień-Pyśniak D, Wernicki A. 2017. Assessment of antibiotic bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

867 susceptibility in Lactobacillus isolates from chickens. Gut Pathog 9:54.

868 107. Campedelli I, Mathur H, Salvetti E, Clarke S, Rea MC, Torriani S, Ross RP, Hill C,

869 O’Toole PW. 2018. Genus-wide assessment of antibiotic resistance in Lactobacillus spp.

870 Appl Environ Microbiol.

871