A Flexible Pipeline for Complete Analysis of Bacterial

bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 1 Bactopia: a flexible pipeline for complete 2 analysis of bacterial genomes 3 4 5 Robert A. Petit III1 and Timothy D. Read1 6 1Division of Infectious Diseases, Department of Medicine, Emory University School of 7 Medicine. 8 9 Corresponding Author: 10 Timothy Read1 11 12 Email address: [email protected] 13 14 Abstract 15 Sequencing of bacterial genomes using Illumina technology has become such a standard 16 procedure that often data are generated faster than can be conveniently analyzed. We created 17 a new series of pipelines called Bactopia, built using Nextflow workflow software, to provide 18 efficient comparative genomic analyses for bacterial species or genera. Bactopia consists of a 19 dataset setup step (Bactopia Datasets; BaDs) where a series of customizable datasets are 20 created for the species of interest; the Bactopia Analysis Pipeline (BaAP), which performs 21 quality control, genome assembly and several other functions based on the available datasets bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 22 and outputs the processed data to a structured directory format; and a series of Bactopia Tools 23 (BaTs) that perform specific post-processing on some or all of the processed data. BaTs include 24 pan-genome analysis, computing average nucleotide identity between samples, extracting and 25 profiling the 16S genes and taxonomic classification using highly conserved genes. It is 26 expected that the number of BaTs will increase to fill specific applications in the future. As a 27 demonstration, we performed an analysis of 1,664 public Lactobacillus genomes, focusing on L. 28 crispatus, a species that is a common part of the human vaginal microbiome. Bactopia is an 29 open source system that can scale from projects as small as one bacterial genome to 30 thousands that allows for great flexibility in choosing comparison datasets and options for 31 downstream analysis. Bactopia code can be accessed at 32 https://www.github.com/bactopia/bactopia. 33 34 Introduction 35 Sequencing a bacterial genome, an activity that once required the infrastructure of a dedicated 36 genome center, is now a routine task that even a small laboratory can undertake. A large 37 number of open source software tools have been created to handle various parts of the process 38 of using raw read data for functions such as SNP calling and de novo assembly. As a result of 39 dedicated community efforts, it has recently become much easier to locally install these 40 bioinformatic tools through package managers (Bioconda (1), Brew) or through the use of 41 software containers (Docker, Singularity). Despite these advances, producers of bacterial 42 sequence data face a bewildering array of choices when considering how to perform analysis, 43 particularly when large numbers of genomes are involved and processing efficiency and 44 scalability become major factors. 45 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 46 Efficient bacterial multi-genome analysis has been hampered by three missing functionalities. 47 First, is the need to have “workflows of workflows'' that can integrate analyses and provide a 48 simplified way to start with a collection of raw genome data, remove low quality sequences and 49 perform the basic analytic steps of de novo assembly, mapping to reference sequence and 50 taxonomic assignment. Second, is the desire to incorporate user-specific knowledge of the 51 species into the input of the main genome analysis pipeline. While many microbiologists are not 52 expert bioinformaticians, they are experts in the organisms they study. Third, is the need to 53 create an output format from the main pipeline that could be used for future customized 54 downstream analysis such as pan-genome analysis and basic visualization of phylogenies. 55 56 Here we introduce Bactopia, an integrated suite of workflows for flexible analysis of Illumina 57 genome sequencing projects of bacteria from the same taxon. Bactopia is based on the 58 Nextflow workflow software (2), and is designed to be scalable, allowing projects as small as a 59 single genome to be run on a local desktop, or many thousands of genomes to be run as a 60 batch on a cloud infrastructure. Running multiple tasks on a single platform standardizes the 61 underlying data quality used for gene and variant calling between projects run in different 62 laboratories. This structure also simplifies the user experience. In Bactopia, complex multi- 63 genome analysis can be run in a small number of commands. However, there are myriad 64 options for fine-tuning datasets used for analysis and the functions of the system. The 65 underlying Nextflow structure ensures reproducibility. To illustrate the functionality of the system 66 we performed a Bactopia analysis of 1,664 public genome projects of the Lactobacillus genus, 67 an important component of the microbiome of humans and animals. 68 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 69 Design and implementation 70 Bactopia links together open source bioinformatics software, available from Bioconda (1), using 71 Nextflow (2). Nextflow was chosen for its flexibility: Bactopia can be run locally, on clusters, or 72 on cloud platforms with simple parameter changes. It also manages the parallel execution of 73 tasks and creates checkpoints allowing users to resume jobs. Nextflow automates installation of 74 the component software of the workflow through integration with Bioconda. For ease of 75 deployment, Bactopia can either be installed through Bioconda, a Docker container, or a 76 Singularity container. All the software programs used by Bactopia (version v1.3.0) described in 77 this manuscript are listed in Table 1 with their individual version numbers. 78 79 There are three main components of Bactopia (Figure 1): 1) Bactopia Datasets (BaDs), a 80 framework for formatting organism-specific datasets to be used by the downstream analysis 81 pipeline; 2) Bactopia Analysis Pipeline (BaAP), a customizable workflow for the analysis of 82 individual bacterial genome projects that is an extension and generalization of the previously 83 published Staphylococcus aureus-specific “Staphopia Analysis Pipeline” (StAP) (3). The inputs 84 to BaAP are FASTQ files from bacterial Illumina sequencing projects, either imported from the 85 National Centers for Biotechnology Information (NCBI) Short Read Archive (SRA) database or 86 provided locally, and any reference data in the BaDs; 3) Bactopia Tools (BaTs), a set of 87 workflows that use the output files from a BaAP project to run genomic analysis on multiple 88 genomes. For this project we used BaTs to a) summarize the results of running multiple 89 bacterial genomes through BaAP, b) extract 16S sequences and create a phylogeny, c) assign 90 taxonomic classifications with the Genome Taxonomy Database (4), d) subset Lactobacillus 91 crispatus samples by average nucleotide identity (ANI) with FastANI (5), e) run pan-genome 92 analysis for L. crispatus using Roary (6) and create a core-genome phylogeny. bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 93 Figure 1 - Bactopia overview 94 95 96 Bactopia Datasets 97 The Bactopia pipeline can be run without downloading and formatting Bactopia Datasets 98 (BaDs). However, providing them enriches the downstream analysis. Bactopia can import 99 specific existing public datasets, as well as accessible user-provided datasets in the appropriatete 100 format. A subcommand (bactopia datasets) was created to automate downloading, building, andnd 101 (or) configuring these datasets for Bactopia. 102 103 BaDs can be grouped into those that are general and those that are user-supplied. General 104 datasets include: a Mash (7) sketch of the NCBI RefSeq (8) and PLSDB (9) databases and a 105 Sourmash (10) signature of microbial genomes (includes viral and fungal) from the NCBI 106 GenBank (11) database. Ariba (12), a software for detecting genes in raw read (FASTQ) files, 107 uses a number of default reference databases for virulence and antibiotic resistance. The bioRxiv preprint doi: https://doi.org/10.1101/2020.02.28.969394; this version posted March 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Load more