MOSGA 2: Comparative Genomics and Validation Tools
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454382; this version posted July 30, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. MOSGA 2: Comparative genomics and validation tools Roman Martin1, Hagen Dreßler1, Georges Hattab1, Thomas Hackl2, Matthias G. Fischer2, and Dominik Heider1, 1Department of Mathematics and Computer Science, University of Marburg, 35032 Marburg, Germany 2Department of Biomolecular Mechanisms, Max Planck Institute for Medical Research, Heidelberg 69120, Germany Due to the highly growing number of available genomic infor- which does not necessarily affect the annotation process but mation, the need for accessible and easy-to-use analysis tools is will hinder the submission process and may lead to wrong increasing. To facilitate eukaryotic genome annotations, we cre- conclusions. This study presents the new functionalities in ated MOSGA. In this work, we show how MOSGA 2 is devel- MOSGA 2, such as new workflows for comparative genomics oped by including several advanced analyses for genomic data. and the introduction of validation tools into the framework. Since the genomic data quality greatly impacts the annotation In brief, MOSGA 2 incorporates comparative genomic work- quality, we included multiple tools to validate and ensure high- flows derived from Hackl et al. (13) and validation tools from quality user-submitted genome assemblies. Moreover, thanks to the integration of comparative genomics methods, users can Pirovano et al. (14). benefit from a broader genomic view by analyzing multiple ge- We initially developed MOSGA to facilitate genome anno- nomic data sets simultaneously. Further, we demonstrate the tation, including an easy-to-use web interface based on a new functionalities of MOSGA 2 by different use-cases and robust, scalable, modular, and reproducible platform. Be- practical examples. MOSGA 2 extends the already established sides new genome annotations, genome assemblies or mul- application to the quality control of the genomic data and inte- tiple genomes can be placed in a bigger context like com- grates and analyzes multiple genomes in a larger context, e.g., parative genomics. Therefore, we integrated several tools by phylogenetics. into MOSGA 2 to calculate and output the phylogenetic Genome annotation | Comparative Genomics | Phylogenetics | Quality control trees, the average nucleotide identity, the completeness of | Framework | Pipeline | Workflow the genomes, and the protein-coding gene comparison based Correspondence: [email protected] on previously performed protein-coding gene annotations. In addition, we incorporated tools to identify the occurrence of contamination and present a new scaffold-detection method Introduction for organellar sequences present in genome assemblies. The constantly increasing number of available whole- genome sequences, mainly derived by high-throughput se- Software changes quencing, provides more and more biological insights (1). In Phylogenetics. MOSGA 2 integrates a workflow based on particular, the sequencing of new bacterial, archaeal, and vi- nine established tools to preprocess and compute phyloge- ral genomes is now routine practice. It has accelerated dis- netic trees on multiple genomes. Preprocessing steps include coveries in microbial diversity and evolution, providing new selecting a database for genes, constructing and trimming insight into microbiome function, human health, and biogeo- multiple sequence alignment (MSA). chemical cycling (2–6). However, the generation of high- Therefore, we included BUSCO (15, 16) and EukCC (17) quality eukaryotic genomes has been hampered inDRAFT the past to identify single-copy genes in genomes for the phyloge- by sequencing costs and assembly complexity, limiting se- netic computation. The user has to choose from one of these quenced eukaryotes to those of medical or economical inter- tools or even combine both as potential phylogenetic mark- est (7). More recently, advances in Illumina high-throughput ers. MOSGA 2 concatenates the chosen gene marker se- sequencing and emerging long-read sequencing technologies quences into a new FASTA file. The user has to select a pro- developed by Pacific Biosciences and Oxford Nanopore have gram for the MSA construction, such as MAFFT (18, 19), also rendered the generation of genome assemblies of large ClustalW (20, 21), or MUSCLE (22). For an optional eukaryotes a routine task (8–10). These proceeding trends MSA trimming, we integrated trimAl (23) and ClipKIT (24). in data availability, advanced techniques, and improved af- Phylogenetic relationships are then reconstructed by maxi- fordability require more facilitated access for scientists to mum likelihood or distance algorithms using RAxML (25) or analyze these sequence data. To address the challenging is- FastME (26), respectively, and the resulting tree are visual- sue of correctly annotating new genomes, we developed the ized by ggtree (27) and ape (28). Tree rooting is optional and Modular Open-Source Genome Annotator (MOSGA) (11). can be defined by marking an uploaded genome assembly as While the number of available genomic data increases con- an outgroup. tinuously, the quality of these data is as important as their availability. Low-quality genome assemblies will introduce Average nucleotide identity. For whole-genome similarity errors (12) and, as such, affect the annotation quality. In addi- metrics, MOSGA 2 includes a comparative genomics work- tion, genome assemblies can be incomplete or contaminated, flow that calculates the Average Nucleotide Identity (ANI) Martin et al. | bioRχiv | July 30, 2021 | 1–6 bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454382; this version posted July 30, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. across all genomes. To achieve this, we integrated Fas- prediction, MOSGA 2 offers BlobTools (33) to estimate the tANI (29), which compares the genomes against each other. taxonomical source of each scaffold. Such sources may be Further, the ANI values are represented as a heatmap. helpful to identify putative biological contaminants. Addi- tionally, MOSGA 2 includes NCBI’s VecScreen based on the Protein-coding genes comparison. MOSGA 2 integrates UniVec database to identify adapter sequences. a comparative genomics workflow that compares protein- coding genes from all uploaded genomes against each other. Organellar DNA scanner. MOSGA 2 is optimized for This comparison requires a previous annotation of protein- annotating nuclear DNA sequences from eukaryotic cells coding genes, which can be achieved via the annotation and is less suited for organellar DNA, such as mitochon- pipeline or by importing already annotated genomes as GBFF drial and plastid genomes. To identify organellar DNA (GenBank flat format) files. This allows, for example, in genome assemblies, MOSGA 2 combines information a comparison between different gene prediction tools or from GC-content, plastid and mitochondrial reference pro- between reference and experimental annotations. Techni- tein databases with RNA prediction algorithms such as bar- cally, MOSGA 2 extracts the protein-coding sequences and rnap and tRNAscan-SE 2.0 (34). At the end of each an- matches them against each other. Matches above a defined notation job, MOSGA 2 creates a relative ranking with the threshold will be binned back to a genome, and the aver- most likely scaffolds. The scoring is an arithmetic calculation age coding content similarities between the genomes are dis- based on the density of organellar-specific genes, the num- played as a heatmap. This analysis allows consistency checks bers of tRNAs, rRNAs, and the GC-content variance. To cre- for gene predictions across different genomes. This method ate the plastid and mitochondria reference proteins databases, compares the nucleotide sequence of protein-coding genes we clustered respective RefSeq (35) databases with MM- and has similarities with the concept of Average Amino Acids Seq2 (36) and only kept representative sequences of these Identity. clusters into our databases to remove redundancy. Genome completeness. The identification of single-copy Taxonomy search. In several MOSGA annotation jobs, we genes is a crucial step for phylogenetic analysis. Therefore, observed that multiple users did not select the best suitable we integrated BUSCO and EukCC into MOSGA 2. While gene-prediction models for the given data. Depending on the BUSCO’s data source is OrthoDB (31), EukCC relies on selection of gene prediction tools, this task could be chal- PANTHER (32). These tools can estimate the completeness lenging since, for example, the gene predictor Augustus (37) of assemblies, and MOSGA 2 integrates them for validation currently includes already 80 species-specific models. Identi- into the annotation and the comparative genomics workflow. fying the most suitable models requires knowledge or an ed- Genome completeness results for each genome are visualized ucated guess about each listed species and their relatedness. together in the comparative genomics workflow and the an- To support the user during this task, we implemented a tax- notation workflow separately.