bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454382; this version posted July 30, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

MOSGA 2: Comparative and validation tools

Roman Martin1, Hagen Dreßler1, Georges Hattab1, Thomas Hackl2, Matthias G. Fischer2, and Dominik Heider1,

1Department of Mathematics and Computer Science, University of Marburg, 35032 Marburg, Germany 2Department of Biomolecular Mechanisms, Max Planck Institute for Medical Research, Heidelberg 69120, Germany

Due to the highly growing number of available genomic infor- which does not necessarily affect the annotation process but mation, the need for accessible and easy-to-use analysis tools is will hinder the submission process and may lead to wrong increasing. To facilitate eukaryotic genome annotations, we cre- conclusions. This study presents the new functionalities in ated MOSGA. In this work, we show how MOSGA 2 is devel- MOSGA 2, such as new workflows for oped by including several advanced analyses for genomic data. and the introduction of validation tools into the framework. Since the genomic data quality greatly impacts the annotation In brief, MOSGA 2 incorporates comparative genomic work- quality, we included multiple tools to validate and ensure high- flows derived from Hackl et al. (13) and validation tools from quality user-submitted genome assemblies. Moreover, thanks to the integration of comparative genomics methods, users can Pirovano et al. (14). benefit from a broader genomic view by analyzing multiple ge- We initially developed MOSGA to facilitate genome anno- nomic data sets simultaneously. Further, we demonstrate the tation, including an easy-to-use web interface based on a new functionalities of MOSGA 2 by different use-cases and robust, scalable, modular, and reproducible platform. Be- practical examples. MOSGA 2 extends the already established sides new genome annotations, genome assemblies or mul- application to the quality control of the genomic data and inte- tiple genomes can be placed in a bigger context like com- grates and analyzes multiple genomes in a larger context, e.g., parative genomics. Therefore, we integrated several tools by phylogenetics. into MOSGA 2 to calculate and output the phylogenetic

Genome annotation | Comparative Genomics | Phylogenetics | Quality control trees, the average nucleotide identity, the completeness of | Framework | Pipeline | Workflow the genomes, and the protein-coding gene comparison based Correspondence: [email protected] on previously performed protein-coding gene annotations. In addition, we incorporated tools to identify the occurrence of contamination and present a new scaffold-detection method Introduction for organellar sequences present in genome assemblies. The constantly increasing number of available whole- genome sequences, mainly derived by high-throughput se- Software changes quencing, provides more and more biological insights (1). In Phylogenetics. MOSGA 2 integrates a workflow based on particular, the sequencing of new bacterial, archaeal, and vi- nine established tools to preprocess and compute phyloge- ral genomes is now routine practice. It has accelerated dis- netic trees on multiple genomes. Preprocessing steps include coveries in microbial diversity and evolution, providing new selecting a database for genes, constructing and trimming insight into function, human health, and biogeo- multiple sequence alignment (MSA). chemical cycling (2–6). However, the generation of high- Therefore, we included BUSCO (15, 16) and EukCC (17) quality eukaryotic genomes has been hampered inDRAFT the past to identify single-copy genes in genomes for the phyloge- by sequencing costs and assembly complexity, limiting se- netic computation. The user has to choose from one of these quenced eukaryotes to those of medical or economical inter- tools or even combine both as potential phylogenetic mark- est (7). More recently, advances in Illumina high-throughput ers. MOSGA 2 concatenates the chosen gene marker se- sequencing and emerging long-read sequencing technologies quences into a new FASTA file. The user has to select a pro- developed by Pacific Biosciences and Oxford Nanopore have gram for the MSA construction, such as MAFFT (18, 19), also rendered the generation of genome assemblies of large ClustalW (20, 21), or MUSCLE (22). For an optional eukaryotes a routine task (8–10). These proceeding trends MSA trimming, we integrated trimAl (23) and ClipKIT (24). in data availability, advanced techniques, and improved af- Phylogenetic relationships are then reconstructed by maxi- fordability require more facilitated access for scientists to mum likelihood or distance algorithms using RAxML (25) or analyze these sequence data. To address the challenging is- FastME (26), respectively, and the resulting tree are visual- sue of correctly annotating new genomes, we developed the ized by ggtree (27) and ape (28). Tree rooting is optional and Modular Open-Source Genome Annotator (MOSGA) (11). can be defined by marking an uploaded genome assembly as While the number of available genomic data increases con- an outgroup. tinuously, the quality of these data is as important as their availability. Low-quality genome assemblies will introduce Average nucleotide identity. For whole-genome similarity errors (12) and, as such, affect the annotation quality. In addi- metrics, MOSGA 2 includes a comparative genomics work- tion, genome assemblies can be incomplete or contaminated, flow that calculates the Average Nucleotide Identity (ANI)

Martin et al. | bioRχiv | July 30, 2021 | 1–6 bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454382; this version posted July 30, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

across all genomes. To achieve this, we integrated Fas- prediction, MOSGA 2 offers BlobTools (33) to estimate the tANI (29), which compares the genomes against each other. taxonomical source of each scaffold. Such sources may be Further, the ANI values are represented as a heatmap. helpful to identify putative biological contaminants. Addi- tionally, MOSGA 2 includes NCBI’s VecScreen based on the Protein-coding genes comparison. MOSGA 2 integrates UniVec database to identify adapter sequences. a comparative genomics workflow that compares protein- coding genes from all uploaded genomes against each other. Organellar DNA scanner. MOSGA 2 is optimized for This comparison requires a previous annotation of protein- annotating nuclear DNA sequences from eukaryotic cells coding genes, which can be achieved via the annotation and is less suited for organellar DNA, such as mitochon- pipeline or by importing already annotated genomes as GBFF drial and plastid genomes. To identify organellar DNA (GenBank flat format) files. This allows, for example, in genome assemblies, MOSGA 2 combines information a comparison between different tools or from GC-content, plastid and mitochondrial reference pro- between reference and experimental annotations. Techni- tein databases with RNA prediction algorithms such as bar- cally, MOSGA 2 extracts the protein-coding sequences and rnap and tRNAscan-SE 2.0 (34). At the end of each an- matches them against each other. Matches above a defined notation job, MOSGA 2 creates a relative ranking with the threshold will be binned back to a genome, and the aver- most likely scaffolds. The scoring is an arithmetic calculation age coding content similarities between the genomes are dis- based on the density of organellar-specific genes, the num- played as a heatmap. This analysis allows consistency checks bers of tRNAs, rRNAs, and the GC-content variance. To cre- for gene predictions across different genomes. This method ate the plastid and mitochondria reference proteins databases, compares the nucleotide sequence of protein-coding genes we clustered respective RefSeq (35) databases with MM- and has similarities with the concept of Average Amino Acids Seq2 (36) and only kept representative sequences of these Identity. clusters into our databases to remove redundancy.

Genome completeness. The identification of single-copy Taxonomy search. In several MOSGA annotation jobs, we genes is a crucial step for phylogenetic analysis. Therefore, observed that multiple users did not select the best suitable we integrated BUSCO and EukCC into MOSGA 2. While gene-prediction models for the given data. Depending on the BUSCO’s data source is OrthoDB (31), EukCC relies on selection of gene prediction tools, this task could be chal- PANTHER (32). These tools can estimate the completeness lenging since, for example, the gene predictor Augustus (37) of assemblies, and MOSGA 2 integrates them for validation currently includes already 80 species-specific models. Identi- into the annotation and the comparative genomics workflow. fying the most suitable models requires knowledge or an ed- Genome completeness results for each genome are visualized ucated guess about each listed species and their relatedness. together in the comparative genomics workflow and the an- To support the user during this task, we implemented a tax- notation workflow separately. onomy search for taxonomy-related options. To do so, users select the species name for the uploaded genome assemblies Contamination detection. To detect potential contamina- and MOSGA 2 searches for each tool’s best putative species- tion in a genome assembly, such as sequences from other or- or lineage-specific parameter. Internally, MOSGA 2 con- ganisms or residual sequencing adapters, we integrated two tains a trimmed version of the NCBI taxonomy database (38) validation tools to identify putative contamination. In the and searches for the shortest weighted distance between two case of a gene prediction workflow with RNA-seq based gene given nodes. This feature is available for the gene-prediction tools Augustus, GlimmerHMM (39), and SNAP (40) and the 0.019 DRAFTS. cerevisiae validation tool BUSCO. 0.009

0.015 Annotation quality. S. paradoxus By default, each finished MOSGA 0.009 100 genome annotation is validated by NCBI’s tbl2asn. In

0.031 MOSGA 2, we inserted additional multiple filters that im- 0.002 S. mikatae 100 prove NCBI compatibility, mainly following Pirovano et al. (14). We integrated additional filters checking the suggested 0.026 0.033 S. kudriavzevii 100 sizes for exons, introns, and the completeness of protein- coding sequences; this includes internal stop-codons and cor- 0 0.033 100 S. arboricola rect start- and stop-codons.

0.011 S. eubayanus Integration of existing annotations. MOSGA 2 can im- port existing genome annotations in GenBank flat format 0.01 S. uvarum (GBFF) and, therefore, can combine or complete existing an- notations with output from additional prediction tools. Re- Fig. 1. Multi-gene phylogenetic tree of seven Saccharomyces species. This phylo- sults from prediction tools can be visually compared using genetic tree output of MOSGA 2 was computed using BUSCO MAFFT, traimAl, and RAxML based on 173 identified BUSCO genes. The tree topologies are identical to JBrowse (41). The GBFF file support is not limited to anno- the one described in Peter et al. (30). Dashed lines indicate a higher distance than tation jobs, but can also be used for comparative genomics the 90th quantile.

2 | bioRχiv Martin et al. | MOSGA 2 bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454382; this version posted July 30, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

BUSCOs category ANI a Complete (single-copy) a 80.45 80.90 81.74 80.26 81.05 81.05 100 Complete (duplicated) Fragmented b Missing b 80.41 79.65 80.81 82.36 88.23 79.66 95 c c 80.84 79.68 80.67 79.59 80.09 91.16

90 d d 81.74 80.89 80.67 80.77 81.60 80.94 a) S. arboricola b) S. cerevisiae e e 80.35 82.34 79.68 80.80 83.61 79.64 85 c) S. eubayanus d) S. kudriavzevii f f 81.09 88.22 80.09 81.65 83.60 80.22 e) S. mikatae 80 f) S. paradoxus g g 81.05 79.59 91.14 80.92 79.69 80.19 g) S. uvarum f c a e g 0 20 40 60 80 100 120 140 160 180 200 220 240 260 b d Number of BUSCOs

Fig. 2. Merged visualization of BUSCO and FastANI results. Visual representation from the BUSCO analysis based on the Eukaryota OrthoDB lineage shows the genome- completeness of seven different yeast species and the heatmap from Average Nucleotide Identity (ANI) analysis. tasks or to mix different file formats, which are subsequently genus or that these orthologs are indeed absent from Saccha- interpreted by our tool. romyces genomes. According to the BUSCO genome com- pleteness analysis shown in Figure 2, the genome assembly of External Application Programming Interfaces. As an- S. mikatae seems to be incomplete. This could be confirmed other new feature, we introduced three APIs to established by applying an additional EukCC analysis for genome com- external tools: g:Profiler g:GOST (42) for functional en- pleteness (see Figure S1). The incompleteness of S. mikatae richment analysis, Integrated Interactions Database (43), and is likely to be derived from the high number of scaffolds the STRING database (44) for Protein-Protein Interactions and indicates a low-quality genome assembly. In this con- Analysis. MOSGA 2 submits predicted protein identifiers text, EukCC detected a maximum silent contamination value from functional annotation to these tools by enabling multi- over 40 % for this assembly. Furthermore, MOSGA 2 cal- ple APIs in the annotation mode and passing the results back culated the Average Nucleotide Identity that is shown in Fig- into the job submission. ure 2. S. cerevisiae’s genome shows a high similarity with to S. paradoxus genome and S. eubayanus to S. uvarum. In- Gene-prediction. To improve one of the main tasks of deed, phylogenetic tree analysis indicates a closer relatedness MOSGA, we included two new workflows to predict protein- of these species. coding genes with BRAKER 2 (45). New genes can be found based on protein- or orthology-based evidence derived from Protein-coding genes comparison. To demonstrate the the OrthoDB database. protein-coding gene comparison feature of MOSGA 2, we used Augustus v3.4.0 to predict protein-coding sequences Results in S. paradoxus and five S. cerevisiae strains. The chosen To demonstrate the usability of MOSGA 2, we will demon- assemblies are listed in Table S2. Protein-coding predic- strate several features with exemplary cases. tion consistency can be checked independently from the used software, especially for annotating different strains from the Phylogenetics, genome completeness and Average same species or identifying contaminant species. Depend- Nucleotide Identity. We performed a phylogeneticDRAFT analysis ing on the selected threshold for acceptance of a gene match, based on seven different genome assemblies from the Sac- the strictness increases, and the genes for each genome are charomyces genus (see Table S1). BUSCO served as the mostly only matching with their original genome, as shown source for gene detection with the Eukaryota lineage Or- in Figure 3. By decreasing the identity threshold, the total thoDB dataset. MOSGA 2 concatenated the genes, produced matches became more undifferentiable. Yet, it is important a multiple-sequence alignment with MAFFT, and trimmed to note that it is possible to differentiate between S. para- the MSA with trimAl. The trimmed MSA is used to calculate doxus and S. cerevisiae genes similarities. Additionally, we the phylogenetics with RAxML. The resulting phylogenetic could observe that S. cerevisiae strain SK1 has more gene tree in Figure 1 displays a branching topology that is iden- similarities with S288C, while HLJ167, Y12 and sake001 are tical to previous analyses (30). We marked S. uvarum as an more similar to each other. We could confirm this observation outgroup to define the tree rooting. Since BUSCO provided by calculating the phylogenetic tree with 1873 BUSCOs for the gene source for the phylogenetic analysis in this exam- these genomes based on the BUSCO Saccharomycetes Or- ple, MOSGA 2 evaluated the genome completeness for each thoDB data source, shown in Figure S3. genome that is shown in Figure 2. The distribution of com- mon missed and unique missed BUSCOs for each genome is Organellar DNA scanner. We applied our organellar DNA shown in Figure S2. Most of the missing BUSCOs were com- scanner on twenty diverse eukaryotic genomes to quickly mon to all genomes, indicating that the Eukaryota lineages identify the mitochondrial, chloroplastid, or other organelles’ do not entirely cover all species from the Saccharomyces scaffolds. We chose genome assemblies from various eu-

Martin et al. | MOSGA 2 bioRχiv | 3 bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454382; this version posted July 30, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

S. paradoxus S. paradoxus S. paradoxus

S. cerevisiae SK1 S. cerevisiae SK1 S. cerevisiae SK1

S. cerevisiae S288C S. cerevisiae S288C S. cerevisiae S288C

S. cerevisiae HLJ167 S. cerevisiae sake001 S. cerevisiae sake001

S. cerevisiae Y12 S. cerevisiae HLJ167 S. cerevisiae HLJ167

S. cerevisiae sake001 S. cerevisiae Y12 S. cerevisiae Y12 SK1 SK1 Y12 SK1 S288C S288C HLJ167 S288C HLJ167 HLJ167 sake001 sake001 sake001 S. paradoxus S. paradoxus S. paradoxus S. cerevisiae Y12 S. cerevisiae S. cerevisiae Y12 S. cerevisiae S. cerevisiae S. cerevisiae S. cerevisiae S. cerevisiae S. cerevisiae S. cerevisiae S. cerevisiae S. cerevisiae S. cerevisiae S. cerevisiae S. cerevisiae

threshold = 100 threshold = 99 threshold = 90

Fig. 3. Gene similarity matrix of six different Saccharomyces strains and species based on Augustus predicted protein-coding genes. Depending on the selected threshold for the binning, the similarities differentiate more or less strongly between each genome. Values are relative and normalized by the genome sizes.

karyotic taxa, including plants, nematodes, protists, fungi, vertebrates, and mammalians, to evaluate our scoring. For that purpose, we performed for each genome individual an- notations jobs with default settings but without any protein- coding gene prediction. The results of the organellar DNA scanner are represented in a table containing all scaffolds and putative indicators for the presence of an organellar scaffold. Table 1. Summarized results from the organellar DNA scanner of twenty eukary- To facilitate the interpretation of this table, we introduced a otic genome assemblies. Asterisks mark assemblies that contain multiple unplaced simple scoring system that considers all putative indicators. scaffolds. +++ identifies an organellar scaffold as the first suggestion, ++ identifies The highest-ranking score sorts the resulting scoring table an organellar scaffold within the first one percent of the suggestions, + identifies an organellar scaffold within the first five percent of the suggestions. The total result rows, shown for Nannochloropsis oceanica in Table S3. In ranking is shown in Table S3. 16 out of 20 assemblies, the first hit belonged to one of the organellar scaffolds. An overview of the results is shown in Species No. Scaffolds Result Table 1 and a detailed table considering the type of the or- Arabidopsis thaliana 7 +++ ganelles is represented in Table S4. The scoring generally Apis mellifera 11 +++ performs worse if the number of unplaced contigs increases. Babesia microti 6 +++ We validated the positive matches by comparing our identi- Bos taurus 2211 +++ fied first-hit scaffold with the NCBI Genome database. Caenorhabditis elegans 7 +++ Cafeteria burkhardae 170 +++ Taxonomy search. The integrated taxonomy search in Cardiosporidium cionae 2204 ++ MOSGA 2 tries to identify the most suitable model or lin- Corvus cornix 113 +++ eage for the annotation workflow. Depending on the com- Danio rerio 1917 +++ pleteness of the NCBI taxonomy and the available models, Drosophila melanogaster 1870 +++ it can quickly identify the putative best parameterDRAFT for each Homo sapiens 164 +++ tool. For example, defining Saccharomyces cerevisiae as the Ipomoea triloba 17 +++ input genome species, MOSGA 2 selects the Saccharomyces Nannochloropsis oceanica 32 +++ Augustus species model and the Saccharomycetes OrthoDB Plasmodium falciparum 15 +++ data source for BUSCO. As another example, a search for Prunus dulcis 692 ++ Taphrina alni results in an Augustus model match for Pneu- Saccharomyces cerevisiae 17 +++ mocystis jirovecii (Figure S4). Both species belong to the Salmo salar 241573 ++ subphylum Taphrinomycotina. The search depends on the Strongylocentrotus purpuratus 871 +++ completeness of the NCBI-defined taxonomy. In other cases, Tribolium castaneum 2149 ++ if the distance is similar, the selected match will be the first Zea mays 687 + hit.

Discussion In MOSGA 2, we streamline the annotation process by pro- viding validation methods and tools. New functionalities en- able user-friendly analyses of genome completeness, contam- ination, or optimal selected parameters. While improving

4 | bioRχiv Martin et al. | MOSGA 2 bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454382; this version posted July 30, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. the annotation process, we additionally implemented com- istration requirements. The calculated results for the parative genomics workflows and thus extended the scope of given examples are available under mosga.mathematik. MOSGA’s capabilities. uni-marburg.de/phylo for the Saccharomyces phylogenetics We demonstrated that MOSGA 2 generates reliable phylo- example, mosga.mathematik.uni-marburg.de/genecomp for genetic results using the Saccharomyces genus as an exam- the Saccharomyces strains gene comparison and mosga. ple. Depending on the chosen parameters, variances have to mathematik.uni-marburg.de/organelles for the organellar be expected, but in this analysis, we only used the default DNA scanner. MOSGA 2 settings and selected BUSCO as the gene source. We presented that the protein-coding genes comparison anal- Competing interests ysis helps to identify differences in samples based on pre- viously predicted protein genes or inconsistent gene predic- There is NO Competing Interest. tions. Furthermore, besides quality controls such as the genome Funding completeness analysis, we showed that MOSGA 2 could fa- cilitate data preparation by identifying organellar scaffolds This work was supported by the LOEWE program of the inside a genome assembly. In most cases, MOSGA 2 could State of Hesse (Germany) in the MOSLA research cluster. identify the organellar scaffold directly, although this de- pends on the assembly quality. The precision of the or- Author contributions statement ganellar DNA scanner decreases in samples with many scaf- folds, which could indicate problems with biological con- R.M. wrote the manuscript, designed and developed the tamination. Moreover, we demonstrated that the taxonomy framework. H.D. implemented the comparative genomics search feature supports the user in finding the most appro- workflows and assisted in the development. G.H. supported priate model. This search generally relies heavily on NCBI the tool development and data visualizations, he proofread taxonomy quality. This is especially relevant when taxonom- and revised the manuscript. T.H. provided the database for ical information is incomplete. the organellar DNA scanner and ideas for implementation During MOSGA 2 development, we implemented feedback and revised the manuscript. M.G.F. revised the manuscript. and observations from MOSGA users, such as wrongly cho- D.H. supervised the project, discussed the results, and re- sen species models or eukaryotic assemblies with organellar vised the manuscript. All authors read and approved the final scaffolds. MOSGA 2 improved in terms of quality and user- manuscript. friendliness by implementing validation tools and functions ACKNOWLEDGEMENTS like the taxonomy search and the organellar DNA scanner. The authors thank the anonymous reviewers for their valuable suggestions. This Furthermore, we integrated new workflows that allow com- work was supported by the BMBF-funded de.NBI Cloud within the German Network for Infrastructure (de.NBI). parative genomics analyses and expand the scope of MOSGA analysis for a wider range of applications. Our comparative genomics workflows are based on state-of-the-art tools. For Bibliography output visualization, we have followed the guidelines of Hat- 1. Eric W. Sayers, Mark Cavanaugh, Karen Clark, James Ostell, Kim D. Pruitt, and Ilene tab et al. (46), and for improved genome annotation quality Karsch-Mizrachi. GenBank. Nucleic Acids Research, 47(D1):D94–D99, 2019. ISSN and validation tools, the advice of Pirovano et al. (14). In to- 13624962. doi: 10.1093/nar/gky989. 2. Paul M. Berube, Steven J. Biller, Thomas Hackl, Shane L. Hogle, Brandon M. Satinsky, tal, we increased the number of implemented workflow rules Jamie W. Becker, Rogier Braakman, Sara B. Collins, Libusha Kelly, Jessie Berta-Thompson, from 63 to 129. Figure S5 and Figure S6 show two job ex- Allison Coe, Kristin Bergauer, Heather A. Bouman, Thomas J. Browning, Daniele De Corte, Christel Hassler, Yotam Hulata, Jeremy E. Jacquot, Elizabeth W. Maas, Thomas amples for comparative genomics and the annotationDRAFT work- Reinthaler, Eva Sintes, Taichi Yokokawa, Debbie Lindell, Ramunas Stepanauskas, and Sal- flow. Although MOSGA 2 grows in complexity, it relies on lie W. Chisholm. Single cell genomes of Prochlorococcus, Synechococcus, and sympatric microbes from diverse marine environments. Scientific Data, 5(1):180154, dec 2018. ISSN our established architecture that allows high scalability and 2052-4463. doi: 10.1038/sdata.2018.154. reproducibility thanks to the multiple-layer design and the 3. Maria G Pachiadaki, Julia M Brown, Joseph Brown, Oliver Bezuidt, Paul M Berube, Steven J Biller, Nicole J Poulton, Michael D Burkart, James J La Clair, Sallie W Chisholm, and Ramu- Snakemake workflow engine (11). We implemented several nas Stepanauskas. Charting the Complexity of the Marine Microbiome through Single-Cell Snakemake and comprehensive data analysis tests that Git- Genomics. Cell, 179(7):1623–1635.e11, 2019. ISSN 1097-4172. doi: 10.1016/j.cell.2019. 11.017. Lab performs on source-code contribution to maintain the 4. Robert M Bowers, Nikos C Kyrpides, Ramunas Stepanauskas, Miranda Harmon-Smith, code quality. We encourage scientists to send us implemen- Devin Doud, T B K Reddy, Frederik Schulz, Jessica Jarett, Adam R Rivers, Emiley A Eloe- Fadrosh, Susannah G Tringe, Natalia N Ivanova, Alex Copeland, Alicia Clum, Eric D Be- tation and workflow suggestions to support the development craft, Rex R Malmstrom, Bruce Birren, Mircea Podar, Peer Bork, George M Weinstock, of MOSGA 2. George M Garrity, Jeremy A Dodsworth, Shibu Yooseph, Granger Sutton, Frank O Glöckner, Jack A Gilbert, William C Nelson, Steven J Hallam, Sean P Jungbluth, Thijs J G Ettema, Scott Tighe, Konstantinos T Konstantinidis, Wen-Tso Liu, Brett J Baker, Thomas Rattei, Jonathan A Eisen, Brian Hedlund, Katherine D McMahon, Noah Fierer, Rob Knight, Rob Data availability Finn, Guy Cochrane, Ilene Karsch-Mizrachi, Gene W Tyson, Christian Rinke, Alla Lapidus, Folker Meyer, Pelin Yilmaz, Donovan H Parks, A Murat Eren, Lynn Schriml, Jillian F Banfield, Our source code is MIT-licensed freely available on GitLab Philip Hugenholtz, and Tanja Woyke. Minimum information about a single amplified genome (gitlab.com/mosga/mosga) and Zenodo (doi: 10.5281/zen- (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nature Biotechnology, 35(8):725–731, aug 2017. ISSN 1087-0156. doi: 10.1038/nbt.3893. odo.5121228). An online version of MOSGA 2 is avail- 5. Frederik Schulz, Lauren Alteio, Danielle Goudeau, Elizabeth M. Ryan, Feiqiao B. Yu, able under mosga.mathematik.uni-marburg.de. This web- Rex R. Malmstrom, Jeffrey Blanchard, and Tanja Woyke. Hidden diversity of soil gi- ant viruses. Nature Communications, 9(1):4881, dec 2018. ISSN 2041-1723. doi: site is free and open to all users, and there are no reg- 10.1038/s41467-018-07335-2.

Martin et al. | MOSGA 2 bioRχiv | 5 bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454382; this version posted July 30, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

6. Shijie Zhao, Tami D. Lieberman, Mathilde Poyet, Kathryn M. Kauffman, Sean M. Gibbons, and evolutionary analyses in R. Bioinformatics (Oxford, England), 35(3):526–528, 2019. Mathieu Groussin, Ramnik J. Xavier, and Eric J. Alm. Adaptive Evolution within Gut Mi- ISSN 1367-4811. doi: 10.1093/bioinformatics/bty633. crobiomes of Healthy People. Cell Host & Microbe, 25(5):656–667.e8, may 2019. ISSN 29. Chirag Jain, Luis M. Rodriguez-R, Adam M. Phillippy, Konstantinos T. Konstantinidis, and 19313128. doi: 10.1016/j.chom.2019.03.007. Srinivas Aluru. High throughput ANI analysis of 90K prokaryotic genomes reveals clear 7. Javier del Campo, Michael E. Sieracki, Robert Molestina, Patrick Keeling, Ramon Massana, species boundaries. Nature Communications, 9(1):1–8, 2018. ISSN 20411723. doi: 10. and Iñaki Ruiz-Trillo. The others: our biased perspective of eukaryotic genomes. Trends 1038/s41467-018-07641-9. in Ecology and Evolution, 29(5):252–259, may 2014. ISSN 01695347. doi: 10.1016/j.tree. 30. Jackson Peter, Matteo De Chiara, Anne Friedrich, Jia-Xing Yue, David Pflieger, Anders 2014.03.006. Bergström, Anastasie Sigwalt, Benjamin Barre, Kelle Freel, Agnès Llored, Corinne Cruaud, 8. Lina Sun, Tian Gao, Feilong Wang, Zuliang Qin, Longxia Yan, Wenjing Tao, Minghui Li, Karine Labadie, Jean-Marc Aury, Benjamin Istace, Kevin Lebrigand, Pascal Barbry, Stefan Canbiao Jin, Li Ma, Thomas D. Kocher, and Deshou Wang. Chromosome-level genome Engelen, Arnaud Lemainque, Patrick Wincker, Gianni Liti, and Joseph Schacherer. Genome assembly of a cyprinid fish Onychostoma macrolepis by integration of nanopore sequencing, evolution across 1,011 Saccharomyces cerevisiae isolates. Nature, 556(7701):339–344, Bionano and Hi-C technology. Molecular Ecology Resources, pages 1755–0998.13190, jul 2018. ISSN 1476-4687. doi: 10.1038/s41586-018-0030-5. 2020. ISSN 1755-098X. doi: 10.1111/1755-0998.13190. 31. Evgenia V Kriventseva, Dmitry Kuznetsov, Fredrik Tegenfeldt, Mosè Manni, Renata Dias, 9. Gergo Palfalvi, Thomas Hackl, Niklas Terhoeven, Tomoko F. Shibata, Tomoaki Nishiyama, Felipe A Simão, and Evgeny M Zdobnov. OrthoDB v10: sampling the diversity of animal, Markus Ankenbrand, Dirk Becker, Frank Förster, Matthias Freund, Anda Iosip, Ines Kreuzer, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations Franziska Saul, Chiharu Kamida, Kenji Fukushima, Shuji Shigenobu, Yosuke Tamada, of orthologs. Nucleic acids research, 47(D1):D807–D811, 2019. ISSN 1362-4962. doi: Lubomir Adamec, Yoshikazu Hoshi, Kunihiko Ueda, Traud Winkelmann, Jörg Fuchs, Ingo 10.1093/nar/gky1053. Schubert, Rainer Schwacke, Khaled Al-Rasheid, Jörg Schultz, Mitsuyasu Hasebe, and 32. Huaiyu Mi, Anushya Muruganujan, and Paul D Thomas. PANTHER in 2013: modeling the Rainer Hedrich. Genomes of the Venus Flytrap and Close Relatives Unveil the Roots of evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Plant Carnivory. Current Biology, 30(12):2312–2320.e5, jun 2020. ISSN 09609822. doi: Nucleic acids research, 41(Database issue):D377–86, jan 2013. ISSN 1362-4962. doi: 10.1016/j.cub.2020.04.051. 10.1093/nar/gks1118. 10. Graham Wiley and Matthew J. Miller. A Highly Contiguous Genome for the Golden-Fronted 33. Dominik R Laetsch and Mark L Blaxter. BlobTools: Interrogation of genome assemblies. Woodpecker ( Melanerpes aurifrons ) via Hybrid Oxford Nanopore and Short Read Assem- F1000Research, 6:1287, jul 2017. ISSN 2046-1402. doi: 10.12688/f1000research.12232.1. bly. G3: Genes|Genomes|Genetics, 10(6):1829–1836, jun 2020. ISSN 2160-1836. 34. Patricia P. Chan, Brian Y. Lin, Allysia J. Mak, and Todd M. Lowe. tRNAscan-SE 2.0: Im- doi: 10.1534/g3.120.401059. proved Detection and Functional Classification of Transfer RNA Genes. bioRxiv, (6):614032, 11. Roman Martin, Thomas Hackl, Georges Hattab, Matthias G Fischer, and Dominik Heider. 2019. doi: 10.1101/614032. MOSGA: Modular Open-Source Genome Annotator. Bioinformatics, 36(22-23):5514–5515, 35. Nuala A. O’Leary, Mathew W. Wright, J. Rodney Brister, Stacy Ciufo, Diana Haddad, Rich apr 2021. ISSN 1367-4803. doi: 10.1093/bioinformatics/btaa1003. McVeigh, Bhanu Rajput, Barbara Robbertse, Brian Smith-White, Danso Ako-Adjei, Alexan- 12. Steven L Salzberg, Adam M Phillippy, Aleksey Zimin, Daniela Puiu, Tanja Magoc, Sergey der Astashyn, Azat Badretdin, Yiming Bao, Olga Blinkova, Vyacheslav Brover, Vyacheslav Koren, Todd J Treangen, Michael C Schatz, Arthur L Delcher, Michael Roberts, Guillaume Chetvernin, Jinna Choi, Eric Cox, Olga Ermolaeva, Catherine M. Farrell, Tamara Goldfarb, Marçais, Mihai Pop, and James A Yorke. GAGE: A critical evaluation of genome assemblies Tripti Gupta, Daniel Haft, Eneida Hatcher, Wratko Hlavina, Vinita S. Joardar, Vamsi K. Ko- and assembly algorithms. Genome research, 22(3):557–67, mar 2012. ISSN 1549-5469. dali, Wenjun Li, Donna Maglott, Patrick Masterson, Kelly M. McGarvey, Michael R. Murphy, doi: 10.1101/gr.131383.111. Kathleen O’Neill, Shashikant Pujar, Sanjida H. Rangwala, Daniel Rausch, Lillian D. Riddick, 13. Thomas Hackl, Roman Martin, Karina Barenhoff, Sarah Duponchel, Dominik Heider, and Conrad Schoch, Andrei Shkeda, Susan S. Storz, Hanzhen Sun, Francoise Thibaud-Nissen, Matthias G. Fischer. Four high-quality draft genome assemblies of the marine heterotrophic Igor Tolstoy, Raymond E. Tully, Anjana R. Vatsan, Craig Wallin, David Webb, Wendy Wu, nanoflagellate Cafeteria roenbergensis. Scientific Data, 7(1):29, dec 2020. ISSN 2052- Melissa J. Landrum, Avi Kimchi, Tatiana Tatusova, Michael DiCuccio, Paul Kitts, Terence D. 4463. doi: 10.1038/s41597-020-0363-4. Murphy, and Kim D. Pruitt. Reference sequence (RefSeq) database at NCBI: Current status, 14. Walter Pirovano, Marten Boetzer, Martijn F.L. Derks, and Sandra Smit. NCBI-compliant taxonomic expansion, and functional annotation. Nucleic Acids Research, 44(D1):D733– genome submissions: Tips and tricks to save time and money. Briefings in Bioinformatics, D745, 2016. ISSN 13624962. doi: 10.1093/nar/gkv1189. 18(2):179–182, 2017. ISSN 14774054. doi: 10.1093/bib/bbv104. 36. Martin Steinegger and Johannes Söding. MMseqs2 enables sensitive protein sequence 15. Felipe A. Simão, Robert M. Waterhouse, Panagiotis Ioannidis, Evgenia V. Kriventseva, and searching for the analysis of massive data sets. Nature Biotechnology, 35(11):1026–1028, Evgeny M. Zdobnov. BUSCO: assessing genome assembly and annotation completeness 2017. ISSN 15461696. doi: 10.1038/nbt.3988. with single-copy orthologs. Bioinformatics, 31(19):3210–3212, oct 2015. ISSN 1367-4803. 37. Mario Stanke and Burkhard Morgenstern. AUGUSTUS: a web server for gene prediction doi: 10.1093/bioinformatics/btv351. in eukaryotes that allows user-defined constraints. Nucleic acids research, 33(Web Server 16. Robert M. Waterhouse, Mathieu Seppey, Felipe A. Simao, Mosè Manni, Panagiotis Ioan- issue):W465–7, jul 2005. ISSN 1362-4962. doi: 10.1093/nar/gki458. nidis, Guennadi Klioutchnikov, Evgenia V. Kriventseva, and Evgeny M. Zdobnov. BUSCO 38. Conrad L. Schoch, Stacy Ciufo, Mikhail Domrachev, Carol L. Hotton, Sivakumar Kannan, applications from quality assessments to gene prediction and phylogenomics. Molecular Bi- Rogneda Khovanskaya, Detlef Leipe, Richard McVeigh, Kathleen O’Neill, Barbara Rob- ology and Evolution, 35(3):543–548, 2018. ISSN 15371719. doi: 10.1093/molbev/msx319. bertse, Shobha Sharma, Vladimir Soussov, John P. Sullivan, Lu Sun, Seán Turner, and 17. Paul Saary, Alex L Mitchell, and Robert D Finn. Estimating the quality of eukaryotic Ilene Karsch-Mizrachi. NCBI Taxonomy: A comprehensive update on curation, resources genomes recovered from metagenomic analysis with EukCC. Genome Biology, 21(1):244, and tools. Database, 2020(2):1–21, 2020. ISSN 17580463. doi: 10.1093/database/baaa062. dec 2020. ISSN 1474-760X. doi: 10.1186/s13059-020-02155-4. 39. W. H. Majoros, M. Pertea, and S. L. Salzberg. TigrScan and GlimmerHMM: Two open 18. K. Katoh. MAFFT: a novel method for rapid multiple sequence alignment based on fast source ab initio eukaryotic gene-finders. Bioinformatics, 20(16):2878–2879, 2004. ISSN Fourier transform. Nucleic Acids Research, 30(14):3059–3066, jul 2002. ISSN 13624962. 13674803. doi: 10.1093/bioinformatics/bth315. doi: 10.1093/nar/gkf436. 40. Ian Korf. Gene finding in novel genomes. BMC Bioinformatics, 5:59, may 2004. ISSN 19. K. Katoh and D. M. Standley. MAFFT Multiple Sequence Alignment Software Version 7: 14712105. doi: 10.1186/1471-2105-5-59. Improvements in Performance and Usability. Molecular Biology and Evolution, 30(4):772– 41. Robert Buels, Eric Yao, Colin M. Diesh, Richard D. Hayes, Monica Munoz-Torres, Gregg 780, apr 2013. ISSN 0737-4038. doi: 10.1093/molbev/mst010. Helt, David M. Goodstein, Christine G. Elsik, Suzanna E. Lewis, Lincoln Stein, and Ian H. 20. Julie D. Thompson, Desmond G. Higgins, and Toby J. Gibson. CLUSTAL W: improving Holmes. JBrowse: A dynamic web platform for genome visualization and analysis. Genome the sensitivity of progressive multiple sequence alignment through sequence weighting, Biology, 17(1):1–12, 2016. ISSN 1474760X. doi: 10.1186/s13059-016-0924-1. position-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22(22): 42. Uku Raudvere, Liis Kolberg, Ivan Kuzmin, Tambet Arak, Priit Adler, Hedi Peterson, and 4673–4680, 1994. ISSN 0305-1048. doi: 10.1093/nar/22.22.4673. DRAFTJaak Vilo. g:Profiler: a web server for functional enrichment analysis and conversions of 21. M.A. Larkin, G. Blackshields, N.P. Brown, R. Chenna, P.A. McGettigan, H. McWilliam, gene lists (2019 update). Nucleic Acids Research, 47(W1):W191–W198, jul 2019. ISSN F. Valentin, I.M. Wallace, A. Wilm, R. Lopez, J.D. Thompson, T.J. Gibson, and D.G. Hig- 0305-1048. doi: 10.1093/nar/gkz369. gins. Clustal W and Clustal X version 2.0. Bioinformatics, 23(21):2947–2948, nov 2007. 43. Max Kotlyar, Chiara Pastrello, Zara Malik, and Igor Jurisica. IID 2018 update: context- ISSN 1367-4803. doi: 10.1093/bioinformatics/btm404. specific physical protein–protein interactions in human, model organisms and domesticated 22. Robert C Edgar. MUSCLE: multiple sequence alignment with high accuracy and high species. Nucleic Acids Research, 47(D1):D581–D589, jan 2019. ISSN 0305-1048. doi: throughput. Nucleic acids research, 32(5):1792–7, 2004. ISSN 1362-4962. doi: 10.1093/ 10.1093/nar/gky1037. nar/gkh340. 44. Damian Szklarczyk, Annika L Gable, David Lyon, Alexander Junge, Stefan Wyder, Jaime 23. S. Capella-Gutierrez, J. M. Silla-Martinez, and T. Gabaldon. trimAl: a tool for automated Huerta-Cepas, Milan Simonovic, Nadezhda T Doncheva, John H Morris, Peer Bork, Lars J alignment trimming in large-scale phylogenetic analyses. Bioinformatics, 25(15):1972– Jensen, and Christian von Mering. STRING v11: protein–protein association networks 1973, aug 2009. ISSN 1367-4803. doi: 10.1093/bioinformatics/btp348. with increased coverage, supporting functional discovery in genome-wide experimental 24. Jacob L. Steenwyk, Thomas J. Buida, Yuanning Li, Xing-Xing Shen, and Antonis Rokas. datasets. Nucleic Acids Research, 47(D1):D607–D613, jan 2019. ISSN 0305-1048. doi: ClipKIT: A multiple sequence alignment trimming software for accurate phylogenomic infer- 10.1093/nar/gky1131. ence. PLOS Biology, 18(12):e3001007, dec 2020. ISSN 1545-7885. doi: 10.1371/journal. 45. Tomáš Br˚una,Katharina J Hoff, Alexandre Lomsadze, Mario Stanke, and Mark Borodovsky. pbio.3001007. BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS 25. Alexandros Stamatakis. RAxML version 8: a tool for phylogenetic analysis and post-analysis supported by a protein database. NAR Genomics and Bioinformatics, 3(1):1–11, jan 2021. of large phylogenies. Bioinformatics (Oxford, England), 30(9):1312–3, may 2014. ISSN ISSN 2631-9268. doi: 10.1093/nargab/lqaa108. 1367-4811. doi: 10.1093/bioinformatics/btu033. 46. Georges Hattab, Theresa-Marie Rhyne, and Dominik Heider. Ten simple rules to colorize 26. Vincent Lefort, Richard Desper, and Olivier Gascuel. FastME 2.0: A Comprehensive, Ac- biological data visualization. PLOS Computational Biology, 16(10):e1008259, oct 2020. curate, and Fast Distance-Based Phylogeny Inference Program. Molecular biology and ISSN 1553-7358. doi: 10.1371/journal.pcbi.1008259. evolution, 32(10):2798–800, oct 2015. ISSN 1537-1719. doi: 10.1093/molbev/msv150. 27. Guangchuang Yu, David K. Smith, Huachen Zhu, Yi Guan, and Tommy Tsan-Yuk Lam. ggtree : an R package for visualization and annotation of phylogenetic trees with their co- variates and other associated data. Methods in Ecology and Evolution, 8(1):28–36, jan 2017. ISSN 2041-210X. doi: 10.1111/2041-210X.12628. 28. Emmanuel Paradis and Klaus Schliep. ape 5.0: an environment for modern phylogenetics

6 | bioRχiv Martin et al. | MOSGA 2