<<

bioRxiv preprint doi: https://doi.org/10.1101/2021.02.28.433233; this version posted February 28, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Informative Regions In Viral Genomes

J. Leonardo Moreno-Gallego1,2, and Alejandro Reyes2,3,

1Department of Microbiome Science, Max Planck Institute for Developmental Biology, Tübingen 72076, Germany 2Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá 111711, Colombia 3The Edison family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, Saint Louis, MO 63108, USA

Viruses, far from being just parasites affecting hosts’ fitness sity. Only about 25% of sequences in viromes from marine are major players in any microbial ecosystem. In spite of their environments have a match (e-value <= 0.001) to a known broad abundance, viruses, in particular remain sequence (Breitbart and Rohwer 2005; Hurwitz et al. 2016; largely unknown since only about 20% of sequences obtained López-Bueno et al. 2009). The large diversity and the lack of from viral community DNA surveys could be annotated by com- universal molecular markers make it difficult to organize and parison to public databases. In order to shed some light into characterize known and new viral genomes. this genetic dark matter we expanded the search of ortholo- gous groups as potential markers to viral taxonomy from bac- Currently, computational tools such as VIROME teriophages and included eukaryotic viruses, establishing a set (Wommack et al. 2012), MetaVir2 (Roux et al. 2014), of 31,150 ViPhOGs. To this, we examine the non-redundant and MG-RAST (Keegan et al. 2016) use available molec- viral diversity stored in public databases, predict proteins in ular databases to analyze genomes and metagenomes. Nev- genomes lacking such information and all annotated and pre- ertheless, they are limited by considering only the fraction dicted proteins were used to identify potential protein domains. of the data generated that has significant similarity to pre- Clustering of domains and unannotated regions into ortholo- viously annotated data. New approaches based not only on gous groups was done using cogSoft. Finally, we employed a annotated sequence comparisons, but also on all the avail- supervised classification method by using a Random Forest im- able information promise to be useful for analyzing individ- plementation to classify genomes into their taxonomy and found ual viral genomes and viromes. For instance, Skewe-Cox et that the presence/absence of ViPhOGs is significantly associ- ated with their taxonomy. Furthermore, we established a set of al., used protein sequence clustering to generate viral profile 1,457 ViPhOGs that given their importance for the classification Hidden Markov Models (“vFams”) that were subsequently could be considered as markers/signatures for the different tax- used for classifying highly divergent sequences (Skewes- onomic groups defined by the ICTV at the Order, Family, and Cox et al. 2014). Furthermore, the identification of highly Genus levels. Both, the ViPhOGs and informative ViPhOGs sets conserved genes in specific taxonomic groups has been an- are available at the Open Science Framwork (OSF) under the other approach for the taxonomic classification of viral se- ViPhOGs project (https://osf.io/2zd9r/). quences. For instance, diversity analyses of cyanophages,

Viruses | Phages | Orthologous gropus | RandomForest | ViPhOGs algae viruses, T4 and T7 phages have been conducted fol- Correspondence: [email protected] [email protected] lowing this method, and they enabled the characterization of the viral diversity in the studied environments (Zhong et al. 2002; Short and Suttle 2005; Fujihara et al. 2010). However, Introduction the mentioned studies were restricted to specific families and Viruses are entities widely spread all around the biosphere. It ecosystems. is estimated that viral particles are ten times more abundant Using another approach, Kristensen et al. constructed a than other types of microorganisms and although theirDRAFT inclu- collection of phage orthologous groups (“POGs”) from bac- sion in a new domain remains controversial, it is clear that terial and archeal viruses (Kristensen et al. 2011; Kristensen they are not merely parasites (Ignacio-Espinoza et al. 2013; et al. 2013). Orthologous gene sets are widely used as a Martínez Martínez et al. 2014; Koonin et al. 2015). Viruses powerful technique in comparative genomics and for viruses actively participate in ecosystem remodelling, population dy- it has been suggested that marker genes could be obtained namics and a wide variety of ecological, biogeochemical, ge- using this technique (Kristensen et al. 2011; Kristensen et netic and physiological processes (Kristensen et al. 2010). al. 2013). Based on the same concept, eggNOG has imple- Despite their importance and abundance, the viral di- mented in its latest version a database of orthologous groups versity has not been well characterized. Difficulties in the focused on viruses (Viral OGs) (Powell et al. 2014). How- isolation of pure cultures and the description of viral cycles ever, they do not include the whole breadth of viral diver- are common limitations in research, since less than 1% sity represented in the public databases, since genomes of of environmental microorganisms can be grown in the labo- eukaryotic viruses were not included by either of the men- ratory (Hugenholtz et al. 1998). Next generation sequencing tioned studies. Given the large amount of currently available techniques enabled us to partially overcome these difficul- genetic information, it is imperative to develop and imple- ties, revolutionizing this field of virology. Deep sequencing ment new tools to reliably and efficiently analyze these data surveys of viral communities (viromes) have revealed a di- to better describe viral diversity. versity beyond all expectations, and they have evidenced the Computational techniques have been used previously lack of knowledge we currently have on global viral diver- for similar issues in biology. Machine learning methods, for

Moreno-Gallego and Reyes | bioRχiv | February 28, 2021 | 1–11 bioRxiv preprint doi: https://doi.org/10.1101/2021.02.28.433233; this version posted February 28, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. example, are algorithms which learn through the experience, ison of the genome length distribution of phages and eu- attempting to classify information according to shared fea- karyotic viruses shows that dsDNA and ssDNA phages tend tures. These techniques allow us to extract patterns, trends to have larger genomes than dsDNA and ssDNA eukaryotic and finally analyze the information using a non-deterministic viruses (Mannwhitney test when comparing dsDNA viruses: way. Supervised learning algorithms such as Random For- p-value=7.19e-06; Mannwhitney test when comparing ss- est, Support Vector Machine and Neural Networks have been DNA viruses: p-value=3.08e-64). In the case of ssRNA successfully introduced to solve complex biological prob- viruses, eukaryotic viruses tend to have larger genomes than lems, such as image analysis, microarray expression analysis, phages (Mannwhitney test when comparing ssRNA viruses: QTLs analysis, detection of transcription start sites, epitopes p-value=7.89e-16) (Figure 1). detection, protein identification and function, among others The set of 14,057 non-redundant genomes code for a (Libbrecht and Noble 2015; Vervier et al. 2016). total of 442,007 proteins, where we observe that the number In this work we used the methodology proposed by of genes is directly proportional to the length of the genome. Kristensen et al., for the identification of gene and domain or- Interestingly, a linear regression suggests a gene density of thologous groups from related viral sequences. We expanded 12 proteins per kilobase in the case of phages, while in the the reach of this approach by incorporating genomes of eu- case of eukaryotic viruses the gene density is only about 2.5 karyotic viruses and applying Random Forest as a machine proteins per kilobase; indicating a lower gene density for eu- learning strategy to identify taxonomically informative or- karyotic viruses in comparison with phages (Figure 2). thologous groups. We denominated the set of orthologous groups as ViPhOGs (Virus and Phages Orthologous Groups).

Results The viral diversity represented in public (NCBI) databases. We searched for all viral genomes stored in ei- ther RefSeq or Genbank databases (April 2015) and ob- tained 50,728 entries by using the selected set of queries (see methods). Depuration of the search results led to the ex- clusion of 6,617 entries, as some of the keywords in their descriptions indicated that they did not correspond to com- plete genomes. Entries kept following the depuration step were clustered based on sequence identity in order to col- lapse near identical viral sequences, which resulted in an overall 57% reduction that reflects a very high redundancy in Fig. 1. Genome length distribution. Violin plots show the genome length dis- the searched databases. sequences decreased tribution of the non-redundant viruses used in the study. Viruses are grouped by from 3,573 to 2,071 (57.9%), while sequences of eukaryotic the type of genetic material and type of host. The inner boxplot shows the median (white circle) and interquartile range (whisker plots). viruses went from 40,538 to 13,011 (32.1%). Finally, a sec- ond de-replication at the protein level was conducted follow- ing the prediction of genes for those genome accessions with- Viruses and Phages Orthologous Groups (ViPhOGs). out protein annotations (see methods).This process led to a fi- We searched for domains in all identified and predicted pro- nal reduced set of 14,057 entries, comprising 1,974 bacterio- teins using InterProScan (Jones et al. 2014). Domains were phages and 12,083 viruses. Those accessions areDRAFT considered found only in 52.59% of the proteins (232,033 proteins), as the non-redundant viral diversity stored in NCBI public which means that even for the sequences stored in public databases (Table S1). databases half of the information belongs to the viral dark According to the type of genetic material stated in the matter. In an attempt to gain further information, unanno- description of the accessions, these were categorized into: tated proteins were used as a query against vFams, but only double stranded DNA (dsDNA), single stranded DNA (ss- 39,344 (8.9%) proteins had a significant match to entries in DNA), double stranded RNA (dsRNA), single stranded RNA this database. -despite the sense- (ssRNA) and retro-transcribing viruses (rt- Given the proportion of unannotated sequences, protein viruses). A total of 1,122 accessions didn’t have a complete regions without domain or vFam annotation were also con- taxonomic annotation at the moment of the study; those ac- sidered for further analysis, meaning that the final set of vi- cessions were found to be either unclassified phages (66), ral regions consisted of: 365,368 annotated regions (309,251 unclassified viruses (80), satellites (203) or assemblies from InterPro domains + 56,137 vFam matches), 157,591 unanno- marine metagenomes (773). The longest genome belonged to tated regions of at least 40 residues (69,333 regions in be- the dsDNA virus salinus (NC_022098) with a tween annotated domains + 88,258 regions in between vFam genome length of 2,473,870 bp, whereas the smallest genome matches) and 170,637 unannotated proteins (proteins with no belongs to the ssRNA Lucerne transient streak virus hit to vFam or InterProScan). Viruses and Phages Ortholo- with 324 bp. In general, DNA viruses are larger than RNA gous Groups (ViPhOGs) were built from the symmetric best viruses (Mannwhitney test: p-value=1.12e-135). A compar- matches between viral regions using the software COGsoft

2 | bioRχiv Moreno-Gallego and Reyes et al. | ViPhOGs bioRxiv preprint doi: https://doi.org/10.1101/2021.02.28.433233; this version posted February 28, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

composed of orthologs instead of paralogs. This is also ev- idenced in Table S2, where most of the ViPhOGs have the same number of genomes as regions, indicating that each genome contributed only one region to each orthologous group.

A Random Forest classifier correctly classifies viral genomes according to the presence of ViPhOGs. We used the Scikit-learn Random Forest implementation to test if ViPhOGs can be used as features to predict taxonomy (see methods). From the model testing, it was determined that in general a model with 60 estimators had a reasonably good balance between classification-score and computation time, as models with 60 estimators result in a high classification score and less variance (Figure 4). Therefore, a Random For- est classifier with 60 estimators was run separately for each of the evaluated taxonomic levels. For each taxonomic level, all genomes classified into a taxon at the analysed level were used. For the Order level the algorithm received a matrix of 1,031 ViPhOGs and 4,698 genomes to classify into 7 Orders. In the case of Family the matrix contained 11,328 ViPhOGs and 11,978 genomes from 84 different Families, and for the Genus level, the size of the matrix was 20,310 ViPhOGs and 10,151 genomes from 335 different genera. For each case, matrices were split in 100 train/test (70:30 distribution) sets. The mean accuracy score achieved was 99.06%, 95.60%, and 89.58% for Order, Family and Genus, respectively (Figures Fig. 2. Gene density. Scatter plot showing the gene density of (A) eukaryotic 5, 6 and 7). viruses and (B) phages. Viruses are colored by the type of genetic material and each dot represents a genome. Best-fit lines with 95% confidence intervals from As the classification score suggests, the chosen algo- linear regression are plotted. rithm excelled at accurately classifying the genomes into their respective taxonomic groups; where most of the mis- (Kristensen et al. 2010) (see methods). A total of 31,150 or- classification cases were a small proportion of the genomes thologous groups with at least three members were obtained. represented by each of the assessed taxons (Figure 8). The Interestingly, most of the ViPhOGs were built from a single 10 most common classification mistakes per taxonomic level type of viral regions: unannotated proteins (9,953), unanno- are shown in the Table S3. Although we observed that the tated regions (8,103) or annotated regions (8,023). Among classification error does not perfectly correlate with the total all possible combinations of viral region types, the highest number of available genomes for the classification, it is ev- number of clusters was obtained for the combination of unan- ident that the lower the number of genomes available for a notated proteins and unannotated regions (2,309) (Figure 3). given taxon, the higher the classification error (Figure 9). This suggests that although the large majority of regions and proteins in viral genomes are uncharacterized, theyDRAFT are con- Informormative ViPhOGs: signatures of the taxonomy served among the different viruses used. of viral genomes. As the good performance of the models The median amount of regions clustered in a ViPhOG suggests, there is a set of ViPhOGs that allows to identify was 5 (IQR:3,11) with the largest ViPhOG having 3,440 with high accuracy the taxonomic group of each genome. regions from 1,180 different genomes of both phages and Therefore, we aimed at minimizing the number of ViPhOGs eukaryotic viruses. This large ViPhOG contained regions capable of reaching high accuracy of classification for each mainly annotated as Helicases. However, it was not the only given taxonomic level (see methods). That approach allowed ViPhOG that comprised a rather large number of regions, as a us to determine that a reduced set of 20 ViPhOGs was enough total of 1,081 ViPhOGs contained more than 100 regions (Ta- to achieve a high accuracy score for the Order level. For the ble S2). In terms of the host type, we found 14,746 ViPhOGs Family and Genus levels, the number of ViPhOGs needed represented exclusively by eukaryotic viral genomes, 10,100 was 388 and 1,392, respectively (Figure 10). We designated ViPhOGs represented only by phage genomes and the re- those ViPhOGs as “Informative ViPhOGs”. Their taxonomic maining 6,304 ViPhOGs were represented by both phages assignment and their functional annotation (if available) is and eukaryotic viral genomes. As a ViPhOG may include presented in the Table S4. paralogs, any given genome can contribute with several re- We used the set of Informative ViPhOGs to build an gions to a single ViPhOG. However, the number of regions UPGMA tree that delineates the viral diversity clustering, per genome for each ViPhOG was on average 1.008 (max: based on the presence/absence of these genomic character- 9.162), which indicates that the vast majority of ViPhOGs are istics (see methods). Genomes that belong to a defined Or-

Moreno-Gallego and Reyes et al. | ViPhOGs bioRχiv | 3 bioRxiv preprint doi: https://doi.org/10.1101/2021.02.28.433233; this version posted February 28, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Fig. 3. ViPhOGs composition: Most of the orthologous regions are made of unknown regions. Plot showing the ViPhOGs composition according to the type of region it has (vFam, InterProScan, unannotated region, unannotated protein). The horizontal bar plot represents the number of ViPhOGs that possess at least one region of the indicated type. Filled dots indicate which combination of types is being considered and the vertical bar plot shows the number of ViPhOGs for each combination.

DRAFT Fig. 4. Model exploration. Scatter plot showing how the number of estimators/trees of the random forest (x-axis) affects the classification score (blue dots, left-y-axis) and computational time (orange dots, right y-axis). Whiskers represent standard deviation. The red dashed rectangle highlights the number of estimators used for the final model. der constitute a defined clade in the tree. Furthermore, in all the family and the Order , cases except for , there was a consistent branch- which includes the families Rudiviridae and Lipothrixviri- ing of the Orders containing the corresponding Families and dae, form a single clade. Furthermore, most of the families Genera designated by the International Committee on Taxon- of archaeal viruses had very few representatives, revealing a omy of Viruses (ICTV) (Figure 11). A closer look at the tree bias in the explored diversity. revealed some interesting viral features. (ii) Characteristics shared between Eukaryotic and Bac- (i) Bacteriophages. The Family Tectiviridae (dsDNA terial viruses. The Order appears as a sister viruses) shares a clade with the genus Rosemblanvirus and clade of a subset of the Caudovirales, in particular, mem- Salasvirus, both members of the Family (also bers of the Family. Moreover, the Nucleo- dsDNA viruses from the order Caudovirales), while the Cytoplastmatic Large DNA Viruses (NCLDV) group is en- other bacteriophage families Inoviridae, (both closed by the Caudovirales clade. We looked for ViPhOGs ssDNA), and Leviviridae (ssRNA) appear as independent present in members of all NCLDV families and found clades without shared characteristics among them or mem- ViPhOGs number 937 and 1598, which are associated with bers of the Order Caudovirales. Regarding archaeal viruses, helicase domains and, as mentioned before, prevalent among

4 | bioRχiv Moreno-Gallego and Reyes et al. | ViPhOGs bioRxiv preprint doi: https://doi.org/10.1101/2021.02.28.433233; this version posted February 28, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

tated proteins, all derived from the non-redundant diversity of viruses stored in public databases. This strategy proved to be useful given that a large majority of the ViPhOGs were con- stituted solely or in combination of unannotated regions or unannotated proteins. Finally, we established a comprehen- sive set of 31,150 orthologous groups that we denominated ViPhOGs. As the ICTV provides a single classification scheme that reflects the evolutionary relationship among viruses, we evaluated the possibility that the presence/absence of ViPhOGs in viral genomes reflected the ICTV taxonomy us- ing a machine learning approach. The low misclassification scores reached by the Random Forest algorithm suggested that the use of ViPhOGs as features for performing taxo- nomic classification of viruses had great potential. There- fore, we determined the subset of informative ViPhOGs that could be considered as markers/signatures for the different taxonomic groups defined by the ICTV at the Order, Family, and Genus levels. Fig. 5. Random Forest accurately classifies genomes into their respective We found a high degree of agreement between clus- Orders. Heatmap representing the confusion matrix obtained after classifying viral genomes at the Order level. The color code indicates the proportion of genomes of tering identified using informative ViPhOGs and the mono- the Order in the x-axis classified as a genome of the Order in the y-axis. phyletic Orders described by ICTV. Example of those were the (Gorbalenya et al. 2006), Ligamenvirales dsDNA viruses in general; ViPhOG 821, which codes for a ri- (Prangishvili and Krupovic 2012), (Afonso bonucleotide reductase and is shared mainly among members et al. 2016), and (Martelli et al. 2007) whose of the families Myoviridae, and ; branches shows a clear separation in accordance to the pro- ViPhOG 865, a serine/threonine kinase domain, and ViPhOG posed Families and Genera. 72, an EF binding domain, both also very common in mem- Importantly, the current approach has been consistent bers of Herpesviridae. with recent changes in the ICTV taxonomy. For example (iii) RNA viruses. Those from the family Chrysoviri- the family , whose members were considered dae (dsRNA) grouped with (dsRNA), in particular as a subfamily of the family up till 2016 with the Genus that also infects fungi. Interestingly, (Afonso et al. 2016; Rima et al. 2017). Our classifi- there was a clade formed by different positive sense ssRNA cation mechanism used the ICTV classification from 2014, viruses, which had no other common taxonomic assignment. but was capable of showing the separation of the family This clade included the Families , Nodaviri- Paramyxoviridae. The tree clearly showed how viruses from dae, , , Togaviridae, , the Genera Metapneumovirus and Pneumovirus (now known and the Families of the Order Tymovirales. as Metapneumovirus and Orthopneumovirus, respectively) form a separate clade (now Family Pneumoviridae), whose Discussion sister clade is the Paramyxoviridae Family. In recent years, as metagenomics has revealed aDRAFT great di- Despite the absence of a link between several ss- versity of phages from different biomes, and evidenced the RNA(+) families in the ICTV taxonomy of 2014, in the huge unexplored diversity that the viral world holds. The ViPhOG-based tree built here families Tombusviridae, No- use of genetic signatures to characterize the diversity of spe- daviridae, Bromoviridae, Virgaviridae, Closteroviridae and cific viruses and phages have been successfully applied to de- Hepeviridae were grouped together with the families of scribe specific groups of viruses in diverse contexts (Dwivedi the Order Tymovirales. In the most recent ICTV taxon- et al. 2013; Sakowski et al. 2014). Different strategies omy the families Bromoviridae, Virgaviridae, Togaviridae have been described including POGs (Kristensen et al. 2011; and Closteroviridae were assigned to the Order Martellivi- Kristensen et al. 2013), vFams (Skewes-Cox et al. 2014), rales; and the family Hepesviridae to the Order Hepelivi- viralOGs from EggNOG (Huerta-Cepas et al. 2016) and rales. These two new Orders (Hepelivirales and Martellivi- pVOGs (Grazziotin et al. 2017). Here, we followed the rales), together with the Tymovirales, belong now to the Class methodology by Kristensen et al., and took it a step further by Alsuviricetes and the Phylum Kitrinoviricota. Furthermore, extending the search of orthologous groups beyond bacterial in our tree the families Tombusviridae and are and archeal viruses to also include eukaryotic viruses, which a sister clade of what seems now to be the Alsuviricetes consolidated a final set of 14,057 non-redundant genomes. clade. Tombusviridae and Nodaviridae belong now to the Or- We took a non-waste-information approach in order to ders Tolivirales (Class:Tolucaviricetes) and Nodamuvirales get orthologous groups among (i) annotated domains, (ii) (class:Magsaviricetes), respectively. All classes, together unannotated regions of annotated proteins and (iii) unanno- with Alsuviricetes, constitute now the Kingdom Kitrinoviri-

Moreno-Gallego and Reyes et al. | ViPhOGs bioRχiv | 5 bioRxiv preprint doi: https://doi.org/10.1101/2021.02.28.433233; this version posted February 28, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Fig. 6. Random Forest accurately classifies genomes into their respective Families. Heatmap representing the confusion matrix obtained after classifying viral genomes at the Family level. The color code indicates the proportion of genomesDRAFT of the Family in the x-axis classified as a genome of the Family in the y-axis. cota. This suggests that a tree generated with conserved Genus Vesiculovirus that were confounded with members of amino acid features could identify basal evolutionary rela- the Genus . Both Genera belong to the Family tionships among viruses matching the new scope of the ICTV . Interestingly, this pair of genera share more (International Committee on Taxonomy of Viruses. 2020). ViPhOGs between each other than against any other mem- ber of the Family Rhabdoviridae. This observation could be Misclassification cases were very limited and more the basis of a more in-depth study which could potentially common in the lowest taxonomic level (Genus) than in lead to the suggestion of both genera being part of a new the highest taxonomic level (Order). Although we did not sub-family which separates them from the rest of the fam- observe a perfect negative correlation between the num- ily. Lastly, regarding misclassification events, we want to ac- ber of genomes available and the number of misclassifi- knowledge that there is still a place for improvement of the cation cases, we did observe that Genera like Mupapillo- classification models. We identified misclassification cases mavirus, Yetapoxvirus, and Families where a taxon was misclassified and the confusion does not such as and with two or appear to be directed by genomic relatedness. As an example three genomes available per taxa were frequently misclassi- we chose to discuss the case of the Family. fied. Another kind of misclassification event occurred be- This Family had 123 representative genomes in our database tween related taxa, such was the case for viruses from the

6 | bioRχiv Moreno-Gallego and Reyes et al. | ViPhOGs bioRxiv preprint doi: https://doi.org/10.1101/2021.02.28.433233; this version posted February 28, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. and in 10% of the cases it was misclassified as Myoviridae. Only a few representative genomes of each Family have (at most) 3 ViPhOGs in common (ViPhOGs number 731, 1158, and 269). Those 3 ViPhOGs are not informative ViPhOGs for Myoviridae, and appear to be present in several different viral families and clades, therefore there is no clear answer to why the classifier confused these two unrelated families.

Fig. 7. Random Forest accurately classifies genomes into their respective Genera. Heatmap representing the confusion matrix obtained after classifying viral Fig. 8. High misclassification is rare. (A) Distribution of the misclassification genomes at the Genus level. The color code indicates the proportion of genomes of cases combined for all taxonomic levels. (B) Box plots representing the distribution the genus in the x-axis classified as a genome of the genus in the y-axis. Genome of the misclassification cases per taxonomic level. Whiskers indicate 1.5 x IQR names were removed for ease of interpretation. (Interquartile range), median is indicated in red.

One of the major strengths of the presented work is that, and molecular (e.g. composition of the virus genome and in addition to genomes of prokaryotic and archeal viruses, we sequence similarity) features is a robust system able to de- included genomes of eukaryotic viruses. As expected, not a pict the evolutionary relationships among viruses. The infor- single ViPhOG was present in all viral genomes. Viruses do mative ViPhOGs dataset is, therefore, nothing but a reflec- not encode for ribosomes or any other universal markers that tion of the efforts done to establish a taxonomic system for allow the study of their phylogenetic relationships. Further- viruses and the strength of machine learning algorithms that more, it has been accepted that viruses have not evolved from were able to depict patterns among a comprehensive dataset. a single common ancestor (Holmes 2011; Koonin et al. 2015; We consider that the result, the ViPhOGs and the informa- Iranzo et al. 2016; Koonin and Yutin 2018), whichDRAFT might be tive ViPhOGs datasets, may be used as a start point to hy- reflected in the high number of polytomies observed in the in- pothesize about the genetic relationships among known viral formative ViPhOGs tree. Besides the absence of an universal groups and as an useful tool to attempt to characterize and ViPhOG, a not negligible number of ViPhOGs were formed define the viral dark matter that is being exposed via metage- by regions from phages and eukaryotic viruses. Further anal- nomics. We released the ViPhOGs dataset hoping that: (i) yses would be needed to determine if the fact that a ViPhOG the community can use it as a tool to explore the genetic rela- is shared between eukaryotic and prokaryotic/archeal viruses tionships among viral clades encouraging viral research, (ii) is due to functional convergence, or if it is because those to facilitate the exploration of specific viral groups by the use viruses presumably have an evolutionary relationship as is the of its ViPhOGs, and (iii) to obtain viral profiles in specific case for Herpesviridae and (Baker et al. 2005; biomes. We want to encourage the community to exploit the Rixon and Schmid 2014; Andrade-Martínez et al. 2019) or as benefits of the use of this comprehensive set of orthologous ssRNA(+) viruses, which presumably co-evolved with their groups in a world of fast evolving entities that quickly lose hosts before they split into (Koonin et al. 2015; their protein sequence conservation. Wolf et al. 2018). The fact that a machine learning approach, based solely on genomic features reached a high score when classifying Methods viruses in their assigned taxa, highlights how the viral taxon- Dataset. All viral genomes stored at the NCBI public omy based on ecological (e.g. pathogenicity and host range) databases in April 2015 were retrieved based on the queries

Moreno-Gallego and Reyes et al. | ViPhOGs bioRχiv | 7 bioRxiv preprint doi: https://doi.org/10.1101/2021.02.28.433233; this version posted February 28, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Fig. 9. Misclassification in terms of the number of genomes available. Scatter plot showing (A) the number of genomes in a training iteration or (B) total number of genomes available for each taxa versus the misclassification percentage. Each dot represents a taxonomical entry and the color represents the correspondent tax- onomic level. previously used by Kristensen et al. (Kristensen et al. 2013). To download the genomes of viruses infect- ing eukaryotic cells (hereafter viruses) available on RefSeq the query used was: viruses[Organism] NOT cellular organisms[Organism] AND srcdbrefseq[P roperties] NOT vhost [Filter] AND "complete genome". Fig. 10. Selection of the informative ViPhOGs. Scatter plots showing number DRAFTof importance ranked ViPhOGs versus the classification score for each taxonomic The query to download the genomes of viruses infecting level: (A) Genus, (B) Family, (C) Order. Dashed lines indicate the number of ranked bacterial cells (hereafter phages) was: viruses[Organism] ViPhOGs chosen as informative ViPhOGs. That is, the minimum number of ranked NOT cellular -organisms[Organism] ViPhOGs that get the highest classification score. Whiskers represent standard deviation. AND srcdbrefseq[P roperties] AND vhost bacteria[Filter] AND "complete genome". To download complete genomes stored in the Genbank quence identity over the full length of the shorter sequence database but not in the RefSeq database, the negation of the using CD-HIT-EST (Li and Godzik 2006). Only the repre- proposition srcdb_refseq[Properties] was used in sentative sequence of each cluster, the longest one according the queries. to CD-HIT documentation, was used for further analysis. Fi- Following the retrieval of viral and phage genome se- nally, dereplication at the protein level: as in Kristensen et al. quences, a series of filtering steps was applied to the query re- (Kristensen et al. 2011), to remove redundancy of sequences sults to remove incomplete genome sequences and to reduce that are not in synteny but are essentially the same genome, the redundancy in the dataset. First, keyword depuration: all genomes were clustered according to their protein content us- entries with the words “segment”, “ORF”, ”Gene”, ”mutant”, ing a complete linkage approach. To identify shared proteins ”protein”, ”complete sequence”, “Region”, “CDS”, “UTR”, among genomes, all proteins from all genomes were grouped “recombinant” or “terminal repeat” were removed. Second, at a global sequence identity of 99% using CD-HIT. Genomes genomic dereplication: genomes were clustered to a 95% se- coding for 20 proteins or less must share all the proteins to be

8 | bioRχiv Moreno-Gallego and Reyes et al. | ViPhOGs bioRxiv preprint doi: https://doi.org/10.1101/2021.02.28.433233; this version posted February 28, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

DRAFT

Fig. 11. UPGMA tree representation of the non-redundant viral diversity. Unrooted (center) and circular middle point rooted (outer circle) representations of an UPGMA tree of the non-redundant viral diversity, based on the presence/absence of informative ViPhOGs. Colored branches highlight ICTV designated Orders. Bold black branches highlight phage families without an order assignment. Names between the trees indicate the name of the ICTV taxonomic order colored in the same color for the branches. Tip labels indicate the Family of each genome and are colored to facilitate their differentiation within an Order not to provide a different color to each family. clustered while genomes coding more than 20 proteins must gal (v.2.6.3) (Hyatt et al. 2010). The protein prediction was share at least 90% of their proteins to be clustered. carried out separately for viruses and phages; and in the par- Nucleotide and protein sequences from the genomes ticular case of GeneMark, it was possible to specify if the that passed the aforementioned filters were considered as the genome was single or double stranded according to the tax- non-redundant viral diversity available in NCBI at the time onomy annotation of each genome. Predicted proteins per this study was conducted and were used for further analyses. genome were de-replicated using CD-HIT at 99% sequence identity to collapse the predictions made by the 3 packages. Gene prediction. Genes were predicted for genomes with- out any annotation using Glimmer (Delcher et al. 1999) Domain prediction and ViPhOGs. To split proteins into as implemented in RAST-tk (Brettin et al. 2015), Gene- their component domains we first used InterProScan, which MarkS (v.2.0) (Borodovsky and Lomsadze 2014) and Prodi- combines several signature recognition methods to predict

Moreno-Gallego and Reyes et al. | ViPhOGs bioRχiv | 9 bioRxiv preprint doi: https://doi.org/10.1101/2021.02.28.433233; this version posted February 28, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

the presence of functional domains (Jones et al. 2014). Do- Selection of informative ViPhOGs. In order to identify mains were extracted and protein regions without domain an- the most informative ViPhOGs for each classification model, notations and comprising at least 40 residues were also kept. features were ranked in descending order according to their In an attempt to get an annotation for proteins and pro- mean Gini importance. Then each model was run again sev- tein regions without InterProScan annotations, we used them eral times, but each time the number of features was reduced 1 as queries against vFams. a database of Hidden Markov Mod- by 5 of the number used in the previous run to exclude the els (HMMs) built from viral RefSeq proteins (Skewes-Cox et least important ViPhOGs. This process was repeated until the al. 2014). Protein sequences that matched entries in the vFam model was runned with the 4 most important ViPhOGs. Fi- database inherited the corresponding annotations, whenever nally, the classification score of each iteration was plotted as a these were available. function of the number of features, and the smallest set of fea- After this process, complete proteins without any do- tures that reached the highest mean classification score was main annotation, protein regions of at least 40 residues, and selected as the set of informative ViPhOGs for each model. domains identified by either InterProScan or vFams were To depict how the viral diversity is related (or not), we built considered for further analysis and referred as viral regions an UPGMA tree based on the presence/absence of informa- from now on. Orthologous groups were built from the sym- tive ViPhOGs for each genome. metric best matches between viral regions using the soft- ACKNOWLEDGEMENTS ware COGsoft (Kristensen et al. 2010), and only consider- We thank the IT Services Department and ExaCore-IT Core-facility of the Vice ing matches with an E-value < 0.1 and that covered at least Presidency for Research & Creation at the Universidad de Los Andes for high- performance computing services. JLM-G received funding support for the realiza- 50% of the viral region lengths. The clusters of orthologous tion of this project from the Young Investigator award from Colciencias, proyecto groups built from viral regions of viruses and phages are semilla from the School of Sciences at Universidad de los Andes, proyecto FAPA from AR internal funding at Universidad de los Andes, and by the Max Plack tandem hereafter denominated ViPhOGs (Viruses and Phages clus- group in Computational Biology, from Universidad de los Andes. Special thanks to ters of Orthologous Groups). Guillermo Rangel Pineros for feedback about the manuscript and to all members of the Biología Computacional y Ecologia Microbiana lab (BCEM) for helpful dis- cussions, insights and, overall, for being constant providers of happiness and fun through the development of this work. Random Forest classifiers. To test if the ViPhOGs can be AUTHOR CONTRIBUTIONS used as a set of features that defines every virus or phage J.L.M-G and A.R. conceived the study, analyze the data and wrote the manuscript. genome in our dataset, we aimed to correctly assign viral COMPETING FINANCIAL INTERESTS taxa to each genome according to the presence/absence of The authors declare no competing interests. ViPhOGs. To solve this supervised learning task, we used the scikit-learn implementation of the Random Forest Classifier algorithm (Pedregosa et al. 2011) to, independently, perform the classification process at three different taxonomic levels: Order, Family and Genus. For each taxonomic level half of REFERENCES Afonso, Claudio L., Gaya K. Amarasinghe, Krisztián Bányai, Y¯ımíng Bào, the genomes were randomly chosen for training, while the Christopher F. Basler, Sina Bavari, Nicolás Bejerman, et al. 2016. “Taxonomy of other half were used for testing the classifiers. Although ran- the Order Mononegavirales: Update 2016.” Archives of Virology 161 (8): 2351–60. domly chosen, we constrained the selection of genomes to Andrade-Martínez, Juan S., J. Leonardo Moreno-Gallego, and Alejandro Reyes. 2019. “Defining a Core Genome for the Herpesvirales and Exploring Their Evolu- balance the taxa represented in both the training and testing tionary Relationship with the Caudovirales.” Scientific Reports 9 (1): 11342. sets. As a consequence, taxons with a single representative Baker, Matthew L., Wen Jiang, Frazer J. Rixon, and Wah Chiu. 2005. “Common Ancestry of Herpesviruses and Tailed DNA Bacteriophages.” Journal of Virology 79 were not included in the classification process. (23): 14967–70. To evaluate the effect of the number of estimators in Borodovsky, Mark, and Alex Lomsadze. 2014. “Gene Identification in Prokary- otic Genomes, Phages, Metagenomes, and EST Sequences with GeneMarkS the classification we varied the number of estimatorsDRAFT from 10 Suite.” Current Protocols in Microbiology 32 (February): Unit 1E.7. to 100 (by increments of 10 on each test), using 50 random Breitbart, Mya, and Forest Rohwer. 2005. “Here a Virus, There a Virus, Every- where the Same Virus?” Trends in Microbiology 13 (6): 278–84. training/testing sets for each model. The mean of the gener- Brettin, Thomas, James J. Davis, Terry Disz, Robert A. Edwards, Svetlana alization score and the elapsed real time were the variables Gerdes, Gary J. Olsen, Robert Olson, et al. 2015. “RASTtk: A Modular and Extensi- ble Implementation of the RAST Algorithm for Building Custom Annotation Pipelines analysed to set the optimal number of estimators in the final and Annotating Batches of Genomes.” Scientific Reports 5 (February): 8365. model. Delcher, A. L., D. Harmon, S. Kasif, O. White, and S. L. Salzberg. 1999. “Im- proved Microbial Gene Identification with GLIMMER.” Nucleic Acids Research 27 After establishing the number of estimators for the (23): 4636–41. model, we sought to reduce the number of features or Dwivedi, Bhakti, Bingjie Xue, Daniel Lundin, Robert A. Edwards, and Mya Breit- bart. 2013. “A Bioinformatic Analysis of Ribonucleotide Reductase Genes in Phage ViPhOGs used for the classification. A set of ViPhOGs was Genomes and Metagenomes.” BMC Evolutionary Biology 13 (February): 33. pre-selected by calculating both sensitivity/Precision (SP) Fujihara, Shun, Jun Murase, Cho Cho Tun, Tomoya Matsuyama, Makoto Ike- naga, Susumu Asakawa, and Makoto Kimura. 2010. “Low Diversity of T4-Type and mutual information (MI) metrics as in Reyes 2015 (Reyes Bacteriophages in Applied Rice Straw, Residues and Rice Roots in Japanese et al. 2015). The ViPhOGs selected as features were those Rice Soils: Estimation from Major Capsid Gene (g23) Composition.” Soil Science & Plant Nutrition 56 (6): 800–812. showing both, (i) a low SP index (high sensitivity and preci- Gorbalenya, Alexander E., Luis Enjuanes, John Ziebuhr, and Eric J. Snijder. sion for the evaluated taxa) and (ii) a high MI index for the 2006. “Nidovirales: Evolving the Largest RNA Virus Genome.” Virus Research 117 (1): 17–37. evaluated taxa. The selected set of ViPhOGs were used for Grazziotin, Ana Laura, Eugene V. Koonin, and David M. Kristensen. 2017. the Random Forest algorithm to solve the classification prob- “Prokaryotic Virus Orthologous Groups (pVOGs): A Resource for Comparative Ge- nomics and Protein Family Annotation.” Nucleic Acids Research 45 (D1): D491–98. lems. This time, 100 random training/testing sets were used Holmes, Edward C. 2011. “What Does Virus Evolution Tell Us about Virus Ori- for each model. gins?” Journal of Virology 85 (11): 5247–51.

10 | bioRχiv Moreno-Gallego and Reyes et al. | ViPhOGs bioRxiv preprint doi: https://doi.org/10.1101/2021.02.28.433233; this version posted February 28, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Huerta-Cepas, Jaime, Damian Szklarczyk, Kristoffer Forslund, Helen Cook, Da- Sakowski, Eric G., Erik V. Munsell, Mara Hyatt, William Kress, Shannon J. vide Heller, Mathias C. Walter, Thomas Rattei, et al. 2016. “eggNOG 4.5: A Hierar- Williamson, Daniel J. Nasko, Shawn W. Polson, and K. Eric Wommack. 2014. chical Orthology Framework with Improved Functional Annotations for Eukaryotic, “Ribonucleotide Reductases Reveal Novel Viral Diversity and Predict Biological Prokaryotic and Viral Sequences.” Nucleic Acids Research 44 (D1): D286–93. and Ecological Features of Unknown Marine Viruses.” Proceedings of the National Hugenholtz, P., B. M. Goebel, and N. R. Pace. 1998. “Impact of Culture- Academy of Sciences of the United States of America 111 (44): 15786–91. Independent Studies on the Emerging Phylogenetic View of Bacterial Diversity.” Short, Cindy M., and Curtis A. Suttle. 2005. “Nearly Identical Bacteriophage Journal of Bacteriology 180 (18): 4765–74. Structural Gene Sequences Are Widely Distributed in Both Marine and Freshwater Hurwitz, Bonnie L., Jana M. U’Ren, and Ken Youens-Clark. 2016. “Computa- Environments.” Applied and Environmental Microbiology 71 (1): 480–86. tional Prospecting the Great Viral Unknown.” FEMS Microbiology Letters 363 (10). Skewes-Cox, Peter, Thomas J. Sharpton, Katherine S. Pollard, and Joseph L. https://doi.org/10.1093/femsle/fnw077. DeRisi. 2014. “Profile Hidden Markov Models for the Detection of Viruses within Hyatt, Doug, Gwo-Liang Chen, Philip F. Locascio, Miriam L. Land, Frank W. Metagenomic Sequence Data.” PloS One 9 (8): e105067. Larimer, and Loren J. Hauser. 2010. “Prodigal: Prokaryotic Gene Recognition and Vervier, Kévin, Pierre Mahé, Maud Tournoud, Jean-Baptiste Veyrieras, and Translation Initiation Site Identification.” BMC Bioinformatics 11 (March): 119. Jean-Philippe Vert. 2016. “Large-Scale Machine Learning for Metagenomics Se- Ignacio-Espinoza, J. Cesar, Sergei A. Solonenko, and Matthew B. Sullivan. quence Classification.” Bioinformatics 32 (7): 1023–32. 2013. “The Global Virome: Not as Big as We Thought?” Current Opinion in Vi- Wolf, Yuri I., Darius Kazlauskas, Jaime Iranzo, Adriana Lucía-Sanz, rology 3 (5): 566–71. Jens H. Kuhn, Mart Krupovic, Valerian V. Dolja, and Eugene V. Koonin. International Committee on Taxonomy of Viruses Executive Committee. 2020. 2018. “Origins and Evolution of the Global RNA Virome.” mBio 9 (6). “The New Scope of Virus Taxonomy: Partitioning the Virosphere into 15 Hierarchical https://doi.org/10.1128/mBio.02329-18. Ranks.” Nature Microbiology 5 (5): 668–74. Wommack, K. Eric, Jaysheel Bhavsar, Shawn W. Polson, Jing Chen, Michael Du- Iranzo, Jaime, Mart Krupovic, and Eugene V. Koonin. 2016. “The Double- mas, Sharath Srinivasiah, Megan Furman, Sanchita Jamindar, and Daniel J. Nasko. Stranded DNA Virosphere as a Modular Hierarchical Network of Gene Sharing.” 2012. “VIROME: A Standard Operating Procedure for Analysis of Viral Metagenome mBio 7 (4). https://doi.org/10.1128/mBio.00978-16. Sequences.” Standards in Genomic Sciences 6 (3): 427–39. Jones, Philip, David Binns, Hsin-Yu Chang, Matthew Fraser, Weizhong Li, Craig Zhong, Yan, Feng Chen, Steven W. Wilhelm, Leo Poorvin, and Robert E. Hod- McAnulla, Hamish McWilliam, et al. 2014. “InterProScan 5: Genome-Scale Protein son. 2002. “Phylogenetic Diversity of Marine Cyanophage Isolates and Natural Function Classification.” Bioinformatics 30 (9): 1236–40. Virus Communities as Revealed by Sequences of Viral Capsid Assembly Protein Keegan, Kevin P., Elizabeth M. Glass, and Folker Meyer. 2016. “MG-RAST, a Gene g20.” Applied and Environmental Microbiology 68 (4): 1576–84. Metagenomics Service for Analysis of Microbial Community Structure and Func- tion.” Methods in Molecular Biology 1399: 207–33. Koonin, Eugene V., Valerian V. Dolja, and Mart Krupovic. 2015. “Origins and Evolution of Viruses of Eukaryotes: The Ultimate Modularity.” Virology 479-480 (May): 2–25. Koonin, Eugene V., and Natalya Yutin. 2018. “Multiple Evo- lutionary Origins of Giant Viruses.” F1000Research 7 (November). https://doi.org/10.12688/f1000research.16248.1. Kristensen, David M., Xixu Cai, and Arcady Mushegian. 2011. “Evolutionarily Conserved Orthologous Families in Phages Are Relatively Rare in Their Prokaryotic Hosts.” Journal of Bacteriology 193 (8): 1806–14. Kristensen, David M., Lavanya Kannan, Michael K. Coleman, Yuri I. Wolf, Alexander Sorokin, Eugene V. Koonin, and Arcady Mushegian. 2010. “A Low- Polynomial Algorithm for Assembling Clusters of Orthologous Groups from Interge- nomic Symmetric Best Matches.” Bioinformatics 26 (12): 1481–87. Kristensen, David M., Arcady R. Mushegian, Valerian V. Dolja, and Eugene V. Koonin. 2010. “New Dimensions of the Virus World Discovered through Metage- nomics.” Trends in Microbiology 18 (1): 11–19. Kristensen, David M., Alison S. Waller, Takuji Yamada, Peer Bork, Arcady R. Mushegian, and Eugene V. Koonin. 2013. “Orthologous Gene Clusters and Taxon Signature Genes for Viruses of Prokaryotes.” Journal of Bacteriology 195 (5): 941–50. Libbrecht, Maxwell W., and William Stafford Noble. 2015. “Machine Learning Applications in Genetics and Genomics.” Nature Reviews. Genetics 16 (6): 321–32. Li, Weizhong, and Adam Godzik. 2006. “Cd-Hit: A Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences.” Bioinformatics 22 (13): 1658–59. López-Bueno, Alberto, Javier Tamames, David Velázquez, Andrés Moya, Anto- nio Quesada, and Antonio Alcamí. 2009. “High Diversity of the Viral Community from an Antarctic Lake.” Science 326 (5954): 858–61. Martelli, Giovanni P., Michael J. Adams, Jan F. Kreuze, and Valerian V. Dolja. 2007. “Family Flexiviridae: A Case Study in Virion and Genome Plasticity.” Annual Review of Phytopathology 45: 73–100. Martínez Martínez, Joaquín, Brandon K. Swan, and William H. Wilson. 2014. “Marine Viruses, a Genetic Reservoir Revealed by Targeted Viromics.”DRAFT The ISME Journal 8 (5): 1079–88. Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, et al. 2011. “Scikit-Learn: Ma- chine Learning in Python.” Journal of Machine Learning Research: JMLR 12 (85): 2825–30. Powell, Sean, Kristoffer Forslund, Damian Szklarczyk, Kalliopi Trachana, Alexan- der Roth, Jaime Huerta-Cepas, Toni Gabaldón, et al. 2014. “eggNOG v4.0: Nested Orthology Inference across 3686 Organisms.” Nucleic Acids Research 42 (Database issue): D231–39. Prangishvili, David, and Mart Krupovic. 2012. “A New Proposed Taxon for Double-Stranded DNA Viruses, the Order ‘Ligamenvirales.’” Archives of Virology 157 (4): 791–95. Reyes, Alejandro, Laura V. Blanton, Song Cao, Guoyan Zhao, Mark Manary, Indi Trehan, Michelle I. Smith, et al. 2015. “Gut DNA Viromes of Malawian Twins Discordant for Severe Acute Malnutrition.” Proceedings of the National Academy of Sciences of the United States of America 112 (38): 11941–46. Rima, Bert, Peter Collins, Andrew Easton, Ron Fouchier, Gael Kurath, Robert A. Lamb, Benhur Lee, et al. 2017. “ICTV Virus Taxonomy Profile: Pneumoviridae.” The Journal of General Virology 98 (12): 2912–13. Rixon, Frazer J., and Michael F. Schmid. 2014. “Structural Similarities in DNA Packaging and Delivery Apparatuses in Herpesvirus and dsDNA Bacteriophages.” Current Opinion in Virology 5 (April): 105–10. Roux, Simon, Jeremy Tournayre, Antoine Mahul, Didier Debroas, and François Enault. 2014. “Metavir 2: New Tools for Viral Metagenome Comparison and As- sembled Virome Analysis.” BMC Bioinformatics 15 (March): 76.

Moreno-Gallego and Reyes et al. | ViPhOGs bioRχiv | 11