Exploring the : Evaluation of their Internal Classification and Potential Relationships with the Tectiviridae Juan S. Andrade-Martínez1,2, Alejandro Reyes1,2,3

1. Research Group on Computational Biology and Microbial Ecology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia. 2. Max Planck Tandem Group in Computational Biology, Universidad de los Andes, Bogotá, Colombia. 3. Center for Genome Sciences and Systems Biology, Department of Pathology and Immunology, Washington University in Saint Louis, Saint Louis, MO, 63108, USA.

Abstract The Caudovirales are the most abundant dsDNA , infecting both Bacteria and . Recently developed distance and network-based approaches have put into question the morphology-based classification of the three traditional Caudovirales families: , , and , and suggested an evolutionary relationship between such order and the phage family Tectiviridae. In that context, the present work aimed to, using of clusters of viral domain orthologous groups (VDOGs) and k-mers, determine whether the current Caudovirales classification is evolutionarily reasonable and explore the possibility of a common ancestry between Caudovirales and Tectiviridae. For this, we employed over 4000 Caudovirales and 15 Tectiviridae complete genomes obtained from the NCBI Assembly Database. These entries were dereplicated at the genome and protein level, yielding a set of representative proteomes. The latter were screened through a Hidden Markov Model search against a viral domain orthologous groups database to determine which proteomes harbored which VDOGs. A k-mer search was also conducted to establish which k-mers with lengths between 6 and 15 were abundant in the clades of interest. The representative features, k-mers or VDOGs, of the clades were determined, and dendrograms constructed based on them using a Neighbor-joining approach. All dendrograms based on k-mers generated an almost perfect distinction between the outgroups and the Caudovirales and Tectiviridae. On the contrary, the VDOG only dendrogram showed that most Caudovirales subfamilies are monophyletic, while none of the dendrograms showed monophyletic Caudovirales families. Overall, our results support the hypothesis that the classification of the three traditional Caudovirales families needs to be revised, suggest the existence of a common ancestry for Caudovirales and Tectiviridae, and benchmarks the use of VDOGs and k-mers for phylogenetic analyses.

Keywords Caudovirales, Tectiviridae, orthologous protein clusters, viral phylogenetics Introduction It is estimated that, for any given environment, the quantity of viral particles is up to 10 times as many as that of prokaryotic cells (Koonin, Dolja, & Krupovic, 2015). More notorious than their abundance, however, is the high viral diversity, which manifests itself in the plurality of structures, genome sizes, strategies of replication and expression, and virion morphologies (Koonin, Dolja, et al., 2015; Koonin, Krupovic, & Yutin, 2015). Viral evolutionary patterns are difficult to elucidate since such biological entities have high rates of mutation and infection (Hendrix, 2008). For instance, it is estimated that phages are responsible for up to 1024 productive infections per second in marine ecosystems (Hendrix, 2008). Additionally, in phages, vertical transmission is the primary way of genetic information transfer only in highly related viruses, whilst at greater evolutionary distances horizontal transmission predominates (Kristensen et al., 2013). In spite of this, significant advances have recently been made in the reconstruction of the evolutionary history of the main viral clades. For the ARN viruses, a draft tree is available which connects the orders , , , and the family (Koonin, Dolja, et al., 2015; Koonin, Krupovic, et al., 2015). For the DNA viruses, there are two recognized relationships: that of the bacteria-infecting family Tectiviridae and its descendants, the Polintoviruses, which include the proposed eukaryote-infecting order Megavirales; and that of the Caudovirales, and their putative descendants in eukaryotes, the (Koonin, Dolja, et al., 2015; McGeoch, Davison, Dolan, Gatherer, & Sevilla-Reyes, 2008; Selvarajan Sigamani, Zhao, Kamau, Baines, & Tang, 2013). The Caudovirales, or tailed-phages, are the most abundant dsDNA viruses, infecting both Bacteria and Archaea (H.-W. Ackermann, 1998). Their non-enveloped virion is composed of a head, a protein shell with protects the DNA molecule, and a tail, a protein tube involved in DNA delivery to the host cytoplasm (King, Adams, Carstens, & Lefkowitz, 2011a). The virion (Figure 1a) can also harbor additional attachments, such as terminal fibers or base plates in their tails (H.-W. Ackermann, 1998; King et al., 2011a). Under its traditional classification, this group was comprised of 3 families, defined by their virion morphology (Figure 1b): Myoviridae, with long contractile tails, Siphoviridae, with long non-contractile tails, and Podoviridae, with short non-contractile tails (H.-W. Ackermann, 1998). Phages of the order can harbor as few as 27 and as many as over 600 genes, which are usually clustered in operons based on their functions (King et al., 2011a). However, due to the low number of fully annotated genomes a general architecture for the order has not been defined yet (King et al., 2011a). Until now, evolutionary analyses of this group have suggested that they arose shortly after the origin of cellular life (H. W. Ackermann, 2003). Nevertheless, they are two yet unanswered questions regarding this clade. The first one deals with their evolutionary relationship with the aforementioned Tectiviridae and other related clades: its elucidation would lead to a first draft of a tree which delineates the transition of the main groups of dsDNA phages to eukaryote hosts. Nonetheless, at this point the only common characteristic identified, apart from the use of dsDNA (Koonin, Dolja, et al., 2015), is the mutual presence of an icosahedral in their non-enveloped virions (H. W. Ackermann, 2003). Morphologically however, the Tectiviridae are tailless (Figure 2), producing instead a tail-like structure during injection into the host, and harbor spikes in the vertices of their capsid (King, Adams, Carstens, & Lefkowitz, 2011b). The second unanswered question is related to the classification of the Caudovirales: over the past years the taxonomy of the tailed phages has changed dramatically, including the creation of subfamilies for the family Siphoviridae (Adriaenssens et al., 2017), and, under the last and yet unpublished ICTV 2017 release (available at: http://ictv.global/taxonomyReleases.asp) the creation of an additional family: the , whose members come from unclassified Caudovirales phages (Figure 3a). The Tectiviridae have also experienced changes, with the creation of a second genus for the order: Betatectivirus (Figure 3b). With the objective of overcoming the hurdles generated by high mutation rates, two approaches have been proposed in recent years for the construction of viral phylogenies. The first one calculates k-mer frequencies in complete genomes, which are then used to generate a distance matrix (Zhang, Jun, Leuze, Ussery, & Nookaew, 2017). Based on the latter, a dendrogram can be created through a neighbor-joining approach (Zhang et al., 2017). On the contrary, the second one starts with the generation of viral domain orthologous groups (or VDOGs), created using best reciprocal BLAST hits (Moreno-Gallego & Reyes, 2016). Once defined, the information of presence/absence of these clusters in each clade of interest can be employed to determine representative (characteristic) clusters for a given group, and hence can be used for taxonomic analyses, such as the definition of core genomes, or the generation of distance-based dendrograms or phylogenetic trees (Moreno- Gallego & Reyes, 2016; Andrade-Martínez & Reyes, 2017). Methods based on bipartite networks of homologous genes and representative genomes (Iranzo, Krupovic, & Koonin, 2016), and distance- based procedures which incorporate both homologous gene groups and synteny, in particular the GRAViTy pipeline (Aiewsakun, Adriaenssens, Lavigne, Kropinski, & Simmonds, 2018), have also been proposed with promising results. In fact, based on the findings from these studies the morphology- based classification of the Caudovirales into the three traditional families, Siphoviridae, Myoviridae, and Podoviridae, has recently been put into question. It has been determined that members of these families do not constitute robust, monophyletic groups in network or distance-based analyses (Aiewsakun et al., 2018; Iranzo et al., 2016; Andrade-Martínez & Reyes, 2017). These techniques have also provided indirect evidence for a relationship between the Caudovirales and Tectiviridae: the Megavirales, members of the Polintoviruses group, seem to harbor ribonucleotide reductase and helicase genes similar to those of Caudovirales (Iranzo et al., 2016). Moreover, a dendrogram produced through GRAViTy analysis of all dsDNA phages clusters the Tectiviridae in the same branch as an offshoot of the Podoviridae, which is nonetheless located outside the main Caudovirales branch (Aiewsakun et al., 2018). Unfortunately, Zhang et al. do not provide information regarding the location of Tectiviridae in their k-mer dendrogram (2017), so no conclusions can be obtained from this k-mer only approach. In this context, the present work aims to answer the aforementioned questions regarding the Caudovirales. Specifically, we employed the VDOGs approach, bolstered with whole-genome k-mer frequency information, to determine whether there is an evolutionary relationship between the Caudovirales and Tectiviridae, and to evaluate the agreement between the current classification of the Caudovirales and those genomic-based metrics. During the process of accomplishing such objectives, we were also able to generate a pipeline for dereplication and proteome prediction of viral genomes. Materials and methods Genome Data: Complete genomes from the order Caudovirales and the family Tectiviridae were downloaded from the National Institute of Health (NCBI) Assembly database. The search was conducted on May 2018, using as keywords the taxonomy ID of each clade: 548681 for the Herpesvirales and 10656 for the Tectiviridae. The length of the genomes of the Caudovirales ranges between 17 and 725kb (H.-W. Ackermann, 1998); all entries whose length was not included in this range were excluded. For the Tectiviridae, the average length is 15 kb (De-Groot et al., 2012). In order to exclude those entries with an abnormal length, the mean and standard deviation of the length of the genomes of the clade was determined. Non-RefSeq entries whose size was above three standard deviations from the mean were subjected to a Blast against themselves to determine whether the potential abnormal length was due to errors during sequencing or assembly which added inexistent duplications to the sequence. If this was not the case, they were retained. All entries were downloaded as complete genome FASTA files and as GenBank entries. Therefore, along with the sequence itself, the accession number, length and taxonomic classification for each entry was also available. Outdated classifications were updated based on the most recent taxonomy release (Adriaenssens et al., 2017) using a custom script. Dereplication and Proteome Prediction: Genomes were clustered into groups with 95% identity or more using CDHIT (Huang, Niu, Gao, Fu, & Li, 2010; Li & Godzik, 2006). The genomes yielded by this software after the clustering were then subjected to proteome prediction employing RASTtk (Overbeek et al., 2014). Entries with less than 15 predicted proteins were discarded (along with their genome clusters). For proteome dereplication, all predicted proteins from all representative genomes were pooled, and then subjected to a CD-HIT analysis, generating clusters with at least 95% similarity. In this context, two proteomes are said to share a protein if each has at least 1 predicted sequence in a given cluster. Based on this, two proteomes were grouped in the same (proteome) cluster if one of three conditions was fulfilled: i) both had less than 20 proteins and shared all of them; ii) one had less than 20 proteins and shared all of them with the other; iii) both had more than 20 proteins and shared more than 90% of them. A representative proteome of each of these new clusters (the one with the most predicted sequences), and its associated representative genome, were selected and employed for further analyses. Dereplication Script: In order to streamline the genome and proteome dereplication methodology delineated above, a custom bash script was generated to perform all the steps involved in this procedure. The Bioinformatic problem that the script was designed to tackle is the following: • Input: A set of genome sequences and the taxonomic classification of their organisms. • Output: A dereplicated set of proteomes which reflects the gene and protein diversity found in the original genome set. In other words, the script was created to be used in any analysis that requires the dereplication of a set of genomes and the prediction of their proteome sequences. This steps have been already used in two previous works from our research group (Moreno-Gallego & Reyes, 2019; Andrade- Martinez & Reyes, 2017) and can take up to a month to complete employing separate scripts; hence the need to automate the process. VDOG Analysis: Each of the peptides from the representative proteomes was screened, using HMMER (Eddy, 1998), against a VDOG HMM database (constructed in Moreno-Gallego & Reyes, 2016). If a given sequence had a hit by an HMM with an e-value lower or equal than 1e-10, it was determined that such protein, and thus its corresponding proteome, harbored the associated VDOG. Using the results of the HMM search, a matrix of presence/absence of each VDOG in each representative proteome was constructed. K-mer frequencies: K-mer frequencies, for values of k between 6 and 15 were computed for the genomes of the representative proteomes through the software Jellyfish (Marçais & Kingsford, 2011). The minimum number of counts L that a k-mer had to show to be reported for a given genome by Jellyfish was set using the formula: 2 ∗ 퐺 − 푘 + 1 퐿 = 4푘 Which yields the number of copies of each individual k-mer of length k expected to be found by chance in a genome with length G. Here, G was set as the median length of all genomes considered, multiplied by 2 since different k-mers can be found in the same positions of each strand. L was rounded up to the highest integer. After the initial tally, k-mers were further filtered based on their abundance in each of the clades of interest: the order Caudovirales and its families and subfamilies (Figure 3a), and the family Tectiviridae and its genera (Figure 3b). In particular, if a given k-mer was present in 60% or more of the representative genomes of at least one of the studied taxa, it was retained, otherwise, it was discarded. Variable Selection: VDOGs and k-mer frequencies for each given value of k were dereplicated based on three successive analyses. Note that the quantity of Caudovirales representatives utilized (3760 after dereplication) was considerably higher than that of Tectiviridae representatives (11 after dereplication), that only 1 viral order (Caudovirales) was being studied, and that only one of the clades (Caudovirales) had defined subfamilies (as opposed to Tectiviridae). That being the case, and considering that some of the analyses performed required as input the taxonomic classification of the proteomes involved, 3 taxonomic levels were defined to carry them out: • 1st Level (Order level): Including the order Caudovirales and the family Tectiviridae. • 2nd Level (Family level): Including the families of Caudovirales (Figure 3a) and the family Tectiviridae. • 3rd Level (Subfamily level): Including all the subfamilies within Caudovirales (Figure 3a) and the two genera of Tectiviridae (Figure 3b). Note that all three levels, while being called order, family, and subfamily level, harbor taxonomic ranks which are not orders, families, or subfamilies respectively. This denomination is used for simplicity, but the terms 1st, 2nd and 3rd are employed when confusions could arise. All analyses which required taxonomic information were done separately at each of the three levels. Variable selection proceeded through three filters: First, the coefficient of variation for the distribution of presence-absence, for VDOGs, or frequency, for k-mers, of each feature in all representative proteomes was determined. Next, the Shannon entropy of each one of these variables was computed three times, based on their distribution, either of presence/absence or frequency, in the representative proteomes as compared to the taxonomic classification of the latter at each of the 3 aforementioned taxonomic levels. If a VDOG or k-mer had a value of CV within the percentiles 5 and 35, and entropy (in at least one of the taxonomic levels) within the percentiles 75 and 95 of the distribution of these values for all VDOGs or all k-mers with the same length, then it was retained. Otherwise it was discarded. Note that values of CV or entropy too small or too big, respectively, are indicative of VDOGs or k-mers with random or uniform distributions across all proteomes, which are not informative for taxonomic analyses. Each group of k-mers, per value of k, or VDOGs which passed the previous filter was used as the initial pool of features of a set of 100 Random Forest classifiers whose training set, amounting to 50% of the data, was selected randomly through a stratified shuffle split. The average importance of each feature across all classifiers was obtained, and the variables ranked based on their importance. Following this, a new Random Forest classifier, which employed 75% of the data as training set, was constructed multiple times, discarding in each iteration the 10% least important features, as established by the ranking constructed previously, until only 2 variables remained. The variables which produced the classifier with the best performance, as measured by the F1-score on the test-set, per taxonomic level considered, where retained, while the rest were discarded. Both the shuffle split and Random Forest classifiers were implemented using Scikit-Learn (Pedregosa et al., 2012). The choice of F1-score instead of accuracy was done to cope with the variation in the number of representative proteomes in the clade of interest. A decision-tree method was considered since both categorical and numerical variables, k-mer frequencies and presence/absence of VDOGs respectively, were considered. The random selection of variables for each of the classifiers which is performed in the Random Forest also allows redundant features to be assigned a high importance. This is desirable at this step since redundant features can be biologically relevant and informative (for example, genes that manifest synteny should have correlated presence/absence of their VDOGs). Moreover, filtering of correlated variables is done after this step (see below), so it can safely be ignored at this point. Finally, the remaining k-mer frequencies and VDOGs were joined together and submitted to a correlation analysis. First, the correlation coefficient ρ between all pairs of features was determined, and then an average linkage process carried out: the distance d between two pairs of features was set using the following formula: 1 − |휌| 푖푓 푝 < 0.01 푑 = { 1 − |휌| + 푝 푖푓 푝 > 0.01 Where 푝 is the p-value of the correlation coefficient test. Note that d can range between a minimum of 0 for fully significantly correlated variables, and a maximum of 2 for completely uncorrelated variables. Using this criteria, clusters were formed with features with a distance or 0.1 or less amongst themselves. For clusters with three or more elements, the variable with the highest average correlation coefficient with the rest of the members of the group was selected as the centroid. For this process non-significant correlations were set to have a ρ of 0. For clusters with two variables, 4 criteria were considered: highest entropy for each of the three taxonomic level distributions, and lowest CV. If either of the variables was selected under at least 3 of these 4 criteria, it was deemed the centroid. If both features were selected by 2 criteria each, both were retained. All singletons, that is, members of clusters with 1 element, were retained. Representative VDOGs and Frequencies: Representative VDOGs are those that can be used as taxonomic signatures of a given viral clade. The same definition applies for representative k-mer frequencies. In order to determine which VDOGs and frequencies were representative of each of the clades of interest, three criteria were employed: i) the sum of the negative log 2 value of the precision, and the negative log 2 value of the sensitivity that each VDOG had when being employed as the sole criterion for classifying a given as member of a clade of interest, hereon referred to as the precision-sensitivity value (precision-sentivity criterion). ii) Mutual information, an entropy- based method which allowed the detection of unique and abundant VDOGs for each clade (mutual information criterion). iii) The inclusion or not of each VDOG as a feature in a Random Forest classifier, trained using Scikit-Learn (Pedregosa et al., 2012) with 100 trees, which aimed at determining the taxonomy of a proteome based on the information of presence/absence of all VDOGs (Random Forest criterion). In particular, a VDOG or k-mer was deemed representative of a particular clade if it was included as a feature of the best Random Forest classifier, and at least one of two conditions was fulfilled: i) It had a precision-sensitivity value of 2 or higher, which as established by Moreno-Gallego & Reyes indicates a high precision and sensitivity for classifying one of the clades of interest (2016), or ii) it could be classified as an outlier in the overall distribution of mutual information values for VDOGs in that clade based on the result of a modified Z-score test, where an outlier, following the procedure of Iglewicz & Hoaglin (1993), is defined as a k-mer or VDOG with a modified Z-score > 3.5. This procedure was done at the three taxonomic levels described previously. Note that due to the fact that all k-mers were filtered by abundance, all representative frequencies are also abundant for at least one of the taxa considered. This was not necessarily the case for representative VDOGs. Dendrogram of Caudovirales-Tectiviridae: The values of all representative features in each of the representative proteomes (presence/absence for VDOGs or values of k-mer frequencies) were then used to construct a new matrix. For all possible pairs of proteomes the Jaccard distance was computed and used to create a dendrogram for the Caudovirales and Tectiviridae through the Neighbor-Joining (NJ) algorithm. Proteomes from the ssDNA family , and the dsDNA phage order (obtained from Moreno-Gallego & Reyes, 2016) were used as outgroups (Table 4). The family Parvoviridae were selected since they seem to have arisen from bacterial plasmids and RNA+ viruses, thus being completely unrelated to the dsDNA phages, including Caudovirales and Tectiviridae (Koonin, Dolja, et al., 2015). On the other hand, the crenarchaeota- infecting dsDNA order Ligamenvirales have been shown not to cluster with the Tectiviridae or Caudovirales in previous analyses (Aiewsakun et al., 2018; Iranzo et al., 2016), and thus in principle should be unrelated to these two groups. Dendrograms were constructed considering only representative VDOGs, only representative k- mers, and both representative VDOGs and k-mers. Results Genomes and Representative Proteomes: In total, 4439 genomes were initially downloaded from the NCBI Assembly database, 4424 from the Caudovirales and 15 from the Tectiviridae. With the exception of 79 sequences, all of the Caudovirales genomes had lengths within the expected interval of 17 to 725kb, and were thus retained. For the Tectiviridae, albeit all genomes had a length within three standard deviations from the mean value (15.1kb), 11 out of 13 had lengths between 14.9kb and 15kb. The remaining two, with accession numbers MG159787 and KY817360, had lengths of approximately 16.5 and 17.2 kb, so they were nonetheless blasted against themselves. For KY817360, no significant hits were found apart from the complete match of the genome against itself, while for MG159787 there were two additional hits against a 326 bp region, which could account for a repetition, but whose length is insufficient to explain the difference in of 1.5 kb with the rest of the genomes. Hence, both sequences were retained. After de-replicating entries at the nucleotide and the protein level, a total of 3771 representative proteomes were kept. A summary of their distribution in all clades of interest, including the subfamilies of Caudovirales and the genera of Tectiviridae, is shown in Tables 1 and 2. In total, 3760 Caudovirales representative proteomes were generated, 73 from the family Ackermannviridae, 917 from the family Myoviridae, 2139 from the family Siphoviridae, and 50 unclassified Caudovirales. For the Tectiviridae, the process yielded 11 representative proteomes, 3 from the genus Betatectivirus, 6 from the genus (Alphatectivirus), and 2 unclassified species. Variable Selection: A total of 98,538 proteins comprise the representative proteomes generated by the de-replication process, with a total of 17,091 VDOGs being identified in at least one the peptide sequences (Table 3). For k-mer frequencies, the values established for L, the minimum counts needed for a k-mer to be reported in a given genome by Jellyfish, were 18 for k = 6; 5 for k=7; 3 for k=8, 9 and 10; and 2 for k=11, 12, 13, 14, and 15. It is important to note that L was increased by 2 for values of k between 8 and 10, and by 1 for values between 11 and 15, due to the high amount of k-mers reported for lower thresholds. The abundance criteria, presence in at least 60% of the members of a clade of interest, was fulfilled by a total of 61,208 k-mers (Table 3), 28,179 of which were octamers (46%). The first filtering process, based on CV and Entropy values analyzed for VDOGs and for each value of k separately, reduced the total number of features to 15131. Reduction for each category ranged between 12 and 23% of the original features. Figure 4 shows the CV and Entropy distributions obtained for the case of hexamers. A similar shape was observed for the rest of the values of k, as well as for VDOGs. The Random Forest analysis decreased the overall number of features by roughly a third (Table 3). The biggest reductions were seen for k=7 and 9, where 14.1 and 17.4% of the variables which passed the first filter were retained. The totality of the VDOGs considered for this analysis, 4127, were retained, since they conjointly gave the best performance for the classifier in the three taxonomic levels considered, reaching F1-scores of over 0.95 in all cases (data not shown). In contrast, there was ample variation in the performance of the classifiers trained with k-mer data: while the F1-score for the first taxonomic level (order Caudovirales and family Tectiviridae) was above 0.99 in all cases, even with the reduced number of features available for the higher values of k, it dropped below 0.6 for the second level (families of Caudovirales and family Tectiviridae), and below 0.35 for the third (subfamilies of Caudovirales and genera of Tectiviridae) (data not shown). Finally, the correlation dereplication left a total of 7468 variables for representative selection (Table 3). The most notable change was the reduction of the number of VDOGs to almost a quarter. Additionally, the number of k-mers for k=13 and 14 diminished to less than a third, and no features with k=15 were retained. Representative Features and Caudovirales-Tectiviridae Dendrogram: The sensitivity- specificity, mutual information, and Random Forest criteria yielded a total of 1058 representative features, most of which (Table 3) were VDOGs (468) and hexamers (427). A single nonamer was deemed representative, while no k-mers from k=10 to k=15 were selected. The use of k-mers, VDOGs, or both, produced different results in the classification dendrograms (Figures 5 to 7). Since the process of representative feature selection aims to isolate taxonomic firms for the clades analyzed, here Tectiviridae and Caudovirales, it is expected that outgroups are not only nested on the external branches, but also have a poor classification. This is also due to the fact that the outgroups selected have no putative relationships with the analyzed clades: as stated above, the family Parvoviridae has RNA ancestors (Koonin, Dolja, et al., 2015), while synteny and orthologous gene analyses indicate that the dsDNA order Ligamenvirales is unrelated to the Caudovirales or Tectiviridae (Aiewsakun et al., 2018; Iranzo et al., 2016). This behavior is seen partly, but not completely, in all three graphs: when considering both k-mers and VDOGs, outgroups are indeed nested in the outermost branches (Figures 5). However, in the k-mer only graph, not only does the main Parvoviridae group cluster with three Caudovirales genomes, but a small group of 9 Parvoviridae and 1 Ligamenvirales proteomes are clustered inside the main Caudovirales branch (Figure 6). For the VDOG only dendrogram, outgroups have a poor resolution in general (Figure 7), and a good deal of Caudovirales proteomes are included in the outermost branch. The Caudovirales-Tectiviridae relation, and the family and subfamily resolution also varies depending on the input employed. Dendrograms which utilize k-mers produce a well-defined branch with Caudovirales and Tectiviridae (Figures 5 & 6), separated from the outgroups, while this is not the case for the VDOG only graph (Figure 7). In none of the three cases the Tectiviridae form a monophyletic group, being either partly clustered in the outgroup branch and partly in the Caudovirales branches (VDOG only dendrogram, Figure 7), or scattered amongst Caudovirales branches (k-mer only and VDOG and k-mer dendrograms, Figures 5 & 6). The same occurs for the Caudovirales families: none of them form consistent monophyletic groups, as evidenced by the fact that their subfamilies are not grouped in the same branches in the VDOGs only dendrogram (Figure 7). On the contrary, in the VDOG only dendrogram the subfamilies tend to be clustered in branches with either only representatives of that clade (as is the case for the subfamilies Mclasvirinae and Pclasivirinae) or with both representatives and proteomes with undetermined family (in the case of the subfamily ). The only two notable exceptions are the subfamily Bclasvirinae, which shares a branch with the single subfamily Mclasvirinae proteome and unclassified members of the families Siphoviridae and Myoviridae, and the subfamilies Autographvirinae and , which form two distinct branches each. Pipeline Construction: The final design of the pipeline scripts takes as input the Assembly genome Fasta and GenBank files, and performs the taxonomy adjustment, genome dereplication, and proteome prediction and dereplication steps as described in Materials and Methods. It then returns the set of Multifasta files of the representative proteomes, as well as a table with the accession, length, adjusted taxonomy, and (if non-representative) representative proteome of all initial entries (Figures 8 & 9). The set of files and scripts needed for the process are stored in a compressed zip file with a general bash script (pipeline1.sh), and a workspace folder. To run it, the user must simply download the folders with the compressed files from the Assembly database that contain the genomes in FASTA and GenBank format of their clade of interest, place them in the general folder with the names “[Clade]_Fasta” and “[Clade]_GenBank” respectively, and then execute the general script. The script is designed to run with the modules available in the high-performance cluster at Universidad de los Andes. All the software required for the process (including CD-HIT and RASTtk) are already installed in that server. Discussion VDOGs and k-mers, when to use which? Apart from analyzing the internal classification of the Caudovirales and exploring a putative relation with Tectiviridae, this study aimed at bolstering the VDOG approach, employed previously (Andrade-Martínez & Reyes, 2017; Moreno-Gallego & Reyes, 2016), to increase the resolution of viral classification. The election of k-mer frequencies stemmed from the need to cope with the main shortcoming of the use of orthologous protein groups: its inability to consider regulatory sequences (Andrade-Martínez & Reyes, 2017). Both of the dendrograms constructed with k-mer frequencies, either alone or with VDOGs, achieved a good separation of the outgroups, Parvoviridae and Ligamenvirales, from the Caudovirales and Tectiviridae: with the exception of a small outgroup cluster in the k-mer only case and 3 Caudovirales genomes within the main outgroup branch, the studied clades were placed in a distinct branch from that of the outgroups. As stated above, k-mer only Random Forest classifiers constructed during the internal dereplicación step were able to separate the analyzed clades with ease. Moreover, only k- mers were selected as representative features for the first (order Caudovirales and family Tectiviridae) taxonomic level. On the contrary, VDOGs achieved a good performance overall, especially in the second (families of Caudovirales and Tectiviridae), and third (subfamilies of Caudovirales and genera of Tectiviridae) taxonomic levels. As a matter of fact, the inclusion of k-mers in the Random Forest analysis for the selection of representative features achieved an F1-score of approximately 0.92 (data not shown), lower than the ones obtained during the VDOG-only Random Forest internal dereplication step. While it is true that a good deal of Caudovirales proteomes were clustered with the outgroups in the VDOG-only dendrogram, this was probably due to the lack of representative VDOGs for the first taxonomic level. The use of VDOGs for viral taxonomic analyses has been applied successfully before for studies analyzing long range (order and family) and short-range (subfamily and genus) relationships (Aiewsakun et al., 2018; Moreno-Gallego & Reyes, 2016). On the other hand, the use of k-mer frequencies has been only tested for long-range relationships (Zhang et al., 2017). Previous results, coupled with what is seen here, allow us to suggest the following:

• K-mer approaches are optimal for long range evolutionary relationships, since few data is needed for achieving a clear-cut or almost clear-cut distinction between clades of interest. Out of the 1058 representative features generated in this study, only 58 were selected as representative for Caudovirales or Tectiviridae, and Random Forest classifiers with as few as 3 k-mers, constructed during the internal dereplication procedure, achieved F1-scores of over 0.995 when distinguishing between these two clades (data not shown). • Orthologous protein cluster approaches work at all taxonomic levels, and, in light of the poor k-mer performance at closer evolutionary distances, are particularly useful for taxonomic analysis considering subfamily, or genera relationships. The utilization of k-mers for phylogenetic analyses could be preferable since it has the advantage of speed: without parallelization, the HMM search conducted against the VDOG database with the Caudovirales and Tectiviridae genomes required days to complete, while k-mer detection, for the whole range of k=6 to k=15 was finished in a matter of hours. The process can be further delayed if the orthologous groups need to be defined from the working data, as was the case for the GRAViTy software (Aiewsakun et al., 2018), and for bipartite networks (Iranzo et al., 2016). It is important to note that this results could be biased due to the fact that only k-mers for values of k between 6 and 8 were deemed representative. In general, higher values of k are more useful and abundant for studying either bigger genomes or smaller taxonomic ranks (Zhang et al., 2017). Their lack of presence here is reflected by high number of discarded k-mers in the 60% abundance filters for k=9 to k=15, were at most 27% of the features were retained, and in some cases as few as 7%. In contrast, the retention rates of k=6, 7, and 8, were 99%, 91%, and 47% respectively. That being the case, further analyses considering classification of viral genera and species are needed to verify the proposed hypothesis. Caudovirales and Tectiviridae: The order Caudovirales and family Tectiviridae are members of the two dsDNA viral lineages which made a transition to eukaryote hosts (Koonin, Dolja, et al., 2015). From the order Caudovirales, most probably the Tevenvirinae subfamily of the family Myoviridae, originated the Herpesvirales (Andrade-Martínez & Reyes, 2017), while from the Tectiviridae or one of its related archaea-infecting families (Corticoviridae, and Sphaeloviridae), originated the Polintoviruses (Koonin, Krupovic, et al., 2015). The elucidation of their relationship would lead to a unified explanation of the transition of dsDNA viruses from prokaryote to eukaryote hosts. Under the premise that the k-mer strategy is preferable for elucidating higher order relationships, then the results obtained suggest the existence of an evolutionary link between the two clades: all dendrograms which considered k-mers grouped the Caudovirales and Tectiviridae inside a single branch. Even though the Tectiviridae do not form a monophyletic group in these cases, this is most probably due to the aforementioned deficiency of k-mers for resolving lower taxonomic levels. In fact, employing a set of representative VDOGs for the first taxonomic level, derived from a preliminary analysis, which did not employ internal variable selection and considered the RNA orders Picornavirales and as outgroups, yields two related Caudovirales branches, one of which harbors the Tectiviridae along with a set of Podoviridae genomes (Supplementary Figure S1), similar to the results obtained through GRAViTy (Aiewsakun et al., 2018). The subfamily discrimination of these tree is poor, however, likely due to the combination of VDOG coming from distinct analyses, but it nonetheless provides evidence of a relationship between these clades. Moreover, under the Jaccard distance metric employed for the generation of the VDOG and k-mer dendrogram, the average distance between members of the Caudovirales and Tectiviridae within themselves is significantly higher than the average distance between members of the Caudovirales with members of the Tectiviridae (Supplementary Figure S2). In other words, there are species of Caudovirales which are more closely related to species of Tectiviridae than to other species of Caudovirales. Were there to be an evolutionary link amongst the two groups considered, then the results generated strongly suggest that Tectiviridae originated from a branch of Caudovirales, probably a subclade of Podoviridae (Aiewsakun et al., 2018). It would be tempting to propose that a group of Podoviridae gradually lost their tails to create the modern Tectiviridae virion morphology, but the generation of anomalous virions is frequent amongst all clades of the Caudovirales (H.-W. Ackermann, 1998). That being the case, further research is needed to confirm the putative relation delineated here, and, if there is a relation, to assign a sister clade to the Tectiviridae within the Caudovirales. Caudovirales: The original classification of the Caudovirales in three families, Podoviridae, Myoviridae, and Siphoviridae, was based mostly on morphological criteria (Aiewsakun et al., 2018). Previous orthologous group-based analyses of the Caudovirales have put into question a common ancestry for the members of each of these clades, while determining that current subfamilies tend to form consistent, monophyletic groups (Aiewsakun et al., 2018; Andrade-Martínez & Reyes, 2017; Iranzo et al., 2016). This was also the case here, with most subfamilies being either monophyletic or clustering in branches with viruses with unassigned subfamily, where the latter can be regarded as putative members of the subfamilies they group with. The only exceptions observed to these rule, as stated above, are the Bclasvirinae, Autographvirinae, and Tevenvirinae. The main branch which harbors the Bclasvirinae, however, contains only unclassified Siphoviridae and Myoviridae genomes, and the single Mclasvirinae representative, suggesting that all of this proteomes could be grouped into a single clade (either extending the Bclasvirinae or creating a family). The separation of the Tevenvirinae and of the Autographvirinae in two branches was not seen before, however (Andrade-Martínez & Reyes, 2017), neither was that of the Ackermannviridae family (Cvivirinae and Aglimvirinae), which forms a monophyletic group under GRAViTy analysis (Aiewsakun et al., 2018). These shortcomings of the current methodology will be addressed in future research. In spite of these, the results obtained support the need for a reevaluation of the classification of the Caudovirales, at least at the family level, based on a consistent measure of distance and a unified set of criteria. The use of orthologous group approaches employed in this and other studies strongly suggests that a measure of distance could be derived from the presence/absence of such clusters. While the results obtained here discard the application of k-mer frequencies for a reevaluation of family taxonomy, synteny analyses could be instrumental in deriving a measure, and have already been used to support the lack of consistency in Caudovirales families (Aiewsakun et al., 2018). It remains to be seen whether a combined VDOG-synteny or VDOG-k-mer, or a single VDOG strategy yields better results and provides a framework for future phage classification. Conclusions K-mer frequency analyses for the elucidation of viral phylogenetic relationships, while fast to carry out, seem to produce consistent results at higher phylogenetic distances only. In contrast, orthologous based approaches, in this case VDOGs, yield good results for all levels. Both k-mer frequencies and VDOG strategies support the hypothesis that the phage family Tectiviridae and the order Caudovirales are evolutionary related, but the details of such relationship remain to be determined. If these common ancestry hypothesis is confirmed in the future, then the two main dsDNA viral lineages would have a common ascent, and therefore all eukaryote-infecting dsDNA viruses would share a common ancestor. Finally, it is imperative to reassess the current family classification of the Caudovirales, since the three traditional families do not seem to be evolutionary consistent according to the VDOG and k- mer analysis performed here. Future studies should aim at suggesting a clear cut criteria for establishing tailed-phage families starting from the already defined monophyletic subfamilies. Bibliography

Ackermann, H.-W. (1998). Tailed : The Order Caudovirales. Advances in Virus Research, 51, 135–201. https://doi.org/10.1016/S0065-3527(08)60785-X

Ackermann, H. W. (2003). observations and evolution. Research in Microbiology, 154(4), 245– 251. https://doi.org/10.1016/S0923-2508(03)00067-6

Adriaenssens, E. M., Krupovic, M., Knezevic, P., Ackermann, H. W., Barylski, J., Brister, J. R., … Kuhn, J. H. (2017). Taxonomy of prokaryotic viruses: 2016 update from the ICTV bacterial and archaeal viruses subcommittee. Archives of Virology, 162(4), 1153–1157. https://doi.org/10.1007/s00705-016-3173-4

Aiewsakun, P., Adriaenssens, E. M., Lavigne, R., Kropinski, A. M., & Simmonds, P. (2018). Evaluation of the genomic diversity of viruses infecting bacteria, archaea and eukaryotes using a common bioinformatic platform: steps towards a unified taxonomy. Journal of General Virology, 1–13. https://doi.org/10.1099/jgv.0.001110

Aiewsakun, P., & Simmonds, P. (2018). The genomic underpinnings of eukaryotic virus taxonomy: Creating a sequence-based framework for family-level . Microbiome, 6(1), 1–24. https://doi.org/10.1186/s40168-018-0422-7

Andrade-Martínez, J. S. & Reyes, A. (2017). Defining a Core Genome for the Herpesvirales and Elucidating their Evolutionary Relationship with the Caudovirales (Tesis de Pregrado). Universidad de los Andes, Bogotá, Colombia

De-Groot, R. J., Baker, S. ., Baric, R., Enjuanes, L., Gorbalenya, A. E., Holmes, K. V., … Ziebuhr, J. (2012). Virus Taxonomy; Ninth Report of the International Committee on Taxonomy of Viruses. Elsevier. https://doi.org/10.1016/B978-0-12-384684-6.00115-4

Eddy, S. (1998). Profile hidden Markov models. Bioinformatics, 14(9), 755–763. https://doi.org/btb114 [pii]

Hendrix, R. W. (2008). Evolution of dsDNA Tailed Phages. Origin and Evolution of Viruses (Second Edi). Elsevier Ltd. https://doi.org/10.1016/B978-0-12-374153-0.00010-2

Huang, Y., Niu, B., Gao, Y., Fu, L., & Li, W. (2010). CD-HIT Suite: A web server for clustering and comparing biological sequences. Bioinformatics, 26(5), 680–682. https://doi.org/10.1093/bioinformatics/btq003 Iranzo, J., Krupovic, M., & Koonin, E. V. (2016). The double-stranded DNA virosphere as a modular hierarchical network of gene sharing. MBio, 7(4), 1–21. https://doi.org/10.1128/mBio.00978-16

King, A. M. Q., Adams, M. J., Carstens, E. B., & Lefkowitz, E. J. (2011a). Caudovirales. In A. M. Q. King, M. J. Adams, E. B. Carstens, & E. J. Lefkowitz (Eds.), Virus Taxonomy: Ninth Report of the International Committee on Taxonomy of Viruses (1st ed., Vol. 1, pp. 39–45). London: Elsevier Inc. https://doi.org/10.1016/B978-0-12-384684-6.00001-X

King, A. M. Q., Adams, M. J., Carstens, E. B., & Lefkowitz, E. J. (2011b). Tectiviridae. In A. M. Q. King, M. J. Adams, E. B. Carstens, & E. J. Lefkowitz (Eds.), Virus Taxonomy: Ninth Report of the International Committee on Taxonomy of Viruses (1st ed., Vol. 1, pp. 317–321). London: Elsevier Inc. https://doi.org/10.1016/B978-0-12-384684-6.00030-6

Koonin, E. V., Dolja, V. V., & Krupovic, M. (2015). Origins and evolution of viruses of eukaryotes: The ultimate modularity. Virology, 479–480, 2–25. https://doi.org/10.1016/j.virol.2015.02.039

Koonin, E. V., Krupovic, M., & Yutin, N. (2015). Evolution of double-stranded DNA viruses of eukaryotes: From bacteriophages to transposons to giant viruses. Annals of the New York Academy of Sciences, 1341(1), 10–24. https://doi.org/10.1111/nyas.12728

Kristensen, D. M., Waller, A. S., Yamada, T., Bork, P., Mushegian, A. R., & Koonin, E. V. (2013). Orthologous gene clusters and taxon signature genes for viruses of prokaryotes. Journal of Bacteriology, 195(5), 941–950. https://doi.org/10.1128/JB.01801-12

Li, W., & Godzik, A. (2006). Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22(13), 1658–1659. https://doi.org/10.1093/bioinformatics/btl158

Marçais, G., & Kingsford, C. (2011). A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27(6), 764–770. https://doi.org/10.1093/bioinformatics/btr011

McGeoch, D. J., Davison, A. J., Dolan, A., Gatherer, D., & Sevilla-Reyes, E. E. (2008). Molecular Evolution of the Herpesvirales. Origin and Evolution of Viruses (Second Edi). Elsevier Ltd. https://doi.org/10.1016/B978-0-12-374153-0.00020-5

Moreno-Gallego, J. L. & Reyes, A. (2016). En Búsqueda de Regiones Informativas en Genomas Virales (Tesis de Maestría). Universidad de los Andes, Bogotá, Colombia.

Overbeek, R., Olson, R., Pusch, G. D., Olsen, G. J., Davis, J. J., Disz, T., … Stevens, R. (2014). The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST). Nucleic Acids Research, 42(D1), 206–214. https://doi.org/10.1093/nar/gkt1226

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., … Duchesnay, É. (2012). Scikit- learn: Machine Learning in Python, 12, 2825–2830. https://doi.org/10.1007/s13398-014-0173-7.2

Selvarajan Sigamani, S., Zhao, H., Kamau, Y. N., Baines, J. D., & Tang, L. (2013). The structure of the DNA-packaging terminase pUL15 nuclease domain suggests an evolutionary lineage among eukaryotic and prokaryotic viruses. Journal of Virology, 87(12), 7140–8. https://doi.org/10.1128/JVI.00311-13

Zhang, Q., Jun, S.-R., Leuze, M., Ussery, D., & Nookaew, I. (2017). Viral Phylogenomics Using an Alignment- Free Method: A Three-Step Approach to Determine Optimal Length of k-mer. Scientific Reports, 7(December 2016), 40712. https://doi.org/10.1038/srep40712

Table 1. Distribution of the initial genomes (in parenthesis), and representative proteomes for the order Caudovirales, its families and subfamilies.

Order n Family n Subfamily n Aglimvirinae 64 (75) Ackermannviridae 73 (84) Cvivirinae 9 (9) 35 (43) Ounavirinae 25 (25) Eucampyvirinae 7 (9) Myoviridae 917 (1142) Vequintavirinae 15 (15) Tevenvirinae 198 (273) Spounavirinae 45 (61) Unclassified 637 (777) Autographivirinae 175 (197) 30 (33) Podoviridae 581 (688) Sepvirinae 13 (15) Unclassified 363 (443) 3760 Caudovirales (4424) Arquatrovirinae 13 (14) Bclasvirinae 86 (102) Chebruvirinae 5 (5) Dclasvirinae 1 (1) Guernseyvirinae 20 (21) Mclasvirinae 2 (2) Siphoviridae 2139 (2458) Mccleskeyvirinae 2 (2) Nclasvirinae 8 (8) Nymbaxtervirinae 4 (4) Pclasvirinae 7 (7) Tunavirinae 30 (30) 1961 Unclassified (2268) Unclassified 50 (52)

Table 2. Distribution of the initial genomes (in parenthesis), and representative proteomes for the family Tectiviridae, and its genera.

Family n Genus n Betatectivirus 3 (4) Tectiviridae 11 (15) Tectivirus 6 (9) Unclassified 2 (2)

Table 3. Number of potential k-mers, per value of k, and VDOGs to be detected, and amount remaining after each of the filtering steps, and after representative feature selection. K-mer Potential L Filter Abundance Filter CV-Entropy Filter RF Filter Correlation Filter Representative 6 4.1E+03 4034 4033 672 508 427 427 7 1.6E+04 15902 14502 2491 352 333 146 8 6.6E+04 59138 28179 4901 4072 4071 16 9 2.6E+05 125823 9783 1966 344 344 1 10 1.0E+06 6845 1864 434 266 262 0 11 4.2E+06 10154 2083 432 266 260 0 12 1.7E+07 3227 492 59 39 36 0 13 6.7E+07 1170 170 29 29 8 0 14 2.7E+08 486 68 11 11 2 0 15 1.1E+09 245 34 9 9 0 0 VDOGs 3.1E+04 17091 17091 4127 4127 1725 468 Total 1.4E+09 244115 78299 15131 10023 7468 1058

Table 4. Distribution of the genomes used as outgroups in the representative feature dendrograms.

Order n Family n Subfamily n 137 Parvoviridae 172 Densovirinae 34 Unclassified 1 Rudiviridae 6 Ligamenvirales 14 8

A

B Figure 1. A: General morphology of a Caudovirales virion (Swiss Institute of Bioinformatics, 2013). Tail fibers and baseplates are examples of attachments not harbored by all Caudovirales. B: Traditional Caudovirales families (Ninjatacoshell, 2014). The length and contractibility of the tail are the hallmark criteria for classification of the traditional families: Myoviridae virions have long contractile tails, those of Siphoviridae have long non-contractile tails, and those Podoviridae have short non-contractile tails.

Figure 2. General morphology of a Tectiviridae virion (Swiss Institute of Bioinformatics, 2013).

A

B Figure 3. Taxonomy of the clades analyzed in this study. A: Under the most recent classification the Caudovirales are composed of 4 families: Ackermannviridae (2 subfamilies), Myoviridae (5 subfamilies), Podoviridae (3 subfamilies), and Siphoviridae (11 subfamilies). B: The family Tectiviridae has no subfamilies, and 2 genera: Tectivirus (or Alphatectivirus) and Betatectivirus.

A

B

C Figure 4. Coefficient of variation (CV) and Entropy distributions for k=6, for each of the taxonomic levels considered: Caudovirales and Tectiviridae (A), Tectiviridae and the families of Caudovirales (B), and the genera of Tectiviridae and subfamilies of Caudovirales (C).

0.2

Figure 5. Dendrogram of the Caudovirales and Tectiviridae constructed with representative k-mer frequencies and VDOGs. Branches are colored if more of 60% of the species there located belong to the given clade. Amongst the Tectiviridae, only the Tectivirus form a monophyletic group, consisting of 5 of the 6 representative genomes.

0.2

Figure 6. Dendrogram of the Caudovirales and Tectiviridae constructed with only representative k-mer frequencies. Branches are colored if more of 60% of the species there located belong to the given clade. As in the case of all representative features, for the Tectiviridae only the Tectivirus genus forms a monophyletic group, consisting of 5 of the 6 representative genomes.

0.2 Figure 7. Dendrogram of the Caudovirales and Tectiviridae constructed with representative VDOGs only. The complete color key is located in the following page. Clades are colored in branches were at least 60% of the proteomes belong to the clade in question. For the Tectiviridae, the Betatectivirus are grouped in a branch with unclassified Caudovirales proteomes, while the Tectivirus are scattered over the Caudovirales branch. The two Mccleskeyvirinae proteomes clustered with the outgroups and are thus not shown in an individual color. The single Dclasivirinae proteome is grouped in a branch with unclassified Siphoviridae, which is colored according to the key. The Ackermannviridae families are clustered within groups of unclassified members of other families, hence the discordant coloring, and the lack of a region with the Ackermannviridae color.

Figure 8. Outline of the general steps carried out by Pipeline 1.

Figure 9. Specific procedure for the proteome prediction and dereplication step of Pipeline 1.

0.2 A A

0.2 B

Supplementary Figure S1. Preliminary dendrogram constructed with VDOGs for the Caudovirales and Tectiviridae using the RNA orders Nidovirales and Picornavirales as outgroups. This graph considers all VDOGs which passed at least 1 of the representative feature criteria for the 1st taxonomic level (Caudovirales and Tectiviridae). A: The dendrogram achieves a good resolution of the outgroups and Caudovirales and Tectiviridae, which cluster in two main branches. B: Zooming in to the uppermost of these branches shows that the Tectiviridae and its genera form monophyletic clades.

** < ** > ** <

Supplementary Figure S2. Box-plot of the Jaccard distance obtained based on the representative features between members of the same clade, as determined by the 1st taxonomic Level, that is, Caudovirales and Tectiviridae; between members of different clades (between members of Caudovirales and members of Tectiviridae); and between either Caudovirales or Tectiviridae species with the outgroups (Ligamenvirales and Parvoviridae). The double asterisk (**) denotes significant relations, with p-value < 0.01, as determined by a Mann-Whitney test. The plus or minus signs below indicate the direction of the relationship. Note that, as expected, the average distance between the outgroups and either members of Caudovirales or Tectiviridae is significantly higher than both of the other two cases, indicating good outgroup resolution. On the contrary, the average distance between members of different clades is significantly smaller than that of members of the same clade, implying that there are Caudovirales and Tectiviridae genomes more related to members of the other clade of interest than to members of their same clade.