Informative Regions in Viral Genomes
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/2021.02.28.433233; this version posted February 28, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Informative Regions In Viral Genomes J. Leonardo Moreno-Gallego1,2, and Alejandro Reyes2,3, 1Department of Microbiome Science, Max Planck Institute for Developmental Biology, Tübingen 72076, Germany 2Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá 111711, Colombia 3The Edison family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, Saint Louis, MO 63108, USA Viruses, far from being just parasites affecting hosts’ fitness sity. Only about 25% of sequences in viromes from marine are major players in any microbial ecosystem. In spite of their environments have a match (e-value <= 0.001) to a known broad abundance, viruses, in particular bacteriophages remain sequence (Breitbart and Rohwer 2005; Hurwitz et al. 2016; largely unknown since only about 20% of sequences obtained López-Bueno et al. 2009). The large diversity and the lack of from viral community DNA surveys could be annotated by com- universal molecular markers make it difficult to organize and parison to public databases. In order to shed some light into characterize known and new viral genomes. this genetic dark matter we expanded the search of ortholo- gous groups as potential markers to viral taxonomy from bac- Currently, computational tools such as VIROME teriophages and included eukaryotic viruses, establishing a set (Wommack et al. 2012), MetaVir2 (Roux et al. 2014), of 31,150 ViPhOGs. To this, we examine the non-redundant and MG-RAST (Keegan et al. 2016) use available molec- viral diversity stored in public databases, predict proteins in ular databases to analyze genomes and metagenomes. Nev- genomes lacking such information and all annotated and pre- ertheless, they are limited by considering only the fraction dicted proteins were used to identify potential protein domains. of the data generated that has significant similarity to pre- Clustering of domains and unannotated regions into ortholo- viously annotated data. New approaches based not only on gous groups was done using cogSoft. Finally, we employed a annotated sequence comparisons, but also on all the avail- supervised classification method by using a Random Forest im- able information promise to be useful for analyzing individ- plementation to classify genomes into their taxonomy and found ual viral genomes and viromes. For instance, Skewe-Cox et that the presence/absence of ViPhOGs is significantly associ- ated with their taxonomy. Furthermore, we established a set of al., used protein sequence clustering to generate viral profile 1,457 ViPhOGs that given their importance for the classification Hidden Markov Models (“vFams”) that were subsequently could be considered as markers/signatures for the different tax- used for classifying highly divergent sequences (Skewes- onomic groups defined by the ICTV at the Order, Family, and Cox et al. 2014). Furthermore, the identification of highly Genus levels. Both, the ViPhOGs and informative ViPhOGs sets conserved genes in specific taxonomic groups has been an- are available at the Open Science Framwork (OSF) under the other approach for the taxonomic classification of viral se- ViPhOGs project (https://osf.io/2zd9r/). quences. For instance, diversity analyses of cyanophages, Viruses | Phages | Orthologous gropus | RandomForest | ViPhOGs algae viruses, T4 and T7 phages have been conducted fol- Correspondence: [email protected] [email protected] lowing this method, and they enabled the characterization of the viral diversity in the studied environments (Zhong et al. 2002; Short and Suttle 2005; Fujihara et al. 2010). However, Introduction the mentioned studies were restricted to specific families and Viruses are entities widely spread all around the biosphere. It ecosystems. is estimated that viral particles are ten times more abundant Using another approach, Kristensen et al. constructed a than other types of microorganisms and although theirDRAFT inclu- collection of phage orthologous groups (“POGs”) from bac- sion in a new life domain remains controversial, it is clear that terial and archeal viruses (Kristensen et al. 2011; Kristensen they are not merely parasites (Ignacio-Espinoza et al. 2013; et al. 2013). Orthologous gene sets are widely used as a Martínez Martínez et al. 2014; Koonin et al. 2015). Viruses powerful technique in comparative genomics and for viruses actively participate in ecosystem remodelling, population dy- it has been suggested that marker genes could be obtained namics and a wide variety of ecological, biogeochemical, ge- using this technique (Kristensen et al. 2011; Kristensen et netic and physiological processes (Kristensen et al. 2010). al. 2013). Based on the same concept, eggNOG has imple- Despite their importance and abundance, the viral di- mented in its latest version a database of orthologous groups versity has not been well characterized. Difficulties in the focused on viruses (Viral OGs) (Powell et al. 2014). How- isolation of pure cultures and the description of viral cycles ever, they do not include the whole breadth of viral diver- are common limitations in virus research, since less than 1% sity represented in the public databases, since genomes of of environmental microorganisms can be grown in the labo- eukaryotic viruses were not included by either of the men- ratory (Hugenholtz et al. 1998). Next generation sequencing tioned studies. Given the large amount of currently available techniques enabled us to partially overcome these difficul- genetic information, it is imperative to develop and imple- ties, revolutionizing this field of virology. Deep sequencing ment new tools to reliably and efficiently analyze these data surveys of viral communities (viromes) have revealed a di- to better describe viral diversity. versity beyond all expectations, and they have evidenced the Computational techniques have been used previously lack of knowledge we currently have on global viral diver- for similar issues in biology. Machine learning methods, for Moreno-Gallego and Reyes | bioRχiv | February 28, 2021 | 1–11 bioRxiv preprint doi: https://doi.org/10.1101/2021.02.28.433233; this version posted February 28, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. example, are algorithms which learn through the experience, ison of the genome length distribution of phages and eu- attempting to classify information according to shared fea- karyotic viruses shows that dsDNA and ssDNA phages tend tures. These techniques allow us to extract patterns, trends to have larger genomes than dsDNA and ssDNA eukaryotic and finally analyze the information using a non-deterministic viruses (Mannwhitney test when comparing dsDNA viruses: way. Supervised learning algorithms such as Random For- p-value=7.19e-06; Mannwhitney test when comparing ss- est, Support Vector Machine and Neural Networks have been DNA viruses: p-value=3.08e-64). In the case of ssRNA successfully introduced to solve complex biological prob- viruses, eukaryotic viruses tend to have larger genomes than lems, such as image analysis, microarray expression analysis, phages (Mannwhitney test when comparing ssRNA viruses: QTLs analysis, detection of transcription start sites, epitopes p-value=7.89e-16) (Figure 1). detection, protein identification and function, among others The set of 14,057 non-redundant genomes code for a (Libbrecht and Noble 2015; Vervier et al. 2016). total of 442,007 proteins, where we observe that the number In this work we used the methodology proposed by of genes is directly proportional to the length of the genome. Kristensen et al., for the identification of gene and domain or- Interestingly, a linear regression suggests a gene density of thologous groups from related viral sequences. We expanded 12 proteins per kilobase in the case of phages, while in the the reach of this approach by incorporating genomes of eu- case of eukaryotic viruses the gene density is only about 2.5 karyotic viruses and applying Random Forest as a machine proteins per kilobase; indicating a lower gene density for eu- learning strategy to identify taxonomically informative or- karyotic viruses in comparison with phages (Figure 2). thologous groups. We denominated the set of orthologous groups as ViPhOGs (Virus and Phages Orthologous Groups). Results The viral diversity represented in public (NCBI) databases. We searched for all viral genomes stored in ei- ther RefSeq or Genbank databases (April 2015) and ob- tained 50,728 entries by using the selected set of queries (see methods). Depuration of the search results led to the ex- clusion of 6,617 entries, as some of the keywords in their descriptions indicated that they did not correspond to com- plete genomes. Entries kept following the depuration step were clustered based on sequence identity in order to col- lapse near identical viral sequences, which resulted in an overall 57% reduction that reflects a very high redundancy in Fig. 1. Genome length distribution. Violin plots show the genome length dis- the