<<

Phylogenetic analysis of Chordopoxvirinae with alignment-free approach

Tatiana S. Nepomnyashchikh, D.V. Antonets, T.V. Tregubchak, A.N. Shvalov, E.V. Gavrilova, S.N. Shchelkunov, State Research Center of and Biotechnology “Vector”, Novosibirsk, Koltsovo, Russia, [email protected]

Rinat A. Maksyutov [email protected]

Poxviruses are a family of large DNA replicating in the cytoplasm. The family is subdivided into Entomopoxvirinae and Chordopoxvirinae subfamilies, infecting insects and chordates, respectively. Chordopoxvirinae subfamily includes eight genera: , Molluscipoxvirus, , , , , and . The most notorious poxvirus is Variola (VARV) the causative agent of , completely eradicated with global vaccination program [1]. Chordopoxvirus genomes contain highly conserved central region (~100 kbp) encoding viral proteins essential for RNA and DNA synthesis, protein processing, virion assembly and structural proteins, and highly variable terminal regions, bearing the genes encoding host range proteins, virulence factors and immunomodulators which are non-essential for virus growth in vitro. Despite these similarities in organization, their length varies from 139 kb in Orf virus to 289 kbp in Fowlpox virus (FPV) and GC-content varies from 25.3% in capripoxviruses to 65% in [1]. In recent years there is a growing interest towards the alignment-free approaches that use k-mers as genomic features. The primary advantage of these methods is that they enable quick and efficient genome-scale comparisons at low computational cost. Alignment-free methods can be used to compare sequences from highly divergent organisms, incomplete draft genomes, to compare genomes of different lengths etc. [2,3]. Sims et al. demonstrated that feature frequency profiles (FFPs) analysis applied to different eukaryotic and prokaryotic sequences was highly consistent with currently accepted taxonomy [4]. Alignment-free trees constructed from whole-genome sequences capture taxonomic classification even better than 16S rRNA alignment-based trees for prokaryotes [5]. Recently, k-mer based methods were successively used to build a cladogram of 3905 RefSeq viral genomes [2], to analyze phylogeny of Ebola viruses and human papillomaviruses [6], influenza virus A phylogeny using neuraminidase amino-acid sequences [7] etc. Large sizes of poxvirus genomes and profound length variation, their complex organization with repetitive regions and recombination sites, makes it difficult to produce accurate whole genome alignments and especially for distantly related poxvirus species, thus we decided to use the alignment-free approaches. At first, we concentrated on a single group of closely related viral species: 78 variola virus genomes were analyzed, including three ancient VARV strains recently extracted and sequenced by German scientists. At first VARV genomes were analyzed with common alignment-based approach. Genome sequences were aligned with either MAFFT [8] or Mauve software [9]. The obtained alignments were of comparable quality although the one produced with Mauve demonstrated lower positional Shannon entropy values and fewer variable sites. However, the neighbor-joining trees built using either alignment were found to be almost identical. It was shown that analyzed VARV strains were grouped according to geographic regions and isolation times. Two large main groups were distinguished. The largest one contained African strains, in turn, isolated in a separate cluster, strains BSH-74, BSH-75, IND-53, NEP-73. This group also included IND-64, KUW-67, AFG-70, SYR-72, IRN-72, YUSL-72, PKN-69 strains which formed a separate clade, Japanese strains JPN-46, JPN-51 and Indian IND-53 New Delhi formed a separate cluster. The second isolated group was formed by GAR-66, BRZ-66, UK-52 Butler, SL-69, GUI-69, BEN-69 and NIG-69 strains. African strains of this group also formed a separate cluster here. Strain V563 (LT706528), isolated from remains about 60 years old, was attributed to the first, the most populated group. Strain V1588 (LT706529), isolated from remains about 160 years old, turned out to be evolutionarily closer to the second group. And the most ancient isolate VD21 (BK010317) extracted from almost 400 years old mummy was found to be isolated from all “modern” strains, although it was more closely related to the second group. The trees built with alignments of only central conserved VARV genomic regions were found to be almost identical to those based on whole-genome alignments. Then we have built the neighbor-joining trees using the k-mer frequency profiles as genomic features. With k ranging from 6 to 9 the resulting trees were found to be highly concordant with those built with alignment-based approach, and even tetranucleotide composition analysis resulted in cladograms representing the same major groups observed in alignment-based trees. The preliminary analysis of 240 chordopoxvirus genomes also demonstrated a good agreement with taxonomy data, even when k was set to 6. Thus k-mer based approach can be used to study evolutionary relationships within Poxviridae family. Next, we will extend our analysis and include all poxviral genomes published to date.

1. C. Gubser (2004) Poxvirus genomes: a phylogenetic analysis, J. Gen. Virol., 85:105–117.

2. Q. Zhang et al. (2017) Viral Phylogenomics Using an Alignment-Free Method: A Three-Step Approach to Determine Optimal Length of k-mer, Sci. Rep., 7:40712.

3. G.-B. Han, D.-H. Cho (2018) Genome classification improvements based on k-mer intervals in sequences, Genomics, S0888-7543(18)30447-6.

4. G.E. Sims et al. (2009) Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions, Proc. Natl. Acad. Sci., 106:2677–2682.

5. O. Bonham-Carter et al. (2014) Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis, Brief. Bioinform., 15:890–905.

6. F. Utro et al. (2019) A Quantitative and Qualitative Characterization of k-mer Based Alignment-Free Phylogeny Construction. In: Bartoletti M. et al. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2017. Lecture Notes in Computer Science, Springer, 10834:19–31.

7. Y. Zhang et al. (2018) Phylogenetic analysis of protein sequences based on a novel k-mer natural vector method, Genomics, in press (DOI: 10.1016/j.ygeno.2018.08.010).

8. K. Katoh, D.M. Standley (2013) MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability, Mol. Biol. Evol., 30:772–780.

9. A.C.E. Darling (2004) Mauve: Multiple Alignment of Conserved Genomic Sequence With Rearrangements, Genome Res., 14:1394–1403.