Fast Whole-Genome Phylogeny by Compression: the COVID-19 Case

Total Page:16

File Type:pdf, Size:1020Kb

Fast Whole-Genome Phylogeny by Compression: the COVID-19 Case Fast Whole-Genome Phylogeny by Compression: the COVID-19 case Rudi L. Cilibrasi Centrum Wiskunde & Informatica Paul M.B. Vitanyi ( [email protected] ) Centrum Wiskunde & Informatica Research Article Keywords: Compression, Phylogeny, Covid-19 virus Posted Date: August 30th, 2021 DOI: https://doi.org/10.21203/rs.3.rs-851224/v1 License: This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License 1 Fast Whole-Genome Phylogeny by Compression: the COVID-19 case Rudi L. Cilibrasi and Paul M.B. Vitanyi´ Abstract—We analyze the whole-genome phylogeny and taxonomy of shortly the compression method since it computes the NCD’s between the SARS-CoV-2 virus, causing the COVID-19 disease, using compres- pairs of genomes. sion in the form of the alignment-free NCD (Normalized Compression We computed the NCD’s between 6,751 viruses and a selected Distance) method to assess similarity. We compare the SARS-CoV-2 virus with a database of over 6,500 viruses. The results comprise that the SARS- SARS-CoV-2 virus but we can only visualize part of them. With this CoV-2 virus is closest in that database to the RaTG13 virus and rather method we quantify the proximity relations between pairs of viruses close to the bat SARS-like corona viruses bat-SL-CoVZXC21 and bat-SL- by their NCD’s and compare them to similar relations between the CoVZC45. Over 6,500 viruses are identified (given by their registration mtDNA’s of familiar mammal species in order to gain an intuition as code) with larger NCD’s. The NCD’s are compared with the NCD’s between the mtDNA’s of familiar species. We treat the question whether to what they mean. Pangolins are involved in the SARS-CoV-2 virus. The NCD method or Phylogeny analysis is widely used but there are challenges of, shortly the compression method is simpler and possibly faster than any amongst others, accuracy and speed. The NCD as a measure of sim- other whole-genome method, which makes it the ideal tool to explore ilarity seems a good idea to solve these challenges. The compression phylogeny. Here we use it for the complex case of determining this method determines the similarity of two sequences using a single similarity between the COVID-19 virus SARS-CoV-2 and many other viruses. The resulting phylogeny and taxonomy closely matches earlier formula once, and thus does not use many examples. We apply it efforts by alignment-based methods and a machine-learning method, here to the complex problem of SARS-CoV-2 phylogeny and verify providing the most compelling evidence to date for the compression whether the results using this method are comparable with those from method showing that one can achieve equivalent results both simply and the other used methods. The conclusion is that the NCD method fast. Index terms—Compression, Phylogeny, Covid-19 virus. is an alignment-free method that gives accurate phylogeny results, while being possibly faster than both alignment-based methods and previously used alignment-free methods. I. INTRODUCTION Earlier studies pointed to the origin of the SARS-CoV-2 virus as In the 2019, 2020, and 2021 pandemic of the COVID-19 disease being from Bats. It is thought to belong to lineage B (Sarbecovirus) caused by the SARS-CoV-2 virus, studies of the phylogeny and of Betacoronavirus. From phylogenetic analysis and genome organi- taxonomy of this virus use essentially two methods: alignment-based zation it was identified as a SARS-like coronavirus, and to have the analysis, for example [21], and alignment-free machine-learning [25]. highest similarity to the SARS bat coronavirus RaTG13 and being In alignment-based methods one customarily takes two or more similar to the two bat SARS-like corona viruses bat-SL-CoVZXC21 sequences over a bounded alphabet of letters and makes them iden- and bat-SL-CoVZC45, see for example [21], [25]. tical by changing letters, introducing gaps and so on, where each of We computed the NCD’s of 6,751+15,409=22,160 virus pairs plus these operations carries a cost. The total cost is computed by adding the NCD’s used for Figures 1 and 2. Altogether, this took about these sub costs. The alignment with the least total cost is preferable. 5–10 hours in a combined run on a home desktop computer (a mini- Problems are amongst others that similar regions on sequences being computer called Meerkat from a Linux computer company called compared may be far apart causing massive alignment costs and System76). It uses less than 2 seconds per pair which comes to that the computation cost of an alignment may be too high or more than 2000 pairs per hour. The same program can be re-used forbidding. Alignment methods have many more drawbacks which for different phylogenies and questions. This is possibly the easiest make alignment-free methods more attractive [32], [30], [34]. and fastest method to establish whole-genome phylogeny. Machine learning algorithms learn a concept using an input con- Figures 1 and 2 were done with the PHYLIP (PHYLogeny Infer- sisting of many examples of that concept. It is alignment-free but uses ence Package) [33]. The programming to obtain the NCD matrices complicated algorithms with many parameters to set. This was used underlying the phylogenetic trees in the figures is easy; essentially in the phylogeny analysis of the COVID-19 virus in [25] together compute the NCD’s for all pairs of viruses to be compared. The with decision trees and other devices. comparisons between the NCD’s of viruses and the NCD’s of the The purpose of this study is to identify the closest viruses to mtDNA’s of species are made on the basis of the Table in Figure 9 the SARS-CoV-2 virus isolate and to determine their structural of [5]. relations (taxonomy) in phylogeny trees 1 and 2 and the Normalized Compression Distances (NCD’s) in Table I using compression. We Author Affiliations compare this method with the other methods in terms of results and RC is a research associate with the Centre for Mathematics and speed. The method is alignment-free, based on lossless compression Computer Science (CWI), Science Park 123, 1098 XG Amsterdam, of the whole-genome sequences of base pairs of the involved viruses, The Netherlands. PV is with CWI and the University of Amsterdam, and is called the normalized compression distance (NCD) method or Fac. FNWI, Science Park 904, 1098 XH Amsterdam, The Nether- lands. RC is a research associate with the Centre for Mathematics and Computer Science (CWI). Address: CWI, Science Park 123, 1098XG Amsterdam, The Netherlands. Email: [email protected] II. METHOD PV is with the CWI and the University of Amsterdam, Fac. FNWI, Science Park 904, 1098 XH Amsterdam, The Netherlands. He is the corresponding A string is a sequence of finite length over a finite alphabet, author. Address: CWI, Science Park 123, 1098XG Amsterdam, The Nether- here A,C,G,T. To assess the similarity of two strings we use a lands. Email: [email protected] simple formula using the compressed versions of both strings and of 2 the concatenation of them to compute the Normalized Compression data cleaning to a set of unique 15,578 viral sequences with the Distance (NCD) between them. This distance is a quantity in between known nucleotides A,C,G, and T. This was further reduced to 15,409 0 (identical) and 1 (totally different) expressing similarity. The sequences. Each viral sequence is an RNA sequence of around 30,000 compression is by a lossless compressor (‘lossless’ means that the base pairs. The total size of all sequence data together is in the original can be reconstructed from the compressed version, that order of two gigabyte. The 6,751 unique sequences obtained from is, nothing of the compressed string gets lost). The distance was the authors of [25] were over the letters A,C,G, and T already. proposed in [16], [5] and especially the last reference where the name NCD was coined. SELECTION OF A SARS-COV-2 VIRUS ISOLATE AS STANDARD In [3], [11], [4] the compression method is validated and shown FOR THE NCD’S to work better for better compressors. We used zpaq as compression As a representative of the SARS-CoV-2 viruses we selected algorithm since [13] “bench marked 430 settings of 48 representative the most common one in the data set as the basis for the compressors (including 29 specialized sequence compressors and 19 NCD against the 6,751 non-SARS-CoV-2 viruses. It occurred in general-purpose compressors) on 27 representative FASTA-formatted the sorted virus list 105 times. In Table I it appears at the test data sets including individual genomes of various sizes, DNA top with an NCD of 0.00362117 and has the official name and RNA data sets, and standard protein data sets.” Figure 1.A of gisaid hcov-19 2020 07 17 22.fasta¡hCoV-19/USA/WI-WSLH- of the cited paper compares the compressors with respect to their 200082/2020—EPI ISL 471246—2020-04-08. The NCD of the se- compression ratio: zpaq is the best of the considered general-purpose lected SARS-CoV-2 virus against itself should have been 0 but compressors and competitive with the best considered specialized because of the unavoidable compression/computation inaccuracy it sequence compressors. is slightly more. The worst-case NCD across all the remaining sequences of the HOW WE USED THE METHOD SARS-CoV-2 virus to the selected SARS-CoV-2 virus is 0.044986 Using the compression method to compute the phylogeny of the and the average NCD is 0.009879.
Recommended publications
  • 2020 Taxonomic Update for Phylum Negarnaviricota (Riboviria: Orthornavirae), Including the Large Orders Bunyavirales and Mononegavirales
    Archives of Virology https://doi.org/10.1007/s00705-020-04731-2 VIROLOGY DIVISION NEWS 2020 taxonomic update for phylum Negarnaviricota (Riboviria: Orthornavirae), including the large orders Bunyavirales and Mononegavirales Jens H. Kuhn1 · Scott Adkins2 · Daniela Alioto3 · Sergey V. Alkhovsky4 · Gaya K. Amarasinghe5 · Simon J. Anthony6,7 · Tatjana Avšič‑Županc8 · María A. Ayllón9,10 · Justin Bahl11 · Anne Balkema‑Buschmann12 · Matthew J. Ballinger13 · Tomáš Bartonička14 · Christopher Basler15 · Sina Bavari16 · Martin Beer17 · Dennis A. Bente18 · Éric Bergeron19 · Brian H. Bird20 · Carol Blair21 · Kim R. Blasdell22 · Steven B. Bradfute23 · Rachel Breyta24 · Thomas Briese25 · Paul A. Brown26 · Ursula J. Buchholz27 · Michael J. Buchmeier28 · Alexander Bukreyev18,29 · Felicity Burt30 · Nihal Buzkan31 · Charles H. Calisher32 · Mengji Cao33,34 · Inmaculada Casas35 · John Chamberlain36 · Kartik Chandran37 · Rémi N. Charrel38 · Biao Chen39 · Michela Chiumenti40 · Il‑Ryong Choi41 · J. Christopher S. Clegg42 · Ian Crozier43 · John V. da Graça44 · Elena Dal Bó45 · Alberto M. R. Dávila46 · Juan Carlos de la Torre47 · Xavier de Lamballerie38 · Rik L. de Swart48 · Patrick L. Di Bello49 · Nicholas Di Paola50 · Francesco Di Serio40 · Ralf G. Dietzgen51 · Michele Digiaro52 · Valerian V. Dolja53 · Olga Dolnik54 · Michael A. Drebot55 · Jan Felix Drexler56 · Ralf Dürrwald57 · Lucie Dufkova58 · William G. Dundon59 · W. Paul Duprex60 · John M. Dye50 · Andrew J. Easton61 · Hideki Ebihara62 · Toufc Elbeaino63 · Koray Ergünay64 · Jorlan Fernandes195 · Anthony R. Fooks65 · Pierre B. H. Formenty66 · Leonie F. Forth17 · Ron A. M. Fouchier48 · Juliana Freitas‑Astúa67 · Selma Gago‑Zachert68,69 · George Fú Gāo70 · María Laura García71 · Adolfo García‑Sastre72 · Aura R. Garrison50 · Aiah Gbakima73 · Tracey Goldstein74 · Jean‑Paul J. Gonzalez75,76 · Anthony Grifths77 · Martin H. Groschup12 · Stephan Günther78 · Alexandro Guterres195 · Roy A.
    [Show full text]
  • The LUCA and Its Complex Virome in Another Recent Synthesis, We Examined the Origins of the Replication and Structural Mart Krupovic , Valerian V
    PERSPECTIVES archaea that form several distinct, seemingly unrelated groups16–18. The LUCA and its complex virome In another recent synthesis, we examined the origins of the replication and structural Mart Krupovic , Valerian V. Dolja and Eugene V. Koonin modules of viruses and posited a ‘chimeric’ scenario of virus evolution19. Under this Abstract | The last universal cellular ancestor (LUCA) is the most recent population model, the replication machineries of each of of organisms from which all cellular life on Earth descends. The reconstruction of the four realms derive from the primordial the genome and phenotype of the LUCA is a major challenge in evolutionary pool of genetic elements, whereas the major biology. Given that all life forms are associated with viruses and/or other mobile virion structural proteins were acquired genetic elements, there is no doubt that the LUCA was a host to viruses. Here, by from cellular hosts at different stages of evolution giving rise to bona fide viruses. projecting back in time using the extant distribution of viruses across the two In this Perspective article, we combine primary domains of life, bacteria and archaea, and tracing the evolutionary this recent work with observations on the histories of some key virus genes, we attempt a reconstruction of the LUCA virome. host ranges of viruses in each of the four Even a conservative version of this reconstruction suggests a remarkably complex realms, along with deeper reconstructions virome that already included the main groups of extant viruses of bacteria and of virus evolution, to tentatively infer archaea. We further present evidence of extensive virus evolution antedating the the composition of the virome of the last universal cellular ancestor (LUCA; also LUCA.
    [Show full text]
  • Article Download (79)
    wjpls, 2020, Vol. 6, Issue 6, 152-161 Review Article ISSN 2454-2229 Pratik et al. World Journal of Pharmaceutical World Journaland Life of Pharmaceutica Sciencesl and Life Science WJPLS www.wjpls.org SJIF Impact Factor: 6.129 THE NOVEL CORONAVIRUS (COVID-19) CAUSATIVE AGENT FOR HUMAN RESPIRATORY DISEASES Pratik V. Malvade1*, Rutik V. Malvade2, Shubham P. Varpe3 and Prathamesh B. Kadu3 1Pravara Rural College of Pharmacy, Pravaranagar, 413736, Dist. - Ahmednagar (M.S.) India. 2Pravara Rural Engineering College, Loni, 413736, Dist. - Ahmednagar (M.S.) India. 3Ashvin College Of Pharmacy, Manchi Hill, Ashvi Bk., 413714, Dist.- Ahmednagar (M.S.) India. *Corresponding Author: Pratik V. Malvade Pravara Rural College of Pharmacy, Pravaranagar, 413736, Dist. - Ahmednagar (M.S.) India. Article Received on 08/04/2020 Article Revised on 29/04/2020 Article Accepted on 19/05/2020 ABSTRACT The newly founded human coronavirus has named as Covid-19. The full form of Covid-19 is “Co-Corona, vi- virus and d- disease”. The Covid-19 is also named as 2019-nCoV because of it was firstly identified at the end of 2019. The coronavirus are the group of various types of viruses i.e. some have positive-sense, single stranded RNA and they are covered within the envelope made up of protein. Still now days seven human coronaviruses are identified are Nl 63, 229E, OC43, HKU1, SARS-CoV, MERS-CoV and latest Covid-19 also known as SARS-CoV-2. From above all, the SARS-CoV and MERS-CoV causes the highest outbreak but the outbreak of Covid-19 is much more than the other any virus.
    [Show full text]
  • Latest Ncbi-Taxonomist Docker Image Can Be Pulled from Registry.Gitlab.Com/Janpb/ Ncbi-Taxonomist:Latest
    ncbi-taxonomist Documentation Release 1.2.1+8580b9b Jan P Buchmann 2020-11-15 Contents: 1 Installation 3 2 Basic functions 5 3 Cookbook 35 4 Container 39 5 Frequently Asked Questions 49 6 Module references 51 7 Synopsis 63 8 Requirements and Dependencies 65 9 Contact 67 10 Indices and tables 69 Python Module Index 71 Index 73 i ii ncbi-taxonomist Documentation, Release 1.2.1+8580b9b 1.2.1+8580b9b :: 2020-11-15 Contents: 1 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b 2 Contents: CHAPTER 1 Installation Content • Local pip install (no root required) • Global pip install (root required) ncbi-taxonomist is available on PyPi via pip. If you use another Python package manager than pip, please consult its documentation. If you are installing ncbi-taxonomist on a non-Linux system, consider the propsed methods as guidelines and adjust as required. Important: Please note If some of the proposed commands are unfamiliar to you, don’t just invoke them but look them up, e.g. in man pages or search online. Should you be unfamiliar with pip, check pip -h Note: Python 3 vs. Python 2 Due to co-existing Python 2 and Python 3, some installation commands may be invoked slighty different. In addition, development and support for Python 2 did stop January 2020 and should not be used anymore. ncbi-taxonomist requires Python >= 3.8. Depending on your OS and/or distribution, the default pip command can install either Python 2 or Python 3 packages. Make sure you use pip for Python 3, e.g.
    [Show full text]
  • Full Text Article
    SJIF Impact Factor: 5.464 WORLD JOURNAL OF ADVANCE ISSN: 2457-0400 Rajeev et al. Volume:Page 1 of4. 51 HEALTHCARE RESEARCH Issue: 4. Page N. 47-51 Year: 2020 Review Article www.wjahr.com STRUCTURE & EVOLUTION OF COVID 19 (SARS-COV-2) Dr. Rajeev Shah*1, Reena Mehta2, Ashvika Mistry3, Manali Patel4, Rajendra Choudhary5 and Vivek Trivedi6 1Head & Professor, Microbiology Department, Pacific Medical College & Hospital, Udaipur, Rajesthan, India. 2Expert in Genetics & Cancer/Expert in DNA Technology, University of New South Wales, Australia. 3Turor, NaMo Medical College, Microbiology Department, Silvassa, UT, India. 4Ex Intern GMERS, Sola, Ahmedabad/Cardiovascular consultants of Northwest Indiana, Munster, Indiana, USA. 5Assistant Professor, Ophthalmology Department, AIIMS, Udaipur. 6Tutor, Microbiology Department, Pacific Medical College & Hospital, Udaipur, Rajesthan, India. Received date: 28 April 2020 Revised date: 18 May 2020 Accepted date: 08 June 2020 *Corresponding author: Dr. Rajeev Shah Head & Professor, Microbiology Department, Pacific Medical College & Hospital, Udaipur, Rajesthan, India. ABSTRACT Coronaviridae is a family of enveloped, positive-sense, single-stranded RNA viruses. The viral genome is 26–32 kilobases in length. The particles are typically decorated with large (~20 nm), club- or petal-shaped surface projections (the "peplomers" or "spikes"), which in electron micrographs of spherical particles create an image reminiscent of the solar corona. Coronaviruses are a group of related RNA viruses that cause diseases in mammals and birds. In humans, these viruses cause respiratory tract infections that can range from mild to lethal. Mild illnesses include some cases of the common cold (which is also caused by other viruses, predominantly rhinoviruses), while more lethal varieties can cause SARS, MERS, and COVID-19.
    [Show full text]
  • Downloaded on 27Th
    bioRxiv preprint doi: https://doi.org/10.1101/2021.06.28.448043; this version posted June 30, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Diverse soil RNA viral communities have the potential to influence grassland ecosystems across multiple trophic levels Luke S. Hillary1,*, Evelien M. Adriaenssens2, David L. Jones1,3 and James E. McDonald1 1 – School of Natural Sciences, Bangor University, Bangor, Gwynedd, LL57 2UW, UK 2 – Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK 3 – SoilsWest, Centre for Sustainable Farming Systems, Food Futures Institute, Murdoch University, Murdoch, WA 6105, Australia * - Corresponding author Abstract Grassland ecosystems form 30-40%1 of total land cover and provide essential ecosystem services, including food production, flood mitigation and carbon storage2. Their productivity is closely related to soil microbial communities3, yet the role of viruses within these critical ecosystems is currently undercharacterised4 and in particular, our knowledge of soil RNA viruses is significantly limited5. Here, we applied viromics6 to characterise soil RNA viral communities along an altitudinal productivity gradient of peat, managed grassland and coastal soils. We identified 3,462 viral operational taxonomic units (vOTUs) and assessed their spatial distribution, phylogenetic diversity and potential host ranges. Soil types exhibited showed minimal similarity in viral community composition, but with >10-fold more vOTUs shared between managed grassland soils when compared with peat or coastal soils. Phylogenetic analyses of viral sequences predicted broad host ranges including bacteria, plants, fungi, vertebrates and invertebrates, contrasting with soil DNA viromes which are typically dominated by bacteriophages7.
    [Show full text]
  • Virology Is That the Study of Viruses ? Submicroscopic, Parasitic Particles
    Current research in Virology & Retrovirology 2021, Vol.4, Issue 3 Editorial Bahman Khalilidehkordi Shahrekord University of Medical Sciences, Iran mobile genetic elements of cells (such as transposons, Editorial retrotransposons or plasmids) that became encapsulated in protein capsids, acquired the power to “break free” from Virology is that the study of viruses – submicroscopic, the host cell and infect other cells. Of particular interest parasitic particles of genetic material contained during a here is mimivirus, a huge virus that infects amoebae and protein coat – and virus-like agents. It focuses on the sub- encodes much of the molecular machinery traditionally sequent aspects of viruses: their structure, classification associated with bacteria. Two possibilities are that it’s a and evolution, their ways to infect and exploit host cells for simplified version of a parasitic prokaryote or it originated copy , their interaction with host organism physiology and as an easier virus that acquired genes from its host. The immunity, the diseases they cause, the techniques to iso- evolution of viruses, which frequently occurs together with late and culture them, and their use in research and ther- the evolution of their hosts, is studied within the field of apy. Virology is a subfield of microbiology.Structure and viral evolution. While viruses reproduce and evolve, they’re classification of Virus: A major branch of virology is virus doing not engage in metabolism, don’t move, and depend classification. Viruses are often classified consistent with on variety cell for copy . The often-debated question of the host cell they infect: animal viruses, plant viruses, fun- whether or not they’re alive or not could also be a matter gal viruses, and bacteriophages (viruses infecting bacte- of definition that does not affect the biological reality of vi- ria, which include the foremost complex viruses).
    [Show full text]
  • Coronavirus: Detailed Taxonomy
    Coronavirus: Detailed taxonomy Coronaviruses are in the realm: Riboviria; phylum: Incertae sedis; and order: Nidovirales. The Coronaviridae family gets its name, in part, because the virus surface is surrounded by a ring of projections that appear like a solar corona when viewed through an electron microscope. Taxonomically, the main Coronaviridae subfamily – Orthocoronavirinae – is subdivided into alpha (formerly referred to as type 1 or phylogroup 1), beta (formerly referred to as type 2 or phylogroup 2), delta, and gamma coronavirus genera. Using molecular clock analysis, investigators have estimated the most common ancestor of all coronaviruses appeared in about 8,100 BC, and those of alphacoronavirus, betacoronavirus, gammacoronavirus, and deltacoronavirus appeared in approximately 2,400 BC, 3,300 BC, 2,800 BC, and 3,000 BC, respectively. These investigators posit that bats and birds are ideal hosts for the coronavirus gene source, bats for alphacoronavirus and betacoronavirus, and birds for gammacoronavirus and deltacoronavirus. Coronaviruses are usually associated with enteric or respiratory diseases in their hosts, although hepatic, neurologic, and other organ systems may be affected with certain coronaviruses. Genomic and amino acid sequence phylogenetic trees do not offer clear lines of demarcation among corona virus genus, lineage (subgroup), host, and organ system affected by disease, so information is provided below in rough descending order of the phylogenetic length of the reported genome. Subgroup/ Genus Lineage Abbreviation
    [Show full text]
  • A Tool for Hierarchical Clustering, Core Gene Detection and Annotation of (Prokaryotic) Viruses Cristina Moraru
    bioRxiv preprint doi: https://doi.org/10.1101/2021.06.14.448304; this version posted June 14, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. VirClust – a tool for hierarchical clustering, core gene detection and annotation of (prokaryotic) viruses Cristina Moraru Institute for Chemistry and Biology of the Marine Environment, Carl-von-Ossietzky –Str. 9 -11, D-26111 Oldenburg, Germany; [email protected] Abstract Recent years have seen major changes in the classification criteria and taxonomy of viruses. The current classification scheme, also called “megataxonomy of viruses”, recognizes five different viral realms, defined based on the presence of viral hallmark genes. Within the realms, viruses are classified into hierarchical taxons, ideally defined by their shared genes. Therefore, there is currently a need for virus classification tools based on such shared genes / proteins. Here, VirClust is presented – a novel tool capable of performing i) hierarchical clustering of viruses based on intergenomic distances calculated from their protein cluster content, ii) identification of core proteins and iii) annotation of viral proteins. VirClust groups proteins into clusters both based on BLASTP sequence similarity, which identifies more related proteins, and also based on hidden markow models (HMM), which identifies more distantly related proteins. Furthermore, VirClust provides an integrated visualization of the hierarchical clustering tree and of the distribution of the protein content, which allows the identification of the genomic features responsible for the respective clustering. By using different intergenomic distances, the hierarchical trees produced by VirClust can be split into viral genome clusters of different taxonomic ranks.
    [Show full text]
  • Phylogeny of the COVID-19 Virus SARS-Cov-2 by Compression
    bioRxiv preprint doi: https://doi.org/10.1101/2020.07.22.216242; this version posted July 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 1 Phylogeny of the COVID-19 Virus SARS-CoV-2 by Compression Rudi L. Cilibrasi Paul M.B. Vitanyi´ Abstract We analyze the phylogeny and taxonomy of the SARS-CoV-2 virus using compression. This is a new alignment-free method called the “normalized compression distance” (NCD) method. It discovers all effective similarities based on Kolmogorov complexity. The latter being incomputable we approximate it by a good compressor such as the modern zpaq. The results comprise that the SARS-CoV-2 virus is closest to the RaTG13 virus and similar to two bat SARS-like coronaviruses bat-SL-CoVZXC21 and bat-SL-CoVZC4. The similarity is quantified and compared with the same quantified similarities among the mtDNA of certain species. We treat the question whether Pangolins are involved in the SARS-CoV-2 virus. I. INTRODUCTION In the 2019 and 2020 pandemic of the COVID-19 illness many studies use essentially two methods, alignment-based phylogenetic analyses e.g. [19], and an alignment-free machine learning approach [23]. These pointed to the origin of the SARS-CoV-2 virus which causes the COVID-19 pandemic as being from bats. It is thought to belong to lineage B (Sarbecovirus) of Betacoronavirus. From phylogenetic analysis and genome organization it was identified as a SARS-like coronavirus, and to have the highest similarity to the SARS bat coronavirus RaTG13 [19] and similar to two bat SARS-like coronaviruses bat-SL-CoVZXC21 and bat- SL-CoVZC45.
    [Show full text]
  • Structure Unveils Relationships Between RNA Virus Polymerases
    viruses Article Structure Unveils Relationships between RNA Virus Polymerases Heli A. M. Mönttinen † , Janne J. Ravantti * and Minna M. Poranen * Molecular and Integrative Biosciences Research Programme, Faculty of Biological and Environmental Sciences, University of Helsinki, Viikki Biocenter 1, P.O. Box 56 (Viikinkaari 9), 00014 Helsinki, Finland; heli.monttinen@helsinki.fi * Correspondence: janne.ravantti@helsinki.fi (J.J.R.); minna.poranen@helsinki.fi (M.M.P.); Tel.: +358-2941-59110 (M.M.P.) † Present address: Institute of Biotechnology, Helsinki Institute of Life Sciences (HiLIFE), University of Helsinki, Viikki Biocenter 2, P.O. Box 56 (Viikinkaari 5), 00014 Helsinki, Finland. Abstract: RNA viruses are the fastest evolving known biological entities. Consequently, the sequence similarity between homologous viral proteins disappears quickly, limiting the usability of traditional sequence-based phylogenetic methods in the reconstruction of relationships and evolutionary history among RNA viruses. Protein structures, however, typically evolve more slowly than sequences, and structural similarity can still be evident, when no sequence similarity can be detected. Here, we used an automated structural comparison method, homologous structure finder, for comprehensive comparisons of viral RNA-dependent RNA polymerases (RdRps). We identified a common structural core of 231 residues for all the structurally characterized viral RdRps, covering segmented and non-segmented negative-sense, positive-sense, and double-stranded RNA viruses infecting both prokaryotic and eukaryotic hosts. The grouping and branching of the viral RdRps in the structure- based phylogenetic tree follow their functional differentiation. The RdRps using protein primer, RNA primer, or self-priming mechanisms have evolved independently of each other, and the RdRps cluster into two large branches based on the used transcription mechanism.
    [Show full text]
  • Transcriptome and Coronavirus: New Hope and Therapy
    Review Volume 11, Issue 2, 2021, 9541 - 9552 https://doi.org/10.33263/BRIAC112.95419552 Transcriptome and Coronavirus: New Hope and Therapy Khaled Mohamed Mohamed Koriem 1,* 1 Department of Medical Physiology, Medical Research Division, National Research Centre, 33 El-Buhouth Street, Dokki, Cairo, 12622, Egypt * Correspondence: [email protected]; Scopus Author ID 24477156100 Received: 20.08.2020; Revised: 7.09.2020; Accepted: 9.09.2020; Published: 13.09.2020 Abstract: Transcriptome refers to all RNA particles occur inside one cell or inside numerous cells in one organ. Coronaviruses are a family of correlated viruses that induce viral infection. In humans, coronaviruses induce respiratory viral infections that may be mild or dangerous. The coronavirus shape is large circular elements that have round tip outbreaks - the virus diameter particles=120 nm. The RNA viral genome occurs in coronavirus. The coronavirus genome size = 27-34 kilobases, and this size is the largest RNA genome size. The Life cycle of coronavirus includes viral entry, replication, and release. Coronavirus transmission was done through the connection of its protein with host cell receptors in a specific process. There are 4 types of coronavirus genus: (1) Alphacoronavirus, (2) Betacoronavirus, (3) Gammacoronavirus, and (4) Deltacoronavirus. Viral replication, immune evasion, and virion biogenesis correlated with host cell transformation mechanism. Viral molecular mechanism hijacks the host cell protein production mechanism. There is an important host factor (CPSF6) that connects with nuclear protein (NP1). The CPSF6 increases the nuclear production of NP1 in the same time, CPSF6 possesses an important role in the progress of capsid mRNAs inside the nucleus.
    [Show full text]