Fast Whole-Genome Phylogeny by Compression: the COVID-19 Case

Fast Whole-Genome Phylogeny by Compression: the COVID-19 case Rudi L. Cilibrasi Centrum Wiskunde & Informatica Paul M.B. Vitanyi ( [email protected] ) Centrum Wiskunde & Informatica Research Article Keywords: Compression, Phylogeny, Covid-19 virus Posted Date: August 30th, 2021 DOI: https://doi.org/10.21203/rs.3.rs-851224/v1 License: This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License 1 Fast Whole-Genome Phylogeny by Compression: the COVID-19 case Rudi L. Cilibrasi and Paul M.B. Vitanyi´ Abstract—We analyze the whole-genome phylogeny and taxonomy of shortly the compression method since it computes the NCD’s between the SARS-CoV-2 virus, causing the COVID-19 disease, using compres- pairs of genomes. sion in the form of the alignment-free NCD (Normalized Compression We computed the NCD’s between 6,751 viruses and a selected Distance) method to assess similarity. We compare the SARS-CoV-2 virus with a database of over 6,500 viruses. The results comprise that the SARS- SARS-CoV-2 virus but we can only visualize part of them. With this CoV-2 virus is closest in that database to the RaTG13 virus and rather method we quantify the proximity relations between pairs of viruses close to the bat SARS-like corona viruses bat-SL-CoVZXC21 and bat-SL- by their NCD’s and compare them to similar relations between the CoVZC45. Over 6,500 viruses are identified (given by their registration mtDNA’s of familiar mammal species in order to gain an intuition as code) with larger NCD’s. The NCD’s are compared with the NCD’s between the mtDNA’s of familiar species. We treat the question whether to what they mean. Pangolins are involved in the SARS-CoV-2 virus. The NCD method or Phylogeny analysis is widely used but there are challenges of, shortly the compression method is simpler and possibly faster than any amongst others, accuracy and speed. The NCD as a measure of sim- other whole-genome method, which makes it the ideal tool to explore ilarity seems a good idea to solve these challenges. The compression phylogeny. Here we use it for the complex case of determining this method determines the similarity of two sequences using a single similarity between the COVID-19 virus SARS-CoV-2 and many other viruses. The resulting phylogeny and taxonomy closely matches earlier formula once, and thus does not use many examples. We apply it efforts by alignment-based methods and a machine-learning method, here to the complex problem of SARS-CoV-2 phylogeny and verify providing the most compelling evidence to date for the compression whether the results using this method are comparable with those from method showing that one can achieve equivalent results both simply and the other used methods. The conclusion is that the NCD method fast. Index terms—Compression, Phylogeny, Covid-19 virus. is an alignment-free method that gives accurate phylogeny results, while being possibly faster than both alignment-based methods and previously used alignment-free methods. I. INTRODUCTION Earlier studies pointed to the origin of the SARS-CoV-2 virus as In the 2019, 2020, and 2021 pandemic of the COVID-19 disease being from Bats. It is thought to belong to lineage B (Sarbecovirus) caused by the SARS-CoV-2 virus, studies of the phylogeny and of Betacoronavirus. From phylogenetic analysis and genome organi- taxonomy of this virus use essentially two methods: alignment-based zation it was identified as a SARS-like coronavirus, and to have the analysis, for example [21], and alignment-free machine-learning [25]. highest similarity to the SARS bat coronavirus RaTG13 and being In alignment-based methods one customarily takes two or more similar to the two bat SARS-like corona viruses bat-SL-CoVZXC21 sequences over a bounded alphabet of letters and makes them iden- and bat-SL-CoVZC45, see for example [21], [25]. tical by changing letters, introducing gaps and so on, where each of We computed the NCD’s of 6,751+15,409=22,160 virus pairs plus these operations carries a cost. The total cost is computed by adding the NCD’s used for Figures 1 and 2. Altogether, this took about these sub costs. The alignment with the least total cost is preferable. 5–10 hours in a combined run on a home desktop computer (a mini- Problems are amongst others that similar regions on sequences being computer called Meerkat from a Linux computer company called compared may be far apart causing massive alignment costs and System76). It uses less than 2 seconds per pair which comes to that the computation cost of an alignment may be too high or more than 2000 pairs per hour. The same program can be re-used forbidding. Alignment methods have many more drawbacks which for different phylogenies and questions. This is possibly the easiest make alignment-free methods more attractive [32], [30], [34]. and fastest method to establish whole-genome phylogeny. Machine learning algorithms learn a concept using an input con- Figures 1 and 2 were done with the PHYLIP (PHYLogeny Infer- sisting of many examples of that concept. It is alignment-free but uses ence Package) [33]. The programming to obtain the NCD matrices complicated algorithms with many parameters to set. This was used underlying the phylogenetic trees in the figures is easy; essentially in the phylogeny analysis of the COVID-19 virus in [25] together compute the NCD’s for all pairs of viruses to be compared. The with decision trees and other devices. comparisons between the NCD’s of viruses and the NCD’s of the The purpose of this study is to identify the closest viruses to mtDNA’s of species are made on the basis of the Table in Figure 9 the SARS-CoV-2 virus isolate and to determine their structural of [5]. relations (taxonomy) in phylogeny trees 1 and 2 and the Normalized Compression Distances (NCD’s) in Table I using compression. We Author Affiliations compare this method with the other methods in terms of results and RC is a research associate with the Centre for Mathematics and speed. The method is alignment-free, based on lossless compression Computer Science (CWI), Science Park 123, 1098 XG Amsterdam, of the whole-genome sequences of base pairs of the involved viruses, The Netherlands. PV is with CWI and the University of Amsterdam, and is called the normalized compression distance (NCD) method or Fac. FNWI, Science Park 904, 1098 XH Amsterdam, The Nether- lands. RC is a research associate with the Centre for Mathematics and Computer Science (CWI). Address: CWI, Science Park 123, 1098XG Amsterdam, The Netherlands. Email: [email protected] II. METHOD PV is with the CWI and the University of Amsterdam, Fac. FNWI, Science Park 904, 1098 XH Amsterdam, The Netherlands. He is the corresponding A string is a sequence of finite length over a finite alphabet, author. Address: CWI, Science Park 123, 1098XG Amsterdam, The Nether- here A,C,G,T. To assess the similarity of two strings we use a lands. Email: [email protected] simple formula using the compressed versions of both strings and of 2 the concatenation of them to compute the Normalized Compression data cleaning to a set of unique 15,578 viral sequences with the Distance (NCD) between them. This distance is a quantity in between known nucleotides A,C,G, and T. This was further reduced to 15,409 0 (identical) and 1 (totally different) expressing similarity. The sequences. Each viral sequence is an RNA sequence of around 30,000 compression is by a lossless compressor (‘lossless’ means that the base pairs. The total size of all sequence data together is in the original can be reconstructed from the compressed version, that order of two gigabyte. The 6,751 unique sequences obtained from is, nothing of the compressed string gets lost). The distance was the authors of [25] were over the letters A,C,G, and T already. proposed in [16], [5] and especially the last reference where the name NCD was coined. SELECTION OF A SARS-COV-2 VIRUS ISOLATE AS STANDARD In [3], [11], [4] the compression method is validated and shown FOR THE NCD’S to work better for better compressors. We used zpaq as compression As a representative of the SARS-CoV-2 viruses we selected algorithm since [13] “bench marked 430 settings of 48 representative the most common one in the data set as the basis for the compressors (including 29 specialized sequence compressors and 19 NCD against the 6,751 non-SARS-CoV-2 viruses. It occurred in general-purpose compressors) on 27 representative FASTA-formatted the sorted virus list 105 times. In Table I it appears at the test data sets including individual genomes of various sizes, DNA top with an NCD of 0.00362117 and has the official name and RNA data sets, and standard protein data sets.” Figure 1.A of gisaid hcov-19 2020 07 17 22.fasta¡hCoV-19/USA/WI-WSLH- of the cited paper compares the compressors with respect to their 200082/2020—EPI ISL 471246—2020-04-08. The NCD of the se- compression ratio: zpaq is the best of the considered general-purpose lected SARS-CoV-2 virus against itself should have been 0 but compressors and competitive with the best considered specialized because of the unavoidable compression/computation inaccuracy it sequence compressors. is slightly more. The worst-case NCD across all the remaining sequences of the HOW WE USED THE METHOD SARS-CoV-2 virus to the selected SARS-CoV-2 virus is 0.044986 Using the compression method to compute the phylogeny of the and the average NCD is 0.009879.

Load more