Fast Whole-Genome Phylogeny by Compression: the COVID-19 case

Rudi L. Cilibrasi Centrum Wiskunde & Informatica Paul M.B. Vitanyi (  [email protected] ) Centrum Wiskunde & Informatica

Research Article

Keywords: Compression, Phylogeny, Covid-19

Posted Date: August 30th, 2021

DOI: https://doi.org/10.21203/rs.3.rs-851224/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License 1

Fast Whole-Genome Phylogeny by Compression: the COVID-19 case Rudi L. Cilibrasi and Paul M.B. Vitanyi´

Abstract—We analyze the whole-genome phylogeny and taxonomy of shortly the compression method since it computes the NCD’s between the SARS-CoV-2 virus, causing the COVID-19 disease, using compres- pairs of genomes. sion in the form of the alignment-free NCD (Normalized Compression We computed the NCD’s between 6,751 and a selected Distance) method to assess similarity. We compare the SARS-CoV-2 virus with a database of over 6,500 viruses. The results comprise that the SARS- SARS-CoV-2 virus but we can only visualize part of them. With this CoV-2 virus is closest in that database to the RaTG13 virus and rather method we quantify the proximity relations between pairs of viruses close to the bat SARS-like corona viruses bat-SL-CoVZXC21 and bat-SL- by their NCD’s and compare them to similar relations between the CoVZC45. Over 6,500 viruses are identified (given by their registration mtDNA’s of familiar mammal species in order to gain an intuition as code) with larger NCD’s. The NCD’s are compared with the NCD’s between the mtDNA’s of familiar species. We treat the question whether to what they mean. Pangolins are involved in the SARS-CoV-2 virus. The NCD method or Phylogeny analysis is widely used but there are challenges of, shortly the compression method is simpler and possibly faster than any amongst others, accuracy and speed. The NCD as a measure of sim- other whole-genome method, which makes it the ideal tool to explore ilarity seems a good idea to solve these challenges. The compression phylogeny. Here we use it for the complex case of determining this method determines the similarity of two sequences using a single similarity between the COVID-19 virus SARS-CoV-2 and many other viruses. The resulting phylogeny and taxonomy closely matches earlier formula once, and thus does not use many examples. We apply it efforts by alignment-based methods and a machine-learning method, here to the complex problem of SARS-CoV-2 phylogeny and verify providing the most compelling evidence to date for the compression whether the results using this method are comparable with those from method showing that one can achieve equivalent results both simply and the other used methods. The conclusion is that the NCD method fast. Index terms—Compression, Phylogeny, Covid-19 virus. is an alignment-free method that gives accurate phylogeny results, while being possibly faster than both alignment-based methods and previously used alignment-free methods. I.INTRODUCTION Earlier studies pointed to the origin of the SARS-CoV-2 virus as In the 2019, 2020, and 2021 pandemic of the COVID-19 disease being from Bats. It is thought to belong to lineage B (Sarbecovirus) caused by the SARS-CoV-2 virus, studies of the phylogeny and of . From phylogenetic analysis and genome organi- taxonomy of this virus use essentially two methods: alignment-based zation it was identified as a SARS-like , and to have the analysis, for example [21], and alignment-free machine-learning [25]. highest similarity to the SARS bat coronavirus RaTG13 and being In alignment-based methods one customarily takes two or more similar to the two bat SARS-like corona viruses bat-SL-CoVZXC21 sequences over a bounded alphabet of letters and makes them iden- and bat-SL-CoVZC45, see for example [21], [25]. tical by changing letters, introducing gaps and so on, where each of We computed the NCD’s of 6,751+15,409=22,160 virus pairs plus these operations carries a cost. The total cost is computed by adding the NCD’s used for Figures 1 and 2. Altogether, this took about these sub costs. The alignment with the least total cost is preferable. 5–10 hours in a combined run on a home desktop computer (a mini- Problems are amongst others that similar regions on sequences being computer called Meerkat from a Linux computer company called compared may be far apart causing massive alignment costs and System76). It uses less than 2 seconds per pair which comes to that the computation cost of an alignment may be too high or more than 2000 pairs per hour. The same program can be re-used forbidding. Alignment methods have many more drawbacks which for different phylogenies and questions. This is possibly the easiest make alignment-free methods more attractive [32], [30], [34]. and fastest method to establish whole-genome phylogeny. Machine learning algorithms learn a concept using an input con- Figures 1 and 2 were done with the PHYLIP (PHYLogeny Infer- sisting of many examples of that concept. It is alignment-free but uses ence Package) [33]. The programming to obtain the NCD matrices complicated algorithms with many parameters to set. This was used underlying the phylogenetic trees in the figures is easy; essentially in the phylogeny analysis of the COVID-19 virus in [25] together compute the NCD’s for all pairs of viruses to be compared. The with decision trees and other devices. comparisons between the NCD’s of viruses and the NCD’s of the The purpose of this study is to identify the closest viruses to mtDNA’s of species are made on the basis of the Table in Figure 9 the SARS-CoV-2 virus isolate and to determine their structural of [5]. relations (taxonomy) in phylogeny trees 1 and 2 and the Normalized Compression Distances (NCD’s) in Table I using compression. We Author Affiliations compare this method with the other methods in terms of results and RC is a research associate with the Centre for Mathematics and speed. The method is alignment-free, based on lossless compression Computer Science (CWI), Science Park 123, 1098 XG Amsterdam, of the whole-genome sequences of base pairs of the involved viruses, The Netherlands. PV is with CWI and the University of Amsterdam, and is called the normalized compression distance (NCD) method or Fac. FNWI, Science Park 904, 1098 XH Amsterdam, The Nether- lands. RC is a research associate with the Centre for Mathematics and Computer Science (CWI). Address: CWI, Science Park 123, 1098XG Amsterdam, The Netherlands. Email: [email protected] II.METHOD PV is with the CWI and the University of Amsterdam, Fac. FNWI, Science Park 904, 1098 XH Amsterdam, The Netherlands. He is the corresponding A string is a sequence of finite length over a finite alphabet, author. Address: CWI, Science Park 123, 1098XG Amsterdam, The Nether- here A,C,G,T. To assess the similarity of two strings we use a lands. Email: [email protected] simple formula using the compressed versions of both strings and of 2 the concatenation of them to compute the Normalized Compression data cleaning to a set of unique 15,578 viral sequences with the Distance (NCD) between them. This distance is a quantity in between known nucleotides A,C,G, and T. This was further reduced to 15,409 0 (identical) and 1 (totally different) expressing similarity. The sequences. Each viral sequence is an RNA sequence of around 30,000 compression is by a lossless compressor (‘lossless’ means that the base pairs. The total size of all sequence data together is in the original can be reconstructed from the compressed version, that order of two gigabyte. The 6,751 unique sequences obtained from is, nothing of the compressed string gets lost). The distance was the authors of [25] were over the letters A,C,G, and T already. proposed in [16], [5] and especially the last reference where the name NCD was coined. SELECTIONOFA SARS-COV-2 VIRUS ISOLATE AS STANDARD In [3], [11], [4] the compression method is validated and shown FORTHE NCD’S to work better for better compressors. We used zpaq as compression As a representative of the SARS-CoV-2 viruses we selected algorithm since [13] “bench marked 430 settings of 48 representative the most common one in the data set as the basis for the compressors (including 29 specialized sequence compressors and 19 NCD against the 6,751 non-SARS-CoV-2 viruses. It occurred in general-purpose compressors) on 27 representative FASTA-formatted the sorted virus list 105 times. In Table I it appears at the test data sets including individual genomes of various sizes, DNA top with an NCD of 0.00362117 and has the official name and RNA data sets, and standard data sets.” Figure 1.A of gisaid hcov-19 2020 07 17 22.fasta¡hCoV-19/USA/WI-WSLH- of the cited paper compares the compressors with respect to their 200082/2020—EPI ISL 471246—2020-04-08. The NCD of the se- compression ratio: zpaq is the best of the considered general-purpose lected SARS-CoV-2 virus against itself should have been 0 but compressors and competitive with the best considered specialized because of the unavoidable compression/computation inaccuracy it sequence compressors. is slightly more. The worst-case NCD across all the remaining sequences of the HOWWEUSEDTHE METHOD SARS-CoV-2 virus to the selected SARS-CoV-2 virus is 0.044986 Using the compression method to compute the phylogeny of the and the average NCD is 0.009879. The worst-case sequence SARS-CoV-2 virus, we first obtained a list of over 6,500 viruses we [22] can be found at gisaid hcov-19 2020 07 17 22.fasta¡ hCoV- wanted to compare the SARS-CoV-2 virus to, without duplicates, 19/USA/OH-OSUP0019/2020—EPI ISL 427291—2020-03-31 with partially sequenced viruses, and SARS-CoV-2 viruses (except for registration code EPI ISL 427291. This shows that the 15,409 a single one it turned out). A particular SARS-CoV-2 virus isolate SARS-CoV-2 viruses retrieved from GISAID on July 17 in 2020 served as standard against which to determine the NCD’s of the other with /2020 in the name all have sequences that hardly differ from viruses. We selected this SARS-CoV-2 virus isolate from the many, one another. at least 15,500, examples available on 17 july 2020 in the GISAID The results presented in this paper are so much larger than this database. We computed for each virus in the list of over 6,500 viruses worst-case NCD that they do not change under subtracting/adding its NCD with the selected SARS-CoV-2 virus isolate. As compression the worst-case NCD to the selected SARS-CoV-2 virus amongst all program we used zpaq. Subsequently we ordered the resulting NCD’s SARS-CoV-2 viruses. Since the NCD is a metric and thus satisfies the from the smallest to the largest. The virus causing the smallest NCD triangle property [5, Theorem 6.2], the main results will not be unduly distance with the SARS-CoV-2 virus isolate was regarded as the most upset by a different choice of selected SARS-CoV-2 virus since the similar to that virus. We took the 60 viruses which had the least distances involved vary only by at most ±0.044986. Let us illustrate NCD’s with the selected SARS-CoV-2 virus isolate and computed the this. Let x, y, z be objects in a metric and let us denote the distance phylogeny of those viruses and the SARS-CoV-2 virus isolate. Next between x and y by d(x,y), the distance between x and z by d(x, z), we compared the last mentioned virus with 37 close viruses including and that between y and z by d(y, z). Then |d(x, z) − d(y, z)| ≤ Pangolin ones, to determine the relation of the SARS-CoV-2 virus d(x,y). If y is the selected SARS-CoV-2 virus, x is an arbitrary with the Pangolin viruses. SARS-CoV-2 virus, and z is one of the non-SARS-CoV-2 viruses, then |d(x,y)|≤ 0.044986. Thus, the results presented are unaffected III.MATERIALS by the choice of selected SARS-CoV-2 virus. We downloaded the data in two parts and stored them on the IV. RESULTS public repository GitHub. The two data sets were as follows. One data set was obtained from the authors of the machine-learning-approach SORTED NCD’S study [25] and consisted of the 6,751 virus sequences used in that In Table I the 60 least NCD’s are displayed of all NCD’s between paper. There was one SARS-CoV-2 virus among them. the selected SARS-CoV-2 virus and the 6,751 non-SARS-CoV-2 The second data set of 66,899 SARS-CoV-2 virus sequences viruses sorted from small to large. The selected SARS-CoV-2 virus was downloaded from the hCoV19 directory of GISAID [10] on appears on the top of the list with NCD toward itself of 0.00362117. July 17th, 2020. After data cleaning all 21 virus sequences with It should be 0 as explained before and is due to a small error margin /2017, /2018, or /2019 in the name, presumably wrongly classified, in the computation. were not included. This last set contains three Pangolin viruses The next virus is the only SARS-CoV-2 virus in the database with NCD’s against the selected SARS-CoV-2 virus of respectively of 6,751 virus sequences, but not the selected one, with 0.738175, 0874027, and 0.873367. All of the GISAID data with NCD=0.0111034. This gives confirmation that the NCD’s in the list /2020 in the directory are SARS-CoV-2 viruses. Because one of are accurately calculated since the number is so very small. The them had the registration code “hcov-19,” identical with the name code is MN908947.3. It is isolate Wuhan-Hu-1, complete genome of the directory itself, we reduced the data set by removing that GenBank: MN908947.3 of 29903 bp ss-RNA linear VRL 18-MAR- sequence: gisaid hcov-19 2020 07 17 22.fasta.fai hCoV-19 29868 2020 of the family Viruses; ; ; ; 1713295574 80 82. Details of the data cleaning are deferred to the ; ; Cornidovirineae; ; Ortho- Supplemental Material. coronavirinae; Betacoronavirus; Sarbecovirus. We retained altogether 66,897+6,751=73,648 raw sequences that The following virus has an NCD of 0.444846 with were considered in our effort. The GISAID download reduced after the selected SARS-CoV-2 virus and is the closest 3

NCD virus name 0.00362117 selected SARS CoV 2 EPI ISL 471246 KC881006” means that the registration code is KC881086. 0.0111034 MN908947.3 alt. SARS CoV 2 The first of these three viruses is the Bat SARS-like coronavirus 0.44486 BetaCoV/bat/Yunnan/RaTG13/2013—EPI ISL 402131 EPI ISL 402131 0.788416 MG772933.1 bat SL CoVZC45 isolate bat-SL-CoVZC45, complete genome at 29802 bp RNA linear 0.791082 MG772934.1 bat SL CoVZXC21 0.917493 KF569996 Coronaviridae 785 VRL 05-FEB-2020 of the previous family of viruses. Its NCD with 0.917563 KC881006 Coronaviridae 783 0.91801 KC881005 Coronaviridae 782 the selected SARS-CoV-2 virus is slightly higher than the mtDNA 0.918257 FJ882963 Coronaviridae 726 0.918381 AY278554 Riboviria 2953 NCD between the Human and the Gorilla at 0.737 and slightly lower 0.918447 EU371561 Riboviria 3205 0.918497 AY278488 Riboviria 2951 than the mtDNA NCD between the Human and the Orangutan at 0.918531 AY278741 Riboviria 2954 0.834. 0.918553 AY278491 Riboviria 2952 0.918565 NC 004718 Coronaviridae 806 The second of these three viruses is the Bat SARS-like coronavirus 0.918597 EU371563 Riboviria 3207 0.918605 FJ882935 Coronaviridae 722 isolate bat-SL-CoVZXC21, complete genome at 29732 bp RNA 0.918658 AY357075 Riboviria 2979 0.918669 EU371559 Riboviria 3203 linear VRL 05-FEB-2020 of the same family. The same comparison 0.918691 DQ640652 Riboviria 3098 0.918724 EU371562 Riboviria 3206 as before of the NCD between this virus and the selected SARS- 0.918796 AY350750 Riboviria 2977 0.918829 AY864805 Riboviria 3030 CoV-2 virus with the NCD’s of mtDNA’s between species holds also 0.918945 FJ882945 Coronaviridae 724 0.919072 EU371560 Riboviria 3204 for this second virus. 0.919117 AY864806 Riboviria 3031 0.919182 EU371564 Riboviria 3208 The third virus of the three is the Bat SARS-like coronavirus 0.919221 AY394850 Riboviria 2981 Rs3367, complete genome at 29792 bp again of the same family. 0.919244 FJ882942 Coronaviridae 723 0.919486 AY357076 Riboviria 2980 Comparing the NCD between this virus and the selected SARS-CoV- 0.91954 FJ882954 Coronaviridae 725 0.91993 KF367457 Coronaviridae 784 2 virus yields that it is slightly lower than the NCD between the 0.920486 AY515512 Riboviria 2987 0.921053 JX993988 Coronaviridae 779 Human mtDNA and the Blue Whale mtDNA at 0.920 and slightly 0.923045 GQ153543 Coronaviridae 751 0.923151 GQ153542 Coronaviridae 750 higher than the NCD between the mtDNA of the Finback Whale and 0.92541 DQ648857 Riboviria 3101 0.925706 GQ153547 Coronaviridae 755 the mtDNA of the Brown Bear at 0.915. 0.925802 JX993987 Coronaviridae 778 In the sense of the NCD therefore the SARS-CoV-2 virus is 0.925844 GQ153544 Coronaviridae 752 0.925951 GQ153548 Coronaviridae 756 likely from the family Viruses; Riboviria; Orthornavirae; Pisuviricota; 0.925982 GQ153545 Coronaviridae 753 0.926013 GQ153540 Coronaviridae 748 Pisoniviricetes; Nidovirales; Cornidovirineae; Coronaviridae; Ortho- 0.92613 GQ153546 Coronaviridae 754 0.92615 GQ153539 Coronaviridae 747 coronavirinae; Betacoronavirus; Sarbecovirus. Figure 1 shows the 0.92615 GQ153541 Coronaviridae 749 0.926681 DQ412043 Riboviria 3074 phylogeny directed binary tree built from the NCD matrix in the 0.931577 DQ412042 Riboviria 3073 0.932533 DQ648856 Riboviria 3100 Supplemental Material of the 60 virus sequences in Table I. 0.952228 NC 014470 Coronaviridae 823 0.994546 NC 025217 Coronaviridae 835 The S(T ) value in Figure 1 tells how well the tree represents 0.994897 NC 034440 Coronaviridae 847 the distance matrix concerned [5]. To clarify, let there be n objects. 0.994986 FJ938057 Coronaviridae 734 0.994986 AY646283 Riboviria 3003 They have n × n NCD’s. Hence they exist in n-dimensional space. 0.995078 EF065512 Riboviria 3126 0.995078 EF065511 Riboviria 3125 Mapping that space onto two dimensions gives distortions of the 0.995078 EF065510 Riboviria 3124 0.995086 EF065506 Riboviria 3120 distances involved. A flat map representing the earth sphere gives 0.995086 EF065507 Riboviria 3121 0.995086 EF065505 Riboviria 3119 such problems of distortion of distances. A tree can represent the n × n distances between n objects easier than a more demanding TABLE I 2-dimensional map. The S(T ) value tells how well these distances FIRST 60 ITEMSINTHESORTEDLISTOFUSED 6,751 VIRUSSEQUENCES. are preserved (0 is not at all and 1 is perfect). In Figure 1 all sequences are labeled as they occur in the data of [25] together with their registration code. The most interesting are the 11th to 15th sequences of the tree from the top of the page. The 13th apart from the above SARS-CoV-2 virus. It is Sar- is EPI ISL 402131 which is the bat/Yunnan/RaTG13/2013, that is, becovirus/EPI ISL 402131.fasta/¡BetaCoV/bat/Yunnan/ the RaTG13 Bat Coronavirus sampled in Yunnan in 2013. The 14th RaTG13/2013—EPI ISL 402131. The first part is the classification label is the selected SARS-CoV-2 virus which occurs 105 times in the of the subfamily of viruses, the code EPI ISL 402131 is that of virus database of GISAID, that is, EPI ISL 428253. The 15th label the virus itself which one can use with Google to obtain further is MN908947 which is the SARS-CoV-2 virus Wuhan Hu-1 from the information. The virus isolate is sampled from the Rhinolophus Wuhan Seafood market collected Dec. 2019, submitted 05-JAN-2020, affinis, a medium-size Asian bat of the Yunnan Province, PRC and reported in Nature, 579(7798), 265–269, 270–273 (2020). This China, in 2013. It is 29855 bp RNA linear VRL 24-MAR- is the only SARS-CoV-2 virus in the 6,751 sequences obtained from 2020, and its final registration code is MN996532. The human the authors of [25]. The 11th and 12th labels are the CoVZC45 Bat coronavirus genome shares at least 96.2% of its identity with Coronavirus and the CoVZXC21 Bat Coronavirus. Numbers 11, 12, its bat relative, while its similarity rate with the human strain 13, 14, and 15 have the least NCD distances to the selected SARS- of the SARS virus is much lower, only 80.3% [7]. The NCD CoV-2 virus. The entire 60 × 60 NCD matrix underlying this tree is distance between the selected SARS-CoV-2 virus and this virus is too large to display here but is supplied in the Supplemental Material. about the same as that between the mtDNA’s of the Chimpansee and the PigmyChimpansee according to the Table in Figure 9 of [5]. It is known as Bat coronavirus RaTG13 of the family 37 VIRUSESCLOSETOTHESELECTED SARS-COV-2 VIRUS Viruses; Riboviria; Orthornavirae; Pisuviricota; Pisoniviricetes; Figure 2 contains the selected SARS-CoV-2 sequence and all the Nidovirales; Cornidovirineae; Coronaviridae; Orthocoronavirinae; GISAID sequences ending in /2017, /2018, and /2019 which includes Betacoronavirus; Sarbecovirus. the three Pangolin viruses in the second paragraph of the Section III The next three viruses have respectively NCD=0.788416 on Materials and the Data Cleaning section of the Supplemental for Coronaviridae/CoVZC45.fasta/¡MG772933.1, NCD= Material. Added are a dozen close sequences from all GISAID 0.791082 for Coronaviridae/CoVZXC21.fasta/¡ MG772934.1, sequences and a dozen close sequences from the machine learning and at a larger distance NCD=0.917493 for Coronaviri- approach study [25] data. The ladders in the directed binary tree are dae/Coronaviridae 783.fasta/¡KC881006. Here the part “fasta/¡ usually there to accommodate more than two outgoing branches from 4

Fig. 1. These 60 viruses have the least NCD with the SARS-CoV-2 virus. Fig. 2. The phylogenetic directed binary tree built from 37 virus sequences The figure therefore gives a visual representation of the structural relations with the human mtDNA as outgroup to determine the root. close to the SARS-CoV-2 virus. We use human mtDNA as outgroup; it is about the same size as the viruses and it can be assumed to be completely different. Using an unrelated sequence which is called the “outgroup” is a is a possibility. But the RaTG13 virus has an NCD=0.444846 distance common method to determine where the root of a phylogenetic tree is: where with the selected SARS-CoV-2 virus which is close to one half of it joins the tree there is the root. Since S(T )=0.998948 the tree almost perfectly represents the NCD distance matrix concerned in the Supplemental the Pangolin NCD’s given here. As noted before, this is comparable Material. with the NCD between the Chimpansee and the PigmyChimpansee according to the Table in Figure 9 of [5]. Also in the tree of Figure 2 the Pangolin viruses are generally far from the selected SARS-CoV-2 a single node. The entire 37 × 37 NCD matrix is too large to display virus. Hence the hypothesis that the Pangolin species is the origin of here but is supplied in the Supplemental Material. the SARS-CoV-2 virus, or an intermediary between the Bat and the Human, is perhaps unlikely if we follow these NCD results. V. THE PANGOLIN CONNECTION As we saw in the section on Data Cleaning in the Supplemental VI.DISCUSSION Material, the GISAID hCoV19 database is littered with Pangolin We determined the phylogeny and taxonomy of the SARS-CoV-2 viruses (now called Pangolin-CoV) from 2017, 2018, and 2019. virus. Earlier studies using alignment-based methods have suggested Several studies e.g. [17], [35] hold that while the SARS-CoV-2 that the SARS-CoV-2 virus originated from bats. Bats are a known virus probably originates from bats it may have been transmitted to reservoir of viruses that can zoonotic transmit to humans [19]. The another and/or recombined with a virus there and transmitted machine-learning approach, an alignment-free method [25], came to zoonotic to humans. The other animal is most often identified as the the same conclusion. The current, completely different, approach Pangolin. The compression method shows that the NCD’s between agrees with this and has as runner-up the Pangolin. the Pangolin-related virus and the human SARS-CoV-2 virus are far The here used method [16], [5] is based on compression, apart. However, they are not farther apart than 0.738175 to 0.874027. alignment-free, remarkably simple, very fast, and quantifies a distance This is between the NCD distance of the mtDNA of a Human to a in a number called the NCD in between 0 and 1, where 0 is identical Gorilla up to the mtDNA of a GreySeal to a BlueWhale. The bat-SL- and 1 is totally different. It uses only the viral DNA/RNA sequences CoVZXC21 and bat-SL-CoVZC45 viruses are at NCD=0.791082 and [5]. In practice the compression method performs accurately [16], NCD=0.788416. That is in between the mtDNA NCD distance of a [5], [3], [4], [11]. Human versus a Gorilla at 0.737 and the mtDNA distance of a Human To compare the compression method with the alignment-free versus an Orangutan at 0.834. Thus, according to the NCD results, method based on machine learning [25] we used the latter database the Pangolin and Bat origins are possible for the SARS-CoV-2 virus of over 6,500 unique viral sequences. To select the particular SARS- while the Bat origin is more likely and modification by the Pangolin CoV-2 virus against which these viral sequences are compared we 5

used a database of about 15,500 unique SARS-CoV-2 viruses. The [18] M. Li and P.M.B. Vitanyi.´ An Introduction to Kolmogorov Complexity NCD analysis places the RaTG13 bat virus closest to the SARS- and Its Applications, 1st Ed – 4th Ed., Springer, New York, 1993–2019. CoV-2 virus followed by the two SARS-like corona viruses bat- [19] A.D. Luis, D.T.S. Hayman, T.J. O’Shea, P. M. Cryan, A.T. Gilbert,J.R.C. Pulliam, J. N. Mills, M.E. Timonin, C.K.R. Willis, A.A. Cun- SL-CoVZXC21 and bat-SL-CoVZC45 just like the machine learning ningham, A.R. Fooks, Ch.E. Rupprecht, J.L.N. Wood and C.T. method and alignment-based methods. Additionally, some Pangolin Webb, A comparison of bats and rodents as reservoirs of zoonotic viruses are close and over 6,500 viruses identified by their registration viruses: are bats special? Proc. R. Soc. B., 280(2013) 28020122753 code in the database of [25] are farther away (the codes of slightly http://doi.org/10.1098/rspb.2012.2753. [20] M. Nykter, N.D. Price, M. Aldana, S.A. Ramsey, S.A. Kauff- less than 60 of them are given in Table I). man, L.E. Hood, O. Yli-Harja, and I. Shmulevich, Gene expres- For alignment-based methods on viruses of the database of [25] sion dynamics in the macrophage exhibit criticality, Proc. National such a feat is complex or infeasible. On the other hand these methods Academy of Sciences of the USA (PNAS), 105:6(2008), 1897–1900. yield biological interpretations. The alignment-free machine-learning https://doi.org/10.1073/pnas.0711525105 method [25] determines the taxonomy of a single virus sequence [21] Paraskevis D, Kostaki EG, Magiorkinis G, Panayiotakopoulos G, Sourvi- nos G and Tsiodras S., Full-genome evolutionary analysis of the novel and gives different phylogeny trees based on different aspects of it. corona virus (2019-nCoV) rejects the hypothesis of emergence as a The compression method gives a single phylogenetic tree for a set result of a recent recombination event, Infection, Genetics and Evolution of many DNA/RNA sequences. The compression method is domain- 79(2020), 104212. https://doi. org/10.1016/j.meegid.2020.104212 pmid: independent and requires no parameters to be set, apart from the 32004758. [22] L. Penarrubiaa,˜ M. Ruiza, R. Porcoa, S.N. Raob, M. Juanola-Falgaronaa, used compression algorithm. Obtaining essentially the same main D. Manisseroc, M. Lopez-Fontanalsa´ and J. Pareja, Multiple assays in positive results as the earlier studies, and agreeing with the generally a real-time RT-PCR SARS-CoV-2 panel can mitigate the risk of loss of believed hypothesis, this method is less complicated than the previous sensitivity by new genomic variants during the COVID-19 outbreak, Int. methods. It yields quantitative evidence that can be compared with J. Infectious Diseases, 97(2020), 225–229. the NCD’s among the mtDNA’s of familiar mammals. Since the [23] C.E. Shannon, The mathematical theory of communication, Bell System Tech. J., 27:379–423, 623–656, 1948. method is uncomplicated and very fast it is useful as an exploratory [24] B. Rannala1 and Z. Yang, Phylogenetic Inference Using Whole investigation into the phylogeny and taxonomy of viruses of new Genomes, Annual Review of Genomics and Human Genetics 9(2008), epidemic outbreaks. 217–231. [25] G.S. Randhawa, M.P.M. Soltysiak, H. El Roz, C.P.E. de Souza, K.A. Hill and L. Kari, Machine learning using intrinsic REFERENCES genomic signatures for rapid classification of novel pathogens: [1] C.H. Bennett, P. Gacs,´ M. Li, P.M.B. Vitanyi´ and W. Zurek, Information COVID-19 case study. PLoS ONE, 15:4(2020), e0232391. Distance, IEEE Trans. Information Theory, 44:4(1998), 1407–1423. https://doi.org/10.1371/journal.pone.0232391. [2] Y. Cao, A. Janke, P. J. Waddell, M. Westerman, O. Takenaka, S. Murata, [26] S.A. Terwijn, L. Torenvliet and P.M.B. Vitanyi,´ Nonapproximability N. Okada, S. Pa¨abo¨ and M. Hasegawa, Conflict among individual of the Normalized Information Distance, J. Comput. System Sciences, mitochondrial in resolving the phylogeny of Eutherian orders, 77:4(2011), 738–742. J. Mol. Evol., 47(1998), 307–322. [27] A.M. Turing, On computable numbers, with an application to the [3] M. Cebrian, M. Alfonseca and A. Ortega, Common pitfalls using the Entscheidungsproblem, Proc. London Mathematical Society, 42:2(1936), normalized compression distance: What to watch out for in a compressor, 230–265, ”Correction”, 43i(1937), 544–546. Commun. Inf. Syst., 5:4(2005), 367–384. [28] P.M.B. Vitanyi,´ Similarity and denoising, Phil. Trans. R. Soc. A. [4] M. Cebrian, M. Alfonseca and A. Ortega, The normalized compression 371(2013), 20120091 http://doi.org/10.1098/rsta.2012.009i13 distance is resistant to noise, IEEE Trans. Inform. Theory, 53:5(2007), [29] P.M.B. Vitanyi,´ F.J. Balbach, R.L.Cilibrasi and M. Li, Normalized 1895–1900. information distance, pp. 45–82 in Information Theory and Statistical [5] R.L. Cilibrasi and P.M.B. Vitanyi,´ Clustering by compression, IEEE Learning, F. Emmert-Streib, M. Dehmer, Eds., Springer, New York, Trans. Information Theory, 51:4(2005), 1523–1545. 2009. [6] R.L. Cilibrasi and P.M.B. Vitanyi,´ A fast quartet tree heuristic for [30] S. Vinga and J. Almeida, Alignment-free sequence comparison–a review, hierarchical clustering, Pattern Recognition, 44 (2011) 662–677. Bioinformatics, 19:4(2003), 513–523. pmid:12611807 [7] C. Ceraolo and F.M. Giorgi, J. Med. Virol., 92(2020), 522–528. [31] L. Wang and T. Jiang, On the complexity of multiple [8] J.G. Cleary and I.H. Witten, Data compression using adaptive coding sequence alignment, J. Comp. Biol., 1:4 (1994), 337–348. and partial string matching, IEEE Trans. Communications, 32:4(1984), http://doi.org/10.1089/cmb.1994.1.337. 396–402. [32] Wikipedia: Alignment-free sequence analysis. Accessed July 2, 2020. [9] P. Gacs,´ On the symmetry of algorithmic information, Soviet Math. [33] Wikipedia: PHYLIP. Accessed July 4, 2020. Dokl., 15(1974), 1477–1480; Correction, Ibid., 15(1974), 1480. [34] A. Zielezinski, S. Vinga, J. Almeida and W.M. Karlowski, Alignment- [10] GISAID at www.gisaid.org free sequence comparison: benefits, applications, and tools, Genome [11] E.J. Keogh, S. Lonardi, C.A. Rtanamahatana, L. Wei, S.H. Lee, and Biology 18(2017), Article 186. pmid:28974235 J. Handley, Compression-based data mining of sequential data, Data [35] T. Zhang, Q. Wu and Z. Zhang, Probable Pangolin origin of SARS-CoV- Mining and Knowledge Discovery, 14:1(2007), 99–129. 2 associated with the COVID-19 outbreak, Current Biology, 30:7(2020, [12] A.N. Kolmogorov, Three approaches to the quantitative definition of March) doi: 10.1016/j.cub.2020.03.022 information, Problems Inform. Transmission, 1:1(1965), 1–7. [13] K. Kryukov, M. Takahashi Ueda, S. Nakagawa, T. Imanishi, Sequence Compression Benchmark (SCB) database–A Acknowledgement comprehensive evaluation of reference-free compressors for FASTA-formatted sequences, GigaScience, 9:7(2020), giaa072, The authors thank Gurjit Randhawa for the database of viruses https://doi.org/10.1093/gigascience/giaa072. used in [25]. [14] T.G. Ksiazek, et.al., A Novel Coronavirus Associated with Severe Acute Respiratory Syndrome, New England J. Medicine, Published at www.nejm.org April 10, 2003 (10.1056/NEJMoa030781). Data Archival [15] M. Li, J.H. Badger, X. Chen, S. Kwong, P. Kearney and H. Zhang, An information-based sequence distance and its application to whole The analytic software which was used is available at GitHub source mitochondrial genome phylogeny, Bioinformatics, 17:2(2001), 149–154. code URI: https://github.com/rudi-cilibrasi/ncd-covid [16] M. Li, X. Chen, X. Li, B. Ma and P.M.B. Vitanyi.´ The similarity metric, We downloaded the data broadly in two parts, stored them on the IEEE Trans. Information Theory, 50:12(2004), 3250–3264. public repository GitHub and linked to it here: [17] X. Li, E.E. Giorgi, M. Honnayakanahalli Marichannegowda, B. Foley, C. Xiao, X.-P. Kong, Y. Chen, S. Gnanakaran, B. Korber and F. Gao, https://github.com/rudi-cilibrasi/ncd-covid#downloading-input-data Emergence of SARS-CoV-2 through recombination and strong purifying We also clarified the instructions for downloading the GISAID data selection, Science Advances, doi: 10.1126/sciadv.abb9153. in the same page and provided a link. Users will have to register a 6 free account with GISAID since they have a strict “no redistribution” policy.

Author Contributions Conceptualization: Paul Vitanyi;´ Data curation: Rudi Cilibrasi; Formal analysis: Paul Vitanyi;´ Methodology: Rudi Cilibrasi, Paul Vitanyi;´ Resources: Rudi Cilibrasi; Software: Rudi Cilibrasi; Super- vision: Paul Vitanyi;´ Writing: Paul Vitanyi.´

Author declaration There are no competing interests

Rudi l. Cilibrasi received his B.S. with honors from the California Institute of Technology in 1996. He has programmed computers for over two decades, both in academia, and industry with various compa- nies in Silicon Valley, including Microsoft, in diverse areas such as machine learning, data compression, process control, VLSI design, computer graphics, computer security, and networking protocols. He received his PhD from the University of Amster- dam in 2007. He is a co-author of papers on the normalized compression distance and several other papers. He helped create the first publicly downloadable Normalized Compres- sion/Google Distance software, and is maintaining http://www.complearn.org now. Home page: http://cilibrar/com.

Paul M.B. Vitanyi´ received his Ph.D. from the Free University of Amsterdam (1978). He is a CWI Fel- low at the national research institute for mathematics and computer science in the Netherlands, CWI, and Professor of Computer Science at the University of Amsterdam. He served on the editorial boards of Distributed Computing, Information Processing Letters, Theory of Computing Systems, Parallel Pro- cessing Letters, International journal of Foundations of Computer Science, Entropy, Information, Journal of Computer and Systems Sciences (guest editor), and elsewhere. He has worked on cellular automata, computational com- plexity, distributed and parallel computing, machine learning and prediction, physics of computation, Kolmogorov complexity, information theory, quantum computing, publishing more than 200 research papers and some books. He received a Knighthood (Ridder in de Orde van de Nederlandse Leeuw) and is member of the Academia Europaea. Together with Ming Li they pioneered applications of Kolmogorov complexity and co-authored “An Introduction to Kolmogorov Complexity and its Applications,” Springer-Verlag, New York, 1993 (3rd Edition 2008), parts of which have been translated into Chinese, Russian and Japanese. Supplementary Files

This is a list of supplementary les associated with this preprint. Click to download.

si.pdf