<<

Phylogenetic study of 2019-nCoV by using alignment-free method

Yang Gao*1, Tao Li*2, and Liaofu Luo‡3,4

1 Baotou National Rare Earth Hi-Tech Industrial Development Zone, Baotou, China

2 College of Life Sciences, Inner Mongolia Agricultural University , Hohhot, China.

3Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University,

Hohhot, China

4 School of Life Science and Technology, Inner Mongolia University of Science and Technology, Baotou, China

*These authors contributed equally to this work.

‡Corresponding author.

E-mail: [email protected] (LL)

Abstract

The origin and early spread of 2019-nCoV is studied by phylogenetic analysis using IC-PIC alignment-free method based on DNA/RNA sequence information correlation (IC) and partial information correlation (PIC). The topology of of is remarkably consistent with biologist’s , classifies 2019-nCoV as Sarbecovirus of Betacoronavirus and supports the assumption that these novel are of bat origin with pangolin as one of the possible intermediate hosts. The novel branch of phylogenetic tree shows location-virus linkage. The placement of root of the early 2019-nCoV tree is studied carefully in consensus algorithm by introducing different out-groups (Bat-related , Pangolin coronaviruses and

HIV viruses etc.) and comparing with UPGMA consensus trees. Several oldest branches (lineages) of the

2019-nCoV tree are deduced that means the COVID-19 may begin to spread in several regions in the world before its outbreak in Wuhan.

Introduction

Coronaviruses are single-stranded positive-sense enveloped RNA viruses that are distributed broadly among , other , and birds and that cause respiratory, enteric, hepatic, and neurologic diseases [1,2]. The family contains 4 genera, namely , Betacoronavirus, , and [3]. Six species are known to cause disease [4]. Among which HCoV-OC43,

HCoV-HKU1, SARS-CoV, and Middle East respiratory syndrome coronavirus (MERS-CoV) are belong to the genus

Betacoronavirus [5-8]. The other two strains, HCoV-229E and HCoV-NL63, are belong to the genus

Alphacoronavirus [4,5]. In the end of 2019, a pneumonia disease named COVID-19 that caused by 2019-nCoV

(SARS-CoV-2) outbreaks in Wuhan, China and then spread rapidly all over the world. People urgently seek for the source of transmission and possible intermediate animal vectors. What kind of coronaviruses the SARS-CoV-2 is?

How did the spread of the COVID-19 among humans begin? Both problems can be solved from the phylogenetic analysis of the 2019 novel Coronaviruses and other related genomes.

About the first problem the alignment-based sequence analyses reveal many new discoveries [9-11].

Phylogenetic analyses of the RNA-dependent RNA polymerase (RdRp) , spike , and full-length genomes found that the SARS-CoV-2 is most closely related to two bat SARS-like coronaviruses, bat-SL-CoVZXC21 and bat-SL-CoVZC45, and the SARS-CoV-2 is thought to belong to Sarbecovirus of Betacoronavirus [12–16]. It was also found that the SARS-CoV-2 has the highest similarity to the bat coronavirus RaTG13 sampled from a

Rhinolophus affinis bat in Yunnan in 2013 [17, 18]. While bats may be the reservoir host for various coronaviruses, recent works discovered that the genome of Malayan pangolin coronaviruses shows high similarity to 2019-nCoV and therefore pangolins (Manis javanica) may be another intermediate host for coronaviruses. [19, 20]. However, little is known about the second problem. Recently, this problem was warm regarded by researchers Forster et al.

[21] and Yu et al. [22] through analysis.

About the calculation method used in phylogenetic analysis although the alignment-based approaches often give high accuracy and may reveal the relationships among sequences, they meet huge challenges when the recombination, shuffling, and rearrangement events frequently occur in the genome evolution. Simultaneously, the whole-genome multiple alignments are very time-consuming and expensive in memory usage. However, to address the limitations of alignment-based approaches the alignment-free genome comparison methods are becoming attractive alternative [23]. Several approaches derived from information theory, with the emphasize of base correlation property, have been proposed for sequence comparison [24-27]. The average mutual information

(AMI) profiles were used to group HIV-1 viruses. The base correlation method has been used to infer coronavirus phylogeny and analysis Hepatitis E virus genotyping and subtyping [25, 26]. Previously, we proposed the IC-PIC method, and studied the phylogeny of dsDNA viruses, papillomaviruses, parvoviruses in a wide range [27].

Recently, Randhawa et al. [28] and Gao et al.[29] using alignment-free method confirmed that the SARS-CoV-2 belongs to the Betacoronavirus and the genomic similarity to the sub-genus Sarbecovirus infers a possible bat origin.

With the wide application of genome sequence data we eager to know what is the imprinting characterizing the characteristics of a genome. We suggested that AMI and k-departed base correlation can be looked as the signature of a given genome sequence[29]. The average mutual information is called information correlation (IC) defined by

Dk2 2 p i log 2 p i  p i ( k ) j log 2 p i ( k ) j i ij and the k-departed base correlation is called partial information correlation (PIC) defined by

2 Fi()() k j() p i k j p i p j where pi means the probability of base i in the sequence and pi(k)j means the joint probability of base pair ij departed by distance k (k=0,1,2,…). In the following we shall study the SARS-CoV-2 phylogeny by using IC-PIC algorithm based on the above set of signatures of the genome sequence .

Materials and Methods

The SARS-CoV-2 genomes used in Figure 1 and Figure 2 to 6 were downloaded from the GISAID (https://gisaid.org) platform on 2020 February 7 and 29 respectively. Sequences with NNs were discarded. Other sequences were downloaded from the NCBI database (http://www.ncbi.nlm.nih.gov/). Note : although the GISAID database now contains more than five thousands entries, to study the early spread of the SARS-CoV-2 we shall use only 136 genome sequences collected before Feb 29.

The genome sequence is converted into an IC-PIC matrix with 17 rows (representing 1 IC for given k and 16

PICs of different base correlation categories) and d columns (representing the distance k between base pair, k=0,1 to d-1). The only parameter in the algorithm is the range of d, which is denoted as K. K is determined from the best-fit construction of tree. In general the deduced tree changes with K and attains stable at some large value. In deducing consensus tree for Beta coronavirus in Fig 1 and for the 2019-nCoV in Fig 2 to 6 we take

K=50 and 100 respectively.

The work is carried out on IC-PIC web server [29]. After uploading input data in fasta format, setting the parameter K-value and choosing the UPGMA or Neighbor-Joining (NJ) option, the server will run the program and for each run of given d (d=1 to K stepping 1) deduce a phylogenetic tree. In the calculation the evolutionary distance of any two genomes is calculated by Euclidean distance between their respective IC-PIC 17Xd matrices.

Then an unrooted UPGMA or NJ tree is generated. Finally, K phylogenetic trees are combined to generate a consensus tree. All these trees were constructed by using NEIGHBOR and CONSENSE program in the PHYLIP package [30]. The robustness of the tree topology was estimated by branch support. Note that the NJ tree obtained by using NEIGHBOR algorithm is unrooted. The placement of the root is one of the most difficult parts of estimating a tree. However, for gene trees with a known species tree, there is the out-group approach since the out-group is known for sure or can be assumed appropriately in the present case. In constructing evolutionary tree for Beta coronavirus we introduce Gamacoronaviruse as out-group. In constructing evolutionary tree for

2019-nCoV we assume several possible out-groups for comparison. Simultaneously, we construct the UPGMA tree as a supplementary proof of the root of the tree in out-group approach. The biggest difference between NJ tree and UPGMA tree is that the latter assumes a constant rate of evolution across the lineages i.e., a ; because this is often violated in empirical datasets, this approach is usually considered sub-optimal. Another problem is related to the accuracy in constructing UPGMA tree. Since in UPGMA algorithm we always select a pair of sequences with the smallest distance to merge, when the difference of the distance of each pair of sequence is very small and the measurement accuracy of the distance is not enough the wrong pairing in some may occur.

Results

The whole-genome-based phylogenetic trees for Betacoronavirus and 2019-nCoV are deduced by use of IC-PIC method and given in Fig 1 and Fig 2 to 6 respectively. To reconstruct the phylogenetic tree the sequence data of 40

Betacoronaviruses from five subgenus and 136 SARS-CoV-2 are used. The consensus tree is derived from 50 (K=50 for Fig 1) or 100 (K=100 for Fig 2 to 6) trees based on IC-PIC matric. The robustness of the tree topology was estimated by branch support but the tree is not drawn to scale.

The phylogenetic tree of Betacoronavirus genus is shown in Fig. 1. From Fig 1 we found: 1) The tree topology is remarkably consistent with biologist’s systematics that Betacoronavirus contains five subgenus, namely

Sarbecovirus, , , Embecovirus and . Each of these subgenera forms its own monophyletic in the figure. 2) SARS viruses form an early in the subgenus Sarbecovirus. Besides, in addition to SARS and SARS-like lineages Sarbecovirus includes four other clades, namely, the Bat-SL-CoVZXC21 and bat-SL-CoVZC45 which jumps out of the clade of SARS-like-CoV, the Pangolin coronaviruses, the clade of RaTG13, and the 2019-nCoVs. 3) Six Pangolin coronaviruses from Guangxi and from Guangdong form a monophyletic clade that is different from the phylogenetic tree obtained by Lam et al [19]. 4) Twenty-three SARS-CoV-2 sequences form a monophyletic group and locate closely with RaTG13. This clade is the newest lineage in Sarbecovirus subgenus, indicating an independent evolutionary event of 2019-nCoVs in nature. To summarize, the tree topology well agrees with alignment-based phylogenetic trees. We conclude that SARS-CoV-2 belong to the subgenus of

Sarbecovirus and support the assumption that the novel viruses are of bat origin with pangolin as one of the intermediate hosts.

The phylogenetic trees of 2019-nCoVs are given in Fig 2 to Fig 6. To describe the early spread Fig 2 to 5 are depicted for 124 SARS-CoV-2 collected in GISAID before February 23, 2020. Then for comparison, in Fig 6 the data are enlarged to 136 viruses that are collected up to February 29. Since the branching order of IC-PIC tree shows location-virus linkage, the picture can describe the details of how the disease spreads from one place to another in principle. Fig 2, Fig 4 and Fig 5 are NJ consensus trees drawn with different orders of input for 2019-nCoV sequences and different out-groups, namely, with bat SARS-like coronaviruses, Pangolin coronaviruses or HIV viruses as out-group respectively.

The phylogenetic analyses in Fig 2, Fig 4 and Fig 5 show that the roots of three evolutionary trees are located in the same place, namely it includes five sequences in Korea (Accession ID : EPI-ISL-412869,412872,412871,

412870,407193),three in Australia (EPI-ISL-408976, 407893, 408977) , one in Singapore (EPI-ISL-410535), one in

Italy (EPI-ISL-410546), and one in Germany (EPI-ISL-406862). The collection time of these 11 viruses in GISAID is nearly same, from January 22 to January 31. To give further supplementary evidence on the location of root the

UPGMA tree is constructed and compared with NJ tree. Fig 3 is an example. By using the same order of input of

124 virus sequences as that in Fig 2, we found the root of UPGMA tree in Fig 3 is basically same as NJ tree.

However, there exist some errors. For example, two sequences USA/CA6 and Shenzhen/SZTH-001 mistakenly enter the root region in Fig 3. This may be caused by the error of distance input in UPGMA algorithm.

Furthermore, to study the possible change of the root place as the virus data have been enlarged we construct

NJ consensus tree of 136 sequences. As given in the example in Fig 6, we found the root of the tree is still distributed in three regions of Korea, Australia and Europe as in Fig 2. There are totally 16 virus genomes in root region and the newly added ones are obviously the second generation of the roots in Fig 2.

We notice that the phylogenetic tree in Fig 2, Fig 4 and Fig 5 shows the nearly same topology although the different out-groups have been used. In addition to the same root, there is a common top branch on three trees, about 35 sequences representing Wuhan outbreak of COVID-19. Below it there are a branch of about 18 sequences composed mainly of USA COVID events, a branch of Japan COVID events and a complex network of

SARS-CoV-2 in south China, Hong Kong, Singapore and other cities. All these branches are derived from the root of the phylogenetic tree. Finally, it is worth noting that although nearly half virus infection events (GISAID, before

February 2019) happened in Wuhan and China, none of them entered in the root region of the phylogenetic tree

( Figure 2 and Figure 4 to 6). We know that the collection of GISAID only began in December 24, 2019. If more early data are collected we will be able to obtain a more complete conclusion. However, in spite of the incompleteness of the early GISAID data it is reasonable to assume that the novel virus has crept into humans before its outbreak in Wuhan, begun to spread in root regions. As for why the same virus suddenly occurred and spread rapidly in areas with different geographical locations, it is obviously related to the globalization of economy and social interaction. Therefore, early effective prevention and control of epidemics must rely on international cooperation.

Summary

The alignment-free IC-PIC method based on DNA/RNA sequence information correlation and partial information correlation is an effective tool for the reconstruction of phylogenetic tree. We confirm current taxonomic classification of the 2019-nCoV as Sarbecovirus sub-genus within Betacoronavirus through alignment-free comparative genomics with IC-PIC method (Fig 1). The obtained IC-PIC tree supports the assumption that the

2019-nCoV is of bat origin with pangolin as one of the possible intermediate hosts. Moreover, the IC-PIC tree of

2019-nCoV (Fig 2 to 6) can give the details of the early spreading of diseases. Through the root placement in NJ consensus trees of early viruses by introducing different out-groups and comparing with UPGMA trees several oldest clades of the tree can be deduced that means the COVID-19 may begin to spread in these locations before its outbreak in Wuhan. Moreover, this discovery provides a clue to searching for the origin of COVID-19. If there are some mammals being intermediate host of the 2019-nCoV one should search for it first in the location of these evolutionary oldest clades.

Acknowledgements

We gratefully acknowledge the authors and originating and submitting laboratories of the sequences from

GISAID’s EpiFlu(TM) Database on which this research is based. References

1. Weiss SR, Leibowitz JL. Coronavirus pathogenesis. Adv Virus Res 2011;81:85-164.

2. Masters PS, Perlman S. Coronaviridae. In: Knipe DM, Howley PM, eds. Fields virology. 6th ed. Lippincott Williams

& Wilkins, 2013:825-58.

3. de Groot RJ, Baker SC, Baric R, et al. Ninth report of the international committee on of viruses,

Elsevier Academic Press: Amsterdam, 2012; pp. 806–828.

4. Su S, Wong G, Shi W, et al. Epidemiology, genetic recombination, and pathogenesis of coronaviruses. Trends

Microbiol 2016;24:490-502.

5. Cui J, Li F, Shi ZL. Origin and evolution of pathogenic coronaviruses. Nat Rev Microbiol 2019;17:181-92.

6. Zhong NS, Zheng BJ, Li YM, et al. Epidemiology and cause of severe acute respiratory syndrome (SARS) in

Guangdong, China, in February, 2003. Lancet 2003;362:1353-8.

7. Drosten C, Günther S, Preiser W, et al. Identification of a in patients with severe acute respiratory syndrome. N Engl J Med 2003;348:1967- 76.

8. Zaki AM, Van Boheemen S, Bestebroer TM, Osterhaus AD, Fouchier RA. Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia. N Engl J Med 2012;367:1814-20.

9. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 1970; 48: 443–453.

10. Smith TF, Waterman MS. Identification of common molecular subsequences. J. Mol. Biol. 1981; 147:195–197.

11. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mol. Biol. 1990; 215:

403–410.

12. Lu RJ, Zhao X, Li J, et al. Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. Lancet 2020, doi:10.1016/S0140-6736(20)30251-8.

13. Chan J. F-W, Yuan SF, Kok k-H, et al. A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster. Lancet 2020, doi:10.1016/S0140-6736(20)

30154-9.

14. Hu B, Zeng LP, Yang XL, et al. Discovery of a rich gene pool of bat SARS-related coronaviruses provides new insights into the origin of SARS coronavirus. PLoS Pathog. 2017, 13.

15. Dong N, Yang XM, Ye LW, et al. Genomic and protein structure modelling analysis depicts the origin and infectivity of 2019-nCoV, a new coronavirus which caused a pneumonia outbreak in Wuhan, China. bioRxiv 2020, doi:10.1101/2020.01.20.913368.

16. Wu, F. Zhao S, Yu B, et al. Complete genome characterisation of a novel coronavirus associated with severe human respiratory disease in Wuhan, China. bioRxiv 2020, doi:10.1101/2020.01.24.919183.

17. Paraskevis D., Kostaki E G, Magiorkinis G, et al. Full-genome evolutionary analysis of the novel corona virus

(2019-nCoV) rejects the hypothesis of emergence as a result of a recent recombination event. bioRxiv 2020, doi:10.1101/2020.01.26.920249.

18. Zhou P, Yang XL, Wang XG, et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 2020, https://doi.org/10.1038/s41586-020-2012-7.

19. Lam T. Tsan-Yuk, Shum M. Ho-Hin, Zhu HC, et al. Identification of 2019-nCoV related coronaviruses in Malayan pangolins in southern China. bioRxiv 2020, doi:10.1101/2020.02.13.945485.

20. Xiao KP, Zhai JQ, Feng YY, et al. Isolation and characterization of 2019-nCoV-like coronavirus from Malayan pangolins. bioRxiv 2020, doi: https://doi.org/10.1101/2020.02.17.951335.

21. Forster P, Forster L, Renfrew C, Forster M. Phylogenetic network analysis of SARS-CoV-2 genomes. PNAS 2020. www.pnas.org/cgi/doi/10.1073/pnas.2004999117.

22. Yu WB, Tang GD, Zhang L, Crolett RT. Decoding evolution and transmissions of novel pneumonia coronavirus using the whole genomic data. 2020. ChinaXiv:202002.00033.

23. Vinga S, Almeida J. Alignment-free sequence comparison-a review. 2003, 513–523.

24. Bauer M, Schuster SM, Sayood K. The average mutual information profile as a genomic signature. BMC

Bioinformatics. 2008, 9, 48.

25. Liu ZH, Meng JH, Sun X. A novel feature-based method for whole genome phylogenetic analysis without alignment: application to HEV genotyping and subtyping. Biochem Biophys Res Commun. 2008, 368:223–230.

26. Liu ZH, Sun X, Coronavirus phylogeny based on base-base correlation. Int J Bioinform Res Appl. 2008, 4(2): 211

- 220.

27. Gao, Y, Luo LF. Genome-based phylogeny of dsDNA viruses by a novel alignment-free method. Gene, 2011, 492:

309 – 314,)

28. Randhawa GS, Soltysiak MPM, et al. Machine learning-based analysis of genomes suggests associations between Wuhan 2019-nCoV and bat . bioRxiv 2020, doi: 10.1101/2020.02.03.932350.29.

29. IC-PIC web server can be found in

30. Felsenstein J. PHYLIP-Phylogeny inference package (ver. 3.69). .1989, 5: 164–166.

Figure caption

Figure 1 - The consensus tree of Betacoronavirus The consensus tree of 40 Betacoronaviruses and 23 2019-nCoVs is derived from 50 trees based on IC-PIC matric, with Gamacoronavirus as out-group. The tree has not been drawn to scale. The robustness of the tree topology was estimated by branch support.

Figure 2 - The NJ consensus tree of 2019-nCoV with bat SARS-like coronaviruses as out-group The consensus tree of 124 SARS-CoV-2 is derived from 100 trees based on IC-PIC matric, with Bat-SL-CoVZXC21 and bat-SL-CoVZC45 as out group. The tree has not been drawn to scale. The robustness of the tree topology was estimated by branch support. The predicted roots are marked in black.

Figure 3 - The UPGMA consensus tree of 2019-nCoV The UPGMA tree is constructed for 124 SARS-CoV-2 by using the same order of input as in NJ tree Fig 2. The tree supports the root placement of Fig 2. (see text)

Figure 4 - The NJ consensus tree of 2019-nCoV with Pangolin coronaviruses as out-group The consensus tree of 124 SARS-CoV-2 is derived from 100 trees based on IC-PIC matric, with Pangolin coronaviruses as out-group. The tree has not been drawn to scale. The robustness of the tree topology was estimated by branch support. The tree topology is same as Fig 2. The predicted roots are marked in black. If the South Korea sequence neighboring to four Korea sequences is assumed as a root then the eleven sequences in root region are exactly same as the predicted roots in Fig 2.

Figure 5 - The NJ consensus tree of 2019-nCoV with HIV viruses as out-group The consensus tree of 124 SARS-CoV-2 is derived from 100 trees based on IC-PIC matric, with HIV as out-group. The tree has not been drawn to scale. The robustness of the tree topology was estimated by branch support. The tree topology is same as Fig 2. The predicted roots are marked in black. The predicted roots are marked in black. If the South Korea sequence neighboring to four Korea sequences is assumed as a root then the eleven sequences in root region are exactly same as the predicted roots in Fig 2.

Figure 6 - The NJ consensus tree of 136 2019-nCoV with bat SARS-like coronaviruses as out-group The consensus tree of 136 SARS-CoV-2 is derived from 100 trees based on IC-PIC matric, with Bat-SL-CoVZXC21 and bat-SL-CoVZC45 as out group. The predicted roots are marked in black.