Downloaded from the GISAID ( Platform on 2020 February 7 and 29 Respectively
Total Page:16
File Type:pdf, Size:1020Kb
Phylogenetic study of 2019-nCoV by using alignment-free method Yang Gao*1, Tao Li*2, and Liaofu Luo‡3,4 1 Baotou National Rare Earth Hi-Tech Industrial Development Zone, Baotou, China 2 College of Life Sciences, Inner Mongolia Agricultural University , Hohhot, China. 3Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, China 4 School of Life Science and Technology, Inner Mongolia University of Science and Technology, Baotou, China *These authors contributed equally to this work. ‡Corresponding author. E-mail: [email protected] (LL) Abstract The origin and early spread of 2019-nCoV is studied by phylogenetic analysis using IC-PIC alignment-free method based on DNA/RNA sequence information correlation (IC) and partial information correlation (PIC). The topology of phylogenetic tree of Betacoronavirus is remarkably consistent with biologist’s systematics, classifies 2019-nCoV as Sarbecovirus of Betacoronavirus and supports the assumption that these novel viruses are of bat origin with pangolin as one of the possible intermediate hosts. The novel virus branch of phylogenetic tree shows location-virus linkage. The placement of root of the early 2019-nCoV tree is studied carefully in Neighbor Joining consensus algorithm by introducing different out-groups (Bat-related coronaviruses, Pangolin coronaviruses and HIV viruses etc.) and comparing with UPGMA consensus trees. Several oldest branches (lineages) of the 2019-nCoV tree are deduced that means the COVID-19 may begin to spread in several regions in the world before its outbreak in Wuhan. Introduction Coronaviruses are single-stranded positive-sense enveloped RNA viruses that are distributed broadly among humans, other mammals, and birds and that cause respiratory, enteric, hepatic, and neurologic diseases [1,2]. The family Coronaviridae contains 4 genera, namely Alphacoronavirus, Betacoronavirus, Deltacoronavirus, and Gammacoronavirus [3]. Six coronavirus species are known to cause human disease [4]. Among which HCoV-OC43, HCoV-HKU1, SARS-CoV, and Middle East respiratory syndrome coronavirus (MERS-CoV) are belong to the genus Betacoronavirus [5-8]. The other two strains, HCoV-229E and HCoV-NL63, are belong to the genus Alphacoronavirus [4,5]. In the end of 2019, a pneumonia disease named COVID-19 that caused by 2019-nCoV (SARS-CoV-2) outbreaks in Wuhan, China and then spread rapidly all over the world. People urgently seek for the source of transmission and possible intermediate animal vectors. What kind of coronaviruses the SARS-CoV-2 is? How did the spread of the COVID-19 among humans begin? Both problems can be solved from the phylogenetic analysis of the 2019 novel Coronaviruses and other related genomes. About the first problem the alignment-based sequence analyses reveal many new discoveries [9-11]. Phylogenetic analyses of the RNA-dependent RNA polymerase (RdRp) protein, spike proteins, and full-length genomes found that the SARS-CoV-2 is most closely related to two bat SARS-like coronaviruses, bat-SL-CoVZXC21 and bat-SL-CoVZC45, and the SARS-CoV-2 is thought to belong to Sarbecovirus of Betacoronavirus [12–16]. It was also found that the SARS-CoV-2 has the highest similarity to the bat coronavirus RaTG13 sampled from a Rhinolophus affinis bat in Yunnan in 2013 [17, 18]. While bats may be the reservoir host for various coronaviruses, recent works discovered that the genome of Malayan pangolin coronaviruses shows high similarity to 2019-nCoV and therefore pangolins (Manis javanica) may be another intermediate host for coronaviruses. [19, 20]. However, little is known about the second problem. Recently, this problem was warm regarded by researchers Forster et al. [21] and Yu et al. [22] through phylogenetic network analysis. About the calculation method used in phylogenetic analysis although the alignment-based approaches often give high accuracy and may reveal the relationships among sequences, they meet huge challenges when the recombination, shuffling, and rearrangement events frequently occur in the genome evolution. Simultaneously, the whole-genome multiple alignments are very time-consuming and expensive in memory usage. However, to address the limitations of alignment-based approaches the alignment-free genome comparison methods are becoming attractive alternative [23]. Several approaches derived from information theory, with the emphasize of base correlation property, have been proposed for sequence comparison [24-27]. The average mutual information (AMI) profiles were used to group HIV-1 viruses. The base correlation method has been used to infer coronavirus phylogeny and analysis Hepatitis E virus genotyping and subtyping [25, 26]. Previously, we proposed the IC-PIC method, and studied the phylogeny of dsDNA viruses, papillomaviruses, parvoviruses in a wide range [27]. Recently, Randhawa et al. [28] and Gao et al.[29] using alignment-free method confirmed that the SARS-CoV-2 belongs to the Betacoronavirus and the genomic similarity to the sub-genus Sarbecovirus infers a possible bat origin. With the wide application of genome sequence data we eager to know what is the imprinting characterizing the characteristics of a genome. We suggested that AMI and k-departed base correlation can be looked as the signature of a given genome sequence[29]. The average mutual information is called information correlation (IC) defined by Dk2 2 p i log 2 p i p i ( k ) j log 2 p i ( k ) j i ij and the k-departed base correlation is called partial information correlation (PIC) defined by 2 Fi()() k j() p i k j p i p j where pi means the probability of base i in the sequence and pi(k)j means the joint probability of base pair ij departed by distance k (k=0,1,2,…). In the following we shall study the SARS-CoV-2 phylogeny by using IC-PIC algorithm based on the above set of signatures of the genome sequence . Materials and Methods The SARS-CoV-2 genomes used in Figure 1 and Figure 2 to 6 were downloaded from the GISAID (https://gisaid.org) platform on 2020 February 7 and 29 respectively. Sequences with NNs were discarded. Other sequences were downloaded from the NCBI database (http://www.ncbi.nlm.nih.gov/). Note : although the GISAID database now contains more than five thousands entries, to study the early spread of the SARS-CoV-2 we shall use only 136 genome sequences collected before Feb 29. The genome sequence is converted into an IC-PIC matrix with 17 rows (representing 1 IC for given k and 16 PICs of different base correlation categories) and d columns (representing the distance k between base pair, k=0,1 to d-1). The only parameter in the algorithm is the range of d, which is denoted as K. K is determined from the best-fit construction of tree. In general the deduced tree changes with K and attains stable at some large value. In deducing consensus tree for Beta coronavirus in Fig 1 and for the 2019-nCoV in Fig 2 to 6 we take K=50 and 100 respectively. The work is carried out on IC-PIC web server [29]. After uploading input data in fasta format, setting the parameter K-value and choosing the UPGMA or Neighbor-Joining (NJ) option, the server will run the program and for each run of given d (d=1 to K stepping 1) deduce a phylogenetic tree. In the calculation the evolutionary distance of any two genomes is calculated by Euclidean distance between their respective IC-PIC 17Xd matrices. Then an unrooted UPGMA or NJ tree is generated. Finally, K phylogenetic trees are combined to generate a consensus tree. All these trees were constructed by using NEIGHBOR and CONSENSE program in the PHYLIP package [30]. The robustness of the tree topology was estimated by branch support. Note that the NJ tree obtained by using NEIGHBOR algorithm is unrooted. The placement of the root is one of the most difficult parts of estimating a tree. However, for gene trees with a known species tree, there is the out-group approach since the out-group is known for sure or can be assumed appropriately in the present case. In constructing evolutionary tree for Beta coronavirus we introduce Gamacoronaviruse as out-group. In constructing evolutionary tree for 2019-nCoV we assume several possible out-groups for comparison. Simultaneously, we construct the UPGMA tree as a supplementary proof of the root of the tree in out-group approach. The biggest difference between NJ tree and UPGMA tree is that the latter assumes a constant rate of evolution across the lineages i.e., a molecular clock; because this is often violated in empirical datasets, this approach is usually considered sub-optimal. Another problem is related to the accuracy in constructing UPGMA tree. Since in UPGMA algorithm we always select a pair of sequences with the smallest distance to merge, when the difference of the distance of each pair of sequence is very small and the measurement accuracy of the distance is not enough the wrong pairing in some clades may occur. Results The whole-genome-based phylogenetic trees for Betacoronavirus and 2019-nCoV are deduced by use of IC-PIC method and given in Fig 1 and Fig 2 to 6 respectively. To reconstruct the phylogenetic tree the sequence data of 40 Betacoronaviruses from five subgenus and 136 SARS-CoV-2 are used. The consensus tree is derived from 50 (K=50 for Fig 1) or 100 (K=100 for Fig 2 to 6) trees based on IC-PIC matric. The robustness of the tree topology was estimated by branch support but the tree is not drawn to scale. The phylogenetic tree of Betacoronavirus genus is shown in Fig. 1. From Fig 1 we found: 1) The tree topology is remarkably consistent with biologist’s systematics that Betacoronavirus contains five subgenus, namely Sarbecovirus, Hibecovirus, Nobecovirus, Embecovirus and Merbecovirus.