Understanding genomic diversity, pan-genome, and evolution of SARS-CoV-2

Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj Sharma, Sai Gayatri Vemuri and Gaurav Sharma Institute of Bioinformatics and Applied Biotechnology (IBAB), Bengaluru, Karnataka, India * These authors contributed equally to this work.

ABSTRACT Coronovirus disease 2019 (COVID-19) infection, which originated from Wuhan, China, has seized the whole world in its grasp and created a huge pandemic situation before humanity. Since December 2019, genomes of numerous isolates have been sequenced and analyzed for testing confirmation, epidemiology, and evolutionary studies. In the first half of this article, we provide a detailed review of the history and origin of COVID-19, followed by the taxonomy, nomenclature and genome organization of its causative agent Severe Acute Respiratory Syndrome-related Coronavirus-2 (SARS-CoV-2). In the latter half, we analyze subgenus Sarbecovirus (167 SARS-CoV-2, 312 SARS-CoV, and 5 Pangolin CoV) genomes to understand their diversity, origin, and evolution, along with pan-genome analysis of genus Betacoronavirus members. Whole-genome sequence-based phylogeny of subgenus Sarbecovirus genomes reasserted the fact that SARS-CoV-2 strains evolved from their common ancestors putatively residing in bat or pangolin hosts. We predicted a few country-specific patterns of relatedness and identified mutational hotspots with high, medium and low probability based on genome alignment of 167 SARS-CoV-2 strains. A total of 100-nucleotide segment-based homology studies revealed that the majority of the SARS-CoV-2 genome segments are close to Bat CoV, followed by some to Pangolin CoV, and some are unique ones. Open pan-genome of genus Betacoronavirus members indicates the diversity contributed by the novel Submitted 29 April 2020 emerging in this group. Overall, the exploration of the diversity of these isolates, Accepted 29 June 2020 mutational hotspots and pan-genome will shed light on the evolution and Published 17 July 2020 pathogenicity of SARS-CoV-2 and help in developing putative methods of diagnosis Corresponding author and treatment. Gaurav Sharma, [email protected] Academic editor Subjects Bioinformatics, Evolutionary Studies, Genomics, Microbiology, Virology Abhishek Kumar Keywords COVID-19, SARS, Coronavirus, Pandemic, Viral disease, Genome, Bioinformatics, Genomics, Mutations Additional Information and Declarations can be found on page 24 INTRODUCTION DOI 10.7717/peerj.9576 The emergence of coronavirus disease 2019 (COVID-19) has sent people from across the Copyright world into a state of high alert, and they are trying to survive complete or partial lockdowns 2020 Parlikar et al. (Lescure et al., 2020). The causative agent of this pandemic is a novel Severe Acute Distributed under Respiratory Syndrome (SARS) Coronavirus, Severe Acute Respiratory Syndrome-related Creative Commons CC-BY 4.0 Coronavirus-2 (SARS-CoV-2). Since 2000, the world has witnessed two major coronavirus outbreaks: the first, of SARS, caused by SARS-CoV, in 2002 in China (Zhong et al., 2003),

How to cite this article Parlikar A, Kalia K, Sinha S, Patnaik S, Sharma N, Vemuri SG, Sharma G. 2020. Understanding genomic diversity, pan-genome, and evolution of SARS-CoV-2. PeerJ 8:e9576 DOI 10.7717/peerj.9576 and the second, of Middle East Respiratory Syndrome (MERS) in 2012 in Saudi Arabia (Zaki et al., 2012). Amongst coronaviruses, SARS-CoV and MERS are known as highly pathogenic viruses that cause pneumonia and respiratory system infections (Coleman & Frieman, 2014). Besides these, several other low pathogenic strains are also known that cause mild respiratory diseases and infect majorly the upper respiratory tract (Han et al., 2020). After the diagnosis of the first reported case in Wuhan, China, and genome sequencing of the Wuhan Hu-1 strain (SARS-CoV-2 reference genome), scientists across the world are striving to develop vaccines and diagnostic kits and understand its evolution and epidemiology. Several potential drug molecules to combat COVID-19 are undergoing clinical trials. As no vaccine or drugs are available at present, the only ways known to contain transmission are social distancing, regular washing of hands, and covering mouth and nose while coughing or sneezing, thereby, preventing direct or indirect physical contact and containing the transfer of infected respiratory droplets. This study comprises a detailed overview of the history and origin of COVID-19 along with the taxonomy, nomenclature and genome structure of its causative agent SARS-CoV-2. Following that, we analyze 167 SARS-CoV-2, 312 SARS-CoV and five Pangolin CoV genomes to study their genomic variability, evolution and mutation hotspots. We also characterize the pan-genome of the genus Betacoronavirus (including the recently sequenced Pangolin CoV) to classify their proteins-coding regions into core, accessory, and unique categories within each genome group.

TAXONOMY AND NOMENCLATURE Coronaviruses are members of family that include enveloped positive sense ssRNA containing viruses, taxonomically classified amongst order in the realm . Like other viruses, they thrive in the gray area of living and non-living as they do not have the machinery to survive outside a living host cell. Therefore, they cannot replicate outside the host. The Riboviruses or realm Riboviria members replicate by utilizing RNA-dependent RNA-polymerases (RdRps). Their genetic material that is, RNA serves as messenger RNA (mRNA), directly translating into proteins and undergoes genetic recombination in the presence of another viral genome in the host cell (Barr & Fearns, 2010). Members of order Nidovirales have a positive sense linear (capped and polyadenylated) RNA molecule, known for causing severe infections. The word “Nido” stands for “nest” implying that all Nidoviruses express subgenomic mRNAs (sgRNA) in a nested form. They are known to infect hosts within three important classes in , namely (Mammalia), birds (Aves) and fishes (Pisces). Coronaviruses (family Coronaviridae; subfamily Orthocoronavirinae) commonly infect both mammals and birds; (family Tobaniviridae; subfamily Torovirinae) infect specifically mammals and Bafinivirus (family Tobaniviridae; subfamily Piscanivirinae) infect fishes. Coronavirus virions are spherical, Torovirus are crescent-shaped, and Bafinivirus are rod-like; however, all members of this family are adorned with a crown or club-shaped surface proteins called the Spike proteins. Because of their crown-like morphology observed in electron micrographs, they are named Coronaviruses. Family Coronaviridae has a single

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 2/31 subfamily Orthocoronavirinae and its members form a monophyletic clade. They are further classified into four defined genera based on phylogenetic studies of conserved genome regions and their serological cross-reactivity: Alphacoronavirus, Betacoronavirus, Gammacoronavirus and Deltacoronavirus. Under genus Betacoronavirus, five lineages diverge: Lineage A (subgenus: Embecovirus) includes HCoV-OC43 and HCoV-HKU1, Lineage B (subgenus: Sarbecovirus) includes SARS-CoV and SARS-CoV-2, Lineage C (subgenus: Merbecovirus) contains Bat coronavirus BtCoV-HKU4 and BtCoV-HKU5, Lineage D (subgenus: Nobecovirus) includes Rousettus Bat coronavirus BtCoV-HKU9 and lineage E (subgenus: Hibecovirus) includes Bat Hp-betacoronavirus/Zhejiang2013. The subgenus Sarbecovirus members tend to undergo deep recombination events leading to the formation of new alleles (Boni et al., 2020) and zoonotic transfer.

ORIGIN AND HISTORY OF THE CORONAVIRUS INFECTIONS The first Coronavirus associated disease was described in 1931 (Peiris, 2012). The viruses were first isolated from humans in the UK and USA around the same time. The first isolated specimen was B814, taken from a boy exhibiting symptoms of the common cold in 1960 and cultured in human embryonic tracheal organ culture (Tyrrell & Bynoe, 1966). The specimen was deemed distinct from Adeno-, Entero- or Rhinoviruses, being ether labile and propagated only in organ cultures (Kendall, Bynoe & Tyrrell, 1962; Tyrrell & Bynoe, 1966). Later, the gradual discoveries of 229E in 1966 (Hamre & Procknow, 1966), OC43 in 1967 (McIntosh, Becker & Chanock, 1967), SARS-CoV in 2002, Bovine CoV in 2006, MERS-CoV in 2012 and SARS-CoV-2 in 2019 (Zhu et al., 2020) were major significant steps in coronavirus history. Coronaviruses have been associated with mild respiratory infections and cold in humans and animals (specifically poultry and livestock). These viruses were not considered treacherous until the SARS epidemic that emerged in China in November 2002. The associated SARS-CoV species were first classified as a separate clade under Betacoronavirus after this outbreak, post which new species including those of human coronaviruses (hCoVs) have been identified. The 2002 SARS epidemic ultimately infected 8,096 people, with 774 deaths. Later, MERS-CoV emerged in Saudi Arabia in September 2012, causing 2,494 confirmed cases of infection and 858 deaths across 27 countries (https://www.who.int/emergencies/diseases/en/). The current pandemic, COVID-19, is caused by the , initially designated “2019 novel coronavirus” or “2019-nCoV”, later re-named SARS-CoV-2, segregating it as a novel SARS species (Huang et al., 2020). Now, it forms a new clade in the subgenus Sarbecovirus and is established as the 7th member of the family which infects humans (Simmonds et al., 2017; Gorbalenya et al., 2020; Zhu et al., 2020). The first patients of COVID-19 were identified in late December 2019 when several local health centers reported clusters of patients with pneumonia caused by an unknown pathogen, epidemiologically linked to living animal and seafood wholesale market in Wuhan. The Chinese Center for Disease Control and Prevention called a rapid action team for the epidemiological investigation. Bronchoalveolar-lavage fluids were propagated on human respiratory tract epithelial cell

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 3/31 cultures for 4–6 weeks. Extracted nucleic acid was identified as viral RNA by real-time reverse transcription PCR targeting the RdRp region of pan-BetaCoV. The electron micrographs revealed the presence of distinctive spikes of length 9–12 nm, establishing morphological resemblance with the Coronaviridae family (Zhu et al., 2020). Following this, SARS-CoV-2 infection has been spreading worldwide. As of 29 April 2020 (submission date), 213 countries and territories around the world are under surveillance, with more than 3.15 million confirmed cases and over 218,490 reported fatalities, and these numbers continue to exponentially increase day by day. Since December 2019, numerous SARS-CoV-2 genomes have been sequenced, which allows researchers to address many critical questions regarding its origin, transmission, epidemiology, and most importantly to design vaccines and viral detection kits. Viral genome sequencing and multiple sequence alignment of SARS-CoV-2 genomes have revealed its identity with Bat CoV RaTG13 (MN996532) and Pangolin-CoV (Cui, Li & Shi, 2019; Boni et al., 2020; Rehman et al., 2020). A comparative phylogenetic study revealed that the Pangolin-CoV genes shared a high level of sequence identity with SARS-CoV-2 genes, specifically orf1b, the spike (S), orf7a and orf10 genes (Zhang, Wu & Zhang, 2020). Higher identity of S protein sequence implies the functional similarity between Pangolin-CoV and SARS-CoV-2 as compared to the RaTg1 strain (Zhang, Wu & Zhang, 2020) further suggesting Pangolin as an intermediate host for infection and natural reservoir of SARS-CoV-2-like strains.

CORONAVIRUS GENOME ORGANIZATION Coronaviruses belong to family Coronaviridae and comprise enveloped viruses that replicate in the cytoplasm of animal host cells; distinguished by the presence of a single-stranded positive-sense RNA genome (about 30 kb in length) (Marra et al., 2003). Coronaviruses possess the largest genomes among all known RNA viruses (Fehr & Perlman, 2015), ranging from 25.32 kb in Porcine Deltacoronavirus PD-CoV to 31.775 kb in Beluga whale coronavirus SW1 (information available on NCBI Virus webpage https://www.ncbi.nlm.nih.gov/labs/virus/). SARS-CoV-2 is a spherical or pleomorphic enveloped particle, containing single-stranded, positive-sense RNA associated with a nucleoprotein, lying inside a capsid composed of matrix protein (Lai et al., 2020). The SARS-CoV-2 reference genome (NC_045512; Wuhan Hu-1 strain) is 29,903 nucleotides long, which consists of A (8,954 nt, 29.94%), G (5,863 nt, 19.61%), C (5,492 nt, 18.37%), T (9,594 nt, 32.08%). Compared to SARS-CoV-2, the SARS-CoV reference genome (NC_004718; Tor1 strain) is a little larger, that is, 29,751 nucleotides and its nucleotide composition is marginally different: A (8,481 nt, 28.50), G (6,187 nt, 20.80%), C (5,940 nt, 19.97%), T (9,143 nt, 30.73%). We identified that the GC content (average of all genomes under this study) of SARS-CoV, SARS-CoV-2, and Pangolin CoV is 40.81%, 38% and 38.52% respectively. Like other Betacoronavirus members, this genome contains two flanking untranslated regions (UTRs), 5′-UTR (265 nt) and 3′-UTR (229 nt) sequences. A typical CoV genome contains at least six ORFs (Graham et al., 2008). The genomes of all coronaviruses usually encode four well-conserved and characterized structural proteins:

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 4/31 S (spike), E (envelope), M (membrane) and N (nucleocapsid) (Graham et al., 2008; Fuk-Woo Chan et al., 2020), encoded by sgRNA 2, 4, 5 and 9a respectively present at 3′end and sharing 1/3 of the genome in SARS-CoV (Graham et al., 2008). SARS-CoV-2 Proteins: The SARS-CoV-2 genome has 12 protein-coding regions, which encode two categories of proteins: first, Structural proteins, which give characteristic structure to the virus and are involved in viral entry, and second, Non-structural (NS) or accessory proteins, which help in viral replication (Marra et al., 2003; Graham et al., 2008; Hu et al., 2018; Fuk-Woo Chan et al., 2020). Most of the information available so far about these proteins is based on SARS-CoV and other members of the genus Betacoronavirus. STRUCTURAL PROTEINS Spike protein The S protein is a 1,273 aa trimeric, cell-surface glycoprotein consisting of two subunits (S1 and S2), encoded by the gene S. The S1 subunit is responsible for receptor binding (Hu et al., 2018). This is also important for mediating the fusion of viral and host membranes. Both these processes are critical for virus entry into host cells (Tan, Lim & Hong, 2005). SARS-CoV-2 has a polybasic cleavage site RRAR, at the junction of S1 and S2 subunits, which enables effective cleavage by Furin and other proteases (Andersen et al., 2020; Zhang, Wu & Zhang, 2020). Furthermore, one proline residue is also inserted at the leading cleavage site of SARS-CoV-2, making “PRRA”, the final inserted sequence. Comparison of this site within different Betacoronavirus members revealed that the insertion of a Furin cleavage site at the S1–S2 junction of SARS-CoV-2 enhances cell-cell fusion (Follis, York & Nunberg, 2006; De Haan et al., 2008; Andersen et al., 2020). Similarly, an effective cleavage of the polybasic cleavage site in Hemagglutinin esterase protein facilitated the inter-species transmission of MERS-like coronaviruses from bats to humans (Andersen et al., 2020). Majorly, variations in the S protein are responsible for two attributes, tissue tropism and host ranges of different CoVs (Hu et al., 2018). It was observed that S protein underwent several drastic changes during the viral infection. The S protein of SARS-CoV-2 is more vulnerable to mutations, especially in the amino acids associated with the spike protein-cell receptor interface. Interestingly, the amino acid sequence represented ~19% changes with four major insertions and ~81% sequence similarity in contrast to SARS-CoV (Ahmed, Quadeer & McKay, 2020; Rehman et al., 2020).

Nucleocapsid protein N proteins are 419 aa long phosphoproteins weighing ~46kDa. These have helix binding properties. Coronavirus N proteins possess three easily distinguishable and highly conserved domains; the N-terminal domain (NTD) (N1b), the C-terminal domain (CTD) (N2b) and the N3 region (Grossoehme et al., 2009; Cong et al., 2019). The N protein plays a vital role in virion structure formation as it is localized in both the replication- transcription region and the ERGIC (Endoplasmic reticulum-Golgi apparatus Intermediate Compartment), the site of virion assembly (Tok & Tatar, 2017).

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 5/31 The N protein of SARS-CoV is primarily expressed during the early stages of SARS-CoV infection (Surjit & Lal, 2008; Grossoehme et al., 2009; Cong et al., 2019). It is important as a diagnostic marker as it induces a strong immune response (Hu et al., 2018). Evolutionary analysis has shown 89–91% sequence homology between the N proteins in the SARS-CoV and Bat SL-CoV (Hu et al., 2018; Ahmed, Quadeer & McKay, 2020). The N protein can bind to Nsp3 protein to help bind the genome to replication/ transcription complex (RTC) (Cong et al., 2019) and package the encapsulated genome into virions (Chen, Liu & Guo, 2020). IBV, SARS CoV and MHV N protein undergo phosphorylation and allow discrimination of non-viral mRNA binding from viral mRNA binding (Cong et al., 2019). N protein is also an antagonist of interferon (IFN) (Kopecky-Bromberg et al., 2007) and virus-encoded repressor of RNA interference (RNAi), therefore, it appears to benefit the viral replication (Chen, Liu & Guo, 2020).

Membrane protein The most abundant structural protein in the genome is the membrane (M) glycoprotein, which spans across the membrane bilayer three times. Thus, M glycoproteins have three transmembrane regions, leaving a short NH2-terminal domain outside the virus and a long COOH terminus (cytoplasmic domain) inside the virion (Tok & Tatar, 2017; Mousavizadeh & Ghasemi, 2020). The M protein plays a key role in regenerating virions in the cell. M proteins undergo glycosylation in the Golgi apparatus which is crucial for the virion to fuze into the cell and to make antigenic proteins. As mentioned before, N protein forms a complex by binding to genomic RNA and M protein triggers the formation of interacting virions in the endoplasm (Tok & Tatar, 2017).

Envelope glycoprotein E glycoproteins are composed of approximately 76–109 amino acids in other coronavirus species. E protein plays a crucial role in the assembly and morphogenesis of virions within the host. About 30 amino acids in the N-terminus of the E protein enable binding to the viral membranes. Co-expression of E and M proteins with mammalian expression vectors enable the formation of virus-like structures within the cell (Tok & Tatar, 2017).

NON-STRUCTURAL (ACCESSORY) PROTEINS ORF1ab and ORF1a The SARS 5′ proximal gene 1 (~22 kb) comprises two long overlapping open reading frames, orf1a and orf1ab, encoding for polyproteins 1a and 1ab. These polyproteins are cleaved by viral proteases that is, PLpro (papain-like protease) and 3CLpro (chymotrypsin- like protease) into 16 non-structural proteins, Nsp1–Nsp16. Expression of orf1ab involves a (−1) ribosomal frameshift upstream of orf1a stop codon, thus forming the full-length ORF1ab. ORF1a polyprotein of ~500 kDa encodes for Nsp 1–11, while ORF1ab polyprotein of ~800 kDa encodes for all Nsp1–16. The ORF1a and ORF1ab polyproteins undergo post-translational modifications to form mature proteins and hence are not

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 6/31 detected during infection (Graham et al., 2008; Lei, Kusov & Hilgenfeld, 2018). A brief description of these proteins is provided below:

i. Nsp1: SARS-CoV Nsp1 is an N-terminal protein coded by the first gene of ORF1ab, which plays an important role in SARS-CoV pathogenesis by inhibiting the host gene expression via binding to the small subunit of the ribosome and then truncating the translation activity. Nsp1 induces endonucleolytic RNA cleavage of a capped host mRNA, making it translationally incompetent (Tanaka et al., 2012). Nsp1 inhibits the type-1 IFN expression and other antiviral signaling pathways, thus suppressing the innate immune system. The viral mRNA is resistant to Nsp1- mediated RNA cleavage in the infected host cells, however, the mechanism through which it achieves this is unknown (Tanaka et al., 2012). ii. Nsp2: the function of Nsp2 in SARS-CoV is unknown. The experimental results show that the deletion of the nsp2 gene from MHV and SARS-CoV was tolerated with modest growth and some RNA defect. Also, there was no observable effect on the polyprotein processing in the mutants. Another evidence shows that neither the Nsp2 protein nor the nsp2 gene is involved in pathogenesis (Graham et al., 2005). iii. Nsp3: the multi-domain SARS-CoV Nsp3 is a 215 kDa glycosylated transmembrane multidomain protein involved in viral replication and transcription. It may serve as a scaffolding protein for numerous other proteins (Angelini et al., 2013). The organization of various Nsp3 domains differs amongst the Coronavirus genomes. However, eight domains are common to all CoVs: ubiquitin-like domain 1 (Ubl1), the glutamate-rich acidic domain also known as “hypervariable region”,anX macrodomain, ubiquitin-like domain 2 (Ubl2), PL2pro (papain-like protease 2), ectodomain 3Ecto, also known as “zinc-finger domain”, and domains Y1 and CoV-Y (unknown functions). Nsp3 releases itself, Nsp1 and Nsp2 from the polyproteins. It alters the post-translational modification of host proteins to antagonize the innate immune response by de-MARylating, de-PARylating, de-ubiquitinating, or de-ISGylating the host proteins. Meanwhile, Nsp3 modifies itself by the N-glycosylation of the ectodomain and can also interact with host proteins (such as RCHY1) to enhance viral survival (Lei, Kusov & Hilgenfeld, 2018). iv. Nsp4: it is known that Coronaviruses induce double-membrane vesicles (DMVs). Any alteration in the DMVs’ morphology impairs the RNA synthesis and therefore growth of Nsp4 mutants (Gadlage et al., 2010). v. Nsp5: the Nsp5 protease also known as 3CLpro or Mpro, cleaves the Nsp peptides at 11 cleavage sites (Stobart et al., 2013). Nsp5 of Porcine Deltacoronavirus (PDCoV) is a type I IFN antagonist. It disrupts the IFN signaling pathway by cleaving the NF-κB essential modulator (NEMO), thus impairing the host’s ability to activate the IFN response (Zhu et al., 2017). vi. Nsp6: Nsp6, along with Nsp3 and Nsp4, has membrane proliferation ability. Thus, it can induce perinuclear vesicles localized around the microtubule-organizing center and DMV formation (Angelini et al., 2013). The coronavirus Nsp6 protein

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 7/31 restricts the autophagosome expansion either directly or indirectly through starvation or chemical inhibition of MTOR signaling (Cottam, Whelband & Wileman, 2014). vii. Nsp7: the Nsp12 RNA-dependent RNA polymerase cannot act on its own and depends upon stimulation by a complex of Nsp7 and Nsp8. Thus, Nsp7 acts as one of the cofactors along with Nsp8 to stimulate Nsp12. It is hypothesized that Nsp7 and Nsp8 heterodimers stabilize the Nsp12 regions involved in RNA binding; also, Nsp8 acts as a minor RdRp (Kirchdoerfer & Ward, 2019). viii. Nsp8: Nsp8 is a 22 kDa non-canonical polymerase shown to be capable of de novo RNA synthesis with low fidelity on single-stranded RNA templates. The ‘main’ RdRp in Coronaviruses is the Nsp12 which employs a primer-dependent initiation mechanism. These observations led to a hypothesis that Nsp8 would act as an RNA primase and synthesize a short primer that will be extended by Nsp12 (Te Velthuis, Van den Worm & Snijder, 2012). ix. Nsp9: Nsp9 is an RNA-binding subunit in the replication complex in all coronaviruses (Zeng et al., 2018). x. Nsp10: SARS-CoV Nsp10 binds to and stimulates the exoribonucleases, Nsp14, and Nsp16, thus playing an important role in the replication-transcription complex (RTC) formation (Bouvet et al., 2014). Nsp10 acts as an essential co-factor in triggering Nsp16 2′O-MTase activity, which suggests its involvement in the regulation of capping of viral RNA (Lugari et al., 2010). xi. Nsp11: the exact function of Nsp11 in Coronaviridae is unknown. Arterivirus Nsp11 was identified as NendoU (Nidoviral uridylate-specific endoribonucleases) that plays a role in the viral life cycle. Nsp11 could be essential to produce helicase (Woo et al., 2005). xii. Nsp12: Nsp12 is a 102 kDa RNA-dependent RNA polymerase (RdRp) featuring all conserved motifs of the known RdRps. It is the most conserved protein in coronaviruses and assumes a center stage in the viral RTC. The Nsp7/Nsp8 complex enhances the binding of Nsp12 to RNA. Nsp8 is the second, non-canonical RdRp in coronaviruses (Subissi et al., 2014). xiii. Nsp13: Nsp13 is a 66.5 kDa multi-functional protein. The N terminal has a zinc-binding domain while the C-terminal harbors a helicase domain-containing conserved motif of superfamily-1 helicases (Subissi et al., 2014). xiv. Nsp14: Nsp14 is a 60 kDa bifunctional enzyme. Its N-terminal is a 3′–5′ exoribonuclease involved in the mismatch repair system, improving the fidelity of virus replication via RNA proofreading. Yeast trans-complementation studies have shown that the C-terminal domain of SARS-CoV Nsp14 is an S-adenosylmethionine-dependent guanine-N7-methyltransferase (MTase) (Subissi et al., 2014; Zeng et al., 2016) with no RNA sequence specificity (Subissi et al., 2014). xv. Nsp15: SARS-CoV Nsp15 is a uridine specific ribonuclease that cleaves the 3′ of uridylates, producing 2′–3′ cyclic phosphate ends (Subissi et al., 2014).

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 8/31 xvi. Nsp16: Nsp16 is a 2′-O-Methyl Transferase (Zeng et al., 2016) and requires Nsp10 as a stimulatory factor to exhibit its MTase activity (Chen et al., 2011). Eukaryotic mRNA has a unique 5′ cap; to evade the host machinery, the viral RNA must be made indistinguishable from the host mRNA by capping the viral RNA (Menachery, Debbink & Baric, 2014). Both Nsp14 and Nsp16 are involved in the modification of viral RNA cap structure (Zeng et al., 2016) to maintain the viral RNA stability, ensure protein translation and immune escape (Lugari et al., 2010; Subissi et al., 2014). The SARS-CoV genome encodes two SAM-dependent methyltransferases (Zeng et al., 2016); Nsp14 N7 methyltransferase, and Nsp16 2′-O-methyl transferase, that methylates the RNA at N7 of guanosine and ribose 2′O sites, respectively (Chen et al., 2011). This process also involves Nsp10 for stabilizing the SAM binding region and RNA binding of Nsp16 (Subissi et al., 2014).

ORF3a protein The orf3a gene lies between S and E genes and encodes for a transmembrane protein (McBride & Fielding, 2012). SARS-CoV-2 ORF3a protein shows 97.82% homology to NS3 of Bat coronavirus RaTG13 and 72% homology to SARS-CoV ORF3a protein (Issa et al., 2020). It is localized to the cell membrane and partly to the ER and the Golgi perinuclear space in the host cell. ORF3a is known to activate PERK (PKR-like ER kinase) signaling pathway through which the viral particles escape ER-associated degradation and the resulting ER stress also induces apoptosis. ORF3a of both SARS-CoV and SARS-CoV-2 have an APA3_viroporin (a pro-apoptosis protein) conserved domain (McBride & Fielding, 2012; Issa et al., 2020). ORF3a has been shown to activate NF-κB and JNK. This leads to the upregulation of the chemokine named RANTES (Regulated upon Activation, Normal T Cell Expressed and Secreted) along with IL-8 in A549 and HEK293T cells (Liu et al., 2014). The extracellular N-terminus of protein can evoke a humoral immune response. Binding of ORF3a protein to caveolin-1 may be required for viral uptake and release (Liu et al., 2014). ORF3a can augment the activation of the p38 MAPK pathway and induce the mitochondria to leak inducing apoptosis (Liu et al., 2014).

ORF6 protein SARS-CoV ORF6/Protein 6 is a 63-aa polypeptide. It has an amphipathic 1–40 aa N-terminal portion and a highly polar C-terminal. The amino acid residues 2–32 in the N-terminal form an a-helix which is embedded in the cell membrane. There are two signal sequences in the C-terminal; aa 49–52 sequence YSEL targets proteins for incorporating into endosomes and the acidic tail which signals ER export. This protein is localized in the ER and Golgi apparatus. ORF6 is incorporated into VLPs, when co-expressed with SARS-CoV S, M and E structural proteins. It enhances viral replication, thus serving an important role in pathogenesis during SARS-CoV infection (Liu et al., 2014). ORF6 is also well known as a β-interferon antagonist; its overexpression inhibits nuclear import of STAT1 in IFN-β treated cells. Along with ORF3b and N protein, the SARS-CoV ORF6 inhibits activation of IRF-3 via phosphorylation and binding of IRF-3 to

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 9/31 a promoter with IRF-3 binding sites. IRF-3 is an important protein for the expression of IFN. ORF6 may disturb the ER/Golgi transport necessary for the interferon response (Kopecky-Bromberg et al., 2007). During SARS-CoV infection, the C-terminal domain of ORF6 modulates the host protein nuclear transport and type-I interferon signaling, thus playing an important role in immune evasion (Liu et al., 2014).

ORF7a protein SARS-CoV ORF7a/ Protein 7a is a 122 aa type-I transmembrane protein (Liu et al., 2014). It consists of 15 aa N terminal signal peptide sequence, an 81 aa ectodomain, 21 aa C terminal transmembrane domain, and a cytoplasmic tail of 5 aa residues. The ectodomain of ORF7a binds to human LFA-1 (lymphocyte function-related antigen 1) on Jurkat cells via the alpha integrin-I domain of LFA-1. This suggests that probably LFA-1 is a binding factor or receptor for SARS-CoV on human leukocytes (McBride & Fielding, 2012). SARS-CoV ORF7a is localized in the ER-Golgi Intermediate Compartment (ERGIC), the assembly point of coronaviruses. It interacts with M and E structural proteins, indicating a possible role in viral assembly during SARS-CoV replication. In mammalian cells infected with SARS-CoV, ORF7a has been known to induce caspase-dependent apoptosis by cleaving PARP (poly(ADP-ribose) polymerase), an apoptotic marker. Its pro-apoptotic nature is a result of the interaction between its

transmembrane domain with Bcl-XL, an anti-apoptotic protein of the Bcl-2 family (Liu et al., 2014). Like ORF3a, overexpression of ORF7a activates NF-κB and c-Jun N- terminal kinase (JNK), augmenting the production of pro-inflammatory cytokines such as interleukin 8 (IL-8) and RANTES (Liu et al., 2014). ORF7a overexpression induced apoptosis can occur via a caspase-3 dependent pathway as well as p38 MAPK pathway (McBride & Fielding, 2012).

ORF7b protein SARS-CoV ORF7b/Protein 7b is a 44 aa long, highly hydrophobic, integral transmembrane protein with a luminal N-terminal and cytoplasmic C-terminal. The expression of ORF7a is reduced significantly when the orf7a start codon (upstream of 7b) is mutated to a strong Kozak sequence or when an additional start codon AUG is inserted upstream of the orf7b start codon. Thus, ORF7b may be expressed by “leaky scanning” of the ribosome (Liu et al., 2014). The Golgi-restricted localization of ORF7b was attributed to the transmembrane domain (Liu et al., 2014). However, the role of ORF7a and ORF7b during the replication of SARS-CoV remains uncertain (McBride & Fielding, 2012).

ORF8 protein Human SARS-CoV-2 isolated from early patients, Civet SARS-CoV and other bat SARSr-CoV contains the full-length ORF8 (Fuk-Woo Chan et al., 2020). Two genomes of genus Betacoronavirus isolated from horseshoe bat, SARS-Rf-BatCoV YNLF_31C and YNLF_34C are 93% identical to the human and civet SARS-CoV genome. However, all Human SARS-CoV isolates from mid and late-phase patients contain a signature 29

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 10/31 nucleotide deletion in the orf8 gene, splitting it into orf8a and orf8b. This suggests that ORF8 may play a role in interspecies transmission. The generation of civet SARSr-CoV could be a result of a potential recombination event between SARSr-Rf-BatCoVs and SARSr-Rs-BatCoVs identified around ORF8 (Lau et al., 2015). Interestingly, the new SARS-CoV-2 ORF8 shares very less homology with the conserved ORF8 or ORF8b isolated from Human SARS-CoV or its related viruses, Civet SARS-CoV, Bat-CoV YNLF_31C andYNLF_34C. The ORF8 lacks any known functional domain or motif. SARS-CoV ORF8b has an aggregation motif VLVVL (75–79aa) which is known to activate NLRP3 inflammasomes and trigger intracellular stress pathways (Fuk-Woo Chan et al., 2020), but this is not conserved in SARS-CoV-2. Notably, both ORF3 and ORF8 in SARS-CoV-2 are highly divergent from the interferon antagonist ORF3a and inflammasome activator ORF8b in SARS-CoV (Yuen et al., 2020).

ORF10 protein The SARS-CoV-2 ORF10 protein has no homologs in SARS-CoV and any other members of genus Betacoronavirus (Xu et al., 2020). Viruses are known to sabotage ubiquitination pathways for replication and pathogenesis. The ORF10 interacts with specifically the CUL2ZYG11B complex and the rest of the members of the Cullin-2 (CUL2) RING E3 ligase complex. ZYG11B degrades substrates with exposed N-terminal glycine residues. By studying the ORF10 interactome, it was observed that it interacts the most with the CULZYG11B complex (Gordon et al., 2020). This evidence shows that ORF10 might bind to this complex and hijack its ubiquitination of restriction factors (Gordon et al., 2020).

METHODOLOGY (GENOMES, DATABASES, AND TOOLS USED IN THIS STUDY) Complete genomes, belonging to SARS-CoV-2 (167 genomes), SARS-CoV (312 genomes) and Pangolin CoV (five genomes) were downloaded from NCBI Virus webpage (https://www.ncbi.nlm.nih.gov/labs/virus/) as on 29 March 2020. Similarly, RefSeq assemblies of 56 suborder Cornidovirineae genomes (including 18 from Betacoronavirus genus) along with the genome of Paguronivirus-1 (as an outgroup) were downloaded as on 1 April 2020. The latest version of the Virus database was downloaded from the NCBI Virus page as on 6 April 2020. For alignments of the studied genomes, as required for phylogeny and mutational studies, we have used MUSCLE v3.8.1551 and MEGAX (Edgar, 2004; Kumar et al., 2018) tools as per their default settings. To generate maximum likelihood phylogenies of genome-based alignments, RAxML v8.2.12 (Stamatakis, 2014) was used to perform rapid bootstrap analysis and search for bestscoring ML tree in one program run (“-f a” parameter) using GTRGAMMA as nucleotide substitution model (-m) with 100 bootstrap values. After obtaining the newik trees from RAxML, an online iTOL platform (Letunic & Bork, 2019) was used to visualize phylogenetic trees as shown in this study. For each experiment in respective result sections, we have provided comprehensive details about the genomes under study, research methodology, and used parameters.

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 11/31 For the pangenome study, Proteinortho v6.0.12 (Lechner et al., 2011) was used with an E value cutoff 1E−05 for each identified hit along with other default settings. It utilizes diamond v0.8.36.98 and BLAST 2.8.1+ (Christiam et al., 2009; Buchfink, Xie & Huson, 2014) to identify ortholog proteins using reciprocal blast strategy. For the identification of pairwise average nucleotide identity (ANI) values within all genomes under study, PYANI v0.1.2 (Pritchard et al., 2016) program was used. Throughout this study, the Wuhan Hu-1 strain has been used as a reference for SARS-CoV-2 genomes. RESULTS AND DISCUSSION Human SARS-CoV-2 genome is quite disparate as compared to other Betacoronavirus genomes The SARS outbreak in 2003 and MERS in 2012 had already set a precedent for widespread genome analysis, therefore, with the recent emergence of COVID-19 in November 2019, extensive sequencing and genome analysis of SARS-CoV-2 isolates are being performed. Since February 2020, several Bat and Pangolin coronaviruses have also been sequenced to understand the probable origin of SARS-CoV-2. In this study, we have analyzed 312 SARS-CoV, 167 SARS-CoV-2 and 5 Pangolin CoV genomes to understand their genomic conservation, unique genes, mutational hotspots and respective evolution and origin.

Phylogeny relationship at suborder Cornidovirineae level SARS-CoV-2 is a member of suborder Cornidovirineae, family Coronaviridae, genus Betacoronavirus, and subgenus Sarbecovirus. In this study, we have tried to understand the strain diversity and evolutionary relationships at different taxonomy levels in a top-bottom classification scheme. For this analysis, the complete whole-genome sequences of 56 suborder Cornidovirineae reference genomes were analyzed, which included members from five genera i.e. 23 Alphacoronavirus,20Betacoronavirus,10Deltacoronavirus and three Gammacoronavirus and the genome of Paguronivirus-1 was included as an outgroup for this study. Their genome length varied from 25,425-31,686 nucleotides. ClustalW alignment (using MEGA-X) generated an alignment of 35,601 nucleotides off which 17,284 conserved sites were stripped and used to generate an ML-based phylogeny using RAxML. We noted that this phylogeny is unambiguously able to demarcate all the genera into their respective monophyletic clades (Fig. S1). The significant variations amongst branch length distances indicate the relative diversity within each genus.

Phylogeny relationship at genus Betacoronavirus level We generated the phylogeny of 18 Betacoronavirus reference strains (along with five Pangolin CoV strains) using their complete genomes ranging within 29,114–31,526 nucleotides. ClustalW based whole genome alignment using MEGAX generated an alignment of length 33,346 from which 24,555 conserved nucleotide sites were stripped and used to generate ML phylogeny (Fig. S2). Three distinct clades were seen in the phylogeny: the first clade includes strains from subgenus Sarbecovirus, Nobecovirus and Hibecovirus, second with Embecovirus members, and the third clade of members of

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 12/31 subgenus Merbecovirus. Within the first clade, both members of subgenus Sarbecovirus (SARS-CoV and SARS-CoV-2) are grouped with Pangolin CoV strains, which is supported by their percentage identity. All Pangolin CoV genomes are >99.8% identical to each other whereas their closest relatives are the SARS-CoV-2 reference strain (~85% identity), followed by the SARS-CoV reference strain (~79% identity). These three groups are closest (~57% identity) to their sister branch occupied by subgenus Hibecovirus strain: Bat Hp-betacoronavirus Zhejiang 2013. Two reference strains of another subgenus Nobecovirus (~67% identity with each other) form a separate clade that is closest (~52% identity with other group members) to the above-mentioned clade comprising subgenus Sarbecovirus and Hibecovirus. The second clade only has representatives of subgenus Embecovirus including Murine hepatitis virus, Bovine CoV, Rabbit CoV, Rat CoV, etc. Some of these strains form respective subclades in the phylogeny based on their closeness. Strains belonging to subgenus Merbecovirus that include the MERS, form the third clade. MERS genomes (England 1 and MERS Co.) are 99.7% identical to each other and therefore form a separate sub-clade. As observed from their phylogenetic branch lengths and percentage identity matrix, all subgenera are quite distinct from each other. As expected, members of each subgenus are closer to each other, therefore forming respective monophyletic clades.

Phylogeny relationship at subgenus Sarbecovirus level To understand the similarities and differences among the strains of subgenus Sarbecovirus, we analyzed the complete genomes of 103 SARS-CoV-2 and 312 SARS-CoV isolates including the reference strains of Wuhan Hu-1 (SARS-CoV-2) and Tor2 (SARS-CoV). Since Pangolin CoV belongs to subgenus Sarbecovirus, we also included 5 Pangolin CoV genome sequences in our dataset and aligned them using MUSCLE. The genome size of SARS-CoV, Pangolin CoV and SARS-CoV-2 strains were in the range of 29,013–30,311 nucleotides, 29,795–29,806 nucleotides and 29,325–29,945 nucleotides, respectively. The alignment of length 31,718 nucleotides was stripped to a conserved region of 26,762 nucleotides (sites with gaps in between them were excluded) which was used to generate a maximum likelihood phylogeny using RAxML (Fig. 1; Fig. S3). We noted from the phylogeny that SARS-CoV-2 strains form a single monophyletic clade as compared to the SARS-CoV. SARS-CoV strain RaTG13 procured from the bat in 2013 from Yunnan, China is closest to SARS-CoV-2 strains, suggesting that they might have diverged from a common ancestor. Average Nucleotide Identity (ANI) studies also indicated that the SARS-CoV-2 reference genome and the Bat RaTG13 branch share 96.11% identity at the genome level, a finding supported by earlier reports (Lv et al., 2020; Zhou et al., 2020). The next closest strains in the phylogeny were those of Pangolin CoV, all forming a subclade. Other close relatives are the SARS-CoV strains Bat CoVZC45 and Bat CoVZXC21, procured from Zhoushan, eastern China in 2018. The phylogeny suggests that Pangolin CoVs are closer to SARS-CoV-2 strains as compared to CoVZC45 and CoVZXC21, however, genome-wide percentage identity suggests the opposite—the genomes of both CoVZC45 and CoVZXC21 are ~89% identical to SARS-CoV-2, as opposed to its 86% identity with Pangolin CoVs.

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 13/31

Tree scale: 0.1

1

01

11

2

20 3

2 anxi 8

7

1 3 - 1

-1 6

- 3

0

1 1 C 4 3 -2

Taxonomy 3- 1 5

4C - 1

nnan 3-3 1 2

1 0 3 1 0 3- - U

U 13 3 9 2 U3 3-

2 U

B 1 6 U 3 -

2012 U

K

K U 3- -

e

0 X

Ra U 3 47 K 45 K U 0

K 3

1 H

1 U XC

0 K

2 S

Y H HK

p-Yu K U H

Rp-Sha HKU3 p3

237

K

t JL

2 H HuB2013 f

t K 526

NLF 3 NLF H

t 02 m

t t 72 55

-

4 t C K t H V a

at H H VZ 2

R

a

a t

Y H at E

s at H VZC 408

a 2

YNLF 31 YNLF H

at t

92 o

B L

SARS-CoV-2 L at Rs4 a

42 t Ba 5

t t B

B s at

at at Ba HKU t

Ba t E 3

a

B 4

B KY Bat Bat a 672 at t -P

P

a 5b

a B e

As6

t R t

R a

a

a a B

013 B a B P

Rs a B 5L

n a B

a at a

a B a na t Co

n

Bat Bat

2 n i n rk

in a B at C 46 na Bat R Bat

t t Bat Bat a Bat R Bat a Bat 279-2005 Bat i n

i i B

1 in n

n a Rs

-00 C014 i Ba TG1 a

Rf40 h A -P

h GX F

na B na

Bat a n

084 X-P1 h

h na

h

hina B in

a a GX- Z

i a h na B GX-

hi

hi a B

C

N 1

Bat Bat hina n

Ba Chi

NA

Rs3367 1

C

C h

Ch a B

i n S -m

h

Chi

SH hi

y

t YN t C

C i a oV C

Bat hi V R Z-002a 2020 0

1

ina B ina 1 d

1 C

hina hina GX

1 h s

SARS-CoV Chi t

7327 a

t 1 C C

Bat 1 ina ina in na C 1 - C -

4 1 C

1 C

- 1 U

- C o -

i

h in m 1

na -

1 Chin 1 a

- 1 a C en 4

China Ba China - 7

B oV G

1 C h AZ1

S 7 S

Chin - 1 1 S-

R T

a Bat a 1 Co u

S-1

S

s940 S

4231 0

oV S Ch

n t

Rs S- B S 1

-1 China Bat GX20 Bat China -1

Ba PCoV

S -

C S-1

i

1 A a a

-1 S

R -

R S- N

R P H TX

48

R 1 afo -1 C -1

t R S- S-1 S-1 -

R

Bat WIV Bat RS-1 C

RS-1 NA RS-1

S- R R

Ch S um S

-1 a

R - Ch in P

n R 1 K n um C S S

1 Ch 1 1 C

A Rs

t t A R

Chi ARS R S - a A

A

1

- n

RS A

S m

AR R P A

a H t i

3 -1 ARS-1 ARS-1 na R 3

S

S

AR n

S

SAR R S S

-1 S S

Ba SA u

S-

IQTC

AR i

S

a

1 China China 1 3

S A lin PC h

B

S- SA

1 Ch 1 5

S

ina Bat Rs4 Bat ina olin -0 A RS-1 S

- A

3 2 lin um ARS goli -

1 Chin 1 A h

Pangolin CoV SAR 7 SA R Hum

3 o

2 5

2 ARS C H

h 2

7 g 5

9 S

4 A S

4 S um se -B n 4 H

S 9 S 10

1 AR C

1 g

0 angol n 3 8

-1 Chi -1 n

9 hina H um

5 8 1

0 m 1

-1China S 02

SAR 5 3 S U H 043 043

9 SAR 9 8 12

8 3

ZS-C n a 2

S P

9

a Bat Rs Bat a

na B na 1 4 a

3 4

01 2

China Ba China u -1 Chi -1 Z SAR

Z 3 m i 5 3

hina hina 5

3

3 2 a

H

A 2

SAR 3 SARS Pa S-

n 4

-1 C -1 4

401 5 SA

5

1 1 -A

RS

2 D

48 ango USA u 5 23

SARS 4 ZS-A

h 3 iw

7 P m

7 2 C

RS -

-0 .1 na

9 H

808 S 808 5

G 3815 SA 3815 8 1

SAR 3 R

1

4

civet020 -2 C i 153542 SA U 9 a

1 5

N 5 H u 6 4 6

C 4 0 na

648857 SA 648857 2 a m

02 084200 S

9 3 RS-2

A P G 1 1 0 NA

153541 5 1

14 - T h 71

6 SARS 6 166

1 3 6.1 A

A

5 5 Q Chi J Q

J Q 2

J473811 SA J473811 64 H u W6

143 Q 153540 SA a

-03 5

071615 S 071615 3.1

X

SA 1 in NA Q 26

8 5 3 S C A Q HC-GZ-3

GZ0 S 04 SA N ARS Q RS

A

4.1 P

S-

N m 17147

- U Q n

DQ648856 SARS-1 NA Bat 273-2005 DQ648856 SARS-1 NA

K 1 D1

K

1 1 G

K a

G

92 SARS- 92 J Q -2 W5

G h 17 DQ412

NA 72934 SA SA 6

714 3 S- Chi i H - DQ 1 DQ 5.1

Z

SARS 14 - D Q KJ473814

S-1 28 A P88680

D

KF569996 KF569996 72933 A m Q153546

R G 33 2 n

JX993988 JX993988 352407 S SARS- 001 9

S

4 7 a ARS-1 China ARS-1

8 U

1 G 0

7 S

KJ47 -2 i m WA 7

na G R

DQ

-1 -1 7 NA

SZ

417 Q 7 2

u C Hu N

R 3 AR SARS

Y4 G

K 816

G

- -2 8 4

A G A

KP886 u S

1 G 9 R S Ch h 8-

in

-1 N -1 GQ

S KY41

6

5 40 5 Y

NA a 3 S

44

A G G W K

5 033 -2 C 4 S S 457 H

KY 4033 262 0 C h ia um W

-GZ-81

KY A R A -1 -1

S

1 0 6 SA

-03 H

R

ARS

Chi M

hina Mon hina

K J a N 4

4-22 4 8 4 S -2

on M 0 R d 2 SA 2

KY4 1 C 9 -U

H

S T S

-13 A H W 4

S A2 n MT0

3 3 C F 2

C 7 5 S m W

6 SA 6 R

T 9 A 4 M MT0

H KY4

M

171

-1 -1 SA

1 S A Hum WA1-F

SA 0 5 5

P M 9

na na

KU9736 - 4 975 W 15 S-2 hi 715 R A

A

Chin N

S M

Host RS-1

KJ47 0 7 5 S 2 In S UW

1 Hu 1

88100 A 3 U

SAR

C MT0 A R S-2 um

KC881006 938 06 6 1 2 - S-1 C S-1

PC4 m

A

SAR S Z-D 9 -1 150

C4 M W

KF367 1

6 S

-03 U

S A R S- A H

Chi

1 1 0 6 9 -2

na NA na MN 3 u

KY4 N P S R

AR S

9 WA6-UW3 9 - -MDH3 i 2

Y41 A7 -S

Mon Mon MN9 S S

on 2 R

KC T1 79 4 S -

417 H 1 1

500

9 China China Y417 A h

S

n n n-0 9 9 M MDH

C

K 06 3 ina NA NA ina

7 0 4 R S 2 U U SA m W

4

997

o M SA - M

K A

1 1

0

S- H

2 T 75 2 5 1-

C 041 Hum

4 -

9 R 2

ina ina MT1 A U a

KY

2 S M

AR

1 3 - MN3

t0

-03

0

SAR Hu nna

S

Y39 M 42 - Ch 848 A KY417 1

0

10

556 S 556

3 1 2

in T1 35 U S

a

C-SZ-7 LR 5 u NA

0 m

A 0

Human R

0

A

-1 Ch T 1

n ARS Y

18 3 S ive

H

-94 M 2 USA Hum

i Y394

AY3

et RS -

1

c

9 1

Z 350 S

39 A - ARS

h M Ch

39 v Y27 A 61- RS-

na na 880 S

on on Hu

i USA

5

i 1 -2 n

1 MT

C -S

A Y 572038 S 572038

- 4 S A

h RS

M 050493 SA S 2 Hum MN

SA US

MT m KMS B

A 1 Hum

SZ-

a NA a

07 SA 07

685 A L2 6 -1

- MT R

ARS 2 2 Mo

AY n

Y5

RS

4 - US A1

CF 985325 A a 7 S

S MT

RS- Hu A

NA Y5

hi S

n 3

17 SARS 17 4 USA a HC

A 1

1

i

591 MT020 S

A hina um I TW

Bat SAR Y613947 6 2 R

9

SA n

n - C L h

4

50 50 7 2 m C

RS-1 C RS-1 MN T020881 SA C H I 59 1 3 A S- 9 1

o A

J959

SARS

C

A

a N a 28

M 8 S-

hina Mon c Mon hina

R S 2

F 163720 SA 6

39 71 m

Y5

8 ARS-2 Chi

S 5

SZ16 1

94 Hu

in

B0

C 3

- MT T SA

4

a M a A 9

1 S- A

Y54 2 A h

n

61 6 S

SA

9 1 - Hu um

1 M

in R -Cc A ARS-1 S

-

C 9

N MA1

3 7

c A A3 -17

915 915

S U

S MT

1 -2 U

AY l H

63

ARS 16371

Mo

5 4

Ch 33 -B

R 63 S 2

8 MT1 24

S

b

HSZ

Pangolin AY613 8 - US - pa

1 1

6 1 SA R

54 iseA

B 8 SARS S Hum

Kong e

RS-1 RS-1

4 MT A 914 914 T1

AY6

8 CA

SA u

S- Y e

HSZ 0 R N A

6 M Hum C

ng

03 A

R 1S

SZ- 1 SA Cr

SA 995 A A-1

Y 2

a NA a

-Cb

35

o 6 A uis e 1

H MT1 -0 NA NA A r

72 6 57 S S- m s 9 Y545 S Hum

H 2 US i

W 7 1 2S

MT188341 2SAR 2 1

A T04 SARS-2R U A - 0

HSZ 2 S- Hu ru -

ina ina S V07 5720

SARS-1 NA SARS-1 M 44 3

2 C AY5

NA oV

1 Chin 1 um C 45919

512 SA 512 446 SA A Y

6

- -19

Monkey NA na

NA NA WI

5 MT H 1 S m

A 8 V S

hi -2 U J

864 nC

Y 8 A

-1 Ch -1

na na

6 MT0 S

C A

i 6 Hu

B S

S

Y515

2 2 R -2 U um nCo

1 1 88 SAR Hum

gKong gKong MN99 A -

A 7 A S

20

n S H

NA S

AR 62 SARS-2 R -2 U o

1 Ch 1 MN98871 S um

1 U AY68

R m

- T0 A S 2 H

Y304486 SARS-1 HongKong NA SZ3 NA HongKong SARS-1 Y304486 a BJ

995 SAR 995

0398

S M S R 2 H

A

hina T N

SA

94 S 94 A S- t CA8 IV0

NA Civets QXC C

9 e

01 5 5

C2 M i

AY304488 SARS- AY304488

4 m

1 A 8 A

SAR 715

Y394 027063 S-2 China V tNam a0

N N MT0270

ARS-1 e m W S- 39

A

2

QX T R i Hu

03 SAR - u 86 86 S

A 911 S R Y

A A C00101P J -1

04 M S V H P A 3949

4 4 N

0 S

B 17 49

R 2

84 SA

2 - a alenci

SA Y

02 7 A-21 9

RS

1

A -1 N -1

55 MT159 V

J S

A J-

5 3 A

8 N

S 59 se 15 B

SA i 0

Y3

S -2 U Hum

R 1

MT1 A 2

8 65 AR 1NA -1

A um

Mouse 27 T 7 ChinA

4 9 N

A LL A S H

Y

06 seA- NA NA 7

6 -2

SA M -13

1 1

8

A 2

- N9 US ui arket

9 73 A

ARS

J01 9 SARS Y8 NA

S ain m

7

20 M Cr

ina N ina -2 A

S 05

B 3

864 T1 ARS Hum Cru 1

-1 -

0 S

3 5

AR S od-

Y u

WH M

NA

A

H

S 60 -2 Sp Cruise 1 Ch 1

s 27 AR Hum - 306

Y46 n

u MT192 S NA 10 a

A 2 USA seafo

M

490 65 5 um

RS-

1

Pig TJF - H

Y46

4 SARS S- uh

78 MT 99

0 A 76

A um

NA R

2 J 2 USA

SA N H N 06

8487 SARS 8487 B

Y 53

-1 -1 S-

A M USA a

12

A um W 8

S R 27 n

4 2

S-1 H

R - A-

Y

8 SARS 8 MT192 08 SA a e 95 2 A

A 9 s

5 S-1 N S-1 i

S

A N A 4 RS in

NA NA NA NA 2 Chi

R 13 SA - h ru

N MT1986

7848

AY -1

A 18 SA S

A C C

S 062 062 T YE618

Unknown Y2 M -2 ina Hum WIV

A S

-1 N -1 SAR TK 01

772 9711 Hum C

SAR Y

4624 S 4624 A

RS MT1597 um

A QT 2

A 996 SAR S-2 Ch H 0

354 7 2 US Y65 U

9 MT15 1 AR m I A

7 5 pan u

2 R75 5 S S-2

Y L 4 a H

A in

508724 S 508724 C 0 SAR h Hum NT

Y N 9 A 996530706 iwan an2 MN RS-2 C J

T159 A 0.72 M 9905 SARS-2 Ja SA Hum Hum CA5 90 S ARS-2 TaU LC52 2 MN2-MDH2

232 76 S RS- Finland um IV05 0.68 MT1 SA A H m W 64 RS-2 u HU01 MT06610 SA na H

RS-2 US 09 0. T027

M 0781 SA S-2 Chi ina Hum W

64 R Ch

2om5 MT02 8340 2 d CGMH-CGU-01

1 1 18 S- -9

N MT 6529 SA China Hum WH- seA

0.6 SAR

99 8 Crui 22

A15-Exo RS-2 N1 d3om2 N1 MN

o

M Taiwan uHumm seA-

s s i

u 0.5 H

5-Ex N98866 M d4ym1 A 1

M 3631 SA US Cru -19

MA SARS-2 eA

xoN1

s

USA is m5

u

1 Hum

- 59

6

M S MT09

15-E Cru

d4y 27 R

A USA

0.5 m

A

N1

M SARS-2

o

USA s 3 0 -2 S

x

1 1 Hu 6

- MT19

-E 7 Mu

S RS

3om

R A

d USA

A15

A m CA

M eA-16 2

oN1 oN1

1 USA 1 Q890538 s

0 S 0 4 S-2

x MT159

- 909 S is 4

H

m R S A Hu ru

Mu 0.

S

3o

905 184

AR U d

8 7 SA m C

S -2

N1 MA15-E

USA MT S U02 o s 90 Hu HQ

4

-1

x 4

531 A

0

Mu S om1

9

5-E SAR 3

8

1 T18 U d

8

Q8

SARS M Hum WH A-10 N1 H 0.4

MA 25 S-2

1 USA 1 a e o

03 in is

Ex AR

29 h Mus 9

5- T044 S C A

2

SARS- M

18

JF

1 d2om4 1

MA1 9714

s

-1 US -1 Hum Cru eA-

u oN

4

2904 15 A is

9

M T SARS-2

2 M 9 F2

m 6 US Cru

J

A15-Ex

2y

USA m

05 SARS 05 1 d u

-

9 886

S 0.4 9

R

oN1

Mus M Mus SARS-2 x

A

E MN

F292

A A S 7

J

5- 0 2 USA H m SH01

39

US

A1 97 S-

1 5 M 05

d2ym3 4 Hu

1

s T AR a A-

0.36 S in

Mu M se i

xoN1

HQ89 ru 7 SARS- 7 716

5-E -2 Ch 3

USA 9 C 1

5 S 9

m T1 R um -201

RS-1 M SA A H 03 N1 d4y N1

HQ89053 - Mus MA Mus

SA 15 H 7

2 12 -2 US W USA

0.32

A15-Exo RS

M A ym2 MT12 Q8905 a Hum

H n us

M

SARS-1 Chi seA-7

oN1 d4 oN1

28 S-2 Crui

905 MT159720 S m

MA15-Ex m1 HQ8 Hu

SARS-1 USA USA SARS-1

31 SAR d2o 3 5

0. 019 USA iseA-6

9053

USA Mus USA MT

HQ8 SARS-2

28

s MA15-ExoN1 MA15-ExoN1 s 05 SARS-1 m4

Mu -5

USA Hum Cru

d4y T1597 iseA

890532 M S-2 ru

xoN1 C

HQ AR

S-1 USA S-1 m 15-E

0.24 u

A

4 SAR 4

4 9722 S 1

ym Mus M Mus

d2 -2 USA H 89053

SA MT15 S

HQ SAR ruiseA-1

ARS-1 U ARS-1

A15-ExoN1 721

M 159 SA Hum C

3

Mus Mus MT F292902 S F292902 USA

J ARS-2 U m WH-01-2019

SARS-1 SARS-1 5-ExoN1 d2om 5-ExoN1 0

A1 159708 S na Hu

s M s T 90529 90529 M hi

Mu

HQ8

.2 A-3

1 USA 1 SARS-2 C ise

N1 d2ym5 N1

xo 529 m Cru

5-E u

536 SARS- 536

0.1

MA1 MT019 A H

s s US HQ890

Mu

USA

1

RS- 719 SARS-2 V04

SA WI

30 30 MT159 Hum 15-ExoN1 d2ym1 15-ExoN1 a

HQ8905 Chin us MA us 6 M

1 USA 1 SARS-2

0.1

ARS- 996528 Z-1

1 MN xoN

HQ890526 S HQ890526 -2 China Hum H

Mus MA15-E Mus RS

1 USA 1 et

S- 039873 SA ark

943 SAR 943 MT 2

882 seafood-m

FJ um

5-ExoN1 P3pp4 5-ExoN1

1 hina H

A

0.0 -2 C

us M us

M SARS USA

1 1 998

SARS-

LR757

82953

FJ8 CA9

1 d3om5 1 Hum

Mus MA15-ExoN Mus 5 SARS-2 USA 8 -1 USA -1 1883

RS MT1

JF292906 SA JF292906

0. um CruiseA-12

2 2 USA H

ExoN1 d2om ExoN1 RS-

Mus MA15- Mus A US

1 MT159709 SA 0 35 SARS- 35 8905

HQ 19

4 -2 China Hum WH-04-20 p5 p 3 1 P 1 532 SARS xoN -E Mus MA15 Mus MT019

A S U

942 SARS-1 942

FJ882 MT019533 SARS-2 China Hum WH-05

FJ882962 SARS-1 USA Mus MA15-ExoN1 P3pp10 MA15-ExoN1 Mus USA SARS-1 FJ882962 MT159718 SARS-2 USA Hum CruiseA-2

F J 8 8 2 9 5 1

S ARS - 1

US A

Mu s

MA1

5 -Ex o

N1

P 3 p p

3 MT159712 SARS-2 USA FJ Hum 8 Cru

8 i 2 se

9 A-1

5 4 9

S

A

R S-

1

U SA

Mus

M A 1

5 - E x

oN 1

P 3

pp6 MT012098 SA

F

J882928 S J882928 RS-2 India Hum

A 29 R

S

-

1 U 1

SA

N A

Exo N

1 P 1

1p

p

1 MT039887 SA

KF5

1 R

4 S-

4 2 U 0 5 SA H

S

AR um

S WI1

-

1

US

A

N

A

E

x

oN1

c

5

MT1

2

P

2 0 2

- 3 KF 2 2

0 9

1 3 5 0 S

1 A

4 RS 3 9 -

1 2

SA China

RS-1 Hum

USA IQTC0

NA 3

Ex MT o

N 1

1 2

K c 3

5 F514410 S F514410 2

9

9 1

P2 SA

0

-2 R 0 S-

1

0 2

A Ch

R in

S a

-1 U -1 Hum

S

A IQ

N M TC0

A T019530 S

E 2

K xoN

F

5

1 c8P 1 1

4

4 A

0 R 2 1

0-2 S

S -2 C

A 009

R hina H

S

-1

U um WH

S M

A T

N 1

K

A 8 -

F514406 F514406 4

0

E 9 2-

x 1 2

o 0 S 019 N

1 A

c R

5 S S

5 -2 A

P R

2 U

0

S S

-20 A -

1 1 H

1 M U

0 T u

SA m

K 18 C

F5

N 49 ru A

1 13 is

4

ExoN eA

4 SA

0 -2

2 R 3

1 1

SA S-2

c13P1-

R U

S- LC SA 1

2

H U 00

K 5 um SA 2 F

9 8

514393 SA 514393 2

N 33 Cr

A ui

Ex SAR seA

o -26

N S-

1 2

R

c L N S

5 C A

-

6

K

1 U 1 52

P2 H

F5 8232 SA um H 0

S

- 1 20 A

4 N

1 u- 4

0

1 A D

6

Exo P-

R

S S K

A - ng-

N M R 2

1 c5 10P c5 1 1

S T1 N

K 9

- A - 1

F514412 F514412 8 H 0

4 2

U 9 u 7-

S 1 m R

A N A

20- 2 S Hu NA

2

A A -

010 D

S R

E P-

A S

x MT1 - K o

R 2 K n

N

S U g

F 1

- 63716 S -

19- 1 1

5

c A

1

5

U H 0

4

8

S 20-

4

P u

A 1

2 S m R

4 0

N A N

- C

A 2

S M R r A 0

A u E S

1 T1 KF5 i

0 R

xoN -2 se

S 2 U A

- 6 - 1 1 1 S

1 8 2

A 5 4

U c13P 0

3 8 S H

9 S

A N A um

5 A 2

S MT

0- RS W A ARS

F

2009

A

E

J8829 0 -2 3-

x 6 B

o 6 U

-

N 1

1 5 r W a 1

US 6 zil 1

26 S 26

c S

5

A H

P M ARS

u 2

NA

K T09 A 0 m

- F R

2

- S

E 514389 SA 514389 0

S 35 2 P 0

x

- It 0 9

o

1 US 1 7 a 2

N1 1 ly

S Hu c M

A ARS-

8

KF N NA

P 994 m 1

R - 5144 INMI1

2

E 2 S S

0 46 x 0

-

1 1

9

o 8 w N 1

U MT ed 1

1 SA

S

SARS-1 USA NA Exo NA USA SARS-1

K e

A n

F5 039 R

N S Hu

A

14 890 SARS--2 U E m

4

xo 0

03 MT S 1

N A FJ

S H

1 c8P10- 1 0 A

8 0 u

82

R 754 m

S 2 So C

93

-1 A 4 A

2009 0 2 KF Y S

U u

SAR N1 c N1 3 A

S 0 R thKo 514

A 4 S

N 4

13 -2

S 95 SARS r

A

3 e

-1 P10- DQ E

9 A a

K US

0 S 0 us Hum x

F

oN 6

200 tr

514417 SA 514417 40

A

A a

1 SNU01 NA ExoN1 NA

R lia 652

9 -1 HongKong NA c5

S A H

- 1

K Y39

1 U 1 SA um

P F

20

51

S 49 RS- V

-2

A

442 IC R

N

0 A 83 S

S 0

10 1

F Y39499 1 A

- Chin

0 1 U 1

J

E

S

8 A

x

8 R A SA

o G

2 S- a

N R

9 A Hu Z50

1 c 1 S 5 F 2

N Y 1 C

0 S -1 J882

A 3 m G 5 S

E 9

U A h

4P 4

AR R in xoN

S 9 DH

2 A S a 95

A 9

0

FJ N

S Y 0 -1 C

-

1

N -

4 4 2

- 3 S A B

c5 3P20- c5 1

01 8

A 94 H J S

U ARS

8

E h H01

0

A

2 9 in S S

x AY RS

9 89 Z

A FJ882931 FJ882931

o a 2

4 - N

N 1 N -

4 3 A -1 -1

1 c 1 SA A HZ 9 C A

2

S 4

01

U E h

A

5P A 9 RS-1 i

S

0 n

x RS

FJ Y 9 S

10-20 A A

o 3 3 a 2

N

8 N

N

- 94 S -C

1 1

8 S C A

A A

P

2

US

A AY 98 h 0

RS H

Ex

9 FJ 3

9 i

R n Z pp

5 7 S

A 3 a S S

o - 6

8 1

9 2

6 N1 -1 US -1

N

NA

8

S 4 -

0 AR NA A E

2 A 9 ARS

F 9 P

Y2 9 HZ

E

J8

4 S

3 1 NA

x

A N A p

0 8 -1 C S

oN S

p

8

- 27

S 2 1

46 AY AR HGZ8

2 -D

A F 1 1 A E A

9

US 5 h

J R P3

2 3 2 S ina

88

9

S 507 xoN

A S -1

pp2

S L2

-

AY N

2

1 A

NA

FJ C

A

96 5 R A U

1 P 1 h 3

R 4 0 H

S

8 SA i

E

0 SA 0 8 na

S S

8 -

3pp12 ZS

x A 52 1

-1

2

oN1

F AR

N Y3 Hong NA HZS2

9 2 U

J 77

A A

4 -

RS

8 S Fb

1 S 5

Ex 8

P SA -

A 7

A 1

S 29

JX 3

-1 Y50 07 Ko

oN

pp5

AR

N N

R

5 U

162

A 6 S-1 A n - 5 5 1

SA F g S-1

3 A 2 SA

E N

K c P3

S Y39499892 N

x A

0

F

A

o N

N A

pp3 R US

87 87 P 5 8

R

N A

A S Su1

1 AY UM

1

S- S JF

- N

4 E

S P 7

A A 1

4 1 1

x NA A

A 4 C N

3 0

2 RS

o

0 85

p U

R Si 0

92 A

N 7

A

p 1

F SARS-1 Ch S

S

Y3 N

1 -1 n

2 1 SA

J882946 SA J882946

E

92

A

-1 -1

7 o1-1

P3pp Ta A

x

N 5 8

o EU

2 PUM

RS U

JF2 7

A E A N S

SA i

SA 0 wa 1

1 A

3 3715597

-1

4

P 5 RS

x EU

92 n

R C0

NA FJ882936 SA FJ882936

3 o

U SA N

S- pp8

N

921 -1

SA 3 in A 3

1

E

1 EU37157 R a NA

NA F

R P U T

x

1 SAR

N S

S W J o 3

S 5

SA - A

p

N1 c5P N1

882933 S 882933 6 1

- 5

A E NA p

1 USA NA USA 1

F 4 N

E L

1 R

N U S C1 J

x

9 S A

S 3715 S o 8

A Ex A EU3 6 A - -1 U -1

N 1 N i

8 N

FJ88 2 S R no

10

R 1

2 A

c S-1

9

S A 3

o E 63 PU S A

2 -

5 7

AR

F

- N 1 N

1 U 1 7 A

U w 15 RS

2934 SA 2934 1 7 J N

1 M

SAR

N S A P2 3

8

c S

tic- AY 7 6 A A N Z C FJ

8 - SA N SA

A wtic- A

-

5 0 1 NA

0 156 0 1

29 R J02

P

-2

MB 34 2

US S

8

1 AY S A

S

0 F

4

8 A -

1

- 5 1 B

A 1 7

J 2

P N

1 RS 0

A

R 34 9

J

w

88293

9 S

US A N M S 3p

NA w NA 8

FJ A 1 S

3 Y A A A

t 8

B 5 8 -1

-1 -1 5

ic

p 3459 9 RS- B 2 c1P

R

8 A

SARS S N

13 AY J - -

FJ88 8 N 1 8 U MB

S

NA 6 A A A 182- 2 29

9

tic-M SA

-1 3 RS 1 N N B 1 A

SA

FJ882937 FJ882937

P

SA 6 8

J

3 Y502931

U

w 4

N 2 7 A 18

29 -1 3pp2 2 A

-1

S 69 R

t A

B P B RS-1 RS-1 SARS

K

A BJ i SA

A 2-8

c S-1 NA 49 49 Y5 H

wt USA NA USA

F5

- 9

MB MB o

N A 1

3 RS

KF5 0 S n 8

S p

A H BJ18

i Y3139

1 2

c- 2 ARS-1 g p6

U

ARS-1

4 a P A

w 9 - ongK Ko

K

-1

M 1

SA

3 SA

1 Y 2 S

ti

1

F5 9

B

p

US 9 H

c 2b 4 36 n

AR A

wtic 4

K p

-M on g P3p 3 RS

NA Y S

1 06 SA 1

F

SA

9 2 ong N

A USA NA USA 5 A 4 AY NA S

B

K g

5

6 6 -1

0 A 4

NA NA R -1 -1

P p wt K

- 1 SARS 9

F RS

MB P MB 2 0

29 A 5 S N A 44 8 o

K

3 T N

514423 514423 9 0 i

U Y 0 SAR - n G c-MB c-MB

p 3 R a A SA F5 1 A

-1 wtic

S 292 g 1 p A 5 i 0

KF514 0 S w

7 A 9 T

A A T 3

U Y5 02932

3 S N

1 -1 C a an

S W G01 w

pp2

- AY338174 SARS-1 NA

R A FJ8 N

4 iw 1 4 -

SA AR S

P3pp

MB P MB 0 C

A

S 4

USA

A

tic A 2 S -1 NA a NA 3 A 0 KF51

- SARS-1 SARS-1 hi R 1

G

wt

1 9 ARS- S n N N Y3

4 -MB P -MB

82938 SAR 82938

418 SARS 418

S na 02

U A 2

S - A

K TW8

16

SA i

- 1

3p 3

c-M Y

NA AR 1 4

SA A

F Ta

W A NA N

U K 3381758 SA

p1 T

4392 S 4392 514

RS Y 3 1 A T

3

F5

N S

B

S i

A W

K

ic 1

4 394 wa W

pp2 R -1 T TWC2G P3pp1

A

U Y 4

A a 6 -

F

1439

41 A T c 1 S D

K

N

S 3 i

WT n

5 SA T w

ic 1

Y3 6

3 9 9 - A U F5144

- AY3950

1 1 P2 a 3

A

K an NA 9

1 USA 1

ARS 5 9

S

c S i

4

S

NA R T wan

S 95000

F 0 9 W ic

3 J -1 -1 0-2

3 KF51 A

9

8 0 A a TW

P S- NA

A

5 N8

c

9

S

Ti JQ316 RS iwa S

2

N 1

1 KF

US R

7

WT

0 ARS

1

-1 US -1 1

0 c S

AR 4

A

15 SA 15 0

SAR 5 NA 8P2

S A T TW - 7

NA K c

4

9 - n 2

WT 4 NA

A

5 01 S AR -1 Y 1

i

44 a

0 0

1

F5 AY

c SAR

K 2

S

14 NA NA NA

0 2 iwa

P TW9

8 NA 1

8 -

W

c3P20-2009 0-2 USA USA -1 U -1 9

F51 GU 1

22 SA 22 7 1 A

10 KF S-1 TC1 SA ic

143 1 6

S 3

40 84

NA

RS-1 RS-1 A 9 S A C TW1 T

c J 9485 n N

- -

0

w S NA

2

1 6 hi ic c3P1- ic

514388 514388 Y 5 R F

9

10 China

1 G

S

442 9 RS JF2

0 A -1 China

tic

98 98 5

N

U

2 3 SA S-1

WT n

09 1

A

S

9 G U5 R A

JF29 3 TC2 0 92 2 RS A

S a

P2

-

N

ARS S -1

U

MB JX U5 1 3 0 S- TC

SA

HQ89

1 SARS 1

A RS-

9291 NA

WTic WTic 533 1 6 ARS

SA 9 ic c1 4 c1 ic A 0 C

U A S

-1 USA -1

H 1 5 1 N

- 19 3

WTi 16 5 hina N

2 N

RS Y A U

2009 AY2 8 S

JF292916 JF292916 Q S

2 A A

-1 LC2 SA

0 42743 3 1 A

N A 63 R

9

A

S

H

1 39 S K

8 3 A -1 NA L 2

W

UK NA

c1 6P c1

- A W A

0 0542 SARS-1 USA SARS-1 0542

08 08 Y50 S

N A U LC5

90

HQ890 6 Q ARS

1 U 1

c RS-1 A

RS-1 US RS-1 P N

SAR 9

Tic A Y5 28 4 S -

S Ch

JF29

-1 -1 1

A 20-20 890 c1

RS- 1 S A 5 NA A

SA A Y A A HQ

WT

T 4 C3

S N r

45 2925 S

LC

US A 0

Y28 5 9 S RS i J

c

2 A i

1P20-201 51 -1

NA n e

A A

c AY28 2 U

S-1 U S-1

2 Y 0 0-20 F2929

J A

RS

5

1 a c

WT S RS

SAR

NA WTic NA H

8 92

P1 A 290 c1 10 2 ic JF292

A F

U N

44 28 R - N HK 4

5 A S

9 SARS N

S AY 9 1 K

JF

W Y46 3797

2

A 46 SA 46

c A

-1 S A

A NA WTi NA

0

0 7 -1 S

i A R A

ARS 2 U-39849 H 10

7P2

92 37

c c1 c Chi U

-200 SAR

Tic Tic

2

5 -1

A WH

A Mu A JF29

7 Y2 6

ARS S N M 292 NA S

SA 3 3

Q HKU-39849

U

P A -

4 S-1 Ch JF

SAR Y 3

11 SARS-1 USA Mus MA Mus USA SARS-1 11 95 -1 N

9 A 101 1 7 S A A on

1 JF2

3 S 3 U

SA

890541 Y 0

0 5 n 9

1

91 A 9 c 94

6

JF -2 U M - WT

9

-1 USA -1 Y SA A R -2

3 S T 8

2 AY 28 1 i a

JF

2 5 SAR

8 RS-1 920 920

S- 1 60

USA Y n

c1

P 4

0 A RS-1 - W

s MA s 5 S HK FJ882

9

A 010

P T

2 9

SARS-

7 3 SA c c1 c

A 2 H

u 1 T A a

M J 9 29 FJ 0

20 5

9 Y 5 29

1 R -

2 3798

S

9 DQ 0

FJ8

1 0 C

2 UOB

s MA15 s a

i

9 X163923 5 um

RS 59 1 NA 10P

S JX1 S Hu

us us 2 FJ8 c

9 0-2

H

1 5 8 5 S- U-39

-1 USA -1 S -2 iwa 92

Mu A

USA

F 5 T

2910 2910 aiw

S 8

c1 c1

9 0

A A A 14

FJ

5 S 5

590 9 R um

Mu S-

2P20- JX 9

M Y

J U

01

1 09 S SARS 8

09 S 09

ARS-1 JX 8 8 AR 1 a m - M

Mus MA15 d3om1 MA15 Mus Y3949 00

M

RS-1 U RS-1 R T T H

9 J S 82952 MK 0

1 U 1

5 d3o 5

882 S

s M s 3

2

SA AY i

MK06

20 98 8 A

S 1 n 8 6 K 9 a

88 A or

MK06 wa

0 Sin an 13

5P20 X S-1 KU S

A 9 -1 Singa 8 K N

M 163 1 HKU-3 ARS- 9 9

9 AR

Mus MA1 Mus AY

A i N A 2 392 s 45 S 45

1 D 1 90 S R

DQ497 Y S 4

06 AY283796 2 S S

FJ882963 w

-

15

A A 1 AR

A 4 2

48 AY 062

A C

d4y 17

A

A

USA 9 20

SAR

Y278 S

SA A K062183 K062183 SAR A

295 3 6 N 9

A

A A A 062182 SAR A

AY559093 SARS-1 Singapore NA Sin845 A1 S-

M ARS n

M Q 2 ARS- -1 i A -

961 6 -

a

Y 9 20 Y

RS S n g Y32

5 FP

Mus Mus 3 Y5

Y 297028 SAR 0 5 392 S-1

Y Y559085 SA Y559085

21 Y559086 SA

Y H Y

YA Y R

m4 741 3 n T

10 92 S

A SARS-1 U SARS-1 7

d A T

us MA1 us 4 SA N

-20 1 4 S g ap 984

8

- 5 ARS

714217 S 714217 S 1

559097 S 1 78

5 d 5

SA 5

10

USA NA ARR 5

1

5 9

5 5 A

2 0 u 5 S 9

2181 S 2181 A Mus Mus

1

179 8 1

m a

SARS-1 US SARS-1 A W

5 NA 1-

2184 2184 7 S 7 6

-1 A -1 S

8 5

5 Ge 9

USA 5

ym

SARS 2

5 6 SA

5

5 4 S S 5 ss

5 S o A

U 2 p RS-

Mus MA15 d MA15 Mus 9 R 10

-1 8 5 T

3977 W2 84

741 741 2

0 S 0 S

9

9 108

S-1 U S-1 19 1 9

9 d2

2 7 ARS-1 R

9 1 9 7

9 9 U SARS r M N

-1 -1 5 -1 or Mu p

008 SARS- 008 0 0 S

RS A e 0

0

om1 i W4 0 i SA

USA S

0 1 - r OD

0

0 0 RS- T 3 H ARS

9 S-

Mu n a ore 9

9 S S

A1 S 1

5 5

SAR ingap ma

8 95 95 -1 SA M SA

MA A e

8 R

8 8

9

8 NA W 8

SARS-1 USA SARS-1 A o A

USA 5 S

1 US 1 S ga

ARS-1 ARS-1 95

6 AR i N HARROD

3

M

5 5 1 1 SAR R

d2o

4 Si nga

4 2

-1 U -1 7

m s MA15 s SARS-

Mu A S NA Si -

SA RROD SARS S

R i FR SARS S A

5 5 3

SA 00

AR n A S

ny 1

SAR

s s d2om

SA Mus SA SA S po SARS-1 Sin SARS-1 NA

- 1 U

S R -1

4 S S

S S n

us MA1 us A R S Si - 15 d2ym2 1 g

- in

A Ca

ARS S

5 d 5 A - 0 M S-1 S-1

1 So

M 1 NA

A SA N g A

SA USA A S

m A

A A

US - N a

u R S 1 p

s M s

A R or 0

M R re n S-1 S-1

S R -

US 1 g H

R

S-1 US S-1 a R

us S p RS A R - 1 U 3

s R

R Sin

R or

R R d4y

3 S - 2

Mu S- US a n A S- 1 D 2

us MA1 us p um F -

S A

1 n e

S

o S-1 I S-1 US

M S -

15 15

A

U 748 00

S-

-1 USA -1 S-

S NA

S

S

MA1 S

5 - po 2

o -1

1 N USA

Mus S ore 1 1 e A A N

MA15 ada NA

A -1 N

USA -1 1 N r

-1 S -

-1 1

3 C 6 A Mus Mus A 1 Mus M Mus

d 1 U 1 - SA

m2 NA

- 2

USA Hum USA - - e

- A T

A NA 0 001 m1

us 1

1 Sin 1 C

15 d4y 15 1 Sin

1 NA

1

1

o 77 NA 1

s C A

3 A d

MA15

Mus MA15 Mus r A 500 00 USA 5 d2ym1 5 ana

S A

A

S

N S

e N

MA m3 o Si

3 A N

S N Sin2 r

S S S

t S

i a GZ- N 5 d4ym4 5

MA15 hina NA NA

S i T or2- A i a 0

Hum Ur Hum m2

Hu aly N aly

om ingapor n G A A

M N i n nP NA Urban NA

na A T i i i i A S 2

n nk

nga A

Hum Urb Hum n

A n

n n gap da or2-FP1-1 5 5

d2ym g A Si

Hum Ur Hum i

H g

ga Z Hu

g

1 g

g A Z ZM g

g g T A1

CV7 S nP1

Hum CDC-20 Hum

5 2

a da S

MA15 T

M f d

m m

A a

m C

5 P3 5 - ap u o F d n

ap

a a a 774 a u

i

p o B

2ym p H pore N pore N n

pore

1 or P P4 m Urban m r2- p

us MA-15 us p

p

A p or inP rt 5

5 P3pp4 5 o 4

P3 r

m P2 m

U J01 P o H

5 um Y

o o A ZJ

o 1 o

Urba o o 2-F

-

ym r 2

HSR 1

P r e 5

rba - -1 re N re b

4 r r e u F

r

r r

p e N

P e 1 - 3

e N

e e

e e 5 pp3

ani ic ani m FP P p N

3pp 0912

N

P

3

NA T 3

an NA

6 N

bani icSA bani N N 1- N

n A 0

i A or2 1- 0 pp A T

A 1

A Sin A - A

ni icSA ni A A

i A 3 1 8

1 i i or2

S Si -

1 7 S 0 Sin

0912 5

Si 1 Sin847 i i

c i S

5 S S

S S 08 SA c 1

in 1

in3408L n 0

icSARS-C7-MA

SARS i

i i

i i S n

n8 8

n n n 2 n

8 95

0301157

ARS-C3 3 5

RS 848 849 679

3 8 8 3 5

4 1

4 7 4

5 7

0 RS-C3 0

2

6 6

2 2

-M RS-C7

8 5

5

V

V

A

-

M A

Figure 1 Maximum Likelihood (ML) phylogeny representing relationship amongst 312 SARS-CoV, 103 SARS-CoV-2, and five Pangolin CoV strains. The whole-genome sequences of 420 isolates were aligned using MUSCLE and stripped to include the highly conserved alignments across all strains. The final alignment was subjected to RAxML to generate the ML phylogeny utilizing the GTRGAMMA model of nucleotide substitution with 100 bootstrap replicates. The phylogeny is depicted with branch length consideration. The inner-circle represents the taxonomy of all strains (depicting SARS-CoV, SARS-CoV-2 and Pangolin CoV). The outermost circle represents the respective host of each strain. Inner Blue and Gray alternative dashed lines represent an internal tree scale with a branch length increment of 0.04 from inside to outside. Full-size  DOI: 10.7717/peerj.9576/fig-1

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 14/31 We also examined the ANI scores of the above-identified closest strains of SARS-CoV-2 with all SARS-CoV, SARS-CoV-2 and Pangolin CoV genomic sequences used in this study (Table S2). We found that the Bat RaTG13 strain is 95.97–96.12% identical when compared to all SARS-CoV-2 genomes under study. Amongst the SARS-CoV strains, Bat CoVZC45 and Bat CoVZXC21 are its closest neighbors with 89% genome identity, indicating a common ancestor in bat coronaviruses, whereas, with Pangolin CoV, they share lowest (~86%) nucleotide identity. Bat CoVZC45 and Bat CoVZXC21 are ~97% identical to each other, whereas their identity with SARS-CoV-2 (~89%) is higher than that with SARS-CoV (86–89%) and lowest with Pangolin CoV (~85%). Similarly, Pangolin CoVs share maximum identity with Bat RaTG13 SARS-CoV (86.50%) followed by SARS-CoV-2 (86.30–86.38%) and lowest with other SARS-CoV. Therefore, we can argue that as SARS-CoV-2, Bat RaTG13 and Pangolin CoV have considerable genome similarity, they possibly diverged from a common ancestor and SARS-CoV-2 was later transmitted to humans through recombination and transformation events via an unknown intermediate host.

Phylogeny relationship at SARS-CoV-2 level SARS-CoV-2 strains, like other viruses, have a high mutation rate for better adaptability and survival. Therefore, one of our aims was to understand the diversity among SARS-CoV-2 isolates. Whole-genome sequences of 167 SARS-CoV-2 were aligned using MUSCLE (alignment length 29,950 nucleotides) and the conserved sites of length 29,725 were used to generate a maximum-likelihood phylogeny using RAxML (Fig. 2; Fig. S4). Irrespective of high similarity amongst all (>99.90% identity), several subclades can be identified from the phylogeny based on their distance from each other, depicting their subtypes. Our data has 97 isolates from the USA, 46 from China, five from Japan, four from Spain, three from Taiwan, two from Vietnam and India, one isolate each from Sweden, South Korea, Pakistan, Nepal, Italy, Finland, Brazil and Australia. From the genome sequence data of 15 distinct geographical locations, we were able to identify a few phylogenetic clusters depicting country-specific patterns. We identified multiple clusters per country suggesting the existence of multiple subtypes. The KMS1 isolate from China was found to be the most distinct one, followed by WA-UW230, WA-UW194, WA-UW218 isolates from the USA. The Wuhan Hu-1 isolate shares a sister clade with the USA Cruise samples, indicative of high similarity. Isolates sampled at the University of Washington (WA, USA) form two separate subclades within two different clades in the phylogeny, suggesting that even amongst people residing in a limited area in the state of Washington, multiple strains exist and are causing COVID-19. Furthermore, both the WA subclades have Valencia-Spain isolates as the closest branch/subclade, suggesting that at least two individuals from these two locations independently encountered each other and transmitted different subtypes of the virus. This study also included two distinct genomes sampled from India which were distributed by the phylogeny into two separate clusters signifying the existence of different subtypes.

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 15/31 Tree scale: 0.001

9

-

7 2

A

Geo Location: 7

9

17 93 e

4 a

- 1 s

37

i 0

4

Z

W1

3 u 4 1

H r

U -002

-UW1 Z-62 005b

Z-90 27 Z - et

C

UW2 a k

ruiseA-

- H S

0 TU0 - n L1

1 ar A WA A

China i WA-

A-UW2

L2

h S na A-UW2 m

WA-UW233 I

A A I KU U-SZ

A n-01

A WA-UW240 A W a IQTC0

38 A

hina H S

C K U

W hi

A W A A-UW2 S

3 H

S od

A in

U

2 C

A-UW224 US a H

9 0 USA CA1

W S U iwan N 2

USA A CA7 9 3

1 a afo

U S

3 1P

USA USA T 6 WA-UW2 e

7

A W A TX1

715 USA C USA 715 1 U s

0 US 450 709 71

3 9 hina a Yunna 976 U 976 Chin 7 7 A A 9 92 Ch

3 8

USA US 973 2 5

1 5 4467 75 C AZ1 979 46449

WA-UW2 8 4257

2 1 1

46490 46490 15 66 WA-UW

A 9 05 6

T 1 T 384 2

51 232

6

6

A-UW2 8 5

T

2 N99 54 USA 1 Chin a Wuhan

T251 M

MT M 0

MT2 MT253706 C 5262 alencia7

A PC001 A MT246 MT25

T T1

6481 US 6481

467 US 467 0 MT25 MN

M M a

V 7

2 China KMS1 M

6 MT04 09 USA n

M M 06 i 6460 USA WA-UW203 USA 6460 MT24648 MT US

974 US 974

437 na 10 3 5

MT246453 USA WA-UW196 USA MT246453 MT10 4995 h

WA-UW194 MT246484 USA USA MT246484

USA W USA MN93

T1 0

-481

India WH-0 MT24

6480 USA 6480 MN9

A Chi

Z

MT24 4

S M 26610 hina 235

2765 0493 Indi na 23 MT24 2 MS- U MT

A

MT251 MN9974

522 Spain Spain 522 041 na 231

MT2 3 MT 5

451 451 alencia8

PBC

MT19 MT05 43 Chi V

hina HZ-162 hina

MT246470 LR757995 C alencia003GOS 983

78 Spain ValenciaV 983

MT23 S

700 China H China 700 MT13 -1 T135044 C a WIV05 a Spain 19- Japan 34419 9

Z 1350 in China I China

MT246 M T

696 C 696 135042351 Chi LC5 M FDAAR

T253 T

23 2S M

9530 9530 M DAARGO

0 A 1-F6

MT253

6529 Chin 6529 MT 8652 Spa S WA

VietNam 01S VietNam TU02

667 USA U

MT01 MT23352319 Spain A

3697 China H China 3697 6

N99

etNam etNam 12

Vietnam 5 WA1 M US

Vi MT 24 -49

2772 T2 A

Z

1

M MT WA1-A

na H na

MT19 MT233526 A-UW219 176 Taiwan N Taiwan 176

2 Chi 2 MT192773 MT192773 MT020881 5

N985325 US USA WWA-UW202 MT066 M A

Pakistan 6476 12 MT25370 MT020880 USA WA-UW23

459 US A MN988668 China WHU0 China MN988668 MT24 A-UW2 31

0.00095 T246 978 US A W W2 M US A-U 4

469 A W 0 0. MT2516 Sweden 0008 MT24 WA-UW2

5 46488 US A 0.00075 -UW220

MT2 6461 US WA 9 0.00065 MT24 477 USA 246 W3

MT 6486 USA WA-UW22A6-U 0.000 W

Japan MT24

55 0.00045 MT163718 USA MT163719 USAA WA7-UW4 MN3-MDH3

0. 8339 US 00035 MT18 A MN1-MDH1 Italy 8341 US

0. MT18 15 00025 A-UW2

MT246472 USA W

0. 00015 A WA2 MT152824 US

UW198 0. A-

Brazil 0000 MT246455 USA W

5 MT246475 USA WA-UW218

MT246471 USA WA-UW214 M T159 712

US

A Cruise

A- 1

4 MT2464 MT2 64 USA

53 WA-UW207 703

China HZ-551 China MT246478 US

LR7

57

996 A WA-UW

China China 221

W

u

han seafood mar seafood han MT24 MT

2 6489 USA 537

k

e 04 China China 04

t WA-UW2

HZ- MT251975 USA WA-UW23932 MN996 5 76

530 Ch 530

i n MT a MT159

WI 24

V06 6473

719 USA C USA 719 MT2 USA W

M A-UW2 T12

ruiseA-3 5198

12 0 16

15 Chi 15 MT USA

MT1 2464 W

n A

a a -

5 UW24 SH0

9 57 USA W 711 US 711

1 MT2 2 MN988669 China MN988669

A 46479 Cruis A-UW20

MT MT1

eA-1 24 USA 0 3

59721 USA Cruis USA 59721 MT25 646 WA-UW2 WHU02 M 6 U

T0 SA 2

19531 China China 19531 MT24 1977 WA-UW 2 L

C52 US

e 6 A WA-U 2 8

A-5 MT16 474 USA 09 2 MT

3

2 Japan 2

IPBC

184912 M 3 W

LC534418 LC534418 717 2

AM T2464 WA-UW2134

1 USA S

U -WH- 9-02 MT24 MT24

S 82

A CruiseA- A W

0

03 MT2464 U A4-U 7 J LC52 645 apan 047 SA WA-

2 W2

9 MT18 MT1 USA 8 19-031

Pa

2 233

5 62 UW2

ki MT

MN99 W 84

stan Gilgit1 stan

4907 US

Jap A-UW19525

913 M 0 A WA M 1

an T 953 US

T12 652

US

1

1 LC5

MT1232 2329 -

9- 2 A UW2

A

8 Ch 8 3 C

02 C

Cruise

2 M ruis

MT03 2 h

7

9 T25 990 0 ina I 05 3

in MT253 Chi eA China IQTC03 China

MT01

a

91 5 Japan

A P -19

WIV0 3

9887 9887 na BCAMS- - MT15

MT 707 Chin 2 China IQTC02 China

6 IQ 2 MT15

MT09

1 710 4

098 India 29 India 098 T

18

U 9705 USA TKY C0

MT25 NC 04551 WH- S

8

A MT0442 C 1 3

MT192759 T MT192759 9 a HZ-6

363 h E WI1

5 0

MT18 722 ina 6182 4 USA MN996531 China China MN996531

37

1 MT

MT253701 Ch MT253701 H

C US 38

0 MT159 C Z-91 CA9 MT15 5

hin 15 8 58 US A r

MT15 u

Chin

MT01 3 2 Cruis iseA-

a MT159720 USA Cruise9716 USA4 LR757

WH MT 0 China MT15 ai

9

a H a MN9965 7 U A CA MT184 wan CGMH wan

70 9709 US e 7

9 08 SA MN2-MDH2

-0 M

MT184 159 A-

529 529

Z-60 MT 6 USA 6 MT159

9

998 China Wuhan seafood Wuhan China 998 MT039 T U 6 MT159 6

9 MT03 10

i

M Wuha

na H na 707 U SA

714 714 M Cr

MT 0

WIV07

Ch MT MT2

T 910 US 910 MT007 7 6053

MT184909 M T0

MT2 Crui u

1

CruiseA-8 MN9 T204UACA USA MT027064 2

908 908 27 C A

MT MT020781 Finla MT020781

ina IP ina MT0661 M

15 M

M M iseA

849 Z-48 6 USA CruiseA-16 USA 718 USA 718 T09 0 -C Crui

2706 9873 888 U 88 Ne

717 USA Cruis USA 717 SA

T 27062 US

T163 5 T T n-H 97

GU s

039890 USA

019

37 53

US hina eA

1 2 -18

1 9

3 Crui B

A CruiseA-23 A seA-12

13 US 13 2 4 1 544 Australia

-01 4

5 08 CAM u-1

69 3 -11 A

USA 6 6 468 China H

533 71 7 S pa CA8

Cruis WI 8 US 4

56

China seA 8 Ch 8 16 A A

C

0 5

USA USA S l 6 -

S M

So A CA4 V0 4

8 4 ruiseA

A

C CruiseA-24 U -WH-01 Ita weden 01 A

U 1-TW -

CruiseA A1 1

SA

B U eA-21 IP hina C 2

i S 0 n u

ly INMI1

Cr A

r S Z-

A

a th

HZ-7 CA2

a

eA-1 WA3

n

H A 3

u 1 -2 z

K

d 29 d

is

i Z

market orea

5 W VIC01 l

eA-22

-1

B

9 S -15

A

-UW CAM

85

P Ja

- S

U 0

n 2 NU01

W

1 S-

1 WH-05

9

7

Figure 2 ML phylogeny representing relationship amongst 167 SARS-CoV-2 strains. The whole-genome sequences of 167 isolates were aligned using MUSCLE, stripped to include only conserved alignments, and subjected to RAxML to generate the ML phylogeny utilizing the GTRGAMMA model of nucleotide substitution with 100 bootstrap replicates. The phylogeny is depicted with branch length consideration. The inner circle represents the respective geo-location of each strain. Inner Blue and Gray alternative dashed lines represent an internal tree scale with a branch length increment of 0.00005 from inside to outside. Full-size  DOI: 10.7717/peerj.9576/fig-2

As strains from diverse locations continue to be sequenced and analyzed, more reliable country-specific patterns showing relatedness can be obtained. Similar to several cutting edge studies (Woo et al., 2010; Fuk-Woo Chan et al., 2020; Zhang, Wu & Zhang, 2020; Rehman et al., 2020), this study also provides an example of how phylogenetic analysis can help generate clusters of identical strains, further enabling the identification of signature mutations amongst those diverse groups.

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 16/31 Mutations are continuously diversifying the Human SARS-CoV-2 strains Like the previous section, the whole genome sequence alignment of 167 SARS-CoV-2 isolates were used for this study. As anticipated, ~100 nucleotides at both N and C-terminals were highly diverse and, therefore, changes in these sites were not considered in this study. For the rest of the alignment, the percentage of nucleotide occurrence at each site (along with gap and other non-standard nucleotides) was calculated using the SARS-CoV-2 Wuhan Hu-1 genome as the reference (Fig. 3). The mutational probability was classified into four categories based on random distribution patterns: high (>19% variation), medium (4–19% variation), low (2–3% variation) and very low (<2% variation). Overall, 415 mutation sites were identified, among which non-standard nucleotides were present in the genome at 125 sites, which have been ignored in this study. We ultimately selected 290 sites with a confirmed variation. Amongst these, we found 43 sites with high, medium and low mutation probabilities within 9 out of 12 protein-coding regions in SARS-CoV-2 genomes, which also included the genes (S, M and N), identified as core proteins across genus Betacoronavirus. Amongst all mutation sites, we found 21 sites within the gene encoding Spike protein, 18 sites in the gene encoding Nucleocapsid protein and three sites in gene encoding M protein, suggesting that even these highly conserved protein-coding regions (core proteins as later identified via pan-genome analysis) are capable of mutating. Also, we recognized 142 mutational sites in orf1a, 50 in orf1b, 6 in orf3a, 2 in the envelope protein-coding gene, 4 in orf6, and orf8, 3 in orf7b and 1 in orf10. We also identified 36 sites within non-coding regions. Amongst sites with high mutation probability, C8,782T in orf1a showed ~35% variation from C to T, suggesting a transition mutation. Likewise, T28,144C in orf8 also has ~35% transition mutation variations. We found three hotspot mutational sites within the orf1b region of orf1ab, which has three transitions, two C to T (C17,747T and C18,060T; ~20% variation) and one A to G (A17,858G; 19% variation). In the spike protein-coding region, we were able to identify one transition site (A23,403G nucleotide) with medium (11%) mutational probability. Overall, we were able to identify nine hotspots with more than two consecutive probable mutational nucleotides in vicinity (197–210 (non-coding), 508–522 (orf1a), 686–694 (orf1a), 4,879–4,881 (orf1a), 20,298–20,300 (orf1b), 21,385–21,389 (orf1b), 21,991–21,993 (S), 28,878–28,883 (N), 29,750–29,759 (non-coding)). Most of these hotspot sites showed ~1–2% mutational probabilities and none of these have high or medium mutational probability as discussed in Fig. 3. We believe these genomic locations are the probable sites of evolutionary divergence, and therefore govern the evolutionary capabilities of these viruses. We also analyzed transition and transversion possibilities within all these mutational combinations. Out of the 290 identified mutational sites, 244 had a nucleotide substitution, distributed into 158 transitions and 86 transversions. Transition events are known to be more frequent and less likely to cause a change at the protein level as compared to transversions (Sanjuán et al., 2010). Most transition events are silent mutations, while transversion events may cause change at the protein level and therefore impact the

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 17/31 Gap Mutational Probability

High Others Medium Low T

Very Low G

C

A

No

1 9 0 4 6 Si 1 1 2 7 Si 2 mutation 1 0 S 1 1 1 Site 1 1 t 2 0 S 0 0 it t 0 19 e 0 0 Site e 29759 198

0 018 20 0 0 0 S i e 0 2 0 6

te 297 00140 2 20 Site 29 0 0 8 29758 0 0 ite 29754 e e 00 Site 29752 9 e 0 0 e 0 0 t t 2 t Sit 0 2 1

i 2 7 i te e 0 0 ite S 9757 it 1 i e 00204 0 2 te

Site 00104 00205 74 S it S12 02 Si Si t 4 S it 9 S te 0 2 e 2975 S i 021 S Site i te 54 7 S e 2 te e Si i te 0 4 S it 56 S i t e 0 297 5 S 753 S t 0 S ite e i te 0 00 0 02 5 S 9 8 S it 29 2 Si i te e 0 0332 9742 S i e 0 035 4 9 S ite e 29635 S t S i 0 0 50 0 Sit i 295 50 1 Sit e 0 5 t 736 t 0 1 Si e 293 29527 S i 0 1 e 29 Site te e 0 S te 2925 S t e 0 5 Site ite2 292 53 Si t e 005100 512 Si t 0 S Si 513 3 03 ite S ite 01 Si te i S e 00 00 6 S te 9 i 29095 30 4 Si te Sit te 28883 2 14 Sit e 00514 051 17 Si it 0 18 S e 891 0 te 00515 S it S e 005 e 2888 Si 519 0 ite 6 e 005 2 Site 2 2 Sit t 5 S 2 8881 Si te 00 2 i e Si ite 288548878 S it 22 8863 orf10 S te 00005215 Si te i 00 4 t S te 0614 Site e 2 28 Si 6 7 ite S 86 92 S e 0 006568 7 it 2 Sit 0 8 Si e 283840957 te 0 06 Sitete 28 Si te 28144 Si te 0 00688 Sit 78 0689 0 e Si ite Site 279 27964077 N S 91 S Site 0 0069 ite ite 006 Sit 27802 S 06923 S e 278 5 Sitee 0 ite 2 7 Sit 0069 Site 2 06 ite Site 27377 S te 00694 752522 Si Sit te 00832 e 2 orf8 Si e 0083384 Si 732784 Sit 9 te te 00805 Site 2722272 Si 102 Site 27 76 orf7b Site 01 Site 011348 Site 26930466 te 0 5 Site Si 0138 7 Site Site 26354267296 orf6 te 013969 Si 5 Site 2 Site 014154 Site 2 6326 te 0 48 Si 015 Site 614 M Site 259794 1691 Site 2581 Site 0 2091 e 0 9 Site 257 0 E Sit 0226 Site 75 Site 2277 255 Site 0 Site 2549463 02446 Site 24351 orf3a Site 32 Site 026 Site 24325 Site 02717 Site 71 24034 Site 029 Site 037 23952 Site 03 Site 23403 Site 03099 Site 231 Site 03177 85 9 Site 22988 Site 0325 Site 22984 Site 03373 Site 22785 Site 03411 Site 22432 S Site 03518 Site 22303 Site 03738 te 22224 Site 03778 Si Site 03992 ite 22104 S Site 04288 Site 22033 93 Site 0430 Site 219 S 7 ite 04402 Site 21992 Site 21991 04642 Site 84 Site 04 ite 217 Site 04880879 S 1707 Site 2 orf1a Site 04 21644 Si 881 Site 5 te 05 Site 0 052 Site 215721389 S 5062 Site 7 ite 0 138 Site 0 5084 Site 2 386 5572 21 5 Site 0 Site 138 Si 57 2 te 16 Site Sit 057 e 058484 Site 2131621147 Site ite 137 S 0 5 S 21 ite 6026 Sit 06 Site 2098036 S e 06 031 Site 09 3 it 0 2 82 S e 0 35 Site 0 ite 63 te 2 00 Sit 06 10 Si 203 99 Sitee 0663650 e 02 8 1 Sit 2 Si 0 81 S te 06669 Site 2 ite 5 te 2029 20 S 81 Si te 031 ite 06 9 Si Site 0 0 968 69 orf1b S 69 Site 20 ite 07 9 ite 19610 S 701 6 S 175 Si ite 0 19 te 1 6 Sitet e192 9065 S 7 05 018 Siteite 0 47 Si 19 786 9 Site 1 Site 08 ite 18975877 4 S 0 0 6 S e 1 it 8 01 it 18 Sit e 0876838 S 88 S 087 8 i e Site Si te 0 e 18689860349 S te 0 8 82 Sitei t1 i 9 S 181 060 8 Sit te 0 9 8 te 1 5 Sit 91570 7 8 47 e 09159 34 Si ite S 0 S te 18 S it e 474 e 9 Si te 17 23 Site i094t 092741 i 17 10 Sit e 0 8 S ite 177 Sit 09 947 0 S 174 ite e 7 S e S 174 Si i e 0 0 477 4 e 7373 Site 100 te Sit 24 2 t 9 e 17376 7 1 S e 099 09561 5 91 Sit t S 9 i e 1 1 87 7 ite 534 1 S it 17000 7 S ite 1 5 S i S ite e 169 68 8 Si t 1 Si i e 10507 1 S 1 t 0 24 16 Si te 1 e Sit e S te 0319 2 3 Si 1 Site 1607 Si ite te 3 6 18 5 S 10 2 Sit 15927 Si t 1 116 e S it te e Site S 1 1 e 54 S 8 Si i t e 1132 12 Site 16467t 5 S i 2 Sit te e 1 11410 Si i 3 1 3 S te Sit te 15607 15324 S S te 07 2 S 1 S i S it t 1 Si e 4408 9 1 1750 33 2 i e 8 ite 123 14657 2 4229 4 4 i te e 1 1 1 ite 1246 i S i t 60 e t 12041 176

t te e 1480 t e 773 7 121 Site 15597 1 1 7 195 7 3 1 0 685 e 12 e Sit e 12

e 5 5 5 5 13072 2 5 1

Si ite t 1 127 2600 12464 2

1 2 1 2 Sit 2 2 2 2 23 160 S te 13226 2 4 e 1 1 15 2 126 1 2 208 6

1 Si t 1 21

1

Site 02

4 Si ite te 12 4 e 1258 5 te e 78 e Site 132 e te 7 Si t 9 te 3 t t S i 5

i Si i 7 3 Site 1 Si Si S Si Sit S S

Figure 3 Mutation sites and hotspots distribution amongst 167 SARS-CoV-2 genomes as compared to Wuhan Hu-1. MUSCLE-based alignment was used to analyze the distribution of all nucleotides (including gap and other nucleotides) at each alignment site and the relative percentage of each nucleotide is represented here in a circular manner respectively for each nucleotide. The innermost circle represents the respective orf and non-coding DNA according to the location as mentioned in the third circle from inside. The second circle represents the mutational probability score of each locus distributed in high, medium, low and very low categories. Full-size  DOI: 10.7717/peerj.9576/fig-3

pathogenesis of the disease (Lyons & Lauring, 2017). Therefore, we suggest that amongst SARS-CoV-2 genomes, transversion events, that are occurring in significant numbers, would provide more evolutionary diversity and possibly lead to divergence, as compared to transition events.

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 18/31 Evolutionary relationship amongst SARS-CoV-2, Bat CoV and Pangolin CoV To re-examine the closest hosts of these viruses, we performed homology studies on small genome fragments (100 nucleotides) to identify their close relatives. We segmented each of the SARS-CoV-2, Pangolin CoV, SARS-CoV and MERS genome (reference genome of each category) into 100 nucleotide fragments, identified their closest homologs in the virus database, filtered the strains belonging to its taxonomy, selected top 50 hits per segment, calculated the host CoV distribution and represented it as a heatmap for each segment/strain along with the closest (topmost) hit per segment/strain (Fig. 4). MERS-CoV is already known to originate from the bat, however, the infection spread via camels either directly/indirectly (Azhar et al., 2014). On excluding Camel CoVs from the data, we found several segments showing maximum similarity representation in Bat CoVs, followed by Hedgehog and Human CoV strains, however, most of the parts were uniquely present in MERS. As expected, when Camel CoV strains were included, they were the closest best hit as indicated in the “MERS Closest” category (Fig. 4). When we further investigated the same patterns in the SARS-CoV genome, we identified the Bat CoV as the closest strain for most of the reference segments, followed by SARS-CoV-2 and Pangolin CoV. In the closest category, we found Bat CoV as the closest suggesting their evolutionary origin. However, many segments do not show any closeness with any strain, indicating their uniqueness in the reference genome. We performed the same analysis on the Pangolin CoV strain GX-P5E and found that several segments have close homology-distribution within SARS-CoV and SARS-CoV-2 genomes. Interestingly, as visible in the “Pangolin CoV Closest” category, several hits show maximum similarity with Bat SARS-CoV, followed by SARS-CoV-2 and a few Human CoVs. Likewise, SARS-CoV-2 genome fragments show maximum homology representation in Bat CoV followed by Pangolin CoV. In the “closest category”, Bat CoV indisputably stands out as the closest host, which is corroborated by whole-genome phylogeny and the percentage identity matrix studies. However, the presence of several Pangolin CoV homologous segments in this analysis indicates that certain specific segments in the SARS-CoV-2 genome share a closer relationship with Pangolin CoV strains as compared to Bat CoVs. Overall, our analysis indicates that the SARS-CoV-2 genome has a close relationship with Bat CoV and Pangolin CoV, along with the presence of a few unique segments. This reasserts that SARS-CoV-2 strains might have evolved from a common ancestor of Bat SARS-CoV and Pangolin CoV strains and during evolution, several segments in the SARS-CoV-2 genome might have diverged as compared to Bat SARS-CoV and Pangolin CoV. The recognized unique genomic segments in this study may be used as potential targets for diagnostic kits.

Pan-genome (Core, accessory and unique gene) analysis Regardless of their approximately similar genome size, the members of Betacoronavirus harbor huge diversity (5–14 CDS) in the number of protein-coding regions, as per the

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 19/31 SARS-CoV-2 Closest SARS-CoV-2 Unique SARS-CoV-2 Closest Human-CoV

-2 -2 Pangolin-CoV Bat CoV Bat-CoV Pangolin CoV Closest Human Pangolin-CoV Unique

SARS-CoV Other-CoV Pangolin Human-CoV

SARS2 Unique CoV SARS-CoV-2 Bat-CoV

SARS-CoV Closest Pangolin CoV Closest Pangolin- SARS-CoV Unique Bird/Rabbit-CoV

Bat-CoV Pangolin-CoV oV SARS-CoV-2 SARS-CoV-2 Bat-CoV

MERS Closest Human-CoV SARS-C MERS Unique Pangolin-CoV Unique Bird-CoV Human-CoV

Hedgehog-CoV RS-CoV

SARS-CoV Closest E Bat-CoV M

0

0 0 0 0 0 0 0 0 0 0 0 0 0 3010 0 0 0 0 0 0 3 0 10 0 2 0 2 5 6 000 0

29 3 4 7 00 8 40 0 0 1 Bat 2 13 500 0 1 1 9 0 1 1 1 6 1 1 1 0 2 98 1 0 9 1 7 0 0 0 1 0 0 1 29 0 1 1 9 1 0 1 8 0 0 01 0 1 1 1 100 1 0 1 3 2 0 0 2 3 7 4 00 5 0 6 001 1 0 3 7 19 0 2 9 601 8 1 1 0 1 901 2 0 1 30 0 1 1 2 00 2 9 5 3 1 1 2 2 01 120 0 00 2 9 4 0 200 1 5 0 1 2 0 1401 6 00 01 1 21 23 4 2 9 3 0 1 29 29 9 0 1 0 0 2 0 1 9 0 1 8 1 2 2900910 9 0 17 9 1 0 0 1 2 8 0 1 0 2 00 0 1 1 26007 00 1 700 0 1 1 25 288 8 2 9 200 0 2 00 0 1 2920 6 0 2 9 2 9 5 220 3 9 0 28 0 1 0 01 1 28 2 0 SARS1 Unique 9 4 0 0 2 1 2 01 1 3 240 100 8 7 2 0 0 25 60 1 3 2 01 2 9100 00 0 0 1 3 00 85 6 2 3200 0 28 0 2 9 270 0 1 3 0 0 2830 0 1 2 28889000 0 28 9 0 4 0 4 1 0 2 01 1 3 5 0 2 0 0 30 3 0 8201 1 2 8 1 3 60 0 2 8607 00 31 20 0 1 7 0 81 1 284028 00 3 3 0 3 0 28001 3 3 8 0 2 0 28300 5 4 01 1 3 0 0 7901 1 0 0 3 0 9 27801 28 0 35 3 00 0 28 0 36 01 1 4 2 20 7 0 10 00 2 77 2 3 8 4 2 760101 8 1 0 3 01 4 00 2 2 00 00 9 01 75 27 79 3 1 43 00 2 0 40 0 1 7 0 27700 0 0 44 27304 1 80 0 41 2 00 MERS Closest 01 27 0 4 01 1 4500 0 27 2 600 43 70 2 201 1 27407 1 46 0 71 5 44050 1 4 2 0 4 0 00 7 01 2730 0 6 0 2 00 272 0 4 00 69 1 4701 480 5 0 26 0 2 0 0 801 149 0 801 1 2 71 0 4 0 5100 26 7000 49 2 0 2 70 26900 0 001 5 30 66011 2 26 0 5 00 Camel 26 510101 5 4 5 80 52 1 5 500 264001 26667 0 30 5 0 00 5 01 60 26 1 4 5 00 26201301 26 00 5 01 7 2 500 55 1 5 00 2 64 60 58 0 61 26 00 5 1 90 26 01 26 30 570 1 5 00 0 0 6000 259 1 26 200 58 01 2 01 2 10 59 6100 5801 2560 0 6001 6200 2570 00 1 300 1 900 610 6 0 2560 258 6201 40 2 1 2570000 1 6 00 5501 630 65 25 25600 6401 600 401 01 6 2530 2550 65 6700 2 1 2540 0 6601 800 5201 0 1 6 25101 252530 670 6900 0 6801 Representation Score 25001 200 24 2510 6901 70007100 901 2 0 7001 24 5000 801 2490 7101 7200300 2470 0 7201 7 1 24800 1 7400 24601 24700 730 0 24501 24 7401 750 600 7501 7600 0 24401 24500 700 24 7601 7 301 24400 7701 7800 24201 24300 7801 7900 24101 24200 7901 8000 24001 24100 8001 8100 23901 24000 8101 8200 10 23801 23900 8201 8300 23701 23800 8301 8400 23601 23700 8401 8500 23501 23600 8501 8600 23401 23500 8601 8700 23400 8701 88 23301 3300 88 00 3201 2 01 8900 20 2 23200 8901 23101 00 900 9000 01 231 1 910 230 000 9101 0 01 23 0 92 9200 229 1 2290 01 9 2280 2800 9301 94300 01 2 00 94 00 227 227 01 601 600 950 9500 22 0 9 1 9 30 01 22250 601 600 225 1 2 00 97 97 240 224 01 00 2 01 300 9801 980 223 22 0 9 9 0 201 20 901 900 22 22 00 10 10 101 221 0 1 001 00 22 1 00 01 101 0 200 22 00 10 01 00 2 01 10 201 10 10 19 219 3 20 40 2 801 10 01 3 0 21 218001700 105401 10 00 1 40 2170101 2 00 10 01 05 0 216 15 10 601 1060000 2 400 7 2150101 21600 00 10 01 10700 4 21 3 0 10 80 1 21 1 21 0 9 1 08 1 12 0 11 0 1 0 21300 2 10 111 0 1 1 09 0 12 1 00 01 1 0 50 2 10 21 0 11 0 1 0 0 1 01 210 0 1 20 1 1 00 2 0 09 00 1 1 10 1 2 8 0 1 3 1 1112 0 2 1 0 11 14 01 1 0 209010 20 7 0 3 0 08 20 600 0 116 5 1 140000 2 701 0 0 11 01 1 1 01 2 01 1500 20 1 205 11 70 1 206 0 0 1 8 1 1 6 05 1 0 1 19010 12000 1 0 2 40 2 2 1 1 70 0 0 203000 00 1 0 11 18 60 2 1 21 0 0 0 2 00 1 201 0 1 2 0 1 9 0 20301 204000 0 00 0 2 2 1 1 0 202011 9 0 1 30 0 2 0 01 1 2 8 00 1 240 1 12 1 20 9 1 2 1 12 1 0 90 1 19 1 97 0 12 2 5 2 2 0 200 1 0 6 0 1 3 00 19 80 1 5 00 1 1 1 0 9 0 9 0 1 2 7 01 1 4 1 7 01 4 290 80 0 2 00 0 9 1 30 13 1 2 5 1 19 0 131 1 0 1 200 1 00 1 2 6 196 501 019600 9 128 0 0 4 19 13 3 1 12 70 0 70 19 1 910 0 1 2 0 1 9 13000 301 01 1 9000 0 0 1350134 30 0 1 9 00 0 1 9 2 1 0 0 13 13 1 01 0 13701 1 1 0 9 1 18900 7 138 01 1 0 8 60 0 0 6 1 3 1 1 9 188 0 0 1 1 01 8 0 1 01 3 2 00 1 0 141 3 1 01 1 1 85 4 0 4001 3 30 00 3 0 0 1 9 19001 0 1 1 1 8 0 1 0 136 3 40 8 2 1 4 0 189 7 0 1 1 1 4 1 13700 5 0 1 1 8 1460 4 2 1 138 188 6 8 0 1 4 3 01 0 0 1 1 1 0 14 1 50 1 4 0 18 8 800 0 1 4 00 0 1 0 1 5 0 1410 1 40 0 700 0 15201 153 1 3 1 1 15301 15400 4 0 1 0 15401 15 7 7800 6 15501 5 0 1 40 01 15601 15700 0 0 15701 51 1 1 801 900 8 3 17900 5 0 1 0 18 1 0 9 1 2 0 400 0 0 0 1 1 1 1 0 0 3 0 8 5 0 0 5 0 4 1 0 1 17 7 200 0 1 14700 1 0 80 0 0 1 8 17 10 0 1 4 0 1 0 0 0 7 0 01 20 0 8 0 9 0 1 7 9 0 1 4 8 01 17 8 1 1 7 70 1 3 0 0 600 4 7 5 4 3 0 2 0 8 1 6 1 1 46 4 1 1 6 14900 0 79 1 6 1 5 6 1 0 6 15 1 6 1 01 1 6 1 1 6 4 0 601 1 1 1 0 1 0 1 0 1 16 1560 5 7701 1 5 01 1 15800 1 8 0 1 178 1 1 5 00 0 1 4 1 1 0 0 1 0 1 7 1 0 1 1 0 2 1 17 7 0 1 20 1 1 5 1 6 0 1 00 1 0 0 0 0 500 0 7 0 1 9 0 0 0 9 0 0 8 0 00 173 7 7 0 1 5 6 4 0 3 2 0 1 0 171 6 0 6 1 66 0 1 6 6 6 0 6 0 1 6 1 1 16 1 1 1 1 90 1

100

Figure 4 Host distribution of each 100-nucleotide segment in the reference genomes of MERS, SARS-CoV, Pangolin CoV, and SARS-CoV-2. Each 100-nucleotide segment per strain was subjected to BLASTn against the virus database. Self-category hits were removed from the analysis, host information of top 50 hits was identified, and the relative percentage is represented here as a heatmap for each segment in each genome. In the closest category, the host information of best hit (after removing self-hit) is represented in the form of a multi-color stripe. The innermost circle represents the location of each segment to start from 1 to 100 nucleotides to the total genome length of the genome. Full-size  DOI: 10.7717/peerj.9576/fig-4

available NCBI genome annotations. SARS-CoV and SARS-CoV-2, belonging to subgenus Sarbecovirus, have 14 and 12 protein-coding regions respectively, revealing the diversity within the same subgenus. Similarly, Pangolin CoV (studied strain: GX-P5E), which also belongs to subgenus Sarbecovirus, has only nine protein-coding regions. To identify the

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 20/31 Figure 5 Pan-genome (Core, accessory, and unique) analysis amongst 19 Betacoronavirus reference genomes (including one Pangolin CoV). The X-axis represents the strain combinations as black-filled circles, whereas Y-axis represents the distribution of genes in the respective combinations and inside each vertical bar, those gene names are written. For each strain, their respective subgenus category has been pointed out on the left side as the colored scale in the corner. In the left-down corner, the number of genes encoded by each genome is depicted as a horizontal bar. In the lowermost panel, pan-genome categories (core, accessory and unique genes) are shown. Full-size  DOI: 10.7717/peerj.9576/fig-5

presence of (co-)orthologous proteins present across all these genomes, a bidirectional (reciprocal) best BLAST hit-based homology detection strategy was followed using the tool Proteinortho (Lechner et al., 2011)(Fig. 5; Fig. S5). A total of 44 protein-coding region clusters represent the pan-genome of nineteen studied genus Betacoronavirus members amongst which, 3 protein-coding regions (S, M and N proteins) were found to be conserved across all genomes, representing the core genome. A similar analysis has been performed on all Betacoronavirus species; however, it does not include Pangolin CoV (Alam et al., 2020). Our study also recognized that the pan-genome of genus Betacoronavirus is open, which means that with the addition of a new species, the pan-genome is continuously increasing (Fig. S6). The increase in the pan-genome at each step is approximately proportional to the increase in unique genes. Since each newly identified strain has additional unique gene sequences, it is effectively impossible to predict their complete pan-genome. As expected, the members of each subgenus, which are closest to each other, have only a few variations, however, with the addition of a new subgenus member(s), the pan-genome expands immediately. orf1ab, the largest coding region, codes for two polyproteins ORF1a and ORF1ab (266–13,468,13,468–21,555) because of the “leaky mechanism” of the ribosome, a (-1) frameshift just upstream of the orf1a translation termination codon. All CoVs, except for strain MHV A59 of subgenus Embecovirus, translate the full-length orf1ab. MHV A59,

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 21/31 however, has separate polyproteins from orf1a and orf1b. Due to the absence of ribosomal slippage reported in several coronaviruses, ORF1a is encoded by only 9 out of 18 genomes, making it an accessory protein cluster of the pan-genome. ORF1b polyprotein is encoded as a part of orf1ab in most of the species belonging to genus Betacoronavirus, again except for strain MHV A59, where it is encoded separately as RdRp, forming a unique gene. Envelope protein (76–82 aa across Betacoronavirus) is another accessory protein encoded by 15 out of 18 genomes. In another cluster, we identified the presence of small envelope protein in two out of those three genomes where E protein is absent (both in subgenus Nobecovirus). No E protein-encoding region is found in the MHV A59 genome annotations. The gene encoding ORF3 is distributed into two subunits (orf3a and orf3b) in SARS-CoV (via an internal ribosomal entry mechanism), whereas only the first subunit (orf3a; 276 aa) is present in SARS-CoV-2. The gene for ORF3a is conserved only amongst subgenus Sarbecovirus organisms. This protein activates the pro-transcription of gene IL-1β and helps its maturation; both signals are prerequisites for NLRP3 (NOD-, LRR- and pyrin domain-containing protein 3) inflammasome activation during infection (Siu et al., 2019). Conservation of ORF3a across SARS-CoV-2, SARS-CoV and Pangolin CoVs provides a clue about its unique infection strategy. ORF3b (155 aa) is only present in SARS-CoV and absent in all other coronaviruses. It has been suggested that ORF3b supports SARS-CoV infection by inhibiting the type-1 Interferon (IFN) synthesis and therefore, IFN signaling (McBride & Fielding, 2012). ORF6 is also a subgenus Sarbecovirus specific protein (conserved across SARS-CoV, SARS-CoV-2 and Pangolin CoV) localized in the Endoplasmic Reticulum or the Golgi membrane. During infection, it interacts with and interrupts the nuclear import complex formation. During infection-caused interferon signaling, this interruption inhibits STAT1 transportation to the nucleus and blocks the expression of genes involved, therefore mimicking an antiviral state (Frieman et al., 2007). Another accessory protein of the Betacoronavirus pan-genome is ORF7a, which is a unique protein amongst subgenus Sarbecovirus members. In SARS-CoV, it is known that bone marrow stromal antigen-2 (BST-2 or tetherin or CD317) can restrict the virus release from inside the cell causing partial inhibition of infection. However, ORF7a can inhibit the action of this protein, thus promoting the infection (Taylor et al., 2015). ORF7b is present in both SARS-CoV-2 and SARS-CoV but absent in Pangolin CoV NCBI annotations. Like orf7b in SARS-CoV, its initiation codon overlaps with orf7a via ribosome leaky scanning and it works as a structural component (Schaecher, Mackenzie & Pekosz, 2007). orf8 gene (365 nucleotides) is uniquely present in SARS-CoV-2 and Pangolin CoV but absent in SARS-CoV. Distinctively, orf8 in SARS-CoV is separated into two proteins ORF8a (119 nucleotides) and ORF8b (254 nucleotides). ORF8 in SARS-CoV-2 is highly divergent from the inflammasome activator ORF8b in SARS-CoV (Yuen et al., 2020). The function of this protein is not clear yet, but it has been suggested that overall ORF8ab or ORF8a and ORF8b together modulate viral replication and pathogenesis in unknown fashion (McBride & Fielding, 2012). ORF10 is a unique protein in SARS-CoV-2 strains. We also performed the homology analysis of ORF10 protein

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 22/31 against the NCBI Non-redundant database and found no hits, which suggests that this protein is specific to SARS-CoV-2. This analysis also revealed that some proteins are specific to a subgenus. Subgenus Embecovirus has a unique combination of proteins, namely HE, ORF4, internal protein, NS2a and NS3a, that are present in most genomes of this subgenus. These proteins might be considered unique to subgenus Embecovirus as compared to the rest of the genus Betacoronavirus. Hemagglutinin Esterase glycoprotein (HE) is a nonessential protein with sialic acid-binding and acetyl esterase activity that create tinier spikes (second attachment factor) on the viral envelope in several MHV strains. orf4 (also annotated as orf5a, ns12.9) encodes a nonstructural accessory protein that is, known to function as type I interferon antagonist. The internal gene is encoded within the N gene in MHV CoV; however, its function is not well-known. Similarly, subgenus Merbecovirus members also have some unique proteins (NS3a, NS3b, NS4b, ORF5), that are absent in other subgenera. We also identified unique proteins in each organism, which were absent in other Betacoronavirus. Overall, it can be argued that although genus Betacoronavirus members have ~10 proteins coding regions per genome, they have conserved similarity in only ~10% (3–4 proteins) of the genes and their diversity (~40% accessory genes and ~50% unique genes) spans the rest of the genome, leading to their different pathogenesis across diverse hosts and distinct evolution. We identified that even among subgenus Sarbecovirus members (SARS-CoV, SARS-CoV-2 and Pangolin CoV), orf3b, orf7b, orf8, orf9b and orf10 are uniquely distributed within at least one genome, therefore depicting wide diversity within the same subgenus.

CONCLUSIONS Humanity has already encountered coronavirus infections twice in the past two decades, in the form of the SARS and MERS epidemics. The latest coronavirus infection, COVID-19, caused by the highly pathogenic and transmissible SARS-CoV-2, turned into a pandemic in sheer three months. This work provides a brief review of the taxonomy, history, origin, and genome organization of the virus, followed by comparative genomics research focused on understanding the evolutionary relationship amongst 167 SARS-CoV-2, 312 SARS-CoV and 5 Pangolin CoV genome sequences as available on 29 March 2020. We identified that SARS-CoV-2 strains form a monophyletic clade distinct from SARS-CoV and Pangolin CoV. This study reaffirmed that they are closest to the Bat CoV RaTG13 strain followed by Pangolin CoV, suggesting that SARS-CoV-2 evolved from a common ancestor putatively residing in bat or pangolin hosts. We identified several mutation sites and hotspots within SARS-CoV genomes with high, medium and low probabilities. Homology studies based on 100 nucleotide segments in the SARS-CoV-2 reference genome pointed out their close relationships with Bat and Pangolin CoVs, along with the unique nature of a few segments. We recognized that 44 protein-coding regions constitute the pan-genome for nineteen genus Betacoronavirus strains. Moreover, their pan-genome is open, highlighting the wide diversity provided by newly identified novel strains. Even members of subgenus Sarbecovirus are diverse relative to each other due to

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 23/31 the relative presence of unique protein-coding regions orf3b, orf7b, orf8 and orf9b and orf10. Overall, this review and research highlight the diversity within the available SARS-CoV-2 genomes, their potential mutational sites hotspots, and probable evolutionary relationship with other coronaviruses, which might further assist in our understanding of their evolution, epidemiology and pathogenicity.

ACKNOWLEDGEMENTS We acknowledge Tanvi Kale, IBAB MSc (2018–20) student for graciously offering her artwork to use as a cover image for this manuscript.

ADDITIONAL INFORMATION AND DECLARATIONS

Funding Department of Science and Technology (DST), Government of India and IBAB, Bengaluru provided financial and infrastructure support to Gaurav Sharma. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Grant Disclosures The following grant information was disclosed by the authors: DST-INSPIRE Faculty Award from Department of Science and Technology (DST), Government of India to GS.

Competing Interests The authors declare that they have no competing interests.

Author Contributions  Arohi Parlikar performed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft.  Kishan Kalia performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.  Shruti Sinha performed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft.  Sucheta Patnaik performed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft.  Neeraj Sharma performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.  Sai Gayatri Vemuri performed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft.  Gaurav Sharma conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 24/31 Data Availability The following information was supplied regarding data availability: The strains and genomes used in this research were retrieved from NCBI Virus (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/) on 29 March 2020. This information is provided in Table S1.

Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj.9576#supplemental-information.

REFERENCES Ahmed SF, Quadeer AA, McKay MR. 2020. Preliminary identification of potential vaccine targets for the COVID-19 coronavirus (SARS-CoV-2) based on SARS-CoV immunological studies. Viruses 12(3):254 DOI 10.3390/v12030254. Alam I, Kamau A, Kulmanov M, Arold ST, Pain A, Gojobori T, Duarte CM. 2020. Functional pangenome analysis provides insights into the origin, function and pathways to therapy of SARS-CoV-2 coronavirus. bioRxiv DOI 10.1101/2020.02.17.952895. Andersen KG, Rambaut A, Lipkin WI, Holmes EC, Garry RF. 2020. The proximal origin of SARS-CoV-2. Nature Medicine 26:450–452 DOI 10.1038/s41591-020-0820-9. Angelini MM, Akhlaghpour M, Neuman BW, Buchmeier MJ. 2013. Severe acute respiratory syndrome coronavirus nonstructural proteins 3, 4, and 6 induce double-membrane vesicles. mBio 4(4):e00524-13 DOI 10.1128/mBio.00524-13. Azhar EI, El-Kafrawy SA, Farraj SA, Hassan AM, Al-Saeed MS, Hashem AM, Madani TA. 2014. Evidence for camel-to-human transmission of MERS coronavirus. New England Journal of Medicine 370(26):2499–2505 DOI 10.1056/NEJMoa1401505. Barr JN, Fearns R. 2010. How RNA viruses maintain their genome integrity. Journal of General Virology 91(6):1373–1387 DOI 10.1099/vir.0.020818-0. Boni MF, Lemey P, Jiang X, Lam TT-Y, Perry B, Castoe T, Rambaut A, Robertson DL. 2020. Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic. bioRxiv DOI 10.1101/2020.03.30.015008. Bouvet M, Lugari A, Posthuma CC, Zevenhoven JC, Bernard S, Betzi S, Imbert I, Canard B, Guillemot JC, Lécine P, Pfefferle S, Drosten C, Snijder EJ, Decroly E, Morelli X. 2014. Coronavirus Nsp10, a critical co-factor for activation of multiple replicative enzymes. Journal of Biological Chemistry 289(37):25783–25796 DOI 10.1074/jbc.M114.577353. Buchfink B, Xie C, Huson DH. 2014. Fast and sensitive protein alignment using DIAMOND. Nature Methods 12(1):59–60 DOI 10.1038/nmeth.3176. Chen Y, Liu Q, Guo D. 2020. Emerging coronaviruses: genome structure, replication, and pathogenesis. Journal of Medical Virology 92(4):418–423 DOI 10.1002/jmv.25681. Chen Y, Su C, Ke M, Jin X, Xu L, Zhang Z, Wu A, Sun Y, Yang Z, Tien P, Ahola T, Liang Y, Liu X, Guo D. 2011. Biochemical and structural insights into the mechanisms of SARS coronavirus RNA ribose 2′-O-Methylation by nsp16/nsp10 protein complex. PLOS Pathogens 7(10):e1002294 DOI 10.1371/journal.ppat.1002294. Christiam C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. 2009. BLAST+: architecture and applications. BMC Bioinformatics 10:421 DOI 10.1186/1471-2105-10-421.

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 25/31 Coleman CM, Frieman MB. 2014. Coronaviruses: important emerging human pathogens. Journal of Virology 88(10):5209–5212 DOI 10.1128/JVI.03488-13. Cong Y, Ulasli M, Schepers H, Mauthe M, V’kovski P, Kriegenburg F, Thiel V, De Haan CAM, Reggiori F. 2019. Nucleocapsid protein recruitment to replication-transcription complexes plays a crucial role in coronaviral life cycle. Journal of Virology 94(4):e01925-19 DOI 10.1128/JVI.01925-19. Cottam EM, Whelband MC, Wileman T. 2014. Coronavirus NSP6 restricts autophagosome expansion. Autophagy 10(8):1426–1441 DOI 10.4161/auto.29309. Cui J, Li F, Shi ZL. 2019. Origin and evolution of pathogenic coronaviruses. Nature Reviews Microbiology 17(3):181–192 DOI 10.1038/s41579-018-0118-9. De Haan CAM, Haijema BJ, Schellen P, Schreur PW, Te Lintelo E, Vennema H, Rottier PJM. 2008. Cleavage of group 1 coronavirus spike proteins: how furin cleavage is traded off against heparan sulfate binding upon cell culture adaptation. Journal of Virology 82(12):6078–6083 DOI 10.1128/JVI.00074-08. Edgar RC. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research 32(5):1792–1797 DOI 10.1093/nar/gkh340. Fehr AR, Perlman S. 2015. Coronaviruses: an overview of their replication and pathogenesis. In: Maier H, Bickerton E, Britton P, eds. Coronaviruses: Methods and Protocols. New York: Springer, 1–23. Follis KE, York J, Nunberg JH. 2006. Furin cleavage of the SARS coronavirus spike glycoprotein enhances cell–cell fusion but does not affect virion entry. Virology 350(2):358–369 DOI 10.1016/j.virol.2006.02.003. Frieman M, Yount B, Heise M, Kopecky-Bromberg SA, Palese P, Baric RS. 2007. Severe acute respiratory syndrome coronavirus ORF6 antagonizes STAT1 function by sequestering nuclear import factors on the rough endoplasmic reticulum/golgi membrane. Journal of Virology 81(18):9812–9824 DOI 10.1128/JVI.01012-07. Fuk-Woo Chan J, Kok K-H, Zhu Z, Chu H, Kai-Wang To K, Yuan S, Yuen K-Y. 2020. Genomic characterization of the 2019 novel human-pathogenic coronavirus isolated from a patient with atypical pneumonia after visiting Wuhan. Emerging Microbes & Infections 9(1):221–236 DOI 10.1080/22221751.2020.1719902. Gadlage MJ, Sparks JS, Beachboard DC, Cox RG, Doyle JD, Stobart CC, Denison MR. 2010. Murine hepatitis virus nonstructural protein 4 regulates virus-induced membrane modifications and replication complex function. Journal of Virology 84(1):280–290 DOI 10.1128/JVI.01772-09. Gorbalenya AE, Baker SC, Baric RS, De Groot RJ, Drosten C, Gulyaeva AA, Haagmans BL, Lauber C, Leontovich AM, Neuman BW, Penzar D, Perlman S, Poon LLM, Samborskiy DV, Sidorov IA, Sola I, Ziebuhr J. 2020. The species severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2. Nature Microbiology 5(4):536–544 DOI 10.1038/s41564-020-0695-z. Gordon DE, Jang GM, Bouhaddou M, Xu J, Obernier K, White KM, O’Meara MJ, Rezelj VV, Guo JZ, Swaney DL, Tummino TA, Huettenhain R, Kaake RM, Richards AL, Tutuncuoglu B, Foussard H, Batra J, Haas K, Modak M, Kim M, Haas P, Polacco BJ, Braberg H, Fabius JM, Eckhardt M, Soucheray M, Bennett MJ, Cakir M, McGregor MJ, Li Q, Meyer B, Roesch F, Vallet T, Mac Kain A, Miorin L, Moreno E, Naing ZZC, Zhou Y, Peng S, Shi Y, Zhang Z, Shen W, Kirby IT, Melnyk JE, Chorba JS, Lou K, Dai SA, Barrio- Hernandez I, Memon D, Hernandez-Armenta C, Lyu J, Mathy CJP, Perica T, Pilla KB, Ganesan SJ, Saltzberg DJ, Rakesh R, Liu X, Rosenthal SB, Calviello L, Venkataramanan S,

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 26/31 Liboy-Lugo J, Lin Y, Huang X-P, Liu Y, Wankowicz SA, Bohn M, Safari M, Ugur FS, Koh C, Savar NS, Tran QD, Shengjuler D, Fletcher SJ, O’Neal MC, Cai Y, Chang JCJ, Broadhurst DJ, Klippsten S, Sharp PP, Wenzell NA, Kuzuoglu D, Wang H-Y, Trenker R, Young JM, Cavero DA, Hiatt J, Roth TL, Rathore U, Subramanian A, Noack J, Hubert M, Stroud RM, Frankel AD, Rosenberg OS, Verba KA, Agard DA, Ott M, Emerman M, Jura N, von Zastrow M, Verdin E, Ashworth A, Schwartz O, d’Enfert C, Mukherjee S, Jacobson M, Malik HS, Fujimori DG, Ideker T, Craik CS, Floor SN, Fraser JS, Gross JD, Sali A, Roth BL, Ruggero D, Taunton J, Kortemme T, Beltrao P, Vignuzzi M, García-Sastre A, Shokat KM, Shoichet BK, Krogan NJ. 2020. A SARS-CoV-2 protein interaction map reveals targets for drug repurposing. Epub ahead of print 30 April 2020. Nature DOI 10.1038/s41586-020-2286-9. Graham RL, Sims AC, Brockway SM, Baric RS, Denison MR. 2005. The nsp2 replicase proteins of murine hepatitis virus and severe acute respiratory syndrome coronavirus are dispensable for viral replication. Journal of Virology 79(21):13399–13411 DOI 10.1128/JVI.79.21.13399-13411.2005. Graham RL, Sparks JS, Eckerle LD, Sims AC, Denison MR. 2008. SARS coronavirus replicase proteins in pathogenesis. Virus Research 133(1):88–100 DOI 10.1016/j.virusres.2007.02.017. Grossoehme NE, Li L, Keane SC, Liu P, Dann CE, Leibowitz JL, Giedroc DP. 2009. Coronavirus N protein N-terminal domain (NTD) specifically binds the transcriptional regulatory sequence (TRS) and melts TRS-cTRS RNA duplexes. Journal of Molecular Biology 394(3):544–557 DOI 10.1016/j.jmb.2009.09.040. Hamre D, Procknow JJ. 1966. A new virus isolated from the human respiratory tract. Proceedings of the Society for Experimental Biology and Medicine 121(1):190–193 DOI 10.3181/00379727-121-30734. Han Q, Lin Q, Jin S, You L. 2020. Coronavirus 2019-nCoV: a brief perspective from the front line. Journal of Infection 80(4):373–377 DOI 10.1016/j.jinf.2020.02.010. Hu D, Zhu C, Ai L, He T, Wang Y, Ye F, Yang L, Ding C, Zhu X, Lv R, Zhu J, Hassan B, Feng Y, Tan W, Wang C. 2018. Genomic characterization and infectivity of a novel SARS-like coronavirus in Chinese bats. Emerging Microbes & Infections 7(1):1–10 DOI 10.1038/s41426-018-0155-5. Huang C, Wang Y, Li X, Ren L, Zhao J, Hu Y, Zhang L, Fan G, Xu J, Gu X, Cheng Z, Yu T, Xia J, Wei Y, Wu W, Xie X, Yin W, Li H, Liu M, Xiao Y, Gao H, Guo L, Xie J, Wang G, Jiang R, Gao Z, Jin Q, Wang J, Cao B. 2020. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet 395(10223):497–506 DOI 10.1016/S0140-6736(20)30183-5. Issa E, Merhi G, Panossian B, Salloum T, Tokajian S. 2020. SARS-CoV-2 and ORF3a: nonsynonymous mutations, functional domains, and viral pathogenesis. mSystems 5(3):e00266-20 DOI 10.1128/mSystems.00266-20. Kendall EJC, Bynoe ML, Tyrrell DAJ. 1962. Virus isolations from common colds occurring in a residential school. BMJ 2(5297):82–86 DOI 10.1136/bmj.2.5297.82. Kirchdoerfer RN, Ward AB. 2019. Structure of the SARS-CoV nsp12 polymerase bound to nsp7 and nsp8 co-factors. Nature Communications 10(1):1–9 DOI 10.1038/s41467-019-10280-3. Kopecky-Bromberg SA, Martinez-Sobrido L, Frieman M, Baric RA, Palese P. 2007. Severe acute respiratory syndrome coronavirus open reading frame (ORF) 3b, ORF 6, and nucleocapsid proteins function as interferon antagonists. Journal of Virology 81(2):548–557 DOI 10.1128/JVI.01782-06.

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 27/31 Kumar S, Stecher G, Li M, Knyaz C, Tamura K. 2018. MEGA X: molecular evolutionary genetics analysis across computing platforms. Molecular Biology and Evolution 35(6):1547–1549 DOI 10.1093/molbev/msy096. Lai AL, Millet JK, Daniel S, Freed JH, Whittaker GR. 2020. The SARS-CoV fusion peptide forms an extended bipartite fusion platform that perturbs membrane order in a calcium-dependent manner. Journal of Molecular Biology 429(24):3875–3892 DOI 10.1016/j.jmb.2017.10.017. Lau SKP, Feng Y, Chen H, Luk HKH, Yang W-H, Li KSM, Zhang Y-Z, Huang Y, Song Z-Z, Chow W-N, Fan RYY, Ahmed SS, Yeung HC, Lam CSF, Cai J-P, Wong SSY, Chan JFW, Yuen K-Y, Zhang H-L, Woo PCY, Perlman S. 2015. Severe acute respiratory syndrome (SARS) coronavirus ORF8 protein is acquired from SARS-related coronavirus from greater horseshoe bats through recombination. Journal of Virology 89(20):10532–10547 DOI 10.1128/JVI.01048-15. Lechner M, Findeiß S, Steiner L, Marz M, Stadler PF, Prohaska SJ. 2011. Proteinortho: detection of (Co-)orthologs in large-scale analysis. BMC Bioinformatics 12(1):124 DOI 10.1186/1471-2105-12-124. Lei J, Kusov Y, Hilgenfeld R. 2018. Nsp3 of coronaviruses: structures and functions of a large multi-domain protein. Antiviral Research 149:58–74 DOI 10.1016/j.antiviral.2017.11.001. Lescure FX, Bouadma L, Nguyen D, Parisey M, Wicky PH, Behillil S, Gaymard A, Bouscambert-Duchamp M, Donati F, Le Hingrat Q, Enouf V, Houhou-Fidouh N, Valette M, Mailles A, Lucet JC, Mentre F, Duval X, Descamps D, Malvy D, Timsit JF, Lina B, Van-der-Werf S, Yazdanpanah Y. 2020. Clinical and virological data of the first cases of COVID-19 in Europe: a case series. Lancet Infectious Diseases 20:697–706 DOI 10.1016/S1473-3099(20)30200-0. Letunic I, Bork P. 2019. Interactive tree of life (iTOL) v4: recent updates and new developments. Nucleic Acids Research 47(W1):W256–W259 DOI 10.1093/nar/gkz239. Liu DX, Fung TS, Chong KK-L, Shukla A, Hilgenfeld R. 2014. Accessory proteins of SARS-CoV and other coronaviruses. Antiviral Research 109:97–109 DOI 10.1016/j.antiviral.2014.06.013. Lugari A, Betzi S, Decroly E, Bonnaud E, Hermant A, Guillemot JC, Debarnot C, Borg JP, Bouvet M, Canard B, Morelli X, Lécine P. 2010. Molecular mapping of the RNA cap 2′-O-methyltransferase activation interface between severe acute respiratory syndrome coronavirus nsp10 and nsp16. Journal of Biological Chemistry 285(43):33230–33241 DOI 10.1074/jbc.M110.120014. Lv L, Li G, Chen J, Liang X, Li Y. 2020. Comparative genomic analysis revealed specific mutation pattern between human coronavirus SARS-CoV-2 and Bat-SARSr-CoV RaTG13. bioRxiv DOI 10.1101/2020.02.27.969006. Lyons DM, Lauring AS. 2017. Evidence for the selective basis of transition-to-transversion substitution bias in two RNA viruses. Molecular Biology and Evolution 34(12):3205–3215 DOI 10.1093/molbev/msx251. Marra MA, Jones SJM, Astell CR, Holt RA, Brooks-Wilson A, Butterfield YSN, Khattra J, Asano JK, Barber SA, Chan SY, Cloutier A, Coughlin SM, Freeman D, Girn N, Griffith OL, Leach SR, Mayo M, McDonald H, Montgomery SB, Pandoh PK, Petrescu AS, Robertson AG, Schein JE, Siddiqui A, Smailus DE, Stott JM, Yang GS, Plummer F, Andonov A, Artsob H, Bastien N, Bernard K, Booth TF, Bowness D, Czub M, Drebot M, Fernando L, Flick R, Garbutt M, Gray M, Grolla A, Jones S, Feldmann H, Meyers A, Kabani A, Li Y, Normand S, Stroher U, Tipples GA, Tyler S, Vogrig R, Ward D, Watson B, Brunham RC, Krajden M, Petric M, Skowronski DM, Upton C, Roper RL. 2003. The genome sequence of the SARS-associated coronavirus. Science 300(5624):1399–1404 DOI 10.1126/science.1085953.

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 28/31 McBride R, Fielding B. 2012. The role of severe acute respiratory syndrome (SARS)-coronavirus accessory proteins in virus pathogenesis. Viruses 4(11):2902–2923 DOI 10.3390/v4112902. McIntosh K, Becker WB, Chanock RM. 1967. Growth in suckling-mouse brain of “IBV-like” viruses from patients with upper respiratory tract disease. Proceedings of the National Academy of Sciences of the United States of America 58(6):2268–2273 DOI 10.1073/pnas.58.6.2268. Menachery VD, Debbink K, Baric RS. 2014. Coronavirus non-structural protein 16: evasion, attenuation, and possible treatments. Virus Research 194:191–199 DOI 10.1016/j.virusres.2014.09.009. Mousavizadeh L, Ghasemi S. 2020. Genotype and phenotype of COVID-19: their roles in pathogenesis. Epub ahead of print 31 March 2020. Journal of Microbiology, Immunology and Infection DOI 10.1016/j.jmii.2020.03.022. Peiris JSM. 2012. Coronaviruses. Medical Microbiology (Eighteenth Edition) 12:587–593 DOI 10.1016/B978-0-7020-4089-4.00072-X. Pritchard L, Glover RH, Humphris S, Elphinstone JG, Toth IK. 2016. Genomics and taxonomy in diagnostics for food security: soft-rotting enterobacterial plant pathogens. Analytical Methods 8(1):12–24 DOI 10.1039/C5AY02550H. Rehman S, Shafique L, Ihsan A, Liu Q. 2020. Evolutionary trajectory for the emergence of novel coronavirus SARS-CoV-2. Pathogens 9(3):240 DOI 10.3390/pathogens9030240. Sanjuán R, Nebot MR, Chirico N, Mansky LM, Belshaw R. 2010. Viral mutation rates. Journal of Virology 84(19):9733–9748 DOI 10.1128/JVI.00694-10. Schaecher SR, Mackenzie JM, Pekosz A. 2007. The ORF7b protein of severe acute respiratory syndrome coronavirus (SARS-CoV) is expressed in virus-infected cells and incorporated into SARS-CoV particles. Journal of Virology 81(2):718–731 DOI 10.1128/JVI.01691-06. Simmonds P, Adams MJ, Benkő Mária, Breitbart M, Brister JR, Carstens EB, Davison AJ, Delwart E, Gorbalenya AE, Harrach B, Hull R, King AMQ, Koonin EV, Krupovic M, Kuhn JH, Lefkowitz EJ, Nibert ML, Orton R, Roossinck MJ, Sabanadzovic S, Sullivan MB, Suttle CA, Tesh RB, Van der Vlugt RA, Varsani A, Zerbini FM. 2017. Virus taxonomy in the age of metagenomics. Nature Reviews Microbiology 15(3):161–168 DOI 10.1038/nrmicro.2016.177. Siu K, Yuen K, Castano-Rodriguez C, Ye Z, Yeung M, Fung S, Yuan S, Chan C, Yuen K, Enjuanes L, Jin D. 2019. Severe acute respiratory syndrome coronavirus ORF3a protein activates the NLRP3 inflammasome by promoting TRAF3-dependent ubiquitination of ASC. FASEB Journal 33(8):8865–8877 DOI 10.1096/fj.201802418R. Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30(9):1312–1313 DOI 10.1093/bioinformatics/btu033. Stobart CC, Sexton NR, Munjal H, Lu X, Molland KL, Tomar S, Mesecar AD, Denison MR. 2013. Chimeric exchange of coronavirus nsp5 proteases (3CLpro) identifies common and divergent regulatory determinants of protease activity. Journal of Virology 87(23):12611–12618 DOI 10.1128/JVI.02050-13. Subissi L, Imbert I, Ferron F, Collet A, Coutard B, Decroly E, Canard B. 2014. SARS-CoV ORF1b-encoded nonstructural proteins 12–16: replicative enzymes as antiviral targets. Antiviral Research 101:122–130 DOI 10.1016/j.antiviral.2013.11.006. Surjit M, Lal SK. 2008. The SARS-CoV nucleocapsid protein: a protein with multifarious activities. Infection, Genetics and Evolution 8(4):397–405 DOI 10.1016/j.meegid.2007.07.004. Tan YJ, Lim SG, Hong W. 2005. Characterization of viral proteins encoded by the SARS-coronavirus genome. Antiviral Research 65(2):69–78 DOI 10.1016/j.antiviral.2004.10.001.

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 29/31 Tanaka T, Kamitani W, DeDiego ML, Enjuanes L, Matsuura Y. 2012. Severe acute respiratory syndrome coronavirus nsp1 facilitates efficient propagation in cells through a specific translational shutoff of host mRNA. Journal of Virology 86(20):11128–11137 DOI 10.1128/JVI.01700-12. Taylor JK, Coleman CM, Postel S, Sisk JM, Bernbaum JG, Venkataraman T, Sundberg EJ, Frieman MB. 2015. Severe acute respiratory syndrome coronavirus ORF7a inhibits bone marrow stromal antigen 2 virion tethering through a novel mechanism of glycosylation interference. Journal of Virology 89(23):11820–11833 DOI 10.1128/JVI.02274-15. Te Velthuis AJW, Van den Worm SHE, Snijder EJ. 2012. The SARS-coronavirus nsp7+nsp8 complex is a unique multimeric RNA polymerase capable of both de novo initiation and primer extension. Nucleic Acids Research 40(4):1737–1747 DOI 10.1093/nar/gkr893. Tok TT, Tatar G. 2017. Structures and functions of coronavirus proteins : molecular modeling of viral nucleoprotein. International Journal of Virology & Infectious Diseases 2:1–7. Tyrrell DA, Bynoe ML. 1966. Cultivation of viruses from a high proportion of patients with colds. Lancet 1(7428):76–77 DOI 10.1016/S0140-6736(66)92364-6. Woo PCY, Huang Y, Lau SKP, Tsoi HW, Yuen KY. 2005. In silico analysis of ORF1ab in coronavirus HKU1 genome reveals a unique putative cleavage site of coronavirus HKU1 3C-like protease. Microbiology and Immunology 49(10):899–908 DOI 10.1111/j.1348-0421.2005.tb03681.x. Woo PCY, Huang Y, Lau SKP, Yuen KY. 2010. Coronavirus genomics and bioinformatics analysis. Viruses 2(8):1805–1820 DOI 10.3390/v2081803. Xu J, Zhao S, Teng T, Abdalla AE, Zhu W, Xie L, Wang YGX. 2020. Systematic comparison of two animal-to-human transmitted human coronaviruses: SARS-CoV-2 and SARS-CoV. Viruses 12(2):244 DOI 10.3390/v12020244. Yuen KS, Ye ZW, Fung SY, Chan CP, Jin DY. 2020. SARS-CoV-2 and COVID-19: the most important research questions. Cell & Bioscience 10:40 DOI 10.1186/s13578-020-00404-4. Zaki AM, Van Boheemen S, Bestebroer TM, Osterhaus ADME, Fouchier RAM. 2012. Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia. New England Journal of Medicine 367(19):1814–1820 DOI 10.1056/NEJMoa1211721. Zeng Z, Deng F, Shi K, Ye G, Wang G, Fang L, Xiao S, Fu Z, Peng G. 2018. Dimerization of coronavirus nsp9 with diverse modes enhances its nucleic acid binding affinity. Journal of Virology 92(17):e00692-18 DOI 10.1128/JVI.00692-18. Zeng C, Wu A, Wang Y, Xu S, Tang Y, Jin X, Wang S, Qin L, Sun Y, Fan C, Snijder EJ, Neuman BW, Chen Y, Ahola T, Guo D. 2016. Identification and characterization of a ribose 2′-O-Methyltransferase encoded by the ronivirus branch of nidovirales. Journal of Virology 90(15):6675–6685 DOI 10.1128/JVI.00658-16. Zhang T, Wu Q, Zhang Z. 2020. Probable pangolin origin of SARS-CoV-2 associated with the COVID-19 outbreak. Current Biology 30(7):1346–1351.e2 DOI 10.1016/j.cub.2020.03.022. Zhong NS, Zheng BJ, Li YM, Poon LLM, Xie ZH, Chan KH, Li PH, Tan SY, Chang Q, Xie JP, Liu XQ, Xu J, Li DX, Yuen KY, Peiris JSM, Guan Y. 2003. Epidemiology and cause of severe acute respiratory syndrome (SARS) in Guangdong. People’s Republic of China, in February 362:1353–1358 DOI 10.1016/S0140-6736(03)14630-2. Zhou P, Yang X-L, Wang X-G, Hu B, Zhang L, Zhang W, Si H-R, Zhu Y, Li B, Huang C-L, Chen H-D, Chen J, Luo Y, Guo H, Jiang R-D, Liu M-Q, Chen Y, Shen X-R, Wang X, Zheng X-S, Zhao K, Chen Q-J, Deng F, Liu L-L, Yan B, Zhan F-X, Wang Y-Y, Xiao G-F, Shi Z-L. 2020. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579(7798):270–273 DOI 10.1038/s41586-020-2012-7.

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 30/31 Zhu X, Fang L, Wang D, Yang Y, Chen J, Ye X, Foda MF, Xiao S. 2017. Porcine deltacoronavirus nsp5 inhibits interferon-$β$ production through the cleavage of NEMO. Virology 502:33–38 DOI 10.1016/j.virol.2016.12.005. Zhu N, Zhang D, Wang W, Li X, Yang B, Song J, Zhao X, Huang B, Shi W, Lu R, Niu P, Zhan F, Ma X, Wang D, Xu W, Wu G, Gao GF, Tan W. 2020. A novel coronavirus from patients with pneumonia in China, 2019. New England Journal of Medicine 382(8):727–733 DOI 10.1056/NEJMoa2001017.

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 31/31