Unique features concerning the proximal origin of SARS-CoV-2

Murat Seyran1,2, Damiano Pizzol3, Parise Adadi4, Tarek Mohamed Abd El-Aziz5,6, Sk. Sarif Hassan7, Antonio Soares5, Ramesh Kandimalla8,9, Kenneth Lundstrom10, Murtaza Tambuwala11, Alaa A. A. Aljabali 12, Amos Lal13, Gajendra Kumar Azad14, Pabitra Pal Choudhury15, Vladimir N. Uversky16, Samendra P. Sherchan17, Bruce D. Uhal18, Nima Rezaei19,20,*, Adam M. Brufsky21,*

1Doctoral studies in natural and technical sciences (SPL 44), University of Vienna, Austria.

2Infection, Malignancy and Autoimmunity (NIIMA), Universal Scientific Education and Research Network (USERN), Austria. [email protected], [email protected]

3Italian Agency for Development Cooperation - Khartoum, Sudan Street 33, Al Amarat, Sudan. [email protected]

4Department of Food Science, University of Otago, Dunedin 9054, New Zealand. [email protected], [email protected]

5Department of Cellular and Integrative Physiology, University of Texas Health Science Center at San Antonio, 7703 Floyd Curl Dr, San Antonio, TX 78229-3900, USA. [email protected], [email protected]

6Zoology Department, Faculty of Science, Minia University, El-Minia 61519, Egypt.

7Department of Mathematics, Pingla Thana Mahavidyalaya, Maligram, Paschim Medinipur, 721140, West Bengal, India. [email protected]

8CSIR-Indian Institute of Chemical Technology Uppal Road, Tarnaka, Hyderabad-500007, Telangana State, India.

9Kakatiya Medical College/MGM-Hospital, DME/TSPSC, Hyderabad, Warangal, Telangana State- 506007, India. [email protected], [email protected]

10PanTherapeutics, Rte de Lavaux 49, CH1095 Lutry, Switzerland. [email protected]

11School of Pharmacy and Pharmaceutical Science, Ulster University, Coleraine BT52 1SA, Northern Ireland, UK. [email protected]

12Department of Pharmaceutical Sciences, Yarmouk University—Faculty of Pharmacy, Irbid 566, Jordan

1 13Division of Pulmonary and Critical Care Medicine, Mayo Clinic, Rochester, Minnesota, USA. [email protected]

14Department of Zoology, Patna University, Patna-800005, Bihar, India. [email protected]

15Applied Statistics Unit Indian Statistical Institute, 203 B T Road, Kolkata 700108, West Bengal, India. [email protected]

16Department of Molecular Medicine, Morsani College of Medicine, University of South Florida, Tampa, FL 33612, USA. [email protected]

17Department of Environmental Health Sciences, Tulane University, New Orleans, LA, 70112, USA. [email protected]

18Department of Physiology, Michigan State University, East Lansing, MI 48824, USA. [email protected]

19Center for Immunodeficiencies, Pediatrics Center of Excellence, Children’s Medical Center, Tehran University of Medical Sciences, Tehran, Iran.

20Network of Immunity in Infection, Malignancy and Autoimmunity (NIIMA), Universal Scientific Education and Research Network (USERN), Tehran, Iran. [email protected], [email protected]

21University of Pittsburgh School of Medicine, Department of Medicine, Division of Hematology/Oncology, UPMC Hillman Cancer Center, Pittsburgh, PA, USA. [email protected]

*Corresponding author

Abstract There is a consensus that SARS-CoV-2 originated naturally from BatCoVs. Although based on phylogenetic analysis SARS-CoV-2 seems to be related to BatCoVs RaTG13 or RmYN02, its exact origin of lineage is not clear. In fact, the flat and non-sunken surface of the sialic acid-binding domain of SARS-CoV-2 spike protein conflicts with the general adaptation and survival pattern observed for all other (CoVs). Unlike RaTG13, for SARS-CoV-2 a recombination apparently occured between the S1/S2 domains of spike protein that enabled furin protease utilization. Interestingly, the clinical strains of SARS-CoV-2 do not have any further recombination

2 that would be in conflict with the recombination models of CoVs. Despite millions of SARS-CoV-2 infections, the spike protein receptor-binding domain (RBD) has not accumulated a high-frequency substitution, differentiating SARS-CoV-2 from other CoVs that have positive selection sites on their RBDs, calling into question the sources of the origin of the apparently optimized angiotensin- converting enzyme 2 (ACE2) binding of SARS-CoV-2. The SARS-CoV-2 host tropism pattern has these major discrepancies compared to other CoVs, raising questions concerning the proximal origin of SARS-CoV-2.

Keywords: Severe Acute Respiratory Syndrome 2, Canyon Hypothesis, Template- switching model, Receptor Binding Domain, Positive Selection Site

Human Pathogenic CoVs and SARS-CoV-2

The possible natural origin of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) from Bat CoV RaTG13 was explained by Andersen and colleagues 1. However, another Bat CoV (RmYN02) was recently claimed to be more genetically related to the SARS-CoV-2 than RaTG13, based on the whole genome phylogenetic analysis 1,4. Therefore, there is no genomic consensus on the exact lineage origin of SARS-CoV-2 1,4. SARS-CoV-2 is the seventh CoV capable of infecting humans, but the first and only human coronavirus (HCoV) with pandemic potential 5. During the host tropism, bat or rodent CoVs have demonstrated certain changes in their S protein receptor- binding domain (RBD) and glycan-binding N-terminal domain (NTD) 3,6. SARS-CoV-2, unlike other CoVs, does not have those signature changes.

SARS-CoV-2 NTD composition conflicting with the ″Canyon Hypothesis″

The formation of canyons, depression zones, or cavities on the surfaces of the influenza , human rhinovirus, and Meningo virus is explained by the "Canyon Hypothesis" suggesting that the host cell receptor attachment site on a viral surface is hidden from the immune surveillance in deep surface depressions or "canyons" 7. In CoVs (except SARS-CoV-2), the S protein NTD glycan- binding domain has several sugar-binding modes, with a common feature being the hidden localization of the corresponding binding sites in cavities to limit their access to antibodies and immune cells 6. This pattern of CoVs was considered as an evolutionary measure to limit recognition of these active sites by host immune system 3.

3 The genomic comparative analysis of six HCoVs with their parent bat or rodent CoVs indicated the presence of several deletions, inserts, and recombination in their S protein NTD to evade from the host glycan-binding immune receptors, being compatible with the "Canyon Hypothesis" 3,6,8. HCoV- 229E S protein, being compared to its parent virus bat CoV, has a 187 amino acid deletion within the NTD that aims to provide means to evade the host immune C-type lectin receptors (CLRs) detecting glycan-binding 8. Similarly, in the MERS-CoV, sialic acid binding site of the S Protein NTD is located 280 Å2 beneath the protein surface 9. However, the flat pattern of the SARS-CoV-2 S protein NTD is conflicting with the evolutionary host tropism strategy, not only regarding the HcoVs, but also many other human viral pathogens 2,3,6,7.

SARS-CoV-2 recombination pattern is not compatible with the Template Switching (Copy- Choice) mechanism

The high rate of RNA recombination in CoVs can be explained by a template-switching (copy- choice) mechanism 10. CoV RNA show discontinuous and non-progressive replication in its host cells, with replication pause at specific RNA sites creating blocks of free RNA that could fuse into another set of CoV replication, leading to the formation of different recombinant CoV strains 10. These pauses in RNA replication of CoVs could allow recombination as a result of mixed infection with other CoVs, other viruses, or even the host RNA sequences. As an example, mouse hepatitis coronavirus S protein NTD sialic acid binding domain is considered to have originated from recombination human galectin RNA sequences 14.

HCoV-HKU1, when compared to the genetically related HCoV-OC43, shows a high rate of recombination in specific ORFs; e.g., P65 to NSP10, suggesting the presence of several RNA replications stops in these domains 11. Recombination can also occur with deletions in the CoVs RNA. Both HCoV-229E and the genetically related Alpaca CoV show a range from 185 to 404 nucleotide deletions in the S1 region compared to genetically related BatCoV 12. Strikingly, ORF8 was deleted in HCoV-229E compared to related alpaca and parent bat viruses, suggesting that the alpaca might serve as the first inter-host of the bat virus 12.

The SARS-CoV-2 S Protein S1/S2 site corresponding to the furin recognition motif does not exist in other ″lineage B″ beta-coronavirus; e.g., -CoV or RaTG13, suggesting that S protein S1/ S2 is not a recombination hot spot or RNA termination based on the template switching (copy- choice) model 1,10. Additionally, clinical isolates of SARS-CoV-2 S protein have not indicated any

4 recombination in this area, suggesting that the S1/S2 site furin recognition motif insertion represents a one-time event of special recombination.

RmYN02, which is genetically the closest related virus to SARS-CoV-2, was proposed to have an insert like the SARS-CoV-2 S1/S2 S protein 4. However, based on the RBD, RmYN02 is phylogenetically not related to SARS-CoV-2, and the insert between the S1 and S2 domains is not the PRRA found in SARS-CoV-2, but P(deletion)AA, and the function of this insertion in RmYN02 as a furin cleavage site is not clear 4. In addtition, another genetically closest related SARS-CoV-2 virus, RmYN01, does not have that insert in the S1/S2 region 4. The S Protein of RmYN02 also has many deletions (e.g., residues 473 to 480 and 485 to 490) that do not exist in the genetically closest related SARS-CoV-2 4.

SARS-CoV-2 RBD is not high frequency positive selection site

The CoVs genomes have hot spots with a high frequency of amino acid substitutions, also called positive selection sites that favor host tropism, antibody resistance, or immune evasion 5. For example, analysis of the S protein RBD in HCoV-229E strains revealed the presence of 36 amino acid substitutions suggesting the positive selection site of RBD and highlighting the viral attempt of low-pathogenicity for adaptation 13. During its limited infection period between 2002 and 2004, RBD mutations K479N and S487T in the SARS-CoV S protein mediated host change from civets to humans and human-to-human transmission, respectively 14. These two mutations contributed significantly to the SARS-CoV epidemic from 2002 to 2003 14. SARS-CoV, which is genetically the most related CoV to the SARS-CoV-2, had a completely different pattern of mutations through its timeline of infection. During the SARS epidemic, SARS-CoV initially had low-pathogenicity strains, with the substitutions of the residues of S protein NTD 147, 228, 240, RBD 479, as well as S2 821 and 1080 15.

However, SARS-CoV further had high-pathogenicity strains with substitutions of the residues of S protein RBD 360, 462, 472, 480, 487, S1 Subunit 609, 613, 665, and S2 subunit 743, 765, and 116315. The SARS-CoV epidemic strains emerged with the further substitutions of the residues of S protein NTD 227, 244 RBD 344, and S2 subunit 778 15. SARS-CoV thus had many adaptation mutations in its S protein NTD and RBD during its limited number of infections.

5 Similarly, more pathogenic strains of HCoV-NL63 have an I507 L substitution in the S protein RBD 16. Interestingly, clinical SARS-CoV-2 isolates have the only single frequent mutation, D614G in S Protein 17. Therefore, based on the mutation rate and patterns in clinical isolates of SARS-CoV- 2, S Protein is not a hot spot for SARS-Cov-2, unlike other human CoVs.

The mutation rate composition of SARS-CoV-2 within its own genome and mutation dynamics during the infections in clinical isolates. SARS-CoV-2 has a high frequency of mutations on ORF8 and ORF1a (nt 8742) showing mutation rates of 30.53% and 29.47%, but the overall mutation rate of the entire genome is 7.23% 17,20. Similarly, the mutation rate in the RBD of the spike in SARS- CoV-2 is low, but a high-frequency mutation on S protein D614G was detected in 71% of 28,637 sequences 17,21. SARS-CoV-2 has a high mutational frequency oft he non-structural genes ORF8 and ORF1a but not on the S protein RBD 17,21. The highly conserved genomic composition of RBD is unique to coronavirus in the clinical isolates of SARS-CoV-2 and is not compatible with the adaptation pattern of genetically related CoVs.

Results and Discussion

SARS-CoV-2 is the seventh HCoV, but the first HCoV with the pandemic potential. SARS-CoV disappeared before causing a pandemic and MERS-CoV is mostly endemic to the Arabian Peninsula with some additional limited traveler infections that caused outbreaks in South Korea 3,5. The SARS-CoV-2 host tropism pattern has discrepancies compared to other CoVs. The SARS- CoV-2 S protein NTD contains flat surface glycan-binding sites, which are in conflict with the Canyon Hypothesis. Furthermore, it appears that the SARS-CoV-2 recombination pattern inserting a furin cleavage site during host tropism represents a unique and one-time event in conflict with the genomic composition of genetically related CoVs and template switching (copy-choice) mechanisms. The SARS-CoV-2 genome is almost identical to that of the Bat CoV RaTG13, but its mutation within the RBD appears to represent another one-time event as well. The optimal RBD that defines the high affinity of S protein binding to ACE2 does not have any furhter high-frequency mutations in any of the clinical strains isolated through time. The RBD of SARS-CoV-2, unlike other human pathogenic CoVs, is not a positive selection site. These unique features of SARS-CoV- 2 raises number of questions regarding the proximal origin of the virus and the answers to these questions will allow better understanding of the origin and the pathophysiology of this virus and will provide means to properly fight this pandemic with dramatic health, social and economic impacts.

6 Author Contributions M. S. conceived the study. A. M. B, V. N. U, A. L., and K. L. provided critical review. N. R., D. P., A. L. S. P. S.. V. N. U, K. L.T. M. A. edited the article, M. S., P. A., T. M. A. formatted the article. All authors M. S., D. P., P. A., T. M. A., S. S. H., A. S., R. K., K. L., D. M., M. T., A. L., P. P. C., V. N. U., S. P. S., B. D. U., N. R., A. M. B. and G. Z. K. of the consortium interpreted the results and approved the final version for submission.

Competing Interests statement The authors declare no competing interests.

References

1. Andersen, K.G., Rambaut, A., Lipkin, W.I., Holmes, E.C. & Garry, R.F. The proximal origin of SARS-CoV-2. Nature Medicine 26, 450-452 (2020). 2. Fantini, J., Di Scala, C., Chahinian, H. & Yahi, N. Structural and molecular modelling studies reveal a new mechanism of action of chloroquine and hydroxychloroquine against SARS-CoV-2 infection. International Journal of Antimicrobial Agents 55, 105960 (2020). 3. Hulswit, R.J., de Haan, C.A. & Bosch, B.J. Coronavirus Spike Protein and Tropism Changes. Advances in virus research 96, 29-57 (2016). 4. Zhou, H., et al. A Novel Bat Coronavirus Closely Related to SARS-CoV-2 Contains Natural Insertions at the S1/S2 Cleavage Site of the Spike Protein. Current Biology 30, 2196-2203.e2193 (2020). 5. Forni, D., Cagliani, R., Clerici, M. & Sironi, M. Molecular Evolution of Human Coronavirus Genomes. Trends in microbiology 25, 35-48 (2017). 6. Peng, G., et al. Crystal structure of mouse coronavirus receptor-binding domain complexed with its murine receptor. Proceedings of the National Academy of Sciences 108, 10696 (2011). 7. Rossmann, M.G. The canyon hypothesis. Hiding the host cell receptor attachment site on a viral surface from immune surveillance. J Biol Chem 264, 14587-14590 (1989). 8. Li, Z., et al. The human coronavirus HCoV-229E S-protein structure and receptor binding. Elife 8, e51230 (2019). 9. Park, Y.-J., et al. Structures of MERS-CoV spike glycoprotein in complex with sialoside attachment receptors. Nature structural & molecular biology 26, 1151-1157 (2019).

7 10. Makino, S., Keck, J.G., Stohlman, S.A. & Lai, M.M. High-frequency RNA recombination of murine coronaviruses. Journal of virology 57, 729-737 (1986). 11. Woo, P.C.Y., et al. Characterization and complete genome sequence of a novel coronavirus, coronavirus HKU1, from patients with pneumonia. Journal of virology 79, 884-895 (2005). 12. Corman, V.M., et al. Evidence for an Ancestral Association of with Bats. Journal of Virology 89, 11858-11870 (2015). 13. Chibo, D. & Birch, C. Analysis of human coronavirus 229E spike and nucleoprotein genes demonstrates genetic drift between chronologically distinct strains. The Journal of general virology 87, 1203-1208 (2006). 14. Li, F. Receptor Recognition Mechanisms of Coronaviruses: a Decade of Structural Studies. Journal of Virology 89, 1954-1964 (2015). 15. Kan, B., et al. Molecular evolution analysis and geographic investigation of severe acute respiratory syndrome coronavirus-like virus in palm civets at an animal market and on farms. Journal of virology 79, 11892-11900 (2005). 16. Wang, Q., et al. Structural and Functional Basis of SARS-CoV-2 Entry by Using Human ACE2. Cell 181, 894-904.e899 (2020). 17. Pachetti, M., et al. Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent- RNA polymerase variant. Journal of Translational Medicine 18, 179 (2020). 20. Mercatelli, D. & Giorgi, F.M. Geographic and Genomic Distribution of SARS-CoV-2 Mutations. Frontiers in Microbiology 11(2020). 21. Korber, B., et al. Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus. Cell, S0092-8674(0020)30820-30825 (2020).

(Caption Figure 1. The structural model of SARS-CoV-2 Spike Protein with the three points of discussion).

8 9 10