A Selective Sweep in the Spike Gene Has Driven SARS-Cov-2 Human Adaptation

bioRxiv preprint doi: https://doi.org/10.1101/2021.02.13.431090; this version posted February 17, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. A selective sweep in the Spike gene has driven SARS-CoV-2 human adaptation Lin Kang1, Guijuan He2, Amanda K. Sharp3, Xiaofeng Wang2, Anne M. Brown3,4, Pawel Michalak1,5,6*, James Weger-Lucarelli7* 1Edward Via College of Osteopathic Medicine, Monroe, LA, 71203, USA 2School of Plant and Environmental Sciences, Virginia Tech, Blacksburg, Virginia 24061. 3Department of Biochemistry, Virginia Tech, Blacksburg, Virginia, USA. 4Research and Informatics, University Libraries, Blacksburg, Virginia, USA. 5Center for One Health Research, Virginia-Maryland College of Veterinary Medicine, Blacksburg, VA, 24060, USA 6Institute of Evolution, Haifa University, Haifa, 3498838, Israel 7Department of Biomedical Sciences and Pathobiology, Virginia Tech, VA-MD Regional College of Veterinary Medicine, Blacksburg, VA, United States of America. *Corresponding authors Summary While SARS-CoV-2 likely has animal origins1, the viral genetic changes necessary to adapt this animal-derived ancestral virus to humans are largely unknown, mostly due to low levels of sequence polymorphism and the notorious difficulties in experimental manipulations of coronavirus genomes. We scanned more than 182,000 SARS-CoV-2 genomes for selective sweep signatures and found that a distinct footprint of positive selection is located around a non-synonymous change (A1114G; T372A) within the Receptor-Binding Domain of the Spike protein, which likely played a critical role in overcoming species barriers and accomplishing interspecies transmission from animals to humans. Structural analysis indicated that the substitution of threonine with an alanine in SARS-CoV-2 concomitantly removes a predicted glycosylation site at N370, resulting in more favorable binding predictions to human ACE2, the cellular receptor. Using a novel bacteria-free cloning system for manipulating RNA virus genomes, we experimentally validated that this SARS-CoV-2-unique substitution significantly increases replication in human cells relative to its putative ancestral variant. Notably, this mutation’s impact on virus replication in human cells was much greater than that of the Spike D614G mutant, which has been widely reported to have been selected for during human-to-human transmission2,3. Introduction Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the causative agent of Coronavirus Disease 2019 (COVID-19), has caused over 60 million infections with at least 1.3 million deaths worldwide as of early November 20204. The virus was first described in late 2019 in Wuhan, China, and quickly spread globally1. SARS-CoV-2 is closely related to SARS-CoV, which caused a more limited outbreak in several countries in 20035,6; however, several bat and pangolin-derived viruses are even more closely related to SARS- CoV-2, indicative of a zoonotic origin 7–9. Bat coronavirus RaTG13—originally isolated in China from Rhinolophus affinis bats in 2013—shares 96% nucleotide identity with SARS-CoV-2 across the genome and ~97% amino acid identity in the Spike (S) protein, which mediates receptor binding and membrane fusion, and is the key coronavirus determinant of host tropism10. Similarly, several viruses found in Malayan pangolins (Manis bioRxiv preprint doi: https://doi.org/10.1101/2021.02.13.431090; this version posted February 17, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. javanica) are closely related to SARS-CoV-2; with up to 97.4% amino acid concordance in the receptor-binding domain (RBD) of the S protein8,9. However, the exact origin and mechanism of cross-species transmission of the SARS-CoV-2 progenitor are still unknown. In the past two decades, the emergence of severe acute respiratory syndrome coronavirus (SARS- CoV)6,11,12 and Middle East respiratory syndrome coronavirus (MERS-CoV)13 in humans and swine acute diarrhoea syndrome coronavirus (SADS-CoV) into pigs has highlighted the epidemic potential of coronaviruses14. Typically, only modest changes to a virus are required to initiate adaptation to a new host; for example, only two amino acid changes were necessary to produce a dramatic difference in human adaptation in both SARS-CoV and MERS-CoV S proteins15,16. This phenomenon is readily observed in other viruses: Ebola viruses’ human adaptation following spill-over from bats was at least partly mediated by a single alanine-to- valine mutation at position 82 in the glycoprotein17,18. Similarly, individual amino acid changes have been associated with recent outbreaks of several RNA viruses: chikungunya virus19, West Nile virus20,21, and Zika virus22. While an individual mutation that likely increases replication of SARS-CoV-2 in humans has been identified—a single aspartic acid to glycine change at position 614 in the S protein2,3—this occurred after emergence into humans, and the genetic determinants of SARS-CoV-2’s expansion from an animal reservoir into humans remain entirely unknown. For a virus recently acquired through a cross-species transmission, rapid evolution, and a strong signature of positive selection are expected. For example, several rounds of adaptive changes have been demonstrated in SARS-CoV genomes during the short SARS epidemic in 2002–200323,24. However, in its brief epidemic, SARS-CoV-2 has been characterized by relatively low genetic variation, concealing signals of positive selection, and leading to contradictory reports of limited positive selection25, “relaxed” selection26, or even negative (purifying) selection27,28. However, these results are based on dN/dS tests that are traditionally designed for eukaryotic interspecies comparisons, and thus ill-equipped to detect hallmark signatures of positive selection in viral lineages with limited sequence divergence29. Here, we employ highly sensitive methods enabling detection of selective sweeps, in which a selectively favorable mutation spreads all or part of the way through the population, causing a reduction in the level of sequence variability at nearby genomic sites30. With unprecedented statistical power that leverages information from more than 182,000 SARS-CoV-2 genomes, we demonstrate that positive selection has played a critical role in the adaptive evolution of SARS- CoV-2, manifested as selective sweeps in Spike and several other regions, also providing candidate mutations for further analysis and interventions. Given its role in coronavirus host tropism, we hypothesized and experimentally validated that the selective sweep identified in the S protein involves an adaptive mutation increasing replication in human lung cells, which, in turn, could facilitate more efficient human-to-human transmission. Results Selective sweeps analysis identified a Spike region with high confidence from 182,792 sequences. OmegaPlus31 and RAiSD32 were used to find putative selective sweep regions in 182,792 SARS-CoV-2 genomes downloaded from the publicly available GISAID EpiCov database (www.gisaid.org). Eight selective sweep regions were detected, including four in ORF1ab and four in the Spike region (Fig. 1 & Table 1). The Spike protein plays an important role in the receptor recognition and cell membrane fusion process during viral infection, and this protein is highly conserved among all coronaviruses. Next, we screened genomic sites in the bioRxiv preprint doi: https://doi.org/10.1101/2021.02.13.431090; this version posted February 17, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Spike region that may be involved in the adaptive evolution of SARS-CoV-2 in the new host by comparing the non-synonymous differences between SARS-CoV-2 and four other Sarbecovirus members (one pangolin coronavirus and three bat coronaviruses; see Materials and Methods). A total of six such sites were identified (Supplementary Table 1); notably, only a single site (A1114G) was centrally located in one of the sweep regions, whereby the 372nd amino acid threonine in the Spike protein of the four Sarbecovirus members was substituted with alanine (Thr372Ala) in human SARS-CoV-2. Out of the 182,792 SARS-CoV-2 genomes, no sequence polymorphism was found in this position (1114G), suggesting a rapid fixation of this mutation via hard sweep. The alternative, putatively ancestral, coronavirus variant (A1114) was perfectly conserved in Sarbecovirus members from bats and pangolin. Fig 1. Selective sweeps analysis. (a) Selective sweep regions (shown as red blocks) identified in 182,792 SARS-CoV-2 genomes, using OmegaPlus (blue lines) and RAiSD (yellow lines). The common outliers (0.05 cutoff, purple dots) from the two methods were used to define selective sweep regions. (b) Non-synonymous difference (Thr372Ala) between SARS-CoV-2 and four other Sarbecovirus members found in the putative selective sweep region (22,529-22,862). Table 1: Putative sweep regions (the region containing the Spike G1114A position is bolded). Start End Region 7,445 7,711 ORF1ab 8,426 8,542 ORF1ab 12,978 13,350 ORF1ab 14,907 15,027 ORF1ab 22,529 22,862 Spike 23,132 23,196 Spike 24,225 24,319 Spike bioRxiv preprint doi: https://doi.org/10.1101/2021.02.13.431090; this version posted February 17, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 24,456 24,712 Spike Structure-based analysis of SARS-CoV-2 S protein variants.

A Selective Sweep in the Spike Gene Has Driven SARS-Cov-2 Human Adaptation

Approximating Selective Sweeps

A New Approach for Using Genome Scans to Detect Recent Positive Selection in the Human Genome

Refining the Use of Linkage Disequilibrium As A

Identification of Selective Sweeps, Major Genes, and Genotype by Diet Interactions Melanie D

Rapid Evolution of a Skin-Lightening Allele in Southern African Khoesan

Identifying the Favored Mutation in a Positive Selective Sweep

Selective Sweep Biotechnology Intelligence Unit Medical Intelligence Unit Molecular Biology Intelligence Unit

On the Unfounded Enthusiasm for Soft Selective Sweeps III: the Supervised Machine Learning Algorithm That Isn’T

A Spatially Aware Likelihood Test to Detect Sweeps from Haplotype Distributions

Genomic Insights Into Positive Selection

Human Pigmentation Variation: Evolution, Genetic Basis, and Implications for Public Health

Diseases Pardis Sabeti’S Search for Signals of Selection in the Human Genome