Emergence and Evolution of a Prevalent New SARS-Cov-2 Variant in the United States
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/2021.01.11.426287; this version posted January 19, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Emergence and Evolution of a Prevalent New SARS-CoV-2 Variant in the United States Adrian A. Pater1, Michael S. Bosmeny2,†, Christopher L. Barkau2,†, Katy N. Ovington2, 5 Ramadevi Chilamkurthy2, Mansi Parasrampuria2, Seth B. Eddington2, Abadat O. Yinusa1, Adam A. White2, Paige E. Metz2, Rourke J. Sylvain2, Madison M. Hebert1, Scott W. Benzinger1, Koushik Sinha3, and Keith T. Gagnon1,2,* 1Southern Illinois University, Chemistry and Biochemistry, Carbondale, Illinois, USA, 62901. 2Southern Illinois University School of Medicine, Biochemistry and Molecular Biology, 10 Carbondale, Illinois, USA, 62901. 3Southern Illinois University School of Computing, Carbondale, Illinois, USA, 62901. *Correspondence to: [email protected]. †Equally contributing authors. 15 Abstract Genomic surveillance can lead to early identification of novel viral variants and inform pandemic response. Using this approach, we identified a new variant of the SARS-CoV-2 virus that emerged in the United States (U.S.). The earliest sequenced genomes of this variant, referred to as 20C-US, can be traced to Texas in late May of 2020. This variant circulated in the U.S. 20 uncharacterized for months and rose to recent prevalence during the third pandemic wave. It initially acquired five novel, relatively unique non-synonymous mutations. 20C-US is continuing to acquire multiple new mutations, including three independently occurring spike protein mutations. Monitoring the ongoing evolution of 20C-US, as well as other novel emerging variants, will be essential for understanding SARS-CoV-2 host adaptation and predicting 25 pandemic outcomes. 30 1 bioRxiv preprint doi: https://doi.org/10.1101/2021.01.11.426287; this version posted January 19, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Introduction In early 2020 the World Health Organization declared that coronavirus disease 2019 (COVID- 19), a potentially fatal respiratory infection caused by SARS-CoV-2, was a global pandemic (1). 5 The high number of SARS-CoV-2 infections worldwide over time has presented the virus with ample opportunity to acquire new mutations. It has been suggested that some mutations already present a fitness advantage for the virus. Notably, the D614G mutation, observed early in the pandemic, is thought to increase the transmissibility of the virus (2, 3). The N501Y mutation of the spike protein (S) has been implicated in the rapid spread of new variants in the United 10 Kingdom and South Africa (4-6). A growing number of spike protein mutations could enable immune evasion and reduced vaccine efficacy (7). The U.S. has experienced a surge in cases during the third pandemic wave over the fall of 2020 and winter of 2020/2021. While many variables are likely to drive the increase in cases, it is 15 possible that emergence of a more fit or transmissible SARS-CoV-2 variant could be a contributing factor (5). Restrictions in population movement during a global pandemic, as well as the rapid acquisition of multiple mutations, could drive emergence of novel region-specific variants. This evolutionary paradigm might explain the rise of distinct SARS-CoV-2 variants now being observed around the world during the COVID-19 pandemic (5, 6, 8). 20 Here we report the characterization of a SARS-CoV-2 variant, 20C-US, that emerged in and has remained mostly confined to the U.S. Its quiet rise to prominence among other circulating variants in the late summer and early fall of 2020 coincides with the third U.S. pandemic wave. Based on existing genomic data, we predict that this variant may already be the most dominant 25 variant of SARS-CoV-2 in the U.S., likely accounting for the majority of COVID-19 cases. In addition to the five signature mutations of the 20C-US variant, new mutations continue to accrue. These include protease, nucleocapsid, and spike protein mutations that highlight the ongoing evolution of SARS-CoV-2. 30 Results: Genomic and phylogenetic and characterization of 20C-US, a prevalent new SARS-CoV-2 variant in the U.S. 2 bioRxiv preprint doi: https://doi.org/10.1101/2021.01.11.426287; this version posted January 19, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. In response to anticipated genetic changes occurring in the SARS-CoV-2 virus, we began sequencing viral genomes for genomic epidemiology and surveillance. With sequencing focused on the U.S. upper Midwest in the state of Illinois, we generated full genome sequences from samples taken beginning in March 2020 to present. During phylogenetic reconstruction with our 5 Illinois genome sequences, a particular branch within the 20C clade became noticeably more pronounced (Fig. 1A). We identified five closely co-occurring signature mutations that appeared synapomorphic to the new clade within 20C. These mutations resulted in amino acid changes of N1653D and R2613C in ORF1b, G172V in ORF3a, and P67S and P199L in the nucleocapsid (N) gene, the last of which also introduces a stop codon mutation at position Q46 of ORF14 10 (Table 1). Table 1. Signature 20C-US mutations. Mutation Nucleotide Amino Acid Protein Protein Name Change Change Name Function RNA proofreading, RNA capping, ORF1b:N1653D A18424G N1653D ORF1b, nsp14 methyltransferase ORF1b:R2613C C21304T R2613C ORF1b, nsp16 RNA capping, methyltransferase cellular immune modulation, virus ORF3a:G172V G25907T G172V ORF3a maturation and exit N:P67S C28472T P67S nucleocapsid viral particle structure, RNA binding N:P199L C28869T P199L nucleocapsid viral particle structure, RNA binding ORF14:Q64* C28869T STOP ORF14 unknown 15 To better understand this new variant at the national level, we randomly subsampled approximately 3.3% of available U.S. genomes (1905 out of 57754 genomes) from the GISAID database and constructed a phylogenetic tree using the Nexstrain pipeline (9) (Fig. 1B). Analysis of each mutation within the tree and visualization of geographic distribution revealed that a 20 3 bioRxiv preprint doi: https://doi.org/10.1101/2021.01.11.426287; this version posted January 19, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Fig. 1. Phylogenetic reconstruction and geographic visualization of SARS-CoV-2 variant 20C- US in Illinois, the United States, and globally. Phylogenetic reconstruction of SARS-CoV2 using (A) 350 genomes sequenced from Illinois from March 2020 through December 2020 and (B) 5 genome sequences randomly subsampled from the U.S. at ~3.3% of all U.S. genomes in the GISAID database (as of Jan. 4, 2021). (C) Geographic visualization of the fraction of 20C-US variant genomes in the U.S. during the indicated 2-month intervals. (D) Phylogenetic tree reconstruction of SARS-CoV-2 during the 2-month interval of Nov. 1 to Dec. 31, 2020 using 3819 randomly subsampled global genomes from GISAID (as of Jan. 4, 2021). (E) Global map 10 view of the fraction of 20C-US variant genomes from around the world. For all trees and maps, the signature mutation ORF1b:N1653D was used to highlight the 20C-US variant. substantial fraction of genomes comprised this new variant for most U.S. states (Fig. 1C). substantial fraction of genomes comprised this new variant for most U.S. states (Fig. 1C). When 15 a unique, signature mutation for the new variant, ORF1b:N1653D, was visualized on a globally subsampled phylogenetic tree, a subclade of the 20C branch of the B.1.2 lineage corresponding 4 bioRxiv preprint doi: https://doi.org/10.1101/2021.01.11.426287; this version posted January 19, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. to the new variant became apparent (Fig. 1D). Geographic visualization from Nov. 1 to Dec. 31, 2020 revealed that nearly half of all recently sequenced SARS-CoV-2 genomes from the U.S. are the new variant of interest (Fig. 1E). Furthermore, this variant has thus far only been reported at very low levels in a handful of other countries, including Mexico, Australia, New Zealand, 5 Singapore, Thailand, Taiwan, Poland and Israel. Therefore, we have termed this new variant 20C-US. Tracing a timeline and geographic origin of the 20C-US variant We attempted to establish a timeline and geographic origin for the emergence of 20C-US by 10 tracing the appearance of mutations in sequenced genomes. We selected all SARS-CoV-2 genomes from the global GISAID database that possessed the five hallmark mutations of 20C- US as well as two prerequisite mutations that appear key to the 20C-US lineage based on phylogenetic reconstructions, which are ORF8:S24L and ORF1a:L3352F. The majority of genomes bearing these two mutations were from Minnesota and Louisiana during March and 15 April, 2020. The first signature mutation of 20C-US, ORF3a:G172V, then appears in four genomes from Louisiana and one from Arizona in April.