Downloaded by guest on September 26, 2021 hsc,Uiest fTise 45 ret,Italy; Trieste, Sweden; 34151 Trieste, of University Physics, China; 210023, a genes viral ( Zeng eight Hong-Li between epistasis SARS-CoV-2 reveals 50,000 genomes than more of analysis Global togycreae,i nerysuy(8.I osntshow not www.pnas.org/cgi/doi/10.1073/pnas.2012331117 does It global be our (28). in to second study ranked found is early but today, additionally an correlations prominent was in pair correlated, docu- variations strongly –ORF8 single-locus been to have The comes nsp12, sites it , (22–27). when these literature , of the Most genes in mented ORF3a. between in and and ORF3a, nsp14 and ORF8, between nsp13, nsp2 and out genes stand nsp4 in located predictions genes sites eight polymorphic that of find pairs eliminate We to inher- (21). (shared technique phylogenetics itance) to this attributed (20). of be evolution can enhancement that sequence predictions geno- recent acid a bacterial amino (19) apply of connect We landscapes models fitness to enhance coevolutionary the to 18) through and infer phenotypes (17, This to and Gag 16). used types (15, HIV-1 been (DCA) earlier of analysis has landscape coupling techniques SARS-CoV- direct of 51,676 a family from of by contributions determined state such genomes have largest 2 We the the 14). of in (13, list a fitness be to to to contributions viral related directly epistatic expected a and (10–12), of be in (QLE) equilibrium amount genotypes quasi-linkage therefore large of can a distribution population exhibit The general, (6–9). in recombination tar- Coronaviruses, drug (5). potential and gets hotspots mutational been identify Data already to has leveraged Influenza and collection sequences, All whole-genome increasing SARS-CoV-2 rapidly Sharing of a contains on (4) Initiative repository (GISAID) Global The worldwide a respiratory therefore priority. acute is non- (SARS-CoV-2) severe 2 coronavirus by coronavirus syndrome new met or the vaccine significant been against for search been treatment has The countries. has many crisis there in disruption health 3) economic the (2, interventions As pharmacological millions. hurt has T SARS-CoV-2 use search to to pathogens. prospect viral point recombinant starting the of genes a weaknesses opens combinatorial in as for paper pairs loci The linked between epistatically nsp14. and prominent and nsp4, nsp12, nsp2, and nsp13, and ORF8 nsp2, ORF3a between genes nsp6, infer viral and we in Altogether, loci mutation. between involving synonymous interactions a interactions holding four locus hold- and both one nsp13, mutations, gene lies in nonsynonymous loci locus ing We two mutations. one between nonsynonymous interactions where holding find pairs also loci both between ORF3a, are gene three in which inter- eight of find We actions, loci. epistatic polymorphic from infer fitness to to equilibrium contributions together quasi-linkage analysis of assumption coupling an direct with used and four until dates, repository cutoff (GISAID) Ini- different Data Global Influenza All the Sharing in on syndrome deposited tiative respiratory genomes (SARS-CoV-2) acute development 2 severe coronavirus We vaccine complete pathogenesis. all microbial and considered of have understanding drug gene deeper to guide infer lead to and can tool powerful which a interactions, is analysis epistasis Genome-wide Sweden Stockholm, 17177 Institutet, Jos Karolinska by Biology, Tumor Edited and Cell Microbiology, of Department Research, Microbiome Translational for f eateto netosDsae,Isiueo imdcn,ShgesaAaey nvriyo ohnug 03 ohnug wdn and Sweden; Gothenburg, 40530 Gothenburg, of University Academy, Sahlgrenska Biomedicine, of Institute Diseases, Infectious of Department e nryTcnlg niern aoaoyo ins rvne colo cec,NnigUiest fPssadTlcmuiain,Nanjing, Telecommunications, and Posts of University Nanjing Science, of School Province, Jiangsu of Laboratory Engineering Technology Energy New epnei ftedsaeCVD1 a ofrldto and led (1) far people so 852,000 than has more COVID-19 of disease deaths confirmed the the of pandemic he e ru fCmlxSsesadSaitclPyis eateto hoeia hsc,PyisFcly nvriyo aaa 00 aaa Cuba; Havana, 10400 Havana, of University Faculty, Physics Physics, Theoretical of Department Physics, Statistical and Systems Complex of Group | .Ouhc ieUiest,Hutn X n prvdNvme ,22 rcie o eiwJn 5 2020) 15, June review for (received 2020 2, November approved and TX, Houston, University, Rice Onuchic, N. e ´ epistasis b odcIsiuefrTertclPyis oa nttt fTcnlg n tchl nvriy 09 tchl,Sweden; Stockholm, 10691 University, Stockholm and Technology of Institute Royal Physics, Theoretical for Institute Nordic | recombination ) a,b ioDichio Vito , | ietculn analysis coupling direct d b,c,d eateto opttoa cec n ehooy laoaUiest etr 09 Stockholm, 10691 Center, University AlbaNova Technology, and Science Computational of Department di Rodr Edwin , ge Horta ´ ıguez doi:10.1073/pnas.2012331117/-/DCSupplemental equal the or for than is less red is the loci two where between 1, distance Fig. (the in effects are close dataset software 2020 circos August 8 by for visualized 1 The Table loci. in PLM of listed pairs the epistases between pair-wise contains effects of column strength last the The indicating scores pair. following the higher The in with locus. locus position the genomic that on information at similar provide prevalent type lists columns most two mutation column second the of third prevalent, and name The (most nucleotide) the the to. allele and indicates belongs major/minor pair column pair-wise it the the protein each second in SARS-CoV-2 of position The the genomic index 200. lower the top with is locus the column in first interaction The dataset. 2020 loci Ns. terminal or gaps both excluding mutations with the links and regions the coding in present located We dataset). other for hsatcecnan uprigifrainoln at online information supporting contains article This Submission.y 1 Direct PNAS a is article This under distributed BY).y is (CC article performed access open E.R.H. This and interest.y competing paper. y V.D., no the declare wrote authors H.-L.Z., E.A. The and K.T., research; H.-L.Z., and data; designed analyzed K.T. E.A. research; contributions: Author (unknown N or gap as the available of (data fraction large symbols range, a nucleotide) short with very them have of them 200 most of top and many the sequence, of reference most (29) and inspection 5 1” links the 50 in Manual sites top noncoding the DCA. involve of half perform about that to shows standard method a scores, computational obtained (PLM) were maximization loci pseudo-likelihood between from interactions effective predicted The Results SARS-CoV-2 more of sequenced. orders been different as have associa- stable genomes a pair-wise remained brings have highlights that paper and tions this correlations, of than analysis perspective epistasis The analysis. owo orsodnemyb drse.Eal [email protected] Email: addressed. be may correspondence whom To ag olcin fgnm eune ofidcombinatorial pathogens. find recombinant to highly very of sequences leverage weaknesses genome to of prospect be collections the learning up large can model opens global selection paper rate a The natural high analysis, technique. coupling a of direct under effects by evolving detected such population recombination, a contribu- In of epistatic fitness. of to effects tions show high-quality sequences of these investigated whether number have We available. increasing are sequences whole-genome continuously the emer- and health by large public worldwide caused a gency is pandemic COVID-19 The Significance nTbe1 els h infiatlnsfrte8August 8 the for links significant the list we 1, Table In e as Thorell Kaisa , f,g β crnvrsSR-o-.Avery A SARS-CoV-2. -coronavirus n rkAurell Erik and , 0 r3 or raieCmosAtiuinLcne4.0 License Attribution Commons Creative . y 0 eino h “Wuhan-Hu- the on region https://www.pnas.org/lookup/suppl/ aae S3 Dataset NSLts Articles Latest PNAS d,1 n nrf 30 ref. in and c eatetof Department g Center | f8 of 1

APPLIED PHYSICAL POPULATION SCIENCES BIOLOGY Table 1. Significant links with rank within the top 200 between of 50 samples. We retain these eight predicted epistatic interac- pair-wise loci for the 8 August 2020 dataset tions in the sampled populations of SARS-CoV-2 genomes. The top ones listed in Table 3 are marked by red bars in Fig. 2A. Locus 1 Locus 2 Epistatic interactions obtained from DCA reflect pair-wise sta- Mutation Mutation PLM tistical associations, but not correlations. As reviewed in ref. 31, Rank∗ Protein† type‡ Protein type score and described in SI Appendix, Methods of DCA, DCA is based on a global probabilistic model, and therefore ranks interdepen- 1 1059-nsp2 C| T-non. 25563-ORF3a G| T-non. 1.7191 dency differently than correlations. Fig. 3 compared to Fig. 2 2 28882-N G| A-syn. 28883-N G| C-non. 1.4996 shows that the distribution of correlation scores is qualitatively 3 28881-N G| A-non. 28882-N G| A-syn. 1.4816 different from the distributions of DCA scores in the GISAID 4 28881-N G| A-non. 28883-N G| C-non. 1.4783 dataset. Fig. 4 further shows that the ranks of the epistatic inter- 5 8782-nsp4 C| T-syn. 28144-ORF8 T| C-non. 1.4471 actions predicted in Table 3 have remained stable, while the 9 14805-nsp12 C| T-syn. 26144-ORF3a G| T-non. 1.1392 corresponding correlations have merged into the background. 12 3037- T| C-syn. 14408-nsp12 T| C-non. 1.0291 The first-ranked interaction between 1059 and 25563 is 13 18877-nsp14 C| T-syn. 25563-ORF3a G| T-non. 1.0131 between a (C/T), resulting in the T85I nonsynonymous mutation 14 3037-nsp3 T| C-syn. 23403-S G| A-non. 1.0114 in gene nsp2, and a (G/T), resulting in the Q57H nonsynony- 17 14408-nsp12 T| C-non. 23403-S G| A-non. 0.9917 mous mutation in gene ORF3a. The nsp2, expressed as part of 21 1059-nsp2 C| T-non. 18877-nsp14 C| T-syn. 0.9197 the ORF1a polyprotein, binds to host proteins prohibitin 1 and 26 17858-nsp13 A| G-non. 18060-nsp14 C| T-syn. 0.8624 prohibitin 2 (PHB1 and PHB2) in SARS-CoV (32). The vari- 27 17747-nsp13 C| T-non. 17858-nsp13 A| G-non. 0.8553 ations in site 1,059 have been predicted to modify nsp2 RNA 36 17747-nsp13 C| T-non. 18060-nsp14 C| T-syn. 0.7780 secondary structure (33) and have previously been reported to 47 11083-nsp6 G| T-non. 26144-ORF3a G| T-non. 0.7340 cooccur together with the Q57H variant in ORF3a in a dataset 63 20268-nsp15 A| G-syn. 25563-ORF3a G| T-non. 0.6474 of SARS-CoV-2 genomes from the United States (34). ORF3a, 134 11083-nsp6 G| T-non. 14805-nsp12 C| T-syn. 0.5040 also known as ExoN1 hypothetical protein sars3a, forms a cation 147 11083-nsp6 G| T-non. 28144-ORF8 T| C-non. 0.4928 channel of which the structure in SARS-CoV-2 is known by cryo- 168 8782-nsp4 C| T-syn. 11083-nsp6 G| T-non. 0.4770 electron microscopy (cryo-EM) (35). In SARS-CoV, ORF3a been shown to up-regulate expression of fibrinogen subunits ∗Indices of significant links in the top 200 with both terminals located inside a coding region, inferred by PLM. The analogous table for the 2 May 2020 FGA, FGB, and FGG in host lung epithelial cells (36), to form dataset is shown in SI Appendix, Table S6. an ion channel which modulates virus release (37), and to acti- †Information on locus 1 includes index in the reference sequence, and the vate the NLRP3 inflammasome (38), and has been found to protein it belongs to. The convention used is that locus 1 (“starting locus”) induce apoptosis (39). The Q57H variant was reported early in is the locus of lowest genomic position in the pair. the COVID-19 pandemic (40) and occurs in the first transmem- ‡Information on mutations of locus 1 includes the first and second brane alpha helix, TM1 (35), where it changes the amino acid prevalent nucleotide at this locus, and mutation type: synonymous(syn.)/ glutamine (Q) with a noncharged polar side chain to histidine nonsynonymous(non.). (H), which has a positively charged polar side chain. This amino acid is at the interface of interaction between the two dimeric subunits of ORF3a that forms the constrictions of the ion chan- to three loci), while blue is for distant effects. Analogous results nel, but the Q57H alteration does not seem to change the ion for the 2 May 2020 dataset is listed in SI Appendix, Table S6, and channel properties compared to wild-type 3a (35). Nevertheless, for the 1 April 2020 and 8 April 2020 datasets in ref. 30. its incidence is increasing in SARS-CoV-2 genomes in the United To check whether the interactions can be explained by phy- States (34), and the effect of Q57H may therefore affect the vir- logeny (inherited variations), we used two randomization strate- ulence in other beneficial ways than changing the conductance gies, “profile” and “phylogeny” of the multiple sequence align- properties of the ion pore. ments (MSAs). Profile preserves the distribution over alleles The association between 8782 and 28144 (rank 5), reported at every locus but does so independently at each locus. Profile early in SARS-CoV-2 studies (28), is between a (C/T) synony- hence destroys all systematic covariations between loci. Phy- mous mutation in the gene nsp4, and a (T/C) nonsynonymous logeny additionally preserves the genetic distance between each mutation resulting in the L84S alteration in the gene ORF8. The pair of sequences. Viral genealogies inferred from this informa- first of these genes participates in the assembly of virally induced tion are therefore unchanged under this randomization. PLM cytoplasmic double-membrane vesicles necessary for viral repli- scores run on these two types of randomized data (scram- cation. The site 8782 is located in a region annotated as CpG rich bled MSAs) are a background from which the significance of and is the site of a CpG for the major allele (C); it has the minor the interactions from the original data can be assessed. Each (T) allele in other related viruses (28). Orf8 has been implicated randomization strategy is repeated 50 times with different real- in regulating the immune response (41, 42). The L84S variant izations of the scrambling; see SI Appendix, Figs. S1–S3 and ref. is, together with the C8782T nsp4 mutation, characterizing the 30. As shown in Fig. 2, the distribution of PLM scores using phy- GISAID clade S (43). logeny and profile are qualitatively different from PLM scores The interaction between 14805 and 26144 (rank 9) leads to of the original MSA, with progressively fewer interactions at nonsynonymous alterations in nsp12 (T455I; note that the ref- high score values. With profile randomization, no interactions erence is Y) and ORF3a (G251V), respectively. The G251V predicted by PLM appear with scores standing out from the back- has been reported by many studies and is defining the GISAID ground. Phylogeny randomization, on the other hand, preserves V clade (43) together with the L37F nsp6 variant (position some interactions found by PLM in a fraction of the realizations 11083, rank 47). The widely reported G251V variant is, unfor- of the random background. Table 2 lists interactions predicted by tunately, outside of the proposed cryo-EM structure (35), and it PLM that appear in some phylogeny randomizations with scores is unknown how this glycine to valine substitution affects protein large compared to the background. In the following analysis, we function. The nsp12 is the RNA-dependent RNA polymerase, have not retained them; see SI Appendix, Figs. S1–S3 for cir- and the T455I substitution is found where the reference Wuhan- cos visualizations. Table 3 lists the eight interactions found by Hu-1 has a tyrosine residue in one of the alpha helices of the PLM which either do not appear in any phylogeny randomiza- polymerase “finger” domain (44). Threonine can, similarly to tion with scores that stand out from the background, or, in the tyrosine, be phosphorylated but also glycosylated, it is polar and case of (1059–25563), shows up three times in the top 200 out uncharged, and it can form hydrogen bonds that may stabilize

2 of 8 | www.pnas.org/cgi/doi/10.1073/pnas.2012331117 Zeng et al. Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 ege al. et Zeng is replication machinery proofreading for nsp14/nsp10 responsible The activity, (46). exonuclease proofreading is domain domain protein N7-methyltransferase nsp14 a an The has and nsp8. that protein and bifunctional nsp7, com- a replicase, polymerase stable a nsp12 RNA form with with nsp13 interaction RNA-dependent of plex molecules direct the two in which the of in not holoenzyme, in at is members located that nsp14 other interact- both protein gene as the are the in T541C 36 of (L7L) and part and Rec2A P504L mutation 26 The synonymous ranks 18060. with C/T position list a same the These with protein. in only ing one the reappear within is loci 3 It two for Table (32). RNA codes in duplex that interaction unwinds nsp13 epistatic and that gene P504L) the helicase within in a T541C) resulting in resulting (C/T, (A/G, mutations nonsynonymous two the that alpha two suggests between model loop unordered (45). early an helices in an situated propagation but is and variant structure, L37P replication validated nsp6 experimentally viral of no for autophagosomes currently model favor is induce There in to (32). cells SARS-CoV shown host been the has in nsp6 The variant. aromatic and the than nonpolar smaller is are hand, residues tyrosine. other the the both and on (30). uncharged, Isoleucine, repository GitHub are helix. the links in colorful alpha available The are the datasets distance. 2020 longer April of 8 links and show 2020, April lines 1 blue 2020, bp); May 3 2 to the for equal plots or circos than Analogous less 1. Table (distance in links listed short-distance as show pairs lines same the Red 200. to 51 top indicate 1. Fig. h neato ewe 74 n 75 rn 7 sbetween is 27) (rank 17858 and 17747 between interaction The L37F nsp6 the is G251V of partner interaction second The o 0 infiatpiws psae rmte8Ags 00dtstbtenlc ncdn ein.Clrdlnsidct o 0 n rylines gray and 50, top indicate lines Colored regions. coding in loci between dataset 2020 August 8 the from epistases pairwise significant 200 Top ehv etdfrtescn yeb admzn S of MSA randomizing by (phylogeny). type effects second inherited the to for also tested and have epistasis We to due be number, in the few 1. but are Table in predictions all predictions remaining 19 many out the i.e., with MSA, filtering regions the noncoding in After in gaps contributions variations sequences. and epistatic the effects infer strongest from by to described fitness DCA well used to is and genotypes cutoff QLE, of different have Kimura’s distribution we the to recombination, that extensive up has assumed sequences coronavirus GISAID this whole-genome in As all dates. deposited considered SARS-CoV-2 have we of work, this In Discussion usage. RNA change codon interactions, to or as proposed modification, other be RNA nsp14, the can structure, but in secondary of unknown, mutations is alterations virus synonymous the C/T affect the protein synonymous in muta- nsp2 this as the of on well effect how the knowledge for Also, evidence the no tion. is As there poor, L280L) 18877. is (C/T, structure position mutation synonymous nsp14, a carrying position in locus nsp2 a in T541C) and (C/T, 1059 mutation nonsynonymous a known. rying not are interaction this complex, of replication–transcription details exact the the with but interact to thought oaitosbtenall itiuin tdfeetlc can loci different at distributions allele between Covariations car- locus a between link a is 21) (rank interaction final The NSLts Articles Latest PNAS | f8 of 3

APPLIED PHYSICAL POPULATION SCIENCES BIOLOGY Fig. 2. Histograms of PLM scores for (A) original 8 August 2020 dataset, (B) a phylogenetic randomized sample, and (C) a profile randomized sample. The blue bars are for all scores, while the red ones are for the top 50 largest scores. Red arrows in A indicate links listed in Table 3. The largest PLM score is pointed to by red arrows for random samples in B and C. None of them is located inside a coding region, and none of them appear in Tables 1 and 3.

sequences such that pair-wise distances between sequences are pathogens colonizing a host; for earlier use relating to HIV and left invariant. We find that the top link 1059–25563 appears three using DCA techniques, see ref. 50. The additive and epistatic times in 50 phylogeny randomizing samples, although with much contributions to fitness of the virus which we find describe the lower rank. The other predicted epistatic contributions disap- virus in its human host and therefore likely reflect host–pathogen pear under phylogenetic randomization, except for pairs in the interactions to a large extent. A conceptual simplification made triple (3037, 14408, 23403) which appear in from 20 to 35% of is that all hosts have been assumed equivalent. In future method- 50 randomizations. After eliminating these links as well as links ological studies, it would be of interest to consider possible between adjacent loci (28881, 28882, 28883, which appear in effects of evolution in a collection of landscapes, representing from 14 to 16% in 50 samples), we are left with eight predic- different hosts, and to correlate such dynamics to host geno- tions listed in Table 3. We consider it likely that these retained types. As this requires other data than are available on GISAID, interactions are due to epistasis, and not to inherited covariation. and which are less abundant at this time, we leave this for future An analogous investigation on a smaller dataset obtained with work. On the other hand, it is unlikely that the inferred couplings an earlier cutoff date (2 May 2020) and reported in SI Appendix, involve the host as a temporal variable, due to the much faster Tables S6 and S7 and Fig. S6 yielded six retained predictions, time scale of the evolution of the virus. involving, however, the same eight viral genes. The question Epistatic interactions are pair-wise statistical associations, but on epistasis vs. effects of inheritance (phylogeny) clearly merits are not simply correlations. The interaction between sites 8782 further investigation and testing as more data become available. and 28144, which is the second largest in Table 3, was identified Biological fitness is a many-sided concept and can also include as a very strong correlation in a very early study (28). As shown in aspects of game and cooperation (47–49). A fitness landscape Table 4, this correlation has generally decreased over time (using describes the propensity of an individual to propagate its geno- data with successively later cutoff dates). In the alternative global type in the absence of strategic interactions with other geno- model learning method of DCA which we use in the present types, and has traditionally been used to model the evolution of work, the score of statistical interdependency of this pair has

4 of 8 | www.pnas.org/cgi/doi/10.1073/pnas.2012331117 Zeng et al. Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 34 ik itdi al r hw ee h icsposfrtesignificant the for Mutation plots in circos available are The samples here. randomized 50 shown all are of links 1 epistatic Table in listed links G| ∗ 28881-N Mutation 16 14 ratio Appearance dataset 2020 August 8 ratio the appearance on an (with appeared that ≥ 200 Top 2. Table C ple omn ape eune rdcswhich predicts sequences al. et sampled Zeng large. very many is diseases, to other applied many and DCA COVID-19 3 in Table targets, in listed proteins the for in drugs summarized of are account Potential on cost. COVID-19 appro- large like not its disease is nsp12, pandemic for a listed against Ponatinib, priate 57. and ref. same nsp2 in targeting the listed drugs are in One practical nsp6 nsp12. and locus no and approved first nsp6, already which nsp2, more The genes or for 57. in respectively, ORF3a, ref. lies, in in pairs three listed lies are locus drugs second potential the pairs remaining two 26144), and three 26144), (14805, (11083, the between 25563), of (1059, are all mutations, For nonsynonymous or 3 of gene. Table same mutation the in synonymous in predictions mutations target one eight to the either of predicted com- involve out or recent Five known guide (57). the drugs a SARS-CoV-2 scanned such of we offers compilation treatment, analysis prehensive drug our combinatorial whether search to addi- than To more a effect. possibility have tive may the simultaneously to drugs points both then using fitness that Epistasis the variants. modulating loci, respective both the around of are targets there on Suppose act locus. that another drugs at compensated mutation or a epistasis) by (positive epistasis) enhanced (sign is locus one at tion SARS-CoV-2 against vac- including coronavirus (53), of (54–56). date target main to the development phylogeny been cine has under protein up Spike see show weak; The interactions quite also are epistatic in that that any or research, pairs randomization found COVID-19 only not Spike, in have involving we avenue study, future present this promising the While a time) 52). indi- 50, be over (18, (HIV-positive may patient AIDS) controllers toward one elite slowly of in progressing viduals system dominate immune to the fitness strains tend and vitro escape that in on strains with observations found well (HIV clinical variabil- was correlate with protein sequence it and Gag by measurements, studies, HIV-1 implied the of in mutations series ity of a combinations In that development. vaccine for effect. QLE phylogenetic a a from nature recombination, since and in by work, different mixed DCA is thoroughly our population through later to a The apply recovered in phylogeny. assumption not interaction, to does due epistatic therefore is an 51 effect ref. the in made that interpretation criticism work the support same not the do they in 28, ref. pair hence in this data made in our first interdependency While statistical 4. of observation Fig. the see support second dates; or cutoff first different ranked four consistently over is pair the and large, remained † 20 22 20 naioai oain hsmtto sD1Gi Spike. in D614G is mutation this notation, acid amino In h nie fsmlswt hlgn admzto hc rsrethe preserve which randomization phylogeny with samples of indices The it hlgn ape r osdrdi total. in considered are samples phylogeny Fifty eetees h ubro obntoso oeta drug potential of combinations of number the Nevertheless, muta- a by fitness of loss that means interaction epistatic An targets candidate find to applied been have techniques DCA nsmlswt hlgn admzto taeybased strategy randomization phylogeny with samples in 10%) ∗ rti yePoentype Protein type Protein % , 40-s1 T| 14408-nsp12 07np T| T| 3037-nsp3 3037-nsp3 88- G| 28881-N 88- G| 28882-N olwn e.57. ref. following S5, Table Apppendix, SI ou ou 2 Locus 1 Locus -o.23403-S C-non. -o.283NG| G| 28883-N 28882-N A-non. A-non. -y.23403-S C-syn. -y.148np2T| 14408-nsp12 C-syn. -y.283NG| 28883-N A-syn. . S4 Table Appendix, SI † † IAppendix SI G| G| C-non. A-non. A-non. C-non. C-non. A-syn. . Rank 7103np 3FL 64-R3 G251V(G) alleles. minor and L7L(L) major between number 26144-ORF3a the by once specified randomizations; T541C(Y) phylogeny ranks with (6%) 47) L280L(L) see L7L(L) and 50 ‡ 200; 29 18060-nsp14 to (experiments of twice 51 L37F(L) and in out 34, L84S(L) rank 3 17858-nsp13 with 23) in (experiment 18877-nsp14 appears here. G251V(G) link P504L(P) listed 18060-nsp14 not †This are 2 11083-nsp6 Table in P504L(P) 28144-ORF8 randomization 26144-ORF3a T85I(T) 17747-nsp13 T541C(Y) ∗ 17747-nsp13 S76S(S) T455I(Y) 47 17858-nsp13 1059-nsp2 36 14805-nsp12 27 8782-nsp4 26 21 9 5 1 The 2020. August 8 as and (30). the repository 2020 given for May MSA are 2 reference used dates prealigned cutoff MSAs collec- the with as the datasets 2020 took pre- largest April we two and 8 a Here, date alignment sequences, cutoff resources. the with To computational 10,000 tion accelerate 2020. on than to burden April more recommended the 8 reduce is with and MSA 2020 datasets reference April larger aligned 1 two the dates for the cutoff 59) align with (58, datasets (MAFFT) Transform smaller Fourier two Fast using alignment sequence MSA. collection respective the for 51,676 selected of and numbers 10,587, dates. The 2020, 3,490, (30). May 2,268, in repository 2 given Github are is 2020, the genomes used on April sequences available 8 GISAID also of 2020, is list and April The 1 2020. August are 8 date dates and collection The bps the database. to of according GISAID (number investigation in our lengths for full used are and datasets quality Four high with (4) database Data. Methods and Materials the ORF3a, COVID-19 gene of viral manifestations severe in (37–39). to loci disease related eight is involve of which predictions out of of action three viral that list both note our of finally of sequences We we continu- pathogens. genome 57, the bacterial of ref. given and availability in value and have drugs increase could direct potential ing approach any of to general lists lead the not the hope did on above based discussed scan suggestions the possibil- if be of Even explosion would be combinatorial ities. there the however, and to would, shortcut information, pairs analogous sample no potential on all based so, equal nec- If ranked unconditionally virus. be from the treatment to for drug essary these for search investigation. assuming a positions, start detailed also conserved more can one further, that used note for be We can combinations and rank fitness, in to dependencies mutual have genes/loci entry of al .Ptnilysgicn psai ik nTbe1 and 1, Table in mutations links acid epistatic amino significant corresponding Potentially 3. Table “012345.” by represented “-NACGT” with . states, . minors “KFY. six the minors are while there the “N” 2) to “12345”; by transformed represented are “-” gaps . . The “KFY. 1) follows: as ways Filtering. MSA treat to introduced minorities. some “-” or gap insertions, alignment or the deletions or nucleotide (N), nucleotide” known “not mn cdi h eeec euneWhnH- tteposition the at Wuhan-Hu-1 sequence reference the in acid Amino † anpeito:egteittclns h ik rsre yphylogeny by preserved links The links. epistatic eight prediction: Main h S sabgmatrix big a is MSA The N eoi eune hc r lge over aligned are which sequences genomic ∗ σ Sswr osrce ihteoln lgmn evrMultiple server alignment online the with constructed were MSAs eaaye h osnu eune eoie nteGISAID the in deposited sequences consensus the analyzed We i n r apdt N”Tu v ttsrmi,wee“AG’are “NACG”’ where remain, states five Thus “N.” to mapped are ” fmatrix of 09np T85I(T 1059-nsp2 rti uainPoenmutation Protein mutation Protein eoefitrn,w rnfr h S ntodifferent two in MSA the transform we filtering, Before . S3 and S2 Figs. Appendix , SI S ou 1 Locus sete n ftefu uloie A ,G ) or T), G, C, (A, nucleotides four the of one either is n r loaalbeo h Github the on available also are and S2, Dataset mn cdAioacid Amino acid Amino S = {σ i n ‡ |i 56-R3 Q57H(Q) 25563-ORF3a ) = 1, . . . , NSLts Articles Latest PNAS r apdt N”Then “N.” to mapped are ” L, L n = oiin 1,2) Each 21). (16, positions 1, ou 2 Locus . . . , N composed }, aae S1, Dataset ≈ 0 000). 30, | f8 of 5

APPLIED PHYSICAL POPULATION SCIENCES BIOLOGY 1 P(σ1, ... , σL) = exp{−H(σ1, ... , σL)}, [3] Z

with an “energy function” X X H(σ1, ... , σL) = hi(σi) + Jij(σi, σj). [4] i ij

In the above, Jij can be related to the epistatic contribution to fitness between loci i and j with alleles σi and σj (11–13). The quantity hi is sim- ilarly a function of allele σi which depends on both additive and epistatic contributions to fitness involving locus i. It has been verified in in silico test- ing that, when the terms in Eq. 4 can be recovered, this is a means to infer epistatic fitness from samples of genotypes in a population (14). In the bac- terial realm, this approach was used earlier to infer epistatic contributions to fitness in the human pathogens Streptococcus pneumonia (61) and Neisseria gonorrhoeae (62), both of which are characterized by a high rate of recom- bination. The method was also tested on data on the bacterial pathogen Vibrio parahemolyticus (63). In that study, the results from DCA were not superior to an analysis based on Fisher exact test; see SI Appendix, Different Quantifications of Correlations for a discussion. This is consistent with the approach taken here, as V. parahemolyticus has a low rate of recombina- tion. Further details on the QLE state of evolving populations are given in SI Appendix, Quasi-Linkage Equilibrium (QLE) and Its Range of Validity. Fig. 3. Frobenius norm of pair-wise correlations between loci for the origi- nal 8 August 2020 dataset. The score pointed by the red arrow corresponds Inference Method for Epistasis between Loci. The basic assumption of mod- to the link of 1059 and 25563. eling the filtered MSA is that it is composed of independent samples that follow the Gibbs–Boltzmann distribution Eq. 3 with H as in Eq. 4. Higher- order interactions are also possible to include, but we ignore them here The following criteria are used for prefiltering of the MSA from the 8 (64). This assumption is a simplification of the biological reality; however, it August 2020 dataset. If, for one locus, the same nucleotide is found in more provides an efficient way to extract information from massive data. than 96.5% of this column, or if the sum of the percentages of A, C, G, and On the other hand, in the context of inference from protein sequences, T at this position is less than 20%, then this locus will be deleted. For each it has been argued that the one encoded in Eqs. 3 and 4 is the minimal sequence, if the percentage of a certain nucleotide is more than 80%, or if generative model, that is, capable not only of reproducing the empiri- the sum of the percentages of A, C, G, and T in this sequence is less than cal frequencies and correlations but also of generating new sequences 20%, then this sequence will be deleted. With this filtering criteria, many indistinguishable from natural sequences (16, 65, 66). loci but no sequences are deleted. There are left 51,676 sequences and 689 Many techniques have been developed to infer the direct couplings in Eq. loci. 3, as reviewed in ref. 31 and references therein; see also SI Appendix, Meth- ods of DCA. We employ the PLM method (13, 60, 67–70) to infer the epistatic B-effective Number. To mitigate the effects of dependent samplings, it is effects between loci from the aligned MSA. PLM estimates parameters from standard practice to attach to each collected genome sequence σ(b) a conditional probabilities of one sequence conditioned on all of the others. For a Potts model with multiple states q > 2, this conditional probability is weight wb (15, 16, 60), which normalizes its impact on the inference pro- cedure. An efficient way to measure the similarity between two sequences   (a) (b) P σ and σ is to compute the fraction of identical nucleotides and com- exp hi(σi) + j=6 i Jij(σi, σj) P(σ |σ ) = , [5] pare it with a preassigned threshold value x in the range 0 ≤ x ≤ 1. The i \i P  P  (b) u exp hi(u) + j6=i Jij(u, σj) weight of a sequence σ can be set as wb = 1/mb, with mb as the number of sequences in the MSA that are similar to σ(b), with u = {0, 1, 2, 3, 4, 5} as the possible state of σi. Eq. 5 depends on a much (a) (b) smaller parameter set compared with that in Eq. 3. This leads to a much m = |{a ∈ {1, ... , B}}: overlap(σ , σ ) ≥ x|; [1] b faster inference procedure of parameters compared with the maximum like- lihood method. With a given independent sample set, one can maximize the here, overlap is the fraction of loci where the two sequences are identical. corresponding log-likelihood function The B-effective number of the transformed sequences is defined as

B X Beff = wb. [2] b=1

We compare the Beff value with different x for the filtered MSA with q = 5 and q = 6. As shown in Fig. 5, the dataset with six states shows larger Beff number for all tested x. We thus perform our analysis on the dataset with q = 6 states, where “-NACGT” is represented by “012345.” The reweighting procedure partially addresses a point raised (51), that sequenced viral genomes are not a random sample of the global population. That is, even if sequencing is biased by the country they occur in and by contact tracing, sufficiently similar genomes will have lower weight, and so each will contribute less to predictions.

Elements of QLE. The phenomenon of QLE was discovered by M. Kimura (10) while investigating the steady-state distribution over two biallelic loci evolving under mutation, recombination, and selection, with both additive and epistatic contributions to fitness. In the absence of epistasis, such a system evolves toward linkage equilibrium (LE) where the distribution of Fig. 4. Ranks for significant epistatic effects with data collection date (1 alleles at the two loci are independent. The covariance of alleles at the April 2020, 8 April 2020, 2 May 2020, and 8 August 2020) by PLM (dashed two loci then vanishes. In the presence of pair-wise epistasis and sufficiently lines) and correlation analysis (solid lines). The ranks of the PLM scores are high rate of recombination, the steady-state distribution takes form of a almost constant, while the ranks of the correlations vary significantly and Gibbs–Boltzmann form mostly drop as more data accumulate (later cutoff dates).

6 of 8 | www.pnas.org/cgi/doi/10.1073/pnas.2012331117 Zeng et al. Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 “AG”.Tenme fsae sdtrie ytetasomciei of criteria transform the by MSA. determined prefiltered is the states of number The (“NACGT”). old ege al. et Zeng com- unlimited and terms data the unlimited resources, Given putational independent. are Analysis. loci different Correlation to Relation implementation the in (60) PLM loci parameter between regularization of interactions with 71 version ref. asymmetric in available the PLM run then we where 5. Fig. ml ubro edn eandpeitosaogamc agrstof set larger much a among predictions retained leading of number small Distributions. Background Randomized (72). software “circos” by performed is visualization epistases The “Wuhan-Hu-1,” these nucleotides. corresponding sequence of the reference of a positions the With scores identify pairs. significant we 200 the top from on the information focus loci, between we important sequences, extract genomic To SARS-CoV-2 loci. massive pair-wise between Scores. paradigm PLM with Analysis Epistasis 30. of ref. plots in Circos available matrices. are scores correlation correlation of on norm based Frobenuis interactions of in instead Results MI S1 (MI). using Table information and categor- mutual S5 two is Fig. between variables dependency statistical correlation random in of ical recovered score stably different be A cannot analysis. DCA by retrieved interactions ing locus interaction qualita- third their a The with if interact inference well. even model as coupled”) global zero rectly and be Eqs. analysis would on correlation based correlations between of of difference strength norm tive of Frobenius The score zero. a be also would as defined covariances, locus–locus The zero. be then .M Kraemer M. https://www.2. pandemic. (COVID-19) disease Coronavirus Organization, Health World 1. 9eiei nChina. in epidemic 19 2020. September 2 Accessed who.int/emergencies/diseases/novel-coronavirus-2019. x e denotes Red . s B aestesqecs(ape) rm1to 1 from (samples), sequences the labels PL eff i ubro h uut22 rfitrddtstwt thresh- with dataset prefiltered 2020 August 8 the of number h h feto ua oiiyadcnrlmaue nteCOVID- the on measures control and mobility human of effect The al., et i , 3 {J and c ij ij } q (a,  = hwta h eutde o usatal hneif change substantially not does result the that show = − Science 4 b) tts(-AG”,adbakdenotes black and (“-NACGT”), states 6 sta w loci two that is N N 1 1 = i k X X and D aai al n i.4so httelead- the that show 4 Fig. and 4 Table in Data . s s 9–9 (2020). 493–497 368, 1 σ h log i j ,a i r crdb h rbnu norm. Frobenius the by scored are J  1 ij nL,tedsrbtoso lee over alleles of distributions the LE, In σ X σ L rcdr ilsaflyconnected fully a yields procedure PLM j u nEq. in i (s) ,b E  exp + a oass h aiiyo a of validity the assess to way A − i N

1   4 and 1 J X σ h ij nerdfo h aawould data the from inferred i i s ,a (u) c szr,poie hyboth they provided zero, is ij j

X (a, j a ecreae (“indi- correlated be may 6=i D + N 1 vridcs(a, indices over b) ihtefitrdMSA, filtered the With . X σ J j 6=i ij j ,b (σ λ E J i (s) = ij , (u, , .Teinferred The 0.1. σ j (s) σ j (s) ) IAppendix, SI q )   = , states 5 as b) [6] [7] 9033-s328144-ORF8 3037-nsp3 3980 n noainpormMCRS-06udrGatAreet734439 Agreement research Grant H2020 under EU MSCARISE-2016 INFERNET. the program by and innovation funding Posts Universitario and acknowledges Collegio E.R.H. of and Fonda.” Trieste) by University “Luciano supported of was (University Nanjing V.D. Scholarship of of Extra-Erasmus work The Foundation NY217013). (Grant Research Telecommunications (Grant 2018, Province Scientific of “Viral Studies Jiangsu China Overseas and of for of Sweden) Scholarship Foundation Foundation Government (Solna, Jiangsu Science Science BK20170895), Natural Natural Labs National 11705097), by Life work (Grant sponsored the The and for was suggestions. of and Elofsson H.-L.Z. input Science of use Arne for program” the the thanks research evolution E.A. sharing of sequence Also, for participants E.R.H. Weigt other with Martin developed and “phylogeny” discussions, numerous for ACKNOWLEDGMENTS. Availability. Data from in PLM found by Principles. recovered strate- be randomization terms can the in on due gies details up More are “phylogeny.” show by interactions also scrambled MSAs epistatic should predicted they phylogeny, the to therefore if would Conversely, information unchanged. Phy- distance changed be invariant. sec- intersequence are kept from the are rows inferred distances and In logeny intersequence columns randomized size. that where is so sample but schedule MSA simultaneously finite original annealing the by simulated “phylogeny,” a caused as by noise are to samples referred the such strategy, to from ond inferred due couplings nonzero DCA and destroys only This correlations, sites. of all all “profile,” kind for of as statistics all permutation single-column to independent the but refer conserving random columns, finite-size we its by which by MSA strategy, input given orig- first the couplings The from randomizes phylogeny. spurious obtained by of results or disentangling the effects allows to them MSA, comparing inal by and generated ensembles strategies, sequences are artificial these on interactions) DCA pre- (epistatic Running are (75). couplings relations removed coevolutionary phylogenetic intrinsic (or) MSA, while and the patterns served, on conservation strategies its randomization that distinct such two perform we scores, PLM Randomization. with Scores PLM 74. distributions and background 13 refs. randomized in on described relying are genome-scale studies to RNA applied earlier noncoding (PLM) two a DCA of MSAs, in context energies the In binding data, (73). RNA–RNA pipeline randomized discovery predicted to for applied values done procedure retained same at was the the appear as comparing from would values by values largest addressed such the be samples, to can of problem from number different, This large are a random. they in because when, large, are case 0.3603 values the and retained or of measure) cutoff, subset some small some 0.3603 to a (by according made large were empirical 0.2821 selection case, an 0.3609 if any retained be in also then, would are The backgrounds. predictions randomized to retained compare to is predictions 0.3837 discarded mostly 28144-ORF8 28882-N omitted. are regions 0.3844 coding outside 25563-ORF3a 28883-N is one 0.3842 least at which 28883-N of loci between 8782-nsp4 1059-nsp2 23403-S 28881-N 14408-nsp12 28881-N 23403-S 3969 28882-N 2394 1071 14408-nsp12 585 3037-nsp3 584 3037-nsp3 581 460 458 455 al .Tp1 ik on ycreainaayi ntecoding 2020 the Rank August in 8 analysis until correlation dataset by the found for links region 10 Top 4. Table .Y h,J caly IAD lbliiitv nsaigaliflez aaFo vision data–From influenza all sharing on initiative Global GISAID: McCauley, J. Shu, Y. 4. Salje H. 3. ∗ oreality. to (2020). akfrtp1 ik srne ycreainaayi.Correlations analysis. correlation by ranked as links 10 top for Rank ∗ siaigtebre fSR-o- nFrance. in SARS-CoV-2 of burden the Estimating al., et uoSurveill. Euro p ou rti ou rti rbnu score Frobenius protein 2 Locus protein 1 Locus au.Tepolmi hshwt itnus h aewhere case the distinguish to how thus is problem The value. 23403-S l td aaaeicue nteatceand article the in included are data study All IAppendix , SI 09 (2017). 30494 22, etakPos atnWitadRbroMulet Roberto and Weigt Martin Profs. thank We oudrtn h aueo h o 200 top the of nature the understand To hlgntcRnoiaino DCA: of Randomization Phylogenetic 28144-ORF8 NSLts Articles Latest PNAS Science IAppendix SI 0.1487 0.1486 0.1803 208–211 369, | f8 of 7 .

APPLIED PHYSICAL POPULATION SCIENCES BIOLOGY 5. M. Pachetti et al., Emerging SARS-CoV-2 mutation hot spots include a novel RNA- 40. E. Issa, G. Merhi, B. Panossian, T. Salloum, S. Tokajian, SARS-CoV-2 and ORF3a: Non- dependent-RNA polymerase variant. J. Transl. Med. 18, 179 (2020). synonymous mutations, functional domains, and viral pathogenesis. mSystems 5, 6. M. M. Lai, D. Cavanagh, The of coronaviruses. Adv. Virus Res. 48, e00266-20 (2020). 1–100 (1997). 41. Y. Zhang et al., The ORF8 protein of SARS-CoV-2 mediates immune evasion through 7. R. L. Graham, R. S. Baric, Recombination, reservoirs, and the modular spike: potently downregulating MHC-I. bioRxiv:10.1101/2020.05.24.111823 (24 May 2020). Mechanisms of coronavirus cross-species transmission. J. Virol. 84, 3134–3146 42. J. Y. Li et al., The ORF6, ORF8 and nucleocapsid proteins of SARS-CoV-2 inhibit type i (2010). interferon signaling pathway. Virus Res. 286, 198074 (2020). 8. J. Gribble et al., The coronavirus proofreading exoribonuclease mediates extensive 43. D. Mercatelli, F. M. Giorgi, Geographic and genomic distribution of SARS-CoV-2 viral recombination. bioRxiv:10.1101/2020.04.23.057786 (25 April 2020). mutations. Front. Microbiol. 11, 1800 (2020). 9. X. Li et al., Emergence of SARS-CoV-2 through recombination and strong purifying 44. J. Chen et al., Structural basis for helicase-polymerase coupling in the SARS-CoV-2 selection. Sci. Adv. 6, eabb9153 (2020). replication-transcription complex. Cell 182, 1560–1573 (2020). 10. M. Kimura, Attainment of quasi linkage equilibrium when gene frequencies are 45. D. Benvenuto et al., Evolutionary analysis of SARS-CoV-2: How mutation of non- changing by natural selection. Genetics 52, 875–890 (1965). structural protein 6 (NSP6) could affect viral autophagy. J. Infect. 81, e24–e27 11. R. A. Neher, B. I. Shraiman, Competition between recombination and epistasis can (2020). cause a transition from allele to genotype selection. Proc. Natl. Acad. Sci. U.S.A. 106, 46. M. R. Denison, R. L. Graham, E. F. Donaldson, L. D. Eckerle, R. S. Baric, Coronaviruses: 6866–6871 (2009). An RNA proofreading machine regulates replication fidelity and diversity. RNA Biol. 12. R. A. Neher, B. I. Shraiman, Statistical genetics and evolution of quantitative traits. 8, 270–279 (2011). Rev. Mod. Phys. 83, 1283–1300 (2011). 47. J. Maynard Smith, Evolution and the Theory of Games (Cambridge University Press, 13. C. Y. Gao, F. Cecconi, A. Vulpiani, H. J. Zhou, E. Aurell, DCA for genome-wide epistasis 1982). analysis: The statistical genetics perspective. Phys. Biol. 16, 026002 (2019). 48. M. A. Nowak, K. Sigmund, Evolutionary dynamics of biological games. Science 303, 14. H. L. Zeng, E. Aurell, Inferring genetic fitness from genomic data. Phys. Rev. E 101, 793–799 (2004). 052409 (2020). 49. J. C. Claussen, A. Traulsen, Cyclic dominance and biodiversity in well-mixed 15. F. Morcos et al., Direct-coupling analysis of residue coevolution captures native con- populations. Phys. Rev. Lett. 100, 058104 (2008). tacts across many protein families. Proc. Natl. Acad. Sci. U.S.A. 108, E1293–E1301 50. A. Ferguson et al., Translating hiv sequences into quantitative fitness landscapes (2011). predicts viral vulnerabilities for rational immunogen design. Immunity 38, 606–617 16. S. Cocco, C. Feinauer, M. Figliuzzi, R. Monasson, M. Weigt, Inverse statistical (2013). physics of protein sequences: A key issues review. Rep. Prog. Phys. 81, 032601 51. O. A. MacLean, R. J. Orton, J. B. Singer, D. L. Robertson, No evidence for distinct types (2018). in the evolution of SARS-CoV-2. Virus Evol. 6, veaa034 (2020). 17. K. Shekhar et al., Spin models inferred from patient-derived viral sequence data 52. V. Dahirel et al., Coordinate linkage of HIV evolution reveals regions of immunologi- faithfully describe HIV fitness landscapes. Phys. Rev. E 88, 062705 (2013). cal vulnerability. Proc. Natl. Acad. Sci. U.S.A. 108, 11530–11535 (2011). 18. J. K. Mann et al., The fitness landscape of HIV-1 Gag: Advanced modeling approaches 53. L. V. Tse, R. M. Meganck, R. L. Graham, R. S. Baric, The current and future state of vac- and validation of model predictions by in vitro testing. PLoS Comput. Biol. 10, cines, antivirals and gene therapies against emerging coronaviruses. Front. Microbiol. e1003776 (2014). 11, 658 (2020). 19. R. R. Cheng et al., Connecting the sequence-space of bacterial signaling proteins 54. F. Amanat, F. Krammer, SARS-CoV-2 vaccines: Status report. Immunity 52, 583–589 to phenotypes using coevolutionary landscapes. Mol. Biol. Evol. 33, 3054–3064 (2020). (2016). 55. T. T. Le et al., The COVID-19 vaccine development landscape. Nat. Rev. Drug Discov. 20. J. A. de la Paz, C. M. Nartey, M. Yuvaraj, F. Morcos, Epistatic contributions promote 19, 305–306 (2020). the unification of incompatible models of neutral molecular evolution. Proc. Natl. 56. T. T. Le, J. P. Cramer, R. Chen, S. Mayhew, Evolution of the COVID-19 vaccine Acad. Sci. U.S.A. 117, 5873–5882 (2020). development landscape. Nat. Rev. Drug Discov. 19, 667–668 (2020). 21. R. Horta, P. Barrat-Charlaix, M. Weigt, Toward inferring Potts models for phylogenet- 57. D. Gordon et al., A SARS-CoV-2 protein interaction map reveals targets for drug ically correlated sequence data. Entropy 21, 1090 (2019). repurposing. Nature 16, 026002 (2020). 22. P. Forster, L. Forster, C. Renfrew, M. Forster, Phylogenetic network analysis of SARS- 58. K. Katoh, J. Rozewicki, K. D. Yamada, MAFFT online service: Multiple sequence CoV-2 genomes. Proc. Natl. Acad. Sci. U.S.A. 117, 9241–9243 (2020). alignment, interactive sequence choice and visualization. Briefings Bioinform. 20, 23. R. A. Khailany, M. Safdar, M. Ozaslan, Genomic characterization of a novel SARS-CoV- 1160–1166 (2017). 2. Gene Rep. 19, 100682 (2020). 59. S. Kuraku, C. M. Zmasek, O. Nishimura, K. Katoh, aLeaves facilitates on-demand 24. H. Y. Cai, K. K. Cai, J. Li, Identification of novel missense mutations in a large number exploration of metazoan gene family trees on MAFFT sequence alignment server with of recent SARS-CoV-2 genome sequences. http://doi.org/10.20944/preprints202004. enhanced interactivity. Nucleic Acids Res. 41, W22–W28 (2013). 0482.v1 (28 April 2020). 60. M. Ekeberg, T. Hartonen, E. Aurell, Fast pseudolikelihood maximization for direct- 25. X. Deng et al., A genomic survey of SARS-CoV-2 reveals multiple introductions coupling analysis of protein structure from many homologous amino-acid sequences. into Northern California without a predominant lineage. medRxiv:10.1101/2020. J. Comput. Phys. 276, 341–356 (2014). 03.27.20044925 (30 March 2020). 61. M. J. Skwark et al., Interacting networks of resistance, virulence and core machinery 26. J. Phelan et al., Controlling the SARS-CoV-2 outbreak, insights from large scale whole genes identified by genome-wide epistasis analysis. PLoS Genet. 13, e1006508 (2017). genome sequences generated across the world. bioRxiv:10.1101/2020.04.28.066977 62. B. Schubert, R. Maddamsetti, J. Nyman, M. R. Farhat, D. S. Marks, Genome-wide dis- (29 April 2020). covery of epistatic loci affecting antibiotic resistance in Neisseria gonorrhoeae using 27. P. Sashittal, Y. Luo, J. Peng, M. El-Kebir, Characterization of SARS-CoV-2 viral diversity evolutionary couplings. Nat. Microbiol. 4, 328–338 (2019). within and across hosts. bioRxiv:10.1101/2020.05.07.083410 (13 May 2020). 63. Y. Cui et al., The landscape of coadaptation in Vibrio parahaemolyticus. eLife 9, 28. X. Tang et al., On the origin and continuing evolution of SARS-CoV-2. Nat. Sci. Rev. 7, e54136 (2020). 1012–1023 (2020). 64. E. Schneidman, M. J. Berry, R. Segev, W. Bialek, Weak pairwise correlations imply 29. F. Wu et al., A new coronavirus associated with human respiratory disease in China. strongly correlated network states in a neural population. Nature 440, 1007–1012 Nature 579, 265–269 (2020). (2006). 30. H. L. Zeng, Data from “hlzeng/Filtered MSA SARS CoV 2.” Github. https://github. 65. W. P. Russ, D. M. Lowery, P. Mishra, M. B. Yaffe, R. Ranganathan, Natural-like function com/hlzeng/Filtered MSA SARS CoV 2. Accessed 13 November 2020. in artificial WW domains. Nature 437, 579–583 (2005). 31. H. C. Nguyen, R. Zecchina, J. Berg, Inverse statistical problems: From the inverse Ising 66. M. Socolich et al., Evolutionary information for specifying a protein fold. Nature 437, problem to data science. Adv. Phys. 66, 197–261 (2017). 512–518 (2005). 32. F. K. Yoshimoto, The proteins of severe acute respiratory syndrome coronavirus-2 67. J. Besag, Statistical analysis of non-lattice data. Statistician 24, 179–195 (1975). (SARS CoV-2 or n-COV19), the cause of COVID-19. Protein J. 39, 198–216 (2020). 68. Ravikumar P., Wainwright M. J., Lafferty J. D. (2010) High-dimensional ising model 33. A. H. Rad, A. D. McLellan, Implications of SARS-CoV-2 mutations for genomic RNA selection using `1-regularized logistic regression. Ann. Stat. 38, 1287–1319. structure and host microRNA targeting. bioRxiv:10.1101/2020.05.15.098947 (16 May 69. E. Aurell, M. Ekeberg, Inverse Ising inference using all the data. Phys. Rev. Lett. 108, 2020). 090201 (2012). 34. R. Wang et al., Characterizing SARS-CoV-2 mutations in the United States. 70. M. Ekeberg, C. Lovkvist,¨ Y. Lan, M. Weigt, E. Aurell, Improved contact prediction arXiv:2007.12692 (24 July 2020). in proteins: Using pseudolikelihoods to infer Potts models. Phys. Rev. E 87, 012707 35. D. M. Kern et al., Cryo-EM structure of the SARS-CoV-2 3a ion channel in lipid (2013). nanodiscs. bioRxiv:10.1101/2020.06.17.156554 (18 June 2020). 71. C. Y. Gao, Data from “gaochenyi/cc-plm.” Github. http://github.com/gaochenyi/CC- 36. Y. J. Tan et al., The severe acute respiratory syndrome coronavirus 3a protein up- PLM. Accessed 13 November 2020. regulates expression of fibrinogen in lung epithelial cells. J. Virol. 79, 10083–10087 72. M. I. Krzywinski et al., Circos: An information aesthetic for comparative genomics. (2005). Genome Res. 19, 1639–1645 (2009). 37. W. Lu et al., Severe acute respiratory syndrome-associated coronavirus 3a protein 73. P. Mandin, F. Repoila, M. Vergassola, T. Geissmann, P. Cossart, Identification of new forms an ion channel and modulates virus release. Proc. Natl. Acad. Sci. U.S.A. 103, noncoding in Listeria monocytogenes and prediction of mRNA targets. Nucleic 12540–12545 (2006). Acids Res. 35, 962–974 (2007). 38. K. L. Siu et al., Severe acute respiratory syndrome coronavirus ORF3a protein acti- 74. Y. Xu, S. Puranen, J. Corander, Y. Kabashima, Inverse finite-size scaling for vates the NLRP3 inflammasome by promoting TRAF3-dependent ubiquitination of high-dimensional significance analysis. Phys. Rev. E 97, 062112 (2018). ASC. FASEB J. 33, 8865–8877 (2019). 75. E. R. Horta, M. Weigt, Phylogenetic correlations have limited effect on coevolution- 39. Y. Ren et al., The ORF3a protein of SARS-CoV-2 induces apoptosis in cells. Cell. Mol. based contact prediction in proteins. bioRxiv:10.1101/2020.08.12.247577 (13 August Immunol. 17, 881–883 (2020). 2020).

8 of 8 | www.pnas.org/cgi/doi/10.1073/pnas.2012331117 Zeng et al. Downloaded by guest on September 26, 2021