<<

Manuscript submitted to eLife

Mapping mutational eects along the evolutionary landscape of HIV envelope

Hugh K. Haddox1,2,†, Adam S. Dingens1,2,†, Sarah K. Hilton1,3, Julie Overbaugh4, Jesse D. Bloom1,3,*

*For correspondence: [email protected] (JDB) 1Basic Sciences Division and Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA; 2Molecular and Cellular Biology PhD program, University of †These authors contributed equally to this work Washington, Seattle, WA; 3Department of Genome Sciences, University of Washington, Seattle, WA; 4Human Biology Division and Epidemiology Program, Fred Hutchinson Cancer Research Center, Seattle, WA

Abstract The immediate evolutionary space accessible to HIV is largely determined by how single amino-acid mutations aect tness. These mutational eects can shift as the virus evolves. However, the prevalence of such shifts in mutational eects remains unclear. Here we quantify the eects on viral growth of all amino-acid mutations to two HIV envelope (Env) that dier at >100 residues. Most mutations similarly aect both Envs, but the amino-acid preferences of a minority of sites have clearly shifted. These shifted sites usually prefer a specic in one Env, but tolerate many amino acids in the other. Surprisingly, shifts are only slightly enriched at sites that have substituted between the Envs—and many occur at residues that do not even contact substitutions. Therefore, long-range can unpredictably shift Env’s mutational tolerance during HIV , although the amino-acid preferences of most sites are conserved between moderately diverged viral strains.

Introduction HIV’s envelope (Env) evolves very rapidly. The major group of HIV-1 that is responsible for the current pandemic originated from a virus that entered the human population Ì100 years ago (Sharp and Hahn, 2011; Worobey et al., 2008; Faria et al., 2014). The descendants of this virus have evolved so rapidly that their Envs now have as little as 65% protein identity (Lynch et al., 2009). For comparison, protein orthologs shared between humans and mice have only diverged to a median identity of 78% over 90 million years (Waterston et al., 2002; Hedges et al., 2006). Env’s rapid evolution has dire consequences for anti-HIV immunity, since it erodes the ecacy of most neutralizing antibodies (Albert et al., 1990; Wei et al., 2003; Richman et al., 2003; Burton et al., 2005). Because of this public-health importance, numerous studies have experimentally characterized aspects of the “evolutionary landscape” that Env traverses. The immediate evolution- ary space accessible to any given Env is largely dened by the eects on viral tness of all single amino-acid mutations to Env. Most mutational studies have measured how just a small number of these mutations aect viral growth in cell culture, although it has recently become possible to use deep mutational scanning to measure the eects of many (Al-Mawsawi et al., 2014; Duenas- Decamp et al., 2016) or even all (Haddox et al., 2016) single amino-acid mutations mutations to an Env variant.

1 of 30 Manuscript submitted to eLife

But interpreting these studies in the context of Env evolution requires addressing a fundamental question: How informative are mutational studies of a single protein variant about constraints on long-term evolution? During protein evolution, substitutions at one site can change the eect of mutations at other sites (Natarajan et al., 2013; Gong et al., 2013; Harms and Thornton, 2014; Podgornaia and Laub, 2015; Starr and Thornton, 2016; Klink and Bazykin, 2017). We will follow the nomenclature of Pollock et al. (2012) to refer to these changes in mutational eects as shifts in a site’s amino-acid preferences. Such shifts can accumulate as substitutions become entrenched via epistatic interactions with subsequent changes (Starr et al., 2017; Pollock et al., 2012; Shah et al., 2015; Bazykin, 2015)—although the magnitude of these shifts is usually limited (Doud et al., 2015; Chan et al., 2017; Ashenberg et al., 2013; Risso et al., 2014). Given that the Envs of circulating HIV strains represent a vast collection of homologs that often dier at >100 residues, shifts in amino-acid preferences could make the outcome of any study highly dependent on the Env used. Indeed, epistasis among a few combinations of Env mutations has been experimentally demonstrated (da Silva et al., 2010), and epistatic tness landscapes have been computationally inferred for a variety of HIV proteins (Kouyos et al., 2012; Ferguson et al., 2013; Mann et al., 2014; Barton et al., 2015) including Env (Louie et al., 2018). However, the only protein-wide experimental studies of how amino-acid preferences shift during evolution have examined proteins that are structurally far simpler than Env, which forms a large heavily glycosylated heterotrimeric complex that transitions through multiple conformational states (Munro et al., 2014; Ozorowski et al., 2017). Here we use an improved version of a previously described deep mutational scanning strat- egy (Haddox et al., 2016) to measure the eects on viral growth of all single amino-acid mutations to two transmitted-founder virus Envs that dier by >100 mutations. We compare these complete maps of mutational eects to identify sites that have shifted in their amino-acid preferences be- tween the Envs. Most sites show no detectable shifts, but 30 sites have clearly shifted preferences. These shifted sites usually prefer a specic amino acid in one Env but have shifted to tolerate many amino acids in the other Env. The shifted sites cluster in structure, but are often distant from any amino-acid substitutions that distinguish the two Envs, demonstrating the action of long-range epistasis. By aggregating our measurements for both Envs, we identify sites that evolve faster or slower in nature than expected given the functional constraints measured in the lab, probably due to pressure for immune evasion. Overall, our work provides complete across-strain maps of mutational eects that inform analyses of Env’s evolution and function.

Results Two Envs from clade A transmitted-founder viruses The viruses most relevant to HIV’s long-term evolution are those which are transmitted from human- to-human. However, the only prior work that has measured how all Env amino-acid mutations aect HIV growth is a study by some of us (Haddox et al., 2016) that used a late-stage lab-passaged CXCR4- tropic virus (LAI; Peden et al., 1991). The properties of Env can vary substantially between such late-stage viruses and the transmitted-founder viruses relevant to HIV’s long-term evolution (Sagar et al., 2006; Wilen et al., 2011; Parrish et al., 2013; Ronen et al., 2015). We therefore selected Envs from two transmitted-founder viruses, BG505.W6M.C2.T332N and BF520.W14M.C2 (hereafter referred to as BG505 and BF520), that were isolated from HIV-infected infants shortly after mother-to-child transmission (Nduati et al., 2000; Wu et al., 2006; Goo et al., 2014). The BG505 Env has been extensively studied from a structural standpoint (Julien et al., 2013; Lyumkis et al., 2013; Pancera et al., 2014; Huang et al., 2014; Sanders et al., 2015; Stewart-Jones et al., 2016; Gristick et al., 2016), and variants of this Env are being tested as a vaccine immuno- gens (Sanders et al., 2013, 2015; de Taeye et al., 2015). We used the T332N variant of BG505 Env because it has a common glycosylation site that is targeted by many anti-HIV antibodies (Sanders et al., 2013). The BF520 Env was isolated from an infant who developed an early broad anti-HIV

2 of 30 Manuscript submitted to eLife . oo usiuin site / substitutions codon 0.1 BG505 BF520

Figure 1. Phylogenetic tree showing the relationship of BG505 and BF520 to other clade A Envs. The tree shows the 69 Envs in the alignment in Figure 1-source data 1, which is a subsample of clade A sequences from the group M alignment in the Los Alamos HIV sequence database (http://www.hiv.lanl.gov). Sites not mutagenized in our experiments (the signal peptide and cytoplasmic tail) or that are poorly alignable were masked as indicated in Figure 1-source data 2, leaving 616 alignable sites. The pairwise identity of BG505 and BF520 to other sequences at alignable sites is in Figure 1–Figure Supplement 1. The tree topology was inferred using RAxML (Stamatakis, 2014) under the GTRCAT model of nucleotide substitution, and branch lengths were optimized under the M0 Goldman-Yang model (Yang et al., 2000) using phydms (Hilton et al., 2017). Figure 1–Figure supplement 1. Pairwise identity of all Env sequences to BG505 and BF520. Figure 1–source data 1. The alignment of clade A env coding sequences is in cladeA_alignment.fasta. Figure 1–source data 2. The 240 Env sites masked in all phylogenetic analyses because they were not mutage- nized in our experiments or are poorly alignable are listed in alignment_mask.csv. antibody response (Goo et al., 2014; Simonich et al., 2016). We have previously created comprehen- sive codon-mutant libraries of the BF520 Env and used them to map HIV antibody escape (Dingens et al., 2017), but these BF520 libraries have not been characterized with respect to how mutations aect viral growth. Both BG505 and BF520 are from clade A of the major (M) group of HIV-1. Figure 1 shows the phylogenetic relationship among these two Envs and other clade A sequences. BG505 and BF520 are identical at 721 of the 836 pairwise-alignable protein sites (86.2% identity). However, in our experiments we mutagenized only the ectodomain and transmembrane domain of Env, and excluded the signal peptide and cytoplasmic tail. The reason is that we measure how Env mutations aect viral growth, which is inuenced both by the functionality of Env protein molecules and their expression level. Mutations in the signal peptide and cytoplasmic tail commonly aect Env expression level (Chakrabarti et al., 1989; Yuste et al., 2004; Li et al., 1994), so we excluded these regions with the goal of reducing the degree to which we simply identied mutations that aected Env expression. In the ectodomain and transmembrane domains of Env, BG505 and BF520 are identical at 549 of the 616 sites (89.1% identity) that are alignable across clade A Envs (Figure 1- source data 1, Figure 1-source data 2). The divergence between BG505 and BF520 therefore oers ample opportunity to investigate mutational shifts during Env evolution.

Deep mutational scanning of each Env We have previously described a deep mutational scanning strategy for measuring how all amino- acid mutations to Env aect HIV growth in cell culture, and applied this strategy to the late-stage lab-adapted LAI strain (Haddox et al., 2016). Here we made several modications to this earlier strategy to apply it to transmitted-founder Envs and to reduce the experimental noise. This last consideration is especially important when comparing Envs, since it is only possible to reliably detect dierences that exceed the magnitude of the experimental noise. Our modied deep mutational scanning strategy is in Figure 2A. This approach had the following substantive changes: instead of SupT1 cells, we used SupT1.CCR5 cells (SupT1 cells that express CCR5 in addition to CXCR4 (Boyd

3 of 30 Manuscript submitted to eLife

A mutant transfection passaged B replicates plasmids supernatant virus 1 2 3 mutant plasmids

short infection transfection supernatant

293T cells SupT1 cells SupT1 cells

deep sequencing passaged virus

short infection, preference amino-acid amino-acid sequencing site

Figure 2. Deep mutational scanning workflow. (A) We made libraries of proviral HIV plasmids with random codon-level mutations in the env . The number of mutations per gene approximately followed a Poisson distribution with a mean between 1 and 1.5 (Figure 2–Figure Supplement 1). We transfected the plasmids into 293T cells to generate mutant viruses, which lack a - link since cells are multiply transfected. To establish a genotype-phenotype link and select for Env variants that support HIV growth, we passaged the libraries in SupT1.CCR5 cells for four days at a low multiplicity of infection (MOI) of 0.01. To isolate the env from only viruses that encoded a functional Env protein, we infected the passaged libraries into SupT1.CCR5 cells at high MOI and harvested reverse-transcribed non-integrated viral DNA after 12 hours. We then deep sequenced the env genes from these final samples as well as the initial plasmid library, using molecular barcoding to reduce sequencing errors. We also deep sequenced identically handled wildtype controls to estimate error rates. Using these sequencing data, we estimated the preference for each of the 20 amino acids at each site in Env. These data are represented in logo plots, with the height of each letter proportional to that site’s preference for that amino acid. (B) We conducted this experiment in full biological triplicate for both BG505 and BF520, beginning each replicate with independent creation of the plasmid mutant library. These replicates therefore account for all sources of noise and error in the experiments. Figure 2–Figure supplement 1. Sanger sequencing of selected clones from the mutant plasmid libraries. et al., 2015)) to support growth of viruses with transmitted-founder, CCR5-tropic Envs; we used more virions for the first passage (≥ 3×106 versus 5×105 infectious units per library) to avoid bottlenecking library diversity; and rather than performing a full second passage we just did a short high-MOI infection to enable recovery of env genes from infectious virions without bottlenecking (Figure 2A). We performed this deep mutational scanning in full biological triplicate for both BG505 and BF520 (Figure 2B). Our libraries encompassed all codon mutations to all sites in Env except for the signal peptide and cytoplasmic tail. The deep mutational scanning effectively selected for functional Envs as evidenced by strong purifying selection against stop codons. Figure 3A shows the average frequency of mutations across Env in the plasmid mutant libraries, the mutant viruses, and wildtype controls as determined from the deep sequencing. The mutant viruses show clear selection against stop codons and many nonsynonymous mutations (Figure 3A). This selection is more apparent if we correct for the background error rates estimated from the wildtype controls (Figure 3-source data 2). The error-corrected frequencies of stop codons drop to 3%–16% of their original values (Figure 3- source data 2), with the residual stop codons probably due to some non-functional virions surviving due to complementation by other co-infecting virions. The error-corrected frequencies of non- synonymous mutations also drop substantially (43%–49% of their original values), whereas the frequencies of synonymous mutations drop only slightly (85%–95% of their original values). These trends are consistent with the fact that nonsynonymous mutations are often deleterious, whereas synonymous mutations often (although certainly not always, see Zanini and Neher, 2013) have only mild effects on viral growth. Figure 3A only summarizes one aspect of the deep mutational

4 of 30 Manuscript submitted to eLife

A B 1 R = 0.59

0.5 BF520-2

BG505 0 1 R = 0.64 R = 0.60

0.5 BF520-3

0 1 R = 0.57 R = 0.58 R = 0.57

0.5 BG505-1

0 1 R = 0.59 R = 0.59 R = 0.59 R = 0.76 BF520

0.5 BG505-2

0 1 R = 0.59 R = 0.59 R = 0.59 R = 0.77 R = 0.78

0.5 BG505-3

0 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 BF520-1 BF520-2 BF520-3 BG505-1 BG505-2

Figure 3. The deep mutational scanning selects for functional Envs and yields measurements that are well correlated among replicates. (A) The average per-codon mutation frequency when sequencing plasmids encoding wildtype Env (DNA), plasmid mutant libraries (mutDNA), mutant viruses after the final infection (mutvirus), and virus generated from wildtype plasmids (virus). Mutations are categorized as nonsynonymous, synonymous, or stop codon. The DNA samples show that sequencing errors are rare, and the virus samples show that viral-replication errors are well below the frequency of mutations in the mutDNA samples. Comparing the mutvirus to mutDNA shows clear purifying selection against stop codons and some nonsynonymous mutations, particularly after subtracting the background error rates given by the virus and DNA samples (Figure 3-source data 2). More extensive plots from the analysis of the deep sequencing data are in Supplementary files 1 and 2. (B) Correlations between replicates in the measured preferences of each site in Env for all 20 amino acids. Blue indicates replicate measurements on BF520, red indicates replicate measurements on BG505, and gray indicates across-Env measurements of BF520 versus BG505. ! is the Pearson correlation coefficient. The numerical values for the preferences are in Figure 3-source data 1. Figure 3–Figure Supplement 1 shows the correlations using contour rather than scatter plots. Figure 3–Figure supplement 1. Correlations plotted on a contour rather than a scatter plot. Figure 3–source data 1. Preferences for each replicate and averages are in all_prefs_unscaled.zip. Figure 3–source data 2. Average frequencies of nonsynonymous, synonymous, and stop-codon mutations as plotted in Figure 3 are in avgmutfreqs.csv. There is only one DNA sample for BF520 which is listed three times with each BF520 replicate. We calculate the error-corrected pre-selection mutation frequency as the mutDNA minus DNA, and the error-corrected post-selection mutation frequency as the mutvirus minus the virus. We use these error-corrected frequencies to calculate the percent of mutations remaining after selection.

scanning data, but Supplementary files 1 and 2 contain detailed plots showing all aspects of the data (read depth, per-site mutation rate, etc) as generated by the dms_tools2 software (Bloom, 2015, https://jbloomlab.github.io/dms_tools2/). We used the deep mutational scanning data to estimate the preference of each site in Env for each amino acid via the analysis method described in Bloom (2015). As graphically illustrated in Figure 2A, the preferences for each site are normalized to sum to one. Our libraries were mutagenized at 670 sites in BG505 and 662 sites in BF520, so 670 ×20 = 13, 400 and 662 ×20 = 13, 240 preferences were estimated for each Env, respectively. The correlations between the preferences from different experimental replicates are in Figure 3B, and the preferences themselves are in

5 of 30 Manuscript submitted to eLife

Figure 3-source data 1. These replicate-to-replicate correlations are substantially higher than those for the deep mutational scanning of LAI Env by Haddox et al. (2016), which had replicate-to-replicate Pearson correlations of only ! =0.45 to 0.50. While the replicates are well correlated across all replicates for both BG505 and BF520, the replicates for BG505 are more correlated with each other than with replicates for BF520, and vice versa (Figure 3B, compare red and blue versus gray plots). This fact hints that there are some shifts in amino-acid preferences between the two Envs—something that is investigated with more statistical rigor later in this paper. Note also that there is a trend for highly preferred amino acids to be more strongly preferred in BG505 than BF520 (most high-preference points in the gray plots in Figure 3B fall above the diagonal); however, this trend does not necessarily reflect differences between the Envs. Rather, there were modest differences in the stringency of selection between our deep mutational scans of BG505 and BF520 (Figure 3-source data 2 shows that purifying selection better purged stop codons in BG505). In the next section, we correct for these experimental differences by calibrating each dataset to match the stringency of selection in nature.

Amino-acid preferences of the Envs and their relationship to HIVevolution The most immediate question is how authentically the experimental measurements describe the actual selection on Env function in nature. Direct comparisons between experimentally measured amino-acid preferences and amino-acid frequencies in natural sequences are confounded by the fact that the natural sequences are evolutionarily related. This problem can be overcome by making the comparison in a phylogenetic context to account for the evolutionary relationships among sequences. Specifically, we used our deep mutational scanning data to construct experimentally informed codon models (ExpCM’s) for Env’s evolution. An ExpCM is a phylogenetic substitution model that incorporates the functional constraints measured in a deep mutational scanning experiment (Hilton et al., 2017). If the experiment captures much of the actual evolutionary constraint on a gene, then an ExpCM will describe the gene’s natural evolution better than a standard phylogenetic codon substitution model. The reason is that standard codon substitution models (such as the commonly used Goldman-Yang style models; Yang et al., 2000) only model functional constraint via a single parameter that represents the rate of fixation of nonsynonymous protein-altering mutations relative to synonymous ones; this parameter is called dN/dS or ". In contrast, an ExpCM accounts for the preference of each site for each of the 20 amino acids under the functional selection in the deep mutational scan, and then additionally adds an " parameter that represents the relative rate of nonsynonymous to synonymous substitutions after accounting for these functional constraints (Bloom, 2017; Hilton et al., 2017). Importantly, since we expect some sites in Env to be under diversifying selection from immunity, we extended the ExpCM’s described in Hilton et al. (2017) to draw " from a gamma distribution as is commonly done for codon-substitution models (Yang et al., 2000). Table 1 shows that ExpCM’s informed by the deep mutational scanning of either BG505 or BF520 describe the natural evolution of Env vastly better than a standard codon substitution model. In addition to the improved fit of the ExpCM’s, we can also interpret the " parameter. Recall that for standard codon substitution models, " is simply the rate of fixation of nonsynonymous mutations relative to synonymous ones. For such models, the gene-wide average " is almost always <1, since purifying selection purges many functionally deleterious amino-acid mutations even for adaptively evolving proteins (Murrell et al., 2015). Indeed, Table 1 shows that Env’s gene-wide average " is <1 for a standard model. But for ExpCM’s, " is the relative rate of nonsynonymous to synonymous substitutions after accounting for functional constraints measured in the deep mutational scanning (Bloom, 2017). For the ExpCM’s, the gene-wide average " is >1(Table 1), indicating that external selection (e.g., from immunity) drives Env to fix amino-acid mutations faster than expected under a null model that only accounts for functional constraints on the protein. ExpCM’s also have a stringency parameter that relates selection in the experiments to that in

6 of 30 Manuscript submitted to eLife

Model ΔAIC LogLikelihood nParams stringency ""# "$ nsites "% > 1 nsites "% < 1 ExpCM BF520 0.0 -35218.8 7 2.8 1.4 1.0 0.7 66 35 ExpCMBG505 269.0 -35353.3 7 2.1 1.3 0.9 0.7 65 53 Goldman-YangM5 3455.1 -36941.4 12 nan 0.8 0.6 0.7 14 211 Table 1. Evolutionary models informed by the deep mutational scanning describe HIV’s evolution in nature much better than a standard substitution model. Shown are the results of maximum likelihood fitting of substitution models to the clade A phylogeny in Figure 1. Experimentally informed codon models (ExpCM, Hilton et al., 2017) utilizing the across-replicate average of the deep mutational scanning describe Env’s natural evolution far better than a standard codon substitution model (the M5 model of Yang et al., 2000) as judged by comparing the Akaike information criteria (ΔAIC, Posada and Buckley, 2004). Both ExpCM’s have a stringency parameter >1. All models draw " from a gamma distribution, and the table shows the mean (") and shape parameters ("# and "$ ) of this distribution. The last two columns show the number of sites evolving faster ("% > 1) or slower ("% < 1) than expected at a false discovery rate of 0.05, as determined using the approach in Bloom (2017) (see also the last section of the Results). Analyses were performed using phydms (Hilton et al., 2017, http://jbloomlab.github.io/phydms/). Table 1-source-data 1 shows the results for additional substitution models.

Table 1–source data 1. Results for phylogenetic models where " is not drawn from a gamma-distribution or where the preferences are averaged across sites to eliminate the site specificity are in modelcomparison.md.

nature. Essentially, this parameter indicates how strongly prefers the amino acids that are preferred in the deep mutational scanning (Hilton et al., 2017). A stringency parameter >1 indicates that natural selection prefers the same amino acids as the experiments, but with greater stringency. Both ExpCM’s have stringency parameters >1(Table 1)—a finding that makes sense, since the stop-codon analysis in the previous section suggests that the experimental selections are more lax than natural selection on HIV. For the entire rest of the paper, we use the experimentally measured preferences re-scaled by the stringency parameters in Table 1. The reason we do this is to distinguish genuine differences between the two Envs from mere variation in the strength of selection between the two sets of experiments. Re-scaling both sets of preferences to optimally describe Env evolution in nature is a principled way to standardize the measurements; see Hilton et al. (2017) and the Methods section entitled “Re-scaling the preferences” for a more detailed explanation. A qualitative way to assess if the deep mutational scanning authentically describes selection on Env function is to visually compare the measurements with existing knowledge. Figure 4 and Figure 5 show the re-scaled across-replicate average of the amino-acid preferences for each Env. At sites of known functional importance, these preferences are usually consistent with prior knowledge. For instance, residues T257, D368, E370, W427, and D457 are important for Env binding to CD4 (Olshevsky et al., 1990), and all these amino acids are highly preferred in our deep mutational scanning (Figure 4 and Figure 5). Likewise, Env has 10 disulfide bonds (linking sites 54-74, 119-205, 126-196, 131-157, 218-247, 228-239, 296-331, 378-445, 385-418, and 598-604), most of which are important for function (van Anken et al., 2008)—and the cysteines at these sites are highly preferred in our deep mutational scanning. The deep mutational scanning is also consistent with prior knowledge about sites that are tolerant of mutations. For instance, Env has five variable loops that mostly evolve under weak constraint in nature (Starcich et al., 1986; Zolla-Pazner and Cardozo, 2010)—and most sites in these loops are mutationally tolerant in our deep mutational scanning (see sites indicated by gray overlay bars in Figure 4 and Figure 5, such as 132 to 195). It is beyond the scope of this paper to catalog associations between our measurements and all other prior mutational studies of Env, but the concordance of our findings with the above mutational studies, and the fact that our data improve phylogenetic models of Env’s natural evolution, suggest that our experiments do a reasonable job of authentically measuring functional selection on Env.

7 of 30 Manuscript submitted to eLife

←− −→ ωr < 1 log10 P ωr > 1 (ωr) region (R)

-4 -3 -2 -1 0 -1 -2 -3 -4 gp120-other gp120-variable gp41

R ωr AENLWVTVYYGVPVWKDAETTLFCASDAKAYETEKHNVWATHACVPTDPNPQE I HLENVTEEFNMWKNNMVEQMHTD I I SLWDQ F M E H M K L MF Q D H W N CTY P I FHG A E A F TW I Q PR NA FQ T T E H N SM F S I A W L E Y NE Q I A NV S P D I TV F G I A Q I R DQ T H Y H F W LA W A K F G A F Q I VYQ QR V L VQQ QR L K E M Q IH D K E F Y W E T E C K W G H H HW F A TV N F D FQES G V QM P R Y E L V L A I Y M H Y S I DP M N F LM A N P A Y W R TR I S E IV T HG AA Q VI W E S M L K T H F S E I M P F M V L A HEQ D C N K G I D TFS D F P V S Y Q V MR G C DYD KD F F V W S Y W I T T I S EI VYHD V E QH E MM V Q H S I C S L MW M V NY A M G E V V N TM NA Y D A VW T T M D E SW L V M T A D G A G Q T P N MWIN M Y DV L N QT H Y N F S M G N I WF V L GS M QGY Y T W EF V N H E Y MHN E TSQ V F V T Q AA AA T M RD V Q A TN INM AIV M E D W Q E T S G Q M A N L P E I K I VT D G H N Q TD M I H H I F L G WT W Y N IG S Y A A V S CN N LR G L F Y KIV T Y PN W AGE SN D L D DQ A S F A A R I I S Y P V Y Y NCK WM L Q VF A Q I W R F E M H Y A M W V V A SV Y M A L K T I F NS NL Y W S P F VT S G MS Y S G CHAH T GW R EW SDLT VSS E W C T R K G Y Q P H Y L K F C TE N M M K A M AD A S SL R M EYT H H K F RM C I M AS D A T T A W H H R R G T V D T I A E C K W N L T P Y V H C M R V S F H Y FC SP GH Y C MA T D V F G H F G R G CTY F C AL I HK YA CI Q I A S H C M R HF Q Q Y K D F S H P LI N G C K D P V C F S K D M M W G A TAG T G T S C A A IN K H E M N NR MY D G KD W W NQA F S V FY T GA Q L Y R P M I C C I E N L Q C L W M A P I L Q S D S T Q S N N LW H K L V R C G M D W G LF M E RW M Q G Y I G T NL I NA G F V CV TT VG N H M S P V CW S D L M A T K I IP C S Y YD K V SQ W H N TGG V K Q R S G R Q W YN W H Q E E I L A K H H T H P H I Y L L C C F H Q V E I E T F C M Q F R L Q I K R E N C M W L R A M W S G D A G Q Q H S V K W P W L L D F V LD F S G S T L R M I Y K A T M P TC T Q M V M R G S F H W G W E L P S L A Y F W N W F E L C F EC S K R E N T Y T H S H I V Q K W I S N A C C V Q M SD Y I Y D V CY H H T M A M I E V N K G C M S V F CL W C I R IV C L I Y E P L V Y G M G Q Q M L T P A V W M P H L H P P C P C V N M Y I S C A K G I 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 106 111 R ωr SLKPCVKLTPLCVTLQCTNVTNNITDDMRGELKNCSFNMTTELRDKKQKVYSLFYRLDVVQINENQGNRSNNSNKEYRL INCNT AH T L R M K Q G K R K W F E L K PRET FA HRR T S TD T WK A A G I VE S D A S EN S E S I R C K V Q M NA CK PR T S V M V YA RE L F Y E R T A DN H VVTNF A C S G R H A Q N S M M N V T GF L YQ Y D EN N R H KY Y N P K T V V L N S M M YA A P F SN WC V G MT S Y K Y I L N R N K L P VQ P I V M L M Q L I D SV AQ Y WQ I Q SPCP Q S N DQ I L C Q G M Y K T S L RT N N PT WY R W AK H H I T G N A I M Y W Q T F HS Q A V A L D T Y R G P P C H E Q F W R K V G V L H V V R KT G G LN GAG GT L W V Q WG RT HQ S G L I WN R K IE AF W EF M AFQ T GS S G A W LKG V HQ KQ PGY SPFW T D L R YE E S W T G AL G PR H W D EM E M R V L M V RA Q S DS F K K G I F E I MM NN Y E N H C L S QA I SV N L VV TN NW DL S K Q S S Y M Q Q I N M Q A A A T M T C F V M K P S I L F D F Y E Q F L V A A T DG M QM M A E A A Q TH N V Q N FI L D A A Q P F I WPWI M I KGSCM HK I KPM KY I S WE TS I R V E L P H M N F H L M V FY I F I E NA F M GETF PF TW V N H V N KTDP F L EMC S L MH H SGEL KE Q I H F V WNH N N P S H D VY T L S Y D T YY C M C W GW V V V F H D Q V K F IR Q CAWT E M C HQT CQ WI M Y C P M L N M F Q Q R K V C S H W C F G C H S Y C GM K D NQIRD A V S M W S T F FFW K N E H W V T Y N Q FV P R S I E H M E C K A R L C D K L D LAM IA Y W YW YV H F D M RK Y I L W S AI M M A Q T M D G W YC I VV S M F D W E C V TM Q K E E A V AD D C W T S S P R I Q R P D Y N K M E H C P R Y I F M G C Q D L F S D D Y I T T V NI I V G M L V S W V K QT S LF W Q N E R S RM Y H H L H M K G KQ K A Y I Q TK N G N CLM W R L N W I K C V T V L HK F R H C C E T Q K M W C K DH L D K T W H P W CY L W S Y M CR H V T R IS L M M V E Q W Q D I S F T G F N H CD Q P D E M C E QA A V Q RCG H H W V FDDC EG A I I W A N VW Q V F K F KC N T YA R C Y H D A N A V I H M E TW A P SN R Y M C H H T I H Q Y VN F H M I Q R A CW Y I DK D E G GE Q LN C E N AI QW N H K TQ D K G A Q L M M I T Y YW I L W TN K M I W N I M F V C M M C Y C Q D H A E N H G S C F I W Y F LR E C D W T F D I R T A H R I M I I E Y E M L Y RK V HS E I P CF C L S R V F A T P T G F F W H I A A C S S G I S L H F P R I Y F R F A I G T E C F L E M P H S K Y M P G N E K C N DC L H L R K I G E A N L Y C D D EI Q T W L W G T D W H G L R W T E Q C A YY S 116 121 126 131 136 141 154 159 164 169 174 179 184 184e 186 191 196 R ωr SA I TQACPKVSFEP I P IHYCAPAGFA I LKCKDKKFNGTGPCPSVSTVQCTHGIKPVVSTQLLLNGSLAEEEVMIRSENITNNAK M P T MEI S I E D I N E M N ALSYRP P I H Q Y T IN YT WAT I ADA T QSP F A VIT A E M M F S V SQW G D F VE ID S S EDN T N S F A V E L N K M Q A QK Y T N E IM FV N F M Q M A G W S L K M A K I N F T S TI M G G G K F S L M DF S T E T I L DQ A V A A Q R W AT M FP R D V T L TI PE E FY S F H A Y D L P G H S H C L I K V H L H M I YH GG L T P D D V K C H I S Q VS K D V R RE Q G F K A L W K E I I V L G Y E C AFA C C E NV D W K I LQ I H Q E G NLV I F M S N F R A M Y Y I Y C L R P S W L R E D V A H P V T KF I A H L D W D L G A L P P Y A P N A A V V YTT M A L I MM M W Y T IH W V S F N AK R V T M S NY M T EAHG R S Y N HL G L A V L MT V Y SNF Y V V H S N MLH D Q PG F T T A M T M V M L G Q G Y Q F N V Q V Q I L G F I M Q T V V VN V HE C N H MA DPN L N M C RY H G M H N K N L CK N R S R L V I QL A F R F M T T L I F C S T H T L H Q YYMC I WY F Q F M Q S A A T H I V K F S A Q S FM G G T T R V L MWIE G I TS V HH W C K Y A G V F I WK Q R M F T R M W A E T I C P T Y D S H C W H H G T KT G I D R V M M MA W I D A H E Q W L A S H K K K LF S T A E YR H Q V K Y R G G L H T W I T LM N Y Q N L W V H L W V G S G R V Q W F G W C Q AY H AA V V Q S Q T E C T H R V NN M E I A I TGF F Y A L W D A T E H M A E A G MG YI Q P Y S M R Y A T SG M KR F Q G I S N M F Y P M R S H P Y C R F K E R N A Q D A M A LG L I H Q R NS M M T R V L G C M A L M A L G G H W E Q T T R H PG D M AG S V HH WW C I W H M R D T Y N E W C R K MM Q G T M Q V S R K E Q H C Q IV S N T CN N E V M T V A S K I Q Q W S W CE Y N K C N Q N E W T Q V M I K V Y W F G G M R C G S S Y F S L F I N Q M Q DC D S Y W Q R M D H W F C A W V M R P I R A W P W L S L Q K C Q F D E Y W Y S W H C H A S Y N H D N YC C M L T H L Y C M V W S T Q V Y Y D W S Q N W I K D P Q K T I W Y M V F E N D N D Y KN Y M D S G C Y Q T V Q M Y N K K L W W M H F E H I W L C V S W V F S T C Y F K T H A V C E L Q N F G S K C F R S L C NV C A P S P L T W I D V M A I P E I I C V C H R F S Y D F C H G E K M W I G W V I W H C V I W 201 206 211 216 221 226 231 236 241 246 251 256 261 266 271 276 281 R ωr NILVQFNTPVQINCTRPNNNTRKSIRIGPGQAFYATGDI IGDIRQAHCNVSKATWNETLGKVVKQLRKHFGNNT I I RFANS SGG D G F R T Q W K L L E SA H SGQF TY W NI T HDK Q KK V I H T I Q F K RM H N NA E F RR VD Q E S E Y I QLSS Y K H PKY H MAN H A Y P H AD S E K GG R R Q G W A A L N C G R N H Y GW WKS TYQ S W I D S D W T KA K R H I R T T AT G L L F P H V S T G A A Q Q N Y L G A R E V A R I R C T I K M N PM QKT M EQ N GSR M PP F FMH M M K PI N K RK K V G V L Y V W F QI H E C G A K V K QI S N S V RT V R Y T Y F KME ET A WS N K L C E R T G H D N K T I L I A F V Q N H Y C V S YV T T L FQ G V R Y K P E H V A T S L S MN N P V T V F E W N I YM T DV F M S H C Q A L I K G Q L G P Q LQ K S N V T A I N Q N I Q M S S L V K W A I A W L H M F T D S H S A C R IN I H V S M D Q L MT Q K Y A Y C QG R RP T RH W I S T M R G VL F F YR Y H I Y M S T A T W R S T H D K F W D V V HD VY L T W R F VQ GS W L K L K LM S L H W K D F K Q Q T F W Q F FI N C L D N A L A Q E VQ I Y T V T N P Y A A L S L F A E I L C TC F M W RM QQ V VE N H NT S V VV R C L NNW M E G F SQ H T V I I R R L A M C L F MQYA V R V H F V A A R Q E E I MI Q L D A P V C L E E T WA HD KG C D S RTQ H G LV AH T LNC P Q T ML K A IQAG T PM H QE FW HN M N D A Y I T T K M M W H N MY M V R A GQ T A H F T F M Y I I G S L G S SM S Y K DG SF H VI LL R SNQ F YI R W W A H L E NQ I W F S D I I M ML L I S L I FC GT S K F L E N CI K F C E R VL M Y V V DT S S V M KF G S R K W M Y F F Q S W F V P C R CC LC V M G F V T E I S V D Q IL C D H A I M K N A H I L K L Y MF RS I I I V W F W G L I W W FV A E H D I M I KM Y E Q K Y K N K L F F F A H E Y P E G M L W L I C V V I R A E M A P E MS N Q C Y S N F Y I P E S A N E E L M H D Y W T C F K Y Y NA N C TM G M DK Q Y T TL W K A Q I C A D P Y G C E I Q C SF T M T M Y G A V M G D W K N M A G L S F R K R D I Q K M C S H H M S E N K C C V R I D L G V Y Y Q W S M Q Y W C M C G W VY G H Y V W C A Y A L F N V M Y FY L G I M F A I H Y F V N M N C S C C D E K I W I V I V T A H T R I CWT Y S Y P D A R F G N D C P H K

Y E D T P W E WF M N P AD I W W E G C C K D W F A E H I M F W W S L I D PW DC S Y H Y G L R D R S K A Y I W D V I G R N A N M G P N H C P Q F K H R I NC M GW Y 286 291 296 301 306 313 317a 322 327 332 337 342 347 352 357 362 367 R ωr DLEVTTHSFNCGGEFFYCNTSGLFNSTWI SNTSVQGSNSTGSNDS I TLPCR IKQI INMWQRIGQAMYAPPIQGVIRCVSNITGL I L DHEYMQEFWYEMM LPL R PD MS N W GWG M IDWSCQGT Y P T S FR E PVH T N Y L N A C L Y HT MYF WWLY GN DMH FS Y F A YM R E M D I Q A A Y D A W Y Y P N H NNR V YT F I V Y G E W W T Y G H I N Q Q S V M D N W I M D PQ K APS L TICD R G Q Q QGM N V H H S V I R G TN E I Q FG I FT E V M A D AE E Y M T T L EDI W NE W NT I S Q F W I H K F T L R FG Q T F S P V A G G V A NN I M Q R Q AS Y D L V W D NS M P PAVA TM Q Y F S N M L G K R T K R Y I T G N EE H P H H G G S W E GY F N I P N G F A T I Q T FW Y R Q S A SQ RW I IKQ RT K H S K L HT I Y R W L Q G Y S S C V N Y F K H R V FH N M K QP VY E Q K F A I I H H P M SF M S L A F A G E A F S S MYV HQ FL K T VV S F G R RV YI A H VP KS W K E K VP G K FT S H H T S C I T V I A KN N W G D C C L L KR P WRMV F R H SA N I F F HI E R FY E L DT HL VL SL S K V L NS T Q W PYC Q Q N FT R E L Q T T I H A G FIKL I W T R T N S MAG Y V P I S N Y A H S R SA CG V HM P A C A M V Y C K Q W S R Q R H R E KS A C Y V R IQ V M I H I M L KR N R D Y Y D M M M I V R V M K M IP D T M AM N R T A E ML LM Q P T W A L V G W LSH H L H A L K F R D L E H W A I DAYP K N L T V K E LVMHP GI D CG LS TV WMKHQE S I A Q M V L C S R L AL K DH I Q T T E A KQ I Q E L R KF N S T F F G N Y E AWHR N C W P Q G M P C DF H P V Q NC KM F S GV S H S W D W VD N P M QN E M G N A C D A A R M S G DL W A C K G V D M E F MV V R G M F F Y HI K LAN A L A CR R H MW K A E M H Q N T Q T E NW A V T Y R D M G R Y S N K M Q C Q Q Q K F Q M N L E T R W F N Q H FG CL C H T A M S Y E M C D I H YS C T F C K W W S M C G NE F P F F M A I K T W P Y Q H F S E A W Y L V Y G T S Q C CT T D E N F C L C CD M Q G E D KI D C Y F V Q H H V L C W L QM F YQ A HI L H S L L A M L D M M Y I A M M I E A N QT E C P M E KQ K G W NI PCF M V S W K V H S K A Y K H K W S D Y Q C V N R D N C D F V F V EY M D TH I M D E G L C I N L A R Y T T F N E M R M T F D I F K H GA W V G K F D WV W L F F C R W Y D G V S N C I I L W G Y K I N M D V E H C E Y E CG I E E L A I R TW R W A I R L S

A Q H CCFHF KCQKFSY C 372 377 382 387 392 397 403 408 413 418 423 428 433 438 443 448 R ωr ILTRDGGSTNSTTETFRPGGGDMRDNWRSELYKYKVVKIEPLGVAPTRAKRRVVGREKRAVG I GAVFLGFLGAAGSTMGAASMT GEAA P V N S N D Q Q QHPDY Y Q LV P S Q Y PG I T R L L V R W AH A H A A F S PY KKC S M Y Y R V R N G AN N C S QP TY S GQGG R S K F TC YP S A L V AA A V P V K E E G ER T D W A L EK C SF V L R MT N R N A TF MA M V G R L LD Q D A K KY V S H K K I Q M VA V Q G R H E Y T L A T A D Q PS C V MN F F C T Q M Y D T M F F D I GA PT NM Q VT G T W D V I W M A A I V L Y C A SI K A P R Q G P IH SA H T Q V E E K V TV T L I V T L W L A Q I A YS T GF G A V Y V MTY I S AMP H AI N S S Q HR H K H P D M P Q S W I S A A T I SR G A V F Y Q A L L SM E M I L I FC N T V D F E N Q H N E P S E WM K FC WSG E Y G R C T W M T C H IQ S FV V VGT W LS I M A L LE I T C V S L A D IF H A F H SM SV W MW T RL C T T W GYNL N I T M D KP S S M M F G F NY N Q V Q ID T G H V R S T F D A H M W I H C NW C L TL T C DG T I QM L W G N R A F L G S G H D E G L V E T V S L QD V T C G I Q H L W S A G C AI N WF A S I A M I E Q NH F A W N A S Y E C L M I K L T M T E V D W F A S T M L L E L A M L S W F L L V T S F M I A HGI M F T V G T K V S G F A A Y V V E DN Q T G Q F I R Q Y R E G G V W Q CK E IE M W V T L N L T S D I R K V I T M R L M F G Y Y I F S G T H CM H N T W Y H M V M L M C M W Y L L M N ML IF SI M GV FM Y M T A M Y K Y C N E H L E A W H W L GV Q F M Y N H Q W F W R Q Y W M E Y N L AL R Q Q C T Q S M E H H HQG S V A S S L C C I H H Y H F C F N K KQ K E K Q M F K H Y V V Q I Y D G S V W S M W E F W V E I F C Y F Y L V CC A I T A M I A H C E Y F PW W D N N M A C FC L K S M C T W D T F Q L H E Y F K F W V F H W L P H M C I E WP IS D C C Y G W L I I W W D L I L R K F HC W D I Y F C G R K I Q MN N M TM 453 458 463 468 473 478 483 488 493 498 503 508 513 518 523 528 533 R ωr LTVQARNLLSG I VQQQSNLLRA I EAQQHLLKLTVWGI KQLQARVLAVERYLRDQQLLG IWGCSGKL I CTTNVPWNSSWSNRNLS M Q A F R HKH E R C E M I T A R W Y G T CG C K VM N A R R I S K S SYRA H C Q TR M R C PV I L W W G S T V G E A Q H H TV A C P D K R N L W S S T W L SK N AF R M R PQ L K M Q F V K V I A Q F R M L S I S V Y K P R L H A Q F E N Q LY M I A Y F Y V K Q L N L WH I LR S L Q LA N T E VMLT TTT I A Q M S D YN L L Q F S A V H C E N A T Q H E G TA I M W H L Q I Q T F I Q C T A Q SK W A R R T C Q G M Q P I Q S K S K M V R V P E S H T S L YQ WN D L F Y S W M SW L D V H T K Y H H I T R C H G A Y M Q V R H M M T W C PD L Y M S M I A G C A L G Q T H V N E S T F H S F S I L L H Y LH A S T V K M T L N L S D S L RT M Y E I Q C M K L D V FF H M LR FV H C M L M H L A A M SQ D S N K K G T M VR II N F H E M I Q Q C MY C T M WC P M T R HC LE R W T W AVFY HW I Q M GN C Y S T G MQ W S YY S I A S T IS L T Q C R VI V I M Y T S G F N Q QV C V T CT L F VA AY TGW F L R V N K A Y Q A H H W I S Y S V Q T C F A D Q Q A L W N N N Y S I L QV F W VY Q H GAV R G E M T Q NF T D A I Q E N N R Y T LH I T LT A W K M W I R N M M C T ED CV D N K RT N F H LD W M M VL S H R M N W V M E D T M L S D G E C Q K VT W Q L M DLV R M E P W F H H Y I G G P G H Q M V G F CY I K F F H A R VK T Q P H E L C D A D C I Y C I K E C E K A G K WH R Y C G T Q L K E C Q Q W Y YDG V M C W K A M F N H A T F V LW V V V F C D T L S E S W M I N F Y Q C F C C I F F P C F G T Y K F I V E K C F KPV K N C Y A H I L F N YD M A H M S A A E N F Y N Y V S G F Q Y DL K VI F V D V A I KY D K L E W I G A K L A V M F V E H L Q N W P C G I F L E T H N LW R T C F D C F Q E S Q W CA K Q K V G Y I Y R K S E A H W D F R G H N C M C F K G V I G C E D N M Y M W S F I F N S YR R D P V E W D P A F RS D N E D H A F K WN TA I K T

R C L L W W M 538 543 548 553 558 563 568 573 578 583 588 593 598 603 608 613 618 R ωr EIWDNMTWLQWDKEISNYTQI IYGLLEESQNQQEKNEQDLLALDKWASLWNWFDI SNWLWYIKIFIMIVGGL IGLRIVFAVL Q NR G TWH CDN GE W E T N F TSV TCWRVPMCIF HVT M G QG W K F IM D M Y S L R M T G YWS RA GIEM C A L N I I G FATEI N K D CC FNH V SK W L L TW FVN A N Y L V I WC W S Q N A LS A V R W CF K S L E L F E D T H FL G E QNW F Q G P I V ST A R Q A H LG E Q FV F L F S N Q Y Y S G L T W W F G W Q F MM E D T C VY L C W C S H IA I H R RT S T F F M S D M L SY I W YMGYAI A SK I A S C P FI I T Y W F D M E S M F K M Y Y Y I C Q C YN MC M F Y A D D Y RD S T A Y Q LC HG S R R R K L F M R D F GTF N NII WR H SF N V M L CV VGPL V F TK C H A E L Y M H T W FH L A K R H M Y I Q Q AA Y P F SN S V F T H W T S Q G Q F C I T HS IW V S T A TA Y MQT I S S S N LD SN C S K M V W ML DSRF L M Q DV E Q T Q L I AD E I D I L S T L Q H M W E F Y I S G Y Q W A NH Y V F WM DS ASMA A T V M HE Q T M T AGVA P TL E I L E NDQ WHL Q EE Y RKMI T G S I H G NK F E W M L YD E C I I WA K K W Q G W K C DC T W NM S VN N H SV G H I DM VL Q L G A K W I A L F YH QE L S LQ S I L M Y AVYE CL N SI T S F NV M HL SV L W Y M V K Y E TLI I V R A IC D N FL H L I Q R Q KH H E I P F H TH P W G W P L M H H P CY MN V F K N CY F H LTKW F C K YEY MD P WQ H VQ YC S VW D V QV V M MD A TW V H I P A C K T H N F N I F K A Y I A Q M Q Q L N K V K H M C W D E M C E I E D H S QW F Q Y P T I K Q I M C M M L M G S M W W Q V V D D V S A L S Y S L V E R S N C H L H W D Q KM S L P C R Q Q D R G Q H E L R E M W C M CR C A E K V P A R Y M C W TM Q HL R G L A K L V G N K K C Q W VA W C F N C N Q W S T T M R N KR D A Y A L G V C C D D HA SW T V C C L S H E D K CY A Y D S Y Y M H I A Y M V A M M S D H M I I D Y Q N R NQ M V W W M C E F W A VG E R E Y L S K K H Y GC C PL F F C L M T KN Q I CT Q M S I YT F DH T D I T T N A T E W H F Y N AL R A I R A C N T M V H A H Q T S V R V Q F Y CA G C N E M D N S G I R F G E E W V A A G I PT L N C Y V C G R A A Q Y E N H E R R YW H H W G P C I K Y E L AN C Y MIQ M V P N V I F V W K

D I E N D W R P 623 628 633 638 643 648 653 658 663 668 673 678 683 688 693 698

Figure 4. Amino-acid preferences for the BG505 Env. At each site, the height of the letter is proportional for that site’s preference for that amino acid. The top color bar indicates the region of Env (gp120 variable loop, gp120 not variable loop, or gp41). The lower color bar indicates the evidence that the site evolves faster ("% > 1) or slower ("% < 1) than expected given the experiments (see the last section of the Results and Bloom, 2017). We report the & -value for "% ≠ 1 rather than the value of "% itself since point estimates of "% are unreliable for individual sites due to low numbers of observations, making the & -value a better indicator of the strength of the statistical evidence for faster or slower than expected evolution (Kosakovsky Pond and Frost, 2005; Murrell et al., 2012). The letters above the logos indicate the wildtype amino acid in BG505. Sites are numbered using the HXB2 scheme (Korber et al., 1998). This logo plot shows the site-specific amino-acid preferences for BG505 after averaging the replicates and re-scaling by the stringency parameter in Table 1.Thefigure was generated using dms_tools2 (Bloom, 2015, https://jbloomlab.github.io/dms_tools2/), which in turn utilizes weblogo (Crooks et al., 2004). The numerical values of the preferences are in

Figure 4-source data 1, the mapping from sequential to HXB2 numbering is in Figure 4-source data 2, and the "% values are in Figure 4-source data 3. Figure 4–source data 1. The numerical values of the amino-acid preferences plotted in this figure are in rescaled_BG505_prefs.csv. Figure 4–source data 2. The sequence of BG505 Env and mapping from sequential (original column) to HXB2 numbering (new column) is in BG505_to_HXB2.csv.

Figure 4–source data 3. The "% values and associated & -values for BG505 in HXB2 numbering are in BG505_omegabysite.tsv.

8 of 30 Manuscript submitted to eLife

←− −→ ωr < 1 log10 P ωr > 1 (ωr) region (R)

-4 -3 -2 -1 0 -1 -2 -3 -4 gp120-other gp120-variable gp41

R ωr AGNLWVTVYYGVPVWKDAETTLFCASDAKAYDAEVHN IWATHACVPTDPNPQE I NLENVTEEFNMWKNNMVEQMHTD I I SLWDQ DG H S I WC K H A D H TD RKFGKK Q L CQEE H RS N R L E N IG T VNLTL KAN WI M VH T Q C V Y L P P SQ PT H L W I MT D V D H L Y F Q E W N C P P N V F HIL A K P H H A V MH T F P S E KL L D R AA D GS F A E MM V QVW Y T V I M M D S L A F D F W P D R E H W N V Q I W EA A W Y I L S A R N W Q VW I SN S A E I A P N Y I N K GQ C I S H S Q GL AMTI QV A C Q S G L R K T N W Y G I I F V Y F YD A T A A Q E S H I S N D K L V W V E T S NA F Q F G T TGL MY VD Q Q E G Q Q M W T W R F S K M F Y H HH L L F FMC F E V D A M G E S C D H R D F C K WSP TV ND TMP GG N L YV NV TSCE LWT FQ AWD WL M E M G R N F V F A D E F L G C QVN M WH A DP Q L W W M E Q D G T V Q M C A N A Q CQ Q D HLHK S E S HM Y TQ F S M C S D MSK Y D EC E VAND P A Y LF M EH Y T C R E H K E S D Y F V L A N W D L N D A S W G L L CT V AC IV M E V R M D IK I NMPI G Y HK L F F C P Y EA E Q MQ Q L ND M H Y II T V I TQ I V C T FY CT M R H S R PI QE A GH S A D R C H P R F E Q L H Q S P CY F V M NPS NMN W S L APFLQ P KV V F A I C Q GQAQ G N I PQ W M NM F Y VF K N F V D Y D I L I W T T G SSFACG R C H EY G W W HAGS AQ F TFV S V N M E KC WI C S T F G SG K I EC AF W F P Q V N K T VY K Y G Y P I M H S Y VC V C D S AR T F C AE N R N G F E W L N R DR E Q Y Y VG T HW L W W A H T M LQ L D P G P HE F G C L G A S VM A DY Q EN PW Y P S H W D W N S IN H M C A C Q K E Q W R F E H V M F D K E A WM G MI EN N H L M Q S F E II G E S FH RY YV R A S E T S D H S I Q MEMF K C Q NM YN E D ADY M AI G D Y W R W T MKD R W V F G I I C IY Y D T M I A Y Y D M RT N I WH S Y I R E W C D NI R C Y S TW W F H W M R M I H D Q G N S P D F P E A V E G Q E YD M Q WCC P F T H L M N M I S M W C D G D F F N H Y T F M N D D R SK R N F L S C R C E Q W G M WT L N T F C W PK C R W F Y K W R T Y CV E C I F E Y E K W W R C I M S E E A N P D H L C T M A M F K K K R GW L F CI L Q F V K P I S E I A S T M T Y K N M G Y H C E Q I W L E Y I T R M C Q S L N I M V CW N A A T KV H C S R T I W S V S E FSG M Q F H T E S F T Q Q A C Q G L Q M H P K Y KH S Y Y V I L D S R A C A CL C N R L L R I C I I H M A NS N V K G L R W M M W C L P G I W F R Y W P E V Y FE Q M R R V E Q W V I R A D P Y K I H V G E KG T V T K 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 106 111 R ωr GLKPCVKLTPLCVTLDCHNVTYNITSDMKEE ITNCSYNVTTVI RDKKQKVSSLFYKLDVVQIGGNNRTNSQYRL INCNTSAITQ N VPGGKA ID Q YI NFQQ RV Y W I IQ NAP Y E L N PY K A Q K AQN KSYT T A SH D I AR QG E T FV T A W GE NH A N S E V V W L K S E Q I L R R Q SQ V L AT G E F V S H R L A R M T I Q I W V V C L Q D EP N A SFKI K WP R M VF GSVK QQRM DRK V T H DP F D I G T F H L I G Y I K HV M V G P V T P I I M T M G P S E D MS V F TY A M V R HM H H V Q L M W M F Q I D W H MM Y W C T L T AP FE T V SFLH AD R V V R S H V G T W K A K T S V Y F N W Y N T C Y H YL S E R S K V ES G RFR S IA RA A R T C Y C S M D L F S M L IC L W R E F I E W Q Q R T T QDQ I L K T A TQ NF I KE A H E Q Q M F MN TT I R QG L L M Q V T L Q R A Y I V H S W V S K Q E Q A W G SI A Y DI K K V S A M A G T E FA IV S K FA F L N F TS DHYQ VSAMNA H DK A NN A L W G I C M L M C H W N R M Y C DY T L G Y G Y K V R A P P E G A WLGD D Y F CW Y D YM R K V DW Q SD V K EWK M HM F AW VL F Q MSD S N W K K E Y K H W D E LG I G WN M M NI VC EENP L V R R R S F FAW I I W SM S S L M HF F P N GN V M H K K K G P T L C S HA D P Q V R L P V R G M Q D N TD M D M EM TMMELN M YC F E V FT EI S Y F L RDPS I V MK YH VH GI R I EN M I E L Y W Q C M C S A MI W F HC N N K C D W H D F N W H N I HY I Y R H S F T Y N H Y VR II IN F L CN AHT Q DA W KL P C M QTL R N G D M Y N H R E H S E H R R H A Y F AGE C VLD T TMM K V M M Y V Y L H TF L WD N L W EW E Q R T L LD H S F G Q N I N Q R K D E P D Y M L A A D AQ F E R V G W H R I M C G G K NY Y L Y T I I A D E AW F W Y K A F M K GE Y V E E P A C A M N G GM M F EM A Y T W C S CV N W E E R F S TY P E M E S CYM W C K N T T AY NE P S I D ML M R FQ A Q D G I D T V F K A MH I N Y R N M K N C N F C P N P A K F D R R GT I I Q M Y N K C F T H E S N R M V R E T N F W Q C W L H E V Y W F M I A R R Y M T H I C AC W R S V V T W K F H M F F L C N H Y T S M VI Q D Q L E S K H W V V A P K N H F SKW Y W DG WV CY N G L Y T FC S R S Q K AK ET F N K D Q E C C W E M E A S H W C P E E I A L W Q Q V H K S M D W V Q G N D G V L H G W W FA W W I W W F R R Y G T N K S K W C D L F S G F S N T W E FQ Q P K R W G Y I D R D A C Y F G V T A F Q K W L G I P I W E M C N H D Y T Y V Q P Q H C F G M P C V Y I QG T P K KF M L H E N W H TK N H W I Y L Q V G I W H W E

F D P T C I D E W I MD G

L L Y 116 121 126 131 136 141 154 159 164 169 174 179 184 187 191 196 201 R ωr ACPKVTFEP I P I HYCAPAGFA I LKCKDEKFNGTGLCKNVSTVQCTHG I KPVVSTQLLLNGSLAEGEVR I RSENI TNNAKNI I VQ T A S I L E H W T F T H P V DIM V M W R L NQM FAA W LM M K SW P Y T GT W Q V E A N FR Q H S A QC I I M T Q T A K S N P V T MT G E V A S F A A E L Y W Q E K V F G V F F R P S V Y F A K G I V M Q T K DDW W N I W T E W N D M M HV PR M F L H R F I Q R F D W A H S S H F L C G D I A R S L SG D TQSA Y I A I Y D A G S I NPG C I T P HV S KK V G S G I K M P QKGK I R G M A Q F G S K VS Y A F S V A R L QM S V S T YI L N WLD R IE F M M R DE E S Q V T D H H L C R Q C P II N N N H W N Q L Y F K TH K H Y H M M A S L T IAA N N QG L Q T H T AP I T N R Y T S C H V A W A I T E E A C Q H E H V L A WS S E Q P L L ERVVM S M E G H H L L G Y V T IV F Q V N F F V C F V E A G D K V L Q L T D F I A R A G A L T L L Q IL S T T A Q Y EYF R Y T K S H H C D Q F N V H V F I W E TD P Y DAH CA FS N N M T A IL EPI T SW F A N T I KL Q D H G V M V C Y AC I RQ A S Q C A E W M N T K II Y D K T I H W V M KF W Q L D M N TN I D L A W A F CV KAI GVR K S V D P S QM D M SG A L FF V VA C T Y W H M H K F VH Q G S V MC E H Q TDR S D Y S NN KE H MP VI A H IW C P H S F W C T H YL M CN YM S H V MC G LG R S QH H M Y A T I C H W S YV Q G F S H A M I WQ K W V M E L R W L N T P AW T Y CG T F A W V K F I T H Y TL F S KC P M R V VT E NK M G M K C N T A N V I T K NN W F I QR NH C Q M Q LD AL T N H Q M H L FN CSS I N ES Y G IIMN H N F S D F D K C LF K D S K A N D C C K A G V F W Q I Y L E R A C H G M R Q F RC D A D M W Q C I C M CQ D D ND I KQ K I RE M L F D Q L R Y M N K M I W D A H E W R V Y Y W E K KR A K FQ V F L G VE F G I D WG W W I L D T G S MW R V M A T E I A L S E Q Q M R N C Q M N M E W C Q N WH V S P H I Y W R F C R M H A W R H V W K N W W Y W E F D C YS M I C I Y H P M H D C C C N W P M D F K Y M W L S M N E M N L L E P D M G K L H N H F N WC G L M Q RF I L S L D A M S Y H V H Y CDY Q Y F L G C L C I V V C SMV I K W K R W C Y Y Y T S E S TW F M W H G F YW H GT Y G D E D L I G W Y T P V Q I V W K KN N A GN E L V E E Y Q D DV Y C C H Y D C I E P I G M N R ND P C Y A V C E V G W 206 211 216 221 226 231 236 241 246 251 256 261 266 271 276 281 286 R ωr LASPVTINC I RPNNNTRKSVHLGPGQAFYATDGI I GE I RQAHCNVSKKEWNSTLQKVANQLRPYFKNNTI I KFANSSGGDLE I T FH N Q H A M E F WN VM RM T C DQ T Y W I AV F K M Y W V VYLW SN T M M Q W Q L H H D E H R S Q A Y L WP T YH M M E K M F G F W HF I V V HYS YE D C M Y T E S V I L H P W D L I F A PEE A II E N T E I IP P T I A GR R T QL T S AE K N YT A W A E ILA H W FS M H Q L SNA L C Y M II Y F L A M AL W D V A I RA W GG A E A E WF K R H W W I Y V M D N Q D S S E S M K F P V G I YV M H L L Y PF K L C F E S T GF M V T I Q L A I C V G F N Q P S HV I R V M D T V Q F K V E N Y V QE K K S S G R T V QL T VC N M C T L P A K V S S HH FT V A I G V V T IK R F T H GV E R Y A V G A W K A L M S R N HT P N W NQ F N H M D Y M CH L K H LHY NV Y QS VRG TS F GK M F K Q Q D TI ES M MQ F AL M E I L L GF F A S RH K I L C Y F V S L D I G T L G Q V Q F V S W D I A W L L I S G M L H G M N HI GD F Q R M K L L T V Q E L E KMC RF EP M R R LR T W F A M L A V S A Y FQ T YK T V F F S S S I E Q F CQ R PIL S PF N T Q W QV QYIM NM L NN V C V N Q F I C Q C F V R I P L M KE M L T L A DF I I Y R F E M T N V I F K T R W YCG V V R C M N D V Y S H S AW VLRW T H K S G H D L G Y F F L T Q K E Q R C H I M M Q G T E Y G NT C S LR T M L V D Y H K W I F HM C N D Y M G D M N GD A H D T D I L Y H M P M V S Y Y L E K R DQ Q EH H G F W C M VAM KF TL P G Y K A L S V NQ MT L YW Y Y T I L E W Y N G V I A W M V M N E D FC H S IW W VSW Y GD HP M A F Y Q C C HM Q Y MHA C L N A SQ W K A N C VH TDQ V N T MT M K D G QVY CF A A Q F S N W L V D W L M FG VETR H P C M A YM W A E P RC C A E V S C E V HS G Q S S L P E GS A M T NE C F F F T P N R D G K W T V L G HT N T L Y H Q F M M V V N Q C RGAKM A W P II K KIN Y NR Q S Q S F G V H L K M I E F W W W H W S S Q C C F C R IN D G T C IKI Y F W S G Y Q R C C N E M C H W I D D P I W A S D K W K FT P R A W G L

H C R K A H C T D G M D K G H E V E E N W T H I W R W R A W K K A M K G G S R M W N G Q W N H A R P K C H N R YS A H S E H ND I Q G D N R R Y C D Y D N L K V C V C Y C K F Q H E K H M C E Y C Q M N C W G L G I R A D E V K N L T E DW H L Y T Y R N W G A T D A KWV V A R I S F W R GS A L D R C C S M Y S Y C Q C P C I D I R F Y K G W A S P L W Q M R C YI E S I I Q K W E C E Y A H D Q F L E G Y G G S Y K W V 291 296 301 306 313 317a 322 327 332 337 342 347 352 357 362 367 372 R ωr THSFNCGGEFFYCNTSGLFNSTWEFNSTWNNSNSTEN I TLQCR I KQI I NMWQRAGQAIYAPPIPGVIRCKSNITGL I LTRDGGS V ES F IT P PG T Y T I N PF MPD DM D W H W G N Y H A N A GY W YHI MQ CRKDMSA I M W EA IP R E T E N I A H K G S R K G YQ WK Y LE W S G S Y V W M HR V S I E A I K V T T Y K Y T T F A L H V A F Y W S A WFF FARN F R T W L E M M WH N GP S K P N N W N I T FL PS P MA K Q I P L G F M A RTQ DM NG RHHMHE R P SL P T A S T Q M W DF V Q E L V H A D I I D C Q D WNEDT LYMQ N M FI Q AA Y I T K S F Y N N I S L A FH Y A M T NE D S S T HM C V L G Q V E M C L I K K D V M G T P QSW T V R F A T Y N K C Q H W Q C Y A N H A W K V A M Y KH M R W Q RY L L I TY C L CG C Y A A LN CT AD Q VEV I ME MY GL N V I F Y S F E H S ER IQA W F A D EK G S G R C QQ P S M A G N P F L I M V T N R W WQF R W WY V I L G Q P K S G I S V C I CV K N H V R S F G T TD G I FA L L I F Y W K G L Q C T TH R A YC R W H R E MR N M F M Y TN C A S V N I F F N P Q Y V R R M A G T A S W DS W M G L V W V Q V F W CY S TV N FM V HI V KY A KC E S W CV F WCN I E V F N C WA NHV Q S C N W V Q QQ L PQ R L A W N AN E R H V K L S G Y V V TF H W WL PA L T LH T F E N L Y Q WTG R G Q E L A FM YSW YTR N Y K K P F L F N E MF L E FHKID DITK I V T A YW H S Q F I T VDM QM MH S F NI Y Y A T V VCIVF N F Q NV N F V YIC D Q I H M E C H A Q Q Y W S H D C D M G NDML A A S E R H KH H H M Q W KAV E Q E Q C C M CM W D HI K E W VK VG MH K ID KE W L F A WP M K R LY H K S E Y F A EYM F D R H F F E W P L L Y S KM H L W W N RM S V W Y M P A T N RQG M G FF G K Y G V G T S A A H E M S Y I G Q H F W K T P S I F M I A F D H N H H Q P W D D W F Q S Y G L H K S D M G Y DA V D P N K A C Q AH SF L D Y W D V C N V H D Y Y P N Q H S Y K CD V I V K Y W V Y D S M A T HQ QI G WF I S V T T D T F I RW G V C M L V K TF V M I N I T I L Q L I F M P N E Y L Q Q G S I R A M G F I Q C G L MG M F M M A M A G L A E S D L Y V A D V K E L N K S H D F V M N V L W D N N R D GE K W E I C C L T L A P N L M W L L Q H S E M V W Y F T G D R K F T F G F R Y K R I I R V C M N E G W V Y A A V N Q A W E K E G Y G L R N T F N K V H T AR F ST W H W G R H C T N P P E H D P V T T P RK RD D C C S C C C K G E Q F PM L T W N M IGRH C T Y K S N C T H Q L F C R R Q Y S G L Y Q G R A GQ C L I M I L H Q A Q V Q I V L FY R F R W P V I K E P C IC E I EH QW R D 377 382 387 392 396 406 411 416 421 426 431 436 441 446 451 456 R ωr NKNTSETFRPGGGDMRDNWRSELYKYKVVK I EP I GVAPTRAKRRVVEREKRAVG I GAVF I GFLGAAGSTMGAASVTLTVQARQL KI HC A Q EMWI T H F T QT D Q K F V V S ML F WMD F MR N I E SP L L C V S Q T Q V K R T T D V H Q HY GI F H QF W E T Y P W H Y V V R Q A H A CF T W A V L L V ELG V I L Y P ME NS V I M M R E P C G Q Q AK A W WL W L T K S W H LQY L RA V N T F V C D I F L L G TN SM L S L CFM L I H NR Y L H Y H I S QT Y S FK VEA L I C I Y I Q R P M G I T F AG A L K F A H GA D PT K E M V QN F I L E A W D G H L P F D L T R G C EL Q G P I PK Y F E W Y G I N L A N M V TR A R P A CV AM V M AM Y A S T Q A G FAT F G W M M L T R C P N LSC W Y M M Q H PP I S EI SP M K E H I Y A W D W G C FQ A I L I V NT G T C F V M H D L A A A W C A Y H S D S VNR Y KAC M M F S AI II M K W M T C F I I C S CG H VH L A I H A V Q F H K R S E L K I S C A N M P QR T N N I N I N I I D I V H S GF C FPS K Y D C D G K E F Q V M V N P F V L A W C H K FY D Y MG M Y V S S H T Y KM F T Y L A S Y M T A P M M TN K F I DS H D Q D W K D S T N A E T R G E L D N I S KP T AM H P Y R Y LH K I V A G F D W V N Q H V N E M F EG KW V M N KY M M AY W E H F N N D E W C E F E V N CH C K V M N Q T Q T L N E G A S R I C I K D HQ G N Y H D S T T Q D I G YE V T T M P V D V Q S L S Q D WY Y N Q G W H H W V R Q L G V C A C T G Q RH I V Y F A V K V K F T E Q W W M P S C Q T D G D S A K E F K EV QR Y S V D Y W K L I H S K T A M LK R F V R W Y V D Y D Q V Q H L C E M S M G L N G D D M S M M T T E N T I T W C G H W F T T R Y T C S N K P Y F C V C C C R A I A Y

SV Y N P D M C R N F W Q E G E L VV E G C N A A F K Q C Y

I R AL V RA A F A STMGAA 461 466 471 476 481 486 491 496 501 506 511 516 521 526 531 536 541 R ωr LSGIVQQQSNLLRAIEAQQHLLKLTVWGIKQLQARVLAVERYLKDQQLLGIWGCSGKL I CTTNVPWNSSWSNKSQDE IWGNMTW Q HN W S YD IK N L Q KA L EG T L T W W Q S AT HG D QA L MI KQ H I W T S V K L V G I E G H V Y V A Y N H AK V A Q T K DA H K G TI HF E T M Y Q N Q Y A S N Q F L W L I M NP Y T A R L F W L V S A H C Y T Q T W VW F S F H I P Q I D Y A F T T H N Q I VS DR A N RW H L VLDV H Y S T R K R Q S H Q NF S T IH Q S SN A E A EF K S Y H A W N R V YR MKI G Q I K W I I F K M Y LC R H NV T M W R L M F S T T H CF G Y K Q T S SC V Q M S Q SC T Q HT YI F W I L D Y M S L F S M C D E LI H N N R EL D M A Q R S L V C GH FGA E N S G Q A S DL G I L R Q Q N S C M R K AS CL I LE D M M F F N S T V Q C N Y L Y L VM D TRK HG E G H SS T Q L I S V Q L KY T K A P D E C T W H SI W F L L Q T FS MQ D M W M H L C N E E L H Y W V M C Y C L F C G N S E V VK I W L M R K G F R C Q V M V K C MY C M S M V R Y M WC T Y H M R V E G M F F I R F P E W I W M R T T H C C W F H E S P K Y NA M L K HA L K Q A I P T Y A F YG N L CT C Y G C N E Y C K V H H H W N A P H I T H V W T V LS H A Y D P A S M C T C E V S C GF R A T L G M W F S H C A M D R E I A Q D R E Y V L M I KK R C Y W L M T D L MGY S SS N G L N Y E N F P I M M H P D I E F N Y W M T M W M H W V N K H A D A Q A P M I I K Q V D R A S K S Y F I D W K E W E W L C D D W E EW N A M A H K V Y T V F L Y P R C F I N M E E F Q V Q Q I E K Q P N I T F W D I N D Y L N N N F A G H L L R W P Q A K A S M GC V D E P C A M Q T F E E C T V R M P C C Q S M N Y M G C S P T F L V M TY Y K F W W F K D R M P H I V F R L F L H T Q Q F AN L H H W L Y M G K E G L M K Y S Y G D H L T P W L A L C T C 546 551 556 561 566 571 576 581 586 591 596 601 606 611 616 621 626 R ωr LQWDKEVSNYTQ I I YTL I EESQNQQEKNEQDLLALDKWASLWNWFN I SQWLWYIKIFI I IVGGLIGLRIVFAVL M I CN M T I I VP T ASR WCSN Q TK H H W F NW L NA AQ M I MN F GC V F EE I S E Y F AF Y T KL II SY S Y T R Q V N Y E D Y W W F N H R F S Q DL S T C C Y G T I M R A A FG Q V S Q I CA F N T A L MW W M CM S Y I N H Y Q KG T M C NN W L G I M AIF FD N A E P C I Y L C D V A Y LSL K A V I D L A Y Q M E L Q R N A E F A L C S L Y TILP CN F RK G I W Y W Y Q I A H MAE A Q R I W WTYVS T Q E V R L S V F L Q D S ASLIY P M FI MF F I E F H L V T SCH E TQ K Q DC E L Q H HE Y L T G W G F T T Q QD Y M H W L I I W M WW G T HE L A C Q W Q D DH H Y D HQ T K FT SWL V Y H R AM R M CT Y E M F E W M V HE Q E LN LF W LN AT L I F ST V F C M F E N L G R A S H HY T Y N NYS GP W M RS R HV AF W SN G E NI AV G TTC V T QF T A G N LH L P S F Q C A D K VR A T V S K W Q G M E M F D DH V QV M L D Y S S H A D R E R K F V M S G YH Y E L I S V H YQ P S V K W V V Y N H V R H I GQ Q V Y F G ST G AL H I YI S W Q KY ML C NN L E W SK K V R F Y W Y K F V M N L Q MT S C C E E IN T N I A I A W Q I E M K H G E G A Y T S Y Y N A I D RET D A F G S I E P Q N K T K M C W S Y A G W W V T Q W I R F I H T I S AV S M WA A V F C NW I Q YI AL N I IH N V P E W L SM A NS F F N QC Q W N M Q N C W A E A V Q I M M P M G G L K I F VD N S L Q K W T Q DH P R I A A S R R G C P A H H S M L T R D F D T Y M L P C G S S M S V GA ET Y Y H A F S P M C V C I V V H F I Y A S K M P E Y M QK NS Q S C G L T F T S E Q H W YE S L L Y N M M M A H L Q K L D L I G L N H G Q Q FM CF R Q E N T T F GD H F M K L S W A D G A R F V T A Q Q A K CD Q S V Y D K A M N C E E ANE Q NM EM KC F C K V H K P C H Y E F V R E H K S L G M Y H T S EIS C L D DE L V W I S E F EN M Q P C R V R C M H E M G C M Q C Y Q F C F T D A P Y V T V A M K M VE T W K ILWWK Y N G N G F H

CR M 631 636 641 646 651 656 661 666 671 676 681 686 691 696 701

Figure 5. Amino-acid preferences for the BF520 Env. This figure is the same as Figure 4 except that it shows the data for BF520 instead of BG505. The numerical values of the preferences are in Figure 5-source data 1, the mapping from sequential to HXB2 numbering is in

Figure 5-source data 2, and the "% values are in Figure 5-source data 3. Figure 5–source data 1. The numerical values of the amino-acid preferences plotted in this figure are in rescaled_BF520_prefs.csv. Figure 5–source data 2. The sequence of BF520 Env and mapping from sequential (original column) to HXB2 numbering (new column) is in BF520_to_HXB2.csv.

Figure 5–source data 3. The "% values and associated & -values for BF520 in HXB2 numbering are in BF520_omegabysite.tsv.

Shifts in amino-acid preferences between BG505 and BF520 The most fundamental question that we seek to address is how similar the amino-acid preferences are between the two Envs. We have already noted that Figure 3B shows that the preferences are more correlated for replicate measurements on the same Env than for replicate measurements on different Envs. However, simply comparing correlation coefficients does not identify specific sites where mutational effects have shifted, nor does it quantify the magnitude of any shifts. We therefore used a more rigorous approach to identify sites where the amino-acid preferences differ between BG505 and BF520 by an amount that exceeds the noise in our experiments. We first re-scaled the preferences from each experimental replicate by the stringency parameter for

9 of 30 Manuscript submitted to eLife

A B

BG505 vs HA BG505 BF520 site replicates replicates RMSDbetween - RMSD within = RMSDcorrected

598 0.04 0.03 0.01 BF520 vs HA 123 123

528 0.35 0.19 0.16

123 123 randomized BG505 vs BF520

288 0.73 0.27 0.46 shifts ≥ 0.22 significant at FDR of 0.1 123 123 BG505 vs BF520

512 0.70 0.18 0.52 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 shift in preferences (RMSDcorrected) 123 123

C site 512 site 516 site 599 site 288 site 165 site 307 site 420 site 309 site 605 site 505 site 436 site 69 site 582 site 525 site 587 shift 0.52 shift 0.50 shift 0.47 shift 0.46 shift 0.43 shift 0.37 shift 0.32 shift 0.31 shift 0.30 shift 0.30 shift 0.29 shift 0.29 shift 0.29 shift 0.29 shift 0.28 A A G G S S F L L I I V I I I L T T V V A A W W A A A A L L BF520 BF520 BF520 BF520 BF520 BF520 BF520 BF520 BF520 BF520 BF520 BF520 BF520 BF520 BF520 BG505 BG505 BG505 BG505 BG505 BG505 BG505 BG505 BG505 BG505 BG505 BG505 BG505 BG505 BG505

site 381 site 118 site 64 site 618 site 395 site 583 site 499 site 626 site 164 site 520 site 542 site 113 site 75 site 493 site 434 shift 0.28 shift 0.28 shift 0.27 shift 0.27 shift 0.27 shift 0.27 shift 0.26 shift 0.26 shift 0.26 shift 0.26 shift 0.24 shift 0.24 shift 0.23 shift 0.23 shift 0.22 E E P P E E N S W W V V T T M M E V L I R R D D V V P P M I BF520 BF520 BF520 BF520 BF520 BF520 BF520 BF520 BF520 BF520 BF520 BF520 BF520 BF520 BF520 BG505 BG505 BG505 BG505 BG505 BG505 BG505 BG505 BG505 BG505 BG505 BG505 BG505 BG505 BG505

Figure 6. Env sites with shifted amino-acid preferences between BG505 and BF520. Note that the preferences have been re-scaled using the stringency parameters in Table 1 to enable direct comparison across Envs. (A) Calculation of the corrected distance between the amino-acid preferences of BG505 and BF520 at four example sites. We have triplicate measurements for each Env. We calculate the distance between each pair of replicate measurements, and group these into comparisons between the two Envs and within replicates for the same Env. We compute the root-mean-square distance (RMSD) for both sets of comparisons, which we denote as RMSDbetween and RMSDwithin. The latter quantity is a measure of experimental noise. The noise-corrected distance between Envs at a site, RMSDcorrected, is simply the distance between the two Envs minus this noise. (B) The bottom distribution (orange) shows the corrected distances between BG505 and BF520 at all alignable sites (see Figure 6-source data 1 for numerical values). The next distribution (blue) is a null generated by computing the corrected distances on all randomizations of the replicates among Envs. The top two distributions (green) compare Env to the non-homologous influenza hemagglutinin (HA) protein (Doud and Bloom, 2016) simply putting sites into correspondence based on sequence number. We compute the & -value that a site has shifted between BG505 and BF520 as the fraction of the null distribution that exceeds that shift, and identify significant shifts at a false discovery rate (FDR) of 0.1 using the method of Benjamini and Hochberg (1995). Using this approach, 30 of the 659 sites have significant shifts (corrected distance ≥0.22). (C) All sites that have significantly shifted their amino-acid preferences at an FDR of 0.01. For each site, the logo stacks show the across-replicate average preferences for BG505 and BF520. The wildtype amino acid for that Env is indicated using the small black letters above each logo plot; note how the wildtype amino acid is frequently but not always the most preferred one. The sites are sorted by the magnitude of the shift. Figure 6–source data 1. The corrected distances between BG505 and BF520 at each site are in BG505_to_BF520_prefs_dist.csv.

that Env from Table 1 to calibrate all measurements to the stringency of natural selection. We then identified the 659 sites in the mutagenized regions of Env that are pairwise alignable between BG505 and BF520 (Figure 6-source data 1). For each site, we calculated the shift in amino-acid preferences between Envs using an approach similar to that of Doud et al. (2015) as illustrated in

10 of 30 Manuscript submitted to eLife

Figure 6A. This approach calculates the magnitude of the shift after correcting for experimental noise by comparing the differences in preferences between replicates for BG505 and BF520 to the differences between replicates for the same Env. Figure 6A shows this calculation for a site that has not shifted (site 598, which strongly prefers cysteine in both Envs), the most shifted site (512, which shifts from being mutationally tolerant in BG505 to strongly preferring alanine in BF520), and two other sites with more intermediate behaviors. The overall distribution of shifts between BG505 and BF520 is shown in Figure 6B. Most sites have relatively small shifts (close to zero), although there is a long tail of sites with large shifts. This tail reaches its upper value with site 512, which has a shift of 0.52 out of a maximum possible of 1.0. How should we interpret this distribution—have mutational effects shifted a lot, or not very much? We can establish an upper-bound for how much sites might shift by comparing Env to a non-homologous protein. Figure 6B shows the distribution of shifts when comparing Env to influenza’s hemagglutinin protein, which has previously had its amino-acid preferences measured by deep mutational scanning (Doud and Bloom, 2016) . Most sites have large shifts between Env and hemagglutinin, with the typical shift being ∼0.4 and some approaching the maximum value of 1.0. We can also establish a lower-bound by creating a null distribution for the expected shifts if all differences are simply due to experimental noise. This null distribution is created by randomizing the experimental replicates among Envs. Figure 6B shows that the null distribution is more peaked at zero than the real distribution, and does not have the same prominent tail of sites with large shifts. The answer to the question of how much mutational effects have shifted is therefore nuanced: they have substantially shifted at some sites, but remain vastly more similar between the two Envs than between two unrelated proteins. We can use the null distribution to identify sites where the shifts between BG505 and BF520 are significantly larger than the noise in our experiments (Figure 6B). There are 30 such sites at a false discovery rate of 0.1. Figure 6C shows the amino-acid preferences of these significantly shifted sites for each Env. For the majority of shifted sites, one Env prefers a specificaminoacid whereas the other Env tolerates many amino acids; for instance, see sites 512, 516, 599, 165, 605 and 505 in Figure 6C. Such broadening and narrowing of a site’s mutational tolerance is frequently linked to changes in protein stability, with a more stable protein typically being more mutationally tolerant (Wang et al., 2002; Bloom et al., 2006; Gong et al., 2013; Kumar et al., 2017). Work with engineered Env protein in the form of “SOSIP” trimer (Binley et al., 2000; Sanders et al., 2002)has shown that BG505 SOSIP is more thermostable than BF520 SOSIP(Verkerke et al., 2016). Consistent with this fact, sites with altered mutational tolerance are often (although not always, see sites 165 and 520 in Figure 6C) more mutationally tolerant in BG505. Differences in Env’s expression level might also contribute to a general broadening or narrowing of tolerance to subsequent mutations. The reason is that our experiments select for viral growth (which is affected by both Env function and expression), so it is possible that some of the shifts are due to epistatic mutationaleffects on expression rather than function. However, not all of the significantly shifted sites show a simple pattern of broadening or narrowing of mutational tolerance. For instance, site 288 does not alter its mutational tolerance but rather flips its rather narrow amino-acid preference from phenylalanine in BG505 to leucine in BF520 (Figure 6C). Thus, there is variation in both the extent and types of shifts observed.

Structural and evolutionary properties of shifted sites What distinguishes the sites that have undergone significant shifts? First, we analyzed the distri- bution of shifted sites in context of Env’s three-dimensional structure. Env’s structure is highly conformationally dynamic and undergoes large changes upon binding and membrane fusion. Inaneffort to account for these dynamics, we examined multiple conformational states of Env: the closed pre-fusion state (Stewart-Jones et al., 2016), the open CD4-bound state (Ozorowski et al., 2017), and the post-fusion six-helix bundle (Weissenhorn et al., 1997). Figure 7A shows the locations of the shifted sites on the crystal structure of Env in the closed pre-fusion state. There

11 of 30 Manuscript submitted to eLife

A B

closed pre-fusion (5FYL) CD4-bound state (5VN3) P =0.15 P =0.16 1.25

1.00

0.75

0.50

0.25

relative solvent0 accessibility .00 120º shifted not shifted shifted not shifted (28 sites) (549 sites) (27 sites) (493 sites) significantly shifted? significantly shifted? C D closed pre-fusion (5FYL) CD4-bound state (5VN3) min distance 5FYL, 5VN3 closed pre-fusion (5FYL) CD4-bound state (5VN3) min distance 5FYL, 5VN3 P =0.43 P =0.29 P =0.44 P =0.07 P =0.03 P =0.01 ) 30 0.6 P =0.21 P =0.31 P =0.21 ˚ A

0.4 20

0.2

10 shift in preferences 0.0 closest shifted site (

0 substitution contacts not near substitution contacts not near substitution contacts not near shifted not shifted shifted not shifted shifted not shifted (83 sites) substitution substitution (61 sites) substitution substitution (83 sites) substitution substitution (28 sites) (549 sites) (27 sites) (493 sites) (29 sites) (559 sites) (204 sites) (290 sites) (149 sites) (310 sites) (223 sites) (282 sites) significantly shifted? significantly shifted? significantly shifted? conserved in homologs? conserved in homologs? conserved in homologs?

Figure 7. Characteristics of significantly shifted sites. (A) One monomer of the closed pre-fusion Env trimer (PDB 5FYL; Stewart-Jones et al., 2016) is colored from white to red according to the magnitude of the mutational shift at each site (red indicates large shift). Sites that are significantly shifted according to Figure 6B are in spheres, and all other sites are in cartoon representation. (B) There is no significant difference in the relative solvent accessibility of sites that have and have not undergone significant shifts. This observation holds for both the closed trimer conformation in (A) and the CD4-bound trimer conformation (PDB 5VN3; Ozorowski et al., 2017). The absolute solvent accessibility of each site was calculated using DSSP (Kabsch and Sander, 1983) and normalized to a relative solvent accessibility using the absolute accessibilities from Tien et al. (2013). (C) Sites of significant shifts are clustered in the structures of both the closed and open Env trimers. The left two plots show the distance of each significantly shifted and not-shifted site to the closest other shifted site in the indicated structure. The right-most plot shows the minimum distance across both conformations. The trend for shifts to cluster becomes stronger when considering the minimum distance, suggesting multiple conformations contribute to this trend. (D) Large mutational shifts are not strongly enriched at sites that have substituted between BG505 and BF520, or at sites that contact sites that have substituted. The plots show the magnitudes of the shifts among structurally resolved sites that have substituted between BG505 and BF520, the non-substituted sites that physically contact a substitution in the indicated structure(s) (any non-hydrogen atom within 3.5Å), and all other sites. Figure 7-source data 1 shows that there is a borderline-significant tendency of significantly shifted sites to have substituted. All plots only show sites that are structurally resolved in the indicated structure(s). Structural distances and solvent accessibilities were calculated using all monomers in the trimer. & values were calculated using the Mann-Whitney ' test. Figure 7–Figure Supplement 1 and Figure 7–Figure Supplement 2 zoom in on some relevant clusters of sites. Figure 7–Figure supplement 1. Cluster of shifted sites in the post-fusion six-helix bundle of Env. Figure 7–Figure supplement 2. Clusters of shifted sites in highly dynamic regions of Env. Figure 7–source data 1. The sites of significant shifts in Figure 6B are somewhat more likely to have substituted between BG505 and BF520. This association is borderline statistically significant, with & =0.055 using a Fisher’s exact test on the contingency table in shifts_vs_subs_table.csv. is no visually obvious tendency for shifted sites to preferentially be on Env’s surface or in its core, and statistical analysis of both the closed and open states of Env (Figure 7B) finds no associa- tion between a site’s relative solvent accessibility and whether its amino-acid preferences have shifted. We did not attempt to analyze the association between solvent accessibility and shift for the post-fusion six-helix bundle because crystal structures of this conformation only contain ∼80 Env residues (Weissenhorn et al., 1997; Chan et al., 1997; Tan et al., 1997). However, Figure 7A does suggest that the sites of significant shifts tend to cluster in Env’s structure. A statistical analysis confirms that there is clustering of shifted sites for the closed and open conformations, with the effect being strongest when we define contacts based on the closest intra-residue distance across these two conformations (Figure 7C). Therefore, the factors

12 of 30 Manuscript submitted to eLife

that drive shifts in Env’s mutational tolerance often affect physically interacting clusters of residues in a coordinated fashion. We also investigated clustering of shifted sites in the post-fusion six-helix bundle. Because structures of this conformation only resolve the coordinates of ∼80 residues, we did not perform a statistical analysis. However, a qualitative analysis revealed that three of the four shifted sites that are resolved in the post-fusion conformation cluster at one end of the helical bundle (Figure 7–Figure Supplement 1). An obvious hypothesis is that strongly shifted sites have substituted between BG505 and BF520, or physically contact such substitutions. According to this hypothesis, substitutions would alter the local physicochemical environment of the substituted site and its neighbors, thereby shifting the amino-acid preferences of sites in the physical cluster. But surprisingly, for both the closed and open conformations, the typical magnitude of shifts is not significantly larger at sites that have substituted, or at sites that contact sites that have experienced substitutions (Figure 7D). For the six-helix bundle, there are five structurally resolved substituted sites, one of which is adjacent to the cluster of shifted sites (Figure 7–Figure Supplement 1). The number of resolved shifted and substituted sites in this structure is too small for a meaningful statistical analysis of the type in Figure 7D. However, the cluster of shifted and substituted sites in the six-helix bundle is also present in the closed and open states (Figure 7–Figure Supplement 1), and so is included in the statistical analyses in Figure 7D. There is a borderline trend for the significantly shifted sites to be more likely to have substituted between BG505 and BF520 (Figure 7-source data 1), but most shifted sites have not substituted (only 8 of the 30 shifted sites differ in amino-acid identity between the two Envs). The lack of strong enrichment in shifts at substituted sites contrasts with previous protein-wide experimen- tal (Doud et al., 2015) and simulation-based (Pollock et al., 2012; Shah et al., 2015) studies of shifting amino-acid preferences, which found that shifts were dramatically more pronounced at sites of substitutions. The difference may arise because these earlier studies examined proteins that are fairly conformationally static (absolutely so in the case of the simulations). That fact that Env is extremely complex and conformationally dynamic (Munro et al., 2014; Ozorowski et al., 2017) may increase the opportunities for long-range epistasis to enable substitutions at one site to shift the amino-acid preferences of distant sites. Indeed, many of the shifted sites cluster within regions of Env that are highly conformationally dynamic. Figure 7–Figure Supplement 2 shows the structural context of these clusters in finer detail. One cluster is at the trimer apex where two of Env’s variable loops pack against one another and against an adjacent protomer. These interactions are likely involved in regulating the transition between conformational states, and upon CD4 binding, these loops become highly disordered (Guttman et al., 2014; Ozorowski et al., 2017). Mutations at two of the shifted sites in this cluster (165 and 307) have been shown to cause Env to assume aberrant conformations, suggesting that these sites can strongly modulate Env’sdynamics(Lee et al., 2017). Strikingly, this cluster of shifted sites may reflect previously observed differences in the conformational dynamics of this regions between these two Envs; the V2 region of BF520 SOSIP trimer is more accessible to deuterium exchange than the BG505 SOSIP trimer (Verkerke et al., 2016). The other cluster of shifted sites is near a network of hydrophobic amino acids that has been proposed to help transmit the large-scale conformational change that takes place upon CD4 binding (Ozorowski et al., 2017). One of the shifted sites (site 69) overlaps with this network, and mutations at another (site 64) have been shown to strongly modulate the relative stability of the open and closed conformations (de Taeye et al., 2015). In total, these two clusters consist of nearly half of the shifted sites (13 out of 30). One hypothesis why so many shifted sites cluster in these regions is that their dynamic nature allows long-range epistatic interactions to be readily propagated between substituted sites and distant shifted sites. Itisdifficult to discern exactly how these interactions might occur, but there is certainly a trend for sites that are conformationally dynamic to also be sites that show shifts in their amino-acid preferences during evolution.

13 of 30 Manuscript submitted to eLife

BG505 all mutations

BG505→BF520

BF520 all mutations

BF520→BG505 -10 -5 0 5 mutational effect

Figure 8. Entrenchment of substitutions during Env evolution. There are 12,521 possible amino-acid mutations at the 659 mutagenized sites alignable between BG505 and BF520. The blue densities show theeffects of all these mutations to each Env. The orange densities show the effects of just the 92 mutations that convert BG505 to BF520 or vice versa. In the absence of entrenchment, mutating a site in BG505 to its identity in BF520 should have the opposite effect of mutating the site in BF520 to its identity in BG505. In this case, we would expect the BF520→BG505 distribution to be the mirror image of the BG505→BF520 distribution—and both distributions should be centered around zero if the two Envs are equivalently functional. Instead, mutating a site in either Env to its identity in the other Env tends to be deleterious, indicating that substitutions are often entrenched in the Env in which they have fixed. The effect of a mutation is quantified as the log of the ratio of the site’s preference for the mutant amino acid to the preference for the wildtype amino acid.

Entrenchment of substitutions modestly contributes to mutational shifts One idea that has recently gained support in the protein-evolution field is that substitutions become “entrenched” by subsequent evolution (Pollock et al., 2012; Shah et al., 2015; Starr et al., 2017). Entrenchment is the tendency of a mutational reversion to become increasingly unfavorable as a sequence evolves. Given two homologs, if there is no entrenchment then the effect of mutating a site in the first homolog to its identity in the second will simply be the opposite of mutating the site in the second homolog to its identity in the first. But if there is entrenchment, then both mutations will be unfavorable, since the site is entrenched at its preferred identity in each homolog. Figure 8 shows the distribution of effects for mutating all sites that differ between BG505 and BF520 to the identity in the other Env. As expected under entrenchment, the average effect of these mutations is deleterious—although there are a substantial number of sites where the mutational flips are not deleterious. We can get some sense of the magnitude of the entrenchment by comparing the effects of the BG505↔BF520 mutations to the distribution of effects of all possible amino-acid mutations (Figure 8). This comparison shows that even unfavorable inter-Env mutational flips are generally more favorable than random amino-acid mutations. Therefore, entrenchment occurs for some but not all substitutions that distinguish BG505 and BF520, and the magnitude of entrenchment is less than the effect of a typical random mutation. Entrenchment of substitutions therefore contributes to some of the mutational shifts. But given that many of these shifts occur at sites that do not even differ between the Envs (Figure 7D), entrenchment of substitutions is clearly not the only cause of the shifting amino-acid preferences.

14 of 30 Manuscript submitted to eLife

10

A BG505 C 1 R = 0.85 > r ω → Q

10 0 log BF520

← N-glycan 1 False < r

ω True 120º −10 −10 0 10

ωr < 1 ← log10 Q → ωr > 1 BG505 B D BF520

435 64

112 66 58 70 111 120º 427 72 113 69 110 73

Figure 9. Sites in Env that evolve faster or slower in nature than expected given the functional constraints measured in the lab. We calculated the statistical evidence that each site evolves faster ("% > 1) or slower ("% < 1) than expected given the experimentally measured amino-acid preferences using the method of Bloom (2017). (A) One monomer of the Env trimer (PDB 5FYL; Stewart-Jones et al., 2016) is colored from blue to white to red based on the strength of evidence that sites evolve slower than expected (blue), as expected (white) or faster than expected (red) given the BG505 experiments. Sites for which we lack "% estimates are colored black. Sites where the rate of evolution is significantly different than expected at a false discovery rate of 0.05 are shown in spheres. (B) Like (A) but using the data from the BF520 experiments. For both Envs, sites that evolve significantly slower or faster than expected are often on Env’s surface (Figure 9–Figure Supplement 1). (C) The results are similar regardless of whether the BG505 or BF520 experiments are used. Many of the sites of slower-than-expected evolution are asparagines in N-linked glycosylation motifs (Figure 9–Figure Supplement 2). All sites that evolve slower than expected for both experimental datasets are in Figure 9–Figure Supplement 3. (D) A large cluster of sites that evolve slower than expected is likely involved in Env’s transition between open and closed conformational states. Gray boxes indicate sites that Ozorowski et al. (2017, PDB 5VN3) proposed form a hydrophobic network that regulates the conformational change; blue boxes and sticks indicate sites that evolve slower than expected. All analyses used the phylogenetic tree in Figure 1.The"% and (-values are in Figure 9-source data 1. Figure 9–Figure supplement 1. Relative solvent accessibilities of sites evolving faster or slower than expected. Figure 9–Figure supplement 2. Amino-acid preferences and alignment frequencies for glycosylation motifs. Figure 9–Figure supplement 3. Amino-acid preferences and alignment frequencies of sites that evolve slower than expected.

Figure 9–source data 1. The "% and (-values are in merged_omegabysite.csv.

Comparing selection in the lab to natural selection Our experiments measure the effects of mutations on viral growth in a T-cell line in the lab. But HIV actually evolves in humans, where additional selection pressures on Env are undoubtedly present. For instance, antibody pressure might increase the rate of evolution at some sites (Albert et al., 1990; Wei et al., 2003; Richman et al., 2003), whereas pressure to mask certain epitopes (Kwong et al., 2002) might add constraint at other sites. Comparing selection in our experiments to natural selection can identify sites that are under such additional pressures during HIV’s actual evolution in humans. We determined whether each site in Env evolves faster or slower in nature than expected given

15 of 30 Manuscript submitted to eLife

three models: that evolution is purely neutral (all nonsynonymous and synonymous mutations have equivalent effects), that sites are under the protein-level constraint measured in our experiments with BG505, or that sites are under the constraint measured with BF520. The first model used a standard dN/dS test (the “FEL” method of Kosakovsky Pond and Frost, 2005), whereas the other two models are conceptually similar but account for the experimentally measured amino-acid preferences as described by Bloom (2017). All three models test if individual sites evolve faster or slower than expected, but they “expect” different things: the dN/dS model expects nonsynonymous and synonymous mutations fix at the same rate, while the ExpCM expects the rate at a site to depend on the experimentally measured functional constraints. In all cases, the evidence that a site % evolves differently than expected is statistically summarized by the & -value that "% is > or < 1. The standard dN/dS model finds hundreds of sites that evolve slower than expected under neutral evolution (Table 1, "% < 1), and only a handful of sites that evolve faster than expected under neutral evolution (Table 1, "% > 1). This finding is unsurprising, since it is well known that Env is under functional constraint. In contrast, ExpCM’s that test the rates of evolution relative to the experimentally measured constraints find far fewer sites that evolve slower than expected, but many more sites that evolve faster (Table 1). The sites that evolve slower or faster than expected from the experiments are shown in Fig- ure 9A,B, and overlaid on the logoplots in Figure 4 and Figure 5 as the "% values. The identified sites are similar regardless of whether we use the experiments with BG505 or BF520 (Figure 9C). The reason the results are similar for both experimental datasets is that (as discussed above) the amino-acid preferences of most sites are similar in both Envs, suggesting that either dataset provides a reasonable approximation of the site-specific functional constraints across the clade A Envs in Figure 1. What causes some sites to evolve faster or slower in nature than expected from the experiments? The answer in both cases is likely to be immune selection. Most of the sites of faster-than-expected evolution are on the surface of Env (Figure 9A,B and Figure 9–Figure Supplement 1). Env’s escape from autologous neutralizing antibodies often involves amino-acid substitutions in surface-exposed regions (Moore et al., 2009), including at many of the sites that evolve faster than expected. Since our deep mutational scanning did not impose antibody pressure, sites where substitutions are antibody-driven will evolve faster in nature than expected from the experiments. Interestingly, immune selection also offers a plausible explanation for the sites that evolve slower than expected. In addition to escaping immunity via substitutions at antibody-binding footprints, Env is notorious for employing a range of more general strategies to reduce its susceptibility to antibodies. These strategies include shielding immunogenic regions with glycans (Wei et al., 2003; Stewart-Jones et al., 2016; Gristick et al., 2016) or hiding them by adopting a closed protein confor- mation (Kwong et al., 2002; Guttman et al., 2015; Ozorowski et al., 2017). Sites that contribute to such general immune-evasion strategies will be under a constraint in nature that is not present in our experiments—and indeed, such sites evolve more slowly than expected from our experiments. For instance, we find very little selection to maintain most glycans in our cell-culture experiments. Of the 21 N-linked glycosylation sites shared between BG505 and BF520, only four are under strong selection to maintain the glycan in our experiments—despite the fact that most are conserved in nature (Figure 9CandFigure 9–Figure Supplement 2). This finding concords with prior literature suggesting that these glycans are selected primarily for their role in immune evasion (Pugach et al., 2004; Wang et al., 2013; Rathore et al., 2017). Similarly, a network of sites that help regulate Env’s transition between open and closed conformations that have different antibody susceptibilities (Figure 9D) also evolve slower in nature than expected from our experiments. Therefore, we can distinguish evolutionary patterns that are shaped by simple selection for Env function from those that are due to the additional complex pressures imposed during human infections.

16 of 30 Manuscript submitted to eLife

Discussion We have experimentally measured the preference for each amino acid at each site in the ectodomain and transmembrane domain of two Envs under selection for viral growth in cell culture. These amino-acid preference maps are generally consistent with prior knowledge about sites that are important for protein properties such as receptor binding or disulfide-mediated stability. However, the main value of these maps comes not from comparing them with prior knowledge, but from the fact that such prior knowledge encompasses just a small fraction of the vast mutational space available to Env. Because Env evolves so rapidly, every study of this protein must be placed in an evolutionary context, and our comprehensive amino-acid preference maps potentially enable this in ways that prior piecemeal studies of mutations cannot. But these maps come with a potentially serious caveat: each one is measured for just a single Env variant. The major question that our study aimed to answer is whether the maps are still useful for evolutionary questions, or whether Env’s amino-acid preferences shift so rapidly that each map only applies to the specific HIV strain for which it was measured. This question is reminiscent of one that was grappled with in the early days of protein crystallography, when it first became possible to build maps of a protein’s structure. Because it was not (and is still not) possible to crystallize every variant of a protein, it was necessary to determine whether protein structures could be usefully generalized among homologs. Fortunately for the utility of structural biology, it soon became apparent that closely homologous proteins have similar structures (Chothia and Lesk, 1986; Sander and Schneider, 1991). This rough generalizability of protein structures holds even for a protein as conformationally complex as Env—for although there are many examples of mutations that alter aspects of Env’s conformation and dynamics (Kwong et al., 2000; White et al., 2010; Almond et al., 2010; Davenport et al., 2013), SOSIP trimer structures from diverse Env strains remain highly similar in most respects (Julien et al., 2015; Pugach et al., 2015; Stewart-Jones et al., 2016; Verkerke et al., 2016; Gristick et al., 2016). Our results show that amino-acid preference maps of Env also have a useful level of conservation for many purposes. From a qualitative perspective, the amino-acid preferences look mostly similar between BG505 and BF520, and so provide a valuable reference for estimating which mutations are likely to be tolerated at each site in diverse HIV strains. Indeed, we anticipate that the complete maps of mutational effects in Figure 4 and Figure 5 will be useful for future sequence-structure- function studies. From an analytical perspective, a powerful use of our maps is to identify sites that evolve differently in nature than is required by the simple selection for viral growth imposed in our experiments—and the identified sites are largely the same regardless of whether the analysis uses an amino-acid preference map from BG505 or BF520. Of course, from the perspective of protein evolution, the most interesting sites are the excep- tions to the general conservation of amino-acid preferences. Consistent with studies of other proteins (Natarajan et al., 2013; Harms and Thornton, 2014; Doud et al., 2015; Starr et al., 2017), we find a subset of sites that change markedly in which mutations they tolerate. Some shifted sites simply accommodate more amino acids in the more stable BG505 Env—a type of shift that has been well-documented for other proteins (Wang et al., 2002; Bloom et al., 2006; Gong et al., 2013; Kumar et al., 2017). But interestingly, there is no strong trend for shifts to be enhanced at sites that differ between BG505 and BF520. Recent studies of protein evolution have focused on the idea that substitutions become “entrenched” as sites shift to accommodate new amino acids (Pollock et al., 2012; Shah et al., 2015; Bazykin, 2015; Starr et al., 2017). Indeed, a prior protein-wide comparison of amino-acid preferences across homologs of influenza nucleoprotein found a significant enrich- ment of shifts at sites of substitutions (Doud et al., 2015). But although there is some entrenchment of differences between BG505 and BF520, this is not the major factor behind the shifts in amino-acid preferences: the majority of sites that have shifted between BG505 and BF520 actually have the same wildtype amino acid in both Envs even though the preferences have shifted. This rather surprising result might be due to Env’s exceptional conformational complexity—mutations can

17 of 30 Manuscript submitted to eLife

cause long-range alterations in Env’s conformation (Kwong et al., 2000; White et al., 2010; Almond et al., 2010; Davenport et al., 2013), so it seems plausible that they might also shift mutational tolerance at distant sites. Regardless of the exact mechanism, our large-scale datasets of mutational effects in multiple viral strains should be useful for efforts to computationally parameterize “fitness landscapes” of Env (Kouyos et al., 2012; Ferguson et al., 2013; Mann et al., 2014; Barton et al., 2015; Louie et al., 2018). Our experiments provide highly quantitative data on the mutational tolerance of Env under selection for viral growth in cell culture. These data are amenable to rigorous functional and evolutionary analyses. Here we have shown how these data can be compared between Envs to identify sites where mutational tolerance shifts with viral genotype, or between experiments and nature to identify sites under different pressure in the lab and in humans. Future experiments that modulate selection pressures in other relevant ways should provide further insight into the forces that drive and constrain HIV’s evolution.

18 of 30 Manuscript submitted to eLife

Methods and Materials Creation of codon-mutant libraries Our codon mutant libraries mutagenized all sites in env to all 64 codons, except that the signal peptide and cytoplasmic tail were not mutagenized. The rationale for excluding these regions is that they are not part of Env’s ectodomain, and are prone to mutations that strongly modulate Env’s expression level (Chakrabarti et al., 1989; Yuste et al., 2004; Li et al., 1994). The codon-mutant libraries were generated using the approach originally described in Bloom (2014a), with the modification of Dingens et al. (2017) to ensure more uniform primer melting tem- peratures. The computer script used to design the mutagenesis primers (along with some detailed implementation notes) is at https://github.com/jbloomlab/CodonTilingPrimers. For BF520, the three libraries are the same ones described by Dingens et al. (2017). For BG505, we created three libraries for this study. The wildtype BG505 sequence used for these libraries is in Supplemental file 3.The BG505 mutagenesis primers are in Supplemental file 4. The end primers for the BG505 mutagenesis were: 5’-tgaaggcaaaactactggtccgtctcgagcagaagac agtggcaatgaga-3’ and 5’-gctacaaatgcatataacagcgtctcattctttccctaacctcaggcca-3’. As with BF520, we cloned the BG505 env libraries into the env locus of the full-length proviral genome of HIV strain Q23 (another subtype-A transmitted/founder virus; Poss and Overbaugh, 1999) using the high-efficiency cloning vector described in Dingens et al. (2017). For this cloning, we digested the cloning vector with BsmBI, and then used PCR to elongate the amplicons to include 30 nucleotides at each end that were identical in sequence to the ends of the BsmBI-digested vector. The primers for this PCR were: 5’-agataggttaattgagagaataagagaaagagcagaagacagtggcaatgagagtgatgg-3’ and 5’-ctcctggtgctgct ggaggggcacgtctcattctttccctaacctcaggccatcc-3’. Next, we used NEBuilder HiFi DNA Assembly (NEB, E2621S) to clone the env amplicons into the BsmBI-digested plasmids. We purified the assembled products using Agencourt AMPure XP beads (Beckman Coulter, A63880) using a bead-to-sample ratio of 1.5, and then transformed the purified products into Stellar electrocompetent cells (Takara, 636765). The transformations yielded between 1.5-3.6 million unique clones for each of the three replicate libraries, as estimated by plating 1:2,000 dilutions of the transformations. We scraped the plated colonies and maxiprepped the plasmid DNA; unlike in Dingens et al. (2017), we did not include a 4-hour outgrowth step after the scraping step. For the wildtype controls, we maxiprepped three independent cultures of wildtype BG505 env cloned into the same Q23 proviral plasmid. See Figure 2–Figure Supplement 1 and Figure 3A for information on the average mutation rate in these libraries as estimated by Sanger sequencing and deep sequencing, respectively.

Generation and passaging of viruses For BG505, we generated mutant virus libraries from the proviral plasmid libraries by transfecting 293T cells in three 6-well plates (so 18 wells total per library) with a per-well mixture of 2 )g plasmid DNA, 6 )l FuGENE 6 Transfection Reagent (Promega, E269A), and 100 )l DMEM. The 293T cells were seeded at 5×105 cells/well in D10 media (DMEM supplemented with 10% FBS, 1% 200 mM L-glutamine, and 1% of a solution of 10,000 units/mL penicillin and 10,000 )g/mL streptomycin) the day before transfection, such that they were approximately 50% confluent at the time of transfection. In parallel, we generated wildtype viruses by transfecting one 6-well plate of 293T cells with each wildtype plasmid replicate. At 2 days post-transfection, we harvested the transfection supernatant, passed it through a 0.2 )m filter to remove cells, treated the supernatant with DNAse to digest residual plasmid DNA as in Haddox et al. (2016), and froze aliquots at -80◦C. We thawed and titered aliquots using the TZM-bl assay in the presence of 10 )g/mL DEAE-dextran as described in Dingens et al. (2017). We conducted the low MOI viral passage illustrated in Figure 2A in SupT1.CCR5 cells (obtained from Dr. James Hoxie; Boyd et al., 2015). The SupT1.CCR5 cells tested negative for mycoplasma. The SupT1.CCR5 cell line was previously created by engineering the parental SupT1 cell line to express CCR5 (Boyd et al., 2015). We used antibody staining followed by flow cytometry to validate that

19 of 30 Manuscript submitted to eLife

our stock of SupT1.CCR5 cells expressed CCR5, CXCR4, and CD4. There is no validated STR profile for SupT1.CCR5 cells. However, we performed STR profiling on our stock of cells and compared the results to the ATCC SupT1 (ATCC #CRL-1942) reference profile. We found that 11 of 14 plus both amelogenin alleles matched the reference, with no additional mismatched alleles ins the SupT1.CCR5 profile. Given the known instability of lymphoma cell lines (Inoue et al., 2000), this level of identity suggests that the SupT1.CCR5 cells are indeed related to the parental SupT1 cells (Capes-Davis et al., 2013). During this passage, cells were maintained in R10 media, which has the same composition as the D10 described above, except RPMI-1640 (GE Healthcare Life Sciences, SH30255.01) is used in the place of DMEM. In addition, the media contained 10 )g/mL DEAE-dextran to enhance viral infection. We infected cells with 4 million (for replicate 1) or 5 million (for replicates 2 and 3) TZM-bl infectious units of mutant virus at an MOI of 0.01, with cells at a starting concentration of 1 million cells/mL in vented tissue-culture flasks (Fisher Scientific, 14-826-80). At day 1 post-infection, we pelleted cells, aspirated the supernatant, and resuspended cell pellets in the same volume of fresh media still including the DEAE-dextran. At 2 days post-infection, we doubled the volume of each culture with fresh media still including DEAE-dextran. At 4 days post-infection, we pelleted cells, passed the viral supernatant through a 0.2 )m filter, concentrated the virus ∼30 fold using ultracentrifugation as described in Dingens et al. (2017), and froze aliquots at -80◦C. In parallel, for each replicate, we also passaged 2×105 (for replicate 1) or 5×105 (for replicates 2 and 3) TZM-bl infectious units of wildtype virus using the same procedure. To obtain final titers for our concentrated virus, we thawed one of the aliquots stored at -80◦C and titered using the TZM-bl assay in the presence of 10 )g/mL DEAE-dextran. For the final short-duration infection illustrated in Figure 2A, for each replicate we infected 106 TZM-bl infectious units into 106 SupT1.CCR5 cells in the presence of 100 )g/mL DEAE-dextran (note that this is a 10-fold higher concentration of DEAE-dextran than for the other steps, meaning that the effective MOI of infection is higher if DEAE-dextran has the expected effect of enhancing viral infection). Three hours post-infection, we pelleted the cells and resuspended them in fresh media without any DEAE-dextran. At 12 hours post-infection, we pelleted cells, washed them once with PBS, and then used a miniprep kit to harvest reverse-transcribed unintegrated viral DNA (Haddox et al., 2016). The generation, passaging and deep sequencing of BF520 was done in a highly similar fashion, except that we only had a single replicate of the wildtype control. Note that the final passaged BF520 mutant libraries analyzed here actually correspond to the “no-antibody” controls described in Dingens et al. (2017), but that study did not analyze the initial plasmid mutant libraries relative to these passaged viruses, and so was not able to provide measurements of the amino-acid preferences.

Illumina deep sequencing We deep sequenced all of the samples shown in Figure 3A: the plasmid mutant libraries and wildtype plasmid controls, and the cDNA from the final mutant viruses and wildtype virus controls. In order to increase the sequence accuracy, we used a barcoded-subamplicon sequencing strategy. This general strategy was originally applied in the context of deep mutational scanning by Wu et al. (2014), and the specific protocol used in our work is described in Doud and Bloom (2016) (see also https://jbloomlab.github.io/dms_tools2/bcsubamp.html). The primers used for BG505 are in Supplementary file 5. The primers used for BF520 are in Dingens et al. (2017). The data generated by the Illumina deep sequencing are on the Sequence Read Archive under the accession numbers provided at the beginning of the Jupyter notebook in Supplementary files 1 and 2.

20 of 30 Manuscript submitted to eLife

Analysis of deep-sequencing data We analyzed the deep-sequencing data using the dms_tools2 software package (Bloom, 2015, https://jbloomlab.github.io/dms_tools2/, version 2.2.4). The algorithm that goes from the deep- sequencing counts to the amino-acid preferences is that described in Bloom (2015) (see also https://jbloomlab.github.io/dms_tools2/prefs.html). A Jupyter notebook that performs the entire analysis including generation of most of the figures in this paper is in Supplementary file 1.An HTML rendering of this notebook is in Supplementary file 2. A repository containing all of this code is also available at https://github.com/jbloomlab/EnvMutationalShiftsPaper (Haddox et al., 2018, copy archived at https://github.com/elifesciences-publications/EnvMutationalShiftsPaper). The Jupyter notebooks in Supplementary files 1 and 2 also contain numerous plots that summa- rize relevant aspects of the deep sequencing such as read depth, per-codon mutation frequency, mutation types, etc. Supplementary file 1 also contains text files and CSV files with the numerical values shown in these plots. Citations are also owed to weblogo (Crooks et al., 2004, http://weblogo.threeplusone.com/) and ggseqlogo (Wagih, 2017, https://omarwagih.github.io/ggseqlogo/), which were used in the generation of the logoplots.

Alignments and phylogenetic analyses of Env sequences A basic description of the process used to generate the clade A sequence alignment in Figure 1- source data 1, the alignment mask in Figure 1-source data 2, and the phylogenetic tree in Figure 1 are provided in the legend to that figure. An algorithmic description of how the alignment and tree were generated are in Supplementary files 1 and 2. For fitting of the phylogenetic substitution models, we used phydms (Hilton et al., 2017, http: //jbloomlab.github.io/phydms/, version 2.2.1) to optimize the substitution model parameters and branch lengths on the fixed tree topology in Figure 1. The Goldman-Yang (or YNGKP) model used in Table 1 is the M5 variant described by Yang et al. (2000), with the equilibrium codon frequencies determined empirically using the CF3x4 method (Pond et al., 2010). For the ExpCM shown in Table 1, we extended the models with empirical nucleotide frequencies described in Hilton et al. (2017) to also allow " to be drawn from discrete gamma-distributed categories exactly as for the M5 model. These ExpCM with gamma-distributed " were implemented in phydms using the equations provided by Yang (1994) (see also http://jbloomlab.github.io/phydms/implementation. html#models-with-a-gamma-distributed-model-parameter). The preferences were re-scaled by the stringency parameters in Table 1 as described in Hilton et al. (2017). For both the M5 model and the ExpCM with a gamma-distributed ", we used four categories for the discretized gamma distribution. Table 1-source-data 1 shows the results for a wider set of models than those used in Table 1. These include the M0 model of Yang et al. (2000), ExpCM without a gamma-distributed ",and ExpCM in which the amino-acid preferences are averaged across sites as a control to ensure that the improved performance of these models is due to their site-specificity. Note how for these Env alignments, using a gamma-distributed " is very important in order for the ExpCMs to outperform the M5 model—we suspect this is because there are many sites of strong diversifying selection. For detection of sites with faster or slower than expected evolution, we used the approach in Bloom (2017), which is exactly modeled on the FEL approach of Kosakovsky Pond and Frost (2005) but extended to ExpCM. This approach estimates a & -value that "% is not equal to one for each site % using a likelihood-ratio test. The actual point estimates of "% are unreliable for individual sites due to the limited number of observations, so we report the & -value that "% is not equal to one, which is a better indication of the strength of the statistical evidence for faster or slower than expected evolution (Kosakovsky Pond and Frost, 2005; Murrell et al., 2012). For the Q-values and false discovery rate testing, we considered the tests for "% > 1 and "% < 1 separately. Supplementary files 1 and 2 contains the code that runs phydms to reproduce all of these analyses.

21 of 30 Manuscript submitted to eLife

Re-scaling the preferences The amino-acid preferences that are directly extracted from the deep sequencing data essentially give the enrichment / depletion of each mutation, normalized to sum to one at each site (Bloom, 2015, https://jbloomlab.github.io/dms_tools2/prefs.html). However, the extent that any mutation is enriched or depleted is a combination of two factors: the inherent effect of that mutation, and the “stringency” of the experimental selection. For instance, if the selection is weak, then deleterious mutations will only be slightly depleted; conversely, if selection is strong, then deleterious mutations will be greatly depleted. The fact that the preferences depend on the stringency of the experimental selection is important if we want to compare results between Envs. The reason is that our goal is to identify differences in the inherent effects of mutations between Envs, not simply find differences due to variation in experimental stringency. Of course, we have done our best to perform the experiments for BG505 and BF520 equivalently, but because these are different viruses with different growth rates, it is impossible to exactly match the experimental stringencies. This can be seen in Figure 3-source data 2, which shows that stop codons were more depleted for BG505 than BF520, indicating that selection in our experiments was more stringent for BG505. How should we best re-scale the preferences? Raising them to a power is a sensible approach. To see why, imagine a mutation that is depleted 3-fold after 2 rounds of viral growth. If our experiment instead allowed 22 =4rounds of viral growth, then the mutation would be depleted 32 =9-fold. More generally, if a mutation is enriched in frequency by *-fold after + rounds of viral growth, then it will be enriched in frequency by *$ -fold after $ × + rounds of viral growth. Since the amino-acid preferences are conceptually equivalent to the re-normalized enrichments of mutations (Bloom, 2015, https://jbloomlab.github.io/dms_tools2/prefs.html), it therefore makes sense that the re-scaled preference ,%,- for amino-acid - at site % should be related to the directly $ measured preference ̂, by , ∝ ̂, . And indeed, this is exactly the re-scaling scheme described %,- %,- ( %,-) in Hilton et al. (2017) that we use to re-scale our preferences for BG505 and BF520. The last point is how to choose the re-scaling parameter $ for each Env. It turns out that the features that we have described above for our experiments are also a feature of natural evolution: the expected frequency of a substitution during evolution depends not only on the inherent fitness effect of that mutation, but also on the effective population size, which is conceptually somewhat similar to the stringency of selection. It turns out that in a mutation-selection phylogenetic model of evolution, if the amino-acid preferences are taken to represent the “fitness effects” of mutations, then the exponential scaling parameter $ is proportional to the effective population size (Halpern and Bruno, 1998; McCandlish and Stoltzfus, 2014; Bloom, 2014b). Therefore, fitting the $ parameter using a phylogenetic approach enables standardization of the preferences for the two Envs, and re-scales the preferences so that they best match with the actual stringency of selection observed in nature (Hilton et al., 2017). Note that in practice this re-scaling scheme is roughly equivalent to a more heuristic approach that has been used by Gray et al. (2017) and others. In this heuristic approach, the log-transformed enrichment ratios from different experiments are adjusted so that the distributions have equal spreads. Since multiplying log-transformed enrichment ratios is equivalent to exponentiating amino- acid preferences, these two re-scaling procedures apply the same mathematical transformation.

Identifying sites of shifted amino-acid preference When identifying shifts in amino-acid preferences between the two Envs, we needed a way to quan- tify differences between the Envs while accounting for the fact that our measurements are noisy. The approach we use is based closely on that of Doud et al. (2015), and is illustrated graphically in

Figure 6A. The RMSDcorrected value is our measure of the magnitude of the shift. Figure 6A, its legend, and the associated text completely explains these calculations with the following exception: they do not detail how the “distance” between any two preference measurements was calculated. The distance between preferences at each site was simply defined as half of the sum of absolute value of the difference between preferences for each amino acid. Specifically, for a given site , let / be % ,%,-

22 of 30 Manuscript submitted to eLife

the re-scaled preference for amino-acid in homolog (e.g., BG505) and let 0 be the re-scaled - / ,%,- preference for that same amino acid in homolog 0 (e.g., BF520). Then the distance between the homologs at this site is simply 1/,0 = 1 ,/ − ,0 . The factor of 1 is used so that the maximum % 2 ∑- %,- %,- 2 distance will always fall between zero and| one. |

Analysis of entrenchment For the analysis in Figure 8, the results are presented in terms of the mutational effects rather than the amino-acid preferences. If ,%,- is the preference of site % for amino-acid - and ,%,-′ is the preference for amino-acid -′ (both re-scaled by the stringency parameters in Table 1), then the , estimated effect of the mutation from - to -′ is simply log %,-′ . ( ,%,- )

Data and code availability All code and input data required to reproduce all analyses in this paper are in Supplementary file 1 (see also Supplementary file 2). A repository containing all of this code is also available at https: //github.com/jbloomlab/EnvMutationalShiftsPaper (Haddox et al., 2018, copy archived at https: //github.com/elifesciences-publications/EnvMutationalShiftsPaper). The deep sequencing data are on the Sequence Read Archive with the accession numbers listed in Supplementary files 1 and 2.

Acknowledgments We thank Michael Doud and Orr Ashenberg for computer code that formed the basis for some of the analyses. We thank Andrew Ward for pointing out to us that some of the sites with slower-than- expected rates of evolution are involved in Env’s conformational changes upon receptor binding. We thank Kelly Lee for helpful input about the relative stabilities of the BG505 and BF520 Envs. We thank the Fred Hutch Core for performing the Illumina deep sequencing.

References Al-Mawsawi LQ, Wu NC, Olson CA, Shi VC, Qi H, Zheng X, Wu TT, Sun R. High-throughput profiling of point mutations across the HIV-1 genome. Retrovirology. 2014; 11(1):124.

Albert J, Abrahamsson B, Nagy K, Aurelius E, Gaines H, Nyström G, Fenyö EM. Rapid development of isolate- specific neutralizing antibodies after primary HIV-1 infection and consequent emergence of virus variants which resist neutralization by autologous sera. AIDS. 1990; 4(2):107–112.

Almond D, Kimura T, Kong X, Swetnam J, Zolla-Pazner S, Cardozo T. Structural conservation predominates over sequence variability in the crown of HIV type 1’s V3 loop. AIDS research and human retroviruses. 2010; 26(6):717–723. van Anken E, Sanders RW, Liscaljet IM, Land A, Bontjer I, Tillemans S, Nabatov AA, Paxton WA, Berkhout B, Braakman I.Onlyfive of 10 strictly conserved disulfide bonds are essential for folding and eight for function of the HIV-1 envelope glycoprotein. Molecular Biology of the Cell. 2008; 19(10):4298–4309.

Ashenberg O, Gong LI, Bloom JD. Mutational effects on stability are largely conserved during protein evolution. Proceedings of the National Academy of Sciences. 2013; 110(52):21071–21076.

Barton JP, Kardar M, Chakraborty AK. Scaling laws describe memories of host–pathogen riposte in the HIV population. Proceedings of the National Academy of Sciences. 2015; 112(7):1965–1970.

Bazykin GA. Changing preferences: deformation of single position amino acid fitness landscapes and evolution of proteins. Biology Letters. 2015; 11(10):20150315.

Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B. 1995; p. 289–300.

Binley JM, Sanders RW, Clas B, Schuelke N, Master A, Guo Y, Kajumo F, Anselma DJ, Maddon PJ, Olson WC, et al. A recombinant human immunodeficiency virus type 1 envelope glycoprotein complex stabilized by an intermolecular disulfide bond between the gp120 and gp41 subunits is an antigenic mimic of the trimeric virion-associated structure. Journal of Virology. 2000; 74(2):627–643.

23 of 30 Manuscript submitted to eLife

Bloom JD. An experimentally determined evolutionary model dramatically improves phylogenetic fit. Mol Biol Evol. 2014; 31(8):1956–1978.

Bloom JD. An experimentally informed evolutionary model improves phylogenetic fit to divergent lactamase homologs. Molecular Biology and Evolution. 2014; 31(10):2753–2769.

Bloom JD. Software for the analysis and visualization of deep mutational scanning data. BMC Bioinformatics. 2015; 16:168.

Bloom JD. Identification of positive selection in genes is greatly improved by using experimentally informed site-specific models. Biology Direct. 2017; 12(1):1.

Bloom JD, Labthavikul ST, Otey CR, Arnold FH. Protein stability promotes . Proceedings of the National Academy of Sciences. 2006; 103(15):5869–5874.

Boyd DF, Peterson D, Haggarty BS, Jordan AP, Hogan MJ, Goo L, Hoxie JA, Overbaugh J. Mutations in HIV- 1 envelope that enhance entry with the macaque CD4 receptor alter antibody recognition by disrupting quaternary interactions within the trimer. Journal of Virology. 2015; 89(2):894–907.

Burton DR, Stanfield RL, Wilson IA. Antibody vs. HIV in a clash of evolutionary titans. Proceedings of the National Academy of Sciences of the United States of America. 2005; 102(42):14943–14948.

Capes-Davis A, Reid YA, Kline MC, Storts DR, Strauss E, Dirks WG, Drexler HG, MacLeod RA, Sykes G, Kohara A, et al. Match criteria for human cell line authentication: where do we draw the line? International Journal of Cancer. 2013; 132(11):2510–2519.

Chakrabarti L, Emerman M, Tiollais P, Sonigo P. The cytoplasmic domain of simian immunodeficiency virus transmembrane protein modulates infectivity. Journal of Virology. 1989; 63(10):4395–4403.

Chan DC, Fass D, Berger JM, Kim PS. Core structure of gp41 from the HIV envelope glycoprotein. Cell. 1997; 89(2):263–273.

Chan YH, Venev SV, Zeldovich KB, Matthews CR. Correlation of fitness landscapes from three orthologous TIM barrels originates from sequence and structure constraints. Nature Communications. 2017; 8:14614.

Chothia C, Lesk AM. The relation between the divergence of sequence and structure in proteins. The EMBO Journal. 1986; 5(4):823.

Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: a sequence logo generator. Genome Research. 2004; 14(6):1188–1190.

Davenport TM, Guttman M, Guo W, Cleveland B, Kahn M, Hu SL, Lee KK. Isolate-specificdifferences in the conformational dynamics and antigenicity of HIV-1 gp120. Journal of Virology. 2013; 87(19):10855–10873.

Dingens AS, Haddox HK, Overbaugh J, Bloom JD. Comprehensive mapping of HIV-1 escape from a broadly neutralizing antibody. Cell Host & Microbe. 2017; 21:777–787.

Doud MB, Ashenberg O, Bloom JD. Site-specific amino acid preferences are mostly conserved in two closely related protein homologs. Molecular Biology and Evolution. 2015; 32(11):2944–2960.

Doud MB, Bloom JD. Accurate measurement of the effects of all amino-acid mutations to influenza hemagglu- tinin. Viruses. 2016; 8:155.

Duenas-Decamp M, Jiang L, Bolon D, Clapham PR. Saturation Mutagenesis of the HIV-1 Envelope CD4 Binding Loop Reveals Residues Controlling Distinct Trimer Conformations. PLoS Pathogens. 2016; 12(11):e1005988.

Faria NR, Rambaut A, Suchard MA, Baele G, Bedford T, Ward MJ, Tatem AJ, Sousa JD, Arinaminpathy N, Pépin J, et al. The early spread and epidemic ignition of HIV-1 in human populations. Science. 2014; 346(6205):56–61.

Ferguson AL, Mann JK, Omarjee S, Ndung’u T, Walker BD, Chakraborty AK. Translating HIV sequences into quantitative fitness landscapes predicts viral vulnerabilities for rational immunogen design. Immunity. 2013; 38(3):606–617.

Gong LI, Suchard MA, Bloom JD. Stability-mediated epistasis constrains the evolution of an influenza protein. eLife. 2013; 2:e00631.

Goo L, Chohan V, Nduati R, Overbaugh J. Early development of broadly neutralizing antibodies in HIV-1-infected infants. Nature Medicine. 2014; 20(6):655–658.

24 of 30 Manuscript submitted to eLife

Gray VE, Hause RJ, Fowler DM. Analysis of Large-Scale Mutagenesis Data To Assess the Impact of Single Amino Acid Substitutions. Genetics. 2017; 207(1):53.

Gristick HB, von Boehmer L, West Jr AP, Schamber M, Gazumyan A, Golijanin J, Seaman MS, Fätkenheuer G, Klein F, Nussenzweig MC, et al. Natively glycosylated HIV-1 Env structure reveals new mode for antibody recognition of the CD4-binding site. Nature Structural & Molecular Biology. 2016; 23(10):906–915.

Guttman M, Cupo A, Julien J, Sanders R, Wilson I, Moore J, Lee K. Antibody potency relates to the ability to recognize the closed, pre-fusion form of HIV Env. Nature Communications. 2015; 6:6144–6144.

Guttman M, Garcia NK, Cupo A, Matsui T, Julien JP, Sanders RW, Wilson IA, Moore JP, Lee KK. CD4-induced activation in a soluble HIV-1 Env trimer. Structure. 2014; 22(7):974–984.

Haddox HK, Dingens AS, Bloom JD. Experimental estimation of the effects of all amino-acid mutations to HIV’s envelope protein on viral replication in cell culture. PLoS Pathogens. 2016; 12(12):e1006114.

Haddox HK, Dingens AS, Hilton SK, Overbaugh J, Bloom JD. Computer code for “Mapping mutational effects along the evolutionary landscape of HIV envelope”. Github. 2018; commit 2355e76. https://github.com/ jbloomlab/EnvMutationalShiftsPaper.

Halpern AL, Bruno WJ. Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Molecular Biology and Evolution. 1998; 15(7):910–917.

Harms MJ, Thornton JW. Historical contingency and its biophysical basis in glucocorticoid receptor evolution. Nature. 2014; 512(7513):203.

Hedges SB, Dudley J, Kumar S. TimeTree: a public knowledge-base of divergence times among organisms. Bioinformatics. 2006; 22(23):2971–2972.

Hilton SK, Doud MB, Bloom JD. phydms: Software for phylogenetic analyses informed by deep mutational scanning. PeerJ. 2017; 5:e3657.

Huang J, Kang BH, Pancera M, Lee JH, Tong T, Feng Y, Imamichi H, Georgiev IS, Chuang GY, Druz A, et al. Broad and potent HIV-1 neutralization by a human antibody that binds the gp41-gp120 interface. Nature. 2014; 515(7525):138–142.

Inoue K, Kohno T, Takakura S, Hayashi Y, Mizoguchi H, Yokota J. Frequent microsatellite instability and BAX mutations in T cell acute lymphoblastic leukemia cell lines. Leukemia Research. 2000; 24(3):255–262.

Julien JP, Cupo A, Sok D, Stanfield RL, Lyumkis D, Deller MC, Klasse PJ, Burton DR, Sanders RW, Moore JP, et al. Crystal structure of a soluble cleaved HIV-1 envelope trimer. Science. 2013; 342(6165):1477–1483.

Julien JP, Lee JH, Ozorowski G, Hua Y, de la Peña AT, de Taeye SW, Nieusma T, Cupo A, Yasmeen A, Golabek M, et al. Design and structure of two HIV-1 clade C SOSIP. 664 trimers that increase the arsenal of native-like Env immunogens. Proceedings of the National Academy of Sciences. 2015; 112(38):11947–11952.

Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983; 22(12):2577–2637.

Klink G, Bazykin G. Parallel evolution of metazoan mitochondrial proteins. Genome Biology and Evolution. 2017; 9:1341–1350.

Korber B, Foley BT, Kuiken C, Pillai SK, Sodroski JG, et al. Numbering positions in HIV relative to HXB2CG. Human Retroviruses and AIDS. 1998; 3:102–111.

Kosakovsky Pond SL, Frost SD. Not so different after all: a comparison of methods for detecting amino acid sites under selection. Molecular Biology and Evolution. 2005; 22(5):1208–1222.

Kouyos RD, Leventhal GE, Hinkley T, Haddad M, Whitcomb JM, Petropoulos CJ, Bonhoeffer S. Exploring the complexity of the HIV-1 fitness landscape. PLoS Genetics. 2012; 8(3):e1002551.

Kumar A, Natarajan C, Moriyama H, Witt CC, Weber RE, Fago A, Storz JF. Stability-mediated epistasis restricts accessible mutational pathways in the functional evolution of avian hemoglobin. Molecular Biology and Evolution. 2017; 34(5):1240–1251.

Kwong PD, Doyle ML, Casper DJ, Cicala C, Leavitt SA, Majeed S, Steenbeke TD, Venturi M, Chaiken I, Fung M, et al. HIV-1 evades antibody-mediated neutralization through conformational masking of receptor-binding sites. Nature. 2002; 420(6916):678–682.

25 of 30 Manuscript submitted to eLife

Kwong PD, Wyatt R, Majeed S, Robinson J, Sweet RW, Sodroski J, Hendrickson WA. Structures of HIV-1 gp120 envelope glycoproteins from laboratory-adapted and primary isolates. Structure. 2000; 8(12):1329–1339.

Lee JH, Andrabi R, Su CY, Yasmeen A, Julien JP, Kong L, Wu NC, McBride R, Sok D, Pauthner M, et al. A broadly neutralizing antibody targets the dynamic HIV envelope trimer apex via a long, rigidified, and anionic $-hairpin structure. Immunity. 2017; 46(4):690–702.

Li Y, Luo L, Thomas DY, Kang OY. Control of expression, glycosylation, and secretion of HIV-1 gp120 by homologous and heterologous signal sequences. Virology. 1994; 204(1):266–278.

Louie RH, Kaczorowski KJ, Barton JP, Chakraborty AK, McKay MR. landscape of the human immunodefi- ciency virus envelope protein that is targeted by antibodies. Proceedings of the National Academy of Sciences. 2018; p. 201717765.

Lynch RM, Shen T, Gnanakaran S, Derdeyn CA. Appreciating HIV type 1 diversity: subtype differences in Env. AIDS Research and Human Retroviruses. 2009; 25(3):237–248.

Lyumkis D, Julien JP, de Val N, Cupo A, Potter CS, Klasse PJ, Burton DR, Sanders RW, Moore JP, Carragher B, et al. Cryo-EM structure of a fully glycosylated soluble cleaved HIV-1 envelope trimer. Science. 2013; 342(6165):1484–1490.

Mann JK, Barton JP, Ferguson AL, Omarjee S, Walker BD, Chakraborty A, Ndung’uT.Thefitness landscape of HIV-1 gag: advanced modeling approaches and validation of model predictions by in vitro testing. PLoS Computational Biology. 2014; 10(8):e1003776.

McCandlish DM, Stoltzfus A. Modeling evolution using the probability of fixation: history and implications. The Quarterly Review of Biology. 2014; 89(3):225–252.

Moore PL, Gray ES, Morris L. Specificity of the autologous neutralizing antibody response. Current Opinion in HIVandAIDS. 2009; 4(5):358.

Munro JB, Gorman J, Ma X, Zhou Z, Arthos J, Burton DR, Koff WC, Courter JR, Smith AB, Kwong PD, et al. Conformational dynamics of single HIV-1 envelope trimers on the surface of native virions. Science. 2014; 346(6210):759–763.

Murrell B, Weaver S, Smith MD, Wertheim JO, Murrell S, Aylward A, Eren K, Pollner T, Martin DP, Smith DM, et al. Gene-wide identification of episodic selection. Molecular Biology and Evolution. 2015; 32(5):1365–1371.

Murrell B, Wertheim JO, Moola S, Weighill T, Scheffler K, Pond SLK. Detecting individual sites subject to episodic diversifying selection. PLoS Genetics. 2012; 8(7):e1002764.

Natarajan C, Inoguchi N, Weber RE, Fago A, Moriyama H, Storz JF. Epistasis among adaptive mutations in deer mouse hemoglobin. Science. 2013; 340(6138):1324–1327.

Nduati R, John G, Mbori-Ngacha D, Richardson B, Overbaugh J, Mwatha A, Ndinya-Achola J, Bwayo J, Onyango FE, Hughes J, et al. Effect of breastfeeding and formula feeding on transmission of HIV-1: a randomized clinical trial. JAMA. 2000; 283(9):1167–1174.

Olshevsky U, Helseth E, Furman C, Li J, Haseltine W, Sodroski J. Identification of individual human immun- odeficiency virus type 1 gp120 amino acids important for CD4 receptor binding. Journal of Virology. 1990; 64(12):5701–5707.

Ozorowski G, Pallesen J, de Val N, Lyumkis D, Cottrell CA, Torres JL, Copps J, Stanfield RL, Cupo A, Pugach P, et al. Open and closed structures reveal allostery and pliability in the HIV-1 envelope spike. Nature. 2017; 547(7663):360–363.

Pancera M, Zhou T, Druz A, Georgiev IS, Soto C, Gorman J, Huang J, Acharya P, Chuang GY, Ofek G, et al. Structure and immune recognition of trimeric pre-fusion HIV-1 Env. Nature. 2014; 514(7523):455–461.

Parrish NF, Gao F, Li H, Giorgi EE, Barbian HJ, Parrish EH, Zajic L, Iyer SS, Decker JM, Kumar A, et al. Pheno- typic properties of transmitted founder HIV-1. Proceedings of the National Academy of Sciences. 2013; 110(17):6626–6633.

Peden K, Emerman M, Montagnier L. Changes in growth properties on passage in tissue culture of viruses derived from infectious molecular clones of HIV-1 LAI,HIV-1 MAL, and HIV-1 ELI. Virology. 1991; 185(2):661– 672.

26 of 30 Manuscript submitted to eLife

Podgornaia AI, Laub MT. Pervasive degeneracy and epistasis in a protein-protein interface. Science. 2015; 347(6222):673–677.

Pollock DD, Thiltgen G, Goldstein RA. Amino acid coevolution induces an evolutionary Stokes shift. Proceedings of the National Academy of Sciences. 2012; 109(21):E1352–E1359.

Pond SK, Delport W, Muse SV, Scheffler K. Correcting the bias of empirical frequency parameter estimators in codon models. PLoS One. 2010; 5(7):e11230.

Posada D, Buckley TR. Model selection and model averaging in phylogenetics: advantages of Akaike information criterion and Bayesian approaches over likelihood ratio tests. Systematic Biology. 2004; 53(5):793–808.

Poss M, Overbaugh J. Variants from the diverse virus population identified at seroconversion of a clade A human immunodeficiency virus type 1-infected woman have distinct biological properties. Journal of Virology. 1999; 73(7):5255–5264.

Pugach P, Kuhmann SE, Taylor J, Marozsan AJ, Snyder A, Ketas T, Wolinsky SM, Korber BT, Moore JP. The prolonged culture of human immunodeficiency virus type 1 in primary lymphocytes increases its sensitivity to neutralization by soluble CD4. Virology. 2004; 321(1):8–22.

Pugach P, Ozorowski G, Cupo A, Ringe R, Yasmeen A, de Val N, Derking R, Kim HJ, Korzun J, Golabek M, et al. A native-like SOSIP. 664 trimer based on an HIV-1 subtype B env gene. Journal of Virology. 2015; 89(6):3380–3395.

Rathore U, Saha P, Kesavardhana S, Kumar AA, Datta R, Devanarayanan S, Das R, Mascola JR, Varadarajan R. Glycosylation of the core of the HIV-1 envelope subunit protein gp120 is not required for native trimer formation or viral infectivity. Journal of Biological Chemistry. 2017; 292(24):10197–10219.

Richman DD, Wrin T, Little SJ, Petropoulos CJ. Rapid evolution of the neutralizing antibody response to HIV type 1 infection. Proceedings of the National Academy of Sciences. 2003; 100(7):4144–4149.

Risso VA, Manssour-Triedo F, Delgado-Delgado A, Arco R, Barroso-delJesus A, Ingles-Prieto A, Godoy-Ruiz R, Gavira JA, Gaucher EA, Ibarra-Molero B, et al. Mutational studies on resurrected ancestral proteins reveal conservation of site-specific amino acid preferences throughout evolutionary history. Molecular Biology and Evolution. 2014; 32(2):440–455.

Ronen K, Sharma A, Overbaugh J. HIV transmission biology: translation for HIV prevention. AIDS. 2015; 29:2219–2227.

Sagar M, Wu X, Lee S, Overbaugh J. Human immunodeficiency virus type 1 V1-V2 envelope loop sequences expand and add glycosylation sites over the course of infection, and these modifications affect antibody neutralization sensitivity. Journal of Virology. 2006; 80(19):9586–9598.

Sander C, Schneider R. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins: Structure, Function, and Bioinformatics. 1991; 9(1):56–68.

Sanders RW, Derking R, Cupo A, Julien JP, Yasmeen A, de Val N, Kim HJ, Blattner C, de la Peña AT, Korzun J, et al. A next-generation cleaved, soluble HIV-1 Env trimer, BG505 SOSIP.664 gp140, expresses multiple epitopes for broadly neutralizing but not non-neutralizing antibodies. PLoS Pathogens. 2013; 9(9):e1003618.

Sanders RW, Van Gils MJ, Derking R, Sok D, Ketas TJ, Burger JA, Ozorowski G, Cupo A, Simonich C, Goo L, et al. HIV-1 neutralizing antibodies induced by native-like envelope trimers. Science. 2015; 349(6244):aac4223.

Sanders RW, Vesanen M, Schuelke N, Master A, Schiffner L, Kalyanaraman R, Paluch M, Berkhout B, Maddon PJ, Olson WC, et al. Stabilization of the soluble, cleaved, trimeric form of the envelope glycoprotein complex of human immunodeficiency virus type 1. Journal of Virology. 2002; 76(17):8875–8889.

Shah P, McCandlish DM, Plotkin JB. Contingency and entrenchment in protein evolution under purifying selection. Proceedings of the National Academy of Sciences. 2015; 112(25):E3226–E3235.

Sharp PM, Hahn BH. Origins of HIV and the AIDS pandemic. Cold Spring Harbor Perspectives in Medicine. 2011; 1(1):a006841. da Silva J, Coetzer M, Nedellec R, Pastore C, Mosier DE. Fitness epistasis and constraints on adaptation in a human immunodeficiency virus type 1 protein region. Genetics. 2010; 185(1):293–303.

Simonich CA, Williams KL, Verkerke HP, Williams JA, Nduati R, Lee KK, Overbaugh J.HIV-1 neutralizing antibodies with limited hypermutation from an infant. Cell. 2016; 166(1):77–87.

27 of 30 Manuscript submitted to eLife

Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014; 30:1312–1313.

Starcich BR, Hahn BH, Shaw GM, McNeely PD, Modrow S, Wolf H, Parks ES, Parks WP, Josephs SF, Gallo RC, et al. Identification and characterization of conserved and variable regions in the envelope gene of HTLV-III/LAV, the retrovirus of AIDS. Cell. 1986; 45(5):637–648.

Starr TN, Flynn JM, Mishra P, Bolon DN, Thornton JW. Pervasive contingency and entrenchment in a billion years of Hsp90 evolution. bioRxiv. 2017; p. 189803.

Starr TN, Thornton JW. Epistasis in protein evolution. Protein Science. 2016; 25(7):1204–1218.

Stewart-Jones GB, Soto C, Lemmin T, Chuang GY, Druz A, Kong R, Thomas PV, Wagh K, Zhou T, Behrens A J, et al. Trimeric HIV-1-Env Structures Define Glycan Shields from Clades A, B, and G. Cell. 2016; 165(4):813–826. de Taeye SW, Ozorowski G, de la Peña AT, Guttman M, Julien JP, van den Kerkhof TL, Burger JA, Pritchard LK, Pugach P, Yasmeen A, et al. Immunogenicity of Stabilized HIV-1 Envelope Trimers with Reduced Exposure of Non-neutralizing Epitopes. Cell. 2015; 163(7):1702–1715.

Tan K, Liu Jh, Wang Jh, Shen S, Lu M. Atomic structure of a thermostable subdomain of HIV-1 gp41. Proceedings of the National Academy of Sciences. 1997; 94(23):12303–12308.

Tien MZ, Meyer AG, Sydykova DK, Spielman SJ, Wilke CO. Maximum allowed solvent accessibilites of residues in proteins. PloS One. 2013; 8(11):e80635.

Verkerke HP, Williams JA, Guttman M, Simonich CA, Liang Y, Filipavicius M, Hu SL, Overbaugh J, Lee KK. Epitope- independent purification of native-like envelope trimers from diverse HIV-1 isolates. Journal of Virology. 2016; 90(20):9471–9482.

Wagih O. ggseqlogo: a versatile R package for drawing sequence logos. Bioinformatics. 2017; 33(22):3645–3647.

Wang W, Nie J, Prochnow C, Truong C, Jia Z, Wang S, Chen XS, Wang Y. A systematic study of the N-glycosylation sites of HIV-1 envelope protein on infectivity and antibody-mediated neutralization. Retrovirology. 2013; 10(1):1.

Wang X, Minasov G, Shoichet BK. Evolution of an antibiotic resistance enzyme constrained by stability and activity trade-offs. Journal of molecular biology. 2002; 320(1):85–95.

Waterston R, Lindblad-Toh K, Birney E, Rogers J, Abril J, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, et al. Initial sequencing and comparative analysis of the mouse genome. Nature. 2002; 420(6915):520–562.

Wei X, Decker JM, Wang S, Hui H, Kappes JC, Wu X, Salazar-Gonzalez JF, Salazar MG, Kilby JM, Saag MS, et al. Antibody neutralization and escape by HIV-1. Nature. 2003; 422(6929):307–312.

Weissenhorn W, Dessen A, Harrison S, Skehel J, Wiley D. Atomic structure of the ectodomain from HIV-1 gp41. Nature. 1997; 387(6631):426.

White TA, Bartesaghi A, Borgnia MJ, Meyerson JR, de la Cruz MJV, Bess JW, Nandwani R, Hoxie JA, Lifson JD, Milne JL, et al. Molecular architectures of trimeric SIVandHIV-1 envelope glycoproteins on intact viruses: strain-dependent variation in quaternary structure. PLoS Pathogens. 2010; 6(12):e1001249.

Wilen CB, Parrish NF, Pfaff JM, Decker JM, Henning EA, Haim H, Petersen JE, Wojcechowskyj JA, Sodroski J, Haynes BF, et al. Phenotypic and immunologic comparison of clade B transmitted/founder and chronic HIV-1 envelope glycoproteins. Journal of Virology. 2011; 85(17):8514–8527.

Worobey M, Gemmel M, Teuwen DE, Haselkorn T, Kunstman K, Bunce M, Muyembe JJ, Kabongo JMM, Kalengayi RM, Van Marck E, et al. Direct evidence of extensive diversity of HIV-1 in Kinshasa by 1960. Nature. 2008; 455(7213):661–664.

Wu NC, Young AP, Al-Mawsawi LQ, Olson CA, Feng J, Qi H, Chen SH, Lu IH, Lin CY, Chin RG, et al. High-throughput profiling of influenza A virus hemagglutinin gene at single-nucleotide resolution. Scientific Reports. 2014; 4:4942.

Wu X, Parast AB, Richardson BA, Nduati R, John-Stewart G, Mbori-Ngacha D, Rainwater SM, Overbaugh J. Neutralization escape variants of human immunodeficiency virus type 1 are transmitted from mother to infant. Journal of Virology. 2006; 80(2):835–844.

28 of 30 Manuscript submitted to eLife

Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. Journal of Molecular Evolution. 1994; 39(3):306–314.

Yang Z, Nielsen R, Goldman N, Pedersen AMK. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics. 2000; 155(1):431–449.

Yuste E, Reeves JD, Doms RW, Desrosiers RC. Modulation of Env content in virions of simian immunodeficiency virus: correlation with cell surface expression and virion infectivity. Journal of Virology. 2004; 78(13):6775– 6785.

Zanini F, Neher RA. Quantifying selection against synonymous mutations in HIV-1 env evolution. Journal of Virology. 2013; 87(21):11843–11850.

Zolla-Pazner S, Cardozo T. Structure–function relationships of HIV-1 envelope sequence-variable regions refocus vaccine design. Nature Reviews Immunology. 2010; 10(7):527–535.

29 of 30 Manuscript submitted to eLife

Supplementary file 1. The code to perform all steps in the analysis is in analysis_code.zip. Specifically, this file contains a Jupyter notebook that performs the analysis, all required input data, and all reasonably sized output files. The Jupyter notebook downloads the deep sequencing data, processes it with the dms_tools2 software (Bloom, 2015, https://jbloomlab.github.io/dms_tools2/), and also performs a variety of downstream analyses that generate most of the figures for this paper.

Supplementary file 2. An HTML rendering of the Jupyter notebook that performs the computational analysis. The actual notebook is in Supplementary file 1, but if you just want to look at the analysis rather than run it, then you may prefer this file instead. In particular, the notebook contains plots detailing the deep sequencing data analysis as generated using the dms_tools2 software (Bloom, 2015, https://jbloomlab.github.io/dms_tools2/).

Supplementary file 3. The sequence of the wildtype BG505 env used in our study is in FASTA format in the file BG505_env.fasta.

Supplementary file 4. The sequences of the primers used for the BG505 codon mutagenesis are in the file BG505_codon_mutagenesis_primers.txt.

Supplementary file 5. The primers used for the BG505 barcoded-subamplicon sequencing are in the file BG505_bcsubamp_primers.txt.

30 of 30 Manuscript submitted to eLife

clade A protein identity at alignable sites

to BF520 to BG505 15

10

5

number of sequences 0 number of sequences 0.80 0.82 0.84 0.86 0.88 0.80 0.82 0.84 0.86 0.88 protein identity protein identity

Figure 1–Figure supplement 1. The histograms show the pairwise amino-acid identity of each Env to all other sequences in the clade A alignment in Figure 1-source data 1 after masking the sites delineated in Figure 1-source data 2. There are 616 non-masked sites. The pairwise protein identity between BG505 and BF520 is 86.2% (721 of 836 sites identical) when considering all sites, and 89.1% (549 of 616 sites identical) when considering just the non-masked sites. Manuscript submitted to eLife

A D

B E

C

Figure 2–Figure supplement 1. We Sanger sequenced 44 clones of BG505 Env sampled roughly evenly from each of the three replicate mutant plasmid libraries. (A) There was an average of 1.5 mutant codons per clone, with the number of mutations per clone roughly following a Poisson distribution. (B) The mutant codons had a mix of single-, double-, and triple-nucleotide changes, with an elevated number of single-nucleotide changes than expected. (C) Nucleotide frequencies were fairly uniform in the mutant codons. (D) Mutations were distributed roughly evenly along the mutagenized region of env (30-699 in the sequential numbering scheme used in this plot). (E) For clones with multiple mutations, we computed the pairwise distance in primary sequence between each codon mutation and plotted the cumulative distribution of these distances (red line). We also simulated the expected distribution of pairwise distances if mutations occurred entirely independently (blue line). The observed distribution is close to the expected distribution. Comparable data for the BF520 libraries is provided in Dingens et al. (2017). Manuscript submitted to eLife

1 R = 0.59

0.5 BF520-2

0 1 R = 0.64 R = 0.60

0.5 BF520-3

0 1 R = 0.57 R = 0.58 R = 0.57

0.5 BG505-1

0 1 R = 0.59 R = 0.59 R = 0.59 R = 0.76

0.5 BG505-2

0 1 R = 0.59 R = 0.59 R = 0.59 R = 0.77 R = 0.78

0.5 BG505-3

0 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 BF520-1 BF520-2 BF520-3 BG505-1 BG505-2

Figure 3–Figure supplement 1. This figure shows the same data as in Figure 3B except that it shows KDE contour plots rather than points. As is obvious from this representation, the vast majority of the points fall near the origin (very low preference in both replicates), and most of the correlation signal is therefore due to the relatively modest number of amino acids that have high preference at any given site. This is expected, since most amino acid mutations at most sites will be strongly deleterious, and so have low preference. Manuscript submitted to eLife

Figure 7–Figure supplement 1. The left-most panel shows the six-helix bundle of Env (PDB 1ENV; Weissenhorn et al., 1997), focusing on the central three helices, each of which is derived from a different Env monomer and which pack together to form the bundle’s core. The other three helices that make up this bundle are located in another part of the structure. There are ∼80 residues per monomer that are resolved in this structure, including four significantly shifted sites and five substituted sites. Of these, three of the shifted sites (582, 583, and 587) and one substituted site (588) form a cluster at one end of the bundle. We show the side chains for each of these sites colored according to whether the site has shifted (red) or substituted (yellow). We wondered whether this cluster was unique to the six-helix bundle, but found that it is also present in roughly the same structural arrangement in both the closed pre-fusion conformation (center panel) and the open CD4-bound conformation (right panel; note that some of the surrounding residues in both the closed and open structures have been hidden for the sake of clarity). As shown in Figure 6C, each of the shifted sites are more tolerant of mutations in BG505 than in BF520. The two shifted sites that point inward towards the bundle’s core (583 and 587) may be important for packing. Based on these structures, it is difficult to discern if or how the substituted site may impact the preferences at the shifted sites nearby. Manuscript submitted to eLife

Figure 7–Figure supplement 2. Afine-grained view of two clusters of shifted sites, both of which are in conformationally dynamic regions of Env. In each structure, we only show side chains for shifted sites or other sites of interest, which include substituted sites and sites that are proposed to take part in a hydrophobic network that helps control Env’s dynamics (Ozorowski et al., 2017). These side chains are colored according to the key in the upper right. (A) A cluster of four shifted sites at the apex of the closed pre-fusion conformation of Env (PDB 5FYL; Stewart-Jones et al., 2016), a region that is not resolved in the open CD4-bound conformation. These sites are in Env’s first/second variable loop (164 and 165) and its third variable loop (307 and 309), which pack against one another and against an adjacent Env protomer (the inter-protomer boundary is indicated by a dashed line). Each of the shifted sites has substituted. And these sites are in the immediate vicinity of other non-shifted, but substituted sites. Thus, it seems likely that these shifts at least partially arise from short-range epistatic interactions within this cluster. However, longer-range epistatic interactions with more distant sites also seem plausible given the highly dynamic nature of this region. (B) A cluster of shifted sites that are adjacent to a network of hydrophobic residues (blue/purple) that help mediate the conformational change between the closed pre-fusion state to the open CD4-bound state (Ozorowski et al., 2017). One of the shifted sites (purple) is in this network. These sites are shown in context of both structural states, the latter of which is from (PDB 5VN3; Ozorowski et al., 2017). Nearly one third of the shifted residues cluster within this structural region (9/30). In contrast to panel (A), only a few of the shifted sites are adjacent to substitutions or have substituted themselves. This trend is found in both the closed and open conformations, suggesting that the shifts may be primarily due to long-range epistatic interactions. Manuscript submitted to eLife

closed pre-fusion (5FYL) CD4-bound state (5VN3) 1.0

0.8

0.6

N-glcyan 0.4 False

0.2 True

relative solvent0 accessibility .0

neither diversifying purifying neither diversifying purifying selection type (Q<0.05) selection type (Q<0.05)

Figure 9–Figure supplement 1. Sites are grouped by whether they have "% > 1 (diversifying selection) or "% < 1 (purifying selection) at (<0.05 in both Envs, or whether they fall into neither of these categories. Relative solvent accessibilities were calculated as in Figure 7. As can be seen from these box plots with overlaid points for each site, sites of both diversifying and purifying selection tend to have higher relative solvent accessibility than other sites. Manuscript submitted to eLife

88, BF520 88, BG505 88, alignment 133, BF520 133, BG505 133, alignment 137, BF520 137, BG505 137, alignment X X X X X X X X X N N N N N N N N N S/T S/T S/T S/T S/T S/T S/T S/T S/T

156, BF520 156, BG505 156, alignment 160, BF520 160, BG505 160, alignment 197, BF520 197, BG505 197, alignment X X X X X X X X X N N N N N N N N N S/T S/T S/T S/T S/T S/T S/T S/T S/T

234, BF520 234, BG505 234, alignment 262, BF520 262, BG505 262, alignment 276, BF520 276, BG505 276, alignment X X X X X X X X X N N N N N N N N N S/T S/T S/T S/T S/T S/T S/T S/T S/T

301, BF520 301, BG505 301, alignment 332, BF520 332, BG505 332, alignment 339, BF520 339, BG505 339, alignment X X X X X X X X X N N N N N N N N N S/T S/T S/T S/T S/T S/T S/T S/T S/T

355, BF520 355, BG505 355, alignment 363, BF520 363, BG505 363, alignment 386, BF520 386, BG505 386, alignment X X X X X X X X X N N N N N N N N N S/T S/T S/T S/T S/T S/T S/T S/T S/T

392, BF520 392, BG505 392, alignment 406, BF520 406, BG505 406, alignment 448, BF520 448, BG505 448, alignment X X X X X X X X X N N N N N N N N N S/T S/T S/T S/T S/T S/T S/T S/T S/T

611, BF520 611, BG505 611, alignment 625, BF520 625, BG505 625, alignment 637, BF520 637, BG505 637, alignment X X X X X X X X X N N N N N N N N N S/T S/T S/T S/T S/T S/T S/T S/T S/T

Figure 9–Figure supplement 2. This plot shows all sites that are N-linked glycosylation motifs in both BG505 and BF520. Each motif is named by the residue number of the asparagine, and the preferences (averaged across replicates) for each Env are shown as well as the frequencies of amino acids across the clade A alignment at each of the three positions in the motif (N-X-S/T). As can be seen from this plot, the experiments measure relatively broad mutational tolerance at many sites where the natural Env sequences have a strongly conserved motif. We suspect this is because many glycans serve as a shield against immunity in nature—a function that is not required in cell culture. Manuscript submitted to eLife

site 47 site 53 site 58 site 64 site 66 site 70 site 73 Qmax = 0.0011 Qmax = 0.015 Qmax = 0.00071 Qmax = 0.00029 Qmax = 0.048 Qmax = 0.0042 Qmax = 0.036 BF520 BF520 BF520 BF520 BF520 BF520 BF520 BG505 BG505 BG505 BG505 BG505 BG505 BG505 alignment alignment alignment alignment alignment alignment alignment

site 110 site 113 site 126 site 127 site 204 site 219 site 374 Qmax = 0.007 Qmax = 0.00044 Qmax = 0.049 Qmax = 0.035 Qmax = 0.048 Qmax = 0.0081 Qmax = 0.046 BF520 BF520 BF520 BF520 BF520 BF520 BF520 BG505 BG505 BG505 BG505 BG505 BG505 BG505 alignment alignment alignment alignment alignment alignment alignment

site 386 site 429 site 439 site 448 site 559 site 569 site 611 Qmax = 0.00044 Qmax = 0.0095 Qmax = 0.045 Qmax = 0.0038 Qmax = 0.00062 Qmax = 0.02 Qmax = 0.024 BF520 BF520 BF520 BF520 BF520 BF520 BF520 BG505 BG505 BG505 BG505 BG505 BG505 BG505 alignment alignment alignment alignment alignment alignment alignment

site 624 site 664 site 685 site 691 site 694 Qmax = 7.9e−06 Qmax = 0.016 Qmax = 0.0075 Qmax = 0.0034 Qmax = 0.048 BF520 BF520 BF520 BF520 BF520 BG505 BG505 BG505 BG505 BG505 alignment alignment alignment alignment alignment

Figure 9–Figure supplement 3. This plot shows all sites that are evolving more slowly than expected in natural sequences given the preferences measured in both Envs. Specifically, it shows all sites with "% < 1 at (<0.05 for the ExpCMs for both BG505 and BF520. For each site, the plots show the preferences averaged across replicates and re-scaled for each Env, as well as the frequencies of amino acids in the clade A Env alignment. The (-value indicated is the maximum of that for the two Envs.