<<

Trends in Sequence Changes during the Somatic Hypermutation Process Louis A. Clark, Skanth Ganesan, Sarah Papp and Herman W. T. van Vlijmen This information is current as of September 27, 2021. J Immunol 2006; 177:333-340; ; doi: 10.4049/jimmunol.177.1.333 http://www.jimmunol.org/content/177/1/333 Downloaded from

References This article cites 32 articles, 9 of which you can access for free at: http://www.jimmunol.org/content/177/1/333.full#ref-list-1

Why The JI? Submit online. http://www.jimmunol.org/

• Rapid Reviews! 30 days* from submission to initial decision

• No Triage! Every submission reviewed by practicing scientists

• Fast Publication! 4 weeks from acceptance to publication

*average by guest on September 27, 2021

Subscription Information about subscribing to The Journal of is online at: http://jimmunol.org/subscription Permissions Submit copyright permission requests at: http://www.aai.org/About/Publications/JI/copyright.html Email Alerts Receive free email-alerts when new articles cite this article. Sign up at: http://jimmunol.org/alerts

The Journal of Immunology is published twice each month by The American Association of Immunologists, Inc., 1451 Rockville Pike, Suite 650, Rockville, MD 20852 Copyright © 2006 by The American Association of Immunologists All rights reserved. Print ISSN: 0022-1767 Online ISSN: 1550-6606. The Journal of Immunology

Trends in Antibody Sequence Changes during the Somatic Hypermutation Process

Louis A. Clark,1 Skanth Ganesan, Sarah Papp, and Herman W. T. van Vlijmen1

Probable germline gene sequences from thousands of aligned mature Ab sequences are inferred using simple computational matching to known V(D)J genes. Comparison of the germline to mature sequences in a structural region-dependent fashion allows insights into the methods that nature uses to mature Abs during the somatic hypermutation process. Four factors determine the residue type patterns: biases in the germline, accessibility from single base permutations, location of mutation hotspots, and functional pressures during selection. Germline repertoires at positions that commonly contact the Ag are biased with tyrosine, serine, and tryptophan. These residue types have a high tendency to be present in mutation hotspot motifs, and their abundance is decreased during maturation by a net conversion to other types. The heavy use of tyrosines on mature Ab interfaces

is thus a reflection of the germline composition rather than being due to selection during maturation. Potentially stabilizing Downloaded from changes such as increased proline usage and a small number of double cysteine capable of forming disulfide bonds are ascribed to somatic hypermutation. Histidine is the only residue type for which usage increases in each of the interface, core, and surface regions. The net overall effect is a conversion from residue types that could provide nonspecific initial binding into a diversity of types that improve affinity and stability. Average mutation probabilities are ϳ4% for core residues, ϳ5% for surface residues, and ϳ12% for residues in common Ag-contacting positions, excepting the those coded by the D gene. The Journal of

Immunology, 2006, 177: 333–340. http://www.jimmunol.org/

he versatility of the mammalian adaptive single-base transitions over transversions at an ϳ3:1 ratio (7). In- rests on its ability to generate high-affinity Abs to foreign sertions and deletions also occur but are considerably less common T Ags. The set of naive B cells expresses a repertoire of (8, 9). Certain four-base DNA sequence motifs, called hotspots, surface-bound Ab precursors called B receptors or BCRs, a are correlated with the mutation locations. The two most com- subset of which can bind most new Ags with low affinity. To make monly cited four-base motifs are RGYW (10) and its inverse re- this possible, the V(D)J recombination process has evolved to mix peat, WRCY (11), where R denotes a purine base, Y a pyrimidine V, D, and J genes during cell development to provide a large base, and W an A or a T base. sequence diversity in the germline L and H chains. Considering It is not always clear how the mutations selected during the by guest on September 27, 2021 only the number of human V, D, and J genes and neglecting a gain affinity maturation process contribute to improving binding affinity in diversity from the joining mechanisms, one gets an estimate of or other properties such as selectivity or stability. Only a few stud- 320 L chain variable domain and 11,000 H chain variants. The ies exist that look at the effect of single somatic hypermutations assumption that any H chain can pair with any L chain yields 3.5 ϫ 6 (12). One expects the residues in contact with the Ag to be the 10 theoretical specificities (1). Additional adaptability comes most critical, but often large affinity improvement come from non- from the somatic hypermutation process, where the recombined surface mutations. Daugherty et al. (13) used phage display on Ab genes undergo error-prone replication during an in vivo selec- Escherichia coli to improve the affinity of a single-chain Fv mol- tion process. Those B cells that produce Abs with higher affinities ecule to cardiac glycoside digoxigenin and found that all of the for their Ags selectively proliferate, leading to a further enormous affinity enhancements occurred at non-Ag-contacting residues. increase in diversity and a fine tuning of the affinity. Similarly, Zahnd et al. (14) used a ribosome display to improve Much of somatic hypermutation research to date has focused on single-chain Fv binding to a peptide Ag and found that the most defining the molecular mechanism. This aspect has been reviewed recently (2–5) and, although much remains to be learned, certain effective mutation was not at the interface with the Ag. From our sequence-related aspects are clear. The AID (activation- own work redesigning Ab-Ag interfaces (34), we find that even induced ) deamidates cytidine residues, con- Ag-contacting mutation effects are not always rationalizable or verting them to and initiating the somatic hypermutation predictable from three-dimensional structure-based energetic cal- process (6). Further processing of the local DNA may involve culations. Thus, there is great potential to learn from the natural excision of the uracil base and repair by certain error-prone poly- affinity maturation process. merases (see reviews in Refs. 3–5). The overall process favors The work presented here seeks to better define the results of the somatic hypermutation process and to investigate sequence-related aspects of protein-protein recognition. Given a mature sequence, a Biogen Idec Inc., Cambridge, MA 02142 probable germline sequence resulting from the recombination of Received for publication November 4, 2005. Accepted for publication April 19, 2006. species-specific V, D, and J genes (15, 16) can be derived using The costs of publication of this article were defrayed in part by the payment of page sequence matching and consideration of the known mechanistic charges. This article must therefore be hereby marked advertisement in accordance rules of V(D)J recombination (see Materials and Methods). Once with 18 U.S.C. Section 1734 solely to indicate this fact. the germline chain sequence is known, simple comparison with the 1 Address correspondence and reprint requests to Dr. Louis A. Clark and Dr. Herman W. T. van Vlijmen, Biogen Idec Inc., 14 Cambridge Center, Cambridge, MA 02142. aligned mature sequence yields the position and type of mutation E-mail addresses: [email protected] and [email protected] that occurred during the affinity maturation process.

Copyright © 2006 by The American Association of Immunologists, Inc. 0022-1767/06/$02.00 334 TRENDS IN SOMATIC HYPERMUTATIONS

Tomlinson et al. (17) have previously used a similar type of the best gene are always assumed to be identical with those in the mature analysis to analyze the diversity of amino acids at specific posi- sequence, and no mutations are recorded. tions in the germline and mature Ab sequences. They found that Algorithm testing was performed in cases where the V(D)J fitting pro- cedure was performed manually. Additionally, D gene fitting was tested on the frequency of somatic hypermutation and the diversity of the cases where the DNA sequence was known. In the small number of cases germline sequences is highest in the CDRs. Rather than focus on examined, translating from the protein sequence to the most probable DNA the mutation frequencies, we examine the type of mutation and its sequence was sufficient to distinguish between available D gene sequences. functional implications deduced from the location in the structure. Typically, one or two related D genes had match scores clearly separated from all other lower-scoring D gene matches The results indicate that residue type changes during the somatic In some cases the algorithm clearly returns nonsensical matches, as hypermutation process are significant and have underlying func- indicated by long stretches of poor base matches. The failures can be traced tional rationales. back to humanized Abs, to Abs from species outside the human and mouse libraries used, or to insertions/deletions. Germline sequence determinations requiring Ͼ20-amino acid mutations relative to the mature sequence are Materials and Methods discarded on the assumption that the problem is inappropriate to the se- Germline V(D)J fitting procedure quence or that the algorithm has failed. Sequences with more than five consecutive mutated residues are discarded for similar reasons. Some se- The V(D)J germline fitting algorithm is intended to provide reasonable quences have so little information in the D and J gene regions that fitting solutions for the germline Ab sequence based on the mature protein se- is impractical. For this reason, solutions having mutations at Ͼ40% of the quence for the large majority of cases. It is recognized that no such fitting D and J gene region positions are also discarded. procedure can be completely free of ambiguities. In some cases, the length of the D region or the lack of residues in many publicly available sequences

Calculation of mutation probabilities Downloaded from gives insufficient information for a certain fit. The flexible joining mech- anisms of the natural process also limit the reliability of the method. For A simple estimate for the probability of making a transition from residue example, the natural N nucleotide addition process inserts random bases type i to residue type j (Pij) with a single base change can be calculated into the sequence, making the concept of a precursor gene at those posi- using the codon definitions and their known usage frequencies. A given tions inapplicable. In all situations, the default algorithm behavior is to number of codons, M (N) code for each residue type i (j). For a given i to assume that there was no mutation and that the germline sequence at am- j transition, the probability is the sum of individual possible codon m to n biguous positions is identical with the mature sequence. transitions (Equation 1). The gene fitting algorithms for the H and L chains share some basic http://www.jimmunol.org/ features. In each case, the mature protein sequence is converted to a most M N probable DNA sequence using mouse or human codon probability tables. 1 P ϭ ͸͸ͩ ͪ f ⌬ (1) Comparisons with the germline V(D)J gene sequences must be done with ij 12 im mn the DNA sequence, because the germline genes may be read in multiple mϭ1nϭ1 reading frames (only D and J genes) depending on their position in the final sequence and because some sequences, especially D genes, are too short to The factor (1/12) is the probability of picking one of the three bases in the reliably match using the reduced information in the protein sequences. codon (1/3) times the probability of picking the correct change (1/4). The Readers interested in the most exhaustive germline fitting procedure for the frequency ( fim) is the fraction of residue type i that is coded using codon D gene region should consult the work of Monod et al. (18). Once the most m in human genes. If codons m and n differ by only a single base, then such ⌬ probable mature DNA sequence is determined, V, J, and, if appropriate, D a transition could occur, and the function mn is unity. This simple tran- genes from the known library of genes are sequentially fit to the sequence. sition probability (Pij) can be renormalized using the residue type usage in by guest on September 27, 2021 V(D)J genes are taken from the ImMunoGeneTics (IMGT) database (15, the germline genes to form the background of Fig. 1A. 16). The fitting is done exhaustively for individual gene possibilities by An estimate for the probability of random mutations leading to a cys- sliding the library gene along the mature DNA sequence and recording the teine pair capable of forming a new disulfide bond (Pds) can be calculated ϭ number of matching bases at each offset. Although simple and easy to code, from the probability of making a single cysteine mutation (Pcys 1/19), this approach is more reliable than a BLAST search because of the short the probability of a residue pair being close enough to form a disulfide sequences involved. It is affordable because of the small number of pos- (Ppair) and the average number of mutation pairs per chain (Npairs) (Equa- sible germline V(D)J genes; analysis of a L/H chain pair takes a couple tion 2). minutes. Fitting for the whole data set can be done in 1–2 days on a ϭ computer cluster and could be done considerably faster if the method and Pds NpairsPcysPcysPpair (2) code were to be optimized. It should be noted that a useful, fast, and accessible algorithm for blasting against the V genes (IgBLAST) is avail- An average of eight mutations per chain yields 28 unique pairs (Npairs). Ӎ able on the National Center for Biotechnology Information website Approximately 4% (Ppair 0.04) of residue pairs have C␤-C␤ separation (www.ncbi.nlm.nih.gov/igblast). distances of Ͻ6Å. This leads to an estimate of 68 disulfide-producing Ӎ ϳ The L and H chain matching is handled using a similar exhaustive pro- events (Pds 0.003) in the 22,320 sequences examined. cedure. In each case the V genes are matched first near the N terminus of the sequence, and the best match is retained. The unmatched portion of the Regional assignments from sequence position sequence is then extracted, and the best J gene match is found using a similar procedure. For the L chains, this process is repeated for each pos- All Ab sequences were first aligned to the set (AAAAA, Aho’s Amazing sible V gene to account for the possibility that an overly long V gene was Atlas of Antibody Anatomy) provided by Honegger et al. (19). We use this chosen because it simply adds extra base matches near the C-terminal end alignment and numbering system because it leaves gaps for almost all loop rather than improving the final fit. There are cases where a V gene with a lengths, gives a structural correspondence between light and heavy posi- lower match score leads to a better overall fit because it allows room for a tions, and provides numerical values for all positions. The Kabat system better-fitting J gene. The H chain is handled similarly; for each possible V uses alphanumeric numbering for some loop positions and can be cumber- gene match the entire set of possible J genes is placed at the end of the some in some cases. sequence. For each V-J combination, the remaining unmatched sequence is Position definitions were derived from Protein Data Bank structures of fit to the D gene library. Most but not all H chains have a D gene segment Abs in contact with peptide or protein Ags. Ab-Ag interface positions are inserted between the V gene and J gene portions. occupied by residues that face the Ag and are within 12 Å of the nearest Ag Variability in the natural joining mechanisms is accounted for by al- atom for at least one Protein Data Bank structure. Because the positions lowing two amino acids worth of flexibility during fitting. For example, were assigned based on a small set of structures, residues occupying Ab-Ag after the V gene segment is fit, the unmatched portion to which the J gene positions do not always contact the Ag for every sequence and may be will be fit is extended two residues (six bases) of mature DNA sequence solvent exposed. Residue positions on the VL-VH interface were defined into the V gene region. This is done to allow for the possibility that part of similarly using an 8-Å cutoff but were required to meet this requirement in the germline V gene was deleted during splicing. For a similar reason the at least half of the structures. Core positions are those that point inward to unfit stretch of DNA in the D gene region is extended six bases into the the core of the variable domain and typically expose Ͻ10 Å2. The surface bordering V and J gene regions. Even with this flexibility, many D genes positions are taken to be the remainder of the occupied positions. Some are often excluded because the gap between the V and J genes is too small. positions may have more than one assignment, particularly those that may

Conversely, if there is a gap between segments, then bases unmatched by both contact the Ag and be present at the VL-VH interface. The Journal of Immunology 335

Results A large number of mature Ab chain sequences are available from public databases (20–22), including Ig sequences from IgBLAST (subset of the “nr” database with similarity of at least 50% over one-third of germline V gene length). After removal of duplicates and sequences not labeled as mouse or human, 23,116 heavy and 11,095 ␭ or ␬ variable domain sequences were fit to known germ- line V, D, and J genes using the methodology described in the Materials and Methods section. Seventy-two percent of the light and 62% of the heavy sequences yielded plausible (see Materials and Methods section for definition) germline solutions and con- tributed to the results presented in this section. Approximately half of the sequences that contribute to the final dataset are human. All results are for the combined mouse and human datasets unless noted. Due to the higher uncertainty of fitting the short D gene segments to the mature sequence, all mutation information from the D gene region is omitted unless specifically noted. Downloaded from Full dataset: mutation type and location A comparison of the residue mutation type frequency with that observed in regular evolutionary relationships shows some basic similarity. Fig. 1A shows the prevalence of residue type change mutations in the full dataset (circles) using a color scale. For com-

parison, the expected mutation frequency has been calculated as- http://www.jimmunol.org/ suming a single DNA base change per codon and human codon usage. These expected frequencies are visible as the background color in each box. There is a clear qualitative relationship between expected and observed frequencies. An additional comparison with expected mutation frequencies comes from sequence alignment scoring matrices. Mutation frequencies can be derived from a mul- tiple sequence alignment by observing residue type variability at a given position. If this information is compiled in a residue type-

FIGURE 1. Residue type change mutations for the full dataset. A, Num- by guest on September 27, 2021 specific fashion and averaged over all positions, then a global view ber of germline (left column) to mature (top row) mutations for the full of how easily a given residue type can substitute for another can be dataset, including human and mouse L and H chains. Example: the red compiled. The BLOSUM62 data provides such mutation frequen- circle in the center can be interpreted to mean that the S3T cies derived from protein sequences with Ͼ62% sequence homol- (germline3mature) type mutation occurs with a high frequency during the ogy (23). somatic hypermutation process. Amino acids are ordered by hydrophobic- There is a similarly loose relationship between the observed ity according to the octanol scale. The background color in each box rep- frequency and the mutation frequency expected from the resents the expected pair mutation frequency assuming human codon us- BLOSUM62 data as shown in Fig. 1B. From these tests we con- age, one base change per codon, and germline residue type frequencies (see clude that the residue type changes during somatic hypermutation Materials and Methods). B, Comparison between observed residue type are similar overall to those seen during evolution but may show change mutation frequency and that expected from the data used to derive the BLOSUM62 scoring matrix (23). deviation in the finer details. In this overall dataset, there is some conservation of hydropho- bicity (clustering around the diagonal in Fig. 1A), a relatively small lower relative usage of arginine in the germline. Other imbalances, number of mutations involving proline, cysteine, and tryptophan such as A3V, have no codon number or germline usage bias. For amino acids, and a number of scattered higher-probability muta- comparison, the BLOSUM matrices and other residue substitution tion types. Comparison of observed Ab (Fig. 1A, overlaid circles) scoring matrices are symmetric by construction. with the expected (Fig. 1A, background squares) mutations indi- Fig. 3 shows the probability of finding a mutation at a given cates that at least some of the higher-probability mutation types, position in the variable domain of the L chain or the H chain. Each such as T3S and A3V, have similar frequencies. Others, such as residue position has been assigned to an Ab structural region using F7L and Y3G transitions, are less frequently and more fre- the sequence alignment and crystal structures (see Materials and quently seen, respectively, in Abs. Further trends will become Methods). There is a clear relationship between the position as- more pronounced in region-specific subsets of the data. signment and the mutation probability. Ab-Ag interface and sol- There is often a clear imbalance in the frequency of mutations vent-exposed surface positions are more likely to be targets of from a given residue type to another (X3Y) compared with the productive (nonsilent) mutations. The distributions agree qualita- reverse (Y3X). This imbalance is apparent in the lack of sym- tively with those from Tomlinson et al. (17), but it is now possible metry in the matrix and is presented in Fig. 2, which shows the to see that the previously unanalyzed H3 loop has more mutations same data processed to show deviations from the average of X3Y than the other CDR loops even when mutations in the D gene and Y3X entries. Part of the imbalance can be accounted for by region are omitted. In agreement with these results, we also see a differences in codon usage. For example, K3R mutations are 3.3 sizable mutation probability (Ͼ10%) for the first three positions on times more common than R3K mutations, and this difference the N terminus of the L chain. Despite being outside of the tradi- could be explained by the 6:2 ratio of R to K codons and slightly tional CDR loops, these residues are often in contact with the Ag 336 TRENDS IN SOMATIC HYPERMUTATIONS Downloaded from FIGURE 2. Deviations in amino acid somatic hypermutation symme- tries from germline (left column) to mature (top row) mutations for the full dataset. Entry values (Mij) are calculated from those in Fig. 1 (Aij) using ϭ Ϫ ϩ Mij (Aij Aji)/(0.5Aij 0.5Aji). Original matrix entries were zeroed Ͻ (filtered out) if Aij or Aji was 5% of the maximum entry. Amino acids are ordered by hydrophobicity according to the octanol scale. http://www.jimmunol.org/

(results not shown). Average mutation frequencies broken down by region are 4, 7, 12, and 5% for the core, VL-VH interface, Ab-Ag interface, and surface, respectively. The germline sequences were analyzed for four-base RGYW FIGURE 3. Probability of finding a mutation at a given position in the and WRCY (see Ref. 3 for definitions) intrinsic mutation hotspots. variable domain of the L (top panel)andH(bottom panel) chains. Residues There is a proximity relationship between hotspot and mutation are numbered using the Honegger numbering scheme (19). The CDR loops position. Fig. 4 shows the distance expressed in the number of are framed in red lines, and position definitions are given as colored bars bases between the mutated codon center and the nearest base of a below the histogram. The gray bars include the less reliable D gene region by guest on September 27, 2021 hotspot. For example, a distance of two bases means that the codon predictions for the H chain. precedes the four-base hotspot and that they are adjacent with no bases separating them. At a distance of Ϫ1, the codon is after the hotspot with its first base forming the last base of the hotspot. The Fig. 5 gives the distribution of number of mutations within each mean distances to the RGWY and WRCY hotspots are 12 and 10 region. The residue positions were distributed into the Ab-Ag in- bases, respectively, and mutations can be seen to be much more terface, VL-VH interface, core, and surface regions using analysis likely near hotspots. There is a clear periodicity of 3 in Fig. 4. This of Protein Data Bank Ab-Ag complex structure as described in the means that the first three bases (for example, RGY) of the hotspots Materials and Methods section. It can be seen that the mutations tend to be in-frame. Specific analysis shows that 62% of RGYW hotspots and 70% of WRCY hotspots are in-frame. Despite the tendency for mutations to be near hotspots within specific Ab sequences, there is no apparent relationship between the overall frequency of mutation at a given position and the prob- ability of a hotspot occurring at that position. The CDRs do not have a higher density of intrinsic mutation hotspots (data not shown). It is not rare for the probability of finding a hotspot to approach unity, meaning that some residue positions are nearly always part of hotspots.

Region-specific mutation type and location The region of the variable domain where a mutation occurs strongly determines its effect, if any, on the function of the Ab. To a first approximation, the residues on the Ab-Ag interface deter- mine specificity and affinity, whereas those in the core partially FIGURE 4. Distance from the mutation to nearest mutation hotspot. determine stability. Mutations at the interface between the two The distance is calculated as the difference between the base number at the variable domains will also contribute to stability (24) and binding center of the mutated codon and the closest base in the nearest hotspot of affinity (25) and may contribute to the relative orientation of the a given type. A periodicity of 3 is clearly seen, indicating that the begin- domains. Surface mutations outside the interfaces may affect sol- ning of the hotspots tend to be in-frame. The zero-distance peak implies vation properties and aggregation propensity but are expected to that the codon center is inside the hotspot. It has been normalized to ac- have little effect as long as polar surface area is maintained. count for the four degenerate possibilities. The Journal of Immunology 337

FIGURE 5. Number of mutations per variable domain broken into con- tributions from each region. All D gene mutations are omitted from the statistics.

on the Ab-Ag interface span the broadest range, with a mean of 3.1 Downloaded from mutations per chain. Surface and VL-VH mutations are less fre- quent at 2.3 and 1.5 mutations per chain on average. Core muta- tions are least frequent at 1.3 mutations per chain. The distribu- tions for the L and H chains are qualitatively similar (not shown). For the H chain, region-specific averages are ϳ30% higher than those in the L chain. On average, L and H chains have 6.8 and 8.6 http://www.jimmunol.org/ mutations per chain, respectively. As a reference point for describing composition changes, Fig. 6 shows the region-specific amino acid compositions in the mature sequence. The germline compositions are qualitatively similar. As should be expected, polar residues are favored on the surface and hydrophobic residues are favored in the core regions. One of the largest contributions to Ab-Ag interface composition comes from tyrosine at 15%. Glycine and serine are also strong contributors at

11 and 18%, respectively. As a comparison for only the interface by guest on September 27, 2021 with the Ag, the data of Lo Conte et al. (26) from an analysis of Ab-Ag crystal structures are also shown. Their numbers are the percentages of the interface area contributed by each amino acid type and, thus, are lower in value for smaller residues such as glycine and serine. The net creation and removal of amino acid types from the germline to the mature sequence varies considerably by residue type. Figs. 7 and 8 give the percentage increase of each amino acid FIGURE 7. Amino acid composition changes during the mouse somatic type in the overall, surface, core, and Ab-Ag interface regions for hypermutation process expressed as percentage changes from the germline. mouse and human sequences, respectively. Some of the most strik- For reference, the expected direction (ϩ or Ϫ) of net change from the simple model (see Materials and Methods) is shown above each bar. Error bars are at one SD. Bars for rare residue types (Ͻ50 examples in applicable structural region of germline) are omitted.

ing net changes occur at the Ab-Ag interface. Tyrosine residues undergo the greatest change in absolute numbers, partially because they are so prevalent in the germline structure. They are reduced by 13% (10% in mouse and 18% in human) from their germline frequency. The loss is distributed throughout the variable domain but is more pronounced in the CDRs and, in particular, at Ag- contacting positions of the H3 loop. Tyrosine is over-represented in the germline JH genes for both human and mouse. This has been noted by Zemlin et al. (Fig. 3 of Ref. 27). One of the six human JH genes (JH6) and its alleles can code for strings of 4–6 tyrosines. The JH6 gene is used in 22% of the germline sequence solutions FIGURE 6. Region-specific residue compositions of mature sequences found in our analysis. The JH4 gene can code for two tyrosines with amino acids ranked according to the octanol hydrophobicity scale. separated by two residues and is used in 37% of human H chain Data from an analysis by Lo Conte et al. (27) of amino acid contributions germline gene fitting solutions. This overall tyrosine change is to Ab-Ag interface area are displayed as lines. large enough to skew the average side chain size change at the H 338 TRENDS IN SOMATIC HYPERMUTATIONS

Proline usage changes in L and H chains are seen mainly in turn regions and are qualitatively very similar between chains. Residue positions 15, 48, and 98 (Kabat light, 14, 40, and 80; Kabat heavy, 14, 41, and 84) are in the ␤-hairpin turns nearest the constant domain and show pronounced mutation activity. The turn in the L3 and H3 loops tends to gain prolines at position 135 (Kabat light, 95e; Kabat heavy, 100h). Similarly, prolines tend to appear at po- sition 52 (Kabat light, 44; Kabat heavy, 45), which is a kinked

position in the strand that forms part of the VL-VH interface. Other interesting composition trends occur during maturation. The most striking is an increase in the relative number of cysteine residues on the surface. The low abundance of cysteines on the surface means that 181 net mutation events in the human dataset translate to a 178% increase (see Fig. 8B). One expects that addi- tional cysteines could lead to creation of nonnative disulfide bonds and inhibit expression. Presumably, the new cysteines observed in this work do not have a destabilizing effect large enough to be selected against. In a small number (Ͻ50) of cases, two cysteines

are apparently created, and in 12 cases they have sequence posi- Downloaded from tions that could lead to the formation of an additional stabilizing disulfide bond. A probability estimate of this event occurring ran- domly indicates that six times more potential disulfide bonds should have been observed (see Materials and Methods). The dis- crepancy could be partially attributed to the observed order of

magnitude lower probability of mutating to a cysteine than the http://www.jimmunol.org/ random mutation probability used in the calculation. Histidine res- idues are gained consistently in each of regions for both species except on the surface of the mouse Abs. There is an ϳ80% in- crease in histidine usage on both the surface and Ab-Ag of human Abs during maturation. For the mouse, the increase on the Ab-Ag interface is less pronounced at 36%. This increase in the number of histidines is the largest of the core residue position changes (Figs. 7C and 8C), where comparatively little composition change is seen. by guest on September 27, 2021 It is reasonable to ask whether the residue types in which the hotspots occur (RGYW or WRCY) bias the residue composition changes. According to Fig. 4, the highest probability location for a mutation is inside a hotspot. Consequently, one might expect a tendency for residue types that are consistently mutated to be more frequently present in hotspots. Fig. 9 shows the fraction of each germline residue type in each of the RGYW or WRCY motifs. More than 50% of the tyrosine residues in the germline are located in WRCY motifs. Generally, the residue types that are lost on the Ab-Ag interface (glutamine, serine, tyrosine, and tryptophan) all FIGURE 8. Amino acid composition changes during the human somatic have some of the highest propensities to be coded in hotspots. hypermutation process expressed as percentage changes from the germline. For reference, the expected direction (ϩ or Ϫ) of net change from the simple model (see Materials and Methods) is shown above each bar. Error bars are at one SD.

chain to a net loss of 3.4 heavy atoms over the whole Ab-Ag interface. For comparison, there is a net gain of 1.1 heavy atoms in the corresponding region of the L chain. Glutamine, serine, and tryptophan residues are also depleted on the interface. Several residue types are used more frequently in the mature Ab on the interface with the Ag. Phenylalanine is the most enhanced residue at the interface, increasing by ϳ80% (31% in mouse and 104% in human) from the germline composition. Proline usage also increases during maturation by 42% (32% in mouse and 50% in human), as does histidine (36% in mouse and 82% in human). FIGURE 9. Tendencies to find specific residue types in the mutation The residues showing these large increases in mature Ab use are hotspot motifs. The bar heights are the fraction of a specific germline almost absent in the germline but occur significantly in mature residue type that is coded within the four-base (two incomplete residues) sequences (see Fig. 6). motif. The Journal of Immunology 339

FIGURE 10. Intrinsic mutation probabilities for germline residue types on the Ab-Ag interface. Numbers given in the scale bar are the probabilities to mutate to the mature (column index) residue, given that the germ- line (row index) residue is a specific type. The tyrosine stripe is most pronounced in the H chain mutation map. There are higher than background tendencies for muta- tions to take tyrosine (Y) in the germline to aspartate (D), glycine (G), asparagine (N), serine (S), and phe- nylalanine (F). The background is identical with that used in Fig. 1, but is not normalized by germline residue type usage. Downloaded from http://www.jimmunol.org/ The most important region for Ab affinity maturation is the in- tyrosines are there to promote low-affinity binding of the naive terface with the Ag. As seen in Figs. 7A and 8A, trends in the germline Abs to new Ags. Tyrosine can provide hydrogen bonding residue usage are apparent, but no information is offered about the and substantial hydrophobic interactions. Tryptophan is the most specific mutations that occur from given germline residue types. hydrophobic of residues and arguably the best for nonspecific in- Fig. 10 shows the intrinsic mutation probabilities for residues on teractions. Serine can hydrogen bond while minimizing potential the interface between the Ab and the Ag. On the Ab-Ag interface, steric conflicts. Both tyrosine and serine have a polar character and some pronounced residue type changes are E7D and K3R, are likely to help maintain solution stability. In support of these which are charge conserving. Others, such as N7D and Q3E, ideas, Fellouse et al. (28, 29) have used phage display techniques conserve size but change charge or change predominantly the hy- to show that tyrosine and a combination of small flexible residues by guest on September 27, 2021 drophobicity, such as S7T and F3L. are sufficient to recognize a number of Ags, including human and A germline tyrosine residue at a potentially Ag-contacting po- mouse vascular endothelial growth factor molecules. sition has a 13% chance to mutate to another residue type (see Not only does nature bias the interface with promiscuous bind- Figs. 7A and 8A for species-specific numbers). This tendency is ing residue types, but it provides a maturation strategy capable of also apparent in Fig. 10, which shows that tyrosines in the germ- making efficient and productive changes to them. Tyrosine, serine, line tend to have a substantial intrinsic tendency to mutate. In the and glycine residues are well represented in germline mutation unnormalized Ab-Ag interface data, the horizontal tyrosine stripe hotspots (Fig. 9), increasing their tendency to mutate. The presence is the most pronounced feature. Conversions from tyrosine and of serines in hotspots has been previously noted and suggested to serine to other residue types are the dominant mutation types oc- be a strategy for improving the targeting of CDRs (30). If one curring on the Ab-Ag interface. When germline tyrosine is mu- examines into which residue types the tyrosines mutate, one sees tated, it has a tendency to convert to aspartate, histidine, glycine, that they most often become the small residues glycine and serine. asparagine, serine, and phenylalanine. Of these residue types, each Aspartate and asparagine, two short-chain charged and uncharged can be reached with a single DNA base change, except glycine. residues, are also favored, perhaps to provide ion-pair and H-bond When a germline serine is mutated, larger residues such as aspar- interactions. Phenylalanine may be favored because of its similar- agine and threonine, both still capable of hydrogen bonding, are ity to tyrosine and greater hydrophobicity. Tryptophan often be- common replacements. Tryptophans are often converted to comes arginine, glycine, serine, and leucine. It may be converted leucines and tyrosines, but also may become small residues such as to eliminate steric conflicts and minimize nonspecific interactions serine and glycine. that could lead to aggregation. The low usage of histidine and methionine at the Ab-Ag inter- Discussion face (Fig. 6) is puzzling. Next to proline and cysteine, which often With the results presented here, it is possible to speculate on the play specific structural and functional roles, histidine and methio- underlying reasons for the mutation propensities and what they tell nine are least used on the interface. The work of Lo Conte et al. us about Ab maturation. Why is tyrosine, serine, and tryptophan (26) shows that these residues are used less on Ab-Ag interfaces usage on the Ab-Ag interface decreased during the maturation pro- than on other protein-protein interfaces. One possible explanation cess? Similarly, why is there a substantial relative increase in the for the low use of histidine and methionine is the hypothesis that usage of histidine, proline, and phenylalanine? they have only recently been introduced into the amino acid rep- One of the most striking results shown is the overall tendency ertoire on an evolutionary time scale (31). These amino acids may for germline tyrosine, serine, and tryptophan residues to mutate to be underused relative to their potential. If under-utilization in the other residue types. Germline genes are biased to place these ver- germline can explain their low usage, then one would expect their satile residue types on the Ab-Ag interface. We speculate that the usage to increase in the mature Abs. Figs. 7 and 8 shows that this 340 TRENDS IN SOMATIC HYPERMUTATIONS is consistently the case for histidine in most regions, but only at the anism: clustering, polarity, and specific hot spots. Proc. Natl. Acad. Sci. USA 90: Ab-Ag interface for methionine. A more structurally oriented hy- 2385–238. 8. Wilson, P. C., O. D. Bouteiller, Y. J. Liu, K. Potter, J. Banchereau, J. D. Capra, pothesis can also be used to explain low methionine usage on the and V. Pascual. 1998. Somatic hypermutation introduces insertions and deletions interface. Methionine has a long side chain and is likely to lose into immunoglobulin v genes. J. Exp. Med. 187: 59–70. 9. de Wildt, R. M., W. J. van Venrooij, G. Winter, R. M. Hoet, and I. M. Tomlinson. comparatively more entropy on binding relative to its gain in en- 1999. Somatic insertions and deletions shape the human antibody repertoire. thalpy, resulting in net lower binding affinity. Relative to methio- J. Mol. Biol. 294: 701–710. nine, histidine usage may be favored at interfaces, because it can 10. Rogozin, I. B., and N. A. Kolchanov. 1992. Somatic hypermutagenesis in im- munoglobulin genes. II. influence of neighbouring base sequences on mutagen- provide both hydrogen bonding and hydrophobic interactions. esis. Biochim. Biophys. Acta 1171: 11–18. One may also speculate on the usage of certain amino acids to 11. Doerner, T., S. J. Foster, N. L. Farner, and P. E. Lipsky. 1998. Somatic hyper- modify loop characteristics with benefits for Ab stability and af- mutation of human immunoglobulin H chain genes: targeting of RGYW motifs on both DNA strands. Eur. J. Immunol. 28: 3384–3396. finity. Proline residues are frequently created on the Ab-Ag inter- 12. England, P., R. Nageotte, M. Renard, A. L. Page, and H. Bedouelle. 1999. Func- face (Figs. 7A and 8A), particularly in the H3 loop where they tional characterization of the somatic hypermutation process leading to antibody could stabilize beneficial loop conformations. Loop preorganiza- D1.3, a high affinity antibody directed against lysozyme. J. Immunol. 162: 2129–2136. tion should reduce the entropy cost of binding and increase affinity 13. Daugherty, P. S., G. Chen, B. L. Iverson, and G. Georgiou. 2000. Quantitative (see Ref. 32 for a possible example). Similarly, Ab maturation analysis of the effect of the mutation frequency on the affinity maturation of single chain Fv . Proc. Natl. Acad. Sci. USA 97: 2029–2034. sometimes leads to the introduction of disulfide bonds, which may 14. Zahnd, C., S. Spinelli, B. Luginbuhl, P. Amstutz, C. Cambillau, and stabilize the Ig fold or result in subtle conformational changes that A. Plu¨ckthun. 2004. Directed in vitro evolution and crystallographic analysis of lead to higher affinity. An example may be the anti-gp120 peptide a peptide-binding single chain antibody fragment (scFv) with low picomolar af- finity. J. Biol. Chem. 279: 18870–18877. Downloaded from complex structure (1ACY), where an additional disulfide is formed 15. Lefranc, M. P., V. Giudicelli, C. Ginestoux, N. Bosc, G. Folch, D. Guiraudou, between CDR H1 and H2, which actually contacts the Ag (33). It J. Jabado-Michaloud, S. Magris, D. Scaviner, V. Thouvenin, et al. 2004. IMGT- is also notable that many of the excess tyrosines on the Ab-Ag ONTOLOGY for immunogenetics and immunoinformatics (genes are from Im- MunoGeneTics website: imgt.cines.fr). In Silico Biol. 4: 17–29. interface of the germline chains are converted to glycines (see Fig. 16. Lefranc, M.-P., and G. Lefranc. 2001. The Immunoglobulin Facts Book. Aca- 10). In this case, the creation of glycines may allow for the back- demic Press, London. bone flexibility necessary for the remaining tyrosines or other 17. Tomlinson, I. M., G. Walter, P. T. Jones, P. H. Dear, E. L. Sonnhammer, and G. Winter. 1996. The imprint of somatic hypermutation on the repertoire of nearby Ag-directed side chains to better contact the Ag. human germline V genes. J. Mol. Biol. 256: 813–817. http://www.jimmunol.org/ Ab-Ag interaction systems are ideal for examining factors and 18. Monod, M. Y., V. Giudicelli, D. Chaume, and M. P. Lefranc. 2004. IMGT/ Junction Analysis: the first tool for the analysis of the immunoglobulin and strategies for improving binding affinity at protein-protein inter- receptor complex V-J and V-D-J JUNCTIONs. Bioinformatics 20: I379–I385. faces. The trends uncovered and discussed in this work illuminate 19. Honegger, A., and A. Plu¨ckthun. 2001. Yet another numbering scheme for im- the strategies that nature uses to bias immature Ab properties and munoglobulin variable domains: an automatic modeling and analysis tool. J. Mol. Biol. 309: 657–670. subsequently refine them during the affinity maturation process. 20. Johnson, G., and T. T. Wu. 2001. Kabat database and its applications: future Residue type changes are clearly biased (Fig. 1A), by which directions (www.kabatdatabase.com). Nucleic Acids Res. 29: 205–206. codons are accessible given a single base change. Serine, tyrosine, 21. Wu, C. H., L. S. Yeh, H. Huang, L. Arminski, J. Castro-Alvear, Y. Chen, Z. Hu, P. Kourtesis, R. S. Ledley, B. E. Suzek, et al. 2003. The protein information and tryptophan are over-represented in the germline at common resource. Nucleic Acids Res. 31: 345–347. Ab-Ag positions and are some of the most frequently coded in the 22. Bairoch, A., R. Apweiler, C. H. Wu, W. C. Barker, B. Boeckmann, S. Ferro, by guest on September 27, 2021 known mutation hotspot motifs. Remaining influences on the res- E. Gasteiger, H. Huang, R. Lopez, M. Magrane, et al. 2005. The universal protein resource (UniProt). Nucleic Acids Res. 33: D154–D159. idue type changes are presumed to result from functional selection 23. Henikoff, S., and J. G. Henikoff. 1992. Amino acid substitution matrices from during the affinity maturation process. These functional selection protein blocks (ftp.ncbi.nih.gov/repository/blocks/unix/blosum/BLOSUM/). aspects make the somatic hypermutation process a good model for Proc. Nat. Acad. Sci. USA 89: 10915–10919. 24. Worn, A., and A. Plu¨ckthun. 1999. Different equilibrium stability behavior of evolution on an accelerated time scale. ScFv fragments: identification, classification, and improvement by protein engi- neering. Biochemistry 38: 8739–8750. Acknowledgments 25. Khalifa, M. B., M. Weidenhaupt, L. Choulier, J. Chatellier, N. Rauffer-Bruyere, D. Altschuh, and T. Vernet. 2000. Effects on interaction kinetics of mutations at The advice and supporting environment provided to L.A.C. as postdoctoral the VH-VL interface of Fabs depend on the structural context. J. Mol. Recognit. fellow at Biogen Idec are greatly appreciated. Helpful comments from 13: 127–139. Juswinder Singh and Alexey Lugovskoy are appreciated. 26. Lo Conte, L., C. Chothia, and J. Janin. 1999. The atomic structure of protein- protein recognition sites. J. Mol. Biol. 285: 2177–2198. 27. Zemlin, M., M. Klinger, J. Link, C. Zemlin, K. Bauer, J. A. Engler, Disclosures H. W. J. Schroeder, and P. M. Kirkham. 2003. Expressed murine and human L. A. Clark is a current employee and S. Ganesan, S. Papp, and H. W. T. CDR-H3 intervals of equal length exhibit distinct repertoires that differ in their van Vlijmen are past employees of Biogen Idec Inc. L. A. Clark and amino acid composition and predicted range of structures. J. Mol. Biol. 334: H. W. T. van Vlijmen are stockholders in the company. 733–749. 28. Fellouse, F. A., C. Wiesmann, and S. S. Sidhu. 2004. Synthetic antibodies from a four-amino-acid code: a dominant role for tyrosine in recognition. Proc. References Natl. Acad. Sci. USA 101: 12467–12472. 1. Janeway, C. A., P. Travers, M. Walport, and M. Shlomchik. 2001. Immunobiol- 29. Fellouse, F. A., B. Li, D. M. Compaan, A. A. Peden, S. G. Hymowitz, and ogy, 5th Ed. Garland, New York, p. 132. S. S. Sidhu. 2005. Molecular recognition by a binary code. J. Mol. Biol. 348: 2. Diaz, M., and P. Casali. 2002. Somatic immunoglobulin hypermutation. Curr. 1153–1162. Opin. Immunol. 14: 235–240. 30. Wagner, S. D., C. Milstein, and M. S. Neuberger. 1995. Codon bias targets 3. Franklin, A., and R. V. Blanden. 2004. On the molecular mechanism of . Nature 376: 732. hypermutation of rearranged immunoglobulin genes. Immunol. Cell Biol. 82: 31. Jordan, I. K., F. A. Kondrashov, I. A. Adzhubei, Y. I. Wolf, E. V. Koonin, 557–567. A. S. Kondrashov, and S. Sunyaev. 2005. A universal trend of amino acid gain 4. Neuberger, M. S., J. M. Di Noia, R. C. Beale, G. T. Williams, Z. Yang, and and loss in protein evolution. Nature 433: 633–638. C. Rada. 2005. Somatic hypermutation at A.T pairs: polymerase error versus 32. Wedemayer, G. J., P. A. Patten, L. H. Wang, P. G. Schultz, and R. C. Stevens. dUTP incorporation. Nat. Rev. Immunol. 5: 171–178. 1997. Structural insights into the evolution of an antibody combining site. Science 5. Samaranayake, M., J. M. Bujnicki, M. Carpenter, and A. S. Bhagwat. 2006. 276: 1665–169. Evaluation of molecular models for the affinity maturation of antibodies: roles of 33. Ghiara, J. B., E. A. Stura, R. L. Stanfield, A. T. Profy, and I. A. Wilson. 1994. by AID and DNA repair. Chem. Rev. 106: 700–719. Crystal structure of the principal neutralization site of HIV-1. Science 264: 6. Pham, P., R. Bransteitter, J. Petruska, and M. F. Goodman. 2003. Processive 82–85. AID-catalysed cytosine deamination on single-stranded DNA simulates somatic 34. Clark, L. A., P. A. Boriack-Sjodin, J. Eldredge, C. Fitch, B. Friedman, K. J. Hanf, hypermutation. Nature 424: 103–107. M. Jarpe, S. F. Liparoto, Y. Li, A. Lugovskoy, et al. 2006. Affinity enhancement 7. Betz, A. G., C. Rada, R. Pannell, C. Milstein, and M. S. Neuberger. 1993. Pas- of an in vivo matured therapeutic antibody using structure-based computational senger transgenes reveal intrinsic specificity of the antibody hypermutation mech- design. Protein Sci. 15: 949-960.