The selection of acceptable

Rajkumar Sasidharan*†‡ and Cyrus Chothia*

*Medical Research Council Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, United Kingdom; and †Molecular Biophysics and Biochemistry Department, Yale University, New Haven, CT 06520

Communicated by I. M. Gelfand, Rutgers, The State University of New Jersey, Piscataway, NJ, April 27, 2007 (received for review January 31, 2007) We have determined the general constraints that govern sequence The first two involve from higher eukaryotes, and the divergence in proteins that retain entirely, or very largely, the same third involves those from prokaryotes. The time of divergence structure and function. To do this we collected data from three from the common ancestor of humans and chickens (310 million different groups of orthologous sequences: those found in humans years) is more than three times that for humans and mice (90 and mice, in humans and chickens, and in Escherichia coli and million years) and for E. coli and S. enterica (Ϸ100 million Salmonella enterica. In total, these organisms have 21,738 suitable years) (7, 8). pairs of orthologs, and these contain nearly 2 million mutations. The three groups differ greatly in the taxa from which they come Results and Discussion and/or in the time that separates them from their last common Three Sets of Orthologous Proteins That Have Similar Structures. We ancestor. Nevertheless, the results we obtain from the three performed two rounds of all-against-all comparisons of protein different groups are strikingly similar. For each group, the ortholo- sequences using BLASTP (9) for three groups of organisms: gous sequence pairs were assigned to six different divergence human࿝mouse (h࿝m), human࿝chicken (h࿝c), and E. coli࿝S. en- categories on the basis of their sequence identities. For categories terica (e࿝s). The protein sequences predicted from the genome with the same divergence, common accepted mutations have sequences of humans (10), mice (11), and chickens (12) were similar frequencies and rank orders in the three groups. With obtained from Ensembl (13), and those from the genome divergence, the width of the range of common mutations grows in sequences of E. coli (14) and S. enterica (15) were obtained from the same manner in each group. We examined the distribution of the National Center for Biotechnology Information (16). For mutations in protein structures. With increasing divergence, mu- each group, pairs of sequences that made reciprocal best matches tations increase at different rates in the buried, intermediate, and are taken to be orthologous. To check that the two sequences in exposed regions of protein structures in a manner that explains the each pair have not been subject to large insertions or deletions exponential relationship between the divergence of structure and we removed those pairs in which the length of the shorter sequence. This work implies that commonly allowed mutations are sequence is Ͻ80% of that of the longer sequence. We also selected by a set of general constraints that are well defined and removed pairs where the sequence divergence is Ͼ60%. whose nature varies with divergence. These comparisons gave 11,160 pairs of h࿝m orthologs, 7,685 for the h࿝c group, and 2,893 for the e࿝s group. The total number ͉ ͉ codon frequencies distribution of mutations in protein structure of orthologous pairs is 21,738, and the total number of mutations sequence-1 structure divergence found in these pairs is 1,955,309 (Table 1). For each group, the sequence divergence of each pair was he sequences of almost all proteins diverge, to a greater or computed as the percentage of residues that are not identical. Tlesser extent, during the course of evolution. Here we Then, according to their divergence, each pair was placed in one describe an investigation into the nature of the selection process of six categories: Ͻ10%, 10–20%, 20–30%, 30–40%, 40–50%, or that governs divergence in those proteins that retain entirely, or 50–60%. Thus, taken together, the three groups of orthologs very largely, the same fold and function. We try to answer a have in all 3 ϫ 6 ϭ 18 categories. The distribution of h࿝m and number of questions for which, to our knowledge, there are few e࿝s orthologs in these categories is exponential: many pairs are or no detailed answers at present. Does divergence of proteins found in the Ͻ10% category, and decreasingly few are found in in very different organisms involve the same or different ac- the other categories. For the h࿝c orthologs, the distribution is ceptable mutations? How does the selection of mutations vary that of a skewed bell curve [see supporting information (SI)]. with divergence? What is the distribution of mutations in protein The different shapes of the distributions arise from the differ- structures, and how does it vary with divergence? ences in times that have elapsed since the three pairs of Proteins adapt to mutations by changes in their structure organisms diverged from their common ancestor. (1–6). For homologous proteins that have mutated up to 60% of their residues, the accommodation of mutations occurs largely General Characteristics of the Mutations in the Three Groups of through small shifts in the relative position of regions that Orthologs. Here we describe the general characteristics of the maintain the same or very similar conformations: there are mutations found in each category and determine the extent to relatively few insertions, deletions, or changes in local confor- which they are common to the three groups of orthologs. mation. For proteins with divergences greater than this, the When counting the frequency of a type, we are not accommodation of mutations increasingly involves insertions, concerned with the direction of the mutation. Thus, a count of deletions, and/or changes in local conformation: for homologous VI mutations in the h࿝m orthologs will include both those that pairs of proteins with sequence identities of 20%, it is common for regions that have different conformations to form half of each structure (2). Author contributions: R.S. and C.C. designed research, performed research, analyzed data, These observations mean that, to determine the nature of the and wrote the paper. mutations that are acceptable to proteins that retain the same The authors declare no conflict of interest. function and fold, the best data come from pairs of orthologous Abbreviations: ASA, accessible surface area; h࿝c, human࿝chicken; h࿝m, human࿝mouse; e࿝s, proteins whose sequences have diverged by not more than Ϸ60% E. coli࿝S. enterica. and are similar in length. Three groups of orthologous pairs of ‡To whom correspondence should be addressed. E-mail: [email protected]. sequences are used in this work: those from humans and mice, This article contains supporting information online at www.pnas.org/cgi/content/full/ those from humans and chickens, and those from Escherichia coli 0703737104/DC1. and Salmonella enterica. These groups have large differences. © 2007 by The National Academy of Sciences of the USA

10080–10085 ͉ PNAS ͉ June 12, 2007 ͉ vol. 104 ͉ no. 24 www.pnas.org͞cgi͞doi͞10.1073͞pnas.0703737104 Downloaded by guest on September 27, 2021 Table 1. h࿝m, h࿝c, and e࿝s orthologs of similar size: The number of sequences, known structures, and mutations No. of sequences and structures in ranges of different divergence

Orthologs Total Ͻ10% 10–20% 20–30% 30–40% 40–50% 50–60%

h࿝m Sequences 11,160 4,852 3,575 1,725 706 234 68 Mutations in sequences 810,305 139,447 286,052 220,208 105,727 45,548 13,323 Known structures 275 131 80 39 17 7 1 Mutations in structures 10,122 1,802 3,505 3,035 1,055 636 89 h࿝c Sequences 7,685 1,323 1,901 1,813 1,398 868 382 Mutations in sequences 1,037,022 35,718 146,384 234,619 264,745 229,153 126,403 Known structures 225 48 58 55 38 17 9 Mutations in structures 15,130 526 2,592 4,101 4,002 2,325 1,584 e࿝s Sequences 2,893 1,541 935 244 84 46 43 Mutations in sequences 107,982 30,168 41,219 16,660 7,324 5,828 6,783 Known structures 495 313 146 27 4 2 3 Mutations in structures 13,733 5,389 5,683 1,797 291 104 469 Total no. of sequences 21,738 7,716 6,411 3,782 2,188 1,148 493 Total no. with known structure 995 492 284 121 59 26 13

have a valine in the human sequence and an isoleucine in the absent (see SI). The absent mutations are as follows, with the mouse sequence and vice versa. Given this definition, there are number of categories in which the mutation type is not observed 20 ϫ (20 Ϫ 1)/2 ϭ 190 different mutation types. given in parentheses: KC (two), KF (one), HC (one), PC (one), All, or almost all, of the 190 possible types of mutations are observed in CM (three), KW (three), DW (five), NW (one), PW (three), and EVOLUTION each category. As described above, the orthologs in each group are IW (three). Of these 10 types of mutations, all require a change assigned to one of six divergence categories (Table 1). In Table in all three and nine involve C or W residues, which 2 we list the frequencies of the different types of mutation that are the two residues that occur least frequently in proteins. These are seen in the Ͻ10% category of the h࿝m group. Inspection of mutations also involve radical changes in chemical character this table shows that all 190 possible mutation types occur and/or shape. The absence of these rare mutations in the e࿝s (although to very different extents; see below). Examination of categories most probably arises because the total number of the other 17 categories shows that in another 10 categories all mutations in these categories is much smaller than those in the types of mutations are found (see SI). h࿝m and h࿝c categories (Table 1). The other seven are the Ͻ10% category in the h࿝c group and The frequencies of mutations have an exponential character that is related all of the six categories in the e࿝s group. For these we find that to the extent to which the sequences have diverged. An examination of between two and six of the 190 possible types of mutation are the frequencies of different mutations in the 18 categories shows

Table 2. The 139,447 mutations in the 4,852 h࿝m orthologs whose sequence identities have diverged by up to 10% snb

KRHQEDNS T AGPCVI LMFYW

K R 5,235 H871,639 s Q 1,173 2,337 2,057 E 853 173 79 1,301 D 92 85 225 92 7,942 N 743 149 876 112 155 1,588 S 237 708 245 208 195 372 6,965 T 487 261 72 113 158 108 1,326 5,433 n A 149 184 63 199 874 509 256 5,830 10,021 G 142 1,085 73 137 1,218 1,122 515 5,141 424 3,044 P 76 481 497 1,593 127 52 78 5,073 1,888 2,918 156 C 13 404 119 30 12 20 53 1,082 58 96 356 65 V 67 88 31 74 325 117 43 445 1,048 5,531 1,017 263 51 I 7768313951221563901,748 523 96 83 26 8,995 L 109 479 354 743 104 44 63 1,064 313 489 148 2,128 100 2,912 2,229 b M 135 98 17 49 54 11 29 126 1,203 339 69 60 13 2,524 1,513 2,135 F 31 35 64 29 25 14 26 461 58 88 46 74 200 337 295 2,044 51 Y 20 66 954 53 23 90 112 253 22 32 24 35 505 46 30 97 8 1,119 W 15 266 12 55 22 4 9 74 11 23 79 24 96 19 13 122 9 27 27

s, hydrophilic residues; n, neutral residues; b, hydrophobic residues; boldface, the 30 most frequent mutations.

Sasidharan and Chothia PNAS ͉ June 12, 2007 ͉ vol. 104 ͉ no. 24 ͉ 10081 Downloaded by guest on September 27, 2021 Human-Chicken Mutations categories have intermediate numbers of mutation types that 100 form this percentage (Fig. 1). <10% Similarities and Differences in the Rank Order of Mutations. 90 In this 10-20% section we investigate the extent to which individual mutations 20-30% 80 occur with similar frequencies in the different groups. Close 30-40% similarities would imply that they are subject to similar con- 70 40-50% straints. We pay particular attention to the sets of mutation types 50-60% that occur with high frequency and make the major contribution 60 to the total number of mutations. For mutations with low frequencies, small differences in their absolute number can 50 produce large but not very significant differences in their rank order. 40 We describe calculations carried out on the categories that are the least and most divergent, i.e., the Ͻ10% and the 50–60%

Cumulative Percentage Cumulative 30 sets. As might be expected from the data discussed in the previous section, calculations on mutations in the intermediate 20 categories give intermediate results. Correlation of the rank order of mutation types. Inspection of h࿝m, h࿝c, ࿝ 10 and e s categories that have the same divergence shows that many of the individual mutation types have a similar rank order. 0 To obtain a overview of these similarities, we determined the 1 20 40 60 80 100 120 140 160 180 correlation coefficients of the rank orders of the 190 mutation Ͻ Mutation Rank Order types in the different categories. For the 10% category, the correlation coefficients of the rank orders in the h࿝m and h࿝c categories is 0.96; for h࿝m and e࿝s it is 0.88, and for h࿝c and e࿝s The number of frequent mutation types that form 75% of all mutations in sets of orthologs whose divergences differ it is 0.91. The same calculation for the 50–60% categories, shows that the correlation coefficient for h࿝m and h࿝c is 0.96, for h࿝m Divergence of and e࿝s it is 0.82, and for h࿝c and e࿝s it is 0.89. ortholog datasets: 0-10% 10-20% 20-30% 30-40% 40-50% 50-60% The rank order of the frequent types of mutation. We next examined the human-mouse 29 37 44 52 59 65 rank order of the frequent mutation types in the three groups. human-chicken 28 38 46 54 61 67 The overall position of each mutation type in the three Ͻ10% E.coli-S.enterica 29 37 45 50 57 64 categories and the three 50–60% categories was first deter- Fig. 1. The cumulative contributions made by mutations when placed in mined. To do this we summed the rank order of each type of descending rank order of their frequencies. To produce the figure, for the six mutation in the categories that have the same divergence. The h࿝c categories, each type of mutation was placed in descending rank order mutation types were placed in ascending order of this sum. In according to its frequency. Then, moving down the list, the frequencies were Fig. 2 we plot the overall rank order of the 30 most frequent summed, and their cumulative contributions were calculated as a percentage mutations in the Ͻ10% and 50–60% categories against the rank of all mutations. In the Ͻ10% h࿝c category, the 28 most frequent mutation order of these mutations in the individual h࿝m, h࿝c, and e࿝s types form 75% of all mutations; for the 50–60% category, this proportion is categories. formed by the 67 most frequent types. The numbers of frequent mutation The 30 most frequent overall mutation types in the Ͻ10% ࿝ types that form this proportion in the intermediate h c categories and in the categories form some 75% of all mutations. Of these 30, there h࿝m and e࿝s categories are given in the table. are 26 or 27 types that are also among the 30 most frequent in the individual h࿝m, h࿝c, and e࿝s categories (Fig. 2). This means that to good approximation they fit exponential distributions that the same small subset of mutation types forms close to 75% Ͻ with only small deviations at the top and bottom of the distri- of all mutations in all of the three 10% categories. Their rank ࿝ ࿝ bution. These distributions have an R2 value that ranges from positions differ, on average, by four positions in the h m and h c Ͻ10% categories, by nine positions in the h࿝m and e࿝s categories, 0.96 to 0.98. In categories that have low levels of divergence, the ࿝ ࿝ more common mutation types occur with a frequency that is and by eight positions in the h c and e s categories (see Fig. 2). The 65 most frequent overall mutation types in the 50–60% greater than is found in categories that have high divergence. categories form some 75% of all mutations. Of these, there are This is seen in the exponents of the distributions. For the three 59, 61, and 58 that are also in among the top 65 of, respectively, Ͻ10% categories, the exponents of the distribution are in the the h࿝m, h࿝c, and e࿝s categories. Thus, as before, the same subset range of Ϫ0.0325 Ϯ 0.0018. For the three 50–60% categories, the Ϫ Ϯ of mutation types form close to 75% of all mutations in each of exponents are in the range of 0.0206 0.0009 (see SI for the three categories. The rank positions of the 65 types of additional data). mutations are roughly similar: they differ, on average, by eight To determine the implications of this behavior in more detail, positions in the h࿝m and h࿝c 50–60% categories, by 18 positions we first ranked the mutation types in each category in descending in the h࿝m and e࿝s categories, and by 13 positions in the h࿝c and order of their frequencies. We then plotted for each category the e࿝s categories. cumulative contribution to the total number of mutations made Changes in the rank order of mutations as a function of divergence. by the ranked mutations as we move from the most frequent to Examination of the 30 most frequent mutations in the Ͻ10% and the least frequent. Fig. 1 shows the results of this calculation for 50–60% categories (Fig. 2) shows that 22 mutation types are the six categories in the h࿝c group. Plots for the h࿝m and e࿝s common to both sets. Some of these common mutations have categories are identical in appearance (see SI). They show that systematic differences in their rank order in the two categories. for the three Ͻ10% divergence categories, the 28 or 29 most For example, the mutation NS in the h࿝m, h࿝c, and e࿝s Ͻ10% frequent mutation types form 75% of all mutations. For the categories is found at rank positions 4, 5, and 10, respectively, three 50–60% categories, between 64 and 67 of the most whereas in the three 50–60% categories the rank positions are frequent mutation types make up this percentage. Intermediate 18, 13, and 14. Conversely, VL is found at positions 13, 12, and

10082 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0703737104 Sasidharan and Chothia Downloaded by guest on September 27, 2021 Fig. 2. The overall rank position of the 30 most frequent mutation types in the Ͻ10% (Left) and the 50–60% (Right) categories. Along the x axis we list, in EVOLUTION descending order, the mutation types that, overall, are the 30 most frequent in the Ͻ10% and 50–60% categories. The overall positions are determined by summing, for categories with the same divergence range, the rank order of each mutation type in the h࿝m, h࿝c, and e࿝s data sets (see text). The rank order of the 30 mutations in the individual h࿝m, h࿝c, and e࿝s categories is given by the numbers on the y axis.

13 in the Ͻ10% categories and at positions 6, 6, and 3 in the the frequencies of codons are correlated with biases in the 50–60% categories (Fig. 2). The frequencies of the observed frequencies of their tRNAs. As a result of this, the expression of mutations are the number of mutations that remain after a protein is more efficient if the coding regions of its selection. Selection for the Ͻ10% categories is, of course, more contain a high proportion of the more frequent codons (18–20). rigorous than that for more divergent categories, which allow Thus, in relation to expression, mutations between codons with larger structural changes (2). Thus, these differences in rank similar frequency would have a close-to-neutral effect. If effi- order imply that, before selection, VL mutations actually occur cient expression is important for a protein, a mutation that more frequently than NS, but for highly conserved proteins VL substituted a rare codon for a frequent one would be deleterious is less acceptable than NS and has been removed by selection to to some degree, whereas the substitution of a frequent codon for a greater extent. a rare one could be advantageous for expression but would be Eight mutation types, QH, VM, NT, IM, TI, RH, NH, and QP, uncommon because of the low number of rare codons. are present in the top 30 of the Ͻ10% categories but not in top Codon frequencies in humans, mice, and chickens are very 30 of the 50–60% categories. Conversely, EK, TV, LS, AL, ES, similar (the correlation coefficients are 0.96–0.99). The codon DS, KN, and EG occur in top 30 of the 50–60% categories but frequencies in E. coli and Salmonella are similar to each other not the top 30 of the Ͻ10% categories (Fig. 2). Note that those (the correlation coefficient is 0.83), but some are very different from those in the vertebrates (the correlation coefficients be- unique to Ͻ10% categories conserve their chemical character, tween the two sets of codons are between 0.44 and 0.57). shape, and/or volume to a greater extent than those unique to the In the h࿝m, h࿝c, and e࿝s Ͻ10% categories, the mutation SP is 50–60% categories. The mutations TV, AL, ES, and DS require found at rank positions 10, 11, and 35, respectively (Fig. 2). at least two changes. Most of these are likely to be Proline and serine have four codons each. In Table 3 we list these produced by a combination of frequent single mutations: for codons, their frequencies in the different genomes, and an example, the combination of the frequent single-nucleotide indication of the four single-nucleotide changes that that give SP mutations TA and AV mutations will produce TV mutations. mutations. For the unique mutations that can be produced by single- The total frequency of serine and proline codons in the three nucleotide changes, the process behind their presence or absence animal genomes is 110 per 1,000, and in the bacterial genomes in the two categories is, of course, the same as that outlined for it is 78, i.e., 30% smaller (see Table 3). NS and VL in the previous paragraph. The individual frequencies of the four proline codons in the Variations in the mutation frequencies related to codon bias. Most of the animals are quite different from those in the bacteria. In animals frequent types of mutations in the categories with the same the four single-nucleotide SP mutations involve transitions be- divergence have similar rank orders, but there are a few that do tween codons of similar high (three cases) or low (one case) not. Differences in the total number of codons for an , frequencies (Table 3). In the bacteria, on the other hand, the four and in the relative frequencies of the different codons of those SP mutations involve transitions between codons that have low residues that have more than one, might be expected to affect the frequencies or between one with a high frequency and one with extent to which mutations can occur. a low frequency. This means that in the e࿝s orthologs single- In different genomes, identical codons can occur with differ- nucleotide SP mutations are likely to be less frequent than those ent frequencies (17). Previous work has shown that the biases in in h࿝m and h࿝c orthologs.

Sasidharan and Chothia PNAS ͉ June 12, 2007 ͉ vol. 104 ͉ no. 24 ͉ 10083 Downloaded by guest on September 27, 2021 Table 3. The frequency of codons that produce SP mutations by change in a single nucleotide Codon frequencies

Residue Codon Human Mouse Human Mouse Codon Residue

S UCU 15.1 16.1 ↔ 17.4 18.4 CCU P UCC 17.7 18.1 ↔ 19.9 18.3 CCC UCA 12.2 11.6 ↔ 16.9 17.1 CCA UCG 4.5 4.3 ↔ 7.0 6.2 CCG E. coli S. enterica E. coli S. enterica

S UCU 8.5 19.5 ↔ 7.0 6.3 CCU P UCC 8.6 15.4 ↔ 5.5 3.9 CCC UCA 7.1 8.6 ↔ 8.5 6.3 CCA UCG 8.9 6.5 ↔ 23.3 11.9 CCG

Examination of the other mutations that have large differ- homologous proteins have shown that residues whose ASAs are ences in their rank positions indicates that, in several cases, Ͻ20 Å2 tend to be conserved to a greater extent than those differences in codon frequency are a contributing factor. In a few whose ASAs are larger (1, 24–26). Based on these observations cases, however, rank differences cannot be linked in a simple way we defined a residue at a site as being in one of three regions: to codon frequencies. ‘‘buried’’ if the ASA value is between 0 Å2 and 20 Å2, ‘‘inter- mediate’’ if the ASA value is between 20 Å2 and 60 Å2, and Distribution of Mutations in Different Regions of Protein Structures. ‘‘exposed’’ for those with ASA values Ն60 Å2. In this section we discuss the distribution of mutations in protein In our set of structures, the average protein has some 300 structures and how this distribution varies with sequence diver- residues, of which 42% are buried, 25% are intermediate, and gence. To obtain data for this analysis, the sequences that form 33% are exposed. the three groups of orthologs were matched to the entries in the The ASA of mutation sites. On the basis of the ASA values of the Protein Data Bank (21) to find which of them of them have had residues in the structures, we assigned the sites’ subject muta- their structures determined. We were able to find at least one tions to the buried, intermediate, or exposed region. At sites not structure for 275 orthologous pairs of the h࿝m group, for 225 buried in the interior, mutations will produce residues with orthologous pairs of the h࿝c group, and for 495 orthologous pairs different ASAs. However, the average value of the change in of the e࿝s group. The total number of structures, 995, matches ASA for acceptable mutations is Ϸ27 Å2 (26). This means that 5% of the total orthologous pairs. Although this proportion is most mutations are likely to either leave the site in same ASA small, the constancy of the results described below indicates that category or move it to an adjacent one. they are likely to be representative of a large proportion of the We assigned 38,985 mutation sites to one of the three ASA orthologous pairs in the three groups. regions. Of these, 10,122 come from the h࿝m group, 15,130 come Buried, intermediate, and exposed regions in proteins. Examination in from the h࿝c group, and 13,733 come from the e࿝s group. 1965 of the first pair of homologous proteins to have their Distribution of mutations. For each structure we counted, in the structure determined, myoglobin and hemoglobin, clearly buried, intermediate, and exposed regions, the total number of showed that sites buried within the protein are less susceptible residues and the number of mutations. Using this data we then to mutations than those on the surface (22). This observation has determined the proportion of residues that is mutated in each been confirmed by numerous subsequent studies. It implies that region. This calculation was carried out for all six divergence the structural property that is most directly related to mutability categories in each of the three groups. To give one example: in is the extent to which tertiary interactions restrict adaptation to the buried regions of structures in Ͻ10% categories, the pro- change in the size, shape, and/or chemical character of residues portion of mutated residues is 2.0% in the h࿝m group, 2.8% in produced by mutations. (Residues directly involved in function the h࿝c group, and 2.5% in the e࿝s group. Thus, the midpoint and will, of course, be more sensitive to mutation, but the number of the range of these values is 2.4 Ϯ 0.4%. In Table 4 we give the such residues is usually small, and, in this article, the proteins results of this calculation for each ASA region and each diver- being considered have conserved functions.) gence category. Accessible surface area (ASA) is a measure of the extent to Inspection of the data in Table 4 shows that h࿝m, h࿝c, and e࿝s which atomic groups or residues in proteins are accessible to the categories with the same divergence have very similar distribu- solvent or buried in the interior (23). Previous analyses of tions of mutations in the three ASA regions. On average, the

Table 4. Proportion (%) of residues that are mutated in different protein regions

ASA of the Proportion of residues Sequence divergence categories,* % mutation in each region of the sites, Å2 average protein, % Ͻ10 10–20 20–30 30–40 40–50 50–60

0–20 42 2.4 Ϯ 0.4 6.7 Ϯ 1.0 12.4 Ϯ 1.2 17.7 Ϯ 1.8 29.3 Ϯ 0.7 39.6 Ϯ 2.8 20–60 25 5.0 Ϯ 0.6 13.5 Ϯ 0.4 23.7 Ϯ 1.5 33.2 Ϯ 0.9 44.0 Ϯ 1.7 55.7 Ϯ 1.9 60ϩ 33 9.8 Ϯ 1.8 24.3 Ϯ 1.1 38.2 Ϯ 1.2 49.8 Ϯ 1.2 57.5 Ϯ 0.2 67.8 Ϯ 1.1

For the four categories Ͻ10%, 10–20%, 20–30%, and 30–40%, the values are derived from structures of h࿝m, h࿝c, and e࿝s orthologs. For the 40–50% category there are only sufficient structures available for h࿝m and h࿝c proteins, and for the 50–60% category there are only sufficient structures for h࿝c and e࿝s structures (see Table 1). *Note that the proportion of mutated residues increases 16.5-fold between the values for the Ͻ10% and 50–60% categories when ASA is 0–20 Å2. For ASA values of 60ϩ Å2, the proportion of mutated residues increases 7-fold.

10084 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0703737104 Sasidharan and Chothia Downloaded by guest on September 27, 2021 range about the mean is Ϯ 1.2% and there is only one entry protein structures, and how they change with divergence. Their where it is Ͼ2.0%. The number of structures available for the features, which have been described above in some detail, can be 40–50% and 50–60% categories is small, 26 and 13, respectively summarized in the following terms: (i) All types of mutations are (Table 1), but the data they provide clearly follow the trend acceptable in proteins of any divergence, but (ii) the frequencies observed in categories that have many more structures. These of the acceptable mutation types follow an exponential distri- results show that the distribution of mutation in the structures, bution. The exponent of the distribution decreases with diver- and the way in which it changes with divergence, follows a gence: in proteins whose sequences have diverged by Ͻ10% remarkably similar pattern in all three groups. some 30 types of mutations form 75% of all mutations; in A striking feature is how, with an increase in divergence, muta- proteins that have diverged by 50–60% the equivalent number tions increase at different rates in the three regions. On going from is close to 65. (iii) In proteins that have diverged by up to 10%, the Ͻ10% category to the 50–60% category, the proportion of Ϸ30 mutation types are conservative and usually have very residues with mutations in the buried regions increases by a factor similar rank orders. In proteins with 50–60% divergence, there of Ϸ16.5, in the intermediate regions by a factor of Ϸ11, and in the are Ϸ60 common mutation types that form close to 75% of all surface regions by a factor of Ϸ7 (Table 4). mutations. They are less conserved in shape and/or chemical This behavior arises from a combination of two factors. The character, and the rank order of their frequencies tends to be first is how residues are distributed between the three regions: roughly similar. This increase in the diversity of mutations comes the average structure has some 300 residues of which 126 are in part from the mutation of mutations. We have discussed buried, 75 are intermediate, and 99 are exposed. The second is previously the acceptance by proteins of rare deleterious muta- that, at low divergence, the mutations in the exposed region are tions (24). (iv) The distribution of mutations in protein structures approximately four times more acceptable than mutations in the varies systematically with divergence. In proteins with low di- buried region (Table 4). This means, of course, that initially vergence, mutations in the interior are under strong selection mutations accumulate rapidly in the exposed region. But, as that removes all but a few conservative changes. With increasing divergence increases, new surface mutations will increasingly divergence mutations in the interior become more widespread occur in residues that have already been mutated. While in the and closer in number to what is found in the intermediate and buried region, which has more residues and a smaller proportion exposed regions. It is this relative increase in the proportion of with mutations, new mutations will mainly occur at new sites. buried mutations that is responsible for the exponential rela- tionship between sequence and structural divergence.

The divergence of structure and sequence in homologous EVOLUTION proteins is related by the exponential relationship ⌬ϭ0.40e1.87H These conclusions have significant implications for the use of where ⌬ is the root mean square difference (in angstroms) in the mutation matrices, e.g., the Dayhoff point accepted mutation (PAM) matrices that describe the probabilities of amino acid position of main chain atoms that have the same local confor- mutations for a given period of evolution (27). It is based on a mation and H is the proportion of mutated residues (2). Residues model of evolution in which amino acids mutate randomly and at buried sites tend to have more contacts than those on the independent of one another. The mutation profile that we surface, and their mutation will usually have a greater effect of observe is very similar to PAM matrices at low levels of sequence structure. The high relative rate with which mutations increase, divergence. In the Dayhoff model, a matrix for divergent se- with divergence, in the buried regions is the basis for this quences is derived from a matrix for closely related sequences by exponential relationship. taking the matrix to a power. This assumes that each position is Conclusion equally mutable. However, as we show from our structural analysis of point mutations, different regions of a protein accept The functions and structure of individual proteins impose dif- mutations at different rates with increasing divergence, and this ferent constraints on their evolution. However, the very similar complex behavior is not represented by the PAM matrices. overall patterns of divergence that we have seen in three very different groups of orthologs show that individual responses of We thank our colleagues and the referees for comments on the manu- most proteins are variations on a common set of selective script. R.S. acknowledges financial help from the Cambridge Nehru constraints. These constraints on proteins govern the types of Trust (New Delhi, India) and the Medical Research Council Laboratory frequent mutations that are acceptable, their distribution in of Molecular Biology (Cambridge, U.K.).

1. Lesk AM, Chothia C (1980) J Mol Biol 136:225–270. 14. Blattner FR, Plunkett G, III, Bloch CA, Perna NT, Burland V, Riley M, 2. Chothia C, Lesk AM (1986) EMBO J 5:823–826. Collado-Vides J, Glasner JD, Rode CK, Mayhew GF, et al. (1997) Science 3. Matthews BW (1987) Biochemistry 26:6885–6888. 277:1453–1474. 4. Serrano L, Day AG, Fersht AR (1993) J Mol Biol 233:305–312. 15. McClelland M, Sanderson KE, Spieth J, Clifton SW, Latreille P, Courtney L, 5. Flores TP, Orengo CA, Moss DS, Thornton JM (1993) Protein Sci 2:1811–1826. Porwollik S, Ali J, Dante M, Du F, et al. (2001) Nature 413:852–856. 6. Koehl P, Levitt M (2002) J Mol Biol 323:551–562. 16. Maglott D, Ostell J, Pruitt KD, Tatusova T (2005) Nucleic Acids Res 33:D54–D58. 7. Hedges SB (2002) Nat Rev Genet 3:838–849. 17. Nakamura Y, Gojobori T, Ikemura T (2000) Nucleic Acids Res 28:292. 8. Feng DF, Cho G, Doolittle RF (1997) Proc Natl Acad Sci USA 94:13028–13033. 18. Grantham R, Gautier C, Gouy M, Mercier R, Pave A (1980) Nucleic Acids Res 9. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) J Mol Biol 8:r49–r62. 215:403–410. 19. Ikemura T (1981) J Mol Biol 151:389–409. 10. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, 20. Akashi H (2003) Genetics 164:1291–1303. Dewar K, Doyle M, FitzHugh W, et al. (2001) Nature 409:860–921. 21. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, 11. Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Shindyalov IN, Bourne PE (2000) Nucleic Acids Res 28:235–242. Agarwala R, Ainscough R, Alexandersson M, An P, et al. (2002) Nature 22. Perutz MF, Kendrew JC, Watson HC (1965) J Mol Biol 13:669–678. 420:520–562. 23. Lee B, Richards FM (1971) J Mol Biol 55:379–400. 12. Hillier LW, Miller W, Birney E, Warren W, Hardison RC, Ponting CP, 24. Chothia C, Gelfand I, Kister A (1998) J Mol Biol 278:457–479. Bork P, Burt DW, Groenen MA, Delany ME, et al. (2004) Nature 432:695– 25. Hill EE, Morea V, Chothia C (2002) J Mol Biol 322:205–233. 716. 26. Miller S, Janin J, Lesk AM, Chothia C (1987) J Mol Biol 196:641–656. 13. Hubbard T, Andrews D, Caccamo M, Cameron G, Chen Y, Clamp M, Clarke 27. Dayhoff MO, Schwartz RM, Orcutt BC (1978) in Atlas of Protein Sequence and L, Coates G, Cox T, Cunningham F, et al. (2005) Nucleic Acids Res Structure, ed Dayhoff MO (Natl Biomed Res Foundation, Washington, DC), 33:D447–D453. Vol 5, pp 345–352.

Sasidharan and Chothia PNAS ͉ June 12, 2007 ͉ vol. 104 ͉ no. 24 ͉ 10085 Downloaded by guest on September 27, 2021