Finding important sites in sequences

Peter J. Bickel†‡, Katherina J. Kechris†, Philip C. Spector†, Gary J. Wedemayer§, and Alexander N. Glazer¶

Departments of †Statistics, §Chemistry, and ¶Molecular and Cell Biology, University of California, Berkeley, CA 94720

This contribution is part of the special series of Inaugural Articles by members of the National Academy of Sciences elected on May 1, 2001.

Contributed by Alexander N. Glazer, August 22, 2002 By using sequence information from an aligned protein family, a of corresponding S-Rs is called a strong motif. All such families procedure is exhibited for finding sites that may be functionally or and sites can be found in O(n2L2) operations (see Appendix for structurally critical to the protein. Features based on sequence con- algorithm). Clearly, corresponding to any single S-R there exists servation within subfamilies in the alignment and associations be- a subfamily having it as a strong-motif pair; thus we limit our tween sites are used to select the sites. The sites are subject to search for strong-motif families with at least two S-Rs. However, statistical evaluation correcting for phylogenetic bias in the collection it is also evident that there may be no subfamilies having strong of sequences. This method is applied to two families: the phyco- motifs of length Ն2.ʈ Fig. 1 shows an example illustrating these biliproteins, light-harvesting in , red , definitions. and , and the globins that function in oxygen storage The first feature differentiates sites by their residue identity and transport. The sites identified by the procedure are located in within and outside subfamilies. There is no explicit evaluation of key structural positions and merit further experimental study. relationships between the sites, although by construction strong- motif sites are perfectly covarying within the strong-motif family. undamental problems in proteomics include both identifying By our second feature for site selection, we explore relationships Fand understanding the role of the essential sites that determine between the strong-motif sites. We look for the strong-motif site the structure and proper functioning of the molecule. A thorough pairs that, in general, are strongly covarying. A strong associa- evaluation of the importance of all sequence sites involves ex- tion between two sites may not necessarily indicate a structural tremely time-consuming and laborious biochemical experimental interaction. From the data, it is impossible to make the causal methods. Our goal is to determine a candidate set of such sites by inference that sites directly influence each other or to infer the considering only the sequence information from a protein family direction of the influence. There are any number of situations alignment. In this article we use two particular features in the that may create a strong association between sites besides direct sequence data for identifying sites, develop statistical evaluation contact. Furthermore, there is evidence that long-range inter- methods, and apply the procedure to two well studied protein actions are also critical for the proper functioning of a protein families: the and the globins. Finally, we show that (2, 3). In conclusion, when a site appears in highly significant the collection of sites identified by our procedure is distinguishable associations, we consider this as stronger evidence for being with respect to specific structural attributes. potentially critical to the molecule. Thus, we will consider the Methods strong-motif sites that are in statistically significant pairs as It is generally accepted that the residues most critical to the function potentially important sites to the molecule. of a protein are the most rigorously conserved (1). For example, the All pairs between strong-motif sites i and j, within each Phe and His residues that are involved in binding are both detected strong motif, are evaluated for statistical covariation by conserved in the aligned sequences of all known functional myo- using the measures described in ref. 4. The statistics used in ref. globins (Mbs), ␣- and ␤-globins, the globins of invertebrates, and in 4 measure the degree of association between two sites. the plant leghemoglobins. Besides such globally conserved exam- Assessing Significance ples of functionally critical positions, other sites of interest include those that show residue variation and are responsible for functional We shall now elaborate on how the statistical significance of the differences within a family. The chief difficulty, of course, is locating covariation features discussed above is determined. We can such sites. Suppose that we are given a functionally distinguished assess significance under the following two sets of assumptions, subfamily of a protein family. It seems plausible that a site playing which we contrast below. We conclude that the second set is a role in this particular function will be conserved within the family. more stringent than the first and rely on its resulting assessment On the other hand, we may ask that such a site should distinguish for our selection of sites. the subfamily from all other sequences in the protein family, which is accomplished by requiring that all other sequences have a residue Set I A1: Sites evolve independently. different from the one conserved in the subfamily in the specified A20: Sequences are independent. site. In practice, there may not be such functionally distinguished Set II A10: Sites evolve independently and identically. subfamilies so that we need to identify subfamilies and sites simultaneously. Abbreviations: Mb, myoglobin; S-R, site-residue; AP, ; PC, ; PE, More formally, we are given a family F of sequences and a ; PEC, phycoerythrocyanin. ʚ biologically functional subfamily Fo F.Welookforsitesthat ‡To whom correspondence should be addressed at: Department of Statistics, Evans Hall, satisfy the following criteria: (i) are conserved in Fo and (ii) have University of California, Berkeley, CA 94720-3860. E-mail: [email protected]. a very different residue distribution in the complement of the ʈWe are indebted to the referee of a previous version of this paper for pointing out that our c notion of strong motif is related to the notion of ‘‘compatibility’’ of characters in cladistics family, Fo. In practice, the functional subfamilies Fo may not be introduced by LeQuesne in 1969 (39) and Estabrook et al. (40). For the simplest case when known. two characters (sites) each have only two possible values (residues), compatibility corre- Thus, we enumerate all subfamilies in the family according to sponds to the requirement that in the 2 ϫ 2 table formed from these sites in the manner these criteria using the strictest version of criterion ii: all the indicated above at least one cell have entry 0. This is weaker than our requirement that c only the two diagonal or antidiagonal cells be nonzero. For more than two states, sequences in the complement Fo have a different residue than F compatibility corresponds to the possibility of using the two characters in the construction the one conserved in o.Asite-residue (S-R) that satisfies this of an evolutionary tree in which neither character can mutate back. The problem of finding condition is called a strong S-R for Fo. Subfamilies that have at the largest sets of compatible characters (maximal cliques) is believed to be non-polyno- least two strong S-Rs are called strong-motif families, and the set mial (NP) hard as opposed to our O(n2L2) problem.

14764–14771 ͉ PNAS ͉ November 12, 2002 ͉ vol. 99 ͉ no. 23 www.pnas.org͞cgi͞doi͞10.1073͞pnas.222508899 Downloaded by guest on October 2, 2021 Fig. 1. An example illustrating the strong-motif algorithm.

A2: Sequences evolve under an evolutionary model on a that is estimated by the same features that we later evaluate for phylogenetic tree. statistical significance. For both sample data sets, the reconstructed phylogenies using all sites or only strong-motif sites were similar. The assumption of independence is clearly consistent with We assumed the Dayhoff evolutionary model (11) for changes changes in the genome due to point mutations but not insertions along a branch. Generation of the tree by the neighbor-joining and deletions. It is made explicitly or implicitly by existing method and the Dayhoff assumption were made for simplicity. We approaches to finding functional motifs such as MEME (5) and do not expect our conclusions to be sensitive to these choices EMOTIF (6). Thus, A1 seems reasonable. for estimating the evolution of the sequences in the absence of A20 cannot be true, because all proteins correspond to leaves of selection. an evolutionary tree. It is made implicitly in the work of Stormo and After having specified the phylogenetic tree for the family and an

Hartzell (7) and Lawrence and Reilly (8) leading up to MEME. The evolutionary model, which assumes no covariation between sites, APPLIED extent to which A20 provides a good approximation depends on we studied the behavior of our statistics under the null model using MATHEMATICS parameters such as the time to the most recent common ancestor simulations. Following the procedure of Wollenberg and Atchley of the species under consideration and the rate of mutation at sites (13), a complete set of family sequences was generated B ϭ 100,000 assumed to be neutral in the sequences coding for the proteins, times (see below) under these assumptions by using the simulation parameters which are not readily ascertainable. software PSEQ-GEN (14). That is, the residues at each site were Of course, identical evolution A1 is an unrealistic assumption. 0 generated by simulating independent and identical evolution, A10, Most nonneutral sites evolve at different rates and with dissimilar along the tree with evolutionary changes specified by the Dayhoff residue distributions due to inhomogeneous selection pressures. model, A2. By comparing the observed covariation statistics Mij (4) BIOPHYSICS However, we use it for the null model that sites are not functionally b ϭ with those in the simulated data M ij, b 1,...,B, we can test the important and hence neutral. Essentially, A1 is more realistic than hypothesis that there is no association between sites i and j. The A10 and A2 is more realistic than A20. most stringent rule is to consider the i–j pairs for which Mij is larger Although we found assumption A2 held up well in a sample of b ⑀ 0 than any of the M ij. This, in effect, sets the significance level, *, of unrelated bacterial sequences, when we contrasted A2 and A20 on our hypothesis test to 1͞(B ϩ 1). By setting the significance level to the phycobiliproteins, it became clear that Set I was untenable as a be very low, we control the number of erroneous covariation null hypothesis. In particular, if sequences were assumed to be declarations. Ϸ independent, then 8% of all pairs show highly significant statis- We are restricting the tests to strong-motif pairs within the strong tical association, whereas if we account for the phylogeny as motifs, which corresponds to simultaneously testing K ϭ 347 detailed below, only 0.1% of the pairs are found significant at the hypotheses in the phycobiliproteins and K ϭ 232 hypotheses in the same level. Thus, the strong covariation exhibited by many of globins. To guarantee overall significance for the final number of the pairs is no longer significant once the dependence structure tests, we use the Bonferroni principle. That is, when we consider site of the sequences is taken into account. We therefore settle on pairs at significance level ⑀*, the chance that we make a false assessing significance using A10 and A2. In using A2,wefollowthe covariability call for any of the K pairs is K⑀*. We shall, in what lead of work such as Akmaev et al. (9). follows, take B ϭ 105 such that the simultaneous significance level Specifically, we evaluate the significance of the observed covaria- K⑀* Ͻ 0.01, which corresponds to a true overall significance level Ϫ3 tion statistics under the null model designated by A10 and A2. For of less than 3.47 ϫ 10 for the K ϭ 347 hypotheses we shall A2 we need to specify the evolutionary relationships among the consider in the phycobiliproteins and 2.32 ϫ 10Ϫ3 for the K ϭ 232 sequences, in the form of a phylogenetic tree, and an evolutionary hypotheses in the globins. model. The phylogenetic tree is specific to the protein family, i.e., the phycobiliproteins or globins. We used the neighbor-joining Data method (10) with PAM distances (11), implemented in PHYLIP (12), The first application and more thorough analysis of this method to estimate the tree for the particular family. The tree was con- was to the phycobiliproteins. Phycobiliproteins are a family of structed by using sites not in strong motifs. Strong-motif sites were highly conserved light-harvesting proteins present in pro- omitted to minimize the bias introduced by simulating from a tree karyotes (cyanobacteria and some prochlorophytes) and

Bickel et al. PNAS ͉ November 12, 2002 ͉ vol. 99 ͉ no. 23 ͉ 14765 Downloaded by guest on October 2, 2021 algal (PEs) (19), and sequences of three such polypeptides are included in the analysis. The analysis was repeated on an alignment of 154 vertebrate globin sequences to examine whether the procedure is generalizable to other examples. The globin family consists of both Mbs and hemoglobins (Hbs), which are responsible for oxygen binding and transport (20, 21). Mb is a monomer with highly compact structure primarily composed of ␣ helices. Attached to the polypeptide is a heme, the prosthetic group to which oxygen binds. Adult Hb contains two ␣␤ heterodimers. Each subunit is structurally similar to Mb. Although the amino acid sequences of the ␣ subunit, ␤ subunit, and Mb are quite different, they have very similar struc- tures (and functions), and therefore it is suitable to apply the procedure to the combined families. Organismal sources and accession numbers are provided in Table 24, which is published as supporting information on the PNAS web site (see below). Protein sequences were aligned by using CLUSTALW (22). The alignment was also visually inspected so that known conserved sites Fig. 2. An example of an aligned sequence from Rhodella were aligned properly. A uniform 190-residue length with a max- violacea PE-␤ subunit (A) and an aligned Mb sequence from Physeter catodon ␣ (sperm whale) (B). imum of four gaps allowed alignment of all the phycobiliprotein and ␤ subunits. The globin sequences were aligned to a uniform length of 161 positions with a maximum of six gaps. The residue identity and numbering uses the conventional single- eukaryotes ( and cryptomonads). The building block of letter amino acid abbreviation followed by the residue number in the each of the phycobiliproteins is an ␣␤ heterodimer. In this ␣ ␤ aligned sequence. For example, an alanine residue in position 4 from analysis, we compare 105 amino acid sequences of the and the amino terminus is designated A4. The residue numbering is subunits of the quantitatively major phycobiliproteins found in zero-based, although some polypeptide sequences align with no cyanobacteria and red algae, allophycocyanin (apcA and apcB), residue in the initial position. Thus, depending on the polypeptide, C- and R-phycocyanin (cpcA and cpcB; rpcA and rpcB), a residue designated as A4 may be either the fourth or fifth residue phycoerythrocyanin (pecA and pecB), and C-, B-, and R- from the amino terminus. phycoerythrin (cpeA and cpeB; bpeA and bpeB; and rpeA and Figs. 4–15, Tables 6–42, and accompanying text are published ␤ rpeB), and biliprotein subunits (CR-peB) taken as supporting information on the PNAS web site, www.pnas.org, from the GenBank and the Swiss-Prot databases (see Table 23, and contain the following materials: (i) the locations of strong- which is published as supporting information on the PNAS web site, motif sites in selected biliprotein and globins; (ii) information on www.pnas.org; see below). The designations for the sequences given and heme solvent accessibilities after truncation of residues above in parentheses are those used in the databases. The classi- one at a time; (iii) contacts and bonds between side-chain atoms fication, structure, and assembly of the phycobiliproteins, the of amino acid residues at strong-motif sites and neighboring structures and positions of attachment of their open-chain tetra- atoms; (iv) aligned biliprotein and globin sequences and acces- pyrrole (bilin) prosthetic groups, and the functions of these proteins sion codes; (v) conversion tables for biliprotein and globin have been reviewed extensively (15–18). The ␤ subunit of crypto- residue and bilin͞heme numbering; (vi) contact profiles and monad phycobiliproteins is related closely to the ␤ subunits of red associated P values for biliproteins and globins; and (vii) the

Table 1. Distribution of the sites selected by our procedure, the important sites, amongst the phycobiliprotein ␣ and ␤ subunit families and the globin families Family Important sites

Phycobiliprotein ␣ and ␤ AP-␣ E15 V67 M80 T81 T83 I110 Y119 P125 PC-␣ T3 P4 G80 K83 G105 Y110 PEC-␣ T3 P4 G80 K83 G105 Y110 CPE-␣ K83 G105 W110 V118 Y119 P125 BPE-␣ K83 G105 W110 V118 Y119 P125 RPE-␣ K83 G105 W110 V118 Y119 P125 AP-␤ D3 A83 S105 R110 T118 Y119 P125 PC-␤ D3 A83 S105 R110 T118 Y119 P125 PEC-␤ D3 F67 Q81 A83 S105 R110 T118 Y119 P125 CPE-␤ D3 A83 S105 R110 T118 Y119 P125 BPE-␤ D3 A83 S105 R110 T118 Y119 P125 RPE-␤ D3 A83 S105 R110 T118 Y119 P125 CR-PE-␤ D3 A83 S105 R110 T118 Y119 P125 Globin Hb-␣ L104 P133 Hb-␤ W44 G53 L104 H105 P133 H155 Mb E24 E44 (L,I,P) 92* H104 (I,N,H) 109* I114 D149

*Sites occurring in small strong-motif families (less than four members) within the larger functional subfamilies.

14766 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.222508899 Bickel et al. Downloaded by guest on October 2, 2021 APPLIED MATHEMATICS BIOPHYSICS

Fig. 3. The amino acid environment of residue 110 within biliprotein ␣ and ␤ subunits. (A–D) The 110 residue in ␣ subunits of AP (A), C-PC (B), PEC (C), and B-PE (D). (E–H) The 110 residue in the corresponding ␤ subunits. All bonds and van der Waals contacts with interatomic separations between 2.7 and 4.0 Å are shown as dotted lines; for hydrogen bonds, this is the separation distance between donor and acceptor atoms. Each image is labeled on the lower right with the respective protein and subunit. The residues found at position 110 are particularly noteworthy, because they participate in several strong motifs: Tyr-110 is conserved in all PC and PEC-␣ subunits, Trp-110 is found in all C-, B-, and R-PE-␣ subunits, Ile-110 is conserved in all AP-␣ subunits, and Arg-110 occurs in all biliprotein ␤ subunits. In C (PEC-␣), the internal cavity lies behind Tyr-A110; in G (PEC-␤), the internal cavity lies below and to the right of Arg-B110; in D (B-PE-␣), the internal cavity is behind Trp-C110; and in H (B-PE-␤), the internal cavity lies above Arg-B110.

Bickel et al. PNAS ͉ November 12, 2002 ͉ vol. 99 ͉ no. 23 ͉ 14767 Downloaded by guest on October 2, 2021 Table 2. Important residues involved in intersubunit or subunit–linker interactions or contacts to bilins AP-␣ E15 V67 M80 T81 T83 I110 Y119 P125 PC-␣ T3 P4 G80 K83 G105 Y110 PEC-␣ T3 P4 G80 K83 G105 Y110 BPE-␣ K83 G105 W110 V118 Y119 P125 AP-␤ D3 A83 S105 R110 T118 Y119 P125 PC-␤ D3 A83 S105 R110 T118 Y119 P125 PEC-␤ D3 F67 Q81 A83 S105 R110 T118 Y119 P125 BPE-␤ D3 A83 S105 R110 T118 Y119 P125

Underlined residues are at or very near interfaces between subunits or interfaces between subunits and linkers. Italicized residues have contacts to bilins. Residues that are both in contact with bilins and at or very near subunit or linker-subunit interfaces are bold.

atom type and radii libraries submitted to the web-based present in all ␤ subunits, forms a hydrogen bond to the main chain GETAREA program for calculation of solvent accessibilities. carbonyl of G114, a conserved residue in all ␣ and ␤ subunits. Y119, present in all AP, PC, PEC, and PE subunits except PC-␣ and Results PEC-␣, is hydrogen-bonded to the invariant D87 in all cases except An example of an aligned phycobiliprotein sequence with the in BPE-␣. residues numbered is shown in Fig. 2A. The full set of aligned Further examination of the three-dimensional structures re- sequences is provided in Tables 23 and 24. Before application of vealed that of 56 residues at important sites, 39 (70%) are involved the strong-motif search algorithm to the 105 aligned phycobil- in intersubunit or subunit–linker interactions or in contacts with iprotein sequences, residues common to all sequences were bilins (Table 2). Overall, 21 (38%) are at or very near interfaces excluded. These universally conserved residues were D13, C84, between subunits or interfaces between subunits and linker. In R86, D87, R93, Y97, G102, and G114. This set of conserved addition, 23 of the residues have contacts to bilins (41%). From residues is a signature that defines all members of the phyco- ␣ ␤ calculations of the change in bilin solvent accessibility after trun- biliprotein and subunit family regardless of the subclass, cation of residues one at a time, truncation of residues at 32% of the bilins, or organismal origin. Application of the strong-motif important sites (M80, G80, T83, K83, A83, T118, Y119, and P125) search algorithm to the ‘‘edited’’ sequences identified 35 strong increased accessibility of solvent to a bilin (Table 3, Supporting motifs ranging from 2 to 13 residues. In the set of strong motifs, ϭ Information, and Figs. 10–13). These residues are expected to affect there were K 347 strong-motif pairs that were checked for the spectroscopic properties of the proximal bilin through their covariation. Of these, 10 pairs (containing residues at 12 sites) influence, inter alia, on the polarizability of its environment and were found to be covarying statistically at a significance level of Ϫ conformational mobility in the ground and excited states. 3.47 ϫ 10 3. By our definitions, each site in these pairs is For many of the sites, there were no readily interpretable critical potentially ‘‘important’’ to the molecule. Also, by construction, interactions with neighboring residues. Nevertheless, it is apparent each site is in at least one strong-motif pair. In this application, that the important sites are clustered and not distributed randomly all of the 12 sites and their associated residues are contained in across the surface (Supporting Information, Figs. 4–7). Important pairs within strong motifs corresponding to structural or func- ␤ tional subfamilies as defined by their spectroscopic properties sites are located in the interior and near the centrally located bilin, which serves as the ‘‘terminal energy acceptor’’ (16). Most sites are (Table 1). ␣͞␤ By closely inspecting the structures of representatives of the four in key regions for the protein: interfaces, near the linker, or in phycobiliprotein subfamilies, we explored possible functional roles close proximity to the bilin. for these sites. The structures of proteins in each of the four classes We analyze these features statistically as follows. Define a residue [allophycocyanin (AP), phycocyanin (PC), phycoerythrocyanin site being in contact if at least one side-chain atom of the residue (PEC), and PE] have been determined by x-ray crystallography, but is within 4 Å of an atom from another subunit, linker, or bilin. The ␣ ␤ the structure of a phycobiliprotein–linker complex has only been number of contacts and P values for all sites in the and subunits ␣ reported for AP (23–26). Nearest neighbors and specific interac- in the AP trimer are listed in Table 31. For example, in the AP- tions (hydrogen bonds, electrostatic interactions, etc.) were exam- A subunit, residues at 5 of the 8 important sites are in contact with ined for each site (Supporting Information, Tables 12–22). either a subunit interface (one), bilin (two), or at least two such Residues at 5 of the 12 important sites interacted with each other interfaces (two). For the 152 sites not designated as important by or with one of the residues completely conserved in the aligned our criteria, 26 are in contact with a subunit interface, 11 with a phycobiliprotein sequences (see above). D3 in the PC-␤ and PEC-␤ bilin, and 4 with at least two such interfaces. Mainly ␤ subunit subunits are hydrogen-bonded to T3 in the corresponding ␣ sub- residues come in contact with the linker because of the asymmet- ␣␤ units. R110, present in all of the ␤ subunits, forms a salt linkage to rical location of the linker in the internal cavity of the AP-( )3– D13, a conserved residue in all ␣ and ␤ subunits (see Fig. 3). T118, linker complex. The types of contacts for all sites were tallied for

Table 3. Inferred consequences of single-residue truncations on accessibility of bilins to solvent AP-␣ E15 V67 M80 T81 T83 I110 Y119 P125 PC-␣ T3 P4 G80 K83 G105 Y110 PEC-␣ T3 P4 G80 K83 G105 Y110 BPE-␣ K83 G105 W110 V118 Y119 P125 AP-␤ D3 A83 S105 R110 T118 Y119 P125 PC-␤ D3 A83 S105 R110 T118 Y119 P125 PEC-␤ D3 F67 Q81 A83 S105 R110 T118 Y119 P125 BPE-␤ D3 A83 S105 R110 T118 Y119 P125

Truncation of bold residues increases accessibility of solvent to a bilin.

14768 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.222508899 Bickel et al. Downloaded by guest on October 2, 2021 Table 4. Effect of human Hb mutations at important residue Table 5. Known mutants sites on the structure and͞or function of the Hb molecule ␣Leu91 Leu91Pro Hb-␣ L104(91) P133(124) ␣Pro124 No mutants reported Hb-␤ W44(37) G53(46) L104(91) H105(97) P133(124) H155(146) ␤Trp37 Trp37Ser, Trp37Arg, Trp37Gly ␤Gly46 Gly46Arg, Gly46Glu (The predicted effect of these mutations The residue numbers in parentheses are those used for these residues in the is to decrease the stability of the ␤ chain.) conventional numbering scheme for human Hbs. At residues identified in ␤ bold, point mutations affect the structure and͞or function of the Hb molecule. Leu96 Leu96Val, Leu96Pro ␤His97 His97Gln, His97Tyr, His97Pro, His97Leu ␤Pro124 Pro124Arg, Pro124Ser, Pro124Gln, Pro124Leu each subunit and will be referred to as a contact profile for the ␤His146 His146Arg, His146Leu, His146Asp, His146Pro, His146Gln, subunit (Tables 31–34). His146Tyr We evaluated the statistical significance of the observed contact profile for each subunit in AP to test whether there was an association between our selection method and sites that are in icant at the 0.03 level for subunits B and D. This analysis also was critical contacts. For all subunits except F, the individual P values performed on the contact profile of the 18 strong-motif sites in the from the hypergeometric test were Ͻ0.02. By using the Bonferroni ␣ subunit and the 16 strong-motif sites in the ␤ subunit (Supporting principle to control the probability of any incorrect statements for Information, Table 41). These sites occur in strong motifs that the six tests, the association between the sites designated important define, with at most one exception, the functional subfamilies or by our procedure and the sites in critical interface regions was smaller taxonomic groups contained in the functional subfamilies. significant at the 0.05 level for all subunits except C and F. This Now, no subunit exhibits a strong association up to a significance analysis was repeated on the other three protein structures (Tables level of 0.3. 32–34). The results were not significant, but the comparisons are For Mb, the P value is 0.07 for testing the association between the incomplete because contacts to the linker can no longer be ob- important sites and the sites in interface regions (Table 36). served because of the absence of the linker in these crystallographic Although this would not be considered statistically significant, the structures. P value for the test increases to 0.52 (Table 42) when using all Before the strongly covarying sites were filtered out, this test was strong-motif sites that define Mb with at most one exception or also performed on the contact profile of the 12 strong-motif sites smaller taxonomic groups contained in the subfamily. So similarly in the AP-␣ subunit and the 22 strong-motif sites in the AP-␤ as in the Hbs, filtering the strong-motif sites based on their subunit. These sites occur in strong motifs that define, with at most occurrence in statistically significant associations refined this list of one exception, all the AP-␣ sequences and all the AP-␤ sequences, sites with respect to structural features. respectively. For that list, the P values from the hypergeometric test Extensive databases provide access to the information on the were all Ͼ0.02 except for subunits C and E (Table 37). Accounting consequences of many point mutations on the structure and for multiple testing up to the 0.05 level, only in subunit E is a function of the Hb molecule (http:͞͞globin.cse.psu.edu͞and www.

statistically significant association detected between the strong- ncbi.nlm.nih.gov͞Omim͞). Tables 4 and 5 list the mutations that APPLIED motif sites and the interface sites. Thus, the second cut of sites, occur at sites we have identified as important in Hb. Twenty-two MATHEMATICS based on statistical covariation, is useful for refining the strong- point mutations occur at 7 of the 8 important sites. No mutations motif site list to find structurally distinguishable positions. This have been reported at the eighth site, Hb-␣ P133. Twenty of these trend was not observed in the other three structures (Tables 38–40), mutations affect the structure and͞or function of the Hb molecule. possibly because of the unobservable linker contacts as discussed The remaining two mutations are predicted on the basis of structure above. analysis to decrease the stability of the ␤ chain. The analysis was repeated on the 154 globin sequences. An For all sites in both Hb-␣ and Hb-␤, we tallied the number of

example of an aligned sequence with residues numbered is shown ‘‘normal’’ and ‘‘nonnormal’’ mutations as defined by an automated BIOPHYSICS in Fig. 2B. The full set of aligned sequences is provided in search of the word ‘‘normal’’ in the summary page for each Supporting Information (Table 24). Before application of the strong- mutation in the database referenced above. Differences in the ratio motif search algorithm, the two conserved residues, L96 and H100, of normal to nonnormal mutations in the important sites versus the were removed. The algorithm identified 27 strong motifs up to 16 sites not designated as important were not significant. This result is sites long. In the set of strong motifs, there were K ϭ 232 not surprising given the small sample sizes involved, particularly in strong-motif pairs that were checked for covariation. According to Hb-␣ where there are only two important sites. Thus, despite a our procedure, 11 of these pairs were found to be statistically trend in the rate of nonnormal mutations in the important sites and Ϫ covarying at a significance level of 2.32 ϫ 10 3. In these pairs, there in the percentage of important sites with interface contacts versus are 11 different sites. They are potentially important sites for the this percentage in sites not designated as important, small P values molecule by our definitions. All 11 sites are contained in pairs are difficult to attain because of the small absolute size of the set within strong motifs that correspond to the functional subfamilies of important sites. However, although the results for some subunits or to smaller taxonomic groups contained in the larger functional in the contact profile and mutation analyses are not overwhelmingly subfamilies (Table 1). significant in the statistical sense, these separate analyses provide The crystal structures of human Hb and sperm whale Mb were different sources of evidence, unlinked to the method of selection, obtained from the Protein Data Bank database (27, 28). As with the all reinforcing the qualitative trend that the important sites differ- phycobiliproteins, the functional and structural roles could not be entiate themselves with respect to critical locations in the structure explained for all the sites by examining the immediate environment in ways that correspond to what one would expect of functionally of the site. We again tested for an association between our important sites. procedure and the selection of sites that have intersubunit or heme contacts. The number of contacts and P values for each ␣ and ␤ Discussion ␣ ␤ subunit in the Hb- 2 2 tetramer are listed in Table 35, which is This work is related to other motif-finding and covariation published as supporting information on the PNAS web site. The P methods, and we will elaborate on the similarities and differ- values for subunits B and D are Ͻ0.01. Controlling for overall ences. Local alignments between a query sequence and a data- significance, the association between the sites selected by our base by programs such as BLAST (29) initially look for short procedure and the sites in the critical interface regions was signif- stretches of similarities between the pair of sequences and extend

Bickel et al. PNAS ͉ November 12, 2002 ͉ vol. 99 ͉ no. 23 ͉ 14769 Downloaded by guest on October 2, 2021 them to search for a longer alignment. This is not a motif-finding appear in the fossil record between 1 and 1.3 billion years ago (37). method per se but can highlight commonalities on a pairwise level Because the phycobiliproteins of cyanobacteria and of red algal between family sequences. chloroplasts share common ancestry, the ancestral phycobilipro- Programs such as MEME (5) and PRATT (30) successively extract teins antedate the appearance of red algae in the fossil record. The groups of sites of interest appearing in large subsets of sequences phycobiliprotein ␣-type and ␤-type subunits can be readily distin- from a collection of unaligned sequences without a query. This guished based on amino acid sequence and bilin type and number. method produces a local alignment in the form of an ungapped They can be subdivided further on the basis of organismal origin. position-dependent residue frequency matrix in MEME or a regular The groupings produced by this approach correspond extremely expression with preconstrained length in PRATT. Such methods are well with those based on the distribution of strong motifs. In useful when the signal in a set of related sequences is too weak for particular, distinctive combinations of 21 of the 35 motifs segregate global multiple alignment methods. A common feature of these precisely with previously assigned groupings: PE-␣ and PE-␤, methods is that the patterns pointed out are constrained to a short PEC-␣ and PEC-␤, PC-␣ and PC-␤, and AP-␣ and AP-␤ (15–18). region of sites. In MEME, the sites appearing in the motif are It is reasonable to suppose that each of the strong motifs in the consecutive. This is perhaps unavoidable in unaligned sequences phycobiliproteins possibly defines residues of critical importance to without further information. The sites in the motifs discovered by the members of the cluster of proteins in which it occurs. Never- these methods are also of interest, because the motif likely corre- theless, it is also very possible that many strong motifs are acci- sponds to a region of activity such as a structural domain. In dental—a byproduct of a shared evolutionary past. From the point contrast, the strong-motif algorithm is based on an alignment, and of view of the neutral theory of evolution (38), we are interested in thus sites are not restricted to be consecutive or within a certain separating out the structurally and functionally important positions bounded region but require a family to be alignable (i.e., adequate from the selectively neutral positions. degree of homology between the sequences). We use features between sites as the next criterion to distinguish Programs such as EMOTIF (6) extract the most representative the sites. A large degree of statistical covariation may be an motifs, based on a particular measure, appearing in large subfam- important signal for some sort of activity at these sites, although the ilies of aligned sequences. These motifs may involve a group of sites manner in which they affect each other or how they are similarly dispersed throughout the sequence. For families of even modest affected cannot be deduced. We evaluate whether the strength of size, such enumeration-type motif-finding methods yield enor- the association between the sites would still exist if the phylogeny mously large numbers of patterns. The statistical significance is accounted for. It is also very possible that sites filtered out by this associated with such patterns is usually unclear, because the search procedure are important to the molecule. By using stringent process is not usually taken into account. Similarly, the strong-motif statistical cutoffs, we ensure that the selected sites distinguish algorithm enumerates patterns but identifies all patterns of sites themselves from the rest according to our criteria, but we do not that covary in the most extreme fashion possible and uniquely claim that they are the only ones of any interest. identify the subfamilies in which they are conserved. In our example As documented in Results, in the structural analysis our selected data, we found that the number of patterns discovered by this sites differentiate themselves with respect to their preferential algorithm is fairly moderate, such that all strong motifs and strong- locations at intersubunit contacts, interactions with completely motif families can be examined individually. conserved residues, or with each other and interactions with Many methods have been developed to explore the relationship prosthetic groups. In human Hb, inherited mutations that occur at between two neighboring sites in the three-dimensional structure of the selected sites are deleterious to structure and function of the the protein and the degree of association between the two positions molecule. More comprehensive evaluations, such as site-specific in the alignment (31–34). The reasoning is that if two sites are mutagenesis studies, are beyond the scope of this article. In light of interacting or in close contact with each other, then the evolution our results, we propose this method as a means of identifying of one site over time should affect the evolution of the other site (for candidate sites for such experiments. an excellent actual example, see ref. 35). In earlier work, statistical significance was assessed assuming that the sequences were inde- Appendix ϫ pendent (under A20). Later research incorporated the phylogenetic I. As in Bickel et al. (4), form for each pair of sites i1, i2 an m1 m2 relationships between the sequences, our A2, for more realistic contingency table, where m1 is the number of amino acid residues evaluations (33, 34). appearing in the family F at site i1 and m2 that appear at site i2. The To evaluate their performance, most methods have focused on cell corresponding to amino acid j1 at i1 and j2 at i2 has as entry the population behavior. Pairs of sites that are ‘‘strongly’’ covarying are number of sequences in F having (i1, j1),(i2, j2). Then any cell having generally closer in three-dimensional space than the ‘‘weakly’’ s as an entry and 0 appearing in all other cells in its row and column covarying pairs. This research has shown that some correlation corresponds to a strong motif {(i1, j1),(i2, j2)} for the s sequences between distance and the covariation measures exists for pairs, but counted in that cell. it is weak (33). We also found this to be true in our applications. However, in our method the purpose of statistical association is not II. List all couples of S-R pairs corresponding to cells with s to relate it directly to the physical distance between sites but as an appearing in them and 0s in all other cells in the same row and interesting sequence feature that distinguishes sites. column. The phycobiliprotein family of AP, PC, PEC, and PE polypep- tides was chosen for the first application of the strong-motif search III. For the given s, define an equivalence relation between S-R pairs  ' algorithm because there is high homology between the amino acid (i1, j1) and (i2, j2), i1 i2 by (i1, j1) (i2, j2)iff{(i1, j1),(i2, j2)} appears sequences of these proteins, and many sequences of each class of in the list developed in II. phycobiliprotein have been reported. The structures and sites of covalent attachment of the bilin prosthetic groups are known. The IV. This equivalence relation partitions the set of all S-R pairs into three-dimensional structures of representatives of all four classes disjoint sets S1,..., St. List all sequences corresponding to any Յ Յ have been determined at high resolution and show strong overall member of Sk and call it Fk,1 k t. It is a consequence of the similarity (23–26, 36). The ␣␤ building blocks of all four proteins construction that all equivalent S-R pairs yield the same set of ␣␤ ␣␤ form higher order assemblies, ( )3 and ( )6, with similar qua- sequences Fk. Each Fk is of cardinality s, and Sk is precisely the ϭ ternary structures. Finally, these proteins occur in both prokaryotic strong motif of Fk. This algorithm repeated for s 1,...,n yields organisms (cyanobacteria) and eukaryotes (red algae and the all strong motifs of length Ն2 and the corresponding subfamilies. cryptomonads). Organisms classified morphologically as red algae Note that determining all the Sk and Fk for each s takes on the

14770 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.222508899 Bickel et al. Downloaded by guest on October 2, 2021 order of n(L) ա nL2 operations. Doing this for s ϭ 1,...,n thus tions, and Eric Lander and an anonymous referee for making us relate 2 takes on the order of n2L2 operations. our work to phylogeny. We also thank the Molecular Graphics Labo- ratory in the Department of Chemistry (University of California, We are grateful to Professor Robert Huber for providing coordinates for Berkeley) for the use of its computers. This work was supported by grants the structures of the B-PEs from Porphyridium cruentum and Porphy- from the Lucille P. Markey Charitable Trust (to A.N.G.) and the W. M. ridium sordidum as well as the phycoerythrocyanin (PEC) from Mas- Keck Foundation (to A.N.G.). Research was partially supported by tigocladus laminosus. We thank Sunil Aggarwal for help with assembling National Science Foundation Grant DMS 9802960 (to P.J.B.) and sequence data, Herman Chernoff for many helpful remarks and ques- National Science Foundation Graduate Research Fellowship (to K.J.K.).

1. Page, R. D. M. & Holmes, E. C. (1998) Molecular Evolution: A Phylogenetic 21. Dickerson, R. E. & Geis, I. (1983) Hemoglobin: Structure, Function, Evolution, and Approach (Blackwell, Oxford), 228–279. Pathology (Benjamin͞Cummings, Menlo Park, CA). 2. Lockless, S. W. & Ranganathan, R. (1999) Science 286, 295–299. 22. Thompson, J. D., Higgens, D. G. & Gibson, T. J. (1994) Nucleic Acids Res. 22, 3. Manning, J. M., Dumoulin, A., Manning, L. R., Chen, W., Padovan, J. C., Chait, 4673–4680. B. T. & Popowicz, A. (1999) Trends Biochem. Sci. 24, 211–212. 23. Reuter, W., Wiegand, G., Huber, R. & Than, M. E. (1999) Proc. Natl. Acad. Sci. 4. Bickel, P. J., Cosman, P. C., Olshen, R. A., Spector, P. C., Rodrigo, A. G. & Mullins, USA 96, 1363–1368. J. T. (1996) AIDS Res. Hum. Retroviruses 12, 1401–1411. 24. Duerring, M., Schmidt, G. B. & Huber, R. (1991) J. Mol. Biol. 217, 577–592. 5. Bailey, T. L. & Elkan, C. (1995) Mach. Learn. 21, 51–83. 25. Ru¨mbeli, R., Schirmer, T., Bode, W., Sidler, W. & Zuber, H. (1985) J. Mol. Biol. 6. Nevill-Manning, C. G., Wu, T. D. & Brutlag, D. L. (1998) Proc. Natl. Acad. Sci. 186, 197–200. USA 87, 118–122. 26. Ficner, R. & Huber, R. (1993) Eur. J. Biochem. 218, 103–106. 7. Stormo, G. D. & Hartzell, G. W. (1989) Proc. Natl. Acad. Sci. USA 86, 1183–1187. 27. Takano, T. (1984) in Methods and Applications in Crystallographic Computing, eds. 8. Lawrence, C. E. & Reilly, A. W. (1990) Proteins 7, 41–51. Hall, S. R. & Ashida, T. (Oxford Univ. Press, Oxford), p. 262. 9. Akmaev, V. R., Kelley, S. T. & Stormo, G. D. (2000) 16, 501–512. 28. Tame, J. & Vallone, B. (2000) Acta Crystallogr. D 56, 805–811. 10. Saitou, N. & Nei, M. (1987) Mol. Biol. Evol. 4, 406–425. 29. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990) J. Mol. 11. Dayhoff, M. O., Schwartz, R. M. & Orcutt, B. C. (1978) in Atlas of Protein Sequence Biol. 215, 403–410. and Structure, ed. Dayhoff, M. O. (Natl. Biomed. Res. Found., Washington, DC), 30. Jonassen, I., Collins, J. F. & Higgins, D. G. (1995) Protein Sci. 4, 1587–1595. Vol. 5, Suppl. 3, pp. 345–352. 31. Altschuh, D., Lesk, A. M., Bloomer, A. C. & Klug, A. (1987) J. Mol. Biol. 193, 12. Felsenstein, J. (1989) Cladistics 5, 164–166. 693–707. 13. Wollenberg, K. R. & Atchley, W. R. (2000) Proc. Natl. Acad. Sci. USA 97, 91, 3288–3291. 32. Neher, E. (1994) Proc. Natl. Acad. Sci. USA 98–102. 14. Grassly, N., Adachi, J. & Rambaut, A. (1997) Comput. Appl. Biosci. 13, 559–560. 33. Shindyalov, I. N., Kolchanov, N. A. & Sander, C. (1994) Protein Eng. 7, 349–358. 15. Glazer, A. N. (1985) Annu. Rev. Biophys. Biophys. Chem. 14, 47–77. 34. Pollock, D. D., Taylor, W. R. & Goldman, N. (1999) J. Mol. Biol. 287, 187–198. 16. Glazer, A. N. (1989) J. Biol. Chem. 264, 1–4. 35. Zhang, J. & Rosenberg, H. F. (2002) Proc. Natl. Acad. Sci. USA 99, 5486–5491. 17. Glazer, A. N. (1994) in Advances in Molecular and Cell Biology, eds. Bittar, E. E. 36. Wilk, K. E., Harrop, S. J., Jankova, L., Edler, D., Keenan, G., Sharples, F., Hiller, & J. Barber, (Jai, Greenwich, CT), pp. 119–149. R. G. & Curmi, P. M. (1999) Proc. Natl. Acad. Sci. USA 96, 8901–8906. 18. Sidler, W. A. (1994) in The Molecular Biology of Cyanobacteria, ed. Bryant, D. A. 37. Schopf, J. W. (1978) Sci. Am. 239, 111–138. (Kluwer, Dordrecht, The Netherlands), pp. 139–216. 38. Kimura, M. (1979) Proc. Natl. Acad. Sci. USA 76, 3440–3444. 19. Glazer, A. N. & Wedemayer, G. J. (1995) Photosynth. Res. 46, 93–105. 39. LeQuesne, W. J. (1982) Zool. J. Linn. Soc. 74, 267–275. 20. Fermi, G. & Perutz, M. F. (1981) in Atlas of Molecular Structures in Biology, eds. 40. Estabrook, G. F., Johnson, C. S., Jr., & McMorris, F. R. (1976) Math. Biosci. 29, Phillips, D. C & Richards, F. M. (Clarendon, Oxford). 181–187. APPLIED MATHEMATICS BIOPHYSICS

Bickel et al. PNAS ͉ November 12, 2002 ͉ vol. 99 ͉ no. 23 ͉ 14771 Downloaded by guest on October 2, 2021