<<

Protein Science (1993), 2, 884-899. Cambridge University Press. Printed in the USA. Copyright 0 1993 The Society

-~~ ~~~~ ~ Families and the structural relatedness among globular

DAVID P. YEE AND KEN A. DILL Department of Pharmaceutical Chemistry, University of California, San Francisco, California94143-1204 (RECEIVEDJanuary 6, 1993; REVISEDMANUSCRIPT RECEIVED February 18, 1993)

Abstract Protein structures come in families. Are families “closely knit” or “loosely knit” entities? We describe a mea- sure of relatedness among polymer conformations. Based on weighted distance maps, this measure differs from existing measures mainly in two respects: (1) it is computationally fast, and (2) it can compare any two proteins, regardless of their relative chain lengths or degree of similarity. It does not require finding relative alignments. The measure is used here to determine the dissimilarities between all 12,403 possible pairs of 158 diverse protein structures from the Brookhaven (PDB). Combined with minimal spanning trees and hier- archical clustering methods,this measure is used to define structural families. It is also useful for rapidly searching a dataset of protein structures for specific substructural motifs.By using an analogy to distributions of Euclid- ean distances, we find that protein families are not tightly knit entities. Keywords: ; relatedness; structural comparison; substructure searches

Pioneering work over the past 20 years has shown that positions after superposition. RMS is a useful distance proteins fall into families of related structures (Levitt & metric for comparingstructures that arenearly identical: Chothia, 1976; Richardson, 1981; Richardson & Richard- for example, when refining or comparing structures ob- son, 1989; Chothia & Finkelstein, 1990). How many fam- tained from X-ray crystallography or NMR experiments. ilies are there? Are thefamilies “tightly knit” or“loosely However, RMS is of limited value as a general measure knit”? That is, do two proteins within a family have much of similarity because it is a “maximum likelihood estima- greater structural similarity than two proteins from dif- tor” of the standarddeviation between two structuresonly ferent families? If so, they are tightly knit. What can we if the individual errors are Gaussian-distributedwith zero learn about the forces of protein folding and evolution mean (Beers, 1957). The Gaussian distribution assump- from observing how proteins cluster into families? tion can be reframed as an assumption that the differ- In order to address these questions, it is necessary to encesbetween two compared structures arise from have a suitable measure of the structural similarity be- fluctuations that obeya square-law potential. A square- tween proteins, because a “family” relationship can only law potential is only a good approximation forsmall con- be defined in termsof some degree of similarity. Several formational deviations. If two structures are not in the measures of structural similarity have been developed same energy well, or if errors are large,RMS will lose its (Remington & Matthews, 1978; Taylor & Orengo, 1989; underlying justification. In addition, theuse of an RMS Rackovsky, 1990; Sali & Blundell, 1990). There is no un- distance criterion to compare two protein structures re- derlying fundamental principle dictating that one similar- quires making assignments in which atom i of protein 1 ity measure is better than others. Ultimately, the concept “is equivalent to” atomjof protein 2. When comparing of “similarity” is based upon some criterion arbitrarily proteins with little sequence identity or unequal chain chosen for a particular purpose (Maggiora & Johnson, lengths, this requires making arbitrary decisions. 1990). For example, a common measure of structural sim- Some similarity measures require making structural ilarity is the rootmean square (RMS) deviation of atomic alignments of one protein with the other (Taylor & Or- engo, 1989; Sali & Blundell, 1990). When there is a bio- ~ ~ ~ ~ ~~~~~ ~ ~~~~~ logical or evolutionary basis for making these alignments, Reprint requests to: Ken A. Dill, Department of Pharmaceutical Chemistry, Box 1204, University of California, San Francisco, Califor- such methods have the advantageof allowing a high de- nia 94143-1204. gree of structural discrimination among highly similar 884 Families of globular proteins 885 proteins. For proteins that are not highly similar, how- is a reasonable measure of relatedness. For example, in ever, making alignments requires making certain arbitrary one test we show that it is a useful tool for searching da- choices about thepossible locations of insertions and de- tabases of protein structures to locate specified substruc- letions and thechoices of gap penalties. These decisions tures within proteins. We then apply this measureto the can be computationally intensive. pairwise comparison of 158 diverse protein structures and Rackovsky (1990) has developed a similarity measure use clustering algorithms to identify families. Finally, we that compares distributions of conformations of chain compare the dissimilarity distribution of protein struc- segments up to four residues in length. Whereas it cap- tures to simulations of points distributed in d-dimensional tures structural information of residues close together Euclidean spaces to explore the tightnesswith which pro- in sequence, our interest here is to capture information teins cluster into families. about contacting residues at all separations along the chain. CONGENEAL: A dissimilarity measure Our purpose here is better served by yet a different measure of structural relatedness. The following ques- The CONGENEAL dissimilarity measure compares the tions motivate theneed for a different measure. What is weighted distance maps of two polymer conformations the shape of protein conformationalspace? What is a use- (see Fig. 1). The weighted distance map of a protein chain ful “reaction coordinate” alongwhich a protein folds to conformation that has N residues is an N x N matrix in its native state? In models of proteins, such as those in- which each matrix element (i,j) is a weight, w,equal to volving chains on lattices, how similaris a model confor- the distance, d,,,, between the a-carbons of residues i mation to the truenative conformation? To addressthese and j, raised to a power -p (p> 0): questions, we need a similarity measure for which the two most important criteria are (1) that it must be able to com- pare any two conformations, no matter how different, and (2) that it must entail making thefewest possible ar- Two residues that are adjacent in space are assigned a bitrary decisions. Furthermore, the measure must avoid large weight, whereas two residues that are far apart in comparing structures based on microscopic details such space have a small weight. Because the matrixis symmet- as hydrogen bondangles, since these are not appropriate ric, it is only necessary to compute the upper(or lower) for some low-resolution models. Many such problems do triangle of the matrix. Thedifference between the weighted not involve insertions, deletions, or gaps, and therefore distance map and the contact map, first introduced by do not require that a similarity measure have sophisticated Liljas and Rossmann (1974), is that, in the former, the alignment machinery. weights are assigned from a continuous range of values If an algorithm that measured structural relatedness whereas, in the latter, only weights of 0 or 1 are used to were computationally efficient enough, it could also be indicate whether or not a pair of residues are adjacent. put to other uses. For example, since the number of The distancedependence in Equation 1 resembles that known protein structures is P = 100-1,000 (depending of intermolecular forces. For the purpose of dissimilar- on whether we choose all known structures, or whether ity measures, however, there are no underlying principles they are selected in some way to avoid repeats of nearly identical molecules), the number of pairwise comparisons involved is [Px (P- l)] /2 = 104-106. If we could com- pute all these pairwise “distances,” we could measure the interrelatedness among proteins to learn how they clus- ter into families. Different similarity measures make dif- ferent trade-offs betweenspeed, number of arbitrary decisions, and discrimination. By choosing a measure that is as simple, fast to compute, and nonarbitrary as pos- sible, we trade off the degree of discrimination among highly similar proteins obtained by other measures, but the latter is less important for our purposes. The outline of this paper is as follows. We first introduce the algorithm for measuring dissimilarity. (It measures “dissimilarity” because it is 0 for identical structures, and increases as the structuralsimilarity between two proteins diverges.) We call it CONGENEAL (CONformational GENEALogy) because it compares conformations and can Fig. 1. Weighted distance maps of crambin with p = 2 on the left and generate family trees describing their relatedness. Much p = 6 on the right. The heightof the peaks corresponds to the magni- of this paper is devoted to showing that CONGENEAL tude of the weight, w. 886 D.P. Yee and K.A. Dill that direct us to choose a particular value of p. We have imply some structural features that are not present in investigated p = 1, 2, 4, and 6. When p = 6, only the the actual conformation (e.g., a helix can move from the closest neighbors contribute to theweighted distance map. N-terminus end of a conformation to the C-terminus Whenp = 2, pairs of residues separated in space by greater end), a randomization procedure is used to ensure that the distances also contribute. We compare different values of dissimilarity score and alignment do not contain artifacts p below, but the qualitativeresults are found tobe sensi- from using periodic weighted distance maps. The ran- bly independent of p. We mainly use p = 2. domization procedure is as follows. For any comparison We now describe how the dissimilarity score is obtained of residue pairs, (i,j)R,(i’,;’)s, involving a wrapped- from the weighted distance maps for two conformations. around residue pair, a difference weight is not added di- Given two proteins, R and S, let r;, be the distance be- rectly to the dissimilarity score. Instead, these weights are tween residues i and; in protein R and let sjjbe the dis- collected in separate bins based on contact order (the con- tance between residues i and; in protein S. First, consider tact order forresidue pair (i,;) is defined as 1; - i I, i.e., a simple case. When R and S have the same chain length, the separationof the residues along the chain). Thebinned N, and have a direct residue-to-residue alignment, the dis- distance weights from one conformation are then ran- similarity between the two proteins is given by: domly matched with the binned distance weights from the other conformation and then added to the dissimilarity NN score. In practice, the offset that gives rise to the best alignment of two proteins containsfew references to non- existent (i.e., wrapped-around) residue pairs, and several different methods that we tried for treating them gave similar results. When comparing a larger protein with a smaller one, If two proteins have identical weighted distance maps, CONGENEAL finds the part of the large protein thatis then d(R, S) = 0. most similar to theweighted distance mapof the smaller Now, in orderto compare proteins with different chain protein. This feature makes CONGENEAL useful for lengths and unknownalignments, we define a score based rapidly finding specific substructures within different pro- upon sliding one map across another,similar to a corre- teins in a structural database.To search a database fora lation function. That is, if two proteins, R and S, have specific substructure, CONGENEALis used to generate chain lengths M and N, respectively, where A4 IN, then scores between the substructure and each protein in the we calculate a series of dissimilarities as follows: database: a small score for some alignmentwith a given protein locates that motif within the protein.

Validation of the dissimilarity measure

How can one validate a dissimilarity measure? For any where the rangeof “offsets,” 7,of one weighted distance two proteins, different measures can predict different de- map relative to the othervaries from -M/2 to N - MI2 grees of relatedness. As noted before, there is no funda- for a total of Ndifferentalignments. The dissimilarity be- mentally correct measure of relatedness. Therefore, the tween the proteins R and S is then obtained by finding the validation of a dissimilarity measure ultimately depends offset for which the similarity is greatest: on whether it seems sensible in light of other knowledge. In the section below, we characterize CONGENEAL in d(R,S) = min(d’(R,S,T)). (4) the following ways:

Thisprocedure alone, however, is not sufficient to Pairwise tests. (a) Finding sequence alignments. specify a score, sincethe sliding of distance maps means When two different proteins contain the same sub- that some (i,;) pairs of one conformation will sometimes structure, a dissimilarity measure should find the go unpaired with (i’,;’) pairs in the other. For example, sequence alignment for which the structures most if 7 = -5, then the residue pair (l,3)R of protein R will closely superimpose. (b)Using aprobeprotein struc- be compared to a nonexistent pair (-4,-2)s of protein ture to search the database. A dissimilarity measure S. Therefore, there are two additional steps in the scor- should find related proteins or substructures in a ing method. First, the weighted distance maps are made search of a structural database. periodic (i.e., “wrapped around”) so that residue pairs are Cluster analysis. We compare 158 proteins pairwise defined for all offsets. In this way, the number of com- and apply clustering algorithms to ask whether the pared residue pairs is the same for all alignments. Second, dissimilarity measure findssensible family relation- because a “wrapped-around’’ weighted distance map may ships among them. We use two types of clustering Families of globular proteins 887

methods: minimal spanning trees and hierarchical orp = 6. The point atwhich the score dipsto a minimum trees based on agglomerative clustering. (1) indicates the degree of similarity between the two pro- teins, and (2) gives the offset (Le., shift) of onesequence All protein coordinates were obtained from the Brook- starting position relativeto theother sequence for which haven Protein Data Bank (PDB) (Bernstein et al., 1977; the structures bear closest resemblance. Abola et al., 1987). The set of 158 protein structures was For the threepairwise protein comparisons mentioned derived from Appendix 3 ofProtein Architecture (Lesk, above, CONGENEAL finds the expected relationships. 1991). Table 1lists the proteins and their PDB filenames. Sperm whale is found to be similar to the A chain of human hemoglobinwith an offset of -6 resi- dues. On the other hand, Figure2B indicates that there Pairwise tests is no similarity between sperm whale myoglobinand the Finding sequence alignments orange subunit of superoxide dismutase. The dissimilarity To first choose a few examples, it is reasonable to be- measure finds hen egg white lysozyme and T4 bacterio- lieve that sperm whale myoglobin (lmbd) and theA chain phage lysozyme to have only a small degree of similarity. of human (1hho-a) are closely related pro- In this case,the best score is obtained at anoffset of -26 teins, that sperm whale myoglobin and the orange subunit residues, in agreement with the observations of Reming- of superoxide dismutase(2sod) are unrelated, and that the ton and Matthews (1978) and Rossmann and Argos lysozymes from T4 bacteriophage (31zm) and from hen (1976), who noted that when residues 1-80 of the phage egg white (llyz) are only distantly related.Figure 2 shows lysozyme are aligned with residues 27-106 of thehen egg the dissimilarity score as a function of alignment for these white lysozyme, there is overlap of the active sites. three comparisons using CONGENEAL with eitherp = 2 Using a probe to search the database

When a probe protein or substructureis scanned through a protein database, a dissimilarity measure should prop- A erly rank order the proteinsby their similarity to the probe. Below we show three examples - the helix--helix DNA binding motif, the EF hand calcium binding motif, and the fold- for which the dissimilarity score identi- fies closely related protein conformations. 0 -70 -10 50 -70 -10 50 Offset Offset 1. DNA binding motif. A number of proteins are known to have similar helix-turn-helix substructures that bind DNA. X Cro, X repressor, 434 Cro, 434 repressor, trp repressor, and catabolitegene activator protein (CAP) all have sequence similarity in a region of 22 amino acids, 32$21 corresponding to the helix-turn-helix 32" (Ohlendorf et al., 1983). How widely distributed is the $*1 helix-turn-helix motif throughout the protein database? 2 IWe use CONGENEAL tosearch the dataset for thehelix- 2-70 u -10 50 -70 -10 50 Offset Offset turn-helix conformation. In our search, the helix-turn- helix substructure is defined as the 23-residue stretch from 434 Cro startingwith methionine 15 and ending with gly- c q"-l cine 37. As a simple test, Figure 3 shows the result of aligning the 434 Cro helix-turn-helix substructure with the full 434 Cro protein. The deepest minimum in Fig- ure 3 correctly identifiesthe properalignment with itself, and the score of0 indicates that it is an exact match. The 0N -60 0 40 other three minima correspond to thethree other turnsbe- j2W -60 0 40 Offset Offset tween helices found in Cro. The 434 Cro helix-turn-helix DNA binding substruc- Fig. 2. Plots of dissimilarity score versus alignment. For A-C, p = 2. ture was then scanned across the dataset of158 proteins. For D-F, p = 6. A and D show the comparison of sperm whale myo- The distribution of dissimilarity scores is shown in Fig- globin with human hemoglobin.B and E show the comparison of two unrelated proteins: sperm whale myoglobin and superoxide dismutase. ure 4. Several proteins are found tohave helix-turn-helix C and F show the comparison of two weakly similar proteins: T4 (bac- substructures similar to thatof 434 Cro. Table 2 lists the teriophage) lysozyme and hen egg white lysozyme. 20 proteins with the greatest similaritiesto thetarget sub- 888 D.P. Yee and K.A. Dill

Table 1. Key to 158-protein dataset . ~" - ." No. of No. of Code residues Protein name nameCode Protein residues

~ ~~ " .~ ~ ~ ____ "~ ~~ ~ ~~ 451c 82 cssl 3fxn 138 Flavodoxin 155c 134 Cytochrome csso 3gap-c 208 Catabolite gene activator protein (closed 256b 106 Cytochrome bs62 form) 1 aat 288 Aspartate aminotransferase 3gap-o 205 Catabolite gene activator protein (open labp 306 L-Arabinose binding protein form) 2abx 74 or- 2gb 309 Galactose-binding protein 2act 218 Actinidin lgcr 174 y-Crystallin 1 acx 107 Actinoxanthin lgdl-o 334 Glyceraldehyde-3-phosphate dehydrogenase 6adh-a 374 dehydrogenase 2gls-a 468 Glutamine synthetase 3adk 194 Adenylate kinase lgpl-a 184 Glutathione peroxidase 2ait 74 Tendamistat 3grs 46 1 Glutathione reductase lalc 122 a-Lactalbumin lhho-a 141 Human hemoglobin 2alp 198 a-Lytic protease 1 hip 85 High potential iron protein 4ape 330 Endothiapepsin 1 hkg 457 Hexokinase 7api 339 a,-Antitrypsin 2hla-h 270 Human class 1 histocompatibility complex 3aPP 323 Penicillopepsin (heavy) 2apr 325 Rhizopuspepsin 2hla-m 99 Human class 1 histocompatibility complex 2atc-c 305 Aspartate transcarbamylase (regulatory ((3-2-microglobulin) subunit) Zhmg- 1 328 lnfluenza hemagglutinin (HAl) 2atc-r 152 Aspartate transcarbamylase (catalytic 2hmg-2 175 lnfluenza hemagglutinin (HA2) subunit) lhmq-a 113 Hemerythin 2aza-a 129 Azurin (Alcaligenes denitrificans) 1 hoe 74 a-Amylase inhibitor 3b5c 85 Cytochrome b5 3hvp 99 HIV protease 1 bds 43 Sea anemone antiviral protein 2ilb 153 Interleukin-10 3blm 257 0-Lactamase 3icb I5 Intestinal calcium-binding protein 1 bp2 123 A2 4ins-a 21 2Zn insulin 3c2c 112 Cytochrome c2 1 kga 173 2-Keto-3-deoxy-6-phosphogluconate 2ca2 256 Carbonic anhydrase aldolase 8cat-a 498 Beef liver catalase 21bp 346 Leucine-binding protein lcbp 86 Cucumber basic protein 31dh 329 Dogfish lactate dehydrogenase 1 cc5 83 Cytochrome cs 21h4 153 Lupin 1ccr 111 Rice cytochrome c 21iv 344 Leucine/isoleucine/valine-binding protein 2ccy-a 127 Cytochrome c' llrd 87 X Repressor 2cdv 107 Cytochrome cj llyz 129 Hen egg white lysozyme 2ci2 65 Barley chymotrypsin inhibitor llzl 130 Human lysozyme 3cln 143 31zm 164 T4 lysozyme 1 cms 323 Chymosin B 1mbd 153 Sperm whale myoglobin 2cna 237 Concanavalin A 4mdh 334 Malate dehydrogenase 5cpa 307 Carboxypeptidase A Zmev-vpl 268 Mengo virus VPI 2CPP 405 CAM 2mev~vp2 249 Mengo virus VP2 5cpv 108 Carp 2mev-vp3 23 I Mengo virus VP3 lcrn 46 Crambin 4mlt 26 Mellitin lcro-o 66 X Cro lmon-a 44 Monellin (A chain) 2cro 65 434 Cro 1 nxb 62 B 1cse 274 Subtilisin carlsberg 20vo 56 Ovomucoid, third domain lctf 68 C-terminal domain of ribosomal protein 2pab 114 Prealbumin L7/L12 9PaP 212 Papain lctx 71 a- 2paz 123 Pseudoazurin Scyt 103 Tuna cytochrome c 1PCY 99 Plastocyanin 2CYP 293 Cytochrome c peroxidase 4pep 326 Pepsin 3dfr 162 Dihydrofolate reductase Ipfk-c 320 Phosphofructokinase (closed form) 5ebx 62 Erabutoxin A Ipfk-o 320 Phosphofructokinase (open form) I ecd 136 3pgk 416 Phosphoglycerate kinase lefm 130 Elongation factor TU 3pgm 230 Phosphoglycerate mutase 2enl 436 Enolase lphh 394 p-Hydroxybenzoate hydroxylase Zest 240 Porcine elastase lphy 126 Photoreactive yellow protein letu 141 Elongation factor TU 2pka 232 Kallikrein A 2#4-h 229 Fob KOL (heavy chain) 2plvLvpl 283 Polio virus VPI 2#4-1 216 Fob KOL (light chain) 2plv_vp2 268 Polio virus VP2 lfcl 206 F, fragment of immunoglobulin 2plv-vp3 235 Polio virus VP3 4fd I 106 Ferredoxin lpp2-r 122 phospholipase (continued) Families of globular proteins 889

Table 1. Continued

No. of No. of Code residues Protein name nameCode Proteinresidues

- ~ 1PPt 36 Avian pancreatic polypeptide 124 7rsa A lprc-c 332 Photosynthetic reaction center 5rub-a 260 Rubisco Rhodopseudomonas viridis C subunit 5rxn 54 Rubredoxin Iprc-1 273 Photosynthetic reaction center R. viridis L 4sbv-a 199 Southern bean mosaic virus subunit 2sga 181 Streptomyces griseus proteinase A Iprc-m 323 Iprc-m Photosynthetic reaction center R. viridis M 3sgb 185 S. griseus proteinase B subunit lsn3 65 Scorpion neurotoxin lprc-h 258 Photosynthetic reaction center R. viridis H 2sns 141 Staphylococcal subunit 2sod-0 151 Superoxide dismutase (orange subunit) 2prk 279 2prk Proteinase K 1srx 108 Thioredoxin lpte 348 Carboxypeptidase/transpeptidase 2ssi 107 Streptomyces subtilisin inhibitor 5pti 58 Bovine pancreatic trypsin inhibitor 2stv 184 Satellite necrosis virus 4PtP 223 4PtP Trypsin 2taa 47 8 Taka-amylase 1PYP 280 Pyrophosphatase 2tbv 286 Tomato bushy stunt virus r69 63 1 r69 434 Repressor (N-terminal domain) 279 1tec Thermitase Irbb-a 124 Irbb-a Ribonuclease B thi 1 207 Thaumatin I lrei 107 Immunoglobulin V, domain 1 tim 247 Chicken triosephosphate lrhd 293 Rhodanese 2tmv 154 Tobacco mosaic virus 2rhe 114 Immunoglobulin Vh domain 4tnc 160 4rhv-vpl 273 Rhinovirus VPI tnf 1 152 Tumor necrosis factor 4rhv-vp2 255 Rhinovirus VP2 lubq 76 Ubiquitin 4rhv-vp3 2364rhv-vp3 Rhinovirus VP3 lutg 70 Uteroglobin 3rn3 124 Ribonuclease A 9wga 171 Wheat germ agglutinin lrns 72 Ribonuclease S 1wrp 102 trp repressor 2rnt 104 , 4xia 393 Xylose isomerase 3rp2 224 Rat mast cell protease 2yhx 457 Hexokinase

~~ .- -~ " -~

structure. All seven proteins known to have the DNA vation of Mondragon et al. (1989) that the aminotermi- binding helix-turn-helix substructure are in this group. nal domain of 434 repressor is remarkably similar to 434 In all seven cases, the predicted alignment of the substruc- Cro and that thesubstructures are virtually identical. The ture with the protein is identical to the alignment pro- DNA binding protein that is least similar to 434 Cro is duced by sequence analysis (Ohlendorf etal., 1983). The trp repressor. The latter differs from the protein with the most similar substructure (other than 434 other DNA binding proteins in two respects: (1) the end Cro) is 434 repressor. This is consistent with the obser- of the first helix is more open, and(2) the orientation of

I 0.00 0.20 0.40 0.60 0.80 Dissimilarity Score

Fig. 4. Histogram showing the distributionof dissimilarity scores when Fig. 3. Dissimilarity score versus alignment of the 434 Cro DNA bind- the 434 Cro DNAbinding substructure is compared with a dataset of ing substructure with the complete structure of 434 Cro. 158 proteins. 890 D.P. Yee and K.A. Dill

Table 2. Proteins with helix-turn-helix motif teins or non-DNA binding proteins with the DNA bind- ing substructure of 434 Cro (see also Kinemage 1). PDB Protein name filenameCONGENEAL RMS 2. Calcium binding: TheEF hand. Another well-charac- 434 Cro 2cro 0.000 0.000 terized substructural motif is the EF hand calcium binding 434 Repressor (N-terminal conformation, first described by Kretsinger and Nockolds domain) 1r69 0.052 0.380 (1973) from carp muscle calcium binding parvalbumin X Cro Icro-o 0.068 0.585 X Repressor llrd 0.099 0.830 (5cpv). The EFhand is also found inseveral other proteins Enolase 2enl 0.110 1.634 that bind calcium, including calmodulin (3cln), tropo- Catabolite gene activator nin C (4tnc), and intestinal calcium binding protein (3icb). protein (open form) 3gap-o 0.120 1.135 Although CONGENEAL finds relatively few substruc- Catabolite gene activator tures identical to EF hands,many proteins contain sub- protein (closed form) 3gap-c 0.126 1.068 Cytochrome P450 CAM 2CPP 0.135 1.742 structures that are fairly similar. C-terminal domain of We define the EFhand substructure in carp parvalbu- ribosomal protein L7/L12 lctf 0.170 1.938 min as the 29 residues from asparagine 79 to lysine 107. Xylose isomerase 4xia 0.177 2.047 The E helix is 12 residues (79-90), the loop is 8 residues Cytochrome c peroxidase 2CY P 0.178 2.053 (91-98), and the F helix is 9 residues long (99-107). Fig- Photosynthetic reaction center R. viridis M subunit lprc-m 0.178 2.963 ure 6 shows the result of aligning this substructure with Photosynthetic reaction center the complete structure of carp parvalbumin. There are R. viridis L subunit lprc-l 0.189 2.835 five minima, corresponding to thejoining regions between trp repressor lwrp 0.197 1 .I29 the six helices of parvalbumin (labeled A-F): AB, BC?, Beef liver catalase 8cat-a 0.203 2.652 CD, DE, andEF. Strong matches are found at twoposi- Erythrocruorin 1ecd 0.205 2.978 Proteinase K 2prk 0.205 2.543 tions, corresponding to C-loop-D and E-loop-F. Both of Glutamine synthetase 2gls-a 0.207 3.694 these substructures are in the EF hand conformation. Hemerythin lhmq-a 0.209 4.206 Kretsinger and Nockolds (1973) suggested that A-loop-B Sperm whale myoglobin 1 mbd 0.210 4.531 is related to the EF hand, but ourresults do not show sig- nificant structural similarity between A-loop-B and the EF hand substructure. In fact, theA and B helices are ori- ented nearly parallel to one another,whereas the helices in an EF hand are nearly perpendicular. the second helix in the helix-turn-helix substructure is We then use the EF hand substructure as a probe to constrained by the binding of L-tryptophan (Schevitz et al., search the dataset of 158 proteins. Figure 7shows the dis- 1985). tribution of dissimilarities. The fourbest scoring proteins The seven DNA binding proteins rank 1, 2, 3, 4, 6, 7, are all calcium binding proteins (see Table 3). The EF and 14 in similarity to the probe helix-turn-helix struc- hand in troponin C was found to be the most similar to tural motif. Some non-DNAbinding proteins also score the parvalbumin structure; the dissimilarity scores as a well for thepresence of the helix-turn-helix substructure function of alignment are shown in Figure 8. Minima (see Table 2). In many cases, the best match of the sub- identify the four EF hand substructures in troponin C. structure to a protein occurs when the helix-turn-helix The two minima on the right in Figure 8 indicate two sub- substructure is aligned with the last half of a long helix, structures that are the most similar to the parvalbumin a turn, and the first few residues of the following helix. substructure and correspond to the EF handsnearest the Two of the proteinsidentified here as having a substruc- C-terminal end of the protein. The two minima on theleft ture similar to that of the probe substructure have been identify two N-terminalEF hands that areless similar to previously noted by Richardson and Richardson (1988). the parvalbumin EF hand. Interestingly, the N-terminal They found that cytochromec peroxidase and ribosomal EF hands do not bind calcium (Satyshur et al., 1988). L7/L12 protein contain conformationssimilar to the DNA The method correctly identifies bovine intestinal cal- binding helix pairs in gene activator and repressor pro- cium binding protein as being similar to the EF hand. On teins. In thepresent analysis, the non-DNAbinding pro- the other hand,given that carp parvalbumin andbovine tein that hasa substructure mostsimilar to the probe DNA intestinal calcium binding protein are related, the RMS binding substructure is yeast enolase. Enolase has two do- deviation of C, positions between their EF hands seems mains consisting of (1) a three-stranded 0 meander and to be surprisingly large. The large RMS deviation arises four a-helices and (2) an eightfold a/p barrel (Lebioda because the residues at bothends of the substructure have et al., 1989). The helix-turn-helix substructure of 434 Cro different conformations in the two proteins. The next aligns with enolase near the end of the N-terminal do- most similar substructure to the EF hand is in T4 lyso- main. Figure 5 shows the top eight alignments found by zyme (see Kinemage 2); this similarity was first noted by CONGENEAL for substructures from DNA binding pro- Tufty and Kretsinger (1975). Families of globular proteins 89 1

Fig. 5. Structural alignment of 434 Cro DNA binding motif to top8 matches identified by CON- GENEAL and trp repressor, which scored 14th. 434 Cro DNA binding substructure is shown in green, DNA binding proteins are shown in red, and non-DNA binding proteins are shown in blue. trp repressor, the DNA binding protein least similar to the target 434 Cro substructure, is shown in cyan.

3. Searching protein structuresfor functional subunits. are predicted to have at least one short loop conforma- The ability to perform fast searches for substructures tion similar to the calcium binding loop, five proteins are within proteins allows for searching a database of protein clearly distinct as being more similar than theother pro- structures for specific functional subunits. We show an teins. They include the four calcium binding proteins example of usingCONGENEAL to find possible calcium identified above. Hence, the weighted distance map for binding proteins in the dataset. In theEF hand substruc- this loop region is a good identifier of the calcium bind- ture, the calcium is bound to residues within the loop ing motif. In addition, galactose binding protein scores region. Therefore, we search the dataset for thepresence well for the presence of a calcium binding loop. Consis- of the E helix, the F helix, and the loop.E and F arecy- tent with this finding, galactose binding protein was re- helices, so all proteins with cy-helices score well for the ported to have a calcium (Vyas et al., 1987) presence of the E and F helix substructures (data not that resembles the EF hand without the helices. T4 lyso- shown). Figure 9 shows the distribution of dissimilarities zyme, which scored 5th for the presence of an EFhand, with the calcium binding loop. Although most proteins scores 69th for thepresence of the calcium binding loop. Although T4 lysozyme has two helices similarin orienta- tion to the EF hand helices inparvalbumin, the intervening loop is clearlynot in the calcium binding conformation.

-10 10 30 50 70 90 0.00 0.20 0.40 0.60 0.80 Off set Dissimilarity Score

Fig. 6. Dissimilarity score versus alignment of the carp parvalbumin EF Fig. 7. Histogram showing the distribution of dissimilarity scores when hand substructure with the complete structure of carp parvalbumin. the EF hand substructure is compared with a dataset of 158 proteins. 892 D.I? Yee and K.A. Dill

Table 3. Proteins with EF hand

~ - ~ .. -"_ Protein name PDB codeCONGENEAL RMS

- .~~ ~ - " Carp parvalbumin0.002 5cpv 0.000 Troponin C 4tnc 0.054 0.644 Calmodulin 3cln 0.064 0.987 Intestinal calcium-binding protein 3icb 0.135 2.868 T4 lysozyme 31zm 0.227 2.994 Sperm whale myoglobin lmbd 0.251 5.199 Human hemoglobin Ihho-a 0.253 5.143 Subtilisin carlsberg lcse 0.255 4.107 0.00 0.20 0.40 0.60 0.80 Cytochrome P450 CAM 2CPP 0.261 6.142 Dissimilarity Score Erythrocruorin lecd 0.264 4.991 Lupin leghemoglobin 21h4 0.265 4.612 Fig. 9. Histogram showing the distributionof dissimilarity scores when Hemerythin 1 hmq-a 0.266 5.050 the calcium binding loop from carp parvalbumin is compared with a Enolase 2enl 0.269 4.524 dataset of 158 proteins. Thermitase 1tec 0.270 4.159 Cytochrome c peroxidase 2CYP 0.275 4.565 frp repressor 1 wrp 0.277 4.272 Cytochrome bS6* 256b 0.277 4.474 Proteinase K 2prk 0.280 5.069 ments(2fb4-h, 2fb4-1, lfcl),tumor necrosis factor Leucine-binding protein 21bp 0.282 3.518 (ltnf), monellin (lmon-a), and Cu,Zn superoxide dis- Malate dehydrogenase 4mdh 0.282 5.295 mutase (2sod). The dissimilarity distribution from CON-

~~ ~~ -~ GENEAL resembles an earlier comparison madeby Bowie et al. (1991) of sperm whale myoglobin versus a protein dataset based on their three-dimensional (3D) profiling method. Their method shows the degreeto which the se- 4. . Figure 10 shows the dissimilarities of sperm quences of other globins are compatiblewith the structure whale myoglobin (lmbd) to theset of 158 protein struc- of sperm whale myoglobin. Our method shows the degree tures. The four most similar structures are all globins: to which the structures of other globins are similar to the sperm whale myoglobin (lmbd), erythrocruorin (lecd), structure of sperm whale myoglobin. At least for sperm human hemoglobin (lhho-a), and leghemoglobin (21h4). whale myoglobin, the distributions from 3D profiling and The next most similar proteins are all dominated by a- CONGENEAL bear considerable resemblance. helices. They include uteroglobin (lutg), trp repressor (lwrp), calcium binding protein (3icb), and cytochrome Most closely related proteins b562(256b). The proteins least similar to myoglobin are all P-sheet proteins: they include immunoglobulin frag- As another test, we compare the 158 proteins pairwise ([Px (P- l)]/2 = 12,403 tests) and ask which pairs are the most closely related. Of course, because this is not a "selected" set of unrelated conformations. some of these

-10 20 50 80 110 140 I 0.00 0.20 0.40 0.60 0.80 Offset Dissimilarity Score Fig. 8. Dissimilarity score versus alignment of the EF hand substruc- ture with troponin C. The four minima correspond to theEF four hand Fig. 10. Histogram showing the distribution of dissimilarity scores when substructures in troponin C. sperm whale myoglobin is compared with a dataset of 158 proteins. Families of globular proteins 893 proteins are quite similar; these highly similar pairs are structures of ribonucleaseA (3rn3,7rsa), ribonuclease B controls thatwe study here. Table 4lists the 20 most sim- (lrbb-a), and ribonuclease S (lrns). Each pair of struc- ilar protein pairs. Not surprisingly, severalof these pairs tures involves only a small structural variation. For ex- represent the same protein in two different conforma- ample, A and B have identical tions. For example, six of the closest structural similari- sequences, but differ by a polysaccharide moiety that is ties are pairs ofribonucleases. The dataset contains four attached to asparagine 34 of ribonuclease B. ribonuclease structures: two independently determined Ribonuclease B is about as similar to ribonuclease A as the two ribonuclease A are to each other. This result is consistent with the conclusion of Williams et al. (1987) that the conformation of ribonuclease B is not signifi- Table 4. Most closely related proteins in dataset

-~ ~ cantly different from that of ribonuclease A. The small ~ variability occurs mostly in the &sheet regions. Protein name PDB code Score CONGENEAL also finds thecorrect alignment of ribo- nuclease S with the other ribonucleases (i.e., at anoffset Ribonuclease A 7rsa 0.014 Ribonuclease A 3rn3 of 21 residues). All the ribonuclease structures are quite similar to each other, but ribonuclease S is the least sim- Phosphofructokinase (open form) Ipfk-o 0.035 ilar among them. The deviations of ribonuclease S rela- Phosphofructokinase (closed form) Ipfl-c tive to the other ribonuclease structures are attributable Rice cytochrome c lccr 0.052 to the contacts formedby residues 21-23 with the rest of Tuna cytochrome c Scyt the protein. Because ribonuclease S is formed by cleav- Ribonuclease A 3rn3 0.056 age of ribonuclease A between alanine20 and serine 21, Irbb-a Ribonuclease B the conformations of residues 21-23 presumably readjust Ribonuclease A 7rsa 0.057 in response to the cleavage of the between Ribonuclease B Irbb-a residues 20 and 21. 434 Cro 2cro 0.072 CONGENEAL finds other highly similar pairs. It finds 434 Repressor (N-terminal domain) 1r69 rice cytochrome c to be very similar to tunacytochrome c. Erabutoxin A Sebx The rice structure has 8 additional residues at the N- 0.076 Neurotoxin B 1 nxb terminus, and there are43 substitutions in the other 103 Fob KOL (light chain) 2fo4-1 residues. Despite these sequence differences, the structures 0.079 Immunoglobulin Vx domain 2rhe are found tobe nearly identical (Matthews, 1985). Other Ribonuclease A 3rn3 sets of proteins that are found to be highly similar by 0.103 Ribonuclease S lrns CONGENEAL include DNAbinding proteins (434 Cro, Ribonuclease A 7rsa 434 repressor, h repressor), (erabutoxin A, 0.103 Ribonuclease S lrns neurotoxin B), immunoglobulins (FabKOL, immuno- Hexokinase 2yhx globulin V, domain), and viral VP3 domains (rhinovirus 0.114 Hexokinase 1 hkg VP3, polio virus VP3). Ribonuclease B lrbb-a CONGENEAL has limitations. First,it does not treat 0.116 Ribonuclease S 1 rns insertions, deletions, or gaps. An example of this limita- tion is in the comparison of a/p barrels such as triose CAP (closed form) 3gap-c 0.125 CAP (open form) 3gap-o phosphate isomerase (TIM). It has been suggested that all the known a//3 barrels may have diverged from a com- Tendamistat 2ait 0.136 a-Amylase inhibitor 1 hoe mon ancestor (Farber & Petsko, 1990). If so, and if the process of evolutionary divergence involves changing loop Leucine-binding protein 21bp 0. I40 lengths while retainingsecondary structural domains, Leu/ile/Val-binding protein 21iv then evolutionary “distance” requires a similarity measure Human lysozyme llzl 0.222 that carries only weak penalties for changing lengths of Hen egg white lysozyme 1 lyz loops between domains. Although some similarity meth- Rhinovirus VP3 4rhv-vp3 0.231 ods do this (Taylor & Orengo, 1989), CONGENEAL does Polio virus VP3 2plv-vp3 not, and therefore would not be useful as a measure of X Repressor llrd 0.237 evolutionary divergence by this mechanism. Hence, again 434 Repressor (N-terminal domain) 1r69 we caution thatdifferent similarity measures will find dif- X Repressor llrd 0.289 ferent degrees of relatedness among proteins and will find 434 Cro 2cro different family clusters, but thereis no unique right way Elongation factor TU letu to dothis. And we note that the approach takenin CON- 0.277 Elongation factor TU lefm GENEAL, while it is disadvantageous for measuring evo- lutionary divergence by this mechanism, is advantageous 894 D.P. Yee and K.A. Dill for other purposes, because it is based on making no as- structures by two different methods:a minimal spanning sumptions about mechanisms of how one conformation tree and a hierarchical method. Different similarity mea- is caused to differ from another. Such a need arises in the sures and clustering methods can lead to different, but comparison of conformations of a given sequence, in equally valid, divisions of proteins into families. which case there are no gaps, insertions, ordeletions, or in the comparison of very different conformations that Clustering by minimal spanning trees may not be related by a known evolutionary mechanism, First, we construct a minimal spanning tree, which is in which case we believe it may often be preferable to a graph that provides one way to describe relatedness measure similarity with an algorithm having a minimum among proteins. Consider a graph in which each one of number of degrees of freedom. the P protein structures is represented by a node. Every Second, when comparing sets of proteins with differ- possible pair of nodes is connected by an edge. Each edge ent alignments and chainlengths, the dissimilarity mea- is weighted by the dissimilarity score relating the two pro- sure is not a true distance metric. That is, as with many teins. Hence, there are [ P x (P- l)]/2 edges. A span- other similarity measures, the triangle inequality law, ning tree is a subgraph in which there are only P - 1 edges connecting the Pvertices (proteins). A minimal spanning d(a,b)+ d(b,c)2 d(a,c), tree is a spanning tree inwhich the sumof the weights of the edges is as small as possible. Thus the only connec- can be violated. For example, in the pairwise comparison tivity is among the mostsimilar proteins. We construct a of a sheet (S), a helix (H), anda protein consisting of both minimal spanning treeusing Kruskal’s algorithm (Horo- a sheet and a helix (P): witz & Sahni, 1978), as follows. First, the pairwise scores are sorted from most similar to least similar. The tree is d(S,P) = 0 and d(P,H)= 0, then constructed edge by edge. The first edge is defined as the proteinpair with the lowest score (highest similar- but ity). The second edge is chosen to be the next lowest score that does not lead to a cycle in the graph. If a cycle were d(S,H)> 0. formed, then there would be more than one pathbetween two vertices, and at least P edges for P vertices, thus vi- Third, as with other contact-map based approaches, olating the criterion ofa spanning tree. Theprocess con- CONGENEAL does not distinguish structures by their tinues until there are P - 1 edges. chiralities. A molecule is indistinguishable from its mir- Figure 11 shows the minimal spanning tree for the set ror image. For comparingmolecules with consistent chi- of 158 protein structures based on the CONGENEAL dis- ralities, such as tworeal proteins, this is not a limitation. similarity measure. A treeis unique provided that no two For comparing a lattice model and a real protein, how- edges have the same weight. Note that nomeaning should ever, chiral errors will not be detected. In a most general be attributed to the edge lengths shown inthe figure,be- way, CONGENEAL only attempts to characterize dis- cause they are not drawn in proportionto their respective tances pertinent to nonlocal interactions. In this sense, dissimilarity weights. An edge connecting two proteins right-handed and left-handed helices are similar. When implies structural similarity between the two proteins. it is important todistinguish them, CONGENEALis not By this clustering method, proteins are found to col- appropriate. lect around hubs, which may be thoughtof as consensus family structures or structural paradigms. For example, monellin (lmon-a), uteroglobin (lutg), crambin (lcrn), Protein clustering into families and X repressor (llrd) areall hubs. Many other proteins CONGENEAL is a measure that computes the structural are connected to each hub. Each hub represents some similarity between any two compactpolymer conforma- characteristic topological feature (i.e., somespecific pro- tions. We have shown a few tests indicating where it is tein fold). For example, the A chain of monellin forms sensibly consistent with other knowledge. We now use this three strands of an antiparallel P-sheet. Any protein in the measure to study how it divides proteins into families. We dataset that has three strands in a similar conformation define a family as a set of structures that collectively share will score well when compared tomonellin, and may be a high degree of similarity to one another. The concept connected to themonellin hub. Similarly, uteroglobin, a of family carries the implication that there arerelatively binding proteinconsisting of four a-helices, sharp boundaries between families. Given a measure of is a hub for structures with similar helical and turn fea- similarity, there are several different methods for identi- tures. Crambin has both P-sheet and a-helix and serves fying clustering. As with similarity measures, there are no as a hub for proteins with similar secondary structural right or wrong clustering methods. In order to determine features. whether the families obtained are sensitive to thechoice Proteins are found cluster to into families, often around of clustering method, we study the clustering of protein hubs. For example, the globins cluster together. Lyso- Families of globular proteins 895

Minimal Spanning

158 Protein Structures

Fig. 11. Minimal spanning tree of 158-protein dataset. Proteins are referenced by the codes listed in Table 1. An edge connect- ing two proteins implies structural similarity between them. Edge lengths are not proportional to the dissimilarity between proteins. As a guide, the general locations of some major family relationships are indicated in bold.

zymes from hen egg white and humans cluster with Q- the greatest similarity. To determine group similarities, all lactalbumin. Other proteinclusters include (1) viral VP3 pairwise dissimilarities between groups arecalculated; this domains, (2) cytochrome c structures, (3) immunoglob- generates [ M x (M- 1)] /2 average dissimilarities for M ulin domains, (4) aspartic proteases, (5) trypsin-like serine groups. The groupdissimilarity is the average of the dis- proteases, and (6) subtilisin-like serine proteases. Inter- similarities between the members of one group with re- estingly, T4 bacteriophage lysozyme is separated by four spect to the members of another group. The merging nodes from the otherlysozymes. In this case, despite the process is repeated until all groups are combined into a structural similarity of the active sites, the remainder of single group. Themerging process is a sequence of deci- the structure of T4 phage lysozyme is different from that sions that can be represented as a tree(see Fig. 12). Nodes of the other lysozymes. at a given level in this treerepresent a given degree of dis- similarity. Hierarchical clustering As with the spanning tree, the hierarchical method In order to learn whether the family partitions found finds that immunoglobulins, serine proteases, ribonucle- by CONGENEAL depend on theclustering method, we ases, globins, aspartic proteases, and viral VP domains now consider a differentclustering algorithm for collect- form families. Hence, the general division into these fam- ing proteins intofamilies. Here we use hierarchical cluster- ilies appears tobe relatively independent of the cluster- ing, which successively groups proteins intoincreasingly ing method, although the details differ. larger sets. At first, there arePproteins in P groups. Step One interesting consequence of the hierarchical cluster- 1 is to combine the twomost similar proteinsto form the ing is evident from Figure 12. It leads to a partitioning of first group; there arenow (P- 2) single-protein groups families in which sheet structures are concentratedat the and onetwo-protein group.This is recorded as thefirst de- top of Figure 12 and helix structures are concentrated at cision. Step 2 is to combine the two groupsthat now have the bottom. According to this partitioning, sheet struc- 896 D.P. Yee and K.A. Dill

Dissimilarity 00 02 04 06 08

vini Domains 3

serine Protease8 1

seritm Protease8 2

Lysorymes

d.3 M 2-. X Calcium Binding

QloMns

Fig. 12. Hierarchical clustering of 158 proteins into a relatedness tree. The mean dissimilarities between the group members are indicated by diamonds at the branch points of the tree. Proteins are referenced by the codes listed in Table 1. As a guide, some major family relationships are indicated. Families of globular proteins 897 tures are less relatedto one another than arehelical struc- Figure 13. We create varying degrees of clustering as fol- tures. Helical structures are morerelated to one another lows. First, we assume there aref families of points in a because of the regular pattern of close contacts formed d-dimensional space. We randomly generate f points that by residues in the helical conformation. represent the family centers. Within each family, we then generate P/f points which are Gaussian-distributed around each family center. That is, the probability dis- Are proteins tightly clustered? tribution for a point x within a family is given by: Are protein families tight or loose forms of organization? “Tight” organization means that anytwo proteins within 1 -d2 (X, C)

P(X) = ~ a family are much more similar than any two proteins auk 2u2k2 1’ from different families. There would be a sharp bound- ary between protein families. “Loose” organization means where c is a family center, d(x, c) is the distance between that two proteins within a family may be only slightly x and c, k is the average distance between any two fam- more similar than two proteins taken fromdifferent fam- ily centers, and u is the parameter that controls the degree ilies. The boundaries between protein families would not of clustering. When u equals 1, the standard deviation of be sharp and might even overlap. points around a family center is equal to the average dis- We assess tightness of protein families by studying the tance between the family centers.As u decreases, the tight- shape of the histogram of pairwise dissimilarities (Fig.13). ness of the clustering increases. As an example, Figure 14 Qualitatively, if proteins are tightly clustered, the histo- shows scatter plots of points distributed in two dimensions gram in Figure 13 would have two peaks: one represent- around 15 “families” with three different values of u. ing the high similarities within families and the other After randomly generating points as described above, representing the low similarities between families. The we calculate all the pairwise distances between the points distribution of dissimilarities for the 158 protein dataset within each set. Figure 15 shows histograms for N = 200, structures, however, shows mainly a single broad peak, d = 7, f = 25, and varying u. When u is small (tight clus- which indicates a wide range of relatedness among pro- tering) the histograms have two peaks, as expected. The teins. There is only a very small peak indicating high sim- ilarities and tight families. The mean dissimilarity is 0.737, indicating that two arbitrary proteins are relatively un- related. By this qualitative criterion, protein structural A no clustering families are only loose entities. A second way to assess the tightness of clustering draws on the analogy between (1) P proteins as points separated by their pairwise dissimilarities and (2)a set of P points distributed in a d-dimensional space separated by their Euclidean distances. To pursue this analogy, we generate points in Euclidean spaces with varying degrees of clus- = 0.10 tering in several different dimensionalities, d. The distri- €3 t3 bution of distances between the points is compared to the distributions of pairwise protein dissimilarities shown in

C CT = 0.05

Fig. 14. Degrees of clustering using a Euclidean distance analogy: in- 0.40 0.00 0.20 0.60 0.80 creasing the tightness of clustering from A to B to C. Scatter plots of Dissimilarity Score points randomly distributedaround family centers. Family centers are represented by large diamonds. Points associated with a family center Fig.- 13. Histogram showing distribution of 12,403 pairwise compari- are represented as small squares. In A-C,the number of familv centers sons of the 158-protein dataset. is 15 and the total number of points is 200. 898 D.P. Yee and K.A. Dill A

(5 = 0.05

JI 0.0 1.0 0.0 1.0 0.0 1.0 Distance Distance Distance Distance Distance

Fig. 15. Distribution of pairwise distances between points randomly distributed between 25 families within a seven-dimensional sphere.

leftmost peak is due tointrafamily distances and theright- together than areinterfamily centers, protein families are most peak is due to interfamily distances. not sufficiently tightly knitto avoid considerable overlap One method to compare theshapes of two distribution between families, and individuals cannot be unambigu- functions is the quantile-quantile (Q-Q) plot (Chambers ously assigned to families. This degree of clustering is in- et al., 1983). A Q-Q curve plots the sorted values of one dicated schematically in Figure 14B for a two-dimensional distribution against the sorted values of a second distri- Euclidean space. By systematically varying the clustering bution. If two distributions have the same shape, and dif- parameter, u, and the dimensionality,d, we find the op- fer only by a multiplicative factor that scales the width, timal dimensionality to be about d = 7. It is not clear to or if they differ by a constant factor that shifts themean us if this dimensionality in the Euclidean space analogy values, then a Q-Q plot gives a straight line. Figure 16A has any physical meaning since our dissimilarity measure shows a Q-Q plot of the histogram of Figure 13 versus a is not a true metric. This comparison should be viewed nonclustered distribution of points in Euclidean space. simply as an analogy. The deviations at both extremes of the plot indicate that Our results are consistent with those of Rackovsky protein structures are more broadly distributedthan would (1990). His similarity measure, which is based on local be predicted by the completely nonclustered distribution. conformational preferences, also orders proteins from Hence, a nonclustered uniform distributionof points is not helices to sheets and finds families to be only loosely knit a good model for the distribution protein dissimilarities. entities. On the other hand,Figure 16C shows that proteins are also not well represented as being very tightly clustered (a I0.05). When the distributions are tightly clustered, Conclusions the Euclidean distribution underestimates the numberof similar protein pairs in therange of dissimilarities between We have described a simple quantity for characterizing the 0.40 and 0.60, and overestimates them for distances less structural similarity between any two compact polymer than 0.40. The closest correspondence between the Eu- or protein conformations. Based on differences between clidean distances and protein similarities is obtained for weighted distance maps, it requires no alignments or gap values around u = 0.10. This is the case for which the penalties and makes few assumptions or arbitrary deci- Q-Q plot is most linear (see Fig. 16B). While this value sions about polymer structure. It is computationally fast. of u implies that family members are considerablycloser The only parameter is the exponentp in the distancede-

n

0.2 0.6 1 .O 0.2 0.6 1 .O 0.2 0.6 1 .O database distribution database distribution database distribution

Fig. 16. Q-Q plots comparing protein dataset pairwise dissimilarity distribution with distributions of random point pairwise distances. For the random point distributions, f = 25 and d = 7. Families of globular proteins 899 pendence of the weights. The results are not very sensi- Bowie, J.U., Luthy, R., & Eisenberg, D. (1991). A method to identify tive to this parameter. protein sequences that fold into a known three-dimensional struc- ture. Science 253, 164-170. The method can compare any two conformations, no Chambers, J.M., Cleveland, W.S., Kleiner, B., &Tukey, P.A. (1983). matter how similar or different, and does not require Graphical Methods for Data Analysis. Wadsworth International identical chain lengths. It is intended for the purpose of Group, Duxbury Press, Boston. Chothia, C. & Finkelstein, A.V. (1990). The classification and origins comparing diverse polymer conformations. Several tests of protein folding patterns. Annu. Rev. Biochem. 59, 1007-1039. show that therelatedness among proteins reportedby this Farber, G.K. & Petsko, G.A. (1990). The evolution of cu/p barrel en- measure is sensibly consistent with existing knowledge. zymes. Trends Biochem. Sci. 15, 228-234. Horowitz, E. & Sahni, S. (1978). Fundamentals of Computer Algo- This method can be usedto rapidly search througha da- rithms. Computer Science Press, New York. tabase of protein structures to find specified substructures Kretsinger, R.H. & Nockolds, C.E. (1973). Carp muscle calcium bind- and to seek given functional components in other pro- ing protein 11. Structure determination and general description. J. Bioi. Chem. 248, 3313-3326. teins. For example, it searches the protein dataset in min- Lebioda, L., Stec, B., & Brewer, J.M. (1989). The structure of yeast utes to find possible calcium binding motifs similar to the enolase at 2.25-A resolution. J. Biol. Chem. 264, 3685-3693. EF hand. Lesk, A.M. (1991). Protein Architecture [Practical Approach Series]. IRL Press, New York. Such a similarity measure can beused to test algorithms Levitt, M. & Chothia, C. (1976). Structural patterns in globular proteins. of protein folding for which generated conformations Nature 261, 552-558. may be distant from the native structure. Hence, the Liljas, A. & Rossmann, M.G. (1974). Recognition of structural domains in globular proteins. J. Mol. Bioi. 85, 177-181. CONGENEAL measure canserve as a sort of “reaction Maggiora, G.M. &Johnson, M.A. (1990). Introduction of similarity in coordinate’’ for nativeness. For such problems, gaps are chemistry. In Concepts and Applications of Molecular Similarity unimportant. (Maggiora, G.M. &Johnson, M.A., Eds.), pp. 1-13. Wiley, New York. We combine this measurewith two differentclustering Matthews, F.S. (1985). The structure, function, and evolution of cyto- methods to identify protein families. We then ask how chromes. Prog. Biophys. Mol. Biol. 45, 1-56. tightly clustered are families by drawing an analogy with Mondragon, A., Subbiah, S., Almo, S.C., Drottar, M., & Harrison, S.C. (1989). Structure of the amino-terminal domainof phage 434 points distributed in Euclidean space. The analogy indi- repressor at 2.0 A resolution. J. Mol. Bioi. 205, 189-200. cates that protein families are only loosely knit entities, Ohlendorf, D.H., Anderson, W.F., & Matthews, B.W. (1983). Many and that individual proteins may often notbe unambig- gene-regulatory proteins appear to have a similar a-helical fold that binds DNA and evolved from a common precursor. J. Mol. Evol. uously assignable to a unique family. 19, 109-114. Rackovsky, S. (1990). Quantitative organization of the known protein X-ray structures. I. Methods and short-length scale results. Proteins Supplementary material Struct. Funct. Genet. 7, 378-402. Remington, S.J. & Matthews, B.W. (1978). A general method to assess This work would not have been possible without the similarity of protein structures, with applications to T4 bacteriophage enormous amount of effort required to experimentally lysozyme. Proc. Nail. Acad. Sci. USA 75, 2180-2184. determine the protein structuresused in this analysis. Ref- Richardson, J.S. (1981). The anatomy andtaxonomy of protein struc- ture. Adv. Protein Chem. 34, 167-339. erences for all the structures obtained from thePDB are Richardson, J.S. & Richardson, D.C. (1988). Helix lap-joints as ion- available on theDiskette Appendix andby request from binding sites: DNA-binding motifs and Ca-binding “EF hands” are the authors. related by charge and sequence reversal. Proteins Struct. Funct. Genet. 4, 229-239. Richardson, J.S. & Richardson, D.C. (1989). Principles and patterns of Acknowledgments protein conformation. In Prediction of Protein Structure and the Principles of Protein Conformation (Fasman, G.D., Ed.), pp. 1-98. We thank Sarina Bromberg, Fred Cohen, Kai Yue, and Chris Plenum, New York. Carreras for many helpful discussions, and the DARPA Uni- Rossmann, M.G. & Argos, P. (1976). Exploring structural homology of proteins. J. Mol. Biol. 105, 75-95. versity Research Initiative program and the NIH for financial Sali, A. & Blundell, T.L. (1990). Definition of general topological equiv- support. Molecular graphics images were produced using the alence in protein structures. J. Mol. Bioi. 212, 403-428. MidasPIus software system from theComputer Graphics Lab- Satyshur, K.A., Rao, S.T., Pyzalska, D., Drendel, W., Greaser, M., & oratory, University of California, San Francisco. Sundaralingam, M. (1988). Refinedstructure ofchicken skeletal mus- cle troponin C in the two-calcium start at 2-A resolution. J. Bioi. Chem. 263, 1628-1647. References Schevitz, R.W., Otwinowski, Z., Joachimiak, A., Lawson, C.L., & Sigler, P.B. (1985). The three-dimensional structure of trp repres- Abola, E.E., Bernstein, F.C., Bryant, S.H., Koetzle, T.F., & Weng, sor. Nature 317, 782-786. J. (1987). Protein Data Bank. In CrystallographicDatabases- Taylor, W.R. & Orengo, C.A. (1989). Protein structure alignment. J. Information Content, Software Systems, Scientific Applications Mol. Biol. 208, 1-22. (Allen, F.H., Bergerhoff, G., & Sievers, R., Eds.), pp. 107-132. Data Tufty, R.M. & Kretsinger, R.H. (1975). Troponin and parvalbumin cal- Commission of the International Union of Crystallography, Bonn/ cium binding regions predicted in myosin light chain and T4 lyso- Cambridge/Chester. zyme. Science 187, 167-169. Beers, Y. (1957). Introduction to the Theory of Error. Addison-Wesley, Vyas, N.K., Vyas, M.N., & Quiocho, EA. (1987). A novel calcium bind- Reading, Massachusetts. ing site in the galactose-binding protein of bacterial transport and Bernstein, F.C., Koetzle, T.F., Williams, G.J.B., Meyer, E.F., Jr., Brice, chemotaxis. Nature 327, 635-638. M.D., Rodgers, J.R., Kennard, O., Shimanouchi, T., & Tasumi, M. Williams, R.L., Greene, S.M., & McPherson, A. (1987). The crystal (1977). The Protein Data Bank: A computer-based archivalfile for structure of ribonuclease B at 2.5 A resolution. J. Bioi. Chem.262, macromolecular structures. J. Mol. Bioi. 112, 535-542. 16020- 16031.