<<

p J. Mol. Bd-(1991) 219, 555-565

Amino Acid Substitution Matrices from an , Information Theoretic Perspective Stephen F. Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD 20894, U.S.S.

(Received 1 October 1990; accepted 12 February 1991)

Protein sequence alignments have become an important tool for molecular biologists. Local alignments are frequently constructed with the aid of a “substitution score ” that specifies a scorefor aligning each pair of residues. Over the years, manydifferent substitution matrices have been proposed, based on a wide variety of rationales. Statistical results, however, demonstrate that any such matrix is i.mplicitly a “log-odds” matrix, with a specific targetdistribution for aligned pairs of amino acid residues. Inthe light of information theory, itis possible to express the scores of a in bits and to see that different matrices are better adapted to different purposes. The most widely used matrix for sequence comparison has been the PAM-250 matrix. It is argued that for database searches the PAM-,I20 matrix generally is more appropriate, while for comparing two specific with.suspecte4 homology the PAM-200 matrix is indicated. Examples discussed include the lipocalins, human a,B-glycoprotein, the cysticfibrosis transmembrane conductance regulator and the globins. Keywords: homology; sequence comparison; statistical significance; alignment algorithms; pattern recognition

2. Introduction . (Smith & Waterman, 1981; Goad & Kanehisa, 1982; Sellers, 1984). This has the General methods for protein sequence comparison advantage of placing no a priori restrictions on the were introduced to molecular biology 20 years ago length of the local alignments sought. Most data- and have since gained widespreaduse. Most early base search methods have been based on such local attemptsto measure protein sequencesimilarity alignments(Lipman & Pearson, 1985; Pearson & ‘‘$’vused on global sequencealignments, in which Lipman, 1988; Altschul et aE.,1990). 1.vclry residue of the two sequences compared had to To evaluate local alignments, scores generally are participate(Needleman & Wunsch, 19TO; Sellers, assigned to each aligned pair of residues (the set of 1954; Sankoff C Kruskal, 1983). However, hecause such scores is called a substitution matrix), aswell as distantly related proteins may share only isolated to residues aligned with nulls: the score of the regions of similarity, e.g. in the vicinity of an active overall alignment is then taken to be the sum of site, attention has shifted to local as opposed to these scores. Specifying an appropriate amino acid global sequence similarity measures. The basic idea substitution matrix is central to protein comparison is to consider only relatively conserved sub- methodsand much effort has been devoted to xquences; dissimilar regions do not contribute toor defining, analyzingand refining such matrices ,thtract from the measure of similarity. Local sirni- (SIcl,achlan, 1971; Dayhoff et al., 1978; Schwartz & iurity mar be studied in a variety of ways. These lhyhoff, 1975; Feng et al., 1985; Rao, 1987: Risler et include measuresbased on the longest matching al.. 1988). One hope has been to find a matrix best segments of two sequences with a specified number adaptedtodistinguishing distant evolutionary or proportion of mismatches (Arratia’et al., 1986; relationships from chancesimilarities. Recent Xrratia & Waterman, 1989), as well as methods that mathematicalresults (Karlin & Altschul, 1990; compare all segments of a fixed, predefined Karlin et al., 1990) allow all substitution matrices to “window”length (McLachlan, 1971). The most be\%wed in a common light,and provide a common practice, however, is to consider segments rationale for selecting particular sets of “optimal” 1;f’ all lengths,and choose those that optimize scores for local protein sequence comparison. 2. The Statistical Significance of Local Sequence Alignments

C;lohal alignments are of essentially no IIS~unless they can aIlow gaps. but this is not true for local alignments. The ability to choose segments with arbiora.r?- starting positions in each sequence means that biologically significant regions frequently may be aligned without the need to introduce gaps. Ij'hile: in general. it. is desirable to allow gaps in low1 dignments.doins so greatlydecreases their mathematical tractabilit?. Theresults described here applv rigorously only t.0 1oc:al alignmen1.s that. lack gap. i.e. to segments 01' ecpa.1 I~ngthfrom each of the two sequcncrs cmmpsretl. Somc rcwnt di~,ti~- base search tools have focusccl on tintling st~c:halign- ments (-4ltsc:hul & Lipmm, 1990; Alt.sc:hnl r!f 0.1.. Not.ic:c* thiLt. multiplying all the wares of a sul)st.itu- , 1990). Howrc-er. the statisLics Of optimal s(wrt's fi)r tion matris 1)y somr ~~ositivc t:ollstiLl1t (Ioos not lot:al ;Aignments that include gaps (Smith e! a/., nff1Ac:t. tllc re1;~t.i~~s(:ores of' any sl~i,;lli~,rn~nc~l~ts.Two 19S.5; li'aterman et d., 1957) are 1)roully ;~nalogous mat riws n?lat.c'tl 1)y s11(*h f'ncAtor (:Hn, lhrrrf(~re,be I to those for thc no-gap case (Karlin Rr Altsc:huI: c:onsidercd , c:ssc+ntialIy t~(Iuiv;~hi.Tnslwc*tion of ; 1990; Karlin el al., IYYO), where more precise resull,s cqna.tion (1 ) re\T(!;LIs that multiplying dlworm by (I ' are availaMe. Therdore, one may hope that many of ;~lsohas the c?ffr:ct of dividing 2. I)? 11. Tht. t);lrameter 1 the hicideas prcsenfsd t)elow will ge~~eraliw1.0 1. mil?., t.lrc?rc\fore,I)r viewed :I t~t11r:dsc-;~l(. for 4 local alignments that include gaps. any sroring system; its c1ery)er meaning wi\: be ,d Formally, we assume that the aligned amino acids discassc~lhelow. c ai and ai are assigned thesubstitution score sij. !$ C;ivw~two random protein stlquw1ws as tlcscribed .i Given two protein sequences, thepair of (~11151 above, how many distinct, 01' -'loc*ally optimal" length segments that, when aligned,have the (Sellers.,* 1984) MSPs with score at least S are greatest. aggregate score we call the Maximal expect@ to occur simply by chance? This number is Segment Pair (MSPt). An MSP may be' of any well apbroximated by the formula: length; its score is the MSP score. Since any two protein sequences, related or un- K.1' c - As related, a-ill have some MSP score, it is important to where N is the product of the sequenws' len$hs, know how great a score onecan expect to find and h' is an explicit]? calculable parameter (liarlin simpIy by chance. To address this questionone & Altschul, 1990; Karlin et nl.. 1990). When needs some model of chance. Thesimplest is to comparing a single randomsequence withall the. . assume that in the twoproteins compared, the sequences in a database, settingiV to the product of' amino acid ai appearsrandomly with the prob- the query sequence length and the database length ability pi. These probabilities are chosen to reflect (in residues) yields an upper bound on the number the observed frequencies of the aminoacids in of distinct MSPs with score at least S .that the actual proteins. For simplicity of discussion we will search is especkd to yield. assume both proteins share the same amino acid probabilitydistribution; more generally, one can allow them to have different distributions. A random protein sequence is. simply one constructed 3. Optimal Substitution Matrices for Local according to this model. For the sake of the statistical theory, we need to Formula (2) allows us to tell when a segm makc two crucial but reasonable assumptions about has a significantly high score. However, it doe the substitution scores. The first is that there be at assist in choosing an appropriatesuhsti least one positive score and the second is that the matrix in the first place. A second class of r expected score xi,jpipjs,j be negative. Because we however, has direct bearing on this question. Th permit the length of a segment pair to be adjusted state that among MSPs from the comparison to optimize its score, boththese assumptions are random sequences, the amino acids ai an a necessary also from practical perspective. If there aligned with frequency approaching qij = p wereno positive scores, the MSP would always (Arratia et or., 1988; Kartin 8: Altschul, 1990; consist of a single pair of' residues (or none at all, if et al., 1990; Dembo & Karlin, 1991). this were permitted), and such an alignment is not Given any snbstitution matrix and rando of interest. If the expected score for two random tein model,one may easily calculatethe residues were positive, extending a segment pair as target frequencies, qij, just described. Sotice by the definition of J. in equation (I), these t t Abbreviations used: MSP. Maximal Segmrnt hir: frequencies sum to 1. Sow among ztlig~ment Ig. immunoglobulin. sentingdistant homologies, the amino acids Amino Acid Substitution Matrices -537 with certain characteristic frequencies, Only doubtfulthat any “target distribution” theorem e correspond to a matrix’s target frequencies, can be proved. It may be possible to make a convincing case for a particular substitution matrix in the global alignment context, but the argument will most likely have to be different from that for Any substitutionmatrix has an implicit set of local alignments(Karlin & Altschul, 1990). The arget frequencies for aligned amino acids. Writing sameapplies to substitution matrices used with thescores of thematrix in terms of itstarget fixed-length windows for studying local similarities frequencies, one has: (McI,achlan, 1971; Argos, 1987; Stormo & Hartzell, 1989): a fixed quantity can be added to all entries of such a matrix with no essential effect. It is notable ,qij = (In %)/A. (3) that while the PAM matrices were developed origin- Pi Pj ally for global sequence comparison (Dayhoff et al., 19SS), their statistical ,theory has blossomed in the In other words, the score for an amino acid pair can local alignment context. be written as the logarithm to some base of that pair‘s target frequency divided by the background frequency with which the pair occurs. Such a ratio compares theprobability of an’ event owurring 5. Local Alignment Scores as Measures undertwo alternativehypotheses and is culled a of Information likelihood or oddsratio. Scores thatare the Multiplying a substitution matrix by a constant logarithm of odds ratios are called log-odds scores. changes A but does not alter the matrix’s implicit Adding such scores can be thought of as multiplying target frequencies. By appropriate scaling, one may the correspondingprobabilities, which is appro- therefore select the parameter A at will. Writing the priate for independent events. so that the totalscore matrix in log-odds form, such scaling corresponds rtmains a log-odds score. merely to using a different implicit base for the Log-odds matrices have been advocated in a logarithm. One natural choice for ,I is 1, so that all number of contexts, (Dayhoff et al., 197s; Gribskov scores become natural logarithms. Perhaps more et ai., 1987; Qtormo & Hartzell, 1989). The widely appealing is to choose A = In 2 = 0.693, so that the used PAM matrices (Dayhoff *et al., 1;978), for base for the log-odds matrix becomes 2. This lends a instance, are explicitly of this form. Other substitu- particularly intuitive appeal to formula (2). Setting tion matrices. though based on a wide *variety of the expected number of MSPs with score at least S rationales. are all log-odds matrices, but with equal to p, and solving for S, one finds: implicit rather Lhan explicittarget frequencies. T:-:-refore,while onemay criticize the method K s = log, - +log2 x. (4) drwiI.)c:tl hy Ih~hdfet al. for estimatingappro- P priate target. frryuencies (Wilbur, 19S5), the most direct way to derive superior matrices appears to 1)e For typical substitution matrices, K is found to be through the refined estimation of amino acid pair near 0-1, and an alignment may be considered target ant1 txwkground frequencies ratherthan significant when y is 005. Therefore the right-hand through any f~~nrlamttntallydifferent approach. side of equation (4) generally is dominated hy the termlogz N. In other words, the score needed to distinguish an MSP from chance is approximately the number of bits needed to specify where the aZSP cf Substitution Matrices for Global Alignments st.arts in each of the two sequences being compared. l\,’hile we have heen wnsitleringsut)stit~rtion (One bit can be thought of as the answer to a single matrices irl the conlest of local secluenw cwmpari- yes-noquestion; it is theamount of information SOII. thry mtr’ Iwt.rnployec1 for glotwl alignment as needed to distinguish between 2 possibilities. It well (Xe(~!lIrlnun& Wunsch. I!)liO; Sell~rs,1974; 1)ecomes apparent that, in general. logz 9 bits of Schwartz 8 Ih~hoR,l!I‘iX). There is a fundamental informationare needed to distingrlish among S difference. twwever, hetween the use of such possibilities.) matrices in tlwse two contexts. For global align- For comparing two proteins of length 290 amino ments. as previously. mult.iplying all scores bya acid residues, about 16 bits of information are ti. 4 positive number has no effect on the relative ,required; for comparing one such protein to a st -wsof rliff’rrtwt afignrnents. But adding a fised sequencedatabase cont.sining 4,000,000 residues. (pantity tu the score for aligning any pair of about 30 bits are needed. When cast in this light, resit1ut.s (and @ to the wore for aligning a residue alignment scores are notarbitrary numbers. By with a null) likewise has no effect. Scoring systems appropriate scaling (multiplying by i./0-693) they that may hr transformed into one another by means takeon the units of bits,and rough significance of these two rules are,for all practical purposes, calculationscan be performed in one’s head. equivalerrt. ~nfortunatel~,the new transformation Furthermore, when so normatized. different amino mw1S that no unique log-odds interpretation of acid substitution matrices may he directly glolx~lsubstitution matrices is possible, and it is compared. 558 S. F. A It.schul

6. The Relative Entropy of a Substitution Matrix The abovereview of previous resuks has provided us nit.h the necessary tools forthe analysis tha.t. follows. The ultimate goal is to decide which substi- t.ution matrices are the most appropriate for. data- base searching and for detailed pairwise sequence comparison. Given a random protein model and a substitution matrix: one may calculate the target frequencies qij characteristic of thealignments for which the> matrix is optimized. A useful quantity toconsider is the average score [information) per residue pair in these a.lignments. .Assuming the substitution matrix is normalized as described above, this value is simp]!: .. .- log2 -.Qij cies for data.l)ase sea.rches. I -& rrj Assuming the model des(-ribed hy 1)ayhoff et a/, -., t,J Pi Pj (1978), Table. 1 lists the relative entropy II implicit Xotice that H depends bothon thesubstitution in a range of T'AY matrices. As arguedabove, matrix and on the randomprotein model. Tn distinguishing an alignmentfrom chancer in a search information theoretic terms, H is the relative of a t_vpical current protein database using an

entropy of the target and background distributions. average length protein requires ahout 30 bits of ~ The origin of the name need not be of concern. The information. Accordingly, fur an alignment of - important point is that, for an alignment character- segments separated by a given PAM distal] ized by the target frequencies qij, H measures the (:an calcu1at.e the minimum length necessary to rise averageinformation available per position to above background noise; these lengths are recorded . ,: distinguish the alignment from chance. Intuitively, in Table j. For instance. at a distance of 250 PAMs,

H,relatively cant, such an alignment wouidneed to I short aliGments with the target distributioncan be length greater than about 83 residues. >Ian?-binlngi- distinguished from chance, while, if the value of I1 is cally interesting regions of protein similarity ;',re 4 lower, longer alignments are necessary. much shorter than this, andaccordingly neeri a standpoint. From a study of mutations between a ment position, while one of leng-11 50 residues nil] large number of closely relatedproteins, Dayhoff need about 075 bit. Table I shows that such align-' and co-workers proposed a stochastic model of pro- ments will not be detectable if theirconstituent

The relative entropy H of PAM matrices .e

PA2u significant Min. PAM Min. significant distance H (bits) length (30 bits) distance H (bits) length (30 bits)

0 1.17 '8 180 060 51 -4 10 343 9 190 055 55 7 '10 295 11 200 051 59 30 255 12 210 048 63 40 2.26 14 220 045 68 50 2.00 15 230 042 73 . .$ 60 1.79 17 2M 039 78

1.30 24 90 1.30 270 032 94 1.18 26 100 1.18 280 030 100 1 110 1-08 28 2m 0.28 107 1.20 098 02531 300 113 . .1 130 090 . 34 310 025 190 ai 140 082 35 , 320 024 125 150 076 40 330 0-22 134 .c.* 070 43 021 141 - .. 160 340 :g 150 065 ' 47 350 0-20 I49 Am ino Acid Substitution MatricesSubstitution Acid Amino 559

Table 2 The average score (in bits) per alignment position when using given PAM'mtrices to compare segments in fact separated by a variety of P,4M distances

PAX mat.rixActual PAM distance D of segments , -1.I employed 40 80 120 160 200 240 280 3'20

40 226 1.31 0.62 0.10 -0.30 -061 -0.86 -1.06 80 i.14 1.44 0.92 0.53 0.23 -0.02 -021 -037 120 1.93 1.39 0.98 0.67 @e2 022 0.06 -0.07 160 1.71 1.28 0.95 0.70- 0.50 033 0.20 0.09 200 1.5 1 1.16 0.90 068 051 038 0.26 01s 240 1.3'2 1.05 0.82 065 0.51 0.39 0.29 0.21. 280 1.17 0.94 0.75 0.60 048 0.38 030 oP3 320 1.03 0.84 068 0-56 046 0.37 030 024 segments have diverged by more then about. i5 and For example, the relative entropy for D = 160 is 150 PA$ls. respectively. 0-70bit, but any PAM matrix in the range 120 to 200 yields at least 067 bit per position. In practice, how near the optimai is it important be? 7. PAM Matrices Database Searching and to for As argued above, for a given distance there Two-sequence Comparison PAM is a critical length at which alignments arejust Therelative entropy associated with a specific distinguishable fromchance in atypical current PAM distance indicates how much information per database search; these lengthsare recorded in Table :mition is optimally available. For u. giwtr align- 1. For the sake of analysis, we will assume that it is ment. one mnattain such a score only by using the worth performing an extra search (using a different appropriate.P.4.M matrix, but, of course, before the PAM matrix) only if it is able to increase the score alignment is found it will not beknown which for sucha critical alignment by about two bits, matrix that is. It has therefore been'propowd that a corresponding to a factor of 4 in significance. Since a variety of PAM matrices beused for 'database critical alignment has about 30 bits of information: searches (Collins et al., 1988). We seek here to we will therefore be satisfied using a PAM matrix analyze how many such matrices are necessary, and that yields a score greater than 93% of the optimal which should he used. achievable. Using data such as those shown in Table Suppose one uses a matrix optimized for PAM 2, onecan calculate for which PAM distances D .iistance rM to compare two homologous protein (and thus for which critical lengths) a given matrix segments that areactually separated by PAM iM is appropriate; the results are recorded in Table distance D. For a range of values of 1V and D, the 3. Our experience has shown that perhaps the most averagescore achieved per alignment position is typical lengths for distant local alignments arethose shown in Table 2. Xotice that for any given matrix for which the PAM-I20 matrix gives near-optimal :tl, the smaller the actual distanceD, the higher the scores, i.e. lengths 19 to 50 residues. Therefore, if score. On the other hand, for a specific distance D, one wishes to use a single standardmatrix for t.he highest score corresponds to thematrix with database searches, the PAM-I20matrix (Table 4) is PAM distance M = D;this score is just the relative a reisonable choice. This matrixmay, however, ' ,Itropy discussed above. Using a PAM matrix with miss short but strong or long but weak similarities .ii near D,however, can yield a near-optimal score. that contain sufficient informationto be found. Accordingly, Table 3 shows that to compiement the PAM-I20 matrix,the PAM40 and PAM-230 (or traditional' PAM-250) metricescan be used. Table 3 Additional matrices should improve the detectionof Ranyrs of locnl alignment lengths for which various distant similarities onfy marginally (i.e. raise their PAY ntdrices are appropriate scores by at most 2 bits). Tf, rather than searching a database witha query 9SU0 rtticiency range 87 yo efficiency range sequence, one wishes to comparetwo specific !'.U for tiatatwe searching for I-sequencv sequences for whichone already has evidence of -1;ttris (Wbits) comparison (I6 bits) - relatedness, the background noise is great.ly 40 9 to t t 4 to I4 decreased. As discussed above, for two proteins of x0 19 to 34 6 to 22 typical length,about 16 bits are needed to It0 I9 to 50 ? to 33 distinguish a local alignment from chance. 160 tti to 50 It to 46 Accordingly, applying the same criteria as before, a 1(W) 36 kJ 94 16 to 8-1 240 4i t~ 123 I1 to x0 matrixshould be considered adequate for those 280 ciu to 15.5 27 to 101 PAJI distances at which it yields an average score 322) 53 to 192 34 to It4 within Si?/, of the optimal. In Table 3, we list the :3riO 94 10 233 42 to 149 range of critical lengthsover which various PAM matrices areappropriate for detailcdpsirwiw and PAM-] 20 scores for Mel's representing tlisr;...!t sequence comparison. As a single matrix,the relationships to four different query sequences. PA>.!-200 spansthe most typical range of local all cases, we consider relationships near the limit alignment lengths, i.e.16 to 6% . residues. what can be distinguished from chance in a se Alternatively, if two different matrices are to be of the PIR protein squence database(Release 26.0; used, the PAM-80 and PAM-250, which together 7,348,350 residues). It will be noticed that the high- spanalignment lengths ti to 85 residues, or the est chaince Pm-250 scores are consistently slightly PAM-120 and PAM-320 matrices, which span smaller than the highestchance PAM-I20 scores. lengths 9 to I24 residues, appear to be appropriat.e This is primarily attributable to the fact that the pairs. parameter K discussed above is about half as large Since it isconvenient to express substitution for the former scores as for the latter. Furtherrn(:;e, matrices ils integers, and since a probability factor since neither the PIR database nor a given query of.2 between score levels is too rough, the units for sequence ever precisely fits the random protein the PAM-120 matrix shown in Table 4 are half bits. model described by Dayhoff el al. (1958), the para- The scores in the original PAM-250 matrix (Dayhoff meter 2 variesslightly from one comparison to et al.: 1978) were scaled as lox log,,. Because another. Therefore, while we will treat the PAM-120 ' 10/(h 10) z 3/(h 2) to within 04%, a unit score in scores from Table 4 as half bits, .and the PAM thatmatrix can bethought of as approximately scores of Dayhoff et al. (1978) as one-third bits, it one-third of a bit. should be noted thatthis is always a slight approximation. 8. Biological Examples (a) Lipocal ins As discussed, the particular PAM matrixthat best distinguishes distant homologies from chance We used the BLASTprogram (.4ltschul et similarities found in a database search depends on 1990) to search the PIR database with huma thenature of the homologies present,and this poprotein D precursor (PIR code LPHUD; Draynyna cannot be known a priori. However, it is frequently el al., 1985), using both the PAM-250 (Dayhoff et ai.-' the case that distantly related proteins will share 1978) and PAM-180 (Table 3) substitution matrices- isolated st.retches of relatively conserved amino acid Huma.n apolipoprotein D precursor is a 189 resirhe residues, corresponding to activesites or other glycoprotein that belongs to the lipocalin important structural features. It has been observed (a2-microglobulin) superfamily, which contains pro- that in general the mutationsalong genes coding for teins that exhibit a wide range of functions re1 proteins arenot Poisson-distributed (Uzzell & to their ability to bind small hydrophobic Corbin, 1971; Holmquist et al., 1983), suggesting The similarities among these proteins and their that short, conserved regions are to be expected. As logical roles have been ana.lped (Peitsch 8 Bogu shown in Table 3, this means that the widely used 1990), and crystal structures are avilahk PAM-250 matrix generally will not be optimal for several members of the superfamily (Cowan et locating distant relationships. 1990). Three proteinsin the superfamily a In the examples below, we compare the PAM-250 androgen-dependent epididymal protein (PI Amino Acid Substitution Xatrices 36 1

p‘ Optimal PAM-250 Optimal PAM-120

PIR code optimal PAM-250 alignment score (bits) score (bits) i- LPHW 25 LGKCPNPPVQENFDVNKYLGRWYEI 49

SQRTAD 12 IAAGTEGAVVKDFDISKE‘LGFWYEI 36 27.0 33 -5

~32202 27 HDTVQPNFQQDKFLGRWY 44 25.7 33.5

HCHU 28 NIQVQENFNISRIYGKWYNL 47 23.0 30.5

Highest chance alignment score : 27.0 29.0

FIR code of sequence involved:SO0758 SO0758

. (b) fluman CY ,B-glycoprotein We ‘searchedthe PTR database with human a,B-glycoprotein (PTR code OMHCIB; Tshioka ~t al., Igt)Ci), a plasma glycoprotein of unknown func- tion, and a member of the immunoglobulin superfa.- mily. Lsing the PAM-950 matris, the only protein in the database with an MSP that rises above hark- ground noise is pig T’o2 F protein (PTR code P1,0030: Van de Weghe et al., ISXS), which achieves a score of 32.3 bits. As shown in Table 6. the score for this known homology (\‘an de K’eghe at 01.. 1988) rises to 45.0 bits when the T’X.\.I-l20 rnatris is used instead. In uitlition, two proteins with irl1munoglol)ulin domains, kinase-related trans- formingprotein prevursor (1’1 R (*ode SOO474: Qiu vt . nl.. 1988) andhuman T~Kchain precursor V-TTI region (PIR code KSHUVH; Pech & Zachau. 19x4). uc.hicv seores of 290 and 28.6 bits, resppctively. Table 6 illustrates that both these similarities are only justdistinguishable from chance, and that using the PAM-250 matrix both similarities drop in score by at least four bits.

(c) The cystic Jihrosis transmembrane conductance reydator The muse of cystic fibrosis has been traced to mutations in a protein that bears striking similarity to manyproteins involved in thetransport of substancesacross the cell membrane (PIR code .430300: Riordan et al., 1989). Characteristic features of the protein are two (ATP)- I)inding folds (Higgins el al., 1986). When the PIR database is searchedwith X30300, many related ! Optimal PAM-250 Optimal PAM-”20i PI?. code Optimal PAM-250 alignmentscore (bits) score (bits) - OMHU 1 B 1 AIEYETQPSLWAESESLLKPLANVTLTCQA 30

PL0030 1 ALFLDPPPNLWAEAQSLLEPWTSQS32.3 30 45 .O

OkMJlB 171 LSEPSATVTIEELAAPPPPVLMHHGESSQVLEPGNKVTLTCVAPLS 216

SO0474 18 LRGQTATSQPSASPGEPSPPSIHPAQSELiVEAGDTLSLTCIDP 61 25.0 29.0

KSHUVH 15 LPDTTREIVMTQSPPTLSLSPGERVTLSCRXQS 48 22.0 28.5

Highest chance alignmentscore: 27.0 28 -0

PIR code of sequenceinvolved: 540102 WGSMHH I I

3 proteinsmay be identified easilyusing either the alignments jump in score hy 4 to almost 12 bits, ! 1’.4M-250 or the PAM-I20 substitmtoion matrix. givitlg,nll but one a score greater than the highest Ilowever, several distant relationships present are chance PAM-120 score of 3349 bits. (The boundaries . ? harder to dctect. In Table 5 are shown four optimal of th(optima1 alignments change slightly under the i: PAM-250 alignments,representing homologies to alternatescdiing scheme.) No biologically signifi- 3 each of the two A30300 nucleotide-binding folds. cant. similarity is distinguished by the PAM-250 Snrw of these alignments has a PAM-250 score as matrix that. is not. found using the PAM-120. The ,I great as the highestchance score of 31.3 bits. In relalively high chance scores found in this exarcpie contrast, when the PAM-120 matrix is used. t.hr are partly attributable to the Icngth of the yuery

Table 7 Four MSPs representing distant relationships, from search,es of the PIR protein swpence dutdase (release 2641) dhcystic fibrosis transmemhrnn,e conductance rc{plator ( PlK code A30300)

optimal PAM-250 Optimal PAM-120

PIR code Optimal PAM-250 alignmentscore score(bits) ( bits)

A30300 438 TPVLIiDIk?FKIERGQLLAVAGSTGAGKTSLLMHIMGELEPSEGKI 482

SO5328 i8 VSKDINLEIQDGEFVVFVGPSGCGKSTLLRMIAGLETVTSGDL 60 28.3 40.0 BVECDA 11 ~KNINLVIPRDKLIVGLSGSGKSSL 40 . 24.7 35.0 - A30300 1219 YTEGGNAILENISFSISPGQRVGLLGRTGSGXSTSWLRLLNTEGEI 1267

QRECFB 19 FRVPGXRLRPLSLrrPAGKGLIG~GSGKST~GR 59 29.3 35 .O

QREBOT 31 DGDVTAVNDLNE’TLRAGETLGIVGESGSGKSQSIUJGMG~TNGRI 77 28.3 . 32.5

Highest chancescore: alignment 3i.3 33.0 I PIX code of sequenceinvolved:

sensitivc! proteinsimilarity searches. Science, 227, 14:35-1441. .\lc:Lac-hl;~n,A. I). (]XI). Tests for comparing related, amina acid sqmwen. Cytochrame c and cytochrome cs\,. J. Mol. Bid. 61. 409-424. Seedlcmall, S. 13. & Wunsch. C. D. (1950). X general mrthod applicnhlr~to the search For similarities in the amino acid sequsnres of two proteins. ./. Mol. Bid:' 48. 443453. Osorio-Keese, M. E.? Keese, P. L Gibhs, .4. (1989).' Kurleotidessqornw of thegenome of eggplant mosaic tymovirus. I~i~ology,172, 54'7-554. J'ark, Y.M. & Stauffer. G.V. (1989). DXA sequence of the mclC gene and its flanking regions from Sal4 lyphimwium LT2 and homology withthe cow- spondingsequence of .hcherichiu coli. Mol. Gm Genet. 216, 164-lti9. Patthy, L. (1 987) Detecting hamalo= of distantly Alated proteins with consensus sequences. J. Mol. Bifd!.1 565-577. I'earson. W. R. &, Lipman,.D. J. (1988). Improved for biological sequence comparison. Proc. Xd. Sei., f'.S.A. 85, 2444-2448. Pech, bl. 8: Zachau, H. G. (1984). Immunoglobulin ge of different subgroups are interdigitated within t VK locus. Nd.Acids Res. 12, 9229-9236. Peitsch, M. C. & Boguski, M. S. (1990). Is apolipo D a mammalian bilin-binding protein? Xew B 2, 197-206. Qiu, F., Ray, P., Brown, K., Barker, P. E., Jhani5.a Ruddle, F. H. & Besmer, P. (1988). P ' structure of c-kit: relationship with the CSF-11 receptor kinase familp-oncogenic activation of v- involvesdeletion of extracellulardomain an terminus. EMBO J. 7, 1003-1011. Rajkovic, A.. Simonsen, J. X.. Davis, R. E. 8: F. M. (1989). Molecular clonhg and sequence of 3-hydroxy-3-methylglutaryl-coenzyme reductasefrom the human parasite Schis Amino Acid Substitution .1lcltrices 565

mansoni. PTOC.Xat. Ad. Sci., 1T.S.A. 86. 92 17- Smith, T. I?. & Waterman, >I. S. (1981).Identification of common molecular subsequences. J. Jfol. Biol. 147, 0, J, K. M. (1987). Xew scoring matrix for amino acid 195-197. . residueexchanges based on residuecharacteristic Smith, T. F.,Waterman, 31. S. I% Burks. C. (1985). The physicalparameters. Int. J. Pept. Protein Res. 29, statisticaldistribution of nucleicacid similarities. :Vucl. Acids Res. 13, 645-656. .fiichardson, M., Dihorth, 31. .J. gE Scawen, 31. D. (1975). Stormo. C. D. & Hartzell, G. W., 111 (1989). Identifying : The amino acidsequence of leghaernoglobin T from protein-binding sites from unaligned DXA fragments. rootnodules of broadbean (Viciu ja6a I,.). PEDS Prooc. Nut. ilcud. Sci., D'.S.R . 86, 1 153-1 187. Lelterx, 51. 33-35. Suzuki, T.(1989). Amino acid sequence of a major globin Rjordan, J. R.. Rommens, J. M.,Kerem. B. S., Alon, X,, from thesea cucumber Puracuudina chilensis. Rozmahel. R.. Gnelczak, Z., Zielenski. .J.. Lok, S.. Biochim. Biophys. Acta, 998, 292-296. Plavsic. X., Chou, J. L., Drumm. 11. L.. Ianouzzi. Taylor, W. R. (1986). Identification of protein sequence M. C.. Collins. F. S. & Tsui, L. C. (1989). homology by consensus template alignment. J. Mol. . Identification of the cysticfibrosis gene: cloning and Bid. 188, 233-258. characterization of complementary DXA. I3 tzence. :' Urade. Y., Nagata, A. Suzuki. Y. & Hayaishi, 0. (L98!)). 245, 1066-1OiS. Primarystructure of ratbrain prostaglandin D Risler. J. L., Delorrne. M. 0.:Delscroix. H. & Henaut. A. synthetasededuced trom cDXA sequence. .I. Biol. :1!j88). Aminoacid substitutions in struct.urally I.'/K~L.264, 1041-1045. dated proteins. 4 patternrecognition approach. Uzzell! T. & Corbin, K. W. (1971). Fittingdiscrete '. Determination of 8 new and stiicient scoring matrix. probabilitydistributions to evolutionary eyents. ' J. MOL. Bid. 204. 1019-1029. Science. 172, 1089- 1096. ' Sankoff, D. Jt Krtlskal. .1. R. (1.983). Tim Wnr7~s.Shiny Van Weghe,de A.. Coppieters, W.. Bauw. C.. Vanderkerckhove. .I. k Bouquet. Y. (1988). The homology Iwtweell the serum proteins PO:! in pig. Xk in horseand a,B-glycoprotein in human. Comp. Biochem. Physiol. WB,751-756. Watrman. SI. S.. (hrtlon. I,. C Arrat.is. R. (l!)Si). I'hase tri~nsitic~~~sill srqurnw matches and nucleic wid struc.trrre. /'roc. .Val. drod. Sri.. i'.S..4. 84. I?:$!)- 1 243. \\'iIl~ur. \V. J. (1!)%5). On the. P.411 matris moclrl r1f protrin . .Vd. Aid. Ed. 2. 4:34--Hi. %;rlac.;ritl. 11.. (hnmlw. A,. (;wrrt?ro. 11. C:.. Mattialiano. R. '.I.. MaIImtida. F. iy: Jinwnw.. A. (IWG). S~wleotitlt: seque~weof the hygromypitl I3 phospl~o- t r;~nsfrrase gr11e !ram S/repl'lont!/rc.s Ir!/~"'.~~"llip//.s. .Y/tr/. A1:itl.s HPS. 14. IM5-15Xl.