Amino Acid Substitution Matrices from an Information Theoretic Perspective
Total Page:16
File Type:pdf, Size:1020Kb
p J. Mol. Bd-(1991) 219, 555-565 Amino Acid Substitution Matrices from an , Information Theoretic Perspective Stephen F. Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD 20894, U.S.S. (Received 1 October 1990; accepted 12 February 1991) Protein sequence alignments have become an important tool for molecular biologists. Local alignments are frequently constructed with the aid of a “substitution score matrix” that specifies a scorefor aligning each pair of amino acid residues. Over the years, manydifferent substitution matrices have been proposed, based on a wide variety of rationales. Statistical results, however, demonstrate that any such matrix is i.mplicitly a “log-odds” matrix, with a specific targetdistribution for aligned pairs of amino acid residues. Inthe light of information theory, itis possible to express the scores of a substitution matrix in bits and to see that different matrices are better adapted to different purposes. The most widely used matrix for protein sequence comparison has been the PAM-250 matrix. It is argued that for database searches the PAM-,I20 matrix generally is more appropriate, while for comparing two specific proteins with.suspecte4 homology the PAM-200 matrix is indicated. Examples discussed include the lipocalins, human a,B-glycoprotein, the cysticfibrosis transmembrane conductance regulator and the globins. Keywords: homology; sequence comparison; statistical significance; alignment algorithms; pattern recognition 2. Introduction . similarity measure (Smith & Waterman, 1981; Goad & Kanehisa, 1982; Sellers, 1984). This has the General methods for protein sequence comparison advantage of placing no a priori restrictions on the were introduced to molecular biology 20 years ago length of the local alignments sought. Most data- and have since gained widespreaduse. Most early base search methods have been based on such local attemptsto measure protein sequencesimilarity alignments(Lipman & Pearson, 1985; Pearson & ‘‘$’vused on global sequencealignments, in which Lipman, 1988; Altschul et aE.,1990). 1.vclry residue of the two sequences compared had to To evaluate local alignments, scores generally are participate(Needleman & Wunsch, 19TO; Sellers, assigned to each aligned pair of residues (the set of 1954; Sankoff C Kruskal, 1983). However, hecause such scores is called a substitution matrix), aswell as distantly related proteins may share only isolated to residues aligned with nulls: the score of the regions of similarity, e.g. in the vicinity of an active overall alignment is then taken to be the sum of site, attention has shifted to local as opposed to these scores. Specifying an appropriate amino acid global sequence similarity measures. The basic idea substitution matrix is central to protein comparison is to consider only relatively conserved sub- methodsand much effort has been devoted to xquences; dissimilar regions do not contribute toor defining, analyzingand refining such matrices ,thtract from the measure of similarity. Local sirni- (SIcl,achlan, 1971; Dayhoff et al., 1978; Schwartz & iurity mar be studied in a variety of ways. These lhyhoff, 1975; Feng et al., 1985; Rao, 1987: Risler et include measuresbased on the longest matching al.. 1988). One hope has been to find a matrix best segments of two sequences with a specified number adaptedtodistinguishing distant evolutionary or proportion of mismatches (Arratia’et al., 1986; relationships from chancesimilarities. Recent Xrratia & Waterman, 1989), as well as methods that mathematicalresults (Karlin & Altschul, 1990; compare all segments of a fixed, predefined Karlin et al., 1990) allow all substitution matrices to “window”length (McLachlan, 1971). The most be\%wed in a common light,and provide a common practice, however, is to consider segments rationale for selecting particular sets of “optimal” 1;f’ all lengths,and choose those that optimize scores for local protein sequence comparison. 2. The Statistical Significance of Local Sequence Alignments C;lohal alignments are of essentially no IIS~unless they can aIlow gaps. but this is not true for local alignments. The ability to choose segments with arbiora.r?- starting positions in each sequence means that biologically significant regions frequently may be aligned without the need to introduce gaps. Ij'hile: in general. it. is desirable to allow gaps in low1 dignments.doins so greatlydecreases their mathematical tractabilit?. Theresults described here applv rigorously only t.0 1oc:al alignmen1.s that. lack gap. i.e. to segments 01' ecpa.1 I~ngthfrom each of the two sequcncrs cmmpsretl. Somc rcwnt di~,ti~- base search tools have focusccl on tintling st~c:halign- ments (-4ltsc:hul & Lipmm, 1990; Alt.sc:hnl r!f 0.1.. Not.ic:c* thiLt. multiplying all the wares of a sul)st.itu- , 1990). Howrc-er. the statisLics Of optimal s(wrt's fi)r tion matris 1)y somr ~~ositivc t:ollstiLl1t (Ioos not lot:al ;Aignments that include gaps (Smith e! a/., nff1Ac:t. tllc re1;~t.i~~s(:ores of' any sl~i,;lli~,rn~nc~l~ts.Two 19S.5; li'aterman et d., 1957) are 1)roully ;~nalogous mat riws n?lat.c'tl 1)y s11(*h f'ncAtor (:Hn, lhrrrf(~re,be I to those for thc no-gap case (Karlin Rr Altsc:huI: c:onsidercd , c:ssc+ntialIy t~(Iuiv;~hi.Tnslwc*tion of ; 1990; Karlin el al., IYYO), where more precise resull,s cqna.tion (1 ) re\T(!;LIs that multiplying dlworm by (I ' are availaMe. Therdore, one may hope that many of ;~lsohas the c?ffr:ct of dividing 2. I)? 11. Tht. t);lrameter 1 the hicideas prcsenfsd t)elow will ge~~eraliw1.0 1. mil?., t.lrc?rc\fore,I)r viewed :I t~t11r:dsc-;~l(. for 4 local alignments that include gaps. any sroring system; its c1ery)er meaning wi\: be ,d Formally, we assume that the aligned amino acids discassc~lhelow. c ai and ai are assigned thesubstitution score sij. !$ C;ivw~two random protein stlquw1ws as tlcscribed .i Given two protein sequences, thepair of (~11151 above, how many distinct, 01' -'loc*ally optimal" length segments that, when aligned,have the (Sellers.,* 1984) MSPs with score at least S are greatest. aggregate score we call the Maximal expect@ to occur simply by chance? This number is Segment Pair (MSPt). An MSP may be' of any well apbroximated by the formula: length; its score is the MSP score. Since any two protein sequences, related or un- K.1' c - As related, a-ill have some MSP score, it is important to where N is the product of the sequenws' len$hs, know how great a score onecan expect to find and h' is an explicit]? calculable parameter (liarlin simpIy by chance. To address this questionone & Altschul, 1990; Karlin et nl.. 1990). When needs some model of chance. Thesimplest is to comparing a single randomsequence withall the. assume that in the twoproteins compared, the sequences in a database, settingiV to the product of' amino acid ai appearsrandomly with the prob- the query sequence length and the database length ability pi. These probabilities are chosen to reflect (in residues) yields an upper bound on the number the observed frequencies of the aminoacids in of distinct MSPs with score at least S .that the actual proteins. For simplicity of discussion we will search is especkd to yield. assume both proteins share the same amino acid probabilitydistribution; more generally, one can allow them to have different distributions. A random protein sequence is. simply one constructed 3. Optimal Substitution Matrices for Local according to this model. Sequence Alignment For the sake of the statistical theory, we need to Formula (2) allows us to tell when a segm makc two crucial but reasonable assumptions about has a significantly high score. However, it doe the substitution scores. The first is that there be at assist in choosing an appropriatesuhsti least one positive score and the second is that the matrix in the first place. A second class of r expected score xi,jpipjs,j be negative. Because we however, has direct bearing on this question. Th permit the length of a segment pair to be adjusted state that among MSPs from the comparison to optimize its score, both these assumptions are random sequences, the amino acids ai an a necessary also from practical perspective. If there aligned with frequency approaching qij = p wereno positive scores, the MSP would always (Arratia et or., 1988; Kartin 8: Altschul, 1990; consist of a single pair of' residues (or none at all, if et al., 1990; Dembo & Karlin, 1991). this were permitted), and such an alignment is not Given any snbstitution matrix and rando of interest. If the expected score for two random tein model,one may easily calculatethe residues were positive, extending a segment pair as target frequencies, qij, just described. Sotice by the definition of J. in equation (I), these t t Abbreviations used: MSP. Maximal Segmrnt hir: frequencies sum to 1. Sow among ztlig~ment Ig. immunoglobulin. sentingdistant homologies, the amino acids Amino Acid Substitution Matrices -537 with certain characteristic frequencies, Only doubtfulthat any “target distribution” theorem e correspond to a matrix’s target frequencies, can be proved. It may be possible to make a convincing case for a particular substitution matrix in the global alignment context, but the argument will most likely have to be different from that for Any substitutionmatrix has an implicit set of local alignments(Karlin & Altschul, 1990). The arget frequencies for aligned amino acids. Writing sameapplies to substitution matrices used with thescores of thematrix in terms of itstarget fixed-length windows for studying local similarities frequencies, one has: (McI,achlan, 1971; Argos, 1987; Stormo & Hartzell, 1989): a fixed quantity can be added to all entries of such a matrix with no essential effect. It is notable ,qij = (In %)/A. (3) that while the PAM matrices were developed origin- Pi Pj ally for global sequence comparison (Dayhoff et al., 19SS), their statistical ,theory has blossomed in the In other words, the score for an amino acid pair can local alignment context.