<<

From: ISMB-96 Proceedings. Copyright © 1996, AAAI (www.aaai.org). All rights reserved.

A Sequence Similarity Search Algorithm Based on a Probabilistic Interpretation of an Alignment Scoring System

Philipp Bucher and Kay Hofmann,

Swiss Institute for Experimental Cancer Research (ISREC) Ch. des boveresses 155, CH-1066Epalinges, Switzerland { pbucber,khofmann } @isrec-sun 1 .unil.ch

Abstract to define a metric over the sequence space (e.g. Sellers 1974). Wepresent a probabilistic interpretation of local Here, we advocate a different view of local sequence sequencealignment methodswhere the alignment scor- alignment methods, in which the scoring system plays ing system(ASS) plays the role of a stochastic process defining a probability distribution over all sequence the role of a stochastic process generating pairs of pairs. Anexplicit algorithmis given to computethe pro- related sequences. Based on such an interpretation, we bability of two sequencesgiven an ASS.Based on this propose a modified version of a Smith-Watermanalgo- definition, a modified version of the Smith-Waterman rithm where the score computedfor two sequences is a local similarity search algorithm has been devised, whichassesses sequencerelationships by log likelihood log likelihood ratio of two probabilities, one given the ratios. Whentested on classical examplessuch as glo- scoring system, and one given a null-model. The bins or G-protein-coupled receptors, the new method mathematical concepts underlying this approach is provedto be up to an order of magnitudemore sensitive closely related to maximumlikelihood estimation for than the native Smith-Watermanalgorithm. global sequence alignments (e.g. Bishop & Thompson 1986). The performance of the new database search Introduction technique is assessed by test protocols previously used for the evaluation of alternative alignment scoring sys- The comparison of a new protein sequence against a tems. database of knownproteins is perhaps the most impor- tant computer application in molecular sequence analysis. It is generally accepted that the Smith- Review of the Native Smith-Waterman Watermanlocal similarity search algorithm (Smith Algorithm Waterman1981) is the most sensitive technique to dis- cover significant weak similarities between two Let a=ala2"’’a,,, and b=blb2"’’b,~ be two sequences. The more frequently used heuristic algo- sequences of residues from an alphabet S containing N elements. A local between two rithms implemented in the programs FASTA(Pearson such sequences is defined by an alignment path 1990) and BLAST(Altschul et al. 1990) can be con- sidered approximations or special cases of a full represented as an ordered set of index pairs: Smith-Waterman algorithm offering high speed in U = (Xl,Yl), (x2,Y2), " (Xl, l) E ach index pair exchangefor reduced sensitivity. identifies a pair of matched residues in the sequence alignment. A valid path u for sequences a and b The Smith-Waterman algorithm maximizes an align- satisfies the following conditions: xk+l>Xk, y~-+l>yk, ment scoring function over all possible local alignments x , Yl<-rt . between two sequences. The scoring function depends I<_m An alignment scoring system (ASS.) consists of on a set of parameters referred to as an alignment scor- substitution matrix and a gap weighting function. The ing system, which consists of a residue substitution substitution matrix defines substitution scores s(a,b) matrix and a linear gap penalty function. Optimal for pairs of residues (a,b)~ 2. The gap weighting alignment scores computed by a Smith-Waterman or function assigns weights w(k)= ot + ~k to insertions related sequence alignment algorithm have traditionally and deletions of length k>l. For gap length zero, we been viewed and interpreted as measures of similarity or distance (Smith, Waterman, & Fitch 1981). Con- define w(0) = cerning their mathematical properties, they were shown

44 ISMB-96 The scoring system assigns an alignment score Sa to PrObHMM(a) = Pr obH~(a,u) any local sequence alignment (a,b,u): paths u l over all possible paths through the model. SA (a,b,u) = k)~.~s (axk,by (1) The probability of a sequence pair a,b being gen- k=l erated by an ASSis the sum of the probabilities I-I of sequence pair a,b being generated via a partic- - ~.~w(Xk+l--Xk--1)+w (Y~+l--Yk--1) ular path u: k=l ProbAss(a,b ) = ~ ProbAss(a,b,u) Note that the alignment score can be described as the paths u stun of two components, a sequence-dependent match over all valid local alignment paths. score: I (iii) In database applications, membership of a SM(a,b,u) = S( axk,by,) (2a) sequence a to an HMM-definedsequence class is k=l estimated by an HMMscorewhich has the form and a sequence-independent gap score: of a log likelihood ratio: l-I Prob~M(a) HMMscore(a) = log Sa(u) = - w(x~+l--x~--l)+wfyk+ryk--1) (2 b) Probnull(a) k=l These two definitions will be useful in the formulation where Probn,t/(a) is the probability of sequence of the probabilistic version of the Smith-Waterman given a specific null-model. algorithm. The native Smith-Waterman algorithm com- In a probabilistic Smith-Waterman search, an putes the optimal local alignment score for two ASS-defined kind of similarity between two sequences: sequences a,b will be estimated by a PSWscore of the following type: SWscore (a,b) = max SA (a,b,u), (3) paths u ProbAss(a,b) PSWscore (a,b) = log which serves as a measure of sequence similarity in Probnult(a,b) database search applications. An efficient procedure to What remains to be done in order to implement the compute SWscore(a,b) is described in (Gotoh 1982). method suggested by this analogy, is to chose an appropriate null-model for randomsequence pairs, and Probabilistic Smith-Waterman (PSW) to work out a reasonable definition for the probability Algorithm of a local sequence alignment Probass(a,b,u) based the definition of the alignment score. The probabilistic version of the Smith-Watermanalgo- The null-model we choose has the general form: rithm is based on an analogy between alignment scor- ing systems (ASSs) and hidden Markov models Probnun(a,b) PL(m,n)Po(a,b) (4) (HMMs),a class of statistical models that have recently where PL(m,n) is a length distribution over sequence been introduced to molecular sequence analysis (Baldi length pair classes, and Po(a,b) is the null-model pro- et al. 1994, Krogh et al. 1994). This analogy comprises bability of sequence pair a,b given the length pair pair the following ideas and assumptions: class n ,m, whichis defined as follows: (i) An HMMdefines a probability distribution over the sequence space by means of a stochastic pro- Po(a,b) = fi p(ai) fi p(bj) (5) i=l j=l cess involving a randomwalk through the model. An ASSdefines a probability distribution over the where p (a) denotes the null-model probability of resi- space of sequence pairs by means of a stochastic due a. The null-model thus essentially consists of a process involving a random walk through an residue probability distribution over the alphabet S. alignment path matrix. The distribution of sequence length pair classes will be the same for the null-model and the ASS-defined (ii) The probability of a sequence a being generated probability distribution. For reasons that will become by an HMMis the sum of the probabilities of clear later, we require that there is a logarithmic base z sequence a being generated via a particular path such that: U:

Bucher 45 s~a~) p(a)p(b)z = 1. (6) B(n,m) = ~ sa

Pt.(n,m) Pa (a,b,u) palhs u Combiningequations 8, 9, and 11, we obtain the fol- PL (n ,m )Po(a,b) lowing intuitively plausible expression for the PSWscore: Z PA (a,b,u) paths u Z Z SA(a’b’u) Po(a,b) pathsu PSWscore (a,b) (13) For sequences of a given length pair class s#u~ a e Sm, b e S", the probability of a sequence align- paths u ment (a,b,u) will be defined as follows: Both sumsin the log likelihood ratio can efficiently be zSA(~,b,u) P0(a,b) PA (a,b,u) (9) computed by special cases of the forward algorithm B (n ,m used for computation of HMMscores (Krogh et al. where SA (a,b,u) is the local alignment score of (a,b,u) 1994, Rabiner 1989). Numerical recipes are given in as defined by equation 1, and B(n,m) is a length nor- Figure 1. malization term whose functions is to ensure that the probabilities of all sequence pairs of length pair class Performance Evaluation of the PSW m ,n sum to one. The definition of the alignment pro- Algorithm bability is natural if one interprets Smith-Waterman scores as log likelihood ratios, as suggested by Altschul In order to comparethe sensitivities of the PSWalgo- (Altschul 1991). The length normalization term obvi- rithm and the native Smith-Waterman algorithm, we ously has to satisfy: performed parallel database searches on SWISS-PROT release 32 (Bairoch & Apweiler 1996) with prototype SA (a,b,u) B(n,m) ~ Z Po(a,b) (10) query sequences from known protein families and m aeS ,b~S" .paths u domains, including the globins, the hsp20 family, cyto- Wewill showthat the expression for B (re,n) simplifies chrome C, the G protein-coupled receptor family, and to SH2, SH3 as domain examples. The results of these tests are shownin Table 1.

46 ISMB-96 Ez SA (a,b,u) paths u PSWscore (a,b) ~’~. Z Sc(u) paths u

Algorithm to compute: ~i~sa(a z’b’u) Algorithmso(u) to compute: ~ z paths u pathsu

s(a pb 1) Ml,l <-- Z M1,~<--- 1; 11.1~- O; li,l~--- 0; Dl,l ~-- 0. D1,1~--- O. for i@-2 tom: for i ~---2 to m: s(ai’b Mi,l~-- 1; Mi,l<._ Z l). ) ~ Ii,l~-- Z-aZ-flMi-i,l + Z-~IIi-l,l; li.~- z-~z + z-~l~_~.~; Di,I<--- O. Di,l~--- O. for j ~---2 to n : for j~-2 ton:

Mi,j <-- Zs(apbY); Mj.j(-- 1; ll.j~- O; ll,j<-- O; -f~ D. I,j ~--- z"Ctz-flM l,)_l + z-~D I,j_I Di.j~--- z-az + z-f~Dl.j-v for i<--2tom; j~---2 ton: for i~---2tom, j~--2 ton: i,s b.)(a Mi,j~. 1 -I.-FMi_I,j_ I + li_l,j_ 1 + Di_l.j_l; Mi,j(" Z J [ l + Mi-l.j-l + Ii-l,j-l + Di-l.j-l ] --IS --a ~,--a ~., I ida"- Z [Z Mi-l,j + [i-l.j + ~ Di-l,j J -[J + li "~ Z z-aMi I ¯ .t- [i I ¯ z-~Di I " + Oi ,j ~¢-- Z-I~L[z-~Mi,j_I Di ,j-I; ¯ Dij~--Z-~ [z-CtMij_l+Dij_ll.

:E z E M,,j z Mi,j paths u I’O’~m,l

Figure 1: Algorithm to compute the PSWscorefor two sequences a ~ Sm, b e Sn .

With a single query sequence, one typically finds (Pearson 1991) implementing the native Smith- between 50% and 90%of all true positives in these Waterman algorithm, and an experimental program examples. The number of missed sequences depends implementing PSW. BLASTwas used with the default on the divergence of the sequence family, as well as on substitution matrix BLOSUM62. The scoring system stringency of the significance criterion applied. For used for SSEARCHand PSWconsisted of a 10Logl0- cross-standardization, we define equivalent stringency scaled BLOSUM45 matrix and a gap weighting func- levels by a fixed number of false positives accepted. tion w(k)=8 + 4k. The BLOSUM45 was chosen for Such an approach maynot be totally representative of a the latter two methods because it performed better than real application where one does not knowthe status of the default BLOSUM62 matrix used by SSEARCH. the sequences in advance, but it constitutes the only The gap weighting function chosen corresponds to the way of ensuring a fair comparison between methods default settings in SSEARCH. that express similarity scores on genuinely different The results shown in Table 1 document a robust scales. trend of increased sensitivity of the probabilistic For each family and domain, we compared the Smith-Waterman algorithm over the native version. results obtained with three different database search The gain in performance is particularly impressive for programs: BLAST(Altschul et al. 1990), SSEARCH the globin and HSP20 families where the sequence

Bucher 47 similarities often extend over the entire length of the where a superior performance of PSWover SSEARCH proteins compared. Note that in these examples, one is only observed at the highest selectivity level. These would have to accept a 10 times higher false positive results, if conformedby further experiments, indicate rate with the native Smith-Watermanalgorithm in order that PSWis the currently most sensitive method to to find as many true members as with the PSWalgo- detect weaksimilarities between two protein sequences. rithm. The increased sensitivity is less evident, or even debatable, in the domain examples SH2 and SH3,

Experiment # of true positives missed at N false positives accepted Family Method N=0 N=10 N=100 N=1000 Globin PSW 36 17 10 4 P02023 Smith-Waterman 59 39 19 8 674 members BLAST 95 77 62 50 HSP20 PSW 18 8 2 1 P02515 Smith-Waterman 33 16 11 3 129 members BLAST 43 30 19 14 Cytochrome C PSW 132 113 87 71 P00001 Smith-Waterman 131 131 112 82 243 members BLAST 129 118 103 80 GPCreceptors PSW 76 73 65 52 P30542 Smith-Waterman 86 74 67 48 497 members BLAST 108 95 77 6O SH2-domain PSW 14 9 0 0 P12931 (150-247) Smith-Waterman 17 5 0 0 119 members BLAST 21 20 13 1 SH3-domain PSW 20 14 1 1 P12931 (83-144) Smith-Waterman 25 10 6 1 131 members BLAST 30 22 9 2

Table 1. Comparative performance evaluation of PSW,Smith-Waterman, and BLASTalgorithms. Data and methods: Database searches were performed on SWISS@ROTrelease 32 (Bairoch & Apweiler, 1996) con- mining 49340 entries. The following entries were used as query sequences: P02023: human]3-globin, P02515: Droso- phila m. heat shock protein 22, P00001: humanCytochrome C, P30542: humanadenosine A! receptor, P12931: human proto-oncoprotein Src. The classification of the sequence families and domains is based on PROSITErelease 32 (Bairoch, Bucher, & Hofmann1996). The list of G-protein-coupled (GPC) receptors was compiled from the four SITE entries PS00237, PS00649, PS00979, PS00238. Five additional putative proteins from C. elegans corresponding to SWISS-PROTentries P41590, P34488, P46564, I:’46568, P46567, were also considered true positives. Blast searches were performed with the default parameter settings (BLOSUM62 matrix). PSWand Smith-Waterman searches were performed with a BLOSUM45 matrix and the default gap weights of the program SSEARCH(see text).

48 ISMB~6 Discussion Despite obvious parallels to earlier work, our presen- tation of a probabilistic sequence alignment algorithm We have presented preliminary evidence that current includes several important innovations. One is the non- methods for pairwise sequence alignments can be trivial extension from global to local alignment mode. improved by interpreting an alignment scoring system Another one is its application to a new problem, as a probabilistic model, applying concepts and sequence similarity search. The question addressed by methods that have been introduced to molecular the PSWalgorithm is fundamentally different from sequence analysis in the context of hidden Markov those addressed by maximumlikehood methods applied models. The fact that we observe an increase of sensi- in molecular evolution. The goal of our method is to tivity of the PSWalgorithm over the native Smith- reach a qualitative decision between two alternatives, Waterman algorithm using a scoring system that has presence or absence of significant local sequence simi- only been optimized for the latter one, lends further larity between two sequences, whereas previous max- credibility to the this conclusion. It appears likely that imumlikelihood applications were aimed at quantita- an additional improvement of the performance of PSW tive estimation of evolutionary parameters such as can be achieved by fine-tuning the alignment parame- divergence time or the ratio of indel versus substitution ters specifically for this method. mutation frequencies. This difference explains the pres- Increased sensitivity is not the only benefit resulting ence of a null-model as integral part of our approach, from a probabilistic interpretation of a sequence align- as well as the consistent absence of such a null-model ment method. Another advantage is the availability of in previous work. simple statistical tests, e.g. Milosavljevi~’s algorithmic Another important difference in our formulation of significance test (Milosavljevi6 & Jurka 1993), the probabilistic alignment problem comparedto previ- assess the significance of database search scores. More- ous ones is the absence of an underlying model of over, the fact that PSWscores are scaled as absolute sequence evolution. By dissociating the alignment con- log likelihood ratios of two probabilities, and thus are cepts from a mandatory phylogenetic interpretation we not influenced by sequence length effects or the relative remove an important conceptual obstacle to its applica- log-scale of the scoring system, facilitates the optimiza- bility in contexts where sequence similarity does not tion of the gap weighting parameters. The probabilistic necessarily mean homology.Therefore, the theoretical frameworkalso suggests methods to correct for residue framework of the PSWalgorithm can accommodate composition effects, both regional and global, via com- amino acid substitution matrices derived from an evolu- parison to null-models that retain certain statistical pro- tionary model (e.g. Dayhoff, Schwartz, & Orcutt 1978) perties of the analyzed sequences. as well as indel frequency models based on structural It needs to be mentionedthat we are not the first to superpositions (Pascarella & Argos 1992). formulate a probabilistic sequence alignment method. Sequencesimilarity search is of course not the only Maximumlikelihood approaches have been applied to possible application of a probabilistic alignment various problems in the field of molecular evolution, method generalized to local alignment mode. Most including the estimation of the divergence time for two questions addressed by previous applications of max- DNA sequences (Bishop & Thompson 1986), the imumlikelihood, or of the mathematically related parti- optimization of the parameters of a likelihood model of tion function calculations, can also be formulated with sequence evolution (Thorne, Kishino, & Felsenstein regard to the local alignment problem. Logical combi- 1991, 1992), and the assessment of the reliability of nations of this kind include the design of an molecular sequence alignments (Thorne & Churchill expectation-maximization (EM) algorithm to optimize 1995). The computations performed in these applica- the parameters of a local alignment scoring system for tions rely on the same principles as the PSWalgorithm distantly related protein sequences, as well as the in that they achieve the summation of probabilities development of a dot matrix method reflecting residue over an astronomically large numberof sequence align- association probabilities rather than local segmentpair ments by meansof an efficient recursive procedure pro- similarities. ven to yield the identical result. McCaskill’s method (McCaskill 1990) to compute global and restricted par- Implementation Notes tition functions for RNAsecondary structures falls also into the same class of algorithms and is conceptually The results shownin this paper were generated with an related to probability through the notion of entropy. experimental program implementing a space-efficient version of the algorithms shownin Fig. 1, making use of the recursion introduced by Gotoh (Gotoh 1982).

Bucher 49 The program, which is written in FORTRAN77 and Bairoch, A.; Bucher, P.; and Hofmann, K. 1996. The runs on various UNIXplatforms, accepts as command PROSITEdatabase, its status in 1995. Nucl. Acids Res. line parameters a query sequence, a substitution matrix, 24:189-196. two gap probability parameters, a maximumlength of Baldi, P.; Chauvin, Y.; Hunkapiller, T; and McClure, database sequences to be processed, and a cut-off value M.A. 1994. Hidden Markov models of biological pri- for PSWscores to be reported in the output. The mary sequence information. Proc. Natl. Acad. Sci. sequence database is read from the standard input. The USA91:1059-1063. program starts by computing the sequence-independent quantity 2.J z SG (a,b,u) ror~ aa relevant length pair com- Bishop, M.J., and Thompson, E.A. 1986. Maximum pathsu likelihood alignment of DNAsequences. J. Mol. Biol. binations, making a single path through the algorithm 190:159-165. shownon the right side of Figure 1, and then processes Dayhoff, M.O.; Schwartz, R.M.; and Orcutt, B.C. 1978. the sequence database sequentially by applying the A model of evolutionary change in proteins. In Atlas algorithm shown on the left side of Figure 1 to each of Protein Sequence and Structure (Dayhoffl M.O., ed.) individual sequence. vol 5, suppl. 3, pp. 345-352. WashingtonDC: National Since no effort has been invested to optimize the Biomedical Research Foundation. speed of the program, no realistic time-efficiency com- Gotoh, O. 1982. An improved algorithm for matching parisons between the PSW and the native Smith- biological sequences. J. Mol. Biol. 162:705-708. Waterman algorithm can be made at this point. The current experimental program is at least ten times Henikoff, S., and Henikoff, J.G. 1992. Amino acid slower than Pearson’s program SSEARCH(Pearson1 substitution matrices from protein blocks. Proc. Natl. 1991). Comparison with an HMMsearch program Acad. Sci. USA89:10915-10919. (Hughey & Krogh 1995) performing essentially the Henikoff, S., and Henikoffl J.G. 1993. Performance same type of computations, suggests that a time- evaluation of amino acid substitution matrices. Pro- efficiency trimmed PSWimplementation sacrificing teins 17:49-61. some arithmetic precision will only take between two Hughey, R., and Krogh, A. 1995. SAM: Sequence and three times more time than publicly available alignment and modeling software system. Technical Smith-Watermandatabase search programs. Report, USCS-CRL-95-7, University of California, The experimental PSWimplementation used in this Santa Cruz. work is available upon request from the authors. We are in the process of preparing a faster public version Karlin, S., and Altschul, S.F. 1990. Methodsfor assess- of the program which will be made available from our ing the statistical significance of molecular sequence ftp site (URLftp://ulrec3.unil.ch). features by using general scoring schemes. Proc. Natl. Acad. Sci. USA87:2264-2268. Acknowledgements Krogh, A.; Brown, M.; Mian, I.S.; Sj/51anderl K.; and Haussler, D. 1994. Hidden Markov models in compu- This work was supported in part by grant 31-37687.93 tational biology. J. Mol. Biol. 235:1501-1531. from the Swiss National Science Foundation. McCaskill, J.S. 1990. The equilibrium partition func- tion and base pair binding probabilities for RNAsecon- References dary structure. Biopolymers 29:1105-1119. Altschul, S.F. 1991. Aminoacid substitution matrices Milosavljevi6, A., and Jurka, J. 1993. Discovering sim- from an information theoretic perspective. J. biol. ple DNAsequences by the algorithmic significance Biol. 219:555-565. method. CABIOS9:407-411. Altschul, S.F.; Gish, W.; Miller, W.; Myers, E.W.; and Pascarella, S., and Argos, P. (1992). Analysis Lipman, D.J. 1990. Basic local alignment search tool. insertions/deletions in protein structures. J. Mol. Biol. J. Mol. Biol. 215:403-410. 224:461-471. Bairoch, A., and Apweiler, R. 1996. The SWISS- Pearson, W.R. 1990. Rapid and sensitive sequence PROTprotein sequence data bank and its new supple- comparison with FASTPand FASTA. Methods Enzy- ment TREMBL.Nucl. Acids Res. 24:21-25. tool. 183:63-98.

50 ISMB-96 Pearson, W.R. 1991. Searching protein sequence libraries: Comparisonof the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics 11:635-650. Rabiner, L.R. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77:257-286. Sellers, P.H. 1974. On the theory and computation of evolutionary distances. SIAM J. Appl. Math 26:787- 793. Smith, T.F., and Waterman, M.S. 1981. Identification of commonmolecular subsequences. J. Mol. Biol., 147:195-197. Smith, T.F.; Waterman, M.S.; and Fitch, W.M.1981. Comparative biosequence metrics. J. Mol. Evol. 18:38- 46. Thorne, J.L., and Churchill, G.A. 1995. Estimation and reliability of molecular sequence alignments. Biometrics 51:100-113. Thorne, J.L.; Kishino, H.; and Felsenstein, J. 1991. An evolutionary model for maximumlikelihood alignment of DNAsequences. J. Mol. Evol. 33:114-124. Thorne, J.L.; Kishino, H.; and Felsenstein, J. 1992. Inching toward reality: An improved likelihood model of sequence evolution. J. Mol. Evol. 34:3-16.

Bucher 51