A Sequence Similarity Search Algorithm Based on a Probabilistic Interpretation of an Alignment Scoring System

From: ISMB-96 Proceedings. Copyright © 1996, AAAI (www.aaai.org). All rights reserved. A Sequence Similarity Search Algorithm Based on a Probabilistic Interpretation of an Alignment Scoring System Philipp Bucher and Kay Hofmann, Swiss Institute for Experimental Cancer Research (ISREC) Ch. des boveresses 155, CH-1066Epalinges, Switzerland { pbucber,khofmann } @isrec-sun 1 .unil.ch Abstract to define a metric over the sequence space (e.g. Sellers 1974). Wepresent a probabilistic interpretation of local Here, we advocate a different view of local sequence sequencealignment methodswhere the alignment scor- alignment methods, in which the scoring system plays ing system(ASS) plays the role of a stochastic process defining a probability distribution over all sequence the role of a stochastic process generating pairs of pairs. Anexplicit algorithmis given to computethe pro- related sequences. Based on such an interpretation, we bability of two sequencesgiven an ASS.Based on this propose a modified version of a Smith-Watermanalgo- definition, a modified version of the Smith-Waterman rithm where the score computedfor two sequences is a local similarity search algorithm has been devised, whichassesses sequencerelationships by log likelihood log likelihood ratio of two probabilities, one given the ratios. Whentested on classical examplessuch as glo- scoring system, and one given a null-model. The bins or G-protein-coupled receptors, the new method mathematical concepts underlying this approach is provedto be up to an order of magnitudemore sensitive closely related to maximumlikelihood estimation for than the native Smith-Watermanalgorithm. global sequence alignments (e.g. Bishop & Thompson 1986). The performance of the new database search Introduction technique is assessed by test protocols previously used for the evaluation of alternative alignment scoring sys- The comparison of a new protein sequence against a tems. database of knownproteins is perhaps the most impor- tant computer application in molecular sequence analysis. It is generally accepted that the Smith- Review of the Native Smith-Waterman Watermanlocal similarity search algorithm (Smith Algorithm Waterman1981) is the most sensitive technique to dis- cover significant weak similarities between two Let a=ala2"’’a,,, and b=blb2"’’b,~ be two sequences. The more frequently used heuristic algo- sequences of residues from an alphabet S containing N elements. A local sequence alignment between two rithms implemented in the programs FASTA(Pearson such sequences is defined by an alignment path 1990) and BLAST(Altschul et al. 1990) can be con- sidered approximations or special cases of a full represented as an ordered set of index pairs: Smith-Waterman algorithm offering high speed in U = (Xl,Yl), (x2,Y2), " (Xl, l) E ach index pair exchangefor reduced sensitivity. identifies a pair of matched residues in the sequence alignment. A valid path u for sequences a and b The Smith-Waterman algorithm maximizes an align- satisfies the following conditions: xk+l>Xk, y~-+l>yk, ment scoring function over all possible local alignments x , Yl<-rt . between two sequences. The scoring function depends I<_m An alignment scoring system (ASS.) consists of on a set of parameters referred to as an alignment scor- substitution matrix and a gap weighting function. The ing system, which consists of a residue substitution substitution matrix defines substitution scores s(a,b) matrix and a linear gap penalty function. Optimal for pairs of residues (a,b)~ 2. The gap weighting alignment scores computed by a Smith-Waterman or function assigns weights w(k)= ot + ~k to insertions related sequence alignment algorithm have traditionally and deletions of length k>l. For gap length zero, we been viewed and interpreted as measures of similarity or distance (Smith, Waterman, & Fitch 1981). Con- define w(0) = cerning their mathematical properties, they were shown 44 ISMB-96 The scoring system assigns an alignment score Sa to PrObHMM(a) = Pr obH~(a,u) any local sequence alignment (a,b,u): paths u l over all possible paths through the model. SA (a,b,u) = k)~.~s (axk,by (1) The probability of a sequence pair a,b being gen- k=l erated by an ASSis the sum of the probabilities I-I of sequence pair a,b being generated via a partic- - ~.~w(Xk+l--Xk--1)+w (Y~+l--Yk--1) ular path u: k=l ProbAss(a,b ) = ~ ProbAss(a,b,u) Note that the alignment score can be described as the paths u stun of two components, a sequence-dependent match over all valid local alignment paths. score: I (iii) In database applications, membership of a SM(a,b,u) = S( axk,by,) (2a) sequence a to an HMM-definedsequence class is k=l estimated by an HMMscorewhich has the form and a sequence-independent gap score: of a log likelihood ratio: l-I Prob~M(a) HMMscore(a) = log Sa(u) = - w(x~+l--x~--l)+wfyk+ryk--1) (2 b) Probnull(a) k=l These two definitions will be useful in the formulation where Probn,t/(a) is the probability of sequence of the probabilistic version of the Smith-Waterman given a specific null-model. algorithm. The native Smith-Waterman algorithm com- In a probabilistic Smith-Waterman search, an putes the optimal local alignment score for two ASS-defined kind of similarity between two sequences: sequences a,b will be estimated by a PSWscore of the following type: SWscore (a,b) = max SA (a,b,u), (3) paths u ProbAss(a,b) PSWscore (a,b) = log which serves as a measure of sequence similarity in Probnult(a,b) database search applications. An efficient procedure to What remains to be done in order to implement the compute SWscore(a,b) is described in (Gotoh 1982). method suggested by this analogy, is to chose an appropriate null-model for randomsequence pairs, and Probabilistic Smith-Waterman (PSW) to work out a reasonable definition for the probability Algorithm of a local sequence alignment Probass(a,b,u) based the definition of the alignment score. The probabilistic version of the Smith-Watermanalgo- The null-model we choose has the general form: rithm is based on an analogy between alignment scoring systems (ASSs) and hidden Markov models Probnun(a,b) PL(m,n)Po(a,b) (4) (HMMs),a class of statistical models that have recently where PL(m,n) is a length distribution over sequence been introduced to molecular sequence analysis (Baldi length pair classes, and Po(a,b) is the null-model pro- et al. 1994, Krogh et al. 1994). This analogy comprises bability of sequence pair a,b given the length pair pair the following ideas and assumptions: class n ,m, whichis defined as follows: (i) An HMMdefines a probability distribution over the sequence space by means of a stochastic pro- Po(a,b) = fi p(ai) fi p(bj) (5) i=l j=l cess involving a randomwalk through the model. An ASSdefines a probability distribution over the where p (a) denotes the null-model probability of resi- space of sequence pairs by means of a stochastic due a. The null-model thus essentially consists of a process involving a random walk through an residue probability distribution over the alphabet S. alignment path matrix. The distribution of sequence length pair classes will be the same for the null-model and the ASS-defined (ii) The probability of a sequence a being generated probability distribution. For reasons that will become by an HMMis the sum of the probabilities of clear later, we require that there is a logarithmic base z sequence a being generated via a particular path such that: U: Bucher 45 s~a~) p(a)p(b)z = 1. (6) B(n,m) = ~ sa<u) (11) aES,beS paths u Note that log-odds substitution matrices, such as those where So(u) is the gap score of alignment path u from the PAM or BLOSUMseries (Henikoff defined by equation 2b, if the null-model satisfies the Henikoff 1992), have known mutational equilibrium constraint imposed by equation 6. Let us first rewrite compositions and logarithmic bases satisfying the the expressionfor B (m,n ) as follows: above condition. If a different residue composition is ac¯ u)[.-, s ~-, sMCa,b,~)~. .. B(m,n) = 2~ ] 2.~ z r0ta,D ) (12) used as null-model, there will always be a unique solu- paths a tion for z solving the above equation, if the substitution LaeSm,beS. matrix satisfies the conditions necessary for local where S~(a,b,u) is the match score of the sequence sequence alignments (Altschul 1991). alignment (a,b,u) as defined by equation 2a. The inner The ASS-definedprobability distribution has the gen- sum in the above expression can be rewritten as shown eral form: below, using the notation v iv2 ¯ ’ ’ vm-~ , wlw2.., wn_ for the residues of sequences a and b, b) (7) t ProbAss(a, = PL (m ,n )PA (a,b) which are not part of a matchedresidue pair defined by where PA(a,b) is the ASS-defined probability path u: sequence pair a,b, given the length pair class n,m. The zsM(a’b’u)P0(a,b precise nature of the length pair class distribution is ) re,be n s (at ,bvk) irrelevant, as long as it is the same for the null-model aE S S t m-I n -I l and the ASS, since the term Pt.(m,n) cancels itself in =sumI"[P (avk)YIp (bw,)lIP (ax,)P (by,)z the definition of the PSWscore" aEsm. bESn k=l k=l k=l ProbAss(a,b) PSWscore - (8) p (b ~’C p’~ (a)p’t’ (b)z Prob.~u(a,b) a~S.b~S Pt.(n,m) Pa (a,b,u) palhs u Combiningequations 8, 9, and 11, we obtain the fol- PL (n ,m )Po(a,b) lowing intuitively plausible expression for the PSWscore: Z PA (a,b,u) paths u Z Z SA(a’b’u) Po(a,b) pathsu PSWscore (a,b) (13) For sequences of a given length pair class s#u~ a e Sm, b e S", the probability of a sequence align- paths u ment (a,b,u) will be defined as follows: Both sumsin the log likelihood ratio can efficiently be zSA(~,b,u) P0(a,b) PA (a,b,u) (9) computed by special cases of the forward algorithm B (n ,m used for computation of HMMscores (Krogh et al.

Load more